Queues that we can arrange into a DAG, tasks that execute asynchronously on different devices… it’s all not much good unless there is data for our tasks to operate on, and some way of communicating results or transformations of that data back to the host. Host-device memory management often occupies a lot of attention when writing CUDA code (or device-device memory management, in CUDA, or MPI, or…). It will, I’m sure, come as no surprise that any kind of heterogeneous programming specification requires us to think about these issues as well.
SYCL memory models
SYCL has two different memory models for thinking about the flow of data: a “buffer/accessor” model for implicit memory management and one based on unified shared memory (USM) pointers for explicit memory management. I’ll describe them both below, but to be transparent: the AdaptiveCpp documentation is pretty explicit in its recommendation to never use buffers. Although they come with several of the conveniences you might expect from an implicit memory management scheme (see below), with that kind of strong recommendation against them I haven’t actually even tried to use the buffer/accessor model in actual code. All of which is to say, treat the section immediately below as impressionistic rather than literal.
Buffers and accessors
This first model is what to use if you want all of the data management to happen implicitly. The idea is that you have pre-existing data, and buffers can provide a view into that memory. Thus, a buffer
is constructed by specifying the memory location and size. It then has a specification for where it can be accessed (on the host, on the device, or both).
Based on the section title, you will not be surprised to learn that the view into memory provided by a buffer is accessed by an accessor
. These are used to describe how the data will be used (read only, read/write, overwrite) in ways which help the runtime decide how to efficiently move the data around behind the scenes. They come in two flavors: an accessor used for accessing data on the device, and a host_accessor
which will read data on the host from a buffer that was accessed on the device. This all looks something like:
std::vector<int> data(100,32);
sycl::buffer<int> A{data.begin(),data.end()};
Q.submit([&](sycl::handler &h){
sycl::accessor a(A,h);
/* do something with a*/
});
sycl::host_accessor result{A};
It is worth noting that host_accessor
will implicitly wait for the data to actually be available, so it enforces a synchronization of the code at that point. The same is true of buffer destructors. This means that one could define buffers and the SYCL work you do with them within scopes as a way of controlling task synchronization.
It is also worth noting that the runtime will assume that during the lifetime of the buffer we won’t work with the memory the buffer is viewing. You could cause quite a bit of trouble, I suppose, if you tried to play with the vector of data
above while the action of Q
was in operation.
USM
There are three flavors of unified shared memory allocations in SYCL, and coming from CUDA they are immediately familiar: host
allocations return a pointer to memory physically on the host but (since it’s USM) accessible by the device (like CUDA pinned memory), device
allocations live only on the device and require explicit data transfers to migrate data to and from it, and shared
allocations use a unified virtual address space, allowing access from both the host and the device. I suppose shared
allocations are convenient in cases where you are trying to migrate existing code or prototyping something. I never really used pinned memory in CUDA — I guess the problem domains I usually work in don’t have that many situations where rare and partial data accesses of some memory on the device happen. As a result, I think the most important thing it just getting used to the explicit memory management associated with device
USM.
The way this works is to use a malloc_device
method which associates some memory with a specific device via a queue (which is itself associated with that specific device). One then uses the queue to explicitly copy to and from the device:
sycl::queue Q;
std::vector<int> hostData(N,0);//N arbitrary elements
auto deviceData = sycl::malloc_device<int>(N,Q);
Q.submit([&](sycl::handler &h){
h.memcpy(deviceData,hostData.data(),N*sizeof(int));
});
Q.wait();
A few things that are worth pointing out: I’ve added the wait()
there because memory copies are asynchronous. Like other tasks in general, they get submitted to the queue the code immediately proceeds to the next line, so if you want to guarantee that the transfer has finished you need to wait (or depend_on
and event, etc, as discussed in the last page). Additionally, I’ve written the memory copy using the Q.submit(...)
and handler syntax, but the queue object itself has a shortcut for this extremely common operation:
//Q.submit([&](sycl::handler &h){
// h.memcpy(deviceData,hostData.data(),N*sizeof(int));
//});
//Shortened to:
Q.memcpy(deviceData,hostData,N*sizeof(int));
It is, of course, the same pattern for using memcpy
to get data from the device back to the host.
A convenient helper class for USM management
This kind of explicit memory management probably feels very familiar if coming from CUDA, and it’s nice that the same small set of familiar-feeling memory copying functions (Q.memcpy(...)
instead of memcpy(...)
? I can remember that!) gets used for all different device backends. To partially automate the task of data migration in these explicit-memory-management schemes, in CUDA many people have independently written simple GPUarray
-like classes: something that points to an array of data on the host and comes with an accessor-like framework for doing all of the device allocations and memory migration when needed (but in the context just of pointers to device memory, rather than using SYCL accessor
s which, apparently, can increase register pressure in kernels). For instance, hoomd-blue uses a GPUArray
and ArrayHandle
combination: the GPUArray represents an array of elements that exists on both the CPU and GPU, with the ArrayHandle being used to access the data during scope-limited lifetimes (and where the ArrayHandle constructor is used to specify how and where the data is being accessed, handling any needed data transfers).
The same convenience classes work quite well here, too, of course. The scoped-lifetime-limited access provided by something like ArrayHandle
is good for error checking and really making sure that host and device data can never be independently modified, but for the purposes of this short guide let’s just focus on a deviceArray
that does the work for us. Enumerations will describe where and how we’re accessing the data (and where the most-recently updated version of the data is):
struct accessLocation
{
//!An enumeration of possibilities
enum Enum
{
host, //!<We want to access the data on the CPU
device //!<We want to access the data on the GPU
};
};
//!A structure for declaring where the current version of the data is
struct dataLocation
{
//!An enumeration of possibilities
enum Enum
{
host, //!< data was last modified on host
device, //!< data was last modified on device
hostdevice //!< data is current on both host and device
};
};
//!A structure for declaring how we want to access data (read, write, overwrite?)
struct accessMode
{
//!An enumeration of possibilities
enum Enum
{
read, //!< we just want to read
readwrite, //!< we intend to both read and write
overwrite //!< we will completely overwrite all of the data
};
};
From here, we define a deviceArray
whose most important features look like the following:
template <class T> struct deviceArray {
//! Construct with a number of elements and a SYCL queue
deviceArray(unsigned int numberOfElements , sycl::queue& _newQ);
//!Free and used memory
~deviceArray();
//! acquire data as a std::span, handling memory transfers if needed
inline std::span<T> acquire(accessLocation::Enum location, accessMode::Enum mode);
//Some state information:
dataLocation::Enum whereIsTheData;
sycl::queue Q;
unsigned int arraySize;
T* hostData;
T* deviceData;
/*
Other helper functions, etc
*/
};
This class has all of the information needed to manage data copying to and from the device: every time the acquire
method is called we check where the most up-to-date version of the data is, we check where we’re accessing it, and we check the mode of access. All of this tells us whether we need to use the queue to copy data around, and at the end of the day we get back a std::span<T>
. I’m relatively new to modern C++, but returning a span
instead of the data pointer let’s us substantially simplify a lot of code. As we’re about to see in the next section, there are parallel versions of a lot of the standard algorithms we know and love, and a lot of boilerplate kernel code that I used to have to write to do simple loops just evaporates away!