Custom kernels | Sussman Lab

Writing custom kernels

We finally get back to some of the first lines we wrote when introducing the idea of the queue, when we wrote

sycl::queue Q;
Q.submit(
    /* something we want done*/
);

Many many of the things we want done can probably be expressed using the AdaptiveCpp version of the STL algorithms discussed on the last page, but sometimes the easiest — or the most performant — thing to do is to write custom code. This is especially the case when we have a sense of what kind of device we will be running on: we really do want to organize parallel work differently a GPU (making use of shared memory between threads within a block, etc).

parallel_for and work groups

The workhorse here is going to be the parallel_for method of a queue, which comes in two basic flavors.

parallel_for and ranges

The simpler is a parallel_for over a range<Dim>, which describes parallelization over a 1-, 2-, or 3-dimensional range object. This is handy, but this is also (typically) the case that we could have just used an algorithm like tranform instead. Without seeing specific examples this perhaps feels a bit abstract — I recommend looking at the basic data-parallel kernels and then the tiled matrix multiply example from the ENCCS workshop for a more complete introduction.

The parallel_for method can be called directly on a queue object (this is a convenience, just like the direct “use a Q.memcpy(...) to move data around” application we saw earlier, but since I’m mostly using it for more complicated kernels it much more frequently gets used with the handler that you associate with the queue. The basic pattern often looks like

Q.submit([&](sycl::handler &h)
{
  cellListSortingKernel clUpdate(/* parameters to initialize a stateful kernel */);
  sycl::range globalSize(N); // Total number of work-items
  h.parallel_for(sycl::range<1>(globalSize),
    [=](sycl::item<1> item)
    {
    unsigned int idx = item.get_id(0);
    clUpdate(idx,/* other parameters of the () operator*/);
    });
}).wait();

Here we see a parallel_for that will launch N “work-items” / threads. As shown above, I frequently find it helpful to organize my kernels as lambda functions that call a functor whose operator takes in the thread index (along with any other information needed for the computation). Among other things, this makes it easy to run the same code on the device as on the host code testing it: the host code just has to replace the whole “queue calling a functor” business with “loop over calling the functor”.

This is, perhaps, the time to point out that Kernels can be any callable, so there is also a style I’ve seen in other SYCL code of replacing the whole lambda function with a functor. I’m sure there are some benefits to this, but for now I haven’t explored them yet.

parallel_for and nd_ranges

The above example — using parallel_for with a simple range specifying the number of work items — is both (a) straightforward and (b) the kind of code that can typically be expressed using transform-like algorithms as in the previous page.

The more complex version of parallel_for uses instead an nd_range<Dim> object. Nd-ranges are composed of internal work-groups, and within a work-group there is some local memory that each work-item can access. This is very much like a “block composed of threads with shared memory” model from CUDA, with the same potential advantages. It might typically look something like:

sycl::queue Q;
Q.submit([&](sycl::handler &h)
  {
  //using just 1-d ranges, for demonstration
  sycl::range globalSize(N);//The total number of work-items / "threads" to process
  sycl::range localSize(32);//the local work-group size you want
  h.parallel_for(sycl::nd_range<1>(globalSize,localSize),
    [=](sycl::nd_item<1> item)
      {
      unsigned int idx = item.get_global_id(0);
      unsigned int localIdx = item.get_local_id(0);
      /*
      if(idx < N)
        do something
      */
      }
  }).wait();

The first part of the Q.submit(... setup is like the cudaKernel<<<nBlocks,blockSize>>>(...) part of a call to a __global__ function in CUDA, and the lambda function with the item parameter is the kernel. Effectively using this pattern, of course, requires all of the usual considerations about organizing memory accesses, etc etc, for efficient parallel computations. Those details will be familiar to those familiar with CUDA, and are also the kinds of considerations that will be more hardware specific.

That’s all for now — this guide obviously hasn’t gone into performance issues, writing good parallel code, or anything like that, but it does represent the core components one would need to start writing code for running parallel code in a heterogeneous compute environment via SYCL. I had a lot of fun learning about this, and might even try implementing my next piece of scientific software using SYCL rather than CUDA! I also know that there is still a lot I have to learn, and if there’s something I’ve missed you should feel free to get in touch and let me know!

Back to the intro page