Devices, Queues, and Kernels

Queues and executing code on device

A fundamental abstraction in SYCL is the idea of a “queue” — code that we want to be executed (in parallel or in series, on different devices or on the host device, etc), can be submitted to a queue. Schematically, this looks something like

sycl::queue Q;
Q.submit(
    /* something we want done*/
);

The runtime then schedules these queue submissions and executes them asynchronously. In order to manage dependencies between different submissions, SYCL provides a few fundamental mechanisms. The first is the ability to specify a queue as “in-order”: in this mode of operation each submission is assumed to be dependent on the preceding submission, and the queue will wait for one task to finish before starting the next. This linear chain of dependencies is well suited to many of the applications I care about, but it means you cannot effectively overlap potentially independent calculations.

Thus, SYCL also provides the ability to directly specify a directed acyclic graph (DAG) of tasks by making use of the “event” abstraction. The idea here is that the methods of a queue all schedule an (asynchronous) task and then return an event object (which we simply discarded in the schematic example above). One can capture these events and explicitly synchronize on them (i.e., wait for an event to finish before moving to the next line of code), or synchronize on collections of them to describe a more general DAG. It looks like

sycl::event ev1 = Q.submit( /*task 1 */);
ev1.wait(); // wait until the first task is finished
sycl::event ev2 = Q.submit( /*task 2 */);
sycl::event ev3 = Q.submit( /*task 3 */);
sycl::event ev4 = Q.submit( /*task 4 */);
sycl::event::wait({ev2,ev4}); // wait until tasks 2 and 4 are finished...
//task 3 might be still be ongoing

Command groups and handlers

The above is essentially a mechanism for enforcing (partial) synchronization points in the flow of the code, but is not necessarily the most convenient way of describing a general DAG of dependencies. SYCL has the additional abstraction of “command groups” associated with queues, which provide more control over how code submitted to a queue is actually processed. “Handlers” are objects that let us interact with command groups, and they are automatically constructed by the SYCL runtime — because of this, we will basically always see them in the context of arguments to lambda functions that are passed to the submit method of a queue. We’ll use handlers more when we get to writing custom kernels — they are useful for expressing the details of how we want code to be parallelized — but they can also be used to more naturally express the execution dependency of our code. For instance, we could write something like:

sycl::event ev1 = Q.submit( /*task 1 */);
sycl::event ev2 = Q.submit([&](sycl::handler &h){
    h.depends_on(e1);
    /*code for submitting task 2*/
    });
sycl::event ev3 = Q.submit( /*task 3 */);
sycl::event ev4 = Q.submit([&](sycl::handler &h){
    h.depends_on({e2,e3});
    /*code for submitting task 4*/
    });
ev4.wait();

We have now described a setting in which task 4 won’t launch until tasks 2 and 3 finish, task 2 won’t launch until task 1 finishes, and task 3 will launch independently of what is happening with tasks 1 and 2.

Devices and device discovery

Each queue is associated with a particular back-end — the device which the queue will actually submit tasks to. The SYCL standard requires that the host CPU always be available as a potential device a queue can connect to (either running a serial task or parallelized via OpenMP), and then any hardware accelerator that exists on a given machine can also be specified. This includes Nvidia, AMD, and Intel GPUs, or FPGAs, and so on. The fact that SYCL lets you write the same C++ code to be executed on any of these different devices — remarkable!

Queues get associated with a particular device via their constructor, via the selector abstraction. The same code can have many queues (with many queues using the same device, or spread across multiple back-ends), and using the default constructors look like:

sycl::queue cpuQ { sycl::cpu_selector_v {} };
sycl::queue gpuQ { sycl::gpu_selector_v {} };

These selectors have internal criteria for selecting a specific back-end — for instance, on a machine with multiple GPUs the above gpuQ will get associated with exactly one of the available devices — and all SYCL selectors will throw an exception if no relevant device is found. In addition to selectors based on the device type (a GPU, a CPU, an FPGA), there are also aspect_selector free functions that let you specify in more granularity what specific device you want to use.

To find out what platforms (OpenMP, CUDA, etc) can be found on a particular computer, and which devices exist within each of those platforms, we can iterate through them and use the various get_info methods. The following is just a small sample of doing this:

for (const auto& p : sycl::platform::get_platforms())
  {
  std::cout << "platform: " << p.get_info<sycl::info::platform::name>() << "\n";
  // Loop through available devices in this platform
  for (const auto& d : p.get_devices())
    {
    std::cout << " Device " << d.get_info<sycl::info::device::vendor>() << " "
      << d.get_info<sycl::info::device::name>() << std::endl;
    }
  }

Other kinds of information available includes how much global memory a device has, how many threads are allowed to execute simultaneously, and so on. The ENCCS workshop has a nice overview.

Once you know what kinds of things can be selected on, SYCL allows you to write custom selectors: the requirement is just that they have a (const sycl::device &d) operator that returns an integer “score”. A queue which uses such a selector will compute this score for all available devices on the system and then select the one with the highest score. A simple example might be something like

using namespace sycl;
class custom_selector : public device_selector
{
public:
  custom_selector(int _whichDev) : whichDevice=_whichDev {}
  int operator()(const device &dev) const override
    {
    int score = 1;
    if(dev.is_gpu())
      {
      auto vendor = dev.get_info<info::device::vendor>();
      auto devName = dev.get_info<info::device::name>();
      if(devName.find("Tesla") !=std::string::npos) score +=100;
      if (vendor.find("NVIDIA") != std::string::npos) score += 20;
      if (vendor.find("Intel") != std::string::npos) score += 5;
      };
    if(whichDevice <= 0 && dev.is_cpu())
        score+=200;
    return score;
    }
  int whichDevice = -1;
};

The above will default to selecting the host CPU, but if whichDevice is positive then the selector will use a Tesla GPU if it exists, then any other NVidia GPU if it exists, then any Intel GPU if it exists, and then falls back to the CPU. Obviously much more intricate and robust selectors can be designed (I extracted this from some code which actually uses something like whichDevice to more sensibly pick between options on my workstation), but since SYCL can target such a large number of back-end devices the process of picking the device you want is a bit more work than when just choosing between Nvidia GPUs in CUDA.

Next up: Convenient data management!

Back to the intro page