SYCL, CUDA, and heterogeneous computing

Skip the preamble and jump directly to the guides:

Parallel (standard) algorithms

4 minutes

They say the best programming language is the one that makes you want to write more code. CUDA is not that language, but it’s certainly a powerful enough tool that it’s worth learning and using! Much of my scientific output over the last few years has been enabled by various open-source software packages that we’ve developed to look at interesting soft-matter and biologically inspired systems. The computational methods we use often have a highly parallelizable structure, and as a postdoc I spent a reasonable amount of time learning how to write CUDA code to develop GPU-accelerated simulations targeting Nvidia devices.

Running a research group with a large computational component, I feel like it’s a good habit to periodically try to learn new tools — something on a similar scale to what I ask students joining the lab to learn all the time. Over the most recent winter break I heard a random podcast talking about SYCL: a cross-platform specification for writing device-agnostic parallel computing applications using modern C++.

The same source code compiling to run on a CPU, an Intel GPU, or an Nvidia card, without having to write vendor-specific code? Using (mostly) completely standard C++? My interest was piqued, and I decided to make learning more about it my holiday project.

As many people have said, I think one of the best ways to learn is to try to teach, so as I was going through the steps of learning about the SYCL ecosystem — and translating what I knew about heterogeneous compute from CUDA to a different setting — I took notes and tried to write in a vaguely pedagogical manner. This is not going to be a complete tutorial, and certainly not a guide to writing maximally performant GPU-accelerated code from scratch. However, particularly if like me you are coming with some background in CUDA I think this might include helpful information, especially in building up a mental map of the analogous concepts between the different heterogeneous computing paradigms. Hope you enjoy learning about all of this as much as I did!

I’ve broken down these notes into the following pages. The first part focuses on the nuts and bolts of installing a SYCL compiler and integrating it into a C++ compilation flow using CMake. Part 2 covers the core abstractions that SYCL uses for heterogeneous computing: queue objects that are associated with a backend device and can be used to asynchronously schedule code to be executed, handlers that provide customization of how the queue operates and can easily be used to describe the dependencies of one task on others. Part 3 describes the two different memory models that SYCL has to control the flow of data between the host and different devices.

Moving into actually performing parallel computations, Part 4 highlights how SYCL — and the AdaptiveCpp compiler in particular — provides a lot of nice features for expressing relatively simple parallel tasks in direct analogy to various primitives in the <numeric> and <algorithm> STL header. Being able to use things like for_each, transform, or reduce with only a little bit of effort radically simplifies a lot of the process of writing GPU-accelerated code, with performance that is competitive with some of the custom CUDA kernels I’ve written in the past. Some computations are not so easily expressed with these standard algorithms (or need problem-specific tweaking to be as performant as possible), and of course SYCL provides the ability to write custom kernels. Part 5 covers the basic patterns used to write those custom kernels, which in SYCL means using either lambdas or functors to describe the elements of a parallel computation.

Learn more

Here are some of the most important resources I used when learning about SYCL myself, and some of these resources have a lot more information on the nuts-and-bolts of writing SYCL code.

The ENCCS workshop is a great tutorial from a few years ago, and working through some of the exercises is very helpful. Intel put together a comparison of the CUDA and SYCL programming models, and especially if you’re already familiar with CUDA its quite nice to see the way each model organizes its core abstractions laid out. The AdaptiveCpp documentation is well written and extremely helpful (and there is an associated discord in which people can ask questions — the developers seems quite friendly and helpful in answering both beginner- and expert-level queries). Finally, this repo is fantastic resource — it contains benchmarks comparing SYCL and CUDA code written for a large number of different tasks. The tasks range from elemental parallel operations (reductions, memory copies, sorting, etc) to much more complex integrated operations (particle based simulations, stencil-based solutions of PDEs in various domains, etc). That also means it has an absolute wealth of sample custom kernels and queue patterns to look at and learn from.

Setting up AdaptiveCPP

Devices, Queues, and Kernels

Host/device data management

Parallel (standard) algorithms

Custom kernels

Learn more