Parallel (standard) algorithms

We finally have the set of things needed to start easily writing parallel code for running on a variety of devices! Below I’ll be making use of the deviceArray structure I described earlier, so any time I’m doing foo.acquire(...) we can assume that this is a deviceArray which returns a span that points to either device memory or to an array on the host, and behind the scenes it is taking care of any needed memory transfer to and from the device.

AdaptiveCPP standard algorithms

One of the first great conveniences of working with AdaptiveCPP is that many of the standard library algorithms (in both the <algoirthm> and <numeric> headers) have been implemented. They have signatures that are exactly like the STL ones, except for also needing a queue object. Using AdaptiveCPP parallel algorithms: device code that looks exactly like algorithm code! This makes it easy to write many of the most common pieces of code, especially in the kinds of particle or lattice-based simulations that I often work with. Easy to write and, I should add, easy to test, since the host-only code can be written to look exactly the same. For instance, here’s a chunk of code from a particle simulation using periodic boundary conditions to update positions according to a set of precomputed displacements.

//"coordinate" is just a data type storing a few doubles
std::span<coordinate> p = positions.acquire(accessLocation::host);
std::span<coordinate> dr = displacements.acquire(accessLocation::host);
std::transform(p.begin(),p.end(),dr.begin(),p.begin(),
  //"space" has a transportScalar(...) method that handles PBCs
  [ sp = space](coordinate &r, coordinate &_dr)
  { return sp.transportScalar(r,_dr); });

Here we’re using a std::transform that takes two input arrays and uses a lambda function to implement a transformation which amounts to “vector addition plus boundary condition stuff.” Nothing too earth shattering. All we need to do to do exactly the same operation on the device is change the access location for our deviceArrays, use AdaptiveCPP’s version of transform, and (if we want to make sure we wait until the transformation is finished before proceeding), add a wait() call.

sycl::queue Q;
std::span<coordinate> p = positions.acquire(accessLocation::device);
std::span<coordinate> dr = displacements.acquire(accessLocation::device);
auto event1 = acpp::algorithms::transform(Q, p.begin(),p.end(),dr.begin(),p.begin(),
  //"space" has a transportScalar(...) method that handles PBCs
  [ sp = space](coordinate &r, coordinate &_dr)
    { return sp.transportScalar(r,_dr); });
event1.wait();

Just delightful. I suppose it’s just like having the thrust parallel algorithms library not just for CUDA but generalized for various back-ends, but I have to admit that when I was first learning CUDA I didn’t understand modern C++ well enough and I missed out on the more generic versions of everything. In any event, especially when combined with C+±23’s zip_view to get around the limit of how many iterators can be passed to a transformation, a quite large number of the parallel code I want to write can fit exactly into this framework. For instance, generalizing the above example to using multiple displacement arrays:

sycl::queue Q;
std::span<coordinate> p = positions.acquire(accessLocation::device);
std::span<coordinate> dr1 = displacements1.acquire(accessLocation::device);
std::span<coordinate> dr2 = displacements2.acquire(accessLocation::device);
auto drs = std::ranges::zip_view(dr1,dr2); 

auto event1 = acpp::algorithms::transform(Q, p.begin(),p.end(),drs.begin(),p.begin(),
  [ sp = space](coordinate &r, auto _drs)
  { 
    coordinate dR = std::get<0>(_drs) + std::get<1>(_drs);
    return sp.transportScalar(r,dR); 
  });
event1.wait();

It’s not just tranform, of course: the AdaptiveCpp algorithms page details the different parallel primitive’s that they’ve implemented, all supported on all back-end devices.

C++ standard parallelism support

Interestingly, AdaptiveCpp also ships with some support for offloading support in a subset of STL algorithms, mostly using the par_unseq execution policy. This is potentially a quite exciting direction for the project, but in terms of writing parallel code at the moment it seems that it makes more sense to just use the SYCL specification — it might not quite be standard C++ code, but as shown above the differences are already minimal!

Next up: Lambdas, functors, and custom kernels!

Back to the intro page