We finally have the set of things needed to start easily writing parallel code for running on a variety of devices! Below I’ll be making use of the deviceArray
structure I described earlier, so any time I’m doing foo.acquire(...)
we can assume that this is a deviceArray which returns a span that points to either device memory or to an array on the host, and behind the scenes it is taking care of any needed memory transfer to and from the device.
AdaptiveCPP standard algorithms
One of the first great conveniences of working with AdaptiveCPP is that many of the standard library algorithms (in both the <algoirthm>
and <numeric>
headers) have been implemented. They have signatures that are exactly like the STL ones, except for also needing a queue
object. Using AdaptiveCPP parallel algorithms: device code that looks exactly like algorithm code! This makes it easy to write many of the most common pieces of code, especially in the kinds of particle or lattice-based simulations that I often work with. Easy to write and, I should add, easy to test, since the host-only code can be written to look exactly the same. For instance, here’s a chunk of code from a particle simulation using periodic boundary conditions to update positions according to a set of precomputed displacements.
//"coordinate" is just a data type storing a few doubles
std::span<coordinate> p = positions.acquire(accessLocation::host);
std::span<coordinate> dr = displacements.acquire(accessLocation::host);
std::transform(p.begin(),p.end(),dr.begin(),p.begin(),
//"space" has a transportScalar(...) method that handles PBCs
[ sp = space](coordinate &r, coordinate &_dr)
{ return sp.transportScalar(r,_dr); });
Here we’re using a std::transform
that takes two input arrays and uses a lambda function to implement a transformation which amounts to “vector addition plus boundary condition stuff.” Nothing too earth shattering. All we need to do to do exactly the same operation on the device is change the access location for our deviceArrays, use AdaptiveCPP’s version of transform, and (if we want to make sure we wait until the transformation is finished before proceeding), add a wait()
call.
sycl::queue Q;
std::span<coordinate> p = positions.acquire(accessLocation::device);
std::span<coordinate> dr = displacements.acquire(accessLocation::device);
auto event1 = acpp::algorithms::transform(Q, p.begin(),p.end(),dr.begin(),p.begin(),
//"space" has a transportScalar(...) method that handles PBCs
[ sp = space](coordinate &r, coordinate &_dr)
{ return sp.transportScalar(r,_dr); });
event1.wait();
Just delightful. I suppose it’s just like having the thrust parallel algorithms library not just for CUDA but generalized for various back-ends, but I have to admit that when I was first learning CUDA I didn’t understand modern C++ well enough and I missed out on the more generic versions of everything. In any event, especially when combined with C+±23’s zip_view
to get around the limit of how many iterators can be passed to a transformation, a quite large number of the parallel code I want to write can fit exactly into this framework. For instance, generalizing the above example to using multiple displacement arrays:
sycl::queue Q;
std::span<coordinate> p = positions.acquire(accessLocation::device);
std::span<coordinate> dr1 = displacements1.acquire(accessLocation::device);
std::span<coordinate> dr2 = displacements2.acquire(accessLocation::device);
auto drs = std::ranges::zip_view(dr1,dr2);
auto event1 = acpp::algorithms::transform(Q, p.begin(),p.end(),drs.begin(),p.begin(),
[ sp = space](coordinate &r, auto _drs)
{
coordinate dR = std::get<0>(_drs) + std::get<1>(_drs);
return sp.transportScalar(r,dR);
});
event1.wait();
It’s not just tranform
, of course: the AdaptiveCpp algorithms page details the different parallel primitive’s that they’ve implemented, all supported on all back-end devices.
C++ standard parallelism support
Interestingly, AdaptiveCpp also ships with some support for offloading support in a subset of STL algorithms, mostly using the par_unseq
execution policy. This is potentially a quite exciting direction for the project, but in terms of writing parallel code at the moment it seems that it makes more sense to just use the SYCL specification — it might not quite be standard C++ code, but as shown above the differences are already minimal!