Welcome. Thank you for coming.
I'm David Olsen. I work on the PGI compiler team at NVIDIA.
And I'm here to talk about parallelism.
This picture is from registration here at CppCon Sunday evening. This is an example of effective parallelism.
Multiple volunteers, each with a group of name tags for a group of attendees.
It's not perfect because registration is sequential within each group.
As you can see, three people lined up at the V through Z sign, the V through Z table.
But it's effective. Twenty volunteers in parallel is faster than having a single check-in person
even if that person were the fastest check-in person in the world.
This picture is from lunch on Monday.
It's a not-so-effective use of parallelism. There are several payment terminals, but not enough to prevent long lines.
At each terminal paying for lunch has to be done
sequentially, even though there's nothing inherently sequential about different people paying for their lunch.
This could be better paralyzed by having even more payment terminals, but that's probably not worth the money to set them up.
This is the last discussion of parallelism in the real world. The rest of the talk will be about parallelism in C++.
This talk is very code heavy with lots of code examples. If you prefer a more theoretical or philosophical talk
there's still time to go to Walter Brown's talking in Summit 8/9.
A couple of assumptions underlying this talk:
First, the vast majority of code cannot be effectively parallelized,
but the gains can be tremendous when you have the right code.
The biggest benefit comes from code that does the same computation over and over and over and over again on different data
where each little piece of that computation is independent of the others.
You might have multiple threads running at the same time doing different things,
but that's concurrency as David Hollman explained earlier this week in the "Lazy Futures" talk.
That's not the parallelism that this talk is going to be focusing on.
The second assumption is that you need the right hardware for parallelism to be effective.
You need a GPU or a multi-core CPU. If the programs that you're working on will run on a single core CPU nothing
I say in this talk will be of any use to you.
First I'm going to go over a number of different parallel programming models.
These are different ways that you can paralyze your code. This list is not comprehensive.
There are other models out there that might be more effective for your particular situation.
But these are some of the ones that I'm familiar with or that are used in the HPC environment where I work.
For each of the models
I will show a very short code example, so you can get a flavor for what it looks like in the code.
Later in the talk. I will have longer code examples with performance data for some of the models.
There are several different strategies for paralyzing your code. They fall into three main camps.
Some of them, like OpenACC and OpenMP, do it through
pragma directives in the code. That's how you express your intentions to the compiler.
Other ways of paralyzing have language extensions, like CUDA and OpenCL. These are actually changes to the language. You have to write
non-standard code in order to get the compiler to paralyze your stuff.
There are a whole bunch of libraries out there, like Thrust in Kokkos, which look like C++ libraries. You include their header files
maybe link in their
These libraries are
always built on one or more of the other models underneath. But you interact with the library rather than with the underlying model.
So here's the the short example I'll show for each of them.
This is a sequential version.
There are two input arrays X and Y. Multiply X by a scalar. Add that to Y and store the result in Z. Really simple.
The first model that we will look at is Pthreads.
This was the original widely used threading library, created back in the '90s as part of POSIX.
It has a C interface. It's not designed for C++.
Here's the Pthreads version of that example code. Because it's a C interface
we have to use pointers to functions. The function has to take a single "void*" argument and return a "void*".
We don't get to use lambdas.
Since we don't have lambda captures
we have to marshal all the data explicitly to get it into our thread.
This is not the best model for C++. Don't use it. It won't be mentioned again in this talk.
In the C++ 11 standard, threads were added to C++.
It's in many ways a C++ interface on top of the Pthreads functionality.
When you create a sd::thread object
it launches a new thread.
The threads must be joined or detached explicitly. (But don't ever detach them. Always join them.)
before the threaded object is destroyed.
You have to manage the threads yourselves.
C++ still does not have a standard thread pool or task manager to help you with your thread management.
There's a little bit. There's async(), ...
It's coming. I guess with co-routines you have some more...
But that's more concurrency than the parallelism that we're looking at in this talk.
The comment is that I said that there's no...
The comment was that I said that there's no
thread pool or other things for managing your threads. There's a little bit creeping its way into C++.
It's more for currency rather than for the massive parallelism that we're talking about.
There's not good tools for managing your threads. Here's the same example using
C++ threads. It's much nicer than the Pthreads version.
We launched the thread by creating a std::thread object, which we pass a lambda. Between the lambda's parameters and it's capture
it's much easier to get the data we want into the thread.
But it still takes some effort to manage the threads.
We have to keep a container of the threads. We have to join the threads at the end.
If you switch to jthread, which is new in C++20,
you can get rid of the explicit join at the end, since jthread automatically joins in its destructor.
But you still need the vector, or some other container, so that the jthreads have the right lifetime.
OpenMP is a programming model for shared memory multi-core systems.
It uses pragma directives to give your instructions, or hints, to the compiler. It's available in C++, C, and Fortran.
More recent versions of OpenMP support
doing your parallelism on the GPU rather than the CPU. But the GPU capabilities aren't widely available in implementations yet.
So here's the OpenMP example. It's about as simple as it gets.
We add a pragma before the for loop. We're telling the compiler
"we want you to paralyze this for loop," and it does all the rest.
It creates the threads. It manages the threads. It takes care of destroying them at the end.
OpenACC is a parallel programming model for shared memory multi-core systems and for GPUs.
Again, it uses pragma directives to give the instructions to the compiler. It's available in C++, C, and Fortran.
Because it targets GPUs it also has directives for specifying the data movement.
CPUs and GPUs have separate memory spaces. So you have to worry about copying your data back and forth between them.
So any GPU model will have ways of managing your data. So we look at the open ACC example...
We have a similar pragma just before the for loop, saying please paralyze this loop.
But we also have the data directive.
The first pragma. It says to copy X and Y to the GPU memory. And then when we're done,
which happens at the end of the scope that the data directive is attached to, copies Z from the GPU memory
back out, and that's what we need for this particular problem.
When you compile OpenACC for multi-core CPU, the data directives are ignored, because they're not needed in that environment.
CUDA is a parallel programming model for NVIDIA GPUs. It uses language extensions and a library.
CUDA exposes more of the capabilities of the GPU than any other model.
Here's the CUDA example. We have the special syntax.
We have a "__global__" attribute on the function that we want it to run on the GPU.
And farther down we have the triple chevron syntax
which is used to launch the kernel to tell [the compiler] "this is where I"
"want you to call this function on the GPU rather than CPU." It has this special syntax that you need.
Within your kernel code there are magic variables so that you can identify the thread that you're on.
Then it has the ways of doing the data movement.
So we allocate memory on the device and copy memory to and from the device.
The next one I'll mention is Kokkos. It's a parallel programming model for performance portability on both CPUs and GPUs.
It is a C++ template library. So you write your code against the library using templates.
It's built on top of a whole bunch of different backends, including CUDA, OpenMP, Pthreads,
HPX, HIP, ROCm, SYCL, and there's a sequential implementation, and there's a couple more that are
Kokkos is open source and liberally licensed.
So our Kokkos example.
Because it does support GPU backends we have to
deal with data, with managing memory. Kokkos::View both manages the memory, and it and allows you to view your data in different ways.
deep_copy is used to transfer the data.
Kokkos provides several parallel algorithms such as the parallel_for that we use here.
And the last one I'll talk about is C++17 parallel algorithms. I think they've been out enough
you've probably heard of them and know something about them. But essentially C++17 defines
execution policies. As of C++20 there are now four of them.
And there are overloads to a whole bunch of the standard algorithms that take an execution policy.
So if you want to run one of your algorithms in parallel, you pass it
the "par" execution policy as its first parameter.
The implementation status
Where are these available?
Intel came out with one of the earliest implementations.
released it as a separate library. It has never been shipped with any compiler as part of a standard library.
All the algorithms are in a different namespace, in the "pstl" namespace. It's based on TBB, or Thread Building Blocks,
which is a library implementation of parallelism that I haven't covered in this talk.
Microsoft was the first of the mainstream compilers to ship an implementation.
All the algorithms are available with the
execution policy overloads. Of those, 39 of them have been parallelized
including all the most important ones: transform_reduce and sort and the scans and for_each and reduce.
16 of them they decided it's probably not worth it to run in parallel
so they haven't done a parallel implementation of those. And nineteen are still to be done.
Others: libstdc++, which is
under the GNU umbrella and and ships with GCC.
There's an experimental implementation of the parallel libraries available in GCC 9.
But it depends on TBB and it requires that you have TBB installed somewhere.
libc++ is under the LLVM umbrella. It's the default library on MacOS and is available with clang.
The parallel algorithms are under development.
If you get their dev branch, there's a an option you can pass to the compiler to make them available.
They're sequential by default, but there is a parallel back-end that is based on TBB.
You'll notice that TBB is mentioned here a lot. That's because
libstdc++ and libc++ both used the Intel implementation as the starting point
for their work.
But neither of them has written a
parallel back-end that doesn't use TBB yet.
non standard library implementations.
Thrust is a CUDA-based implementation for a NVIDIA GPUs.
Their implementation predates the C++17 standard, and in some ways it was the model for it.
But because they built it before the standard there's a number of differences between the Thrust implementation and the standard.
It's in the namespace "thrust" rather than "std", and
because it's a CUDA-based implementation. You have to use a CUDA compiler.
HPX is a
library for all sorts of things in the HPC and the high-performance computing space
It includes implementations of the algorithms in the "hpx" namespace and has some extensions
for the execution policies. You can configure them, like attaching them to an executor or making them asynchronous.
SYCL implements the parallel algorithms on GPUs if you feed it custom execution policies.
And at PGI, we have a project under development that I'm working on which we hope to
release in the first half of next year, if all goes well,
where we will run the parallel algorithms,
we'll run them on the GPU with the standard execution policies.
And so going back to the example with the parallel algorithms
We use the algorithm "transform" and pass the execution policy "par", which means please run this in parallel.
The inputs are X and Y.
The output iterator we send in is Z, and the lambda has the transformation we do, which is "a" times X plus Y.
That's it for the survey of
Now we'll go through some bigger examples. There will be two code examples.
I will show you the sequential code plus several parallel versions for each of the examples.
I will show only the main loop, not the entire program. I will highlight the differences between the versions.
And please do not try to understand all of the code. You'll fall behind and miss the main points.
What I want you to pay attention to is how easy it is to write or read that code.
I've spent some time optimizing the code, but I didn't try to eke out every bit of performance.
Because in this talk, I'm looking for order of magnitude differences in performance. Not that last 10%.
The fine print on the performance measurements...
All the measurements were done on a dual socket Skylake machine with a total of 40 physical cores and
an NVIDIA Volta GPU
and more RAM than I could ever possibly fill up.
The details about the compiler versions and the command line options I used are there on the slide.
On to the first example
The problem is a salesman has a territory and in that territory has some number of cities and he wants to find the
shortest route that will visit every city.
This is your classic NP-complete problem. There are no shortcuts to solving it.
The example code that I have is a brute-force implementation that checks every single route to find the shortest one.
Here's the sequential code. We have a function called "find_best_route".
It takes a parameter, which is the
distances between the cities, an array of distances, and the number of cities, and it returns
the best route.
"Distances" is really a two-dimensional array,
but since C++ doesn't handle two-dimensional arrays very well, especially when they are dynamically sized and passed as parameters,
it's passed as a one-dimensional array. This will get better when mdspan makes it into the C++ standard.
Before we go on with the code, there's some helper code which is the same for all the different versions.
"route_cost" is a pair of route ID and cost.
The default constructor constructs one with a maximum possible cost.
functions to compare them to see which is least costly. There's a regular function "minf", and there's a function object called "min".
They do the same thing, but in some cases we want a regular function. In some cases we want a function object.
"route_iterator" calculates a route. You pass the route ID and
the number of cities into the constructor. You call "first" to get the first city. You repeatedly call "next"
to get the next cities in the route until
"done" returns true. That's when you know you're at the end of the route.
This is all the code for route_iterator. Don't read it now. It's just here
so it's in the YouTube recording. you can read it there later if you're interested.
So back to the code for the main loop.
If there are "N" cities there then there are "N" factorial
unique routes that visit all the cities.
So we iterate over all the integers between 0 and N factorial,
and for each route we iterate through all the cities on the route.
Add up the costs of each of the hops between the cities.
We compare the total cost of the route against the best route so far and when we're done, we return the best route.
Okay before moving on I want to look at this code a little more closely.
The green code is easily parallelizable.
It doesn't depend on any iteration. It's local there. It doesn't write to any variable outside the loop.
You will see this exact same code in all the different versions of this problem.
The code in red is the problematic code. It writes to a variable that's outside the loop.
So if we were to paralyze this naively we would run into race conditions and have problems and it wouldn't work right.
This is a reduction where we are looking for the minimum value among the set of all the costs.
Getting that code
to work is where all the effort will go in the parallel versions.
This is not a very interesting graph because we're not comparing different versions.
I want to introduce the graphs. All the graphs will show speed up against a reference implementation.
So taller means faster. Smaller means slower.
Green bars were compiled with the PGI compiler.
Purple bars with GCC, and blue bars, which will come later, are compiled with NVCC which is the CUDA compiler.
The reference implementation for this problem is the sequential code
compiled with the PGI compiler, which took 22 minutes 40 seconds to calculate
6.2 billion possible routes between 13 cities.
The GCC version was a little slower taking 27 minutes 41 seconds.
So now on to the version traveling salesman that uses C++ threads
to run it in parallel on the CPU
First we ask the
implementation how many threads it thinks it can support by calling "hardware_concurrency".
We have a container to hold the thread objects.
We add the threads to the container as they are created, and then at the end we join all the threads.
That's how the main thread knows that they are done.
Each thread keeps track of the best route found by that thread.
At the end of each thread use a mutex to update the overall best route.
This lock is only acquired at the very end of the thread, not in each iteration. So there should be little or no contention.
And so we run this...
The PGI compiler is
43 times faster. The GCC version is 30 times faster.
These are within the ballpark of what we would expect on a machine with 40 physical cores.
The OpenMP version. This code is identical to the original sequential version except for the two additional pragmas.
This takes advantage of OpenMP's user-defined reductions.
The first pragma defines the reduction. It tells the compiler how to
find the minimum of the route cost objects.
Then the second pragma, we put on the loop. It says please run this in parallel and
there's a reduction inside the loop that fits the one that I just defined.
Those are the instructions to the OpenMP compiler.
PGI's OpenMP implementation does not support user-defined reductions
so I only have results for GCC.
Here it's a little bit faster than the GCC manual threaded version,
32 times faster than the sequential.
OpenACC doesn't have user defined reductions.
So the code needs to be changed to do the reduction manually, which, frankly, is a pain.
The first highlighted pragma specifies the number of threads to use which is not recommended practice for OpenACC. That's normally left to the compiler.
Lower down, to manage all those threads the main loop is broken into three nested loops.
There's a temporary array to hold the best route for each thread.
When all the GPU threads are done the CPU looks through the data structure to find the best overall route.
As I said earlier OpenACC can be compiled either for multi-core CPU or for GPU, so the bar on the right is
OpenACC for CPU for multi-core. It's similar to the other multi-core runs.
When we compile for the GPU
things get a lot faster.
Runtime is down to 1 1/4 seconds, down from our original 22 minutes.
In Hanna's compile time regular expression talk on Wednesday, she showed a 100 fold speed-up over C++ standard regular expressions.
Here we have made the code 1,000 times faster.
Though I cheated by changing the hardware,
Switching from a CPU to a GPU. I still think Hanna's speed-up is more impressive because she did it on the same hardware.
This is the part of the CUDA code.
and this is the rest.
I won't try to explain it all, but I do want to point out the how the
reduction happens. It's broken into four different steps.
First, each thread keeps track of the best route for just that thread.
Second, all the threads in the warp work together to find the best route for that warp.
Third, one of the warps gathers the best route from all the other warps,
finds the best route for the block, and stores that in a temporary array.
Finally, the host code goes through the best route from each block to find the best overall.
That's a lot of code to add to your program to get it on the GPU.
But it sure runs fast.
We're down to 1.1 seconds.
Because Kokkos has GPU backends, we have to use
Kokkos::View and Kokkos::deep_copy to manage the memory.
I want to point out there's very little memory for this just the distance array takes up only
So it'll fit in the cache, but we do have to worry about how to get it to the CPU
So we have the View and the deep_copy here.
We use the parallel_reduce algorithm.
Koko's has a reduction operation that keeps track of both the value to be minimized and an identifier
which is exactly what we need for the traveling salesman.
threw away the route_cost class for this one and use Kokkos's MinLoc reduction directly.
Kokkos has many backends.
I'm going to show you results from the OpenMP back end and the CUDA back-end.
The bar on the right is for OpenMP.
The graph is scaled so you can better see the CPU results.
The Kokkos OpenMP is on par with the other
CPU multi-core runs.
Now let's add the results for the CUDA back-end.
390 times faster is normally a great result, but it's several times slower than the other GPU results.
So I talked to the Kokkos developers.
They pointed out that Kokkos is
by default, it's tuned for typical scientific applications, which have a lot more memory access than
The Kokkos developers sent me a patch which just this week was released in Kokkos 3.0.
So it's available. They told me how to give hints to the Kokkos scheduler.
The text in red shows the change from the previous run.
This configures the policy that we're passing into
giving the hint that the algorithm is computation heavy
So when using this version,
the Kokkos CUDA back end now performs as expected.
The final version for traveling salesman uses the C++ 17 parallel algorithms.
Our function is now a single call to the transform_reduce algorithm with the parallel execution policy.
The input is a sequence of integers from 0 to N factorial.
"counting_iterator" is a helper class
that really should be in the C++ standard, but you can get it from Boost or Thrust or various other places.
"counting_iterator" makes it look like you have a container full of consecutive integers without having to create an actual container.
This is the initial value for the reduction.
"route_cost::min" is the reduction function. In this case we want the function object, not the regular function.
This lambda is the transformation function.
It converts the integer, the root ID, into the cost of that route, or actually into a route_cost object.
Once you're familiar with transform_reduce, this code is at least as simple as the original for loop version.
If you watched Connor's "Algorithm Intuition" talk on Monday, you will understand the importance and the beauty of transform_reduce.
So we can compile this code with GCC 9.2 to run in parallel on the CPU
It is 33 times faster than our sequential version and
similar to most of the other
CPU multi-core versions.
We can also compile it with a development version of the PGI compiler
so it'll run that same code with the parallel algorithms on the GPU.
So we're down to one second flat to compute 6.2 billion possible routes between 13 cities
1355 times faster than sequential version, with code that is clear, portable, and standard.
So on to the next example
This is a very common example in the scientific computing community.
Jacobi Laplace heat transfer thingy. I've seen a bunch of different names. I'm not sure which name is most often used
The problem statement: given a metal plate or some other two-dimensional object
where the temperature of the edges is known, calculate the temperature of the interior.
Laplace refers to the math equations that model the system, and Jacobi refers to the method of solving those equations.
The algorithm that we will use:
Represent the metal plate as a two-dimensional array of temperatures with constant temperatures on the edges
Simulate the heat propagating through the plate by changing each interior point to be the average of the temperatures of its four surrounding points.
And then we keep iterating.
We keep doing step two over and over again until the temperature change from one iteration to the next is below a certain threshold.
So here's the sequential code. We have a function called jacobi_solver. It takes
"data" as an input which is an M by N
two-dimensional array. It takes a threshold. max_diff is our threshold that we're looking for
So when the changes are less than that, then we're done
It returns the temperatures of the plate back through the "data".
So - it modifies the "data" array in place and returns the number of iterations that it took before it converged.
We need to allocate a second array and copy the data into it.
Only the edges are important, but it's easier to just copy the entire thing
Then the main body. We iterate over just the interior points.
So you notice the index goes from 1 to M minus 1 rather than the usual 0 to M
And then we set each point to the average of the surrounding points
And at the same time in the same loop we check how much the value changed and see if it was greater than our threshold
At the end of each iteration
swap the pointers to the two arrays and
If any value changed more than the threshold, keep going and do the next iteration.
Once we're done iterating, if the last iteration that was run wrote to "temp",
then "temp" needs to be copied back to "data" because "data" is how we're returning the results of the user.
If you recall Andre's keynote about optimizing sort,
his advice included avoiding branches in inner loops, and to break out of loops early wherever possible.
This code does not follow those guidelines
We can avoid those pitfalls by changing the code to have two separate loops
The first one does the transformation. It doesn't have any branches in it
The second one checks the threshold and breaks out with a goto as soon as possible.
I said at the beginning I don't care about 10% performance changes, but this little change is much bigger than that.
When compiled with GCC this is 30% faster. When compiled with PGI, it was almost twice as fast.
I'm not entirely sure why.
My guess is that by taking the branch out of the loop, it allows the compiler to much better vectorize
that first loop.
So again, this isn't very interesting because it's just one performance model. It's the same thing. Green is PGI.
Purple is GCC. Blue is NVCC.
For the sequential, it took four and a half minutes for a 12,000 by 8,000 array.
I set the threshold so it took 3,891 iterations to converge.
And this time GCC was a little bit faster than PGI, so it's our reference implementation for this example
The OpenMP version. Mostly we just add "pragma omp parallel for" to all the loops.
Except for the one that checks against the threshold
OpenMP does not allow you to jump out of the loop so we can't do an early exit
Instead we mark the loop as a "logical or" reduction.
For the sequential version it was much faster to have iteration code in separate loops. For OpenMP, it's exactly the reverse.
If we combine the code back into one loop which has a "logical or" reduction,
Compiled with GCC, it's 35% faster. And PGI is 25% faster.
This example proved a lot more interesting from how to code it up and how to optimize it.
So here's the performance for OpenMP. GCC did a lot better. It's 11 times faster
than the sequential version.
we saw speed ups between 30 and 40 times for the
multi-core CPU versions. This time we're only seen 11. The difference has to do with the memory usage.
Traveling salesmen used 676 bytes of memory. This one uses about 800 megabytes.
That is where you access all 800 megabytes each time you're doing an iteration. So there are cache issues here.
We could probably improve things by changing how we access the cache and reordering things, but that was beyond
the scope for this talk
The OpenACC version. Similar pragmas on the loops, except we added "collapse(2)".
That tells the compiler to take the two for loops and
collapse them, to iterate them over in just one box. Basically collapse them into a single loop.
The two loops are more clear in your code, as you're writing your code.
But the compiler can parallelize it better if you have a single loop, so this is telling the compiler...
Basically getting the compiler to do the hard work
which is what you want.
OpenMP also supports a collapse clause. But for reasons that I don't know, I didn't try to figure out,
when you put "collapse(2)" on the OpenMP
pragmas it actually got a lot slower.
With OpenACC it gets noticeably faster.
Because OpenACC can be compiled for the GPU, we have the data directives. We need data directives to manage the
movement of our data
This one says to copy "data" to the GPU at the beginning, and then, at the end of that scope, copy "data" back.
Which is what we want
The "temp" array is only used in the device code. We don't ever need it on the host, on the CPU.
So that one has a create clause which tells the compiler, just make sure this memory is there on the GPU,
but don't ever copy it.
The performance for OpenACC. Remember, we can run this on either CPU or GPU.
CPU version was
7.4 times faster than
the original sequential. When we run it on the GPU, it's down to 5.4 seconds, which is 51 times faster
than the sequential.
So not a thousand times faster, but still pretty good.
Kokkos's policy for a parallel_for can iterate over two different indexes
So we specified down at the bottom. We see in the very last row
We specify from 1, 1 up to M minus 1, N minus 1
It handles that. These indexes don't have to start at zero. So that was nice.
Kokkos::View can also provide a two-dimensional view over your data.
That makes our
our math on the indexes into the arrays simpler and more clear.
How does it perform? Again I'm showing the OpenMP and CUDA backends for Kokkos.
Which are a little bit slower than the other ones in our model.
I have two caveats for this. (1) When I ran these Kokkos runs, the test machine wasn't completely idle.
So the actual numbers were probably a little bit better. I don't know exactly how much. And, (2), this was the
last addition to this talk. It came in just yesterday. So I haven't had time to fiddle with it.
Either me or the Kokkos developers haven't had time to fiddle with it and see... There might be some small changes that could get two
or three times faster, like we saw for traveling salesman, but I just haven't had time to investigate that yet.
And finally the parallel algorithms version
We can use parallel copy at the beginning and end when we just need to copy the whole array
For the main body we use transform_reduce.
As Bryce mentioned here at CppCon three years ago and as Connor mentioned earlier this week,
transform_reduce is the most important standard algorithm. You can do it to solve all sorts of things.
We used it to solve both the traveling salesman and Jacobi
Our input is using a counting_iterator again. We iterate over all the indexes.
But we can skip the first and last rows since those are constant. We don't want to change those.
We have a logical [or] reduction
With initial value of false
Sorry, logical or reduction
Because we are using a single index, the math to index into the array is a little bit different and
We need to skip over the first and last entry of each row
because those are edge notes that we don't want to change. So we have to
do a mod N and see if we're at the first or last entry
So so far we've gone back and forth about whether it's faster to combine the two loops or to keep them separate
You would think that since the combined loop was faster for all the other parallel examples
It would also be faster for parallel standard algorithms, but no
For reasons I can't explain
Separating the code into parallel for_each and parallel any_of is about 10% faster
So this is the one that I did the performance results on
Again with GCC 9.2
compiling for multi-core CPU
we get a pretty good result: 6.8 times faster, in the range of all the other multi-core CPUs.
Using the PGI compiler for the GPU
We've got 37 times faster
That's good. Though not quite as good as open ACC.
So this one we have some more variety in the scale and the range of our performance results
Before I get to the conclusions, I want to give some thanks to people who helped me with this talk
Christian Trott and David Hollman from Sandia helped with the Kokkos.
updated me on how libc++ implementation is going. Billy O'Neil
informed me about Microsoft's status. Gordon Brown
helped me with SYCL.
my coworker at NVIDIA wrote the OpenACC versions for both of the main examples.
Daniel Tien my coworker at NVIDIA is my partner in crime on implementing the parallel algorithms
in the PGI C++ compiler for the GPU.
And we would like to make this team of two be more, so we are looking to hire
more people to work on our PGI compilers. So if you want to do C++ compiler work
contact me or Bryce or anyone else at NVIDIA.
So in conclusion
Parallel is faster than sequential
If you have code that is safe to run in parallel,
I encourage you to try it out. Try out one of these models and
see how much faster you can get it.
Second, parallel on the GPU is faster than parallel on the CPU
So if you have a GPU handy, try it out. See what performance differences you get.
In my opinion, the C++ 17 parallel algorithms have good performance versus readability trade-offs.
Which means you can get good performance without having to sacrifice
clarity, maintainability, or portability (once implementations are more widely available).
>> Hi there, can you go back to your first transform_reduce example
Or really any of them that have counting_iterator
Are you aware that there's UB in this example?
And if you are, have you filed the defect report so that we can fix it?
>> No, what's the UB?
>> The parallel algorithms require forward iterators in the standard and
counting_iterator is like a stashing iterator. Like it doesn't...
>> I made it.... It's a random access iterator.
>> It doesn't like... You can't store an address to the number, increment the iterator ...
>> Oh, right. Its reference type is not quite right.
>> So have you submitted a defect report to make this work?
Because I agree it should work
>> No, I have not
>> Please thank you.
>> The order of magnitudes are such that I don't know how meaningful the difference is, but i'm just curious
How do you normalize the performance of the GPU you're choosing and the CPU? Is it based on cost or what?
>> What do you mean by normalize?
>> I mean so you could get the cheapest CPU you you could find and the best GPU you could buy. How do you
make that so it's kind of a
comparison where you...
>> I haven't tried to do those computations or figure out exactly what variables you would measure.
Off the top of my head
It depends on how much code you can get to run on the GPU. A lot of applications... Most code out there...
Like I said at the beginning, most code out there is not parallelizable easily.
If your code has very little on the GPU then you want to put all your money in the CPU.
If you're a massive scientific simulation where you can run everything on the GPU, that's where you're going to put your money.
So I think it'll depend on
>> Alright, thank you
>> Thanks for the talk. Have you ever measured the performance across multiple GPUs?
>> No I have not.
This was all single GPU.
>> Thank you
>> Yeah, very interesting talk
What would I do if say I wanted something like the FFT which isn't in a standard algorithm?
You know, would I have to use like
FFTW on the CPU and then do something else for the GPU to take advantage of that common code using the PGI compiler?
>> Are there GPU versions of FFTW?
>> No, I don't think so. It's CPU only. I was asking how would you kind of handle that like you just want to do an
FFT and you know
make it choose, I guess.
>> It's a complicated question
Actually, the person sitting across from me worked some on FFTW
There are people in my office who who can answer this much better than I can.
But if there's no GPU version available you would probably have to bug the people who maintain the libraries and
Get them available. I'm sorry. I don't have a good answer for that.
the question is
whether you can see an easy way to run parallel algorithms in parallel
so that say you have a partially shared data and
would like to
parallelize the parallel parts also
to utilize your hardware better.
>> Taking a parallel algorithm and invoking another parallel algorithm within it is
many many times harder than just getting your parallel algorithm
>> So I...
In CUDA there is a way to specify a flag to get defaults in...
>> In CUDA it's possible to launch new kernels from within your device kernel ...
>> I'm not talking about dynamic parallelism.
>> That's not widely used
>> I'm saying that there is a flag for compiler
to give you a default stream per
CPU thread, so in OpenACC I haven't found such an ability and had to
asyncs and then introduce synchronization by hand
>> People have had success doing
Parallelism on the CPU and then having each of those threads invoke parallelism on the GPU
By having OpenMP for the CPU parallelism and OpenACC for the device parallelism
>> Right, but they utilize the same stream by default
>> I am not familiar enough with the details of how that works and the complications of that. But in general
nested parallelism is very hard. It's much harder than your regular
single level parallelism
>> Thank you
>> Hi David. Thanks for including us in this study. And and thanks for the talk.
noticed looking at your examples trying to compare complexity and readability
that a large part of the difference at least in my mind for both OpenACC and
Kokkos was the explicit memory management
But either of those models would have worked just fine with unified memory
It's just that those models also support non unified memory
And so if you want to be generic over that you have to write the extra stuff
So the question that's associated with this is: if you were to add in the explicit
requirements to do non-unified memory in the parallel algorithm case,
do you think that the readability would be a little bit more on par?
>> As David pointed out there are advances in memory management for GPU programming
Unified memory gives you at least some... Right now, the heap is shared...
The CPU and GPU have the same virtual address space for at least part of the memory, often
what's on the heap, and
the OS and device drivers will take care of moving the memory automatically and you don't need your explicit data movement directives.
the models that have explicit data movement would be easier if you can get all your data into the manage memory and
the parallel algorithms would become more complex if you had to explicitly specify data movement. So yes, that is a valid point
>> Excuse me, one more question, whether you have an explanation why
the pure CUDA implementation
showed worse performance than ...
I am not a CUDA programmer. I wrote it myself and I didn't spend time trying to optimize it much.
I know specifically that the final loop of going over the array to find the best route at the end, that's done on the host.
It would be faster if I moved that to the GPU.
That's one example of where I could improve the performance in the CUDA example.
I would expect the CUDA version... If I spent time
optimizing it, or got a CUDA expert to spend time optimizing it,
I'm sure the straight CUDA example could be at least as fast as any of the others.
Thank you very much for coming