- Good afternoon everyone, my name's Jefferson Amstutz.
I'm a software engineer at Intel,
and I'm here to talk to you about SIMD
and doing so in C++ type system
with SIMD wrappers in C++.
So just so you kinda know where I'm coming from,
I do come from Intel, but I'm actually a library maintainer
for a Ray Tracing library called OSPRay.
It is focused at scientific visualization,
doing Ray Tracing at really big scales on HPC clusters.
It's real fun.
So just know that I'm trying to come from a perspective
of having to use these things.
It's not, I don't just implement this every day.
This was kind of like a side project
that turned into something that was mildly useful
from my own stuff.
So the agenda for today is to kind of go
over some expectation management and an introduction.
Then we're gonna talk abut why you could care about SIMD,
look at what SIMD wrappers are and then some examples
of using these to accelerate your code.
So expectations, number one is
talking about x86 optimization is probably
a multi-hour talk in and of itself.
You have actually been to talks here at CppCon
where people dive into the details of assembly code,
and that's fun, it's great, there's just not enough time
to cover what we're gonna cover today and do that.
It's also not a talk that we're gonna,
that I'm gonna talk a lot about
specific CPU instruction sets,
so ISA means instruction set architecture
and the difference between SSE, AVX, AVX2, AVX512,
maybe a different CPU ISA.
That's, again, an entire talk in and of itself.
And another thing I don't want to convey
is that SIMD wrappers are the only way
to get SIMD vector instruction generation out of your code.
If you're an HPC programmer and you've
already been using pragma-based approaches,
that's great, keep doing that, and that's a good thing.
If you're using other techniques and you're happy with that,
that's fine too, so this is really focused on this route,
which would be how do I do this expressively
in the type system of C++.
It's also not a talk about all kinds of parallelism.
So there'll be a slide on this,
but we're gonna look at just SIMD,
not threading, not multi-node,
like networking kind of things, just SIMD.
And then the last piece would be,
I did offer a library called TSIMD
that does this but the point of the talk
is to say what are these kind of libraries doing?
What are the kinds of problems they're trying to solve
and it's not trying to sell you
to use my library 'cause it's the best.
So some definitions before we get started,
ISA, Instruction Set Architecture.
These'll be the specific instructions
for the CPU that you're running on.
SIMD means Single Instruction, Multiple Data.
So if you have a single instruction,
that instruction normally you would think of
as applying to a single element,
like an add is two different, maybe two different floats.
SIMD means we can do multiple floats
all at the same time of that one instruction.
SIMT would be Single Instruction, Multiple Threads.
This will be talked about a little bit later.
It's a higher level concept.
SPMD, Single Program, Multiple Data,
is actually something I would say is equivalent to SIMT,
and we'll talk about that later.
I'm gonna use two different groups of terms
that I hope don't get too confusing.
I'll use them interchangeably
because of the way my brain works.
It's this concept of varying and uniform,
in vector and scalar.
So varying in uniform are SIMT or SPMD terms,
saying is this gonna be like a vector's worth of values
versus a uniform says always a single value.
And then vector here, we're not talking about
algebraic vectors, we're not talking about stood vector,
we're talking about a vector register.
And scalar means not a vector register,
so like a plain old single value.
And then lastly, if I talk about like a Kkernel of code,
this is code that's gonna be applied to more than one value.
So you can think of it as like a function
that's going to be operating on multiple pieces of data.
Okay, so what kind of problems benefit from SIMD anyway
because we need to know why we would use this at all.
Compute intensive codes that have a lot of computation to do
are where you're gonna get most of your benefit.
So SIMD is less likely to help performance
if you're memory bound, if you're disk I/O bound,
if you're network bound, if you're pure latency bound.
Basically if you're not bound
by anything other than computation,
it's not like you're gonna get zero speed up from SIMD,
it's just that optimization is a lot of factors
and this is just one of them.
Okay, so I'm gonna make some assumptions.
If you decide to go home and start using these libraries
about when you're optimizing code,
make sure first that you're testing your codes correctness.
If you go and rewrite your function into something else
and you get different answers,
it's not your ISA, it's your code's probably wrong.
So tests are important.
You're carefully measuring importance.
Again, optimization is really hard,
and it's very multifaceted no matter
what hardware you're targeting.
So SIMD is not always just the only answer.
It's an answer and one to pay attention to,
but it requires careful measurement
for you to make good conclusions,
and that there are always people out there
who have a very deep understanding of an instruction set
and always will tell you that they
can hand code something at assembly faster.
And that's cool.
It's just not where I think the average programmer sits.
So if you're one of those types of programmers, that's cool.
This talk's probably not for you.
Okay, so let's look at a little bit of background
for how we are where we are and need
to care about things like SIMD.
So life in computing circa 2004
was life where CPU frequencies were just getting faster.
Transistors were increasing,
single-thread performance was going up,
so you've probably heard this before
over the last 10 years as the free lunch
and that this free lunch is over because of power.
So there was a wall that was hit
where by adding more transistors and trying
to shove more electricity through the chip to clock faster
that that was not getting good return
on that power investment anymore.
We call this the power wall.
So it was what we called Watershed Moment
because all of a sudden we went from just rely
on single-threaded execution to go faster for my code
to now falling down a different mountain of now I need
to extract parallelism because I can implement
parallel constructs in hardware with less power budget.
So that's how you take advantage of more registers,
or sorry, more transistors.
So after 2005 and all the way up to today,
this graph, the trends extend all the way to 2018,
even though that thing stops at 2015,
is that transistors keep going up
but they're being spent on parallelism,
more cores, wider SIMD, than on just frequency scaling.
So if we just briefly look at different generations of CPUs
that you can find out in the wild.
The left column there's a really old CPU
at 90 nanometer single core, nicely clocked,
but small amounts of cash, not a whole lotta SIMD,
like 128 bit, but as I think in the second column,
V4, that's Broadwell.
And we go from one core to 22 cores.
We go from basically no SIMD to having FMA
that's dual 256 bit register operations,
and Xeon Phi made that even wider at 512 bit with 68 cores.
The story here is as new processors come out
parallelism becomes ever more important,
and so we can visualize this.
Here's, let's take a 12 core CPU,
and if you just use one single core on that CPU
to do your compute and you have no SIMD,
that's about how much of the area of the chip you're using.
If you then figure out how to multi-thread your application
and use every core, let's say all 12 cores,
that's about how much of the chip you're using.
Then if you also can vectorize your code,
you're gonna end up using all of the compute available.
So obviously there's caches, there's memory subsystems,
there's other subsystems that talk
to the rest of the machine at the edges of that chip.
But there's a big difference
between the far left and the far right images
for how much of the chip
you're actually taking advantage of.
So parallelism comes in many shapes and sizes,
instruction-level parallelism, threading,
through implementing those in multiple cores.
So, but we're only gonna, again, gonna focus on one,
and that's vectorization.
So just know that there's lots of topics about parallelism.
This is the one we're talking about.
So data-level parallelism is what SIMD really is.
The idea is, I can have multiple pieces of data
that I compute at the same time
that I could normally compute a single value.
And so if we take this little loop over eight values,
like in an eight-length array,
and I were to loop and do the same thing,
like this alpha times B of i plus C of i to store into A,
that that would look something like this.
You know, you would just loop
and do a single value at a time.
And so this is what we call scalar execution.
So the scalar registers only operate with single values.
These are what you're used to seeing all the time.
But SIMD is a little bit different.
In x86 we represent them with different registers.
Those different registers
can have different widths of values.
And I know this is lots of data.
When we get to what SIMD wrappers are
hopefully this density on the slide
will make a little more sense.
So you don't have to grock all of this right away.
But the point being is these registers
store more than one value,
and that there's special instructions
for manipulating these registers
that are not the same as scalar instructions.
And so SIMD parallelism in general
is something orthogonal to threading.
So threading on a CPU is about what is the stream
of instructions that's gonna go into a core?
SIMD is about, I guess there was a timer on that one,
SIMD is about those instructions that go into the core.
What are they?
Are they SIMD or are they scalar instructions?
So before we get into SIMD wrappers themselves,
I just wanna call out the spectrum
of what SIMD versus SPMD actually is.
And so on the left-hand side I'll say SPMD and SIMT
is about really expressing generic data parallelism
that isn't specifically mapped to a particular ISA.
So this would be like a logical SIMD.
Like I can say I have a SIMD register of like 64 values,
even though of like 64 floats,
even though that doesn't exist in hardware anywhere today.
So if I wanna express my parallelism generically,
think SPMD or SIMT versus when you are,
it's specifically talking SIMD,
this is very specific to the features
that are in an instruction set,
you'll want very exact types to map
to very specific registers 'cause you're programming
to a very particular CPU.
And there's, the different libraries that are out there
are on this spectrum in different places.
Okay, so let's look at a tour of
what SIMD wrappers actually are.
So there's a number that are out there in the wild.
So these are some of the highlights
that I have picked up off of GitHub.
I will note in particular that Vc
is the library that's being used to prove out
the proposal to the C++ standard
to get these in the standard library.
So if today you haven't used these at all
have a look at Vc.
That's the one that's most likely
to make it into the C++ standard.
But at the same time, all of these libraries
are doing the same thing.
I wrote tsimd, the last one there.
And they're all kind of doing the same role.
There was one called boost.simd
that the NumScale folks have decided to pull down to rewrite
to have a kind of a different feature set.
And then there's tons of other ones out there
that are usually embedded in the projects
that are using them.
And it's kind of unfortunate,
but I think standardizing this in the C++ standard
is a great way to fix it.
Okay, so when you're looking at a SIMD library,
there's some attributes
that normally people have issues with,
but I think in the case of a SIMD library,
they're actually very good.
And the most important one I wanted to call out
is that generally these things are header only
because the point of abstracting of register
is for it not to have to be buried under a function call.
You actually want this to be inlined and optimized
to the nth degree.
And so the SIMD library centered around,
kind of three key components,
and we're gonna go through each of these.
First is I can use a type to represent
what a SIMD register is.
The second is how do I get these kinds of types
from memory into some kind of a local function space
that I'm computing with and write them back out?
And then the third would be what are functions
and operators that I have to manipulate these registers?
So the syntax to use them looks like natural compute
syntax that would use with like standard scalar types.
Okay, so the core abstraction I'm going to spell out,
you'll find this in TSIMD, but all these libraries have
this same core abstraction of I have packed values
that are in a register that represents
a SIMD register's worth of these values.
And that the T is the type
of the element in the SIMD register.
So example would be a float, could be an int,
like a char or a short.
The width being how big is the SIMD register.
And so these two things together
give me an abstraction over that,
but I will note that this is a logical SIMD register.
This is not enforcing that it is a particular ISA,
or it's a type that fits to an ISA well.
And that's where some design,
low-level design trade offs of these libraries end up,
but just understand that this
is how we're gonna mentally say this is the core type
that we're gonna be manipulating.
And so the way we can think about these
is kind of different methods for naming these types.
And we can do that with type aliases.
For instance, if I wanted to say I have a vector of floats
that are four wide I might call that vfloat4.
That's not the way these all are spelled
with other libraries but this is the way we're gonna,
I'm going to, this is the naming scheme
I'm gonna use for this talk.
So a vfloat4 would be a pack of float that's four wide,
and that would look like this logically,
that I have a register that have 32 bit floats,
and I got four of 'em for 128 bits total.
So another one might be a vint8.
That would be this pack of 32 bit ints,
that I've got eight of 'em, and that would look
like a logical register that's this big.
So that's very concrete.
That's specifically concretely naming
both of the template types,
the element type and how wide it is,
but there's other things we can do, too.
Note that there's some generic code
that might do width independent code.
So I might say I need to work with a float,
but I'm gonna template over how wide the register is
because I don't need to specify that much.
So for instance, in TSIMD this is how we implement
like different math functions like sine and cosine.
Like, it's the same algorithm,
I just don't care about how wide it is.
I just need to work with registers of them.
And then you can go even further
and say depending on what I'm compiling for,
I don't even care how wide they are.
If I just say vfloat, vint, vchar,
I'm going to get the best,
the best for what I'm compiling for,
and what that means can have some subtle trade offs.
So if I'm compiling for AV or AVX2
my vfloat might be the exact same type as a vfloat8.
It's really just manipulating type aliases
to that core pack of T of W.
Okay, so together we have a big old family way of
naming these types based on what our constraints are.
Some are more specific than others.
Okay so I've talked about logical SIMD registers.
But we still have to deal with
what are we actually gonna run on.
And so if I have a physical SIMD register
in my instruction set,
what on earth, how do I store all these things?
And so if we take an example of a 256 bit register in AVX,
you know, what on earth can we put in them,
and the key point is these SIMD ISAs,
when you go and look at how they're used,
they're defined by the bit width of their registers,
not of elements and number of elements.
And so what you could store in this exact same register
is a bunch of 32 bit elements.
Could be ints, could be floats.
some less 64 bit elements
or a ton of eight bit elements.
It all just depends on what the ISA supports.
So for example, in SSE I can natively represent
a four wide float, a four wide 32 bit int,
or half sized for longs and doubles.
AVX added the ability to eight wide floats
and four wide doubles,
and AVX2 was what's also do eight wide ints and longs.
So note that the first generation of AVX
was just about widening floats
and then AVX2 brought in integer support as well.
And then AVX512 said let's do 16 wide for everything.
And as you go up in those ISAs
those CPUs can also execute the ISAs that came before them.
So I can still run SSE code on a CPU that supports AVX512.
All right, so hopefully you have an idea
of what are like the physical constraints
of the instruction set that I've got.
And maybe how I would logically map a type that's a packtw
to one of these registers.
But there's one lingering thing
that needs some, needs some love.
And that's what is vbool?
So the way that you typically,
or at least the way I think about it,
is that a vbool is tracked for the size of the element
that you're doing logic with,
and the width of the number of elements
that were in that register.
So it doesn't closely track to the number of bits
in the register but rather what
are the higher level elements and number of elements
that I'm comparing with?
So for instance, if I have two vfloat4's
and I want to do a comparison of a < b,
in this view of the world, you would then get a type
of vbool which is a pack of 32 bit wide bool in four.
Conceptually all these libraries do the same thing
the way that's spelled, whether you use
a different type than pack to represent vbool,
that's all there.
Just understand that vbool is this about the logic
of the registers you're doing,
not necessarily tying to particular register size.
So I internally, mentally think that for like an x86ish hint
I usually mentally map vbool to a vint
and just ignore the fact that it
might be stored differently.
So an AVX512, all ISAs all the way up to AVX512s
masks these vbools, it's another name for it,
that these vbools were really just ints underneath the hood
and then on AXV512 they turned into bits,
we used bits instead.
So we use eight bits to do eight wide float comparisons
and 16 wide blah blah blah.
So just logic versus physical understand
that those things are lurking.
Okay, so some various design trade offs between libraries
that you're gonna find.
Individual lane access.
So if I have a vfloat4 maybe I want
to access the ith value out of that.
Just like it's a little tiny array,
it's a little container, which logically they all are,
just understand that
that's not always something you can physically do.
Like that's not always, the a sub i
is not always gonna return a reference to that element
'cause it may not even be addressable.
So if we look at AVX512 vbools, those are bits.
We can't address bits in C++.
So understand that you can do that,
but there might be other functions
that it's not spelled with the bracket operator.
It uses insert or extract to make it very explicit
that you're accessing individual elements.
So the same thing from above might also be spelled
in some SIMD library as extract i to get its value
and then insert to set its value.
So anther design trade off you're gonna find
when you have these registers is do I get implicit
or explicit casting between these things.
And that's a vert SPMD versus SIMD trade off.
So if I have a vfloat and a vdouble,
first, that's super ambiguous to begin with
because how wide's my vfloat?
How wide's my vdouble for what ISA I'm on?
Depends on if you're a SIMD style or a SPMD style library,
but when I do some math with that, is c double?
Is doing that implicit conversion a warning?
Is it an error?
Is this a library trying to say you're
only allowed to code to a specific ISA?
Or is this one that is gonna try to do the best thing?
So remember, we have these different ways
of subdividing these registers?
So it's nonobvious what the answer is.
So when you go find one of these libraries
that you're gonna get whether these different masking types,
like if I have mask for 32 bit integer
can that cast implicitly to a mask over a 32 bit float,
or even different bit widths for the mask part?
And I'll tell you in TSIMD, if you go check it out,
that it does everything implicitly,
like it tries to make in the above example,
it tries to make this a vdouble
because if those were just plain floats and doubles
it does the right thing,
but just know that that's not always universal,
and they have good reason to choose that,
like if their library is making a call
about how explicit it wants you to be.
And another thing that can be a design trade off
that's kind of a usability thing
is what does it mean to have a logical register
that's wider than any physically sized register
that you can have.
So for instance, if I'm on AVX or AVX2,
what does vfloat16 mean?
Is this an error or not?
Like in TSIMD I think it's
completely logical to treat it as two vfloat8s.
So if I do vfloat16 times vfloat16 it'll just be
two vfloat8s under the hood.
It lets you do that.
And then with AVX512 with vfloat16 as native,
as you would expect.
So all I'm trying to call out is sometimes a library
might say it's okay to have just purely logical registers
and I will try to do the right thing
versus, hey, you're trying to call out a register
that doesn't exist for the CPU you're compiling for,
please don't write that.
Okay, so all of this up to now has been about what
is this one abstraction that represents a SIMD register?
This pack, we've got a number of elements that are of type T
that are width W.
And we have aliases for talking about them.
Like what's vfloat versus like vfloatn or a vfloat4?
Okay, so now what can we do with these things.
We haven't actually done anything with the pack yet.
We've just understood what is this abstraction
and how that maps to hardware.
Now let's do some compute.
So one very typical component to a SIMD library
is making the syntax of using them very natural.
So think arithmetic operators.
You've already seen some of that,
like if I have two vfloats
I wanna be able to add them, subtract them, multiply them,
as you would expect.
But you can also do logical operations
like greater than, less than, et cetera.
One thing that's really important to understand
is it's a design trade off for a library
if they provide ampersand, ampersand,
or the bar bar for ands and ors,
because as soon as you overload those in C++
you don't get all the shortcuts that you would expect,
like the early termination
for when logical statements get evaluated.
Like if I say false and then some expression,
like that's always gonna evaluate to false
and the compiler will just
not evaluate the rest of the expression.
When you have a logical SIMD vbool that you're doing
these kinds of ands and ors with that doesn't make sense
and so it can be a little misleading.
So it's, again, a design decision that those library's
will make, but just be aware that you can still do that,
but you have to do it with bit wise mask operations.
So specifically single ampersand,
single bar for ands and ors,
because that's actually what the instructions
are doing underneath the hood.
And then in core C++ when you're dealing with scalar types
and in bitwise and and bitwise or
don't have those shortcut semantics,
so you don't have to worry about it.
It's the same kind of readability of code.
Just it's a little bit of a trickier thing.
Okay, so we can compute these things.
What about getting them in and out of memory?
There's a set of functions that we use
to get values out of arrays and that they have
different semantics, different performance trade offs,
but they all are trying to do the same thing.
First is load.
So if I have some vector defined somewhere of integers,
I might want to load a vint4, so four values from that,
from like the first four values of that array,
of that vector.
And so what that looks like is I got values
stored somewhere on the heap and I wanna load these
and now I have a local vint4
that I can do all my operators and stuff with.
And so store is like you'd expect.
It's the same thing but the opposite direction.
I'm gonna take an entire pack and stick it
into wherever the destination is.
And so, that's great but what if I have values
that come from different places in the array.
This is what gather and scatter are for.
So gather would take, I have offsets,
so if I have a base pointer I'm gonna use
and different indexes off of that pointer,
then I can gather values into a register.
So what that looks like is if I have those offsets
I can go find them and bring them
into one coherent SIMD register.
And scatter, as you would guess,
just goes the other direction.
So, again, I have offsets.
I've chosen the same ones and right now I can have them,
maybe values I had computed I now need
to write back into my data.
So I have load store and gather and scatter.
But sometimes I need to be able to write
only a subset of those values.
And so that's where you can introduce a mask.
If I'm doing some computation I'm like oh,
I only need like the first four values that I want to write,
or just depending on the natural logic
of what I'm computing,
that's where you can have masked versions of all of these.
So if I wanna do a masked store,
I might have, I wouldn't naturally
probably construct a mask like that.
It would be the result of like doing different comparisons
in my kernel, but just for demonstration sake,
let's take a mask that was on, off, off, on,
and I want to do a mask store.
So then what would happen is my local value
when I go do that store only the first
and the last one would make it into memory.
So that's my way of kind of do conditional reads
and conditional writes.
And you can do that for load stores, gathers, and scatters.
So you get this nice matrix of memory IO functions
you can work with.
Okay, so typically these SIMD wrapper libraries
also have additional functions
that are accelerated for the SIMD types,
just like I can take a math function's floats
from your standard C library or through stood for C++.
I can do the same thing with the SIMD types.
So if I look at a little distance equation here.
If I got some X and Y points from somewhere,
maybe this would be an instruct instead.
But just note that that's the exact same syntax
that I wold write it as if I just used plain floats.
That's the whole point of these libraries,
is to make the expression there at the bottom
look like a normal expression that you would use.
Okay, so the last kind of piece that we will look
at for what SIMD wrapper libraries provide are algorithms,
and these algorithms are not the algorithms
I'm talking about with the parallel STL.
So I'm talking specifically about algorithms
that work with the values that are in a SIMD register.
And so I'll go through a couple examples here
of select, reduce, and shuffle.
This list is not complete by any means.
So select says I've got two different values
and I've got some mask to say I wanna blend these together.
Some libraries call it blend,
some call it select,
where if I have a mask and two different vfloats
I can get a resulting vfloat
based on what those values were,
the combination based on that mask.
So if I have the same mask I used before
for the gather and scattering,
if I have two a and b values,
the result is they're combined based on the mask.
Reduce, so this is something where I wanna take
all the values that are in a SIMD register
and I want to bring them down based on some operation.
Could be min, max, could be a sum.
And these can be implemented by hand.
So if you have the ability to manipulate individual lanes
in a SIMD register you might be tempted sometimes
to just write a little for loop,
and that's very much missing the point,
'cause when you express yourself
with these very specific operations,
there's a chance that the ISA
that you're compiling for might have an instruction for it.
When there's an instruction for it
you'll wanna use that instruction.
That'll be the fastest implementation.
So I'll just implore you to understand
that if you wanna do an interregister operation,
look for an algorithm first.
If it's not there then maybe write
the scalar loop over the SIMD register, but,
and then for masked we have some custom reductions.
So I wanna be able to say like are all of these true?
Are any of these true?
Are none of them true?
So if I have that mask, you know, any is true
'cause some of them are true,
but all and none are both false.
All right, this last one shuffle and swizzles,
they can get really complicated.
The whole concept is if I have a register of values,
I wanna be able to rearrange them.
And sometimes that's compile time,
destinations, and source and destinations,
sometimes it's run time.
Just understand it's really complicated
and I'll implore you to go do more research
about where I would wanna use these.
But that's the fundamental thing that you're doing
with a shuffle swizzle is taking a register
and generating an output register
based on rearranging the elements.
so now we've got hopefully a good idea
of what SIMD wrappers are.
It's a core abstraction, a pack of TW.
I can manipulate those things.
I can pull them in from memory.
And I can store them.
But I wanted to call out a very typical usage model
of SIMD in general, but SIMD library,
those groups of functions that you're gonna use
to manipulate these are arranged for a reason.
And it's 'cause typically when a load values
from that you have in memory,
you wanna do some compute and you wanna storm.
So you'll notice there are functions
that load values from memory.
You have your ways of manipulating registers
with operators and functions,
to do whatever algorithm you're implementing.
And then you wanna store them back.
And so as an entire high level view
of what we're trying to do is really this.
So I have some examples.
I have example code that's up on GitHub.
You can download it, run it, it's fun.
I will switch to that for a live demo.
Okay, so saxpy is this like hello world
of parallel computing that you might recognize
is all over the place.
And the whole concept is is forgive the primitive types,
it's just less visual noise.
I'm not advocating this for modern C++ production code,
but just understand if I have some value
and I wanna take that value, multiply it
by an element in the first array and then add it
to the element in the second array and store it
in some result, some output, that's called saxpy.
So the sum of ax plus y, that's what the acronym means.
So a little for loop.
You got my little formula and write it to the output array
there at the end.
I have some comparisons like what open mp is.
So if you don't know what open mp is,
I'd encourage you to go Google it.
It's the whole annotation via pragmas to say,
hey compiler, you can probably vectorize this loop
even if you're really paranoid
that you might do something wrong.
So I put that in there for comparison sake.
And then there's TSIMD.
This is a TSIMD example.
Note that the syntax here, like if I take
x sub i I'm loading a vfloat's worth of from x sub i,
so I'd load two different values,
do the normal expression, I have a result,
then I store the result.
And the key is that then this for loop
is now the width of the vflow,
less number of iterations.
So instead of doing, let's say, 64 iterations,
I can now do this in eight iterations
less code executed.
That's a lot faster.
And so I use Google Benchmark to benchmark these things.
So let's just have a little quick look.
If I run saxpy, this is with gcc here on my Mac locally.
You can see that, hey, the scalar version was pretty fast.
That is really tiny isn't it?
Did not think about that.
All right, so, hopefully that's a little easier to see.
The speed up is not as you would expect.
Note that optimization's hard, it's really, really rare
that your entire compute problem is going to be that simple.
So we're probably hitting things
like some memory bandwidth issues,
just pipelining stuff,
just things to go crazy with investigating.
So that's saxpy if you wanna go look at that offline,
But I did something else.
I noticed like, what could I use as a simple proxy
for a much more computationally expensive code?
Because saxpy's nice because it's simple
and you could plug that into (mumbles)
and be like, look I got vector instructions,
but that's not the whole story.
So what I wanted to create was a version of doing saxpy
but it's like a lot more complicated.
So instead what I did is create what I call saxpy trig,
which is this looks kind of like saxpy,
but I injected this arbitrary number of calls
to trigonometric functions which are expensive.
So you now, a times sine of x plus cosine of y,
take the tangent of that,
and then you can definitely play
with injecting arbitrary additional calls to tangent.
So what it does is we're simulating
like I got a lot of computation to do,
even though that would be numerically nonsense,
saxpy's nonsense as well, so.
But then also at the end I injected something else,
which is like a conditional,
I'm only gonna write it to memory if some condition's true.
So as you'd expect, open mp is, put the pragma there, great.
TSIMD is hopefully not onerously more complicated.
Being able to load a value from memory,
load another value from memory,
and hey look, I just call sine and cosine on them.
It's not complicated.
I call tangent.
Hey look, this is the exact same loop as above
with a different namespace.
And then I can create a little mask,
which is hey should I write the result out?
And if any of those are true I'm gonna do a store.
So basically if they're all false nothing happens.
And if this store is a mask store.
So if I just had, if I just had that,
that would always store all the values,
and that's not what we want.
Instead we wanna say only the values
that are true in that mask we're gonna store those out.
So we have saxpy trig.
There we go.
So I did a couple of things here.
I decided to benchmark stood sine, cosine, and tangent.
I benchmarked the TSIMD sine, cosine, and tangent.
They used different algorithms fundamentally.
So they do take different amounts of time.
'Cause if you're only on scalar values
there's probably certain optimizations you can do
that are hand done inside the C standard library.
But note the scalability difference here.
Once we actually do the saxpy trig calculation,
open mp got really paranoid and said,
ah, I see trigonometric functions,
I just can't vectorize that.
I will tell you that if you are using open mp
with the Intel compiler,
Intel does ship a vectorized math library,
so that's not always true.
I don't have that compiler on this machine,
so we just won't worry about that.
But note that I didn't need that with SIMD wrappers
because my SIMD wrapper library
supplied those trigonometric functions.
So the difference in execution time
between the scalar and TSIMD version is dramatic.
And it actually gets better
if I come in here and say, let's quadruple the number
of times, we just call those additional tangents.
So instead of doing 10 extra calls to tangent
let's do, let's do 40.
So yes, all those should be the same.
Now we're going a lot slower
because we're just tangent all over the place,
and great, on this AVX2 CPU on here
I've got a Skylake CPU in this Mac.
I'm getting somewhere to sevenish x,
seven to eight x speed up, do the math.
You can see it for yourself.
That's the whole point.
The whole point is I'm doing a lot of computation,
and it scales a lot better with my SIMD width.
So, let's go back to the presentation.
All right, so we've got SAXPY, SAXPY_TRIG.
I forgot to show you Mandelbrot.
We can get that at the end.
It's a little more complicated,
but I just wanted to make sure I get through the rest
of this content, then we can go back to Mandelbrot.
And that is, I wanted to talk about data layouts.
So, it's great when you have contrived examples of like,
I have a pure float array and I have a pure integer array,
and that's nice, but I do more complicated things.
So in the world of Ray Tracing
we deal with algebraic vectors all the time.
So this is your x, y, z vector.
You can do dot products and all kinds of things with.
And so typically you would write this like this
or even write it like as a raw vec vf
would just be a float x, y, z.
But for the sake of configuring different types
for what your elements are,
we're gonna template it over what the element is.
And so there's different kinds of data layouts
that you can, that are general high level strategies.
I'm gonna introduce those and then talk about
how we would instantiate these kinds of templates
to do more complicated data structures.
So the first is Array of Structures.
This is what you're used to seeing.
This would be if I have a float x, y, z.
That means if I have an array of those,
we'll call that vec3f,
then that would just be an array of those structures,
Array of Structures.
The other is you can invert that and say,
I want an x, y, z vector where that algebraic vector type
is the entire dataset that I have,
like a whole array of x's, a whole array of y's,
a whole array of z's.
This is what we call a structure of arrays.
And in another talk maybe in the future
I could talk about strategies for like
how you wanna really tune data layout for performance,
but that'd be a different talk.
And then the third category is what we call
an Array of Structures of Arrays.
And we'll visualize this in a sec.
But the concept is of if I have like a vfloat8
for my element and my x is a vfloat 8 and my y is a vfloat8
and my z is a vfloat8 and I take an array of those,
I would have an array of, I'll call,
vector or varying vec3f's.
We'll visualize this in a sec.
That's an array of these little tiny structures of arrays.
So hopefully that didn't lose you.
But let's look at this data layout of this structure
and for the sake of visualization
we're gonna actually orient these differently.
So this means if I have what we call a uniform version,
so I just have scalar types substantiating with,
I'll call this a uniform, just plain vec3f.
There are some type aliases.
Just like we can talk about SIMD registers,
we can talk about algebraic vectors
in the same kind of language.
Like if I have a vec3f, a vec3i, I have x, y, z for ints,
for longs, whatever you want.
And you know, usually like a vector type
you still can do math with operators and stuff.
So the last line is saying hey,
you can add two vec3s together
and maybe subtract a scalar from it, that's fine.
The point is, in memory I have one float,
or one int, so one element, one element, one element.
So then vfloat.
Like where do I get to use my fancy SIMD wrapper?
And remember that these operators operate per lane.
So they would do, like if I'm doing a vfloat plus a vfloat
it's going to add up the first element in each one,
second, third, to generate an output register,
an output pack.
So you have all of these available to you,
which look remarkably like what plain floats look like.
So turns out if I want a varying vec3f I'll just
stick a vfloat in there.
If I want a varying vec3i I'll just stick a vint in there.
So what I can do is, with like if I have a vec, a float4,
so a vfloat4, a varying vec3f4, that's a bit of a mouthful,
where I would have x's four values
stored as a SIMD register,
y is four values stored as a SIMD register, and same for z.
So hopefully you can see this composition
because with a SIMD wrapper you model like a varying float,
a vector of floats,
the same way you would treat a normal float.
These kinds of data structure manipulations
then become a lot more natural and easy.
Okay, so if we were to look at SoA,
the Structure of Arrays version,
instead we would just instantiate like a vector of floats,
like a stood vector of floats.
So I created some aliases here to say, hey,
I'm gonna say my stream type.
Like if I have a stream of floats
I'm just gonna call that a stood vector.
And so my sfloat instead of a vfloat, my stream float
is basically ends up being a stood vector float.
And so my stream of vec3f's or my SoA's of vec3f,
whatever terminology you wanna use is fine.
Now I have vect of sfloat.
And so then that means, let's say I had 64 of these values,
I would have 64 x's, 64 y's, 64 z's.
I would like to point out
this is called horizontal vectorization,
and I'm using my definition of that
because if you look at it logically,
I hope you can gather that when we go from this
to this we're horizontally making it wider.
The problem is I was at SIGGRAPH
and talking with someone about this,
and the terminology's confusing if you look at it in memory
for what's considered horizontal versus what is vertical
because those would actually be concatenated in memory,
they're not, anyway.
The entire point is, keeping this at a logical,
like what I'm widening point of view,
I think is a way to stay sane,
but that's just a personal opinion.
And so what you'll find in the games industry,
in particular, who are operating
on maybe less items of data,
and they're still trying to get speed up from SIMD,
you'll find that they can only use SSE pretty often
because if I've got a single uniform vec3f
and I wanna do SIMD on it,
maybe I have to
use SSE which would be four wide floats.
So maybe I can do like a reduction or something
on that vec3f with SIMD,
but then that doesn't scale like AVX's eight wide.
I don't get any better, AVX512 gets even worse.
So vertical vectorization
is where you're trying to take a single scalar structure
and do SIMD in between those values.
This is a lot trickier, it's a lot,
it's still useful, but it's just a lot more,
it's a lot harder to do.
I'll just, I'll say my recommendation
is to always consider
horizontal vectorization where possible,
because this you can keep in your head with like,
if I were to do that vvec3f expression at the bottom
of that a + b - 1,
I can in my head understand that like, okay,
I'm gonna get a vector add between a and b
and I'm gonna create like a vector full of one
and then take the result of that and do the subtraction.
Like I can reason about what's gonna be vectorized
and it might take a little bit to sink in,
but it's a lot easier to reason
about when you use the strategy versus this.
You're on wider ISAs and you don't
get a whole lot of speedup.
So it really depends on your problem,
is really what I'm trying to say.
Okay, so this concept of data layout, it can compose.
So if I have a vec3t, remember, I can make a vec3f
and a varying vec3f.
If I have a ray, I'm a Ray Tracing guy,
so I gotta show you some rays.
If I have a ray, which is an origin and a direction,
I can actually create two different types of rays.
I can create a uniform ray,
which would be I want my rayt to be a single float,
or maybe I want a varying ray of width four.
Which would just, I would stick in vfloat4 instead.
Because I have vec3t, that template that completely
takes care of the problem of the outer ray vector,
I can now instantiate very different data layouts
with little aliases.
And I may, depending on, what I'm writing,
have different transformations can I,
in that varying ray, can I go grab
like the ith offset in that ray
so it gets the ith vec3 origin,
the ith vec3 direction, blah, blah, blah, blah, blah.
There's really cool things you can do.
Don't have a lot of time to show you all that stuff.
So the advice I wanna give you is just
consider horizontal vectorization first.
You know, where can I, instead of using a plain float
just use a varying float.
Like is that, can that really just solve my problem,
whether I wanna stick it in my data structures
like I showed you and just use varying float
as if it were a float?
Or if I need to do like load the whole
load compute store thing,
both of those end up being hopefully
more straightforward to do with horizontal first.
And so there's some reasons that you get some benefits
for horizontal vectorization, one being scalability.
So if I'm doing like a particle simulation or something
and I'm working on four particles at a time
or eight or 16, the point is,
is like that's the changeable value,
and so if my ISA keeps getting wider
I can basically keep scaling
and I didn't have to really rewrite anything.
The second's portability.
So this is,
this is about ISAs that may or may not map to your problem,
so with vertical vectorization, you know,
if I have an x, y, z type, and they're all floats,
the AVX512 is not gonna do a ton for me.
So I can better use these different ISASs
even across, this isn't even limited to x86.
This is true for Arm and other CPUs architectures.
And then the last is clarity.
So if you wanna look at what is going to be vectorized
when I use these data structures,
it's a lot easier to understand of like
I have a four wide ray
and I'm gonna generate an output of like four pixel colors,
like that's then a lot more obvious for how,
what code gen I'm gonna get
instead of writing, you know, deep function call lists
and then what my data layout in vectorization,
code gen looks like is then not obvious.
So three reasons to choose horizontal vectorization first.
So then there's this, I don't know if it's an elephant
in the room, but I'll call it the elephant in the room,
is what about GPUs?
And the thing I wanna communicate
is SPMD versus SIMT, or SPMD/SIMT versus SIMD.
When you're really optimizing for a particular
instruction set you then start really focusing on
like I need vflow4 and that's the only thing I care about,
or I need vfloat16, the only thing I care about.
If you can write your algorithms with just vflow,
just the concept of a varying flow,
it turns out when you then go to a GPU
and you might have super wide SPMD,
like NVIDIA (mumbles) whether you're in Sickle or whatever
or in Cuda, the idea is I have these configurable widths
in my algorithm.
I've expressed what are the data types that are wide
and which ones are uniform that are gonna stay small.
So these types of techniques that I was showing you
with that vec3f and the varying vec3f,
then hopefully translate to cool, I can instantiate
the right types, it'll be the right
layout for my GPU kernels.
Again, something don't have enough time
to talk to you about.
And so there's another talk from CppCon two years ago
that I want everyone to go take a look at
if you haven't yet.
Nicolas Guillemot, he did a talk on pure SPMD in C++.
He came up with this cool little way
of doing mask management and stuff,
and that's a whole 'nother talk in and of itself,
which was done so you should go take a look.
And another thing to think about with SPMD, SIMT,
GPU computing, is that SIMD really,
like at its strictest definition
where I really care about these register sizes
and what operations I can do on them.
You can think of SIMD is an implementation detail of SPMD.
So when I'm using my SIMD wrapper libraries
to take advantage of AVX2 I might be thinking
on a SPMD type of, like,
I don't really care what the width is,
but eventually when I compile my code
I know what SIMD instructions I'm gonna get.
I know it's pretty dense to kind of reason about,
but I'm trying to give you information
to go maybe help yourself on Google
or come talk to me afterwards.
And so the last thing is, at the end of the day,
we are talking about parallelism.
And the way you schedule kernels
that you write with these types,
I hope I don't get too much flack for this,
but I'm convinced that the core kernels that you write,
they are not actually that tied to the architecture,
it's just how do you schedule like a zillion threads
on a GPU versus like a bunch of tasks on a CPU?
That I think the core of kernels themselves
can actually remain the same.
So hopefully whether you're coming from GPU computing
and you're trying to like, how do I do SPMD or SIMD on CPUs?
Or if that's where you're coming from
and you wanna go learn GPU computing,
the high level concepts are not all that different.
At the end of the day we're trying to say
here's some math where I can do,
I know I can do multiple of these values at the same time,
and once you have that core concept,
how you spell it in Cuda or Sickle or ISPC or in C++,
ends up being more inconsequential.
So I think I'd like to have some time open for questions.
So that's my talk. (audience applauding)
- [Man] Hey, great talk.
Can you please go back to the (mumbles) example?
I want to make-- - Yeah.
Yeah, I went through this pretty quick.
So the saxpy or the saxpy_trig one?
Either? - [Man] Yeah.
So just wonder what are the best practices
in case the static size of the register
is not a multiply of the vector?
So you have like--
- So this is what I'll call a SPMD SIMD trade off.
A SIMD style, more SIMD type implementation decision
would be you only get the possible candidates
that can be instantiated,
like it can be, like with my library
it's you can choose one, four, eight, or 16.
So you might have to do a little logic to say,
if my array, yeah, like it should also be like in this one,
it should be, good catch, and i is less than n.
So then we won't write it if we're beyond the array.
So those elements that by adding this right here,
that was a really poor use of a, there we go,
so this part of the expression we can also say,
okay, if the result is in my range that I care about fine.
But also it has to be less than the number of elements.
(audience member too far from mic)
Sorry, plus size, plus,
good paraprogramming there.
So does that make sense?
Like it'd just be a part of my mask that I used to say
what I write to memory or not.
Because I don't, if there's no time saved
by like, when I compute a vfloat math
the fact that I'm gonna ignore some at the end
won't make any of the instructions go faster.
So kinda doesn't matter if I have
only five that I care about.
And I'm only gonna write five,
but I have eight that I'm totally doing.
A vector instruction's a vector instruction.
So you won't save time by trying to mask the math itself.
You'll need to mask the write to memory,
if that makes sense.
- [Man] Yeah, thank you. - Cool.
- [Audience Member] Hi, I have two questions.
So first question is what's your opinion
of suitable applications for SIMD?
For example, I can think of examples
that are harder to express in (mumbles)
for example, you have multiple keys,
you want to look up those keys in a map at the same time.
It looks really harder to express (mumbles) library.
- So if you want to look up values like in a table?
- [Audience Member] Yeah, map like.
- Yeah, so there's a couple of things.
One is you're probably gonna have to implement
a custom map that is friendlier in is data layout,
so maybe you'll store vectors of keys,
and then you'll fill those piecemeal as you start
setting values and then you can
look up keys faster that way.
Or you're gonna have to do lots of gathers
that may, it's all specifically
the application you're doing.
You would have to write a map
that is stored more SIMD friendly.
So probably stood map would not be your friend.
- [Audience Member] Yeah that's where I needed
the change of the library,
which it's not always doable, right?
- Yeah, well, it ends up being
very specific to what you're doing.
You know, there won't be a one size fits all container
that fits all usages of a map.
- [Audience Member] Okay cool.
Another question is since we needed to lay out
the data correctly first that you use the library,
was that makes the data structure
less straightforward to understand?
- So that's where I think the type aliases help.
So instead of sprinkling in vec3t of vfloat everywhere
in your code, give it a name, vvec3f,
a varying vec3f.
And then use that instead.
So it kind of hides that,
you'll need to use the template instantiation everywhere.
You can talk to me afterwards.
I can show you code.
- [Audience Member] Okay, thank you.
- [Man] Can you go back to slide 90?
- [Man] And I think it's possible they're out of sync
from the, oh yeah, there's it, okay, great, cool.
So can you real quickly explain again
why those two operators are not implementable.
So they are implementable,
they just then kind of lie to you.
So if I say, like, if I say one expression
and another expression, if it's just plain scalar math
like normal Boolean math,
if that first expression is false
the second one will never be evaluated.
However, in C++ when you overload those operators--
- [Man] It's function. - Yeah, it ends up
doing a function call and doesn't go away.
So then people will think, oh yeah, this will shortcut
and it won't execute and then it's not true.
And underneath the hood when you do
SIMD logical operators like that
they end up being bitwise anyway.
So then that's why the suggestion is to instead use these.
So when you're calling those, is it doing a bitwise
for each individual element? - yeah.
- [Man] So it's not reducing those?
- So a reduction would be different.
- [Man] Okay, so my understanding was prior
to the doubles could be appointed
as an operator bool for each of the operands, right?
But if it's by, if it's not being reduced to one value?
- Correct. - [Man] Okay,
that's the difference.
- So yeah, I could do like an an of some expression
and any of some expression that turns those
into single bools and then that does what you would expect.
- [Man] So that is solid for what--
- Yeah, so if I reduce those vbools down
to a single regular C++ bool, then yeah,
then those and and or are still back on the table.
- [Man] And would that behavior match
what you're looking for? - Yeah.
- [Man] Okay, cool, thank you.
- [Audience Member] Hi, can we use scatter
and gather with iterators?
Like if the data is not--
- That could be really interesting iterator implementation,
like, yeah, so you have like a, well,
yeah, so you have like a varying iterator
and then you can conditionally increment it,
and then you end up having offsets
or elements that you're pointing to
and hen when you right click like a vfloat to it--
- [Audience Member] Iterator is continuous.
Like I'm using a map and so in every element
of the map I want to do something.
So I want to lower like four elements together
using an iterator and then to the operation
and then store it back,
but it may not be contiguous in the memory.
- Yeah, so I think you would have to write
a custom iterator for that and then you could just say
on assignment if they're not contiguous
or just always scatter.
'Cause technically I can scatter two contiguous offsets.
You just might not get the most efficient instruction,
but yeah, that's totally a cool pattern.
Like a SIMD iterator, yeah.
- [Audience Member] At some point you presented reduce
as a function (sound distorts voice).
- That would be way down here.
I mean like min, max, and stuff like that?
- [Audience Member] Yeah, okay, my question
is is reduce something that can actually be vectorizable
and eventually how.
- So there it's not that you vectorize the reduce,
so remember these algorithms
are only on the elements in a register.
This is different from like
a stood reduce of a big container.
So like a TSIMD reduce is like,
I'm just gonna do the reductions in a SIMD register
to like a single value.
Like a sum or a max or something like that.
That could be used to speed up an implementation
of stood reduce on a stood vector
to do it in less number of iterations.
- Right, what I was referring to
is the only way I can imagine this implementation
can be is actually normal sums of let's say
we have a vector of four floats,
I can imagine only like four sums being able to--
- Correct, correct.
- That's the only, so I don't see like any other way
to implement it with SIMDs and I was wondering
if there was going to be anything--
- So on most architectures things like
min, max, and sum,
they're actually just intrinsics you use for--
- Yeah, for min, max of course.
No I'm talking about sum.
- Yeah, and so-- - if this is something--
- If you have a more complicated reduction,
yeah you're gonna have to write a for loop,
which hopefully you wrap in your own little function,
but yeah, that's what you got.
- Okay, thank you. - Yep.
Any other questions?
All right, thanks for coming.