All right, folks, come on in.
I wanna give you all a fair warning.
You probably want to move towards the front of the room.
I know it's scary up here.
I'm not gonna throw anything.
But there's gonna be a lot of code.
It's gonna be kind of small.
Because I'm gonna actually be coding it live,
and so I can't use super big slide fonts, sorry.
But yeah, people way in the back,
you're gonna have a hard time.
You might wanna move up a little bit.
We have enough space here.
All right, so my name's Chandler Carruth.
I'm here to talk to you about going
absolutely nowhere faster,
which is probably a little bit of a confusing title.
Sorry for that.
But it seemed like a good funny joke.
This talk is gonna be a little bit unusual.
So a lot of my optimization talks,
I actually try and tell you how
to make things go faster.
I've given several optimization talks at CppCon.
This one's not going to do that.
If that's why you're here, sorry.
I'm trying to set expectations, so.
I am gonna try and tell you how to understand
why your code is going fast or isn't going fast.
But today we're gonna focus on when your code
is actually going nowhere at all,
but either fast or not fast.
We're gonna talk about loops.
We talked about data structures,
talked about a lot of other things around performance.
This time we're gonna talk about loops.
There's gonna be a decent amount of live demo.
I'm gonna show you stuff.
I'm gonna try and show you stuff, hopefully.
Please don't do anything particularly mean to the wifi.
That would be much appreciated.
The other thing I do wanna mention
is please be thinking of questions.
One of the challenges of this topic
is that there, and I'm a terrible speaker.
I'm gonna give away the punchline on the title slide.
There is no one good answer when it comes to loops.
I'm gonna talk about two specific things, mostly.
There's some bonus stuff if we get to it,
but I'm not sure we're gonna get to it.
I'm gonna talk about two big things today.
But there's a laundry list of things I could talk about.
We could keep going for hours and hours.
And there's no telling which one's gonna be
the most important for you.
I'm focusing on the thing that I think
is the most common for people to run into
and the thing that I think is the hardest to understand.
That's how I picked my two.
But I don't want you to get the idea that,
oh, well, those are the two things
that go wrong with your loops.
Okay, we're done now, cool.
It's not quite like that.
And just a little bit of background information in case,
how many folks have been to a talk
from me at CppCon before?
Oh, this is a bad question.
How many folks have not been to a talk from me
about optimization before?
That's a lot of hands too.
Okay, so for reference, I work at Google.
I work on Google's C++ team.
I'm the tech lead for our C++ programming language platform.
And when I'm actually coding, the thing
I like to code on is LLVM,
which is our preferred optimizing compiler.
I do a lot of work on LLVM and compilers
and optimization, cause I really like performance
and I'm totally obsessed with that.
And I come to CppCon to get to actually
talk to you all about cool things you can do
to make programs run faster.
That's really why we're here.
So with that, let's get started.
Briefly, I have given three talks at CppCon before.
You should go and watch them, not because
I'm super proud of them, necessarily,
but because they're way more important than this talk.
I'm just throwing that out there,
these are way more important.
You've gotta use efficient algorithms,
you've gotta use fast data structures,
you have to follow idioms that make your code fast.
You can't just let your code slide down everywhere.
That doesn't mean you add complexity to do it.
There are simple idioms that are fast.
And you wanna make sure that you follow those
really, really carefully.
Always, always, always benchmark the code that matters
and understand why its performance matters, right?
Understand how it performed, in detail.
Otherwise there's no hope.
And you gotta use hybrid data structures
to optimize your memory allocations and cache locality.
These are way more important, but
but we gotta talk about some other stuff, too.
So I just have to remind everybody
you only care about the performance
that you have benchmarked.
If you didn't write a benchmark for your code,
you do not care about its performance.
That's not an opinion, okay?
This is how you show you care about performance.
So in every case that I talk about,
I'm gonna assume that you have a good benchmark.
And the thing that I haven't talked about before
is what happens when you have a loop in your benchmark
and it's actually this loop that is
the critical thing to your performance?
You've got good data structures,
you've tried to do other things,
and you don't understand why this loop
is such a bottleneck for your performance.
Well, turns out that it's still probably data.
I'd like to say there's this brilliant trick
to loop optimization, but 99 times out of 100,
it's actually still data.
If you want to get lots of high-level ideas
about how to fix this, there are a bunch
of great talks out there.
Mike Acton talked about data-oriented design.
Several other people have talked about
how to design for cache locality
and try to minimize your working set of data.
I gave a talk about profiling,
but I also gave a talk about hybrid data structures
that are really trying to help you solve this problem.
And we're hoping in the future you can use
a tool called Efficiency Sanitizer.
How many folks here are familiar
with one of the sanitizers?
Okay, how many folks here are asleep?
You just had lunch, I understand.
But we actually looked at adding an Efficiency Sanitizer.
Instead of looking for correctness bugs,
what if you just found really bad performance patterns?
It's not really very fleshed out.
If you're interested in this, it'd be great
to help out some.
But that's actually something I hope
you can use in the future.
But let's actually go back to profiling for a minute,
because profiling tends to work pretty well.
But there are some tricks to profiling
that I didn't even get into the last time
I talked about it that are very specifically useful
when you have a loop and it's looping
over lots of data, right,
and that data is the source of your performance problem.
All right, let's see if we can
switch to live demo here.
Nothing could possibly go wrong.
All right, so
that doesn't look entirely great.
That looks better, okay.
So let's actually look at a benchmark
that specifically is kind of designed
around understanding bad data structure
or cache locality problems.
And writing this benchmark is actually
a little bit tricky.
So I'm not gonna try and live code it on you.
I think that would just make all of us really unhappy.
So I've gone ahead and I've written this benchmark up.
It's called cache bench, right,
I'm very specifically targeting kind of cache
How do we understand the performance of
your cache and of your working set?
And since we're targeting this kind of data-based
performance problem, we're going to actually
parameterize this benchmark on a particular size, okay?
And a size in bytes, not on how many elements
But we're actually gonna go after very specific byte sizes.
So we take in this s.range zero
to kind of grab a size from the benchmarking framework.
We shift one byte, and this just lets us get
a power of two.
I don't want to type out really big numbers.
I'm bad at it.
So we're always gonna use a power of two bytes, okay?
Then we're gonna figure out how many elements
we want in our data structures
in order to use roughly this many bytes.
We're gonna build two data structures.
They're both gonna be vectors.
We're gonna put ints in them.
And so we can kind of guess at the amount of
elements by dividing the number of bytes
by the size of an int and then dividing that into two,
cause we're gonna have two vectors, cool.
We fill up the first vector here with random data.
All right, I have my little random number
generator class here to make this code
fit on a screen.
It's super fancy.
All it does is produce something I can iterate
in a range-based for loop.
I can show it to you if you're desperate to see it.
But it's not really that awesome.
But it gives me so many random numbers.
That's what we got here.
And we fill up our vector with it.
And then we go and we actually,
the next vector's a little bit more interesting.
So this is called this indices vector, right?
So our indices vector we're going to fill up
with indexes into the other vector,
but random indexes.
And we'll talk about why we're gonna do that
in just a second.
So with this one we go down, we
restrict our range for the random numbers
to zero to the count, right?
We're gonna put a bunch of stuff into this.
And then we're gonna go into our benchmark,
we're gonna keep running the benchmark
until we get enough samples to kind of
have accurate measurements.
We're going to sum things.
And what we're going to do,
and this is the tricky part here,
so we're gonna loop over the indices,
these random indices, and then for each
of these random indices, we're gonna go
load the data out of the first vector
at that offset and add it to our sum.
What does this mean?
It means we have two access patterns to memory.
We have one vector where we're just
walking through memory linearly.
And half of our working set is doing that.
And the other half of our working set
is randomly bouncing all over the place.
No way to predict where it goes next.
So this is designed to undermine cache locality.
This is exactly the kind of access pattern you'd have,
for example, if you have a lot of linked list
data structures and your allocations are randomized.
Then you're gonna be bouncing around
all over your data structure.
In particular, this happens to be exactly
the kind of data structure access you see
if you have something like an unordered map
or an unordered set out of the standard library,
because it stores all of its elements
behind one layer of indirection,
which means you essentially have a vector
of random pointers off into memory,
and you see access patterns that look very,
very similar to this.
This is just a little easier to code and think about.
Everybody happy with my benchmark?
Please, if someone is confused,
if you don't know what I'm talking about,
if none of what's of the screen makes sense,
shout, raise your hand, jump up and down,
find a microphone.
I've gotta explain it, I've gotta actually
help everyone understand it or we're all lost here.
All right, so this is my lovely benchmark.
Let's run it.
That seems like the most prudent thing to do next.
So fortunately, I went and built a whole
build system for this particular demo.
I didn't wanna sit around here hunting
for compiler command line flags.
And so I can run this.
And when we run this, we're gonna see
some lovely numbers come out and some stuff
that I've tried to do here just to make this
a little bit easier for us to think about
when we see these numbers.
I've tried to really carefully pick
the range of values we're benchmarking,
and I've tried to print out kind of
why those values are particularly interesting
for us to look at, okay?
All right, so let's dig in just a little bit here.
All right, so we start off with eight kilobytes of data.
With eight kilobytes of data, we see
things are fast.
With 16 kilobytes of data, things are still fast.
With 32 kilobytes of data, things are still fast.
With 64 kilobytes of data though,
things start to slow down a little bit.
We didn't change what we were doing,
we just grew the amount of data
that we're kind of randomly probing.
All I'm doing here is just measuring time,
and you can already see this.
And then by the time you get to 128 kilobytes,
we're really starting to slow down.
We've gone from 18 gigabytes per second
of memory accesses, essentially, and sums,
to 15 gigabytes per second.
That's a pretty substantial slowdown,
given the fact that we've just gone from
a very small amount of data, 32 kilobytes,
it's tiny, right, to 128 kilobytes.
This is also still kind of tiny.
But it happens that I know something
about this processor and this machine.
That's gonna become pretty important.
So this is
a AMD processor.
This is a little server that I built recently.
And it's running the new AMD Epyc chips.
So I don't know if you all heard of them.
They're super fancy little chips.
But they have a 64 kilobyte L1D cache.
And wouldn't you know, the performance
starts to teeter off right as we get
up to 64 kilobytes and we're actually
getting cache contention.
And because of the way caches work,
when you're doing randomized access,
you're gonna have collisions in your cache.
So you don't get exactly 64 kilobytes.
You get close to that, you start to see
performance fall off, you go past it,
performance degrades substantially.
And so doing nothing more than measuring time,
we can actually observe the cache size.
So I think that's pretty cool.
Maybe everyone else is like, this is so boring.
But I think this is awesome.
So then let's look further down here,
because we got more numbers here.
So if you keep going down, we see
we got 128 kilobytes.
We've got 256 kilobytes.
The performance is kind of degrading,
but there's not really another kind of bend
in the curve here.
We get to 512 kilobytes, still going down,
not a huge bend, but it starts,
it kind of knocks off a bit after 512 kilobytes,
which I found really interesting.
Dropped down to 10 gigabytes per second there.
It's not a huge drop, but it's not trivial either.
And then it starts to drop, drops faster and faster
until we get to eight megabytes.
And then from eight megabytes to 16 megabytes,
the performance just craters, absolutely craters.
So does anyone have a guess about the other two
cache sizes of my processor?
No, no, come on.
(person talking quietly)
16 megabytes, I heard.
(person talking quietly)
Eight, why eight?
Why do you think it's eight?
Look at the jump between four megabytes and eight megabytes.
It's not substantially different
between two megabytes and four megabytes.
And we shouldn't expect to be able
to fully use an eight-megabyte cache.
Cause there's gonna be collisions in that cache.
It's when we hit 16 megabytes
that we see a precipitous drop off.
And that likely means we're missing cache.
And it's really interesting, this one's precipitous.
And it makes sense that one of these
is way more dramatic than all of the others,
cause that's the last level cache.
When we miss here, we hit memory.
When we miss here, we take orders
of magnitude more time.
And so we should expect this precipitous drop off,
and we should expect it around the actual size
of the cache, not before it, not after it,
right at the size of the cache, when it's full,
not when it's too full.
And then it stays down.
But there's another cache layer in here.
This processor has L1 cache, has L2 cache,
has L3 cache.
L1 cache is 64 kilobytes.
L3 is 16 megabytes.
Any guesses what L2 is?
(person talking quietly)
Any other guesses?
Someone's looking up on Wikipedia.
It is 512 kilobytes.
It's exactly 512 kilobytes.
And this one I don't understand.
I don't know how it actually has
this good a performance at 512 kilobytes of working set.
So that's a great question.
We should figure out why this has good performance.
Okay, so we can actually do something
a lot more interesting than just running
this benchmark and looking at timing.
It's cool that I can build a benchmark
to show me the cache and how it behaves.
But I probably am not trying to fix
the performance of that benchmark.
That's not likely an exciting prospect.
I built it to perform badly.
What I actually wanna be able to do
is I wanna be able to recognize
when a real program is doing that
and I didn't intend for it to be doing that.
So we need a better measurement tool.
So the first thing we need to do here
is we need to come up with a way of getting better data.
So the benchmark framework I'm using
has some cool flags here.
You can filter down to one benchmark measurement,
and you could also ask it to run
that benchmark measurement for more time.
And this gives you a lot more data.
So I've asked it to run for probably too much time.
Okay, that's just silly.
Let's run it for two seconds here.
And this'll give us pretty stable numbers.
And so this is right at 64 kilobytes.
So I probably want to back off one step here
so we can look at one step before this.
So this is in cache.
We expect this to be super fast.
This is L1 cache.
Cool, it is.
Now, how do we actually tell
that this is staying in L1 cache?
Well, that's where our performance monitoring
tools come in, perf.
Last time I talked about perf,
I showed perf stat.
I said, this is really cool,
but we're not gonna talk about it a whole lot
and ran off to do profiling.
Today, we're only gonna talk about perf stat,
because there's no interesting source code here.
What we're trying to understand
is what the processor and the machine is doing.
So I run perf stat.
I get a bunch of data here.
But nothing looks bad.
I have very few stalled cycles.
My processor's running fast.
I don't have any context switches.
This looks great.
This is about as healthy as you could get
a benchmark to run.
But as soon as I go up to 64K,
I start to see a different result.
All of a sudden, I have this 25% of my cycles stalled,
So that's your first signal that you have
some kind of memory bandwidth problem.
It might be something else, though.
There are other things that cause front end stalls.
So how do we drill down even further?
Let's look for a better tool.
If you run the perf list command,
it actually will list out all of the awesome
things perf can measure.
And it can measure amazing, amazing things.
And right at the top is this beautiful list
of things, L1D cache load misses.
That looks terribly promising.
Okay, we gotta try that one.
All right, so if you wanna measure, ah, no.
So if you wanna actually measure a specific
counter, you have to tell perf to do that
instead of its typical counters.
So we wanna measure L1D cache loads.
Well, actually, we don't know that it's D cache.
It might be I cache.
And you can actually measure more than one thing
at a time, so you can say
show me the loads.
Also show me the load misses, probably good.
Show me the D cache loads
and the D cache load misses.
Did I manage to typo any of that?
I don't think so.
And this, look at that.
So we run this, we get some information here
about the fraction of L1D cache
queries were cache hits.
Sorry, sorry, were cache misses.
Now, the thing that's just amazing to me
is if we back this up one step,
if we go back one step to the one
that was super fast,
you're just gonna be just staggered by this, I think.
So the different between four to five percent
front end stalls and 25% front end stalled cycles
was a difference of 0.01% L1 cache misses
to 0.04% L1 cache misses.
That's why everyone is so obsessed
about cache locality.
We regressed cache locality by 0.03%,
and we regressed performance by like 20%.
This is crazy talk, right?
Well, that's processors for you.
And this works all, like we can keep going down here.
So if we hop down a bit, I don't remember
the exact one.
You're gonna see this number just keep growing.
So here at one megabyte, we have 12% L1D cache misses.
The L2 isn't terribly interesting,
so I'm just jumping ahead on you people, sorry.
If we go a little bit bigger, we're gonna see even more.
So at eight megabytes, we have 1.6% L1D cache misses,
1.6% L1D cache misses.
So we have 100 times the D cache miss rate.
Because we have preposterously more working set.
But now you know how to actually tell
am I blowing up my cache?
I would love to show you all of the other
hardware performance counters that measure
different parts of the cache.
Unfortunately, most of them don't work.
And that's a little concerning to me,
cause I work on performance stuff,
but hardware performance counters are actually
still really finicky and hard to get to work.
And so this is a very new processor.
The Linux kernel and the profiling tools
don't appear to have caught up.
So the L1 is the best one I can probe here.
Any questions about probing for cache misses
before I move on?
Everybody's just like, I don't even know
what you're doing anymore.
You're playing with these toys that don't make sense.
All right, let's briefly return
to our presentation here, see what else
we wanna talk about.
(person talking quietly)
Is there something to say about the do not optimize
function call in my code?
That is the benchmark library's
beautiful packaging of that horrible, horrible trick
I showed you two years ago to get the compiler
to not optimize away your benchmark code.
And you can thank Eric Selleg, who's over here.
He added it to the benchmark library
like at the same time as I was presenting
the horrible stuff.
But now I get a nice API.
But you can go look at the benchmark library.
It's Google Benchmark, like Google slash Benchmark
on GitHub, and it has documentation for the function.
All right, so that's great.
This is all great.
I've shown you one clever trick
that'll let you debug all of your cache locality problems.
Well, not really.
It'll tell you that you have cache locality problems.
And then you have to go do a bunch of work.
And we've had a lot of great talks
about how to actually fix those.
I don't wanna repeat that.
So now I'm gonna jump to the other topic for today.
The other topic is, I think, the hardest
single topic to really internalize in your brain
about optimizing loops, okay?
And this one applies to tiny, microscopic,
tiny little loops that you sometimes end up with.
And we still need to make those fast.
And we're just gonna go right back to live demo.
I hate slides.
Live demo's better.
All right, let's take a look at
a tiny little loop.
So this is my least favorite loop in the world.
I hate this loop.
I'm not actually being facetious.
This loop has caused me more pain
than any other loop, really.
This is kind of a silly version of it.
I've really simplified it a bit.
This is what's called the clamp loop,
where you run over a big pile of integers
and we see if any of those integers
are bigger than some upper bound,
and if they are bigger, we clamp it
to that upper bound.
This loop, in various forms,
not really this form, but in various forms,
this loop shows up in unimportant code like
every compression algorithm based on zlib.
And every decompression algorithm based on zlib.
It also shows up in most image compression algorithms.
This thing is everywhere.
You'd be just amazed at how much of your CPU
on your phone and your laptop,
how much of your CPU is dedicated to running
this annoying little loop.
Now, I've also turned off a bunch of fancy,
fancy loop optimizations that the compiler does,
because I actually wanna look at
the simple version of this loop.
And in a bunch of cases, we actually see
the simple version in real code
because in a bunch of cases, we don't see
one loop on its own in the middle
of a micro-benchmark designed to test it.
And so much like we have to turn off
the optimizer for micro-benchmarks
to keep them from deleting my code,
I also kind of need to dial them down a little bit
so that I can understand what the behavior
of this loop is without it being vectorized
and unrolled and peeled and turned into
some kind of monstrosity.
All right, so I've turned all that off,
just to simplify things.
And I've got my little benchmark.
Any questions about this benchmark?
If anyone doesn't understand it,
I'm happy to clarify.
Otherwise, I just want to get onto
showing you the craziness here.
All right, we're gonna see if we can do this.
So, again, I've got a build system.
I'm up to date, excellent.
So let's try running my clamp benchmark.
I just gave it some random sizes.
It doesn't really matter.
Right, it's running, it's doing
1.4 billion items per second.
It's running pretty fast.
This is solid.
This is a four-byte item, so it's actually
churning through a reasonable amount of data.
This looks great, this looks fantastic.
But we should actually take a look
at what it's doing and see if it's doing something
that makes sense to us.
And we looked at how to do that last time,
so let's just quickly
do something like this.
So I'm gonna run it for a little while
to make sure I get enough sample data
to have something useful.
Recording it, recording it.
I told you it was live.
Okay, so this is exactly the profiling steps
that I showed you last time I was up here
doing live demos with profiling.
We wanna annotate this thing and look at the code.
Okay, so here's the code.
So this is the loop.
Hopefully folks can read this.
We jump back up to the top, kind of compare these things,
we do this fancy cmovge thing.
All right, so let's figure out
what this actually means.
Step one, move.
This is loading the data we're gonna clamp.
Step two, comparing it with OX100 in hex.
This is checking to see whether we need
to clamp it at all.
Then we do this cmovge thing.
This is a special X86 instruction.
It conditionally moves a value between registers.
And so that GE is a condition predicate.
If that comparison found that the value we had
was greater than or equal to the thing
we compared it with, we wanna replace it,
we wanna clamp it.
And it happens that we've actually put
the replacement into a register already,
so we have that in ebx over here,
and we move it over top of it.
Then we store the result back into memory,
so we actually have done our full clamp.
We add four to our pointer, because
we've gotta go to the next entry.
Then we check to see if we're done.
We're not done, we brach back up.
That's the loop.
Everybody understand the clamp loop?
It's super simple, right?
Everybody's like, this is super simple.
Okay, so X86 has this awesome instruction called cmovge.
It seems designed to implement this clamp loop.
Aren't you glad the compiler took care of
generating this magical code for you?
Well, see, I work on compilers.
And I'm suspicious.
And so I'm like, I don't know.
Last time I got up here I invented a faster mod.
We can do something here, okay.
So maybe we're crazy and we think
we can totally do better than this.
This is totally within our power.
So let's see if we can get at this thing.
So I went ahead and
compiled this thing down so I could get the assembly,
because I don't really like staring at object code
or looking at build systems all that much.
So this should give me the assembly.
All right, so here's my loop.
You can see the cmovege.
There's a bunch of other stuff here.
Actually, I can clear the other stuff we don't need.
We don't need debug information.
We're good without debug information.
Where did you go?
Yeah, we don't need this stuff.
I'll make it easier to read.
If I can type.
So there's my cmovge.
This looks like the code we just saw.
This looks the same.
I mean, it's using a decimal number.
I don't know why it's using a decimal number.
But it looks the same, right?
Okay, so if I'm sitting here thinking, look,
I'm way more clever than my compiler is.
I don't need this cmov thing.
I think that's kinda lame.
Because maybe I'm thinking,
this cmov thing seems kinda clever,
but what if we just didn't do that move at all?
What we could do here is we could just branch,
we could branch to something.
But I need something to branch to,
so I have to make up a name down here.
But we could branch to, I don't know,
So do you all see what I'm doing here?
So we're still doing the comparison,
but we're jumping if not greater than or equal to.
If it's not greater than or equal to,
we're gonna jump over it.
And if it's not greater than or equal to,
we're gonna have to clamp it down.
Is that in ebx, I think?
Have I made any bugs here?
I think we're good.
Yeah, so if it's greater than or equal to,
we're gonna go and we're gonna store our constant into it.
If it's not greater than or equal to,
it's already below our clamp.
We're just gonna skip over the store.
How many folks here think this is gonna be faster?
(person talking quietly)
Depends on my data, depends on my data.
So do we not remember what the data is?
Okay, here, let me go look at my data.
That's a great question.
So if you look up here, it's kinda hard to read, hold on.
So this should be easier to read.
So I'm picking random numbers
between zero and int max.
So there're not gonna be a whole lot of them below 256.
So anyone think this is gonna be faster?
Why do you think this is gonna be faster?
(person talking quietly)
Branch prediction, why?
(person talking quietly)
It's all the same branch.
But you still have an extra branch.
(person talking quietly)
One more time?
(person talking quietly)
(person talking quietly)
So branch prediction guesses the jump forwards.
Not true in modern processors, it turns out.
Used to be true.
Pentium six had a forward-based predictor,
but modern processors don't anymore.
Any other theories?
So for the record, my theory was
this is not gonna be faster.
We had one branch, now we have two.
They're super close together.
This isn't good for performance, I was thinking.
This is a terrible idea, because we're never
actually gonna take the branch.
We're just gonna consider taking the branch,
take up an entry in our branch predictor
for no good reason, we're still gonna store
the same number of times to memory.
This is the perfect case for a cmov.
It's a cmov that actually always happens.
That's the only thing that made sense to me.
I was actually pretty convinced
that this was a bad idea.
I was so convinced I actually spent
a bunch of time making LLVM produce that cmov.
It's a true story, I'm not making that up, okay?
I really cared about this.
I thought this was a bad way to go.
But we should find out whether
this is actually a bad idea or not.
What do I wanna do?
None of that, none of that, none of that, come on.
There we go, I wanted it to look like that.
Let's make myself a fancy clamp.
Oh, goodness, I'm not waiting five seconds.
I'm an impatient soul.
All right, so
uh oh, uh oh, hold on.
I don't remember what the other one was.
This is the problem with doing it live.
So we have our fancy clamp,
and we have our not fancy clamp.
And it turns out that cmov is faster.
the really funny thing is that the cmov
wasn't faster earlier today.
Which is rather surprising.
And now of course you all know
the gag I was trying to play on you
and failing just really miserably.
Did I get this wrong?
Did you people let me down and let me
program this wrong?
Let's do it the other way, cause that's how I did it
when I tested this.
I think that's right.
(person talking quietly)
Oh, ebx is set way, way, way, way, way,
oh, but I bet I know what's going on here.
See, when I did this, I was kind of annoyed
at that ebx anyways, and so I did this,
cause that seems obvious.
Oh, I have to rebuild it.
So somewhat hopefully this will reproduce results.
There we go, cool.
(laughing and applause)
In case you were wondering whether this was live.
Okay, so lo and behold, a branch can be faster than cmov.
And I have to say, I did not expect that at all.
It's not a lot faster, but it's faster.
And I wanted to understand why,
not just in this case, but actually we see this a lot.
We actually see it a bunch of places
where you would kind of expect cmov to be better,
and it doesn't end up being a significant win.
And it's kind of interesting to try
and figure out why that's happening.
But to explain that, I actually,
I can't use a tool to really explain that.
It turn out that this is really hard to explain.
We're also running a little bit short on time.
So I want to prepare all of you.
This is going to be a bit of crazy.
So we're gonna go to slides here,
because I can't do anything but slides.
I'm gonna try and relay to you a talk
that one of the really senior engineers
who worked on LLVM for a lot of years gave
as a keynote for the LLVM developers meeting
a bunch of years back, back in 2013.
He used a different example than clamp,
and there's a good reason to use this example.
Branches make thinking about this even harder,
and this, it turns out, remember
what I said at the beginning, this is one
of the hardest to think about problems
we have with loop performance.
So we're gonna simplify this as something
that doesn't have branches.
We're gonna look at a dot product, all right?
And this is exactly the code you would write
to compute dot product.
There's nothing fancy going on here.
And if you compile this code, that inner loop
is gonna turn into this X86 instruction sequence.
It's just gonna turn directly into that.
And I promise, I actually did compile it.
I'm not making this up.
But this is not what the processor is going to execute.
The X86 processor doesn't actually execute
X86 instructions are very high level.
The X86 processor executes what's called microcode.
It decodes the X86 instructions into smaller,
more precise operations.
And we can also kind of use a better notation,
which actually shows you where the destination is,
where the destination is and separates an input
that might also happen to be a destination.
And so I'd like to use the code on the right here,
not the code on the left.
The right is kind of what X86's micro ops are.
Let's step through this.
So we start off with a load.
A load is actually just a direct mapping into microcode.
No problem there.
It's just one micro op, sorry, not microcode, micro op.
But this imull has two things going on.
It's loading, but it's also doing a multiply.
And so it's doing those as two separate micro ops.
The first one is just gonna load.
And then the second one is gonna do the multiply.
Then you have an add.
That translates very naturally.
You have another add for kind of the array index.
Then you have a comparison.
And you shake your head, because I've been lying
to you this entire time.
Sometimes micro ops are actually bigger than
the X86 instruction.
We have this thing called a macro op fusion
in every modern X86 processor,
and it says, well, I see you're doing a comparison
and then you're branching on the result.
I'm just gonna bake in an operation
to do exactly that.
And so we get this as a single operation in X86
when we're actually executing it.
Everybody happy with my little code sample here?
So there's one other thing the processor knows about this
that's really important.
It knows that this branch is going to be taken.
It is confident this branch is gonna be taken.
Of course it is, we're looping.
It's taken every time until we stop caring.
And so the branch is gonna be predicted as taken.
Now, because this branch is predicted as taken,
there's this fall-through path
that's actually not interesting to the processor.
And so notionally, we can think about this inverted.
We can think about it the other way around,
where we actually continue executing
the loop body and we branch out of this loop if we're done.
And that inversion is really key,
because now we're gonna kind of notionally,
this doesn't actually happen,
but think about it as if you unrolled
an iteration of that loop.
So you have a second iteration of the same loop.
And then another iteration after that.
You can kind of stack these iterations up.
It turns out this is exactly what the processor does,
but not when it's executing it.
It stacks up the iterations for the result
of each computation.
It has what's called a reorder buffer
that looks pretty much like this
and captures the result of every computation
and can capture several iterations' worth of computation,
not just one.
Now, that's hard because we've got these registers
in here that are reused over and over again.
And so the processor also has an amazing feature
called register renaming.
Modern processors don't have the number
of physical registers that you can write in assembly.
They have many times that number of physical registers.
And so when you write percent ecx here,
it's actually gonna go and rename that
to something else.
And so we can kind of rename this entire set
to have unique names so that we can execute
any of these and capture any of their results,
provided the dependencies are satisfied.
Make some sense?
So this is how the processor wants to think
about finishing the execution of this loop.
It thinks about finishing multiple iterations
concurrently, all right?
It's called speculative execution.
It's incredibly powerful.
Let's see exactly how powerful.
To understand how this executes,
you have to think about the parts of the CPU
that are doing the execution.
These are the functional units,
or the execution units inside the processor.
We have a load unit that's loading from memory.
We have arithmetic units that are doing things
like addition, multiplication.
And we might have multiple units in the processor.
Most modern processors, ARM, X86,
most modern processors have multiple units
that can all run at the same time.
So I'm gonna kind of use a hypothetical processor
with one load unit and two arithmetic units here, okay?
And we're gonna kind of step through,
cycle by cycle, how this executes on the processor
or how it could execute on the processor.
First, we have kind of a simple chain of events
where we have no really interesting choices.
We need to load two values.
There's one unit that can do that
we have to go there.
We're loading from memory, and so
there's latency involved in that.
And so the next operation can't execute immediately.
But at some point, the results become available
and we can do the multiply.
And then at some point the multiply results
become available, and we can this accumulate.
And multiply's a relatively slow operation,
so it might take a few cycles.
That's what this is trying to represent.
Don't get the idea that this is precisely accurate for X86.
This is just illustrative.
Okay, and then there are two other instructions
that execute in that first iteration,
but they're actually completely independent
once you do register renaming.
We can execute the increment on that array index
right away because we can just remember the old value
and use it if we need it.
And so here we're seeing out of order execution
and speculative execution from the processor.
Sorry, sorry, not speculative, just out of order.
So this is great, right?
This means we've already handled
those two instructions.
But what happens next?
We have this branch, but we predicted
that it would be taken.
And so the processor just assumes
that this is gonna be taken and continues
executing this stream of instructions.
And so it starts off on the next iteration,
which it can pack in very densely
because there's a lot of remaining space.
So it can go ahead and execute the next increment.
It can execute the next increment
before any of the loads have finished, okay?
Before any of the loads have finished.
And then it can actually do the next test.
And because it's already doing the next test,
it doesn't have to work very hard
to actually continue speculating,
pass this iteration into the third iteration.
It's going to finish the third increment
before the first load becomes available, okay?
But look at what we've done here.
Look at what we've done here.
The latency to get from the very first operation
to the actual result, the ax1 there,
was eight cycles.
The latency to get to the next end result was two cycles.
This is why this is how you get nowhere faster.
This is actually probably the most important thing
in modern CPUs for performance.
It's just unbelievable what this does
to your execution performance.
And just to be fair, let's roll this whole thing forward
until we've actually fully packed the execution units.
And you can see there's actually no space
to fully pack these execution units.
We had one stall cycle here
because we can't compute flag six.
We can't compute the comparison for flag six
on the sixth iteration because the increment
to the index hasn't finished yet.
And we can't quite do the increment to the index earlier
because the previous increment to the index
hadn't finished yet, because we're on
the sixth iteration.
And look at where the sixth iteration starts.
The same cycle as the first iteration's result.
Okay, so we have over six iterations
of this loop executing in parallel
on a single processor.
All the way around the back edge.
How many folks here realized
that this is what executing this loop
would look like in the processor?
You guys pay more attention than I thought, good.
So this is what makes it really hard
to reason about performance.
Now, let's think about something else with this.
We have this cmov.
Why is cmov weird?
Because we can't predict cmov.
We simply have to evaluate both inputs.
See, we didn't wait to find out anything about the flag.
When we were scheduling this, we didn't have to wait
for flag zero to finish resolving
to start the next operation.
We just keep going, right?
But with cmov, you can't do that.
You can't speculate through cmov
in the same way you can through a branch.
And this is why branches, in a bunch of cases
end up faster than cmov.
Making some sense?
Looks like I have a question.
Or you can speak up, I'll try and listen better, sorry.
(person talking quietly)
So the question is, why doesn't cmov
get microcode translated into the right thing?
Because that's hard.
That's a very expensive transformation to do
in a very constrained part of the processor.
Knowing when it's safe to do that,
when it's correct to do that is very, very hard.
And there are cases where cmov is absolutely correct.
There are cases where there's no advantage
to speculative execution, all of the inputs
are going to be ready long before the cmov
has to be executed, and where actually
doing the speculation might be hard.
And so how does the processor know?
And the answer is it doesn't.
And so sadly, this falls on us compiler folks.
this seems really complicated.
I would like tools to help me reason
about this kind of pipelining.
And I showed you kind of a fictitious processor,
but it would be nice if there were better tools.
I was gonna try and live demo a bunch more tools for you,
but unfortunately we only have enough time
for a Q and A session.
And so instead, I'm gonna mention the tool,
and I'm happy to show you guys some other time.
It's called the Intel Architecture Code Analyzer.
It's an amazing, awesome tool.
It's just a piece of magic.
You should all play with it.
It's gonna give you the exact graph I did,
except it'll be real for the code you give it.
It's just amazing.
I could try and do it, but it would take me
like way more than the amount of time we have here.
But let's do the other live demo,
which I think is faster.
And then folks, get ready for your questions
cause we only have a few minutes.
So one thing that I think is useful to do here
is actually to look at what the performance counters
tell you when you run these two different
versions of the benchmark.
Because the performance counters actually
give you some good clues as to what's going on here.
So if we look at the original one with cmov in it,
one thing that we're going to notice is
we have a tremendous number of backend stalls here.
And that's not a good sign.
We are actually struggling to execute that clamp loop.
But this is also probably coming from storing to memory,
and sadly, we are storing to memory on every iteration.
And if I switch over to my fancy clamp loop,
we're gonna see a pretty substantially different profile.
So it finishes faster, but now you see
something really confusing.
Even more backend stalls.
And this should confuse you, right?
Like why did it get worse?
But think about what we just showed.
The processor is going to speculate that branch.
And it's going to have a store into memory
ready on every single cycle.
All this benchmark is doing is storing to memory.
It should be stalled every cycle,
trying to store to memory.
The fact that the other one wasn't
is because it was doing something else.
It was waiting on cmov.
And I mentioned this was hard to tease out.
It looks very counterintuitive.
But this is the reality of X86 processors
and most modern CPUs.
All right, with that we should wrap it up here.
Let you guys ask me the really hard questions
that you've been holding back on,
cause I know you have.
It's good, I would too.
Like I said, hopefully this lets you understand
why your loops are going fast or not,
lets you reason about at least two
of the interesting cases.
There are a million more.
Unfortunately, I can't stand up here
and monopolize the entire conference.
So I'll have to come back and talk about
more fun versions of this next time.
All right, so questions?
I don't know which side was up first.
Let's go over here.
- [Man] Thank you, great demo.
And you showed it on AMD processor, right?
Did you see any difference on Intel processors?
I will freely admit, I haven't tried this
on Intel processors very recently.
However, in the past, I have not seen
Most modern processors are very similar here.
You'll find all of what I'm talking about here
generally holds for modern desktop
or server-oriented and even the high end
of mobile-oriented Intel processors,
AMD processors, ARM processors, and PowerPC processors.
It's really ubiquitous.
This is not actually specific to one architecture at all.
For example, the out-of-order execution talk,
the original version of that was given
talking about an ARM processor, not an X86 processor.
But it translates perfectly.
The processors are much more similar
than you might imagine.
- [Audience Member] Working through SG14
and AWG is a proposal for likely unlikely attributes
to standardize things like built-in expect.
We actually noticed the same thing
while validating the use of that that
when you put likely unlikely on,
it turns the cmov into the jump.
- [Audience Member] And we saw it got better
around one tenth or a hundredth of a percent of the time,
it would sharp drop off.
Do you have any thoughts on using notations like that
and a good way of dealing with them?
I love them, they're great.
I have one problem with them.
Sometimes they get old and out of date and incorrect.
What I would really, really, really like,
and sadly I've never had the opportunity to go with build,
is a tool that takes an actual profile of the binary
and checks to see, so,
did this profile and your static hints
actually agree at all?
Were they even close?
Were they actually contradicting?
And actually tells you, the programmer,
hey, maybe you should go check
these two annotations over here
because when you ran your binary,
those annotations weren't just unimportant.
They were dead wrong.
But I don't have that tool yet.
And so the only hesitation I have is that
if you're not careful, they can get out of date.
Otherwise, I love them.
Oh, one more thing about them.
Those are especially great
when you have some fundamental reason
for them to be correct.
Best example in the world, vectors grow
during pushback, right?
We actually have an algorithmic guarantee
that vectors grow during pushback is cold.
Because it's amortized away.
That kind of guarantee, fantastic.
- [Man] Hey, Chandler.
You actually showed us live that
comparing to a literal is faster
than comparing to a register, which seemed,
well, quite unnatural to me.
- [Man] Yeah, the cmp and the move, no, the move, sorry.
In the move, yeah.
- [Man] Yeah, it seems strange to me
that moving a literal is faster than moving
from another register where you already have the value.
Yeah, I don't know.
I mean, I can sit up here
and try to figure it out, but I don't think
I'm gonna succeed.
It might be entertaining, but
I don't know that I'm gonna have an answer.
My best guess is actually that
it may have gotten in the way of register renaming.
It does mean we have another register live, right?
I don't know why that would be the case.
It seems very strange for that to be the case.
This loop doesn't seem complicated
for the processor to handle.
But I can't argue with your conclusion.
- [Man] All right, thank you.
- [Audience Member] Hey, great talk.
So I had a question about the last loop
that was faster but it had more stalls.
- [Audience Member] So you said that it's stalling
because we are storing to memory, right?
So what about the store buffer?
How does it...
So the store buffer, first off,
for folks who don't understand,
there's a buffer inside the processor
that kind of takes stores before
they actually arrive at memory.
However, we're also reading from memory.
And we're not reading from the same memory location.
And unfortunately, X86 makes guarantees
about memory orderings.
And so we're not waiting on the store,
the store buffer can't help us here
when we actually need to let the store
flush all the way out.
- [Audience Member] Okay, thanks.
- [Man] I actually got up here
to ask that same question about ebx and the literal.
I mean, also I should be clear, I'm guessing.
Again, I can sit up here and try and prove it.
If we had more time, I would,
cause that's actually really interesting,
to try and understand whether it's stalling
because of synchronization, like establishing the TSO,
because it does have to establish TSO
for these stores, sadly.
And we could do something to try
and actually see that.
For example, we could switch them
to non-temporal stores.
Because we happen to know we're not going
to read them again,
and see if that changes the behavior of the processor.
And it might.
It might actually make a big difference.
- [Man] So instead of asking that,
I'll piggyback on Paul Hansom's question
about the built-in expand down notations.
And I think a couple of the HFT talks here
mentioned that sometimes you don't expect
something to happen but you still
need it to be fast.
Got any opinions on that?
No, it's absolutely relevant.
We see this actually a lot with
more genuine profile-based optimizations
because there're kind of very boring reasons
why a profile is sometimes wrong.
However, there is some good news here.
The compiles doesn't just go home
when it sees something is cold.
It tries to make the path that isn't cold
as fast as possible.
And it will make the cold path slower
when doing so makes the other one faster.
But it won't insert arbitrary slowness there.
And so I would not expect this to be that bad.
I mean, it could be, but that seems kind of weird.
The way we actually talk about it
inside of the compiler, inside of the optimizer is
whenever we have a choice between making
one side faster or making both sides faster,
we always prioritize.
But it's about priority.
It's not really about pessimizing.
I don't know if that really helps though.
- [Audience Member] Given that cmov
doesn't seem to do speculative execution
- [Audience Member] In the loops,
what are the cases where it makes sense to use cmov?
So it makes a lot of sense to use cmov
when your dependencies are already
going to be required.
That's the classic way.
In particular, for example, if you have to
materialize the values to compute the condition,
it can just not matter.
And so then you get rid of a branch,
branch predictor gets one less branch
it has to keep track of.
It can be a win.
But I'm sympathetic.
I've never met a cmov I liked.
I just, it hasn't happened to me yet.
I don't know why.
But there is a theory behind the instruction.
- [Man] On the first live, when you were
populating the micro ops into your pretend processor,
you did a load on I think two,
and then you didn't use the result until six.
I was surprised to see that you put
another load right on three.
I was thinking, are the loads sort of
fire and forget on a real processor?
You can just load, load, load, load, load,
and not worry about how long it's gonna take to come back?
But yes, in many cases there is overlap there.
There's two sources of delay.
One is figuring out what to load.
The unit's busy, it can't take more.
But there is also just waiting
for the caching memory subsystem to produce the value.
That's writing into a register file.
The unit's not really involved in that.
You're just waiting for a dependency to be satisfied,
and then you keep executing.
- [Man] So you definitely don't have to wait
until six to put another load line in.
Almost certainly not.
- [Man] Okay, thank you.
Almost certainly not.
- [Audience Member] In this first case,
in the cmov, have you tried profiling-guided
Did it help?
Yeah, LLVM kills the cmov and switches to a jump
if you give it a profile.
- [Audience Member] Oh, okay.
- [Man] In my amateurish view of things,
isn't it clear that when you store the literal directly,
it's faster than doing this because
it doesn't have to, it already knows,
when it decodes the instruction,
it knows what the literal is.
It knows what it has to store.
So it doesn't have to wait for the register
to actually have the value.
I would not expect the processor to work that way.
I would actually expect that the wires
that do the store come from a register file entry.
- [Man] Okay.
But I'm also not a hardware architect.
I tried to give you enough understanding
of the hardware to work with it.
But I don't build processors.
You should track down someone who does.
They might be able to give you
a much more detailed answer.
I know there's one person here
who might have a clue, from Intel, at least.
- [Man] Okay, thanks.
- [Audience Member] So my question is about
going nowhere faster.
So if you're gonna speculate the instructions,
and let's say the processor gonna speculate
doing some loads out of the bounds of the RA.
How does it work in the hardware
that it doesn't crash?
I don't even know.
Magic and unicorns?
Anyways, sadly, the session's over.
I'm happy to try and find folks
and answer more questions later.
Thank you all.