Cookies   I display ads to cover the expenses. See the privacy policy for more information. You can keep or reject the ads.

Video thumbnail
So I'm going to turn it over to our headliner: Chandler Carruth.
Thank you.
(applause)
CHANDLER: Morning everybody
How are you guys doing?
Excellent, excellent!
So everyone who is at the back, you're going to want to scoot up a bit, OK?
This is fair warning!
Guys, you might want to start that shuffle process now...
You're going to want to be able to read very little bitty text up here,
for the whole talk, OK.
So, my name is Chandler Carruth, I'm here from Google
where I'm the C++ and LLVM tech lead
I do a lot of fun compiler stuff,
I do a lot of fun programming language stuff at Google
and I'm here to talk to today about tuning C++.
When I say tuning C++ I mean tuning C++ for performance -
like runtime performance
we want to have really fast and efficient code
It's one of the primary reasons why we still use C++
today is to get the last one and two percent of performance.
I'm going to try to talk to you a little about how you can do that,
I'm going to cover thing like benchmarking
Some of the crazy things we see with CPUs
Compilers (which are my bread and butter)
All kind of crazy stuff
Sound good?
OK there are a couple of caveats.
The first caveat is, and I actually do feel a little bit bad about this,
This talk is going to be very Linux focussed
Now I'm hopeful that the concepts are going to carry over to other operating systems
I think there's similar software on almost every other operating system
but I'm a Linux expert, I've got to teach from what I know
It's going to be fairly Linux focussed
It's also going to be x86 focussed
Most of my performance work is on x86 these days
and for better or worse
And that means I'm going to talk a lot about x86
There's going to be a lot of assembly
and other stuff up on the projector
So prepare yourself, right, for that part.
The last thing I really want to caution you guys about is
this is not going to be
a talk similar to to Herb's talk or Bjarne's talk.
I don't have the right answer here
and that's not the point of this talk
So you're going to see
things that I'm not 100% proud of.
You're going to see things that are very questionable
but they're going to work (that's at least the hope).
And to make sure that you believe me when I tell you that this stuff actually works
and that you can do this if you go do this on your own,
I'm going to do almost the entire talk live.
So prepare yourself for that as well.
Alright, so let's get started.
So performance does actually matter
How many folks here think that performance still matters today, that it's still relevant?
Good - you guys have been paying attention!
Now for anyone who is questioning, like really quickly
do you have any guesses why performance still matters today?
Anyone? You can shout stuff out, it's OK.
(Audience mummers)
Energy - I heard, yeah!
So this guy - Nicola Tesla. We have this whole electricity thing, it's really cool, super-fun.
And you know, electricity's great but
this kind of gives the wrong impression about electricity
because here electricity is just everywhere, right?
It's just crazy... there's plenty of it
Like we don't even know what to do with it, we're just studying it
What we really have is a scarcity of electricity
and the scarcity of electricity comes
in almost every place where performance matters today.
From the tiniest of little devices that we got to introduced to,
that have these very frustrating batteries (sometimes),
to the largest of data centers.
And all of these things need electricity and they are all struggling to get it.
That's the primary motivation behind performance work today.
So this is what's in the back of my head constantly when I'm thinking about performance
OK, the second thing I really want to talk at a high level about
is exactly what C++ code you should spend your time tuning.
Because, spending your time tuning C++ code -
that introduces complexity, it's a hard task,
you may make mistakes, you may even introduce bugs -
we don't just want to tune anything.
This actually became much more of a pressing concern for me
when I first started at Google, OK?
Very first- so I- you have to understand
I started at Google almost eight years ago
I started at Google right out of university
I had no idea what I was doing
I'm not kidding - I literally had no idea what I was doing
I thought I was crazy for even going to Google...
So on week one, they take you through this lovely orientation thing
They say "Welcome to Google!", they give you funny hat to wear
(Which doesn't go over too well)
You get made fun of by Sergey Brin on stage
It's a good experience on the whole
It's a good experience, I was very happy...
Then week two came around...
(fumbles with mic)
Sorry, my microphone doesn't like me.
Hopefully it stays on this time
So on week two, things get a little bit more interesting,
I started to actually talk to my co-workers,
figure out what team I was on
That was fun,
got to actually learn some stuff about what Google is actually doing
And then I had a meeting with my manager at the time,
great guy
He had an interesting thing,
He came to me and was like:
"I've got a problem...
"I've got a problem because
"we have some code that we need someone to take a look at...
"...and we thought you might be the perfect person to take a look at it"
Which, by the way, for those of you who don't understand the translation,
that was - "You're the new guy!"
I didn't understand this at all...
So he was like "We thought you'd be perfect". I was flattered....
I was like: "Aw - of course I was!"
And so he tells me:
"We had this problem with some of our systems
"they're really running into trouble...
"The latency on this one system that we use internally
for our internal development work"
(I was on our internal developers tools team)
"It's too slow, it's actually the bottleneck for almost all of our developers"
That's a BIG problem!
And this problem had bubbled up several levels of the company
and across several
different parts of the company
and then down and back up
and you know the whisper game?
That you tell your kids...
That's true in corporate politics as well
So at some point it reached a totally different part of the company
and had lost some small elements of truth to it
and that part of the company had an extremely senior engineer who had decided to have a crack at this problem
And you know he worked this problem
but not talking to us, right?
which is fine - he didn't even know he needed to talk to us
because by the time he heard about the problem it was a totally differently described problem
And then he sent my manager this code
My manager pulls me in to ask me to take a look at this code
Manager says to me "can you take a look at this code?
"Ken Thompson sent it to us and he says it will this service twenty times faster
"but we don't we really understand how, we need someone to step through it"
Now I look at my manager
Dead serious, I Iook at my manager and say -
"Uh, sure but you know, who's Ken Thompson?"
(audience laughter)
I mentioned I didn't know anything at my second week at Google
It was bad!
I didn't actually know who Ken Thompson was and my manager very politely (perhaps too politely)
smothered his burst of laughter and said:
"A very senior engineer"
Gave me the code, I went back, I went home, I got on Wikipedia, and looked into this...
"Oh Ken Thompson! Oh OK, so the inventor of UNIX has sent me some code to look at..."
That'll go over well...
So I sat down, I figured
It's my second week at Google - I've gotta do this!
I can't fail now...
So I sit down, stepping through this code
Now the first thing I've got to say
and I know this is a C++ conference,
but Ken Thompson's code is actually what taught me
that there was such a thing as beautiful C code
I had always thought that C code was terrible
Like - TERRIBLE!
What I didn't know was that I was looking at terrible C code -
it's an easy mistake to make!
Terrible C code is terrible -
it's not because it's in C it's because it's terrible code
Ken Thompson knew how to actually write beautiful C code
It was a wonderful experience,
got to read through all this code
understood the problem...
The system we were trying to solve,
it, was essentially serving out access control rules from this big long table of rules
based on some pattern that you had to match in the table
to say what access you were allowed to have.
Straightforward problem, everybody has this -
firewalls have this,
filesystems have this - everybody has this problem it's a really common problem
and Ken Thompson heard about this totally in the abstract
and he noticed that this actually very similar to regular expression and finite automata construction
That's one approach you can take to solving this problem
And he's written a few regular expression engines in his life (a few!)
Has some domain knowledge,
and so he figured he'd help us out
and he'd write us a finite automata based implementation of this rule matching
And it used some of the regular expression finite automata code that he had written before
and it was twenty times faster
and in his email he says "this is twenty times faster, we can make it another twenty times faster
but it would require substantially more memory
"this seems like the sweet spot in terms of the tradeoff between time and space"
Very reasonable.
So I'm going through and I figure out all of this stuff,
I learn all this stuff - it's wonderful!
And then I send it to another one of my collegues
who is actually responsible for the service
and I explain to him how all this works
and he says "well that's interesting,
"so it really does get twenty times faster...
"The funny thing is I was looking at our data the other day
"for the queries we actually receive and which rules match them
"and essentially every query matches against a single rule in the table
"so I put that one first and it's one-hundred times faster now."
(audience laughter)
Well, that's unfortunate...
So the moral of this story is - measure first and tune what matters
and it actually solved the problem -
it did exactly what he set out to do.
Unfortunately, he didn't have all of the information
and if you don't have all of the information you CAN'T make the right decision
and you can't have the right information until you measure.
So, that's my preachy bit of this talk. OK!
I thought about spending a whole lot of time here
and then I attended Bryce's talk [Benchmarking C++ Code - Bryce Adelstein Lelbach] yesterday and...
Bryce where are you? Somewhere round here...
We're going to talk about Micro Benchmarks.
OK so micro benchmarks
Who here knows what micro benchmarks even are?
Got a few people...
OK, so micro benchmarks are when you're benchmarking some really isolated part of your system
Something whose performance matters desperately
but is very very narrow in scope.
But who here really loves slides?
Anyone?
Yeah, so I figured - let's just do it live!
Alright (audience laugher)
So the entire rest of this talk is going to be live
Please bear with me, hopefully it's all going to work out well
So what I'm going to try and do is I'm going to try and show you guys some microbenchmarking
But before I do that I have to show you guys how we're going to write benchmarks
If I can get this to work...
There's this wonderful library,
I can't take much credit for it
But my company did this great thing
we had the benchmarking library internally
and we open-sourced it (this is all on Github)
If you just search on Github for google/benchmark you'll find it - it's super easy
This is a great little microbenchmark library
I don't want to bore you guys with a detailed thing
so I'm just going to skip all the fun stuff
and point at some example usage
So this is actually all on the webpage you can go and find it
this is how this thing works
You write a little function
It takes a state object
you run it
for a while
An unspecified amount of time
And you do something in it
And this benchmarks it
Could, not, be, simpler - right?
This is super simple, this is an entire benchmark
Anybody have super-big questions about how this works?
Please interrupt me at any point
So let's look at a little bit more fanciness
just so we understand how things work
You can also do some other fun stuff with this benchmark library
You can have... ranges of inputs
So here we've got a range_x
accessor on the state
And what this is doing is...
allowing you to control exactly what input space you benchmark
you've got a bunch of arguments down at the bottom
These go into x
Make sense?
Not rocket science here, right?
we'll see lots of the fun output
So we want to get a std::vector
like that...
we're just going to push_back(42)
Alright? Just to let us benchmark how long it takes to push_back an integer into a vector
How many folks like this? Anyone like this?
We've got a good benchmark going here
Hmm, no-one's really happy about this...
What's wrong with this?
So we definitely need to fix our typos here
Alright
That seems plausible to me at least
So the other thing we need to is figure out how to compile it
Unfortunately compiling this thing requires a bunch of flags
For which I'm kind of sorry
We've got a bunch of flags here
Optimisation flags
I'm trying to turn on all the latest, greatest features
We've got standard library flags
But none of these really matter
The only thing I'm really going to point out are a few things here
So, I am turning off exceptions and RTTI
Please don't throw anything at the stage
But this is the environment I'm used to benchmarking in
and so I'm trying to give you a faithful representation of what I see day in and day out
Alright,
We've got these lovely flags...
So all I really need to do now...
(I think I have a copy of this somewhere... here we go)
Need to compile our code...
Hopefully compiles... Cool, that worked well...
And we can run it.
And it's running, fantastic!
So now this runs and does a bunch of cool stuff that's kind of under the hood
So let's look at what we actually get out of this
First thing is we get this nice column layout that tells us what benchmark we ran,
how long that particular benchmark took,
both in wall-time that's the first column here,
and in CPU-time
and CPU-time is interesting because we're discarding any kind of overhead where our thing got off the CPU -
that's the CPU-time
and to try and get some of what Bryce talked about in his benchmark thing
This while library has a pile of internal logic
To essentially run just amazing numbers of iterations until we get something reasonably stable
We've go the iterations over here...
So it ran... oh that's a big number
22 million iterations
in order to conclude that this operations takes 32 nanoseconds
Now the one thing- I hate to rain on your parade-
but if I run it again
you're going to see some fluctuations
So the wall-time here flucuated by three nanoseconds
You're going to see the CPU-time fluctuate as well,
even with tens of millions of runs we still have a noisy system.
If you want to understand more of how to cope with that that's what Bryce was really talking about.
Last little bit of statistical confidence
of how fast things are
Everybody happy with this benchmark now?
How many folks are happy now?
Why are you guys not happy with my benchmark?
(audience member shouts "malloc")
Malloc! Oh man!
So I'm not benchmarking push_back
You're totally right...
So how do I actually compensate for that?
Well, to compensate for that is actually hard...
One thing that's a little bit annoying, if we just take this vector and pull it out of the loop, right?
Then we're going to benchmark push_back,
But we're going to run out of memory when we push_back
too many millions and millions of integers -
we're going to see other artefacts of the system because we're not actually clearing the memory.
So we need to do something else
Now, a good way of this is often
to just, kind-of, set up a baseline
Let's look at kind-of an example of this
So here's a baseline benchmark so instead of push_back,
we're going to benchmark just the creation of this vector
Alight, is this making sense?
So here we've got a new benchmark
this is going to benchmark constructing the vector
and then the second we're going to benchmark constructing it
plus push_back
we're going to... be able to look at the delta between these two
to understand the performance impact of push_back
So, any predictions on how fast the first one is going to run?
Like what percentage of the time is this first one going to see?
90%?
0%!
Everybody see the bottom here?
Zero nanoseconds? ZERO nanoseconds!
It's infinitely fast...
It's amazing, right? I love code like this!
Infinite speed...
It's great!
it's just the default constructor, it didn't do anything...
So we've got to work a little bit harder to isolate this
So what do we actually have to do to isolate this?
We've got this nice create thing
That's great
And let's keep that around
and we need something a little bit fancier to really benchmark this
What if we use reserve?
So we here can say
And that's going to take the malloc, and actually benchmark the malloc
That'll work, that'll work
How many folks think this is going to actually work?
Yeah, you're catching on to the theme of this talk
So now reserve in our push_back benchmark
and we also do just a blind reserve
so that we should be able to see the delta now
How much of the performance is this going to take?
Is the reserve going to be 0% or more than 0% of the performance?
Go on, shout it!
4%! 4%?
More! More than 90%!
OK, there we go...
So this is going to be all of the performance
Let's find out
So we call our thing, run it...
Wait, now this is a little bit weird
So I know reserve is good and all
And you know I've read Scott Meyer's books and all these other books
I even watched my own talk from last year
and so I know I should do reserve whenever I can
but I did not expect reserve to make push_back TEN TIMES faster
That's a little bit weird...
So my FAVOURITE profiler in the world
(and I'm actually not being facetious here)
is this little tool called perf
That kind of stuff doesn't go into this
and so wall-time is done at the bottom here,
Is not always exactly equal to this task-clock
But it also tells you how much precision this has
This doesn't necessarily have perfect precision
A coarse grained look at exactly what performance I'm seeing
it's a fantastic tool
But this doesn't actually tell us
what on earth our code is doing that made it ten times faster when we reserved
So we don't yet have the information - we need to do more detailed performance analysis
To do that you can record with perf
Now this is going to do a more traditional profiling run
it's just going to record what takes place as your program executes
and dump out this very dense,
very strangely formatted data file at the end
that contains the entire profile
and once you do this
There's a tool, the tool has a way to analyze it
You can do "perf report"
creates a nice little report
perf report looks something like this
it's going to show what you're dong here
and so it's actually showing
from the hardware
exactly what functions
were executing and how much of the time
So this is your most basic profile -
it's a flat profile, right?
Here are the functions
Here is how long we were actually inside of them
Making some sense?
Alright!
So that's great but there's one kind of problem here
There's no context to understand "is this function calling some other function?"
Because we were trying to isolate out malloc()
The allocation of memory
from this benchmark
But we don't actually get that information here -
we don't know if there's memory allocation or something else going on...
I see malloc down here, right?
- we're spending some time in it
but there's no connections between this stuff
What we really want is a callgraph
A dynamic callgraph that tell us
well this function then calls this other function,
which then calls this other function,
and that's where the performance problem really lies.
and when it interrupts your program
it looks at the current state of the stack
of the thread that is running
and then it tries to figure out the call stack from that position
Well, the problem here is that it doesn't have any way of finding that call stack
Because I've optimized all of the information about the callgraph out of the binary
You can actually fix this, OK?
I know this is really ghetto - I'm using a text file for flag management
But it fits in my presentation so...
and I don't like Make
There's this magical flag
This flag...
Is the performance analyzer's most endearing flag
OK? Now what does it do?
-fno-omit-frame-pointer is a very opaque flag name
So what we're actually telling the compiler to do
is to stop deleting the "frame pointer"
Now the frame pointer on x86
(and most other architectures)
is a register which tells you where the bottom of stack frame is
Or the top of the stack frame is depending on how it grows on your architecture
So with the frame pointer you can kind of orient yourself.
At any point in the program's execution you can orient yourself
and understand how to walk up the stack
and discover the call sequence that led to the place you're currently at.
This is AWESOME!
from the perf tool
You can hit the 'a' key
actually hold on, before I do that
let's look at perf a little bit more...
before we dive deep let's look at this a bit more
There's one other thing that's a little bit confusing here
So, you'll see if I expand this benchmark thing...
Hold on, if I expand this benchmark thing
It's not calling push_back... or create... or reserve...
So we actually don't yet understand this
This is actually my LEAST favourite thing about the perf tool
So this callgraph is not the callgraph you expected
what I naively expect it to look like
OK, and now when I expand this I see
We start the benchmark framework
Cool
and it calls three functions
Ah yes, because we have three benchmark functions.
Is that making sense to people now (that nice callgraph)? OK good, phew!
Now you can keep expanding here,
But,
you're going to see strange artifacts if you do
I strongly encourage you not to just keep expanding
up here
But drop down and find where this function is rooted
right down here
and look at that
and you'll get slightly better information
But you'll still be confused because it says it does weird stuff
like, it seems to be calling this apic_timer_interrupt
which doesn't make any sense
but that's because it's in the kernel
this stuff looks really confusing initially when you see it
and the thing you have to understand is that
the perf profiling tool isn't precise
what it's doing is that it's sampling your program
in the least intrusive way it can
so that your program can keep running
at blinding speed forward
But it's not infinitely precise
in fact there are a lot of artifacts and you're going to see them
so it's really important to keep that in mind
OK, so let's back-up and get back to what we were trying to do
We're trying to understand something that should be simple
We're trying to understand why this push_back function
Is ten times faster when we reserve first
and then push_back than it is when we just push_back
Right?
Everybody back on track?
You know how to use a profiler?
Well OK let's dig deep...
So the way we dig deep in the perf tool is
we hit the lovely "a" key
and it's going to annotate this function
I told you were were going to look at assembly
so you've got to prepare yourself
That's going to make the next few instructions look really suspcious
We start here
We execute down
We do some computation which is basically just loading new data...
That's why it's ten times faster
So what happened here?
So what happened here was the optimizer...
Took one look at it and said, pff this code doesn't do anything
I'm going to delete it
That's its job right?
It's doing its job
Unfortunately it's making your job as a benchmarker or performance analyzer tremendously harder
We've run into the first kind of real problem here and that's the optimisers
How do we defeat an optimizer
Now first off, we've got to pause for a minute
I'm about to tell you all how to defeat the optimzier
As a compiler and optimzer writer,
please use these powers for good...
Don't defeat the optimizer because you have undefined behaviour in your source code
and you're like, "No, no, no optimizer. You're not going to miscompile MY code today!"
Uh-uh, that's not actually going to work for you...
it goes very badly
OK but how do we defeat it when we need to for good reasons
we need to for performance analysis reasons
We need two functions
These are kind of magical functions
I would never expect anyone to know how to write these functions
You might reasonably argue that I shouldn't be writing them
they should be in my library
but I want to show you guys how these things actually work
OK we're already in trouble
I'm using void *s - I know, I know...
Type safety and all that...
But there's actually a reason why I'm using void *s
It's going to get a lot worse
I'm going to next type the two scariest tokens
from my compiler/implementor world.
OK? Literally the two scariest tokens...
(audience laughter)
OK!
And now I'm going to do something...
this doesn't make any sense to anyone
What on earth have I done?!
OK so what is this?
This is a magical escape function -
it is truly magical!
This has these unique and mysterious properties...
It's going to assemble it
It's going to build a representation for it
and then it's going to notice it doesn't' DO anything...
Or in my case that it's EMPTY.
That doesn't do what I need.
So I have to tell the compiler - "No, no, no,
"this assembly has some crazy observable side effect."
In my case the observable side effect is the benchmarking.
The benchmark numbers.
So it's actually doing what I want here
This is a correct usage of volatile and in my opinion a correct usage of inline assembly.
So the syntax for inline assembly,
and GCC came up with the syntax (I think) and clang uses the same syntax,
is kind of opaque - it's a little bit confusing.
So you have a string literal that is your inline assembly
You have a colon,
you have a comma separated list of outputs,
then you have a colon,
you have a comma separated list of inputs,
you have a colon and then a list of clobbers is what they're called,
"Clobbers".
STARTMISPLACEDSUBTITLES OK so the question is why
when there is only one test
did it take 20 nanoseconds
and then iti got so much faster
So...
and then I'm going to pick the modulus point at 32, 128, 224
They'll be like 20%, 50% and 80% right?
But all the numbers are the same.
Anyone spot why?
what I naively expect it to look like
OK, and now when expand
Are yes, because we have three functions.
But,
So the fact that create takes more time here is totally fine
That's not what's interesting
I want to actually look- what's reserve doing?
This is actually pretty straightforward...
Where's our loop? Here's our loop.
We come in, we allocate some memory
Not too surprising
We delete some memory
and then we loop.
Now one thing that might frustrate you is
(It very much frustrates me)
Is why are these things still in my program?
Ah! Because we MADE them be!
Silently, inbetween
in this empty space is our escape
right?
That's the cool thing about escape
We can actually force something that the optimizer would love to just delete to stick around
But we don't even have a single instruction in here
So we actually get the lowest cost measurements we can - awesome!
Sad story
So when you explicitly write reserve and then you explicitly write pushback,
the optimizer does something different than when you just write pushback
even though pushback, behind a conditional, calls reserve
The reason it does something different,
is because it now inlines at different points
Unfortunately, it's going to make a different inlining decision
and this is just a horrible horrible deficiency of today's optimizers, right?
This seemingly meaningless change causes such a dramatic performance swing
It's not because it's faster or slower
It's because in one case, the optimizer got very lucky
and managed to inline all of these things and delete all the code
In the other case the inliner didn't quite connect A-B-C-D
even though theoretically it could
and so it didn't delete everything and so we saw the real performance.
Yes?
< It that because of the the order that the compiler sees those functions?
< Visibility while it is running (inaudible) it always looks back it doesn't look forward (inaudible)
No no no, nonono, no
Not at all relevant, compiler doesn't care
The allocation in reserve
and I'm speaking as the author of it - I'm not trying to denigrate it -
it's a very very fantastic thing but
it gets tripped up easily
This has been fun
I want to briefly look at one other crazy example
Because I thought, you know, I should have some fun stuff to look at
but I'm going benchmark it first because I know I'm supposed to measure things
So we have this generate_arg_pairs()
This is going to generate our input space
We're going to walk from 32 I think- no... 16, up to 1024
then on the other dimension we're going to get the numbers 32, 128, 224
These are kind of arbitrary - it's just a two dimensional space we're going to explore in our benchmark
we'll come back to them more
we pick a ceiling here which is 32, 128 and 224
We create some vectors for our input and our output
Mark this down, OK?
We're going to take that input and if the input is greater than or equal to my ceiling,
I'm going to do a modulo
But if it's not I'm just going to use input directly
See what I did?
It's awesome!
It's fantastic!
I'm assuming all of these are positive numbers
(that's not relevant to this example)
Who thinks this is even going to compile?
Hey look it compiles! It even runs, cool...
What if 80% of the numbers are below it?
I'm going to vary my numbers between 0 and 255
and then I'm going to pick the modulus point at 32, 128, 224
They'll be like 20%, 50% and 80%, right?
So that should give me kind of a different- but all the numbers are the same...
Anyone spot why?
the actual sample
for a very slow operation
gets attributed after the operation completes
So we're attributing all the cost
of this actual modulo operation
on to the store into memory, OK?
So all of this cost - this whole thing
that was super slow
We're actually accounting for it here
This can really throw you and confuse you
So we are actually understanding how this profile works
Cool thing
Before we go though,
The other thing that's a bit weird is:
Why are there two copies here?
Well the answer to that is of course very simple
We have a nice optimizing compiler!
It was like "you have a loop, it has like five instructions in it,
"that's a waste of my time. I'm going to duplicate it so that we can actually do more things at a go."
Well cool, I like it when my optimizing compiler does work for me
But if two is good then I've got to figure that four is better right?
Who's with me? C'mon, four better? Yeah!
Thank you, thank you! There's one person still awake...
(audience laughter)
So four is clearly better
Let's see what four looks like
So unfortunately to get four (because I'm a lazy typist)
I'm going to do something terrible
Yeah I know - I'm sorry. I'm using a macro.
How many folks here have used clang-format?
Not really.
So why is this not, like, crazy faster?
get a look at what's going on
For the record we have these numbers, they're good right?
But I DOUBLED how many times we did it on each iteration
so it should be twice as fast
and it's a little faster but not a lot faster, right?
What's going on here?
This is, of course, where you break out your profiling tool
So we want to profile this thing but when we do
we actually want to look more specifically at one of these things
so you can actually do this benchmark_filter thing
(the benchmark library has nice filtering things)
and we can do benchmark_fastmod- so it's just bench_fastmod...
Alright?
Cool!
So it just ran one, this time it will be nice and consistent
evenly distributed
So we can look at this
There's only one function
Alright!
So what's actually taking place here?
Let's go up and get ourselves oriented
you'll just have to trust me that I know this is actually the start of the loop
We're going to load some input data
we're going to compare it
and if it's less we're going to jump over mod
alright?
We're going to load some data, compare it and jump over THAT mod...
We just keep doing this
Now why is this slow?
The reason it's slow is because we've got these very small number of instructions and then we
jump over even more instructions
We're constantly jumping forward
Now modern processors
primary reason why they are slow is due to cache misses
Like the processor is faster than any program you've ever run on it
Unfortunately it's spending all its time waiting for your program to have data available for it
And we're going to see a really large number here
like fifteen here
just jumping back up
Now, this is not actually the jump back up.
Again, right, our measurements are imprecise
you'll note the utter lack of samples next to idiv over here, right?
We're actually accounting for the cost of computing the modulo
Even though we only spent 20% of our time computing modulos,
that's hotter than the rest of our code...
OK?
Only 20% of our "numbers" needed the modulo computed
but we actually spent most our time in those instructions
which is really frustrating, right?
But as we kind of have more and more dense kind of distributions
towards under the ceiling, this is going to just get better and better
because we're going to reach code less and less often.
Make some sense? Fantastic!
Now, before I stop and take some questions
I bet all you guys have some burning- one burning question
I spent a bunch of time talking about modulo and all this other stuff
Any questions? What's the most important question right now?
What have I failed utterly to do?
(audience member says "baseline this")
šŸŽµBaseline!
What does it look like if I just do this?
Why do need all this complexity?
OK, so we've got to actually do our homework here
We've kind of being playing fast and loose...
Because I was excited -
I invented something folks, it was awesome!
But now you're going to make me be honest
OK, so we- I have- kicking around here...
a benchmark that includes both
(rapid typing)
In that sense we're also getting different samples as well
But you can fix the seed if you want - there are lots of techniques to pin this down harder
Yes?
AUDIENCE MEMBER: Ah yes, I was wondering,
can you use instead just using the, uh, boost-
the, uh, Google benchmark test
could you kind of use, er, std::chronos to kind of get exactly
what you really want to measure in a, er, fixed distribution?
So if you're doing something,
you can actually test how much time is actually running...
AUDIENCE MEMBER: Two questions:
First, why omit the frame pointer and not use the LBR unwind on Intel or DWARF unwinding?
CHANDLER: So, why am I using -fno-omit-frame-pointer?
The first option to- sorry. The first alternative to omitting the frame pointer is actually debug information
like DWARF unwinding
Unfortunately I have had terrible success with DWARF unwinding -
it tends to miss frames and misattribute frames really really often,
so I don't trust it.
I just don't trust it.
The frame pointer always gives me consistent results,
even when it's a little bit hard to stare at the profile -
it's much much more consistent than DWARF.
LBR is a fantastic tool,
But LBR magnifies the statistical... random errors that you see with the frame unwinding...
LBR is a fantastic way to do statistical profiling, it's somewhat less useful for humans.
Because it's even more imprecision in your measurements.
That tends to be my preference,
and I don't think you can use LBR with frame pointers...
LBR has to be the unwind information
and it has too many error cases
AUDIENCE MEMBER: OK thanks. And speaking of precision,
why not use precise counters which perf supports,
or even better correlating more(?) counters like cycles with instruction counts and cache misses...
So you had an example of where because of the sampling rate
probably it was missing the div and making the jump look expensive
So increasing the sampling rate helps and I personally don't think it would affect the benchmark
because you run it enough etc etc.
The sample rate won't help here, the reason it's missing div isn't the rate of sampling,
it's the imprecision of actually collecting the sample data
on the other (?)
That to me is a solved problem but
One of the more bothersome problems I've seen
Especially with vectorization instructions is
skid where basically the CPU counter's a little bit off and you can't quite trust them so you
kinda of have to do low-level microcodey type things
to the CPU to eliminate skid
What are the strategies or tactics that you have for doing in perf versus
a more complicated tool
Or is perf not capable of doing that?
Can you just talk about that a little bit?
I have no idea if perf is capable of doing that
The way I get around this is I read perf with a very very large grain of salt
Because it's essentially giving me raw data
and it's not correcting for it
and I try to mentally correct for it
Looking at the instructions, reasoning about what their latency should be
and then how these samples actually match up with that
The other tool that I'm a huge fan of
because it's very easy to use and gives you a lot of this data is the Intel Architecture Code Analyser
And you can essentially give it
a snippet of instructions and it'll explain the Intel accurate model for how those instructions should execute
and if you look at that
and you look at your profile and the samples you're seeing
in the wild
you can get a pretty clear idea of where the samples came from
even when they're skewed
and so I do more of a mental mapping
because I don't really trust any of the tools to undo any of the fundamental skew that's inherent in the counters
They are very very general,
you can almost always get them
to preserve the piece of code you want...
and they are unfortunately just a bit more powerful than a volatile varirable
they're also a little bit easier to explain in my opinion
I actually think that reasoning about things like escape
and clobbering memory is fairly reasonable
And pretty easy for folks to think about
One of the other nice things about having clobber at the end of a loop iteration
is that it prevents the compiler from doing
loop computation analysis
To just figure out the sequence of things you're going to store into this volatile memory
and ahead of time prepare them and then just store them.
And there are optimizers that will do this
So the clobber isn't just helping with the escape
it also tends to help with the inputs
actually being new each time around the loop iteration
rather than being stored and cached