- My talk is called What Has My Compiler Done for Me Lately?
Unbolting the Compiler's Lid.
Which is the longest and most complicated title I've seen.
Titles, it turns out, are very hard.
So why am I up here staring back at you all?
It's definitely a question I'm asking myself at the moment.
As Jason, in his wonderful introduction, has said,
my name has rather unexpectedly become a noun and a verb,
which has led to some extraordinarily surreal conversations
this last week.
I've not only had people come up to me that,
oh, had people come up to me and introduce themselves
as if we're old friends, and I'm looking
at them smiling going, "Who are you?"
And then they say, "Thank you for Compiler Explorer,"
which is amazing.
Or I've been in conferences and people
have been talking about Godbolt and, you know,
like how there are those neurons in your brain
that are always background process listening out
for your own name, and I'm just looking around like this,
so it's been very very very surreal,
but thank you for the very warm reception.
I'm going to talk a little bit about me
just so you know how it is that I came to come up
with this idea for Compiler Explorer.
Like a lot of people, I'm sure, I started,
a long time ago, programming on
these 8 bit computers here, which is,
as Jason alluded to, still kind of a hobby of mine,
I'm still hacking around on the 6502 and the Z80.
After a few years of trying to write my own games,
I've been writing stuff in assembly 'cause that's
the only way you could get anything done performantly
back in those days.
I moved over to the ARM processor,
I sort of hung onto the 8 bit era rather too long
and then, when the leap came, I made it
all the way to 32 bit.
And back in those days, there wasn't a version
of GCC for my computer, so I was stuck still
writing assembly code.
I even ended up writing a whole IRC chat client
in assembly, complete with windowing and scripting
and everything built in just because
that was pretty much the only option open to me.
When I finally was introduced to C,
I considered it a bit of a macro assembler.
It was kind of like one of these things was an annoyance,
it got in the way of me writing the assembly
that I wanted to write.
But when I got to university and discovered that,
oh, wow, suddenly it's very dark in here.
Nobody falling asleep, right?
When I got to university and realized
that the only way that I could actually like
write software that would run on more than just
my one ARM computer that I had, I realized I had to learn C.
And at that point, things moved on, and eventually C++ too.
That led to me getting a job
in the games industry, where I spent 10 happy years.
Still kind of on the fence between
writing assembly code and C code and maybe a bit of C++.
After that, I had a dabble in writing C++ tools,
so for those people who work on C++ tools,
I have a huge amount of sympathy for you.
The language is absolutely dreadful
and is the worst thing to have to parse.
So if you see anyone from like JetBrains
or anything like that, give them a good firm handshake,
or anyone from Eclipse.
I then worked at a small Internet startup.
And then I moved on to DRW, which is the company I'm at now
where I do sort of low latency stuff.
So you can sort of say I've always stayed
pretty close to the metal.
So what about this talk?
It's an amazing opportunity to speak
to so many C++ developers, and I thought to myself,
what am I gonna do with this time?
What do I love, what do I wanna tell people about?
I love assembly, you've probably realized that
from the last few slides.
So I wanna make sure that you guys
are happy and aren't scared looking at assembly,
it's a useful thing to do and you should be doing it.
Not all the time, and I'm not saying
that you should go out and learn how to write assembly code
or that you should even ever write assembly code,
but you should be able to look at the output
of your compiler and, with some confidence,
know what it's doing.
And when you do that, you'll appreciate
how much work has gone into the compiler
and how clever the compiler is.
There's a lot of glamorous features
that come out in the language.
You know, all these new meta classes,
Herb's awesome meta classes ideas.
Reflection is coming and all
the template meta programming tricks that you can do
with the compiler, that's awesome.
But that's all in the frontend.
It's very important as well that the code
that comes out at the other end,
the thing that actually is going to run on the computer
is efficient, well, first, is correct,
and secondly, is about as efficient as it can be.
So the outline of this talk is that,
and I give a little bit of a backstory
as to how Compiler Explorer came about.
I wanna give you a very quick x86 assembly overview.
I wanna ask the question,
"What has my compiler done for me lately?"
Which is just to say I'm gonna look at
some of the optimizations, some of the cool optimizations
that compilers do.
Now, of course, these are all gonna be slide examples,
so they're all totally contrived,
but hopefully you'll get, it'll whet your appetite for it.
And then I'm gonna talk a little bit about what happens
when you type into my website,
what's happening behind the scenes to give you your
code results back in front of you.
So in 2012, a friend and I were at work
and we were discussing whether or not
it would be possible for us to start turning on
the C++0x flag or the C++11 flag on the compiler
and start using some of these new features
that were coming through, you know.
Things like range fors, auto variables, things like that.
And being a high frequency trading shop,
we can't just say to everyone, "Go ahead,
"just use whatever cool new feature is available,"
we need to make sure that we haven't done any harm.
So we concocted an example bit of code like this,
which is, as you can see, it's just taking a vector
by reference, vector of integers by reference,
and it's just summing them all up.
Yes, this could just be std::accumulate, I know,
but bear with me.
And we wondered that this traditional way of writing it,
that is just a looping over an iterator by using an index
into that, looping over a container, I'm sorry,
by using an index into that container.
That would be how we'd write it before,
we wondered whether or not we could replace it with
the much nicer code here just using, you know, range for.
We've been bitten before by other languages because
some managed languages in particular
that we had some experience with will construct iterators
in order to iterate over a container like that
And that would generate garbage and ultimately
things would go on behind the scenes, obviously,
C++ doesn't have garbage, but we just didn't know,
we would like to check, you know,
we should test these things.
Which one of these is better than the other?
And that would be kind of like the leg into us,
me talking about how Compiler Explorer came about.
And if my clicker works, we'll go to the next one.
Before I get into a whole bunch of assembly,
we're gonna sort of tell you, give you
at least enough information to know about
how to read assembly, I'm gonna give you a big warning here.
It's really beguiling to stare at the output
of your compiler and start to kid yourself
that you can see what it's doing
and know instinctively that this is better than that
or this other thing is best or like,
there are fewer instructions and so it must be faster,
things like that.
If any of you were in here before for Chandler's talk,
he gave just a tiny glimmer of the kind of
complicated things that are going on inside
a modern processor and you really can't predict.
While you can obviously develop an intuition
about what's going on, you should always measure
and you should definitely use something like
Google Benchmark or one of the other benchmarking libraries
that's out there that have done their best
to try and make it as bulletproof as possible,
although microbenchmarks have their own problem.
Or you can, a quick shout out to Fred
if he's in the audience somewhere for his site, quick-bench.
Which is an awesome online tool that allows you
to check out two different snippets
and compare the performance, again,
using Google Benchmark behind the scenes,
but also in an interactive online way.
And I've, I click a down, whoops.
So let's talk a little bit about x86.
Well, first of all, I guess, how many of you
compile for x86 platforms by raise of hands?
I had this terrible fear that suddenly,
everyone was gonna say, "No no, we do ARM now."
All right, who does ARM, ARM predominantly?
There's a few hands.
What about 32 bit x86, is that still being targeted?
Anyone got a more exotic processor
that they regularly build for?
Oh, yes, some PDP-11, right?
All right, well, that gives me at least some legs,
it would be really embarrassing now
if suddenly I had to invent another architecture
to talk about, but we're gonna talk about x86,
and specifically x86-64.
We don't need to know too much about it.
All you need to know, it has registers.
These are 64 bit values and there are 16 of them.
Back in the 32 bit version of the chip,
there were only eight of them with these funny names,
ax, bx, cx, dx and all these things,
and some of them have like special meanings as well,
but for the most part, you can think of them
as general purpose like variables.
Luckily, when AMD decided to extend out
to the 64 bit version, they added in eight more registers,
which was handy, and instead of inventing new funny names,
they just called them the far more sensible r8 to r15.
Those are where the integers are stored,
that's like your integer variables inside the processor.
There are some other registers, the xmm registers
or ymm and zmm, depending on which revision
of the processor you've got.
Any of your sort of floating point operations
will happen in there, and any sort of packed operations
will happen in those registers too,
but by and large, and at least for this talk,
I'm not gonna talk about those registers.
The only other thing you need to know is that
there's an ABI, there's a way that functions
talk to each other, and that's by convention.
And that convention is that, when I call a function,
the arguments of that function wilL be an rdi, rsi, rdx,
and then some other registers.
And then, if you're returning a value,
an integer value, that is, you're gonna leave it in rax.
And there are rules about which registers
you're allowed to overwrite and which registers
you have to preserve, but if you're not writing assembly,
you don't need to know.
That's kind of like our ABI for talking between functions.
Now, registers, as I've said, are 64 bit.
But they have all these different names.
So this is, I've given rax as an example here,
but this is true also for rbx, rcx, all the other ones,
and even, you know, r8 through r15.
If we put an r at the beginning,
we mean the whole 64 bits of the register.
If we say eax, we mean a 32 bit version of the register.
And we just get the bottom 32 bits.
And for complicated reasons, if you write to eax,
it zeroes out the top 32 bits.
That's not true of ax, ah and al,
which are these names for the bottom 16 bits
and then the two bottom bytes, which are kind of legacy.
I mean, this is kind of how Intel have maintained
backwards compatibility right back up to the,
I think the 8088, by having essentially just
adding on these, widening the registers
and giving them new names each time.
Once you've got registers, you have to do things with them.
So that's when we finally get to talk about instructions.
So instructions can take anywhere
between zero and three operands, and unlike Chandler's talk,
I'm using the Intel syntax here.
God, I had a whole routine done for like tabs
vs spaces and things, but I didn't realize
it was gonna get that much of a clap.
By show of hands, who uses Intel syntax
when they're looking at assembly?
Oh, okay, and who uses the AT&T syntax?
Oh, oh good, I thought I was gonna make some enemies today.
This is why Compiler Explorer, listen to me,
Compiler Explorer has Intel assembly by default,
because it makes sense and because I grew up--
And everyone else is wrong, I'm sorry.
No, because I grew up, again, with ARM and 6502,
and they all have like the destination
on the left hand side.
So if you see anything on my slides,
the destination of the instruction
is gonna be the left hand parameter.
So operations could be things like call, return,
add, subtract, exclusive or, and,
all those kind of bits and pieces.
And then sort of uniquely to x86, well, not uniquely.
Specific to CISC type machines like x86 is,
those destination and source registers, sorry,
things that are up there don't have to be registers.
They often are, but they can (audio cuts out)
references to memory.
And being as complicated as it is,
the reference to memory can actually be
more than just a memory address.
We get this huge let of options at the bottom here,
we get like a base, which is a constant.
We can pick another register and say
add that register to it.
And then we can also pick yet another register
and add a multiple of one, two, four or eight to it.
Obviously, this is handy for things like arrays
because you wanna have a pointer to the beginning
of the array, and then you wanna get like
the ith element, you're gonna put i in one of the registers
and then you're gonna multiply it by one, two, four, eight
to kind of move it forward that number of bytes.
So that's really powerful.
But it sorts of blurs the line
about what a single instruction should be.
I mean, if you can do all this on like an add.
And in fact, this is the kind of thing
we're talking about here, we've got a bunch
of instructions there, I also put the C equivalent
on the other side.
You can kind of see the pattern there, I think,
where like, the first one is just reading from r14,
it says r14 is an address and I'm gonna read four bytes
from it and interpret it as an integer, put it into eax.
Second one is just adding two registers together.
Then we can start to see these more sophisticated
addressing modes where we can add
whatever's r14 plus a small constant into eax,
it would read out like the first element of an array
if r14 was an integer array.
And then that subtracts sub eax, DWORD PTR, blah blah blah.
That is doing that array index operation
that I was just saying about.
Now, it turns out that the array indexing stuff
is so powerful and so useful that there's an instruction
that just says hey, compute the address
that I would otherwise want to read from
and just give me the result of what
that address computation is.
That's that penultimate instruction at the lea,
the load effective address.
It's kind of a glorified add, it was designed
to effectively do like the C code equivalent shows,
like take the address of some element in a subarray.
But it is also effectively just adding
a bunch of things together in a more sensible way,
you might argue, than the actual add instruction.
And then it's worth calling out this thing
on the bottom here, that xor edx, edx.
It seems kind of weird
to do exclusive or on something with itself.
But as you've probably worked out,
if you take any number and exclusive or it with itself,
you end up with zero, so this is a fancy way
of setting edx to zero.
Why might you want to do it that way?
Well, in order to encode a move instruction,
move eax, 0, you would need to put the instruction
for move down and then you'd have to put the four bytes
that correspond to the zero that came through.
Whereas this exclusive or is only two bytes
to encode the op code, so it's smaller
and more compact in terms of code efficiency.
There are also some architectural reasons
inside which are super exciting and interesting,
but I don't have time to go into here.
So in summary, funny 64 bit registers,
they have different names, parameters are passed
in rdi and rsi first, result comes out in rax,
operations are with the destination on the left hand side
and destination and source can be registers
or those memory references.
Okay, so now you all know how to read x86 assembly.
Where were we, we were here, we were comparing
these two implementations of effectively standard accumulate
over a vector of ints.
Which is better, well, let's have a look at it.
Let's have a look at the assembly,
that's how we look at these things.
This is Compiler Explorer version 0.1, it's a shell command.
So I was running g++ on a temp file,
putting the optimizer on, telling it
to just output the assembly instead of assembling it,
outputting to dash, which is standard out.
And of course, the very important masm=intel.
Pipe it through c++filt because the mangled names
don't mean anything to me, and then this funny grep here
is just removing all of the funny lines,
the dot directives the compiler needs to put in
in order to talk to the assembler in order to get like
a real program out but I'm not interested in.
And we got something like this.
I don't know if you know about the Unix watch command.
It's a command that takes another command
and it just sits there and runs it in a loop
and it keeps running it over and over again.
And you can even put --dif, and it will show you
a difference from the previous time it ran.
And so I just split my terminal in half with tmux,
ran that in one side and run vi on the other.
There you go, Compiler Explorer.
We found it so useful.
We were able to answer all sorts of questions
about how the code was being compiled,
and once you've got a tool like this,
you start thinking hmm, I wonder what else we can do.
So I went home that night, and it was hurting me
that this was a great solution, but it was not very pretty.
And so what do you do when something doesn't look
very pretty, you could use a cool graphical toolkit, right.
Or you could go to the web.
So, oh gosh, I'm so glad that loaded.
This is Compiler Explorer, many of you have seen it before.
And this is the example, so this is the embedded view.
But I'm not gonna show the embedded view because, frankly,
it is too small, oh, I'm gonna have to...
All right, is that just about readable?
Yeah, oh good, and the yellow shows up, hooray.
So here's our example.
And on the left hand side, we have the code.
On the right hand side, we have the assembly
that comes from it.
I'm compiling with GCC 7, 7.1 here,
it's the version I wrote this talk with
and I'm paranoid about changing anything
in case it all moves around.
Inside the compiler options here, I'm using O2
and C++1z and arch=haswell.
Haswell is an Intel variant that's about five years old
and is in most servers, it's a fairly safe baseline
in my experience.
Just to sort of give a brief sort of wave,
hand-wavy introduction, we're gonna go
into a little bit more detail in a second.
If I mouse over this yellow area here,
you can see that there are two yellow areas
on the right hand side that have sort of become
a bit bolder, although it's difficult to see on that screen.
The color coding tries to match up the source lines
with the assembly output.
So although we don't really know that much about assembly,
apart from my crash course I've just given you,
we can see the int result = 0 corresponds to xor eax, eax.
So we know that xor eax, eax is the same
as eax = 0, so we can at least sort of see
a parallel between those two things.
Similarly, if I look at this red block here,
this is where we're accumulating into the result.
Now again, because we're xoring eax with zero,
we might reasonably intuit that the compiler is using eax
to hold the value of the result while it's running
through its loop.
And indeed, this is backed up by us looking at
the add eax, DWORD PTR [rdx], which is it just saying
I'm gonna accumulate into eax with
whatever is pointed at by rdx, and then it moves rdx on.
And then you'll notice that this return result is white,
there's no actual assembly instruction corresponding to it.
That's because, when the compiler
is finished executing this loop of adding things up,
the result is already in the eax,
which is where we needed to leave
the return value for our caller.
So there's actually no instruction.
I mean, typically, I guess you could argue
this red here should be attributed to the result,
but it's not perfect how these things are lined up.
Okay, so before I get too far into this,
I would just like to show you what happens
if we turn the optimizer off.
So optimizer off, and I'm not gonna go through all this,
but the compiler does pretty much exactly what we asked it.
So this is kind of interesting and useful
just to sort of see fully what it is
that you have asked the compiler to do.
And we can see that there's a whole bunch of code
that's been generated here, these vector operations
have been emitted.
And if you went to the linker talk earlier today
you'd know that they were marked as weak
and things like that.
But on O1...
Yeah, we've got a mov eax, 0 here,
it's funny that GCC has decided that just
replacing mov eax, 0 with xor eax, eax
is just not worth doing at O1.
I don't know quite what the rationale is there.
And then if you put it on O3, which is the one true setting
but not very good for, as we know.
I mean, look at this stuff.
There's pages of it, and I'm sure
it's all super super awesome and it uses
all these vector instructions,
which are super cool and everything.
But we'd have to benchmark to make sure
that was actually faster than the simple version.
I trust the compilers a lot, but anyway,
it's too much to fit on the slide,
so I'm kind of settling for O2 for most of this stuff.
All right, anyway, what was our original question?
It was are we okay to do a range for?
Now, I am gonna drag down one of those
and I'm gonna paste the code down here and I'm going to
hide away the hidden set of code that I've put there.
The font size is slightly too big here,
but I'm sure you'll give me the benefit
of the doubt for a second, so I'm gonna go
for (auto x : v) result += x, and I'm gonna delete
those two lines, and nothing's happened,
'cause all I've done is I've created an edit window.
The UI is such that I need to now actually
slave a compiler to that window,
so I'm gonna pick up a compiler
and I'm gonna put it over here.
And then, just to make it a fair fight,
I am gonna grab the same command line options
and paste them in there.
Okay, well, I can scroll up and down
and see that there's 14 instructions there
and there's 18 instructions there,
but there's a better way to sort of see what's going on,
I'm gonna bring down a diff view
and I'm gonna maximize it, all right.
This is all good, there we go.
All right, this is what I wanted to see.
This is the two versions side by side with a difference.
So you can see, on the left hand side
is the version that is using the loop over from
zero to v.size and just accessing it by an index,
and on the right hand side is the version
that's using the range for.
Now, again, with the caveat that you have to
measure everything and you can't just look
at these things, we can at least draw one thing,
which is that, although there's a bit of difference
at the beginning, this region here, from line 7 through 11
or 10 to 14, is identical on both sides.
So even though we wrote the code
in two completely different ways,
the core loop, the bit that's actually
running over everything and summing them up,
is identical in both cases.
And the difference in the top bit is
of a couple of instructions.
We might reasonably expect them
to perform very very similarly.
Oh, hang on a second, I was gonna do one other thing.
And that is, for the purists among you
that are saying why don't we use std::accumulate.
Accumulate begin(v), end(v).
Typing under pressure is never easy.
Now, if you are eagle eyed, you'll spot that that's
absolutely identical, give or take one reordered instruction
from the handwritten for loop.
So use your standard algorithms
I think is the takeaway there.
Let's just quickly take apart that code.
So this is an example of how, if I were reading some code,
I would take it apart and try to intuit
what's actually going on and why are they different at all.
So if you remember, we're taking
a vector of integers by reference.
Now, there's no such thing as a reference
in terms of the hardware, it's all pointers
as far as the processor's concerned,
so effectively, what we've done is we passed
a pointer to a vector of integers to this function.
The first parameter, the first and only parameter,
will be an rdi.
And you can see these first two instructions
are reading from rdi and rdi+8.
Now, it would be easy for you to sort of fall
into the trap of thinking that rdi is pointing
at the list of integers itself, it isn't.
It's pointing at the vector.
And in at least GCC's implementation of the STL,
this is what ultimately, if you boil through
all the template stuff, you get to.
And that is we got a structure
which has three pointers in it.
The first pointer points to the beginning of the array.
The second pointer points to the one past the end
of the array, and the third pointer points to the end of
the allocated region of storage for that particular vector.
So that means that we can have more space
than we're actually using, and as you know,
vectors will grow to fill the amount of space
that they need, and they will shrink down,
well, they won't shrink down unless you ask them to.
Sorry, I'm getting carried away here, but
the interesting thing here is that the size itself
is not encoded inside that structure,
we just have two pointers.
This will become interesting.
So here is the difference bit, the difference bit,
here are the bits that's different.
On the left hand side, we've got the traditional
counting based approach, and on the right hand,
we've got the range based approach.
Those first three instructions on the traditional side,
the sub, the mov and the shift right
are effectively taking the end pointer
and subtracting it from the begin.
That tells us how many bytes are there
between the beginning and the end of our vector.
And then that shift right by two is dividing it down
by four, you know, if you shift anything right,
you're gonna be dividing by two,
and if you shift right twice, you get a divide by four.
So what we've effectively done there is
that's the size call.
So the size call's been made.
And we now know how many elements there are
in our vector that we're gonna be iterating over.
Implicit in that shift right is
a sort of comparison with zero.
So if we got a zero result, the equal flag will be set
and that je, jump if equal, will disappear
after the .L4, which is the there was nothing to do,
just return zero part of the program.
So so far, we've said is size equal to zero,
if so, we're done, great.
So now we might reasonably expect it to use that
while it counts up until it hits
the end of the, like counts a loop iterator
until it hits that size, but something interesting
has happened here, the compiler has realized
that we never used that index inside the loop.
And then we're just reading an array one after another,
and it's basically gone in and rewritten
the whole thing for us in terms of generating
a pointer to the end of the array and then walking
a pointer forwards through the whole array.
So it now needs to know where the end of the array is,
which we all know is actually in or was in
rcx at the beginning.
Compiler has unfortunately not noticed that,
and so it's had to add rdx back into rcx
in order to reconstitute the end point
that we had to start with, unfortunate.
And then it sets the result to zero there.
So a little bit of extra work's happened here,
and I think, I mean, there are probably
some compiler writers here, you could probably,
I don't think I'm speaking out of turn
to think that that could be optimized by the compiler,
I don't know if there's anything, oh,
there's a note there, (mumbles).
(background noise drowns out audience member)
So I think the comment was that Clang
was giving the same output, that it might be actually a bug.
Is that right, we--
(background noise drowns out audience member)
(background noise drowns out audience member)
It's difficult to communicate.
The comment was that there was an issue with the way,
or there may be an issue with the way Clang,
if we could talk about it at the end,
perhaps that would be clarified,
'cause otherwise we're gonna be derailed forever here.
But that's great, there are compiler people here,
they can help me make this better.
On the range side, of course, what's really happening
is that range for is being rewritten to be
get the begin, get the end, make an iterator
and walk it from begin to end.
And so that's exactly what the compiler's done.
So there's no funny working out
how many iterations we need to do
and then effectively throwing away
the number of iterations there.
The loop, we already saw, was identical in both cases,
so I don't think I need to go into too much detail.
We're just reading each integer one at a time
and adding them to the accumulator.
And then, when the iterator hits the end,
we stop looping around and we just hit
that return instruction.
I think we can probably conclude that, give or take,
a couple of instructions right at the beginning,
they're identical, which is great, really,
because that means that we can say to everyone,
go ahead, use the much better C++ idiom of range for
rather than counting.
Also, we learned that the optimizer settings
make a huge difference.
And in fairness, I haven't actually compiled
the two versions with O3 and spent any time
working out whether or not there's some other thing
that changes if we try counting versus the range thing.
I'm not sure that we would, but it would definitely
be worth checking, and of course,
you would benchmark it if it really mattered to you.
And we saw that standard accumulate is identical,
so we should really just be using standard accumulate.
Okay, so that's an example of the kind of things
you can do if you start sitting down and looking at code.
And as I say, you start discovering these amazing things
the compiler does for you.
So the first thing I'm gonna talk
to you about is multiplication.
Again, as I said in the beginning,
these are all slide examples so they're all necessarily
very small and very simple, but they're pretty cool.
So this is what a multiply x by y routine looks like,
passing two things, x and y, edi will have
the first parameter, esi will have the second parameter.
And we need to get the multiplicated, multiplicated,
listed to me, the multiplied version into eax.
And this is what the compiler emits.
But can we do better than that?
Multiplication, if you remember from grade school,
looks something like this, this is a four bit multiplied
done by hand.
I'm sure there are much cleverer things
going inside the processor, but ultimately,
there's a lot of adds happening,
and this is only a four bit multiply,
so there's gonna be four adds in the middle.
So on a Haswell, where you're gonna be running
at least a 32 bit multiply most of the time.
It's a miracle that they can get that to happen
in four CPU cycles.
Now, a CPU cycle is roughly a third of a nanosecond,
so just get that around your head, right,
we're still talking about something which doesn't take
much more than a nanosecond, so it's crazy.
But an add is one cycle.
So maybe there's a faster way of doing this.
Or maybe my clicker will work, there we are.
What happens if we happen to know
that we're multiplying by a constant?
We know it happens all the time.
You might be walking an array or getting
the ith element of an array, and you need
to be able to multiply by the size of the object.
We might reasonably expect it to use that shift
as a shifter where we can shift up by a power of two, sorry,
shift up and therefore multiply by a power two,
so we might expect it would shift it up by one.
Oh, no shift.
It's that funny lea instruction.
Like I said, it's a glorified add,
you have to think of it as an add,
but the cool thing about lea is that
you can have a different destination from the source,
whereas the shift on x86 is in place.
So it avoided an instruction here
just to move things around into the right place
and instead it said okay, add rdi to rdi, great,
and then put the result into eax, we're done, brilliant.
What about some other numbers?
Surely, it's gonna need a shift if we make it for.
Well no, because we can use the lea again.
All right, well, now we know that the lea instruction
can do one, two, four or eight.
So there'll be no surprises in store for us here,
and there actually aren't.
If we go to 16, we're gonna see it using those shifts,
and there you can see the two instructions
that it would otherwise had to have used
if it was just using shifts alone.
Now, again, for anyone who was at Chandler's talk,
you can see how clever that processor is
at moving things around and reordering them,
it's very very clever but it's still worth it
to not have two instructions for your pipeline.
But what happens if we do something like that?
Okay, it's given up on us.
So there's obviously a limitation in the compiler,
compiler writers are not as clever as they make out.
I know that--
Oh, no, wrong one, sorry, I need to do this.
I know that I can build that multiply myself.
I know that 65599 is 65536 plus 64 minus 1, right,
so that kind of, I can build out of those operations,
so I'm gonna do that and I'm gonna make
better code than the compiler, I know best.
Well, that's awkward.
So yeah, it turns out the compiler is smarter than me.
No surprise there.
So not only is it smart enough to know that the multiply
is faster than all the shifts and adds and subtracts
that it would otherwise have to do.
It has worked out that my sequence of shifts and adds
is actually equivalent to a multiply and it said well,
I'm gonna save you from yourself here.
That, of course, is only true on more modern processors.
If we go back to the golden age, let's go i486.
Yeah, you see.
I am still channeling Michael Abrash's
big book of awesome optimizations
and I can still write better code than the compiler.
Well, no, still.
If I turn this into the multiply
that it really should've been all along,
the compiler is actually even smarter
than that sequence of instructions,
I have no idea what it's doing though,
I haven't bothered to work it out.
So the thing that, the answer really is
let the compiler do the things for you,
it's gonna be smarter than you.
And I guess as a sort of secondary thing there
is like telling it what architecture you're targeting
is probably important as well.
I mean, I don't think multiply has been that slow
for a long time, so I think you're safe
on this particular one, but always take a look.
That's multiplication, and multiplication
with a constant as well, so it's a bit of a special case.
What about division?
If you hated doing long multiplication as a kid,
you probably, like me, loathed doing long division.
And processors don't like it any more than you.
So this is the code that comes out if you do
a divide or a modulus.
It turns out that the circuitry that
they use to make a divide also gives you
the remainder at the same time, so the only difference
between these two routines is whether or not
it picks to return eax or edx.
So it's one of those funny cases where the instruction,
the idea of instruction effectively outputs
the two registers that aren't even named
in the instruction, it's crazy.
A 32 bit divide on a Haswell takes 22 to 29 cycles.
I mean, it's still only 10 nanoseconds, okay.
But it's eternity compared to all those movs.
The other thing that I haven't put up on the slide here
is that there is only one divider for each core
on your Intel chip.
And it's not even fully pipelined.
So those multiplies, for a start, there might be
multiple multiplies on the chip, which means you can have
multiple concurrent multiplies going on.
They're also pipelined, which means
that you can start a new multiplier
every cycle and get the result four cycles later,
but least if you have a chain of them going on dependent,
you can kind of get one multiplier result every cycle.
The divider, no chance.
The divider, you're just, it's wedged,
it's gonna be, you're gonna be waiting
for 20 to 30 cycles before you get the result
and no one else can use it.
Can we do better?
Click, there we go.
Well obviously, trivially, we divide by a constant,
and that constant is a power of two,
we're back into shift land, right.
So I have moved over to using unsigned numbers here
just to avoid extra code, but you can do this trick
with signed numbers too.
So now there no magic lea instruction type trick
to save us here, it's just gonna do the shift,
which is cool.
And you can't even see the code
because I scaled everything up
to try and let you see it, okay.
So there it is, x over 2, we're just shifting right,
and you know, no surprise, or we'll do x divided by 16,
it's gonna shift it times by 4 and so on and so forth.
But how often, I mean, in fairness,
how often do you do integer division,
but how often do you do inter-division by a power of two?
Not that often, I'm guessing.
So let's try three, what happens here?
We've lost a divide, there's still no divide
on the right hand side.
We've got some movs and things with
a funny scary looking constant and a multiply.
What's going on here?
Well, compiler cleverness again.
What's happening is that aaaaaaab value
is actually two thirds shifted up into the sort of,
by 2 to the 32.
So like effectively a fixed point two thirds
where the fixed point dot is between 32 and a lower 32,
so it's 32.32.
So if we multiply by it, we get the answer out,
shifted up by 32 bits.
And if we just discard the bottom bits here,
which is what that mov eax is doing,
then what we've got is the answer
with a fixed point multiplied by two thirds.
And then we halve it.
So why are we multiplying by two thirds?
Well, it turns out that, in order to cover
the entire range of an integer,
an unsigned integer at least, there's just one fewer bit
to do it by a third directly, you have to do it
by two thirds and then shift it down.
And there are some very well published algorithms
for determining the exact sequence
of operations you need to do in order to cover
a particular range of value in a particular way.
And I saw some copies of, was it Bit Hacks or whatever,
there's a book, I saw some outside in the library earlier,
oh, that's great.
I've been clicking report problem,
but nothing seems to happen.
So anyway, that's division.
So if we've done division, we ought to look
at what modulus does.
And modulus is effectively just a divide
so that the bit is grayed out at the top there.
It's a bit that's, sorry, we do the divide,
and then we just multiply it back up again by three,
and you'll see it's using the lea instruction here
to multiply it by three by adding it to itself times two,
that's quite clever.
And then it subtracts it from the original number
and you get the remainder.
So why am I banging about moduli, moduluses, modulants?
Well, they're used in hash maps,
which is everyone's favorite container, right?
So if you've got a hash value,
you've probably got a 64 bit number
that's come out of you like doing
whatever crazy operation on your string.
And now you've got your 1,021 buckets in your hash tree
that you need to, sorry, your hash list,
that you need to start indexing from to look for the result.
And the only way to kind of take your 64 bit value
and smoosh it into the number of buckets you have
is to modulus by the number of buckets,
which is great but we've just seen,
it's about the slowest thing you can do.
Now, obviously, if you know ahead of time
that your container is always going to have
exactly 1,021 buckets, then you can write the code
that does mod 1,021, compiler jumps in and goes,
ah, modulus by a constant, I'll generate
that cool code for you.
And you're done.
But of course, most of the time in general purpose code,
we don't know how big our sets and maps are going to be,
unordered ones anyway.
And so there's going to be a dynamic number
of how many buckets I currently have, and that's
just gonna mean there's gonna be a divide in there.
And that's unfortunate because hash maps
are meant to be the fast thing.
Again, we're only talking about, you know,
10 nanoseconds, but these things do make a difference.
In fact, so much of a difference
that libc++, after a certain point,
gives up using prime numbers, which is like
the good way of doing the number of buckets that you've got,
and it starts using powers of two, even though that's
not actually like perfect, but for a sufficiently large
number of buckets, it's probably okay.
And at that point then, it just uses an and trick
to pick the bottom few bits rather than going into a divide.
So obviously, people are thinking about this,
some smart people, I'm looking,
some people are looking at me and nodding at the moment,
so I'm assuming it's them, some people
are thinking about this kind of stuff.
And in the case of say boost multi_index,
there is an even cleverer thing where
they have like a certain set of allowable sizes
of the buckets so they just know that like this,
we have 20 different possible sizes of the index,
each of which is a prime number.
And then instead of just modulusing by the actual size,
they do a switch statement on how big is it at the moment
and they have case 1021 return hash mod 1021.
Or the equivalent, I think it's done by an ordinal value.
Which means that yes, you've got a big switch statement,
and yes, the compiler's gonna have to dispatch
and jump to the right part, but when you get there,
you're just gonna do a couple of adds and a multiply.
It's obviously worth thinking about.
People cleverer than me are thinking about this anyway.
And that actually is relying on the fact that the compiler
is going to do this optimization for them, right,
you wouldn't write it that way
if you didn't know that that's what was gonna happen.
Who went to Phil Nash's talk earlier?
There's a few.
Cool, I thought he was gonna still my thunder on this part.
There are a number of data structures
for which it is convenient to be able to count
the number of set bits in an integer.
Now, who has ever actually had to do that?
Gosh, actually a lot more than I thought,
I mean, Phil's hand's up, of course.
That's surprising, I thought I was gonna give
this sort of example and everyone would look blankly at me,
like, why would we do that?
So this is a way to count the number of set bits.
We're gonna pass in a value a and we'll say,
while there are still bits in a, increment the count.
And then at a &= (a-1), it's like one of
the oldest bit twiddling tricks in the world
for clearing the bottom set bit.
Go around again, once we've got no bits left,
we're done, we return the count.
Let's see what our compiler does with this.
So this is GCC.
GCC has done something quite clever here.
It has got a loop.
You can sort of see the yellow bit is the loop bit there.
And then the red line there, the a &= (a-1),
it's replaced with a blsr instruction.
I'd never seen the blsr instruction before
until I actually prepared for this talk,
and it turns out it's custom made for doing this operation.
It clears the bottom set bit and lets you
kind of put it into another register,
so there is just, you know, clearing the bottom set bit.
That's pretty cool.
So it's picked an instruction which does
exactly what I want, that's super clever.
Can we do better than that though?
I imagine there's some people stroking their chin
at the moment.
Oh yeah, we can do much better.
So turns out there is an x86 instruction
whose only reason to be is to count the number of set bits,
it's so common that people wanted it
that Intel put it into the instruction set.
Now, just have a think about what Clang must've done here.
Like, there's no bearing on the right hand side,
there's no hint there that says I am counting
the number of set bits.
There are some builtin pop count things
that you can do which tell the compiler
I wish to do this, and therefore, the intrinsic
will let you either emit the code that does this
or the instruction if it's supported
on the architecture I've picked.
But I was blown away when this happened, this was amazing.
So all the bits of code that I have
that actually used the pop count instruction
could be replaced with the much much more readable,
human, well, I guess depends on (audio cuts out)
bit manipulation and stuff, but you know,
more readable to me, sensible, straight line code.
That's an awesome achievement.
The last one I'm gonna talk about is, again,
a bit of a toy example.
So standard caveats apply.
And that is this summation, and I've written a routine here
and I've used constexpr because I've realized
that this is all C so far and that makes it C++, right?
So we're gonna sum up to the value x,
and so we're just gonna do the obvious thing
of starting at zero, counting up to it
and adding it to our sum, brilliant.
And then we're gonna like write a little main routine,
it's gonna sum up to 20.
And so we're gonna have a look at what the compiler does,
how does it, oh, oh right, okay.
Well, it's constexpr, right.
Constexpr means it can only possibly happen
in the compiler, right, well, I'm not using it
in the constexpr context, it turns out,
but let's just replace it with static instead
or let's replace it with an error, why don't we do that?
It's still the same, right.
So the compiler can see it, the compiler knows the game
we're up to and it says, I'm gonna spend some time
just working out if this boils down to a number.
Now, this is a stupidly trivial example of this.
In general, you might find this happens
in code snippets that you're pasting into
Compiler Explorer, and the trick for this
is to make sure it depends on something
the compiler doesn't know.
So in this case, the compiler doesn't know
how many arguments they're passing into the function,
so there we are.
Now we can actually see the loop in its glory.
And again, GCC's done a cool thing.
It's unrolled, no it hasn't, I told a lie, sorry,
that's the next bit.
Okay, so it's a pretty straightforward thing here.
I'm not gonna bore you by going through it,
but you can see the shape of it,
and you know, if we wanna really go to town,
we can turn on the O3 again, you can see,
look at how cool, again, I love all this stuff.
I mean, I'm sure, I don't know
how many command line arguments I need to pass
before that's more efficient than
the previous version, I don't know.
But let's have a look at what Clang does again.
Oh, that's interesting, there's no loop.
There's a test in that blue region,
and that test is just saying if it's less than
or equal to zero, then the answer's zero.
And then that yellow block, which apparently corresponds
to sum += i, I don't know how the compiler
came to that conclusion, but it's just a move,
a funny lea and a multiply.
What's going on there?
Well, as most of you probably know,
there is a closed form solution to summation.
Clang has worked out, again, what I'm doing,
and it's replaced a code with a closed form summation,
yeah, that's pretty awesome, right.
Even more interestingly, it hasn't actually used
the thing that I've always used, the x(x + 1) over 2.
It has, in fact, replaced it with x + (x - 1) over 2.
The reason being is that what happens
if we passed an INT_MAX, right?
It would've worked, the loop version would've worked
for INT_MAX, but it wouldn't have worked
if we had to do x(x + 1), we would've overflown,
it would've all gone horribly wrong.
I mean, how you'd get INT_MAX command line arguments
in to a prime function is another problem, but you know,
essentially, they've dealt with the overflow case as well.
And it turns out you can tinker with that code quite a bit
and it will like, move things around and ultimately
just give you a close form solution every time.
Just think about what that's doing, it's taken
an order n piece of code I wrote and turned it into
a linear, sorry, a constant time operation,
there's another hand going up here, hi.
(background noise drowns out audience member)
Right, so the observation was that
overflowing an integer is an undefined operation,
and so why should it compile?
It's because the compiler isn't allowed
to start doing undefined behavior on my behalf,
my x wasn't overflowing.
Yeah, so you can't introduce, you can't add one to x ever,
because for all it knows, the value
was actually intmax beforehand and nowhere in my code
did it overflow the integer, so it can't introduce
We can talk about it afterwards, anyway.
Okay, these are dumb examples, they're all C examples.
The compiler does so much more than this.
I tried to get a slide together that explained
some of my favorite things.
I'm an old school sort of 00, unfashionable kind of a guy.
So I still use the keyword virtual, I'm a bad person,
I know, but compilers are getting much better
at seeing through that too.
They can do static devirtualization
in all the cases where they can prove
that a type is of a particular type
and they can get rid of the virtual function call overhead.
And now they're getting cleverer and cleverer,
like saying I've looked at your entire program
because of LTO and I've said seeing that
there is exactly one implementation of this interface,
and so I'm going to assume that every call
to that interface is to that particular implementation.
It still has to check, so it does a quick check
on the vtable pointer or the actual address
of the function it's gonna call to,
in case you've dlopened something
and loaded another implementation.
But ultimately, it's effectively inlining in the code
and put a big check at the front.
And that's pretty cool, but again, didn't fit it there.
And of course, the compilers doing range analysis
and it's doing constant propagation and it's doing things
I don't understand, I'm not a compiler writer,
I'm just a guy at the other end of the pub line
looking up in awe and wonder going wow, that's so clever.
And I think back to that 20-something me,
who always thought that the C compiler
was basically a glorified macro assembler,
and now I just realise how wrong I was.
The compiler has done an awful lot for you,
you should really be thankful for the hard, hard work
the compiler writers put into making the code efficient
and most often correct too.
All right, so that's my big sort of love story
for compiler writers, and now I'm gonna tell you
a little bit about how G-Compiler Explorer,
I nearly said GCC Explorer again, gosh.
Compiler Explorer works.
This is the second talk of this conference
So yeah, it's written in Node.js,
And I know we sometimes give C++ a bit of a bad rap
for having funny defaults and weird odd edge cases, but
I mean, honestly, you can just Google for it
and you'll find three cases at least
that will make you just, I don't know,
give up on the whole thing, but it's ubiquitous,
it's easy to stop then get up and running quickly,
and once you start a project and
it becomes really successful, it's kind of hard for you
to like go oh, I should really rewrite this
in something else.
And the irony is, the only other sort
of big open source libraries I have, big is relative,
is a web server written in C++, so you'd think
I would've used that, but never mind.
So it runs on Amazon's infrastructure.
Oh, this is what Node.js looks like, it's horrible.
But you know, this is not that much more code
gives you a very primitive version of Compiler Explorer.
You just say hey, I'm gonna get some requests
being posted to me at this URL,
feed them to the compiler, see what it comes out with,
maybe do a bit of primitive text processing
and then give it back.
So straightforward, I wish it was as easy
to do some things like that in C++,
and I'm sure there are people in room,
who can tell me libraries that I should be using
to make it that easy.
Behind the scenes, it all runs on Amazon's infrastructure,
that means EC2, which is their Elastic Compute Cloud.
I have an edge cache, oh, first it's worth saying
that it started out as like on the free tier,
so literally the world's smallest computer
that Amazon would give me for a year for free.
And as demand increased, it got more
and more sophisticated, which is to say
that the onion layers have grown and grown around it
and now it's just a bit of a dev ops nightmare.
But anyway, it's now fully cached at the edge.
So hopefully, when you load the page,
it comes up pretty quickly.
There's a load balancer behind the edge cache.
So when you actually make your post,
it gets fed out to one of the many instances
that I have in the background.
Those instances are VMs that are running Ubuntu.
And within those VMs, (as I told you it's onion layers)
there are Docker images.
So Docker is this sort of very lightweight container thing
which allows you to bundle together data
and sort of run a process in a namespace,
which means it can only see a subset of the computer,
so it's a bit like a VM within a VM,
although much lighter weight.
That gives me a certain amount of protection
from all the crazy code you guys put into the site
not taking down the whole thing.
And also, it gives me a way of like making sure
I can run the exact binaries locally
that were gonna be deployed to the site.
Most of the time, I don't break it when I update the site.
And for a long time, I used to build
all of the compiler images actually
into those Docker images.
And that started getting less fun
when you get 40 or 50 gigabytes worth of compilers
and you try and build a Node-sorry a Docker image for it
and then you have to try and push it to all of the nodes,
and you know, it takes like 10 or 15 minutes
for a node to start up, because it has to pull down
all this data before it can even start.
I solved it with the cheesiest hack in the world,
and that is I just have a big NFS mount
that everything stays on, so it seems to work.
The compilers themselves, as I said, initially,
I started out by apt-get installing them.
So the Docker images kind of looked like little VMs,
and so you would shell into them, and then I'd have
all these commands, I would like sudo apt-get install
gcc-blah blah blah blah or whatever.
That was great, it was very convenient.
But then I was finding that the compilers
I'd originally put on the site for like Ubuntu 11
were being deprecated and they wouldn't install
on GCC 12 or 14 or 16, so I was keeping an container around
just because it had some ancient compiler
that I wanted to keep.
And I wanna make sure that the URLs
that you guys are sharing around are always gonna work,
which is a whole other story, you know,
URLs are forever, forget like any kind of radiation thing,
it's like URLs stick around absolutely forever.
So now they're built through these Docker images,
I've spend quite a bunch of time trying to learn
good practices for building compilers,
and so I've kind of canonified that,
it's all on my website, which I should link towards the end.
And they're built in Docker images just because then
I have reproducible builds, which is something
I'm very passionate about, I want anyone else
to be able to clone this down and be able to build
the same compiler the Compiler Explorer is using.
And more to the point, actually, once I've built them,
I chucked them on S3 and then marked public.
And I keep getting emails from Amazon
because there have a lot of people recently
who's discovered that they have public information
on S3 and they didn't mean to.
Well, I actually mean to, the reason that they're public
is that you can actually run a shell script
and if you have enough hard disk space,
you're gonna get 40 gigs worth of open source compilers
dumped onto your hard disk, which is cool
because then you can run your own local instance
of Compiler Explorer if you don't wanna ship
your code to me, for example.
That's the open source ones.
Very kindly, Intel have provided me with licenses as well
for the ICC- they're somewhere else.
And the Microsoft compilers also currently run via WINE,
which is a bit of an admission of my lack
of understanding of how to administrate Windows
more than anything else.
I'm glad to say I've met up with Andrew
and some of the other Microsoft folks
who are gonna help me and we're hoping
to get some proper support for the Windows compilers
into Compiler Explorer, so that's
a really exciting thing that's coming up.
Thank them, thank them.
And you know, security.
Well, I only got an email this morning, actually,
from someone who thinks they've hacked my site
and maybe they have, I don't know.
Turns out a compiler is a gigantic security hole
waiting to happen, it's a huge attack vector.
Compiler writers are awesome, as I think
I've already said enough times.
They're not really security experts
and they're not really, they don't have to be, right,
they're on a trusted system.
But I had no idea how many ways there were to inject code
into a compiler.
You know, you think a compiler just builds the code,
I'm not actually executing it, right, I'm just.
But GCC has a plugin architecture,
you can supply a -F plugin and point it
at something which is a load as a dynamic library.
So if you can cause the compiler to crash,
but having written the file somewhere,
and then my cleanup code doesn't work immediately,
then you might be able to guess the path of it
and then do another compile with -F plugin equals boo,
and then you're in.
Only happened once.
Also behind the scenes.
GCC uses something called specs files,
which I don't claim to understand,
but it seems to be part of the processes,
the many, many pipelined processes
that comprise the compilation, from the preprocessor
through CC1+ through the AS assembler
and all that kind of stuff, and essentially,
it's a set of shell commands to run.
And of course, you can say, actually,
I'd like you to use my specs file over here.
And again, you're back in the same world where,
as long as you can write a file somewhere on the disk
that says run, I don't know, netcat or something,
then you know, you're in.
I sort of started down a route of trying
to harden everything, and then eventually
I came to the conclusion that there's just no way
I can make it completely bulletproof.
And so I've accepted that the principle of
what's the worst that can happen, it should apply.
And that is, you can hack one of my nodes
and you can take down the Docker instance,
and even if you don't get out of the Docker container,
the worst you can do is stop it from running
and then something else will come and kill the process
and it'll start up again.
So hopefully, that's not too bad.
Even if you escape the Docker sort of CH route,
namespace jail, which I'm told is possible.
You're in a VM which has no privileges to anything.
I'm saying this for where it is being recorded,
someone's gonna email me and say, "Ah, this is good,
"actually, true," but oh well.
So Docker protects me in some ways from this.
And I experimented for a long time using LD_PRELOAD tricks,
which is a Unix specific thing where you can force
a dynamic library to be loaded before an executable is run
and then you can replace some of the system routines,
so I have a little bit of code that basically
replaces open and read and all that kind of stuff
and just has a blacklist and whitelist
of all the different files that you,
that I was allowing the compiler to read,
and that was my way of preventing people from doing like
#include/etc/passwd, ha ha ha.
Which is essentially the kind of thing
that people try first.
Turns out though the compiler needs to read /etc/passwd,
'cause he wants to know which user's running it.
And I thought well, okay, we'll allow that.
And then it was oh, now it's trying to read
/proc/cpuinfo, I'm like, well, that's a bit dodgy.
Why should I let it read the proc file system?
And it's like, well, because people wanna type
mach=native, and how else is the compiler gonna find out
what CPU it's running on?
And so on and so forth, and before I knew it,
everything was basically fair game for the compiler,
I just threw my hands up and said, well, sod it.
So I guess we should talk a little bit
about that frontend, the bit you see.
So another huge props to Microsoft here.
Visual Studio Code's editor is a separate project
that you can just take and plonk in your project,
and it's awesome, it's a fully featured editor.
And it does everything I'd ever want.
That diff view, for example, that diff view
is just a native thing that my code can do.
It had long been on my backlog to have a diff.
Things like bug number three.
And then, when I moved over to Monaco from CodeMirror,
it had this diff feature.
I was like oh, that's cool, and it was just
two lines of code to enable it, it was awesome.
And then the funny sort of drag and drop thing
that I see other people wrestling with,
which is more of a testament to how rubbish I am
at user experience design, is golden layout,
and I'd like to, thanks to both the teams
behind those things, it's awesome that they open sourced
those things, it made it very easy to sort of like
copy paste the website together.
So the code's on GitHub.
There are two repositories, the first one
has all of the Node's code and all of the frontend code,
so effectively, if you have Node on your machine,
you should be able to git clone that
and type make, and then you're done.
You'll have your own local running version
of Compiler Explorer.
The second line there is the image,
which has all of the Amazon stuff,
it has all of the Docker container stuff,
it has all of the compiler scripts, it has everything.
So basically, if I were to be knocked over
by a bus tomorrow, you should be able
to reconstitute Compiler Explorer with what's there.
And if you, oh, I've done that one.
And there will be more about this running in
the next C++ Weekly, Jason's weekly YouTube thingamajig.
So this is a slide I plonked in earlier
so I haven't really thought about what I'm gonna say.
But this is inspired by the conversations
I've had with you, you folks.
And that is, I think you've got a flavor for how
I use Compiler Explorer, why it was made.
And I've been surprised at how differently
everyone else seems to use it.
There are a few people who like to show
cool and cute optimizations the compiler does,
but it has become this sort of defacto code pastebin.
I think Jason alluded to that in the introduction saying
that, you know, it's how we share stuff now,
which is, you know, hugely exciting for me.
And I know that some of the compiler teams
are using it internally to test out things
that they're doing and I did a search on both
the Clang and the GCC bug forums,
and there are over 100 bugs that have citations
to Compiler Explorer links.
And people use it to like sort of bisect quickly, you know,
you can just drop, the drop down has like
so many compilers, you just go and flick up and down here,
oh, it was introduced in 6.2, and that's great.
I've also seen people or heard of people
using it as a kind of REPL.
Which, you know, if you're doing
a bit of template meta programming, you can write tests,
you can write, you know, constexpr codes
that uses it or static asserts.
And you can start like noodling and typing code.
I think this sort of shows the deficiency in our tooling.
It's a nicer experience to type into a website
which posts your code across the Internet
to someone else's machine that builds it and sends it back
than it is to just do it locally.
So I wonder if tools persons in the audience
might have some ideas from that.
And I've also seen people using it
as a training resource, which is also really rewarding,
you know, to be on this point in the opening keynote,
you know, teaching C++ and making sure
people understand how to write this code
and understanding the processes that we go through
when we write code is important, and it's nice that
I feel like Compiler Explorer can help with that.
Quick sneak, sneak?
Yeah, sneak peak, I always get those the wrong way around,
of things that are coming soon.
There is CFG, that is a control flow graph viewer
coming very very soon, in fact,
it's on the beta site or beta site.
So if you go to godbolt.org/beta,
then you'll see that there is a new icon
on the far right hand side of the assembly pane
you can click and drag, and you get like a view
that shows how the different basic blocks interact.
I'm also looking to unify all the languages,
so you've noticed that it's inappropriately,
as many people have said, called gcc.godbolt.org.
It should just be godbolt.org, well,
should really just be compilerexplorer.com,
but you know, no one's gonna let me get away
with renaming it now.
So I'd like to have all the different languages,
you know, there are other languages supported, D,
there's Haskell, there's Swift, there's ISPC.
They're all there, and it would be nice
to have them all in one places so instead of hitting
different URLs, you could actually bring up
two different editors from two different languages, right.
The Go version and the C version of something,
and then see how the assembly works out side by side.
I fear that will lead to some flame wars,
so I'm not taking any responsibility for that.
And then the one thing everybody always asks me about
is executing the code.
I mean, it's fair to say, there are other online compilers
out there and they are very good and you should
check them out as well, there's OneBox
and other ones that escape my mind right now.
But they have, some of those have execution support,
and that's awesome, because it's nice to write the code,
it's even better if you can see what it actually does.
But you've seen my amateur hour devops and security setup,
so I've really gotta get my head around that
before I allow arbitrary code to be run.
But it is coming soon and I'm taking advice from people
who know much more about this than me.
We're at the end now and I just wanna say thanks.
This is not just me.
Compiler Explorer is now, I'm glad to say,
getting a lot of contributions from external people,
so Ruben, Gabriel, Simon Brand, Johan, Jared, Chedy.
And the other people that have both filed issues
or taken the time to contact me directly
and tell me about issues that they found
or have suggestions for the site and,
you know, it's awesome, it's lovely to have something
that has been a labor of love,
that you were scratching your own itch and then find
that actually, it's useful for other people,
and not only that, that they are willing
to help you with it.
Thanks also to the Patreon folks who help pay the bill.
I'm in an embarrassing situation of actually making
a small amount of money at the moment out of this,
which is awkward, so I keep upping the number of instances
just to counteract it, so if you put more money,
then you'll get faster compiles.
Thanks to the awesome C++ community, oh, ooh,
there we go, spoiled the ending.
Thanks to the awesome C++ community, the folks on Slack
have been a great source of inspiration, idea bouncing,
and of course, thanks to you for sitting through
while I talk about my funny little website.
Go and read some assembly, thank you.
We've got time for some questions until
I'm told otherwise, so yeah, hi.
- [Male Audience Member] Can you go back to the slide
where you showed the summation or the equation?
- Oh, you're gonna find a hole in my amateur maths, yeah.
I'm not, I'm, oh, somewhere in here.
I think it would stick out like a sore thumb, right?
There it is.
- Yeah, so the interesting thing is that actually,
the Clang transformed this to have this like
x - 1 'cause it's worried about overflow,
but the multiply can also overflow because
the whole thing cannot overflow, right.
So if you look at the assembly, you will see that
it uses some very weird instructions and actually
in the compiler, it models it as a 33 bits.
- Oh, I see, well, there's somebody
who actually knows what they're talking about.
- [Male Audience Member] Yeah, I hope that 15.6 MSVC
will also have this.
- I don't know, we've got some MSVC folks in here.
I guess on this side, hi.
- I actually had a question about scope.
Other tools that I've used as like a pastebin for code
provide external libraries like Boost or ranges.
Any plans, or is that even in scope for this project?
- Let's have a look, hang on.
Just the other day, some kind soul, oh look,
thank you, Twitter.
Oh, go away.
Never do things live, let's go to the live website.
So we now have a relatively straightforward way
of adding libraries in, and I'm glad to say Abseil
is one of the more recent ones that we put in.
- [Male Audience Member] That is awesome.
- So thank you, Google, heads up on that.
Oh, thank you, I guess, hi.
Okay, so you said you like people
to use it to share code, right?
Any way of making links shorter?
- Well, the link shortening.
So I have a thing.
At least at the moment, I'm not storing any data at all,
I think I should make that clear, I don't care about
your code, my life is far too complicated and short
to read your code, so I don't do anything with it
other than compile it and throw it away.
And I don't even wanna store it long term at the moment
so what actually happens in the URL has everything,
so when we do the long URL, if I were to get a share here.
You've seen it, it's like, it's giant.
So then I use Google's link shortener
to turn it into the smaller version that it is
and I've been reliably, what's that, sorry?
- [Male Audience Member] It fails at some point.
- Yes, it fails at 8,000 characters.
So I'm gonna have to pony up at some point
and actually store your data, I think, in order to give you
the sort of features that people are asking for.
And we've got some ideas about how to do that
and to do it in a way that means that
your data is safe and backed up,
because I never want those URLs to go away,
that's really important to me that people
can still go back to Stack Overflow five years ago
and find some awesome reply to something
and then still have it work.
- [Male Audience Member] Okay, thanks.
- Thank you.
- Regarding that optimization you were talking about
with the hash table where you have a set of fixed choices
of the length of the hash table--
- [Matt] Right, the boost multi_index one, yeah.
- And then you do a switch.
Well, those possible fixed choices aren't right next
to each other, they're all spread apart.
So you're gonna wind up having to do a binary search,
which is a whole bunch of ifs, so you stall the pipeline.
- Yeah, yeah.
I simplified for the purposes, what I believe
it actually does is it just has the set of ordinal values.
So say it has 20 different choices of size of buckets
and it switches between 0 and 20 then.
And it knows that the 9th possible size of buckets
is 1,021, I know I described it as case 1,021,
but it was actually like, it would be switch case 9,
return hash percent 1,021.
Does that make more sense?
- [Male Audience Member] Well, what I don't understand is,
if you have a bunch of cases that are really far apart,
you can't have a jump--
- Right, they wouldn't be far apart, as I said,
they would just be numbers between 0 and say 20
as the 20 possible sizes.
And then, of course, it just uses a jump table
and so you just jumped to the right bit of code.
Yeah, I'm sorry, I oversimplified that, I was thinking
that otherwise I'd spend ages trying to explain it,
and as you've probably realized, I'm not the best.
I guess you were here first, yeah, go.
Do you have any idea to support execution?
So I think that, if you support WebAssembly,
I think it can be supported.
- That's right, so yeah, using WebAssembly.
I mean, it's already been the case that people have asked me
asm.js and then WebAssembly would be another thing.
And so yes, there is a potential for compiling to
WebAssembly, shipping it back to the client and saying,
well, you can play with it as much as you like,
the only computer you're breaking is your own.
That has a certain appeal to it,
but there are many many compilers here
that don't have backends for WebAssembly.
And so, I mean, I guess one thing I've thought about doing,
I mean, if you came to my other talk, you know,
I'm sort of strangely passionate about emulating things.
And if you've ever seen any of my YouTube videos,
I also love talking about microarchitecture
and how the internals of processors work.
In my dream job, I would sit down
with all the pipelines and everything,
and then I could like complete the circle
and have everything together, and then
you could execute it locally and see
how all the caches work and everything, it will be awesome.
But no, at the moment, I think execution
will probably remain on the server side.
- [Male Audience Member] Okay, thank you.
- Thank you.
Oh yeah, go.
- [Male Audience Member] I have one more--
- Of course, go for it.
- I just came here, you know, C++ conference, with
my team members working for the Linux kernel,
and do you have any idea to support the Linux kernel's code?
- I would suggest that grab the code locally,
point it at your own compiler, there are no restrictions
on which files it can pick out when it's local,
it's not running in a Docker container
when you run it locally, so you can do -I
and point it at user, include Linux
or whatever you'd like it do, and that's what I do.
So when I'm using Compiler Explorer at work,
I'll use it locally and I'll do -I in my project path,
and then I can just #include, you know,
my thing that I'm interested in looking at .h
and then write a little test function,
I can see how the assembly's generated.
So I think you could probably use the same approach there.
- [Male Audience Member] Oh, okay, thank you very much.
- You're welcome.
- I appreciate it if you don't wanna get into this,
but how much does this cost?
- Oh, I don't mind telling you at all.
I'm spending about $120 on Amazon costs just straight up,
and then there's, just on like CPUs and stuff.
And then there's like transfer costs and then there's
the other miscellaneous, the bill, but it's, you know,
about 150 on that, and then there's a bunch of other things
that I have like, you know--
- [Male Audience Member] 150 per.
- Month, sorry.
150 a month.
And then there's a bunch of other stuff,
it ends up being of around about $200 a month.
And you know, I dither on what I'm gonna spend the surplus,
so my patrons are currently about 300, which is,
you know, awesome, so I'm like net up,
although if you count--
- [Male Audience Member] Well, it's gonna go up
by 10 or something like that.
It's just gonna keep going up.
- I mean, maybe.
Don't make it anymore awkward than it is.
So yeah, if people have got ideas about how to improve it
that may involve larger sums of money rather than that,
then I'm all ears for adding stuff
that's like a mouse click away particularly.
Oh, I guess now I've lost track of who is next,
so I'm gonna just go alternating.
- Do you think it's possible to support
like LLVM intermediate representations in--
- Yes, absolutely.
- Like I think 'cause it will help LLVM folks,
especially for people who's ramping up on that who like--
- Can anyone remember what, lvo, lve?
Oh, well, there you are, that works, apparently.
So it already is, don't clap, don't clap that,
I can just put command line options in it
and it just works, right.
Yeah, so it's sort of supported only because
it will just dump whatever the compiler gives me,
I'm not being clever here.
But we have got some, I mean, you can see
there's a load of redness in here that isn't that great.
We're gonna be putting in like better support for
syntax highlighting of LLVM, but you can do that already.
And it would be nice to like filter it as well so, you know.
It's gotten more and more sophisticated,
the filtering that I do has gotten more
and more sophisticated 'cause I really just wanna show
the function that you're interested in
and not all the fluff that comes in.
I don't know enough about LLVM intermediate representation
to know how to do that, but I think it'll be something
that'll be coming down the line.
- [Male Audience Member] Yeah, I think it looks pretty cool.
- It also support -e, which is preprocessing, yeah,
which is how people do the horrible like include line,
we do have that over here, you know,
#includes passwd.h, whatever.
Does that answer your question?
- [Male Audience Member] Yeah yeah, thanks.
- Thank you.
All right, over this side now, hi.
- I'll bring this comment on because you mentioned it.
Yeah, you can look it up in Google,
the one I know is in bellard.org.
And it runs an entire Linux, it has a compiler,
it runs everything even somewhat fast.
I don't know how, that guy's crazy.
- That would be fun.
Yeah okay, thank you, that's a great idea, thanks.
I think, last question by the look of it.
- How do you get the association from C++ lines
to assembly lines?
- Oh, magic.
No, I just pass -g, whatever you type in,
there's always a -g in there and luckily,
the assembly output has the dot lock things,
the hint to the assembler to emit the dwarf debug
and I parse that out.
I also support STABs because
there's some ancient compilers that do that as well,
but it's not that sophisticated and
I've filed a few bugs with, both I think GCC and Clang
about like where they decide to plonk which line
is correspondence to what, and because
there's some obvious glaring ones, we're like, come on,
guys, I'm sure that you could attribute this
to the right spot, but we'll see.
Okay, I, oh, is there, no, I think we're done.
Thank you, everyone, thank you so much.
- Bash Films can shoot your even with multiple cameras,
link to presentation slides, add titles
and edit your event live for a full broadcast experience.
- How is this even working?
So this is actually a more interesting program to,
you know, look at in a lot of ways.
So let's profile it.
Do a little bit of time to do a profile for us.
Let's see exactly what it is that's making this faster
or slower based on the different inputs.
We could really gain a lot of insight by actually looking
at the profile like this.
- I worked at Sesame Street, I got brought on
to be a writer's assistant on
a show called Sesame Street English,
which was to teach English to kids in China and Japan.
It seems very simple, the shows that they put together,
but it's actually really hard to design a show
that is not only for young kids, but also the parents.
- Confession like this is therapeutic,
I hope you all get something out of this,
but if you don't, the therapy will have been good for me,
so thank you.
Seven years ago, I was working, I wasn't working
at (mumbles) for my previous employer,
which was a large multinational investment bank.
I had what was up to that point the worst day of my career.
- And then came the anger.
Anger at ourselves, because we knew we were responsible
for America's first space disaster.
We wrote two more words into our vocabularies
as mission controllers, tough and competent.
Tough meaning we will never again shirk
from our responsibilities because we are forever accountable
for what we do, competent, we'll never again
take anything for granted.
We will never stop learning, from now on,
the teams and mission control will be perfect.
Because as a team, we must never fail.
- One other thing.
We're all in a very fortunate position,
we've been very lucky in our lives and so forth.
And I think, as part of the mission,
it's also good sometimes to take that fortune and give back.
To make sure that we take this platform
and use it towards worthy causes.
That's good karma, that's good stuff in the universe.
- We understand that your even will have needs
that are specific to your organization.
Please, email or call us directly to discuss
your particular event.
We look forward to discussing your goals
and helping make your event a success.