Cookies   I display ads to cover the expenses. See the privacy policy for more information. You can keep or reject the ads.

Video thumbnail
- All right.
Welcome, come in.
Welcome to the Defense Against the Dark Arts class,
where you will learn about the more
radioactive parts of C++.
For this class, I have several goals.
One is just educational.
If you don't know about atomics,
I will introduce them from the beginning
and all the way to where I want you to be.
But there will be things that,
for those who know something about them,
might be interesting.
More importantly, there will be things
for those who think they know something about them.
Okay, so let's start about C++ atomics.
Now, if you're here on a class about atomic operations,
it's probably because you want to learn
something about lock-free programming
or you want to learn something useful
for lock-free programming.
And if you want to learn something useful
for lock-free programming, it's because
lock-free programming is fast, right?
Is everybody here for that?
Okay, so lock-free programming is fast.
And to demonstrate that,
everybody was at Bjarne's keynote, right?
Bjarne said you should start your presentation
with an application.
So I put in the application.
I have two programs that do exactly the same thing,
and they do it correctly.
There are no tricks.
There are no wait loops.
It's straight up computation.
One of the programs is mutex-based,
and the other one actually
is slightly better than lock-free.
It's going to be wait-free.
And this is the speedup of the two programs,
number of threads versus the speedup,
meaning this is a genuine speedup,
not CPU time divided by real time.
This is really how much faster, in real time,
the job gets done.
So the
lock-free program is,
did everybody expect that to happen?
Anybody didn't?
Okay, you weren't in my last talk. (laughs)
So this is the code.
Let's see why this is happening.
Okay, everybody saw what's happening.
Pay attention to the legend.
The wait-free program is way down on the bottom.
It doesn't speed up at all.
And the lock-free program, despite being locked,
speeds up almost linearly.
So this is the code.
What the program actually does, it computes a sum.
The wait-free program uses atomic sum
and adds to it atomically.
This works.
There's no cheating; this actually works.
And the one with the lock computes its portion
of the sum using a local accumulator
and then, under lock, adds it to the global sum.
Well, I showed you speedup,
but what about the actual real, just the runtime?
The runtime is even worse.
Not only the wait-free program actually scales worse,
it runs slower.
The main conclusion from that is
before you do anything else, look to your algorithms.
Algorithms rule supreme.
This is a general conclusion,
which I like to bring up because people delve
into details of implementation way too soon.
Your algorithm is going to dominate
everything else you can do to the implementation.
Assuming you picked the best algorithm,
now we're into details of lock-free programming.
And one thing we can see is wait-free
actually has nothing to do with time.
And atomic operations, by themselves,
do not guarantee good performance.
And you have to know what you're doing.
There is no substitute for that,
except there is this class.
In one hour, you will be good for that.
So now let's go and understand the atomics.
So first of all, what in general is the atomic operation?
It's an operation that executes as a single transaction.
If you have other threads, they will see
your state of the system as it was before the operation
or as it was after the operation,
but not in any intermediate, partially-completed state.
In the context of C++, we're usually talking
about atomics as low-level, individual operations
where there is some hardware assist.
But it's a more general concept.
So databases may have atomic operations on them,
which have thousands of instructions or more.
But the clients are guaranteed to see the database
either before the transaction or after the transaction
but not during the transaction.
So the concept of atomicity scales
all the way from a single instruction
to the whole program.
So why didn't our increment work well with atomics?
Well, let's first, before we...
Is Scott Meyers here?
Okay, I'll steal his quote, then.
Before you learn to crawl on broken glass,
let's learn to crawl.
So
before we learn atomics, let's just look
at your basic, garden-variety increment on two threads.
You have a variable, you increment it by one
in two threads, and what happens?
Well, increment is a read-modify-write operation.
So you're reading on two threads from memory,
you're adding one, you're writing the results to memory.
That's a data race.
That's undefined behavior,
which means anything can happen, technically,
such as the demons could fly out of your nose.
In reality, something like this will happen.
Both threads will read the value,
both will increment, and both will write it.
You will probably lose the result of one of the increments.
Why?
Well, we have to go back to the hardware.
I'm not going to go back in as much details
as Paul did in his talk.
Fortunately, he did it before me.
So you have your CPUs at the top level.
You have memory way down here.
Both CPUs access the memory,
but they don't do it directly.
There is these levels of caches between the CPUs.
Let's say three levels.
This is an X86 picture.
The CPU itself directly interacts with the closest cache.
The variable initially sits all the way down in memory,
so how does it get from memory to the cache?
Well, the caches fetch it from memory
and then store it back.
So if we don't have any synchronization whatsoever,
each CPU reads the value from its cache.
It doesn't know that it needs to do anything else.
There is no synchronization here, after all.
It does the increment, writes it back.
Eventually the value will propagate back to main memory.
Well, one of them will.
But they both did increments without knowing
anything about the other, so one of them
just stomped on the other value.
They both wrote back one.
Well, one is only one possibility.
If you have another register, another CPU,
that one could see zero because it could read X
before any of the incremented values
trickle down from the cache.
Now, on X86 we're spoiled.
In general, nobody said that reads and writes are atomic,
which means you could actually see anything.
Undefined behavior really means undefined behavior.
Okay, so now that I've scared you enough
with this example, how do you safely access data
from multiple threads?
That's where atomics come in.
Before C++11, life was easy.
The answer was what's a thread?
C++11 introduced threads, so it had to do
something about it.
And what it did was std atomic.
So the syntax, it's a template-like thing.
You give it a type, you make the type atomic,
and you can initialize it.
By the way, notice that it has to be direct initialization.
That's just how they did it.
So increment is now atomic.
Just by wrapping int into std atomic,
I made the plus plus operator atomic.
So now if I do exactly the same code, this works.
And the result is guaranteed to be two
because both threads will see the final state,
one of them will do the atomic first and will see zero.
The other one does the atomic second,
is guaranteed to see one.
Sorry?
Yeah, X is now atomic int.
Okay, so what's really going on now
in the hardware, that I changed my non-atomic int
to atomic int?
Well, each CPU does the increment
as a single atomic operation.
There is a read-modify-write atomic, no interruptions.
Okay, what's really going on deeper?
Well, what's really going on deeper
is the core gets exclusive access onto the memory.
The value trickles up and down through the caches
if necessary.
Now, that's a simplified picture.
They may instead be talking to each other's caches.
But that's the idea.
Whether the hardware actually implements it
in terms of communicating between different caches
of different CPUs or through the main memory
doesn't matter for our purposes.
It will matter for exactly how slow it runs, but
conceptually, you can't tell the difference.
So this is really how that atomic operation is built up.
And if there is another core accessing it,
it's guaranteed to not see the intermediate state,
as long as everybody does atomics.
Okay, so which types can be made atomic?
Which operations can be made atomic?
What can you do with atomics?
And how fast is atomic?
Well, any trivially copyable type can be made atomic.
Trivially copyable means you can copy it
with memcpy, basically.
So int, double is okay.
That struct of two longs actually is also okay.
It's trivially copyable, can be made atomic.
Okay, what can you do with them?
You can read and write them, assignment operator,
copy constructor, always.
There are some special atomic operations we'll go through.
And depending on the type T of the atomic,
there may be some additional operations.
So here is an atomic integer,
which is initialized with zero.
And there is a bunch of things you can do to it.
One of these is different from all the others.
Which one?
Anybody?
(man talking quietly)
Sorry?
(man talking quietly)
Okay, that one is,
there is one more just like it.
But there is one that is very different.
Sorry?
(man talking quietly)
No.
That one.
There is no atomic multiply.
So C++ wouldn't let you do this.
There is atomic add, two lines above,
but there is no atomic multiply in most hardware systems.
So this actually will not compile.
If it compiled, it would have to be done atomically,
so C++ refuses to let you do that.
Okay, with these two taken out,
there are two more that are not like the others.
And one of them was already named.
And these two,
they both compile.
But they are not atomic, at least not in the sense
that you would expect looking at the expression.
Why?
Well, they are atomic in the sense
that each operation on the atomic variable is atomic.
But look at what operations are actually being done
on the atomic variable.
The first one is atomic increment,
the second one is atomic increment,
the third one is atomic increment.
There is atomic bit set, yes.
It's in the standard, and it's in hardware.
But when you get to the bottom, there is an atomic read,
and then there is an atomic write.
But they're separate transactions.
So we do atomic read from X,
store it in the register, add one to it,
do atomic write of the result.
Any other thread can come in between the read and the write
and modify the X, cause there are two atomic operations
with a big gap between them.
There is no synchronization, no exclusivity
during this big gap.
There is a read, and it goes onto the register,
something happens to it,
and then there is a write.
The read is atomic, sure.
The write is atomic.
Nobody will see, you know, half.
It's either zero or one.
In this case it's a one or two.
But nobody will,
anybody can change it to seven in the middle,
and then your write just writes it back to one.
So there is this catch with the overloaded operators
that you should be aware of.
For integer types,
std atomic provides overload operators
for all the math operations.
Well, not all.
You can't do multiply.
It doesn't compile if the operation doesn't support it,
which is a good thing.
It will compile if the expression is not atomic,
which is not necessarily a good thing.
It's pretty easy to make mistakes.
So the three increments, plus plus,
plus equal, and the X equal X plus one,
all mean the same thing,
except if the variable is atomic, then they don't.
Now, the compiler knows the difference.
The compiler will not optimize plus plus X
as X equal X plus one.
Do you always know the difference?
So there is slightly danger.
Okay, so we covered that.
Oh, it's going backwards.
Now, what else can you do with atomic types?
You can do assignments, reads, and writes
for all types that you can put into atomic,
both built-in and user-defined.
For raw pointers, you can do increment and decrement.
For integers, you can do increment,
decrement, and bitwise operations.
Atomic of bool will compile.
There are no special operations for it.
So it reads and writes.
Atomic of double will compile,
reads and writes, no atomic increments
or multiplies or anything of the sort.
Okay, in addition to overloaded operators,
it has a lot of member functions.
There is load and store, which is the same as assignment,
but it's explicit.
Here is a new one, exchange.
Exchange is an atomic swap.
It's a read-modify-write done atomically.
Both reads the old value, stores in the new value,
and guarantees that nobody can get a word in in between.
Compare-and-swap, which is a conditional exchange.
So you can say compare-exchange,
and ignore the strong for a moment.
You give it two values: what you expect it to be,
that's Y, and what you want it to be, that's Z.
If it's not what you expect it to be, nothing happens.
It returns false.
If it is what you expect it to be,
it's replaced by what you want it to be.
So if X equals to Y, X becomes Z, true is returned.
Otherwise, you get back the current value of X.
Note that you get it through the first parameter.
The first parameter is a non-const reference.
Returns false; X doesn't change.
This little operation is the key
to most lock-free algorithms.
What's that strong thing?
Hold that thought.
We'll get to that.
So why is this compare-and-swap
key to most lock-free algorithms?
Well, here is how
pretty much every lock-free algorithm
is centered around this loop here that you see
right here in the second paragraph.
So
we want to increment X.
Of course, we have an increment for that,
but we could also do a multiply,
for which we don't have an atomic operation.
So we say, I expect X to not have changed.
So first of all, I'll read the atomic value,
store it in a local X0.
I hope nobody got to X before me.
X hasn't changed.
If that's true, I'm going to change it
atomically to the new value that I want to be.
That could be incremented or multiplied.
And if nobody else changed X,
then I did my increment atomically.
If somebody did change X, compare-exchange fails,
returns the new value into, whoa.
Something wasn't atomic.
(laughing)
Returns the new value into X0
so I don't have to do the read again.
And I go in the next iteration of the loop.
And I keep trying until my compare-and-swap
beats everybody else's compare-and-swap
and gets that increment in.
Lock-free, not wait-free.
So that's what's special about compare-and-swap.
Some lock-free algorithms can be done
with simpler instructions, but this one
allows you to do anything.
Okay.
In addition to the operator overloads like plus equals,
there are member functions.
Fetch-add is the equivalent of plus equal.
Now you can do, if you just call fetch-add, it adds.
What's that fetch part?
Well, it doesn't just add atomically.
So it increments atomically,
which is a read-modify-write,
but it also returns you the old value.
That's the fetch.
So it returns you the old value
and adds the increment, all of it atomically.
Well, actually, it returns the new value.
If you want the old value, it's,
I'm sorry, it returns the old value.
So yeah, if you want the new value,
you have to add it yourself.
But you know what it would be.
So yeah, it returns the old value.
Same thing for subtract and the bitwise,
and, or, xor.
This is obviously more verbose
than your overloaded operators,
but it's also less error-prone.
You're less likely to write a complex expression
and not notice that you're breaking up
an individual atomic operation,
that the expression is not atomic.
If you have multiple of these function calls,
you hopefully would see that you
have multiple function calls, each one is atomic,
the composite is not.
So overloaded operators make writing code easier.
Unfortunately, they also make making mistakes easier.
As a programmer, you speak to two different audiences.
You speak to the machines,
and you speak to other programmers.
The compilers always understand what you mean.
It's not necessarily what you meant to say,
but they always understand what you actually said.
Speaking to other programmers,
that's the more difficult part,
in the well that makes you understood.
So we're going to focus on that.
The member functions help because they call attention,
something weird is going on here.
They highlight what's being done atomically.
They may be longer to write,
but they focus attention of the reader at the right place.
And the programmers tend to read
what they think it going on,
not what's really going on.
X equal X plus one, I know what's going on.
It increments by one.
Well, no.
Okay, now, before we talk to the programmers,
the whole point of this is doing things fast.
So how fast are atomic operations?
How fast are atomic operations?
The first rule of performance is what?
Always measure the performance.
There are two caveats that come in with atomics.
You should still always measure the performance,
but you should be careful when deducing
from the measurements what the hardware
is actually doing and what it's supposed to be doing,
because not all hardware implements all of the atomics
the same way.
So you can get some results that, for example,
tell you that this atomic operation
and that atomic operation run at exactly the same speed.
That may be the artifact of this particular hardware.
And we'll see some of that.
Also, if you're comparing atomics with non-atomics,
you're not really comparing the same thing.
Atomic operation does more.
You wouldn't use atomic if you could get away
with non-atomic.
That being said, let's see what the cost is.
Okay, who expects atomic operations to be a lot slower?
By show of hands?
Who expects to be about the same speed?
All right, here it is.
That's X86.
At the top of the line, there is a read and atomic read,
and atomic read is marginally slower.
The next one is atomic write and, I mean,
non-atomic write, and atomic write is way down there.
And the light blue in the middle is non-atomic increment,
and atomic increment is way down there with atomic write.
Now, as I said, it's not entirely fair
because if I could use non-atomic increment,
I wouldn't use atomic increment, right?
So let's compare.
Let's now compare with a lock.
Atomics and locks do generally the same thing.
They provide a thread-safe way of doing these operations.
Let's focus on increment.
So here is atomic increment, and here is mutex.
That's a normal C++ mutex.
Quite a bit slower, so that's good.
There are more to locks than there are mutexes.
I didn't say which lock.
So let me add another lock, which is a spin lock
that I wrote and tuned for this machine.
Spin lock and atomic run at practically the same speed.
This machine, what's this machine,
and why am I going to 128 threads?
Well, it's 120-core Broadwell server.
At this point, anybody wants to say something?
I know somebody wants to say something.
Come on, don't be shy?
Anybody want to accuse me of cheating?
Okay, Fedor, you're cheating!
Nobody has 120 cores.
Everybody knows that spin lock is a lot slower
if you run more threads than you have cores.
Everybody knows that?
Okay, this was done on my desktop,
at home.
Well, anybody doesn't believe that?
What's wrong with this audience.
Everybody believes that.
Maybe four cores are still too many.
This laptop only has two.
I'm going to run that spin lock test.
This is a Google benchmark, runs the increment
of a 64-bit integer,
up to 1024 threads, with atomic and with a spin lock.
The number on the right is how many increments
per second it can process.
Starts off a little slower, but spin lock catches up
at large number of threads.
So this is pretty generic.
Now, this is just in the benchmark configuration.
In your application, you would see variations,
depending on what else is going on with the cache,
what else is going on the with memory.
Spin lock may be faster, may be slower.
You have to measure in context.
But basically, don't be sure that lock-free
is necessarily faster.
Remember we talked about compare-and-swap.
So I can do increment with compare-and-swap.
I've shown you on one of the slides
how to increment with compare-and-swap.
Compare-and-swap would go somewhere
between the mutex and atomic increment.
In addition to the fact that atomics
aren't necessarily the fastest thing in the universe,
there is a huge secret they're hiding.
They aren't even always lock-free.
So here is an atomic type, three atomic types.
It's a struct with one long, two longs, and three longs.
And because they're all trivially copyable,
I can put them all into an atomic.
Which ones of them are lock-free?
A is lock-free.
(man talking quietly)
B might be okay.
X86 is the platform, let's say.
I'm compiling for X86.
Okay, yes, so
A is definitely lock-free.
C is definitely not.
And B may be.
Fortunately, there is a way to find out.
And there is a member function, is-lock-free,
that you can call.
And it, at runtime, will tell you
if this variable is lock-free or not.
Why at runtime?
Why not at compile time?
Well, C++17 adds a compile time function
that is called is-always-lock-free,
but it's a subset.
So you could get is-always-lock-free false,
which means sometimes it is and sometimes it isn't.
And at runtime, it will tell you
for this particular variable if it is or it isn't lock-free.
And the reason is mostly alignment.
Depending on alignment,
the variable may or may not be lock-free,
cause some platforms have alignment requirements
for atomics.
So if you have a, for example, 16-byte long
like this struct B, it will only be atomic
if it has 16-byte alignment.
So on X86, the two longs is indeed atomic
if it has 16-byte alignment,
because the move into mmx register is atomic.
And if you look at the assembly code
generated by the compiler, you'll see
that that's what the compiler does
if you ask it to load that atomically.
The last one is greater than 16 bytes.
There are no atomic instructions on X86,
at least on the processor that I used,
and is-lock-free returns false.
What about the middle two, the C and D?
They are the same size.
Are they atomic?
Who thinks that C and D are atomic?
Lock-free, I mean.
They can be both atomic.
Who thinks C and D are lock-free?
Who thinks they're not lock-free?
Who thinks it's a trick question?
(laughs)
C is actually lock-free because it's not really 12 bytes.
It's 16.
And the same atomic move to mmx handles it.
D is indeed 12 bytes, and it's not lock-free.
Padding matters, in addition to alignment.
Now, atomic operations are lock-free,
maybe when wait-free.
It doesn't mean they don't wait on each other.
So let's compare atomic increment of an atomic variable
versus incrementing two atomic variables that are different.
So one thread has its own, the other thread has its own.
This way they don't have to wait on each other.
And let's run the benchmark for different number of threads.
And it's basically the same thing.
There is a slight blip at over 16 threads,
but basically there is no difference.
Can we conclude that atomic operations
don't wait on each other from this?
Not necessarily.
What's going on here is what's called cache line sharing.
So the two operations are in the same cache line.
When I said that the variable trickles up and down
through caches from main memory
to the CPU and back, I lied to you.
It's not the variable that trickles up and down.
It's the whole cache line on X86, 64 bytes.
You want one variable from that cache line,
the entire 64-byte chunk will go up
through the cache and down.
And if two different CPUs want two different variables
that are within the same 64 bytes,
they have to wait as if it was the same variable
because you don't get a lower granularity
on cache access than 64 bytes on X86, again.
So exclusive access is granted to this entire cache line.
And that's the reason, remember this graph
from the beginning, the wait-free program that didn't scale,
well, because they do wait on each other.
So if I put them on separate cache lines,
then there is no problem.
Then they don't have to wait on anything.
So atomic operations wait on each other,
in particular the writes.
The reads don't.
So here it is.
I can prove it with a benchmark.
The top one, the one that scales, is truly not shared.
These two variables are now 64 bytes apart or more.
They're on different cache lines.
And the two ones on the bottom,
one of them is truly shared, meaning I'm hammering
at the same atomic variable, and the other one
is false shared, meaning they're different atomic variables,
but they are within the same cache line.
Now, why there is a blip at 16 threads,
because at 16 threads I run out of cache line,
I run out of 64 bytes.
I can't give eight-byte variable per thread
anymore to each thread, so there is,
some of them are false shared and some of them
are actually not shared at all.
So that's why it starts to pick up.
So atomic operations do wait on a cache line access.
And that's what you pay for data sharing without races.
If you don't need to access shared data,
make sure you're not accessing the same cache line.
On more complex hardware, it may be more than that.
May be even memory pages on newer machines.
Okay.
We did see the compare-exchange-strong before.
You know, there is also in C++ standard
compare-exchange-weak.
And the standard says that compare-exchange-weak
can spuriously fail, and even when the current value
is the same as the expected value,
it can still return false
and fail to do the exchange.
Everybody knows the difference?
Okay, who thinks they know why there are two
and what the difference is?
Well, yes, it depends on hardware.
And the most common explanation that I read
is something along the lines of
you have these different cores,
and they have to communicate to each other,
and it's kind of like timing out on a socket.
You wait, and if you couldn't get the answer
from the other core in some timeout,
you just return false.
Okay, is that consistent with what you heard?
And I read that, and I thought I understood it.
And then I thought about it some more,
and I realized I don't understand it.
So
what does it return when the failure happens,
when the false failure happens?
Well, it returns the current value of X.
It still has to return the correct current value of X
because that's what you're going to start
on the next iteration.
It's actually going to return
the value that you expect, because that's the definition
of the false failure.
The value was there, the expected value,
but it didn't do the swap.
That's what weak means.
So it knew that the current value was the expected value.
If it knew that, why didn't it do swap?
And if it wasn't sure that the current value
was the expected value, how can it return
the value that
may not be there, cause what are you going to do
on the next iteration?
So I had to do some more figuring out.
Is it worth explaining?
Okay, so this is how compare-and-swap
might be done in hardware.
This is pseudocode.
So
the lock here means we get some sort
of exclusive access, whatever that means in that hardware.
So we get the exclusive access to the variable,
we read the value.
If it's the expected value, we swap it and return true.
If it's not the expected value, we just return false.
Okay.
Now, remember our benchmark.
Reading shared variables is a lot faster
than writing into them.
So we could do, at the hardware level,
again, this is pseudocode, an optimization.
First, we read the current value.
This is faster than writing.
If the value isn't what's expected, we're done.
We're just returning false.
If the value is what's expected,
we have to now get the exclusive access.
Oh, by the way, before we got the exclusive access,
the value may have changed.
We have to read it again.
Check that it hasn't changed,
and if it hasn't changed, now we can do the swap.
That's double-check locking pattern, back for you
from the grave.
Only the hardware can actually do it.
Well, what could happen is exclusive access
on the particular platform may be hard to get.
Read isn't, but the exclusive access is.
So instead of doing a lock, you may do a timed lock.
So you did the read, and if the value you got,
the current value isn't what's expected,
you just return false right away.
So the value, current value, is what's expected.
You're going to try to do a timed lock.
If you timed out on the lock at this point,
well, you didn't do the swap, you timed out on the lock,
you returned false.
That's a spurious failure.
So the result of the read was correct.
There is really, the expected value of X is there.
So you didn't time out on the read.
You got the read.
You timed out on trying to get the exclusive lock.
So the hardware platforms that play games like this
with compare-and-swap, they may have a weak
compare-and-swap.
X86, by the way, doesn't.
X86 has a straight, brute-force,
lock it and hold it until we're done compare-and-swap.
Now, this is all well, except it actually
doesn't really let you do anything.
So it's good to understand,
but it's not enough by a mile.
And the reason is you very rarely use
all atomic operations across your program.
You never use all atomic operations across your program
unless you have just one variable
in your program, and that's atomic.
Typically you have some atomic operations,
and most of your variables are non-atomic.
So let's look, for example, at atomic queue.
Here is a lock-free queue implemented with an atomic.
There is an array of slots that are not atomic.
They're just plain integers, queue of integers.
There is an atomic variable for front index.
And what we are going to do is,
in order to push on the queue,
we're going to atomically increment the
index of the front slot.
Let's not worry about multiple writers yet.
So I'm going to atomically increment,
which means it's atomic operation.
Nobody else can see the intermediate value.
So if somebody incremented before me,
I would see the incremented value.
So let's say it starts from zero.
And I do atomic increment.
If I got back one, it means I own slot number zero.
Nobody else can get to slot number zero,
because it's atomic.
If somebody got before me, I wouldn't get back zero.
I wouldn't get zero plus one.
I would get one plus one, two.
If I got to the zero first, nobody else will see zero,
never again.
They'll see one, what I put in.
That's what atomic means.
So now, as a writer, I can store my integer
into my slot anytime I want.
Now, this avoids the, we haven't talked about
how to synchronize with the reads,
because the readers have to know
when you're finished writing.
Let's ignore that problem for now.
The point here is that my atomic front
is used as an index to a non-atomic memory.
And that's how pretty much all lock-free
data structures are.
May not be index,
may be a pointer.
Here is a lock-free list.
It has an atomic pointer to the head of the list.
How do you push a new node onto the list?
You create a node on the side.
Nobody can see it.
There is nothing to worry about.
You get the current head.
Again, nothing to worry about.
You put the current head as the next pointer
on your new node.
Nobody can see your new node.
There is no race conditions here.
Now you're going to do an atomic swap,
except you have to worry about the possibility
that somebody else did the atomic swap before you,
and the head is no longer the head.
Okay, so you're going to do conditional swap,
compare-and-swap.
If nobody changed the head, you're going to enqueue,
to store your new node as the head of the list
and atomically replace the head pointer.
If somebody did change the head pointer,
you get the new head pointer back,
and you loop all over again.
The important point is that the atomic node star pointer,
the head, points to non-atomic memory of the list.
And that's, all lock-free algorithms are like that.
Okay, so basically you have this.
You have atomic pointer that points to memory location,
and you grab it into your exclusive use
and swap atomic pointer to some other memory location.
Or you have some new memory location
that you want to reveal to the world.
You swap some atomic pointer to point
to your new memory location, atomically.
But the memory itself is not atomic.
So what guarantees do we have
that the non-atomic memory is still read correctly,
without race conditions, by all the other threads?
We spoke about atomic operations on atomic variables.
We have guarantees for that.
Atomic read, atomic write, atomic swap,
that's all atomic.
But that's on the atomic variable itself.
What guarantees do we have for the memory?
And we need these guarantees,
we need guarantees of the sort,
like if I'm publishing a new node,
all other threads must see that memory
of the new node in its final state.
I'm done preparing it, I did the atomic swap.
I must have a guarantee that my non-atomic writes
to the list node have completed
and become visible to everybody else.
Otherwise, this atomic pointer is worthless.
Now, that's done by the memory barriers.
And memory barriers really go hand-in-hand with atomics.
You really can't understand atomics
if you don't understand memory barriers.
So let me show you the memory barriers.
Memory barriers control how changes to memory
made by one of the CPUs become visible to the other CPUs.
And the need for them is because if you don't have
any memory barriers or anything equivalent to them,
you have no guarantee of visibility.
By default, there is no guarantee of visibility whatsoever.
You have two CPUs.
They are both modifying their caches here and here.
Main memory doesn't have to change at all.
There's no guarantees that anybody can see anything.
That's not how X86 works, but that's how
some systems work.
So it's a global control of visibility
across multiple CPUs.
Has to be supported in the hardware.
In practice, it's often not a special instruction,
although it can be.
It's often an attribute on some other instruction.
Okay, thank you.
So, again, C++03, life was very simple.
No memory barriers.
C++11 has memory barriers, and in the standard,
the concept that they use is memory order.
The two things are closely related.
So the way it looks in the language is
like what you see here in the last line.
Remember, I told you before that there is
assignment operator.
And it's equivalent to a store function.
Well, I simplified things for you.
I lied.
Store function is more complex than that.
And a store function has the second argument
that is the memory barrier, or the memory order.
So if I say, store one with the memory-order-release,
it means that I put a release-memory-barrier on that store.
What's the release-memory-barrier?
Well, let's start with a simple one.
What's no memory barrier?
No memory barrier means I can reorder
reads and writes any way I want.
So here is my atomic X in the middle,
and I have A, B, and C, some non-atomic variables.
In the program order, I write to A,
then I write to B, then I write to C, then I write to X.
In the actual order, anything is possible,
no restrictions whatsoever.
And that's, in C++ standard, called memory-order-relaxed.
There are no guarantees here.
Acquire, that's the first barrier with some guarantees.
Well, there is also consume, but let's not talk about that.
So acquire, acquire is what's called a half barrier
sometimes, guarantees that all the memory operations
that are scheduled after the barrier in the program order
become visible after the barrier.
Now, all operations means both reads and writes,
not just reads or just writes.
And all operations means not just operations
on that atomic variable, but all reads and all writes.
So how does it look like?
Well, here is my
read, load, on the atomic variable,
with memory-order-acquire.
And before it and after it in program order,
I have some other operations on non-atomic variables.
Some reordering is allowed, and some is not.
Nothing that was after the load can move
in front of it.
Anything that was before can move after.
That's the acquire barrier.
Makes sense that the release barrier is the reverse.
Nothing that was before release barrier
in program order can be observed after that store operation.
But things that were done after the store
can be observed before.
That's the second half barrier.
Now, they're often used together,
this acquire-release protocol.
Why are they often used together?
Well, think about it.
So you have two threads, and the first one
writes the atomic variable with the release barrier.
It guarantees that all the writes done before
become visible.
But the second thread has to also use a barrier.
If the second thread doesn't use a barrier,
the guarantees don't apply.
The second thread reads it with acquire barrier.
All the reads, now I have the guarantee on the acquire side.
All the reads that are done after the barrier
in program order have to be done after the barrier
in actual real execution order.
So now all the writes that were done before the barrier
have become visible.
And all the reads that were done after the barrier
are reading that new, updated memory.
Has to be the same atomic variable.
That's very important.
So the language acquire and release here
has to do with this concept of, I have some data,
I prepared it, it's my private data.
Finally, I am ready to publish it, I release it
to the general consumption.
Another thread wants to sync to make sure
what I released is what it's going to read.
It acquires the data.
It acquires the atomic variable
and using acquire fence, it's guaranteed
that it will acquire the data that I published.
So here is how acquire-release protocol works.
I have a release-store, and I have some data
that is all written before the release-store.
That data is published to the reading thread
that does the acquire-load and reads it
after the acquire-load.
And I'm guaranteed, even though my A, B, and C
are not atomic, only X is atomic,
I have the guarantee that the memory I published
will be the memory the other thread will see.
Okay, let's skip the locks,
cause actually, in the previous talk,
they covered the locks.
C++ has two more barriers: acquire-release,
it's just acquire and release in one place,
bidirectional barrier, but only works if
both threads use the same atomic variable.
The total sequential consistency barrier,
the highest, the most strict barrier,
that removes that requirement of one atomic variable.
Now if you have sequential consistent separations
on two different atomic variables,
you now have the ordering guaranteed between them.
Compare-and-swap is the only member function
that actually has two memory order arguments.
Why two?
Well, remember that double-check locking implementation.
You have a load, and after load it may decide
to just bail out with a false.
And there is a memory order on that.
But if after load it got the expected value,
then it has to get the exclusive access and do a swap.
And the swap is another memory transaction,
which has its own memory order.
So I could have two different memory orders
on compare-and-swap,
the one for just reading it
and the one for actually writing it.
If you don't specify anything,
what's the default memory order?
Well, the answer is the strongest memory order.
They really guard you from accidentally making mistakes.
That's a good thing,
except for
the strongest memory order can be really expensive.
Again, remember we're talking to two audiences here,
to the hardware and to the people.
The strongest memory order may not be what you want to say,
and it may not be what you need for best performance.
Well, let's deal with the performance first,
because you are all here to write
the fastest code possible.
So the top line is a non-atomic write
or atomic relaxed write.
They're basically running at exactly the same speed.
The bottom line, one and a half orders of magnitude
slower, is the sequential consistency write,
which incidentally happens to be
what you get when you write assignment operator
on an atomic.
So if you didn't need the sequential consistency order,
you're paying for being lazy,
for not writing store memory-order-relaxed,
for just writing assignment, one character,
instead of the whole big member function.
You're losing about 15 times in performance.
So memory barriers are expensive.
How expensive?
That varies from platform to platform.
On ARM, really expensive.
On X86, all loads are acquire-loads,
so acquire barriers are free.
All stores are release-stores, so those are free.
But the other barrier, the release on read
or the acquire on write, are really expensive.
That's what we saw on the previous slide.
All read-modify-write operations like exchange
have bidirectional barriers,
and there's no difference between acquire-release
and sequential consistency barriers, same thing on X86.
Doesn't mean that's always the case.
Even if you're writing for X86,
it's very important to say what you mean.
Why it's important to say what you mean?
Lock-free code is really hard to write.
It's hard to write if you don't mind occasional bugs.
It's really hard to write if you do.
It's hard to read.
It's hard to read for other people.
It's hard to read for yourself.
And why would you read your own code?
Well, to reason that you have done it correctly.
And memory order specification expresses
the intent of what you want it to do.
So let's look at how this works.
You wrote, atomic count, fetch add one,
with memory order relaxed.
What did you want to say by this?
At least, as a reader, what do I infer
when I read this code?
Count is incremented concurrently but not used.
Why incremented concurrently?
Because you made it atomic.
If it wasn't incremented concurrently,
it wouldn't make it atomic.
But it's not used to index any memory.
It's just a count.
You want to know how many times something happened.
But you're not using it to index an array.
How do I know that?
Because you didn't put a memory barrier on it.
Now, fetch-add on X86 is actually
memory order acquire-release.
So you might be tempted to cut corners.
Don't.
First of all, it confuses your readers, including yourself.
Second, the compilers, although currently
I'm not aware of any compiler that does
such an optimization, may, as a reader of your program,
realize that hey, you said memory-order-relaxed.
I can reorder things across this.
It wouldn't be forbidden.
It's difficult to do, for the compiler writers.
It wouldn't be forbidden.
So do write what you mean to say.
Okay, let's look at something else.
Same count, but now I'm saying memory-order-release.
As a reader of your code, what do I infer from this?
Now, as opposed to previous slide,
now I'm saying your count indexes some memory.
And that memory was prepared for my consumption
prior to your increment.
So you did something to that memory, you did some writes,
you did some initialization.
And then you released it to me with this add.
And in order to properly consume that memory,
I have to load with an acquire barrier.
That's what I infer when I read this code.
Okay, what if you wrote this?
Well, there are two possibilities.
One, there are more than one atomic variable here,
so you needed sequential consistency,
cause that's the default memory barrier,
or memory order, actually.
Sequential consistency's not a memory barrier,
it's a memory order.
So you needed sequential consistency memory order,
which means you have multiple atomic variables in play.
Thank you.
There is another possibility.
And what's that?
You didn't think about what you were doing.
And that is unfortunately more likely.
Or you had a bug and you couldn't figure out why,
and you kept adding memory order until the bug went away.
Now,
if there is one thing you take away from this class,
please, please, please
don't take away this.
Sequential consistency's fine.
It makes your programs much easier to understand,
much easier to reason about.
It's just not always necessary
to make every atomic operation memory order
sequential consistent in order to get sequential
consistency in your code.
Consider that locks, lock-based programs,
usually are sequentially consistent,
because they can be.
And they actually use acquire and release barriers.
Acquiring a lock has acquire barrier,
releasing lock has a release barrier.
There is no sequential consistency memory order in the lock.
Now, wouldn't be a talk on C++
if I haven't complained at least once
about the C++ standard.
Just isn't done.
So let's say you wrote this.
You have a class with an atomic data member,
and in the destructor, you need to know
how many of something you need to clean up.
And you're doing a load with memory-order-relaxed
to get that out of the atomic variable.
What did you just say to me by writing this code?
You said, be afraid.
Why did you say that?
Because somebody else is reading that atomic variable
at the same time on a different thread,
while the object is being destroyed.
That's what you said.
That's why you're doing an atomic read.
Well, of course there is another possibility.
You couldn't do any other way of reading it.
You really wanted to say non-atomic read.
And there isn't one.
Okay, so with that mandatory gripe about the standard
out of the way, what do we know about atomics?
Well, I covered what the syntax is,
what the member function and operators are,
what the performance is,
the connection with memory barriers,
atomics and memory barriers really two sides
of the same coin.
If you are reaching out for atomics
for lock-free programming, you're doing it
to get the best possible performance.
I've shown you some benchmarks against a spin lock,
which tells you, not necessarily tell you
that you will always fail to get a better performance
with lock-free.
That's not true.
You will get better performance, often.
But you have to work for it.
You have to pull all the stops.
You almost always have to actually go
and use the right memory barriers.
You can't afford the default memory barriers
if you want maximum performance.
You have to think about what you need.
You have to use the right memory order
on every atomic operation.
Sometimes it will be sequential consistency,
and often it won't be.
It affects the performance of your code,
it affects the clarity of your code,
when you're saying exactly what you need,
the restrictions that you need
on each of the atomic operations.
So sometimes you still need to use atomic in your C++ code.
Often you don't, but sometimes you do
for maximum performance.
There are data structures that are really expensive to do
with locks, and that's mainly data structures
that allow distributed sort of access, like lists.
Ever tried to do lists with a lock?
Well, when you don't want to lock the entire list.
You can put a spin lock on every node.
That's not very expensive.
You're going to deadlock really quickly.
Not impossible, but hard to do correctly with locks.
Not that hard to do with lock-free, with compare-and-swap.
There are other drawbacks to locks.
Locks are non-composable.
You try to compose locks, you get deadlocks.
You get priority inversion, you get latency problems
with locks.
Lock-free algorithms exist to solve all of those problems.
If that's what you want, use lock-free algorithms.
But again,
learn the impact on the performance
of all the details.
Be very explicit about your atomic programming,
what you do, why you do.
State the actual restrictions and guarantees
that you need.
If you need a release, say that you need a release.
Don't be sloppy about this.
It helps me to read it;
it helps you to get the best performance.
If you have a concurrent synchronization protocol
that can be achieved without atomic swap
or compare-and-swap or exchange,
you can get better performance by a mile
than any other synchronization schema.
If you can only use reads and writes.
When can you use reads and writes
in your atomic programming?
Well, I have a talk on that on Wednesday.
And with that, let's open for questions.
In sequentially consistent order, please.
(applause)
Yes?
- [Man] So I've done far more reading
about lock-free programing than successfully
programmed lock-free, but from my understanding
of the weak versus strong write,
in certain circumstances, in certain architectures,
which have not been fully spelled out,
other people have made the claim
that the weak version may succeed
faster than the strong one would,
besides just spuriously returning false.
- Okay, so the difference between the weak,
remember that I showed you the difference
between the weak and the strong.
So the weak one will do a read,
and that may require communication across cores.
That part has to happen.
But the getting exclusive access part,
which is the global locking down of the entire,
of everybody's access to the cache line
while you're doing your write,
that, on some hardware platforms, is very expensive.
And yes, it may be better, basically,
for this core to give up its write
to the exclusive access and let somebody else
do their update first.
Essentially, if you're doing this kind of timed lock,
and if you're waiting for a long time,
it means somebody else is really trying
to do something to that same cache line
or the same memory location,
and they're in a better position.
It's local, maybe it's local cache to them.
So okay, let them do it.
So on some platforms, giving up on that exclusive access
and essentially going for a spin
on the compare-and-swap loop
will result in a faster performance overall,
across all threads, because you let another thread
that can do it faster, you let that one go first.
It requires that you have platforms
that have a hardware that basically
plays something like this double-check locking trick.
It's not the only implementation that does it,
but something along these lines.
On other platforms, so if you look at X86,
you can just look at assembler-generated code.
Compare-and-exchange, strong compare-and-exchange
would generate the same instructions.
So X86 just doesn't have enough tricks
in its bag to make use of that.
Now, you can still say compare-exchange weak,
and you wouldn't be wrong.
Just because it can spuriously fail doesn't mean it will.
Your program obviously has to handle if it might fail,
but what if it never fails?
Your program has to work for that condition.
So it's not wrong to use it if you're just
spinning on it and hammering the atomic
over and over in this compare-exchange loop,
it's usually better to do weak
just in case you're on one of those platforms
where getting a long-range lock is expensive
and it's better to let the owner of that location go first.
- [Man] You say that memory-order-relaxed is
tricky to use.
But do you know any practical use cases
for this memory ordering?
(man talking quietly)
- Well, so relaxed, if you have a true relaxed memory order,
which let's assume your hardware platform
has a true relaxed, so like ARM has a true relaxed.
It means that you have no guarantees
on other variables, atomic or non-atomic,
with respect to their writes.
Now, what's the use?
The use is, I have this atomic variable,
and I'm only interested in that atomic variable.
It's not an index into any array,
it's not a pointer to any memory.
I just want to know how many,
I'm printing something in the log
across my program, and I want to print 10 lines.
So I want to know how many times I printed.
I'm accumulating a sum.
I have a bunch of integers sitting on my multiple cores.
I just want to know the total sum.
Well, it's faster to do the local accumulation first,
but then eventually you're going to go
and commit to the global accumulator.
You just want to know the total sum.
Do a relaxed atomic increment.
You don't care about any other variable.
You just want that atomic to be atomically incremented.
So I have two threads.
One wants to add three.
One wants to add five.
The only guarantee I want, I want to get eight at the end.
Memory-order-relaxed,
cause you're not using that eight to index
into anybody's array.
So that's memory-order-relaxed.
Now, this is by itself.
Now, when it gets tricky is you have multiple
atomic variables.
And you're doing some kind of communications for them,
so you are pre-building.
So you first increment this atomic variable.
And that doesn't really signal anything.
Then you increment that atomic variable,
and that doesn't really mean anything either.
Then there's a third atomic variable, some flag,
which you change from zero to one,
and that means, okay, I'm done, go.
So the first two can be relaxed,
and the last one is sequentially consistent.
You don't need the barriers on the ones
you used earlier because nothing is synchronized by them.
You just want their final values to be known
on the other side.
So if you have a synchronization schema that,
like a queue, queue has a front and an end.
So two atomic variables,
and you're atomically updating both of them.
And you may go, okay, I cannot dequeue if they are
the same place, but if there's a gap between them,
then it means there is some data in the queue.
I can dequeue.
So you have two atomic variables,
and you may not, depending on,
there are different synchronization protocols.
If you were in Magit's talk before,
he's shown you two different schemas
on synchronizing a queue.
One of them just used basically memory-order-relaxed
on front and tail, and atomic was somewhere else.
So you have multiple atomics,
very often you don't need a barrier on all of them,
because you're using the last one
to actually publish or synchronize,
and everything else just has to be atomic.
- [Man] So building on top of that,
at the very beginning when you showed the demonstration
of adding a whole bunch of integers,
and you showed that the lock-free one was,
of course, much slower in your benchmark,
is that primarily due to the sequential consistent
memory barriers?
- Well, on X86, it's really hard to answer
because increment or fetch-add is sequentially consistent
on X86, and there is nothing I can do about it.
I could say fetch-add memory-order-relaxed,
and it wouldn't make any difference
because the xaddl instruction that is emitted
is acquire-release.
On X86 there is no difference
between acquire-release and sequentially consistent.
So xaddl is a bidirectional barrier.
So there is nothing I can do about it.
Now,
in this case it's not really because of the barrier so much.
It's because of the cache line bouncing.
So even if I had a true relaxed,
let's assume that I had, on X86 I had the true relaxed,
meaning I increment it, and I get no additional guarantees
about any other memory except for this atomic variable
that I'm incrementing.
It would still do the cache line bounce
from core to core as I'm incrementing
cause I want to write into this variable, says this core.
Okay, send me the cache line.
I'm going to lock this cache line and increment.
Oh, I want to write into this.
Send me the cache line all the way back.
I'm going to lock it and increment.
And the cache line keeps bouncing back and forth
between the cores, because each core
has to have exclusive access in order to do its increment.
So even if you had true relaxed,
it wouldn't really make it
anywhere close in performance to what you
saw in that, the lock-based,
the mutex there is a gimmick.
The bulk of this algorithm is adding
into the local sum accumulator.
After I've added thousands of values,
yeah, at the end I go and do one add into the global,
and nobody cares how I do it.
It happens so infrequently.
It's once per entire computation, the lock-based.
Mutex, spin lock, atomic, whatever.
I do thousands of accumulations locally,
in my thread local variable, the one on the stack,
millions of accumulations, however long
you want to run the benchmark.
Cause I knew I do one atomic commit.
Under lock, do whatever you want, at this point.
Doesn't matter.
So lock is really a gimmick there.
You have to have a lock, but it doesn't matter
what you do with it.
- [Man] I think I simply missed the local variable
on the slide with the code, so that's what confused me.
Thank you.
- Yeah, so the lock is necessary,
but performance of the lock is utterly unimportant.
The atomic lets you write correct code without that lock.
Correct, doesn't mean it's not stupid.
It's just correct. (laughs)
- [Man] Right, so if you had removed the lock
and replaced it with an atomic plus-equals
in the second one but kept the algorithm the same,
it would have been-
- It would make absolutely no difference,
assuming you actually have,
so you run the loop into the local accumulator.
Let's say it's 1,000 additions.
And then you do one atomic or lock or whatever.
It's not measurable.
Now, if you're local accumulator was like three adds,
and then you do the commit,
because you only have six elements
in the array to add, total,
it's questionable whether you want to do six adds
on two threads or not,
but if that's what you had to do,
you would start seeing the difference on the lock
versus atomic, cause three adds plus a mutex,
versus three adds plus an atomic,
that's a difference.
- [Man] Okay, so, okay, I understand, thanks.
- Yeah.
Any other questions?
- [Man] Yes.
- Oh, okay, yep?
- [Man] Hello, so the question that I have is that,
can you give a hint of how the memory barriers
are actually implemented in hardware?
And is there a scope, like do they have anything
to do with cache flush and cache load?
- Well, they have to,
so first of all, how it's implemented,
from your point of view, it's either a special instruction
like mfence or lfence on Spark
or mfence on X86, or it's a modifier
on an existing instruction.
What's actually going on in the hardware,
yeah, there is some sort of global coherency
being established.
What exactly is being established
depends on what it has to establish.
So first of all, pretty much everybody,
even the systems with very strict memory model
like X86 have store buffers.
So that's the first thing being affected
by the memory barrier.
It has to basically flush the store buffers.
Because on X86, actually it wouldn't have
out of order writes if you didn't have the store buffers.
The caches themselves are DSO,
double store ordering.
But you have store buffers, so the stores
on the individual CPU are accumulated
in the store buffer.
So while the CPU waits for the cache line to be
sent from another CPU and locked for exclusive modification,
the values themselves, the ones that you want to write,
are accumulated in the store buffer,
which means other CPUs can't see them at all.
So if you have a memory barrier,
the first thing that must happen,
everything in that store buffer has to be committed
before other CPUs will be allowed to proceed.
- [Man] Okay.
All right, thank you.
- Beyond that,
it depends.
So if you're on the same socket,
typically you have cache-to-cache communication protocols,
meaning you don't have to go through main memory.
If you're on different sockets,
you go through the UPI link or its equivalent.
So you don't go through main memory,
but there is basically like a socket
on the microscopic level.
- [Man] Okay, so basically it's not trivial.
- It's not trivial.
It heavily depends on what hardware you're using.
And the performance characteristics will depend a lot,
again, on the hardware.
So if you benchmark the same code on X86 and on ARM,
you will see, and for example you measure how it scales,
you start taking memory barriers and see
basically how much the contention cost you.
You would notice totally different,
on X86 it might basically,
it starts off really fast, and then as soon as
it kicks into contention, it just slows down
to sequential.
And on ARM, you might see that it starts much slower,
but it keeps scaling almost perfectly
for a lot more threads before you start
noticing slowdown.
And that has to do with, well,
the cores themselves being slower,
but the memory model being more relaxed
on ARM than on X86.
- [Man] All right, thank you.
- Any more questions?
No?
Okay, thank you.
(applause)