- So, my name is Gor Nishanov.
I am a developer in Visual C++ team,
and one of the thing I work on
is design and standardization of C++ coroutines.
Most efficient, most scalable, most open coroutines
of any programing language in existence.
And, how many people attended any of the coroutine talks?
We have about, okay, let's do it the other way around.
So, we had probably, I think, every CppCon
there is one or two coroutine talks.
Who haven't been to any of them?
This will not be an introduction to coroutine.
I put a few slides to give a context,
but this still assumes that you are roughly aware
how a C++ coroutines work, but don't worry,
we'll have a few slides to get you up to speed.
So coroutines is a very, very old abstraction.
It is a bare function.
It was invented 60 years ago by Melvin Conway,
and Knuth gave it a more convenient definition as
actually, he did the opposite.
He said that the subroutine is a special case
of a more general construction, which is a coroutine.
And if we look at it, coroutines are just like functions
but better, because they get two more operations.
Just like normal functions which get call and return,
coroutines get two more: suspend and resume.
And in C++ we are roughly getting this,
or at least in terms of the coroutine TS
which is currently going through
the sub-duct-iz-ation process.
So coroutine is a function which has an await
or halt expression, or a special form
of a coreturn statement.
And one of the common patterns people use
with coroutines are generators.
You can think of them as a lazily-produced sequence.
So in this case I have a hello generator,
which yields character one at a time.
And generator, from the perspective
of the consumer, is just an iterable.
It has begin and end, you can pass it to arrange base four,
you can pass it to a standard algorithm,
and what happens is that when coroutine
gets to a suspend point, in this case it's yield
which pushes the value out of the coroutine,
it gets suspended keeping all of its state intact,
and then when the main in this case will
pull the next value we will jump back in
right where the yield was and keep going.
Another common pattern with coroutines
are asynchronous tasks.
In this case we have some activity which as a
an iteration, an expression which is not ready yet.
It could be some lengthy operation like for example
TCP connection, establishing TCP connection,
or maybe reading from a socket.
And instead of blocking the thread, we will suspend
the coroutine while waiting for the result
of that operation to become available.
And once available, coroutine will be resumed
from the point where it was suspended and will keep going.
Many languages, about 11 by last count,
have acquired coroutines of roughly the same same shape
over the last four years.
10 or 11 counting the ones which are
still in the design phases, and those which already
shipped the standard with that language.
I think now Rust is in process of getting them.
Swift is getting in the progress, so anyway.
A lot of languages are getting those facilities.
What makes C++ coroutines unique is the design principles
that we established that wanted to guide
the design of our coroutines.
So we wanted coroutines to be scalable.
Not like hey, we will solve 10k problem.
If you look it up there was this problem, I think,
it mostly union circulated, it's how many...
Can you handle 10,000 connections on the same box
without creating too many threads?
No, we are talking billions of coroutines.
So the overhead of every individual coroutine
is tiny, it's just a few bytes.
So if you want to you can scale to billions.
So far I have not seen a very good example
of an application which uses million coroutines, but...
We can hope.
Now coroutines need to be efficient.
So the overhead of suspend and resume
is equivalent or roughly on the same order as function call.
And that actually gives a hint, or at least small hint,
to you what this talk will be about.
Because now that coroutines are playing off from the fact
that the switch between coroutines is less than 1 nanosecond
And we can do very funny things if that is the case.
We didn't want to do any funny business
with stack switching, changing TLS that
for example fibers do, that put any limit
on what you can do inside of the coroutine.
Moreover, coroutines themselves can be passed through
a C boundary, a boundary as avoid start.
And then from your third-party library, from an OS call
in the callback, you can reconstitute the coroutine back
and resume it, and that allows very nice
interactions with legacy C code.
Finally, unlike most of the other languages
which have coroutines, we did not tie
coroutines to any particular runtime.
Because requirements for the runtimes say in a kernel mode
or in embedded systems could be very different
from requirements needed for, say,
UI applications or some others.
So we made coroutine open-ended.
You can define the meaning to them, which is provide
efficient chopping of a function into a nice state machine.
And finally, we added efficient cancellation to coroutine.
So any coroutine can be canceled at any suspend point
without relying on exceptions.
And that is mostly to increase applicability of coroutines.
There are environments, like Google for example,
that does not use exceptions, or at least
in a large part of your code base,
and you can use coroutines there.
Same with kernel, like in Windows, kernel exceptions
are generally banned or unusable
because runtime does not support them.
Anyway, so this is what distinguishes coroutines in C++
from coroutines in any other language.
And couple of years ago, we went through detailed talk
where I explained what compiler does with coroutines.
How compiler synthesizes the coroutines,
and we saw what major compiler is doing
to turn all of those wonderful coroutines on the left
into a tiny, single constant on the right.
So here you can see that we created
a bunch of generator coroutines.
That we composed them together, we passed them on to
a standard accumulate, and in the end we get a single number
So coroutines are very efficient and optimizable in C++.
Last year we looked at interactions
between the network in TS and coroutines.
So we can take this wonderful server,
which is a little bit hard to read, and
this is the same server but written as a coroutine,
which is the same efficiency but easier on your eyes.
This is an application of coroutines
in the field of databases.
And I have not actually discovered that.
All the created goals to research that presented
and discovered that in VLDB conferences
and written those papers.
So why database people, why people
that go to VLDB Conference, which is
Very Large Database Conference, care about coroutines?
coroutines that can switch in nanoseconds.
Well, the trouble is very large database.
A lot, right?
And this is a tiny, tiny CPU, actually, well, yes.
It's a four core CPU on my laptop on which
I was running on the benchmark.
Look at those tiny, tiny numbers.
L1 cache, 32 kilobytes.
L2 cache, 256.
L3 cache, 8 megabytes.
And then memory.
And it takes 60 nanoseconds to bring a value
from the memory to the...
To the CPU.
And because it is a very large database
their index is, they're gigabytes, their own memory.
And they are hid at random spots.
So whatever you do, whatever wonderful (mumbles)
you are trying to achieve, when you start taking
those indexes, you're hitting memory
in absolutely random spots and nothing can help you.
So let's see what happens, and I will use
the simplest algorithm we can have for index searching.
A binary search, because it fits the screen.
Databases use more complicated b-trees and hashing
and combination of those but this will be good enough for us
So it is a version of the binary search
mostly stolen from a lower bound with one extra check
which allows us to bail out earlier
than the standard algorithm.
So let's run it on a small array.
Wonderful, if it fits L1 we're getting roughly
four nanosecond lookup over log two.
In this case I have this little metric
which allows us to evaluate how good the algorithm performs
by simply dividing how long did it take
to look do this search, divided by the log
of the size of the array.
Which is roughly how many hits into the memory it will do.
So wonderful, array comfortably fits L1.
We are barely having any misses, four nanoseconds per lookup
Okay, let's grow bigger.
Okay, 4.7 nanoseconds, reasonable.
Well, half of the L1 cache misses, but L2 if you see
is just 3.5 nanoseconds latency, so still not bad.
Again, even when the array fits L3, still wonderful.
Now this is where we're getting with databases.
With databases the actual index is gigabytes.
So even with this tiny array of 256 megabytes,
suddenly we are having 0.21 instructions per cycle.
This CPU can do four instructions per cycle.
Here we're doing it the opposite.
It's like one fifth of the CPU speed.
And what can you do about it?
Like, people have those wonderful data structures
trying to do cache optimizations, but what if your
memory access pattern is horribly random?
Well, if you want to do only one lookup,
nothing can help you, nothing can help you.
But luckily in databases, there is this wonderful
join operation which does a lot of lookups.
For example, here we have three tables
containing some customer orders, customer names,
and what food customers are eating.
And now we are doing the join operation.
So probably database would go through one table in a nice,
linear fashion that will have wonderful cache performance.
It will go one-by-one, they will prefetch a big table
of big pages so it will nicely go through all of them.
But then when it has to use the foreign key
to hit other table, oops, it's random.
The core will be very, very sad.
It will do just a little bit of work,
and then 60 nanoseconds of nothing.
Now, 60 nanoseconds is small enough so that it is
not worth to switching to a different thread,
because switch to a different thread is what...
half a microsecond or more, so it's much bigger.
So core is being wasted.
Of course, there is hyper-threading, but normally on CPUs
how many hyper threads are?
Two, two, right?
Two, okay, okay.
So core will be slightly more happy, because what it will do
it will run until it hits the memory stall,
then it will start running the next thread
but if it hits the stall again, we are stuck.
So what can we do to make it better?
Luckily, CPU hardware vendors provided us
with prefetch instructions.
And this is an intrinsic with a bunch of hints
which allows us to tell to a CPU,
"Hey CPU, we are interested in that memory allocation."
And we can give it a bunch of hints, for example.
If I say hint is zero, it means I actually want that value
to be fetched from memory, and then go into L3, L2, L1,
because I want it to be all throughout.
And NTA, the last option.
It's actually the least disturbing flavor of prefetch.
It's where I have important data in my caches
and I don't want to evict it.
Instead, I want to prefetch the value from
that memory allocation as close to CPU as I want as possible
without trying to disturb other levels of caching.
So this is what we will be trying to do.
We will try to do instruction stream interleaving.
Because our joint operation needs to do multiple lookups
we will try to do something like that.
We will start the first search, and then at the point
where our binary search used to just read the value
from memory, we will issue a prefetch instruction
and then switch to a different...
Different search that, one you still need to do
because the joint operation requires multiple lookups.
And then let it run until we hit a memory
read from error in the application.
We will issue a prefetch, switch to the next one.
And then when we have enough of those
hopefully by the time when we did the last one
the first one we prefetched is already in memory, sorry.
It's already near the CPU so we can
switch to the original one.
So essentially what we'll be doing, we'll be doing
hyper-threading in software.
Okay, it's actually not that hard, it's just
you have to be very, very meticulous
about how you're going to do it.
So this red line shows roughly how we're going
to split that function and try
to turn it into a state machine.
So here we have this start middle that is trying
to read from memory, and we'll just partition it,
well, since we need to make it to a state machine,
we first need to extract all of the state
we need that currently inside of that function.
So do we need first?
Yeah, we need it because it is used both in
before the red line and after red line.
Same with last, middle, length, half, value,
the value that we are keep comparing,
okay, and we need a state.
And we need a state so that we know which part
of the algorithm we are before the read or afterwards.
And we'll run the state machine with just two function.
Initial one which executes the first time we enter function
and until we hit that read or prefetch instruction.
And then the one that will run the rest of the loop,
but again we will split it up so that we
we stop just before trying to reach from the middle,
if your prefetch instruction, and then get out.
This is the initial entry.
We copied all of the values that were before the while loop,
then, well, if the range is empty we will bail out
and we will mark the state as not found.
Otherwise, we will do all of the stuff which happens
before the read, namely half divided by, sorry.
We divided the length stored in half, computed the middle.
Say that we need to keep going, and we issue
the prefetch instruction I showed earlier, great.
Now run will do the rest of what happens once we're read.
So now the value we think that value had been prefetched.
So now it is safe to say start middle.
So we go the middle key, we compare it,
is it on this side, is it on that side?
We did the earlier bail out, and one difference here
is that before in our binary search the return value
was indicating whether we found a value or not.
Here I'm using the return value of run
to simply indicate do we need to run it more or we're done.
And the state that will be store and found or not found
will be the thing that will be telling us
whether we found a value or not.
And then we'll go prefetch the value again.
No, yeah, I think we covered all of it,
and we will mark the state as not found.
It's tedious, but doable, right?
Okay, I see one hand, good, good.
Sorry, oh, one head nodding.
So and we want to run it roughly like this.
We will create a bunch of those state machines
and we will say run init on this one so it'll
initialize this frame, hit the prefetch,
then do the same for the second frame, and keep going.
Now we initiated and prefetches.
We go back and call run.
We hope that by the time we get back to the first one
prefetch has already completed.
So we will run it again until we hit the prefetch,
go to the next one and say this one either found a value
or reached the end of the algorithm.
So we're done.
So we consume the result in whatever fashion
the database wants, and replace that frame
by running init now for search number N + 1.
And we will keep going.
I will bore you with roughly showing
the code, how it looks like.
Something like that.
We create vector frames, we go through the list
of all of the lookups we need to do,
and then if the state in a particular frame
is not equal keep running, we will initialize it.
Otherwise we will call run until
it's either found or not found.
In either of those cases we'll initialize
it again and so forth, so on.
And in the very end we need to deal with strugglers.
What it means is that we run through entire list
of lookups we need to do, but we still have frames
that have searchers that haven't completed yet.
So there will be a tiny loop at the end
that we'll take care of those.
Okay, let's see how we did.
So with naive search we were getting 26 nanoseconds
per lookup over the log of an array size.
Now we're getting something.
We just made it two times, 2.5 times faster.
And that is with interleaving with 16 streams.
Essentially I ran it with different amount of frames
until I find the number of frames which is sufficient for us
to do enough work between the prefetches
so by the time we go to the first one it's already there.
But it's nice, but that's a lot of code.
Moreover, we need a very,
very simple algorithm, binary search.
Now imagine we want to apply that technique
to more complicated algorithm.
That is not manageable.
Can we do it better?
Yes we can, and we can do it with coroutines!
So at the very beginning we talked about
two common patterns people use with coroutines.
One was generators, lazily-produced sequences,
and other are sync tasks that await
for some asynchronous I/O completions.
So which of those two kind of problems
this problem reminds me of?
Who thinks it's generators?
Two, who thinks it's async tasks, a lot.
And you are right, and you are right.
Moreover, that coroutine-based TCP server
we looked at last year is actually shown us
almost exactly the kind of patterns we are going
to use to deal with this problem.
Because the top part, which is a session where we
awaiting for TCP read, and then once read completed
we do a write, and then read again, write again.
It has the property that this coroutine doesn't go anywhere
while we're awaiting for the result.
And that is roughly what binary search is doing.
Because binary search you can think about it
a read from memory, oh, that's an I/O.
That is an I/O that takes a lot of time, 60 nanoseconds.
Without a coroutine so fast, you know,
one nanosecond or less switching, we can do a lot.
One change before I switch to a coroutine
I will change the way how we deal with the result.
So here we return the result as a boolean,
I want to return result in a form of a function object.
And if you notice, Alex Tiponov long time ago
discovered that this go behind are the best kind.
If you have a function object, compiler can actually
align it inside of the algorithm
and it will be very, very, very efficient.
That's why we're doing it this way.
And you know where I'm going to with this,
because this is it.
This is how long it would take us to convert
the algorithm from a plain one into a coroutine.
Essentially we have some kind of coroutine type,
we changed void into a coroutine type, and then
we replaced the plain read into a waiting on
a prefetch operation that will suspend that coroutine
while value is being prefetched, and then hopefully resume
it at the right time when the value is already in memory.
And yes, we need to replace return with coreturns.
There was a battle two years ago, and the coreturns have won
So we need to write coreturns on coroutines.
Okay, so that took care of the algorithm.
Now we need to write an orchestrator.
Let's get back to our TCP-based server.
So the bottom part actually has the pattern we want,
because unlike the session which gets suspended
while waiting for the result of the read or write,
we don't want a server to be suspended
while waiting for a particular session to work.
We actually would like to spawn as many sessions
as we want, and we want them to live
independently from the server itself.
And in this case, because having detached activity
right running wild, like detached thread in horrible,
never, never, never do it, here we are actually
associating the coroutine with an I/O context.
So in network in TS I/O context is the thing
which tracks all of the activity
which is currently in flight.
So in this case, we are spawning and we'll make
the coroutine independent of a server,
but still there is a way to join them
in the end, or cancel at the shutdown.
And that brings me to roughly this shape of an orchestrator.
So here in sort of I/O context we will have a throttler.
Because we're potentially doing a million lookups.
We don't really want to launch a million coroutines.
So the throttler will just have enough coroutines
to keep the prefetcher by client busy.
So and this multi-lookup gets the concurrency parameter
will given it to a throttler, and then throttler
will spawn enough coroutines to keep it busy
and in the end we will do the join.
So another name for this pattern is fork-join.
You probably have seen it NCBB and others,
because what we're trying to do is to spawn
a whole bunch of stuff, and then in the end
we want to wait until all of it is done.
So this is how we imagine the
simplest possible way to solve this problem.
Does it look simple to you?
Yeah, yeah because it's very close to what we're doing
first with the TCP server, and second the algorithm itself
is almost the same as a naive version.
So now let's see how much will it take us
to actually make it work.
And that is actually a quite a good approach.
Frequently I advise when people say,
"Hey, you know, how can I do this with coroutines?"
I will say, first think of the best way
of solving this problem.
Just imagine, imagine the syntax
and then you can make it work.
That's roughly what we were doing here.
We thought of a way how would we write it in a DL way,
and then figure out a way to make it with coroutines.
So the first part, we need to make prefetch an awaitable.
A quick reminder for those who haven't
seen the other coroutine talks,
awaitable is a thing with three functions.
Await ready, await suspend, and await resume.
So compiler expanded into something like that.
First it asks, hey you, awaitable expression, are you ready?
And if it's ready, then it proceeds straight to the end
into await resume, which unpacks the result.
And the result of an entire await expression
is whatever await resume have returned.
Now if it's not ready then compiler calls await suspend
and give it as a parameter, a handle to a coroutine
it manufactured, and the handler for await suspend
can then do something with it so that it can resume
or destroy the coroutine later when it knows the result
is available or if it knows it will have to cancel
and get rid of all of this stuff.
So let's do the awaitable for prefetch.
So we're going to try to implement this
prefetch awaitable to templatized on T.
So first we need to remember to grab
a reference to the value.
Right, so whatever is best in a constructure
we will just store a reference.
References are cheap, right?
It's just an address, we're not reading anything.
Now at the very end when our await is satisfied,
we will just return that reference to whomever
and hopefully he will actually use it for good name
to the reference and get the value that we were looking for.
Now in this case, if you look at the prefetch instruction
it doesn't tell us whether the value is already in the cache
Moreover, probably that will not be a very good instruction
because it will take a while for hardware to figure out
whether it's in the cache or not.
We want to be very fast, we want to cycle
a lot of coroutines, so we'll just say nope,
prefetch is never ready.
We assume that it is always in memory.
So we need to go on a suspend path.
And in a suspend path it's quite simple, actually.
We will call that intrinsic that I showed earlier,
give it an address of a memory location
that we want to prefetch.
And in this case I gave it a hint because
I didn't want to pollute any other caches.
I want to just we hit at random spots,
every time will be different.
Give us straight to the CPU.
And now I want to put this coroutine,
you see, there is coroutine handle that was passed
to everybody at this point, and they will put it
into some queue that will put it
at the very end of the queue.
Couple of years ago when we're talking
about different coroutine flavors, we talked
about symmetric and asymmetric coroutines.
So asymmetric coroutines, they support
only suspend and resume operation.
So in order for, well, when you suspend
you return back to the caller, and the resume
is a different operation that will resume you.
So there this unequal variation between the guy
who called you and the coroutine that suspended.
There is also a symmetrical coroutine variety
which allows direct coroutine-to-coroutine transfer.
More like thread to thread context switch.
And even though functionally those kind of coroutines
are accumulating, because we can convert easily
one into another, there is a cost of conversion
asymmetric coroutine into a symmetric one.
And I actually observe that cost right here,
because we're talking nanoseconds.
This was my first implementation where I simply
added a coroutine handle to this queue
and then I returned control back to the scheduler
which runs the loop, which de-queues now the coroutine
from the front and resumes it.
Well, that's about three cycles, and three cycles
is about 0.75 nanoseconds on that machine.
And we want to pack as many coroutines as we want,
and I wanted to eliminate that.
Luckily we have a provision in Coroutines TS
which allows symmetric coroutine-to-coroutine transfer.
So what we want to do here is this.
That way we essentially eliminated the scheduler.
Scheduler became just a very, very simple, dumb queue.
We're saying, put ourselves at the very end of the queue.
Get the coroutine at the front of the queue,
and that is the coroutine I want to
symmetrically transfer control to.
So this becomes simple jump from
this point to a different coroutine.
Queue is boring, anybody can write it.
Yeah, and at the very end is just tri-pop and a run
that will keep trying popping the coroutines
until there is no coroutines anymore.
And resume on the coroutine handle is the thing you use
to resume the coroutine which is suspended.
Any question on the queue?
No, no, that's...
What else is left to implement?
Okay, now we need to implement a throttler.
Let's look at the binary search again.
I said it's a task void, and task void is
something which currently goes through
some (mumbles) processes, the zero overhead future.
We talked about it two years ago
in the coroutine under the cover of stock.
Well, this is not it.
I am going to hack it a little bit.
So it will be a root task.
Because I want root task to be aware of the throttler.
And it's a yucky part, I will show how to fix it later,
but I need this, so my apologies in advance.
So what throttler needs to know is, well,
what are the concurrency factor?
How many coroutines it want to keep concurrently.
And in the spawn it will check.
If it populated the queue with enough coroutines
it will start writing them down.
So essentially here what we're doing we'll
popping the first coroutine from the queue and resume it.
And then it will start jumping around
from coroutine-to-coroutine because we say the await, right?
Await will prefetch is actually doing all of the scheduling
and will keep running until one
of the coroutines run to completion.
And that will create a room in a cube.
That's why if the limit is zero,
I will start running the first coroutine
and then go around everything else.
Otherwise I will associate this coroutine
with this throttler so that it knows to tell me
when it's done so that I can bump up the limit.
Push it into the queue, and decrement the limit.
And on task done it is the callback that essentially
the coroutine will call at the end
that will increment the limit.
And the rest are boring functions.
I needed to do a few tweaks to the zero overhead future.
And this actually shows those tweaks.
First I needed to add set owner which will
disassociate the coroutine from this array type
which is going to destroy the coroutine and is destructor,
and transfer the ownership to the throttler,
because now this will be the guy who running the coroutines
and will be destroying them in the end.
And I will remember the throttler in the promise type.
And I change my final suspend from suspend always
to suspend never, which means the coroutine
will self-destroy itself in the end and will let know
the owner that the task is done, so it reached completion.
Finally because I am creating a million of coroutines,
million potentially coroutines,
I will use a recycling allocator.
So if you overwrite an operator, new and delete.
On the coroutine premise, you can supply your own allocator.
And that helps a little bit.
You know, extra few nanoseconds, and again
every nanosecond counts when we're
doing this kind of operation.
So let's see what happened.
So we have...
26 nanoseconds for a naive version.
We have 10 nanoseconds for
a handcrafted version with 16 streams.
And do you think it will be better or worse with coroutines?
Yes, yes, it's actually better.
So we're getting to, well, almost 3.5 time
faster performance, and we have 20 streams to get it there.
Oh, thank you, so you all have to
leave early, we're almost at the end.
So negative overhead abstraction again?
So Alex Tiponov when he was in one of his interviews
when he was asked how to design good abstractions,
he explained that it's not like you invent abstractions
from just your brain, from thin air.
No, you start from a concrete problem,
and then you look at the most efficient way
to solve that problem you know today.
Okay, at that point you can try to invent abstraction
and now recode that problem using this new abstraction.
And see how much overhead have you introduced.
If not, wonderful, you have a good abstraction.
Try applying to some other pattern.
And so far, so on.
What happened with coroutines, and this is
not the first time, is that when we take
the best possible pattern we do today,
say with handcrafted state machine and then recode them
with coroutines, for some unknown reason it becomes faster.
So it is not just abstraction penalty,
it's a negative overhead abstraction.
And I try to understand why it happen.
So I was looking at disassembly,
because pattern of access for both handcrafted state machine
and the coroutine-based state machine, it is identical.
I am running them on the same array,
with the same set of values to lookup.
One of them was faster, and I think it happens
because of when coroutine transformation takes place.
So both Microsoft compiler, Clang, and a JCC implementation
which is currently in progress discovered that
the most profitable way to do a coroutine
is to actually let the coroutine be observed
by a compiler as a normal function for as long as it can.
So that optimizer can simplify the body,
can do the constant propagation, can do the
common expression elimination.
Like, do all of the awesome stuff compilers do inline things
Discover after aligning that some other stuff disappears.
And once you finish simplifying the body
of the function then you do the transformation
of the coroutine into a state machine.
So you chop it up into small pieces
and then you optimize the individual pieces.
So what happened here, we humans, we did the transformation
of a function into a state machine to early.
We did it at this very high level
before compiler had any chance of doing optimizations.
And once a function was chopped up into pieces
it becomes a state machine, compilers are not
very good at optimizing state machines.
So that is my guess why it actually faster.
If you look at disassembly you will see that the
the handcrafted state machine puts too much
register pressure on a compiler.
It is too complicated code, it starts spilling
and reloading values into the memory and into the registers.
Whereas coroutines after it's chopped up
it becomes very, very simple.
It does almost the same thing that you would have done
if you were trying to write this in Assembly.
So that's nano-coroutines for you.
We're taking advantage of the fact that the switching
from one coroutine to another is taking less than
a nanosecond, and in some cases zero
in case a coroutine is inlined.
Unlike, for example, fibers or stackful coroutines
which takes from 10 to 50 nanoseconds
to switch depending on the OS architecture,
CPU kind, and how many registers they have to
save and restore when they do the context switch.
And naturally you cannot do that with threads,
because they are we're talking...
You know, close to a microsecond switch.
So this approach allows you to...
To deal with memory latency in those cases
where nothing can help you.
No more packing of your data
in a cache where data structures can help.
Nothing, you hit the memory at random spots.
Well, if you have more than one lookup,
you can try to split your code into several streams
and then use coroutines to interweave them.
So coroutines current status, proposal is going through
the C++ Standardization Committee.
Maybe it will be in C++ 20.
We have implementations in Clang and Visual Studio.
GCC implementation is in progress.
Nathan, yes, in progress, I see he is nodding there.
No, he's leaving the room.
Oh no, he's coming here, excellent!
So I think we have confirmation
that CGG implementation is in progress.
And last year we had a talk where we talked
about naked coroutines, and that was one of
the complaints that coroutines are just a compiler feature.
You always have to come up with
your own types to play with them.
No longer naked, we have the task type
going through standardization.
We have the generator type that is supposed to come
in San Diego, which is just in a few months.
And a bunch of wonderful, wonderful, wonderful
algorithms that work on awaitable types
that will helped to...
And this is just a reminder.
I'm told to put the slide.
There are other useful talk, and we are
at the end of Wednesday, so I will put it back again.
- [Man] Yes, so go, I just, correct started
an implementation, as Facebook's hired
a GCC engineer to work on this.
Started a week and a half ago, so (laughing)
But we have a design document that I've been working
with Gor on and the like, so it's not from a standing start.
If you go to the GCC wiki page, there's now a new
project page where you can see what's happening,
but it'll be a while before there is a result.
- Uh, wonderful, do I need to repeat the question?
Well, it's awesome!
So I was not lying, GCC implementation is in progress,
even though it's only a week and a half, it's still awesome.
It's still in progress!
- [Man] So what's the performance
of the coroutine implementation like without the allocator,
and to what extent does that allocator rely on
the fact that all of those stack frames
are expected to be exactly the same size.
If you were in a place where you couldn't
rely on that, would the speedup be similar?
- I think it was about 10 nanoseconds.
I was surprised by how well malloc actually behaved.
I didn't use tcalloc, I just used
very dumb 30 second allocator.
But even without it, malloc was doing an amazing job.
But I wouldn't recommend, because if you are
going for nanoseconds, you better use
some good recycling allocator.
- [Man] Alright, thanks.
- [Man] For those of us who want to write applications
that other people are going to run on a variety of machines
that may have more cores or less cores
or more cache or less cache.
What facilities are there going to be
to make the throttlers more well-behaved
without hand tuning them?
- With the current instruction sets as they exist today,
right, that will not give you visibility of say
how busy a prefetcher is, you have to tune it by hand.
But I've been talking to some, I think, Intel engineers
what kind of are the facilities we could have.
For example, there could be, and again, this is spit balling
It's not real, no, nothing has happened
but potentially could have and instructions can tell you,
hey, prefetcher is that busy, and thus you know
you need to feed more to it.
And if it's completely busy, well, you don't
have to spawn any more, it will not help you at all.
So this kind of a help.
Another one somebody suggested was to have
a prefetch instruction that will give you a hint
whether the value is already somewhere close,
but that is likely to be a more
expensive instruction than a plain prefetch.
Because it would have to go and check in a bunch of places.
- [Man] I would be even be happy to let the
the throttler or whatever initialize itself
and run some kind of test once at the start of my program
and then use that data if that's, you know,
another way that this library could go.
- [Man] Hi, I'm interested if my own data structures
and algorithms are cache-friendly.
Can you comment on the profiling toolkit
you used to show your latency?
- I used perf-stat.
It's Linux tool, perf-stat, to show me
the Intel counters, the CPU counters.
- [Man] Perf-stat, what was it called?
- Perf space, yes, perf.
- Perf-stat, okay.
- Perf, P-E-R-F, perf-tool.
It has a bunch of very useful primitives,
and if you want to look in more detail, I think Chandler
had been given couple of talks last CppCon.
Going nowhere faster, and the previous one before that
he showed how to use perf-tool to get the information
about caches, about stalls, front-end stalls, backend stalls
perf-stat is a wonderful thing.
It also work as a profiler.
You can say perf record and then perf report,
and you will see wonderful text-based UI.
Which is pretty decent.
And on Windows you would use something like VTunes,
but you have to, is it VTunes, right, Intel?
But you have to pay.
On Linux is cheap, it's nothing.
- [Man] So this looks super awesome, but what if you
put a break point inside of a coroutine
and look up a call stack and then debugger?
Like is it a nightmare?
- If you put a break point into a coroutine
you will see roughly in, let me get to that.
Okay, hold on.
Okay, anyway, it's too far to (laughing)
to get to the code I want to show,
but what it will show you, you will have a...
Your binary search out or function, right,
and then that function call resume
and that resume you will be in a coroutine.
So you'll have just two functions on the stack.
You will see a function, the multi-lookup,
and the next one will be a coroutine
at that point where it is currently executed.
- [Man] So there's not much of, like,
sort of impenetrable functions deep
in some standard library in the call stack?
- No, no, every time you look you will see
a normal thread, and you will see the currently executing
coroutine and what function invoked it.
In this example it will be just the multi-lookup
and then a coroutine where it's suspended.
And if you want to see where other coroutines are,
then you need to go and look at the queue
which has coroutine handles and in Microsoft debugger
we will actually tell you what function it points
at what suspend point it points.
But we have plans, both in MSVC and hopefully
have extensions for other debuggers
that will show you as if it's a call stack.
So we'll show you a here's a coroutine suspended here
and here's all of the locals it has at the moment.
Right now you cannot examine coroutine handle
in this fashion unless it's a currently executed coroutine.
- [Man] Gotcha, thank you.
- [Man] I'd like to know if there is anything to think about
in terms of what the prefetch intrinsic does
with regard to C++ memory model.
Is there anything funky there
with acquire-release semantics?
Is it possible not to see the results of some
memory operations done by some other thread and--
- Well, prefetch is just a hint, right?
So the actual operation we are doing is the reference.
That is where we will start worrying about the semantic.
It's we'll start trying to get that value
from a cache or from memory.
Prefetch is just a hint to a CPU hey, try to get me a value.
The real work will be done when, sorry,
not real work, but at least the one
that have all of that semantic associated
it is the actual the reference of the value.
- [Man] So the memory access itself should be
the subject of the same limitations as applies
to any other program outside of coroutine case?
Like putting memory barriers and so on.
- Actually, I think I start, I think I replied
incorrectly to your question.
In the sense that yes, we need to make sure
that the prefetch is not moved
around the actual read operation.
I'm not sure if it's hardware catching it
or it is an intrinsic which is marked somehow
in the backend which prevents reordering.
- [Man] And just another question, it sounds like
execution TS is something that you don't really need
to integrate with or care about, because you are
completely outside of whatever threading model,
execution model's going to be in C++.
'Cause it seems to be like completely unrelated
with anything threading, right?
- Yes, yeah so, coroutines love executors.
Coroutines love network NGS.
Coroutines interact with other things, but by itself
they do not require those facilities.
But if you want to inter-operate with them they can.
And there is a paper in the, if you know where to find
the standardization mailings, there is a paper
of coroutine impact on library facilities.
You Bing this name you will see how it's supposed
to interact with executors, with networking and other things
- [Man] Hello, so in my day job we use, like,
bunch of fiber managers to run on different threads
and basically serve connections with clients.
And there is kinda of pain to use mute access and statics
because they can grab logs that were inside.
They basically block all the fibers on the same thread.
Is this going to be the same with coroutines
or you're going because like for fibers
we use special fiber matrices.
- Okay, so one of the fundamental problems
why you don't want to migrate fibers from one thread
to another is related how thread
local storage code is generated.
So compiler is unaware of at what point of your function
you can have the fiber switch that will also switch the TLS.
So if you go to (mumbles) and compile some arm code
that uses TLS you will observe it, it will cache in some
non-vault vault or registers the pages of TLSs.
And thus you will be either reading garbage
or corrupting the memory if the switch occurs.
Coroutine control flight, the stack class coroutine
like this, suspension point is called out.
It is observable by the compiler.
So no TLS caching is performed.
Now, if you are actively trying to defeat
or write a bug, you can, but you can never
get into a situation where you are using the facility
like magic statics, or maybe thread caching allocator.
Or error_value, which is under covers,
uses the magic statics.
You will not end up in the case where you will
trash the memory or read garbage with these coroutines.
So they are safe to migrate as long as
you're not actively writing bugs.
- [Man] I see, one followup question.
What about your, like, primitives of STL?
Are they going to be like variants which support coroutines?
- Okay, so we have, at the moment we don't have
any asynchronous I/O in the standard,
but there is a network in TS which is in flight,
which might land in 20 or might land slightly later.
So that network NTS will be interacting
nicely with coroutines, and if you look at CppCon talk
called Naked Coroutines Live from last CppCon,
there will be a demonstration how you can hook up
network in asynchronous I/O and coroutines.
- [Man] I would, just that I'm speaking
about, like, I/O streams for example.
- Well, I/O streams are not asynchronous yet.
When they will be we can start talking about coroutines.
- [Man] As I understand the time when we
await the result from prefetch depends on
size of (speaking foreign language).
So if it's enough featured to between the prefetch
and when CPU call, move data to L1 cache it's...
Or isn't, will log best of worst.
You compare two organization, but in one case was
16 streams and the other case 20 threads.
Are you compare the working this, all that it's
on same amount of streams?
- Okay, so what happens is that the handcrafted
state machine were doing more work.
So there are four, I only can launch...
It does more, it consumes more cycles
in that little squiggly area here, sorry.
So this squiggly where we actually do useful work
between the stalls, it was doing more work.
There are four, I needed less of them
to consume this empty space between the squigglies.
Whereas with compiler-generated state machines
that I did from coroutines, it was actually using
less instruction to do the same amount of work.
So I was able to feed more
coroutines to do more concurrency.
But I determined that empirically.
I just random with different concurrency numbers
and saw which one gave the best performance.
- [Man] Okay, thank you.
- So I wasn't trying on purpose to make
the handcrafted state machine worse.
It was actually the fasting timing
I got was with 16 of them.
- [Man] Hi, I wanted to ask about the binary search.
I'll try to make it quick.
So in order to know what else you would need,
wouldn't you need to do some redundant work
and bring in some stuff you might, so,
to know which thing you're going to need next,
you need to resolve the current check.
So you need to sort of do work that you
never would actually need, right?
So as in prefetch stuff that you wouldn't really require.
If you would just do it, right, like this, linearly.
- Well this is exactly the sad situation,
because if you knew in advance that you need
that, that, that, that, and that one,
you can prefetch whole bunch of them.
But you can't, you first have to prefetch a value
before you know where to go.
That is what make this problem very sad problem
unless you do prefetching combined
with instruction stream interleaving.
- [Man] My point is that you're pulling more data
into the memory, evacuating more stuff out of the memory
than this thing, right, in some sense?
- In my testing I have been using the prefetch NTE,
which is trying to move the value directly
into L1 cache without touching L2 and L3.
So supposedly say if your database,
you have some real data you want to do work with
and index is just a way to get to the data.
So in this case I'm trying to be very frugal
and not evict stuff from big caches which have data,
and use prefetch NTE to avoid that.
- [Man] Can you put the slide back
with all the presentations?
- Say again, which one?
- [Man] The slide with all the presentation times.
You had a slide that says all the other talks.
The second to last.
- This one?
- [Man] Uh, nope, the last.
- The last one. - At the end. (laughs)
- Hold on, I think I know the faster way to get there.
- [Man] I wonder if to clarify that the question from before
You weren't optimizing a single search to be faster,
you were optimizing many searches
to happen in a less amount of time.
So it's, it wasn't the single search that he was prefetching
he was just interleaving many other searches.
- Oh, yeah, thank you, thank you for clarification.
Yeah, if you have to do only one search, nothing.
Nothing can help you.
And that is why database people are interested,
because they actually do multiple lookups.
- [Man] Would it be also possible to optimize
single search if you stop prefetching sooner,
because, well, every time you access memory
you are actually pulling a little bit
more data into cache, right?
So like let's say when you are making less steps
and bettering your search, and when you're observing like
last, I don't know, four values or something like that.
So you probably don't need to do it in separate steps.
- Can you speak slightly close to the microphone?
For some reason I'm not getting it here.
- [Man] Oh, sure.
My question was when you prefetch something from memory,
so you were, you know, CPU's not pulling single byte, right,
so it's pulling couple more bytes after that, then, to cache
So probably you could stop prefetching
for the last couple iterations of the binary search.
And by doing that, so probably you would spit up
little bit individual search.
I will put up the code on the GitHub,
and have fun parting on it.
See if you can make it faster!
Okay, thank you very much.