- And I guess we're gonna get started.
My name is John Lakos.
I have been at Bloomberg for now about my 16th year.
Prior to that, I worked at Bear Stearns, which is no more,
and that is where this notion of allocators
became a real thing.
And prior to that, I worked at Mentor Graphics,
where I used memory allocation for a long time
and actually talked about it in my 1996 book.
So we've been dealing with allocators,
we meaning me, for a very long time.
And this is not new to me.
But some of you may feel that this is new.
It's not new.
Also, people in the video games area
know this stuff better than I do,
and they use it all the time.
And it is not magic,
and it is not syntax or any of that stuff.
It is absolutely essential
in order for them to do their jobs.
And the way I came on it,
it was absolutely essential in order for me to do my job.
So if anybody doubts that this is real,
by the end of this talk,
there should be no question at all that it's real.
The only thing that you should be left with
is to make the decision, is it worth the effort?
The good news is, it actually is worth the effort,
and when we get to C++20 and 23,
the effort will go to zero.
So that's good news.
And then you'll start to worry about,
"Well, what about the performance cost?"
Don't even get me started.
Now, there's a lot of material in here.
Normally when I give a talk,
I have like five, six, seven, eight hundred,
nine hundred, maybe even almost 1,000 slides.
You'll be pleased to know I don't have more than 500 slides.
The bad news is they're animated,
which makes them like 5,000 slides.
We have two hours, though.
Hopefully I'll finish early.
I don't mind if you stop me to understand the material.
There's gonna be enough time at the end of the second talk
for you to grill me on any concerns that you have.
I look forward to it, I really do.
So any questions that you have about what I'm talking about,
please do interrupt, and I will try to remember,
Mr. Sign Man, to repeat the question
because I need to do that.
Does anybody have any pre questions
or anything about the rules,
which is, if you don't understand something, you stop me,
but if you wanna heckle me, wait 'till the end?
The first slide is there
because Bloomberg has a standard slide,
but if they thought, or anybody thought
or if it were even possible to go in
and change all my animated slides to these Bloomberg slides,
that was not happening.
So somebody just put it back.
This is the Bloomberg slide.
Everybody. (inhales deeply)
Okay, that's done.
And this is John's slide.
We're gonna use John's slides from now on.
All right, this is the Copyright.
It's there because it has to be there.
This is what we're talking about.
Are they important?
Are memory allocators really worth the trouble?
What situations can we use them in?
How can we apply them?
What's the performance impact?
By the way, performance is extremely important
when we talk about allocators,
but surprise, it's only one of three areas
where allocators are very, very important.
So we're gonna talk about performance today,
but if I were gonna argue the value of allocators,
I would have two other complete talks
about why they're important.
But today is just performance.
By the way, at the end of this,
you won't care about the other two.
So here's what we're gonna do.
I'm gonna talk a little bit about background,
because this is, if anybody saw Bjarne Stroustrup's talk
where you need to know about the layers,
you can't deal with the highest level abstraction,
you have to go down to the libraries
and perhaps even to the operating system
and perhaps even to the hardware, guess what.
This talk is gonna have
a little hardware in it, unfortunately,
so we have to do that.
But we shouldn't be embarrassed about that, right?
We should be proud of that.
We're gonna look at the hardware a little bit.
Even the hardware that I'm gonna show you is an abstraction,
because the idea, it's the gist of it.
It doesn't matter.
If we get too precise,
then we have to look at different processors.
We don't want to look at different processors.
We want the gist, because it is a very rich, robust benefit
that we're gonna get, so we're gonna keep it
above individual architectures.
We're gonna talk about a general architecture,
although we will run benchmarks on a specific architecture.
We have run them and they're published on GitHub
for many different platforms, different architectures,
The results are more or less the same.
You couldn't possibly care.
Your eyes would glaze over
if I showed you all of the data we have,
so I'm not going to.
Let me show you some stuff.
So, why do we like C++?
Well, a couple of things.
It enables us to fine tune at a low level,
and it can deliver
a very high runtime performance when needed.
That's why we use it, right?
Otherwise, we'd use something else.
Why should we care about memory allocators?
Well, they enable us
to fine tune at a low level when needed,
and they can help improve runtime performance.
So it seems like allocators
might be a good match for C++, right?
Seems like it, right?
We want high performance.
But no, we can't be bothered with the platform,
so just forget about allocators.
Doesn't work that way.
It really doesn't work that way
'cause there are people who will not use
the standard library without allocators because they can't.
I was in such a position a long time ago,
and I didn't use the standard library.
What are the benefits?
Not all memory is alike.
Different kinds of memory.
That's one of the benefits.
You can deal with different kinds of memory.
You don't have to say all memory is alike.
Another one is we can do testing.
We can do debugging.
We can do measuring.
We can do profiling that we couldn't otherwise do.
If you talk to somebody who runs a large system,
the profiling is the most important thing,
not the performance, but the profiling,
because when something goes bad
and you need to figure out what's wrong,
this is how you do it.
When memory is getting sucked in somewhere, what do you do?
Really, honestly, this is how you do it.
Oh, yes, runtime performance.
That's very important to some people.
Not everybody, maybe not even half the people.
And given that allocators today require some effort,
and we'll see that it's not as bad as people think,
but that's not the purpose of the talk.
Purpose of the talk is simply to explain
that they actually are worthwhile.
The next step is to figure out how to deliver them
at the least possible pain point.
C++11 is the most possible pain point.
C++98 is they don't work.
Bear Stearns, circa 1997.
I, taking credit, invented
what is now the C++17 memory model.
Make no mistake, it doesn't look like what I did,
because what I did was so obvious and simple
that anybody could understand it,
and what we have in C++17
is a vast improvement over C++11,
which was a necessary painful step
because people had fear, uncertainty, and doubt
about polymorphic memory resources.
No, it had to be part of the type,
'cause that's all we know.
You know, fear of the unknown.
By the way, fear, uncertainty, and doubt, that's FUD.
There is a lot of FUD going around.
We need to debunk the FUD.
That's what the fun questions are gonna be
at the end of the talk.
You're gonna say, "But it isn't it (mumbles) like this?"
All right, so we're gonna see that, but it'll become clear.
So back then, we had coalescing allocators,
and so somebody came to me and said,
"I have a model, and I build up this model.
"And I use it.
"Then, when I go to destroy it,
"it takes 10 seconds to reclaim the memory."
Can you imagine 10 seconds
to reclaim the memory for a model?
So it turns out that because we had coalescing allocators,
the benchmarks were all based on
how fast we could allocate the memory,
but they didn't care because you'd just leak the memory
and it goes away, and who cares?
So to actually reclaim the memory was the big bottleneck.
I gave them a local allocator using my model from 1997,
and instead of taking 10 seconds,
you couldn't see it on Purify because it didn't exist.
That's the kind of improvements we're talking about,
from 10 seconds to zero.
Another one, Bloomberg.
Did I go a little fast here?
We have a way of doing swapping to disk.
The way we do it is we have
something called finch memory.
And so we have memory mapped IO.
We store things that we wanna save in a process to disk.
And because all of our processes are the same,
we can read them into any other like process on the machine,
and we're reading in binary,
but it can just come back the way it was.
You can't do that without allocators,
and you can't use the standard library
unless you can provide a finch memory allocator
to take and put your memory, right?
So we need to do that.
That's the placement aspect.
And then, in 2006, people realized
that you can use allocators on the front end
to make the user interface zippier.
That's something people observed.
And people who were not at all happy
with the way I did things at Bloomberg said,
"That's the best thing you've ever done."
I'm like, "Okay, thank you.
"I'll take it."
So what are the common arguments against?
I'm sure I don't need to give you arguments,
'cause I'm sure you all have arguments.
"Oh, allocators, not worth it."
Well, they require more up front design.
Well, software engineering is about up front design.
There should be a payoff for it, of course.
Complicates the user interface.
How are you designing it?
These are all fair concerns.
May degrade performance.
If I try to use an allocator here
but I don't really need it, it's gonna slow me down.
I'm sure you've heard that.
Suppose we don't need an allocator.
We don't need a special egg.
Why should I worry about it?
And that's most of the time, right?
'Cause when you're using C++,
performance isn't typically important.
It's only sometimes.
So why should I worry about it?
Maybe I choose the wrong allocator
and blow the whole thing up.
Do we have to actually know how to program,
or can anybody write C++?
I don't know.
Anyway, so these are all valid concerns,
and we can deal with them
only by well supported facts and careful measurement.
Does that sound reasonable,
as opposed to fear, uncertainty, and doubt?
That ends here.
There is no doubt, none.
You're going to be shocked at how little doubt there is.
We're gonna talk about some background material,
like main memory.
What does main memory look like?
Well, we have the CPU and we have main memory.
Everybody knows this.
This is three levels down the stack where we have hardware.
I want you to appreciate
the amount of time I put into drawing that.
You see the two gradients there?
That's the wear from that wheel turning.
Now, I'm gonna start it up.
First, we're gonna do some memory.
We're gonna put some memory there.
Did you see it turn?
And then we're gonna put some more memory.
Oh, it turned again.
I hope you're enjoying this,
because I got very frustrated
with how much time it took to do this.
So I kept it up for just long enough
for you to know that I can do it, and that's it.
Applause are welcome.
Then, I'm going to do something that is unthinkable.
I'm gonna add a cache.
That will never happen again, that thing over there.
We're done with that.
That's staying fixed.
And so now we're gonna add something,
and you'll notice that what happens is we get a cache line.
And everybody has heard the term cache line.
For those few of you that have never heard of a cache line,
it's a chunk of memory, contiguous memory,
that gets pulled into the cache all at once.
And so the next time you ask
for something on the cache line, you already have that.
So watch what happens.
That's one cache line.
Here's something from another place.
That's another cache line.
The cache can take something
from anywhere in memory and put it there,
so now you have these two things.
But now, when we add something else up there,
it's instantly in the cache,
'cause it was already there.
And so on.
So we just keep doing this.
And eventually, we might write to the cache like that.
And then we have to save it.
Now, there are all kinds of algorithms and everything.
Please, let's not worry about it.
The idea is that has to be put back to main memory,
and then we'll just get rid of it from the cache.
And now that space is available for something else.
And so we get a cache line, and it goes there,
and that's how a cache works.
And there are many levels of cache, and we don't care.
Then there are things at a higher level, like paging.
So when you have so many pages in your real memory,
then you have to page out.
And that's kinda like a cache,
because it goes to a slower memory.
So we have all these different levels,
from registers to L1 to L2 to L3 to virtual memory.
All of that, right?
What you need to know is,
things that are accessed frequently in some block of time
need to be close together in memory.
If you can deal with that concept, that's locality.
Not having locality is bad.
Can we all just accept that?
Because we should have known that for the last 50 years.
Ah, so now we're here.
We have an executable...
Did I jump over something?
Oh yeah, main memory segments.
This thing is going a little funny.
All right, anyway, this is what happens.
We load in a program,
and it becomes an executable program,
and then we have stack memory,
which grows down typically, let's say,
and dynamic memory, which grows up.
And it can be in the other direction.
It doesn't matter.
But these are the two ways we get memory
other than static memory,
and static memory, of course,
lives in the executable program itself.
There are all kinds of different static memories
that are more efficient and less efficient,
but it's not important.
So what's a memory allocator?
So, a memory allocator is something that, well,
allows us to control memory.
With C language memory allocation utilities, for example,
we have malloc and free.
And malloc and free work on dynamic memory.
Now, we have another one some very smart people
who really cared about performance back in the day
decided that this thing called alloca was important.
What that does is it gives you memory
on the hot program stack.
And that's really good if you know what you're doing.
If you don't know what you're doing,
your program will crash and burn.
That works on bits of memory in the stack.
How am I going with speed?
Everybody with me so far?
I don't wanna lose you.
So, this is a first cut at a memory allocator.
A memory allocator organizes a region of computer memory,
dispensing and reclaiming authorized access
to suitable sub-regions on demand.
Might not be contiguous regions.
But your first thought is,
it's a block of memory, and I'm gonna get stuff out of it.
So we have general purpose allocators
and special purpose allocators.
And you can imagine
a general purpose allocator works for everybody,
and it does a good job all the time.
And a special purpose allocator
doesn't work for everybody,
but it works for some people real, real well,
and so you say,
"I know more than the general purpose allocator.
"I'm gonna do this."
And if you get it wrong, the whole thing blows up.
So if you don't know what you're doing,
don't use a special purpose allocator.
But if you do, you can beat the compiler, hands down.
Then we have a global allocator and a local allocator.
A global allocator is good for the duration of the process,
and a local allocator isn't necessarily good
for the duration of the process.
The global allocator is accessible from everywhere,
and it doesn't need any reference or memory
because it's global.
It would just go there.
Local allocator, you have to provide a handle
so that it knows what's going on.
And the local allocator doesn't talk about all of memory,
but a subset of memory.
And it doesn't talk about
the entire duration of the process,
but only a subset of the duration of the process,
so it's a little time space box.
You can use this memory right now,
just this memory and just for now.
That's a local allocator.
So in C, we have malloc,
and in C++, we have new and delete.
But those operators are not allocators.
They are ways we access allocators, right?
They're not allocators.
So now we have global and local allocators
and general and special purpose allocators.
Look, we have some things in the box.
And you notice that malloc and gree and new and delete
are global, general purpose allocators.
But they're not really.
They're just specs for general purpose allocators.
Now, tcmalloc and jemalloc are not specs.
They're actually allocators.
They're general purpose allocators.
And some might say, "Well, if we have those,
"we don't need local allocators."
Some are wrong.
Then, we have special purpose allocators,
and they could be global.
They could be globally accessible,
but that would be pretty dangerous,
but not necessarily.
If you're in a program that's single threaded,
do you really need an allocator that does synchronization,
or can you just say, "I don't need that,
"because I don't have multiple threads.
"So why do I need synchronization?"
So that might be a reasonable improvement,
and it turns out it is.
You don't need synchronization.
Why on Earth would you pay for it?
Again, if you know
that you're about to do something very intensive,
"I'm gonna go over here
"and I'm gonna work on this for a while
"and then I'm gonna stop,"
local allocator is your friend, because all the memory,
all the stuff that you're doing is right there.
That means the number
of cache lines that are in play are much smaller,
which means the number of hits you're gonna get in cache
is much greater.
And it applies the same to pages and memory
and every level of cache up and down the hierarchy.
It doesn't really matter why you're getting it.
The key is locality is king.
So if you're gonna access something in rapid succession
and you can do all of it in a boxed region, you win.
Otherwise, you lose.
Does anybody have any questions on this?
Going too fast?
Locality is good.
That's a fact.
A memory allocator is a stateful utility or mechanism
that organizes a region of computer memory,
dispensing and reclaiming authorized access
to suitable sub-regions on demand.
So here's a local allocator.
And the computer knows how to access that memory.
The allocator itself has some logic in it,
but it's tied to memory.
Now, some people seem to think you can copy an allocator,
and if you think you can copy an allocator,
I got news for you.
An allocator is, the important part
is that it's a region of memory.
You can't copy it because it's physical.
It's not a value.
So we know about value types.
They represent values,
and I have talks on that out the whatever.
But this is hardware,
and all we're talking about is a pointer.
Every allocator that you ever deal with
is a pointer to some region of memory
and some logic that controls it.
The logic lives in the object,
but the memory lives in memory, and that's a region.
So here's a local allocator,
and you notice it has a default constructor,
like any value type, but it's not a value type,
so don't get confused.
And what it probably has,
it's not a default constructor, actually.
It takes the region of memory that it's gonna operate on,
and then the rest of it is the logic
to allocate and deallocate.
And then, this poor clicker was doing things too quickly.
To allocate and deallocate,
and those are the operations that we normally have.
Now, I know there's alignment
and all of these wonderful, good things
that we could talk about.
We're not gonna talk about them today
because they don't fit on slides.
So we're just keep it real simple,
because we can understand that,
and then we can extrapolate to the reality
which is just a little bit more complex.
But what's nice about local allocators
that's not true of global allocators
is we have this third thing.
This third thing is called release.
So while new and delete
don't have any notion of release, nor should they,
a local allocator, you can say to the local allocator,
"You know on that memory you were managing?
"Forget about it.
"Just give it back to wherever it came from.
"You don't even care what the objects think.
"It's not their call.
"Give it back."
When you're in that realm,
now you're playing with the big people,
and you have to know what you're doing.
But if you can save
significant time in a video game by doing it, then why not?
- [Burke] What is the purpose of this release?
Shouldn't you put that in the structure?
- Okay, so the question is, I will get to this more,
what's the purpose of release?
So we have an object.
Call it your, what's your name?
- [Burke] Burke.
We have Burke.
Burke allocates memory.
We have an allocator.
Burke takes an allocator construction.
When Burke needs memory, Burke says, "Allocate memory."
Now you're full of memory.
Now, here's John.
"Burke, you're done."
I didn't call your destructor.
You're just gone.
Now, you're probably saying, "Is that okay?"
We'll get to that.
It doesn't feel good
to just be dismissed like that, does it?
It's horrible, but it's very quick.
You're gone, all of you, in one shot, boom.
Sorry about that.
So a memory allocator is the client facing interface
for stateful utility or mechanism
that organizes a region of computer memory,
dispensing and reclaiming authorized access
to suitable sub-regions on demand.
I'm trying to build up an idea.
We can supply allocators in multiple ways.
We could supply them as stateful utility functions,
like malloc and free.
But that doesn't really support allocator objects.
We can't really think of an allocator as an object.
It's a pair of functions.
It's not, God forbid, object oriented.
Then, we could think of it
as a reference wrapper in a template context
where we're gonna pass it into, as a template parameter,
and now the type knows what allocator it uses.
This is the C++98 model.
This is also the C++11 model.
What's good about that?
Well, the concrete allocator derived type
is available to the client's compiler.
That might be good.
Unfortunately, it forces the client
to be a template in order to use it.
That could be bad.
It affects the type of the object,
which means it's not a convenient vocabulary type,
because a thing and a thing with an allocator
aren't the same C++ type.
I remember where I was in my house.
It's a very small room on the first floor,
and I was thinking, and I said,
"That's a problem with templates.
"They need an allocator in order,
"and then the type changes.
"And then what are you gonna do?
"And how do you take a function?
"Oh, wait, the function could be a template.
"But wait, then everything has to be a template.
"Wait a minute."
2004, that was not cool.
It's still not cool.
Or we could pass the address of an abstract base class.
What's good about that?
We can actually refer
to a polymorphic allocator with a handle,
and it doesn't effect the type of the object.
The bad news is, allocator must be accessed
via its virtual function interface.
And people say, "Oh, can't afford that."
How much does that cost?
Does it cost anything?
Has anybody ever measured it?
- [Man In Audience] Yes.
- Yes, and?
- [Man In Audience] 20 cycles.
- It takes 20 seconds?
- [Man In Audience] 20 cycles.
- 20 cycles, on your machine with your compiler?
- [Man In Audience] It's average.
- Averaged over?
- [Man In Audience] (mumbles)
Strictly speaking, it's like from five to--
- Okay, so let me just cut to the chase.
So you have platforms.
You did this measurement last week?
I wanna point something out.
We did our measurements about a year and a half ago.
Some platforms there is a performance cost,
and some there aren't.
And the ones where there are
haven't caught up to the ones where there aren't yet,
and as soon as C++17 adopted this,
there's now pressure for people to catch up.
I'll explain why that can be done,
but don't assume that there's a performance cost.
The other one is the object now, unlike the C++11 model,
whether it needs it or not is gonna somehow have to
tuck away an address.
And if we assume that most things
don't need their own custom allocator,
now we're paying a machine address size,
let's say 64 bits, in every object that allocates memory,
not necessarily in the footprint
but in every object that actually allocates,
not one that could allocate,
but in every object that actually allocates memory,
we've gotta find space for a machine word size.
Some people will say, "Well, that's out of the question.
"We can't use allocators,
"'cause that's gonna slow the whole thing down.
"It's gonna make it unworkable."
Have you measured it?
So that's the introduction.
Does anybody have any questions on this?
At least we know some things we're concerned about.
Maybe this is a bad idea.
If it were a bad idea,
would I really be standing up here talking to you?
So now we're gonna understand the problem.
So here's the problem.
I wanna know if I should be using an allocator or not.
So, should I supply an allocator?
Darn, this clicker.
If the answer is yes, I'm done.
I mean no, I'm done.
I don't need to supply an allocator.
But if I do need to supply an allocator,
then should I supply it via a base class
or should I bake it into the type of the class C++11 style
as opposed to C++17 style?
So, whether I do or not is a syntax issue.
The next question is,
which allocator should I use, A, B, or C?
Whichever I decide, I have to ask yet another question.
Do I want to bother
destroying the individual objects and sub-objects
that are fetched from the allocator,
or do I just want to make them go away
in one shot in zero time?
I have to decide that.
So that's where I am.
This is our decision tree.
And I'm gonna see if I can learn how to do this better.
That's our decision tree.
So we're gonna go through this process.
The thing is, what's gonna guide us through this process?
So, we have a tool chest of strategies,
and the first strategy is
we're gonna use the global allocator.
The next strategy, or that strategy has two ways.
We can use the global allocator as a type parameter
or we can create a wrapper pointer
to what's called the new delete allocator, right?
So I can have an object that all it does
is invoke new and delete, right?
So those are the two possibilities there.
Then, I have three possibilities.
I can use what's called a monotonic allocator,
this is a local allocator, a multipool allocator,
or a combination of a multipool and a monotonic allocator,
so now I've chain them together.
So what the third one is is it says
from new and delete, or whatever is supplied to it,
I'm gonna create a big block of memory.
I've got that.
To that, I am going to affix a pooling allocator,
and it's going to pool from that chunk.
Whereas just having the multipool allocator in theory,
every time you needed a new chunk,
you would have to go and get it from whatever,
but when you gave it back,
it would keep it in its free list.
That's not how we do it because we know better,
so we're not gonna do that kind of thing.
We actually build into our multipool some buffering
that makes it ridiculously efficient
to the point where nothing can compete.
But that's to one side.
So I'm trying to give you the slip in the truth
while I'm telling you the story
so you can understand what's going on.
The combination of the two has applications,
and either one individually has applications.
We just need to know when to use what.
And then, again, we can pass it as a type parameter,
in which case we're baking the allocator type
into the container type
or whatever type needs to allocate memory.
Or we can use an abstract base class
and pass it in on construction,
and now it doesn't affect the type at all.
And then the third one,
because it's a local allocator,
we can either destroy it normally or we can wink it out.
Winking it out means it had nothing to say about itself.
It's just gone.
So, that gives us 14 allocation strategies.
And that is ample for this talk,
and it's ample for anything you probably ever wanna do
when it comes to performance.
But it's not ample for anything you'd wanna do,
because you might wanna do something else,
like place memory somewhere
or you might wanna instrument things, whatever.
But as far as just performance,
you're really in good shape here.
You're gonna have to look really hard
to find a reason to do more than this.
How are we doing?
So this nice little chart shows you what we got.
We got 14 allocation strategies,
and then we have how we're going to affix it
to the container object, let's say.
It doesn't have to be a container, but we'll say container,
like a vector or a map or whatever.
But it could be anything that allocates memory.
And then we can decide
whether we're going to destroy it normally or wink it out.
So here are all the possibilities, nice little chart,
and so let's try to understand it.
So here's the normal way of doing business.
So we have an allocation strategy, either AS1 or AS2,
and the only difference is whether we bake it in
or we use an abstract base class pointer.
Take a look at this.
Here's an allocator.
There are no data members.
And this is the standard allocator.
You see allocate and deallocate.
Now, notice that I put them right in line right there.
That's what they do.
This is not a very complex class.
This is the standard allocator.
It's also the global default.
And we're done.
That's all we're talking about.
Then the next one is, let's see.
I wanna use it, I just say that.
And by the way, if I were to supply the allocator,
the underlying machine code would be the same.
So if I don't supply an allocator,
we get the default allocator.
If I supply the default allocator,
we get the default allocator.
So these have the same code.
We're not gonna make a distinction.
That's a type parameter.
Remember, the type parameter changes the type,
so STD vector with allocator is a different beast
than just plain old STD vector.
With a non-default allocator, I'm sorry.
That's the same.
Now, the next one is,
here we have our benchmark.
We're gonna create a vector,
and we're going to populate it like this.
We're gonna do our benchmark on the vector.
And then we're gonna destroy it like this.
That's the old way.
Now, just to prove that I know how to do this,
I'm gonna use auto.
And I can go back and use auto again.
The purpose of this talk, of course,
is not to talk about C++11, 14, or 17,
but really to talk about memory allocation,
so I'm gonna try to keep it very simple
for those people that might not be up to speed
on all of the things.
It's not important.
Next, we have AS2, which is the same thing,
but now I'm going to have it be an abstract base class,
so this is the protocol.
Here's my allocate and deallocate, pure virtual functions.
And you can see, this is a concrete derived type.
What does it do?
Now, I want you to look at this.
Notice that I have a bug here in the slide.
That's on purpose.
The bug is what?
What's the bug here?
That's the bug.
I put the inline there because I'm really serious
that this is a derived class
that has inline virtual functions.
The bug is I don't need inline
because it's gonna be inline anyway
because you can see that the code is inline.
I am not kidding.
I have an abstract base class.
I have a derived class.
The derived class has virtual inline functions.
Does anybody have a problem with that?
- [Man In Audience] (mumbles) compilers--
- [Man In Audience] Some compilers,
if you have pedantic warnings enabled--
- With the inline?
- [Man In Audience] I think so.
- The inline won't compile.
Repeat the question.
So the question is, if I have the inline here,
will it create a warning?
No, it'll create an error.
It's not syntactically valid.
I put it there to remind you that I'm not joking.
It is inline.
Everybody understands, these are deliberately inline,
not just slideware.
They're inline for real.
There's a reason for that, because we want the compiler
to be able to take advantage of it,
and I am not talking about full program optimization.
I'm talking about what the compiler can see today
in your typical compiler scenario.
- [Man in Audience] So it doesn't compile?
- This doesn't compile because the word inline is there.
So when it doesn't compile, you delete the word inline
and then it will compile.
Okay, see this word inline?
You see how right here there's an inline definition?
- [Man in Audience] Okay.
- Okay, all right.
So now, this is what we do in this world.
We create an allocator,
and then we pass in the address of the allocator
to the PMR vector, 'cause that's how it works
in the polymorphic version.
That's the C++17 version.
So instead of baking it into the type,
you pass in the address as an optional argument,
and life is good.
So I'm claiming the virtual functions
can sometimes be bound to compile type,
and the really cool thing is that
in every case where that's important, they can.
And if a compiler doesn't do it a year and a half ago,
it probably does now, and if it doesn't do it now,
as soon as C++17 is out and about
and the game people get ahold of them, they will.
For example, GCC does but Clang doesn't,
didn't, as of a year and a half ago.
So, Clang, if you're not doing it already,
you have some work to do.
So we have normal destruction,
because what else can we have?
It's not a local allocator.
So what does normal destruction look like?
We create the thing.
We go through and we build it up.
You see how we're doing that.
We do our benchmark.
And we destroy it, and this is normal destruction,
'cause that's all we can do with a global allocator.
We can't do that winking stuff with a global allocator
because it's got all the memory
that if we winked it we'd have nothing.
Might as well end the program.
I'm going slowly.
Is this too slow?
Do I need to pick it up a little bit?
I just wanna know.
A little bit faster?
Okay, I see somebody say a little bit faster.
Anybody think that's gonna cause,
are we okay if I go a little faster?
- [Man in Audience] Yeah.
- All right.
I just don't wanna lose you.
It's a little bit of stuff here.
All right, monotonic allocators, what are they?
Well, they're this beast.
A monotonic allocator is intended to be used, typically,
with a buffer on the program stack.
Here we have an example.
I'm calling it bdlma::BufferedSequentialAllocator
This is what we have.
This was pre standardization of 17.
We've had this forever.
This is a really useful thing.
We create a buffer on the stack of 1,024.
There it is.
this is the buffer.
Now, people know about natural alignment maybe.
So if I wanna put a char in here, what happens?
It goes there.
Now I wanna put another char.
Where does it go?
It goes there.
Then, another char.
Where does it go?
Now I wanna put a short, and it goes there.
Now I wanna put an int, and it goes there.
Do you see?
It has to be aligned on its address
that's divisible by its size.
- [Man in Audience] Sorry.
Doesn't the standard require alignment--
- [John] I can't hear you.
- [Man in Audience] Doesn't the standard require
alignment on the largest possible, size which is (mumbles)
- Okay, you asked a question,
does the standard require alignment?
Maybe you're talking about allocators,
but I'm talking about a buffer that's raw memory,
and I can put it wherever I want it.
So what's gonna happen in this particular thing,
it's gonna do natural alignment 'cause I said so.
And it works.
That's what this is gonna do.
You could do it the way you just described,
which is to waste a lot of space,
but I'm not gonna do that, 'cause it's just me.
So anyway, so then you can put the short here.
Then you can put the char here.
But if I get another int, it has to go here.
I'm giving you an idea
of what the most efficient thing is.
You can back it off to your satisfaction,
whatever you wanna do.
That's the most efficient.
We do things that work.
Now, note that if you try to deallocate something,
it's a no op.
Here we have, we can do it as a type parameter.
We know what that means.
We can do it as normal destruction,
so what does that look like?
Normal destruction, I'm saying we have this wrapper thing,
it's this thing that holds a pointer to the thing.
It's just a way to build it into the type.
It's just really, it's an object that's a pointer.
And then we have this buffered sequential allocator.
It's a monotonic allocator.
We can supply stuff on the stack,
we can do whatever we want,
but right now, that's just gonna do,
it's gonna not allocate memory at all.
It's not gonna point to a buffer.
It's gonna go to that dynamic thing
because of the way I did it.
But we could also say, "Here, please use this to start."
So we have some flexibility there,
and we're gonna go through, and we're gonna allocate,
and we're gonna do the new separately.
I'm doing them separately
'cause I want you to see the two operations.
First thing we do is we allocate the memory.
Second thing is we construct
into the memory we just allocated.
Then we come down here to do normal destruction,
and you notice the first thing we do is destroy it.
We destroy it, and then we deallocate it.
So allocate, construct, deallocate, destroy.
That's the normal pattern.
That's the normal way.
Then there's the magically winked way.
So this time, when A goes out of scope,
we just destroy everything,
and then when A goes out scope, the memory is reclaimed,
so we don't have to do that.
But the crazy thing is that we don't need to do any of that.
It's legal C++ to just let those things go away.
And you say, "No, it's not."
And I'm gonna say, "Yes, it is."
And then you're gonna go look it up
and you're gonna go,
"Well, okay, it's legal, but it really," yeah.
So normal destruction and magically winked are the same,
and now we have multipool.
Now, multipool is a completely different beast.
And it's backed by the global allocator.
So what is a multipool?
It's got a bunch of different dynamic pools.
It's like, I call it a beanstalk,
but they're all these different,
I'm gonna call them dynamic pools,
but they're designed for different sizes.
So now ...
What happened there?
That's not right.
Let me try this again.
This clicker is so sad.
So this is what happens.
So during the course of,
you saw with the end picture,
but this is what's gonna happen.
Over the course of running this thing,
we're gonna keep allocating,
and you'll notice the size,
the chunk size that we get each time is gonna double.
And you see how we're using up the chunk size
'cause they're two iterators,
and it's just going until it hits the end.
And then when it hits the end,
we just create another thing that's twice as big
and so on like that.
Now, here we have it saturating,
but in real life it wouldn't saturate
because it turns out with the two iterators
there's no reason not to keep doubling forever.
Well, when you just populated a free list
and went in and loaded the free list,
there needed to be a saturation point.
And I wanna thank Pablo Halpern for pointing out to me
at C++Now about four years ago
that what I was doing here is brain dead,
and thank you very much, Pablo.
So it's now even faster.
It's not so much faster that you could really notice it,
but it's aesthetically faster, and it's better.
So anyway, if you exceed the maximum size,
then all of this stuff is gonna
come straight from the allocator, and that's very important,
because if it comes straight from the backing allocator,
there's no buffering, there's no pooling.
It's as if there's no allocator at all.
So if you have a pooling allocator
and you allocate a megabyte,
you don't have a pooling allocator.
You're just calling straight through.
And that's gonna turn out to be important.
So, we've talked about type parameter and abstract base,
and they're really the same.
And then we have the combination,
and the combination is simply a multipool
backed by a sequential or a monotonic backed by the global,
and it has, of course, you can do it by type parameter.
Wait a second.
I got a new clicker 'cause I lost my old one,
and I don't know to use this one properly.
Let me just come back here.
So we have this one.
we have type parameter,
and then we have magically winked.
Did I get that right?
So anyway, it's the same kind of thing,
rather than go through the whole thing.
It's just the same kind of pattern.
You've gotten all the information.
It's just I could repeat the same steps,
so I'm not gonna do that.
What did I miss here?
So we're gonna have basic parameters
to use to characterize our experiments.
And the two we're gonna use are
N for the number of instructions.
So a program is gonna be assigned a size
by the number of instructions it executes.
And again, this is very vague.
And it's also gonna be based on
the number of threads, worker threads.
Now, most of the experiments we're gonna do
don't involve multiple threads,
so that's just there sort of at the end.
So the next question is,
what aspects of software affect optimal allocation strategy?
And this is kind of like Family Feud.
So I've got these five things,
and the first letter is on the front,
and it'll turn out that these letters
are gonna form a mnemonic.
So can anybody think what those letters might mean?
I'm not gonna take a lot of time,
'cause you'll never guess.
Can you think of some?
Just try it.
What might matter?
Can anybody think of a dimension that might matter?
'Cause this is something we had to think about.
We wanted to think about,
what dimensions of a problem
would we come up with so that we could say
it's got a lot of that or a little of that,
and then based on that,
we could make some sort of recommendation?
- [Man in Audience] Fragmentation.
- Fragmentation, excellent.
- [Man in Audience] It's not up there.
- It's not up there, but we'll get to that.
- [Man in Audience] Duration.
- [Man in Audience] Duration,
how much time you're holding memory.
- Okay, duration.
It's gonna turn out that that's U.
So here we have density of allocation operations.
You can imagine a program
that does a lot of allocating and deallocating, right?
You could imagine a program that doesn't do that at all.
That's a dimension.
Another one is variation in allocated sizes.
You can imagine a program
that allocates one size and that's it,
or it allocates, every time it allocates a different size.
You could imagine locality of accessing memory
as being a dimension.
Is that important?
Is the usage pattern of your program
such that you could exploit locality or no?
How about utilization of allocated memory?
That means, do I at one shot
get all the memory I need, use it, and throw it away,
or do I get a little, give it back,
get a little, give it back, get a little, give it back?
That's more along the lines
of the way in which you use the memory.
Is it all at once, or is it piecemeal?
And then finally, just contention of concurrent allocations.
So these are five dimensions.
You might say, "Those are really stupid."
But if you don't know anything
and you have to come up with five dimensions,
at least I got five dimensions.
That took weeks.
There you go.
These are intended to be rough indications, not precise.
I wanna put them on a scale of zero to one
where zero is not much and,
this thing did it to me again.
So they fit into that thing.
Zero is minimal.
One is maximal.
They're far from independent.
So just because I have five
doesn't mean that they're independent variables.
They are not independent variables.
Now, it would've been nice
if I could've come up with an experiment
that I could vary each one independently
in just one experiment.
That would've been nice, but that didn't happen.
So anyway, density is
the number of allocation, deallocation ops
over the number of operations.
And so zero means we don't use an allocator,
and one means every instruction
is allocating or deallocating.
So, if you think about populating a vector with push back,
so what happens there?
Think about what happens.
If I just say push back, push back, push back on a vector
to a billion, does that have a high allocation
or a low allocation?
And I'm specifically saying vector of int.
Would you say that has a high allocation density
or a low allocation density,
where I push back a billion integers?
- [Man in Audience] It's relatively low.
- Relatively low.
What do you guys think, low?
- [Man in Audience] A little bit low.
- Yes, sir.
- [Man in Audience] Low.
- We're all low.
Right, because if I push a billion things on,
there are gonna be 30 allocations, let's say,
and then there are a billion other things,
so 30 over a billion, small.
Just giving you an idea.
Then variation in allocated sizes.
So everything is allocated the same.
And then everything is different.
Make sense? Okay.
So consider like STD set of int or STD string.
Just think about those guys.
Do they have variation or are they all the same?
If you have a set of int, every node is the same.
If you have strings,
it could vary all over the place, right?
The footprint of the string is the same,
but if it's a long enough string,
then it could vary to whatever it is.
Now locality is the most complex one.
Locality, you have to think about
the number of instructions in a sub-region over a duration,
and then you have the memory size of the sub-region,
and you have the number of transitions
out of that during this period.
And so if you look at those,
they form this little, first order equation.
I don't mean this to be really linear,
but to a first order or a zeroth order, this is the idea.
And then you can break into physical locality
and forget about the T.
You can break it into temporal locality
and forget about the M.
And again, I'm just trying to give you the idea
that the more instructions, the more locality,
and the smaller the memory, the more locality,
and the fewer times you jump out of that region,
the more locality.
I'm just giving you the rough way the thing fits together.
So, zero is, we've got a large and not accessed repeatedly.
That's what most programs are,
large and not accessed repeatedly.
I mean, large or not accessed repeatedly, right?
The region is the whole memory.
Or it's small, but it goes all over the place.
The other one is saying that the sub-region
is small and accessed repeatedly.
If I've got a small region,
not all of memory but it has a little bit of memory,
and I'm pounding on it, that's high locality.
Even if we don't allocate a lot in our program,
even if we just do a little bit of allocation
and then spend the next seven days accessing memory,
the performance improvement of local allocators
can be amazing.
I just want you to understand
that it's not about the speed of allocation.
That's another myth,
that, "Oh, I need to allocate really quickly."
No, you don't.
You need to allocate well.
Utilization, the max memory in use
divided by the total memory allocated.
If I allocate a byte and free it
for the entire duration of the program,
my utilization is very low.
If I allocate everything up front,
use it, and blow it away, it's very high.
Does that make sense?
Finally, again, consider vector of int
and assume that we allocate the capacity up front.
If I allocate the capacity of a vector up front,
I got one piece.
And whatever I do to it, nothing else is getting allocated.
It's done, and then I free it.
So that's everything up front.
Whereas a vector of large strings,
if I were to push and pop and push and pop and push and pop,
I wouldn't be getting all the memory up front.
I'd be getting some memory.
Then I push and I pop and then I get some more memory,
so the amount of memory that I've allocated is much larger
than the amount of memory I have at any given point in time.
And finally, there's contention.
And contention really doesn't fit this,
but you like to try to fit things into something,
so this is the fifth category.
If I have no contention at all, then it's zero.
And if every thread
is constantly trying to access the thing,
or I should say is trying to allocate or deallocate memory,
then I have maximal contention.
So we have this summary here.
And if you notice, the mnemonic is going to be
DVLUCK the duck.
So I hope you remember it.
Density, variation, locality, utilization, and contention.
I'll make it even easier later.
Okay, so remember that.
So we're now at the next step, analyzing the benchmark data.
And of course, this is what we were all waiting for,
but we're almost out of time.
I'm going to start,
because I wanna make sure I have time for questions,
but does anybody have a question right now?
As long as it's not heckling.
I'm saving the heckling for the end, 'cause I'm scared.
- [Man in Audience] With monotonic allocators,
how do you decide the size?
- With allocators, how do I decide?
- [Man in Audience] The size of the monotonic allocator?
- The question is, with the monotonic allocator,
how do I decide how big to make it?
- [Man in Audience] On the stack.
- On the stack.
The way you decide is you figure out
how big it needs to be, and you make it that big.
- [Man in Audience] If you use pragma pack
would that also affect your allocators
if you're using the previous allocator,
the (mumbles) allocator where you're--
- Okay, the question is something about pragma--
- [Man in Audience] Pack.
I don't think I even know what that is.
What is that?
- [Man in Audience] Like naturally,
would you align them back to back?
- So the question is,
if I use this thing that I've never heard of,
will it make things--
(man in audience mumbles)
I will tell you now, I have never heard of this thing,
but if the question is,
do I need a compiler pragma to do something,
I think what you might be asking is,
I think I might I know,
I think I'm gonna figure it out with the question.
So let me say what I think the question is.
If I turn off alignment in the compiler
so that everything just mushes together
and I don't have to worry about any holes, is that right?
- [Man in Audience] Yes.
Why didn't you say so?
The answer is your program will run like a dog with one leg,
because that's not how computers work.
That's how compilers work,
but that's not how computers work.
So we don't do that.
Don't do that.
Do not turn off alignment so that you get this packing.
- [Man in Audience] I'm sorry.
Just to clarify, if you are running x86 or x64,
its performance impact is very, very small.
We just heard from somebody who knows.
On an x86,
the performance impact is very small,
but not zero?
- [Man in Audience] On anything else,
it is completely (mumbles)
So what we heard is it's either not as good or horrific.
I'll take that.
That's good to know, I guess.
- [Man in Audience] So when you're measuring the locality,
how do you determine the size of the memory region?
'Cause you said N was like the size of the memory region.
- Okay, so the question is, how do you measure locality?
And the answer is you don't.
The idea is,
this is a model for a human being to think about,
what are the trade offs and so on?
The idea is you measure various system sizes,
and you say, "The system size is this big.
"What happens if this big and so on?"
And you compare, and it's like an accordion.
You say, "I got one system that's big,
"two systems that are like this,
"four systems that are like this.
"They're all the same size, but they're segmented."
And you look and see what the surface looks like.
And guess what we're gonna do?
So that's happening.
I want you to know that this talk
did not happen in a couple of days.
This talk is serious.
We have about three minutes left.
Maybe, I don't know if I should start,
or if I should just break now
and we'll come back maybe three minutes earlier.
Let's start on time,
'cause we literally have two minutes or something left.
Does anybody have anything else they wanna ask?
- [Man in Audience] Is there anything
about the monotonic allocator that says
its buffer has to be on the stack?
- Question is, does the monotonic allocator
say anything about where the buffer comes from?
The answer is no.
You can specify whatever you want.
Note to self, stack is hot.
- [Man in Audience] Is there a such thing
as a backup strategy for a monotonic allocator?
- Is there a backup strategy for a monotonic allocator?
Typically, it is the backup strategy,
but if you wanted to put something that measured
how much memory was going through,
like a monitoring allocator, you can chain them.
We are gonna start sharp in 15 minutes,
'cause I'm going to get through the rest of this.
The good stuff is now about to happen,
some crazy good stuff.
So go up, but don't be late.