Cookies   I display ads to cover the expenses. See the privacy policy for more information. You can keep or reject the ads.

Video thumbnail
All right.
Hello, everyone.
And welcome to Fantastic Algorithms and Where to Find Them.
An intern once asked me, "Nicolas, how often
"do you write algorithms?"
And the answer is rarely.
Algorithms are tools.
Hammers, and levels, and mops.
And what makes a collection of tools impressive
is not how many of them you made yourself,
but rather, how many unique problems
you can use your toolbox to solve.
I hope that in today's talk, as we go through
a couple fantastic algorithms,
that there will be some tools here rare enough
to be novel, but useful enough to be found in practice.
So a little bit about me before we dive in,
I am an Infrastructure Engineer at Facebook.
I work in infra for two reasons.
First of all, a lot of infra is services.
Services are often written in C++,
and I happen to like that language.
The second reason that I like working in infra
is that my clients are fellow engineers.
If I've got a problem, something's going wrong,
if I need to learn something about the use cases,
I can just go and talk to another technical person.
This leads straight into our first algorithm.
Heavy Hitters.
Despite my co-workers being engineers,
people with no malicious intentions...
I still get DoSed.
Maybe a new feature gets turned on,
maybe an upstream service gets DoSed which causes
my service to get DoSed, or in this particular story,
a command line interface used primarily by humans
is called programmatically for the first time.
This particular graph, those two spikes
represent high latencies, specifically timeouts,
as a result of too many queries coming into my service.
Now, what I want to do, what I really would like
is to guarantee to all the people
who are using my service normally,
that they will have a bare minimum amount of traffic.
I do not want to starve any of my individual users
in the event that I am overloaded.
And so the ideal thing that I would like is I would like
to have a map of the people who use my service
to the number of times that they have made queries with me.
The problem is that keeping that map is very expensive,
it takes order n (O(n)) space.
Even if I were to simplify my request,
and instead just want to know the identity
of the single user who was making the most requests,
it would still take order n (O(n)) space
to be done in a single pass.
So fortunately, though, there is an algorithm
that exists, Heavy Hitters, which does almost what I want.
Heavy Hitters does not determine
who the most frequent user of my table is,
of my service is, however, what it does do
is it does identify the users who single-handedly
account for more than x percent of total traffic.
For this special case, x equals 50%, some of you might
have heard this algorithm under some different names,
for x equals 50%, knowing which elements in an array
occur strictly more than half the time,
the algorithm's called Majority Element.
Also known as Boyer-Moore's Majority Vote algorithm.
And it works something like this.
Imagine, for a moment, that we are all at a conference,
and there's a lot of programmers
who are attending the conference,
and the programmers indicate by the color of their dress
what their favorite programming language is.
As the programmers enter the floor
the passion of the moment overcomes them,
and a furious fight breaks out.
Programmers who represent different languages lock eyes
and instantly do battle.
They run at each other and try to knock the other out.
Now, it happens to be that every single pair
of programmers is perfectly evenly matched,
and so when they run and punch each other,
they both land a blow on the other's chin simultaneously,
and both get knocked to the floor unconscious.
The conference chair enters the room
and notices that there is exactly one
programming language left standing.
Indeed, if there were multiple programming languages
left standing, then one programmer from each of two
distinct languages would lock eyes,
run at each other, do battle, get knocked out.
So there is exactly one language left.
And so the conference chair smiles,
'cause he had a question that he wanted
to have an answer to.
He wanted to know whether or not there existed
a programming language that enjoyed the support
of a strict majority of all attendees.
The conference chair's job is simpler now.
Instead of counting the number of attendees
who support every single programming language,
he merely needs to count the number of attendees
who support the programming language left standing.
The reason this works...
is because we see that pairs of people are knocked out
where each member of the pair represents
a different programming language.
If there's n people, there's at most
n/2 pairs of knocked out people.
And if there's a programming language that is
represented by more than n/2 people,
then it cannot be present,
then it cannot be completely knocked out
in every single pair, thus will be left over.
This conference room brawl translates
very cleanly into an algorithm.
The only difference is that you let the programmers
into the room one at a time.
Implement this as a counter-value pair,
or sorry, a value-counter pair.
You have a value, the current programming language
which is winning, and some rather arbitrary integer
representing the weight.
So the first programmer walks into the room
representing C++, there's no one else in the room,
C++ is now the winner, the weight is one.
A second programmer, a Java programmer,
walks into the room, locks eyes with the C++ programmer,
they do battle, they get knocked out,
the counter goes back to zero.
Third programmer walks into the room, C++.
C++ is the winner, counter is one.
Fourth programmer, again, C++ walks in,
C++ now has a value of two,
and the fifth and final programmer, a Perl programmer,
walks in, locks eyes with one of the C++ programmers,
they do battle, get knocked out.
The algorithm's done.
At the end of these five people, three of whom were C++,
majority vote does indeed say that C++ is the winner,
and there's a rather arbitrary value of one
left in the counter.
It is important after you do the first pass through
to find your estimate of who the winner is,
that you do a second pass through the array
to verify that the returned element
is indeed a majority element.
We can imagine a case where this is not the case.
Suppose there are a very large, odd number of pirates
each of whom is voting for who the next
captain of the ship is going to be.
1,001 pirates.
And I've chosen pirates, because pirates
are well known to always vote for themselves.
And so the first pirate walks in and he's the winner.
And the second pirate walks in, and they do battle,
and there's no winner.
Third pirate walks in, he's the winner.
Fourth pirate walks in, oh they disagree,
they do battle, no more winner.
And the first 500 pairs of pirates go in
and knock each other out until the last pirate,
the 1,001st pirate, walks in and,
"Hey, guys, I vote for Teddy."
Teddy's the winner!
No, he's not the winner, you have to do a second
pass through the array.
It is sufficient for a majority element
to end up in the value, but it's not necessary.
This is the x equals 50% version of the algorithm.
It extends very cleanly and very naturally
to percentages other than 50%.
Specifically to percentages of the form 1/k
for positive integers k.
The only trick that you have to change
is that, inside the conference room,
which could previously accommodate only one
programming language, now allow there to be k minus one
distinct programming languages existing in harmony.
And it's only when a fifth programming language
walks in that the room becomes too crowded,
and that new programming language
plus the other four distinct programming--
sorry, if k equals five, the other four distinct
programming languages, one person from each group
does battle and leaves.
And so they do battle in groups of k,
therefore, no one who exists more than
1/k of the time can be completely knocked out.
So coming back to the service that I own
that was getting DoSed.
I was upset that this was happening.
I went and I talked to that engineer, because I'm an adult.
But I also put up some code based on Heavy Hitters
to make sure that, if this happened again,
I'd be able to throttle the people who were
single-handedly accounting for large percentages
of my resources.
I had one small variant, though.
Heavy Hitters does not have a concept of time,
so I did Heavy Hitters in five minute intervals.
It was a long time ago, there was a little bit
of trickery going on.
I can't show you that code, though.
But some code that I can show you is this majority vote
implementation, found inside of Libgcc.
I was talking to Nathan Sidwell yesterday.
Nathan Sidwell, by the way, is going to give a talk
tomorrow on modules, implementing modules inside of G++,
which I would advise going to.
But I was telling him about Heavy Hitters,
and he was like, "Hmm, you know,
"this code sounds familiar to me."
And sure enough, Nathan found this Heavy Hitters
implementation inside of Gcov.
What this code is doing is it is trying to find
whether or not there is a majority argument
for profiling purposes.
During profiling, this extremely cheap,
extremely simple counter-value pair
is used to determine whether or not
there's a specific argument that occurs
the vast majority of the time.
There's one slight hiccup, though,
in the Gcov implementation.
Which is, typically, at the end of the Heavy Hitters
implementation, you have to do a second pass
to verify that the given argument does indeed
occur more than 50% of the time.
And Gcov can't do a second pass.
So, instead, Gcov has this interesting trick
where, at the very end, they check the counter
and they see if the counter is higher
than 50% of the total number of calls.
If this is true, then we know for sure
that that algorithm occurred at least 50% of the time.
However, we don't know that an algorithm
that occurred 51% of the time is guaranteed
to make this cut.
However, it is guaranteed that an algorithm
that occurs at least, or an input that occurs
at least 75% of the time will,
once the algorithm completes, have a value
in the counter at least equal to 50%.
So this is a rather nice, nifty optimization
that Gcov uses to profile your code
so that your code can be optimized.
All right.
I've got another algorithm for you.
Who here was at CppCon last year?
That's actually fewer people than I thought,
that was like less than half.
Who here, regardless of whether or not
you've been at CppCon last year,
listened to Herb Sutter's talk Leak-Freedom in C++?
That looks like quite a few more hands.
Not everyone's hands, but more.
For those of you who haven't seen Herb's talk,
I highly advise, go and listen to it,
it's a great talk about how to avoid
leaking memory inside of your programs.
And one of the most important things
that Herb hammers home in his talk
is not to manage pointers yourself.
Do not use new and delete.
Instead, rely on unique ptrs and shared ptrs
when you actually have data that has to
outlive your stack frame.
And one of the examples that Herb had in his talk
was he had a tree implementation.
And instead of having a tree star left
and a tree star right, Herb instead had unique ptrs.
And this is very nice, because those unique ptrs
are going to be cleaned up successfully,
regardless of how your tree gets destroyed.
You could have an exception thrown in your constructor
and those child trees are going to be cleaned up,
and you're not going to have a leak.
There's a problem, though, Herb remarked.
The problem is that, in the wild,
not all trees are balanced.
And so if you happened to have an unbalanced tree,
in the worst case, just a straight line like a linked list,
what happens when you free the root node?
The root node's going to be freed,
but in order to do that, it first has to call
the destructor on it's left child unique ptr.
So six calls the destructor of two,
two calls the destructor of one,
and so a lot of destructors get pushed onto your stack,
and you can smash your stack.
This isn't good.
And so, at the end of the presentation,
during Q&A, Eric Neebler went to the microphone and said,
"Herb, did you know that there exists a way
"to destroy a tree using order one space overhead?"
Before I go on, I've got a small little story
about Libgcc again for you.
I went and I checked Libgcc's implementation
to see whether or not they were doing a stack-smashing
destruction call, or if they were doing some fancy-swancy
order one space destruction call.
And the answer is, that inside of Libgcc,
their red-black tree implementation
uses this type of destruction.
It's just recursively call the destructors,
come back up, push everything onto the stack.
And the reason that makes a lot of sense
is because the STL's red-black tree implementation
is guaranteed to be balanced.
So the height's going to be proportional to log(n),
you're not going to smash your stack with log(n) space.
But back to the fact that Herb has said
that there do exist cases when trees are imbalanced.
If you implement some not necessarily balanced tree,
you have to be careful about smashing your stack
in recursive contexts.
So what are we going to do?
The first answer is I'm going to step it up a notch.
When you want to destroy a tree in order one space,
there's something really handy you can do.
Which is that you can muck with all of your tree's
internal invariance with reckless abandon.
You can rotate whatever internal node you want,
don't worry about it, the tree's getting destroyed.
So I'm going to throw an extra challenge on top.
I would like to iterate through my tree
using order one space overhead,
but I'd like my tree to be in the same state
as the beginning at the end.
So what are we going to do?
And the first thing we're going to do
is we're going to remember a really cool
implementation trick to save space
inside of recursive languages.
Languages like Scheme.
See, one thing in recursive languages like Scheme
is there's very frequently what's called tail-recursion.
Which is when you return a value
which is the smaller answer to your own function call.
Now how does this relate to trees?
Well, suppose we don't, for a moment,
suppose we don't have a left subtree.
When we visit, when we do in order traversal on node six,
there's no left subtree to visit, we don't visit it,
we visit ourself and then we tail-recurse
in on the right-hand side.
And that right-hand side, that node seven,
does not care that six was its parent.
Node seven never has to come back up the stack
to see that node six was the parent.
And so why bother keeping all the state
for the stack frame about node six on the stack?
Instead of doing recursive call, recursive call,
recursive call, what languages like Scheme mandate
is that the compiler perform tail call optimization.
And that six is in the stack, seven is the child,
and then instead of just pushing seven onto the stack,
seven should replace the data for six in the stack frame,
and turn the entire program from a recursive function
into a loop.
The difference between Scheme and C++ is that
C++ does not mandate that this optimization be performed,
however, your major compilers will perform this optimization
at things higher than O0.
Gcc, I did a quick experiment, Gcc seems to turn on
tail call optimization at O2, and Clang on O1.
Both of them, though, do not turn on
tail call optimization with O0.
Because with O0 the compiler is trying to turn your code
as faithfully as possible into assembly code,
and that means having order and stack calls.
Back, though, to Morris Traversal.
If our tree does not happen to have a left subtree,
we can do an inorder traversal in order one space.
The problem now is that trees would be kind of boring
if they didn't have left subtrees.
What are we going to do when we do have a left subtree?
Because we need some way of getting back to the root node.
We need a way to visit the left subtree, and then come back.
And the way that Morris Traversal's going to do this
is Morris Traversal is going to find a node
in its left subtree, and set its
right subtree node to point back to itself.
And the specific node that Morris Traversal is going to pick
will be the predecessor.
So Morris Traversal's going to notice that it has
a left subtree, it's going to go one step
to the left as many steps to the right as it can,
and it's going to set that node's right ptr to be itself.
At that point in time, it can tail-recurse
on the left-hand side.
And so now we've got a smaller version of the problem.
And one thing that I really like about recursive problems
is that, when I reason about the whole, I can assume
that the smaller cases just work.
It's proof by induction.
Technically, you need a base case,
but we already covered the base case
of not having a left subtree a couple of slides earlier.
So this, Morris Traversal, will successfully
traverse this tree, it'll visit one,
then it'll visit two, three, four, five,
and then finally it visits six again.
And when node six gets visited a second time,
it's again going to notice that it has a left subtree,
it's going to find its predecessor,
and it's going to notice, hey, wait a minute,
my predecessor already has a back edge.
Quite clearly, I've done this before.
Quite clearly, my left subtree has already been visited.
So instead of setting a back edge on this second iteration,
Morris Traversal's going to erase the back edge
to clean up after itself.
It will visit itself, and then it will continue recursing
on the right-hand side.
This pattern of setting a back edge the first time
and deleting the back edge the second time
means that Morris Traversal will clean up
after itself perfectly, leaving the tree
in an unchanged state at the end of the algorithm.
Certainly, it's going to mutate the state
and make an invalid tree in the interim,
but that's the price you pay for order one
traversal of your tree.
So, coming back to the fact that trees might be unbalanced.
If you don't have an unbalanced tree,
or if you have a tree that you know won't be too deep,
just use unique ptrs, it'll be fine,
your stack will be fine most likely.
But if you do run into this problem,
then there exist algorithms both for
regular inorder traversal and also for destruction
that will allow you to visit every node in your tree
without requiring extra stack space.
So one thing about the last two algorithms
that I shared with you is that they're
both space-efficient algorithms.
And I realized that after putting all my
algorithms together, that all of the algorithms
I have are space-efficient algorithms.
Because I work on services, and when I have a service
that uses a function, a method that uses less memory,
then I can do more with an individual server
that I have as my resource.
So Reservoir Sampling is no different.
It is a space-efficient streaming algorithm.
So awhile ago, and intern came to me
and she had a problem.
BigGrep wasn't doing what she wanted it to do,
and she wanted BigGrep to do better.
BigGrep, by the way, is Facebook's internal grep tool
for the entire codebase.
You can imagine it like grep dash r,
except it's really, really fast.
And because we are grepping over our entire codebase,
if you grep for something particularly common
such as a string, you'll get thousands
and thousands of results.
Well, you actually won't get thousands and thousands
of results, you'll get an error, too many results.
That's unfortunate.
And this was especially unfortunate for the intern
because she, like many other new employees at Facebook,
learn how Facebook's codebase operates
by looking at other code.
And the way you find code, you hear a magical keyword,
what's that keyword means, scuba,
I don't know what scuba is, BigGrep scuba.
Look at the BigGrep results, see what other people
are doing, copy their style, figure out what they are doing.
And if there's a particularly common name,
a particularly common keyword,
BigGrep's not going to return anything.
So what this intern wanted is she wanted
BigGrep to return a sample of the data.
To return a subset of the 185,000 results
instead of nothing.
And to Reservoir Sampling.
Reservoir Sampling is a generic algorithm
to select k elements from a stream of unknown length.
I have seen many halfway solutions
that kind of try to be Reservoir Sampling.
I have seen people pick elements
with probability k/n from their stream.
This actually works, when n is known,
and also when you don't particularly care
if you have exactly k elements.
It works particularly well when you want
to sample log messages.
I have seen services that just accumulate all elements,
store them all, and then pick some random subset
of them to choose.
I have also seen services just pick the first k element.
This one's a little bit rarer, but if you say,
have a service that has a queue,
and requests are coming in in a random order anyways,
you might as well take the first k.
All of these, though, pale in comparison
to Reservoir Sampling, which extremely efficiently
will select k elements perfectly
with a random probability from an unknown length stream.
How does it work?
And the state, the answer is that Reservoir Sampling
maintains only a very small amount of state.
The state that Reservoir Sampling maintains
is the current selection of elements
that it wants to return, as well as
the number of elements currently processed.
And so here's a little pictogram,
and right before we processed element #60,
we'd processed 59 elements from our stream,
and k equals two, we've chosen red and blue to be our
two random elements out of the 59 we've processed so far.
The statistical invariant of this code, by the way,
is that for every element, and every one of those
two cells, each element has a 1/n chance
of being in any given cell.
So that red circle has a one in 59 chance
of being in box zero.
And then, hey, on comes element 60, the green element.
If the algorithm were to terminate immediately
after we processed element 60,
then we would expect that green element
to exist with probability 2/60.
Therefore, we must take element 60 and place it
into our currently selected elements
with probability 2/60.
So take a random number, take in mod 60, take in mod n,
check if it's less than k, k is two,
check if rand number mod 60 is less than two.
And if it is, boot out a random one of the two elements
already selected, otherwise discard it.
Thinking back to the statistical invariant,
right after we'd processed element 59,
that red element had a one in 59 chance of existing
inside bucket one.
And after we've processed element 60,
it had a one in 60 chance of being kicked out.
Green had a two in 60 chance of being selected,
but there were two boxes that it could've been placed into,
and so there's a one in 60 chance that green replaced red.
Worded another way, red has a 59 over 60
chance of remaining.
So one in 59 chance of existing,
times 59 over 60 chance of remaining
is the red element, after having processed 60 elements,
has a one in 60 chance of being there,
maintaining the statistical invariant of Reservoir Sampling.
There is one particularly cool implementation trick
that I'd like to share with you for this code,
which is a way to save on a random numbered generation.
We take our random number and we take it mod 60,
and we get a uniform distribution zero to 60.
Then we check to see if it's less than k.
If it isn't less than k, whatever, move on,
nothing else to do here.
But if it is less than k, we then need to generate
a random number in the range zero to two.
But wait, r was a random number in the range zero to 60,
and furthermore, we know that it's less than k.
Therefore, r is actually a random number
in the range zero to two, and therefore,
we can just replace the rth element with green,
no need to call rand a second time.
Algorithm in hand, my intern protagonist went off
and implemented Reservoir Sampling
inside of our BigGrep codebase,
and she added the dash dash sample flag,
which will successfully return a subset of results
even if there are grossly too many results
to be returned by the BigGrep server.
Dash dash sample has stayed in production
for the last four years, and is still in use today.
All right.
I've got one last algorithm for you.
HyperLogLog.
Who here, by the way, has ever
made a table in a database?
Wow, I feel embarrassed now, I haven't. (chuckles)
All of you know how to database way better than me.
Because I've never had to.
Facebook has a lot of teams, a lot of
really good abstractions, a lot of good services
that deal with the database abstraction for me.
So all I really need to do is take my data,
throw it over a wall, and I trust that
other people, other teams, other services
are going to do all the hard stuff.
They're going to deal with retention,
they're going to deal with sampling,
and they're going to have these really nice
web UIs where I can go, when I can look at my data,
and they'll take P99 of my data for me,
and they'll make me pretty charts,
and they even have these dropdown buttons
that I can press with my mouse, it's fantastic.
Very, very pleased with the state of databases at Facebook.
Until one day...
The web UI did not have a button to do what I wanted to do.
What I needed to do with my data is I needed to know
how many distinct elements there were.
Specifically, what I was doing is
I had the big, huge database of all the different
compilations that Facebook has run,
and I wanted to know how many distinct filenames
Facebook had compiled in the last 24 hours.
And there wasn't a button for this.
It was, it was awkward.
I had to go to the Wiki.
Wiki...
How do I database?
(tongue clicks)
And the Wiki was actually pretty good,
but it told me what to do, it told me
how to open the database, it told me how to find
the specific table that I was looking for,
and it even told me the name of the magical function
that I had to call.
Approx count distinct HLL.
Okay, count distinct is what I want,
I do want to know the number of distinct elements
in my database, but why approximate?
I'd be quite fine if you just gave me
an exact answer, thanks.
(audience laughs)
And also, what does HLL stand for?
It turns out, HLL stands for HyperLogLog,
and it is a damn cool algorithm.
The thing that HyperLogLog does
is it counts the number of distinct elements,
but for the exact same reason that it was impossible
to find who the most common users of my service was
in Heavy Hitters, it is also impossible to count
the number of unique elements of my database
in a single pass using less than order n space.
And for databases, yeah, sometimes they're big.
And so what are we going to do?
How can we get around this problem
of needing to have order n space?
And the answer is that HyperLogLog trades away
exactness in order to reduce the space from order n
to a particularly tiny number.
As the original author of HyperLogLog would've said,
HyperLogLog can be used to estimate the number of words
in the whole of Shakespeare's works using less memory
than it takes to represent this slide.
It's a very space-efficient algorithm.
And the key observation for HyperLogLog
is that if you have a random stream of bits,
then the probability that the first k bits are all zero
is one over two to the k.
So we can imagine, maybe I have 128 distinct random numbers.
I would expect for those random numbers,
those 128 random numbers, I'd expect 64 of them
to have no leading zeros, 32 of them to have
one leading zero, 16 of them to have two,
eight to have three, four to have four,
two to have five, one to have six,
and one of those 128 distinct random numbers
to have seven or more leading zeros.
So yeah, that's a nice observation.
How do we even use it?
I see there's three big problems with this observation
from being turned into a practical approximation algorithm.
And the first problem is that this is an observation
about numbers of leading zeros,
which is a property of random numbers.
And my input is not random numbers.
My input is arbitrary data.
Specifically, my example was filenames.
Filenames being strings, not integers,
and also paths which are highly structured
and certainly not random.
Fortunately, there's a relatively
straightforward solution to this.
We're going to hash our input.
Hash functions, turning arbitrary data
to sufficiently random numbers since the 1960s.
One of the other really great things about hash functions
is that hash functions are deterministic.
If I had hello world dot cpp occurring twice in my data set,
then hello world dot cpp would hash to the exact same value.
If I've got 1,000 strings, 500 of which are unique,
then when I hash them, I'm going to get 1,000 numbers,
half of which are unique.
Technically, there's hash collisions,
but HyperLogLog is already an approximation algorithm.
And I'll show you guys later that it
usually runs at 2% error rate.
And with 64-bit hashes, and a built-in 2% error rate,
hash collisions aren't going to make a difference
until you have about a quintillion elements.
That's a billion billion.
So, whenever you guys have a billion billion elements
that you want to figure out how many of them are distinct,
HyperLogLog is no longer for you.
Either that, or get a bigger hash function.
Okay, so we can take our input,
we can hash it to get random numbers,
and we're good on that score.
The next problem with this observation
is that this is an observation about random numbers.
However, our input is arbitrary,
and specifically, our input has arbitrary duplication.
That's the whole point of HyperLogLog.
And the problem with duplication
is that our nice distribution of how many leading zeros
we expect to see is totally wacked off and bonkers.
And so how, how are we going to solve the problem
of duplication messing up our distribution of leading zeros?
The answer is we need to find some property
of our distribution that is not affected
in the presence of duplications.
And the answer for that is we're going to use the maximum.
We are not going to track how many leading zeros
there were for every different number,
we are merely going to track the largest number
of leading zeros that we have seen in the entire data set.
So back to the example with 128 different elements,
we see our distribution, and we see that there's one
element that has seven leading zeros,
and then without actually knowing
that the data set contained 128 distinct elements
to begin with, we would guess, because we see a maximum
of seven leading zeros that our data set contains,
two to the seven equals 128, distinct random numbers.
Which brings me to the final problem,
which is that that estimation sucks.
It is a terrible, terrible estimation to just say,
find the maximum number of leading zeros
and to whatever the power of that
number is your cardinality.
Doesn't work.
What happens if, instead of seeing seven leading zeros,
by random chance, you saw ten.
Three extra leading zeros.
That should happen about 12% of the time.
So 12% of the time, instead of estimating
that you had a 128 distinct elements,
you would estimate that you had 1,024 elements,
exceeding the real number by a factor of 10.
How are we going to solve this?
The answer is that statistical randomized algorithms
have two nice tricks up their sleeve.
The first trick that randomized algorithms
sometimes employ is to repeat the algorithm
several times with different sources of randomness.
And then to aggregate those different answers together
to form a more statistically correct answer.
We'll leave out the aggregation function in a bit.
Because in HyperLogLog, we are relying on the hash functions
to provide us with our random bits,
what we need to do is we need to run this function
some number of times with different hash functions.
The second trick that approximation algorithms
like to employ is you can sample the data.
You don't need to investigate every single thing,
as long as you pick a random subset of the data,
you can form a statistically accurate estimate
of what your data set is doing.
How might you pick k random elements
from your stream, you might ask.
There might be an algorithm for that.
But for the moment, we can combine
these two ideas in HyperLogLog.
And what we will do is we will process elements
one at a time, and we will look at an element
and partition it.
We'll say, hey, by the way, this element
belongs to partition 10.
We'll send it off to partition 10,
we'll hash it using partition 10's unique hash function,
and then we'll count the number of leading zeros,
and we'll update the maximum number of leading zeros
that partition 10 has seen with this new number.
There is a really interesting implementation detail here,
as well as a very subtle detail.
The cool implementation trick is we can dispense
with this have a billion hash functions problem,
we can get away with only one hash function.
As long as you're willing to have a number of buckets,
a number of partitions that is a perfect power of two.
Suppose you want to have eight different partitions.
That's three bits.
When you see a new element, take its hash,
and use the first three bits of the hash
to identify the partition to which your element belongs.
The other 61 bits of your hash,
or n minus three for whatever n happens to be,
is your partition's specific hash function.
And so in this example, we see an element of hashes
to zero one zero zero zero one zero one,
take the first three bits.
Hey, those first three bits represent the number two,
this element belongs to partition two.
Following which, we take the other five digits,
and we observe that those five digits
have two leading zeros.
Hey, we're going to update the maximum number
of leading zeros for partition two with the value two.
The subtle detail here...
is that the hash function means that we are partitioning
the input space, not the input size.
If we have hello world dot cpp twice in our data set,
it's going to hash to the same thing both times,
and that means that it's going to get sent
to the same partition.
That means that all of the strings that make it into
partition a are disjoint from all of the strings
that make it into partition b.
Something to keep in mind for a moment.
And so what does this look like?
What does this look like when we run it
on 10,000 random integers?
And the answer is that it looks something like this,
this is literally a run that I ran on my computer.
And we see, for example, that partition six,
of all the different numbers that were assigned
to partition six, six saw a maximum of 10 leading zeros.
Okay, that would mean that we'd estimate
that partition six has two to the power of 10,
which is 1,024, distinct elements.
We would also estimate that partition five
has 2,048 distinct elements.
Keeping in mind that the elements
that go into each partition are distinct,
it is tempting to think that we would get a good estimation
by summing up the estimates for each individual cell
to get our final guess.
Problem is that that estimation really sucks.
I said there were 10,000 elements
that generated that HyperLogLog table.
It turns out that if you sum up
the expected number for all the individual boxes,
you get an estimate of 50,000.
Five times your original number.
And the reason you get 50,000 is because of that seven box
that has a 15 in it.
Box seven, by random chance, saw 15 leading zeros
and so estimates that your data set
has at least 32,000 distinct elements.
So, we're a little bit stuck there.
What are we going to do instead?
Fortunately for us, some very smart mathematicians
have come up with an answer.
Instead of taking the sum, use the harmonic mean.
Couple of extra details here.
Because all the different boxes are unique,
you have to add an additional factor of m
to account for the fact that all the boxes
have different strings in them.
And in the formula that they provide in their paper,
you also need to throw in an additional constant factor
to account for the fact that there is structural bias
in how this number is computed.
But at the end of the day, you have a implementable formula
that you can apply to the various numbers of leading zeros
for each of your partitions.
And if we apply that formula to these observed numbers,
we get an estimate of about 8,000.
Keeping in mind there were 10,000 elements to begin with,
so we've got an error of about 20%.
That's not actually all that great.
There's a reason for that, though.
This estimate is 20% off because I chose
a very small value of m, I chose m equals eight.
The expected error of HyperLogLog is proportional
to 1.04 divided by square root of m.
When m is eight, that comes out to
an expected 35% error margin.
Most people, I have been led to believe,
choose 2,048 distinct partitions.
M requires the first 11 bits of your hash function.
Because when m is 2,048, your expected error
comes out to a mere 2%.
So, I found this function.
Database, please approx count distinct HyperLogLog my data.
(clicks tongue)
Hmm. That just worked.
That's one of the glorious things about algorithms.
They just work.
You don't need to invent them, you don't need
to implement them, you don't really even
need to understand what they're doing.
All you have to do is you have to know that they exist,
and know what problem they solve.
And that is what turns an algorithm
into a useful tool that you can use.
There are so many algorithms
and there's one place that I like to go to find them.
Include algorithm.
It's a very nice place, I like going there,
seeing what algorithms are on offer.
It's not the best.
Do I wish there were more algorithms?
Yes.
Do I wish that the algorithms didn't have
an iterator interface?
Also, yes.
But do they work?
Are they fast, are they tested, are they rigorous?
Yes!
Every algorithm that you find inside of include algorithm
will be an extremely efficient algorithm
to add to your toolbox.
Some of my favorites.
Nth element, given a list of data, find the n largest ones.
Accumulate, given a stream of input, sum it all up.
And count if, another one of my favorites.
People have a tendency to corrupt the data
in the services that I own, and sometimes I need to count
how much of my data is still valid.
Really, whenever anyone, when I'm doing code review,
and anyone writes a for loop that does
something relatively simple, I go to my favorite C++
reference website, cppreference.com,
and I look through their algorithms description,
and I try to find one that does what I want them to do.
Literally, I managed to do this with nth element.
A co-worker of mine was taking his input,
he was sorting it, and then he was taking the top k.
And I was like, hey, hey bud, here's a cppreference
link to nth element, if you'd like to shave a factor
of log(n) off your runtime.
But I'm not going to talk much more
about standard algorithms, and the reason
I'm not going to talk much more about standard algorithms
is because there's already a great talk
on this subject available on YouTube.
Two years ago, at CppCon in 2015,
Michael VanLoon gave a talk, STL Algorithms in Action,
and it was one of my favorite talks that year.
It contained just a whirlwind tour
of the STL Algorithms library, and for me,
it was really useful because it broke down the barrier
of mysteriousness and weird function call syntax
that was present in the STL.
Before that talk, I did not use the STL very much,
I was not very familiar with it, but Michael's talk
introduced me gently into those algorithms,
and I use them all the time now.
If you do not use stood algorithm,
go onto YouTube, watch Michael VanLoon's talk.
It'll be a good use of your time.
I could keep talking for a long time.
I could talk for days about algorithms.
The thing is, though, if I were to talk for days
about algorithms, instead of limiting myself to just a few,
we would all get very bored very quickly.
Because most of the algorithms that I know you also know.
We all know binary search, we all know quick sort,
we all know substring, and that's because algorithms
are the tools of our trade.
I hope that today's presentation of algorithm
has contained some new algorithms for you.
Tools, rare enough to be novel,
but useful enough to be found in practice.
Each, in their own special way, fantastic.
My name's Nicolas.
It's been a pleasure speaking with you today.
(applause)