All right.

Hello, everyone.

And welcome to Fantastic Algorithms and Where to Find Them.

An intern once asked me, "Nicolas, how often

"do you write algorithms?"

And the answer is rarely.

Algorithms are tools.

Hammers, and levels, and mops.

And what makes a collection of tools impressive

is not how many of them you made yourself,

but rather, how many unique problems

you can use your toolbox to solve.

I hope that in today's talk, as we go through

a couple fantastic algorithms,

that there will be some tools here rare enough

to be novel, but useful enough to be found in practice.

So a little bit about me before we dive in,

I am an Infrastructure Engineer at Facebook.

I work in infra for two reasons.

First of all, a lot of infra is services.

Services are often written in C++,

and I happen to like that language.

The second reason that I like working in infra

is that my clients are fellow engineers.

If I've got a problem, something's going wrong,

if I need to learn something about the use cases,

I can just go and talk to another technical person.

This leads straight into our first algorithm.

Heavy Hitters.

Despite my co-workers being engineers,

people with no malicious intentions...

I still get DoSed.

Maybe a new feature gets turned on,

maybe an upstream service gets DoSed which causes

my service to get DoSed, or in this particular story,

a command line interface used primarily by humans

is called programmatically for the first time.

This particular graph, those two spikes

represent high latencies, specifically timeouts,

as a result of too many queries coming into my service.

Now, what I want to do, what I really would like

is to guarantee to all the people

who are using my service normally,

that they will have a bare minimum amount of traffic.

I do not want to starve any of my individual users

in the event that I am overloaded.

And so the ideal thing that I would like is I would like

to have a map of the people who use my service

to the number of times that they have made queries with me.

The problem is that keeping that map is very expensive,

it takes order n (O(n)) space.

Even if I were to simplify my request,

and instead just want to know the identity

of the single user who was making the most requests,

it would still take order n (O(n)) space

to be done in a single pass.

So fortunately, though, there is an algorithm

that exists, Heavy Hitters, which does almost what I want.

Heavy Hitters does not determine

who the most frequent user of my table is,

of my service is, however, what it does do

is it does identify the users who single-handedly

account for more than x percent of total traffic.

For this special case, x equals 50%, some of you might

have heard this algorithm under some different names,

for x equals 50%, knowing which elements in an array

occur strictly more than half the time,

the algorithm's called Majority Element.

Also known as Boyer-Moore's Majority Vote algorithm.

And it works something like this.

Imagine, for a moment, that we are all at a conference,

and there's a lot of programmers

who are attending the conference,

and the programmers indicate by the color of their dress

what their favorite programming language is.

As the programmers enter the floor

the passion of the moment overcomes them,

and a furious fight breaks out.

Programmers who represent different languages lock eyes

and instantly do battle.

They run at each other and try to knock the other out.

Now, it happens to be that every single pair

of programmers is perfectly evenly matched,

and so when they run and punch each other,

they both land a blow on the other's chin simultaneously,

and both get knocked to the floor unconscious.

The conference chair enters the room

and notices that there is exactly one

programming language left standing.

Indeed, if there were multiple programming languages

left standing, then one programmer from each of two

distinct languages would lock eyes,

run at each other, do battle, get knocked out.

So there is exactly one language left.

And so the conference chair smiles,

'cause he had a question that he wanted

to have an answer to.

He wanted to know whether or not there existed

a programming language that enjoyed the support

of a strict majority of all attendees.

The conference chair's job is simpler now.

Instead of counting the number of attendees

who support every single programming language,

he merely needs to count the number of attendees

who support the programming language left standing.

The reason this works...

is because we see that pairs of people are knocked out

where each member of the pair represents

a different programming language.

If there's n people, there's at most

n/2 pairs of knocked out people.

And if there's a programming language that is

represented by more than n/2 people,

then it cannot be present,

then it cannot be completely knocked out

in every single pair, thus will be left over.

This conference room brawl translates

very cleanly into an algorithm.

The only difference is that you let the programmers

into the room one at a time.

Implement this as a counter-value pair,

or sorry, a value-counter pair.

You have a value, the current programming language

which is winning, and some rather arbitrary integer

representing the weight.

So the first programmer walks into the room

representing C++, there's no one else in the room,

C++ is now the winner, the weight is one.

A second programmer, a Java programmer,

walks into the room, locks eyes with the C++ programmer,

they do battle, they get knocked out,

the counter goes back to zero.

Third programmer walks into the room, C++.

C++ is the winner, counter is one.

Fourth programmer, again, C++ walks in,

C++ now has a value of two,

and the fifth and final programmer, a Perl programmer,

walks in, locks eyes with one of the C++ programmers,

they do battle, get knocked out.

The algorithm's done.

At the end of these five people, three of whom were C++,

majority vote does indeed say that C++ is the winner,

and there's a rather arbitrary value of one

left in the counter.

It is important after you do the first pass through

to find your estimate of who the winner is,

that you do a second pass through the array

to verify that the returned element

is indeed a majority element.

We can imagine a case where this is not the case.

Suppose there are a very large, odd number of pirates

each of whom is voting for who the next

captain of the ship is going to be.

1,001 pirates.

And I've chosen pirates, because pirates

are well known to always vote for themselves.

And so the first pirate walks in and he's the winner.

And the second pirate walks in, and they do battle,

and there's no winner.

Third pirate walks in, he's the winner.

Fourth pirate walks in, oh they disagree,

they do battle, no more winner.

And the first 500 pairs of pirates go in

and knock each other out until the last pirate,

the 1,001st pirate, walks in and,

"Hey, guys, I vote for Teddy."

Teddy's the winner!

No, he's not the winner, you have to do a second

pass through the array.

It is sufficient for a majority element

to end up in the value, but it's not necessary.

This is the x equals 50% version of the algorithm.

It extends very cleanly and very naturally

to percentages other than 50%.

Specifically to percentages of the form 1/k

for positive integers k.

The only trick that you have to change

is that, inside the conference room,

which could previously accommodate only one

programming language, now allow there to be k minus one

distinct programming languages existing in harmony.

And it's only when a fifth programming language

walks in that the room becomes too crowded,

and that new programming language

plus the other four distinct programming--

sorry, if k equals five, the other four distinct

programming languages, one person from each group

does battle and leaves.

And so they do battle in groups of k,

therefore, no one who exists more than

1/k of the time can be completely knocked out.

So coming back to the service that I own

that was getting DoSed.

I was upset that this was happening.

I went and I talked to that engineer, because I'm an adult.

But I also put up some code based on Heavy Hitters

to make sure that, if this happened again,

I'd be able to throttle the people who were

single-handedly accounting for large percentages

of my resources.

I had one small variant, though.

Heavy Hitters does not have a concept of time,

so I did Heavy Hitters in five minute intervals.

It was a long time ago, there was a little bit

of trickery going on.

I can't show you that code, though.

But some code that I can show you is this majority vote

implementation, found inside of Libgcc.

I was talking to Nathan Sidwell yesterday.

Nathan Sidwell, by the way, is going to give a talk

tomorrow on modules, implementing modules inside of G++,

which I would advise going to.

But I was telling him about Heavy Hitters,

and he was like, "Hmm, you know,

"this code sounds familiar to me."

And sure enough, Nathan found this Heavy Hitters

implementation inside of Gcov.

What this code is doing is it is trying to find

whether or not there is a majority argument

for profiling purposes.

During profiling, this extremely cheap,

extremely simple counter-value pair

is used to determine whether or not

there's a specific argument that occurs

the vast majority of the time.

There's one slight hiccup, though,

in the Gcov implementation.

Which is, typically, at the end of the Heavy Hitters

implementation, you have to do a second pass

to verify that the given argument does indeed

occur more than 50% of the time.

And Gcov can't do a second pass.

So, instead, Gcov has this interesting trick

where, at the very end, they check the counter

and they see if the counter is higher

than 50% of the total number of calls.

If this is true, then we know for sure

that that algorithm occurred at least 50% of the time.

However, we don't know that an algorithm

that occurred 51% of the time is guaranteed

to make this cut.

However, it is guaranteed that an algorithm

that occurs at least, or an input that occurs

at least 75% of the time will,

once the algorithm completes, have a value

in the counter at least equal to 50%.

So this is a rather nice, nifty optimization

that Gcov uses to profile your code

so that your code can be optimized.

All right.

I've got another algorithm for you.

Who here was at CppCon last year?

That's actually fewer people than I thought,

that was like less than half.

Who here, regardless of whether or not

you've been at CppCon last year,

listened to Herb Sutter's talk Leak-Freedom in C++?

That looks like quite a few more hands.

Not everyone's hands, but more.

For those of you who haven't seen Herb's talk,

I highly advise, go and listen to it,

it's a great talk about how to avoid

leaking memory inside of your programs.

And one of the most important things

that Herb hammers home in his talk

is not to manage pointers yourself.

Do not use new and delete.

Instead, rely on unique ptrs and shared ptrs

when you actually have data that has to

outlive your stack frame.

And one of the examples that Herb had in his talk

was he had a tree implementation.

And instead of having a tree star left

and a tree star right, Herb instead had unique ptrs.

And this is very nice, because those unique ptrs

are going to be cleaned up successfully,

regardless of how your tree gets destroyed.

You could have an exception thrown in your constructor

and those child trees are going to be cleaned up,

and you're not going to have a leak.

There's a problem, though, Herb remarked.

The problem is that, in the wild,

not all trees are balanced.

And so if you happened to have an unbalanced tree,

in the worst case, just a straight line like a linked list,

what happens when you free the root node?

The root node's going to be freed,

but in order to do that, it first has to call

the destructor on it's left child unique ptr.

So six calls the destructor of two,

two calls the destructor of one,

and so a lot of destructors get pushed onto your stack,

and you can smash your stack.

This isn't good.

And so, at the end of the presentation,

during Q&A, Eric Neebler went to the microphone and said,

"Herb, did you know that there exists a way

"to destroy a tree using order one space overhead?"

Before I go on, I've got a small little story

about Libgcc again for you.

I went and I checked Libgcc's implementation

to see whether or not they were doing a stack-smashing

destruction call, or if they were doing some fancy-swancy

order one space destruction call.

And the answer is, that inside of Libgcc,

their red-black tree implementation

uses this type of destruction.

It's just recursively call the destructors,

come back up, push everything onto the stack.

And the reason that makes a lot of sense

is because the STL's red-black tree implementation

is guaranteed to be balanced.

So the height's going to be proportional to log(n),

you're not going to smash your stack with log(n) space.

But back to the fact that Herb has said

that there do exist cases when trees are imbalanced.

If you implement some not necessarily balanced tree,

you have to be careful about smashing your stack

in recursive contexts.

So what are we going to do?

The first answer is I'm going to step it up a notch.

When you want to destroy a tree in order one space,

there's something really handy you can do.

Which is that you can muck with all of your tree's

internal invariance with reckless abandon.

You can rotate whatever internal node you want,

don't worry about it, the tree's getting destroyed.

So I'm going to throw an extra challenge on top.

I would like to iterate through my tree

using order one space overhead,

but I'd like my tree to be in the same state

as the beginning at the end.

So what are we going to do?

And the first thing we're going to do

is we're going to remember a really cool

implementation trick to save space

inside of recursive languages.

Languages like Scheme.

See, one thing in recursive languages like Scheme

is there's very frequently what's called tail-recursion.

Which is when you return a value

which is the smaller answer to your own function call.

Now how does this relate to trees?

Well, suppose we don't, for a moment,

suppose we don't have a left subtree.

When we visit, when we do in order traversal on node six,

there's no left subtree to visit, we don't visit it,

we visit ourself and then we tail-recurse

in on the right-hand side.

And that right-hand side, that node seven,

does not care that six was its parent.

Node seven never has to come back up the stack

to see that node six was the parent.

And so why bother keeping all the state

for the stack frame about node six on the stack?

Instead of doing recursive call, recursive call,

recursive call, what languages like Scheme mandate

is that the compiler perform tail call optimization.

And that six is in the stack, seven is the child,

and then instead of just pushing seven onto the stack,

seven should replace the data for six in the stack frame,

and turn the entire program from a recursive function

into a loop.

The difference between Scheme and C++ is that

C++ does not mandate that this optimization be performed,

however, your major compilers will perform this optimization

at things higher than O0.

Gcc, I did a quick experiment, Gcc seems to turn on

tail call optimization at O2, and Clang on O1.

Both of them, though, do not turn on

tail call optimization with O0.

Because with O0 the compiler is trying to turn your code

as faithfully as possible into assembly code,

and that means having order and stack calls.

Back, though, to Morris Traversal.

If our tree does not happen to have a left subtree,

we can do an inorder traversal in order one space.

The problem now is that trees would be kind of boring

if they didn't have left subtrees.

What are we going to do when we do have a left subtree?

Because we need some way of getting back to the root node.

We need a way to visit the left subtree, and then come back.

And the way that Morris Traversal's going to do this

is Morris Traversal is going to find a node

in its left subtree, and set its

right subtree node to point back to itself.

And the specific node that Morris Traversal is going to pick

will be the predecessor.

So Morris Traversal's going to notice that it has

a left subtree, it's going to go one step

to the left as many steps to the right as it can,

and it's going to set that node's right ptr to be itself.

At that point in time, it can tail-recurse

on the left-hand side.

And so now we've got a smaller version of the problem.

And one thing that I really like about recursive problems

is that, when I reason about the whole, I can assume

that the smaller cases just work.

It's proof by induction.

Technically, you need a base case,

but we already covered the base case

of not having a left subtree a couple of slides earlier.

So this, Morris Traversal, will successfully

traverse this tree, it'll visit one,

then it'll visit two, three, four, five,

and then finally it visits six again.

And when node six gets visited a second time,

it's again going to notice that it has a left subtree,

it's going to find its predecessor,

and it's going to notice, hey, wait a minute,

my predecessor already has a back edge.

Quite clearly, I've done this before.

Quite clearly, my left subtree has already been visited.

So instead of setting a back edge on this second iteration,

Morris Traversal's going to erase the back edge

to clean up after itself.

It will visit itself, and then it will continue recursing

on the right-hand side.

This pattern of setting a back edge the first time

and deleting the back edge the second time

means that Morris Traversal will clean up

after itself perfectly, leaving the tree

in an unchanged state at the end of the algorithm.

Certainly, it's going to mutate the state

and make an invalid tree in the interim,

but that's the price you pay for order one

traversal of your tree.

So, coming back to the fact that trees might be unbalanced.

If you don't have an unbalanced tree,

or if you have a tree that you know won't be too deep,

just use unique ptrs, it'll be fine,

your stack will be fine most likely.

But if you do run into this problem,

then there exist algorithms both for

regular inorder traversal and also for destruction

that will allow you to visit every node in your tree

without requiring extra stack space.

So one thing about the last two algorithms

that I shared with you is that they're

both space-efficient algorithms.

And I realized that after putting all my

algorithms together, that all of the algorithms

I have are space-efficient algorithms.

Because I work on services, and when I have a service

that uses a function, a method that uses less memory,

then I can do more with an individual server

that I have as my resource.

So Reservoir Sampling is no different.

It is a space-efficient streaming algorithm.

So awhile ago, and intern came to me

and she had a problem.

BigGrep wasn't doing what she wanted it to do,

and she wanted BigGrep to do better.

BigGrep, by the way, is Facebook's internal grep tool

for the entire codebase.

You can imagine it like grep dash r,

except it's really, really fast.

And because we are grepping over our entire codebase,

if you grep for something particularly common

such as a string, you'll get thousands

and thousands of results.

Well, you actually won't get thousands and thousands

of results, you'll get an error, too many results.

That's unfortunate.

And this was especially unfortunate for the intern

because she, like many other new employees at Facebook,

learn how Facebook's codebase operates

by looking at other code.

And the way you find code, you hear a magical keyword,

what's that keyword means, scuba,

I don't know what scuba is, BigGrep scuba.

Look at the BigGrep results, see what other people

are doing, copy their style, figure out what they are doing.

And if there's a particularly common name,

a particularly common keyword,

BigGrep's not going to return anything.

So what this intern wanted is she wanted

BigGrep to return a sample of the data.

To return a subset of the 185,000 results

instead of nothing.

And to Reservoir Sampling.

Reservoir Sampling is a generic algorithm

to select k elements from a stream of unknown length.

I have seen many halfway solutions

that kind of try to be Reservoir Sampling.

I have seen people pick elements

with probability k/n from their stream.

This actually works, when n is known,

and also when you don't particularly care

if you have exactly k elements.

It works particularly well when you want

to sample log messages.

I have seen services that just accumulate all elements,

store them all, and then pick some random subset

of them to choose.

I have also seen services just pick the first k element.

This one's a little bit rarer, but if you say,

have a service that has a queue,

and requests are coming in in a random order anyways,

you might as well take the first k.

All of these, though, pale in comparison

to Reservoir Sampling, which extremely efficiently

will select k elements perfectly

with a random probability from an unknown length stream.

How does it work?

And the state, the answer is that Reservoir Sampling

maintains only a very small amount of state.

The state that Reservoir Sampling maintains

is the current selection of elements

that it wants to return, as well as

the number of elements currently processed.

And so here's a little pictogram,

and right before we processed element #60,

we'd processed 59 elements from our stream,

and k equals two, we've chosen red and blue to be our

two random elements out of the 59 we've processed so far.

The statistical invariant of this code, by the way,

is that for every element, and every one of those

two cells, each element has a 1/n chance

of being in any given cell.

So that red circle has a one in 59 chance

of being in box zero.

And then, hey, on comes element 60, the green element.

If the algorithm were to terminate immediately

after we processed element 60,

then we would expect that green element

to exist with probability 2/60.

Therefore, we must take element 60 and place it

into our currently selected elements

with probability 2/60.

So take a random number, take in mod 60, take in mod n,

check if it's less than k, k is two,

check if rand number mod 60 is less than two.

And if it is, boot out a random one of the two elements

already selected, otherwise discard it.

Thinking back to the statistical invariant,

right after we'd processed element 59,

that red element had a one in 59 chance of existing

inside bucket one.

And after we've processed element 60,

it had a one in 60 chance of being kicked out.

Green had a two in 60 chance of being selected,

but there were two boxes that it could've been placed into,

and so there's a one in 60 chance that green replaced red.

Worded another way, red has a 59 over 60

chance of remaining.

So one in 59 chance of existing,

times 59 over 60 chance of remaining

is the red element, after having processed 60 elements,

has a one in 60 chance of being there,

maintaining the statistical invariant of Reservoir Sampling.

There is one particularly cool implementation trick

that I'd like to share with you for this code,

which is a way to save on a random numbered generation.

We take our random number and we take it mod 60,

and we get a uniform distribution zero to 60.

Then we check to see if it's less than k.

If it isn't less than k, whatever, move on,

nothing else to do here.

But if it is less than k, we then need to generate

a random number in the range zero to two.

But wait, r was a random number in the range zero to 60,

and furthermore, we know that it's less than k.

Therefore, r is actually a random number

in the range zero to two, and therefore,

we can just replace the rth element with green,

no need to call rand a second time.

Algorithm in hand, my intern protagonist went off

and implemented Reservoir Sampling

inside of our BigGrep codebase,

and she added the dash dash sample flag,

which will successfully return a subset of results

even if there are grossly too many results

to be returned by the BigGrep server.

Dash dash sample has stayed in production

for the last four years, and is still in use today.

All right.

I've got one last algorithm for you.

HyperLogLog.

Who here, by the way, has ever

made a table in a database?

Wow, I feel embarrassed now, I haven't. (chuckles)

All of you know how to database way better than me.

Because I've never had to.

Facebook has a lot of teams, a lot of

really good abstractions, a lot of good services

that deal with the database abstraction for me.

So all I really need to do is take my data,

throw it over a wall, and I trust that

other people, other teams, other services

are going to do all the hard stuff.

They're going to deal with retention,

they're going to deal with sampling,

and they're going to have these really nice

web UIs where I can go, when I can look at my data,

and they'll take P99 of my data for me,

and they'll make me pretty charts,

and they even have these dropdown buttons

that I can press with my mouse, it's fantastic.

Very, very pleased with the state of databases at Facebook.

Until one day...

The web UI did not have a button to do what I wanted to do.

What I needed to do with my data is I needed to know

how many distinct elements there were.

Specifically, what I was doing is

I had the big, huge database of all the different

compilations that Facebook has run,

and I wanted to know how many distinct filenames

Facebook had compiled in the last 24 hours.

And there wasn't a button for this.

It was, it was awkward.

I had to go to the Wiki.

Wiki...

How do I database?

(tongue clicks)

And the Wiki was actually pretty good,

but it told me what to do, it told me

how to open the database, it told me how to find

the specific table that I was looking for,

and it even told me the name of the magical function

that I had to call.

Approx count distinct HLL.

Okay, count distinct is what I want,

I do want to know the number of distinct elements

in my database, but why approximate?

I'd be quite fine if you just gave me

an exact answer, thanks.

(audience laughs)

And also, what does HLL stand for?

It turns out, HLL stands for HyperLogLog,

and it is a damn cool algorithm.

The thing that HyperLogLog does

is it counts the number of distinct elements,

but for the exact same reason that it was impossible

to find who the most common users of my service was

in Heavy Hitters, it is also impossible to count

the number of unique elements of my database

in a single pass using less than order n space.

And for databases, yeah, sometimes they're big.

And so what are we going to do?

How can we get around this problem

of needing to have order n space?

And the answer is that HyperLogLog trades away

exactness in order to reduce the space from order n

to a particularly tiny number.

As the original author of HyperLogLog would've said,

HyperLogLog can be used to estimate the number of words

in the whole of Shakespeare's works using less memory

than it takes to represent this slide.

It's a very space-efficient algorithm.

And the key observation for HyperLogLog

is that if you have a random stream of bits,

then the probability that the first k bits are all zero

is one over two to the k.

So we can imagine, maybe I have 128 distinct random numbers.

I would expect for those random numbers,

those 128 random numbers, I'd expect 64 of them

to have no leading zeros, 32 of them to have

one leading zero, 16 of them to have two,

eight to have three, four to have four,

two to have five, one to have six,

and one of those 128 distinct random numbers

to have seven or more leading zeros.

So yeah, that's a nice observation.

How do we even use it?

I see there's three big problems with this observation

from being turned into a practical approximation algorithm.

And the first problem is that this is an observation

about numbers of leading zeros,

which is a property of random numbers.

And my input is not random numbers.

My input is arbitrary data.

Specifically, my example was filenames.

Filenames being strings, not integers,

and also paths which are highly structured

and certainly not random.

Fortunately, there's a relatively

straightforward solution to this.

We're going to hash our input.

Hash functions, turning arbitrary data

to sufficiently random numbers since the 1960s.

One of the other really great things about hash functions

is that hash functions are deterministic.

If I had hello world dot cpp occurring twice in my data set,

then hello world dot cpp would hash to the exact same value.

If I've got 1,000 strings, 500 of which are unique,

then when I hash them, I'm going to get 1,000 numbers,

half of which are unique.

Technically, there's hash collisions,

but HyperLogLog is already an approximation algorithm.

And I'll show you guys later that it

usually runs at 2% error rate.

And with 64-bit hashes, and a built-in 2% error rate,

hash collisions aren't going to make a difference

until you have about a quintillion elements.

That's a billion billion.

So, whenever you guys have a billion billion elements

that you want to figure out how many of them are distinct,

HyperLogLog is no longer for you.

Either that, or get a bigger hash function.

Okay, so we can take our input,

we can hash it to get random numbers,

and we're good on that score.

The next problem with this observation

is that this is an observation about random numbers.

However, our input is arbitrary,

and specifically, our input has arbitrary duplication.

That's the whole point of HyperLogLog.

And the problem with duplication

is that our nice distribution of how many leading zeros

we expect to see is totally wacked off and bonkers.

And so how, how are we going to solve the problem

of duplication messing up our distribution of leading zeros?

The answer is we need to find some property

of our distribution that is not affected

in the presence of duplications.

And the answer for that is we're going to use the maximum.

We are not going to track how many leading zeros

there were for every different number,

we are merely going to track the largest number

of leading zeros that we have seen in the entire data set.

So back to the example with 128 different elements,

we see our distribution, and we see that there's one

element that has seven leading zeros,

and then without actually knowing

that the data set contained 128 distinct elements

to begin with, we would guess, because we see a maximum

of seven leading zeros that our data set contains,

two to the seven equals 128, distinct random numbers.

Which brings me to the final problem,

which is that that estimation sucks.

It is a terrible, terrible estimation to just say,

find the maximum number of leading zeros

and to whatever the power of that

number is your cardinality.

Doesn't work.

What happens if, instead of seeing seven leading zeros,

by random chance, you saw ten.

Three extra leading zeros.

That should happen about 12% of the time.

So 12% of the time, instead of estimating

that you had a 128 distinct elements,

you would estimate that you had 1,024 elements,

exceeding the real number by a factor of 10.

How are we going to solve this?

The answer is that statistical randomized algorithms

have two nice tricks up their sleeve.

The first trick that randomized algorithms

sometimes employ is to repeat the algorithm

several times with different sources of randomness.

And then to aggregate those different answers together

to form a more statistically correct answer.

We'll leave out the aggregation function in a bit.

Because in HyperLogLog, we are relying on the hash functions

to provide us with our random bits,

what we need to do is we need to run this function

some number of times with different hash functions.

The second trick that approximation algorithms

like to employ is you can sample the data.

You don't need to investigate every single thing,

as long as you pick a random subset of the data,

you can form a statistically accurate estimate

of what your data set is doing.

How might you pick k random elements

from your stream, you might ask.

There might be an algorithm for that.

But for the moment, we can combine

these two ideas in HyperLogLog.

And what we will do is we will process elements

one at a time, and we will look at an element

and partition it.

We'll say, hey, by the way, this element

belongs to partition 10.

We'll send it off to partition 10,

we'll hash it using partition 10's unique hash function,

and then we'll count the number of leading zeros,

and we'll update the maximum number of leading zeros

that partition 10 has seen with this new number.

There is a really interesting implementation detail here,

as well as a very subtle detail.

The cool implementation trick is we can dispense

with this have a billion hash functions problem,

we can get away with only one hash function.

As long as you're willing to have a number of buckets,

a number of partitions that is a perfect power of two.

Suppose you want to have eight different partitions.

That's three bits.

When you see a new element, take its hash,

and use the first three bits of the hash

to identify the partition to which your element belongs.

The other 61 bits of your hash,

or n minus three for whatever n happens to be,

is your partition's specific hash function.

And so in this example, we see an element of hashes

to zero one zero zero zero one zero one,

take the first three bits.

Hey, those first three bits represent the number two,

this element belongs to partition two.

Following which, we take the other five digits,

and we observe that those five digits

have two leading zeros.

Hey, we're going to update the maximum number

of leading zeros for partition two with the value two.

The subtle detail here...

is that the hash function means that we are partitioning

the input space, not the input size.

If we have hello world dot cpp twice in our data set,

it's going to hash to the same thing both times,

and that means that it's going to get sent

to the same partition.

That means that all of the strings that make it into

partition a are disjoint from all of the strings

that make it into partition b.

Something to keep in mind for a moment.

And so what does this look like?

What does this look like when we run it

on 10,000 random integers?

And the answer is that it looks something like this,

this is literally a run that I ran on my computer.

And we see, for example, that partition six,

of all the different numbers that were assigned

to partition six, six saw a maximum of 10 leading zeros.

Okay, that would mean that we'd estimate

that partition six has two to the power of 10,

which is 1,024, distinct elements.

We would also estimate that partition five

has 2,048 distinct elements.

Keeping in mind that the elements

that go into each partition are distinct,

it is tempting to think that we would get a good estimation

by summing up the estimates for each individual cell

to get our final guess.

Problem is that that estimation really sucks.

I said there were 10,000 elements

that generated that HyperLogLog table.

It turns out that if you sum up

the expected number for all the individual boxes,

you get an estimate of 50,000.

Five times your original number.

And the reason you get 50,000 is because of that seven box

that has a 15 in it.

Box seven, by random chance, saw 15 leading zeros

and so estimates that your data set

has at least 32,000 distinct elements.

So, we're a little bit stuck there.

What are we going to do instead?

Fortunately for us, some very smart mathematicians

have come up with an answer.

Instead of taking the sum, use the harmonic mean.

Couple of extra details here.

Because all the different boxes are unique,

you have to add an additional factor of m

to account for the fact that all the boxes

have different strings in them.

And in the formula that they provide in their paper,

you also need to throw in an additional constant factor

to account for the fact that there is structural bias

in how this number is computed.

But at the end of the day, you have a implementable formula

that you can apply to the various numbers of leading zeros

for each of your partitions.

And if we apply that formula to these observed numbers,

we get an estimate of about 8,000.

Keeping in mind there were 10,000 elements to begin with,

so we've got an error of about 20%.

That's not actually all that great.

There's a reason for that, though.

This estimate is 20% off because I chose

a very small value of m, I chose m equals eight.

The expected error of HyperLogLog is proportional

to 1.04 divided by square root of m.

When m is eight, that comes out to

an expected 35% error margin.

Most people, I have been led to believe,

choose 2,048 distinct partitions.

M requires the first 11 bits of your hash function.

Because when m is 2,048, your expected error

comes out to a mere 2%.

So, I found this function.

Database, please approx count distinct HyperLogLog my data.

(clicks tongue)

Hmm. That just worked.

That's one of the glorious things about algorithms.

They just work.

You don't need to invent them, you don't need

to implement them, you don't really even

need to understand what they're doing.

All you have to do is you have to know that they exist,

and know what problem they solve.

And that is what turns an algorithm

into a useful tool that you can use.

There are so many algorithms

and there's one place that I like to go to find them.

Include algorithm.

It's a very nice place, I like going there,

seeing what algorithms are on offer.

It's not the best.

Do I wish there were more algorithms?

Yes.

Do I wish that the algorithms didn't have

an iterator interface?

Also, yes.

But do they work?

Are they fast, are they tested, are they rigorous?

Yes!

Every algorithm that you find inside of include algorithm

will be an extremely efficient algorithm

to add to your toolbox.

Some of my favorites.

Nth element, given a list of data, find the n largest ones.

Accumulate, given a stream of input, sum it all up.

And count if, another one of my favorites.

People have a tendency to corrupt the data

in the services that I own, and sometimes I need to count

how much of my data is still valid.

Really, whenever anyone, when I'm doing code review,

and anyone writes a for loop that does

something relatively simple, I go to my favorite C++

reference website, cppreference.com,

and I look through their algorithms description,

and I try to find one that does what I want them to do.

Literally, I managed to do this with nth element.

A co-worker of mine was taking his input,

he was sorting it, and then he was taking the top k.

And I was like, hey, hey bud, here's a cppreference

link to nth element, if you'd like to shave a factor

of log(n) off your runtime.

But I'm not going to talk much more

about standard algorithms, and the reason

I'm not going to talk much more about standard algorithms

is because there's already a great talk

on this subject available on YouTube.

Two years ago, at CppCon in 2015,

Michael VanLoon gave a talk, STL Algorithms in Action,

and it was one of my favorite talks that year.

It contained just a whirlwind tour

of the STL Algorithms library, and for me,

it was really useful because it broke down the barrier

of mysteriousness and weird function call syntax

that was present in the STL.

Before that talk, I did not use the STL very much,

I was not very familiar with it, but Michael's talk

introduced me gently into those algorithms,

and I use them all the time now.

If you do not use stood algorithm,

go onto YouTube, watch Michael VanLoon's talk.

It'll be a good use of your time.

I could keep talking for a long time.

I could talk for days about algorithms.

The thing is, though, if I were to talk for days

about algorithms, instead of limiting myself to just a few,

we would all get very bored very quickly.

Because most of the algorithms that I know you also know.

We all know binary search, we all know quick sort,

we all know substring, and that's because algorithms

are the tools of our trade.

I hope that today's presentation of algorithm

has contained some new algorithms for you.

Tools, rare enough to be novel,

but useful enough to be found in practice.

Each, in their own special way, fantastic.

My name's Nicolas.

It's been a pleasure speaking with you today.

(applause)