# Hamming codes part 2, the elegance of it all

I’m assuming that everybody here is coming in from part 1.
We were talking about Hamming codes, a way to create a block of data where most of the
bits carry a meaningful message, while a few others act as a kind of redundancy in such
a way that if any bit gets flipped, either a message bit or a redundancy bit, anything
in this block, a receiver is going to be able to identify that there was an error and how
to fix it.
The basic idea presented there was how to use multiple parity checks to binary search
your way down to the error.
Now, in that video the goal was to make Hamming codes feel as hands-on and re-discoverable
as possible, but as you start to think about actually implementing this, either in software
or hardware, that framing may actually undersell how elegant these codes really are.
You might think that you need to write an algorithm that keeps track of all possible
error locations and cuts that group in half with each check, but it’s actually way,
way simpler than that.
If you read out the answers to the four parity checks from we did in the last video all as
1’s and 0’s, instead of as yesses and nos, it literally spells out the position
of the error in binary
For example, the number 7 in binary looks like 0111, essentially saying it’s 4+2+1.
And notice where the position 7 sits: It does affect the first of our parity groups, and
the second, and the third, but not the last.
So reading the results of those four checks from bottom to top indeed does spell out the
position of the error.
There’s nothing special about the example 7, this works in general, and this makes the
logic for implementing the whole scheme in hardware shockingly simple.
Now if you want to see why this magic happens, take these 16 index labels for our positions
but instead of writing them in base 10, let's write them all in binary running from 0000
up to 1111.
As we put these binary labels back into their boxes, let me emphasize that they are distinct
from the data that's actually being sent; they’re nothing more than a conceptual label
to help you and me understand where the four parity groups came from.
The elegance of having everything that we’re looking at being described in binary is maybe
undercut by the confusion of having everything we’re looking at being described in binary.
It’s worth it, though.
Focus your attention just on that last bit of all of these labels, and then highlight
the positions where that final bit is a 1.
What we get is the first of our four parity groups, which means you can interpret that
check as asking: “hey, if there’s an error, is the final bit in the position of that error
a 1?”.
Similarly, if you focus on the second to last bit and highlight all the positions where
that's a 1, you get the second parity group from our scheme.
In other words, that second check is asking: “Hey, me again, if there’s an error, is
the second to last bit in its position a 1?”.
And so on, the third parity check covers every position whose third to last bit is turned
on... and the last one covers the last 8 positions, those ones whose highest order bit is a 1.
Everything we did earlier is the same as answering these four questions, which in turn is the
same as spelling out a position in binary.
I hope this makes two things clearer: The first is how to systematically generalize
to block sizes that are bigger powers of 2.
If it takes more bits to describe each position, like 6 bits to describe 64 spots, then each
of those bits gives you one of the parity groups that we need to check.
Those of you who watched the chessboard puzzle I did with Matt Parker might find all this
exceedingly familiar.
It’s the same core logic, but solving a different problem, and applied to a 64 square
chessboard.
The second thing I hope this makes clear is why our parity bits are sitting in positions
that are powers of two, for example, 1, 2, 4, and 8.
These are the positions whose binary representation has just a single bit turned on.
What that means, is that each of those parity bits sits inside one and only one of the four
parity groups.
You can also see this in larger examples, where no matter how big you get, each parity
bit conveniently touches only one of the groups.
Once you understand that these parity checks that we’ve focused so much of our time on
are nothing more than a clever way to spell out the position of an error in binary, well
then we can draw a connection with a different way to think about Hamming codes, one that
is arguably a lot simpler and more elegant, and which can basically be written down with
a single line of code.
It’s based on the xor function.
Xor, for those of you who don’t know, stands for “exclusive or”; when you take the
xor of two bits, it's going to return a 1 if either one of those bits is turned on,
but not if both are turned on or if both are turned off.
Phrased differently, it’s the parity of these two bits.
As a math person, I prefer to think about it as addition mod 2.
We also commonly talk about the xor of two different bit strings, which basically does
this component-by-component; it’s like addition but where you never carry.
Again, the more mathematically inclined might prefer to think of this as adding two vectors
and reducing mod 2.
If you open up some python right now and you apply the carat operation between two integers,
this is what it’s doing but to the bit representations of those numbers under the hood.
The key point for you and me is that taking the xor of many different bit strings is effectively
a way to compute the parities of a bunch of separate groups, like so with the columns,
all in one fell swoop.
This gives us a rather snazzy way to think about the multiple parity checks from our
Hamming code algorithm as all being packaged together into one single operation.
Though at first glance it does look very different.
Specifically, write down the 16 positions in binary, like we had before, and now highlight
only the positions where the message bit is turned on to a 1.
And then collect these positions into one big column and take the xor.
You can probably guess that the four bits sitting at the bottom, as a result, are the
same as the four parity checks that we’ve come to know and love, but take a moment to
This last column, for example, is counting all of the positions whose last bit is a 1,
but we're already limited only to the highlighted positions, so it’s effectively counting
how many highlighted positions came from the first parity group.
Does that make sense?
Likewise, the next column counts how many positions are in the second parity group,
the positions whose second to last bit is a 1, and which are also highlighted.
And so on, it’s really a small shift in perspective on the same thing that we’ve
been doing.
And so you know where it goes from here, the sender is responsible for toggling some of
the special parity bits to make sure this sum works out to be 0000.
Now once we have it like this, this gives a really nice way to think about why these
four resulting bits at the bottom directly spell out the position of an error.
Let's say some bit in this block gets toggled from a 0 to a 1, what that means is that the
position of that bit is now going to be included in the total xor, which changes the sum from
being 0 to instead being this newly included value, the position of the error.
Slightly less obviously, the same is true if there’s an error that changes a 1 to
a 0.
You see, if you add bit string together twice, it’s the same as not having it there at
all, basically because in this world 1+1=0.
So adding a copy of this position to the total sum has the same effect as removing it, and
that effect again is that the total result at the bottom here spells out the position
of the error.
To illustrate how elegant this is, let me show that one line of python code I referenced
before, which will capture almost all of the logic on the receiver’s end.
We'll start by creating a random array of 16 1’s and 0’s to simulate the data block,
and I'll go ahead and give it the name "bits".
But of course, in practice, this would be something that we’re receiving from a sender.
And instead of being random, it would be carrying 11 data bits together with 5 parity bits.
If I call the function “enumerate(bits)”, what it does is pair together each of those
bits with a corresponding index, in this case from 0 up to 15.
So if we then create a list that loops over all of these pairs, pairs that look like (i,
bit), and then we pull out just the i value, just the index....well it’s not that exciting,
we just get back those indices 0 through 15.
But if we add on the condition to only do this “if bit”, meaning if that bit is
a 1 and not a 0, then it pulls out only the positions where the corresponding bit is turned
on.
In this case it looks like those positions are at 0, 4, 6, 9, etc.
Remember, we want to collect together all of those positions, the positions of the bits
that are turned on, and then xor them together.
To do this in python, let me first import a couple helpful functions...that way we can
call “reduce” on this list and use the xor function to reduce it.
This basically eats its way through the list, taking xors along the way.
If you prefer, you can explicitly write out that xor function without importing from anywhere.
So at the moment it looks like if we do this on our random block of 16 bits, it returns
"9", which has the binary representation 1001.
We won’t do it here, but you could write a function where the sender uses that binary
representation to set the four parity bits as needed, ultimately getting this block to
a state where running this line of code on the full list of bits returns a 0.
This would be considered a well-prepared block.
Now what’s cool is that if we toggle any one of the bits in this list, simulating a
random error from noise, then if you run this same line of code what it does is it prints
out that error.
Isn’t that neat?
You could get this block from out of the blue, run this single line on it, and what it'll
do is automatically spit out the position of an error, or a 0 if there wasn't any.
And there’s nothing special about the size 16 here, this same line of code would work
if you had a list of, say, 256 bits.
Needless to say, there is more code to write here, like doing the meta-parity check to
detect two-bit errors, but the idea is that almost all of the core logic from our scheme
comes down to a single xor reduction.
Now, depending on your comfort with binary, and xors, and software in general, you may
either find this perspective a little bit confusing or so much more elegant and simple
that you’re wondering why we didn’t just start with it from the get-go.
Loosely speaking, the multiple parity check perspective is easier to think about when
implementing Hamming codes in hardware, and the xor perspective is easiest to think about
when doing it in software, from kind of a higher level.
The first one is easiest to actually carry out by hand, and I think it does a better
job instilling the core intuition behind underlying all of this, which is that the information
required to locate a single error is related to the log of the size of the block, or in
other words, it grows one bit at a time as the block size doubles.
The relevant fact here is that that information directly corresponds to how much redundancy
we need.
That’s really what runs against most people knee jerk reactions when they first think
about making a message resilient to errors, where usually copying the whole message is
the first instinct that comes to mind.
And then, by the way, there’s this whole other way that you sometimes see Hamming codes
presented where you multiply the message by one big matrix, it's kind of nice because
it relates it to the broader family of "linear codes", but I think that gives almost no intuition
for where it comes from or how it scales.
And speaking of scaling, you might notice that the efficiency of this scheme only gets
better as we increase the block size.
For example, we saw that with 256 bits, you're using only 3% of that space for redundancy.
And it just keeps getting better from there.
As the number of parity bits grows one-by-one, the block size keeps doubling.
And if you take that to an extreme, you could have a block with, say, a million bits, where
you would quite literally be playing 20 questions with your parity checks, and it uses only
21 parity bits.
And if you step back to think about looking at a million bits and locating a single error,
that genuinely feels crazy.
The problem, of course, is that with a larger block, the probability of seeing more than
one or two bit errors goes up, and Hamming codes do not handle anything beyond that.
So in practice, what you’d want is to find the right size so that the probability of
too many bit flips isn’t too high.
Also, in practice errors tend to come in little bursts, which would totally ruin a single
block, so one common tactic to help spread out a burst of errors over across many different
blocks is to interlace those blocks, like this, before they're sent out or stored.
Then again, a lot of this is rendered completely moot by more modern codes, like the much more
commonly used Reed-Solomon algorithm, which handles burst errors particularly well, and
it can be tuned to be resilient to a larger number of errors per block.
But that’s a topic for another time.
In his book “The Art of doing Science and Engineering”, Hamming is wonderfully candid
about just how meandering his discovery of this code was.
He first tried all sorts of different schemes involving organizing the bits into parts of
a higher dimensional lattice and strange things like this.
The idea that it might be possible to get parity checks to conspire in a way that spells
out the position of the error only came to Hamming when he stepped back after a bunch
of other analysis and asked: “okay, what is the most efficient I could conceivably
He was also candid about how important it was parity checks were already on his mind,
which would have been way less common back in the 1940s than it is today.
There are like half a dozen times throughout this book that he references the Louis Pasteur
quote “luck favors a prepared mind."
Clever ideas often look deceptively simple in hindsight, which makes them easy to underappreciate.
Right now my honest hope is that Hamming codes, or at least the possibility of such codes,
feels almost obvious to you.
But you shouldn’t fool yourself into thinking that they actually are obvious, because they
definitely aren’t.
Part of the reason that clever ideas look deceptively easy is that we only ever see
the final result, cleaning up what was messy, never mentioning all the wrong turns, underselling
just how vast the space of explorable possibilities is at the start of the problem-solving process,
all of that.
But this is true in general, I think for some special inventions, I think there’s a second,
deeper reason that we underappreciate them.
Thinking of information in terms of bits had only really been coalesced into a full theory
by 1948, with Claude Shannon’s seminal paper on information theory.
This was essentially concurrent with when Hamming developed his algorithm.
This was the same foundational paper that showed, in a certain sense, that efficient
error correction is always possible, no matter how high the probability of bit flips, at
least in theory.
Shannon and Hamming, by the way, shared an office in Bell Labs, despite working on very
different things, which hardly seems coincidental here.
Fast forward several decades, and these days many of us are so immersed in thinking about
bits and information that it's easy to overlook just how distinct this way of thinking was.
Ironically the ideas that most profoundly shape the ways that a future generation thinks
will end up looking, to that future generation, well, simpler than they really are.