Disclaimer: This video was produced in collaboration with the US Census Bureau and fact-checked
by Census Bureau scientists; any opinions and errors are my own.
Every ten years the US Census Bureau surveys the American population - the ambitious goal
is to count every person currently living in the entire United States of America and
collect information about them like age, sex, race and ethnicity.
The whole purpose of doing surveys like the census (and many other big medical or demographic
surveys) is to be able to get an overall, quantitative picture of a particular population
- how many people live in Minnesota?
What’s their average age?
And how do these things differ in different places, or by sex, or race?
The results of the US Census are of particular political relevance since they’re used to
determine the numbers of seats that different states get in the US House of Representatives
as well as the boundaries of legislative districts from Congress down to city councils, but big
surveys are also useful for understanding lots of other issues, too . The problem, of
course, is that the Census (like many other medical and demographic studies) is supposed
to be private.
Like, no one outside the Census Bureau is supposed to be able to look at just the published
statistics about the US population demographics and definitively figure out that there’s
a white married male 31-year old with no kids living in my neighborhood (that’s me).
The census bureau is supposed to keep my information confidential.
And they’re supposed to keep the information of every single other person living in the
United States confidential, too.
Which is a tall order, because how can you keep everyone’s information entirely confidential
while still saying anything at all based on that information?
The short answer is that you can’t.
There’s an inherent tradeoff between publishing something you learn from a survey and maintaining
the privacy of the participants.
It might seem like you could just remove people’s names from the spreadsheet, or only publish
summaries like averages and totals.
But it’s easy to reconnect names to datasets using powerful computers, and there’s a
mathematical theorem that guarantees that if you do a study, every single piece of accurate
information that you release, however small it seems, will inherently violate the privacy
of the participants to some degree violate the privacy of the participants in that study
to some degree.
And the more information you publicly release, the more you violate the individual privacies
of the participants.
But how do you quantitatively measure something nebulous like loss of privacy, and then how
do you protect it?
To understand how to measure privacy, it’s helpful to start by imagining how somebody
would try use published results (from a study) and piece together the private information
of the people surveyed.
They could just try to steal or gain direct access to the private information itself , which,
of course, can’t be protected against mathematically - it requires good computer security, or physical
defenses, so we won’t consider it here!
The kind of privacy attack we can defend against mathematically is an attack that looks at
publicly published statistics and then applies brute force computational power to imagine
all possible combinations of answers the participants could have given to see which ones are the
most plausible - that is, which ones fit the published statistics the best.
Imagine checking all possible combinations of letters and numbers for a password until
one of them works, except instead of letters and numbers it’s checking all possible “combinations-of-the-answers-that-330-million-people-could-give-on-their-census-questionnaires”
to see which combinations come closest to the publicly published figures for average
age, racial breakdown, and so on.
The more closely a potential combination of answers matches the published figures , the
more promising a candidate it is (from the attacker’s perspective).
The more poorly it matches, the lower their level of certainty.
As a small example, if there are 7 people living in a particular area and you tell me
that four are female, four like ice cream, four are married adults, three of the ice
cream lovers are female, and if you also give me the mean and median ages for all of these
categories, then I can perfectly reconstruct the exact ages, sex, and ice cream preference
of everyone involved.
I would start with the 3 ice cream loving females; even though there are hundreds of
thousands of possible combinations of ages for three people, only a small fraction of
those - 36, in fact - are plausible - they’re in the right combination to give a median
age of 36 and a mean age of 36 and two thirds.
And the same thing works for the four females overall - there are almost 10 million possible
combinations of ages they could have , but only 24 age combinations that are consistent
with a median of 30, a mean of 33.5, AND with at least one of the plausible age combinations
for the three ice-cream lovers.
Continuing on with this kind of deduction leads to a single plausible (and perfect)
reconstruction of all of the ages, sexes, and ice-cream preferences of the people involved;
a 100% violation of privacy.
If, however, you didn’t list how many of the ice cream lovers were female, there would
instead be two plausible possibilities, so I would be less certain which was the true
combination of ages and genders and ice cream preferences.
And the potential level of certainty of an attacker is precisely how we measure the loss
of privacy from publishing results of a study.
If all possible combinations of ages and sexes and so on are similarly plausible, then an
attacker can’t distinguish between them very well and so privacy is well protected.
But if a small number of the possibilities are significantly more plausible than the
rest, they stand out - and precisely because they stand out on plausibility, they’re
also likely to be close to the truth.
So to protect privacy, all possibilities need to seem similarly plausible, or at least there
can’t be plausibility peaks that are too conspicuous.
The potential for plausibility peaks is quantified mathematically by measuring the maximum slope
of the graph - if the slope never gets too steep, then you can’t have any sharp peaks
of highly plausible possibilities that stand out.But how do we publish statistics in a
way that limits the maximum slope (and possible peaks) on the plausibilities plot?
In practice, the best way to limit an attacker’s ability to confidently choose one scenario
over the other is to randomly change, or “jitter”, the published values.
Like, for example, rolling a die and adding that number to the average age reported for
Jittering the published results in a mathematically rigorous way puts a limit on the slope of
the plausibility graph, and thus makes it harder for any particular possibilities to
stand out above the rest.
Jittering results might also seem like lying, but as long as the size of the adjustment
isn’t big enough to make any significant changes to conclusions people draw from the
survey, then it’s considered worth it for the privacy protection.
For example, imagine I want to give you a sense of my age while keeping my true age
If I just told you my age, obviously there’s just one plausible possibility - 31!
But suppose instead that I secretly pulled a number between minus 5 and 5 out of a hat
and added it to my age before telling you . In this case, all you know is that my true
age is somewhere within 5 years of the number I told you, but you don’t know my age exactly.
My privacy has been preserved, though only to a certain degree because you can be confident
I’m not 20 and not 40.
To protect my age more, I’d have to pull a number between, say, -10 and 10 out of a
hat and add it to my age - this increases the number of plausible possibilities - that
is, the possible true ages that COULD have resulted in the number I told you.
It also increases your uncertainty about my actual age - the tradeoff for privacy is inaccuracy.
If I wanted you to know my age within a year, I could only pull a number between -1 and
1 out of the hat.In general, the idea is this: more privacy means you get less accuracy . Less
privacy means you can have more accuracy . When you publish results, hopefully there’s a
sweet spot where you can share something useful while still sufficiently maintaining peoples’
And simultaneously maintaining decent privacy and decent accuracy gets easier and easier
with larger datasets.
Like how as I add more noise to this image, you can still get the general picture even
once you’ve lost any hope of telling the true original value of a particular pixel.
So, to protect people’s privacy, we can and should randomly jitter published statistics
(which the US Census, for example, has been doing since the 1970s).
However, there’s a subtlety - you can’t just add any old random noise however frequently
you want - if I simply add different random noise to this picture a bunch of times different
times, once you take the average of all of the noisy images you basically get back the
original clean image - you don’t want this happening to your data.
So, there’s a whole field of computer science dedicated to figuring out how to add the least
possible amount of noise to get both the most privacy and the most accuracy, and to future-proof
the publication of data so that when you publish multiple jittered statistics about people,
those statistics can’t be combined in a clever way to reconstruct peoples’ data.
But up through the 2010 census, the Census bureau couldn’t promise this - sure, they
were jittering data published in census bureau tables and charts, but not in a mathematically
rigorous way, and so the Census bureau couldn’t mathematically promise anything about how
much they were protecting our privacy (or say how badly it’s been violated).
The US 2020 Census will, for the first time, be using mathematically rigorous privacy protections.
One of the biggest benefits of the mathematically rigorous definition of privacy is that it
reliably compounds over multiple pieces of information - like, if we have a group of
people and publish both their average age and median age, each with a privacy loss factor
of 3, then the privacy loss factor for having released both pieces of information is at
So you can decide on a total cumulative amount of privacy loss you’re willing to suffer
, and then decide whether you want to release, say, 10 pieces of information each with 1/10th
that total privacy loss (and less accuracy), or if you want to release 1 piece of information
with the full privacy loss and a higher level of accuracy.But how much privacy we need is
a really hard question to answer.
First, it involves weighing how much we as society collectively value the possible benefits
from accurately knowing stuff about the group we’re surveying vs the possible drawbacks
of releasing some amount of private information.
And second, even though those benefits and drawbacks can be mathematically measured as
“accuracy” and “privacy loss”, we still have to translate the mathematical ideas
of “accuracy” and “privacy loss” into something that’s understandable and relatable
to people in our society.
That’s partly a goal of this video, in fact!
So let’s give it one more shot at a translation.First and foremost: it is in principle impossible
to publish useful statistics based on private data without in some way violating the privacy
of the individuals in question.
And if you want to provide a mathematically guaranteed limit on the amount of privacy
violation, you have to randomly jitter the statistics to protect the private data.The
accuracy of the information after being jittered is generally described probabilistically,
by saying something like “if we randomly jittered the true population of this town
a bunch of times, 98% of the time our jittered statistic would be within 10 people of the
So accuracy has two components: how close you want your privacy-protected statistic
to be to the real answer , and how likely it is to be that close.
The loss of privacy due to the publication of information is described in terms of how
confidently an attacker would be able to single out a particular possibility for the true
data the plausibility of different possible true values for the underlying data.
Given the published information, are there just a few possibilities for the true data?
Or are there many, many, plausible possibilities for what the true data might be?
Essentially, loss of privacy is measured by the prominence of peaks on the plausibility
And so the protection of privacy requires policing the possibility for such peaks.
If we individuals are going to willingly participate in scientific or other studies and surveys
or use services where we reveal potentially sensitive personal information, we should
really demand that the researchers or organizations utilize a mathematically robust way of protecting
Simply put, if they can’t guarantee there won’t be a peak in plausibility, then we
shouldn’t agree to give them a peek at our data.
SPONSORSHIP MESSAGE Thanks to the U.S. Census Bureau for supporting
The founders of the US understood that an accurate and complete population count is
necessary for the fair implementation of a representative democracy, so a regular census
is required by/enshrined in the US Constitution.
The US 2020 Census will be the first anywhere to use modern, mathematically guaranteed privacy
safeguards to protect respondents from today’s privacy threats.
These new safeguards will protect confidentiality while allowing the Census Bureau to deliver
the complete and accurate count of the nation’s population.
They will also give those who rely on census data increased clarity regarding the impact
that statistical safeguards have on their analyses and decision-making.
In short, the Census Bureau views the adoption of a mathematical guarantee of privacy as
a win-win.Here’s how the chief scientist at the Census bureau thinks about it: there
is a real choice that every curator of confidential survey data has to make.
If they want the respondents to trust them to protect confidentiality, then the curator
has to be prepared to give (and implement) mathematically provable guarantees of privacy.
Unfortunately, this means there’s a constraint on the amount of information you can publish
from confidential data.
It’s mathematically impossible to provide perfectly accurate answers for as many questions
or statistics as you want while also protecting the privacy of respondents.
So curators need to do two things: understand the needs and desires of the people who provided
data and the people who want to use the data in order to determine precisely what balance
of accuracy vs privacy to choose, and then not waste that limited privacy budget by publishing
accurate answers to unimportant questions.