Disclaimer: This video was produced in collaboration with the US Census Bureau and fact-checked

by Census Bureau scientists; any opinions and errors are my own.

Every ten years the US Census Bureau surveys the American population - the ambitious goal

is to count every person currently living in the entire United States of America and

collect information about them like age, sex, race and ethnicity.

The whole purpose of doing surveys like the census (and many other big medical or demographic

surveys) is to be able to get an overall, quantitative picture of a particular population

- how many people live in Minnesota?

Or Mississippi?

What’s their average age?

And how do these things differ in different places, or by sex, or race?

The results of the US Census are of particular political relevance since they’re used to

determine the numbers of seats that different states get in the US House of Representatives

as well as the boundaries of legislative districts from Congress down to city councils, but big

surveys are also useful for understanding lots of other issues, too . The problem, of

course, is that the Census (like many other medical and demographic studies) is supposed

to be private.

Like, no one outside the Census Bureau is supposed to be able to look at just the published

statistics about the US population demographics and definitively figure out that there’s

a white married male 31-year old with no kids living in my neighborhood (that’s me).

The census bureau is supposed to keep my information confidential.

And they’re supposed to keep the information of every single other person living in the

United States confidential, too.

Which is a tall order, because how can you keep everyone’s information entirely confidential

while still saying anything at all based on that information?

The short answer is that you can’t.

There’s an inherent tradeoff between publishing something you learn from a survey and maintaining

the privacy of the participants.

It might seem like you could just remove people’s names from the spreadsheet, or only publish

summaries like averages and totals.

But it’s easy to reconnect names to datasets using powerful computers, and there’s a

mathematical theorem that guarantees that if you do a study, every single piece of accurate

information that you release, however small it seems, will inherently violate the privacy

of the participants to some degree violate the privacy of the participants in that study

to some degree.

And the more information you publicly release, the more you violate the individual privacies

of the participants.

But how do you quantitatively measure something nebulous like loss of privacy, and then how

do you protect it?

To understand how to measure privacy, it’s helpful to start by imagining how somebody

would try use published results (from a study) and piece together the private information

of the people surveyed.

They could just try to steal or gain direct access to the private information itself , which,

of course, can’t be protected against mathematically - it requires good computer security, or physical

defenses, so we won’t consider it here!

The kind of privacy attack we can defend against mathematically is an attack that looks at

publicly published statistics and then applies brute force computational power to imagine

all possible combinations of answers the participants could have given to see which ones are the

most plausible - that is, which ones fit the published statistics the best.

Imagine checking all possible combinations of letters and numbers for a password until

one of them works, except instead of letters and numbers it’s checking all possible “combinations-of-the-answers-that-330-million-people-could-give-on-their-census-questionnaires”

to see which combinations come closest to the publicly published figures for average

age, racial breakdown, and so on.

The more closely a potential combination of answers matches the published figures , the

more promising a candidate it is (from the attacker’s perspective).

The more poorly it matches, the lower their level of certainty.

As a small example, if there are 7 people living in a particular area and you tell me

that four are female, four like ice cream, four are married adults, three of the ice

cream lovers are female, and if you also give me the mean and median ages for all of these

categories, then I can perfectly reconstruct the exact ages, sex, and ice cream preference

of everyone involved.

I would start with the 3 ice cream loving females; even though there are hundreds of

thousands of possible combinations of ages for three people, only a small fraction of

those - 36, in fact - are plausible - they’re in the right combination to give a median

age of 36 and a mean age of 36 and two thirds.

And the same thing works for the four females overall - there are almost 10 million possible

combinations of ages they could have , but only 24 age combinations that are consistent

with a median of 30, a mean of 33.5, AND with at least one of the plausible age combinations

for the three ice-cream lovers.

Continuing on with this kind of deduction leads to a single plausible (and perfect)

reconstruction of all of the ages, sexes, and ice-cream preferences of the people involved;

a 100% violation of privacy.

If, however, you didn’t list how many of the ice cream lovers were female, there would

instead be two plausible possibilities, so I would be less certain which was the true

combination of ages and genders and ice cream preferences.

And the potential level of certainty of an attacker is precisely how we measure the loss

of privacy from publishing results of a study.

If all possible combinations of ages and sexes and so on are similarly plausible, then an

attacker can’t distinguish between them very well and so privacy is well protected.

But if a small number of the possibilities are significantly more plausible than the

rest, they stand out - and precisely because they stand out on plausibility, they’re

also likely to be close to the truth.

So to protect privacy, all possibilities need to seem similarly plausible, or at least there

can’t be plausibility peaks that are too conspicuous.

The potential for plausibility peaks is quantified mathematically by measuring the maximum slope

of the graph - if the slope never gets too steep, then you can’t have any sharp peaks

of highly plausible possibilities that stand out.But how do we publish statistics in a

way that limits the maximum slope (and possible peaks) on the plausibilities plot?

In practice, the best way to limit an attacker’s ability to confidently choose one scenario

over the other is to randomly change, or “jitter”, the published values.

Like, for example, rolling a die and adding that number to the average age reported for

ice-cream lovers.

Jittering the published results in a mathematically rigorous way puts a limit on the slope of

the plausibility graph, and thus makes it harder for any particular possibilities to

stand out above the rest.

Jittering results might also seem like lying, but as long as the size of the adjustment

isn’t big enough to make any significant changes to conclusions people draw from the

survey, then it’s considered worth it for the privacy protection.

For example, imagine I want to give you a sense of my age while keeping my true age

secret.

If I just told you my age, obviously there’s just one plausible possibility - 31!

But suppose instead that I secretly pulled a number between minus 5 and 5 out of a hat

and added it to my age before telling you . In this case, all you know is that my true

age is somewhere within 5 years of the number I told you, but you don’t know my age exactly.

My privacy has been preserved, though only to a certain degree because you can be confident

I’m not 20 and not 40.

To protect my age more, I’d have to pull a number between, say, -10 and 10 out of a

hat and add it to my age - this increases the number of plausible possibilities - that

is, the possible true ages that COULD have resulted in the number I told you.

It also increases your uncertainty about my actual age - the tradeoff for privacy is inaccuracy.

If I wanted you to know my age within a year, I could only pull a number between -1 and

1 out of the hat.In general, the idea is this: more privacy means you get less accuracy . Less

privacy means you can have more accuracy . When you publish results, hopefully there’s a

sweet spot where you can share something useful while still sufficiently maintaining peoples’

privacy.

And simultaneously maintaining decent privacy and decent accuracy gets easier and easier

with larger datasets.

Like how as I add more noise to this image, you can still get the general picture even

once you’ve lost any hope of telling the true original value of a particular pixel.

So, to protect people’s privacy, we can and should randomly jitter published statistics

(which the US Census, for example, has been doing since the 1970s).

However, there’s a subtlety - you can’t just add any old random noise however frequently

you want - if I simply add different random noise to this picture a bunch of times different

times, once you take the average of all of the noisy images you basically get back the

original clean image - you don’t want this happening to your data.

So, there’s a whole field of computer science dedicated to figuring out how to add the least

possible amount of noise to get both the most privacy and the most accuracy, and to future-proof

the publication of data so that when you publish multiple jittered statistics about people,

those statistics can’t be combined in a clever way to reconstruct peoples’ data.

But up through the 2010 census, the Census bureau couldn’t promise this - sure, they

were jittering data published in census bureau tables and charts, but not in a mathematically

rigorous way, and so the Census bureau couldn’t mathematically promise anything about how

much they were protecting our privacy (or say how badly it’s been violated).

Until now!

The US 2020 Census will, for the first time, be using mathematically rigorous privacy protections.

One of the biggest benefits of the mathematically rigorous definition of privacy is that it

reliably compounds over multiple pieces of information - like, if we have a group of

people and publish both their average age and median age, each with a privacy loss factor

of 3, then the privacy loss factor for having released both pieces of information is at

most 6.

So you can decide on a total cumulative amount of privacy loss you’re willing to suffer

, and then decide whether you want to release, say, 10 pieces of information each with 1/10th

that total privacy loss (and less accuracy), or if you want to release 1 piece of information

with the full privacy loss and a higher level of accuracy.But how much privacy we need is

a really hard question to answer.

First, it involves weighing how much we as society collectively value the possible benefits

from accurately knowing stuff about the group we’re surveying vs the possible drawbacks

of releasing some amount of private information.

And second, even though those benefits and drawbacks can be mathematically measured as

“accuracy” and “privacy loss”, we still have to translate the mathematical ideas

of “accuracy” and “privacy loss” into something that’s understandable and relatable

to people in our society.

That’s partly a goal of this video, in fact!

So let’s give it one more shot at a translation.First and foremost: it is in principle impossible

to publish useful statistics based on private data without in some way violating the privacy

of the individuals in question.

And if you want to provide a mathematically guaranteed limit on the amount of privacy

violation, you have to randomly jitter the statistics to protect the private data.The

accuracy of the information after being jittered is generally described probabilistically,

by saying something like “if we randomly jittered the true population of this town

a bunch of times, 98% of the time our jittered statistic would be within 10 people of the

true value.”

So accuracy has two components: how close you want your privacy-protected statistic

to be to the real answer , and how likely it is to be that close.

The loss of privacy due to the publication of information is described in terms of how

confidently an attacker would be able to single out a particular possibility for the true

data the plausibility of different possible true values for the underlying data.

Given the published information, are there just a few possibilities for the true data?

Or are there many, many, plausible possibilities for what the true data might be?

Essentially, loss of privacy is measured by the prominence of peaks on the plausibility

plot.

And so the protection of privacy requires policing the possibility for such peaks.

If we individuals are going to willingly participate in scientific or other studies and surveys

or use services where we reveal potentially sensitive personal information, we should

really demand that the researchers or organizations utilize a mathematically robust way of protecting

our privacy.

Simply put, if they can’t guarantee there won’t be a peak in plausibility, then we

shouldn’t agree to give them a peek at our data.

SPONSORSHIP MESSAGE Thanks to the U.S. Census Bureau for supporting

this video.

The founders of the US understood that an accurate and complete population count is

necessary for the fair implementation of a representative democracy, so a regular census

is required by/enshrined in the US Constitution.

The US 2020 Census will be the first anywhere to use modern, mathematically guaranteed privacy

safeguards to protect respondents from today’s privacy threats.

These new safeguards will protect confidentiality while allowing the Census Bureau to deliver

the complete and accurate count of the nation’s population.

They will also give those who rely on census data increased clarity regarding the impact

that statistical safeguards have on their analyses and decision-making.

In short, the Census Bureau views the adoption of a mathematical guarantee of privacy as

a win-win.Here’s how the chief scientist at the Census bureau thinks about it: there

is a real choice that every curator of confidential survey data has to make.

If they want the respondents to trust them to protect confidentiality, then the curator

has to be prepared to give (and implement) mathematically provable guarantees of privacy.

Unfortunately, this means there’s a constraint on the amount of information you can publish

from confidential data.

It’s mathematically impossible to provide perfectly accurate answers for as many questions

or statistics as you want while also protecting the privacy of respondents.

So curators need to do two things: understand the needs and desires of the people who provided

data and the people who want to use the data in order to determine precisely what balance

of accuracy vs privacy to choose, and then not waste that limited privacy budget by publishing

accurate answers to unimportant questions.