Hello, everyone, and welcome to today's hot topics
I am delighted we have two extraordinary scientists who
are also members of the MIT community
to educate us on computational epidemiology today.
We have professor Ankur Moitra, who
is in the Department of Mathematics at MIT.
And he's also a member of CSAIL.
Ankur worked in numerous areas of algorithms,
including approximation algorithms, metric embeddings,
combinatorics, and smooth analysis.
But lately, he has been working at the intersection
of algorithms and machine learning.
And he's collaborating with Professor Elchanan Mossel,
who works in probability combinatorics and inference.
And his research has resolved open problems
in computational biology, machine learning, social choice
theory, and economics.
Professor Mossel is on the senior faculty
of the mathematics department and he has joined
faculty appointment in IDSS.
Ankur, Elchanan, please go ahead.
Great, so thanks, everyone, for logging in.
So I'm happy to tell you at least a little bit about what
we've been thinking about in the space
of computational epidemiology and what we've learned.
So I should mention at the beginning
that this is a new area for both of us.
In fact, how we got interested in this was
we were approached by a bunch of epidemiologists
about joining a project.
And they persuaded us to start thinking about some
of their statistical and algorithmic and machine
learning problems that they have.
Now, this was back in September.
So we thought it would be fun to learn a little bit
about these problems.
They sounded a bit different than the kinds of things
we usually worked on.
And then coronavirus happened.
And it changed the way we think about these problems.
So instead of thinking about some cute math problems
that we could write about, all of a sudden, these
became much more serious issues.
So I'm talking to you not really as an expert,
but as an outsider, as someone more from a computer science
perspective, thinking about the way
that things happen in epidemiology
and what the perspectives from sort of more modern algorithmic
and computational ways of thinking
might lend to these sorts of problems.
So let me tell you, at least, a little bit
about some of the background about epidemiology.
Because it's important to understand where
the roots of this field are.
So it's hard to imagine, but way back when, back in the 1800s,
there were a lot of fundamental diseases
that we just didn't understand the underlying mechanisms.
And people were dying left and right.
So maybe one of the most important cases of this
was cholera in the 1800s.
So you can see from this new version of a famous map
by John Snow of, first of all, the Broad Street pump and all
of the surrounding cases, mostly deaths around the Broad Street
And when these things happened, they were catastrophic.
They took out entire families.
And no one really knew what was happening.
At the time, there were a lot of theories
about what the origins of cholera were, none of them
really grounded in any kind of science.
Some people thought cholera was in the air
and it was just endemic and impossible to get rid of.
But John Snow had a theory that it
came from drinking contaminated water.
So what he did was, when there was an outbreak in 1854
and there was a rash of deaths all over, what he did
was he actually traced through all of these cases
and tied them back to the Broad Street pump.
In some cases, this was easy.
It was something like schoolchildren
on the route from their home to school,
they would pass by the Broad Street pump
and they would refill their water.
In other cases, it was much more complicated.
Because there's a story of a mother who was many miles away
from the Broad Street pump.
There were other pumps that were much closer to her.
And it was a complete mystery about how she died of cholera.
So in fact, Snow didn't give up.
What he did was he figured out from talking to the son
that the mother used to live around the Broad Street pump
and that she liked the taste of that water more
than the other pumps.
So whenever someone was in the area
she would ask them to pick up a bottle of it
and she would drink it at home in service to her guests,
and that resolved one of the mysteries.
In fact, what was quite frustrating around
this time was that despite the scientific evidence
of tracing through these cases and pinning it back
to the Broad Street pump, there wasn't
a molecular understanding of what was going on.
And it was very difficult, in general,
to convince lawmakers to actually try and close the pump
or do some other safety checks on it.
It wasn't until a reverend who thought that cholera was really
the work of God deciding who had been evil that people realized
that he tried to show Jon Snow's work was not
the correct explanation.
And he actually pinpointed that the reason
that the pump head been infected was because there
was a nearby sewage dump area where a mother had dumped
some of her dirty diapers.
And there's a leak in that sewage
that actually led to the water becoming contaminated.
So this is by no means a full history of epidemiology.
After that time, people spend a lot of time
identifying new diseases, tracing
the mechanism of spreading, even trying
to understand what are the underlying
bacteria or other factors that lead to it.
So one of the things which has really
changed the game within epidemiology
has been the advent of more DNA technologies
to peer deeper into what the underlying mechanisms are.
So this generally goes under the umbrella
of molecular epidemiology.
And if you think about what the DNA revolution has offering,
well, it allows us to look into specific diseases
and understand what are the chemical pathways
that these things take to cause infections.
That gives us some idea about what
are the potential treatments we could take,
instead of taking some blind approach to looking for cures.
We can look into what kinds of molecules it has affected
and what particular genes and try all sorts of therapies.
In fact, it even gives insight into some
of the original questions of epidemiology.
Because by tracing things like the genetic structure
and how they mutate, you can actually give new insights
into the transmission.
So what's depicted here on this screen
is actually from a paper which studies the transmission
And through genetic understanding,
they can distinguish between cases
where there is one rash of transmissions and the case
where the tuberculosis was already present, but was
reactivated after a very long time.
So it adds very much a piece to the puzzle
in trying to trace these diseases.
Now, I should mention that what we're going to try and do is--
because some of these areas are new to us too--
I want to offer some thoughts at least, high level,
about what are some of the new ways
that statistical and computational thinking can
change the types of questions that we can ask.
So I'm sure everyone's been following all kinds of things
on the news because it's a little hard
to ignore these days.
But one of the things that I've certainly
been wondering about that can be posed, potentially,
as a nice statistical problem is you
see that some of the ways that governments have been dealing
with coronavirus, especially early on,
is through contact tracing and trying to understand when there
are community transmissions and whether we can fully
identify all of what those are and stop the spread.
But one of the things which isn't obvious
is, what is the tipping point at which we should no longer be
trying to try these strategies of tracing all of the contacts?
And at what point is it hopeless?
This depends a lot on what kinds of information
you can compel people to disclose about their contacts.
There are cases, for example, in Connecticut
where a single party that someone
had for their 40th birthday led to what's
called a super spreading event.
And for a long time, the small town
tried to trace all of the different contacts,
until a couple of days later, someone
who had been at the party admitted later on--
and he waited so long because of some of the social stigma
of being responsible for it--
that he'd attended a party a few days after with 400
And at that point, they completely
gave up trying these strategies of contact tracing
and shut down the schools almost immediately.
So let me switch over to Elchanan
who's going to tell you some of the beginning parts
of how exactly we start to model growth, moving from simple ODEs
to more complicated models.
So maybe let me start by two comments about,
you know, even in parallel of the historical perspective
that Ankur presented to what we're seeing right now,
so if you think about the Broad Street pump situation
and you think about the Wuhan market situation or even
the policy decisions in the US, you see the parallels
where the people who hunt for the disease
already know something.
They don't know how to prove it, you know, scientifically, 100%.
But the policy makers are very reluctant to do anything
because of financial interest, because
of the fact that they don't want the public to panic.
And unfortunately, we see this in multiple cases
in the current outbreak.
I also wanted to mention that the molecular techniques
that Ankur mentioned have played an important role, even now,
in trying to identify how long have been
the disease rampant in the US.
And the use of phylogenetic techniques
led by people in the University of Washington,
was one of the main tools that led to the conclusion
that this has, in fact, been in the community
for a very, very long time.
So I mean, we are still using these molecular technology
techniques, even in this current disease.
And this is, of course, related to the policy questions
that Ankur mentioned, right?
Now that we know that it's rampant or somewhat rampant,
you know, that it's somewhat widespread,
this leads to the policy decisions
that Ankur and I are both fascinated by.
You know, when do you give up and decide, this is
going to hit almost everybody?
And the question is, how do we do it
in the way that's best according to some optimization
[INAUDIBLE] versus, you know, is there still a chance
to try to do what China or Korea did,
which is to stop the disease by tracking it.
So let me mention, let me go to one of the simplest
models in this area.
So we'll talk at a very high level about a more
But this model or models very, very similar to this model
are still used in order to inform policy and make
This is a very, very simple differential equation
based model in which they'll say that the population is divided
into three types.
They are the susceptible, the people who can get infected
but haven't been infected yet; the infected,
the people who are infected; and the people
who have already recovered.
There are many things that are missing here,
in particular, people who might die from the disease.
And so let me go over all of the details of the model.
There is a disease.
At some rate, people are being infected.
This is decided in the first equation.
This rate is beta.
Among the infected people, some of them recover, some of them
do not recover.
And this is decided by this parameter beta and gamma.
And what's easy to see from this picture is at the beginning,
this is what also you'll see with corona,
the disease which is, I believe, the red curve,
grows exponentially, and then it goes down, right?
That's the dynamics of many diseases that spread.
Many diseases do not actually spread, right?
They start spreading and they immediately die out.
But for the diseases that do spread,
that's often the dynamic.
And what you see, hopefully, in the yellow curve
is the number of people who have recovered and are now
immune to the disease.
So in this model, again, nobody dies.
So this gets to the full population.
And the parameter of interest that you
may see discussed a lot in the literature regarding corona
is this parameter R0, which is how
fast is the exponential growth of the disease.
And you shouldn't really think about this parameter
as an absolute constant coming from nature like one
of the physical contacts.
This is something that's very, very much
dependent on the environment of the country or the community
where you are.
And it's also very, very much dependent on policies.
And I think one of the ways to think about policies is,
like, in the idealized situation of applying interventions,
you want to get R0 less than 1.
Because if it's less than 1, the disease will die out.
That's the ideal.
But even the policies that people are considering right
now, the goal is maybe just to get R0 a little lower than
the value that people cannot really estimate.
You know, the estimates go anywhere between 1 and 4,
They really vary a lot.
But even if you lower R by a little bit,
maybe you give more time to the hospitals to get prepared.
Maybe you buy more time to develop vaccinations
and so on and so forth.
So just to add in a quick comment, too.
So, I mean, there are a lot of difficulties
in taking these very simple models like ODE-based things
and trying to fit them to data.
So in fact, we see already in the US that, you know,
something that really affects the growth of these curves
is also changes in the testing policies about who's
allowed to be tested.
We can see that initially when coronavirus started,
there were extremely stringent criteria that you had to meet.
And they were also focused on trying to understand
the community spread.
So if you were in various risk categories
then they would test you, if you had various symptoms,
but also if you had been in contact with someone
who had tested positive.
But, of course, being in contact with someone
who tested positive meant that that person also
had to meet the criteria of being testable.
So it was very difficult to see, you know,
actually these types of curves being present in the data,
especially because as you change the policies about who's
allowed to be tested, you can see these wild ramp-ups that
don't actually have to do with the change in the number
of people who are infected or the growths or the lack
of efficacy of the different policies,
but just the different ways that these things are
So one of the other things to keep in mind in terms
of open questions is that it also makes sense
to try and understand ODE types of models like this in cases
where we work with statistics that we have
a bit more confidence about.
Like, maybe instead of the statistics
about how many reported positive cases we have,
we can put a little bit more faith
in the number of the statistics of how many people actually
have been admitted to hospitals.
Because those actually don't vary quite as
much what the admission criteria are across different states,
at least at the moment.
So you know, when somebody in the modern age of our time
sees such a simplistic approach, you immediately say,
oh, I can do much better.
I can model the disease much better.
And indeed, a lot of the modern work in this area
is based on modeling, which is very data intensive and very,
I mean, there's art to these models.
So these are what's called multi-agent models.
And the idea we don't just want to use
some parameters, some simple parameters beta, gamma,
and so on and so forth, we want to model better the fact
that there are different kinds of people,
different kind of communities, different kind of interactions.
And the hard part of this modeling
is to model people interaction, right?
So most of the diseases come into play
by people interacting with other people.
And the question is, how do you model
the interaction of a population of, say, 200 or 300
And what people do is they try to use various kinds of data.
I mean, I think in the last 30 or 40 years,
it was mostly census kind of data,
but also information form traffic
sources about how much time different kinds of people
spend with different kinds of people
and who are these people that they spend time with.
So we want to estimate the interaction graph
for various people.
How many people do you meet a day?
How far away are you from them?
When you are close to them, for how long do you stay close?
I mean, it's very important in trying
to model the spread of the disease,
especially when you talk about respiratory diseases.
The other part of the modeling, which is typically easier
because it just involves an individual
or the statistical behavior of an individual,
is to estimate how does this disease develop per individual.
How long is the incubation period?
For how long are you sick?
For how long can you infect other people?
And for this, often, just by observing a large number
of patients, you can get pretty good estimates for that,
even though there is a caveat.
If I was like the corona where it
seems like many, many of the patients
get the disease in a way that's very, very mild
or not noticeable at all, it's actually
estimating how many people have this kind of disease
and how long are they sick in this kind of mode
and how many people can they infect in this kind of mode.
That's something that we don't know
because most of the data that we get is skewed,
skewed to all the patients that get the disease in the stronger
And then once you have this model
and you believe this model, you say, OK,
now here's what I'm going to do.
I'm going to decide that every second high school is
going to close.
And I'm going to close all kindergartens
and close the Whole Foods and just
leave the Trader Joe's open.
And then I got my model, and I come up
with a policy, policy accommodation.
And a lot of what we're seeing, and we'll
see it in the next slide where we talk about the Imperial
College paper that got a lot of traction,
is to try to evaluate the efficiency
of various interventions using these models.
Now, for people like me who have been
in the business of networks for many, many years,
we know that the main challenge in this
is that these models are very, very
sensitive to the parameters.
And even if you know, you get a lot of census data,
it's actually very difficult to get a good evaluation
of how long do you spend in proximity to other people
and what the actual distance is in meters.
And even if you know the actual distance in meters,
it's really important if you're coughing into your elbow,
you're coughing into your hand, or you're coughing
into the open air, right?
So there's a lot of parameters that eventually people just
And the parameters of this model, these
are very large networks.
There's a lot of interaction.
And people who have heard the words "phase transition,"
they undergo very, very dramatic phase transitions,
depending on [INAUDIBLE].
So these models sound like a very, very good idea
because they include all of the data that you want.
But in practice, the efficacy in actually predicting what
will happen is not so good.
And people in the public health community
are very well aware of that.
And what they do is more like, we
know that this is not the model.
But let's play with various policies
and see what kind of pictures do we get
and maybe it will inform us.
So it's a little bit like the models in economics, people,
in part, because they don't believe
that this is very good model.
They don't believe this is the kind of predictive model
that we would use in statistics or in machine learning.
They don't believe that the quality is good enough.
As a result of that or compounding
that is the fact that these models are not
I think that once you start playing with them,
you realize there are so many parameters that people just
decided to fix it to some value with no reason
and that if you change these parameters even
by a little bit, then the model behavior completely changes.
So it's not really open science.
It's more like an art.
I don't know if fortunately or unfortunately,
but these sophisticated models have a very, very strong impact
on policy, even though the scientific basis of this model
is actually pretty weak.
Yeah, so just to add to that and to emphasize.
So the way that you're talking about one
of the most fundamental things is this interaction graph.
So this Imperial College study, you know,
you need to take census data and somehow
figure out what's the distribution of how much time
people spend on public transportation.
Now, this is an extremely under-constrained problem.
So it's open to interpretation how exactly you figure out
these kinds of things.
One thing that we haven't really seen too much
in the literature in comparison to other fields that
use a lot of network ideas underneath them
is that if you compare it to something like, you know,
statistical physics, you're also interested
in macroscopic properties of the underlying network.
So you don't care about the actual individual atoms,
but you want to understand what kinds of temperatures
you have different magnetization properties.
And what's important there is that actually many
of the nitty-gritty modeling assumptions
you make end up being abstracted in simpler models that
capture the rough dynamics of what the model is doing.
So I think there is not so much of an understanding in some
of these multi-agent systems of, first
of all, how robust are their predictions.
Because these models aren't even public,
so it's impossible to play with other people's models
or reverse engineer how exactly they created the interaction
And it's also not so clear whether you
need to go to this level of sophistication
if, at the end of the day, having
that many different modeling parameters
just confounds the issue of trying to understand
what the macroscopic properties are
and what the impact of policy is on those properties.
So here is that the table that Ankur and I
like the most from this Imperial College study.
Unfortunately, I cannot see the last column because I see
the windows of the people.
But this table try to evaluate different interventions
in the disease in terms of closing high school, who self
isolates, you know, do sick people stay at home,
and so on and so forth.
And I won't get into the details of this picture.
I think Ankur and I, like many other people,
were mostly interested in the question of,
you know, how effective is an intervention of the form
you just tell old people, maybe like me, to stay at home
and you tell young people, like most of the viewers,
you can go outside and develop immunity.
And then once you develop immunity,
I will go out and be happy.
So that's a question that, you know,
you can see in discussions in the pages of The New York
Times and other places.
So this maybe is related to the last two columns of the table.
You can look at this a little bit later.
But what I want to point you to is that we really
don't have a very good R0 before intervention, what
you see in the leftmost column.
So those 2.4 and 2.2, the range of estimates
that we have for R0 is much wider than 2.2 to 2.4.
And the impact in terms of the number
of deaths and hospitalization and so on and so forth
is huge, right?
So this is just like a huge impact.
This is a parameter that we don't really know.
So you have to run this model with these two parameters,
in addition to the other millions of parameters
that are under the hood that both Ankur and I mentioned.
And you can get dramatically different results.
So that's, you know, just to tell you,
you know, the kind of very, very idealistic assumptions
that people make and their limited application
in deciding on policies.
OK, so I want to also talk about some of the ways
that more traditional computational thinking, you
know, can help in the picture, but also
what are some of the pitfalls and what
are the things people have tried in the past?
So one of the big questions right now is also,
you know, we basically don't understand
most of the statistics about the coronavirus, right?
We don't understand what number of people
are actually infected.
Most experts predict that if you take
some place like Massachusetts, the number of actual infections
is probably somewhere between 5 to 20 times
larger than what's actually being reported.
And in fact, over time, our understanding
of these statistics might even get worse.
Because there are a bunch of places like New York
and California where public health officials have actually
advocated that maybe we should reduce testing.
Because, you know, maybe we should only
be testing people for whom the outcome would actually
change their treatment plan.
Now, the reason why they're suggesting this
is because there's such a drastic shortage
of personal protective equipment that, you know,
this is now starting to drive some of the decisions
that we make.
So every time someone gets tested,
you need to throw out that entire set of PPE.
And then there's just one less patient in the future
that you'll be properly equipped to be able to handle.
So one of the things that I think is very interesting
that's not really discussed in some of these models so far--
actually, let me transition backwards--
is first of all thinking about issues like if I don't fully
understand what the growth is of the number of infected people,
then how much does that hurt my policy implementation?
So you can see, for example, from the Imperial College
paper that there is some understanding that when
you actually institute the policy of quarantining people,
well, it depends on how well you time it with the peak.
So if you time it really well with the peak,
then its ability to flatten the curve
is much more multiplied than if you miss time it.
And one of the things that we don't understand
is, for example, if we lose information
about tracking how many different infections there are,
how much of a price do we pay in terms
of the efficacy of our different policies?
In fact, a lot of these models also
don't take into account some of the other parts
of the picture that are really fundamental to understanding
what we're actually going to deal with.
Ankur, we have a few questions about the slides.
Some of the viewers would like to better understand
what the numbers are.
And so I wonder if you could take a minute
to explain the numbers in the table.
So Elchanan, you want to go for that?
I actually don't have the paper in front of me.
But a big number is better.
This is a reduction in hospitalization or death
and so on and so forth, all right?
So I don't remember if is this is hospitalizations or death,
but I think that 67%, for example,
means that this is 67% less death than if you would not
intervene at all.
So the green numbers are big numbers.
This is how much less of the bad effect will you have,
given the intervention that's written on top.
I think, like, for example, the last two columns
are closing school, people over 70 self-isolating.
And one of them has closing of schools.
The first one is closing of high schools and elementary schools
and the second one does not, right?
So this is an example of the column, right?
Right, so you do that and it says,
actually, the left column says peak beds, right?
So the first two items are peak beds,
how many hospital beds will you need?
So a number like 80% means you would
need 80% less than what you would need if you were not
using any intervention at all.
69% would mean you would need about 70%
less beds than if you would not intervene at all, right?
This is like the ebb flow of this table.
And also, I think it's worth mentioning, you know,
there are some parameters that are so widely--
it's not clear how exactly you would set them
that are a really important piece of the picture that
are not a part of the study.
So for example, the focus here was
on the number of total people infected
and what the peak number of beds in the ICU you would need is.
But they didn't actually explicitly model
things like when you were over capacity, by how much, how
much does that affect the mortality rate?
One more question about the slide.
If we don't know R0, are you saying we also
don't know gamma and beta?
I mean, the actual model again, even in this study
and definitely in more--
All right, so in this, let me just say,
I mean, this study uses the most sophisticated
multi-agent model, you guys.
So there isn't just a gamma and beta, right?
The parameters will be different for different individuals,
depending on their lifestyle.
And similarly, the mortality rate
would be different for different people,
depending on the demographics, their age,
having other diseases, and so on and so forth.
All right, but even in the simpler model that we discussed
at the beginning that people do use in order
to simulate various policies, the values of gamma and betas
and other relevant parameters.
Maybe gamma and beta for different age groups
are not known.
Right, and just to, you know, point this out, right?
So gamma in the original SIR model
is, you know, what the rate is of people recovering.
So we don't even have any idea how many people are actually
There's plenty of belief that a lot of the spread
is due to asymptomatic cases which we're not even seeing.
So that very much affects gamma.
But there is hope in trying to leverage some types of data
sets that are more complete than others to extrapolate
some of the information we're missing on the larger data
So if you think about, for example,
the Princess Diamond cruise or South Korea
where there's much more widespread testing,
some of the ways that people are getting these numbers
about the number of actually infected people
in Massachusetts is by extrapolating things
we see there about how many people have
very bad symptoms, how many people are asymptomatic,
and trying to apply this to the partially observed data
that we see in the US when we don't have
this kind of rampant testing.
And again, there's more biology in the background, right?
They're starting to deploy antibody testing in which you
can also test for people who have been infected
before and recovered, right?
So obviously as time, you know, moves on, you know,
we'll get more data of different types.
OK, two more quick questions about the slide, please.
One question is about modeling uncertainty.
What are the current approaches?
And would something like Bayesian estimation
or inference techniques be useful?
Yeah, I don't know what they do in this study.
But I think the people who do this work, people
who are trained in public health are actually pretty
And they do know that the models are
very sensitive to assumptions.
And they do know that if we would
start to do some serious robustness,
they would get very, very different behaviors,
as we see in this table.
So I think it's really, really different
than the kind of science, you know,
where you care about the prediction
where your accuracy probability is 87% versus 88%.
You know, the spectrum here is much, much wider.
And people in the field, you know,
who have been modeling this kind of phenomena
have had these experiences for very long
and they are very well aware of that.
So I think one of the ways for them
to think about it is one to compare different policies just
to see which one is better and which is not
in different parameters, but not as a predictive tool,
like saying, you know, maybe if in various settings
we are trying something, this is better than nothing.
It will also be better in the real life situation.
So that's one case where people do it.
The other situation where people do it is,
you know, whenever you have a very complex system
that you don't understand and you want to control it
in some sense, you know, it makes sense
to make some assumptions and just follow
the behavior for a very, very short amount of time
and then adjust all of the parameters
and so on and so forth.
So I think that's a little bit of this
and a little bit of that.
Yeah, and just to give you a sense
of some of the other parameters in these types of studies
that are really sensitive in this way.
So there's another parameter in here,
which is, you know, when you institute the policy of,
let's say, quarantining, then there
is a parameter in there which is what fraction of the population
Now, in the Imperial study, it was 75%.
Now, this type of parameter, it's not so clear
how exactly you set it.
It's not so clear how it depends on which different demographics
you're talking about.
It's not so clear how it varies from country to country.
So a lot of people right now are looking at this,
you know, Imperial College study to try and, you know,
understand whether it's reasonable,
how robust are its predictions.
In fact, there was a nice Reddit thread
in which Bill Gates weighed in with what
The Gates Foundation thinks.
And one of the points that he brought up was, well, you know,
it also doesn't explain what happened in China
where they were actually able to curb the growth of coronavirus.
So there's a lot of, you know, sanity checks and eyeballing
of parameters where in these types of models
if I play with something like the compliance rate,
you can actually see wildly different behavior.
So it's just something to think about.
There is a flurry of questions.
I will ask one more.
And we'll leave the others for later.
so you can go on with the talk.
And one question is, how does R0 relate to the doubling,
to the days to double?
So, yeah, R0 is--
Yeah, so one is the log of the other or so on and so forth.
So if the date to double is x, when x to the R0 is equal to 2,
then x is the number of days to double.
All right, please go ahead.
OK, so let me also tell you about--
I mean, because computational epidemiology really
has many different facets.
So what we talked about so far were some
of the roots of epidemiology.
And you can see already that there
are many different types of problems you can study there.
You can study these contact tracing
types of inference problems.
You can study these modeling types of problems
about trying to understand the growth as a way
to check the efficacy of different policies.
But now, these days, people are also
trying to think about what are different sorts of data that we
can leverage that, you know, can help
complete some of the puzzle.
So I want to give you at least a little bit of a sober
look at these kinds of things.
Because they're both very good and also not necessarily good.
So there is a famous example of Google flu trends
that you can see this curve right here.
There are actually two instantiation of the algorithm.
So there was the 2008 flu trends algorithm.
And there was the 2009 flu trends algorithm
which ran for quite some time.
And the basic idea is that Google
has access to this amazingly rich source of data
which is search queries.
So you can get some idea of how many people are worried
about the fact that they might have the flu by searching,
for example, for flu symptoms or other types of things
like what are good remedies, should I
be running a humidifier, that kind of thing.
So you can see that this 2008 flu trends algorithm,
when it launched, right, at the left end of this curve,
it was quite good at tracking the actual flu rates,
compared to the CDC data.
But then towards the end, it stopped
tracking it quite so well.
And then I had this update to their algorithm in 2009.
And now it looks even better because it
tracks the peaks quite well, other than this blip in 2012.
And then all of a sudden, the thing
went completely off the tracks in 2013
to the point where Google cut the flu trends algorithm.
And it's no longer something that they actually run.
So it's still something that people try.
We don't all have access to Google's search queries.
But something that people have been
trying within computational epidemiology
is to use things like Twitter.
So there are some interesting NLP types
of problems that people have had to solve
en route to doing this.
Because they want to distinguish between tweets which
are about individual people saying that they feel sick
versus people generally talking about the flu.
And this is a big problem around spikes where, all of a sudden,
you'll have tons more people talking about the flu
than will necessarily have the flu.
And this could be one of the confounders
with what happened in 2013.
So there is some kind of hope for trying
to leverage these different sources of data.
But so far, a big problem has been that it historically
can track it well sometimes.
But when you have these big spikes, things all of a sudden
go out of whack.
And then we're just spewing out this incorrect data.
So some of the other big data approaches
that people have tried are they can
use things like mobility networks
from GPS data to try and figure out
all of this fine grained detail about who actually
has been in contact with other people and at what distance.
Now a lot of times when people do this,
they do it in a very small scale.
So there are some instances where
people did this within schools to understand the spreading.
One thing that they do is, you know, as Elchanan mentioned,
there are these very complicated multi-agent models
that people use to try and model the spread of disease.
And so what people sometimes do is
they use these mobility networks that they've
learned from data to try and check
against the actual multi-agent models that they've had
and see if they're reasonable.
So are they the right degree distributions, the right number
of triangles, these kinds of networks statistics
to at least give some proof of concept
that it's getting roughly the right network structure?
But it's, of course, a very complicated thing.
Because you don't really care about all
of the network structure, you care
about the parts that have an effect
on the macroscopic properties which you're observing.
Now, people also can use this GPS data, in principle,
to trace the spreading of diseases
through contact networks.
And here there's a big difference
as you vary from country to country about what
is and is not feasible.
So in other countries, particularly in Asia,
we've seen that the governments have
been able to force people to actually disclose who they've
been in contact with through check-ins,
through their phones, this kind of thing.
But right now, this isn't actually something
that we can do very effectively in the US
to try and trace contacts this way.
So it is one thing to think about.
Now, as I mentioned, there are actually
many different problems that you can
study besides just these modeling types of things.
There are also some interesting sensor questions
about how exactly you can detect early spreading.
Now, we've see this already with the coronavirus
that it popped up in some communities much quicker
and ramped up much quicker than in others.
So for example, West Virginia was the last state
in the country to actually have a positive confirmed case.
And meanwhile New York City was exploding
and continues to explode in the number of positive cases.
So what you can ask is, in some communities
where you haven't yet seen this kind of explosion
but it's happening in neighboring areas,
are there good ways to try and detect early onset of pandemics
within different areas that tells you
something about when is the right time
to time some of the interventions
that you're trying, like closing down schools.
So one of the first famous studies of this
is through what's called the friendship paradox.
It's a beautiful, mathematical, and rigorous statement,
which is that if you take any graph that's connected where
the nodes do not have all the same degree, then if you choose
a random node from the network and look
at its average degree versus you choose
a random node from the network and choose one
of its random friends and walk along the edge
and ask it how many neighbors it has on average,
that's actually strictly larger.
So this is called the friendship paradox.
It has the really disappointing message
that your friends have more friends than you.
But that was the mathematical statement behind it.
So there's some really nice work of Christakis and Fowler,
although there are a lot of questions that people have
about actually reproducibility for these things,
where what they did was they took a group of 400 students
And they wanted to use the friendship paradox
as a way for early detection of the flu.
So they took those 400 students and they
asked them all to name a random friend in Harvard.
And they took that group of 600 something people
who'd been named as friends and they
tracked how early, on average, they got the flu versus how
early, on average, the people from the original group
got the flu.
And what they found was that this idea
of using the friendship paradox supposedly
allows you early detection of the flu by as much as one week.
Now there are a bunch of caveats that people have.
So you should take this with a grain of salt
that not all of these things have been fully reproduced.
But one of the other things that people have studied
are other mechanisms to place sensors,
socially, to understand early detection.
So there are works like graph dominators, which
are very worst-case from a combinatorial optimization
So the way that these things are quantified
is that for all starting nodes of where the infection is
spreading from, you want to have a small subset of nodes that
universally acts as a very good early detection, regardless
of where that initial seed is.
So these things end up, in practice, on big networks
end up being quite large themselves.
So it's not really so clear, feasibly, how to scale them.
But it's also not so clear whether you
want to formulate them in some kind of worst-case sense.
Because we don't necessarily care about the worst
case over all possible initial starts for the seed,
but, generally, as infections are spreading,
we do have some idea about where are the hotspots.
And we might want to seed some kind of sensors
so that we can detect it when it spreads
to other communities that are not currently hotspots.
So this is some simulated data where people tried out
these dominator trees.
And they showed that, again, much like the friendship
paradox, you can get early detection from the dominator
tree, compared to this ground truth of when the infection is
Now, there are other sorts of problems
that you can study within computational epidemiology.
Like one of the other classic problems
here is you might want good algorithms or approximation
algorithms for distributing antidotes.
Now, for those of you who come more from a computer science
perspective, this might be very similar to some types
of problems you've seen within viral marketing
where you want to figure out what
are the most influential nodes to offer deals to
so that it spreads as rapidly as possible
across the social network.
But one thing to keep in mind is that distributing antidotes
is not the same.
And in fact, even the way that you formulate the question
ought to be quite different.
So some of the seminal works in these questions
about good approximation algorithms for distributing
antidotes, like this Anderson and May paper, what they do
is they assume that the entire graph is known.
But really this is a totally unrealistic setting,
as you can see, from the fact that we
don't have a good understanding of what the interaction
So some of the things that people
have been working towards in this area,
especially people like Elchanan, is
trying to understand what we can do in the face of uncertainty
about the graph.
Are there good active learning algorithms
for querying a small number of nodes or connections
to get some idea about how exactly
we should seed viral marketing or, alternatively, distribute
So let me let Elchanan take away and add some more discussion
on that point.
Yeah, so this is still speculative.
As Ankur mentioned, I think people in public health
have noticed that distributing antidotes to high degree nodes,
it makes a lot of sense.
I think also in the context of corona,
it's not just where you distribute the antidote,
it's maybe also who you test and whether you
test either for the disease or for antibodies.
One of the things that Ankur made me think about just today,
so I don't think there's any complete research here,
is to try to think about ideas that we
used in the context of viral marketing of how
do you do viral marketing where you don't know the graph.
I think, unfortunately, our algorithm
is something like, you choose some disease,
you spread it many times, see how many people
and which people it infects, and then you
decide who are the central people in the network.
But it's possible that something like that
can be done if you simulate the disease, you know,
using an app on the phone or something like that.
So there's definitely some venues
to test the ideas like that in the future with research.
So as Ankur mentioned at the beginning, I mean,
Ankur and I were very excited to join this effort of,
you know, of real experts in this area of epidemiology.
And there's an expedition proposed
that was just approved by NSF about
the computational epidemiology.
And one of the reasons we are very happy to give this talk
is that we really want to establish
a community for this area in computer science.
And I think this is obviously a very good time
to do that because everybody is thinking about it.
I think [INAUDIBLE] especially in the context
of the current proposal, given that there
are so many different models, there
are so many different layers.
The models are all bad.
I don't think that Ankur and I have built some amazing models
that we've decided that we determined what's
the right policy to take.
But maybe in the longer run, by coming up
with better models, models that are more robust, more
precise, more quantified, impact policy and inform
So this is a very big grant.
Many people from many institutions,
many of them working in biology and public health.
The two PIs are Madhav Marthe and Anil Vulkanti.
Great, so we'll end there.
I'm sure there are gonna be a bunch more questions.
So I'm happy to take them and maybe answer them.
Thank you so much.
This was awesome, very, very informative.
And congratulations on the expedition.
What a timely topic.
And we are so happy--
--to have experts with us.
OK, so I have a lot of questions.
And I'm going to try to scroll through the chat.
Maybe somebody already corrected me about my formula
So my formula for doubling wasn't quite correct.
But it really depends on the model, right?
So the relationship between R0 and the time for doubling
is very model dependent.
So thanks for the person who told me that for one model,
this is not right for one.
But it really depends on the model.
Though some relationship, it's some sort
of exponential growth.
So what I said is, you know, sort of correct,
depending on the model.
Also related to R. We are wondering
whether different communities have different Rs?
And is this taken into account in your models?
So within the simple SIR types of things,
there is the capacity just to choose different R0s.
But I easily imagine there are different R0s, especially when
you talk about, you know, different policies
At the very least, different communities
are going to have different compliance rates which
very much affects a macroscopic R0 that you get out
of things like the Imperial College model.
Next question, what sort of robust optimization techniques
are typically used in public health?
OK, so this is an area where we can contribute, right?
Again, I think the models are very sensitive.
So at least for short time predictions and decisions,
we might be able to decide.
Yeah, I think you can certainly think about things,
like, a la robust optimization, you
know, which policy is going to be the best,
in a worst-case sense, over the different allowable intervals
your parameters could be.
Now, the original Imperial College stuff,
you can look at it that way, too.
In fact, you can look at the tables
that we presented at least one example of that.
And you can ask, is it robust of a different choice of R0?
And you can ask for what's the most robust policy.
I mean, the truth is that none of them
are particularly robust.
Because as you change R0, the actual impact of the policies
is wildly different.
So one of the things that I do think is interesting
is instead of just thinking about static policies
like from now on, everyone will self
quarantine for the next six weeks,
you can also try and think about dynamic policies that
might have different self-quarantine rules depending
on the different percentages and different communities of how
many people are infected.
Now, one of the real problems with deploying that
is that we just don't know what is
the number of people who are infected
in different communities.
But there's also the capacity that adaptive policies
might quantitatively end up being more robust.
And usually robust optimization asks this question
about what is the most robust.
But one of the questions which is also important
is, just how robust is that to the uncertainty in R0?
And that's one of the sticking points
is that so far the answer is not that robust.
And maybe a different perspective
on that, I think one of the issues
with the adaptive policies that we have to take into account
is compliance or just understanding
of the public of what they're supposed to do, right?
So it could be that if we tell people
to stay at home for the next six weeks, they will comply more.
But it could also be that they will
be much more excited if every day there
will be a radio announcement by the president saying today
you can go play out.
And the following day, oh, the next two weeks
you have to stay inside.
We have to take into account that, you know,
there is the possibility that more dynamic policies
will result in either more or less compliance.
It seems like the ICU data would be very useful,
at least for COVID-19.
What's in the way of getting the ICU data?
To me, it seems like in the US in general,
it is very, very hard to get data.
It's very hard to get data about how many people are tested.
It's very hard to get this data about hospitalization.
I mean, I don't know why.
But I agree.
I completely agree with you.
Even at the community level, how many people have respiratory
disease right now, I mean, I think all of this information
I don't know that it's tracked in the US.
I don't know if it is tracked.
It's all right.
I mean, right now some of the best data sets are.
You know, Italy has a bunch of data sets online.
There's the Princess Diamond data sets.
But certainly, if we had access to the US data sets
and also understanding why people were admitted,
like, you know, if some of the rules about admission
were actually clarified, then it could
be a lot easier to try and back out from the sensor data
where we only see a part of it, what's actually happening
in, at least, a somewhat more meaningful way
than the guesstimates that are sort of floating around
in the news right now.
I guess the one thing we have in the news
is the number of deaths.
And I wonder if you can elaborate
a bit more on how we can use that to get better
models and better predictions.
You touched a little bit on that.
But can you tell us a bit more?
So definitely, you could use things like, you know, death.
So one of the things that--
So certainly, if you look at the US data right now,
we have information about the deaths.
But one thing to keep in mind is that the mortality rate,
the number of deaths divided by the number of positive cases,
obviously, that's not a good estimate.
Because we're under-reporting the number of cases
because of lack of testing.
And that actually varies on a state-to-state basis,
depending on how easy they've made testing available.
But one of the other things to keep in mind
is that sometimes the death rates lag in the sense
that people are admitted to the ICU
and then it can actually take a little bit before we find out
whether they can weather the storm or not.
Now, I mean, I've seen some statistical analyses of,
for example, comparing South Korea to Italy.
In South Korea, we have much more widespread testing.
Certainly, people are wondering a lot about why the mortality
rate is so high in Italy.
And some of the explanations we've seen
are the skew in terms of the population, how much older it
is in Italy.
But this alone doesn't explain very much of the picture.
Because the fraction of the population that's above 80%
is only over represented in the reporting data
by a factor of two compared to what
it is in the baseline population, which
is not a giant amount.
So it actually, in principle, should not
be skewing the death rates by as much as the ratio
that we're seeing in South Korea, compared to Italy.
There have been a bunch of other conjectures
about the prevalence of smoking and all kinds of other things.
But, yeah, I definitely think this is a great way
to go is to try and understand from the data
we have, first, if we can have more insight into how exactly
it was collected, what's the mechanism, what
are the different rules for how exactly
we decided who is admitted and who we're going to test,
then there's much more scope for correcting
across different populations and actually trying to understand
what the heck is going on.
But it may be that the right thing to do right now
is just to do the basic public health work of testing people,
not just more, but maybe even randomly.
And testing for antibodies as randomly as we
can to get a better feeling for what
is happening with this disease.
And my guess is that this will happen in the next few weeks,
So we'll get a much better picture of how many people were
exposed in the next few weeks.
And we'll get a much better picture of what's going on.
So there is a comment from Pete Szolovits
I would like to share.
And he says, Dr. Leo Celi in Roger Mark's lab and an ICU
doctor at Beth Israel are trying to get real-time ICU data.
And there is a focus to enable real-time ICU data, at least
from Beth Israel.
That's great to hear.
But I think that to the extent to which we
can make the medical community aware
of what we could do if we had access to their data
in a more timely way, maybe in this time of crisis,
that will become available I will make a comment
and then ask one closing question to both of you,
since we're getting close to the top of the hour.
The one comment I will make is, I see a lot of questions
from very enthusiastic students who would like to learn more
and join the field.
I wonder if you would be willing to share some resources
that we could post, in addition to the video of this talk,
for the students to get access to.
Would that be all right?
That sounds great.
In fact, maybe one thing we could do
is we could also open it up so that people could
post other resources, too.
Because I find that people know all kinds
of the different pieces of the puzzle.
So personally, I mean, there's only so much news
I can read with the same statistics being regurgitated
over and over again.
But thinking about it academically
is a good kind of antidote and actually
reading papers that try and dig deeper than just understanding,
you know, at very cursory levels what's going on is great.
OK, so here's my question for both of you.
Given where we are today and what
you know about computational epidemiology,
what is the best-case scenario?
And what do you think will happen, more realistically?
I'm happy to go with the best-case scenario.
So I was very much a worst case scenario.
I can show off by telling you that the first lecture
in my class this semester, I told my students that classes
will be cancelled.
So the first week of classes, I told them.
I told people that classes will be gone.
And my students can, you know, attest to that.
I think that actually the situation is probably not
as bad as we think.
Because I think there is a lot of undetected disease going on.
And I think that there is a good chance
that we will be past the worst of it, at least this wave,
in less than a month.
That's sort of my best-case scenario.
One thing that doesn't really answer your question,
I think one of the things where we have to be very careful,
and I don't think that the models right now that inform
policy take into account, is to take into account
the fact that economic outcomes also have a big public health
So if people lose their job, if people lose their health
coverage, they may die.
And they may not have access to the medication that they need.
They may not get the treatment that they need.
So it's something to take into account.
I don't know to what extent is this impacting public policy
But my feeling just on the fact that there are so many people
sick around, all these anecdotal stories that I hear.
My sister is a doctor in Israel.
Stories I hear from other doctors,
there are so many people with this pathology disease that
is not corona, but it's treated somehow.
My feeling is that it's much, much more widespread
in the community.
And this is a good thing, right?
Because if it's so widespread, you know,
many people will get over it.
They will get, you know, recovered.
They will not get it again, at least, in this season.
And, hopefully, we'll be out of it in a month.
Worst case scenario is all yours, Ankur.
[LAUGHS] Oh, god.
Yeah, I'm gonna abstain from saying the worst-case scenario.
But let me at least say some positives that, you know, right
now, not going to conferences, not
doing all of this traveling, it's had an amazing impact
already, in the short term, in reducing our carbon emissions.
So maybe it'll force us to rethink some of these things
about how we've built up our lives.
You know, I think, let me, you know, angle a little bit more
in the worst case and say something else,
which is that, you know, I think we're
going to get into trouble if we think that this is
a random, one-off fluke, that there
are many different types of strains of coronaviruses that,
you know, have been predicted actually for a few years
that they could make the jump to humans.
So I think these kinds of things might end up
not becoming a one in a century kind of thing.
So I'm hopeful that at least now that these issues are
on everyone's mind, we can all start
to take a more sober look at what
are the modeling assumptions that
go into computational epidemiology.
There are a lot of issues that we don't necessarily
talk about, like the fact that these models are not
It makes it very difficult to understand robustness checks.
But, you know, on my end, I certainly
talk a lot to some of my family who
have a lot of questions and debates
about these kinds of things.
And without people being able to play with these models
and understand what these predictions mean,
what these policies mean, you know, how robust or not robust
are these models, it's really difficult
to have a positive and productive conversation
about it that's not just grounded
in fear and uncertainty and lack of knowledge.
So I'm hoping that at least, you know,
there's some positive steps we can take research-wise,
as community wise, as re-evaluating
some of these fields and taking a deeper dive.
Thank you, Ankur.
Thank you, Elchanan.
It's 3 o'clock.
You have inspired us to come to the field.
You have educated us.
It was an awesome hour.
I'm going to clap on behalf of everyone.
You were both awesome.
And for all of our audience, we are
going to post-process this recorded talk.
And we will let you know when the video is available.
Have a safe and healthy afternoon.