Cookies   I display ads to cover the expenses. See the privacy policy for more information. You can keep or reject the ads.

Video thumbnail
Hey everyone! Hello. Welcome to Jon Gjengset's\h thesis defense. And thank you for attending.\h\h
I want to say just a couple of words before\hhanding over to Jon.
In his research life, Jon's been pursuing a really neat vision for how busy web sites ought to manage their data.
It's been exciting to watch him develop this vision\h and impressive to see his skill in realizing it.\h\h
But in parallel with research, Jon's been\h a hugely energetic contributor to all the\h\h
communities he touches. From his infinite patience\h as a TA, to organizing graduate student events,\h\h
to live-streaming systems programming tutorials.\h Jon's sure to be valuable and valued whatever he\h\h
sets his hand to. And assuming his committee lets\h him go, many people at MIT are going to miss him.\h\h
I certainly will.
So with\hthat, I'll hand over to Jon.
Thank you Robert, I appreciate that!
So, thank you everyone for coming, and welcome to\h my doctoral dissertation presentation
where I'll be presenting my thesis on Partial State in\hDataflow-Dased Materialized Views.
Now, my goal with this presentation is to try to summarize the\h past six years of work or so in about 45 minutes.\h\h
And hopefully I will succeed. And at least,\h hopefully, by the end, you will understand what\h\h
the words in the title mean. And why I think it\h was worthwhile to spend six years doing this work.
Now, don't worry, this work was supervised. My\h committee, Robert, who you just saw, but also\h\h
Frans, Sam, and Malte, have been working with\h me in basically trying to figure out that this\h\h
work is worthwhile, and to make sure that what\h comes out of it is something that is valuable.
So, I want to start this presentation with "Why\hare we here?".
And not in the existential sense of, like, "why are we all on this planet?".
But\hmore in the sense of "Why am I here?" — "Why am I talking to you at all?".
I am here because\hI want to make databases better.
And in order to understand the way in which I want to do\hthat, we first need to talk a little bit about what databases are,
in order to understand the\hproblems that some applications have with them.
So let's do a little bit of database 101.
So here,\hin databases, you just sort of take some tables. In this case, we're going to work with a stories\htable and a votes table.
And these two tables hold stories and votes.
And then the idea is that, as an application, you can query these tables by issuing "queries".
And these queries might do all sorts\hof things, like they might aggregate and combine values, or they might
join different tables together, in order to produce some application
data that it might care about. And the\h query operations are here shown in orange.
If you want to modify the\h data, then you perform an\h\h
"insert" or an "update" or a "delete" directly\h into the tables at the bottom. So here, if you\h\h
wanted to insert a new vote, then you just make\h that insert operation directly to the votes table.\h\h
And even though inserts in SQL, and\h updates and deletes, are "queries",\h\h
I will not be referring them as such in this talk.\h Rather, I will be talking about queries as things\h\h
that read data. Think SELECTs. Whereas inserts,\h updates, and deltes, are "writes" or "updates".
And now that we've looked at this picture, you\h might see that there is a lot more orange here\h\h
than blue. And what this indicates is that if you\h do a read, you have to do more work than if you're\h\h
doing a write. But this is unfortunate, because\h in practice many applications, and in particular\h\h
web applications, are read-heavy. They do a lot\h more reads relative to the number of writes.\h\h
And then it seems unfortunate that the\h reads are also the most expensive operation.\h\h
And so, ideally we want some way to mitigate\h this problem. Because it becomes pretty severe\h\h
once you look at applications that\h issue many many read queries that are\h\h
either identical or similar, where there is\h a lot of work that is a same in executing any\h\h
given query. And it'd be great if we didn't\h have to do this repeated, unnecessary work.
And you might say, well Jon, this problem has been\h solved: you just use a cache. And it's true that\h\h
caches are great. Caches are essentially a storage\h system that you place in front of your database.\h\h
And the idea is that the queries look in the\h cache, and if the result is there, then they just\h\h
return immediately, and the reads are now fast.\h And then only if you miss in the cache — if the\h\h
result you're after is not there — you\h issue a query to the backend database.\h\h
And it's true that this does make queries\h fast. And there are some other similar schemes,\h\h
like denormalization where you store and\h maintain derived values like the number\h\h
of votes for a story in the stories\h table, that has a similar effect.
And it's true that caches are great, but\h unfortunately they're also really hard to\h\h
get right. This diagram is an attempt to\h summarize some of the things that you have\h\h
to get right in the application in order to use\h this kind of a caching strategy correctly. So,\h\h
the cache is great, but as the data in the\h tables changes, the cache becomes out of date.\h\h
And it needs to somehow be refreshed. So imagine\h that we insert a new vote into the votes table,\h\h
the cached vote count is now wrong. And so the\h path of the application that inserts this new vote\h\h
also needs to invalidate or somehow mark the cache\h as being outdated, and maybe refresh that value.\h\h
Furthermore, even though it is easy to say that\h if you miss in the cache you have to query,\h\h
it's hard to figure out how this should work\h in practice. Imagine that 100,000 people want\h\h
to read a particular story, but that story is not\h present in cache. Do all those application queries\h\h
all swarm the database, and all execute the\h same query? That seems wasteful too. And\h\h
even if they do that, we still need some way to\h fill the cache. If all of them read when they\h\h
miss, someone has to write that value back into\h the cache. And orchestrating who should do that is\h\h
not entirely trivial. Beyond this, over time, the\h cache is going to accumulate more and more values.\h\h
And we only have so many resources\h on the server that's serving these\h\h
requests. And so we need some way to evict\h from this cache, entries that are old or\h\h
unpopular. But then the question arises again:\h who does this eviction, and how? And hopefully\h\h
this image gives you an idea of just all the\h stuff that needs to be figured out by the\h\h
application developers in order to make databases\h serve these kinds of read-heavy workloads.
And so in my research, I've been looking\h specifically at this problem of automatic\h\h
database caching. Specifically, is there a\h way that we can get all this cache logic to\h\h
not reside in the application, and instead\h be provided by the database. And that is what\h\h
my thesis implements, and what we're going\h to be looking at for the rest of the talk.
Now, the way we're going to do that is we're\h going to be walking through the title of the\h\h
thesis. Which, to remind you, is partial\h state in dataflow-based materialized views.\h\h
And what we're going to do is we're going\h to parse the title from right to left. And\h\h
then move through the words and figure out\h what they mean and how they fit together.
So let's start with materialized views. And in\h particular, why they're useful in this context.\h\h
Materialized views have been around for a long\h time; they were invented by the database community\h\h
in the early 1980s. And essentially they are\h running the query and then remembering the result.\h\h
Which sounds like a fairly straightforward\h thing — it sounds very much like a cache.\h\h
And the key question with materialized views,\h just like with caches, is how to maintain that\h\h
materialization. What happens if the data in\h the tables change? And so this is what is called\h\h
materialized view maintenance. And ideally we want\h this maintenance to be incremental. We want it to\h\h
be so that we don't have to re-execute the query\h that is materialized every time the underlying\h\h
data changes. For example, if some story has\h 100,000 votes, and then one vote more comes in,\h\h
we'd like to not have to count to 100,001, and\h instead just take the current count of 100,000,\h\h
and have the system just realized it can\h just increment that count by one rather\h\h
than recomputing the query from scratch. We also\h have this questions when it comes to materialized\h\h
view maintenance of whether we maintain\h on write, so proactively when a new change\h\h
happens to the underlying data. Or whether we do\h it sort of lazily and on demand on later reads.
And this brings us to the next\h part of the thesis title, which is\h\h
how we maintain these kinds of materializations.\h And the title gives us a clue: dataflow-based.\h\h
Now, dataflow has a lot of meanings\h in different parts of academia,\h\h
and in particular within Computer Science,\h but essentially it is a systems architecture.\h\h
And in this talk, we're going to be talking about\h dataflow as having data move to compute. The idea\h\h
here is that instead of code fetching data\h from tables, you're going to have the tables\h\h
send data towards the compute. You can think of\h this a little bit like push-based computation.\h\h
And these data changes are going to propagate\h through a graph of operators. These are\h\h
relational operators like joins, aggregations,\h and filters. And you can sort of think of them\h\h
as whatever operators you might use in SQL. This\h might sound a little hard to get your head around,\h\h
and so I'm going to give an example\h shortly of how this works out in practice.\h\h
And then each edge of the dataflow is going to\h indicate a data dependency. So for example, a join\h\h
is going to depend on all of the inputs of that\h join. An aggregation is going to depend on the\h\h
table, or the data, that it is aggregating over.\h And then the messages that flow over the edges\h\h
of the dataflow are going to be deltas. A delta\h here is a full row, it has columns just like the\h\h
rows that are coming out of the operator above,\h but they also have a "sign" which is either\h\h
positive or negative. If a delta has a positive\h sign, you can think of it as an addition, an\h\h
additional query row result to what came before.\h If it's a negative, it's a revocation of some\h\h
result that was previously issued. You can think\h of it as sort of "add this" or "remove this". And\h\h
in particular, we're going to be looking at this\h whole dataflow model in the context of "Noria",\h\h
which is an eventually consistent materialized\h view system that is built using dataflow.
So let's take a look at a concrete example of\h how dataflow can be used to maintain materialized\h\h
views. So here on the left I show you a query,\h and on the right, I show you the dataflow that\h\h
Noria constructs for that query. So the query\h here creates a materialized view by the name\h\h
of StoryWithVC. VC for vote count here. And then\h defines a query that does a join between stories\h\h
and votes on the story id. And then counts the\h number of votes for each story. You can see it\h\h
groups by stories.id. And if you squint at\h this, you can see how some of the parts from\h\h
the relational query appear in the dataflow graph.\h For example, there's an aggregation here, a count,\h\h
indicated by the sum operator in the dataflow.\h And it groups by the stories id, which we know\h\h
from the join clause is the same as the story_id\h column of votes. Similarly, there's a join between\h\h
stories and votes, and that is represented\h by the join operator in the dataflow graph.\h\h
You might notice here that these seem a little out\h of order: the join is between stories and votes,\h\h
not stories and the vote count operator. And what\h we see here is the effect of query optimization.\h\h
Noria has decided that it's going to do the\h aggregation first for optimization purposes.
And now the question becomes: what if the\h application wants to read from this view at\h\h
the bottom, StoryWithVC. Well what happens\h is that the application issues a query\h\h
over that view. Which might look\h something like this. So in this case\h\h
SELECT * FROM StoryWithVC WHERE id = ?. Where\h question mark here means some parameter that\h\h
is input by the application at runtime. And that\h parameter is essentially going to be a lookup key\h\h
into the materialized view to get out just\h the subset that the application cares about.
Now imagine that some data changes in the\h underlying tables. So imagine for example\h\h
that a new vote comes in for some story. What\h Noria is going to do is take that input change\h\h
and inject it into the dataflow at the votes table\h node in the dataflow graph. And then it's going to\h\h
stream that as an update through the\h dataflow graph, along each of the edges,\h\h
down the dataflow, all the way down to\h any materialized views we might have.\h\h
And the idea here is that the update here is\h modified by the operators it passes through.\h\h
So for example, when the vote — let's say it's\h for story 7 — passes through the count operator,\h\h
the count is going to replace, or convert, that\h +1 vote into a negative of the old count — a\h\h
negative delta — and a positive delta of the\h new count. So in this case, it'll be a -7,\h\h
the story id, 42, the old count, and a positive\h 7, 43 for the new count. You can think of this\h\h
message as communicating: "It used to be that\h the count for story 7 was 42. Forget that,\h\h
it is now 43". These two deltas then flow along to\h the join, and the join needs to convert that again\h\h
into some kind of delta that can be applied\h to the downstream materialized view.\h\h
From the query, we know that the materialized view\h has a bunch of additional columns that come from\h\h
the stories table. And so the join performs a join\h by doing lookups into the stories table and then\h\h
stitching together the output deltas so that\h they have the additional columns. Those deltas\h\h
then flow through the dataflow again, down to the\h StoryWithVC materialized view where they update\h\h
that view in place. And now subsequent reads\h from that materialized view are going to see\h\h
the updated result. The result that\h has the count of 43 rather than 42.
Now that we've walked through dataflow-based\h materialized views at a relatively high level,\h\h
let's start to look at the last\h part — or first part — of the title:\h\h
partial state. And partial state is\h the core contribution of this thesis,\h\h
and what we'll be looking at\h for the rest of this talk.
So partial state can be summarized as\h "learning to forget". The observation here\h\h
is that the chances are most entries in any given\h view are not accessed. Old and unpopular stories\h\h
are just sitting there wasting memory; if no one\h ever reads story 7, why are we storing its result?\h\h
Furthermore, why are we even computing, and\h keeping the results up to date, if no one reads\h\h
them? Similarly, if we don't have the memory to\h keep all the results for every story in there,\h\h
we might need to choose to only keep some of them.\h Say, only the popular ones that speed up reads the\h\h
most. We want to trade off in favor of the most\h popular and most frequently accessed things.\h\h
But traditional materialized views don't really\h give us this opportunity. As we saw before, you\h\h
just have a query that gets all of the results,\h and then you have to query over that view.\h\h
And what we need is some way to evict\h old entries from materialized views,\h\h
and only add new ones on demand. Which brings me\h to three of the main contributions of this thesis.\h\h
The first is the notion of missing\h state in materialized views:\h\h
the ability to mark some state as not being there,\h and not expending memory on it. The second is the\h\h
mechanism of upqueries, which allow populating\h missing state using dataflow. And finally,\h\h
an implementation and evaluation of partial state\h in the Noria dataflow materialized view system.
So in order to get at partial state,\h one thing we first need to figure out\h\h
is this separation between the definition\h of the view and the query over that view.\h\h
Because unfortunately this doesn't give us\h quite enough information to make the view\h\h
partial. We don't know what key to use as the\h basis for deciding whether a subset of the view\h\h
should be made missing or not. We don't know the\h query key that the application is going to use.\h\h
And therefore, in partial state and\h in partially materialized views,\h\h
what we're going to do is introduce the query\h parameter into the view definition itself. So\h\h
instead of writing what you currently see\h on screen, and having them be separate,\h\h
we can have a single view definition that also\h includes this clause of where the story id equals\h\h
question mark. And then have the question mark\h be a parameter for the view. And we're going\h\h
to use this key to determine if a given result is\h present or missing. And then when the application\h\h
wants to query over this view, it's going to\h execute that view without any additional query,\h\h
and just supply the parameters that are\h identified by the question mark placeholders.
And what's neat about this is that when an\h application wants to execute a given view,\h\h
let's say for story 7, that\h gets sent as a request to Noria.\h\h
Which then receives it and looks up into its\h materialized view, looks for the index for\h\h
story 7, and it might find that that entry is\h missing, indicated here by a hollow green box.\h\h
Now when this happens, some mechanism needs\h to be in place in order to fill that result,\h\h
that subset of the view, so that we can respond\h to the application. The way we do this is using an\h\h
"upquery". And an upquery is, as the name implies,\h a query that goes up the dataflow graph. In this\h\h
case, immediately to the join operator. And you\h can think of an upquery as a request for a summary\h\h
of all the relevant past deltas to be\h retransmitted through the dataflow. You can think\h\h
of it sort of as the downstream saying: "tell me\h about 7 again, because I forgot all about it".\h\h
Now, the join in this case is stateless — it\h doesn't have a way to send such a summary — and so\h\h
in order to support this, upqueries\h can also recurse. In this case,\h\h
the join might choose to forward this upquery to\h one of its ancestors, which might be stateful.\h\h
In this case, let's say it decides to forward\h that upquery to the count operator on the right.\h\h
And you can think of this as sort of asking:\h "can you tell me about 7 so that I can tell the\h\h
join — the downstream — about 7?". Now, a quick\h aside here: we've only talked so far about state\h\h
in tables and in views. But there is other state\h too. Remember how I mentioned that the count takes\h\h
a +1 vote and turns it into a negative for the\h old count and a positive for the new count? Well,\h\h
in order to do that, it needs to\h remember that the current count was 42.\h\h
And the way it does this is it also keeps its\h own little materialized view internally. It's\h\h
a materialized view of its past output.\h And so this might actually be state that\h\h
is present in the dataflow, and that\h can be used to satisfy these upqueries.\h\h
In this case, it might be that story 7 is known\h in that materialized state, and so we don't need\h\h
another upquery to recurse all the way back\h to the individual votes in the votes table.
When the aggregation responds to the upquery,\h it sends that upquery just directly through the\h\h
existing dataflow. And this is a key point: there\h is no need for special operators or different\h\h
forward or backwards query processing in this\h design. Instead, the response that gets sent\h\h
through the dataflow is just a single message\h that holds the current state of the source for\h\h
the requested key. And it's processed just like\h any other dataflow update. When that response\h\h
arrives at the join, the join does the same thing\h it would have done for any other update: looks\h\h
up into stories, patches together the output, and\h then forwards that back to the materialized view.\h\h
And that materialized view, when it receives this\h upquery response, takes that and uses it to fill\h\h
in the hollow box because this represents the\h answer for 7. At this point we now have enough\h\h
state to respond to the application query\h for 7, so we can just send the response back.
What's nice about this is that\h the state has now been populated.\h\h
So if we now get later queries for the same\h value, 7, we can just serve them directly from\h\h
the same materialized view because that state\h is no longer missing. What all of this enables,\h\h
and the sort of key feature of partial state,\h is that at some point later, if story 7 falls\h\h
out of favor — becomes old — we can evict that\h entry. In fact, we can evict any state we want\h\h
internally in the system related to that key.\h So for example, we might choose a year later\h\h
to evict story 7 from both the aggregation\h state and also from the materialized view.\h\h
And this lets us save memory that can then be used\h to materialize other results that are popular.\h\h
I mentioned also that we might waste compute\h to keep materialized view results up to date\h\h
that are no longer being accessed. And eviction\h and missing state allow us to work around this.\h\h
Imagine that after evicting 7, some change\h happens to the story table where say the title of\h\h
story 7 changes. This is going to introduce\h a delta in the dataflow graph that changes\h\h
the story. And when that arrives at the join,\h and the join does a lookup into the state of\h\h
the aggregation to figure out what the count\h is, it's going to encounter missing state.\h\h
And when it does, we can safely evict that…\h Sorry, we can safely discard that update to\h\h
the story. The reasons for this are a little\h subtle, and the thesis goes into more detail,\h\h
but essentially if you observe missing state for a\h given key in a sibling in the graph, it generally\h\h
also means that that state is missing downstream.\h And therefore there is nothing to update — there\h\h
is no state for 7 downstream — and so don't need\h to update it and can discard this update entirely.
At this point I've talked enough about\h materialized views and dataflow and partial state\h\h
that I think it's time for an intermission\h where we talk a little bit about related work.\h\h
And there's been a lot of related work in\h this general area. And the first of course is\h\h
materialized view maintenance. So, materialized\h views, in general, have traditionally been used\h\h
for a slightly different workload than what I'm\h going for with this thesis work. In general,\h\h
materialized views are used for something\h like an analyst that comes in to check\h\h
results occasionally. And it would be too slow\h that complex query the analyst wants to run\h\h
those few times they come in. And therefore, over\h time, we're just going to keep this materialized\h\h
view. And then when the analyst comes in they\h can just open the view and it opens immediately.\h\h
And so the focus has more been on the maintenance\h than the reads because the reads are infrequent.\h\h
Think of it as there is a high frequency and\h volume of writes, and so we want to make sure that\h\h
the maintenance is as efficient as possible. But\h it's okay if the read has to sort of do a bunch\h\h
of work when the analyst sits down as long as it's\h much less than it would be to execute the query.\h\h
These systems also generally have\h little or no support for on-demand\h\h
queries. The queries are often compiled\h into a program, or something similar,\h\h
and it's not really built for the kind of dynamic\h query setting that we often see in something\h\h
like a web application. And for similar reasons\h these systems rarely have support for eviction,\h\h
because it's not really needed; the analyst's\h query is what it is. That is the view.\h\h
And there's no subset of the view that\h is more important than the others.
Another area of related work is automated caching\h systems. We've seen a number of these come out of\h\h
both industry and academia. Unfortunately,\h especially the ones out of industry,\h\h
these tend to be very tailored for a particular\h purpose. They're not really general purpose things\h\h
that you can plug and play into your\h own application. What usually happens\h\h
is some large company has a particular\h caching problem, they build a solution\h\h
that works for them, but you can't just spin\h that up in your Ruby on Rails application.\h\h
These systems also often only support\h invalidation, and not incremental updates.\h\h
They just focus on evicting things from cache\h that are related, and not updating them in place,\h\h
which has the downside that now you might miss\h a lot and have to go the backend a lot more\h\h
than is necessary. Furthermore, these systems are\h often limited to specific database interactions:\h\h
they require that you go through a framework or\h an ORM. As opposed to what we're targeting here,\h\h
which is sort of general-purpose SQL where\h you just write SQL queries, you write views,\h\h
and they just work similar to\h parameterized prepared statements in SQL.
And finally, there's been a lot of work\h on dataflow and stream processing systems.\h\h
And these systems are also usually focused on\h write performance, similar to materialized views,\h\h
where there is a data pipeline and you want\h to perform all these ongoing computations\h\h
over that data pipeline. These tend\h to focus on strong consistency,\h\h
which usually comes at the cost of read latency.\h The reads generally have to coordinated with the\h\h
data pipeline somehow; often the reads even\h have to go through the write processing path in\h\h
order to give consistent results. Whereas with\h Noria we can leverage the eventual consistency\h\h
to give much faster results by leveraging the\h fact that the results are allowed to be stale.\h\h
These dataflow systems, and stream processing\h systems, also tend to have limited support for\h\h
on-demand compute and eviction. Here, too,\h you sort of set up some queries, and they\h\h
run for a while, and you can't evict partial\h results from a given materialized view. If you\h\h
wanted to you would sort of have to terminate\h the process, or something along those lines.
So now that I've talked to you for a bit\h about what the system does, you might wonder:\h\h
well, are we done? Is everything you told me\h everything I need to know, and I can just go\h\h
run this in my application. And unfortunately\h that's not the case. Although if it were,\h\h
this thesis wouldn't be very valuable, and so\h I'm kind of glad that there are more challenges.
Because it turns out that, in practice, things\h are hard. In particular, we need to ensure that\h\h
data changes to the tables take effect exactly\h once. For example, if you do an insert to the base\h\h
table, well that insert has to happen. It can't\h just… The row you inserted can't just vanish. And\h\h
if the database applied the insert multiple times\h so the table now contains that row multiple times,\h\h
that would also not be great. Now, this might\h strike you as weird if you come from a traditional\h\h
database world, because the only way that could\h really happen is through a bug. The database sort\h\h
of goes through and looks at indicies and scans\h tables and such, and would only encounter any\h\h
given row once. But in this model, it's a little\h bit harder. And the reason is because, well,\h\h
it's two-fold. First, upqueries are summaries of\h past state. Things like all the state in the past,\h\h
all the deltas in the past, for story 7. But\h upqueries can happen concurrently to updates that\h\h
flow through the dataflow graph. And those updates\h might be for the same state. Imagine, for example,\h\h
there's an upquery for story 7 at the same time\h as story 7 is being updated. We need to ensure\h\h
that these don't conflict with each other and\h end up violating this sort of exactly once rule.\h\h
Similarly, we want the ability to discard updates\h early to not maintain state unnecessarily.\h\h
However, if we discard things erroneously\h — if we encounter missing state but there\h\h
is downstream state that actually depended\h on the update that's flowing through the\h\h
graph — then that'd be really unfortunate:\h that result would be permanently stale.\h\h
There are a lot of hazards that can cause these\h kinds of problems, and the thesis goes through a\h\h
fair number of them, but in this talk I'm going\h to focus on one just because of limited time.
The one we're going to focus on is incongruent\h join evictions. And if you don't know what these\h\h
are, it's not that weird because it's\h a term that I made up. But hopefully\h\h
you're going to see in a little bit\h why I think the name makes sense.
So let's start then with "what is an incongruent\h join?". What I'm showing you here is another query\h\h
and dataflow side by side. It's a different query\h and dataflow from the ones we looked at before\h\h
even though they look kind of similar. Here,\h we have a stories table like before, and then\h\h
we have a users table. And the idea is that the\h stories table has an author id column. And in\h\h
order to display a given story we probably want\h the author's name rather than just their user id.\h\h
And so we store a separate table that has the user\h ids and various information about that user, such\h\h
as their name. And so in this StoriesWithAuthor\h view we're doing a join between stories and users\h\h
which ends up combining the results such that the\h output rows actually contain the author's name\h\h
pulled from the users table. What makes\h this join incongruent is the fact that\h\h
query key is different from the join\h key. The query key here is the story id,\h\h
whereas the join key is\h the author id of the story.\h\h
That's what we end up looking into users\h with. Now, in general this is not a problem.\h\h
Let's first work through the sort of correct\h case of what happens when an upquery occurs.
Well let's say that someone queries this view as\h well for story 7. An upquery goes to the join,\h\h
the join is stateless, it forwards to the\h stories table. The stories tables looks up\h\h
the state for 7 and sends back a message along\h the dataflow saying "here's story 7, the author\h\h
is 42". It might include some other columns\h too, but let's just consider these for now.\h\h
The join then dutifully does the lookup into\h the users table, and wants to look for the\h\h
user information for user 42. And let's\h say it finds the name for user 42 is Lena.\h\h
Well in that case it takes that information,\h it stitches it together with the update that\h\h
flowed in as a response to the upquery, and\h then it forwards the response downstream to\h\h
the materialized view to fill the hollow box\h — the missing state. So this includes 7, the\h\h
various columns from the stories table, and the\h author name it fetched from users. So far so good.
But now recall this figure. Where if we encounter\h missing state in a sibling, we end up just\h\h
discarding the update, assuming that there's no\h state downstream that we might have to update.\h\h
This ends up causing us a problem in this\h particular case. Let's consider what happens\h\h
if the author changes. So this would be\h represented in the dataflow as two deltas:\h\h
one that removes the story with the old author,\h and one that adds the story with the new author.\h\h
These two deltas flow down\h through the dataflow graph,\h\h
and when they arrive at the join, the join\h then needs to do the lookups as before.\h\h
So it first does a lookup for 42. It again finds\h the state for Lena, and populates the author name\h\h
column for that in preparation for sending it to\h the materialized view. And then it needs to do\h\h
a lookup for author 43 to do the same for the\h positive delta. However, what if that lookup\h\h
misses? So the lookup misses in the users table,\h and now we don't know what the author's name is!\h\h
Now, you might wonder: how could this happen if\h users is a table? And there are two answers I can\h\h
give you to that. One is you could imagine that\h the users table is, say, only partially stored in\h\h
memory, and 43 would have to be fetched from disk.\h Now, Noria doesn't actually support this — keeping\h\h
materialized state in memory versus disk — but in\h practice there are other ways that this can occur.\h\h
For example, here I've given you a simplified\h view where users is just a table. But you could\h\h
imagine that users is actually a view in and of\h itself, and has some large dataflow upstream.\h\h
And then there could totally be results in users\h that are missing. Regardless of how this happens,\h\h
the join still now has a problem: it needs\h to complete the processing of that update,\h\h
but it doesn't have the state that's required\h to do so. So what do we do? We can't produce\h\h
the needed update. We cannot forward just the\h negative, and just sort of discard the positive,\h\h
because if we did, the downstream materialized\h view would now have no rows for story 7. And any\h\h
subsequent query for story 7 would get no results,\h which is obviously not correct. We also can't just\h\h
drop the update altogether. Because if we did,\h any subsequent read for 7 would now see the\h\h
old author name rather than the new one. And\h this would also be a problem. So what do we do?
Well, your first instinct might be that we can\h just fill the missing state. Right? We just have\h\h
the users table do an upquery — or something —\h fill in the state for 43, and then everything is\h\h
great. And while this seems like an attractive\h solution, in practice it doesn't work so well.\h\h
Because we can't… we would have to hold up the\h dataflow at the join until that upquery finishes.\h\h
And that might potentially take a long time.\h Remember, the users table might be a view that has\h\h
huge dataflow above it. And satisfying\h that upquery might take forever. Well,\h\h
at least a very long time. And during that\h time, the join can't process any more updates.\h\h
And we're sort of stuck just holding up\h the dataflow — not processing more writes.\h\h
In fact, it turns out this is even worse. There\h are cases — I won't go through them here — where\h\h
this might end up in deadlock. Where the join\h can't process more updates until the upquery\h\h
finishes, but the upquery cannot finish until the\h join processes more updates. And now we're stuck.\h\h
You might think: well, can the join just\h process later updates to avoid this problem.\h\h
But we can't process them out\h of order because there might be\h\h
later updates that depend on this one. And so\h we need to finish processing this one first.
Okay, so that's not a viable solution. What do we\h do instead? Well, partial state actually gives us\h\h
the mechanism to solve this problem, and that\h is evictions. It's true that we don't know what\h\h
the author for story 7 now is, but we can just\h communicate that downstream with an eviction. We\h\h
can evict story 7 from the downstream materialized\h view. And now, all the application has… or,\h\h
all Noria has to do, is when a later query comes\h in for 7, that's going to fill the required state\h\h
through the normal upquery flow, which we\h already showed worked just fine. And it\h\h
turns out the system can actually detect when you\h have incongruent joins, and only send evictions\h\h
in those cases where it's necessary. It\h doesn't have to do it for every join.
So might wonder now: does all of this work?\h Like, this seems like there's a bunch of\h\h
mechanism internally, does it end up just\h killing performance — is this even worthwhile?\h\h
In order to evaluate that, we need a realistic\h test subject. And for this thesis I chose\h\h
Lobste.rs, which is a Hacker News-like\h news aggregator. Users submit stories,\h\h
the users can vote for and comment on those\h stories, look at top lists of the most popular\h\h
stories, that kind of stuff. And I chose Lobste.rs\h for two reasons. One is that it is open-source,\h\h
which allows us to see the queries that are\h issued for different page requests. And the second\h\h
is that the data statistics for Lobste.rs\h are available. This is stuff like how many\h\h
requests come in per second in general, how are\h those requests distributed across different pages,\h\h
what users what stories are most popular and most\h active. And ultimately all of this data allowed me\h\h
to build a workload generator for Lobste.rs\h that can synthesize Loste.rs-like requests.\h\h
And the reason why we want a generator\h here is so that we can change the load\h\h
of the system artificially, and\h see whether the system keeps up.\h\h
The generator also has a pluggable backend so\h that we can choose to run it against MySQL,\h\h
against Noria, and just see how they perform. It\h also lets us bypass Ruby on Rails so that we can\h\h
benchmark the true backend performance rather\h than the language that's serving the frontend.
So lets' first look at how MySQL does.\h This is MySQL running on Amazon EC2.\h\h
It's a 16-core instance. And this is showing the\h throughput that the workload generator can get to\h\h
before the system falls over. Before all\h the cores on the machine are saturated\h\h
and the latency starts to spike. So\h this gets to about 400 pages per second.\h\h
And this is across all the different\h page types — it generates a sort of\h\h
mix of page requests. Now I want to point out\h that this is already with a denormalized schema.\h\h
The Lobste.rs developers have done a decent amount\h of work on their queries to make sure that things\h\h
like the vote count — or their equivalent of it\h that they call story "hotness"­ — is actually a\h\h
column of the stories table. So they don't\h generally have to join and aggregations in\h\h
their queries. Although their update paths in the\h application have to do a lot of work to make sure\h\h
those values are kept up to date. If you don't\h include this kind of manual denormalization\h\h
the throughput for MySQL… it can't even keep\h up with a single page request per second.
Now let's contrast that with running Noria without\h partial state. Noria without partial state already\h\h
does significantly better — this is about an order\h of magnitude improvement. And this shows you the\h\h
power of materialized view. Especially in a case\h like Lobste.rs where there's a majority of reads.\h\h
All those reads don't have to do joins; they're\h essentially all just key/value lookups. Here,\h\h
however, we run into another bottleneck: we're\h not CPU-bound, we're memory-bound. When Noria\h\h
without partial gets to about 4500 pages per\h second, it runs out of the 128GB of memory\h\h
on this machine. And in fact, it gets\h a little bit worse than that because\h\h
the memory use tends to increase over time,\h and so if you ran the benchmark for longer\h\h
it might not even keep up with this load\h for longer periods of time. And this is\h\h
already with some amount of optimizations within\h Noria to make sure we only materialize requested\h\h
values. So if no one ever asks for a particular\h query, we don't materialize results for it.
And now let's look at what happens if we run\h Noria with partial state. It does significantly\h\h
better — this is about 67% over Noria without\h partial, and about 18x MySQL. And here, there's\h\h
actually a little bit of room to grow. This\h benchmark falls over not because of saturating\h\h
all the cores, and not because of memory. It falls\h over because of processing upqueries. It turns\h\h
out that there's a particularly update-heavy\h path through the dataflow that currently\h\h
bottlenecks Noria. And this might be somewhere\h where additional optimizations could help.
And we can see how Noria with partial state gets\h to this higher performance number by looking at\h\h
the memory use. So here I'm showing you the\h memory use in GB for Noria without partial\h\h
and Noria with partial. At the capacity just\h below where Noria without partial falls over.\h\h
You see that Noria without\h partial uses over 100GB of memory,\h\h
and Noria with partial state\h uses only a third of that.
And you might say well, Jon, forget about\h MySQL, no one does that in practice. If I\h\h
implemented caching myself, is Noria really going\h to keep up with that? Is Noria really a serious\h\h
sort of competitor for that? And this\h is a really hard question to answer\h\h
because it depends on the implementation,\h it depends on the application in question,\h\h
it depends on the workload. And so instead what\h I did is construct a benchmark that is sort of\h\h
an idealized caching setup. This is one where\h there are no evictions, there are no misses,\h\h
almost all requests are reads, almost all\h requests turn into a single key lookup.\h\h
And there's only a single query in these\h benchmarks, so this is not all of Lobste.rs,\h\h
but instead just the stories and vote count query\h I showed you in the very beginning. And I ran this\h\h
against Redis, which is a popular key value story\h that's often used to back things like caches.
And in this idealized workload, Redis gets to\h just under a million requests per second. Which\h\h
might sound pretty good. However, Redis is\h single threaded — it can only run on one core —\h\h
and so if we really wanted to compare these,\h we have to assume that someone implemented\h\h
like perfect scalability across cores for Redis,\h and so here's what I'm going to show you is 16\h\h
times that number. This is not… the number on\h the right here is not a real benchmark number,\h\h
it is just 16 times the number on the left. Think\h of it as a theoretical maximum for what you could\h\h
get against Redis here. And then I ran the same\h benchmark against Noria. Now before I show you\h\h
the result, remember that Noria provides\h SQL, automatic maintenance of this cache,\h\h
and it doesn't have these requirements\h of everything has to be implemented\h\h
in the application logic. Instead, Noria does it\h automatically for you. And Noria here gets pretty\h\h
close to this theoretical maximum for Redis. Noria\h uses as many cores as it needs to satisfy reads.\h\h
It handles all the eviction and caching\h updating for you. And yet it gets within\h\h
a factor for two of what you could get with a\h theoretical Redis deployment on this machine.
Now, I've been talking for\h a while, and so I'm going to\h\h
start to wrap things up now. And I want to\h start first by talking a little bit about\h\h
future work on this. Because Noria is\h neither perfect nor complete. There\h\h
are a number of things that are missing\h both from Noria without partial state,\h\h
and partial state itself. For example, there is no\h support currently for things like range queries,\h\h
cursors, and time-windowed operators that might\h be useful for applications, and is somewhere where\h\h
there's room for innovation both within Noria's\h dataflow but also within partial state. It'd also\h\h
be really nice if there was a way to integrate\h Noria with an upstream database. So imagine that\h\h
the tables in Noria were not stored in Noria\h itself, but was stored in Postgres or MySQL\h\h
or some other source of truth database that the\h user or application developer is already running.\h\h
Similarly, there's an attractive prospect here of…\h because Noria uses dataflow to manage all of the\h\h
updates to materialized views, there's no real\h reason it has to stop at the materialized views.\h\h
Imagine that the user, or the application\h developer, also has client applications running on\h\h
devices that users are holding. Well it might be\h that those are showing subsets of views that are\h\h
on the server. And you could imagine that the\h deltas are allowed to flow all the way to the\h\h
end-user device and update the views there in\h place. This is sort of reactive programming,\h\h
or reactive applications, that are becoming a\h bit of a trend in web application development.\h\h
And finally, fault tolerance is something\h that Noria has a somewhat weak story for.\h\h
But partial state might be able to help here.\h The idea would be that if a given node goes away,\h\h
what we can do is just make the state that\h that node held as missing, and then have the\h\h
partial state and upqueries mechanisms take\h the role of trying to repopulate that state.
Now, before I end, I want to acknowledge the\h influence of a couple of people on this work.\h\h
And I want to start off with the committee.\h Robert and Frans have — over the years — just been\h\h
continuously asking and trying to make me build\h a better story for Noria. And for partial.\h\h
And I think this is one of the reasons why the\h work is where it is today. I think initially I had\h\h
these sort of hazy ideas for something that might\h be cool, and I think the two of them have really\h\h
helped hone the story, hone the argument, for why\h this is useful and what the work should focus on.\h\h
And this project wouldn't be anything like what it\h is today without their invaluable input. Malte as\h\h
well was a post-doc in my group during many of the\h years of Noria's development, and without Malte\h\h
Noria wouldn't have SQL. It wouldn't have query\h optimization. It would essentially just be a\h\h
dataflow system. And his contributions have been\h invaluable and without him this work would not be\h\h
in its current state. I'm also very glad that\h I have Sam on my committee so that I can draw\h\h
from some of his database experience. Which is\h something that he has a lot more of us than us\h\h
mere systems people. I also want to thank the\h people of the parallel and distributed operating\h\h
systems group at MIT. The people there have\h just been giving me… they just have this endless\h\h
curiosity and insight and support that\h I think has made me the researcher I am\h\h
today. And I don't think I would be here without\h having them surround me throughout the years.\h\h
I also want to thank my family —\h my parents and step-parents — who\h\h
have always been encouraging me to pursue my\h interests no matter how geeky they might be.\h\h
And I think they've sort of been nudging me along\h all the way until I got where I am today — where\h\h
I'm now giving a very geeky presentation\h — and I think they'd be happy with that.\h\h
I also want to thank my girlfriend Talia. I\h am amazed that she's tolerated me just locking\h\h
myself in my room for hours on end working on\h this work, on the evaluation, on the thesis,\h\h
on this presentation. I am so grateful and\h happy that I have her, and… I love you.
To conclude: my thesis enables\h materialized views to be used as\h\h
caches. It does so by allowing state\h to be missing from materializations,\h\h
and using upqueries to populate missing state on\h demand. The resulting system provides automated\h\h
caching for SQL queries, and reduces the\h need for complex, ad hoc caching logic.
Thank you all for listening, and please\h ask any questions you might have.