Cookies   I display ads to cover the expenses. See the privacy policy for more information. You can keep or reject the ads.

Video thumbnail
- [Jordi] Thank you everyone for coming.
It's a privilege to be here today
with my colleagues Ken. - Hello.
- I'm James. - It's nice to see you
all here this morning.
- My name is Jordi Mola.
I'm the architect for this time travel
debugging tools and we are happy to present them
to you today.
So I got the clue from a great presenter one time
to start with questions up front.
So I'm going to open Notepad and type up
a few questions from the audience, if there
is any question before we start?
Yes, go ahead. - I wonder how
this differs from RR.
I mean, aside from the fact that the platforms
are mutually exclusive.
- How does this differ from RR?
OK.
There was another question here up front.
Sorry? - Windows only.
- Windows only? - And which Windows?
- Oh.
Which versions of Windows?
OK.
Let's just start with those and then I'm sure
that there will be more questions as we go forward.
So.
What is Time Travel Debugging?
OK.
Time Travel Debugging, TTD, not test development,
just Time Travel Debugging, for short, TTD,
is a reverse debugging solution.
So we record the execution of the process.
Think of it as a movie.
We take a movie of the process execution.
As a movie, when you have a DVR or a DVD
or some of those things, you can go forward,
some backwards, to look for things.
But also you can search through the trace using queries.
Ken will do a demo on how to do queries
to search through the trace.
When we look at debugging, we set out to solve
a few problems.
Debugging is time consuming.
I'm sure all of us here have a single step
through a function painfully to make sure
we didn't get past the point of the function
that was of interest,
and then spend an hour just single stepping
to not yet be at the point we wanted.
Debugging is really hard.
We actually will do a demo today of debugging
a garbage collection bug that can be very nasty
to debug without some of these tools.
Debugging is complex.
So we made sure that we didn't have to learn new things.
All of the things that you are used to,
when you debug live, break points,
single stepping, going, memory break points,
work on backwards debugging.
And finally, this is a pet peeve of mine,
the number of many times you have to repo a bug
when you are debugging normally.
You have to go and, oh, I learned something, start over.
Because it was a little different than I thought.
Now I learned something else and start over
and in time travel debugging, you just simply
have to go back.
Some of us have used a trick of binary searching a bug.
Which is look for a place the bug happening,
not yet, OK, this isn't the second half of the trace.
And move on.
A little bit of background and key features
for this, this is based on a Microsoft research project
that was internally available about 10 years ago.
But three, four years ago, we decided to make this
publicly available, so we went in and did work
to convert the, excuse me, research code
into production code, that was an effort,
we also feel that we are ready for preview,
but it's a preview, it's not final,
we still have some bugs in several areas to work on.
In particular, for those that have worked at Microsoft
have used research technology, our replay is about
two orders of magnitude faster than it was for
the old tool.
Record is still as much overhead as it was before.
It's something we will be working on in the upcoming
months and years.
It's written in C++, but it's C++ of different kinds.
A recorder is a piece of code that we inject
inside of the process that we are recording,
so that uses no exceptions.
Cannot allocate memory.
It has a bunch of restrictions
that resemble embedded systems.
On the other hand, the replay code is normal, modern C++.
It uses coroutines, vectors, maps,
the parallel library, and I mean, this is an enormous
C++ piece of code.
Performance sensitive, but normal.
OK.
Our recorder is multicore.
That's one of the differences with DRR projects
since it was one of the questions.
We do not single core the execution of the process.
To actually record it.
If you have 50 runnable threads,
and 50 cores, we will use all 50 cores.
This is important for us at Microsoft
because this is used on things like chain servers,
that run on 64 core machines, 128 core machines,
so it was an important aspect for us to make sure
that we could use significant amount of that machine
when we were recording.
And this is used a lot in escalation cases.
When you get to escalation engineers at Microsoft
that are working with you and your company
or your consumer side.
And you get to people that have to debug your machine,
this tool is used often.
Having quoting numbers above the 50% of the cases.
So.
The tool supports shared memory and a synchronous IO.
Now, different with the RR project.
And we have the ability to query the trace.
And again, Ken will do a demo of that,
which we think it's an interesting approach to debugging.
If you take one thing out of here is these tools
run in three steps.
You record the execution, you index the execution
to debug it as it automatically for you
as you open the trace for the first time,
and then you replay, you consume the trace
in the debugger.
We have done a few block post.
We will do more.
There's a bunch of documents.
There's how to download this, you have here
a link to go and play with it.
How we use this at Microsoft.
How we use time travel debugging at Microsoft.
We use it to resolve problems from customers.
And we use it to resolve problems internally.
There are teams at Microsoft, in Windows,
which is the organization we are in,
that will not ask you for a repro of a bug,
they will ask you for a trace of a bug.
A trace is essentially one of our recordings.
So you don't give them, do these 15 steps to repro,
and then you get bug, it doesn't repro my bugs.
Which we all have used as an excuse at one time or two
or three, a few times.
You get to trace them, and that also, bugs that have
traces on them have a much higher chance of getting
fixed than bugs that don't have traces with them.
This is awesome for bugs that are difficult to repro.
If you have captured the repro on a trace,
you have it, you don't need anything else.
Corrupted memory, we will see that on an example.
Race conditions, resource leaks, it works
very well at these type of problems.
How to get it.
It's part of the WinDbg Preview tool set.
And here you have a link on the post that has
help install it, a video of how to use it, channel nine,
there's a sticker on one of the sides
of the laptop for that.
But instead of explaining a lot of, you know, what it is
and how to do it,
let's go for demos.
James.
- Thank you, Jordi.
So, yes, some things are much easier to show than
to talk about, so I'm going to walk through
two demos of using time travel debugging.
In the first example, I'm going to show
how you can record an execution of a process,
and then take the trace that it's produced,
and actually debug it in the debugger.
So the first thing that you're going to need to do
is open WinDbg as an administrator.
So you have to have it open as an administrator
in order to record a process.
If you've never used WinDbg before,
this is it.
If you have used WinDbg before, this may look
a little different than what you're used to.
So this is a few weeks ago, we released
the first preview of this new WinDbg UI.
So it's, we're trying to make some major user interface
improvements, it should hopefully be a lot easier to use.
But it uses the same debugger engine under the hood,
so if you're familiar with, you know, a bunch
of WinDbg commands, all of that works, you know,
just as well as it did in the old one.
We're really happy that we made some big steps forward
with WinDbg here, and now that we're adding reverse
debugging, we're looking forward to making a lot of
steps backward as well.
So, what I'm going to do, is I'm going to go
to the file menu, and we have two options here.
That support time travel debugging.
So we can either launch a new program,
or we can attach to a running program.
So I want to run my own little test program.
So I'm going to pick launch executable.
And I'm going to type in the name of my program.
Which I've conveniently named test.exe.
Then I'm going to check this record process with
time travel debugging.
Now when I click OK, what's going to happen is
the debugger will run this program,
and it will be recording everything that it does.
It won't, the debugger won't be attached to the program,
it's just going to be letting it run normally
with the recording active.
So I'll click OK.
And we'll see the program start up,
and it's just a little test program,
so it prints out some strings, and then returns.
I'm gonna open the source file so we can take a look at it.
And so we can see here just this very simple program
in our main function, we have a vector of strings.
We iterate over the first 10 letters of the alphabet,
we create a string for each letter,
we print it so we have something interesting to look at
while we're running the program,
we add the string to the vector,
and then we sleep for a little while
so that the program actually takes a little time
to run and doesn't exit immediately.
The first thing to note is that when you
open a trace,
it starts off at the very beginning of where
you started recording, so here we can see,
we're very early in the execution of the process.
Now, I want to get us into the main function,
so I'm going to add a break point at the end of the loop,
and I'm just going to click go.
And so what the debugger has now done,
is it's replayed execution of the program
to the first point that it hits that break point.
So we can see, in the stack window,
we're inside of the main function,
which is where the break point is,
and we can see in the locals window,
we can actually see all of the local variables,
just like we would if we were debugging a live program,
so we can see the current characters A,
in the first iteration of this loop,
and there's one string inside of the vector.
So if we go, you know, a couple more times,
we can see it advances through the program
and it just keeps hitting that debug, or that
break point over and over again.
So there's a few things to point out here.
The first is that
so all of the visualizations,
so for example, WinDbg uses the same Natvis visualizers
that Visual Studio does, so you get all of the rich,
you know, ability to inspect STL containers here.
I mean, this is a vector of strings.
Looking at that just in raw memory can be rather
tedious and painful.
The other thing is, that all of the debugger commands
are used, too, with live process, generally work.
So for example, we have a G command to go forward
one step or one, that will run the program,
and you can see that we've advanced again
to the next time that that break point is hit.
In general, most of the commands you're used to
using will work, the one set of things
you can't do is you can't do anything
that would change the state of the program
because, again, the program's not executing right now,
we're just replaying what already happened.
Finally, I wanted to note here that
you'll notice that you don't actually
see the console window.
So when we ran the program originally,
you could see the console window printing out the text.
The reason for that is when we replay the program,
or when we replay the trace, everything is contained
within the debugger.
So we don't actually do the side effects
that the program had.
So for example, if it did any IO on the disk
or on the network, we're not going to actually
go back and try and read those files
or do that network IO again, which is very important.
Additionally, if while you are running your program
and tracing it, your computer lit on fire,
and you somehow salvaged the trace,
don't worry, if you run it in the debugger,
your computer isn't going to, you know,
catch on fire a second time.
So, all of that is well and good.
But this is a reverse debugger, so we also want
to be able to, well, reverse execution.
So you can see here on the ribbon,
we have the normal flow control buttons
to go and step into, and we also have,
for time travel debugging traces,
reverse flow control buttons.
So we can go back or step into back.
So if I click the go back, what it will do
is it will be as if it replays the program backwards
and it will stop wherever the previous stop was.
So in this case, it will stop the previous time
that this break point was hit.
So if I click that,
you'll see, we're still at the break point,
we've just hit at the previous time that it was hit.
So in the locals window, you can see instead of
the current letter being D, it's now C.
There's only three elements in the vector,
whereas there was four before.
So we have a G command in the command window
that would let the program go forward.
There's an equivalent G minus that goes backward.
So if I run that, we're now, again,
in the previous iteration of the loop.
If you're familiar with using the P and T commands
for stepping, there's also P minus and T minus commands
for stepping in reverse.
So those are the basics of how time travel debugging works
in the debugger.
But it's a lot more interesting if we look at
how to actually debug a real bug.
So to that end,
I thought since we were the first break out session,
I would welcome you all to the conference.
So I wrote a little program to print out a greeting for you.
So we can see here it says, "Hello everyone,
"and welcome to CPPCon 2017, enjoy the conference."
So consider yourself welcomed.
Oh.
It also seems to have crashed.
That's kind of unfortunate.
So, thankfully, I have a crash dump for this
that we can take a look at.
So I'm going to open up that crash dump,
and we'll see if we can figure out what went wrong.
So I'll open the crash dump that we had for that.
And I'll look at the stack window,
because hopefully it will show us where the problem occurred
and I see we're inside a bunch of exception handling
and then I see
here, that doesn't look like any of our code.
And nope, can't, hmm.
So there's not a lot of information here.
So I also, I saw this bug happen this morning,
so I also recorded a trace.
So I'm gonna go and open the trace
and see if we can debug it a little easier.
So this is the trace I recorded earlier.
You can open it, there's that open trace option
on the file menu.
And again, when we open the trace, it starts off
at the beginning of the trace.
So one thing that you can do,
if you want to move to another location,
we have a bang TT command.
So I can run bang TT.
And one option is you can pass it a percentage.
So I'm going to pass 100%, which will move it
to the end of the program.
And what I wanted to point out here is we're,
you know, this shows you we're in, almost the exact,
we're basically the same place we were in
in the crash dump, so again, the program was crashing,
so you know, this is where we would expect the end
of the trace to be, as well.
And I can move back to the beginning using bang TT
with zero to go back to zero percent.
So since the program crashed, I'm going to assume
that it's because an exception occurred,
and the exception was not handled in the program.
I don't know if it was a C++ exception or something else,
but chances are, it was an exception.
And so I want to find where,
during execution of the program,
that exception occurred.
And to do that, I'm going to use a feature
of the debugger, a relatively new feature called
the debugger data model.
So the data model provides a uniform and simple way
to get all sorts of information about the thing
that you're debugging.
In this case, the trace.
So I'm going to use the DX command,
which is used to query the data model.
And I'm going to look at the current process.
Which is kind of one of the root-ish things
that you can look at.
So in here, we can see the name of the program
while we were tracing it is greeting.exe.
That was the name of my test program.
We can see the process ID that it had
while it was running.
And then we can see a bunch of other information
that we can expand.
Now, the information we're looking for
is time travel debugging civic,
so I'm going to expand this TTD node here.
And then we're looking for an exception event,
so I'm going to expand the events list.
So the events list will have an event in it
for every time that a DLL is loaded or unloaded
in the process.
It will have an event for any time a thread was
created or terminated.
And then it will have en event for every exception
that was raised.
So if we see here, there's one exception in the list,
so that's probably what we're looking for.
So I'm going to click that entry, and then click
exception to see the exception specific properties of it.
So we can see here, it was a hardware exception,
so it was something that was actually raised
by the CPU, it wasn't a C++ exception
or some other kind of software exception.
We can see the program counter,
where the exception occurred.
We can see the exception code,
so in this case, I know this is the exception code
for an access violation.
So it's likely that our program either tried to read,
write, or execute some memory we didn't have proper
access to.
But most importantly, it also has this position field,
which tells us the exact location during execution
of the program where the exception occurred.
So what I can do is I can click this time travel link,
which will actually take us directly to that point.
So that's moved the debugger,
so we're now looking at the exact point
where the exception was raised during execution
of the program.
Now we can see here, in the stack window,
unfortunately, it still doesn't tell us where it happened.
We can see that instruction pointer,
which was that instruction pointer where the exception
occurred, but the debugger hasn't been able to decode it.
It hasn't been able to tell us, like,
what function it was in or even just what DLL that was in.
So that's not particularly useful.
Additionally, I noticed that the stack pointer here,
and the base pointer, are pointing to two
completely different locations in memory.
So between these two things,
I'm kind of guessing this was a stack corruption.
So someone overran a buffer on the stack,
or otherwise corrupted something on the stack,
and then when they returned from the function,
that ended up corrupting the CPU state
with an invalid instruction pointer and base address.
So you know, from our crash dump AA,
we wouldn't have even got to this point.
Like, it would have been very difficult for us to figure out
what was going on here.
Now, if we had a live repro, it might be inconsistent,
it could be very difficult to, you know,
to figure out from that.
We don't have either of those, we have a trace.
We know exactly what the program did,
and so we can just walk backwards and see what happened.
So in this case, I want to get back to before
the CPU state was corrupted.
So before that return statement, you know,
loaded that invalid instruction pointer
so that we can look at kind of get an idea
of where the exception happened.
And I think that happened recently, so I'm just going
to step back a few times until
we get a reasonable looking context.
So I'll step back once, and that didn't change,
so I'm going to step back a second time.
And you can see now, we've stepped backwards.
The next instruction that the debugger would execute
is a return instruction, so we've stepped back
to before the return.
And so we see now that the instruction pointer,
the debugger now knows where we are.
We're inside of the main function.
The base pointer is still incorrect, though,
because we haven't stepped back to before that was
corrupted, so I'm just gonna step back once more.
And now we can see most of the stack,
and we can see we're on the last line of the main function.
Now, we still can't see our caller,
and the reason for that is that we've stepped back
to before the CPU state was corrupted,
but we still haven't found where the stack itself
was corrupted, so the return address is still incorrect.
Now, I'm just curious, so I'm gonna,
now that we have our base pointer, I'm curious
what's there on the stack.
Now we'd expect there to be two pointers there,
we'd expect the instruction pointer to be there,
and we'd expect the previous frames base pointer.
So I'm gonna open up the memory window
and just copy that address in, take a look at what's there.
That does not look like pointers.
That looks like the end of the string
I printed out earlier.
And if we scroll up a little bit, we can see, yes,
there's the rest of the string, so definitely,
we've corrupted the stack somewhere.
So the trick now is to figure out, where did that happen?
And so what I'm going to do is I'm going to use
a break point and then run the program backwards
and have that break point stop on the exact point
where we corrupted the stack.
So to do that, I'm going to use this BA command,
which is break on access.
I'm interested only in rights to the for byte location.
Starting at that base address.
So basically, whoever writes to that pointer,
this will stop.
And at that point, all I have to do is click go back,
and we've now stopped at the exact instruction
that corrupted that pointer.
And we can see, so it's inside of the string copy
function, which is expected, we are copying,
you know some string into this buffer on the stack,
I expect the bug isn't in the string copy function,
I used to work on the C runtime,
and I'm pretty sure we didn't have bugs
in that particular C runtime function.
So I'm gonna look in this get CPP greeting function,
which is the first function in our program on the stack.
So we can see in here, this function just takes
a wide character buffer, and the size of the buffer.
It has the welcome message.
And then it copies that message into the buffer.
We can see in the locals window,
we can see the message points to the complete message.
We can see that the buffer has only part
of the message and the reason is that we've stopped
part way through the copying of the message into
the buffer, and then we can see that the size
of the buffer is allegedly 100 characters,
or hex 64.
So I'm going to guess that that's not correct,
given that we've overrun the actual buffer,
so I'm going to look at our caller and see
what it passed in.
So we can see here, the buffer actually only
has 50 characters.
Well, there's our problem.
And the problem is, well, I accidentally passed
size of greeting instead of greeting.size,
so I passed the number of bytes instead of
the number of characters, so whoops.
I'm gonna blame my code reviewers.
But actually here, the, so with this little test program,
it's only 20 lines of code.
Like we could have easily gone and just looked
at the code and diagnosed this and figured out
what was going on, but in a large program,
you know, debugging this kind of thing,
especially if it's a hard to repro bug,
or if it only occurs occasionally,
or if you're just getting crash dumps from the field,
this can be extremely tedious and difficult to debug.
So time travel debugging, you know,
works just as well with large programs
as it does with small ones.
And Jordi will be showing an example in a minute
of debugging a large program.
So.
So in our first demo, I showed how you can actually
record process using the debugger.
I showed how you can step forward and backward,
that you can use all of the same commands
you're used to using with debugging live processes.
With the exception of things that modify
the state of the program.
And the second example I showed you this debugger data model
that provides you a very rich way of querying information
about the process, and you'll see more of that later
in our upcoming demos.
I showed you how you can find events easily.
So we were very quickly able to find the exception
and navigate right to that point during the execution
of the program.
And then I finally showed how you can use memory
break points to quickly find who last modified something,
which is often the thing that we do most
with these traces, right?
Like, you see some data,
and you say, hmm, how did that get there?
Well, this makes it very easy to go and find that.
Right, you don't have to go hunting around
and see all the places that might write
and try and divine where they came from.
You can actually go directly to the location
where that right happened and find it directly.
So with that, Jordi is going to show a much more,
I guess, large example.
- Thank you, James.
So we wanted to show you an example with production code,
code is used normally, so we picked the Chakra engine.
This is the Javascript engine that is used in Edge on IE.
And I need an excuse to put the console debugger here
since I'm a console debugger user
so we are going to demo the check range
inside of the debugger.
The debugger is scriptable in Javascript.
And we are going to just look at a bug in Chakra that way.
So what we did was inject a bug in the Chakra engine
that is not that different from the bugs
that happen normally in production,
except it has a nice comment next to it when
we find that it will be obvious that we are there.
Click.
So.
With that, I'm going to launch the debugger.
Conveniently in the debuggers folder.
And I'm going to run this live.
First.
So.
When the debugger comes up, I'm going to just,
here's the console debugger, the same debugger
we were using, in the console,
and I'm going to load
the Javascript engine that
has the bug injected in it, so let me put the backslash
here just in case.
It loads it.
And then I'm going to run in a script, Javascript,
that all it does is allocate some memory,
use some memory and enforce some garbage collection.
So it just puts some pressure on the garbage collector.
It really doesn't do much more than that.
And OK.
It has loaded the script and it has secured it
second time, OK, this third time,
it crashed and if we looked at it, it's an access violation.
We see here that it's trying to read from this
awful memory location here which is minus two,
probably not the right address,
and we see in a stack that is
quite intimidating.
In fact,
OK, this is memory, memory recycler.
Realloc.
If I have to figure out where I got this minus two
from this caller stack, it's going to be painful.
In particular, garbage collection bugs
have the root cause usually at the distance
from the side effect.
In fact, if you took a dump from this
and looked for this minus two pattern,
you will actually not find it that much into the trace.
Into the dump.
It's not there.
So that makes things a little bit harder.
Let's repeat this case now with recording a trace.
So what I'm going to do is do the same thing,
I'm just going to record it.
And when I launch it, what you will see is that
there is some overhead with the recording,
in fact, the debugger has not yet come up.
If I have time at the end, I will show you
a case where things are a lot of overhead.
Still usable, but a lot of overhead.
Let me do the same things as before.
GSprovider.dll.
Load the guy that has the bug.
Let's run the script that we were running before.
And you can see, it's a little bit slower
and despite this, as I said, this is used
more than half the cases situation engineers,
and that was before the performance improvements
that we did in the consumption side.
OK, so.
It crashed like the other one.
It's indexing, so for expediency, I'm going to debug
the same thing that I have recorded on an unoptimized trace.
There is an upcoming channel nine video,
I think comes out next week, it's in post production,
where we debug this both optimized and unoptimized,
so you can see that it works on both cases.
This is a trace of the recording,
unoptimized, so it makes more C++ debugging.
And James already showed us that we can look at events
to find the exception.
So I'm going to do it differently,
which is just, tell the debugger to go.
And it will stop at the exception.
You know?
First chance exception, this exception may be handled.
OK.
To show to you guys a little bit of how this also
works the other direction, is let me go to the end
of the trace.
And go backwards.
G minus, and we are stopped in the same spot.
So that's a little bit of how it works,
forwards and backwards, everything works normally.
As before, we have this wonderful minus two here.
In fact, it's the disk pointer here, minus two.
Let's go back a little bit and see if I can figure out
where this value comes from.
So T minus to go back or if you prefer,
press the button in the UI.
Now, here we have the C++ expression, memblock,
ooh, memblock is also minus two.
So, that DX command that James was talking to us about,
can be used to also look at the local variables, memblock.
And it has a C++ indexed parser, so the ampersand sign
in C++ gives us the address.
The ampersand sign will give us the address.
So this is the address of that minus two value.
So I'm going to do the same thing that James was doing.
Put the break point, eight bytes, BA,
when someone writes these eight bytes on this address,
and I'm going to ask the debugger to go backwards.
What this does is it stops at the spot
where we wrote the value.
Now, memory break points, when you are debugging
on a processor, stop at the instruction after
the action took place, is the same here,
so this instruction after the right happened.
If I do want to step backwards, now we are
at the instruction where this thing happened.
Which is memblock, will get the value assigned,
and in fact, you see memblock has the previous values
still here.
I don't see free object list, so let me explain
this pointer, oh, this pointer members, so fine.
DX.
Free.
Object list, this would work,
or DX this free object list.
Same thing.
And as before, an ampersand will get me the address.
So now, I can do the same thing as before.
Put the breakpoint on this address that I have right here.
And as the debugger to go backwards.
Now, a couple of observations.
The address I just was looking at, this address,
is no longer of interest to me.
This is the value that now has the previous values.
No longer the minus two, so I'm going to disable this.
And the other thing is, the debugger,
when it's running live,
has a limit of how many hardware break points,
how many access break points you can put.
Intel processors tends to be four.
Other processors may be different.
With debugging a trace, you have no limit.
You can put 100 of these break points,
and it will just work.
So that's another advantage of debugging a tace,
and another reason why it's used often inside
of the Microsoft.
As before, we stopped, now I stopped on the instruction
after, so let me go one step backwards.
And I see this here.
I look at memblock.
It has an address.
I see free object list.
And if I, OK, has the new value.
The old value, not the new value.
Let me step into the function.
This is the same function as before.
Let me look at the values here.
It's minus one.
Not minus two.
But if I look at this piece of code here,
in fact, if we go back, we can see it.
That's how the minus two came to be.
OK.
So now, the question is, who put this minus one here?
Let me copy this address.
Let me step backwards.
Let me put the, sorry, let me put the breakpoint
and type correctly that the term,
the breakpoints, let me remove the last one I checked
because it's no longer of interest.
And go backwards.
And now here, I see the value minus one.
Which looks promising.
And I see this demo introduced a bug,
so that's the bug that we're looking for.
And to be clear, I am not familiar with this code base,
I have only used it to prepare for this demo.
The person that did the Javascript integration
on the debugger in our team is not me,
someone else, and we made something
that was hard and difficult, quite easy to do.
It was mechanical.
It just followed the recipe.
And it's not always that easy, but it often is
easy to debug bugs this way.
Excuse me.
Now, if we had to do this normally,
for those that have enough memory,
this stack looks nothing like the stack
that we had on the crash.
In fact, this crash happens over 100,000,000 instructions
before the root cause.
So this is not going to be something
that you can do manually by hand,
the old fashioned way.
So.
Let me stop debugging and we go back to the slides.
Let's take a look.
What we have seen is that there is overhead
when we record.
You get a lot of value out of it,
but there's some price to pay.
I said, inside Microsoft, we use this a lot,
so we find the price reasonable.
And it's something we will be working on.
If we have time at the end, I will show
another scenario where it's much more clear,
there is overhead.
Hard problems become easy.
I said, I have not seen this code base before,
I just followed a mechanical recipe of just
follow the bread crumbs backwards
'til you find the place where it does not look
like the right thing to do.
And the other thing is, I had one trace to do.
I did not have to restart the debugger over and over
every time I learned something.
Where this came from, since it's backwards,
just go and find it.
And so, with that.
Ken.
(applauding)
- Thanks, Jordi, that was pretty awesome.
I like that.
OK, so.
OK.
So what we've seen so far is a pretty familiar
debugging experience, setting break points,
looking at the stack, executing forwards.
Now executing backwards.
But time travel debugging allows us to look at
code execution another way.
It lets us look at it as a queryable data source.
So what do I mean by that?
Imagine you had a sequel database table
where each row was a point in time in your execution
and that there was a set of columns that represented
the register state of your processor and the contents
of your address space.
That would be a really big table,
but you could imagine doing queries that would find
specific call patterns, memory access patterns,
things like that.
What we have done is we have enabled this kind
of querying in time travel debugging.
Another kind of way to think about it is
if you've used the system tool's process monitor,
right, you have a work flow of capturing activity
in a file, then you go to the filter,
and you narrow it down to your specific problem
that you're interested in.
And we are trying to bring that kind of power
to debugging your programs.
So to illustrate that,
we took a Windows SDK sample for a media player
and we introduced a bug into it,
and so let's see what that looks like.
OK, so imagine it's four PM on a Friday,
you're getting ready to go home,
and you get an e-mail like this one.
I was trying to use your program,
and this is what I get.
And we see we have the dubious error message
that operation completed successfully.
So how many people have seen this one in the wild?
Yeah, excellent, excellent.
So, all right, well, let's see if we can use
time travel debugging to solve this quickly
so we can get out of the office.
So.
We are going to go here.
So.
You know, I could have asked Jordi for repro steps,
but he didn't remember what he did exactly,
or I could sit in this office and have him
hover over me while I try to debug it.
Which is kind of intimidating,
so I just asked him, please send me a trace file.
And so he sent me a trace file.
And so let's open it up.
And see what it looks like.
And let's, actually, let's stop debugging.
Just to make sure.
And we'll open it up.
And we'll close some of these old files there, great.
So.
We're sitting at the start of the trace.
And I'm gonna start off by making assumption
that Jordi probably stopped recording after he
saw the message box come up,
so I'm just gonna push the go button.
And go to the end of the trace.
OK.
And so we look down at the stack
and we see that the message box
is still in the stack there.
So our theory was right.
And again, this emphasizes the fact that,
you know, when you have your repro captured,
and there's really no cost to like going past
to where you want to execute,
you can just test theories, and if they don't
work out, you can try something else.
So in this case, the theory was right, so
let's take a look at what this code is doing
when it calls message box.
So, I'll scroll up a little bit.
And so, this is fairly typical code.
It calls get last error to find out what the error code was.
It formats it using format message
and then it displays it using message box.
Now, the fact that it said operation completed successfully,
that likely means the error code was zero.
We can prove that by using a standard WinDbg command.
GLE for get last error.
Yep, there it is, operation completed successfully.
OK, so.
Let's see how queries can help us find this problem.
So the first thing, so our basic idea is
that in Win32, a common pattern is,
when APIs fail, they return false.
And if you want more information, you have to call
get last error to get the information.
More detailed error code.
Now, if you forget to call get last error soon enough,
and you call some other Win32 API,
it usually gets, the error code gets set back to zero,
and you end up with lovely message boxes like that.
So, what we're looking for is a error code
that was returned that was non zero before
message box was called.
So let's use queries and see what that looks like.
So for now, I'm going to slide this over,
so we have plenty of room.
And I'm going to clear the console window.
So we access queries through the debugger data model
that we talked about earlier.
Yeah, through the DX command.
And what we're gonna do is we're going to capture
all of the calls to get last error and to message box.
And I'm going to assign it to a variable
so that we can do some other things to it
after it's ready.
So, we're gonna go to the current session.
Use the tab key to have Intellisense
have us find some of this.
And I know kernel base has get last error.
And user 32 has
message box W on the stack.
And we hit enter.
And it runs for a few seconds and it has,
it returns some results.
OK, so that doesn't look like very interesting results,
so let's just clear the screen and display it as a table.
And to do that, we use dash G for the grid command
and I'm going to use my previous variable.
And we can see this time that the results came up
much quicker, so it caches query results in memory
as you execute them.
To save time.
OK.
So here's our table of information.
And one of the things I notice is look at all these zeroes.
A lot of the times, there's no error,
but code is continually calling get last error,
getting a zero, and so, that's a lot of noise.
We like to get that out of the way.
So let's filter out the zeroes.
So, I'm gonna clear the screen again.
And use the up arrow key to bring up a previous query,
and we'll add to this.
And so, here we're going to use a link query
and this is link with a Q, not a K, this is not the linker.
And link is a query technology that's common in C sharp.
And we have more information about it on our blog,
if you haven't seen it before.
But what we're gonna do is use the tab key,
we're gonna do a where clause, and we're gonna limit
what we see.
The where clause talks a land,
so I'm just gonna use X to represent one particular call,
and say OK,
I want to keep anything where the function name
contains message box
because I notice that message box return zero,
we don't want to filter that out.
Or.
The return value
is not equal to zero, OK?
So we'll do that.
OK, now when we scroll up, we don't see any zero
error codes, that's nice.
We actually filtered out about 90% of the calls
in this particular trace file.
So now, the other, going back to what our goal was,
our goal was to find an error code that was set
before message box was called.
And so what we need to do is sort these things
by the time that they occurred in.
OK, so I'm gonna clear the screen, up arrow again.
So we can just refine things over time here.
And I'm gonna use an order by.
Which is another link method that also takes alanda.
And the land that tells us which field we want
to sort on, so I'm gonna sort on the time start field.
OK.
And I also notice that down here, if you get,
it shows 100 results by default, and we see
that there's some ellipses here.
And so I'm gonna make one more refinement to this.
Which is I'm gonna add a comma on the end,
and I'm just gonna say, give me like 1,000 results.
OK, so now I have all of my results.
And now let's see if we can find the message box.
OK, so I found the message box call, and we see that
before that, the first non zero error code was a two.
That's an interesting error code.
Let's go time travel there and take a look at what we got.
So we click on that and it shows us the time travel
thing here.
And here is my scroll.
Oh, there it is.
Empty scroll bar.
OK, and we click on time travel, and it takes us to
the time travel position.
And we see our error completed successfully.
And let's go down, let's see here.
Click on the right one?
I think I missed the right one here.
Clicking can be challenging at times.
Scroll up to message box.
OK.
And we go to on file open.
And our file name.
Zero there, that's not good.
Oh, I forgot the time, thank you, thank you.
OK.
All right.
This looks better.
So we do get last error and there's our system
cannot find the file specified.
We see we're calling into the media player.
Let's go to open file.
And we see that the file name
was passed in was CPPConDemo.txt.
That's not a very good media player option there, Jordi.
OK, so problem solved.
(applauding)
- OK.
Going back to the slides.
What we have learned is I don't know how to open files.
Send me a trace not a repro,
as we said before that's something that works very well.
The power of queries and then a little bit of C Sharp
at CPPCon, no?
Link.
So.
What's next?
What are we planning for the future?
Improve our recording experience.
We talked, there is some overhead.
We have done work from the days that the research
set out to prove that this was doable to shipping
on the performance on the replay side.
On a couple of orders of magnitude improvement.
And the reason is, people record for five minutes
and they replay for four, five hours.
So, cutting an hour out of those four was a bigger win
than cutting a minute out of five.
Very powerful query support.
We want to invest in that space heavily.
We want to do dynamic code analysis.
We have a trace that explains everything that happened,
so we want to use it to find leaks,
or to find double freeze, or user after free,
or this kind of stuff.
And then, we want to add kernel level tracing support
to these tools so you can do this on a driver
that you are developing for Windows or whatever
it happens to be.
If you want to know how we are going to do
some of these things, there's a clue at the bottom
right hand side corner of the slide.
And with that, I'm going to move to questions
that I didn't save.
Oh.
Do you remember the questions? - Jordi, yeah.
You forgot to save them again, didn't you?
- Yes. - Yeah.
Well, this happened a fair amount while we were practicing,
so I didn't tell you this, but I secretly
recorded the note pass session when you were typing
the questions so that we have a hope of getting them
back, so may I? - Please.
- OK.
OK.
All right, I hope this works.
Let's see here.
OK.
Alt tab.
OK, so we'll stop debugging here.
And.
Open trace file.
From.
Documents.
Sorry.
Down here.
In documents.
TTD traces.
Ah, here we are.
So before I open this, I just wanted to point out briefly,
you can see that when we launched Notepad,
you might have noticed it took a little while to launch,
but the typing speed was pretty interactive
and it produced a file that's roughly 200 megabytes.
That's not so bad.
It probably contains on the order
of 200,000,000 instructions.
Something like that.
So let's open the file.
And.
So what I did was I made a little Javascript file
that's going to try to dig out these things, so.
Go into here.
And we'll open up the Javascript.
Great.
And let's give ourselves some room to see the code here.
So, this is a feature of WinDbg that you can actually
write Javascript and do imperative things like
executing instructions immediately,
or you can actually extend the data model
that we've been talking about.
So invoke script is like main in C++.
And debug logs, just like print app.
So the basic idea is Notepad uses a window edit control,
and the edit control holds a piece
of locally alloced memory.
And so, if Notepad wanted to load or save text
from what you're typing, it would have to be
able to get at that memory.
And in Windows, the way you get at that memory is
through a message, so I'll briefly show you,
you know, I have some code that finds the text handle.
Remembers what position it's at,
so it can restore to that point, when it's done.
And it finds the text locations where the text was updated
and it does some simple things like appending to a vector,
finding unique strings, and prints them out.
And in particular, wanted to show you that
you can do queries inside of Javascript, as well,
and it looks very similar to DX.
Here we start with the host current session.
And we go to TTD calls.
Look for send message.
And the edit handle message.
Which is parameter one.
Of send message.
And then we do another query to find the memory
and hopefully, if we all got it right,
bring our.
Bring that over.
And it says here are the unique lines
from the Notepad session.
This looks like the questions we asked at the beginning.
So there you go, Jordi. - Thank you.
(applauding)
So, before answering the questions,
let's go back to the slides for a moment.
And let's see what we learned on this.
Ken saves the day again.
The debugger is scriptable and now we did
C Sharp at the CPPCon and now we're doing JavaScript
at the CPPCon, so not only C++ is not the only
language we can use.
Now, going back to the questions.
How does this differ from RR?
Single core versus multicore.
Different characteristics on the performance overhead
during record.
Depends on the type of program, on the number of cores
you have on your machine, one will be better than
the other and the overhead that you see.
But the basic principles are the same.
On how this works.
Is this Windows only or which version of Windows?
Actually, this is on the WinDbg preview.
So you actually have to have an up to date version
of Windows 10, this will be available on the kits,
and some of the restrictions will be lifted,
but right now, it's an up to date version of Windows 10, OK?
And with that, now for real, let's open the floor
for questions and you know. - I don't think you ever said
how to perform a trace, unless I missed it.
- Since, let me show, and what I'm going to do
is use this moment to also show a case where
our overhead is significant so that you guys
see that sometimes, in Notepad you saw us type,
and that came fine, or even the debugger you saw use this.
So if I'm going to launch, actually, let me launch Notepad
this way, no, sorry, from here, launch executable,
click this checkbox, record process with time travel
debugging, and that will launch, in this case,
the debugger.
If I do something as uninteresting as this.
And click on the record, it should launch Notepad.
And now typing works normally, however, if I want
to show the open file dialogue, this will take a while.
And the reason is this dialogue does a ton of work.
This dialogue launches 14 threads, calculates icons
and thumbnails, it actually, you know,
it still usable but there is some overhead sometimes
and as we said, that's-- - Is there?
Does this require any compiler or language support?
- None.
This will require, in fact, it's one of the conditions
we had, since we support code that sometimes we haven't
returned at Microsoft, we don't have symbols for it,
or we don't have anything, this works just fine.
- [Man] Yeah, so that's just like the normal Notepad
that's like installed with Windows, it's no special
modifications to it.
- No special modification.
The only thing we did is we created a new start menu
shortcut to launch it under the recorder,
but that was it. - So you don't need
any symbols to do the recording?
I mean to do the playback, I guess.
- No, you need symbols if you want to debug with symbols,
but I have debug without symbols just fine with this.
And this is a normal machine with public symbols
of Microsoft, so we have the public symbols of Notepad.
Sorry, there was a question on the back, in the middle.
- [Man] Yeah, actually if people
could go to the microphones, that would be easiest,
because then we don't have to keep repeating questions.
- In fact, if you don't mind, I'm going to change
my glasses so I see you.
- [Man] What size of files are the traces?
Like how much resources does that take?
- So.
The size of the trace varies by the number of instructions
you execute.
The more instructions you execute, the better.
What we do is when we start recording,
we take a dump of oops.
A dump of the entire process,
so that is, and then on a steady record,
we do a bit per instruction, OK?
But it will vary.
The open file dialogue is actually disastrous for us.
But there are other places where we can do
much less than that.
But that's approximately the case.
The overhead is mainly recorders is some overhead,
replay will be faster than live debugging,
some cases not.
On the overhead of the recorder is memory intensive
right now, we have work on the record side, as we said,
to improve that significantly.
Thank you.
- [Man] OK.
You mentioned that you can't modify memory
on the replay.
Can you still execute code?
So can I pull a function that doesn't load my memory?
- [James] No.
(laughing)
- That was a fast answer.
- [Man] I have actually two questions.
One of them, can I attach the recorder while the process
is executing? - So, a few notes.
You can attach to anything that you can debug,
pretty much, there a couple of corner cases,
but in general, if you can debug it, you can attach
to it, or you can launch it.
You can attach to UW processes.
You cannot launch them under the traces just yet.
The other thing to note is when you stop,
we terminate the process.
OK, that's relevant, we have a couple of other
work items to remove that, but that's--
- [Man] Thank you.
And in general, I also had a question,
what are the limitations so that we know
what to not expect?
- If you cannot debug the process,
we have actually done work to avoid us recording
things that you cannot debug.
That means sometimes if you have anti-virus software
on your machine, that logs process right across,
we will not work, you will have to disable anti-virus
software, if that makes any sense.
- And also, I would like to add.
If you're using public symbols with queries right now,
it doesn't resolve the names, it'll just say
unknown symbols, so it's another thing we need to fix.
If, you know, your own code, where you have full
PDB information, it's no problem,
but like, pulling out kernel bases named there
today requires private symbols, but we'll fix that soon.
- [Man] Awesome, thank you.
- [Man] You showed the time travel on launch of application,
I want to know what the impact is when you run it on
huge scale, in application.
Like in my case, there's an application that sometimes
runs for 10 minutes or an hour before it has problems,
and takes 10 to 20 gigabytes of memory.
What is the most efficient way to work with it?
And how, what about size, performance?
- That will be a long question to answer.
We'll be at the table outside to answer it.
I have recorded this for hours, but the traces come back
really, really, really big.
Just to set expectations.
We use it, for us, on chain servers,
which are 64 gigs of ram, fully packed,
and it works, but the overhead is measurable.
You will see it with the naked eye.
The overhead, just to set expectations.
Since we still have time, but if not,
if you see people with these shirts,
you can ask them questions, as well.
Just putting up here other Microsoft docs
so that you guys know where they are.
Please. - I have two questions.
First of them, is it possible to record 32 seconds
of application?
Last time the application working before crash.
- You need to know how to start recording,
but for example, if you touch the debugger
and you have a particular event
and the debugger, then you work to suspend
the process and there are things to do there.
It's an area where we will making things easier for you.
- [Man] Situation or I start test agents
under debug and when test crashed sometimes I need
the last seconds to understand what's happened.
- You still need to start recorder yourself.
But you can start it.
Because you cannot touch, you can start the recorder
when you see fit. - Second question
was about administrative rights.
Is it possible to not use administrative rights
because test agents must be clean.
- Not at this moment in time.
We have been talking inside of Microsoft on how to remove
that requirement, but right now, we rely on
a couple of APIs that are admin rights.
We are working through that. - OK, thanks.
- [Man] Hi there.
Are there commands to integrate to Visual Studio?
- So we are working with the Visual Studio team
on the analyzing the product and how it would integrate.
If you have, if you would like to give them feedback,
e-mail TTDVS@Microsoft.com and give them
your thoughts, and they will be happy to hear about that.
To be clear, they expect someone to e-mail them
if you want to do that, so please do.
- [Man] Visual Studio.
If you want to see this happen,
and you send mail, that increases the chance
of that happening.
- Sorry, for those. - If people want this,
there's a lot more leverage than just an idea
from the likes of the TTD team.
(laughing)
- For those that did not hear the person without
the microphone, if you send an e-mail,
it will really increase the chances of this happening,
so please do if you want to see this happen.
- Thank you. - Thank you, Andrew.
- Hello, once-- - Sorry, we're out
of time, I'm happy to answer questions, but just to be
clear, we're out of time.