- [Voiceover] Hello everyone.

So this is what I might call a more optional video.

In the last couple videos, I talked about

this multivariable chain rule,

and I give some justification.

And it might have been considered

a little bit hand-wavy by some.

I was doing a lot of things

that looked kind of like taking a derivative

with respect to t, and then multiplying that

by an infinitesimal quantity, dt,

and thinking of canceling those out.

And some people might say, "Ah!

"But this isn't really a fraction.

"That's a derivative, that's a differential operator,

"and you're treating it incorrectly."

And while that's true,

the intuitions underlying a lot of this

actually matches with the formal argument pretty well.

So what I wanna do here is just talk about

what the formal argument behind

the multivariable chain rule is,

and just to remind ourselves of the setup of where we are.

You're thinking of v as a vector-valued function,

so this is something that takes as an input, t,

that lives on a number line,

and then v maps this to some kind of high-dimensional space.

In the simplest case, you might just think of that

as a two-dimensional space,

maybe it's three-dimensional space.

Or it could be 100-dimensional.

You don't have to literally be visualizing it.

And then f, our function f,

somehow takes that 100-dimensional space,

or two-dimensional, or three-dimensional,

whatever it is,

and then maps it onto the number line.

So the overall effect of the composition function

is to just take a real number to a real number

so it's a single-variable function.

So that's where we're taking this ordinary derivative,

rather than a partial derivative, or gradient,

or anything like that.

But because it goes through a multi-dimensional space,

and you have this intermediary, multivariable nature to it,

that's why you have a gradient,

and a vector-value derivative.

With the formal argument, the first thing you might do

is just write out the formal definition of a derivative.

And in this case,

it's a limit.

Definitions of derivatives are always gonna be

some kind of limit as a variable goes to zero.

And here, you're loosely thinking about h as being dt.

And you could write delta t,

but it's common to use h just because that can be used

for whatever your differential quantity is.

So that's on the denominator,

'cause you're thinking of it as dt.

And the top is whatever the change

to this whole function is

when you nudge that input by t.

And what I mean by that is you'll take f of v,

not of t, but of t plus h,

that kind of nudged output value,

and you're wondering how different that is

from f of v of t, the original value, v of t.

So this is what happens when you just apply

the formal definition of the derivative,

the ordinary derivative, to your composition function.

And now, what do you do

as you're trying to reason about what this should equal?

And a good place to start, actually,

is to look back to the intuition that I was giving

for the multivariable chain rule in the first place.

You imagine nudging your input by some dt,

some tiny change, and I was saying,

oh, so that causes a change

in the intermediary space

of some kind of, you know,

you could call it dv, a change in the vector.

And the way that you're thinking that,

that as you take the vector value derivative

and multiply it by dt, it's the proportionality constant

between the size of your nudge and the resulting vector.

And loosely, you might imagine those dt's crossing out

as if they were fractions.

It doesn't really matter.

And then you say, "What does this change?"

This change by a dv cause for f,

and by definition, the resulting nudge

to the output space of f

is the directional derivative in the direction

of whatever your vector nudge is of the function f.

So this is the loose intuition,

and where does that carry over to formality?

You say, "Well, in this intermediary space,

"we have to deal with the vector value derivative of v."

So it might be a good thing to just

write down that definition, right?

Write down the fact

that the definition for the vector value derivative of v,

again, it looks almost identical.

All these derivative definitions

really do look kind of the same

'cause what you're doing is you're taking the limit

as h goes to zero,

h we're still thinking of as being dt.

So that kind of sits on the bottom.

But here you're just wondering how your vector changes.

And the difference, even though we're kind of writing this

the same way, and it looks almost identical notationally,

what's on the numerator here, this v of t plus h,

and this v of t, these are vectors.

So this is kind of a vector minus a vector.

When you take the limit,

you're getting a limiting vector,

something in your high-dimensional space.

It's not just a number.

And now, another way to write this,

one that's more helpful,

more conducive to manipulation,

is to say not that it equals the limit of this value,

and I'm gonna go ahead and just copy this value here,

kind of down here, and say,

the value of our derivative

actually equals this, subject to some kind of error,

which I'll just write as E of h,

like an error function of h.

And what you should be thinking is that

that error function goes to zero as h goes to zero.

This is just writing things so that we're able

to manipulate it a little bit more easily.

So I'll give ourselves some room here.

And what you can do with this

is multiply all sides by h.

So this is our vector value derivative,

just rewriting it.

Multiply it by h.

And you're thinking of this h as a dt,

so maybe in the back of your mind,

you're kind of thinking of canceling this dt with the h.

And what it equals is this top, this numerator here,

which was v of t plus h

minus v of t.

And in the back of your mind,

you might be thinking, this whole thing represents

dv, a change in v.

So the idea of canceling out that dt with the h

really does kind of come through here.

But the difference between the more

hand-waving argument before of canceling those out

and what we're doing here

is now we're accounting for that error function.

In this case it's now multiplied by h

'cause everything was multiplied by h error function.

And there's actually another way that I'm gonna write this.

There's a very useful convention in analysis

where I'll take something like this

and instead I'll write it

as little o of h.

And this isn't literally a function.

It's just a stand-in to say whatever this is,

whatever function that represents,

it satisfies the property that when we take that function

and divide it by h,

that will go to zero as h goes to zero, right?

Which is true here because you imagine taking this

and dividing by h, and that would be,

this h cancels out and you just have your error function

is gonna go to zero.

So now what I do is I use this entire expression

to write this v of t plus h.

And the reason I wanna do that

if we kind of scroll back up

is because we see v of t plus h showing up

in the original definition we care about.

So this is just a way of starting to get a grapple on that

a little bit more firmly.

So what I'd write, I'd say that that v of t plus h,

v of t plus h, that nudged output value,

is equal to the original value that I have, v of t

plus, and it's gonna be plus this derivative term,

and you can kind of think that it's almost

like a Taylor polynomial,

where this is our first order term.

We're evaluating it at whatever that t is,

but we're multiplying it by the value of that nudge,

that linear term.

And then the rest of the stuff is just some little o of h.

And maybe you'd say, "Shouldn't you be subtracting

"off that little o of h?"

And it's not an actual function.

It just represents anything that shrinks.

And maybe I should say it's the absolute value,

like the magnitude, 'cause in this case,

this is a vector-valued quantity.

You know, that error is a vector.

So it's the size of that vector

divided by the size of h goes to zero.

So this is the main tool that we're gonna end up using.

This is the way to represent

v of t plus h.

And now if we go back up to the original definition

of the vector value derivative,

and I'll go ahead and copy that,

go ahead and copy that guy.

Little bit of debris.

So copy that original definition

for the ordinary derivative of the composition function,

and now when I write things in according

to all the manipulations that we just did,

this is really, it's still a limit,

'cause h goes to zero,

but what we put on the inside here

is it's f of,

now instead of writing v of t plus h,

I'm gonna use everything that I did up there.

It's the value of v of t

plus the derivative

at our point times the size of h.

So again, it's kind of like a Taylor polynomial.

This is your linear term,

and then it's plus something that we don't care about,

something that's gonna get really small

as h goes small,

and really small in comparison to h, more importantly.

And from that you subtract off

f of v of t.

Kind of running off the edge.

I always keep running off the edge.

And all of that is divided by h.

Now, the point here is

when you look at this limit, because we're taking it

as h goes to zero,

we'll basically be able to ignore this o of h component

because as h goes to zero,

this gets very, very small in comparison to h.

So everything that's on the inside here

is basically just the v of t

plus this vector value, right?

And this is h times some kind of vector.

But if you think back, I made a video

on the formal definition of the directional derivative.

And if you remembered, or if you kind of go back

and take a look now, this is exactly the formal definition

of the directional derivative.

We're taking h to go to zero,

the thing we're multiplying it by

is a certain vector quantity.

That vector is the nudge to your original value,

and then we're dividing everything by h.

So by definition, this entire thing

is the directional derivative in the direction of

the derivative of the function of t.

I'm writing v prime t instead of getting the whole

dv, dt down there.

All of that of f evaluated at where?

Well, the place that we're starting

is just v of t, so that's v of t.

And that's it, that's the answer.

'Cause when you evaluate the directional derivative,

the way that you do that, you take the gradient of f,

evaluate it at whatever point you're starting at,

in this case it's the output of v of t,

and you take the dot product between that

and the vector value derivative.

Well, I mean (chuckles),

the dot product between that and whatever your vector is,

which, in this case, is the vector-value derivative

of v, and that's the multivariable chain rule.

And if you look back through the line of reasoning,

it all really did match

the thoughts of kind of nudging, nudging,

and seeing how that nudged, right?

Because the reason we thought to use

the vector-value derivative

was because of that intuition.

And the reason for all the manipulation that I did

is just because I wanted to be able to express

what a nudge to the input of v looks like.

And what that looks like is the original value

plus a certain vector here.

This was the resulting nudge in the intermediary space.

I wanted to express that in a formal way.

And sure, we have this kind of o of h term

that expresses something that shrinks really fast,

but once you express it like that,

you just end up plopping out

the definition of the directional derivative.

So I hope that gives kind of a satisfying reason

for those of you who are a little bit more rigor-inclined

for why the multivariable chain rule works.

I should also maybe mention there's a more general

multivariable chain rule for vector-valued functions.

I'll get to that at another point

when I talk about the connections

between multivariable calculus and linear algebra.

But for now, that's pretty much all you need to know

on the multivariable chain rule

when the ultimate composition is,

you know, just a real number to a real number.

And I'll see you next video.