Cookies   I display ads to cover the expenses. See the privacy policy for more information. You can keep or reject the ads.

# More formal treatment of multivariable chain rule

- [Voiceover] Hello everyone.
So this is what I might call a more optional video.
In the last couple videos, I talked about
this multivariable chain rule,
and I give some justification.
And it might have been considered
a little bit hand-wavy by some.
I was doing a lot of things
that looked kind of like taking a derivative
with respect to t, and then multiplying that
by an infinitesimal quantity, dt,
and thinking of canceling those out.
And some people might say, "Ah!
"But this isn't really a fraction.
"That's a derivative, that's a differential operator,
"and you're treating it incorrectly."
And while that's true,
the intuitions underlying a lot of this
actually matches with the formal argument pretty well.
So what I wanna do here is just talk about
what the formal argument behind
the multivariable chain rule is,
and just to remind ourselves of the setup of where we are.
You're thinking of v as a vector-valued function,
so this is something that takes as an input, t,
that lives on a number line,
and then v maps this to some kind of high-dimensional space.
In the simplest case, you might just think of that
as a two-dimensional space,
maybe it's three-dimensional space.
Or it could be 100-dimensional.
You don't have to literally be visualizing it.
And then f, our function f,
somehow takes that 100-dimensional space,
or two-dimensional, or three-dimensional,
whatever it is,
and then maps it onto the number line.
So the overall effect of the composition function
is to just take a real number to a real number
so it's a single-variable function.
So that's where we're taking this ordinary derivative,
rather than a partial derivative, or gradient,
or anything like that.
But because it goes through a multi-dimensional space,
and you have this intermediary, multivariable nature to it,
that's why you have a gradient,
and a vector-value derivative.
With the formal argument, the first thing you might do
is just write out the formal definition of a derivative.
And in this case,
it's a limit.
Definitions of derivatives are always gonna be
some kind of limit as a variable goes to zero.
And here, you're loosely thinking about h as being dt.
And you could write delta t,
but it's common to use h just because that can be used
for whatever your differential quantity is.
So that's on the denominator,
'cause you're thinking of it as dt.
And the top is whatever the change
to this whole function is
when you nudge that input by t.
And what I mean by that is you'll take f of v,
not of t, but of t plus h,
that kind of nudged output value,
and you're wondering how different that is
from f of v of t, the original value, v of t.
So this is what happens when you just apply
the formal definition of the derivative,
the ordinary derivative, to your composition function.
And now, what do you do
as you're trying to reason about what this should equal?
And a good place to start, actually,
is to look back to the intuition that I was giving
for the multivariable chain rule in the first place.
You imagine nudging your input by some dt,
some tiny change, and I was saying,
oh, so that causes a change
in the intermediary space
of some kind of, you know,
you could call it dv, a change in the vector.
And the way that you're thinking that,
that as you take the vector value derivative
and multiply it by dt, it's the proportionality constant
between the size of your nudge and the resulting vector.
And loosely, you might imagine those dt's crossing out
as if they were fractions.
It doesn't really matter.
And then you say, "What does this change?"
This change by a dv cause for f,
and by definition, the resulting nudge
to the output space of f
is the directional derivative in the direction
of whatever your vector nudge is of the function f.
So this is the loose intuition,
and where does that carry over to formality?
You say, "Well, in this intermediary space,
"we have to deal with the vector value derivative of v."
So it might be a good thing to just
write down that definition, right?
Write down the fact
that the definition for the vector value derivative of v,
again, it looks almost identical.
All these derivative definitions
really do look kind of the same
'cause what you're doing is you're taking the limit
as h goes to zero,
h we're still thinking of as being dt.
So that kind of sits on the bottom.
But here you're just wondering how your vector changes.
And the difference, even though we're kind of writing this
the same way, and it looks almost identical notationally,
what's on the numerator here, this v of t plus h,
and this v of t, these are vectors.
So this is kind of a vector minus a vector.
When you take the limit,
you're getting a limiting vector,
something in your high-dimensional space.
It's not just a number.
And now, another way to write this,
one that's more helpful,
more conducive to manipulation,
is to say not that it equals the limit of this value,
and I'm gonna go ahead and just copy this value here,
kind of down here, and say,
the value of our derivative
actually equals this, subject to some kind of error,
which I'll just write as E of h,
like an error function of h.
And what you should be thinking is that
that error function goes to zero as h goes to zero.
This is just writing things so that we're able
to manipulate it a little bit more easily.
So I'll give ourselves some room here.
And what you can do with this
is multiply all sides by h.
So this is our vector value derivative,
just rewriting it.
Multiply it by h.
And you're thinking of this h as a dt,
so maybe in the back of your mind,
you're kind of thinking of canceling this dt with the h.
And what it equals is this top, this numerator here,
which was v of t plus h
minus v of t.
And in the back of your mind,
you might be thinking, this whole thing represents
dv, a change in v.
So the idea of canceling out that dt with the h
really does kind of come through here.
But the difference between the more
hand-waving argument before of canceling those out
and what we're doing here
is now we're accounting for that error function.
In this case it's now multiplied by h
'cause everything was multiplied by h error function.
And there's actually another way that I'm gonna write this.
There's a very useful convention in analysis
where I'll take something like this
and instead I'll write it
as little o of h.
And this isn't literally a function.
It's just a stand-in to say whatever this is,
whatever function that represents,
it satisfies the property that when we take that function
and divide it by h,
that will go to zero as h goes to zero, right?
Which is true here because you imagine taking this
and dividing by h, and that would be,
this h cancels out and you just have your error function
is gonna go to zero.
So now what I do is I use this entire expression
to write this v of t plus h.
And the reason I wanna do that
if we kind of scroll back up
is because we see v of t plus h showing up
in the original definition we care about.
So this is just a way of starting to get a grapple on that
a little bit more firmly.
So what I'd write, I'd say that that v of t plus h,
v of t plus h, that nudged output value,
is equal to the original value that I have, v of t
plus, and it's gonna be plus this derivative term,
and you can kind of think that it's almost
like a Taylor polynomial,
where this is our first order term.
We're evaluating it at whatever that t is,
but we're multiplying it by the value of that nudge,
that linear term.
And then the rest of the stuff is just some little o of h.
And maybe you'd say, "Shouldn't you be subtracting
"off that little o of h?"
And it's not an actual function.
It just represents anything that shrinks.
And maybe I should say it's the absolute value,
like the magnitude, 'cause in this case,
this is a vector-valued quantity.
You know, that error is a vector.
So it's the size of that vector
divided by the size of h goes to zero.
So this is the main tool that we're gonna end up using.
This is the way to represent
v of t plus h.
And now if we go back up to the original definition
of the vector value derivative,
and I'll go ahead and copy that,
go ahead and copy that guy.
Little bit of debris.
So copy that original definition
for the ordinary derivative of the composition function,
and now when I write things in according
to all the manipulations that we just did,
this is really, it's still a limit,
'cause h goes to zero,
but what we put on the inside here
is it's f of,
now instead of writing v of t plus h,
I'm gonna use everything that I did up there.
It's the value of v of t
plus the derivative
at our point times the size of h.
So again, it's kind of like a Taylor polynomial.
This is your linear term,
and then it's plus something that we don't care about,
something that's gonna get really small
as h goes small,
and really small in comparison to h, more importantly.
And from that you subtract off
f of v of t.
Kind of running off the edge.
I always keep running off the edge.
And all of that is divided by h.
Now, the point here is
when you look at this limit, because we're taking it
as h goes to zero,
we'll basically be able to ignore this o of h component
because as h goes to zero,
this gets very, very small in comparison to h.
So everything that's on the inside here
is basically just the v of t
plus this vector value, right?
And this is h times some kind of vector.
But if you think back, I made a video
on the formal definition of the directional derivative.
And if you remembered, or if you kind of go back
and take a look now, this is exactly the formal definition
of the directional derivative.
We're taking h to go to zero,
the thing we're multiplying it by
is a certain vector quantity.
That vector is the nudge to your original value,
and then we're dividing everything by h.
So by definition, this entire thing
is the directional derivative in the direction of
the derivative of the function of t.
I'm writing v prime t instead of getting the whole
dv, dt down there.
All of that of f evaluated at where?
Well, the place that we're starting
is just v of t, so that's v of t.
And that's it, that's the answer.
'Cause when you evaluate the directional derivative,
the way that you do that, you take the gradient of f,
evaluate it at whatever point you're starting at,
in this case it's the output of v of t,
and you take the dot product between that
and the vector value derivative.
Well, I mean (chuckles),
the dot product between that and whatever your vector is,
which, in this case, is the vector-value derivative
of v, and that's the multivariable chain rule.
And if you look back through the line of reasoning,
it all really did match
the thoughts of kind of nudging, nudging,
and seeing how that nudged, right?
Because the reason we thought to use
the vector-value derivative
was because of that intuition.
And the reason for all the manipulation that I did
is just because I wanted to be able to express
what a nudge to the input of v looks like.
And what that looks like is the original value
plus a certain vector here.
This was the resulting nudge in the intermediary space.
I wanted to express that in a formal way.
And sure, we have this kind of o of h term
that expresses something that shrinks really fast,
but once you express it like that,
you just end up plopping out
the definition of the directional derivative.
So I hope that gives kind of a satisfying reason
for those of you who are a little bit more rigor-inclined
for why the multivariable chain rule works.
I should also maybe mention there's a more general
multivariable chain rule for vector-valued functions.
I'll get to that at another point
when I talk about the connections
between multivariable calculus and linear algebra.
But for now, that's pretty much all you need to know
on the multivariable chain rule
when the ultimate composition is,
you know, just a real number to a real number.
And I'll see you next video.