When I first learned about Taylor series, I definitely didn’t appreciate how important

they are. But time and time again they come up in math,

physics, and many fields of engineering because they’re one of the most powerful tools that

math has to offer for approximating functions.

One of the first times this clicked for me as a student was not in a calculus class,

but in a physics class. We were studying some problem that had to

do with the potential energy of a pendulum, and for that you need an expression for how

high the weight of the pendulum is above its lowest point, which works out to be proportional

to one minus the cosine of the angle between the pendulum and the vertical.

The specifics of the problem we were trying to solve are beyond the point here, but I’ll

just say that this cosine function made the problem awkward and unwieldy.

But by approximating cos(theta) as 1 - theta2/2, of all things, everything fell into place

much more easily. If you’ve never seen anything like this

before, an approximation like that might seem completely out of left field.

If you graph cos(theta) along with this function 1 - theta2/2, they do seem rather close to

each other for small angles near 0, but how would you even think to make this approximation?

And how would you find this particular quadratic? The study of Taylor series is largely about

taking non-polynomial functions, and finding polynomials that approximate them near some

input. The motive is that polynomials tend to be

much easier to deal with than other functions: They’re easier to compute, easier to take

derivatives, easier to integrate...they’re just all around friendly.

So let’s look at the function cos(x), and take a moment to think about how you might

find a quadratic approximation near x = 0. That is, among all the polynomials that look

c0 + c1x + c2x2 for some choice of the constants c0, c1 and c2, find the one that most resembles

cos(x) near x=0; whose graph kind of spoons with the graph of cos(x) at that point.

Well, first of all, at the input 0 the value of cos(x) is 1, so if our approximation is

going to be any good at all, it should also equal 1 when you plug in 0. Plugging in 0

just results in whatever c0 is, so we can set that equal to 1.

This leaves us free to choose constant c1 and c2 to make this approximation as good

as we can, but nothing we do to them will change the fact that the polynomial equals

1 at x=0. It would also be good if our approximation

had the same tangent slope as as cos(x) at this point of interest. Otherwise, the approximation

drifts away from the cosine graph even fro value of x very close to 0.

The derivative of cos(x) is -sin(x), and at x=0 that equals 0, meaning its tangent line

is flat. Working out the derivative of our quadratic,

you get c1 + 2c2x. At x=0 that equals whatever we choose for c1. So this constant c1 controls

the derivative of our approximation around x=0. Setting it equal to 0 ensures that our

approximation has the same derivative as cos(x), and hence the same tangent slope.

This leaves us free to change c2, but the value and slope of our polynomial at x=0 are

locked in place to match that of cos(x).

The cosine graph curves downward above x=0, it has a negative second derivative. Or in

other words, even though the rate of change is 0 at that point, the rate of change itself

is decreasing around that point. Specifically, since its derivative is -sin(x)

its second derivative is -cos(x), so at x=0 its second derivative is -1.

In the same way that we wanted the derivative of our approximation to match that of cosine,

so that their values wouldn’t drift apart needlessly quickly, making sure that their

second derivatives match will ensure that they curve at the same rate; that the slope

of our polynomial doesn’t drift away from the slope of cos(x) any more quickly than

it needs to. Pulling out that same derivative we had before,

then taking its derivative, we see that the second derivative of this polynomial is exactly

2c2, so to make sure this second derivative also equals -1 at x=0, 2c2 must equal -1,

meaning c2 itself has to be -½. This gives us the approximation 1 + 0x - ½

x2.

To get a feel for how good this is, if you estimated cos(0.1) with this polynomial, you’d

get 0.995. And this is the true value of cos(0.1). It’s a really good approximation.

Take a moment to reflect on what just happened. You had three degrees of freedom with a quadratic

approximation, the constants c0, c1, and c2. c0 was responsible for making sure that the

output of the approximation matches that of cos(x) at x=0, c1 was in charge of making

sure the derivatives match at that point, and c2 was responsible for making sure the

second derivatives match up. This ensures that the way your approximation

changes as you move away from x=0, and the way that the rate of change itself changes,

is as similar as possible to behavior of cos(x), given the amount of control you have.

You could give yourself more control by allowing more terms in your polynomial, and matching

higher order derivatives of cos(x). For example, add on the term c3x3 for some

constant c3. If you take the third derivative of a cubic

polynomial, anything quadratic or smaller goes to 0.

As for that last term, after three iterations of the power rule it looks like 1*2*3*c3.

On the other hand, the third derivative of cos(x) is sin(x), which equals 0 at x=0, so

to make the third derivatives match, the constant c3 should be 0.

In other words, not only is 1 - ½ x2 the best possible quadratic approximation of cos(x)

around x=0, it’s also the best possible cubic approximation.

You can actually make an improvement by adding a fourth order term, c4x4. The fourth derivative

of cos(x) is itself, which equals 1 at x=0. And what’s the fourth derivative of our

polynomial with this new term? Well, when you keep applying the power rule over and

over, with those exponents all hopping down front, you end up with 1*2*3*4*c4, which is

24c4 So if we want this to match the fourth derivative

of cos(x), which is 1, c4 must be 1/24. And indeed, the polynomial 1 - ½ x2 + 1/24

x4, which looks like this, is a very close approximation for cos(x) around x = 0.

In any physics problem involving the cosine of some small angle, for example, predictions

would be almost unnoticeably different if you substituted this polynomial for cos(x).

Now, step back and notice a few things about this process.

First, factorial terms naturally come up in this process.

When you take n derivatives of xn, letting the power rule just keep cascading, what you’re

left with is 1*2*3 and on up to n. So you don’t simply set the coefficients

of the polynomial equal to whatever derivative value you want, you have to divide by the

appropriate factorial to cancel out this effect. For example, that x4 coefficient is the fourth

derivative of cosine, 1, divided by 4 factorial, 24.

The second thing to notice is that adding new terms, like this c4x4, doesn’t mess

up what old terms should be, and that’s important.

For example, the second derivative of this polynomial at x = 0 is still equal to 2 times

the second coefficient, even after introducing higher order terms to the polynomial.

And it’s because we’re plugging in x=0, so the second derivative of any higher order

terms, which all include an x, will wash away. The same goes for any other derivative, which

is why each derivative of a polynomial at x=0 is controlled by one and only one coefficient.

If instead you were approximating near an input other than 0, like x=pi, in order to

get the same effect you would have to write your polynomial in terms of powers of (x - pi),

or whatever input you’re looking at. This makes it look notably more complicated,

but all it’s doing is making the point pi look like 0, so that plugging in x = pi will

result in a lot of nice cancelation that leaves only one constant.

And finally, on a more philosophical level, notice how what we’re doing here is essentially

taking information about the higher order derivatives of a function at a single point,

and translating it into information about the value of that function near that point.

We can take as many derivatives of cos(x) as we want, it follows this nice cyclic pattern

cos(x), -sin(x), -cos(x), sin(x), and repeat. So the value of these derivative of x=0 have

the cyclic pattern 1, 0, -1, 0, and repeat. And knowing the values of all those higher-order

derivatives is a lot of information about cos(x), even though it only involved plugging

in a single input, x=0. That information is leveraged to get an approximation

around this input by creating a polynomial whose higher order derivatives, match up with

those of cos(x), following this same 1, 0, -1, 0 cyclic pattern.

To do that, make each coefficient of this polynomial follow this same pattern, but divide

each one by the appropriate factorial, like I mentioned before, so as to cancel out the

cascading effects of many power rule applications. The polynomials you get by stopping this process

at any point are called “Taylor polynomials” for cos(x) around the input x=0.

More generally, and hence more abstractly, if we were dealing with some function other

than cosine, you would compute its derivative, second derivative, and so on, getting as many

terms as you’d like, and you’d evaluate each one at x=0.

Then for your polynomial approximation, the coefficient of each xn term should be the

value of the nth derivative of the function at 0, divided by (n!).

This rather abstract formula is something you’ll likely see in any text or course

touching on Taylor polynomials. And when you see it, think to yourself that

the constant term ensures that the value of the polynomial matches that of f(x) at x=0,

the next term ensures that the slope of the polynomial matches that of the function, the

next term ensure the rate at which that slope changes is the same, and so on, depending

on how many terms you want. The more terms you choose, the closer the

approximation, but the tradeoff is that your polynomial is more complicated.

And if you want to approximate near some input a other than 0, you write the polynomial in

terms of (x-a) instead, and evaluate all the derivatives of f at that input a.

This is what Taylor series look like in their fullest generality. Changing the value of

a changes where the approximation is hugging the original function; where its higher order

derivatives will be equal to those of the original function.

One of the simplest meaningful examples is ex, around the input x=0. Computing its derivatives

is nice, since the derivative of ex is itself, so its second derivative is also ex, as is

its third, and so on. So at the point x=0, these are all 1. This

means our polynomial approximation looks like 1 + x + ½ x2 + 1/(3!) x3 + 1/(4!) x4, and

so on, depending on how many terms you want. These are the Taylor polynomials

for ex.

In the spirit of showing you just how connected the topics of calculus are, let me turn to

a completely different way to understand this second order term geometrically. It’s related

to the fundamental theorem of calculus, which I talked about in chapters 1 and 8.

Like we did in those videos, consider a function that gives the area under some graph between

a fixed left point and a variable right point. What we’re going to do is think about how

to approximate this area function, not the function for the graph like we were doing

before. Focusing on that area is what will make the second order term pop out.

Remember, the fundamental theorem of calculus is that this graph itself represents the derivative

of the area function, and as a reminder it’s because a slight nudge dx to the right bound

on the area gives a new bit of area approximately equal to the height of the graph times dx,

in a way that’s increasingly accurate for smaller choice of dx.

So df over dx, the change in area divided by that nudge dx, approaches the height of

the graph as dx approaches 0. But if you wanted to be more accurate about

the change to the area given some change to x that isn’t mean to approach 0, you would

take into account this portion right here, which is approximately a triangle.

Let’s call the starting input a, and the nudged input above it x, so that this change

is (x-a). The base of that little triangle is that change

(x-a), and its height is the slope of the graph times (x-a). Since this graph is the

derivative of the area function, that slope is the second derivative of the area function,

evaluated at the input a. So the area of that triangle, ½ base times

height, is one half times the second derivative of the area function, evaluated at a, multiplied

by (x-a)2. And this is exactly what you see with Taylor

polynomials. If you knew the various derivative information about the area function at the

point a, you would approximate this area at x to be the area up to a, f(a), plus the area

of this rectangle, which is the first derivative times (x-a), plus the area of this triangle,

which is ½ (the second derivative) * (x - a)2. I like this, because even though it looks

a bit messy all written out, each term has a clear meaning you can point to on the diagram.

We could call it an end here, and you’d have you’d have a phenomenally useful tool

for approximations with these Taylor polynomials. But if you’re thinking like a mathematician,

one question you might ask is if it makes sense to never stop, and add up infinitely

many terms. In math, an infinite sum is called a “series”,

so even though one of the approximations with finitely many terms is called a “Taylor

polynomial” for your function, adding all infinitely many terms gives what’s called

a “Taylor series”. Now you have to be careful with the idea of

an infinite series, because it doesn’t actually make sense to add infinitely many things;

you can only hit the plus button on the calculator so many times.

But if you have a series where adding more and more terms gets you increasingly close

to some specific value, you say the series converges to that value. Or, if you’re comfortable

extending the definition of equality to include this kind of series convergence, you’d say

the series as a whole, this infinite sum, equals the value it converges to.

For example, look at the Taylor polynomials for ex, and plug in some input like x = 1.

As you add more and more polynomial terms, the total sum gets closer and closer to the

value e, so we say that the infinite series converges to the number e. Or, what’s saying

the same thing, that it equals the number e.

In fact, it turns out that if you plug in any other value of x, like x=2, and look at

the value of higher and higher order Taylor polynomials at this value, they will converge

towards ex, in this case e2. This is true for any input, no matter how

far away from 0 it is, even though these Taylor polynomials are constructed only from derivative

information gathered at the input 0. In a case like this, we say ex equals its

Taylor series at all inputs x, which is kind of a magical thing to have happen.

Although this is also true for some other important functions, like sine and cosine,

sometimes these series only converge within a certain range around the input whose derivative

information you’re using. If you work out the Taylor series for the

natural log of x around the input x = 1, which is built from evaluating the higher order

derivatives of ln(x) at x=1, this is what it looks like.

When you plug in an input between 0 and 2, adding more and more terms of this series

will indeed get you closer and closer to the natural log of that input.

But outside that range, even by just a bit, the series fails to approach anything.

As you add more and more terms the sum bounces back and forth wildly, it does not approaching

the natural log of that value, even though the natural log of x is perfectly well defined

for inputs above 2. In some sense, the derivative information

of ln(x) at x=1 doesn’t propagate out that far.

In a case like this, where adding more terms of the series doesn’t approach anything,

you say the series diverges. And that maximum distance between the input

you’re approximating near, and points where the outputs of these polynomials actually

do converge, is called the “radius of convergence” for the Taylor series.

There remains more to learn about Taylor series, their many use cases, tactics for placing

bounds on the error of these approximations, tests for understanding when these series

do and don’t converge. For that matter there remains more to learn

about calculus as a whole, and the countless topics not touched by this series.

The goal with these videos is to give you the fundamental intuitions that make you feel

confident and efficient learning more on your own, and potentially even rediscovering more

of the topic for yourself. In the case of Taylor series, the fundamental

intuition to keep in mind as you explore more is that they translate derivative information

at a single point to approximation information around that point.

The next series like this will be on probability, and if you want early access as those videos

are made, you know where to go.