Last time we decided to use gradient descent to train our Neural Network
so it could make better predictions of your score on a test
based on how many hours you slept and how many hours you studied the night before.
To perform gradient descent we need an equation and some code for our gradient ∂J/∂W.
Our weights W are spread across two matrices.
W(1) and W(2)
We'll separate our ∂J/∂W computation in the same way, by computing ∂J/∂W(1) and ∂J/∂W(2)
We should have just as many gradient values as weight values, so when we're done our matrices
∂J/∂W(1) and ∂J/∂W(2) will be the same size as W(1) and W(2)
Let's work on ∂J/∂W(2) first
The sum in our cost function adds the error from each example to create an overall cost.
We'll take advantage of the sum rule in differentiation which says that the derivative of the sums equals the sum of the derivatives.
We can move our sigma (Σ) outside and just worry about the derivative of the inside expression first.
To keep things simple, we'll temporarily forget about our summation.
Once we've computed ∂J/∂W for a single example,
we'll go back and add all our individual derivative terms together
We can now evaluate our first derivative. The power rule tells us to bring down our exponent 2 and multiply.
To finish our derivative we need to apply the chain rule.
The chain rule tells us how to take the derivative of a function inside of a function
and generally says that we take the derivative of the outside function and multiply it by the derivative of the inside function
One way to express the chain rule is as the product of derivatives this will come in very handy as we progress through backpropagation
In fact a better name for backpropagation might be: Don't stop doing the chain rule, ever.
We've taken the derivative of the outside of our cost function. Now we need to multiply it by the derivative of the inside.
y is just our test scores which won't change so the derivative of y, a constant, with respect to W is zero
y-hat (ŷ), on the other hand, does change with respect to W(2)
So we'll apply the chain rule and multiply our results by minus ∂ŷ/∂W(2)
We now need to think about the derivative of ŷ with respect to W(2)
Equation (4) tells us that ŷ is our activation function of z(3) so it will be helpful to apply the chain rule again
to break ∂ŷ/∂W(2) into
∂ŷ/∂z(3) times ∂z(3)/∂W(2)
to find the rate of change of ŷ with respect to z(3) we need to differentiate our sigmoid activation function
with respect to z
Now is a good time to add a new Python method for our derivative of our sigmoid function, sigmoidPrime
Our derivative should be largest where sigmoid function is the steepest, at the value z=0
we can now replace ∂ŷ/∂z(3) with f-prime of z(3)
Our final piece of the puzzle is ∂z(3)/∂W(2)
This term represents the change of z, our third layer activity, with respect to the weights in the second layer.
z(3) is the matrix product of our activities a(2) and our weights W(2).
The activities from layer 2 are multiplied by their corresponding weights and added together to yield z(3)
If we focus on a single synapse for a moment, we see a simple linear relationship between W and z where a is the slope.
So for each synapse, ∂z/∂W(2) is just the activation a, on that synapse
Another way to think about what the calculus is doing here is that it is backpropagating the error to each weight.
By multiplying by the activity on each synapse, the weights that contribute more to the overall error will have larger activations,
yield larger ∂J/∂W(2) values,
and will be changed more when we perform gradient descent.
We need to be careful with our dimensionality here, and if we're clever, we can take care of that summation we got rid of earlier.
The first part of our equation, y-ŷ, is of the same dimension of their output data, 3x1.
f-prime of z 3 is of the same size and our first operation is a scalar multiplication.
Our resulting 3x1 matrix is referred to as the backpropagating error, δ(3)
We determined that ∂z(3)/∂W(2) is equal to the activity of each synapse.
Each value in δ(3) needs to be multiplied by each activity.
We can achieve this by transposing a(2) and matrix multiplying by δ(3)
What's cool here Is that the matrix multiplication also takes care of our earlier omission.
It adds up the ∂J/∂W terms across all our examples
Another way to think about what's happening here is that each example our algorithm sees has a certain cost and a certain gradient.
The gradient with respect to each example pulls our gradient descent algorithm in a certain direction.
It's like every example gets a vote on which way is downhill and when we perform batch gradient descent,
we just add together everyone's vote, call it downhill, and move in that direction.
We'll code up our gradients in Python, in a new method, costFunctionPrime
Numpy's .multiply() method performs element-wise multiplication and the .dot() method performs matrix multiplication
We now have one final term to compute, ∂J/∂W(1)
The derivation begins the same way as before, by computing the derivative through our final layer,
first ∂J/∂ŷ, then ∂ŷ/∂z(3).
We now take the derivative across our synapses,
which is a little different from our job last time, which was computing the derivative with respect to the weights on our synapses.
There's still a nice linear relationship along each synapse,
but now we're interested in the rate of change of z(3) with respect to a(2)
Now the slope is just equal to the weight value for that synapse.
We can achieve this mathematically by multiplying by W(2) transposed
Our next term to work on is ∂a(2)/∂z(2).
This step is just like the derivative across our layer 3 neurons, so we can just multiply by f-prime of z(2).
Our final computation here is ∂z(2)/∂W(1)
This is very similar to our ∂z(3)/∂W(2) computation. There is a simple linear
relationship on the synapses between z and W(1). In this case though, the slope is the input value, x.
We can use the same technique as last time and multiplied by x transposed,
effectively applying the derivative and adding our ∂J/∂W(1)s together
across all our examples. All that's left is to code this equation in Python.
What's cool here is that if we want to make a deeper neural network, we could just stack a bunch of these operations together.
So how should we change our Ws to decrease our cost?
We can now compute ∂J/∂W, which tells us which way is uphill in our 9-dimensional optimization space.
If we move this way by adding a scalar times our derivative to all of our weights,
our cost will increase. And if we do the opposite,
subtract our gradient from our weights, we will move downhill and reduce our cost.
This simple step downhill is the core of gradient descent
and a key part of how even very sophisticated learning algorithms are trained.
Next time we'll perform numerical gradient checking to make sure our math is correct.