Last time we decided that to learn from Data. We need to make some assumptions.
Specifically before we begin learning we need to assume that our solution is simple.
This assumption have the all-important effect of limiting the number of rules we tried out on our data.
Reducing the probability of our learning algorithm performing artificially well by chance.
Let's try out these new [ideas] on some real data.
So far we tried a few different approaches unsuccessfully on our finger dataset.
Before we craft a new approach let's consider two of our existing approaches through the lens of what we learned about computational learning theory last time.
We know that our memorization approach failed to generalize, and we now have a new tool to help us understand, 'Why'?
Last time we considered a rule g4 that was chosen from an enormous class of rules that covered all
possible 2 to the 16th or
65,536 Truth Table combinations.
Since this class of rules covers any possible data set we could ever see it actually works in the same way as our memorization approach.
One way to think of our memorization strategy is as choosing from a huge set of rules that has a rule that exactly matches any possible set of Training Data.
Of course there is one big difference here as we saw with our grains of sand question on our real 9x9 data.
There are way more possible examples than the 16 of our toy data set.
To have one rule match each possible data set we need an absolutely astronomical number of rules.
Again guaranteeing the rule we end up choosing will simply get lucky and fail to generalize exactly as we saw on our toy data set.
Now let's have a quick look at our baseline strategy sure it's not really much of a strategy.
But it does actually do one important thing surprisingly well
Why is our baseline strategy so good at generalizing?
Last time we found strong statistical evidence that simple rules will generalize well because there are fewer simple rules to consider when learning.
In many ways, our baseline strategy is the ultimate simple rule
When choosing this rule, We really only considered two possible rules.
Either all examples are fingers or all examples are not fingers .
only choosing between two rules means that the probability of our rule getting lucky is very low and our rule should generalize well
and It does.
Now that we have a deeper understanding of where these approaches went wrong
Let's consider how this understanding might help us develop a better strategy for learning to recognize fingers on one
On one hand, we have our baseline approach which doesn't perform well on training data but does generalize well.
On the other hand, we have our memorization approach which performs flawlessly on our training data
but generalizes terribly.
Whatever strategy we adopt next, we would really like it to do both.
We want good performance on our training data and of course it must generalize.
This brings us to our third big point.
There are not one, but two goals in learning from Data.
Training set performance and generalization and more importantly these goals are fundamentally opposed.
To improve training set performance we must increase the expressiveness of our rules
meaning that we have more rule to try and a greater chance of our rule getting lucky and failing to generalize.
To improve our rules ability to generalize, we must reduce the probability of a single rule getting lucky by considering fewer and
necessarily less Expressive rules.
So our memorization and baseline strategies are not two separate bad attempts at learning, but two ends of a single spectrum.
By putting too much emphasis on other one of our two goals training set performance or generalization,
we end up with lopsided solutions that fail to learn.
Since good machine learning algorithms must both perform well in our training set and generalize they must exist between these two extremes.
If we want to find an algorithm that will actually learn this is where we should look
Each solution along our spectrum represents a specific trade-off between our two goals.
As we move to the left we encounter less and less complex Rules and better and better generalization.
These types of solutions make more assumptions about our final result. Specifically that it will be simple.
This assumption is often called bias.
More biased solutions make more assumptions result in simpler Rules and are better generalizers.
As we move to the right our rules become more complex and less biased this complexity is often called Variance.
The Variance of our solution increases as we move to the right reaching a maximum with our memorization strategy.
Since the strategy can model any possible data set, it has the most complex set of rules possible and makes absolutely no
assumptions about the form of our solution so no bias but very high Variance.
When building machine learning algorithms, it's critical that we pay attention to this trade-off between bias and Variance.
Choosing where to live on the spectrum is key if we're really going to learn from Data.
If our solution is too biased we risk underfitting finding too simple a solution that doesn't actually fit the underlying patterns or looking for.
Alternatively if our solution has too much variance we risk overfitting matching patterns by chance that don't actually exist in our data.
Now that we understand a bit more about what to pay attention to let's try a new machine learning strategy.
We know we want to live somewhere on our bias-Variance spectrum, but where?
Let's start on the high bias low variance side and experiment.
Our baseline strategy is clearly too simple. How can we make it a little more complex?
One option is to use our approach from last time and search for rules that use just one pixel while ignoring the rest.
This class of rules is probably still too simple, but we can crank up the variance later.
There are 162 different single Pixel Rules for our 9 by 9 data.
We can quickly search through of these with a python loop.
And we'll pick out the rule that makes the fewest errors on our training data.
Just as with our toy data. Let's give every pixel in number in this case from x 0 to x 80
Our winning rule classifies examples with a 1 in the x 40 position as fingers.
As expected, the performance of our winning rule is not great.
But we do achieve higher precision and recall than our baseline approach and are still generalizing.
Well which hopefully means we're headed in the right direction.
Next time we'll move further across our spectrum by increasing the number of pixels in her rules one by one.
Now how complexity rule do you think we'll need you?
How many of our 81 pixels are necessary to make a rule that reliably finds fingers and images?
Our recall and precision numbers on testing Data has been pretty dismal so far. How many of our 81 pixels
Do you think we need to boost these numbers to let's say 65 percent?
That is we catch [65%] or more of all fingers in our testings and 65% or more of our finger predictions are correct
Next time we'll figure out how complex of a rule we need to really start learning what fingers look like.