Learning to See [Part 5: To Learn is to Generalize]

Last time, we decided our performance numbers were suspiciously good for our first hack at machine learning.
So, why can't we trust these numbers?
We can't trust them because our algorithm actually hasn't learned a single thing.
Instead, it's done something that is all too often substituted for learning: memorizing.
Memorizing looks like learning, but don't let it fool you. Memorizing is not the same thing as learning.
My favorite explainer, the physicist Richard Feynman, makes this point wonderfully in his quasi-autobiography, "Surely You're Joking, Mr. Feynman!"
Feynman tells the story of spending a year traveling and teaching physics in Brazil,
where he quickly realizes that his students have memorized an impressive number of physics facts but don't have any idea of how to apply them.
After further investigation, Feynman discovers that his students have, throughout their academic careers, been rewarded for memorizing.
At the end of the story, Feynman shares his findings in a lecture to his Brazilian counterparts,
where he dramatically chooses topics at random from the Brazilians' favorite textbook, and explains how, in Feynman's words,
"it's not science, but memorizing, in every circumstance."
So we suspect our algorithm is performing artificially well by memorizing, but how can we test our suspicion?
Just as a good teacher tests students on problems they haven't seen before, we should test our algorithm on examples *it* hasn't seen before.
Let's take two new hand images that our algorithm hasn't seen and measure its performance.
And our results are absolutely terrible.
We correctly identified only eleven percent of finger pixels,
and of the predictions we made, only 64% were correct.
In fact, our machine learning algorithm does worse than our knowledge engineering approach.
So what's going on here?
One possibility is that our learning algorithm just hasn't seen enough examples of fingers yet.
Let's test this theory by increasing the number of training examples in our data set by a factor of 10.
Instead of memorizing examples of fingers from three images, we'll use 30.
For our 30 images, we now have 2,160 examples of finger pixels, and our algorithm again performs very well on the data it's already seen.
However, when we test on new examples again, our performance returns to its previous dismal state.
Increasing our number of training examples by tenfold made almost no difference.
This brings us to arguably the most important point in machine learning: to learn is to generalize.
To ensure our algorithm is actually learning, we must evaluate its ability to generalize by testing it on data that it hasn't seen yet.
By splitting our data into training and testing portions and leaving our testing data alone until it's time to test performance, we can determine if our algorithm is actually learning.
So we can confidently say that our memorization approach is not the way to go. It doesn't generalize. It hasn't learned. But why?
Simply looking for matches to the examples of fingers we've already seen doesn't seem so outlandish. So what went so wrong here?
What is the difference between memorizing and actually learning?
To get to the bottom of this, we need to be a bit more quantitative.
In some ways, machine learning and artificial intelligence are about search.
We can better understand our machine learning process by understanding the landscape of possible rules we're searching through.
What does this landscape look like for our problem?
By allowing positive examples to be rules, we're saying that any nine-by-nine grid of ones and zeros could be a rule as long as it shows up in our data.
How big does this make our landscape?
How many possible rules are there?
To make things a little more tangible, let's say that our 413 positive examples from our first attempt at machine learning are these 413 blue grains of sand,
and our 3,090 negative examples are these 3,090 red grains of sand.
Now, say we have one grain of sand for each possible example,
so one grain of sand for each possible nine-by-nine grid of ones and zeros.
One possible example could look like this.
Sure, I've never seen a finger quite like that, but the idea here is to simply compute how many possible examples are out there.
Now, taking one grain of sand for each possible nine-by-nine grid of ones and zeros,
how much space do you think all these grains of sand would take up?
We'll discuss the answer next time.