Last time, we left off trying to figure out

how many possible rules our first machine learning approach allowed

That is, how many possible nine-by-nine grids of ones and zeros exist?

Since every possible example could be a rule, and since there are two options for each value in our nine-by-nine grids,

that means we have a staggering two times two times two times two... eighty-one times, or 2^81 rules.

Thats 2.4 septillion possible rules.

This many grains of sand is closest to answer choice D,

which is, like, a lot.

Despite the relatively simple mathematics, grasping concepts like these intuitively is tough.

The inherent variability in our tiny binary slice of the universe,

a nine-by-nine grid of zeros and ones, is staggering.

If we stored every possible example on a computer, this would require 25 yottabytes,

more storage than is available on all the computers humans have ever built combined.

Even if you could capture all possible examples, you couldn't store them.

The vastness of our landscape of examples can help us understand why our memorization strategy performed so poorly.

We've seen about 76,000 total examples so far, and after removing redundancies,

we're left with 19,584 unique examples.

This sounds like a lot, but compared to the landscape of all possible examples, it's nothing.

Well, not nothing exactly. It's actually .000000000000000000801%.

From this perspective it becomes much more clear why our memorization strategy performed so poorly.

It's ridiculous to assume that all fingers should look like the ones we've seen in our tiny, tiny, tiny sample.

Now that we've gained some deeper persepctive into our problem, we can now ask some better questions

and hopefully find some better approaches to machine learning.

If we're going to program computers to actually learn from data, we must address some tough questions.

How can we learn rules that will actually generalize?

How can we learn when we've only seen a vanishingly small portion of all possible examples?

To answer these questions, let's simplify our problem one last time.

Instead of trying to learn rules from nine-by-nine examples, let's first consider two-by-two toy data.

This toy data is arbitrary. It doesn't represent fingers or anything. But it's going to help us think about our problem.

Let's start with three positive and two negative examples. We'll call these training examples because we're going to use them to learn a rule,

and we'll later use this rule to classify examples we haven't seen yet.

This is the central architecture of machine learning problems.

We have some data, in our case examples of finger and not-fingers,

and based on these training examples, we want our algorithm to decide if unlabeled testing examples show fingers or something else.

Just as in our finger data, colored-in squares correspond to ones, and empty squares correspond to zeros.

However, we now just have four variables to consider.

We'll call them x1, x2, x3, and x4, or collectively, x.

Can we find a rule that, based on x, will correctly separate our positive example from our negative examples?

Put differently, what do our positive examples have in common? What do our negative examples have in common?

This part is important, so I'll wait while you think of a rule.

Got it? Now, using your rule, which class does this new example belong to?

If you said our new example is positive, you're right. Great job!

Your rule might have been that examples with an x1 value of 1 are positive.

Now, if you said our new example is negative, you're... also right.

Your rule might have been something like "count up the total number of colored-in squares, and if the result is bigger than one, the example is positive."

So we have two rules that perfectly fit our training examples but predict exactly opposite results for our test example.

If this seems troubling, it should.

How the heck do we decide which rule is correct?

And it gets worse before it gets better.

Programming machines to learn is hard.

But before we try to figure out if one rule is better than another, let's try to figure out if we at least understand the big picture here.

We found two rules that perfectly fit our training examples.

Now, are there any more?

If so, how many are there?

This problem is really worth thinking about, so let's make it multiple choice, and we'll sort out the answer next time.