Last time, we left off with a question:
how many distinct rules fit our training examples?
To answer this question, we first need to think about
logical connectives and Boolean variables.
Our variables are Boolean because they can each
take on two possible values:
one or zero.
True or false.
Logical connectives give us a simple way of
writing rules using Boolean variables.
There are quite a few connectives out there,
but we're only going to use three:
AND, OR, and NOT.
We can explore how these connectives work
using truth tables.
Just as we can plot functions of continuous variables
on the Cartesian coordinate system
to see how they behave across their various inputs,
truth tables allow us to see how Boolean functions behave across their various inputs.
Since our logical connectives AND and OR
connect two Boolean variables,
there are only four possible combinations of
input values for each connective.
AND returns true when both of its inputs are true
while OR returns TRUE when either one of its inputs are true.
Finally, NOT only applies to one Boolean variable at a time
and does what you might expect—
NOT one is zero; NOT zero is one.
Now we're getting somewhere.
We can make our question far more specific.
How many distinct Boolean functions
using combinations of ANDs, ORs, and NOTs
of our four input variables, perfectly fit our training data?
It may seem limiting to only use three connectives.
Fortunately, they form a functionally complete set,
which means that these connectives can express any possible Boolean function.
To help us organize things, let's build another truth table
but this time for our whole function.
We have four Boolean input variables,
which means there are sixteen unique possible inputs.
Of these sixteen possible examples, five are our training examples.
Next, let's assume that there is some ideal function out there that we're trying to find.
We'll call it f.
We want f to match our training examples,
but we don't know what f should return in other cases.
Last time, we talked about two functions—
we called them g1 and g2—
that perfectly fit our training data.
We can apply these functions across all the inputs in our truth table to see how they behave.
As we know, they perfectly match our training examples.
We can now really dig into our question:
how many distinct Boolean functions perfectly match our data?
Let's approach it backwards.
I'll give you some truth table values, and you find a rule that
matches these values using only ANDs, ORs, and NOTs.
One answer is (x1 ∨ x2) ∧ (x3 ∨ x4).
So we have at least three functions that work.
Let's consider one more set of truth table values.
Can you find some function g4 that matches these?
Notice that this function only matches our three positive cases.
It there a way to tailor our function
to exactly match these examples alone?
It turns out there is.
It's not pretty, but it's a well-defined rule that
perfectly matches only these examples.
Notice how g4 works.
We made a rule that exactly matches each case
and then just ORed them together.
Importantly, g4 shows us that using simple logical connectives,
we can generate a function that matches any
arbitrary collection of truth table values.
Since we can make a function that matches any
combination of values in our truth table,
to figure out how many functions match our training data,
instead of writing down and counting up all possible Boolean functions,
we can instead just count up the number of possible
combinations that fill our truth table.
After all, for whatever combination of ones and zeros
we come up with, we know that we'll
be able to find a function to match.
Five of our sixteen values on our table are fixed
because we want our function to perfectly match our training examples,
but the other eleven are free to be positive or negative.
Since each of our eleven slots has two possible values,
that means there are 2^11, or 2,048, possible rules
that perfectly fit our examples.
So it's even worse than we thought.
With 2,048 perfectly good rules to choose from,
how can we possibly decide which ones are good
and which ones are bad?
Remember that the central idea to our machine learning approach
is that we would like to take some labeled training examples,
learn a rule or function f, and use this rule to make
predictions about unlabeled testing examples.
Our simple toy example has led us to 2,048 rules that perfectly fit our data.
So each rule is as good as the other as far as we know.
One possible solution to having too many rules
is to allow all of our rules to vote.
We have 2,048 rules, so let's allow the majority
to decide how to classify new examples.
Unfortunately, this strategy falls apart pretty quickly.
We know that our 2,048 rules cover all possible
classifications for our eleven unknown examples.
So this means that for each example we haven't seen,
exactly half of our rules will label it positive,
and half of our rules will label it negative.
Our vote is dead even.
One thousand twenty-four to 1,024.
This result forces us to admit our second big point.
Machine learning is fundamentally ill-posed.
There is no f.
Our data alone is not enough to uniquely determine an answer.
There are many, many correct answers, and we have
no rational basis to choose between them.
So what are we to do, then?
How are we going to learn from data?
The answer is both beatiful and messy.
We must leave our neat and tidy world of theory
and do something that makes good scientists uncomfortable.
We have to make assumptions.
And it's not that making assumptions makes our problem simpler or easier;
these assumptions are at the very core of machine learning.
If we don't make assumptions, we can't make machines learn.
This idea is called the Futility of Bias-Free Learning.
Next time, we'll try to figure out what assumptions to make.