# Learning To See [Part 1: Introduction]

This is a decision tree.
Right now, it’s learning.
When it’s done, it’ll have learned to do something very human.
Something that everyone knows how to do,
but no one can quite explain how they do it.
Something that has fooled generations of brilliant scientists with its apparent simplicity,
but practical complexity.
Something that has only become possible in the last few decades, thanks to creative solutions
in the face of huge complexity.
It’s learning... to see.
Let’s test it out.
With the help of a camera and a computer, our decision tree sees exactly how many fingers we’re holding up.
Now, depending on who you are, the fact that a machine can perform this task
may be completely mindblowing, not that out of the ordinary,
or may not even be all that impressive.
The good news here is, whichever camp you’re in, you’re in good company.
Our finger counting machine is a simple example that belongs to a deceptively complex class of problems.
Problems that arise from thinking about how we think.
How exactly is it that you, me and other animals can so easily turn images into ideas?
How, with a simple glimpse, do we so accurately identify objects?
Historically, questions like these have led very smart people to very wrong conclusions.
The great rationalist philosopher and mathematician René Descartes believed
that non-human animals could be completely reduced to automata:
collections of pistons, cogs and cams.
Descartes gave humans a little more credit.
But, along with Gottfried Wilhelm Leibniz, co-father to calculus,
believed that all rational thought could be systematized by assigning every concept to a number.
Arguments could then be resolved by simple calculation.
Ideas like this stuck around for a couple of centuries, but were limited by the technology of the day.
It turns out there’s only so much you can do with cogs, pistons and cams.
When their modern-day equivalent, the transistor, showed up in the middle of the 20th century,
we humans finally had the tools we needed to test these ideas.
Thanks to some deep insight in the 1930s from the great computer scientist Alan Turing,
we knew that any idea that could be formulated in mathematical logic could be carried out on a computer.
If rational thought could be made into a logical system, it could be programmed,
and the computer would, in a very real sense, think.
These breakthroughs, alongside rapid advances in computing technology,
led to the creation of a brand new discipline in the 1950s:
Artificial Intelligence.
Wild optimism quickly ensued.
And, of course, as we all know, AI went on to be completely solved in the late 1970s.
“Siri, what’s the difference between a W-2 and a W-4?”
Well... Maybe not exactly.
Like Descartes and Leibniz before them, the AI researchers of the 1960s were seduced
into believing that the AI problem was much simpler than it actually is.
The first attempts at our problem, programming a computer to make sense of images, were no exception.
When the brilliant AI researcher Marvin Minsky set up to solve our problem in 1966,
he assigned to an MIT freshman, Gerald Sussman.
Minsy and Sussman began programming their mainframe computer to describe images
they’d fed it from a video camera
and ambitiously set out to complete the task in a single summer.
As Minsky and Sussman quickly learned, programming computers to see is... hard.
Like... really hard.
To see why, let’s look at our problem a little more closely.
If we’re going to program our computer to count fingers, we need some data to practise on.
Using a Leap Motion infrared sensor and some Python code, we’ll take some pictures of hands.
We can have a look at our data using a tool our early AI counterparts could have dreamed of:
the Jupyter notebook.
We’ll import our images and have a look at the first one.
Cameras, like the one in our Leap Motion sensor and our eyes, to some extent,
capture light in discrete blocks called “pixels.”
Our images are made up of 100×100 grids of pixels,
where each pixel is completely described by an intensity value between 0 and 255.
We can now begin to think about how to count the number of fingers in our images.
To count fingers, we first need to find them.
More specifically, we would like to know which pixels in our image belong to fingers.
Let’s consider a few example pixels.
For each pixel, we certainly don’t need to consider the entire image to decide
if the pixel belongs to a finger or not.
So we’ll focus our problem by sampling a 9×9 grid around each example pixel.
We now have a much more clearly defined problem.
Given a 9×9 example, we need to decide if these pixels show a finger or something else.
This is easy for us to do visually, but we have to remember
that our computer is completely unaware what an image is.
We see fingers thanks to our visual cortex, but our computer just sees numbers.
Our algorithm must decide if it’s a finger or not based completely on these 81 numbers.
Is there some mathematical equation, a set of logical rules we can plug our 81 numbers into,
that will output the correct labels?
If so, how do we find it?
How do we even start?
What patterns in our 81 numbers should we look for?
How do we decide which patterns represent fingers?
How do we account for the different hand orientations, shapes, sizes and distances to our camera?
In this series, we’ll answer these questions and ultimately build a robust solution:
a decision tree.
A lot has changed in the 50 years since Minsky and Sussman set out to solve problems like this.
One particularly interesting shift is the discovery that many of the things that make our problem hard
also make lots and lots of other interesting problems hard too.
These intersections mean that the approach we’ll use to build our decision tree
applies far beyond finding fingers in images.
Today, decision trees can play chess, detect car crashes, diagnose disease,
predict heart attacks, detect credit-card fraud, reveal hidden structure in data
and do lots and lots of other useful things.
All with a single algorithm.
Next time, we’ll start building our tree.