Cookies   I display ads to cover the expenses. See the privacy policy for more information. You can keep or reject the ads.

Video thumbnail
This episode of Real Engineering is brought to you by Brilliant, a problem solving website
that teaches you to think like an engineer.
Introduction: Opening, scene in a pub listening to a song and opening the shazam app.
Maybe be tricky to film.
What you just witnessed was the Shazam app recognising a song in a noisy environment,
and proceeding to find a match for it among the millions of songs in its servers database.
For most this probably seems like a trivial task.
Our brains can identify songs incredibly quickly from a young age, but the pathways in your
brain that allow you to identify a song quickly are incredibly complex.
Often times you simply need to hear just a few chords to know exactly what song is about
to play, that jolt of excitement when you can hear a DJ fading in the baseline of your
favourite song.
A simple combination of tones in a specific order allow you to identify a song from the
thousands of other songs you have heard in your life in an instant, but coding a computer
do the same thing is an incredible challenge.
A computer does not have an intuitive understanding of music.
A computer can only compare songs to other songs in its database, looking for a match
by comparison.
It is a problem akin to finding a needle in a haystack, where you can only find the needle
by looking at a picture of a needle and comparing it to each individual straw, comparing it’s
length and colour until you finally find the needle.
To create a software capable of doing this task quickly poses a very interesting coding
challenge and the solution the engineers at Shazam came up with gives us some interesting
insight into how our own brains work.
A study by the Manchester Museum of Science and Industry tested 12,000 people’s ability
to recognise a song.
They created an interactive game to search for the most recognisable songs, where they
would play the hook of 1000 best selling songs and recorded the time required to identify
them.[1]
Can you identify this song with just 2.3 seconds of the hook?
That was the Spice Girls song Wannabe, which ranks highest with a recognition time averaging
just 2.3 seconds, and that’s including the reaction time required to hit the button.
Our brains are hardwired for this kind of pattern recognition.
In a world where recognising the sound of an approaching threat meant life or death,
we have evolved incredibly efficient ways of categorizing and accessing historical data
like this.
Our brain does not take the sound and compare it to every sound we have ever heard like
a computer, the specific combination of chords in progression simply activates specific neurons
that unlock that historical data.
What if the chords were played by a different instrument?
Would we recognise the song as quickly?
Those same 2.3 seconds played on a guitar sounds like this.
The notes are exactly the same, but they don’t sound the exactly the same.
We even know intuitively what instrument is playing.
Why is that?
This is called the timbre of a note and different instruments have different timbres.
Pianos and guitars are examples of harmonic instruments and when they produce a note,
they aren’t just producing a pure note of a single frequency.
Each note is a combination of multiple frequencies all related to the base note, the fundamental
frequency.
These are called overtones, and they are simply multiples of the base frequency.
Each instrument has a unique combination and evolution of these overtones that give it
that unique sound.
Again, it’s quite easy for our brains to distinguish between a piano and a guitar,
but we need a way to quantify these characteristics for a computer to recognise, and this is where
the spectrogram comes in.
A spectrogram is a visual representation of sound.
It’s a 3D graph with time on the x-axis, frequency on the y-axis, and the amplitude
of the frequency, or in other words the loudness, on the z-axis, which is often represented
by a colour.
This 3D graph is something a computer can absolutely recognise and store as data, but
there is huge amount of data within a spectrogram like this, and the more data there is the
more computation time is required to find a match.
So the first step in reducing computation time is reducing the data required to classify
a song.
Shazam uses something they call a fingerprint, where they transform these spectrograms into
something that looks like star map.
[2] Here each star represents the strongest frequencies at particular times.
Doing this, we have not only reduced our graph from 3 dimensions down to 2, but have drastically
reduced the amount of data points on the graph.
This is the first vital part of Shazam’s technology.
Every single song in Shazams database is stored in a fingerprint like this.
When you open your phone and hit that Shazam button, the app accesses your microphone and
begins to create its own fingerprint of the sound waves it receives.
This ingenious method also helps the shazam app to filter out noise because it only creates
data points for stand-out frequencies.
Once the app has created a fingerprint of your audio, it then sends it to the shazam
servers where the recognition part of the process begins.
This is where things get difficult.
Let’s look at a simplified song fingerprint, and a recorded fingerprint to see why.
The recorded fingerprint is only a short recording of the song, in our example we have just 3
possible frequencies, and each recorded fingerprint will have just 3 time points.
If we want to check the first 3 time points in the song for a match we first check the
3 frequencies, then we move onto the next time point and check the 3 possible frequencies
again, and do the same for the final time point.
If we find a match, that is 9 operations required to find a match, but obviously that isn’t
likely.
We then need to do those nine operations for every time point in the song, or perhaps every
time point in Shazams massive music archive, this obviously is going to take a lot of computation
time.
This is not how Shazam looks for a match.
First Shazam categorises fingerprints in a clever way.
We don’t search to see if a note exists in a song, we search to see if several notes
exist separated by a particular time, just as brain does.
This becomes our searchable address for a hash table.
Hashes and hash functions are an incredibly useful technique that appear everywhere in
computer science.
Hash functions can be found in search algorithms used by Google, to make sure files are downloaded
correctly, and are the backbone of crypto currencies like bitcoin.
[3]
A hash function takes a varying length of input and produces a fixed length output,
called a hash.
In practice, the input can be anything from a short piece of text, like a password, to
a long data stream like an entire movie.
Consider a library of books.
We want to store each book on a shelf so we can find it later, and we know we’ll have
the title of the book when we’re searching for it.
We can use a hash function to decide which shelf to put a book on, using the title of
the book as the input and producing a shelf number as an output.
The first goal of a hash function is to produce outputs that are uniformly distributed...In
our library, we want the books to be spread evenly across the shelves, so no shelf in
particular will end up full of books, leaving others almost empty.
The second goal of a hash function is that it should reduce collisions.
A collision is when two different inputs produce the same output hash.
In our case, a collision results in two or more books on the same shelf.
If our library only has two shelves, collisions will be really common, no matter what hash
function you use.
If our library had a billion shelves, a good hash function will mean collisions will be
rare.
Another goal of a hash function is that it should be quick to calculate.
If our library has millions of books, we don’t want to take too long figuring out which shelf
each one needs to be on.
A simple hash function might be to take the title of a book, and group them on shelves
alphabetically.
This would be really quick to calculate, but it would result in a lot of collisions, with
many books on the same shelf, and wouldn’t be very well distributed.
Think about how many book titles begin with the word “THE”, compared to how many book
titles start with the letter “Z”.
An alternative might be to take the position of each letter in the alphabet and sum up
the letters in the book title.
We could then divide that number by the number of shelves we have and take the remainder
as the shelf number to store the book on.
This would still be fairly fast to calculate, and would prevent all the books with titles
starting with the word “THE” being stored on the same shelf.
Now imagine, instead of book titles, our hash function takes data from our two frequencies
separated by particular time as an input, t, and produces a number between 1 and...say
1 billion.
First, we go through our database of songs and calculate the hash number for each anchor
point.
Songs will contain multiple anchor points, which will allow us to categorise short snippets
of songs by the frequency of the anchor point, the frequency the following point and the
time between them.
And just like the library, we store each anchor point in order by the hash.
These addresses’ are also categorised with song IDs and time stamps within the song in
a secondary hash table, allowing us to search for matching songs.
This makes it much faster to locate our matches, and to find our song we will require multiple
matching anchor points.
This ingenious method of song recognition allowed Shazam to be sold for 400 million
dollars to Apple, and help you figure out just what that catchy song is.
This is a very simplified view of how the programming of Shazam works, but I have linked
my research materials below if you would like to read more into the process.
Or if you would like to begin learning more about programming and build a solid foundation
of understanding, then you could take this course on Computer Science Fundamentals on
Brilliant which will start with an intro to algorithms and gradually build you up to more
complex ideas like data types and structures, by the end of this course you will have discovered
algorithms that can be used to extract answers from data, and when you are finished you can
take the follow up course in computer science algorithms.
This is just one of many courses on Brilliant, with more courses due to released soon on
things like automotive engineering and Python Coding.
If I have inspired you and you want to educate yourself, then go to brilliant.org/RealEngineering
and sign up for free.And the first 73 people that go to that link will get 20% off the
annual Premium subscription.
As always thanks for watching and thank you to all my Patreon supporters.
If you would like to see more from me the links to my instagram, twitter, subreddit
and discord server are below.