How Shazam Works (Probably!) - Computerphile

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

one thing i'd asked somebody a long time ago was how does this service shazam works and shazam for those who who don't know is a service where you can hold your smartphone up press a button it listens to a song and then it tells you who the song is by and and you know what the song is called and we had a discussion and you've done some research into this and you'll be able to show us how it works yeah well um i have looked into how they've done it over the years and i'd imagine the way they're doing it now is uh vastly more superior than the original algorithms that people propose they were doing and obviously we don't 100 know how they're doing it they've given away some of the information and then other people have kind of speculated and written up their own papers on how things could be done but um what i've actually done is written a working model kind of like a shazam simulator which uh when when we were setting up the zoom call for this i put in the subject line shazoom so we're going to call my service shazoom it does work i mean i've only got a few songs that we're going to load into it and test with but i was going to go through how i think that at least the original version of shazam worked even if it's not the same now fantastic and um i know that you've put it on a website so people can go and have a play with it and we'll come to that in a bit yeah i know that somebody i spoke to said oh they must do it with beats per minute which is you know how fast or the tempo of the song is it is that what it's that's kind of like an emergent property you will get a certain amount of beat detection and other bits and pieces but the first thing to do is to run an fft over it's kind of slice the song into 100 millisecond chunks what's fft fast fourier transform um that's probably one of the most important bits of the whole thing and what i'm going to do is give you the the quickest and most basic explanation on what it does i'm going to give you a pound shop version of it now or in america dollar store version of it okay so we've got a nice little sine wave which all computer people seem to be very bad at drawing because we spend more time typing but let's say that this here the period between this is 100 hertz which means 100 times per second that it that it goes through its cycle right if you imagine that you had a clock with a single hand on it and that clock is spinning around at a rate so we're going to imagine we've got this clock and that hand is going around a particular rate so let's say that that clock is totally in sync with the sine wave so the sine wave is 100 hertz 100 times per second and the clock is going around 100 times per second which may seem fast but for audio that's very very slow what you'll end up with is you'll end up with a correlation with the wave that's coming in just as a quick recap when you're sampling on a computer when you're sampling a wave effectively you are just taking a reading of the voltage the middle of it is zero and above it you consider positive and below it you consider negative numbers so that's all we need to know for this so what we're going to do is imagine you've got this clock that's going around 100 hertz and the hand is at the top and let's say we're right at the peak of one of the waves so that's going to be the highest bit so let's say that's that's 100 from the zero point that value is 100. so in this clock let's say we're going to plot a point here at 100 so the distance there between the center of this clock and there is going to be a hundred now as you come down the wave and you get to the trough at the bottom halfway through the period that hand is going to have turned round to be exactly 180 degrees around the other way so that hand is now down in that direction but of course your number that you're plotting is a negative number because you've come back down the other way so you're plotting in the opposite direction so what you'd end up with is the point on this clock that you're going around the point would be in exactly the same pace as it was when it was at the top of the sine wave because it's in the opposite direction now if you have a hundred hertz sine wave coming in the data all the all the um the volume levels of of the the sine wave coming in it's 100 hertz the data's coming in and your clock's going round 100 hertz what ends up happening is you can end up plotting the same point all the time because it's hitting all the peaks and all the troughs and if you've got a sample rate which you would have which would have many many more slices than just getting the peaks and the troughs what you end up with is a load of other scattered out noise and if you take all of those points so let's say we had a point there and a point there and a point there and a point there well all of those points that are surrounding it if you use complex maths basically you you average it and reduce it down and the noise will cancel itself out and you'll be left with a prominent point on your 2d graph and that prominent point will be how loud that particular frequency you were feeding in is within that particular signal so if you were then to say put in 101 hertz but you've got the clock going at 100 hertz although you will get some correlation it's not going to be as much and then 102 hertz is going to be less correlation it's going to keep going down until effectively it will just turn into noise and all of those points will cancel each other out so if you want to find out how much 100 hertz 200 hertz 300 hertz 400 hertz and you want to keep going you have to run effectively run a sine wave through and do a comparison to the original signal plotting these points average them down and see what's left at the end of it so if you wanted to do it for 100 different frequencies you'd have to do that 100 times there are libraries out there that do this stuff super fast super reliably and and much better than i'm ever going to do it but the whole point of fft is you put in an audio signal and you get out the component frequencies so for example if you had a sound which was like a piano so you're playing an a on a piano it'll be 440 hertz will be the main fundamental frequency but then there'll be harmonics within that and then you'll get the octave up at 880 and and and so on keep doubling the frequencies and they'll they'll fall off and it's the mixture of all of those frequencies together that give you the sound and it just doesn't sound like a pure sine wave so when you combine all of those different elements that make up that note those different elements make the note sound like it's a specific instrument or a specific song yeah guitar will have different harmonics to a piano they'll also have a different you know some some instruments will slowly rise to a peak of a sound some will come in and be very percussive and have what's called a transient at the beginning of it um so when you have all of these different sounds they they're they're a combination of lots and lots of different sine waves at different frequencies and when you put them all together they produce one waveform which is a lot more complicated which is the music you listen to or the speech you listen to or whatever and and at fft and furio transforming basically is a way to get those fundamental frequencies back out of the song so if you've got a song and you've got loads of different frequencies and you want to say do i have 50 hertz playing at this point do i have 250 hertz playing at this point you're effectively running it through and doing a comparison to a to a raw 250 hertz signal do this clock thing plot your plot your points and do an averaging to find out and basically however far away the point you're left with is from the center of the clock is how loud that signal is within that particular bit of data that you've given it and although that does it doesn't matter in this particular case sort of where it's offset you know as in like twelve o'clock one o'clock two o'clock whatever is actually also the phase is how the how the uh how the wave is shifted but we really don't need to get into that for the purposes of this we're just basically getting out what frequencies are being used and obviously the more frequencies you want to test the longer it's going to take to do so we've got a very complicated waveform which is maybe a musical song with a bass and a drums and guitar and piano and all these different things which has become one waveform but it's an extremely complicated waveform because it's made up of all of these different instruments which in turn are made up of all these different harmonics and different elements we're using fft to look at a musical track and to pull out some information about what frequencies are occurring yeah when we get our song let me change a change sheet of paper here if we have a a song coming in that's like a complex waveform what's going to happen with that is that will be made of a set of frequencies and we want to get those out and really all we're interested in is between maybe 100 hertz which is quite low it's not super sub bass but it's it's low enough that a phone would be able to pick it up accurately and then say about 5 000 hertz or 5 kilohertz which is quite a cutting sort of frequency that hurts your ears obviously the higher the frequency the data the more intricate data there is in there but the more noise you're going to have when you're trying to use a phone to record something in a in a loud environment or with some background noise so what we're trying to do is pass through this complicated waveform and we're really interested in looking at frequencies between 100 hertz and 5000 hertz 5k now an average the songs that i'm dealing with here are sampled at 48k which means you can accurately represent a 24k signal in reality the way shazam would do is they downsample that they don't need huge files with all of that top end on it i just had them at that so if you had 24 000 different frequencies you were looking at with your you that you got back you would uh you would end up with 24 000 different numbers and one for each individual frequency obviously that would take way too long so the one that i actually use if i look at my code here i've got an fft size in my javascript at the top here the fft size is 2048 which i believe will return 1024 individual frequencies so if you were to split up naught to 24 000 divide that by 1024 then you work out each frequency chunk that it's looking at so basically we're looking at about 23 hertz 23.4 hertz um per slice that we're looking at so this waveform this complicated song waveform is going to have a reading it's going to have a let's say a figure between 0 and 1 of how loud that is so one is the loudest possible zero being the quietest possible and we're going to have that for 23 hertz chunks all the way up from zero to 24 000 hertz and we get that data back as an array you'd feed a song through this system that you just talked about and effectively pull out that information pull out that data yes effectively pull out the data on each frequency so if there's a lot of low frequencies if there's a load of going on you're going to get more stuff happening in the lower parts of the array and if you've got a load of top end stuff happening you're going to get it in the the higher the higher elements of the array this is what they would do to all the source music or the originals is that right yeah so they go through and get this information out so then there's another stage beyond that so once you've got out all this information so let's say we've got um all of these different frequency ranges and at each frequency range we've got a effectively a volume level how loud is it and we'll say for the sake of this between zero and one okay so we get out how loud that is and we then want to effectively group some of those together so we want to say well anything that's within this range will kind of say this is all one bucket we call it a bucket bucket of frequencies right and i all i'm interested is in this bucket of frequencies is which one of those is the loudest one and then i'm just going to get rid of everything else i don't care beyond that so if i've got say 100 hertz all the way up to 5k 5000 hertz okay so we've got these six frequency buckets going up and all i'm interested is i want one frequency in the bottom bucket i want one frequency in the second bucket i want one frequency in the third bucket and so on so that i get the most prominent frequencies and then effectively with that i'm going to draw a dot i'm just gonna call that a point on my graph in each one so we should get six dots out maximum of six dots per slice and if you have each column as a hundred milliseconds so every second you're getting ten columns of this audio suddenly you've got a nice little graph with lots of different data points in slightly different places and i think the best thing is we just show it on the screen just to make it real for people is 100hz something like a bass guitar or something like that can you give us a couple of examples of those different instruments they don't just have one frequency spectrum they'll have a whole vast range of frequency spectrums and part of mixing and being an audio engineer is finding ways of putting all those together so that they sound nice and kind of thinking well if i lose that bit of fruit spectrum off the guitar then the bass can go in and take the same space and it's not not going to interfere and sound horrible so it doesn't get too muddy okay but different instruments will have uh different areas of the frequency spectrum that they take up so bass drum will be right at the bottom and then i'd say bass guitar just above it and then you you'd have some snare drum around 200 hertz you'd have guitars produce a lot of sort of four or 500 hertz and then vocals can really kick through it about you know the prominent frequencies of sort of 800 to 1000 to so then it starts getting cutting around that area and then all that is happening and the the 4 000 to 8 10 000 hertz range that's where the prominent frequencies are obviously a guitar will produce 50 hertz and it will also be producing you know 12 000 hertz but it's where the most prominent frequencies are for that instrument and that's that's the important thing and generally when you play a note you will have a thing called a fundamental so if i was to play that a at 440 hertz on my guitar it's going to produce loads of other lower and upper harmonics but the most prominent frequency will be that 440 hertz and that will be the fundamental and what we're really looking for is fundamentals within our frequency buckets how do we look at those on the screen well effectively it's a two-dimensional graph you have time along the bottom and you have frequency up and actually i say two-dimensional three-dimensional way because the color that's actually plotted in those instead of it being a number you display it as a color and and uh traditionally you know blue will be quiet and red would be loud or going into the oranges would start to get loud i think the best thing is if i play in a song you'll hear the song it will go through in real time it will scroll across so you'll see the the the x-axis kind of populating you know as time goes across and then you'll see all the all the y-axis filling up with with all the different colors and the data points that is picking out and effectively under the hood it's testing each hundred millisecond millisecond slice with a load of different frequencies using that clock method or something similar however the the the fft algorithm they've implemented is doing it and it's just letting me know how loud it is and actually the numbers i get back are between zero and two five five okay before you do i want to avoid any imperial entanglements tell me the music that you're going to use is it is it's not commercial no it's not commercial this is all music that is uh stuff that i own that i've either written or been part of with um okay i'm looking up at youtube here so what i've got here if you look at the screen we're going to ignore all this stuff at the bottom at the moment and what i've got is i've got these five light blue songs which are actually full tracks i'm not going to play them all in because for the purposes of the demo we can use the intros of the song so i'm just going to play a little bit of each of them and then here i've got 10 clips i've got a clip of the music recorded through my phone so these are to emulate somebody sampling the sound and the the point one clip is actually pretty much the same thing where i've recorded it through my phone but i'm also talking over it so try and emulate some background noise so so we've got two different tests there but the first thing i have to do is play in the raw audio to the system the whole of this following on from our last video the whole of this demo i've done with javascript css and html5 as well and i've been using the the web audio api the um the audio context stuff which has a lovely analyzer that does all the fft for you um all built in under the hood in the browser which is wonderful let's click on the first song and we'll see it populate [Music] that's enough of that one now i've clicked on this song because i put this together as a demo and it's not something where i've labeled it all up as an actual piece of work that's being published when you've analyzed the song it then appears in the list underneath as a green button that you can click on which we'll get to later so effectively once i've played them all in we're going to end up with this this blue row here being duplicated underneath in green and it should look pretty much exactly the same apart from the fact that the buttons will be green but that means that they've all been played in and we've got a noise fingerprint for each one let me play in the other ones it's quite nice to see the base being drawn on this one the most interesting thing is is obviously that song there 2020 rendezvous but there's actually a remix of that which called wise sounds similar similar tempo it's about 2pm bpm difference it's got different drum sounds but it'll be interesting to see how it can differentiate between effectively the same song but in a remix fashion but let's do the others all right that'll do for that one next one 2020 rundown remix last one [Music] okay so that will do for that one there so what i've done is i've got all the songs in here now and before i do the clips because we've got a song up and this one's particularly good on the screen at the moment what we're seeing here is as you can probably see in the pixelated chunks particularly if i if i zoom in on it a bit you can see these tiny little sort of hundred millisecond slices so as you can see we've got these hundred millisecond slices and then the yellow marks are where the most prominent frequencies are and i've effectively split this up into six different chunks that can actually be changed down here if you wanted to change it you can say maximum points per slice and you can actually change that as one of the variables but you can see clearly where like a symbol or something has been hit where you're getting a lot of data on each beat so you can kind of see the tempo of the song as it's playing through the music you see the bass line moving i think that's that's kind of cool but if i now play in these clips and these are tiny clips these are sort of like five six seconds what you get if you were listening off your phone so we'll just go through and play all the clips in [Music] she's got my name there we go so now what we've got is we've got all our five songs in and then we've got five clips of just the song recorded through my phone and then you've got another five clips where i'm clearly chatting over them whatever came into my mind at that particular time which was just nonsense but i was just emulating background noise now the idea is is if i click on any of these green ones here it will then compare it to the songs the first five songs and we should tell how well it's correlated and this will be the big reveal does it work so i'm going to click on one let's see and there you go you can see at the bottom there it's correlated the first clip to a thousand things really well go clip two and it's correlated to the second song clip three to the third song clip four to the full tongue and clip five oh i didn't actually i think you missed five missed five oh let's analyze that one so if i do clip 5 which is now ended up at the end there that's actually correlating there you can see it's correlating a bit with 20 20 rendezvous that's not a remix so that's the same song but just different kind of versions of it right different version of the same song it's got the same melody slightly different words it's it's about 2 bpm faster so i'll explain on the paper a minute how this this matching happens if we go to the other clips here so clip 1.1 should be a thousand things even though i'm talking over it and then it goes down to there to the second song third song fourth song and then the fifth one with me talking over it selects the last one so he's correctly identifying all of them and the way we do that turn over the bit of paper is when we have this this set of slices so we've got these 100 millisecond slices so we've got six points on there and then we've got in the next column over we'll have a slightly different set of points in slightly different places i was effectively going through and numbering all of the points so the first bottom most point we'll call one and then we'll work our way up the column so we'll call that one two three four five six and then when you get to the top and there's no more points to go you go to the next column and then seven eight nine and effectively you're creating an array of these points but with the frequency of that it won't exactly be the frequency it'll be the position within the array which will be set but it doesn't really matter the position within the array represents the frequency um so what we're going to do is we're going to take say the first five frequencies so all we're doing is matching the first point so let's say i've got this graph here these x's on a graph for my original song that i'm trying to match against and then i've got the clip that i've recorded from my phone and that's also got these points all i'm trying to do is just match the first point if if i've got a point at 100 hertz in my sample from my phone and there's a point at 100 hertz in the original song we can then work from there if there's not a match on that first point we can move on you know we don't we don't care about that we can move on to the next next thing so from there we've got the noise print that comes in now the noise print that comes in your phone may not have all of the same frequencies exactly the same and you might have less top end frequency or less bottom frequencies so we're not trying to match an exact pattern to the original one what we're trying to do is pick one anchor point and then say we want x number of other points that were within what i've recorded off my phone to match within a song and it doesn't matter if they match in the right order as long as the frequencies are there so in a way you could sample um have have more maximum points per slice for the original song than you would off the phone if you wanted to you could have less data points for the the information you're getting off your phone for the the quality of the microphone on say the phone in this instance doesn't affect the frequencies does it it will for the absolute bottom and absolute top end if i play an a called on the guitar and i recorded it into a really bad sort of old-school dictaphone or or a really awful like iphone one or something that doesn't sound good into i don't even know whether they had a voice notes at but let's say i did that you're still going to be able to get out usable information and hear that i'm playing an a chord on a guitar okay it's not it's not going to sound beautiful but i mean it really does when i play but um it's it's still going to be a representation of an a chord so as long as the prominent frequencies are there and you've got some kind of system to kind of match them up with the original song it will work so so what you end up with is you end up with these groupings so these groupings will be if i've numbered all my points in my sample and i've numbered all the points within the original song i mean actually forget the original song that doesn't really matter we'll number the points within the sample i've got from the phone um we're gonna test we're gonna look for point one anywhere in any of the songs and then if we find point one we're then gonna try and match points two three four and five and if we look within that hundred millisecond slice or the next hundred millisecond slice because particularly with lower end frequencies they resonate in the room so they could last a bit longer if you were recording in a room compared to having the original recording so you might look over into the next slice and see whether the frequency is there and obviously if your first anchor point is actually at the top and there are no more frequencies in your column you've got to go to the next column so effectively you're you're looking for one and then if you find it you test two three four and five against the original source and if you get a match you just add one to the match score and then you move on you set point two as your anchor point and you try and match three four five and six and then you set point three as your anchor point so what you'd end up with is drawing a line around that lot there and then i'll get a red pen to do it in a different color and then the next one we're going to be testing that lot there and if i go to a green pen the next lot i'm going to be testing is this slot here plus this one at the bottom of the next column what i'm interested is the distance from the anchor point to the next point and then from the distance from there to there and the distance from there to there and then across to there and then across to there and i'm happy to find say this third point i'm happy to find that in the next column that would also be a match in the next column across because you don't know at what point you're taking these hundred millisecond slices you could be slightly out of phase with the original so something that appeared in the same column on one would appear across two columns because of the timing so you these are all the tolerances that i'd imagine shazam have been through and tweaked and gone crazy on to the nth degree all i'm really doing is just looking for these maximum frequencies and then using an algorithm to match them up and it works really well and it's really fast as well the naive approach would be to try and match up the whole array of all the noise print in a five second chunk to all the other array and effectively you get this huge convolution going across the whole thing it'd be like this huge convolution filter going across the whole thing which you know computationally would just i mean i i did that initially that was that was just to check that things were working i knew it wasn't going to be how i ended up doing it but um to do something with five songs that took me no time at all i clicked and it had matched it even in javascript and html5 as my as my ui um it was taking maybe 10 15 seconds to search one clip against five songs with with it um doing it as an array match whereas doing it where you have the points and you just anchor off the first one and you look just seem really instantaneous and then of course if you put those little five point groupings or six-point groupings and you somehow represent them in a hash table they'd be very very quick to look up in a database and you could use all sorts of uh clever computer sciencey stuff to make the algorithm work even faster so you're holding it and it just tells you the song straight away if you're playing on say an old-fashioned turntable uh my memory of these things is that they can play at slightly different rates and then presumably that would affect frequencies would it well it would so so technically it would but it all depends on how big your fft bin is and how many

Info

Channel: Computerphile

Views: 183,448

Rating: undefined out of 5

Keywords: computers, computerphile, computer, science

Id: RRsq9apr5QY

Channel Id: undefined

Length: 29min 48sec (1788 seconds)

Published: Mon Mar 15 2021