Build a Deep Audio Classifier with Python and Tensorflow

Video Statistics and Information

Video

Captions Word Cloud

Captions

ever wondered how siri or alexa picks up your voice commands well in this video we're going to be going through exactly that in this video you're going to learn how to build your very own deep audio classification model using tensorflow and python ready to do it let's get to it [Music] [Music] so we're going to be going through a ton of stuff and as per usual all the code for this is going to be available in the description below by github and i'll also throw a link up on the screen right about now now what is it exactly that we're going to be doing well we're going to be working with audio data so first of what we're going to need to do is learn how to take that audio data and bring it into python into a numerical representation we're then going to transform it a little bit and convert it into a spectrogram so this means that we can use convolutional neural networks to be able to process that and classify it and we're also going to perform a sliding window classification so we're actually going to take a larger audio clip and be able to count the number of specific detections within that clip so this is particularly useful if you wanted to do something like speech command recognition later on now in terms of the data that we're going to be using we're actually going to be competing in the z by hp unlock challenge but more about that later let's go and jump right into it now as part of this journey we're going to be speaking to our fictional client demon along the way this is going to give you a little bit of context while we're building the deep learning model let's go and have a chat what's up nick heard you're going to be helping us out with that audio analysis problem we're trying to classify caputra and monkey density in the rainforest using audio clips interesting yep i'll be giving you a hand to moan you sent over that data set yet should be on cargo now perfect i'll check it out so what are you going to be getting up to first first up we're going to need a way to numerically represent the audio we should be able to do this using the tensorflow audio processing libraries nice good luck so let's run through the breakdown board so first up demon is going to send us some bird call clips that's the best bird that i can draw and these are going to be roughly three seconds in length so each bird call that we're going to try to detect is going to be three seconds we're then going to take that particular set of bird calls and load it up into python and in terms of how we're actually going to do that we're going to take it in and read it in as a waveform so this is going to be done using the tensorflow package so what we'll actually get out is a waveform now from that waveform what we actually want to do is convert it into a spectrogram so the reason that we're going to do this is it's going to allow us to use computer vision techniques like convolutional neural networks to be able to perform that classification so we'll take our raw waveform and what we're going to do is convert it into a spectrogram and you'll actually be able to see what the capuchin bird waveform actually looks like through that spectrogram it's actually really really interesting so once we've got that spectrogram done the next thing that we're actually gonna go on ahead and do is push this into our convolutional neural network which will be built using tensorflow and the output that we're actually gonna get from that neural network is just going to be a binary outcome so one or a zero now that one is going to represent whether or not we've actually heard a capuchin bird within that particular clip the zero is going to represent the fact that we haven't so the output of our neural network is going to be either a one or a zero now the thing is the final outcome that demone wants to get to is the density of bird calls within a particular audio file now the final audio files that we're actually going to get from demon are actually going to be slightly longer so rather than just being three seconds they're actually going to be three minutes each so if we visualize our full clip as this entire time period so three minutes what we actually need to do is slice that clip into different segments so we're actually going to call these segments windows so we're going to slice them in to three second segments so on and so forth and then we're actually going to take our tensorflow neural network which has been trained on three second bird clips and we're actually going to slide it across each one of these windows and once we've got these classifications all we really need to do is count the number of calls that we've heard so if we hear a call here that'll be one if we hear a call here that'll be another one or in this particular case if it doesn't hear anything that will be a negative call now what we'll actually notice later on is that we'll get consecutive calls or consecutive detections which are counted as individual calls we'll actually group these up so that if we hear consecutive calls we're going to actually aggregate these and consider them as one so in the end outcome we'll be able to push out a csv with all the demons outputs let's jump to it all right so we're going to go on ahead and build our audio classification model for demon and the team now in order to get started the first thing that we're actually going to need is some data so if we jump on over to the associated kaggle repository available or the link will be available in the description below what we're going to find is a whole bunch of information about the actual challenge so this is what we're going to be able to or what we're going to be trying to classify the capuchin bird now the challenge specifically is to build a machine learning model and code to count the number of capuchin bird calls within a given clip now specifically the data that we're going to be using if we scroll on down to the data explorer we are going to get three different folders so we're going to get forest recordings which are the full clip that we need to pass through and count the number of calls through we're going to get some specific capuchin bird call clips which are about three seconds each and we're also going to get a set of files which are not capuchin bird clips or not capuchin bird calls now what we need to do is download this data so if you jump on over to the kaggle repository which is being created by none other than ken g and hit download this is going to download the data set as a zip archive now once that's downloaded what i want you to do is grab that data and bring it into the repository that you're working through and put it into a folder called where are we audio classification and putting it into a folder called data so you can see i've got forest recordings past kabuto and bird clips and past not capuchin bird clips so if this is the main folder that you're working in create a folder called data and then throw those in there so we'll have three different folders inside of that data folder now the next thing that i'm going to show you is where to get this code so if you want to follow along all you need to do is go to github.com forward slash knick knock knack forward slash deep audio classification and you're going to have the jupyter notebook down there as well as a example results csv that you'll be able to work with now the end goal is to produce a csv that will then be able to submit to the hp unlock challenge so if you go to hp.com unlocked this is going to take you to this page that i'm currently looking at now the cool thing is that there is this interactive movie that you're able to watch and see how we're actually able to use data science in the real world so it's pretty pretty cool what you can actually get up to with it my mouse is lagging now so if we take a look at that and you scroll a little bit further down the challenge that we are going to take part in is challenge three so if we step into challenge three and have a quick read so this is what we're trying to do so michaela is helping no that's the data visualization one let's go to challenge three challenge three right over here all right cool so what we're trying to do so michaela dimon and eva need to pinpoint the place where a certain bird population is most dense this is a critical point in their journey and the clock is ticking use your data science skills to analyze these audio files and identify the areas of the forest that have the greatest density of capuchin birds remember kenji and me are here to help you with the tutorial walkthrough if you need a little assistance and when you're ready submit your answers in the form below to continue your journey and enter into the sweepstakes so if you scroll on down once you've gone and completed the challenge what you can do is just select your results file throw that up and then fill out your details and you can actually enter the challenge and go into the draw to win one out of ten hbz book studio notebooks and a potential trip to the kaggle days championship in barcelona which would be absolutely awesome i'd love to go all right but now we've taken a look at what we need to do let's actually jump on over into our jupyter note to book so i've got the notebook from the github repository which is audioclassification.i pi and b if we take a look at our folder structure i've got all of my audio clips inside of a folder called data and then forest recordings and then we've got the capuchin bird clips and past not capuchin bird clips so let's have a quick listen to some of those first up just so we know what we're actually working with so i'm going to go into that folder and let's play one of these files so this is the forest recording so it's going to be the full clip so you can see it's about three minutes long and we've just got some audio of the forest so we're going to be trying to pass through this and count how many combustion birds we actually hear that's a capuchin bird right there so ideally you'd count one there and you'd keep counting as you actually go through this clip now if we take a look at the training data that we've got we've got past capuchin bird clips if we play one of these so that's a capuchin bird about three seconds if we take a look at the not capuchin bird clips just got some like crickets or some bird songs out in the forest and again those are about three seconds so cool all right so what we're going to first up need to do is take those caputrium bird clips the three second clips and convert them into a waveform so this is going to take the audio and convert it into a tensor now in order to do this the first thing that we're going to be doing is importing and installing some dependencies so the first thing that we'll do is we'll install some dependencies in order to do this we are going to be running exclamation mark pip install tensorflow which is going to install tensorflow it's also going to install tensorflow dash gpu this is particularly useful if you've got a cuda enabled gpu we're going to be installing tensorflow dash io so that's going to be useful for audio processing and matplotlib to visualize our results so if we go and run that cell that should go on ahead and install our dependencies i've already got them installed so it's gone relatively quickly let's just double check there's no errors there so that all looks good cool so now that we've got our dependencies installed so we've gone and run this one line the next thing that we can go on ahead and do is load up our dependencies so there's going to be four specific dependencies that we're going to be bringing into our notebook so the first one is os so in order to do that we've written import os and this is just going to make it a little bit easier to traverse through directory so our data is going to be stored inside of that data folder so we're going to need to go into data and then we're going to need to go into past capuchin bird clips and then pass not capuchin bird clip so os just makes it a little bit easier to navigate through that then we've written from matplotlib import pipeline as plt so this just makes it easier to visualize our results and our waveforms once we've gone and converted our audio i've then brought in tensorflow so import tensorflow stf so we're going to use tensorflow a little bit later on to actually build our deep learning model and then we've brought in tfio so import tensorflow underscore io as tfio so that just makes it easier to process our audio clips and convert them into a format that we're going to be able to use so if you just select that cell and run that cell that should load in our dependencies now that is step one now done so import and install our base dependencies we've gone and installed some and we've also gone and loaded them into our notebook the next thing that we need to do is actually build our data loading function now the first thing that we need to do there is define our paths to our files so this is relatively straightforward so we're just going to test out a single file to begin with so the first thing we've defined here is the path to a specific clip so capuchin underscore file equals os dot path dot join and then the path to that is data comma pass to underscore computing bird clips comma and then i've picked a specific file so xc3776-3.wave so if you actually just take a look at that let's actually run this and if we type in carputron file all we're really getting there is a file path and this is where os just comes in handy so whether you're working on linux or whether you're working on windows or mac it is going to work out the appropriate way to define a file path to a specific file on your operating system then we've defined one for our caputrion or an example of a capuchin bird and then we've also gone and defined one for a not capuchin bird so for our not capuchin bird the file path is os.path.join and then we're going into our data folder again because remember we've got everything stored in there and then we've gone into past underscore not underscore capuchin bird underscore clips and then we're grabbing this one here so afternoon dash bird song in forest dash 0.wav if we take a look at that one so not underscore caputrium bird file this is the file path to that specific file that we're going to be using okay so that's the path to our initial file and remember this is just a single file at the moment just to test out when we actually go and load in our data then what we're going to do is we're going to set up our data loading function and this is it here now this is actually adapted from the data loading function that's from the tensorflow documentation so this just makes it super easy to actually go and process your audio i've just added in a couple of comments to basically explain how this is actually working all right first thing so we've defined a new function so def load underscore wave underscore 16k underscore mono so we are going to be processing our audio file and converting it to 16 hertz and we're going to get a single channel so it's going to be mono channel now to that this function is expecting the file name or file path to the specific file that we want to read then we're able to load it in and convert it to a waveform so file underscore contents equals tf.io dot read underscore file and then to that we're actually passing through the file path so once we've actually got that we're going to get a byte encoded string so this isn't actually going to return back the audio yet it's just going to be a string if you actually go and run that individually so if i went and grabbed that pasted that there and grabbed in a caputrin file so you're not actually going to have the audio yet it's just going to be this oh well this is the audio but it's a string here so that is the decoded version of our wav file the next line is what actually goes and decodes our wave so to do that within tf.audio dot decode underscore wave and then to that we pass through this byte encoded string and we specify that we only want one channel so rather than having the stereo audio we're going to be getting mono audio this just makes it a little bit easier to process our audio later on and then what we're going to get out of that is the actual wav file itself and the sample rate so if we grab that line paste that here let's actually take a look at it if we take a look at the wave so what you're actually getting back is a tensor of the shape 132 300 with one channel so this means that you've just got the representation or the amplitude of the wave which we'll be able to visualize in a second and what we're doing is we're squeezing it so we're getting rid of this trailing axis and to do that we're in tf.squeeze and then we're casting our sample rate to an integer or a 64-bit integer so this means that we're going to have it in a format that we can then use to resample so right now the sample rate so let's take a look at that is that 44 out of 44 point 44 100 hertz so or 44.1 kilohertz so this means that we that is the frequency of the wave that you're actually getting back or the amplitude of the wave so what we actually want to do is resample that to be a 16 000 uh example so in order to do that written wave equals tf.io.audio.resample sorry i haven't rented it this is actually from the tensorflow website so wave equals tf.io.audio.resample we're passing through the wave from here so this is once we've got our trailing axis removed and we're passing through our input sample rate so this is what it is currently at which is 44 100 hertz and we're converting it to 16 000 hertz so that should effectively reduce the size of our final audio so if we actually take this paste it down here so let's just quickly take a look at the length of this wav file at the moment so it's 132 hundred values now if we go and resample it so we're going to convert it from forty four thousand one hundred to sixteen thousand we take a look at the size of the wave now it is now forty eight thousand so significantly smaller so this effectively gives us less data that we're going to need to process cool now all of that is inside of a function so we don't actually need to write those individually all we need to do is pass through the file path to this specific file and we'll have our pre-processed wave file that we'll be able to work with so let's get rid of this temporarily so we don't need that that that or that and then what we're going to do is we're going to plot our wave so we're first i'm going to load in our capuchin file and our not caputrin file using this function here so load underscore wave underscore 16k underscore mono and we should get back our wav file which is at 16 000 hertz for both our valid example our positive example and our negative examples i've just called them wave and n wave if we go and run that and we actually need to go and run our function those are our two waves and if we go and visualize them those are our two waves now visualized so the blue wave is actually our that'll be our capuchin bird so you can see that you've got that warbling and our orange line in this particular case is our not capuchin bird example or our negative example cool so that is our audio now process so we've now or at least we've gone and processed one sample so right now what we've gone and done is we've gone and installed and imported our dependencies we've also gone and built a data loading function to remember load underscore wave underscore 16k underscore mono and then we've gone and visualize them so we've got an example of what a capuchin bird sounds like and what just the generic forest or other songbirds sound like at the moment let's jump back over and speak to damon and give him an update on our results all right demon so the first phase is done we've got the audio in a format that we can work with now there's one last week that we need to do before we get to modeling though ah cool what's the last week we're actually going to convert the raw audio waveforms that we had by loading it in using the tensorflow decode wave method into a spectrogram using the short time fourier transform uh you mind dropping the jargon for me ah yeah my bad man we're going to convert the audio wave into what's effectively an image representation this means that later on we'll be able to use image classification techniques to count the capuchin density all right shoot me those spectrograms when you're ready alrighty so spectrogram so we've now gone and converted our data into our waveform we've got it inside of python or at least we've got one or two examples inside of python what we want to do now is convert our data into a spectrogram but before we do that we probably want to load up all of our data to get ready to actually build our deep learning model and to do that we're going to be using the tensorflow data set library so this is really really good because it actually allows you to build a data pipeline when you're loading in significant amounts of data so it's good if you don't have too much memory on your computer it's good um to actually do that pre-processing for you so the first thing that we're going to do is we are going to define two paths to our positive and our negative data so these are really just file paths so i've called the first one so pause equals to os.path.join and the first file path is going to be data and the second directory that we need to go into is passed underscore capuchin bird underscore clips then we've defined a negative path or where our negative examples are so neg equals os.path.join and then we're going into our data folder again and then we're going to be going into past underscore not underscore computer and bird clips if we go and run that and again these are just file paths right so if i type in pos that is just our file path for acupuncture mode clips which is effectively this we go into nag that is the file path for our past not computer bird clips which is going to be this one cool all right those are our two file paths now defined now the next thing that we want to do is we actually want to load it into the tensorflow data sets format so in order to do this we're going to be using the list files method so this effectively is going to give us a set of files as a string inside of a tensorflow data set format and then we'll actually be able to pre-process these and load them up so if we go and run these now that looks like it's run successfully so let's take a quick look at what we've written there so pos equals tf.data.dataset and then we're going in using that list files method and this really just goes into a directory and looks for files of a specific format now we are telling it to go into the positive directory here from up there and we're telling it to look for anything that ends in wav because those are going to be our audio files and we've done a similar thing for our negative examples except the only difference is we're going to be going into a negative folder but again looking for files that end in wav now what we can actually do is we can actually take a look at one of these examples if i type in pos dot as numpy iterator dot next that is what our tensorflow data set currently has so it's got a string or a string path to a specific file so it's got data and then passed underscore capuchin bird clips and then the file that it's returned back in this particular case is xc27882-1.wav so it doesn't actually have the audio file at the moment it just has the file name or the path to the file what we'll eventually do is we're going to use the map function from the tensorflow dataset method or api and we're actually going to run our load wave underscore 16k mono function on that or on each one of those file paths so this is actually going to allow us to load in a specific file cool so right now we've got our a big list of all of our different files that we want to work with and we've got the positive and negative examples but the next thing that we need to do is we actually need to give it some labels because right now it's just the file we don't know what's a positive example so ie a capuchin bird versus a negative example i.e not a capuchin but and that is exactly what we're doing here so again we're using the tensorflow dataset method so tf.datadata.dataset.zip so this allows you to concatenate two different files together but it's actually going to concatenate them in line and we are passing through our existing file paths but we are going to be adding through a positive one if it's a positive sample or a zero if it's a negative sample so to do that over in tf.data.dataset.from tensorslices and we are appending tf1s of the same shape as our different files so this is effectively going to return back so let's actually take a look so if we type in len pos we've got 217 different files for our positive examples if we then go and pass that to tf.ones we're going to get a big tensor back with this a whole bunch of ones so this effectively means that each one of our files that is a computer and bird we are going to be passing through a one binary flag to say that yes this is in fact a capuchin bird so it's going to be a binary classification problem at the end and we've done a similar thing but for our negative examples with using tf.0 so this rather than having the ones we're going to have 0s so if we go and run that so it's going to be tf0s for our negative examples and it looks like we've got 593 samples so this is an unbalanced data set so ideally what you might want to do later on to improve performance is to balance this out so effectively resample or undersample your negatives cool all right so that is those two lines there so we're going to take a look at that so let's actually run this so if we go and take a look at our positives first up positives and we're going to use as numpy iterator again dot next what we've actually got now is no longer just the string we've also got the flag so in this particular case this particular file is an example of a coputron bird hence we've got the binary flag 1. if we went and did the same thing for our negatives again we've just got the file path to our negative example but we've got a zero binary flag cool cool now we probably want to join all of these together and this is where the next line comes in so what we're going to do is we're going to concatenate all of these together so we're going to effectively have a positive negative or have our positive examples and our negative examples in the same variable so we've specified data equals positives dot concatenate negative so this is just going to bring it all together the file structure or the shape of the result is not going to change at all so if we actually go and take a look at data exactly the same right but now we're going to have our negative examples and our positive examples in the exact same format so if i actually went and typed in dot shuffle and pass through 10 000 for example and we uh we need to shuffle it first right so here we've got a negative example run it again negative example negative negative negative give me a positive one let's change this that's positive one negative one positive one positive one positive negative so you can see that we've now got all of our data stored inside of this one data pipeline so this data variable is what we're effectively going to be using later on to work with uh or when it actually comes to train our deep learning model alrighty cool let's bring that down there so this is part three now done so we've gone and to find our paths to our positive and negative data we've gone and created our tensorflow data sets we've also added our labels and combined our positive and negative samples now the next thing that we probably want to do is determine the average length of a caputrium bird call so a little bit of exploratory data analysis and this is important because when it comes to converting our spectrogram we want to ensure that we capture what is effectively a full computing bird call or at least the majority of it so that when we go and slice through our bigger clip we're capturing the entire call or the majority of it to ensure that we have good predictions so what we can do is we can go and loop through every file that we've got inside of our past caputrium bird clips file and calculate how long it actually is now in order to do this we're going to create a new variable called lengths and it's just going to be a blank array we're then going to loop through every file inside of our past computing bird clips folder so i'm just using os.path.join to get the folder path and then we're using os.list here to get every single file so if i actually show you what this does so we're just going to get a set of files that are inside of that folder right then we can loop through each one of them and use our load underscore wave underscore 16k mono to load up our file into its waveform and then what we're going to do is we're going to get the waveform back so this is going to be the 16k equivalent we can then take a look at how long each one individually is because they're not all going to be three seconds some might be longer some might be shorter so we want to at least ensure that we capture the majority of the wave in that particular form of classification so if we go and run this it might take a little wider run and you can just ignore that error that's perfectly fine once we've got that back we should effectively be able to calculate the mean min and max so that's now done so we can actually take a look at our lengths so you can see that we've got a whole bunch of different variable lengths or file lengths in this particular case so some of them are 40 000 time steps long some of them are 48 000 so they're varying right and this is important to know because let's say that the capuchin bird call is really short or was really really short then you want to effectively capture each one of those calls now let's actually take a look at some summary stats on this so if we actually go and calculate our mean it looks like our median core length is 4 54 1556 values so this particular in this particular case it means that each one of because we've got 16 hertz so what does that actually calculate down to pulling out my phone so if that means we've got 16 000 cycles per second so if we divide 50 456 by 16 000 so that means on average our capuchin bird core clips are about 3.3 seconds in this particular case so we're probably going to want something that captures the majority of this now probably doesn't need to be 50 456 we can capture the majority let's say 48 or something along those lines but let's take a look at some other summary stats so we can take a look at our min our min is about 32 000 what's our max and our max is about 80 000. so we'll play it safe and we'll probably just grab around about 48 000 but again you could definitely tune this if you wanted to grab more just keep in mind the larger you make your spectrogram the more variables you're going to have in your deep learning model which means the more time it's going to take the train which means the higher latency that you're going to have so ideally you want it to be quick when you're processing but just that's something to keep in mind cool so we've now gone and taken a look at some of our summary statistics for our audios we've gone and taken a look at how average length of each one of those calls so that means our shortest one is going to be 32 000 which is going to be about two seconds longest one is going to be 80 000 which is about five seconds so they're between two seconds and five seconds in length okay so we've now gone and done that so we've got a little bit of information on our average length of a capuchin bird call the next thing that we want to do is actually get to what we promised the moan that we do we want to convert our data into a spectrogram and in order to do this we are going to be using tensorflow so we're going to be using the short time for a transform to convert it into a spectrogram now this is the function that we're going to use so it's called preprocess and to that we are going to be passing through our file path and our label now you're probably thinking nick why do i need to pass through the label to create a spectrogram well that's because we are actually going to be using this specific method as part of our data pipeline which we defined where is it right up here now when we actually use the map method on this it's expecting or it's actually going to pass through the full sample that means it's going to be passing through the text and it's going to be passing through the label now we're not actually going to do anything with the label in here but we want to ensure that we take it in and we return it back at the end okay so what is this actually going to do well our pre-processed method is first up going to load our data into its waveform remember that function that we wrote up at the start so load underscore wave underscore 16k mono we're then going to grab the first 48 000 steps or examples from each one of those signals because in this particular case you could bump it up you could make it 32 it's probably a little bit short you could make it 80 but just keep in mind that's going to be more data that you need to process so in this particular case let's try at 48k and go from there but not every single one of our clips is going to meet that specific limit because remember that some of our clips are 32 000 examples in length now how do we actually handle this well what we can do is we can pad it with zero so anything that doesn't meet the full length of a clip we can actually use tf.zeros and pad the rest so what we're going to do is we're going to take any clips which are shorter than 48 000 examples or samples and we're just going to throw zeros out at the front so say for example it was 32 000 zeros in length then what we're going to do is add what is that so 16 000 zeros right at the start of it to effectively ensure that all of our clips meet that 48 000 a sample length so this is everything that that is doing here so we're using tf.zeros we're passing 48 000 and we're subtracting the length of the actual wave file and then that is going to give us the number of zeros that we need to add at the start and then we're grabbing the wav file and we're overriding it using the tf.comcat method so we're going to append or concatenate all of the zeros at the start plus the actual wav file and that is going to give us our final file that we're then going to use to create a spectrogram cool this brings us to our spectrogram so in order to create a spectrogram we're going to use tf.signal.stft which gives us a short time fourier transform to that we're going to be passing through our wave file we're going to be specifying our threat frame length which is 320 and our frame step which is 32 so this is the sample rate i believe we're then going to convert it into an absolute set of values so we're going to everything's going to become in absolute format so we're going to get rid of the negatives everything will become a positive example we're then going to expand our dim so this is going to encapsulate our spectrogram in another set of arrays to ensure we have it in the appropriate format that we need to go and produce a deep learning model so let's actually take a look at this and i'll step through it so let's grab our first file path and we let's pass through the capuchin bird call all right so that's our wave file so it's just a set of tensors 48 000 steps in length then let's go and apply our zero padding well this one's 48 000 so you're not going to see any change there right so if we go and paste that it's going to be the exact same thing take a look at our wave again 48 000 there's no zeros at the start because that meets the minimum length we're then going to create the uh spectrogram so if we go and do that so our spectrogram is currently in the format of 1491 and 257 well it's 1491 by 257 so that is the shape of our spectrogram now by adding this or we're going to absolute it as well so let's do that first but i want to explain that expand dims bit so again no change in the format rather all we're doing now is we're absolutely or converting it into an absolute set of values so if we actually uh comment that out so you can see that we do have some negatives there now go and apply the absolute function tf.absolute or tf.abs we no longer had negatives now the expand dims function is then going to give us this a set of channels so it should become 1491 by 257 by one and this is needed so we can use our convolutional neural network on it because it's expecting a channel's dimension so if we go and grab this boom it's now by one right so this is almost like you were processing a grayscaled image that is effectively what you've got on your hands here so um with any other image you'd probably have uh x by or height by width by a specific channel and normally if you're you've got a color image it's going to be three channels so rgb or bgr in this case we only have a single channel but that is our data or our waveform now converted into a spectrogram but you're probably thinking nick what what what does this spectrogram actually look like so let's actually take a look so let's test out our function first up so we're going to grab a sample out of our positives data or data pipeline so we can go and run that and to do that i've just run positives.shuffle buffer size equals to 10 000. so you've probably seen me write this a little bit earlier just up here really no different and we've just run dot as numpy iterator dot next so this allows us to get the next example out of our data pipeline and what we're going to get back is our file path and we're also going to get back our label because remember we haven't done our pre-processing yet on our data pipeline so if we go and run this now we actually need to run our function we are now going to have a spectrogram back spectrogram so remember it's in the same file format so 1491 pixels by 257 pixels by one channel and now we can actually go and visualize it so let's actually take a look at what this bad boy looks like so if we go and run this that is our spectrogram so we've now taken our raw waveform and we've converted it into an image so that is if we actually go to visualize a capuchin bird that is what it actually looks like now we could also take a look at a negative example so let's change this from positives to negatives so that's our spectrogram we probably don't need to see this so run it so that in this particular case that is what a negative example looks like so very very different right so our caputrium bird had that distinctive warbling this is just almost looks like noise right we can then go and grab another example so let's actually take this off and we can run file path comma label equals this actually this should actually give us our next right so that's another negative that's another another negative it's another negative it's another negative we can jump out let's jump back to our positives that's a positive that's positive so are you sort of noticing a pattern there so the the capuchin bird sort of stands up kind of low and then rises and then continues on so that's sort of that warbling that you hear right and this is that how cool is this right we've actually visualized what a capuchin bird sounds like now we can take a look at more keep running this cell you can see you've got that warbling so it sort of starts off low and then rises up there pretty cool right okay so that is a spectrogram now created so let's quickly take a look at what we've gone and done there so we have first up where did we start we were up here so we first have gone and taken a look at our positive and negative data so we've gone and created our file paths we've gone and created our tensorflow data sets and added our positive and negative labels to them we've then gone and determined the average core length so we know how much of the core we actually need to capture and we've decided on about 48 000 examples so let's scroll on down we've got the 48 000 anything that doesn't meet the 48 000 we're just going to pad it with zeros at the start then we've actually gone and converted it into a spectrogram using the tensorflow.signal method and that is what you get there now i didn't actually talk about this how we actually visualize this so we've just gone and used matplotlib so plot.figure and we've passed through fig size and we've set that equal to 30 by 20 and then we can use plot.i am show and i've just transposed it in this case so you can sort of see it across the time so tf.transpose and we're passing through our spectrogram and that is what you see there let's jump back over to dimon and give him an update on our spectrograms so spectrograms are now done what's next we're pretty keen to get this done well thankfully we're now up to the modeling stage we'll now build a deep learning model to detect single capuchin monkey calls but how's that going to help with detecting coal density well once we've got this model trained we'll actually slice up the longer clips of the forest into shorter windows we can then use that model to classify if there's a capuchin heard in that specific clip or not ah right so then you can aggregate the results and calculate how many calls were heard exactly let me jump back into it alrighty so we've done a lot of pre-processing but we haven't necessarily done a lot of deep learning and this is where it's going to kick in so the first thing that we need to do is actually create a training and testing partition now in order to do this we actually need to go and finish off our tensorflow data pipeline because up until now remember that our data pipeline which is stored inside of data so if we go run data dot as numpy iterator this is still only currently holding the string or the file path to our audio file and the label we haven't actually run it through a spectrogram method yet so that's exactly what we're going to go ahead and do so in order to do that we are going to make shabbat so this is the way that i actually remember how to go and pre-process tensorflow data pipelines so first up we are going to map it then we're going to cache some mic and then we're going to shuffle it then batch it then pre-fetch it i don't know if that's that's all i could come up with in terms of creating my memory palace so all right so remember our data pipeline is just a string at the moment so we are first up going to specify data equals data.map and this is going to run it through a spectrogram method which is up here so remember pre-process takes in a file path and a label outputs a spectrogram and a label so first up we're going to run it through map then we're going to cash it then we're going to shuffle it so this is going to mix up all of our training samples so rather than just having positives at the side and negatives at the end we're going to have a mixed bag so this ensures that we're not going to necessarily overfit and we're going to introduce unnecessary bias or variance into our model then we are going to batch it up so we are going to train on 16 samples at a time and then we're going to pre-fetch so we're going to specifically pre-fetch eight examples so this eliminates any cpu bottlenecking cool so that is our tensorflow data pipeline so let's run that cool so that is our pipeline now built now we're going to split it into training and testing so again this is good data pre-processing processes so what we can do is run data.take and we're going to grab 36. so if we take a look at the length of our data right so it's 51 samples so let's say we take 70 for our training partition so that should be around about 35.7 so we're going to round that up and take 36. so then what we can do is assign that partition to train so train equals data dot take 36 and then we're going to skip those 36 and take the last 15 as our testing sample so this is effectively going to give us our training and testing partition so if we run that that is now done now what we can do is test it out so let's grab a training example and again as per usual we're just using the as numpy iterator generator and grabbing the next sample if we take a look at our samples.shape we're now going to have 16 different examples of our spectrogram which have the shape of 1491 by two fifty seven by one this is really important because we're going to need this specific shape to pass through as our input to our deep learning model but for now that is that bit now done so we've now gone and created a training and testing partition so we've gone and built up our or we've gone and finalized our tensorflow data pipelines we went and ran it through the mixture pipeline we then went and created a training and testing partition so we took the top 36 as our training sample we then skipped those 36 and took the last 15 as our testing partition then we've gone and tested it out in one batch and you can see that we are definitely returning back our spectrogram now we can also take a look at our labels so our labels are just our ones and our zeros in this particular case okay now the next thing that we actually want to do is actually build a deep learning model so in order to do this we are going to need a couple of dependencies so specifically we're going to need the sequential api because we're just going to build a sequential model no fancy inputs here and then we are going to bring in three different types of layers so we're going to bring the conf 2d layer the dense layer or fully connected layer and the flattened layer to go from a convolutional layer to a dense layer so to do that we've written from tensorflow.keras.models import sequential and then to bring in our layers from tensorflow.keras.layers import com2d dense and flatten i was just thinking about how i actually remember layers and i always think of donkey from straight talking about the onion but anyway that's how i remember layers uh okay so that is those are the two different layers that we're going to need so sequential comp 2d dense and flatten now we can actually build up our sequential model so let's take a look at this so first up we define our sequential model so model equals sequential so we're instantiating an instance of this class then we're adding a bunch of convolutional layers so the first one we're running model dot add and then we're passing through conf 2d here which is there so com 2d we're specifying that we want 16 different kernels of shape 3x3 we are not updating the stride so we're just leaving that as its default we're specifying that we want an activation of a relative function or rectified linear unit and then we want our input shape to match what our spectrogram looks like because remember this is why i said it's important so if we go to samples.shape a spectrogram is the shape four nine one by two fifty seven by one one four nine one two fifty seven by one if you change the shape of your spectrogram i you go and change how big of a waveform you pass through to your spectrogram or you actually change the parameters to create when you're actually using the short time fourier transform over here then you might need to actually go through and update that input shape but using the parameters that i've gone and passed through that'll be the input shape over here cool so that's our com2d layer then we're using another one almost identical so conf2d16 3x3 activation equals relu and then we are flattening it down so we take our convolutional outputs which are going to be in three dimensions and we flatten it down to a single dimension and then we pass that to our dense layer which is going to have 128 units an activation of relu and then that final layout goes to our final the final layer which is going to be a dense layer of one value because it's just going to output 0 or 1 because we have gone and specified an activation of sigmoid cool so if we run that that's going to create your sequential model cool so no errors there now we can compile it and to compile it we're going to be using the atom loss or the atom optimizer with a loss of binary cross entropy because really this is just a binary classification at the moment and then we're actually going to pass through a couple of metrics so we are going to specify metrics equals and inside of a list we're going to specify tf.keras.metrics.recall so this is going to give us our recall metric and we're also going to specify tf.keras.metrics.precision and that is going to be inside of model.compile so that will compile our model and then we can take a look at it by running model.summary and that is a neural network so again this is pretty big at the moment so we could actually compile this down maybe do some max pooling or something i've found that this tends to work but it is very big so it's got 770 million parameters in it so quite a bit of a monster but again you could definitely uh narrow this down so drop in a couple of max pooling layers between here and between here but in this case this this tends to work cool so that is our deep learning model now created so remember we've gone and specified our sequential model added our convolutional layers flattened it and then added our dense layers and then we compiled it and took a look at our summary cool that is that now done now what we can do is actually go on ahead and train it so in order to train it we can specify hist so this is going to allow us to get the training values and specifically the loss metrics for our validation partition and our training partition and we will also get our different metrics as well so by capturing um this memory you'll actually be able to visualize it which i'll show you how to do in a sec so here's the equals model.fit we're going to pass through our training partition specified we want to train for four epochs you could train for a longer time you could trade for a shorter time depending on how accurate you want your model to be and also based on whether or not you're starting to overfit or not and then we're specifying validation data equals tests because that's going to be our validation partition cool so if we actually go and run this now this should effectively kick off our training so we'll give that a second just to ensure that it's kicked off successfully that's working so you can see that our model is now training that is what this line down here represents and ideally you should start to see this loss reduce our recall increase and our precision increase over time so let's let that run for a little bit all righty so we've now gone and successfully trained our audio classification model so you can see that after four epochs we've got a pretty good model and i mean i'm always skeptical whenever i get 100 on anything but on our training partition we do have a 100 recall and 100 precision likewise on our validation uh partition so it's looking pretty good in this particular case ideally you'd want to bump up more data and probably flesh this out but we'll eventually see how this actually performs on the final model let's actually take a look at some plots so if we take a look at our loss it looks like it went pretty much down and sort of stayed down we take a look at our precision sort of trended towards upwards doesn't look like we've got any overfitting there and likewise we can take a look at our recall looking pretty good so in order to run these plots so i've just set the title so matplotlib allows us to do that so plot.title and we've set it equal to loss and then remember this is the advantage of saving your history or the memory so if we go to hist you're actually going to get back a whole bunch of values if you go to dot history so you actually get your loss your recall precision validation loss validation recall and validation precision and you'll get each one of these per epoch so it's actually pretty cool um in that sense that you can grab all these values in order to plot them we can just run plot dot plot and grab the specific metric that we want so in this first plot i've gone and set loss equal to or i've gone and passed through lost first and i've set it equal to red by passing through the color code so plot dot plot hist dot history so that grabs this value and we specifically want this metric here because it's a dictionary so if we go to type type type type it's a dictionary so we can grab the specific metric that we want loss and those values are effectively what we're going to be passing through to this plot and we're setting it to red likewise we can do it for validation loss i've just gone and done pretty much the same thing there except we're grabbing val loss pretty cool right and so i've just gone and said basic loss to red and validation loss to blue likewise i've done the exact same thing for precision and recall so we're grabbing precision and validation precision recall and validation recoil but that is actually looking pretty good right now so after only four epochs we've got pretty much 100 recall 100 precision haven't done too much data transformations or data augmentation but you could definitely add that as well if you're getting less than better performance okay so that now that that's done the next thing that we actually want to do is make a prediction so we're going to grab one batch out of our sample or our test sample and to do that i've just written test dot as numpy iterator and i've grabbed the next batch out of this we're going to get our spectrograms and we're also going to grab our label so if we go and run this x test should be 16 spectrograms so if we take a look at the shape right so 16 spectrograms by 1491 by 257 by one cool y test should be 16 labels cool so we've got 16 labels now we can actually pass through x test to our model so model.predict is going to allow us to pass through our audio clips to our deep learning learning model and make predictions if we go and run this now boom those are our predictions and if we take a look at y hat these are our logits or our confidence metrics so you can see that if we right now we're just getting probabilities so this is looks like that is a capuchin clip that is a capuchin clip that is one rather than looking at this manually we can really just convert these to classes so if we go down here what we're do going to do is effectively just loop through each one of these predictions so for prediction in y hat if that particular prediction is above 50 then we're gonna set it equal to one if it's below then we're going to set it to zero so this is just a pythonic way of writing it really all i'm doing is i'm writing this for prediction in y hat if prediction equals or is greater than 0.5 uh what are we going to do so let's actually create an array so y hat equals array or blank array and then we are going to be setting uh y hat dot append one else y hat dot append as zero that is effectively all we're doing but writing it like this is much more pythonic and allows you to skip doing all of that or writing this many lines of code so if i go and run this now and take a look at y hat boom those are our results pretty cool right now we could also take a look at our actual results which is what we'd get from y test so here it looks like we've got um let's uh do we have numpy so tf.math dots reduce sum so let's have a look at how many capuchins were heard in that clip so it looks like five capuchins were heard in that clip so if we go and take a look at our actual so tf.math.reduce sum and take a look at why test five capuchins how good is that so it looks like we're accurately predicting right but let's actually take a look at the classes all right okay uh we can convert this dot as type all right so we've got zero zero one one zero zero zero zero zero zero one one one one zero zero so four zeros one two three four one zero how good is that so we are very accurately predicting where we're hearing capuchin birds we can take a look at another sample so run this again make some more predictions run through our classes again so again one one zero zero zero so four zeros one one zero so we are accurately predicting what is a capuchin and what is not so that is already performing really really well and keep in mind what we're passing through when we pass through x test is we're actually passing through the spectrograms for each one of those original capuchin uh clips that we're getting in part of our training and testing partition so it's performing very very well in this particular case now on that note let's jump back over to our client and have a quick chat about how our model is actually performing timon we can knock off the deep learning model from the kanban board it's done and it's performing pretty well we're getting well into the high 90s for both precision and recoil a higher number is better right right the higher precision and recall the better we're going to be able to find those capuchins all we have to do now is slide the model across the longer clips and aggregate the number of calls heard so after this we'll have density counts bang on let me wrap it up all right time to start going through our forest clip so remember when we actually went and worked or took a look at our data right up at the start we had three different folders actually we don't even need to go into here so we had three different folders so if we go into our data folder we had forest recordings past capuchin clips and past not capuchin bird clips but we've only worked with these two folders at the moment we actually need to get into our forest recordings and start dealing with these and these were around about three minutes long as opposed to being a three to five seconds or two to five seconds remember that was the min and max length when we took a look at each one of our different capuchin bird training samples so what we actually need to do is first up load up our mp3s because these are no longer wav files right you can see that they're mp3s over there so we're going to need to load those so let's do exactly that so what i've gone and done is i've gone and tweaked the load wav 16k mono script and i've done a little bit of tweaking to get this to work so rather than using the tf.io.loadwave method we're going to use this one so tfio.audio.audioiotensa so this is actually going to take our mp3 file and still convert it into a tensor but each one of those files are actually multi-channel so rather than just dropping one channel we're actually going to take that tensor and we're going to add them together and divide it by two so we're going to take the average between the two different channels to reduce it to a single channel value and that is exactly what we're doing there and then pretty much the rest of this is exactly the same as what we had during our initial 16k load up method right so if we go and run this now what we can then do is grab one recording so mp3 equals os.path.join and we're going to go into the data folder which is this we're going to go into forest recordings which is this and we're going to grab the first recording so recording underscore zero zero dot mp3 so let's do that so that is defining our file path we can then load it using our loading method there and now what we're gonna do is something a little bit trickier so we are going to convert this big file into a number of audio slices so rather than just taking one massive clip and trying to make a prediction we're actually going to slice this up into this same audio same size audio slices that we'd pass to our model so this means that we're actually going to perform multiple predictions on a single audio clip so if we go and run this line here this is going to do exactly that it's actually going to give us those different slices so by running tf.keras.utils.time series underscore data set from array that is going to give us our different slices of that longer clip right we've got a little bit tongue tied there to that we are going to pass through a wav file so if i go and show that exactly the same as what we had during our small eclipse it is just much longer there so to this keras method we are going to be passing through this wav file from over here going to be passing through that a second time so this passes through the input and the target we then go and pass through how long we want our sequences to be and remember we set it to 48 000 samples our sequence stride is going to be 48 000 as well so rather than having overlapped sequences we're actually just going to take non-overlapping sequences so by specifying our sequence stride equal to the same length as our sequence length they're not going to overlap and we're only going to have one batch from this as well so if we go and run that we are going to then have our audio slices so if we now go and take a look at samples dot shape we now have one in this case our batch size is only one so you're only going to get one but if we actually take a look at the length of our samples or length of audio slices what we've actually done is we've taken one full clip and we've actually converted it into 60 different windows we're then going to take those windows and actually convert them to spectrograms and then we'll be able to loop through them all and make predictions so let's actually convert them to spectrograms again pretty much the same function we're just going to be squeezing the dimensions because this has a channel at the start and then this everything that you sort of see here is exactly the same as what we had during our initial spectrogram pre-processing method so if we run this and then let's actually go and convert our audio slice into those spectrograms and now if we make predictions let's actually run through this this is actually going to make 60 different predictions so if we go and type in y hat now you can see we've got all of these predictions right so if i let's actually take a look at the length of this should be 60 or 180 why do we have 180 are we taking a look at multiples this should be different this is not uh 16 000 it should be 48 000. let's change that up now if we go and make our predictions right 60. cool now if we take a look at these predictions though what you'll notice is that we've got consecutive predictions let's actually go and take a look at our first example as well though so it is recording underscore underscore zero dot mp3 so let's actually go and play that file so we can hear it recording zero let's do a manual count oh my hands getting tired holding it up three okay and remember our model predicted 5 for this particular clip oh okay let's wait and listen to all the way to the end just to ensure there's no other detections or no other calls that we've maybe missed nope okay so there weren't actually any other detection so it's accurately predicted five now the only thing that you'll note is that when we go and take a look at those predictions is that it's classifying consecutive calls as additional predictions so what we actually want to do is reduce this down so consecutive calls are only treated as a single call now we can actually do this using the group by method out of the edda out of the ita library so we're gonna do exactly that but let's actually go and take a look at how we might go about doing this so if we go and import the group by method so from eta tools import group by and then what we're going to do is we're going to loop through the predictions from y hat so exactly what we had over here and we're going to reduce the sum so let's actually take a look at the red reduced values first so if we go and take a look at y hat now we have one two three four wait hold on is this are we looking at the right thing here model.predict run through that again we've got one two three four oh wait we've got a whole bunch of additional predictions after this hold on i realize what we're doing wrong so what we can do is we can increase the confidence of our model so rather than leaving a prediction over 50 percent as a valid capuchin call or computing bird call let's bump this up to 99. so this means that we're only going to take the most confident predictions as being valid chord detection so if we now go and run through y hat this means that we are going to have let's take a look so we've got one there we've got one there assuming we group that together so that's two three four and then five so that means we've effectively detected them successfully so rather than leaving in the lower confidence detections which may be lower than 99 we're only going to take the most high ones in this particular case so if we actually go and run through this now so if we go and sum this up so np.sum and we don't have numpy imported so tf.math dot reduce sum right so we've now got those five detections if we drop this lower let's take a look at the result that we get so if we go and drop that confidence metric or effectively our threshold y hat should take in more detection so we're going to get a lot more sort of bubbling at the end and now if we go and run through this we're now detecting 11 which in our particular case we know is not right we only heard five capuchin bird calls so let's make it really confident and bump up that threshold to 0.999 or 0.99 we can enable scrolling for this and if we go and take a look at those final results now loop through them we've now got those five detections so it looks like we're performing a lot better in that particular case and that is effectively what we were aiming for right we were aiming to get the number of calls per forest recording but now what we actually need to do is loop through every single one of our files within this forest recordings directory so rather than oh we can run this as well it's the exact same as what i've just written here so this tells us that we've got five calls now what we can do is loop through each one of those so we are going to do a similar thing to what we did when we were calculating the call length so we're going to loop through every single recording in the forest recordings directory so for each recording we're going to load it through the mp316k mono pre-processing method we're then going to convert it into the audio slices and our stride is going to be our sequence length is going to be 48 000. sequent stride is going to be 48 000 we're then going to perform our audio pre-processing to convert it into a spectrogram we're going to batch it up and then make some predictions and then what we'll do is we'll convert it into classes we'll use our 0.99 detection threshold and we'll go through that but for now let's go ahead and run this on all of our detections it will take a little bit of time but if you run this this is going to go on ahead and make predictions over every single one of those recordings inside of that forest recordings file so we'll let that run and we'll be back in a second five minutes later alrighty so those are our predictions now made and they are going to be stored inside of a file called results so if we go and take a look at that you can see that for every single recording so you can see that this is recording underscore 00. mp3 we've got a prediction for every set of slices in that particular audio file you can see we've got a really strong prediction there we've got a whole bunch of others let's keep scrolling through some crappy predictions doesn't look like there's too many in there but interpreting it like this is a little bit of a pain so rather than doing that we can actually go and convert all of these predictions into classes so again exactly the same code as what we had from over here so again we're going to be using our threshold of 0.99 so we wanted to be very confident in those results so what we're going to do is we're just going to loop through every set of results that we've got inside of our results dictionary and we're going to convert them into class predictions so that's what we've got there so now that rather than having each one of those individual percentage or confidence metrics we've now got these values or effectively our classes now rather than leaving it there we can again go and convert this into grouped consecutive detections so in this particular case for recording zero zero dot mp3 we can see that we've got five caputrium bird calls so maybe that's an indicator of i don't know medium density let's enable or disable scrolling so uh recording zero five recording zero eight 19 calls heard that let's actually go and listen to that one so if we go into umd youtube audio classification data which ones had a lot of detection so 19 was in that that is obviously very high density so let's go to recording eight uh recording eight let's play that one we've got one already two three [Music] this is obviously a very high density capuchin bird area seven hey you sort of get the drill so it's obviously detected that there are a lot of capuchin birds in that particular area so in terms of what our client might be interested in so that area is obviously very dense um what's another area that's very dense by the looks of it so remember we've got a hundred files that we just went and processed it looks like a location 87 has quite a number and location 98 has quite a number of capuchin birds so they they are the leading areas that they might want to look into now let's jump back on over we're going to discuss these results with our client and then we might have one last task hey demon what's your email nft guru for life hotmail.com uh really yeah alrighty then well i've got those metrics i'm just going to compile them for each recording and then i'll shoot it to you as a csv and you should be good to go all right so we promised we'd go and export these results to a csv so in order to do that we are going to import the csv library just by running import csv and then what we can do is we can go and export this as a csv using that particular library so we've just written from or with open results.csv so that's going to be the name of our file so we could actually just call it a caputrin bird results and again if you're doing this for another use case you could just name it whatever you want here so capuchinbird underscoreresults.csv we're going to set it that we want to write so that is part of the open method and a new line is going to be started as part of a new line we're going to set that as f so we'll be able to work with the file as the variable f we then define our csv writer so writer equals csv dot writer we're going to be passing through our file and specified that our delimiter is going to be a comma so every value is going to be delimited by a comma we can then use writer.writerow to write out the title of our csv file so recording and the caputrin calls and then what we're going to do is we're going to loop through our results file what we had over here so this post process dictionary that we just created now if for that we've written four key under comma value in postprocess dot item so we're going to get the key and value from this particular dictionary and we're just going to use writer.writerow to write out the key and the value so if we go and run this we should now have a file called results.csv and you can see that there so that is our final set of results so we've got our recording file name and we've also got the number of calls so in this particular case recording eight had a ton of calls what's another one um looks like recording 39 had 14 that's quite a fair few recording 61 14 again recording 87 24 recording 98 23 so quite a lot of calls in specific locations but on that note that does about wrap it up so i've successfully gone and determined how many capuchin bird calls are in specific areas now again all this code's going to be available via the github repository mentioned in the description below and once you've gone and calculated your results be sure to jump over to the unlocked with z webpage and go on ahead and load up your results when you submit your answer here to enter the competition but on that note thanks again for tuning in guys peace thanks so much for tuning in guys hopefully you enjoyed this video if you did be sure to give it a big thumbs up hit subscribe and tick that bell and let me know what you thought in the comments how'd you go did you improve on the challenge did you beat it how did you go with your submission did you find anything interesting along the way let me know in the comments below thanks again for tuning in peace

Info

Channel: Nicholas Renotte

Views: 158,223

Rating: undefined out of 5

Keywords: machine learning, python, audio classification machine learning, audio classification

Id: ZLIPkmmDJAc

Channel Id: undefined

Length: 77min 10sec (4630 seconds)

Published: Sat Apr 16 2022