CNN Architectures for Large-Scale Audio Classification | AISC

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I want to share my experience working with some data here is that it shares two common difficulty in this namely one to lock off the data or the lack of take good from anything that produces ticketing ticking and clicking noises from the roaring of lions vs a son of engines so on the other hand visually similar images are smaller much smaller subsets of our general province so I think everybody can tell that from the left on the left hands that image is engine a left-hand side image of life but how about this there are here are six and our YouTube audio clip and we were America go we'll be going over for all of them I'm trying to distinguish what it is and how many labels we should assign to each of them so the first one I don't have labels I just [Music] [Laughter] yeah actually this act and this YouTube data Stan assigned as a label to ambulance but it sounds like that usually if I'm still in hospital I also can assign this they go to the hospital but across them and I can't distinguish what it is again but I just serve guys I actually I think everybody can hear a sound off the sign but actually this the label assigned to this article it is the stomach hi why later we were included in the better side yes Harvey yes correct but actually this one is a card it is generated from some cat homes so it announced the real Harvey's no doubt is some sample data actually the GUI use for their 100 and you to 130 stat so this custom better own in the work work discuss some point of the future of it and but the main idea about that the wit label is super weak we'll talk about later so the lack of data is fundamental problem we're wiping out stuff without investing the huge amount of financial and time resources and the most effective when we do with this kind of things and of course the transferor name where some big guy with a lot of the data or a lot of the model Trina and Manning model and it can enough to give it to people luckily for us we got a guy called Google and Google has has got a huge library from for sound which is the YouTube so the Google's internal data set is called YouTube 100m it consists of about 100 million audio clips where the training validation testing split a 70 percent 20 percent and 10 percent the average length of each videos is four point six minutes and they adapt total enough for a 5.4 million 20 hours this very did a set later gave birth to another two data set YouTube it an and other stats subsets and both of which are simple smaller but much cleaner much cleaner their subset of YouTube 100 M so this bring us this bring us to the paper okay are they just tags on you to labeled by the creators they have some different way to get a label from different and based on different sauces we'll talk about it later so the this bring us to the paper where the earth and authors use the YouTube 100 M they decide to investigate three K point how the performance is affected by different modern architectures how the training is affected by the different training set size and how the training is affected by the different label society and finally how useful and the Ambani model in terms of recognizing other sets which is a subset even YouTube 100 and with good labels up to 500 classes so and so what I was looking for from this paper was the answer to this question even if our heart rotate from public art of stat datasets is a feasible to Trina model yielding no such that it it performs as well as Google's vgg and bending model so in order to use it as truth in order to use it at the image image not of sound for sound and Trina and Benny know Nate's winning labels like image nod or any other and many exercise the label doesn't need to be super accurate as long as it can distinguish what are different and recognized what's the fundamental a similar since it from YouTube there are many ways we can infer label from the information already available first of all being videos the image recognition model can be used to attend to use to extract content from from the image the issue is covered the issue as a content of limit is enough is often not related to its audio mm just give you an example like the music video which is the typical example where the sound has nothing to do with the videos the reader provides no information about who is saying or which instrument is playing another sources from labeling is merit information from the user from the video web page is titled a topic it's taxed or is Channel and finally the comment section section can also be explored with an LP model to find out if there is any information about it audios so once we call label is relatively easy to treat and ban in our names mmm or loading the well knowing your network architecture such as the vgd ResNet and inceptions but before that the role waveform needs to be transformed into the images so that loss model can use a 2d convolution of sound and the euro trick is to use a short time to forum transform to prefers as the waveform into the malik spectrogram for each video they lead it into the one second one second long frames without overlap this frame shares out the label from its parents audio the window of the short time transform transform is about 25 milliseconds long with Astrid of 10 milliseconds the revolution resulting image is 96 times 64 96 is the time from revolution and the second floor is the frequency revolution the training is always done on mini batches of size 128 so here are some jargon about the definition mount forum transform fast forum transform and shorthand form transfer as well as fact rogram and malloc skills I just wanted to point out that the spectrogram alpha autoplay which has waveform actually is truly his gram of the sound intensity which is the volume at each frequency at each time so just you have to understand this fundamental ideas then you can do everything about a spectrogram so the general the general setup of tuning is of line on this slide nothing extremely special cross-entropy laws for each label but norm is used after each convolutional layers no dropout way decayed our regularization was used just because other than think that huge no this dataset is enough to prevent our setting so there are three data set balanced in a data set for the evaluation and three metrics about a you see D Prime and I'm a p4 to monitoring the performance after model so we will also re each of them in the later slides half the question okay yeah I skipped one I think okay so as a based on model yeah whatever for a year - 100 M or you-tube ATM or the artist at the data set actually is extremely unbalanced so they provide some balance this that means they each classes they have the equally number of samples that is because and actually for the raw data um for some of the classes like music and speech they call a 1 million bits and a data points and for some in the Rio classes only have the 100 data points so that providers one just one of evaluation purpose roughly equal distribution or anything like number of I samples so they as a base tomorrow they use a fully connected three layers in will not work with 100 1000 notes in each layer selected the unit research and for Alex nerds due to a different imprecise and then because imprecise is 96 by 64 so they have to modify the initial 11 by 11 convolution layer to have a first ride instead of 2 by 1 so the output is similar in size this result is 40% reduction in the size of the model and vigie vigie max marks the standardization of convolutional neural network after Alex not where the network got deeper but not more than the on 20 layers so the common architecture about vgg is vgg 16 and rigidly 19 so the most structured in terms of the convolutional kernel size max pooling and it became a standard to divulge channel ads each convolutional layers and vdd also introduced a local response normalization and his original paper we're trying to normalize the local batches so the local response normalization is replaced in this paper with the most the more recent pattern or other minor changes including resizing of the final layers one important detail to note this VDD model mentioning paper is not the VG th and manding model that Google released to the public which is this VG ish is much smaller VT ish and many model which Google released to the public is only half the 11 layers but this one called 19 layers and this one with the patch norm and vgg vegetation valley model doesn't have any rational okay you have the instruction so inception also known as Google not consists of total twenty two layers and was the winning model of 2014 in May image nap challenge inception module are fundamental block of sexual inception that's the key idea of inception module module is to design your local network topology network we think so so the paper for the paper they remove a few blocks and adjust slices the modified model is similar in size compared to the original and since it's an inception model of course it takes the most computing resources to Train oh wait then we have the not 50 okay so our previous module used deep neural network in which way in which they stack layers together it was learned that the deeper the deeper network supposed supposed to have have a better performance but which is not so it was not introduced the concept of the escaped connection where shock hearts are created to connect to the iPlayer to the I plus two the layers infer identity blog and we have to layer they have many images I actually so they was not in 2000 come the following three problems Network become difficult to optimize when the layer become larger and larger in trying to overcome the managed ingredients and well as well as the degradation problem do you remember what were the dotted lines support was supposed to mean you have dotted lines for the skip connections and you have direct skip connections you remember the dotted lines when we change layer sizes let me change the master layer mask sizes to the next level we have to kind of like do a linear transformation in order to make the size relevant in order to go to the next unit we have to do a matrix multiplication in order to make a figure so that we get color changes purple and then green and like dark we have to change the size in order to master thanks Ian itself is it matrix turnable that you multiply to go there I guess there are three types of methods that they use one of them stages as zero padding to it the other one will using a matrix in order to do this transformation the other one was to use the wood mat matrices for all of the connections so at the end I guess they for the plan B which was only between unions using the law but no learning no learn for that that has a last one has put yeah but they just added a little bit of parameters to learn but not much like when you have a resident one 100 layers a lot of parameters just for this paper leaders years of paying less than 50 as their architecture where only the convolutional and layer has slightly chain adjusted 1414 meter size which is 96 564 the modified version has special weights but the number of the model cars are reduced by by about half so the since they performed experiment with those flowering model architecture as well as varying training sites at size and late targets label size so there are results the most impulsive important that finding to me attached winning size matters allowed i try to quickly pull out a comparison table for the paper so that it's much easier to see the effect and we can see on the right hand side of the plot the extra data really helps the resonate model to perform better the dotted line here shows out how much data is released by the google in artists at so even if we can scrape all those two million the audio clip from YouTube to Trina and Bandung format it would take a lot of effort especially in the data augmentation to match googles and banning Trini on the YouTube 100 and the desert but one common in this paper is that the poor performance of poor performance using the smaller data set is likely due to overfitting so there could be some rooms for improvements in in terms of the neural network architecture now slightly last night did an inception did a good job but it seems if you learn model to train a little bit longer is able to continue to learn and the resulting performance boost is actually comparable or higher than the improvement for means from switching architectures it will be interesting to say how much vgg can get better if the training if it's trained for a for a long time long time - mmm as for the effect of the labor size we can see that it there is very big a weak support of using the larger set of label this is very likely due to the fact that label and weakened fuzzy a lot of them may not be relevant to its video so based on these observations to answer my previous question about assuming Ahava script - and labeled audio clip from out of that is it enough to train a good ad model from scratch the answer is no mmm since that transfer learning is indeed my only choice as one person project in three weeks okay I just wanted to make her clear about the Google actually released her back public the date had about the data that Google actually released to the public since it's quite confusing how this paper related related to the out of sight data sets for the data sets two million relatively cleanly clearly labeled ten-second audio clip unbending available for download the unbending and the unbending 108 features per second they did provide a video ID and label so that it's possible to scrape the road audio from YouTube but based on my experience not a small portion of these videos are no longer available for the labels they push things into categories and release the other side ethology not only for the Aged datasets but also for any other source sound recognition datasets as for them banning model a vegetal model currently Jewish which is vgg 11 with a vigilant architecture is provided it's not vgg model yielding in this paper but a watered-down version no reasoning was provided why Google don't release the release release their battle for me res Nats motto or why they need to water down the vgg my guess is for the info inference feed and based on my experience on GPO it doesn't make much difference but if you train it on the CPU given the video dish and model is already pretty slow for the real-time inference so it's actually the bottleneck in my app so anyway this is inclusion inclusion my summary of the paper I will go on I would go around to so I don't know something the average is above 5 so look at the last year they replace max by using and be some way so they replace last layer by using the max all the instruments sigmoid function so that you essentially get just one output at the end of the day so my point is how do we go from that one output to five regions so say for instance if we're taking one kind of ship that comes in on an average it has five they will say ABC every day the money for a monkey multi-label classification we're not yielding the softmax were yielding the sigmoid and trying to go just thinking we would give us why not afraid so how would you label it one or two things because actually we can set our threshold okay because we're using the AOC or whatever I may be trying to marriage so that is a view of the percentage so once the presentation yourself for your threshold it has those presentation and we will say it is correctly label well not show that you made that or or we just classified aids to be the wrong labels they also mentioned that they average so basically let's say your clip is like 60 seconds right and you split it in three second windows and so you will get one prediction of one class for 3 seconds clip and what they do after that they average out all the classes probabilities and then that way you can get more than one class [Applause] you [Laughter] [Music] [Music] [Laughter] [Laughter] see networks and see if they can apply them to the audio debate so they took 70 million YouTube videos and the thirty thirty thousand tags that were generated with the audio they took spectrograms of rolling windows and then said that so if in a five second youtube video there exists five tags each one of the individual sliding window spectrograms would have all of the five tags present then they found that one once they trained us over the 70 million videos that each individual segment of audio when you ran the network over was able to transfer to do audio good audio event detection so if in the five seconds of audio there is a whistle playing later if you were to running this over each individual frame it would be the highest activation for the whistle tag on the second that the whistle was whirling so that's the code is available now my code is also available okay so for a second part I'm going to talk about my application a code a caption for sound which is an end-to-end solution to the closed captioning of sounding YouTube videos so the motivation of this application is quite simple there are approximately 3.6 million Canadian Canadian with some ton of different degree of the hearing losses and the based on according to the Canadian Association of the Deaf from 2015 and on the other hand YouTube have 1.3 billion videos which enriched our daily life they did tremendous job and providing caption for automatic caption for for speech but never added ability to describe other type of sound so but I think that understanding of the other type of Sun Also mm has to be appreciated - enjoy your video so there's where capital for sound comes into play um this is a pipeline for caption for sound caption for Sun is realized in two parts the first part is a Chrome extension Chrome extension is here so we which can directly display the caption on the top of their current playing video so it's here I got one two three four four levels over there mmm this decision was made because I want I want I wanted a minimally intrusive way to interact with users so the second part is brought back an API living on my aw eyes waiting for the request from the front end and serving my deep learning model once was a user opened up a YouTube website and collect a click the corn extension icon is over here then the cons extension were sent you're out back to the first back-end API so I can idea is with web corresponding wave form and put it into the model trying to make a pretty so what this demo is I recorded about 1 minutes with 500 for YouTube videos so I just run them select some be interesting movies trailer trip along children off movies which I like it so I just wanted to point out that the caption is on the on here maybe I'm here yes the top left corner and the font a starlet bit is smaller because in our margin on my orange and idea wanted to keep the same same size of the font eyes the Google did in there generated auto-generated captions so I have to measure about that so let's take a look [Music] [Music] [Music] paper is shaking don't know what is seized there is some installation is mentioned on my github so you guys can go to make it up and download it there are some having the sinkhole setup for the Chrome extension so but actually currently I already turned my a device instance off because it's continuing burning my a device credit also if we want made your own Chrome extension about the closer captioning you can copy the details of my code actually their entirety of after setting up the Chrome extension is over there um it's better for the code way okay so in terms of the model building converting the sound an event recognition problem into the image recognition Portland so the use of the spectrogram has been widely used and explored so just against Patrick Graham is truly his grant of some intensity you just think about this dimension is pointing out to the screen it's the volume some volume at each frequency at each time so the basic idea is we're trying to pre-process the waveform into the spectrogram by the shortened form transform so this one basically is image failing to the same model and trying to make prediction that is come idea mmm this one I just wanted to show um during the paper we talked about a lot of the you to 110 assad's this is internal Eustace I understand so this YouTube 100 M was a week label but the data status and had a rich data set and it is internal news we can't get it and the only Google popular released a tool to dataset ID I said it they said I set it as a parent this is the if you have to train I spend more but I don't know in terms of the time consuming or accounted worthy financial resources about it if you wanted to take in the model on it and the distance adds is to audio set to our mother set so we're using this one to build this app this is about to see some features about other sets so a result is only provided in any format called Rajesh just remembered is veggie dish is the veggie g11 yes without fashion or so it consists of two million pan second lungs don't clip so which means an our debt and our the dimension of our dataset has to be the detail second round and with 500 plus platter for 500 500 classes classes and this data side has to come an issue one low quality level so what is the low quality levels sometimes it just simply got round label to the different different other clips another one is because the other the label resolution is pretty nice sorry standing here however then because in this pestilent lung autoclave we don't know what the effective sounds weird where does it live so even if there is maybe they line up the roaring the running style of mine and it's lady on the first the first seconds in terms of this over these ten seconds apart we do know we do know the location where it is so if you are dealing with front end if you're trying to make a more accurate prediction on you closed-captioning that it be a big power plant to make a prediction and the next question is the next problem is the stiff side is extremely imbalanced to just look at that image is over here the music and speech classes actually have the over 1 million data points but you got a lot of other classes it just only have the 100 points so it's quite hard to train otherwise if you dare you ugly trade model then the model we are trying to prove every I read everything to the music and it's it's speech so I have to find very hard against mmm unbalanced the amount of data when I'm turning my final layers especially a wisdom connected a fully connected ones and tell many things like the balance the mini-batches attention fully layer and foreclose so what is Alex the mini-batches just trying to upload this data samples in are we mini batteries with the balance that insult so to try to include one at least one sample from each classes in each mini patches and what is attention cooling so if it this image it doesn't make sense to you your testing is attention waste is some kind of the case at memory gates and you can choose something very much apply these features you can choose something he wanted to forget something you wanted to memorize or something you want partially memorize trying to have the more accurate prediction and this one helped me allowance actually and also tried for clothes so do guys know what is for clothes ok the vocalist is trying to shifting effort from the easy example to the health example so what is either design : have example but just imagine that if you are trying to do a image image detect image detection and there is tiny car located over here and others so what College part is from is for one and others at paper on so when you're using your kernel trying to stand over those images there are too many back one foot model to train which is we cut it easy example and only a small portion of this kernel can can touch is the recorded program recorded of hard examples that is because there are too many who is a example too many backgrounds to make a model to classify everything to the background so the focal loss actually just add one item 1+1 manners key for the power to a power of gamma trying to all trying to and just lower it down the gradient from the ID classes which is background and trying to hire hired a gradient on the heart Club and hard classes which is a car so I just found that it helps imbalance between the easy and difficult classes which means if there are only two classes which like the image specification image recognition we have the back one on we have from program then it's quite how to identify the problem but my question is about my problem is a major is that that I got the 500 plus classes there are too many hot example for the vocalist to deal with so it doesn't help me anything about if I'm trying to do with the foreclose but this but by the focus it doesn't help if you are doing the image processing so this is my model performance mmm I think I found some a little arrow incremental improvement compared the who goes based on model since two years ago just now someone metal about two years ago yes and so I I forgot I work I wanted to go through each of those metrics about I think everybody know what has a recall precision and precision recall curve so I'll just quickly confirm it and what has the ROC AOC and Mao say I'm using just a min of the arrows discourse other classes and deep fine if you're looking if you're look at this functionally or say that they did form just derive from the I mean your thing so there's to it quite a cut a similar but if you are dealing with the multi-label classification where we say that mean average precision is much used useful on to monitor the performance of the multi-label classification be in what is the average precision ever position is just the area under the precision recall curve an average precision cross out the we call as threshold increases so the MEP is dimming of the APS cross the outlet classes and they also mentioned about I maybe is a better measure than the our Jose Jose for imbalance and it's a question so I'm also thinking that because currently I'm using am building the show time foreign transfer to purpose as the audio waveform to spectre-1 and then failing to my zeal Mario to make prediction so what if I don't want to use it and there any Ana strategy we can deal with the the raw waveform so I think maybe um I could use the I could use of with nad to train the raw the raw video the raw to and other SATs because as I mentioned earlier if I'm gonna provide this and some kind of the video ID and the label so I can scripted it from directly from YouTube and get those role to an artist at and trying to make prediction I'm still working on it and trying a hopeful that it have can speed up things or convert your size but I have another about now so there are some of the some K takeaways about today's presentations and about image recognition see an capable of excellent result on audio classification and as well as the attorney and more diverse nice datasets can improve performance and training our major label site vocabularies can improve performance but just remember that if your increase your containers increase your label size it doesn't help you performance so the longer training time may have some model continued to improve and I'm very model capable of excellent result for EDD and others and dataset so there are some point when I got we're gonna to talk about in this session so the first about that is thing that we're going to that are there any questions here's how you fetched up on your to input to your model to like the Chrome extension how do you fetch the audio from the YouTube video 20 GPU parameter server that sounds like a lot of a diverse credit which is quite from the YouTube videos are even the more data sets people in a sad so I'm trying I'm trying to I was trying to use the al-assad's a many formats to payment model to make a prediction actually I'm using the ADA best and I think it acts acts large the instance is is about x-large accident x-large two or three African African deciding but it's a GPU base so it's quite fast I haven't trained it on my own it does take second I mean very fast trying to train those and many formats so I'm just so that is where the first question come out I'm Parker it's wider Google don't release their full vgd which is in two to nineteen word just thinking about another way of processing the data compared to so the images or instagrams if we were to instead of just randomly processing the images if we started from the top and work our way to the bottom I would think that both of the important classification would be at the top and then as we went down it would become more because the labeling would get smaller so it'd be a quicker way to get a working model compared to just going linear because that is just now if we're talking about it there is five label assigned to the video but actually the eminent video is the long-wave you speak after that if they spell this videos 2 or roughly 5 second but this 5 seconds autoplay actually they're sharing the video from their parents audio so I don't think they had a specific label so which but the louder audio even though it has the same labeling as a parent it would be I guess better summary of the parent versus just just as the when we have the bucket of all the videos we just start from the top video and work or we do the bottom of the video versus we take all of the videos turn identify each second of the video turn into an image and then we rank we identify the images with the highest peaks and then we give me a sound byte in the sound yeah and then they're associated labels and we start from the left the louder audio and work our way to the car to make sense oh yeah or maybe like just cut silence or something like that yeah but the songs may have their I would assume that the more acquired the audio the less value has legalized so we saw in the demo majority of it was music right so that would be more more I guess more on the side was also like crackling or pages rustling so that would be to dampen up the order doesn't matter like because they ever judge all of the frame so no matter how you arrange it it's still gonna be this like in theory it should be the same outcome because they they don't connect those sliding windows in a temporal fashion just every job engine in theory it should be I would assume [Music] yes actually that is and our limitations or doing this one because and they provide the many formats and they have their own way to studying their suppose sample for cozy and so actually I shall cut that one is I think it you got a little bit solar something so I really major trying to have those and but that one actually is to do Google set up that account the sampling frequency actually makes sense to the human years so that is why they choose their particular once use because skill down then definitely they just pick off some maybe they have the lower higher resolution or something they just yeah no joke so now wondering if you the thing is that I'm not doing the real-time predictions always yes first of all and for a five-minute video and because I told you that I have to some packed you are out for backhand you should make sure that it is always some time so whatever you want a faster one or two minutes then it is just pretty down time that's how you get the audio so I'm sending one I improve the Marvel with but then how long 10 seconds for five minutes longer you and second and how often do you sample like how often do your labels change when those videos change I don't know because what's your tactic after then it just automatically automatically just display the boobs cannot touch I mean it's quick for the captions go I mean within your model frequently are you sampling the audio to detect the labels like changing every second every like okay that is it's not important as well let me think I also wanted to I'm looking for the help from you guys because I actually um so just now I told oh god that I joined to make those Chrome extension this closed captioning is playing on time but um but there is still a limitation about we have the 10 seconds long data points and there is no specific specific labeling on each section so I got a question on names so I'm looking I'm still looking for the help from you guys and we that is also the discussion point so essentially if say for example with each second that we take for a particular audio clip if you are able to take a screenshot of that particular video and the same labels if we just around the time a different made and try to see whether it comes to the same so a cottage shop and it covets the shaft some events so currently we have the because we have the 10 seconds long autoclave that is our data sounds which is also output from the ourselves which is that many formats so the thing like that when I wanted to make prediction on time so how can I deal with it I don't know how to do it sold my idea if that I'm trying to repeat header for seconds in this 10 seconds long videos I tried on one second and it sounds like the ticking things that I so it doesn't make sense to my application so I think they for me the currently imitating four seconds over those 10 seconds is pretty pretty good for my prediction so as you guys can see me during my demo and show some roughly they can update fire good enough prediction but I put in here just one or two point two looking for how that if you guys can give me any idea to how to make a it's more accurate like say for instance for a 10 second period presently using 10 seconds say for instance we have an output for one second and then at the end of those 10 seconds if it's on averages so I am there isn't something happened it's about what if they're in sevens in this ninth second and I mean during the video and only this one sec I mean that the 10 seconds they have to speech so carry your deal with where did dealing with your front element come is techn you're building a Chrome extension and putting the closed captioning on the count playing your tune when you start playing it once you've got this one to Train make prediction even though the first second I mean the first second will show the caption about speech but actually it's located here so once you want to trying to do the the window size about one second actually I'm doing the window about two seconds well this one to do that because what you are talking about is the window size right you are trying to move to to here but I still got some proper so if I do in this one and I'm moving here but one else at spot here so even though it is the third second I don't have any sound it because it's solid in the first nine seconds but also they come extension will please pay the speech the caption for speech is over here so that is my question comes from how to solve it I don't know how I don't know the good way or a squared without it so I'm looking for help about this question the car late I'm thinking about it's just repeated four seconds over ten seconds long made use and the world of size box for second you can also you can do it maybe they may be longer window size but it doesn't have to have much sample so for example you use the first two seconds and off sample eighty ten and that would be a lot more noise obviously because at that point like you have a two-second like your resolution would be to say yourself unless you reach nine and ten units a speech and China it's before I mean it do help in some cases but if you're dealing with other cases it doesn't help so because what is the problem yeah I mean because o because what are doing with it was a multi-label classification we have a five hundred five hundred classes we work at it oh and it's quite a challenge project but if you are doing with dealing with the YouTube did is that and there are thousands tasseled enough thumbs up stand over there I don't have I don't have those kind of them kind of that things so because the data they themselves are quite large I don't have but I wanna doing this Chrome extension I mean I have to always the real word YouTube videos that is I mean just use it for inference time you do your modeling just trade your model whatever right now I think that I do that I mean prediction right there the prediction gets completely nuts I was doing that before yeah good waste of sample I'm not really sure what do you mean by up sampling so for example you have like you have two points in a 2d space you draw a line there are all the dots between these two you think they are a part of this distribution but they are not but you are adding more points to a data point to two data set so that's absurdly like using regression you have a couple of points and you add them to the line it means like you think that you assume they are connected with those extra line so that's the same thing so if you have like in your spectrum space exactly so if you have like a voice and then you just expand it to be 10 so for example when when we use like lower speed videos and YouTube it's like so that's off sampling like they are disconnected but they want they're trying to example that so it kind of sounds like interpolation yeah exactly okay okay [Music] you presumably come from the same distribution yeah I believe there are a lot of ways to do that I'm not an expert in audio of sampling how did you a sample how did you all sample and the base under certain linear regressions so when you upset but wouldn't that change a frequency to because they keep the size about just or whatever be if you only cheney in the first time that is other issues will have to deal with otherwise unless you train your own and many formats and you can keep everything same as the training session looking to the prediction session so that is why for a modeling part actually what I'm doing is that I'm using on many formats as MIDI to that I'm adding some of the final years and the coalitional layers and the fully connected ones yeah so that is something that I can do for my models and so during a prediction I have to keep the same size I mean whatever it is after one useful the I'm gonna format doesn't that II think I can't change it so what are you doing there's something during the whatever the spectrogram or something else you change it so they're not equally otherwise unless you have you own a mini model or your trim model whatever by the other mastered building the with night or because like when you construct the sqft when you get this vector gram those are parameters that you can control like the hop size for example you have the audio file ready so there are in a format I don't know how they're going to represent it but you can do again interpolation or like kind of progression on that and then you have more data points and then you do this spectrum extraction from that so are you trying to say like if you have a audio sample he want to up sample it to a 12 K audio sample and then do really like if your talk about the grid break like 8 feet and I don't inside question really so so I follow this you can increase the resolution right I don't think that's what and then you have more you would potentially have more sampling from within the same window it's like what specifically are we talking about like high quality audio recorded in the studios at forty four thousand examples per second whereas online is like a cave CC for eight points per second maybe and like 64 K over there in a second but in your case you have eight K so you can like increase it by interpolation to reach it to 664 but it's not like high quality information because you don't have actual values like you're you have a high frame video like high like 300 frames per second video and you have a thirty of camera with 34 enforce exam you can interpolate like images for that 30 per 30 frames per second but it would not be as accurate as the 300 frames per second because you don't have the actual information from the 9th nature coming in right you don't have those data points you can just make it like it would be a little crappy media or for example if you just screw this gear that was not this frame just with respect to like what is the sampling frequency for like for each second how many data for you have okay let's see this model have the 1111 South yeah 11,000 so but actually they because you are dealing with YouTube data cell so they call it here I mean the quality or they sometimes sampling frequency is not equal not not equally so whatever they they were dealing with they're trying to convert those things to the eleven thousand for example sampling frequency so for the some of them they are up sampling and for them for some of them they're down someplace yes so whatever they do it it just exactly so if you have a two second video you can make it a ten second video and that would be the point that you want for your model and you can just enter the other import that like that sample audio to your model and it would see as a 10-second audio but it's actually a two second audio and it will be like you can slide the window and because there are nice ass there have been studies or there are many format ODG so I can't change it so during the inference tonight I have to using I have to use the same same something starting about dealing with this sorry when you up symbol is just a sort of slowing the week video down right so basically like if you have a two second video you just kind of look at this image here if they're if you've got a higher resolution from the frequencies masonary is not off the appointed order I mean in our site and if I wanted to sound something it means a lot here I can see if I a half size this sampling size then they may be going to can say this half of them they're recording so how about bullying is only changing the time right so it's not taking I'm not doing time you may sit this one of do physical oh yes ma'am in physics I'm just saying that this to get to your point but because the challenge was that we wanted to 10 second things so that's why we change the two seconds to it because you had to make two seconds it is your is your is that why you're 10 seconds like you can turn seconds yes have to be so turned a difference session the input the input answer has to be make it as the success was it so they're in a prediction I mean the bother was true yeah the model was between I'm in dead side pudding tomorrow it has to be the 10 second line and the new data side I mean the new maybe the Arabic from the YouTube the mother never seen before has to be long 10 seconds another themselves having dimension is 10 seconds long and half their 128th two features limitation otherwise you can train Owen and many formats you can modify it as long as you want to have it or you have to move not you just you give up the spectrogram to hopes back for when things other I don't think how to solve is perfect so currently I just think about maybe good for second repeating it for second give me a prediction anyway appreciate about it [Music] it's less of a technical question but no more so the what is is there a so the only other thing I can think of that does anything similar is in like closed captioning sometimes they put like you knowing like brackets or like asterisks like you know music planning or something so I guess have you doctor does that data exists in any way you could like try to scrap that or is there was I'm not sure like that doesn't have a second-by-second resolution I there yeah right it's probably like it's probably a few seconds long that window so yeah I'm not sure what is a reasonable target it's unusable I guess is there any other thing that that does something similar to what you're trying to do he publicly available stuff like that like open subtitles will have some sort of annotations but YouTube has a much larger set of it you know fire alarm going off so there's it's just a bigger set and a figure whereas like open subtitles which would be the next biggest data set is like 400 hours of movies or is this is to 5 billion hours in a few more labels I mean not just in terms of as a way to an experiment series about a complication other than waveform I actually a thing couple of so trying to predict it I'm serious future I've seen using the spectrum or any kind of bridge any frequency but I haven't seen anything like much strong is also or any specific strong results are permanent and that published so it's a little bit different for financial trends for two reasons one is that normally you don't want to necessarily label a chunk of it normally you you probably want to just predict the projections so it's continuous time series as opposed to this wheat but but there are algorithms for stock exchange where you actually have patterns that people want to detect these patterns and these patterns mean something usually so for those yes you can use but I haven't seen any classification algorithms being applied for those cats usually deterministic how to do which the movement you mean the same technology like spy program for dealing with this one so actually they honestly which is a presumed that during the one minute secondary 22000 stock movement so basically in reality and maybe in the one nanosecond they have them over them the two thousand stock movement but let's say in this one millisecond we have it result in stock movements and because they're on this woman one even a second there is no movement I took counter that you call a there is no first thing for exhausting else it just equally so you can do whatever they you need to get some features on it to calculate the main okay right up absolute some any body we wanted to get give it and unless it would just have three value about it in during this 2 mm within a one millisecond and annually now free value added to the original is that to add the extension the three channels and so it can looks like an image so you're using that and to put it into the put into your model I mean into doing 2d to transform it to the special so you learn a new image model of trying to make prediction that is how I saw that I can using the same technology to apply to the financial staff and also become dealing with of can of the discourse for recommendation system so you can American predict human recommendation system based on the content of the music not just a metadata of the music sheet music and also because because my mom my background is here when I was working at Intel are dealing with all the IOT sensors so they do have some a lot of applications about hearing once I thought as your dataset is about to tell series no matter how fast it is meant in terms of frequency you can instead the same technology apply gate and try to make it a spectrogram and fitting to the same model to make a prediction yes honey the case of I have a question for that order book time series finance application did you find any papers there are a couple of papers I now remember remember the different name but but the main I think them usually the granularity of how they tackle it with supervised algorithms is whether or not the value in the stock market value is going up down or neutral that was the extent that I saw it's similar to what someone some time series you will need to put on top of that recreant network may be an LSTA or something like that plus the CNN with a image because in the in times you use they say that little piece of memory because you're actually the the time series is straightening itself he's looking at me it's fast to the future yeah especially today especially why you were taking to here you're cutting the window you're zoom a lot of the assumption is there was a time in the recording of a song is a repetitive but sometime for example were you having the water is boiling you hear the sound going higher and higher pitch but if you cut and repeat with the sim with the first as they go first two seconds and into the bottom it may not be doesn't necessarily capture the phenomena the physics about the inter boiling phenomenal yes so this is something still provides why try to do what they should know try to take all that of the challenges you have because here and update about image presenting so I mean artist doesn't matter you can cut it as as big as about as the what do you want and I try to make prediction because since it's the our group level stock movement so you already have a label on it so because they're currently when we're dealing with the time service we come shuffled it ourselves but this one we do because they they but it doesn't matter I mean our it doesn't matter and also I also found another article at four at the Times also with a comes here with conversion to conversion to the fruiting snowman by affected he and then they using the CNN to to look at it I think isn't related to the bearing is I think I release the by Boeing companies and at the time they are studying some of the bearing failures of all today a whole new chemical machineries and they have a beverage free sensors which is a morning to read evaluation or is a comma you can imagine is a time see every day the vibration coming off and the data use an effective method to map into the spectrum which is very similar yes I try to associate with certain such a variable yes that is become a way if you're dealing with the IOT sensor I personally using this concert or dealing with the machines and like the malfunction detection or what a much offenders are trying to a pretty clear explain which affair something like and it worked I read this paper color translating neuro signals like text using cream machine interfaces I thought like um perhaps like neural signals would work really well this feels like how they did um they had the frequency graph right but instead of passing it directly like image to CNN they took like the highest on like the most intense like frequencies and induced that fed it into a lsdm so I'm just thinking if maybe if they just often to image directly to a CNN perhaps it like they would get higher accuracy with it does relate to the previous question I'm just phonics family but so did you say because you take it to the port in effect because of the forty transport you don't have to worry about the sequential patterns anymore because it was suggested that our names may help with some of the physical phenomena did you mention that they stock movement and staff that because you have a label knowledge and I'm trying to have you over window like say the 192nd over have something I'm trying to predict they may be the next five minutes let go on it because you already transformed this kind of the deceptive image and you are trying to make a prediction so I don't think I don't think our other matters but if you are doing with the series time series pop seriously time series compartments that are doesn't matter because once your chaperone thing that you got the you actually the during the training session you just that question to if the neural signals which I think they do if they have sequential patterns like how long the couple dependencies because in in this paper when they're talking about like the the spectrogram is fully able to capture like part of a staffing or if we're going off whereas there might be longer term dependencies that you need in another paper to try to like something having happened you took a long time ago like three or four frames ago what it was like this understand your point so it's not only so the sequential processing is not just about the fact that it's a longer time dependency but the sequence itself so if you switched the order loom no matter how short the time is the time for it means would it mean a different thing right and CNN misses that it does have a Google misses and our experiments and Arden captures that right so so so more so more so is it order sensitive or not and brain signals as far as I know they are and of course if you go to if you go to fine grained you're gonna miss the signal anyways so the order doesn't matter obviously yes to test that I guess with one way bTW you could take the chain model and feed it the audio backwards your nice suggestion so yeah but she okay cool so there any pretty so nowhere near you guys know what that pays me for me
Info
Channel: ML Explained - Aggregate Intellect - AI.SCIENCE
Views: 1,076
Rating: 4.7142859 out of 5
Keywords:
Id: ylIfbwOS5QU
Channel Id: undefined
Length: 96min 57sec (5817 seconds)
Published: Mon Sep 23 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.