Data Agnosticism: Feature Engineering Without Domain Expertise; SciPy 2013 Presentation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
everybody to our tux first taught today is Nicolas kind of talking about data agnosticism so we're running a little bit late from the keynote we're basically just going to shift everything by five minutes so same same timing and everything but yeah welcome and all right thanks good morning I'm Nick cried ler I work for a creative health in Chicago and I'm going to be talking about what I'm calling data agnosticism and I'll talk a little bit more about that in a second but it's basically feature engineering without two main expertise so first of all to give you a little context about me I consider myself to be a generalist I am an applied mathematician by training I focused on numerical solutions to partial differential equations that's where I learned how to code now I'm working as a data scientist for this healthcare company more recently I've been involved in some chiral competitions and have reached their master tier and then when I burn burnt out from all of that I play Final Fantasy online there's a Dragoon which is pretty fun so so I was formerly a defense scientist I did mostly data analysis algorithm development that sort of thing we're a little bit of C++ code here here and there but for the most part I've transitioned to Python after I left defense I'm new to healthcare I've only been with accretive for about about six months now so today's talk is going to be well first I'm gonna tell you a bit more about what I mean about data agnosticism but then to motivate all that I'm going to talk about this Kyle competition to save whales from ship collisions and I'll use that to illustrate that my data analysis process and then and then I'll talk a little bit about how Python impacted work and so if you remember one thing from today it should be that responsible data analysis and quick iteration can produce really high performing predictive models right and I'll talk about this more in a bit but but this is my main message the takeaway is that you don't need domain expertise so as I said I'm a generalist I work on a variety of problems but I'm not a machine learning expert I'm certainly not a healthcare expert that's going and I definitely don't have any experience with wigs so to motivate this a little bit more there was this data science debate at strata last year where the motion was in data science domain expertise is more important than machine learning skill and there were some great arguments on both sides of the fence but one of the the takeaways was that if you have subject subject matter expertise you know which data and which features are important your a lot of you frame the problem much more easily and you get your high-performing predictive models from from knowing what exactly you need to measure and then there was lots of talks yesterday that actually you know added some more weight to this and talked about the difficulty of feature engineering without demand expertise but I think algorithms don't care right and so this way this is what I mean by data agnosticism maybe the data the algorithms they they take that in they don't really care where it comes from they don't know who generated the data and you know they pretty much care about one thing if you give me good data I'll give you good output so if you have garbage in you also get garbage out and naturally that assumptions lead to a bad model but the thing is we can use models to help us find features with without domain expertise right so we start out with a bad assumption and see how we fail and from that we can iteratively figure out how to make better models and so my evidence for this is the Cal competitions I've have participated in where I focused on feature engineering I don't have any health care by acoustics experience aside from you know just recently working in in healthcare but my these were all team efforts and my team was a company I used to work at our case we got tenth in the in the heritage health prize and the Whale competition and then the follow-on I worked with a former coworker Scott Dobson and we got we took first place so so I think the secret to my success was responsible data analysis and quick iteration plus a lot of really bad ideas but that's where the quick iteration part comes in and so what I mean about the responsible data analysis is it you should take some time to look at individual samples not just the aggregates right see what's really going on in the data then also pay attention to sources of overfitting right so that that is a danger from looking at the sample as you can see anecdotal evidence of things and maybe fit to that try to avoid that as much as possible and that that's where the skepticism comes in so if you do really good job maybe you want to believe that you did something wrong and then be thorough and make sure that that whatever you're seeing is real and then quick iteration and so you know really domain expertise is just an advantage a time advantage and so if you can iterate quickly you can reduce that time of image and plus how else you're getting it through all those bad ideas and so I'm going to talk about process for finding whales this was the the kaggle competition the North Atlantic right whale up call detection and they gave about 80,000 to second audio clips containing a whale call and up call sounds sort of it's really low frequency so it sounds sound like sounds like a Moo but but the goal was to determine the probability that whale exists in that in that clip and they have this area under the curve metric that you're supposed to maximize and so if you were to take a look at the forums you'd see things about spectrograms and signal processing audio and mel frequency sexual coefficients things I know nothing about on the picture I've got is a what a spectrogram is it's a time frequency plot this is taken from the competition website and then this sort of Arc feature this really bright red feature in the center is the up call that we're looking for and they published a benchmark from Cornell University where the area under the curve was 0.72 so this to me sound like a really cool problem because I never work whale data before and I thought thought why not and so you know where do we start well you could try googling whale detection but interestingly it just leads you back to the competition website and then there are some papers but they're out there behind paywalls so didn't really that didn't really work you get through the whole thing into a random forest which some people actually did but but what would you do after that right I mean you might be able to pick out some frequencies that are they're sort of important but that's not really a model so what I want to know is what can we do in just a few hours and so I thought as a simple way to start I would look at a correlation based model and so here I've taken the average of all the right whale spectrograms and you can see in this sort of boxed out region and I have there's this really strong arc feature that was similar to the feature that I showed you and there the sample from the competition website and so I thought how will does that chip correlate with a spectrograms from the audio clips particularly how old is it correlate with the right whale samples we hope very well since this this is the average of all them within the non right whale samples the noise samples how does that look and so from that I can generate three features so this is the max normalized cross-correlation in red I have the right whale clips and in black I have the noise clips and then the corresponding locations of the of the max the frequency and the time location and so I thought hey that looks to be some decent separation let's just throw it into a random forest and see what happens you know I there's been a lot of great work I mean scikit-learn has been mentioned many times at this conference and so you know random forests are quick high-performing and easy to interpret this case I only have a little bit of data so right I don't know how much could trust it but can see to work pretty well so how we do well it's a first pass I have the probability distributions for the the right whale in red and the noise in black and you can see it's some pretty strong separation but the big takeaway is it gives you an area under the curve of 0.92 which is much better than Cornell's that's right right so a simple model seemed a sink work work pretty pretty well but this was just the start right at this point in the competition there was someone who was at 0.97 or something like that and so the first thing that I always do in my process is look at what am I missing the there's this sort of long tail in the red distribution where we're doing a poor job of predicting the probability of a whale and so so I pulled some samples and if you notice anything but these are pretty noisy right there you can sort of see the the whale a call that we're looking for but not really and so the first thing I thought was well you know these are some cases where I don't really have a strong like signal-to-noise ratio so what about some contrast enhancement if you do that pops out quite nice and then thought well does this make a difference if we go and do that same process again so I've got the the max correlation again and in the sod lines are this contrast-enhanced feature and then the dotted lines are the dashed lines are the original and you can see that the distributions are moving in the right direction we've got the the red thick red line it's moving further to the rut to the I guess it's your right and then the black is moving to the left and so you're getting more separation and that gives you an area under the curve of 0.9 for so we're getting there and so I've mapped this to a cycle and an Isetta is what I'm calling a good cycle where I make a prediction and I evaluate that prediction and then I figure out how I can improve my model and I just try to do that in the small chunks as possible and trying to get through quickly and what I'm calling the bad is what a random walk through algorithm land so and I base this on experience working with others and then also some people talking on the forums where people pick an algorithm and it doesn't quite work so then they try another algorithm it doesn't quite work and then they try optimizing hyper parameters and then they're stuck in this loop of they've decided they're just going to use this one particular algorithm and then they try to make the the the most of it without ever actually looking at the data and so I say don't get stuck in in algorithm land folks off-putting better data into the algorithm and so then I mapped all of the examples that just showed you to this to the red cycle and see I you know chosen algorithm we generated the model evaluated the model and then figure out how to turn whatever ideas we come up with as far as maybe making improvements into code and the way that I can do that quickly is with Python right so the the great thing about Python is I feel like it shifts the focus from algorithm implementation to data analysis and so all of these great packages helped me create a submission in like three hours or something like that and so it's really due to all the great work that the people have done in this community and so so I I always speak very highly of Python and I'm always trying to get my co-workers to use more of it so so from that I'm able to to do consistent data-driven improvement just by constantly looking at where my models not working and for the heritage health prize that was a very long competition there were some definitely some periods when I burnt out and didn't work on it for very long but for the will detection challenge it was about a two-month competition and you know after he made some pretty big jumps sort of settled on a decent model from there it was just you know looking for ways to improve and I should mention that the difference between first and second place in the web section challenge was I think like 5 times 10 to the negative fifth it was really really small and one of my features had a typo so so that'd be even better but but then there was this follow-on to the Whale competition this this right will Redux and so I'll apply the same process and I took my code from the previous challenge got point nine eight seven sort of consistent with with what I was seen before and I looked at one of the ones that I was doing poorly and all of a sudden there's this junk at the bottom these weren't in the the previous clips this data is collected with a different type of sensor apparently and so I just knocked it out and got first place again with with the score of 993 and it's also interesting to note that so after the first competition I'm a Maiko publicly available someone downloaded it from github and ran it and they got second place so in conclusion you know algorithms really only care about better data you know and how you do that you can trade computational time or data analysis time but for the most part you know I didn't use really any of the expert approaches I didn't look at audio enhancement or mail frequency sexual coefficients zero crossing rate so so I just tried to focus on what the algorithm was telling me it was missing and so again I say responsible data analysis and quicker iteration produce high performing bridge models and here's the takeaway from the blog for for marine Explorer so that was really nice and and that that's it here's my contact information and github if you want to see my marine explore code I mean my competition code that's all there and then also a creative is hiring so if you like machine learning and Python and all that stuff let me know and talk about the great work can be done in healthcare we've got a few minutes for your questions approaching your competition so the question was if I think differently when approaching a work project versus a competition uh not really I applied the same process to my day to day work in fact that's how I started the Heritage Prize a bunch of us at work like you know we always say that you know bits are bits of data data we can we can solve any problems so they heritage Harwich prizes sort of our example are since we work in defense like oh you know we can do healthcare let's let's see how it goes so we just the same process and process seems that seems to work yes problems waiting for selecting additional pages the kind of pictures so the question was if I use random forest importance waiting to select features and yes yes I tried some of the other things that people talk about in the forums I always give a shot and they tend to not be very high in the list or they're very correlated with something that I'm already doing well thank you
Info
Channel: Enthought
Views: 15,653
Rating: 4.9097743 out of 5
Keywords: python (software), SciPy, scipy2013
Id: bL4b1sGnILU
Channel Id: undefined
Length: 16min 39sec (999 seconds)
Published: Tue Jul 02 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.