[Data Science YVR] How to (almost) win at Kaggle - Kiri Nichol

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thanks Charles thanks everybody for coming out I'm super excited to give this talk and tell you a little bit about why I decided to be a data scientist and how I use Kegel to do it yourself train myself to be a data scientist so basically I just want to tell you a little bit about how I went about turning myself into a data scientist why I decided to do the duty or self method and why I decided to train myself with Kegel problems I also want to tell you a little bit about how a calculation actually works and some give you some strategy and tips for doing well at Kegel yeah in conclusion and the conclusions ought to cover things it was did do-it-yourself data science actually a really a very good idea in the end the pros and cons of training with Kegel and also some tips for interviewing since I know ultimately most of you want to end up with a job actually Quick Poll who here is in school or studying or grad student okay so that's like maybe a third people cool okay mmm all right so eight years ago oh my goodness I can't believe eight years ago now I moved to the Netherlands to do a PhD in physics my experiment was a bucket of glass beads and I was I had a motor underneath that ran a disc and the disc would jiggle the beads around and they started to behave like a liquid and this was really cool because I could like get something heavy and drop it in and it would sink and something light would float and the light thing would also also float and jiggle around a little bit and the jiggling around a little bit look like Brownian motion so that was the statistical mechanical part of my PhD and it's also why there is a rubber bath dock on the front cover my PhD thesis and on my github profile I really like my PhD I really love doing research but I got to the end of my PhD and I thought academia is kind of a pyramid scheme and I looked at people who are a little bit ahead of me and you know they're in on the postdoc treadmill their third postdoc their fourth postdoc and I thought you know I would like to have a little bit more stability in my life than what for postdocs would afford and I also wanted to have a little bit more control over where I wanted to live and I knew doing my during my PhD that I ultimately wanted to come back to Canada so actually um I wanted to stay in the Netherlands a little bit longer because my partner was finishing his PhD so I needed in order to do this I needed a job and I had one criterion for a job was and it was that I would get more programming experience so the first job that I applied for and got offered was a research position at the dusk Cancer Institute and they it turns out the Dutch cancers too they basically hire PhD phd's and physics because they can do research and they can program so I thought great I'll go and improve my programming skills and then I discovered that it was all physicists who also had terrible programming skills and there is no version control system so I learned a lot of terrible habits but I also learned a lot about anatomy in Dutch the problem that I worked on was someone cut somebody has radiation therapy they make a 3d x-ray a CT scan of the person's the part of the person's body that's affected and then every day that the person goes in for treatment see when you get radiation therapy you get radiation every weekday for about five to seven weeks and every day that you go in to have a little bit of radiation they make another CT of your body and then they use the position of your bones in the daily CT to match you up with the position of your bones in your planning CT and then they know where to put the radiation but somebody clever thought oh we can do a better job of delivering this radiation by taking the radiation dose and then warping it to look more like the the anatomy of the daily CT but to do this you want to make sure you're doing a good job of doing that warping so my job was to come up with methods for making sure the computer didn't screw up basically anyway so I decided when I moved back to Canada I still actually didn't really know what I wanted to do just before I left the Netherlands somebody should say oh you should check out Kegel this this website the word you can do data analysis competitions and I thought okay what the hell I looked at a knife and when I saw the competitions that were running I was like wow these are all really interesting problems so one was optimized flight routes based on current weather and traffic there is another one that was given samples from a pair of variables a and B find whether a is a cause of the ooh that's kind of an interesting problem predict Yelp business ratings recognize gestures in video and in video data and identify bird species present and in an audio recording okay all these problems sounded really really interesting I had some ideas about how to approach them as not so many ideas how about how to approach others but the main thing was that I was curious and once I sort of found this motivation I was like oh this is what I want to do I want to work on cattle problems so when I came back to Ken oh this slide already I want to do a quick poll here because I know there people here are working as data scientist so if you are working as a data scientist ish maybe you put your hand up so there's like 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 okay so maybe 16 so of those how many of you guys have computer science backgrounds 1 2 3 4 5 6 7 8 8 so 1/2 1/2 computer science backgrounds some okay who has sorry um no emergency I think the numbers are small and if I can memorize them let's see so uh who has a PhD in something or masters ish in something analytical chemistry genetics 1 2 3 4 5 6 7 8 9 10 all right oh this is terrible because I counted 16 first and now I have 10 plus 8 is 18 okay this is not going to not be very statistically significant so another route that people use into data science is PhDs or postdocs in natural sciences genetics anything where you have some programming and stats experience there's training programs that are mostly in the u.s. there's some in Europe one example is if Ian's insight anyway these programs basically it's competitive to get in but then once you get in you spend something like six to eight weeks with a bunch of other people taking classes and making a little portfolio project and then the thing that makes this nice is that the process is paid for by companies who want to hire data scientists so when you finish the program you have the undivided attention of many companies who would like to hire you and I there's a lot of people who get into data science this way as well but the disadvantage of that is that the programs are mostly in the US and I wanted to try to live in Canada and the others less competition for people with data science qualifications here so there's companies are they have their expectations are that is that they can just go out and find somebody they don't have to go and train the person to come and work for them let's see um so I kind of did a I wanted be a data scientist I sort of looked at the skills I had and then looked at the skills that I thought I needed to develop and actually as a physics PhD I had a lot of skills that already kind of made transitioning to data science really easy math skills linear algebra calculus solving optimization problems time series analysis Fourier transforms and a little bit of statistics very little bit of statistics but some statistics I had some pretty decent programming skills so I basically every time there was some sort of data analysis problem or I had to get my experiment to run or I wanted to just plot something then I use using MATLAB maple mathematica and if you can use any of those programs you can learn but you can learn how to program in Python I also took a really good class as an undergrad on it was programming for physicists it was taught in Fortran and we basically did a bunch of the numerical recipes poor trend we solved systems of linear equations differential equations a whole bunch of stuff they were you would be useful to you as a physicists and actually turned out to be really awesome to have for data science as well and also I as a physics as a person with a physics degree you basically have a ton of experience that look going and looking at data and saying oh I wonder what's going on here how do i how do I represent this and actually I find that really useful to the modeling aspect of physics is something that I definitely use now my job now the skills that I needed to develop more statistics I need to learn about all about machine learning I needed to learn how to use some database tools so sequel and being able to use some cloud computing tools like Amazon Web Services that's that's pretty handy I needed to learn how to use a version control system because I didn't learn that at the hospital and also I just needed to get better at writing code that was going to be for an audience so code that other people could read and understand and so I also thought about okay there's programs that have sort of started up to train people to be data scientist and I thought about this and I was like you know I've already done enough school I I've demonstrated them I'm awesome at getting degrees I don't really need another degree and I also rely actually I thought about applying for some of these training programs on zipfian and insight then I realized I actually would prefer to try to stay in Canada so that's definitely an option as well does anybody know who this guy is you might yes okay so this is Garrett Lisi yeah he's an interesting character he is a physics PhD and he after he got his PhD he was at some university in California where the weather is pleasant and the surfing is good and he likes surfing so he decided when he got his PhD that he was he wanted to keep doing physics but he realized that most of the postdocs were in string theory and he didn't really want to do string theory so he just decided that he would move to Hawaii and live in a van and go surfing and do physics well his money lasted and uh when I when I read about this this guy I was like huh you know this guys this is really cool and I kind of realized you know I've spent my whole life uh mostly being in school if I didn't wasn't in school I generally had a job and I was like you know it's okay to be unemployed this is a little harder to explain to my parents but it was easier because okay so the only reason I was unemployed for two years when I came back to Canada and the only reason I got away with this was that Dutch PhD candidates are paid way better than Canadian PhD candidates to the tune of like depending on the dollar to the euro but like like 60 to like a hundred percent better paid than Canadian PhD candidates so I had enough money saved that I could pretty much do whatever I wanted for a little while and once I saw that I could sort their work tackle problems that would give me a little bit of direction and focus on motivation I you know that makes it a lot easier to get out of bed in the morning so between the money and the motivation I thought I can do this I can be unemployed for a little while um so my action plan when I came back to Canada I arrived in I guess July of 2013 and I sort of planted a series of little projects to help myself learn some new skills so I wrote some little apps to request travel times from the Dutch Railway website and Google Maps and store the information in a sequel database and I built a little app that would allow travel users with a passes for a fixed route to exploit some loopholes in their billing system and I also started building a portfolio on github now get a github is usually used by companies or people doing like open source software to do version management so that you if you're working on some aspect of the code you don't wreck other people's versions of the code but they also let you just host a very simple website as well so I started building a portfolio on github me to the future I thought okay I'm going to do the Titanic problem on kegels so cackled has a bunch of problems that are sort of billed as starter problems and the Titanic problem you're given the data on who survived on the Titanic and like informated that their gender which class cabin they are traveling in where did they embark from and then you're given a subset of that data where you know who survived and who didn't you're asked to predict who survived and who didn't in the remaining data and it's just a way to get started with learning how to use machine learning so that was sort of where I got started there and then I thought okay I can once I've done that I can tackle a real candle problem and then I had seen that the Coursera ran a machine learning class that was really highly regarded so I'm like okay I'll do the Coursera class this class turned out to be so interesting I had no idea I I thought oh you know the algorithms is going to be really dry and kind of boring and I just really want to get to the data but actually liked learning about neural networks was really really cool I really enjoyed the class so okay so what is Kegel candle is a business whose clients are companies who have some sort of data analysis problem and they can go to Kegel and Kegel will turn their problem into a competition and so kegels clients are paying Kegel to formulate their problem as a contest caylor's are just people on the internet who decide that they want to have a go at solving this problem and so there's prizes for people who do really well but the and the prizes are substantial but you have to be pretty good and also pretty lucky to win so probably that Kegel is good for so I should probably start off by saying that a lot of data science in in like data science work is exploratory you want to just sort of have you have some data and you want to go and look for patterns in that data and Cagle only works for data where you have labeled data where you want to make a prediction and you have some way to test whether your prediction is correct a lot of data scientists also work on problems that are more like a be testing so that's like if I have two populations or if I have a population and I give half that population one treatment so they for example you have users goat they go to a website and the text is read and you want to see how many people buy stuff on that version of the website you have another version of the website where the text is blue and you want to see how many people buy stuff when they get the blue webpage that's that's that's a be testing but cattle doesn't help you do that it's really only for problems where you have to make a prediction and there's a way to see if the prediction is true so the problem is that keggle is good for regression problems classification problems predicting elements in a time series and labeling segments in a time series and actually I'm going to group these a little bit differently because yeah the quake angle has cattle has evolved in a slightly interesting way you could there's a class of problems which are I call feature building problems or the features don't have the same units and so in let's see in this case here I'm okay so in this case here you have things like in the bot versus human competition which I'm going to talk about more later you have information about what people have been on I in an online auction so you have information like what IP were they visiting from how often do they bid and on auction what's the time between bids in an auction and all those things that you can measure are our different units whereas if you have a problem where you have for example you're classifying images then you have image one and you have one data element which is the value of pixel one another date element which is value of pixel to pixel 3 and all the information is the same type and I think actually that makes the problems more suited to a neural network approach alright so how to learn with Cagle starting with one of the learning problems is usually good idea these are all flagged on their website and I've talked a little bit about the Titanic problem another tip is to pick a problem with a small amount of data because that just makes it easier to manipulate and I also recommend if when you're starting you pick a problem that's been going for a little while because then if there's a there's something funky about the way the competition was set up people have usually figured it out by then and occasionally it happens that there's something a little bit strange about something in the competition and then people work that out eventually another ok so another piece of advice is like look at old competitions and like and read the forums and especially people have been able to share their strategies people who've done really well will often do a little write-up about what they did that worked really well for them and you can learn a lot just by reading about what other people have done people also while a competition is running people also post starter code so this is just some quick and dirty script he'll usually load the data run some sort of simple classifier and then and write out the prediction mmm this is not always high quality code but at least it's something to like just get your data in and then make a prediction and it's useful to see how other people set up their code and you can use that as a starting point to you know build your own empire on top of what they their foundation looks like okay so kaggle is a competition because I said during the competition there is a a public leaderboard and so what happens is you're given some training data and the labels for the training data and then you're given some test data and your us to make predictions for this and while the competition is running the the public leaderboard consists of this your predictions but only about thirty percent of those predictions that you made are actually evaluated and so you can see how you're doing relative to other people but you don't know what the final score is going to look like until the competition is over and then you see the the private leaderboard and all of your predictions are scored at that point so that's what makes it exciting because you can see how you're doing relative to other people but you don't know what the final results can look like question so the question was there's only there's only two datasets there's the true there's it there's a training data set that has the labels and then there's the test data set that doesn't have the labels okay so what you get is X train and y train then and you are asked to make predictions on X test and you don't know the print was the values are yes yes you have to split this yourself if you want to do cross validation that's another slide maybe the next slide actually nope it's not okay ah okay so I like one of the the problem that I did actually did really well at was the bot versus human competition so I'm going to talk a little bit about feature engineering and strategy with respect to this particular problem yeah Kaito problems are all really different so it's actually sort of hard to generalize to say the strategy is what I always use for every single problem that never lets me down because really every problem is a little bit different so in this competition you have you're asked to look at bidding data for an online auction bidding data and so you have information like what what auction the person was bidding in how when did they make the bid what IP did they place the bid from what was the referring you a URL hashed and then with that information you're asked to identify whether a particular account belongs to a person or belongs to a robot mm-hmm okay so they give you some data there's a train train dot CSV file a sample submission file that shows you what format you're supposed to be submitting your predictions in the test data CSV file and then there's this other bids CSV file so actually what here what they have here is there's this extra n-- and X test and you're given the labels Y train so this is fodder or human and then you have these events which are actually just the bids and the challenge here is actually to take these events and map them into X train so you have to build X train yourself and this is what makes the competition of feature building competition what's the difference between trained on co2 okay so they give you some training data which could be you know I okay I haven't actually looked at potato for a really long time so there was some information they had about the user that didn't change that was the same for every bit that they made and that just went into extra is yeah even the location information was in the events thing because people can move around when they bid yeah that's the magical operation that happens in my head okay so it's not yeah taking the events the data and mapping it into X train and X test is not as simple as doing a joint so maybe actually just move ahead to that here we go let's look at this slide so X is this is this matrix where each user is a row and then you have properties of this user so they gave that you when you downloaded the training data you they they gave you extra in here but then you have to have to add additional columns to X train by examining the event behavior so basically every time a user makes a bid so for example user 0 makes more than one bid they have there's some behavior that you have to make some sort of aggregate description of their bidding behavior so you can put that as a feature into this into this matrix is that clear okay so here I did actually write down what the items the columns were in this event matrix its bid ID bidder ID the auction merchandise the type of merchandise that they are buying their device that they were using to connect with the time the country the IP and the referring URL so okay so how do I actually go about feature building you know I just start by exploring the data and make lots of plots and I mean you look at the data you think oh yeah I totally know what's going on it's not really till you make a plot you can be you can sort of make sure that your picture of what you think is happening is actually what's going on so my first idea was actually to look at bidding strategy at the end of an auction versus bidding strategy at the beginning of an auction like I know that some people like snipe an auction right in the last like millisecond of the auction they're firing off a bid and I thought okay I can probably use this to describe whether they're a bot or not but the thing what I actually discovered when I made this plot so I I made this plot here of this is a histogram of the bids over time and I discovered there's actually three clumps of bidding activity and then gaps is nothing and then well in further inspection I thought okay probably whatever bidding activity there is is going to have a periodicity of one day just because one day so I could indeed see this nice periodicity here so this is like three days and then there is a gap of eleven days and then like three more days of bidding activity and then another gap of 11 days so the thing I realized when I made this plot is I couldn't use any information about bidding strategy because I didn't have actually any information about whether I knew an auction was over or when it I knew when an auction had started there was no way way for me to figure out where in the auction the bid had been made so I couldn't you I couldn't use that as a feature anymore but so I what I did do to start with was just to go and look at the bidding activity for one auction so I here I've done my little pendous query I asked to look only at bids in this one auction here and the first thing I noticed was that there's this character here 965 cc7 who seems to be a very very very very very busy bidder so one thing I realized right from the start was I can use the number of times that a person bids in an auction and the median time between their bids these are all features that I can use and oh so the cool thing about machine learning is that you can make up whatever features you think are interesting and the algorithm will go in and figure out whether you're which features are actually instructive so this rewards being very creative and you can be a bit stupid but as long as you're creative and come up with thing that's interesting the unuseful then they algorithm will pick out what's useful so I had some of the features that I made up or things like median bids proxxon median time between a user's bid in the previous bid in a different auction or in the same auction the time between bids made on different IPS in the same auction or in a different auction is a user active and on an IP which is known to have bought activity and then this is really interesting so I my PhD involves statistical physics and there's this interesting quantity called entropy which is basically how ordered is a system and I realized that actually entropy can used be used to set describe somebody's bidding behavior if somebody places all their bids from the same IP that's very ordered if somebody places bids from a bunch of different IPS that's more disordered so I can use entropy to describe the bidding activity I could there's no stopping at just the IP I can accumulate the entropy for the URL or the day of the week and really it sometimes helps to take the log of whatever values you're generating I haven't totally understood actually why this helps I think sometimes that if you are looking like something like at the standard deviation of something and you don't have something that's a Gaussian distribution you often have something with a tail and then just taking the log helps to sort of normalize that tail away but sometimes taking the log just on that on I it'll also like if it's something like a Gaussian curve it'll linearize the data so your moot you're putting things that would be on like a curved scale on to a linear scale which makes it easier certain algorithms to pick up on stuff okay well whatever the magic is and for whatever reason works and we may arrive at a more profound understanding over drinks tonight uh so um Oh question oh sure just before you transition away from feature building are you limited to data that are available in the test set or could you say maybe you think that Tuesday is a big purchasing day in countries with a certain GDP could you go and do additional research to build it in yeah you could they're usually the competition they'll say whether or not they want you to use outside data yeah usually they will specify whether it's a ladder daaad in this case I don't think it would have helped you very much my suspicion is that they cut the the reason I threw out chunks of the data was they wanted to use only weekday information and like things like AI in different countries different days or holidays different A's of the week or holidays and also yeah it was uh yeah you could definitely there's nothing stopping you from going and trying to make something out of that if that was something you think is useful okay so the next once you've got your nice features built the next part of the strategy is to pick an algorithm and for actually so one of the nice things about scikit-learn which is the machine learning package for python is that they have this handy flowchart that helps you make a good decision about what algorithm to pick but especially for classification problems I've actually mostly just started using gradient boosting machines and random forest classifier just because there's they usually work pretty well they and they don't they have fewer parameters to tune so it's easier just to shove your data in and get something yeah that's half-decent without having to worry about tuning learning rates or mysterious normalization parameters that's the hot tip of the day I guess so when you're when you're doing a competition one of the things you actually have to work out is how good is your prediction and the way to approach this is to construct a cross-validation set from your training set so basically what you do is you break your the training data into two segments usually 80% you turn into another training set and then the other 20% you turn into your cross-validation set then you use your algorithm to train on this this data make it and check your prediction with the predictions here and then you make a prediction for the cross-validation data and see actually how well your algorithm performs and usually what you do is you make the split a couple different times and then check to see how the algorithm performs on each each time you split it and the reason this is useful is it actually gives you some idea of what score you can expect on the leaderboard and one before the competition ends you have to pick your two best submissions or what you think are going to be your two best submissions and if you have an idea of which submissions you think are going to be the best then it's easier to make your picks about which ones you're going to use you have some feedback from the the public leaderboard worried about how you're doing relative to other people but you don't always want to overfit to the leaderboard if you you want to make sure that you think you're doing a decent job of making your prediction so usually I try to pick a prediction that's one one prediction that I think is going to do really well probably based off of the cross-validation scores I have and also the leaderboard score and then I make a second more conservative pick and one of the things actually that shows up every once in a while in a cago competition is that the trading data can be a little bit out the slide the training data may not always overlap with test data so you might have the situation where you're your test population is actually a little bit distinct from your training population and yeah you yeah so in this case using the leaderboard is a bit of a hint for whether or not your algorithms going to do well is actually helpful I kind of realized that one way to determine whether you're going to have the situation is to train a classifier to see if you can distinguish between the training and the testing population that's my little trick another thing to keep in mind is that when you have a situation is that near neighbor neighbor algorithms are better at interpolating them extrapolating so you have if you have a situation where you have to extrapolate and you're using a near neighbor algorithm you have to be a little bit careful about trusting your predictions so in this case actually I can recommend this package called stats models which lets you exploit some underlying distribution that you're the data conforms to so you if you have data about some part of the distribution and then you know that you can extrapolate on like a Gaussian then that allows you to make to do extrapolation okay so more general strategy I've just start by building something simple that makes predictions and does cross validation and writes the predictions to a file question you actually have access to cluster machines but you just using one machine um okay so actually when I started I had a four year old laptop which was pretty awesome for the most part and after a year of being basically spending like eight hours a day on Cagle I was like okay I need a better computer so I went and I bought in a MacBook and that's what I was using now that I'm doing more problems with neural networks I've been buying time on Amazon Web Services but I think actually I'm I got a little bit tired of being booted off constantly and I think I might build myself a little GPU that is not going to be used to play computer games I see but I mean so you're trying to say that the data is not actually small enough yeah any memory it's actually fairly common that it will all fit in memory yeah yeah yeah it would I would most of the competitions that I've worked on that having been sort of neural network orientated the data is yeah yeah definitely been like in the megabytes range even you idiots of Hadoop I have no experience with I do but I would sure like some I see I guess you what I'm going to say lust is that it you do to actually justify using a dupe you you must have a lot of data yeah I think that's pretty much in the cases like one between like the 1 to 5 terabytes or even more than that yeah so no dupe is like databases for really huge amounts of data and Kegel competitions are usually geared towards much much smaller amounts of data thanks ok so I already kind of alluded to this but keep track of your cross-validation scores so usually what I'm doing once I have a kind of a basic script going I actually have a little lab notebook where I keep track of what I've changed in the code the cost and then the cross-validation scores that I've generated and then I just to avoid doing anything really stupid i save the script that i used for each submission once you have something working then that's the time to start investing more energy in feature engineering and maybe also fiddling with the the algorithm that you're using the parameters of the algorithm and I guess the final tip if you really want to just win something recruiting competitions seem to be a lot easier and less competitive both the two competitions that I've done really well I'd have both been recruiting competitions and I think this is just because there's no money involved it's just getting an interview and people who are doing really well they'd that's not necessarily motivating for them oh ok I'm going to have to go over here and let's see if I this says no signal shall I plug it in and plug it yeah try that reboot okay oh look at that victory is mine yeah and yes the final tip is if it doesn't work unplug it and plug it in again was it asked you to clarify the notebooks do you use ipython notebooks food or whoa okay the notebook all right I actually I have a little file in text wrangle TextWrangler where I literally put in like I just changed this thing so that I can and then I change this other thing and then I got that I cross validate like 10 times or something like that and report the cross-validation statistics it's very unsophisticated see okay hold on I might have to go and wave my hands over here again there's the mouse and we're fullscreen that's great yay victory okay who wins at Kegel well as you may be surprised to hear that smart people who spend a lot of time on Kegel but it actually includes a lot of grad students people with university positions you're kind of free to figure out what to do with their time there's actually quite a few consultants who seem to be dividing their time between Kegel and industry projects so they they can go for a little bit and then they go and they do a paying job and I think I don't know one of these people seem to have gotten their start on Kegel and then just kind of made consulting into their business and there's also a lot of people who've already made a pile of money and they're in a start-up and then they're just looking for a meaningful way to spend their time goggle actually started interviewing people so one who've done well at the competition's and this is sort of interesting sometimes to read about the backgrounds of people who've done really well see the interesting thing is if you look at the breakdown so this is actually data that I looked at in July I looked at who are the top hundred calories and where are they from and the first thing that I noticed was there'll there are a lot of people with Dutch names on this on the top 100 and then I boiled it down a little bit more and I liked the per the Netherlands actually has the highest gap per capita number of top 100 of all the countries that are on Kegel Canada I think there is two at that time and the US Russia and Japan have a lot of people but they also have become the countries also have a fairly large population I don't know maybe this is not super statistically significant but I don't know having just come from six years in the Netherlands I had some ideas about why this might be and I think it's the answer to the question why are Dutch people so good at Kaggle is actually related to the question about why are Dutch people so good at speed skating and I think actually answer is that oops the answers that they are also really good at this port which you do with two wheels and the answer is basically cross-training the more problems you do the better you get at approaching a new problem and Dutch people also have a huge amount of vacation time like I had six weeks of vacation as a postdoc I can not take six weeks of vacation I was I can do like four weeks I can totally take four weeks of vacation but a lot of people actually work a reduced workweek as well so that means that people have they don't they don't work 50 hours a week at their job they work a civilized amount and then they they have time to go in kaggle when they get home in the evening and you know that actually that that creates a huge opportunity for learning and training and I think that's one of the reasons why that there are a lot of top 100 Dutch keg lers so I also want to say that I didn't do it myself data science by myself I had a lot of help mainly from these three people Bruce who I went to undergrad with who decided to go back and do masters in machine learning in natural language processing and he's really good at math so anytime I had to anything to talk about about natural language processing or just math he was always there to chat about that my friend Nick he worked at a startup and he had lots of advice for talking to startup and he also thanks to his astronomy PhD I knew how to use Amazon Web service so he actually took on a tutor me and how to use Amazon Web Services and then Philip who I actually live with he taught me how to use github and resolve much of my frustration involving installing Python packages which is still the thing I hate most about programming installing packages so advantages of doing yourself training with Cagle I think problems are really interesting so just having that motivation that was really the best thing about it Cagle also provides starter code they people on the message board give suggestions for methods insights into the data they ask good questions it's a really good way to learn by reading the forum title because Cagle only rewards accuracy it's a to some extent being helpful in the forums it's it's a no bias environment so I can be out there with my little icon which is a duck and people not know female I'm standing here I'm female but when you're on the internet you can be totally anonymous and I really appreciated that it sort of gave me a lot of confidence to know that people were only responding to my ideas Cagle gives you a benchmark mark for your skills and it also gives a potential employer benchmark for your skills and I think a lot of confidence comes from knowing how you're performing relative to other people cackling lets you build a portfolio so every time I did a competition I would write up what I did what my strategy was and I would put it on my little github page and that was something that I could put in resumes when I said the round like here's my portfolio Cagle problems also give you something to talk about when you're networking and okay I think many people including me find networking kind of awkward but if you have something that you can about that you're interested in and that you think well that will be interesting to other people that actually makes now working a lot easier and finally kettle actually gave me freedom to travel if I had been in a another degree program I would have had to stay in Vancouver but I was able to spend some time in San Francisco and I was able to go to Brazil for six weeks and that was really cool and though I when I categorize in San Francisco and I cattle while I was in Brazil and I cackle while I was in Vancouver and it was nice to have that freedom disadvantages of training with Cagle so Kangol actually does a lot of the hard work of data science for you when you go to cattle and you download the data they've already gone to the trouble of making a nice labelled data set and honestly most a lot of data science is actually just getting the data cleaning the data making sure the data is label trying to get data that you can actually label that tends to be most of the problem in data science in a real company the flipside of cattle only rewarding accuracy is that there's no points for like making a nice visualization or you know even putting error bars on your prediction with all which are very valuable things to know sure all right sit over there so I was just curious do they ever have requirements for like prediction time or anything like that like you would see in like a real company right like anything's like that or it's just look if you can make the prediction before the competition ends you can submit it yeah the reality is that yeah it has to be somewhat reasonable because you probably want to do make more than one prediction and usual Ong is the competition and usually they're about at least two months long and some of the other some competitions that are run just for like learning those it'll be like it'll be like nine months even was there a question over here to it's time do you spend on a particular Kengo problem um okay Oh probably way more time than I should I find cattle really addictive I love to spend time on Kegel I was kind of treating it as a day job so generally I would when I was unemployed I would Kaggle for like eight hours a day maybe 12 to 14 hours a day if the problems getting really interesting but I generally would not work on the weekend and uh yeah you don't have a full social life too so yeah another question yeah let's describe it here hi I'm just wondering how much statistics do you need to have four I would say zero because there's no statistics involved it's the only statistic you really need is making a decision about which which of your predictions do you think is the best and if you can understand I need to split my data 8020 and do cross-validation a couple times too that's about all the statistics you really need to do to this it's really more about machine learning there's another question hi scuri how do the leader board scores change as the as you would approach the deadline for the competition they start jumping quite a bit well okay so different competitions are interesting sometimes the scores jump around quite a bit and that's usually a case where the test set is is quite a different from the training set so there is this one competition Africa soil prediction challenge they gave you satellite imagery data for different soils a bunch of information about the soil you had to predict certain nutrient contents and what they done when they package the training data and the test data was they taken soil samples from a bunch of different locations and then the test data was soil samples from a bunch of different locations so it was that you actually kind of had to the the populations were separate and that the difference there's a difference between doing that and taking some data from this sample some data from this sample and some data from this sample because that's a problem that you're going to be doing a better job at interpolating with and so in that case then the leaderboard scores can move around quite a lot sometimes people also are open they get tempted to overfit to the leaderboard and then that case people's scores can drop quite substantially is it it just usually means that they did a bad job of using cross-validation to try to pick it which there which if there are predictions they thought was the best another question Oh where's the company headquartered uh where does Kegel located yeah in San Francisco there is another question behind you I think to you oh you had the mic okay one more question over here so I have a question on on how you decide about the features so you have they're all test data and then you have some data for validation and you know you run the algorithm on some of them so then you want to decide about the features you use the whole data a whole test data or you use the part that you want to train the algorithm on and then you know basically okay the way I think I understand your question is how do I decide which features to use yeah I'm very heart of data do you use to make that decision okay so I answer is I use all the features so the cool thing about machine learning is that it will go in and figure out which the which of those features are the ones that are helpful for making the pretend question is do you know so you have validation data and you have the training data okay yeah so when I want to do the cross validation I split the data usually it's a random split so basically to take the training data and randomly pick 80% of that to be my training set and then 20% of that to be the validation set and then I do that make that split several time so I'll go in and more than like five to ten times make the split the data 8020 and then train on the 80% and test on the other 20% sometimes you can't do that though sometimes like if you have time series data so there's this Walmart sales prediction competition that gave you three years worth of sales data and so really when you are doing cross validation you want to be able to take like ideally you'd have like one year or five if you had five years of data that you could use one year as your validation set and four years as your training set and then just do that five times each time have a different year as your as your validation set yeah sometimes it's not so easy to figure out how to split it in a way that's actually giving you information that you can one more question uh hi yeah so do you have any tips on building ensembles after you've done all of your feature engineering okay so I've only actually tried that once and it didn't provide any value I think I can track down some information for you but I'd have to look at my computer to do that people there there are people who've done it quite successfully though so the strategy's asking about is you can actually use more than one machine learning tool to make a prediction and then you combine all those predictions together and usually what happens is you get a better prediction out than you would have gotten from any one of the predictions you've made with any one single machine learning algorithm and this is called ensemble and gradient boosting is on tumbling the da let's see so yeah so a lot of data science jobs are primarily statistics with a little machine learning if you're working at a company that's like how do I get people to click on stuff to buy stuff on the internet that's mostly a/b testing theirs it's less about about the machine learning side of things and Kendall doesn't help you learn to do statistics or experimental design you have to have actually like take a statistics class to do that and Kendall is a winner-take-all economy so there's great prizes for people who win but everybody else gets nothing disadvantages of do-it-yourself data scientist I already started to said this but Kegel is pretty addictive and can be kind of a time-suck a degree forces you to learn things that are good for you I'm sure like when I was in undergrad I probably wouldn't have taken a linear algebra but then it turned out to be useful and things like quantum mechanics and machine learning so I'm glad that I did take it but when you're doing a do-it-yourself program you have to be kind of careful that you're doing your interest but at the same time forcing yourself to learn things that are good for you and just being aware of what it is that you're you're not that you're missing also a degree or a training program like zipfian or insight that also provides you with sort of a more a route to a job a route to an internship and I didn't have that option with the do yourself program so I had to work really hard to make contacts to network and to forget interviews and I think that would have been easier if I'd been in a degree program and yeah just being unemployed is kind of hard uh you know it's kind of hard looking at your bank accounts you know bike account balance dwindling and I and my parents are like you know generally very laid-back people but then they're like oh what's wrong with you you were a smart girl you don't have a job that's really hard um I'm glad I stuck with it though I like the autonomy and being able to do interesting things that was worth it for me okay so action plan revisited so I had this sort of six-month plan where I was going to teach myself sequel and do the Titanic problem and tackle a kegel problem and do the machine learning class on Coursera and then sort of at the end of like six to nine months I thought oh somebody would surely realize my brilliance and offer me a job but it didn't quite work out that way in fact I wasn't employed for about two years and I have to say I think it took me a year to get good at data science but after that year I spent most my Kaggle time focusing on problems that where I would have to use neural networks just because I thought that was just super interesting but after a year my kago scores stopped increasing substantially and I thought okay I I'm not learning as much as I was learning when I started and okay so probably what everybody wants to know is does doing well at Cal help you get a job and well it's kind of yes and no the company it turns up my experience has been that companies don't really care if you are decent at Kangol but they do like seeing a resume with a big unicorn sticker on it and if you have kegelmaster on your resume that's a big unicorn sticker I definitely saw job advertisements where people ask for Kegel experience or at least suggested that it was going to be valuable but I also had the experience in Vancouver where like 50% of the HR that I spoke with or the recruiters they had no idea what Kaggle was so in that case it's not helping you to get your foot in the door if the HR people don't know what Kegel is my experience has also been that if your kegels are people companies don't really care if you're smart or creative or can take initiative they only care that you've solved they're almost there exact same problem at another company companies can be really conservative and I I'm not really sure how to persuade companies to take risks on people who might be awesome or to even to explain to companies that you know if I'm good at these five kinds of Kegel problems that I'm probably going to be good at this other kind of cattle problem that's something just that we have to educate when we do have jobs you have to educate the HR people at our companies about that I I had actually generally pretty good luck with getting interviews at small companies I for whatever reason I just did not figure out the magic recipe to get interviewed at large companies so most of my interview experiences at small companies and what I've actually found is that small companies especially where they don't have a data science person they don't actually really know what always what they want or need and and okay this is this is okay this is it's a hard problem actually to know when you don't have this technical expertise maybe you have somebody's a software developer if they don't really understand what data science is and a lot of people who are in like marketing or whatever data science dislike looks like magic to them and so to try to when you go to an interview they don't necessarily know what to ask you or even what skillset they should be looking for and they don't know what that what they should be asking you in order to figure out if you have the skillset that they need so just keep this in mind and you know I think companies you can even tell people in the interview what they what they should be asking you or you can tell them what you think their problems are that's all okay to do like many people question alright the small companies I heard I well okay I think the smallest company I interviewed two companies which were three people well I'm not sure you could call him a company at that point and anywhere up to you about like 50 people yeah um okay like many people I don't actually care for recruiters that much I would much let's just put it this way I had a much more favorable impression of companies who use their technical staff to talk to me when I was interviewing recruiters nobody really has any way to assess whether recruiters are doing their job including the recruiters so it's not entirely their fault I my apophysis is that recruiters might have okay precision so they find usually some decent candidates but there recall recall is totally shitty so they find like five decent candidates but then they miss like twenty other awesome candidates this is a hard problem I don't really know what to do about it I wish there were didn't have to be recruiters in the world but obviously that's just not going to happen so if I was working at a company or offering advice to a company I think your best options for recruiting are if your existing staff know a knowledgeable recruiter who treats them who's treated them well in the past those people are good options and your technical interview should consist of I'm actually perfectly happy to do like a day-long take-home problem and I think that such a problem lets you actually give me something to work on that looks like the problem that you would like me to solve at your company I think that's the best way to run a technical interview eula so question for the audience are there any recruiters here a question for you later but also are there any other company people representing companies that are hiring data scientists maybe after a couple slides as one if you can come up and tell us how you're recruiting what you're looking for how do you screen advice to compliment what carries talking about um okay yes thank you Travis that's that was very helpful um okay so this is kind of an addendum to the talk but I think Calif occation has a lot to offer research especially place like medical research there's a lot of medical research that's like we introduce technique X to solve problem y and usually these papers are super boring to read and then usually it's really hard to actually take different techniques that different groups have come up with and compare them and actually figure out you know it was the problem that they solved the same and which solution is actually superior and I think Kaggle provides a really good way to solve these kinds of problems I think it also is a really good alternative to sort of the traditional I have so in so many publications so I'm then I will be very awesome as a professor kind of approach to making a CV if you can if you can have this alternative option which is you know I did really well this sort of Kaggle find problem then that could be valuable for helping people identify individuals who have a lot of promises researchers I think another point is that when people work on a problem that you have a really good understanding of the problem and they're actually that puts them in a much better position to learn from other people's solutions I think that's actually one of the most powerful things about Cagle and I think the other thing I really like about kaggle is it helps keep knowledge and techniques public and it helps disseminate information like google has gone and hired all the top neural network researchers and that information is sort of basically now all owned by Google but by having Kaggle keeping this information and these ideas public that helps to disseminate these ideas so there's actually a couple of different organizations that are doing Kegel style competitions in biomedical research Grand Challenges and dream challenges I I think I'm actually super excited about this I think medical research is one of the areas where machine learning kind of have a lot of really amazing applications and but actually the biggest problem in medicine is collecting the data and disseminating it and I don't know if any of you have been to the doctor here lately but they're still giving out x-rays on CDs and so actually doing a good job of data basing our medical information is I think the most important problem we need to solve okay conclusion being unemployed is not a giant pit of despair it helps when you have something to get you out of bed every day that you find interesting and motivating and Cal did that for me cackling is a good way to get better at attacking predictive modeling problems cattle doesn't really help so much to get a job but it did make me pretty awesome at my job now that I have one do your self training gives me a lot of freedom to learn whatever it was interesting to me and but maybe a degree or an internship would have been sort of the faster cheaper way to get a job so if there's any other questions I'm happy to take them down and then I guess we'll go for Bureau where we're going to happen somebody is going to say something about recruiting - yes so we have you guys want to have questions first recruiting first things to say are questions - my grades fourth mic hi I'm wondering if you ever been in a team for one of these competitions ah advice for teams I have not actually ever been on a team I talked I have two friends who I've talked about joining a team with but never I actually did so there was some complications involving which time zones we're Ian and it was also Christmas so that didn't transpire yeah Kegel lets you form teams as well so you can work either alone or you can team up with like I think three other people maybe and then it has that they have this algorithm for dividing the points up at least you another question yeah I'm curious to what extent domain knowledge plays any role in these competitions they cover a broad range of domains does it it's helpful if you have domain knowledge or they have they abstract it away to be honest in a lot of cases people without domain knowledge win so yeah it turns out that if you can if you're just sort of semi intelligent you can deal with machine learning tools you can make a lot of progress at the problems I think where domain knowledge helps is actually formulating the problem because actually that's sometimes I think that's the hardest part of data science is what problem do we have to solve good question did you actually Oh got it okay I'm sure so did you actually in Kangol or even in the place that you work at right now do you actually implement any of the the algorithms from scratch or do you just use the libraries that are out there like okay so I'm not allowed to talk about what exactly happens at work don't know okay what I can say is that there are definitely things I learned cackling that I use every day at work and it's even actually happened now that I something I learned at work I'm using a Kegel just kind of exciting oh I work at a company that is fraud detection new data security hi so this is somewhat related to peggle but also in general data science often do you have bad data as in data that is not good enough to make decent predictions that's kind of a hard question to answer actually well I think definitely in real life as a there's more bad data than there is on Kegel and Kegel they've really cleaned the data up quite a bit yeah I definitely end up find myself using like philony and pandas to figure out what to do with my missing data okay one other question you mentioned about there is no incentive to specifying uncertainties on your prediction did you mean for every single prediction I know the score is you make you send in like the test data will consist of like 10,000 items that you're supposed to label and the score is some function you know I got this many corrector and I got this many wrong there's different ways that they evaluate there's different metrics you can use to evaluate the problem so the score is based on all the all the predictions that you make so the only sort of statistical part of your the prediction is that you have to try to choose which of your two which two submissions you want to do so you can make like you can enter the competition like 50 times but if you make 50 entries you have to pick two submissions that you think are your best ones and then the thing is that your best submissions on the final leaderboard may not necessarily be the ones that have done best on the private or the public leaderboard that it's that's visible during the competition did that answer your question yeah sort of I mean I was wondering if you can assign someone uncertainty to each of your predictions for yeah okay so a lot of actually a lot of machine learning algorithms will spit out a probability so if you have a like a classifier it'll give you a probability that it thinks that's this class and I think you could in a lot of cases you can use that probability to decide whether or not you're confident about your prediction or not that's pacification that's for classification yeah and guess you're trying to predict something so numerical value uh I would have to think about that a little bit more about that student that's a good question yeah thank you No oh yeah so I read um there was this famous competition for Netflix and I read that the winner actually the winner algorithm was not actually implemented by the company because turn out not very scalable so because that happens because the reward is only on the accuracy but not on the speed so I was wondering if you know if that is very common for a week okay so I know one other competition where that's happened it was a well detection challenge so they had a bunch of whales about whoo underwater and they were listening to them with the microphones and they are asking people to identify you know which whale is it that's making the wittle little noises and it turned out that in the end that the algorithm didn't work so well in practice I do not remember why though actually now I will I can look that up afterward though but I definitely was that situation where it didn't working in the field so much any other questions the question to both the mechanics of the process what are you actually submitting it's some sort of thing executable if they can run that so they literally it's just a CSV file with with the each prediction for each row in the test data so when you're given a test data you are like for example for the bot versus human a competition you think you have the test data each row is like account ID one account ID 25 account ID 26 and you have to make a prediction which is account ID one is a bot account ID is a bot account ID is a human account ID as a human and so anybody's source code and algorithms own if it did if it becomes known as because they put it on the forum or right so a lot of people will share their code on fly always shared my code on github and then there are some situations where the winners code becomes the intellectual property of the company so it's different the rules are different for different competitions though but usually the idea is the company gets access to your code a condition of receiving the prize is that you give them your code so when you're doing a competition how do you distribute your time between things like exploratory analysis feature engineering and like model tuning and model selection ah you know I'm not sure I have a really good recipe for that yeah what I'm doing stuff in the neural networks it just takes me a lot because I'm not less familiar with that it takes me a longer amount of time just to get something going that works oh yeah the math that's the best answer I think I can give so what I'm curious about is if you're if your training sets and your your testing set are quite disjoint and you know if you can set up an easy classifier to distinguish between them do people end up trying to infer some of the the predictable x' from the hidden like test set um-hmm is that a common strategy that people would do uh you know I'm not actually sure that seems like an interesting idea though yeah that seems like an interesting idea next time you run into it come find okay here I was curious if you when you're using methods are you tuning them a lot are you finding that the defaults work fine for you uh well yeah usually with random forest classifier I don't have to fiddle with too much if I use like a stochastic gradient descent then I will spend some more time to optimize the learning rates and the parameters there's a tool in Python in an esky learn called grid search that will let you go through and sort out which of the settings are the most optimal hi yeah so you expressed enthusiasm for neural networks and you said that you've been reading about them have you found that understanding the inner workings of that algorithm has translated to like better success with them okay so to be honest the biggest challenge that I have with neural networks is that I don't actually know I don't understand how best to go about selecting the geometry and I don't actually think anybody else does either because I've looked another it's not well described on the internet somewhere convenient I was working mostly with feed-forward neural networks but I've been trying to learn a little bit more about long short term memory networks and all these different networks have different architectures and different kinds of operations that had been each note I mostly I just really find it interesting to learn how they work too and then I don't know the other part of the problem is that you know to get them trained you have to have a huge amount of data and then this kind of comes into the Amazon Web Services thing because if you have an Amazon Web Services account you can put a huge amount of data and run and rent a GPU and then train your algorithm with the GPU but yeah I'd like to understand them better than I do you
Info
Channel: Polyglot Software Association
Views: 49,948
Rating: 4.8368421 out of 5
Keywords: Data Science, Kaggle (Venture Funded Company)
Id: JyEm3m7AzkE
Channel Id: undefined
Length: 72min 45sec (4365 seconds)
Published: Sat Nov 28 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.