Dirty Secrets of Data Science by Hilary Mason

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Wow well that's a great introduction can everyone hear me all right so ivory I'm such a terrible cold and I am on a lot of drugs so please don't hold anything I say against me tonight um if you do want to tweet at me I'm H Mason we will have plenty of time for questions later but if something comes to mind I will definitely respond to it or if you want to email there it is so hello am i pushing the wrong way yes okay hi everybody um it is amazing that this many people actually turned out to hear a talk about data um data is not necessarily the sexiest topic of and what's also amazing is that I actually don't know that many people here and I find that amazing because I've been running around New York yelling about data for about three years now non-stop and I've met a lot of amazing people along the way and that's really exciting to me to see that our community just keeps growing and growing so I hope you have had your beer hopefully a lot of it and your pizza um when I decided to do this meetup I thought it would be like 20 people sitting in the mortar data office and we'd just like chat about Hadoop and how much we hate it and then the the list kept growing and growing and then K was like by the way we've got a space that holds 250 people and it's raining and you're all here and so I started thinking about you know okay this would better be a decent talk because so many people are spending so much time to come and see it and I started asking people on Twitter people who I knew personally you know what do you want to hear about and nobody said the same thing there were people who said I want to know how to get started I'm not a data song it's just yet but I want to I want to become one there were people who said you know I'm hiring data scientists where I find them how do I evaluate them there are people who said tell me about hierarchical clustering so what I decided to do for this talk is to divide it into a few different parts and hopefully you will find all of them at least mildly entertaining and one of them will will fit each of your interests the parts are to start with the dirty unspoken history of data science as I've seen it being here in New York for the last five years or so the next part is to talk about how we do data science at bitly and I'll use a few of our projects as examples but really to talk about the process and the engineering that goes into it the last part is my opportunity to expound philosophically on the future of data science and then the very last part will be open question and answers ask anything you want to know and we'll see how that goes all right so this is one of my favorite quotes from Richard Hamming father of information theory the purpose of computing is insight not numbers and he said this in 1961 so people have been thinking about this for quite a long time in fact there was a movement in AI back this is the ENIAC 1946 even then to say computing is a tool for enhancing human cognitive capacity and who deserves to have their capacity so enhanced like what can we do with this this was a the point in history at which they also said they would solve the machine translation problem in ten years and they sort of you know that didn't go so well the field of AI was founded at this time on a conceit that all of human thinking could be reduced to rules and logic that could be processed on a machine like this which I find to be a completely amazing idea we've come a long way from that so I want to tell you a little story and it's a story about how I've seen data science actually go from being an an idea to something real so how many people here are data scientist you would put that on your business card okay what are the rest you do okay how many engineers alright awesome how many business people okay that probably covers everyone if not tell me later so this is an extremely quite brief history of data science data science actually draws a lot from astrophysics and finance and so the work that we do has been done before and it's been done for a very long time what's different is that now it is accessible to people without highly specialized training necessarily and without huge budgets so the kind of modeling that a data scientist can do is pretty much the same as what a financier astrophysics was doing 20 years ago they just don't need multi-million dollar equipment budgets and tools anymore which is awesome and there at the same time is this extremely powerful algorithm it's actually a human algorithm and it's something I've seen worked a couple of times and this algorithm is that you find something a lot of people are doing and you name it now Chris Anderson at Wired is the master of this algorithm um this is where phrases like Big Data which we'll get to come from data science also came from this and it was amazing to see how there's a bunch of people who sort of latched on to it and so when I started at bitly three and a half years ago I insisted my title was scientist because I was coming from academia and I didn't want something that would ruin my CV and our business guy at the time he said yeah sure whatever put that on your business card but actually my title still engineer in our our HR system which I found out last night when it last year when I went to buy an apartment that data scientist didn't exist three years ago three and a half years ago as a field of practice it had just started to emerge at that point and it does so in a couple of different ways there were a core group with people on the west coast who started promoting it and a core group of people here in New York who started promoting it as well and we used to have brunch and like argue should it be data science or should it be something else in 2010 I wrote this with them Chris Wiggins from Columbia University who is also another co-founder of hakkon why and this is where the awesome comes from we actually sat down to write down what the process of data science was because we could not find it written down anywhere this was in 2010 it's not that long ago and there's the URL if you actually want to pull up our our little essay where we go into each bit of this but this seems completely obvious to those of us in this room today right you get some data you clean it up you play around with it you build a robust model and then you you know make a graph or write about it this was not obvious in 2010 at least not to me so we've come a long way we've uh we've essentially invented this field at the intersection of a bunch of different things and I've been using this slide now for three years but um always tweak what it actually says data scientists today are a blend of mathematics statistics and those are not the same thing um as engineering that is the ability to actually code the systems to do the math and statistics or to get data out of the systems and the human interface when I talk to CEOs I say they can understand your business problem domain what it actually means is that you know to ask the right questions you can understand what people are trying to do translate that into an analysis and then translate it back into something they can understand so it's the ability to write communicate make pretty graphs all of these things right so we've had these people for a long time but it gets better right so that's up now I want to take a step to the side and talk about big data which is similarly something that you know had no name until some folks from O'Reilly sort of stamped their big data on it big data actually does mean something but it's gone way too far so this is from the Wall Street Journal this morning Justin Lin sent this to me because he said you know big data is insane I just did ctrl F to highlight big data in the article right it doesn't even mean anything it says the company already uses this big data technology thank you big data requires massive storage and processing power big data analytics across its 120 offices it's sort of gone so far it has lost its meaning for technical people so when people say what is Big Data um some people think it's data too big to fit in Excel there's another seriously but there's another article going around yesterday on the top of hacker news about the dangers of Excel and I tweeted it also so you should read it about how our financial system runs on Excel some people think it's data too large to fit on one node and this is the operational definition I like to use that is it requires specialized infrastructure in order to do an analysis on the data that's big right there's another definition I like to use which is that big data is data that can be made useful we have always had big data we've always had years of log files thrown on a hard drive somewhere but it is only now that we have the tools to ask a question of that data and get the answer back before we forget why we asked that question in the first place and that's useful data right there's a new definition emerging that is really dangerous and that's that big data will tell you what to do and you do not have to think or have a theory about your data or you know in the case of medicine it's it will solve all of our health problems we don't need doctors anymore this is it concerns me that this is starting to to pick up steam outside of the tech community and I think we should do whatever we can to fight against it data will tell you whether a or B is correct but it will not tell you what a and B should be in the first place that's what you do that's what we have jobs and on that topic what do data scientists actually do because it's still not clear I think if everyone of us who raised our hands when I asked to hear a data scientist actually got up and said what we spend our days doing it would be very different and Harlan Harris and his crew at the DC data science Meetup did a survey and they got of course they did they got some really great data to answer this question and along the top are the four categories of data science professional and on the side are the skills and there's the link here as well or if you google data science DC survey you'll get the results it's really interesting we spend our time doing business and strategy answering business questions analytics questions doing machine learning and what we might have thought of as big data engineering we do math we program we do statistics and you can sort of classify yourself as a data business person a data creative a data engineer or an EDA researcher right and there are lots of demands for these people this is from indeed job trends searching for data science I'm not sure what's going on because I haven't seen a drop in demand here in New York but um so if data scientists do so many different things I think what all ties us together as professionals must be how we do those things and the blend of skills we bring to it and I'd add another thing to that which is the passion we bring to it and data scientists tend in my completely anecdotal experience to be really smart creative people who like to solve hard problems this is from the Carnegie Mellon protest support vector machines everybody all right okay then section 1 of this talk a brief history of data
Info
Channel: MortarData2013
Views: 135,114
Rating: 4.4870691 out of 5
Keywords: data science, Hilary Mason, NYC Data Science Meetup, Data scientist, Bitly, big data
Id: fZuDwiM1XBQ
Channel Id: undefined
Length: 12min 41sec (761 seconds)
Published: Wed Mar 06 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.