The Future of Data Science - Data Science @ Stanford

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

It was a fun panel but not very informative.

👍︎︎ 2 👤︎︎ u/Gurung11 📅︎︎ Jun 20 2017 🗫︎ replies
Captions
Stanford University let's do a quick round of introductions alright I guess I'll start I'm you and actually I'm a cardiologist in the School of Medicine I have the honor and privilege of leading the biomedical data science initiative I usually say I was born a geek but trained as a cardiologist I did medical school in Scotland you could probably tell from my accent and some statistics hearing a PhD in Oxford and moved to Stanford like 12 years ago and so I went from when I first visited this was clearly the place I wanted to eventually call home and the combination of bringing the sort of engineering and computer science and theoretical tools that we have developed in a place like this to the application of medicine is the thing that excites me the most so looking forward to the discussion today so I'm Vijay Pandey I'm the leader the theory initiative School of Medicine Sciences I'm actually in a couple different departments I'm in chemistry but also with curse appointments in computer science and structural biology I'm also director of the program in biophysics so my research is at the intersection of a lot of different areas but especially we use a lot of data science methods especially for understanding fundamental biophysics and also for coming up with new small molecule therapeutics okay my name is Hector Garcia Molina I'm in the computer science and electrical engineering departments I was talked into being the director of the Stanford data science initiative my colleagues twisted my arm which is has a lot of engineering folks but there are also other people in statistics and other departments and we have a lot of already existing collaborations with other groups on campus and I should point out that the executive director the guy who really does everything is Steve a class who's sitting right here and if I get a really hard question I'm just as Steve to answer it I'm John Hennessy I'm actually a computers I have a same appointment Hector as EE and C s appointment I'm a computer scientist I came here to figure out what the difference between a data scientist and a computer scientist was which I expect to learn today but anyway I obviously I have an interest in this I think it's important initiative and I think it has many of the interdisciplinary roots I think that we've tried to hone and improve here in the university it also I think one of the other things is attractive is it has deep roots on everything from theory to application and I think that promises a lot of opportunity for the University where we have faculty across that entire spectrum perfect okay great thank you so the first set of questions is more about understanding what data science is just to make sure we're all on the same page so let me ask by just asking maybe each of you to pick one application area for data science be it you know healthcare climate whatever one application area and explain kind of how data science is going to revolutionize that field and we can take it in turn alright okay I'll go first so one of the the things that I have been focused on in my own group over the last few years is there is a tremendous technological advance which is the change in the cost of sequencing human genome you may be aware that it cost three billion dollars to fund the human genome project about a decade ago and that over the course of the last ten years that cost has dropped down to one thousand dollars so a million fold drop in the cost of sequencing a genome and that has had an already an absolutely transformative role to play in changing medicine to give you an example of that we had a little baby admitted to the neonatal intensive care unit at the Children's Hospital a few months ago who had not a mystery syndrome but something we understood well long QT syndrome this was associated with cardiac arrest and the little baby had multiple cardiac arrest and day one of life we know the genetic basis of that condition but at the moment with the standard that approach a genetic test would take three months it would look at five genes so very small data and it would give an answer that would help the physicians then do what we're increasingly calling precision medicine we'll probably come to that phrase later with the new technology we can do a whole genome sequence which reduces 6 billion data points in 24 hours analyze it using algorithms that used to take days that now take errors and put the result back into the hands of the treating physicians in days so if I was to think of one example that could encapsulate what we can do and what data science could bring to medicine we've looked at one gene at a time we've looked at you go in and you have a lab sample maybe 10 reagents come back to you but now we have technology that can deliver 6 billion results back to you and there's a huge signal-to-noise issue there and so the sorts of things that colleagues and engineering have been dealing with for many years suddenly become extremely relevant to things that are happening in the medical school so I think the example I would give is also in the healthcare but from a different perspective you know what I think a lot about is how we divide how we design small molecule drugs and right now the procedure is very empirical if you think about it you know when we do other things like we design a bridge like you know the bridge that the Bay Bridge we don't do that the way we design drugs if we do that way we'd have some idea for the bridge we think it works and then we'd first send my soul for the bridge to see if they survive and then if the mice survive we would take people who really want to go to San Francisco would put them over the bridge first so you know we would never do that right and so just do you think why can't we engineer small molecules in an alley and you know there's many different answers of this is extremely difficult problem but I think the opportunity now is that we have a huge data sets and so huge that you can't have a single person wrap their head around them and I think the promise of data science is that we'll be able to algorithmically gain insights and insights that are really non-obvious fector okay well just to be contrarian I'm going to mention old applications like finance commerce manufacturing where they're not as trendy but they're still very important and there's a lot of great things that can happen to produce products and services cheaper more efficiently and safer again because we have a lot more data about the process of manufacturing or about the customers and their needs and their desires so it's a combination of more data and better ways to do things it's a more data and better algorithms and better algorithm provide yeah so I think one thing that is going to be possible now is the work that's going on on deep learning and really the funny thing about deep learning is deep learning will give computers what we would call common sense it's it's really the ability to devise relationship and that's happening now because we've learned enough about machine learning and we have enough data you want to learn about the world go read the internet read the Internet put down all the relationship you get about the internet and then build a system that can reason about them so you need both large amounts of data and large amounts of compute and we're going to spend a lot of money doing things which people could do easily right now for smaller amounts of money but in ten years there'll be a lot cheaper and that will be liberating for me for my area the application of data science that's really exciting is the application actually to computer security so as you I realize attackers are constantly doing new things and for us it's not new data it's actually it's not large amounts of data and new algorithms it's actually the fact that there's new types of data that are being collected in order to detect attacks just as examples LinkedIn for example it's constantly being bombarded with fake accounts right so how do you the take detect detect fake accounts in the massive sea of accounts that they have that's all based on data science techniques but is this really a new field or is it just like John says it's just just just more computer science and more statistics is something fundamentally new happening here are there kind of new techniques that are being developed new ideas that are being developed or is it just more of just a new name to an old concept yeah please pay cracker bet ah definitely data science is new in the same sense that Columbus discovered a new continent 500 years ago where there were happened to be a lot of people living there already that so so it's sort of similar there it was there's a lot of data related work that's been going on for the past years but what's really new is the discovery that a group of influential people have discovered that there's a lot of value and power in data and and mining it and analyzing it and that's what's exciting now I'm not implying that everything has been done right even after Columbus discovered America there were still a lot of good things that happened in America it was awesome in a country well I meant America in the broad sense Hector I want to point out that Columbus's discovery did not go so well for the people who are new I hope you're not gonna do the same thing in data science who the why not why not yeah it's a good question I think this is a good question cuz um Jeff Holman my colleague and computer science once said anything that has to call itself a science ain't one and I think there's some truth to that you know the question will be take Computer Sciences discipline right over time yeah surely its foundations are in lots of other disciplines but over time they became a core body of knowledge that distinguished what computer scientists know right and it's about algorithms and complexity and computability and and and autonomy and there's a there's a core body of knowledge my guess is the same thing will happen here there'll be a core body of knowledge it will build on statistics and mathematics and other disciplines computer science but it will be increasingly distinct as our applications frontier moves forward over time yeah also a lot of times we define you know these disciplines by the curriculum that we teach yeah because you know you can be in a computer science department or biology department or a math department and do whatever research but often it's that that the discipline that we want to teach the younger generation which is a key aspect of I think the absolute scale is is also larger I mean I think there's no debating that compared to ten years ago that the amount of data in the world the amount of data that is being passed through the pipes on an hourly basis is greater and so that that can one lead to I think learning that wasn't possible with smaller data but I think it can lead to opportunity and extending current discipline so it's definitely not replacing anything I think it's Joan said it really builds on those brings them together and I think for me whether it's almost like I don't really care at some level whether it's new or not what is clearly new is the availability of data and the problem as to what we're going to do with it what what are the techniques that make the field its own unique distinctive feel from you know just machine learning or just databases or things we've had in the past are there specific things that folks here can learn about that make them data scientists sometimes it's not necessarily going to be something truly unique in making analogy to other things like biophysics biophysics is a mixture of physics and biology in other areas and it's maybe more about sort of how we do that mixture how what the cocktail is like because you know different cocktails can have similar ingredients but different different quantities of each I think lots of people will want to own certain techniques and we want to have them under their wing and I think that that's okay I think that again for me it's the application that counts rather than exactly which box we put it in and so for sure I think the most common answer to your question is machine learning deep learning and applying that to extremely large data that's probably the closest is the thing that's new that sort of associated with data science but really isn't that just an extension of statistics I think probably you know many levels it is I mean I would say this very broad range of tools that come from multiple toolboxes but I would yeah I think that's right yeah we're using a lot of existing tools but the process of applying these tools to new domains new applications will improve the tools over time so I think Dan's points well-taken I'd say right now saying you're a data scientist or you're working on data science is a way of defining a collection of disciplines and knowledge that you have from different disciplines it's not yet distinctive in that way and it may not be because I think you can see statistics and computer science kind of embracing these techniques and and developing them further as a natural evolution of their own disciplines so basically we don't quite know yet what makes a data scientist a data scientist but we know but we intend we do learn that's part of the front that's part of the but it does sound like if you want to become a data scientist you need to learn you need to know machine learning you need to know statistics those are the general areas that you must know in order to call yourself a data scientist does their agreement on that yeah yes yeah it's something about computation I mean I owe is helpful so they're all the question is what is the role of universities in this new world I mean how do we get the data that is proprietary is you know has privacy regulations around that this companies are just not going to share the data with us so how are we going to do data science in academia if we don't have the core data I mean I actually would challenge the the assumption that the companies only one hope data I think even can speak very much to genomic data which is huge of and actually we have a whole hospital there of data I think you know an astronomy is very data rich I think not all fields are data rich but I think there's many many fields just within academia that are themselves data rich yeah I mean I think if you take the hospital for a moment I mean the 500 beds the many of them are telemetry beds which means that we have real-time signals coming from the heart rate from ECG tracing electrical tracings of the heart we have every day on probably too many patients but we have their lab tests that are coming in in streams and for the most part that data is reviewed on a daily basis kind of reviewed somewhat in a time series way manually on a screen at least no longer on paper and then kind of dismissed and there is no algorithm sitting in the background that understands like your fraud alert on your credit card you know what the normal boundaries for you are and probably there should be but actually they've got resources that we don't have at Google you can run a ten million core job it wouldn't be a big deal that's impossible to do on campus so I but I grew dan I think our data is a drop in the bucket we don't need data on five hundred patients or even a year's worth of 500 patients we need data on 10 million patients or 20 million patients if you really want to start working on the problem and that's got to be a full set of data not only genome genome it did we don't need lots of medical data and I think we're going to have to find a way and I thought the secret was going to be I thought you were on this panel dan is the one who is going to help us break into their systems I see that's the hold whether it's all the issues related to to privacy why that's one of the topics and I gave you that yeah because I don't think it's an issue of Stanford data versus other industry data it's an issue of people are very paranoid these days for various reasons and even the government that has public data doesn't want to share it it's so getting any type of data is hard I mean we've had some trouble in the past although I'm sure that that's not a little longer going to be true with our Hospital trying to get any data out of there was hard and there's a good reason for that there's heap HIPAA and all sorts of constraints one of the things we learned about the NSA Snowden thing is that just collecting the metadata in fact gave them a fairly good look into some people's lives so how you anonymize it while maintaining the structure of the data in such a way that you can reason from it is going to be harder I think we have a lot of admissions data on campus can we use that for research I think it is the intention that we have we have the education data and we're going to use that as a way to and this is I think is one of the big issues about moving to more online methods is you can build the analytics and get more information about what students learn and have a shorter feedback loop I think we definitely want to do so the university would be willing to share education data that's being collected with the researchers in the data science sure initiative sure and even I admissions data at some level obviously there's certain things you've got to keep anonymous but at some level that's great that's good to hear so fascinating data said they wanted our help to study this data and tell them what to do what they could do with this data but at the end of the discussion I asked well okay so we're very interested how do we get the data oh I don't think we can give you the data anyone who's here who would like to come and work on it please come and find me afterwards and I can point you to any number of data sets that need people to work on them so I think absence of data is not our challenge but there is still a rule I think for synthetic data I could just see I would you go to your doctor I guess you're our doctor but you have must have a detail so here I have this pill for you we we've tested it on Tron this this simulated human who's a simulation who was was written by an undergrad I have it ready and it works fine tron survived and the roots are your take and the students graduate and we don't understand how the simulation works but we think it's okay I just I think this synthetic I worry about models that are developed around synthetic data and the fact that you fit the algorithm to get the to match the synthetic data I worry about that I also worry about you know we've kind of moved away from this named big data and I think one of the reasons we did is there's lots of big data out there but lots of it isn't particularly complicated it's simple data it's not complex relationship and when you think about synthetic data how are you gonna get synthetic data that has deep complex relationships that are not artificial but in fact look like the real thing I think ultimately we're all talking about the same thing we're not going to simply put datasets together and then sit back and say well we feel really good now great dataset the point of putting the data senticles to ask a question so I suspect we don't disagree as much as maybe we say I did like we might you know yeah I think it's as we let off with some examples of questions but I think that we good questions are very domain specific and and and the part that is common is maybe more methods yeah another quick example just from the idea of sort of the builds upon federated queries but moves towards federated statistics because again we as we mentioned earlier probably one Hospital if I keep away from the domain but let's you know one unit it may not have all the data in one place that you want to analyze and there may be privacy and we'll get to this privacy or regulatory reasons that you can't actually bring that data together in one place there's a lot of really challenging problems and just getting data that's usable it's not just get some file from somebody and you're all set to go you have to organize the day you have to understand what does it mean when you have a feel there how what instrument was used to measure this and and how do I combine this reading in this field with another reading that's related but was done with a different instrument or measured in a different way for instance so there's a lot of challenges how do I represent the accuracy in the data itself and generally how do we get how do we adapt the curriculum to deal with teaching data science and data science skills I mean I when I did towards this one things that I have always been really impressed about about the new CS curriculum is that CS can be done in a very project oriented way if you think about like 106a students can see the value of computer science very quickly because they're actually doing it and you can imagine having data signs like projects and other disciplines where the project has to do with analyzing the data and maybe they don't take the data themselves and that would be very onerous and almost maybe impossible in some cases but I can imagine that there could be applications like that I think my personal prediction is that education will start to look more and more project-based as I think it has a greater impact on the students and also becomes I'm just more exciting from yeah we've had successes he has with some classes where the beginning of the class the instructor has put together 2030 data sets and then the students can pick one that interest them often they can even go and talk to the person who generated that data set and then they can do an interesting project with it so I think that same idea could be used in many disciplines so you see this as you mean the many faculty here are teaching classes that could adapt to that model where there would be more data sets available and the courses would actually focus on how to analyze those data sets possibly I mean I'm not sure I'm since I'm an expert I know how far we can go to data mine Shakespeare or something like that but it's it's it can't be done it's actually Universal here in the English department who's been analyzing old manuscripts I don't know I think Shakespeare yeah well I think and then in the medical school I think we have a slightly different challenge in that were generally training people not to be data sensors we're training them to be doctors but yet data science will be such an important part of medicine going forward that we need them to be much more quantitatively aware than they currently are and so there's been a number of movements towards thinking about how to improve or increase the amount of exposure to those topics that our medical students have because they're going to need to understand the theory behind it if not necessarily have to be able to actually apply the command laying all the things that they might and understand what they can trust if I can't yeah if you think about core data scientist as opposed to we're gonna have lots of students who learn about data science techniques in the context of whatever they're doing biology chemistry engineering but I think we also probably want to think about a core educational program that trains people who are deeper in the fundamental techniques and applying those techniques and maybe not so focused on the applications in any single area and that's a good model for how that might be done so here I have to remind everybody of this example of a father who was who went and complained to targets about his daughters teenage daughter getting bombarded with ads for cribs and for diapers only to find out the target figured out that his teenage daughter was pregnant before she told him so that I was quite interesting all right true story true story absolutely true story so the question is how did target actually do this and it turns out what they did is actually well they have data scientists who were actively working on their data where they had they had mothers sign up for their baby registry and then they started what do mothers pregnant mothers soon-to-be mothers what did they what did they buy and it turns out like in their second trimester they tend to buy unscented lotion and in their in their first twenty twenty weeks they load up on supplements like calcium and magnesium and quickly they correlated that to you know the fact that that someone is expecting just based on their purchase pattern so that's an example of you know kind of crossing the border of ethics and data and I guess that's what you're getting at is it is their limit should there be limits to what data is being collected and generally what do we do about the ethical question and should we teach them a difference between correlation and causation there you go very important might be the other data you can study it and if you discover patterns like this as long as it's not referring to a specific teenage girl I don't see an issue it's it's the specifics of when when it leaks that this specific girl is pregnant that's when you but when Sheldon Hector let's put it in so now you discover this connection you publish it and the next thing that happens is target then deploys that algorithm using your discovery or somebody else deploys it and you know they start they start then bombarding the advertisements now they've used your insight to create a problem yeah but this can happen to whatever algorithm I develop it can be used by bad guys to do bad things so so I shouldn't be doing my research this is a common issue in technology from atom bombs to anything else I mean I think the challenge especially is that if Hector didn't do it maybe 10 or 20 years later someone else of course somebody else in the digital world in the digital world increasingly you have to do something really contrived too if you really want really want ultimate privacy it's the honest I mean Dan's right there's a scale issue here and a pervasiveness issue that's very different I mean well look at credit card data everybody you know does everybody use credit cards yes and imagine that data being crunched not just by the credit card form but then sold because of course it is sold and used in lots of different ways to determine buying techniques and things people I mean and this is true in all the stores now you know the new way to try a new product out is you pick a few stores and you go watch who buys it and you watch information about who buys it and they use a credit card now you deduce from that what kind of people might buy that product and you decide what stores should have it and how much space it should get that's why when I go shopping I wear a brown paper bag on my head do you use cash this is big coin yeah some kind of regulation is going to be necessary I think trying to shut down the research side of the agenda doesn't work I mean it basically doesn't work so you're going to have to figure out what company's regulations are about security and privacy and the problem with that is that that involves government and government moves about 10 times slower than technology moves so it's going to be very hard to develop that and you just have to look at what's happening copyright law and how far out of date it is not you know 50 years out of date probably something like that right and so we're going to have that problem as a society this is a soluble issue it's gonna get hefty that's it okay I guess we have to leave some topics for next time I did want to talk about the role of libraries and can libraries help with data collection and data retention but I guess those are things we'll have to talk about I think I just have to is the answer I'll have to they'll have to adapt including the Stanford library yes it would be really nice if the same we don't have the most old-fashioned library in the world we're ahead of that oh yeah up to terms of so actually I did want to advertise there is something called if you guys don't want to talk about the Stanford digital repositories about books libraries about data okay well thank you everybody thank you for more please visit us
Info
Channel: Stanford
Views: 110,255
Rating: 4.8898387 out of 5
Keywords: Data Science, Data, Stanford, Stanford University, Medicine, Computational Science, Computational Mathematics, Artificial Intelligence, Internet-of-Things, Social Science
Id: hxXIJnjC_HI
Channel Id: undefined
Length: 25min 49sec (1549 seconds)
Published: Mon Jul 06 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.