How to Make a Data Science Project with Kaggle

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

YUFENG GUO: On this episode of "Cloud AI Adventures," I've invited Megan Risdal to join me on the show. Together, we'll cook up our own date science project on Kaggle. How are you doing today, Megan? MEGAN RISDAL: I'm doing great. Thanks so much for having me on your show. YUFENG GUO: Awesome. And before we get going, I wanted to let you have a chance to talk a little bit about what you do at Kaggle and your role. MEGAN RISDAL: Sure. So I'm the product lead for datasets at Kaggle. And what that means is that I work with our engineers, our designers, as well as our community to build tools that help data scientists find, share, and analyze data. And today, what we want is for Kaggle to be the best place for our 1.7 million data scientists to share and collaborate on data science projects. YUFENG GUO: Awesome. And so today, we'll be working together to use the freshest ingredients-- MEGAN RISDAL: Data. YUFENG GUO: --and prepare them using different tools and work together to come up with our delicious outcome, this public dataset and notebook that we can share with the world that has cool analysis to go with it. MEGAN RISDAL: Yeah. That's exactly right. And I'm excited today because we're going to really make this a collaborative project. So that's how we're going to get things done is together-- teamwork. YUFENG GUO: Teamwork. All right. Let's go. So Megan, on a previous episode "AI Adventures," I had a video that showed how to get started with Kaggle kernels. And it was pretty rudimentary in terms of just get started, go, it's awesome, it's a free resource. But since then, there's been a couple of new features released that really enhance the functionality of Kaggle, both kernels and datasets, to be used as a great tool for individuals and teams. MEGAN RISDAL: Yeah, that's exactly right. So today, Kaggle's a really great place for people who use R and Python to work with data. They're looking, really, to build data science portfolios, do data analysis work, or even share research. It takes a lot of tools to do data science. And Kaggle really acts as this one-stop shop that provides all of these tools that makes this possible, from working with data privately, to sharing it with the world. YUFENG GUO: And that really-- it's really fantastic. Let's explore a little bit more about the fact that Kaggle datasets and kernels can support this kind of collaborative model, this private mode, if you will. MEGAN RISDAL: So some more recent features are ability to publish and work with private datasets and kernels. And speaking of kernels, this is basically like a laptop in the cloud. It's more powerful than the laptop that I'm working with here today. You've got 16 gigs of RAM, four CPUs, six hours of compute. And one of the really exciting things is that it is all in a docker container that has all of these packages, that data scientists love, pre-installed. So you've got this environment, this one-click environment. And then finally, we're starting to add on more customization, so if there's any packages missing, I can install those or even do things add a GPU. YUFENG GUO: Ooh. MEGAN RISDAL: Yeah. YUFENG GUO: Very nice. We've picked out a particular dataset today to play with around data from the city of Los Angeles if I understand correctly. MEGAN RISDAL: Yeah, that's right. So a lot of governments and organizations from around the world and in the United States are making open data available as part of their open-data initiatives to make their work more transparent. So I'm from Los Angeles. I live in Los Angeles. And I was kind of interested in taking a look at some of the open data that the city of Los Angeles makes available. So I was poking around on their open-data portal. And this one caught my eye because I'm a little bit of a foodie. It's a little interesting. But it's actually environmental health code violations from restaurants and markets in Los Angeles. YUFENG GUO: OK. All right, let's get into it. Yeah. MEGAN RISDAL: Yeah, so what I've done is I downloaded the dataset. So it's on my local machine right now. YUFENG GUO: Great. MEGAN RISDAL: And what we're going to do is upload it to Kaggle. This is going to be the foundation of our project. YUFENG GUO: Awesome. And one of the things that, a lot of times, I hear about is-- and some people are concerned around distributed computing and massive datasets. And you just mentioned, you download this dataset to your local machine. And some folks say, oh, I need lots of compute and resources. Is Kaggle going to be powerful enough to support my use case? And I guess, looking at a Kaggle, and just the vibrant community you mentioned-- was 1.7 million? MEGAN RISDAL: Yeah. That's where we're at today. YUFENG GUO: That's amazing. It clearly shows that there are so many use cases beyond the massive, massive datasets out there. There's the situations where you can get away with just one powerful machine that can take you quite far. MEGAN RISDAL: Yeah, that's right. Yeah, and we-- people are uploading thousands of datasets per month. YUFENG GUO: Yeah, wow. All right, so let's go over to your laptop and see how we go about doing that. How do we make a new dataset on Kaggle? MEGAN RISDAL: Sure. So we're going to start from the datasets page on Kaggle's website. So this is what it looks like. And basically, this is where you have access to all of the datasets that have been publicly published on Kaggle. And we're going to add on our own today. So what I'm going to do is I'm going to click New Dataset. And then, from here, it's just a matter of dragging and dropping the files that I've chosen to upload. And these are inspections of restaurants and markets in Los Angeles and then violations. And then, we're to add a little bit of metadata to get the dataset started. So I'm just going to grab all of the information I need here. So we're going to keep it private because, like we talked about, we want to prepare the dataset so that it's well documented. And then we're also going to play around with the data little bit and create some kernels before we share it publicly. YUFENG GUO: Yeah, awesome. And that's definitely something that doesn't get talked about as much is documentation for datasets. MEGAN RISDAL: Yeah, that's right. YUFENG GUO: The documentation from code is very well understood, and people hammer that home. But documentation about datasets is kind of a new concept. MEGAN RISDAL: Right. Yeah, it's really about making data accessible. It's not just making the data files itself machine readable-- so having well formatted CSVs-- but also helping anybody who's interested in working with this data really understand it. So I'm going to go ahead just click on Create Dataset. YUFENG GUO: Fantastic. All right. And your private dataset was successfully created. MEGAN RISDAL: Yay. YUFENG GUO: Whoo. MEGAN RISDAL: Cool. So now the private dataset was uploaded. And like it tells us here, now we can do anything from starting to analyze the dataset already to adding collaborators, and we're going to do both of those things. YUFENG GUO: Fantastic. MEGAN RISDAL: So we'll click confirm, and it's going to take us to our dataset. YUFENG GUO: Looking good. MEGAN RISDAL: Yeah. YUFENG GUO: That's, like, a real thing. MEGAN RISDAL: Yeah, that's right. So what we want to do when people create a private dataset is make it easy for them to, then, make that dataset public eventually and share it with the community. So we provide this quality checklist that helps people basically document their dataset and help them be successful when they share it. So we're just going to quickly go through this quality checklist. So the first is providing a description. And this is just a markdown file, so I have it saved here. YUFENG GUO: Great. Yeah. I mean, that's really nice that there's some guidance on what sorts of things to add in to make a dataset nice. MEGAN RISDAL: Yeah, yeah, that's right. YUFENG GUO: Make for a good experience. MEGAN RISDAL: Yeah. So I think that things like understanding the context of the data and why it's interesting and why you're sharing it is important, as well as providing more details about the contents of that dataset, so that's what we've done here. And then also inspiration-- so some questions that you can use the data to answer. YUFENG GUO: Yeah. I've seen that in some of the other datasets out there. Now I know why [INAUDIBLE],, there's some guidance there. MEGAN RISDAL: That's right. Yeah. So then the next thing on this page is we're going to add just a couple of tags. And this helps make the dataset more discoverable once we're ready to share it publicly. So we'll do Public Health and Food and Drink. YUFENG GUO: Seems reasonable. MEGAN RISDAL: Seems reasonable. So then we're going to add a subtitle and a banner image. And this is just to add that final coat of paint to make it look good and, again, help people understand what the dataset is about. YUFENG GUO: Yeah-- a little flair. MEGAN RISDAL: Yeah, that's right. YUFENG GUO: OK. MEGAN RISDAL: So we'll save that. YUFENG GUO: And we want them to replace this image? MEGAN RISDAL: Yeah. So this is what Google will see in the dataset listing. And you're not supposed to judge a dataset by its cover. But if it has a flashy image-- that can only help. YUFENG GUO: Yes. I always pick datasets that have an image of a sliced onion over ones that don't. MEGAN RISDAL: That's right. It looks delicious. And then finally, the most important part is I'm going to add you as my collaborator on this dataset. YUFENG GUO: So now I get to see it? MEGAN RISDAL: Yeah. YUFENG GUO: OK. So eventually-- MEGAN RISDAL: There you are. And I will grant you edit access. YUFENG GUO: Well, thank you. Megan Risdal invited you to edit the dataset. Great. And so I can click View on Kaggle? MEGAN RISDAL: Yeah. YUFENG GUO: And let's see what that looks like. Awesome. So this looks basically the same as it looks on your side. MEGAN RISDAL: Yeah, that's right. Cool. So we have uploaded our data, we've documented it, and I've shared it with you. One of the things that we like to encourage people to do is to also document their datasets through code. So what I mean by that is publishing a kernel on a dataset is one way to demonstrate to users, and other people in the community, what they can do with your data. So we might want to show somebody in a kernel how they can read in the data, some of the things that we can visualize using the data, questions that can be answered using it. YUFENG GUO: Yeah. I mean, when I see datasets on Kaggle these days, they all have these exploration notebooks with fancy visualizations, and it's really nice. MEGAN RISDAL: Yeah. Yeah, exactly. And when you start working with a new dataset, usually, when you're working locally, you're starting from a blinking cursor. You don't have any code that shows you how to read in the data and how to work with it. So that's what we're going to do is, we're going to additionally document our dataset by publishing a kernel on it. YUFENG GUO: Fantastic. MEGAN RISDAL: Let's get started. I'm just going to click on this Big Blue Button, as we call it-- New Kernel. YUFENG GUO: Yes. MEGAN RISDAL: So here, we have a choice between a script and a notebook. I'm going to go with notebooks because I like interleaving markdown and code. And then while this starts up, I can see that I have the data accessible right here at my fingertips in my environment. YUFENG GUO: Great. MEGAN RISDAL: And I'm going to change the language to R. I am an R Stats person. That's right. YUFENG GUO: All right. MEGAN RISDAL: Cool. So what I've done is I cheated, and I already prepared the code that I'm going to use. So I'm just going to quickly upload it here. And then I'm going to walk you through what I've done to analyze the dataset. YUFENG GUO: Great. MEGAN RISDAL: So in the first cell, we have the inspections CSV file and the violations CSV file. So I'm going to go ahead and read those in, join them together by serial number, and then take a glimpse at the resulting data frame. YUFENG GUO: OK. MEGAN RISDAL: So once that's done, you can see that we have almost 900,000 records that we're looking at. So these are all health code violations for about two years of data. YUFENG GUO: OK. That's a lot for two years. MEGAN RISDAL: Yeah, it-- yeah, it seems like it. So we're going to dig into what that looks like. So what I want to do, now that I've got the dataset prepared, in the shape I want it, is look at the number of violations reported over time by month. YUFENG GUO: Right. This is the big one. MEGAN RISDAL: Yeah, exactly. All right. YUFENG GUO: All right. MEGAN RISDAL: So you can see how quick and snappy that is. And we've got this visualization that's-- cool-- right in front of us. So that's a lot of health code violations. YUFENG GUO: Yeah. It's all over the place. What does look like-- what is that bar-- 30,000? MEGAN RISDAL: Yeah, that's right. YUFENG GUO: In a month? MEGAN RISDAL: Yes. Yep. YUFENG GUO: That's a doozy. MEGAN RISDAL: Yep. So let's take a look and see if there are any seasonal trends. And we also have information about what the violations were for each serial number. So we'll take a look at that. And we're going to look at just the top 10 violations, so that's what this code is going to be doing here. YUFENG GUO: We run that, and then we're going to get-- wow, very nice color coding here, yeah. Is that-- the darker one is more, or the lighter ones are more? MEGAN RISDAL: The lighter ones are more. YUFENG GUO: OK. MEGAN RISDAL: Yeah. Yeah, so you can see this one here is a violation of the code for floors, walls, and ceilings are properly built, maintained, in good repair, and clean. YUFENG GUO: OK. MEGAN RISDAL: Yeah. YUFENG GUO: It's always comforting to know that your establishment is in good repair. MEGAN RISDAL: Yeah, that's right. And then finally, I'm going to just save another project for later that I have in mind is, I want to look at the violations by zip code. So we've also got information for each of the facilities with their address is. So we can look at whether or not there are more violations by zip code and look at a geospatial analysis. But I want to do that a little bit later. So I'm just going to write that CSV to a file. And I'll be able to use that in another kernel. YUFENG GUO: Right. And you can imagine-- I'm just trying to think about this new output that you've created, you could make some kind of mapping with it. You could do one of those fancy color-coded heat maps. MEGAN RISDAL: Right, yeah. YUFENG GUO: We have this sort of heat maps, which shows the violations by type, but you could also show-- MEGAN RISDAL: Yeah, like a choropleth map geospatial-- YUFENG GUO: There's a tongue twister. MEGAN RISDAL: Yeah, choropleth. Yeah, exactly, and you can see, now, how taking just a peek at this dataset has already inspired new questions. And that's exactly what we want to do for our users. So I'm going to go ahead and give my notebook a title. YUFENG GUO: Yes, always good to have a title. MEGAN RISDAL: Yeah. And then I'm going to hit Commit and Run. YUFENG GUO: OK. So let's hit that. And while that's running, a question for you-- does the notebook not save if you haven't clicked Commit and Run? If you were to close that tab before you click that, what would happen to all that code, all that work? MEGAN RISDAL: So it's saving a draft. But if you want to save your code and come back to it later and share it with other people, you want to hit Commit and Run. And what that does is it executes the code from top to bottom. YUFENG GUO: Right. Perfect. So once that's done, I guess-- what is our next step? What's our plan here with-- because right now, we have a dataset that's private, but shared between us, and we have this kernel, which I think is still just private to you, right? MEGAN RISDAL: Yeah, so once this is finished, I'm going to go ahead and click View Snapshot. And this is going to take us to the Notebook Viewer. And from here, this is what I'll be sharing with the world. And this is what somebody looking at the dataset can come and find. So I'm going to go ahead and, again, share this with you just to make sure that you think that all of our work is ready to be made public. YUFENG GUO: Right, yeah, so in a team environment, you could do this to essentially do some sort of a code review scenario. MEGAN RISDAL: Yeah, exactly. YUFENG GUO: OK. So once you've done that, I can go over here on my laptop and, in the dataset, click Kernels and go to your work, which I guess, in this case, is your work. MEGAN RISDAL: Right. YUFENG GUO: And we'll open up your notebook here. And you can see that it loads nicely. And I have the option to either edit or to fork the notebook. MEGAN RISDAL: Yeah. So why don't you go ahead fork it, and just make sure that everything runs as expected, and you can get everything to compile. YUFENG GUO: So when I fork it, is that then similar to when you fork a repo on GitHub-- MEGAN RISDAL: Right. YUFENG GUO: --where you make your own copy? Now, this is really mine? MEGAN RISDAL: Yeah. This is your copy of, not just the code, but also the data that I used and the environment that I used. YUFENG GUO: OK, gotcha. And so anything you now make changes to on your side won't affect my copy. MEGAN RISDAL: Correct. YUFENG GUO: OK. So now, I'm running it, and this will generate a different kernel. Do I need to change the name? Will there be a name collision there if I leave it the same? MEGAN RISDAL: You don't need to change it. So the slug that gets used is your username and then the slug of the notebook title. YUFENG GUO: Gotcha. So I could change it, but I don't have to. MEGAN RISDAL: Right. YUFENG GUO: Now, if I go back to-- oh, we can watch it do it's thing, and I could share my fork with other folks. MEGAN RISDAL: Yeah, that's right. YUFENG GUO: And we can see our data. If I click back now, I guess this is-- it should be-- once it finishes, it'll just show up? MEGAN RISDAL: There it is. YUFENG GUO: And there it is. All right. MEGAN RISDAL: Awesome. So what do you think? YUFENG GUO: It's pretty good, pretty good. I guess it's time to make this thing public for real. MEGAN RISDAL: Yeah, let's go public. YUFENG GUO: All right. MEGAN RISDAL: Cool. So I'm going to go back to the dataset. And I go to Settings, Sharing, and if we think we're ready, we can click Make Public. YUFENG GUO: All right, let's do it. So this is a Make Public Permanently. MEGAN RISDAL: That's right. YUFENG GUO: Great. That's always good to know what you're getting into here. Nice. MEGAN RISDAL: And here we go. And then the next step is, of course, we want to make the kernel public. YUFENG GUO: Oh, right. Because the kernel itself is separate from the dataset, and so those two concepts are distinct. MEGAN RISDAL: That's right. YUFENG GUO: And so in this situation, it could be one where you wrote some stuff, and then you make yours public. But then I fork it, and it's private. And I can extend it privately with you or with other folks and then release another version, perhaps, with some different analysis. MEGAN RISDAL: Right. Yeah, exactly. So that flexibility is up to you. YUFENG GUO: Awesome. MEGAN RISDAL: So our data side is public. And anybody from our community of data scientists can go ahead and explore more about restaurant inspections and violations in Los Angeles County. YUFENG GUO: That's right. That's right. And zooming out and looking at what we've covered today, it's quite a bit. And the tools are all in this package, in this really nice, seamless platform. I really enjoyed going through. And we made a notebook and dataset on your side. We were able to share it across privately and then publicly. And we didn't even get into things like commenting system and the discussion forums. And there's so much more to Kaggle. But even this environment of collaboration and sharing is so rich. MEGAN RISDAL: Yeah. So we really did create a project from start to finish. We got data files off of my local machine and into this reproducible, documented dataset that's now publicly shared with the world. And you can kind of see how somebody could do this for a school project or as a way to share research. YUFENG GUO: Absolutely. Yeah. So this notebook is really public. So if you're watching this video, you can go on Kaggle right now and access this dataset. We'll include links to the notebooks in the description below the video and share it, and then you'll be able to see the notebook, the dataset, and post comments, fork your own notebook, make edits. Thanks so much for joining me today, Megan, on the show. It's been really fun putting together this Kaggle kernel. We're making this dataset and making it public to the world. If you've liked this video, be sure to hit the Like button down below and click Subscribe to get all the episodes of "Cloud AI Adventures" right when they come out. For now, me and Megan, we're going to go back to working on this kernel. But this time, maybe I can convince her to do it in Python. MEGAN RISDAL: We'll see about that. YUFENG GUO: All right.

Info

Channel: Google Cloud Tech

Views: 116,340

Rating: 4.9525261 out of 5

Keywords: How to make a data science project with kaggle, kaggle, how to use kaggle, kaggle kernels, intro to kaggle, kernels, data science, data scientist, getting started with kaggle, public datasets, data, Data science portfolios, create a dataset, csvs, private datasets, data sets, kaggle community, kaggle notebooks, jupyter, python, data analysis, data sharing, machine learning, ML, AI, artificial intelligence, google cloud platform, GCP, google cloud, Yufeng Guo, GDS: Yes;

Id: m2DfpM6MyB8

Channel Id: undefined

Length: 20min 59sec (1259 seconds)

Published: Tue Jul 03 2018