Datalab: Notebook in the Cloud

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
YUFENG GUO: Streaming your data down to your own local environment can be slow and costly. In this episode of AI Adventures, we'll see how to bring your notebook environment to your data. What's the only thing better than an interactive Python notebook? Interactive Python notebook with fast and easy Cloud connectivity, of course. Last episode, we looked at just how useful Jupyter notebooks are. And if you missed it, I've included a link in the description below. This time, we'll see how it can be taken even further by running it in the Cloud with lots of extra goodies. We'll run some analysis with some powerful tools and even larger data sets. As you work with larger and larger data sets in the cloud, it starts becoming more and more unwieldy to interact with it using your local machine. It can get tough to download statistically representative samples of the data to test your code on. And streaming the data to do training locally relies on having a stable connection. So what is a data scientist to do? Well, if you can't bring your data to your compute, bring your compute to your data. Let's see how you can run a notebook environment in the Cloud close to your dataset. Google Cloud Datalab is built on top of the familiar Jupyter notebook with some added abilities, including easy authentication with BigQuery data sets, fast operations to Google Cloud Storage, and SQL query support. Let's go create a Datalab notebook and get set up for running our notebook in the Cloud. Install the Datalab component using gcloud components install datalab. Then you'll have a new command line tool called datalab. Starting datalab is a one line command. This command spins up the virtual machine that you'll use for your analysis, configures the network, and installs the necessary libraries that we'll use, including TensorFlow, pandas, NumPy, and more. Once Datalab starts, it will open up a notebook environment that looks pretty similar to what we saw in Jupyter notebooks. However, unlike before, this is an environment that's running on a virtual machine in the cloud. Datalab installs some samples by default, which makes a great place to start exploring. Let's take a look at the Hello World notebook, which is in the Docs Folder. Here, we can start playing with the notebook right away, running the cells and experimenting. This is super convenient, since there was no need to deal with managing different Python libraries and configuring them. Let's take a look at a few additional tools that are built in. We'll click on the Account icon in the upper right corner and take a look at all the options and information that we're presented with here. First, note that the notebook is running as a service account. The service account is already authenticated with the assets of the project that we are in, but if we want to access resources from another project, we'll need to grant access to this service account, not the user account. Since the virtual machine that the notebook is running on is accessible to anyone who can access the project, we don't want to expose our own account credentials in the Datalab notebook. Now as we continue on downward, we see that this Datalab notebook is hosted on a virtual machine on Google Compute Engine. And the virtual machine is called AI Adventures. And we can shut down the VM by clicking this button here. Now by default, Datalab has a feature that will shut down your virtual machine once it's been idle for about 30 minutes. And you can toggle this feature by clicking the message to disable and re-enable it. This timeout can also be set to a custom value. Let's go over to Settings to see how we can do that. The value we set here will remain in place across reboots of the virtual machine. And if you set it to zero, auto shutdown will not occur. So this is just another way to toggle it off, basically. And additionally, you can also select a theme of light or dark. So if you're feeling like you want the screen to be dark, or we can stick with the light theme keeping it airy. Now we have our Datalab notebook set up, and we've gotten familiar with the environment. So let's see what we can actually do with Datalab. Today I'm going to run through an example that describes the correlation between programming languages used on GitHub. That is, if you program in language A, how likely are you to also program in language B? The notebook is in the Samples directory under the Docs that we are in right now. And I've also placed a link to GitHub, directly, in the video description. This notebook uses NumPy and pandas to analyze BigQuery data and visualize it using Matplotlib. So naturally, the first step is to import those libraries. The table that we'll be pulling from is the GitHub timeline table. And that lives in the public dataset samples provided on BigQuery. Let's see what the table schema looks like. Notice that we can use this percent percent BQ to send operations to BigQuery without needing to do any additional scripting, authentication, or deal with the REST API response. Well, there certainly is a lot of columns. We can also see how many rows there are as soon as I get past all of the columns. OK, so if we run this query here, we can pull the number of rows from the metadata, and we see we have about 6.2 million rows. And we can also use a sampling feature from BigQuery, and that'll pull rows directly from the table into Datalab. But it's a smaller chunk of data. So you don't need to query the whole table, which would otherwise take a while. And it allows us to easily see the format of data and do some sanity checking. So we can see here we just pulled five sample rows, and all the different columns associated with that. So we just pulled these four columns. So it's really just a slice of data. Now let's go onwards to the core of our analysis. These are the queries that we'll run to grab the GitHub commit data and the languages associated with them. Notice, again, that we're using this BQ or percent percent BQ magic. And this time, to create our SQL query, we'll actually be able to write SQL directly. And you notice here that we get syntax highlighting, and it all is directly in the cell, rather than having to wrap it in the string of an API column. Now, these cells create the query, but it doesn't execute them yet. To execute them, we'll actually just call execute, and you'll notice here we have popular languages, and that was the name that we called our first query here, name popular languages. And so that's where we construct the query, and here we execute the query. And looking at the results, we see that JavaScript, Java, and Ruby are some of the largest number of pushes, while at the bottom we see that R, Matlab, and Puppet have the fewest number of pushes in this sample dataset. We'll also execute one of the other queries that we had and see the number of pushes by user. Anyone see themselves on this list, perhaps? Anyhow, let's load our query results now into pandas data frame. Doesn't get easier than this. We just call to dataframe. And now we can use all our pandas analysis skills on this new dataframe. So we can see that we have the data loaded in DataFrame, and we can use that to scribe here-- we can see some interesting stats, including who has contributed to the most number of distinct languages. Looks like someone's actually had contributions in nine different languages. Pretty impressive. Now we will need to actually further shape our data toward our end goal of correlating languages with each other. Specifically, if we want to have columns for each of the 25 different languages, we need to have the user names and the columns be 25 across. So pandas actually has a handy function to help us do this. It's called Pivot. And so we'll use that to create our 25 wide table here with all of our users. And this is going to have a lot of zeros. Because even the most ambitious user, as we saw, only has nine different languages pushed to. Now we can take this and compute the correlation between the different columns for each possible language pair, so 25 by 25, basically. And we'll do that using the core function. So we get that correlation. And you'll notice that we have core here the bottom, so it prints it out. And so now we have our 25 by 25 table. Unfortunately, this table of decimals is no fun to look at, and it's hard to really tease out any trends here. So we'll use Matplotlib to make a more colorful plot, so that we can more easily see where the notable correlations lie. So once we have that, we can see we'll run that. And wow, that's a lot of colors. Let's see here that the stronger the blue, the more positively correlated the two languages are. And the stronger the red, the more negatively correlated the given language pair. There are some interesting language pairs that are notable here. First, you see that Java doesn't really mix well with a couple of different languages. You know, here's our Java, and we see that it doesn't show up much with JavaScript or PHP or Ruby. And C and C++, on the other hand, do seem to go together very well. Not super surprising there. And CoffeeScript and JavaScript also get in that blue box to be combined together. Looking kind of broadly at the dataset, I also noticed that Ruby and PHP both seem to have a lot of red, and some of them quite red across their rows. This makes me wonder if one of these languages could get the prize, which one would get the prize of least likely to go with another language-- perhaps something we can analyze another time. So this analysis used only a small sample of the larger GitHub public dataset. And if you want to work with the full GitHub commit history, I've included a link to that public dataset down below, as well. Datalab is a great way to run Cloud connected notebooks close to your data, with convenient connectors to tools like BigQuery with easy authentication to your datasets in the Cloud. Thanks for watching this episode of Cloud AI Adventures. Be sure to subscribe to the channel so you can catch all the future episodes right when they come out. For now, go and try out Datalab and see if it's the right choice for your dataset.
Info
Channel: Google Cloud Tech
Views: 45,900
Rating: 4.9555554 out of 5
Keywords: Machine Learning, TensorFlow, Big Data, Cloud, Artificial intelligence, AI, ML, machine learning with gcp, gcp machine learning, cloud and machine learning, ai adventures, training, estimators, classification, linear classifier, machine learning models, data science, google cloud, datalab, cloud datalab, google cloud datalab, google cloud platform, big data science, google cloud services, GDS: Yes;
Id: Eu57QKNHaiY
Channel Id: undefined
Length: 11min 45sec (705 seconds)
Published: Thu Jan 11 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.