YUFENG GUO: Streaming
your data down to your own local environment
can be slow and costly. In this episode
of AI Adventures, we'll see how to
bring your notebook environment to your data. What's the only thing better
than an interactive Python notebook? Interactive Python notebook
with fast and easy Cloud connectivity, of course. Last episode, we
looked at just how useful Jupyter notebooks are. And if you missed it,
I've included a link in the description below. This time, we'll
see how it can be taken even further by
running it in the Cloud with lots of extra goodies. We'll run some analysis
with some powerful tools and even larger data sets. As you work with larger and
larger data sets in the cloud, it starts becoming more and more
unwieldy to interact with it using your local machine. It can get tough to download
statistically representative samples of the data
to test your code on. And streaming the data
to do training locally relies on having a
stable connection. So what is a data
scientist to do? Well, if you can't bring
your data to your compute, bring your compute to your data. Let's see how you
can run a notebook environment in the Cloud
close to your dataset. Google Cloud Datalab is built
on top of the familiar Jupyter notebook with some
added abilities, including easy authentication
with BigQuery data sets, fast operations to Google Cloud
Storage, and SQL query support. Let's go create a
Datalab notebook and get set up for running
our notebook in the Cloud. Install the Datalab component
using gcloud components install datalab. Then you'll have a new command
line tool called datalab. Starting datalab is
a one line command. This command spins up
the virtual machine that you'll use
for your analysis, configures the
network, and installs the necessary libraries
that we'll use, including TensorFlow,
pandas, NumPy, and more. Once Datalab starts, it will
open up a notebook environment that looks pretty
similar to what we saw in Jupyter notebooks. However, unlike before,
this is an environment that's running on a virtual
machine in the cloud. Datalab installs some
samples by default, which makes a great
place to start exploring. Let's take a look at the
Hello World notebook, which is in the Docs Folder. Here, we can start playing
with the notebook right away, running the cells
and experimenting. This is super convenient,
since there was no need to deal with managing
different Python libraries and configuring them. Let's take a look at
a few additional tools that are built in. We'll click on the Account
icon in the upper right corner and take a look at all the
options and information that we're presented with here. First, note that the notebook
is running as a service account. The service account is
already authenticated with the assets of the
project that we are in, but if we want to access
resources from another project, we'll need to grant access
to this service account, not the user account. Since the virtual machine that
the notebook is running on is accessible to anyone
who can access the project, we don't want to expose
our own account credentials in the Datalab notebook. Now as we continue
on downward, we see that this
Datalab notebook is hosted on a virtual machine
on Google Compute Engine. And the virtual machine
is called AI Adventures. And we can shut down the VM
by clicking this button here. Now by default,
Datalab has a feature that will shut down your
virtual machine once it's been idle for about 30 minutes. And you can toggle this
feature by clicking the message to disable and re-enable it. This timeout can also be
set to a custom value. Let's go over to Settings
to see how we can do that. The value we set here
will remain in place across reboots of
the virtual machine. And if you set it to zero,
auto shutdown will not occur. So this is just another way
to toggle it off, basically. And additionally,
you can also select a theme of light or dark. So if you're feeling like you
want the screen to be dark, or we can stick with the
light theme keeping it airy. Now we have our Datalab
notebook set up, and we've gotten familiar
with the environment. So let's see what we can
actually do with Datalab. Today I'm going to run
through an example that describes the correlation
between programming languages used on GitHub. That is, if you
program in language A, how likely are you to also
program in language B? The notebook is in the Samples
directory under the Docs that we are in right now. And I've also placed a
link to GitHub, directly, in the video description. This notebook uses NumPy and
pandas to analyze BigQuery data and visualize it
using Matplotlib. So naturally, the first step
is to import those libraries. The table that we'll
be pulling from is the GitHub timeline table. And that lives in
the public dataset samples provided on BigQuery. Let's see what the
table schema looks like. Notice that we can use
this percent percent BQ to send operations
to BigQuery without needing to do
any additional scripting, authentication, or deal
with the REST API response. Well, there certainly
is a lot of columns. We can also see
how many rows there are as soon as I get
past all of the columns. OK, so if we run
this query here, we can pull the number of
rows from the metadata, and we see we have
about 6.2 million rows. And we can also use a sampling
feature from BigQuery, and that'll pull rows directly
from the table into Datalab. But it's a smaller
chunk of data. So you don't need to query
the whole table, which would otherwise take a while. And it allows us to easily
see the format of data and do some sanity checking. So we can see here we just
pulled five sample rows, and all the different
columns associated with that. So we just pulled
these four columns. So it's really just
a slice of data. Now let's go onwards to
the core of our analysis. These are the queries
that we'll run to grab the GitHub commit
data and the languages associated with them. Notice, again, that we're using
this BQ or percent percent BQ magic. And this time, to
create our SQL query, we'll actually be able
to write SQL directly. And you notice here that
we get syntax highlighting, and it all is
directly in the cell, rather than having to wrap it
in the string of an API column. Now, these cells
create the query, but it doesn't execute them yet. To execute them, we'll
actually just call execute, and you'll notice here we
have popular languages, and that was the name that we
called our first query here, name popular languages. And so that's where we
construct the query, and here we execute the query. And looking at the results,
we see that JavaScript, Java, and Ruby are some of the
largest number of pushes, while at the bottom we see
that R, Matlab, and Puppet have the fewest number of pushes
in this sample dataset. We'll also execute one of
the other queries that we had and see the number
of pushes by user. Anyone see themselves
on this list, perhaps? Anyhow, let's load
our query results now into pandas data frame. Doesn't get easier than this. We just call to dataframe. And now we can use all
our pandas analysis skills on this new dataframe. So we can see that we have
the data loaded in DataFrame, and we can use that
to scribe here-- we can see some
interesting stats, including who has
contributed to the most number of distinct languages. Looks like someone's
actually had contributions in nine different languages. Pretty impressive. Now we will need to actually
further shape our data toward our end goal of
correlating languages with each other. Specifically, if we want to
have columns for each of the 25 different languages, we
need to have the user names and the columns be 25 across. So pandas actually has a handy
function to help us do this. It's called Pivot. And so we'll use that to
create our 25 wide table here with all of our users. And this is going to
have a lot of zeros. Because even the most
ambitious user, as we saw, only has nine different
languages pushed to. Now we can take this and
compute the correlation between the different columns
for each possible language pair, so 25 by 25, basically. And we'll do that using
the core function. So we get that correlation. And you'll notice that we
have core here the bottom, so it prints it out. And so now we have
our 25 by 25 table. Unfortunately, this table of
decimals is no fun to look at, and it's hard to really
tease out any trends here. So we'll use Matplotlib to
make a more colorful plot, so that we can more
easily see where the notable correlations lie. So once we have that, we
can see we'll run that. And wow, that's a lot of colors. Let's see here that
the stronger the blue, the more positively correlated
the two languages are. And the stronger
the red, the more negatively correlated
the given language pair. There are some
interesting language pairs that are notable here. First, you see that
Java doesn't really mix well with a couple
of different languages. You know, here's
our Java, and we see that it doesn't show up much
with JavaScript or PHP or Ruby. And C and C++, on
the other hand, do seem to go
together very well. Not super surprising there. And CoffeeScript and JavaScript
also get in that blue box to be combined together. Looking kind of
broadly at the dataset, I also noticed that Ruby and PHP
both seem to have a lot of red, and some of them quite
red across their rows. This makes me wonder if
one of these languages could get the prize,
which one would get the prize of least likely
to go with another language-- perhaps something we can
analyze another time. So this analysis used only
a small sample of the larger GitHub public dataset. And if you want to work with
the full GitHub commit history, I've included a link
to that public dataset down below, as well. Datalab is a great way to
run Cloud connected notebooks close to your data, with
convenient connectors to tools like BigQuery
with easy authentication to your datasets in the Cloud. Thanks for watching this
episode of Cloud AI Adventures. Be sure to subscribe
to the channel so you can catch all the
future episodes right when they come out. For now, go and try
out Datalab and see if it's the right
choice for your dataset.