JEFF DAVIS: Welcome to the
Google Cloud Platform education series. My name is Jeff Davis,
and in this session we'll discuss analyzing
your data in less time using Google BigQuery. We'll be running a moderated
QA during the session, so feel free to use the
Ask your question field at the bottom of your
screen to submit questions as we go along. Let's get started. So the agenda for this
session will be starting out with a series of demos. We'll show you typical
use cases and the speed at which BigQuery operates
and how easy it is to use. Then we'll talk about the
general set of features for BigQuery and how you
might be able to leverage them in your environment. We'll talk about how to find
answers to questions after you leave today's
session and you get started working with
BigQuery on your own. We'll talk about some best
practices for evaluating BigQuery in your environment. And again, we'll be
doing Q&A as we go along, so submit those questions. Let's start with
a series of demos that show both BigQuery's
scale, as well as how easy it is to
integrate in other systems that you use for data
analysis and reporting. What we have here
is the web UI that allows end users to easily
play with data within BigQuery. We're going to show you
some integrations later on, but right now this is
just for doing quick iterative analysis of large data sets. So let's start with
some sample data sets that Google
makes available. So for instance, here we
have airline on time data. So we have a list of flights. And if we take a
look at this table, you can see it's
about an 8 GB file. It's 70 million rows of
data regarding different airlines' on time, arrival,
and departure performance in different states
and so forth. And I want to get
some information. So I'm going to start with a can
query that I've got here, OK? And this is going to find
all the arrival entries just for United Airlines. So I'm going to copy this,
go back, and query the table. OK, you can see I'm
using standard SQL. I'm going to be querying
the flights table, looking for all rows where the
airline code is 19977, that's the code for United Airlines. And I'm going to select
three columns from every row. Now I'm going to disable
caching because we want to see how
long it takes to do an initial query against
the 70 odd million rows. You'll notice it gives me
an estimate of how much data it's going to have to process. This is also useful
for estimating how much it's going to
cost and that you pay, I think, it's $5 per
terabyte of data process after the first terabyte. So this uses some
fraction of that. So I'm going to run
this and it should take between 25 to 30 seconds
to complete the query. When it comes back, it
will show us the three rows for all the columns. It will give us a column count. And then we'll move on to a
slightly more interesting query that gives us some more
insight into our data. So you'll notice it
took 23.8 seconds. Processed 2 gigabytes of data. It returned 5,043,578 rows. So it's a lot of data returned. That's part of why it
took so long to come back. It would have been
faster if there had been fewer rows in the result set. So what this does is it just
gives us the date, the arrival state, and the arrival delay. I would like to get a
little bit more insight, so what I'm going to
do is run this query. This query should tell me
the average delay by month by arrival state
for United Airlines. So I want to see what
months are the worst and what states are the worst. So let's run this. Again, it's going to process
the same amount of data. That came back
quite a bit faster. And you'll notice that we
have 5,000 rows instead of 5 million rows returned. That's part of why it
came back so quickly. And you'll notice
that what we've got is for each state we've
got individual months. And for those months, we've got
the average delay, the number of flights, the number of
flights that were early, the number of flights
that were late, the percentage of
flights that were early, and the percentage of
flights that were late. So this is using the lingua
franca of data analysis. It's using standard SQL. It's not that
difficult to write. There's no actual query
optimization here. You'll notice I'm
also using subqueries. That's perfectly valid. And I was able to
very quickly get back insights from a very
large collection of data. So again, you can do this all
interactively through the web UI. Let's actually
look at just scale. How fast can this go on
really big data sets? So you think, well,
Jeff, 75 million rows. That's a pretty big data set. Well, it's not as big as some of
these data sets, for instance. Here we have Wikipedia
benchmark data. And if you look, you can
see that the table basically has got year, month, day,
project, language, title, and view. So this has got data
about how many times particular articles were viewed. So we've got a
million row version. We have a 10 million
row version, 100 million row version, a
billion row version. Let's see how this all scales. So we'll start with a million. I'm going to turn off the
standard SQL and use legacy SQL just for giggles. So we're going to query
a million articles. We're looking for all
articles that have Davis because I'm fairly vain. I'm going to look for
articles about people with my same last name. We're going to use a regex
expression to match these, and what I want to do is
actually group them by title and sum the number of views. So let's do this over
a million row data set. This is 47 megabytes. And it should come
back fairly quickly. So you'll notice Anthony
Davis and Thomas Davis are the most frequently viewed
articles in the 1 million entry data set. Let's look at 10 million. OK, so we go from 47
megabytes to 476 megabytes. And we go from just a couple
of seconds to three seconds. We now have roughly
2,500 results. Daniel Davis is the top now. So this was three
seconds to do 10 million. Let's do 100 million. And you see where
we're going with this. You want to know if I'm going to
get to the 100 billion number, and I am. I'm going to have to pay $20 out
of my pocket to run that query, but I'll do it just for you. So you notice, ah,
Jefferson Davis. Not my namesake. It's a coincidence. Favorite jazz musician,
Miles Davis, is up there. So we've done 100 million. Let's do, oh, what the heck. That took two seconds. Let's do 100 billion. This is going to process
3.64 terabytes of data, 100 billion rows. Cross our fingers. Get out my ATM card. And yet, think about it. Twenty dollars to run this
query versus the amortized cost of a query against a
big Teradata system, or some other big data
rig that you've got. It's going to give you
equivalent performance. It's actually a heck of a value. So I think it's about 30 to 50
seconds to complete the query. While that's running, I'm going
to get-- oh, actually it's done. So 37 seconds. Notice it came back with
185,000 rows in the results and it has aggregated and told
me that Miles Davis is in fact the most popular article within
the 100 billion row data set. So think about it. That's 30 seconds to do
analysis on 100 billion rows. I could sit here and
iterate on my queries, interact almost in real time
with my data very, very easily. Now obviously, doing this
interactively with the web UI is one thing. Another thing is to
actually expose this data, pull it out of
BigQuery and expose it through some other application. So we have an example here. This is a customer
success application that pulls data from BigQuery. So let's take a look at
one of the dashboards. If you went to the
App Engine session, you saw this dashboard earlier. This has basically got a
bunch of visualizations of different types of
user engagement metrics within my product. And what this allows me to do
is pull the data out of BigQuery through APIs and
represent it in code. So behind the scenes here
I've got a user events table. Now, this table is
paltry 500,000 rows, but it's a modest amount of data
for the usage of this product. And it's a demo
application, so it doesn't have tens of millions of rows. What we do then is we
write a little bit of code to access this. And here's the code. We're going to go to BigQuery. This is a little Python script. You'll notice what we do
is we import a library. So Google has code libraries
or client libraries available for a variety
of different languages, and it's quite simple. We create a BigQuery client,
we connect to the data set, and then within the data
set, we connect to a table. And within the table, we are
then going to run a query. If we want to run an
arbitrary query for instance, we take our client, we run a
query, we designate the query. We give it a time out,
so this makes sure that the query has up
to 45 seconds to run. We run the query,
and then we simply iterate through the results. And this allows us to
process through the results of the query. If you want to
write to BigQuery, it's quite straightforward. You simply take the table
and you insert a row. And this adds events
into the backend. So this data gets written
automatically, then the API pulls it out and
renders it in dashboards within your application. Now you may be using a different
tool to do your visualization and reporting. You may not use the
web UI and you may not want to go so far as to
write a custom dashboarding application. Let's say you want
to use something like Tableau or Looker, or in
this case, I'm using a product called Mode Analytics. These BI visualization
and analysis tools allow you to connect to
BigQuery as a backend. So you'll notice over here. Let me make this a
little bit bigger. You'll notice I
have data sources. If I click on the plus
sign, I can actually select a backend
data source, and I want to connect to BigQuery. OK. So it's saying I want
to connect to BigQuery. It wants to know my project ID. So I am going to pick my project
ID, which is GCP webinar demos. And then I need to
create a service account. And so the service
account is created through IAM, the identity
and access management. And you'll notice that I've got
a variety of service accounts here. I can go ahead and create
a new service account. Actually, sorry, service
accounts are here. Those were the roles. I'm going to create
a new service account and I'm going to call this mode. And I want to give this
service account BigQuery admin permissions, so that it can
create tables, delete tables, read data, and so forth. And I want to get
a new private key. And because I read the
Mode documentation, I know that it needs
to be a P12 key. So I'm going to create
this service account. And I get the credentials. OK? Actually we'll
cancel out of that. I don't want to put
it in my key chain. We'll just skip over that part. We go back to Mode and
we've got to give it the name of the service account. So the name of the service
account or the email address here is this. So we designate that and then
we have to upload the key file. And that should be the key file. And if all went well, this
has now connected my tool with BigQuery. So I can create a new query. Let's pick one that looks a lot
like what I was using before. Let's not use the 100 billion. We can use the 100 million. OK, now I need to enter a query. So I'm going to go pick one
of my pre-canned queries. Let's go ahead and use
the standard SQL format for getting information about
United's on time arrival. I can paste that in. It will run the query. It gives me a tabular
version of the results. And if I want to, I can
create a chart for instance that shows on time arrival. I can change the chart
type if I want to. This sort of thing. So this shows how you can do
analysis and visualization within a third party tool,
so you can fit BigQuery into your normal data analysis
and analytics workflows. Pretty cool. OK. Let's switch back and talk
about BigQuery's functionality overall. We showed you some pieces of it. Let's highlight
additional capabilities and use cases so
that you'll have a better understanding of how to
apply it in your environments. So again, BigQuery is
this fully managed, massive scale, low
cost enterprise data warehouse that Google
provides running on top of their compute storage
and networking infrastructure. There is a demo query that
I've seen at some conferences that when you run it, it takes
3,300 cores, 330 hard drives, and 330 gigabits of network
bandwidth for 30 seconds to complete. Now those are
tremendous numbers. But because Google has so
much available capacity, they're able to give you a slice
of 30 seconds with that level of resources to run
your query and then claw those resources
back and deploy them against somebody else's query. There is no infrastructure
for you to manage. You'll often hear this referred
to as being serverless. Well, it's clearly
not serverless because there are
servers involved, but you don't have to
deal with any servers. You focus just on
getting your data in and on analyzing your data. You don't have to worry about
standing up servers, tearing down servers, configuration
updates, any of that stuff. So when you perform
queries, you're going to get results
really, really quick. So we showed scanning terabytes
of data in roughly 35 seconds. At the petabyte level,
you're talking a few minutes. So this actually makes turning
around analysis on a huge data sets very, very fast. Which means better
analysis, more creativity, more interesting insights
that you can derive. It uses standard SQL. I showed you that there
is a legacy SQL syntax, but in the newest generation
of their query language it's ANSII 2011 SQL
compliant, so you can actually lift queries from
other environments and with very little
tweaking if any they will run in the BigQuery environment. You can use the web
UI as I demonstrated, or you can use third
party tools of choice, like Looker, Tableau, Click. They've got date
partition tables, so you can actually have
BigQuery automatically partition time series
data by day or by date. And this allows you to have
much higher performance queries. You can also have BigQuery query
against external data sources. So you can have a bunch of
data inside of BigQuery, and then you can join that data
against external source data as well. How do you get these massive
amounts of data into BigQuery? Well, you can do
this interactively. Like you can upload
files through the web UI, or you can use the BQ
command line utility and upload a file
from your local PC. However, obviously that's
going to work better for a few megabytes
of data than it will for a few terabytes of data. So there are other ways
that make more sense for transferring
really large volumes. You can stream data
directly into BigQuery query at 100,000
rows per second. So one of the things
that we did at Highfive, this video conferencing
startup that I used to work at, is we actually would generate-- So you had 10,000
users and calls. For all 10,000 calls
we were generating second by second
telemetry data on hundreds of different counters
on every single call. We were writing that into
a production database, but then streaming
it into BigQuery. And BigQuery was the
data store from which the customer facing
team dashboards drew all of their data. Google Cloud Dataflow is
a data pipeline framework that allows you to easily stand
up ETL sort of transformation jobs that take data
from incoming sources, manipulate the data,
and write them out into some target repository. Dataflow works
beautifully with BigQuery. And so you can easily
be processing batch data or streaming data and pumping it
through a transformation engine into BigQuery. There's also the newly announced
BigQuery Transfer Service where you can set up regular
transfers of your YouTube data, your DoubleClick, data
and your AdWords data and have that stream
automatically into BigQuery for performing analytics. There will be
additional services coming to BigQuery Transfer
Service in the future. Right now through middle of
2017 BigQuery Transfer Service is free. After the midpoint
in the year, it will become a billable
monthly service. Availability and access. Your data is replicated
across multiple geographies. What that means is that the
durability and availability are through the roof. If you have a device go down,
if you have a zone go down, if you have a region
go down, you're still going to be able to
get access to your data, and you will not lose any data. You also then can control
who can access that data. So there are identity and
access management roles that you can assign. So you can say developers
can write data, and your analysts can read data. Or your end users can read data. You also can set permissions
at the per data set level. You have really high
availability, as I mentioned. A 99.9% SLA on availability. The data that you store in
BigQuery is encrypted at rest. We mentioned the fact that
you can use IAM roles, and the fact that you can either
go through command line or web UI or external clients. Pricing. So there's two components
to pricing for BigQuery. There's a storage component and
a query processing component. So the storage
component, you pay $0.02 per gigabyte per month. If you store the data
for longer than 30 days, you actually get a
discount on that. So if the data
doesn't change but you keep querying against it, the
price automatically goes down. You pay $5 per terabyte of
data processed for queries. The first terabyte is free. For companies that do
more than $10,000 a month in analytics processing
on a monthly basis, they can actually get
flat rate pricing. And you buy what
are called slots. It's sort of technical,
but it's a different model for processing fees. And you can get what
essentially is a discount by using flat rate pricing. If your volume goes
above $10,000 a month, certainly up to about
$40,000, $50,000 a month, talk to your Google sales
rep and they can hook you up. It should be noted
that BigQuery is not a replacement for every
enterprise data store. It's not a transactional
relational database. So if you're looking to read
and write into individual cells in rows in a table and
constantly update that, that's not what this is for. BigQuery is data that you write
in, you don't change the data, but you're doing reporting
against the data. Similarly, it's not an
operational data store, and it is a
cloud-based solution. So if you have an
application that requires on premise
data warehousing, BigQuery would
not work for that. But you want to ask
yourself, are the benefits of BigQuery worth
making the changes necessary in our workflow to
put this data in the cloud? A lot of organizations
didn't think that they would want
to or be able to do their data warehousing
in the cloud are finding that the benefits are great
enough that they can find creative ways to make it work. So given this, you want to
work with your internal teams to figure out what can
we do with BigQuery. So what Shine
technologies did was they were working
with a customer that did a huge amount
of internet advertising. And what this
customer wanted to be able to present to
their executives was a live dashboard that showed
the performance of their ads. So what they needed to have was
the ability to stream millions and millions of ad
impressions into BigQuery, generate the analytics on
this constantly updated data set that had billions
of rows, and do complex queries to
drive insights, and then visualize these. And Shine was able to do this
without any Hadoop clusters, without any sophisticated
systems operations investment. And they were able to deliver
the clients' dashboards that would consistently
in 20 to 30 seconds render visualizations
of up to date data. Saved tons of money. Actually became a
product that they were able to start delivering
to other customers as well. Now when you walk away
from today's session, you may have queries as
you're trying things out and you may need guidance on
how to perform certain tasks, so where do you look
for data or information? How do you find what
you're looking for? So as always, you want to start
with search, and just type in BigQuery and your topic. So for instance,
you want to know how to get data into BigQuery. BigQuery ingestion. And you'll notice you've got
preparing data for loading, streaming data in, improving
data ingestion times, and so forth, OK? So for instance,
this article talks about using Avro, which
compresses the data before it transmits it in and decompresses
it on the other end. This is built into
BigQuery query. It makes data loading
much more efficient. There's also the typical
table of contents in BigQuery documentation. So let me actually take
you on a quick tour. So if we type in BigQuery. Up here at the top will be
the splash page from Google. If you click on
View BigQuery docs, you'll get the typical
table of contents. There's a Quickstart here. Actually two. One for the web UI, and
one for the command line. These get you up and running
in less than five minutes. They give you a
very quick taste. And then you've got a drill
down into how to guide. So you can see how
do I load data? How do I prepare the
data before I load it? How do I transform the data? How do I load it from data
store, or cloud storage, or via API? How do I stream it in? And then here's,
how do I do queries? And there's a section down
here on how do I export data? And then general basics around
security, and management, and so forth. So pretty well organized,
reasonably exhaustive reference. Should be very useful to you. Now, I don't have a
prescribed methodology for how to do an
evaluation of BigQuery because it really depends
on your applications. I've heard two very different
approaches commonly used by companies that have large
scale data warehousing needs. One is to take your
top half dozen queries. So your top half
dozen jobs that are breaking on your
existing infrastructure, or just take a long time,
and move them to BigQuery. So you just pick five or six
queries, you load your data, you run the queries, you see
how much better they are. If they're better, great. You're ready to go. On the other end of the spectrum
are companies where they really are going to be very methodical
and exhaustive about the way they do the evaluation. So before they can move any
of their data into the cloud, they have to do compliance
and regulatory research. They have to cleanse
all of their data. They may need to
anonymize the data. They load massive
amounts of data and then they
essentially replicate their existing analytics
workflows in BigQuery. And then they may go so
far as to actually do blind taste tests. So they have their front end
representation be identical, but they have half of the users
go through the new BigQuery enabled pipeline, and half
the users go through the old. And they judge the efficacy,
the level of satisfaction, and so forth. So it's going to be
up to you, and it will depend on what
vertical you're in, what data you're looking
at, and so forth. Certainly, quick
and dirty way to get a sense of what the
potential benefits are is to do the half dozen queries. And then depending on your
long-term applications, you may need to do a much more
exhaustive proof of concept. So that's it for this session. Thanks so much for joining us. I hope you found it
interesting and helpful. We're going to send
you a follow-up email, and it will have a
link to a code lab, so you'll get a guided walk
through BigQuery that you can do on your own. We'd love to get
your feedback, so feel free to provide
the feedback in the UI in front of you, and please
stay tuned for the next session.