Analyzing Big Data in less time with Google BigQuery

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
JEFF DAVIS: Welcome to the Google Cloud Platform education series. My name is Jeff Davis, and in this session we'll discuss analyzing your data in less time using Google BigQuery. We'll be running a moderated QA during the session, so feel free to use the Ask your question field at the bottom of your screen to submit questions as we go along. Let's get started. So the agenda for this session will be starting out with a series of demos. We'll show you typical use cases and the speed at which BigQuery operates and how easy it is to use. Then we'll talk about the general set of features for BigQuery and how you might be able to leverage them in your environment. We'll talk about how to find answers to questions after you leave today's session and you get started working with BigQuery on your own. We'll talk about some best practices for evaluating BigQuery in your environment. And again, we'll be doing Q&A as we go along, so submit those questions. Let's start with a series of demos that show both BigQuery's scale, as well as how easy it is to integrate in other systems that you use for data analysis and reporting. What we have here is the web UI that allows end users to easily play with data within BigQuery. We're going to show you some integrations later on, but right now this is just for doing quick iterative analysis of large data sets. So let's start with some sample data sets that Google makes available. So for instance, here we have airline on time data. So we have a list of flights. And if we take a look at this table, you can see it's about an 8 GB file. It's 70 million rows of data regarding different airlines' on time, arrival, and departure performance in different states and so forth. And I want to get some information. So I'm going to start with a can query that I've got here, OK? And this is going to find all the arrival entries just for United Airlines. So I'm going to copy this, go back, and query the table. OK, you can see I'm using standard SQL. I'm going to be querying the flights table, looking for all rows where the airline code is 19977, that's the code for United Airlines. And I'm going to select three columns from every row. Now I'm going to disable caching because we want to see how long it takes to do an initial query against the 70 odd million rows. You'll notice it gives me an estimate of how much data it's going to have to process. This is also useful for estimating how much it's going to cost and that you pay, I think, it's $5 per terabyte of data process after the first terabyte. So this uses some fraction of that. So I'm going to run this and it should take between 25 to 30 seconds to complete the query. When it comes back, it will show us the three rows for all the columns. It will give us a column count. And then we'll move on to a slightly more interesting query that gives us some more insight into our data. So you'll notice it took 23.8 seconds. Processed 2 gigabytes of data. It returned 5,043,578 rows. So it's a lot of data returned. That's part of why it took so long to come back. It would have been faster if there had been fewer rows in the result set. So what this does is it just gives us the date, the arrival state, and the arrival delay. I would like to get a little bit more insight, so what I'm going to do is run this query. This query should tell me the average delay by month by arrival state for United Airlines. So I want to see what months are the worst and what states are the worst. So let's run this. Again, it's going to process the same amount of data. That came back quite a bit faster. And you'll notice that we have 5,000 rows instead of 5 million rows returned. That's part of why it came back so quickly. And you'll notice that what we've got is for each state we've got individual months. And for those months, we've got the average delay, the number of flights, the number of flights that were early, the number of flights that were late, the percentage of flights that were early, and the percentage of flights that were late. So this is using the lingua franca of data analysis. It's using standard SQL. It's not that difficult to write. There's no actual query optimization here. You'll notice I'm also using subqueries. That's perfectly valid. And I was able to very quickly get back insights from a very large collection of data. So again, you can do this all interactively through the web UI. Let's actually look at just scale. How fast can this go on really big data sets? So you think, well, Jeff, 75 million rows. That's a pretty big data set. Well, it's not as big as some of these data sets, for instance. Here we have Wikipedia benchmark data. And if you look, you can see that the table basically has got year, month, day, project, language, title, and view. So this has got data about how many times particular articles were viewed. So we've got a million row version. We have a 10 million row version, 100 million row version, a billion row version. Let's see how this all scales. So we'll start with a million. I'm going to turn off the standard SQL and use legacy SQL just for giggles. So we're going to query a million articles. We're looking for all articles that have Davis because I'm fairly vain. I'm going to look for articles about people with my same last name. We're going to use a regex expression to match these, and what I want to do is actually group them by title and sum the number of views. So let's do this over a million row data set. This is 47 megabytes. And it should come back fairly quickly. So you'll notice Anthony Davis and Thomas Davis are the most frequently viewed articles in the 1 million entry data set. Let's look at 10 million. OK, so we go from 47 megabytes to 476 megabytes. And we go from just a couple of seconds to three seconds. We now have roughly 2,500 results. Daniel Davis is the top now. So this was three seconds to do 10 million. Let's do 100 million. And you see where we're going with this. You want to know if I'm going to get to the 100 billion number, and I am. I'm going to have to pay $20 out of my pocket to run that query, but I'll do it just for you. So you notice, ah, Jefferson Davis. Not my namesake. It's a coincidence. Favorite jazz musician, Miles Davis, is up there. So we've done 100 million. Let's do, oh, what the heck. That took two seconds. Let's do 100 billion. This is going to process 3.64 terabytes of data, 100 billion rows. Cross our fingers. Get out my ATM card. And yet, think about it. Twenty dollars to run this query versus the amortized cost of a query against a big Teradata system, or some other big data rig that you've got. It's going to give you equivalent performance. It's actually a heck of a value. So I think it's about 30 to 50 seconds to complete the query. While that's running, I'm going to get-- oh, actually it's done. So 37 seconds. Notice it came back with 185,000 rows in the results and it has aggregated and told me that Miles Davis is in fact the most popular article within the 100 billion row data set. So think about it. That's 30 seconds to do analysis on 100 billion rows. I could sit here and iterate on my queries, interact almost in real time with my data very, very easily. Now obviously, doing this interactively with the web UI is one thing. Another thing is to actually expose this data, pull it out of BigQuery and expose it through some other application. So we have an example here. This is a customer success application that pulls data from BigQuery. So let's take a look at one of the dashboards. If you went to the App Engine session, you saw this dashboard earlier. This has basically got a bunch of visualizations of different types of user engagement metrics within my product. And what this allows me to do is pull the data out of BigQuery through APIs and represent it in code. So behind the scenes here I've got a user events table. Now, this table is paltry 500,000 rows, but it's a modest amount of data for the usage of this product. And it's a demo application, so it doesn't have tens of millions of rows. What we do then is we write a little bit of code to access this. And here's the code. We're going to go to BigQuery. This is a little Python script. You'll notice what we do is we import a library. So Google has code libraries or client libraries available for a variety of different languages, and it's quite simple. We create a BigQuery client, we connect to the data set, and then within the data set, we connect to a table. And within the table, we are then going to run a query. If we want to run an arbitrary query for instance, we take our client, we run a query, we designate the query. We give it a time out, so this makes sure that the query has up to 45 seconds to run. We run the query, and then we simply iterate through the results. And this allows us to process through the results of the query. If you want to write to BigQuery, it's quite straightforward. You simply take the table and you insert a row. And this adds events into the backend. So this data gets written automatically, then the API pulls it out and renders it in dashboards within your application. Now you may be using a different tool to do your visualization and reporting. You may not use the web UI and you may not want to go so far as to write a custom dashboarding application. Let's say you want to use something like Tableau or Looker, or in this case, I'm using a product called Mode Analytics. These BI visualization and analysis tools allow you to connect to BigQuery as a backend. So you'll notice over here. Let me make this a little bit bigger. You'll notice I have data sources. If I click on the plus sign, I can actually select a backend data source, and I want to connect to BigQuery. OK. So it's saying I want to connect to BigQuery. It wants to know my project ID. So I am going to pick my project ID, which is GCP webinar demos. And then I need to create a service account. And so the service account is created through IAM, the identity and access management. And you'll notice that I've got a variety of service accounts here. I can go ahead and create a new service account. Actually, sorry, service accounts are here. Those were the roles. I'm going to create a new service account and I'm going to call this mode. And I want to give this service account BigQuery admin permissions, so that it can create tables, delete tables, read data, and so forth. And I want to get a new private key. And because I read the Mode documentation, I know that it needs to be a P12 key. So I'm going to create this service account. And I get the credentials. OK? Actually we'll cancel out of that. I don't want to put it in my key chain. We'll just skip over that part. We go back to Mode and we've got to give it the name of the service account. So the name of the service account or the email address here is this. So we designate that and then we have to upload the key file. And that should be the key file. And if all went well, this has now connected my tool with BigQuery. So I can create a new query. Let's pick one that looks a lot like what I was using before. Let's not use the 100 billion. We can use the 100 million. OK, now I need to enter a query. So I'm going to go pick one of my pre-canned queries. Let's go ahead and use the standard SQL format for getting information about United's on time arrival. I can paste that in. It will run the query. It gives me a tabular version of the results. And if I want to, I can create a chart for instance that shows on time arrival. I can change the chart type if I want to. This sort of thing. So this shows how you can do analysis and visualization within a third party tool, so you can fit BigQuery into your normal data analysis and analytics workflows. Pretty cool. OK. Let's switch back and talk about BigQuery's functionality overall. We showed you some pieces of it. Let's highlight additional capabilities and use cases so that you'll have a better understanding of how to apply it in your environments. So again, BigQuery is this fully managed, massive scale, low cost enterprise data warehouse that Google provides running on top of their compute storage and networking infrastructure. There is a demo query that I've seen at some conferences that when you run it, it takes 3,300 cores, 330 hard drives, and 330 gigabits of network bandwidth for 30 seconds to complete. Now those are tremendous numbers. But because Google has so much available capacity, they're able to give you a slice of 30 seconds with that level of resources to run your query and then claw those resources back and deploy them against somebody else's query. There is no infrastructure for you to manage. You'll often hear this referred to as being serverless. Well, it's clearly not serverless because there are servers involved, but you don't have to deal with any servers. You focus just on getting your data in and on analyzing your data. You don't have to worry about standing up servers, tearing down servers, configuration updates, any of that stuff. So when you perform queries, you're going to get results really, really quick. So we showed scanning terabytes of data in roughly 35 seconds. At the petabyte level, you're talking a few minutes. So this actually makes turning around analysis on a huge data sets very, very fast. Which means better analysis, more creativity, more interesting insights that you can derive. It uses standard SQL. I showed you that there is a legacy SQL syntax, but in the newest generation of their query language it's ANSII 2011 SQL compliant, so you can actually lift queries from other environments and with very little tweaking if any they will run in the BigQuery environment. You can use the web UI as I demonstrated, or you can use third party tools of choice, like Looker, Tableau, Click. They've got date partition tables, so you can actually have BigQuery automatically partition time series data by day or by date. And this allows you to have much higher performance queries. You can also have BigQuery query against external data sources. So you can have a bunch of data inside of BigQuery, and then you can join that data against external source data as well. How do you get these massive amounts of data into BigQuery? Well, you can do this interactively. Like you can upload files through the web UI, or you can use the BQ command line utility and upload a file from your local PC. However, obviously that's going to work better for a few megabytes of data than it will for a few terabytes of data. So there are other ways that make more sense for transferring really large volumes. You can stream data directly into BigQuery query at 100,000 rows per second. So one of the things that we did at Highfive, this video conferencing startup that I used to work at, is we actually would generate-- So you had 10,000 users and calls. For all 10,000 calls we were generating second by second telemetry data on hundreds of different counters on every single call. We were writing that into a production database, but then streaming it into BigQuery. And BigQuery was the data store from which the customer facing team dashboards drew all of their data. Google Cloud Dataflow is a data pipeline framework that allows you to easily stand up ETL sort of transformation jobs that take data from incoming sources, manipulate the data, and write them out into some target repository. Dataflow works beautifully with BigQuery. And so you can easily be processing batch data or streaming data and pumping it through a transformation engine into BigQuery. There's also the newly announced BigQuery Transfer Service where you can set up regular transfers of your YouTube data, your DoubleClick, data and your AdWords data and have that stream automatically into BigQuery for performing analytics. There will be additional services coming to BigQuery Transfer Service in the future. Right now through middle of 2017 BigQuery Transfer Service is free. After the midpoint in the year, it will become a billable monthly service. Availability and access. Your data is replicated across multiple geographies. What that means is that the durability and availability are through the roof. If you have a device go down, if you have a zone go down, if you have a region go down, you're still going to be able to get access to your data, and you will not lose any data. You also then can control who can access that data. So there are identity and access management roles that you can assign. So you can say developers can write data, and your analysts can read data. Or your end users can read data. You also can set permissions at the per data set level. You have really high availability, as I mentioned. A 99.9% SLA on availability. The data that you store in BigQuery is encrypted at rest. We mentioned the fact that you can use IAM roles, and the fact that you can either go through command line or web UI or external clients. Pricing. So there's two components to pricing for BigQuery. There's a storage component and a query processing component. So the storage component, you pay $0.02 per gigabyte per month. If you store the data for longer than 30 days, you actually get a discount on that. So if the data doesn't change but you keep querying against it, the price automatically goes down. You pay $5 per terabyte of data processed for queries. The first terabyte is free. For companies that do more than $10,000 a month in analytics processing on a monthly basis, they can actually get flat rate pricing. And you buy what are called slots. It's sort of technical, but it's a different model for processing fees. And you can get what essentially is a discount by using flat rate pricing. If your volume goes above $10,000 a month, certainly up to about $40,000, $50,000 a month, talk to your Google sales rep and they can hook you up. It should be noted that BigQuery is not a replacement for every enterprise data store. It's not a transactional relational database. So if you're looking to read and write into individual cells in rows in a table and constantly update that, that's not what this is for. BigQuery is data that you write in, you don't change the data, but you're doing reporting against the data. Similarly, it's not an operational data store, and it is a cloud-based solution. So if you have an application that requires on premise data warehousing, BigQuery would not work for that. But you want to ask yourself, are the benefits of BigQuery worth making the changes necessary in our workflow to put this data in the cloud? A lot of organizations didn't think that they would want to or be able to do their data warehousing in the cloud are finding that the benefits are great enough that they can find creative ways to make it work. So given this, you want to work with your internal teams to figure out what can we do with BigQuery. So what Shine technologies did was they were working with a customer that did a huge amount of internet advertising. And what this customer wanted to be able to present to their executives was a live dashboard that showed the performance of their ads. So what they needed to have was the ability to stream millions and millions of ad impressions into BigQuery, generate the analytics on this constantly updated data set that had billions of rows, and do complex queries to drive insights, and then visualize these. And Shine was able to do this without any Hadoop clusters, without any sophisticated systems operations investment. And they were able to deliver the clients' dashboards that would consistently in 20 to 30 seconds render visualizations of up to date data. Saved tons of money. Actually became a product that they were able to start delivering to other customers as well. Now when you walk away from today's session, you may have queries as you're trying things out and you may need guidance on how to perform certain tasks, so where do you look for data or information? How do you find what you're looking for? So as always, you want to start with search, and just type in BigQuery and your topic. So for instance, you want to know how to get data into BigQuery. BigQuery ingestion. And you'll notice you've got preparing data for loading, streaming data in, improving data ingestion times, and so forth, OK? So for instance, this article talks about using Avro, which compresses the data before it transmits it in and decompresses it on the other end. This is built into BigQuery query. It makes data loading much more efficient. There's also the typical table of contents in BigQuery documentation. So let me actually take you on a quick tour. So if we type in BigQuery. Up here at the top will be the splash page from Google. If you click on View BigQuery docs, you'll get the typical table of contents. There's a Quickstart here. Actually two. One for the web UI, and one for the command line. These get you up and running in less than five minutes. They give you a very quick taste. And then you've got a drill down into how to guide. So you can see how do I load data? How do I prepare the data before I load it? How do I transform the data? How do I load it from data store, or cloud storage, or via API? How do I stream it in? And then here's, how do I do queries? And there's a section down here on how do I export data? And then general basics around security, and management, and so forth. So pretty well organized, reasonably exhaustive reference. Should be very useful to you. Now, I don't have a prescribed methodology for how to do an evaluation of BigQuery because it really depends on your applications. I've heard two very different approaches commonly used by companies that have large scale data warehousing needs. One is to take your top half dozen queries. So your top half dozen jobs that are breaking on your existing infrastructure, or just take a long time, and move them to BigQuery. So you just pick five or six queries, you load your data, you run the queries, you see how much better they are. If they're better, great. You're ready to go. On the other end of the spectrum are companies where they really are going to be very methodical and exhaustive about the way they do the evaluation. So before they can move any of their data into the cloud, they have to do compliance and regulatory research. They have to cleanse all of their data. They may need to anonymize the data. They load massive amounts of data and then they essentially replicate their existing analytics workflows in BigQuery. And then they may go so far as to actually do blind taste tests. So they have their front end representation be identical, but they have half of the users go through the new BigQuery enabled pipeline, and half the users go through the old. And they judge the efficacy, the level of satisfaction, and so forth. So it's going to be up to you, and it will depend on what vertical you're in, what data you're looking at, and so forth. Certainly, quick and dirty way to get a sense of what the potential benefits are is to do the half dozen queries. And then depending on your long-term applications, you may need to do a much more exhaustive proof of concept. So that's it for this session. Thanks so much for joining us. I hope you found it interesting and helpful. We're going to send you a follow-up email, and it will have a link to a code lab, so you'll get a guided walk through BigQuery that you can do on your own. We'd love to get your feedback, so feel free to provide the feedback in the UI in front of you, and please stay tuned for the next session.
Info
Channel: Google Cloud Tech
Views: 158,938
Rating: 4.9527068 out of 5
Keywords: BigQuery, big query, analyze data with BigQuery, analyze data, BigQuery demo, BQ, BiqQuery best practices, bigquery tutorial, Google Cloud webinar, BigQuery capabilities, Qwiklabs BigQuery, intro to BigQuery, Google Bigquery, BigQuery for beginners, how to use BigQuery, how to use big query, google bigquery tutorial, analyze data Google, what is BigQuery, Google Cloud Platform, GCP, google cloud platform tutorial
Id: qqbYrQGSibQ
Channel Id: undefined
Length: 29min 14sec (1754 seconds)
Published: Thu Apr 20 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.