BigQuery Tips: Nested and Repeated Fields and How GOJEK Builds Data Warehouses (Cloud Next '19)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[INSTRUMENTAL MUSIC] GAURAV ANAND: We all know the world is generating data more than ever. So how do we make sense of this data? The traditional data warehouses were not designed to handle this explosive growth in data. So as we see the data volume to increase, the complexity to increase, and the number of use cases around the data to grow, enterprises needs actionable insights. So what do we mean by actionable insights? As the cost of storage and data processing reduce, enterprises want to process, store, and analyze all the data sets, internal and external to the organization. The modern data warehouse also needs to reflect the current state of business. That means you need to act in real time. You need to make those decisions in real time. We don't need a data warehouse which takes like weeks, or even months, to get the software installed or even the hardware provisioned. We also want the insights to be available to the business users, in their hands, so that enterprises move more towards data driven decision making. And last but not least, we need security and governance. So we basically want to have the data accessible to the right stakeholders, internal and external to the organizations. Now this statement still holds true. However, as the demand for new data types, new use cases, and the complexity has come in over the period of a few years, the data warehouse architecture needs to evolve to meet this demands, in both distributed and centralized solutions. So if you look over the last two decades, basically enterprises started using data warehouse more from an operational reporting purposes. And then we had cell service BI coming in. And over the past decade, there has been a huge shift towards data mining, AI, and ML. So as businesses inspire to be data driven, this requires the data warehouse to support their machine learning and AI initiatives. So we all need the data warehouse of the future, but what worries us? We need to remember that data warehousing is not a use case. Data warehousing is a solution that enables higher order use cases to make the businesses data driven. And these are some of the common challenges which we see enterprises and businesses face while managing the data warehouse, or even growing the data warehouse. So it's time now to modernize a data warehouse with BigQuery if you haven't already. So basically, BigQuery is Google Cloud's enterprise data warehouse. It scales from gigabytes to petabytes. There are numerous features of BigQuery to call out, out of which there are a few unique features which we have listed out there. With my experience working with all these large digital enterprises, we have seen them see lot of value with real time analytics, because BigQuery natively supports real time analytics, and you can basically stream your data all the way into BigQuery, and make all those real time business decisions. So the challenges which I just showed you in the slide before. BigQuery enables us to overcome those challenges by offering scalability, simplicity, security, and a TCO for all kinds of businesses, no matter small or large. In terms of architecture, which most of you might be familiar, it's fully managed and serverless. The storage and compute are decoupled so that they scale independently on demand. This basically offers immense flexibility and control for the enterprises, and you don't need high compute resources to keep running 24 by seven. And this is very different from the traditional node based and MPP data warehouses. So if I have to summarize what BigQuery brings to the table, it gives you a modern data warehouse. And basically it's focused on these six key areas. With that I would like to introduce Zaki from GOJEK to share his experience, and GOJEK's experience building the data warehouses, and using BigQuery. [APPLAUSE] ZAKI SUKMA: Thank you, Gaurav. Hello everyone. Thank you for joining our session today. So I am Zaki, and I work as that engineer at GOJEK. And here, I would like to share GOJEK's experience in building our data warehouse, while we also do our international mission effort. Has any one of you heard about GOJEK before? You can raise your hand. Oh, wow. Cool. Thank you. OK. So GOJEK is a technology company in Indonesia that aims to improve people's quality of life. And then, GOJEK was started in 2010, actually not as a technology company, but as a call center for a motorbike service, motorbike taxi service, called OJEK. So the idea of having OJEK taking passengers around, moving things around, is very powerful. That we in 2015, are able to launch our three first digital products called Go-Ride, Go-Send, and Go-Shop, all centered around moving people and things around. And people likes our products a lot. So in 2016, we are able to expand our products to many other kinds of services, like lifestyle, digital payment, and entertainment, and many others. So those many products has empowered people in Indonesia. So their quality of life has improved. And then we are having very successful businesses, so that in 2018, we begin our international expansion to countries in Southeast Asia. OK. So we have so many products here. You can see we have Go-Ride, Go-Car for transport services. We have Go-Send and Go-Box box for logistic services. We have lifestyle services like Go-Massage, Go-Glam, and we have also Go-Clean, and many others. We have also Go-Pay for payment. So by having that many products in a single application, people like to call us a super app company. And all of those products are very nice, because they bring convenience to people, while also empowering the drivers, empowering the service providers, so they can get a living. OK. This is our global footprint. So we have expanded to four countries in Southeast Asia-- Vietnam, Thailand, Singapore, and soon we are also expanding to other countries. So in those countries we are operating in essentially 204 cities in Southeast Asia. And operating in those cities, our application has been downloaded more than 130 million times. And we also empower more than 400,000 merchants, and more than 2 million drivers, essentially generating more than 100 million bookings. So we move on to our data warehouse architecture. So this is our data warehouse architecture using GCP. We can see that we have many, many Google products there. We use BigQuery for our data warehouse. We use Cloud Storage for our data lake, and then Dataproc, Kubernetes, Dataflow, for our ELT execution. And also like Composer, Stackdriver, for operations and monitoring. So we already know that we have to offer so many services for so many kind of businesses, different businesses. So our data warehouse has the requirement to fulfill all of the different businesses, has to support a range of different kind of businesses. So Google products has helped us in building our data warehouse. We can fulfill all the requirements supporting many different teams, probably with different kinds of setups in each team. And each team can have like Postgres cluster, [? MySQL ?] cluster, and then they can have a user tracking system, using a third party platform. So all the Google products has helped us in building our data warehouse. And data warehouse has called as a multi-product data warehouse, because we have to support that many kind of products. OK. This is the current statistics of our data warehouse in the end of Q1 of 2019. We can see here that our data volume is increasing more than 30% in each month. And then because of many and many data has ingested to our data warehouse, many people can use the data for insights, for analytics. So they can generate more than 13,000 dashboards. The dashboard supports from high level, like strategic team doing more strategic decision making until operation level, solving passengers of our driver challenge, our driver case, and problems. So the most important metric of our data warehouse is this, the growing numbers of BigQuery user. So our data warehouse using BigQuery, and then we can see that in the end of Q1 2019, we have 2,100 active users of BigQuery. It is three times the number of the user in mid 2018. This is the period we do our global expansion. And then more and more people rely on data when making decision. And then here comes the challenge. Actually, when people wants to develop ETL jobs, when people wants to schedule their analytics queries, people relies heavily on our BI engineer to do that. People relies heavily on our BI engineer to create ELT jobs. And our BI engineeers' capacity is limited, and more and more requests coming beyond our BI engineers' capacities. And then here comes the other challenge. While we are undergoing very fast growth, we have to move us, and then do international expansion. So we have to keep up with our growth, while we also have to support with global expansion. We have to expand our data warehouse to countries in Southeast Asia, and then we are very lucky because the GCP products that empowers us has helped us in many ways, the support with our data warehouse. But yeah. Improvement can be made. Improvement can always be made. While we have operating and can support all of those use cases, we realize that our current setup is not the most efficient setup. We can do more and more things to improve our efficiencies. So we come with several questions that we asked to ourself. How can we survive the global expansion? Several questions that we essentially have to solve in a short period of time. So the first question is, how can we extend data warehouse to countries in Southeast Asia, and set up some GCP projects faster and easier for those extensions? And the second question is, how can we manage user access? So when we know that a number of our data warehouse is growing, the thing that will directly impact us is how to manage those user access. If we don't monitor user access better, we will have difficulties in managing all of those accesses. And then the third question is, how can we make the ELT job development less dependent on our BI engineers, and how can we make the development friendlier? And then, well, usually the one that develops the ELT job is BI engineer. How can we make the development friendlier to non-BI engineers, like analysts, like product managers, or even operation team wants to do some queries to solve people's problem? And then the last question is, how can we maintain users' confidence to data? So we don't want users to always worry about, oh, what is the quality of data? Is this data good to use now? Or what are the things that I need to check to ensure that the data can be used by me? OK. So the first thing that we do, by setting up the GCP project faster, is expressing the all the configuration in infrastructure as code. So we want our data warehouse to be extended to each country in Southeast Asia. Some of them will have a common configuration. Some of them will have specific configuration. And then those configurations can include data modeling layers, express in DCS bucket and BigQuery data set, and many other configurations, like the Composer instances, the Kubernetes for executing the ELT jobs. So how can we set up the project cluster? So here we are using Terraform. So one platform to help us expressing our configuration as code. And then the nice thing about expressing the configuration using infrastructure as code is, we can set up faster. And then, by having the set up faster, we can do experimenting with different kinds of setup. So we want to experiment with Vietnam, for example. We can do this data modeling, we can do this Composer. We can do this Kubernetes cluster is enough or not. So we can just express all the configuration using Terraform Plan and Terraform Apply. And then the projects are there. We can experiment with it, and then if we are done, and we are not satisfied with the settings, we can use Terraform Destroy to delete the projects. And we can repeat with other configurations. And then the other nice thing is all the core configuration is stated explicitly in one file, or probably two files. So we can share the configuration, and everybody can see it. So we can have the common understandings about how the settings of our data warehouse. And then the second part is, how can we better manage our user access to data? So we built a tool called Data Pier. So Data Pier will assist us in helping users onboard with the data warehouse. So previously, users requested access to us by using email, or coming to our desk directly. It's not scalable to do that way. Every time we have-- we get emails to give them access, and then every time we do some work. And then some people just come, and then, hey, give me give me access to these data sets. And then we have conversation, and then yeah, we lost the productivity. So we built this Data Pier so people that need access only open Data Pier, and then they can request access in the database dashboard. And then in the Data Pier, people will also see this data set can be accessed by this Google Groups. And then they can join some Google Groups that has access to data set. So we manage user access using Google Groups, and then we develop the tools using Google IAM API. The back and forth, our tool is Google IAM API. The third thing that helps us so very much to boost our productivity in developing ELT jobs is having a tool that helps people. They can set their own ELT jobs. So we think that SQL is a very nice thing. We can express transformation in SQL because SQL is simple, straightforward, and then it's very intuitive. So we move from cloud centered transformation into SQL-based transformation. So when people know SQL by doing analysis, we can teach them SQL easier than teaching them how to code. Well, they know SQL, they can set their own ELT jobs by using the tools. So people that have the query only needs to install our optimal CLI tools, and then submit the query, SQL query, and then select what is the destination data set and table. And then determine, what is the schedule for the job? And then the optimal stores will take care of the rest. Everything done automated. And then people will also know, what is the result of the execution? And then the fourth thing that we built is data quality service. So we have the problems of convincing people that this data, this table, is good to use. And then we don't want them to always worry about data quality of the data. We don't want them to be surprised in solid time, because their data has completeness issues, has uniqueness issues. So we built an automated data quality service that helps people with checking the data quality of the-- yeah-- the quality of data. And we implement the data quality service in Google products, like Kubernetes, and then Composer, to schedule the execution of data quality profiling. So when people finish with creating the ELT jobs, they can continue with creating data quality specification. Like for these columns, what is the threshold for completeness? What is the threshold for uniqueness? So they can submit the data quality specification into repos, so the tools will get the data quality specification, compiles it, and then just runs the data quality profiling, based on the specification. And then the data quality, data quality assessment is done immediately after the execution of its ETL jobs. So people will always know, what is the quality of their data? And then people can get alert if some columns is not complete enough, or some case is not unique. And then but still, a problem with data is still there. So what is problem with data, actually? So we already know that, yeah, data warehouse is data warehouse. And then if data comes to us, how can we create such a single expression of truth? How can we create a table, authoritative table, that single table can be used by many people for analytics, for reporting? So people don't have to compare one source of data to other source of data, because, yeah, too many data sources to be used. So the other thing is, how can we make sure that also the single version of truth is rich in contexts? So for example, in GOJEK, we have booking data with many dimensions like pick up, drop off, the duration, and the event logs. So the other problem is, how can we structure and model the data? So luckily, BigQuery has some nice features that helps us with such problems, that solve our problems with those features. And then here, we have event that shows us, tells us, how can we structure the data in our data warehouse using those features, so we can get the best of BigQuery in solving our problems? Evan, please come. [APPLAUSE] EVAN JONES: Good afternoon. It's always interesting doing a session at 4:00 PM, right before-- do you guys know the act who is coming on tonight? Who's attending the party later on tonight? Anyone know who the act is? Gwen Stefani, right? So I'm literally the gatekeeper between you and Gwen Stefani, which is tough. She's a lot better of a singer than I am, so I can promise you that. But I would hazard a guess that I can help you out with data warehousing a little bit better than she can. So the problem of data is still there. If you don't mind switching back to the slides, I'll do the demo just a little bit later. So I've been at Google four years, and one of the problems I really, really like tackling-- I'm a technical curriculum developer-- is taking BigQuery and the things that it can do that seem like magic, like including many different types of data, petabytes of records, millions of different events, all inside of one place, and making it accessible to everybody. I'm going to show you just a little bit of a snippet of some code that you can see. I'll provide you the code. The session is being recorded. Also, we're going to be doing a hands-on lab after the session, so if you really want to get into the granular details, you can finally back to Moscone South. But where I started my career 15 years ago, teaching SQL. One of the biggest things that was taught to me, and I'm going to ask you for the answer for it, is if you have data that's in four different places, you have events. You have orders, you have pickups, or drop offs. That's what we call a normalized data structure. Raise your hand if you've worked with databases before, you've heard of normalization. Raise your hand if you've worked with BigQuery before, as well. Awesome. Keep your hands raised if you work with arrays inside of BigQuery or JSON data. OK, cool. So I'll speed through this a little bit. So the constant struggle is, do we split our data across many different tables? Inside of a SQL, what are you up to do as a BI analyst to bring all those back together? You've got to join them all back together. It's everyone's SQL 101. You absolutely hate that stuff. Or you dump it all on one big table, and you have data issues. One fact in one place, not so much in repeat them. So since you worked with, or you heard of arrays before, one of the really cool data types that's natively supported inside of BigQuery is the array. So you have object oriented programmers. They're like, obviously, yeah. I want to do arrays for everything. And you have SQL developers saying, whoa, whoa. We're rows and columns. Based keep your arrays to yourself. BigQuery as a data warehouse, as a data platform, can support both, and we're going to walk through examples of what's really beneficial of storing your data nested in a repeated format. This is GOJEK. Zaki and Gaurav provided me with an awesome table. This is a bookings table. So Zaki mentioned you have lots of different information about GOJEK bookings, like motorcycle ride shares, that are all inside of one table now. All that different context from 15 different tables is all in one. So if you notice, it looks really strange if you've never seen arrays inside a BigQuery before. So this is actually only four rows of data on this screenshot. The others, 1,000 or so, totally in BigQuery. But what's interesting is that you see that there's a status field and a time field for arrays that look like there are four rows of data inside of that first one. That's an array of data. So if you're familiar with like, arrays it's the bracket. Essentially, BigQuery visually breaks that out for you. But technically, behind the scenes, it's still in those brackets. And it's going to get really interesting when you see the demo of how to work with that. It natively supports that, that's one data type, and you'll see how we work with those in just a minute. The other data type that's interesting, if you're looking at the top and you're like, I'm familiar in SQL with an alias, like naming a field something else. What on earth is events dot something, pickup dot something, destination dot something? It almost looks like you've got six or four different tables all inside of one. The structure-- and this is not specific to BigQuery. This is just generally part of SQL. It's supported in BigQuery. Is called a STRUCT, short for structure. You combine both of them together, and you can have really wide tables. A STRUCT is kind of like a pre-joined table inside of BigQuery. So you both get the width of having information on events, pick up, destination. I'm going to show you an example data set that has over 30 STRUCTS in it. That's the Google Analytics schema. And you get the granularity of having one row technically represents one order, like a ride share pickup or something like that. But within one row, you can have extreme granularity at the array level, with as many different things happening with one order. So your queries can get at the high level. Count star from all orders? Sure, I got four orders. Or I can look deep into the array values there, without having to worry about data stored in a repeated fashion, like a truly denormalized. All right, so a little bit of a practice for you, and a question I'm a technical [INAUDIBLE] developer. You're not going to leave you without a quiz, and we definitely have some cool lab promotions for you, as well. In a schema inside of BigQuery, if you work with the BigQuery before, you've seen this before. Yell out maybe one or a couple of the STRUCTS. How do you know whether it's a STRUCT? You're looking for a data type of record. Whenever you see record in somebody's schema, you're like, whoa, this person is using STRUCTS. Yell one out. AUDIENCE: Events. EVAN JONES: Events is one. Give me another one. AUDIENCE: [INAUDIBLE] EVAN JONES: Pickup, cool. What else? AUDIENCE: [INAUDIBLE] EVAN JONES: If you leave this class with nothing else, understanding this schema and the fact that whoever's created that table that you've inherited is using semi-structured data, either like nested or repeated, especially when you-- raise your hand if you've worked with JSON data before. I never know if it's JSON or JASON data. That's another argument for a different time. Ingesting that in, you definitely get some nested and repeated values. You can actually nest values 15 layers deep. The computer can handle it. Mentally a human, if you gave them 15 essentially tabs and a data set, you're BI analyst might get really mad. All right, see, you're exactly right. These are essentially kind of like four other tables that are pre-joined into there. Pre-joined, you can think for a huge performance benefit, means that you actually can get a lot of performance not having to run that join behind the scenes, which is cool. So MySQL, Postgres cloud SQL, that's transactional. That's 80% percent writes, 20% reads. BigQuery, data warehouse, you're doing the majority, 80% reads, instead of writes. It's better to have that data stored inside of one table, not doing performance hits on joins. All right. Now, look at the mode, and then find me the array or arrays. You guys are fast. That's exactly right. The events have that double level of granularity. The rows of this table, yes. They are on the order level. When you look at Google Analytics, it's on this session level. But within each of those different rows, you can get an extreme granularity by nesting an array, which is great. But when you write SQL against an array, unless you're experienced with it, which you're going to see in the demo, it's really, really painful. So a quick recap before the demo. Anytime you see you record, that immediate means a STRUCT. STRUCT is not specific to BigQuery, but it's natively supported. Anytime you see repeated in the mode, that means it's an array. You don't have to have a STRUCT to have an array, but if you have, you can nest arrays within STRUCTS. For example, the events is a STRUCT, as you see here. That doesn't mean it has to be. Like there are other things in here that aren't repeated, like the pickup up latitude and pickup longitude. Unless you're changing your motorcycle ride mid trip, there's no need to have that be an an array. So that's not an array, that's just a normal STRUCT. But the first one, it's an event. There's many different things that can happen over the course of the order. Rider picked up, dropped off, charge reversed, that type of stuff. That's an array. All right. Let's take a look at what this actually looks like inside of code. So we're inside of BigQuery. You are familiar with BigQuery before. If you're not, petabyte scale data analytics platform. You can just throw arbitrary amounts of SQL against it. If you're going to take a screenshot of one slide, what I did for those people who want to leave early and catch Gwen Stefani, I want you to take a screenshot of this. These are all the free links that you should know about. This session has a Dory, which is essentially a question and answer. If we don't get to your question at the end, I want you to post that question on there, and then later on tonight, or later on this week-- if you have a complicated data question, you want to know a cool link to a lab, I want you to post that question there, and Gaurav, Zaki, and I will definitely answer it. Next thing. All the code that you learned here, I've spent the last year of my life building out interactive labs to get you from a blank canvas in BigQuery, to building machine learning models, to building nested repeated data structures, performance optimizing your queries, and you can get them from free, which is pretty cool. So that's that free labs. If you want to hear more of my voice, you can actually take the Coursera course, and you get a month free. [INAUDIBLE] I don't want to spend too much time advertising, but if you don't get where you want out of this session, you're welcome to self-study with the labs, come back with me to Moscone South after this, we're going to working through one of them. So inside of BigQuery, Zaki was kind enough, Zaki and the team and GOJEK were kind enough to give us a table. Inside of BigQuery, a cool thing that you can do, and I'll provide this code to all of you, is, if somebody gives you a table, a cool thing-- if you never seen this before, what on Earth? You can actually select the entire by holding down the Command key or the Windows key, and you can actually highlight all the data sets that are inside of a query. It's kind of cool. Inside of there, generally we advise against-- don't let your BI analyst do select star from a table limit 10. BigQuery already stores a lot of that metadata in a preview, so I don't even need to write any SQL to see what the data is in there. So recap-- you can click on a table name, you can look at the metadata, and this is exactly where I took this screenshot from for the slides. And you're going, whoa. It's one row, but it's three things. And now that you know that, it's an array of different values for timestamps. You're like, OK cool, that's not too bad. Now what I want to do is I want to just get all the orders and all the different statuses for those orders. So I'm querying this, and you can see how this is, might potentially blow up, and I'm going to query these. So what's that going to look like? Well, you saw that big red exclamation mark. What? So this is the huge takeaway for the next five minutes of this lecture, is anytime you see this error, you can Google it. It will probably take to a video, to this session or the documentation. Don't be afraid. When you're literally doing just-- well it seems like a simple single query-- but all that means is that this person right here is an array. And we need to essentially unpack that array back into rows and columns before we can operate on it. BigQuery will definitely take your array, ingest it, no problems. But before we do that, let's talk about how arrays work inside of BigQuery. If you're familiar with object oriented programming, you're going to see arrays a lot. And if you're a data analyst, data engineer, you're definitely the see them a lot in your data warehouse. An array is an ordered set of values. They must share a data type. So here's just an array of strings, no big deal. We just passed in a bunch of fruit, and stored it as an array. And we're specifying an array here. And we got one row. We got four things in that row. Now can also infer it. So I don't tell it that this is a string value, but it automatically just knows that that's going to be an array. What are some cool things that you can do with arrays? Once it's in an array format, you want to particularly say, hey, I want to find the second element in the array. Who can tell me, who's brave enough to shout it out and have it be recorded on YouTube, that-- what is it, what are we going to return here when I run this? I've got raspberry, blackberry, strawberry, cherry, and I've got offset of two in there. Look on line 41 one before you shout out. What do you think? Hands up if you think cherry. Hands up if you think raspberry. What about blackberry? You all are too smart. What about strawberry? All right, yeah. You definitely are earning your alcoholic beverages later. It is zero indexed. Yes, exactly. So if you didn't wanted to offset, you could do ordinal. How about now, yell it out. What we're going to return now? AUDIENCE: [INAUDIBLE] EVAN JONES: Yeah that's precisely right. So you can do cool things with arrays. You can just look at one value, zoom in on just one particular value. But if I want to do-- I want to see, hey, how many things are in that array? I can do array length. There's a ton of array of functions. I'm just showing you the tip of the iceberg of some of the most common aggregations. You have for items in your cart. Now here's the issue. We saw different levels of granularity in the bookings. This is the same thing here. I can't run a WHERE clause against a typical WHERE clause, select star from where, against this item. They're on two different levels of granularity, right? How do I work with arrays? Well, let's go back to the GOJEK example. Taking a look at the preview, we're going to look at the payment method. One insight, always start with a good question, right? All data analysis begins with a good question. Cash, cash, cash. Is cash the most popular method for the orders? Well we can write a very simple single SQL query. We're just going to say count orders by the payment method, group it by-- we're not even touching the array values. And you get the answer back for that. You're like, OK, cool. This Is basic SQL we can see as an insight. Most of the people in this demo table pay with cash, and then the second highest is Go-Pay, the native app based payment. Not a problem. We're not touching any array values. All right, well I'm going to go back to the table, and there's a bunch of interesting status fields associated with an order-- driver canceled, created, driver found. If you're familiar with SQL, one of the easiest ways you can get a distinct list of values is select distinct. So select distinct status. If someone comes along to you with this query, you might just look at it and go, that's absolutely going to run fine. Boom, you're going to get this error. Immediately you're going to get angry, and don't worry about it. But I want you to train your mind. As soon as you see array, I want you to memorize this word-- UNNEST. If you ever get an array and you need to unpack it, the fancy way that we decided to call it is UNNEST that array. So once we break that array apart, it's going to look like this. Normal SQL runs fine. You can see that there are, let's see, nine unique statuses. This doesn't give us too much context. Let's add an additional aggregated field. Let's just say, hey, let's do a count of the orders buying each of the distinct statuses that were in there. Now a note. Let's do a pause here. So what you've done-- event is normally stored like this, 1, 2, 3, 4, 5. That's a normal array, nested. Unnested means we're going to break this apart. That's UNNEST. Now one really devilish thing you can do in your code, and I don't recommend it, is essentially-- this in my mind, conceptually is very different-- this nested, this unnested. Do me a huge favor. When you UNNEST your arrays, AKA turn them from line 106 to the rest, don't give them the same alias. It's incredibly confusing. And one of the best ways you can help out your fellow programmers, unnested_array, you can name it whatever you want. Just make it really obvious which one is which, because that's the one you actually have as your alias when you're pulling it from there. And you can see that that works just fine. Let's see the total amount of status. If I'm going to rename it on the fly for a demo, I need to rename it in all places. So that's the note that I have there, always rename it as something different. And this is fundamentally what's going on. Nested, unnested. Rows. This is all in one row. This is all in normal, regular rows and columns, and you can go crazy with SQL on it. So you can see the majority here. We just have the quick insight, majority of orders have the status driver found. And then created, picked_up, completed, customer cancelled. Link it to a dashboard if you want. You can actually explore directly in Data Studio. I'm not going to show you that right now, but you can link a table directly into Data Studio. All right. I do want to leave a little bit of time at the end. So one of the things you might be asking is when you get access to this code. Hey Evan, I don't have a gojek.booking table because it's a sample. But what you do have is you have the Google Analytics schema. And if you think two or three STRUCTS inside of a table is a lot, if you-- and this is all into BigQuery public data. There's over 130 BigQuery public data sets for you to play with. One of them, or if you have your own Google Analytics account, is this Google Analytics sample. Any guesses on how many-- and you might have already seen the answer-- how many STRUCTS are in the Google Analytics schema? How wide is the schema? Just throw out a number. One, two, three? Seven? Let's see. What we're are we going to search? Oh, it's already cheated. What word are you going to search for the type? You're going to search for record. There's 32 STRUCTS inside the Google Analytics scheme, essentially 32 pieces of context. Other tables that are jammed in there. That's because there's a lot of data that's collected for your visitors, for your e-commerce website. And you can see you can even nest STRUCTS. hits.product.isimpression. hits.eventinfo, event category. So you can get a really, really, what's called denormalized, or gigantic, singular table for reporting insights. And that's exactly how Google Analytics is set up. And that's the data set that you can experiment with. You can actually, as I said before, you can go 15 deep. So if you wanted to fix this query, for example, page title is [? or ?] array. What you would need to do is hits, you're unnesting that. Give it a different name, don't name it the same thing, don't confuse everybody. And then you'll be able to select from it, and you get some really cool data that's returned. So why are-- a lot of you might be wondering, I'm never going to come across nested [INAUDIBLE] fields. I don't have any JSON data. Ha-ha, you will soon. Because if you're building a classification model inside of BigQuery, which now you can do with just two lines of code, create model, model type classification, using logistic regression. A little bit outside the scope of this talk. But what you're going to get back, and this is what the lab is going to be at 5 o'clock in Moscone South. Again, you're going to get back something. BigQuery is going to say, hey, remember that talk? Hopefully you didn't sleep through it. It's going to give you back, for example this problem is whether or not a visitor who visits your website is going to buy or not buy on a return visit. It could be whatever problem that you want. But for a classification model, each of those classes, yes they will buy, no they won't buy, is going to return back in an array. And it's going to give you a confidence, 55% I'm sure, 44% I'm not sure, and again, what's the main, huge-- if you're working with arrays, as soon as you see that word array inside of BigQuery, what's that other word it starts with a U that you should immediately memorize? UNNEST. That's exactly what I'm going to do in the last piece here, and then we'll open it up for questions. So this is the example of the final, final result, where you can do cool things like-- I want the machine learning model to tell me its highest predictions where they're going to come back. You have to break that array apart, and you can do normal SQL, and get, hey, I'm 80% sure this person is going to come back and buy it from our website. Now if you want to know how to create models, that's the next session at 5:00 in Moscone South. BigQuery ML is really, really cool. Or if you want to get additional practice on working with arrays, we did a full day boot camp predicting NCAA March Madness for men's and women's, who's going to be the final teams that are going to make it. We created an entire lab on machine learning for that one. That's this quest. We have an entire lab on loading in and querying JSON data, essentially this demo, but in a two hour, really comprehensive lab. And then this lab is going to be the one that we'll be running a little bit later on tonight, or you can take these at your leisure. [INSTRUMENTAL MUSIC]
Info
Channel: Google Cloud Tech
Views: 10,961
Rating: 4.915493 out of 5
Keywords: type: Conference Talk (Full production);, pr_pr: Google Cloud Next, purpose: Educate
Id: 3TVeG_dpGxk
Channel Id: undefined
Length: 48min 24sec (2904 seconds)
Published: Thu Apr 25 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.