[INSTRUMENTAL MUSIC] GAURAV ANAND: We all know
the world is generating data more than ever. So how do we make
sense of this data? The traditional data
warehouses were not designed to handle this
explosive growth in data. So as we see the data
volume to increase, the complexity to increase,
and the number of use cases around the data to
grow, enterprises needs actionable insights. So what do we mean by
actionable insights? As the cost of storage and
data processing reduce, enterprises want to
process, store, and analyze all the data sets, internal and
external to the organization. The modern data
warehouse also needs to reflect the current
state of business. That means you need
to act in real time. You need to make those
decisions in real time. We don't need a data warehouse
which takes like weeks, or even months, to get the
software installed or even the hardware provisioned. We also want the insights to
be available to the business users, in their hands, so
that enterprises move more towards data driven
decision making. And last but not least, we
need security and governance. So we basically want
to have the data accessible to the right
stakeholders, internal and external to the organizations. Now this statement
still holds true. However, as the demand for
new data types, new use cases, and the complexity has come in
over the period of a few years, the data warehouse
architecture needs to evolve to meet this
demands, in both distributed and centralized solutions. So if you look over
the last two decades, basically enterprises started
using data warehouse more from an operational
reporting purposes. And then we had cell
service BI coming in. And over the past
decade, there has been a huge shift towards
data mining, AI, and ML. So as businesses inspire
to be data driven, this requires the data warehouse
to support their machine learning and AI initiatives. So we all need the data
warehouse of the future, but what worries us? We need to remember that data
warehousing is not a use case. Data warehousing
is a solution that enables higher order use
cases to make the businesses data driven. And these are some of the
common challenges which we see enterprises
and businesses face while managing the
data warehouse, or even growing
the data warehouse. So it's time now to modernize
a data warehouse with BigQuery if you haven't already. So basically, BigQuery is
Google Cloud's enterprise data warehouse. It scales from
gigabytes to petabytes. There are numerous
features of BigQuery to call out, out of which there
are a few unique features which we have listed out there. With my experience
working with all these large digital
enterprises, we have seen them see lot of
value with real time analytics, because BigQuery natively
supports real time analytics, and you can basically
stream your data all the way into BigQuery, and
make all those real time business decisions. So the challenges which I just
showed you in the slide before. BigQuery enables us to
overcome those challenges by offering scalability,
simplicity, security, and a TCO for all kinds of businesses,
no matter small or large. In terms of architecture, which
most of you might be familiar, it's fully managed
and serverless. The storage and compute
are decoupled so that they scale independently on demand. This basically offers immense
flexibility and control for the enterprises,
and you don't need high compute resources
to keep running 24 by seven. And this is very different
from the traditional node based and MPP data warehouses. So if I have to summarize what
BigQuery brings to the table, it gives you a modern
data warehouse. And basically it's focused
on these six key areas. With that I would like to
introduce Zaki from GOJEK to share his experience,
and GOJEK's experience building the data warehouses,
and using BigQuery. [APPLAUSE] ZAKI SUKMA: Thank you, Gaurav. Hello everyone. Thank you for joining
our session today. So I am Zaki, and I work
as that engineer at GOJEK. And here, I would like to
share GOJEK's experience in building our data
warehouse, while we also do our international
mission effort. Has any one of you heard
about GOJEK before? You can raise your hand. Oh, wow. Cool. Thank you. OK. So GOJEK is a technology
company in Indonesia that aims to improve
people's quality of life. And then, GOJEK was
started in 2010, actually not as a technology
company, but as a call center for a motorbike service,
motorbike taxi service, called OJEK. So the idea of having
OJEK taking passengers around, moving things
around, is very powerful. That we in 2015, are
able to launch our three first digital products called
Go-Ride, Go-Send, and Go-Shop, all centered around moving
people and things around. And people likes
our products a lot. So in 2016, we are able
to expand our products to many other kinds of
services, like lifestyle, digital payment, and
entertainment, and many others. So those many products has
empowered people in Indonesia. So their quality of
life has improved. And then we are having
very successful businesses, so that in 2018, we begin
our international expansion to countries in Southeast Asia. OK. So we have so many
products here. You can see we have Go-Ride,
Go-Car for transport services. We have Go-Send and Go-Box
box for logistic services. We have lifestyle services
like Go-Massage, Go-Glam, and we have also
Go-Clean, and many others. We have also Go-Pay for payment. So by having that many products
in a single application, people like to call us
a super app company. And all of those
products are very nice, because they bring
convenience to people, while also empowering
the drivers, empowering the
service providers, so they can get a living. OK. This is our global footprint. So we have expanded to four
countries in Southeast Asia-- Vietnam, Thailand,
Singapore, and soon we are also expanding
to other countries. So in those countries we are
operating in essentially 204 cities in Southeast Asia. And operating in those
cities, our application has been downloaded more
than 130 million times. And we also empower more
than 400,000 merchants, and more than 2 million
drivers, essentially generating more than 100 million bookings. So we move on to our data
warehouse architecture. So this is our data warehouse
architecture using GCP. We can see that we have many,
many Google products there. We use BigQuery for
our data warehouse. We use Cloud Storage
for our data lake, and then Dataproc,
Kubernetes, Dataflow, for our ELT execution. And also like
Composer, Stackdriver, for operations and monitoring. So we already know that we
have to offer so many services for so many kind of businesses,
different businesses. So our data warehouse
has the requirement to fulfill all of the
different businesses, has to support a range of
different kind of businesses. So Google products has helped us
in building our data warehouse. We can fulfill all the
requirements supporting many different teams, probably
with different kinds of setups in each team. And each team can have like
Postgres cluster, [? MySQL ?] cluster, and then they can
have a user tracking system, using a third party platform. So all the Google
products has helped us in building our data warehouse. And data warehouse has called as
a multi-product data warehouse, because we have to support
that many kind of products. OK. This is the current statistics
of our data warehouse in the end of Q1 of 2019. We can see here that our
data volume is increasing more than 30% in each month. And then because of
many and many data has ingested to
our data warehouse, many people can use the data
for insights, for analytics. So they can generate more
than 13,000 dashboards. The dashboard supports
from high level, like strategic team doing
more strategic decision making until operation level,
solving passengers of our driver challenge, our
driver case, and problems. So the most important
metric of our data warehouse is this, the growing
numbers of BigQuery user. So our data warehouse
using BigQuery, and then we can see that
in the end of Q1 2019, we have 2,100 active
users of BigQuery. It is three times the number
of the user in mid 2018. This is the period we
do our global expansion. And then more and more
people rely on data when making decision. And then here comes
the challenge. Actually, when people
wants to develop ETL jobs, when people wants to schedule
their analytics queries, people relies heavily on
our BI engineer to do that. People relies heavily on our
BI engineer to create ELT jobs. And our BI engineeers'
capacity is limited, and more and more requests
coming beyond our BI engineers' capacities. And then here comes
the other challenge. While we are undergoing
very fast growth, we have to move us, and then
do international expansion. So we have to keep
up with our growth, while we also have to support
with global expansion. We have to expand our data
warehouse to countries in Southeast Asia, and then we
are very lucky because the GCP products that empowers
us has helped us in many ways, the support
with our data warehouse. But yeah. Improvement can be made. Improvement can always be made. While we have operating and can
support all of those use cases, we realize that our
current setup is not the most efficient setup. We can do more and more things
to improve our efficiencies. So we come with
several questions that we asked to ourself. How can we survive
the global expansion? Several questions
that we essentially have to solve in a
short period of time. So the first question
is, how can we extend data warehouse to
countries in Southeast Asia, and set up some GCP
projects faster and easier for those extensions? And the second question is,
how can we manage user access? So when we know that a number of
our data warehouse is growing, the thing that will
directly impact us is how to manage
those user access. If we don't monitor
user access better, we will have difficulties in
managing all of those accesses. And then the third
question is, how can we make the ELT job development
less dependent on our BI engineers, and how can we make
the development friendlier? And then, well, usually the
one that develops the ELT job is BI engineer. How can we make the development
friendlier to non-BI engineers, like analysts, like
product managers, or even operation team wants to
do some queries to solve people's problem? And then the last question
is, how can we maintain users' confidence to data? So we don't want users to
always worry about, oh, what is the quality of data? Is this data good to use now? Or what are the
things that I need to check to ensure that
the data can be used by me? OK. So the first thing that we do,
by setting up the GCP project faster, is expressing
the all the configuration in infrastructure as code. So we want our data
warehouse to be extended to each country
in Southeast Asia. Some of them will have
a common configuration. Some of them will have
specific configuration. And then those configurations
can include data modeling layers, express in DCS
bucket and BigQuery data set, and many other
configurations, like the Composer
instances, the Kubernetes for executing the ELT jobs. So how can we set up
the project cluster? So here we are using Terraform. So one platform to help us
expressing our configuration as code. And then the nice thing about
expressing the configuration using infrastructure as code
is, we can set up faster. And then, by having
the set up faster, we can do experimenting with
different kinds of setup. So we want to experiment
with Vietnam, for example. We can do this data modeling,
we can do this Composer. We can do this Kubernetes
cluster is enough or not. So we can just express
all the configuration using Terraform Plan
and Terraform Apply. And then the projects are there. We can experiment with it,
and then if we are done, and we are not satisfied
with the settings, we can use Terraform Destroy
to delete the projects. And we can repeat with
other configurations. And then the other nice thing
is all the core configuration is stated explicitly in one
file, or probably two files. So we can share
the configuration, and everybody can see it. So we can have the
common understandings about how the settings
of our data warehouse. And then the second part
is, how can we better manage our user access to data? So we built a tool
called Data Pier. So Data Pier will assist
us in helping users onboard with the data warehouse. So previously, users
requested access to us by using email, or coming
to our desk directly. It's not scalable
to do that way. Every time we have-- we get emails to
give them access, and then every time
we do some work. And then some people
just come, and then, hey, give me give me access
to these data sets. And then we have
conversation, and then yeah, we lost the productivity. So we built this Data Pier
so people that need access only open Data Pier, and
then they can request access in the database dashboard. And then in the
Data Pier, people will also see this
data set can be accessed by this Google Groups. And then they can join
some Google Groups that has access to data set. So we manage user access
using Google Groups, and then we develop the
tools using Google IAM API. The back and forth, our
tool is Google IAM API. The third thing that
helps us so very much to boost our productivity
in developing ELT jobs is having a tool
that helps people. They can set their own ELT jobs. So we think that SQL
is a very nice thing. We can express
transformation in SQL because SQL is simple,
straightforward, and then it's very intuitive. So we move from cloud
centered transformation into SQL-based transformation. So when people know
SQL by doing analysis, we can teach them SQL easier
than teaching them how to code. Well, they know SQL, they
can set their own ELT jobs by using the tools. So people that have
the query only needs to install our
optimal CLI tools, and then submit the query,
SQL query, and then select what is the destination
data set and table. And then determine, what is
the schedule for the job? And then the optimal stores
will take care of the rest. Everything done automated. And then people
will also know, what is the result of the execution? And then the fourth
thing that we built is data quality service. So we have the problems
of convincing people that this data, this
table, is good to use. And then we don't want
them to always worry about data quality of the data. We don't want them to be
surprised in solid time, because their data has
completeness issues, has uniqueness issues. So we built an automated
data quality service that helps people with checking
the data quality of the-- yeah-- the quality of data. And we implement the
data quality service in Google products, like
Kubernetes, and then Composer, to schedule the execution
of data quality profiling. So when people finish with
creating the ELT jobs, they can continue with creating
data quality specification. Like for these columns, what is
the threshold for completeness? What is the threshold
for uniqueness? So they can submit the
data quality specification into repos, so the tools
will get the data quality specification, compiles
it, and then just runs the data quality profiling,
based on the specification. And then the data quality,
data quality assessment is done immediately after the
execution of its ETL jobs. So people will always know, what
is the quality of their data? And then people can get
alert if some columns is not complete enough, or
some case is not unique. And then but still, a problem
with data is still there. So what is problem
with data, actually? So we already know that,
yeah, data warehouse is data warehouse. And then if data
comes to us, how can we create such a
single expression of truth? How can we create a table,
authoritative table, that single table can
be used by many people for analytics, for reporting? So people don't have to
compare one source of data to other source of data,
because, yeah, too many data sources to be used. So the other thing
is, how can we make sure that also the
single version of truth is rich in contexts? So for example, in GOJEK,
we have booking data with many dimensions like pick
up, drop off, the duration, and the event logs. So the other problem is, how
can we structure and model the data? So luckily, BigQuery
has some nice features that helps us with
such problems, that solve our problems
with those features. And then here, we have event
that shows us, tells us, how can we structure the data
in our data warehouse using those features, so we can
get the best of BigQuery in solving our problems? Evan, please come. [APPLAUSE] EVAN JONES: Good afternoon. It's always interesting
doing a session at 4:00 PM, right before-- do
you guys know the act who is coming on tonight? Who's attending the
party later on tonight? Anyone know who the act is? Gwen Stefani, right? So I'm literally the
gatekeeper between you and Gwen Stefani,
which is tough. She's a lot better of
a singer than I am, so I can promise you that. But I would hazard a guess
that I can help you out with data warehousing a little
bit better than she can. So the problem of
data is still there. If you don't mind switching
back to the slides, I'll do the demo just
a little bit later. So I've been at
Google four years, and one of the problems I
really, really like tackling-- I'm a technical
curriculum developer-- is taking BigQuery
and the things that it can do that seem
like magic, like including many different types of
data, petabytes of records, millions of different events,
all inside of one place, and making it
accessible to everybody. I'm going to show you just
a little bit of a snippet of some code that you can see. I'll provide you the code. The session is being recorded. Also, we're going to be doing a
hands-on lab after the session, so if you really want to get
into the granular details, you can finally back
to Moscone South. But where I started my career
15 years ago, teaching SQL. One of the biggest things
that was taught to me, and I'm going to ask you
for the answer for it, is if you have data that's
in four different places, you have events. You have orders, you have
pickups, or drop offs. That's what we call a
normalized data structure. Raise your hand if you've
worked with databases before, you've heard of normalization. Raise your hand if you've worked
with BigQuery before, as well. Awesome. Keep your hands raised if
you work with arrays inside of BigQuery or JSON data. OK, cool. So I'll speed through
this a little bit. So the constant struggle
is, do we split our data across many different tables? Inside of a SQL, what are
you up to do as a BI analyst to bring all those
back together? You've got to join
them all back together. It's everyone's SQL 101. You absolutely hate that stuff. Or you dump it all
on one big table, and you have data issues. One fact in one place, not
so much in repeat them. So since you worked with, or
you heard of arrays before, one of the really
cool data types that's natively supported
inside of BigQuery is the array. So you have object
oriented programmers. They're like, obviously, yeah. I want to do arrays
for everything. And you have SQL developers
saying, whoa, whoa. We're rows and columns. Based keep your
arrays to yourself. BigQuery as a data warehouse,
as a data platform, can support both,
and we're going to walk through
examples of what's really beneficial of storing
your data nested in a repeated format. This is GOJEK. Zaki and Gaurav provided
me with an awesome table. This is a bookings table. So Zaki mentioned you have
lots of different information about GOJEK bookings, like
motorcycle ride shares, that are all inside of one table now. All that different context
from 15 different tables is all in one. So if you notice, it
looks really strange if you've never seen arrays
inside a BigQuery before. So this is actually
only four rows of data on this screenshot. The others, 1,000 or
so, totally in BigQuery. But what's interesting
is that you see that there's a status field
and a time field for arrays that look like there are
four rows of data inside of that first one. That's an array of data. So if you're familiar with
like, arrays it's the bracket. Essentially, BigQuery visually
breaks that out for you. But technically,
behind the scenes, it's still in those brackets. And it's going to get
really interesting when you see the demo of
how to work with that. It natively supports that,
that's one data type, and you'll see how we work
with those in just a minute. The other data type
that's interesting, if you're looking at
the top and you're like, I'm familiar in
SQL with an alias, like naming a field
something else. What on earth is events
dot something, pickup dot something, destination
dot something? It almost looks like you've got
six or four different tables all inside of one. The structure-- and this is
not specific to BigQuery. This is just
generally part of SQL. It's supported in BigQuery. Is called a STRUCT,
short for structure. You combine both
of them together, and you can have
really wide tables. A STRUCT is kind of
like a pre-joined table inside of BigQuery. So you both get the width of
having information on events, pick up, destination. I'm going to show you
an example data set that has over 30 STRUCTS in it. That's the Google
Analytics schema. And you get the granularity
of having one row technically represents one order,
like a ride share pickup or something like that. But within one row, you can
have extreme granularity at the array level, with as
many different things happening with one order. So your queries can
get at the high level. Count star from all orders? Sure, I got four orders. Or I can look deep
into the array values there, without having to worry
about data stored in a repeated fashion, like a
truly denormalized. All right, so a little
bit of a practice for you, and a question I'm a technical
[INAUDIBLE] developer. You're not going to
leave you without a quiz, and we definitely have some
cool lab promotions for you, as well. In a schema inside of BigQuery,
if you work with the BigQuery before, you've seen this before. Yell out maybe one or a
couple of the STRUCTS. How do you know
whether it's a STRUCT? You're looking for a
data type of record. Whenever you see record
in somebody's schema, you're like, whoa, this
person is using STRUCTS. Yell one out. AUDIENCE: Events. EVAN JONES: Events is one. Give me another one. AUDIENCE: [INAUDIBLE] EVAN JONES: Pickup, cool. What else? AUDIENCE: [INAUDIBLE] EVAN JONES: If you leave
this class with nothing else, understanding this
schema and the fact that whoever's created that
table that you've inherited is using semi-structured data,
either like nested or repeated, especially when you--
raise your hand if you've worked with JSON data before. I never know if it's
JSON or JASON data. That's another argument
for a different time. Ingesting that
in, you definitely get some nested and
repeated values. You can actually nest
values 15 layers deep. The computer can handle it. Mentally a human,
if you gave them 15 essentially tabs
and a data set, you're BI analyst
might get really mad. All right, see,
you're exactly right. These are essentially kind
of like four other tables that are pre-joined into there. Pre-joined, you can think for
a huge performance benefit, means that you actually can
get a lot of performance not having to run that join behind
the scenes, which is cool. So MySQL, Postgres cloud
SQL, that's transactional. That's 80% percent
writes, 20% reads. BigQuery, data warehouse,
you're doing the majority, 80% reads, instead of writes. It's better to have that data
stored inside of one table, not doing performance
hits on joins. All right. Now, look at the mode, and then
find me the array or arrays. You guys are fast. That's exactly right. The events have that double
level of granularity. The rows of this table, yes. They are on the order level. When you look at
Google Analytics, it's on this session level. But within each of
those different rows, you can get an
extreme granularity by nesting an array,
which is great. But when you write
SQL against an array, unless you're
experienced with it, which you're going
to see in the demo, it's really, really painful. So a quick recap
before the demo. Anytime you see you record,
that immediate means a STRUCT. STRUCT is not
specific to BigQuery, but it's natively supported. Anytime you see repeated in the
mode, that means it's an array. You don't have to have a
STRUCT to have an array, but if you have, you can
nest arrays within STRUCTS. For example, the events is
a STRUCT, as you see here. That doesn't mean it has to be. Like there are other things
in here that aren't repeated, like the pickup up latitude
and pickup longitude. Unless you're changing your
motorcycle ride mid trip, there's no need to have
that be an an array. So that's not an array,
that's just a normal STRUCT. But the first one,
it's an event. There's many different
things that can happen over the course of the order. Rider picked up, dropped
off, charge reversed, that type of stuff. That's an array. All right. Let's take a look at
what this actually looks like inside of code. So we're inside of BigQuery. You are familiar
with BigQuery before. If you're not, petabyte scale
data analytics platform. You can just throw arbitrary
amounts of SQL against it. If you're going to take a
screenshot of one slide, what I did for those people who want
to leave early and catch Gwen Stefani, I want you to
take a screenshot of this. These are all the free links
that you should know about. This session has a Dory,
which is essentially a question and answer. If we don't get to your
question at the end, I want you to post
that question on there, and then later on tonight,
or later on this week-- if you have a complicated
data question, you want to know a
cool link to a lab, I want you to post
that question there, and Gaurav, Zaki, and I
will definitely answer it. Next thing. All the code that
you learned here, I've spent the last year
of my life building out interactive labs to get you
from a blank canvas in BigQuery, to building machine
learning models, to building nested
repeated data structures, performance optimizing
your queries, and you can get them from
free, which is pretty cool. So that's that free labs. If you want to hear
more of my voice, you can actually take
the Coursera course, and you get a month free. [INAUDIBLE] I don't want
to spend too much time advertising, but if you
don't get where you want out of this session, you're
welcome to self-study with the labs, come back with
me to Moscone South after this, we're going to working
through one of them. So inside of BigQuery,
Zaki was kind enough, Zaki and the team and GOJEK were
kind enough to give us a table. Inside of BigQuery, a cool
thing that you can do, and I'll provide this
code to all of you, is, if somebody
gives you a table, a cool thing-- if you never
seen this before, what on Earth? You can actually
select the entire by holding down the Command
key or the Windows key, and you can actually
highlight all the data sets that are inside of a query. It's kind of cool. Inside of there, generally
we advise against-- don't let your BI analyst
do select star from a table limit 10. BigQuery already stores a lot
of that metadata in a preview, so I don't even need
to write any SQL to see what the data is in there. So recap-- you can
click on a table name, you can look at the
metadata, and this is exactly where I took this
screenshot from for the slides. And you're going, whoa. It's one row, but
it's three things. And now that you know that, it's
an array of different values for timestamps. You're like, OK cool,
that's not too bad. Now what I want to do
is I want to just get all the orders and all
the different statuses for those orders. So I'm querying
this, and you can see how this is, might
potentially blow up, and I'm going to query these. So what's that
going to look like? Well, you saw that big
red exclamation mark. What? So this is the huge takeaway
for the next five minutes of this lecture, is
anytime you see this error, you can Google it. It will probably
take to a video, to this session or
the documentation. Don't be afraid. When you're literally
doing just-- well it seems like a
simple single query-- but all that means is that this
person right here is an array. And we need to essentially
unpack that array back into rows and columns
before we can operate on it. BigQuery will definitely
take your array, ingest it, no problems. But before we do that,
let's talk about how arrays work inside of BigQuery. If you're familiar with
object oriented programming, you're going to
see arrays a lot. And if you're a data
analyst, data engineer, you're definitely the see them
a lot in your data warehouse. An array is an
ordered set of values. They must share a data type. So here's just an array
of strings, no big deal. We just passed in
a bunch of fruit, and stored it as an array. And we're specifying
an array here. And we got one row. We got four things in that row. Now can also infer it. So I don't tell it that
this is a string value, but it automatically
just knows that that's going to be an array. What are some cool things
that you can do with arrays? Once it's in an
array format, you want to particularly say, hey, I
want to find the second element in the array. Who can tell me,
who's brave enough to shout it out and have it
be recorded on YouTube, that-- what is it, what are we going
to return here when I run this? I've got raspberry,
blackberry, strawberry, cherry, and I've got offset
of two in there. Look on line 41 one
before you shout out. What do you think? Hands up if you think cherry. Hands up if you think raspberry. What about blackberry? You all are too smart. What about strawberry? All right, yeah. You definitely are earning
your alcoholic beverages later. It is zero indexed. Yes, exactly. So if you didn't wanted to
offset, you could do ordinal. How about now, yell it out. What we're going to return now? AUDIENCE: [INAUDIBLE] EVAN JONES: Yeah
that's precisely right. So you can do cool
things with arrays. You can just look at one
value, zoom in on just one particular value. But if I want to do-- I want to see, hey, how many
things are in that array? I can do array length. There's a ton of
array of functions. I'm just showing you the tip of
the iceberg of some of the most common aggregations. You have for items in your cart. Now here's the issue. We saw different levels of
granularity in the bookings. This is the same thing here. I can't run a WHERE clause
against a typical WHERE clause, select star from where,
against this item. They're on two different
levels of granularity, right? How do I work with arrays? Well, let's go back
to the GOJEK example. Taking a look at
the preview, we're going to look at
the payment method. One insight, always start
with a good question, right? All data analysis begins
with a good question. Cash, cash, cash. Is cash the most popular
method for the orders? Well we can write a very
simple single SQL query. We're just going to say
count orders by the payment method, group it by-- we're not
even touching the array values. And you get the
answer back for that. You're like, OK, cool. This Is basic SQL we
can see as an insight. Most of the people in this
demo table pay with cash, and then the second
highest is Go-Pay, the native app based payment. Not a problem. We're not touching
any array values. All right, well I'm going
to go back to the table, and there's a bunch of
interesting status fields associated with an order-- driver canceled,
created, driver found. If you're familiar with
SQL, one of the easiest ways you can get a distinct list
of values is select distinct. So select distinct status. If someone comes along
to you with this query, you might just
look at it and go, that's absolutely
going to run fine. Boom, you're going
to get this error. Immediately you're
going to get angry, and don't worry about it. But I want you to
train your mind. As soon as you see array, I
want you to memorize this word-- UNNEST. If you ever get an array
and you need to unpack it, the fancy way that we decided
to call it is UNNEST that array. So once we break
that array apart, it's going to look like this. Normal SQL runs fine. You can see that
there are, let's see, nine unique statuses. This doesn't give
us too much context. Let's add an additional
aggregated field. Let's just say, hey, let's do
a count of the orders buying each of the distinct
statuses that were in there. Now a note. Let's do a pause here. So what you've done--
event is normally stored like this, 1, 2, 3, 4, 5. That's a normal array, nested. Unnested means we're
going to break this apart. That's UNNEST. Now one really devilish thing
you can do in your code, and I don't recommend
it, is essentially-- this in my mind, conceptually
is very different-- this nested, this unnested. Do me a huge favor. When you UNNEST your arrays,
AKA turn them from line 106 to the rest, don't give
them the same alias. It's incredibly confusing. And one of the best
ways you can help out your fellow programmers,
unnested_array, you can name it whatever you want. Just make it really
obvious which one is which, because that's the
one you actually have as your alias when
you're pulling it from there. And you can see that
that works just fine. Let's see the total
amount of status. If I'm going to rename
it on the fly for a demo, I need to rename
it in all places. So that's the note
that I have there, always rename it as
something different. And this is fundamentally
what's going on. Nested, unnested. Rows. This is all in one row. This is all in normal,
regular rows and columns, and you can go crazy
with SQL on it. So you can see
the majority here. We just have the quick
insight, majority of orders have the status driver found. And then created, picked_up,
completed, customer cancelled. Link it to a
dashboard if you want. You can actually explore
directly in Data Studio. I'm not going to show
you that right now, but you can link a table
directly into Data Studio. All right. I do want to leave a little
bit of time at the end. So one of the things
you might be asking is when you get
access to this code. Hey Evan, I don't have
a gojek.booking table because it's a sample. But what you do have is you have
the Google Analytics schema. And if you think two or three
STRUCTS inside of a table is a lot, if you-- and this is all into
BigQuery public data. There's over 130 BigQuery public
data sets for you to play with. One of them, or if you have your
own Google Analytics account, is this Google Analytics sample. Any guesses on how many-- and
you might have already seen the answer-- how many STRUCTS are in the
Google Analytics schema? How wide is the schema? Just throw out a number. One, two, three? Seven? Let's see. What we're are we
going to search? Oh, it's already cheated. What word are you going
to search for the type? You're going to
search for record. There's 32 STRUCTS inside
the Google Analytics scheme, essentially 32
pieces of context. Other tables that
are jammed in there. That's because
there's a lot of data that's collected
for your visitors, for your e-commerce website. And you can see you
can even nest STRUCTS. hits.product.isimpression. hits.eventinfo, event category. So you can get a
really, really, what's called denormalized,
or gigantic, singular table for reporting insights. And that's exactly how
Google Analytics is set up. And that's the data set that
you can experiment with. You can actually, as I said
before, you can go 15 deep. So if you wanted to fix
this query, for example, page title is [? or ?] array. What you would need to do is
hits, you're unnesting that. Give it a different name,
don't name it the same thing, don't confuse everybody. And then you'll be
able to select from it, and you get some really
cool data that's returned. So why are-- a lot of
you might be wondering, I'm never going to come across
nested [INAUDIBLE] fields. I don't have any JSON data. Ha-ha, you will soon. Because if you're building
a classification model inside of BigQuery,
which now you can do with just
two lines of code, create model, model
type classification, using logistic regression. A little bit outside
the scope of this talk. But what you're
going to get back, and this is what the lab
is going to be at 5 o'clock in Moscone South. Again, you're going
to get back something. BigQuery is going to say,
hey, remember that talk? Hopefully you didn't
sleep through it. It's going to give
you back, for example this problem is whether or not a
visitor who visits your website is going to buy or not
buy on a return visit. It could be whatever
problem that you want. But for a classification
model, each of those classes, yes they will buy,
no they won't buy, is going to return
back in an array. And it's going to give you
a confidence, 55% I'm sure, 44% I'm not sure, and again,
what's the main, huge-- if you're working with arrays,
as soon as you see that word array inside of BigQuery,
what's that other word it starts with a U that you
should immediately memorize? UNNEST. That's exactly what I'm going
to do in the last piece here, and then we'll open
it up for questions. So this is the example of the
final, final result, where you can do cool things like-- I want the machine
learning model to tell me its highest
predictions where they're going to come back. You have to break
that array apart, and you can do normal SQL,
and get, hey, I'm 80% sure this person is
going to come back and buy it from our website. Now if you want to know
how to create models, that's the next session
at 5:00 in Moscone South. BigQuery ML is
really, really cool. Or if you want to get
additional practice on working with arrays, we
did a full day boot camp predicting NCAA March Madness
for men's and women's, who's going to be the final teams
that are going to make it. We created an entire lab on
machine learning for that one. That's this quest. We have an entire lab on
loading in and querying JSON data,
essentially this demo, but in a two hour,
really comprehensive lab. And then this lab is going
to be the one that we'll be running a little
bit later on tonight, or you can take these
at your leisure. [INSTRUMENTAL MUSIC]