[MUSIC PLAYING] DAN MCCLARY: First
of all, I'd like to say I'm kind of
blown away, Daniel, that we've got so
many people at 4:30 in the afternoon, who've come
to talk on data modeling. DANIEL MINTZ: I know. DAN MCCLARY: It's
really telling. DANIEL MINTZ: Yeah. Welcome to the
Daniel and Dan show. DAN MCCLARY: Yeah, and again
thank you all for coming. We'll try and make data modeling
as interesting as it can be. So I'm Dan McClary I'm part of
the BigQuery team at Google. I'm joined by Daniel
Mintz from Looker. He's the Chief Data Evangelist. And like we said, we're going
to talk about data modeling. And we're going to try and make
it as exciting as possible. Now, the title of a slide
is a little bit misleading, because we're not going to
talk about data modeling just for BigQuery. We're actually going to
talk about data modeling, to some extent,
it's history, what's changed, how we should
challenge our assumptions. And similarly, not just
how is data modeling maybe changed for massive
analytical systems, but also for business
intelligence and reporting. And we'll talk a little
bit about how BigQuery is different from the
systems for which some of our original assumptions
around data modeling originated. And then similarly
about how modern VI tools can take advantage
of some of these differences. So hopefully that's
interesting to you guys. And we'll try to keep
it light, and we'll try and take some questions, and
talk through some common cases as well. So, for me, as I was
thinking about how we were going to
put this together, the thing I sort of asked
myself is, where did we learn to model data? And some people didn't
learn to model data, they just have an ORM,
or something modeled data for them, or they inherited
it, and that's sometimes tough. But where did we
learn to model data? And I think for most of
us the answer is actually quite simple, Codd. Oh, no, not that Cod, this Codd. This is Edgar Codd. If you guys haven't
seen Edgar Codd before, he is due a great debt in
the data and analytics space. So Edgar Codd is the man who
created the relational data model. And he started this with
a paper he wrote in 1970 called "A Relational
Model for Data in Large Shared Databanks." Quick history note, back in the
'70s, we didn't have databases, we had databanks. Same thing, but that's
what they were called. Now in this paper, Codd
outlines the foundation of a relational model, right? So he defines a system of
tuples, and importantly, an algebra on those tuples. And this is the thing that
leads us to languages like SQL, where we can have declarative
languages that are very easily able to produce
execution plans that we can use to analyze data. And this gets us a
long, long way, right? Like this gets us from the very
early databases of the '70s, to all the things we can do with
big data processing and systems like BigQuery today. Now, in that, Codd had two
really primary concerns. Data independence and
data inconsistency. So in his mind, these
were the things that had to be addressed, right? And part of the value
of the relational model is that you get data
independence, right? Your data looks a little
bit like relationships in the real world, it's
easy to add fields, it's easy to do stuff with it. But it also helps guard
against data inconsistency. And data inconsistency
can lead to wrong results, and miscomputed bills,
and all kinds of pain. And if you guys are
familiar with Codd, you may have heard of these
things in your university class, normal forms, right? And this starts with
the first normal form, which is introduced
in the paper in 1970. And with the help of others
evolved all the way up through the fifth normal form. Now, many of us who work
in the database industry may remember taking
schemas and saying I've got to get to 3NF, 4NF,
or BCNF, which is Boyce Codd Normal Form, or even to fifth NF
in the most extreme situations. Now there are lots of good
reasons for normalization, right? So increasing data
independence is actually great. Because it makes this easy for
us to extend our models of what the world looks like. Our data can well represent
what happens in the world. And again, reducing
data and consistency is great because anomalies
are confusing and problematic. So what do we mean when
we say easy to extend? Add new types, mirror
the real world, I've said that a few times now. Avoiding anomalies can
be different things. One of the things that Codd was
really, really concerned about was avoiding duplicates
and conflicting values. Conflicting values in particular
can remain a challenge, but the other thing that
Codd was dealing with were operational constraints. I mean, Codd was
a man of his time, he was a man at the
cutting edge of technology, and he was making sure that
the system he was describing fit well with the operational
constraints of the day. And this was the
constraint of the day. You may have never seen
a picture of this before. This is the IBM System/360. This is the premier database
platform from 1964 to 1978. And it ran a hierarchical
database called IMS. IMS well, IMS is now 51
years old, and I think still runs in production
in many places. But so Codd's primary argument
was hierarchical databases, like IMF-- IMS, sorry-- can have these
problems of consistency, have these problems
of data duplication, can be really difficult to make
model lots of general purpose situations. And so his relational
model made it much simpler for us to capture and
represent information in a queriable fashion
that could answer a broad set of constraints. Things though have changed. Most of us don't come to
work every day in sort of a burnt umber kind
of room, with spinning tapes, and looking
very fashionable, sometimes I wish I did. But things have changed. And because things have
changed, its worthwhile when we think about modeling
data, to say, yes, the things that Codd was after
were important, and they and they matter to
how we represent our data and how we analyze it. But some things have
probably changed with respect to our operational
constraints, and as such we should think about maybe we
could do things differently. So again, I don't I
don't want to make anyone think that
normalization is not important. Normalized data models
are really important, they still matter. Data is moving fast, it's
changing all the time, and so this notion of
data independence is huge. It still matters quite a lot. And nobody wants wrong
results, wrong results are just unacceptable. Now access and simplicity
of queries though, I think, matter far more
than they ever did. And part of the reason of this
is that more and more people are analyzing data. More and more people are
analyzing lots of data. And so making it
simple for other people to write queries against
data, or to build reports against data, is really,
really important as well. So the question I
keep coming back to, is how do we challenge these
sort of normal assumptions in modern environments? What are the things
that are different? One of the things I think
is really, hugely different in modern distributed
data processing systems, is that duplicate records don't
cost money like they used to. If we go back to our
picture of the IBM 360, that data was stored on tapes,
or on giant giant magnetic hard drives. The cost of a megabyte of
storage in the '60s and '70s was tremendous. And so the graph we have here
is basically average cost of a gigabyte in a year,
starting from 1980. It didn't go all the
way back to the '70s. And if we see, in this, is
it's basically dropped to zero. So when we think
about normalization, when we think about
our data modeling, one of the real specters
of denormalization kind of goes away, because
the price just drops out. Having a duplicate
record is fine. Now if a duplicate record causes
inconsistency, that's an issue. So this allows us to kind of
relax some of the constraints that we might have heard
of when we were going through our university courses. So the old Codd's
rule, you know, "the key, and nothing but the
key, so help me Codd," right? We can reduce that a little bit. The notion of these things
like non-prime attributes, we can have more stuff in
tables than we used to, because it doesn't cost us
a fortune to store data, or to store small
amounts of data. We also have the ability now
to think about complex types stored inside rows in
a way that we couldn't. And this is a thing that
BigQuery is actually quite good at it, and somewhat
natively designed for. The other thing is that
distributed systems are the new normal. When Codd was writing
his paper in 1970, networks weren't a thing. They weren't a thing that
query processors had access to. Nowadays, when we look at
big data systems like Hadoop. Spark, things like that,
and systems like BigQuery, distributed systems
are the way we process massive amounts of data. Now, that means that there
are new costs to consider. Now, just a quick
show of hands, I don't even know what a
shuffle operation is? All right, so a few people know
what a shuffle operation is, a few people don't. Just to make sure
people understand, if I have a multistage
distributed process and I have to pass data from
one set of workers to another, that effectively amounts
to shuffling data around. Now this is costly because
I have to take those bytes, I've got to serialize, I've
got to put them on a protocol stack, I got to send them down
to the nick, across the wire, and bring it all the
way back up again. And that's a cost we
didn't used to have in single monolithic systems. So this means when we do things
like joining data, there a cost because there's transport
that's happening. And so, while joins
are incredibly useful and they're a key part of
the way anyone analyzes data, we do have to ask ourselves,
is a join really improving my workload, or is this
join just something I've incurred
because I've blindly normalized data according
to rules from the past. So the other thing is
that columnar access has become a big deal for
analytical processing. So back in the '70s
and then onward, we talk about OLTP systems. These are usually row
oriented, and this means that when we
access data, we're accessing in a row at a time. And this is one of
the reasons that Codd had this notion of like,
well, look at the key, nothing but the key, and
facts related to that key. Because when you go
and fetch that data, you're fetching the whole row. Columnar storage, such
as what BigQuery uses, allows us to say, I'm
interested in column one, just go get column one. And we don't pay a cost
for accessing null values, we don't pay a cost for
storing null values, and we don't pay a cost for
accessing the other columns, because we don't have to. We only access
the column itself. And this kind of raises the
question in my mind of, well, we can relax Codd's rules
around keys a little bit, because in a columnar system,
many fields can be key-like. So I think this, for me, sets up
some priorities for data models and modern data warehouses. When I think about systems like
BigQuery and other cloud data warehouses, I think
we need to re-evaluate what our priorities are. And I think one of
the biggest ones is make your queries
easy to write. Because you want more and
more people to analyze data, you want to build
more systems, so make your queries easy to write. All right? If you can denormalize in
a way that helps you get to an easy query, go for it. Make that data easy to join. So don't have things that
require you joining data six ways, maybe just the
most important joins, the most obvious
joins, or things that users can get a
lot of benefit out of. Now, if you can aid
performance by utilizing things like complex types, using
things like nesting, which I think Daniel is
going to talk about a bit, this is great. We want to get performance
out of our queries, but the first order
of business is to make sure people
can use the data. All right, and
then make data easy to update if it's
required, right? So don't go overboard
on the performance or on the normalization
if it means that your updates
process is going to be tremendously complicated. And finally I think parsimony
is a huge piece of this. And when I think about the
spirit of what Codd was writing about in the '70s, I
think that's actually a huge piece of this. The relational model, in some
sense, is all about parsimony. It's about making the data
model something you can look at and say like, yes, that's
like the real world. And we have the opportunity,
with modern systems, to reflect that even
more closely I think. So, Daniel, do you want to
tell us a little bit about what was going on in business
intelligence as it evolved, and what's different now? DANIEL MINTZ: Yeah,
so I'm going to come at this from the
entirely different angle. So Dan started at the bottom
of the stack with Codd, and the databases, the IBM
System/360, and the tapes that drove it. And I'm going to come
from the top of the stack, from people who are
working in business and trying to understand
what the data means. What it means to
their business, how do they make better
decisions with that data. So, when we talk about
business intelligence, this is a thing that's
been around at least since the early '80s. And the systems that people
used for business intelligence back in the early '80s,
were monolithic systems. And they were
monolithic systems where you bought the BI System that
sat on top of the database, or the data warehouse, and
you bought all those things together. You had systems like Cognos,
and Business Objects, MicroStrategy, Informatica. Lots of those are actually
still around today. Lots of Fortune 500 companies,
if you go into them, you're going to see these
systems still in place. And they sat on top of the
very first columnar stores like Teradata and Oracle. And the reason that you had
to buy a monolithic stack was pretty simple. It was because that was
the only thing there was. Systems, because they
were slow and because they were expensive, the only way
that you could query data to get information
about your business, was with these
monolithic stacks. You bought all in one. Why? Well, you had slow
expensive hardware. That was the defining
constraint that decided what you were going
to buy and how it was built. IT, which is a word
that has come to mean I think a different thing among
the newer breed of companies. It's like IT is the people
you go to fix your laptop now, but IT used to be the
department that owned all technology in the company. And when you were talking
about these monolithic stacks, IT was a department
that owned that stack. And they owned that stack, they
probably owned the server farm where the semi backed up
with your new Hyperion data warehouse, and trucked it into
the data warehouse that was actually a physical warehouse. So that was another
big constraint was that IT had to own
it, because they were the only people with the
technological know-how to make it work. And the other
constraint was we didn't have data flowing in from
every device in our pocket, and every server in the
cloud, and just there's so much data now. But that was not true back
then, because data entry was actual data entry. There were data entry clerks
who sat and enter data. If you were managing hundreds
or thousands of stores, there wasn't a
point of sale system that automatically sent data
about what had been bought back to the mothership. Someone sat down every day,
or every week, or every month, and typed that in. So you just didn't
have that much data that you were dealing with. And so these systems grew
up, these first wave systems grew up, and they did
have some advantages. First of all, because
IT owned them, the answers that you got
out of them were reliable. There was a single
department, a single person, in a lot of cases,
who owned the system, and they said, yes, I stamp
this with my stamp of approval and I said this data is right. And so if you were trying
to produce an SEC report, you knew that the
data and it was right because
somebody who really knew the data had blessed it. They also made it really easy
to make pixel perfect stuff because you were still
printing these reports out in a lot of cases. You didn't have touch screens
that you could interact with them on, they had
to be pixel perfect because you were printing them. And for 1980s, 1990s,
they were pretty fast. You could actually get access to
the data in a reasonable amount of time, you know, days,
not weeks or months, and you could get
your reports that you needed to run your business. But there were
some disadvantages, not surprisingly. First of all, they were
terribly inflexible. You'd put in a request for a
new report or a new dashboard, and first of all, you
only had the authority to put on that request if you
were probably a C-level exec, and you only could
do that if you were in a giant company
that could actually afford one of these systems. And so you'd put
in that request, and then you'd wait
two or three months to get your new dashboard. They were locked down. People couldn't access the data,
because, my god, if you access the data, it would overload
the data warehouse, and that would be
crazy, because we don't want to have to spend
another million dollars to buy another appliance
to run more queries. And they were pretty
low resolution. Luckily, there wasn't
that much data, so low data resolution wasn't
that big a deal, but they were. And so as the scales
of data grew over time, people were not
seeing the raw data, but they were seeing
roll ups and summaries. Because the only way
to make these systems continue to work on
larger sets of data, was to reduce the size of
that data by summarizing it, and accessing those
in the dashboards. So not surprisingly,
people started to get fed up with
these first wave tools because they weren't living
up to their evolving needs. And so in the late
'90s, early 2000s, you start to see the
second wave of Vi tools, and these are all
about self-service. You don't have to
wait for IT anymore, you can self-serve to the data. And you get this
proliferation, this explosion of different tools. Visualization tools, and
you have a governance tool, over here, and
your ETL tools over here, and you still get your
data engine up here, and you still have your
Legacy system, which still runs your SEC reports because
those have to be perfect, and when we close quarter,
we use that, right? And so you just have all
of these different tools. And, you know, again, why? Well first of all computers
have gotten a lot faster. You now have PCs
and even laptops that can hold a
reasonable amount of data. Not nearly as much data as
what lives in your warehouse, but you could maybe
carve off a chunk of data that you really cared about
because you were in finance, and you could carve off
some of that finance data, you'd get a hold of it,
you wouldn't give it back, and you'd load it into
a server under your desk or onto your laptop so you
could do some analysis. And your computer
was fast enough to actually slice and dice that
data in some reasonable ways so you could get some
meaning out of it. People went to the second
wave because they were fed up with the IT bottleneck. They were sick of waiting
three months for a change to their dashboard. I want to change the title on
dashboard, great, get in line, we'll be with you
in three months. That didn't fly anymore. And, because the amount of
data that people were trying to work with was
growing, and they were sick of having to do
roll ups, they said well, but I only care about
this little slice of data, and I can fit that
on my computer, so I want high resolution
access to just that data. And so they could do that
with these new tools. So in terms of advantages, this
gave people a lot more agility than they'd ever had before,
and that was a big win. They could actually
self-serve to the data, and that gave them much
faster time to insight. It meant that they weren't
waiting months and months to get any explanation,
any response. And it also gave them that
higher data resolution. But it had some disadvantages. You know, I think the
tools of this era, partially by
necessity, necessitated by the technology
of the day, they made some choices
about things that they would hold onto and
keep, and things that they would throw away. And one of the things
that they threw away was a shared model, a
shared understanding at the business level, about
what the data actually meant. That thing used to live,
in the first wave tools, in the monolithic stack and
in the way that the data was rolled up. You get these second
wave tools, people are slicing off a
little bit of the data, analyzing it in a workbook,
now all of a sudden the meaning of the
data, that shared understanding of what the data
means, that gets left behind. You're also really
dependent on data extracts, and data extracts
are really troubling. Because I extracted
my data last month, and you extracted
your data last week, and now they're not in
sync, and we don't know why. And so rather than
spending the meeting discussing what our strategy
going forward should be, we spend the whole
meaning trying to debug why we're getting
different results out of our data. And then the third
disadvantage was just that you had this tool explosion. You had so many tools
that people just had trouble keeping
track of, how am I supposed to get this
data from here to there? And so that that created
a lot of problems. So that's from the BI
perspective what was happening. At the same time, that Codd
and the data warehouses-- or the databases-- that turned
into columnar data warehouses were growing underneath. So let's jump forward to today. DAN MCCLARY: So we can
actually talk about the fact that maybe things
are different now. Maybe we can do some new things. I think it's interesting
to point out, and worthwhile, that the
consumerization of data, that business intelligence
and reporting enabled, actually really challenges the
fact that we need good models. We need a good data
model, largely relational. But we may have to
think about ways to make it easier
for consumers to get at it in ways that allow them
to have a shared understanding. Real shared understanding
both at the warehousing level, and at the tool level. So it's a talk about
BigQuery, and other things, and so would be-- I would be remiss if I
didn't talk about BigQuery. In part, because we think of
it as a really good example for a modern data warehouse
that's cloud native, that requires us to challenge
our modeling assumptions. Now for those of you who
don't know about BigQuery, you don't know what it is,
BigQuery is Google's enterprise data warehouse. It's the warehouse
we built starting with the Dremel
project which we wrote a paper about 10 years ago. It's central to the way
Google analyzes data. And in BigQuery, we make it
available to GCP customers. And the thing about it
that's really different is, unlike systems
in which maybe I was designing for a fashionable
1970s INS system, or even the warehousing appliances
of the '80s and '90s, Dremel and BigQuery
were really designed to be query engines that
operate at a data center scale. And this means that
we can take advantage of things that are
somewhat unique to GCP and unique to Google. We separate, specifically,
storage and computing. Now in many big data systems,
this would seem like anathema. This would be a real
problem, because I want to keep close to the data. However, because Google's
networks are so fast-- and I encourage everybody
to take a look out on the show floor at the Jupiter
switch that we've brought, because this is really
some of the stuff that makes GCP super powerful. We have a petabit
bisection in our network. Which means all
of a sudden, I can have vast amounts of compute,
that are separate from storage, but access that data as if
it were effectively local. Now this means that our query
engine can do a lot of stuff. At scale, that's really,
really different. It also means I
can start to play some really interesting games
with the way I model my data. Now, the Compute Engine
itself is Dremel. It is our Dremel architecture. And this is
effectively a dynamic serving tree of many, many
shards of compute workers that can apply thousands
of cores to problems. This is a dynamically
organized tree, and we are able to use the
resources per the query. Now the nice thing is that
there's an economic advantage and, so much as you only pay
for the CPU cycles you're using, but at the same
time it also means we can apply a tremendous
amount of compute to to some of these problems. It allows us to have multistage
queries that go really fast. Joining's OK, joining's
safe, but at the same time it's a distributed system. And so we need to
make sure that joins are things that we want to do,
that they really make sense in the context of the problem. And the other thing,
I talked about shuffle and the cost of shuffle. Well, shuffle and any
distribution system costs. However, BigQuery does a
unique thing when it shuffles. It shuffles to remote memory. We wrote at a really interesting
blog about this earlier in the year. And by having a
remote shuffle layer, this allows us to handle
shuffle operations much faster, and at a higher scale
than most systems. So this means that
we can join data even in situations in which the
normalization might help us. But it does mean that because
there is a shuffle stage, things like [INAUDIBLE]
can cause problems. So this is just
an example of how something like a select BI
with a WHERE clause in a group, would actually move in
and out of remote shuffle. And the pipeline that you get
from a remote memory shuffle, is such that again you can
do really interesting joins. But it also means that there
is a tax when you do a join. Now, our managed storage, as I
mentioned earlier, is Columnar. And we take care of things
like durability by default. So when you put
data into BigQuery, not only is it translated
into our proprietary calendar storage format capacitor,
which is another great blog on, we store columns
in our own files, we compress these, and
encrypt them on disks. The data is always encrypted. And then we replicate
it across data centers. Your data is available
in multiple zones and resilient to failure. Now this Columnar
storage allows us to do some really interesting
stuff, because not only is it Columnar, and allows us to
have many key like attributes, it also allows us to take
advantage of the fact that we can filter
on these things when they're still compressed. So the notion is, I don't
have to go and pay the CPU cost to decompress
things, just to figure out how the WHERE clause applies,
I can apply it to the data before I've decompressed it. Now, all of this means,
I think, a few things. Again, we can think about
having many key attributes, we can think about having
complex fields and types-- because in Columnar
storage engine, Columnar storage system,
that supports complex types, allows us to model data
in a different fashion. But it also allows us
to say, you know what, normalization matters to me. I have a relational
model that represents my business, that represents my
world, and I need to run joins. That's OK. We can bring that, and I can
move whole hog into BigQuery, and have exactly the same
kind of querying experience I would expect. So Daniel, what was happening
and what sort of new tools have emerged in this space to
deal with the cloud experience, if you will? DANIEL MINTZ: Yeah, so we talked
about the first two waves. So let's start from the
sort of middle layer. So we've talked about the bottom
layer, the actual hardware that's driving these, the top
layer of the business side, but in the middle, you've got
this way of accessing the data in it's sort of raw form. And so, I'm sure this won't
come to news to most of you but, programming language
development looked a little bit something like this. You start with
machine code, people realized that writing ones
and zeros is kind of tedious, so they're like all
right, let's do assembly. You get Fortran,
and COBOL, basic, then C comes along
in the early '70s, this is a huge revolution. '95, it turns out, was a great
year for programming languages. PHV, JavaScript, Ruby, and
Java all appear in 1995. But you get this wide range. And really, what's
happening here is that the level of abstraction
from the actual processor that is doing the work,
is increasing. Go is this amazing
language that very pithily lets you run incredibly
fast programs on tons of distributed processors. Machine code, you can't
do that because you wouldn't want to write
that in ones and zeros. On the data side, this is
kind of what it looked like. So we start with data
written to files. This is before you have
random access memory at all, you roll your own b-tree. Codd comes along. Then in the late '70s, we
get SQL, which is really an operationalization. Wow, that's hard. Operationalization. I'm curious what the-- yeah, the robot got it right. The robot that is transcribing
got it right, amazingly. So you get SQL shows up
in the late '70s, that's an operationalization of
Codd's rules, Codd's laws. And then you get Oracle
and IBM, which turn them into commercial software. In the late '80s you get the
T-SQL, which is still SQL. And the '90s, you get MySQL and
PostgreSQL, which are still SQL fundamentally. And really, there hasn't
been very much development on the data side. So at the same time that
programming languages have been evolving
enormously, data language has been kind of stuck. And so in terms of
what we actually want, we wanted to define
relationships and definitions once. We don't want to have
to do that repeatedly. Computers are good at
doing tasks repeatedly, humans like to do it
once and be done with it. We want to retain
all of the agility that SQL gives us,
which is amazing. We want an easy way of
translating what we actually mean, the business question,
the question that we care about, into a data query. And sending that off to
the database or the data warehouse so we can get that
data that answers the business question back. We want to state performant,
which is becoming less and less of a problem. But still, when you're querying
petabytes of data still is an issue. And we don't really want
to worry about syntax. Because, you know, the
proliferation of SQLs, means that you
have to keep track of little bits of, oh, this
is how this one handles date reformatting, and
humans shouldn't have to worry about that, right? Computer programmers
figured that out. They said, we don't want
to worry about that stuff. I don't need to worry
about memory management except in 1% of the cases, so
I won't use C most of the time, I'll use something that takes
care of memory management and garbage collection for me. Analysts are still
stuck writing SQL where they're worrying about syntax. They're doing a lot
of these things. And so, lots of analysts I
talked to say, but I love SQL. And I love SQL, too. Dan loves SQL, too. Does anybody else love SQL? Yeah, SQL's amazing. I am getting a lot of like, meh. Yeah, I have a love hate
relationship with SQL, too. That basically describes it. So SQL's amazing. It's proven it's powerful, it's
versatile, it's everywhere. SQL and C fundamentally
run the world, right? But, SQL's really
easy to screw up. In my first job, where
I was writing SQL, in the Orders table there
was the status field. And you had to say Orders. Status equals Completed. And if you didn't, you would
get all the failed credit card transactions returned
as part of your query. And you go, oh my god
we had an amazing day, but you didn't, you just had
a lot of failed credit card transactions. And so keeping track of, oh, I
have to put the group by there, and I have to-- what
dialect am I in, how does it handle date
conversions, that's stuff that programmers
have gotten away from and analysts are
still stuck with. And so what Looker has done-- and I should say I was a
Looker customer for far longer than I have worked for Looker. And so I've stumbled on Looker
early on in Looker's history, and they made this promise,
and I said, oh my god, I want this thing,
but it doesn't-- it kind of exists, maybe it
won't really live up to it-- and it totally has
lived up to it, and that's why I sort of
jumped over the fence. But Looker started with this
kernel of this idea of LookML, which is SQL, evolved. It says, let's stop worrying
about all that stuff. And so it makes this data
language reusable, which is not a thing that SQL is. I know when I go look at
SQL I wrote two weeks ago, I give up immediately. I'm like, oh, I'll write
this from scratch again, because I had no idea
what I was doing. It makes data language
collaborative. Because if I try
reading something I wrote two weeks ago,
something Dan wrote is probably even worse and
harder to understand for me, because he has his own
style of writing it. No offense to the way
that Dan writes SQL. I'm sure it's great. It makes data language flexible. It makes it easy to organize
so you don't have untitled, underscore, 43, dot SQL,
untitled, underscore, 44, dot SQL. These little recipes
on your desktop. I definitely have those. And it gives you
version control, which is another thing that
programmers a long time ago were like, hey,
maybe we should keep track of what we did in the
past, so if we need to go back, we can. Data people, not so much. And the way that
it does this is it says, well, really any query can
be decomposed into four things. One is a set of
relationships between tables. The second is the fields
that you actually want. The third is what filters you
want to apply to the data. And the fourth is how
you want to sort it. And if you have
those four things, you can really
compose any query. A computer could compile those
four things down into a SQL and send it off to a database. And so that's what
LookML is, and that's what allows you to build a
sort of third wave of BI. Which is a data
platform that says, you know what,
we'll have the data platform take care of
those annoying things that we've always worried about,
and allow people, whether they speak SQL or not, to
come in and write queries without looking at the
SQL, without needing to look at the SQL, write
queries against the data. And that allows you to
access the data directly. And it's only possible because
of this data infrastructure revolution that's
happened underneath. I think big data
is a lot of hype. The big data revolution has
been a lot of hype, and not that much delivery quite yet. But one thing it
absolutely did deliver on, is enormously fast databases. You go back even
five or 10 years, when Hadoop was going to be
the answer, how far we've come from then of
like, well, maybe we should shard MySQL a lot. Ooh, that's really ugly. Let's use Hadoop. Oh, boy-- you know? Hadoop was kind of going
back to the old days where you'd like type in the
program at night at 11:00 PM and then you come
back the next morning, and aw, man I had a typo. And you'd have to rerun
it the next night. It was very much
going backwards. So these incredibly
fast databases, that mostly live
in the cloud, have made this revolution possible. Because all of a
sudden, you didn't have to do all the preparation. You could actually
set a data platform on top of these databases
and just access the data, leveraging all of their power. And so, in this third
wave, well, first of all. It was only possible because
the databases were that fast. Without these databases,
it doesn't work. One of the other reasons
that people wanted this was because they were just
sick of having a million tools. If you could centralize
all this work in one tool, that would make
life a lot easier. And the third, as
Dan talked about, I don't know if you guys noticed,
but the x-axes of those two price drop graphs, one started
in 1980, the other only started in 2000. And that's because this
idea of shuffling data across a network, simply
didn't exist in the old days. But now that it does,
you can access the data in these incredibly
fast databases which live in the cloud, and then
bring it to your machine. And so in terms of
advantages, well the third wave means you're
getting reliable answers. And not just reliable
answers in the sense that the data is consistent, but
reliable answers in the sense that someone who knows what
the data is supposed to mean, went in, and said this is
the definition of our KPI. This is the definition of
average contract value, and it's a plus b, divided by c. And so when you want
to access that, you don't have to remember oh,
wait, was it a plus b, minus c, or over-- that's in the system. The system worries
about that for you. It gives you agility
because you're not constrained by the data
that lives on your machine, you're accessing the data
right where it lives. And so you can
access all the data. It lets you pick the
tools that are right. So you choose what is the right
amazing cloud database for you? What is the right
ETL tool for you? What is the right
visualization tool for you? You're not locked into
these monolithic systems. And it gives you
full resolution. You're not rolling up the
data and summarizing it before you access
it, you can always drill down right
to the lowest row, because you send a billion
row or 100 billion row query to BigQuery
and BigQuery says, all right, I'll
be right with you. There are some disadvantages. First, is that it's a
major shift in thinking, especially for the
people who worked in IT when it was called IT. It's like a big deal. They have to let go. And that's hard for
a lot of businesses, because these business
processes are deeply embedded in the organizations. And so for
organizations to think of how they can use
data in these new ways is actually pretty hard. You do need a powerful
data warehouse. If you've got an IBM
System/360, the third wave of BI is not for you. Don't do that . And this is something that
we've only begun to see, but all of a sudden you have
insight coming from everywhere. Because the people
who are closest to what the data actually
means, the business people who understand what it means,
are able to explore freely, and so they're finding stuff
that they couldn't find before. And so going from information
scarcity to insight abundance, is actually hard in some cases. It's the same thing of going
from a bunch of word docs on your machine, now you're
collaborating on stuff with a bunch of people,
you need Google Drive because you need a way to
keep all that stuff organized. Otherwise you're stuck. Oh, no, you're working
on that version. I already made changes
to that version, I emailed-- did you
not see the email? You don't want to
go down that road. So that is a new way of
thinking, but this is possible. So now that we've bored you
all with the very high level abstract, you've lived through
the data model section, now we're going
to get practical. And talk about what it means to
model your data in this brave new world. DAN MCCLARY: Wait, wait, I
had one more academic point to make. DANIEL MINTZ: Sorry, I lied. DAN MCCLARY: I know, I know,
I just can't give it up. Which is though, when we think
about the spirit of Codd's original work, and the notion
that a relational algebra is a powerful tool to apply to
modeling real world situations, the business user needs it,
too, but for the business user, semantics and context
matter as well. It's not it's not
simply the set theory or the set theoretic
operations I want to perform, it's that these have
specific in-context meaning. And so I think a
lot of what we see and the interplay between
classic data modeling and data modeling as it evolves
for the BI user, is that context is brought in
line with the relational model. DANIEL MINTZ: And doing that I
think is enormously valuable, and people sometimes miss
that, because the people who actually in the store,
running the POS system, they're the ones who could
look at the data and go, something's wrong here. A data analyst
back at HQ is going to miss that, because they don't
have that tactile feel for what the data should be, what
it should look like. They're going to miss the
insights if the data has to be shipped off
to HQ, and then an analyst has to groom it
and put it in the system, if they can actually
access the data themselves, they're the ones that are going
to come up with those insights. DAN MCCLARY: Outliers look
different the further away you are. DANIEL MINTZ: That's
exactly right, yeah. DAN MCCLARY: OK,
so now we can be now we can be a little
bit more practical. And unfortunately, I think
the one thing to sort of tell everybody, is that there's no
one hard and fast rule for, thou shalt model
your data this way. All we can do is talk
through some of the things that we see, some of the
questions that Daniel I have been asked, and what we think
are reasonable rules of thumb to apply. All right, so one
of the things I get asked a lot is, I'm really
worried about denormalizing my data. I've heard systems like BigQuery
really like denormalization. I don't know, I
have this dimension, should I put it in the
multiple fact tables? The answer is, of course, maybe. Everybody's least
favorite answer, maybe. Now the questions you
have to ask yourself are actually about
the dimension. Is this a static dimension, is
it way smaller than the fact? Is the dimension just a
list of the 50 US states? You could probably denormalize
that pretty safely. Now, if the dimension
is something that changes, or is used
independent of the facts, if I have a dimension table
it's quite large and used by many, many different
fact tables, the challenge of denormalizing it, whether
as a nested structure, or into every table is
a flat sort of columns, is quite large. So it might be better
left normalized, because that actually
models the world better. It models the world. And large scalable joins are
entirely possible with systems like BigQuery. So again, I think this notion
that I have a dimension table and I really, really want
to know whether or not I should put up
with the fact table, should I denormalize
it, well how is it used? How is it updated? How big is it? These are the sort of
fundamental questions we have to ask. So another one I
get asked a lot, and I think comes up a lot
when people come to BigQuery, they look at our type system
and they say, OK, wait a second, you've got a arrays
and structures, you have these nested
and repeated fields. What do I do with those again? Is everything now
an array of struts? What am I supposed to do there? Now, the reality
of it is, I think Daniel, you've got
some thoughts on this that you want to
share in a bit, is that nested and
repeated fields are hugely powerful
ways of preserving the logic, or the
logical relational view, while getting the physical
benefits of dennormalization. So if I have a
dimension table that represents a one to
many relationship, I can take that dimension
table that is the many, and embed it, as nested
and repeated fields, into that table and maintain
the logical view of, oh, yes, this order has a
list of items in it. OK, that's reasonable. But there is a question. So that repeated data needs
to be updated frequently, that could be a real problem. So to say the shopping
basket had these items in it, that's fine. Because that happened once. To say that I'm going
to have a nesting of all of the transactions
going on forever, becomes a real challenge. Because now I'm saying,
oh, I have an array that I'm going to modify
on every row, every day, with all of the new data. That's probably not the
right thing to go and nest. DANIEL MINTZ: I mean or
even to go even simpler than that, items are something
that belong to the order, but orders aren't
really something that belonged to the user. The user can exist
separate from that. And so the idea that, oh,
I need to go into the users table, which then
contains the orders, which contains the items, every
time that this user makes a new order-- yeah. DAN MCCLARY: Well it creates
it creates a real problem. Then also in terms of
trying to reflect the world. If we want the people who were
writing reports and people were analyzing data to be
able to write queries or build reports that model
the world, we have to make sure that we're not
sort of overly denormalizing for the sake of
performance, and as such, breaking our view of reality. DANIEL MINTZ: Yeah. DAN MCCLARY: And
also if you have to, as an analyst write SQL,
you would like it to not be terribly complicated. Why do we have to do an unnest
on the orders in the user to get to the thing? That's not what we want. So when do we think
about nesting? Daniel, you've got more
thoughts you're going to share, as I said, but
again, sometimes it makes sense if we're
trying to preserve the logical relationship,
because it's important relationship. But we shouldn't just
do it because we're blindly seeking performance or
because I read the BigQuery doc and it's like, oh,
well they have Nested, I better figure out
how to use Nested. So one of the
things I see a lot, particularly with enterprises
who are coming to big worry from older systems,
is that you may have a query that
has many, many stages or you're used to running
it on your teradata machine, and it runs for two days,
and then it shows up. I feel like this is
one of those situations in which you really do
have to kind of challenge your assumption, but not
necessarily your assumptions about the data model. The model may be fine. There may be tweaks
you can make, but you have to think of
the workflow a little bit. So sometimes when we run
these queries where we say, I'm going to submit it,
I'll come back on Wednesday and we'll see what happened. Some of that's because
you're actually operating in a resource
constrained environment. You may have gotten
used to it, but it is kind of resource constrained. I'm only getting a
little bit of batch work, and I can submit it
in these time periods. And I think the challenge that I ask people to push on
that workload is, well, what if you broke it down. Are there materializations
that can preserve work? Remember, at the top of the
talk, we talked about the fact that the cost of
storage has plummeted, such that duplicate data
doesn't really cost that much, particularly if it doesn't
live for very long. So if you intermediate
materializations, you can not only sort
of preserve your work, and maybe speed up the way
you compute things over time, but you could also
think, hey, maybe there's a broad intermediate form
of either denormalized one, or a set of normalized
relationships, that are actually really useful
for a broad set of people. If I have a number
of analysts who are looking at the
same kind of things, maybe I can break my
work down into sort of a large materialization
that many analysts can then go to their leading
edge computations on. And so, one of the
things I also see is this notion of
like, oh, well I'm coming from the big data space. I suffered through MapReduce,
we got to Hive, well it's OK. It's OK. But one of the things that
really worked for me in Hive was how we partitioned our data. We had multiple levels of
list partitioning, it was OK. We were able to go
a little bit faster. But one of the things that I
think is worth pointing out is, not only does
BigQuery offer the ability to date partition tables,
which is pretty common in data warehouses, but we have
this sort of table sharding which allows you to
effectively have prefixes that are common amongst
tables of common schema, and then wildcard
seachable suffixes. So you can use that to emulate
something like partitioning, while providing yourself
a lot of flexibility from a sort of partition
management standpoint. So I think that's the thing
that people can kind of look at if they're used to
coming from something where list partitioning was really
key to getting a little bit performance out. Or just even organizing
large amounts of data. A good example of
this, actually, is a lot of the things we've
done with the ground station observation data and the
weather public data sets. This is a great example
of like, oh, this is we got a ton of data
from 1929 to today, gosh, we really ought to
figure out a way to deal with that in a
somewhat partitioned fashion. So, Dan, do you want to talk
about the Nesting stuff. DANIEL MINTZ: Yeah. DAN MCCLARY:
Because I think it's interesting to watch
someone from the outside think about our
nested structures. DANIEL MINTZ: Totally. And I'll be honest,
this challenges me. I'm a deeply relational thinker. I want things to be up and
down, side to side, and just-- DAN MCCLARY: Tables are cool. DANIEL MINTZ: Yeah,
tables are good. Tables are good. Tables are good. So let's use a really
simple example. I can print it on the
screen simple example. I've got sessions, or
orders, or some data that is conceptually nested. In the real world, there is a
nesting pattern going on here. And so normally, no
pun intended, normally, the question that you
would ask yourself is well should I normalize this data? Which will be more efficient
on the space front, and the answer, as that
graph from before showed us, is don't bother. That's not a good reason. There may be good
reasons to do it, but space efficiency
is not a reason to do it when you have an
infinitely scalable cloud. OK so should I
denormalize it so I don't have to worry about joins. OK, normally the answer
would be, well, maybe. That might be more performant
for certain types of queries, but maybe not for others. And, in BigQuery at least,
there is another option. There's a third option,
which we never had before. So here is my really
simple example. I have three orders which
happened in January. This was very hard to
come up with because I had to figure out fruits and
vegetables that started with e. But I did it. And so we've got three orders. They each contain
some number of items. Each of those items
as a unit price. Super, super simple. So one thing that we
could do with this data is we could turn it
into a snowflake. We could have our orders
table, our orders items table, order items table,
and our items table. Great. Normalized. Yay! I feel so at peace,
because it's like-- DAN MCCLARY: I'm pretty sure
did that in my first database class. I still hearken
back to those days. DANIEL MINTZ: I realized
that I organized my CDs. Some people may not
know what CDs are. They're these physical
disks, that we-- never mind. But I liked to do it in
order when I was a kid. Anyway, so we could do this,
and this would be totally fine. Another thing that we could
do is we could denormalize. We could stick it
all in one table. We've got some
repeated data here. We now know the unit
price of the banana twice. Unnecessary, space inefficient,
but, hey, no joins. Great. In BigQuery there
is a third option, which looks like this, which
no one scream from horror. This is really weird looking and
not OK for us who like tables, but we can just leave the
table nested inside a column. DAN MCCLARY: Daniel, we
heard you liked tables, so we put tables
inside your tables. DANIEL MINTZ: Is it turtles
all the way-- never mind. I'll give everybody a
second to compose themselves after looking at this. All right, so what
we have here is a table in BigQuery, which
still has an order ID, still has all the top level
information about an order. Order ID, order
date, order total. And then inside the order items
column, which is a column, we have, fundamentally,
another table in each row. They all have the same schema. And I'm not going to go into
the struts of arrays of struts of arrays of struts, but-- DAN MCCLARY: That's a thing
we could talk about in Q&A. DANIEL MINTZ: But this
is totally kosher, legit, in BigQuery. You can do this. Now, why would you
want to do this? Other than to drive your
Databases 101 teacher crazy. Well, let me go back. The reason is
actually because if I want to query something
just about orders, remember that
BigQuery is columnar. So that crazy
thing on the right, don't need to worry about it. Stays in memory,
I never touch it. Right I can get a count
of orders really simply. Select count of all from orders. It's going to give
me the answer. That's great. Now, when I do want
some information about the items in the orders,
I can unnest that structure, go in and get the information,
and then, and only then, unpack it and look at it. In the other examples, I have
to choose one or the other up front. I have to either say I'm going
to normalize because that makes it really easy to query
stuff about orders, but kind of a pain
to query stuff about the things in the orders
because now I have to do two joins. Remember again, this is an
incredibly simple dataset, so if you're dealing
with real data, this would actually
be a pain, not just kind of a fun exercise. In the denormalized one,
sure, I can get stuff about the items and the
orders really easily, but just to find out
how many orders I had, I have to do select count
of distinct order ID. And so now and scanning
a bunch of stuff repeatedly that I don't need. In this, I put off
the choice about how I want to structure the
data, until query time. Which is great, because I
can choose the structure that makes sense for the type of
queries that I want to run, This is a big deal
when you're dealing with billions of rows of data. So the one problem,
remember, we said, well, in our list of things we want
out of our data language, Is that writing
the SQL for this is going to look a little funny. We're going to use this idea
of left join unnest, which is weird, and a
little weird, right? So, but with a data platform,
with a third wave data tool, we're not writing the SQL. So I grabbed some look
LookML, which is our markup language, that deals with this. It says, yeah, there's this
table at the top called orders, and it has this join
called order items, and here are the dimensions
inside the views, inside orders and order items. I can compose things
like item total by multiplying quantity
and item price. I can specify
multiple time frames. And this probably
looks a little bit weird to people who
haven't seen LookML before, but the fun part is I can
learn this real quickly because I already speak SQL. And this is
fundamentally just SQL. And then I can expose
this to my business users. And so all of a
sudden, they can then explore this data set freely
and leave it to the tool, to the data platform, to
write the appropriate SQL. They don't worry about
how orders and order items are related. Whether I did the normalization,
the denormalization, the nested structure,
tool worries about that. So that stuff only gets written
once, and now all of a sudden, they can say, well, I want
to know orders by dates, I want to know how many bananas
were bought on the 26th. They can just freely move
around in this environment, and ask questions. We send out those queries
that the platform writes off to a BigQuery, and assuming that
there are more than seven rows of data, BigQuery, then,
can very powerfully churn through that data
and return an answer. DAN MCCLARY: Well you
preserve, importantly, you preserve the relational
model of the world throughout, DANIEL MINTZ: That's right. And so the relational
model now spans all the way from the bottom, from the
way that the data is actually structured in memory, all
the way to the top to the way that business people
are accessing the data. And so we have this
sort of unified model that is both how the
information is organized, at the most basic level, and
how people are accessing it. DAN MCCLARY: So at
least logically, we're actually accomplishing a
lot of what Codd set out to do. DANIEL MINTZ: Codd,
would be so happy! DAN MCCLARY: He'd
probably be freaked out. I think cell phones
would freak him out. DANIEL MINTZ: So I'm going to
hand it over to Dan to wrap up. If we're downstairs,
Looker's downstairs at E14, so as you're drinking
a beer, if you want to put our
engineers to the test and make them actually
explain LookML to you, on bigger data sets and
this, come on down to E14 and you can make them do that. And I'll enjoy watching
them have to do it, so-- DAN MCCLARY: Okay so I'm going
to try and wrap up really quickly so we maybe
have even one or two minutes for questions,
because I know there's also a happy hour that's
happening and I'm sure people are quite thirsty. I think for me though, the
core of this is really-- the relational model matters,
normalization to some extent matters, it's at the foundation
of systems like BigQuery you don't get relational
query processors, SQL driven processors, without that model. And the logical
continuity of it, I think, is really important,
all the way from how we built our
distributed processing engine, to the way
you can model schema. Using not only the tools
you have learned, but also things like nested
and repeated fields. And it passes through tools like
Looker into the reporting space as well. But, I think, again,
it's important as we think about
trying tools BigQuery, trying tools like
Looker, we can't blindly accept the rules of the '70s. We have to know that
they're important, and know that they
have given birth to the point we're at now, but
anytime we're modeling data, we should just continue to ask
ourselves did this actually increase independence, data
independence, independence from our users, and their
ability to analyze data? Is it helping users
fundamentally do more, or find insights? Is it reducing the
inconsistencies? Is it making it easier for
me to actually figure out what happened? Is it making those outliers
come closer to the fore so I actually know
what's going on? And then finally, if we're
making a modeling choice, because we believe
performance is the reason, is it really meaningfully
enabling performance? Or is it really
kind of premature? Well you know it goes
a half a second faster, and that's really important. Is it really important, or is
it better to help the users? So again, all of the
things we learned from our friend the fish and
from Mr. Codd, are important. We just need to
make sure that we're aware that the systems
we work with now allow us to potentially
do quite a lot more. And with that we've
got only a couple of minutes left, but on
behalf of myself and Daniel, I'd really like to thank
you guys for sticking around even as happy hour started. We really appreciate it. Who at data modeling
could be so fun and fishy. [APPLAUSE] [MUSIC PLAYING]