[MUSIC PLAYING] CHAD JENNINGS: Hey. Good afternoon, everyone. It's 4 o'clock on the dot. And we have practiced this talk
with military grade precision. So I thought I would get started
right away as folks can come in and fill in the extra seats. My name is Chad Jennings. And I'm one of the
managers for BigQuery. And thanks for spending
this hour with us. Obviously, we're going to be
talking about Data Migrations: Lessons Learned. And we put that in
there on purpose. I'm going to spend a little
bit of time talking about some of Google's tools for
folks who aren't maybe as familiar with GCP as
some of the other vets here. And then our customers,
we've got an excellent field of customers who are
going to come up and talk about their sessions
and actually give you their war stories. I couldn't put War Stories up
here, or like, whipping post, or like anything negative. So I had to put lessons
learned on the slide. I guess that's captured
in the video now. Anyway, they're going to come
up and give you the real skinny. Before we get into
that, though, can we do a little bit of
audience participation? How many folks
here are relatively new to GCP and BigQuery? OK. A fair few. Do you guys come from
the G Suite side? And how many have like gotten
really deep, dark, and dirty with BigQuery and
know it pretty well? OK. A fair smattering. And how many folks
are in this room who are thinking about migrating
to BigQuery as your enterprise data warehouse? OK. Great. I think we've got a range
of materials for you all. And I think that was about the
distribution I was hoping for. So we have three customers,
kind of in data volume migration sizes, you know, small,
medium, and large, and representative across a
wide set of industries and such. So Nikhil Mishra
from Yahoo is going to come up and take
the stage and tell you about their effort
migrating to BigQuery. Then Nick Ursa will
come up and do the same. And then getting
into the agenda, starting down at the green
dot, why migrate at all? What we're going to do with the
last 20 minutes of the session is I'll be joined by the
gentleman up here on stage, and we'll all just have a bit
of a panel discussion about why go ahead and do it? If you've got a system that's
working well enough for you, the thought of moving
encompasses a lot of risk, maybe a fair bit of fear,
maybe a career is on the line. And so there has to
be a real big benefit. So we'll talk about that. And that will also
gives us an opportunity to have an extended
question and answer period. So if you're thinking
about moving, this is a time to ask people
who have actually done it what their experiences were
and what roadblocks or hiccups they had to overcome. OK. I'll take the first two sets. So I'll set the
stage a little bit. And then I'll talk about
Google and partner tools to help get this
kind of work done. Okeydoke. So there are a number of
reasons why you might find yourself compelled to move. Maybe it's time to renew your
license on your on-prem system. Maybe you're frustrated
with the inability to scale storage and compute. Maybe you're just frustrated
with the inability to scale them at all. Maybe your system
works well enough, but it's not going to keep up
with your forecasted growth if you're lucky enough
to be in that problem. Maybe your current system is
just too hard to maintain, or takes too much work to keep
it fed and watered and humming along. Maybe it's just
too dang expensive, and you're looking
for an option. So now you're faced
with a choice, right? You can either move
now or move later. At some point you're going to
be forced to make some change, whether it's buying more
of your legacy system or thinking about
something else. And so that's what we're
here to talk about. So for those of you
new to BigQuery, we'll talk about this
for a little bit. If you've known BigQuery
for a long, long time, I'll call you back in, you
can just check your email for a couple of minutes here. So BigQuery is Google Cloud's
enterprise data warehouse. It's an analytics tool. So it's not an OLTP store. Like Bigtable, we have
other products for that. This one is focused
on analytics. BigQuery goes really
big and really fast. BigQuery, you interact
with BigQuery in SQL. It's the Lingua Franca
of data analytics. And so standard SQL is the
coding language that you use. And because BigQuery
is basically the public face or
the productization of a number of the elements
of Google's infrastructure, our storage is on
the same system as storage of products
like AdWords, YouTube. Let's see, another small
one-- oh yeah, Gmail. What's that other
one that Google does? Search. So we get a lot of
benefit at BigQuery by being built on the
exact same infrastructure that these other, you know,
billion plus user products utilize. Now, something that is unique
to BigQuery and not something that we get from
Google's infrastructure is that we are fully
managed and serverless. So what that means
is you do not have to worry about spinning up
reservations, or provisioning, or figuring out when you're
going to have peak loads. You just load your data,
write your science, and then press the red button,
and you execute your queries. Google takes care of managing
all the computation resources that you need. And we can scale up
storage almost infinitely. Our largest customers store
well over 100 petabytes. OK. I'm not going to go through
this slide in detail. But I do think it's a really
good slide for context. So Google has an awful
lot of storage options. And if you're in this
migration, like should I migrate or should I not? It's important to know what
these other things do, or might do for you, or might not. So BigQuery is all the way
over there on the left in case you couldn't find it amidst
all of the other blue hexagons here. And that is the data warehouse. So it's relational in nature,
but it's not an OLTP system. It's not a database. It's our data warehouse. Okeydoke. So let's talk about
the migration path. And so here I have the legacy
enterprise data warehouse on the left. BigQuery there kind
of in the middle. And then I put this little
chart, this last icon in to really drive
in the fact that when you're migrating from
one place to another, getting the data
landed in BigQuery and in the right shape, and
getting the queries to run appropriately quickly, that's
actually not the end line. The end line is the
understanding and the insight that you're able to implant
in people's brains, right? It's not even the
charts, or whether it's a bar chart, or a line chart,
or how pretty the colors are. It's the understanding
that people take away. And so I wanted to make that
point explicitly here so basically that we make sure we
cross all the way to the finish line. Now, to get data from
a legacy on-prem system to understanding in
your users' minds, there are a couple of
steps along the way. Google Cloud Storage is
one such stop for BigQuery. We're trying to make
this process smoother and be able to load things
directly into BigQuery. But at the moment, if
you want to migrate even little amounts of data,
or significant amounts of data, then doing it through GCS is
the way you're going to do it. However, we do have
some people, probably in the room, who make this
much easier by abstracting that layer out of the way. And that's our partners. We'll go through some of
these in detail a little bit later on. But our partner tools will
abstract this step from you. And so they'll help you
with schema migration, they'll help [AUDIO OUT]. Are we good? They'll help you
with migrations, data cleanliness,
deduplications, et cetera, and make it a lot smoother to
get your data into BigQuery. If, however, you want to
control that process in detail, you can do it all on your own. Maybe it's more cost
effective for you to do some
transformations or shift your schema into the right
format before you move it out. That's fine. If you're moving data from
the legacy into cloud storage, you can either use online
transfer, which most people do. If you don't have really
good internet connections, we even offer offline transfer
where you can actually pull the hard disks out of
your system, put them in a box, and send them to us, and
we'll load them up for you. Once you get the
data there, BQ Load is the command that takes
things from cloud storage and drops it into BigQuery. Now, you have to be
careful here, right? When you do that you have
to specify the schema. And doing it this way
assumes that you've got the schema
appropriately partitioned or appropriately designed
inside of cloud storage. If that's not the case, there
are a number of Google tools that you can use to
make that happen. And then second to last
but not least, and Nick may talk about this,
doing ETL inside of SQL is a way to generate a
much more transparent ETL pipeline for your users. And last but not least,
you have to make sure that you have an efficient
pipeline between BigQuery and whatever vis tool
you're going to use. Okeydoke. We'll spend just
a moment or two, because I'm running at
the edge of my time. We'll spend just a moment or
two talking about the tools that exist here for you. So everyone in this room
is on this chart somewhere. The x-axis is the distance
to Google Cloud Platform. And it's virtual
distance, right? You're very, very close
if your data is already in GCS on the left
side of the axis. And you're very far away if
you're on-prem with no network whatsoever. And then the y-axis here
is the amount of data. And so we're going to go
through some strategies depending on where you line up. So Google has a
transfer service. So if you're in GCS
or in another cloud, you can use our transfer
service to migrate data into Google Cloud Storage. If you're a little further away
in this virtual distance scale, if you're in a
colo, or if you're on-prem with good
connectivity, you can use GS Util,
or these partners to move your data over. If you're a little
bit further away and you've got less
than 100 terabytes, then Iron Mountain and Prime
Focus are good bets for you. And then if you're far
away and a lot of data, like way off in
right field there, than Zadara is the place to go. I thought I had moved
that slide out of there. I'll spend just a moment here
talking a little bit more about one of our new products
in the Google ecosystem. And then we'll go
through partner tools, and then I'll turn the mic over. So Google Cloud data prep
was announced earlier today at the keynote. And so this is a tool. It doesn't do migrations. However, when you bring
data in it can automatically detect schema. It runs analysis on the
columns to draw histograms, as shown here along the
bottom of the screen, to identify what
type of data it is. And then it can automatically
do things like deduplication, or finding nulls,
and things like that. So this is a new tool. We just launched it today. And it can help
clean up your data before you start giving
it to your analysts. And the partners, I won't go
through this one in detail. There's too much. There are too many
different axes on this four legged octopus. What I will say is that
depending on the source, we have partners to help you
ingest data from that source. Since BigQuery is
a data warehouse, I'll just spend
a moment on that. We've had a lot of success
using partner tools and even folks managing it themselves
from data sources such as, you know, Teradata, Vertica,
Natezza, AWS, Hadoop, and Hive. OK. So I will now turn it
over Nikhil Mishra, who will talk about Yahoo's
experience migrating an awful lot of
data, to BigQuery. [APPLAUSE] As you like. NIKHIL MISHRA:
I'm Nikhil Mishra. I'll just walk through
some of the experiences that we had with BigQuery. And just probably share
some of the learnings that we have had as
part of this migration, and share as much detail
as I could in this talk. So I wanted to give
you some context on how the Yahoo data pipelines. And this is a
general model in most of the companies where
you are using primarily Hadoop for your data processing,
and ETL processing in Hadoop. So a lot of data gets generated
through our user activity over the web, on native apps, on
ad servers, and a bunch of data centers collecting those
data, and moving it out into an HDFS storage system. ETL processes run on Hadoop. They use MapReduce, Pig,
Hive, whatever you're comfortable with and whatever
your data transformation needs are. And then you put this
data on Hive, Hive on Tez, which is the
fastest possible querying interface you have there. And it goes in the hands
of users through a BI tool. And that's typically the
pattern that we follow. What we are trying to do is
not to replace this data flow with BigQuery or
something on GCP. What we are trying to do is
add BigQuery into the mix. And I'll talk to you the reasons
for going through this modified architecture. And what the pain points were
with our existing systems, and what we're
trying to solve here. The rest of our system
works pretty well. And you know, the
goods, as I said, the data size is humongous. Now, I've been advised not
to use the exact numbers that we process. But I can always say
humongous, freakingly large, multi-something. So I'm going to use
that lingo here. So the data size that
we process is humongous. It's big. We do it throughout
the day, every hour. The batch system is
extremely stable. The pipeline's strong. There are massive joins
happening with multiple data sets. And we've never had any
major issues in our system. So this part is
working pretty well. Now, the pain point,
and this is primarily from the customer standpoint. So once we deliver
the data to our users, you massage this typically
data transformation. We cleaned up the data. We standardized it. We enriched the data with a
lot of additional annotations. We added a lot of
metrics in there and made it available
for our users. And they would just go and
query through a BI tool, and that's pretty much it. That's what you need to
make your data accessible. But what is missing
for us was the speed. And the speed was
needed at Yahoo scale. And there are not many
options available. We tried a lot of things. And we keep trying new stuff. So every once in a while
we'll just see, you know, what eventually turns out
to be the best architecture. So the scale at Yahoo, as
I said, is freaking big. It's multi-petabyte
data warehouse. The incremental size, as I said,
our processes run every hour. And there is a
reason behind that. And I'll little bit about that. And every hour when
we process the data, again, it is
multi-terabyte data. The concurrence is high. And this is the result of
a cultural shift happening in the company, or has been
going on in the company, where we are moving
away from the model where they've got a set
of analysts, set of data engineers, who would
prepare data for you, massage it for you, build the
dashboards, run your queries, and hand it over
to you in a platter and just explain to
you what exactly it is. And you make your decisions
on what you wanted to do. What we are doing is
instead we're just handing all the data to
all of our engineers, all our product advisors. No matter who you
are in the company, you should have
access to the data. And you should be
running your own query. Now, when you do that you
need to understand that those guys have a daytime job. They cannot be wrangling
with the data the whole day. And one experience
for me was, you know, one of our colleagues, a
pretty smart product manager was talking to him, and
he was using my system, and he just ran a query. And he said, you know, I'll
go for a smoking break, and probably by then
it could be done. And if you're managing
an analytic system, that's as humiliating
as it can get. If your queries start
taking that much time, because that's eating
up a lot of productivity from your company. So key consideration for
us was at this scale that are very few options as it is. So for us, being
hosted, being managed was good, because we didn't
have to worry about setting up the system, because that
just adds up to the lead time before we start testing it out. So that's pretty good,
pretty convenient for us. SQL, again, it's the
language of choice for our data
engineers, analysts, and pretty much everywhere. And with SQL comes BI
integration, pretty good SQL integration. And one really
important thing for us was independent scaling
of compute and storing. That was really
important for us. Because even though we
have multiple petabytes of warehouse-- and that's
true for everyone's case-- even if you have this
multi-petabytes warehouse, users are not going
to come, and each user is not going to come and query
all of the data all the time. He's going to query a subset
of data, what he needs, and probably work with it. So we wanted to be able to
adjust our capacity for storage and compute as we see fit. So we set out to test BigQuery. It fit in most of the
[INAUDIBLE] we laid out. And another point we added was
we defined what fast means. When we said that we
wanted our queries to be interactive and
fast, what fast means here? So we said that, you
know, when you're running your queries, for
typical day-to-day queries that you are running,
daily usage queries, you know that you're going
to run it every single day. We don't want you to
be wasting your time running those queries. We want you to do
your daytime job. So we said, you know, 30 seconds
seems to be a good threshold. If a user is typically
going through a seven step, eight step now process to
get answers to his query, he should be back
probably in a few minutes, or in a half an hour. And we said if you are going
to do your monthly reporting, your weekly reporting,
which is typically the case with most of
our business users, we said less than
a minute should be good enough, because that's
something you do once a month, or a few times in a month. So with that we started
our process of migration. And the very first step
that we do at Yahoo when we are doing some
of those things, and there's something new-- as we were moving our
data for the first time, or at least what I know
of as the first time, outside of our data centers. So it ran through a
comprehensive legacy, you know, legal review. The data that we are
moving should not be violating any
policies, any agreements we have with our users. And that's an important
consideration. Another important consideration
was network bandwidth, because we were going to move
many, many petabytes of data across the wire. So we had to be very
clear how we phase it out, and if we have enough
bandwidth to support that. Another interesting
consideration, mostly for engineers
on the team, was that as we move the data
between the two systems-- because we already
had our data prepared. All we were doing is moving
it from one database, one system to another. And the data-- we wanted
that the schema also moved along with the data. And partitioning was important
for a couple of reasons. We wanted to manage
the retention, because as we load the
data into the warehouse, over a period of time the
warehouse size increases. You don't need all the
history in some cases. So you should be
able to manage it. And you should also be able to
optimize your queries leverage and partition pruning. One interesting thing we
learned very early in our cycle was that even though
BigQuery stores data in an uncompressed
format, there is no need to move your data uncompressed. You should be able
to move your data, and you should be moving
it, in a compressed format. And we realized that
BigQuery supports Avro. GCP supports Avro natively. And what we wanted to do is
we wanted to leverage that. And we tested it out. And it turns out that using
the Avro with [INAUDIBLE], we ended up reducing
our size by 6x. And that was a pretty good deal. So we said, you know, that's
what we're going to use. And the schema
migration, which was an interesting criteria for
us, and a needed criteria, came for free. So with Avro, with every single
row, you get your schema. And we don't have to worry
about as the schema changes. It gets propagated in both the
systems as the ETL progresses. Now, we moved the data. We moved it in Avro
format, loaded in BigQuery. The loads were efficient. We monitored the load
processes, and we benchmarked some of our queries. We benchmarked
some of our loads. And we kept it within the SLA
when the data was available. And one interesting
thing we learned was that in order to utilize
the capacity efficiently and not to have this blip
every day, where you're using the entire
capacity in a daily batch by moving your
hundreds of terabytes, or multi-terabytes
of data in one day. If you split it out
throughout the day every hour, you would be not only speeding
up all the load process, you'd be also
needing less capacity throughout the day instead
of massive capacity once every day. And you know, that helps a lot. That keeps your pipeline stable. And that helps you leverage your
capacity much more efficiently. Query benchmarking,
what we did, we set out 30 second as
our daily, 30 seconds in one minute intervals. And we were
pleasantly surprised. The query response time
was pretty amazing. It was way, way below than
what we set out to achieve for. So it was a big win. Interesting thing we learned
was there is something called Coalesce that BigQuery does. We don't know much about it. We typically work
with the Google team to make sure that all of
our tables are coalesced. And there is a huge difference
between the query performance when the tables are coalesced
versus they are not. You don't have any
visibility into it. Typically you work with
your Google partners to make sure that [INAUDIBLE]
that is a query problems, just talk to them and
make sure that, you know, your tables are being coalesced. Another thing we
learned was purging was a new feature
when we started off. And it was an important feature. And it's a pseudo column
called partitioned time. And you pick your date
to be partitioned, which is your data
date typically. And what you need to make
sure that your users end up accessing-- if you
can obscure it, if you can have your BI
tool translate partition time into data and give
that to your users to query, you would be able to easily
partition from your data, scan less data, make
it less expensive, make it faster for your users. What we also did when
we launched BigQuery with our customers
was we gave them-- we are trying out
another hosted system, which is something,
again, that I'm not going to talk much about-- and we gave a set of users
the new back-end, but exactly the same user interface they
were using with the old system. And we kept both the
systems running in parallel. And we started monitoring
their usage pattern, the query average,
the query response time, and all those things. And we benchmarked both
the systems side by side. And BigQuery was like way
better than the other one in all parameters. Its speed, its stability,
its query performance. Everything else was pretty good. And the users didn't even know
that the back-end had changed. So kind of they were in dark. So what that allowed us to do
was over the weekend we just flipped the switch. We initially
thought that we were going to keep running
the systems in parallel for a few months. But we flipped the
switch in a few weeks, because we thought, you know,
it's working out pretty well. And we started moving all
our users and our use cases over to BigQuery. And again, the most
important part here is when you're doing this
scale of data migration, you would be working pretty
closely with your Google team, because there are a lot
of things that happen. It's a black box, and
I like it that way. You don't want to
worry about what's happening behind the scenes. If you run into a problem, you
just tell Google, go fix it. And they are usually
pretty prompt in fixing it instead of you trying
to figure it out through the documentation
what's going on. So we'll talk a bit more
later on in the panel, but it was a pretty
interesting experience moving this scale of data. And some of the key learnings
I'll share with you. Next up, I guess, it's Nick
from "The New York Times." [APPLAUSE] NICK URSA: Hi, everybody. My name is Nick. I work as a data scientist
and data engineer at "The New York Times." I'm actually going to
interrupt my normally scheduled performance to tell you about
a little piece of history I think we just witnessed. In 2006, Doug Cutting
was working at Yahoo. And if you don't know
Doug, he was who came up with the Hadoop elephant. It was his son's plush toy. And a lot of Hadoop,
to which many of us owe some part of our careers,
was really born at Yahoo. So it's kind of historic that
Yahoo has kind of merged back using elements of the
original Google File System under BigQuery, and they were
the developers of the Yahoo that we all used. There's kind of this family
reunion that you just saw. Anyways, I want to
talk to you about what you're interested in
your 20s and your 30s. A lot of us kind of grew
up getting interested in the technology behind data. And we loved scale. And we loved getting
things to work. But when I got a
little bit older and started kind of
working with the data, instead of getting the
data to work at all, I saw a lot of people
who were disenfranchised by the kind of
problems that you're having when you're working
with technologies like Hadoop. What I realized is that a lot of
people are left out in the cold as soon as they hit like a
Java stack trace or something. And it led me to switch kind
of from doing data science to trying to go back
to data engineering at "The New York Times," because
I saw this problem around us. We had a situation where
we had a lot of people staring at Java stack
traces when they should be kind of computing things. So it caused us to take
a step back and look, not at the technologies that
were available at the time, but the technologies that
maybe were nascent at the time. And this was last year. So to the people in
this room who maybe were flirting with the idea. You're running Hadoop
stacks, and you're not sure if you really want to take a
chance on this thing, which is, after all, as a
commercial product, quite new. I was exactly where you are last
year at last year's Google Next wondering if I'm going
to be risking anything. So this is potentially
you one year later. At "The New York Times," you
know, we're an older company. I think our founding
predates the Civil War. And we had a similar
archaeological cross-section of data storage technologies. We still have a few
mainframes, and they work-- much respect to mainframes. There are just so bulletproof
after all these years, but we've got to move
a lot of these things into a common place. Because if you're going
to work in media today, you've got to be able
to know all the touch points for a given
person in the data silos that have been
created over time. Different IT systems for
different departments, you know, that's
not going to work when you need to respond in 10
milliseconds to a web request. So if you can see, we've gone
through Oracle data warehouses. We've spun up at least 500
MySQL instances on Amazon. Redshift, which I
like, but-- and I'll tell you the but later. And not maybe a year ago, we
spent-- well, maybe a year and a half ago-- $1 million
on our hardware Hadoop cluster to cut down all
these cloud costs. And I got to tell
you, after we like kicked the tires on
Hadoop, it was a no-brainer to junk all that capital
and just go for this system. So we did that with
that kind of outlay. You have people trying to
answer a simple question about the business
that we're in, and they get this kind of thing. And what happened
is that the kind of people who can
deal with this problem are not the people who know what
our quarterly earnings should be. Is the business metrics
at the right level? And we separated the people who
have the technical capability to answer the question
from the people who originated the question. I think that's one of the
unfortunate sides of big data. And so what we want to do is
really bring the capability to the outer part of the circle. People who are closer
to the product design, people who are closer to the
heartbeat of the business. We fetishized data
engineering a little bit, and it's been fun
for a lot of it. But I think we need to make
these tools as polished as the tools that we used
to use, as polished as those old mainframes, you know? And there's a problem
with opensource, I think, because the kind of
thing that takes you to the 100%, the
polish, is really boring QA work of edge
cases and corner cases. And from what I
know of engineers, they want to solve scaling
problems and new core engines. Unfortunately,
we've got businesses to run in the meantime. So we took a look at kind
of what our priorities were. And although machine learning
gets a lot of headlines, and certainly everything
in there is important, it relies on stable
infrastructure. Nothing works unless
you can count on it to work without a smoke break. We need it to be in SQL. We don't want to make
people write Java just to answer some question. And then once you
do that, you can start taking control of the
probably octopus of ways you're storing data
in your company. And then you can have regular
normal people ask questions. Now you have four
times as many people getting the answers they want. So I'm going to briefly show
you how we conducted our trial. We benchmarked performance. We took our six
most common queries. We estimated price by
looking at the Hive logs and seeing what was the data
footprint on each of them and adding it up. My estimate was that
we were actually going to be at 1/3 the costs
of our Redshift cluster. It turned out to be 1/4
when we actually put it in. And people are running
their queries a lot faster. So then we went to
an internal beta. And here's a little bit
of social engineering I'll talk to on the next slide. It's really important
if you do the shift you budget for two systems
running at the same time. Because this is data,
and you have to make sure it is correct. So that means you need
an audit benchmark. So we kept-- I'll show you this
a little in zoom. So the bottom is our old flow. The top is what
we set up for GCP. So we split our thing
going to a pub/sub. The pub/sub calls the AWS SNS. And that goes into dataflow. And now, you don't
have to do data flow. We wanted to move from
mini-batch systems to near real-time. And it took three
weeks in the POC for us to get latency from
45 minutes, log files, gzip-- I know some of you
still do this-- to 45 seconds. From the time that you
click to "The New York Times," to the time that I
can see your click in a query is down to 45 seconds. And internally you
get a lot of love. No one likes queries to fail. But this is the kind of thing
we saw pretty well immediately with the beta group. And once the beta group started
getting this everyone else started getting a
little jealous of it. And this is just
organizational information. That made the rollout across
the organization really good, because everyone's like,
oh, can I try that? Oh, can I try that? And through that we
learned a little something about how to onboard
people to new technology. It's really important,
if you do this, you write your internal
getting started guide that uses your
queries and your data. Because if you hit
the Google Docs, their first day may
be one of frustration. Then we made sure
we had like a chat group internal to the
company so that you could get the early
users helping each other. They became your local heroes. And it formed kind
of a Hive mind when it came to solving queries. And we tried to notice the
little things that were keeping people from working. There were a few
wins with Google here that we didn't expect. At first I thought
not having JDBC/ODBC was going to be a problem. Turns out that
having a web client was a bigger advantage, because
it enabled so much casual use. When you can give someone
an URL and say, hey, check out this
query, versus having to get them to set up
a whole kind of client, and like download the
opensource JDBC driver, it was such a winner to
have the web interface. Email account creds,
it helps a lot. You don't have to set up all
kinds of LDAP things here and there and make it
permissions in the database. There is a module in Pandas
called, I think, from GBQ, or something like that, which
goes straight to Pandas data frame. So all the data scientists
were absolutely thrilled. And because of the
permissioning you can actually use the Google
Groups as mailing lists to alert people
of schema changes. The first thing
you're going to have to do when you start trying
to get useful stuff in there is get some of your relational
data in there as well. The click stream I showed
you, we did with data flow. Ironically, the large data
is the easy stuff to do. It's the thousands of tables
all through your organization that you're going to finally
do this data lake on that end up being kind of the real pain. But I don't know, BigQuery-- all Google Cloud is
Google commercializing the technology that
allowed their developers to be successful. So what you're seeing
is a productization of what has made
Google successful. That's what we're all
trying to tap into here. And they know that when a
Google Developer sits down and they're paying them
what they're paying them, they want it to be
productive on the first day. And I got to say, when you start
loading data into BigQuery, you go hit your database,
push it to a FIFO or a file, and push it into BigQuery,
it works like right away. And you know, this thing,
there's no indexes. There's no sort order. At first I thought,
well, what are the wastes to get all this performance? But the ability for
me to tell an analyst that they can upload
data without worrying about the layered
on disk or anything like that, it's been a huge win. Because now people can
do these loops where they have a question, they
have an external data source, they push it up into
the database themselves, and they can answer
their own question. And so I think what
happened is we really changed our performance metrics
from-- how does this query run? To-- given a question,
how long does it take until a reasonable skilled
person can get an answer? And so the less people you
have in their loop the better. I'm not going to read
all these slides. The slides are more for you
if you check it out later and you want to know what
are the best practices for every type of data. These are the things
that I came up with. But I do want to talk about
data flow a little bit. If you don't need real-time--
and be honest with yourself about it-- don't do it. Little small batches
are totally fine. A lot of people say
real-time, they're still fine with five minutes. Data flow can be a little bit
expensive to use streaming, because you're just keeping
these connections alive no matter what the
data volume is. And here are a few other things. Telling people to use BQMate-- for the first time
people will become aware of the cost of
what they're doing, which is sometimes a shocker. But it teaches them how
to write a better query. And you probably need to get
out of the habit of ordering. I think relational
databases made us do that. But on distributed
systems ordering can really screw up
your performance. So don't order if
you don't need to. Finally, don't think of
BigQuery as a database. And it's in this uncanny
valley right now, because it almost looks
like a relational database. But you can't do a lot
of mutations on it. But the storage
is the same price as cloud storage, which means
you can store your data as cheaply in BigQuery, where you
have query access to it, as you can as if it were on disk, you
need to push up like a Hive cluster to get at it. So you basically have-- all structured data is just as
happy in BigQuery price-wise, op-x-wise, as it is
in any other system. So finally, I let's talk
about what your KPIs are. When we started this
project we were like, what should we target? Deprecated systems? Tables migrated. And I said, no, let's
do a survey of people. And we did a survey
before and after BigQuery. And we ended up with
3.5 to 4.7 out of 5 in terms of analyst
satisfaction. And around Christmas I
get an email like this. So I hope that if you
go down this path-- and you can grab my email,
or grab anybody here, because it's a small enough
community that we can help each other a lot-- that you end up with
an email like this when you're into your attempt
at trying a technology, which I think really
deserves a break. That's it. Chad? [APPLAUSE] CHAD JENNINGS: Thanks very much. The last part of the
session here-- we've got right on 20 minutes or so. I'll invite my co-presenters
up onto the stage. So also acknowledging the-- Nick, please-- and
acknowledging the dual purpose of these slides, we're going
to talk a little bit right now about why bother do
this crazy migration thing if it has risks,
and it has work to do, and et cetera, et cetera. So we'll talk a little
bit about that starting with Jason in just a second. But I also wanted to
point out that in the deck the gentlemen here
were kind enough to add some bullet points
about what the benefits were. So like you can get the
deck, you can get the video, and have a look through these. Oh, yeah. Sorry. This is also for
reference of the related-- oh, there's the clicker. Thanks. Some other sessions that
may be interesting to you. And then because I chose the
slide with the animation-- yeah, I won't leave
this one up forever. But if you want to
grab our handles, you can get a hold
of us another way. I'll leave this one up
for a couple of minutes. But I wanted to start off
the conversation here. And, Jason, ask
you to lead us off. What was the benefit of doing
the migration that you did, and can you describe
what the migration was? JASON JHO: Yeah, sure. So I guess I'll start with
just first introducing myself. So for those of you who may
not know about Blue Apron, we're a company whose mission
is to bring incredible home cooking to everybody. And in the process of doing
so build a better food system, which is essentially how food
is grown and distributed. So we're not actually as
big as Yahoo and perhaps "The New York Times," in terms
of data, size, and scale. But we actually do have a
lot of complex data sets-- everything from our
warehouses, the physical kind, where we actually ship
boxes out to our customers. To the consumer website
that we have collecting data about consumer behavior,
to all the data that is relevant to
shipping and logistics. So a lot of
different fragmented, heterogeneous data sets
that are relatively complex. We rely on that data
to perform analytics as well as to help
us forecast demand. And to also drive personalized
content to the end user. So data is very important
for our business success. BigQuery has been an
incredibly important asset to our platform and pipeline. So one of the things that
really shines about BigQuery, it's really simple. It's our ability to not
think about performance-- so tuning query performance,
thinking about indexes and capacity planning. All of that sort of
magically went away. And we instead took that time
to think about the data model. What is the right
data model that we want to curate for our analytics
and machine learning teams, so that they don't have to run
all of these complex processes to get data from
different places? That semantic layer
that we're building can be consistently
shared across the company and scale as a business grows. And BigQuery is a real
incredible enabler for that sort of process, which
leads me to my next point. I think having that sort of
scalable petabyte scale data warehouse at your fingertips,
without having to really think about the management
part of it, has allowed our analysts
to really explore data at interactive scale. And I want to
emphasize interactive. I think for the very
first time using Looker, which is a tool
that we use for BI, we've been able to answer
questions as they come. You know, like, the
ability to do full table scans, to aggregate
data on demand is an incredibly
game changing thing. Before we would have
wasted hours and perhaps, embarrassingly, days to get
answers to our questions that we had. So BigQuery, again, has
been a tremendous tool to our tool box. And finally, I would say
that, to Nick's point earlier about storage costs,
having data in BigQuery is-- number one, it's available
for hot querying. But you can actually have large
amounts of historical data just sitting there. And that's sort of
what we've adopted in terms of getting historical
data into the data warehouse without having actual
time series data. So we've been able to batch
load consistent snapshots of our data into BigQuery
using day partitioning and essentially have the
ability to query over time. So that's been a huge
boost in our ability to understand customer
behavior over time. So all in all BigQuery has
been an incredible tool. And it's allowed
us to stay ahead of the insights instead of
playing catch up all the time. CHAD JENNINGS: Great. Thank you. Thank you very much for that. I'll say that the team
will be particularly pleased to hear that story. Actually it was
at this conference last year that we introduced
long term storage. Where if your table hasn't
been modified for 90 days, we just reduce the
price of storage by 50%. And it was exactly
to induce people to say, hey, let's
keep this around and continue to get
value out of it. So that's terrific. I wanted to start out with a
quick question here for you. How did you guys,
when you were first adopting BigQuery, or
first thinking about it, how did you design
the proof of concept? JASON JHO: You want
me to take this one? CHAD JENNINGS: Yeah, why
don't you lead us off? JASON JHO: Sure. For us it was really simple. To Nick's point, getting
started with BigQuery is as easy as opening up
a URL on your web page and putting in a query. So we actually took a couple
of the most expensive queries that we have, queries that
would actually run for hours, and without much
migration work at all we demonstrated that this thing
can run in less than a couple of minutes. So that was a huge just proof
of concept right off the bat. So yeah, it was really
as simple as that. CHAD JENNINGS: And,
Nikhil, you had a very different, like orders
of magnitude, proof of concept to design. How did you pick that? How did you design it? NIKHIL MISHRA: So for us, yes,
it took us time to get started. So we had to think
through a lot of stuff. And the way we
decided was we didn't want to go to a specific query
model, or benchmark something-- that's industry
standard benchmark-- or one of those things. We essentially wanted to
replicate a real life use case. What that required was to
actually move our real data out there. And that requires all the kind
of prep that I talked about. Go through the legal review. That's a very important step. And we started off with that. The network bandwidth
was something that we actually did not start
off looking at initially when we did it. We actually ended up choking our
pipe between Yahoo and Google by moving a massive
amount of data. We brought down some of
our internal systems. And that's where
the consideration has become so important
for everybody at Yahoo. And it was something new. And we assumed a lot of stuff. And we were wrong. And as we started off
migrating the data, we also were not aware-- and I believe Avro
is still beta. I don't know. But at that point
we were not aware that Avro is natively supported,
which is a huge deal for us. So when we got those pieces
together what we did was we really picked up
our real life use case. And we were using another
third party hosted cloud. And we had something
going on there. And we just picked it
up and moved it here and moved some
additional data sets to just try out Google's
claim of petabyte scale. And it worked out pretty well. And we replicated the use case. And as I said, without
telling our users we flipped the switch. They didn't even know that they
were querying another database. We used Looker as our BI tool. And Looker had pretty
interesting integration with legacy SQL. The legacy Google SQL,
which is not on single plan. Which I love, by the way,
it is pretty interesting. It's fun to use. So we just hooked it up. We created exactly-- all we had
to do was flip the connection, reuse the same
model, same stuff, hand it over to set up
users, and just monitor them. And that's what we did. And that allowed us, and
that gave us the confidence that this is something that
we could possibly work with. CHAD JENNINGS: Yeah, you
were in a fortunate position where you had a cloud
workflow running, and you were able to
switch that one over and do a direct compare. I wanted to ask Nick
a question here. So BigQuery, if you're thinking
about just doing lift and shift from a Teradata instance
or something like that, BigQuery is not optimized
for a star schema, right? BigQuery wants the
tables flattened. So there are definitely some--
let me not sugarcoat that. There was a session yesterday
on appropriate data models, so we're not going to go
into that in detail here. But I did want to ask Nick-- when you do that shift there's
an opportunity for you to, A, endure some heartache and
do some work, like redesign-- NICK URSA: We did,
and it was heartache. CHAD JENNINGS: So
the question here is, how much schema
migration do you want to do? The analogy is when you do
remodeling at your home, you pull the plasterboard off
and you're like, oh, geez. Do I do the electrical as well? So how far do you take that? NICK URSA: I don't want
to say that you can't do star schema in BigQuery,
especially if your dimension tables are small. They're going to like-- CHAD JENNINGS: You can,
it's just not optimal. NICK URSA: Right. Right. Well, the thing is-- maybe other places have
this, too, where there's a clickstream data flow. And clickstream dataflows
are high volume, wide tables. Often you're only taking
a subset of the data drop. And then you have like a
bunch of data warehouses with smaller tables. And we're at the point
that those two worlds are merging inside BigQuery. So we used to do a lot of
stamping in applied values, because we knew we
couldn't join at it later. I think that we were a little
more conservative in BigQuery, because we saw that
we could actually get good join performance in
some of the smaller tables. But if anything, we
had to scale back on how much we denormalized. We normalized as if we
were going to run on Hadoop and never do a
join in the world. And we found BigQuery actually
had better join performance coming from a Hadoop workflow. But when you revisit
things it's a chance to do those things that you
kind of never had the chance to. So we did, I think, more
importantly in the performance was just a data
governance exercise of-- what are these fields? Are we still using them? Do they mean what
they think they mean? And although it delayed
our rollout by a month, it's important that everybody
who sees a column name and makes a logical assumption
of what that thing means-- that they're right. Because I think worse than slow
queries, worse than anything is getting two
different answers. So I think it's more
of the data governance that drove the schema
redesign than any performance consideration. CHAD JENNINGS: Interesting. I'll also invite
the audience here. So I have a couple
more questions, but I'm going to run out
before the session is over. So I'd like to
invite the audience to come up and ask
questions of us as well. Please, use the microphone
so that it's on the record. I'll give you a second. I did want to ask Nikhil,
what's been the biggest impact of migrating to
BigQuery and having BigQuery be that like fast store at
the end of your block diagram? NIKHIL MISHRA:
It's productivity. There was one of the PMs in
Yahoo, a pretty smart guy, he actually put a sheet together
and handed it over to me after we did this migration. And he had a dollar
amount of how much we saved improving
the productivity of different product managers. And he took himself as a base,
which is not the right thing to do, because he did all
kinds of crazy steps there. But I think that
to me is a big win. Initially, even though we had
all the data easily available and accessible to
the users, users knew that that's where the
data is, and I can query it, there was a fear. There was a barrier. And the barrier was they
didn't know whether they would run the query and
whether they'll be able to get the
answer back, right? Depending on how much
data they're querying. Now, what's the load on the
system at that point in time? And by nature, you know, it was
more suited for batch loads. And it kind of, you know,
with concurrent users, we just ended up waiting way too
long than we would anticipate when a single user
or a few set of users are running the queries. So I think that was a big. When users don't think
twice before just logging into the system and
running their queries, that's a huge win. CHAD JENNINGS: Fantastic. NICK URSA: I just
wanted to add to that. Which is that it's not
only an efficiency thing, it's an innovation thing. Because you know what it's
like when you try something out and it's five minutes. You lose your train of thought. Imagine if your IDE didn't
respond in five minutes. But that's what it's
like for an analyst when they run in BigQuery. And so when you're trying
to refine a business model or you're trying to search
for the cause of a problem, the ability to
retain your train of thought as you query
is hugely valuable. And people like
Justin are literally having so much more fun now. CHAD JENNINGS: That's
a wonderful story. Yes, sir. Please. AUDIENCE: I wanted to ask you,
do you have any scenario where the data need to be updated,
and how frequent are the data need to be updated? How you solve it? And does that affect
your schema design? NICK URSA: I mean,
I might want to talk to you about what it is. When it comes to, like, logs
and aggregations, append only, obviously those don't
need to be updated. So let's look at the categories. Large things that
need to be updated, small things that
need to be updated, small things that don't
need to be updated-- well, let's not even talk about that-- and large things that
don't need to be updated. So the last category
is very easy. Those are just append only. Small things that
need to be updated, we have relational
databases for that. So for me, the solution
to that is just replicate, replicate, replicate. And there's a company called
Alooma, which can even read my SQL bin log files
directly and move them into BigQuery for you. And so the tricky one
is the first column. Things that need to be
updated that are large. This is the use case that
Bigtable was designed for. When you're in that
category you maybe need to look at different
storage options. You know, Spanner may actually
be a good choice for you if you have large scale. But a lot of people can
get by on a 256 gigabyte backed SSD PostgreSQL instance. The 96 is amazing. SSDs are amazing. And a lot of RAM solves
a lot of your problems. CHAD JENNINGS: If I may, I'll
just add one thought to that. Nick mentioned Bigtable. I can't say what it
is now, but there's a talk at 5:20 that
has a lot to do with how Bigtable and
BigQuery might possibly, maybe possibly work together. NICK URSA: Hint, hint. CHAD JENNINGS: Yeah, no kidding. Hint, hint. So to take Nick's point, if you
have to do a lot of streaming, then you can load it
into BigQuery query, and then you can
export it from there. 5:20. JASON JHO: Yeah, so
we actually do both. So we take consistent
transactional snapshots of our relational data. And we batch load them into
BigQuery several times a day. But we also have
event data that's coming in through it like
an append only fashion, streaming that
directly into BigQuery. So I think depending
on your needs, for us we have the benefit of
taking our relational data, sort of loading them
into a big history table. But we also have sort of
real-time event data coming in, as well. So it really serves well
for both of those use cases. CHAD JENNINGS: Yes, sir? AUDIENCE: My question is
about separation for the data. So do you have a roll
based security implemented in your systems, guys, somehow? CHAD JENNINGS: Yeah, actually. Thank you for
bringing this one up. So the GCP security model is
based on GCP's IAM-- identity, access, and management. And so you can get granularity
down to row and column access by creating views and then
giving particular sets of people access to that view. AUDIENCE: OK. Thank you. CHAD JENNINGS: Yes, sir? AUDIENCE: I'm just curious
how you guys control costs by like throwing Looker
on top of BigQuery and releasing it
out in the wild. JASON JHO: I think Looker comes
with a dry run functionality. You can actually like
estimate the number of bytes scanned since that's how
you actually get charged. So that's one thing that we do. We also have budget calculators. And I think Google actually has
a tool that allows you to cap-- CHAD JENNINGS: Yeah. There's a form that you
can fill out online. Apologies, it's a bit clunky. But you can fill out
the form and say, hey, I want to cap spend on
X project at Y dollars. It's basically quota control. AUDIENCE: Thanks. NICK URSA: By far, the largest
cost is mistaken queries, not like the normal ones. CHAD JENNINGS: Yes, sir? AUDIENCE: First of
all, I appreciate you said that migrating
star Schema is not easy. You didn't sugar coat. So the reality is
that there are a lot of legacy star schemas,
Teredata, Netezza, or whatever, right? I'm just curious
if any of you guys had the experience,
how you went about it. Did you just try
to break it down? Chunk it down? What stays as star schema? What are some things which
you can refactor, redesign? What are some
effective strategies? I'm just curious on that. CHAD JENNINGS: Thank you. JASON JHO: So I might not be
the best person to answer this, but we just threw the same
data model into BigQuery because it joins. So that actually
allowed us to buy time to do that, get the benefits
of the query performance, and in parallel
think about, well, how do I denormalize
and flatten this data so that we can really optimize? So I think just the
fact that it supports standard SQL and big
joins is a huge win, because it allows you to
be effective immediately and giving you time to really
think about the optimal data schema. AUDIENCE: Because some
of the other sessions they talked about
repeating using the nested repeated features,
but that requires a rewrite in SQL, to your point. You've got to train all
the analysts to un-nest, which is an investment. That's what happened. CHAD JENNINGS: Yeah. I might I might add also-- so the session on
the data models is the best place
to go for that. The video actually should
be available today. But I will say,
in the experience that I've had helping
people specifically from Teradata migrate, the lift
and shift, with the expectation that it's going to
work just like it did before is a fool's errand. That's just not going to
work, and we found that out the hard way. Because we tried it
once, and it was like, oh, that was not a good idea. And so I can't give you
too much advice here without like sitting down
with a piece of paper, but it definitely
takes some doing. We found that the partner tools
have been very useful here. Although, partner
tools, it's basically like they'll give you
a pair of scissors and invite you to go running. They're not going to give
you advice about what shape to put your data in. They're going to let you code
that up any way you want. So it's a useful
tool, but you still need to do that design work. Do reference that other
talk from yesterday. AUDIENCE: All right. Thank you. CHAD JENNINGS: Yes, sir? In the blue shirt, please. AUDIENCE: Go first. [INTERPOSING VOICES] CHAD JENNINGS: All right,
you guys can fight it out. AUDIENCE: I was just wondering
if the custom partitioning was going to come in
the near future. CHAD JENNINGS: So
BigQuery's partitioning is not as full featured
as some others. And yeah, so custom partitioning
based on any user's hash is definitely on our roadmap. I really can't speculate. I know where it
is on the roadmap. I can't share it here
at a public forum. But let me say it this way. So BigQuery is well aware
that as we market ourselves as an enterprise data warehouse,
and as we talk about things like Teradata, and
we talk about Oracle, and moving from
those things, that we need to answer use cases
with features that people are accustomed to having. And so the team has
that really very much in the front of their mind. You guys have heard
this story before. When we look at features
that we're prioritizing, I put them into 2 buckets. It's table stakes and
market differentiators. And things like hashed
partitioning are table stakes. We have to do that
to be really useful. And so those things
tend to get prioritized when we're doing our work. AUDIENCE: OK. CHAD JENNINGS: Thank you. AUDIENCE: Hello. Hi. I'm Tiger, coming from Seattle. First I want to
say, this session is a very solid session. It's the best I have ever
taken since yesterday. It's full of good practice
and good knowledge. Thank you so much. CHAD JENNINGS: It's the
best one since yesterday? AUDIENCE: Yeah. CHAD JENNINGS: That's awesome. [LAUGHING] NICK URSA: What was yesterday? CHAD JENNINGS: Shh, shh,
just take the compliment. AUDIENCE: My question is as
the artificial intelligence and machine learning
is in fashion, do you guys think you
want to integrate it with artificial intelligence
somehow in with your data warehouse solution? Or any practice already? NICK URSA: No. [LAUGHING] CHAD JENNINGS: OK. Give me. NICK URSA: So most ML
starts with clean data. I guess they can write
clustering functions or whatever but the
computational requirements of most machine
learning algorithms are not appropriate for the
thing that BigQuery's based on. BigQuery is based on
being able to subdivide your data tremendously. But a lot of machine learning
is about global optimums, and there's so much
shuffling in between. So who's doing this? I'll tell you. Spotify is now using a lot of
BigQuery as the source data, because good ML starts
with clean data. And having a nice
repository that you can pull training
data from quickly and know that it's right-- and then the algo is
like icing on the cake. CHAD JENNINGS: Yeah. We're being indicated that
we're out of time here. So let me just add
a step to that. And if there are folks
with questions remaining, I think the speakers will be
available a little bit after. But to talk about
integrating BigQuery with machine learning,
that is something that is very much top of mind
in that market differentiator column. So it's prioritized often behind
things like hash partitioning. But you can use it now. It's a bit clunky. But you can export data out
of BigQuery, store it as a CSV in Google Cloud Storage. And then use that
as input to a cloud ML call or something like that. But we definitely
have at top of mind to smooth out that workflow. All right. Folks, thank you so
much for your time. To my co-presenters. [APPLAUSE] [MUSIC PLAYING]
This was easily one of the best sessions at Cloud Next.