[MUSIC PLAYING] RYAN MCDOWELL: Hi, everyone. I'm Ryan McDowell, a could
data engineer with Google. Joining me a little
bit later today is going to be Alban from Teads. And today, we're going to be
talking about data warehousing with BigQuery best practices. So if you're coming from your
traditional on-premise data warehouse system or
even another cloud, you may be familiar with some
of the challenges listed here. When I talk to customers,
these are very top of mind. They can't keep up
with data growth. They don't have scalability
within their current system. They are spending tons of
time with teams of people just to manage their
data warehouse. They have expensive
license renewals. And upgrades for them
require downtime. In fact, some customers
that I talk to aren't even using their data
warehouse as a data warehouse anymore. They're offloading data
from their data warehouse into Hadoop and other
places, in order to perform the type of
analysis that they require in their business needs. And this is really because
their data warehouse system is no longer fitting
the needs of the business. When we look at why many
customers have decided to move their data
warehouse to BigQuery, it's because
BigQuery solves many of the issues with your
traditional data warehouse system. With BigQuery, you're able
to automate data delivery with BigQuery data
transfer service. You're able to share queries
among users and people within your organization. You can set the foundations
for AI and real time analytics and significantly
reduce the amount of effort required to
maintain your data warehouse. But one of the major advantages
over the traditional data warehouse systems is
BigQuery's architecture. Google BigQuery is fully
managed and serverless. Storage and compute are
decoupled from each other, and they scale
independently and on-demand. This offers immense
flexibility and cost control for your business. So you're not keeping expensive
compute clusters running 24/7. This is also very different
than the traditional node-based cloud data warehouse, or
even on-premise MPP systems. Some of the secret
sauce of BigQuery sits in between the
storage and compute layers in its in-memory shuffle. BigQuery uses remote
memory shuffle to gain unprecedented query
speed over big data sets. This enables you to query
large amounts of data and get your results very fast. Each query that you
execute within BigQuery is split up into a
sequence of stages. Each stage is then
read from one worker to perform some
operation from shuffle and then written
back out to shuffle. If a worker was to, let's say,
go down because of a hardware failure or something like that
during your query processing, you wouldn't notice it
because what would happen is another worker would spin
up and then start reading from the previous point that
the previous worker was reading from in shuffle. This offers you great
resilience to downtime. And it enables you to
continually perform your analytics
without interruption. And then finally when a query
is complete within BigQuery, after this reads from shuffle
and writes back to shuffle and we're sort of done with
the entire query processing, the results are written
out to persistent storage and then returned
back to the user. And this persistent
storage enables you to-- essentially, if you were to
run the same query over again, you get the cache of
the previous query. So we're able to serve
up consistent results to the users over time. And speaking of storage, one of
the major benefits of BigQuery is it's managed storage. BigQuery offers durable
and persistent storage for your data warehouse. And this helps you dramatically
reduce your data operations. Gone are the days that your DBAs
are calling you up on the phone and saying, hey, we need
you to delete data out of the data warehouse. We're getting too big. That was exactly the scenario
in my previous company where it was a regular basis the
DBA would call us up and tell us to start deleting data. And we didn't want to delete
data out of our data warehouse, but we had resource
constraints and we had to. And so with BigQuery,
you're able to ingest petabytes of data and basically
throw at it as much as you can. And when you throw
data into BigQuery, all data is compressed
and encrypted by default. This means that
your data is secure, once you migrate to BigQuery
and it's in the cloud. And not only is it
encrypted, but when you save a record
to BigQuery, it's not that that record ends up
physically on disk somewhere. Those records are split
up into tiny data blocks and smeared across thousands of
disks within our data center. And each individual
data block is encrypted with a different
key encryption key. Which means if I went to
the Google data center and just pulled
out a disk drive, can't read your data because
it's smeared across everything. And it also is encrypted
at the data block level. And you also don't have to worry
about data replication in DR. Because in BigQuery,
when you are running in a regional configuration,
your data is in multiple zones within a region. When you're running in a
multi-regional configuration, your data is spread across
multiple geographically distinct regions. And so you don't have
to worry about downtime from just a region
or zone going down, because you will
have that durability and consistency across regions. Now, the first thing
that you'll probably end up doing with
BigQuery, whether it be in a pre-prod environment
or sandbox environment, is loading data. With BigQuery batch,
ingest is free. It doesn't consume
query capacity. And so this is huge,
coming from a lot of the traditional
on-premise systems where you have to carve
out separate resources just to deal with batch loading data. That doesn't exist
within BigQuery. You can load up data
using the communal slots available to you, and load
it without having any impact on your query capacity. When you're loading data,
it has asset semantics. This means that I can
replace an entire table that may be terabytes in
volume, and the user would never know that that
operation is happening. Because the operation is
only exposed to the user once it's fully complete. Or if it fails, it's
never exposed to the user. So you don't have
these situations where you have to
worry about, oh, I have users querying the data. I can't load at this time. And I don't have
situations where I have impartial data
loaded, and so people are getting bad results when
they're doing their reporting. Finally, you can,
like I said earlier, load petabytes of data per day. There's no real limit
on the amount of data that you can load into BigQuery. And so a lot of times when
I see customers move over to BigQuery, they may be
moving over with, let's say, 100 terabytes. But then they end up
using substantially more than that on BigQuery
because of pent up demand within their organization. And they're able to just scale
as necessary to deal with that. And then finally, there's a
streaming API for real time. So if I needed to ingest real
time records or near real time, I can stream thousands
to hundreds of thousands of records into BigQuery,
without any impact to my query performance. Which actually brings us
to our first best practice. So BigQuery supports
a wide variety of file formats
for data ingestion. Some are going to be
naturally faster than others. When optimizing for
load speed, prefer using the Avro file format. Avro is a binary,
row-based format, which enables us to split it
and then read it in parallel with multiple workers. In fact, in a recent
test with a customer, we found that Avro with
deflate compression to be an order of magnitude
faster than the other methods of loading data into BigQuery. Now, Parquet and ORC are
also binary-based formats. And they're relatively
fast for loading data. When I'm talking to
customers, they often think that since Parquet and
ORC are their columnar formats, and I'm loading into
managed storage, which is also columnar, that should
be the fastest way to load in. But when we're
loading data, we're actually taking the
entirety of the record. So we're not taking advantage
of any of the columnar format. And there's a lot
of random reads. So loading data via
Parquet and ORC is fast, but not going to be as fast
as loading data from Avro. And then finally
here at the bottom, we have CSV and JSON compressed. Loading data from
compressed files, specifically CSV
and JSON, is going to be slower than loading
data in any other format. And the reason being is because,
since the compression of Gzip is not splittable, we
have to take that file, load it onto a slot
within BigQuery, and then do the decompression,
and then finally parallelize the load afterwards. So there's going to
be more overhead, and it'll be slower than loading
data from these other formats. And after I loaded
data into BigQuery, I'm probably thinking
about, how am I going to do transformations
within my system? Should I be doing ELT or ETL? What is the best practice there? So in general, you want
to prefer ELT over ETL where possible. BigQuery is very
scalable and can handle large transformations
on a ton of data. So generally, when
customers think about doing ELT versus
ETL, ELT is where they're going to start within BigQuery. It's also quite a bit simpler,
because you could just write some SQL queries,
transform some data, and then move data
around between tables, and not have to worry about
managing a separate ETL application. And then as you're loading
data in, what you may find is that, well, I have
a CSV file maybe. And the date isn't really
in a standard format. How do I get this into the
data type within BigQuery? Well, you can have separate
scripts or ETL processes which process that and
pre-process it before loading it into BigQuery. But you can also leverage
federated queries onto GCS to transform that data and
load it in a single step. Now, the only
downside of this is this will consume
your query capacity. But it enables you
to keep everything in SQL, if you just have a bunch
of one off use cases for this. Once you've started
loading data into BigQuery, in general, within
your warehouse, you're going to
want to leverage raw and staging tables before
publishing to reporting tables. The raw table essentially
contains the full daily extract, or a full load of
the data that they're loading. The staging table
then is basically your change data capture table. So you can utilize
queries or DML to merge that data
into your staging table and have a full history of all
the data that's being inserted. And then finally,
your reporting tables are going to be the ones that
you publish out to your users. And your users are
going to execute queries against these tables. And they may be dimensionalized
in a star schema, or they may be denormalized,
depending on the use case. And that method of
going from staging to reporting using
MERGE statements, that's where you can
perform your SCE Type 2 transformations. So if I have the
history of records all within my
staging table, I can easily within my MERGE
statement mark a record as expired, and then set the
new date for the current record. And then finally,
when you're getting into streaming loads or
really complex batch loads, you can leverage data
flow or data fusion to speed up those pipelines. So oftentimes when I
work with customers, they're starting
out with the ELT as their primary
mechanism of loading and transforming
data into BigQuery. But then they may have
some really gnarly batch process that doesn't really
fit into SQL cleanly. And this is where they'll
sort of scale themselves out. They'll start utilizing
data flow or data fusion to speed up those loads
and transformations and do more complex
activities on that data. And if you're getting
started with streaming, I recommend using the
Google-provided dataflow templates. So Google provides
dataflow templates for loading data from
multiple different places and moving data around. You can find those
dataflow templates within the Dataflow UI. So within the Create Job
from Template button, you'll find all these
dataflow templates. And if you find that it
mostly fits your use case, but I want to make one
slight modification, all those templates
are also open sourced. So you can go to the
Open Source repo. You can modify the
code to fit your needs, whether it be adding some
windowing in, or just using it as a baseline for
your own pipelines, and then quickly accelerate
your development. And this is sort of like
an end-to-end example of change data capture
coming into BigQuery. So with some customer
submits, they may have Oracle. And we'll be capturing the
changes in Golden Gate. We'll publish those records
to either Pub/Sub or Kafka, to ingest those
records in real time. For doing batch loads, we may
just load that data into GCS as our landing zone, before
loading it into BigQuery. And then we'll utilize dataflow
or BigQuery batch loads to get that data into our raw tables. Then once the data
is in our raw tables, we can utilize Cloud
Composer to schedule queries within our BigQuery system
to move that data from raw to our change log, then
dimensionalizing for our star schema. And this is a very
standard process that I see a lot of
customers utilizing. Now, once we have a
strategy for loading data, it's time for us to
think about, well, how do we want to
design our schema? In general, I find
that customers will overthink this step upfront. They'll think about like,
well, let's redesign everything for BigQuery. It's a new system. Maybe I need to
denormalize everything. But in general,
the best practice is actually to migrate your
star schema and snowflake schema as is, and then
optimize as necessary. This enables you to
find what may not run as is quickly within BigQuery. And then optimize it after
identifying those gaps. Denormalization is
not a requirement, but it can speed up
slow analytical queries by reducing the amount of
data that we have to shuffle. So to give you an
example, there is one customer that I
was working with who had 77 joins in a query. I was impressed that they
were actually executing a query with 77 joins. And what we found is that,
well, you know, a lot of this did not need to be
a join statement. A lot of this was
like a country field, or it was a date dimension. It was stuff that could
be naturally denormalized into the output table. And so we were able to reduce
the 77 joins that they were doing down to four
joins, which made it a lot more scalable as
their data volume grew. And speaking about
denormalizing, BigQuery has capabilities for
nested and repeated fields. And this enables you to store
semi-structured data very naturally within BigQuery,
and then query over it. Now oftentimes, this is
a very different concept than people used to
row-based architectures where they're not necessarily
storing semi-structured data within their data warehouse. So for leveraging nested
and repeated fields, I generally have
three guidelines. One, leverage them when you have
tightly coupled or immutable relationships. If I have an order and I have
an order line items table, the line items will
never really be used without the order table. And these are naturally
coupled, oftentimes immutable. And I'll denormalize
that into a single table. So as opposed to having the
order in the order line items table, I can have
an orders table that has a list of line items. Similar ways to do that
is sessions with events. So an event within
a session I can have, as opposed to having a
separate table for sessions and events. I could have a list of
events within a session record in a sessions table. I'll also think about
using this for infrequently changing data-- so country,
region, date dimension. Hopefully, these are
un-frequently changing. And so this is stuff that
you can actually just merge into your
reporting table and have it be slightly more
denormalized and easier to query for the users. And then finally,
simplifying queries. A lot of times, queries for
users will be super complex. And just by nesting
in repeated fields, you can sufficiently
decrease the complexity of those tables and the usage. Which brings us to our next
best practice-- partitioning. Partitions in
BigQuery physically split your data on disk,
based on ingestion time, or based on a column
within your data. As seen within the diagram,
each column of data is split according
to the partition key. And then queries over
that partition key, as opposed to scanning the
entirety of that column and all data within
that column, we're able to do partition
elimination and only scan the data that's necessary to
answer that query question. And this provides huge
cost benefits, but also huge performance
benefits, especially on large fact tables. So as a best
practice, whenever you have a fact table
or temporal table, utilize a partition column
on your date dimension. And as your users are using
these tables, oftentimes, what I've found is
users will forget to supply a partition key. So they won't get the benefits
of the performance of partition elimination. BigQuery has an option in which
you can require a partition filter on the table,
which basically means that if I'm a user,
and I execute a query, and I don't specify
an as-of-date filter, and if I am partitioned
by as-of-date, then my query will return back
an error and basically say, I didn't specify a
required partition filter. And this enables you
to avoid inadvertently scanning large amounts of data,
to essentially keep your costs down and avoid errant queries by
users who may not know better, until they're actually
presented with that error. Clustering-- clustering allows
you to physically order data within a partition. So you can do clustering
by one or multiple keys. So if I had a
partition of 2018603, I can physically order
the data for a user ID in that partition. And that way when the
query is executing, we're only scanning a very
small subset of that partition. And this can provide
massive performance benefits when used properly. You're going to
want to prefer using clustering on common
filtered columns, especially within
your fact table. So if you had, let's say, a
sales fact table or transaction fact table, you may
want to partition it based on transaction date. But you may want to
cluster it on, let's say, region, or store,
or customer ID. And this will enable you
to have better performing queries over that table. Now, you can only have one
clustering key per table. So it's basically most customers
will look at the queries that they are running
and then determine what is the best
clustering key that will satisfy most queries that
are running against this table. When you're filtering
on clustering columns, you want to filter in the
order that they're specified. So when you specify, let's
say, on that transaction table that I have partitioned
by transaction date and that I'm
clustering by, let's say, region and then
store, if I was just to provide a store
filter to my query, then I'm not taking
advantage of clustering. So you want to order your
clustering keys properly so that it can essentially
limit the amount of data scans per query. And then finally, most
people will ask me, how do I do clustering
on a table that has no date dimension? So let's say I have
a dimensional table and its product or its customer. And I just want to cluster
on that that customer ID or product ID. Well, you can utilize
natural partitioning on that table and cluster
based on that column. Even though the
partition means nothing, it'll still provide
performance benefits. You can also utilize a fake
date column within your table, and then cluster on
customer ID or product ID. And that'll provide
even better performance because everything is going
to end up on one partition. And then all the data is going
to be ordered within there. And then finally, this is
sort of a workaround today. If you had a use case where
you have a dimensional table and you have maybe an
integer key, in alpha, BigQuery has support for
integer range partitioning. And so if you're interested
in that functionality, you can sign up for that alpha
and begin testing that out. When customers are
coming from systems which have sequence generators,
there are oftentimes questions of, how do I handle
surrogate keys in BigQuery? BigQuery doesn't have
any sequence generator. And their first attempt
at this will often be, well, I'll use row number. I'm doing ELT, I'm
going to use row number. And that's going to give me
my sequence over my table. But that's not a
scalable solution because, when you're using
row number in that way, everything within the
query is ending up on a single partition. And then you're doing
basically a sort of it could be millions or
billions of records. A more scalable
way to do this is to prefer UUIDs in
place of surrogate keys. And then, if you have
the use case where you need a deterministic
surrogate key, you can use a hashing
function, such as SHA-256, which enables you to hash over
one or multiple business keys and so that you can
have a deterministic key be generated every time. Now, once I've loaded
data into BigQuery, I've designed a schema,
I may be worried about how to manage
workloads once users start using the system. But before we talk about
workload management within BigQuery, it's
important to understand how BigQuery executes queries. BigQuery has a fair scheduler. What this means is that if I was
to execute a query in BigQuery, just a single query,
it has full access to the resources
available to it. So in this case, it's a
project that has 2,000 slots. So that query has full
access to those 2,000 slots. Now, it may not use
those 2,000 slots because the number of slots
that each stage requires is based on the number of
parallel inputs to that stage, but it still has access
to the full 2,000 slots. Now, when I execute two
queries within BigQuery, we subdivide those resources. So as opposed to that one query
having access to 2,000 slots, now we have two queries
that have access to 1,000 slots each. And then so on and so forth,
as you execute more queries, the resources continually
get subdivided. Now, there's a common
question there is, well, how do I prioritize
between queries? Maybe there's some
queries that I care a lot more
about that I don't want to have necessarily
equal resources available to as much as my other queries. So this is where you
can take advantage of BigQuery reservations. BigQuery reservations allow
you to create reservations of slots, assign projects
to those reservations, so you can allocate
more or less resources to certain types of queries. So in this case, I have
a dashboarding project, I have a data science project,
and I have an ETL project. I may want to
allocate more slots to my dashboarding project,
because those dashboards are being utilized by
my C-level execs. And I want to make sure that the
response time is always quick. And I don't really care about my
ETL project or my data science project. They can sort of get
results as needed. And this allows me
to sort of prioritize between the different
workloads that I'm executing within BigQuery. And an important thing
here is that by carving out resources in this way, you're
not carving out data silos. Because idle slots
are seamlessly distributed between
the different nodes within your reservation. So let's say that my
dashboarding project is not using 1,000 slots. That means that any other
project within my system can take advantage of it. These aren't physical
resources that you're sort of segmenting
off and putting away. And if you have low
utilization, then you're just paying for compute
that you're not using. You're able to efficiently
distribute these slots across all these
reservations, but also have that priority so that if a
dashboarding query did come in, it sort of preempts
the other projects and is able to take advantage
of those full 1,000 slots. And this is also where
the separation of storage and compute comes in handy. Because we're segmenting
off compute resources, but we don't have to
think about storage. Storage still exists
somewhere in some project. And any project and any user can
access that storage, provided that their permission to that
storage via our standard ACLs on data sets. So this brings us to
another best practices, to segment your
users based on roles. So as opposed to having
everybody in one project accessing BigQuery, the same
projects which you're storing all your data, segment out your
users into different projects based on their role
within your organization. So I may have a separate project
for my data science team, or a department's data science
team, and a separate project for my data analysts. And then if I needed to start
doing that prioritization among the different
groups, I can do that very naturally
just by assigning their projects to different
nodes within the reservation hierarchy. So once I have
workloads executing, I probably want to know
what's going on in my system. BigQuery logs all admin
activity and data access into Stackdriver Logging. So you're able to access
all the logs of what is going on within your system
and do analysis over them. But in most cases,
customers will want to do analysis
over that in BigQuery. And so as a best
practice, we actually say, export your data access
logs back into BigQuery. That way, you can query
over those data access logs, and you can
have those data access logs in a single place. So you could have thousands
of projects using BigQuery. But if you use an
aggregated export of those logs to a single
BigQuery project, then all that data is going
to exist in a single place. All that audit data for
your entire organization in a single data set
and easily queryable by people who are
interested in knowing what's exactly going on. And then finally, as a best
practice, is go the extra mile and visualize your data
audit logs in Data Studio. By visualizing your audit logs,
you can analyze spending trends and query volume over time and
break down costs by project, by user, and be proactive about
tracking expensive queries. So I actually have
a short demo here. There we are. So this is an audit
dashboard that I built on top of the audit
logs within my project, right? So using the audit
logs, I'm able to get at slot utilization over time. I'm able to estimate the cost of
the queries that I'm executing. I'm able to then
filter this down by different tables
that are being accessed. I can even filter this
down by hour of day. So I guess I'm active
24/7, sometimes. And you're able to also
filter this down by user ID. So if I wanted to filter
this down and just see the logs that are the
queries that I'm running, I can filter that down. I could see the estimated
cost of all the queries that I've executed. And then I can go
in here, and I could see these are all the
queries that have executed. Maybe potentially,
there's a query in there that is substantially more
expensive than the others. I can then take a look
at that, and analyze it, and see if there's a
way to optimize it. And when you have an
organization of end users, they may not always write
the most efficient queries. So visualizing your
audit logs in this way could enable you to be
proactive about tracking down those expensive queries and then
informing those users of better ways to run them and also
track those costs per user per project. Back to the slides. All right. So that's a little bit about
BigQuery best practices. But now, Alban from Teads
is going to come up on stage and talk a little bit about
how Teads utilized BigQuery within their organization. ALBAN PERILLAT-MERCEROZ:
Thank you, Ryan. Three years ago, I
moved from San Francisco to Montpellier South of France. I joined Teads in Montpellier
and had the privilege to work on a project called
BigQuery POC at the time. Today, I am thrilled to be back
in San Francisco to showcase how BigQuery has grown
from a proof of concept to our main data
warehousing system. I'll start with an overview
of how we use BigQuery. I'll explain why
we chose flat rate and how we manage our workloads,
and will finish by the impact that BigQuery has had. But first, a few
words about Teads. Anyone heard about Teads? Not a lot, although
I'm pretty sure you regularly see ads served by
Teads on these news publishers. So we are an ad tech company. We have united the best
publishers in the world into this global media platform. With Teads, advertisers can
reach 1.4 billion people monthly. We serve ads at scale
within a premium context. Now, why BigQuery? This is the kind
of user interface that we build for our
clients, for publishers to monitor their inventory
and monetization, and for advertisers to
manage their campaigns. It's full of dashboards where
you can see plenty of metrics and in which you can
explore multiple dimensions. We want a fast and
interactive experience. And to do that, we need to
collect, store, and process billions of events. Let's take a look at the
sources of these events. To do that, we need an ad break. So what happened here? This ad experience
emitted five events so we can know what's going on. The first one is
Ad Slot Available. That's when there's actually
room for an ad on the web page. Then we have Ad Available. This is the result of a
complex auction system that is running behind
the scenes to select the right ad for the right
person at the right price. Impression is when the
ad is actually displayed. Then User Click
and Video Complete are useful metrics
for advertisers. 10 billion events a day-- that's more than 100,000
events per second that we have to collect, store,
and process within our systems. Another important metric is 10. We are only a team of 10 to give
meanings to all these events. That's why we are more
than happy to delegate to BigQuery the complexity
that comes with that scale. Now, let me tell you
where we come from, what made us choose
BigQuery at Teads, and how it slowly became our
main data warehousing system. So all these events are
first stored into Kafka. Before BigQuery,
our data pipeline was following the principles
of the lambda architecture. On the batch layer, we
were storing the events S3. And then Spark's drives were
aggregating these events into Cassandra tables. The S3 layer was doing some of
these aggregations in real time using Storm. This model was complex. We had to maintain
two code bases. And it did not scale well. The cost of Cassandra was
getting out of control, and we were limited
in the features that we could build upon it. In 2016, we chose BigQuery to
overcome these limitations. We also chose to simplify
the data pipeline with a single data
flow process to load all the events into BigQuery. We also chose an ELT approach,
as opposed to an ETL approach, meaning we first store all
the events into BigQuery, and then use BigQuery only
to process these events. Let's have a look at
this ELT. So Dataflow is loading all these events
into BigQuery using batch mode, and writes one table per
batch every about 30 minutes. The events you see in
these tables are unordered. Then the ELT BigQuery
takes over here. And we do several jobs inside
BigQuery from these tables. I'll give you two examples. The first one is a simple job
that takes all these events and appends them into a
single federated table. And this table is very
important for our data analysts to run exploration queries
or for debugging purposes. The second job I will show
here is the rollup job. This job has two purposes. First, it groups events
by one table per hour, so we can know where
to look at the events. And the second
role of this job is to aggregate these events so
we can query them much faster. Let's take an example with
five events that we have. So each of these events
have a timestamp, and ad ID, and a website ID, and
the name of the event. In this example,
we want to group into three dimensions-- the hour
from the timestamp, the ad ID, and the website ID. And we want to compute
three metrics, which are the number of impressions,
of starts, and clicks. In this example, we've
grouped these five events into two lines. So that's a 2.5
compression ratio. Whereas, in real life,
our prediction job can perform a 70x
compression ratio. Now before BigQuery, to do that
at scale, we had to run complex Sparks jobs written in Scala. With BigQuery, it's been
simplified into this simple SAL query. As you can see, we are
grouping by our dimensions and computing the metrics
in the SELECT statement. In production, we have 60
dimensions and 150 metrics. I can tell you the
query is much longer, but it's not more complex. It's just as simple as that. In production, this query runs
in 1 minute and 40 seconds, despite having to
group by 60 dimensions and process 100
million rows per batch. And the good thing
is it saturated the equivalent of our 3,000
slots for only eight seconds. It means we have plenty of
room to run multiple queries like this at the same time. Speaking of slots, I'll now
explain why we chose flat rate and how we manage it. The most common pricing
model for BigQuery is the on-demand
pricing where you pay $5 per terabyte
of data you scan. Flat rate is an
alternative pricing model. For $40,000 a month, you
get 2,000 dedicated slots, regardless of the
amount of data you scan. If you're scanning few
enough bytes per month, it makes sense to use
the on-demand pricing. No capacity planning. No commitment. You only pay for what you use. If your bit reaches $40,000 a
month, you have two options. You can either reduce your
usage or optimize your queries to reduce your bill,
or you can go flat rate and try to make
the most out of it. At Teads, we chose
the second option. It made us focus
on building value from BigQuery's unlimited
power, rather than optimizing and cost control. It encouraged also other
teams to use BigQuery too. In three years, Tead's
costs kept stable, and we only had to increase
our slot capacity once from 2,000 to 3,000. And by the way, a
few days ago Google announced that flat rate is
available from 500 slots. So the entry price is now
about $10,000 a month. One of the challenges of flat
rate is workload management. With flat rate
pricing, you're not counting the bytes you're
scanned by your query. Your monitoring your slot usage. At some point, you'll
reach saturation of your compute capacity. That's OK. That means you're getting
the most out of BigQuery. But you have to make sure
some of your workloads are not impacted to
match the others. You have to make sure they
are not starving the others. Throwing money at the problem
is always the easiest solution. So it helps that it's not
a silver bullet, as you may end up starving all-- you may end up saturating
all your slots. And yes, you still
have no guarantee that some of the workloads
are not starving the others. Splitting different workloads
into different projects helps the BigQuery
scheduler give a fair share of slots for each project. This way, a given project cannot
impact too much the others. In this example,
the orange project is using most of the slots. But when the green and
blue project needs some, they can access a
fair share of it. Another way to
isolate workloads is to implement
heirarchical reservation. With reservation system,
you can assign slots to nodes that are
groups of projects. And you can make the slot
sharing uni-directional. In this example, we've isolated
the most critical projects into the critical node. And they get 1,000
dedicated slots. They can also access the 2,000
slots of the other project. Whereas, the other
projects cannot have access to the critical one. This makes sure that
critical projects always have a minimum of
resource available. Now that I've showcased
how we use BigQuery, I'll finish by the three major
impacts that it had at Teads. It made us build projects
and features that we were not able to do before. It made data accessible
by everyone easily. Thanks to SQL, which is a
game changer when you compare it to Jupyter notebooks,
and because thanks to a UI with single sign
on and built-in security. And last but not
least, it made us focus on building projects and
not maintaining infrastructure. We use AWS Redshift 2. And to give you an idea, we
spend more time operating clusters of hundreds
of gigabytes on it than we spend time
operating BigQuery with hundreds of terabytes. Key takeaways-- BigQuery gave
us performance and simplicity. It made our data accessible. Flat rate drives usage
with predictable costs that requires workload management. If you want to know more
about our data pipeline, our engineering blog has
phenomenal articles about it. Honestly, they are great. And I'm not only saying
that because I wrote them. Finally, if like me
you prefer sunny south of France over the 101
commutes, we are hiring. [APPLAUSE] [MUSIC PLAYING]