Data Warehousing With BigQuery: Best Practices (Cloud Next '19)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] RYAN MCDOWELL: Hi, everyone. I'm Ryan McDowell, a could data engineer with Google. Joining me a little bit later today is going to be Alban from Teads. And today, we're going to be talking about data warehousing with BigQuery best practices. So if you're coming from your traditional on-premise data warehouse system or even another cloud, you may be familiar with some of the challenges listed here. When I talk to customers, these are very top of mind. They can't keep up with data growth. They don't have scalability within their current system. They are spending tons of time with teams of people just to manage their data warehouse. They have expensive license renewals. And upgrades for them require downtime. In fact, some customers that I talk to aren't even using their data warehouse as a data warehouse anymore. They're offloading data from their data warehouse into Hadoop and other places, in order to perform the type of analysis that they require in their business needs. And this is really because their data warehouse system is no longer fitting the needs of the business. When we look at why many customers have decided to move their data warehouse to BigQuery, it's because BigQuery solves many of the issues with your traditional data warehouse system. With BigQuery, you're able to automate data delivery with BigQuery data transfer service. You're able to share queries among users and people within your organization. You can set the foundations for AI and real time analytics and significantly reduce the amount of effort required to maintain your data warehouse. But one of the major advantages over the traditional data warehouse systems is BigQuery's architecture. Google BigQuery is fully managed and serverless. Storage and compute are decoupled from each other, and they scale independently and on-demand. This offers immense flexibility and cost control for your business. So you're not keeping expensive compute clusters running 24/7. This is also very different than the traditional node-based cloud data warehouse, or even on-premise MPP systems. Some of the secret sauce of BigQuery sits in between the storage and compute layers in its in-memory shuffle. BigQuery uses remote memory shuffle to gain unprecedented query speed over big data sets. This enables you to query large amounts of data and get your results very fast. Each query that you execute within BigQuery is split up into a sequence of stages. Each stage is then read from one worker to perform some operation from shuffle and then written back out to shuffle. If a worker was to, let's say, go down because of a hardware failure or something like that during your query processing, you wouldn't notice it because what would happen is another worker would spin up and then start reading from the previous point that the previous worker was reading from in shuffle. This offers you great resilience to downtime. And it enables you to continually perform your analytics without interruption. And then finally when a query is complete within BigQuery, after this reads from shuffle and writes back to shuffle and we're sort of done with the entire query processing, the results are written out to persistent storage and then returned back to the user. And this persistent storage enables you to-- essentially, if you were to run the same query over again, you get the cache of the previous query. So we're able to serve up consistent results to the users over time. And speaking of storage, one of the major benefits of BigQuery is it's managed storage. BigQuery offers durable and persistent storage for your data warehouse. And this helps you dramatically reduce your data operations. Gone are the days that your DBAs are calling you up on the phone and saying, hey, we need you to delete data out of the data warehouse. We're getting too big. That was exactly the scenario in my previous company where it was a regular basis the DBA would call us up and tell us to start deleting data. And we didn't want to delete data out of our data warehouse, but we had resource constraints and we had to. And so with BigQuery, you're able to ingest petabytes of data and basically throw at it as much as you can. And when you throw data into BigQuery, all data is compressed and encrypted by default. This means that your data is secure, once you migrate to BigQuery and it's in the cloud. And not only is it encrypted, but when you save a record to BigQuery, it's not that that record ends up physically on disk somewhere. Those records are split up into tiny data blocks and smeared across thousands of disks within our data center. And each individual data block is encrypted with a different key encryption key. Which means if I went to the Google data center and just pulled out a disk drive, can't read your data because it's smeared across everything. And it also is encrypted at the data block level. And you also don't have to worry about data replication in DR. Because in BigQuery, when you are running in a regional configuration, your data is in multiple zones within a region. When you're running in a multi-regional configuration, your data is spread across multiple geographically distinct regions. And so you don't have to worry about downtime from just a region or zone going down, because you will have that durability and consistency across regions. Now, the first thing that you'll probably end up doing with BigQuery, whether it be in a pre-prod environment or sandbox environment, is loading data. With BigQuery batch, ingest is free. It doesn't consume query capacity. And so this is huge, coming from a lot of the traditional on-premise systems where you have to carve out separate resources just to deal with batch loading data. That doesn't exist within BigQuery. You can load up data using the communal slots available to you, and load it without having any impact on your query capacity. When you're loading data, it has asset semantics. This means that I can replace an entire table that may be terabytes in volume, and the user would never know that that operation is happening. Because the operation is only exposed to the user once it's fully complete. Or if it fails, it's never exposed to the user. So you don't have these situations where you have to worry about, oh, I have users querying the data. I can't load at this time. And I don't have situations where I have impartial data loaded, and so people are getting bad results when they're doing their reporting. Finally, you can, like I said earlier, load petabytes of data per day. There's no real limit on the amount of data that you can load into BigQuery. And so a lot of times when I see customers move over to BigQuery, they may be moving over with, let's say, 100 terabytes. But then they end up using substantially more than that on BigQuery because of pent up demand within their organization. And they're able to just scale as necessary to deal with that. And then finally, there's a streaming API for real time. So if I needed to ingest real time records or near real time, I can stream thousands to hundreds of thousands of records into BigQuery, without any impact to my query performance. Which actually brings us to our first best practice. So BigQuery supports a wide variety of file formats for data ingestion. Some are going to be naturally faster than others. When optimizing for load speed, prefer using the Avro file format. Avro is a binary, row-based format, which enables us to split it and then read it in parallel with multiple workers. In fact, in a recent test with a customer, we found that Avro with deflate compression to be an order of magnitude faster than the other methods of loading data into BigQuery. Now, Parquet and ORC are also binary-based formats. And they're relatively fast for loading data. When I'm talking to customers, they often think that since Parquet and ORC are their columnar formats, and I'm loading into managed storage, which is also columnar, that should be the fastest way to load in. But when we're loading data, we're actually taking the entirety of the record. So we're not taking advantage of any of the columnar format. And there's a lot of random reads. So loading data via Parquet and ORC is fast, but not going to be as fast as loading data from Avro. And then finally here at the bottom, we have CSV and JSON compressed. Loading data from compressed files, specifically CSV and JSON, is going to be slower than loading data in any other format. And the reason being is because, since the compression of Gzip is not splittable, we have to take that file, load it onto a slot within BigQuery, and then do the decompression, and then finally parallelize the load afterwards. So there's going to be more overhead, and it'll be slower than loading data from these other formats. And after I loaded data into BigQuery, I'm probably thinking about, how am I going to do transformations within my system? Should I be doing ELT or ETL? What is the best practice there? So in general, you want to prefer ELT over ETL where possible. BigQuery is very scalable and can handle large transformations on a ton of data. So generally, when customers think about doing ELT versus ETL, ELT is where they're going to start within BigQuery. It's also quite a bit simpler, because you could just write some SQL queries, transform some data, and then move data around between tables, and not have to worry about managing a separate ETL application. And then as you're loading data in, what you may find is that, well, I have a CSV file maybe. And the date isn't really in a standard format. How do I get this into the data type within BigQuery? Well, you can have separate scripts or ETL processes which process that and pre-process it before loading it into BigQuery. But you can also leverage federated queries onto GCS to transform that data and load it in a single step. Now, the only downside of this is this will consume your query capacity. But it enables you to keep everything in SQL, if you just have a bunch of one off use cases for this. Once you've started loading data into BigQuery, in general, within your warehouse, you're going to want to leverage raw and staging tables before publishing to reporting tables. The raw table essentially contains the full daily extract, or a full load of the data that they're loading. The staging table then is basically your change data capture table. So you can utilize queries or DML to merge that data into your staging table and have a full history of all the data that's being inserted. And then finally, your reporting tables are going to be the ones that you publish out to your users. And your users are going to execute queries against these tables. And they may be dimensionalized in a star schema, or they may be denormalized, depending on the use case. And that method of going from staging to reporting using MERGE statements, that's where you can perform your SCE Type 2 transformations. So if I have the history of records all within my staging table, I can easily within my MERGE statement mark a record as expired, and then set the new date for the current record. And then finally, when you're getting into streaming loads or really complex batch loads, you can leverage data flow or data fusion to speed up those pipelines. So oftentimes when I work with customers, they're starting out with the ELT as their primary mechanism of loading and transforming data into BigQuery. But then they may have some really gnarly batch process that doesn't really fit into SQL cleanly. And this is where they'll sort of scale themselves out. They'll start utilizing data flow or data fusion to speed up those loads and transformations and do more complex activities on that data. And if you're getting started with streaming, I recommend using the Google-provided dataflow templates. So Google provides dataflow templates for loading data from multiple different places and moving data around. You can find those dataflow templates within the Dataflow UI. So within the Create Job from Template button, you'll find all these dataflow templates. And if you find that it mostly fits your use case, but I want to make one slight modification, all those templates are also open sourced. So you can go to the Open Source repo. You can modify the code to fit your needs, whether it be adding some windowing in, or just using it as a baseline for your own pipelines, and then quickly accelerate your development. And this is sort of like an end-to-end example of change data capture coming into BigQuery. So with some customer submits, they may have Oracle. And we'll be capturing the changes in Golden Gate. We'll publish those records to either Pub/Sub or Kafka, to ingest those records in real time. For doing batch loads, we may just load that data into GCS as our landing zone, before loading it into BigQuery. And then we'll utilize dataflow or BigQuery batch loads to get that data into our raw tables. Then once the data is in our raw tables, we can utilize Cloud Composer to schedule queries within our BigQuery system to move that data from raw to our change log, then dimensionalizing for our star schema. And this is a very standard process that I see a lot of customers utilizing. Now, once we have a strategy for loading data, it's time for us to think about, well, how do we want to design our schema? In general, I find that customers will overthink this step upfront. They'll think about like, well, let's redesign everything for BigQuery. It's a new system. Maybe I need to denormalize everything. But in general, the best practice is actually to migrate your star schema and snowflake schema as is, and then optimize as necessary. This enables you to find what may not run as is quickly within BigQuery. And then optimize it after identifying those gaps. Denormalization is not a requirement, but it can speed up slow analytical queries by reducing the amount of data that we have to shuffle. So to give you an example, there is one customer that I was working with who had 77 joins in a query. I was impressed that they were actually executing a query with 77 joins. And what we found is that, well, you know, a lot of this did not need to be a join statement. A lot of this was like a country field, or it was a date dimension. It was stuff that could be naturally denormalized into the output table. And so we were able to reduce the 77 joins that they were doing down to four joins, which made it a lot more scalable as their data volume grew. And speaking about denormalizing, BigQuery has capabilities for nested and repeated fields. And this enables you to store semi-structured data very naturally within BigQuery, and then query over it. Now oftentimes, this is a very different concept than people used to row-based architectures where they're not necessarily storing semi-structured data within their data warehouse. So for leveraging nested and repeated fields, I generally have three guidelines. One, leverage them when you have tightly coupled or immutable relationships. If I have an order and I have an order line items table, the line items will never really be used without the order table. And these are naturally coupled, oftentimes immutable. And I'll denormalize that into a single table. So as opposed to having the order in the order line items table, I can have an orders table that has a list of line items. Similar ways to do that is sessions with events. So an event within a session I can have, as opposed to having a separate table for sessions and events. I could have a list of events within a session record in a sessions table. I'll also think about using this for infrequently changing data-- so country, region, date dimension. Hopefully, these are un-frequently changing. And so this is stuff that you can actually just merge into your reporting table and have it be slightly more denormalized and easier to query for the users. And then finally, simplifying queries. A lot of times, queries for users will be super complex. And just by nesting in repeated fields, you can sufficiently decrease the complexity of those tables and the usage. Which brings us to our next best practice-- partitioning. Partitions in BigQuery physically split your data on disk, based on ingestion time, or based on a column within your data. As seen within the diagram, each column of data is split according to the partition key. And then queries over that partition key, as opposed to scanning the entirety of that column and all data within that column, we're able to do partition elimination and only scan the data that's necessary to answer that query question. And this provides huge cost benefits, but also huge performance benefits, especially on large fact tables. So as a best practice, whenever you have a fact table or temporal table, utilize a partition column on your date dimension. And as your users are using these tables, oftentimes, what I've found is users will forget to supply a partition key. So they won't get the benefits of the performance of partition elimination. BigQuery has an option in which you can require a partition filter on the table, which basically means that if I'm a user, and I execute a query, and I don't specify an as-of-date filter, and if I am partitioned by as-of-date, then my query will return back an error and basically say, I didn't specify a required partition filter. And this enables you to avoid inadvertently scanning large amounts of data, to essentially keep your costs down and avoid errant queries by users who may not know better, until they're actually presented with that error. Clustering-- clustering allows you to physically order data within a partition. So you can do clustering by one or multiple keys. So if I had a partition of 2018603, I can physically order the data for a user ID in that partition. And that way when the query is executing, we're only scanning a very small subset of that partition. And this can provide massive performance benefits when used properly. You're going to want to prefer using clustering on common filtered columns, especially within your fact table. So if you had, let's say, a sales fact table or transaction fact table, you may want to partition it based on transaction date. But you may want to cluster it on, let's say, region, or store, or customer ID. And this will enable you to have better performing queries over that table. Now, you can only have one clustering key per table. So it's basically most customers will look at the queries that they are running and then determine what is the best clustering key that will satisfy most queries that are running against this table. When you're filtering on clustering columns, you want to filter in the order that they're specified. So when you specify, let's say, on that transaction table that I have partitioned by transaction date and that I'm clustering by, let's say, region and then store, if I was just to provide a store filter to my query, then I'm not taking advantage of clustering. So you want to order your clustering keys properly so that it can essentially limit the amount of data scans per query. And then finally, most people will ask me, how do I do clustering on a table that has no date dimension? So let's say I have a dimensional table and its product or its customer. And I just want to cluster on that that customer ID or product ID. Well, you can utilize natural partitioning on that table and cluster based on that column. Even though the partition means nothing, it'll still provide performance benefits. You can also utilize a fake date column within your table, and then cluster on customer ID or product ID. And that'll provide even better performance because everything is going to end up on one partition. And then all the data is going to be ordered within there. And then finally, this is sort of a workaround today. If you had a use case where you have a dimensional table and you have maybe an integer key, in alpha, BigQuery has support for integer range partitioning. And so if you're interested in that functionality, you can sign up for that alpha and begin testing that out. When customers are coming from systems which have sequence generators, there are oftentimes questions of, how do I handle surrogate keys in BigQuery? BigQuery doesn't have any sequence generator. And their first attempt at this will often be, well, I'll use row number. I'm doing ELT, I'm going to use row number. And that's going to give me my sequence over my table. But that's not a scalable solution because, when you're using row number in that way, everything within the query is ending up on a single partition. And then you're doing basically a sort of it could be millions or billions of records. A more scalable way to do this is to prefer UUIDs in place of surrogate keys. And then, if you have the use case where you need a deterministic surrogate key, you can use a hashing function, such as SHA-256, which enables you to hash over one or multiple business keys and so that you can have a deterministic key be generated every time. Now, once I've loaded data into BigQuery, I've designed a schema, I may be worried about how to manage workloads once users start using the system. But before we talk about workload management within BigQuery, it's important to understand how BigQuery executes queries. BigQuery has a fair scheduler. What this means is that if I was to execute a query in BigQuery, just a single query, it has full access to the resources available to it. So in this case, it's a project that has 2,000 slots. So that query has full access to those 2,000 slots. Now, it may not use those 2,000 slots because the number of slots that each stage requires is based on the number of parallel inputs to that stage, but it still has access to the full 2,000 slots. Now, when I execute two queries within BigQuery, we subdivide those resources. So as opposed to that one query having access to 2,000 slots, now we have two queries that have access to 1,000 slots each. And then so on and so forth, as you execute more queries, the resources continually get subdivided. Now, there's a common question there is, well, how do I prioritize between queries? Maybe there's some queries that I care a lot more about that I don't want to have necessarily equal resources available to as much as my other queries. So this is where you can take advantage of BigQuery reservations. BigQuery reservations allow you to create reservations of slots, assign projects to those reservations, so you can allocate more or less resources to certain types of queries. So in this case, I have a dashboarding project, I have a data science project, and I have an ETL project. I may want to allocate more slots to my dashboarding project, because those dashboards are being utilized by my C-level execs. And I want to make sure that the response time is always quick. And I don't really care about my ETL project or my data science project. They can sort of get results as needed. And this allows me to sort of prioritize between the different workloads that I'm executing within BigQuery. And an important thing here is that by carving out resources in this way, you're not carving out data silos. Because idle slots are seamlessly distributed between the different nodes within your reservation. So let's say that my dashboarding project is not using 1,000 slots. That means that any other project within my system can take advantage of it. These aren't physical resources that you're sort of segmenting off and putting away. And if you have low utilization, then you're just paying for compute that you're not using. You're able to efficiently distribute these slots across all these reservations, but also have that priority so that if a dashboarding query did come in, it sort of preempts the other projects and is able to take advantage of those full 1,000 slots. And this is also where the separation of storage and compute comes in handy. Because we're segmenting off compute resources, but we don't have to think about storage. Storage still exists somewhere in some project. And any project and any user can access that storage, provided that their permission to that storage via our standard ACLs on data sets. So this brings us to another best practices, to segment your users based on roles. So as opposed to having everybody in one project accessing BigQuery, the same projects which you're storing all your data, segment out your users into different projects based on their role within your organization. So I may have a separate project for my data science team, or a department's data science team, and a separate project for my data analysts. And then if I needed to start doing that prioritization among the different groups, I can do that very naturally just by assigning their projects to different nodes within the reservation hierarchy. So once I have workloads executing, I probably want to know what's going on in my system. BigQuery logs all admin activity and data access into Stackdriver Logging. So you're able to access all the logs of what is going on within your system and do analysis over them. But in most cases, customers will want to do analysis over that in BigQuery. And so as a best practice, we actually say, export your data access logs back into BigQuery. That way, you can query over those data access logs, and you can have those data access logs in a single place. So you could have thousands of projects using BigQuery. But if you use an aggregated export of those logs to a single BigQuery project, then all that data is going to exist in a single place. All that audit data for your entire organization in a single data set and easily queryable by people who are interested in knowing what's exactly going on. And then finally, as a best practice, is go the extra mile and visualize your data audit logs in Data Studio. By visualizing your audit logs, you can analyze spending trends and query volume over time and break down costs by project, by user, and be proactive about tracking expensive queries. So I actually have a short demo here. There we are. So this is an audit dashboard that I built on top of the audit logs within my project, right? So using the audit logs, I'm able to get at slot utilization over time. I'm able to estimate the cost of the queries that I'm executing. I'm able to then filter this down by different tables that are being accessed. I can even filter this down by hour of day. So I guess I'm active 24/7, sometimes. And you're able to also filter this down by user ID. So if I wanted to filter this down and just see the logs that are the queries that I'm running, I can filter that down. I could see the estimated cost of all the queries that I've executed. And then I can go in here, and I could see these are all the queries that have executed. Maybe potentially, there's a query in there that is substantially more expensive than the others. I can then take a look at that, and analyze it, and see if there's a way to optimize it. And when you have an organization of end users, they may not always write the most efficient queries. So visualizing your audit logs in this way could enable you to be proactive about tracking down those expensive queries and then informing those users of better ways to run them and also track those costs per user per project. Back to the slides. All right. So that's a little bit about BigQuery best practices. But now, Alban from Teads is going to come up on stage and talk a little bit about how Teads utilized BigQuery within their organization. ALBAN PERILLAT-MERCEROZ: Thank you, Ryan. Three years ago, I moved from San Francisco to Montpellier South of France. I joined Teads in Montpellier and had the privilege to work on a project called BigQuery POC at the time. Today, I am thrilled to be back in San Francisco to showcase how BigQuery has grown from a proof of concept to our main data warehousing system. I'll start with an overview of how we use BigQuery. I'll explain why we chose flat rate and how we manage our workloads, and will finish by the impact that BigQuery has had. But first, a few words about Teads. Anyone heard about Teads? Not a lot, although I'm pretty sure you regularly see ads served by Teads on these news publishers. So we are an ad tech company. We have united the best publishers in the world into this global media platform. With Teads, advertisers can reach 1.4 billion people monthly. We serve ads at scale within a premium context. Now, why BigQuery? This is the kind of user interface that we build for our clients, for publishers to monitor their inventory and monetization, and for advertisers to manage their campaigns. It's full of dashboards where you can see plenty of metrics and in which you can explore multiple dimensions. We want a fast and interactive experience. And to do that, we need to collect, store, and process billions of events. Let's take a look at the sources of these events. To do that, we need an ad break. So what happened here? This ad experience emitted five events so we can know what's going on. The first one is Ad Slot Available. That's when there's actually room for an ad on the web page. Then we have Ad Available. This is the result of a complex auction system that is running behind the scenes to select the right ad for the right person at the right price. Impression is when the ad is actually displayed. Then User Click and Video Complete are useful metrics for advertisers. 10 billion events a day-- that's more than 100,000 events per second that we have to collect, store, and process within our systems. Another important metric is 10. We are only a team of 10 to give meanings to all these events. That's why we are more than happy to delegate to BigQuery the complexity that comes with that scale. Now, let me tell you where we come from, what made us choose BigQuery at Teads, and how it slowly became our main data warehousing system. So all these events are first stored into Kafka. Before BigQuery, our data pipeline was following the principles of the lambda architecture. On the batch layer, we were storing the events S3. And then Spark's drives were aggregating these events into Cassandra tables. The S3 layer was doing some of these aggregations in real time using Storm. This model was complex. We had to maintain two code bases. And it did not scale well. The cost of Cassandra was getting out of control, and we were limited in the features that we could build upon it. In 2016, we chose BigQuery to overcome these limitations. We also chose to simplify the data pipeline with a single data flow process to load all the events into BigQuery. We also chose an ELT approach, as opposed to an ETL approach, meaning we first store all the events into BigQuery, and then use BigQuery only to process these events. Let's have a look at this ELT. So Dataflow is loading all these events into BigQuery using batch mode, and writes one table per batch every about 30 minutes. The events you see in these tables are unordered. Then the ELT BigQuery takes over here. And we do several jobs inside BigQuery from these tables. I'll give you two examples. The first one is a simple job that takes all these events and appends them into a single federated table. And this table is very important for our data analysts to run exploration queries or for debugging purposes. The second job I will show here is the rollup job. This job has two purposes. First, it groups events by one table per hour, so we can know where to look at the events. And the second role of this job is to aggregate these events so we can query them much faster. Let's take an example with five events that we have. So each of these events have a timestamp, and ad ID, and a website ID, and the name of the event. In this example, we want to group into three dimensions-- the hour from the timestamp, the ad ID, and the website ID. And we want to compute three metrics, which are the number of impressions, of starts, and clicks. In this example, we've grouped these five events into two lines. So that's a 2.5 compression ratio. Whereas, in real life, our prediction job can perform a 70x compression ratio. Now before BigQuery, to do that at scale, we had to run complex Sparks jobs written in Scala. With BigQuery, it's been simplified into this simple SAL query. As you can see, we are grouping by our dimensions and computing the metrics in the SELECT statement. In production, we have 60 dimensions and 150 metrics. I can tell you the query is much longer, but it's not more complex. It's just as simple as that. In production, this query runs in 1 minute and 40 seconds, despite having to group by 60 dimensions and process 100 million rows per batch. And the good thing is it saturated the equivalent of our 3,000 slots for only eight seconds. It means we have plenty of room to run multiple queries like this at the same time. Speaking of slots, I'll now explain why we chose flat rate and how we manage it. The most common pricing model for BigQuery is the on-demand pricing where you pay $5 per terabyte of data you scan. Flat rate is an alternative pricing model. For $40,000 a month, you get 2,000 dedicated slots, regardless of the amount of data you scan. If you're scanning few enough bytes per month, it makes sense to use the on-demand pricing. No capacity planning. No commitment. You only pay for what you use. If your bit reaches $40,000 a month, you have two options. You can either reduce your usage or optimize your queries to reduce your bill, or you can go flat rate and try to make the most out of it. At Teads, we chose the second option. It made us focus on building value from BigQuery's unlimited power, rather than optimizing and cost control. It encouraged also other teams to use BigQuery too. In three years, Tead's costs kept stable, and we only had to increase our slot capacity once from 2,000 to 3,000. And by the way, a few days ago Google announced that flat rate is available from 500 slots. So the entry price is now about $10,000 a month. One of the challenges of flat rate is workload management. With flat rate pricing, you're not counting the bytes you're scanned by your query. Your monitoring your slot usage. At some point, you'll reach saturation of your compute capacity. That's OK. That means you're getting the most out of BigQuery. But you have to make sure some of your workloads are not impacted to match the others. You have to make sure they are not starving the others. Throwing money at the problem is always the easiest solution. So it helps that it's not a silver bullet, as you may end up starving all-- you may end up saturating all your slots. And yes, you still have no guarantee that some of the workloads are not starving the others. Splitting different workloads into different projects helps the BigQuery scheduler give a fair share of slots for each project. This way, a given project cannot impact too much the others. In this example, the orange project is using most of the slots. But when the green and blue project needs some, they can access a fair share of it. Another way to isolate workloads is to implement heirarchical reservation. With reservation system, you can assign slots to nodes that are groups of projects. And you can make the slot sharing uni-directional. In this example, we've isolated the most critical projects into the critical node. And they get 1,000 dedicated slots. They can also access the 2,000 slots of the other project. Whereas, the other projects cannot have access to the critical one. This makes sure that critical projects always have a minimum of resource available. Now that I've showcased how we use BigQuery, I'll finish by the three major impacts that it had at Teads. It made us build projects and features that we were not able to do before. It made data accessible by everyone easily. Thanks to SQL, which is a game changer when you compare it to Jupyter notebooks, and because thanks to a UI with single sign on and built-in security. And last but not least, it made us focus on building projects and not maintaining infrastructure. We use AWS Redshift 2. And to give you an idea, we spend more time operating clusters of hundreds of gigabytes on it than we spend time operating BigQuery with hundreds of terabytes. Key takeaways-- BigQuery gave us performance and simplicity. It made our data accessible. Flat rate drives usage with predictable costs that requires workload management. If you want to know more about our data pipeline, our engineering blog has phenomenal articles about it. Honestly, they are great. And I'm not only saying that because I wrote them. Finally, if like me you prefer sunny south of France over the 101 commutes, we are hiring. [APPLAUSE] [MUSIC PLAYING]
Info
Channel: Google Cloud Tech
Views: 85,681
Rating: undefined out of 5
Keywords: type: Conference Talk (Full production);, purpose: Educate, pr_pr: Google Cloud Next
Id: ZVgt1-LfWW4
Channel Id: undefined
Length: 43min 8sec (2588 seconds)
Published: Wed Apr 10 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.