Modern Data Warehousing with BigQuery (Cloud Next '19)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] JORDAN TIGANI: Good afternoon, everybody. My name is Jordan Tigani. I'm the director of product management for Google BigQuery. I was also one of the first engineers on Google BigQuery, so I know where a lot of the skeletons are hidden in the code-- or at least I still remember some of it. So a data warehouse is a tool. It's a tool that can be used in a lot of ways but it's a relatively simple tool. And like simple tools, you can build some really impressive things. This is my house in Seattle, and you can build this house with very simple tools. But if you want to build a skyscraper, you need totally different tools. So what is the difference? It's scale, it's technology, and it's use cases. And so a data warehouse is similar. So a traditional data warehouse is a fairly simple tool. It's been around for 30, 40 years in pretty much the same format. How many people here have a data warehouse that they run? So lots of you. Actually, how many of those people have a data warehouse that's not BigQuery? I just want to see if I'm preaching to the choir here. OK. So folks do have lots of data warehouses that's not BigQuery. But of those people that have data warehouses that aren't BigQuery, how many of you run either Hadoop, or Spark, or Impala, or something like that as well? So lots of people are running things that are data analysis tools that are outside of their data warehouse. So the data warehouse is clearly missing something. And how many of you run some sort of Kafka or streaming analytics? It's a lot. Of a bunch of those. And how many of those are integrated with your data warehouse? So not too many. So there's clearly also a real-time component that people want or people have streaming data that's coming in. So kind of the way we see data warehousing at Google is it starts with the data warehouse. On the data warehouse, as actually one of the guys from Home Depot said a couple of weeks ago, 99.9% of all traditional data warehousing stuff, it's still relevant. So I'm not going to try to say that these old traditional data warehousing isn't relevant. It is. But there's something more. There's something more that people want. And so Google's data warehouse is BigQuery. Hopefully, people have heard of it by now. And it's our enterprise data warehouse for analytics. We used to call Petabyte-scale, now we call it Exbyte-scale, in case you're tracking these slides over time. And you can run Petabyte-scale queries, and I'll show you one of those in a few minutes. Security is super important for us, so all your data is stored encrypted, durably, available. And I'll get into some of the other properties as I go along. You don't have to just take my word for it. Forrester named BigQuery a leader in their cloud data warehousing wave. And we recently had a study out from ESG that said using BigQuery as a traditional data warehouse gives you significant savings over other on-prem data warehouses. I also want to highlight something that we've announced at Next. Sudhir mentioned it in his talk yesterday. So a lot of people that I've been meeting with over the last few days have been saying, we love BigQuery, but we want predictable pricing, but $40,000 a month that we charge for 2,000 slots is out of their price range. And so we're announcing 500 slots for $10,000 per month. Hopefully, there's a lot more people for whom that's in their price range. So one of the key things for data warehousing when you're moving to the cloud is the separation of storage and compute. When Home Depot moved to BigQuery, they had 100 terabytes in an on-prem data warehouse. Just recently, they finished their migration to BigQuery, and now they've got tens of petabytes, so huge amounts of data. And it's not something that's new. It's that this data was always coming in, but they were constrained by the environment that they were working in. So when you go to the cloud, you remove a lot of the constraints. You can scale up the storage amounts virtually infinitely. You can scale up the compute amounts to tens or even hundreds of thousands of CPU cores. So we've also announced that, where we talk about some of our big statistics. We have customers that have a quarter exabytes of data for a single customer, and we do run queries that are over a petabyte relatively frequently. So this is the retailer that I mentioned. So three years ago at GCP Next, I introduced this petabyte dataset that we had. And I ran a query over this petabyte dataset, and I kind of did this thing where I, at the beginning of the talk, I started it running. At the end of the talk, I said, let's see how it's doing. And it took about four minutes to complete. Then, last summer at GCP Next, I ran that same query again against that same dataset, and it was down to a minute and a half, so 2x performance improvement. The big difference between those two, actually, was not the performance difference. It was, if you look to the right, is says that one had to process all 1.09 petabytes. The latter one only had to process half a gigabyte. So if we could switch over to the demo, please? So I'm going to try running this query again, and let's see how we do. So the difference between the half a gigabyte and the single petabyte was that we enabled clustering. And so clustering enables you to find data much more quickly. So there we go. 11.7 seconds to do this scan of a one-petabyte table. I'm hoping that by next time we'll get that down to one second, but-- [APPLAUSE] Could we switch back, please? So that's the challenge for the engineering team, to make me look good next year by getting that to a second. So the last bit in kind of the traditional data warehouse space, where you're just sort of expanding the traditional data warehouse, is serverless. Nobody wants to manage servers. If you can get somebody else to keep your servers running, then that's great. And BigQuery does automatic patching. There's no downtime. When we launch new versions, we just sort of rolled that out seamlessly. This was a quote from last week, actually. One of our more colorful Australian customers just mentioned that one day, things started getting much faster. And so he had this to say on Twitter. So what makes BigQuery BigQuery? What makes it work? It's really the architecture that we build on that Google has, the infrastructure that we really can stand on the shoulders of giants, the extremely fast petabit network, our highly scalable storage systems, highly scalable compute clusters as well. So it's serverless. You get to focus on what's important to you. You get to focus on actually doing your analytics. You don't have to focus on configuration management, reliability, et cetera. So next is real-time. And I think real-time is an underappreciated side of data warehousing. Traditionally, you take your operational data warehouse and overnight, you dump that into your-- there's your operational database-- overnight, you dump that into your data warehouse and you build reports. And then the next morning, everybody comes in looking at the reports of what happened yesterday. But people don't want to see what happened yesterday. They want to know what's going on right now. The amount of data that is being created is-- people always talk about how fast data is coming in and how hard it is to process. And more and more of that data is streaming oriented. And when you think of it, really, all data is generated one event at a time. In its natural format, it's a stream. So across Google and across Google Cloud and data analytics, we really want to make it possible to keep data in its natural form, to keep data-- if it's a stream, to keep it as a stream. That's why Cloud Dataflow, with one line of code, you can switch back and forth between batch and streaming. And BigQuery, we are investing heavily in streaming analytics. So one of our customers is Zulily, and they have these daily deals. I think they have like 100 new products a day. And they want to know how those products are doing throughout the day because, if they have to wait until tomorrow to get their report, then it's too late to do anything about it. So they actually use streaming into BigQuery, and they send an hourly report that's read by all their executives, and they are able to make changes on the fly. So it's super powerful to be able to make changes on the fly. So I'm going to show a demo of some high-volume streaming. So in the past, people have kind of run into some streaming limits in BigQuery. Let me just start this. And we're constantly working on breaking through those limits. And I think, if you think about limits you hit, quotas you hit, one overarching thing is you can be sure we're working on making those limits not actually hard limits anymore. So we put together this streaming demo using Dataflow. So we're using thousands of workers in Dataflow. And so here's the Dataflow instance that's-- you can see how many bytes written, and this has only been running for a couple hours. So that's some pretty significant amount of data. Now I'm going into the Compute Engine instances. So we have ten of these running. And let's check out the monitoring. There we go. Check out the monitoring, check out the network bytes. So each one of these ten is doing about two gigabytes per second. So we're streaming at 20 gigabytes per second. And if you don't believe me, we can look at this BigQuery table into which we're streaming. So this is looking at the data that's coming in over the last 20 minutes, and we're computing the byte length and the number of rows per second. So we're sending 22, 23 gigabytes per second and 2.3 million rows per second, so kind of decent velocity. But, OK, perhaps lots of systems can handle that sort of scale, but we want real-time. We want to be able to do something and see that it happens immediately. So what we're simulating here is a sensor network. So let's say we have IoT devices, they're all around the world. And these sensors are-- so we're looking at what's happened in the last ten minutes. In the last ten minutes, we have all of these sensors that are reporting that they're happy. Now, let's inject an unhappy sensor. So I'm basically just piping this into BigQuery via our streaming API and our command line client. And now we're looking for sensors that are not happy. So let's see whether this is going to show up. Come on, unhappy sensors. I feel bad for the sensors, that we're making it unhappy, but at least it's a fake one. There we go. So now we have one unhappy sensor. And switch back to the slides, please. Thanks. So real-time data warehousing, being able to make decisions quickly, being able to do things at high velocity-- something that we're pushing on. Another important thing for modern data warehousing is the centralization of storage and-- how many people use a data lake? So lots of people use a data lake. We believe, actually, that BigQuery does an excellent job as your data lake for structured data. We also want to make sure that you can put the data where you want or how you want it, but we've done a pretty good job of understanding our structured data. it says the statistic is, less than 50% structured data is used to make decisions. That's a lot higher than it used to be. It's not a bad number but people haven't really started looking at unstructured data at all. So a data lake is something that's important. But we've also recently launched the BigQuery Storage API, which is a way to sort of turn the data lake upside down. It's a way to have the data lake be your data warehouse. And so what are the advantages of that? Well, BigQuery automatically optimizes the shape of the data. So one of the problems with the data lake is you have to optimize the number of files you have, the size of the files you have. You have to worry about consistency issues, what happens if you're running a job while somebody adds or deletes a file. It's harder to apply consistent security. So BigQuery lets you do DML over your data. It lets you just have a higher level table abstraction. And when you have a higher level table abstraction, you can apply things like security policies that marks certain fields as PII, so other people can't read them. But so the Storage API is a way of reading at full velocity from BigQuery storage. So there's a Dataflow connector, a Spark connector, a Hive connector, and a Dataproc connector, and these will let you read in parallel and scale out virtually as large as you want to make access through these other processes super fast. It also supports column projections and filters so that you don't have to read the full table. So next is security and trust. When people move to the cloud, they get nervous. They feel like they've lost control of their data. Somebody else is managing the data. Maybe they can't fire the person who leaks the data. And it's not just a perception problem. There are various attack vectors that can only happen in the cloud, which is why Google has been paying a lot of attention to security and working very carefully with customers that have significant security needs. One of those is HSBC. We worked hand-in-hand with them to make them comfortable putting their data in our cloud and being able to rely on the safety and security of the data in Google's cloud. And some of the things that we developed in conjunction with them were customer-managed encryption keys for BigQuery, Access Transparency project where, even if an insider needs access to your data for support reasons, that you get detailed audit logs of what happened. There's also a number of other security features that are coming down the line. Next is sharing. So data that's sort of locked in a silo is not all that useful. In the traditional data warehousing model, only a select few of analysts were actually granted access to the data warehouse. Because the data warehouse generally ran 100% of the time at full capacity, people wanted to make sure you didn't lock it up, you didn't bring it down. When you move to the cloud, and you have virtually unlimited or scalable capacity, you have the option and the ability to grant more people access to your data. So one of the main things that we're trying to do is democratize the ability to access the data. So anybody in your organization should be able to make sense of the data. So various things that we've announced this last couple days with connected Sheets. Connected Sheets lets you-- anyone who can use an Excel spreadsheet can now use BigQuery and can create pivot tables and build reports in their spreadsheet over data sizes that are virtually unlimited. And then there's BI Engine, which is our accelerated engine that sits on top of BigQuery that can power dashboards, can power Data Studio dashboards, and will soon power other partner tools. And Data Studio and the speed and versatility of that lets you-- anybody who can do drag-and-drop in Data Studio can take advantage of it, and you can also share dashboards, and do drill downs with other folks. And the other thing that BI Engine gets you is high concurrency. So it's not just faster, it also allows you to have your dashboards be accessed by hundreds of people or thousands of people at once. We built a dashboard for March Madness to showcase the machine learning stuff that we were doing for the college basketball tournament, and the dashboard was public and it was rendered-- every time someone loaded the page, it ran a BI Engine or a number of BI Engine queries. And BI Engine was just able to scale and sort of magically serve that. One of the nice thing for you folks is that, when you use BI Engine, you don't get charged for a query. So anything that hits the BI Engine in memory cache is not charged. So you can buy a certain amount of memory, and data will be automatically cached in that memory, and then whatever we can serve out of that cache will be free. The other nice thing is that anything that misses the cache will be actually consistent. So one of the driving things for the BigQuery team is we always want to serve the freshest data. And then we also have a lot of partner tools. The tools that you're used to investigating your data with, those will work with BigQuery. So, yes, mentioned BI Engine. And so when we launched-- here's another colorful user-- when we launched BI Engine, this was one of the first things of feedback that we got. I probably shouldn't have shown that slide but I snuck it in at the last moment. So Connected Sheets, we've you've likely seen before-- drag-and-drop pivot tables-- and we have a number of early partners that have already been validating. This is a little bit more PR-ready quote from a customer. And the last bit of modern data warehousing is predictive. So if real-time data-- or traditional data warehousing lets you know what happened yesterday. Real-time data warehousing lets you know what's happening now. Predictive data warehousing let you know what's going to happen tomorrow. What are your customers going to do? And this can be super powerful, and I think it's an area that has only just started to be explored. In a survey, Forbes said that 82% of executives believe that it's going to be highly impactful. But the truth is, very few people are actually doing it yet. And so our mechanism for unleashing the power of predictive analytics is BigQuery ML. And with BigQuery ML, you just write a SQL statement, and you can build a model and you can run predictions over your models. And in the past, we had only two classes of models, linear and logistic regression. They were actually very, very powerful and good at making predictions over large-scale datasets but they're not the cool models anymore. We launched a couple of new ones this time around. We launched K-means Clustering, so you can actually build customer segmentations and do clustering right in the database. We also launched Matrix Factorization to alpha, and that's super useful for recommendations. And just some initial use-cases that we had, we took the Netflix dataset from-- I don't know if you guys remember-- a few years ago, they offered a million dollar prize for whoever could beat their machine learning, and we just sort of ran it over this untuned in our matrix factorization. And it only took a few minutes to process all the data, and we also got results that were more or less equivalent to the best published results. And it's not because we're doing anything fancy. It's just because we were able to process the whole thing is because we had the scale to understand the full dataset. We also have some DNN neural network models that are in alpha, and that's an interesting one because that's our first one that, under the covers, actually goes out of the database. So these are the ones are building things in the database. We're not moving the data out. Some of the neural network, there's database access patterns are very different than the access patterns that you need to build a neural network, and so we ship the data over to Cloud Machine Learning Engine under the covers. You don't actually see any of this. It just magically happens, and we build the neural network for you. But you can imagine that, once we can do that, then, really, we can do any model. So we haven't announced other models but I wouldn't be surprised if more of them were impending. And the other one that I think is very cool is importing TensorFlow models. If you build a TensorFlow model anywhere you want, your data science team can build a TensorFlow model that does a chat bot. And you can load that into BigQuery and use that to make predictions and inferences within BigQuery. And that actually, that does happen within the database, so we can do it very fast. And I think, actually, this is a challenge to you folks because there's a lot of stuff you can encode in a TensorFlow model, a lot of things that are not just machine learning. You can encode just about anything. So we're hoping people come up with some interesting use-cases just to push on this. AutoML tables was also launched. And AutoML tables lets you take a BigQuery table and you just point to a BigQuery table, you say, this is what I want to predict, and it will automatically generate a machine learning model for you. So very, very hands-off. And so I mentioned before that we're sort of trying to democratize data analysis and make it possible for more people to do data analysis. We're also trying to do the same with machine learning because, when we talk to customers, many of them say, we really want to do ML, we want to do AI, but I just can't hire anybody that can do that. I just can't find the talent. It's also a really good market for people that know how to do that stuff. You can get paid very well. But we also want to bring this to more people. And somebody who's a machine learning PhD and deeply understands the data is going to always produce the best models. But you can get very, very good results with AutoML and BQ ML with less work and less deep understanding of what's going on. So AutoML and BQ ML are still different. I might expect in the future for those things to start looking similar. That's just a hint. And so we've got also a number of users that have really been using ML and predictive analytics to really move their business forward. So you put all this stuff together, and I think each one individually is sort of not that different than what a traditional data warehouse can do. You kind of put all these things together, and it starts to look like this is something more than you're old-school traditional data warehouse can do. But there's some other differentiators, and another one that I want to call out here-- and there have been several sessions on BQ GIS-- and I want to mention it again because lots of data is streaming in nature, but also more and more data has a location associated with it. All the apps on your phone, or many of the apps on your phone, they collect location data-- people that have delivery drivers, and they want to know the GPS tracks of the delivery driver. So lots and lots of datasets have location built into it, and so BQ GIS lets you turn that into what's actually happening in the real world. It lets you turn lat and long, and paths, and these simple points into actual interaction with the real world. And many of our customers have been finding very cool use-cases for this. There was a talk this morning on Global Fisheries Watch that was looking for poachers using geospatial. There's various transportation boards use it for understanding traffic flows and traffic patterns. But lots of interesting ways of using it. And one of the ways that there's a researcher at Google who's actually started to dig into is to use BQ geospatial to understand astronomy. And that sounds sort of weird because, like, OK, the stars are 3D and geospatial, everything is mapped onto a sphere. But if you kind of think of the old-school globes or what the ancients used to think of as a celestial sphere, is if you kind of take a point on the Earth, and you go straight up, and you see what the intensity is, either or light or some other electromagnetic range, and you can map that back down onto where that would be on the Earth. And once you do that, then all the calculations that you can do over your dataset or over geospatial data can be done on this astronomical data. And so one that-- this is sort of still early, but one idea is just looking for exoplanets. So this dataset is from satellite-based telescopes. And so these are three passes of the satellite. And so you're looking for exoplanets. You're looking for places that are transit of the exoplanet in front of the star system. And so anytime you have an unexpected dip or unexpected deviation, that's sort of an area where you may want to look at more closely. So this one in the middle, I'm not saying we found an exoplanet, but it might be something to look more closely at. So next, I'm going to hand it over to Rick. Thanks. [APPLAUSE] RICK FULTON: Hi, everyone. So I'm Rick Fulton. I am the senior engineering manager of the simulation platform at Cruise Automation. So I'm going to be talking a little bit about how we use BigQuery on the simulation platform. So to introduce the company, Cruise Automation. So we're building self-driving cars. Our mission is that we're building the world's most advanced self-driving vehicles to safely connect people with places, things, and experiences they care about and transform the future of transportation. So, for example, we're going to be launching a self-driving rideshare service. So the simulation platform is my team. So I guess a little context, to build a self-driving car, one way you could do that would be to make code changes and put it on the car and see what happens. Not the most efficient way to do that. Much better to have really accurate simulation systems and to test your code changes on those simulations before you put it on the car. So the simulation platform is all about accelerating and making more efficient use of simulations, so that means faster simulations, more reliable simulations, being able to run those simulations in more expansive interesting ways. And then, for the purpose of this presentation, analyze the results of the simulations efficiently. So again, the goal of my team is, within minutes, to be able to determine the effect of the code change on the AV's behavior and be able to understand where to make improvements if needed. So again, we are using BigQuery, and I'm going to talk a little bit and touch on the points that Jordan was talking about, about some of the data needs we've run into and how BigQuery has helped us. So I guess to start off, we have to handle surprisingly large amounts of data in a real-time way in order for us to do the simulations we want. So we're talking about the number of simulations we're running or generating, gigabytes per second, billions of rows a day, and we need to have a data warehouse solution that's going to be able to support this. The data needs to be available within minutes because we have AV engineers, they run their simulations, they want to know what's the effect on the car. So they want to have access to that. And then finally, we've been massively scaling the number of simulations we run, and so we really need a solution that is low operational overhead for us. So that went into our selection of BigQuery, which I'll discuss in a minute. For the purpose of the presentation, I just want to dig a little bit into context for typical AV architecture. So this is actually the Udacity self-driving car course architecture they use, so it's not necessarily ours. So on the left, we have sensors, so that's like camera, radar, LiDAR data. So that's raw input data to the car. It feeds into the perception system. The perception system is all about the car reasoning about where it is in the world, what's around it, where are the cars around me, where are the people on the bikes. Maybe if there's a car next to me and I see it has a left turn signal, then maybe I will predict that it's going to change to a left lane. So that is essentially the state of the world. And then that feeds into the planning system, which is, how do I get from point A to point B? If I need to make an unprotected left turn, how do I make an unpredicted left turn safely? How do I go through an intersection, a four-way stop safely? And so that all feeds into the control system, which is the low level controls. How do I actually turn the wheels? So just for some color, I think what's important to note is that we have a ton of different types of testing frameworks. So for instance-- oh, right. And so this is a picture of our web visualization platform. So this basically takes what the car is seeing, and so this is pretty instrumental in building out our simulations. So some of the types of simulations we have, so we have a 2D SIM system, so this is like a top-down view, where you can see the car executing various maneuvers. And you can put cars and people and just see how the car would react. There's the same system, except it's three dimensional. And so this is more of a full system where we can feed data into the car within the 3D SIM, and the car doesn't know it's within a simulation. So we can see that it's kind of like an end-to-end integration test. Really important would be sensor replay to us. So we want to feed the sensor data into the perception system and make sure that the perception system is reasoning correctly about the world. So given this radar and LiDAR data, did we accurately identify all the objects around me? There's also hardware performance and tests. So do we have tests to make sure the hardware is functioning properly? Do we feel confident that the hardware is going to react similarly to what we're running in the cloud? There's many more not worth mentioning in this presentation but suffice it to say, there's many different types of simulations we run. Right. So this is a pretty important point. Simulation testing is hard. It is not your typical regression testing, where you have pass-fail tests you run to know if you can merge or deploy. I guess the first point to talk about is that it's more than binary pass-fail results. So for instance, you could see a significant decrease in some metric you care about but that might be OK if you see increases in other metrics that have gone up. So you need to take a more holistic view of all the different metrics. As you could see in the previous couple of slides ago, there's many interdependencies in the stack. So if I'm a LiDAR engineer and I'm making a change to a segmentation model to identify objects through LiDAR, I might see that model is doing fine or doing better than before, but maybe it has some kind of bad downstream effects on some systems, like the planning system. It's really important to understand how my current iteration, my current commit is doing in relation to previous commits. So I want to see, given a metric, am I doing better compared to base or over time? And then finally, we want to be pretty flexible about being able to add new metrics. So if I'm an AV engineer and I decide that it's useful to compute some new derived metric, we need a data solution that can handle that without having to do an onerous schema migration. I'm going to briefly talk through our old architecture. There's some obvious issues with it, so it's not worth spending too much time on. But you have a code change, it goes in to GitHub, we have our CI system kick off, requests the standard set of regression tests and simulations. They get scheduled and run. We have some kind of graph compute engine that will change those results into Avro, and we use Avro as our data serialization method to put those Avro tables into S3. So the main point here is that we just had raw Avro tables. There's no querying layer, so there's really not a ton we can do here. So there's many types of queries we can't answer. We can't do average detection accuracy over a test run, or average over time, or specific metrics over time. There's really not a lot we can do. All of that aggregation has to happen on the front end, so a significant memory and CPU burden. It doesn't really scale at all with the number of increasing simulations, and it's really time consuming to build a front end analysis tools because they're so bespoke to the particular use-case. OK. So we chose BigQuery. And the difference here is that, OK, well we moved to GCS, which was cool, from S3, but also, when we load those tables into GCS, then we use, via Pub/Sub, we send a signal to this simple ingestion service, which will do some of the ETL to put it into BigQuery, including abstracting away, adding a new table into BigQuery so that the AV engineer doesn't have to worry about that. So it's all taken care of for us, and we've just been able to feed tons of data into BigQuery. So there's been applications that now we can do things we couldn't do before, so direct queries, Jupyter notebooks, BI tools, Looker and Tableau. And I'm going to give us specific example. We built out this front end analysis platform. And so I'm going to talk about one specific AV metric and the tool we built on top of BigQuery now that we had this data in there. So this is an unprotected left turn, very important maneuver. So what this metric is, is the selected gap metric, which is the time it takes between when the car enters the intersection and makes the left turn, and when the oncoming car enters the intersection. So it's very important to make sure that there's a good cushion here. You don't want to make an unsafe left turn. So in this case, in this second picture here, that's when the car enters, and you can see whatever that time is up there. And then when the car enters the intersection, that's about five seconds later. So we want to make sure that there's a good cushion no matter what speed the cars coming, no matter what the headway is between those two cars, so how much temporal distance there is between car two and car one. So you might think that we want to have a series of simulations that not just test this in this particular scenario, but what if the car's are going faster? What if they're closer together? Are we still going to make the right decisions? So that's exactly what we did. This is built directly on top of BigQuery. A lot going on in the slide, but what we're doing here is that we're comparing against-- we have our feature branch we're comparing against base. And what we're doing here is, these axes-- so one axis is the oncoming car speed, and the other one is headway time. And so we want to make sure that as we modulate these values, that we're still having a nice, healthy selected gap length. And so, interesting point here, each one of these cells is itself a simulation, like a full simulation that takes a long time to run, generating at least a gigabyte of data. And so we're basically pulling all of this data easily into BigQuery, and we have very powerful tools allow us to analyze how we're doing, how feature's doing against base, and where we still have to improve. So, yeah. We had some really good results so far. So we are ingesting something like half a million rows, million rows, gigabytes per second. The data is available within minutes, and it's scaled, actually, literally 10x since growth. We looked at the numbers, and it was the number 10. So we've been very happy with this. And we expect to scale this another order of magnitude or two. So we're not quite at the limits that Jordan was talking about but we've had no problems adjusting a pretty considerable amount of data so that we can be much more efficient about how we're doing AV development. So, yeah. In the future, we're going to continue to scale a number of simulations. We are targeting a order of magnitude or two. So we believe that BigQuery will be able to support that. We're going to continue to expand the simulation tooling we have, and work on more of the simulation tooling. We might be looking at external data storage that Jordan mentioned for certain kinds of simulations that have particularly large outputs. And most likely, we're going to be using some kind of ML application here, like we might want to take all the data that's in BigQuery and use that, build some model to predict on-road performance given the metrics that we compute from the simulation performance. So that would be really cool. So thank you, everyone. [APPLAUSE] [MUSIC PLAYING]
Info
Channel: Google Workspace
Views: 38,655
Rating: 4.9492388 out of 5
Keywords: type: Conference Talk (Full production);, pr_pr: Google Cloud Next, purpose: Educate
Id: eOQ3YJKgvHE
Channel Id: undefined
Length: 44min 33sec (2673 seconds)
Published: Thu Apr 11 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.