[MUSIC PLAYING] JORDAN TIGANI: Good
afternoon, everybody. My name is Jordan Tigani. I'm the director of product
management for Google BigQuery. I was also one of the first
engineers on Google BigQuery, so I know where a
lot of the skeletons are hidden in the code-- or at least I still
remember some of it. So a data warehouse is a tool. It's a tool that can be
used in a lot of ways but it's a relatively
simple tool. And like simple tools, you can
build some really impressive things. This is my house in Seattle,
and you can build this house with very simple tools. But if you want to
build a skyscraper, you need totally
different tools. So what is the difference? It's scale, it's technology,
and it's use cases. And so a data
warehouse is similar. So a traditional data warehouse
is a fairly simple tool. It's been around
for 30, 40 years in pretty much the same format. How many people here have a
data warehouse that they run? So lots of you. Actually, how many of those
people have a data warehouse that's not BigQuery? I just want to see if I'm
preaching to the choir here. OK. So folks do have lots of data
warehouses that's not BigQuery. But of those people that
have data warehouses that aren't BigQuery, how many of
you run either Hadoop, or Spark, or Impala, or something
like that as well? So lots of people
are running things that are data analysis tools
that are outside of their data warehouse. So the data warehouse is
clearly missing something. And how many of you
run some sort of Kafka or streaming analytics? It's a lot. Of a bunch of those. And how many of
those are integrated with your data warehouse? So not too many. So there's clearly also
a real-time component that people want or people
have streaming data that's coming in. So kind of the way we see data
warehousing at Google is it starts with the data warehouse. On the data
warehouse, as actually one of the guys from Home Depot
said a couple of weeks ago, 99.9% of all traditional
data warehousing stuff, it's still relevant. So I'm not going to try to say
that these old traditional data warehousing isn't relevant. It is. But there's something more. There's something
more that people want. And so Google's data
warehouse is BigQuery. Hopefully, people have
heard of it by now. And it's our enterprise data
warehouse for analytics. We used to call Petabyte-scale,
now we call it Exbyte-scale, in case you're tracking
these slides over time. And you can run
Petabyte-scale queries, and I'll show you one of
those in a few minutes. Security is super
important for us, so all your data is stored
encrypted, durably, available. And I'll get into some of the
other properties as I go along. You don't have to just
take my word for it. Forrester named BigQuery a
leader in their cloud data warehousing wave. And we recently had
a study out from ESG that said using BigQuery
as a traditional data warehouse gives you significant
savings over other on-prem data warehouses. I also want to
highlight something that we've announced at Next. Sudhir mentioned it
in his talk yesterday. So a lot of people
that I've been meeting with over the last few days have
been saying, we love BigQuery, but we want predictable
pricing, but $40,000 a month that we charge for 2,000 slots
is out of their price range. And so we're announcing 500
slots for $10,000 per month. Hopefully, there's
a lot more people for whom that's in
their price range. So one of the key things
for data warehousing when you're moving to the
cloud is the separation of storage and compute. When Home Depot
moved to BigQuery, they had 100 terabytes in
an on-prem data warehouse. Just recently, they finished
their migration to BigQuery, and now they've got
tens of petabytes, so huge amounts of data. And it's not
something that's new. It's that this data
was always coming in, but they were constrained
by the environment that they were working in. So when you go to
the cloud, you remove a lot of the constraints. You can scale up the storage
amounts virtually infinitely. You can scale up the
compute amounts to tens or even hundreds of
thousands of CPU cores. So we've also announced that,
where we talk about some of our big statistics. We have customers that have
a quarter exabytes of data for a single customer,
and we do run queries that are over a petabyte
relatively frequently. So this is the retailer
that I mentioned. So three years
ago at GCP Next, I introduced this petabyte
dataset that we had. And I ran a query over
this petabyte dataset, and I kind of did
this thing where I, at the beginning of the
talk, I started it running. At the end of the talk, I
said, let's see how it's doing. And it took about four
minutes to complete. Then, last summer at GCP Next,
I ran that same query again against that same
dataset, and it was down to a minute and a half,
so 2x performance improvement. The big difference
between those two, actually, was not the
performance difference. It was, if you
look to the right, is says that one had to
process all 1.09 petabytes. The latter one only had to
process half a gigabyte. So if we could switch
over to the demo, please? So I'm going to try
running this query again, and let's see how we do. So the difference between
the half a gigabyte and the single petabyte was
that we enabled clustering. And so clustering enables you
to find data much more quickly. So there we go. 11.7 seconds to do this scan
of a one-petabyte table. I'm hoping that by next
time we'll get that down to one second, but-- [APPLAUSE] Could we switch back, please? So that's the challenge
for the engineering team, to make me look good next year
by getting that to a second. So the last bit in kind of
the traditional data warehouse space, where you're just sort of
expanding the traditional data warehouse, is serverless. Nobody wants to manage servers. If you can get somebody else
to keep your servers running, then that's great. And BigQuery does
automatic patching. There's no downtime. When we launch new versions,
we just sort of rolled that out seamlessly. This was a quote from
last week, actually. One of our more colorful
Australian customers just mentioned that one day, things
started getting much faster. And so he had this
to say on Twitter. So what makes BigQuery BigQuery? What makes it work? It's really the architecture
that we build on that Google has,
the infrastructure that we really can stand
on the shoulders of giants, the extremely fast petabit
network, our highly scalable storage
systems, highly scalable compute clusters as well. So it's serverless. You get to focus on
what's important to you. You get to focus on actually
doing your analytics. You don't have to focus on
configuration management, reliability, et cetera. So next is real-time. And I think real-time is
an underappreciated side of data warehousing. Traditionally, you take your
operational data warehouse and overnight, you
dump that into your-- there's your operational
database-- overnight, you dump that into
your data warehouse and you build reports. And then the next
morning, everybody comes in looking at the reports
of what happened yesterday. But people don't want to
see what happened yesterday. They want to know what's
going on right now. The amount of data that
is being created is-- people always talk about
how fast data is coming in and how hard it is to process. And more and more of that
data is streaming oriented. And when you think of
it, really, all data is generated one
event at a time. In its natural
format, it's a stream. So across Google and across
Google Cloud and data analytics, we
really want to make it possible to keep data
in its natural form, to keep data-- if it's a
stream, to keep it as a stream. That's why Cloud Dataflow,
with one line of code, you can switch back and forth
between batch and streaming. And BigQuery, we are investing
heavily in streaming analytics. So one of our
customers is Zulily, and they have these daily deals. I think they have like
100 new products a day. And they want to know how
those products are doing throughout the day
because, if they have to wait until tomorrow
to get their report, then it's too late to
do anything about it. So they actually use
streaming into BigQuery, and they send an
hourly report that's read by all their
executives, and they are able to make
changes on the fly. So it's super powerful to
be able to make changes on the fly. So I'm going to show a demo
of some high-volume streaming. So in the past,
people have kind of run into some streaming
limits in BigQuery. Let me just start this. And we're constantly working on
breaking through those limits. And I think, if you think about
limits you hit, quotas you hit, one overarching
thing is you can be sure we're working on making
those limits not actually hard limits anymore. So we put together this
streaming demo using Dataflow. So we're using thousands
of workers in Dataflow. And so here's the
Dataflow instance that's-- you can see how
many bytes written, and this has only been
running for a couple hours. So that's some pretty
significant amount of data. Now I'm going into the
Compute Engine instances. So we have ten of these running. And let's check
out the monitoring. There we go. Check out the monitoring,
check out the network bytes. So each one of
these ten is doing about two gigabytes per second. So we're streaming at
20 gigabytes per second. And if you don't believe me, we
can look at this BigQuery table into which we're streaming. So this is looking
at the data that's coming in over the
last 20 minutes, and we're computing the
byte length and the number of rows per second. So we're sending 22,
23 gigabytes per second and 2.3 million rows per second,
so kind of decent velocity. But, OK, perhaps lots of systems
can handle that sort of scale, but we want real-time. We want to be able to
do something and see that it happens immediately. So what we're simulating
here is a sensor network. So let's say we
have IoT devices, they're all around the world. And these sensors are-- so we're looking at what's
happened in the last ten minutes. In the last ten
minutes, we have all of these sensors that are
reporting that they're happy. Now, let's inject
an unhappy sensor. So I'm basically
just piping this into BigQuery via our streaming
API and our command line client. And now we're looking for
sensors that are not happy. So let's see whether
this is going to show up. Come on, unhappy sensors. I feel bad for the sensors,
that we're making it unhappy, but at least it's a fake one. There we go. So now we have one
unhappy sensor. And switch back to
the slides, please. Thanks. So real-time data warehousing,
being able to make decisions quickly, being able to do
things at high velocity-- something that we're pushing on. Another important thing
for modern data warehousing is the centralization
of storage and-- how many people use a data lake? So lots of people
use a data lake. We believe, actually, that
BigQuery does an excellent job as your data lake
for structured data. We also want to make sure that
you can put the data where you want or how you
want it, but we've done a pretty good
job of understanding our structured data. it says the statistic
is, less than 50% structured data is
used to make decisions. That's a lot higher
than it used to be. It's not a bad number
but people haven't really started looking at
unstructured data at all. So a data lake is
something that's important. But we've also recently
launched the BigQuery Storage API, which is a way to sort of
turn the data lake upside down. It's a way to have the data
lake be your data warehouse. And so what are the
advantages of that? Well, BigQuery
automatically optimizes the shape of the data. So one of the problems
with the data lake is you have to optimize the
number of files you have, the size of the files you have. You have to worry about
consistency issues, what happens if you're running a job
while somebody adds or deletes a file. It's harder to apply
consistent security. So BigQuery lets you
do DML over your data. It lets you just have a higher
level table abstraction. And when you have a higher
level table abstraction, you can apply things
like security policies that marks certain
fields as PII, so other people can't read them. But so the Storage API is a
way of reading at full velocity from BigQuery storage. So there's a Dataflow connector,
a Spark connector, a Hive connector, and a
Dataproc connector, and these will let
you read in parallel and scale out virtually
as large as you want to make access through
these other processes super fast. It also supports column
projections and filters so that you don't have
to read the full table. So next is security and trust. When people move to the
cloud, they get nervous. They feel like they've
lost control of their data. Somebody else is
managing the data. Maybe they can't fire the
person who leaks the data. And it's not just a
perception problem. There are various
attack vectors that can only happen in
the cloud, which is why Google has been paying
a lot of attention to security and working very carefully
with customers that have significant security needs. One of those is HSBC. We worked hand-in-hand with
them to make them comfortable putting their data in
our cloud and being able to rely on the
safety and security of the data in Google's cloud. And some of the things that
we developed in conjunction with them were customer-managed
encryption keys for BigQuery, Access Transparency
project where, even if an insider needs
access to your data for support reasons, that you get detailed
audit logs of what happened. There's also a number of
other security features that are coming down the line. Next is sharing. So data that's sort of locked in
a silo is not all that useful. In the traditional
data warehousing model, only a select few of analysts
were actually granted access to the data warehouse. Because the data warehouse
generally ran 100% of the time at full capacity, people wanted
to make sure you didn't lock it up, you didn't bring it down. When you move to
the cloud, and you have virtually unlimited
or scalable capacity, you have the option
and the ability to grant more people
access to your data. So one of the main things
that we're trying to do is democratize the ability
to access the data. So anybody in your
organization should be able to make sense of the data. So various things that we've
announced this last couple days with connected Sheets. Connected Sheets lets you-- anyone who can use
an Excel spreadsheet can now use BigQuery and
can create pivot tables and build reports
in their spreadsheet over data sizes that
are virtually unlimited. And then there's
BI Engine, which is our accelerated engine
that sits on top of BigQuery that can power dashboards, can
power Data Studio dashboards, and will soon power
other partner tools. And Data Studio and the
speed and versatility of that lets you-- anybody who can do
drag-and-drop in Data Studio can take advantage of
it, and you can also share dashboards, and do
drill downs with other folks. And the other thing
that BI Engine gets you is high concurrency. So it's not just
faster, it also allows you to have your dashboards be
accessed by hundreds of people or thousands of people at once. We built a dashboard
for March Madness to showcase the
machine learning stuff that we were doing for the
college basketball tournament, and the dashboard was public
and it was rendered-- every time someone loaded the page, it
ran a BI Engine or a number of BI Engine queries. And BI Engine was just able
to scale and sort of magically serve that. One of the nice
thing for you folks is that, when you
use BI Engine, you don't get charged for a query. So anything that hits the
BI Engine in memory cache is not charged. So you can buy a certain
amount of memory, and data will be automatically
cached in that memory, and then whatever we can serve
out of that cache will be free. The other nice thing
is that anything that misses the cache will
be actually consistent. So one of the driving
things for the BigQuery team is we always want to
serve the freshest data. And then we also have
a lot of partner tools. The tools that you're used to
investigating your data with, those will work with BigQuery. So, yes, mentioned BI Engine. And so when we launched--
here's another colorful user-- when we launched
BI Engine, this was one of the first things
of feedback that we got. I probably shouldn't
have shown that slide but I snuck it in
at the last moment. So Connected Sheets, we've
you've likely seen before-- drag-and-drop pivot
tables-- and we have a number of
early partners that have already been validating. This is a little bit more
PR-ready quote from a customer. And the last bit of modern
data warehousing is predictive. So if real-time data-- or traditional data
warehousing lets you know what
happened yesterday. Real-time data warehousing lets
you know what's happening now. Predictive data warehousing
let you know what's going to happen tomorrow. What are your
customers going to do? And this can be
super powerful, and I think it's an area that has only
just started to be explored. In a survey, Forbes said
that 82% of executives believe that it's going
to be highly impactful. But the truth is, very
few people are actually doing it yet. And so our mechanism
for unleashing the power of predictive
analytics is BigQuery ML. And with BigQuery ML, you
just write a SQL statement, and you can build a model
and you can run predictions over your models. And in the past, we had only
two classes of models, linear and logistic regression. They were actually very, very
powerful and good at making predictions over
large-scale datasets but they're not the
cool models anymore. We launched a couple of
new ones this time around. We launched K-means
Clustering, so you can actually build customer segmentations
and do clustering right in the database. We also launched Matrix
Factorization to alpha, and that's super useful
for recommendations. And just some initial
use-cases that we had, we took the Netflix
dataset from-- I don't know if you guys
remember-- a few years ago, they offered a million
dollar prize for whoever could beat their
machine learning, and we just sort of ran
it over this untuned in our matrix factorization. And it only took a few minutes
to process all the data, and we also got
results that were more or less equivalent to
the best published results. And it's not because we're
doing anything fancy. It's just because we
were able to process the whole thing is
because we had the scale to understand the full dataset. We also have some DNN
neural network models that are in alpha, and
that's an interesting one because that's our first
one that, under the covers, actually goes out
of the database. So these are the
ones are building things in the database. We're not moving the data out. Some of the neural
network, there's database access patterns are
very different than the access patterns that you need to
build a neural network, and so we ship the data over
to Cloud Machine Learning Engine under the covers. You don't actually
see any of this. It just magically
happens, and we build the neural network for you. But you can imagine that, once
we can do that, then, really, we can do any model. So we haven't
announced other models but I wouldn't be surprised if
more of them were impending. And the other one that
I think is very cool is importing TensorFlow models. If you build a TensorFlow
model anywhere you want, your data science team can
build a TensorFlow model that does a chat bot. And you can load
that into BigQuery and use that to make predictions
and inferences within BigQuery. And that actually, that does
happen within the database, so we can do it very fast. And I think, actually, this
is a challenge to you folks because there's a lot of stuff
you can encode in a TensorFlow model, a lot of things that
are not just machine learning. You can encode just
about anything. So we're hoping people come up
with some interesting use-cases just to push on this. AutoML tables was also launched. And AutoML tables lets
you take a BigQuery table and you just point
to a BigQuery table, you say, this is what
I want to predict, and it will automatically
generate a machine learning model for you. So very, very hands-off. And so I mentioned
before that we're sort of trying to
democratize data analysis and make it possible for more
people to do data analysis. We're also trying to do the
same with machine learning because, when we talk to
customers, many of them say, we really want to
do ML, we want to do AI, but I just can't hire
anybody that can do that. I just can't find the talent. It's also a really
good market for people that know how to do that stuff. You can get paid very well. But we also want to bring
this to more people. And somebody who's a machine
learning PhD and deeply understands the data is going to
always produce the best models. But you can get very, very good
results with AutoML and BQ ML with less work and
less deep understanding of what's going on. So AutoML and BQ ML
are still different. I might expect in the
future for those things to start looking similar. That's just a hint. And so we've got also a number
of users that have really been using ML and
predictive analytics to really move their
business forward. So you put all this
stuff together, and I think each one
individually is sort of not that different than what a
traditional data warehouse can do. You kind of put all
these things together, and it starts to look like this
is something more than you're old-school traditional
data warehouse can do. But there's some
other differentiators, and another one that I
want to call out here-- and there have been several
sessions on BQ GIS-- and I want to mention it
again because lots of data is streaming in nature,
but also more and more data has a location
associated with it. All the apps on your phone, or
many of the apps on your phone, they collect location data-- people that have
delivery drivers, and they want to know the GPS
tracks of the delivery driver. So lots and lots of datasets
have location built into it, and so BQ GIS lets you turn
that into what's actually happening in the real world. It lets you turn lat
and long, and paths, and these simple points
into actual interaction with the real world. And many of our customers
have been finding very cool use-cases for this. There was a talk this morning
on Global Fisheries Watch that was looking for
poachers using geospatial. There's various
transportation boards use it for understanding traffic
flows and traffic patterns. But lots of interesting
ways of using it. And one of the ways that there's
a researcher at Google who's actually started to dig
into is to use BQ geospatial to understand astronomy. And that sounds sort of
weird because, like, OK, the stars are 3D and
geospatial, everything is mapped onto a sphere. But if you kind of think
of the old-school globes or what the ancients used to
think of as a celestial sphere, is if you kind of take
a point on the Earth, and you go straight up, and
you see what the intensity is, either or light or some
other electromagnetic range, and you can map that
back down onto where that would be on the Earth. And once you do that,
then all the calculations that you can do over your
dataset or over geospatial data can be done on this
astronomical data. And so one that-- this is sort of still
early, but one idea is just looking for exoplanets. So this dataset is from
satellite-based telescopes. And so these are three
passes of the satellite. And so you're looking
for exoplanets. You're looking for
places that are transit of the exoplanet in
front of the star system. And so anytime you
have an unexpected dip or unexpected deviation, that's
sort of an area where you may want to look at more closely. So this one in the
middle, I'm not saying we found an
exoplanet, but it might be something to
look more closely at. So next, I'm going to
hand it over to Rick. Thanks. [APPLAUSE] RICK FULTON: Hi, everyone. So I'm Rick Fulton. I am the senior engineering
manager of the simulation platform at Cruise Automation. So I'm going to be
talking a little bit about how we use BigQuery
on the simulation platform. So to introduce the
company, Cruise Automation. So we're building
self-driving cars. Our mission is
that we're building the world's most advanced
self-driving vehicles to safely connect people with places,
things, and experiences they care about and transform
the future of transportation. So, for example,
we're going to be launching a self-driving
rideshare service. So the simulation
platform is my team. So I guess a little context,
to build a self-driving car, one way you could do that
would be to make code changes and put it on the car
and see what happens. Not the most efficient
way to do that. Much better to have really
accurate simulation systems and to test your code
changes on those simulations before you put it on the car. So the simulation
platform is all about accelerating and
making more efficient use of simulations, so that means
faster simulations, more reliable simulations,
being able to run those simulations in more
expansive interesting ways. And then, for the purpose
of this presentation, analyze the results of the
simulations efficiently. So again, the goal of my
team is, within minutes, to be able to determine the
effect of the code change on the AV's behavior and
be able to understand where to make improvements if needed. So again, we are using
BigQuery, and I'm going to talk a little bit
and touch on the points that Jordan was talking
about, about some of the data needs we've run into and
how BigQuery has helped us. So I guess to start
off, we have to handle surprisingly large
amounts of data in a real-time way
in order for us to do the simulations we want. So we're talking about
the number of simulations we're running or generating,
gigabytes per second, billions of rows
a day, and we need to have a data warehouse
solution that's going to be able to support this. The data needs to be
available within minutes because we have AV engineers,
they run their simulations, they want to know what's
the effect on the car. So they want to
have access to that. And then finally,
we've been massively scaling the number of
simulations we run, and so we really need
a solution that is low operational overhead for us. So that went into our
selection of BigQuery, which I'll discuss in a minute. For the purpose of
the presentation, I just want to dig a little
bit into context for typical AV architecture. So this is actually the
Udacity self-driving car course architecture they use, so
it's not necessarily ours. So on the left, we
have sensors, so that's like camera, radar, LiDAR data. So that's raw input
data to the car. It feeds into the
perception system. The perception system is
all about the car reasoning about where it is in the
world, what's around it, where are the cars
around me, where are the people on the bikes. Maybe if there's a car next to
me and I see it has a left turn signal, then maybe I
will predict that it's going to change to a left lane. So that is essentially
the state of the world. And then that feeds
into the planning system, which is, how do I
get from point A to point B? If I need to make an
unprotected left turn, how do I make an unpredicted
left turn safely? How do I go through an
intersection, a four-way stop safely? And so that all feeds into
the control system, which is the low level controls. How do I actually
turn the wheels? So just for some color, I
think what's important to note is that we have a ton of
different types of testing frameworks. So for instance-- oh, right. And so this is a picture of
our web visualization platform. So this basically takes
what the car is seeing, and so this is pretty
instrumental in building out our simulations. So some of the types
of simulations we have, so we have a 2D
SIM system, so this is like a top-down view, where
you can see the car executing various maneuvers. And you can put cars
and people and just see how the car would react. There's the same system,
except it's three dimensional. And so this is more of a full
system where we can feed data into the car within
the 3D SIM, and the car doesn't know it's
within a simulation. So we can see that it's kind of
like an end-to-end integration test. Really important would
be sensor replay to us. So we want to feed the sensor
data into the perception system and make sure that the
perception system is reasoning correctly about the world. So given this radar
and LiDAR data, did we accurately identify
all the objects around me? There's also hardware
performance and tests. So do we have tests to make
sure the hardware is functioning properly? Do we feel confident
that the hardware is going to react
similarly to what we're running in the cloud? There's many more not worth
mentioning in this presentation but suffice it to say,
there's many different types of simulations we run. Right. So this is a pretty
important point. Simulation testing is hard. It is not your typical
regression testing, where you have pass-fail
tests you run to know if you can merge or deploy. I guess the first
point to talk about is that it's more than
binary pass-fail results. So for instance, you could
see a significant decrease in some metric you care
about but that might be OK if you see increases in other
metrics that have gone up. So you need to take
a more holistic view of all the different metrics. As you could see in the
previous couple of slides ago, there's many interdependencies
in the stack. So if I'm a LiDAR
engineer and I'm making a change to
a segmentation model to identify objects
through LiDAR, I might see that model is
doing fine or doing better than before, but
maybe it has some kind of bad downstream
effects on some systems, like the planning system. It's really important
to understand how my current iteration,
my current commit is doing in relation
to previous commits. So I want to see,
given a metric, am I doing better compared
to base or over time? And then finally, we want to
be pretty flexible about being able to add new metrics. So if I'm an AV
engineer and I decide that it's useful to compute
some new derived metric, we need a data solution
that can handle that without having to do an
onerous schema migration. I'm going to briefly talk
through our old architecture. There's some obvious
issues with it, so it's not worth
spending too much time on. But you have a code change,
it goes in to GitHub, we have our CI system kick
off, requests the standard set of regression tests
and simulations. They get scheduled and run. We have some kind of
graph compute engine that will change those
results into Avro, and we use Avro as our
data serialization method to put those Avro
tables into S3. So the main point here is that
we just had raw Avro tables. There's no querying
layer, so there's really not a ton we can do here. So there's many types of
queries we can't answer. We can't do average detection
accuracy over a test run, or average over time, or
specific metrics over time. There's really not
a lot we can do. All of that aggregation
has to happen on the front end, so a
significant memory and CPU burden. It doesn't really scale at all
with the number of increasing simulations, and it's
really time consuming to build a front end analysis
tools because they're so bespoke to the
particular use-case. OK. So we chose BigQuery. And the difference here
is that, OK, well we moved to GCS, which
was cool, from S3, but also, when we load those
tables into GCS, then we use, via Pub/Sub, we send a signal
to this simple ingestion service, which will
do some of the ETL to put it into
BigQuery, including abstracting away, adding a
new table into BigQuery so that the AV engineer doesn't
have to worry about that. So it's all taken
care of for us, and we've just been able to
feed tons of data into BigQuery. So there's been
applications that now we can do things we
couldn't do before, so direct queries,
Jupyter notebooks, BI tools, Looker and Tableau. And I'm going to give
us specific example. We built out this front
end analysis platform. And so I'm going to talk
about one specific AV metric and the tool we
built on top of BigQuery now that we had
this data in there. So this is an unprotected left
turn, very important maneuver. So what this metric is, is
the selected gap metric, which is the time it takes
between when the car enters the intersection and
makes the left turn, and when the oncoming car
enters the intersection. So it's very
important to make sure that there's a
good cushion here. You don't want to make
an unsafe left turn. So in this case, in
this second picture here, that's when
the car enters, and you can see whatever
that time is up there. And then when the car
enters the intersection, that's about five seconds later. So we want to make
sure that there's a good cushion no matter
what speed the cars coming, no matter what the headway
is between those two cars, so how much temporal
distance there is between car two and car one. So you might think that we want
to have a series of simulations that not just test this in
this particular scenario, but what if the car's
are going faster? What if they're closer together? Are we still going to
make the right decisions? So that's exactly what we did. This is built directly
on top of BigQuery. A lot going on in the slide,
but what we're doing here is that we're
comparing against-- we have our feature branch
we're comparing against base. And what we're doing
here is, these axes-- so one axis is the
oncoming car speed, and the other one
is headway time. And so we want to make sure that
as we modulate these values, that we're still having a nice,
healthy selected gap length. And so, interesting point
here, each one of these cells is itself a simulation,
like a full simulation that takes a long time
to run, generating at least a gigabyte of data. And so we're basically
pulling all of this data easily into BigQuery, and
we have very powerful tools allow us to analyze
how we're doing, how feature's
doing against base, and where we still
have to improve. So, yeah. We had some really
good results so far. So we are ingesting something
like half a million rows, million rows,
gigabytes per second. The data is available
within minutes, and it's scaled, actually,
literally 10x since growth. We looked at the numbers,
and it was the number 10. So we've been very
happy with this. And we expect to scale this
another order of magnitude or two. So we're not quite at
the limits that Jordan was talking about but we've
had no problems adjusting a pretty considerable
amount of data so that we can be much
more efficient about how we're doing AV development. So, yeah. In the future, we're
going to continue to scale a number of simulations. We are targeting a order
of magnitude or two. So we believe that BigQuery
will be able to support that. We're going to continue to
expand the simulation tooling we have, and work on more
of the simulation tooling. We might be looking at
external data storage that Jordan mentioned for
certain kinds of simulations that have particularly
large outputs. And most likely,
we're going to be using some kind
of ML application here, like we might want to take
all the data that's in BigQuery and use that, build some model
to predict on-road performance given the metrics that we
compute from the simulation performance. So that would be really cool. So thank you, everyone. [APPLAUSE] [MUSIC PLAYING]