GEOFFREY NOER:
Hello, and welcome to Introduction and Best
Practices for Google Cloud Storage. My name is Geoffrey
Noer, a product manager for Cloud Storage. And with me today is Lavina
Jain, software engineer for cloud storage. Today, we'll be
walking through you through a brief
introduction, including some information about how
to take advantage of location types and storage classes. Then we'll focus
on best practices for two of our most important
use cases-- content serving, and compute and analytics. And then, finally,
some words of wisdom on performance and
scaling and how to take advantage of the performance
that cloud storage has to offer. So first of all, what
is cloud storage? So, not to be confused with
file storage or block storage, cloud storage is
our object store and is often by
considered by many to be the native format
for storage in the cloud. What we provide is
a simple interface with a consistent API,
strongly consistent listings, and low latency and equivalent
speed across all of our storage classes, which makes for
a very smooth experience on the product. Reliability is of
utmost importance, both from a durability as
well as from an availability standpoint. Cost effectiveness,
of course, as you don't want storage to be
the majority of your budget. And then, finally, security,
because, of course, you want to make sure
that you can control who has access to your
data, and to make sure that your data is
always encrypted at rest and in transit. The foundation, in many
respects, of cloud storage is a high-performance,
high-quality network that Google built
for its own purposes initially for all of
the consumer properties that you are used to using. And so this involves multiple
undersea cable investments, large numbers of points of
presence across the planet, and as well as a substantial
number of cloud regions. And this allows
you to really pick a location that makes sense for
your data with the knowledge that you can
distribute your content to any part of the globe
that you may need to. The three primary
use cases that we see are content storage
and delivery. So for example,
National Geographic uses cloud storage for its
content serving purposes. And this is ideal because of
low latency and the ability to take advantage of
all of those edge caches to push content
across the globe. We have a lot of use for compute
analytics and machine learning. So this is more the high
performance inside the cloud use of storage. Here, customers like
the Broad Institute doing high-performance
genomic sequencing, analysis, or for that matter,
Twitter doing analytics on the overall
corpus of tweets are both examples of customers using
us for that sort of purpose. And here, the sky's the limit. You can scale up to
terabits per second of throughput from a
cloud storage bucket to make those workloads stream. And then, finally, we have
backup and archiving use cases where it's really more about
storing the data long term. And so regulatory compliance. Tools like Bucket
Lock that allow you to guarantee that data will
not be deleted for a specified length of time. Those sorts of facilities
really help the use case for those needs. So let's turn first
to storage classes to go a little bit
deeper into cloud storage because it's really about the
locations that you choose, and then the
storage classes that determine your
frequency of access and, ultimately, the
price point that you pay. So the first stage
in storing data is actually to decide what
kind of a location type you want to use. Regions, multi-regions
and dual-regions are all available
to you with a number of different specific
location choices within each. So regions are the
most basic choice. They allow you to
have all of your data in one specific region. So for example, you might
choose US Central One, or you might choose
Europe West One. And once you've chosen
that specific location for your bucket, all of your
data will be stored there. So the advantage is you can
then run your VMs or other cloud services and have that compute
and storage be co-located for the best
possible performance. And regions also provide a
very attractive price point. Now, if you want georedundancy
for a higher availability protecting you against
particular regional outages, then what you can do is you
can choose either a dual-region or multi-region. Dual-regions are
more like regions in that you know exactly
where your data is located. And, namely, they'll be located
in the two specific regions that are part of
that dual-region. And so, again, for analytics
and other workloads that are high
performance, dual-regions are usually the right choice. And then multi-regions,
on the other hand, you just know that
your data is redundant and stored somewhere
within a continent. And this is an ideal
choice for content serving. So how do we place the data? So in a region, which
is the yellow squares, if you chose Region B as
your region, all of your data will be placed in Region B. So
that's the most straightforward example. In blue is the dual-region. And with the dual-region,
it's one copy in one region, one
copy and the other. So your data is redundant across
those two specific regions. And then with
multi-regions, every object might be replicated differently. And so for your
entire bucket, you're likely to have data distributed
across the continent. And if you're serving
to the internet, any additional
latency is not really a consideration because
of the internet latencies that will be added to it. So once you've picked a
particular type of location type, the next choice has to
do with your default storage class. And in reality, you
can have objects that are in all of
your storage classes. So you can start
out in standard, and then also have colder data
that is in nearline, coldline, or archive. So what is the difference
between the storage classes? It really comes
down to a trade-off between price and frequency of
access and length of storage. And so if we look at
the best practices, this is where that
becomes clear. If you're going to retain data
for, for example, more than a year and only access
it once a year or two, that would be an ideal use case
for the archive storage class. And, in fact, much
of the data there is intended never to be read. Maybe it would only
be read if there is a regulatory
compliance issue that needs to be looked
into four years after some logs are
written, for example. But on the other
hand, maybe you're doing a really frequently
accessed analytics workload where you're churning through
a very large amount of data quickly. Maybe your data is only
living for a few days, but other objects are
staying around for longer. That's where standard
comes into play. And then, of course, nearline
and coldline are in the middle. So nearline is usually
for once a quarter and retained for
at least one month. Whereas coldline is
at least retained for at least three
months, and you don't want to access it
only a few times a year. So you can use this
as a good benchmark. But so then what do you do? Well, this is where you take
advantage of object lifecycle management. And we provide all of the
utilities inside the product to be able to
automatically migrate your data from standard
down to nearline to coldline to archive. And you can do that based on
the age of the objects, which is usually very
applicable to workloads because cold data is usually
infrequently accessed. You can also manually
move a particular data sets with a storage rewrite
to change the storage classes. So no matter how
you approach it you can use the multiple
storage classes as a way of optimizing your
total cost of ownership so that your cold
data is charged the least, and your hot data
the most, as you would hope for. Security is critical, and
we provide a good amount of flexibility in the product. So one thing that you'd
never need to worry about is whether your
data is encrypted. So your data is always encrypted
at rest and in transit, and that's not even an option. So some other services
just make it-- you have to be a little bit
more careful about that. With us, it's always encrypted. The default is that the
keys for the encryption are actually stored in
Google and managed by Google. And then, depending on your
level of paranoia and desire for control over
the encryption, you can take more and
more control, first by still keeping the keys
inside the key management service inside Google
Cloud, but where you manage the key
rotations and revocations. So that's kind of a best of
both worlds, in some respects. And then, if you want
full, complete control, you can actually supply
the keys with every access and have a key server running
on premises outside of Google Cloud, in which case Google
really literally could not access your data, even
if you asked Google to. With the other two where Google
does have access to the keys, any accesses are fully logged. And we never look at data
unless there is a support case or other legitimate
reason where we do need to go in
and access the data. And again, that would all be
fully logged and transparent. So we take security and privacy
both extremely seriously. So with that basic
introduction to the framework of location types and
storage classes and security, let's see how those choices
apply to the first of our two use cases that we're going to
discuss today, which is content serving. So there are really four
primary considerations with content serving. Some people decide
to do direct serving. Others use the Google Cloud
CDN product as a complement to cloud storage. Others combine cloud storage
with a custom front-end, which also might be
used with Cloud CDN. And then, finally,
there's the opportunity to use independent
CDNs outside of Google to combine with cloud
storage to actually stream or serve the data from those
independent third parties. So for direct serving,
this is, in many respects, the easiest approach. It's fully integrated
with the product. There's nothing more
that you need to do. You can do this just straight
out of the cloud storage offering. And here there's
some real advantages in that we automatically
use the premium Google network for distribution
and take advantage of the edge caching
if you turn it on with cache control headers. So go ahead and turn the
cache control headers on. And that will allow
repeated accesses to have significantly lower
latency than, of course, the first cold access
would have had. So it's, in many ways, a
built-in content distribution network. In terms of the location
type, we would typically advise using multi-regions
because this provides you with the highest possible
availability and basically continent-based serving. And that actually can
even be serving out, for example, a US
multi-region to the globe. Or you might actually have a
different multi-region buckets in different continents to
serve those populations. It really depends
on your use case and how your
content is oriented. What Cloud CDN does is it
means that all of your reads don't go back and hit the
Cloud Storage Service. Only the cold misses
where you haven't recently access to that storage does
it go back to cloud storage. So you won't have any egress
charges from cloud storage. Those will all get handled
by Cloud CDN instead. It also provides you
with custom domains with SSL, some more flexibility
on the other URLs for access. But, of course, the
number one reason is to access the
differentiated pricing that's available through Cloud CDN. Custom front-ends
are a way that you can have an intermediary
service between cloud storage and the customer. This can be popular,
for example, if you want to have
an authorization stage before access. Or if you need to dynamically
transform the data in some way before it streams
out from the service. So here it depends on whether-- you might want to use a region. You might want to
use a dual-region to co-locate the storage
with your custom front-end, depending on how much
disaster recovery and business continuity you want to
build into the model. The dual-region would
be the better choice for high availability. But if your service
doesn't require that, a region would be sufficient. And then you can still
combine this with Cloud CDN to take advantage of the
caching and the other advantages that I just talked about. If you're using an
independent third party CDN, this is, again, an example where
using a region or a dual-region makes more sense
than a multi-region. And this is because
the whole purpose of using an independent
CDN is to have an origin shield between the two that
means that only cold content hits will end up
going back to Google. But you want those to be
hitting the Google region that is in close proximity to
where the independent CDN has their setup. So if they're located very
close to Iowa, for example, in a data center
near there, you might want to use US Central One,
which is the Google region that is located in that same state. So trying to have a close
proximity and low latency between the independent
CDN and Google is the first matter
of importance for the best performance
with an independent CDN. So with that, I'd like to
turn the presentation over to Lavina Jain, who
will take it from here. LAVINA JAIN: Thanks, Geoff. So I'll talk about how to
pick the right location date for compute and
analytics workloads, and then share some tips on
performance and scalability. Compute and analytics pipelines
have several types of data, and your choice of
location type really depends on the type of data. First, you have the source data. This is data that is
persistent and that you need to load up right when
you stock your pipeline. What is really important
here is very high throughput. So you would want to co-locate
your compute with storage, and using a region or
a dual-region location makes sense. The second type of data
is intermediate data. This is data that is produced
by one stage of your pipeline and consumed by another stage. Now, if this data is persistent,
then using cloud storage makes sense, and the
same recommendations of co-locating
compute with storage and using a region or a
dual-region location apply. However, many times
this data is very short lived and users need high
throughput and low latency. In that case,
actually, cloud storage is not the best product that
you could use for this data. Consider using local SSD or
persistent disk, for example. The third type of data is
side inputs and staging data. This data is typically read
only once per workload. So choice of location
type doesn't really matter much here because cloud
storage does a lot of caching. So even if the false
read is remote and slow, the success of
reads will be fast. And finally, you
have final outputs that your pipeline produces. The choice of location type
and storage class for this data really depends on what you
plan to do with the data. For example, if your pipeline
is producing processed images to be sold to internet users,
then really the recommendations that Jeff shared earlier about
content-serving workloads apply. Or if these final outputs are
fed into another pipeline, then it really acts as source
data, and the recommendations that I shared earlier
about source data apply. So before moving on, I want
to give a quick shout-out to Cloud DataProc Service. There has been a lot of interest
in moving Hadoop workloads into cloud, and Cloud
DataProc Service lets you do exactly that. It is a managed Hadoop
service that can bring up new clusters really fast. There is also a connector
for Cloud Storage that lets you read your
data from Cloud Storage directly without having to
first copy it into HDFS. There are a few gotchas
to be aware of here. First one is that Cloud
Storage is not a file system. We do not have
directory semantics. So you have to get around that
by eliminating any dependencies on [INAUDIBLE]
directory operations. For example, you could
upload your data temporarily in a temporary directory. And then, when the data set is
complete, rename the directory. Or you could load directly
into the final directory. But when you're done with data,
write a sentinel file that says, hey, I'm done with data. Secondly, cloud storage
latency is typically higher compared to HDFS, especially
for small objects. So you could get around that by
parallelizing small operations or using larger object sizes. And third, as I mentioned
before for temporary data, consider using local SSD. Now let's talk about
performance and scaling. So first, let's understand
the general performance characteristics
of object storage. One of the things
to remember here is that latency
for small objects could be relatively high. At 95th percentile,
a 1-byte read could take about
100 milliseconds. And the same applies
for write as well. The trends of cloud storage are
really horizontal scalability and single-screen throughput. So for a large
enough object, you should be able to get about 800
megabits per second for read and 400 megabits per
second for writes. What this means
is you would want to favor large objects sizes
and avoid unnecessary sharding. Now, even if you are able
to pack most of your data into large objects,
you may still have those internal
segments or structured data, for example, that typically
tend to be smaller in size. See if you can tune those
up to be a bit larger. At least 2 megabytes is
the recommended size. Larger is, of course, better. And then, if you have to
use small object sizes, then parallelize your
requests to get it on the inherent
latency of the system. Now let's talk about how to
get the best performance out of your writes. There are three different
write strategies that you can choose from. The simplest one is called
non-resumable upload, where you can upload your entire
object using a single request. The second one is
called resumable upload, where you first send a request
to start a resumable upload session, and then send
subsequent requests to upload data using the same session ID. Now, if one of the
upload requests fail, that you can query the
server to find the last offset that it could
successfully commit, and then resume writing from that offset. The third approach is
quite parallel upload, where you split your
object into multiple parts, upload those parts in
parallel, and then compose. Compose is mostly a
metadata-only operation and does not involve
rewriting bytes. I say mostly because it
does cause data deletion. And then, depending on
certain options that you pick, it may result in rewriting data. So be careful with the
options that you pick there. So the recommendation here is
to use non-resumable upload by default. The
bar to switch over to start using presumable
upload or parallel upload is really high. That's because
resumable upload has a drawback that uploading
every single object takes a minimum of 2 requests. Now, if the first request
to start the session takes about 100
milliseconds, that could be a really large
part of your upload time for small objects. So the rule of thumb
here is to multiply your expected throughput
by 30 seconds. And that gives you the cutoff
in terms of object sizes when it makes sense to switch
over to using resumable upload. To give you some concrete
numbers, if you have clients uploading using cable at
[? 4 ?] Mbps upload speed, then the cutoff object
size is 15 megabytes. Or if you have clients
over VM getting a full 400 Mbps us upload speed,
then you shouldn't really have to use a resumable
upload unless you have objects as large as 1
and 1/2 gigabytes or larger. The bar to switch to parallel
upload is even higher. You shouldn't need it unless
you're uploading using a VM and need more than 400
megabits per second. I do want to call
out a gotcha here, that compose really makes
sense in standard storage class only because it could
trigger early deletions. Lastly, I want to talk about
scaling and hotspotting. Cloud Storage generally
scales really well, and you don't have to worry
about object [INAUDIBLE] and object sizes, though
there's one case where we have seen users run into
some scaling bottlenecks, and that's localized
hotspotting. To explain this,
let me first explain what happens under the hood,
and then go over recommendations on how to get around it. Cloud Storage uses
a sharding strategy called Range-Based Sharding. So we store your objects
across multiple servers. And the way we distribute
your objects to servers is by splitting the object
namespace into ranges. In this example
here, any objects that begin with a
letter between E and G will go into the fourth shard. Objects beginning with
a letter between G and S will go into the second
shard, and so on. What this allows us to do is
provide consistent listing. So consider a user request to
list all objects that start with D/. This request will be
routed to the third shard, and looking up this one
shard is enough to find all objects starting
with D. Now, consider an alternate sharding
strategy, which is called hash-based sharding. Here the shards are
assigned to servers based on the hash of the object name. This distributes the
objects all over the place. But the same listing
operation would now have to look up
every single shard. Worse yet, if one of
the shards is slow, then the latency of the
entire list operation goes up. Even worse scenario is,
imagine what happens if one of the shard fails. Then we could either
return incomplete results or fail the entire
list operation. So hash-based sharding
doesn't really scale well when we have to
provide consistent listing. Hence, we chose to go
with range-based sharding. Now, one of the shards
could get very hot, and we handle that by
doing auto scaling. So we detect hot shards
and split them up. So I want to give you a sense of
when does this behavior really kick in. So you can easily do
up to 1,000 object writes per second
on 5,000 object reads per second before really
running into this scenario. Also, typically, we try to
detect hot shards ahead of time and split them up so you
don't notice anything. But you can expect a
delay of up to 20 minutes. The key thing to
making this work really well is really to distribute the
load evenly across key ranges. For majority cases, auto
scaling provides the scalability that you need without
any other considerations. But there is one pattern
that trips this up, and that is sequential access. So let's talk about some of
the object naming patterns that can cause sequential access. The first one is using
Azure month and day format. The second is generating
order numbers or user IDs using monotonically increasing
or sequential numbers. And the third one is
using a timestamp. So you can get around it by
choosing alternate object naming schemes. For example, you can pick
a part of your object name that is more variable
and bring it up front. Log type, in this example. So if you're accessing different
kinds of logs on a single day, those accesses
would be distributed over multiple shards. Don't use a sequential
number if you don't have to. You could use a UUID instead. And the third approach is
to hash the object name or part of the object name,
and then prefix the object name with the hash. The drawback of this approach
is that you cannot do in-order listing. So if you care about
in-order listing, then you could map the
hash to a fixed number. 10, for example. And then, to list an order, you
could do a [INAUDIBLE] look-up and merge the results. And finally, I want to
add that you don't always have to change the
object naming scheme. In case of bulk
operations, for example, you could simply make sure
that all your accesses are distributed randomly and evenly. So to summarize, what we
talked about some common use cases of Cloud Storage,
shared some best practices on how to pick the right
storage class and location type, and finally shared some tips on
how to get the best performance and scalability out
of Cloud Storage. Thank you.