Introduction and best practices for Cloud Storage

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

GEOFFREY NOER: Hello, and welcome to Introduction and Best Practices for Google Cloud Storage. My name is Geoffrey Noer, a product manager for Cloud Storage. And with me today is Lavina Jain, software engineer for cloud storage. Today, we'll be walking through you through a brief introduction, including some information about how to take advantage of location types and storage classes. Then we'll focus on best practices for two of our most important use cases-- content serving, and compute and analytics. And then, finally, some words of wisdom on performance and scaling and how to take advantage of the performance that cloud storage has to offer. So first of all, what is cloud storage? So, not to be confused with file storage or block storage, cloud storage is our object store and is often by considered by many to be the native format for storage in the cloud. What we provide is a simple interface with a consistent API, strongly consistent listings, and low latency and equivalent speed across all of our storage classes, which makes for a very smooth experience on the product. Reliability is of utmost importance, both from a durability as well as from an availability standpoint. Cost effectiveness, of course, as you don't want storage to be the majority of your budget. And then, finally, security, because, of course, you want to make sure that you can control who has access to your data, and to make sure that your data is always encrypted at rest and in transit. The foundation, in many respects, of cloud storage is a high-performance, high-quality network that Google built for its own purposes initially for all of the consumer properties that you are used to using. And so this involves multiple undersea cable investments, large numbers of points of presence across the planet, and as well as a substantial number of cloud regions. And this allows you to really pick a location that makes sense for your data with the knowledge that you can distribute your content to any part of the globe that you may need to. The three primary use cases that we see are content storage and delivery. So for example, National Geographic uses cloud storage for its content serving purposes. And this is ideal because of low latency and the ability to take advantage of all of those edge caches to push content across the globe. We have a lot of use for compute analytics and machine learning. So this is more the high performance inside the cloud use of storage. Here, customers like the Broad Institute doing high-performance genomic sequencing, analysis, or for that matter, Twitter doing analytics on the overall corpus of tweets are both examples of customers using us for that sort of purpose. And here, the sky's the limit. You can scale up to terabits per second of throughput from a cloud storage bucket to make those workloads stream. And then, finally, we have backup and archiving use cases where it's really more about storing the data long term. And so regulatory compliance. Tools like Bucket Lock that allow you to guarantee that data will not be deleted for a specified length of time. Those sorts of facilities really help the use case for those needs. So let's turn first to storage classes to go a little bit deeper into cloud storage because it's really about the locations that you choose, and then the storage classes that determine your frequency of access and, ultimately, the price point that you pay. So the first stage in storing data is actually to decide what kind of a location type you want to use. Regions, multi-regions and dual-regions are all available to you with a number of different specific location choices within each. So regions are the most basic choice. They allow you to have all of your data in one specific region. So for example, you might choose US Central One, or you might choose Europe West One. And once you've chosen that specific location for your bucket, all of your data will be stored there. So the advantage is you can then run your VMs or other cloud services and have that compute and storage be co-located for the best possible performance. And regions also provide a very attractive price point. Now, if you want georedundancy for a higher availability protecting you against particular regional outages, then what you can do is you can choose either a dual-region or multi-region. Dual-regions are more like regions in that you know exactly where your data is located. And, namely, they'll be located in the two specific regions that are part of that dual-region. And so, again, for analytics and other workloads that are high performance, dual-regions are usually the right choice. And then multi-regions, on the other hand, you just know that your data is redundant and stored somewhere within a continent. And this is an ideal choice for content serving. So how do we place the data? So in a region, which is the yellow squares, if you chose Region B as your region, all of your data will be placed in Region B. So that's the most straightforward example. In blue is the dual-region. And with the dual-region, it's one copy in one region, one copy and the other. So your data is redundant across those two specific regions. And then with multi-regions, every object might be replicated differently. And so for your entire bucket, you're likely to have data distributed across the continent. And if you're serving to the internet, any additional latency is not really a consideration because of the internet latencies that will be added to it. So once you've picked a particular type of location type, the next choice has to do with your default storage class. And in reality, you can have objects that are in all of your storage classes. So you can start out in standard, and then also have colder data that is in nearline, coldline, or archive. So what is the difference between the storage classes? It really comes down to a trade-off between price and frequency of access and length of storage. And so if we look at the best practices, this is where that becomes clear. If you're going to retain data for, for example, more than a year and only access it once a year or two, that would be an ideal use case for the archive storage class. And, in fact, much of the data there is intended never to be read. Maybe it would only be read if there is a regulatory compliance issue that needs to be looked into four years after some logs are written, for example. But on the other hand, maybe you're doing a really frequently accessed analytics workload where you're churning through a very large amount of data quickly. Maybe your data is only living for a few days, but other objects are staying around for longer. That's where standard comes into play. And then, of course, nearline and coldline are in the middle. So nearline is usually for once a quarter and retained for at least one month. Whereas coldline is at least retained for at least three months, and you don't want to access it only a few times a year. So you can use this as a good benchmark. But so then what do you do? Well, this is where you take advantage of object lifecycle management. And we provide all of the utilities inside the product to be able to automatically migrate your data from standard down to nearline to coldline to archive. And you can do that based on the age of the objects, which is usually very applicable to workloads because cold data is usually infrequently accessed. You can also manually move a particular data sets with a storage rewrite to change the storage classes. So no matter how you approach it you can use the multiple storage classes as a way of optimizing your total cost of ownership so that your cold data is charged the least, and your hot data the most, as you would hope for. Security is critical, and we provide a good amount of flexibility in the product. So one thing that you'd never need to worry about is whether your data is encrypted. So your data is always encrypted at rest and in transit, and that's not even an option. So some other services just make it-- you have to be a little bit more careful about that. With us, it's always encrypted. The default is that the keys for the encryption are actually stored in Google and managed by Google. And then, depending on your level of paranoia and desire for control over the encryption, you can take more and more control, first by still keeping the keys inside the key management service inside Google Cloud, but where you manage the key rotations and revocations. So that's kind of a best of both worlds, in some respects. And then, if you want full, complete control, you can actually supply the keys with every access and have a key server running on premises outside of Google Cloud, in which case Google really literally could not access your data, even if you asked Google to. With the other two where Google does have access to the keys, any accesses are fully logged. And we never look at data unless there is a support case or other legitimate reason where we do need to go in and access the data. And again, that would all be fully logged and transparent. So we take security and privacy both extremely seriously. So with that basic introduction to the framework of location types and storage classes and security, let's see how those choices apply to the first of our two use cases that we're going to discuss today, which is content serving. So there are really four primary considerations with content serving. Some people decide to do direct serving. Others use the Google Cloud CDN product as a complement to cloud storage. Others combine cloud storage with a custom front-end, which also might be used with Cloud CDN. And then, finally, there's the opportunity to use independent CDNs outside of Google to combine with cloud storage to actually stream or serve the data from those independent third parties. So for direct serving, this is, in many respects, the easiest approach. It's fully integrated with the product. There's nothing more that you need to do. You can do this just straight out of the cloud storage offering. And here there's some real advantages in that we automatically use the premium Google network for distribution and take advantage of the edge caching if you turn it on with cache control headers. So go ahead and turn the cache control headers on. And that will allow repeated accesses to have significantly lower latency than, of course, the first cold access would have had. So it's, in many ways, a built-in content distribution network. In terms of the location type, we would typically advise using multi-regions because this provides you with the highest possible availability and basically continent-based serving. And that actually can even be serving out, for example, a US multi-region to the globe. Or you might actually have a different multi-region buckets in different continents to serve those populations. It really depends on your use case and how your content is oriented. What Cloud CDN does is it means that all of your reads don't go back and hit the Cloud Storage Service. Only the cold misses where you haven't recently access to that storage does it go back to cloud storage. So you won't have any egress charges from cloud storage. Those will all get handled by Cloud CDN instead. It also provides you with custom domains with SSL, some more flexibility on the other URLs for access. But, of course, the number one reason is to access the differentiated pricing that's available through Cloud CDN. Custom front-ends are a way that you can have an intermediary service between cloud storage and the customer. This can be popular, for example, if you want to have an authorization stage before access. Or if you need to dynamically transform the data in some way before it streams out from the service. So here it depends on whether-- you might want to use a region. You might want to use a dual-region to co-locate the storage with your custom front-end, depending on how much disaster recovery and business continuity you want to build into the model. The dual-region would be the better choice for high availability. But if your service doesn't require that, a region would be sufficient. And then you can still combine this with Cloud CDN to take advantage of the caching and the other advantages that I just talked about. If you're using an independent third party CDN, this is, again, an example where using a region or a dual-region makes more sense than a multi-region. And this is because the whole purpose of using an independent CDN is to have an origin shield between the two that means that only cold content hits will end up going back to Google. But you want those to be hitting the Google region that is in close proximity to where the independent CDN has their setup. So if they're located very close to Iowa, for example, in a data center near there, you might want to use US Central One, which is the Google region that is located in that same state. So trying to have a close proximity and low latency between the independent CDN and Google is the first matter of importance for the best performance with an independent CDN. So with that, I'd like to turn the presentation over to Lavina Jain, who will take it from here. LAVINA JAIN: Thanks, Geoff. So I'll talk about how to pick the right location date for compute and analytics workloads, and then share some tips on performance and scalability. Compute and analytics pipelines have several types of data, and your choice of location type really depends on the type of data. First, you have the source data. This is data that is persistent and that you need to load up right when you stock your pipeline. What is really important here is very high throughput. So you would want to co-locate your compute with storage, and using a region or a dual-region location makes sense. The second type of data is intermediate data. This is data that is produced by one stage of your pipeline and consumed by another stage. Now, if this data is persistent, then using cloud storage makes sense, and the same recommendations of co-locating compute with storage and using a region or a dual-region location apply. However, many times this data is very short lived and users need high throughput and low latency. In that case, actually, cloud storage is not the best product that you could use for this data. Consider using local SSD or persistent disk, for example. The third type of data is side inputs and staging data. This data is typically read only once per workload. So choice of location type doesn't really matter much here because cloud storage does a lot of caching. So even if the false read is remote and slow, the success of reads will be fast. And finally, you have final outputs that your pipeline produces. The choice of location type and storage class for this data really depends on what you plan to do with the data. For example, if your pipeline is producing processed images to be sold to internet users, then really the recommendations that Jeff shared earlier about content-serving workloads apply. Or if these final outputs are fed into another pipeline, then it really acts as source data, and the recommendations that I shared earlier about source data apply. So before moving on, I want to give a quick shout-out to Cloud DataProc Service. There has been a lot of interest in moving Hadoop workloads into cloud, and Cloud DataProc Service lets you do exactly that. It is a managed Hadoop service that can bring up new clusters really fast. There is also a connector for Cloud Storage that lets you read your data from Cloud Storage directly without having to first copy it into HDFS. There are a few gotchas to be aware of here. First one is that Cloud Storage is not a file system. We do not have directory semantics. So you have to get around that by eliminating any dependencies on [INAUDIBLE] directory operations. For example, you could upload your data temporarily in a temporary directory. And then, when the data set is complete, rename the directory. Or you could load directly into the final directory. But when you're done with data, write a sentinel file that says, hey, I'm done with data. Secondly, cloud storage latency is typically higher compared to HDFS, especially for small objects. So you could get around that by parallelizing small operations or using larger object sizes. And third, as I mentioned before for temporary data, consider using local SSD. Now let's talk about performance and scaling. So first, let's understand the general performance characteristics of object storage. One of the things to remember here is that latency for small objects could be relatively high. At 95th percentile, a 1-byte read could take about 100 milliseconds. And the same applies for write as well. The trends of cloud storage are really horizontal scalability and single-screen throughput. So for a large enough object, you should be able to get about 800 megabits per second for read and 400 megabits per second for writes. What this means is you would want to favor large objects sizes and avoid unnecessary sharding. Now, even if you are able to pack most of your data into large objects, you may still have those internal segments or structured data, for example, that typically tend to be smaller in size. See if you can tune those up to be a bit larger. At least 2 megabytes is the recommended size. Larger is, of course, better. And then, if you have to use small object sizes, then parallelize your requests to get it on the inherent latency of the system. Now let's talk about how to get the best performance out of your writes. There are three different write strategies that you can choose from. The simplest one is called non-resumable upload, where you can upload your entire object using a single request. The second one is called resumable upload, where you first send a request to start a resumable upload session, and then send subsequent requests to upload data using the same session ID. Now, if one of the upload requests fail, that you can query the server to find the last offset that it could successfully commit, and then resume writing from that offset. The third approach is quite parallel upload, where you split your object into multiple parts, upload those parts in parallel, and then compose. Compose is mostly a metadata-only operation and does not involve rewriting bytes. I say mostly because it does cause data deletion. And then, depending on certain options that you pick, it may result in rewriting data. So be careful with the options that you pick there. So the recommendation here is to use non-resumable upload by default. The bar to switch over to start using presumable upload or parallel upload is really high. That's because resumable upload has a drawback that uploading every single object takes a minimum of 2 requests. Now, if the first request to start the session takes about 100 milliseconds, that could be a really large part of your upload time for small objects. So the rule of thumb here is to multiply your expected throughput by 30 seconds. And that gives you the cutoff in terms of object sizes when it makes sense to switch over to using resumable upload. To give you some concrete numbers, if you have clients uploading using cable at [? 4 ?] Mbps upload speed, then the cutoff object size is 15 megabytes. Or if you have clients over VM getting a full 400 Mbps us upload speed, then you shouldn't really have to use a resumable upload unless you have objects as large as 1 and 1/2 gigabytes or larger. The bar to switch to parallel upload is even higher. You shouldn't need it unless you're uploading using a VM and need more than 400 megabits per second. I do want to call out a gotcha here, that compose really makes sense in standard storage class only because it could trigger early deletions. Lastly, I want to talk about scaling and hotspotting. Cloud Storage generally scales really well, and you don't have to worry about object [INAUDIBLE] and object sizes, though there's one case where we have seen users run into some scaling bottlenecks, and that's localized hotspotting. To explain this, let me first explain what happens under the hood, and then go over recommendations on how to get around it. Cloud Storage uses a sharding strategy called Range-Based Sharding. So we store your objects across multiple servers. And the way we distribute your objects to servers is by splitting the object namespace into ranges. In this example here, any objects that begin with a letter between E and G will go into the fourth shard. Objects beginning with a letter between G and S will go into the second shard, and so on. What this allows us to do is provide consistent listing. So consider a user request to list all objects that start with D/. This request will be routed to the third shard, and looking up this one shard is enough to find all objects starting with D. Now, consider an alternate sharding strategy, which is called hash-based sharding. Here the shards are assigned to servers based on the hash of the object name. This distributes the objects all over the place. But the same listing operation would now have to look up every single shard. Worse yet, if one of the shards is slow, then the latency of the entire list operation goes up. Even worse scenario is, imagine what happens if one of the shard fails. Then we could either return incomplete results or fail the entire list operation. So hash-based sharding doesn't really scale well when we have to provide consistent listing. Hence, we chose to go with range-based sharding. Now, one of the shards could get very hot, and we handle that by doing auto scaling. So we detect hot shards and split them up. So I want to give you a sense of when does this behavior really kick in. So you can easily do up to 1,000 object writes per second on 5,000 object reads per second before really running into this scenario. Also, typically, we try to detect hot shards ahead of time and split them up so you don't notice anything. But you can expect a delay of up to 20 minutes. The key thing to making this work really well is really to distribute the load evenly across key ranges. For majority cases, auto scaling provides the scalability that you need without any other considerations. But there is one pattern that trips this up, and that is sequential access. So let's talk about some of the object naming patterns that can cause sequential access. The first one is using Azure month and day format. The second is generating order numbers or user IDs using monotonically increasing or sequential numbers. And the third one is using a timestamp. So you can get around it by choosing alternate object naming schemes. For example, you can pick a part of your object name that is more variable and bring it up front. Log type, in this example. So if you're accessing different kinds of logs on a single day, those accesses would be distributed over multiple shards. Don't use a sequential number if you don't have to. You could use a UUID instead. And the third approach is to hash the object name or part of the object name, and then prefix the object name with the hash. The drawback of this approach is that you cannot do in-order listing. So if you care about in-order listing, then you could map the hash to a fixed number. 10, for example. And then, to list an order, you could do a [INAUDIBLE] look-up and merge the results. And finally, I want to add that you don't always have to change the object naming scheme. In case of bulk operations, for example, you could simply make sure that all your accesses are distributed randomly and evenly. So to summarize, what we talked about some common use cases of Cloud Storage, shared some best practices on how to pick the right storage class and location type, and finally shared some tips on how to get the best performance and scalability out of Cloud Storage. Thank you.

Info

Channel: Google Cloud Tech

Views: 16,758

Rating: undefined out of 5

Keywords: type: Conference Talk (Full production);, pr_pr: Google Cloud Next, purpose: Educate

Id: WSwqYGln-vU

Channel Id: undefined

Length: 27min 19sec (1639 seconds)

Published: Tue Sep 15 2020