Choosing your storage and database on Google Cloud Platform

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
PAUL NEWSON: Hi. My name is Paul Newson, and I'm a developer advocate for the Google Cloud platform. Almost all applications need to use some kind of persistent durable storage to get their job done. This might be to store information about user accounts, images you want to capture or serve, the time series of events that is happening that you want to analyze, or pretty much any type of data you can imagine. Not surprisingly, the Google Cloud platform can store all these things. However, because different applications have different storage requirements, the Google Cloud platform has not one, but several ways of storing persistent data. Today I'm going to introduce each of the storage options in the Google Cloud platform and talk about their different characteristics to give you some idea of which ones you should use for your application. Note that depending on your application, you might use one or several of these storage services to get the job done. Let's start our tour in the Google Developers console. I should mention that we are constantly improving the Developers console to add new features and improve usability. So don't be surprised if the console you see looks different from the one shown here. You should be able to perform similar tasks to what I demonstrate here today in future versions of the Developer console. Here we have a new project I recently created. In the left navigation menu, we see the major categories of services offered by the Google Cloud platform, including Storage. Under Storage, we see four of the storage services offered in the Google Cloud platform-- Cloud Bigtable, Cloud Datastore, Cloud SQL, and Cloud Storage. There is a fifth storage option that is not listed here, but is instead under the Big Data heading-- BigQuery. BigQuery is both a storage service and a powerful analysis service, which is why it is listed under Big Data. We can divide these five storage services into two categories-- structured and unstructured. If the data you want to store can be organized into a table structure with columns and rows, then it is structured data. Examples might be user profile information, event logs, sensor measurements, sales records, or stock trade data. Structured data comes in many different shapes, sizes, and usage patterns, and there is a great diversity of ways to store it and interact with it. So perhaps not surprisingly, all but one of the storage services offered by the Google Cloud platform are for structured data. Cloud SQL, Cloud Datastore, Cloud Bigtable, and BigQuery all store structured data. The remaining storage service, Google Cloud Storage, stores unstructured data, by which I mean that you give Cloud Storage a sequence of bytes to store, a place you want to store it, and a name to identify that sequence of bytes, and it stores them for you. Of course, there may be internal structure to that sequence of bytes. For instance, it could be a zip archive with a table of contents and individual archives inside of it, or it might be a JPEG image file, which has a well-defined format, or even a text file in CSV format organized into rows and columns, which sounds a lot like structured data. The key is that Cloud Storage has no insight into that internal structure. It just faithfully stores and retrieves the exact sequence of bytes you ask it to, regardless of any internal structure that may exist in those bytes. At this point, you might be thinking this unstructured data you're talking about sure sounds a whole lot like files. And you'd be exactly correct, but Cloud Storage calls them objects instead not because they're not like files, but because files are generally stored in a file system. And the file systems we're used to interacting with on modern operating systems generally have a hierarchical structure and naming conventions that Cloud Storage does not have. Thus Cloud Storage calls the place you store things a bucket and the things that you store, objects. Let's use the Developer console to get started with Cloud Storage. This project does not currently have any buckets. So let's create one. First you have to choose a name for the bucket. It's important to note that bucket names are global across all projects. If you try to create a bucket with an obvious name, such as Test, you will probably find that the bucket name is not available. As you can see, the Developer console helpfully checks your bucket name while you're typing it to let you know if it's a legal name and if it's available. Now we have a legal and available bucket name, but we still have a couple of choices to make. The first is the class of storage to use. Standard storage is the default, and if you're using Cloud Storage as part of an application, is almost always the right choice. It provides the best latency and highest availability of the storage classes, which is usually what you want. If, for some reason, you can't accept a lower availability for the objects you are storing in this bucket, you can choose the durable reduced availability class of storage instead. As the name implies, objects stored with this class of storage are just as durable as standard, but have a reduced availability expectation. The service level agreement for the standard storage class specifies 99.9% availability. The durable reduced availability class specifies 99% availability instead. The trade-off for accepting the reduced availability expectation is a lower price. Pricing is subject to change. So you should consult the current price list for details on the cost of the different classes of storage. The final class of storage is nearline. Nearline storage is intended for archival scenarios where you do not expect to access the stored objects often. Nearline storage also carries a 99% availability SLA. Nearline storage also has slightly higher latency than standard or durable reduced availability storage with an expectation of a few seconds to the first byte, compared with less than a second to the first byte for the other classes of storage. Nearline storage is significantly less expensive than standard or durable reduced availability storage. But there is a surcharge for accessing your objects. A good rule of thumb for when to use near line storage is that if you think you will access an object less than once a month, then nearline storage will probably be the more cost effective option. If you expect to access an object more than once a month, durable reduced availability storage will be more cost effective. The second choice is the location where your data will be stored. There are two types of locations available-- multi-regional and regional. Regional locations correspond to the region supported in other cloud platform services such as Compute Engine. If you primarily intend to read and write to this bucket from Compute Engine instances in a particular region, then setting the location of the bucket to that same region will provide lower latency and higher throughput for that workload. However, by setting the location to a particular region, you restrict the storage to only that region, which can be a disadvantage if your objects are going to be accessed from many different places. If your objects will be accessed from many different places, then you can choose a multi-region location which restricts the placement of your data to a broader geographic area. Currently, the choices are Asia, the European Union, and the United States. These multi-region locations can also be useful when you want to ensure your data stays within a certain jurisdiction. With these choices made, you can now create your bucket. Here we see our new bucket, which is, of course, empty. We could use the developer console to upload objects. But instead, I would like to quickly show you how to use the command line utility gsutil that comes with the Google Cloud SDK. I have previously installed the Google Cloud SDK and configured it to point to the project and use my credentials. We can use gsutil to see the bucket we just created using the ls command. gsutil uses a URI syntax for object names where the gs prefix specifies objects stored in Google Cloud Storage. It can also access files in the local file system and objects stored in Amazon S3 using the S3 prefix, which is useful for transferring objects between Cloud Storage and the local file system or between Cloud Storage and Amazon S3. Here I will quickly demonstrate how to upload a large number of objects in a single command. And we can switch back to the Developer console, refresh the window, and see those objects in our bucket. Now let's take a quick hands-on tour of the structured storage options starting with Cloud SQL. When we click on Cloud SQL, we are given the opportunity to create a Cloud SQL instance. We choose a name for the instance which must be unique within this project, choose a region in which to run the instance, and a size for the instance. There are lots of additional configuration options that I won't go into detail on now. But the one I wanted to point out is database version. The important thing to notice is that you are selecting an explicit MySQL version. This demonstrates a key point about Cloud SQL. It is a hosted MySQL service. It is not similar to MySQL. It is MySQL. Whatever you do today with MySQL, you can do with Cloud SQL. But Google takes care of keeping the operating system up to date, configuring application, performing backups, and other ops tasks. To underscore the point that this really is a MySQL server we started, let's connect to it using the MySQL client. Let's create a new instance we can use to run the MySQL client. Because this instance is part of the same project, no additional network configuration is required. However, we do need to make sure we enable access to Cloud SQL for this instance. We can do this explicitly in Access and Security, or we can allow this instance, to access all Cloud resources that are part of this project. Since we may use this instance for more than just running the MySQL client, we will enable access to all the APIs with this checkbox. Once the instance is created, we can SSH into it and install the MySQL client via [? app.get. ?] However, we need to make a couple of configuration changes to our Cloud SQL instance before we can connect to it. First, we need an IPv4 address to connect to. Then, we need to tell Cloud SQL that we're OK with connections coming from the IP address of our Compute Engine instance. In this example, we only want one Compute Engine instance to get access. So we specify a network with a 32-bit prefix, which effectively means a single host. Finally, we need an account that allows us to sign in from our host. The default root accounts only work from local host. So we create a new account that allows root to sign in from our instance's IP address. Now we are ready to connect to our Cloud SQL instance using the MySQL client. At this point, you have a fully functional MySQL prompt with which you can do all the normal stuff you would expect to be able to do from a root MySQL prompt such as creating new databases, creating tables, and so on. But now you get the idea that Cloud SQL is a fully functional MySQL database, which is fantastic if what you want is a low maintenance MySQL instance. It has everything you expect from SQL-- a rich query language, primary and secondary indexes, asset transactions, relational integrity, stored procedures, the works. If you are familiar with MySQL or another relational database, and you value the power of a full relational database, and if your scalability needs are not too great, Cloud SQL could be the perfect solution for your application. However, the so-called NoSQL family of databases was created to address some of the issues faced by relational databases when maintaining relational integrity and full ACID semantics while attempting to attain massive scale in a cost effective way, which brings us to Cloud Datastore. Cloud Datastore was born as the structured storage solution for App Engine, but is now accessible from outside of App Engine as well. Cloud Datastore scales gracefully from very small database sizes to very large database sizes. It grows with your application, both in terms of scale and by being flexible when it comes to schema definition. For those of you who are accustomed to third normal form with full relational integrity, you may find the NoSQL approach to schema definition to be a little bit fast and loose. And you'd be right. It is fast, by which I mean highly scalable. And it is loose, by which I mean it is highly flexible. There is no equivalent to a create table statement in Cloud Datastore. You simply start storing things. There is no equivalent to an alter table statement to add new columns. You just start storing things with additional columns. We can demonstrate this in the Developers console. There are a few important things to notice about what we're doing here. While this may look a little like defining a table, it's not. What we're doing is storing the very first entity of its kind in our project's data store. We can then store a second entity of the same kind, but we can add new properties to it that were not in the first entity. You'll see when we create a second entity it knows which of the first entity's properties were marked as being indexed and asks us for a value for those. But it does not demand that we provide a value for every property that was in the first entity. In fact, you can even create entities that do not have values for properties that were previously identified as being indexed. At this point, it bears mentioning that just because Cloud Datastore doesn't require you to define your schema in advance doesn't mean you don't need to think very carefully about your data model. You still need to think carefully about what you're planning to store and how you intend to retrieve it. But the NoSQL approach means that when the time comes to store additional information about a kind, which in a SQL world would mean an alter table statement to add columns to a table, you can simply start storing new entities of that kind with the additional information you now require. You'll need a plan to either backfill old data or deal with entities that are missing that data. But that's true no matter what your database may look like. Another thing to notice is that at no point did we tell the Datastore how many resources should be used for our structured data. It is a true no-op solution with almost no operational overhead, no need to specify instance sizes or cluster sizes or anything of that nature. Datastore is a great fit for an application database that scales from 0 to terabytes of data as your application grows in popularity. However, if your scaling needs are truly massive, there is another NoSQL solution you should consider-- Cloud Bigtable. Cloud Bigtable is a hosted version of the same Bigtable technology that Google has been using internally for over a decade. In 2006, Google published a paper describing Bigtable which resulted in a flurry of activity in the open source community to build similar systems. One of those open source projects inspired by Bigtable was HBase, which is part of the larger Hadoop ecosystem. Cloud Bigtable is exposed through the HBase API, which makes it instantly compatible with a great deal of existing code written for the Hadoop ecosystem. But it is a managed service, which reduces your operational overhead. If you know you're going to be story over a terabyte of structured data, if you have a very high volume of writes, if you need read and write latency in the single digit millisecond range with strong consistency, or if you're already using HBase and you want a straightforward migration path to a managed Cloud Service, Cloud Bigtable is something you should look into. Let's create our very own Cloud Bigtable cluster to see what that looks like. We have a couple of choices to make at this point. First is to choose the zone in which our cluster will run. Recall that for Cloud SQL, we selected a region. And for Cloud Datastore, we selected nothing. We simply started storing data and let Cloud Datastore worry about where it should go. Cloud Bigtable is a lower level database than both Cloud SQL and Cloud Datastore, which is one of the reasons it can scale so high and provide such low latency. But with such low latency guarantees, your whole cluster runs in a single zone. The second choice is how many nodes to provision. As we increase the number of nodes, we increase the available operations per second and throughput, and also, of course, the cost. You can go up to 30 nodes without even talking to us. As the developer console shows, a 30 node cluster is capable of performing 300,000 operations per second with a throughput of 300 megabytes per second. To put that in perspective, 300 megabytes per second is about 25 terabytes per day, which would add up to about nine petabytes in a year. That's scale. Similar to how we used the native MySQL client to demonstrate that Cloud SQL really is MySQL, I'm going to demonstrate how we can use the HBase client to connect to our Cloud Bigtable cluster. Here I am using the Quick Start script that is described in the Cloud Bigtable Quick Start. Setting up the HBase client can be tricky, and the Quick Start packaging script makes sure you have the right things on your machine and understands how to discover and connect to your Cloud Bigtable cluster. We can see that we don't currently have any tables in our cluster. So we create one, then add a row to it, then scan the table to verify the row is there. Learning how to model data in Bigtable and use the HBase API is beyond the scope of this video. But this short example serves to show that Bigtable does indeed play nicely with the HBase toolchain. Given the massive scalability provided by Cloud Bigtable, why not make it your number one choice? After all, everyone wants to be prepared for the overwhelming success of their application. So why not choose the most scalable database available? Simply put, Cloud Bigtable does not effectively scale down to small sizes. The smallest Cloud Bigtable cluster you can create still has three nodes and can handle 30,000 operations per second, which is far more than a small application needs. This isn't a problem from a performance standpoint, but it is a problem from a cost effectiveness perspective. These three nodes are dedicated for your sole use and cost the same per hour no matter if you use them or not. Contrast this with Cloud Data Storage model where you pay by the operation. So when your usage is small, so is your cost. Each of the structured storage solutions we have discussed so far-- Cloud SQL, Cloud Datastore, and Cloud Bigtable-- are primarily operational databases. They're intended to be used as part of an application. The final structured storage service we're going to talk about is BigQuery, which is an analytical database. The power of BigQuery is its ability to run SQL queries over terabytes of data in seconds. To demonstrate, I will use one of the several public data sets available in BigQuery. This dataset contains 100 billion rows of page view data from the Wikimedia Foundation. By running this row count query, we can see that there are indeed more than 100 billion rows in this table. Now we're going to scan all 100 billion of those rows, run a regular expression against each one, then aggregate the page views for each matching row. And there we have it. We just scanned over three terabytes of data in less than a minute. I hope you've enjoyed this whirlwind tour of the storage options in the Google Cloud platform. We were barely able to scratch the surface of each of these great products. But hopefully you have a better idea of which ones might be best for your application. [MUSIC PLAYING]
Info
Channel: Google Cloud Tech
Views: 107,292
Rating: 4.9114504 out of 5
Keywords: googe cloud platform, cloud storage, cloud computing, gcp, google cloud storage, backup, database, bigtable, nearline, bigquery, big data, datastore, sql
Id: mmjuMyRBPO4
Channel Id: undefined
Length: 19min 56sec (1196 seconds)
Published: Mon Aug 29 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.