PAUL NEWSON: Hi. My name is Paul Newson, and
I'm a developer advocate for the Google Cloud platform. Almost all applications
need to use some kind of persistent durable
storage to get their job done. This might be to
store information about user accounts, images
you want to capture or serve, the time series of events
that is happening that you want to analyze, or pretty
much any type of data you can imagine. Not surprisingly, the
Google Cloud platform can store all these things. However, because
different applications have different
storage requirements, the Google Cloud platform
has not one, but several ways of storing persistent data. Today I'm going to introduce
each of the storage options in the Google Cloud
platform and talk about their different
characteristics to give you some
idea of which ones you should use for
your application. Note that depending
on your application, you might use one or several
of these storage services to get the job done. Let's start our tour in the
Google Developers console. I should mention that
we are constantly improving the Developers
console to add new features and improve usability. So don't be surprised
if the console you see looks different from
the one shown here. You should be able to
perform similar tasks to what I demonstrate here
today in future versions of the Developer console. Here we have a new project
I recently created. In the left navigation menu,
we see the major categories of services offered by
the Google Cloud platform, including Storage. Under Storage, we see four of
the storage services offered in the Google Cloud
platform-- Cloud Bigtable, Cloud Datastore,
Cloud SQL, and Cloud Storage. There is a fifth storage
option that is not listed here, but is instead under the
Big Data heading-- BigQuery. BigQuery is both a storage
service and a powerful analysis service, which is why it
is listed under Big Data. We can divide these
five storage services into two categories--
structured and unstructured. If the data you want to store
can be organized into a table structure with columns and rows,
then it is structured data. Examples might be user profile
information, event logs, sensor measurements, sales
records, or stock trade data. Structured data comes in
many different shapes, sizes, and usage patterns, and
there is a great diversity of ways to store it
and interact with it. So perhaps not
surprisingly, all but one of the storage services offered
by the Google Cloud platform are for structured data. Cloud SQL, Cloud Datastore,
Cloud Bigtable, and BigQuery all store structured data. The remaining storage
service, Google Cloud Storage, stores unstructured
data, by which I mean that you give Cloud
Storage a sequence of bytes to store, a place
you want to store it, and a name to identify
that sequence of bytes, and it stores them for you. Of course, there may
be internal structure to that sequence of bytes. For instance, it
could be a zip archive with a table of contents and
individual archives inside of it, or it might be a
JPEG image file, which has a well-defined format, or
even a text file in CSV format organized into rows and
columns, which sounds a lot like structured data. The key is that Cloud
Storage has no insight into that internal structure. It just faithfully
stores and retrieves the exact sequence of
bytes you ask it to, regardless of any
internal structure that may exist in those bytes. At this point, you might be
thinking this unstructured data you're talking about sure
sounds a whole lot like files. And you'd be exactly
correct, but Cloud Storage calls them objects instead
not because they're not like files, but
because files are generally stored in a file system. And the file systems
we're used to interacting with on modern operating
systems generally have a hierarchical structure
and naming conventions that Cloud Storage
does not have. Thus Cloud Storage
calls the place you store things a bucket and
the things that you store, objects. Let's use the Developer
console to get started with Cloud Storage. This project does not
currently have any buckets. So let's create one. First you have to choose
a name for the bucket. It's important to note
that bucket names are global across all projects. If you try to create a
bucket with an obvious name, such as Test, you will
probably find that the bucket name is not available. As you can see, the
Developer console helpfully checks your bucket
name while you're typing it to let you
know if it's a legal name and if it's available. Now we have a legal and
available bucket name, but we still have a
couple of choices to make. The first is the class
of storage to use. Standard storage is the
default, and if you're using Cloud Storage as
part of an application, is almost always
the right choice. It provides the best latency
and highest availability of the storage classes, which
is usually what you want. If, for some reason, you can't
accept a lower availability for the objects you are
storing in this bucket, you can choose the durable
reduced availability class of storage instead. As the name implies,
objects stored with this class of storage are
just as durable as standard, but have a reduced
availability expectation. The service level agreement
for the standard storage class specifies 99.9% availability. The durable reduced
availability class specifies 99%
availability instead. The trade-off for accepting
the reduced availability expectation is a lower price. Pricing is subject to change. So you should consult
the current price list for details on the cost of the
different classes of storage. The final class of
storage is nearline. Nearline storage is intended
for archival scenarios where you do not
expect to access the stored objects often. Nearline storage also carries
a 99% availability SLA. Nearline storage also has
slightly higher latency than standard or durable
reduced availability storage with an expectation of a few
seconds to the first byte, compared with less than a
second to the first byte for the other
classes of storage. Nearline storage is
significantly less expensive than standard or durable
reduced availability storage. But there is a surcharge
for accessing your objects. A good rule of thumb for
when to use near line storage is that if you think you will
access an object less than once a month, then nearline storage
will probably be the more cost effective option. If you expect to access an
object more than once a month, durable reduced
availability storage will be more cost effective. The second choice
is the location where your data will be stored. There are two types of locations
available-- multi-regional and regional. Regional locations
correspond to the region supported in other
cloud platform services such as Compute Engine. If you primarily intend to
read and write to this bucket from Compute Engine instances
in a particular region, then setting the location of
the bucket to that same region will provide lower latency
and higher throughput for that workload. However, by setting the
location to a particular region, you restrict the storage
to only that region, which can be a disadvantage
if your objects are going to be accessed from
many different places. If your objects will be accessed
from many different places, then you can choose a
multi-region location which restricts the
placement of your data to a broader geographic area. Currently, the choices are
Asia, the European Union, and the United States. These multi-region
locations can also be useful when you want
to ensure your data stays within a certain jurisdiction. With these choices made, you
can now create your bucket. Here we see our new bucket,
which is, of course, empty. We could use the developer
console to upload objects. But instead, I would
like to quickly show you how to use the command line
utility gsutil that comes with the Google Cloud SDK. I have previously installed
the Google Cloud SDK and configured it to
point to the project and use my credentials. We can use gsutil to
see the bucket we just created using the ls command. gsutil uses a URI
syntax for object names where the gs prefix
specifies objects stored in Google Cloud Storage. It can also access files in the
local file system and objects stored in Amazon S3 using
the S3 prefix, which is useful for
transferring objects between Cloud Storage
and the local file system or between Cloud
Storage and Amazon S3. Here I will quickly
demonstrate how to upload a large number of
objects in a single command. And we can switch back
to the Developer console, refresh the window, and see
those objects in our bucket. Now let's take a quick hands-on
tour of the structured storage options starting with Cloud SQL. When we click on Cloud SQL,
we are given the opportunity to create a Cloud SQL instance. We choose a name for
the instance which must be unique
within this project, choose a region in which
to run the instance, and a size for the instance. There are lots of additional
configuration options that I won't go
into detail on now. But the one I wanted to point
out is database version. The important thing to notice
is that you are selecting an explicit MySQL version. This demonstrates a key
point about Cloud SQL. It is a hosted MySQL service. It is not similar to MySQL. It is MySQL. Whatever you do
today with MySQL, you can do with Cloud SQL. But Google takes care of
keeping the operating system up to date, configuring
application, performing backups, and other ops tasks. To underscore the
point that this really is a MySQL server we
started, let's connect to it using the MySQL client. Let's create a new
instance we can use to run the MySQL client. Because this instance is
part of the same project, no additional network
configuration is required. However, we do need to
make sure we enable access to Cloud SQL for this instance. We can do this explicitly
in Access and Security, or we can allow this instance,
to access all Cloud resources that are part of this project. Since we may use this
instance for more than just running the MySQL client,
we will enable access to all the APIs
with this checkbox. Once the instance is
created, we can SSH into it and install the MySQL client
via [? app.get. ?] However, we need to make a couple
of configuration changes to our Cloud SQL instance
before we can connect to it. First, we need an IPv4
address to connect to. Then, we need to tell
Cloud SQL that we're OK with connections coming from
the IP address of our Compute Engine instance. In this example, we only want
one Compute Engine instance to get access. So we specify a network
with a 32-bit prefix, which effectively
means a single host. Finally, we need an
account that allows us to sign in from our host. The default root accounts
only work from local host. So we create a new account
that allows root to sign in from our instance's IP address. Now we are ready to connect
to our Cloud SQL instance using the MySQL client. At this point, you have
a fully functional MySQL prompt with which you can
do all the normal stuff you would expect to be
able to do from a root MySQL prompt such as creating
new databases, creating tables, and so on. But now you get the
idea that Cloud SQL is a fully functional MySQL
database, which is fantastic if what you want is a low
maintenance MySQL instance. It has everything you expect
from SQL-- a rich query language, primary and secondary
indexes, asset transactions, relational integrity, stored
procedures, the works. If you are familiar with MySQL
or another relational database, and you value the power of
a full relational database, and if your scalability
needs are not too great, Cloud SQL could be the perfect
solution for your application. However, the so-called
NoSQL family of databases was created to address
some of the issues faced by relational databases
when maintaining relational integrity and full
ACID semantics while attempting to attain massive scale in
a cost effective way, which brings us to Cloud Datastore. Cloud Datastore was born
as the structured storage solution for App
Engine, but is now accessible from outside
of App Engine as well. Cloud Datastore
scales gracefully from very small database sizes
to very large database sizes. It grows with your application,
both in terms of scale and by being flexible when it
comes to schema definition. For those of you who are
accustomed to third normal form with full relational
integrity, you may find the NoSQL approach
to schema definition to be a little bit
fast and loose. And you'd be right. It is fast, by which I
mean highly scalable. And it is loose, by which I
mean it is highly flexible. There is no equivalent
to a create table statement in Cloud Datastore. You simply start storing things. There is no equivalent to
an alter table statement to add new columns. You just start storing things
with additional columns. We can demonstrate this
in the Developers console. There are a few important
things to notice about what we're doing here. While this may look a little
like defining a table, it's not. What we're doing is storing the
very first entity of its kind in our project's data store. We can then store a second
entity of the same kind, but we can add new
properties to it that were not in the first entity. You'll see when we create a
second entity it knows which of the first entity's properties
were marked as being indexed and asks us for a
value for those. But it does not
demand that we provide a value for every property
that was in the first entity. In fact, you can
even create entities that do not have values for
properties that were previously identified as being indexed. At this point, it
bears mentioning that just because Cloud
Datastore doesn't require you to define your
schema in advance doesn't mean you don't need
to think very carefully about your data model. You still need to
think carefully about what you're
planning to store and how you intend
to retrieve it. But the NoSQL approach means
that when the time comes to store additional information
about a kind, which in a SQL world would mean an alter
table statement to add columns to a table, you can
simply start storing new entities of that kind with
the additional information you now require. You'll need a plan to
either backfill old data or deal with entities that
are missing that data. But that's true no matter what
your database may look like. Another thing to notice
is that at no point did we tell the Datastore
how many resources should be used for our structured data. It is a true no-op solution with
almost no operational overhead, no need to specify instance
sizes or cluster sizes or anything of that nature. Datastore is a great fit
for an application database that scales from 0 to terabytes
of data as your application grows in popularity. However, if your scaling
needs are truly massive, there is another NoSQL
solution you should consider-- Cloud Bigtable. Cloud Bigtable is a hosted
version of the same Bigtable technology that Google
has been using internally for over a decade. In 2006, Google
published a paper describing Bigtable
which resulted in a flurry of activity in
the open source community to build similar systems. One of those open source
projects inspired by Bigtable was HBase, which is part of
the larger Hadoop ecosystem. Cloud Bigtable is exposed
through the HBase API, which makes it instantly compatible
with a great deal of existing code written for the
Hadoop ecosystem. But it is a managed
service, which reduces your operational overhead. If you know you're
going to be story over a terabyte of
structured data, if you have a very
high volume of writes, if you need read and write
latency in the single digit millisecond range with
strong consistency, or if you're already
using HBase and you want a straightforward migration
path to a managed Cloud Service, Cloud Bigtable is something
you should look into. Let's create our very own
Cloud Bigtable cluster to see what that looks like. We have a couple of choices
to make at this point. First is to choose the zone
in which our cluster will run. Recall that for Cloud
SQL, we selected a region. And for Cloud Datastore,
we selected nothing. We simply started storing
data and let Cloud Datastore worry about where it should go. Cloud Bigtable is a lower level
database than both Cloud SQL and Cloud Datastore,
which is one of the reasons it
can scale so high and provide such low latency. But with such low
latency guarantees, your whole cluster
runs in a single zone. The second choice is how
many nodes to provision. As we increase the
number of nodes, we increase the
available operations per second and throughput,
and also, of course, the cost. You can go up to 30 nodes
without even talking to us. As the developer console
shows, a 30 node cluster is capable of performing
300,000 operations per second with a throughput of 300
megabytes per second. To put that in perspective,
300 megabytes per second is about 25 terabytes
per day, which would add up to about
nine petabytes in a year. That's scale. Similar to how we
used the native MySQL client to demonstrate that
Cloud SQL really is MySQL, I'm going to
demonstrate how we can use the HBase client to connect
to our Cloud Bigtable cluster. Here I am using the
Quick Start script that is described in the
Cloud Bigtable Quick Start. Setting up the HBase
client can be tricky, and the Quick Start
packaging script makes sure you have the
right things on your machine and understands how to
discover and connect to your Cloud Bigtable cluster. We can see that
we don't currently have any tables in our cluster. So we create one,
then add a row to it, then scan the table to
verify the row is there. Learning how to model data in
Bigtable and use the HBase API is beyond the scope
of this video. But this short
example serves to show that Bigtable does
indeed play nicely with the HBase toolchain. Given the massive scalability
provided by Cloud Bigtable, why not make it your
number one choice? After all, everyone
wants to be prepared for the overwhelming success
of their application. So why not choose the most
scalable database available? Simply put, Cloud Bigtable
does not effectively scale down to small sizes. The smallest Cloud Bigtable
cluster you can create still has three nodes and can
handle 30,000 operations per second, which is far more
than a small application needs. This isn't a problem from
a performance standpoint, but it is a problem from a
cost effectiveness perspective. These three nodes are
dedicated for your sole use and cost the same per hour no
matter if you use them or not. Contrast this with
Cloud Data Storage model where you pay by the operation. So when your usage is
small, so is your cost. Each of the structured
storage solutions we have discussed so far-- Cloud
SQL, Cloud Datastore, and Cloud Bigtable-- are primarily
operational databases. They're intended to be used
as part of an application. The final structured
storage service we're going to talk
about is BigQuery, which is an analytical database. The power of BigQuery
is its ability to run SQL queries over
terabytes of data in seconds. To demonstrate, I will use
one of the several public data sets available in BigQuery. This dataset contains 100
billion rows of page view data from the Wikimedia Foundation. By running this
row count query, we can see that there are indeed
more than 100 billion rows in this table. Now we're going to scan all
100 billion of those rows, run a regular expression
against each one, then aggregate the page
views for each matching row. And there we have it. We just scanned over
three terabytes of data in less than a minute. I hope you've enjoyed this
whirlwind tour of the storage options in the Google
Cloud platform. We were barely able to
scratch the surface of each of these great products. But hopefully you
have a better idea of which ones might be
best for your application. [MUSIC PLAYING]