[MUSIC PLAYING] TOM SALMON: Good
afternoon, everybody. Thank you for coming
to my session. I really hope I can teach you
something new in the next 45 minutes. The title of this
session is a bit large. I don't actually
like it anymore. I kind of want to change it. And it's "A Security
Practitioners Guide to Best Practice GCP Security." And I'm not going to be
everything to everybody, so I'm probably going
to disappoint everybody in this room. But hopefully, I can
teach you, at least, maybe one thing new that you can
take away and implement today. So my name is Tom Salmon,
and I'm a customer engineer in Google Cloud. I'm based in London,
and I primarily work with financial service
customers primarily in banking. I joined Google 18 months ago. Before that, I
worked in security doing engineering, design,
architecture, and consulting, mostly building security
operation centers. My aim here today
is to share with you what I've learned
from my customers, what they've told me, the
conversations we've had together, and where I
think there's, maybe, gaps in people's
knowledge that I think I can hopefully upskill you. So what we are and
aren't going to do today. We will cover the
common questions I've had from my
customers, the areas of common misunderstanding where
we spend time going over things that can be fairly fundamental,
but sometimes people get wrong. We're going to look at how we
take lots of different services and bundle them into
skill solutions. We're not going to cover
everything in the GCP platform. We're going to cover a
core number of services, and they're mostly focused
around infrastructure. And by that, we're going to
be looking at permissions and logging and
security, and how that cuts across every service
we have, rather than digging into specifics around
App Engine or BigQuery. We won't talk about Roadmap. There are no
announcements here today. We won't talk about
third-party solutions. I'm not going to talk
about other vendors you can work with. And we won't get really deep
into network or encryption. There are a huge amount of
sessions going on at Next. I thought I'd leave
that to the experts. And there's an assumption
this session is set under. And that is that you trust
Google Cloud is secure. So when customers tell
me, we trust you Google. We believe your platform
security is good enough for us. What we want to do
is build solutions on top of it that are
secure to our needs. If you don't agree with
that statement, that's fine. Security specialists
and customer engineers like myself are happy
to meet with you and discuss the security
of the platform. But everything we talk about
today is on top of GCP. Cool. So there are three things
in my agenda, roughly. We're going to
talk about control. Control is around
access control. It's around IAM. It's around roles
and service accounts and granting people
access to resources. What are the best practices,
and how do you do that securely, and how would an
enterprise do that? We're going to
look at visibility. How do you monitor
what's happening? How do you validate that
the controls and permissions that you put in
place are actually giving you the controls you
think they're giving you? And then we're going
to talk about how we wrap up some
services into solutions and solve some common problems
that my customers have brought to me. So controls-- the first thing
we have to talk about is IAM. Identity and Access
Management is fundamental to everything
that happens on Google Cloud Platform and in any cloud. And the thing that took a
while for some of my customers to tweak is that it's
very centralized. If you look at GCP, everything
is integrated with IAM. You can assign a role to
a user, and suddenly, they might have access to
a ton of resources they didn't have before. Whereas, in traditional
on-premise environments, they found that actually it
might be four or five or six different teams. The team that manages the
firewall rules opening a port, the team that
manages the service controls to allow
you in, the team that then grants you permissions to
[INAUDIBLE] Active Directory access to it-- it's a
whole bunch of people. And actually, it's a bit
higher risk having everything controlled through IAM because
you can more easily give people the wrong access. So you really need to
put a lot more focus into doing it right by making
sure that you're locking it down a lot more than you
would necessarily on premise. So what we're
going to look at is who can access what resources,
which sounds pretty simple. But there are lots
of questions that come out of this, such as how
do you figure out what resources they should be accessing? And actually, who can
access those resources? It's a pretty hard question
to answer most of the time. The very first
piece of advice I'll give you is all of your access
should be based around groups. If a user is directly given
permission to resources, then you have a
significant burden in monitoring and
management and trying to understand who
can access what. I would actually say you
should be actively scanning for and looking for any
usage of identity controls that rely on the individual
accessing any resource. Everything through
a group allows you to put in
improper JML processes and correct monitoring, and it's
much easier to manage as well. And I see this
mistake being made at the start of implementations
when people are starting to work with Google Cloud. They go, oh, we'll just have
this one person accessing these rules, and
these couple of people can have this product over here. It's best to start strong and
stick with it consistently as you scale and grow
your use for the platform. Once you're 6 or 12
months down the road, it can be hard to back out. So just a really simple
example of a group that some of my
customers might use would be Global Sec Ops admins. [? There ?] are then
two roles below it. There are security admins,
and there are log viewers. You can basically
figure out what they do based on the name of the
group, so it's self-describing. So if you are reviewing
an alert or a log that contains the group
name, you can kind of figure out what access they
have based on that group name. And secondly, understanding
folder structures and how they fit into
the resource hierarchy is pretty important. So just to go over this
again, if people aren't aware, at the very top you
have your organization. There's one organization,
and that's your company. Below it, you have folders,
many folders and hierarchies. Below that, you have projects,
and then those projects have the resources
inside of them. The best way to map
it up, in my opinion, and I found in most
of my customers, is to map your folder hierarchy
to your company layout and structure. And by that, I
mean separating it into different countries,
different working groups, different teams, and
doing the logical separation, and using folders
as many times as you need to to have that
clean separation. Because when you apply
constraints and policies at an organization-level,
and they trickle down, you can then apply it to a
folder further down the chain, and that will
trickle down as well. Another resultant
thing that I've found is that the folder hierarchy
isn't always apparent. So for example, if you're
looking at an access control list, if you're
looking a log, it will come with a project name. But it's not always
obvious to many people, where does that project sit
in the folder hierarchy? So simply, just expand it
out into the project name. Structure it as
organization name. I recommend you always prefix
with your organization name at the start, and then
expand out the folder tree in the project name. Make use of long project names. It's absolutely fine. Simple example-- Acme Corp
have a sales organization. There's an application
running inside of it that provides insight around
what their clients are doing. And this is the production
version of that application. It's self-describing,
it's easy to understand, and it's easy to debug where
it sits in the folder tree without having to go to
someone that can actually see it and give you
that information. You're trying to lower the
burden and the admin overhead. So Service Accounts is
something that I didn't really know that much about. And I assumed I kind of knew
how they worked until I started writing this presentation. And I had it completely wrong. [LAUGHS] And the real
key advice that I had from one of the
security PMs was, you need to think
of Service Accounts as both a resource
and an identity. There are two
perspectives, and you need to be really aware of them. So when we think
about a resource, a resource has controls as to
who can access that resource. So in this case, let's say
you created a Service Account. You want to limit who can
use that Service Account. So this simple diagram shows
you there's a user called Alice, and she'd like to start
a virtual machine. She'd like to start that
using a particular Service Account that's being generated. So she needs a role that allows
her Service Account user to use that Service Account. So she has a constraint on
being able to access it. Normally, that list of people
with Service Account user should be pretty small
and pretty focused, and you should be reviewing
who has that access. Because as soon as
you actually start using the Service Account,
and you start up that instance using the Service
Account, it flips over, and it becomes an identity. Then the identity has access
to resources on the other side. So that's, then, having
permissions and roles as to can it access a Google
Cloud Storage Bucket, can it access a
spanner database, can it create some
credentials elsewhere. So what this means is that a
user who has Service Account user permissions
can then actually access all of the resources that
Service Account can access as well, not necessarily directly. But actually, through
that Service Account, that can access everything else. And that can really greatly
increase the scope of access that individual user has
through that one role. So simple tips, things that
have caught me out before and caught out my customers--
have a naming convention. And one of the things I'd
say for a naming convention is actually saying it's a
Service Account in the name. If you look at a log, or
report, or an [? ACL, ?] it just looks like
an email address. Is it a person, or is
it a Service Account? Just put in front
SVC, or SA, or Service to obviously delineate
this is a Service Account. The display name
is a longer name. A number of my
customers have found putting the
particular roles given for that Service Account
into the display name also makes it easier to
figure out what it does and what it should
access and, actually, how much permission does it
have to access other resources. Being verbose and
repeating yourself isn't really a problem. So then the question
becomes how many Service Accounts do we need? And the answer is, it
depends, which is the worst answer you can give to anyone. So generally, as a rule of
thumb, what I am looking at is how many rolls are given to
a particular Service Account? We generally say that each of
your applications or services should have a Service Account. If you're, then,
giving it many roles-- 5, 10, 15, or 20-- that's probably
a hint you should be refactoring the application
if you can control it or trying to decompose
it into multiple items. So what you have
to think about here is, if someone picked
up that Service Account, and they took it away,
and they used it later on, it can be hard to monitor how
a Service Account is being used and if there's multiple
Service Accounts being used in multiple places. And if that one account can
access 10 different resources, then that's a concentration
risk that you should be looking at a lot more closely. And don't rely the
default Service Accounts. So whenever you
create a new project, and you spin up a
Compute Engine API, it will create you
a default Service Account that's really useful
for testing and development. We do not guarantee
the functionality of that Service Account, the
roles and scopes it might have. We also don't guarantee it
might exist in the future, whether we create that default
Service Account or not. If anything today relies on
the default Service Account, you should immediately stop
using it and transition to custom Service Accounts that
you can control because there may be a breaking change. So something I actually
learned on Monday this week-- and added this
slide on Tuesday-- was trying to
answer the question of who can access what? So which user can access
this particular resource? Or who could delete
this virtual machine is a really hard
question to answer. And there's a great piece
of software called Forseti. Forseti's open-source
software originally developed by Google Cloud that
we use to monitor our policies. And we need to answer
those questions ourselves. It's pretty simple
in how it works. We do an inventory. We take a list of
all the users, we take a list of
all the groups, we take a list of all
the roles, we take a list of all of the resources. We build a model on
top of it, and then you can query that model. And you can directly
ask it, who could delete this virtual machine? For this user, what
can they access? Who has Service Account
user permissions? It's a really quick way to get
answers to those questions. It's freely available
open-source software online. I'd highly recommend
you review it. So that was control,
and that was really centered around identity
and access management. And as you'll see as
we go through this, everything builds on
top of it when you're looking at cloud security. So for visibility, we're going
to talk a lot around logging trying to understand
which users are actually using those permissions,
and who is doing what in the operational perspective? The main tool here
is Stackdriver. Stackdriver has a
ton of functionality. It's a fantastic platform that
does a million things really well. There are some great
announcements at Next this week as what it does. But we're going to talk
about a subsection of it-- Stackdriver Logging. So Stackdriver
Monitoring is generally for the operations teams. It's asking you
how much CPU usage is on this particular machine? How much latency is on this
particular load balancer? Stackdriver Logging takes,
generally, human-readable logs in text form or JSON
format, parses them, and makes them
available for developers debugging an application and,
typically, security teams who want to look at what's
changed, how did it change, and how does that affect me? So on GCP, there are a number
of different logging platforms and a number of different
logs that are produced. The two we're going to
really talk about here are Admin Activity Logs
and Data Access Logs. Admin Activity Logs
are on by default, and you can't turn them off. So whenever you make a change
through the Admin Console, whenever you run
a GCloud command, whenever you do something
that changes your platform in any way, that's logged for
you, and you can't stop that. That's a good thing,
and it's free. It's baked into the cost. What's also available
is Data Access Logging. So Data Access Logging looks on
top of the platform and says, well, inside your applications,
we can log what happens. An example, Cloud SQL-- if I go and create a Cloud SQL
cluster and create some nodes, that will be logged through
the Admin Activity Logs. If I actually send a query
to my Cloud SQL database, that can be logged inside
the Data Access Logs. We can log if it's
a READ query, we can log if it's a change, if
it's a modify or a delete. And this is fairly consistent
across most of our products. But it's not enabled
by default, and it can produce a huge amount of logs. You need to be really aware
of how much this can produce. And we're going to
talk in a second about reducing that volume
to a manageable quantity. So the first thing to say is
you need different permissions to access the Data Access Logs. Because they're inherently more
sensitive because they contain data access and data change. So the default
logging permission you're going to give to your
ops teams and developers is logging.viewer. They can view the logs. To get Data Access
Logging access, you need
logging.privatelogviewer. It's in the name. It's a lot more
private information. It should be a much
smaller subset of people because they're looking at
actually this data access and who's changing
the data access. So a quick example of
how you can turn this on and how you can configure it. You can do it through a GUI, or
you can do it through GCloud. And you can configure this
at an organizational level and speed it down the
hierarchy, or you can do it on a per-project basis. So in this example, we're doing
it for one particular project. So I'm pulling out the IAM
policy for myproject123. I didn't follow my own
naming convention advice. We feed it into a yaml
file, and then we modify it, and we just append at the
bottom the audit configuration. And all we're saying is we're
going have, for every service, enable DATA_READ, DATA_WRITE,
and also ADMIN_READ. So ADMIN_READ is every
time you run a READ command through GCloud. So if you list all of the
virtual machine instances you have, that's an ADMIN_READ. If you create a
virtual machine, that's an admin change
and an ADMIN_WRITE. So simply, we change the yaml
file, we write it back in, and that project
will now produce logs for all the DATA_READ
and all the DATA_WRITE for every single
service you have. And that is a huge
amount of information. Could be good, could
be bad, depending on what you want to monitor. Typically, you want to
exclude certain services. So building on
this example, we're going to show you how you can
pull out, say, for Cloud SQL, how do we remove one
particular Service Account from logging those
Data Access Logs? So the use case here is you
have an application that's serving an API. And you've written it in Python. It's serving your users. Behind the scenes is
a Cloud SQL database. And that Cloud SQL database is
the primary storage system-- lots of reads, lots of
writes coming through. Your application is
running under the Service Account you created, your
custom Service Account. Actually, this is the
default Service Account. I should have changed that. And what you want to
say is, let's exempt this particular Service Account
for Cloud SQL from DATA_READ and DATA_WRITE. So what you're left
with is any data access not by the production
Service Account running your application. So it means that
you'll get access and logs every time a human
or a different service tries to access or change
data in that database. So instantly, that's kind of
interesting, saying, well, my DBA has now changed something
in a production database, or an engineer is logged in and
runs some queries against it. That isn't the typical
flow that we'd expect. That's worth monitoring. You can also do this
through the GUI-- so through the IAM
console, you can go in, you can turn on the
default audit logs, you can turn on the
DATA_READ and WRITE logs, you can choose
particular services, you can build the exemptions. I'd recommend this for
testing to figure out how many logs does it produce. It's a good way to
figure out, actually, is it a couple of megs,
or a couple of gigs, or a couple of terabytes
we're dealing with here? But normally, you want to do
this programmatically most of the time, particularly,
dealing with exceptions. When you're looking at
Data Access Logging, you should be doing it
on a case-by-case basis most of the time. ADMIN_READ logs could be
turned on across the platform. Again, be aware of how
much volume it can produce. So another challenge my
customers brought to me was, by default, logs in
GCP live within the project they're created in. If you're an enterprise,
you could well have hundreds or thousands
of different projects created in many different countries. And that becomes
an admin overhead having to go to
every single project to look at the
logs inside of it. And it stops you
correlating them together and looking for
interesting patterns across your environment. So with Stackdriver, we can
do aggregated log exporting. So simply, we can take the logs
from lots of different member projects, put them
through some filters, so the filters could be to only
forward on particular services, particular users,
particular accounts. You could sample it down. So you could say,
just send 10% of it. That's only interesting to me. And there are three different
sources you can send it to-- sorry, destinations. There's cloud storage,
so you can write it just as plain text files. Commonly, this is
done for compliance. We need to store this for three
years, five years, seven years, as cheap as possible
so it's available if we did need to do an
investigation down the line. You can send it to
BigQuery so that you can run SQL queries against
it and do some analysis. And this is pretty common. I'll talk to you about an
example of that in a minute. You can also send it to Pub/Sub. And Pub/Sub is typically
where third-party platforms integrate. So tools like Splunk
have connectors that read from Pub/Sub to take
those logs out the other side. So let's assume that
all of your projects are sending their logs to
one particular Cloud Storage bucket. You want to do your
archiving compliance and make them stay
there for seven years. There's a couple of
things I'd recommend to make sure those logs are
sound and safe and secure. The first thing is turning on
Object Versioning on the Google Cloud Storage bucket. The reason for
that is that, when you turn on Object
Versioning, you can't actually delete files anymore. You can run the delete command,
but that file's never deleted. The reference to
it says it's gone, but the file is still there. If someone accidentally
changed the file, we keep all of the
previous versions as well, so you never
actually lose anything as long as that's turned on. And of course,
this bucket, you'll lock down permissions on it. The project it's
running in, you'll lock down the access to
this project as well. And there's another
level you can apply where we can put
controls on the project itself. So you can apply
something called a Lien, which essentially says
you can't delete this project. It's really simple to
apply, and it's just another level of control
against accidental deletion or potentially
malicious deletion. It's a really simple command. It's done by Resource Manager. You do it on a
per-project basis. And I'd highly recommend doing
this for all of your production projects, for all of your
applications that are running. Anything that just doesn't
need to be deleted, accidentally, I'd recommend
turning this flag on. All that happens is, if
you run a command to delete the project, it comes
back with an error saying, no, that can't be done. And you can give
a specific reason as to why that can't be done. I probably wouldn't call
it Super Secret Logs, but I might call it something a
bit more generic just in case. But that will at least stop
you from both malicious and accidental deletions, trying
to make sure your logs are safe and sound. Cool. So I want to talk about
security solutions. And by solutions,
I mean combining lots of different
elements together to solve common challenges
that my customers had. The first and
probably most common conversation I have
with people is, how do I run my applications securely? So I have a line, a business
application, it's HR application or
finance application, and it's out there,
and our users should be able to access this
from the corporate environment. Really, where we
want to get them to, is the Beyondcorp model. If you're not familiar
with Beyondcorp, I highly recommend doing
some research into it or attending the session that
we have running at the moment or watching the replay. Beyondcorp is talking about
beyond the corporate network. How can we have all of our
projects and applications running in a way that people
can access it from anywhere without being behind a
particular IP address, without having to dial
up VPN into the office and access it from
trusted IP ranges, which doesn't work at scale? So when we start to take
people on the journey towards a
Beyondcorp-style model, the first thing that
comes up is VPNs. People don't like VPNs. And it's having the
constraints around people, either remotely
accessing in, or dealing with just the admin
overhead of managing it. How can we start to
build applications that take advantage
of Google's scale so you can take advantage of our
denial-of-service protection, so you can deal with
our scalable networking without having a
bottleneck in place? So there are three components
we're going to put together. The first one is Cloud
Armor, and that's protecting the edge-- so limiting who and
how they can access. We're going to talk about
Identity-Aware Proxy, so we can make sure that only
users that are authenticated and authorized to
access the application can access the application. And then we're going to
talk about VPC firewall-- so how do you actually restrict
the ports and protocols that can communicate with the
application you're running? So in a really
simplified example, it looks a little bit like this. So this is kind of step
one of the journey. So let's imagine you have your
employee in your San Francisco office. And the external IP
address from that office is 1.2.3.4 to make
life easy for everyone. The first thing that would
happen with that traffic is, hopefully, it will
get routed over to the Google Cloud Platform. The first place it meets is
our Edge proxies or edge PoPs. And Cloud Armor sits on top
of them as a bolt on service. So everything we're
talking about here-- so Identity-Aware
Proxy and Cloud Armor-- relies on, not changing
your architecture. We're not telling
you to install a VM and push your
traffic through it. We're trying to layer the
capabilities onto your existing design and the existing
capabilities we have today. We don't believe you should
change your application to get more security. So the first thing
we're going to do is we're just going to do simple
whitelisting and blacklisting through Cloud Armor. We're just going to
allow connections from particular IP addresses. In this case, we'll just do
it for your corporate ranges-- so 1.2.3.4 can connect through. And that will allow it to
initiate a TLS session using Google networking
technologies to a HTTPS load balancer serving your
particular application and your particular project. So then we're going to add
on Identity-Aware Proxy. Identity-Aware Proxy
is actually the service that we released that's
as close to Beyondcorp as we have internally
at Google as possible. And as a great roadmap
for Identity-Aware Proxy, it's really coming
on leaps and bounds. The concept is this. You should provide network-based
authentication and network restriction based on the
identity of the user. So if I'm an employee,
and I work in HR, the ideal scenario
is the only network services I can ever
talk to are those which relate to my job
and my function in HR. There might be tens or hundreds
or thousands of applications that are running out there. And normally, you can ping
them, you can talk to them. You probably can't log into
it, but it's a very wide net that you could touch. Identity-Aware Proxy looks
at your user credentials, looks at the roles
and the groups that you're associated with. You can integrate
it because it's backed off by Cloud Identity. So you can use Active
Directory Federated Services. You can use two-factor
authentication and single sign-on. And then it says,
well, actually, you can't get through
this load balancer, you can't get through
this proxy unless your job role and function
and group is correct. So it means, actually,
the attack surface is significantly reduced
because no longer does any employee have network
connectivity to all of the applications
that are out there. You're no longer reliant on
the application authenticating and authorizing those services. Because as we all
know, is there's a particular vulnerability
in some part of the stack, in some middleware
messaging platform, that any of those desktops,
if they were compromised, are going to sweep
the network, try and find one particular
application service environment that has that
vulnerability, and then they have a way into your network. With this, you've
suddenly restricted that down to a small subset
of your application servers that are only
hosting the services they should be able to access. So it's issuing the pass-through
Identity-Aware Proxy, and they are allowed
to access the service. Next up, again, we're
going to use Cloud Armor. So what Cloud Armor
can do is that, now we know that this is
a trusted user, and it's a HTTPS
load balancer, we can inspect the
contents of the packet. So we can do some simple
SQL injection and checking. We can look for
cross-site scripting. We can just check
the packets, seeing that they're the right
size, and shape, and volume, and sort of conform
to standard bounds. If that looks good,
we'll then pass it through to the service
on the back end. In this instance, we
use Compute Engine so it's running a simple
infrastructure as a service. Could be Kubernetes
Engine, could be App Engine Flexible, which, in turn,
will have the VPC firewalls. And those VPC
firewalls will allow you to restrict ports and
protocols at a networking level. So the next question I
normally get from people is-- OK. So let's assume we're trying to
move away from IP whitelisting and blacklisting. We want to promote
employee mobility so people can work from home. We're not having to get them
to dial up to our remote office range and connect through there. But we only want people
to access it from the US, or if they're in the
UK, only in the UK. So again, Cloud Armor
has functionality. You can write a
simple rule that says we're only going to let you in
if the origin of your traffic comes from this or these
particular countries. Or on the flip side, we don't
like these particular countries for you to come from. So this is kind of step
two on the journey. You're removing the hard
IP whitelisting and saying, right, well, actually,
now my employees-- I know they're
working in the United States, which is better for me. They're going through, and
we're validating the packets. We know that they're doing
the right job function. We know they're using
two-factor authentication. And finally, they can actually
access the application services we're running. I generally find, talking
to my customers, that's a high level of security
than they'd ever been able to build
with some kind of DMZ they've had for
the last 15 years. And this is all scalable
Google technology. You're not relying on a
single point of failure. You're not reliant on managing
traditional technologies. Another thing you can do now
is you can enforce standards around SSL and TLS. So attached to the HTTPS load
balancer that we're running, you can simply state the
ciphers that you're allowed and the versions of
TLS that you'll accept for clients to connect to it. So you can enforce a
minimum of TLS 1.2. You can add or remove
the particular ciphers that you'd like people to use
to ensure that those clients-- if they are working from
home and they're employees-- and hopefully, they're in the
right country and they've got the right roles-- are using good
encryption in transit that you trust and validate
and conforms to your standards. So another question that
comes from this is-- OK. So we're running our service in
more of a remote access model, we're using Identity-Aware
Proxy, we're using Cloud Armor, we're allowing people to access
without this corporate network any more. So where are they
connecting from? Where does the traffic go to? And actually, how do my services
interact with each other, and what are they doing? So VPC Flow Logging
is typically how we get this information out. VPC Flow Logging, you turn
it on on a per-subnet basis and record four different
types of traffic. We record inter-VPC-- so any
traffic within the subnet-- intra-VPC-- traffic
between multiple subnets, assuming the logging
is turned on-- traffic to and from
Google services-- so sending requests to BigQuery
and getting responses-- and also internet traffic-- so internet, egress, and ingress
between your applications and services. And all of these logs
[? I've heard ?] before are going to Stackdriver logging,
and they're pretty valuable. So the quickest way to
get some insight into them is to feed them
out into BigQuery as we talked about earlier. Use aggregated log exports
for each of your projects, set it up to feed the logs
out, aggregate the logs, set up the filters to send
just the VPC Flow Logs out into BigQuery so they're
immediately available to query, and we're going to use
Data Studio to build a nice little report
on top of it to try and figure out
what's in the logs. So here's a little dashboard
that I pulled together with a sample application
that I spun up. I didn't put any country
restrictions on it. I was just interested
to see who's going to start pinging it around
in a "Hello, World" application that I spun up. So you can see here that it
was running in two regions. We got Europe West
1 and Europe West 2. We had connections from a
lot of different countries. We can see the round-trip
time for the packets that were coming through. There's a huge amount of
detail and information that's really valuable
inside these logs. And typically, building this
in non-cloud environments requires pretty
expensive technology, fairly specialized vendors. And dealing with actually
the volume of traffic it can produce, VPC Flow Logging
can produce a decent amount of information, and you can mine
some really interesting trends in it, particularly, when you're
looking across your projects and across your
applications, looking for common types
of traffic, looking for particular applications that
are connecting to you that you don't like, looking for
particular ports and protocols that you're not expecting. So then that leads
to another use case. Which services
talk to each other? So in a world where
things are very connected, there are lots of
dependencies in around it. There are some great
announcements at Next this week talking around Istio. So Istio is fantastic
if you're running in a microservices
environment inside Kubernetes. Many of my customers are
not at that stage yet. They're using traditional
virtual machines and infrastructure,
and they're talking to each other in all
kinds of wonderful ways, and they just kind of hope
they know what they're doing. So if you're using
VPC Flow Logging, we turn on the logging for
all the different subnets. And now we can start to
profile the applications. And we can figure out
which ports and protocols are being used? How much traffic is
being sent between them? And one of my customers
took this even further, took the aggregated exports and
the summaries of those flows, and then turned them
into VPC firewall rules, and applied them, and
locked it down as a policy. So they monitored it for a
period of around two weeks. They found every
single connection that they saw between
these applications and then built very
specific firewall walls to limit that
connectivity to only those particular
approved connections. So what they found was that,
when the application changed, and then they found a new
connection was coming through, the question becomes,
is it a developer? Can we correlate this
with an update or a patch to the software we were using? OK, fine. We just need to update the
documentation and the release process better to
incorporate that change. Or is it nefarious activity? Has someone compromised our
application, and now, they're trying to move around to
see what other services are around they can
go and connect to? Answering that question
is pretty critical. And this gives you
a lot of visibility to see that very quickly and
then do the investigation. So a use case I saw recently
where the customer was, how do I keep my secrets secret? And we're going to use
Cloud KMS for this. And it's really
important to share, you cannot store secrets
inside Cloud KMS. But it's an enabling
technology that will help you keep
your secrets secret. The use case here was API keys. So an application
that was being built talked to a third-party service
that authenticated an API key. In this case, it
was sending emails. The development in
the test environment had one key that had
a limited quota on it, a limited amount of spend. Production had another key. And they simply
wanted to make sure that that key wasn't
distributive widely. They definitely didn't want to
keep it inside the source code, because then every
developer could access. And they didn't quite
know where to put it. So the recommended solution
for this-- and this is on a public documentation--
is building two projects to separate the duties. Now, actually, there
are three if you think about the application
it's running as well. So on the left-hand side,
we have the Secret Storage Project, and we're
going to store them in Google Cloud Storage. On the right-hand side, we
have the Cloud KMS Project, which is the encryption
and decryption. And the nice thing here is
that one person or one team can manage the keys. Another person or
another team can manage the encrypted secrets. So the first thing that you
do is you take your API key, you send it to Cloud KMS, you
pick the particular key and key ring that you want, and it
gives you the encrypted blob to store. You then place it in Google
Cloud Storage in a bucket as the encrypted blob. When your application starts
up, it does two things. First, it connects to
Google Cloud Storage and reads that blob
from the bucket. You obviously set up all
the permissions and scopes and roles developed
to pull that down. If any other person had
access to that bucket and logged in and saw
it, it would be no use, because they couldn't
actually see what's inside that encrypted blob. So the next step is
it goes to Cloud KMS, it sends the encrypted
blob, and then it returns a decrypted
version of it, and says here's your API key. So what you've done
is you delinked it. You've brought in a
separation of the duties. Someone can independently
rotate the keys and manage the keys for you. Another team can manage the
encrypted blobs and secrets, then the application can
run fairly statelessly. It doesn't need to be
bundled with the secrets. Every time it
starts and restarts, it goes and grabs
that key again. And of course,
this is all logged. So you can start to look
for activity and say, well, was there any failed
decryption attempts? Was anyone pulling down
these particular keys from a different application
in a different source that wasn't, then, going
and decrypting them? You can get some pretty
good visibility there. So the final solution I'm
going to talked about, which is a really
common question is, are my machines up to date? And we're talking about
infrastructure here, so we're talking about
plain old virtual machines. And the assumption is
that they need updating, they need patching,
and also, they need to conform to your
security standards. So there's the baseline
that you define. The baseline might be a
NIST or a CIS standard, and probably your
organization-specific requirements as well. But let's say there's been some
particular vulnerability occur, and you want to
update GLIBC or libssl to make sure it's maintained. So there are two
questions really. How do to make sure people
are only using trusted images? And secondly, how do you
make sure they're always using the latest version? So the first thing
we're going to do is we're going to
build custom images. So part of Compute Engine
includes an image registry. It's not actually call
it an image registry, but it is an image registry
inside of Compute Engine. And with this, you can
store baked images. Typically, it's an
automated process that will go through and
pull down the patches, build them together, do
your security configurations and hardening, and publish it. It could also be a
manual process as well, depending on volume. And you can choose
whatever flavor you like-- particular versions
of Linux, or Windows, or whatever takes your fancy--
and publish those custom images, share
them, and make them available for other users and
other projects to start with. So in this case, a
user that has access to start a virtual
machine, an instance, pulls the image from the image
registry based on the URL and then builds it
from that version. There's a constraint you can
apply in Resource Manager that allows you to control which
particular versions and which particular images can we use? And this is important
because, by default, Google published a whole range
of images for you. We've produce standard
Ubuntu, standard CentOS, standard Windows server. And that doesn't conform
to your security standards and policies. It's patched, and it's updated,
and it's fairly up to date with all the vulnerabilities. But it probably doesn't contain
your particular configurations and your endpoint tools
and management tools. So you can apply this
constraint and essentially say, none of my
users can use any of the Google-provided
libraries. They can use none of the
Google-provided images. It's completely denied. So what they're left
with is they can only access the custom versions
which you've published. So then, what you're saying is
that any developer, any person anywhere in your
organization that wants to spin up an instance
has to use an instance that comes from our custom
images which we've approved, and we've standardized,
and we've pushed out. So images go
through a lifecycle. There are four different
states that they go through. By default, they're active. You create a new image,
people can use it-- happy days. After that, you
can deprecate it. And you can say that
it's allowed to be used, and often, people will support
two or three old versions, but they'll get a warning saying
that this is not the latest version. And typically,
people access images based on the name or the family. They'll say just give me
the latest version of Ubuntu rather than the particular
version that's being produced. But this will tell
them, actually, the version you just specified,
if they've hard-coded it in a version, that's not OK. Obsolete will actually
stop them using it. So if they try and
start an instance using that particular image that's
now obsolete, it will fail, it won't start up,
and it will give them a failure instead of a warning. Then the fourth state is
DELETED where the image is flagged as deleted-- it's not actually
removed from disk. You have to do that
separately-- but it's no longer visible and available. So you can do
automated obsolescence. So what this means is that, when
you tell a version of an image to be deprecated, there's
two extra flags you can pass. There's an OBSOLETE in,
and there's a DELETE in. And this can be a
relative amount of time-- so in seven days-- or an absolute amount
of time-- a particular date that you give it. So as part of your processes--
ideally an automated baking process-- you can say, right, here's
the latest version of an image I just published for you
and the previous version deprecated today. So all of my users will
get warnings telling them this isn't the latest version. And automatically,
we'll then make it obsolete in seven days time. We'll delete it in 14 days time. It's another thing you
don't have to think about. It will just run through
that lifecycle for you, and users should treat
it appropriately. If people start complaining
that the version of the image isn't available, they need to
change their deployment scripts to use the image family rather
than particular versions. So we talked about three
different things there. We talked about control. Identity is super important
in any cloud provider, particularly Google
Cloud, because all of your access to resources is
focused around your identity. There's much less of a
focus on network controls, particularly, with more
of the managed services. Service accounts are really
powerful if you use them correctly. And actually, there are lots of
little details around who can access that service account. That's definitely something
I'd recommend you go away and review, probably,
with Forseti to see who has that control. Visibility is out there. You can see everything that
happens inside Google Cloud, but not by default.
And you can't wait until there's an
incident to go and find that, because, unless the
logs are on, you can't go backwards and see it. So you need to review
data access logging. You need to review
VPC firewall logging. And if it's appropriate,
turn that on today to prepare you for the future. And we talked a bit
about some solutions. How do we combine different
GCP services together to get you to a point that
you can run your applications and services more
securely and get you prepared for a Beyondcorp
world of the future where all your employees can
access remotely, securely, and do whatever they need to. Thank you very much
for attending my talk. I'll be able to take
questions outside. [APPLAUSE] [MUSIC PLAYING]