A Security Practitioners Guide to Best Practice GCP Security (Cloud Next '18)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] TOM SALMON: Good afternoon, everybody. Thank you for coming to my session. I really hope I can teach you something new in the next 45 minutes. The title of this session is a bit large. I don't actually like it anymore. I kind of want to change it. And it's "A Security Practitioners Guide to Best Practice GCP Security." And I'm not going to be everything to everybody, so I'm probably going to disappoint everybody in this room. But hopefully, I can teach you, at least, maybe one thing new that you can take away and implement today. So my name is Tom Salmon, and I'm a customer engineer in Google Cloud. I'm based in London, and I primarily work with financial service customers primarily in banking. I joined Google 18 months ago. Before that, I worked in security doing engineering, design, architecture, and consulting, mostly building security operation centers. My aim here today is to share with you what I've learned from my customers, what they've told me, the conversations we've had together, and where I think there's, maybe, gaps in people's knowledge that I think I can hopefully upskill you. So what we are and aren't going to do today. We will cover the common questions I've had from my customers, the areas of common misunderstanding where we spend time going over things that can be fairly fundamental, but sometimes people get wrong. We're going to look at how we take lots of different services and bundle them into skill solutions. We're not going to cover everything in the GCP platform. We're going to cover a core number of services, and they're mostly focused around infrastructure. And by that, we're going to be looking at permissions and logging and security, and how that cuts across every service we have, rather than digging into specifics around App Engine or BigQuery. We won't talk about Roadmap. There are no announcements here today. We won't talk about third-party solutions. I'm not going to talk about other vendors you can work with. And we won't get really deep into network or encryption. There are a huge amount of sessions going on at Next. I thought I'd leave that to the experts. And there's an assumption this session is set under. And that is that you trust Google Cloud is secure. So when customers tell me, we trust you Google. We believe your platform security is good enough for us. What we want to do is build solutions on top of it that are secure to our needs. If you don't agree with that statement, that's fine. Security specialists and customer engineers like myself are happy to meet with you and discuss the security of the platform. But everything we talk about today is on top of GCP. Cool. So there are three things in my agenda, roughly. We're going to talk about control. Control is around access control. It's around IAM. It's around roles and service accounts and granting people access to resources. What are the best practices, and how do you do that securely, and how would an enterprise do that? We're going to look at visibility. How do you monitor what's happening? How do you validate that the controls and permissions that you put in place are actually giving you the controls you think they're giving you? And then we're going to talk about how we wrap up some services into solutions and solve some common problems that my customers have brought to me. So controls-- the first thing we have to talk about is IAM. Identity and Access Management is fundamental to everything that happens on Google Cloud Platform and in any cloud. And the thing that took a while for some of my customers to tweak is that it's very centralized. If you look at GCP, everything is integrated with IAM. You can assign a role to a user, and suddenly, they might have access to a ton of resources they didn't have before. Whereas, in traditional on-premise environments, they found that actually it might be four or five or six different teams. The team that manages the firewall rules opening a port, the team that manages the service controls to allow you in, the team that then grants you permissions to [INAUDIBLE] Active Directory access to it-- it's a whole bunch of people. And actually, it's a bit higher risk having everything controlled through IAM because you can more easily give people the wrong access. So you really need to put a lot more focus into doing it right by making sure that you're locking it down a lot more than you would necessarily on premise. So what we're going to look at is who can access what resources, which sounds pretty simple. But there are lots of questions that come out of this, such as how do you figure out what resources they should be accessing? And actually, who can access those resources? It's a pretty hard question to answer most of the time. The very first piece of advice I'll give you is all of your access should be based around groups. If a user is directly given permission to resources, then you have a significant burden in monitoring and management and trying to understand who can access what. I would actually say you should be actively scanning for and looking for any usage of identity controls that rely on the individual accessing any resource. Everything through a group allows you to put in improper JML processes and correct monitoring, and it's much easier to manage as well. And I see this mistake being made at the start of implementations when people are starting to work with Google Cloud. They go, oh, we'll just have this one person accessing these rules, and these couple of people can have this product over here. It's best to start strong and stick with it consistently as you scale and grow your use for the platform. Once you're 6 or 12 months down the road, it can be hard to back out. So just a really simple example of a group that some of my customers might use would be Global Sec Ops admins. [? There ?] are then two roles below it. There are security admins, and there are log viewers. You can basically figure out what they do based on the name of the group, so it's self-describing. So if you are reviewing an alert or a log that contains the group name, you can kind of figure out what access they have based on that group name. And secondly, understanding folder structures and how they fit into the resource hierarchy is pretty important. So just to go over this again, if people aren't aware, at the very top you have your organization. There's one organization, and that's your company. Below it, you have folders, many folders and hierarchies. Below that, you have projects, and then those projects have the resources inside of them. The best way to map it up, in my opinion, and I found in most of my customers, is to map your folder hierarchy to your company layout and structure. And by that, I mean separating it into different countries, different working groups, different teams, and doing the logical separation, and using folders as many times as you need to to have that clean separation. Because when you apply constraints and policies at an organization-level, and they trickle down, you can then apply it to a folder further down the chain, and that will trickle down as well. Another resultant thing that I've found is that the folder hierarchy isn't always apparent. So for example, if you're looking at an access control list, if you're looking a log, it will come with a project name. But it's not always obvious to many people, where does that project sit in the folder hierarchy? So simply, just expand it out into the project name. Structure it as organization name. I recommend you always prefix with your organization name at the start, and then expand out the folder tree in the project name. Make use of long project names. It's absolutely fine. Simple example-- Acme Corp have a sales organization. There's an application running inside of it that provides insight around what their clients are doing. And this is the production version of that application. It's self-describing, it's easy to understand, and it's easy to debug where it sits in the folder tree without having to go to someone that can actually see it and give you that information. You're trying to lower the burden and the admin overhead. So Service Accounts is something that I didn't really know that much about. And I assumed I kind of knew how they worked until I started writing this presentation. And I had it completely wrong. [LAUGHS] And the real key advice that I had from one of the security PMs was, you need to think of Service Accounts as both a resource and an identity. There are two perspectives, and you need to be really aware of them. So when we think about a resource, a resource has controls as to who can access that resource. So in this case, let's say you created a Service Account. You want to limit who can use that Service Account. So this simple diagram shows you there's a user called Alice, and she'd like to start a virtual machine. She'd like to start that using a particular Service Account that's being generated. So she needs a role that allows her Service Account user to use that Service Account. So she has a constraint on being able to access it. Normally, that list of people with Service Account user should be pretty small and pretty focused, and you should be reviewing who has that access. Because as soon as you actually start using the Service Account, and you start up that instance using the Service Account, it flips over, and it becomes an identity. Then the identity has access to resources on the other side. So that's, then, having permissions and roles as to can it access a Google Cloud Storage Bucket, can it access a spanner database, can it create some credentials elsewhere. So what this means is that a user who has Service Account user permissions can then actually access all of the resources that Service Account can access as well, not necessarily directly. But actually, through that Service Account, that can access everything else. And that can really greatly increase the scope of access that individual user has through that one role. So simple tips, things that have caught me out before and caught out my customers-- have a naming convention. And one of the things I'd say for a naming convention is actually saying it's a Service Account in the name. If you look at a log, or report, or an [? ACL, ?] it just looks like an email address. Is it a person, or is it a Service Account? Just put in front SVC, or SA, or Service to obviously delineate this is a Service Account. The display name is a longer name. A number of my customers have found putting the particular roles given for that Service Account into the display name also makes it easier to figure out what it does and what it should access and, actually, how much permission does it have to access other resources. Being verbose and repeating yourself isn't really a problem. So then the question becomes how many Service Accounts do we need? And the answer is, it depends, which is the worst answer you can give to anyone. So generally, as a rule of thumb, what I am looking at is how many rolls are given to a particular Service Account? We generally say that each of your applications or services should have a Service Account. If you're, then, giving it many roles-- 5, 10, 15, or 20-- that's probably a hint you should be refactoring the application if you can control it or trying to decompose it into multiple items. So what you have to think about here is, if someone picked up that Service Account, and they took it away, and they used it later on, it can be hard to monitor how a Service Account is being used and if there's multiple Service Accounts being used in multiple places. And if that one account can access 10 different resources, then that's a concentration risk that you should be looking at a lot more closely. And don't rely the default Service Accounts. So whenever you create a new project, and you spin up a Compute Engine API, it will create you a default Service Account that's really useful for testing and development. We do not guarantee the functionality of that Service Account, the roles and scopes it might have. We also don't guarantee it might exist in the future, whether we create that default Service Account or not. If anything today relies on the default Service Account, you should immediately stop using it and transition to custom Service Accounts that you can control because there may be a breaking change. So something I actually learned on Monday this week-- and added this slide on Tuesday-- was trying to answer the question of who can access what? So which user can access this particular resource? Or who could delete this virtual machine is a really hard question to answer. And there's a great piece of software called Forseti. Forseti's open-source software originally developed by Google Cloud that we use to monitor our policies. And we need to answer those questions ourselves. It's pretty simple in how it works. We do an inventory. We take a list of all the users, we take a list of all the groups, we take a list of all the roles, we take a list of all of the resources. We build a model on top of it, and then you can query that model. And you can directly ask it, who could delete this virtual machine? For this user, what can they access? Who has Service Account user permissions? It's a really quick way to get answers to those questions. It's freely available open-source software online. I'd highly recommend you review it. So that was control, and that was really centered around identity and access management. And as you'll see as we go through this, everything builds on top of it when you're looking at cloud security. So for visibility, we're going to talk a lot around logging trying to understand which users are actually using those permissions, and who is doing what in the operational perspective? The main tool here is Stackdriver. Stackdriver has a ton of functionality. It's a fantastic platform that does a million things really well. There are some great announcements at Next this week as what it does. But we're going to talk about a subsection of it-- Stackdriver Logging. So Stackdriver Monitoring is generally for the operations teams. It's asking you how much CPU usage is on this particular machine? How much latency is on this particular load balancer? Stackdriver Logging takes, generally, human-readable logs in text form or JSON format, parses them, and makes them available for developers debugging an application and, typically, security teams who want to look at what's changed, how did it change, and how does that affect me? So on GCP, there are a number of different logging platforms and a number of different logs that are produced. The two we're going to really talk about here are Admin Activity Logs and Data Access Logs. Admin Activity Logs are on by default, and you can't turn them off. So whenever you make a change through the Admin Console, whenever you run a GCloud command, whenever you do something that changes your platform in any way, that's logged for you, and you can't stop that. That's a good thing, and it's free. It's baked into the cost. What's also available is Data Access Logging. So Data Access Logging looks on top of the platform and says, well, inside your applications, we can log what happens. An example, Cloud SQL-- if I go and create a Cloud SQL cluster and create some nodes, that will be logged through the Admin Activity Logs. If I actually send a query to my Cloud SQL database, that can be logged inside the Data Access Logs. We can log if it's a READ query, we can log if it's a change, if it's a modify or a delete. And this is fairly consistent across most of our products. But it's not enabled by default, and it can produce a huge amount of logs. You need to be really aware of how much this can produce. And we're going to talk in a second about reducing that volume to a manageable quantity. So the first thing to say is you need different permissions to access the Data Access Logs. Because they're inherently more sensitive because they contain data access and data change. So the default logging permission you're going to give to your ops teams and developers is logging.viewer. They can view the logs. To get Data Access Logging access, you need logging.privatelogviewer. It's in the name. It's a lot more private information. It should be a much smaller subset of people because they're looking at actually this data access and who's changing the data access. So a quick example of how you can turn this on and how you can configure it. You can do it through a GUI, or you can do it through GCloud. And you can configure this at an organizational level and speed it down the hierarchy, or you can do it on a per-project basis. So in this example, we're doing it for one particular project. So I'm pulling out the IAM policy for myproject123. I didn't follow my own naming convention advice. We feed it into a yaml file, and then we modify it, and we just append at the bottom the audit configuration. And all we're saying is we're going have, for every service, enable DATA_READ, DATA_WRITE, and also ADMIN_READ. So ADMIN_READ is every time you run a READ command through GCloud. So if you list all of the virtual machine instances you have, that's an ADMIN_READ. If you create a virtual machine, that's an admin change and an ADMIN_WRITE. So simply, we change the yaml file, we write it back in, and that project will now produce logs for all the DATA_READ and all the DATA_WRITE for every single service you have. And that is a huge amount of information. Could be good, could be bad, depending on what you want to monitor. Typically, you want to exclude certain services. So building on this example, we're going to show you how you can pull out, say, for Cloud SQL, how do we remove one particular Service Account from logging those Data Access Logs? So the use case here is you have an application that's serving an API. And you've written it in Python. It's serving your users. Behind the scenes is a Cloud SQL database. And that Cloud SQL database is the primary storage system-- lots of reads, lots of writes coming through. Your application is running under the Service Account you created, your custom Service Account. Actually, this is the default Service Account. I should have changed that. And what you want to say is, let's exempt this particular Service Account for Cloud SQL from DATA_READ and DATA_WRITE. So what you're left with is any data access not by the production Service Account running your application. So it means that you'll get access and logs every time a human or a different service tries to access or change data in that database. So instantly, that's kind of interesting, saying, well, my DBA has now changed something in a production database, or an engineer is logged in and runs some queries against it. That isn't the typical flow that we'd expect. That's worth monitoring. You can also do this through the GUI-- so through the IAM console, you can go in, you can turn on the default audit logs, you can turn on the DATA_READ and WRITE logs, you can choose particular services, you can build the exemptions. I'd recommend this for testing to figure out how many logs does it produce. It's a good way to figure out, actually, is it a couple of megs, or a couple of gigs, or a couple of terabytes we're dealing with here? But normally, you want to do this programmatically most of the time, particularly, dealing with exceptions. When you're looking at Data Access Logging, you should be doing it on a case-by-case basis most of the time. ADMIN_READ logs could be turned on across the platform. Again, be aware of how much volume it can produce. So another challenge my customers brought to me was, by default, logs in GCP live within the project they're created in. If you're an enterprise, you could well have hundreds or thousands of different projects created in many different countries. And that becomes an admin overhead having to go to every single project to look at the logs inside of it. And it stops you correlating them together and looking for interesting patterns across your environment. So with Stackdriver, we can do aggregated log exporting. So simply, we can take the logs from lots of different member projects, put them through some filters, so the filters could be to only forward on particular services, particular users, particular accounts. You could sample it down. So you could say, just send 10% of it. That's only interesting to me. And there are three different sources you can send it to-- sorry, destinations. There's cloud storage, so you can write it just as plain text files. Commonly, this is done for compliance. We need to store this for three years, five years, seven years, as cheap as possible so it's available if we did need to do an investigation down the line. You can send it to BigQuery so that you can run SQL queries against it and do some analysis. And this is pretty common. I'll talk to you about an example of that in a minute. You can also send it to Pub/Sub. And Pub/Sub is typically where third-party platforms integrate. So tools like Splunk have connectors that read from Pub/Sub to take those logs out the other side. So let's assume that all of your projects are sending their logs to one particular Cloud Storage bucket. You want to do your archiving compliance and make them stay there for seven years. There's a couple of things I'd recommend to make sure those logs are sound and safe and secure. The first thing is turning on Object Versioning on the Google Cloud Storage bucket. The reason for that is that, when you turn on Object Versioning, you can't actually delete files anymore. You can run the delete command, but that file's never deleted. The reference to it says it's gone, but the file is still there. If someone accidentally changed the file, we keep all of the previous versions as well, so you never actually lose anything as long as that's turned on. And of course, this bucket, you'll lock down permissions on it. The project it's running in, you'll lock down the access to this project as well. And there's another level you can apply where we can put controls on the project itself. So you can apply something called a Lien, which essentially says you can't delete this project. It's really simple to apply, and it's just another level of control against accidental deletion or potentially malicious deletion. It's a really simple command. It's done by Resource Manager. You do it on a per-project basis. And I'd highly recommend doing this for all of your production projects, for all of your applications that are running. Anything that just doesn't need to be deleted, accidentally, I'd recommend turning this flag on. All that happens is, if you run a command to delete the project, it comes back with an error saying, no, that can't be done. And you can give a specific reason as to why that can't be done. I probably wouldn't call it Super Secret Logs, but I might call it something a bit more generic just in case. But that will at least stop you from both malicious and accidental deletions, trying to make sure your logs are safe and sound. Cool. So I want to talk about security solutions. And by solutions, I mean combining lots of different elements together to solve common challenges that my customers had. The first and probably most common conversation I have with people is, how do I run my applications securely? So I have a line, a business application, it's HR application or finance application, and it's out there, and our users should be able to access this from the corporate environment. Really, where we want to get them to, is the Beyondcorp model. If you're not familiar with Beyondcorp, I highly recommend doing some research into it or attending the session that we have running at the moment or watching the replay. Beyondcorp is talking about beyond the corporate network. How can we have all of our projects and applications running in a way that people can access it from anywhere without being behind a particular IP address, without having to dial up VPN into the office and access it from trusted IP ranges, which doesn't work at scale? So when we start to take people on the journey towards a Beyondcorp-style model, the first thing that comes up is VPNs. People don't like VPNs. And it's having the constraints around people, either remotely accessing in, or dealing with just the admin overhead of managing it. How can we start to build applications that take advantage of Google's scale so you can take advantage of our denial-of-service protection, so you can deal with our scalable networking without having a bottleneck in place? So there are three components we're going to put together. The first one is Cloud Armor, and that's protecting the edge-- so limiting who and how they can access. We're going to talk about Identity-Aware Proxy, so we can make sure that only users that are authenticated and authorized to access the application can access the application. And then we're going to talk about VPC firewall-- so how do you actually restrict the ports and protocols that can communicate with the application you're running? So in a really simplified example, it looks a little bit like this. So this is kind of step one of the journey. So let's imagine you have your employee in your San Francisco office. And the external IP address from that office is 1.2.3.4 to make life easy for everyone. The first thing that would happen with that traffic is, hopefully, it will get routed over to the Google Cloud Platform. The first place it meets is our Edge proxies or edge PoPs. And Cloud Armor sits on top of them as a bolt on service. So everything we're talking about here-- so Identity-Aware Proxy and Cloud Armor-- relies on, not changing your architecture. We're not telling you to install a VM and push your traffic through it. We're trying to layer the capabilities onto your existing design and the existing capabilities we have today. We don't believe you should change your application to get more security. So the first thing we're going to do is we're just going to do simple whitelisting and blacklisting through Cloud Armor. We're just going to allow connections from particular IP addresses. In this case, we'll just do it for your corporate ranges-- so 1.2.3.4 can connect through. And that will allow it to initiate a TLS session using Google networking technologies to a HTTPS load balancer serving your particular application and your particular project. So then we're going to add on Identity-Aware Proxy. Identity-Aware Proxy is actually the service that we released that's as close to Beyondcorp as we have internally at Google as possible. And as a great roadmap for Identity-Aware Proxy, it's really coming on leaps and bounds. The concept is this. You should provide network-based authentication and network restriction based on the identity of the user. So if I'm an employee, and I work in HR, the ideal scenario is the only network services I can ever talk to are those which relate to my job and my function in HR. There might be tens or hundreds or thousands of applications that are running out there. And normally, you can ping them, you can talk to them. You probably can't log into it, but it's a very wide net that you could touch. Identity-Aware Proxy looks at your user credentials, looks at the roles and the groups that you're associated with. You can integrate it because it's backed off by Cloud Identity. So you can use Active Directory Federated Services. You can use two-factor authentication and single sign-on. And then it says, well, actually, you can't get through this load balancer, you can't get through this proxy unless your job role and function and group is correct. So it means, actually, the attack surface is significantly reduced because no longer does any employee have network connectivity to all of the applications that are out there. You're no longer reliant on the application authenticating and authorizing those services. Because as we all know, is there's a particular vulnerability in some part of the stack, in some middleware messaging platform, that any of those desktops, if they were compromised, are going to sweep the network, try and find one particular application service environment that has that vulnerability, and then they have a way into your network. With this, you've suddenly restricted that down to a small subset of your application servers that are only hosting the services they should be able to access. So it's issuing the pass-through Identity-Aware Proxy, and they are allowed to access the service. Next up, again, we're going to use Cloud Armor. So what Cloud Armor can do is that, now we know that this is a trusted user, and it's a HTTPS load balancer, we can inspect the contents of the packet. So we can do some simple SQL injection and checking. We can look for cross-site scripting. We can just check the packets, seeing that they're the right size, and shape, and volume, and sort of conform to standard bounds. If that looks good, we'll then pass it through to the service on the back end. In this instance, we use Compute Engine so it's running a simple infrastructure as a service. Could be Kubernetes Engine, could be App Engine Flexible, which, in turn, will have the VPC firewalls. And those VPC firewalls will allow you to restrict ports and protocols at a networking level. So the next question I normally get from people is-- OK. So let's assume we're trying to move away from IP whitelisting and blacklisting. We want to promote employee mobility so people can work from home. We're not having to get them to dial up to our remote office range and connect through there. But we only want people to access it from the US, or if they're in the UK, only in the UK. So again, Cloud Armor has functionality. You can write a simple rule that says we're only going to let you in if the origin of your traffic comes from this or these particular countries. Or on the flip side, we don't like these particular countries for you to come from. So this is kind of step two on the journey. You're removing the hard IP whitelisting and saying, right, well, actually, now my employees-- I know they're working in the United States, which is better for me. They're going through, and we're validating the packets. We know that they're doing the right job function. We know they're using two-factor authentication. And finally, they can actually access the application services we're running. I generally find, talking to my customers, that's a high level of security than they'd ever been able to build with some kind of DMZ they've had for the last 15 years. And this is all scalable Google technology. You're not relying on a single point of failure. You're not reliant on managing traditional technologies. Another thing you can do now is you can enforce standards around SSL and TLS. So attached to the HTTPS load balancer that we're running, you can simply state the ciphers that you're allowed and the versions of TLS that you'll accept for clients to connect to it. So you can enforce a minimum of TLS 1.2. You can add or remove the particular ciphers that you'd like people to use to ensure that those clients-- if they are working from home and they're employees-- and hopefully, they're in the right country and they've got the right roles-- are using good encryption in transit that you trust and validate and conforms to your standards. So another question that comes from this is-- OK. So we're running our service in more of a remote access model, we're using Identity-Aware Proxy, we're using Cloud Armor, we're allowing people to access without this corporate network any more. So where are they connecting from? Where does the traffic go to? And actually, how do my services interact with each other, and what are they doing? So VPC Flow Logging is typically how we get this information out. VPC Flow Logging, you turn it on on a per-subnet basis and record four different types of traffic. We record inter-VPC-- so any traffic within the subnet-- intra-VPC-- traffic between multiple subnets, assuming the logging is turned on-- traffic to and from Google services-- so sending requests to BigQuery and getting responses-- and also internet traffic-- so internet, egress, and ingress between your applications and services. And all of these logs [? I've heard ?] before are going to Stackdriver logging, and they're pretty valuable. So the quickest way to get some insight into them is to feed them out into BigQuery as we talked about earlier. Use aggregated log exports for each of your projects, set it up to feed the logs out, aggregate the logs, set up the filters to send just the VPC Flow Logs out into BigQuery so they're immediately available to query, and we're going to use Data Studio to build a nice little report on top of it to try and figure out what's in the logs. So here's a little dashboard that I pulled together with a sample application that I spun up. I didn't put any country restrictions on it. I was just interested to see who's going to start pinging it around in a "Hello, World" application that I spun up. So you can see here that it was running in two regions. We got Europe West 1 and Europe West 2. We had connections from a lot of different countries. We can see the round-trip time for the packets that were coming through. There's a huge amount of detail and information that's really valuable inside these logs. And typically, building this in non-cloud environments requires pretty expensive technology, fairly specialized vendors. And dealing with actually the volume of traffic it can produce, VPC Flow Logging can produce a decent amount of information, and you can mine some really interesting trends in it, particularly, when you're looking across your projects and across your applications, looking for common types of traffic, looking for particular applications that are connecting to you that you don't like, looking for particular ports and protocols that you're not expecting. So then that leads to another use case. Which services talk to each other? So in a world where things are very connected, there are lots of dependencies in around it. There are some great announcements at Next this week talking around Istio. So Istio is fantastic if you're running in a microservices environment inside Kubernetes. Many of my customers are not at that stage yet. They're using traditional virtual machines and infrastructure, and they're talking to each other in all kinds of wonderful ways, and they just kind of hope they know what they're doing. So if you're using VPC Flow Logging, we turn on the logging for all the different subnets. And now we can start to profile the applications. And we can figure out which ports and protocols are being used? How much traffic is being sent between them? And one of my customers took this even further, took the aggregated exports and the summaries of those flows, and then turned them into VPC firewall rules, and applied them, and locked it down as a policy. So they monitored it for a period of around two weeks. They found every single connection that they saw between these applications and then built very specific firewall walls to limit that connectivity to only those particular approved connections. So what they found was that, when the application changed, and then they found a new connection was coming through, the question becomes, is it a developer? Can we correlate this with an update or a patch to the software we were using? OK, fine. We just need to update the documentation and the release process better to incorporate that change. Or is it nefarious activity? Has someone compromised our application, and now, they're trying to move around to see what other services are around they can go and connect to? Answering that question is pretty critical. And this gives you a lot of visibility to see that very quickly and then do the investigation. So a use case I saw recently where the customer was, how do I keep my secrets secret? And we're going to use Cloud KMS for this. And it's really important to share, you cannot store secrets inside Cloud KMS. But it's an enabling technology that will help you keep your secrets secret. The use case here was API keys. So an application that was being built talked to a third-party service that authenticated an API key. In this case, it was sending emails. The development in the test environment had one key that had a limited quota on it, a limited amount of spend. Production had another key. And they simply wanted to make sure that that key wasn't distributive widely. They definitely didn't want to keep it inside the source code, because then every developer could access. And they didn't quite know where to put it. So the recommended solution for this-- and this is on a public documentation-- is building two projects to separate the duties. Now, actually, there are three if you think about the application it's running as well. So on the left-hand side, we have the Secret Storage Project, and we're going to store them in Google Cloud Storage. On the right-hand side, we have the Cloud KMS Project, which is the encryption and decryption. And the nice thing here is that one person or one team can manage the keys. Another person or another team can manage the encrypted secrets. So the first thing that you do is you take your API key, you send it to Cloud KMS, you pick the particular key and key ring that you want, and it gives you the encrypted blob to store. You then place it in Google Cloud Storage in a bucket as the encrypted blob. When your application starts up, it does two things. First, it connects to Google Cloud Storage and reads that blob from the bucket. You obviously set up all the permissions and scopes and roles developed to pull that down. If any other person had access to that bucket and logged in and saw it, it would be no use, because they couldn't actually see what's inside that encrypted blob. So the next step is it goes to Cloud KMS, it sends the encrypted blob, and then it returns a decrypted version of it, and says here's your API key. So what you've done is you delinked it. You've brought in a separation of the duties. Someone can independently rotate the keys and manage the keys for you. Another team can manage the encrypted blobs and secrets, then the application can run fairly statelessly. It doesn't need to be bundled with the secrets. Every time it starts and restarts, it goes and grabs that key again. And of course, this is all logged. So you can start to look for activity and say, well, was there any failed decryption attempts? Was anyone pulling down these particular keys from a different application in a different source that wasn't, then, going and decrypting them? You can get some pretty good visibility there. So the final solution I'm going to talked about, which is a really common question is, are my machines up to date? And we're talking about infrastructure here, so we're talking about plain old virtual machines. And the assumption is that they need updating, they need patching, and also, they need to conform to your security standards. So there's the baseline that you define. The baseline might be a NIST or a CIS standard, and probably your organization-specific requirements as well. But let's say there's been some particular vulnerability occur, and you want to update GLIBC or libssl to make sure it's maintained. So there are two questions really. How do to make sure people are only using trusted images? And secondly, how do you make sure they're always using the latest version? So the first thing we're going to do is we're going to build custom images. So part of Compute Engine includes an image registry. It's not actually call it an image registry, but it is an image registry inside of Compute Engine. And with this, you can store baked images. Typically, it's an automated process that will go through and pull down the patches, build them together, do your security configurations and hardening, and publish it. It could also be a manual process as well, depending on volume. And you can choose whatever flavor you like-- particular versions of Linux, or Windows, or whatever takes your fancy-- and publish those custom images, share them, and make them available for other users and other projects to start with. So in this case, a user that has access to start a virtual machine, an instance, pulls the image from the image registry based on the URL and then builds it from that version. There's a constraint you can apply in Resource Manager that allows you to control which particular versions and which particular images can we use? And this is important because, by default, Google published a whole range of images for you. We've produce standard Ubuntu, standard CentOS, standard Windows server. And that doesn't conform to your security standards and policies. It's patched, and it's updated, and it's fairly up to date with all the vulnerabilities. But it probably doesn't contain your particular configurations and your endpoint tools and management tools. So you can apply this constraint and essentially say, none of my users can use any of the Google-provided libraries. They can use none of the Google-provided images. It's completely denied. So what they're left with is they can only access the custom versions which you've published. So then, what you're saying is that any developer, any person anywhere in your organization that wants to spin up an instance has to use an instance that comes from our custom images which we've approved, and we've standardized, and we've pushed out. So images go through a lifecycle. There are four different states that they go through. By default, they're active. You create a new image, people can use it-- happy days. After that, you can deprecate it. And you can say that it's allowed to be used, and often, people will support two or three old versions, but they'll get a warning saying that this is not the latest version. And typically, people access images based on the name or the family. They'll say just give me the latest version of Ubuntu rather than the particular version that's being produced. But this will tell them, actually, the version you just specified, if they've hard-coded it in a version, that's not OK. Obsolete will actually stop them using it. So if they try and start an instance using that particular image that's now obsolete, it will fail, it won't start up, and it will give them a failure instead of a warning. Then the fourth state is DELETED where the image is flagged as deleted-- it's not actually removed from disk. You have to do that separately-- but it's no longer visible and available. So you can do automated obsolescence. So what this means is that, when you tell a version of an image to be deprecated, there's two extra flags you can pass. There's an OBSOLETE in, and there's a DELETE in. And this can be a relative amount of time-- so in seven days-- or an absolute amount of time-- a particular date that you give it. So as part of your processes-- ideally an automated baking process-- you can say, right, here's the latest version of an image I just published for you and the previous version deprecated today. So all of my users will get warnings telling them this isn't the latest version. And automatically, we'll then make it obsolete in seven days time. We'll delete it in 14 days time. It's another thing you don't have to think about. It will just run through that lifecycle for you, and users should treat it appropriately. If people start complaining that the version of the image isn't available, they need to change their deployment scripts to use the image family rather than particular versions. So we talked about three different things there. We talked about control. Identity is super important in any cloud provider, particularly Google Cloud, because all of your access to resources is focused around your identity. There's much less of a focus on network controls, particularly, with more of the managed services. Service accounts are really powerful if you use them correctly. And actually, there are lots of little details around who can access that service account. That's definitely something I'd recommend you go away and review, probably, with Forseti to see who has that control. Visibility is out there. You can see everything that happens inside Google Cloud, but not by default. And you can't wait until there's an incident to go and find that, because, unless the logs are on, you can't go backwards and see it. So you need to review data access logging. You need to review VPC firewall logging. And if it's appropriate, turn that on today to prepare you for the future. And we talked a bit about some solutions. How do we combine different GCP services together to get you to a point that you can run your applications and services more securely and get you prepared for a Beyondcorp world of the future where all your employees can access remotely, securely, and do whatever they need to. Thank you very much for attending my talk. I'll be able to take questions outside. [APPLAUSE] [MUSIC PLAYING]
Info
Channel: Google Cloud Tech
Views: 31,338
Rating: undefined out of 5
Keywords: type: Conference Talk (Full production);, pr_pr: Google Cloud Next, purpose: Educate
Id: ZQHoC0cR6Qw
Channel Id: undefined
Length: 42min 29sec (2549 seconds)
Published: Wed Jul 25 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.