AWS re:Invent 2020: Data lake security in Amazon S3: Perimeters and fine-grained controls

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Welcome, everybody, to our discussion of data lake security in Amazon S3. My name is Becky Weiss. I'm an engineer with AWS. And I'm honored to be doing this presentation with Rajeev Sharma, a security architect from the Vanguard Group. Of course, maybe I should start by talking about what exactly we mean by a data lake. That term gets used a lot. It's typically used to mean a lot of data. And data can be in a number of data sources, of which S3 is only one. There can be log data, data in relational databases. Other kinds of databases. Often a lot of different consumers of data with different levels of access to it, trying to do different things to them, and a bunch of services in the middle. Applications that you own, as well as AWS services, that all of those consumers of the data are using in order to get various insights out of the data. What we're here to talk about, we're here to talk about the stuff in this box. The data, probably a lot of data, that you are keeping in S3 and how you keep it secure. So, you are in the right place if you are keeping a lot of data in S3 as part of your data lake, running a lot of different analysis tools in it, and you're looking for the strategies to keep your data secure, both at the coarse grain level, how to make a good perimeter around it, and at the fine grain level to make sure the right entities have access to the data that they need to do their jobs. That's how we're gonna do this. The way we're going to structure this, I'm going to tell you a lot about the AWS capabilities, and techniques, and strategies that you can use in order to secure this data lake. And then, Raj is going to talk about how they have put some of those principles into practice with their lived experience at Vanguard Group. Now, to kick this off, if you do nothing else at all about your data lake in S3, go and turn on block public access in S3. It's a bucket-level and account-level setting. The account-level setting is the easy mode. It basically asserts that you don't have any publicly accessible data, and the block public access setting will keep it that way. As you probably know, when you create an S3 bucket, it is private to your account by default. Now, of course, you can add policies that share it with specific other entities. And what block public access does is it ensures that the entities that should have access, that you specifically named, they still will have access, but the outsiders are kept out regardless of exactly how the policy is written. Now, of course, in practice, you may actually have some data that's deliberately public. Raj is going to talk to you a little bit here about how they manage their static content within an environment that's normally using block public access. Thanks, Becky. So, here at Vanguard, we sometimes have to actually publish data through S3 buckets. But with the public access block turned on, that makes it very difficult. So, how do we actually publish things like static content, like HTML files and JavaScript and CSS? We do that through the CloudFront service. So, CloudFront can connect to an S3 bucket and allow public access to that bucket. What can go into that bucket is controlled through a pipeline. So, in this mode, content writers and content approvers can look at content and then place it into a static object bucket, things like HTML and JavaScript. The content pipeline has Get and Put operations into that bucket. And then, the bucket itself has what's called an origin access identity, which allows CloudFront to serve that data out into the internet. However, since these buckets are in the same account as some data buckets, it is possible for a data user to accidentally put confidential data into a static bucket and then expose it out into the internet because they have the GET and PUT operations. So, another problem that we've faced here at Vanguard was when we have these accounts and we have hundreds and hundreds of buckets within the accounts, how do we know which bucket actually is the one that can serve static content and which bucket can serve data out into the internet? This becomes a very hard problem to solve. Another internal threat pattern that we've seen is that what if an external attacker account tries to get data from an S3 bucket by an insider adding the OAI of an external attacker account? So, to solve this problem, it was actually pretty simple. What we did is create two separate accounts, logically. One account would just contain static objects, and the other account would contain buckets. The pipeline would only be allowed to put static content into the static content account. We can control this through a service control policy. Here, we deny any kind of PutObject, except for the pipeline. And then, the data users only have access to accounts that have data in their S3 buckets. So, this is an example of the service control policy that we put in place. And what we're really saying here is that we're going to deny any one PutObject into any of these content buckets, unless it's coming from the content pipeline. The other thing that we wanted to make sure was going to happen is that the CloudFront distribution in this static content account doesn't inadvertently point to a bucket inside one of the data accounts. And the way that we accomplish that is using things like CloudWatch and CloudTrail events. So, here, an event is triggered any time a distribution is modified or created within the AWS account. And that event is sent through a topic to a Lambda function. Here, this lambda function ensures that the CloudFront distribution only allows access to the correct S3 buckets in that one account and no other accounts. Because if it had access to a data account, it would then make that data public. If for any reason, it does find that it's crossing accounts and going into an account that contains data, the Lambda function will assume a role and then apply remediation by either disabling the distribution or just removing the origin altogether. Well, now that you've turned on block public access, you probably want to think a little bit more about the perimeter around your data, the coarse-grain-level controls that you use to keep control of your data. I talk to customers about this topic a lot, and they all have different words for it. But really, what they're all trying to accomplish is they say: "Look, I have my identities and my data. My identity should be accessing my data. I've got network locations that I expect them to be accessing them from. And that's how I want it to work." Well, AWS provides a bunch of different perimeter boundary controls in order for you to assert that my entity's accessing my data from my networks. So, let's start with my identities. Well, as you know from using AWS IAM, you write these permission policies for your identities that say what they can and cannot do. Well, this is a great place to be asserting that this identity is going to talk to specific resources that it needs to talk to. If it's an application, you know exactly what those are. And you can also assert where in the network they're supposed to come from. So, I'm going to show you an example policy here that I might have attached to this IAM role that might be part of an application in my network. The great thing about this format is IAM policies, sometimes you want to take a bit and look at them. You can pause. But I'm going to talk it through quickly in English. That first statement, well, I'm being specific about my buckets. These are the buckets example bucket one and example bucket two. They're part of the application. The second policy statement here, it's a deny. Notice though, it's kind of an insertion of an invariant. I'm saying that if this identity tries to take any action from anything other than this VPC network that they're expected to take action from, I want the access denied. Now, you'll notice that there's a "via AWS service" thing in there. It's relatively new, so you may not know what it is yet. I'm going to talk about that later, and why that is so useful for your data lakes. But that deny statement, if you want to scale that out, you want to say: "For all my identities in this account, in this organization unit, in this organization, I want them to operate only from this network or from a given set of networks." Well, that's how you scale it out. You use your organization as your identity boundary to create a policy like that, a service control policy. But you can also do that on the network. The network is another boundary you own. You were probably using a VPC endpoint to reach S3 from your virtual private cloud. And that has a feature called a VPC endpoint policy, where you can make assertions saying "only my identities," and optionally "only my resources." I'm going to show you an example where it's only my identities. I'm just going to show you that part here. And again, this is a boundary policy. It doesn't grant anybody access to do anything, but it needs to be satisfied in order for the network to use. What this means is that if somebody is using S3 from this network, it's going to be one of my identities. Otherwise, it's not going to work. So, that's your other boundary. Final boundary here is the resource boundary, and that is the bucket policy. A bucket often has a lot of S3 data in it, objects maybe from different data sets. And the policy on the bucket is how you make assertions over all of that data, how it's going to be accessed. So, thinking perimeters, it's going to be accessed by my identities and from my expected networks. Give you an example here. I'm going to talk through it quickly. The first statement says: "Well, I know what accounts are going to access it, so I'll list them out here." Second one is: "Well, an AWS service is going to be accessing it." I'll talk a little bit about that pattern later. And then, finally, that familiar, by now, network location. This bucket is kind of part of this network. It's not going to be accessed from outside that network, so let me assert that here. Well, that's how you do a perimeter. You use your identity perimeter to say: "For my identities, they can access my resources and only work for my networks." You can use your network perimeters to say: "In this network, it's my identities and my resources." And you can use your resource perimeter to say: "It's my identities from my networks." Now, we're going to turn it over to Raj who's going to talk about implementing some of these perimeter controls at the Vanguard Group. How does Vanguard actually implement these network security controls in practice? One thing that led us to using these controls was a work-from-home scenario. An interesting thing happened. We saw that work-from-home users, who are connected over a VPN to our corporate data center, were accessing the Amazon account, let's say, making an Athena query through our proxy server. This is the intended pattern. Our proxy server is part of our security stack that we use to monitor access for individuals. However, at times, the VPN may drop unexpectedly, and now the user who's working from home is now accessing the Amazon Athena service, let's say directly over the internet from home. And what happens is it makes an alert to the Security Operation Center because this looks like actually a credential that was used from one IP address now suddenly moved over to another IP address in a geographically different location. And this can trigger an alarm. So, for those reasons, we looked at using these network-based controls. So, how do we ensure that this path is continuously followed? We limit the STS tokens that are generated by AWS to only the Vanguard network. And we do this by creating an IAM permission boundary policy. And this boundary policy really is showing that it's going to allow any action and resource so long as it's coming from this IP address. Basically, our secure stack. One thing to note, that boundary policies do not grant permissions. They only set up the boundary. So, the corresponding IAM permissions that the user has still are in effect. But what happens when this user needs to use Athena, to, let's say, access an S3 bucket? Since the request is coming directly from Athena, it would cause that IP restriction to fail. So, this is where we also use the AWS ViaService. And this actually allows the credentials to safely pass through the Athena service into the S3 buckets. So, the two statements that are created here, the first one talks about that the request must come from the IP address, as long as it's not coming from a service. And then, another allow statement that says if it's coming from a service, then it can be allowed to proceed. This really creates the trust between Athena and the bucket. And we also had to do a similar thing if the user or role was within a VPC. So, here's with the principals running within the VPC and going through a VPC endpoint. So, instead of an IP restriction, we use the AWS source VPC ID. And this allows the traffic to go through the VPC, through Athena, and then into the S3 bucket. So, really, in summary, what we're doing here is we're ensuring that the IAM principal must come through our proxy servers and secure stack or through an approved VPC endpoint. It's going to get denied access if it's coming through, let's say, a home network or an unapproved VPC endpoint. And this really ensures that the tokens remain within the Vanguard environment. So, how do we actually get this to scale? We talked about the IAM policy side. What about the resource policy side? So, when you're looking at hundreds of principles and thousands of buckets within an account, you wind up with this mesh type of authorization scheme. And it's hard to really know who has access to what. And then also, if you need to scale this, and an auditor wants to go and take a look at some of the buckets, or we have a scanner robot that needs to go and validate that bucket policies are set up correctly, now, all of a sudden, these principles need access to the buckets. And when they do, the resource policy will grow. If you have thousands of buckets deployed by hundreds of DevOps teams, this becomes a significant amount of churn within the bucket policies to keep maintaining these resource policies. So, our solution was to move certain IAM principles into a path structure. And, for example, if we had a developer resource, a developer principle, we would then move that principle into a path. Let's say the universal path. What that allowed us to do within the resource policies was really simplify the code. So, now in the resource policy, you originally would have had to list out every single role that would need access to this bucket. And then when a new principal needed to be added, every one of these resource policies would have to be upgraded. Using the path structure, we can actually wildcard the roles and then create a list of known roles and add or remove roles into this path. They can then get access to this bucket. Now, these roles, as you can imagine, are going to be extremely sensitive and need to be watched over very carefully. So, on top of this, we added protection with a service control policy at the org level. And what this is doing is denying access to anyone to modify these roles unless it's the IAM engineer administrator role. All right, so now that we're on the topic of assigning IAM permissions to data, in a data lake, there are a number of different access patterns. There were a bunch of things in that picture. And there's really three fundamental patterns of access that in order to write effective policies, you need to know about them, be able to apply each in its own appropriate use case. The first is kind of what we've been talking about. An IAM identity, like a role, is accessing data. There are some permissions policies that need to authorize it. And this is what I would call direct access, an identity directly accessing the data that it needs. Now, there's a variant of this. Often in analytics uses cases, you might be going through AWS Athena in order to do these nice, scalable queries of your data. Well, in that case, if you think about this role, it is going to be making an API call not to S3 directly, but to an Athena API, a start-query execution. Now, Athena, the way it works is it makes an onward request on your behalf using your identities, using your permissions policies to the underlying raw data in S3. The nice part is, of course, it means that your identity needs access to the raw data that it's about to query. In this case, the one thing you need to know about is Athena doesn't run in your network. It's an AWS service. So, if you're using those network perimeter controls that source VPC control we were looking at before, that's where this Via AWS service comes in so handy because it lets you account for these onward call use cases simply in its scale. There's another pattern here, which is an AWS service that has persistent access to your data. Athena had temporary access to your data to do your query. It was doing it under your own identity. A service like CloudTrail or many other of our services, particularly ones that leave logs in S3, well, they're going to be writing data to your S3 buckets on a continuous basis. The way those work, the pattern to understand there, is the service is making the requests under its own identity. That's called a service principle. So, cloudtrail.amazonaws.com that you see there is the allowed principal. That's who's putting data in your bucket. You'll see I'm also following the best practice in this policy, where I'm specific about what path it's going to write to my account numbers in there, so that I know that it's writing data specifically for my account. That's the second pattern. Final pattern. You're going to see this a lot in analytics use cases, because often, you're accessing your data through some other kind of application that's offering value-added processing, various machine learning training. And often these applications, including your own applications and AWS services, I have EMR and SageMaker notebooks as an example here, they are running under their own compute environment. For example, EMR runs on EC2 instances. The EC2 instances, in turn, in order to access the data, are using IAM roles that are associated with the compute environment. And you'll often see this pattern where it's actually the identity of the compute environment itself accessing the data. Now, the reason why that's important to think about is because, well, at the end of the day, you have your human users connecting to these various environments in different ways. You have a data scientist connecting to a Jupiter notebook. And you'll want to manage the access of the people to these environments on the basis of the fact that these environments, each themselves, has access to the set of data that's needed by the environments. You have these two identities in the mix here, so that's the third pattern to be aware of. We're going to hear from Raj about how the Vanguard Group sets up their provisions and their EMR clusters to take into account these security and access patterns. So, how do we, in practice, manage the AWS service environment when it comes to the EMR service? So, what we do is we have a development team that creates basically these EMR service catalog items. So, they'll create the catalog item using CloudFormation, parameter files, and various tags and add them to the code repository, where an approver can take a look at these and ensure that the roles and permissions are set correctly. Once they pass the checks, the build and deploy agents will go ahead and deploy the portfolio and the product into the service catalog. It will contain the role that would be used by, let's say, an EMR cluster, and all the CloudFormation would go along with it. When a business user wants to go ahead and use one of these EMR clusters, they'll have access to the AWS management console, at which point all they need access to is the service catalog item. The service catalog item contains all the information they need to construct an EMR cluster. That product, launched with the service catalog role and through CloudFormation, will deploy the EMR cluster with the correct roles that have the correct access. Now when data scientists, data engineers, and data analysts want to gain access to a cluster, let's say an EMR cluster, they gain access to the cluster, but they first must authenticate with a local directory server. This will provide their authentication. On top of that, within the clusters themselves, we use role groups. So, the users must be belonging to a certain role group in order to actually authorize to the cluster. Once they're authorized, then the cluster has the correct permissions needed to access the S3 data, as well as any of the KMS keys used for decryption. Okay, the final couple of minutes here. I'm going to talk about going even finer grain than those permissions policies that we've been looking at before. Now, of course, with IAM, IAM is... You get a lot of control. In fact, a lot of fine-grain control. You're probably familiar with this pattern where you have different IAM capital roles fulfilling different lowercase roles. For example, project-based roles. Here, I have the yellow, green, and blue project. I've got people assigned to them maybe through my identity provider. And in fact, I probably have my data structured in my S3 bucket along these different prefixes. They look like folders, but they're not. They're just strings, which makes wild cards possible, such as this one. This is the permission policy for the blue project role. And they have permission to the data under the blue project. Now, of course, if you start to scale that up, particularly if these roles are all in a bunch of different accounts, you start to get some fairly long and monolithic bucket policies. So, to make the management of that easier, once you start to see a pattern of data sharing where you have these discreet use cases that you'd actually like to factor out and manage each separately, S3 Access Points, which we announced about re:Invent 2019, offers this factored permissions use case. If you focus on this role, I might have hundreds of different access patterns to my bucket. I don't want to encode them in a very long, monolithic bucket policy. So, what I do is I create an access point for each of these use cases. It's tailored to it. Now, for this role to interact with the access point, it is the same as interacting with a bucket. You'll notice that this looks very much like an API call to S3. It has the data plane. You can think of the access point as an alternative endpoint to an S3 bucket. Now, in the bucket policy, rather than managing each of these use cases, I don't actually have to talk about this use case specifically. I might have a policy like this. I'm going to allow entities in my organization as long as they're coming through an access point that I wrote for one of these projects. Now, the access point itself, that's where you write the policy for the use cases. And it's well factored. It's for a specific use case. If you have these data-sharing scenarios, we've had a number of customers have a really good experience using access points to factor it out. Now, this is a whole topic unto itself. I'm going to go through it really quickly. But often, your data in S3 is actually structured as databases. And your more natural mode of assigning permissions is on the database level, databases, tables, columns. And you want to use that mechanism. And in fact, those columns are actually below the object level. And that S3 and IAM go down to the object level in terms of access permission, because it's about API requests being made, but you actually want content-based filtering. Well, if you're using these analytic services, they're integrated with Lake Formation. What you do is you configure access on the basis of databases, columns, and tables in Lake Formation. These services integrate with them, and it actually does the content filtering for you. So, useful to explore for those database use cases where you have that database-structured data. That brings us to the end of this whirlwind tour of securing your data lake data in S3. There's a lot around this picture. But if you have a lot of your data in S3, and you get the security practices right on your data in S3, you've taken a really large step towards good security of your data lake. Enjoy the rest of re:Invent, and thank you so much.
Info
Channel: AWS Events
Views: 344
Rating: 5 out of 5
Keywords: re:Invent 2020, Amazon, AWS re:Invent, STG302, Storage, Amazon Simple Storage, Service (Amazon S3), Vanguard
Id: 6AROHrwj9GQ
Channel Id: undefined
Length: 25min 26sec (1526 seconds)
Published: Fri Feb 05 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.