Welcome, everybody, to our discussion
of data lake security in Amazon S3. My name is Becky Weiss.
I'm an engineer with AWS. And I'm honored to be doing
this presentation with Rajeev Sharma, a security architect
from the Vanguard Group. Of course,
maybe I should start by talking about what exactly
we mean by a data lake. That term gets used a lot. It's typically used
to mean a lot of data. And data can be
in a number of data sources, of which S3 is only one. There can be log data,
data in relational databases. Other kinds of databases. Often a lot of different
consumers of data with different levels
of access to it, trying to do
different things to them, and a bunch of services
in the middle. Applications that you own,
as well as AWS services, that all of those consumers
of the data are using in order to get
various insights out of the data. What we're here to talk about, we're here to talk about
the stuff in this box. The data, probably a lot of data, that you are keeping in S3
and how you keep it secure. So, you are in the right place if you are keeping a lot of data
in S3 as part of your data lake, running a lot
of different analysis tools in it, and you're looking for the strategies
to keep your data secure, both at the coarse grain level, how to make
a good perimeter around it, and at the fine grain level to make sure the right entities
have access to the data that they need to do their jobs. That's how we're gonna do this. The way we're going
to structure this, I'm going to tell you a lot
about the AWS capabilities, and techniques,
and strategies that you can use in order to secure this data lake. And then, Raj is going to talk about
how they have put some of those principles
into practice with their lived experience
at Vanguard Group. Now, to kick this off,
if you do nothing else at all about your data lake in S3, go and turn on
block public access in S3. It's a bucket-level
and account-level setting. The account-level setting
is the easy mode. It basically asserts
that you don't have any publicly accessible data, and the block public access setting
will keep it that way. As you probably know,
when you create an S3 bucket, it is private
to your account by default. Now, of course, you can add policies that share it
with specific other entities. And what block public access does is it ensures that the entities
that should have access, that you specifically named, they still will have access,
but the outsiders are kept out regardless of exactly
how the policy is written. Now, of course, in practice,
you may actually have some data that's deliberately public. Raj is going to talk
to you a little bit here about how they manage
their static content within an environment that's
normally using block public access. Thanks, Becky.
So, here at Vanguard, we sometimes have to actually
publish data through S3 buckets. But with the public access block
turned on, that makes it very difficult. So, how do we actually publish
things like static content, like HTML files
and JavaScript and CSS? We do that
through the CloudFront service. So, CloudFront can connect
to an S3 bucket and allow public access
to that bucket. What can go into that bucket
is controlled through a pipeline. So, in this mode, content writers and content approvers
can look at content and then place it
into a static object bucket, things like HTML and JavaScript. The content pipeline has Get
and Put operations into that bucket. And then, the bucket itself has what's called
an origin access identity, which allows CloudFront to serve
that data out into the internet. However, since these buckets
are in the same account as some data buckets,
it is possible for a data user to accidentally put confidential data
into a static bucket and then expose it out
into the internet because they have
the GET and PUT operations. So, another problem that we've faced
here at Vanguard was when we have these accounts and we have hundreds and hundreds
of buckets within the accounts, how do we know
which bucket actually is the one that can serve static content and which bucket can serve
data out into the internet? This becomes
a very hard problem to solve. Another internal threat pattern
that we've seen is that what if an external attacker account
tries to get data from an S3 bucket
by an insider adding the OAI of an external attacker account? So, to solve this problem,
it was actually pretty simple. What we did is create
two separate accounts, logically. One account would just contain
static objects, and the other account
would contain buckets. The pipeline would only be allowed
to put static content into the static content account. We can control this
through a service control policy. Here, we deny any kind of PutObject,
except for the pipeline. And then, the data users
only have access to accounts that have data in their S3 buckets. So, this is an example
of the service control policy that we put in place. And what
we're really saying here is that we're going to deny any one PutObject
into any of these content buckets, unless it's coming
from the content pipeline. The other thing that we wanted
to make sure was going to happen is that the CloudFront distribution
in this static content account doesn't inadvertently point
to a bucket inside one of the data accounts. And the way that we accomplish that
is using things like CloudWatch and CloudTrail events. So, here, an event is triggered any time a distribution is modified
or created within the AWS account. And that event is sent
through a topic to a Lambda function. Here, this lambda function ensures that the CloudFront distribution
only allows access to the correct S3 buckets
in that one account and no other accounts. Because if it had access
to a data account, it would then make that data public. If for any reason, it does find
that it's crossing accounts and going into an account
that contains data, the Lambda function
will assume a role and then apply remediation
by either disabling the distribution or just removing
the origin altogether. Well, now that you've turned
on block public access, you probably want to think
a little bit more about the perimeter around your data,
the coarse-grain-level controls that you use
to keep control of your data. I talk to customers
about this topic a lot, and they all have
different words for it. But really, what they're all trying
to accomplish is they say: "Look, I have
my identities and my data. My identity should be
accessing my data. I've got network locations that I expect them
to be accessing them from. And that's how I want it to work." Well, AWS provides a bunch of different
perimeter boundary controls in order for you to assert that my entity's accessing
my data from my networks. So, let's start with my identities. Well, as you know from using AWS IAM, you write these permission policies
for your identities that say what they can and cannot do. Well, this is a great place
to be asserting that this identity is going to talk to specific resources
that it needs to talk to. If it's an application,
you know exactly what those are. And you can also assert where in the network
they're supposed to come from. So, I'm going to show you
an example policy here that I might have attached
to this IAM role that might be part
of an application in my network. The great thing
about this format is IAM policies, sometimes you want to take
a bit and look at them. You can pause. But I'm going to talk it
through quickly in English. That first statement, well, I'm being specific
about my buckets. These are the buckets example bucket one
and example bucket two. They're part of the application. The second policy statement here,
it's a deny. Notice though, it's kind of
an insertion of an invariant. I'm saying that if this identity
tries to take any action from anything
other than this VPC network that they're expected
to take action from, I want the access denied. Now, you'll notice that there's
a "via AWS service" thing in there. It's relatively new,
so you may not know what it is yet. I'm going to talk about that later, and why that is so useful
for your data lakes. But that deny statement,
if you want to scale that out, you want to say:
"For all my identities in this account,
in this organization unit, in this organization, I want them to operate
only from this network or from a given set of networks." Well, that's how you scale it out. You use your organization
as your identity boundary to create a policy like that,
a service control policy. But you can also do that
on the network. The network is
another boundary you own. You were probably using
a VPC endpoint to reach S3 from your virtual private cloud. And that has a feature called
a VPC endpoint policy, where you can make assertions
saying "only my identities," and optionally "only my resources." I'm going to show you an example
where it's only my identities. I'm just going to show you
that part here. And again, this is a boundary policy. It doesn't grant anybody
access to do anything, but it needs to be satisfied
in order for the network to use. What this means is that if somebody
is using S3 from this network, it's going to be
one of my identities. Otherwise, it's not going to work. So, that's your other boundary. Final boundary here is
the resource boundary, and that is the bucket policy. A bucket often has
a lot of S3 data in it, objects
maybe from different data sets. And the policy on the bucket is
how you make assertions over all of that data,
how it's going to be accessed. So, thinking perimeters, it's going to be accessed
by my identities and from my expected networks. Give you an example here.
I'm going to talk through it quickly. The first statement says: "Well, I know what accounts
are going to access it, so I'll list them out here." Second one is: "Well, an AWS service
is going to be accessing it." I'll talk a little bit
about that pattern later. And then, finally, that familiar,
by now, network location. This bucket is
kind of part of this network. It's not going to be accessed
from outside that network, so let me assert that here. Well, that's how you do a perimeter. You use your identity perimeter
to say: "For my identities, they can access my resources
and only work for my networks." You can use your network perimeters
to say: "In this network, it's my identities and my resources." And you can use
your resource perimeter to say: "It's my identities
from my networks." Now, we're going to turn it over
to Raj who's going to talk about implementing
some of these perimeter controls at the Vanguard Group. How does Vanguard actually implement these network security controls
in practice? One thing that led us
to using these controls was a work-from-home scenario. An interesting thing happened. We saw that work-from-home users, who are connected over a VPN
to our corporate data center, were accessing the Amazon account, let's say, making an Athena query
through our proxy server. This is the intended pattern. Our proxy server is
part of our security stack that we use to monitor
access for individuals. However, at times,
the VPN may drop unexpectedly, and now the user
who's working from home is now accessing
the Amazon Athena service, let's say directly
over the internet from home. And what happens is it makes an alert
to the Security Operation Center because this looks
like actually a credential that was used from one IP address now suddenly moved over
to another IP address in a geographically
different location. And this can trigger an alarm.
So, for those reasons, we looked at using
these network-based controls. So, how do we ensure that
this path is continuously followed? We limit the STS tokens
that are generated by AWS to only the Vanguard network. And we do this by creating
an IAM permission boundary policy. And this boundary policy
really is showing that it's going to allow
any action and resource so long as it's coming
from this IP address. Basically, our secure stack. One thing to note, that boundary policies
do not grant permissions. They only set up the boundary. So, the corresponding IAM
permissions that the user has still are in effect. But what happens
when this user needs to use Athena, to, let's say, access an S3 bucket? Since the request is coming
directly from Athena, it would cause
that IP restriction to fail. So, this is where we also use
the AWS ViaService. And this actually allows
the credentials to safely pass through the Athena service
into the S3 buckets. So, the two statements
that are created here, the first one talks about that the request must come
from the IP address, as long as it's not coming
from a service. And then,
another allow statement that says if it's coming from a service,
then it can be allowed to proceed. This really creates the trust
between Athena and the bucket. And we also had to do a similar thing
if the user or role was within a VPC. So, here's with the principals
running within the VPC and going through a VPC endpoint. So, instead of an IP restriction,
we use the AWS source VPC ID. And this allows the traffic to go
through the VPC, through Athena, and then into the S3 bucket. So, really, in summary,
what we're doing here is we're ensuring
that the IAM principal must come through our proxy servers
and secure stack or through an approved VPC endpoint. It's going to get denied access
if it's coming through, let's say, a home network
or an unapproved VPC endpoint. And this really ensures
that the tokens remain within the Vanguard environment. So, how do we actually get
this to scale? We talked about the IAM policy side. What about the resource policy side? So, when you're looking
at hundreds of principles and thousands of buckets
within an account, you wind up with this mesh type
of authorization scheme. And it's hard to really know
who has access to what. And then also,
if you need to scale this, and an auditor wants to go and
take a look at some of the buckets, or we have a scanner robot
that needs to go and validate that bucket policies
are set up correctly, now, all of a sudden, these principles need
access to the buckets. And when they do,
the resource policy will grow. If you have thousands of buckets
deployed by hundreds of DevOps teams, this becomes a significant amount
of churn within the bucket policies to keep maintaining
these resource policies. So, our solution was to move
certain IAM principles into a path structure. And, for example, if we had
a developer resource, a developer principle, we would then move
that principle into a path. Let's say the universal path. What that allowed us to do
within the resource policies was really simplify the code. So, now in the resource policy, you originally would have had
to list out every single role that would need
access to this bucket. And then when a new principal
needed to be added, every one of these resource policies
would have to be upgraded. Using the path structure,
we can actually wildcard the roles and then create a list of known roles and add or remove
roles into this path. They can then get
access to this bucket. Now, these roles, as you can imagine,
are going to be extremely sensitive and need to be watched
over very carefully. So, on top of this,
we added protection with a service control policy
at the org level. And what this is doing
is denying access to anyone to modify these roles unless it's the IAM engineer
administrator role. All right, so now that we're
on the topic of assigning IAM permissions to data, in a data lake, there are a number
of different access patterns. There were
a bunch of things in that picture. And there's really
three fundamental patterns of access that in order to write
effective policies, you need to know about them, be able to apply each
in its own appropriate use case. The first is kind of what
we've been talking about. An IAM identity,
like a role, is accessing data. There are some permissions policies
that need to authorize it. And this is what I would call
direct access, an identity directly accessing
the data that it needs. Now, there's a variant of this. Often in analytics uses cases,
you might be going through AWS Athena in order to do these nice,
scalable queries of your data. Well, in that case,
if you think about this role, it is going to be making an API call not to S3 directly,
but to an Athena API, a start-query execution. Now, Athena, the way
it works is it makes an onward request on your behalf using your identities,
using your permissions policies to the underlying raw data in S3. The nice part is, of course,
it means that your identity needs access to the raw data
that it's about to query. In this case, the one thing
you need to know about is Athena doesn't run in your network. It's an AWS service. So, if you're using
those network perimeter controls that source VPC control
we were looking at before, that's where this Via AWS service
comes in so handy because it lets you account
for these onward call use cases simply in its scale. There's another pattern here,
which is an AWS service that has persistent access
to your data. Athena had temporary access
to your data to do your query. It was doing it
under your own identity. A service like CloudTrail
or many other of our services, particularly ones
that leave logs in S3, well, they're going to be
writing data to your S3 buckets on a continuous basis. The way those work,
the pattern to understand there, is the service is making
the requests under its own identity. That's called a service principle. So, cloudtrail.amazonaws.com that you see there is
the allowed principal. That's who's putting
data in your bucket. You'll see I'm also following
the best practice in this policy, where I'm specific
about what path it's going to write to my account numbers in there, so that I know that it's writing data
specifically for my account. That's the second pattern. Final pattern. You're going to see
this a lot in analytics use cases, because often,
you're accessing your data through some other kind
of application that's offering value-added processing,
various machine learning training. And often these applications, including your own applications
and AWS services, I have EMR and SageMaker
notebooks as an example here, they are running
under their own compute environment. For example,
EMR runs on EC2 instances. The EC2 instances, in turn, in order to access
the data, are using IAM roles that are associated
with the compute environment. And you'll often see this pattern where it's actually the identity
of the compute environment itself accessing the data. Now, the reason why that's
important to think about is because, well, at the end of the day, you have
your human users connecting to these various environments
in different ways. You have a data scientist
connecting to a Jupiter notebook. And you'll want to manage the access
of the people to these environments on the basis of the fact that these environments,
each themselves, has access to the set of data
that's needed by the environments. You have these two identities
in the mix here, so that's the third pattern
to be aware of. We're going to hear from Raj about how the Vanguard Group
sets up their provisions and their EMR clusters to take into account
these security and access patterns. So, how do we, in practice, manage
the AWS service environment when it comes to the EMR service? So, what we do is
we have a development team that creates basically
these EMR service catalog items. So, they'll create the catalog item using CloudFormation,
parameter files, and various tags and add them to the code repository, where an approver can take
a look at these and ensure that the roles
and permissions are set correctly. Once they pass the checks, the build and deploy agents
will go ahead and deploy the portfolio and the product
into the service catalog. It will contain the role
that would be used by, let's say, an EMR cluster, and all the CloudFormation
would go along with it. When a business user wants
to go ahead and use one of these EMR clusters, they'll have access
to the AWS management console, at which point all they need access to is
the service catalog item. The service catalog item contains
all the information they need to construct an EMR cluster. That product, launched
with the service catalog role and through CloudFormation,
will deploy the EMR cluster with the correct roles
that have the correct access. Now when data scientists,
data engineers, and data analysts want to gain access to a cluster,
let's say an EMR cluster, they gain access to the cluster, but they first must authenticate
with a local directory server. This will provide
their authentication. On top of that, within the clusters
themselves, we use role groups. So, the users must be belonging
to a certain role group in order to actually authorize
to the cluster. Once they're authorized, then the cluster has
the correct permissions needed to access the S3 data, as well as any of the KMS keys
used for decryption. Okay, the final couple
of minutes here. I'm going to talk
about going even finer grain than those permissions policies
that we've been looking at before. Now, of course, with IAM, IAM is... You get a lot of control.
In fact, a lot of fine-grain control. You're probably familiar
with this pattern where you have different IAM capital
roles fulfilling different lowercase roles. For example, project-based roles. Here, I have the yellow,
green, and blue project. I've got people assigned to them
maybe through my identity provider. And in fact, I probably have
my data structured in my S3 bucket along these different prefixes. They look like folders,
but they're not. They're just strings,
which makes wild cards possible, such as this one. This is the permission policy
for the blue project role. And they have permission
to the data under the blue project. Now, of course,
if you start to scale that up, particularly if these roles are
all in a bunch of different accounts, you start to get some fairly long
and monolithic bucket policies. So, to make
the management of that easier, once you start to see
a pattern of data sharing where you have
these discreet use cases that you'd actually like
to factor out and manage each separately, S3 Access Points, which we announced
about re:Invent 2019, offers this factored
permissions use case. If you focus on this role, I might have hundreds of different
access patterns to my bucket. I don't want to encode them in a very long,
monolithic bucket policy. So, what I do is
I create an access point for each of these use cases. It's tailored to it. Now, for this role to interact
with the access point, it is the same
as interacting with a bucket. You'll notice that this looks
very much like an API call to S3. It has the data plane. You can think of the access point
as an alternative endpoint to an S3 bucket. Now, in the bucket policy, rather than managing
each of these use cases, I don't actually have to talk
about this use case specifically. I might have a policy like this. I'm going to allow
entities in my organization as long as they're coming
through an access point that I wrote
for one of these projects. Now, the access point itself, that's where you write
the policy for the use cases. And it's well factored.
It's for a specific use case. If you have
these data-sharing scenarios, we've had a number of customers
have a really good experience using access points to factor it out. Now, this is
a whole topic unto itself. I'm going to go
through it really quickly. But often, your data in S3
is actually structured as databases. And your more natural mode
of assigning permissions is on the database level,
databases, tables, columns. And you want to use that mechanism. And in fact, those columns are
actually below the object level. And that S3 and IAM go down to the object level
in terms of access permission, because it's
about API requests being made, but you actually want
content-based filtering. Well, if you're using
these analytic services, they're integrated
with Lake Formation. What you do is you configure
access on the basis of databases, columns, and tables
in Lake Formation. These services integrate with them, and it actually does
the content filtering for you. So, useful to explore
for those database use cases where you have
that database-structured data. That brings us to the end
of this whirlwind tour of securing
your data lake data in S3. There's a lot around this picture. But if you have
a lot of your data in S3, and you get the security practices
right on your data in S3, you've taken a really large step towards good security
of your data lake. Enjoy the rest of re:Invent,
and thank you so much.