AWS re:Invent 2019: [REPEAT 1] All in on AWS Fargate for high-security workloads (CON322-R1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome thanks for coming to cone three-to-two Ollie no DWS for gate for high-security workloads I'm Rafael Segura I'm a Solutions Architect with AWS supporting Vanguard and today for this talk I have here with me Akshay Ram he's the product manager for faregates and I also had the Oni chief Enterprise Architect at linger on this talk today we're gonna talk about thinking big taking action and going Olly and let me tell you a story that exemplifies that around this time of the year reinvent 2018 around this time last year you only reach out to me to discuss the the container platform and the container deployment at Vanguard they were using a third-party container orchestration platform but they were looking for a new platform that they could provide more cloud native capabilities so think about things like elasticity lower management overhead and more cost efficiency so we start discussing how they could leverage ECS and Fargate and all the other benefits that comes with it but of course there also there was also challenged in place Vanguard have very high security requirements as you would expect from one of the world largest investment companies so we had to go to the process of getting ECS and forget approve it and validated for security to run production workloads but as I said this talks about going on lean so as you can imagine fast forward one year success we're here telling you about this story on how we're able to use the the new features that Fargate launched it with ECS but also all the hard work that the team at Vanguard put in place to make sure the platform is compliant to their requirements today we're going to share this story and we really hope that helps you in your company to also accelerate the adoption of our gate so let's get started to cover that we're going to be talking about forget adoption how the the process usually goes new features we're launching Fargate we launched it through these years a lot of things are launching here reinvent we're really excited about sharing these these new features with you and yoni was kind enough to share the story on how they deploy and Fargate and how they customize the Fargate at Bangor so let's get started to understand the Fargate adoption or containers adoption general it's it's it it's useful to understand to see how different teams inside the organization look at this new paradigm of adopting containers first of all the developers of course there is always excited they love the elasticity that froggit provides they know that with with containers you can get your code much faster from development to production so it's all good in there the architects are also excited about containers now they have an option to to match the micro services architectures that they design to the container deployment model so that those two things match really well so they're also really excited about adopt containers the infrastructure team they have some questions all right they start thinking about deploying containers now they have to think about managing those clusters they understand that deploying one container in one laptop it's an easy task but deploying and managing thousands of containers and productions it requires better planning same thing happens with the CFO or the budget team they know how to use things like reserving the instances and spotty instances on ec2 but how does that translate when I talk about containers and of course they want to get cost effective running those platforms on AWS right same thing with security the security team they know how to secure the situation at runtime for example they use things like agents to do monitoring but how does that translate to they contain environment where I ideally I will not be running any kind of agents right this principle doesn't doesn't translate to the container environment amore how can I make sure only validated images run on my container environment so all those questions comes into play and we need to go through those to address those questions too as I said to validate and to get a container approve it for usage so enters AWS for gates so far gate helped us to address all those questions about management about costs and about security as I said some of those things are based on the on the on the new service features we launched it some of these things are based on customizations that the cost was put in place like finger so to understand how far gate helps right that's a that's a blank statement of our gate helps with that to understand how it effectively far gate helps with that let's take a quick recap on how containers runs on AWS in managed platforms so if you look at the managed container platforms we have a DWS we have an orchestration layer and the tasks of this orchestration layer is true as the name says orchestrate how far how containers got deployed throughout your platter out your your your AWS infrastructure we have services like ECS LS container service and eks l SE kubernetes services they're the core service for this type of orchestration these services they remove the heavy lifting and they enable you to run the orchestration without having to manage the servers behind it where I have to be worried about patching and upgrading we also have services like ECR that offers the repository where you can create private repositories and lock them down and have specific networks connected to which those repositories so that also helps with the security requirements and once you have the orchestration in place now the whole idea is that you can just make one API call pass a task definition and you're gonna have your containers run your tests running but your tasks and containers they need to run somewhere so that's where the runtime layer comes into play right so together if the orchestration layer you also need a runtime layer on that runtime layer you can have a cluster of ec2 services sorry ec2 servers that you gonna manage and size and provide but you also have an option to run that runtime layer with Fargate so that's where you start to see the benefits of our gate right with far gate we bring a server less operational model to the run time to run containers and what I mean by that is now that you don't have to size or manage or provision those easy to servers anymore on the runtime layer you can just use as I said there's a service platform so that that implies that you're gonna you're not gonna pay for idle resources the platform is gonna scale up and scale down based on your requirements and you just pay for why you use and you have embedded high availability and resiliency so again that's highlights the benefits of our gate and why why so many customers are choosing to use for gate so all of that comes the whole serverless benefits and they come without compromising the things I was talking about what the developers and the architects loved right so for example with Fargate you still gonna use the same container you don't need to do code changes to run those containers in forget whatever containers you have today for gate also provides security so if you think about compliance you can you can architect your or your applications to be compliant with things like PCI PCI or HIPAA compliance for example by using the service and also another interesting point about four gates that by opting for the service operational model you do not lose the integration with other AWS services so far gates allows you to integrate with other AWS service so you can create powerful and complex applications using the combination of the services so think about services like load balancers with networking with service mash and service discovery logging monitoring so all of that works really well we for gate so to give some example two examples of this type of integration how those integrations help it help it Vanger to adopt for gates I'm gonna talk about the integration with Identity and Access management and the integration of networking Fargate again there are more than there's much more than that but just let's focus on those two examples those are very critical talking about identity and access management so with four gates you have total control about the I M permissions that are signatory to your tasks in your cluster but what does that really mean what am i controlling with those permissions first of all you're controlling the close you have control over the cluster permissions and that means you control who can run a task inside your cluster which I am role or I am user can start a new task on that cluster so that's the first layer level control you have the second level control very important application permission that's that's a different type of permission and what does that mean is that now you can control what services this task is able to access is this test able to access a specific dynamodb table it does have read or write permissions does this task have permission to access an s3 bucket and the list goes on and on of the number of services you can control access to right so that's the idea the third type of permission that you also have controls about the housekeeper missions so think about how zookeeper all those tasks that exists for the for the the task and a in the cluster be able to run with the containers deployed so if you're gonna need to pull some image from ECR for example you're gonna push log somewhere you control which which repositories you can access you control which cloud watch logs you can push or 2r2 to specific topics you can publish another type of control it's a data plane lab of control its control about networking so now that I'm running this test in the servlet environment it does not mean that I'm gonna lose the control over there at network access because my tasks are gonna be spun up and run inside the V PC and the subnet you select we call that a wsv pc mode in our documentation so that means the task is gonna have an IP address that belongs to your IP address space and it's gonna have the same level of isolation and features that the V PC provides so things up think about things like exposing your your task as a service to the Internet behind a load balancer right so that's supported you can privately and securely connect to other AWS services inside your V PC so things about databases or any other service that's inside your V PC it's gonna exist in there you're gonna control also for everything that you need to talk outside your V PC if you talk about security high security workloads you want to have that connection private and secure and and control it by you so you can use things like V PC endpoints V PC endpoints allows you to have just as I said private security connection to services like ECR where you're gonna pull gonna pull your container image kms we're gonna push or pull your encryption keys that are going to use for security and many many other AWS services other type of connectivity you might want to have it's to talk to another V PC a different V PC so your container can use things like V PC peering or transit gateway to do that also secure also encrypted same thing going back on pram you can leverage tools like Direct Connect or VPN connections to do that connection in your application can access the resources back on Prime if that's required so with all of those controls together and many others that you're on is going to talk about you can see that a far gate allows you to enable the surveillance operational model but at the same time be very secure and have total control about the the security posture of your container platform so to give more details about the new features I spoke about the the what's out there today but to talk about the new features that we launched it here at reinvent I'm gonna right here on stage akshay RAM so actually for yours thanks everyone [Applause] hey everyone my name is Akshay Raman the product manager with AWS for gate so as Rafael was mentioning customers are starting to think pay confer gate really leverage that operational model and we can actually see this in our matrix 40% of new container services products on AWS or customers who are new to container services choose far gate first and that is because of us adding new features over time and in making it more powerful in terms of being extensible and allowing this concept of security by default and this can also be seen by a diversity of use cases with customers typically server less was with a lot of event event applications we see that a lot but we're starting to see growth from a lot of embarrassingly parallel genomic processing jobs on fergie that's catalytic data sciences who runs parallelized applications and we have customers like Samsung who run long-lived applications there they live there this run on target you don't maintain you don't patch those instances it's completely serverless so they just keep running and serving requests and it's and for you it's it's zero maintenance it enables this no ops way of running applications and helps you run on run in a very in a lean operational fashion so as we ask customers what else do you need what do you want what else you need to go bigger on fara gate and are there any any gaps or features which we're missing it essentially fell down into three specific categories what does it take to go all-in was the question we asked them one is pricing so the customers were happy with the Fargate price in Jan of this year we we dropped our prices by 50% and it made us highly competitive because VMs are I mean containers and the Fargate configurations you choose are smaller than VM so you actually save on the cost of not utilizing your VMs the second is the assess about extensibility over there observability tooling so the customers had clearly the the observability space has a lot of selection and customers wanted that extensibility point on for gate and lastly this is not a feature per se but customers wanted clarity or what it is what is the security posture and Fargate and what what is this so we have this concept basically security by default is what do we offer out of the box in terms of how we harden every container that runs on forget so let's start with pricing so this is our containers roadmap we have the AWS container services team has a roadmap public on github so you if you have any feature requests you can always create an issue there and we're always listening so customers were saying hey we love the server less benefits of Fargate but we miss something with VMs we miss the ability to use spot so they use ec2 spot instances today but they wanted to use it on Fargate and they also wanted when applications run continuously they're always running this so there's no scale out and scaling and as I was saying there was many customers who like to run long and long-lived applications so they're like can we have like a discount model similar to the ec2 or I'd a pricing model on Fargate so so when we actually dug a little deeper we thought you know what it we wanted Mary to words in terms of having VMs and in capacity and spare capacity with spot and reserving capacity versus serverless and how do we make those two worlds meet so we this year we launched three significant pricing innovations as I alluded to earlier we invested in technology that makes us more efficient and we dropped our prices by up to 50 percent in January of this year and it was investments in in in we are in technology like Nitro and firecracker that helps us be more efficient and pass on cost savings to customers it is compute savings plan and this is this is super friendly and it's super it works well in a surveillance way because you don't compute the compute savings plan which I'll get into any detail is into detail a little later is that you don't have to commit to a specific instance type you just commit that I'm going to use compute at AWS and you can use any sort of compute option you want weathers ec2 or Fargate and you get discounts and this is this is hot off the press in that we just launched this yesterday is for gate spot it's the ability to get spot like discounts on forget but you don't have to deal with the managing of instances and infrastructure it's all completely server less for you so that's that's why we really try to figure out how do we get you the best of both worlds fulfil through a server less operational model let's take a look into some of these in a bit more detail the first one I said is compute savings plan so you get up to 50 percent it's one of the year commitment this is the similarity which customers wanted it make it similar to alright when we ask customers how would they like it they said make it similar 1 or 3 your commitments and you get up to 50% discounts it is super migration friendly I think this is a really important point in that we understand and recognize that customers always have some ec2 and then they'll have some Fargate and they'll always be a state of migration and this could this could vary by a business cycle it could vary by by by your needs of being on a specific instance type so you you actually commit at an at an AWS level so you get flexibility to use ec2 a far gate and you get discounts across the board so wherever you go your discounts follow wherever a usage goes you just can't follow super migration friendly and the third is you get built-in recommendations in your cost Explorer so we actually make it super easy for you to whale these discounts so you can if you can go to the cost Explorer you can see based on your past usage what is a recommended savings plan to buy to help you avail savings for your applications the second innovation which we launched on pricing this year is is for gate spot and this is up to 70% discounts this is for your applications that are interrupted it could be your ETL jobs it could be or embarrassingly parallel jobs and it's it's it's it's it's typically for this the the applications that run on easy to spot but this is a little more here we made it migration friendly we didn't want our customers to go through too much Rilke tech chure to use our gate spot if they wanted to run their tasks on far gate of our gate spot this shouldn't be too much of a rewrite so we use the SiC term signal which customers used today to clean up their it's a standard signal to cleaner to perform clean-up operation so we give you that same signal and you can use up to two minutes to to do your cleanup operations and three years I think this is this is a really important point is you get application first controls what I mean by that is on far gate you have no cluster I mean you the cluster on far gate is a namespace it's it's so you can say your web application team can now mix forget and forget spot and since there is no back and forth between two teams it's just the application team who controls how much he or she wants to mix between forget and forget spot they get clear well they get clear decisions as to because they get feedback through cost allocation at how much am I willing to put on for gate and how much am I willing to put on far gate spot such that I can I can leverage those discounts and hey if some if you want to be really conservative and run a lot on far gate instead of our gates what you can always buy savings plan and you know see cost savings as well so combined I think spot and savings plan work together and really help you leverage cost savings as you know you use for gate and and when we were try to address some of our customers requirement to give us their similar levers which they had with virtual machines the second big pain point customers wanted us to solve was observability or when forget launched in 2017 which is we had support only for the AWS logs driver which essentially meant that he could send logs to cloud watch we soon learned customers had integrations with partners like Splunk tera dog sumo logic they had built custom ingestion pipelines like an elk a stack or they use Kinesis fire hose and Kinesis data streams to transform logs so they wanted to migrate those applications over and this is adding friction because we didn't have that extensibility and the second is they wanted native support for deep visibility and metrics this is a common ask because customers want to understand what is the resource utilization of their tasks on Fargate so that they can write size and auto scale the right way let's jump into logs first or we just shipped Firelands for logs this was this was actually a ship I think a couple of months ago it is essentially one interface to send logs so anywhere it it gives it fulfills the ask for extensibility of customers it's built on open source technology which I'll speak about in the next slide but it essentially gives you a native integration to send from let's say you want to send a data dog or Splunk or Kinesis data fire hose or msk or cloud watch it just enables you to say I want to send it here send it there and you don't have any intermediate hops or intermediate code to maintain to do send logs to the destination you want it's it's quite extensible it also gives you levers to reduce costs it's built on open source technology that has a lot of filters which allow you to send some logs to s3 and you can index them later to Amazon Elastic search we there's a this is my architecture talk by ancestry which says let me send all my logs to s3 and pull them on demand to elastic search so you have that levers to to tune between highly available search versus storage and gives you that cost optimization levels and it also decouples we've also built it in a way that D couples log ingestion pipelines what I mean by that is you can separately independently configure this is an asked by many of our customers that they have separate log ingestion teams as a separate application team so the log ingestion teams wanted to control the configuration independent of the application and you can do that now with the interface of Firelands let's take a look a little bit or let's take a look at how it looks so this is your interface you use a log driver called AWS Firelands and you can specify any destination you want be data doc so more logic Splunk Kinesis firehose as I said AWS an AWS log analytic storage tool or a partner tool you can specify secret options to configure your driver could be or mostly your API key and this is something which we worked with the community of fluent bit fluent bit is a CNC AF open source project where you this is a inner image which you have to run as a sidecar with your application and this is the image which has the the plugins to all the destinations you you wanted us we wanted out to and this we work by contributing and and asking our partners to contribute to this community so this is how the interface works it's super simple you can try it out today and this this we also have a detailed blog process to how much overhead because customers said hey it's a sidecar we care about overhead what is the overhead of this it's super light it takes even at high log volume it takes up to 50 MB so you can you can check out we have we have blog posts on this we can I can share the content so that later lastly customers ask us for metrics metrics is something which is super critical it's like air food and water for customers to understand what their what their tasks are doing so we have a we launch container insights for metrics where this this is this is the interface is even smoother than Firelands because it's it's essentially built-in you click on you know button and it just works and also you get deep dive you get a deep dive into all your application data because it's integrated with login sites so you can run structured log queries against your metrics and against you it's actually the data center structured logs which we then pass the metrics and you can always run queries on the structured logs so you can you can really pass high cardinality data right up to a container level and it's available out of the box and it's also super competitive on price so this is how a dashboard looks it's it's ready it's t8 it was launched sometime around the New York summit so you can check it out and give us any feedback on the roadmap se if you want to add more to it finally this is the far gate security model and and this is I'm pretty sure why most of you are here to understand how Vanguard actually used I mean how benefitted from far gate to reduce their operational overhead so far gate by default offers every tasks with a full vm isolation boundary so you don't have to worry about container breakdown it's a breakout it's part of our design and every tasks gets its own isolated with CPU memory storage and as Raphael said you have credentials isolation and you also have this is a big thing for customers is hands of passion you don't have to actually patch the horse it's completely because it's serverless we do we do everything behind the scenes for you so you don't have to worry about armies and which how many versions with CV is batching it's pretty seamless for most part for customers and we have multiple customers if you if you search about forget security who reference that it's been super simple for them running on Fargate and and you know not having to patch instances so to elaborate more about this no ops model or less ops model or ops 2 ops where you want to focus on model and how one guard benefit benefited from security of our gate I'd like to introduce you [Applause] Thank You Archie and file my name is yoni I'm enterprise architect in Vanguard and I would like to share our journey into cs4 gate from the last year like Rafael Illustrated before that quick introduction into who fanger is we are one of the largest financial institutions in the world we are located in modern PA strategically located between Wall Street and u.s. East one data centers we started operation in 1975 we have multiple lines of businesses institutional retail and a new one that we are rolling out his advice what type of application we are running on ECS we started with the web-based applications only so there'll only applications usually fronted by a load balancer and alb or an lnb NLB for web based most likely it'll be a lb and recently probably about two three months ago we started exploring badge based applications to run on ICS primarily for those use cases where lambda wasn't a good fit we chose easiest forget for variety of reasons but but some of them are on the screen it's it's really fully distributed architecture it enabled us to implement dev sock there's a cop's parents or ops pattern students how you like to look at it and by that I mean to isolate the ownership of the entire stack just to one team they can deploy things and by deploying those things they do not affect any other team so that was a big deal the the solution was backwards compatible with the legacy container castration that we had until we start using ECS for our gate it is fully automated and automatable like actually I said it's secure by default one of the most important things for us it is cost cost-effective and by that I mean it is more cost of if than our existing legacy solution and the last point is probably one of the most important ones as well is that it enables a no ops pattern and I will elaborate on that in a future slides but it reduced our operational frequency dramatically almost to non-existent so want to cover our web-based architecture a little bit the stack that we have our idea was to create a highly parameterised CloudFormation template a single cloud formation template that any application development team in beggared could use with the right parameters for their application to deploy their stack and those are the components that would be deployed by that cloud formation template so we say we have a zone elastic container registry service where the images are being pushed far ahead is the computer platform we didn't even look at DC - even though that was available first a Muslim certificate manager it'll be a certificate manager is another thing that we brought in along the along the route we didn't talk much about it but it is a service that enables us to have less operational frequency I will talk a little bit about that service just because it's very pertinent for a discussion ACM will issue certificates for free to any AWS property that can handle aren't based certificates basically it's it's a certificate that you don't have an access to the certificate you just have an access to an Arn and that are can be associated with the low balance or another alias primitives that support it but what's more importantly here is that when time comes to rotate the certificate as long as the certificate is attached to that property and the validation record for domain ownership exists in a public zone ADA base will seamlessly rotate the certificate there is nothing else you need to do and speaking about Article III zones the load balancers in this architecture are primarily internally facing so their resolution is happening for private driver safety reasons but the validation record that ACM issues in order for them to validate the ownership of your domain is residing in a public zone because the service comes from the internet and needs to resolve it in a public zone to validate ownership everything is encrypted in motion in this implementation as I said on the from the alb side the traffic comes in and it's encrypted with a TLS from ACM the connection from a lb into ACS test by the way we have to by default as a minimum so that they deployed in multiple availability zones he is encrypted with the privately signed certificates that get generated when the container starts at start time lb doesn't validate certificate of validity of certificate for the property to which it connects so it kind of makes sense whenever as a container inside the ACS tasks connects to data services we try as much as possible to use the PC endpoints so s3 dynamodb when would never go over the internet will try to use endpoints whenever possible our batch based applications support two flavors and again I'm not advocating to use ACS for based applications and probably should use lambda if you can but sometimes there are use cases when you can't so we support - to trigger based implementation one is time-based which supports both Crone and raid based implementation and that one is somewhat simplistic it launches a task it presumes that the application or container inside the task knows what it needs to do and as soon as the task finishes its run it exits and since it's the last foreground process inside the container the container access exits and the task gets shut down the event based implementation is a little more complicated it's more complicated definitely than with lambda you can just pass an object of the event into the lambda context so the there are there is something called event override which you need to do for for an event based stuff and what happens is that those overrides get injected as an environment variable in the container so the use case that we are using it for the most part is the s3 object boot sometimes we receive very large objects from our partners and we need to process this object lambda has a limitation of a runtime it's only 15 minutes so for anything that needs to run longer than 15 minutes or you simply don't know how long it's going to take to run this is a good use case so finally the security by default both actually and Rafael had mentioned it for us as a customer ABS has this shared responsibility model where AWS is responsible for part of the stack and then you as a customer or us as a customer responsible for the remaining part of the stack with Fargate we push the shared responsibility responsibility of ABS even further so less responsibility for us great full tenant isolation actually I had mentioned that using Firecracker through hypervisor and this is a big one an ability to attach a task role to the task itself you probably familiar with instance profiles from ec2 this is very similar implementation you can create you you operate on entitlement based security if you need to connect to a service you don't worry about secrets anymore you have that entitlement through the role no operational overhead there is nothing for us to Petra maintained and finally we are highly regulated industry so to have soap to compliance for us is and must we wouldn't even consider service didn't have that certification but you can see the far it has a lot more certification than just soap - that's HIPAA compliant so if you're in health care you can consume that service and others as well some of them don't even know what they mean so let's talk a little bit about data protection obviously we have some environment variable data that are sensitive the way ECS is implemented is that if you look at the test definition in the console you can actually see the environment variable key and value so for us it was extremely important to increase those since sensitive environment variables and decrypt them only at runtime within the tasks that can that is supposed to have access to those values end-to-end encryption in motion is a requirement I showed you how we do it through the load balancer alb onl B with TLS now and then into the container using the private encryption from a will be to the private certificate ssin through the TLB to the task all the certificate is there and rotated by ACM and all the data is accessed through a PC endpoints whenever possible so this example here that I want to talk to you about I'm not advocating for it I'm just telling you what are the things that you can do for us it was a requirement we have a we had a legacy still have some to some extent a legacy secrets management system and because that secret secrets management system was not cloud native today there is you know a double secrets manager runs and secrets manager that you probably should be using at the time it wasn't even existing we had to inject an encrypted environment variable into the container through the pipeline and that's the way we decided to do it so the pipeline or the agent of the pipeline would have an access to that legacy secret manager system a secret secret management system the policy of the role of the agent would allow it to decrypt anything with any key and then encrypt again with the key specific to the task each task would have specific KMS key ok must keep it so it would decrypt at run time and then Rhian crypt again with the application specific key and then inject the KMS encrypted value into the environment variable of the of the task and then the role of the task would only allow it to decrypt with the key that belongs to that task so at runtime the application would run it would decrypt the value with from the environment variable of the container and then it will have access to the clear clear text value passing environment variables through cloud formation was another interesting challenge for us to solve you simply could do a parameter for ki and a parameter for value but very quickly you learn depending on the on the amount of parameters that you need to pass through applications that you will run off the limits of the parameters first per stack so we had to devise a clever way and I don't know how clever that way is I had some feedback yesterday that pipe not necessarily is a great separator but it works for us and if it doesn't work for you maybe there is another implementation that can be done but what we said we're gonna pass a string of key value separated by pipe and then use confirmation native primitives or sub functions in order to extract the key and the value and pass it in the right into the ccs context within confirmation from the diagram before and another thing we we have a lot of custom resources that are doing different things for this encrypt decrypt thing that I showed you before we had to write a custom resource obviously CloudFormation doesn't support everything out of the box so this confirmation or other troposphere section of code here illustrates how a custom resource using a select and split built-in confirmation template passes an encrypted text from a parameter into the context of the custom resource which in the next step Riaan crypts it and now that value is gonna be available to ets as an environment variable and here is a text block of code of SES environment variable within cloud formation where the key is the first element of the parameter and then the value is the output of the custom resource with the encrypted with the key of the task value that is being passed to these here some context ACM is another story which required a custom resource so when you request a certificate from ACM it actually doesn't issue it right away it needs to validate validate the ownership that can take up to 30 minutes usually it's between 2 & 5 but we couldn't continue within the run of the CloudFormation until we knew that the certificate state is now in issued and we can attach it to the load balancer another thing we had to pass additional names for the certificate so it's actually issuing the same certificate for multiple names by which we can access that load balancer and this section of the code shows you how we did it with the customer resource called ACM domain validator once the certificate is issued its passed to the Lau load balancer listener using simply certificates which returns the output of the previous customer resource output task roles I have mentioned it before is very similar to instance profiles even simpler to implement it allow us to create policies that are extremely specific to the application use case this think the policy is very narrow from the access control and allows only access to what the application inside a container required and finally when the role is created to attach it to the ECS within acs times the finishing block simply passed escrow arn and that attaches the role to the task by the way it's done at the service level so every task that is being spun up through elastic events within the acs service will have that role attached let's talk a little bit about infrastructure protection so we started using a third-party tool and there are a couple of those tools on the market there's aqua security and some others I think a couple of weeks ago AWS announced a scan on push so actually can scan images while being they're being pushed into ECR very important thing to us was to be able to protect an application in run at run time so first of all we need to make sure that the application or rather the container is running within the ECS task is one of those images that is approved meaning we scanned it it validated within our security policy security posture and only those images are running and then if or when an application within a container gets compromised we need to make we wanted to make sure that that application could not escape the boundary of its initial permission some sort of a envelope around around that run time and when compromise alerted we wanted to get notified and as I said before we don't want the application to escalate its own privilege annex is something that was not supposed to so while we allow our developers to compose you know their docker file and and build you know any kind of docker file composition that they want we always want to make sure that the starting point the front image from which they're building their container is one of the approved ones and then the entry point as the first argument has a micro enforcer which is that envelope that protects the application code I did mention before that one of the most important things for us was reliability in no-ops EC a service will always maintain a desired number of tests running when you deploy any CA service it asks for three numbers for tasks one is the minimum number of tasks which for us defaults to - the other one is maximum number of tasks which for us defaults to 6 and then the desired number of tasks which usually at initial time defaults to minimum but it will scale up and down between minimum and maximum depending on the on the on the elastic events there is an integration and type relationship tight relationship between the target group of the load balancer and it will be on a or a or b and the ECS service itself the actual health check of the tasks in our use case well it can be done - is he a service now we started doing it through the load balancer and then the the easiest the service would get a signal from a target group health check and whenever a task stops to be healthy or it simply doesn't respond within the time that is allocated within the thresholds ECS will restart that task auto-scaling is another feature that is built in into our single stack implementation i'll show you a little bit more about that and ECS will make sure that the always the desired number of tasks are running and will scale up and down based on cloud watch metrics so this screen here lives trades the relationship while this is the ECS service screen it shows you that there is a mapping between the target group within a load balancer and ECS service itself and as i said before ECS will restart a task which is no longer deemed healthy by the target group in the load balancer if the previous list a stated the desired number of tasks will always be running so here I actually purposefully went into the service and I kill the task and within within a minute under a minute actually VCS detects through its loops that desired number of tasks is no longer running and it it starts a new task from the same image within the target group of the load balancer in this particular case it's an al B you can see that there is a path for the hell check for your application or microservice and obviously the port those other numbers are dialed to the most the minimum number they may be abs allows you cannot set it to less than that so we want to be as aggressive as possible in our elastic events we want the applications to go into serve or other containers to go into service as soon as possible this is the auto scaling tab within the acs service while you use auto scaling service to define this it shows you the relationship every time a cpu of the service which is all the tasks running at that time goes about 35% we want to bring another task again it will scale it up all the way to the maximum defined in a in container definition and then when they CPU in the service drops below 5% we want to remove one task and it will continue removing until we are the minimum will no never go below that we use super super utilization as a metric for scaling there is memory utilization metric that is available and obviously you can do your custom metrics and one more thing they wanted to do it here is that those settings are also set to the most aggressive possible within AWS you cannot actually make it less than that we tried so what are the next steps for us as far as you know our journey where the cs4 gate are it is not on this line but our first idea was that we would deploy a stack of DCs cluster which when you run an ec2 mode is kind of relevant because you have you ec2 instances attached to the cluster when you're in Fargate mode it's simply a namespace it's a separation and doesn't cost anything so we actually creating a cluster per micro service application that we deploy within that stack that is obviously in DCs service and the tasks and the application load balancer what we learned after a couple of months of implementation is that he actually takes time to deploy this tag especially the load balancer takes a couple of minutes to provision and then everything needs to be done the easiest components are not as expensive from time perspective but it still takes time so we decided to separate the stack into a couple of components we have a notion of persistent stack where the load balancer is in a persistent second ACR is in a persistent stack and then what we do at the elevation is actually just changing the image and it has a definition and we use the same load balancer for older versions of the same service so when we elevate blue/green we simply point to a different that the same path points to a different task definition which would have an elevated image first the other limitation there is that when you dip like it seems really cool but if when you deploy all those stacks you actually consume a lot of IPs like peaceful application load balancer for each verb for each version a piece for ECS tasks it's all kind of removing from your V PC subnets scope and this new implementation actually reduces that consumption we also created subnets that we call outbound only they're kind of knotted subnets now the greatest implementation but when you run out of a piece you need to be creative and then the tasks are actually going into those outbound only only subnets because they only make outbound connections the connection into them comes through through load balancers so we really want to you know that that also that optimization because now we don't have multiple B's and they are not cheap we have one a lb per all diverse on the same micro service that reduced the cost somewhat sizing the task to the sweet spot is extremely important we have had traditionally some woods if you hungry applications to rewrite those applications to optimize them so that they consume less memory and CPU is something that is extremely important to us and through that we actually reduce the size of the task that is requiring to run those containers so that's something to consider fine tuning or scaling for cost as I said before ours is configured to the most aggressive possible maybe we'll stay with that but maybe you don't need it in your non production regions so every order scaling event costs you money so deployment of more Fargate tasks if you don't need this consider reducing it reducing docker image size you know will increase elasticity and and reduce cost one because you're actually storing as a smaller artifact in your ECR and you do pay for gigabytes storage there but also the smaller the image the quicker the service can pull it from the repository and then put it into and then run what's the application inside it so your elastic events happen that much faster it was just announced with a to be a savings plan and reserved instances or reserved commuting plan that was announced we definitely want to implement that that will reduce our cost and for the most part all the optimizations are around cost which is kind of like your maturity right like you you implement something it's running and now you're trying to get it to get it for that for less money experimenting with cloud map you can actually send traffic to ECS tasks which is just a peace within your VPC through cloud map it doesn't have to be an alb and the more you start exploring with appcache that becomes a lot more of a viable option it can stand on its own unfortunately there is no way for you to attach ACM certificates there so that's a you know an area for consideration obviously we wanted to implement app mash and hopefully the service matures and gives a lot more functionality hopefully tomorrow and we definitely will experiment with it you know service to service call chemicals can be a lot more secure prepare for the age of sidecars I think every vendor on the market today has a sidecar what happens is that when you create when you daisy change those sidecars actually create a train because everyone waits for another and the elastic event of your application the logic which is pertinent for your customers takes that much longer to go into service actually give value to your business so don't forget that well it's easy to implement it might have some issues integrating with other areas services I think AWS is constantly giving us more that we can integrate with we have some you know application intelligence tools in-house but we do want to look at ATS x-ray because you know through the role you can do a lot more there's not many more things needed container insights were rolled out a couple of months ago we definitely take an advantage of that and then finally anomaly detection is another feature which is you know everything animal can do for you and you don't need to do anything for it it's probably a good consideration and with that being said I want to thank you for coming today and listening to us and if you don't mind please complete your session survey and if you have any questions please stop by and we'll be happy to answer them thank you so much [Applause]
Info
Channel: AWS Events
Views: 2,701
Rating: 5 out of 5
Keywords: re:Invent 2019, Amazon, AWS re:Invent, CON322-R1, Containers, The Vanguard Group, AWS Fargate
Id: gNDi6l2tIws
Channel Id: undefined
Length: 54min 6sec (3246 seconds)
Published: Fri Dec 06 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.