AWS re:Invent 2017: Batch Processing with Containers on AWS (CON304)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

thanks everyone appreciate you attending this event I know it's it's late in the day late in the week and really happy to see so many people in the room my name is Jamie Kinney I'm the principal product manager for AWS batch and high-performance computing at AWS happy here to introduce a few my friends and colleagues so we have Davin joining us from from GoPro want to raise your hand real quick we've got Tom joining us from AWS one of our crystal solutions architects and Lee from here from the highly autonomous driving division of here system so I hope you've got a great talk for you I wanted to spend a bit of time today talking about a few things we're gonna first to be just a quick introduction kind of recap on what what is batch processing what do we mean by that then we're going to talk tom is going to give us an introduction to Amazon ECS not just introduction but really kind of talk about why is it relevant for for batch processing before we get into David's presentation on on GoPro and how they use Amazon ECS so easy to container service then they'll switch gears a bit and I'll focus on AWS batch introduce it to those of you who might not have had a chance to kick the tires talk about some of our big releases and reinvent and leading up to this conference as well as give you a glimpse into our relatively near-term roadmap and then here well we'll talk about how Lee we'll talk about how here's been using AWS patch for autonomous driving and a number of other workloads and we'll save some time for Q&A I promise so first with the the quick introduction to to batch computing and batch processing so a batch is a really interesting paradigm that that's been around for for literally decades batch processing allows you to it's kind of like a TiVo for your work and let's use shift when a given job runs and where a given job runs and in exchange for that flexibility you can do things like take advantage of compute resources under that are acquired under different provisioning models or you can run workloads at times a day when you have spot resources might be available at a lower cost or at higher capacity but in order to take advantage of that flexibility and the many benefits that come from batch processing you need to think about the sequencing of your jobs and have orchestration systems that keep track of available resources or are provisioning the right types of resources in a just-in-time manner and historically yeah this is this is presented some challenges for a lot of batch work Leslie's tend to be very high scale workloads very bursty workloads you'll you'll frequently see hundreds of thousands if not millions of jobs coming in over a relatively short period of time meaning that you need to be able to scale up your your compute resources or at the very least keep track of those jobs that aren't yet able to run because you don't have that that capacity on hand unless of an issue in the cloud of course the other thing is that you need to think about concurrency of workloads and and how are you distributing your available processing across all of the available compute resources and often times especially Bachelor clothes because these are running asynchronously these are running automatically without human intervention you need to make sure that you've got error handling things like the ability to retry a job in case of an application failure or in the case of a part of your infrastructure becoming unavailable the case of spot instances a spot instance termination as one example and so batch workloads require reliability tools to make sure that you can automatically retry jobs and that you're able to distribute work across the range of availability zones in the case of AWS batch workloads benefit from from simplicity batches a pervasive model of computing it's used by data scientists it's used by life scientists it's used by media and entertainment by Potamus vehicle systems it's used to transcode and and do a computer vision and image recognition on video streams as you'll hear from from GoPro and so because there's so many workloads you want to make it very easy to develop these tools and deploy them in your batch framework hence the importance of things like container technology and it's also really important to make sure that you keep the cost of batch processing as as low as possible and so this is actually one of the reasons that AWS transition to the model of per second billing versus the hourly billing that we had previously with per second billing you have the possibility to easily scale up a compute resource that's really precisely tailored to the needs of each of your jobs so that job runs as performant ly as possible while that compute resource is running but that you stop paying for that resource the moment that it's no longer needed and so if you have jobs that are or cpu-intensive maybe you're launching c5 instances if you have jobs that need GPUs maybe you're running those on p3 instances or FPGAs with the f1 instances and in doing so we've been able to orchestrate the the selection of these types of resources you'll need a higher level of infrastructure and so that's what we'll be talking about in a bit and so cloud makes sense for this because we have a massive amount of capacity we have a wide range of instance types and with services such as ECS and eks and AWS batch and Fargate you now have the ability to use these resources even more efficiently and be able to because we're using docker be able to deploy your applications just as easily on AWS as you are on your laptop or in your own environment and your own data centers if you have those of course under a pay-as-you-go model and so containers make sense for for batch processing workloads for it for a number of these reasons it's it's really a polyglot mechanism within a container you can run a java and are a Python application Fortran code C your C++ application really the sky's the limit and we have customers that are that are using containers to run this this tremendous diversity of applications and because you were packaging each of your container your applications within a container you can easily constrain the amount of resources that you're giving to a specific job while it's running and this allows you to do things like bin pack multiple jobs onto an ec2 instance giving you even more efficiency in the way that you run your workloads and then finally it helps make sure that you're not locked into any one platform because again you can develop on your laptop deploy on Amazon just as easily as you can deploy anywhere else so with that brief introduction I want to make sure that we focus on a really important a vision or a tenant of Amazon which that we want to give you as much choice as possible today we're going to be talking about just two of the many many options that exist for using AWS to run a containerized batch processing workloads we'll be focusing on ec2 container service and AWS batch and so to focus on ECS i'm going to hand us off to to tom at this point thank you thanks Jimmy okay so before we kind of jump into the specific use cases I want to make sure everybody has a good baseline understanding of elastic container service the container service we provide that service as a way to bring your docker containers and schedule those resources as tasks right so you're able to build out the same kind of infrastructures and cluster and scale that the way you always have right with auto scaling but then there's this additional layer where you're able to take containerized code and schedule it and then place it there's even very flexible options for placing different types of processing containers on say GPU based instances or heavy Ram specific type instances are available for those types of containers because as you define containers and tasks you're specifying what sort of resources are necessary now they're very fast super agile they can be spun up spun down and then you start to integrate with all the other native AWS building blocks services so this as a scheduler as a way to pull together a load balancer for example if you're gonna run a long-running service on it or integrating it with cloud watch and our event streaming systems that allows you to trigger all kinds of different execution patterns so batch is another example of a workload that works very well for containers it's something that Jaime just talked quite a bit about now this is an example where you see something like s3 become the storage mechanism for input and output when you're processing and executing a batch job these are different reference architectures that we've seen emerge and using the container service you schedule out the jobs with the pure algorithmic code that you want to run against data that's separated another example we actually can go and use triggers right so this is where these integrations again start to start to really become advantageous now I can trigger a lambda function that can execute and start up a batch process that batch process being again a container running in tasks scheduled on top of that in first structure that infrastructure could scale it could be a scheduled scaling scenario because these batch jobs often are maybe on a quarterly boundary maybe they're done weekly it's a little bit more predictable than some of the real kind of hyper scale spiky type workloads in some cases you can actually use all of those those scheduled api's to do some really interesting work and optimize another example of course queues right as an intermediary I can start to throw work into a queue and then I can again abstract away the data persistence and let these tasks get scheduled out and go retrieve the data for itself to process using the queue essentially as a as a mechanism to maybe drive more scale maybe to do all kinds of different parameterizations that could flow through in the messages themselves the concept of a queue in between your processing superpowerful and you may even see a little something like that when Jamie comes back and talks more about the batch service itself long-running batch jobs this is where you really can take advantage of things like spot right if you have these you know item potent workload items where you can retry things get to a certain point stop snapshot your output and then be able to just restart at any point time that that lends itself very well to a spot type of a scenario spot often can give you 50 to 90 percent reduction in cost so you can go kind of one of two directions you can scale it quite a bit higher so that you can get done faster or pay less right there's all these kinds of options at your fingertips when you can build these batch jobs schedule them all out and make sure they can all be restarted very gracefully so this is this is a little bit more about what I was talking about there where spot is super super valuable so just to summarize the container service docker bring it scheduled it all this kind of flexibility but but keep in mind a couple of key tenants stay away from incorporating too much local state you want to have that somewhere else right because now I can I can get much more flex well it's stateful in the sense that the processing logic is something you can add and remove at will make sure to minimize dependencies in your task definitions as you start to get these things designed and thought out you can kind of paint yourself into a corner as to how flexible where where can different workloads actually execute and run if they had a ton of additional dependencies so be smart about things being small and compact and easy too easy to run monitor things with the api's so the event stream cloud wash all those things are there to allow you to make decisions based on the state of the cluster or the state of the tasks themselves now that you've got these multiple layers of auto scaling capabilities you can you can do a lot more so so that's that's my piece I want to hand it off the savin now to talk about a very specific example of using ECS for batch all right hey everybody my name is Haven I'm with GoPro we are the team at GoPro that's been developing the AWS cloud platform behind some of our key products for the last several years first using ec2 and auto scaling groups and now more recently we're using ECS to do both our web and our API layer so today I'm going to share the team's experience migrating from those ec2 based applications and I'll tell you a bit about the GoPro plus service and the cloud platform behind it what it looks like what kind of services we run there and some of the lessons we've learned along the way for those of you considering migrating to container based a container based paradigm I hope this is somewhat useful to you guys I hope we believe we will be able to give you some pointers on how to make that transition as smooth as possible but first if you'll indulge me GoPro turn on I just want to do an extreme selfie since I'm here GoPro take a photo awesome so not just to indulge myself but also to give some context so when one of our customers uploads photos takes photos and videos they can be uploaded directly to the cloud via GoPro plus in 2016 we launched GoPro plus it's a subscription service that makes it easy to automatically upload edit store and share your vote your videos and photos from your GoPro devices it also provides a number of other benefits like discounted GoPro accessories so when user signs up for the service logs on uses our quip mobile app or desktop app they're integrating with and they're interacting with our applications on the cloud platform which are all running in the ec2 container service today we're running a hundred percent of our production services for GoPro plus on ECS so the platform consists of a number of different components like ms like I mentioned photo and video processing user and subscription management's keeping track of devices like cameras and drones and mobile apps and serving up the user interfaces for the users on the web and the mobile platforms not to mention all the infrastructure support services that we need to keep the system's healthy something that my team the DevOps team is very interested in I forgot my graphics there they are so we have today 60 services in ECS in production we've split into several different clusters to handle some apps that need a large amount of memory and some that need a large amount of compute for example doing video transcoding different types of media processing and we have today around 500 ec2 instances although that is shrinking as we've been moving over to ECS part of the main reason we decided to migrate over and our platform serves around 100 million requests a day those were API requests so for us container orchestration was a long time coming and I'd like to give you some sense of the pain points that we were dealing with before our move to containers some of you may be very familiar with some of these so for our apps and micro services we used to run this classic model of just deploying anees so we would deploy our code to an ami image then create a new auto scaling group off that ami and then create a new sort of Bluegreen deployment from that new auto scaling group so this works but it's very slow it has multiple steps to it many places to fail this led to some very long deployments and some very unhappy developers and QA engineers there's too many steps in this process too many places to fail as far as our workers doing the asynchronous job queue work we also had some challenges there so basically it comes down to visibility control and cost visibility because orchestration was kind of a black box we used a third-party service that did sort of an MQ managed MQ as a SAS service in that model all the orchestration logic was hidden from us it ran in somebody else's infrastructure so for example if a container fails to start that's hard to see we had limit limitations in terms of metrics and monitoring and we actually ended up writing some middleware so that we could scrape this third party API just to import the monitoring metrics into cloud watch having fewer sources of data to monitor is always a good thing and so that's one of the things we love about ECS all the stats are right there in cloud watch makes it very easy to alert and final point here we couldn't always rely on the efficiency of the scheduler so again operational overhead it wasn't always very efficient to run these workloads and we didn't have very much control over that so we decided to move over to a orchestration solution what about kubernetes well we learned yesterday that we'll be able to run managed kubernetes on AWS which is pretty cool I'll just share with you some of the reasons that ECS was the best choice for us a few months ago when we started this project both solutions both ECS and kubernetes pretty much meant most of the requirements for us ECS came out ahead however in a few key areas so number one security policy specific to every service running in the EECS cluster this is built in EECS via IM roles they're very familiar we can be sure that every service has only the minimum access that it needs to do its job this is key we found a very familiar set of abstractions that look very much like auto-scaling groups and the same AWS CLI and ap is that we all know so well work very well with dcs as well integration with AWS services like cloud watch and elastic load balancing are there as are the usual Python API is like boto next one for us was really crucial enterprise support so we've been working with our AWS enterprise support team for a number of years now they're really great we really value their support so from the business point of view this was extremely important migrating all of our production services to a brand new platform very important for us to have that kind of level of support which we could get from AWS and finally the last bullet point less cluster maintenance so the only cluster maintenance we've had to do with TCS is really updating that base omni for the ec2 instances and of course now with Fargate you don't even have to do that so that's pretty awesome as well our DevOps team some of which are here today also automated a nice solution for us that goes out and retrieves the most recent optimized ECS optimized ami and automatically rolls it through our GCS clusters with zero touch so pretty much we already made far gate before AWS came up with it that was a joke ok so let's move on and have a look at an example of one of the services we run in ECS there's a high-level architecture diagram of our media service so in this example we have a client over here on the left that could be my GoPro camera it could be a mobile device running our app and it's making calls through and alb to an a player serving the API so the client in this example might receive back from the API a signed URL for upload s3 so it could directly load photos or videos to our s3 buckets after the file is uploaded the media service might kick off a job asynchronously to transcode the video it could pull a frame out of the video it could create a thumbnail out of it there's a lot of things we can do with the raw image or video once we get it so in this example this is a pretty classic pattern of just a decoupled asynchronous kind of architecture so all the long-running jobs the worker tasks there are able to complete take the time that they need and meanwhile the client gets a really fast response from the API layer so this allows the workers to just work out of band and still keep very low latency for that client request another crucial point to note here is that both the apt tasks and the worker tasks scale independently so they could be running on the same UCS cluster they could be running on two different UCS clusters with different Hardware under them key point is we want to be able to scale out those workers to meet the actual demand of the moment to keep our SLA low it also knows we're using sqs cues here this is a key part of the decoupled architecture this means that essentially if one of these micro services is down that's okay because the messages are already on the queue we don't have to block and wait for that service to come back up service comes back up looks at the queue and picks up its messages and keeps going alright so as far as our migration I just share with you some of our key learnings from our migration into ECS hopefully this will help make it relatively painless for those of you undertaking this number one are our main paradigms switch here was moving to infrastructure as code so we chose terraform we use terraform for almost everything everything in this whole stack everything from the VP sees all the way up to the task definitions and the scaling parameters every even deployments TECs can be done with either terraform or cloud formation if you like pretty much equivalent you can do the deployments because deploying to UCS is just updating a task definition with a new docker image name some of our learnings with infrastructure as code our DevOps team maintains base modules that implement that stack that you just saw and our development teams just import those base modules they just have to fill in a few key variables like the service name task definition parameters like CPU allocation memory allocation and some of the alarm thresholds that dictate when the service should scale up when it should scale down so for us for our workers we make use almost exclusively of the sqs maximum queue length size as our metric to scale on so this is one of those paradigm shifting things that DevOps is supposed to be doing we're giving more control and more visibility to the development teams no longer do they have to come to an Operations guy and ask about the scaling parameters they can just look right in their code we check the terraform definitions for the infrastructure in with the same application code and get so everybody can read it and some people can write it too next key point released tagging so in moving a container based workloads really important to keep track of your container image versions we want to make sure that an identical tag is applied to both of the git commit to the docker image and to the task itself running in ECS this makes it very easy to track changes all the way through to production operations QA management can look at the console and know exactly what version of code is running so we use something called semantic versioning to do this applies a very nice major minor point release version tag this is also really nice if you're doing Bluegreen deployments because it's easy to see with ETS console which version of the code is deployed where I'm gonna include the slide in the deck I'm not going to go through it in the interest of time you can look it up on SlideShare this details are released tagging flow in some detail and the next one is on our deployment pipeline just some of the technologies we're using here includes API gateway circle CI and github to get our containers deployed into our V pcs and into ETS so check this out on SlideShare if you're skip that one more I'm gonna go back auto-scaling so a few hints on auto scaling so best practices ECS scales the underlying ec2 cluster out when the sum total of the service resource requirements is more than what's available on the cluster I mentioned we use the queue lengths metric to scale for most of our workers and for almost all the web services we just scale on CPU or memory pretty easy the key difference with dcs is you do want to allow those ECS services to scale vert vertically and you want them to be able to scale down as well so this is one thing ECS is missing it was a little bit of a gap so the long-running process is when they need to scale down they can be interrupted if the ec2 instance underneath goes away you might lose some work there so there's a pattern called container instance draining Amazon publishes the best practice reference architecture for this we implemented it it the long the short version of it is you can use lambda to hook into the auto scaling group lifecycle policies this will tell the containers running on the instance that they're about to be killed and give them a grace period before they are actually killed by auto scaling so you just put a little hook in your code you say when I get the certain signal it's time to shut down gracefully and you set a timeout for the apps to gracefully shut down it's a little bit hard to read the URL for the reference architecture is on the bottom there all right a few high-level lessons learned scale up with a little Headroom applications startup time is a factor even if you're running in containers yes containers are small they're lightweight lightweight we found that in many cases we were just container izing existing legacy applications that still means they need to start up they take time to start up so as a best practice we usually scale up twice as fast as we scale down we'll add two instances or four instances when we scale up our tasks rather and we'll scale down much more slowly one task at a time and finally I mentioned this already scaling custom metrics you're not limited to scaling on CPU and memory you can push your own custom metrics you can use queues queue lengths etc the other thing people talk about when they talk about containerization is immutable infrastructure and immutable images so immutable images are good however they don't solve all your problems your application code will definitely still break for example if you pass in the wrong environment config to it an example of this is you're passing in a variable called deploy environments you think it should be production the application is expecting something like a prod and it turns out that logic is correct but it's hard to test that before you actually get to production right so that'll still break your containers even though it's the exact same application code that was running in your pre-production environment and probably goes without saying but application code will still break if there are environmental differences in your actual VPC so if you're going to use infrastructure as code i recommend that it's it's holistic that it's comprehensive and covers your entire environment otherwise it's almost as good as nothing at all okay so the lesson here post-deployment testing and automated rollback I think my final lesson here is around identity and access management so access policies are tough using cloud formation or using terraform anyone that wants to modify the environment using your terraforming cloud formation scripts probably needs wide access to all of all of the different things in your environment s3 buckets could be RDS instances ECS batch etc so this is fine for your administrators it might not be what you want for all of your developers so this is tricky there's a lot of possible solutions to this one that I recommend is just isolate your environments put your production environment in a separate AWS account altogether that is nice isolation that means hopefully no nightmare scenarios running the wrong command all right again towards the end here benefits that we realized included a really simplified deploy pipeline we can deploy any kind of app that we want using the same pipeline as docker eyes containers and we experience very great efficiency increases in our deploy time we were taking 30 minutes to roll out those auto scaling groups in the old days we can deploy two ECS in 30 seconds or less we do that all the time now finally predictability visibility I mentioned most of that promoting docker images from our staging environment to production is now very clear-cut because we're using the same tagging for our images and our get code and our ECS tasks so overall simplified operations so that's about all the time I have thank you all appreciate it and I think I hand it back to Jamie thanks a ton good job okay for the next portion of the talk I'd like to switch gears now and focus on AWS batch so Atos batch had a quick show of hands how many folks have had a chance to kick the tires on the service or looked at that at the service yeah we we launched batch last year at reinvent and we've actually had a number of releases over the past year but before I get into the capabilities of batch I wanted to first talk a little bit about what it is and what was the motivation for creating the service so historically if you were if you were to deploy batch computing infrastructure on premise you would you would start by provisioning a large number of relatively homogenous resources you would pick these the the instance type or the virtual or physical machine size based on the the typical deed of the jobs that you would want to run in your environment and you keep that infrastructure around for for a few years and then and then go through a refresh cycle and in doing so you'd have to kind of shoehorn your jobs as the the requirements for those jobs evolved over that that that multi-year process into fitting into the resources that you've provisioned now obviously the cloud simplifies this but in order to take advantage of things like many instance types and instead have resources compute resources instance types and containers be provisioned in response to the needs of your jobs needed to build a lot of automation you need the stitch together probably bout a dozen different AWS services and so the few folks myself included helped a lot of our customers go through the process of organizing all of these different services building these assembly these building blocks to create a system whereby you could eventually submit your first job to the queue this was taking a lot of our customers a few months just to get to the point of that that first successful job submission and so we wanted to simplify that and offer a fully managed service and so that that was the first tenant or design goal for AWS batch give you a managed service that provides batch computing primitives so you focus on submitting jobs to Hughes let us pick the right resources and run your workloads for you the second thing we wanted to do and and and they even talked about this the importance of being able to specify a role for the for the work that's happening so you have fine-grained permissions you want to make it very easy for your batch processing workloads to make calls to recognition or dynamo or s3 and do so without having to bed denzel's within your applications and so tight integration with with Identity and Access Management was important for us and then thirdly we wanted to help our customers reduce the cost of using AWS for batch processing and so that meant that we wanted to make spot a first-class citizen we wanted to simplify automated retries of workloads in response to just about terminations or other things that might interrupt your work and so we built that into the service - and so what patch does is it gives you a way to submit your jobs which run within a docker container and each of your jobs will have a predetermined to that by you I'm not a CPU and memory that you allocate to those jobs you submit those jobs which are based on a job definition or a template similar to an ECS task definition which tells us the container that you'd like to use the the command you'd like to run within the container environment variables and parameters and things along along those lines including the Identity and Access Management ball and you submit your job to a job queue within your account you could have multiple job queues and these job queues can have a priority relative to each other so for example you might have a production job queue and a development job queue and you might have a job queue specifically for GPU jobs that that's really optimized in terms of what resources it has access to and mapped to these job queues is something called a compute environment the compute environment is a logical set of rules where you tell us how big and how small yam envy CPU is equals zero max V CPUs equals 10,000 and you tell us the instance types that you'd like us to be able to choose from and it can be very prescriptive and say use P three instances or P three extra large or be as general as optimal and we'll pick from any of the CM or our instance families and then you map these compute environments I'm also specifying whether you want us to use on-demand or spot to your job queues and then we will provision resources we'll choose from the range of instance types that are available that you've given us permission to launch on your behalf and launch the right quantity and the right distribution of instance types based on the needs of the jobs that are running or are ready to run and well we'll launch those instances very quickly especially with per second billing and then as your work finishes if there's no more jobs if there are no more jobs I can take advantage of that already launched instance will very quickly turn it off for you typically within a minute or two of the the job finishing if not sooner and then the aw patch scheduler is the part of a TBS batch that's responsible for sequencing your jobs and making sure that we're running jobs in your high priority queue before jobs in the low priority queue if they have a common set of compute resources that they're contending for making sure that if you have multiple jobs a B and C such that C depends on B B depends on a that we run those in the right order and that if a fails we don't try to run B and C and said we let you go fix that and make sure that you're notified that there's been a failure so that you can correct that issue so now talking about a little bit about what's happened since this time last year so when we launched batch we had one region we supported us east one we're now in nine regions around the world and in early 2018 you can expect that we'll be in the remaining regions that we don't yet support today that includes the the commercial regions that we aren't covering as well as AG of cloud and some of the other non-public regions we added support for custom machine images so by default we'll use the e CS optimized machine image but if you'd like you can use your own machine image and we'll launch many instances based on that machine image and so that's useful if you want to auto mount elastic file systems so that you could have that file system be mounted as slash data or slash input or output in your in your container instances as your job is running gives you the ability to easily run an FPGA or GPU accelerated workloads it also allows you to provision storage and in a way that's more optimal for the type of work that you're running maybe moving with to larger or faster or different types of EBS volumes we added support for the new instance types as they've launched c5 support launched on Tuesday and fives coming very soon and then in response to per second building instead of more gradually scaling up instances because we didn't know yet how long your jobs might run and we would wait and see okay are these resources fully utilized or are they gonna be sitting idle we used a whole lot of instances for the majority of your building our in case additional work arrived but with the transition to per second billing what we can do now is immediately go to eleven when you you give us a ton of jobs you submit 10,000 jobs a hundred thousand jobs and you've given us max B CPUs in your computer Biman set to a reasonably high value we will scale up as as high as we need to go to run as many those jobs concurrently as we can and then once the work finishes we'll scale back down to min V CPUs ideally zero and it's what most of our users specify for that particular parameter and then we added support for tagging of spot instances manageability and performance features we're really important for our end users and so we added the ability to automatically retry jobs if you have a spot termination or if your application has a non zero of x''x code failure we can automatically retry your job up to the number of times that you specify when you submit the job well maybe your job back to the head of the queue and we'll run it on the next available instance either one that already exists or one that we launched to replace a failed ec2 instance when we initially launched AWS batch we designed it to support jobs that lasted for fifteen minutes or longer and on day one we had a ton of users that were running jobs that lasted for a fraction of a second and in order to handle resource provisioning and make sure that we're getting the highest possible utilization of your compute resources you take very different approaches with long-running versus short running jobs and so we've done a lot of work over the past year to ensure that we can just as easily run to second or five second long jobs in in order to in a way that that gets you 90 percent or higher utilization of your underlying compute resources we've added support to batch in cloud formation terraform also supports AWS patch and a very recent an important launch is the ability to use a ventra VIN architecture so as your jobs transition from one state to another from submitted to pending to runnable starting running etc we now emit cloud watch events as at each of those state transitions so that you can track and set up filters and and understand that a job based on a particular job definition just failed with a following reason and automatically handle that in a way that's different from jobs that have succeeded added support for HIPAA compliant workloads and an area that we've seen very recent innovation is around workflows and pipelines and submission of jobs that have a large number of copies so when we designed the service we originally designed it in a way that we didn't wanna be too opinionated about how workflows would would operate so if you want to tell us about the dependencies of your jobs you can do that you can submit job a B and see and tell us the relationship that exists between these jobs or if you prefer you could use a workflow system maybe it's step functions or airflow or Luigi or Pegasus or a number of other commercial and open source workflow systems and instead have them submit a job to batch listen for the success or the failure of that job and then proceed to the next stage in your workflow both of those models work equally well within a TBS and you can combine them too for those of you who'd like to integrate step functions at AWS batch we have a reference architecture the links here to the to the github repo and in the readme file that github repo we have links to the forepart blog post that walk you through an architecture that you can see here that that shows how you would use step functions in this case to submit jobs to be a Spach through an AWS lambda function using a template that we have in the lambda console your jobs of course running images stored in ECR or docker hub or your own private repo and your jobs can of course interact with services like s3 now a feature that we just added on Tuesday is array jobs and array jobs or the ability to submit not just one job but many copies of a job with a single API call so initially we support submitting jobs that have 10,000 copies or up to 10,000 copies will be increasing that pretty quickly and you could use this for example to have a job that's running the same command maybe it's a transcoding job against a thousand objects that reside in an s3 bucket each copy of the job will be identical in the command the the CPU the docker image is using the the memory that we allocate to it and we'll give you an extra environment variable that tells you the index within that array for that particular child within the job will manage the degree of concurrency of execution for you and then each of those jobs can can perform they're part of a much larger piece of work that needs to be done with this we're also updating that the dependency model between jobs so you can express in this case of job a job B with many copies and job C that job B is dependent upon job a and that job C should only be started once all elements of job B have completed successfully we've added the support to the console and you'll see that for example we have different dependency models like end to end dependencies which we'll get into just a moment this is the syntax fairly straightforward actually the the big changes around array properties and telling us the size of the array job in this case we're submitting a 10,000 wide array job and every child job will have the same job ID with an additional feel : 0 through 99999 in this case this is how you would submit jobs with with different dependencies between arrays and non array jobs we have an array job a once all of the elements have been completed by expressing a dependency on the the parent job ID you'll be able to then proceed to job B if you call describe jobs on an array job with 10,000 copies you'll be able to see a summary of how many of the children are in each status how many of these are in a pending state how many rentable how many have succeeded how many have failed and then you can call describe jobs on individual child elements within that that broader array job you can also use the list jobs API to to get a listing of all the jobs in a particular state in this case we have an array job that depends on an on or a job you can also have a really interesting model which is the end-to-end dependency model so say you have a processing pipeline maybe it's an image processing pipeline where the first stage you need to be first validate that the the file wasn't corrupted in transit maybe the the second stage is to do a rectilinear correction and the third stage you're going to make recognition calls or or try to do some computer vision analysis of this you could easily submit that and process 10,000 images by just making 3 API calls one for each of those different commands so you'd like to operate and in doing so by expressing an end-to-end dependency as element 42 of job a completes then element 42 of job B will be considered rentable we can proceed on to that even if there's some stragglers in job a if you have a sequential processing you can express a sequential relationship between the elements within a single array job will process element 0 then 1 then 2 and 3 and so forth and you can express dependencies on individual elements of an array job 37 and 42 only once those have completed then move on to the to the next stage and this becomes really useful when you have pipelines where each stage is not only running in command but might also require different CPU or memory or GPU or FPGA configurations and so in this model as we complete job a the initial setup job and we proceed to a network i/o intensive job will run those on what instance type and maybe the second stage will start to scale down these instances as quickly as this work is completed and then move on to a c5 instance to run these workloads and then maybe we'll move on to an m5 or it or an hour for instance and before we have our final cleanup tasks and in doing so you you can submit these workloads without having to tell us which instance type your job is running on well we'll pick all of that for you if you want us to of course have the ability to tell us the instance types if you prefer that so just summarizing with that with the the roadmap and what you can expect from us going forward so we've invested heavily in our web development team for the a dispatch console so in addition to having a parody with the api's for a TBS batch we're going to be adding a number of capabilities that give you additional telemetry the ability to see more holistically what's happening and your job queues and your compute environments adding a number of other capabilities making the console far easier to use we're going to give you the ability to have the other half of event urban architectures the ability to automatically submit a job to AWS batch to a particular job queue using a pre-selected job definition when events matching a particular cloud a particular filter are emitted to cloud watch so for example object arrives in s3 automatically submit a job to production queue running the the image processing command or image processing job definition we will also be extending a batch to support cloud trail auditing of our api's we already support cloud trail auditing of the underlying services that we manage on your behalf and consumable resources a super interesting feature if you're using if you're running jobs that have a dependency on license software or if you have jobs that are you have thousands of compute resources at your disposal but you have jobs that might be connecting to a database that can only handle so many connections so the consumable resource feature will allow you to specify an integer that corresponds to the maximum number of jobs that can run at any given time that have a dependency on this consumable resource as we start the job we decrement as we finish the job we increment that consumable resource and make sure that we don't start your job until until that resource is available and that we don't waste any compute resources needlessly trying to run jobs that are never going to be able to succeed we'll be moving to support multi node parallel jobs so give MPI applications you'd like us to provision a cluster and run your job across many machines with low latency high bandwidth connectivity between those instances that'll be coming to batch very soon and finally we'll finish out the regional expansion so sharing a couple of links which you can you can see if you look at the slides on SlideShare let you quickly take a picture if you like and then I'd like to hand things off to Lee from from here to talk about how here is using a TBS batch for their autonomous driving systems thank you [Applause] thanks Jamie my name is Lee Baker I'm a senior architect with the highly automated driving division for here technologies now a lot of people may not know who we are but odds are you interact with our products quite often four out of five in vehicle navigation systems use our maps and we generate this map data we have four hundred cars around the world driving around collecting high-resolution imagery GPS lidar etc and that generates 28 terabytes of of map data per day so we provide a lot of products to a lot of different companies in a lot of different industries with the one common thread being location in fact we just launched the open location platform to to to some customers and it will be it will be opened coming soon so if you're interested in that check it out here calm ok so the use case I want to talk to you today about is occupancy grid now at a high level what we do is translate a lidar point cloud into what's called an occupancy grid and as you can see the the purple cubes that is the Aachen secret that it indicates when something is occupying that space and this is very useful for self-driving cars for obvious reasons so they don't run into things so we had to create a pipeline for this and basically how this goes is a customer submits a route or many routes but a route is then partitioned into equal sizes and we can process each of those in parallel so this is kind of an embarrassingly parallel problem some of the requirements we had is a one-month deadline the previous architecture just wasn't scaling and they came to us and they had one month until the first customer deliverable we also we didn't want a hamper research or productivity we wanted to get everything out of their way so they can just focus on developing the algorithm we also wanted to make it generic for other teams occupancy grid is pretty typical of the workloads we run so if this worked well we could apply this to other projects we needed to support many different languages occupancy grid is C++ we run a lot of Scala a lot of Python some Java some go I'm sure there's other languages I'm not even aware we're using around the company and then of course we wanted it to auto scale so this this is a very bursty workload customers will submit many routes at once these will need to be turned around very quickly sometimes within a day and then maybe it's idle for a few days until we get the next batch so we didn't want to pay for idle capacity obviously scale down to zero in the minimal care and feeding we wanted to be able to just hand this off have it just work without a whole lot of centralized infrastructure and really minimize the things that could go wrong so one common question we have is why isn't this on a deep learning framework well deep learning frameworks are actually better for structured data like images lidar point clouds are unstructured so it just it's not a good fit as of now they're working they're working on that but but for now it doesn't work but more importantly it was the researchers decision we don't want to impose anything so also spark spark really shines at parallel processing spark doesn't have native C++ support though you can use the pipe operator and make it work that way but you quickly run into memory issues because sparks not aware of what's going on in that process so we ruled that out so containers are the logical way to go so we looked at ECS just kind of vanilla UCS and ECS is great but you still need to deal with manual auto scaling tuning instance type optimization so you if you have varying if you have jobs of varying sizes you end up having to kind of pick the largest instance so that your jobs won't fail and queue management we still would have had to solve that problem as well so we kept looking kubernetes was another thing simply we didn't have the team experience and felt it would be too risky with such a tight timeline so we were aware of batch and decided to give it a try and this is what we came up with I really struggled with the set with this slide because it's so simple but I think that actually kind of illustrates how easy it was for us so the operator submits all the jobs onto the batch queue and batch handles the rest so we use a managed batch compute environment it picks the instance types and best for our resource requirements and it actually it actually learns and gets better over time so we did run into some some challenges but luckily we had Tom who helped us out with all those he's gonna walk you through this no okay yeah so let's just talk a little bit about some of the things that were a challenge I'm a solution architect so I get to really get into the into the weeds with the customer and actually here I've been with for almost two years now so lots of different types of projects lots of things going on this one was really fun the story kind of starts early this calendar year and it starts with a data scientist just asking how do we make this easy right I really want the simplicity of I'm gonna write my algorithm on my machine and then I just wanted to hit a button and it just all runs and leaves doing a lot of work gathering those requirements understanding the different services and I'm on the back end trying to work with our different teams and make sure they're making the right decision now along the way we had some some I mean I won't say it false starts I'll just say you learn by doing right you go in and you you feel it out and try to see if it fits exactly what you're trying to accomplish so the the big one was orchestration we spent a lot of time on this we did different pocs we saw some things that were successful in some things that weren't the one thing this this concept of array jobs and please keep this in mind right the the we talked about earlier the only constant is change so a lot of the things that you're gonna deal with our appoint time decision-making and at this time array jobs wasn't an available feature so I mean we'll continue to reassess but right now at early this year you could really only set up 20 dependencies on the jobs and they were thinking they wanted to scale this to thousands and thousands we were orders of magnitude out of out of balance here so we had to come up with some better way to orchestrate the the scale at the fan out and fan in of their pipeline so spark was an option that was considered the the spark kind of felt a little bit like overkill and it had some of those limitations that Lee mentioned like not be able to run the C++ native stuff so that was kind of whacked out the the step functions project was really interesting because we took step functions and ran lambda shims on top of you know executing batch processes and then you know you'd have to build kind of a lot of these polar to check and this is again before you had these wired up events that now can come through and actually rehydrate and keep the keep the pipelines running so it just didn't feel it felt like the custom activities and and the amount of kind of work that was necessary to orchestrate these jobs was too cumbersome and then airflow ends up being very very nice tight fit to what they were doing and they're they're continuing to use that today so that was orchestration it's like you know the batch service the architecture diagram that Lee showed is is really simple like dropping them on there is is and watching its scale and manage the resources all those were outside like no longer a problem area but just driving it from from start to finish and tracking it was something that still had to be solved and they did that with with their flow so then the other big one is like the the data scientists didn't want to write as three code they don't want to have to you know have a completely sort of different mechanism for storing input and output data they wanted to be able to run it and then just like I said move it hit a button and just go so how are we gonna accomplish that well the first idea was let's just include libraries and data all alright they're packaged but it in the image it just got to the point where you know every time data was changing you had a new image the versioning was was problematic similar here with a base docker image was another attempt that also became too difficult to manage all of the different versioning and maintenance there was flexibility issues and and it was risky and that you know that that base that base image was wasn't gonna be potentially a problem and then EFS because now we've got this idea of a shared state it's not like you have to rewrite the code it's just like normal the normal same semantics of accessing the data and writing the data the challenge with with EFS was that we we kind of had a difficult time running long sustain jobs and and at that point we're really there was no way to get like your EBS volume a PI ops model right so you kind of had to know and this data is very transient so we sort of show up and if you know EFS the the elastic file system it gives you the performance based on the size right so you kind of had to go in and do a lot it you had to do a lot of manipulating it to get it to the performance levels that they were looking for so in the end it ends up being a runner model container that has secondary containers so that docker and you know could spin up secondary containers and link those reference those and now the data is separate and so that just ends up being part of the architecture of what they included in their application a couple of other small challenges just the storage like auto scaling EBS volumes so you kind of similar to the the scenario described where you got to have the largest instance you kind of have to think about having the largest amount of data available for a varying levels of potential ingest so that's it that's a bit of a challenge so right now it's it's over scaled and we're gonna work on coming up with some ways to fix that so that it's a little bit better from a cost perspective and then manage versus unmanaged compute environments you really want to make sure you let Jamie's team do their thing with their software and and scale that down to zero you know it can be can very easily and quickly be out of bounds from a cost profile perspective if you don't set that up to to go ahead and auto manage so you won't talk about next steps or no all right so now so like you said array jobs it was actually Jamie came out and talked to them in Chicago the whole team they were really excited about array jobs and continued to be excited about array job so will will be looking at that you know going into early next year and there's a lot of other projects so there's definitely more opportunity to keep building this I have actually met with quite a few other teams that are building similar concepts adhere and and this model that they've got is really ideal it's server list it's it's not a lot of heavy maintenance so there we're looking forward to bringing in other players in the company you wrap up yeah one more comment so with regards to storage at the job level this is another area where we're gonna be innovating in a two Bs batch give you the ability in your job definition to express the storage requirements of your job in the meantime though I would recommend that you take a look at the batshit project which we we covered in the the CMP three to three session to genomics open-source capabilities so that when they're submitting their batch jobs they have a little wrapper that will actually provision an EBS volume for that job and then and then get rid of it once the the workloads done can also conceive of using this to pass EBS volumes from job to job if you have dependency chains something to look into but well let's we playing a more active role there too we're a little bit short on time so we're gonna wrap up with the main presentation now but we're the last session in this room happy to stick around and if you have any questions I'd love to take the bow so thanks again for all of your time today thanks to as Aven Lee and Tom really appreciate it good job guys thank you

Info

Channel: Amazon Web Services

Views: 3,179

Rating: undefined out of 5

Keywords: AWS re:Invent 2017, Amazon, Containers, CON304, AI, Lex, AWS Batch, Compute, Security

Id: Zgqrd0XQwTw

Channel Id: undefined

Length: 59min 56sec (3596 seconds)

Published: Fri Dec 01 2017