Introduction to HashiCorp Nomad

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is are medaka and today I wanted to do a brief introduction to nomads so when we talk about nomads one of the things we see that's really common is this deployment pattern of a single operating system with a single application running on top of it and in this configuration they meet on top of a single VM now the challenge that we often see with this configuration is you really have two distinct audiences here you have your developer audience who cares about the application lifecycle so they care about you know scaling up scaling down changing configuration deploying their new version but at the same time you have your operator audience and they care about a different set of things so we talked about our operators they care about are we running the right version of the OS is it patch do we have enough capacity in our fleet and the challenge is although they have an independent set of concerns they have to coordinate because the application is running on an OS on a VM at the end of the day so what we often see is the development group has to file a ticket so anytime the development group wants to do anything application lifecycle related there to file a ticket against the Operations Group and that's the layer at which the coordination is done so the first thing we're really looking at when we talk about nomad is how do we split this so that we can have independent workflows the primary goal of nomad is to sit in between here and disintermediate and really provide a layer where we have a southbound API focused on the operator and a northbound API focused on the developer and so what does this really mean for the developer what we want to do is let them write their job and that's what Nomad calls it as a job file which is an infrastructure as code way of declaring everything about their job so they would say I have this weather application I want to run you know it's version 10 of my application and I want three instances of it running and now the developer just submits this job file to nomads API and it's nomads responsibility to find space in this cluster to run three instances of this web server so we might have a hundred node cluster that we're running and know that's going to find three machines that have available capacity and deploy the web server there and now as a developer when we come back and say you know what I want to deploy version 11 my application I simply changed my job file and then specify what roll out strategy do I want do I want to use you know a canary do I want to do a Bluegreen deployment do I want to do a rolling deploy so I can specify my strategy for deploying my application and then I submit it to Nomad and Nomad takes care of rolling out this change across the fleet safely right in addition to just managing deploying our application as well as making changes to it right as we change version or scale up and down so we could easily come in here and just change three to five and nomad will go run two more copies of this the other thing it really does is how do we automate some of the operational challenges that are historically part of the Operations Group right and so we talk about these sort of operational issues it's things like how do we make sure if this application crashes that it gets restarted right so we want to make sure we can gracefully restart the application and ensure it stays online even though it might have crashed or experienced an issue the other side of those what if the machine that we're running on or the rack or the cage or the data center we're running on fails well we really want to do is automatically reschedule that somewhere else so if Nomad detects that the machine running our application has failed it will find somewhere else with available capacity to run it and so these are traditionally things where we might have paged someone to go deal with this operational issue and provide a reliable service but instead Nomad lets us automate it now that's the developer focus is the sort of lifecycle as we talked about the operator they're really focused on the southbound side and they have a few different concerns which is you know like I said are there enough machines on the fleet with capacity are they patch are we running the latest version so they have a set of needs as well they need to be able to come and say what these ten machines I want to take them out of the fleet so I can patch them and then bring them back and so they have an API as well where they can come and say you know what I'd like to gracefully drain these ten machines and over the next four hours get all the workload off of them and then I can take it out of service do a patch bring it in and really allow workload to go on it so this is how we think about kind of these two different audiences what is the developer need for app lifecycle and how do we let them define their job requirements in this infrastructure as code way and from an operation side how do we decoupled em and allow them to do the things they care about around cluster management and node management without tightly coordinating with developers so this kind of the first level goal now the second level challenge we see with nomad is when you look at most infrastructure you have a really bad rate of hardware utilization now typically less than 2% and so how do we actually solve this because we have all these eight core 16 core machines that are running an application doing a hundred or a thousand requests a day where this is effectively idle we're not making good use of hardware and so the approach no man takes is to run multiple applications on the same machine and so how do we move from a place where at you know less than 2% to being at 20 to 30 percent utilization now you might look at this and say 20 to 30 percent utilization doesn't sound that good right why don't we shoot even higher than that but what we have to realize is with the sort of law of small numbers because we're starting at sort of such a bad place going from 2% to call it 20% what you still get out of this transition is an incredible reduction in your fleet size so as we go from 2 percent to 20 percent it's actually a 90 percent reduction and the amount of overall hardware we need we can replace every 10 machines with one basically so there's this great sort of total cost of ownership optimization that comes from running multiple applications and making better use of our resources so this is the kind of primary focus how do we allow this sort of decoupling and self-service how do we look at total cost of ownership as a secondary goal and so then what we haven't really mentioned is you know here we're talking generically about an application running on a machine and this really comes back to how flexible no mat is so on one side and measure a use case for Nomad is acting as a container platform so this application that we're deploying might be packaged as a docker container so we packaged our application a docker container specify as part of our job file that our web server is using this container let's say you know web dash v10 and then we hand that to Nomad to do the deploy but what about applications that containerized or they can't easily be containerized this is actually a whole second use case for nomads which is both windows as well as legacy applications so when we talk about some of these applications maybe it's just a simple c-sharp application that we're deploying on windows or it's something more heavyweight that we can't easily containerize nomad allows us to run many of these types of workloads without needing to make that sort of transition and packaging format so a common use case is running c-sharp apps directly on top of Windows without container izing and reporting them to linux so this ends up being a common workflow for us now beyond that the interesting thing is when we talk about sort of this north and southbound API what we're really providing is an API for scheduling work right so it could be that we're specifying our job in the form of you know this job violence talk and providing that submitting it manually to no man but we could also programmatically consume knowmads api to deploy jobs and so this actually leads to a few interesting use cases one of these is you know we might call it job queuing or you might call it you know a serverless pattern of deployment is when an event comes in how do we translate that event into something that needs to execute so a great example of this is Circle CI every time a commit comes in Circle CI has to trigger a build that goes in tests you know does this cause a change yes or no and so circle CI has publicly talked about how they use nomads behind their the scenes for their infrastructure and so they get a web hook in an event that it commits has taken place they translate that and submit a job to nomads to now go run that build and what they see is being able to submit well over a thousand jobs a minute to nomads and so in this sense it's acting in sort of two ways we're sort of queuing up jobs for no man to run so it might be that for a temporary period of time the number of incoming events exceeds our ability to process it so our rate in might be a thousand a minute but we only have enough hardware capacity to process 500 or 800 events a minute and so nomads will allow these this work to back up and queue it until there's available capacity and drain it as it comes in this allows us to also start to think about the sort of serverless paradigm right how do I think about trance an event and turning that into a small unit of work that just processes that event and scheduling all this work independently so this becomes an interesting use case because we have this API that we can use to programmatically consume infrastructure now one of the interesting things when we talk about circle CIS use case we're talking about a relatively large scale right a thousand different events per minute but this is actually relatively easy going for nomad because a whole other use case for us is high-performance computing so we talk about high performance computing there's many different kind of interesting use cases here right it could be that you're a financial institution that you know every night you want to run a complex risk model so you're spending up a hundred thousand cores running these complex risk calculations to determine you know should I buy or sell stock or am i overexposed in certain areas where what really matters to you is being able to consume an enormous amount of compute for a period of time and really caring about how long does it take you to compute you know some job or some calculation and so this is an interesting use case that we benchmarked very publicly in what we called the c1m or our million container challenge which we looked at how quickly could we schedule a million containers on a cluster of 5,000 machines and what we found is that we can actually run all million containers which were an instance of Redis in less than five minutes so that's an incredible rate of scheduling and at the time we thought you know this sets an upper bound of what's reasonable and what we'd actually see customers doing but what we found in practice and Citadel has publicly talked about it is for those unfamiliar Citadel is a large hedge fund that you know they spoke to us and said you know this is cute but could you actually do this at 40 times the scale and so their use case is very much like the one I described where periodically they want to run these incredibly large sort of calculations and simulations where speed is of the essence it affects their ability to make a trade within the same day until this calculation is completed and so they want to be able to scale to incredibly large clusters with thousands and thousands of cores quickly run these massive scale computations and then spin this cluster back down and so the ability to sort of programmatically generate and submit these jobs and then to them uh so you might be running multiple of these things on a fixed set of hardware is a powerful feature of nomad above and beyond the sort of self service infrastructure capability so again when we talk about nomad it's really starting to look at how do we move away from this tight coupling of the application from the operating system and introduce sort of a layer of abstraction and this layer of abstraction buys us both a north and southbound API for cluster management as well as application management a secondary side effect of that is doing the sort of automatic cost optimization by bin packing placing multiple applications on the same machine and this really enables us four distinct types of use cases and patterns that we see around nomad hopefully this was a useful high-level introduction to nomad there's a lot more material available on our website and as well as content that goes a lot deeper than this so I encourage you to check it out thank you so much
Info
Channel: HashiCorp
Views: 44,566
Rating: undefined out of 5
Keywords: Armon Dadgar, HashiCorp, HashiCorp Nomad, Nomad, Scheduler, Operations, DevOps
Id: s_Fm9UtL4YU
Channel Id: undefined
Length: 12min 14sec (734 seconds)
Published: Mon Jul 23 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.