Nomad and Next-generation Application Architectures

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

to kick off our afternoon I'm really excited to introduce Armand agar Armand is the co-founder and co-ceo of Hashi Corp and he's here to talk to you about no man so please give it up for Armand so today I wanted to spend a little bit of time talking about know man and how we can leverage it in next generation application architectures so for those who aren't familiar I think Seth sort of beat me to this I'm one of the founders and CTO and I think you've seen this one now a few times in terms of like how we think about the problem space and delineate the challenges today we're just going to focus on nomad and what we can really do in the space when we change sort of our perception of what the scheduler is and how it actually gets leveraged by our applications themselves so when we talk about nomad you know it's been mentioned a few times that you know it is a cluster manager or an application scheduler and it really is both of those things but the groundwork I really want to set is sort of when we talk about a scheduler what are we really talking about like what what makes a scheduler a scheduler at a sort of core the most sort of you know philosophical level if you will the goal of a scheduler is to map a set of work on to a set of resources right sort of sort of day so let me grab this is something a little more concrete you know an example of something we're probably all you know familiar with or have heard about is a cpu scheduler so the function of a CPU scheduler and we are surrounded by these are you know we have them in and almost any device is to map sort of the virtual work of processes and threads onto the physical world of CPU cores right so your laptop maybe has two cores or four cores but usually they might there might be hundreds or thousands of actual inputs work that we want scheduled on top of that right and so the challenge is well there's no static way to assign 2000 threads on to only four cores so instead we need to take a more dynamic approach where we're scheduling that work as we go and this really is the function of the CPU scheduler right the input to that system is the set of virtual work and the resources underneath it or what it can leverage to achieve that now over time the scheduler is going to move things around it's not going to be a static assignment of Lutz running where this lets the CPU scheduler provide a higher quality of service and also create the kind of virtual illusion that everything is running at once even though we don't actually have the hardware to allow that to take place so CPU scheduler is really just one example of a scheduler in the wild in reality we're kind of surrounded by schedulers we see them all over the place CPU scheduler feels the function of mapping the virtual threads the processes on to the physical cores but there's all sorts of other types of schedulers right here's the this is just meant to be a sort of a very small example of what's actually out there if we really think about something like ec2 or OpenStack Nova it's doing a similar sort of thing but at a coarser scale right it's mapping a virtual machine which is maybe four cores and eight gigs of RAM onto a set of hypervisors which are physical machines right but it's the sort of dynamic mapping of work onto the sort of set of fixed resources under the hood when we talk about something like Hadoop or the SPARC scheduler they're doing a similar thing they're mapping sort of a query or a batch process on to a fixed cluster as well potentially running multiple jobs at the same time overlapping their execution putting things in queues but it's filling that mapping role and so when we talk about a cluster scheduler so something in the context of nomad what we're talking about is mapping applications on to servers so it's within this context that we're going to talk about no man and so the goal of schedulers almost regardless of whether we're mapping cores onto CPUs or applications onto servers it's to really achieve three things right one is higher utilization of our resources the other is to decouple the two things so we don't want our work to be intimately connected with the underlying resource and the last one is to achieve a better quality of service and this sort of will vary depending on the application right so if we're talking about a desktop CPU scheduler maybe we care about responsiveness so if I'm clicking around you know as a user but my machine is really busy I want the machine to feel responsive that's not necessarily concern for a cluster scheduler it doesn't care about it user clicking around and maybe cares about you know overall density or minimizing time and cue so there's a number of approaches that will leverage to achieve each of these things I'll just touch on them sort of briefly when we talk about optimizing for resource utilization the sort of core bit of that is what's known as bin packing and this is the algorithmic approach to sort of saying I have a number of boxes that I'm trying to fill whether they're sort of servers or cores or whatever it may be and how do I sort of maximize the stuff I can use them at or conversely minimize the amount of wasted resource the amount of resource I'm not using and so bin packing is a well-known algorithmic approach to doing this where you try and minimize that sort of unused space and this leads to a higher resource utilization the other approach is resource oversubscription so we're we're most sort of familiar with this is its memory over allocation so oftentimes machines will allocate more memory than they actually have available and what we'll end up with is if we actually try and use all of that memory it will swap the disk right so we've over allocated memory on the machine overall and we have sort of measures in place to make sure things don't sort of fail catastrophically in the land and sort of cluster schedulers you do the same sort of a thing where you'll say the applications have requested you know ten gigs of memory on a machine that only has eight gigs we'll provide that to them we'll allow this over subscription and if that memory tries to actually get used then you'll kick in to sort of counter control measures to keep the machine from crashing and the last one is job queuing so how do we actually ensure there's a steady stream of work well you just queue it up right you don't sort of fail when you run out of capacity you put it in the queue so this new room clears up that work instantly starts executing the next one is the decoupling of resources from the actual work that's being done right and the way this is done is really three primary techniques one is abstraction so we sort of hide some of the essential details of the underlying resource and this will gives us some amount of portability the other is an API contract right so we have to provide some set of guarantees to the application the consumer so they can reason about how is their work going to execute how should they think about modeling their resource utilization so this is sort of hand in hand with the abstraction and the last one is packaging we want our app to be packaged independently of the underlying system so that we don't have that coupling so a great example this is something like docker right it's wholly independent of is it a Red Hat system and it should be an RPM is it a Debian system and it's a Deb package right it's an independent flexible packaging format and the last one is better quality of service so like I said this means different things for different applications for a big data scheduler quality of service may simply be you know optimizing you know the wait time in queue so first end to first out I want to optimize that for desktop type CPU you probably care about things like real-time scheduling and in sort of a more general case cluster manager you care about some mix of things right you care about things like priorities so high priority works should get access to resources before low priority work you care about resource isolation so if I've asked for you know eight gigs of memory and you know I I should actually be able to use that eight gigs some other apps should be able to show up and take take my resources away and the last one is preemption which is if I have a whole bunch of low priority work that's consuming my whole cluster and high priority work shows up we should be able to evict the low priority stuff to make space to run high priority stuff and together these these three sort of properties give us a higher quality of service to our applications and so we talked about cluster schedulers they're sort of coming into vogue last few years particularly led with things like docker giving us that flexible agnostic packaging but these are techniques that have been around for quite some time right Google is relatively famous for running their board schedulers in early 2000s Amazon if we think about ec2 as basically a giant scheduler has been doing it since 2006 and big big folks like Netflix and Twitter talk about how they've scaled their infrastructure up over the last decade using these approaches as well so hopefully that that provides a little bit of context in terms of when we talk about cluster managers and schedulers what we're really talking about is how do we maximize a resource utilization how do we get that decoupling how do we get that quality of service and our scheduler is no man so I want to spend a little bit of time unpacking nomads within the realm of being a cluster scheduler is sort of a number of goals we had for it one was really thinking about sort of like what are the two primary users of something like nomads right on one side we have developers who are running applications deploying new versions scaling up and down doing roll backs and so how do we make their life easy make deploying a service managing a service that sort of sort of full lifecycle as simple as possible on the other side we have folks that have to stand up and manage the cluster so how do we make it operationally simple to deploy scale manage do all the operational bits and then in doing so in doing both of those two things how do we ensure the system is built to scale so that you don't hit these sort of elbows in the scaling where you have to say oh great I have to re are Kotak my whole application now because I've hit the limits of the scheduler right a system like this lives at the heart of your application and it can be very uncomfortable to hit it's sort of scaling limitation and so in doing so we can provide this API that enables next generation application patterns right because we're confident in the ability of the system to scale and support at sort of any scale sort of get into what that looks like so for those who have never seen a nomad job out that's what they look like hopefully you know non-threatening they're meant to be human readable human writable these are what developers are actually working with to specify what their job looks like their constraints do the rollouts of new versions so this ends of things are the heart of what a developer works with is the declarative job definition and so there's a very simple job it's fully valid it will run Redis using docker and here it's just going to run count one of it basically in the u.s. East data center the goal of the job file itself is very similar to sort of the philosophy behind terraform it's to be very declarative we don't want to tell nomad how to do anything we just want to tell it here's exactly what I want you to run one Redis can read us a thousand Redis is you figure it out nomad and so that job felt really a totally devoid of any imperative detail as to what to do where to do it that is all left to Nomad that is part of that abstraction 'el api contract is we've abstracted the developer from that the developer does not know where doesn't have to be aware of sort of the how something gets run they just specify it's part of my API contract I want one or five Redis servers and so what this is really doing is getting at that abstraction it's letting us abstract underlying machine and the work that's being done and so we get this natural decoupling in doing so we really want to make sure we can support sort of the broadest range of applications that you might need to run inside of your organization right there's not one class of application there's not one type of thing you want to run so when we really talk about workload flexibility with the system like Nomad we mean it in a very broad sense right on one side there's things like operating system flexibility so you have some set of Windows apps you have some set of Linux apps how do you run those in a unified way on one cluster so Nomad has no issue with 20% of your sleeping you know Linux and 80 percent being Windows that's totally fine the types of workloads you want to run themselves tend to be very probably the most common class we talk about is sort of our long-lived micro service application so web servers API servers things like that those are of course the first pots concerned the next largest class is really short-lived bash right so either you know an monthly billing run or a big data job or some sort of data transform it's something that's going to run to completion and exit it'll give up the resource it doesn't need it perpetually sort of as a subset of that category of the cron workloads so things like a billing run for example maybe we do that on a weekly basis or a monthly basis how do we actually let nomads know hey here's the definition of my job run this every month this is the amount of resource that needs and now we have a highly available cron the last kind of workload is system agents so these are things where we want to run it everywhere in the fleet so you know a data dog agent or logging agent or telemetry things like that where we want it fleet wide so all of these workloads are considered first-class to Nomad and lastly as drivers which is we touched on sort of to get that decoupling of work and resource we need to have these independent packaging constructs right we need to sort of free the application from being OS specific and so this can come in many shapes right one way of doing this is virtualizing we're I'm sorry one way is container idling with things like docker rocket lxc those are all natively supported you know the more traditional approach is virtualization so if you have you know VMDK or an i so you can feed that into something like q email or KVM and run virtual workloads on nomad and then as you get into things like static binaries or sort of you know cat jars or DLLs for c-sharp you can also provide those to nomad and it will wrap it with OS primitive containerization she routes and c groups sort of everything dr and rocket use under the hood sort of implicitly do that on behalf of your application and so the goal of really all of this is to say okay if we have these declarative jobs how do we really enable this infrastructure is code approach moving up the stack from terraform which is looking at it from sort of storage compute network level up the stack a little bit to the service application level right and part of that is saying what did these applications need they need to be able to integrate with systems like console to register their service to do health checking to discover their up streams they need to integrate with things like both to be able to get database credential TLS certificates secure key value storage Jeff went into a lot more detail than I will and this really highlights nomads approach to the problem which is to say how do we compose together these independent pieces to solving this broader problem as opposed to trying to make Nomad sort of an end-to-end platform Nomad very much focuses on the cluster management and scheduling aspect and pulls in these other systems or allows you to integrate to sort of solve the whole spectrum the ultimate goal though really tends to be how do we empower developers by decoupling the operations rules right how do we let our operators do the cluster management do what they want to do without constantly being pulled in by developers and how do we empower the developers to self-service so when we then talked about the operators they have a big role to play with Nomad as well right they're actually providing this as a service so how do we make this as simple as possible for them where this really starts for us is making it a single binary with no external dependency so Nomad binary can run it the client can run to the server there's no external data storage like zookeeper console it's fully self-contained next we started by saying can we build on the other systems we know we're gonna have learned about and so for us this really start with two different systems one is surf which we don't talk about a lot but it's sort of our gossip system that lived in both console and nomads and the other one is console right which is really our consensus-based service discovery tool and so each one of these systems has taught us different things right so serf is a totally peer-to-peer gossip based system for cluster management and so what that means it really gives us three different things one is membership who is in my cluster what's their IP what datacenter are they and one is failure detection so great I know these 500 nodes are in my cluster which of them is alive at the current moment and an event system how can i efficiently broadcast an update to all 500 members of the cluster in a reliable way and so serf really taught us a lot of these core distributed systems bits in terms of how you put together a large-scale system and we've learned a lot from that right in practice we have users that single data center is running over 10,000 nodes of serf powering incredibly sensitive use cases and so what this let us do is really simplify the clustering story and simplify the Federation for users of nomad they sort of get this out of the box with one command similar with console console our mother Sam taught us a whole lot about for building a higher-level multi data center service right so it provides service discovery configuration coordination a whole set of features and it's architectures very different where serf was totally and wholly peer-to-peer console has a centralized server architecture so it has a set of servers and a distributed set of clients and what this really taught us about is how do you architect from multiple data centers out of the box how do we leverage RAF consensus to build strongly consistent systems and how do we do this in a large-scale production hard in a way right there's a lot of sort of difficult problems to solve there but we have individual users of console spanning over a hundred thousand machine right so we've learned a lot in terms of what it takes to make this work at scale and so in bringing all of that together with nomads it started with a single binary we didn't want to have external dependencies high availability had to be there out-of-the-box and multi DC and multi region supports shipped with 0.1 because these are all very important to us the next aspect of this is how do we build it for scale so great it's operationally simple it should be enjoyable and easy for developers to use but it's not helpful if you have to totally reaaargh attack your app when you hit 50 nodes or 500 notes right we wanted to be able to scale with you so where we started with Nomad was really saying what can we learn from surf what can we learn from console and pull in both sort of mature libraries as well as proven design patterns right why reinvent the wheel what we didn't have in sort of building that with any scheduling logic right like neither surf nor console do any scheduling and so this was a new area for us and so we turn to was research is that great what's out there what is the state of the art in terms of doing scheduling at scale and really what we looked at was really to two major sort of camps of work one was out of Google in terms of what they've learned over the last you know decade and a half with Borg and Omega as well as amp lab in terms of what they've learned from doing sort of the more big data scheduling around maces and Sparrow and so these four papers were really the big inspiration as to how we actually build no man in practice the sort of architecture ends up looking very similar to console with a twist and so for those who have seen sort of the console equivalent diagram you have the central set of servers that are electing a leader among themselves doing a replication of data and forwarding of requests and you have clients and what's different about nomad as the clients can now span multiple data centers so unlike console where you have one set of servers for data center nomads one set of servers per region which might be a single data center it might be a dozen data centers might be a thousand data centers and so we have these clients that live in totally different data centers as we compose this into a multi region architecture it looks very similar to what console Federation looks like with multiple data centers so each of the region has their set of servers they loosely set Irate together over over LAN using you know sort of surf and gods gossip as the backbone and together they can form a global cluster so you can forward a request to sort of any any server for any region and it will route correctly to kind of get to the correct server to process it so as we started thinking then about the scheduling algorithm okay great this is what it's going to look like from a sort of an architectural standpoint here's sort of the design goals the sort of box we drew for ourselves and say okay how do we make sure the system can tolerate hundreds of unique regions with tens of clients per region and thousands of jobs running on top of that right so we sort of set the box there and said how do we step backward from that goal into what the architecture has to look like and where we landed was really driving a lot of this from Google Omegas design right Google Omega what's unique about it is its optimistic and its concurrency and what this really means is if you submit 50 jobs at one time to the system it will not try and schedule them sequentially job on job to job 3 they'll try all 50 in parallel right and so with Nomad as you add more servers to it I do add more CPU cores it will just do more work in parallel and this is really important to hit some of those design targets we wanted particularly if you're looking at beyond service workloads to batch where you can get these very high rates of concurrency because it's not a developer issuing a scale up or scale down it's automated right it's application saying hey I need to send up a hundred thousand machines and doing it in a programmatic way and lastly the pluggable architecture of Omega meant that it was possible for us to fit these different types of workloads whether service or batch or periodic or system scheduler sort of plug-in schedulers to meet different needs and so what we wanted to do is say great well these are the design goals this is what drove our architecture you know let's put our money where our mouth is does the system actually do what we promised it should do and so this is the million container challenge that we sort of put together for ourselves right loosely based on sort of the sea 10k of the early 2000 would say can it can actually just run a million containers and so what we did was launched a thousand jobs each with a thousand tasks and taxes being you know an app to run in this case Redis for a total of a million containers running Redis and we launched all of this on top of GCP with 5,000 hosts and what we saw was Nomad does what it promises right and this is sort of what this graph is showing you it's sort of three different lines the topmost line is no man actually picking where to run something so doing the initial scheduling and placement the second line shows the sort of time lag to notifying the client that by the way you should be running Retta and the third line shows the client actually acknowledging and and sort of signalling that Redis is now running right so there's sort of a necessary trailing edge of schedule first notify the and then it finishes running you can see that basically there's a linear line from zero to you know a million scheduled instances in under five minutes and then slightly after that and under six minutes all million instances of Redis are now running so I think we can sort of confidently say we've run the world's largest Redis cluster you know on one hand you know the instant reaction to this from a lot of folks was why would you ever need that why would you ever need something as obscene as a million containers running right you know like no one has that workload right and this is to us instantly sort of trigger that reaction of like you know 640 KB should be enough for anyone like what are you doing that you need more than this much memory and you know what perfectly highlighted it for us was literally days after we published our c1m challenge we got a note from a group from the Citadelle group you know for those who are unfamiliar though the world's second largest hedge fund I think there's like a hundred eighty billion dollars under management and you know they came out and said you know that's really cute we like your c1m we're curious we can do a workload forty times larger than that and we're like well we don't know we thought this was sort of the obscene scale that you know okay once they can do this no one's ever going to come to us and be like will it do X and we were wrong and so they came to us and said great we want to sort of battle-tested this next scale and what they did was basically for five hours across 18,000 cores continuously scheduled work to just see how far they can push it and they were able to sustain 2,200 containers a second at that scale and so where we thought we were sort of pushing the limits we didn't even realise that we're sort of one step behind even even what was happening in industry and so this sort of brings us to sort of what I wanted to talk about which is what are those next-generation patterns that you can start to actually enable when you have a systems that can do this right when you have more than 640 kilobytes of memory what can you actually do with the computer and so where I think this starts you know is it's what I sort of refer to as a service oriented architecture spectrum I think when we talk about services these days we hear micro service we hear monoliths we hear sort of function of the service and timming these are all just sort of points on a continuum right on one extreme end of the spectrum is one application 10 million lines of code on the other extreme end of the application is a million applications each with 10 lines of code right and so there's a fundamental complexity of if you're you know an app that's you know running a fortune 500 Bank you're going to have the 10 million lines you just get a pic where they live right and you can slice it however you want but you're never going to boil it down to 50 lines and so there's sort of sort of utility curve on top of this of you know there's certain pain points on the extremes that you start facing and somewhere in the middle tends to be healthy and the reason for this is there's two different trade-offs that you have to make on one side monoliths have this enormous application complexity right each of the subsystems interact with each other in a complex way you have these huge class hierarchies it's hard to really understand what's going on and changing something on one side weirdly breaks the other side of the application and so you have this high level of inherent app complexity on the other end of the spectrum you have you know the million micro services that have a very high level of operational complexity now your operations really has to be very sophisticated and how you're dealing with this volume of things right you know the more sort of scale you inherit the more more more things just break right there's just more moving gears and so what lets us escape this I'm sort of both sides is abstraction abstraction allows us to sort of hide some of the details hide some of the complexity and sort of reach higher up on the scale right and lets us build bigger applications that can do more interesting things I sort of abstracting the lower level details and so this sort of analogy that we see is what allowed these model is to achieve those huge code bases with frameworks right there's frameworks to sort of decompose them it said great you're going to have a separation of data management of controllers of rendering of these different subsystems that sort of shard out their work and they will compose together into a larger application so the framework gives us that skeletal structure that lets us layer on bit the functionality schedulers are sort of the same thing but for services they give us that framework to say great here you're going to add an app that you know does an image resize here we're going to add an app that of the video transcon we're going to sort of compose all of these individual services together into a larger application and we're going to retain our sanity using schedulers to abstract some of the underlying details of this and so the key really is can we abstract enough detail not all of it right because we still need to think about okay our app is running on hardware and there are certain implications of things we should and shouldn't do in performance trade-offs but can we abstract enough that we can really focus on service composition as opposed to things like deployment and provisioning and sort of capacity optimization and when we start doing this I think the first thing we start running into the first pattern is a sidecar pattern right and so this was sort of an a relative you know it's been emerging for a few years Netflix has been pretty vocal about some of the work they do and the idea here is that you run an application alongside the main process your main API server offload some of its concerns to these sort of side processes right and this gets represented in different schedulers of different ways so Nomad has an abstraction it calls a task group it is a group of tasks as you might expect board refers to this as an alec kubernetes refers so this is a pod but what we're really all talking about is the collection of apps one of them is probably sort of the main app or the leader app and the other ones or Co process and the kind of things you start seeing is different applications will start off loading different chunks of their functionality right instead of trying to natively build in rich logging libraries we start saying actually let's just build a common logging sidecar that doesn't really care what the programming languages or the framework is and run it alongside the app so now we don't need the Java logging library and the go logging librarian so on and so forth use build one in any language and compose it at an Apps level the other you know you can searching other things like routing proxies in practice there's a whole host of interesting side cards that we're seeing emerging right some of these are used try things like configuration so maybe we don't want our app to be aware of our custom you know configuration stores CMDB or console even and so abstract things like console with console template you might abstract sort of your observability stack so things like logging and telemetry some hide those behind side cards that are aware of the API in fact data dog ships exactly like this they give you a data dog agent basically that is a side car that runs on every every machine and then you get into some of the more sort of sophisticated around load-balancing and server meshes which starts to sort of get more intimately involved with the execution of the app right some things more traditional load balancing approach some of them now looking at sort of layer seven can you intercept on a per request basis as opposed to a per connection basis but keep this outside of the application so it's nice about the sidecar approaches oftentimes they don't require us changing our app it's sort of transparent that we're doing this right it's sort of just a configuration detail of instead of reaching out over the network just talk to localhost and you're logging sort of endpoint is there and so this is nice but it doesn't necessarily enable us to do more dynamic things so let's sort of break down some of the way we're composing our functionality within one machine or one application but it doesn't let us get richer than that more dynamic than that and the reason is we're sort of bounded by the abstraction of you know a process right it doesn't there's nothing richer there to be like I want to launch something over the network and so this is where API awareness of schedulers comes in and sort of starts changing this because now if we can talk to that scheduler if we can leverage its API in a dynamic way then we can start doing richer more interesting things with it and so these are some of the ones I want to want to touch on briefly so the first one is sort of the traditional cue example right and we see this all the time it's sort of a publisher consumer paradigm where on one side we have a set of producers web servers API servers they're emitting events like user signed up or user logged in whatever maybe we're doing things like sending them an email or to a fake code or something like that and it's running in the background so there's a set of consumers that digest those events and process them once they're done they throw away the events beyond and so in some sense what those workers are doing is batch work right it's an activity that runs to completion and then it's over right it's sort of ephemeral in nature the challenge in the traditional model is we have to provision these workers in advance right like the queue doesn't know how to start workers it's just a queue and so if there's no workers running there no one will DQ and actually run anything so then when we start provisioning we say well we want more than one instance because what if it goes down right we don't want the messages now to be stuck in the queue no one consumes them and work just piles up and so typically what we do is provision Wayne more than we need to write either because we needed a very low rate and now we have at least to 4h a or we need actually five but we're running eight right there's the sort of over provisioning that leads to idle and underutilized resource and so this is where nomads dispatch comes Anna Mitchell talked about it a little bit this morning the idea being can we actually just start the consumer on-demand so instead of having these consumers always sending their own idle acting like an online service even though they're doing batch work can we just treat it like batch work right so when an event actually comes in launch the consumer process the tech to sort of completion and then exit return the resources to the cluster so we get the same queue abstraction our publishers still shielded from the detail they don't care about how the event is being published they just want a guarantee that does at some point get published and nobody can now act like the queue for us right if we submit more work than there's actual capacity in the whole cluster no I will just queue it it does this with bachelor cat does it with service work there's not really different it still wants to provide you that that promise that the work will eventually get done but now we avoid underutilization because we don't have to launch n plus one instances for availability we don't have to launch them in advance when the work shows up great if the machine running that worker died nobody will reschedule it we don't need to have sort of an n plus one story if it fails great let's deal with it at that time instead of paying the cost continuously so what this looks like is actually a pretty small modification how we define and use jobs here I've taken a normal job and basically added the the required Delta to make it a parameterised dispatch job the moment we add the parameterised block nomad now so is great this thing needs some parameters so it's not it is a now a dispatch job as opposed to a job you want to specifically run that moment and here we're just saying we need an input every time this job is called at any time this job is invoked start a docker container with the my worker image and feed through the the input file to it and so this could do you know could send an email it could trenchcoat a file it could resize you know an avatar it could do whatever right it's a function of the service and sometimes and so from a workflow perspective what this looks like is the owner of that service writes the right service writes the nomad file and registers and says here's my dispatch job called you know whatever my job and now the web server shows up and says go please invoke my job here's its inputs at that point Nomad takes care of launching the consumer and scheduling the work to be done and as we dispatch more more things they'll just be more and more workers running it right so we'll use the appropriate amount of capacity so this starts to look a lot like function of the service or sort of an AWS lambda model the difference is really how small is our granularity right and so if we're doing relatively low volume stuff we're latency isn't of critical concern Nomad dispatches well-suited as we get into very high volume latency sensitive the set up overhead becomes prohibitive right it's sort of like do you launch a web server for every single request or do you sort of amortize that launch cost by launching it and letting it service many requests over its lifetime similar sort of a concern here if it's a batch processing thing that only takes a few milliseconds the overhead of actually launching it you know might be two orders of magnitude more than the actual work is doing and so how do we actually solve these sort of fast serverless type sort of patterns as these are emerging well there's a few things we have to do one is actually do more than one piece of work per execution so processing multiple events we want to still dynamically scale our workers to avoid that over provisioning and we want some some level of queuing after that there's a burst of messages we don't drop them on the floor and so what this can look like with a system like no mat is still very similar we still register it as a dispatch job registered against our service but now we add this intermediate controller and the controller has a few functions one of them is to receive the actual event so now the web server fires it to the controller instead of directly to the server and now the controller can choose do I have enough capacity or do I need to dispatch capacity right so this gets to that dynamic scaling question so in the case where we get an event there's no worker running that's a good opportunity to go ahead and schedule the first worker and so this worker gets started and it pulls that job out of the controller which is cueing it to prevent us from sort of dropping messages when there's no capacity as the message queue grows it's getting deeper because we sort of exhaust the ability for one worker to keep up with the event then the controller can make the decision to continue to scale up so as these queues grow we make this adaptive and just dispatch more instances of the work as we need them to sort of accommodate that that sort of growing queue drain the queue back down and now we sort of right-sized the workers another logical sort of a fit for this sort of dynamic consumption of resources big data processing right these are applications that tend to be very large-scale of consumption of compute they have conflict graph so map produces maybe the simplest example but usually you have sort of a chain of different jobs that are processing and feeding into into sort of future jobs and so at each of these phases there's sort of different different sort of widths to the pipeline your map phase might be very wide your reduced phase very narrow it's and then to another map phase it's very wide so on and so forth and so what we'd like to be able to do is programmatically set up and tear down the workers to right-size each phase so we're not just sort of using the maximum amount of compute and having it be idle most of the time and so one direction we're taking this with native spark integration which Mitchell touched on and so what this looks like is totally native to the user the user submits their job just like normal to spark spark then goes and talks to Nomad to figure out ok I need 50 executives or 500 executives and nomads will dynamically provide those to spark and scale them up and down as appropriate for each job in phase and what we've seen in practice is the value of merging these things right and this is particularly something that Google's spend a lot of time sort of studying and looking at is why have SEP you know why not have separate data clusters from separate services it's a very different workload pattern and so why not sort of deal with them as a separate issue entirely and what they found in analyzing their different cells their different sort of data centers was that the overhead and doing that would vary from you know you can sort of see the CDF on the right from you know 0% in very rare cases all the way up to you know a 60% overhead so they need 50 to 60 percent additional hardware footprint if they did the segregation and so this adds up to being a pretty large sum of hardware for most users now of course Google may not be a representative of all use cases but you sort of get an idea right across these different workloads there's a spectrum of overhead but it's almost always more than zero and so what becomes interesting as we start talking about the scheduler API is undoing this dynamic behavior is this API starts to blur line between what an application and then what the underlying infrastructure is right so when we think about something like spark yes it's a user facing application but it's intimately connected to the infrastructure and how it consumes it right so it starts blurring that capability as we talk about things like server list controllers or dispatch right there's a fine line between is this an app concern is this an infrastructure concern I think you're going to enable richer functionality by blurring that line so our goal is to really look at this and say how do we enable that dynamic behavior by providing the right contract the right API but still sort of providing all the other promises of nomads right which is higher resource utilization and better quality of service right a decoupling of concerns how do you still get all of that while still leveraging it in this sort of very visible way to the application so that sort of sums it up what no man really is looking at is how do we be a great cluster scheduler so just focusing on that task of mapping applications onto underlying servers looking at our developer audience and saying how do we make it easy and pleasant for them to use and flexible enough to handle all their applications making our operators life simple by really being concerned about how the architecture is going to going to make them deal with the system and then building for scale particularly to implement the sort of next generation pattern right if we give more than sort of 640 kilobytes what cool things can you actually build on top of it and hopefully sort of seeded some ideas for the kinds of things we're starting to see and starting to talk about thank you guys so much [Applause]

Info

Channel: HashiCorp

Views: 7,111

Rating: undefined out of 5

Keywords:

Id: S0TxyUx8suk

Channel Id: undefined

Length: 39min 16sec (2356 seconds)

Published: Mon Jun 26 2017