Mitchell Hashimoto - Introducing Nomad and Otto

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I'm Mitchell and this is Armand were the two founders of Hashi Corp we've collectively built sort of everything I'm going to show you and what we're talking about today so we just got back from hash comp it was last week that's why we have these jackets and we announced a couple new things at hofstra comps so do asked us since we're gonna be in town to come and talk about those things so we're gonna spend the focus of the time talking about what those two things are given that this isn't a keynoting environment we get to go in a lot more detail technical detail about how things work so we added a lot more detail removed a lot of the key note e stuff from the the presentation we're gonna talk more technical but to get started I sort of want to give the the quick overview of what we've done like the one two sentence or one paragraph description of everything we've done in the order we've done it and sort of explain why we did it and in doing so lead to these final two which we're going to talk about today and that should help explain sort of how they fit in and and the purpose they have so we built nine things they're listed here over the past six years I guess and the first one which nope these two were the ones we announced but yeah so the purpose of building these nine things our goal from the beginning of starting this company which was three years ago was to make going from development to production easy and when we were talking about development production we're not talking about like a Heroku or a more limited approach to production although Heroku is getting more flexible every day at least you know three years ago wasn't ultra limited we weren't talking about that we wanted to actually help very large users get to production like hobbyists to very large users so we wanted to build these tools that would let that would scale from a hobbyist to to fortune 500 basically so we started with vagrant we had vagrant to begin with this is the one that six years old now I think most people here probably know what it is so the development environment tool and this is where we started from so we knew this was sort of the starting point of how we had to build tools out the the funny sort of thing that we laugh about today is when we started the company and analysts started analyzing us they categorized us as a company we were we were a dev tool IDE company and it was funny to us then because we knew what we wanted to do but we couldn't really talk about it and it's funny now just because it's a it's really far from the truth so we started with vagrant we were an ID company and according to you know analysts and we started to push forward and the first thing we had to do was it was change his perception and get people comfortable with running our tools and an operator environment and then eventually in data centers I get in the path of downtime eventually because being just the vagrant developers if we just came out with something like console we would have gotten a lot of skepticism and our ability to ship something that could technically cause downtime so we sort of had to gain trust towards that path so we went in a very sort of calculated order the next thing we came out with was Packer Packer is the tool for creating images can be creating sort of immutable images and when it came out I could only build machine image was like am i is VirtualBox images things like that nowadays it could create containers and it could also in the most recent version create application level images so I mean like jars and Heroku slugs and things like that like at a finer granularity than a container and when this came out people saw it and were like okay the vagrant people created a tool for creating vagrant boxes and they're they're not wrong that's how we like purposely sort of advertised it so we sorted it to ourselves but they're not they're not wrong but at the same time there was a much deeper plan here which was this would be our tool for creating immutable images going forward and it has been so the next thing we came out with was surfer named console within about six months of each other or so so if you look at bagram - developer tool if you look at Packer it's it's kind of an operator tool if you think about it and then we came out surfer console and we came out with surf I think that's when people got really confused because they suddenly couldn't anchor this to vagrant in any possible way like there just isn't any good reason to have this if you view us as an ID company so surf came out and what we were trying to get to was service discovery and and in this immutable world where you have things coming up down really quickly and and it's immutable so they're not there's no runtime configuration or it's not easy to runtime configuration we needed a tool to do that and we needed a tool to find these things so we're how does the web server find the database before you would use something like chef but given that you're running chef in a compile step rather than just like on demand it's harder to do that sort of thing unless you want to run chef twice which is not what we wanted to advocate so we came out with surf which then became the building block of console which you just heard about and these are both very firmly runtime tools like production run in your data center could cause downtime sort of things and consoles adoption went really well we came out like we didn't plan this but we came out right at a great time where docker was taking up and micro services were taking off and suddenly a lot of people had problems at console solved so we got lucky there and that caused it to get adopted really quickly and Armand worked really hard to make it very stable so we've to date never had a single data loss incident with console and that sort of stuff even at like a 0.1 increase the sort of confidence people had in console greatly and so the adoption sort of took off but this was great for us because it sort of solidified us as being able to make stable production level tools and it put us in a really good position we're still being called like on Twitter and stuff we still seem like oh the Baker and people made console like we still were seeing stuff like that but it was getting less and less so things were looking good next I think oh yeah next was terraform so this was last year actually funny tidbit we finished and polished us up in New York like a year ago so we were we're here and the DAO people were the first people ever to see terraform actually so we can't we built terraform next so this is back sort of towards operator tools in a way but this was our tool for building for codifying and building complete data centers with code so chef and puppet and friends try to do something like this but due to their sort of model of running on the age that they are the agent of change on whatever they're running on it just didn't work right with this cloud service provider control plane model even though they've there they've tried and they keep trying to do this but it just doesn't feel right so we came out a terraform which is the way to this scribe your whole data center all your regions and text files and then execute them so most recently we use terraform to load balanced Nomad on do and we're able to spin up thousands of droplets basically in parallel really quickly and that's sort of what it's good at so we came out terraform and I'll explain how all these tie together a little bit later and at this point we're pretty seen looking at news social media stuff like that we're pretty firmly suddenly becoming a DevOps company so it was shifting and things were looking really good so then this year we came out with vault Balthasar secrets as our security solution when it first came out it was a just a secret management tool to store the secrets passwords tokens since then it's evolved in quite a lot more it's now I fully self run a CA or PKI system so it could be your CA it could generate certs it could give those out it could do it could do all sorts of stuff now that's all security related it's the tool that is meant to secure everything else so that came out earlier this year and that then became our at the time fastest growing tool ever in terms of vanity metrics within the first month it had 3,000 stars it's been out six months and like not able to talk specifics basically I like to say like if you've if you've used non-cash forms of money or traded stock recently you've touched this thing and yeah I mean that's really crazy for a tool that's six months old that's like security it's a very scary thing it not only causes downtime it causes like brand damage if that thing gets messed up so the amount of trust we've been able to get to get this thing out is great and it's audited now so it's it's it's verified secured by other people but who knows hopefully but yeah it's doing really well so then this week or it kind of like seven days ago we announced two things so the first is no matter is what we're gonna go into detail about today Nomad is our application scheduler its schedules docker container is scheduled VMs other things we're going to talk about that and this was critical to split the immutability into two layers so basically machine level and application level if you only have machine level deploys kind of suck they you have to build a whole image every single code change and that takes minutes it's we're on the scale of minutes you have to orchestrate starting with new machines stopping new machines it's just not a great experience although it's workable so we wanted to split that into an application level or immediate application level immutability so that you could build the image relatively infrequently or the machine image relatively infrequently and then do the application one very frequently maybe a dozen times per day where the deploy is now on the order of milliseconds rather than minutes and that's what we came out with and then the last thing we came up with is auto and autos quite a bit different than the rest so I'll just hold off on talking about that so this is sort of the one diagram we like to show of how things fit together in a in a loose way so the ID sort of vision we had for how these things that would all work together is you start in vagrant which is your development environment you then use Packer to build the machine image it's running on or the application level image that will get deployed so we use Packer to build both a Mize or droplet base images as well as docker containers jars things like that from there depending on what type of image you make it goes it branches so if you made a machine image it's gonna go to terraform terraform is gonna launch new machines get rid of the old ones try to do this with minimal downtime it has various tools for doing that but if you build an application level thing like a container it's gonna go out to Nomad which is scheduling it onto your hopefully terraform managed cluster the arrow there between nomads and terraform is because Nomad knows things like your CPU pressure is really high on this machine or there's too much there's too much pressure I can't schedule new jobs because there's not enough capacity I need more capacity please tell terraform terraform knows how to spin out more capacity so it could scale your cluster this isn't built into auto - no Madden terraform right now but that's what we're doing right now and that's sort of how they fit together and then both of these things just tie in a console so terraform is spinning up notes that are registering with console so we could see every nomad agent or anything else in council and then Nomad is scheduling jobs application-level jobs that are also registering into consoles so you could you can ask like using the data from terraform we could say where are the Nomad agents then using the data from Nomad you could actually be like whereas my application so you get both that data in console floating out here is vault vault is sort of used by everything and putting arrows to everything doesn't work so you know vagrant is probably the least likely one to use vault perhaps but you could get some credentials from there Packer uses vault pretty heavily to get a DPS credentials digitalocean credentials terraform uses it heavily for the same purpose Nomad doesn't integrate yet but eventually Nomad will integrate with vault and a much different way which is to give each application and identity like a verified identity if you trust vault if you trust vault and no Matt tells vault but this application is who who it is and it could talk to the API then then you have sort of trust in your network so vault will do a lot more there and then vault is sort of secret storage that is backed by console because the data source for vault can go into console as well as zookeeper and other things so that's how it all fits together and then you might notice around this in the little corner is Auto the little robot and we'll talk about how that all is but basically you can imagine that if you're trying to build a development to production workflow learning five-way no six tools is pretty complicated so Auto is really the the meta tool we've tried to build in order to simplify this whole thing and manage this whole thing for you but we'll talk about how we do that later and then the last one which wasn't mentioned this whole thing is Atlas and that's because it's our commercial product it actually gives you that whole thing and and how that relates to Auto I just want to show here so the idea that Auto is the opens Auto ought it was like what get is to get up so Auto to Atlas is I get to github and the idea there is workflow not not tech related obviously Otto's not a version control system but the idea is that Auto is your open source workflow to develop deploy get all everything done from just a CLI and you need Atlas to do collaboration to do interface the security things like that and it's it's very much similar to the relationship between git and github so that's all the tools we've built so far and next I sort of this want to pass it off to Arman who's gonna talk about nomad why we built it what I could do in its architecture as well so Nomad what is it as Mitchell said it's kind of it's an application scheduler what does that really mean for those of you who are unfamiliar with the space basically the idea is it's a single system that kind of pulls the resources of your whole data center together so whether you have ten machines hundred machines thousand machines it kind of makes it look like a single very large machine think about almost like your cpu scheduler on your operating system you don't really think like my process is running on this Court just kind of spreads things around everywhere and gives you the abstraction of it so does that the same sort of thing and it gives you an API above that to be just push jobs into it so you push jobs into its single API and it manages placing it all over you know your very large fleet of machines and the idea really the goal of doing it isn't just to add complexity it's to be able to easily deploy applications so you no longer have to think about like where which machine is it running on you know if it's a Linux based job am i accidentally running it on a Windows machine or vice versa you know does that machine have enough memory and CPU and disk to actually run my thing you basically declare all of your requirements and you know the the enforcement and logical parts of making sure none of those are violated live within oh man so as a developer kind of frees you from all this you just thought this is what I want to have happen and you really care how it has to happen and the way to make this actually possible is this job specification so you have to specify to no man what you wanted to run what the shape of that thing is those kind of details so what does this actually look like so we use this thing that we call basically the the job specification so this is a really example example one it's actually not that simple this is you know a fully runnable real real world example we'll come back to in a second with the specifics of these things mean but really when I want to show is that it's a very high level like human readable easy to read easy to write you throw comments in there it's not a very low level specification it's meant to be read and written by developers on a day to day basis and the idea behind it is like I said to really allow a developer just to declare what they want to have run and then all of them like details of you know which machine it's running on how do I run it there how do I monitor it how do I service discover it how do I connect it to vole and have security around this kind of thing those are all details that are left to the scheduler you know have to think about it as a developer you just told that why you needed it to run and so the where and how is left to the to the system itself and so one of the reasons we took this approach if you've ever played with something like terraform is we're really huge fans of this kind of declarative language because the power behind it is we can make these very simple high-level specifications where the all of the features can basically be hidden their complexity of how its implemented the complexity of like okay you changed the definition of your application how do I do a rolling deploy now so if it was a very imperative kind of code where you said okay first app get installed this thing then Afghan installed this thing then doing a rolling upgrades very hard there's it's very hard for the system to tease apart the how of something versus the intention of what you're trying to do so by just giving us the intention we can do a lot of very interesting powerful things around the howl of it so going back to that exact same job file and I want to kind of decompose the different bits of this and show what I mean by that kind of like hiding a lot of power there so the first thing is just defining the job itself so here we're just naming a Redis job a job has a name just to kind of group it and so you can logically kind of talk about it with the API so it's just giving an arbitrary name the second line is we're specifying the data center we want to run it so here we're just saying us east one and so this looks like a really simple line underneath this there's actually a full multi-region multi data center model so if I wanted my job to span two different data centers I could just throw in u.s. West one in there or if I wanted to spend AWS and there's a lotion I say US East one SFO one New York City three Amsterdam for whatever now it's the schedulers problem to figure out how do I do a multi cloud multi data center deployment right as a developer I don't care I just some strings in there and it's the schedulers problem basically and so it's only a single line very easy to read there's a lot of power and making that actually possible then we move down a little bit and we get to the actual tasks so task is the real unit of work in nomad there's basically a one-to-one mapping you can think about between a task in your application and a task really has a few things one is it means you know the application to actually run so how to actually run this application and the other is the set of resources that app is going to consume so every task has to have these two things so here we're defining a pretty simple set of resource utilization we're saying give me about a half of a CPU 200 Meg's of RAM 10 megabits of network and by the way give me it dynamically assigned a port to me so we're not asking for port 80 or 4 it's 63 79 we're saying you know on the machine you're gonna pick of its 65,000 port so give me one that's free and reserve that just for my application the other side of this is then specifying what we actually want to run so in this case we're saying ok we're gonna run the red Redis latest image and use this docker driver so this driver piece again looks very simple but underneath that there's actually a lot of kind of interesting functionality we can talk about and similarly here the resources you know not only the point of this isn't just you know for documenting how many resources we need Nomad will actually reserve that resource and find a machine that has enough space for us to run so it's not gonna place this on a machine that doesn't have enough memory or doesn't have enough CPU we're gonna guarantee that you actually have enough resources to run your application and will reserve it so another application doesn't come and take up your resources but then going into the drivers there's a lot of kind of interesting features there so here we had just said hey just run this docker image for me but underneath that no man actually has the flexibility to run kind of multiple drivers and the idea is that we want to be able to support any type of containerized virtualized or kind of any real applications sort of workload so there it was just okay I have a docker eyes application my organization already packages everything great you just specify type as a docker and let the docker driver spin it up there's a lot of applications for example volt that still benefit benefit from hardware virtualization right you know container security has improved pretty dramatically in the last year - but still the kind of hardware level isolation you get from virtual virtual memory and kind of hardware-level protection it's still higher so if you're running any sort of security sensitive application you probably still want to run it inside of a VM or if you're a public cloud that's running VMs as a service you know they talked only briefly a bit before but do in some sense has a scheduler as well so you can almost imagine providing an internal cloud to your company where you know you're buying bare-metal but you're exposing VMs so the scheduler doesn't care it's just you're asking for another slice of allocation of a machine so you can express that as a virtualized workload and the last set of kind of interesting workloads are these kind of stand-alone applications and so for a lot of things like Java for example when you compile it you end up with a jar file basically that has all the class files all of the assets everything you need to run your application - of the JVM so for something like a Java jar there's really very little benefit to reekin Tanner rising in Java is already containerized everything into a jar you're just re wrapping it again in another layer basically so in nomads roll you don't actually need to do that you don't need to express it as a docker container you can say I have this jar file find a machine that has Java installed and just go run it there basically I already containerized it in the form of a jar another example of this is how Google does you know their data centers which is compiled everything as a static binary so if I take my C++ application or you know any application really and I compile it into a single enormous static binary you know instead of a two Meg hello world I bring in Lib C and live XML and everything that I need I have a 50 megabyte binary I don't depend on anything from the kernel anymore I expect the Cisco layer and everything else I've already compiled in so again there's really no benefit to me to rewrap it and bring the operating system with me I've already compiled everything in so no matter well you say here's my static binary just find a machine you know that can run this thing basically and there's no need to rewrap it again so you know if you're moving towards docker and your that's what you've standardized on great if you're somewhere else on that spectrum and it doesn't make sense for you then the goal for Nomad is to be able to have that flexibility to run whatever workload you have so those were kind of just the initial set that we wanted to launch with so 0/1 shipped with docker q emu Java and static binaries but there's a whole host of things we want to be able to bring it so on FreeBSD world they have jetpack which is like their version of docker Windows is bringing Windows Server containers in the virtualized world you know Xen is still huge hyper-v on Windows is huge and things like c-sharp can also compile down into a single executable which is self-contained it only depends on the CLR so as long as the CLR is there you're good to go basically and so kind of the initial set of things that we wanted to kind of target was really making this application deployment process for developers as easy as possible and along the way we wanted to make sure we support docker as a first-class citizen of this world makes it really easy for organizations that are already docker izing things to now schedule at enormous scale multi data center and multi region was a huge requirement for us just kind of the kind of customers we talked to multi data center isn't kind of the like future scenario it's like it's the default scenario for them and then when you start talking about disaster recovery multi region or multi cloud provider is also just a practical reality for a lot of these guys so it just had to be there out of the box flexible workloads again we work with a lot of people who they have tens of thousands of applications and it's not practical for them to docker eyes everything or do a kind of integration they have to be able to say you know this set of applications as virtualized one's going to be virtualized for the next five years so how do we still airlift those things into the scheduler without forcing you to retool everything basically bin packing you just kind of get that for free so the thing we don't even talk about which is one of the biggest strengths of a scheduler is maximizing density and resource utilization so on average most servers you know it depends on company but you know somewhere between 5 to 20 percent utilization which the flip side of that means you have something between 80 and 95 percent of your compute wasted so because we're specifying the exact amount of resources we need on a per task basis you know the goal with Nomad is maybe not to get you to a hundred percent but at least can we get to 80% utilization or 90% utilization so instead of talking about 95 percent of resources being wasted we're talking about 10 percent of resources being wasted so in some large-scale organizations you're talking about tens of thousands of machines that can actually be shedded and no longer need to be managed and dealt with and really all of these features although each one sounds pretty complex we saw what that actually looked like in the HL specification it fit on a page that one spec did dr. dead multi data center it's supported multiple workloads it was doing been packing and it was like ten lines of code right so by having this very high level specification we're not sacrificing a lot in terms of complexity for developers and so really when we thought about Nomad and when we think about really any one of our tools our kind of design approach at how a verb is how do we look at what is in the market or what tools or solutions exist to a problem and 10x that and try and 10x that in every kind of category we can and so when we want to look at kind of the scheduler space the application delivery space you know there was three distinct categories that we saw were you know could we could improve upon one was the ease of use for developers a lot of these tools are pretty challenging to learn pretty challenging to use the tooling is just not intuitive the other is on the operation side so some of these tools are simple for developers but really challenging for our actual operators to operate at scale in an HK environment where they care about security across regions right operational challenges are extreme for some of these the last one is then building for scale right like it had to support tens of thousands of machines just because you know it's cute if we can solve it for you know the ten node hundred note scale but really the extreme challenges for doing application delivery exist at that 10,000 node scale so how do we tackle that and I think we've done a pretty good job as well kind of talk about with each of these trying to get there so the ease of use for developers really comes from that job specification we talked about so that simple file is pretty much that is the way develop interact with the system right job files they edit job how they read job files but that's kind of the end of their interaction where they do need to actually kind of interface what the system is you know typically in testing and development mode you actually have to submit the job to make sure you know does it have a syntax error does it work will it schedule at all am I gonna get an error back from the scheduler so one thing we wanted to optimize for is basically adding a special dev mode so here I've just added a special flag I'm saying nomad agent - dev and this just automatically spins up a schedule or just for development so it doesn't persist in any state it goes from you know hitting enter to being fully running in about a second and it's a full schedule you have the scheduler the server side component you have the client side component where you can actually schedule and run work and the know API so now a developer can use their CLI they can submit jobs to and develop against the tool if you're building higher-level interfaces and tooling that can consume the API there you go you have the full API and it's incredibly easy to like actually spin it up it's not a lot of like how do i hack this thing into a development mode but it wouldn't be you know it'd be kind of it it's great if it's easy for our developers I'm just one flag and then it's like when you actually want to operationalize it it's a super nightmare so really it's a tool design for production of the same way console is in the same way surf is in the same way bolt is and so we wanted to get it to be as operationally simple as we've done with things like console so just the similar sort of story as console we ship only a single binary so whether it's the client that's actually running the tasks in which you'll have tens of thousands of them or whether it's the servers that are the control plane of the system and you only have three five seven of them it's just a single binary you're just changing the flags the h1 so the developer might be running - dev and it's acting as both a client and server in production you'd very likely never configure it that way and say I have my three servers I have my thousand clients that's just one buyer knee you don't really have to worry about it same configuration syntax and everything yeah so on the operation side it will come back a little bit to its architecture it should be it's single binary simple configuration really easy to get it going the building for a scale part was more more interesting so you know in some sense we we have the kind of advantage going in that we have a lot of experience building very large-scale distributed systems at this point we started with surf it was kind of our first production oriented tool it's a completely decentralized peer-to-peer gossip system it's designed to operate at massive scale it powers the san diego supercomputing center and you'll learn a lot of interesting things about the kind of bizarre things like ARP ARP storms and like oh yeah you only really realize that we have tens of thousands of machines trying ARP with one another like what that does to your network routers so you'll learn a lot by building something like console I'm sorry Cerf operating at large scale figuring out like what weird edge cases you run into as you're scaling into the tens of thousands of machines and then that codebase mature like as we've been running this thing as it's run ever larger scale those fixes get incorporated in there and code beige hasn't been sure now for two or three two years I guess now on the flip side we have council council is kind of a weird hybrid in some sense it embeds surf so it brings in a lot of the decentralized peer-to-peer gossip side of things so it still gets to run and kind of torture test that thing at scale but it's also has a centralized consensus algorithm so it's built on raft which is kind of a derivative of paxos and you know those systems are notoriously hard to get right you know strong consistency is one of these things that's really really nice as an abstraction for a developer and really really difficult to actually provide as an abstraction to a developer because we're operating in a distributed world under failures and cloud environments where networks come and go machines come and go so having something having these libraries inside console has really allowed us to refine that both in terms of the stability the performance you know in its short lifespan we've never actually had a data corruption problem we've never like had a violation of our strong consistency but we have learned a lot about edge cases and kind of you know error messages the things you might run into and refine that over time so really we got to start with these two as our building blocks going into nomad so we had this very mature consensus library we had this very mature peer to peer gossip thing and these are really awesome building blocks to have when you're building a distributed scheduler but we didn't want to just you know build it just on that and kind of postulate what's the best design for a scheduler and like you know it's a very challenging space itself so in the same way we kind of read what the state of the art was in academia and then research before building those two tools we did the same thing with nomad and a lot of it is based on really three different papers one is Google Borg one is Google Omega and the other one is Berkley Sparrow so these three are kind of the the heart of the design patterns will we'll talk about in a little bit in nomad the goal was really to deliver a state-of-the-art scheduler right it should be based on the latest in you know in cutting-edge research and really who better to learn from than doing cutting-edge scheduling than Google right no one no one really could doesn't at the same scale so what is the system actually look like architectural ewwww if you've ever looked at the console diagram for a single data center it's going to very similar to nomads regional architecture so Nomad makes a new kind of hierarchy in the console model each data center it was kind of the unit that you cared about was your failure isolation domain and when I say data center I mean you know a grouping of machines that are probably in the same you know physical region that are less than five to ten milliseconds apart on the network and the problem with this is you have a lot of people who have many data centers that are relatively small maybe only have you know 10 20 50 machines and so you know a lot of people complain to us a I don't want to run five different console servers for my 10 servers that I have here the overhead is huge so with no matter we had a chance to kind of split this and introduce a concept of a region so a region is now a collection of multiple data centers and so you might have you know SFO one city one New York City - and these are each of data centers and group them into a larger United States region so now in nomads land we have basically regional control servers so you might deploy the three servers in New York City one and then you have clients in SFO New York whatever Miami doesn't matter and so now there's a split the control plane has kind of done at a larger scope and within that particular region our clients don't really talk to each other they're doing remote procedure calls to our servers our servers internally are doing leader elections so one of our servers kind of becomes the master and provides additional coordination for the system we'll talk about that in a second and among themselves they're replicating data forwarding requests this is all done transparently so a user doesn't really have to know am I talking to a server am i talking to a client am i talking to the leader server they just make the API request to any of the endpoints in the request properly does multi-hop forwarding so then there's a multi region design as well so we might have our US region which spans you know three different data centers and then we might have an EU region spans whatever fr1 UK one GE one and so we have a separate regional control plane for that again from a user perspective we didn't want them to have to think okay if I'm submitting a job to EU I talked to this endpoint if I'm submitting it to us I talked to this endpoint you just submit to any end point that is nomad and it will follow the right routes so maybe I'm sitting in my SF office I submit a job to UK what happens is I hit my you know u.s. regional servers the u.s. regional servers communicating with all of the other servers over the gossip network and know okay I'm gonna go forward you to a server sitting in the UK now so they'll automatically do the forwarding the job will start running in that region and the response will basically unwind and go back to the client the client doesn't really have to care who they're talking to and in this way each of these is now a failure isolation domain so even if I lose a majority of my servers and the EU everything continues functioning the u.s. I can submit new jobs things are still scheduling this is kind of my failure isolation domain so the nice part of this design is you can kind of pick and choose what granularity makes sense for your company as a failure domain so by default out of the box we actually configure Nomad to have a single global region so you might say you know what it's okay for me if you know in the unlikely case three of my five servers are down I can't schedule any job in any data center all over the world I'm willing to take that risk I'm just gonna deploy five servers and you know hopefully don't lose three of them that's fine a single region can handle tens of thousands of machines if you decide it's completely unacceptable every datacenter must be its own failure isolation domain he'll every cage has to be its own failure isolation domain you can set it up that way you can set up each region to be one region per data center and federating all together in that way if New York City one goes down SFO one is still operating as normal so you can kind of design it around the risk tolerance of your organization and so in this we needed to support a lot of flexibility it wasn't enough to say like okay well no no one will have more than two regions right like a and B it was like because there's so much flexibility we really have to support an arbitrary number so if we're picking an arbitrary number let's just say thousands and since we're building it on the same kind of gossip network that we know works at the tens of thousands even low hundreds of thousands it's not a problem right that's what it's built on it can handle that scale of region Federation within each region we really wanted to support tens of thousands of clients and the goal here was if your risk profile matches that you should be able to run a single global region because why not right like let's just run one set of schedulers it's good enough for the whole world and so realistically the number of clients that actually are gonna have more than tens of thousands of machines relatively low so if that's kind of the design threshold great we can actually have a global region and you one set of schedulers for the you know the entire planet and then within each region the you know thousands of jobs just was a requirement right like we talked to folks that have tens of thousands and so like they're gonna have regions splitting issues right because you know hopefully we get it to the point where tens of thousands is reasonable but yeah we wanted to be able to very easily say yeah submit thousands of jobs you know millions of tasks whatever it should just work and so this is a section we didn't actually get to talk about during the keynote for lack of detail how do you actually make this possible and where where does some of the inspiration from Google leak in and so one of the particularly interesting design points of nomads is that it's one of them you know it might actually be the only optimistically concurrent scheduler that's open source now what does that actually mean what I'm what it means is then in most systems let's say I have three different servers they do a leader election and then they serialize one server is going to do all of the scheduling decisions right it's very pessimistic it's gonna say maybe if the other servers are all trying to schedule they're gonna make decisions they conflict with me so pessimistically let's exclude them from the scheduling process as the leader I'm going to decide everything and the problem with that is if I have one server or three servers or seven servers I get the exact same amount of scheduling three-button right it's just whatever that one machine can actually schedule my other servers you know are kind of useless they just sit there basically they just replicate data where in our design we're like okay how do we actually put these other servers to work how do we make sure that not only are you increasing your replication factor but you're adding throughput that they shouldn't just be idle and sitting there and part of this was really to support this kind of thousands of jobs and tens of thousands of clients right there is just a limit to how much work one machine can do so how do we ensure the other folks are participating and so before we dive into how we actually make it optimist living current it's it's better to understand kind of the data model within nomads it's relatively simple there's basically two kind of external inputs to the system one is kind of the client node this is something that is a schedulable kind of target basically so you know this is a server basically so a node might be in datacenter one it has four cores a hundred gigs of ram some disk and network whatever it's some unit of capacity basically and has some capabilities so maybe as Java installed maybe has docker installed it exposes that up to the central servers so they kind of know what can this note do then the other input is from the developers so nodes input is kind of implicit you don't really configure a note it just intrinsically has a set of resources a job on the other side is configured this is the main input into the system so this is you know provided by a user it changes over time they're updated and deleted the system has to react to that nodes similarly are joining leaving failing right like hardware fails the two pieces that are internal to the system or evaluations and allocations so it's important to understand these two it as we go into the next slide an evaluation is created basically anytime the external world changes so you submit a new job you modify a job you delete a job or a node comes up a node fails right there's a transition basically and what Nomad currently knows and what the external world is basically doing so how do we reconcile that Nomad creates an evaluation to say oh I need to evaluate the difference between what I think the world looks like and what the world actually looks like and the mapping kind of the join table between jobs and nodes are in the form of allocations so an allocation is quite literally an allocation of work so this job specifies I should run Redis allocate Redis to node 1 ok so node 1 now basically can query what work is allocated to me ok Redis I need to run that so this is its fundamental data model it's relatively simple so like I said evaluations roughly map to any time there's a state change in the world so you can imagine you know maybe my developers are really only changing my job file once a second that's maybe an extreme example I don't know why you'd update things that often but the other type of state change node failures are no joins these are happening at machine scale things just fail and especially if you have tens of thousands these can be relatively common right they don't actually have to be a machine failure the switch of that rack fails it's now 40 machines fall off the network at the same time basically so there's a lot of state change those possible at large scale so really the goal of them any of these evaluations that are created is to modify the set of allocations in the system so allocations are basically the assignment of work the mapping of like this is the set of work I have to do I need to map that onto the set of machines that I can surely do work so once the state of the world changes I have to figure out ok do I need to update my mapping do I need to create new mappings because there's a new job do I need a movement because no just failed or do I need to destroy allocations because that job doesn't exist anymore a user decided they're not running Redis anymore so this is kind of the goal of an evaluation a scheduler in some sense is that mapping function this function that takes an evaluation and generates a set of allocation updates you can call that essentially a scheduler right it's applying some set of business logic in to modifying allocations so this is gonna get a little scan get into a bit of detail now so this is the this is the fun slide it's my favorite slide that I didn't get to talk about in some sense this plays out kind of the the optimistically concurrent core of the system so at the very top here we see kind of you know in kind of dashed lines updates to the system these aren't kind of like these are just events that might occur so job got registered a job was updated and node failed whatever these are things that are happening external to the system well no man does is then create an evaluation anytime one of these things happens so in a job register case the API called it's made is like whatever v1 jobs put to that endpoint so the user created a job and underneath it both updated the job record and injected a new evaluation into the system all of these evaluations then get in queued into a central broker so this is what I meant with that leader has additional coordination responsibilities you probably can't read the small print here but it says leader under there so there's only a single evaluation broker in the whole system it's just effectively if you've ever worked with the Redis task queue or rabbitmq or any messaging queue kind of a system it's almost a first in first out queue of evaluations yeah priority systems are built right into the heart of Nomad so you know if I schedule a job at priority hundred and there's a thousand jobs sitting at priority 50 I get to jump to the front of the queue and run first basically so this kind of brokering happens in a central location and its goal is to ensure at least once delivery so we need you know in any queue system you have two choices at most and at least once so we need at least the scheduler to respond one time to dealing with an evaluation it's this thing's job to make sure it gets deliver so then we have some number of servers right maybe I have five servers running each of my servers has eight cores so I have 40 different kind of scheduler threads that are available in the system each of these threads are pulling 2dq work or I should say they're kind of blocking 2dq work they're waiting for stuff to be pushed to them so each of the servers basically DQ's an evaluation and ingest it into a different scheduler function so the system has this kind of notion of pluggable schedulers that function that you might want to use to figure out what your mapping is is different depending on your workloads so at the core of it there's two that we ship with them there's no way anyone can read this but me one of these is the service scheduler so if I'm running a long live service for example Redis you know I don't expect to you know start read us in 10 minutes later it finishes I expect to start write us and it runs forever so there's a lot of considerations you want to make if you're scheduling a service because they will live forever and it's hard to move right once my service accumulates state I can't just kill Redis and move it I mean I can but it's velocity right so what I'd rather do is pay a lot of attention when I'm making the system and run very expensive kind of computations to figure out am i running to Redis is on the same machine on the same rack on the same page on the same core router how do I kind of isolate the kind of correlated failure that can occur with my services with one Redis you know it's not so interesting but let's say let's say I'm scheduling web servers write an absolutely terrible placement strategy if I said I want 40 web servers as if the scheduler puts all 40 web servers on the same rack because now I lose a single top-of-rack switch and my entire service takes an outage right the correlated failure of all of those instances was very very high so for something like a service scheduler you care deeply about these kind of things you don't you want to spread across your datacenter you want to isolate you know the damage that random failure will do to your service where something like a Bachelor scheduler I'm saying run a MapReduce job I have 10,000 mappers I don't really care I don't really expect any of them to live more than 5 minutes 10 minutes 20 minutes on the outside and should it fail I just schedule it again it's like just run it again I don't care right so it doesn't it's not as sensitive to a failure because you don't it's not damaging you lose 20 minutes of work at most versus something like my web servers a total service outage is pretty damning and then you might imagine you have custom scheduler logic for whatever reason you need to write your own scheduler with your own specialized thing for your organization that's possible as well so a lot of this architecture comes from the Google Omega model of supporting hey how do we have enough flexibility in our scheduler they clean or get different workload cases and make sure we can fit that logic in so great we have these different schedulers maybe we have 40 different threads that are making progress on evaluations at once how do we actually make sure that you know the scheduler doesn't do insane things like say I'm going to put all 40 of the allocations on to you know web 1 or node 1 well the way they do it is they don't actually create allocations directly there's a layer of indirection so again the leader is the one providing the coordination here so when these when the actual scheduler decides to make a change either creating allocations modifying deleting allocations they create a plan they don't actually just do it right away so what they'll do is they'll create an allocation plan and they'll submit it into the plan queue so similarly this is now the inverse of the evaluation broker all the different servers are pushing into the single queue and saying hey please apply my set of changes that I care about and again on this side the leaders pride in the coordination it's producing a priority queue system internally so if my job you know priority 100 job comes in and you know there's a bunch of work sitting at priority 50 I'm gonna get first access to resources basically my plan jumps to the front of the queue so now our leader deke used these things one at a time and looks at the plan and checks okay you want to put you know you know web machine over here Redis machine over here are you violating any constraints of the system basically right so if you imagine I start with a blank slate and I had one gig of you know memory first job came in and took half of that second job came in and took the other half if a third job tries to get place on the same machine it must be rejected because we've over committed the resources and because these guys are making decisions in parallel one two and three might say yeah I need 512 Meg's of RAM on the exact same machine so when these three plans get ingested here plan number one will get approved that will go back to a server and say yup great your 500 mega allocations exceeded plan number two again same thing plan number three the plan Q says oh no sorry you've actually picked a machine that's over committed so now the scheduler will receive a plan result back that basically says of your 20 allocations or your 2,000 allocations you tried to make 950 succeeded and here the rest of them remaining whatever 50 failed you need to make a new decision and so now the server is free to implement the logic to basically try again pick a different set of machines and hopefully and this is where the optimism comes in hopefully the second time around it does conflict so the first time around 950 out of a thousand succeed second time around you know maybe 49 succeed third time around the last one succeed and now the evaluation is done so the leader is then the only one who's actually modifying the state so this coordination that's happening within the plan queue and the evaluation broker are critical for maintaining the integrity of the system itself but at the same time these are very expensive functions to be running right like these things may chew a way at things for hundreds of milliseconds to figure out where do I place a thousand Hadoop tasks right like that or where do I put my five hundred web servers it's a very expensive function it's not instant so we don't want to just run it on our one leader so this is kind of the heart of it I hope that was somewhat interesting for people other than me and so if you if you're interested in this kind of style this kind of architecture highly recommend the Google Omega paper it basically describes very much this style of thing theirs has a little more flexibility in the kind of its implementation for various various reasons that make more sense for for their kind of environment that they're in but nonetheless this kind of separation of constraint management and coordination provided by a single leader from the distinct kind of computation and optimistic concurrency of the schedulers themselves is kind of pioneered by by Google's work and so you know we've talked a lot about wanting to build the system that is designed for scale and you know operationally pleasant and yadda yadda yadda you know so at some point you want to test that it works right and so the problem is when you're building a system was designed for scale you you need those machines right you can't just spin up a virtual machine be like just pretend this is ten thousand machines right because it doesn't work very well and it tries to run ten thousand containers on your machine and so we worked with some friends at digitalocean to be like hey can you guys give us a bunch of free money and spin up machines for us and they were kind enough to do so so we spent up three servers in the New York City three region and then a hundred clients in both New York City three SFO one an Stern m2 and Amsterdam three so the idea was to have a nice spread in terms of latency so this is also to kind of stress test the system right in real world you're crossing the Atlantic Ocean packets are dropping you're experiencing hundreds of milliseconds sometimes thousands of milliseconds and latency can the system tolerate these kind of things right it's it's no fun to do it over vagrant where you have zero milliseconds of routing time and so then the goal was can we do the c 1k test basically can we just submit a thousand containers and see what happens basically so that's what we did so we submitted a job that had a thousand was it reticence nginx I don't remember what the job even was something like that and the scheduler took less than a second basically by the time the benchmarks were actually you know pulled it every 100 milliseconds it was already done so took less than a second for it to finish the scheduling of a thousand containers to 400 machines the first container booted within one second so this was the one that happened to be in the same data center right New York City threes latency was a lot faster than your city three than the hamster name within six seconds 95% of the containers had finished booting across the fleet and at eight seconds we had 99 percent there's you start seeing an extreme kind of tail latency at some point just because we were downloading from docker hub and one of them got hung up who knows but effectively the job was completed scheduling within about eight seconds across you know going across the going across the Atlantic for different data centers thousand containers eight seconds is a pretty good role lap time and so kind of in summary as we were working on the system these were the three really big points of focus for us was can this be really easy for developers so the it is the past path of least resistance right like we wanted to offer kind of the most interesting features of like rolling upgrades scaling up and down like really easy to express arbitrary constraints like mus run on Linux must be 3.9 kernel you know so on and so forth we want it really easy for them operationally we wanted it to be really nice for operators as well it should be able to be simple to run effectively arbitrary scale without requiring a complex song-and-dance to make it actually work and last one was built for scale can we actually deliver a system that can handle a thousand thinners ten thousand containers we're hoping to do one where we try for a million containers and see what happens and so that's nomad it's available at nomad project IO there's a lot more documentation there's a getting started I promise you can schedule a job within like for copy pasted into your terminal so go play with it and I will hand it back to Mitchell all right cool sounds no man I'm gonna talk about Auto instead for better or worse I'm not gonna go into as much technical detail I we both thought Nomad is more technically interesting so there's a lot more detail there Auto we're gonna focus more on why it exists and and sort of how you use it because I think that is more confusing about Auto versus something like nomad so what Auto is is the successor to vagrant which is kind of interesting because we made vagrant and we're slowly trying to replace it but it doesn't replace vagrant today so we'll get into more detail of how it doesn't replace vigor today and how it does over time but for now this is the goal of Auto we have big releases coming out for the first seeable future I tweeted today about like some pretty cool new features of migrants so stuffs coming but autos here so with Auto what we did was we took a look at vagrant and we thought you know what have we learned vagrant at six years old the state of the world in six years in the state in our world in six years changes a lot so we took a look at how we're we're now using Microsoft's a lot more there's a lot more containers people are scheduling things like it the state of how applications are developed and run is very different so we went from living in a run time production ops focused world for the past three years all the way back to vagrant and developers and and sort of question how can we make things better so the three big things we saw in vagrant or we learned from vagrant is that one development environment deviation is minimal and what I mean by this is if if if like you two don't work in the same place and he's a ruby developer and he's a ruby developer then your vagrant environments are gonna be pretty similar they're both gonna have ruby they're both gonna have bundler they're gonna have a web server probably they might change in some packages he might be using my sequel he might be using Postgres like there's little deviations but most of it is similar and being able to represent this in a bigger and file is difficult without repeating yourself so it's likely and said that he has a vagrant file that has all the same instructions of how to install stuff as he does so we wanted a way to make that more dry make that less less verbose the second thing is pretty obvious as developers want to deploy it's the next about all development is deploy don't work most things you don't work on your laptop say that was cool and like never want to show anybody so you want a deploy and this was a feature request in vagrant for its entire life I mean I think since vagrant 0.1 people have wanted to vagrant up to production and we tried at various points with vagrant to make this a possibility but ultimately what we learned is that the vagrant file again picking on the vagrant file just isn't the right way to describe how to go to production it describes how to set things up onto a single machine it describes how to install debugging type things it doesn't set up monitoring it's just it's just not the right level of abstraction for how to get to production in a way that's that you would want to in a best practices way we certainly had internal demos where we could turn a vagrant environment in into a native instance but it was just it was just weird just it wasn't right so we wanted to do that and then the last thing is microservices so we're living in this world where this is still relative companies aren't doing micro services but it's pretty clear to us that micro services are the future breaking down monoliths into smaller applications maybe not micro services the at least breaking them down into services is the way things are moving to or moving back to if you want to think about SOA but that's what's happening and again picking on the vagrant file just not a great abstraction for micro services we have multi VM in in the vagrant file but it wasn't made for that it was made to represent one big monolithic application talking to one big database or something it wasn't meant to do a dozen different things and and even the modern laptop today struggles running like 12 bm's that's really rough for your computer so most VM wasn't the right abstraction it wasn't built for that and yet micro services are here and are gonna grow so how do we build a dev tool that is friendly to micro services so these are the three major things we solve with figuring and so to make them better we built a successor to vagrant and the reason we did that versus trying to fix Negra itself is we thought that the vagrant file was actually fundamentally not the right level of abstraction for the goals we wanted to achieve we wanted to change things in a major way and bolting it on to vagrant would have made things really terrible for current vagrant users and it wouldn't have made the optimal experience for the future so we built something new it's actually built on top of vagrants so they're gonna six years old it has a lot of wisdom has very few bugs that are all the bugs it has are the weird edge cases you know it's like I'm using Windows with this version of hyper-v running this obscure operating system with puppet and things don't work they're really obscure so the the the core of Agra is extremely stable so we didn't want to reinvent the wheel right off the bat so we started with a very stable core of vagrant and built Auto on top of it so Ottawa's is three things so it's a development tool a deployment tool and a micro services management are all all that tool to make these three things possible we made a new format called the app file and the app file we believe is the right level of abstraction versus the vagrant file and that's sort of what we're gonna talk about first so just judging by the name like you could already tell the app file is focusing on a different level the app files focusing on your app rather than the Machine the vagrant file the first thing you configure as a box the first thing you tell a vagrant is how to install an operating system out of the machine and and I think that's a great like philosophical difference between Auto and vagrant the first thing you tell Auto is what kind of application are you running is a ruby is it something and so it's different here's an example of an app file it's it's not wrong it's actually intentionally blank and that's because we went we went all the way with the app file the a file is completely optional so whereas vagrant you off the bat have to tell it a bunch of things when you run Auto it just looks at your application and says it looks like a ruby application I'm just gonna do what a ruby application should do I see AWS environmental variables I guess I'm using AWS like it detects and does things for you and we're gonna get more into this this intelligence a little bit later but the app file is a real format so while it is optionally you can do things with it so here's an example of a more complicated app file it might look kind of similar it's no mad at hofstra Corp we sort of standardize on this config format which is JSON compatible but this is what it looks like so I don't know this thing works oh cool so we have like an application we tell it the name of the application the type of it we could specify dependencies so here's where micro services start to come in we could actually tell Auto that we depend on in this case Postgres so we depend on something and we'll explain what that means later and then we can make some customizations so while we're leaving out a lot of detail you could still say well the Ruby version I really wants 2.1 I can't work with 2.2 or 2.0 or something but but all of this is optional and you'll sort of notice right off the bat that a lot of things are missing that you might think are necessary we're not telling it what operating system to run on we're not telling it how much memory it needs we're not telling it how to install Ruby there's just a lot of stuff missing and this is sort of a really fundamental difference between vagrant and and Auto the idea I think it's the next slide yeah so vagrant auto follows this idea of codification and vagrant I would describe as a tool of fossilization so the idea behind vagrant is if you take a vagrant file that you wrote five years ago we've worked really hard to make sure that vagrant file still works today so you can vagrant up a five-year old favor file and if you do it's gonna spin up the same box it's gonna run the same set of commands it's very as Arman sort of described it's not descriptive it just says exactly what to do and philosophically we moved much more towards a descriptive model so vagrant is spinning something up the identically as it would five years ago it's a fossil when you created that vagrant file five years ago you snapshotted it it's gonna be that way for all time for all eternity it is a fossil whereas Auto is not a fossil so it's it's codification and more specifically it's the centralization of knowledge out of the out of the config format and into the tool itself so rather than in the config format being very specific about how to run things it's declarative it specifies intent rather than how to do things I intend to deploy a ruby application I intend for it to be on this version of Ruby specifically it needs these dependencies it has its own description of how wants to run and auto itself is the intelligence that knows how to do that if you run Auto deploy today it's gonna deploy a very specific type of infrastructure setup service discovery certain way maybe use docker maybe use nomad but in five years the state of the world changes maybe maybe Nomad isn't the best way to do things maybe Dockers been replaced by eunuch URLs maybe AWS has like burned down and digitalocean replaced all of it so and so like the state the best practice state of the world changed and it's Otto's job to know what what the best practices for the time and do that for you so if you run out of today very different for run if you run out tomorrow but the end goal should be what you want and the Apple is supposed to describe that angle that's kind of a scary thought it's a it's a shift in in in a way to think but at the same time the reason Otto's open source is so that what you believe is best practices you could help get in there and one of the best ways I like to describe Otto's if you run Auto in front is that some an AWS infrastructure for you if you choose to target AWS then the person who designed that AWS the way Auto creates a table is infrastructure was also the director of ops for the second-largest two years ago the second largest AWS site in the world so you just got the knowledge from the person who knows how to manage that level a davidís infrastructure for excuse me for your hobby application perhaps and this is sort of the power over time we hope to get the best PHP developers the best Ruby developers the best node developers all contributing their knowledge and codifying it into auto so that it does the best thing for you for your application and you also could contribute that and so another concise way to say it as autos autos pretty smart and Otto's gonna get smarter so if you're on auto auto dev today to create a development does something wrong you could try to contribute the fix or you could just wait a little bit and run Auto dev in a month and it hopefully will do the right thing it got it got a little bit smarter things will change but of course it's pretty scary if every time you update auto it's doing something totally different so we have the concept of fossilization in a way and in auto to sort of restrict how often you change things and this idea is a file compilation so it take auto takes a file and compiles it into a fossil and every time you recompile it it might totally change something so another way to describe Auto is like a compiler so let's see the first that happens when you use Auto as you compile the app file this is the only time Auto ever reads your app file ever so this is very different from a grand this is actually something we learned from vagrant - which is when you run vagrant up when you run any vagrant chromatic read your vagrant files so if you're on vagrant up and suddenly change the box which I'm sure a lot of you have done when you run another vagrant command it's the same box it doesn't go back and change it like it's the same and you but you want it to change and you can't detect it there's sort of all sorts of problems Auto instead varied documents that compile is the only time it'll ever read your app file if you make any change to that file at all it won't take effect until you recompile and so if you update Auto at all it won't no changes will take effect until you recompile it's very much like an application source and binary so in this case we load the app file which may or may not exist again it's optional it detects a bunch of things and starts creating stuff so we ran Auto compile on our vault project and it doesn't have an app file and this is what it detected so it detected it's a go project it detected it once we want to put it on AWS it compiled a bunch of stuff and and we're done so what does it mean what did this just do so if you look it created the auto directory which doesn't go into version control and if you were to look at this directory you'll see a bunch of stuff and you'll see actually like it created 49 files and it's a compiler so what it actually did was it generated vagrant files a bunch of different vagrant files it generated scripts upstart configurations a lot of some is missing it generates Packer files generates terraform files it generates all the low-level configuration for the other tools we've built in order to do what they do well so terraform is Auto uses tear for him to start servers Auto uses parts of vagrant in order to manage the development environment those parts are changing over time uses Packer to build the docker containers and machine images it keeps those being single purpose tools of what they do and auto just instant auto manages them and it also installs them for you so you don't even need to know how to install them it just does things and if you think about it in one way when when you look at this a lot of people that first run compile and look at what the auto what's in dot auto it's like I just by running one command didn't have to write 49 files like it wrote the 49 files for me and that's really sort the power in Auto and and you could see how like if it generated a vagrant file and which is a fossil and you were to update Auto and you didn't recompile it's gonna run the same favorite command so you're gonna get that consistency when you recompile it might change the vagrant file and future versions of auto handle that migration for you so if you've already deployed something and you run Auto compile then it's gonna generate the migration steps towards its future it doesn't do that today because it's never changed it but that's what it's gonna do so yeah this is this sort of better explains why autos around all these things it's using all these things under the covers we didn't reinvent the wheel we're just simplifying sort of the whole Devdutt production process yeah so that's fossilisation so let's just take a look at what auto looks like for development to start it is of course meant to be the successor to vagrant so it should have a development experience that is better if not equal to vagrant so vagrant has vagrant up Auto has Auto depth they're effectively the same vagrant up gives you development environment and one command auto dev gives you a development environment in one command it looks like this it you could see vagrants in here which seems kind of weird if it's meant to be the successor to vagrant but like I said we built on top of it so we we we use they're going to orchestrate the development environment process underneath but we add a lot on top of it so we do fancy things with Auto like cache the SSH credentials this is pretty nice because if you've run vagrant SSH you probably know it takes like two or three seconds in order to actually enter the Machine Auto us Auto dev SSH takes about 100 milliseconds like it's really fast because we cache credentials because Auto is the only entrance weight of vagrant we know you're not messing with the environment we know the IP hasn't changed we're gonna do it fast the next version of Auto actually what I was working on today actually uses link clones and layers underneath so auto dev today takes a normal vagrant up time which is perhaps like a minute or something auto devin 0.2 my goal is to today it could take five seconds so you get bigger you get development my rents really really fast and and that process just gonna improve so auto dev might feel a lot like vagrant today but in the coming months it's just going to get better and better and better and we're doing fancy things the other thing we do is we assign an IP address for you so with vagrant you had to make the decision of how do i network what IP do I give it auto we look at your network interfaces we find the one that doesn't conflict with anything we choose an IP that works for you we choose an IP that no other applications using and we allocated the next version of auto actually allocates a DNS name so you could reference everything with DNS names do you see how things are just getting better and better and then the last thing you'll notice is down here is this is just human friendly and this whole bottom part is totally I like tailored for your application this case a ruby application and autos telling you as in human friendly terms how to work with that application it's saying by the way Ruby's pre-installed to work on the project and it files the file changes will be synced when you're ready to build use SSH bundler Ruby's already in there you can access it using this IP and if you ran this would go if you ran this note it would be different output someone already added someone that works at hofstra Corp actually already added Rails detection so not only will we detect is Ruby we'll take those rails and we automatically set databases for you we migrate we recede the database for you and that's in the output it says by the way I already see to the database here's the username here's the password it gives you a bunch of stuff and and it's getting smarter already so you can run Auto dev SSH that'll be a hundred milliseconds like I said you're in a vagrant environment which is kind of funny but you are and that's just how easy that is you could get the address using one command you could destroy using one command it's very similar to Baker and right now and we did that on purpose that you're comfortable the vagrant we're not trying to totally like change your world yeah so I think at the end what we've created is a really nice development experience I think it's gonna get a lot better and and I'm excited for it so of course we did development so now let's try to get to deployment let's try to get to production this is new vagrant doesn't do this so we're in uncharted territory how do we make this environment as nice as vagrant so first of all if you're deploying let's let's talk about what happens so everyone in here is probably like an amazing developer or office person so this is not what you do you just know what to do but if you it if you're like the long tail of most developers mule just Google especially with your first time if you're working with rails let's use rails an example you Google how do I deploy a rails application I google this a week ago this was the result and the top results really doesn't say anything useful interestingly enough it's it has no no useful information but the second one the this second one here actually is sounds perfect like how to deploy a rails app with passenger an Apache on like dot dot dot that sounds exactly like what I want so you click that and you get something like this and this is pages and pages so you get stuff like this which is copying and pasting commands into a terminal and and the you kind of like if you've been doing DevOps or something you sort of laugh at this because this wasn't a best practice fifteen years ago and it's still you know despite the innovation we've had in that time it's still the top hit and so you you sort of you have to ask yourself I we asked ourselves why why is this the most popular result why is this the top hit and so what you'd actually want is what is the current best practice so instead of this you want the best practice and this is the best practice so you want to set up a you know private subnet a public subnet you want to hide your services that don't need internet external to internal internet access and the private subnet your databases your services your web servers you want to make a public subnet with a load balancer that routes back there you want a bash run or jump hosts in order to get into there you want an app so they could reach out to the Internet and you this is what you want this is the best practice whether you know it or not this is what you want and the problem is this a super complex like this no if you go ask some recent graduate of any sort of computer education what how they would deploy an application that this is not what they would come up with it but experience has told us that this is sort of the best foundation with which to deploy things that gives you the best ability to grow it gives you sort of the best minimum security like that sort of stuff but the problem is this is super super complex and and even the best operators it takes time to do this and the to automate this away you have to learn other tools and we've built a lot of those tools terraform consul Packer but now you're asking somebody to learn a whole other set of tools which they don't want to be their expertise they want to be a really great rails developer they don't want to be a really great terraform Packer console person there are those people which are great but this person doesn't want to be that so that's why this is the top result because despite all the innovation we've made despite how much we've made things better where we believe we've made things better this is still so much easier than anything else out there so might as well just copy and paste things in there because even if you have to blow away the server and recreate it this is still faster than learning a half dozen different tools and so we wanted to change that with Auto so so with Auto we've split this process into three commands in future versions it'll probably be even less we're probably gonna make it even easier but there's auto infra Auto Bild and auto deploy so Auto manages your complete infrastructure so running auto infra actually creates that diagram it actually creates a private subnet public subnet creates a Bashan or jump post creates an that automatically configures them sets of security group rules sets up everything for you so you don't need to know and like I said that informational infrastructure was designed by somebody who did this at larger scale than most people ever see so you're getting a lot of wisdom in that Auto has different options like I just described a minimum of maybe three servers and maybe don't want to pay that much so Auto has different options to say sacrifice scalability for costs and let's just throw everything on one server in a public subnet and try to lock things down Morris purity groups it has that option and more will come with it do is coming in the next release so over time to also learn like more infrastructure providers as well so you could say do is just a lot cheaper for what I need so let's go to do or I have credits from Google so let's just use Google autos able to target different things like I said the app file doesn't say what cloud provider you want so it gives you that flexibility of jumping around the next up is to build so the build process takes your source code and turns it into something deployable this might be a container this might be at the Machine level of like a droplet base image it's it's meant to be lit make something deployable and what it makes is sort of up to auto right now and the current release of built am is we built auto and nomad in parallel so the goal is actually with in very soon probably not 0.2 but very soon in autos lifecycle everything it's gonna build is gonna be something than nomadic a run and we're just gonna defer to nomads but again that's that's best practice is evolving we built Auto when there are schedulers but where there were no schedulers that were easy enough to automate the operation of so we built Auto in a world where in our world where machine images were still easier so they use a machine images but you're gonna update Auto and suddenly it's gonna use a scheduler you're gonna get way more density your costs are gonna drop and all you had to do is download a new binary so that's how easy that is and it's gonna get easier thing and I think in the next work and all Yelp to run this a lot of activities but but these are separate because you can actually share infrastructure between multiple outlines so you can actually have one one infrastructure and then you can deploy a bunch of apps on top of that infrastructure that other developers agree so these why these are separated and then the last thing unemployment is micro services so Microsystems were coming up and we're going from the bottom of world of a lot of big pieces and in production you need to set up more things monitoring service discovery service configuration and more security stuff so it's just really complicated and thing that makes it complicated is all these like connections it's hard to succeed like efficiently describe all your dependencies and be able to manage on the scale and what people are doing today when we looked at microcircuits today I don't think even outside of bigger I don't think anyone who uses micro services would disagree when I say that they are complicated to develop and deploy and the reason is because if you look at what people are doing for development today let's say they're using docker you create a docker compose file which has every container you need in it but it doesn't just have your immediate dependencies it has everything they depend on so you end up with a docker file that it's up to that one developer to have the full transitive list of dependencies know how to configure it Noah version to install I'll start them all in parallel all that stuff and then deployment you have to know what order to launch them in of how they find each other it's just sort of a whole new bag of worms so we wanted to solve that and so with Auto what we ended up doing with looking at that app file we had was creating this dependency thing and you could specify a bunch of these so if you have a bunch of dependencies you just repeat that and what we did was we set up pointers so instead of instead of having a docker compose file where you need to know how to install everything the app file is a perfect abstraction for that the a file describes alcohol plus Auto describes what you want and how to so let's put the burden of telling Otto how to install something on to the developer of that service it should be their responsibility to make this happen so that's what this does you just point to the dependency you need this could be on get bitbucket even HTTP stuff like that you just point to it and otto during the compilation process fetches these app files fetches their dependencies for you so you don't need to know that and automatically figures out the ordering and how to set everything up based on these app files yeah so pointers so here's what it looks like if I was running that app file if you were on auto dev create a development environment you'll actually start seeing messages like installing dependency Postgres that a file described how to install it Auto does that for you it also configures in console for you so that you can find it via service discovery auto auto automatically installs console manages that distributed system for you so you don't need to know how to and then it creates it and it puts all of this into one virtual machine so this is very similar to if you're running like docker or something you usually run like one boot to docker machine we run docker compose there's a bunch of containers on there vagrant multiplex is all your dependencies your whole micro service graph down onto one one virtual machine exposes everything through console and and you're sort of off off to the races if you ask this a chin you could actually see so we asked this age tin ran DNS lookup and we got some results so all that stuff's there automatically you didn't know need to know how to configure it you didn't know how to put in seed data on any of that stuff it's up to the dependency to tell Auto for development how to set all that up and if you were to deploy this is not useful output but if you were to deploy it actually will deploy it will deploy all the dependencies Auto currently will deploy just one of each dependency like a singleton eventually you'll be able to say I want multiple distinct copies stuff like that but for now it just deploys one of each and if you were to actually run auto deploy on a folder with no app file with Ruby that depends on Postgres you would actually get a Postgres machine configured the rails the the Ruby application deployed behind fusion passenger service discovery setup to find the database you would get credentials injected into the environment variables eventually will automatically integrate console template and it just all sort of works and you could think that that just happened with zero lines of configuration a few commands and just a credit card for eight of us and it's gonna get smarter so we already have plans for Auto deploy to automatically set a vault for you to setup nomad for you to store secrets for you to give you access to those secrets easily all that stuff so you could just redeploy redeploy and an auto is gonna get better and better and I don't know what else yeah so what Auto is is the development tool a deployment tool and a micro services tool all sort of configured with this one app file which which sort of we think is the right way to think about things that we've we've spent six years building these low-level tools to work with machines and and images and things like that and we're trying to sort of move up to a higher level of abstraction to make us all more productive and so Auto is sort of our way forward it doesn't replace vagrants for a lot of things we think that over the next three years my hope is that 90 plus percent of developers using vagrant move to auto it becomes good enough to move top to auto but vagrant never gets replaced for machine level stuff like if you're testing provisioning at a machine level if you're working with obscure operating systems if you're trying to model weird environments vagrant doesn't get replaced so we're still developing and working on vagrant but but Otto's gonna get better and better every day and I think that eventually very similar to when not that I was alive but very similar to when compilers came out for this for things like C or earlier languages when we when we moved from assembly to higher level programming languages there were sort of a lot of assembly developers which we're like well I could do it better in assembly I have more control and assembly it could be better I think Auto is gonna be at that stage for a while which is a lot of a lot of people are like well if I control Packer and I control terraform I could do things better and it's probably not it's probably not wrong today you're probably right that you could do it better but we're hoping that those people contribute to Auto to make it smarter and eventually you get to the point where you Auto compile you look at the terraform output you look at all the things that did automatic service discovery security scheduling automatic density calculations Ottoman which using the most efficient cloud for your workload and you're gonna look at it and you're gonna think I could not have done this better and and that's the goal of Otto so you could find it Otto project audio already exists it's been released for a week and this is now as surpassed vault as our fastest growing project it vault got three vanity metric ball got three thousand stars in a month Otto's gotten three thousand in six days so or speed enough and that's Otto and that's it so thanks for thanks for being here thanks for watching this I know is long I don't I don't know if you want to ask questions here you just want to talk later okay otto runs on anything itself runs on anything what it deploys right now is Ubuntu and Otto's meant to be best practices so we're not gonna try to have Otto support every operating system in the world we're probably gonna go at least in the early versions we're gonna go through bun two and like real like flavors and we'll see it's meant to be best practice so like if you want to if you want to deploy a full mix based OS like deployment and not saying Nix is bad it's just it's just not pragmatic today in terms of like it's not widespread there's a lot resources so that's not what I was gonna do today yep that's a great question so today not in a good way but we completely planned for that so the way Otto's gonna work production is that get github flow it's it's you you'll want to sign up with Atlas and and yeah and very well it it it's Auto provides you the tools to work on a team it's just it's sort of like you could in theory get pool from everybody's the machine right but you get hub just gives you a nicer experience we're building Auto so in theory you can work with a team using what it has but Alice is just gonna give you a lot nicer experience for that any other questions any questions for no man all right well cool we're gonna stick around a little while so thanks for having us again

Info

Channel: DigitalOcean

Views: 12,377

Rating: undefined out of 5

Keywords: DigitalOcean, Digital Ocean, Cloud, Iaas, hashicorp, mitchell hashimoto, developer, developer meetup, security, encryption

Id: aF_HPTHtqCA

Channel Id: undefined

Length: 83min 13sec (4993 seconds)

Published: Tue Oct 27 2015