Automating Infrastructure Management with Terraform

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right one is welcome everyone to the SF plod ops Meetup thank you for coming tonight just one I if you want to like tweet about the event anything good you hear these are the hashtags that we like to see so feel free to put something out there about us of cloud ops tonight you want to give you the requisite like who we are we're lithium we do social CRM we do community really well we're doing some really cool fun stuff that's why we get to meet people like Arman I saw Mitchell's here tonight as well we've been able to really like try to bring these back in the things that we've learned we really try to bring in some of these speakers that have helped us whether we use their tool or not we found that they really helped shape our vision about what we're trying to do here at lithium with you know things you hear about like hybrid cloud you hear you know like trying to mix these environments and you find out when you really dig into the details it's a challenge so really appreciate the opportunity to kind of work with it and if you'd like to learn more about joining lithium we do have some opportunities here and I'm going to invite up Tara real quick she's going to talk about some immediate opportunities we'll keep it brief but just let you know what we have available how many of you were here for our sky base meetup any of you here for that ooh new crowd okay so we're working on hybrid cloud application deployment systems if you had been here you would know about it will be open sourcing it hopefully in the next month or two in a in a beta sort of form we're looking for people who are interested in working on this platform the job bite will go out soon if you're remotely interested just come find me by the bar and I'll give you a drink and we can talk about it she does make good drinks I have to I appreciate that yes so with that I really want to excited about tonight's presentation because you know us we use AWS a lot we know CloudFormation a lot the good in the bad not sure a lot of you guys probably know that as well so there was one feature that when I heard about Terra firme I was like first off I was like these guys are really kind of close to how we think and then there were some features in here that we were like well that kind of speaks to what we're trying to get accomplished so that's why we want to invite up our Mon dog dad gar dad gar have you ever been in a data center before in your life at all you have alright I was wondering I was like that was what that was a debate we were debating has he even been in the data center because like this new generation that we even wonder like how they they're not like the old-school crusty guys or the neckbeards these are guys who are new-school doing cool stuff so with that I want to invite up Ramon and he's going to talk about automating infrastructure with terraform it's it's seeing the my laptop sees the other display I don't know it's not picking up cool awesome yeah the colors are a bit off more awesome so thanks a lot to Jo and lithium for hosting this I'm pretty excited to be here and talk to you guys today about terraform lithium's are pretty awesome use case as they said they have a pretty complex hybrid cloud scenario and so today we're going to be talking about what it's like to do orchestration at cloud scale for those of you who don't know me my name is Arman I'm everywhere on the Internet as also just Arman and I work at hashey Corp so we are a DevOps tooling company and we're basically working on getting towards a software managed data center some of our tools you might recognize we make vagrant PACCAR surf console and terraform which is what we're going to talk about today so just as an overview what we're going to be talking about is one is kind of what is orchestration because this term is thrown around a lot these days and it's kind of become a little bit meaningless so in the context of this talk I want to frame what we're talking about here and then we're going to go through kind of the origin story of orchestration and how we've ended up with the current modern-day data center and kind of where we're moving towards with the next generation cloud scale data centers so the first question is what is orchestration this term super overloaded that could mean anything from like booting an AWS instance to like running puppet but for the purposes of this talk what we're concerned about is basically infrastructure lifecycle management and so there's kind of a there's kind of a fixed relationship here which is the first step is really acquiring the resource so resource acquisition once you've acquired the resource how do you go about provisioning it whether it's a switch whether it's a server whether it's a logical record like a DNS entry how do we set that thing up how do we update it need be and then the life cycle is over and we no longer need that resource how do we destroy it so for our purpose this is the life cycle we're going to be talking about so where did we start we started with this guy so at some point maybe you know you started a startup and you only had five users and this is where you started or you know 20 years ago this was state-of-the-art we had one server there was only one one to rule them all up and when there's changes you just go in there and just like fiddle the bits but at some point you reach the you reach kind of the more than one scale you reach the okay now we have racks now it's more of an issue we have to think about these things ah you know and this is kind of not that long ago you know the data center of your is this people still run this and in this world acquisition time is like on the order of weeks this is something you have to think about your calling Dell you're calling HP and you're like please send me 20 servers and then you wait and those servers come and you rack them and that's manual and then people start SSA Shing and that's manual and pretty much every process of it is you're calling a sysadmin or you're calling someone there's just a lot of calling involved and you have to manually get things set up but right around 2006 there was a tectonic shift basically which was the cloud or less ambiguously but I'm going to refer to as Elastic Compute so Elastic Compute really is a specialization of Labor right it's what it's saying is you know I'm a software company why am I in the business of running a data center I don't know anything about networking I don't know anything about power and AC units I really shouldn't specialize and building and managing my own data center instead someone else should specialize and they should be a utility company so really Elastic Compute lets us think about computing as an on-demand utility it's cheap its on-demand its unlimited in nature and it's elastic right one of the nice benefits of this though is because we have people like basically mail something to us we moved from months or weeks of acquisition down to minutes right just make an API call and 60 seconds later you have an ec2 instance so this is a huge shift right the other shift is basically we move from a capex model of spending where you have to spend a million dollars to get that initial data center into you're sending two cents an hour right it's a totally different world and what this does is it basically reduces the barrier tape enables a lot of innovation to be done in the SAS kind of a model so now if you want to start a startup you need a $25 AWS credit you spin up some instances right and you can take that risk on building you know the next generation SAS to do dns or a CDN or whatever it is so this enables a lot of innovation so very quickly we see basically the beginnings of the cloud Wars right Amazon got here first and what's great about this is it's a race to the bottom right all of us benefit because we get more features we get it at lower cost and we get higher reliability the only problem is now we have to deal with ok I'm 17 different clouds now and this is kind of a nightmare very quickly enterprises also figured out that hey there's something to this elastic model so we see basically the rise of an on-premise cloud solutions as well so mostly this is companies still willing to front the capital expenditure cost you still have to go out and spend 10 million dollars on the hardware and then another 20 million on VMware but basically you get the nicety of having a cloud and you have that elastic on-demand API driven nature and so really we've we're able to now move into this world where our acquisition time for new resources is on the order of minutes right so now the question becomes ok great we can spin these servers up left and right but how are we provisioning them right that's kind of a next logical step once we have them and so again you know as we talked about in the early days is likely had the one server you plugged in your keyboard maybe you had a CRT monitor you started manually doing it at some point you had a sysadmin who decided bash and pearl were the way to go but they were the knowledge silos there was one or two of these guys they knew the magic incantations to get the Perl script to like self evaluate its regex and do things luckily right around the same time as Elastic Compute we see the rise of config management so now there's probably also more that I'm forgetting about that you know chef puppet salt ansible and they allow us to think about provisioning at a higher level of abstraction so instead of writing shellcode or Perl to you know whatever operating system and version we have it's a higher level abstraction I just need this file here with this permissions but more than this abstraction it lets us codify knowledge so before you have that's this admin guy who knew what that Perl script did and nobody else but now we're able to move that knowledge of that guy into shared chef repos or puppet repos and codify their knowledge and share it across the organization so one benefit of this is great now we anyone can just go in there and look and be like how did this file get here how do we set up a web server how did the pieces come together so from a knowledge sharing perspective it's great but because it's code like it's an executable representation we're able to use it to automate so chef and puppet one of their beauties is that you know it's much faster than running that puppets thing there's less it's less error-prone because it's totally automated on the horizon basically it's maturing rapidly as containerization it's solving a similar kind of problem around provisioning which is how do I do OS agnostic packaging and simplify my application delivery right you don't want to have to think do I need a BN do i or do I need a Deb or an RPM package or whatever again two uses instead you just containerize it deliver it to Linux thing and it works alongside your existing container sorry config management tooling you just spin up your container run chef inside of it package it you're ready to go so it kind of brings us to this world where now we're able to rapidly acquire new resources and then run our config management tooling and basically pump out very quickly new resources that were able to acquire and provision reliably so now we're doing things basically better faster and cheaper than ever before but really we want to get to this world of like wuwei doing without action because the only thing better than doing it better faster and cheaper is not doing it at all all right so because of all of this SAS innovation enabled by Elastic Compute we've seen all of these great services showing up now nobody really runs their own CDN or DNS or email or payment servers instead we use a SAS for all of these systems and we outsource all of these things that aren't our core competency to these SAS providers because why do you want to manage your own email delivery right and what allows us to do is that software companies is we get to specialize on what our value add is because realistically unless you're Google no one's value add is infrastructure and so we're seeing this huge diversity now and basically the SAS world right this is just a short list of a few different kind of problem spaces but you could go down but down and for every possible category there's probably a dozen competent SAS providers so there's huge diversity there's a huge competition and we all win because we're moving more efficiently where there's less stuff to maintain but as operators we now have to deal with the complexity of the fact that we're running all of these different sasses as part of our infrastructure instead of there being a consolidation and like a a reduction in complexity we're rapidly approaching this in terms of what it's like to run our our data centers this complexity is just rising and it's not going to go away because it just makes sense you're never going to move things in-house when it's more expensive and more difficult for you to do so so the modern data center really is very different than that picture of the racks sitting in one room now you're basically you know it's common for us to see hybrid deploys of you have physical hardware that's running an on-premise cloud like OpenStack those VPNs of public clouds and these are globally distributed in nature right so you're not managing one room anymore you're managing three plus data centers across the world and you have a matrix of different provisioning systems used for each of them you're probably still hand configuring the physical boxes from 20 years ago you have some chef you have some puppet maybe there's a cluster running docker but it's not standardized I think it's just kind of layer just like that tower and on top of all of this you have integrated use of pass and SAS throughout the organization and a non unified fashion and so really as operators it's overwhelming it's like how are we supposed to deal with the fact that this complexity is just bubbling out of control because the role of an operator really is kind of three things right the first is to truly understand the infrastructure because when something goes wrong they're the first line of defense they're the ones who have to be like okay what's happening is BGP broken did we lose a course switch right and so you have to really be able to understand your infrastructure at a deep level to be able to diagnose it the next is you have to be able to adopt new technology and stay nimble right because we're seeing all of the SAS innovation you have to move faster than ever before basically to stay relevant in this world it almost doesn't matter how big a company is you talk to them they're worried about being irrelevant in five years so you have to be willing to adopt new technology and move quickly to stay relevant and part of this is you have to be able to deploy applications efficiently so really what an operator wants to do is move fast without breaking things basically and so to that end a tool that we've developed at Asha Corp is terraformed and so we have a number of goals when developing terraform we basically wanted a single workflow to unify any of the underlying technology and so we need to be able to support and represent that and part of this is really treating SAS and pass as a first-class entity in the data center they're not going away and they're only going to proliferate so how do we pull them into the fold and pull them into the management infrastructure in a way that operators can use and understand basically and so really all this comes down to the operators how do we put them first how do we make sure they understand this and can continue to scale the data center and so one of the first kind of principles behind terraform is infrastructure should be code right and the same way the configure management brought codification of knowledge we wanted to pull that into the infrastructure layer right we want you to be able to describe our desired state and simply let terraform get us from state to state prime as opposed to having to figure out on a complex infrastructure how to get there ourselves as operators so we want the management of our data center to be automated by a terraform without us telling it how just the what and again like I mentioned a fall out of this is anytime you want to know what your infrastructure looks like you just look at the the terraform configuration because it's fully documented the documentation is executable and you also get the same knowledge sharing that you got with config management once one person in your organization describes how to set up and scale a 50 node Redis cluster right because we have to be able to represent the entire data center for this tool to really be useful if you have to use multiple different tools and piece together your representation it wouldn't be useful because you're going to fragment your knowledge parts of it won't stay up to date and you'll just you'll end up kind of losing consistency with what your infrastructure actually looks like so again this means sass and pass have to be in there it means we can't just deal with computer deal with networks we have to deal with storage you have to think about switches and routers as first-class elements of your data center and so to that end hasha Corp introduced a new language we call it HCl the hash Accord config language it's heavily based off of Lib you see else is not totally out of the blue and its goals were basically twofold one is that it should be human readable because most of the time it's going to be humans that are looking at configuration being like what's wrong or how does this work or how do we go from A to B right very little of the time is spent by the computers reading these files and they're so incredibly fast that you don't need to optimize for their par speed if you do want the computers to be able to automatically generate configuration and you want to parametrize in a way that smoother more complex and more organizational specific the tooling is also JSON interoperable which makes it easy for automated tooling to push things in and out of terraform so this is a little example basically of what terraform HCl looks like let's see so kind of this first block here let's just I just want to unpack it a little bit it defines a resource so in Tara forms world a resource is basically anything that can be represented and so it has two parts you have a type and you have a logical name so in this case digitalocean droplet is specifying that the type of this resource is address instance of compute in the digital ocean kind of infrastructure as a service world and we're just going to logically name it web and then we give it some parameters that describe kind of the minimum necessary for digitalocean to be able to boot this instance so here we're saying it's a small sized instance running sent us in the SFO data center and then here we're going to finding a separate resource in a totally different kind of model of the world so DNS simple is a software-as-a-service DNS provider sorry because the site makes more sense so they're a software-as-a-service of DNS provider we're just going to assign this record the logical name of hello and basically we want to define an a record such that the IP address is actually PUC digitalocean instance so this is a really powerful concept if you think about it for a moment what we are saying is we have a software as a service that we're subscriber to but we're going to spin up an actual infrastructure note get its IP address and create a DNS or based off of that right and we're codifying all this in a way that otherwise you're going to have a Perl script somewhere that's like okay spin up the thing then parse the response get it and then go ahead and make this request and you have two thousand lines of indecipherable prologue you're like what is this doing versus this you're like Oh spin up the thing take its IP address create a DNS record right it's instantly obvious you didn't have to go through 2,000 lines of code to figure that out and so part of this is that terraform uses this to do dependency management for you so as an operator you don't have to figure out does the DNS record come first or does the droplet come first by maintaining a resource graph and understanding the dependencies between them terraform automatically handles the ordering of operations it can automatically paralyze operations that aren't dependent on each other and it lets us visualize what this actually looks like so as an example here's terraform being asked to visualize what that you know toy infrastructure looks like and what it's able to represent is that we have this droplet it depends on basically the digital ocean provider so it has to be able to properly authenticate and have the API tokens there and then we have a DNS simple record which depends both on DNS simple but also that the droplet must exist first right and so this notion of basically providers is what's the key to allowing terraform to interface with all sorts of external systems the providers in essence are the integration points between terraform and the rest of the world they expose different resources so what we saw before was the digital ocean one had the droplet resource DNS simple has a record but both of them export a few other also resources that we can make use of and for each resource the provider implements a simple crud API this then creates the separation between terraform core and the provider the core then can handle all of the actual logic of parsing vid files understanding the dependency flow ordering parallelization error handling all of that gets offloaded to the core and really to integrate new systems you're writing a very thin crud layer against a REST API and because of this kind of notion of having multiple providers that can be independent on each other it allows terraform to kind of understand the infrastructure and kind of a layer cake of I use this analogy a lot more than my co-workers appreciate but I just like take a lot more than they do so in terraform worlds basically have a provider that understands full of your infrastructure and their orchestrated kind of in the order that logically makes sense so for example you might be on a you might have to start at a physical level and you need to provision that up into an image cluster and you can think of kind of there's the OpenStack house like a thin dividing layer right you're there's the layer above the API and they'll layer below it so below that layer terraform is booting and provisioning and installing OpenStack but above that layer your OpenStack API to start virtual machines right it's a different kind of a workflow but then you're bringing up virtual machines and maybe above that you're loading on a containerized workload and at each of these levels terraform is able to use the different they appropriate provisioners in the correct order such that to go from basically physical hardware all the way to virtualize containerized workload is a single terraform apply command so it's great and all that we can do this and start from physical and go up to the Container but relatively speaking the initial constrain you know you can you can hard code in and probably get away with a reason about the ordering we're things that were changes right if you want to change something in a complex infrastructure there's all sorts of intricate inter dependencies that have to be captured and handled right if you want to change the version of your web server do you need to update the load balancer does the load balancers DNS record need to be updated do you have to notify CloudFlare to do a CDN invalidation right all of these things are interconnected in a way that becomes very hard to reason about once you're running hundreds of nodes maybe tens of apps tons of sasses right and so as an operator when you were like okay I want to make this change you know there's kind of like what's going to happen right now like I'm going to change something but what do I expect to be the result of this and basically terraform acknowledges that it's very hard to define basically what changes are going to take place like what is the butterfly effect of any given change and so we take kind of a mesh or twice cut once approach to the problem which is how can we make it clear to an operator before we do something what we're going to do so they can fully understand the implications of changing something if it's clear ok changing this value is going to cause this resource to have to be destroyed and recreated maybe you want to rethink how you're going to do that update and you're like actually I don't want to destroy every web server and so the terraform approach to this is basically to separate the planning phase from the actual execution phase and to do this in a very strict away Soter form actually doesn't allow anything to be executed so to speak until it's been pre-planned and once that plan is finalized it's strictly execute only what's in the plan this is a pretty big departure from things like cloud formation which don't make that separation and you figure out what it did after it did it which maybe want it to do but maybe it's not so that model of kind of how terraform operates is it kind of examines the existing state of the world along with your desired configuration and forms its execution plan execution plan can then be visualized in the same way terraforming a value what it's going to change what's going to destroy what it's going to create you can inspect it and be like this looks great and then pass it through to its plan application phase and it can move you from state to state Prime you can also abort it at that point and be like no this plan doesn't look right and tweak your configuration until you get it to do what you want it to do so kind of an overview terraform has a few big tenants behind its design one is that infrastructure should be code that this gives us understandability it gives us documentation that's always up to date it lets us compose providers in a way that we can kind of express this layer cake levelling of our infrastructure but it also lets us merchants software as a service like platform as a service like Heroku and merge that in with our infrastructure as a service like AWS and because of its separation of the execution and application phase and the resource graph it's able to safely iterate on that so we're able to make changes in the right order and be confident that it's going to do exactly what we expect it to do without it being like oops I accidentally blew away my load balancer and because of its very technology agnostic provider model it doesn't lock you into any you know you can build your whole thing on AWS tomorrow there's a brand new you know infrastructure-as-a-service you're like great I want to migrate to that you add that that provider gets implemented and now you have a migration path forward so it future proves you against technological change because it's model isn't specific to any particular technology or vendor so as an operator these are awesome features to have right you get a unified configuration that's able to describe your infrastructure end to end at every level of abstraction you also get an atlas for your infrastructure so that when the main kind of DevOps guy leaves and you're like oh god how does anything work you know you just open up the files you're like okay this is how everything connects these are how the pieces come together these are the dependency chains and you kind of only have one tool to deal with to do all of this you're not first running one tool to boot your AWS and then running a separate tool to run your provisioner and then a third tool to update your load balancer and a fourth tool to update your DNS records under one umbrella one command we already talked about how to future-proof but its uses kind of also keep going past operators you can basically extend terraform to actually help end users as well so in this case the end users are your developers basically so if you kind of have a central ops team that's responsible for maintaining and babysitting the infrastructure itself well then you have your developers who are responsible for developing the applications and what you'd like to be able to move to is kind of a self-serve model so they as operations people you're not responsible for every application level change or deploy you just want to make sure you know the substrate is ready to serve and the applications can be deployed by the development teams independently so that you can move a lot more quickly without bottlenecking on a central ops team so when you move to the self-serve model you really need to basically be able to decompose your infrastructure into its various layers and then delegate it to the appropriate team so that your deploys can be happening in parallel so in this going back to the previous example stack level resources they get to just treat it as an elastic internal cloud right the separation further and allowing kind of the operator and user interaction as a notion of modules so module lets you think about your terms of its abstract components so you don't have to think about your Redis cluster thing the Redis nodes press the Loretta sharding layer plus the monitoring systems you just think about it as I have a Redis cluster and so it allows you to think about your infrastructure at a higher level of reasoning in the same way that chef and puppet allow you to think about throwing files on disk at a higher level of reasoning it also lets you reuse that configuration across your organization likely you probably have multiple projects within an organization that need to make use of a cache you don't need to refigure out per team how to configure and scale memcache for Redis you can have a single configuration that takes variables and then you share that across your org so for an operator who's basically writing and developing these modules they want to be able to treat it like any other terraform resource they're like I'm going to write a console module such organization can make use of console they just have to require it and to them basically it's terraform can show them this unpacked white box view of it where they they care about the intricate details of what's happening inside of that box but as an end-user developing their service they maybe don't care they just say hey I have this instance and I just need a console cluster running but I don't care about the details of how you do it just N equals three and you deal with the details of bringing up managing that making sure the whole orchestration flow works I just want to pull out the IP address of a console node so this kind of helps us move towards this model of cloud scale infrastructure right between certain organizational sizes you have a very complex infrastructure right it's no longer 12 nodes running in the basement there's thousands of machines running tons of different layers of technology across the provisioning matrix and this infrastructure is fundamentally shared among the organization your own cloud for every dev team right and so you end up with this weird dichotomy between dev and ops you usually have your central ops team ability for months they care about security and you have your developers who don't necessarily aren't as concerned about a lot of these things they just need to deploy their bug fix because it's 2:00 in the morning and the app is down right so they want to be able to move fast and they're doing a decentralized deploy model right there's no hive team there's probably one dev team per internal tool or application so they're not coordinating with each other so you want to move to this continuous delivery model and the operations people really are the frontline of enabling that so how do we reduce the friction basically between these two teams how do we reduce kind of the throwing over the fence between dev and ops and kind of make this easier well self-serve is kind of that answer the operations people provide the self sort of portals the developers simply consume it right and you want to do it in a way that what you're ready for change right so when you want to change underlying technology you're not having to change the workflow all the way from operators clear through developers right and so terraform kind of offers a pipeline way of doing this as we talked about you can split that app split the interactions with terraform and delegate the work kind of appropriately to each team and manage the layer cake in each layer so the operations maintain the underlying substrate maybe they're responsible for writing the modules and making sure there's tested and scalable and then your developers treat it more of as a black box plug-and-play system right they pull in the modules they need they allocate and provision the resources they want but they don't really care about maintaining the substrate like the VPN and the the broader networking and storage and configuration of the data center as a whole and so together by using this one pipeline the two teams are able to kind of move fast without breaking things because they make terraform is there to enforce the constraints and make sure nothing crazy is happening so kind of inclusion terraform offers a one tool to unify them all so to speak and it's really there to enable operators but by doing that we're enabling end users as well by making kind of the self-serve model possible and by being able to pull all that under a unified language and a unified workflow we're kind of able to tame the complexity of the data center and bring all these passes and Sasa's and IAS as under one giant umbrella so many questions no so you can't safely do rollback with a lot of infrastructure so what it does is because it's able to figure out what are the dependent resources if there's an error applying a particular operation any of the dependent operations subsequent to that won't be executed right because you've hit an error state but the other changes that are not affected so to speak that ones that can happen in parallel terraform will allow those to continue and when it's done when that kind of that tree of operations is done it'll be like this web server failed to provision puppet reported exit code one we've marked this resource as being tainted and so it won't go ahead and destroy it because of that strict plan enforcement I talked about before because terraform promised you I'm going to boot a web server it didn't promise you I'm going to boot it and then maybe destroy it so what it will do is it'll say I've made it it's tainted next time you try and do an operation I'm planning on destroying this so the next plan in the next execution will kind of handle that cleanup but because of its model of a strict execution it won't attempt to rollback because it's not something an operator anticipated yep yep so it's like if you're getting a temporary error you you yeah we've thought about this there's the problem is there's a large graveyard of people who have tried to basically abstract away the semantic differences right because it's like from a distance they kind of look the same but looking closely there's there's enough gotchas that it becomes very challenging to do it right and it's usually not the general it's kind of once you start getting close to the edge boundaries that's where it's hard to mask the differences so it is possible in the very general case but as you start getting more specific where it's like oh I need an ipv6 address and this particular provider only does ipv4 well now becomes very challenging to mask the semantic difference so right now terraform doesn't make an attempt to mask that it exposes the semantics all the way through because you're like this is very clearly a digitalocean droplet type it's not a general-purpose computing right so right now we don't make that approach it's of course possible to build meta tooling such that you're just generating a form where you're like generate me a do specific configuration or an AWS specific configuration and you could emit the proper terraform code to do that as long as you know within your domain that those kind of edge cases won't bite you if that makes sense right seriously make a magical provider that basically underneath two shims out to the right level right a question of theoretical possibility that you could you certainly could right it's just it's just doing RPC calls whether or not you'd want to maybe a different question it's possible it's possible I won't say no you could yeah basically you could just forward the request to digital ocean and then pass through the results just lie to terraform core and I wouldn't be able to tell well because the provider is required to provide the cold crud cycle so it tries to create and suppose the create fails then it knows later I can come around and call delete and basically go back to like fully resetting that particular instance so dependents so some providers allow you to do a rescale so like in digital oceans case it will actually snapshot and resize the machine but an AWS case it'll have to do one of two things which is either create the new one and then destroy the old one or create a new one I might have just said the same thing twice create a new one destroy the old one or destroy the old one create anywhere basically I'll have to replace it on the same sorry right so when you run the plan I'll tell you that basically it plans to do a create destroy operation so it's going to create and then destroy the existing instance okay so terraform is actually is stateful itself so terraforming maintains a state file so once you run apply it basically emits this is the current state of the world you can run Terra firme refresh where it's like okay I'm going to look out query AWS as API see is this still the state of the world so it basically has a way to reconcile its view of the world with what the underlying providers think the state of the world is but yeah it uses its own state to track this so ill mark that particular instance ID is being tainted in its state file yeah yeah it's a great question so right now we recommend checking it in to get we're working on a way to avoid this by just being able to like easily tell terraform to like materialize my state when I need it and then push it to a state store when you're done and then just pull it back down anytime you're about to do an operation so we want to move towards a world where the state palette can be treated as being a little bit more memorable because you don't want to lose it but it's in transit basically right yeah so we're working on a recovery tool to be able to do this one of the blockers was the state store was kind of in a binary format up to 0-2 and so it's moving to a more human editable format and so the ideas there'll be a separate tool from terraform they'll be able to basically it'll act like the cloud former so just queries iws and generates terraform state file so it bootstraps you into the terraform world it covers most of AWS right now yeah I think there's like some of the weird edge ones but they're being added all the time I don't actually know what the current status is but it's pretty close Oh Mitchell might know right right yeah so it kind of goes back to the same almost the same question he had with the state store stuff so we're looking at basically being able to use a centralized repo almost like get to be able to pull in and out for that yeah yeah we're working on basically a making the environment like a first-class thing for the state file so it'll be clear like are you sure you want a new production you'd be like no oh sorry I think toward the side go for it yeah so goes back to that terraform refresh I was talking about so when it builds that state file it's kind of it's a view of the world right when it ran the apply so assume you don't run apply for like a week and like whatever five nodes fail an AWS the next time you run refresh basically and refresh runs by default anytime you run a planning phase same time your turf run starts planning what it's going to do it checks for the world is you'll notice that oh whoops these five web servers I thought were running or actually dead so its plan will be to recreate those five that have died so you'll kind of catch that drift that's occurred while it hasn't been running you couldn't do something like that if if you were seeing a lot of drift but like typically you don't infrastructure doesn't drift fast enough for that to really be an issue usually because you're not losing machines hopefully every 30 minutes you could you totally could just put it on a cron and just run a refreshing email me anytime you're seeing like drift from the last state so you could today right right and you so probably what would end up that would expose itself as an error during the apply and the tariff will be like whoops I thought this note existed I try to do something interview said it doesn't exist so then that dependent update will fail the next time you run plan will be like oh actually the note doesn't even exist anymore let's do a recreate and then continue the next time it kind of is built into the model because it catches the error and then you just do another refresh plan cycle so then the next time you run it it's able to account for the drift so it'd be because it would accounting for it during the apply would mean if we break our strict planning because our plan has said we're going to do an update and then if we detect it doesn't exist and trying to create that it means we did something that we didn't tell the operator about right and so we shy away from doing anything in that vein just because it makes it very hard as an operator at large scale to reason about what it's about to happen so we're very strict about if we say we're going to do an update try the update if it fails don't try anything else like don't try to create so but it naturally fits into the you just do the neck the whole cycle of refresh plan apply again which really is all kind of like built-in to just apply with the right flag so just run terraform apply again it'll do the right thing but it'll do what it promised you Oh and depends on your definition of strictly doing what it said cuz it's like it's like an idempotent delete on that point like a promise so delete but it doesn't exist so it deleted right you you we did a few of the handful ones for the launch but since then a lot of them have just been poor across people just been adding features in so it's kind of a mixed we do some of them like as we have a need and people other people as they have needs they pull it in it's kind of yeah totally right yeah exactly yeah it's definitely a limitation we're aware of and we think about it that's why I that's why I knew right away I like yeah it's what has it this one has issues as we think about this one you yeah we're thinking about it that one is the little backward compatibility wives them to stay file because it's very ephemeral so we're probably not in 0-3 but we're talking about the plan file as well once easier to change in the future because it only exists for that short period between planet apply so you're not checking it in or really like persisting it long term yeah that's a great question so basically count there's some meta parameters or terraform treats magically so you can define a resource as having like some number of occurrences of that resource and you can actually just very make that a variable and then change it so you can say you know count my web servers is for change the variable through whatever an environment flag or something go to five and then the next time you're in terraform it just doesn't create so it'll figure out up before into Amazon's auto scale model that you can just say I have one instance of autoscaler it has min one max 50 and just let Amazon elephant or you can move that to terraform and say like i have web server count is five and change count later to 10 and terraform will manage that for you so it's up to you if you want to use the auto scaling stuff that's built in but if you want to auto scale something that isn't maybe like an ec2 instance then you have that opportunity to do that in tera form as well but again it just depends if it's something where you're happy and you like this like the kind of always-on nature of Amazon's autoscaler then you can just represent that as well you could so the way darker would be treated basically on terraform is terraform draws a very very clear boundary from being a scheduler so terraform is not an application scheduler it's not in the world of maysa or kubernetes so what you could do is if you're saying on this machine I want count for docker containers it can totally do that because that is a static that is a static acquire provision kind of a problem that fits in Tara Forbes world it's not a dynamic problem of I want four of these containers living on any one of my instances because then you're in the world of scheduling which is not a problem terraform handles it just runs at a different granularity and it draws a very clear boundary in there so if again if it's like I have four cores my node.js app is single threaded I just need count equal four because I don't have four cores that is something that fits in Tara forms model if it's I need count 4,000 and it has to just be somewhere on my cluster of 10,000 that doesn't fall into her forms world anymore yeah it would be a dr. provider yeah no it doesn't have any interrogation kind of stuff unfortunately because they would it causes the problem is at the planning step you want to be able to say this is what I'm going to right now today camp but it has a pretty generic hook system built in so there's ways of basically adding separate hooks so you can add your own like quota logic where ill go and be like hey I'm about to boot 50 web nodes is that okay and then the hook can be like no that's way over budget and terraformable just like blow up and be like nope the hook said no so there's ways we're planning on ways where you can integrate that kind of business logic but it's not currently a part of the core but the mechanisms are in place to do that kind of enforcement the problem is the examples become pretty big right you end up with a lot of resources so it becomes not that you fit on a slide we have some online so go to the terraform website and under it's under like the intro thing and then there's like the more realistic examples section or something there's like four or five to do a larger scale one I think there's one that's like a blog post on how to deploy like a four tier discourse like blog thing so there's one of those it gives you a more real-world okay that's what it looks like when it's not just to resources kind of a thing and I think we actually well maybe we published it but consoles cluster which does a multi region console cluster deployed that one's actually all orchestrated with terraform too and that one's digital ocean and some CDN and the DNS provider things layered together so that one is a more that one is a real-world example - and I think checked into the console github I think I think it's in there if not just like send me an email and I have it somewhere oh sorry I'll get him and I'll get you yeah so you can kind of view abuse the the tainting system to do this so as part of the provisioning cycle let's say you run like a stress test against it so you do stress test as disk you're like just below ten I'm going to return an exit code for my provisioner Tara forms like oh this provision are returned an error back I'm going to taint this resource it's then on the next cycle it will do the destroyed recreate basically for you so you can kind of you push that testing into kind of a provision ER and on terraform we'll just manage the taint light cycle for you that's a good question I don't know I don't know what it's uh I don't know what the scale points are on this thing I can't answer off the top I can't give you a good answer so the community development is anything anytime you run into a provider and you're like this doesn't fit into the terraform world let us know so that we can get that in that world or even better at a pull request for it in terms of what's coming 0 3 is the big one we've been working pretty hard on so it brings a lot of improvements in kind of the graph representation the state file representation and lots of bug fixes I think a few new providers I think OpenStack is on the horizon cloud stack is also on the horizon the state the remote state storage migration module systems are coming so there's just there's a lot of stuff coming down the pipeline right now we want to do a little better job around things like parallel deploy so I don't remember who asked someone was like if Devon ops are deploying at the same time how do you make sure these things are done safely and the state is updated correctly there's stuff around that so it's mostly like core is kind of able to move at a separate pace from the providers are so provider aside we just want support for everything so that's easy as long as the core maintains the same API we can just keep adding more providers to it and then core which was trying to make it better faster stronger so to speak mm-hmm oh I see where you're saying I see I like a more interactive workflow we don't currently but based on the way providers are written it's something we could add so maybe a good thing to talk to Mitchell about actually it's more intimately familiar so that they use a self-describing kind of a schema so there's like these helper libraries that help do most of the heavy lifting for the providers so in theory it's possible to interrogate the schema just be like what is the schema for this resource and I'll be like well I'm going to give you an ipv4 and ipv6 address and like whether or not you're no VPC so theoretically the schema knows this and we could build some kind of interaction to let you query it because it's it's in there it's codified it's just not exposed other than the documentation yeah we've thought about it we're just not quite there with that there's enough like there's enough work that has to happen on kind of the provider and core side that we're just not quite there but yeah it would be awesome how cool awesome well thanks a lot guys thanks for Mom appreciate you coming out and delivering that you're welcome to hang out here with us and have another drink if you'd like you know chat some more with the guys and thanks Mitchell as well for for popping in as a guest appreciate it as well so that we're it's it for the night so thank you
Info
Channel: SF CloudOps Meetup
Views: 26,556
Rating: undefined out of 5
Keywords: cloud, Cloud Ops, CloudOps, internet, DevOps, Dev Ops, TechOps, Tech Ops, The Internet (Issue), Cloud Computing (Industry), Software Development (Industry), Software As A Service (Industry), Software Engineering (Algorithm Family), Terraform, Hashicorp, Infrastructure (Industry), Technology (Industry), Infrastructure Asset Management, Automation (Industry), Amazon Web Services (Website), OpenStack (Software), Open Source (Software License), Infrastructure As A Service, Hybrid Cloud
Id: WdV4eYZO5Ao
Channel Id: undefined
Length: 56min 17sec (3377 seconds)
Published: Wed Oct 08 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.