Real World Elixir Deployment // Pete Gamache

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi my name is Pete Gamache I'm head of engineering at a little startup called app queues in Boston right near Fenway Park I've spent the last six months learning how to deploy a lick sir in the context of a long-lived Erlang cluster and I am here today to give the talk that I wish I had heard back then how many you want to be using a lick sir in production sweet I'm gonna throw a small amount of cold water on you all right I'm gonna I'm gonna start at the end the summary of this talk so you know you can leave if you want elixir in Erlang absolutely a rock-solid platform I don't think anyone in this room is going to debate that point it's a good starting point for building many different types of applications Erlang excels at traditional long-lived server clusters and I'm gonna go into more detail about what makes it you know traditional rather than perhaps a more modern approach and you know why Erlang so good at it modern deployment practices do not optimized for long-lived clusters I think most people here know this it's all about servers that poof into existence and damps right back out when you're done with them and deploying a lick sir there's gonna be pain you're gonna hurt I'm not gonna sugarcoat this for you but if you've chosen the platform for the right reasons you can still get away with it and come out ahead so at this point I'm gonna step back a little bit and let's talk about what makes an ideal platform so I came up with a little list of platform Desirables none of these are terribly controversial horizontal scalability essentially being able to throw money at a problem of computing and just add more servers to the frame very desirable via the alternative to that is you make bigger servers and eventually you hit a wall you can you can't jam any more RAM in you can't jam any more storage or CPU fault-tolerance everyone likes it when they're big distributed system does not completely fall apart when one small element of it does zero downtime upgrades and rollbacks certainly those of you coming from the Ruby community it was a hot topic there a few years ago it's a hot topic everywhere everyone wants to be able to upgrade to the next version without downtime and if you can do that everyone double wants to be able to do it in a rollback because in rollback your hair is on fire and you know there's there's a time limit stability and performance are just table stakes I mean very few people are willing to use a janky platform for an internet service and finally this last one is for us humans ease of administration and maintenance a lot of things are theoretically possible if you want to climb into the Turing tarpit I'd prefer to focus on what I can get done today ish so let's talk about an average platform and I'm using the word average not - not as a disparagement but just to say you know kind of the median these days an average platform on the Internet is very good if you go to websites and the websites return 200 okay and serve you what you want that means things are working so let's talk about let's talk about what works now i arbitrarily labeled this DevOps 2010 you have fleets of servers they're stateless they don't care about any of the servers around them they're getting work either you know depending on what sort of workloads they're serving they're in a load balancing pool or perhaps they are taking work off the same shared queue and since you know they are incredibly stateless all message passing you know to other services other micro services that happens externally all state is held externally these servers are just disposable so in this system when you want to upgrade you you know you've written some new code you want to deploy it great you start up a bunch of new servers running that code you add them to this yeah them to the pool you remove the old servers and you know if you want to do it right you have to kind of like starve them for input for a little bit let do the connection draining thing let let them do all the work that they have currently slated and then you you know who cares what happens to the old servers you take them out of the pool and they're done there's a slight enhancement to this that you know alternately is coal called hot coal deployment or Bluegreen deployment that's when you pretty much keep two sets of servers running only one of them is serving traffic at a given time but the advantage is you don't have the overhead of spinning up these services they're already there you can deploy to them fairly quickly and if you need to roll back you basically just flip the switch to point to the old ones rather than the new ones and you're out of trouble I think a lot of us have done that here now there's a lot of moving parts to the system that I just described you have to have good script ability over your provisioning I mean this sort of implies that are using virtualized servers just in order to even try doing this if you're doing this on hardware you're probably virtualizing it and running VMs on that you need good control of the load balancer you need either very good intuition or some good insight into the connection draining and who's you know who's still working on what so it's achievable but it is a bit of a burden now I'm going to skip to DevOps 2016 things this is the future pretty much pretty much all of DevOps 2010 still applies people you know containerization is sort of like an eight in one multi-tool you know it's good for encapsulation of services it's good for distribution of services it's good for security it's good for this it's good for that all right great heavily automated that is one of the themes of DevOps 2016 is you know in order to use all of these super spiffy tools there is a lot of tooling that you need to make sure all of the hoses are connected and all of the wires are plugged in or else it you know doesn't work there's also some far-out ideas which are beginning to take hold sort of I don't know exactly the right way to put it resource pools that can execute stuff so being able to just have a bunch of servers act as an amorphous blob of compute storage and RAM and then assigning tasks to work on that AWS lambda is a great example I don't know if they're using mezzos under the hood at lambda if not they had to write something that does substantially the same thing but it's pretty neat to be able to you know have a pool of resources and not have individual computations tied to particular servers at any given time so is the average platform actually get there horizontal scalability fault tolerance check they do it can't deny it zero downtime upgrades and roll backs yeah they're mostly there it takes them dancing around like I mentioned but you get their stability and performance obviously we all use the Internet it remains there for us to use it's not all written in Erlang and elixir ease of maintenance though I am not so sure that we got there so now let's turn our eyes to early the Erlang platform not early in the language things that Earl and does very well extreme fault tolerance was one of the goals of the platform from the very beginning telephone switches are not allowed to go down ever so you achieve these fault tolerance through a very smart supervision API that's baked right into right into the core language it's very simple to scale an Erlang cluster into the dozens of nodes and this is an actual cluster every machine knows about every other machine in this cluster so it's not the kind of thing that you can practically throw five thousand machines at cluster them all together and set them loose on a problem because your network is just gonna melt do the crosstalk but for a couple of dozen servers doing roughly the same thing you're good zero downtime upgrades and rollbacks we know Erlang has an and they're fast they are really fast once you've built the release the upgrade is typically just a thumbs up or thumbs down you made it it didn't stability and performance we know it's there now there's one thing missing and that is what elixir helps with creature comforts ease of administration ease of maintenance ease of writing code on the dam platform to begin with I love her laying the platform I think Earling the language looks like Prolog and ml got into a car crash in the 80s sometime so I'm kind of happy to leave that part of it alone so early and clustering it's it's again been around a while there aren't a whole lot of other systems that do this but it is about making multiple machines all work together on a single workload in a in an internally stateful way the machine the machines in this cluster know about each other you can do things with this OTP apps are a really nice way of encapsulating pieces of functionality people would call that micro services it's certainly like a micro service if you squint hard enough you know early and cluster is basically a little mezzos in a box if you don't need massive computation or scale you can really set something up very quickly to do this sort of thing and because you can cluster so easily you can postpone perhaps charting your workloads onto different servers or different clusters you can postpone bringing in outboard message brokers in whether they be you know HTTP services whether you're putting things onto a queue or a stream you can take care of that everything in the box now this does not look like DevOps 2010 at all but it achieves a lot of the things that DevOps 2016 is trying to do with I'm gonna call them lesser platforms I decided if I'm going to if I'm going to buy into this platform I want to really buy into the platform certainly you can you can use Erlang any way you want and you're gonna get some is out of it I wanted more startups love time to market advantages and everything I've just just described really easy to set up a cluster really easy to run multiple micro services on it you don't need to setup external cues or streams or stuff like that that gets you to the market faster now don't use it just because it is the shiny new thing this if you're bringing this into your company you're putting money on the line and it is worth evaluating exactly what you're gonna gain and where you're actually gonna pay the toll if early an elixir have features that will save huge amounts of programming time out of your project it may well be worth it to sit through the occasional pain of deployment you know it it works when it works and that's really nice but I'm not gonna pull the wool over your eyes it is going to blow up in your face sometimes also if if things like maintaining connections if dropping connections isn't a big deal consider DevOps 2016 and veera laying platform because it will work it will work fine but you are leaving certain features of the Erlang platform on the table ultimately nothing is free once you once you get into the hundreds of servers you are going to need to make some tough decisions about okay what are we going to extract into other services what you won't be able to automatically operate at big scale but small to medium scale it's got you covered so deployment is roughly speaking it's the sum of your tools and your network topology and topology is there's a lot to say about it I am NOT the most qualified person in the world so I'm just gonna hang back and let other people do it I'm just gonna say for the purposes of this discussion let's let's assume you have two of everything because that makes it a lot more of an interesting discussion where we're talking about clustering so let's talk about tools then from not going to talk about topology this is what Erlang gives you this is an excerpt from learn you some Erlang you may notice I have skipped steps 5 through 21 they're all I copied this file to this place and edit this and then run this command and then run this command you might want to write a few scripts to automate this yeah you do and it sounds like the vast majority of you who are making releases are using our friend the Alexa release manager this handles most of steps 1 through 25 there are a few of them that doesn't I think one of the steps is make sure your code doesn't have bugs and crap like that but this takes care of generating an early release from your code and what that is is basically a package that contains everything that your application needs to run including or laying itself just put it in place it gives you a command that will start and stop and restart and attach a console run RPC calls but then you got to put them on servers and this is the part that I found to be a little bit lacking when I first started to investigate how am I going to get away with this but then I found a nice tool it's called a delivered anyone here use it cool so it's based on deliver which is a it's a deploy tool that's written in bash you can kind of consider it similarly to something like Capistrano except the main difference is when Capistrano has 15 tasks to accomplish you SSH into your server 15 times in a row whereas deliver will make a bash script that is 15 steps long transfer it to the server then just execute it I'm kind of agnostic to that part of the tool it's just that these fine people at bowled poker and online poker firm in Berlin decided to write in Erlang an elixir delivery service on top of deliver works with rebar mix relax the xrm basically all of the tools that were used to using to generate these releases and version are code and all of that no problem it handles building the code deploying the code versioning upgrades downgrades there are some newer features where it it claims to handle database migrations as well I'm not as familiar with those so I won't speak as much about them but I'm gonna talk about the general core of the functionality it's pretty easy to use you added as a dependency in mixed ID XS as you might expect there's a config file that you fill out I will show an example you set up your servers with the correct logins and directories config files where they need to be VM args where they need to be you use git tags to mark your mix TXS version and this is not necessarily mandatory but I'm gonna strongly recommend you do and then you run next task to make money do it okay so here we have mixed dot exs I added a deliverer to not only the depth section but also the application section it runs a little service which can be pinged by a deliverer to tell you what version of your code you're running it's pretty convenient next up is the deliverer config file it is it's bash file so real easy to understand the important parts are app you set it to your app name build host build user and build at that's your build server staging host staging user test at there's your staging production notes production user deliver - there's your production setup the staging host and production host those can be space separated lists of host names or IPs so you don't think you have to do it on one server you can use more for this example I just set everything up on localhost because you know I own it I have mix in and I have linked via marks Mix end is a hack that I put in there in order to make the link via Mark's work does everyone know what the VM args are actually no one I talk about in like two slides I'll shut up this is a little bit further down the deliver config this pre Erlang get an update depths that's a hook that deliverer provides or a deliverer I should say provides that we can add some code so in this case I copy all of the config files into the config directory my service is structured so that if you don't have dev secret you access staging secret e access prod secret exs all there it won't compile so I like it when it compiles get check out remote I'm gonna talk about that later it becomes important during the gotchas section and there is a gotcha section the directories are pretty simple build that test at deliver to don't even create them just make sure that the permissions for them to be automatically created by your build test and deploy users just make sure they have the permissions to do it configs go in config at via marks file they go on each individual server and this is a this is a subtle difference that the docs don't make blazingly clear the configs live on your build server VM args live on the individual servers on which you're deploying and this is important because that is how the server's can that's how they know their own name a really quick clustering how to there are params and VM arcs these are the arguments you start the Erlang VM with you can keep them in a file and they will get automatically applied as part of your release there is also a config that you put in your you know usual config dot exs which has sync nodes mandatory or sync no it's optional there there's a lot more parameters that you can play with and I give the URL of the documentation that describes them much much better than I will but here's a practical example up top I set the name of the node there can only be one node with this name in the early in cluster I set the cookie this is for distributed Erlang it's a shared secret so you know protect it with your life and then in prod dot exs you'll notice I have sink now it's optional I use optional because when one of my nodes goes down I don't want the others to follow it deliberately so you know I have two server notes there we go no versioning you deliver his introduce some new features around both versioning and Auto versioning I use them at your own risk I would not recommend it every time I've tried there has been a small explosion no one was hurt but I stick to this every time I commit a change where the version and mix da/dx s changes I make a get tagged with that version exactly zero dot zero dot one not V zero zero one not version zero zero one just exactly the semver that you're putting in mix dot exe s it just works better that way maybe in a few months we will have more choices I didn't write the talk then so we're gonna take it out for a spin it's pretty easy stuff that one of the reasons that I was drawn to it it's really simple to use so let's build our first release we type this magical incantation mix he deliver build release and we tell it what version that we want to release it does that it does some stuff you can see it it pushes oh I should mention this this is an important thing this uses your git repository this does not use for instance the github main repo this uses the repository that is on your computer so you you actually don't have to push your commits or push your tags you can just keep them local and push them when you're super sure of them so pushes commits to Gamache at localhost is my build host in this case it does the needful generates a release and it copies it back to my local computer so when we want to deploy it mixie delivered deploy release to production i at the version number I want to deploy and since this is the very first time I say start deploy because it wasn't running before and I want it to start it uploads the release that we have in our local releases it extracts it it starts in that's that and here is a little example we go into our production deployed directory this is what an unpacked release looks like and been my app you know change the name for your application if you ping it it should pong we started it and we see that it does now is the fun stuff hot upgrades that is why we all showed up here today some of us took trains and planes to talk about hot upgrades so get comfortable I'm gonna bump a version number 0:02 it's one more than one I get commit I get tagged mixie deliver build upgrade from o1 200 to very straightforward command once again and in this lovely ideal case it just it just worked all of a sudden our application has been hot upgraded 200 - now the hot upgrade process is pretty neat what it's actually doing behind the scenes is there are a set of what are called a pups and their scripts that say for each OTP application this is how to transfer or to transform the old-style state into the new style state and then Arella release upgrade collects all those a pups into one and this is the script of what happens when you are actually performing an upgrade a deliverer will make the default a pups and real ups for you so essentially I'm just saying nothing needs to change I haven't changed the shape of my data anywhere because I didn't you know one I didn't have to and - it's kind of an involved thing when you have to crack open an app pup file and I would prefer to just avoid that and then what actually happens when via when the a pups are all executed we start a whole bunch of new code with the new versions and they receive all the messages that you know you would send to those processes in the old version the old version stops getting input but it's allowed to continue executing anything that's two versions old which has probably just been sitting there for hours or days or weeks gets killed so basically unless you do two of these upgrades in extremely rapid succession the connection draining is just taken care of by Erlang you didn't have to write any code to do it love that so deploying the hot upgrade it is just as easy deploy upgrade to production you say the version and it just does it here is just a little you know magical demonstration I added a little function to my app dot X it will just give the version of this application it's a lovely little code snippet I use that all the time so I open a remote console when it's running 0:02 and I can see my app dot version it spits out oo - when I say upgrade occurs I opened another terminal window and I made an upgrade to oo 3 and I deployed it and I say my abdun and there I am oo 3 didn't have to disconnect this extends to web connections WebSocket connections you don't drop connections it's really nice but it does not always work works a lot doesn't always work sometimes it's a matter of some of the software you're using has changed its shape the shape of its state rather and you need to take that into account manually sometimes certain packages are just not in a position to be hot upgraded cowboy is a good example it is very difficult to not drop connections when the thing holding those connections is what you're trying to kill and restart so you don't so keep keep your cowboy upgrades infrequent and I think you'll probably be fine and sometimes you know what sometimes it just feels like it's not in the car I am NOT a scientist about this stuff I am you know I've just been in the trenches for a little bit and I've I've seen good men die and I don't have I don't have great explanations for it but I do have some ways that we can make it around the mines in the field if you can't upgrade hot than rolling upgrades are your best fallback it's here we are in DevOps 2010 you probably serve as out of the pool you restart let me put them back in the pool depending on how much you care about people and how your app is structured you might be able to just like not take them out of the pool and just restart them anyway it's much easier to do rolling upgrades because we're only or rather rolling upgrades will feature a non-trivial period of time where you have multiple versions of your software running at the same time and so the best practice is kind of follow it's very similar to how to do a gentle database migration don't immediately assume that the new column is there and the old column is gone write your code and sort of a transitional way takes a little bit longer there's a little bit of temporary technical debt but you know it makes the whole process a whole lot error-prone rolling upgrade you build a release you deploy the release probably without start deploy if you give it start deploy it will restart all of the servers at the same time probably don't want that you might want it but you probably don't and then on each server you execute that command just CD to your release directory and app name dot I said SH restart rollbacks super easy they are just upgrades if you could upgrade from A to B you can roll back from B to a this is immensely convenient because upgrades take place generally and well under a second so when you see that you have just deployed the worst code in the company's history everything is going haywire and customers are at your door with pitchforks and torches all you do is you upgrade to a lower version number four which you've already built a real upgrade release and it does what it has to do now I mentioned migrations and the fact that a deliverer now supports them I've never used this feature also on the plataforma tech blog about a month ago there was a nice article about how to stick a little bit of code into your project so that you can run your database migrations from the release um honestly what I do because my build server is in the same network as my production servers I just log into the build server and run the migration like I run the migration at my desk it works you can choose any one of these three now is the fun part I ran into a lot of problems over the last six months I would I was surprised at some of them I was genuinely surprised here coming up is one of the least surprising ones you limit how many people have dealt with you limits before UNIX does this wonderful thing we're in the UNIX philosophy everything is a file a network connection is a file and UNIX made the decision early on that if too many files are being used for some reason this implies there is a problem on the machine so they limit the number of files this can this can screw up your day in a number of different ways but in Erlang which is you know pretty much known for the number of connections that you can keep open concurrently it's a real pain in the butt so you need to set you limit very high I don't describe how to do that because it's operating system specific so read the man pages and set it very very high and then in your VM args there is an environment variable called Earl Max ports which can be used well it it sets the maximum network concurrency of an ER laying node set that very high once you've unlimited your you limit exceptions exceptions when thrown too quickly can cause death in parts of your application and this is taken care of by a supervision tree generally depending on what part of your application is throwing exceptions very very quickly and depending on how dumb you were when you set up or did not set up supervision trees and you're just kind of falling back on what you're given maybe your entire app is going to die I'm not gonna say this happened this happened so here's a crap solution but it's not as bad as the one that I'm about to tell you in vm args there is a - heart parameter that you can give and what this does is it sets up a heartbeat watcher process separate from your Erlang process and if Erlang dies if the whole VM packs up and fly south the heartbeat monitor will restart it it says right there in the VM rx file use this with caution so do that when you're ready to throw when you're ready to throw caution to the wind there's solution number two and I like to call this the brick on the pedal pattern you can just start a release arbitrarily number of times it won't start it again it'll say hey I'm already started so yeah if you cron that every minute or so you end up with pretty good coverage I haven't calculated precisely how many nines that is yeah and at the bottom of the slide is the less glib solution fix your code that that's really the only way you can avoid this you shouldn't be throwing exceptions all that quickly here's something that I didn't really know could happen on beam in pure elixir code I was managing to segfault maybe it was cosmic rays I don't have good answers for you but the same instructions is the previous slide you just need to make it not do that I I'm not concerned how you do it make it stop here's an EDA liver related gotcha sometimes you will have a release stick around for a little bit longer than you want it to for instance if I make an upgrade from X to Z I put it on the server and then I decide you know what I think I actually needed the upgrade from Y to Z I forgot that I deployed Y yesterday it might not ever actually unpack that tar.gz it will transfer it to the server and then it'll be like oh it's already unpacked I'm just gonna use that the solution is to just blast the directory not complicated but I ran into it a couple of times this is one for production deployment that I also did not expect sometimes I was seeing or rather when I went into production started testing I was seeing these 400s come back when trying to use the WebSocket feature now I tested this extensively on my own machine tests all around it and I was getting these four hundreds and I couldn't find them in the server logs now that is the clue I didn't find him in the server logs so was I was hitting a wall before I even got there what it turns out is that in many cases you need to be very specific with whatever web proxy or load balancer you're using that you want to support WebSockets on Amazon elastic load balancer which is how I'm deploying in production they give two sets of roughly identical settings HTTP HTTP those don't work with WebSockets but if you use TCP on port 80 and or TCP on whatever port you want an SSL on whatever port you're using it will work exactly like HTTP and HTTPS except it will do WebSockets I don't even know why they give HTTP and HTTPS considering they are somewhat crippled and TCP and SSL seem to just work just as well nginx there are a couple of proxy headers that you need to send along that are not in most of the example web proxying in nginx tutorials on the web those are the ones I'll have to excuse the old-style screenshot on this one sometimes you can run to get problems I ran into this is this is from Monday this is a late addition to my talk what was happening is that in between two versions it it would build one version and the process of building it would mutate mix lock and then it would say oh wait you you have changes you you totally cannot check out anything until you stash your changes that pissed me off and was stupid I didn't know how to solve it so the first thing I did is I went into the source code for e deliver and I grabbed for the phrase checking out and I found it this is the piece of code that was blowing up now I noticed you get check out remote as the name of the function I said well what if I could just put that in my config file that seems really simple could it actually work it it worked so you'll notice I've updated checking out revision with a rad way cuz we're doing better than we used to do and you'll see the second line from the bottom get stash is going to take care of that mixed out lock file that I do not care about and then it works you see it checking out oo to the rad way and then going along its merry way and doing what I just told it to do so we're near the end I'm gonna give you another summary in my opinion elixir and early and deployment it shouldn't look like what you do on lesser platforms if you want to take advantage of the full platform you're going to have to leave some of the current best practices in dev ops aside for now a deliverer gets this right so much of the time I want to say ninety percent and I think that's about right I mean I've let's see I've made 84 releases of my production API in the last few months and how many gotcha slides did you see six seven and a couple of them happened twice so yeah I'm just gonna say 90% of the time you're gonna you're gonna be okay but even when it does blow up the amount of time you spend on that may well be far less time then you would have spent on something else in in my case I'm using phoenix channels very heavily and if i had to write that in another system it would have taken weeks to get that right in elixir with phoenix channels i think i spent maybe two hours on that part of the code and i've barely touched it since so for me i think this was a good bargain and the final takeaway is that 2016 is very early in elixir and in the elixir deployment time lining the the book is still being written on this stuff so the problems that you're gonna have today might not be there six months from now hopefully most of them will have vanished a year from now it's what i got questions asure the question is had I considered have I considered using ansible along with this I don't go into that in in this talk but my servers are provisioned with vagrant and puppet I chose that rather than ansible because I don't know why I asked from the hole in the ground when DevOps is concerned and I managed to make something work with puppet when I say he delivers under active development as that daily weekly monthly I would say daily to weekly yeah it's it's it's moving and I don't want to call it a moving target because they they haven't made breaking changes but they are actively working on it and you know there is opportunity for the rest of us to pitch in I may end up kicking some patches back to the project yes so the the question is doesn't this run afoul of these servers as cattle metaphor because I I'm treating my servers as pets I am treating my servers as pets I am which is not to say I can't banish one of the pets for my cluster if I need to or bring new ones in but I am not intending to immediately bring them to the slaughter when I'm done with them maybe one more I do not have any comments on umbrella apps and deployment because I have no experience but if I got a couple minutes I could just like to talk I have no expertise you've had 40 minutes to talk well thank you thanks please
Info
Channel: EMPEX Conference
Views: 8,974
Rating: 4.9572191 out of 5
Keywords: EmpEx, Elixir, elixir-lang, NYC, @gamache
Id: H686MDn4Lo8
Channel Id: undefined
Length: 40min 36sec (2436 seconds)
Published: Fri Jun 10 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.