ElixirConf 2021 - Mark Ericksen - Globally Distributed Elixir Apps on Fly.io

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] when i first started learning about elixir and learning about the beam i got this idea somewhere that you could have a cluster of nodes that would cross continents and i'd seen that and as a demo of like this actually works and i thought man that is so cool i have no idea how to actually do that and and that's really the problem is become it becomes an operations problem it's not a problem that elixir can't do it it's operations it's networking it's data centers and how do i cross connect things so i forced myself to think smaller to put my ideas into a box and i'm just gonna you know the u.s east virginia box right like just kind of limit it but there's this quote from joe armstrong who's the one of the creators of erlang and the beam he says i have this idea in which will connect all the world's earlying systems to each other imagine if every process could talk to every other process worldwide and you see joe was thinking what if what if we could connect all these things what could we do then and we could bring ourselves out of the little boxes that we've kind of limited ourselves to and really i don't know what we could do then it really kind of takes the lid off and lots of things become possible so i'm going to start small we're going to start and just address some of the most common problems that you have when you try to go multi-region and so that's what we're going to cover so we're going to talk about going global on fly.io and taking your phoenix app which is using a postgres database and going multi-region my name is mark erickson i work at fly and i'm also host on the thinking elixir podcast so there are three problems we have to address and i guess they're not so much problems as they are constraints and the first one is we're going to talk about an ordinary phoenix app and that's that's just a constraint it's not a problem um and we'll define what that means more in a second and number two the speed of light that is a constraint and we have to be able to deal with that and number three is postgres as our database so postgres was never created to be a multi-master eventually consistent database so we'll talk about how our apps can still go global while using postgres and so we'll talk about replicas how replicas help solve some problems but introduce other problems and then how to solve those problems so first let's come back to our ordinary phoenix app so when we talk about ordinary phoenix app we say you're using a regular ectorepo you're using postgres you have contexts and it's a read heavy app so this is your traditional sas style app where you know your user is navigating around clicking things maybe toggle the state of something maybe edit something but most of their actions with your app are read operations so that's where this works really well but i think that's most apps and i'm hoping that your app that is is using live view because this works really well when you talk about live view as chris was talking about in there live view when you get close to the users becomes super awesome but specifically we're talking about an app that was not designed for being multi-region this was something that i wrote on my own laptop with a single machine my dev machine and like the the code that works on my dev machine that's what i want to be able to take global so that's the goal go global with an ordinary phoenix app that i don't have to rewrite honestly i think that's like the holy grail because you know how how everyone else does it is they have to re-engineer you have to engineer for that from the beginning or re-engineer everything so the benchmark we're going to use is this is ordinary code everyone has code like this in your app it's i'm doing an insert and then i'm turning around and doing a read the problem is is this actually doesn't work when you go multi-region and we'll cover why that doesn't work and what we can do about it but that's the goal that's our benchmark is to see can we make this ordinary code work so now we're going to talk about the speed of light so here's the speeds in meters per second miles per second the point is distance actually matters so for the distance that we're going to talk about for the sake of this demo we're talking about los angeles to sydney australia that's 7400 miles 12 000 kilometers and that's that's what we're going to be talking about and so if i wanted to take a flight from los angeles to sydney i would be on an airplane a direct flight for 15 and a half hours and that's about the worst case i could think of but we're actually in austin which is a little further east so it's actually over 8 000 miles so we're now actually worst case more so than what i built this for but uh but what we want to talk about is that 12 000 kilometers requires light 40.2 milliseconds to travel so like if light was traveling in a straight line that's how long it would take just to travel that distance and that's the theoretical fastest time so the world isn't ideal right our the electrical signals traveling through the wires or the fiber optic cables aren't taking a straight line they're going through some other path and there are routers along all these different hops where the packets have to be inspected and then forwarded on to another route so we know it's going to be slower than the theoretical minimum so it will take longer and we're also having to deal with network congestion dropped packets and retries so just thinking about the multiple trips is where it becomes a real a real deal killer where it takes 40.2 milliseconds to go to make a request and get a response takes 40.2 milliseconds and back and forth it just doing that just in transfer time alone is 160 milliseconds added and that's the ideal right and we don't even have the ideal so if i did a ping time from my server in lax to my server in sydney i'm getting 148 milliseconds so that's like realistically the distance that i am between my two servers are so how do big companies deal with the speed of light you got netflix xbox with live gaming cloudflare the way they do it is they actually move the servers closer to the users they just shorten the distance they don't fight it they don't fight the speed of light so but that's where it gets really hard right because then you start dealing with multiple regions multiple data centers how do i bridge and keep it secure how do i how do i do that honestly as a solo developer or a small team developer i have no idea it is a an impossibility for me the big guys can do it google amazon i don't know how they have huge engineering teams but i want to be able to play with the big boys and girls too and and i think we all often would take a look at that problem and just say well i guess i i can't play with those guys but our solution can be their solution too we can deploy close to our users and fly.i o makes that really easy so fly has regions and right now there's about 20 available and you say i want to add a region using the command line interface and this adds lax which is los angeles and then you can add maa chennai india or scl for santiago chile and then you just scale your app to put instances in those regions so what happens when we actually scale and we have multiple nodes now so this is the internal network within fly i have my app in la and my app in sydney and there's a wire guard tunnel that's put up between the two and wireguard chris mentioned it it is a new cutting edge vpn replacement but it's so much better than any vpn you've ever used it's built into the linux kernel it's on all your ios android windows everything and so it is like the replacement the way to do things and so fly is automatically creating a vpn network for you between your nodes so they're already have secure direct uh pathway to each other and you won't be able to see anything else on fly so you're already boxed to just your your apps and no one else can see your apps just getting to that point i never could get that far right with aws or anything and so from the external part of the network how do my users get anywhere so there's a standard called anycast dns and when a user tries to go to my website anycast figures out which of my fly regions is closest to them and will send them to that one so if i have customers in lax but i'm running everything out of sydney they will be directed to the lax one or if my customers i'm adding in england you know i can add uh something close to them and really that's where live view starts to shine right so much people kind of complain oh live you can't actually work because speed of light you know they actually say that and it's like no like well okay maybe that's the case for you and your app but if you just bring it close to the users then man it is snapping it's pops it's wonderful so now we're coming to postgres and really that's that is the the main bulk of this talk and the the problem is how we deal with postgres here because if my app is close to my users but my database is still really far away it doesn't actually help anything so let me walk you through the path that i took on this so coming back to this distance my primary database is in sydney australia i did this to punish myself i wanted to see what do our poor brothers and sisters down under deal with when they're visiting my website in your website in virginia and so i deployed the primary there and then created a lax happens to be the closest fly region to where i live in utah but if i do just that and i deploy my app this is what it's going to look like my app in lax is going to have database connections that are connecting directly to my primary which is across the ocean in sydney so that's really not ideal right because i'm having a lot we know what kills us as the back and forth and when i when the code that i wrote in my app did not account for my database being far away from me right my code makes the assumption that my database is zero to two milliseconds away from me and so what fly provides here is multi-region postgres sql so there's two steps to adding a a replica so the first is just adding a volume in the region where you want it to be and you choose the size of the volume 10 gigabytes and then you scale your database app to say uh i want to have two and that will be so if lax is my second region then it's scaling it to two we'll put a database app in lax and now i have a read replica so now that means my app in lax has direct database connections to the replica that's right next to it okay and so now it's super fast all the reeds are really smooth and but you know the question is obviously like well how do i do writes one thing i do want to mention though replication uh when you add that replica fly automatically sets up the replication and does all that for you so you don't have to there's nothing to manage which is super helpful so the question now is how do i do rights so the default way i thought well i'll just i took this approach on my test and i said well i'm just going to have my lax app have open connections to the primary so it can talk to the primary when it needs to but that requires changes to my app in a significant way i have to add two repos one that does read only and one that does writes and now my code has to know which one to use and that's where i start to not have an ordinary app anymore right it starts getting more complicated and then it doesn't scale well because as i add more regions i just have all of these open database connections to my primary and that's really not great but that's not even the real problem the real problem is read your rights consistency and it's because the replication process is asynchronous we don't know when that's going to happen and we don't know when it's going to be done so it's a background process that's the real problem so what that means let's walk through it i when i do a write to my primary and then my next line of code says now read from it read from my app it says oh well i'll go and re i do reads from the replica so i do my right over there i insert some data in my primary i turn around and read it and my database says i don't have that record or i do an update and i say change this email address and say i have the old one or delete this record it's still here your user starts to not like that right that just it feels like something's wrong and they have to hit refresh to to see the stuff as it happens it's because replication is happening back here and the main thing is we don't know when it's going to happen sometimes it'll happen it'll replicate before your your read action and it'll work fine and sometimes it won't it's just the fact that we can't count on it and now this is a truth for any read replica you have if you have your own hardware you're hosting everything in your own data center and you have your primary and your replica sitting right next to each other the same condition exists this is just a problem with having read replicas so let's kind of check in on our goal this is our goal is to make this code work and we know that if i insert an order i can't count on being able to read it so we're not there yet so maybe there's another way right we're using elixir we can do things like you know chris was talking about we can do things that other environments and frameworks and languages really can't do as well so let's try something different let's say i actually want to instead of talking to the database i'm shifting the layer at which i'm talking to and i'm going to switch and talk to my app instead and i can rpc because elixir makes rpc to run node.spawn where you say run this function on that other node that's so awesome it just works and and so i can do i can just talk to my app have my app do the right okay well that doesn't actually fix our problem yet we're not there yet but we've changed the layer at which we're doing the discussion and having the talk so we still have to solve the replication problem let's do that so this is where postgres is really it's a cool database i love postgres it really is my favorite and that's why i want to keep using it right and so postgres has two functions in it that are really important here and they the pg current wall insert lsn and pg last wall replay lsn so wall is right ahead log and lsn is a log sequence number so what happens is the insert one is after i've made a change to my database i can ask postgres it's already keeping a log of all the changes that are happening and it creates a sequence number for where it is in that process of those those changes i can just say after i made a change to my database what is the log sequence number for where you are now and then on the replay side i can go to the replica and i can say you're applying all these changes where are you now in what's been applied so just we need to make sure we can trust the wall so let's kind of talk through it talk it through a little bit so the primary and the replica are connected through the replication the replication process is a stream of data it's sequential it has to be applied in order so we know it can be counted on to be sequential and so when an insert happens where we talked about that i made a change that insert flows down to the replica through this whole stream and i am able to that second call i can ask the replica where are you now in the log that has been applied and don't worry about having to track like ids right we're not talking about like user ids we're not talking about transaction ids it's that it's a sequential number and that's the power of this is that it's they're comparable these lsns i can say is this lsn greater than or equal to this lsn and that just tells me where am i in this log stream have i gotten to at least this point in this log stream and if i have then i know the data that i care about has been replicated and also don't worry about having to wait for replication to all settle down right we're not waiting for all the changes to settle out they can all be going on concurrently so i can have the insert i care about and a bunch of other inserts are happening before and after it and there's still changes happening actively and i just care about when has the the point that i care about in the log reached the replica so now back to the rpc if i can talk to my app and have my app do the right for me then it can ask and say what is the insert lsn for where you for the after this change has been made and it can pass that back through the rpc call so now i on the app in my replica region in my far away distant place i can say i know the point in the replication log that i care about to have my data my change and that means we can wait for replication this is it guys this is this is we have arrived okay and it's kind of anticlimactic as a slide right but but this is it and and so the way this works is kind of talk it through so our insert in lax goes to sydney goes to the primary database but we can have library code that can say well i'm also i'm going to take the rpc result which has the lsn number that i care about and i'm going to talk to a service that's running here on my replica and it's just polling the replica constantly and i can say hey tell me once the data that's in that this lsn represents once that's replicated just tell me when it's done and i'll just sit here and wait and so i'm able to create a blocking insert so the insert can sit there and execute the data gets inserted far away and replicated locally and now i'll continue on to the next line with my read so my data will be there i can actually read my rights now obviously you'll be saying well my repo doesn't do that and it's true so we created a library called fly underscore postgres so the idea with this is you take your existing repo and you just move it out of the way in this case i gave it a local name so just change the namespace and then i created a new repo that had the same name as my the one in my app and it does a use fly repo and the local repo is pointing to the the the actual ectorepo and so the job of the fly repo is it proxies those calls it says you're doing an update and insert or delete i know how to proxy them i know what the regions are i know how to find the app your app running in a different region the main reason of kind of moving the ecto repo out of the way is so none of your code has to change all of your references to your ecto.repo your my app repo all stay the same so ordinary code business logic code does not need to change you want to see it let's see if it actually works because you know you saw how the demo gods were not kind okay let me get a mouse all right so this app is running in dallas this is nice and close and my primary app is running in sydney australia so let's see see if this will work so first you can see so this is a live view app right so you can see conference wi-fi accepted it's still darn snappy right and so that that's live view it's not the new stuff that chris was showing off with the modals that happen locally this is all server rendered modals and you know if i type something remove it i'm getting instant feedback right and now when i actually go to do a a record insert let's see okay i didn't see any lag did you no all right what happens if i edit this guy okay there's a little bit of lag did you see that like i'm totally okay with that and what if i do a delete so i mean i just love how snappy that is it's even better when you're not on conference wi-fi but when i delete there's a little bit of a lag right we see that we can't we still have to deal with the speed of light and the speed of replication but but that's so much better so it works really well for a read heavy app which most sas apps are okay so what it that's just simple insert update and delete what about something more complicated right so here's this feature called a quick add where you know i these are it's a live view managed list of ids and i can create a bunch of templates from these i can actually create records from this and there's there's quite a few and so that's what we're going to be talking about so let's let's look at the code that sits behind that okay so this is the non-optimized code approach this is a live view application that has an event and it says i'm going to take a a list of all those ids for the the different templates that we want to create and like the third line we see it's just an enum each we're just going to go through each one of these say create it actually go create this location so i didn't change any business logic right this is this is what happens and so it's going to create a location and it's going so we know that my proxy is going to sit in there and rep and proxy the the inserts so it's going to insert wait for data replication and then do the second insert and wait for data replication and you can start to see that's going to be really slow right you can already tell and and then it makes it even worse right because after that we're going to fetch all the locations that were added and we're going to they come out in a uh an implied order so we want to make that order explicit so it's reproducible so then we'll do an update on each one of them to set their explicit position so because it's a draggable list so let's see how terrible that is okay so now now we know what we're going to see so here's where i i'm going to add all i'm going to do it nice and do it that way so it's actually going right and it's we know we're doing this it's the whole multi-round trip problem right we're doing insert wait for data replication insert wait for data replication okay they all came in so first of all yeah it worked but you know like that's that's sub-optimal right i don't like that and so how can we how can we deal with that all right so with the refactor um what i'm going to do is just take out take all the business logic that was in my live view event and extract that out into a function that can be called uh just needs to be addressable just means it needs to be a public function and then i just change the return type so that i'm not returning my whole list of locations because that's an arbitrarily large list i don't know how big that could be that's message passing really far and i'm already waiting for the data to come through my database replication so i don't really need to return that to bring the data twice over and then i'm going to handle the event i just extract sorry i've already extracted that code out i'm just going to say rpc and wait in this same module i just pulled it out call this function passing these arguments and and i'm going to wait for it to make the call over to sydney have sydney because it's close to the database do all of the work it can do all of the normal reads updates reads and updates and do everything and then wait for it to finish and replicate before i go into my next line which says okay i know my data is here fetch my list of locations and show and so i can continue with my render all right let's see if we are able to get a look at that so if i select them all it is quite a list and so this is doing the rpc all the way over there's still a lag right but that was darn fast we're doing all the work over there letting it do all the normal business logic we already had so let's do that one more time so you can see this could be better right we could add maybe a little spinner or something to say hey we are doing something we are working but like that that works so when do i want to do an explicit rpc so when we have slow performance like we saw there when i'm doing explicit start transactions or transaction management using ecto multi is uh one of those cases where it's doing explicit transaction stuff if i have lots of sets of reads and updates where it's the back and forth background jobs lots of times background jobs make the assumption that they're doing some work and updating the database so those probably should happen in the primary region there are other background jobs that might be sending emails where it doesn't matter but so i checked with the oben project and they have cue splitting where you can say you can give it say if i am in the primary region run these jobs and so you can totally do something like that too so to do all of this we created two elixir libraries so fly rpc adds region awareness fly region awareness to your app and it lets you easily rpc to another region not just a node but like i could have six nodes in that region i just send this over to that region and then fly postgres as it's using the rpc it tracks the replication process with the lsn's and it does the where i can at my live view process can say tell me when this is done and has replicated and it defines the fly repo that does the proxy so both of these libraries are pretty early days right so we'd love to have people use it and give feedback i know i already got some feedback someone's trying to use it with the ace framework and so we're missing some repo calls so there's there's stuff to do but you can see it's working so what i think is awesome though is what when we look at who is providing what benefits so fly is providing a lot of that operational infrastructure stuff the global deployments the multi-region networking the multi-region postgres replicas that are already set up and i don't have to manage them and elixir is bringing some really powerful stuff to in to this uh equation which is the built-in clustering so i can have my node clustered i have concurrency i have processes so that i i can have a blocking process like an insert that doesn't freeze the whole app right that's so cool and i have message passing i have ets tables which keep things all fast and smooth so one of the things to just be aware of is most elixir systems can't easily do this on other platforms and most other languages and frameworks like node django rails they can't take advantage of what fly is doing because they don't have clustering they don't have concurrency they don't have processes they they can't do the lsn weight right because they they're just not able to so it really does feel like elixir and fly were like made for each other i love it so we've covered everything and just coming back to joe armstrong's talk where he was thinking what could happen if so i want to encourage you to climb out of the mental box that you've kind of been limiting your eyes to your ideas to i don't know what ideas and things you'll come up with but i can't wait to hear about it and this approach following this just means that i don't have to rewrite my app i don't have to relearn how to work with different kind of database or how to work with eventually consistent data i don't have to do any of that today to be able to start going global now right i can do this with my ordinary phoenix app that i know you have sitting on your laptop probably six of them right and and once your app has gone global what could you do then what kind of problems could you solve for your customer you've got that idea for that app and you think well you know i'm a solo developer i think i could i could at least target and hit all the english-speaking countries right without having to do internationalization well now you can so it's time to play have fun thank you well you have time for questions john yeah so markup's gonna ask and i don't have a lot of like devops experience or anything so level of um but um what could an optimistic style like like i know we have like optimistic updating with like javascript front ends could something like that work for this redirect problem yeah it just complicates your app because your app says well i have to insert it so then you're you're saying i can't write normal code right i have to insert it into my list because i and then i it might come in later if there was a problem with the insert and it failed like for a uniqueness check that happens at the database then how do i show that to the user you know it just complicates it but what if it was at the level yeah then you're doing multi-master almost because you're talking directly and inserting to your ets local that has to be replicated and avoid conflicts so it's it's tough yeah i'm curious about do you guys like have your own servers that you build your own data centers or are you guys utilizing the existing cloud aws yeah so we use a lot of different so we don't build our own data centers so we're using other people's data centers we have some digital ocean some aws some stuff that's in germany you know that's not one of those you know so you have uh it's lots of other data centers where using those yeah and but able to already do the wireguard connection to give you the private network across those you don't actually have to care about where your actual app is good question i don't know yeah the the pricing is is really good you can do uh so you get three so like if you sign up for the free tier well if you sign up and want to get want to get as much as you can for free right you can get three apps so one app is your database and one app is your app so if you want to go multi-region and have read replicas then you do need to have four apps because you'd need to have a read replica app for your database and your primary database and then your app that's close to the read replica and your app that's close to the primary so you'd end up with four um but yeah most of fly's customers don't end up actually hitting a bill um only about 25 of them do so like you can if you want to just like start experimenting and play yeah you can do it for free does the uh fly postgres library allow you to choose if you want to wait for a replication or not not yet that is the default right because that is the makes it work uh but yeah we do want i do need to add the ability to say this one i don't care whatever yeah yeah yeah one more do you guys have any plans to kind of expand which databases you support and kind of introduce this across those i i know like one of kurt is the ceo at fly and one he's a very technical technical founder and one of the things he's really excited about is uh databases like cockroachdb which is a postgres compliant query structure but it is multi-master and replicating so it doesn't do all the things that maybe a phoenix app might want right now but it's like hey there are directions we can go in the future and you can run a cockroach db and there's another one that's similar but i can't remember the name of it but there are other options but that postgres is the default one yes yes so assuming you've already set up a cluster so uh i wrote the guide on fly for setting up getting your elixir app deployed and clustering it so i can tell you clustering is actually really really easy there is a dns strategy so lib cluster dns strategy that was added that kurt actually created he's a very italian very technical co-founder or founder yeah and so he he wrote the first elixir uh library so it will cluster makes it so lib cluster can work with fly and it's using dns as the way it can find your other apps but yeah so clustering you do have to have clustering in order to rpc yeah so clustering's heartbeat can struggle when there's long latency have there been apps in production um i haven't seen it but like this is when i say we're early like we're early right like this is i i haven't even before this library i there wasn't a good way that i could do a postgres database that was far away but yeah it's it's totally worth uh continuing to explore yeah there's a reason why we're polling versus streaming the postgres log that was so the polling one there's a function that was already available for just doing that i i would like to look into reading the stream and see what what's part of what we're trying to do is what can i do that doesn't require special permissions and so that that's that's a good question though any other questions yeah um yeah it's a very cheap because it's a built-in function to postgres but yeah right now i have it set to poll every 100 milliseconds but that could totally be tuned um right yep not yet any other questions cody does the fly rpc it um it totally could and really i what i should stress about these libraries is they're really small and so you could just like take that code and just like run that anywhere you have uh yeah so the yeah it does add some regional awareness but if you had some way to identify that these nodes belong to some group then it would just totally work yeah all right well thank you guys
Info
Channel: ElixirConf
Views: 1,869
Rating: undefined out of 5
Keywords: elixir
Id: IqnZnFpxLjI
Channel Id: undefined
Length: 42min 0sec (2520 seconds)
Published: Sat Oct 23 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.