ElixirConf 2021 - Chris Freeze - Using Simple Patterns to Solve Complex Problems with ChoreRunner

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] all right might be a little early but i think i'm actually going to go ahead and get started just so maybe i can have one or two extra minutes hopefully i don't go over but when i timed this it got pretty lengthy but we'll see well all right thank you all for attending uh i understand there are quite a few competitive talks in this time slot so uh even i personally am going to be going and re-watching or i'm watching that uh rpg like that video game one after this that one looked pretty cool but mine's going to be pretty cool too especially if you're unfamiliar with things that i'm talking about hopefully you can learn something add a nice little useful tool to your tool belt after my talk so uh yeah i am chris freeze i am a elixir back engineer pepsico ecommerce i've been doing elixir since about 2016 or so maybe even as early as 20 i was like looking at it i'm an austin texas local and yeah my github is cjfreeze and my twitter is chris freeze dev a little bit about pepsico obligatory um we have a lot of really cool brands not just you know pepsi cola the soda and so here are some of them probably beyond there you're not familiar with i just think it's cool that all of these random brands out there that i you know have eaten at any one time or another just happens to be on by pepsico and that's kind of cool so the reason why i am giving this talk is mainly to introduce a library which i call chore runner and if any of you recognize the font there yes it is a blade runner reference despite the cool name it's a really simple library but before i get into it though i'd like to talk about something first i love elixir it's great elixir has a ton of extremely cool and useful features in it uh and one of the reasons why it has so many cool and useful features i'll get into but one of the features i want to talk about first is iex i love iex and iex shells are like especially remote iex shells are extremely useful for debugging and like you know playing around and just generally it allows really fast feedback between you and your project so i have uh some audience participation here or i want some audience participation here show of hands who has production elixir experience here that is a ton compare that to like the first or second ever elixir comp there's like barely any hands it's crazy uh who here has ever remoted into a production iex shell for debugging nice nice nice and who here is every ever copy pasted code into that production ixl oh that's uh that's a lot of you wow that is um maybe a little uh a little worrying uh who's uh this is my favorite i do this all the time confessions of a developer who here has ever used iex to replace an entire existing module in production to avoid like the like the you know i don't have to commit this change i just want to see if it works and just yeah that is uh extremely dangerous don't ever do that all right what about databases who here is ever remoted into a production database instance that's a lot nice i mean there's pretty common reasons to do that who here has manually run a destructive query in a production database instance uh wow just about the same amount of people that's pretty common there and uh you know no no no shame if you don't want to raise your hand here but who here has ever made a mistake doing any of these things that's that that's a lot of people yeah so there are plenty of stories we shared about these kinds of mistakes some of them only get passed around in passing in the office and you know some of them make international news let me share a story i heard from pepsico ecom's previous svp mark mark had a long history of working in you know various tech companies before he joined ecom and of course that comes with his fair share of mistakes uh he told me about once he was working with this live event ticketing company and how uh they had an event with the title that was just super bowl uh because the teams weren't decided yet and so when the teams were decided he's like okay i'll just promote into the database real quick and i will update that to reflect what teams are actually playing each other so he did so and he ran a query that probably looks something like this does anyone see the issue [Music] yeah so yeah yeah he hit he hit enter and before he could even read the number of rows that were just updated every single event in his system just had its event name change to super bowl team one versus team two luckily they had frequent database backups but you know that's never a fun day at work really mistakes like this are just a symptom of moving fast oh yeah i have an error there whoops at pepsico we move pretty fast maybe faster than you would expect for a company of our size since the first launch of pantry shop in april of 2020 we've added five new stores to our ecommerce platform pretty much all of our developers are always working on new features or shops at any given time that means that certain infrequent activities simply aren't worth the time to be solved in like a proper fashion here's an example when we first worked into selling a variety of pepsico products online we quickly learned about bottler agreements bottler agreements are essentially contracts with give certain bottling companies exclusive exclusive rights to distribute in their markets which are geographically based for bottlers not owned by pepsico we had to ask them for permission to sell their products or sell these products in their regions and those bottlers requested to choose which zip codes we could sell certain products to on an opt-in basis and that was the deal that's great and all that we can you know at least have some opt-in but our platform didn't support that kind of restriction uh time was of the essence and we wanted to be able to sell these products as soon as possible so we speedily created a rule system it was a pretty simple system basically products can implement rules and zip code rules can have a bunch of excluded zip codes a product with a zip code exclusion rule can't be sold to you from our store if your billing or shipping address matches any of those excluded zip codes so how do we know what zip codes to add to each rule well the business gives them to us in the form of several excel documents that change every month updating them turned into a bit of a chore so it started out with copy pasting a script module that had all the zip codes hard coded into it and taking that script module and you know putting in a production shell right and then we would run it to configure the rules this was a bit stressful as this script was upwards of thousands of lines due to all the hard-coded data and we wanted a safer way with minimal dev downtime spent on it so i implemented the first pass of the chore system well it wasn't really much of a system though it was just a single behavior that looked like this so with this you know it's pretty simple but despite its simplicity i had two main goals that i wanted to accomplish here the first was speed i wanted to try and retain as much speed as possible uh the whole reason we used remote shells in the first place because the frequency of updating these zip codes about once a month didn't really warrant writing a whole specialized admin feature for it it's faster just to use that production ax shell the second goal was safety in many cases mistakes with these things could have legal ramifications we misconfigured something that could result in a breach of a bottler agreement or charging the incorrect amount of taxes i figured this could be prevented the same way we try to prevent mistakes with non-chore code which is just you know testing and code review by including our scripts as application modules they can be tested in ci reviewed by other developers and easily trialled in staging environments oh yeah that safety yes fast and safe very nice some of you might say that speed and safety are mutually exclusive or that it's always a trade-off and i agree though i imagine it is sort of like an uneven gradient between the two goals i wanted the best value solution which maximize each goal as much as possible and while it's something that was usable in the short term in the form of that behavior i was far from finished i mean we still had to connect to a remote iex sheldy and run these chores so our chore system eventually turned into chore runner from the beginning my vision was a much more feature rich and open source library that truly found the sweet spot of speed and safety for starters i added a simple public interface for working with chores lets you run chores see which tours are running it's not shown here but you can also you know stop an existing chore or early just in case didn't mean to do that i expanded the chore behavior to also offer a simple framework in dsl let's look at example implementation uh simply use chorebronder.sure to get started then once you've done that you can take advantage of the dynamic input definitions and validations part of the framework uh the system offers multi-node concurrency guarantees a simple and intuitive logging dsl that broadcasts short updates over pub sub and a plug and play live view ui and of course i have a demo so let's get into that all right so here's an example of a chore it's completely empty um ignore that and here are two uh browsers both connected to different nodes in the same cluster running the same code basically so for starters if we run a chore on one this empty chore it instantly completes on both nodes you know it's distributed it's it's supported across nodes because of pub sub and we can dismiss them individually on each uh browser now let's uh maybe have it do something over here let's say let's have it say hi to us and let me uh yeah there you go can everyone see that yeah cool all right so let's have it say hi first of all we need to reload both browsers so that that you know both nodes recompile then if we run we can see that it gives us a nice little high message well what if we want to take some input well we can go over here and we can define the inputs callback uh let's make it uh say hi to us based on the name we need a string input name cool uh now let's reload and we can see that the run chore has automatically generated the name field for us so if i put my name in there and i run the chore um oh i forgot to use it of course you know it's kind of nice of you actually modify the implementation to use the input but yeah as you can see we can just pattern match on the expected key right here and then when we go back in here and we run it hi chris hello uh yeah these inputs also support those validations right let's see is nope jason is not in here i was going to heckle him a little bit from up here on the stand well i'll do it anyway even though he's not here so let's say i wanted to prevent the system from saying hi to jason i'd say validators and then i could say no jason and if it's the name is jason and then of course for everyone else i'll just say okay then over here we uh capture yes i did you're right and make sure oh i i forgot a bunch of things but it's all good now now it'll work so if i go over here and i type in jason over here we get a nice little error message of course live there but jason's smart you know he's a he's a very smart engineer he he's just gonna amend that lower case there or right well we can solve that by just putting in another validator that also modifies our input by just calling capitalize here validators can also be used to transform input so now if we go back over here and we refresh and we type you know json here and as a bonus if i type chris right here and i run chore it capitalizes my name for me and finally uh one of the cool things that you can this system does is it prevents you from you know running more than one of the same chore at the same time based on a configuration so right now we've got this chore running right or not finally actually from one of the coolest features but we'll go over that real quick so let me just sleep here for about you know 10 seconds so that way it takes a while for each tour to complete so if i run it uh you can see it still says stop up here right i can actually kill it at this point and then dismiss it um but if i run it multiple times it doesn't work because it's already running and if i try to do that on the other node it also doesn't work because it's already running but i can configure that with a restriction callback by saying restriction none uh the system then lets me spam as many chores as i want which is nice and fun and finally let's say you want to get some like nice like counters like oh i want to see how many rows of data i've processed here or percentage well we can do something like this using the set counter there's also an equivalent ink counter for incrementing if you don't want to like have to keep the track the account yourself you can set a nice little data value oh and the reason why that instantaneously completed is because elixir is so fast that it counted from 0 to 100 and sent 100 of those messages faster than it could even render the interim so i have to put a process.sleep here let's try this again we can see it will give us a nice little count to 100 there and it did so on both notes and that's pretty much it for the demo so yeah thank you [Applause] i got like 20 minutes left don't i well uh don't worry that's not actually it um you know i intentionally have this time this much time left and that's because the real purpose of this talk isn't to announce my library uh tour runner isn't really groundbreaking you know this isn't the next live view it's actually in fact remarkably simple behind the scenes but that's exactly why i wanted to give a talk about it because it combines many of our favorite buzzwords like otp meta programming pub sub and live view into one useful tool which makes it the perfect example to talk about those things and of course share what i've learned with all of you so you can maybe use it practically in a production scenario so we saw several things on the demo we could track a chore from start to finish we could see chores on every node and prevent conflicting chorus from running simultaneously and a distributed system and we implemented callbacks and it used imported dsl functions to build chores from scratch that's quite a few features is the implementation really so simple let's dive in and find out so the first thing i wanted to share is toronto's use of dynamic otp which is an otp use case i really don't see much in my everyday coding chores only run when you tell them to run we also want to isolate the running chore from everything else rather than running chore on whatever process happened to kick it off so that exceptions are isolated this includes separating the reporting process from the chore process itself so even the catastrophic failures are properly recorded this ends up being a great use case for a task but not that type of task we can't use that task.async links the process the calling process to our chore so the exceptions kill them both we could call unlink but there's an easier way we can use task.supervisor.async no link all tasks started this way will be linked to our chore supervisor which can only notify the or which can then notify the calling process when our task finish finishes and exits including for unhappy reasons this is only one half of the puzzle though as we need a separate dedicated process to receive these messages one way we could approach this is using a dedicated worker process which reports on all chores across of a node but instead i opted for a more dynamic solution starting one gen server per chore but we still need these processors the reporter processes to be supervised though so we can use a dynamic supervisor to start and supervise them at least until we want them to die when a reporter is spawned it's in charge of starting the chore the chorus task supervisor registers the starting process to receive task lifecycle messages and that's always the or in this case the the reporter is the one that always gets registered if we want to know when the chore has finished whether gracefully or not all we have to do is listen for the right message these messages go to the handle info callback of our reporter gen server we can receive two different types of messages here one is the happy case where the chore ran to completion and even gave us back a result the other will always be received if the chore exits regardless of reason that means for success you'll get one of these messages in addition to the previous but it can also give you other reasons such as exceptions that's all we need to track when a chore ends but what about happens when or what about what things that happen when a chore is running well we already saw code like this during the demo as well as a bunch of helpful live information about the chore's progress while running let's look at some of these reporting functions behind the scene all these functions just called gen server.cast this asynchronously sends their information to our reporter resulting in minimal slowdown of our chore's execution time sending a synchronous message with genserver.com would be much slower would work but it's like much slower and it's okay if the reporter runs a little bit behind the chore itself as every message will get there eventually but what is this magic function git reporter pit clearly it returns the pit of the reporter but it accepts no parameters so how does that work it could work in a number of ways such as using registry or some sort of common name scheme or could or could use something a little bit more magical called the process dictionary if you're unfamiliar all beam processes have a mutable key value store accessible anywhere as long as you know the process is pid well i generally wouldn't recommend using it where you can use a parameter or a variable instead it's great for library and logging convenience in a controlled setting in fact elixir's own logger uses it for storing metadata so how does this work in chore runner i when the reporter starts the chore task i just put the reporter's pid inside the chore task before calling the chores run callback as a result git reporter pid can then retrieve the reporter's pit at any time to talk to the reporter and it's really that simple except there's actually a problem here done this way the chore process only has the reporter pid stored and that might not seem like a problem but this is elixir and in elixir we like to do things like concurrently parallelize stuff because that's a little bit more efficient you know we could use task. async stream in order to process stuff in our chore and that would be much faster than doing it one at a time but if the chore process uh or if the reporter pit is only in the calling process then those other pids or those other tasks started in async stream can't use reporting uh and limiting our choice for a single thread isn't very elixiry but there is a cool solution that retains the magic of our original git reporter pad a lesser known fact about tasks and elixir is that in certain situations it will record its parents in the process dictionary for example here we have a task called task a if task a spawns another task which we can call task b the spawning tasks pid gets copied into the spawners uh dictionary under the dictionary key dollar sign callers if task b stuns another task c it will copy both its spawners and its spawner spawners pit into dollar sign callers but what can we do with those pids well i mentioned earlier that the process dictionary can be accessed by any process as long as you know the pit this is done with the function process.info simply call said function with the pid and the atom dictionary and you're good you got it so with the pids found on callers and the process info function our git reporter pitt implementation now works in any task spawned by our chore process so that's pretty much it for like chore tracking and stuff uh besides the message accumulation logic of course but that's really just putting data in a struct i think you all know how to do that so i'll skip that for the interest of time next of course i want to talk about distribution one of the things that makes elixir great is how easy otp makes running stateful distributed systems many apps run multiple nodes clustered together via the beam and i definitely needed to make sure that troll runner would not work unexpectedly in multi-node apps there are two main problems that need to be solved here for 200 work the first problem is relatively simple the list running chores function on one node needs to be able to find all the chores on other nodes not just locally there are many different solutions here like possible solutions but i chose pg pg stands for process group as the name would imply it lets you create groups of processes you can then ask pg to list all processes in a group and then iterate through them to communicate with them and if a process dies it's automatically removed from the group the best part is it works multi-node right out of the box all you have to do is add a reporter to the group after it starts and then after that if i want to list them i can just ask the pg for the members and that's pretty much all we need for our first distribution problem the second problem though is a bit trickier miscommunication or the lack of communication altogether is a source of many a mistake in the world one of the potential issues i saw when writing chore runner was if two developers both tried to run certain chores at the same time who knows what chaos could result from two chores making the same or similar changes simultaneously a simple solution could be just to mention it a precaution section of the readme and move on but nobody likes er but that would result in additional things a developer has to worry about and nobody likes having to code defensively we could use my listrunning chores function before running any chore and check manually but that could pose a problem pg is kind of lazy its groups are strongly eventually consistent which means states can temporarily diverge in addition group membership isn't transitive so let's say we have three nodes with some minor net split node a can see node b but cannot see node c that means node a can only see group members in nodes a and b even if node b can see all of the group members for the this function this is okay because that function's really only used for the ui and we don't necessarily need to have perfect introspection there we can you know potentially just try and hit the load balancer over and over again see in like a disaster scenario but that's really not very relevant but instead i needed something with a bit a bit stricter to solve this problem so i went with global yeah it's uh yet another erling module that offers several conveniences for working with multi-node systems such as a distributed registry and distributed locks if you've ever dealt with a multi-node name registration for gen servers then you've seen its registry functionality but today we'll be looking at the lock functionality instead locks are simple if you lock something nothing else can lock it if someone tries to lock the same resource it will synchronously block until the resource is unlocked it can then uh access that resource but it can also time out if it takes too long let's see how this looks in code all i need is the resource to lock in the list of nodes to lock on which in this case is always all nodes uh yeah okay i'm just gonna leave it like that and keep going we don't even have to call global.delete lock which is how you would normally remove a lock to release the lock afterwards because they also get released if the process dies similar to how they get removed from a process group but the process dies that way when the chore ends it automatically automatically releases this lock and then something else can then go in but what is a resource here well it depends on the chores restriction callback for global i just lock the atom global for self i lock the chore module name and for none i skip the lock step entirely and that's pretty much it for distributed chore management i may have made it seem simple and indeed the implementation is simple so let me state a really quick disclaimer here distribution is a complex topic off-the-shelf solutions are great for simple applications like tool runner but are no means a one-size-fits-all solution while pg or global could work to help you solve your distributed problems you should definitely evaluate many different implementations and weigh their differences when trying to solve distributed problems in a production scenario especially something user facing so let's uh switch gears here let's talk about behaviors oh sorry hold on there we go uh behaviors offer a great way to define a framework with which users of your library can write code which works with your library since elixir is of course a functional language as long as your callback function definitions meet the spec you're good sometimes callback specs can get a really a little complex though imagine having to manually build and transform a plug.con it's often a good idea to provide additional functions that can be used within the domain of your framework to make it easier to conform to a more complex specs broadly you can refer to the resulting syntax as a domain specific language many libraries offer powerful dsls through the use of macros a notable example is ecto the ectoquery dsl may have some taken some time getting used to at first but it's very powerful to those who understand it but there are always trade-offs for example let's like take a look at the chore dsl if we take a look at this example chore input callback we see that's a little bit more complex than what we saw in my demo or actually it's about exactly the same complexity this was written before i'd planned my demo anyway now let's uh without any other context take a look at a different definition or the same definition i should say but in an earlier version of the chore runner library which looked like that the old approach uses data and functions or the current approach uses data and functions whereas the old approach uses pretty macros and magic ultimately though they compile to pretty much the same code and work in the same way now i'll be honest i may have moved away from the macro approach but i still kind of miss it it's very pretty and it looks like it would be faster to the right too so uh why did i move away from it i actually had to do a lot of work behind the scenes to get it to work so it's like it was very difficult it's like you know sunk cost right plus less code is faster to the right right well this pretty code has a dark side behind the rosy exterior lies a mess of compile time code generation it claims to define inputs for your chore but how it does it is unclear if you wanted to know how it worked you'd have to go digging through quite a few lines of source code macros to figure it out you'd learn it generated a bunch of functions iterated through the abstract syntax tree of the input macros do in block to capture those validator functions into the validation logic which was generated at compile time and it had some dirty hacks to make sure that the contents of each input's due in block actually triggered correct compilation errors if there were any uh but even if it was a bit messy it still worked so why would we change it well if i'm talking about macros i'm obliged to mention of course this meme rule number one of macro club don't write macros i was so caught up in how nice my code was and how much fun i was having having writing the macros i broke several macro rules along the way but the main one that i will cite today is this don't write macros if you can write a function instead or i guess more appropriately for me if you can make your library's users write a function instead ultimately the code i was generating was not complex at all it was so simple in fact that i could simply document the structure to find a callback and let the developer write it the only mystery here is this string function let's take a peek into the source and see what it does so here as you can see i actually managed to sneak some meta programming into the library but there's not really anything complicated here if we expand this out after compilation it would look like this it's just five simple functions that output some happily formatted tuples according to the input callback spec in addition to this there's actually one more useful macro being implemented in short runner this is of course a using macro which are one of the more common macros being implemented across elixir for those of you who are unfamiliar this is what actually gets called when you call use anything uh including use chorerunner.chore at the top of the uh or excuse me yeah when you call so for example if you were to call use short.chore which you've seen at the top of the chorus thus far it would call this macro and this is vital to the functionality of our dsl let's uh look at what it does for us so it gives us a few aliases and imports that's nice convenient it flags what behavior you're using or we're implementing it's nice that way we get those callback warnings uh it defines some default callbacks and functions so that we don't have to implement every single callback it's also very nice marksm is overrideable so we can define our own if we choose and provides a little validate input function that's really just a convenience to me the developer or the library writer so yeah that's um pretty much it uh yeah anyway sorry i guess the takeaway here is that in many cases you can write convenient and intuitive dsls and frameworks without the use of heavy macros that disguise functionality from the user at the same time meta programming is still an important feature of elixir that can be used with great success if used responsibly and that's pretty much all for the framework as well the lesson learns while writing it and yeah that's all of our cool features covered we use tasks dynamic supervisors gen servers and the process dictionary to enable our chore reporting pg and global to power our distribution logic and learned about when to use and when not to use macros while writing a simple dsl or framework for your library and that unfortunately means we're approaching the end of my talk just in time too so i have one last thing to say i love elixir elixir and the technology it's built on enables me to build things like chore runner major problems that other languages would face around concurrency and distribution are effortlessly solved and the resulting code is clean and nice to look at there's so much that the language has to offer and that's one reason i love coming to elixir conf is that i get to go to talks and learn about features and details that i didn't know before and then turn around and use those things myself this is actually my first ever talk at elixir conference pretty much ever this was also written before i was invited to give a talk at nerfs comp on monday so that's not true anymore but i still thank you all for being a part of this historic occasion for me uh you know me i wouldn't be able to be here giving a talk without all of you and so hello thank y'all for coming to my talk i hope you learned something i also hope the chore runner gets some good use out there after this there's honestly a lot more i wanted to talk about too such as some of the things i was doing in live view with the dynamic ford form building and also packaging it into my library if you'd like to hear about that or feel free to talk to me more in the hallway or reach out on twitter or slack or if you'd like to do some self-study you can always just check out the source on github toronto is open source though we are not currently accepting open source contributions unfortunately though we hope to be able to in the future but for now if you have an issue while using it you can open a github issue to address it once again i'm chris freeze with pepsico ecom thank you [Applause] and uh i think we have maybe a little bit of time for questions or well i don't know where the mc is so ask away if you have any yeah so um a little quick comment about that uh probably eventually the intention is separate out toronto from the ui but now they are together but either way you know i'd still need to answer your question about how to package live view into a library it's honestly pretty simple but there's not really any way to just like like oh yeah just you know use the depth and it works you still have to essentially call the live view like you would any other live view but ultimately these otp processes or these live views are just otp processes just the way that you would get like a library and then throw like uh like a supervisor into your application you basically just take the live view and then throw it into your router and it pretty much just works uh the main thing to keep in mind though is like what to do with like static assets like css files and whatnot um luckily live our applications do come with priv directories uh like this default that your priv directory will be forwarded along with your dependency if you define it to be that way so you just throw it all in there i would recommend that you make the css files and javascript files compilation time assets uh and then if you want to have to if you want to be able to reference like images like send file you're going to have to use application dot after with like the priv option it's kind of weird i can go into more detail about it later but uh yeah it's honestly not that hard um it seems like to me a major use case of this would be like jobs that are supposed to run one time like you have a fix that you need to do but the way this is set up is like it's forever there until you kind of delete the job right or the chore yeah uh essentially yeah and it's not necessarily that you would want to run at one time for example the main use case we're going to use in pepsico is updating rules that constantly change due to business requirements and it really just we're just expecting that they're going to come up with new rules that we need to implement instead of having to create a brand new admin interface for every single type of rule we want to change we can just use shore runner but yeah you could also use it for one once things and we've actually done uh that plenty of times with the existing chore system we just clean up the chores like once a month it's like okay this directory is getting a little big just delete it you know there's nothing wrong with the leading code at least in my opinion definitely the former i would not ever run something locally and then try and connect it to production the idea is that this can just be plugged into like a an admin and like generally if you're writing some sort of web app you're probably going to write some sort of simple admin portal that only your developers or you can access and you just throw this in there yep any other questions yeah yes full disclosure he is my father but i i hear that he has absolutely no say in what talks get accepted just just for the record yes that's my that's my younger brother he's a little bit better looking than me so i use his picture instead of me our faces are pretty similar but i've got this hat surgically attached to my head yeah any other questions all right thank you [Applause]
Info
Channel: ElixirConf
Views: 1,096
Rating: undefined out of 5
Keywords: elixir
Id: YDaPUX0-X5c
Channel Id: undefined
Length: 36min 31sec (2191 seconds)
Published: Sat Oct 23 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.