Lynn Root - Advanced asyncio: Solving Real-world Production Problems

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
good morning oh are you doing okay not that hungover I hope I think it's an I think it's probably the most expensive hangover they've had in a while well anyways my name is Lynn rich and I am a staff engineer at Spotify I've been at Spotify for nearly six years but since the beginning of the year I've been working on building machine learning infrastructure for like really smart people that do digital signal processing and like need to production eyes that it's pretty fun so if anyone here like uses a patchy beam or does streaming with data pipelines I'd love to chat afterwards I'm also Spotify fast evangelist and I help a lot of teens with their projects and tools to get them open source under the spotlight github organization and lastly I am one of the global leaders of PI ladies which is a mentorship group for women and friends to help increase diversity in the Python community and I brought a lot of stickers with me so if you want a pie lady sticker can find me one random clap thank you all right so this is Jennifer dome it doesn't look at it's a bit jam-packed I'm gonna be covering some graceful shutdowns exception handling and threading along with testing debugging and profiling I'll use probably most of my time if not all so I won't take any questions but we can go outside afterwards and talk and this presentation is very code heavy but don't worry the slides for everything the the full write-up in the code is is that this link and I'll show it again at the end okay so async i/o right the concurrent Python programmers dream well it's the answer to everyone's asynchronous prayers right an async i/o module has various layers of abstraction allowing developers as much control as they need and are comfortable with simple hello world examples do show how how it can be so simple but it's easy to get lulled into a false sense of security and these sort of hello world examples aren't that helpful over we're led to believe that we can do a lot with the structure at a sink in a wait API layer some tutorials well great for the developer to get their toes wet and try to illustrate real-world examples but are actually just beefed up hello world examples some tutorials even misuse the async i/o interface and a line one to easily fall into the depths of callback hell and there are some tutorials that will get you easily up and running and with async i/o but then you may not realize that it's not exactly correct or not what you want or it only gets you part of the way there and while some tutorials and walkthroughs do a lot to improve upon the basic like hello world use case and maybe it's just still just a web crawler and I'm not sure about others but I'm not really building bug crawlers at Spotify um I've built services that do need to make a lot of HTTP requests sure and I need to be non-blocking but these services of mine they also have to react to pubsub events to measure progress of actions initiated from those putts of events handle any incomplete actions or other external errors deal with pub/sub message lease management and measure service level indicators and send metrics and we need to do this with non async io friendly dependencies so for me my problem got difficult quickly i'm so allow me to provide you a real world example that actually comes from the real world has anyone heard of Netflix's chaos monkey I see some hands so a few years ago at Spotify we built something similar like a chaos creating service that does periodic hard restarts of our entire fleet of instances and so we're gonna do the same here build a service called May mandrill pun off chaos monkey on which we will listen for a pub/sub message and restart a host based off of that message and so as we build this service I'll point out best practices that I may or may not have realized and when first using async I ohm and this will essentially become the type of resource that pathline would have wanted about like three years ago and again don't worry about the code in the slides I have the link and referred that refers to all the code at the end and so we're gonna start with some foundational code we're gonna write a simple publisher so here's where we're gonna start we have a while true loop we have a unique ID for each message to publish to our queue I want to highlight that we're not going to await the Q dot put of a message and a sync IO dot create tasks will actually schedule a cartoon on the loop without blocking the rest of our for loop the create task method does return a task object but we can also use it as sort of a fire-and-forget mechanism if we added the await here everything after this and within the published care routine will be blocked this isn't an issue with our current setup but it could be if we were to limit the size of our queue and then that a weight would be awaiting on space to free up in the queue so majestic with the async IO create task so we have a publisher cover team function and now we need a similar concern so this consumer will consume the messages that we've published it's sort of similar to the publisher now we have a while true loop and a weight on the queue for a message but here we don't want to create a task of Q yet it makes sense to block the rest of the care routine on this because there isn't much to do if there are no messages to consume on to highlight this again we're only blocking within the scope of the consume care routine we're not blocking the actual event loop so then let's replace async oats leap with a function that will restart a host and I'm sure it looks like I'm just like pushing the simulation of IO work to the restart host function but in doing so I'm actually able to create a task out of it so therefore we're not blocking on awaiting four more messages perhaps we want to do more than one thing per message and for example in addition to restarting a host maybe we'd like to store that message in a database for potential replaying later as well so we'll make use of async ioad create tasks again for the safeco routine to schedule on the loop basically checking it over to the loop to execute when it can and in this example the two tasks of restarting and saving don't need to depend on what on one another and I'm completely sidestepping the potential concern or complexity of should we restart a host if we can't save to the database and vice-versa but maybe you actually want your work to happen serially you may not you may not want to have concurrency for some asynchronous tasks so for instance maybe you restart hosts that have an uptime of more than seven days so this is similar to like within banking you should check the balance of an account before you actually debit it so needing code to be serial or sequential to have steps or dependencies it doesn't mean that you can't be asynchronous the wait last restart date will yield to the loop but that doesn't mean that restart hosts will be the next thing on that loop that the loop executes it will just allow other things to happen outside of this cover team so with that in mind I will just put all this message related logic into a separate cover team so we don't block the consumption of messages saving a message shouldn't block a restart a host if needed so well so we'll return to it being a task and then we're just going to remove the uptime check and just restart hearse indiscriminately because why not well so we pulled the message from the queue and fanned out work based off of that message and now we need to perform like finalization work on that message and so often and with pub/sub technologies if you don't acknowledge a message within a predefined time frame it will get redeliver do for a finalization task we should acknowledge the message so it isn't redeliver you currently have two separate tasks save and restart host and we want to make sure that they are both done before the message is cleaned up now we could go back to the sequential awaits since that's a very direct way of manipulating the ordering but we can also use a callbacks on a completed task now what we therefore want is somehow to have a task that wraps around the two pair routines of like save and restart host since we have to wait for both to finish before we can clean for cleaning up can happen now we can make use of async io gather which returns a future like object and to which we can attach the callback of cleanup via add done callback we can now just await that future in order to kick off the save and restart house care routines and then obviously the call bag of cleanup will be called once those two are done and so visualizing this a bit you can see that both the save and the restart care routine are complete and then the cleanup will be called that to signify that the message is actually completely done and we've also maintain appropriate concurrency and I don't know about you but I have an allergy to callbacks so um perhaps we also need cleanup to be non-blocking and so then we can just actually await clean after a waiting gather itself which i think is that much cleaner looking so to quickly review async i/o is pretty easy to use but being easy doesn't and being easy to use doesn't automatically mean that you're using it correctly you can't just throw around async and await keywords around blocking code sort of a a shift in a mental paradigm with then needing to think about what work can be farmed out and let its do its thing and then what dependencies are there and where code might still need to be sequential but having your steps within your code like having first a than B and then C may seem like it's blocking and when it's not sequential code can still be asynchronous and for instance I might have have to call customer service for something and wait to be taken off hold for them but while I wait I can put the phone on speaker and like pet my super needy cat so I might be single-threaded as a person but I can still like multitask and so often you'll want your service to gracefully shut down if it receives a signal of some sort like cleaning up open database connections I'm stop consuming new messages finished responding to current requests while not accepting any new requests that kind of thing so if we happen to restart an instance of our own service and we should probably clean up the mess that we've made before exiting completely out so here's some typical boilerplate code to get service running and we have a queue instance and setting up the loop scheduling the pub the publish and the consume tasks and then starting the event loop maybe you even catch the commonly known like keyword interrupt exception so if we run it as is and then send it them the intern variant single we see that we do get to that except in that finally block with these two log lines however if we send our program a single other than Segen to like sig term we see that we don't actually reach that finally clause where we're logging that we're clicking and we're closing up the loop and it should also be pointed out that even if you're to only ever expect a sig int signal or keyboard interrupt signal it could happen outside of the caching of exception potentially causing the service to end up in an incomplete or otherwise unknown State so instead of catching keyboard interrupt we can use a signal handler on the loop itself so first we define our shutdown behavior echo routine that will be responsible for doing all of our unnecessary shutdown tasks here I'm just like closing database connections returning messages as not act so that they can be delivered and not dropped and then cleaning up or collecting all outstanding tasks except for the shutdown task itself and then canceling them now you don't necessarily need to cancel pending tasks and we could just collect them and allow them to finish and we may also want to take this opportunity to flesh any collected metrics so that they're not lost so then let's add our shutdown curtain function to the event loop so the first thing we do is set up our loop first and then add our signal handler with the desired signals that we want to respond to and then remove the keyboard interrupt so then running this again mmm we actually do see that we get to that finally clause now you might be wondering which signals to react to and apparently there is no standard and basically you should be aware of how you're running your service and handle accordingly and it seems like it could get particularly muscle messy with conflicting signals and when adding docker to the mix another misleading API in async I uh is a shield method and so the docs say that it's meant to shield a future from cancellation but he had a KO routine that must not be canceled during shutdown an async i/o shield will not help you so this is because the task in async i/o dot shield creates it creates like that task that gets created gets included in async I own all tasks and therefore receives the cancellation signal on just like the rest rest of the tasks so to help illustrate this a little bit and have a simple like async function with a long sleep that then like finally logs --eline Singh done and we had to shield it from cancellation so as for the docks we have like a parent co-routine shielding the care team to get canceled and so the the task that is running the parent Tessa is running the shield encouraging if that's if that's canceled that shouldn't affect the shielded care routine and so then we add our parent tasks to our main sort of function and then we're running this and interrupting it after a second we see that we don't actually get to the done log line and that it's immediately cancelled even if our shutdown co-routine function skips cancelling the shield curtain or even the parent tasks it still ends up getting cancelled so basically we don't really have any nurseries and a sink I Accord and clean ourselves up it is upon us to be responsible and close up connections and files that were open and respond to outstanding requests basically leave things how we found them doing our cleanup in the finally Clause isn't enough though since a signal could be sent outside of the Tri except clause so as we construct our loop we should tell how the loop should be deconstructed as soon as possible in the program to ensure that all of our bases are covered we also want to be aware of when our program could be shut down which is closely tied to how we run our program if it's a manual script evidence again is fine but if it's like a demonize within a demonised docker container and then sig term might be more appropriate and finally if you use shield in a service that has a signal handler you should be be aware of its funky behavior and so you might have noticed that we're not doing any handling of exceptions so far let's revisit our care routine or restart host care routine and we're gonna add like a super-realistic exception so running this we do see that the super-serious exception is raised but we actually get a task exception was never retrieved and this is proud that this is because we don't properly handle the results of a task when it raises and we can do is define sort of a global exception handler this is super simple or super ear like simplified and then attach it to our loop and similar to signal handling and so if we were to rerun this and we do see that that logging of exception that we are actually handling that but perhaps you want to treat exceptions more specifically from certain tasks and it's good to have a exception handling on a global level and but also on a more specific level so let's revisit our handle message couraging say for instance you're fine with just logging when a save message fails but you want to nak or not acknowledge the pub/sub message and put it back to the queue if you retry the whole message if restart fails mmm so since async i/o together returns results and we can add more of a fine-grain exception handler to this and handle the results as we wish I want to highlight that setting return exceptions to true is a super imperative here and otherwise exceptions will be handled by the default handler that is set so be sure that there's some sort of exception handling either globally individually or a mix most probably a mix otherwise exceptions may go unnoticed our cause weird behavior I also personally like using async code either because the order of returned results are deterministic but it's easy to get tripped up with it and by default it will swallow exceptions but happily continue working on other tests that were given if the exception is never returned then weird behavior can't happen all right um sometimes you need to work with threads I'm sorry just like a threaded pub/sub um like a threaded pub/sub client um and you might want to consume a message on one thread and then handle the message within within a cur routine on the main event loop so let's first attempt to use the async i/o API that we're familiar with and update our synchronous callback function with creating a task V via a single create task from the handle message a routine that we defined earlier and then we call our threaded consuming function via a thread pool executor but we don't get very far at this point we're in another thread and there's no loop running on that thread it's only on the main thread so if we take what we have right now an update or function to use the main event loop we actually do get it working or it looks like it worked but it's this is deceptive we're sort of lucky that it works some folks can probably already see that we're not being thread safe so instead of a loop create tasks we should be using async ayahs thread safe API the run care routine thread safe um it can be difficult to tell when you're not being thread safe particularly when it looks like it works as it did in our previous attempt but in a bit I'll actually show you how to easily surface when there's an issue of thread safety so my opinion it isn't too difficult to work with threaded code in async I own particularly our similarly to how we work with non and async code in the async world and we just make use of the thread pool executor and which essentially creates an avoidable for us however it's difficult to work with both threads in a single when there's some sort of shared state between between a thread in the main loop and and so if you must then use the thread-safe AP is that a sink io gives you and it took me an embarrassing long time to realize that this existed all right now on to testing so for a more simplistic starting point we're going to test a single code before we introduce threading so to start simple we're going to test the save co-routine using pi test since a save is a co teen our tests will need to run the care routine in the event loop so like so which python 37 makes it easy for us with async iota run um or with older python 3 versions we have to construct and deconstruct the loop yourself but there is a better way so there's a PI test plug-in called PI test async i/o that will essentially do that hard work for you oops and then all you need to do is Mark the particular tasks that are testing async code with a decorator from that plug-in as well as make make it so that the test function itself is actually a co-routine function and so now when running the test the plugin will essentially do the work for you of constructing and deconstructing the event loop the PI tests async IO plug-in and we can't get you pretty far but it doesn't help you when you need to mock out co-routines so for instance our safeco routine function calls another care routine function and the async are you don't sleep or maybe some actual call to database you don't actually want to wait for async code up sleep to complete or you don't actually want a connection to the database to happen on so both the unit tests mock and the PI tests mock libraries do not support asynchronous mocks and so we're gonna have to work around this a bit and so first we do make use of the PI tests mock library and we create a PI test fixture that is essentially returning a function the outer function itself returns inner function as a fixture that we'll use in our tests and then the inner function is basically creating an return a mock object and that we'll use in our test as well as a stub co-routine that will that the mock will end up calling it also patches if needed and the carotene function with the stub so we can avoid a network call Sleeps etc so then we're going to create another pi PI test fixture that will use the create code mock fixture to mock and patch a stink eye out sleep we don't need the stub co-routine and that it returns so we can just essentially throw that away and they're gonna use the mocks a fixture in our test save function so down here basically is patched a sink oh don't sleep with her and within our mayhem module with this stub carotene function then we just assert that the mock mock tasting kayo death sleep object is called and when mayhem dot save is called because now we have a mock object instead of an actual care routine we can do anything that is supported with your standard mock objects like a third called once with setting return values and side-effects so it's pretty simple I guess it's wrong but maybe you want to test code that it has that calls create tasks mmm and we can't simply use the create code or mount fixture however with this for instance let's revisit our consume care routine which creates and schedules a task on the loop out of handle message we first need to create a couple of fixtures on the for the queue that gets passed in my first real mock and patch may sink i oq class in our module and then we'll use that mock queue fixture and another one a maki get fixture so unlike our mock sleep fixture we will use the stub care routine that create career mock returns and then set it to the mock you get method so here's our test consume function where we're giving our newly created fixtures so let's try to use the create coder mock to mock and and patch the call to handle message a via KHOU routine curtain via create task note that we're setting the mock get side-effect to one real value and one exception to make sure that we're not permanently stuck in that while loop now one and finally we want to assert that the mock for handle message has been called after or consumed has been run so when running this we see that mock handle message is it's not actually called like we're expecting and this is because the scheduled tasks are only scheduled at this point and impending and we sort of need to nudge them along so we can do this by calling collecting all running tasks and but it's not the test itself and running them explicitly this is a bit clunky I know if you use the unit test from standard library and there is a package called an async test that handles us better and exhausts the scheduled tasks for you so I hear that you're warming like 100% test coverage which is great on but it might be difficult for our main function and we need to set up signal handling and exception handling and we need to create a few tasks and then start and close the loop and we can't exactly use the event loop fixture that PI test a single you know library gives us as it is now we need to sort of manipulate the event loop that PI tests a cigar you will use to when it injects it into the tested code and so what we do is update the testing event loop and we can override the closed behavior and if we close the loop during the test we will lose access to the exception and signal handlers that we set up within the main function so we actually want to close the loop after we're done with the test and then we can also use the mock to assert that our main function actually closes the loop so now we'll write a test main function that actually borders on again integration functional test you want to make sure that in addition to expecting calls to publish and consume that shutdown gets called when expected but we can't exactly mock out shut down with a create Cairo mock since it will patch it will patch it just another care routine and therefore run the mock care routine each time it receives a signal rather than canceling the tasks and stopping the loop so instead we're gonna mock out what's called with the shutdown care routine the async I gather and then here I'm starting a thread that will actually send the process I signal after a tenth of a second and so after the starting the thread we then called main function that we want to test and so looking at the second half of the test and we can assert that the loop setup is the way that we expected as well as our mock tap functions having been called returning to the the setup that the first half of the test here you might want to prioritize the test function itself to test not just the second signal but all the signals that were expecting and probably tests with a signal that you're not expecting like sick user or something like that so basically the TLDR of using pipe is to use PI test async I ohm and there's also a package called async test for a unit test that functions similarly to a PI test async I am and the bonus is that it will exhaust the tasks schedule and a loop for you automatically as well as provide care routine mocks so right we're decent programmers and have code coverage but sometimes ship breaks and we can figure out what's going on and we can use everyone's favorite debugger and I printing even if you won't admit it so if you have like one tiny little thing to debug you can use the print stack method on the instance and so when when you run this it will print the stack for you for each running task you can also increase the number of frames that are printed as well but you probably will actually use will need to use a sync iOS debug mode and within the standard library itself so along with setting our logging level to debug we can easily turn on a cinco's debug mode while we run our script if we didn't have proper exception handling setup and we'd get information about which task is affected and and also what's cotton what is called a source trace block that gives us more context in addition to like our normal trace back and so without debug mode and we get told that there's an exception that's not properly handled but with debug mode and it gives us an additional clues as to what's what might be going on and where they might be another very handy thing that I wish I knew of a few years ago is that if you have threads and an event loop interacting with each other debug mode will surface not being thread safe as as a runtime error and just quit out super helpful one also a really nice feature about a sink iOS debug mode is how it kind of acts like a tiny profiler that will log async calls that are slower than 100 milliseconds so we can fake a curtayne by putting a blocking call to with time dot sleep and so when we run the script we can see that it will surface slow to finish tasks potentially highlighting an unnecessarily blocking call the default for what's considered slow is 100 milliseconds but that is easily configurable - you can set it directly on the loop with slow callback duration in seconds on the loop directly itself so much like some people's testing philosophies sometimes we want to debug in production because why not but usually you don't want full-on debug mode while in production so there's a slight way package on called a AO debug that will log slow callbacks for you and it also comes with the ability to report delayed called calls to stats D if you use stats team and so that that's all what this library does so it's super lightweight and quite handy so you can easily print the stack of a task if needed but you do get a lot with async I used to bug mode and it gives more information around unhandled exceptions what are not being thread safe and when there are slow to complete tasks and if you want to understand slow tasks while in production a Oda bug is a lightweight library that essentially does only that as we saw with async IOT bugs debug mode the event loop can already track care routines that take up too much CPU time to execute but it might be hard to tell when it's an anomaly or when or what is a pattern so a lot of folks might first reach for a seat profile when trying to understand performance and we can try that here too but there isn't that much to glean from this the top item here is essentially the event loop itself and even if we looked at our own code specifically you can kind of get a picture of what's going on but it doesn't immediately surface any bottlenecks or particularly slow areas and of course our main function would have the most time cumulatively since that's where the event loop is ran but nothing is immediately obvious so I recently discovered Kay cash grind even though it's been around for a while but you can use it with Python and to do so we first save the output of C profile and then use this package called pipe off to call tree which takes the output of C profile and converts the data into something that K cash Brian can understand and so when the script is ran you're met with this UI and it's okay if you can't make anything out but basically on the left hand side there's a profiling data that we would otherwise see from the output of C profile and then the two sections on there right that shows information about colors and collies I'm including a call graph which is on the bottom and a map of the Colley's and which is on the top yeah if we limit our view to only our script and like start clicking around we can start to get an idea of where time is most spent and the visualization groups modules together by color and so when I first ran this service I notice there's a lot of blue on the Kali map that's on the top there and if you click into that blue it's actually logging that's taking a lot of time let me go back to that point in a second um so key cache Kryon allows us to get a broad picture of what's going on and gives us visual clues of where to look for potential areas of unnecessary time spent but then there's a line profiler package and we can use it to hone in on areas of our code that we're suspicious of so after installing the profiler you can add a profile decorator where you want to profile here I'm just decorating the save care routine for now the line profiler library comes with a tool called current profs and that we will invoke our script with and then we render the output of line profiler itself and this is a line-by-line assessment of our decorated code the total time spent in this function is just over two milliseconds and the majority of that time is spent in logging now if only there is something we can do about that coincidentally someone has done something about it there's a package called a AO logger that allows for non-blocking logging so if we switch to at a default log area with a or logger and rerun line profiler we can see that our total time spent in log function has halved as well as time spent while logging so certainly these are like minuscule improvements that we're doing here it was probably a lot more that we can do but if you imagine this sort of on a larger scale that we could probably save a lot of time and also as I see it if we have an event loop let's try and take full advantage of it so we've profiled using C profiler inline profiler but we had to stop the service in order to look at results so perhaps you'd like a live profiler along with your production testing and debugging if you do that and there's a package called profiling that provides an interactive UI and it supports async i/o as well as threads and green let's I'm grant aid you can't attach to an already running process with this particular tool and you'll need to launch a service with him and when you do you get this text-based UI but regularly updates um you can drill down you can pause it while inspecting something and restarted you're able to save performance data to view it with this UI at a later time and also provides the tool provides a server which you can then remotely connect to it so the TLDR profiling there isn't much difference with profiling async i/o code from LAN async i/o code it can be a little confusing though with looking at see profile output to get an initial picture of your service your services performance using a seat profiler with cake a shrine can help surface areas to investigate without that visualization we saw that it can be a bit difficult to see the hotspots once you have an idea of those hotspot areas you can use a line profiler to get a line-by-line performance data and finally if you want to profile a profile in production I suggest taking a look at the profiling package so in essence this talk is something that I would have liked a year ago speaking to passel in here but I'm hoping that there are others who can benefit from a use case that is not a web crawler this is the link to a full blog post the slides and all the code and I must give my obligatory spiel we're spotify hiring for various engineering and data science positions at all of our engineering offices Stockholm London New York and Boston so if you're interested you can come talk to me and thank you very much you [Applause]
Info
Channel: EuroPython Conference
Views: 5,553
Rating: 4.9506173 out of 5
Keywords:
Id: sW76-pRkZk8
Channel Id: undefined
Length: 40min 2sec (2402 seconds)
Published: Tue Oct 01 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.