Lonestar Elixir 2020 Speaker Talks: Michael Crumm - Batch Operations with Broadway

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] [Music] [Music] [Music] [Applause] [Music] everybody michael crum [Applause] good afternoon welcome to my broadway talk this is gonna be weird with a microphone i'm way too loud okay so broadway is a concurrent multi-stage tool for building data ingestion and data processing pipelines so specifically this talk is about how to increase pipeline efficiency by using batchers and the handle badge for callback if you don't know what any of those words mean that's okay but buckle up because we're going to go fast so hi i'm michael crum i'm an elixir engineer at dockyard i'm the author and maintainer of the google cloud pub sub connector for broadway so so broadway ingest messages and those are messages that originate remotely i mentioned google cloud there are official adapters for let me move the mouse here too there are also official adapters or connectors for apache kafka amazon sqs rabbitmq and a few others out there in the wild so in fact any gen stage producer can be a broadway producer as broadway is built on top of gen stage so last fall i wrote an article about batch operations you can find it on the dockyard site and it was also about a pizza shop so keeping with the the pizza theme this is my this is my elevator pitch for lone star elixir 2020. so that's that's just a picture of a freight container but a pizza box container would have like you know like a kiosk up front and maybe like a carport in the back quadcopter landing pad on top i don't know um i thought you know maybe maybe but it turns out turns out someone has already wasted millions and millions of dollars on a worse version of this so so we move on and uh so today we're gonna we're going to act as the franchise payments team for pizza box and our roll you see what i'm doing here our role will be to ingest these uh these very stripey event charges or charge events rather and insert them into our database so each franchisee for for pizza box gets its own db prefix and we'll need to take that into account when we're creating these charges so without further ado let's dive in okay so to use broadway you can use broadway and we'll add a a start link function which will be called by default when you add your add your pipeline to a supervision tree so we invoke broadway.startlink with the pipeline module the name of this particular pipeline process and then we will add a producer definition so this is my custom producer the the backstage event producer it emits events equal to the demand and will do so forever it is purely for example purposes here this is a place where in practice you would use whatever adapter works for for your environment whether that's again cloud pub sub sqs or something custom so then finally for for this stage we will we will configure the processors um so this can be a little bit confusing given that the definition takes a list but broadway will only allow you to define one pool of processor like one processor's pool so we'll do we do that here and we'll name it default so processors are the first stage in the broadway pipeline they're the first opportunity you have to interact with the message that you receive so to do that we're going to add a few aliases we're going to alias broadway.message which is the the data structure that will contain your data and we're going to add our payments context and then we're going to define our handle message three callback so handle message three receives the name of the processor stage the message itself in an optional context and we're gonna go about inserting our first charge so let's take the step by step we're going to unpack the the actual charge data from that monstrosity of json and the message itself that we got earlier we'll pop the franchise from that list of charge attributes and use that for the bb prefix then we'll create the charge and if it's successful we just return the message broadway will handle acknowledgments for you if you're familiar with any of these cues sqs or pubs up you need to acknowledge the message when you're done with it um and if it's you know in the case of an error we'll fail the message by invoking message not failed with the message and the and the error reason um the behavior of what happens to a failed message is entirely dependent on your connector so pub sub google cloud pub sub will continue to retry that message up to some arbitrary expiration point uh that you know but at after which point you will have already received it you know over and over again every 10 seconds or so by default until you acknowledge it uh sqs has a little bit more functionality there to be able to you know set up dead letter queues and do some other things like that so look at your connector documentation for for more information about specifically what happens there uh but independent of the connector failed messages you you also have another opportunity to work with failed messages in the handle failed callback which you can read more about in the docs okay so who likes charts and graphs yeah okay so i've been building the phoenix telemetry dashboard and it's still pretty rough but let's check it out oh that's no fun nope my apologies just a moment there that's better okay so this is uh this is the live dashboard it is instrumenting a chart for a custom metric that i wrote to poll ecto or the db specifically to pull charge totals back or the number of charge records per franchise so right now we're doing about i don't know it's about 100 records per second this is pulling every two seconds about 100 records per second per franchise it's about 800 records a second it's okay it's all right so okay so it's working great neat so all right so with with just the implementation of the required callback you know handle message three we you know our pipeline is pretty okay like 800 messages a second you could probably be done there and that would be fine but uh we want to get past okay but to get past okay we need to we need to reach beyond the required we need to dive into the optional so let's take a look at what this would look like with a with a handle batch for callback uh first step we need to we're gonna alias the batch infostruct which is going to contain information about our batch when we receive it and we're gonna add a definition for our batcher stage which is we're going to give a name called charges so unlike the processor stage batch or any pipeline can have multiple batcher stages and messages are routed to a particular batcher when they're processed in the handle message 3 callback so just in case you missed that handle message three first handle batch four second that seems to be a point of confusion sometimes it certainly was for me um so the first step towards batchers once we have our configuration in place is to modify our handle message three callback so we'll just modify the example we had before so instead of creating the charge we'll build a charge entry and a charge entry is simply a map of attributes that we'll use to call uh with ecto to call repo dot insert all three as opposed to insert two uh but unlike insert two auto-generated fields aren't populated for you so one thing you have to take into account is if you're if ecto is auto-generating your ids or auto-generating your timestamps or any of these sorts of things you need to account for that yourself get those attributes into the map uh for for an insert all three call so once we get back a successful entry we will update the data in the message we do that through this update data callback which receives the current data and you return the new data um after having done that we will we will set the batcher that we want to handle this particular message and we'll do that with the putbatcher function we'll set this to the the name of the battery stage that we defined in our in our start link function so finally and most importantly for this example is we're going to set a batch key batch key is a term and it is used to group messages within your batcher stage so the docs word it effectively the messages will be grouped per batcher or per batch key per batcher up to batch size which is another you can set the the size of your batches so so we'll group these by the prefix so that we can insert them by the prefix and we'll fail the message as usual if it fails at that stage so okay so there's the updated handle message three function all right finally here we are handle batch four handle batch four receives the batcher name the list of messages which is the batch the batch info struct which is going to have the metadata that we need in this case just the prefix is what we care about and then again the optional context so to handle that now those who've heard me talk about this this talk for many many days know that i struggled very much with this slide so if you don't like the code here i don't either so we'll just leave it at that um so we're gonna we're gonna insert and get all the charge ids and we're gonna we're gonna run an insert all to do that we're gonna return just the ids of those and then we're going to fail the messages that we didn't insert and that basically looks like this so we're going to unpack the data from the messages again insert and get all the entries unpack the you know map those to just their ids and to fail the messages we'll split based on you know the ones that aren't there the ideas that aren't there and then return both the good and the bad messages that have now been failed so important to note here that in your whether it's handle message which only gets the one or handle batch which gets a list of them in all cases return all the messages that you get so broadway wants them all back okay let's take a look there we go all right so okay that looks significantly better um after after initial relatively large jump we're we're now seeing around about if i kill it it's easier to see but uh we're about about 1500 messages every two seconds so about and that's per franchise so around around 750 messages a second somewhere around there so not too shabby quite an improvement and that's and i would i would consider that more than more than okay i think we're more than okay so i'm you know i'm relatively happy with this i'm just going to leave that there because i'm tired of fuzzing with the window so okay so have i convinced you to use bashers yet yes no maybe okay would you believe me if i told you we could do better well fantastic because we can't okay so each stage in your broadway pipeline can be tuned independently uh concurrency as well as you know the the amount of messages being handled at any given time so it's important to note here that this is not tuned this is just more i just want to just be very clear about that um so this is not tuned this is just more this is uh by default i think min and max demand are like five and ten um so we're substantially increasing that and we're gonna run four batcher processes they're gonna expect batches of size 500 and if they don't have 500 within the timeout of one second they'll deliver whatever they have all right here we go right because that's it that's ac3 whoops no stop it okay i'll give it a second to reconnect here and there okay awesome so once again fairly large improvement over what we were doing before um hard number wise i think this is it ends up sitting around and again i have there's literally no tuning here did i mention this is docker or postgres in a docker container um so we're you know 56 000 to 59 you know so it's about we're we've about doubled we about doubled our our throughput there so about 1500 messages a second at this point and and that's where i'll leave it so this is a rough quick look at some of those things that are coming from from the dashboard side of things and i hope that this talk was at least gave you a view into why we should move past just what's required and uh and take a look into into the additional features of broadway and the pipelines so i have to thank dockyard um they they you know i i like i said before i work at dockyard i uh they pay for me to be here and uh give me time to work on all these awesome open source projects and you know it's just such a great place to be and i'm so happy to be able to get up here and talk to you all about these awesome awesome things so if you're looking for training you know elixir uh live view all things broadway what have you come talk to us all right thanks so much

Info

Channel: Groxio

Views: 763

Rating: 5 out of 5

Keywords: functional programming, myelixirstatus, elixir language, learn to code, programming conference, coding, programming, computer software, elixir

Id: g88eT6Ow9aQ

Channel Id: undefined

Length: 16min 6sec (966 seconds)

Published: Thu Aug 20 2020