AWS Step Functions: Parallelism and concurrency in Step Functions and AWS Lambda

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

alright hey everybody its Rob if you're joining from the previous IOT all the things thanks for sticking around if you're joining from previous episodes thanks for coming back this of course is AWS step functions and today we're going to be talking about parallelism and concurrency in AWS step functions and AWS lambda the first let me change the title here for y'all so we get what we want hooray we got the title so here's what we're gonna go over today real quick parallelism versus concurrency yes they're two different things but do we care in our context and then I'm gonna talk about three different methods of service implementation of parallelism to help you speed up the execution of your workflows I'm gonna talk you through when to use each with a specific use case and I'm gonna give you a built example of each and it says build but when I was pulling these together last night and building them there's a lot of tempting of the demo powers I'll be honest I went ahead and deployed these before this to test them and we're just gonna walk through them this time so the bright side is we know everything will work and we could spend a little bit more time on the logistics or the specifics of it so I'd encourage you to ask questions as we go right like if you have a question just in general about parallelism or concurrency or all this stuff please share them there's a couple commands you can get in the chat the links command that I just dropped in there will give you a lot of good links that we're gonna cover today including definitions of parallelism and concurrency the definition of map stay JSON path and all these different things that we'll cover so that's there for your convenience there's also a github link and a Doc's link that you can use if you want those so looking at our agenda here the first thing we want to knock out is parallelism versus concurrency and parallelism is executing the same task in multiple locations concurrently right and there's I should probably take out concurrency there but what do we say let me check my notes parallelism is the Mille taneous execution of possibly related computations it's doing lots of things at once whereas concurrency is the composition of independently executing processes or dealing with lots of things at once and I bring this up because some people will depending on which language paradigm you come from you might understand one better than the other if you're coming from something like Scala that implements true parallelism then you can think about this like parallelism but if you're coming from something like go that thinks about concurrency is very different you can think about this like concurrency as well we're also going to get down into some of the language details of implementation and then you'll see this stuff come back up but the answer to that big question there from a service perspective is we don't really care all that matters to us is we're taking big blocks of work and we're accomplishing them in a shorter amount of time by using more resources at the same time so if you want to think of it as parallelism that's great that's probably more accurate if you want to think of it as concurrency that may be more accurate it doesn't really matter to us so the first don't worry all it's not all gonna be slides I just I really want to make sure I hit this because this is a little complicated the first type of parallelism that we're going to talk about is parallel invocations and this is almost a like a default built-in when you're building on the AWS cloud this is when you have multiple AWS step functions workflows or multiple AWS lambda functions multiple instances of them or arguably even multiple instances of Fargate tasks running together so whether you've Auto scaled them out or you've provisioned that amount of concurrency or whether it's just happening it's the same definition executing at the same time with different inputs or different data and so this can be thought of as like web requests that come in and are being processed through API gateway and are hitting a lambda function behind it all of those lambda functions might be executing sort of staggered or concurrently but that's that parallel invocation the important characteristics to think about here are that it's each process is independent right so if make a request to an API gateway and you make a request to the API gateway a failure in my process has absolutely nothing to do with yours so it shouldn't impact yours it should be pretty much a like an atomic transaction right and the example that we use here is if you if you have some sort of task that runs and looks for non-compliant resources either scheduled or in response to rules then it kicks off that remediation workflow for each individual violation and the resolution of each individual violation doesn't have anything to do with the other violations there may be one there may be no others and maybe a thousand executing concurrently right the limitation on parallel invitations is basically your account quotas right it's how many of these things can you execute at the same time so there's in the links that I just gave you here I gave you both of the step functions service limits and the lambda service limits I'd encourage you to check those out most of them are soft limits so if you're proving that you're close to running up against them go ahead and fill a support request and ask for them to be increased someone will get with you and make sure we can get that resolved for you but that's the limiting factor for parallel invitations but it's really it's it's a lot of independent little workflows doing their own thing at the same time right it's it's like I don't even want to say ants because it's not a colony they're just a bunch of individuals doing their own thing so let's take a look at what this looks like in some code when you look at parallel invitations I've got a Sam template built here for us and this Sam template comes from last week's episode of app 2025 which I'd really encourage you to to check out as well it's on Thursdays at the same time what it does is it creates an event bridge event bus custom event bus and a function for publishing events on to that bus and a rule that listens for events and in bokes a step functions workflow that's defined here right so your application is just putting whatever the event is in this case we use the expired subscription because that was already written so anytime a subscription expires that event is just put on to the bus and this workflow is kicked off independently and again you go back to the example of like my subscription expiring yesterday doesn't have anything to do with your absque rip ssin so your subscription expiring today right maybe I've been a customer for two years so I get a special offer and then you were just on a trial subscription for 30 days so you get an even more special offer who knows but this is pretty straight forward but I want to show you what it looks like visually so when we come back over here and we look at parallel invocation by definition it's just a simple workflow right these things are for the one item that comes in to the workflow and then it's done it doesn't look for all other canceled subscriptions or billed statistics around it or try to process a batch of subscription expirations together it just takes the one subscription in processes it and pushes it back out so when we execute this it's gonna look like this oh all right so I've created a file here and this will all be in the repo afterwards it just has the format that you need for events to go on to event bus and there's five here two of them will trigger the rules same as during the app 20-25 episode so when we put these events here we'll get back all right none of them failed that's great that's what we hoped for and then we get event IDs on the bus that you can use for tracing or debugging later right and we said only two of them will go through but let's look at these events for a second that that execution will already be done but we can tell that this matching event this subscription really has nothing to do with this matching event for this subscription right they just happen to show up at the same time for our convenience of demonstration and that's it so when we look at these executions we see that just now we had two successful ones and if we dig into one again we'll only see the information about this event right the single subscription with its details and then it completes by putting it back on to the event bus as a processed event and we're done so a lot of this again this functionality is sort of the default parallel invocation functionality that's provided to you and you choose AWS to run your workflows this this isn't necessarily something you need to engineer although it is something that you should be thinking about because you want to make sure that you have enough concurrency that you don't get it back up here that your cost is managed maybe these things are related for one of the next use cases that we'll talk about and maybe you could re architect it and refactor it to execute more efficiently so let's go back to our second type of parallelism which is dynamic parallelism dynamic parallelism is also known as the map state in step functions workflows and you choose this model whenever the workflow itself generates a collection of items right so if you're if you're executing the workflow maybe on a scheduled basis or you're executing the workflow in response to some request but it needs to part of that business process is identify all of the items that meet this condition and gather them and then apply this task state or workflow to them right so the items are processed individually with an iterator that I'll show you and then optionally reduced to an aggregate and that aggregate can be you know in any aggregate function a some statistical function whatever works out to be the difference here our one major difference here is that errors in processing one item can typically affect the outcome and sometimes other items right typically not that process on other items but it can affect what happens at the end of the workflow so I want to say hi to Nicol lounge welcome to the channel first time joining us appreciate you being here we're discussing different ways that you can execute parallel workflows to speed up your workflows on AWS so errors in one item when you're executing this way can especially affect the outcome of the workflow in aggregate and so an example is an order fulfillment system a customer places several different items into the cart hits checkout great and you go out and you you know pre-authorize that payment and then you need to pull and pack right before you charge but say that you didn't have enough inventory to pull all of one item you still want to ship the rest of the order but you don't want to charge them for that so the ultimate end of the workflow is the amount that you charge them and it's impacted by a failure in one of your items that you've generated in your collection this if you also look at the links that we put in the chat here I'm gonna put them again for anyone new that's joined us there is a step functions map State blog by one of my colleagues danila that uses this same example to walk you through the introduction of the map state the main limitation around choosing dynamic parallelism is that your state in an AWS step functions workflow is limited to 32,768 characters so that means that only whatever you can jam into that state can be used as an input or a result right this limits you typically on the order of like hundreds of items if you're designing it well maybe thousands of items if you're extremely efficient with it but you're kind of running out of space there so you need to be judicious with what you're putting in here and there's also a different set of limits on how many of these will execute concurrently so you'll want to check those service limits again but I want to go back now and show you what this kind of code looks like so we'll close our parallel implications example and we have dynamic parallelism here this is a workflow that I've generated with the AWS toolkit for visual studio code and one good thing about this the way I generated it was by creating a state machine if you haven't been with us in previous ones and then choosing map state here so it's one of these quick starts that'll get us going okay I'm not gonna go back through that but we also get the visualization of this workflow over here and you see just like we were talking about customer places in order and we say okay let's eyes that charge and in this case we just mocked this up with a past state and we've said okay the total amount of items in their cart plus shipping and tax is 81 48 let's authorize that the authorization succeeded great let's move on we check this authorization here and we have a choice did the authorization succeed yes retrieve the items no go down here to failed and again all this code is going to be posted to the repo afterwards so you have a working sample we retrieve the items using a pass state just to generate this collection right so we have this collection of two items and I have something bouncing in my dock that's very strange to see on screen we have this collection of two items and that's the collection that the map state is going to iterate over so you see we place this collection into the order items path in the JSON path and then in the map state our items path is the same right now this is really important to note again I've left you a link to the JSON path definition definition in the links in the chat we're gonna do an episode next week I believe it is just on JSON path on input path items path result path output path all this stuff right because it helps you track the state as it moves through so then we've provided this max concurrency here which means for all of these items if you see this dashed line box over here with the sort of parallel floating look to it for this iterator it means we're gonna execute at max 40 concurrently and then as they complete an iterator becomes available to execute the next one then just like any state when we're done with this entire iterator we'll move on to the print label phase and so here we've just mocked up these you know lock the item pull the item pack the item and then you print the label and you complete the charge so this complete charge step can be the lambda function that takes in the items that were actually packed and then only charges you for those each item as you're processing through it when you try to lock if you fail you can put a separate workflow in there and if it can't be locked then you push it into a you know delayed items queue for later or out of stock items queue for later processing all these types of things can grow out of here right but I just wanted to give you this most basic workflow that we have and so this one we can invoke directly here from visual studio code using dynamic parallelism start execution oops and because we hard-coded everything with past states here we don't really need any data to pass to it so now we go back to our our state machines here we have this dynamic parallelism state machine and we see that it just succeeded and this is in the AWS management console under step functions so if any of this is unfamiliar to you please let me know I'm looking at questions I don't see any either to come up but I am paying attention so if you have any let me know and we see the same definition that we have before what's interesting is to track the change when we move into that dynamic parallelism state or that map state so we get to here the output is as we expect right we've got this order items State two items in a collection that we're gonna pass in and that's where our iterator is going to pull its items so now if you look we can highlight this entire collection which shows us the statuses of you can filter here based on whether in iterations succeeded or to only see the ones that you want or you can filter based on specific indices from that items path that you gave it or when you go into an individual step inside that workflow then you can move back and forth between those values so we'll see here this was our first line item that we took one at 2.2 50 and this was our second line item where we took 17 at 4:21 right so you can really granularly see what's happening inside this state even though you're doing multiple things at the same time and then we see at the very end of this we get this confirmation that we sent back one for each item that they were locked right and of course in a real workflow you're gonna want to build this out where this locked items includes the SKU or some sort of line item identify or whatever your business process is but you see that we have one entry here for each of these items so we could now take at this complete charge step we could take this lock two items as input and we could perform that reduce operation on it to determine what the total cost is that should be charged and then complete that charge right any questions on that looks like we're tracking alright so again that's dynamic parallelism or the map state and then now I'm gonna show you the third category as promised earlier z' function native concurrency or parallelism and either term applies here based on whatever your language has as its paradigm so you do this whenever a single task in your workflow receives a single input and that task is generating the collection performing the operations performing the reduction or the reducing operation and returning that reduced result right so what do we mean by this an example is like traversing an s3 bucket and prefix looking for items that match a pattern whether it's upload date time or media type or file name or size or whatever you have write or permissions any aspect that you can query you're essentially checking that bucket so the input is an s3 bucket and a prefix and the output is the number of items or a collection of items or a dynamodb table that contains information about the items that met your function importantly here errors in one item can impact the entire task sometimes even to the point that the entire task fails to complete right so super awesome info thanks says majestic pears thank you for being here for the compliment I'm happy to put it out to you again if you have questions please raise them I want to get them answered for you but errors in one item as you're working through can impact the entire execution of the task maybe it's an all-or-none type task right so it's even more tightly bound than using dynamic parallelism your main limitations on using this type of concurrency or parallelism are just the limitations around execution in the AWS lambda runtime environments so your functions can't execute for more than 15 minutes if you can't get your work flow completed in that or your your task completed in that time then it won't work for you whereas you might be able to do that with standard workflows and AWS step functions which can run for up to a year your limitations on memory at 3 gigabytes is the max so if you can't fit all of your processing inside 3 gigabytes that won't work for you disk space things like that right so you have to be able to execute this inside your languages runtime also if your language isn't very good at parallelism or is sort of aqua or concurrency or sort of awkward around it then you may not want to do that as well right but that's more it's a word that I'm looking for here that's those are more like personal concerns and how you feel about the about the runtime so let me jump in here and show you what this looks like in code it's very similar to our parallel invocation and in fact these could be used together right you could have a parallel invocation of tasks that the task generates a collection of items and performs of reducing operation on it you could use all three together where you know you have all these workflows coming in and part of the workflow is to the workflow generates the long-running set of the collection of items and then individual tasks operate on the item so all of these things are here for you as I get into this majestic pairs how would you handle multiple error handling paths what if each customer wants errors handled differently I'm not sure I understand what you mean by customer here if you mean consumer like each executor on the item oh you must you're probably referring back to the example of the e-commerce right where we talked about placing an order in a cart and executing the order multiple error handling paths will be the subject of a future episode I'm not just kicking that down the road for you but there's a lot of it depends around that I would say it depends and you'll handle them differently each paradigm that we've discussed here so when you look at the dynamic sorry parallel invitations you're handling that error in a more simple fashion you can use retry logic that's built into AWS step functions but you sort of only have the one error at a time so it's a little easier to reason about quick sidenote here I really recommend that you throw all of your errors out of your code up into AWS step functions so that it can handle that retry logic for you that's a better practice make your function simpler and simpler and simpler right rather than rewriting that type of code if you're doing dynamic parallelism and map state it really depends so you can reassemble those errors at the end and it depends on how you defined an error like as an item being out of stock an error or is handling it putting it into the out of stock queue and returning success I would argue its success just of a different kind so again a lot of this is a and it depends answer I'm sorry finally the function native concurrency I would recommend that you handle it in accordance with your languages error handling paradigm after that consideration that I would still throw it back out to the step function to handle if each customer wants air is handled differently that's where you talk about composition and workflows and I would say that you can create individual workflows for each of those types of customer wants and you can route them based on based on fields there so ok answered your question thanks good so function native concurrency the example that I've got here is again an example that I've used in the AB 2025 program the things we need to look at here are our step function and also this do stuff function for this I've written it in go go is known for its concurrency it's a simple implementation of our example we define an input type for our lambda function that's that s3 bucket and prefix simple ops in our output is just gonna be the number of operations performed and then our handler gets us a new s3 object list them I don't this is not production y'all I do not handle pagination if you have more than a thousand items with the v2 list objects you're still gonna have to handle pagination I don't do that here because I pre rigged the data this is a very simplistic example but so we list all the objects that satisfy these criteria check for errors as always and then this is our language specific concurrency right if you're not familiar with go it just essentially creates a weight group and for each thing that matches our pattern we kick off a go routine which is a light weight function first we say hey we got one task to complete we initiate that task that task finishes we call it done so it removes one from the queue or from the counter and then down here we just wait for the counter to reach zero and it reaches zero and we create our return object and pass it back right so this one I've created a file and populated some data for us this test data just passes in a bucket that I've created in the stack and a prefix what I did was I put 400 objects in the root of that bucket I put 400 objects inside this Rob stuff folder in the bucket and so whenever we list all of these items we should only list the ones that are inside the prefix so the number of operations performed should be 400 right so again function native start the execution I've got that sample file that we just gave it that was test data right I believe yeah perfect and we execute it and we'll return to the AWS management console over here for a look at this state machine so this one yeah I had some errors earlier you know this is why I don't this is why I don't tempt the demo spirits we come over here and we see we just had the simplistic do stuff workflow that we built and it did stuff it took the input that we gave it the bucket and the prefix it passed it in and it returned this payload operations performed four hundred so you can see this is a very simple you don't get the idea that anything parallel is happening here and in some cases you may not even know right I just say here's an s3 bucket and a prefix go do some magic that magic could be some sort of big data processing that reads all the objects underneath that prefix and you know I don't know what's there but you've block boxed that function for me so that it's opaque and then it returns a magic number back to you the point is you can't really tell from here than anything parallel is going on because it's going on inside this single invocation so that's our function native invocation and that's it hello Kay Ben's then your showing up right when I finished everything this has been an unusually short episode at 30 minutes I could go over the whole thing again if you want just like a little little private episode any questions on the ones that we've covered again those three patterns were parallel implications dynamic parallelism or the map state and function native concurrency the three examples we use were scheduled tasks that identify resources that are independent for remediation a checkout or order fulfillment system that checks individual items and then charges you for the ones that ship and puts the ones that don't ship into a queue and an s3 bucket and a prefix that performs some sort of operation over all of the objects in that prefix and returns some sort of aggregate value again for the links I'll drop these in here one last time for those of you that have made it for the github repo it's here I haven't put these up yet I'm gonna clean them up make sure there's no like weird data in there that would cause you to not be able to deploy it in your account and I'll push it up the other thing I would suggest is on Thursdays if you're interested in building more with these types of services like Amazon event bridge and AWS step functions Amazon SNS Amazon sqs join me on Thursdays at the same time for app 2025 we're going over how to architect an entire application with very little code it's not no code but it's service integration heavy so these things are very resilient because the underlying you know api's don't really change you don't have issues with dependencies with operating system patching all of this stuff it's like sort of next-level serverless so i really suggest you join for that the link there is to my twitch channel it's also twitch.tv slash Rob Sutter I don't see any comments in the chat I'm happy to stick around if you have any I'll be here for a little while I'm gonna go ahead and end the stream here I do want to say thank you to everybody for joining today again you got my my twitch channel up there slash Rob Sutter twitch.tv slash Rob Sutter or Rob Sutter TV my twitter is there if you have questions feel free to DM me my DMS are open it can take me a day usually to figure out that somebody who I don't follow is sent me a DM so be patient but I do respond and again I just want to thank you all for joining us thanks to my colleagues who are in here moderating what happens if an error occurs within a parallel task in a step function from Bogart yeah so you can implement an error handling workflow specific for that inside the iterator or you can place it back into the state and handle it outside of there it depends on how you define an error so in the example that we had we talked about if an item in an order has gone out of stock between when you said send the order and when we started fulfilling it is that an error it's not ideal but it's up to you and your business process to determine if it's an error I would not call that an error in the workflow I would say it's a branch in the business process and so I would kick it off to the branch now if you have an an execution error like you got bad code into there calling a lambda function it failed to execute things like that that's when you would want to use robust - error handling inside your step functions workflow to catch those and then either retry or send them to a queue but you definitely want to explicitly manage for those that's part of my recommendation on sending it back out where you put that largely depends on does that error impact the collection of items that you're iterating over or does that error impact only that item right so can you handle the air inside the workflow and go on with your life and return some sort of result or do you need to error the entire iterator that's something we'll cover in a future session but there's a lot of detail to dig into there so thanks for that question I can't read your username Holli f UK thank you thank you thanks for joining us I'm gonna end the stream here I'm gonna stick around if you have any questions please keep them here I'm still here in the channel chatting thank you again for joining us and I hope you'll join us next week and join me in my channel on Thursday at the same time for app 2025 will be covering standard workflows and the callback pattern for placing human interactions into your step functions standard workflows thanks everybody

Info

Channel: Serverless Land

Views: 6,161

Rating: undefined out of 5

Keywords:

Id: At5mw8T2riY

Channel Id: undefined

Length: 34min 57sec (2097 seconds)

Published: Tue Apr 21 2020