Apache NiFi Anti-Patterns Part 4 - Scheduling

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

scheduling scheduling threading and concurrent tasks that's what we're talking about today this is one of the most critical topics for a nifi user to understand but a lot of people don't fully grasp all the concepts even people who have been using nifi for years now get this wrong and your performance will suffer get it very wrong and even the stability of your cluster can suffer but if you nail this it can make the difference between processing a thousand events per second and processing a hundred thousand even a million events per second on each node in your cluster i'm mark payne and this is part four of my series on apache nifi anti-patterns nifi offers quite a few different knobs that you can turn to really tune the settings of your data flow but one of the most important knobs is the size of the thread pool to configure this you can go to the hamburger menu in the top right corner and then go down to controller settings now the settings in this screen affect all of the users on the system and all the different flows running in the system so they require some special permissions so if you're running in a secure productionized environment you may need help from an administrator in order to configure these settings so on this screen you're really given two different uh thread pulls that you can configure the event driven thread pull and the timer driven thread pull the event driven thread pull is pretty much not even used at this point so we're not really going to cover that aside from to say you should probably leave it at one thread it's probably going to be removed removed in a future version but the timer driven thread pull now that is really really important this is basically telling nifi how many different processors do you want nifi to run at any given time so if you set this value way too low you end up in a situation where you're really under utilizing your resources because you're not scheduling enough things to happen concurrently but on the other hand if you set this value far too high you end up in a situation where you're trying to do too many different things at once and you really kind of overwhelm the system even to the point that it may not be able to handle all the different background tasks that it needs to complete and that can even lead to system instability so that begs the obvious question then what value should we set it to and so in general we typically would recommend that you set that that thread pull to somewhere between the range of two times the number of cpu cores that you have on the system and four times the number of cores you have uh so on a large system though that can be quite a large range so if you've got 64 cores we're talking about anywhere from 128 to 256 threads so there's a pretty large range in there where we might want to kind of narrow down what's best for the system and so we'll walk through that uh some over the next uh few minutes but it's important to note also that this is a per node setting meaning that if you have a 10 node cluster each with 64 cores you want to set that to somewhere between 128 and 256 not 10 times that number so to really understand what makes the most system the most sense for your particular system i typically would say you want to start with two times the number of cores that you have on your system and then increase the size of the thread pull gradually as necessary and so let's talk about when we need to actually increase it so i've got a simple data flow here i generate some random data i compress that data and then i've just destroyed the data so it's really simple straightforward flow and if i start this flow we're going to see pretty much right away the compressed content is our bottleneck now we see this because this connections turn red we can see that a lot of data's queued up and we can see the output of the compressed content processor doesn't have any backlog going on so there's nothing in the flow that's causing back pressure to prevent the processor from running but it's clearly not able to keep up with the data rate and so what we'll sometimes see is that a user will look at this and they'll say okay well i need to give it more threads then that'll give me better throughput so they'll go in and start adjusting the size of their thread pull but the reality is that that's really not going to help us very much here because if we look in this corner here we can see the number of active threads in our data flow is only two and remember so if we come over here to our controller settings we can see that our thread pull actually has 10 threads in it so we can use up to 10 threads we're only using one or two threads at any given time so increasing the size of that thread pull is really not going to buy as much so instead of adjusting the size of the thread pull let's see what we can do with the configuration of the compressed content processor so if we come in here to configure and we go to the scheduling tab we have this setting right here for the number of concurrent tasks so this is basically what is the maximum number of threads in that thread pool that this processor is allowed to use and the default for that is set to one and that's what it's configured to here so let's go ahead and set that to two and see if that makes a difference for us so if we start this we'll give it a little bit and see if that backlog starts to work off or not so we can see here that we've now got two threads that are running but we still got quite a backlog and staying pretty steady here we're back up to ten thousand so at this point we could go ahead and give it three threads so we can say use three of the threads in that thread pool and we can continue to increase the number of threads until we're able to make sure that this processor is able to keep up with the data flow but if you notice we see that this processor is also handling a lot of really small flow files so if we come in here and configure the processor again we have this setting over here called run duration now the default setting is zero milliseconds and that gives us the lowest latency but in this case we know that we're going to process a lot of different flow files instead of a small volume of flow files so we can increase this to 25 milliseconds and so for those of you who are familiar with the concept basically what we're configuring here is micro batching do we want to ensure that every flow file is processed and transferred out of this processor as soon as possible or do we want to batch together some of that processing for some period of time before we transfer it on and so in this case i'm saying okay well let's go ahead and change the batch duration essentially to 25 milliseconds and let's see if that helps so we'll click apply and start the processor and just like that we can see that the entire backlog has now been processed and this processor has no problem coming up with the throughput at this point now if it were still a bottleneck we would go ahead and give it three threads and see if that helped or maybe four threads and we can slowly step that number up but in this case using uh 25 millisecond run duration was all we really needed so that's going to be a common pattern that we see whenever you start configuring processors that are that are running over a very large number of small flow files we really want to use a run duration that's going to be larger than zero milliseconds whenever it's available not all processors support it for different reasons based on the underlying implementation but it's also worth noting that now that we have no backlog we're actually not only increasing the throughput but we're also decreasing the latency significantly so whenever we look at the slider where the left hand side shows the lower latency and the right hand side shows higher throughput this is kind of assuming that the processor is able to keep up with the rate of data that's coming into it if it's not able to keep up then adding this batch duration will actually lower your latency as well because it prevents the data from sitting in the queue for a long period of time waiting to be processed now if that processor weren't able to keep up with two concurrent tasks of course i could set it to three or four but it's important it's very important that we don't go overboard with the number of current tasks either in fact it's really rare that you want to go above say 12 concurrent tasks because once we start using a lot of concurrent tasks what we end up seeing is that all those different threads now are basically vying for right access to this queue so if you have a lot of different threads trying to constantly pull from the queue at the same time that this processor is using his threads to write to the queue you kind of get into a situation where you have a lot of uh of law contention just like if you were to have a door you suddenly tried to squeeze 200 people through that door nobody's going to actually be able to get through and the same thing is going to happen if we come in here and schedule this processor and say okay i want to use concurrent tasks of 200 200 concurrent tasks for this processor but this is what i see happen all the time i'll look at a user's flow and they'll end up with one processor having 100 concurrent tasks the next processor having 200 current tasks the next process are having 200 concurrent tasks and we get to this point where the processors really are performing really really poorly because they're just vying over uh just a few nanoseconds of the cpu time to actually run their tasks so we really want to make sure that in order to get the best performance we're really kind of starting with a small value of typically one concurrent task and then we can increase that to two or three as we need and we'll go ahead and set that back down to two concurrent tasks and again we'll see that it has absolutely no problem keeping up with two concurrent tasks now we've seen a pretty huge performance improvement without changing the size of the thread pull at all but what if we were using all the threads in our thread pull what if we kept seeing that we had 10 active threads here then should we increase the size of the thread pool well it depends if the cpu is already over utilized adding threads isn't going to help us in fact it will probably hurt us so we really kind of need to check out the utilization or the cpu load now there are different tools that we can use to do this in linux you could look at the top command in windows there's a task manager in osx there's an activity monitor but the nifi ui actually provides this information to us as well so if we come over to the hamburger menu in a clustered environment you can go to the cluster dialog and that will give you information about each node in the cluster in this case i'm not using a cluster so i'll go to the summary table and come down here to system diagnostics then i can come over here to the system tab now we can see here that we have 12 cores available to us and on this side we see that the core load average is 6.13 that's the one minute load average so basically what this is telling us is that over the last minute on average we asked the cpu to do six things at a time and we could have asked it to do up to 12 things so we can in fact increase the size of our thread pull because we do actually have more cpu cycles available to us but we want to be careful here we only want to increase the size by a small amount let's say 20 or 30 percent we don't want to say well we've got a core load average of 6 and we can go up to 12 so let's double it we want to make sure that we're increasing the size of the thread pull slowly because when if we give it more threads it may be that the processors that will use those are actually much more cpu intensive than the ones that happen to be used over the last minute predominantly so we want to make sure that we're not immediately jumping to a huge uh increase in the size of that thread pole but we do it in a little bit more of a controlled manner and it's also important to note that we typically don't want to actually have our one-minute load average reaching the same value as the number of available cores now if the one-minute load average is equal to the number of cores that we have that means that we're basically asking the cpu to do exactly the the amount that it's able to handle which might sound like a good thing but that means that the cpu is doing absolutely as much as it possibly can so if we have a cluster of say 10 nodes and we lose one node in that cluster that means the other nine are now going to have more data that they have to process which means they're probably going to be using more cpu cycles so we want to try to avoid getting to the point that losing a node in the cluster would overwhelm all of the other nodes in our cluster so to do that we want to keep the load average to some value less than the number of available cores typically i would say you want the one minute load average to be somewhere around 70 percent of the number of available cores now you might say that uh for your situation 50 of the available course is the most that you're willing to use or you might say that you're willing to go 80 or 90 of the number of cores that you have available really just kind of depending on your tolerance for risk in that situation situations can arise and do sometimes arise in which we've set the size of the thread pull effectively and our cpu is far from being fully utilized and yet we have a processor that can't keep up with the data rate so we increase the number of concurrent tasks we increase the run duration and yet nothing really changes it uses the same number of uh or the same amount of cpu and the throughput really stays the same even though we've gone from six to maybe eight or ten concurrent tasks now when this happens a lot of users are very quick to say okay well 10 concurrent tasks is clearly not enough let's go ahead and use a 100 concurrent task or 200 concurrent tasks and of course we have to make sure that our thread pull can handle that so let's set our thread pull to 2000 threads but please don't do this we've already talked about some of the problems that this can cause but i can guarantee you that you're not actually going to get the results that you want so when we end up in a situation like this what this is really indicating is that our bottleneck is not the cpu oftentimes the bottleneck is actually the disk so if we have the content and the provenance and the flow file repository all writing to a single spinning disk you can really reach a point where that disk is the bottleneck pretty quickly so if you're using spinning disks i would certainly recommend try to use a separate disk for each of the content profile and providence repositories ideally use more than one disk for the content and provenance repositories or even better yet use an ssd or even an nvme drive if you have really high volumes of data and the nifi admin guide can walk you through how to actually configure those different repositories to use separate disks and to use multiple disks but the other thing that we need to consider is the design of the flow itself now it's really important to think about the design as we're building out our flow so if we are building a flow that's going to take a huge number of really small flow files we're going to get to a point where we have a lot of garbage collection occurring uh if back pressure is not configured perfectly we we can certainly get into a case where we have a lot of swapping so we're writing those flow files to disk and then deserializing them and serializing them over and over again and that can really cause performance problems and we can actually even get to the point that the provenance repository starts to apply back pressure due to just too many provenance events not being able to keep up and so we could certainly add more hardware we could add more more nodes to our cluster but the design of the flow is actually far more important to the performance than the hardware that we run on we saw in part one of the series where going from a flow that's really using a large number of flow files and taking that into a flow that uses record-oriented processors can often yield results that are at least an order of magnitude better performance or better throughput and that's not at all uncommon we'll very often see at least an order of magnitude better performance when we design our flow in a way that uses larger flow files with many records than whenever we use a lot of really small flow files and this is really key because for most people it's really hard to increase the amount of hardware that they have but at an order of magnitude users who have a five node nifi cluster typically don't have the resources to scale out to 50 nodes so if we are just careful whenever we're designing our flows we can really kind of avoid a lot of the pitfalls that we run into with garbage collection and back pressure from the providence repository and swapping and really just a lot of the lock contention and a lot of problems that we inherently will see with really large numbers of small flow files as i said in the beginning nifi offers quite a few different knobs that you can turn today we discuss some of the most important knobs but please keep in mind that as you start to adjust these settings start small and and adjust them gradually if making a small change helped but it wasn't enough try making another small change maybe slightly bigger going from one concurrent task to two improve the performance but it's not quite enough try three or four or look at the run schedule we want to avoid going from one or two concurrent tasks to ten or a hundred it's rarely going to give us what we're looking for but time and time again i've seen users running on vms that have four or eight cores and they set the size of the thread pool to 2500 now if your system can do four or eight things at a time i promise it's not going to behave the way that you want when you ask it to do 2500 things at a time so just start small make slow small adjustments and you'll get to where you need to be thanks a lot for watching guys take care

Info

Channel: NiFi Notes

Views: 1,907

Rating: undefined out of 5

Keywords: apache, nifi, anti-patterns, performance, scheduling

Id: pZq0EbfDBy4

Channel Id: undefined

Length: 22min 42sec (1362 seconds)

Published: Thu Sep 24 2020