Apache NiFi Anti-Patterns Part 3 - Load Balancing

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everybody welcome back for the third installment of this series on apache nifi anti-patterns i'm mark payne i'm one of the co-creators of apache nifi and today i'm going to take some time to talk about low balancing data across your cluster now before we jump into exactly how nifi handles this and when you should and shouldn't do this i want to kind of take a step back and talk about some of the general concepts of low balancing data first so first of all what exactly do we mean when we talk about low balancing data across the cluster well in general what we're talking about is taking the data that's on one node in your nifi cluster and spreading that across the entire cluster so that you can parallelize the processing of that data now a lot of big data processing frameworks and streaming frameworks such as apache storm spark streaming mapreduce all these frameworks are going to kind of handle this for you behind the scenes and so a lot of newcomers to nifi are kind of shocked whenever they find out that nifi doesn't automatically distribute the data between the nodes and so we'll kind of look at why that is first of all so we've seen several different benchmarks published about uh where people take a simple algorithm like page rank and they'll run that on their laptops and then they'll run it on let's say a 100 node uh complex event processing engine and what they'll see is that they can actually process the data faster on a lowly laptop than this huge infrastructure but at the same time on the other hand if you are going to train a deep learning model and you want to train that against petabytes of data well good luck doing that on your laptop right so there's it's got to be somewhere that you draw the line when you're trying to decide well should i distribute data across a cluster or should i just process it on the node that has the data already so where do we draw that line well what it all really boils down to is is the efficiency gained by parallelizing the processing of the data enough to offset the cost of pushing that data to another node if it is then you should go ahead and load balance you should push that data across your cluster if it's not then don't bother pushing the data across the cluster just process the data where you already have it and so in nifi we really kind of give you the ability to make that decision on a case-by-case basis so in your data flow you can choose where you want to start pushing that data across the cluster where you don't want to by default it's not going to push data anywhere across the cluster and for most data flows you probably don't need to actually distribute the data across the cluster so we'll talk about where we do and don't want to distribute that data and how often we want to distribute the data across the cluster as we look at how we can use the mechanisms in nifi to handle this so let's take a look all right so in the ui now we've got a flow here that's pretty straightforward it starts with a consumed kafka record processor so we're going to get some purchase order data from kafka that purchase order data has a zip code in it but no city or state so we're going to use the lookup record to perform some enrichment and grab the city and state and put that into the record itself we're then going to partition by that city that we just looked up and we're going to do some routing logic on that city and if the city happens to be one of the cities that we're interested in we're going to publish the data to a different kafka topic in this case special purchase orders so it's a pretty straightforward flow but there's quite a bit of processing happening here so what we can do is we can configure this connection here and in the settings tab there's a load balancing strategy so by default it's set to do not load balance so whatever data shows up on this node is going to stay on that node we've got a few different strategies that we can use to move the data around the cluster but for the vast majority of cases what we're actually going to want to use is round robin so we're going to just spread the data evenly across the cluster i'm not going to get too much into the other low balancing strategies you can read the documentation on those if you're interested but the first anti-pattern that i'll point out here is the low balance compression now typically whenever you're moving data around a cluster you don't really want to compress that data because you've typically got a really high throughput link between the different nodes in a given cluster it's not always the case but it typically is and we see a lot of times people start uh choosing that they want to compress attributes and content when they'll really benefit by by not performing any compression at all so we don't really want to pay the cost of that compression and the decompression so in this case we'll just choose round robin and do not compress and we can hit apply and now any data that shows up in this connection is automatically going to get spread evenly across the cluster and we can see that low balancing is happening because we have this icon here on the connection the other connections you can see don't have that icon so that's just an indicator that low balance is configured if we hover over that it's actually going to tell us that we're using the round robin strategy and we're not using any compression at all so it's a pretty simple configuration change to enable this capability but what we often see people start doing is they'll build a flow looks like this so they've decided that they want to pull data into the cluster and they want to do a lot of processing they want to make sure that that data stays spread across all the nodes in the cluster so they'll configure the first connection here for uh for round robin load balancing and then they'll choose round robin for the second connection and the third and the fourth so conceptually it makes sense but what ends up happening is that means that once we've pulled the data in from kafka we're going to distribute it across all the nodes in the cluster we're then going to perform the enrichment step and we're immediately then going to start moving that data all throughout the cluster again we'll then go ahead and partition that data and we'll move it all throughout the cluster again now once the data's already been evenly distributed across the cluster from the first connection there's really no need to start pushing the data between the nodes again because it's going to stay on whatever node already has that data so typically what we want to do is we want to say do not load balance for all of these connections except for that first one so now when we pull data into the cluster we'll go ahead and distribute it across the nodes and then each node in parallel will process that part of the data that it owns so this looks pretty good well once we turn off load balancing here this looks pretty good but we can actually do a lot better because if we consider the source of the data it's apache kafka and kafka already gives us cueing semantics what that means is every node in the cluster is already going to be working to pull its share of the data so that it can process it there's no need to pull the data and then distribute it across the cluster because each node's already got its own share now there are some caveats to this which is if we have more nodes in our nifi cluster then we have partitions on the kafka topic then we may end up in a situation where the data doesn't get evenly spread across all of the nodes and so we may actually want to round robin in that scenario but the vast majority of the time we don't really want to round robin after we pull data from kafka we'll say do not load balance and just because kafka is going to give us those queuing semantics we know that the data is already going to be low bounce across the cluster and of course that's not just specific to kafka if we were using uh consume jms or any sort of uh pub sub mechanism that's going to allow us to to have queuing semantics we'll already get that data distributed across the cluster for free so if we're not going to use it in this situation where does it make sense well generally any source that doesn't offer cueing semantics is going to be a good candidate for load balancing the data so if we take the same flow that's going to get these purchase orders do some enrichment to add the city partition by that city and do the routing but in this case we've changed it so that we're pulling data from a gcs or google compute storage bucket we no longer have those queuing semantics so in this case the list gcs bucket can only actually be run on primary node that means that the listing is coming on to just that one node so what we can do is we can perform that listing and then we can fetch the data and then once we fetched it we can go ahead and use round robin and we can evenly distribute that data across the cluster now if we were to try to perform this list gcs bucket on all nodes in the cluster what would happen is that we would end up with all nodes getting that same listing so all the nodes would get the same listing and then pull the same data and we would end up with just a duplicate on each node so if you go to configure this processor we'll see that the execution can only be on primary node in order to prevent that from happening so this is what we'll typically see we'll see something like a list gcs bucket followed by a fetch gcs bucket and then we'll use the round-robin the round-robin low balancing strategy to spread that data across the cluster but we can do a lot better than that because at this point we've already used that one node to pull the data from gcs it's already done a lot of that hard work we're then going to read that data and push it to another node so rather than using this approach let's configure this connection say do not load balance here and instead load balance at this point because the way that this processor works is it's going to put out a flow file for every object that it finds in that gcs bucket so if we're just a little bit smarter about where we configure that load balancing we'll now go ahead and distribute that listing across the entire cluster we don't have to pull the data into a single node and then push the data around we can push just the listing across the cluster and now every single node in our cluster it's going to get its own share of that listing and each node in our cluster can now pull all of that data in parallel and then handle all of the processing so this gives us a very scalable very high performance approach and in fact this is the same pattern that i used recently to write a blog post where i explored nifi's uh scaling and performance capabilities so i used google compute engine to actually scale out nifi onto a hundred different nodes and then a 500 node cluster and then a 1000 node cluster and we were able to see that even at scaling out to a thousand nodes knife i was able to handle that scalability quite well we could probably go quite a bit bigger than that if we had the resources and we were able to scale up a single node to the point that it was able to handle easily a million events per second so we could actually process in this 1000 node cluster up to a billion events per second doing something very similar to this and in that case it was using log data but it was doing very similar type of processing where it was optionally decompressing the data it was doing some enrichment it was doing some routing and then eventually pushing data back to gcs so clearly we can hit some really high numbers in terms of performance and scalability but we couldn't do that if we were coming through here and configuring load balancing on every single one of these connections so in order to hit those really high types of performance and scalability numbers we definitely need to make sure that we're following the best practices and so in terms of best practices for load balancing the biggest thing that we want to consider is we want to make sure that we're not load balancing constantly throughout our flow we're only load balancing where it makes sense and that we're really taking into account where in the flow we perform that load balancing so that we can minimize the amount of data that we're actually pushing across our cluster and that's all i've got for you today thanks for watching guys cheers

Info

Channel: NiFi Notes

Views: 1,420

Rating: undefined out of 5

Keywords: nifi, anti-patterns, load balancing, cluster

Id: by9P0Zi8Dk8

Channel Id: undefined

Length: 14min 48sec (888 seconds)

Published: Wed Sep 09 2020