A Simple Apache Nifi 3 Node Docker Cluster | Apache Nifi | Part 13

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello and welcome back to my next video for apache hifi so in this video we are going to cover setting up a quick and dirty apache nifi cluster and this is going to be a 3d server cluster using docker to do it in so instead of a docker compose file uh befo before we get started just a reminder take uh take a second there if you want to hit that subscribe button down below and the notify button if you want to be notified of new videos all right let's go ahead and get started so the first thing i need to do is check out i've already created a docker compose file that i want to use uh so or i've made modifications to an existing one and i'm going to use it now let's go ahead and buy that one real quick all right here's the full here's the file we're going to use so version three we've got a couple services we're going to start first one being the zookeeper service and then after that we are starting a knife is a one zero two and zero three service each one of them identical except for their ports so mapping from the host internally to 8080 each one of the host ports that are mapping is different uh you can see all of them i've added a networks uh variable in here using the nifi net all three of them have added to their containers and down here in the environments for the nifi portion you can see there's a couple settings we need for the environments uh the 88a port is being set the knife cluster is node so it's still on this service it's part of a cluster and that's true the protocol to communicate on for the port kd82 the zookeeper to communicate to which is a zookeeper container up here that's why it says zookeeper to match the host name up here eight of uh 2180 and then a max weight one minute for the election in order to determine who is going to be the leader and then down below just a quick uh network creating the knife net using the bridge driver in order to get that done so let's go ahead and quit out here go ahead and start up the cluster when you just do the up and then we are not going to do the dash d that normally would do because i want to put this in the foreground i want to be able to see the log files as we would for the first time go ahead and get that kicked off there you can see the images were already available because i've already ran it before so they're stored on the system you can see the log file for the zookeeper knife two one and three zookeepers already started on running and now this each one of those servers we can see is expanding the gnars so they're going to get that business done the same thing they normally do when you start them up and you start a server up in uh standalone mode so that's what they're working on right now and then they'll get to the point where they're going to start their election communicate to each other and figure out who's going to be later in this cluster to start things off with so i'll go ahead and come back as soon as we're done all right and welcome back so the cluster is done with the election as we can see here so we're good to go i have access to the web ui so let's go ahead and switch on over to that okay welcome to our nifi 3 server cluster here so we got three different nodes and as we can see up here in the top left corner so things are different about the ui the top left corner right here shows us we have connected nodes and total number of nodes in the cluster so there's three or three right now so all of them are connected and working and over here on the far right we just have an error because uh their little heartbeats thing so that's no big deal it's just an info there this will clear up and everything will be fine because they are communicating just fine now let's go ahead and take a look at a couple other things that change in here so if we go over here to the menu we can see we have the cluster option in here that we can click on we'll select that and here we get a list of everything in our my cluster so all of our nodes are currently available we have the address to all them in this case instead of an ip address we have the container address and the port number and then we have active thread counts the queue size for each one of them currently where they're at the status of each one so we can see the first one up here in the list is connected last one's connected the middle one is connected it's also the primary and the coordinator so it's got a couple of extra responsibilities there when they were started and when the last heartbeat came in and they're constantly sending heartbeats to each other okay so we can also see that we do have an option to disconnect a node from the cluster from here in our system we do see that they all get listed here as well now all these are being ran on the same server so the server itself is a 32 core machine with 32 gigs of ram so they're each going to report 32 cores and the same load and not the same thread because it depends on what they're doing okay so jvm though mostly look the same in a lot of areas here but not everything because they all do run their own jvm in their own containers file storage though will be identical because they're sharing the same host machine content storage providence storage and version all right you're going to also see that because this is used it's using the docker file which is set to the latest version of nifi the current latest image that was available was 1.12 so that's what it has right now all right let's go ahead and get out of here and let's go ahead and bring in and create a quick flow to look at some of the differences and some of the things we want to consider and we've got a little bit of throughput as well for when we're running a nifi node so what i'm going to do is i have a flow created that we can use and that is going to be imported let's go ahead and use that select the template have it right here close my cluster example i'm going to upload it that was successful that's good we're going to drag it we're going to select it from the list at it and here we go so i already have a pre-made uh example here that we're going to utilize let's take a quick look at it and see what it's doing and we have to edit and fix a couple parts of it too so i'm using a generate flow file and in this generate flow file we are just creating a quick flow file it's in a json format so this is a json from our aviation data that we've been playing with and then i gave it a review type of application json as well so that'll take care of that scheduling is set up as well for a one and seven nodes okay we'll leave that one alone we're going to make quick change in here go back to the 10 put this back on its default do not load balance so this is what it normally looks like because we'll play around some settings on that so this is going to generate a json flow file we're going to do a convert record on it and what we're going to do is convert the record from json to avro so in order to do that though we need to set up the json tree reader in writer you can see during the import of the template pretty cool right it brought in the uh reader and writer they just need to be turned on and activated okay that takes her that part so now we have no more issues with that one the convert record we're just gonna take a look at it make sure everything looks legit and that should be on one this one goes down there we go merge just set things back to the defaults from when i was playing with things okay that looks good looks good same here and same here okay so this is going to be our flow that we're going to be working with now there are a couple things we want to understand differently about it so in a cluster environment we have three nodes so how does that differ from a regular standalone environment well let's go ahead and take a look we're gonna go ahead and do a generate flow file on here right click on it and we'll start it and we can see it's quickly making a lot of profiles and we can see up here it shows three active threads and it shows us in the total environment three active threads as well that's a little different right because if we go back and look at this and our scheduling we only have one concurrent task set up so what's going on here and another thing that's really weird is the queue has 30 000 as its limit which is odd because back pressure is set to 1000 objects so how does that work exactly well when you're in a cluster environment everything is basically multiplied by the amount of nodes that you are working with so in this case this generate flow file is running on every single node so that means three of the same thing are getting produced at the same time so let's go ahead and stop that and take a look at that now under execution we in a standalone environment doesn't matter because you don't have any other nodes but in a cluster environment you have two options total you have the all nodes and then you have the primary node so the primary one will be only be ran on the primary node which if you remember when we go back over here and we look at our cluster that means this node right here is the primary and if we select that option this one will take all that work and it will be the only one to process it uh it also means because we have three we have three different cues that are being utilized as well so our queue gets bigger too for our settings in there so let's go ahead and uh clear all of our cues and make a little bit of change here figure this one back up we're going to run on the primary node only for right now and everything else should be good all right so i'll stop that now one big difference to see here in the ui is on the processor itself it now indicates a p in the top corner or the other ones don't have one indicating visually that this is going to run on the primary node so and the tooltip pops up as well when you hover over the p indicating that as well so that's kind of nice to know so let's go ahead and start this one back up again there we go and as you see now our cue size is only 10 000 because we only have one the primary node is the only one running it so it's the only one who has a cube that's getting filled up for the back pressure right now so let's go ahead and take convert records and start that one so convert records is set up to run on all nodes let's go ahead and take a look at that but as we saw there that quickly ran so not too bad it couldn't keep up up here it wasn't able to drain that at all for that back pressure let's go ahead and do the merge the compress the update and then on the route what i'm doing here something slightly different i'm not saying it anywhere actually i just have a using this and then it's taking the outputs from is and unmatch it and just terminating those so we're not keeping that information around so we can run this indefinitely without having to worry about anything right now let's take a look so what other things do we see that's different so we have primary node being ran up here for this processor only uh which means the queue size 10 000. what's odd here though is we can see over here this one's only consuming one active thread uh now i thought i just told you that we are running this on all the nodes yep execution is all nodes so why are we only seeing one node actually doing the work and we can verify that too so just like we can go up here and we can say view status history we can see up here all of our nodes are part of our cluster will be listed we can uncheck them to see what it looks like so in this case this solid blue line is the mattify cluster itself and then all the other colors so we'll get rid of that one are all the elements or all the individual nodes in the cluster now as you can see here we are only before we were running in all of them so we had a quick little boost there to follow them but now we only have one doing all that work so it's generating all these flow files which is the reason why only that green line is continuing to rise up until it gets to its total threshold where back pressure kicks in so what's weird though is that we're still if we go look at this one only one node is actually doing any work here this doesn't make any sense does it uh so we're actually being limited on our total processing power here because two other nodes are just lying around waiting for some work and in this flow they're not getting that work handed out to them so how do we take care of this well there's a way to make sure that they get some work let's go ahead and stop this up here and we'll stop this one down here okay we got a couple left in the queue here let's go ahead and edit this queue and inside the queue now we have a couple of additional options that we haven't used before in the past because they weren't needed one of them being the load balance strategy so by default we're set to do not load balance but we have other options we do we can partition by attributes we can round and rob in it or we can have a single node do all the work all right so signal load balance compression as well you can set up compression if you want to uh if you if you're needing to but in this case we're not going to mess with that so single node is not what we want because we want to invite all of our nodes into the workload so let's go ahead and use round robin we also see we have a couple other things as well so available priorities are here so we can which is what's always been there too so again say first in first out newest full files first priority or yeah news flow file first prioritizer the oldest one and priority attribute so based off an attribute setting but really all we need here is the round robin let's go ahead and apply that one real quick and then let's go ahead and start things back up so we got 345 sitting in the queue we'll start this one back up it's going to get filled up again there we go so now we're back up to 30 000 because the cubes for all of them are gonna get populated and we see there's a change here in the queue visually and when we hover over it we have a little thing here showing load balance is configured with round robin strategy and no compression actively balancing so now we have three individual cues are available to us for each node and that's why it goes up to 30 000 now so we go ahead and get that started and now we know that each new flow file being generated up here is being load balanced with a round robin strategy to each of the individual nodes so now we see up here well it did bump up to three so we have three active threads going here one for each of the individual nodes and it does drop down a little bit so it's trying to it's trying to stay busy here and merge content is not having a hard time and the reason that is it's easy it's having an easy time keeping up is because the emerge is set up to do to take uh basically for every one flow file or for every thousand flow files coming in we're getting one flow file coming out of this processor and it's bidding up to five different maximum bins at one time so it can handle quite quite a bit at once and get a lot done now one thing i do notice though is that we are unable to really get that cue emptied all the way it gets close but it just doesn't get all the way down there now it is down to ten thousand instead of up to thirty 000 range but it does bounce around a lot and if we go in here we can take a look now in the history so we can see we still only have one in this processor only the one node the primary node is still generating all the files but now when we take a look at the convert we can see easily here that the other two because two of them are just right on top of each other are starting to come or make their way up as the load gets figured out and they'll probably all three of these lines will eventually emerge at some point you'll see happening down here right starting to get there we have a point where they're going to intersect to each other and collapse down so now they're all working together so it's pretty cool we can see that they're getting a lot done we can also see that they're processing uh the convert now is processing it's taking in 1.5 million and putting the mac out pretty quickly as well which is equivalent to so far in the last five minutes it's already processed over 1.9 million in total reads and writes are for the writes out so all right what is the one point i mean i mean two gigs 1.9 games so we can see this is working pretty well but we're still not keeping up performance-wise right we're not emptying out this one and we only have i mean we have three we have one here doing all this work but these three aren't keeping up and most of that have to do with the convert process like what's the cost to do each one of these conversions but we can probably improve this a little bit so let's go ahead and stop it and see if we can bump up our performance a little bit so we're going to change our run duration and this is kind of like this is a best way to explain it that i've been told is uh it's like micro batching so the it'll run a little bit longer and get a little bit more done you can see it says lower latency down here and then as you move towards here you can get higher throughput although i feel like uh you go a little bit too far you start to get the mission returns but on depending on the processor and what you're doing with it so i'm gonna bump up to 25 milliseconds gonna wait we'll still leave it at one concurrent task so that means three are still running one for each server but they're gonna take a little longer to run each one and see what the boost is so we're around 1.69 and see if we can get a little bit better performance on that immediately start jumping up seven two or seven three two seven six and we notice right away that we cleared the block the back pressure inside that queue and we're basically hovering around uh 1000 looks like yeah in the 1000 range there so this cube is actually being dealt with pretty easily now and now we're up to about 2 million and so it's not the full five minutes yet but you can see we're much busier now let's go ahead and uh let's increase how many records we're creating so instead of doing it just on the primary node we'll go ahead and switch it back to all nodes this will really blow things up for us there we go the queue gets filled all the way up to 30 000 pretty quickly so now we're back to the convert record not being able to process quick enough it is using up three threads one for each of the servers of the nodes and it has a little the duration time it was tweaked which got us better performance or throughput but uh now we're not we're back at a point where we can't keep up to us back pressure here so and we know it's down below we're bouncing around back and forth but the merger is doing pretty well by itself and every once in a while it's having to kick in an extra thread from one of the other nodes so someone but we're not using a word strategy or we're not using a we're set up for execution on all nodes and you see concurrent tests so if i remember correctly i believe what happens here when you don't set a load balance strategy is is that once one node becomes full so it has back pressure and and it's too busy then it starts up sitting that work on to the next node and then the next node as they become too busy trying to keep up right so i believe that's how it works so that's why we see like this will bounce around between one or two active threads where you don't see it going up to three because two are basically able to work with the load and that's why we see it bouncing around in the queue so much but it's really one can mostly keep up sometimes it needs to help of a second one depending on that i think it really comes down to the binning like if you adjust your bins a little bit you probably see one might be able to do a better job you just have to play around and tweak it until you got there but we can see we're definitely not keep it up up here so let's go ahead and just make a quick bump here and we'll go ahead and bump up the convert record here let's go ahead and stop that and we're going to say okay well run duration did help i don't think it's going to give us much more on this processor if we keep bumping up a little bit on the duration so let's go ahead and move up by one more task for each of them so each node will run two concurrent tasks at once which gives us a total output of six tasks at one time from this processor we can see we easily had to start there to get up to six and it looks like it's going to maintain six uh it's not able to clear out the queue all the way right away but it is looks like it's getting a little more fluctuation in the back pressure here all right so let's go ahead and look at a couple other things real fast so if we go back to the menu go back to cluster and we can see here now we can see the queue size of each of them how they're performing right now the turtle active thread count for each node in the cluster and we can go ahead and look at the job the system jvm and all the other stats are available as well so you have some options in there you can also take a look at our history again and see how our nodes are performing and how the cluster as a whole is performing with uh handling the data as well for each of the processors so we'd see two started merging together so now two are definitely busy all the time and they actually three all three of them are busy now so they're keeping up with each other here because they're all threes and all six processors or threads that they had available all right so really the biggest takeaway i think is you want to pay attention to how the processor will react when you're running it inside of a cluster environment and there could be some very serious negatives if you don't have using it uh one good example that i'll just point out here if i take this and i do my execute that is not the one i want to do but actually i think that would be a good example as well let's go to an execute sql right so if i have this in a cluster and i run this and i leave it on all nodes that means every time this one execute each one of the nodes will run the exact same query script that i have in there so i'm querying the same data three times from the same database or source and i'm duplicating my data multiple times here so it's i've i've found how when uh using it for first time in a cluster environment i want to make sure that you do things like hey set your queries to only run on a primary processor this is also true when you're doing stuff like say depending on the other processor but like maybe put sql right meaning that when one full when one full file comes in here uh depending on the strategy you have set up and stuff you could end up writing it three times depending on how you do things but when you execute for sure you'll see this get executed three times by each one of those processor may not necessarily be true for this one because if you haven't set up this from the uh the convert json to sql before this one and you're feeding it into here and you have the cue set up for round robin then you i think you should be able to pull that off safely because it's setting one flow file to each one of those so you're not going to see any problems there really you want to be careful on just like when you're running slow creeks those are the biggest ones i've seen problems with so there you go just a quick idea on how to how you can when you do run into cluster you can scale up pretty easily you can run a lot of through put in here i mean we can we can already see that our we're not even keeping up on this back pressure here and we're starting to get back pressure down here on the merge but i mean we're moving 6 million flow files through here 7.8 gigabytes uh in the five minute time window that this one is providing us data for so that's quite a bit for what we're doing just on three nodes real quick and when i glance over at my terminal for my server 32 cores six uh we're having around 50-ish percent a matter of fact we can even see when we go to submarine and i think this will give us the right numbers system yeah so around 50 in the last this is one minute windows yep one last one minute so 56 utilization across those 32 cores and 93 cores available and the reason we did that is i say these reach we were on so another thing you want to pay attention to too is your real quick before we stop controller settings so inside of the general where you set your max1 timer uh driven thread count this is no longer oh 10 for my entire environment or 10 across my entire cluster this is 10 per worker so if you're doing a 1 core four threads type of ratio then you would want to make sure you adjust this number to it's automatically going to apply to all them so you still want to keep the number based off of if i have two cores if i have the two core machine and i'm doing i want to do a uh one to four ratio so my two core machine i would say give me i want to run eight threads on it then i still only put eight in here because that eight is going to be applied to each one of those so my total number the the total number sum is the multiple of this against whatever your cluster size is that you have and then that's what you have overall across the entire cluster available and then the other thing is just keep in mind for clusters uh each flow file like in this case we're doing around robin that means each individual server has that one full file that flow files not reproduce across all of these servers it gets routed to that individual server they'll do its work and then send back over to wherever they end up grading going at the end so each they're they're literally splitting out the work to each other instead of like hey let's duplicate it and i'll process the same stuff they don't that's not how clusters work inside an i5 all right that's all i have for you today uh don't forget to hit the like and subscribe if you enjoyed the video and leave a comment if you have anything you'd like to add or contribute as well talk to you next time bye

Info

Channel: Steven Koon

Views: 2,648

Rating: undefined out of 5

Keywords: Steven Koon, apache nifi, apache nifi examples, data streaming, data flow processing, learn nifi, nifi training, nifi training videos, nifi training online, nifi training courses, nifi for beginners, learn nifi in a day, big data, data science, data scientist, dataflow, etl, apache kafka, streaming, data, data engineer, datascience, bigdata, kafka, spark, data pipeline, data pipelines, data engineering, Kinesis, nosql, aws, azure, api, computer science, data analytics, database, docker

Id: eQ4UlerXJ24

Channel Id: undefined

Length: 28min 37sec (1717 seconds)

Published: Wed Sep 30 2020