Apache NiFi Anti-Patterns Part 2: Flow Layout

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everybody i'm mark payne i'm one of the co-creators of apache nifi and over the past five or so years i have worked with teams from some of the largest companies in the world to help ensure that they're making the most out of their use of apache nifi now some of the use cases that i've seen are really simple maybe grabbing a few files off of an ftp server once a day and pushing those into hdfs some are far more complex receiving millions of events per second from thousands of different sources doing some analysis and some enrichment maybe doing some transformation and ultimately routing and sending that data to hundreds of different destinations but across all these different use cases there are certain practices that i see come up over and over again that really prevent nifi from reaching its full potential i call these practices anti-patterns and in this video series i want to show you some of these anti-patterns and what we can do to avoid them to help ensure that you're not running into the same problems so let's get into it so in part one of this series i talked about three different anti-patterns i talked about splitting and re-emerging data i talked about treating structured and semi-structured data the same as unstructured data and i talked about blurring the lines between flow file content and flow file attributes today we're going to talk about flow layout now no developer wants to inherit a code base that's just an absolute spaghetti mess that's true of code and it's true of a data flow if you have a data flow that's just all jumbled up and messy it presents several different problems first of all it's hard to understand you can't really follow the flow of the data by looking at the ui and if you can't understand it it definitely makes it difficult to update and to maintain that flow but it also makes it really hard to onboard new flow developers and maintainers nobody wants to mess with a flow like that but when i ask a customer to show me their data flow more often than not i see something that looks a little bit like this now if i talk to the customer and they tell me that they're having trouble with data that's arriving in kafka from this published kafka processor one of the first things i'm going to do is ask them to explain to me the flow of the data through their system how does the data get there and whenever you look at a graph like this it's really kind of difficult to understand what's happening but as i've talked to users there are really two reasons that they so often lay out their flows in this way first they just haven't really taken the time to sit down and think about what's best in terms of laying out their flow how should it look and second there are a lot of tips and tricks in the ui to help you with laying out your flow that may not be obvious to everybody so today we're going to fix this i'm going to show you a handful of different tips to keep your data flow clean simple and easy to understand so let's start by talking about the flow direction now you can build your flows going from top to bottom or from left to right you can build them going bottom to top if that's what makes you happy but i found that most people tend to align their processors from the left to the right in this manner now they typically do this because i think the screen is a lot wider than it is tall right but the processors are also much wider than they are tall so it's very common whenever you talk to users for them to expect that left to right is going to give them the most screen real estate to deal with because their screens are wider but the reality is that you can stack many processors vertically and get a lot more on the screen at one time and then you don't have to deal with the connections overlapping with your processors like this so typically whenever i look at a flow like this and i want to start reorganizing it i'll start at some point just grab one of these processors and i'll just pull it over out of the way just pull it over to the right here at this point we can grab the source or the destination and start moving those over as well so at this point we've kind of started this linear approach going from top to bottom but they're not really aligned very well and so one thing that we can do to take care of that so we can zoom out using the mouse's scroll wheel and then if we hold shift we can drag an area across the canvas to select a bunch of components then we can right click on those components and go to the align menu either horizontal or vertical alignment in this case i want to use a vertical alignment and that's going to go ahead and line up all those selected components for me so now we can go ahead and start moving over more of these components but we can see here that the compressed content is going in two different directions we've got a success and a failure so in this case what i typically do is i'll go ahead and start creating a divergent path basically to the left or to the right and so that way we have basically the the main path going straight down and then the divergent paths also moving straight down but offset to the left to the right you'll see what i mean in just a moment now at this point we've got uh these components are a little bit too close together because you'll see that their labels start to overlap and so we can either move these components further apart but now we start to take up a lot of our screen real estate so the other option is that whenever we select this connection we can double click that connection and that will give us what we call a bin point once we've created that bin point we can go ahead and start bending that line however we want if we then double click here we can create another one and then we can start making shapes that make sense for our data flow if we decide we no longer want this bin point we can just double click it to remove it so this way we've got these two different input ports we can keep them pretty close together but that pin point allows us to draw the connection much more cleanly so that they're not overlapping each other and we can see that we've got some of these connections going from the route on attribute processor to this log attribute processor they're coming way over here to the left and that's just because we have these bin points already added here so we can remove those bin points but now in this case we've actually got two different connections going from route on attribute to log attribute now there may be reasons that we want to do that for instance we may want to manage the back pressure independently or something like that but in most cases what we can actually do is to just configure this connection either by right clicking going to configure or just by double clicking it and then we can just select all the different relationships we want for this one connection and now we see this one connection has both the medium and priority relationships so that we can get rid of this other connection and help to keep our canvas a little bit tidier we'll continue to pull these components over get rid of some of these break points because they don't make sense for us anymore once we've moved these components again we can just iteratively move these components around to make a little bit better sense of this data flow and if we add a bin point there and now we see that in a couple of cases here we have this connection going from this processor down to this output port and this connection is really overlapping our process group it's overlapping some other components so this is a great use case for adding some bin points we might add a couple of them so that we can just kind of route around that process group and again we'll do the same thing over here i'm being a little bit particular about my pin points here there are certainly different ways that you can lay these out that would make a lot of sense just kind of depends on your personal taste here but at this point we've now aligned these components much more clearly it's a lot easier whenever you look at this put kafka processor to understand how the data got to this point well at least in terms of how it got here once it came into this process group you can see the steps that it took to get here for instance but it still doesn't really tell us a lot about what the flow is doing or more importantly why the flow is built the way that it is so take for example this identify mime type processor well clearly it's going to identify the mime type but why in the world will we want to do that well in this particular case the reason it's done is because we just want to check to see whether or not the data is compressed so if we configure this processor we can give it a name check if compressed the next thing that the flow is going to do is it it's going to use a compressed content processor but we're not really going to compress that data if you look at the properties we can see that the mode is set to decompress and so this compressed content processor is built to work really nicely with that identify mime type processor so the identify mime type will determine whether or not the data is compressed and then this processor will decompress it but only if it is compressed otherwise it'll just send it directly through to success so we'll call this decompress if necessary the next processor in our flow is replace text so again obviously we're going to replace the text in the flow file but what are we changing and why are we changing it well i happen to know and if you look at the properties we'll see that we are searching for this particular pattern dollar sign dollar sign curly brace and then we're replacing it with a dollar sign dollar sign underscore and so this is done because some of the sources that are feeding us data have a bug in them that's actually using the wrong value there and so in our settings let's give this a more descriptive name but in addition to changing the name we might also want to leave some comments about this processor [Applause] then we can apply these changes now at this point you'll notice that this replace text processor has this black downward arrow and if we hover over that that's actually going to show us those comments that i just typed into that box there so now as you're looking at this flow you're able to tell not only that we're replacing text but that we're fixing some corruption uh and hovering over this is going to give us even more context whatever the those comments were that i left in that processor next we're going to look at a route on attribute processor but again we can be a lot more descriptive we should be talking about what are we actually routing on in this case routing on a priority field and it's going to go down this path or down this path and in this case we have a put jms processor again we can be a lot more descriptive where are we putting it clearly to jms but in this case it's going to go to the my destination topic similarly with publish kafka we don't want to just say that okay well we're publishing it to kafka but we'll say what to other destination topic or we might want to describe the particular kafka brokers that we're going to push to if we have more than one kafka cluster that we might want to actually push data to whatever makes sense for your particular use case but the idea is to give these processors names that are really descriptive that make it easier to understand how the data flow was meant to behave and it really serves as documentation both to other people who are being onboarded under the flow as well as the people who develop the flow when they come back to it a couple of weeks later and they forget the different decisions that they've made at the time and so one other thing that we can do whenever we're building these flows to provide more context and documentation is that we can actually drag these labels onto the flow we can double click to configure that and here it's just an open text field whatever we might want to type in here so let's make it larger and let's say we're going to remove these once the source of the data is fixed right so that gives us the ability to just put some blob of text right onto the canvas for documentation purposes but we do want to be careful that we don't overuse these because most of the time it makes more sense to leave comments in the processors themselves rather than to clutter the canvas with all these types of labels they can be useful at times and so now if we zoom out which we can actually do by using this icon over here to fit everything onto the screen for us and we can click this icon to zoom into the actual zoom level or the one to one zoom level but we can now step back and really look at this flow and we can see how nicely aligned it is we know the processors have descriptive names they have comments where applicable to describe what's actually happening in the flow we've got an absolutely gorgeous data flow at this point it's clear it's concise it's easy to understand and it took all of what 10 minutes to rearrange everything to look this way now of course if you've got hundreds or thousands of processors it'll take longer than that but you'll also reap 10 times the benefits whenever you're reorganizing a really large complex flow so don't just watch me do it go on try it and when you realize how much better this is don't forget to come back and like and share the video and subscribe to the channel cheers

Info

Channel: NiFi Notes

Views: 1,726

Rating: undefined out of 5

Keywords: nifi, dataflow, anti-pattern, technology, apache nifi

Id: v1CoQk730qs

Channel Id: undefined

Length: 18min 45sec (1125 seconds)

Published: Thu Aug 27 2020