Apache NiFi & Kafka: Real time log file dataflow

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

all right so so this is basically my knife I interface right assuming this is my first time ever logging into knife I write the very first question I have is how do I add a new data source to my enterprise architecture so what does it take to do that it's actually very intuitive I can just drag and drop a processor like processor hi come to the canvas in order to add a new data source right I can collect data from different data sources for example I can search for HDFS right we have a different processors to interact with HDFS in different ways I can also search for example for Calcutta right we support different types of versions this is Cal 9 broker this is a Kafka hotel broker etc so there are so many different things I can do with all the processors and we have a Henry and 72 processors in total so imagine like how many things you can do with all these processors now it's a very quick example I'm going to do a guest file processor just add it which is going to 4 which allows me to grab a file from local machine and then deliver to a another processor a different place so for example I can do a put HDFS right and I can just simply connect these two processors together right this connection between the processors indicates profiles are moving from my guest out processor to this HDFS storage so this is how easy you can set up a data flow so maybe I'm not a developer I'm not a data scientist and just operational people I don't care about coding because my job is to deliver data to these data scientists and I find a handle that plumbing works for me in a couple of minutes that's how easy that is so if we go back here so I know that demo is maybe tooth info it doesn't really convey the message do you guys want to see a more you know more detailed use case let me show you how you can very quickly set up a data flow a flow that is actually meaningful in the enterprise environment now the flow scenarios is like this I have a running log file it gets updated on the fly so I want to monitor that log file and whenever there is a new entry there is a new line being added to the log file I want it to be able to grab that log entry and send that particular event or that message to a copy broker and the best part is not to any kind of random caca broker I wanted to route my log entries to different types of brokers based on their types right there are kind of install entries error messages or warning messages I wanted to be able to send them to different types of brokers based on other types so additionally I also need a some kind of a downstream mechanism that is pulling data from my casa brokers and deliver that data to a place like HDFS so that's pretty much the dataflow job now how do I do it in nine five and how long does it take to do it not fine let's see how that goes so go back to my kind of demo environment right move up a little bit so I'm going to kind of add a processor I'm going to start by adding a processor i'll McKenna's right to pay a log file so I'm going to go for a tail file processor go to that processor configuration and specify where my log file is right in this case I'm going to look for knife at a blog actually it's not showing up here log where's my message all right or maybe I can do like a user lock all right another if that's a running log because I know like how a frog is but it's not showing up here for some reason alright so I specify where the lock file is and I can also go back to configuration I can kind of customize all these processor settings by changing these property values but I can also leave them as default for now because it's going to work out of the box and a second processor I need is aa for example a fleet text processor right so what I'm going to do is to sleep a log entry into individual lines right what I'm looking for is how many lines per log entry I wanted to have a single line in every single log entry I wanted to sleep the big log file into multiple lines and one line per each file and now I can just connect the two processors right deliver data from table file processor to sleep text the processor and another processor I can add is for example route on content basically allowing me to make a routing decision right and then I can connect the two processors sleep meaning that I wanted to deliver the splitted line entries to the downstream processor and also I can see that there is a some kind of an error message showing on the face of that sleep patch the processor which means that processor needs some attention so let's take a look at the actual error messages it basically says that there are all and defined relationships for example I have not defined failure handling so that's another powerful feature of Apache nyjah you can specify what failure actually means at any given point along this look right you can either go to the processor configuration go to settings and auto terminate that failure relationship which means whenever there's a failure I'm not going to you know do anything about that data I'll just terminate that data and just keep it in the system or I can have had a new processor for example put HDFS right so I can drag and drop a line and deliver all the fails data to HDFS or I can even kind of drag a line back to the same processor again which means I can send the sales data back to the same processor and I wanted to kind of reprocess that failed it until it's successful so there are so many different ways to handle failure and it is a supported are the Box by 9/5 framework in the meantime I also wanted to terminate these other relationships so that the processor works alright so moving forward I'm going to configure this router content processor go to properties content must contain or must match exactly I'm going to change it to contain the match right and I wanted to add a different entries to specify words you specify the keyword of Hermes dude rewarding message or uniform message so what I can do here is to add a property that says maybe in fall and I'm going to specify to capture the keywords in fall from the log entries so this is basically a expressional language we have a complete guide for all the expression language syntax in the data documentation so you can look at it whenever you feel like - so similarly I can do I can capture different keywords for example warning right so I can capture warning message or I can capture like error messages just to extract the keyword error from the log entries so now I have a different relationship in the end I'm going to deliver all these data based on different keywords to different characters so I can add a processor that says publish publish Calcutta this is going to deliver data to the Kaka processor right except go to that processor configuration I can see like different options here let me change this to set box for works calm that this is where I read my Kafka broker which is locally running on my laptop and I just happen to remember the port number which is a six six six seven and I can specify a topic name right in this case this is a a default topic make a copy of that processor because we also want to deliver data to different topics may be error topic and maybe a a warning topic so I know those topics exist learning topic all right and I can just connect the routing processor to these cops up Roker's this is in fall this is a warning and this is error message right I can go back and configure the failure relationship in this case I just wanted to terminate all the messages that are not matching with any of the keywords and go to cops of processors drag and drop the failure relationships back to the processor itself meaning that I just wanted to keep processing the failures all right and then I wanted to terminate all the relationships because there's no more dull stream processors right I'm going to terminate the success for us alright so now I have all basically this entire flow from end to end gathering data from a local market log folder delivering that to all the way to a cock-up Roker's so now let's turn on those processors and see how the flow files move through night guys so start processor here getting the log file and this is a running log file rack and keep getting the new messages right turn on that processor to sleep the log file into single lines right there's a thousands of lines in those log files and then turn on that throttle attribute brought on content processor to deliver them to what different khakha brokers based on the keywords so all those lines happen to be a info which kind of makes sense so if I turn on that published copper processor is going to store all the data in khakha in my pocket broker right start all these types of processors so this is a running data flow from end to end once it starts all the processors I do not have to maintain the flow at all there's no human involvement needed but on the other side of the house I also wanted to kind of a retrieve or Pozos data out of the castle pokers and deliver them to a downstream HDFS storage so how can I do that I can add another processor that says consume Kafka right I'm going to read I can read from any of the great topics in this case I will do sandbox stop forwards calm port number I want to do maybe it's all topic right and group ID just give it random number and have a processor to load that data to HDFS right so deliver that data from a Kaka broker to HDFS right so if I hope they're double check the configuration looks all right looks promising now if I start that processor is going to grab data from the install topic and keeps delivering that data to HDFS storage you know I can see all the data right whenever there's a new data being delivered to the Kasbah processor so my downstream consume Casca processor is going to grab those new entries in real time so this is how easy you can set up a data flow that is actually meaningful in an enterprise environment

Info

Channel: Hortonworks

Views: 72,735

Rating: undefined out of 5

Keywords: hortonworks, apache nifi, apache kafka, apache hadoop

Id: 4yBc7hHvaQU

Channel Id: undefined

Length: 12min 36sec (756 seconds)

Published: Mon Nov 14 2016