A Real Use Case with NiFi, the Swiss Army Knife of Data Flow

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay it is a minute after the hour we'll go ahead and get today's webinar started so hello and welcome to everyone today's webinar is a real use case with knife I the Swiss Army knife of data flow my name is John Silver's I'm with compose and I'm one of the organizers for today's presentation in a minute I'll introduce today's speaker Hayes Hutton who is also at composed today's call will be in listen-only mode you can ask questions using the GoToWebinar panel on your screen and we'll come back at the end of the presentation answer any questions if there are any we are recording today's webinar and we'll send everybody who registered for it a link to the recording by the end of the week I wanted to briefly introduce compose for anyone who's watching who's not already a customer or who hasn't signed up for a trial compose makes it easy for developers to use production ready databases for building apps without all the hassles of managing the databases each database is deployed in a high availability cluster and what our customers tell us is that with compose they can focus on writing apps and not managing databases we often compose both as a multi-tenant hosted service and AWS Google cloud platform and SoftLayer and we also offer compose Enterprise which includes all the same goodness as compose multi-tenant but on your own cloud servers we've been around since 2010 back then we were called HQ and about two years ago we change your name to compose crypt like the fact that we host weather databases besides MongoDB we currently offer nine different databases and developer services we release my sequel about two weeks ago and launched silo DB Cilla DB we've got a week before that if you're not familiar with it slightly to be as a drop-in replacement for Cassandra that boasts and very fast speeds without all the overhead that goes into managing your Cassandra database so please go check it out and check out all of our services if we need to compose we offer everything for you for the first 30 days okay and with that we're going to move to today's topic today's presenter is Hayes a Hutton who's one of the technical content creators here compose he has also written a three-part series on knife I on our blogs at compose comm slash articles search for an iPhone you'll find them as hazel show knife I is a really powerful tool for using data across data stores excuse me it since many companies now routinely use two or more databases for running their apps we felt this would offer some terrific insights into how to make the process even easier so please be sure to post any questions for Hayes in the GoToWebinar questions box and we'll good we'll get to them at the end and with that Hayes I'm going to pass controls over to you all right thanks John my screen has come up yep so we're gonna get started here and move pretty quick like again thanks thanks for everybody's time I'm excited to share some insight in the knife I it's a great tool that we're gonna go over here and I've got a lots to go over so I'm gonna start moving pretty quick and here we go so you start on a give an overview of what we're going to talk about today so I've got a brief history and some context from that fine then we'll talk about how it works with some of the main core abstractions and other things I think you should know about in order to actually be able to develop and utilize and know when to use a knife I then I'm going to build a simple flow so you the simplest of flows that I could come up with so that you would see what it takes and how easy it is and then I've got a larger example already built and we'll go through actually pulling some data out of some Twitter feeds and running it through a modestly complex flow and putting some data into going through Watson and doing some sentiment analysis going into Redis story and some scores for a data and then actually storing all of the data into so kind of a complete flow of competitive intelligence data as I like to call it because we're gonna pull Twitter feeds from different cloud competitors and then score the tweets we're going to look real briefly at extending knife I one of the beauties of Natha is that it's really customer in lots of different ways today we're going to even just look at extending my five with no J essence in JavaScript which is really pretty powerful especially when it comes to prototyping and getting started real quickly alright history in context some brief history in regards to knife I where did it come from a knife I was born in the NSA so it ran before it was turned over and that the technology transfer program ran in the NSA where they were utilizing it to obviously move data but probably bigger sets of data than a lot of us ever contend with so it has an eight-year history before it was released into the public of running signal intelligence data flows in a really high high volume environment so now if I can handle probably just about anything you could think of throwing at it but as we look at today - one of the beauties of it is it even scales down to being able to run it on a laptop or a single server and doing 20 interesting information and processing that way so it's based on a notion of flow basic programming this is nothing new came out of flow base program started basically in the in the 70s a paradigm that was that was created in the early 70s it's really about defining application as a network of blacks blackbox processes with message passing connections so what's interesting about that is we had a component model that doesn't know about the other pieces of components and everything's hooked up the queue you can actually create a full programming paradigm that way so that's what knife I does when knife I was fun out of the NSA that team came out of the NSA - and they were pretty quickly acquired by Hortonworks so that's where most of the folks that have built knife I reside currently so it still still worked upon we went to one dot o out of the Apache project here probably just a couple of months ago so dataflow I'm gonna talk now imma change gears a little bit and talk about why you should use knife five so when it comes to moving data around certainly you can build your own custom program that does such a thing but once you really get into that process one of the things that you start to find is there are a lot of features and functionality that you really probably shouldn't be developing especially when a tool like a knife I exist and these are a few of them here that are for features that I think are really relevant whether your data set is modest or whether it's really large guaranteed delivery this core philosophy that that everything needs to be delivered and the infrastructure to do that through is persistent right ahead logs and content repositories we're going to look at here in just a little bit they're designed that they can handle the high transaction rates and the effective load spreading and copy-on-write content repository and everything goes through on disk so what's interesting about knife is is as we build these flows one of the things I want you to not forget is that this data is protected each step-by-step on the way basically by a transactional right ahead log that is managed in memory but everything gets persisted to disk once the data comes into my files persistent attest once the data that the metadata attributes that we're going to look at they're persisted to this too so everything is protected through this entire flow we're going to look at tweets so whether it tweets are really worth being protected there's a good question but there are plenty of situations where you really do want to protect your date Maya also handles data buffering so this is another thing as you get multiple systems in place we're gonna have a Twitter stream we're gonna have read us we're gonna mondo I mean all of these systems are going to perform in different ways buffering and back pressure and all of these these types of infrastructure is really important when you start mediating data between multiple systems knife I does it out of the box you don't have to think about it it's really pretty straightforward you can do some prioritization of the data as it flows this can be important for certain use cases obviously the NSA probably had some of these also there are a bunch of other dials that you can toggle and twiddle like this latency versus throughput you can do some loss tolerance if you've got too much data and you don't want to you know if you get start getting overrun you can drop date on the floor if you want to but so that's all configurable which is really which is a the way that it should be so those are kind of the reasons to use an AI fine now I'm going to jump into the main abstractions that I think you should know about as we start to as we look into some examples here in just a second the poor abstraction of knife is called a flow file and a flow file is really nothing more than some metadata attributes these key value pairs that as this as your piece of data flows through this data flow the attributes help manage it so it's a flow file is also a pointer into the actual content itself so you can imagine if we're trying to manage millions and millions of records or tweets or whatever it is we don't want any copy in that data round so what knife ID does to solve that is they created this abstraction called a flow file so the metadata can actually flow around through the system and the content gets written once to a repository and the flow file actually point to that piece of data so when you start getting really big sets of data that's that's really important so knife I threw out does things like that the fluid a file processor so processors are these little black boxes in the flow based programming terminology they're the only places that work is really done so you know this like we'll look at we're doing mediation between systems so we'll pull data out of Twitter we'll put data and each of those is a little process or such a black box that performs functionality for us these little black boxes are connected together via these connections and all connections really are is basically named queues and these queues will be able to look inside of these queues and see what's buffered up and I'll do that on purpose in here because in just a minute and we'll look at that actual flow file and see what's in a queue all of that gets put in this a flow controller and a flow controller is basically a the server itself so it's gonna manage what gets run when threading of the process is what gets put in do to you the transactions rights to the different repositories and that type of thing so those are kind of the main abstractions are still pretty abstract here since I haven't shown any good pictures of them but we'll get to that just a second one last architecture diagram because I think it's important so knife is a java-based program which runs on the JVM we're gonna control knife I via this web server so that's they deliver it so the entire interface of knife is delivered over HTTP so everything that you look at here in a minute is I run a real life data flow will actually be done over a web server and then the core of the server runs the processes and actually manage that's what that flow controller is and interacts with these posit or ease and what's nice about knife AI is these three repositories because these are kind of the three things that you really want when you have a data flow so the flow file repository is a right ahead log that manages key value pairs which are the metadata of each piece of information that you have come into your system so let's say I get a tweet tweet comes in two nine five the first processor a flow file gets written and that's the J's that the data payload that gets written is the actual JSON text of the tweet and we're going to see this in a second but the attributes get written to this flow file and they're kept in memory so nine five is going to write that once into the content repository which is a read-only log of content so if you change the content it'll actually does a copy on write so as you architect an i-5 flow you really want to be conscious of how often you read and write content itself obviously want to try to get it into the the metadata so that it's managing the flow file provenance repository is a leucine based index so that you can actually search for all types of audit style information so it's each of this data moves through each piece of data is tracked via this audit history that's searchable so they've really covered all of the bases and specific stores built up to handle each of the features that are necessary to do data flow correctly alright so the simplest of flows let's go ahead and build one I've got a alive nine five up and running we're gonna build this simplest flow that I could think of we're gonna go out and we're gonna get a a web page which is so this is kind of a source it's going to create a flow file it's going to drop it through this connection which they're called success connection and it's going to go into another processor called the file so this is a processor this is a processor this is connection that's the simplest flow flow that I could come up with so we're gonna do that real quick and give you a sense of knock on i-5 works so this is the knife I canvassed that's what I call it so this is kind of the whole control plane the admin panel even users can come into this if you have it set up that way these are process groups I'm going to go into this particular group of processes and I already have this built up here so but to give you a sense you know how a knife I talks about having little components now if I has a bunch of components already written and they all do one thing typically they try to do one thing and one thing well so we're gonna have one that does get HTTP so we pulled this on this get HTTP component I just search for it it's one it's a 172 of these things I add it to the canvas it's ready to it's not ready to run yet it's ready to be configured everything in life i gets configured I can control click on this component I get the this menu comes up and I can see all the things that can happen in regards to this particular component a processor better better terminology of nine five terminology so if we looked at configuring it the properties each each process has its own set of properties that are built when the processes that will develop with the code so they actually all these things live in the actual each individual component to do a get HTTP obviously you're going to need a URL we're gonna file name to assign to the file so we're going to name the file if SSL is use then usernames and passwords times that all the things that you can think of that you might need in order to configure it can get HTTP process and once that's configured and I've got one configured here so we'll go look at this one once it's configured so I'm going to go out to Noah gov and pull the home page it can be scheduled so one of the other things that's really nice about non-violence you have these little independent components our processes they can actually be run in parallel so I can do concurrent tasks in this particular one when this one task is run and it finishes it waits a second and then it runs it again so there's also the ability to do this is timer driven it can also do cron if you're familiar with cron so all of this processing can actually be scheduled and it can be run continuously all to generate streams of data whether that's pulling from somewhere so this is all about you know this particular process is all about a source of data so you can name your components and some other configuration information pretty straightforward so I've got another one here called put file so this is going to go get Noah's web page and HTML file it's going to write all of the HTML to that flow file I'm going to put it into the queue and then it's going to actually go do this put file processor that writes the contents of a flow file to the local file system so I'm going to start this up so currently spelling out the processes start it's been scheduled the processor is currently running that's what the triangle means and here in a second we should see since I don't have this one running yet this put file running yet we should see something show up we should see a flow file show up into this queue and it takes a second for it to refresh so I'm going to go ahead and list the queue it's there and now we can look in so now I'm looking into this connection this is a buffered flow file and I'm gonna go look at this and this are the attributes of this particular flow file it's it's gotten a UUID so it can be its key its key yeah UUID it doesn't have a lot of attributes but it does have that and file name that we set up it does say that it's HTML and so this is the attributes of the flow file that live in the flow file repository so this is the metadata that can be utilized and changed a lot with that much when I click this view it's actually going to go into the content repository and pull out the bits from the actual payload that was taken so if I go into there there's no viewer for this I'm gonna have to look at it and X and if you look over here you'll see the hex this is actually the binary of vanilla web page that we just pull so that is basically the the simplest flow that I could develop just to give you a sense of the pieces processor connection and if I actually ran this here it would actually pull these out and write them to the file system so you could get possibly how that works okay so simple as flow file ever I'm going to go back and we'll go look at a more interesting one here in just a second so now let's look at what I want this real use case that I've come up with so this should so this is kind of the set up to look at a more complex flow that we'll look at here in just a second social competitive intelligence this is a example I'm going to take filtered Twitter streams so I'm going to filter it I've signed up with Twitter knife I has a built in processor called get Twitter stream and it takes a you can actually and it some filters so like when I say filters you can filter on the tweets just like you would when you search for it for a tweet so I've got one set up for AWS one for IBM one for Google and when I say all for their cloud stuff so we're gonna take all the cloud tweets that we can find we've bucketed them and label them with their particular stream that they came from and extract some attributes then we extend the tweet itself the text of the tweet to Watson for sentiment analysis so this is a custom processor that I built in nodejs I wanted to go further and productionize this I could actually carry over into Java if you look at one of the the blog's that I've written I actually do show you how to do write a job a custom process or two once we get this tweet back scored with cinnamon analysis from Watson we're gonna send the score itself with a good key into Redis so this would be kind of the halfway there to delivering that the data that you could use in some kind of a dashboard if you had some kind of a soft real-time visualization webpage that you wanted to build you know Redis is good for that kind of hot data so we put some of that basics of that metrics in there and then we store the full tweet with all of the the new cinnamon analysis right into so let's look at that so back in I Phi now I'm going up to my process groups here this is scroll out a little bit so this is the actual full flow of pulling multiple tweets tagging the tweets merging all of the streams of the connections into one and then actually I've got a buffer here and stopped right before getting a score sentiment so this is where take each of these individual tweets and actually going to send it off to Watson to the how knie API and do some sentiment analysis of the actual tweet itself so real quick I'm going to go in here and I'm going to look at a tweet just to give you a sense of this raw data so once again attributes these are flow file attributes this particular tweet I tagged as Google so now I've got that into the system I'll be able to utilize that information I'm gonna go look at the actual flow file content itself I go into the view this is the actual part of this is the actual payload of that particular flow file what we're going to do and I just picked this one by random so this one has GCP in it so that's how it got picked so this this tweet might not have anything to do with Google itself but it did have GCP Google cloud platform in it so it did pick it up anyhow this text gets sent so this is the text of a real tweet this this data set here comes directly off of Twitter and so when we send this stuff off the alchemy when I send it off I just take the text out of this type arse into the JSON the other piece of information that I'm going to use to create a score is this followers count too so the other thing that I put into Redis is if you have actual tweet I called us I actually scored created a this is rather arbitrary score on my part but that said reach so a user that has 20,000 followers that retweet something it's a score of 20,000 whereas you've only got 40 followers you give us for a 40 for that particular tweet so that's the data that gets sent and gets scored and then when it comes back from that scoring process it actually gets sent into Redis and so I've got composes data browser open here and so I don't know if you know if you haven't use compose before this is actually a data browser into collections that the collection is not even created yet and this one here is into a Redis keys you know Redis database and I don't have any keys yet so I'm gonna go ahead I've got seven hundred and sixty tweets queued up I'm gonna go ahead and start this process here side kind of what's really cool about Naya is you can stop pieces along the way and they'll queue and don't hurt me of the date I haven't lost any data it's still all written to disk and now I can start it up and it'll start to flow through so as this starts to flow through we'll get to see some of this data populate into both Redis and into so already I've got 8 tweets that have all flown into so this is the this is post-processing from this these are tweets that have been put into that had been already processed by the the alchemy API so this was the actual tweet about their Microsoft and their slack competitor AWS startups and something there this particular user had a follower account of a thousand 39 so that we go down here we'll see some scores so this tweet was scored as negative by Watson the category was AWS I actually gave this is a minute bucket via the UNIX epoch so I took just took the time stamp and divita it out a buckets because I'm gonna put them because I when I put it into Redis I keyed it with the category which was AWS it was a cinnamon score and it's a this minute bucket so actually in Redis i key to all of the cinnamon scores and put them all together by minute so when we go over here and look we'll see a bunch of keys and this particular so a juror was that was the label a cinnamon score and this was the minute bucket so this is a typical type of key you know complex keyed into a list of cinnamon scores for that minute so these were that you could take this now and do some simple visualizations maybe even some averages or those types of things so all of those came together the last thing that I want to show before I run at a time here this particular process is called execute stream command and this is when I work with a knife i one of the things that I like to do before I spend a lot of time building a custom java processor I actually kind of flesh out my ideas and know so this execute stream plan takes standard in if the standard in is a flow file so each particular flow file the command gets called the standard in becomes the content of that flow file and whatever you put on standardout becomes the new flow file so that's that's how I manage this particular calling into the alchemy API because there isn't one there isn't one of those bill so I'm gonna show you that code here real quick so it's just a simple npm node package just give me a sense what's in the dependencies here I've got box and developer cloud I depend on I depend on Redis because I use this also to put two reticences not a Redis and just some simple config for ease one of the nice things about NPM is you can create some commands by who's been commands like that gifts cinnamon and I've got put Redis so that one process that was the alchemy process is called git cinema so we'll go look at that real quick so this is the complete 9 5 extended custom process that actually interacts with alchemy and scores via that text from the tweet so a lot of this is just boilerplate to get setup and Figg stuff require the alchemy library get an alchemy key I've got that down in just in a config file like I was talking about standard in so we use process than just the regular process standard in while it's readable we build up the flow file because you can get more than one of these because it comes in and you got to fill the buffers and then you can fill your entire flow file when that ends then you know you have a complete flow file so you've got everything so this is where the real work is done for this particular process not too much to it parts the flow file so now I've got a J sigh object JavaScript object excuse me I call the cinnamon API what's really interesting is as a Twitter each tweet already has a text attribute the cinnamon API requires an object that has a text attribute so I send that whole thing in and it uses that text attribute and when it comes back the response has the score so respond stock sentiment score I turn it into a number and store it in into this original flow file object that I've got and then I string a fight send it out and right here when I send it out that actually creates the new flow file so this whole process here is extended knife I even with all of the semantics and guarantees I write into a file of managing it through the the the multiple processes all that still works and this is just with node you can do this with Python you do with go you can do with anything because it's just it's just using basic children ones and then when you actually have it to something that you really want to productionize and have more than just one output maybe you've got multiple connections and when you do some fail you're handling and that type of stuff at that point you really should probably look at doing something more in Java all right so extending that pi we just went over that this was the execute stream command they do have something called new execute which is a an execute script but it's more based on the JVM type stuff and then you can actually create deploy at your own processors and basically a your own jar file and my files got its own custom class loaders so you don't get you know dependency problems and that type of stuff so and if you want to know more about that please do go look at the go look at the blog because I've got I did write up a blog on extending my file with Java some resources so this will be available like I said articles these are the nars the docs on my fire are great and then I do highly recommend this github to this person here keeps a whole list of resources in regards to my file so hopefully you found this a good quick introduction and I find and maybe it'll become a good tool in your tool kit and I think that is all the time that I've got so John I think that's it and thanks everyone I appreciate your time if there any questions I'm happy to ask them if not all right let me pull up my window all right now all the time if you have any questions please go ahead and post them in the question window I did have one question which is can i if I be used for data modeling data modeling no I can't envision exactly how you would use it for data modeling there are some processors like some avro processors and such like that that'll actually read some data models from some source systems and keep the metadata in regards to you know like a record format but there's really not not like an ER diagramming tool where you're actually doing data modeling itself so no it's this this is more about that the movement of data around and as far as knife is concerned once the data hits that content repository it's really just a blob we were looking at formatted shot you know JSON there but that that was just a nicety of the tools as far as knife fights all this concern that just bits okay great and one other just Kenya which is is there any kind of data loss concerns when you change or update an existing data flow I guess midstream so the data when you say a lock like although I took that to mean more like versioning all right mock that Midian mispronounce that get a lot o data loss o that I applied does a pretty good job in regards to to managing data so data loss is it's it's pretty transactional unless you do something where you configure it not to be now so every every one of those tweets that we saw flowing through this system is actually the original tweet and anything that was changed or written to was done with it was changed there was a copy on right there was a whole new version of it written to written so not only do you have the original you get the steps along the way and I didn't show the provenance repository you can actually go back and look through those steps and see what changed when and that type of thing so all of this is predicated on steps that right things to disks so every one of those processors when it writes a flow file it basically commits a session which is very akin to a transaction so it says that it's written to disk and then then it returns so if there was a change that's what happened so that happens in metadata so that flow file metadata that is it has the right ahead log it commits to disk and if it goes away and it comes back up it'll it'll repopulate memory and then the content would but it won't commit back that it's done until it's written to disk too so it's pretty good about protecting your data as it flows through that's really one of the core points and not only that it can do that at scale too we didn't even talk about you know scaling this out and clustering this you can get into this where you have a zookeeper and a master and slave nodes and have multiple multiple servers so like I said this came out and this was using the NSA for eight years and they weren't pushing little bits of data around and they cared very much about keeping it too and it shows in the product so yeah data losses is managed pretty well but better than any tool that you'd write yourself other than maybe some transactional data store but even then I could be like a Postgres or something but yeah it's pretty good ok one last question which is how is he aauggh handled for Twitter and would it be the same for Facebook Graph slash Atlas API not to say that again I didn't get it Kari how is the author handle for Twitter and would it be the same for Facebook graph or graph slash outlet API yeah so so each processor has the ability to be configured to do its own thing so like when we had that get HTTP it you know you could have set up SSL and then it would have used that the Java JVM normal infrastructure with key stores and Trust doors in order to actually handle that and then on the Twitter one each one of those and I didn't show it just because of time and the you know how we configured the file based processor you configure the get Twitter processor and it has places for all the keys so if you go to Twitter and sign up for the developer access it gave you all the keys I'm assuming that Facebook would have something similar I didn't look I haven't I have not hooked that up in particular but if if it doesn't exist as an i5 processor the same way that I just did that alchemy API you could do something very some more so that alchemy API doesn't exist as a as a built-in Wi-Fi processor so that screams - hey let's extend this and in that I just used the keys that and the library that alchemy that gave me so you could you could build your own custom one very easily and once you wrote the flow file and it hit disk and it said it was done then it's enter the system and it's there so I don't think there's a real big barrier there personally some custom work I mean that but that's you know that's just cuz I'm not familiar to the graphic you got okay great I put up the question of screen which has some more information so if you go to compose column such articles and search for knife I you'll see I think a three part series that Hayes wrote on knife I so there's more detail there and also if you have any just general questions about this webinar or anything else to compose just you know whether I should compose that IO and will forward that alone and there's no other question so that is that's the end of today's webinar thank you very much for attending that concludes our presentation please do come to our webinar or start our website to learn more about knife I and you'll find it on our articles page and we will email everybody shortly with a link to today's recording so anybody who registered will receive that thank you very much
Info
Channel: Compose, an IBM Company
Views: 22,735
Rating: undefined out of 5
Keywords: nifi, java, javascript, aws, npm, node, js
Id: zRW8_Xb-zaA
Channel Id: undefined
Length: 37min 45sec (2265 seconds)
Published: Thu Nov 03 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.