Demo Jam Live Returns: Build Data Flow with Apache Nifi

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone uh thank you for joining this session um i'm terri i'm going to uh do another uh live session about apache nifi uh before we uh go through it let me uh start with some slides because we are uh doing a new format today so i just want to make sure that everyone is aware of what we are using today so let me show my screen and let me start with a few slides um so first of all um you can ask your questions uh via the q a widgets you should see on your screen um i have uh dan jefferson and team team span with me today so if you ask questions we will try our best to give you some answers um when you uh exit when you have the exit survey click yes and we will get in touch with you and um we will get back to some additional resources you may want to watch after this session um if you joined us uh using linkedin twitter or youtube uh and you want to participate to the poll questions we are going to ask during this session please use the link in the chat otherwise you won't be able to answer the questions okay um before we get into the demo just a few slides about myself and what knife i is just to make sure that we are all on the same page uh so i'm keavila i'm the product manager at cadera in charge of all the products around nifi including knife registry minifi edge management and and so forth uh i'm involved in the average nifi project uh since five years now uh i'm both a committer and pmc member in the in the apache community uh before joining cloudera as a product manager for nifi i was at google and before that hot on walks if you want to get some news about nifi when we release new features when we are doing some exciting stuff publishing new uh blog posts you can subscribe on twitter that's the only thing i tweet about before i go into the technical details uh we have two ways for you to get involved with cloudera there is the community page where you can ask questions uh we have a lot of our people at uh answering your questions there and there is also a lot of meetups we are scheduling around nifi but also a lot of different subjects so if you want to uh be aware of uh the upcoming meetups please look at this page as well uh so what is about you know if i will go very quickly through the slides because uh there is a good chance you already looked at uh the [Music] live session before uh and um you already know all of this so i will go three very quickly through the slides just to make sure uh we are on the same page so nifi is really the tool you want to use when you have some data you want to move around uh that's really what wi-fi is about you have data and you want to get it in the right place in the right format at the right time that's really what knife is about and then we have the mean if i agents that you can use for for edge management uh and and a collection of data at the edge so that's our solution to answer the first uh my problem for the data how to collect efficiently the data and just collect what you need so an rfi is a drag and drop no code ui that you can use to design your flows uh and and then uh you are up and running and you can move your data process it uh acquire it from uh from a source uh do some transformations and deliver it into a destination uh we provide hundreds of processors to connect with uh almost any kind of systems you can think of uh also knife is by definition extensible so anything you need you can extend nifi you can build your own processors your own components in case you want to build a specific feature that is not available like in knife out of the box um and then for the session today i will be talking a lot about the knife registry which is another tool that you want to use in combination with nifi when you want to move your flows from one environment to another so when you are using knife in the industry uh in the enterprise you want to move your flow from a development cluster where you are trying things out and then you want to um send this to send this into a production cluster so that's what we are going to focus on today um just a quick words about the collateral products so the cloudera data platform what this is uh this is really um a platform where we are putting together a lot of components in a consistent and coherent way uh so we will be focusing on data for today which is including uh nifi kafka and flink um and we will be talking about nifi exclusively today on top of that we have what we call sdx which is a common layer that is shared across all of your components to manage security data governance uh lineage all of this in a consistent way in a single place and all of this can run on your data centers on premises or in the cloud on containers that's um really a lot of options you can use and we are running uh in all the main cloud providers today for the demonstration i will be uh using a cdp in public cloud uh it will be running on aws uh but just so you know uh since december so it's like one month ago we added nifi on cdb public cloud on google cloud so you know knife is available on all the three main club providers so if you are a google cloud customer reach out to us if you want to try nifi on google cloud in in cloudera public cloud we have this concept of data hub clusters which is a very easy way for you to start specialized clusters which are independent from each other and you can uh select the components you want on each data hub cluster which is really great uh for really uh isolating your workloads and using the components you want um still with this common sdx layer to manage authorizations uh and so on that's what we will be using today um very quickly uh before this session uh like since uh the last two weeks i think uh we asked on social medias uh what we wanted to uh see today um well i'm happy to say that we have um let's say an evenly balanced uh set of answers uh so i will try to uh go as soon as possible to the let's say real technical details with the scripts um and all but since it's evenly balanced i will also uh spending some time on the parameters and um and how the knife registry is working and how we go from uh manual deployment from one environment to another uh to the full automation of uh of the deployments so um that's it uh for uh the slides and before we go into uh the before we go into the live demo there is a first question we want to ask you so uh hopefully on your screen you should see uh the first poll question uh which is really about uh are you today using the knife registry uh in your environments um that's an interesting question for us because uh this is really the tool we want you to use when it's uh when it comes to move a flow definition from one environment to another so i will just leave you a few seconds to answer the questions uh and then then if you have the answers uh and the results please let me know yeah man just letting them come in uh hi everybody good to be back doing more demo jam uh i am not actually on a beach in new zealand uh i would just like to be and while you're giving poll results i think that that's a reasonable place to wish i was considering it's night time in london and it's cold a few more seconds here let's see what we get i feel like i should be like wearing a fancy outfit and do like you know but wait there's more gestures something like that here you know just get fancy yeah i like it yeah okay um it looks like we've got some good answers here so the results are uh do you use nine five registry deploy flows in production um about 27 are saying they do not use knife high in production of our callers so but just under a third and then uh of the people who use nifi we've got about 40 no and about 32 yes okay well hopefully after this session you will be using knife in production and you will be using the knife registry yeah thank you dan so let's get started uh let me share my screen again and let's go into uh the demo so here wow okay so um you should see my screen now cool so uh this is what it looks like when you start an environment uh in cdb public cloud um so right now you can see here that i have my uh cluster running in aws uh this part at the top is uh what we call the uh data lego sdx layer that i was talking about where we have uh apache atlas for managing metadata lineage of your data and so on uh ranger for the authorizations data catalog for uh looking at your data sets uh data quality and and things like that and then as i said we have the the concept of data hubs clusters which are specialized uh clusters so uh for for this demonstration all of my clusters are running in the same cloud environment because it makes my life easier but you could think of a different model where you have your development environment where you where you have also your development data hubs clusters and then another dedicated environment for production in this case i use the same environment and then i have uh dev data hubs clusters so i have one for my file which is named dev dash cluster i have one dev cluster for kafka and then i have the two same clusters uh for production so here i have four clusters two running knifi running kefka that's really what this is about uh and then uh in each of the nifi clusters so if i go into one um we can see a lot of information about what is being deployed in this nifi cluster uh so in this case i have uh what we call a gateway node uh that's where my knife registry is going to run and then we have the nifi nodes uh in this case i have a three nodes cluster so then we can access uh the knife ui the knife registry ui you can also access cloudera manager in charge of this specific cluster and you have that for all of the data clusters so if i go back to my uh data hubs clusters you can do the same for the the other dev clusters and production clusters so if i look at the kafka and cluster in this case uh it's quite similar we have a masternode where we have the schema registry which is what we use to manage uh schema versioning uh and so on and we have a streams messaging manager which is the tool we we provide uh for uh monitoring and manage yeah monitoring and supervising your your kafka clusters um that's where you can also easily create delayed topics and and so on uh so all of this is is running on the master node and then we have the three broker nodes where we have the kafka brokers um okay so before i uh move on to the nifi ui and uh we start designing a flow uh i want to quickly move to another tab but apparently my sharing is uh limited yeah okay so now you should see a new tab hopefully um uh so this is a a very old blog post um uh i published two years ago uh almost three years ago now uh um about what we are going to talk about and i want to stop on this uh because this is something that this is a discussion i have uh with our customers quite frequently uh so there are two ways to look at how the knife registry can be used to move flows from one environment to another the first architecture is one knife registry which is being used for as a common piece for all of the uh your nifi clusters uh so you will be in your development cluster you will be creating your flow developing your flow adding processors doing whatever you want to uh to meet your use case requirements and then you would uh commit and version your nifi flow into the knife registry and then yeah buddy but uh your screen share has dropped out do you want to try that again yep thanks uh let me although you know when it comes back up yeah let me try something else yeah should be better hopefully yes we have it loaded okay well thank you for that um so yeah so i was talking about the first schema so hopefully uh this makes more sense now so one knife our registry shared by all the environments only the development cluster is low to right in the knife are registry while the staging and production clusters are just checking out the version of your flow to run it to perform some testing and then to deploy it finally in the production for uh real-life use cases and in production uh that's the that's the architecture we at cluberra recommend um consider an ifr registry just like a ansible tower or jenkins or a github repository or something like that uh this is something that is shared across multiple clusters and you have the right voices to make sure that right access is coming only from the dev and then you are just checking out from the other end grounds that's that's the um easiest architecture that's where you will be uh um benefiting from most of the features from the knife registry uh there is another option though uh which is having an ifr registry for each cluster and then you have apis to move things from one knife or registry to another that's another option sometimes it makes sense obviously this makes all of the operations a bit harder but um if you go through this blog post um i explain how this works however for this demo i will be uh using a single knife our registry uh which is used for both my development cluster and my production cluster okay um so if i go back here so if you remember i have two kafka clusters two nifi clusters uh one dev forage one production forage so i chose but this is really up to you it doesn't really uh matter in this demonstration i chose to have uh the knifehour registry to use the knife registry which is deployed in the production cluster and i'm also going to use the schema registry which is running in the production cluster so all of my knifi flows will be using the knife registry from prod and the schema registry front prod so that's what we are going to use so if i go to uh my knife cluster and i opened the nifi registry ui um i will quickly show you what it looks like so i can close this so uh that's that's the knife registry ui right now there is not much uh but basically here you have the list of all the flows you version and the knife registry there is just one i played with before um but we will start with a new one during the demo but this is uh something you can look at you can look at the versions uh you can do some actions and then uh you have also some administration features in case you have the right authorizations so you can create new buckets uh you can manage users and policies even though in this case we we want you to do it from apache ranger when you are using nifi in public cloud so that's really uh everything you can do in the knife or registry ui um that's a nice ui you can see the history of the changes uh on your flow uh but that's pretty much it um we are going to add new features uh uh this year in the knife hour registry and i will come back to that later uh if time permits but that's really it's about the knife registry what is really interesting in is the integration of nifi with the knife registry and how uh all of this work uh works together so let's go to the nifi ui of my uh dev cluster so first of all uh well uh sorry i said dev cluster so i need to go to my desktop here we go nifi so i'm uh authenticated uh with my own identity uh on this death cluster and i just want to uh quickly mention that when you are managing a lot of nifi clusters there is a a very small feature that allows you to write some custom code custom custom text sorry here which can be helpful in case you want to be sure you are on the right cluster at the right time and you are not doing anything stupid hopefully you have the right authorizations anyway so you can't do anything stupid in production but just in case uh you have a property in nifi that allows you to give some custom text here uh which is useful we are also adding new features uh as we speak to make this even uh more obvious let's say so yeah so uh how does it work with knife r registry so uh really what you can version in the knife registry is a process group uh so the first step is really have a process group uh so you drag and drop a process group you say well this is my use case so uh demo jam uh use case uh that's it you create a process group uh usually any way you are going to create a process group flow per use case you want to deploy on your nifi cluster so this is really what we want to do anyway so in this case um i i create uh this process group and i can go inside and start uh designing my flow this is what i'm going to do in a sec and then so well let's let's start with something very simple uh i won't uh show you all the design because that's not what we want to talk about but so let's say okay i have my uh listen http um then i want to i'm going to receive receive some json data and then i want to do some enrichment based on the id that i receive in in my data so i will be doing uh some jio enrich ib record processor so for every ip address i will be uh associating a city uh based on a database uh that you can find on the internet to link an id to a cd country etc so sorry here success so you can you can start designing your flow you can do that uh it works well uh and and then when you are happy with it or maybe it's time for your break or maybe it's the end of the day you want to make sure this is saved and you can start from here tomorrow or you want to make this available to another team working on the same use case or similar one so what you want to do is you can right click here and you have a version menu here that allows you to start version control you can also do it from the process group if you want uh it's it may be a bit easier so here version start version control so your nifi in public cloud um is uh directly configured to uh exchange with knife registry by default the knife registry uh from dev will be used for the nifi cluster in the dev cluster so what i did is to go in the settings of my knifi cluster uh and instead of using my knife registry from the dev cluster i just changed uh this to use the endpoints for my production knife or registry instance that's it nothing else um so now instead of using the the knife registry that is created for you in the dev cluster i'm using the one in the production cluster so version control start version control it will be uh asking the knifer registry about what are the buckets in which you can save your flow uh so buckets are just kind of directories if you want to organize flows by some kind of uh business units in your company or by functionality or uh by teams or whatever buckets are where you can also set permissions so you want also to keep this in mind on how you want to organize your flows in the knife registry in this case i'm just using the the default bucket which is uh the one that is created for you by default i'm going to name my flow um to democrat use case description this is uh the flow from the demogen and this is a initial version with nothing much nothing uh that's it so um here you can see now that uh the flow is versioned in the knife registry and what we have on the screen and is the lattice version now if i go back into it and i start doing some changes uh so i start i don't know configuring some processors uh saying on which processor uh sorry which part i want to listen for my edit my http request i i configure things uh and then i will be i don't know um publishing my data to kafka so i will be using the published kafka record processor um found here so i did some changes and now you can see that instead of the uh the green logo we had before uh there is this grey uh uh the the spray logo where it says that you have changes locally and you can commit a new version if you want to so let's say i want to commit a new version you can right click uh say commit local changes you can also uh if you want reverse your local changes uh you can list the local changes uh so for instance here uh i added um i configured the property listening port for my listing http processor um i changed the value for a content listener in my listen http processor etc so you have the list of changes you can revert changes in this case i want to commit my local changes so it's going to create version 2 so i did publish kafka processor okay and and that's it so you have a new version so now if i go back here in the knife registry and i refresh that you will see a new flow available here if my connection is fast enough okay so here i have my new flow which is also in the default bucket uh and i see that i have two versions uh first version created two minutes ago a second version created right now uh there is my uh my identity so i know who made uh changes on the floors uh and so on so okay my flow is version uh that's great now uh this is really cool but uh everything you configure in your nifi flow so let me go back to uh the nifi ui for some reasons my chrome brother is really slow anyway so uh when you when you create a flow uh you are going to configure a lot of things uh and most likely for the source and the destination i mean the source and yeah source and and destination processors since you are most likely going to use processors interfacing with external systems uh the the configuration you are going to set is most likely going to change from the development cluster to the production cluster um that's that's really uh something to keep in mind uh because for instance here when you are going to configure your kafka processor you are going to give a list of kafka brokers obviously this is going to change between my dev cluster and my production cluster maybe you are going to use a different topic name uh there is also if you are using a service account maybe the service account and the associated password is going to change from dev to production so all of this um you don't want to manage all of this uh directly in the processors uh since probably one or two years now we added the concept of parameters and before i go into the explanations of what parameters are um there is a poll question for you uh that we can go to right now uh to uh see if uh what what what is your experience with the parameters in nifi and we're back again hi hey tim yo what is up all right new poll do you leverage the parameters of knife to ease flow promotion between environments and ips said parameters are fantastic they integrate much better with registry uh much more portable uh much more easier to use across the various contexts and process groups you're working with you can import and export a lot easier uh and you do not have to do any xml wrangling and i personally cannot be bothered with xml anymore i don't know about you i'm just a bit sick of it yeah i agree yeah um just waiting for results to come in as we as we chat away there we go it's time to get some results um i am seeing quite similar results to the last time which is interesting so um and they're still coming in but at this point about a quarter of the audience is like nope never use parameters which it is a recent ish feature to knife i you know it's i think we introduced it around uh the 110 release 111 release so it's more recent um and then about a third of the audience is using parameters excellent job keeping up with them features then about 40 overall says no okay so i think it's good some good adoption considering the feature is not that uh not that uh well aged yet yep sounds good yeah cool okay uh let's go back to the demo and since i see that we are already half an hour in i will try to go as quickly as i can because there's a lot of good stuff and i want to make sure we cover everything okay uh so hopefully you should see my screen again yep we've got your screen back cool thanks uh so uh instead of watching me uh creating a flow let's uh switch gears and go directly to the flow i created before and just let me do some changes uh because i was expecting to do that uh in front of you but uh okay so let me remove that here here okay so found that okay so uh this is the flow i want to uh to show you so um so let's let's imagine that uh that's my first version of the flow that i consider as production ready so what what is uh the flow doing uh that's really simple so let me stop that just to go through this flow very quickly so i have a listen http processor which is well listening for http requests i'm going to from my laptop send some data on on this processor this will be uh some random json data let me start that right now okay so we should uh hopefully see some data coming in so if i uh list you look at the data um this is uh chasm data random json uh where we have some id name email and i uh an ip address so it could be i don't know from blogs from a website uh whatever just random data where we have an ip address uh this is what really we we want to focus on today uh and then there is this uh jio enrich ip record processor which is really about looking at the ip address looking in the in a mapping database and giving you a ct for uh for each of the id so let me start that as well uh and now if i look at the data there is a new field uh that came in which is the city i also changed from jason to overall which is very useful if you want to have smaller datasets so here we can see that for the ip address uh this is the city of columbus we have uh some ip address we cannot resolve because i'm using a free online available database which is not very accurate so for some ip address i don't have any city but that's fine so san jose dorchester charlton etc uh it it gives uh i mean when i i see all of these uh cities i really want to travel again but uh anyway uh so we have the cities that's great and then we can send the data into this uh kafka processor and i can start that's it uh and the data is made available to downstream applications that we'll be consuming from kafka so that's that's really what my um what my flow is and i want to i mean i feel like this is good enough uh to go in production uh this will be my first version so um i will commit this well in this case because i did a lot of stuff before today this is version 17 but we don't really care let's say okay this is uh the first version uh ready to go in production okay uh so before we we switch to the production cluster and i show you uh how it would look like if we are doing uh things manually uh let's look into this concept of parameters because uh as we said this is really a key concept and you you should use it um even if you are not using the knife registry uh to be honest uh um you want to use the parameters uh it's nicely integrated with the nifi registry but if for some reasons you don't want to use the knife or registry uh please use the parameters anyway uh so if we look at the configuration of the processor uh we see some properties where the configuration is with the uh with the with this format and there is a name uh and i have i think i have a few uh processors uh so in this processor uh i'm also creating a parameter which is referencing where my file is this max mine database file uh in the publish kafka record processor i have the brokers i have the topic name uh i also have uh the service account i'm using but very importantly there is also a parameter which is used for sensitive values in this case the password associated to my user so this is one of the the big difference with what we provided before the parameters existed what we called before the variables variables were a very limited feature only something you could only use where expression language uh is available in nifi so it was kind of limited in term of scope and you couldn't use it for sensitive properties in this case you can use parameters everywhere uh including for the sensitive properties so it's it's really uh powerful and and very useful uh and i'm also using it in the controller services uh sorry if i look at all the controller services i have for my process group uh i'm also using parameters here so for instance for the schema registry um sorry one sec here configure yeah so here for the cloudera schema registry uh where i store the schema of the data i'm dealing with uh i'm using a parameter to also reference where the schema registry is running uh here again i have my uh service account because of my really bad internet connections and i apologize for that uh the properties are duplicated uh this is just a ui issue because i have a very bad connection uh let me try again but yeah there is yeah much better uh sorry about that so yeah so there is uh the service account and again the the password associated to my service account so well that's nice but you you may ask well where are these parameters so if you want to define parameters there is this concept of parameter context that you can define here if you go in the hamburger menu you have a menu here parameter context where you can go and you can create a parameter context a parameter a parameter context is really a set of key values that you are going to use in your photo definition to externalize uh the parameters you want and you can map a parameter context to one or many process groups but not the other way around so for one process group you can have only one single parameter context this is something that we are going to make better this year but just so you know right now uh you can only attach one parameter context to a process group so basically you would create parameter context for each one of your use cases and then if you go to the process group and you go to configure here you can select the parameter context you want to attach to this um to this uh process group so in this case i'm attaching my parameter context uh guipa enrich so uh before i go to the production cluster uh i'm going to clean a little bit what i did before uh let's uh let's stop for just uh three or four minutes uh team and then do you see any any interesting questions you want to answer right now yeah we've got a bunch of good stuff coming in um there's quite a few people on uh we're not only in the the live q a system uh which is one of the links we sent out but we're also restreaming on the youtubes linkedin twitter places places places that people go um and we've got tim back hey tim how are you so um there's a couple of good ones here i'll pick up while you're doing that here um first one is um just talking about the parameter context you just had i see uh where do we set different parameter values for different environments so so while pierre is setting that up basically parameter contexts are set in each nifi cluster so each environment has its own local collection of parameter context but those can be pushed and pulled by the api so you could store them elsewhere you could have a you know a configuration management database that you use you could have secret stores that you want to pull passwords and things out of and then populate into that particular nifi environment and then obviously you may have uh parameters that change between um dev and prod for database names and ips and all that sort of business so you would uh push those in and out of the particular nifi environment using the rest interface or via the gui like uh pierre is showing you here and there's various ways it means to do that um we'll probably go into a bit more detail on that later in the demo but uh if you want me to speak a bit more on that detail or a particular question you have around that just pop it back in that question chat uh the next one i will grab just mark that one as answers won't lose track it's um somebody's saying they've been through the blog that p.m mentioned they've got different knife registries in each of their environments they've gone with that architecture and they're wondering about the apis around flow development and should they create a python script that does that for their jenkins pipeline uh i have good news and i have better news about that the good news is i've already written the script for you we will probably go into it later in the session as well so one of my various hats is that i maintain the python client for nifi which pierre is hopefully bringing up he planned on talking about it later so i'm just jumping in a bit early here um so i contributed that and i maintain it there are all of the python methods you might need around this there is a handy top level library called versioning another one for parameters and there's even a script which walks through the process that pierre is doing right now it's called the fdlc demo we'll have a look at that as well so uh there are absolutely methods available for you to just lift and use to do all of that in python and if you're feeling fancy um the interface is actually a published swagger spec so you can choose your own preferential language make your own client go to town as well uh here it goes pretty popular with the kids i haven't had time to implement it and go yet but we're talking about it in the community actually because there's a lot of work happening in the community around kubernetes operators most of those are written in go so uh you should be able to find one of those as well uh i grabbed two tim do you want to grab one or pierre do you want to grab things back well tim let's let's have one from you i'll pick an easy one uh there was a interesting one uh that i saw about exporting uh parameter context when you want to move to uh production instance um what there's a little bit of a difference here between parameter context and parameters because you'll set up the parameter context in different areas then you'll want to load your parameters and like you mentioned we can do that very easily with the nifi command line interface or with the rest api i posted an article in several of the places that shows you how easy that is it's you know one or two steps you get those files and if you want to change them you change them push them back that that is very easy to do that covers uh can you update parameters from data uh i don't think so that might be like nifi data um technically you can but you won't have them updated by the time the data finishes processing through so in theory you could be a little bit fancy and you could have a python script that executes inside nifi using the execute script context that takes the content of the flow file and then runs a parameter update command uh to update the parameters on the flow that the data is flowing through the problem with that is that the flow gets paused while the parameter contents are updated and then restarted so that flow file would have moved past that point but if your design is that those flow files come into one process group okay now i'm being mental tim i know you're going to laugh at me if those play files came to initial process group so the data was analyzed and checked and then you used parameters that you extracted from that in a second process group that you passed them to that would totally work not that i've done that but that's the idea no it's not a good idea but it's funny that that's like me the website did not fight it'll work does anyone else endorse that i do not recommend that you treat knife i like lisp yeah yeah i want to know about the use case before someone is doing that it doesn't seem like a good idea use uh there's many other ways to have values around in knives right yeah there's one more quick question i'll grab here before we go back is there a path to migrate variables in existing data flows and prod to parameters so the answer is basically yes so um parameters are accessed in a similar way to variables we're set up uh where variables where a dollar curly brace parameters are hash curly brace we stick with our standards around here um there are um largely similar methods and approaches you can use for them so it's pretty much uh drop and replace i can't actually think of any cases off hand pier where it's not just drop and replace except where parameters have additional features like security sensitive values right yep correct i agree yeah so you should anything you can do with a variable and you should be able to do a parameter and if i'm wrong you can hunt me down and then i'll fix it cool uh let's let's let's go back to you to the demo um i will try to speed things up a little bit to make sure that we cover all the great stuff uh that we just talked about uh so now i moved to the production cluster you can see here you can see it here uh knife cluster i just cleaned everything i did before the uh this this session just to start from uh uh from something clean just to show you how it is when you want to deploy a new flow for the first time because that's where you are actually going to do most of the manual stuff when you do the first uh in production deployment overflow of a new use case because uh well as as we will see it after it's really uh easy then to automate deployments and updates of new versions of the flow but during the first deployment of the flow can be a bit more change so uh let's do it in the ui and then i we will talk about the the cli and what uh dan is uh providing uh in python if you prefer python which is perfectly fine and and that's great and also uh dan in his uh repository is providing a lot of out-of-the-box uh methods doing a lot of stuff for you uh while if you are using the cli it can be uh while a bit challenging to uh automate uh some parts so anyway um if we are on the ui we want to deploy the flow for the first time so we just drag and drop a process group but instead of giving a name here we see imports i just want to mention something before because as i said when you have one knife our registry which is used for both dev and prod you don't want to let anyone pushing uh changes from production to uh the knifer registry because this is really bad uh fixing stuff in production and committing from production i know this happens but this is bad so let me create just a process group um this is bad uh working in product okay and here uh thanks to the the the authorizations i set in apache ranger on my knife or registry uh if i try to start version control here this will be disabled because i don't have right access to the knife registry from here from the production filter so that's just something to keep in mind um and i strongly recommend to have something like that so instead i'm just drag and dropping a process group and i'm doing import and here i can see uh my knife registry the bucket and then i can use i can choose the flow i want to check out so let's see this one and i said that uh this version is the version uh ready to go in production so let's import this one so uh this is available but when you import a process group for the first time this is really the first time you import this version of the flow on in production uh there is quite a few steps to perform uh so first you want to update the parameters so if you look at the parameters here you can see that for the sensitive values we don't have any value set that's because in the knife registry we don't store anything sensitive because we care about uh security and we don't want any passport to be uh well stolen or to leak on the internet or something so any sensitive property never leaves knife so when you move your flow from one enderman to another uh whatever sensitive value you set in the production in the development cluster is not going into the production cluster however for all the other parameters we still see the values from the dev cluster so where it makes sense we want to update that so uh i don't want to update the listening parts let me go very quickly so platform i want to set my password to here this is a value once you apply here uh you can't see it anymore even if i edit again you can't see it so it's only when you set it for the first time okay then i have here my account for production the file is in the same location so if you are that's also something to keep in mind when you are moving flows from one cluster to another if your flow rely if your flow is relying on external files like i don't know uh hdfs site.xml uh external file like this one that is used uh in in the uh jira enrich processor uh if you use i don't know uh key stores stress tools things like that that needs to be i mean that are specific to your uh workflow and you need to move it from one cluster to another that's something you need to do before deploying your flow any file which is a dependency of your flow is not managed by the knife register that's something we are looking at to have kind of uh bigger bundles of flows that are including also some of the dependencies um in addition to the parameters but that's a long term roadmap anyway so before the station i deployed my files uh the files on which my flow depends uh on the nifi nodes and it's located at this location uh then the kafka rockers let me get that okay so here uh i have my kafka brokers for the production uh kafka topic is the same the schema registry is the same because i'm using a single schema registry which is in the production cluster so i just need the keystone password let me get that here and i set this here so that's really the first step updating the parameters to match what you need in this environment what's once it's done uh this is going to look at the affected processors everything is looking good and then the next step is uh starting the controller services so you can go to the configuration here and you can start the controller services that's something again everything i'm doing here is something you can automate and you can do all of this using the cli we will see it in a minute you can also do the same with the python library that then um is maintaining all of this can be scripted in many ways no issue with that but when you deploy your new use case for the first time and again really for the first time in production you probably want to be careful about automation deploying and upgrading to new versions and that's what we are going to do right after this uh is fine but uh for the first time you probably want to do it let's say more carefully so once uh uh controller services are started you can start everything uh this is started i will start my script here that is sending data this time to production cluster uh and uh this is running that's cool okay so now i have my flow here uh i'm trying to go very quickly but let's let's go back to my dev cluster okay um and now here uh if you remember for some of the records i'm processing here some of the values for the ct fields i'm computing are new i don't map an ip address to a ct so in this case i want to filter out uh sorry i want to filter out uh the records where i didn't find a cd so in this case sorry i need to stop that so here i'm doing an update so i don't know someone in the doorstream application they said well uh it would be better if the data you make available uh in in kefka is only when you have some values uh for uh for the city field so here i'm using the the record query record processor i'm going to specify reader writer and then i can write ct not new i can write a sql query to do my filtering so select star from flow file where city is not yeah that's it okay apply here i uh send ct like neo i'm going to auto terminate failure original here um i'm i have i like when the design is nice so here i'm going to stamp this um and if i start some uh sending some data here uh let's see just uh that's here if i look at the data uh now all of the fields city should have a value so again uh i only have records where i do have a value for the city okay so now um okay this is great i say um uh this is ready to go to production and i'm going to commit local changes so uh added a query record to filter out a new city values and here is the trick i'm saying in the comments this is fraud ready okay and i said this and now i'm going back to my production cluster um and thanks to all of the automation i put in the background um when my ui is going to refresh i shall see some updates to uh refresh did i did uh i did something oh it's still running here just a sec my death cluster is complaining let me refresh that okay no this is looking good so maybe some delays okay here we go so now here you have a new econ uh showing this uh little econ here saying that the new version of uh this flow is available and you can update your new version of the flow um and things are a bit slow but in theory i have some automation in the background that should be doing all of the updates so it should be doing it um maybe there is uh something wrong let me check very quickly oh yeah sorry that's my bad so actually let's let me well i mean that's why my automation is not working that's because i cleaned everything uh let me show you what i did um so let me change the screen i'm sharing i'm going to go as fast as possible because i realize we are already over time uh screen sharing stop screen sharing i need to screen share this instead so uh hopefully you should see my screen so in the knife registry there is a concept of uh hook uh which is giving you the the ability to trigger any action you want when there is something happening in the knife our registry uh in this case actually let me do yeah in this case uh what i want to do is i want to trigger the update of my flow when i receive an event which is create flow version so basically i created a new version of my flow this is not the first version because i want to do it manually when this is the first time i deploy flow and i'm looking at the comments in case i have a prod ready uh and if in the comment this is i mean uh the the the developer is saying this is probably i'm going to automate the deployment uh in production uh and then i'm using the the cli uh here uh you can play with it uh there is a lot of options uh as dan said there is also the the python toolkit and this is going to update the flow to the lattice version here if you want you can also update or set the values for the new parameters this is also an option for you in case you want to retrieve the values of the parameters from i don't know hashicorp vault or external database that's where you could automate the retrieval of the values for your parameters and set the values to your parameters when a new version of the flow is going into production so setting the parameters then starting the controller services as i did manually before and then start the process group so my script is very simple uh dan is doing an amazing walk uh with uh python where it's uh everything is ready for you to use uh i'm really bad at python so i did this a very quick uh ugly bash script which is using a mapping file where i say well this is my flow id this is the process group associated to this flow in production so that's actually what i need to update because since i started from scratch there is a new process group id so let me uh then then and maybe you want to take uh two or three questions now while i fix my mapping file and we show you the actual automation of it sounds legit yep we've got questions we've got people talking about scripting it's all really good uh do you want to go first term or shall i i don't know where we want to start uh there's a bunch of them there is a bunch um so i'll answer a couple of quick ones that aren't sort of particularly related to anything else so can use a schema registry uh when nifi and whatnot is built within a docker container the simple answer is yes so um you can if you create say a docker private network you can attach multiple containers to it they can address each other via host names on the internal docker dns service so we do this a lot for testing uh knife in registry they come on pre-built docker containers and if you look in the nypi api repo that pierre is mentioning i actually use that for all of my testing so there are docker compose file in there under the resources folder to generate both secure and insecure nifi and registry deployments and you could add a schema registry one to that we provide a schema registry uh which we open sourced the other main popular one that people use is the confluent one there's others around as well so they're all supportive they all integrate just uh you go and do whichever is best for you it should work um there was another question around schema registry which was similar sort of thing you know does it work does it integrate so i think i've sort of answered that as well do you want to grab the next one tim sure and one other thing for people who have very minimal schema needs there is one that you could do within nine five without having to use anything else it's the avro one it is hard coded uh most people just use it for testing but if you're very limited on run space you could use it there if your scheme is not changing but i guess it probably will change uh i'll go through a couple of the quick ones here there's one x asking about the knife processor execution uh we probably answered this in another place it's asking if it's event based or time based uh what's nice with nine five is you have the option of deciding uh you know you want it to be on waiting for something to happen you want to set a cron setting you wanted to have it run every second minute hour all that's configurable at each step in a flow which is really nice so yeah if you want to run it every one minute you can if you wanted to run when something hits it like listen http doesn't make sense to have that on a timer that's really wait some of the call in same thing is if you're looking at uh say a file to change you want the file to change you know that's really typical so like putting knifi behind a load balancer that receives a ton of udp traffic uh maximize availability get all that throughput you don't want that running on a timer you want that scale to load so we totally do that um i think that's similar to another question here about sort of just how does knife i scale and everything else and uh it does it scales and responds to data within the limits of the infrastructure that is available to it so if you give it no hardware then it ain't going to do an awful lot if you give it a ton of hardware then it will scale up and it will auto balance itself to do that so it will first try and commit all the data it's received to disk to guarantee it doesn't lose it and then it will work through its processing dags that you've given it that's in a traditional cluster and then if you implement it on kubernetes or some auto scanning service like we do like others do then uh they'll use up whatever hardware that gets as the pods get spun up and away it goes so that's pretty handy as well um yeah i have a couple more questions here but uh it looks like you're almost ready to crack on yeah yeah yeah let's let's uh let's uh yeah we're running out of time so let's finish that demo yeah exactly uh let me do that very quickly so i just fixed my my bing file uh in the background which is really linking my flow id to the process group id where i deployed my flow uh so this is fixed sorry about that um so i'm going to commit again from the dev cluster uh this time this is really broad ready let's assume that i did something wrong uh like i did uh this is properly uh save so this is going to save a new version uh we can see here this is the green logo and if i go to production um while data is flowing in this is going to update the flow that's also something that is very important and that's something you want to know about the knife registry that's here we just updated everything without actually uh um removing data or or anything so data is still going through my flow and we went to uh the new version of this flow uh automatically and this is running everything is up to date um if the flow updates from one version to another requires to remove a relationship where there is some data the update will fail because by default nifi is ensuring that there is no data loss so the automatic update using the cli will fail in case the update requires some uh data removal so in this case what you will do is probably uh stop uh some of the source processors let the data drain out or stop the the parts of the flow where you did a very critical changes and then try the update again that's where some probably uh manual care is required uh but other than that as you can see with uh 10 lines or 20 lines of code uh very poorly written in bash you can automate all of this um i won't go into more details i will stop sharing i don't know if there are some other questions we want to answer then team uh um i don't have access to the questions so if you see some questions uh let's go um there's a question here um about dealing with the scope of parameters so um as you showed earlier parameter contexts which is the collection of parameters are scoped to a process group um and i believe they are still inheritable aren't they so um you would assign it to a process group and then any children of that process group do they also get to use those parameters as well do you have to set it each time remind me right now you have to set it each time on each process group what we are going to work on this year is the concept of composite parameter context where you can mix multiple parameters context that you can attach to your process group and then also look at the uh at the inheritance of the parameters uh through a jar key of process group so that's that's something coming that's here yeah that makes sense to me because i was remembering that variable registry is inherited and it used to cause a lot of confusion amongst users where they weren't sure actually where that variable was set within their stack of nested contexts so yeah i think that'll be good so uh and since then the scope is set to the process group and if you want to change the scope you just move it to a different process script i know it goes so i think that answers that question um i think that is actually most of the questions i was collecting some more from the uh the other live streaming channels like youtube and that sort of thing but i've answered a lot of those in place one of them was just a generic question about when should you use a funnel and uh that's generally useful to people so um in nifi most of the time you would uh set a parameter or a variable inside the configuration of a processor that allows it to address you know multiple things like lots of different ports a lot of different directories but there's quite a few services that cannot do that maybe because the underlying library that we use is compiled at execution time and so it can't dynamically update properties a good example of this is like a flume collection service where you have to hard code the port most of the time so if you had to run five of those in parallel then you might have five feeds coming out each with very similar data but coming from different sources and so you shove those in a funnel and then they'll come down to a common stream which is coming from five different sources of flume in that case but it's all going to be treated as the same data because we can simply use the source from flume as a parameter further down the flow and not have duplicate uh parallel threads which is really great for handling your back pressure and your data context and control around that so yeah it's a handy little thing only when you need it and i think that's um that is all the main questions that i have got here did you pick up any others tim with a monitoring question so i just want to take this opportunity to say that the next uh live demo i will be doing so probably uh in a month or two will be focused on knife monitoring uh so to the question can we have uh some performance metrics about each processor yes we can uh there is different ways to extract these metrics and make them available in dashboards and stuff like that or be alerted if a processor is not processing uh performing uh efficiently or as expected uh there are ways to do that uh mainly using the reporting tasks um but also also other ways so to the answer uh yes um everything you can access through the ui is available through the ui and you also have a lot of data through the reporting test yeah cool um so we are unfortunately over so hopefully people who can stick around thank you very much and hopefully the extra information was useful um we will go into some more topics the next time we do this but i think back to you pierre to wrap up yes uh so let me just share my screen again very quickly to wrap up that sure let me present again here so uh yeah so uh so this slide i mean or maybe yeah actually no i'm not sharing my screen never mind i got it uh so yeah so uh you have uh some resources available uh on the screen where you are right now uh for uh uh everyone uh still here with us and thank you for staying over time uh we have uh some content for you in case you want to know more about clever edge management so everything around uh min ifi uh and h4 management capabilities uh and we also have some content about flow management so if you want to look at that please do do so uh i also invite you regarding what we discussed today to look at the uh the github repository that dan is maintaining the the python client is really amazing uh it's it's automating a lot of stuff for you making your life much easier uh and dan is very keen to improve it so if you have any feedback uh please feel free to uh file pull requests issues uh get in touch with them um i'm sure we will make it even better uh thank you for staying with us uh hopefully this was this was uh interesting and see you soon thanks guys thanks everyone see you on slack
Info
Channel: Cloudera, Inc.
Views: 2,377
Rating: undefined out of 5
Keywords: hadoop, hadoop tutorial, data warehouse, data processing, apache hadoop, hbase, cloudera, hadoop training
Id: XYHMExiWM6k
Channel Id: undefined
Length: 76min 58sec (4618 seconds)
Published: Thu Jan 21 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.