Data Engineer's Lunch #29: Introduction to Apache Nifi

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
everyone welcome to data engineers launch number 29. today's topic is going to be an introduction to apache nifi i'll be the speaker on it today and i'm an engineer here at anat our current organizers for this event are rahul myself and josh however if you're interested in helping the community in this event find speakers or sponsors you're more than welcome to if you have a topic that you would potentially like to talk about we have a couple speakers on here like will who have given presentations for data engineers lunch so if you're interested on speaking on a topic you can reach out to rahul myself or josh and we can get you slotted into the queue we are part of data community dc which is um a diverse and inclusive culture and organization to help bring people of all walks of life together and learn and practice data dia community dc is made up of a bunch of different organizations data wranglers dc which is where we apologies where uh we host this event we also have other uh organizations like dataviz full stack data science if you are interested on finding other events that are in the data community dc so not our only not only our events but any other events held on the organization you can find out more at datacommuniondc.org so what do we cover here we cover pretty much anything data engineering anything that talks etl data processing data cleaning machine learning and the general processes that people don't necessarily talk about when it comes to data in general so if we pick up data from somewhere what are the processes that take place in terms of how does that get used by the end user so a lot of times topics that we cover here are simple tools that you could use to help substantiate and improve your data pipelines and your data processes excuse me and sorry again if you are interested in talking about something you're more than welcome to and you can drop an email to rahul myself or josh this is generally a good stopping point at which we have any new members that would like to say hi and introduce themselves and say what they work with with that engineering um what they hopefully hope to get out of this event uh you're more than welcome to say hi but i won't pull any teeth all right and people can also introduce themselves in the chat if they're more comfortable with that yep all right again i won't pull any teeth as well mentioned if you're more comfortable with just saying hi um and doing quick little introduction in the chat you can do that as well so group rules um generally if you have a question this is a community so if you are confused about something or you would like more clarity on something don't be afraid to ask any questions be polite and courteous uh depending on the speaker they may have their own rules in terms of how to um how they will accept questions or comments um so generally the speaker will go over that once they take over and again again because this is a community please do share what you know with us if you know something that necessarily uh wasn't touched upon and you can provide further insight feel free to drop it in the chat or just speak up and add to the conversation uh we are not we help design build and manage global data and analytics platforms surrounded by the technologies of cassandra spark and kafka so data engineering is a it's a daily grind for us data stacks is a partner and a sponsor uh gw university is also a sponsor they help us have venue space when we're doing these events in person as the u.s starts to open up again we might start exploring in-person events but still keeping everyone's safety is the number one priority all right we have some institutional and organizational sponsors as well this is a stopping point for any announcements anyone looking for jobs anyone offering or hiring are there any meetups hackathons conferences that you would like to plug i think if mike's usually here he usually has a an event that he's usually putting on or promoting if you don't want to speak up you could drop something in the chat and then you can get to it at that point as well um yeah hi the company i work for is looking for a java developer um if you want to know more say so in the chat um that's all i can say right now great thanks dan yeah if you're interested uh just shoot dan message in chat and anant is hiring we are hiring full-time or part-time positions surrounding data platform operators engineers and architects if you are interested you can find out more at careers.not that us some upcoming events we have for data engineers lunch um we are wrapping up the end of the month for july and we're finalizing topics for next month so we will have those updated um but we also have a sister group um for cassandra lunch on wednesdays at the same time if you're interested in cassandra we have some upcoming topics regarding that with um spark sql part k tables and dsefs and also how to use secondary indexes and cassandra and then i will we will now move on to the topic for today all right so generally for questions or comments i have no preference if you have something to ask you can go ahead and speak up or if you prefer you can drop it in the chat and we can get to it at a good stopping point but again today's topic is going to be an introduction to apache nifi and how we can use apache nifi for data engineering um just as a baseline this knife i is you can do a lot of things and so today is going to be mainly about introduction into some core concepts and then also building a very simple data flow that we could then potentially expand into other aspects of data engineering but it's more so to get you introduced to nifi if you've never used it before and also how you can kind of play around with it and learn how to test flows as you go along so what is apache nifi it was built to automate the flow of data between systems supports powerful and scalable directed graphs of data routing and we'll take a look at that once we get to the actual demo portion it has a web-based user interface so the nice thing about apache ifi is you can essentially build your entire data pipeline via the user interface you don't actually have to write any physical code at least for setting up the data flows and it's highly configurable and data provenance is another big thing is where you can track your data flow from beginning to end and as you move from step to step you have cues and so if you want to stop something and drill down you can do that and you can actually physically look at the flow file or at least the specific record of the transformer or whatever data you're working with and seeing as you go along what's happening to the data and also it's designed for extension you can build your own processors and enables rapid development and effective testing so there's already a ton of processors that nifi has in its ecosystem and we'll take a look at that once we get to the demo portion and also secure you can add ssh ssl ssh https etc and you can have multi-tenant authorization and internal authorization policy management so nifi is a very useful tool um if you have the right use for it sometimes some other tools might be better but we'll take a look and potentially you could use it or utilize it for your data engineering processes so some core concepts that we're going to discuss or at least mention during the demo are the flow file so the full file represents each object moving through the system and for each one nifi keeps a track of like the key value pairs and attribute strings and we'll show you that once we get to the actual demo to see how flow files can change from one large flow file into say a thousand uh split flow files um a flow file processor so the processors are the things that actually perform the task or the work um doing the transformation or doing some some form of transformation cleaning etc and then these processors uh are connected to each other uh via essentially just a drag and drop arrow and you can set relationships between processors so say on success move down or on failure retry and we'll talk a little bit about that once we get to the demo flow controllers it's another one where it maintains the knowledges of how processes connect and manages threads we won't necessarily talk too much about the controllers mainly we'll be focusing on flow files processors and process groups so process groups for what we're doing you don't necessarily need a process group but it's it's a way to where you can essentially create quote unquote microservices within your data pipeline so where you can create specific processes groups for specific tasks and have them connected in some sort of manner and then for the demo we're going to show you how to actually i'll pause here see if anyone has any questions and then we'll move on to talking about the demo and getting started looking forward to seeing the demo all right so for the demo everything is going to be done so i've set it up via gitpod so that way everything can be done within your browser so you don't necessarily have to download anything to your uh to your local machine and have potential os and consistencies etc but essentially we're going to start with starting apache nifi and then we are going to create a process group and then we're going to create a flow which will read a csv from a local directory and it'll pick it up as a whole full file with a thousand rows or so so the thing about knife is i believe the architecture is based on jvm so if you don't have enough memory there can be errors so that might be something that plays into how you use nifi but nifi can also scale right so you can also create clusters and have primary nodes or etc and you can set those configurations within your process groups or within your processors in general and so with that flow file that we picked up from the csv we're going to split it so that way each row becomes its own flow file and then we're going to create those individual flow or we're going to convert those individual flow files into json format so at this point we can extend in a sense of what can we do so we have a thousand flow files and they're all in json format so if you will take a look but there are processors in ifi that allow you to send items to kafka right so if you have some sort of kafka topic and you want to push these records to it that might be a possibility and then you can continue on in your data pipeline and then we'll also show you another functionality of how to export and import templates so if you're using um i guess similar or configurable process groups or just flows in general you can export them import them so you can move them around hand it to other people um and yeah that'll just be a basic introduction of how we can use nifi and then we can always expand on it at a later time so i already have git pod running and i've already downloaded nifi onto the container so what we're going to do is we're going to go ahead and start nifi as well um and so there's a couple different oh no that's not the one i want right let me delete this real quick apologies um yeah so as i was saying there are a couple different ways to where you can run knife so there's the background foreground etc so we'll just be running it in the background so to do that we would just need to run bin nifi.sh and then start and then that will run i find the background and nifi will open or the ui will be accessible at and one thing to notice that at least it might be the problem with just git pod but if you're doing it on local you might not have the issue but luckily nifi gives you the redirect option so once this pulls up we'll take a look at it just give it a second and then if you want to check the status of it you can also run the same exact command but instead of start you could do status so say if you started in the background and you want to check on it you always can but we'll go ahead and open port 8080 and so as i mentioned it'll say did you mean slash nifi so when you open up on your local machine potentially it would be localhost 8080 slash knifefire depending on if it's if you have a a cloud or a hosted cluster so starting off what we're going to do is this is just the ui so you have some just normal stuff you can click through it i won't necessarily go into every single thing again because this is just an introduction and we're just going to show you how you can quickly get started so looking at the directory structure we have we have this csv directory and we have a csv inside of it so if we click on the csv it's about a thousand rows plus um we have our header line so it's a thousand and one total so now if we go back to nifi so the first thing we want to do is how can we create a process group so you would essentially just hover over and look for a process group and you would drag and drop and so this one we would just name csv to json and we'll go ahead and hit add and so if you notice down here if we drill down into this process group we've gone one layer deep but if we want to go back to the top level flow we would just need to click on here and we can see so if you think about it if you have an expanded data pipeline you might have multiple different process groups and that they would potentially could be connected within see if you drag and drop you can connect the output of this into another process group and then continue on your data pipeline but we'll just focus on this initial one here first um again you can also navigate so if you have very large data pipeline or flow you can zoom out zoom in center etc and if you click out you can also start the entire operation this would be for the process group but if you go up one this would be for the entire flow so if we go into csv json the first thing we want to do is we want to get that csv file so what we're going to do is we're going to create a processor again because remember processors do the tasks so again here as i mentioned nifi has a large ecosystem of things so if you have certain processes or tools you already work with so say if i just search for kafka right and this will show you all these different things we could do so you can even consume records from kafka you can also publish kafka records et cetera et cetera so as i mentioned as an extension of to where we're getting to in this demo we could potentially take those json records or flow files and put them into a kafka keyword topic etc but what we're going to be working with mainly is just csv and json so the first thing we need to do is do a get file what this does is it's going to scan our local depending on the parameters we give it and pull the file in so when we click on this there are some configurations here you can mess with and then also as i mentioned we have relationships and the nice thing about what knifi does is it has a little caution symbol that will tell you hey wait before you before this can be turned on these things have to be addressed so what we're going to do here is also for scheduling so if you don't want this to run every zero seconds say we want this to run every five minutes or so or so or six minutes however much 300 seconds is um so properties um so this is where we actually put in our configuration so again as i mentioned knife you don't actually have to do any quote unquote real coding a lot of it most of it can be done directly within the user interface so the input directory we have here is essentially just the path to this um csv directory so we're going to go ahead and paste that there hit okay and then we can have file filters on it so if you have multiple files you could also set that for batch because we're only working with one we're just going to set the batch batch size to one so one thing when you're working with this get file processor is to know that if you don't have or if you want to keep the source file you need to set this to true otherwise if you have this false and you uh run it that that csv is going to be gone so it's going to be a pain for testing and i had to learn the hard way um so say like if you have some kind of process where you have your potentially some kind of analytics team or business team business analyst team in your company and you what you tell them to do is okay export the csv file of this data and then put it into this directory and so what this potential flow file or this potential data flow could do is every x seconds look for new files ingest it so again if you have keep source file it's set to false it gets rid of it right so it'll ingest it completely but if you want to keep them you can do that but if you do keep it you have to keep in mind that you can also set minimal file ages or you could set max and then file sizes as well so that's essentially all the configuration we have to do for this but if we want to test this what we need to do is let me just confirm yep keep source files true what we can do is we can add a weight we're not necessarily doing anything with this weight is we're just going to do it so we can test the get file functionality so for relationships we're going to say success and so when we do that that's going to get rid of the caution symbol we had on get file so now as i mentioned we now see this queue so if we wanted to and we start this individual get file and again with nifi you can pick up and stop and continue um any processors in the line so if you have something in the queue and you only want to start downstream items you can also do that so we see here we have a success we have queued one so if i click on this we can just see some items about it but if we want to see what's actually in the queue if we right click on it and we do list q if we see the list queue and if we hit this little i we can also download what's the what's in the flow file but if we hit view we'll see what's in the flow file just from the ui and we see that it's picked up or a thousand and one rows so now what what can we do to this to then let me close out of this to essentially split it and then how can we take those split records and convert them into json so what we'll need to do is we'll need to and also if you want to delete something like a processor you need to first clear the queue before you can delete the relationship between them so we'll need to empty the queue and then i'll go ahead and stop this for now and then we can then delete and then once that's deleted we can then delete this so the next thing we're going to do is we're going to drag and drop another processor in order to use split text and split text will take a text file and split it into smaller text files so what it's going to do as i mentioned is take the 1000 row csv and split it up into a thousand one row cs or text files so for this relationship we're going to do success but then if we notice on this one it needs to have a failure or original terminations so this is where we can bring or if we uh if we bring back in the weight and we can see what the output of the split text is so what we're going to do here is we're just going to connect this and then we are going to do splits because that will be what the actual splits values are and we'll take a look at that in a second but to um then further configure the split text processor so we need to adjust uh the current caution symbol so we already have covered splits so we say on failure just terminate and on origin i'll just terminate so then if we go into properties so this line split count we're just going to set to one and then because we have the header we're just going to set this as one as well and that's pretty much all the items we need to do for the split text and then if we go ahead and again as i mentioned if we want to start the whole thing we can just click out so that way no individual processor is selected because if i select on one you see how it changes to the processor so if i click out it changes back to the process group so then if i click run we see that the entire flow has started to run so we see the success happened and then we saw the queue get picked up by split text and now we see a thousand splits so again as we if we want to see on this and we click the list queue we now see a 100 for now but we know that there's a thousand based on the queue so if we click on the i again just view see now we've kept the header as an individual row as line two so if we close out of this and go to this one so this one's a biologist and then if we click on number two physician so we know our data is actually getting broken up how we want to because again in json you need a key value right so the reason we're keeping the header is so we can assign the job title to the first value employee id to the second value etc you could also do this another way where you can set attributes using a different processor but this version works just as fine again because nifi there's a ton of different processors in the environment there are different ways to do different the same thing so it's whatever floats your boat or whatever is optimized for your pipeline so granted now that we see that we have let me go ahead and stop this now that we know we can get a thousand split records we are going to clear this queue and we are now going to do we are now going to do a convert record so we're going to drop a new item here and we'll do a convert record and from here we'll pull this down and this will be on splits because again that's what we want to pass in we'll do add and then from here we're just going to put this back onto the weight and we'll do this for successes and then on this one if we configure it we'll set the failure so that way the yield will go away and now when you're using certain kind of processors you there's also things called readers and writers if we want to select on this we see that there's no current services so if we want to create a new service so for our reader service we want to use something like the csv reader there's a bunch of different options we can have but we'll do csv reader and then we'll create it and once it's created i'll go ahead and create both first and then we'll take a look at some configurations we might need to set on it so for record writer we'll do json record writer and then we'll hit apply and then once it's we know it's saved so to and you've you would need to enable these uh readers and writers as well so what you could do is you can click on this little arrow and it will take you to the controller services tab and you see here that the lightning sign is currently off so we need to hit enable enable but before we enable it um we're going to edit the configurations of the csv reader there are some things we need to adjust in here is instead of using in first schema as the schema access uh because we have the header uh within our single flow files we're going to use use string fields from header and hit okay so that way it just recognizes it top down or line one to line one or line one line two line one line two et cetera and then there are other options you have set so say if you don't if you're working with tsv or some other separated value text file you can also adjust the value separator etc and then also what we'll want to do is the treat first line is header to true so again to ignore the first line and that's pretty much all the configuration we have to do for a csv writer and we'll go ahead and hit apply and then we'll go ahead and turn it on and again for the scope we can do service only or service and referencing components so if you have a much more complex data flow you can have even more configuration and adjustments based on what can be used where et cetera et cetera but because we'll just do service only and that's what we're working with it's now enabled and for json record set writer we'll look at the configurations but we don't need to change anything so we can write the schema there's a couple of different options here and then schema access strategy etc we can pretty print the json if you want if needed and then other items as else so if you want to potentially change the output grouping you could do array one line etc and so i'm just going to go ahead and turn it on all right and now that these both of these services are on we'll go ahead and rerun the entire flow and so ideally what we should see here is a thousand row or a thousand flow files that are now json format so we see our file here it's gonna get picked up and we should see a thousand splits once it happens there we go and then almost these happened simultaneously but from the one here it went to a thousand here and it went to a thousand here so if we look at this q and we open it up and we see line one and we see view we now see json format of records so if we actually want to see it in like a prettier or prettified json we can see it as is here there are other items to where if you want to flatten the arrays you can potentially do that as well but yeah so what we've done is taken a thousand row csv with the header and we've converted them into single uh json um formatted records that we could then potentially move into a sql database move into a nosql database move into kafka move into some other form of tool that you can then put back into your pipeline so as an end-to-end tool it could also be used but it could also be used or knifi can be used as an end-to-end tool but can also be used as a in between different processing steps so if there's certain methods where you don't want to do a whole ton of coding just to do some kind of data transformation an i5 might be a potential option um but yeah that wraps up the introduction demo to nifi again we can then expand on this as i mentioned so do something like put it into kafka and then have kafka do something with it or um put it in a kafka pull it back out at kafka so yeah um that will wrap up the demo um i'll open the floor to see if anyone has any questions or comments uh feel free oh sorry um before we get there i should also show you how to um export and import templates uh forgot about that sorry um so if we want to do this we can then um we can multi-select the items and then we can do a create template so we can call the csv to json and we do create all right so then if we create that um so let me go up one item let me zoom out a bit let's move this over so if we do templates and we drag and drop and we add it so now we've configured something that we can drag and drop and move around so if we want to download it we can also or yeah we just do is it this one nope it is i think it's in templates and then um where'd it go there we go and then we download it and then if we want to add it so if we go back to templates and we remove this yep and then if we go into the ui and we see this little up arrow so that's the way to upload templates and then you would just um or you would just hit the upload or the search magnified and it'll take you into your directories or your local directories to download it but yeah or upload it apologies but yeah that will wrap up the demo and um let's see mike has a question if you have a csv with some records with incomplete data how can we ignore those records so i believe there are methods to where you can it might be in the the um actual service of it but you can ignore null rows i believe um let's see null strings um nope [Music] i i i'm pretty sure i remember seeing it i don't remember because again the the ecosystem is kind of expansive so i i don't know every single processor by heart um it's probably possible honestly um because you can just say look for this potential rejects and then set it to this i don't think that would be like not a functionality so that might be you might just need to look into it mike apologies that'll be a couple of questions yeah go ahead yeah so i've been over here multitasking so i i hope my question isn't dumb if i you know missed uh something that you already covered um but let's say um you run into some kind of limitation with nifi if you're you know i don't know a connector for example um and you need it to execute some custom code does it have the capability to execute a script or would would that be something you have to do outside of nifi and then 0.95 to the results of whatever that script was doing yeah it looks like there's a processor that can execute a script so if we click on this and then we'll see what options there are so script engine so i'm assuming different languages um python so there might be some limitations of what you could use as your script but again python is a pretty prevalent data engineering language so and then you might just need to have a script file that you have the path to so it looks like you might be able to do that using this execute script but there are other potential processors so you can just literally so i mentioned there's a ton of processors in the ecosystem so if you're looking for a very specific tool you can just search for it and you can also search for it on their docs because they have every processor in the docs and more information on them okay and my second question uh because i work a lot with uh ssis so i'm kind of looking at this through that lens um your little uh connectors that connect the components uh can you you know assign like expressions to them so you don't execute the next task unless the first one was completed or if it failed then do something else or maybe even check the uh the output value from a previous component and then decide whether what you want to do in your next step do you have those kind of capabilities i believe so um i haven't built anything so so in ssis for example you see where you you have your little connectors i think they call them president constraints you would you would right-mouse that and then you get some options there um let's see that might be something i would have to look into i don't think i don't think why not right there might be some method of because they there are some methods of like um like their their relationships are relatively simple in terms of um what to do on something so like on failure i can say redo it right um but i i can't speak to the question that you're having without having had a deeper look yet okay what about what about variables you can set variables within um i believe so um i think they're done within the yeah yeah okay all right okay yeah so the reason why i wanted to come to this presentation because i picked up a data engineering book i mean i think probably most of us have been doing data engineering for years maybe but but they weren't calling it data engineer but in that book it it it mentioned not by niffy or however you pronounce it and so i when i saw this pop up i said okay i'll check this out uh rather than try and read read through a book um but yeah part of my motivation is to uh just kind of expand beyond kind of the enterprise ssis kind of tools into uh using more uh open source kind of tooling with with programs like this and uh you know scripting with python that kind of a thing yeah and if you're interested on learning nifi for yourself and then creating some sort of demo um we could do that and then you can also present it here if you're interested sure that would be a cool demo to see as well so i'd love to hear a little bit more about um the sort of computational requirements because nifi is doing the processing itself right yeah yeah so which is different from a lot of other like data flow orchestration systems like air airflow or some of the others where the processing happens in some other system so can you elaborate a little bit more on like what a typical hardware setup would look like um yeah they have a pretty decent like architecture diagram yeah so of how like um they do mention um let me see where it is if you want to operate within a cluster because i have run into some like resource allocation issues and so i wasn't able to process as much data as i wanted but it may have been due to the constraint of the gitpod container that's why i kind of scale it down so i think um it it mentions it in terms of yeah right here like scale out so like um i remember reading it i'm just trying to find so i don't spew the wrong information yeah so right here so it's like for io for cpu for ram because it lives in the jbm is limited by memory space so garbage collection becomes important so depending on the amount of processing you're doing in your job right hardware might come into play so um you might need to play with like ram configuration um or potentially like cpu alec or cpu monitoring based on uh if you're overclocking or underclocking yep that that makes sense and from from what you've seen is the when knife runs into hardware limitations for a particular job how graceful is the degradation so it doesn't stop like it doesn't like it doesn't just die it's always good when it doesn't fall over yeah it it it had little flags so like if i go back to the ui it had like a little um red box that showed up here and it'll have like little cautions and it would show you logs of what's happening so it would tell you like the source of the error and um the reason for the error um been and it kept i think it kept continued trying to move on but didn't because um the resources were just not like weren't enough the queue just didn't get filled so it kind of like hard stopped on itself just because it couldn't process it that makes sense yeah so yeah that was that was one thing i noticed there is like because sometimes if you overload something it'll just crash and die and then start and reconfigure everything but yeah luckily knife i didn't do that that may have just been a like uh happened once for me but i can't speak to every instant or every use case of um it's like overloading nifi but then again that's where potential clusters come into play right so you could have like configurations of like um let's go to configure you can set things to run on like certain nodes so you could say run this all node so like concurrent tasks so if you have something that like for this get file if we had a cluster and we have concurrent task 1 and all nodes execution i believe the way that it's set it was it would run the get file on each node so if you have three nodes it would be get file get file get file but if you wanted to change it to only the primary node i think it would only run once or you could set potentially concurrent tasks to zero as well to adjust for that but again depending on your data flow you might need concurrent tasks but um that's something to research into right that that makes sense and as far as like use cases go the the examples you used in the demo if you are taking taking the file and and moving it or serializing it that it's a good example um within like what sort of destinations would you would you want to use knife for uh would are you thinking mostly like looks like you've got a nice list over there yeah so there's um you can convert the json to sql and then potentially put it into like a sql database there's also hdfs related things i think there's some amazon connectors dynamo kinesis put put essentially just means right so if you see put it just means write something um right it's also aws azure um i know there's cassandra related ones as well um which i'll probably do a topic on just extending this to then take what we have and put it into cassandra or ex make a little bit better but like there's options of like you have your ingress egress like source and destinations um from what i've seen there's there's a fair amount of of databases or other tools i don't know if there was an option for spark yeah but kafka is one [Music] maybe rabbit yep see probably flink no splunk yeah so you could potentially use this for monitoring so if you need to put stuff into splunk like error logs or something knifi might be a way to pick up your logs uh from some kind of service and then put dump them into splunk um one way potentially yep yeah but you can literally just go on their docs and search they have a filter so if i just do the same filter we see the same exact things that that is that is very cool and so there with a say getting uh like a csv file from an external sftp server and then putting that into cassandra you could just do that entirely within the the nifi web browser yeah granted you would just need to run cassandra but yeah topic topic for another day yeah yeah i guess so it sounds like that case uh getting knifi set up would take longer than actually building that flow for that use case probably if you know what you're doing yeah i have one more question too yeah so um so how is the login work and just kind of out of the box so you set up your your uh your flow your packages your flows and you have them executing uh where is that execution info being saved how is that assets i believe there are repositories here so certain things get updated in this content repository so i think these are related to the processor um let me see it's yeah so the actual bytes of a conf given flow file so that's where it's like that ram issue kind of comes into play so if you have like 300 000 flow files and you don't have the ram for it that might get an issue but i think if you want to see individual um oh no i want to go here um i think it's i just want the status history i'm going to stop this for a second i think it's not the one um i know it's somewhere i remember exactly where uh so this flow configuration history i think it would be this one the data provenance so you can see what's happening to your data and then the actual like bytes of it would get stored to like wherever nifi is local okay and if the um if the process completed if it failed that would be something that's i would expect to see that in the file somewhere um execution started at this time it ended at this time um you know um this is how many records came through the pipeline um it seceded or failed yeah that kind of information yeah i'm trying to um not to read through the docs yeah i i don't remember exactly where i remember seeing a it was similar some somewhere in this like flow file repository so if we look at that uh flow file so it will probably be saving them as uh json docs okay this is this journal file um i don't want to break this yet i would i would need to look further in terms of where the logs are getting stored that might be something you could configure hello i don't know if you go back to the to the uh to the previous uh tab i think i can see a log directory i don't know it should be there okay yeah there we go um yeah so successfully checkpointed um so i guess it's based off checkpoints um yeah so it looks like it's potentially creating new logs so this is what 1659 oh yeah see it's happening in real time um so yeah you for your purpose you might just need to take a look at how it's doing it and see if it applies yeah thanks yep um hello uh i have a question i think i i wrote it in the chat uh based on the use case that i think it was uh will uh that said it uh when you have uh you have to get some file from an ftp server and then maybe load it in a relational database with a postgres or something like that can you have dynamic names like uh for example if there's a file that's generated on the on the ftp server on daily basis okay and you have to run uh will i say the flow uh every day can it dynamically get maybe the date of today and then use it to concatenate in the name of the file it's meant to get from this ftp server you you might be able to um using variables potentially um or using some kind of rejects i don't i haven't played with it so i'm not sure so if looking at the um so there's a file filter a path filter um yeah you you may just need to look at see if you can um because i i'm i'm currently unsure if that's potentially an option okay okay no problem thank you i'll check the the logs and try my own list great um so yeah uh that will wrap it up we are at the top of the hour um i'd like to say thank you everyone for coming out great questions and uh we will see you all next week have a great day
Info
Channel: Anant Corp
Views: 125
Rating: undefined out of 5
Keywords:
Id: weVVpcg716o
Channel Id: undefined
Length: 54min 39sec (3279 seconds)
Published: Thu Jul 08 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.