Data Movement into the Cloud with Apache NiFi

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] okay so to get started and if audio quality breaks up or screen share breaks up please tell me it shouldn't but um so what is this where am i oh if you're here you're basically in a webinar for evaluating data movement options we're going to discuss some reference architecture we're going to discuss some tools and then we're going to demo some tools so the way this course is structured is the longer you stick around the more technical it's going to get the first part probably good for mbas and business owners second part good for your software architects and the third part is good for the infrastructure engineers that will be working on that component um i really have to stress though if you feel it's not technical enough feel free to send me a personal message or if you feel i skipped over too many steps send me another message we are going to be covering basically an abridged three-day course that web age offers in about two and a half hours and to do this as you may imagine we've accelerated things a bit really if you're interested in more of these topics webage offers a lot of courses and blog posts on it and i'm more than happy to help guide people to those so who am i you might be asking who is this person why is he talking at me and who what's he doing here so i'm chris gambino i'm one of the architects and founders of calculated systems we partner with webage a lot to do training and materials like that i've previously worked at google and hortonworks doing big data and type problems my company focuses mostly on software solutions and consulting and we partner with companies like web age to bring that into an actual legitimate form of training that's underst understandable um i also wrote the book knife for dummies which i believe carling if she's still on the call we can get a copy sent out to everybody here potentially um or we could just direct you to a download page on one of our websites either way it's a free ebook that covers some of the software tools and that kind of alleviates where we are today since we are getting more technical as we go through today we will have a little bit of an aws focus so we will start high level and then discuss aws i am well versed in google cloud and azure so i can answer questions there but if you like aws or you had some experience you'll find yourself right at home here and then a little bit more just about what i've worked on and kind of the systems that i'm go that everything's going to be based on data movement is such a broad topic it's impossible to cover even in like a four-year degree so the types of problems that i've personally worked on and seen at scale is i've worked in the financial services space working on portfolios where we are building out financial data feeds this is more this is a traditional reporting movement work use case this is actually we'll discuss how we can classify these later i also developed some iot backbones with jaguar land rover as part of the genevieve auto alliance basically how does jaguar land rover have their convertibles send data to the cloud this was a very fun use case because this is a very fun use case because we basically were allowed to have a jaguar f-type convertible on the floor of the la auto show and we had direct api control of it so that was a lot of fun to play around with um and as i said i also worked at google cloud as a big data architect and i since forming calculated systems have done lots of cool stuff um including a use case where we stream hundreds of millions of event per second and a conference called battle of the quants anyway um so what is the goals of this course and i encourage people to type to me as i said we have some flexibility this is based off a big course so depending what parts we want to snip and pull from we absolutely can cover slightly different topics so feel free to write in chat feel free to message me um i think as a webinar you're not able to speak up directly but just message me and i can add them to this and try to cover them as part of the class but the goals here are understanding data movement as a problem how do you quantify it how do you understand it how do you qualify it there's a lot of tools and a lot of companies out there who say they can solve all of your data movement problems and how do you actually know if they're telling the truth how do you actually evaluate it as i said my company helps with this this will actually teach you how to qualify or disqualify our help as well it's a pretty agnostic approach and then once you understand what type of data movement problem you need how to select the correct tools i think amazon is well over 100 applications in just their baseline infrastructure and then there's over 5500 in their extended marketplace so if anybody has any type of data movement problem you can get everything in there from snowflake to splunk and then of course the native tools like redshift and such and then a preliminary understanding of how amazon approaches this so not just evaluating tools how do we use amazon's approach to many of these common patterns so let's define data movement and this might sound a little broad but we're going to get really into what you're trying to do with this statement we're going to get really into how can you explicitly define it both at the business level the technical level and the execution level so data movement simple explanation we could say the webinar is over right now if we really want it to be snarky and say it's not data is not where you need it to be it's in point one and you need to be in point two classic example of this is you have a data dump in an object store such as s3 and you want to move it to dynamo this is the most simple form of data movement everybody has done this this is as simple as moving data to your thumb drive back in the day or copying it to a disk but if this was just the problem none of us would be here this is not really reality reality sometimes gets a little more like this it's a little more complicated maybe we move data from one source of truth and we need to enrich it off a second source of truth so maybe we get a list of accounts as a daily report we need to look up their account numbers and names in a sql database and then land it in the final crm or dynamodb and while i go through this i encourage the audience to type in any data movement problems they have or specific patterns they would like to see discussed more than happy to go through that you should have the chat functionality in zoom and you could type it to all panelists secret being i'm the only panelist in this webinar so it's a private message to me um but uh once again i i very rarely encounter problems that are just this simple usually it's something like this which i call the oh no where the sales team sells you this it's really this but then it turns into this so this is a more typical data movement problem that i've seen so data starts in one spot it's in the wrong data center you need to somehow move it to the cloud and then you need to enrich it and land it so how do you build this type of pipeline how do you understand what this pipeline is what are these steps of transformation what are the steps of enrichment what are the steps of movement required for this um and then it gets worse and then of course if any of you here are part of a company that's probably more than 50 or 100 people which i suspect most of you are you're going to know that once you solve the problem technically your production it group or maybe you're part of that is going to come in and say well then we need to build a security plan around all these things on the right because they count as production and we have access so what was originally just a simple move my data from s3 to dynamo quickly becomes move it to the right data center enrich it securely and how do we start to do this so this all means you're trying to solve several core concepts of movement transformation enrichment and security understanding how those relate is going to be key i'm going a little fast just because i'm trying to cover the entire um section if anyone wants to give a thumbs up or a thumbs down in chat more than happy to see that i'm pretty sure you have that ability otherwise feel free to message as well i am paying attention to everything in general so i'm gonna assume i'm doing good until somebody yells at me otherwise so data movement to define it now a little more canonically it's the movement of bits from one medium to the other is what we started with and that is uselessly vague that is something that you should not even be on the webinar to be told that is a useless definition so let's be a little more specific and categorize it so it cannot just to summarize this one all we've covered so far is that we're moving it transforming and enriching it let's categorize it a little more specifically because that doesn't help us yet so let's say the canonical definition there are four types of data migration there's storage migration database migration application migration and business process migration and if we go back to this big complex architecture not complex this slightly elongated architecture here you could see that maybe there's a database over here maybe there's a file server over here maybe we're sending application data across the cloud this represents a technical infrastructure these represent kind of what you're actually trying to do but these are a little vague they're a little complex and they're also on the academic side of things so if you were to look up in any textbook what are the types of data movement you're going to see these four answers now as i said this we get increasingly technical as we go through the class as we get started um this is where the mba sees ceo level discussion kind of stops if you're a ceo or a business owner and you're not technical if you can qualify to your technical team hey we need to move our storage location we need to move some databases i need to change where we host our application or i need to be able to make my decisions in a different location that's actually probably where you'd hand off to your director this is the highest level of understanding and if you can define this you're good but let's go a level deeper let's go a level a little more towards execution my problem with this is that the top two define kind of what type of bytes are going back and forth this one's typically moving files this one's typically moving database rows these bottom two rely on infrastructure dependent things such as there's an application moving files or an application sending messages or rows and then this one represents software combined with data so let's break it down into a more approachable problem so if everyone takes off their executive level hats and puts on their technical director level hats we've been told we need to do a type of migration let's think about it so first technical definition as an i.t director i would say you'd have to think about is is this a raw data movement problem this is a subtype of the storage movement typically and it represents that we are moving logs or other unrefined sources of data those of you on the call in the financial services space maybe in the iot space definitely in the social media space you might recognize these patterns this is the type of thing where you are moving a log file you are moving email accounts you are moving maybe even http tcp streams basically you are taking unrefined data and moving it across the wire this really does extend to things like ftp this extends to things like s3 the important thing here is we're not assigning a tool yet we're taking the practical definition of we are moving raw data and unrefined sources that aren't understood the next level or the next type of practical definition is database movement this is removing structured data this data is already refined this data is already understood this data is originally at a source database and is being moved to another database the key thing here is raw data movement will often include a person understand step so we can take raw data and turn it into structured data database movement starts with structured data and builds on top of it as we'll discuss a little down the line a database movement problem has a usual requirements such as acid transactions sql compliance or nosql compliance there are additional complications in a database movement problem that could make it a little more difficult or different than a raw data movement problem that being said you might find your database movement problem is also a raw data improvement problem such as if you take a snapshot of your database the management of that snapshot is kind of just a piece of raw data it's a piece of just unrefined understood data that we could upload back to the database whereas if we said we want to copy the rows we have a database movement problem these are the most common archetypes i've seen particularly if you're in a introductory role or if you're in an early stage role most companies are getting the reports maybe it's from a bloomberg maybe it's from a social media source maybe it's from an iot source and we are trying to move the data in many data engineers focus entirely on these two and this is what they specialize in but it's important to put a third one in here which is a app application communication this one is also a form of data movement and it's very difficult at times or it can be very easy but in this case we're not necessarily having to move the logs from one source to another we're not trying to move individual bytes we're trying to help applications communicate structured messages to each other so this is maybe we're using apis maybe they're sending messages to a queue this data movement problem specializes in a highly structured type problem and how that gets built on so i'm going to pause here for a second if anybody wants to type a question feel free but i want to make sure everybody understands or maybe agrees with or disagrees with these core points because uh this can you might have an idea or a different way to slice it just waiting one second if anyone has a question great so i got one person saying they agree with me which is good enough for me to say this is a hundred percent non-controversial um so moving on the definition so far is we're defining it as the movement of bits from one medium to another and that could either be raw data database data or application data with once again no software products assigned just yet but we've defined it so next problem is once again this is an i t director level problem what frequency latency like what what is all the other requirements that go on this do i need to move the data once do i need to move the database regularly do i need an acid transaction where i can roll back a days of tran of database rows do i need application data what's going on so for that we can turn to something that has been both highly applauded and called a little vague we use this at hortonworks we use this at google so i tend to like the three v's of data some people prefer five v's while some people say the v's aren't as useful i personally have found them extremely useful so the three v's are the first one is volume now for those of you listening you might think this simply means how many gigabytes but or terabytes or petabytes or x whatever there's a lot of bytes involved how much data do you have how much data do you need so this though is important to note that when you discuss the volume of your data it's usually whatever the bulking load is and what i mean by that is if it's an ftp server it could be number of bytes being stored so how many terabytes but it could also be number of files so for those of you who have worked with hadoop systems you will know about the small file problem and for those of you unfamiliar hadoop really had two ways of measuring files you had how many volumes and how many terabytes did you store and then also how many files did you have and of course depending on the system not to go too technical things like your hadoop name server might fall over or maybe there's a limit to how many files could be in a specific storage application another classic volume problem is number of messages this is very important on your message cues like your kafka your pulsar your pub subs how does that relate and how does that build out so what's going on there so think what is the main load on this system and how do i quantify and measure that so that's going to be things like gigabytes terabytes that's going to be things like number of files that's going to be things like messages note that i'm not making any other claims on it other than just the volume because the next one is velocity of how fast is that moving how large how fast is the data going to grow how fast is the data going to change so whereas the first one represents bulk the next one represents kind of how fast it's changing so this would represent things like latency this would represent things like messages per second this would represent things such as changing schemas no not not changing chart changing values and it's important to define this and to not just say things like fast or slow or streaming i once got into a ridiculous debate with somebody who was trying to claim that we were not a streaming application and when i spoke to him turns out this individual was a high frequency trader that's specialized in new york city and anything longer than 50 microseconds not milliseconds microseconds was considered batch processing so in my opinion anything that's in the low milliseconds is probably streaming for most people many people consider single digit seconds even streaming but it's important to note that all streaming all velocity at its fastest possible rate is just going to be the tick speed of your cpu to be technical for a second so how fast is it moving qualify it how low latency do you need how many messages per second how many new files how fast is it growing that is velocity and finally is variety now this one you might find tied to your data movement problem whereas a database might have lower variety than a raw data problem but it's important to understand what is the format of your data what is the schema of your data and how much is it changing for example we can select specialized software tools such as click or stream or goldengate for database movement but that restricts us to things like database movement so understanding how varied the data is in this problem is key because it allows you to select the appropriate tool and then for those of you mbas who haven't gotten bored yet and are still on the call and we haven't gotten too technical for we're going to sneak a little fourth one in there and call it value and i know i said three v's and because i don't count this one but i've had enough business people say that they only care about valuable things so we put it in here and this one is just kind of a lesson learned of the concept of you're going to move all your data the concept that you need every last piece of your information is a little on the dated side now that we're starting to get so much data people don't know what to do with it so all the value one says is think do i need and use this data do i need it for compliance is it useful now can i just store it and look at it later can i store it on premise and look at it later can i store it on the edge and look at it later so all the value says is maybe you should limit the scope of your data and maybe if we're capturing all of the logs we don't need all of the tcp stream as redundant so just think is your data valuable i once worked with an insurance company that had built a very large data lake of all of their data and their administration burden became so high they were thinking of discontinuing the data lake project well it turns out most of the data in there wasn't needed they just added it because they thought it might be useful but then when you talk to the business analyst they're not using any of it so focus on the valuable parts of your data and you can feel free to sideline the slower parts now i know this goes against what some people might have heard a few years ago but the amount of data in the world is blowing up out of control in my opinion it's impossible to wrangle all of the data in the world right now from any one given entity or company so focus on what you need focus on what you want and feel free to liberally restrict the scope of your project to just focus on that so if we combine these together we have what i consider a good actionable definition of data movement it's the movement of bits from one medium to another where we're focusing on raw data application data and database data and it's governed by the three v's a velocity volume and variety if i can answer the top question and then put values to these bottom three i have a well-defined data movement movement problem that i can begin to hand off to my technical people my individual contributors my programmers and if you're a programmer an individual contributor on this call you might be thinking how does this help you do your job or how do i ask these questions maybe my boss isn't attending all the web page webinars like they should be maybe my management structure doesn't attend training so it's on you then to ask these questions back up so for example a very classic one is you have a point of sale system in a store and you need to get the data into a central ledger um so let's say that we have to um the crm solution in the individual store has a database of all transactions and the company is running on redshift on amazon so what we have is a database problem we're starting in a sql server on premise we are going to a sql style server on the cloud redshift um so we can say it's a database problem and then we could say how much data how fast is how fast do we need it and what's the variety so the variety is super low it's transactions the volume is well you could say it's number of transactions or it's the size of those transactions both of which are important and then velocity how fast does the business need these resolved essentially it could be that your business is using these customer um point-of-sale transactions to simply account at the end of the day or maybe they're looking for fraud it's important to understand that because that changes everything for example if um for example if we have a client who is saying something like they only need to do an end of day transaction ledger versus an all day i can go two entirely different ways and be a hundred percent right so problem one imagine it's a relatively small amount of transactions that were synchronizing from on premise to the cloud so maybe this is a department store that's closing down over time they're not doing many transactions it's low volume low velocity and we just need to sync it up to the cloud for accounting it might be 100 valid to take a snapshot of that on-premise database every day upload that to the cloud and unpack it there that is an entirely valid solution however if you're it's also a valid solution to say we have a high number of transactions this is our marquee location and we need to check for fraud because people are trying to do lots of returns so that's means maybe we need something like a change data capture system that regularly and almost instantly synchronizes the transactions so by understanding the top and bottom parts here you can draw drastically different conclusions all of which are valid now i see a question in chat to share the presentation the webinar video will be recorded and sent out uh for those of you who are lucky enough to see the video i'm wearing a very fancy headset to hopefully get high level quality and an abridged version of the sides will be sent out as well so um feel free to reach out to webpage or um for those details but i do believe carling will be sending up a follow on email soon so um as we go through here is this a good pace for everybody still uh do people want to go through any more of this or i'm going to continue on so the next section is going to be discussing some common patterns in this stage we're taking it from the i.t director down to kind of the software architect where we can start assigning specific tools we can start thinking about the problem i'll give everybody a second to think about it ask any questions but by this point in the lecture you should be able to think about your data movement problems you face in your day-to-day life and assign these answers to them and once again this is part of a three-day class where we sit with you and we can actually help you work out your specific problems during that class so that's what it's designed for it's a little bit of a lecture a little bit of a workshop your team would get some one-on-one attention as well it's going to give everyone one second to think absorb or respond okay so no questions yet let's review some common patterns so these are how do we put these in these examples like what do we see in the world what are some common application not applications some common data moving problems that we can draw regularized answers from and conclusions from one of the biggest ones i see in the field is log streaming and i say log streaming it could really be report streaming as well but in this example your team is getting some type of data being generated in one spot and it needs to go to another for essential processing the classic example of this is iot you have some type of device and it needs to go to the cloud i just finished up an engagement with sift they run spectrometers um for those of you who've ever been through an airport and they've ever swiped you like with that brush thing and put it in to check for residue of explosives that's kind of what it is it does chemical analysis they mostly focused on industrial but they also do other stuff so how do we transmit that data to the cloud for complex processing that maybe can't be done on board um security what are reports coming in from all of your machines this one's big at the fortune 500s this is where splunk lives typically this is where logstash lives basically how do i monitor all of my web traffic being generated on devices how do i measure windows event logs for anomalous processing and then daily reports so maybe bloomberg or reuters is providing a daily dump to you and you need to somehow capture it so these three are common ones i'm sure everybody has other ones as well and basically here the questions you need to ask yourself are can the tool handle the number of logs being produced so the two types of problems here are volume and then amount so i'm actually going to break that down three ways actually so first what's the total number of megabytes or gigabytes of logs or terabytes two how many log files are being generated and three how many rows are being generated maybe we don't need rows and files those are the questions i would ask before even starting to architect the solution variety what types of logs are we getting in so is this the type of log that we need to sort it is this the type of log that we need to develop or is it the same type every time does one schema fit all and then velocity how fast do i need these results so that is along the lines of how often do you need to get these logs into the central system once again a security application might be different than a daily report so let's look at how they solve this and i don't want you just to trust me i think a big part of these classes is you've just met me i'm sure you've seen my credentials and think that's uh wonderful but uh let's reference some other architectures just to show that what i'm showing you is not generated in a vacuum that instead this is a big part of what people do so amazon published this i think this is amazon's best architecture showing raw data movement they call it their iot approach because they pipe everything through the iot hub but i'm going to ask everybody on this call well let's say for example we didn't start with the iot hub let's just say we started here on the top half so to do that again we're going to just clip off the top half of this section and suddenly we no longer have an iot approach we have a streaming raw data approach now you can see we're starting to assign technology here i'm going to loop back and actually come back to the slide in just a little bit to explain that their choices here but amazon shows this actually know what let's skip a little bit ahead and showcase this so this is how i really consider a raw data architecture this is a very common pattern that applies to anywhere that data is being generated and needs to be landed so on the left here we have something called the device now this might not be a device your company owns this might not be a device you see it might not even be a device in the typical sense something somewhere is sending you data it might be your problem it might be bloomberg's problem it might be your client's problem it might be your vendor's problem but someone somewhere somewhere is generating data and sending it well first thing as a data engineer architect because that's level right remember we started at c level then we went to director level and now we're down to the architecture level in this drill down of understanding it's your responsibility to look at data so that means we are getting the data uh somehow we're letting the client or the person sending us the data connect and we're storing it exactly as they get it this is a pretty big caveat that you'll see some people agree with or disagree with um my opinion and what i've done professionally and seen many of my peers do professionally you should always try to land the data exactly as you got it you don't necessarily need to archive it but get it on your system reliably in the exact format they sent it to you you'll reward yourself later and pat yourself on the back when you have to debug and then after you get the data you need to process it so i've never really seen a scenario where an individual sending a report is instantly useful typically we need to do something at least maybe we need to update some field values maybe we need to merge it with a database to produce a daily report we need to do some type of processing and transformation which means we pick up the data we look at the data we transform the data and uh before pushing it through to the final end product and then the four things we could do on top of that are we need to store it so in a raw data architecture the data is being generated on the device we have no guarantee it's being stored on the device and we also maybe can't even guarantee we can get it resent so we're going to land it process it and store it now it could be stored in raw format it could be stored in enriched format lots of ways it can be stored on amazon this represents things such as s3 this represents things such as dynamo and rds with raw data i typically see s3 being used to store the the raw variant and then database is being used to store the enriched variants or the processed data also in raw data architectures you need to somehow explore it so that is usually running some type of sql query over it maybe some type of python or data science code if you're a little more advanced but you need some way to interact with the data beyond just processing so when we look at this pipeline and look at the user experience that is being generated over here on the left we're landing and we're landing it allowing it to connect or processing and storing it well the users then need to be able to look at the data and explore it so we have exploration which is kind of a programmatic thing and we also have visualize so these bottom two seem a little similar and let me differentiate them because even amazon's architecture starts to delineate the difference between these two so exploration is asking a new question of the data that is me saying hey raw data i need to discover a relation in you i'm going to ask a new question typically your data analyst or your subject matter expert will ask this question it's somebody who typically knows at least advanced sql or maybe some data science or python or java or something like that maybe scala and then once that question has been asked such as what is my daily run rate what is my daily revenue you can then visualize it so this is done in your tableaus your metabases your lookers visualizing it is done by people who know a little bit of sql maybe they know excel but they're not really able to ask a new question so exploration is for the people who are discovering relationships and visualizing is for people who are monitoring and looking at those relationships so just a different layer here typically this is something like sage maker or python notebooks or something and then this is like tableau or your standard bi and then finally act now this top right one in purple is by far the hardest this means we need to understand what we're going to do with the data it goes a little beyond the scope of this two and a half hour lecture so i'm not going to go too far into it um yes okay so to start with amazon's approach here um by the way some person said the pdf is missing some slides um yeah we could chat about that later the video will be uploaded in full due to some content agreements so we couldn't share all of the slides necessarily for individual download and redistribution but this video will be uploaded in full with the glorious voice narrating it um anyway though for amazon's iot approach um let's dig in and look at this as a raw data problem as well however i do want to point out for those of you who actually have iot problems not to dismiss that we have the connected vehicles over here this represents the device these are sending the data to us and this is the initial landing so we are landing the data here however if we were to drill down we could say the amazon kinesis fire hose stores the data which it does it buffers it through so we can really start here so this is could also be viewed as initial landing we could have where the client sends data to your kinesis fire hose you allow them to connect class requests permission um allowing i will actually showcase a little bit about the permission side of the house time permitting um but we will be showing how to connect to this system so this is the initial landing but kinesis as you got folks might know or might not know doesn't necessarily do processing so we can send it to lambda or the kinesis analytics so these are two of the initial processing pieces so if we go to this example we have the connected car or something we then have kinesis for initial landing and connection now the iot hub can facilitate the connection but that goes here and then we have processing the data which is lambda so let me actually edit these slides just for those of you following along so we can review at the end so we have these are the cars in amazon then we have that the data gateway and initial landing is both the iot gateway and kinesis remember these are all before processing and then from there we go one step further and we say the processing is either kinesis analytics or lambda let's go back to the reference architecture so they show a couple processing steps and this is pretty standard the tool that i specialize in apache nifi is a great way of doing multiple steps in a visual fashion but we can showcase that in a little bit and then from there we have dynamodb which is storage now you might be noticing they also stored the raw data as is best practice so data was siphoned straight off the kinesis fire hose and stored raw but we're also storing the enriched and transformed data so this is a really good top part reference architecture the bottom part you might notice is pretty similar they go one step further and serve the data to a topic but in total what you're seeing is the pattern of the first part of it where you have connected vehicles aka the device a landing zone and then an initial process initial storage area or a connection and landing and then processing here and then storage here here and here so we see many of the core components their iot approach doesn't discuss disc exploration visualization or landing but we see many of the core first parts of the component here and it's important not just to take my word or amazon's word microsoft literally says the same thing this is taken from the azure reference architectures we have iot devices the cloud iot gateway hub stream processing storage and as you can see storage is connected to both process data and raw data and then microsoft takes it a step further and has the ui and reporting tools for visualization and then even an actions tab and i know i've shown two iot examples in quick succession so for those of you who don't like iot let's just quickly pretend that it said this is my server with security applications this is my event hubs or kafka queue or sqs or kinesis or pub sub storing all of the messages then we all have our stream processing our ui and storage so it's important to know and to note that these patterns repeat themselves regardless of the venue if there is data being generated in one section and it needs to go to another aka raw data type problem this pattern really covers everything now of course there's going to be caveats there's going to be different technology selections but if you can answer these pieces you've got yourself a full software stack any questions so far we're going quite fast we're skipping over a few steps along the way to try to fit this in i'm going to pause for a second and let anybody ask any questions or i'll just leave this on screen if somebody needs to take a quick screenshot now we're going to discuss database migrations and for a quick level set for everybody here we're going to discuss database we're going to discuss applications then we're going to pause take a quick break for people to get water get coffee we're going to come back we're going to go through a real life example of jaguar land rover fortunately what we did there has been open sourced so i could talk very freely about it uh and then we're gonna go through a live demo of one of some of the tools on amazon so that's what you should expect from the next hour and a half but let's go through some database migration for the next short while um and this will wrap up in about 15 minutes for a quick coffee break or maybe 20 minutes so database migration this is the second type we discussed this is where your data exists in a sql style server and it needs to move to the cloud now i actually don't like this definition it needs to move let's make it simpler sometimes it's going from the cloud to on-premise sometimes it's going from on-premise to on-premise um basically your data is in a sql style server and needs to move i'm also going to maybe say it could be nosql i think we should really define this if i was to change these slides to be truly canonical we have a tabular format style system that needs to move with some level of sql or acid compliance we can be specific about it but you get the point we need to move some type of tabular data between applications so volume fortunately databases allow us to ask some very specific questions such as how big is the database and how much data is being created and how many rows there are fortunately in my experience for volume databases can be defined by rows and total size now for those of you thinking that you have to think about things like load on the database you have to think about things like complexity of the calculations for sizing or size of the field or maybe width of the row those are problems i'm actually going to say that that is not necessarily the data movement problem things like the width of the row or the size of the field or this the weight of the calculations are all factors in your database selection and as we said in the value section to control the scope of this problem and select the best technology as long as i can move it we can select the databases on either end to make match those needs and that's a separate selection process so how much data is being created and how big is the database variety how many different types of tables are we moving is it predictably shaped is it predictably sized um for those of you working in nice good um i think it's type 5 or star schema stuff you're going to have a pretty standard predictably shaped sql tables with known schemas for those of you working with nosql um you're gonna have a little more fun because the size of your rows are gonna change a lot so we got a little more variety and then velocity how fast do you need the rows updated this is a very important one because a lot of people rely on databases for managing their business so we've covered that already let's discuss one important differentiation in database migration that is often a sticking point for many people considering this for the first time and then move on to some reference architectures so i want to discuss the difference between database migration and database movement of append only versus change data capture as this is critically important and if you don't know the answer you're going to pick the wrong software tool so if you're saying that i need to move data from one point to another can you define what data needs to be moved that means can i say with a select statement what data needs to be changed this means maybe i say where date is january 9th 2020 move all those rows maybe i say within the last five hours move all those rows maybe i say everything above row number one million all of those rows the important thing here is i can select the rows of my database now this is often called the append only problem because this type of problem is super easy to do if your database is only adding new rows because it's super easy to write things on an auto incrementing integer or something or time stamp where you're selecting everything above a specific value where this gets bad and harder a lot harder is if you have something like edits in line where values and flags are flipping over and you're not expecting it so it's called append only colloquially at least um it really just means if you can point to the data being added on to the end as new data on the flip side if you're facing things like edits maybe it's account balances maybe it's um fraud detection you and you need to do things like rollbacks or you need to do things like caching the edits you need a change data capture tool now we're starting to go more technical as we go in as i promised everybody here basically most if not all database databases have a write ahead log or copy book it's called that captures everything that happens to the database in some type of order this allows us to accurately recreate what happened to the database and roll it forward roll it back it is a trap to try to write your own tool i've looked into it cdc solutions are often a little on the expensive side they're worth it right ahead log capturing really you should look into a commercial software solution if you need to capture all of the edits i am not compensated or partnering with any cdc solution so i'm not getting compensated by saying this i have just not seen people be successful if they need to capture edits to try to do it themselves i would recommend companies like stream or click or tunity to look into it if you're an oracle shop you might have goldengate microsoft has their own solutions as well amazon has amazon's database migration service which is really just some attunity stuff that's been heavily modified look into commercial software if you need to capture edits and roll forward and roll back the good news is you're going to have a really good solution that allows you to capture all of these edits and with be fully featured the downside is you just have to buy some new software so if you can define your pom or structured database to be append only you stand a greater chance of doing it in-house just keep that in mind so let's look at some database architectures now if we're saying that's append only i'm actually going to go one step further actually before i go to other architectures i'm going to say boldly if we could say it's append only we have a raw data problem almost not really um because we have data being generated that's clearly defined and if we put a little bit of an agent that monitors the database i can capture those rows and send them and then i have myself a classic raw data problem so normally the hart database migrations the one that people build reference architectures for are the cdc ones that are slightly more complex now in this example we have amazon's approach to a complex database migration what we have here is an on-premise user they have an existing database solution they are connected to an internet gateway and then amazon's database migration service is handling the cdc under the covers i'm not sure how much it is or how similar it is technically but it's very similar to a tunity or click solution in terms of functionality so basically you can copy from your database on premise and replicate the rows and edits and everything straight into a cloud um once again it's mostly going to be rds maybe it's sql mysql oracle can get a little wonky with licensing so be sure you check your license before trying that and then i really prefer uh this example of it attunity now notice click um i think this is a pretty good example of how it works where you can show that you have your mainframe your hadoop cluster your hive your impala over here we are reading the right ahead log with this application and then we are publishing the changes and edits to all the potential endpoints any questions about the database migration approach before i move on i know i kind of broke it down into a pseudo-raw data problem or an application problem okay so let's uh let's go one step further let's now discuss applications quickly and then we'll take a short break for i want to get some water or coffee or something so in my experience in my definitions and all my colleagues and co-workers experience the biggest challenge with application data is ensuring connectivity and ensuring paths so unlike raw data movement where you have to write some type of agent to move the data unlike database migration where you have problems with edits application data movement is often we need to figure out how to get the applications to talk to each other in this case so there's a lot of connectivity concerns in this one if you're facing an application challenge you're going to face some vm issues of course that's more of an application engineering problem the data engineering problem for applications is very heavily on connectivity and paths so one very important way is the data engineer you might not think you need to know all this you might not think you need to think about this but your life becomes a trillion times easier if you think about this think about how your applications are going to figure out where to send data this is more of a network engineering vertical but if you can think about dns servers and hostname resolution you as the data engineer as the data architect can change where your applications send data to by simply changing the dns and not updating it and what i mean by that is if we look at amazon's reference architecture application data movement what we have here is a customer network on the right and amazon on the left and they have been basically highlighted how it is very much a connectivity issue so what we have here is gateways we have network address translation between the two it allows us to resolve it i'm not going to dig as much into this one because we promised everyone some data movement because we only have two and a half hours total don't really have time to go through all the networking concerns but i just want to now summarize this and uh then we'll get ready for a short break in an example so in summary today so far so far of the first hour data movement is a complex issue it's where data is not where it needs to be it's not enriched against the source of truth and you need to solve this type of problem data is in one place needs to go through a network connection needs to go to s3 needs to be enriched you then need to ask yourself am i moving raw data am i moving database data or am i moving application data so this right here are the three practical definitions and then we need to add the three v's onto that which is how much are we moving how fast is it going and how much is changing combining these two gives us a good director level understanding of the problem where we are saying it's this type of movement with these parameters to go down from director level to architecture level we can start to look at amazon and microsoft's application based architectures i actually am going to showcase a little bit of live amazon towards the end of this class and we can then start to assign them to common archetypes so after this class if you're asking myself how do i do it how do i think answer this top part answer these three v's answer these two parts here oh sorry answer these three v's and then start mapping them to existing architectures you'll find that most of your problem is similar enough i bet to map it to something that you can find as a public reference and then with your knowledge of the three v's and data movement type you'll be able to start picking the right technology we will be going through a real life example of this it's a raw data movement problem um we actually chose to use apache nifi for many parts of this so we will be going through parts of this then i'll show a little bit of amazon um if anybody has any specific requests more than happy to accommodate those as we go through the next about hour hour and 15. um just a reminder at the end of this there'll be a little bit of an evaluation uh but if nobody has any requests or questions we are going to get started okay so to get started let's talk about how we solved a literal data movement problem in real life so we had several problems that need to be thought out um at a high level once again we're going to start at the highest possible level which is kind of the sea level technical not sea level more technical director level we needed to receive data that is coming off an automobile and needs to be landed into some centralized location we needed to i'll just show the whole thing actually rather unrolling it we needed to decouple the data before storing it so this is what they called the landing and storage area sorry the landing and initial landing area and gateway we then need to process it we then need to store it and then present it so as you can see here we use slightly different names than of the the full one this is a little more specific to this problem but once again the pattern repeats itself of you receive data you store it in some initial gateway you use logic and enrichment you store it and then you present it to the end users so what does this mean how does this actually come together how do i as a software architect or a programmer start to actually think about this so i need to think how do i want to receive the data how is this going to come in how am i going to collect it so we could select many different types of technologies for transmitting data we could use tcp we could use udp if you really wanted to be daring you could use https basically though it boils down to nowadays a lot of apis are built on soap or rest apis in this case rest apis were a clear choice for us so that means we need to think of how do we let these systems connect to send us the raw data and in this case it was a rest api server this is a well established standard we're not making our own new technology this is ideal for problems such as we are receiving data as they come in so then we need a way of taking those restful api messages and store them in a decoupled state so what does that mean where does it go um this is basically a messaging queue so we need a temporary place to store them in their initial format before either archiving or processing and a lot of companies including the amazon architecture above show a message queue at this stage what a message queue enables you to do is store the message exactly as is and also um also begin to deliver it to others places so this could be your kafka this could be your sqs this could be your pulsar um once again it's a little specific some people do this at nifi directly i'll showcase a little bit more then we have event processing so as we said after you connect the data land the data you have to process the data so this in this case is what's known as an event processor we're not batching it we're not necessarily streaming it but every message that comes in needs to be looked at broken down and understood and what that means is we are basically taking these individual messages and routing them in some type of processing we then have a database to store it and um in this case we selected a nosql database so we could this course really isn't designed to talk about how to store your data but just think if you have a lot of data that's coming in that's of high variety as in its unpredictable structure nosql databases are ideal because they allow you to just have flexible schemas and in literally in this case as a ridiculous little side story we would get weird readings coming from jaguar every so often sometimes we would get weird stuff like windshield wiper position sometimes we get things like tire pressure sometimes they were running a system test and we would get absolutely no sensor readings so we needed a way to store a high variety of messages as they were being received and then finally we needed interface um so this means we need a way to present this solution to the end user that allows them to interact with the solution so this is kind of um a technical director to software architect level communication here they're basically determining the exact steps they need and now um how does the individual programmers and architects communicate and build this out so for a rest api server we need to select the specific technology and now this will be based on location this will be based on the three v's of volume velocity and variety in our case we had a volume that was going to scale up the velocity was moderate we weren't doing vehicle control off it so we didn't need like a c or super fast solution that had millisecond latency but we wanted to do some stuff maybe like vehicle configuration settings off this so we needed some control we needed to scale up very well and we needed to select a solution that allows you to um handle a high variety of messages because as i said sometimes we would get fun things like windshield wiper position and some things we get things like fuel level and some things we get nothing so we need to select a rest api server i listed a few of them out i'm sure on here everybody has a few more these are some of the most popular in common and as you notice i put knifi at the top this that's the tool we ended up using so you're going to see it in a lot of steps here but to keep going for the messaging system remember this is kind of where you connect to the rest api server but then the message gets stored in the messaging queue one of the most popular industry solutions for this is kafka it's an apache open source project as many of you probably know but things like pub sub on google rabbit mq things like even maybe even things like gem fire or geo apache geodes it's called now or things like sqs and kinesis all serve a great messaging system once it was in here and stored we needed a way to process it as an event processing engine so that's things like apache beam that's a super popular one in the circles around google cloud um they call it google dataflow in the amazon community it's more typically things like kinesis analytics i should actually fix this slide for everybody on the call nieces analytics or lambda functions this is what does the actual compute part of the workload um for those of you who are a little more on the technical side or the data science side of the affairs you might see apache spark being used like spark streaming data bricks has really begun to or continue to knock it out of the park and kick butt in that space so spark is doing quite good and then finally of course again as we selected so i'm going to keep referencing it apache knife i was used for this so you might see knifi is referenced in all three of them it's what we ended up using at jaguar so that's why you're seeing all these steps but i really wanted to capture the selection process for everybody on here so you can understand that there's lots of technology and we could have a system so far that goes through these first steps of nifi knife i knifi it would be just as valid of a solution to have a tomcat server publishing to kafka and then having apache beam read from it we can achieve the same use case using those and by understanding your three v's and by understanding your data selection proc your data movement problem is how you can start to select which technology you're going to use for nosql we have a few good options dynamodb of course is a bombshell option i realize that slave might not translate is a phenomenal option um you also have things like bigtable on google mongodb is of course classic and hbase is a really big one too we ended up using hbase in this one but um if i was to do it again on amazon specific like with cloud lock-in i would pick dynamo because it's really quite good and then finally an interface to external users now we have to think what type of user do we want to support how do we want to build this out we could have a jdbc connection for simply tableau users we could have a rest api we could simply provide regularized file dumps i could publish to a client's messaging queue i could even provide an odbc connection and connect excel in as much as a lot of people like to pretend excel is on the way out it's still probably the most used data analytics tool in the in the world because excel is still very popular so what we have here is a selection in five verticals of lots of different technologies and i really want to emphasize that even though in this case we want nifi nifi nifi hbase rest api um it is just as valid of a solution to go tomcat kafka apache beam jdbc you can literally achieve the same business function it's up to you as the architect and director to make this decision so a unified solution just to break you down what we did we used apache nifi as the rest api server it's a drag-and-drop way it's visual the team understood it well and it scales i'll showcase that a little more in a bit for a service bus or message queue knows i update it to service bus because it's more proper for kafka um kafka was chosen because of its ability to replay so why did we pick these solutions actually let's back up so nifi was picked for the rest api server due to its fact that it's open source and its ability to scale so you're not worrying about licenses it can scale up horizontally and if you're basing on amazon nifi servers behind an elastic load balancer actually serve as an out of the box highly available rest api system that doesn't require you to cluster or anything like that so nifi is a really good option there kopfco is picked due to the fact that it has the ability to replay messages so apache kafka for those of who don't know is supported by cloudera it's supported by confluent and a few other people the big one is that it has the ability to repay play so i changed it from message q to service bus because kafka you can take a message that's been sent to a server or an end user and play it right back so it's a really good initial landing spot because it allows you to iterate and redo messages for event processing we went right back to apache nifi so actually our flower nifi kafka and back to knife in this jaguar solution it's extremely adept at handling high amounts of load for the same reason it's a great rest api it's also a great event processor for nosql we went with hbase now as i said if i was to do this again and we were specialized in amazon i'd probably pick dynamo but as this solution needed to be portable to on premise dynamo isn't an option so hbase is really good at extreme load it's handling many terabytes i know for a fact they use it at ford all the big autos use it it's able to handle massive dependency uh massive loads and finally an interface we supported a few different options the main one was the rest api but as hbase has a built-in python and jdbc connection through its phoenix plugin we're able to support those as well does anybody have any questions on this what we're now going to do now that we have gone through a little bit of kind of the architecture is i'm going to showcase how this starts to come together uh i know the webinar description talked about apache nifi so i'm going to kind of run through nifi i'm going to kind of run through how we solve problems with it and i'm going to showcase a data movement problem that is similar to the jaguar land rover problem but a little easier to understand with 10 minutes versus a couple hours any questions here otherwise we are going to dive into our last 45 minutes to hour of lecture okay great question in the unified solution does it mean we create a separate data like in each iot application ah basically you need to think about how does the data interact with each other so in this case the answer is not in this case we use the existing hbase infrastructure they had that influenced a lot of our decision making of what they already had as a tool but in some cases particularly in my work at eaton vance they actually had several data lakes due to compliance reasons so you might need to isolate your data you might need to put it on existing data but let me just type this reply so everybody can see it so not in this case it is okay so i would say the part i haven't discussed is if your company's already strongly invested in specific technology that of course is going to make your selection a little more limited um but storage solutions are really good at storing lots of data i would say think of it as leverage your existing investment as much as possible don't recreate the wheel unless you have to but in many cases i've seen it where the wheel has to be recreated because certain data can't live with other data and it needs to be compliant so um i want to discuss this really fast this is kind of what we'll be diving into so the last part of this lecture is going to be very technical this is designed for the individual contributors the programmers the tinkers the managers who still want to pretend they're individual contributors the managers who program is half their job this is going to be pretty technical we're going to be talking how to start your journey with amazon i'm going to cover common pitfalls people have it's going to be some 101 stuff and then it's going to be some nifi so let's discuss here so in our data journey that we are going to show showcase today we're going to start with amazon's s3 this is a very common starting point for data it is the simplest solution on the cloud or it's among the simplest solutions to store data you can simply drag and drop files in here amazon charges you by the byte of something silly like a buck or two a month per gigabyte or something like that i'd have to look at the exact price but it's super cheap oh no i sent a couple cents per month per gigabyte either way it's very cheap to store data on s3 so what we have here is just examples of text files from reuters so what i did for this demo and what we were doing for this demo is we took a sample writer's data feed it's all txts every one of these is i can't show it just yet one second i'm gonna stop sharing my screen just while i download this one second folks just opening this up without revealing any private information okay now to figure out how to share my screen again okay everyone should be able to see my screen again if you can't please tell me so for example when we purchase data from reuters or we showcase reuters data uh this is a publicly available set so this is licensed to just kind of show this should be if you're ever going to use this data set talk to reuters of course um but this is taken from their public data sources so here it's literally just an article as you can see it's raw text nothing in it sample data use case is we get all these down in s3 but i need it in a database i need to have a searchable index perhaps so what we have here is an s3 bucket of just raw files and you might notice it's not labeled txt that's part of the problem so when we go back to amazon's streaming approach to raw data architecture we're going to start with s3 and then we need to move to um processing of it so let me um let me actually map it to this real fast so the device in this example that we are going through our cases study we actually label this case study so people can screenshot this at the end case study architecture the device in this case is actually something we don't control i'm going to gray this out entirely in this case the device what we're talking about is entirely on reuters side of the house so we have no visibility to what the devices the initial landing in this case is actually a dump to s3 so what that means is the data is initially landed on s3 and it's just stored there so that actually handles the first steps so this is one of the advantages if you work with amazon versus somebody else they do facilitate some of these transition steps and connectivity steps quite quickly so the data is in s3 already how do we begin processing it so for processing it we could do lambda we could do kinesis to discuss each of those points really fast for those of you unaware lambda is a great set of functions it is a serverless way of hosting code as they as you saw it said run code without thinking about servers so what i can do is i can simply create a function and let me slow down a little just so those of you following along on screen can see um if you're unfamiliar with amazon services at the top search for lambda click it and it loads so i'm going to hit create function and i can now create functions that trigger off all these different things it could trigger off um s3 it could trigger off kinesis trigger off kafka whatever lots of little serverless functions that can be written in python and stuff kinesis is a streaming data platform but it also has kinesis analytics as well which allows you to do some processing however i have really seen not as many people use it for that they often pick a different execution engine and as you can see here even amazon is recommending use lambda for your execution engine in these reference architectures um what i recommend is using nifi now full disclosure my company supports a nifi distribution on the cloud it's in the cloud marketplace if you go to ec2 you can spin up a pre-made instance of this out of the box that's just something my company supports but it's an open source tool it's completely viable to install and configure it yourself just so i disclose that before i recommend this one second so how does knife i approach problems you might have this as something from your cloudera hortonworks distribution you might have installed it yourself you might have it from our marketplace option let's walk through how we think about processing data so as you can see here the sections that we will be referencing are the rest api and event processing so let's talk about how this tool can be used to process data on the cloud and move it and all this is done in relation to an upcoming demo of how we're showing how we're going to move reuters data from s3 into various other states if anybody has any questions now is a great time okay so when i'm going to showcase the solution it's important to understand the parts of it so nifi is a visual user tool it is my preferred way of processing data on the cloud i did write a book on it so that a little biased on why it's my favorite and how it's my favorite but basically you have a visual user interface you have what's called the canvas it is a blank area that you can drag and drop processors into so this is a collection of 330 ish 293 depending on your distribution processors that accomplish all sorts of different functions they could be everything from list file to get file to list ftp to put ftp to invoke http to handle metadata this was designed originally to move data around between systems and has evolved ever since nifi is actually a relatively older solution believe it or not um it came from the nsa of all places so this tool that i enjoy quite a bit was open sourced in like 2014 after being rolled into the private sector in 2011. not really sure where it started this tool um there's evidence of it going back publicly to 2007 but that's all kind of classified under the nsa umbrella so our journey really starts in 2014 and it's undergone active development ever since so there's lots of functions there's lots of integrations it's cloud economic and um you can either get it through hortonworks cloudera calculated systems or yourself all with different advantages um of course you can always get it on the marketplace though so what does it do so beyond just features this has an added benefit of monitoring built in you can see here we are looking at it we could see that hey we list file we listed 5500 files so as we go through our demo as i showcase solutions and how to build out data movement pipelines keep in mind you have a live monitor right here so you can see what i'm saying isn't made up it's actually moving data and it's built in monitoring processors can also be configured so you say hey chris this is great that you have you're showcasing some tool that allows me to demonstrate data movement but i'm i'm different well all these processors all these functions are highly configurable if you were to ever right click on a processor you can actually add lots of functional parameter changes and you can add different properties and by connecting them you have a very very very high number of potential interactions between processors some companies make custom processors my company has been making them commercially for a while however i've noticed many people can get started with their initial use case with out of the box options and then can get some customized ones if they really start to mature as a company so one of the classic patterns is list file fetch file you can replace this with list s3 fetch s3 get google cloud storage fetch google cloud storage get azure blob fetch azure blob the main pattern that i'm showcasing here is that list file can send data to fetch file over a connection so it just defines the flow of information it defines the flow of connections from point a to point b and as you can see here we're monitoring 5500 messages out 5500 in q and of course it has different relations so we talked about previously how in your data movement problems you need to showcase different levels of connections different functionalities different options in this case we can actually route to different pipelines different storage options if it successfully found the file if it didn't find the file or if it failed to find the file or was denied so connections connect processors to each other and then depending on what happened in that connection it gets routed as very specific way and then of course as you can see here you can also route them to these little things called funnels if you ever see this icon just think it's a dead end used for processing so here fetch file we want to showcase the success relationship it's going to a little bit of a dead end funnel right here so just if you see this just think it's used for development or collection another important thing that i'll be showcasing today is the concept of listing a queue so we can inspect the data stream which is part of the main reason jaguar land rover was going with this originally is that we were able to showcase what was in flight so it's not just blind to what's in flight we can showcase what's in flight and we can also not only show the metadata which is things like the details the file name the access time the location we can also view the file claim itself so that means we're going to view it we're going to download it and as i showed we can download the reuters news messages from s3 directly but i'm also going to showcase how we can download it indirectly as well and then of course there's even more attributes on the attributes this is all referenceable metadata so in this tool the way it handles data movement at the most technical level it has what we call attributes or metadata and these are referenceable so you can see hey in this example i took a screenshot of if i reading its own jetty directory so it's just pulling in from here so to quickly review before i dive into some live demos and some live discussions we have your canvas which is the spot where nifi visually builds the flow we have processors which are predefined actions that can be included in the nifi data flow there's a couple hundred default options and then some companies such as mine also sell add-ons and then connections which defines how they could talk to each other and flow files which is how data is transmitted between processors so let's jump into a little bit of a demo i want to showcase how a data movement problem can get started on amazon so here we have amazon's s3 now for those of you who might be new to amazon's s3 i will showcase a little bit of the permissions management because i want to get people comfortable with whatever tool they're doing and the permission settings that i'll be showcasing today can be used in anything if you go to amazon's cli command line interface or sdk you're going to require the same permission so before i showcase any of this the first challenge of data movement at the technical individual contributor level is connectivity how do i connect to my landing zone or my initial gateway so this is an s3 bucket in amazon for those of you unaware you can search iam which pulls up their identity access management toolset now this identity access management toolset has lots of stuff as you can see here our demo account has a slew of dummy users but hey let's create a user just for this live demo we'll call it um aws our web age webinar user so this is creating a user that can access it so one very important thing to understand about amazon users are you have the amazon account which we just called super root admin for some crazy reason and then within that account you have iam users this is very different than how google handles things very different than how azure handles things amazon you have an account that then users are provisioned within so a small company will probably have one account larger companies might have a few so what i have here is i have my account which is just it's aws admin we called it our super root admin and then within that we have users of different levels of permission so i'm going to say enable programmatic access permissions so now i need to just add it into this so i'm going to add it to the nifi demo group for those of you who are creating stuff from scratch you might wish to directly attach some policies which allows you to attach things like aws s3 full access so in your data movement problems with s3 as an individual contributor or somebody doing the hands-on programming what you really need to do is understand how to connect to it and no matter what tool you use if it's nifi if it's kafka if it's python or if it's their own cli you are going to need an iam user that has the correct group frankly you can skip tags and hit review and you can see here i'm making a username webinar user with a programmatic access key and it belongs to the knife demo group and i'm going to hit create user i'm going to download this and i'm going to revoke these credentials quickly so i'll just show them make sure you copy these down your access key and your secret access key uh you download them these are like your ssh key though um you won't be able to get them back so then you see here we have a little bit of our demo environment uh let's say that you did something stupid like you flashed your secret access key on a webinar to a ton of people you just met today you could always go back to iam you can click on the user go to security credentials and revoke keys or make them inactive so this way i can kill that key that i just showed everybody and i'm secure the reason i show this is when you're done with your demo you're done with your tinkering it's good to clean up your access keys unless you're using them so i'm going to download a new one not hit show just close it that way i have a nice secure access key so you saw me recycle my credentials in real time with these real-time credentials done i have access to s3 using this key so i can now go to my lambda function i can go to my cli i can go to anything and start to pull in that reuters data so let's look at how nifi does it actually i have a working demo here i am feeling bold and i'm going to start a fresh anew over here so we have a processor we drag and drop it in as you can see this version has 298 some of them have a little over 300 but i want to get data from s3 so i am going to type s3 and you can see lots of lots of options here and remember if anyone has questions feel free to ask um so we're going to hit list s3 so this means we're going to list all the files in that directory and if you're using nifi for using lambda you're actually probably going to do a similar pattern no matter what this is just the fastest to demo we're going to go here you go double click it and we need to set properties so where is it reading from now for those of you who are new to amazon s3 buckets are based in regions um that changes we can host it in potentially multiple regions with duplication between as a little secret to people on this call i think google has a slightly better bucket answer where they have multi-region buckets under the surface amazon has different regions with different replication it's a good it's a good solution that works and then i can put in the bucket name which if i go back to s3 i can retrieve right here of the reuters calcus demo and i now need to put in credentials so i can put my access key and secret access key i could just copy those in right there however i'm going to use a credentials provider service this is something we pre-configured ahead of time it just has the credentials saved in it nifi has the concept of shared provider services that do things like registry lookups they store logins jdbc connections in this case we set up a controller service that just stores credentials for us ahead of time just so this demo goes nice and smooth and i don't have to keep flashing secret access keys to everybody on the on this recording because uh that feels like a leak waiting to happen and um yeah if you saw the news starbucks leaks their api key and is this webinars recorded we're not going to showcase working credentials either way list s3 now we want to fetch s3 so that means we're going to list all the objects and then fetch all the objects so we're going to connect them like this you drag and drop in and basically we can list all the reuters news objects in there and as you can see there's 660 news objects that actually is 100 correct but you might notice that there's no bytes so this is just an important data movement problem remember how we talked about different stages can have different volumes and velocities in this example we have a volume of number of files and there's no data being retrieved yet because we just listed it number of files can overload a system just as fast as too much data so number of files is listed here then we're going to fetch it which means we're actually going to retrieve it as you can see here i made a connection for the success cue i'm going to go here and terminate the relationship failure and if you think i'm going a little fast the answer is i am just um this part of the class would normally be interactive and take a few hours remember web age offers a full three-day experience of this course actually calculate systems typically delivers those as well so there's a good chance you'll get me serving it um but we're starting to blow this out oh so an important concept of nifi is the concept of metadata so as you can see here fetch s3 object is throwing an alert if i hover over here i can see that there is a problem bucket is invalid because bucket is required okay i'm going to double click on it go to properties and i'm going to see that the bucket setting is required but not provided that's a property in the processor so processors have properties and we need to fill those so i could easily type in reuters calculus demo and this processor will run just fine you know it won't run just fine what happened 301 i know what i did wrong so if you look here uh you can always see the error codes that nifi is throwing you hover over the top right and you can see the errors and it says the bucket is in this region us east 1. please use this region to retry the request so us east one is north virginia always referenceable on their website but the funny thing here is the way amazon handles clients so storage clients such as the s3 access client need to be initialized per region so it creates a little fun problem where you can't bounce regions without restarting your client and you're going to see that no matter what tool you're on so something to keep in mind make sure you always initialize in the correct region restart this now we're still throwing an error what's going on access denied okay so now we don't have the right credentials so as you can see here we are starting to debug and go through what is happening at each stage of the demo for those of you be paying attention you might have noticed that i i did not put in an access key secret access to your credentials file so we're going to reference a service provider and hit run and we are going to see some messages finally come through so as you can see there we are able to use just a user interface to process data amazon is pretty good about the error codes it throws and i could see that i was able to debug both a wrong region and a 403 error in real time in front of an audience if anybody has any questions or wants me to focus more on different tools feel free to ask we have about 20 minutes left of total lecture where i'll be showcasing some databases obviously excuse my cult here i'm dying um i'll be showcasing some databases i'll be showcasing some s3 work i'll be showcasing a little more nifi and even a little machine learning thrown in there just for some fun but uh open to more questions if people have them so um either way we're fetching but you might notice data movement problems here under attributes you can see hey we're pulling in from this bucket we're pulling in files and if i look at the file it's actually not a known type yet because it they don't have a txt but one other thing i want to showcase before i showcase how to do a txt is how to make a little more of a programmatic pipeline with this tool um i'm going to spread through this but if you ever need to reset a processor and you stop it you can always try to hit view state and reset state that restarts the entire knife processor or you can just make a copy of it and it'll act as a new processor if you've done a one-off read so here we have um the fetch s3 object we're going to hit start we're going to run it and as you saw it was running just fine but let's make it one step further if i look at these incoming flow files by right clicking on the connection and hitting list q i hit information so i can always hit the thing on the left and view it attributes you see that i have a piece of metadata called s3.bucket i could go to the reuters calculus demo bucket but by referencing this piece of metadata i can actually create a dynamic flow so in this dynamic flow i'm going to replace bucket with s3 dot bucket and um in if i in particular you need to use the expression language so dollar sign open bracket and then close bracket and okay so all that tells it to do is to evaluate the expression instead of just taking it directly you can notice that this processor came pre-filled with the expression language on file name to evaluate there so we're gonna hit ok apply and when i turn this on you can see it continues to process just fine so that's a big deal so we now have a dynamic flaw where we're monitoring this bucket in real time and we're pumping the successfully fetched files down to the right now you might have noticed that some failures fatches three object also allows me to route failed files to the left here so maybe as i develop my production load i can also have a failure queue where i'm monitoring for connections that didn't process successfully um i am now going to show how we can build this out a little faster than having to build it all manually so if you look here we took it one extra step so we have list s3 we have fetch s3 object and in this part of the pre-made demo i set key name and data type so remember how i said nifi has attributes and metadata that handles it that is this section of it we need to update these attributes so i can do that with a processor that's very unimaginably unimaginably named update attribute to get this processor all you have to do is drag it from the top left hit update attribute hit add and you've added it in i just deleted that one so if i double click here you can see i am naming document number and mime dot type to text plain what that allows me to do is just tell the system that this is a text document and the file name is well it's the document number and what this allows me to do is now hit view and right in knife i can start to inspect so here you see a fascinating article about the minds and energy minister of indonesia and how they feel about the tin packed extension really riveting stuff here but more importantly i could view it through the application and that really helps with debug and development so to go back to here what i've showcased so far is how to fetch or sorry list fetch and start to edit files so we're starting down our data pipeline journey with this tool i really have to stress that although this webinar was shown as showcasing nifi if you want to write code this pattern really is applicable to many tools i just find this tool is the easiest to use however this pattern is repeatable and is honestly the cornerstone of good data science on the cloud of get your files fetch your files and manage your metadata okay another concept about nifi now rather than the specific tool i want to showcase is this thing you might have seen me click this giant processor let's call it a processor group i can double click in try to click loudly so people could hear it and you can see i have these three processors one two three in here and they're all hiding in here nicely they then go to this little thing that's called an output port i named it files out and this allows us to create a clustering of processors and i can now drag that files out connection and relationship to anywhere i want so i can drag it over here to this dead end funnel from output files out hit add so there you can see i could pipe my files out this way in this case i want to um send them in two different directions i am going to send it one way to go through some prep and another way to go through some transformation so to cut back to the powerpoint or just google slides because i prefer g suite very rarely do we have just a simple problem very rarely do we simply go from s3 to dynamo we often have to route it through another source of truth so this is kind of the architecture that we have time to go through today so i'm going to show it how we can do a quick look up against google's cop or amazon's comprehend service um and then land it so for those of you who don't know amazon offers a full set of nlp machine learning data engineering stuff amazon nlp so here this is the comprehensive solution there's a lot of different features in here i'm not going to sell amazon for amazon but out of the box has a lot of stuff that could help you in your data engineering and your transformation once again i'm not compensated on anything amazon buys so this is just my on this is just my unbiased opinion amazon's good google's good azure offers a similar product azure do some weird stuff with facial recognition though so just keep that in mind and here we have things like entities sentiment key phrases syntax and so on so what we have here is different types of natural language enhancement where i could say i want to extract every entity all the sentiment all the key phrases and so on and so forth so basically our data journey and this data movement problem is going to be s3 to an api lookup where we're going to extract all the entities and then land it in a database so we've already extracted all these s3 files we are going to now enrich it so we're going to turn this on we're going to turn on the output port here and we are going to see events starting to stream in so you can see as i said we're seeing events stream through so this is just another update attribute where we're saying we're detecting sentiment and then all this processor does right here is make a comprehend call um this is one of the processors that my company produces but it's an open api as well if you wish to code it for yourself just to stay platform agnostic that being said i think this is a pretty slick and clean way to do it so what we have here is comprehend the text sentiment so i want to see if this is a positive a negative or a neutral message so what i'm going to say is just clean this up a little bit what i'm going to say here is we are reading the incoming flow file body that means it's the payload that is coming in we are going to write it out to an attribute and the attribute name is going to be called oh it's just going to name itself and then it's going to use the existing credentials provider service hit apply so now when i turn this on it's going to start processing data and we've enriched it so now what we have is an enriched message with this tool um you can see in this example we have an incoming well it's just a shares report once i go let's get something a little juicier at least how somebody feels about the tin packed expansion level juicy here we go uh brazil bankers trust of new york has downgraded brazil's accrual basis or cash allowance either way they're in trouble so let's see sentiment neutral um however it is this is a relatively neutral statement according to amazon's nlp so you're seeing some of the shortcomings here maybe it should be negative but either way with amazon's nlp it was rated to be a neutral statement so we made an api call we did some enrichments we processed it through this pipeline and then finally we can send that data to a database so i'm not going to dig too far into this because i don't want to get too specific but we can format it as a json route on different attributes i'm looking at the clock we are running out of time so i want to focus on just jumping through this and then ultimately we could put data into dynamodb after routing it through this pipeline i'm going to focus a little less on this tool a little more on amazon just after seeing a requester 2 sent to me in chat um but this tool allows us to do a single end-to-end pipeline without leaving the tool so back to some amazon specific stuff so you can get started on your journey after this call i could talk about google or um azure a little bit as well if anybody requests but without a specific request i'm going to assume everybody is on amazon because that is typically what i've seen particularly at sk broadly even though i love google so we talked about to recap amazon we've talked about iam this is the security level um at the baseline you need to create a user group you need to create a user in that group and give them credentials that's what you need from an it perspective individual contributor perspective you then need to have different services you have services such as s3 for storing your raw data objects you can see some stuff from previous demos i ran here everything you've seen here is publicly disclosable um and then you can also see things such as um you could see things such as rds and dynamo so i'm going to showcase dynamo i see we had another panelist join whoever just joined uh feel free to speak up we are wrapping up the live part of the demonstration and then we were going to pause to collect some surveys whoever just joined back up either way so we talked about in the architecture lots of ways lots of places to land the data i have mostly focused on the left half of this architecture let's take 5-10 minutes and talk about the right half of the data movement architecture so in this we have a few different options if i go to amazon services you can see storage and database which are really the big ones here to just iterate through these quickly just so you have an overview of your tools to moving data to the cloud and moving it around the cloud s3 and s3 glacier are different tiers of their object store really good for storing objects in the cloud without necessarily understanding what they are when we talk about raw data movement s3 is usually one of the major cornerstones of that the storage gateway is more of appliance don't worry about that just yet elastic file system this is if s3 doesn't work for you if you are moving data from an ftp system and need it to be on a compliant file system disk efs allows you to merge multiple vms together into a common file system really good for storing your objects if your data migration if the access layer needs to remain available on ftp or it needs um pos x compliance if you really want it to be nerdy there for a hot second and uh that's the last time i'm going to mention possex on this call because uh nobody wants to worry about file system definitions just yet so down here on the databases we have a few interesting options rds this is the cornerstone this is what everybody usually starts with as one of the main things um rds disc is multiple databases pretending to be one it's kind of like the kids in the trench coat from the little rascals if you saw it if i hit create database you can see that i'm given a lot of options it's core there's my sql at its core there's postgres oracle and sql server once again if you're doing oracle just pay attention to your licensing because um it can get a little wonky at times and then aura aura is a very interesting storage option on the cloud if you're migrating data to it because it allows you to commit a couple sins that you couldn't commit on premises so when we talked about volume we talked about velocity when we talked about other things um amazon aura allows you to create a 64 terabyte mysql database um whether or not you should be allowed to create a 60 for my sql terabyte mysql database is another question but rds really supports lots of sql style interactions and lots of existing ones if you're not sure what to do if you're a real noob and you're sticking around to this point my sequel is is a real safe bet to get started but i'm pretty sure most people on this call probably already knew that going down the list dynamodb is one of their premier um nosql options um this is cloud native you pay for capacity in terms of read and write so with mysql sorry with dynamodb you're not provisioning servers you're provisioning capacity and read and write so this is really good if you are have an unstructured schema in fact i think i have a table in this demo instance yep items as you can see here i started doing some nosql on the raw data that reuters was doing so here we have a dynamodb entry this one is an economic spotlight on booming australian markets really good stuff here apparently but either way no sql it had no idea what was incoming just stored it as is no servers to go down this list a little more you've got elasticash which is a caching service we're not gonna go with that too much but if you need low latency it's a good option neptune is an interesting one pulling it up now it's a graph database so if you wanted a differentiator if anybody on this call is thinking where should you move your data to amazon's neptune service is one of the few hosted graph data pieces now what unique graph databases for is another problem you usually know if you need one but just keep that in mind hosted graph database service not everyone realizes that um cassandra is another um nosql style solution that can be used with graph databases but the last one i want to discuss is redshift redshift is a massive way of storing a lot of data it's pretty much the biggest one on amazon other than maybe s3 say data warehouse um anyone on this call thinking of redshift i hear that amazon is not interested in working with you and redshift unless it's a multi-million dollar deal so unless you're a really really big use case i would avoid redshift but if you are a really really big use case redshift might be a strong option i am going a million miles an hour right now does anybody have any questions on pause just to make sure nobody has any okay so i've gone through the storage options uh in a short class here and the full class would go through a little bit more just as a teaser for those of you still interested in the full class um we would go through the networking options and describing a little bit about how you could enable application data movement we would go through the machine learning section a little bit talking about how you can enrich data yourself and the analytics section so just to go through a recap now quickly and then we're going to take 15 minutes to do the um the survey and make sure everybody gets the handouts so in summary of today data movement is at its simplest layer a movement of information from one source to another that is the simplest answer in reality it's much more complex in reality you have to move data from one system to another you have to go through complex transformations you have to change data centers and locations and worry about security access at an academic level database movement data movement is defined as storage movement database movement application movement or business process movement at a more practical definition you have raw data movement which is the movement of files and unrefined sources database movement which is the movement of structured data between sources and application movement which is the structured messages between systems in addition to understanding these you need to go through the three v's of data to help quantum for visa if you're an mba to help quantify what you're doing so how much data there is what is the main bulker how many messages how many bytes how many rows of data are there how fast is the data moving for velocity messages per second acceptable latency and then variety what is the schema what is the format how much does this data change and then of course is it valuable to you do you need it if you're an mba just to put the fourth bonus one in there so the first part of this class was defining what is data movement for those of you who might have joined late or who are interested it's the movement of bits between one medium to another and it is defined by raw data database data application data or velocity volume and variety the second part of this class discussed common patterns so here we can see amazon is suggesting for a native amazon stack going from something like kinesis to lambda to dynamo with s3 serving as a raw data backup amazon also provides a reference architecture of going from kinesis to lambda straight into sns for a message system i've expanded upon this to showcase some options using other open source tools that in my opinion doing a little bit better such as nifi and kafka and that is collaborating evidence microsoft also offers the same pattern of server to a messaging queue to a stream processor to storage locations with business and ui integration on top of that so this pattern is often repeated for database migration to summarize this we have broken it down into two formats append only and cdc which is really can you define what data has changed and how to get it or do you really need to capture all the edits and complex relationships that might have happened the top option is doable in-house the bottom option really relies on some commercial software with some reference architectures there as well um sorry about that i think i coughed into the mic before meeting i think i made it on time anyway finally um we have the application data which is the big challenge here is the movement of data from left to right or right to left is routing the data between your applications in my experience and most people's experience the biggest challenge here is pathways and routing so understanding how your data gets from one point to the other this is less of a data engineer problem and more of a network engineer problem but it might become your problem as a data engineer particularly if you're more towards the director level and then finally we went through some real world architectures where we discussed how in a real world jaguar land rover setup we had to receive data we had to decouple the data we had to use logic and enrichment we had to store it and use an interface from there we defined some requirements around it around the v's and what we actually need so we needed highly scalable but slightly allowable latency we needed a messaging server that was also scalable we needed event processing we needed a high variety of data so that informed our nosql option and then we also needed it to be accessible and we went through a selection process i'll leave this on screen briefly if somebody wanted to screenshot it of what are potential fillers in each one um after we went through all this we demonstrated the unified solution which is what was demoed at the la auto show with jaguar land rover and before diving into some amazon specific tooling and examples namely around nifi namely around s3 and permissions um i encourage everyone to try it knifi is a good tool to use so is the native amazon stack you could get our distribution off the marketplace or try to set up yourself both our options any questions here if you look in chat carling just left a handout so that's the pdfs and an evaluation so if you would like please go fill out the evaluation i am going to chill here answer some quick questions if anybody has them but i want to make sure everybody has time for the evaluation and any questions that gets sent to me uh thank you and i will probably get a question or two that i'll answer momentarily okay um one question around databases um the question is how do you connect to them from nifi um it's a little specific but we'll go through it quickly um we don't have any databases in the demo account but once you set it up you'll get a host name you'll get a jdbc connection in nifi if i was to i'm assuming this person met sql database if i go to database we'll say put sql you can see there's a few options the main one though is the jdbc connection pool this is a controller service much how you could have an amazon credentials provider service you can have a jdbc connection pool provider service what this allows you to do is manage the connections between multiple accounts multiple environments that way you don't overwhelm the server so if i was to hit create and hit this arrow to go to it you can see the properties where you put the url in the class driver name in the kerberos credentials fire service if you're a serious hadoop junkie and so on so lots of good options there and of course the user and password i hope that answered your question please ask a follow-up if you haven't but the answer is to everybody if you're ever worrying about credentials right click in the open area hit configure hit the plus button make sure on the controller services tab and you should have a lot of credentials provider services ah one person asked is knife running on aws or local ah so nifi as an open source tool can be run on either aws or locally um if you go to ec2 if you hit launch instance you could search for nifi and you can see that we have our certified nifi install here which is a one-click launch my company does provide this service additionally you can try to set it up on your own server as well as you can see internally we have a slew of development variations and uh we also have this little web ui in front of it as well so if you launch on the cloud you can use our variant which has passwords and stuff like that to help make sure that your connections are relatively secure or you could try to roll it yourself as well which is a valid option but make sure that you set it up securely if you do so the main point is the way it works is if you run it on the cloud or on premise the controller services handle the routing to the cloud or on-premise so it doesn't matter where you run it in fact you can run it on amazon or google and reference each other's object storage so it's more a matter of where you want to host it and does it have a path to the resources then where does it live explicitly i hope that answers your question as well uh yeah and this particular one is running on amazon just because uh we're using our distribution of it and uh i like of course i like it it also comes with some of these extra little processors this demo so it's all running here but it'll route wherever the cool thing is i could do put uh put gcs right here and literally send it to a google cloud storage bucket simply by using a google cloud credentials provider service but that's a subject for a more full lecture okay everyone is getting pretty late on a on a thursday it's now approaching 4 30. i'm going to wait around another minute make sure there's no more questions and then i close off this webinar i hope everyone has found this informative i am going to put my email on screen as well if you need to reach out directly but of course you have your web page connections you can talk to carling and of course since their company name is so long i need to make a smaller font okay there we go so i hope everyone's found this informative um this is a preview of the full three-day course that we offer potentially in person potentially remote depending what you want it includes a live lab the full class where you can do hands-on work as well and we will not race through everything quite as much and go into much more detail okay we have a question how can you scale knife i was asked in chat that is a phenomenal question there are two ways to scale it and i'm gonna sorry three ways to scale it and it's gonna have a little bit of a caveat we can of course discuss specifics if you want to reach out i'm more than happy to answer a few questions if you would like but the three ways to think of it are of course vertically you can always put it on a bigger server that is the lazy answer of course um that gets you up through maybe 16 cores maybe 32 cores if you're feeling daring that isn't the best answer the next question is around state management nifi can run in clustered or unclustered mode in clustered mode nifi connects to each other and using an internal process based on zookeeper you don't have to worry about that it selfs manages it can keep track of things that means it is doing things like reading from s3 or reading from an ftp and multiple nodes are communicating what they've read last so that means that multiple nodes are reading from the same source they're keeping track of what they read last and they are doing that in what's called clustered mode now a really cool pattern is unclustered mode nifi can work in unclustered mode if what you call state management is handled by an outside group so for those of you familiar with kafka you might know consumer groups or you might know round robin if you have something like a consumer group that is keeping track of the last message received message sent you could simply put all the nifi nodes in the same consumer group and that will manage the state that's exceptionally powerful because it doesn't require you to cluster nifi it allows you to just ad hoc spin up and spin down notes so in summary to answer your question the easiest lazy way is to simply use a bigger server amazon is pretty linearly priced makes a lot of sense just to use a bigger server at first the better ways that offer high availability in dr are to either run it in cluster if they need to monitor their own state or unclustered if you need to if you have an external state management coming from kafka or sqs or kinesis um we have seen nifi clusters able to handle up to 100 million messages a day very comfortably things like 400 megabytes per second it scales up extraordinarily well well thanks everybody i am going to end this webinar a few minutes with despair just so everyone can uh get a few minutes back on their late friday afternoon and thank you all for being a great audience
Info
Channel: Web Age Solutions Inc
Views: 533
Rating: undefined out of 5
Keywords: apache nifi, webinar, training, web age solutions
Id: oBzUgfqcgjw
Channel Id: undefined
Length: 129min 44sec (7784 seconds)
Published: Thu Apr 15 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.