Building a Data Pipeline on AWS | AWS Data Pipeline Tutorial | AWS Training | Edureka | AWS Live - 1

Video Statistics and Information

Video

Captions Word Cloud

Captions

so good morning good afternoon and good evening guys based on the time zones you all are coming from so guys before we start with the session can you always give me a quick information if you all can see my screen and hear me darting clear as well perfect thank you so much for the confirmation everyone so my name is niraj keria and i have been working in the city industry for more than 13 years now so before we get started let me quickly introduce our edurika master class community with y'all so this community of master classes was started back in 2019 and since then we have been closing into almost 30 000 members so far and in these master classes we conduct multiple webinars and live events on different topics almost 100 in a month on different topics including accenture including blockchain iot artificial intelligence big data data science rpa vmp and multiple front-end and back-end development technologies and the best part about these webinars are they are absolutely free of course so there are no charges involved here and these are bananas are a really great platform for anyone who is looking to increase their knowledge on the technology that they are interested in so to be a part of this entire webinar and to get the access to all the webinars being scheduled so we can click on join this group and we will be also having the access to the recordings as well for each and every webinar that we are going to attend here so the main agenda for today's session we are going to discuss on what exactly data pipeline is and before that we are going to understand what exactly cloud computing is so we are going to stand what exactly we mean by cloud and what is aws what are different domains available in aws what is the need of a data pipeline in devops strategy what exactly a ws data pipeline is what are its components and then we are going to see a small hands-on on top of aws data pipeline and how we can create one as a part of our hands-on today so first of all if you talk about cloud computing itself then let's say if we go back almost 17 18 years back if we had a requirement of deploying a website for example we are the company and now we have a new website that needs to be launched for the end users so we had to purchase a stack we had to purchase a large stack of servers and then we have to make sure that we are taking care of the power the setup on its own infrastructure that we may have rented or bought and then we had to set up the entire servers we had to take care of the maintainance scaling of the servers scaling down the servers monitoring maintenance everything had to be done at our own end right and that resulted in increased cost so let's say if we are if we do have the knowledge of everything that we can manage and we can maintain the entire servers from at our own end or in case we don't have time for that then we can then we can ask the team members for example we may have we may hire us us or we can sell an expert or a team of experts to manage the entire servers for us and this was a and that actually resulted in increased cost as well correct so now what happens there are multiple vendors of all around the globe because they were multiple disadvantaged because setting up servers locally and maintaining them manually from our own end was expensive correct if we hire a team as well and that is also going to be expensive for us plus the troubleshooting problem problems can be tedious and it can simply conflict with our business goals and the traffic is varying and that's why servers will be idle most of the time so here we also have to purchase the servers that means even if we are not using the servers for a given time period or we don't need extra servers then still we have to pay for it and that resulted in increased cost for us right so they were multiple drawbacks and that's where cloud computing platforms they came into play so with cloud computing platforms what happens now someone else now someone else has already done the entire infrastructure setup at their end and now they are allowing us to use their infrastructure to start deploying our application that exactly is what cloud computing is all about so for example aws seo gcp they have done the entire infrastructure setup globally in almost every major city all around the globe and then we can choose in which region we want to deploy the applications and we have to subscribe to the product we have to choose a region in which you want to deploy the application and then we can start deploying it that's how it works right and that exactly is what cloud computing has to offer and plus there is no upfront payment required here that means here we can pay exactly for the time period we have consumed the resources and what kind of resources we have consumed here so depending upon that we can simply pay and that's why the pay as you go model has been made has been one of the main factors by these different cloud computing platforms they have been much popular because there is no upfront cost that needs to be paid here and we can scale the servers at any point of time let's say we want to scale up and down the servers and we can do that at any point of time there is no restriction defined here all right now if you talk about aws then aws has been the global leader in cloud computing domain so aws was first of all launched in 2003 so it was conceptualized in 2003 although the first cloud computing platform was officially launched by salesforce back in 1999 and then aws was conceptualized back in 2003 and then it was released as a beta stage in 2004 and then it was officially launched for the public back in 2006 itself that's how it is and aws currently offers multiple services so aws currently offers more than like different categories of services so here we have services for computation for storage for migration for iot for machine learning for security so we have services of being offered in multiple domains here that's how it is structured and that is what aws platforms are so we have to make sure that in case just a moment all right so there so although aws was launched with just two core services but now we have more than 175 services offered under different categories in computation migration storage networking management tools database and security as well so to see the complete list of all the services available here we can navigate to us page call us aws.amazon.com so in here we can see the list of all the core services offered by aws in analytics and computation containers customer engagement database developer tools front-end web and mobile iot machine learning and management and governance as well so these all are different categories in which these services are being offered by adapters now aws offers us a free tier account that means in order to sign up on aws we can simply we can we need to have an active payment method uh just like we need an active payment method in netflix in prime on or on majority of the other subscription-based services so we have to act to add an active payment method and then we will be getting the access to a free tier account now when we talk about free tier account that means we are given the access to certain services that we can use under a limitation and that too for free for a period of 12 months from the date of sign up that's how it is so we can use so that's how we can sign up for our freer account on aws and different vendors they have different ways of giving free trade access for example if we refer to microsoft seo then microsoft seo gives us 200 of free credit that we can add and that we can use against our billing for a period of one month in gcp as in google cloud products we get the access to three hundred dollars of credit that we can exhaust in a given year and in aws we don't have the concept of free credits we only get free tier access that means we can use certain services and that too on a limitation for free for a period of 12 months from the date of sign up so in order to create the account on aws we can we can focus on the button available on the top left corner in case we are using this for the first time then we instead of sign in to the console button we will be able to see create and again aws account so here we can click on this option and here we can enter the email address the password confirm password aws account name and then we have to enter the valid payment method and then we will be having the access to the feeder account and once we are done with the sign up we can simply navigate to a service called as console dot aws.com and here we have to enter the email so let's log into our dummy account so once we are done with the sign up process and even when we are have added the payment method so it may take some time not sometimes majority of the times we get the access to this console the moment we create the account on aws but sometimes it may take a while and even though when we will be getting the access to the console we may not get the access to all the services just after we have the account created so we may have to wait for a couple of hours as well now if we talk about data pipeline here so we know that with the advancement in technologies and ease of connectivity the amount of data getting generated is skyrocketing and buried deep within this mountain of data is the captive intelligence that companies can use to expand and improve their business now companies they have a constant requirement of moving sorting filtering reformatting analyzing and reporting data in order to derive value from it because until as they know what exactly the consumer trend is how data is being generated and that too in what pattern they won't be able to plan the entire investment operations accordingly right so if we talk about the actual requirement of data pipeline for example let's say here we have pinterest so pinterest first goal is to improve business by targeting content and then it also has to manage the application efficiently as well and third goal is to improve business faster but at a cheaper rate right so these are the challenges again and then if you talk about the problem statement then there are huge amount of data in different formats so processing storing and migrating data becomes complex so we know that data is growing exponentially and that too at a faster pace so companies of all sizes are realizing that managing processing storing and migrating the data has become more complicated and time consuming than in the past so let's so again we can see first of all we have huge amount of data so and that today's large amount of raw and unprocessed data that needs to be processed which includes log files demographic data data collected from sensors for the transaction histories and not more and they are also available in variety of formats as well so they available different formats in different states as well as and when required right and then they are available and they again the entire process is also a time consuming and it is costlier as well right that exactly is why we need to have a more efficient solution being defined now we have different services offered under aws for managing different sets of data so for example if we are talking about real-time data for registered users for handling the web server logs for potential users for demographic data and login credentials for managing sensor data and third-party data assets so we had different services offered by aws which is used for handling these kinds of data so first of all if you talk about real-time media then then for that we had dynamodb just like we have mongodb cassandra as a part of no sql database service same way here we have dynamodb and then we have s3 so s3 is basically a simple storage service which we can use just like google drive for storing any files and then if you want to share that file with anyone we can easily do that then for data such as demographic data and login credentials the inventories for the e-commerce stores these are going to be solved saved in the fully managed database service given to us by aws by the name of rds as in relational database service where we can choose any database engine for example we want to deploy our database or running on mysql on postgre on mare db on oracle server so we can choose the rds service for that and then for sensor data for third-party data sets we can also make use of s3 being a part of simple storage service being offered here all right now if you talk about a solution here the first feasible solution for having this large amount of data we had to first of all analyze the data and convert from unstructured to structured format because until as we do that we won't be able to compress the data space required and we won't be able to perform the queries on top of it and then the optimal solution is to use data pipeline which handles processing visualizations and migration as well so if you talk about aws data pipeline then aws data pipeline is basically a web service that helps us to reliably process and move data between different aws compute and storage services as well as on-premises data sources at specified intervals so with aws data pipeline we can easily create we can easily access data from location where it is stored and transform it process it at scale and efficiently transfer the results to aws3 services such as amazon s3 amazon rds dynamodb and emr emr refers to elastic map reduce and then it allows us to create complex data processing workloads that are fault tolerant repeatable and highly available as well now there are multiple benefits of using data pipeline as well so if we talk about the main benefits here for data pipeline then first of all it provides a drag and drop console within the aws interface it is built on distributed reliable infrastructure so basically this entire infrastructure has been designed for fault tolerant execution of each and every activity that we are going to conduct and then it supports scheduling and error handling as well because it provides a variety of features such as scheduling dependency tracking and error handling which allows us to make sure that the entire project is done with ease and then it allows us to distribute work to one machine of any so here and by using aws to the pipeline we can simply reduce the burden to easily distribute and dispatch work from one machine to the other or to multiple machines as well and we can do that serially or parallelly as well as per the requirement and aws data pipeline again it is inexpensive to use as it is built at a low monthly rate and then it offers full control on the computational resources that are that we are going to use for executing our data pipeline logic that's how it's going to be structured now if you talk about the components of data pipeline then we have multiple components available here so in aws data pipeline we have discussed it's a web service that we can use to automate the movement and transformation of data so we can define data driven workflows so that tasks can be dependent on the successful completion of previous tasks and then we can define the parameters of our data transformations and aws data pipeline enforces a logic that we have set up here we have three components first of all we have data stores then we have the compute resources and then we have the data stores again as in the output so here we have the input and we have the computer sources and then we get the output so basically we always begin by designing a pipeline by selecting the data nodes so then the data pipeline works with the compute service to transform the data if and when required and most of the time is a lot extra data is generated during the step so optionally we can have the output data nodes where the result of transforming the data can be stored and access them all right that exactly is what a data pipeline is and now if you talk about the data node itself then in aws data pipeline our data node defines the location and type of data that a pipeline activity uses as input or output so basically it supports multiple data nodes like dynamodb data node we have escalator node we have the red shift data node and we have s3 data node and so on so these all are different available data nodes is there to understand this further we can understand this by talking about a small scale example let's say here we have to collect data from different data sources perform amazon elastic map reduce analysis that we know as emr and generate weekly reports so here we have to recollect the event data from dynamodb then we are going to analyze it by using the daily emr analytics and the bulk data is going to be deployed from s3 as well and then these are have to process and then the daily amr results have to be corrected and it has to be stored and a report out of it should also be automatically generated that can be referred in redshift so basically in this use case we are designing a pipeline to extract data from data sources like amazon s3 and dynamodb to perform emr analysis daily and generate weekly reports on data all right so we have multiple activities as well so basically here we are going to perform multiple activities now when we talk about activities then an activity is a pipeline component that defines the work to be performed on schedule using a computational resource and typically input and output data nodes for example let's now the activities can be let's say moving data from one location to the other running a hive queries or generating amazon efr reports so again these are going to be considered as activities and we have preconditions so pre-condition is basically a pipeline component containing a conditional statement that must be true before an activity can run and the main use case for it is to check whether the source data is present before a pipeline activity attempts to copy it and if not then our respective database and table exists there or not and then if we talk about resource then resource is basically a computational resource that performs the work that a pipeline activity specifies so we can say that an ec2 instance that performs a work defined by a pipeline factory it refers to the amazon emr cluster as well that performs the work defined by the pipeline activity as well and then towards the end we have something called as actions so actions are basically steps that a pipeline component takes when certain events occur such as success failure or late activities and it includes sending an sns certification which is basically a service to get the data for multiple activities and then it can we can simply use it for triggering the cancellation of a pending event on a given data node all right so now we are going to see the small hands-on on top of data pipeline as well so for that we can go back to our console all right so now we are going to see in how we can create the data pipeline here now to work on data pipeline first of all we are going to make a simple nosql database by using dynamodb so dynamodb is a nosql database service so here we have dynamodb so for working on any service we can navigate to the service by using dynamodb itself so here we have to choose this platform and here we can click on create tables so to see the list of all the services offered under aws for free we can navigate to a link called aws.amazon.com free so using this link we will be able to see the list of all the feature services offered yeah all right so here we can see the list of all the services are offered under aw offered under free to account in aws so here we can see the stuff so in in free tier we get the access to ec2 so here we get the access to 750 hours of free tier as a part of free tier access so we have to create the dynamodb table here so we can define any table name so let's say here we want to create a table so let's say here we can define a simple empty table here so here we can define a simple eb table and then here we can define any of the ids as a part of the part or the part of the primary key so let's say here we defined as employee table and here defined as employee id as a part of our primary key so with the default configurations we can click on create and then we can start defining components here one by one now we also going to create an s3 bucket for diaboli dab for diabolo db tables data to be copied here and once we have the table defined now we are also going to create an s3 table as well so we can navigate back to our s3 and here we can create one bucket under s3 so let's navigate to s3 servers s3 is like a simple storage service so in simple layman terms we can say it's more like the google drive itself where we can store any amount of data and then we can share with anyone as and when required so let's say here we create this by the name of let's say employee data this has to be a unique name and let's see here we deploy this in the same region in which we had deployed the bucket so here we have the but here we have dynamodb table defined in north virginia so aws resources they are divided in different regions right so here we had multiple regions and availability zone so here we can choose in which region we have deployed the resources and let's create the bucket in north virginia as well we can scroll down and here we can click on create bucket so we have to ensure that the bucket name should be unique we can click on create bucket so we can see a bucket has been created by the name of employee data one month so now we have the dynamodb table also defined and now we also have the s3 bucket for dynamite for copying the dynamodb tables data so now we are going to work with aws data pipeline so for that we have to again search for a service name as data pipeline so as you can see our data pipeline is being visible as a service so we can navigate to this data pipeline service let's open this up so here we have to create a data pipeline so we have to give a pipeline a suitable name and appropriate description so we have to specify source and destination data node parts and then we can schedule our data pipeline by activating this pipeline here so here we can click on get started now so now here we have now here first of all we have to define the name of this pipeline let's say here we define the name as pipeline one we can name it as pipeline one we can define the source as well so sources we can define now here we have multiple sources that we can define here for example here we have dynamodb right so we want to export dynamodb table to s3 because that's why we have the table and the s3 points created and before we do that let's do one thing let's create some data in dynamodb table let's create some items so that we can actually see the item being exported to s3 as a part of our data pipeline because currently we don't have any data stored in dynamodb right so we can go back to dynamodb and in here we have tables so this is the table that we have created now we have to move to items and now in dynamodb since this is a no sql database service so we cannot run any kind of sql queries so we can work with this either using cli as in the command line interface or we can use the sdks here as well so let's say here we want to add items so here we can click on add item here so here we can click on add item let's say we add certain items here so this is now we have defined the employee id to be the primary key and suppose here we define the employer id suppose as one two zero one and for adding data here we can click on insert we can define what kind of data we are going to enter let's say here we define an again a string and we define as employee name and let's see how we define the employee name suppose as helen and then if you want to add more so here we can go insert again again a string let's say here we want to find the department name so here we can define department and then we can define suppose the department is for let's say what cs itself so you can define cs click on save so now here we have employee id department and the employee name being defined here correct now if you want to make a copy of the same so that we can create more values so here we can click on duplicate so let's say here we can give a different id let's see here we can define the id2p1209 and here we can define underneath instead of hidden let's say here we find suppose as od and here we can keep it a cs as well again we can replicate the same so that we don't have to start from scratch and in here let's say here we change it to pi suppose as finance so you're going to find finance the employee i suppose 1 2 0 5 and the name suppose here we want to define as suppose george and here we can click on save so here we have three different values available here in terms of three different items we have one two zero one one two zero five and one two zero nine for the values being defined and this is the table that we are going to export into s3 bucket that we had created all right so now we are going to create the data pipeline again so here we can choose the template so here we are going to make use of a template for exporting data into s3 so if you want we can choose export dynamodb able to s3 import dynamodb table backup data from s3 we can do it the other way as well and then we can run a job on elastic path reduce we can have a full copy of our days mysql table exported to s3 or import it to f2 mysql as well so let's say here we want to export dynamodb table to s3 we can define the build using a template and then we are going to define the dynamic table name so our table name is what our table name is employee table emp table so this is a name that we have to enter here and then this is going to be the s3 output folder so here we can choose the folder so here we defined employ data 11 correct so here we define point data 11 and we can click on select and then we can define the read throughput ratio we can keep it as 0 to five and the region of the temporary table is u.s east one because this has been deployed in north virginia and then we can define schedule that means we want this one to be sca that we want the entire job to be executed at a given time for example we want this job to be executed on a daily basis and that to on a given timeline and this one this job should be ending at a specific timeline as well then we can define the schedule here so let's say here we want to activate this on pilepin activation as being the first time so we can keep it if you want the all the logs to be recorded so here we can keep it enabled or we can keep it disabled as well if we keep it enabled then we can choose where exactly the logs are going to be delivered so for now we can keep it disabled and then we can define security access so iam is basically identity and access management service through which we can define users groups roles and policies for having a full control on every user and resource that we had deployed in aws so that we can define the permissions that means what kind of permission a single item user is going to have so that they don't perform the activities which they have not been authorized for and same way we can give the access to multiple resources as well for example we are deploying one website on the ec2 as a part of virtual server and we want that their website should be having the access to fetch data from the database deployed in rds service so whenever one resource is going to interact with the other resource then also we have to define the permissions by using the concept of rules so these entire permission status can be defined by using iam so we can keep it a default for now and tag us something like a common label that we can attach to any of the resources being defined in it in aws for example let's say we are deploying an application on aws and for the application we have to deploy suppose that 15 20 different resources and now we want to filter out these different resources then how we can do that by using the concept of tags right so let's say if we had deployed let's say 15 or suppose 500 data pipelines and now we want and suppose we want to find out those are the pipelines which has been deployed for dynamodb and that too for a single project for specific projects so we can so we can add the tags while we are creating the resources so that we can by using these tags we can filter out the resources at any point of time that is what we mean by tags here so we can erase this up now once we are done we are simply going to click on activate so this is going to create the data pipeline so again in case there are some kind of errors and we are going to see the errors really what like what exactly error is so as you can see it says the rule is not present so we have to ensure that role is automatically created in case the role is not there then we have to create the rule as well in case the rule is not available there so for that what we can do we can simply let's see here we can go to the pipeline section where we have this pipeline defined now let's do one thing let's have a quick recap so we can define the entire name of this pipeline suppose as pipeline one now when we are using them please suppose export data table two is freeze and here we have to choose the role as well because at the end we have discussed when we are using the role we can either define a custom rule for example here we get a final pipeline rule that we can see here in case we don't have this pipeline don't create it then we have to use that or we can use this default rule as well because without we have discussed roles are going to be used for granting one resource to interact with the other resource so here dynamodb is going to interact with s3 right so that it can export its data and store in the s3 bucket that we define so here one resource is going to interact with the other resource and that's why we need to define the rules here all right so now here we can define the table name so table name is emp table the bucket we can choose the bucket again which is emp data one one and that when we in usc one all right and then we can run this on pipeline activation we can choose a default role as data pipeline default role and then we can click on activate so as you can see here the data is the data pipeline is currently being defined and then we can see the pipeline is currently active and currently it is waiting on the dependency so that it can simply first of all validate that this is going to be that all the components are valid and the up and running so after a few minutes we can see the status will be changed to running and at this point of time suppose if we go to the easy to console we'll be seeing two different instances are going to be created automatically and this is bo and that is going to be deployed automatically by the emr cluster triggered by the pipeline all right so now we may have to wait for a couple of minutes for this one to be up and running so until this entire status has been changed so we can see so basically two ec2 instances are going to be launched as well that's going to be part of emr and then if you navigate to the same bucket then data is going to be exported and saved here as well once that service has been changed to up and running which is going to take a bit of uh we can say it's going to take a while and if we go back to ec2 instance and here we can see these two instances have been launched here as a part of emr that has been launched as a part of data pipeline service here and once the status changes to running then we will be able to navigate to sd bucket and then the data there is going to be captured from the emr and then they are going to be stored in the s3 packet all right so now the parameter once it says finished then again we can simply navigate back to our sd bucket and then if we refresh csv bucket we will be able to see a text file is going to be delivered in this s3 bucket which is going to contain the data as you can see here in our sd bucket this file has been delivered and we'll be having the in as you can see here this is a file that we have as manifest and the ac and the success file so these are going to the furthest contributing process and we will also be looking at one text file that is going to contain the data that has been exported out of dynamodb that's how it works so thank you so much for joining and have a great day ahead take care bye bye you

Info

Channel: edureka!

Views: 17,472

Rating: 4.7004404 out of 5

Keywords: yt:cc=on, AWS Data Pipeline, AWS Data Pipeline Tutorial, aws data pipeline emr, AWS Data Pipeline EMR example, AWS Data Pipeline S3 to Redshift, AWS Data Pipeline Example, AWS Data Pipeline S3, AWS Data Pipeline Tutorial RDS to S3, Data Orchestration Service, Cloud-Based Data Workflow Service, AWS Data Pipeline Cloudformation, aws data pipeline architecture, aws data pipeline demo, aws data pipeline demo tutorial, AWS Certification Training edureka, aws data pipeline edureka

Id: 1nhy4kMwo8E

Channel Id: undefined

Length: 37min 5sec (2225 seconds)

Published: Thu Jan 21 2021