Twitter Data Pipeline using Airflow for Beginners | Data Engineering Project

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this is what we are going to build in this project we will extract data from twitter we will use python to extract data and transform it then we will deploy our code onto airflow that will be on ec2 machine that is available on amazon web service and at the end we will store our data onto amazon s3 that is an object storage a lot of things to do here and we will do everything step by step i will divide this video into three different sections so in the first section we will talk about the prerequisite required for this particular project in the second section we will talk about some of the basic concept needed for this particular project so we will understand the basics of airflow and some other things and in the third section we will do the actual execution of the project so a lot of cool things to do in this particular project and before actually beginning this project i just want to say if you are new here then don't forget to hit the subscribe button and if you do learn something from this particular video so make sure you hit the like button that will help my channel to grow and reach more and more people so without wasting time let's get started let's start this particular project by understanding the basic prerequisite you need number one thing is you need one laptop and internet connection it doesn't matter what kind of laptop you have because we will be working on two ec2 machines so it doesn't matter if you don't have the high-end laptop you can still do this particular project second thing is you will have to install python in your local computer anything about 3.5 version works fine so it doesn't matter what version you have as long as it is more than 3.5 if you don't have python installed then i will put the tutorial link in the description so you can check that out and install it you also need to know the basics of python because in this particular tutorial we will mainly focus on the airflow part i will not go deep dive into python so make sure you have the basic understanding of the python and at the end you also need the aws account you can easily create your aws account for free you just have to go to aws website and fill some of the information aws offers a free trial for 12 months so you can create your account there i will also put the tutorial link in the description so if you don't know how to create aws account you can watch that and lastly you need discipline to work on this particular project whenever you start learning something it is going to be difficult at the initial stage you will face a lot of different errors while working on this particular project i'm 100 sure you will face errors while executing this particular project so first you need to believe in yourself that you can actually do this particular project and if you get stuck on some kind of error then you have google stack overflow to research on your errors and if you still can't solve the error then you can join the discord channel i will put the link in the description and you can ask your question over there so now let's try to go a little bit deeper into our architecture diagram first is extracting data from the twitter it is pretty easy twitter provides api to do all of these things so we just need to sign up on twitter and we can get the api credentials so first part is pretty easy to extract data from the twitter we have the package inside the python that is called as tweepy or twee pi we can use this particular package to extract data from the twitter we will also use package called as a pandas to work with the data and transform those data and store it somewhere and then at the end we will deploy that particular code onto airflow so now let's try to understand the basic concept of airflow before actually executing this entire project apache airflow is an open source workflow management platform for data engineering pipelines it was developed at airbnb they were using this platform to manage their workflow and data engineering tasks airflow joined apache incubator and became open source so now everyone can use this in easier terms airflow is a workflow orchestration tool you can build schedule and monitor data pipelines workflow basically means the sequence of every task in the airflow it is defined as directed or cyclic graph these individual numbers are called as task a task is basically a unit of operation you want to perform so if you want to extract data from the api it becomes one task if you want to transform that data that becomes another task and lastly if you want to store that data somewhere that becomes third task this sequence of operation is called as tag directed acyclic graph a dac contains multiple tasks that does some work so in basic terms airflow is a workflow orchestration tool it contains multiple tags and one dac can contains multiple tasks task is basically the unit of operation we will be doing so extracting data from the twitter transforming that data and storing that particular data somewhere now these different tasks can be easily built using operators operators are like the predefined template you can use to create tasks operators actually define what happens inside the individual task there are many different operators available in the airflow such as bash operator if you want to run bash command python operators if you want to run python code or you can create your own operators so instead of just focusing on the theoretical concept let's go to the airflow console and start looking at what it looks like so this is what the airflow ui actually looks like you have the dax these are the example tags that airflow provides so when you install the airflow you will get these then we have all the other information so let's look at one of the dag in this particular example uh let me just click on to this example task group and you can see a lot of things over here so if you want to get the core you can click on the code so if you want to see the graphical representation of this particular dac then you can click onto the graph and it will give you different tasks available in this particular tag so this entire thing is called as a dag these individual things are called as a task so let's look at the dag that we will be creating in this particular project so this is what the basic dag looks like in your airflow now don't worry about writing this particular code because we will be writing this particular code in the actual execution section so just try to understand what i'm trying to say in this particular section so we are installing some of the time function that is available in python so if you want to define the dac then you need to import the dac function from the airflow package so we will be using python operators because we will be calling the python function and then we have some other important packages now we will be defining some of the default arguments so arguments are basically if you want to define the owner so if you are working in a team and if you want to make sure that who wrote this particular tag you can keep track of it if you want to create a dependency so if you want to run this particular task once the another task completes then you can keep this as true but in our case we will have only one dac so you don't really need to worry about it then we also have the start date some email if you want to get the email notification on the failure or retry you can also define the different retries so if your pipeline fails for some reason and if you want to re-initiate it you can define the retry and if you want to understand these arguments you can go to the airflow documentation airflow docs and you can learn more about these things so if you want to understand these things like how diagrams and how these are default arguments are built you can go to the alpha documentation and learn more about it easier so this is the basic dac so we already imported this particular dac this contains three arguments one is the tag name so it can be anything i can write my name over here but you should write the logical name second is the argument that we are written over here so you just have to provide the default arguments that you mentioned and then if you want to define some kind of description you can also do that after that we will be using python operator this is really important because we will be calling some kind of python functions in this particular operator the task id is again up to you whatever the name you want to keep always try to keep the logical name then we have the python callable this is where you actually provide the function so we imported this particular function from the another file we will look this particular thing in the actual execution part but just try to understand so there is one function we imported this particular function into this particular script and we are calling it using the python operator and then we also need to provide the dac and just write this particular operator name at the below to run this particular tag now all of these things are really basic i've tried to keep these things as as simple as possible at the beginner level now if you want to go deep into this level i will create more tutorials in the future but this particular project is for very beginner people so we will write this particular code everything by ourselves so don't worry about writing or understanding this particular code right now because in the next section you will do the entire execution step by step so all of these things will make sense when you actually try to execute it by yourself so don't get scared if you don't understand anything everything will make sense when you actually start doing hands-on practice so now let's jump on to the actual project first step in this project is to get access to the twitter api in this case all you need to do is go on to google or i'll provide the link in the description twitter api and you'll get the first link you can go over here so all you need is one twitter account so if you don't have one you need to create one and you once you have your twitter account created you can just directly click over here sign up your twitter account and you can get the access to the dashboard so this is what it looks like so over here you will be creating an application so click on to add app click on to create new uh in this case you will select the development so just click on that go to next and just give your name like etl airflow project in this case i will write the yt because it's the youtube project okay click on next and this will give you two keys one is your api key and one second is your api secret key so make sure you copy this key and save it somewhere because you won't be able to see them again so this is really important uh to save it so just save it somewhere let me just save this particular thing onto my pc so make sure you save your api key and api secret key i will revoke the access to the these apis after i complete the tutorial so don't worry about getting leaked but make sure you create your own api key and a secret key after that all you need to do is go again click on the dashboard you will be able to see your application over here click on to app setting and you can just go to keys and tokens here you need to generate the token the consumer token you will be needing to access the twitter api so click on to generate over here so over here you will be able to see your api key that you created earlier and also the access token that we will also be needing uh to access the twitter api so make sure you copy these two values also uh because we will be needing them so after you get your own api key in the secret key then it is time to start writing the code and get the data from the twitter api so for that we will be writing our code into python as we already discussed so for that you will need to create one folder so in my documents i will be creating one folder airflow airflow project and i will save all of my files over here so you can use any editor uh whatever you are comfortable with i will be using visual studio code to write uh my python code and you can you can just start writing the code over here so first we will be storing our file into our airflow project the first uh the the file name will be twitter twitter underscore etl dot pi because we will be accessing data from the twitter and also doing some kind of basic etl which is extract transform load in this case so we will start writing code now before actually start writing code you need to make sure you have some of the packages installed in your pc so you can open the terminal if you are using windows so you can use the command prompt so let me just uh open this so in this case i'm using python 3 so i'm going to use pip 3 install because i have my version set up like that but if you're if you have installed python you can directly write pip or pip3 or whatever it works for you just try both uh if pip pip doesn't work for you type 3 okay so pip 3 install pandas because we'll be working with the data frame so make sure you have the pandas installed i already have it installed because i already worked on this particular project then you need to install pip three qp or two pi whatever you call so this is the package we will be using to access the twitter data so this is uh this is how you write it pip install or pip3 install whatever works for you and at the end we also need one one more package which is basically s3fs f3fs is used to store read or write data from the s3 bucket so uh just make sure you install that once you install them then you are good to go and start writing the code so first we'll be importing some of the packages now i assume that you already have the basic understanding of python because this is not the python tutorial this is more advanced little bit tutorial for airflow and all other things so make sure you have the python version installed so this is how you import the packages inside the python so after that you need to copy the access key in the secret key like this so put your access key in the secret key that the first access key in the secret key you copied and then the token token that we copied after that so make sure you put your access key in secretly like this again don't worry about these uh keys to be leaked because i will revoke the access after completing this tutorial so don't worry about getting leaked make sure you create your own access key in a secret key to access your own data then once you have these access key we will be using some of the functions available inside the tweet package to do the authentication because the first step is to create the connection between your current code and the twitter api so that is the first step and we will be doing that right now so for authentication and authorization there is a function called as oauth handler this is available into twit to ap api so you can just provide your access key in the secret key and then also set the token this is basically the consumer key or you provided this will create the connection between your code and the twitter api so that you can easily access the data from the twitter now once you have that you can create the object of this authentication so that is pretty simple just write vp dot api and auth now if you want to understand more about these apis in the function there is a documentation available so all you need to do is google the web documentation uh wipi documentation and you can go to this website and understand more about the api function how it works but i don't want to waste your time going into details in this particular project we want to focus on the execution side so that once you execute it you will understand it much better right so understanding about these functions the detailed functionality documentation is available already the informations are available so let's not repeat those things so now after you create the object the creating the object is basically now you can access the function inside that we api and call whatever you want so if you want to get the information about that particular user and the timeline information there is a function available inside the pvp api that is called as a user timeline so this is basically api dot user timeline and we can pass some parameters the first parameter is screen name screen name is basically the username you want to extract data from so in this case we'll be using elon musk you can also write your own username or someone else's so in this case if i want to write permadeath07 this is my twitter username i will get the data from my twitter account so this is the screen name then the count is basically how many tweets you want to extract from that particular timeline so there might be like thousands of feeds available but i just want to extract 200 right now then include artist is artist is basically the retweets onto the twitter so uh if elon must retweet someone else's tweet if you want to extract that also you can keep it true but in this case we will keep it as false because we would we don't want retweets and tweet more is extended again this is uh if you want the entire tweet text you can keep it extended but there are other uh parameters also you can explore into the twitter api so this is the basic version so after writing this particular code you should be able to extract data from the twitter api now that is pretty simple uh if you want to test whether your code is working or the authentication process is good or not you can go print and just write the tweets over here tweets and let me just open the terminal inside a terminal go to your uh the folder where the where your file is stored so in my case it is the document and then airflow project okay i will i will just uh change my directory and as you can see i have this particular project now i can run my code as python3 uh you can use python and uh with twitter etl in my case i have installed python 3 version into some different paths so i'm just you using the python 3 but again you can try both of the things and just run this particular code and let's see so as you can see we get lot of data as you can see uh we have like a lot of tweets available and this is the json data so we are not able to actually understand it because it has lot of different information it has like the image url the date when the to it was sent and a lot of other parameter information also so this particular output tells us that our connection with the twitter is successful so now we found out that we can easily extract data from the twitter api now what we want to do we want to transform this data into proper structure and save it onto our local disk for now in the case we will be storing into the s3 afterwards but for for currently we just want to transform this json data and store this particular tweet information into proper data frame so for that i already written the code uh if you understand the basic python uh it is pretty simple to understand what we are doing is we are going looping through each and every individual tweet and we are just extracting data from that particular tweet so if you want to extract the information about the username you can just write to it dot user.screenname this is inside the json file nested so this is basically inside the tweet there will be the curly braces inside the user and you can access the screen name same this is the text uh this is where we are extracting the actual tweet then you can uh extract the how many people liked the tweet and how many people retweeted the tweet and when was the tweet actually created then this entire dictionary is being appended into the list and then what we need to do we just need to write this file so for that i'm just going to write pd dot data frame this will create the data frame and you can just provide this particular list over here and i just want to store this uh file onto my local bucket so local local file so df does data frame dot two csv this is the function available to pandas and just let's just write on something elon musk twitter data dot csv so elon musk twitter data csv and just save it and let's just try to run our code again and let's see what we get let me just clear this and there you go let's hope we don't get any error okay our code ran successfully let's go to our folder and as we can see we're getting some data onto the csv file so we were able to extract our data as you can see okay we have the user elon musk we have text so these are the each and individual text elon musk has tweeted uh into last r so as you can see we are getting lot of different tweets so around 192 current in the case so he might have tweeted that and the twitter api was able to extract it but as you can also see the favorite count this is the like count and the how many people retweeted it when he actually tweeted this particular tweet so a lot of information we were able to extract easily so this is the first step okay so if you are able to do that then congratulations the first step is complete we were actually able to extract the data from the twitter now we have our etl script ready that is basically extracting data from the twitter and storing that onto the local file also we are doing some basic transformation converting that data into the data frame second step is to start working on to the actual thing that is airflow so the second step in this particular project was to create the ec2 instant that is the online machine and deploy our airflow on that so let's go start with that so so the first step in this case is to login to your ews account and go to the ec2 so once you open the ec2 just click on to launch make sure the region you selected is nearby your location so if you are living in u.s you can select one of the region from there if you are living in india you can select the mumbai region if you are living somewhere else in europe so you can select the region according to your location so that will be much faster for you click on to the launch instance let's give our instance name as airflow test project okay uh you can give whatever you name you want other thing you need to make sure is click onto the ubuntu now instance type you can keep t2 micro as it is uh if you are into the free trial if you are using the free version uh this should work uh i guess there are some basic requirements for the airflow so i'm gonna go with the t3 medium uh that has like the 4gb and the two cpu but you can go with the free tier version if that doesn't work for you then try changing the instance type and right but it should work in my case i'm just going to go with the t3 medium you will get charge if you use the t3 medium so make sure uh if you are okay with that then go for it but i will still suggest you to go with the this t2 micro if you if you want to stay inside the free tier but in this case i will go with the t3 medium then what you need to do is need to create the key uh this key pair is how you will access the ec2 instance so click on to create new key pair just give whatever the key uh you want to game it's like the airflow uh ec2 key okay now make sure you click on to create pair keep everything as it is and it will download the key so this is downloaded onto my downloads folder i'm going to copy this key and store it onto our airflow project so let me just uh copy this into the airflow okay uh i have my key stored onto our folder uh so after you get your key uh you need to make sure you allow the http s and http traffic just for the safer site so click on all of both of this topic and you you can just click on to the launch instance and it will create the ec2 instance for you so this will take around like one to two minutes to create the instance and once the instance is ready we will continue with this so once you have the instance state as running that means your instance is ready and you can move forward so now what we need to do we need to connect to this instance and deploy our airflow on that so we'll be installing airflow onto this ec2 instance so we need to connect to it so all you need to do is select this instance click on to connect we will be using ssh client for that connection so this is one example the aws automatically generates for you click copy over here so once you copy the ssh command all you need to do is go to your terminal paste it over here so make sure you are at the folder where your key is stored that you downloaded earlier just gonna click ssh so if you see something like this ubuntu that means you are connected to our ec2 machine now the next step is to install bunch of packages and the update the file so the first tip is to update the package so these are all of the commands you want to run i will put the document link in the description so you can get access to it so today apt update this basically updates all of the files available into the ubuntu machine so this will update it if it asks you to write yes you just need to write yes and just hit the enter second is we need to install the python 3 version onto this ubuntu machine once you do that you should be able to go ahead then second third is sudo apt install apache airflow this is where we'll be installing the airflow version onto our ec2 machine so this will do this will install all of the required packages needed by apache airflow so once the apache airflow is installed then we need to install all of the same packages that we installed on our local machine sudo install pandas uh so so sudo pp install pandas will install the pandas package then we have s3fs so we also need to install that so i'm just going to run these commands and install this package so once you run all of the commands and if you are able to successfully install all of this package then what you need to do let's just clear this so you can just write clear to clear the entire screen that will make things look much easier to understand then just write airflow to see if the airflow is actually so if you see something like this that means your airflow is installed now to run airflow server you just need to write airflow stand alone okay this is the command that is being used to run the entire airflow server so i'm just going to run that and this will start running our airflow server so it is launching airflow server right now now once the airflow is ready if you see this so make sure you copy the username and the admin password that is being provided over here because you won't be able to see it again so once you are able to see the airflow is ready and once you are able to see the admin password all you need to do is go to your instance copy this public dns uh this is available onto aws just put it on your browser let me see if you guys can see it let me just okay so just put it on the tab of your browser and just write 8080 because airflow is running onto the 8080 port and let's just see if we can access the airflow server or not okay so we're not able to connect to the airflow server what might be the reason so once you paste the ec2 instance path and 8080 port and if you are not able to see the airflow console that means we are there is some security issues we have right now so all you need to do is go to your instance id and just scroll down and you should be able to see the security groups into the security section just click on to the security group so security group is basically if you want to allow some ip to ex have the access to some kind of protocol that you can do that in this case just click on to the edit inbound tools just click on to add rules and all you can uh for the sake of the tutorial just write all traffic and you can click on to my ip or just write the ip4 anywhere and just save the rules now this is not the best practice you should not do this but again we are just focusing on the tutorial and we are just trying to get the project done uh as fast as possible so that so that you can understand how this process works so you should not do this into the actual production code but this is just tutorial so i will just suggest you to go with the all traffic and you should be able to access this particular thing automatically if you edit the edit the inbound tools now this is our airflow console so this is what it looks like now all you need to do over here is uh copy the initial uh admin admin uh username that you got and the password you got from the console we have and just click on to sign in so this is what the airflow console actually looks like uh this is you will see the dac security browser admin this is the actual airflow console okay uh if you go to this terminal you will see lot of different requests being sent over here because you are trying to access this so the server is running onto this machine so that is the reason we should we are able to access this airflow so congratulations so if you see this that means you are able to install the airflow onto your machine so in the start of this tutorial we already understood the concept about tags and the task inside the airflow so we won't be going into that we will be just uh understanding the actual code that we want to write so now we start writing our dag into the python code so the first thing we want to do is we want to convert this entire thing into the function because we should be able to access it so so what we will do into our etl code is right we will be creating one function and the function name will be run etl just start this and select this entire thing and just hit the tab okay so it should be able to properly intend these things in inside this function now this is the basics of python again i suggest you if you don't know about python go and understand the python so all we did is basically we just copied this entire code and put that into the function let me just zoom zoom out so you so that you guys can see it as you can see over here we have this entire code that we written for the twitter etl and this is now inside the function this is run etl so we have copied our entire code into one function so we can easily access this particular function from our dag so now we will be creating our actual twitter dac so all you need to do is click on to the new file just give this file as twitter underscore that dot pi okay so this is the dac file now we will just uh start importing some of the packages we need so these are some of the packages we need so this is the date and time so time delta you will understand why we imported these packages then we are importing dag function from the airflow we also need the python operator the python operator is used to work with the python data so we already understood the python operator in the start of the tutorial and then we have like datetime function and other packages so if you want to understand more about these packages you can just research about it then what we need to do we need to import this particular function into our dac so for that what all you have to do is from uh this file name twitter underscore etl import the function name the function name is run twitter etl this will import this entire function okay inside this dag twitter file so this will do that and you should be able to see this when we actually work with that too so now we need to actually define some of the parameters that are required for the tags so we already understood about these all of these functionalities this is like basically the owner the start date the email you can define all of the different arguments you want to give in your dag like this so once you have your default argument all you need to do is create the dac now creating the tag is pretty simple all you need to do is write dag so the function that we imported this is your tag name that will be shown onto the ui we will see that and we which is going to provide the default arguments that we created and the description so this is how you initially create the dac and then after that we will be using our python operator to run this tag so that basically means uh the python operator that we install this is the python operator so this is the python operator that will be installed so give your task id as complete twitter it can be anything whatever you want to keep the important thing over here is the python callable so which function you want to call so in this case we imported one function which is run twitter etl from the twitter etl file so make sure you python callable is your twitter function and your dag is the dag that you created over here and at the end all you need to write is run etl now uh if you if you don't understand this you can ask a lot of questions into the comment section and you can read documentation and i will provide some of the tutorial links too but this is how you create the basic tag onto the airflow now one more thing we want to do is we want to save our uh this data frame onto the s3 bucket so for that we will be we will need to create one s3 bucket so all you have to do is go to your s3 console just open the link over here i'm just going to open it on to the new folder so as you can see it overhears okay so just create the bucket if you don't have the bucket my i'm just going to write my name the shell airflow youtube bucket now this s3 bucket is globally unique so you won't be able to copy this name so just write your name over here to make it unique and just keep everything as it is and just click onto the create bucket okay so i can't okay i can't use the underscore it for god i need to use the hyphen okay you can't use underscore or you can't use spaces in between just click onto create bucket and you should be able to create bucket automatically now where is my bucket let me just write the shell okay as you can see this is my bucket so if you want to store our data onto s3 all you have to do is uh right write something like this okay let's keep it as it is s3 colon slash slash your bucket name slash your file name so this will store our and uh the actual the file that we created onto s3 bucket like this so after you make these two changes then we need to uh actually deploy these codes onto the airflow so make sure you have these two things ready and hopefully this runs onto the first try if it doesn't run we will face some errors but facing error is good it will help you to learn a lot of things so let's just copy this so we already have our server running onto one terminal so all you have to do is uh just open the new terminal so i'm just going to click on to new window and i'm going to again go repeat the same process that is make connection with the ec2 instance let me just go to my folder documents slash my airflow project so as you can see we have our uh private key over here you need to go to the ec2 again to copy the connection string so this is our ssh client connection just copy it paste it and you should be able to clear or create the connection now at this stage of the tutorial it should be pretty easy to make the connection so this is how we actually create the connection now we need to make few changes onto the airflow so if you do the ls over here you should be able to see the airflow folder so just do cd airflow and inside the airflow folder you should be able to see one file that is called as airflow.chv so right sudo nano airflow dot cfg and we will be changing one thing over here is basically instead of dax okay we will be writing twitter underscore that that so this is uh this is the one change we want to make into this don't make any other changes then click on to control x and make sure you save this before uh so it is asking you to just save this modifier buffer click on y and click on enter so it will save this if you go again if you go again do the serial nano you should be able to see the changes over here so make sure you change it to a proper path because we will be creating the new folder into our bucket so let me just clear the screen and we will create one new folder which is mkdir to create the new folder twitter underscore dax this is where we want to store all of our python files so if i do ls you should be able to see twitter tag so this is the same folder the same folder name that is being stored onto the airflow cfg file so make sure you change according to that and now what we want to do is do c cd into the twitter dac that we just created so we currently have two files one is our twitter dac file and one is our etl file we want to copy that and store it onto this ec2 machine because we currently have all of those files onto our local computer so let me just uh all you need to do is copy this so this is our first tag file so let me just write nano make sure you write sudo nano just to make sure it is a super user and then write the twitter dot twitter underscore tag dot pi this will create the new file and make sure you just paste it so this is the entire file from my local computer being uh copied on onto the ec2 machine click on to control x save it okay if you do if you do the ls you should be able to see the twitter dac file repeat the same procedure for the etl file sudo nano twitter underscore etl dot pi click enter i'm just going to copy this etl file that i copied and paste it over here and just write control x yes and save it so at the end of this you should be able to copy these two files from your local computer to the ec2 machine so now we can run this so let's go to our airflow now okay so we understood a lot of things this is our airflow so we are still not able to see the actual tags onto our airflow ui so there might be some problem let's check it okay the first thing we have over here is the twitter tag that is that looks good but let me just open our uh airflow chd file and let me see what is happening over there so the our folder name is twitter underscore dhe tag and inside this we have dax okay so we need to change on dax so we need to remove s from this particular part and just write as it is so this should fix the error so make sure the name of your folder and the path that is you assign in the cfg file is the same if that is not same then you won't be able to see the dag over here and after you make that changes all you need to do is stop the airflow server for that you can just write the control c and it will shut down the entire server as it is again this is the development server so we don't worry about these things right now and all you have to do is again run the command airflow standalone so you just stop the server and you run the server again uh to you know uh key to to get the changes uh according to that so once you restart your airflow server you should be able to see that count increase so initially it was 32 now it is 33 so we're just going to be clicking over here and this is our code or this this is the code that we have written uh inside our airflow uh initially and if you want to run this all you have to do is go to click on to graph and just click on to this run click onto this twitter tag and it will run this now there are multiple scenarios uh into the airflow that is how it runs so these are the different status that we already discussed so if you see green that is that basically means it is running then if you see uh this brown that is it is skewed if you see yellow that is that means it is retrying this code to run again and again till it success and at the end if you see the red that basically means it is filled so currently we are just seeing yellow so we want to understand what is really happening inside the code so i'm just going to click on to this and go to logs and see the errors we are getting so inside this error uh the error is when calling the create bucket operation access denied so we don't have the permission to write from this airport from this ec2 machine to the s3 bucket so we need to make sure from the ec2 we have access permission to write it on to uh the s3 bucket so for that you you need to go to your ec2 machine so make sure you click onto your ec2 bucket and over here you need to click onto the action click onto the security and click onto the modify iam role now so we will have to create the iam role it basically means you are giving access to the ec2 machine to make changes to the s3 bucket based on the permission you provide inside the role so click on to this create im role so it will go to the actual uh im service that is available on to the ec2 uh that is that is available onto the a aws click onto the create role uh keep aws service click on to the ec2 over here click on to next over here just write s3 and you should be able to get this s3 full access just select this after that write ec 2 and just give easy to full access so we're going to be giving two accesses one is s3 full access and easy to full access click on to next give this name as ec2 s3 airflow role okay and just click onto the create role it should create the new role for you and if you go back over here if you refresh this uh onto the ec2 machine you will be able to see uh ec2 s3 airflow roll just update the im roll over here and this should give the permission uh to ec2 to access the uh s3 bucket now we're going to be running this code again so if you see the green mark that means your code isn't you can just click on to logs and see the logs over here that means so if the info is done and if you see the mark mark the task of the success that means your code has run successfully this is to validate that you can go to the s3 bucket and refresh your bucket and see if you got the data over here or not so that means we got the data you can click this data and download this data and check uh whether we actually got the results so as we can see we got the tweets as it is uh we got on the local machine but this time we were able to run our code from the airflow so this is uh how the basic airflow you can run this is how the how you get started with the airflow this was the basic tutorial and this was the first tutorial on the airflow i just want to i just wanted you guys to understand these things from the high level in the future videos i will create the dedicated videos onto airflow this video was more about building the data pipeline from the twitter to storing our data onto s3 bucket using airflow i will create some other tutorials where we will deep dive into airflow topics and try to understand how airflow ui works and all the other things so first of all congratulations on completing this particular project i hope you learned something new in this particular project and if you did then make sure you hit the like button now if you are stuck in some error and if you did not understand any concept you can comment it out or you can join the discord channel to discuss more about it and again if you are new here then don't forget to hit the like button and subscribe to my channel for more of this kind of content thank you for watching see you next video
Info
Channel: Darshil Parmar
Views: 285,174
Rating: undefined out of 5
Keywords: apache airflow, data engineering, data engineer, airflow for beginners, airflow tutorial, data engineering course, data engineering tutorials, apache airflow tutorial for beginners, airflow installation, airflow, airflow tutorial for beginners, etl process, data engineering project, airflow project, etl project, apache airflow demo, learn airflow for free, learn airflow, what is airflow, airflow course, apache airflow tutorial, data engineering projects, learn ariflow
Id: q8q3OFFfY6c
Channel Id: undefined
Length: 40min 21sec (2421 seconds)
Published: Tue Sep 20 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.