Airflow for Beginners - Run Spotify ETL Job in 15 minutes!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi guys i'm carolina and i work with data whenever i ask you guys a question it always feels like many us elections so much agreement still the majority of you wanted to see how to schedule jobs in airflow so this is exactly what we're going to do today in this video i'm going to show you how to set up airflow from scratch and how to schedule a job that runs etl process for spotify data if you have your own code to schedule that's great if you don't feel free to follow my data engineering course for beginners where i show you how to build a simple pipeline that uses spotify api this is going to be a very crude introduction just to get you started but to be honest this is actually the most difficult thing about airflow you know configuring it that's why there are quite a few tutorials out there that use docker for example docker is a technology that is supposed to make configuration easier so many people use it with airflow but if you are a total beginner you don't want to overwhelm yourself with yet another technology so that's why i decided to create this tutorial basing it purely on the airflow documentation which is hopefully going to be much more beginners friendly all we need is a little python and positive attitude yes we are going to fix a lot of errors before we see the light of airflow but if there's a will there's a way and i know how much you want it so let's get into it i'm creating a fresh virtual environment using the virtual env module okay so now let's just follow all the steps that are listed on airflow's website on the quick start page firstly we are setting the airflow home variable to a subfolder in our home folder this is going to be where our airflow sits where all the data gets saved and now we are installing airflow oh oh what's this i don't know what's wrong but a quick google search suggests that we need to install airflow with some constraints so let's do that so i am uninstalling airflow and installing it with the suggested constraints what again a few modules seem to have wrong versions for example airflow says that it needs a module called future that is between versions 0.16.0 and 0.17.0 but what we've got is 0.18.2 i'm being way too avant-garde for air flow you know with things like that sometimes you have to take a leap of faith okay back to downgrading our future looks like we're done with the error for now anyway so now let's edit the config file in the home airflow directory so when you installed airflow the config file appeared there and now you can access it open it and edit it i'm using the command nano it allows you to edit a text file within the command line you can do it in your own way that's one way i'm familiar with there's also vim and emacs and i guess potentially you can open it with some graphical user interface but it's simpler to type nano the config file will determine many important fields but let's just have a look at the one that is concerning us at the moment so we need to tell airflow where to look for our code okay so let's edit the location variable and let's point it to the location of our dags i know you don't know what dogs are yet i will get to that in a second but for now let's just create a subfolder called dags in our code repo so where we keep our code can we start our airflow web server now yay now in a separate terminal let's open our virtual environment again and now let's run the airflow scheduler if both the scheduler and the web server are running we're all set so now open your favorite browser type localhost and voila that is airflow ui cool stuff as you can see here there's one dog here it is something that i've been playing around with when i was creating this video uh let's not worry about it for now let's create our own dog where in the dags subfolder i just wanted to show you that the airflow is running already but it's time to write our own dag for the spotify pipeline okay first of all let's clarify things what is a dog in air flow duck is a directed acyclic graph so if you've studied computer science or algorithms you know that directed icy click graph is a representation of a series of events it looks like that it is directed because there are arrows it goes from left to right or from right to left it's a cyclical because there's no coming back there can't be any cycles for example you couldn't draw an arrow from three to two so each dag represents a collection of tasks that you want to run that means your code that you want to run organized in a way that reflects dependencies and relationships each circle in a dag or in proper times each node in a dag represents one task or one piece of code one function that you want to run simple idea the whole dag is defined as a python script cool so let's do it the very first thing that we have to do is to create a python file called spotify dag that will called our dag code let's place it in the newly created dags subfolder first of all we need to import some modules secondly let's create a dictionary called default args it will hold all the arguments all the parameters that we want to pass to airflow to configure it and parameters such as start date by the way use the corrected value of the start date that is on the screen rather than what is originally in the video because the start date should be hard-coded other parameters are whether we want to email a group of people if our dog fails that's something that we might want to do especially in a in an organization where you know people might need to be aware of what's happening with the code and whether you know the data is flowing or not we can also set number of retries so if a dog fails do we want to retry and re-run it again and here we set it to one next step we define our dog we give it a name for example spotify dag as you can see here we're passing the default args dictionary that we've just defined we give a description and schedule interval that defines how often do we want our program to run in our case we want to do it daily now i'm defining a little helper function just to show you what will happen in a second okay the next third important part of the dark file is defining your operators operators determine what actually gets done by a task one operator equals one task in a dag in our case we are using the python operator but if you read the docs you'll see that there are plenty of other operators for example bash operator which allows you to run a bash script operators are usually atomic which means that they are a standalone piece of functionality and they don't share resources they don't share data arguments variables with other operators that means that you cannot pass the data from one operator to another well you can and airflow has a feature for cross communication called xcom but that should rather be avoided so we are calling our task run etl we give it a task id and in the python callable we are passing the function that i've just created lastly you might find it a little bit weird but we define the order of execution of the tasks by simply putting the names at the bottom of the file so here we're putting run etl if we had more tasks we would use arrows to specify the direction of the flow okay cool so let's go to the web server to see what we've just done you can see our new dug spotify dag has just appeared on the list of dags so let's open it there we are we are in our newly created spotify dag let's switch to the graph view this is perhaps the most useful view especially if you have plenty of tasks white border around a task means it's got no status it's never been run there are multiple tabs here you can explore for example the code tab shows you the code of our dag the code that we've just written when you click on the task it gives you some options to to run this individual task but what we want to do is to run the whole dag so let's click on that [Music] okay that was quick green border means we've got success so it ran successfully now let's open the logs to see what happened as you can see our little function has executed and we've got the output now onto our etl code so previously we put the majority of our code in main now let's create a separate function let's call it run spotify etl and let's put everything that was in main inside that function and we can delete the main also at the top of the file we had some global variables so now let's copy them and paste them inside the newly created function by the way we should make them lower case now because they are not global variables anymore but i'm just rushing to get you through this tutorial also don't forget to update your token from the spotify website because surely it's expired now let's rename the file to something more sensible than main and let's import the newly created function in our dag file now let's pass that function to the python columbu instead of the dummy function that i've just created lastly let's just get rid of the validation step that was checking for dates because we haven't thought about our schedule strategy yet okay and let's re-run the dug now hooray dark green border again success let's see the logs again as you can see our etl code executed now that is all great but how can we know whether the data got downloaded did that even work so let's head to dbiver to inspect our database awesome as you can see the data that we've just downloaded from spotify got appended to the table and as you can see i don't listen to arctic monkeys anymore so now if the airflow is running the data will get automatically downloaded and saved to your database every day we've made it well almost let me be honest about multiple points now one we've just scratched the surface of airflow two what we've done is not very proper it's a quick and dirty example to get you started but you probably wouldn't write such a code in a commercial environment i strongly encourage you to learn more about airflow on your own reading the documentation it should be way easier now that you've got some foundation to build on i would strongly suggest you start with the so-called airflow hooks that is something that we should probably use in our etl process if we are saving some data to a database that's the proper way to store credentials that's a proper way to connect to a database and three there's a problem with a spotify token right it expires every what 15 minutes obviously if we're just gonna hard code it like we are doing it right now in the code this is not going to work and tomorrow our pipeline will just fail so what needs to be done is we need to automatically download the token from spotify and automatically pass it to the api it's not really that hard to do and i've seen that there are some nice tips and answers on stack overflow how to solve that how to do that how to basically download the token automatically i will leave it as a little challenge for you perhaps in the future i will create another video that goes a little bit more in depth to airflow and shows you more features and explained everything that i didn't manage to cover today i hope that today's video was useful because it took me so much time to make so i hope that this is not time wasted if you are new to this channel please subscribe if you are interested in videos about data engineering machine learning and software engineering see you next thursday everyone ciao
Info
Channel: Karolina Sowinska
Views: 139,476
Rating: undefined out of 5
Keywords: airflow, airflow tutorial, airflow for beginners, airflow installation, airflow set up, airflow set up mac, airflow setup, schedule pipeline jobs, schedule etl process, run etl in airflow, schedule jobs in airflow, schedule pipelines, schedule jobs, etl process, job scheduling, data engineering, data engineering course, spotify api, python etl project, apache airflow, apache airflow tutorial for beginners, apache airflow installation, apache airflow demo
Id: i25ttd32-eo
Channel Id: undefined
Length: 16min 37sec (997 seconds)
Published: Thu Nov 12 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.