Airflow Tutorial: End-to-End Machine Learning Pipeline with Docker Operator

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone this is coder 2J you're all amazing today we've crossed over 7,000 likes on our video airflow tutorial for beginners full course in 2 hours as promised we're unlocking the second bonus tutorial about the airflow Docker operator if you missed our first bonus tutorial on debugging airflow dags you can catch it right here in this tutorial I'll guide you through using the docker operator to execute tasks in a docker container but before we dive in don't forget to check out my latest 1hour beginner friendly pisar course plus there are two bonus videos waiting for you to discover all right let's get started so what exactly is the airflow Docker operator it's quite straightforward it's just another type of airflow operator while you're probably familiar with operators like the python operator which runs python functions as tasks the docker operator lets you run tasks inside Docker containers as part of your workflow you might be wondering why should I bother to learn the docker operator well here are a couple of good reasons first code isolation when you use the python operator you need to Define your python functions within your airflow project that's perfectly fine for small pieces of code however when dealing with complex logic it's better to keep it in a separate project for better maintainability in the long term that's where the docker operator comes to your rescue secondly the docker operator significantly improves reproducibility and scalability by packaging your tasks into Docker images you no longer need to worry about managing the environment and dependencies airflow can simply pull your image and run it inside a container are you convinced yet let's take a look at a real use case here I aim to create a data pipeline that involves generating a data set and pushing it into an S3 bucket after that I need to train a machine learning model based on the created data set and publish the model artifacts to S3 to keep things organized I want to separate the source code for the machine learning part from the airflow project so I'll package it in a Docker image and use the docker operator to run it you can find a GitHub repo link that contains the source code of the demo in the video description now let's dive into it inside my airflow project folder I've placed the machine learning part under the SRC folder to keep it in a single repository for your convenience when cloning and reproducing it locally in reality I'd split the entire project into two separate ones one for airflow and the other for the machine learning part the machine learning part is quite straightforward it consists of a Docker file that uses the python 3.10 base image and defines the necessary environment variables it also copies the python requirements file and installs the required dependencies then it copies the application code and executes the Trainor andcorp publish. py script the model.py script is a placeholder script that prints a string message used here for demonstration purposes the requirements.txt file lists all the python dependencies needed for the project the train uncore n _ publish. py file is a functioning script it downloads the data set from niio trains a regression model based on it and then uploads the train model artifacts back to minio now let's navigate to the SRC folder and build a Docker image for this it's a straightforward process before we begin make sure that you have Docker up and lunning you can verify this by using the docker version check command once Docker is confirmed to be running you can use a simple Docker build command this command builds the image using the provided Docker file and tags it with the name regression training image and version 1.0 after the build command is complete you can use the docker image LS command to check if the docker image exists now that we have our Docker image let's take it for a test run using the docker run command to ensure it's executable to do this we need to set up a local Mi instance Within the airflow project you'll find a Docker composed. file that has everything required to launch a local minio use the docker compose up command to start it you can access the minio console by navigating to Local Host 901 to log in use the credentials to find in the docker compos environment where both the username and password are Mino admin once you're logged in navigate to the left hand menu click on buckets under the administrator section and create a bucket with an appropriate name here I have already created one then move to the user section and click the object browser button inside the bucket you've created create a new path named data sets and upload a data. CS file into a using the upload button you should find the data. csse file in the airflow project folder which you can obtain by cloning the GitHub repository for this video now that minio is set up let's run the docker image using the docker run command you'll need to provide four environment variables as to finded in the docker file first minio endpoint this should be set to host. doer. internal 9000 second minio access key ID and minio secret access key you can find these in the minio console under access keys in the user section if you don't have one you can create it by clicking the create access key button make sure to save it as you won't see it again third Min bucket name this should be the name of the bucket you created in my case it's coder 2J awesom ml artifacts lastly specify the image name which is regression training image v1.0 execute the command and you should see two log messages one will indicate the download of data. csse from minio and the other will say model uploaded to minio if you check the minio console you should see the uploaded regression model artifacts this confirms that our model training and Publishing script is working as intended now let's construct an airflow dag for our machine learning data pipeline I'm running airflow version 2.7.2 locally with SQ light and sequential executor in the dag folder I've created a dag file named ml Pipeline with Docker operator. py within this dag file I've imported the necessary packages and defined the default args I've named the dag with an ID of dag ml pipeline Docker operator v01 and set the schedule interval to none4 manual triggering this pipeline consists of two tasks the first task simulates data set creation by running a simple bash command with a bash operator the second task is the model training and Publishing task which uses the docker operator there are several parameters that can be configured first we need to set the docker URL since I have Docker running locally I've set it to the default Docker socket we can set the API version to Auto to let Docker choose the appropriate API version and if we want to remove the container when the task is complete we can set auto remove to true we can also specify the container name and image name using the image and container name parameters lastly we will need to provide the environment variables for the container in this case we should specify the exact environment variables we used in the docker test run once we Define the task dependencies we're ready to launch airflow and run your data pipeline before we start airflow let's ensure that we've set the airflow home environment variable to our current airflow project directory normally we can use the airflow Standalone command to start airflow locally however there's a known bug when running airf flow locally on Mac OS we can fix this issue by adding in _ proxy equals asterisk as a prefix to the airflow Standalone command without this your tasks may fail with a return code n signal. Zig once everything is set up let's launch airflow we can copy the login password and use it to log into the to the airflow web server here we should see our newly created dag toggle it on before triggering a dag run let's check the minio console to ensure that the model published by the previous Docker test run has been deleted and that the data. csse file still exists now go ahead and Trigger the dag we will notice that the first task succeeds with a log message hey the data set is ready let's trigger the training process check the second task which is also succeeded from the log we can see that it runs the docker container from the image regression training image v1.0 it then downloads the data set from minio trains the model and finally publishes the model artifacts back to minio double check in the minio console boom we have seen that the model artifacts have been saved properly if you remember correctly in our machine learning project we have two Python scripts by examining the docker file we can see that the default command is set to run the train and publish py script but what if we want to run the other python script without updating the docker file and rebuilding the image yes it's possible with a Docker operator there's a parameter called command that can be set to the command you want to run inside the container uncomment the command parameter in the docker operator and set it to run the model tuning py script save the dag files and return them to the airflow web server you may need to wait a few seconds until you see the updated dag code displayed in the code section now trigger a new dag run the first task should remain the same in the second task run you should see the log message hello world I am a fake model tuning script output from the model tuning script this demonstrates that you can easily switch between running different Python scripts without altering the docker image if you are running airf flow within a Docker container there are a few additional configurations that need to be done let's go through these steps first in our airflow dock project directory open the docker compos yaml file set the environment variable airflow core enables scump pickling to True second you'll also need to add a Docker socket proxy service this is necessary because air flow inside a container cannot directly connect to the default Docker socket outside the container which is a known issue referred to as Docker and Docker third in the dag file modify the docker URL to TCP forward sl/ Docker socket proxy 2375 this ensures that airflow can communicate with Docker correctly fourth for the purpose of this demo we'll continue using the train and publish. py script Now launch the airflow server log in and make sure you can see the ml pipeline tag toddle it on and double check that the model artifacts in the min console have been deleted and the data set is present with everything set up safely trigger the dag you'll notice the first task is completed and you can find the log message in the log of the second task you'll see that it runs the docker container with the defined Docker image it then trains the model and publishes the artifacts to minio verify this by checking the minio console boom we have seen that the model artifacts have been successfully uploaded congratulations you've just learned how to use the airflow Docker operator to streamline an end to end machine learning pipeline if you found this video helpful please consider subscribing and giving it a thumbs up let me know what topic you'd like to see in the next video thanks for watching and see you in the next one
Info
Channel: coder2j
Views: 4,288
Rating: undefined out of 5
Keywords: python, etl, datapipeline, airflow tutorial, airflow tutorial for beginners, apache airflow, airflow tutorial python, apache airflow use cases, airflow docker, airflow example, apache airflow tutorial, airflow introduction, apache airflow intro, data science, data engineer, data engineer tutorials, data engineer projects, airflow, airflow dockeroperator, airflow dockeroperator example, end to end machine learning project, airflow mlops, ml pipeline, airflow real usecase
Id: uZy2Lwioi3g
Channel Id: undefined
Length: 13min 40sec (820 seconds)
Published: Tue Oct 24 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.