Docker for Data Engineers

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is up YouTube in this video let's look at Docker and why it is important for data engineers so there are many things you have to do as a data engineer a lot of the times you need to write down code you need to use different sort of packages from pip you need to run SQL on some kind of environment so a lot of times you need to write this code with a lot of these different dependencies in place so uh to name a few dependencies can be like having spark installed in the right place for Big Data engineering even having pandas to have smaller iterations or Transformations running SQL maybe use cyclop G2 that's one of the libraries to use or if you're running a complex machine learning model in your pipeline you need to import maybe psychic learn or maybe some Advanced Frameworks such as tensorflow or pytorch so in a sense Docker is like a Magic Bullet here which can resolves all these dependencies being installed through on a software first of all let's try to understand what is docker so basically Docker is a software that allows you to package your code with dependencies in form of images called Docker images and then you can run this image anywhere be Dev Pro environmental product environment allows you to package your code and then run it into standard units called containers so basically containers you can consider them as a bag or a container that holds everything together everything in place such as code dependencies any other installations all together as a package and then you can just deploy it anywhere and it runs due to this concept there are key differences between how things were deployed in the past like on a virtual machine or a regular machine and then there's a different concept of using Docker to deploy your application specifically in my experience I've seen we keep different virtual machines for different environments like prod staging or development we keep them different and then if there are multiple applications for example there's an airflow instance there's an API we kind of deploy it as multiple containers within this apart from that let's look at a few terminologies and things in detail about Docker and the ecosystem behind Docker right so first of all you need to consider this um a place called doc Hub so maybe you can call it as a Marketplace for Docker so all these multiple companies bigger companies kind of make build these images on top of Docker and then they deploy it and as a developer you can just use those images that exist which are public label for you to use with the right licenses right let's look into like a small detail right you need to have a python API in place and you need all these different defenses dependencies or pandas and whatnot so you can build your your own custom image on top of something like an Ubuntu image but then there are already pre-existing image you can try to find which is there in Docker Hub right so one of the key facts about Docker is that it works on a layer system so images are based built by layers and layers so you can use an existing image and customize it and add your dependent is on top of it and then you can use it as needed right so that's the best part I find about Docker let's look at some simple terminologies there's one Docker file Docker file is kind of used to write down your Docker image like components of the docker image it starts from like hey I want to import this existing Mage copy my code from local to this machine install these dependencies in the requirements file these kind of steps you can put in a Docker file and you build a Docker file as an image the next step is after the docker file then comes an image once you build a Docker file you use a Docker file to kind of build the image you have an image in place so once the image is ready it's it's like a blueprint or a template so image is not running at the moment you can consider it as a CD or a disk so it's not running at the moment but then you can try to run it as a container using Docker so one of the key things you need to understand that you need a container registry to push these images so that's one of the things like similar to GitHub or you need to push your code right any code you need to push on GitHub for images you need to push your images on a container registry one is a one containing resistor which is publicly available is Docker Hub there are private Registries within Cloud Google Cloud apps Google Cloud AWS they have their own Registries where you can like push your images which can be used by other applications so it kind of lives not in it doesn't lives on your local but it lives on a cloud or an environment where we can like just easily Pull It in so yeah the next step is after the image exists in the registry or even local you can you can run the image with a few commands you can run the image as a container so container is like a lightweight environment for your application to run which is which has pre-packaged dependencies and your code all right so um so now that we have logged into Docker and why it's important for any Engineers toolkit right so yeah let's look into a simple walkthrough how you can customize your data engineering workflows with Docker so yep let's look into it I've already published this as a Blog previously feel free to check out the blog as well but like I'm going to do a simple walkthrough instead of writing code down because blue so here let's look at the structure first there's uh all these files in place there's dockercompose.yaml there's run compose uh uh the first let's look at the there are four key elements in the package we're going to build this package.txte which is line extreme requirements next is python requirements pick packages which which comes under python commons.txt uh there are other configuration files related to airflow and Spark It's like spark defaults.configs well supervisory dot config and airflow.config we're keeping them to run all these libraries the next is the main Docker file which I went through Docker is a layer based system so we are already using an existing image which is open jdk which has which is being deployed on top of Ubuntu right so open GDK and then we have environment variables put these environment variables to run our packages and code Max is spark installation we have the Hadoop file uh we kind of downloaded it in the right place then we configure the GCS because we want to use Google Cloud Storage to connect with spark and then we are doing the installation of requirements uh the Linux packages and then we're doing uh installation of the Python packages which is the PIP packages after that we are doing some simple command line Tools installation which is relevant to Google Cloud so you can see in your works so you can see from this like it's it's heavily customizable you can customize for your needs after that you're setting some configuration files opening up ports and then finally running the entry point script which configures the airflow DB and whatnot and yeah and then you're setting up the supervisor de config which is the scheduler behind air flow as you can see this is the whole Docker file which is a blueprint so you can see this is a blueprint all the dependencies the code everything is in place uh you can use this to run any airflow container now is this is not packaged once you run it you create an image one this package you can run it as a container as a Next Step you need to build this image so it exists as a Docker file but first you you build it so I give it a name and just build it uh locally once you build it you can basically run it so I've kind of skipped the step where you push it to a container repository in this part but the idea is this doesn't live locally it actually lives in a cloud container repository and you can just pass that path and run it so here I'm using the local path and local image name and the latest is the version of the file I'm passing some environment variables which are needed to run Google Cloud I'm also passing a volume we call it a volume Mount so the local folders are being mapped to the local folders in your local machine so this is how you can just exchange put up your code in inside of this container so yeah as a final step you run the Doc Run compose.sh file which is this one so yeah this is how the whole installation uh kind of works uh so using Docker for data engineering use cases is basically really important nowadays because you wanna develop and deploy your application very quickly all right so that's about it in terms of this video If you gain value out of this definitely hit the like button subscribe to my channel it really helps me a lot to push my content to people like you thanks a lot for watching see you in the next one
Info
Channel: Anuj Syal
Views: 2,718
Rating: undefined out of 5
Keywords:
Id: COMEVcZtx1s
Channel Id: undefined
Length: 8min 28sec (508 seconds)
Published: Thu May 04 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.