What Is Docker - Docker Intro And Tutorial On Setting Up Airflow | High Paying Data Engineer Skills

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey there guys welcome back to another video with me Ben Rogan Sean AKA bestiality guy today we're going to talk about docker now many of you probably have heard the term or you know of the technology Docker I really want to give a background on what Docker is uh why it's important to digit engineers and also you know give a few examples or a quick tutorial on how you can set up a basic Docker container so let's start with the probably more obvious question which is do data Engineers need to know what Docker is all right because there's so much technology we need to know so do we need to know what Docker is the short answer like everything else in technology is yes and no and let me explain that slightly longer now before you get too upset what I need to say is you could spend your entire uh data career and never ever run into a Docker situation where you actually have to use it it's 100 possible there's so many different ways you can set up your infrastructure as well as teams to where it could be a different team that manages it all together that you may never work with Docker on the flip side if you work for a startup or you decide to go more the data platform route where you're not more of a data engineer in the traditional sense but you are a data platform engineer so the person actually manages the platform then Docker will likely be something you run into or kubernetes but that's a different video from this video we're just talking about Docker so the short answer is kind of yes it's good to know it's good to understand it's good to grasp why and what it does you know like why it's important what it does those are important Concepts will you maybe never have to run like a Docker compose or a Docker run command maybe but that's kind of the short of it some specific examples where a data engineer might actually use Docker or things like possibly you set up your entire data platform where it's set up in either kubernetes or some sort of Helm chart or some way where you've got multiple components you know data Hub air byte airflow so let's dive into first of all what Docker is and and why it's important um for those of you who don't know Docker basically allows you to take all of these dependencies and all of this code and all these things that you want to often deploy into some sort of piece of software and do it smoothly now many of you might not know what that means in terms of like maybe you lived in a world where doctors always existed I have sadly had to live in a world where you know in order to deploy code you get to do things like put together release notes I'll put together instructions on how to deploy said code and like you know go step by step through not just saying hey here's code deal with it you know you actually talk to an operations team who then took that code and made it work in real life basically right like yeah it's great that you made it work in development now let the professionals take over and actually make it work in operations and in that process you know there's all these connections to databases and all these other systems that you rely on that could just be nightly nicely packaged in some form of Docker Docker compose file which will kind of get to the difference there eventually so that's what we're kind of trying to get rid we're trying to get rid of the old-fashioned approach which many of us have to do where in order to deploy essentially a new piece of software rather than like having to zip it up you know have a configuration file kind of tagged in there you have to go through all these steps in order to actually get the code ready to be in production instead all of that can just be in your Docker container all your dependencies you know especially if you're used to fight python it's just your requirements txt file that is now just a basic run command in your Docker file where it's just like run pip install you know chromos.txt set it up so it's very simple honestly you give it the exact instructions and know that in theory it should run the same way on your computer can hurt other computers uh unless you recently got the M1 um chip and it was like a year and a half ago and you tried running a Docker container and quickly found out that M1 chips were not developed uh with Docker mind or vice versa Docker hadn't you know taking that into consideration to see it now you can actually download the M1 chip version of Doppler and we'll talk about that here in a second but that's kind of what Docker is it's making the deployment of code and all of the things that it dependence on again configuration files databases nicely packaged into a single set of files so instead of again you having to do a lot of manual discussion oh with your Ops Team it's a much quicker process okay so that's kind of generally what Docker does that's not really what Docker is what Docker essentially does is it is a platform that allows you to take advantage similar to virtualization although not essentially create a virtualized OS traditionally virtualization was done at the hypervisor level but we've now kind of taken that up essentially one step and Docker essentially leverages the host operating system so I think that essentially just connects and and that communicates with the kernel and by essentially again skipping the step of needing half these other stuff standard virtualization you have a far faster kind of setup time like when you spin up Docker it doesn't take minutes it basically up once you say you know Docker run it's running um honestly if you want a better again quick story when I first started in the world of technology virtual machines were like everything everyone was like we're doing everything in Virtual machines it's amazing it's great um one of the things that I always notice is that in theory they were always supposed to run and their own isolated kind of uh system so right like the hardware or the virtualization sorry the software was you know in theory supposed to like take like whatever 32 gigs of RAM and you split it between two things and in theory they shouldn't touch each other but because they are often physically connected they're all the same Hardware um they still would cause issues when you when you run like a large process on one the other one would actually be impacted even though supposedly that wasn't supposed to happen I saw it happen all the time um more just fun stories okay so that's basically what the offer is right like it's kind of connected to virtualization slightly different of course people out there who actually do system admin work and are very deep in the technical work would get upset that there's even a comparison but that's kind of an easy way of thinking about it okay and if you're a data engineer there are lots of times that you need Docker to run things Docker for example plays probably role when you're spinning up airflow although you're likely using kubernetes um in getting everything set up right because you want to create a system that can spin up the workers has a scheduler setup has a web server setup has all those things maybe even in in different pods that way they either don't impact each other or can kind of scale up and down as they need to and all this stuff again gets very complicated and personally if you're going to ever spin up airflow yourself I'd recommend talking to a devops person first because I've just seen too many ways that airflow is set up poorly okay before diving into Docker um we're gonna go through a few commands I want to go through a few of them now that way as I'm explaining them you kind of know what I'm talking the most basic command that is somewhat self-explanatory is Docker run which is essentially kind of what it sounds like it's going to be the uh dot command that runs your container that's based off of an image that you have essentially built uh next one that I use very kind of frequently it's almost like LS is Docker PS and usually a I just add in the dash a anyways because I want more information this will essentially list out all the docker containers that are running currently in their state and I'll constantly use this especially to see if a doc campaign container has failed um and if it's failed then I'll use Docker logs again then you gotta reference the actual Docker container number ID and then see like why did it fail next you'll see me often use doctor xsec uh it then the name of the actual instance followed by bash and this is basically what I use to access the docker container um so this is essentially you can think about almost like sshing into a different server except for that server lives on your computer and doesn't have as many cool things on it the one thing that just makes me think of every time I I do set up Docker I forget that it has like nothing on it so I will always like um essentially accept into it and I'll try to like vim and nothing works and then you realize oh yeah I don't have Vim or Nano or anything downloaded onto this dock container the other command you're going to see me use is Docker to pose up Docker could pose down and Docker volume uh prune actually I might not use the prune one but these essential commands are for when you have a Docker compose file so the docker can pose up uh command basically lets you start and restart the services that are in your Docker compose yaml file where's the docker compose down command kind of as it suggests takes down removes all the containers that are running whereas the docker composed down own command stops essentially all of the containers that are running and essentially removes any like networks or other stuff that you've been set up when you ran the docker compose up command then usually you'll have to do Docker volume prune which just gets rid of everything officially if you don't have anything left there's no um images that maybe you've got built that kind of get all swept away and so that's kind of the process that I'll let you do when I'm doing a complete wipe away of something that just doesn't seem to be working so those are the commands you're going to see me use um there's obviously a whole list of other ones I'm going to put them here but those are some great commands that you should know all right let's dive into the actual video but to show you a baseline example let's just go through one simple way you can set up a Docker file to run in this case just a basic class cap so let's let's just go through that example okay so first step for anyone out there who's just getting started what you're going to need to do is download Docker now if you're on a Mac there are multiple ways you can do it I'm just gonna say you know you can use the uh download Docker desktop approach that's the way I like um again you're gonna have to hit Apple chip and then download if you have the M1 and if not you have Intel chip so whichever one you have again I have apple chip so it caused a lot of issues when I first tried to spin up kubernetes again all right so we've got that downloaded um in theory it's technically already running so we're going to look at my fighter uh two of them and what I've got here is these Docker file so let me open that up actually all right so we've got this Docker file here and what you'll see is this first command here um we'll pull down an image so there's something called Docker Hub where there's a bunch of pre-built images you can go on there you can scroll they're essentially already pre-built um and basically it'll pull that down when you run it and that it uses kind of like the bass right like you almost think has this building foundation so I'm pulling them this one next it kind of sets this working directory which I called app and then we're going to basically in that working directory put these copper requirements.txt file so this is basically setting up what things are going to look like because when you actually spin this up you're going to have essentially a again virtualized system you can log it not log into that's the wrong word um having a term effort people uses essentially Mount but you can just go into essentially almost like your SSA chain into like an ec2 instance or another server so we'll show you that here in a second um from there it'll run this pip install command and from there it'll run this pip install command where you expose 80 on the port because we're gonna actually use this to build just a basic app.pi kind of flask app and then from there again we're going to run this python map app.cloud and so that's that's really it so I'm going to spin this up here in a second and then we're going to connect we're going to show you what it does both how we can go into it but also you know kind of how it like the whole flash Gap works all right so I'm in that Docker demo folder basically if I were to run like Docker PS which should list um what currently exists in terms of containers there's nothing there uh then if we do then if we do Docker images again nothing is there because I haven't built anything yet so and if we do Docker images same thing I haven't built anything yet so both Docker retainers Docker images blank I'm going to run this Docker build command and then give it this name here which is you know this flat Docker example class Docker example you'll kind of see it run through the various commands here so it literally will okay sorry finish all 11. so what you'll basically see so if you run through these commands is these are all commands that we've got yeah so the copy.app ETC all these things were run but if I were to go through this again Docker PS nothing is in containers yet but if I go doctor images there is an image there and there's kind of a distinction right the image is essentially just the template of how something is going to work and you haven't actually deployed it yet you actually now need to run it essentially right like that's what Docker run is um deploy is not the right word you haven't really started to run it and so nothing's really running uh it's there it's ready to start but it's not running now if I say essentially Docker run in order to run this command Docker run Etc should hopefully fingers crossed nothing should ever this should work pretty simple and cool so it's running it kind of see it's running right here it's saying hey if you go to this location uh you know essentially localhost actually I think I think it should be this one oh never tried this other one um you could probably try localhost or this so now you can see it's essentially running um so this support is essentially open on but like we ported it over to four thousand so I actually need to go to localhost 4000 so if we go to this they'll say hello this is the docker uh Docker flask app essentially so we've got it running essentially so we've built this up we've all set it up um the one thing here is like you'll see that it's logging in my terminal which is maybe something you don't want so in order to kind of get around that instead of doing it that way what you might do here is say Dash D which is basically to say run it detached so if you want to avoid essentially having all those logs essentially which is kind of cool to see you can see yourself go to the website you see yourself come out you can hit Docker Dash Dash p um so d stands for detach so it's uh once it's running so it's running basically this is uh the ID so when I go again to this Local Host 4000 you'll notice that it's not you know there's no nothing running here and so that's basically Docker now there are a couple other commands that you will see for example you'll see Docker compose and like I said there are a lot of other commands you can use um and the flask example might not even be one that some of you use uh because some of you might be you know not interested in building flask but many of you will likely want to work with something like airflow um so I did want to go through an airflow example and with airflow we're going to use a different command like I referenced we're going to use Docker compose and basically what Docker compose does is kind of like it suggests it allows you to spin up um multiple uh like things that are going on essentially there are multiple components it's actually things great word multiple components so we're going to use postgres airflow web server and airflow scheduler and we're gonna have all of that running uh and the docker compose file is going to set that up and you can literally set it up so that like you have one service run there we go there's the word I'm like for one service run another service run and another service run in a specific order which is going to be important um with airflow because if you run it out of order the scheduler won't work or if you're trying to point to the postgres instance and the postgres instance isn't up yet that won't work um so let's kind of go over that here really quick so with Docker can pose you put together a Docker compose file uh and it's it's not it's different obviously than what you're typically used to but got some similarities so you're going to call it the image again so if you recall in the prior example you had the from this is kind of similar you're going to call it the image that you want to pull in um so again this is uh if you go to Docker Hub you can find this specific setup already so it was postgres you'll set up some environmental variables um and then from there we're setting up airflow web server um so this next one and then airflow scheduler and what you'll see is there's a few other important commands here or at least start a few other important configurations such as the depends on um section here and then also here as well so this has to run in a specific order to make sure that everything runs um so everything actually works if not what ends up happening is airflow schedule or well run or airflow web server will try running and not everything will be there and then it'll it'll kind of freak out you also have this command here that I've set up just so you already have a user set up on your airflow instance and then same thing here I'm just running airflow scheduler to start when we start so it's pretty straightforward again it's kind of like uh the prior example except now we just do three different Services rather than just one and that's going to kind of be the approach so here again we've got some new commands it's pretty simple in this case it's just gonna be Docker compose up there's also Docker compose down and then if you're trying to get rid of that image essentially um and if you're trying to get rid of that volume essentially you can just kind of use Docker volume prune but we're gonna do Docker and compose up and we're going to see a lot of things happen all at once so just do you see everything kind of running postgres skin set up airflow airflow schedulers getting set up if the web server is trying to start a careful scheduler setup hey now it's I don't think it's finished setting up just yet if I recall okay so in theory I think I can go to localhost now let's check okay so it's running and again I put the password as admin and admin and you'll see airflow is running here we've got it running if you don't have this running correctly up here what you'll see is a yellow thing that says hey airflow scheduler is not running but you see it's here it's running we're good um so this is how you set up airflow um obviously some things I want to point out uh in that Docker compose file because someone's gonna yell at me in the comments so let's pull this up I forgot to call these out executor here um local executor's fine if you're doing it locally for testing if you want this in production this will this will not work you will want something different this local executor will not function correctly you're going to want to use something like celery executor uh the other thing you don't see here is like and I've called this out before or at least at least in some articles um you don't see where the logs are going so in this case the logs are being stored locally in fact I can probably show you that um so if we go in here logs will be stored in this folder essentially uh which is not good in production if you're storing it on like eight if you're storing you're just on some local location that has like an ec2 instance that ec2 instance will explode eventually maybe it runs for six months maybe it runs for 12 months it depends on how big your logs are and how big your ec2 instance is but eventually that that location will explode or you have to make sure you've got some sort of deletion function or something going on I the best thing to do is usually dump it into S3 or Cloud watch or whatever you're you know working on to make sure those logs don't break uh what's wrong so yeah there's a lot of little things in production that you want to change here but this is again a local version and it'll work and it's great and it'll work um so that's that's really Docker compose again we've just added it so there are more layers essentially now so now instead of just having that one um instance with this basic Docker file we now have multiple Services going on so that's kind of it and again we could actually like log into these different Services if we wanted to and again we could Mount uh and then again we could you know essentially connect to these different Services if we wanted to and like connect to this postgres instance we're not doing that today this is just a basic example hopefully that's clear um honestly for the longest time I just didn't Google what it was I kind of understood that like when you wanted to have more things spun up you want to have a more complex essentially Docker bile except for now it's a darker compose file you would use Docker compose and it's kind of obvious what it means it really just means kind of what it sounds like which is like you're trying to compose a bunch more things so this is actually something you'll see very commonly with like let's say a basic airflow setup that you need a Docker compose file because you've got things like postgres that you're likely setting up to manage kind of the actual metadata of your of your airflow instance you likely have a log location that you might be setting up in there um actually recently I had a airflow set up with a client where we even got to pre-configure a bunch of files and environmental variables and all of this other stuff prior to even loading essentially as we're building everything up so I thought that was really cool things that I just hadn't seen um again for me my Docker skills are good enough to work with but I'd say there's definitely people who are far better um if you haven't ever seen um I think it's uh Tech with Nana um so honestly if you want a better deeper dive um there's no actual ad here uh Tech world with Nana I've watched her stuff and every time I'm like looking for more information and I want to get a deeper understanding that's where I go like devops in general that's where to go there's a reason she almost has 900 000 subscribers her videos are on point um but anyways guys thanks so much for watching this video I really appreciate your time and if you're a data engineer out there looking for Docker hopefully this was helpful if not you know let me know how what other things you'd want to see I'm happy to give you more other than that guys thank you so much for watching and I will see you in the next one thanks and goodbye [Music] thank you [Music] foreign [Music]
Info
Channel: Seattle Data Guy
Views: 7,038
Rating: undefined out of 5
Keywords: Data engineering, Docker intro, what is docker, how to set up airflow with docker, how to become a data engineer, docker tutorial, should data engineers learn docker, docker basics, docker compose, how to set-up flask with docker, seattle data guy, ben rogojan, should i become a data engineer, do data engineers need to know docker, docker 101, what is data engineering, data engineer
Id: oVKuwk8xY38
Channel Id: undefined
Length: 22min 24sec (1344 seconds)
Published: Wed Sep 13 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.