Data Engineering Course for Beginners

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to this data engineering crash course created by Justin Chow this course covers the essentials of data engineering you'll learn about databases Docker and analytical engineering you'll explore Advanced topics like data pipeline building with airflow and you'll engage in batch processing with spark and streaming data with Kafka the course culminates in a comprehensive project putting your skills to the test in creating a full end to-end pipeline hey everyone welcome to the course my name is Justin Chow I'm a developer Advocate at airbit I'm going to be going through kind of essentially what you need to know in order to become a data engineer what goes into Data engineering and then this is going to go into a fully fledged demo where you can follow me in creating a data pipeline from scratch and then evolving that into a fully fledged data pipeline using open-source tools such as air bites so hopefully you enjoyed this and this gathers some interest if anything along with data engineering or software engineering in general is of interest to you then this should be good good to know but without further Ado let's get into the course so we're looking over the introduction to data engineering but you may be thinking why data Engineering in the first place well there's a couple reasons why you may want to consider getting into Data engineering first there's a high failure rate of Big Data projects and the percentage there based off of the data that we've collected is 85 to 87% of those projects failing and obviously as I've said that's pretty pretty high those failures are due to unreliable data infrastructures and quality data in the past there's been an expectations for data scientists to build out that data infrastructure that's necessary for those projects with the infrastructure that was put in place there's always incorrect data modeling and redundant work that they have to do because of the incorrect data modeling and the Redundant work that they have to do there's a high rate of data scientist turnover so there's a high need of data Engineers to process that data and extract value instead of the data scientists with all of that being said there is a growing importance of the data engineering role as we speak if we're talking compensation right now in the US the average median salary for data Engineers is 90 to 150k a year and obviously I say in the US it still highly depends on the location and whether you're in the states or not so that number will change now let's look at the business potential and impact that data Engineers can have in any company you will have a crucial role in the success of making data driven decisions in the AI and ml space for applications espec especially with the rise of chat GPT right now and the AI space in general data Engineers are going to have that crucial role in determining what happens with those applications and the data in general now let's look at the business potential that you have as data engineers and the impact you can make in any company that you work for the first one being is that you're going to have a crucial role in the success of making data driven decisions in the AI and ml space for applications and so with the rise of chat GPT and the AI space you're going to be a crucial factor in that aspect and for any company that wants to and should be getting into AI in general the next thing you're going to be doing is ensuring that there's efficient management and processing of the actual data as it moves Downstream while it moves Downstream you're also ensuring that the data quality the security and availability of it to those teams Downstream such as the data scientists the data analysts or any other team that has to work with data in your company that it's quality it's secure it's formatted the way that they need it and it's never not flowing Downstream this will also lead to better insights and outcomes for your company and overall just drive Innovation and competitiveness so that your company can Thrive and that your team can Thrive so now we're moving on into the first module which is the introduction to Docker and what we use Docker for in data engineering but first we have to have a better understanding of what it is if you haven't used it before Docker is an open source platform that simplifies the process of building shipping and running any applications inside containers so imagine IM you are working on a project and you want your co-workers to try and use it to see if there's any issues but they can't run it for some reason and you kind of run into that good old saying of hey it runs okay on my machine well what Docker is able to do is you're able to put your environment into a container and that container will include every dependency and kind of Silo everything into again this container which you can then pass on to your co-workers or even run up in the cloud to have your application running as it was when you were developing it so it essentially just helps organize and make sure everything works as developed so if we're going over containers and their benefits containers are lightweight portable and self-sufficient environments that package an application along with all its dependencies libraries and configurations as I mentioned before again just going over how much Docker has really really helped engineers in General on packaging their their applications to make sure that they don't run into any issues down the line when testing and developing and to make sure that those environments are reproducible in case anything goes wrong and making sure that also when they're pushing up into the cloud or into production that those applications are working as intended and when they were testing it's going to work the same way in prod so now just imagine that all your apps code and all your apps dependencies are landing into one of these containers and you're shipping one of those off out into you know your co-worker place or on AWS or gcp everything that was inside that container will be seen on the other end as well now as I mentioned before in software development we're looking at Docker in the form of testing deployment for its ease of use speed and flexibility it assists with the deployment scaling and management of those applications using what we call containerization which is just a lightweight form of virtualization so if you look on the right hand side's illustration here you can see that we have a couple apps that are in containers using Docker uh as well as you know sitting on top of your oper in system and the infrastructure and it's all running all within the docker ecosystem now there are three main concepts with Docker that we need to understand the first one is going to be Docker files the second one is going to be Docker images and the third one is going to be Docker containers now let's go over what those three are right now the first step is the docker file the docker file is a text file that consists of all the instructions needed hence blueprints that we want Docker to use all in order to create the docker image so the docker image is what is created from the docker file hence the blueprint so the blueprint then creates our Docker image which is the lightweight Standalone executable package that includes everything we need to run your piece of software that includes code the runtime libraries environment variables and config files the one thing to note is that images are readon and immutable so nothing can change within the docker image itself if you wanted to create another Docker image or modify anything that exists you want to modify your Docker file and the instructions that are inside that text file to create a new Docker image so that's why we stick the image into a Docker container which is the runtime instance of our Docker image or in other words the running version of the image is the docker container this Docker container is completely isolated from the host and other containers that you may have running it has its own file system and can be started stopped and deleted independently so for example we may have one Docker image that contains our application but we have three different Docker containers running that same image so we have three different environments three different instances of that image but I can stop one and the other two are still going to run I can delete one and nothing changes in those other containers because they are isolated environments and you can run tests you can do anything you want in those containers but nothing will change unless it is done inside the container so first things first to get started with Docker is obviously we're going to want to install Docker so head over to doer. and here is the landing page of Docker now what we want to do is simply just download Docker for whatever operating system you have I am currently running on a mac Apple chip so I would click this and download Docker desktop which also comes with Docker compose which we will also need for later along in this course but just know that Docker compose and Docker desktop are two prerequisites to running with Docker right now now if you're on Windows obviously follow the instructions for Windows if you're on Linux follow the instructions for Linux but go ahead and download Docker for your machine and if you have any questions uh you can also go over to Docker and they have a really good explanation in their docs on how to install Docker desktop for your machine and get going with that so make sure you do this first now that we have Docker installed I think the docker docks have a great getting started guide on how you can actually containerize an application and so what we're going to do is actually just go through this really quick and it can give us a great idea of how to containerize an application through a Docker file how we build the image getting the image to run inside of a container and then a couple other things like networking and things like that so let's go ahead and go through the getting started guide and then we'll move on into the SQL module right after so if you were still confused on what a container is or what an image is it gives you a good rundown here at the beginning of part one but let's go into part two on how we can containerize an application first things first is they actually have a getting started app which is just a to-do app for you to containerize but what we're going to do first is simply just clone this and and use it for the rest of this guide I'm going to open up my terminal here and make a directory and I'm going to call this Docker tutorial now I'm going to CD into Docker tutorial and after that what we're going to do is paste in the command that was in there which simply just clones the repo and obviously if it wasn't clear enough and make sure you have a text editor installed I'm using vs code and I think that should be it obviously Docker installed now that it's Clon let's go ahead and CD into the getting started guide and what we can see here is if we open up vs code there's a couple things here so we have a spec file we only really need to look at the source file but this is just a simple to-do app so let's go ahead and go back to the getting started guide and it has the instructions for you on Mac and windows so depending on your operating system make sure you follow the right ones but we've already cded into it what we're going to do is inside of the route directory of the getting started app we're going going to create a Docker file and as mentioned before this is going to be the instructions on how we build the image that containerize the application that the docker file is sitting in so now that I've created the docker file if type in LS you can see that the docker file is there so what we're going to do is they've provided us the instructions or the contents that we're going to use for this Docker file now you can see that we have a couple things here that we haven't seen before but essentially what from is is we're going to use the node 18 package that we're pulling from the docker Hub to run our environment in or our run time the working directory is going to be in the SL app or the root directory of the application this simply copies everything and then we're going to run this command which is yarn install production other commands that are going to be is going to be node and we're going to look at in the source index.js file and we're also exposing Port 3000 so you can see that these instructions are pretty straightforward read like English the one thing to take in mind is that this does does have to be in order this cannot be misplaced so keep that in mind when you're writing your own Docker files but what we're going to do is we're simply just going to copy this switch back over to our vs code head into Docker file and then just paste this in and save now that that's saved they give us a command so we're going to run Docker build hyphen T which simply just tags our image and then we're going to name it getting started so that will be the name of our image and then the dot we need at the end to tell Docker where to look for everything to containerize which will simply just be in the current directory so I'm going to go ahead and copy this command as well paste this in also make sure that you have the docker Damon running if you install doer desktop just simply open the application let everything launch and then you should be good so I'm going to go ahead and hit enter this is going to do a couple things awesome so now if we go back into Docker desktop and we go into images you can see at the very top getting started is created we have an image the tag is going to be the latest tag hasn't been used yet tells us our size but now we have an image with everything as you can see from the docker file so you can see the commands that we ran we have node version 18 which we did with the from command and then we're also exposing Port 3000 so you can see that the docker files blueprint is in play here next thing we're going to do is going to start this container so we can say Docker run which is going to run the image in a container we're going to flag it for DP which the D flag is short for detach which runs the container in the background the P flag is short for publish which means that we're creating the this port mapping so that we can actually access this container locally and with the mapping we're pointing it to a local host here so 12.0.0 point1 and we're opening it on Port 3000 now keep in mind this can change say you have an application already running on Port 3,000 but you wanted to run this on Port 6000 you can definitely change this but this number after the colon needs to stay at 3,000 because in our Docker file we exposed Port 3000 so to be able to access this we need to still keep that number to 3,000 and then after this this is the name of our container so obviously we're going to name it getting started so let me copy and paste this go back in our terminal and as you can see it spits out a container ID uh I'm going to go ah and copy this for later but if we go back into Docker desktop now just for a simpler gooey to see if we go to the containers tab you can see that a container was created with the image getting started and it is currently running on Port 3000 so now if we're going to go back here and we want to go to Local Host Port 3000 you can see that the app is now running locally and we're accessing this container so that is good now we can move on well first of all what we may want to do is I can show you guys that you can actually write stuff to this so as I mentioned before this is a to-do app so I can say get cat food take out the trash very simple react app going on right now that is inside of a container so as I mentioned previously before we started this demo is that Docker contains a runtime it contains all the environment variables and it contains every single dependency that react or this application in general needs in order to run and so we're siloed inside of a container with a runtime with this environment and we're able to actually use the application within the container there's a couple other commands that tells us so in the command line we can run Docker PS and it will name any other containers that are currently running along with their container ID so as I mentioned this container ID popped up but if I type in Docker PS it's going to be the only one running but if there were any other containers running we can pull their container ID along with their names and other information the other thing to note too is that container IDs can be accessed you don't need the full container ID you only need the first couple digits here and you can still access that container so let's move on to part three here now we were going to want to make some changes to this code and this is going to show us how updating the application itself changes things so they want us to change a paragraph inside of the SRC static JS app.js file so let's head back into into vs code so we're in SRC we're in static we're going to go to index.js JS app.js sorry and then go to 56 and then we're simply going to change this text to be what they told us we're going to go ahead and save that but now that we've updated the application that means we actually have to build another version of this image because the image we built cannot be Rewritten it is completely readon and so if we're going to make any changes we have to create another image and so that's exactly what we're going to do we're going to run the same Docker build command in the terminal and that's going to build another image and so now what we're going to do is we're going to run container command but we're going to get an error here because we already have a container with that name that's binded to Port 3000 now if we want to remove the old container that this will bring us back into Docker PS so we're going to run Docker PS and so we can see that this container is still running so what we're going to do is we're going to say Docker RM paste the container ID and that should give me another error saying you cannot remove a running container so we going to say Docker RM and put in the F flag that will stop the container as well as remove the container so I'm going to do that and that should now get rid of it and now we can see that this has stopped and we can copy and paste this again and now we've created another container and it has started running with the updated image now if we go back into the to-do app and refresh this you can see that it's completely fresh and the new text has been added so that's what happen happens when you go and update an application we're going to skip part four but if you wanted to push this image up into the docker hub for other people to access you can absolutely do that we're not doing that right now so at your own you can go through part four but we're going to go to part five and this is actually a big one because sometimes we're messing with data and as we saw in the to-do app we have data and we want to persist it but if we were to stop the container and open it the data that we had previously is going to be gone so the way we can actually persist data is by simply adding a couple things so we're going to go through the steps one more time but we're going to run this command here we're simply going to start a container that has a buntu running and it's going to contain a file called data. text that will have a random number between 1 and 10,000 so let's go open up my terminal and we're going to run Docker through that and as you can see it created another container has an image of auntu I already had it installed locally but if you do not have the auntu image it will immediately detect that and download that from the docker Hub once we have that going we're going to copy and paste this and this will simply check the file that is inside the container which is data. text and it should spit out the name of the or should not spit out the name but it should spit out the number and what we need to do is we need to remove the current container ID or the placeholder there with the actual container ID and there we go we have 848 that's spit out to us so we know that there is data inside they want us to start another container that has a buntu so we're going to go ahead and do that as well and it'll list in every single folder that is currently in the root of that container and then what we're going to do also is we're going to remove that first container so again we're going to go Docker RMF and then the first container should be this one right here so we went ahead and remove that so the way to process data in that container is going to be by creating a volume and it's pretty simple to do so the first thing we're going to do is run this Command right here that they give us essentially we're going to be creating a volume called Todo hyen DB so let me go ahead and open my terminal here and what we're going to see in the volumes tab now is a to do hyphen DB what we're going to have to do is remove the container that we created because it is running without a persistent volume because we created the volume after container creation so we're going to have to create a fresh one so again let me grab grab the container ID of the getting started container and same thing Docker RM hyphen f and with the container ID and that should remove it for other purposes let me also just remove this as well cool so we have a clean slate now we're simply going to copy this command but there's a couple things here so along with the run and run command the dnp flags as well as mounting this or not mounting this but binding it to the port 3000 we're also mounting the volume here which is going to be the flag here that we see or the option and we're specifying the volume that we want to mount so the type is going to be volume the source of the volume is going to be the to-do hyphen DB volume we just created and the target is going to be where the actual data sits inside of the container so that is the path that we're going to do and then getting started we're actually just specifying the image that we what we're using or the container so let me go ahead and copy this command paste it in and now you can see that we have another container with getting started and it will be utilizing the 2db volume that we've created so now what we can do since it's running we're going to go back here to the to-do app and as you can see I have some data that was in there and I didn't add anything you saw that it was empty and there's two tasks in here that I added previously about I mean yesterday and they persisted so now we're able to tap into the volume and the data and anything that we do from here on out let's just say I I don't know go to the gym we want to add that if we stop this container and then start it again and refresh this we can see that the data is persisting now so volume creation is going to be the way to persist databases especially if you're testing stuff this is a good way to do that within Docker and it's easy to play around with servers and actual databases now that we have data that persists instead of having to reload and add more data over and over and over again so there's a couple other things that we can see but if you were to run Docker volume inspect on the volume itself you can see a couple things of metadata here but specifically we get to see the mount point the name the scope of it and it's created time so those are a cool couple things if you want to persist databases within Docker next thing we're going to do is we're going to skip over to part seven I don't really want to touch into the used bind mounts right now but multicontainer apps so as I mentioned we may have a couple things that we're running and so multiple containers are going to be needed in this case we have our to-do app but we also want to create a MySQL container that runs in tandem with the to-do app so that we can isolate our database and our front end app here in order for these two containers to talk we're going to need to create a network and that's exactly what we're going to do right now so we're going to run this command Docker Network create and the Network's name is going to be to-do app so let me go ahead and run that right now it's going to give me an error because I already created this but go ahead and run that and it will create a network there for you so now what we're doing is we're going to create a another container and then attach it to the network that we just created and so you can see that we have a couple Flags here first ones are going to be the network so we're tying the network of this container to the to-do app Network we created the Alias of this is going to be my SQL so when we see the container through the network we're going to know that this container here is the MySQL container some of the variables that we have here are going to be the volume that we want to create is going to be the to-do my SQL hyphen data volume so we're creating a volume there to persist the actual database and then we have two environment Flags specific to my SQL again if you're using something like maybe say postgress there's going to be some different environment variables you need to attach to this container two of the first ones being in my SQL are going to be the root password as well as the database name so we know that our password Here is secret and then the name of our database is going to be to-dos and we're going to be pulling the MySQL version 8 Docker image so what we're going to do is copy this over and we're going to go into our terminal and create that and now we can see we have a MySQL image again I've downloaded locally already and the container is now running to confirm that we have the database running we're going to run this command here so we're exing hyphen it and the container ID will go right after that then it will spit out the my SQL command enter in the user which is root and then ask us for the password so let me go ahead and copy this really quick in here and copy the container ID that it gave us and now replace the placeholder enter It'll ask us for the password and as we saw before it was secret so let me enter that in and now we are inside of our mySQL database the next thing it asks us to do is Type in show databases to confirm that our databases was created and we see the to-dos database is here so our commands ran successfully now we can say exit and now we have a to-dos database we're going to skip this part because it's not really necessary but if you want to follow it it simply just spins up a DNS tool that will allow you to see the IP address of the container and show you that each container has its own IP address so this networking step really is crucial for these two containers to talk to each other so what I'm going to do here is I'm actually going to first remove the container of the getting started app that we created first because it's not tied to this network yet so let me go ahead and remove it first and then we're going to create another one using this command here and as you can see we're doing the same run command here but with a couple other flags so there's just you know we're tying it to the network here there's a couple other environments like hooking it up to the mySQL database so that we're actually able to push data to the database same thing with the node version and then initializing the application with yarn install and and run Dev so let me go ahead and copy and paste this over and there we go we should have another container built with the note 18 Alpine image instead now if we want to connect to the mySQL database let's go ahead and do that here so we're going to copy this command again copy and paste this and then we need to get into the mySQL database so let me copy and paste that enter the password here and it is secret so now that we're in so we're going to do the select function here and we're going to select all from to doore items and we can see that the three items that we created inside of the application are now here inside of the database so we're confirming that the data that we created in the volume is now inside of our my SQL database and that right there is the power of networking earlier in this module I also mentioned dock compose and it makes this whole process that we just did with this application 10 times easier again as I mentioned before also is that Docker compose is a tool that helps you to find and share multiple container applications and so we're going to be doing that right now this right here was the old command that we had to run inside of our terminal to get everything running for the MySQL and to do app but we can simply do that all of this in one Docker file and the docker file now changes to a Docker composed. yo file or you can shorten it to os. yaml so we're going to head back into our terminal here and inside of the route where we initially put our Docker file we're going to create a new file and call it compose do yaml and what Docker has provided us with were a couple things so we're going to go ahead and copy and paste this now make sure it added to space here but it will air out if it has an empty space so cover that space again yaml is very strict with spacing and indentation so keep that in mind but the first thing here we're going to do is specify Services of our compos file the first one is going to be our app which is going to contain the image of the node version 18 so we're going to run node in this container next thing we're going to do is add the command so the command we want to run is going to be the yarn install and and run the dev environment there so we're going to add that here next thing is we're going to open up the ports and it's going to be the same port the 127 on Port 3000 so now we're moving the working directory to the volume mapping uh and by doing that we're going to add the working directory and then the volumes so let me go ahead and add that as well and then finally we're going to add the environments so these environment variables are all going to be the my SQL environment variables so that we can access the database through the application itself when we add a need todos so again the host the user password database name these are all there and are going to specified for this app container now we're going to do the same but for my SQL so we saw the app container but now if we back up a couple spaces is we're going to add the MySQL container so you can see that these two are on the same line one being app with all of its information here and then MySQL will contain the same but now we're pulling the MySQL version 8 image the volumes that we want to use for this container is going to be the to-do my SQL data so let me go ahead and add that and then the overall volumes here are going to be outside of that and then finally what we need to do is we're going to add the environment variables for the specific container so we can access it now everything in your composed. file should look like this again the top level is identifying what services that we want this file to create the first one being our app container using the node 18 image this is the command we want to run this is the port that we want to expose and use to access this application our working directory is going to be this root directory that contains everything here the volumes we're mapping is going to use the app this is the environment variables we use to connect to our MySQL container that also contains its own information and so now it's much easier with Docker compose to run this application now that we have our composed. yl file I can open up my terminal here let me exit my SQL clear this now all I have to do is run Docker compose up and I can add The Hyphen D flag to make sure that this runs in the background and what that's did just now was literally spit out two containers one it looks like it failed because I have a container running already that has ,000 so let me go ahead and just delete this and delete this from the docker desktop app so we can have a clean slate and let me run the same thing and there we go now it's spit out the getting started app and it contains our app and my SQL container both of them running and we have opened on Port 3000 and if I head over into our Local Host Port 3000 I'm refreshing and you can see it's all fresh so we are note we're using a different volume here but the application is now running successfully it is attached to our my SQL database and we can continue and now add data here that will persist because we have tied a volume here to both of these containers now if we wanted to tear it down all we would have to do is Docker compose down and that will shut down and stop our containers from running and that right there is a simple rundown of how you can create images in a container through Docker using Docker files or a composed. yaml file and then accessing those containers to do certain things we ran a application in the form of a to-do app and we also created a mySQL database pretty quickly here in their own environments and that way we have coworkers that are working on the same thing they can run the same image that we're on using the same environment variables using the same dependencies and essentially be able to recreate any problems or just have the same environment that we're running on every single time and that is the power of containerization through Docker we're going to be using a lot of these skills in the coming lessons so hopefully that helps but now let's move move on to the SQL module where we're going to be using SQL but we're also going to start our own little database inside Docker as well so let's get to it so now that we have Docker set up and we know the basics we're going to move it on to SQL or SQL however else you want to call it but SQL stands for structured query language it's a standard language used for database creation and manipulation databases like MySQL or postgress use sqil for you to query any of the data within their own tables so let's go ahead and start our playground we're going to set up a database using postgress within a Docker container and then we can create our own tables with fake data to play around with and we'll use them later on in the course as well all right so first things first is I have a terminal open and I've created a data engineering database directory because we're using Docker we do not have to install postgress locally what we can do is run the command Docker pull postgress and what that's going to do is look up the image of postgress on the docker Hub pull the latest version of that down and now we can actually create a container that's running the postest image so now we got to start our postrest container now that we have the image so I'm going to say Docker run the name of this container is going to be data engineering postgress uh the environment variables that we that I want are going to be related to postgress which is going to be postgress password and what that's going to be is just going to be secret and then I'm going to do the D flag and we're going to say we're going to use the postgress image go ahead and run that now I have my container now what I want to do is I want to create a new database inside of postgress so I'm going to say Docker exec hyphen U for the user which is just postgress as default the name of the container which is just data engineering a postgress going say create DB and the name of the database is going to be postgress DB now that we've run that we've created a sample database inside now let's go ahead and connect to it so I'm going to say Docker exec slit data engineering postgress psql hyphen U for the user on postgress and the name of the database which is going to be postgress T now we are inside as you can see of the postgress database if I do /dt no current tables as of right now so that's what exactly what we're going to do we're going to create our first table I have these commands already written out so I don't waste as much time so the first thing I'm going to do just for an example is create a table called users so inside this command you see a couple things so we have an idid of thep type serial which is also our primary key here as an identifier we have a first name which is at most 50 characters we have a last name which is the same thing we have an email that is also characters but we want it to at least be 100 and then we have a date of birth which is just in the type of date so we're going to hit enter it's going to show us that we have a table so now if I do /dt we can see that the users table is now created and populated inside of this database I'm going to go ahead and copy in some fake data as well and don't worry this will all be in the description below so that way you can copy and paste this in as you follow along so I'm going to and copy and paste this and this is simply just user so it contains a first name a last name an email and date of birth so let me go ahead and enter and you can see it entered in 14 records so now this brings us into the First Command we're going to learn so the First Command we're going to want to learn is the select command and it's pretty self-explanatory as you saw already we entered in 14 rows of data and and if I wanted to select anything from that table I'm going to need to use the select command so I'm going to say select and I'm going to add this asterisk what the asterisk typically means is a wild card pretty much meaning you want all of the columns within a table so you can read this query as select all now I'm going to add from and then the last part of this is going to be the table name so I'm going to say users and do not forget this is very important that every query needs to end in a semicolon otherwise the query won't go through so make sure you always add a semicolon at the end of every query so once I hit enter you're going to see that I get back a formatted view of the table that I just queried we saw before that we entered in 14 rows of data and I see that I get exactly 14 rows back all having the ID first name last name email and date of birth formatted in a much more pleasing way than what we see up here the select query is going to be a frequently used command by everybody who who uses SQL so make sure you know it and understand it but again very straightforward another command that gets used in conjunction with select is also going to be distinct we use distinct in conjunction with select if we want a specified value within a very particular column so let's say we want every single email within the user's table what we'll say is we're going to say select distinct email from and then we're going to say users table so again we're reading this back if you haven't noticed already you typically all the SQL commands are going to be in all caps keep in mind you do not need this but it is much easier and it is probably best practice to capitalize every command that is SQL and lowercase everything that is either a table or some type of value that you need but if we're reading this back we want to select distinct emails from the users table so again it reads just like English we add the semicolon if we hit enter it's going to give us just the emails in every single distinct email now keep in mind this is not in order but we do get 14 rows of data back we can see that the query was successful so that's how you can select data from the whole table to get a better view or select distinct values within a particular column but let's move on to other commands so now that we have our users table that has data that exists in there what if we wanted to update some information say an email if a user changed their email or a date of birth if they entered it wrong this is where the update man comes in handy so let's say for example if we go back up to our table that we wanted to change J's email so what we're going to do is we're going to say update so the query starts with update now after update we're going to enter the table name so we're going to say users and then we're going to say set so now we want to address what column and value we want to change in the users table so we're going to say email and then that is going to equal whatever our value is going to be inside of single quotes so for example we can just say you know newcore mail@gmail.com the problem with this query is if we were to hit enter on this it would change the emails for every single row inside of the user table which isn't exactly what we want and if you were doing this in a production database it would probably go Haywire so how can we modify this query for it to actually only update one person's email inside the table what we can do is introduce the wear command so after we set the email to the new value we're going to say where and then we're going to say first uncore name so now we're saying where the first name so we're selecting a particular first name is equals to and the first name in this case is going to be John and we're going to hit a semicolon so again we're going to backtrack here we're going to update the user table we're going to set the email so we're selecting the column email to newor mail@gmail.com where our the first name is equals to JN so that means that we're going to look for JN inside of the first name column once we find John we're going to go to his email and change the email over to the new value so if I hit enter we can see that we have one updated change now if I say select go back to the cap select all from users table we're going to see that John's actually changed so he gets moved down he still has an ID of one but his email is now new _ mail@gmail.com so again if we're using the update command it's very helpful if anything changes in terms of data and we need to update something that's where we' use the update command now let's say you wanted to enter in new data into a table well that's where the insert command comes in if you're starting to see the pattern here SQL commands are very self-explanatory and the fact that they read English and kind of like python so if you wanted to insert new values you would reach for the insert command so we're going to create a new table called the films table add some films films and then show the insert command but go more in depth into it all right so we are back in our database and I have the command here already I'm going to create the table of films again so create table is also a command that we didn't get really into but again self-explanatory we're just creating a table here called films within the parentheses we're going to see the film ID with a type we have a title a release date a price which is new that we're seeing a decimal type we have a rating and then a user rating the one thing to note within the user rating is that we're checking to see if the number value that goes into user rating is greater than or equals to 1 and that the user rating is less than or equals to 5 so obviously we want to go between 1 and five here so if I hit enter we're going to see that if I do /dt which is like LS in a normal bash terminal is kind of like listing every table inside of a database we see that the films table is now created so let me go ahead and copy my insert command here and we see that we have insert into so again self explanatory we want to insert values into the films table within the first set of parentheses we see the names of the columns that we want to add and again those are fitting and matching exactly what we have on this side right here when we created the table after the first set of parentheses we have a values command and after that everything inside of the parentheses are values that have to match the order in which you placed every single column so the first one we have is a title which is these are all not fake data but I have them generated so we have Inception the release date which is right here in the date format we have its price in a decibel format we have the rating of the movie as well as the user rating of the movie and so you can see it fits within this set of parenthesis it fits the order of this set of parenthesis so if we hit enter right now you can see that we've entered in 20 rows of data if I say select all from films we can see that's a little out of place here because my terminal is a little big but we can scroll down and we can see that all the data was entered in go ahead and skip this line let me make it smaller so that we can see more of it but if I hit that again we can see the table is formatted properly the same way that the users's table was but that right there is how you can insert data into a table it doesn't have to be that big every single time if we just wanted to insert one more row it would just be the exact query back here except just one more line another really useful command that you should know is going to be the limit command imagine you have a database that has thousands upon thousands of Records but you only need a certain amount so you want to limit the query to that amount that's where limit comes in so as we saw in the new films table we saw that we have 20 rows but what if we only want less than 20 so let's let's say five that's where we would use limit so let's do select all from we're going to go into the films table but after films typically we would have that semicolon to end the query but now we're going to add limit and then the value of what we want to limit the query to so because we have 20 we want less than 20 here let's just say I only want five records to come back if I hit enter you're going to see that I only get the first five records in the table to come back so that is a very easy to understand command here and very useful to limit the amount of records that come back cuz you could imagine as I said before if you're trying to query a database or a table that has thousands of records that query could take a very long time so if you want that to speed up a little bit and you only need a certain amount to see what the data looks like if it's formatted properly the limit command is going to be your best friend the next thing we're going to cover is going to include some math here but we're going to go over aggregate functions in SQL aggregate functions in SQL allow us to do computations on top of a single column through every single row of a table it's incredibly useful in data analytics where it allows us to calculate important metrics and gain insights from our data so let's get into the first one the first one is going to be count now the count function allows us to count the number of rows depending on if it matches the condition that we give it so now that we have our films table let's say we wanted to count the amount of films if we don't know how many are inside the table let's start the query here so let's say select right after the select statement we would say count in the parentheses we would count however many we want but if I add an asterisk here that's going to say count every single record inside of this table if it matches the condition I'm about to give it so we're going to say count from film's table so again this is selecting the number of Records within this condition here and what we give it is an asterisk for all from the films and what this should give me back is the number 20 because we have 20 rows inside this column so just remember within this parenthesis here is going to be the condition depending on what that condition is it will spit back to you always a number depending on if it's true within that condition and count is really useful for that the next aggregate function we're going to look at is going to be the sum function now we would use sum to calculate the sum of all the numbers in a numeric column so if we take a look here we have a prices column so now if if we wanted to let's say calculate the sum of every single price in this column we would say select sum inside the parenthesis we would say price because we want to use the price column from the film's table so now if we look back at this we're selecting the sum function is going to look into the price column and add up every single number value within it and it will spit back to us the total sum of every single movie that we have inside this table so if you were to have gone to every single one of these movies at this price you would have spent $228.8 now let's say we want to get the average of something inside of our table say for example we want the average user rating of movies in this table well that's where the average statement or the average function comes in it's going to look the exact same as the sum function so we're going to say select but instead of su we're going to say average or APG and inside the parenthesis we're going to enter in the column that we want to take a look at which is going to be user uncore rating close the parentheses there and then we're going to say from the films table and that then gives us an average of every single user rating that the movies have which is going to be 4.6 and lastly one of the more useful ones is going to be Max and min max and Min are used to find the highest and lowest numeric value within a column so let's look at our price column again if we wanted to get the lowest and the highest values in the prices column our query would look something like select and then we would say Max inside the max column our condition would be the column we want so we want price we're going to add a comma here and we're going to say Min again parentheses and looking at the prices column from the films table now if I hit enter here we're going to see the max value here was $14.99 and the minimum was $7.99 so the the cheapest movie out here that was in the table was $7.99 and the max was $4.99 so very helpful to sift through a big table if you're just trying to find the highest value and the lowest value another useful statement in SQL is going to be group bu we use group bu to get identical data into groups it's also typically used with aggregate functions like the ones we just did like count sum min max average so that we can group the result sets into one or more columns let's take a look at how this would look if we were to use group bu with the average aggregate fun function let's take a look at the films table again and let's say we want to group the average user rating based off of the movies actual rating how would we do that we're going to select the rating column here and then we're going to add a comma and add our aggregate function since we're stringing together two select and inside the condition of the average function we're going to put in the user uncore rating column from the films Table after after that we're going to say Group by and what we wanted to group that by is by the rating so we're going to get the average of every user rating inside of the film's table and then group the results by each individual rating so we have PG-13 we have R and we have G npg so we should see four rows inside of the results after I had enter you can see the average rating for R-rated movies was 4.61 the average rating for PG-13 is 4.5 PG was 4.8 and G was 4.4 so now you can see how this could be useful to group together identical forms of data and compare them against each other this is especially important in data analysis again it allows us to organize the data into a different way and then run computation on top of that especially with them being in groups it's also important to note the order that we did this query if we were to have switched the rating and the average functions here we would have got a different result so the order of the query does matter depending on how you want to format your data so keep that in mind one cool feature you can do in SQL is join two different tables that have different data sets and get a result that combines both of the data you need into a formatted table there are two different commands that we can use for this and we're going to go over those right now the first one being injoin and the left one being left join they do very similar things but there are key differences that that make them unique so let's go over what those are right now but before we do that we're going to need to create another table that we can use data against alongside our existing films table so let's first create the film category table so I'm going to go ahead and copy and paste this so we have a category ID a film ID and the category name these categories match every single movie that we currently have inside of the films table now I'm going to go ahead and insert our values here so let me go ahead and do that that hit enter we have 39 records here so now if I do select all from the Filmore category table now we can see that we have our category IDs and we have our film IDs as well as their category names go ahead and Skip all these lines here so now we have two tables that we can use to join with each other if we want a table where we have all the movies as well as the film category ories we're going to learn about two things right now we're going to learn about how injin works and we're going to learn about how aliases work in your SQL queries so let's go ahead and get to that so in order to understand how joins work we need to First create some tables that go against the data that we have currently in all those tables and insert some new ones so that we can join them together so the first table we're going to create is going to be the actor's table so this is going to have an actor ID and the actor name I'm going to have all the actors from every single film that we have currently and then we're also going to create the film actors table and this will make sense in a second so this one is going to have the film ID an actor ID as well as a primary key and now into the actors table are going to go every single name of each actor that are associated with the specific films let me go ahead and insert those and now we also have the film actors which are going to to include their IDs as well as the film IDs so now you can see that we have two separate tables that we just created if I do select all from the actors table we can see that we have actor ID as well as the actor name all in this table and then if we say select all from Filmore actors table you can see that we have a film ID and an actor ID so this goes with multiple different tables that we have relationship with and now we can focus on how to join all this data together and make sense of inner join and left join okay so like I said we're going to focus on the inner join and we're also going to add some aliases and what those are going to look like let me copy and paste this in here so we can go over what this looks like first we have our select statement and now we're going to go into what we're selecting so we have f. film ID and this will make sense in a second we have f. tile we have a. actor name so we're selecting the film ID we're selecting the title name as well as the actor name that's what the results are going to look like now we're going to select from the films table and after the films table is a space and an F so the Alias that we want to give our films table is f so now that anytime we want to access the films table all we have to do is type f as we can see here after we select the films table and add the Alias we're going to do an add our inner join statement this is where we're going to add different tables and join them together so after you say in join we're going to take on the film actors table and we're going to give it an alias of fa so short for film actors and we're going to join these two tables on which is why we have on so we're going to select the film's table and then the film ID from the film's table if it equals the same film ID from the film actor's table that's where we're going to join those rows and columns on we're going to add another inner join we can string on as many joins as we want so in this case we're going to do two because we're bringing on the actor's table so we're going to join the actor's table and we're going to give it an alas of a on the fact that any actor IDs from the film actor table is also equals to the actor IDs from the actor's table and lastly we see order by that's a statement we haven't gone over yet but we're going to order them by the film IDs from the film table so we're going to order them by those film IDs but we're going to go from the film table so I know that's a lot to take in but let's go over it one more time the final table that we are going to see we're going to have three different columns we're going to see the film ID from the film's table we're going to see the title from the film's table and then the actor name from the actor's table and now the reason we're not seeing the film actor's table in the select column is because one we're using the film actor's table as a bridge to connect the actor names to their correct films that's what we did here on this join is we took our film actors and if we look up at the film actors table we can see that we only have a film ID and an actor ID each of them are just numbers but they correspond to an actor as well as a movie so we took the data that we have from the film actor table and we joined it on top of the film table now if there was a film ID that matches any of the movies inside of the movies or I'm sorry inside of the films table then we're going to match those together take that data and use it for the results section here now that bridge allows us to say okay well now we want to join the actor's table on top of the film actor's table now now we can associate actors with the correct movies now that we have the film IDs and the actors that go with them so we're taking on the actor's table in the form of or in the Alias of a on top of film actors and now if we have any actor IDs that match the film IDs in the film ID or in the film actors table we're going to take that data and then use that here and then we'll obviously again order them by the film ID in the films table a lot to take in again but if I hit enter we're going to see that we have a very nice formatted table here like I said we're ordering by the f. film ID which we use up here as well so everything is ordered chronologically or you know in numerical order 1 2 3 4 and then that film ID will match Inception and then Le Naruto DiCaprio checks out was in Inception Tim Robbins was in Shashank Redemption Marlon Brando was in Godfather so you can see on and on that this does make sense and now if we were to go and do select all from the films table we can see that Inception had a film ID of one so that checks out right there um and then if we go to select all from the actor's table we can see that Leonardo DiCaprio had an actor IDF one so all of that data that we just took in actor idea of one and film idea of one matches exactly what it was between all the three different tables that we had a relationship with all to give us one very nice table that we can now use for any form of data analysis or what any sort of computation that any of your teams need and that's really a awesome way to join multiple different tables this was an example of inner join left joins do something similar but there is a catch between the two that make them unique as I mentioned before so let's take a look at what left joins look like so in order for joins to make sense we need to add another entry into the films table so let's go ahead and do that I have here a single entry that I will copy and paste right now so I am simply going back into the fils table and entering the mystical Adventures movie so let me go ahead and hit enter and then if we do select all from films we can see that the mythical Advent or mystical Adventures is now added at the very bottom so now we have 21 so now let me copy the left join query that we want to do here and now it's going to look very similar to the inner join where we're taking the film ID from the films table the title from the film's table and the actor name from the actor's table again we're adding the aliases here here with the letter f for the film's table we're left joining the film actors table again using that as the bridge uh it's essentially going to be the same query but if we were to hit enter on this right now you can see that we still do have the table filled out properly and formatted properly but we have mystical Adventures but with no actor so the thing with left joins is even if there is a null value or it's missing a value the result table will still show even if there isn't an exact match now if we were to go back up to the query where we do the inner join after we've now added the new movie we can see that it doesn't show up so in jooin if there is a null value and the condition doesn't the condition isn't fully met we will not see that in the result table so for the result table you see that we only have 20 entries as opposed to the left join where we have the 21 but we just have a missing actor so to now show that this is actually true we're going to insert a new actor as well as the new actor IDs to populate the movie so the movie we added was a fictional movie so let's go ahead and add the actor's name so we're going to add Jane do and John Smith and we're also going to add the film actors as well so 21 and 22 so if we were to run our left join command again we're going to see that mystical Adventures actually does populate now but we're going to see that the mystical Adventures actually does populate now with the actor name so this kind of just proves that no matter what left joins will always show the result table even if the conditions aren't meant and there is a null value there's two more commands we're going to go over the first one is going to be Union and the second one is going to be Union all these do get a little mixed up with the inner joint and left joint since they kind of function the same way in terms of how they work but they are completely different and let's go over what makes them unique here so the next two commands are going to be Union and Union all these are going to be the last commands that we go over so we actually use the union operator to string together two select statements but with the Union statement it actually removes any duplicate rows with the same values but with Union all you're going to get duplicate rows in the result set table so it kind of functions the same way as in joint and left joint in terms of their differences but the queries are going to be very different so let's go ahead and take a look at what that looks like all right so back in the terminal we're going to copy this Union command and we're going to go over every single line so with this command we're going to select title as name from film so we're going to use the name Alias we never use the as operator it is completely optional when giving aliases to use as so you can omit this when doing it but it makes it a little bit more clear if we say Select Title as name so instead of the column being named title it's going to be called name when we get the result table from the films table and then we're going to add the union operator here and then string on another select statement so we're going to select the actor name as name from the actor's table and we're going to order them by name so if I hit enter here you're going to see that doesn't really make too much sense but we can see how it works kind of we have one column since we named them both name down here and we're putting down this is an actor this is an actor this is the name of the movie so we put together a bunch of names and we order them alphabetically by name now if if we were to do that again but instead we were to change the actor name to actor and hit enter and it would be the same thing for Union all except remember there are going to be duplicate rows or duplicate values within here so if we were to scroll down you can see that we get Tom Hanks twice whereas in the table above we only got Tom Hanks once so again this is a prime example of Union and Union all so we're putting together all the data that we're grabbing using two select statements into one result table another cool little thing you can do in SQL that's actually going to be very useful later down the line especially as a data engineer is going to be something called subqueries we've seen them a little bit here by using the parentheses but let me go ahead and show you what one looks like and then we'll go over it so let me copy and paste it over so as you can see here we have first a select statement where we're selecting the title now inside of that we have a par we have a comma and then inside a set of parentheses we're using another select statement but we're selecting the actor name and again using aliases and a join so from actors Alias of a we're joining the film actors again using it as a bridge and then the final result from this select statement is going to be the actor name and then we're pulling this from films using the Alias F so this is just a string select statement but a subquery so before this finishes we're going to to run and find the title from the films table but then we're also going to look here SQL will go ahead and execute this query and then add it into the results table so right now what this is doing is it's retrieving the films title along with the name of one of its actors so if we go ahead andit enter you can see that it's kind of the same thing what we did earlier in the course where we joined two different tables but we used a subquery instead subqueries can be used in multiple different ways we saw this being used used in the select Clause here but we can actually use this also with the in operator so if we wanted to retrieve films that have either you know let's say Leonardo DiCaprio or Tom Hanks this is what the query would actually look like so let me go ahead and show you so we're selecting the title from the films table where the film ID in and then this is where our subquery would start this is where we're going to say hey we're looking for any films and we want the name of them if the condition is met here where the actor's name name is either Leonardo DiCaprio or Tom Hanks so we're looking for both of these movies so we're going to say Obviously go into the film actor's table and do another join here so we're adding that condition by joining in other tables and that's where subquery can really become useful here so if I enter you're going to see that we're only going to get the titles back and Inception Force gum and Toy Story are going to be the three movies that contain leard DiCaprio or Tom Hanks and obviously most of us have seen that so that checks out but that is a quick little intro Pro and awesome little thing we can do in SQL subqueries all right so now that we've gone through SQL we can finally get into building our first data pipeline this is going to evolve as the course progresses but we're going to start off by building this data pipelines from scratch using a python script and moving data that way we're also going to be using Docker to host both of our source and destination databases using postgress so you'll get to see how this all coincides and works together with Docker one thing to note before we keep moving on is there is a repo for this whole project you can see it here with a read me on how to get started and how everything works if you go up here to the branches you can see the evolution of this project so first we're going to be starting with just the main elt script and then we're going to be adding on a Cron job to show orchestration from there we're going to be adding DBT to the mix which we'll get into later and then also airflow and then eventually air bite so there's going to be this progression of the course to show kind of how a main elt script uh that's created manually would look and then as we add in more open source tooling you'll kind of see the power of how everything works together and why you know we are using these open source tools in the first place so just for your reference make sure to look in the description and you can find this link here to my GitHub repo with everything you need to see just in case you get stuck on something but for the rest of the project I'm going to be creating a new directory but it will directly reference this project so just keep an eye on this when you're working through it all right so I have my terminal open so let's go ahead and create our first directory here I'm going to say make and then for here I'm going to say elt and then we're going to CD into elt um and then from there we're going to say touch _ script.py let's also create a Docker file so I'm going to use Docker compos in this so I'm going to say Docker compos and let's go ahead and open that up okay so now you can see we have two files that we just created we have the elt script and then we also have the docker compos file all empty but we're going to fill this out now the first thing we're going to do is actually fill out this Docker compose file in order to have the two databases that we need one being our source where our data is coming from and the destination where we want to send data to so that way we can use the elt script to move data between those two now we're going to use postgress for this example because we canuse use it as an open source database and so let's go ahead and fill this out so the first thing we need is going to be the version and uh version three is going to be what I'm going to be using for the doc compos file now we list our services so in here this is where we name our first service and in this case I'm going to name it Source postgress that's going to be our first service now remember with Amo files it is very crucial for you to have proper spacing within the file itself if not the yo file is not going to read properly and it'll actually error out if you don't follow the correct tabbing and indentation so keep that in mind um so the first thing we're going to do is list the image and for the image here I'm just going to say we want to pull from dockerhub the postgress image and the latest version of it so that way we're always up to date the ports so this is going to be very important for us in terms of accessing the actual service itself so with the source and the destination databases we need to expose the ports so that we can actually access them on the local host and so for this I'm going to say we're going to open this up on 5433 and by default we need to do 5432 I believe so let's go ahead and do that now for the networks we're going to create a network for all of our containers so you can look at all these Services as a single container in each single container contains an image that the runtime is going to use and so we actually need to create a network for all of these different containers to talk to now I'm going to name it _ Network and we will see that at the end of this file we're going to have a well I can actually just do this right now so we're going to have a networks section and in there we'll list the _ network we also have to specify this is going to be a bridge driver um you don't have to worry about that what that means but just know that within the network all of the containers are going to recognize that this is the network we want it to talk uh remember that is outside of the services tab this is going to be its own section here with its own properties so make sure that you're doing that there now within the source postest service we're going to do some environment variables here that are going to be very important for the first one we're going to do the postgress database so this is going to the name our database and for the name we're going to say sourcecore DB keep it very simple uh the next one we're going to do is going to be the user so postore user and this one let me uh actually should be this way not an equal sign it should be a colon my bad this is going to be user postgress as default and then obviously the next one is going to be password this you can change to however you want it to be I'm going to say secret for this one and now for the volumes this is going to make way more sense and I I should actually create this um this folder now but we're going to create an elt folder we're actually going to drag this elt script into their folder just so we can have some separation between everything the docker composed file will remain in the root but um for the next one we're going to say volumes and if we remember what we were talking about in the docker section the volumes is is how we're able to persist data so each container can create a volume and then whatever we're doing inside of that Docker container will contain the data that we're utilizing within that volume so that data will always persist as long as we do not delete it from within Docker what I'm going to do here is I'm going to specify the path of so Source this will make a lot more sense in a second so Source IND justy uh init so we're actually going to have some data that we're going to initialize this postgress database with and and the init SQL folder or file sorry that we're going to create inside of the source DB init folder is what we're going to utilize and so after we've done that we're going to map this uh so if you see the colon after this this uh directory we're actually mapping this to the directory within the docker container so you could see this as this is our local directory and we're going to map this over into the docker directory and where this is going to go is going to go into the docker entry point init db. d/ init SQL okay so this should be exactly what it looks like for you and this is going to be the same it pretty much looks identical so we're actually going to do this and I'm going to do shift option down to duplicate this we're going to add a space here so this is going to be our very first service here which is going to be the source database and now what we want to create here is going to be the destination database uh and we're going to change the port here to 5434 we were going to keep this on the network on the elt network now this also needs to change so this is going to be destination ZB we'll keep user and password the same and we actually don't want any volume section here why because we don't want to save the data every single time for testing purposes I want to make sure that the El script runs every single time successfully and so if the data persists I can't confirm or deny that the elt script is actually working so that every single time we kill this Docker container I want the data to vanish so that's what we're doing here for the next service and the last one we're going to do for this section is going to be the elt script now why do we need to create one here well we need to create one in order for Docker to see that we're actually utilizing a script here to send data from these source and destination databases so we actually have to use Docker for this and use it for the runtime to actually run the script itself instead of us having to manually do it so what we're going to do here is utilize the manual docker files at the same time as using Docker composed there's different reasons for for doing this I think it's just easier to run certain commands utilizing it this way and we'll actually see what what this is doing so it's going to look a little different so we're going to say build under the build property we're going to add some context uh and the context is essentially what file we're we're building off of and so we're going to say ltor script inside of the root folder so it's going to be here and it's going to be root in reference to where the docker file is and we're just going to say Docker file but I'm going to capitalize the D here um cuz that's what I'm going to do when we create it now the command we're going to run I think that's changed it the command we're going to run we're going to put this in early brackets we're going to say python since this is a python file and then eltor script. same thing with the networks so we're going to say networks and then elore Network there now this is going to be one thing that we learned in Docker so the depends on property here is that this container depends on other containers and so if it has any dependencies this container won't initialize until the other containers it depends on are finished building and so if I say this depends on both the source as well as the destination services this script will not run until these two are finished initializing in which it does take time which would make sense right because we need to initialize the destination database because there is no way for us to move data if there is no data in here yet and this isn't even ready we don't want the script to run or else we're going to run into a lot of errors here so this is exactly why I wanted to use a Docker container for the elt script so that way we can control the way that it functions so now that we have this going on we're going to create a new file here we're going to say Docker file just like we said and now within the docker file this is where we're going to specify a couple things so obviously the structure is going to look a little different here but from is going to be the actual image we want to use so we're going to say 3.8 H slim is going to be the first one that we're going to use here now what we want to do is we want to install postest with some command line tools so we're going to say run um and apt get since it's running off of Linux apt get update and and appt get install y postgress UL so that's going to install the postgress client there we're also going to copy so we're going to copy the contents of the actual elt script so that the docker container has access to it and so we're going to say elt or script.py this so make sure you have that period at the end and then the command we're going to run is again python eltor so that is going to be the structure here now the next thing we're going to do is actually create our source destination and initialize it with some data so let's go ahead and do that so if you go over to the repo of the custom elt project you're going to see this folder called Source DB underscore and knit now what this is primarily holding is a SQL file that contains all fake data so we're going to create a couple tables and then insert values into them so that's exactly what we're going to go do so let me go back and switch I'm going to create a new folder and we're going to name it Source dbor init as we just saw inside of there I'm going to do init SQL and if you remember it's exactly what we did here so we're pointing that exact file over into this directory of the container so that we have access to it we're creating a mapping of this over into the docker container so that way we can actually run the commands to initialize the source database from there now inside of this what we're going to do is just go in here and copy I believe I can just copy everything here and paste it so that's exactly what you're going to do just make sure you have access to this copy everything that you see inside the folder so we can save some time going to create a couple tables and insert some fake data for us like I mentioned we're going to save that and now we can actually go up about running this and and creating everything so the way to do that is going to be inside of our elt script so let's fill out elt script now all right so we are in our elt script here and there's a couple things that we're going to import so we're going to import the subprocess first so this is how we're allowing to control inputs and outputs and things like that we're also going to import time um this is going to be make sense a little bit later but the first thing we want to do is obviously double check and I know that the docker composed file checks to make sure that the elt script will not run unless the source and the destination database are working but we can also run a fallback just in case so we're going to create this function here called weight 4ore postgress and uh some of the variables we need are going to be the host we're going to need the maxor retries the default value here is going to be five delay in seconds we're going to leave at five so after that what goes inside this function uh we're going to say retries is zero while retri is less than Max retries we are going to say try results equals subprocess run and then in here we're going to say PG uncore is ready this say hyphen H going to post here check equals true now this is all going to be correlated to postest directly capture output true and then the text is true a couple checks here right uh I think I forgot something here move this over and then we're also going to put an if condition here if accepting connections so obviously we're pinging the actual database itself to make sure that it's ready to go or if it is accepting connections we're going to say print successfully uh connected to postgress and then we're also going to return true here and accept subprocess do called process error as e we're going to say print and F error connecting to post stress colon here I'm going to use that F here or sorry e for the error oh this to be inside retries if it errors we're going to add one to the retry plus equals 1 also going and add another print statement here and we're going to say f retrying in play seconds attempt fiz this is all just for log in here and then Mac tries here see where this isue here outside of the print we're going to say time do sleep delay seconds and then outside of the accept print Max R tries reached exiting in case all else fails and then we're going to return false Okay cool so that is going to be our check function here so I know that was a lot but you tries spelled that wrong both times all good okay that should fix it cool now we don't have any weird Squigly now we're going to get into actually building out the initialization of the source database and then moving on to the destination well we can also add this so we're going to say if not wait uh underscore 4ore postgress post equals sourcecore postgress then we're going to say exit with the eror with code one print um starting LT script well now we can actually do this so we're going to say sourcecore config open this object here and so the first thing we need to specify is going to be the DB name and we're going to be using dump files here if you're familiar with postgress we can use dump files to initialize the actual database but also use those and dump files into the destination so that's exactly how we're going to be using this batch script here for our elt and so we're going to say uh Source DB is going to be the name of our DB name it's actually not a uh sign it's going to be that and then then let me add our comma here same thing here so we're going to say user is going to be what we specified so these are all information that we had in the docker composed file this is going to be the password um and then the password was secret now for this next one we're going to say host now the host should be the name of the service uh that we put inside of the docker compost file so we're going to say sourcecore post that is going to be the source config now we're going to do the exact same thing for the destination unor config DB name is destination DB username is if I can type properly post address password is going to secret as well and then the host is going to destination this destination yep no typos there perfect now we need to create the dump command for initializing The Source data so underneath here we're going to say dump underscore command we're going to say PG dump which is the command here that we're going to use um the H flag hyph H we'll use the sourcecore config and we're going to point to the host since that's our host um let me make sure that this has a comma for the user so we're going to say hyphen U we're going to say Source config and we're going to point it to user so you can see kind of what we're mapping each individual flag so the host user password all that stuff too so we're going to also do uh hyphen d for the database name and we're going to point it to sourcecore config D name hyphen f is going to be the file that we're going to use and where we're pointing that is going to be dataor dump needs to be single quotes dataor dump. SE if we add this flag hyphen W it will not prompt us for the password which we don't have to do every single time which is nice so the other thing too is we're going to set the environment variable to avoid the password dump so we're going to say subprocessor EnV equals di uh PG password equals Source config password so now we we've created a password RPG password environment variable that will not ask us for the password every single time now we're going to execute this dump command so we're going to say subprocess do run dump command which is the one we just created point it to the environment variable that we just created so subprocessor EnV and then check equals true so that's going to dump everything into the source datab this now we actually need to get everything from the source database over to the destination so we're going to create a load command and it's going to look pretty much identical to this so what I'm going to do is copy and paste everything here but we're going to make some slight modifications so instead of PG dump we're going to say psql so we're actually going to get into this SQL command line tool we're going to point it to the host but instead of source it needs to be destination the user is also the same so we need to point it to the destination database name is going to be the [Music] same now we need to or well we don't need this since we're not pointing it to a file well we do what we're actually going to do is concatenate this so we're going to put two commands here we're going to do hyphen a comma F we're going to delete the W uh and then we're still pointing it to the data. dump. SQL file now now we're going to do the same environment variable with subprocessor EnV equals uh we're actually copy and paste this over again cuz I mean it's literally the exact same thing except we're going to be using the destination here destination config now we need to execute the load command um so what we're going to do is we're going to say subprocess do run load command EnV is going to equal sub processore EnV and then we're going to do the check equals true again and we're going to print just for our own purposes ending elt script cool so now this is going to be our full on elt script now how do we run this well what we're going to do is use Docker to completely run everything here so let's go ahead and take a look at what that looks like because we've made everything and hooked everything up with do ER compose it's going to be as simple as this all we have to do inside of our terminal obviously make sure that we're in the directory that we created I'm going to run Docker compose up uh I forgot to do this so I forgot to change the host to hostress here um let me make sure that's that okay so I had a slight error but all we have to do is we're going to do Docker compose up what say Docker compose services. Source oh does it not need this uh I might have screwed up here again which is fine we all make mistakes so it doesn't actually need this section here uh let's see if that will fix it we're going to leave this in here all right we're running into a couple things here so _ script so let's do let's just debug this live cuz why not/ scripts so this should just be elt I'm pretty sure there you go okay so just a couple errors here but hey nothing that's going to shake us up but so now that we run Docker compose up we're actually going to go through and build out the images so we're pulling down from the docker post or from Docker Hub to uh build out these images so now what we can see is we might be running into something here but as you can see in the logs and let me pull this over so you can actually see it so through the logs what we've done is we have we've initialized the destination database and the source database which is good and then we've also created the database we've inserted a couple tables and then the values for them but it looks like we ran into an issue with the elt scripts uh starting elt script so it looks like that was good trace back on line 54 [Music] uh okay so let's look at our PG script so you can kind of see like all the little nuances here with creating a bat script and like this is on purpose uh I I wanted to make sure that we ran into some issues here I mean obviously I would like to not run into issues but comes with coding comes with building out these things manually and we'll see that it gets a little bit easier to deal with later down the line um so on line 54 is when it said that we had a bug here so subprocess run dump command I think this should be a hph U or capital u here go ahead and try this one more time so what we're going to do is Docker compos down type oh my God okay so the source is ready to connect still running into an issue here um on the subprocess still on line 54 so we've started the elt script return nonzero exit status let's try without the check here no that's not it DB name sourcecore DB post you got secret Source postgress destination or postgress okay so I've deleted the image on Docker let's try running this one more time let's build the image again and see if it fixes it e error oh I see the issue okay so I screwed that up I think that should be good now so let's Docker compos down again Docker compose up just a casual typo error here I believe it's still not going to work here cuz I think it's still readed the old one so what I'm going to do is kill this Docker compose up or sorry Docker compose down rid of these now let's go into Docker and you can see kind of like what the troubleshooting so we built this image and I don't think it's changed it so I'm going to delete that um we're going to go back into the terminal here say dock compos up so we're going to rebuild the image and it should there we go okay so now you can see after some troubleshooting and then some small typos here and there uh we have successfully gotten our logs back and then at the end we had our ending elt script so the way to check our work now is if we open a new terminal let me make this a little bit bigger go back into our other directory so CDT I'm going to say Docker exec and this is from what we learned in the other tutorial on Docker so Docker exact and we're going to use the interactive flag here uh we're going to grab the name of our _ script 1 so I believe the full name of it should be this so we're going to grab I'm sorry we're not grabbing this we want to grab the destination name so we're going to grab _ postgress one so that's going to be the name then we're going to say psql we want to run the command line the postgress command line hyphen U with the user user postgress now if I hit enter we're going to be in here the one thing we do need to do is we're going to say/ C for connect to the destination DB so now we can see that we are in the destination database now if I do slash this is backs slash by the way or forward forward slash I think uh is/ DT you can see that all of the tables are now inside of the destination database now if I do select all from actors I need to add the semicolon you can see that all the data that we inserted into the source database is now into the destination database this is now fully working it is not automated at all but you can see that our bat script is now successfully working just by simply taking the dump commands that are built into postgress and then moving those in to the destination database now this is postrest to postrest so this is very very simple but you can see that even this is a very basic script but there's a lot of moving Parts when it comes to building out a script manually and so when things change or if you're trying to connect postgress over into let's say another database type where you potentially have to match schema and things like that in this case we didn't have to worry about that because the same database same protocol and everything but if we're working with something else it can get very very tricky so that's why we're going to be implementing a lot more open source tooling but for this next part we're actually going to add a Cron job to this to automate the script a little bit because as of right now we are simply just manually triggering this so if we needed to do this every day every 3 days we could actually add that in here so let's go ahead and do that now okay so actually I lied the next evolution is not going to be the KRON job next it's actually going to be DBT DBT is going to be an open source tool that we can utilize here in order to write custom Transformations or custom models on top of the data that we've just sent into the destination database so you can take for example that we had a use case where this destination database is primarily going to be used by data analysts or some sort of data scientist and they need their data formatted or structured in a certain way where we're combining certain tables together they can't write SQL or something like that and they actually just need it a certain way or something along those lines we can use DBT to have those custom models so that by the time the data is inside the destination we can run DBT on top of that data and then write custom models here creating custom tables and things like that and it allows us to have certain capabilities it also allows us to even write python to run SQL scripts and things like that so let's go ahead and see what that looks like and implement it now there are a couple prerequisites you need in order to run DBT successfully on your machine or even using Docker so let's go ahead and take a look at what that looks like and actually get installed all right so I'm here on the DPT docs which we will also list down in the description just so you can get to it pretty easily now there is some things you need to do locally uh not everything you can do here Docker so the first thing you need to do is obviously I'm on a Mac so most of the instructions here are going to be for Mac but you do need to have this installed locally and the way to do that is if you probably have Pip if you have python installed and so you're going to follow all these instructions uh I think you could pretty much skip these ones right here uh you're going to want to follow this one for sure I think actually you can probably skip that one too the main one here is installing the actual adapter you're not just installing the adapter but you're also installing DB T core which is the opsource version of DBT now I've already done this but if you run DBT install DBT hyphen postgress we're installing the adapter of postgress that utilizes DBT as well as DBT core you're going to go ahead and run that locally I've already done that uh as you could see here so if I say uh DBT hyen hyen version you can see that it's going to list pretty much I have 1.62 installed I'm not going to update for the sake of this tutorial but there's 1.66 and I have the post cres version installed as well so make sure you go through this installation really quickly and that way we can actually go through and create our models so make sure you do that that is just the only prerequisite for this section okay so now that we have DBT installed let me go ahead and show you how we can initialize a project here so it's as simple as DBT innit we're going to run DBT innit inside of the root directory excuse me uh now it's going to ask us for the name of our project we're going to say uh custom I don't know custom _ postgress so it's going to say custom _ postgress and you can see it's creating these folders here for us now we have the database we would like to use it's asking us right here for postgress and obviously it's the only adapter that we installed so we're going to hit number one and therefore it's created us our it's created two folders actually if we take a look on the right left hand side it's created logs for us so we now have logs for anything regarding DBT as well as our project for DBT so it has a couple folders here analys macros models seats snapshots and tests primarily what we're going to be focusing on is the models and macros for now and then the the rest of it is is kind of like you know you can learn as you go but for the sake of this tutorial we're going to focusing on these folders here so there are other couple things that we need to do in order to set up the project itself before we actually write our models the one thing we actually have to do is going to be this so I've just copi and pasted this one command but you're going to do Nano and inside of the complete route of your local machine when you install all DBT using pip it actually creates this DBT directory and inside that folder is a profiles. yl file so what we're going to do is go in there and you can see that I've already set this up but inside of the dev profile it does say post Crestor Transformations I'm going to go ahead and change this because this is now different so what I've done now is I'm going to do change the name of this because it's no longer post Transformations that's what I normally did uh this is going to be custom postgress you can see it's done that down here it it is actually done that down here actually you know what I'm probably just going to do that instead I'm going to post address so now that we're inside of this folder you can see that I've already done this prior but if we were to go down here into the custom post crust profile which is created for me because of this new project you can see here there there's two different outputs there's prod and Dev we're going to be primarily focusing on dev and we do need to fill out the actual information here so inside of host I'm actually going to do host. doer. internal which is needed in order for DVT to actually recognize the docker containers that we were trying to access here which is going to be our destination database so make sure that this says host. doer. internal and it'll read the actual Network or IP address of your Docker container for the destination database for the port here this is going to be 5434 I'm pretty sure for the destination for the user we're going to change this over to postest for the password we're going to change this over to secret DB name will be destination _ DB and then I forget what I actually put for the schema here uh schema should be public and then I forgot to fill out the threads here so let's change this to one all right so and the target here is going to be Dev so obviously make sure that says it by default it should just Target the dev profile but now that we have our information filled out here it should have the correct information to connect to our destination database I believe if I do write out maybe not write out but if we say x save modified we're going to hit yes tab to complete and hit enter okay so this should have saved it now the other thing to do here is inside of the DBT pro. file the one thing to note is that obviously the profile we just wrote is the custom _ postgress you don't have to change anything here except for this bottom SE inside of the bottom section is going to be how we're forming our models now inside of the models folder you can see that there's an example folder and then a couple of models here what we're actually going to do is we're going to change this there's a couple options inside the material section but because we're using postest tables we actually want to change this to table if you keep it as view we're actually not going to see any of the tables that we created come through or anything like that or DBT is not going to be able to recognize them so we have to change it to a table now that we've done that we can actually go in and write our models so let's go ahead and do that all right so we are inside of the models folder here and inside of this example folder are some custom examples that they have created for us uh we don't really need to see this because we're going to be writing our own so I'm actually going to go ahead and delete both of these we're going to keep the schema. yaml for now well we're going to reference that later but let's create our models here so you can think of models as just SQL files we're just querying the data but we're morphing the data around to the needs that we need so we just do it once here and then once the destination database has its data everything that we've queried here all the modifications and the new tables that we've created will end up inside of the destination database and you'll see what I mean by all that once we have it all hooked up the first thing we need to do is actually Source the data itself because it can't just simply reference tables inside of a destination database without having the actual reference itself so what we're going to do is we're going to create our first one so we're going to say Filmore actors which is a table that we're actually writing over into the destination database and this is simply just creating a reference for us to create the custom models so this isn't necessarily a custom model here um so we're going to write some some SQL here select all from and then we're going to do double curly brackets here and inside of that we're going to say the source of our data is going to come from destination database and the name it for this that we're going to reference it from is going to be Filmore actors we're going to need to wrap both of these in single quotes very cool now what we've done is we've sourced the destination database and we've taken the film actor's table and now we have a film actors reference let me go ahead and copy and paste this we're going to do that again for the actors table so actors. SQL we're going to paste this in here and we're simply going to change this over to actors and we're going to do the same for films films. SQL so we have our references now we can actually create our custom models but again before we do that we have to have some prerequisite work so for the schemas these are our custom models here our DBT models and we do have to specify what the schema of these are going to be so that if we do need to test anything which you should be testing you can run a model or you can run DBT against all of these SQL files and if it doesn't match what we've specified inside the schema we're going to get thrown back errrors so you should definitely always do this what I'm going to do is I'm going to head over into our file here and I'm going to look into to the models and example folder and we're going to have a couple things so schema. AML is what we want we see that there's a films table here and every bit of schema here for that one the actors film actors film ratings I think I need the film ratings one two as a yeah that's fine so what we're going to actually just do here is copy and paste this whole thing I implore you to do that too we're going to paste that in there and now we have the schema for all the tables that we just added now the other thing too that we need to do is we're going to go ahead and copy over the sources. EML so now we're specifying the raw data from the destination database so I'm going to copy this as well and you can see film actor and film actors literally the exact tables that we just copied so we're going to go in here and create a new file called sources. EML and paste everything that we just saw so now we actually have sources that we can point to regarding these files or tables essentially that we're querying from so you can see if we think about the this from the top down data is theoretically already written inside of the destination database DBT is going to run after we've move data over so the data is already there and from there what we're doing is taking in the data so that way we can modify it and create a custom model with that data so all we're doing here is referencing data that we already have now let's go in and create the custom models to formulate and modify the data we have into a table that we want okay so the first model we're going to create inside of the example folder I'm going to name Filmore ratings. sequel so what we're going to be doing is appending the films as well as the actors together to create one table but it's going to be primarily focused on the film ratings so let's get to it all right so what we're going to do inside of this file is we're actually going to create two different tables and then use a join to create one if that makes sense so the first one we're going to call is with filmscore with ratings as and then this is where we're going to run our squl to create this table so we're going to say select Filmore ID title release underscore date make sure you're adding commas price rating these are all from the table that we already have this is going to be user rating this is where we add something very specific so this is going to be a case so what the case is kind of like you can think of an if statement per se so we're going to say when user rating is greater than or equals to uh let's say 4.5 then we want the rating to say excellent uh we're going to add another one where user rating is going to be greater than or equals to 4.0 so kind of like in the in between there and we're going to say that we want the rating to read good and then the last one we're going to do is going to the user rating is going to be greater than or equ = to 3.0 then we're going to say that this is going to be average else we're going to label it as poor and we're going to end as rating uncore category so that ends this case statement here and this is where the reference comes in so we're going to say from and we're going to open the double curly brackets again and if we say ref parentheses films we're going to be pulling from this here because we labeled it as films so now we can actually reference it here within this little query here one thing I forgot to mention for this specifically is you saw that we say with and as or with films with ratings this is actually what's referred to as a CTE or Comon table expression so you can see this as a wrapper of this query kind of like a subquery per se but it makes it easy to write queries that are maintainable and they're also scoped just to this CTE so it will not like go outside of this so this allows us to write subqueries and then ultimately reference them later inside of this file so it is actually really really nice part of DBT that lets us write CTE uh to kind of formulate these custom models so that's just something I forgot to mention when we were first writing this but remember this is a CTE or Common Table expression all right so now that we have our first CTE so let's write the second one I added a comma after the CTE so that way we can string on another one this one is going to be films with underscore with actors as and then that's going to open our CTE here our second one so we're going to say select and we're going to do the Alias here um f. Filmore ID f. tile and then we're going to do a string aggregator or string aggregation with a. actor undor name we're going to pend it with a comma there and we're going to do as actors so we're aggregating actor names for each film here and then we're also going to hit the reference so from double curly brackets reference films and then obviously we're going to enable the Alias here with f my auto completes that but there we go so then we're also going to left join Filmore actors table and we're going to give that fa for film actors on f. Filmore ID if it equals the film actors. film ID uh we're going to do another left join here referencing the this left join we needed this so let's go ahead and do that so forgot the curly brackets or I'm sorry the uh parenthesis and single quotes there um let's add the actors for this reference obviously this one is going to be a on fa. actor _ ID if it equals the a. actor ID and then we're going to group by the f. Filmore ID and the f. tile okay so that's going to be the second CT so we have two now technically two queries films with ratings and the films with actors so we've taken the film ID and the title and we've joined the two tables that we had here we had the film actors and the actors table so these are obviously you know we worked with these before but the film actors and the actors is the bridge to the film's table so now we can actually join them to create one so now we're going to have the main query here we're going to select fwf do asterisk here and we're going to say fwa film with actors. actors from the filmscore with now you can see uncore ratings you can see how we're referencing the actual CTE now and we're going to label this as fwf for the Alias then we're going to say left join films with actors which we're going to label fwa on fwf do Filmore ID if it equals the film with actors do Filmore now we're creating essentially what could be two tables two tables using the references we've created and now we're actually joining them together using this custom model now if we were to want to run this we now need to set up the dock container for DBT so let's go ahead and do that and then run this custom model just to see what it would look like okay so we are back inside of our Docker compost file let's go ahead and add the DBT service so underneath the elt script I'm going to say DBT which is going to be the name of our service the image I want to use is going to be GHC r.i / DBT Labs DBT postgress and then the version is 1.47 uh this as of right now at the time of this recording is most likely out of date but it's just what I use so I'm going to go ahead and run with it the command I'm going to run here uh I'm actually going to run a couple now I'm going to open this square brackets and the first one I'm going to do is run so this is technically DB DT run DBT run is the command you would use in order to run your models on top of the destination so I'm going to string a couple here so we're going to say run and then we're actually going to do we're going to point to the profiles directory so if I do double hyphen profiles hypher uh not period I need a comma we're going to say that is going to be in the uh root here and then if we're going to point to the project directory then we're going 2.2 SL DBT which is the name essentially going to be the name of our project directory here the one thing I also want to do uh well we can leave this out for now actually um so after that we're going to do networks and this will be altore Network so we can keep it on the same network here and now we're going to point some volumes here so at the volumes we have the actual project itself so you notice I said / DBT so here we're actually going to create another mapping to point Point 2/ DBT because if we remember correctly on Docker containers every directory here is pretty much going to be pointing to the directory inside of the actual Docker container and not your local machine so you have to map the docker container to look at where the actual project is going to be on your local machine so we're going to create that mapping and this is going to be/ custom postgress and then uh from there we're going to do the colon to create the mapping to The Container path and then the container path is simply just going to be/ DBT which we've pointed right here now we need to create another volume here uh let me add a space here because that requires a space uh we're going to point to the um the DBT profile that we did so we're going to need the squiggly Tilda we need squiggly slbt and we're going to point this over into the root which is where we've pointed if I could spell normally uh this is where the profiles directory was which is going to be in/ root so now we're mapping our local TBT folder with the profiles. emo file to the /root of the docker container hopefully that makes a bunch of sense cuz in my mind it kind of does so we're also going to say depends on the elore scripts we want DBT only to run because remember when this container starts we're running this command and if we run DBT run on top of the destination database without it having any data this will not work because has nothing to reference and it has nothing to write data on so we need this to make sure that it depends on the elt script and the elt script depends on the two databases so this is working down the line we've essentially created a uh work line here and then for the environments we need two here so we need DBT profile this one is going to be it's going to be default and then we have another one this one doesn't need a hyphen actually cuz we made that mistake last time and then we need DBT uncore Target and this one is going to be Dev so this is going to be everything we need for the DBT container now all we need to do go ahead and clear this I don't think compos down Docker compose down everything uh I think the volumes are still up but that's totally fine so now we're going to do Docker compose up did I type that correctly H C.I DBT lab DBT post press seven oh uh 1. 7 okay that's another one another typo there we go so we're going to go ahead and let that run here so elt script exited Code Zero so that's a good sign now the DBT one should pop up invalid not a project missing DBT project. file okay so invalid project directory project directory here custom post press maybe we do not need this let's go ahead and clear this let's do Docker down SLB and then so maybe custom oh I'm dumb uh it should have been an underscore you can see all the typos and everything really does make a difference here so let's run that again and that should map to the correct project now hopefully there we go so it's ran successfully I think unable to do partial parsing encounted an error error sources. runtime error sources. list of actors film actors R while parsing a block mapping uni code okay so there was an error in our sources file here somewhere that might have been it um 13 and 14 film under score actors okay let's do this one more time I think my cat may have done something with that so you can see there a lot of back and forth here scripts ran successfully okay it's taking a little bit longer which is totally fine here model fims was which was not found there's a lot of typas going on film ratings so if we go into Filmore rating reference films fims okay here's the typo let's talker compose down again so also uh if you're curious if I'm Docker composing down hyphen V I'm actually killing the volumes every single time just to make sure that everything is working even if we're starting straight from the beginning without any data persisting so I'm that's the only reason why I'm doing that um so now we Docker compos up again this should work unless I've made even more typos okay so we're doing that again there we go okay so now you can see that we've successfully run our models so we've created a couple models here the first one was the actors the first one was the film actors the F and the second or the third one was the films so these are all our references so we've successfully taken that in but now we've actually done the fourth one which is the model for the Filmore ratings so you can see that that's actually worked now now if we were to go and Docker exact hyphen it and then I think it was elt destination or postgress hyph 1 and then we do psql hyphen U postgress so then we go here we do c destination DB we connect there and then SLT so now you can see Filmore ratings which is the new table that we've written in DBT is actually written now if we select all from Filmore ratings and then with the semicolon you can see that if I move this over it's a a little big but you can see that our case has gone in here uh it's kind of cut off just because of the way that my terminal looks you can see everything is in there so we have our actors we have our film ID we have the release date the price but the most important part is the rating that we've created here so the case statement worked successfully so anything above a 4.5 was excellent anything below that between 4.0 and 4.4 was good and you had no average one so you can see that we've actually created this custom model successfully uh along with the actor and the actor name so you had Inception with Leonardo DiCaprio it just looks a little weird because of the way that um if I were to do this here so let's do the same one so now you can see it it it looks a well it still looks a little weird but anyways you can see that it works so that is creating a simple model in DBT now let's look at some other Advanced options you can do inside of DBT to make your queries a little bit more Dynamic is the best way to put it okay so now that we've actually created the models I told you that we're going to get into some advanced stuff with DBT and that's why I mentioned the macros folder so you can think of macros as sort of like a reusable components of DBT where you may have some sort of queries that you want to reuse and is universal through pretty much all your models so what if you could just create a macro where you didn't have to rewrite the same thing every single time well you can absolutely do that with macros let's take the Filmore ratings file for example let's say we wanted to generate this every single time uh now that's probably not going to happen but let's say we wanted to do it and so let's just go ahead and actually take the whole file and cut it now inside of the macros folder what we're going to create is we're going to say Filmore ratings macro do SQL file everything's going to be a SQL file here now what we're going to do is we're going to start the file with curly bracket both sides are going to have a parenthesis so double parenthesis and inside that we're going to say macro generate uncore Filmore ratings uh and then we're going to parenthesis after now the end of the macro is also going to have curly brackets is also going to have double parentheses but inside the parentheses I'm sorry inside of the uh percent signs you're going to say n macro in one word now in between that is where your macro is going to go so I'm going to go ahead and paste what we just took from film ratings into the macro now Filmore ratings has nothing but if we were to go inside of film ratings now all we have to say is two early brackets and we're going to call the generate Filmore ratings macro so you can see that this is the model that we're actually writing and the macros never actually gets written but if I were to go back into our terminal here and let's close this and let's say Docker compos down I and V so let's kill the volumes here we're going to say Docker compose up and it's going to run through everything all over again so let's go ahead and let that go through really quickly DBT is going to start running and then there you can see we still have four models being written if we go back now over to the destination database uh SLC destination _ DB and then /dt the Filmore ratings table is still getting written and if I do select all from Filmore ratings you can see that we're still writing exactly what we wrote in Filmore ratings but we've only been using the macro so you can see that literally anything inside of the macro file in between this section here and here is going to be written exactly as shown so if I were to say let's I don't know let's create a a macro for this case which probably is more plausible than anything so if I wanted to say generate ratings macro SQL and we we wanted to do the same thing where we go uh let's just percent sign this generates ratings and then we're going to end macro right and then we paste this inside all we have to do now is then just say generate ratings call that and then we should be able to do the same thing uh if we do this Docker compose down B compose up let that run again yes it takes a little bit of time to run run it every single time but you know this is just what it comes with it be much harder is undefined under undefined generate underscore ratings it's not generate underscore ratings uh it's generate uncore ratings wondering if you can't call CT here so now you can see kind of how macros could work is it it allows you to have reusable queries throughout your whole DBT project essentially to just plug and play in and not have to rewrite the same thing every single time so if you have reusable queries that you can use go ahead and use macros because it's very very efficient in terms of not having to you know it it implores the dry method of software engineering where you know don't repeat yourself you don't have to repeat yourself every single time if you have the same queries over and over and over again so utilize macros when you can now let's move over into gingas which is another feature of DBT okay so this next section that I just mentioned with gingas gingas are essentially what allow you to create conditionals or add control structures to your SQL queries so for example that we're going to use is let's say we wanted to select a specific title but using a Ginga so we didn't have to actually put it within our actual SQL query that's what it allows us to do it could allow you to select you know do we want to use the prod Target instead of the dev Target inside of our DBT project then we could do that as well and we can change the way that queries are done again adding a control structure so let's go ahead and inside of our example folder we're going to create a file called specific mov SQL and inside of that is where we're going to start with our Ginger here so again starts with a curly brackets and then two percent signs and inside of that we're going to say sets so when you see set this is essentially what is starting our control structure here and we're going to set Filmore title to equal for this example we're going to say Dunkirk cuz I know that's going to be a movie inside of the demo or the um test database here so Dunkirk is going to be a movie that we're going to look for right now inside of that we're going to say select so this is the actual query itself select all from and we're going to reference the uh films table here where let's practice here where title equals and then we can say in the single quotes Filmore title so that is obviously the variable that we've set up here right that's going to be our Ginger here we're setting a control structure where we only want to select the title where dynamically the value we've set here is Dunkirk and obviously this can be anything we want it to be but so let's go ahead and save that and then run our Docker compose up here and let everything go when DBT runs we should have a fifth model that runs which is going to be our specific movie and so let's go ahead and let that run here there's going to be the fifth it's all pass very good let's run our destination database and SLC into it so destination db/dt and you can see specific movie is now a table inside of this database and so if I say select all from specific movie you can see Dunkirk as well as all of its information including the film ID the release date price rating and user rating is going to be in there because we've used the Ginga in there now it allows you to to pretty much do anything you want you know if we wanted to go into the macro here and create conditionals where we have an excellent condition a good condition or an average condition and we set those to be this number we can absolutely do that and allows us to do that outside of the query so that way we can know that up here we can set it to a different movie you know if we wanted to do Inception we we're only changing it Inception pretty sure that's how you spell it if it's not that's going to be very awkward but so let's do this Docker compose down V Docker compose up let's run this again and so when we run this we should see that the specific movie has then changed over to Inception instead of dunker there therefore showing that the ginger works and it it allows for a little bit more readability right because you can see that our film title we've set it to a variable we're not you know manually searching for a a movie here and then we're going to do select all from a specific movie and now you can see Inception is now there so again a very simple example of what a ginger is and kind of how it works but go ahead and play with it I implore you to kind of you know inter integrate gingas with macros to really really you know use the full power of macros and gingas together and I think it could really 10x your SQL queries and your custom models from within DBT and really just Excel that process of the whole data engineering SL data analytics process alog together okay so as I've mentioned before we're going to be adding a Cron job to this not ideal I guess you could say in the sense of that like it's not necessarily the best implementation I guess you could say but there are a lot of other people out there that use Crown jobs I'm going to show you the way you can Implement a crown job because again you don't want to manually be triggering these data syncs or this pipeline to be running on your own you kind of want some automation here so the first thing we're going to do is add the Cron job I'll show you how you can do that in this type of implementation obviously there's different ways to implement it but this will eventually evolve into US adding a tool called airf flow airflow is going to be an orchestration tool that easily allow us to kind of have more control than a Cron job over each individual task inside of our project to allow us to schedule it properly I think the best thing that you can have is more control outside of just scheduling when it runs and so let me go ahead and show you how the Cron job works first and then we'll move into airflow and other orchestration tools like that okay so for the Cron job the main file we got to go into is going to be the docker file now because we're primarily M working on the El script and we want to schedule that this is where we're going to run everything so just going to be some slight modifications here so the one thing we're going to do is we're going to add KRON to the end of the list of installs that we're going to do so we're going to run for postgress client as well as KRON underneath here now what we're going to do is we're going to create a new file called start.sh now this is going to be a bash script that we run so this sits in the root directory and we'll modify this later but above this copy script or above this copy command that we have we're going to add another one and this one is going to be start.sh uh and we're going to copy this over to the app directory of Docker so that Docker now has the start.sh file within its own container here the working directory is going to be the app so we're going to say work is going to be slapp here in between work directory and command this is where we're going to add the crown job now for this example we're going to run the script let's say every day at 3:00 a.m. so we're going to do a run command here and then we're going to Echo and then in quotes we're going to say zero space three and then three asterisks and I'm sure there's like you know sources online that can show you kind of like how to properly write a crown job for a specific time this is just the one that I have now we're going to say python SLA slore script pi and then outside of that we're going to do the divider here and then we're going to say cron tab now we're going to save this and then if we go back into start.sh this is where we're going to write the uh to how to actually start the Kon Damon so at the very top we're going to say cron and and that pretty much is what's going to run the KRON and have it run in the background and then python slapp _ script.py what this is going to do is we have the bash script and it's in our working directory we're going to run this command here when the docker container starts and now this bash script is going to run the pr in in the background and then run this every single time it's 3:00 a.m. and so we'll run the bat scripts at 3:00 a.m. every single time due to this Crown so again a very simple implementation of how the KRON is supposed to work obviously it's hard to demonstrate this because it's 6:50 p.m. right now so I can't really just wait here until 3: a.m. but now at 3:00 a.m. every single time this will run so yeah that's essentially how this Cron job is going to work now let's move into how airflow is going to play in the mix and how we can now orc rate this using a real open source tool that's out there in the wild all right so in this section we're going to be adding airf flow to our project and as I mentioned before airflow is an orchestration tool open source that allows you to essentially have a top- down view of every single task and say Hey I want my elt script to run first and then once this finishes and is successful and that's exactly what we're going to do here we're going to orchestrate all of our tools and all of the working pieces that we have right now to make sure that they work sequentially here and you know there's a lot of different use cases with airflow or other open source tools like Daxter or prefect but we're simply going to be using airflow as our orchestration tool to make sure that our pipeline is working the way we want it to and has some form of automation here all right so first thing we're going to do is going to create a new folder here we're going to name this airflow and inside of airflow we're going to have another folder called dags and a dag is simply a directed ayylic graph I think is how you pronounce the a part but so the dag is what we write in order to orchestrate all of our tasks together and also put in some of our Logic for each task so that's what we're going to be writing here in DBT we were writing uh models custom models in airflow we write dags or in any orchestration tool we're going to be writing a dag so that is another term for you to put into your brain as knowledge to know as a data engineer the other thing we're also going to create here is a file not inside of the dags folder so this is going to go into the root folder of airflow but this is going to be airflow. CFG I think we just need this for airflow specifically we don't need to put anything in it but if you needed to do anything with your configuration of airf flow this is going to be the file that you're going to put it in now that we have this structure put together let's go ahead and go into our Docker compos file and then add the airflow service this is going to be a little extensive because it is a big service so we're going to take our time with this one and really go through it all right so uh underneath the D B section we're going to go here and add a couple spaces because we're actually going to be adding quite a bit of new Services here that have to work with airf flow it would be a lot but I've narrowed it down to be very minimal here so we don't have to worry about that too much but the first thing we're actually going to add is going to be another postgress service and why are we adding another postgress service here well it's because airflow needs a database to store its metadata and its logs so this is the database that's going to go into it's going to be the same look here where we're going to say postgress and the latest image here right um networks we're going to tie into the elcore network and then for the environment let's do this so environment we're going to do postgress user is going to equal airflow here we have postgress password which is going to equal also airflow and then for the postgress DB this is going to be airflow so very simple there now fix this now that's the metadata service so you can name that whatever you want but I'm for Simplicity sake I'm going to keep it as postgress the next thing we need to do is going to be an init airflow service now for the init airflow service what this is primarily doing here is initializing the user and the database that we just created all together for airflow so we had we're going to add two more services after this but this is a very important service for us because without this we're not going to have the admin user and be able to create new connections and things like that so this is very important here now the image that we're going to add is going to be Apache so the Apache project open source projects airflow and then latest this is going to be the same for pretty much every single uh I think it's going to be the same for every single one but I could be wrong U anyway so image this is going to depend on postgress cuz we need that database to be initialize essentially before we initialize airflow because without it we're going to have some errors here networks again we're going to say elore Network for the environments we're going to say this is going to be a long one so airflow underscore uncore so twocor database _ SQL underscore alemy _ con for connection is going to equal and this is going to be pointing to our postgress database here that we just created so postgress plusy cop G2 colon SL SL airflow for username and password Here at postgress SL airflow okay so that's that's a long one but that's going to be the environment there now last one we need is a command the command we're going to say uh less than or greater than sign and in we're going to say bash iph C in quotes airflow EB and nit to initialize the database for airflow and and string on another one here push it over here and say airflow users create so we're this is where we create the admin account username airflow password is going to be password let's just say first first name is going to be John last name is going to be do we want the rooll here to be admin and then the email we need to set let's say admin at ample cool all right so that's a pretty long one there already let's go down here into the next service that we need to create so that's our initialization of air flow now we're going to get into the actual like UI of airflow which is going to be the web server service let me add some spacing here the image we're actually going to create a Docker file for this so we need to do that first now in the root of the directory here we're going to say Docker file and then in here we're going to say from Apache SL airflow latest so the same thing that we did uh with the initialization of airf flow now we're going to say run tip install Apache air flow providers Docker now this is going to come in handy later on but we do need this in order to run this command here airflow has a bunch of different providers for you to access different Integrations Docker being one of them and there's going to be another couple ones that we're going to use later down the line but Docker is the one we need so now that we have this we're going to go back into our Docker compose file and then for the build section we're going to say build context just add the dot here the docker file we're going to point to Docker file uh just capitalize this part the US user here is going to be root because we want the root here and then it depends on the postgress service so we do not want airall to initialize until the metadata store essentially is is initialized networks we're going to point over to the elt network and this is going to be a special one here so extra hosts extra hosts is going to be an interesting one because we are running everything on Docker we actually need this for special reasons that'll come up later but so in double quotes we're going to say host. doer. internal and then we're actually going to point that to host Gateway okay so that points to inside Docker the host Gateway so that's going to come in handy later so now that we have that we're going to point to some volumes here now the volumes are going to be very important so we have our dags that we need to find So within that we're going to say/ airflow SL dags so now we have our dags folder now we need to map it so we're going to map it to the slop slair flow/ dags so now we have have the same directory of dags inside the opt folder and then it'll go to airflow dags so now we have that inside of our Docker container next one next one is going to be the elt script that we have because we need access to that so/ is the folder that we have and then we're going to say we're map that over again to SL opt SL airflow SL elt so now we have our own elt folder with the script and all of its contents which is just the bash script there now we need the Transformations because we're also going to be orchestrating not only the script but we're also going to be orchestrating DBT and we need access to those files so we're going to say SL custom hostress that is going to map over to SL opt SL DBT now we have the DBT folder in there we're going to need the profiles for DBT so we're going to say Tilda sl. DBT and then we're going to point that over to SL root sl. okay I hopefully this is all making sense here this one is going to be opening up essentially giving giv airf flow access to the essentially the docker Network so we this is going to be very important here so we're going to say /var run/ doer. sock we're giving access to the socket there SL we MPP get to SL SL run/ Docker okay cool now for the environment variables and this is going to get really long here so we're going to get environment variables here the first one is going to be load underscore x = n these are all just going to be needed for airflow exeggutor equals local and we're we're going to say airflow and we're mapping this back to the database here so underscore uncore database Alchemy con so I've uh just copy and paste this over actually cool so I've got that now one thing airflow does need is it's going to need a furnet key underscore fernet uncore key so that's only one underscore there and then we're going to say equals it essentially encrypts the username and password it's for security reason we're just going to do that because you know for the sake of this video so I've actually if we go to orchest air flow I have this command which I will also leave for you in the description but this is essentially going to generate our fernet key for us and what we're going to do is copy and paste that command into our terminal and it should spit out to us a key which we will then use and paste here so now that is our fit key that we're going to just leave in and will be used for encryption sake right and for security there's that uh now we're going to need airflow underscore uncore web server uncore uncore default user underscore username and that's just going to be airflow I'm actually going to copy this and then just change the last bit here to say oops say password and this was going to be password cool next environment variable here we're going to go airflow uncore 1core www uncore user underscore username equals airflow same thing copy and paste this password the last thing we need is going to be airflow web server underscore secretor key that's just going to be secret cool all right last bits here we're going to say ports so we need to expose the ports here uh this is going to sit on P 8080 so we're going to say 8080 here so that's how we're going to access the web server and the command we're going to run this web cool now the one thing that's going to save us some time here is we're going to copy and duplicate the web server portion because it's going to be the exact same but now instead of web server this is going to be the scheduler so it literally is going to be the exact same so I don't think we need to change anything no so the scheduler is a needed worker for airflow and so we absolutely need both of these one the web server to give us a UI to access and then we're also going to need the scheduler to schedule all of our jobs here and they're going to point to the same exact thing okay so that is going to be the setup for airflow now let's go into writing our first dag in order to actually orchestrate all of our tasks here okay so we're going to go into our dags folder here and we're going to create a new folder and in that folder is going to be eltor dag dopy so we're going to be writing some python now the first things we're going to import are going to be from date time import date time and time Delta so we're going to be using some time values here and the next one is going to be obviously air flow so airflow uh from air flow we're going to import dag this is going to be a default one that we're going to need because uh obviously we need to type our dag from Docker do types we're going to import Mount this is going to come in handy for us later because we're orchestrating Docker services or Docker containers you need some specific Docker types and then you we also going to need a specific Docker operator that comes from airflow in order to actually orchestrate them uh the next thing we're going to need is going to be uh from airflow. operators. python uncore operator we're going to import the python operator so this is going to be very very crucial because our elt script is a python script we need the python operator in order to actually add that in there okay so we're actually going to copy and paste this down here as well because in the same operator package we're going to need the bash operator we're going to change this to bash operator so that way we can actually interact with the bash here and and do a couple commands there last one we're going to do it's going to be from airflow actually should be copy copying this again so from there we're going to say Docker underscore or I think it's just docker here yeah just Docker and then the docker operator cool so these are all the packages oh actually we're going to need one more sub process silly me okay so those are going to be all the packages that we need here now let's go into some default arguments here so we're going to see default uncore args um this is just couple things so the owner of this whole project here is going to be airflow the next one is going to be depends on past so do we want this depends underscore onore past we're going to say false on this so it doesn't email on failure these are just some things that are built into airflow and we are setting some default values here default arguments we're going to say false and then email onore retry I think these should be curly brackets not square brackets I screwed up on that one cool all right now we're going to write a function here Define runor _ scripts so this is going to be a function we use in order to actually run the script now we're going to need the script path obviously script scriptor path we're going to set this variable here and this is going to be pointed to where we've set it inside of the docker container so if you remember correctly this path right here so we have opt airf flow/ elt so we need to point to that and not where it's pointed to locally for us so we're going to say/ opt slair flow sl/ _ script. Pi so now we're pointed directly to the path of the elt script from within the docker container so what we can do now is we're going to say result equals subprocess do run and this is going to be you know we we've had this command inside of actual batch script itself so we're going to say python we're going to run Python and then we're going to use this point it to the script path now after this we're going to add a comma and we're going to say capture output equals true so we want to capture the output there and the text equals true okay now after that we're going to add a control structure here so if results. return code does not equal doesn't equal zero that means we failed something so we need to raise an exception and in that we're going to say script failed with error and then we're going okay and then else we print the result output cool now now let's actually put this into a dag so we're going to say dag equals and then point it to the dag object that we just imported add a parenthesis and this is where we add in all the information so the first one is going to be the name so we're going to name it elore andore DBT so that's going to be the name of our dag now uh the default arguments this is where it comes into play here so we're going to say default arguments equals the default arguments that we've specified the description of this dag is going to be an LT workflow with DBT rhyming accidentally the start date start underscore date is going to be date time so that's where we're going to use date time here and we're going to say 2023 uh it the time of this recording is currently October 28th so that's going to be our start time there catch up is going to be another one catch up equals pulse we're we're not going to catch up all right oh I didn't close it I'm done so now we're going to go into actually writing out the tasks so you can name name a task whatever you want the first thing I'm going to do or the easiest way to do it for me is going to say T1 for task one that's just the way I like to do it and we're going to use the python operator because the first task we want is the elt script and because it is a python script we need the python operator now let's do the same thing where we initiated the dag here where we're going to add parentheses the task ID is going to be task ID uh we're going to tag this over and you know label this as runor elt script so that we can identify this properly we're going to say python cable equals the Run elt script function that we just created and then the dag is going to equal dag okay so very very simple we've we've given it an identifier we've added the python function here in order for us to actually run the elt script and then we've attached it to the dag that we've created all right the next one is going to be task two which is going to be DBT and obviously for DBT we've added the docker operator so let's go ahead and add all the necessary information for the docker operator the same thing is going to be here so task uncore ID right uh we're going to name this the same one is going to be DBT run very simple there after task ID we're going to need to attach an image to it and so um I we do have a DBT container but we're actually going to remove it and replace it with this so we're going to do the same information that we had in the the docker or the DBT container so we're going to say GHC r.i DBT lab DBT hen postgress for the postgress adapter and then point it directly to 1.4.7 all right and the command we're going to run is going to be the exact same so you can see kind of like the the similarities here uh so we're going to say that we're going to point it to the profiles directory and the profiles directory is going to be SL root the project directory is going to be/ DT and then that should be it the next one for this one is auto remove so Auto remove is do we want to automatically remove the container once it's finished and we're going to say true because we don't want it open all the time so Docker URL is going to be the next one and this we're going to be pointing directly to the the socket that we open so Unix sl/ bar/ run/ Doer sock right it can identify where the docker container is now the network mode we're going to label as bridge and if you remember the docker or the network driver for the docker Network that we had was a bridge so that is going to be our mode here and then mounts so mounts is going to be where we imported the type of mounts this is where we're actually doing the same thing that we see here in the DBT service where we've mounted these two volumes here and we want to mount them because we're actually doing exactly what we're doing here but instead we're doing it inside of airflow we're creating this Docker container from scratch using the same information but that way airflow has sight of okay well this is exactly what I'm orchestrating let me go ahead and create this and then run it so the first thing we're going to do is say Mount and then inside of that is going to be Source the source of this is going to be the local path of my actual custom postgress so I'm going to say users SL my name is Jus Chow slev development and this is going to be separate and a very different uh depending on what you're doing here so the inside of development is just the elt folder and then we're going to go into custom press is where my source is now after this we're going to say Target equals slash in single quotes SL DBT well that's essentially mapping the source to the/ DVT folder and we're going to say type equals bind okay now what we're going to do is we're going to copy this and we're going to add a comma here now this is going to be shorter here so I can get rid of pretty much everything here and then we're going to say DBT because this is our profile and then instead of Target being SL DBT we want this to be root and the type is going to remain bind the last thing we need after the mount is going to tag it to the dag so dag dag and there we go that's pretty much writing both of our tasks the last thing we need to do here is create the order in which we want it to run and so at the very bottom here we're going to say T1 2 greater than signs and then T2 so this means that T1 takes priority over T2 and that's the order that we want it to run in okay so let's go ahead and make sure to save this it looks like we have that's fine we don't have to worry about this so that is pretty much writing our dag here let's go over it one more time we've created some default arguments here that airflow needs inside of the dag we've created our python function in order to correctly identify where the script is and then we're going to go ahead and run it with some error checks here we've created the actual dag itself and named it with the default arguments description and then the start time we've created the first task here using the python operator giving it an identifier pointed the function we created to it so that's what we know to run and then we've attached it to the dag we created and then this is the second operator or the second task here using the docker operator and essentially creating DBT in a new container and then running and mounting everything we need to correctly have you know the DBT project run on top of the destination once the first task has run and then we've created the order of operations here which we wanted to run so hopefully this all makes sense and this next one we're going to go ahead and run it and show you what the web server looks like and then make sure that the dag actually works so let's get into that okay so we have our dag now the one thing to note is that we are actually running this script here and then creating a new Docker container here for TBT so we can actually go back into our Docker compos file and we're going to comment out the DBT as well as the elt script containers because we do not need to create them anymore because we have airf flow doing it we don't need to have them anymore so let's go ahead and comment those out and what we're actually going to do is we're going to do Docker compose up and run all these containers so I forgot to do one thing so remove the port section and then the command should be scheduler not web server that's my bad so let's do Docker compose up again ports must exist there we go okay just some weird issue there so this is airflow initializing everything everything seems to be going into plan here and so now we can do exit with Code Zero we should see the scheduler and the web server pop up in just a second here oh you know what okay so now that we have everything going let's go ahead and create all of our Docker compos files so but the first thing we have to do is actually run the init airflow container first because it's very important so we're going to say Docker compose up init hyphen airflow and we're going to actually say iph d in order to do that first so what that's going to do is going to start the postgress and it's going to start the ini initialize airflow containers first and if we go to Docker desktop here you can see that they they're already running we actually need those to be running first before we enable anything else because without those the web server and the scheduler will not work so now what we can do is we're going to say Docker compose up and that's going to create the rest of the docker containers that we have and so let's go ahead and wait for that now we can see that the web server is now open so what I'm going to do here is going to do open a new window and then we're going to go to Local Host and remember we said 80 80 so if we go to Local Host 8080 we are now in the airflow UI so airflow password was the username and password that we enabled here we're going to say no here no but you can see we have an issue opt airflow dags _ D.P a broken dag so that's fine I think what we probably forgot to do was install the python operators so let's go ahead and debug this really quickly a nothing like a little bit of problem right so let's just make sure op airflow do oper python oper oh so this should be an operator not operators now let's go ahead and go back and kill all of these again so Docker compose down B we're going to say the same thing so we're going to say Docker Air B let that run for a second and then kner goes up okay let's go back to our web server here should load up in just a second go airflow password in I may have tried a little too soon there we go uh we're still running into an error here docker's the operator I think I probably did this let's look at the docker file here tip install Apache High airflow High dock okay so I found the issue it was from the actual package here so I wasn't doing it properly on the docker file here Apache airflow providers Docker so this should actually be airflow. providers. doer uh and then operators. doer so that should be the import name uh that's why we were getting the issue here uh so we're going to say Docker composed down B do the same thing here end then airflow wait a couple seconds then docker compos up in a second now hopefully this should fix the issue because I had another just like simple typo so if we go back now crash airflow password still the same issue oh I had it twice it's another typo okay so this shouldn't be operators this should be providers A okay compose down I B doer initialize it again doer compose up again just some simple simple typos that are screwing it all up there we go now now that I haven't typoed anything you can see that we have the elt and DBT dag here we have the owner being airflow we have all the runs um so you have failed running success cued you have a schedule your last run next run which we pointed out would be today but we're going to go ahead and manually run this so what we're going to do is actually going to trigger the dag and what we're going to see here is we're going to see it run so we see one is one is uh still cued so we have one successful and I believe we should be waiting for one more to go believe we're still waiting for one more to go yeah okay so we see one has failed now let's go ahead and check why um so the dbt1 failed DBT was not running at the state so we can check the logs here now remember that we created that DBT or sorry that postgress service for the airflow and so that is because of this so we needed the actual postgress database for this reason so we can actually get some logs here if you didn't have it it wouldn't we wouldn't get any logs here so it's actually erroring out because custom post. myc DBT model depends on a node name my first DBT model which was not found which is interesting because this is not what we had in mind here very interesting so it looks like the script worked but the DBT section did not so we're going to have to take a look at why and it has to be something in here right so the DVT run profiles directory profiles directory is going be /root project directory is going to be/ DVD if we check the docker desktop we go into web server go into files here um and then I believe so root should contain our DBT folder with the profiles. yl folder and we can actually see that custom poost press has exactly the information we're looking for so that's roots that is fine the other one that we should have airflow yeah EGS no air where did we put the actual put in/ DBT so let's go back this should be in a DBT but it is not SL op /d and I believe our models should be in here okay so I think we're pointing it to the wrong area here so SL opt SL DBT let's go ahead and see if that changed everything cuz we didn't create the Ste section there so let's go ahead and restart this so we're going to say airflow wait a second okay let's go back to the UI password okay this is still going let's go ahead and Trigger it okay I think we're still waiting on one one's running right now which is a good thing let's hope it doesn't fail fails now let's go ahead and Che the logs here here invalid project directory not a project project AML okay so that's a better sign which means that we just need to change this now SL op SL now I believe that when we run this it should work let's go ahead and kill this same thing close up should be good reset go Ahad and run it so looks like the elt1 ran successfully and then we're just going to wait for the second one to pass hopefully it does and it's still failing why is it still pointing to the other models I don't know how this happened but inside of the example folder there was a my second DBT model sequel file that was throwing the error so I just deleted it so now let's try and see if I don't know how that got in there I I don't even remember if I deleted it or not but obviously while we were testing it manually it wasn't throwing an error but now airflow decides to throw the air let's go ahead and see if that's going to fix the issue I'm pretty sure it will very just like didn't even think about it kind of thing but again like I'm leaving all these mistakes in there not because it's like obviously doesn't look that great but I do believe that these errors are just a part of Engineering in general and so I think it's good to put some light on it and just you know not every course or not every lesson is going to be fully fledged I think it's better to just leave that stuff in there because you know we all go through mistakes stuff like this happens where it just you know there was one file that was throwing everything off or something like that now let's go ahead and see if I can actually get in here so I can refresh and I'm in so let me go ahead and hit start on the dag and let's see what we get here so it looks like our batch script worked now we're just waiting for airflow let's hope for a number two on the green I think we're still running cool that was it that was literally it so now you can see we have two on the success part so we have our DBT and our elt script that ran successfully so what we can do now is we can go into our destination post Crest and do the destination DB and we're going to say /dt and you can see all the models and everything that we ran are now inside of the destination and to prove that I can say select all from uh let's say specific movie again and you can see that the Inception selection is still there I can say select all from users and everything should be in there so that is successfully adding airflow with some minor hiccups but obviously you can see here that our python operator and our Docker operator are in play here and working it does take a little bit of work to make sure that everything is moving because they yes there is a lot of moving Parts here but in this next section with the elt script we're making it a lot easier because we're removing the need for the elt script we're actually adding in another open source tool that I happen to work for called air bites to be able to add more Integrations and more data sources more destinations with ease so let's look at how that would work with orchestration and everything like that all put together now all right so for this next and final section like I mentioned we're going to be implementing an open-source data integration tool called airite which I just so happen to work for airite we are pretty much a data Movement platform like I just said so you have a bunch of sources like we have our source database and we have our destination database both in postgress now imagine you have a couple other sources that aren't just postgress but you also have places like Salesforce or LinkedIn Facebook marketing places like that where you need data to come out of and then move them into you know your destination postest data base or you may have some cloud data warehouses things like that airite is the platform to easily sync all of them together without having to manage any of these manual scripts like the one we just made use a dump file to go from postgress to postgress uh it would be much more complicated if we were doing that from you know a different source to a different destination and so I'm going to show you how easy it is to get air bite into your project and it's open source like I said and utilizes Docker so it's pretty easy to hook up to the rest of our application and then get airf flow hooked up as well so let's go ahead and do that so I am on the air bite repo right now which you can just go on GitHub and all we're going to do pretty simply is just copy this uh what we're going to do is we're going to go back into our terminal here we're going to stop all of the docker containers and then what we're going to do is get clone and then we're going to clone air bite into our repo here so that's going to take a little bit the big open source package and then if you want it to read the read me on how to get started and things like that all of it is down here but you can see that literally like you can use this project for free and that's the beauty of Open Source right so now if we hit LS you can see we've created an airite directory here and if we open our project here on vs code you can see that airite is now successfully inside of the file directory along with everything else so now let's hook up everything together to make sure that it works in harmony so before we run anything with air bite we actually got to set up a couple things so we do not need to touch this Docker compos file anymore that is completely all set and ready to go if we go into this Docker file you can see that obviously we've installed the docker provider we're actually going to install the air bite provider as well as a couple other things uh yes like I mentioned that there are a couple other Integrations with airf flow and air bite is one of them with open source it's a little different but we do use the HTTP provider as well as the air bite provider so we're going to change this command up a little bit everything's going to stay the same but when we run pip install the provider's docker we're going to add a little uh Slash here we're going to say and and tip install Apache airflow providers and then we're going to say HTTP we're going add another backs slash here and we're going to add another and and and then this is going to be pip install Apache airflow providers air bite so now we're installing three different packages that we need uh I mean yes we do we do need the docker one still but so yes we we're going to be using airflow for Docker HTTP and air bite works anonymously together HTTP provider is a dependency for the air bite provider so now we're going to save that and I'm actually going to change up this start.sh file so it's going to make it a little bit easier to run the whole project Al together we don't have to keep waiting for the init air flow to start and then we can kind of structure the way that all these Zer containers are running with this bash script so let me go ahead and do that right now okay so we're going to actually delete all this stuff that's in here right now the first thing that we want to do is we want to do Docker compose up and then we want to do init airflow now that's going to obviously do the initialization of air flow after that we're going to wait 5 seconds so we're going to sleep the command for 5 seconds and then next we're going to say Docker compose up I and D to run it in the background because if we don't run in the background we can't really um continue to see air bite come through so after that we're going to sleep for another 5 seconds after that we're going to change directories into the air bite directory because if we see it here you can well you don't you won't see it now but there is a Docker compos file that needs to be downloaded here and that's exactly what we're going to do so we're going to CD into air bite and we're going to do a little check here because the way that we run it is there's a run platform uh shell script that we need to run with air bite and that is the initialization where it downloads all the correct files that we need to in order to spin up all the docker containers so we're going to say if and we're going to look for the file here so we're going to say uh hyphen f and then we're looking for docker compose yl that's the file we're looking for if that exists then we're going to uh run Docker compos up hyphen D if it doesn't exist we're going to run slash run hyphen ab platform.sh and then we're going to finish okay so very simple script here uh we're also going to create another one that's going to say stop. sh so this one is going to say Docker compose down I hyphen V and then we're going to uh I think we're going to sleep for 5 seconds we're going to CD into air bite and and then do Docker compose down so that is going to be the start and stop for this to make this easier to run every single container so now what we're going to do we're going to go into our dag and then change up everything because we no longer need this elt script obviously going to still need the docker operator but the python operator we don't need it anymore since we're going to be using air bite to do it so let's go ahead and hook that up right now all right so like I said we we actually didn't end up using the bash operator and we don't need the python operator anymore so we can remove this um let's go ahead and remove the time Delta as well because we do not need that right now so what we're going to do is also remove this function here and then this is what we're going to start doing our actually I'm going to leave this for now so right here above the default ARS we're going to create a Conor ID a variable and this is going to be a string so this will make sense so a connection ID is what this stands for within airite we have a connection ID and a workspace ID and those are just identifiers for your specific instance of airite let's say but more specifically if we're looking at the source and the destination that we currently have the both of the postgress databases connecting those together obviously create a connection and for airite when we connect the to we have a connector and that connector has that connection ID so that's exactly what we're going to use variable for is to grab that and then input it here into the dag to make sure that we're orchestrating the correct connector right now and so here what we're going to do is we're actually going to go back up here we're going to say from airflow uh the first thing we're actually going to do is from utils and dates so we're going to use the dates util from airflow we're going to import days uncore ago under that we're going to say from airflow uh. providers do air. operators. air we're going to import the airite trigger sync operator so this is going to be the module that we're using to trigger syncs on the connection that we create with from within airbag now we're going to replace that replace T1 with this operator here let's go ahead and replace it and now the first thing we're going to do is change the task ID so this is uh oh whoops I replaced the wrong one uh this is the python operator we're trying to replace so the first thing we're going to replace is the task ID so this one is going to be the Airy postgress _ postgress task ID and uh we no longer have a python callable here but the first thing we're going to do is aircore Conor ID which is an argument that we need which obviously we've provided which is going to be airite so this will make more sense once we're inside of webflow but we need to make a connection inside airf flow and we're going to name that air bite so we're going to add that the connection ID is what we provided so this will point to the con ID variable that we had right make sure that my com here underneath that is asynchronous so if we wanted to asynchronously do jobs asynchronous I think is how I spelled it wait did I spell that right asynchronous false so this is not going to be asynchronous we would have to add another operator but we're not doing any jobs asynchronously we're only triggering one job so we're going to leave it at that we're going to leave the timeout to 3600 we're going to say waitor seconds uh we're going to wait 3 seconds and then the dag obviously we're we're putting back into dck so you can see pretty simple in terms of what it's accept here the main thing here to note is going to be these two properties right here so the airb con ID which we're going to find within airflow and then the connection ID is going to be what's important to us here so we're actually going to save that and nothing else changes so we've we've successfully put an airbyte within the dag um the one thing we need to do here is we do need to add our connection ID which we have not gotten yet there is a little bit of a trick here that we're going to have to do um it's kind of inefficient but for the sake of this project we're going to have to do it so let me show you what that is we're going to have to make another bash script potentially but it should all be okay this is what happens when you kind of have to work with multiple different Docker containers since air bite is working in in its own so let's go ahead and do that right now I'm actually going to create another bash script here um this one's going to be el. sh and this one is going to be exactly what we did before where it's just Docker compos up and knit hyphen airflow and then we're going to sleep for five and then we're going to Docker compose up Hy so the reason for this is that airite has its own set of Docker containers that are running to make sure that everything is working and then we have our own El containers that we've already been working with and we can close those separately now in order for this to work we need to grab the connection ID that we make from within air bite and in order to create the connection we need to obviously hook up the source and the destination together which we're going to do next but then when we have the connection ID we can't necessarily save the connection ID in into the dag and then you know load it up we're going to have to kill the container so that the new connector ID propagates but airite will still be active so we're just going to go ahead and kind of go through that process but then at the end we'll see that airflow is able to read everything that we've done and then we can orchestrate both airbit that's replacing our elt script and then using DBT to write on top of the destination databases that we have currently so let's go ahead and run air bite right now and then we'll get going into uh hook up both of the source and destinations to create our first connector all right so now we're in the UI and then now all we got to do is put in our information here you can choose this if you want but I'm going to say get started and then now we're in the airb UI and let me walk you through really quickly it's it's very simple so we have connections so if you have all your sources and destinations and they're connected together this would be your connections if we go to sources you can see we have a bunch of different sources both certified as well as community and the same thing on the destination side so the one thing we're going to do is we're actually going to go for postgress here you can see it's pretty simple and it's just simply logging in to your instances here or our databases so I'm actually going to name this source postgress and then for the host here we're going to say host. doer. internal and for the port I believe Port 5433 is is the name or is the port number and then for the database name is going to be Source DB now for the username was just postgress and for the password was just secret okay now no SSL modes no SSH tunnels needed for this sake I'm going to be using detect changes with the X-Men system column and then you know if you have any jdbc URL parameters here you can absolutely add that let me go ahead and hit setup Source here and then it should just essentially make a call to the postest database there through the API and We're Off to the Races it should just connect properly if everything is correct and if I look on Docker we should have the containers open so that is correct correct cool all the test connections have passed so then it spits us over to create a connection and we're actually going to add a destination now which is going to be also postgress now it's going to be the same thing we're going to name this destination postgress uh host. doer. internal the port here is going to be 5434 DB name is going to be destination DB default schema uh we're going to name the user postgress and then it's kind of hidden right now but in the optional Fields we're going to say secret as the password because we do need that and then we're going to go ahead and hit setup destination and cool all test connections have passed and what you can see up here uh is that we have you know the connection name we have source postgress to destination postgress replication frequency there is an orchestrator from within airite but we want to switch this over to manual because we are orchestrating this with airf flow if we were wanted to just have this work on its own we can absolutely do that if we wanted to add a Cron job you can absolutely do that as well from within here but for our sake we're going to use manual there's a name space there's the stream prefix to identify streams a little easier and here is all the tables coming in from our source database now what I'm going to do here is I'm going to check all of these cuz I want all of these streams or all of these tables to come with and now from here we can choose a sync mode so do we want to fully refresh every single time do we want to incremental which we recommend uh only the changes because it does support CDC do we only want to sync the changes from the database over to the destination depending on the transactions uh in our case we're going to do full refresh overwrite for all of them uh so we can do some uh testing here but uh yes incremental sync is probably the way to go here we do offer normalized tabular data in this so we do some normalization on top of the data and that is totally fine so I'm going to go ahead and set up this connection here and there there we go uh once this is all set up we can see here at the top we have a couple IDs that are very important to us so the first one is going to be the workspaces ID uh the next one is going to be the connections ID we want this connections ID so what we're going to do is copy this head over into our dag and paste this into the connection I now we're going to save this but remember what I said when we uh you know when we launch airflow and and all of these containers right here they're in separate instances right and so the change I just made has not propagated and so we're not going to be able to stop or to start the dag right now so what we're going to do is we're actually going to kill the elt section first and then leave airite open restart this we should be able to then start the sink using airflow so what I'm going to do now is Docker compose up or uh we have that e T script so what I'm going to do is I have to add uh chod plus x/t Dosh and now I can say elt Dosh and that should run the init first and then that should run the rest of the docker containers we need on the elt side um so the init is done and we can see it down here and then it should follow up with all of that so there we go everything should be initialized now what we have to do is obviously everybody is still running now let's go in and do Local Host port 8080 so that's going to let us sign into our instance here uh we're having an issue with the dag uh I think it's the provider here yep okay so I had this issue with this before airflow. providers. a. operators. airb um let me double check to make sure that is the correct installation of it so where we're probably going to need to do is for some reason this doesn't fully work work here user airflow yeah because we let me double check this um oh I forgot to do that okay so the one thing we do actually need to do is we need to remove this user tag here because we want the user to switch every single time so uh we it's a weird thing where you have to be the airflow user before to install the actual correct operator and then switch back to the route in order to run the dag so that's what we're going to do we're going to do Docker compose down hyphen B um and then run our El script again after we remove that so let's go ahead and run it should be good so then it will use it'll go down and read the user airfow first run this command to install all the correct operators and then we'll have back to the user route once they're installed that way we can actually run the dag okay so that should be good let me go ahead and wait for it to run here there we go uh still having an error here all right so I got it working and this is just a note uh there is some weird issues kind of with Docker but so if you you ever run into issues like that you want to go into the images section so I had to look for the elt web server and scheduler because we built those images already and it doesn't propagate with these new commands here and so when I reran or sorry when I deleted the images and reran it it actually went through and installed the correct operator so now if we go back to the web UI you can see that the dag is now in there so we did everything correctly it was just a matter of deleting the images on Docker first and so now what we're going to do is before we even run it we need to go up into connections uh and then hit this plus sign right here and now what we can see is we have a connection type of air bite but that's not exactly what we're going to be using so if you remember this connection ID if we go back to the code of the dag you can see we had a connection ID of airb so what we're going to do is we're going to say air bites we're going to go down into HTTP and then for the host is going to be do uh host. doer. internal we don't need to provide login password but for the UI we're actually going to point it to 801 because that's going to be where um we can access the actual UI and everything so we're going to hit save and so now we have this air bite section if we go back to dags let's go ahead and Trigger this and so we triggered it if we go back to the airb connections we should be able to refresh and hopefully see that a stream is going to start here eventually okay so it failed instantly okay so it looks like it errored because let's actually go back to the uh the connections here we actually may need to do a login password so this is host. doer. internal 801 let's add airite and password because this is a default um thing that's needed or login that's needed for us on our side uh for security reasons and obviously that can change let me go ahead and rerun this again and that should give us permissions to run the sync there we go so now you can see that we have a sync running so we've able to trigger the actual sync Now using airflow and that should just work so if we go back to statuses here we should be able to watch it go and then once that's loaded into the destination then the DBT container will spin up and then we will have all of our models written onto the destination side but data was moved from airbit so let's go ahead and wait for that to load here but if we go back to the dags we can see that it's all working now we were able to successfully connect to airbit and it's not going to succeed until this is completely done there we go so sink has succeeded you can see that we have 113 records extracted if we go back to airflow we should be able to see that it will succeed and then it's going to run DBT uh there we go so there's the first success and then now we're waiting for DBT to run and then once DBT has run we're going to go ahead and check so DBT just successfully ran go back into our terminal and then go back into our destination database so see destination DB let's go ahead and do a/ DT and then you can see so now we have some extra stuff so you can see airb actually wrote some raw data into the database but it has come with all of the other tables here notice how specific movie and film ratings isn't here so you can see that DBT actually ran we only have the tables coming from the source database and now if we do select all from specific movie that's going to show the actual Inception you know all that stuff along with some airite metadata so you could see that has actually successfully worked we have air B being orchestrated by airf flow and DBT now being orchestrated by airflow all within this simple project and that right there is the full project completed hopefully that give you a pretty high level understanding and example of how you could successfully create a data pipeline from scratch but then ultimately lead that into a fully-fledged open-source data pipeline where you're sending data from a postgress database using a tool like airite to connect it to a destination datab base where you need the data to go to and then from there you're orchestrating it with airflow writing custom models on top of it with DBT to then use for other sort of data analysis or things like that ultimately the destination could be anything where it ends up in a CRM or some sort of tool like metabase or looker for other open- source options there but essentially this is kind of like my version of a storytelling where you can see how a data pipeline can evolve how complicated it can be at the very beginning but then how simple it can be at the very end when you start using some of the tool in the modern data stack where you know you have tools like air bite and airflow and many other tools where you can bring into the stack and just make it a lot easier as a data engineer now the opportunities here are endless and this is can hopefully open some doors to you to test some things and also play around with other things to become a data engineer eventually but please let us know how we did please let me know if you have any questions or if you learned anything new here please let me know down in the comments or anything like that but I hope you all enjoyed this course and thank you so much for watching
Info
Channel: freeCodeCamp.org
Views: 235,043
Rating: undefined out of 5
Keywords:
Id: PHsC_t0j1dU
Channel Id: undefined
Length: 183min 43sec (11023 seconds)
Published: Tue Jan 16 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.