Code along - build an ELT Pipeline in 1 Hour (dbt, Snowflake, Airflow)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's the difference between ETL and elt they both have extract transformed load but they're range differently in the past when ETL was created cloud storage was very expensive in order to minimize cost businesses would transform the data first and load them in a data warehouse fast forward to today with the Advent of all these new exciting Technologies like snowflake storage is much cheaper which is why elt was cre it makes more sense to dump your data in a data warehouse and slice and dice them later on there are so many tools in the market for creating elt pipelines for example we have DBT snowflake prefect Daxter spark how do I pick the right one this is going to be a live coding tutorial where I'll walk you through how to build an elt pipeline from scratch I'll show you every step of the way and we'll talk about the thought process we'll cover basic data modeling techniques such as how to build fact tables data Marts we'll look into Snowflake rbac and how to deploy our model on airflow for all tools we will be using DBT for transformation snowflake for data warehousing and airflow for orchestration for orchestration feel free to use something else you're more comfortable with I hope you enjoyed this tutorial we'll use this snowflake tpch data set which is a free data set provided by snowflake make sure to have your snowflake personal account ready and let's Dive Right In for this Hands-On tutorial we're going to be developing locally using DBT core so the first thing to do is to go to DBT core which is this website over here and make sure you install DPT core using pip so you can run pip install DBT core on your terminal you can create a virtual environment if you want to there are several other ways to install DPT core you can install on Homebrew make sure you have your snowflake account set up so I have my account right here let's see then pip install DPT core the first thing we're going to do is to set up environments in Snowflake so we're going to create a warehouse a database and a roll in Snowflake and in that database we're going to create a schema and that's where we're going to write our DBT tables into for the role we're going to make sure this role is assigned to our user in Snowflake and we're also going to Grant access for your warehouse and your databases to that role so let's do that so let's use roll account admin account admin is sort of the super user for snowflake by the way and I'm going to create a warehouse I'm going to call it DBT Warehouse with Warehouse size equals small X small okay and I'm going to create a database let's call it DBT database and create a role let's call it DBT rooll so let's run that so press command enter to run your commands in your snowflake worksheet okay let's check the grants in your DBT Warehouse first before we actually Grant the warehouse usage onto the role so let's show grants on Warehouse BBT warehouse and you can see that there's only one user that has ownership PR privileg which is the account admin so what we're going to do next is to Grant usage on Warehouse DBT to roll DBT roll and if we run this show grants on Warehouse command again we can see that there's another another user that has usage privilege on this warehouse and it's the grante name is DBT roll so that looks good then the second thing we're going to do is to Grant roll we're we're going to Grant this role to our user so I'm using my own user this could be make sure to replace this with your own user so what we're going to do next is to Grant all on database DBT DB to roll DBT roll so so we're going to make sure this Ro has access to this database that we've created so let's run that again okay now that we've created our warehouses databases and roles let's switch into the role we just created so let's use DBT rooll and now we are going to create a schema DBT db. DBT schema like this and if we look here in our databases objects we can let's refresh the page you can see DBT schema exists but no objects are found so everything looks good if you have trouble rerunning this again you can also do something like create database if not exist otherwise if you try to create a rle again it's going to say the r already exist so create R if not exist you can do that too if you want to drop your warehouses later just so you don't incur cost you can do something like use roll account admin use account admin again and you can drop your Warehouse if exists DBT Warehouse drop database if exists DBT DB drop roll if exist DBT roll so now we're going to initialize our DBT project and all you have to do is run DBT in it I'm going to call my project dataor pipeline for the database I'm going to set up my profile as snowflake for account make sure you go to Snowflake and then go to your locator type here hover over so now copy that paste it here for user type in the user you created for Snowflake and I'm going to put password for Simplicity so just type in my password for roll we're going to add in the r that we just created so it's DBT uncore roll for warehouse It's DBT uncore Warehouse so Warehouse like this for database it's DBT DB schema is dtore schema for threats I'm going to use 10 threats now CD into Data pipeline all right and let's open V code so let's configure our DBT project yo file and this file basically tells DBT that this is a project folder right this is the main file that's going to reference and it contains a bunch of useful information like where where to find your models where where you putting your test where your seats where your Macros so forth I'm going to create two two tables one called staging and this staging is going to be materialized as a view and I'm going to set the snowflake Warehouse as DBT warehouse and for I'm going to create another table called Mars and this is going to be materialized as a table and I'm also going to use the same snowflake Warehouse which is DBT Warehouse let's delete the example models and create new folders first one let's call it staging and let's call the next one Mars let's install some third party libraries this is going to be useful later when we're creating suret keys for our models so let's create a new folder let's call it packages. yo and in packages. yo I'm going to add DBT Labs DBT utils so this is a really common library in DBT let's look up what versions they have on DBT utils let's use the latest version actually to install your packages hit DBT depths to kind of give you a high level tour of DBT projects so this DBT project yamamo file tells DBT where to look for your models so the models folder is where we're going to write our SQL logic our source data sets are going to live here our staging files it's really good practice to separate your staging files staging files are Ono one with your source files and you should separate them with Ms folder all these models are going to materialize in Snowflake and this bottom piece of code here where we tell DBT hey I want to materialize all my models in this staging folder as views and I want to materialize all my models in this SMS folders as tables the macros folder you can write reusable macros here and I'm going to show you how to do that later for DBT packages by running DBT depths this is where third party libraries are going to live here seeds is for static file so files that the data is not going to change very often maybe you have a data set you have a CSV file that you need to reference and you know it doesn't change for every few months and you can put it in your C folder here snapshots are useful when you are trying to create incremental models and test folder here DBT test types there's two types of tests in DBT and the first one is singular tests and generic tests are parameterized queries and accepts Arguments for example check if this model doesn't have no values or check if this model has values greater than zero so that's kind of a high Lev tour of DBT projects now what we're going to do next is to set up our source and staging tables so go to models go to models and let's create a TP chore sources. yo the name of this file doesn't really matter so I'm going to create sources so the First Source I'm going to get it from the tpch data set I'm going to name it tpch and it's coming from Snowflake sample data and that basically comes from right here snowfake sample data and you have these bunch of schemas here I'm going to reference them so go back to schema tpch sf1 for tables pull in the orders table and this orders table has a bunch of columns but I'm going to write some tests here so there's a order key here and I want to make sure this key is unique and it's not null so like like what I said about generic tests so the second table I want to pull in is line item table and this line item table has a bunch of colums that's order key L order key is a foreign key to this table so let's test this relationship there's a generic test called relationships in two Source oops tpch orders and the field is order key so this test is going to make sure that this value here is actually a foreign key of this table so now that we have our sources table let's create our staging models staging models let's call it staging tpch orders. SQL so let's try to pull in data from our sources first to make sure everything works so the way you pull data from source is using this Source function wrapped around this Ginger curly braces bracket let's let's pick up data from orders let's do DBT run oo it's not working let's see what's the problem here oh okay so the issue is the yamamo files have to be indented correctly so it wasn't indented as a typo should be name not names and there should be a indentation here relations let's try to run that again everything looks good staging folder let's hit DBT run to run all your models okay see it says that it passed and if we go back if we go back here to our worksheet and refresh we can see our first table staging tpch ORD is created right here so that's pretty cool so what we want to do in our staging folder is I want to rename some of these values so let's do o order key as order key and then o cust key as customer key o order status as status code o total price as total price o order date as order date so I'm just going to rename my variables here and do the same same thing for let's create staging tpch line items. SQL so let's do the same thing let's call get our source and then the name of the table line item so this time I'm going to copy paste a bunch of the columns here and I'm going to create a surrogate key using dbts and a surrogate key is it's useful in dimensional modeling when you have a bunch of fact tables and dimensional tables that you want to connect so let's create our serate key here DBT utils surrogate key let's do something like this so I'm going to use l order key I'm going to use both the line number and the order key to create my circuit key let's name this as order order item key think of it as a hash to run this model only press DBT run- select or- s for short staging tpch line items let's go okay says that compilation error warning this function has been replaced by a new name so let's rename that run again okay it works perfect The Next Step we're going to do is we're going to transform our models when you work for a company you have to do some kind of business transformation on these staging tables staging tables are one to one with Source table and we're going to aggregate some data in this this line items table here and create a fact table as a result so what is a fact table so fact table is a dimensional modeling technique that stores results from a business process so data warehouse toolkit so let's look at what a fact table is so a fact table contains numeric measures produced by an operational measurement in the real world so you can think of a green in effect table as a a table that represents a bunch of numeric events and it's connected to other tables that you call dimensional tables so effect table always contains foreign keys for each of its Associated Dimensions so that's why we created our serate key so let's go back here and what I'm going to do is I'm going to reference the files I created just now so let's use the ref function staging tpch orders and let's call it orders oops and I'm going to join this table with staging tpch line items as line item and we're going to join it on this key so orders. order key equals line item line item. order key I'm going to copy a bunch of columns here and we're going to order this by orders. order date let's see if this works TBT run so remove this comma here syntax error orders order key make sure you run your orders table again oops and then make sure you run your in order items table again [Music] okay everything works so what I'm going to do next is to create a macro function and macro functions are a good way to reuse business logic across multiple models so let's create a file called pricing. SQL let's create pricing. SQL I'm going to Google DBT macros go here let's copy this example let's copy put it here I'm going to rename this as discounted amount it's going to have two inputs the first is extended price and discounted price discount percentage so over here is where we typically write our business logic write some business logic like this and let's reuse uses macro in this Ms folder so I'm going to call this function discounted amount on line item extended price and line item discount percentage so I'm trying to get the item discount amount right here let's see if this works DBT run in order items okay it works good by the way I'm going to add a new column here let's add extended price let's create more intermediate files I'm going to do this really quickly in order items summary. SQL just going to do a simple group bu taking the intermediate files we created and just do a group buyer over the extended price and discount amount and finally let's create a fact model so fct orders. SQL and in this fact model here let's select star from ref staging tpch orders as orders and we're going to join it with ref int order items summary as order item summary on let's join it with the order key order by the order date I'm going to take everything from the ORD table so star and I'm going to combine this order item summary it DBT run let's see if this works everything works so we have our fact table let's go back to Snowflake and click refresh if we click on tables we can see fact orders and this fact orders table is connected to our orders key right here so the way it works is this orders item is our dimensional dimensional model this is a huge oversimplification and there's a lot more to it to fact tables and dimensional tables but we're not going to talk about it in this video all right now let's create some test code there two types of test in DBT singular test where you write SQL queries that returns failing rowes and a generic test so let's write some generic tests so the models folder here I'm going to create generic test you can name this file whatever you want I'm going to call it generic test let's say write some generic test so this is what I mean by that there's a bunch of inbuilt tests in built generic tests like unique notnull relationships so let's test the relationship of the foreign keys right here staging tpch orders field and severity let's put warning let's add another test for status code and let's put accepted values so I'm saying that for this column status code I only want to accept p o and F and for this relationships generic test I want to test the foreign keys so make sure to indent these values here correctly accept the values values p o and F so this piece of code here is telling DBT to say check for these values these are the only acceptable values for relationships here it's checking the foreign keys and you're checking order key must be unique and it must not be null hit the BT test nice Works completed successfully let's build some singular tests let's create fact orders discount. SQL now let's write a test to check if the item discount is always greater than zero because you can't have a negative discount if you have a negative discount that means you are paying more money right fact orders let's do where item discount amount is greater than zero so DBT test so what happens if you do less than zero it's going to fail so it fails because your this query is returning a non-null value and you can see here the failure there are one is it 1 million 1.4 million values that didn't meet this test so it should be greater than zero okay now it's passed I'm going to write one more singular test let's call it fact orders date valid. SQL and over here let's do select star from ref fact orders where let's check if the date of the order date I'm going to cast this value as a date is greater than the current date or the date is older than 1990 which is a long time ago so what this test is saying is make sure the values are within an acceptable range so let's do DBT test everything is working again so now that we have our singular test set up we have a bunch of generic tests to summarize everything we've built so far we've created our source tables here we have a bunch of staging tables that reference the source tables for this line item St staging table we generated a ciruit key so it can reference later in our Downstream fact tables we have a bunch of Mars tables here and they do a bunch of Transformations these Transformations use macros that we wrote right here and finally we have an or fact table that references some dimensional models so we've done a lot so far now let's actually deploy this using [Music] airflow so I'm going to use airflow to deploy this DBT dag but you can choose your own orchestration platform of choice whether it's prefect Daxter you can even try to run this on AWS if you prefer to prefect and Daxter are great Alternatives that build upon airflow but let's just stick with airow flow and I'm going to use this Library called astronomer Cosmos and this Library here is going to allow us to run our DBT core projects using airflow Dax and task groups it's really simple to install all I have to do is go to your terminal here and run Brew install Astro I already have it installed so this is probably not going to work for me or it's going to update my Astro okay so we're going to initialize a new project using airflow so let's make directory let's call it DBT do CD into DBT D and run Astro Dev init so it's going to initialize a new astro project let's take a look at the code go to a Docker file and add this to your Docker file at the bottom so this command is going to install DBT snowflake after you add this piece of code to your Docker file okay after add this your darker file we're not done yet we need to add to our requirements file let's add as stomer cosmos and add Apache air flow providers snowflake so make sure to add these two lines for requirements txt file Astro def start okay air flow starting up amazing amazing amazing so let's go to Local Host 8080 username is username is admin password is admin okay so in order to run DBT on airflow copy paste data pipeline DBT folder into Dax folder right here so I've just did it is Dax DBT data Pipeline and here's the piece of code to write your DBT D so I'm basically just importing a bunch of libraries right here for datetime OS using the cosmos Library I am setting my connection to snowf using snowl con for my profile arguments I make sure I set that up to the database we created in Snowflake and the schema we created in Snowflake for your DBT dag right here make sure to point it towards the right directory and let's call it DBT dag ID and I'm going to schedule it for a daily run so everything here looks quite good if I go back to my airflow right here I can go to DBT dag there's one more thing we have to do left so go to admin go to connections and then click add a connection let's call it snowflake con and let's go to find snowflake so put in your snowflake account put M your Warehouse DBT Warehouse whoops DBT Warehouse DBT DB roll is DBT roll and that's it just double checking everything looks good okay everything looks good hit save let's go back to our DXs go to DBT D and let's try to run this so before we run it we can see this is how Cosmos will actually create a graph using graph for us to show us how this fact orders table is constructed we see the two staging models that we created goes to this intermediate table goes to another intermediate table and this table goes to fact orders you can even see your test code right here the trigger D let's see if this works fi let's take a look at why it f go to loog fail to execute okay I know so go back to admin connections I need to type in my snowflake username password so going type my password and user right here I'm going to throw this out and then click save let's try right again okay it looks like it's working you can see the user interface okay perfect it ran successfully if we go to run and go to logs you can see DBT code here actually running you can see your that code XCOM is pretty difficult to use in my opinion in there's a limit to how much data you can pass on XCOM okay awesome if you made it this far to the tutorial I hope you learn a lot we've made a lot of progress in this tutorial let's wrap up thank you for watching I genuinely hope that you got a lot of value out of these Hands-On tutorials and to summarize what we learned today we learned how to set up our snowflake environments with our warehouses our roles our users tables schemas databases we learned learn how to connect DBT core with Snowflake and inside DBT we've built models in our staging folders and our Mars folders for different in order to organize our models better after building the models we learned how to use macros to templae code and reuse business logic we know a difference between generic tests and singular tests using DBT and finally we orchestrated this DBT code inside airflow let me know the comments what kind of videos you want to see next as usual thank you so much for watching and I'll see you next time peace
Info
Channel: jayzern
Views: 2,457
Rating: undefined out of 5
Keywords:
Id: OLXkGB7krGo
Channel Id: undefined
Length: 36min 34sec (2194 seconds)
Published: Mon Feb 26 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.