Orchestrating Data Pipelines With Snowpark dbt Python Models And Airflow

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

foreign [Music] good afternoon everyone um I believe this is the last session for today so hopefully it's the best for the last um so today I'm going to be presenting on orchestrating data pipelines with snowpark DBT python models as well as airflow so as mentioned my name is Adrian I'm actually a student Solutions engineer based out in Singapore very far from here in fact I just got accustomed to the time zone here going back today and I think I need to be accustomed back to Singapore's time zone not so good for me in terms of my sleep cycle but you know it's okay it's Vegas right so as mentioned today we'll be actually looking through a technical architecture we'll be going over through some different components of what I've actually built we'll be then looking in terms of the technical aspects of things do a quick demonstration of stuff and then update later we'll do a quick summary of what we have actually achieved today so for the overview we'll be looking at each and every individual component and I'll be explaining why I chose each particular tool set for a particular reason so let's start off with airflow and for the folks who are not technical in the audience or who are need some familiarization airflow is basically an open source platform for developing scheduling monitoring batch oriented workflows white airflow and why it's been so popular is because well it's a very easy to use orchestration tool for Workforce for analytical purposes it also provides easy to use that for folks who do not know what Dax means it's called direct acrylic graphs for Clairvoyant pictures of dependencies and best of all is written in Python right so you can actually write out your whole deck which I'll be going through my code later a bit to let you guys understand how the whole deck is being written up and uh I think also another thing is about DBT and what is DBT and why DBT um DBT has been gaining really quite a lot of Attraction I think I've been using them for quite a while uh DBT stands for data building tool for those folks who are not familiar in the room uh and what happens is that it performs the transformation uh of the elt process notice that I say actually the elt process here specifically because what happens is that with Device of data Cloud warehouses especially with the likes of snowflake you are able to actually push down all the transformations to snowflake uh to invade and run it there natively um DBT really helps in terms of the building the testing maintaining that data infrastructure and what happens is that it takes code and compiles it into SQL to run it against snowflake um so why has so many data engineers and data analytic Engineers have been using DBT if you attended these morning sessions from a Flipside crypto they also explain why it was so modular to actually build out a DBT pipeline but maybe just to summarize what happens is that with DBT you can use actually purely repeatable code it's purely written in SQL which is up until now which I'll be going through in a little bit and what happens is that their macros code switches and configurations which are actually achievable easy through the ginger templates um just for folks in the room as well basically this is the outline of how DBT runs so essentially you will write a DBT rights model then after that there's um what happens there is that a DVT model is a single SQL file uh and what happens is that you can see that inside that select statement you saw I can actually reference like for example a table name or Source name and this model a SQL will translate to a table inside DBT DBT code is essentially SQL and ginger and ginger as mentioned before is a complaint templating engine in Python the next thing that happens after that is that we call DBT models layer and what happens here is that you can reference different DBT models to each other so for example instead of me writing like a giant giant big SQL statement and then after that later I do like for example subsequel within a sub SQL within a sub query you can see that from here right I can actually create model B dot SQL and incident inside that I can actually reference like for example select star from model a DOT SQL from reference Model A sorry so this makes it very easy for me to actually track the different like Transformations or workflows that I'm writing uh in different SQL files and the best part about DBT in the layers layer is that it creates those natural dependencies uh the model actually executes by its own then you're in in terms of that order and that's really flexible in terms of how you can actually build up flexible data pipelines the next part is that basically how do you then run it right and in DBT what happens is that we have this model command called DBT run and essentially when you run run command essentially will package up all those models all those references that have been built before by yourself and then it will create a deck and execute that um that that whole model right against your data warehouse of course actually for DBT wise that's actually parallel execution that you can use in that on top of it um so maybe just explain a bit about the state of DBT the state of DPT has primary SQL base but I think there has been a quick shift already from the release of Core V 1.1 or 3.3 1.3 they have actually been supporting python modules and what happens is actually in Python models right now um right now it's the latest is I'm not one is 1.5 but for python models it actually naturally supports the native snowpark integration so what happens is that later on we'll be looking at a bit of code for actually the python models and you can see how the snow part code is written there and then we actually when I run the deck right on airflow you can see the push down actually from DBT into snow park using Stone Park to push the snowflake so just to also allow the folks in the room to explain and I kind of like just created this heat map to allow everyone to understand what you can do in terms of data Engineering in the data Cloud for snowflake um and just as an overview right from the left you have actually different data ingestion techniques of course we announced about iceberg different connectors the batch Stone pipe streaming um there's actually the Kafka connector schema detection evolution in terms of the data transformation everyone has been hearing about snowpark which is what you can use for running python in top of snowflake you have different things like streams and tasks but not only that right now you have things called Dynamic tables which was what Steven was going through just now talking about how you can use Dynamic turbos to actually produce actually the CDC capability within Snowflake and then on layer on top of that is actually the different capabilities for observability as well as pipeline experience but of course today what will be actually talking mainly about is about the Snow Park area and if you guys haven't heard it more than enough but just as an overview basically allows you to run actually your python Scala in Java code on the server side instead of actually the client side so that makes it really easy because then you can actually do that natural push down into snowflake for it to run so it really takes off a lot of actually the dependencies on your laptops and stuff or your individual servers you can just offload the computation to snowflake to run so naturally there are some reasons for snow park for python I'm just only going to give three three reasons why firstly is because of the streamlining of architecture now actually also with the integration of DBT pattern modules you can actually just collaborate and run all those python models within Snowflake and DBT you also can build scalable and optimized pipelines using snowpark for Python and of course within snowflake right there's a features such as zero copy clone you're able to create different Dev uat environments it acts as a singularity platform for you to actually support both your pipelines let's say for Port Dev or maybe even uat lastly it actually helps to enforce consistent security grid because it's all within the data cloud and there's actually controls and governors across all your different flows that's running on the platform so um just as again or over architecture overview to access actually snow park right now as mentioned it's GA already for python access you just need to actually um you can actually use notepad either using the vs code access or you can actually use Python worksheets to actually write snowpark some benefits here is that you can see that from a capacity management perspective yes as is offloading to snowflake to run it actually reduces that overhead and allows you to actually collaborate on the same data now let me just go on more to the interesting part here for actually the using DBT snowflake python models and what happens is that it leverages on the native support the snow part for python actually includes capabilities such as the python data frame apis scale and batch udfs stop procedures and a Content integration as well as the udtfs all this at this point of time in terms of actually the support with the python-based model in DBT is averages on top of this for the support and the snow Parks API for the pushed out access so just as a technical architecture before we go and take a look at the code and then I'll do a quick demo in terms of why is what happens is that in terms of this setup it's pretty simple what I've done is I have kind of like dockerized all my environments essentially airflow is running on the container DBT is running also within the container airflow will actually trigger the trigger the DPT container DPT will deploy the runbook load the data into our snowflake tables as a CSV and then after it will perform the analysis on actually these um on the data and DBT will then talk to snowflake via the snow park and perform the push down to snowflake using Snow part in that regard so let's actually hover over to our our code here just to take a quick look about what's happening and essentially what you see here if we take a look here right is that for folks who have been using DBT traditionally this is the DBT pack folder that you create um I think it comes to no exception for folks who have been using DBT this is a normal DBT file that you see here right that we can write like for example config materialize tables and use the source data this is what I was talking about that from DBT is using that snow python models pushed out and you can see that actually like for example one python file were actually reference to actually one table later yeah so for example what happens is that right now the case here is that I'm actually loading some customer data I'm loading some orders and payments data I'm then going to do some analysis right across like for example these different areas and looking at the code here what happens is that in the customer area I'm selecting the data frame and just doing a first name last name for example the orders and payments is also similar to that regard but I have a final table which actually is like a customer data table and essentially is referencing all those DBT files that I created just now I'm performing some good buys or the boys and joins and creating that final data frame so it's very standardized python code that you can see here that is using the snow part python to do it but essentially in the back end as mentioned before DBT is actually going to compile all this and then perform the push down from into stone flake as mentioned again earlier what happens is that each of these files will create actually an individual table inside my schema later so if we go back to here as well so we talk about actually the Stone Park uh sorry the DBT portion about how the snow part for python access right the DPT models actually will work um in terms of actually the airflow component the airflow component essentially is actually run out from this whole Docker compose files that which I which I then used for it but for this actually on top of it it's quite standard in terms of the airflow but the key difference here is that then what happens is that I installed on top of it the DPT snowpark snowflake modules as well as DBT core on top of it so all this actually is being packaged within that whole Docker file and run inside the docker compose environment so right now let's just hover over to our airflow environment so right now I have airflow for folks who do not know this is the airflow environment that you can actually trigger the decks and I'm just going to trigger a deck for it to run so when I actually trigger a deck right what happens is that you can see that in the background DPT is really starting to run it's not in the status of running so let us take a look at to what's happening inside of snowflake environment and I'm just going over to my Kobe history to just take a quick look at it and that's um that's um just running again one second yeah so for example What's Happening Here is now the DPT models are running already it's um yeah still compiling and running yeah a minute okay so it seems now is what what's happened is that actually you can see here right that whatever code that we created just now uh is really creating and running as a stop procedure of particular interest I want to actually just maybe talk about this item so you saw just now that whole giant code that I wrote I mean not giant code but that whole python code that I wrote essentially just now um is actually being stored up as a stop procedure and passing actually whatever code I have written here into a stone python model for it to run so with this actually as a stock procedure right in the background DPT is enacting this to create the stop procedure and then it will add the workflow to actually create the different tables into snowflake um from here right you can see that you can see that actually inside the console itself the tracking of actually the different DBT models are actually happening within the console itself onto Snowflake and let us go over into our demo schema where actually my tables are being materialized so we can see that the data is being materialized and being created into here we can see that actually this is our first DBT model that I created as a transient table but we see here that basically I have my staging customers data my staging orders and staging payments notice that each of these items met to exactly to what I have here in terms of my models file for all actually of these individual files so all of this in the background as mentioned what happens is that it's materializing it as a table getting the data creating the data frame and passing it into a table here yeah so you can see that basically from the customer data as well that here also is actually creating that table and is also performing that same analysis that I was talking about just now to aggregate all that information together and push it within snowflake for it to run so within a few lines of code we saw how we actually can create a very simple deck about how we can actually create this whole pipeline right for it to be able to do Transformations within Snowflake and we can actually easily use airflow to actually orchestrate all this so let's just think about if let's say you have different data sources you can use airflow to extract the data out perform the transformation as well or push the snowflake to run it and run the decks there as well okay so going back very quickly to our airflow yep our airflow is finished running all looks good let me just run it just to check long time it should have been completed okay yep all done already we can see the results actually have just happened it has been created successfully and you can see here's the cool part that basically um snow park in the back end will actually create that whole SQL statement for me uh in the back end right without having me to do anything I just literally need to write as a data frame and join let's say the data and in the background it will actually happen to it will do the joining for me yeah so going back to here I think just to summarize we actually talked about a few parts uh one aspect is that we talked about like for example the whole airflow components how airflow is acting as an orchestrator how airflow is also triggering the DPT containers um and what happens is that when DBT is running how it actually loads the data and actually runs it as a CSV and perform the analytical engineering on top actually uh some part yeah um one more thing is that I think I forgot to show earlier but I would just actually hover down to allow you guys to understand maybe the uh let me just see the airflow there yeah so if we take a look at the airflow deck right this is a very simple airflow deck that I created and what happens here is that it's using that stone part um DBT file that I created just now and it's running the profiles and here I'm actually extracting out let's say my DBT using it and password for it to run this whole deck is what you see here essentially this notepad DBT airflow deck is this code here that is being run here yeah so I forgot to show that just now but yeah I just want to come back to this in case you guys are wondering in the background what happens there so in totality that coming kind of comes to the end of my presentation I think what has happened here is that been seeing a lot of different customers are using actually different pipelines to run their workflows um traditionally even right now a lot of them are still using airflow as well as advising number of using DPT so naturally a lot of them are also thinking about how to combine together with snow park to do all the Transformations on snowflake yeah so this comes to the end of my session I hope you guys enjoyed it and thank you very much [Applause] [Music]

Info

Channel: Snowflake Developers

Views: 5,843

Rating: undefined out of 5

Keywords:

Id: bnRZL64airk

Channel Id: undefined

Length: 17min 12sec (1032 seconds)

Published: Tue Aug 29 2023