What Is DBT and Why Is It So Popular - Intro To Data Infrastructure Part 3

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
data build tool or dbt as it is more commonly known perhaps you've heard of it perhaps you haven't if you've been sleeping under iraq for the last two years but as of recently it ended up raising 222 million dollars in funding to build out a tool that is based on well sql and jinja yes a sql jinja based tool is almost worth at least according to vcs and investors 4.2 billion dollars now we have all learned recently that current funding rounds from vcs is more similar to that of enron in terms of the fact that you know you can sometimes just put valuations based on possible fantasy earnings versus true earnings right now i do think dbt is a slightly different story as it does have a pretty large user base that all enjoy the product but why what is it about dbt that has given it such a strong following in order to understand what dbt is as well as why i believe it's gained a lot of popularity let's take a quick step back and look at least 10 years ago if not longer on how people were building data pipelines in the past now if you're looking at my pass about a decade ago when i first started when i first started with data pipelines i was using ssis and i wouldn't have even called them data pipelines i was just creating automated jobs that's all it was to me no one even used the term data pipeline it was just a way to get data from point a to point b that's all i kind of knew you would often connect to some sort of data source do some sort of transforms inside of ssis i would watch wise owl tutorials all day on how to transform my data use some c-sharp in there occasionally to do more complex um transforms maybe write some sql queries and then inject them in uh to do some sort of model development of like tables of like dimension tables and fact tables and maybe just put that in the sql there once i had developed the jobs in ssas i take sql server agent and basically set the time of when i want this job to run and if you used sap or oracle there were similar tools that they basically offered that essentially did the same thing that would pull data from sources do some sort of transforms push it to your data warehouse and then you could schedule it somehow still other people would then do things like you stored procedures or sql files and and configure it with powershell or bash or something similar to that and set up cron or windows jobs actually i think i might say windows task scheduler the point being that people would have 10 different ways and 30 different tools on how to schedule essentially sql queries even if you worked at facebook what i would often tell people is really what you're doing is writing parameterized pipelines uh in python that would just basically set up configuration in the main point of your pipeline would be sql that would do the transform that would create your table on the other side and that's really all it was so what you'll notice is all of this revolves around sql but often the framework around it was often different and often you have to be a specialist in this tool at every step of the way there was no way to just know sql to run your job you'd have to know sql you have no python you'd have to know something like informatica or ssas or something similar azure data factories another popular one these days so all this increased the technical barrier for people to actually write data pipelines lo and behold dbt kind of came in and kind of shook things up a little bit in a lot of different ways first let's talk about how it's kind of been a major player in the unbundling of the data pipeline first it kind of helped unbundle airflow so to speak that is to say if you were an airflow user although technically airflow is an orchestration tool you probably used it for everything if you were doing data pipelines you used it to directly connect to data sources whether that's salesforce asana some sort of my sequel connection whatever you then probably have your sql in some sort of like bigquery operator that would then run the transform on bigquery and you might even have some data quality checks all in this one pipeline but with the rise of tools like five tran airbyte multano the whole connector game kind of got shook up and similarly dbt has shook up the transform layer or the t layer okay but what is dbt at the end of the day when you look at kind of what dbt is it's a tool that does more than again just sql and jinja but it also puts a framework that is very similar to software best practices around essentially sql that is to say things like version control testing repeatability modularity and even how you can run your code or in this case sql in different environments very kind of naturally as part of your process is all baked into dbt so it turns sql from something that's purely more analytical into something that can be more code-like or more engineered like it provides a framework for a better analytics process i think that's personally one of the benefits again taking away what is dbt exactly but looking at what it does it gives you this framework that can set up better processes for a team out of the box there's still plenty that you need to do well with dbt but out of the box it has the possibility that if you have good practices it will be much easier to implement them looking at dbt itself it's really just an easy framework for you to install onto your computer just hit pip install dbt that can then quickly initialize a whole again framework that you can utilize by hitting dbt in it plus the name of your project and then using things like the models folder you can start developing essentially sql models that will represent real tables in your target database you can use your target database by setting up the correct configuration and then when you run essentially the compiler it will compile your sql that could have jinja in it into an actual sql statement that it will hold into the targets folder and in that target folder it will then take the sql it will run it against your database and then essentially each of these files represents a table so it will create a table and could even do things like create tests in your configuration of set table that it will run against this table and that will just automatically live in your database so now we've taken all this other extra stuff you generally have to learn um all these separate different tools such as ssis and azure data factory and now just put it into have someone that can manage dbt and then everyone else just has to know sql that's kind of the end of the day like i think that's what the benefit is you need one person to manage dbt and then everyone else can know sql and then can build solid tables obviously they still need to have solid like database and data warehouse design skills but after that they just need to build essentially sql scripts to build tables and of course there are plenty of ways that i've heard this go wrong but if you set up again the correct processes that's kind of the win here with dbt now that you've got that high level understanding of what dbt is let's talk about some quick pros and cons for dbt one it's open source meaning it's open for you to customize it two it automatically comes with version control from day one three as i kind of referenced earlier it doesn't necessarily require a specific skill other than sql which you'll always need at most analytical jobs four it kind of already has testing built into the process meaning you can easily run things like schema tests or referential integrity tests without having to do something later five it's well documented and has a growing community that's pretty large i mean their slack channels like 30 000 people almost and of course there are cons for example sql based isn't necessarily the best for everyone it's not necessarily the easiest to read its logic is a little more complex and hidden behind joins and things of that nature another issue is that dbt doesn't really have debugging functionality so if you create custom macros which macros are essentially just uh similar to almost creating user-defined functions they just allow you to kind of repeat code but if you create one there's not really a good way of debugging it in dbt itself and of course dbt is just the t which is helpful in some ways but it means you're gonna have to have all these other components already done now a lot of people are switching over to elt methodology at least i think in the smb side of the data market maybe at enterprise it's still more etl that's what i assume but i'm seeing a lot more people on the elt uh side on the uh smb side of the market so in that side it's not that hard to buy a solution like 5 trillion or use error byte or something similar to pull data and then just use dbt to transform it but it's still technically a negative at the end of the day dbt is just another tool if you haven't had to use it i wouldn't necessarily worry too much if you understand docker python in sql to some regard you can pick it up really quickly and if you want to pick it up today you need a project or an example to show you how i would check out start data engineering's blog about what is dbt and how to kind of set it up they kind of go through the whole process they go a little bit into what is dbt but then just show you an example of how you can use dbt uh to set up a model and then push it to a data warehouse with that that covers the t part of the modern data stack there's still plenty more to go i think next i'm going to do reverse etls and kind of like why they're so popular because at this point i think two or three companies have now announced that they're either adding in reverse ctl functionality and talking about you know is reverse etl even a thing is it just another way to integrate or use data pipelines just pushing things back and there's a lot to be thought there as well so i'm going to dig into that next thanks guys so much for watching and i will see you guys next time bye
Info
Channel: Seattle Data Guy
Views: 133,313
Rating: undefined out of 5
Keywords: what is dbt data, what is dbt data build tool, what is dbt tool, what is dbt sql, sql tools, why is dbt popular, what is an analytics engineer, what tools do analytics engineers use, new data engineering tools, tools for analysts, data tools for analysts, data transformation, data build tool, etl vs elt, data pipelines, etl process, data transformation in power bi, what is dbt, dbt, dbt analytics, data build tool training
Id: 8FZZivIfJVo
Channel Id: undefined
Length: 9min 47sec (587 seconds)
Published: Fri Apr 15 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.