SQL Workshop: SQL Patterns Every Analyst Should Know w/ Ergest Xheblati

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you so much for attending today today we are going to learn from ergus zablocki he is a SQL expert data architect and author of minimum viable SQL patterns I'm excited for this event specifically because I love learning from people that are just experts at their craft and that's exactly who ergus is he's been practicing SQL for over 15 years and today he's going to teach us about specific SQL patterns that he's learned over the years to go from really being a good SQL practitioner to a great SQL practitioner also this event is brought to you by the operational analytics Club so if you want to meet more SQL experts like ergist you can join the link right there operationalanalytics dot club and I hope to see you there shortly with no further Ado ergus the floor is yours thank you so much thank you Parker for having me uh and thank you for setting this this up hello everyone wherever you are around the world my name is chipladi and like Parker said I worked in the data space for the past 15 plus years um I like to say that I'm you know fluent in English Albanian and SQL uh it's it's my favorite language by far I started my career as kind of a business analyst um where I was doing a variety of different things but the moment that I ran into SQL and I started to learn SQL I just fell in love with it and to this day it Still Remains my deepest my deepest passion um I've um so so I've I've spent 15 years in the data space doing a combination of what what you might call analytics engineering these days what we used to call data engineering or uh business intelligence business intelligence Development I've also done actual data analysis I've done data science a little bit of machine learning so I've sort of run the gamut of all these different spaces in the data space but I've found that I what I enjoy the most is by far doing SQL development and because of that I noticed certain patterns that I was using over and over again when I was solving queries um and and I was able to solve them a lot faster uh than someone starting from scratch and so I realized there must be these sort of like Lego blocks if you will these patterns this templates that allow you to solve problems really really quickly um in here you'll see that I published the book earlier this year in April of 2022. uh you can you can find the link to the book on bio.link slash ergostex or you can reach me on Twitter and you can see the link to the book there um and in this talk um I will cover just very few slides before I go into a live session where I show you a query that runs that works really well but it also is not very easy to read easy to maintain um and so what we'll go through and try to improve the readability of the query hopefully even it's its performance so I'll talk about why did I write the book what are the what are these patterns that I keep talking about and what are some of the most common of these patterns so uh why did I write these books so as an engineer or like as a computer programmer software engineer coming from kind of the the world of engineering and going into the world of data working directly with data analysts um people like yourself I noticed because I was reviewing their code and I was submitting it into our git repo for our processes to kind of run the code automatically in the data warehouse so the analysts they knew their way around SQL they could write very complex SQL queries that sold really complex problems that created really complex models but what I was more con concerned with what I started to notice is that a lot of their code would be um sometimes inefficient it would take a really long time to run um or it would be really hard to read and understand that I myself had to go in and kind of rewrite some of the codes so I could understand what it was doing to make sure that it was doing the right thing as I was reviewing a lot of this code I noticed that it was really inefficient repetitive and even brittle so if data were to change um the code would would inevitably break so I started to think about how do we solve these things in a manner that people can kind of reuse uh the the these patterns these ideas so what patterns are are they basically that they're templates for solving common problems for example if you need to uh dedupe your data a lot um you'll notice that after you do it manually the first time or the second time uh you'll reach a point where you can just sort of you know exactly what that pattern looks like you know that if you use say the row number with uh Partition by etc etc like the window function you can just copy paste that from from your mind or from another piece of code and kind of reuse it in many in many future queries and it becomes like a like a like like a like a Lego block right something that you use to solve problems in the future and you can then sequence some of these patterns once you once you create them in your mind you can sequence them to solve pretty much any problem so if you need to deduct data and then you need to create a very long table and then you need to Pivot it out you kind of know that the pattern is like okay I'm going to use the this to do pattern and then I'm going to use kind of this Union all pattern and then I'm going to use this pivoting pattern so that's kind of the idea of like this is these are like bigger conceptual pieces to solve to solve these problems and um so I sort of broke these patterns down into these four categories as I started to look uh into some of the Night written some of the code that other people had written I noticed that they fell roughly in these four categories now since I've published the book you know I've come up with a few others and I I hope to expand upon them and to kind of release another version of the book with more of these patterns but for now these are the ones that I initially wrote about in the book so we have like how query composition patterns that help you compose make your queries easier to read easier to understand um not necessarily more efficient but definitely more maintainable so you an example of that is using ctes as opposed to using sub queries for example some query maintainability patterns so the idea is that if you if a piece of code gets repeated multiple times in software engineering there's this DIY or dry term which means don't repeat yourself so if you see if you notice yourself repeating the same chunk of code in multiple queries or even within the same query that's a good example for applying one of these query maintaining maintainability patterns and then there's query performance patterns these differ a lot based on the system that you're working with different systems have different ways to optimize queries both for Speed and for like credit usage Etc that's becoming more more and more important these days when when these days when these cloud-based data warehouses charge you based on the credits that you that you spend on these queries so there's like one of these patterns is like the concept of not ordering your data unless you really have to or waiting to order the data until the very end and finally our sort of my favorite query robustness patterns and I call them defense against dirty data just a uh to give you a quick Harry Potter reference there okay that's it with respect to the presentation so we are going to quickly jump into a session Okay so um what I what I'm showing here and the details of this don't really matter um what's important here is that you get kind of the pattern kind of see the query the way that I see it so this is a query that uses stack Overflow data to aggregate uh user level metrics at the user ID username and date level so we have then things like questions questions that they created answers they created questions that they edited answers to the edited and so on and so forth and we can run this query this is running locally on my duckdb database by the way I've been experimenting with drb recently I love it it's great this query only takes about okay about 10 seconds and we get some data here so this guy Gordon linoff by the way he is another SQL expert he has written thousands upon thousands of answers on stack Overflow and and he has a SQL book as well so since we've ordered this data we've sorted this data by the number of posts that were created we see the person that has posted the most posts in here okay just to show you that this query runs now let's kind of analyze the query from the perspective of some of the quer some of the patterns that I mentioned here so first of all the the first thing that I notice is the use of these nested sub queries so we have one level of sub query here and then we have another level of sub query here this has three levels of nested sub queries now there's nothing wrong with sub queries per se but if you want to make your code easy to read and easy to understand um I don't like sub queries I prefer to use ctes so why what's wrong with these sub queries well when I'm doing the select here I'm using this activity type okay but where is activity type defined well I have to go down to this level okay it's not here then I have to go down to this level okay so now I see where activity type is defined Okay so in order now this is a very small query as you can see but imagine having like 300 lines of SQL that does this sort of thing right so you are trying to debug this query uh so somebody else wrote it you're trying to debug it you go and you say what the heck what is the meaning of this activity type before you even can read what it's doing you have to drill all the way down to the level of the nested queries to figure out what activity type even is and then you have to go back and then okay post type is that defined anywhere uh it's not okay so maybe that's coming from uh where is post type defined okay so post type is defined over here in this level so you can see how this is really really hard you go back and forth back and forth to try to understand this query so that's the first thing that I notice so if we scroll down a little bit we see there's a union hall um because the the questions and the answers are in Separate Tables in this particular skiba and so the questions and the answers are sort of schema aligned they've brought into the same same column names basically and they are unioned uh but look at this the same sub query that we saw of here is repeated again down here okay that's a that's an example of what I call well that that's an example that's an anti-pattern of the DIY do not repeat yourself uh if you make a change to this query here maybe you had it wrong the first time around you then have to go and change it everywhere else it disappears and in a really really long query this is not very long query but in a really long query you have to look who knows how many search and replace uh things you might have to do in order to to fix a single uh sub query like this okay so that's the second thing that I notice the third thing that I'm noticing is this order button here what is that order by doing in there we don't care about activity date we're ordering we're sorting by the number of posts created so you might think okay well this is not really doing anything so it's fine if it's here well it's not fine yes your query engine might you know do it really really quickly just like duct TB does for me but imagine if this was a really large table with hundreds of millions or billions of rows sorting is a very expensive activity it takes a really long time uh for it it takes a lot of compute to um to actually be performed and so the where data warehouse vendors are more than happy to do it for you and they will charge you the appropriate number of credits based on you know how much data you scanned or based on or or based on the time that the query took to to complete so by doing I like to call this like a premature sorting right premature sorting means you don't really need sorting here but when an analyst was developing this query they started with this inside query first they were they they were like okay well I can run this query and I can get me all the different uh all the different post activity that that happened in this post okay now I need to add the user information and so I do a little join here with users and then they say okay now I need the other post types I need the answers uh so I can just Union this copy and paste all this code and basically the query Works everything is great however if I'm looking at this query I say okay this query is um is unmaintainable and it's not efficient okay so what can we do how do we turn this query into something that's a little bit more more efficient um now I can try to do this live let's see if that will work so the first thing that I notice is let me take this piece of code and let me put it into a CTE here with cte1 as something come on okay so I'm going to copy this into cpe1 this is uh this again this works really fine if I highlight it um and and I run it and the so now I've taken this repeatable piece of code and put it on top here as a CTE I I put a comma in here for the next CPE the next piece that I'm noticing is that these two don't really need to be like this this piece of code in here uh it needs to be unioned with the other so I can have another CTE here CTE 2 we take the question types and then we can paste in the answers so we're basically what we're doing is we're combining these two into a single CTE okay so what we have now is we have uh let's see if we can run this query ah we cannot run it because this query was joining with this one up here so do we want to join with that again we can um but I like to delay the joints until the very very end so I'm going to remove these columns that depend on the join and I'm going to basically try to run this disc okay so now now I get another CTE okay so we have a CTE that gets us all the post activity that we have another CTE that gets us all the post types now I can get the final query up here and basically paste it in here and say okay this is from doing this live from cpe1 and join CPE true on CTE one dot post ID equals cpe2 dot post e and then let's not forget our group by that we have down here let's get our group by um and paste it over here let's hope that this works no as always when you do things live it doesn't work um yeah rather than try to debug this I'm gonna show you what it looks like Okay so here is the same query but now it's what I call what we call in software engineering it's been refactored refactoring means you retain the same functionality that it was doing before but you make the the code easier to read or or easier to understand or easier to maintain so now when I'm reading this query I say okay I'm defining a CTE called post activity where I get one row per user per post per activity date right and then I get the activity type whether they created a post or edited the post and then I'm adding another one so if you think about this like um like a dag like a directed as a cyclical graph if you're familiar with DBT or airflow uh this this is working like a DAC we have a dag with post activity we have a dag with post types and then we combine them um at the end over here so the post types are just basically taking a post ID and basically adding a a column for the Post type based on where it's coming from so the post question says a question type must answer has an answer type notice also that I removed the order by from from this CTE up here okay so that because I don't need the order by and I'm saving myself both time and potentially money by not sorting the data uh before uh before I really need to okay so then I just have my final query that does all the aggregations that I really care about so I pulled from post types I joined the post activity and by having these you know aliases ptpa now I know exactly where this data is coming from so the activity type I know that it came from the post activity and then the uh uh post type question answer I know that he came from the post types CTE so by naming my CTS properly by sort of like reducing not only did I make the query easier to read easier to understand but also I I very easy to maintain if I come back here in the future I know exactly what's going on if I make a change to this post activity you know I add another join I had I had a filter condition let's say that I wanted to filter you know where this creation date um is between but you can't type when you type in Live um 2021 on nine one okay um and 2022 or one or one I guess should uh of course doesn't work okay ah I misspelled creation of course foreign okay so now I can I should be able to run this query um maybe the creation date oh creation date is ambiguous okay so I need to specify that it's the Post history creation date and now I can run it all right then when I run the whole thing uh guess what it runs even faster because now I'm a pre-filtering data so this is another pattern that I talk about in the book when it comes to query efficiency you can very well take this creation date filtering and put it over here at the very end however in some cases right it's um it's a lot better to put it next to the tables that you're filtering by so it takes advantage of the indexes um and and all those other things that exist in in tables um and this makes my query even faster so yeah so so so that's about it in terms of some of the patterns I hope that wasn't too too fast uh for for the audience here but um yeah happy to take any questions awesome thank you very much ergus um we have received a few questions and a couple questions in the Q a um but everyone that's attending at any questions you'd like on the right of your screen where it says q a and then we'll bring them onto the screen so Trevor Fox had a question yes performance organization yeah I I see the question that's a good question uh Traverse so why am I joining here um yeah so right so what I could have done is I could have taken these columns and brought them in here and joined with the post activity and then in the final query I could have just selected from the post types um it's it's neither performance nor organization it's more of a philosophical uh like a philosophical like software engineering term where you want to make ctes perform one thing and one thing only and keep them very self-contained so if if you were to add columns to the post activity and then you wanted to bring them all the way down to your select you'd have to change the post activity and the post types and the select so you're kind of like you you're you're repeating yourself right so you want to not repeat yourself so if if I can limit the scope of this CTE to just aggregating uh to just getting me all the post types and whatever else I need I have it in a separate uh CTE that makes it that that makes it a lot easier to change in the future um another thing that I that I will mention here is if you notice that look if this piece of code you keep using this everywhere in your queries right because you for whatever reason you need to Union these two tables and everywhere that you go you keep pasting the same CTE pasting the same CD pasting the same CTE that should be a trigger to you to say wait a minute so if I need to change this CTE I have to change 15 different models I have to go and change 15 different models by hand so the moment that I see that I need to copy the same piece of logic somewhere that's a trigger to say maybe this should live in its own model in its own table or View probably a view because they they are there's already some tables so I could take this put it into a view and then use the view in all the uh all the downstream models right so this is um I I think the term in software engineering is like single purpose principle I I have to I have to look it up but the idea is that you want to contain the purpose of of these ctes to just the thing uh that you that that that you want right so um if uh yeah if I again if I was adding columns to to the top city or the middle City I have to go and change it in all the different places okay so I see I see another question in here from David what are your suggestions for where and how to comment in SQL um I like to comment right above what I'm doing so I would add a comment right there uh get the post pipes um and combine them now I'm I'm a big fan of what I call self-documenting code so I like to write these really long if you notice your answer to question ratio okay so if you're reading my code you don't need comments to understand what it's doing so this is this is another I guess you call it a pattern but it's more of a principle that your code should be self-documenting right um that makes it easy not not just for yourself but also for other people for example I could have called these ctes a lot of people go cte1 cte2 cte3 they solve the problem and they're done they they close the book on that query and forget all about it six months later if they're looking at the query like what the hell does CTE one do what the hell does CTE 2 do if you had taken the time and named them properly you would then not have to answer that question um okay moving on to the next question the group uh one two three four five how is this in terms of query optimization if it's if it's not optimal zero would optimize so the group one two three four five this is more of a has nothing to do with optimization this is just a way to make it uh easier to to make the queries be much shorter and easier otherwise I would have had to so if I didn't use the one two three four five I'd have to actually uh Group by all these columns one by one so I have to group by that group by that um I even have to group by this if I have to actually copy and paste this and and do it like this otherwise the query Optimizer will complain if I didn't use one two three four five so this should technically work uh hopefully I'm right okay uh maybe maybe I forgot something oh okay you can't use the ads in here yeah so this also works um okay well anyways like I'm not gonna okay so there's another as uh that's what happens when you do things live Okay so but if you didn't do the one two three four five okay so this also works but look at that query that is ugly and unmaintainable I've copied and paste code everywhere if if you did if you wrote your queries like this uh I would probably not be very happy with you if I was the one reviewing her code so doing one two three four five makes it a lot easier now there's one thing to note about the group one two three four five um this only works if your aggregate functions all come at the end if there was another column in here like if user name was actually here and not up here the one two three would not work I'd have to count one two three four five six seven eight nine nine and I have to say one two nine otherwise this would not work okay I'm doing this live so I don't don't mind me doing that but you you get the point right so the idea is if you do aggregate functions do them at the very end okay so when by doing the aggregate functions at the very end you can do yourself a favor to do one group by one two three four five and so on okay um next question I break down complex problems into a smaller one each smaller solution becomes a CT or a sub query and I find myself in a deeply nested CT or software I cannot help but feel that there's a better way to do it okay so if if you are finding yourself writing a lot of ctes like uh there's some like for example bigquery will only allow a certain level of nested queries after a while it'll basically say sorry I can't run that code any anymore right you've exceeded the limit of of nested queries so if you find yourself doing that a lot that's a good that's a that's a good it's a good idea to to break your code into individual models so if you're using a tool like DBT then you take some of your code and turn it into like materialize it into a table upstream or turn it into a view Upstream um and then your your Downstream work uh becomes much less complex so you basically you're breaking it down in ctes but these ctes then turn into tables or views and they get materialized in the in the database itself um not always possible with um unless you're using a tool like DBT or or some other some other data transformation tool but if you have that available to you that's the best way to do it is to go beyond ctes and go into the the realm of tables and views okay next question I have a habit of casting my date strength data types does the cast impact performance at all uh of course like every every operation that we do here casting will um will impact the performance a little bit although databases have gotten very very good at doing this stuff really quickly these days where you might run into trouble is if you're uh if your date is a string in the in the where clause and you're doing the conversion to from string to date in the where clause and your date has like an index on the column so when that happens uh database query optimizers will will tend to perform worse so let's say that you have a a date column in your table and there's an index on that because it's like a time stamp so it's like unique for every row uh so it's really really good to have this index however if you use a a string and you don't cast it sometimes and and you basically call the the call cast on on on a date you can cause the index to not be used and make your query worse so just the cast itself does not cause it does not impact performance but it might impact it uh further Downstream when you're doing like uh filtering like where clause or joints okay next question if you have a daytime column and you want to find the latest date just date could you cast Max column as they perform better as Max cast column date or with the difference okay I believe the difference there is negligible the order of the functions from what I've seen doesn't really matter much query optimizers these days are really really smart so they will take either of those operations they're they're identical to the query Optimizer and it will optimize them the same way so basically it will perform really well so you shouldn't have to worry about whether you do cast Max or Max cast one one thing that I'll mention is if you're doing an aggregate query cast Max may not work your query Optimizer might complain and say hey this this column is not aggregated so you then might have to swap like the aggregate function always has to go first I I tend to put the aggregate functions all the way out of the like the the giant function chain if you will and all the conversion inside so I I prefer to do maxcast as opposed to cast of Max because then I know that I'm doing a Max and whatever happens inside the max um is is more or negative so performance Wise It's negligible but from readability wise um and sometimes from from query optimization wise uh it makes more sense to have maxcast next question when when do you decide to optimize the query versus the underlying table indexes always try to optimize the query first before you mess around with indexes indexes are um I've had a debate with this with like data people I believe indexing tables is a bit of an art um and a science uh but it's it's it's not an easy game to play or to learn I've optimized a bunch of indexes in the past they always warn you against having two too many indexes too many indexes in a table can cause performance degradation to few indexes on the table can cause performance degradation so I would go for optimizing as much as you can the query first before you tackle any index creation on the table um what I'll say there also is that the uh the so indexes work really well for columns that are used in the where clause or in the join condition uh and columns that are very unique something like a post ID or like a timestamp if a column is like a categorical string that only has like 15 unique values that's not a good idea for that to be uh an index but like one thing to keep in mind with especially if you have to deal with uh table indexes um like a postgres or MySQL or SQL Server when you're optimizing your query one of the first things you should look for is functions that are used in the where Clause so if you are using like string concatenation functions or string substring functions if you do any of that in the where Clause it will cause performance degradation so you might want to take like a if you're doing like a substring of name equals this it's much better to use the like so substring like this plus percent which is the the the pattern for for more because the like is better for query optimizers than using functions in The Wire clause okay next question you suggest avoiding using functions in the word class such as replacing data yes with a hard-coded date have you used a date value variable that you can use in the where Clause yes I have I have used date variables and I'll tell you the query optimizers will still will still complain even if you use a date variable because they're trying to predict what's the worst case scenario that could happen here now of course if you were saying like date equals variable name that might work but if you say something like date between variable 1 and variable two the query Optimizer might very well say Okay worst case scenario I'm scanning you know 50 years worth of data so I'm just going to do a full table scan all the time so I've dealt with this before uh where you know I would declare the variable to and set it to like a dynamically generated date get date minus day three for example and I would still get full table scans so yeah that's that's why I'm not a fan of putting um uh functions in the work loss now I've seen modern databases like snowflake duct TB bigquery they don't care everything is like a table scan and they do it really really fast because they have this massive compute uh but me having worked with like SQL server for a really long time I'm very sensitive to those things um okay next question from Joe I don't understand the grouping in the CTE if there's no aggregation in the city is that different than sorting when you aren't aggregating I don't understand the grouping in the CTE if there is no aggregation in the CTE is that different than sorting I don't understand the question maybe referring to the post types because there's no aggregation is that different than sorting when you're not aggregating the way I think about ctes are like mini tables right mini tables in in memory so you could do whatever you want inside the CTE I I just don't like sorting data in a CTE I've seen cases where sorry my bad uh let me let me un undo what I just said uh yeah yeah so okay so you can filter data in a CPE that's really really good but if you sort data in the CTE that's not good because you're causing additional computation to be used so uh that's yeah I guess I'll move on to the next question if you can rephrase the question I can take another look at it okay uh sorting is done in of n log n time is it relatively cheap even on a very large It's relatively cheap on a very large data set okay um you may be right about that but I have seen cases where say you're you're doing a window function window functions always window functions always require sorting I have seen performance degradation even if it is a few if you can draw a few milliseconds yes to the naked eye it may not make it a lot of a difference you know but if you are um if I I've seen I've sorted really really large tables and let me tell you the performance degradation is noticeable you know if you want a query to complete in a reasonable amount of time so yeah just take it into account would be my my suggestion okay I mentioned views what is my take on a view versus an indexed view so you can't really index a view right from the definition right so a view is basically a query that gets executed at runtime so in order for the the database to index of view it has to materialize that view into a table dynamically in memory without you seeing it and then sort of index it so yes it may perform better but the the materialization of the view will take um sort of CPU Cycles my take on that is I want the view to be a view and I want a table to be a table I don't want to combine the two next question is it a good idea if to add a sequential ID column that has no business meaning to large stables and add an index on the sequential ID I've been told this is good for performance um so you you might be referring to something that is known as like a surrogate key in data warehouses so uh these types of surrogate keys are used for joining data in the data warehouse much more efficiently so for example if the business key is is a string by you creating an like a numerical index you can make the performance better uh so the oh I I I lost the the the question so yeah so um it's it's a known pattern that is used in in some in some data warehouses so so doesn't always improve the performance of of tables though so it's not like you can magically improve performance by adding this index it only works in cases when you're transforming data you're taking your raw tables you're adding an index which then makes sense to that internal data set right doesn't make any sense to the business but the the tables internally they all know about the structure so those will improve in performance but you can't just add an index to any old table and have it improve the performance magically okay next question what is the performance readability different between users using ctes and temp tables um it's it's a good question so I think I used to solve problems with temp tables before um so I'm used to like the way in my mind it's sort of equivalent to solving uh the the problem with ctes uh but in some databases like we were working with bigquery and temp tables were not available at the time that we were working on it so we had to use uh ctes um temp tables are good because you can sort of materialize a result set in the disk now okay let me let me put it this way uh doing stuff in memories always faster than doing it by dumping data in the disk but if you dump data in if you dump really large data in in memory and it doesn't fit it's going to dump in the disk um anyways so I say in my mind they're equivalent and you can solve the problem either way um okay clarify in CTE one you are grouping yes but there's no aggregation in that CTE why am I oh I see what you mean okay got it all right so the reason I'm doing a group by here is to dedupe this is um yeah that's a nice catch there yes so I'm not aggregating here but I believe if I if I didn't do the group by there would be multiple rows of the same um yeah this is a way of doing um uh data deduplication yeah that's that's the answer to that thank you thank you for that uh thank you thank you for catching that yes but that's purely for that's purely for data the duplication I I so just just to kind of continue this this is equivalent to you doing select distinct okay so doing the group by and doing select distinct are equivalent however internally the group by maybe sometimes more efficient than the select this thing I know people that um by the way this is another anti-pattern that I've noticed a lot I know that people love to use distinct everywhere that is just adding more and more computation computational complexity to your query so I like to avoid select this thing unless I know that the underlying data is messy um anyways good question Joe uh let's move on to the next one is it more efficient to do the group by in the first CTE versus using select distinct [Music] um I think so like I said uh some like you you could try it it looks like in this database it does not matter if I do select distinct or if I do go back and have the group by so you could try it and see which one is better I prefer to get rid of duplicates as early as possible by using one of like a variety of different patterns whether you use like a a rank like a row ID rank whether you use like a distinct whether you do like a group buy um you can you can try it yourself and see which one works better so in like logically they're they're equivalent you're doing the same thing you're basically reducing the the number of rows that are being um executed okay Ankit uh good question yeah my SQL 5.7 does not support ctes um yep it's unfortunate you have to stick to sub queries there is no there is no other choice um however I have noticed uh like I was working with the mySQL database where switching from a join to a sub query uh a correlated sub query where you do something like where not exists was a lot faster than doing the actual join so yeah unfortunately you have to you have to use subquerors there's no or tell the RT person to upgrade because my SQL 5.7 came out who knows about 10 years ago just kidding there could be a very old system that you can probably upgrade the database so you're stuck I'm sorry okay next question speaking of optimization and readability do you prefer some case when then one or some case when then one um else zero and okay um for the readability which one did I do yeah I like to be explicit [Music] um I I like to be explicit because if you're not explicit you don't know how it's going to handle nulls because what happens is when it everything that it matches it will sum the ones but if I think that it doesn't match it will be null so in some cases what you could do you could use like a count function and then you can get rid of the zero but um someone case when then one end I don't know I I prefer to say okay like you're saying I could have inverted these and this could have been a zero this could have been a one that's that's very confusing to me anytime that you're you're inverting logic it's super confusing especially if you do like not not true or something like that just say if it's true or if it's false like simplify your simplify your logic simplify your make it easy for yourself okay next question from Sam I work with staging dims and fact tables as the only data person at the company I'm forced to join staging to dims in my cte's uh-oh due to time pressure bad practice should I do the data engineering first so I avoid using staging yes you should absolutely you should one of the worst patterns that I've seen is people using people joining fact and dim tables to staging tables because either like you said there's pressure to deliver quickly and they don't have enough resources to fix the code I would much much rather go and fix all the Upstream tables to add the columns that I need now that could have repercussions later on many analysts don't want to don't want to do that so you need basically help from data engineering but if you're asking me yes absolutely do the data engineering first and avoid user staging if you if you continue that pattern and your company Grows by 10x and number of people you're going to have a hell of a time debugging models later on you're gonna you're gonna be dealing with this like um I uh I call this like um the weed Garden of data models right there's just so many of them and they're interlinked interconnected and I've seen this I've seen this many times right they're they're all interconnected and so it makes it extremely difficult to debug anything like if you want to know where did this Revenue field come from you have to trace it through this like crazy perhaps like self-referencing loops and dags and all these different things to go all the way to the source yeah do yourself do yourself a favor and do the data engineering first if you can okay for the creation date filter instead of using a hard did a hard-coded date can you use two variables for better readability um using variables like this has any performance on the impact yes you can I I answered a bit about variables earlier as well like the system that I was working with which was bigquery did not support variables and also if you use DBT it doesn't really support I mean I guess it does support variables if you use ginger but the raw SQL itself unless you're doing like a stored procedure or a function you can't really use uh variables in SQL itself sometimes you can so I mean if you're determining the the tables basically dynamically and then you're running this query every day to go back a day then yes like you you could use like a DBT variable that will automatically pre-fill that variable with the that day's value each time and and so that will work um will it impact performance no the performance impact uh will be minimal okay next question from Daniel I see the where true I've heard Arguments for and against this uh does it make difference of performance no it makes no difference on performance it's only the only reason I use it to be honest is uh so that like okay let's say that I was seeing like um it's when I'm writing the query for the first time so I could do you know where true and and some condition so what I could do is I could like comment out that condition and rerun the query without worrying about it whereas if I didn't have the true and it was like this if I comment this out and try to run the query it'll fail and then I have to click oh crap I gotta I gotta also comment out the where so this this it's easy it's only used to uh when you write the queries um ahead of time but it's one of my favorite favorite place to uh to do it I hope that answers the question okay next question when using joints do you prefer join table or join select specific columns from table I prefer join table doing the second one so join select specific columns from table uh nope the the former does not fetch all the columns the joins are very very efficient in database the second one is a case of using a sub query there are examples where you like I said when the older version of MySQL you're limited to having to use subqueries but um um yeah I I but I always use joint table or joint CTE um it it's not going to fetch all the columns from from the table it's only going to fetch what you put on the select clause okay uh next question I read it's bad practice to add sequential ID as a primary key where the database already has a surrogate key where can I learn about these practices um I don't know that there are any books on this like other than like data warehouse the data warehouse toolkit by Ralph Kimball which as of now it's it was the last decision was published in 2013 but people still use it um he recommends using these like sequential IDs as primary Keys when the database already has uh key but he was the cases he was dealing with was when he had like limited space limited storage um and limited compute so he needed to make the queries very efficient so he basically replaced strings with within with uh with integers um all right next question um why did that disappear any tips for using ctes and databases where they act as an optimization barrier you're you're stuck Tim I uh I I I don't think that there is any way around that if it's if it's an old bicycle not unless you use some kind of like a translation like a SQL transpiler that will take a CTE and turn it into like native MySQL code duck DB supports Group by all which what is my take on that I I'm used to group by one two three most databases support group i123 um don't use Group by field one field two field three like I said if you're using any sort of calculations or case statements you're gonna have to paste the whole damn thing in there and I know I used to work with SQL seven I believe for 6.5 which did not have group by one two three and it was a nightmare you you're if your select was 20 lines your group by would be an additional 20 lines so uh Group by all is new for docdb um I've only just started playing with it I I like it I wish other databases supported it what id am I using this is a d Beaver d beaver.io uh it's it's free for everybody to use the beaver.io okay uh how we're doing I think we're a little bit over in time how we're doing with uh with with time or questions Parker I know I kind of like ran through those questions at 100 miles an hour but hopefully that was that was helpful for folks we're doing great um yeah thank you everybody for asking such great questions uh looks like there's one more right now that I'll pull up for you temp tables versus ctes is there any situation where they get locked up um temp tables are unique to the session that you're in the moment so if I run a query and I call it temp table one you'll run a query and you call the temp table one there will be no locking unless the database is performing some kind of like backup operation or some kind of restore operation where reading the table is just limited you you will be fine I think I think that's that's it perfect thank you so much everybody for attending and for being so engaged um yeah very good questions for just being an expert in SQL as well very good questions I I just I love answering um single questions and I am on the uh OA slack group so if you have additional questions you can you can add me on Twitter and send me a DM my DMs are open uh you can ping me on the operational analytics slack group I will look at it and try to answer your question um individually and there you'll see the link to join the operational analytics Club you just go to the website uh fill out the form there and then you can meet a ton of data professionals like ergus and um areas before we sign off uh do you want to pull up your um school I chose so that people can see your Twitter and Linkedin or whatever you were sharing with them of course of course thank you for mentioning that yes so yeah so my Twitter is at August X and you can view all my sites so like I have I have a I have a sub stack newsletter that I just launched um you can get a link to my book on that bio dot link if you're interested um but I I'm active on Twitter LinkedIn and a few uh slack channels amazing thank you again and that is the end of the session thank you very much Parker for setting this up I appreciate it absolutely have a great day everybody you too
Info
Channel: Census
Views: 15,486
Rating: undefined out of 5
Keywords:
Id: UFiZx5NlzL4
Channel Id: undefined
Length: 60min 34sec (3634 seconds)
Published: Thu Oct 13 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.