Master the SQL SELECT statement part 06: Aggregate Functions, GROUP BY and HAVING clauses

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi there thank you for tuning in i'm michael fudge and i'd like to welcome you back to our series on mastering the sql select statement in this video we're going to take an in-depth look at aggregate functions like min max sum count and average and we'll explore the group by and having clauses it ought to be a good one so let's get into it all right in this video we're going to take a look at the group by and having clauses specifically we'll start by looking at the aggregate functions we'll look at the many different ways that you can count data and use that count aggregate opera aggregate function then we'll discuss group by how to use it the the secret processing order of group by where it occurs in the processing of the select statement and then we'll talk about the having clause and why the having clause exists and how you can use that all right here i am once again out in azure data studio ready to show you what this is all about with aggregate operators and group i and having okay so we're starting out as we usually do with a table here we're looking at the students table and maybe you want to know just how many students are in this table so you can use the count operator for that it's a good idea to alias the column maybe i won't alias the column to start just to show you what happens so this is going to retrieve one row so unlike the other commands we've learned so far which return actual rows out of the table these aggregate operators are are what are called reducer functions they will take several rows and then collapse them down to one or more rows but not the original rows that are in the table they are now summarized rows so you see that there's 31 students in the table and i might want to aliase this column as as count of students oh i gotta put underscores in there mike a little off my game this morning apparently good so a couple other things let's do some of the other aggregate maybe you want to find out the smallest the lowest gpa so you can say select min student gpa as lowest gpa from students so someone's got a one eight you don't know who they are because again this is an aggregate operator we lost the original rows and now we have a summary so let's try to add in let's see who's got the highest gpa right so you might have max max student gpa as highest gpa got to spell it right it's not stud net looks better and there we go so someone's got a 4-0 or there could be several people with a 4-0 and then someone or several people have a 1.8 those are the min and max there's also an average right you can get an average get an average as well so different aggregate operators there's also a sum operator but we'll get into that in a little bit okay so this is great let's talk about different ways that you can count things so i had this example before where i counted students the first question that someone usually asks when they when it's dealing with counts is what's the difference between count star and counting a column so up here we have all these different columns in the table right so what's the difference between count you know wild card and just counting this column when you count star you're counting physical rows in the table when you count a column you're counting values in that column that are not null so i believe we have a null down here right down here um student 31 has a null so if i count the major id i'll get 30 because there's 30 major ids and then one student does not have a major id it's no so that's the difference between counting a column and counting star which is counting the entire row i'll show you that this is actually a count of students with majors okay so i have 31 students and 30 students with actual majors if you count the primary key it's the same as count star because the primary key is going to definitely it doesn't allow nulls and so therefore it's it would be the same the same thing or counting any column for that matter that does not have nulls in it okay but it could be circumstantial that it does not have nulls so you don't want to rely on that all right so let's show you another trick so count distinct student major id what does that do distinct says only count the different values in a column so in student major id i have you know um people that are majoring in accounting or people that are majoring in information management so i want to count each different one of those so this would be a count of different student majors so there's lots of different ways that you can count things using the count function and that's good so there's four different majors i can check that and verify that by just doing a select distinct student major id from students this will give me you know there's four different majors one two three four now null is included when you say select distinct however when you count you're not going to count nulls nulls don't get included the same thing with when you average it doesn't it does include nulls in the average when you min or max nulls are ignored nulls mean no value why would they be included in there they wouldn't okay so let's just label this different ways to count things verification that there are four different majors so maybe i want to know i want a count of students uh i want to i want a count of uh majors but only in you know only among freshmen so count of majors but only among freshmen first i get a data set of freshmen where student year name is freshman make sure that works those are all my freshman's now i'm going to count the different major ids here oh i want to count distinct right i don't want to count the physical rows i want to count the different ones and there's three diff the freshmen have three different majors so there's four different majors in total but freshmen have three different majors okay now why did i show you that last example i showed that last example to sort of segue into the purpose of group by it's great to go through and you know count everything among the entire data set or count the min or get the min or max or average among the entire data set but sometimes you might want to get the lowest gpa among the freshmen or the highest gpa among the freshmen so you might want to break them out by student year name rather than filtering on student year name so for that we need to use the group by clause so i'll give you an example so let's first of all create the data set then we'll create the query so this is usually how i like to approach it so i'm going to say select and let's just say start from students to start and i am going to get the student year name and the student gpa okay now that i've got you sort of id with a clue as to what i want to do what i'd like to do is for each freshman i want to get the average gpa so i want to say freshman here's the average sophomore here's the average junior here's the average okay so here's how you do that i want to take the average student gpa as student average now when i do this and i run it i'm going to get an error that says it's invalid in the select list because it's not contained in either an aggregate function or the group by clause this is telling you that you must include a group by and then you must group by the column that you want to display now you might wonder why let me just run it first and i'll sort of explain the why so now what i get is each of the different years i'm still getting a reduction right there was 31 rows now there's five rows but what do i have here in my five rows i have freshman and then this is the average gpa among all the freshmen graduate students average gpa among the graduate students juniors average gpa among the juniors okay so why do you need to group by if you include the column well when you aggregate right you are summarizing the rows so it doesn't know what operation to perform on student year name as far as the summarization goes do i want to minimize that or maximize that or whatever so because i didn't specify that it doesn't know what to do if i put in like you know give me the minimum if i put in minimum student year name watch what happens and then let me turn off the group by what you get is just one row that says what's the what's the first alphabetically of these year names freshmen and then what is the average amongst the entire 31 students that's not what i wanted what i wanted to do is say for each change in student year give me a different gpa so in order to trigger that for each change that's what the group by clause does is it says among the data set collect them by this column and then apply this aggregate over that collection okay if i leave out this it still works you just don't know why you're seeing it you just don't know what you're seeing and why you're seeing it right why is this a 2.9 9 what is it what does it mean right is it is it by eye color is it by gpa is it you know is it by major is it by you know student year name we don't know i mean looking at the code you know but looking at the results you do not and it's a good practice that your query results should be self-explanatory you should be able to look at the query results and know what the information means i don't know what this information means now i could help out by putting that up here student average by year but that still doesn't tell me which student which year is this the freshman is this the seniors i don't know okay so this is important thing to understand about writing any query that you write is that you should want the output of the query to be able to explain itself without the need to look at the code that you used to write the query in the first place okay anybody should be able to just look at this and understand what it means and that's important okay so that is that is group by so group by allows us to take something out of our data set and say i would like to organize the data set by that and then apply this aggregate to that so likewise i could continue this on so let's get you know averages some are not really that useful without counts as well i would argue because you want to know how many so a better way to represent this report might be count of students in major with average gpa there's my count or there's my average and then i'm going to add the count let's just count rows and i'm going to shape this up a little better there so i now know there's seven freshmen and their average is 2.9 and there's um three grad students and their average is 3.9 so i i see that i know that now okay so that that is how group by works and you can apply any of the aggregate operators to it count count distinct average sum min max i can actually include those in here maybe i want to get the minimum gpa by year i can just add that to the existing query and i can make a full report here let's say the maximum student gpa and this is the student max throw that in there for good measure so you know someone's got a 4-0 in every category except junior senior and sophomore there's nobody in those in those three years that has a 4-0 so you can answer and you can answer a lot of questions this way by shaping the data you can see which which year has the highest gpa which year has the most students so which year has the lowest gpa well that would be juniors which year has the most students well that would be seniors which year has the lowest the student with the lowest gpa uh which student has the among the highest gpas possible which which year has the lowest gpa that would be sophomore with a 3-3 okay the best gpa among the software's 3-3 okay so it sort of gives you an idea about group i now you can group by more than one column so i guess maybe i should talk about that maybe you want to get a a let me set it up first okay we want to get a group by year and major and then within that i want to get an average gpa and a count of gpas we'll say okay so i'm going to do it like this i'm going to say well anything that you show in the select line you have to group by anything you group by you do not have to show in the select line but it doesn't make a lot of sense to do that right because you need to see this column and this column in order to understand what the aggregate means so you you should you don't have to but you should so i want to group by student year name it's interesting because this here gets processed before the projection one of the last things that happens in the select statement is the projection the projection then the order by everything else sort of happens in the right order right first you do from then you do where then you do group by et cetera and then you do the projection and then you do the order by that's just something you gotta get used to with the select statement so student year name student major id and then because i'm grouping by those i probably should show them so that's my start so i have you know freshman and then there's three majors now i might want to know the names of those majors so let's do a join right join and let's for the sake of helping you out there's majors and it's major id and student major id are what i'm going to join them on and then i want the major name so rather than show student major id i'm going to show major name and then i'm going to get major name by using a join student major id is my foreign key major id is my primary key just to show you that that's the four key in case you didn't remember from a previous video student major id right there fk see that foreign key primary key that's how i'm going to join them all right let's see what we get here okay so now i know freshman accounting freshman applied data science freshman information management and technology i know each of the majors among the students now that i have that figured out i would like to figure out what the average gpa is so i can just throw that in here as an aggregate now a great a great name for this column is not really average gpa that's kind of a bad name for this column because once again it implies that when someone just sees this report they know that that average gpa is by freshman and by major are they going to know that yeah maybe maybe not it's better to be crystal clear with it and alias the column has averaged gpa by year and major that way everybody knows what we're talking about okay now does this mean that every single freshman in here has a 4-0 yes but how many are there we don't know this is why counts are important to include with averages counts are important to include with averages now another thing i don't like is maybe i want all the freshmen first then all the majors see how it sorted it so i need to include an order by here and order by student year name first grouping by is no guarantee that it's going to sort it either you need to explicitly include the way you want to sort the query okay now i get you know freshman freshman freshman accounting applied data science information management okay now these aren't ordered right but i'd have to include that other table to do that i'm not going to do that let's do the count let's add the count in here we'll just count rows as average as count of students count by year and major there we go so yes there's a 4-0 but there's only one student in there okay that makes it makes a little more sense so it's using group by pretty slick now let's kick it up one more notch maybe you want to get let's go back to an example that we did before just to keep things simple let's get an average of gpa by major okay and we're going to use this join to do that so let's say major name and then average student gpa as as average gpa by major okay now i'm going to run this but you know already this is not going to work right you're with me on this it's not going to work because i need to include group by i cannot aggregate a column and then display a column that isn't an aggregate unless i explicitly tell the select statement to group by this column otherwise it's expecting this to be an aggregate min max something like that again the order in which it processes is different right so first it does this then last it does this second it's going to do this so it does this first makes a big fat table then it groups them then it projects them projects them last and there you go so we have our major names and then our average gpa by a major now couple things so this is average gpa by major so let's talk having having the having clause let's add counts i always add counts with averages mike there we go we added counts now um one other thing here is if you add these up you get 9 right and 6 is 15 and 15 is 30. we're missing one person one person doesn't have a major one person does not have a major so if we want to include that one student one student that does not have a major we need to include all students even the ones without majors so we need to put a left here because we want every single student even those without majors now to make this sort of read a little better i'm going to put this up there go like that okay so there's our one student with no major null that soon has a 4-0 and there's one student that does not have a major okay so now they're all there the gang's all here now let's suppose we want to take this data and show only freshmen we can use a where clause for that right we can just put a where clause right here you know where student year name freshman do that i argue if you're going to do that it's a good idea to include freshmen as a column in here so that people know what you're looking at when you look at this you don't know is it freshman nothing in here again nothing in here indicates it's just freshmen so you should include that as like so i got a group by it and then i have to project it i really just want to project it but in order to project it i need to group by it okay now i see freshman there okay only freshmen all right okay that's one thing now let's talk having let me take freshman out this is what we had before maybe i want to take this data and only show it when the gpa the average gpa is bigger than three okay so i want to take this data and i want to filter what i see so that only the gpas are larger than 3-0 okay so i don't want to see this one and i don't want to see that one in this particular case you cannot do that with a where clause because the where condition is based on this column which is an aggregate column so you need a way to filter the data after you group this is the purpose of having having basically allows you to run what's similar to a where clause it's just not called where not called where to it's called having and you're filtering the data after you group so now you might say oh cool well now i can say having gpa having average gpa by major bigger than three boom done awesome no it's not going to work why again the way the query gets processed is the select projection occurs last last so it doesn't know about this column alias here it has no idea what that is so you actually have to say having and then use the aggregate operator having an average student gpa bigger than three now does that mean it's doing it the average twice because i wrote it twice hell no that's not what's happening here okay it is in reality making this column and then filtering that column it's just that the column alias is not available at that time this is just one of the weird things that you have to get used to with the select statement it's one of those legacy baggage things with the sql select statement i will say this when we talk about building table valued expressions with the with statement you can simplify a lot of your intent um with the with statement and make things a little easier to understand and uh we'll get to that in another video and then that way when you look at this you don't look at it and say what is this all about i don't know it doesn't make a lot of sense right so what what else what in essence we did was we filtered if i run this okay we filtered out with this halving line those rows where the major if you will was more than three okay all right that is min max average all the aggregate functions group by and how you use having the last thing i'll mention is whenever you use having you should always be filtering an aggregate there's no point if you're going to filter an actual row like if i said we're having major name accounting i can do that in a where clause so there's no point in doing it and having and as a matter of fact you're better served doing it in a where clause because then there will be less things that you run through these aggregate functions so you should only be using having when you need to filter based on an aggregate function okay that's the rule of having that's the general guideline of having
Info
Channel: Michael Fudge
Views: 821
Rating: 5 out of 5
Keywords:
Id: mX61Ww3wRgU
Channel Id: undefined
Length: 31min 7sec (1867 seconds)
Published: Mon Mar 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.