F2023 #02 - Modern SQL (CMU Intro to Database Systems)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign all right uh let me make sure I do backup recording um all right so uh some quick things administrative things from from last class uh it's only been one day we already got emails um so uh the first things that I made a mistake last class I said Ted Cod got his PhD at Penn you gotta appreciate Michigan where all the great datas people got their phds uh so uh Mike Stormbreaker got his PG there Ted got a PG there David DeWitt another famous Davis guy so it was not Penn it was uh it was Michigan all right the other thing is people complain that uh about the audio dear Andy the audio of your class sucks so I can't seem you afford a sound engineer do it right Andy I love the course but I can't listen to it because the audio is messed up what happened why are you doing this I lost the way to live um so that that was my fault last time so we're double recording so hopefully we won't have any issues uh we're posting it this time and then we actually got emails about you uh which is surprising right uh I saw dj2pl last month performing at the bridge uh 21 and overshow in Pittsburgh is that true all right uh you're very lucky to get a DJ like that he is expensive uh seniors have a lot of money TV does not have a lot of money okay yo TJ okay dj2pl is ridiculous is he single I'm asking for my friend and then they were she's like Taylor Swift without any thumbs I don't know what that means are you single okay all right so all right anyway that's their problem uh all right cool uh so today's class we're gonna talk about SQL um the last class we spent time talking about the relational model we talked about how in my opinion that it's Superior data model for every possible database you can sort of think of because the data the relational data model can be uh used to represent pretty much all the different schemes that that are out there and then we showed how relation algebra was the building block for how we would execute queries or Define queries to operate over uh on a relational database so today's class is really now to talk about SQL which is the again a declarative or non-procedural language for interacting with a database system and we'll see over time is that SQL has evolved where in the beginning in the 1970s when it was first defined it was very strict about what what a relational database should look like but uh in the last 40 years it's expanded to support things that don't look relational like Json for example um so let's let's start the beginning talk about how sibo got started and then we'll talk about uh the sort of more interesting interesting things you can do in modern versions of it so the the sequel goes back to the 1970s um and again for Ted Cobb when he wrote that first paper he didn't Define a programming language for operating relation databases it was all mathematical people said like oh the the the you know the paper was so inscrutable no one can understand it uh if you actually read it it's actually pretty easily understandable it's just people didn't like math back in the day I guess um so then he so some people at IBM solves paper and tried to start building you know experimental relational databases to see whether they can actually take his mathematical ideas and put it out put it into practice and so the very first relational database language as far as I know um was this thing called Square that IBM invented in 1971. um and this was for a earlier project that IBM was developing for a one of the first relational database systems and probably the first one um which sounds it sounds like a weird experimental Rock Band It's called The Peter Lee relational test vehicle right but that was the first thing they built as an early prototype to show that you could take Ted cod's ideas and actually put it into a real system the problem with square though is that you can never actually reuse it because you had to write in weird notation and vertically which is not you can which you can't really do right so this is uh this is from the the original one of the original papers right you would write right this is how to do a you know scan on the on the sales table by Department like you would write in this weird vertical way with characters that you wouldn't have on the keyboard even today right so no one actually ever did this so then uh IBM threw that away and they started building a new query language called SQL spelled s-e-q-e-l um for the system R project which is a system we'll talk about throughout the semester but this was the the second relational database system that IBM started building to try to show that Ted cod's uh work had actually done the the Peter Lee one that was in the UK that was a small team the system R project was in San Jose um at IBM research and that was that was a major major undertaking um so they defined SQL back in 1972 this was uh don Chamberlain and Boyce they work in this query language um and the idea is supposed to be the Structured English query language but in the 1980s when when IBM put out a commercial relational database system they got sued for the term SQL like the name SQL because some there was some other system or some some other piece of software that was using it so then they just reverted it back to us uh to SQL just the letters right the structured query language um there was another Pro uh very famous project at the same time out of Berkeley in the 1970s when system R was getting started uh called Ingress who here has heard of Ingress nobody who here's heard of postgres you're wondering why postgres is called postgres because it's post Ingress the guy that built stone brick when he built Ingress uh he commercialized it in the late 1970s and went back to Berkeley 1980s they built a new system that was supposed to be post Ingress that's why it's called postgres um so they postgres or sorry Ingress had this other query language called quell and so SQL the plan the words is that SSB the sequel to quell because they the IBM guys knew what the Berkeley people were doing and they were trying to build a better query language stonebreak would argue that quell is better uh but of course no one uses that today IBM released a couple of relational in the 1970s IBM was making a lot of money off of IMS which is a not a relational system not didn't support SQL and then they realized that SQL was going to go somewhere relation data is going to go somewhere so they released a bunch earlier prototypes like system 38 Seco DS uh but the big one that that really took off was db2 which is still around today um again IBM was it was a big Juggernaut in the Computing world so whenever IBM said they were going to do that sort of became the de facto standard so when IBM came out with a relational database that supported SQL even though there were competing languages like well everyone uh coalesced around uh around around SQL so SQL became a standard in a Nancy standard it was American Standards body in 1980 1986 and then it became a international standard in 1987. um and so even though it's a you know 50 60 year old language now uh it had evolved and expanded over time so the latest version of the SQL standard actually came out in March this year in 2023. um and he sees her list here the history of all the updates the various features they've added over time and the main takeaway from this from this listing here is that as as programs evolves applications evolved or the trends in in development software development has evolved SQL has has has moved along with it and adopted the ideas and adopted new new capabilities so in in 2023 the big two features that have come out is now you can do property graph queries directly in SQL so somebody brought up neo4j last class right that's a that's a a special purpose graph data model database system but now you don't need that anymore because now you can run graph queries directly in SQL because the SQL standard supports it they also added support for multi-dimensional arrays right so I said before that a lot of machine learning stuff is based on arrays or matrices now you can operate directly on SQL these things now just because the standard has defines it doesn't mean every system is going to support it right uh I I don't think any system really supports the the multi-dimensional array stuff like Oracle supports the property graph stuff postgres will eventually get their duct TB eventually ducktb has a prototype for it but you do it because it's in the standard not everyone's going to actually be able to support it so I would say in my opinion the the minimum support you need for SQL to say that your database support SQL is defined in the SQL 92 standard right that's like select insert update deletes create tables that's that's the basic functionality so again even though SQL is over 50 years old it's not a dead language and there's updates all the time and of course every 10 years or every five years some new technology comes out and people say that sql's dead and it's about to be replaced um 10 years ago it was no Sequel and the hot thing now is chat gbt or vector databases so you see a lot of these kind of things on on Twitter or uh on social media where they claim SQL is going to die because chat TPT is going to replace it or natural language is going to replace it right this is all this is a bunch of hype um it's interesting but it's not going to play SQL like SQL was here before you were born and seek will be here when you die okay and I I've made public statements to basically about this so there's an article uh they quoted me in uh last year and some some magazine or something I basically said you need to know SQL if you want to do anything in in computer science all right so the in a relational language like SQL it's going to have sort of three parts there's gonna be the DML the data manipulation language that's how we're going to uh that's our select insert update delete queries that interact within our database it'll be the ddl the data definition language that's the crate table statements the crate views right to create the entities of the objects uh in our database and then they'll be we're not really going to cover this but they'll be the DCL the data control language that's for like security and access control right some systems allow you to have like you know you can specify what users are allowed to see what rows or what columns or what tables and so forth right so the SQL standard specifies for these things another big thing we'll see later in the semester is definition of transactions right how do you how do you define a bunch of SQL statements you want to happen atomically um in an isolated way and again the SQL standard supports this so again we'll see bits and pieces of this as we go throughout the semester but for today's lecture we're really going to focus on the first one the DML and then reminder from from picking up what we want where we all left we talked about last class uh Seco is going to be based on bags meaning there there could be duplicates um whereas relational algebra was based on sets and we can see some cases where we'll we'll have to add extra stuff in our SQL statements to um to deal with that so today again is supposed to be a crash course on Modern SQL I'm assuming everyone whether or not you know it or not you you know enough from the SQL 92 standard right select insert up deletes and I want to talk about the I want to talk about these are more more sophisticated things you can do with them but but the other arching theme also will be is that we will open up the terminal we'll try a bunch of these queries in different database systems and we will see that even though there is a SQL standard there is a you know internationally recognized document that says here's what SQL should look like nobody implements it exactly that way right everyone's going to have these weird nuances and quirks where they have different features or different uh nomenclature or syntax to do certain things uh in some cases different semantics of of different operations where even though there's a SQL standard it's going to be different from one system to the next who did you guys who who do I think is the biggest offender for the worst SQL implementation worse is not the right word but like the one that deviates from the standard the most let me take a guess all right the top four out of yes in the back yes my sequel he got it right yes my sequel is going to be the out of all these where they're gonna do all sorts of weird uh more recently they they now have a flag where you can make it be more strict and and try to be more closer to this the SQL standard but for the longest time they do a bunch of weird things and my problem is that I first started using databases relation databases in like when I was in high school in the 90s and we were using my sequel three so I have all these bad habits that like I picked up from my Sequel and I'm like oh yeah this is what SQL is then you realize when you start playing other systems like this is crazy they're doing some weird stuff um but it's gotten better and my SQL 8 has has certainly improved a lot all right so we'll go through through all of these aggregations group eyes the string date and time operations that's gonna be the one where we see all the problems and then a bunch of other different ways to to interact with SQL queries and then another theme about what we'll talk about is the goal of writing SQL statement oftentimes is to try to do all the computation on the database server itself within one sort of one overarching SQL query meaning we don't want to have to do a select get some data back into a Python program or something then then operate on it and then push it back and do more queries we want to try to do everything we can on the on the server side inside the database system because we want to be able to push the query to the data not pull the the data to the query again this makes more sense as we go along all right so for today we're going to use a simple uh example database like this um it has three tables student enrolled and course right it's it's basically trying to mimic a university there's students they take classes uh and they're enrolled and they get grades in the various courses that are there okay so we use this as the as the sample database as we go along all right first things aggregations so aggregate aggregate functions are a a way to compute some mathematical computation on a on a sequence of data or a bag of tuples and you're basically going to coalesce it down into a single value so the classic things would be average Min Mount average min max summon count right you're trying to compute like the the Min value of of a column across all tuples within within a relation so an example example like this so say we want to get off for the students we want to count the uh the number of students so have a login where the the login ends with at CS like do you have an at CS email address and so we just put the count function here um and then the inside of it actually doesn't matter for account but we're just gonna count the logins and then we have our where Clause specifying when uh or what tool should qualify so again my example here I'm putting login you don't actually have to do that you could put a star that's equivalent right because again it's just counting the number of of entries you can actually put one again also equivalent it doesn't matter inside you can really put anything right you can put one plus one plus one all right and the database system should be smart enough to realize that hey in this last example here I don't care what's what the expression is inside of the count and I won't actually do that math because I just care about what's the what's the count tuples that I have you can have multiple Aggregates in a single select output so here now we're going to compute the average GPA and the the counting the number of students again with have that have the at CS uh login and you get this you know you get back a single entry or single record in the output result for for the two computations important thing to understand though if with aggregation since you're trying to coalesce down a you know multiple tuples within you know to a single scalar value you can't reference anything in the in the select output that isn't that that isn't part of the aggregate so I can't do something like this I can't go select the average GPA uh after you join the student table of the enroll table and then also spit out the the course ID of the enrolled enroll table right because this is not defined right this doesn't make it make any sense right there there isn't again you're taking multiple rows you're you're condensing it down closing it down and collapsing down for the complete the average what is the course ID in this context right it's nothing so if you in this case here what you if you sort of look at this what you're really trying to do is you're trying to get for each course ID get the average GPA so what you need to do use is what are called a group by clause where you're going to project tuples into to buckets based on whatever the the parameters are in the group by clause and then compute the aggregate on each individual bucket so you sort of take it like this if I first do the the join between the enroll table and the student table I have all the you know I have all possible combinations based on the join and then now I'm going to split them up based on the course ID because that's what I have in my in my group I clause and then now I compute the average on on for the GPA for each for each of those buckets makes sense all right it just matches up like this so again the main takeaway this again you have to have anything that's in a uh editing that's not part of an aggregation has to appear in the group by Clause so again in this case here I don't have the student name I I can't I can't put that there I'd have to put it in in the group by clause uh we can open the terminal if you want my SQL used to let you do do this in some cases um but we can actually let's try let's see what happens and I hate typing on my uh my Surface I'm going to use this this laptop here let's log into it all right so we want to do this is right so I had postgres I have a bunch of Data Systems running so the query we were trying to do was uh this right select average GPA course ID from enrolled joining enroll table and student table right so postgres doesn't let you do this because it says the the course ID has to appear in the group by Clause that's good that's what we expect let's go over to my sequel my SQL doesn't let you do it um but let me see if I put it in the right mode by default right it doesn't let you do it but there's a way to it's enforcing um it's enforcing the what what mode it's in so if I go to traditional then now run the same query it lets me do it and it spits out course ID 15445. is that correct no right because what does that mean it's the average GPA for all courses but that's spitting out one of them so that's that's bad so let's go now take the same query and we'll go over to uh to sqlite all right who thinks it's going to work raise your hand if you say yes and the man work meaning like it'll actually run the query I'm not saying the results correct who thinks we have one yes two yeses who says no most people say no I did it right it also spit out 15 445 is that the same value as my SQL yeah GPA looks sorry GPA looks the same all right let's go to Oracle Oracle doesn't like it let's go to duckdb didn't like it so my SQL you can do it if you if you make a b more like MySQL 5.7 instead of eight but sqlite will do it so again like the this is the first example we'll see how many times that like sequel says sander says one thing but the different systems are doing different things all right the next thing you do is have you have a hiding Clause like say if you want to start filtering on on these aggregation the aggregate columns you're generating uh you can add a having Clause to specify whether what how many what people should match after you complete the aggregation right so say I want to get only show me show me only the students that have an average GPA uh that's greater than 3.9 so in this case here I'm Computing the aggregation right select average GPA as as a GPA and I'm trying to reference it here inside of my where Clause right I can't do that because at this point when the system is actually calculating the query it's Computing aggregation as it goes along it can't doesn't know what the final result is right so the easy fix with this is to have a having clause which is basically telling the system okay formative aggregation is produce the output that's defined in the select statement and then apply this additional filter uh for having this is actually not correct either in some cases I don't think the SQL stand unless you do this either right because even though I have an alias up here for average GPA the the data system can say I don't know what this is my SQL lets you do it uh postgres does not so instead you have to basically write the the the the the aggregation Clause again and again the data resistor should be smart enough to recognize that this average from the GPA is the same as that average GP up there and therefore compute the same computation don't perform perform the same computation twice right so essentially just doing this again Computing aggregation and then do the additional filing to throw out things you don't make sense all right strings and time stamps or what or dates of when things get get really uh get really weird not weird but like really inconsistent so for string string operations or sorry string data types the SQL standard specifies that the the case of the strings within the values because I don't mean the strings and the select statements I mean like the actual data you're storing that they should be case sensitive and that you you when you want to have in your SQL statement constant strings you want to use single quotes postgres SQL server and Oracle follow the standard my SQL is by default case insensitive um and then they both sqlite and my SQL support both single and and double parenthesis or double quotation marks to represent constants and strings so let's see what my SEO does and see how weird this is let's go back here um right so um so you can represent a constant like this right so you can have a select statement without a without a from clause in my SQL right and I can represent it basically takes whatever the input is and I can it'll spit it out so I put a comma like I can get like I can do like one two three like that it'll make comms for all the the things in the output so for Strings I can have it as double quotes and single quotes right in the case of postgres it won't let me do double quotes right can't do that because it's trying to look for a column name Tupac that's the way you sort of Escape column names but it'll support uh single quotes so in SQL Lite they support both so that can go Tupac like this and I can go with single quotes like that in Oracle it's single quotes but it doesn't like queries without without a from clause so in Oracle they have this weird thing called the Dual the Dual table and this is a fake table that comes with Oracle to allow you to write these kind of queries that against tables that don't actually exist right so then I can get that if you try to do like select star from from Dual you just get like an X I think the newer version they got rid of the Dual you don't need any more this is this is Oracle I mean 21 so it's a rather newer version right um so like you can't do you can't do this but like in postgres or any other data system you can treat you can treat SQL as a calculator you can just put whatever you want and close like that right so all right let's go back to my sequel and let's look at some string functions so I can call now I can do like select stuff from student where name equals Tupac with you know weird casing and then it matched on the string Tupac right because internally my sequel is treating the VAR chart as as case insensitive so if you want to now uh if you want to have it treat it like a you know like any other database system where it actually is actually looking at the case as a true varchar you can add this binary flag in front of the or keyword in front of it the column name and that'll treat it as like a binary uh string like any other system in this case here now it doesn't match but now it tells me I have a warning so now I gotta go now call show warnings and this is again this is my SQL specific so now they tell me that the the binary expression is deprecated and be removed and they tell me at least how to write it correctly sorry now I have to cast the the name as as a binary and then I can call it right so if I change the casing again then I get Tupac so this Burns a lot of people because they end up you end up like thinking oh I'm if you don't know that your varchar is case insensitive you could store things multiple times and uh you know thinking that it's gonna be different because the case is different but then MySQL says they're the same again this is only my sequel I don't know any other database that actually does this so that's a weird one yes uh the question is why is the name capitalized so this question is uh I'm telling you so there's the data is being stored with the case sensitivity the comparison operator when it actually executes the where Clause is ignoring case right so it's not calling whatever string compare that you'd have in lib C uh it's calling either their own version of it or the caitson sensitive version of it because that was some decision that somebody made in the 1990s that has carried out over today yes why did they make that decision uh ask me that question at the very end if we get through all the books I again it's probably because somebody just did it the one way you know decided how to do it right or my sequel the guy was actually in many cases trying to follow what Oracle did in some cases but Oracle doesn't do this I I have no idea right we can email the guy uh he's still alive um so yeah I mean there's a lot of times where people just did stuff because like one person did it without like thinking through the implications of it um or they're trying to copy some other system where they like that some other you know particular feature of functionality right and other questions we'll see many examples are like why would anyone ever do this you know do it this way um all right so I think I showed a query like this before just make sure you see it so there's this like uh like operation in in SQL and you use this for sort of really primitive string matching or pattern matching so you would use a you call like and then you would say um you would have a percent sign to represent a wild card so instead of if you're coming from from like the Unix world star or regular expression usually means uh match anything or Dot in SQL it's the percent sign and that'll match any substring uh including empty strings but if I just want to match one character you would use the the underscore and there is support for regular Expressions I forget whether that that is in the SQL standard but everyone does it slightly different um but you can write more complex uh string matching um matching patterns there's a bunch of string functions that also come in the single standard to do things you would expect if you're familiar with python there's like all the python functions um uppercase lowercase substrings replacing strings right all that all that's in the SQL standard and for the most part these are going to be pretty consistent across the uh the various systems where things go wrong is which we think would be most simple operation concatenating two strings that's where Everyone likes to do something slightly different so the Siegel standard says the double uh the Double Bar is the way you can catch strings um in in SQL Server they support they use the plus sign um and then in my sequel they don't have these under the the default mode they don't have the double bar they don't support the plus sign you have to use the concat function right and we can see that real quickly so going back to my sequel so if you want to do something like this right I get another warning I show warnings and it tells me it doesn't like my syntax right um I mean that was the that was the first one sorry boom I've got two warnings says the the double bar is a synonym for the or and therefore it's to be deprecated um and they didn't like the way I uh was sending along at the at sign in CS so if we now try to call we change the SQL mode in my sequel to follow the the SQL standard now I can I can get the the concatenation that I want um right so again it's concatenation to think it would be super super it would be everyone should do the same thing but again it's some in case of my sequel some Legacy thing from the 90s that they're they're trying to slowly undo all right date and time is probably the the worst one so the SQL standard defines a bunch of ways to Define uh date types time types also time times with time stamps um different calendar types Julian calendar Gregorian calendar but how again how the the syntax is going to vary is going to be pretty annoying so I'm going to give it now a demo where try to do what would seem like a simple simple calculation a simple computation we just want to count the number of days since from today to the beginning of the year it's like 230 something 240 something right just the number the total number of calendar days so we're going to do this first in postgres and then we'll do this in uh in in MySQL and do this in in SQL Server so the first thing we need to do is figure out how to get the current date right the current current time right well there's in in postgres there's a function called now and that'll give you uh you know that you'll get back a timestamp with the current date in in my SQL you can do the same thing in sqlite they don't have a Now function inductdb ductdb is going to follow pretty much postgres for a lot of things because it's based they use the same uh the same SQL grammar so they have a malfunction I'll go to Oracle Oracle does not have a malfunction all right so there's another way you can get the timestamp so in the SQL standard there's something called a function called current timestamp right except it's not a function it's a keyword and then in my SQL they have the function they have the keyword in sqlite they don't have the function they have the keyword and an Oracle doesn't gives us a weird error about that one we'll come back to that in a second and then they don't have the keyword so they have the function but we're getting this other weird error date time interval Precision out of range okay so what's that uh so now we're going to go back and maybe oh because guys it's it's Oracle it doesn't like having a select Clause without a front a select statement without a front Clause so let's add our fake table dual right then we get it right but it's the keyword and not the timestamp all right all right so now all right so at least now we can get the the current timestamp the current of the current day um and so what we can do is now there's a we can start casting strings or varchars into date types and then there's this extract function in the SQL standard allows us to extract some part of of that data timestamp so this is saying extract the day from and then today's date as a string casted into a uh to a date type all right and again there's syntactic sugar for all these different systems that are like non-standard so in postgres if I try to give the string it's going to throw an error because it says it can't I need to operate the extract what you need to operate on a date but you're giving me a varchar but I can add these the two colons at the end and then put date at the end and then that's going to cast it to a uh to a date type oh can you see that or no sorry um shoot sorry let me do this yeah right but I know it's wrong sorry let's try it again all right so here I can give it a string and then I put colon colon date and that converts it to a date but that's only in uh in postgres I can't do this in in any other system except for duct DB because they follow the same standard right so if I go to my SQL try to do the same thing doesn't like that go to sqlite doesn't like that go to duckdb or oracle's not gonna like that from Dual doesn't know what a date is um this.db should do it right because again wtb's follows the same grammar okay so we can use the distract function to maybe extract what the the current date is or try to figure out how many days since from now until beginning of the year so let's start with postgres so it turns out it's pretty simple um with postgres so we can just cast the string uh of today's current date to uh to a to a day type uh and then subtract it from the string of the current um sorry at the beginning of the year and we could use the if you wanted to we could go back here and use current timestamp or use maybe use the Now function and this should work right so that gets today so cast it as a date subtracting the or taking the today's date and subtracting by the beginning of the year and we get 241 which I assume is correct um so let's try the same thing now in uh in my SQL so again since we don't have the Now function we'll do it for casting right so now we get a weird number we get 729 what's that uh surprisingly actually somebody on YouTube in a comment of all places told me what it was it's the and this is weird so the first number is the today's current month subtracted by January so 8 minus one is seven then it's uh today's what the 30th so then it's the today's day subtracted by January 1st so that's 29 so you get 7 29. all right so that's wrong can't do that uh so what we can do instead is we can we can uh sorry we can get the 20 Windows sorry uh there we go all right sorry so what what we're doing here now is we're getting the the Unix we're getting the date of today and beginning of the year converting it to a Unix timestamp all right the Unix timestamp is the the number of seconds since the Unix epochs like like January 1st 1970. um so we're converting it now to the number of seconds from today uh since 1970 and then we subtract that from the number of seconds since since January 1st and we divide that by 60 seconds times 60 Minutes times 20 24 hours um and we get 241. so hey this is this is my original idea um and then turns out there's a date diff function in in my signal that you can do this but processing doesn't have it Dr DB doesn't have a see if Oracle has it I'm dual right they don't have it right all right so that's a my secret that's a MySQL thing all right so now let's try in SQL light so sigal light doesn't have date diff uh we can't do that that subtraction that we did in postgres the the best solution I could come up with is the convert the timestamp for today and begin a year to the Julian calendar which is the number of days since Julian Caesar's birthday in whatever BC uh you laugh a lot of the banks ran off that in the in the up until the 80s right um and then you get 241 but of course we're getting it as a as a as a floating Point number so we can cast it as an integer and then we get 241. all right I I'm not I forgot how to do this in record I'm not gonna do an oracle um but the main point again is like seems like it'd be a simple thing but all these timestamp stuff is is is woefully different or any questions about this before yes so why would you want a lower function um uh good question um I mean you might need it for like data cleaning uh you might want it for yeah that's a good question maintain the standard right it could be I mean it doesn't have to be also in the where Clause you can have it in the from Clause right so if I go back to my SQL right so select star from students where a name equals Tupac right student singular right so maybe I want to do this though in my output right get a lowercase like that right yeah sorry yes at least from this demo yeah so her question is uh why do people have all these weird idioms in in their SQL uh when at a high level they seem to be all sort of doing the same thing but it's these one-off things are different and that's related to his question why do why are all these why are all these different nuances for these different systems because somebody was writing and thought it was cool right and then they showed their friends like yeah that's cool right so that the double colon and postgres I agree that's cool that casting thing but they only do it right uh the Dual table at I don't know what I've whatever uh the yeah so like let me give another example so like there's a shortcut in SQL to do basically select Star right so select star from uh student gives you all the tuples right but in postgres which I think is also in the SQL standard I can just write table and get that right in uh my sequel I can do that that's cool um SQL Lite yup doesn't like it in Duck DB they do it but they also have another one they can I think you just go I think just go fetch now or is it from right you can just do that I so they all had their weird idioms I mean so some of these things were are based on customer feedback like the customer says I want you know I need functions that operate on Json right so somebody adds that and a lot of times these features get added before they show up in the standard right so like the Json XML stuff is a good example of this they that got out of the SQL standard like 2006 but a lot of relational databases at the time in the early 2000s had some support for XML and so what happens is like the standards body is it's it's done a bunch of randos it's the people at different companies so in the SQL standards body there's there's somebody from Oracle there's somebody from cybase somebody from you know IBM and they show up the standards committee and they all try to get whatever they have proprietary thing that they have they try to get that into the standard right or could probably do this more the best people more recently the Oracle got their version of property graph queries in the SQL standard right they based theirs on Cipher which which is in neo4j that's now the the pgq stuff in the SQL standard and so they got their extensions for for graph queries in the SQL standard because they were sort of only ones ahead at the time so that's how these things show up in the SQL standard and so if everybody has competing ideas for how something should be done you end up with the lowest common denominator somebody could try to support everyone but then then no one exactly supports the seat supports the standards I'm not saying it's a good thing but like it's we live in a different time also too where there's so many different database companies and there's not one there isn't one company I say that owns the market and is and can bend people according to the will right so I said before in the 1980s IBM was was the huge company right IBM was the the Computing company so whatever IBM said that was considered the the de facto standard and so that's sort of how we ended up with with SQL today but there isn't a company like that now like the closest thing would be Google put out their standard a sequel called Zeta SQL internally it's called something else but they they open source a parser uh and and and the grammar file and the spec for their version of SQL nobody uses it and Google's huge right the closest you're going to get today is postgres a lot of these database companies when you start out instead of building like the grammar file from scratch you go take the progress one hack it up and inject it in your system that's what we did uh and then duckdb took our code and they put it in Duck DB right like this there's a bunch of systems are based on postgres grammar because they because it's open source and they use it that's the closest you're going to get to your Universal standard today but again I just showed you how there's from in induct EB but that's not in um in postgres right because they've adapted it yes this question is what's the point of having a standard if no one's going to follow it uh I mean there's a speed limit everyone drives over it right like um no so so I showed you a bunch of Select statements like the the and that were slightly different from one system to the next but you understood what it was doing basically right the nuances of different systems yeah you basically read the documentation or ask chat gbt what to do but like at a high level the concepts are the same right um just you know the the the specifics each of each system is going to be different snowflake is a good outlier actually snowflake started from scratch in 2013 they didn't take postgres they said they just came up with their own grammar so there's now a snowflake SQL grammar that has things that other systems don't support um if I was if I was building a new data system scratch today I would not do what snowflake did it was a different time I would start with postgres and then expand upon the way duck DB did okay keep going because it's still a lot to cover um in the sake of time I'm going to skip output redirection um because you're not really going to need that for the the homework let's Jump Ahead to window functions all right so before we showed aggregations uh they were Computing sort of sort of one shot calculation across the entire sort of input set to or the relation that was being inputted to the um to the to you know for the aggregate function that you're operating on the from clause but there's also times where you may need to want to support what is called a sliding calculation where think of like a a rolling tally as you go from one Tuple to the next as you're scanning along you want to update some some kind of aggregate function so that for every single Tuple that you're outputting uh from your select statement the aggregate is is sort of a snapshot in time of when that that Tuple was processed it's like an aggregate function where you're not grouping them to a single output for every single you know single final output for every single Tuple it's going to have its own computation for that aggregation so the way this works you would have like a function here right this would be all your aggregate functions min max you know count uh average as we saw before as well as some additional ones and then you're going to specify what is the sort of scope or the range that you're going to compute this calculation for right it's basically sort of how to slice up the data and Source it and sort it so let's look at some examples like this all right so I can have all the aggregation functions that I had before um min max account so forth but I have these additional ones like the row number that tell me what row my tuples is is in my output as well as a rank if I'm if I'm sorting them so if I have like an order by Clause like order students by GPA I can tell you what your position is using the rank function right you couldn't you couldn't you can't do that with a aggregation function because there's things just get collapsed down right so in this case here this example here I can do select star from row number over and then the empty parentheses because I'm not partitioning it and that'll give me output like this what will tell me again for all my output tuples where where do I appear that appear in the list for that right um if you have the Over clause you can specify how you want to group group tables together or start a group tuples together we're going to continue the winning function and then you can use the Partition by like a group by of how to how to group them up sorry so for this query here we're doing select the course ID and the student ID from the enroll table and I want to get the the row number of each of each of each student record or in the role table um but then I want to partition it by course ID so I would get an output like this for every single course it would tell me the for every student ID what what position they are in in that in that group right is it sort of a cluster like this then if you have an order by Clause you can then control how the tuples will be sorted within within either a partition or within the window right so in this case here now I can order the students by uh when they roll table based on the course ID so a little more complicated example here so if we want to find the student with the second highest grade uh for each course so for this one here we're going to have a nested query which we'll discuss in a second but basically I have a select statement that has a from clause and aside that from Clause I have another query right and I can inside this this inner query I can reference uh actually this here I'm doing the lookup on the enroll table and then the outer query can just do filtering based on the output of of this nested query I'm gonna cover this queries in a second so the first I'm going to do is going to grip the tuples uh by the course ID and then sort them by the grade and then we'll get the rank would be what is their position uh in the sort of list of of grades right and then in my where calls here I can reference now the the uh the the window the window function calculation of column right so any questions about this yes can you basically can you make up the window functions using Google this question is can I can I make a window function using group eyes yes let's try it see what you're saying all right so all right so we do this in postgres right so again select star from from the roll table and then we'll get the row number uh you know where each student appears right um and then the second excuse me the second example was uh maybe the course ID student ID and then the row number we're going to partition it by the course ID so and then we're just going to order them the and the output by the course ID right so again in this case here we see that we have for each course 1545 721 and 826 right here's the students that are enroll in them and then this is their position within within each group and then my last example was like this and this is where you were asking whether you can do a group by um we're now here again so I can get the the first the inner query is going to give me the the rank position of every record and then rank is just where you are in the Sorting output um actually let me remove the where the this part here first right so here's here's the output of the of the the inner query of the select rank so for every every course I'm going to get the uh the grades and I'm going to order them by the grades and then the rank is just where their position is in the you know within the sort of the grades and the rank can have repeats so if I say and insert another record here it's insert into enrolled values so we need a student ID let's do have Tupac take um so values course ID would be 15 721 and let's say he got a let's give a name he's dead um um what about Craig that's ordered by the order by rank theme well actually that that screws out the partition let me give her that sorry all right so here what we're doing is uh every every single course again we're getting the grade and then we're sort of in the rank and so we inserted This Record here Tupac we gave an A but there was also another student who got an A in the same class and therefore they both had the same rank position of one and then for the the student that got the C their ramp position is three so rank you can have duplicates row numbers will not so yeah so you're proposing to do what a bunch of random groups where like sorry and the inner query or what sorry um the question is it possible to create the same query using Group by um you wouldn't you wouldn't be able to get the rank right because you wouldn't be able to get what is my sort position there isn't a concept of that in in in SQL right so so row number is interesting because it is let's do that row number it's all right make the point here switch to row number row number so row number is interesting because it's calling it rank but trust me it's row number is because again it's bag algebra there is no sort order in these relations that's a sort of weird concept that we think about programming like what do you mean there's not an ordering because we're used to programming in like under x86 where there's a you know uh the ordering how memory operations occur right there isn't any of that here everything can be unordered so in without a window function you can't get a row number because there's no way to say where do I exist in this position you know in my position on my output um Oracle does have row number they hide it from you you can get it but like it's it's that's just an oracle thing um so so the window functions allow you to in addition to doing the averages and all the other Aggregates it allows you to to get the order of things in a way that you would not be able to get otherwise okay all right so I should have showed nested queries before but let's just go through more more detail so a NASA query net square is a really powerful concept sometimes called sub queries where it allows you to have a query inside of a query inside of like inside of a query like you have multiple queries inside of sort of overarching calling queries and you would need this because you want to be able to express certain computations uh it would be difficult to express certain computations without these nested queries without taking the data out doing some computation and then putting it back in the database system so it allows us to put these things together to to create more complex logic than we would not be able to otherwise do and these inner queries can appear almost anywhere inside of a select state but actually really almost any query like you can have in the select output the from Clause the where Clause you can put them in update queries and delete queries right and they can now reference all the tables within your own query like it's it's a very powerful construct so the basic idea is something like this so here we're doing selected from the name table and then I want to get the uh the name of a student that is at least enrolled in in one course so you can think of this out this the select story in the top part that's called the outer query and then this inner part here we would call this inner query so NASA queries are notoriously difficult for database systems to optimize um right because you think about the stupidest way to execute this query would be for every single Tuple in my student table rerun this thing right get the list of all the student IDs then compute the in the way to really execute this is this is just a join for this one example here right this one's easy to do because uh you know you're looking for this this thing to match something here so you can do like convert that to a quality predicate things get more complicated when there's uh non-trivial relations between the inner query and the outer query we won't we'll come to that later in the semester but this is something uh this is again this is the hardest the one of the hardest part of database systems and the only system that does this does nested queries correctly is the system called Umbra which is a the academic system out of Germany um duct EB does it correctly now for two reasons one because they copied what Umbra did it's in papers it's not like they stole the ideas um and then we also sent them patches last semester at a 721 so we fixed it for them they can do some of these nested queries correctly at least with lateral joins um so Dr B is probably the best implementation of this a lot of times a bunch of heuristics uh hacks again we'll cover this later my sequel is always the worst it's gotten much better though um all right so the so here's a query like this so we won't get the name of the students that rolled in 15445 so we have the outer query that we say you know we want to get the name from the student table and then we want to have this where Clause we want to specify the logic that will get us the student ID of the set of people that are taking 445. so this is a way to sort of think about how you want to actually construct this start with the autoquarium with the overarching uh computation or the output you want to be and then you figure out what the inner part needs to be separately so in this case here we can convert this English part here into a domestic query like this but now we need to be able to reference it or do do the the check that we want in the where Clause of the outer query and would use that that in Clause that we that we had before so in this case here the now we see that the student ID in the in this where Clause here the adequate that's referencing the student ID from the the Ada query but the second student ID in the inner query that's referencing the student ID in the in the in the enroll table so the the parser in the database system is smart enough to recognize the the context of where a column is being referenced to know which table you're you're looking at in the cases where it doesn't know that if it's two things have the same name it'll throw an error and make you qualify uh the table name of where a column is coming from so there's a bunch of ways you can interact with nested queries to do instead of where Clauses so you can have things like an all command or all operator that that every Row in the nested query has to satisfy some kind of strain you can have any or something sometimes called sum it's the Alias s-o-m-e where you can say at least one row must ratchet might match my sub query the in Clause is this that I showed before it's the same thing as is equals any and then exists this means that I want to find something where I know there's what there's at least one match sorry there's just one row being returned but I don't actually care what's in it so I can rewrite the example I have before instead of using in I can use equals any and it's considered equivalent and so we can show real quickly how uh postgres picks different plans for this and you see how it's actually being executed um right so here's our query we have Rizza and Tupac taking the class um so in SQL you can put this explained keyword in front of it and of any query and what that's going to do if the system supports it it'll come back with the query plan and tell you what what operations would it execute if it actually tried to execute this thing right so when we run that we get something like this that's going to tell us basically think of this as a tree structure so these are these are the leaf nodes and then it builds up this is the final output so what this is telling us we're going to the postgres wants to do a sequential scan on the enroll table and then it's going to Hash it because it's doing a hash join up there which we'll cover what a hash 20s later on and then as does suspension scan on the student table and then now it does uh my matching the student ID uh with the enroll student student ID with this enrolled student ID so postgres was smart enough to convert this nested query into a join which is always going to be the fastest way to execute something when you have these kind of references we can try the same thing in MySQL but you get their explained output is terrible uh there's a way to get I forget the syntax you gotta put like a standard or something like that I forget how to do it in my SQL there's a way to get something a little bit better um in SQL Lite I don't think you can do this oh you can do this hmm the net Frameworks what's that what is this all right uh all right sqlite if they don't like the select statement um surprising right why doesn't that work let's see if duck DB does it think of the output.db has very pretty um they give you they give you nice little trees you guys are easily amused this impresses you like Unicode output for explain oh my gosh all right um but yeah it gives you the shows you what the physical plan is um and then we can try it in Oracle right it has the right output well we need another one there's a getting the plan out of Oracle and SQL Server is a huge pain but I'm just surprised that um sqlite doesn't support the select I'm not going to debug this live um yeah I don't know why it doesn't like that let's try in ah there we go didn't like eagles Anna you liked in all right so in sqlite if I run explain I get this so the way sqlite does which is genius is that uh it the way it executes your query plan it converts the query plan into its own DSL op codes and it has its own VM that runs the op codes thinking like the jvm you take Java code convert it into Java byte code and then the jvm executes it or interprets it that's what SQL Lite does right we'll discuss query compilation later in the semester um so you got to put I think explain plan uh planet extended there's some syntax to get the real plan but whatever trust me it's there all right so uh so yeah so they all do something something slightly different um and then if the system is smart you're trying to convert it into a join all right so skip this and take a time because we got to get through um I want to go through lateral joins and um and and ctes all right so lateral joins are a uh it's a newer concept but and not all systems are going to support it but the basic idea is that it's going to allow you to have a a a nested query reference data in a in another query that is adjacent to it so normally in if you have two nested queries one asset query can't reference what's inside the other nested query right because it doesn't know about what's inside of it but with a lateral join uh it allows you to do this you can almost think of like it's got like a for Loop where one table for every single every single uh every single Tuple in the outer it's out of for Loop you can do some you can run some query do some computation here so in this simple example here I have two uh I have two nested queries I have a select one as as X so this is turning back a single two but has one column with a value one and then my lateral joint here can now reference the the output of this first query here and just do plus one on it right so I get I get I get one and two that way right without lateral you can't do this right because this would be treated as completely two separate queries which we can do this in postgres and see real quickly right um so the select star from an intercourse like one as sorry yep yep as X as T1 right so I get back a single tubal that has has one in it right but if I try to put a nest another nested query next to it select uh say two as y as T2 right I'm getting the Cartesian product but I can't reference inside of this thing I can't go T1 dot X plus one right because it doesn't know about T1 because those two steps those queries are running separately if I add the lateral keyword now my my second nested query can can reference with whatever's in the first one right and you can chain these things together uh as many times as you want okay let's go back quickly let's see what the query plan for this one would be in theory you should convert this to a join wow all right that it did a shortcut ignore that okay because it basically it says I I know what the answer is I don't run anything and just spits out the answer that's what it did like select one plus one it knows what it knows how to compute that without running a query all right let's look at more complicated example so say I want to calculate the number of students that enrolled in each course and then I'm going to count the number of students in the world needs course and then I also want to get all the average GPA of all the students in that course um and so yes you can write this without using a ladder join I just want to show you how to do this with a lateral joint so there's two gonna be two nested queries where we have this select statement on the out data part and then for every single Tuple that's in the course table I want to then compute the number of enrolled students and then again for every single student in the course table I want to compute the the average GPA of all the enrolled students right so I could write it as this we have two again two nested queries that are that are with the lateral keyword but again the first one here I compute the uh the the count again inside of it I'm able to reference what's in the outer query here or the Json query and then for this one down here same thing I can have this one reference there now I'm not showing this example here because they're contrived but like in the second lateral query I can also reference what was in the first one because these things get changed to get chained together and again this is a different concept when you think of SQL because SQL is the unordered we're not specifying the or we don't specify the order in which the database system should should execute anything we're not really doing that we're just telling it that the order we want the computation to to be performed to compute the answer that we want so the database systems can decide do I want to rewrite this as a bunch of joins and just execute them all concurrently or decide to do it one after another which we can then test postgres real quickly and see what it does which I don't think I've copied here um yeah sorry I don't copy pasted real quickly so you can try it online later all right the last thing I want to show you is Common Table expressions and so ctes were added 10 ish 20ish years years ago um and this is a sort of similar to nested queries or similar to uh if you're writing data to like a temp table or something like that it's a way for us to take specify a query that we want to get materialized maybe I don't use that word we want to specify a query that could be stored in quotation marks at some virtual table and then we can have another query of reference whatever's inside of it right so in my really simple example here I have a this with Clause I give my CTE a name then I have my as statement and then whatever's inside this parentheses whatever select query here is going to get bound to this this this name here and then editing that that comes below after the with statement can then reference it as if it was a table right so again so sort of some some example like this so I essentially the as Clause is binding things to uh names to whatever whatever's inside my with statement here so I have again a no table query select one and two that's going to produce one two bullet has a value one column of one one column of two but then within my uh with my with statement up here I can give now names The Columns which then can be referenced Down Below in in the query you can do weird things too like you can actually in postgres will let you actually name the columns the same thing um but then when you actually try to reference it below it'll throw an error uh so you again this is an example like the syntax is roughly the same but the semantics can be different across across different systems let's see how so let's see people how we actually want to use this so for this one again we want to find the student record that has the highest ID that's enrolled at least in one course again with the short-haired examples how to do this with with nested queries that we do the joins um but now we can do it with the CTE where inside the CT first thing I'm going to do is compute the the max student ID from from the enroll table and then now in my select statement down below I can just reference my CTE to get that Max ID and then do my my join on that again the database system should be smart enough to realize that oh I only have to run this uh the CTE once materialize it and then now I can reference it as if it was a temp table in in any query below that that calls it any questions about ctes okay just to finish up all right so again hopefully the main takeaway from all this is that SQL is not dead language there's a lot of cool things you can do with it uh you want to try to do as much computation as you can within a single statement now it's gonna be it could be nested query you can do a bunch of other weird stuff inside of it uh we want to avoid the round trips going back and forth between the client and and the server because again the database system should in theory be smart enough to know what's the best way to execute the query that you're giving to giving to it as soon as you take stuff out of the database do some python code on it that's obviously outside the purview of the database system so we can't optimize it can't optimize that python code if you keep everything inside a database system it should be able to make a good effort how to optimize it further and again also the main takeaway from all this is that there is a SQL standard nobody follows it exactly every single database system is going to be slightly different people claim that oh it's great if you support SQL because then you can go and be portable like if I rent my application on on my SQL I can very easily just quote the postgres that is not the case right oftentimes whatever data system you pick at the beginning that's what you're going to be stuck with for a long time all right so last thing homework one it'll be out today it's gonna be writing SQL queries do basic data analysis this year we're gonna require you to do it on sqlite and duct DB all right reason why is because you are right you write the same query syntax will be slightly different it won't be too bad but you'll run the same query in SQL Lite then you run induct DB and you'll see which one's faster okay and then you'll have this Epiphany oh one of them is much faster than the other one everything against which ones would be faster why what's that he says more efficient that's Aquarius part of the reason Maybe it's not based on postgres talk to me does not turn that's not the answer either okay so yes in the back one last shot that's not the reason okay so you'll run these queries duct TB should be faster you'll be like okay why that's the rest of the semester okay all right so next this will be it for sequel next class we're actually start talking how do you build a system okay hit it thank you
Info
Channel: CMU Database Group
Views: 7,949
Rating: undefined out of 5
Keywords: databases, carnegie mellon university
Id: n9S4Ibh5O0E
Channel Id: undefined
Length: 75min 50sec (4550 seconds)
Published: Thu Aug 31 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.