Practice Writing SQL Queries using Real Dataset(Practice Complex SQL Queries)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys i'm tawfiq in this video let's learn how to download a realistic data set from the internet and then write some sql queries on this data set now the best way to learn sql is by writing sql queries but to write sql queries we need data the more realistic the data the more interesting and easier it's going to be to write sql queries so in this video we will be downloading a realistic data set from the kaggle website now kaggle is basically a website where you would find thousands of different data sets and you can download these data sets for free kaggle also has a lot of different projects you can also post your own projects and you can do a lot more on kaggle we'll be talking more about kaggle maybe in one of our future videos now the data set that we will be downloading in this video is going to be the olympic history data set this data set basically has all the details of all the olympic games that have been held from the very first olympic games till the 2016 olympic games once we download the data set from kaggle we load this data into a postgresql database once the data is available in our postgresql database we'll write some sql queries to extract some information from this data set now i have come up with a list of 20 different sql queries on this data set now some of these sql queries are pretty simple some are intermediate and some are pretty complex so i will not be able to solve all the 20 sql queries in this video because then this video is going to be too long but rather what i have done is i have posted a blog where i have mentioned all these 20 sql queries so i'll leave a link to my blog in the description below so you can go to my blog you will also find a link to download the data set once you have the data set in the video i'll be showing you how you can load the data set into your database once you have the data set in your database you can then refer to the queries that i mentioned in my blog and you can try to write these queries on your own now for each of the sql queries that i mentioned in my blog you will find a problem statement which basically explains what is the query that you are expected to write you also find a screenshot to the expected output so you can refer to this screenshot to understand what is the data that is expected from the query that you are about to write for each of the questions and then you'll also find a download link where you can download each of the sql scripts that i have written for each of these questions now what i would recommend is you go to my blog try to solve each of these questions or each of these sql queries on your own if you're not able to solve it only then download my script and refer to my script so in this video i will be solving three to five different sql queries i have chosen some sql queries which i feel are slightly complex so we'll try to solve them in this video i'll try to explain you what is the thinking behind solving a particular sql query with this hopefully you should get some idea about the data set and you'll be able to solve the remaining sql queries on your own if you're not able to solve it then you can definitely refer to my script in the blog okay so before i can start writing sql queries if you like my work and if you would like to support this channel then please make sure to subscribe to the channel and give me a thumbs up thank you and let's begin okay so the very first thing that we will need is the data set now to download the data set just go to kaggle.com and here you would see something like data sets just click on that and here you can just search for any data set so we will be searching for the olympics data set i'm just going to type olympics and click search and you can see that we have around 132 data sets available now among these i am going to be using this one that is the 120 years of olympic history this is by the user rg refin so we'll just click on that and here you can see a quick glance of the data that is available so this is it and then it's mentioning here all the different files that are available in this data set so currently we have only two files here so we have the athlete underscore events dot csv and then we have the noc underscore regions dot csv so both these files are csv files which is again good for us so i'll just go ahead and download this i'll just click on the download button here and then in order to download the data set you will need to have created an account with kaggle so if you have not created an account then you can just click on the create one here and you can create an account i already have a kaggle account so i'll just sign into that so i'll choose my google account to log into kaggle and then just click on the download button here and then it will ask you to save the file so i'll just click save and i think the download is completed so i'll just go to my download folder here and you can see it will download an unzipped file so i'll just unzip that and i'll just open this so this is the two files that we have just downloaded okay so we have basically completed the first step that is to download the data set from kaggle the next thing that we need to do is to load this data into our database now we have two different files here so how we can load this data into a database is by creating two different tables so i'll create a table to store the data from the athlete underscore events file and then i'll create another table to store the data from this noc underscore regions file in order to create the table first thing that we need to do is we need to open this file let's open this in excel and just see how basically the file structure is so we will need to create the same number of columns that are present in this file in our database table now i have opened the athlete underscore events file you can see that i have around 15 columns here i have id which is basically a unique identifier for each athlete then i have the athlete name their sex their age height weight team noc which basically stands for national olympic committee i think and this basically is the country code then we have the game which olympic game it is and then we have the year in which this olympic game was held we have the season the city in which developing gape was held we have sport event and middle okay so all these columns and then we have another file that is noc underscore regions so if i just open that in excel you can see that this file only has three columns so we have noc which basically is the country code then we have the region which basically is the country name okay and then we have nodes but i think we'll just ignore this column for now so what we are going to do is we are going to create two different tables which will basically have the same structure as this file so that we can load all of the data from this file into that table okay so i'll just go to my postgresql database and if you want you can create a new database but i already have a demo database i'm just going to be using this and i'll just click on query tool and here i'll be creating my table so i already have the script created for this so i'm just going to copy that and i'll just paste it here and okay so before i can execute this create table script i'll just show you the columns so the column name that i have used here is exactly the same that i have in this file and the data type also i have tried to keep it the same now id i can see that this is all like integer number so i'm just going to use the data type as in and wherever i have some characters i'll use the data type as varchar now there are columns like age height and weight which basically should be having numeric values but then there are some values in this data set where the values are n a so if i try to create these columns as an integer then it's going to throw me an error because n a is not an integer and if i try to insert n a into this integer column it's going to throw me an error so what i'm what i have done is i'll just make all of this age height and weight also as varchar and whenever in my query i want to convert this into an integer i can just do that right there so for now i'll just keep it work here so basically that's it it's just simple nothing too complex i'm not going to be creating any primary key or foreign key i'll just my intention is to just create a table with the same number of columns as this file and then load all of this data into this table let me go ahead and create this table okay so i think the table is now created let me just query and see if the table is created so i can see that yeah olympic history is created and then i have olympic history noc region so what i have done is olympic history is the table that will hold the data from this athlete events file and then olympic history underscore noc regions is the table which is going to hold the data from my another file that is noc underscore regions okay so we have created the tables the next thing is to load the data into this table so we need to load the data from those files which we have downloaded so we can pretty easily do that by using the pg admin tool so if you just go to your database and schema i'm just using the default schema that is public and if you go to tables you should be able to see all the tables i'll just refresh so these are the two tables that i just created olympic history is the first one so just click right click on this table name here and then you will see an option of import and export so i'm using postgresql and the tool that i'm using is pg admin but let's say if you're using mysql or oracle or some other database then whatever tool you're using i'm sure that you will have this option of import and expert so you just right click on the table and then do a import and export and you should have the option of importing data from some external file so i'll just click on that and then here i'll just choose import and then once i choose import next i need to need to choose the file that i want to import i need to go into our chief and here i have this athlete events file so just double click on that so i selected the file and then this file format is csv so i'll keep it as it is and then i'll not bother to change the encoding i'll keep it as it is and then i'll say it has header so if i go back to my file here and if i go to this file you can see that the very first line here is basically the column name for all of this data now this basically is the header and i don't want that header to get inserted into my table so i'll say that this file as header what this will do is it will it will basically ignore the very first line from your file okay so the very first line will not get considered and that's what i'm doing and then i have this delimiter so for a csv file the delimiter by default is comma so it's already selected i don't need to specifically select it here and then i have something like code and escape i'll i'll ignore it for now i'll just keep it the default value and then i have another tab called as column so if you want to insert only a specific column you can just remove any columns that you want okay but i'll just want to insert all the columns from that file into this table so i'll just go ahead and click ok and this will give me an error now okay now this is one thing to remember whenever you download a data set from the internet it's there is no guarantee that the data set is always going to be 100 correct there might be some data issues so whenever you find some data issues it's up to you to fix that error okay so currently in this data set that i just downloaded from the from the internet from kaggle website actually i found one record which is having some issue so basically i'll just show you that record so i'll just copy this this is basically the id so if i go back to this file and let me just search for that particular athlete and let me just zoom this so what's basically happening is for this name here you have a single quote and then you have some double quote so this data pg admin is not able to properly import it it's throwing some error so i can fix it manually by just removing the single quote the additional single quote that is present here i can remove that but i don't want to modify this file so i'll just leave this file as it is but rather what i'll do is i'll go back to my pg admin and here i'll try to import it again so i'll just right click i click import and here i'll say import i'll choose the file that is athlete events i'll just say it has header and then in the escape character i'll just change it to double quotes by default it was single quotes and then i'll just say ok and hopefully this should work and now you can see that it's telling me successfully completed so basically all the data has come what happened is in this data in this particular record there is an additional single code in between this value and since our escape character was single code it was showing some error so i just changed escape character to double quote and it worked fine okay so you will need to manually fix any data issues that you might face so i think the data is loaded now if i just check you can see that i have the data here and then if i just count the number of records you can see that i have two seven two lakh seventy one thousand one hundred sixteen records and if i just go back and check in this file how many data i have here okay so you can see that i have two seven one one six records okay so all the data has been loaded into this particular table now let's try to load data into our noc regions table as well and again let me go back here go to the table that is noc regions i'll right click i'll click import and i'll say import here and then i'll choose the file name which is noc regions and it's a csv file i'll say it has headers and i'll not bother to change any other default settings i'll just say okay and here there is no data issue so it's working fine all the data has been inserted so if i just run you can see that i have all the data that i want okay so basically we have now loaded the data into our database now we are all set to write sql queries on this data set okay so as i told you i have come up with 20 different sql queries in my blog and if you go to my blog i leave a link to my blog in the description below you can see all the 20 different queries now all the 20 different queries if i try to solve it this video is going to be a few hours long so i don't want to do that so i'll just solve few queries which i feel uh will help you to solve remaining all the remaining queries so we'll try to solve for different queries i'll start with the query number six okay this one so if i just go to query number six okay that is this one you can see that we need to write a query to identify the sport which was played in all summer olympic games and you can also see the screenshot to the expected output you can see that i have cycling swimming fencing gymnastic and analytics these are the five sports these are the only five sports which were played in all the summer olympic games okay now how do we write a query to find this kind of data so the first thing that we need to do is we need to find how many summer olympic games were played as per this data set okay once we get this number then we need to find each sport how many olympic summer games have they been part of okay once we get these two numbers we'll then try to compare these two numbers so let me just write it down here so the very first thing is we find total number of summer olympic games this is our first step and the second step is find for each sport how many games where they played in and the third step would be you get a count from this step then you will get a count for each spot and then we need to compare these two counts so i'll just compare one and two and with this we'll be able to get the final result that we are looking for okay so let's try to write that query now since this query is going to have multiple steps we will need to write multiple sub queries now instead of using a single query and then having sub queries within them i will just use a width clause okay which will hopefully become more neater and cleaner to understand so i'm just going to say with t1 i'll just give an alias as t1 for my temp table so and here i'll just write my query so the very first thing that i need to do is from this olympic history because all my data is in olympic history so i'll just copy it here and i know that i'm only interested in summer olympics so my very first filter condition will be to basically filter the data so that we only fetch summer olympics now if i just look at the data here i have this id i have the athlete name and then i have something like season here now this is a season where it tells whether the olympic games were summer olympics or winter olympics so i'll just use this column to filter the data just to fetch data for summer olympics so i'll just say where season equal to summer from this what i need to do is i need to find all the different games that were played so i'll just say games and now if i just run this query so you can see that i'm getting the data for all the different games that were played but then you can see that data is repeated and i just need to find the total games that were played in total i don't need this duplicate data so i'll just do a distinct of this and now if i just run this it's now going to return me i think you can see it's returning me 29 records and if i just do an order by so order by let's say games and if i just run this you can see that the very first olympic games were the 1896 summer olympic games then we had a 1900 summer olympic games and then finally we have this 2016 summer olympic games so this data set has data only till 2016 the latest olympic data is not available in this data set okay so this is fine so we have got all the games that were played till 2016. so i'll just remove this order by in order to get the count i can just use the count function that is count of distinct of games and i'm going to name this like or given alias like let's say total summer games okay so this query here should hopefully return me 29 because there were totally 29 olympic summer games played as per this data set so we have basically got the first part of our query the next part is to find for each sport how many games where they played in so let's try to write that query now so i'll just move this ahead and i'll say t2 as so if i just run this query here just to see the data i can see that i have all of this data here now the data that i'm interested in is i just need to know for each sport how many games were they played in okay so the only two columns that i am interested in is so let me just copy these two first because i know that my data is going to be from the same table that is olympic history and i know that i'm only interested in the summer olympic games so i'll just put season equal to somewhere and the columns that i am interested in is only sport and games so if i just run this query you can see that i'm getting sport and games but now you would have multiple entries for each sport and within each game the reason for that is this table here is having information about each athlete now there are sports where you would have multiple athletes participating in a sport for example basketball you would have multiple athletes and then this basketball is also played by multiple countries so you would have multiple records in this table for a particular sport corresponding to a particular olympic games so in order to get the distinct record for each sport represented in each game i can just use the distinct clause here and then let's say just to see the data i'm just going to do say order by games okay and then if i just run this now you can see that and if i just extend this you can see for 1896 summer olympic games there are totally nine records meaning that these were the nine sports that were played in 1896 summer olympic games so we have athletics fencing cycling tennis gymnastics wrestling weightlifting swimming and shooting so these were the nine games or nine sports that were played during 1896 summer olympic games okay now what our intention is we need to find for each sport how many in how many games or how many olympic games they were part of so how we can do that is i can just group this data based on each spot okay but we will need to do that group by or grouping of this data based on the result of this query okay so either i can move this into a sub query and do that but a better way would be just to use another temp table so i'm just going to say p3 as and here i'll just write another query so in this case i'll just say select star from e2 because i'll be taking the data from t2 and then here i'll just say i need sport that is my column and then i need to count the total number of games they have been part of okay and then i'm going to give an alias to this like let's say number of games okay and then i do a group by here so i say group by support and finally in my main query here i'll just say select star from d3 and let me just minimize this and if i just okay so let me remove this i don't need all of this now and if i just run this you can see that i'm getting the data so what's happening is for each sport it's telling me how many games or how many olympic games they were part of so lacrosse was part of two olympic games i'm not really sure what lacrosse game is or sport is but then we have wrestling wrestling has been part of 28 summer olympic games judo has been part of 13 summer olympic games so we have got this data but the data that we are interested in is if i go back to my blog it's telling me identify the sport which was played in all summer olympic games now how do we get that now we know that there were totally 29 summer olympic games and we have got that from our t1 table here that is 29 so what we need to do is we just match this total summer olympic games result with the total number of games that we have got here so from t3 we get for each sport how many games they have been part of so if there are there is any sport that has been part of 29 games then that is the sport that basically was present in all olympic games so i'm just going to join t3 table with t1 and i'm going to join it based on let's say p1 dot this column is total summer games equal to d3 dot this is number of games number of games here so if i just run this now and now this is what the expected output is so from this query it's very clear that swimming cycling fencing gymnastics and x are the sports which were played in 29 olympic summer games and in total there are only 29 summer olympic games so basically these are the only five sports which were played in all the summer olympic games okay so this basically what our output is i hope this query was clear now let's move on to our next query and the next query that we will be solving is the query number 11 so if i just go into my blog and search for query number 11 and that is this one you need to write a query to fetch the top 5 athletes who have won the most gold medals so in all of the olympics we need to find who are the athletes who have won the most gold medals so if i go back to my postgresql here i know i need to fetch the data from this table itself so i'll just do select star from this table here and if i run the data here this is fine so i know that i need to find top 5 athletes who have won the most gold medals so we are only interested in athletes who have won the gold medals so my very first filter condition is going to be where middle equal to gold so i'll just say where medal equal to gold so this is my filter condition i'll run this and i've got the data for only those athletes who have one gold medal so this is fine now from this i need to group this data based on the athlete name so i need to find each athlete has one how many gold medals so i can just do that by saying name comma count of one and i'm going to name this like let's say total medals and here i need to group this data based on group by name if i run this so i'm getting the data for each athlete but let's also do an order by so let's say order by count of 1 in descending order and if i run this now we can see that i get michael phelps who has won 23 raymond clarence who has 110 and so on and so forth so these are all the athletes who have won gold medals and have grouped their data together now i have data for all the athletes here but i only want to fetch the top five athletes so how do i modify this query in order to fight just the top five athletes i cannot just fetch the first five record from this table because these are not the top five athletes because you can see that all athletes here have just won nine gold medals these are basically number three and then we have some athletes we have one eight gold medals and then there are some athletes who have won seven gold medals so we cannot just fetch the first five records rather what we can do is we can rank the records here based on the values present in the total medals column now in order to rank a record in sql we can just use a window function called rank rank or dense rank so let's see which function we should be using in order to do that first thing what i'll do is i'll use a width clause so i'll say with e1 as and i'll move this whole query inside my t1 and i'll just add a bracket here and then in order to get the rank for each of this record i'll just create another temp table p2 as and i'll say this like let's say select star from t1 but here i need to add a rank so i need to assign a rank to each record in this basically result set here that i have got from t1 so i'll just use the function rank for now and let's see what happens i'll just say rank and then i need to use the over clause i don't need to do any partition here because i want the whole data to be treated as a single partition but i of course need to do an order by so i'll say order by total medals in descending order and i'm going to name this column as let's say rnk okay and let's do at the end select star from d2 okay and if i run this now you can see that i'm getting the rank and if i just maximize this here you can see that michael phelps get rank one then raymond clarence with 10 gold medals get ranked two then all these who are having nine gold medals gets rank three and then the next athletes who are having gold medals eight have are getting the rank as seven now this is because what rank will do is if there are repeated values then it's going to skip the next rank okay so we do not want to skip the rank because we want one two three four five rank and so that we can filter the data based on that rank so instead of using rank i'm just going to use dense rank now dense rank is similar to rank it's just not going to skip any rank so if i run this query now now you can see that i'm getting the rank in sequence so i have rank 1 2 3 and then i have rank four and then i have rank five okay so this is basically what i wanted so i have got this and now the last step would be to just filter the data based on the rank that is one to five so i'm just going to say where rnk is less than or equal to 5 and if i run this i think i have got the expected data if i go back to my blog here you can see this is almost what i wanted and this basically what i have got okay so this basically how the query looked like i hope this was simple it was not so complicated you just needed to add a logic to identify the top five records using a rank window function okay so the two queries that we have solved now are not that complex but the next two queries that we will be solving is going to be slightly more complex so let's see what is the next query that we are going to be solving so let me go back to my blog and here the query that we will be solving next is going to be the query number 14. what the query number 14 tells is list down total gold silver and bronze medal one by each country so corresponding to each country we need to identify how many total gold how many total silver and how many total bronze medals they have won in all of the olympic games as per this data set so we can see that usa has won 2 638 gold medals 1 641 silver medals 1 358 bronze medals and we have similar information for all the countries okay so we need to write a query to get this data now if i go back to my query here and let's say if i just run this query and i just do select star from olympic history and if i just run this you can see that i have this data here and if i go to the last column that is the column where i have middle now from here let's just filter this data because i'm only interested in where uh basically the nations have one gold silver or bronze now this na basically means that that ethic did not win any gold medal so i'll just eliminate this na record so i'll just say where middle not equal to n a so we only get the records where the athletes had one gold bronze or silver now you can see that in the middle column i have gold bronze and silver medals but these are all at row level but if you look at the output that i am expecting from this query i want this gold silver and bronze at column level so basically looks like we need to convert the data from row level data into a column level data and not just converting when we are converting we also need to aggregate the data that is we need to count the total number of gold silver and bronze medals one by each country so when you have this kind of requirement where you have to transform the data from row level to column level and then do some aggregation on that then the best option is to use a pivot table now if you're using mysql oracle or some other rdbms then you would generally have a pivot function to be used but since i'm using postgresql we do not have something like pivot table but we have something called as crosstab which basically does the exact same thing that a pivot table would do so we will be using crosstab to transform the data from row to column but before transforming the data from row to column level let's try to segregate or basically filter out this data into exactly how we want it okay so let me just do select star from olympic history this is my table where middle not equal to n a if i go back here i want to find for each country how many medals they have one right so what i can just do is i just need to group the data based on country and then count the total number of medals right so what i can just do is i'll just say country okay but i do not have a column called as country in this table now i have something like teams but it's not exactly a country the country would be something that you see in noc this is basically the country code now in order to get the country name i need to join this with another table so what i'm just going to do is i'll give an alias to this like o h and then i'm going to join the another table that we had created that is olympic history underscore noc underscore regions and i'm going to give an alias to this like nr and i'm going to join it let's say on nr dot n or c equal to o h dot n o c and then here i have this that is n r dot region which is basically my country so i'll call this like country so this table here noc regions has a column region which is basically the full country name so that is what i have fetched and i have given the alias to that as country next i want the medals and that's it and then i want to count the total number of middle so i'll just say count of one and i'll i'm going to give an alias to this like let's say total medals okay and here since i'm using count i need to do a group by so i'll say group by both this column so i'll just copy this that is nr region and then middle so if i run this now now you can see that for every country it's basically telling how many bronze medal it has one gold medal it has one and silver medal it has one and the same thing goes for every country so algeria has one eight bronze medal it has one five gold medal and four silver medal and this is in the entire data set so so from the very first olympic game that is from 1896 olympic game until the 2016 olympic games okay so we have got this information but if i go back to my expected output you can see that i need just one record for a country okay here for a country you would have maximum of three record because the medals there are three types of medals bronze gold and silver so for each medal there would be one record now we need to transform this three record into a single record so when we are transforming we need to make sure that bronze gold and silver comes in its own column and the count that you see here will be corresponding to each of these metals now this is where we will need to use the crosstab function but before that i also need to do an order by here so i'll just do an order by so we don't have some weird data in between so i'll just do an order by and see okay the data looks fine so now we need to use the crosstab function to transform this row level data into a column level data now crosstab function is part of the postgresql extension called table fun so in order to use the crosstab function we need to first enable the extension table fun so in order to enable the extension table func we just need to run a command like create extension table func and i'll just run this and that's all now i'll leave all of this script in my blog so you can download it from there for now i'll just remove this so i have enabled the table func extension and now i can use the crosstab function and in order to use my crosstab function what i can just do is i can just say select star from crosstab and here i need to pass in the query as a parameter so i'll just pass this whole query as a parameter so i'll just move this to the right and here i'll make this single quote into two single quotes so it will be accepted here and then i'll remove this semicolon i'll move this one here okay so basically what i have done is the same query that we have written so if i just run that query here you can see that i am getting all of this data here so i have just passed the whole query inside this crosstab function so this will be my first parameter and the rule is that we need to pass this query as a string so that is why i have passed it inside the single quotes to make it a string and the second rule is that this query can only have three columns so it should have the first column which basically is going to be the main column basically the column on which the whole data will be grouped together and the second column is generally used only for ordering i'm going to explain this further in a short while and the third column is go basically the aggregated value that will be returned for each distinct value from this middle column you'll understand this further when we'll complete this query now whenever you are using the crosstab function you need to pass in this query as an input parameter and lastly you also need to specify the result so basically whatever is going to be returned from this crosstab you need to specify all the different column names and its data type so how we can do that is just by saying as and then providing a name i'll just say result and here you need to specify all the columns that you want this to return so we know that we want this query to return the country and then the three columns one column each for our middle okay so i'll just say country and the data type has to exactly match with what will be returned from this inner query okay and we already have the inner query data here so country is character varying which basically stands for where care so i'll just say where cap and then i have bronze big end and then i'll say gold big end and then i'll say silver big end okay so the output from this crosstab function is going to have four columns that is country bronze gold and silver now the reason why i have mentioned bronze gold and silver in this order is if you look at the data here that will be returned from this inner query here you can see i am sorting the data based on region and middle so region will be basically my country and the middle will be bronze will come first then gold will come second and silver will come last because depending on the ascending order b g and yes okay and whatever data will be returned i want to store the data in this order okay and that is why i have mentioned the column in this order so this is fine i think let's just run this and see what happens okay so i'm getting the data and looks fine so i'm getting all of the data here so what happened is now for each country so for algeria instead of having three records what crosstab has done is it has grouped the data together for the algeria country and it has passed the count value for bronze corresponding to this branch column for gold corresponding to this gold column and then the silver will have the count value corresponding to the silver column okay now there is one slight problem in this when crosstab will assign values to these columns it's actually not going to compare if the value needs to go to bronze gold or silver now let me explain that further so let's go back and run this query here so i'll run this query and here i'll just search for those countries and let's say okay botswana so botswana you can see that it has never won a gold or a bronze medal it has only one one silver medal so let me just copy this data here similarly let's go back here and search for another country let's say burundi now this burundi has won one gold medal and one silver medal but it has never won a bronze medal okay so let me just paste that here okay so this i'm just copying this because i want to show you an issue in this crosstab function and how we can fix that so we have this data here this is actually the original data the correct data but now i run this crosstab function okay i'll just run this for these two countries so let's say where nr dot region in so i'll say burundi and then i'll say botswana okay and if i just run this very only for these two countries now you can see that it's telling botswana has one one bronze but if i just look back here butsan has not won one bronze it has just one one silver and then burundi is telling that i'm getting bronze one and gold one but actually the correct data is gold one and then silver one so looks like whatever is written from this cross tab is something is wrong what's happening is the data that is returned is going into some different columns and that is what i was saying this cluster function will not will not basically compare the column name with the values that are written so basically we can avoid this by specifying the values that will be returned from this query so how we can do that is just by passing a second argument and that is the values argument so i can just say let's say like okay but i need to pass it inside the single quote so i'll say values and here i'll open a bracket and then i will pass in all the distinct values so my first value is bronze and then i have another value which will basically be i think gold and then i have the last value which will basically be silver okay so these are the three values that will be returned from this query under this middle column and that is what i have specified here so we don't have to specify basically the main column that is the grouping column the country we just need to specify the values that will be returned from our middle column basically the distinct values based on which the aggregation was done okay so now if i just try to run this okay now you can see that i'm getting botswana is getting one silver and if i go back here okay this is correct and then i am getting burundi where i'm seeing gold is one and silver is one gold and silver one so this is fine so now i think the crosstab function is going to work fine so now i'll just remove this filtering i don't need that and i'll remove this data as well so this is one important thing that you need to remember when you're using crosstab function so i have this data and if i run this now so i'm getting the data for everything now we have almost there just couple of more things that we need to do the first thing is i'm having this null value so wherever a country has not won any middle it's returning null i don't want it to return null i want it to return zero so i'm just going to say country and i'm going to use a function called as corliss okay so i'll use this function call this and here i'll pass in my column so i'll say gold and if gold is null then i want it to return as 0 okay and i'm going to alias this column as gold and i'll do the same for silver and bronze as well so i'll just say silver and i'll copy it here and paste it here and then i'll do bronze and i'll do the same thing here as well and finally i also want to order the data okay so let's say i will order the data order by okay but let me not show you order by for now let's just run this query and see what it happens okay i'm getting some error i added an additional comma here so let me run this now so okay so i'm getting the data instead of null i'm getting 0 now which looks much better now i just need to sort the data based on who has won the highest medals gold silver and bronze so i'll just order the data so i'll say order by gold in descending order and then silver in descending order and then finally bronze in descending order and now if i just run this query okay so now i'm getting the data so you can see that i'm getting the first data for usa because usa is a continuous one the highest gold silver and bronze medals then russia germany uk and so on and so forth so basically this is how we have written our query we have just used the crosstab function or if you're using mysql or oracle you can use the pivot table but in first grade we do not have pivot so we are just using crosstab so i hope this was clear okay so the last sql query that we will be solving in this video is going to be the query number 16 so if i just go here it's telling me that identify which country won the most gold most silver and most bronze medal in each olympic games so you can see and sample output here so i have this games column so for 1896 summer i have max gold was won by germany 25 gold medals max silver was won by greece which got 18 silver medals and then greece also won the highest bronze that is 20 bronze medals and we need to do this for each olympic games so from 1896 till the 2016 olympic games okay so let's see how we can write this query so we will be using the same query that we have just written for the query number 14 but we will do some modification so this query was basically segregating the data based on each country for each country it was finding out total number of gold silver and bronze medal one now for the query number 16 we need to segregate the data not just based on country but mainly based on olympic games as well so let's read the question again identify which country won the most gold most silver most bronze now the process of finding the most gold most silver and most bronze let's try to do it at the later stage the first stage is to identify gold silver and bronze medal one by each country in each olympic so we need to segregate the data not just based on country but also based on olympic games so let's first try to do that okay so here i'm using the nr region the country and then i'm grouping by nr region but what i want to do is i also want to segregate this or basically group this together based on games now i told you the inner query of this crosstab function cannot have more than three columns so what we are going to do is we will still be passing three columns but this this first column here instead of just passing the country we will concatenate the games along with the country okay so i'm just going to use a function called as concat which will just concatenate multiple values together and then i'll pass the first parameter here as games and the second parameter here as let's say space dash space and then the third parameter as the region which will basically be my country okay so i've just passed this whole value as a column called as country so okay instead of keeping this column name as country i'll just make it like game underscore country or games underscore country and then i just need to use these games in my group by so your group by i'll say games and then nr region and middle and the same thing i'll do for order buy so basically the only change that i have done is instead of just passing country as my first column here i'm passing the concatenation of games and country as my first column so i have just changed the column name to games country and here also i'll just pass the column name as games country and then here as well i'll just use games content now let's just try to run this you can see that i'm getting the output but let me just change this order by to just say let's say games country and if i now run this okay so now it's printing the data in the order of each olympic games corresponding to each country so in 1896 summer olympic games for the country australia it won two gold medals zero silver and one bronze medal and the same thing for each country and also if i go to the next summer olympic games that is 1900 australia won three gold medals and three bonus medals and the same thing continues okay so this is fine okay so the concatenation of games and region that is country i had done inside the crosstab function because crosstalk function can only accept a query which has three columns but when i come to my main output i can see that i want my first column to be just games and then the country should be part of the other three columns so again from this main query here i again want to separate out the games that is olympic games from the country so in order to separate out what i'll do is i'm just going to use the function substring and position so let's see how we can do that now i'm going to be using substring and position in positive sql but if you're using oracle or some other rdbms you would have your own function so you can just search for corresponding functions for substring and position i know in oracle the you have instr and you can use sub str to do this and but for other rdbms you can try to find it out if you are not able to find it out then leave a comment below i'll try to find it out myself so let's try to use this so i'll just say substring of this column that is games country and let me just copy this value here just for reference i'll just keep it here okay so from this value what we first need to do is i'm just going to say start from the first character and then go until certain character that is go until you find this value so go until you find this value so i'll say inside position i'll pass in this that is my first parameter in this column okay so if i just show you okay so if i just have to quickly show you i'll just say select position of this in let's say i'll just copy this whole thing okay so if i just run this what happens it's going to return position as 12. so what it's going to do is it's going to look for this character in this string and it's telling it's finding it in the 12th 12th position that is this is the 12th position so in order to find this one so this basically will be until 11 so one two three four five six seven eight nine and eleven so this first 11 characters are basically my game that is olympic game name and then you have the other characters so this position of the space dash and space is in position 12 so i get 12 from here and then i need to do a subtraction of 1 so what happens is i'll do a substring of so if i just show you substring here i'll just say select substring of this whole thing so i'll just copy this whole string here one and then let's say 11 okay so if i just do this now i will get basically my olympic game name so i'm just going to follow the same logic here so i'll do substring of games country 1 until that position minus 1 and this will be basically my games and then i'll just copy this whole thing and i'll paste it here and now instead of starting from the first position because i want to find the this one that is a country name so i'll tell from games country first search for the position of this particular character that is space dash space you will find it so i think you will find it in the 12th position right but the country would be like this much is 11 and then you have three more that is 11 plus 3 14 and then from 15 is where you need to look to find the country so what i'm going to do is you find this position this would return 12 for the first example here and then i'll do a plus three okay and this should written hopefully 15 so from the game's country if i search for let's say if i just pass in 15 this should return me the country that is australia okay and that is the logic that i'm applying here so i'll just make this as country so that's all now if i just run this query you can see that what i have done is from four columns i have just split that first column again back into two columns so i get name game separately and country separately okay so we have done the first part of our query where we have segregated the data based on each olympic game corresponding to each country and the total number of medals they have won okay but our expectation is that we need to find one record for each games and then see who won the highest gold highest silver and highest bronze and the country which won that okay so so let's see how we can do that so in order to do that i am just going to move all of this query into a width clause i'm just going to say with temp as and then i'll just move this right and here i'll just add a parenthesis here and then in my main query so let's do initially select star from m and just run this so the output will still be the same now how we are going to do this second part of this query is what we need to do is for each olympic games so let's just consider the first olympic game that is the 1996 summer olympic game so what i'm going to do is i'll create a separate partition for each game so 1896 will be one partition 1900 summer will be second partition and so on and so forth and in each partition i will see who has the highest goal now how we can easily do that is we need to do partition of data plus we need to do sorting of data at the same time and for that we can just use a window function now i can use the max function as well as a window function or i can use the first value function of the window function so let's try to use the first value so my first column here needs to be games i know that and my second column here so let's try to achieve this only for the gold column so let's say i use the function first value and let's say i want to fetch the column that is gold and i will say over and here i need to do a partition by so i'll say partition by games because for each game i need to find who has won the highest gold highest silver and highest bronze so i'm going to partition by games and then i'm going to order by gold so this particular column i'm going to write it for gold but gold in descending order okay so what happens is for each game it's going to create a partition so for 1896 there will be one partition and in that one partition it's going to order the data by gold in descending order so the the record where the gold was highest will come at the at the first record and since i'm using first value and first value i'm fetching the column gold it will fetch the value of gold okay i can also use max here but i'll use first value and you'll come to know that uh later why i'm using first value okay so i'll use this and then i'm going to name this whole column as gold so i'll just run this and see what the output is so if i just see the output here you can see that i'm getting the output the highest gold was 25 and it's for the 1896 summer olympic games but i'm getting multiple records here and the reason for that is since i'm using a window function window function will return the output corresponding to each record so in order to i do not want all of this multiple record for each olympic games i just want one record so what i'll do is i'll just use a distinct function or clause here and now if i run it and let's do an order by so we can see the data in more clearly so i'll say order by games and now if i run this now you can see for 1896 i am getting gold as 25 and then 1900 summer the highest gold one was 59 and 1904 summer the highest gold one was 128. this is fine but if i go back to my data you can see that i am not just displaying the highest gold one but also the country that won it okay so in order to extract the country corresponding to this value i can again use the first value function so i'll just use the same thing i'll just copy this here and here instead of fetching the gold value i'll just fetch the country value and i'm going to name this like let's say country okay and now if i run this you can now see that i have fetched gold corresponding to 1896 summer basically the record with the highest gold and then the country which has won that so germany has won 25 gold in the olympic game 1896 summer and that is what i have got here now the data that i want is i want the country then a dash and then the total number of gold one so i want all of this into a single column so instead of having all of this in separate columns i'll just concatenate all together into a single column so i can just do that by using the function concat so i'm just going to say concat and then i'll just copy this whole thing okay first i need to pass the country so i'll pass this whole country here as my first argument for my concat and then my second argument would just be space dash space and my third third argument would be the count of total gold okay that is a maximum gold and that's it and this whole thing i'm just going to name it like the column name as gold i'll remove these two entries here and now if i just run this now you can see that i'm getting the output how i wanted so i have this column games which basically has an entry for each olympic game so totally there are 51 olympic games this includes summer and winter games now for each game it's telling me gold highest gold one in 1896 was 25 it was won by the country germany so this is fine so we have got this for gold similarly let's try to do it for silver and for bronze so what i just need to do is i'll just minimize this and here instead of gold i'll just say silver i'll copy this and here i'll just pass in silver silver and silver and then finally i need bronze so i'll just copy bronze and i'll just pass in bronze bronze and bronze and now if i run this this basically is the output that i was looking for so corresponding to each olympic games i have got the country which has won the highest gold the country which has won the highest silver and the country which has won the highest priest along with the total number of medals they won okay so this basically the output that i wanted from this query and that is what we have got so the query we have written is the query might look a little big but i hope you have understood what i have explained and i hope this video was helpful okay so in this video i have only explained four queries and this itself took so much of time but let's say if you are not clear on any of the other queries that is mentioned in my blog then leave a comment below and tell me which is the queries that you are not clear and i will try to make a separate video covering any of the other queries i hope you found this video useful and i hope you gained some knowledge from this and this helps you to learn sql and write sql queries even better if you like this video then please make sure to subscribe to the channel and give me a thumbs up thank you and see you soon in the next one bye
Info
Channel: techTFQ
Views: 14,464
Rating: undefined out of 5
Keywords: SQL, SQL Tutorial, SQL Queries, Practice SQL Queries, Practice Writing SQL Queries, Writing SQL Queries, Practice Complex SQL Queries, Complex SQL Queries, Practice Writing Complex SQL Queries, SQL Queries using real data, SQL Queries using real dataset, SQL Queries using Kaggle Dataset, SQL Query using Olympics Dataset, Using Kaggle dataset to write SQL Queries, SQL queries on Kaggle dataset, Practice Writing SQL Queries using Real Dataset, learn sql, learn sql queries
Id: XruOBp7yPXU
Channel Id: undefined
Length: 54min 41sec (3281 seconds)
Published: Thu Oct 28 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.