Data Analyst Portfolio Project | SQL Data Exploration | Project 1/4

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on everybody welcome back to another video today we are starting our data analyst portfolio project series [Music] now before we jump into our first project i wanted to talk with you for just a second so we're all on the same page first thing is that they're gonna be four projects the first one is gonna be sql and we're doing a lot of data exploration and we'll be setting up a lot of our data to visualize it in tableau tableau is gonna be our second project in our third project again we're going back to sql but we're going to be doing a lot more of the etl process so a lot more of the data cleaning i did that one as the third project because i think it's going to be a little bit more advanced than this first project i tried to make it as beginner friendly as possible so even if you are a complete beginner as long as you've walked through uh you know the tutorials that i have made on my channel you should be pretty good and then the fourth and the final project will be with python we'll be using a lot of pandas doing a little bit of data cleaning and then doing visualizations as well as i said just a second ago i'm trying to make this as beginner friendly as i possibly can the whole point of this series is that if you are trying to apply for a data analyst job by the end of the series you should have an entire portfolio or at least a a really good start at a portfolio to show a potential employer i give you full permission to copy every script every query line for line if that is what you want to do and create your own portfolio i am totally fine with that but i will encourage you and i'm sure i'll say this throughout the video i encourage you to try to think of your own queries try to think of your own insights and your own things that you can do to make this portfolio project unique with that being said i'm super excited to get started on this with you guys so let's jump over to my screen and get started on our very first project all right so now that we are on my screen we are going to get started on this project we're going to download the data set we are going to format it just a little bit in excel and then we're going to get it into sql where we will start querying it i will say that i think this is going to be a very long video i'm hoping to keep it under an hour and a half i may separate this into two videos depending on how long it runs um but you know i will do my best to keep it short but we have a lot to get through i'm going to basically do no cuts i'm that's my goal is to do no cuts um in this because i want to walk you through each step of the process so that you understand everything that's going on and i i you don't get lost at some point but i think this is probably the best way to do it we'll see the very first thing we're going to do is download our data set so you know as we're looking at this there's an option right here to download the data set i don't recommend that one um you can it just won't give you all the information that i personally want which is to go back to like the very beginning if you go down right here to the very first graph you can actually push this back and then download it and what this will do is it will go back to i think january 1st of 2020. so let's open this one up and when we get in here we're going to reformat it just a little bit it's nothing too complicated i hope um i'm just going to double click here actually let me let me go up here and filter just in case you want to filter anything so what we have here is a ton of information on coping i mean just a ton and it goes back to early 2020. i believe it does go back to the first of 2020. so really quick a really brief introduction of what kind of data is in here we have total cases new cases um total deaths new deaths we use those quite a bit in the queries that are coming up um if we go way over here we have total vaccinate vaccinations people vaccinated and then over here a little bit farther we have population that's the main stuff we're going to be working with today as you can see there's so many other things in here i mean you can use this if you want to go back and do more stuff on this i highly recommend it there's such you know there's so such unique data in here about smokers and diabetes and like all this random stuff that i did not do a deep dive in i mean i could i could spend you know a month just like looking at this data set and and getting really interesting stuff from it um but i'm not gonna do that i wanted to do this faster than uh two months to complete what we're going to do is we're going to go back over here i'm going to take this population and we're going to click on this a s and we're going to click control x and that's going to cut it we're going to go back to the very beginning and we're going to place it right here and we're going to right click and say insert cut cells now why are we doing this because i've already done this entire project and if you don't do this you're going to do a join with every single query you do which if you want to do that keep it there and then just you know change your query for for that i did it like this because i wanted to show joins later on i wanted to keep it kind of simple at the beginning um and then work my way to a little bit more advanced things which you will see um it gets you know semi-advanced but not too much i promise um just stick with me let's go back over here we're gonna go to z uh actually double a and then we're gonna click control shift right key that's going to select everything over here and we're going to literally delete it okay this is going to be our first table over here so everything you see over here is our first table and we're going to save that so let's save as i'm just going to keep it in my downloads as and let's do covid deaths so that has our death information the next one is going to include our vaccination information which is what we're going to join on and then we're going to do that later so let's let's hit ctrl z that's going to bring it back now let's select on z and go all the way to e and we're going to do the same thing we're going to delete this it looks like there's no data but i promise there is later on the vaccinations um like total vaccinations if we go down you can see that that starts on in february and very end of february in 2021 that's because vaccinations are you know didn't come out until recently now let's save this file i'm going to save as instead of covet deaths we'll do covid vaccinations all right now let's save that so now we have our two excels that we want we need to get them into sql we're going to go over to sql and we're going to create a portfolio project database i've already done this all you have to do though is right click click new database type in portfolio project and then click ok and it will create your database for you if you open up the tables it should be empty and that's where we're going to put these two excel files now i had a ton of trouble actually importing these excels um i mean i tried everything and i eventually just went down a rabbit hole of how to get these in i don't know if it's me or or what but i could not figure out how to do it if you go to portfolio project you hit tasks and you hit import data that may do it for you and it may work um it did not work for me uh it just it kept giving me errors so what i would recommend you do right off the bat just to make sure that we're doing the same thing um and you can do it that way if you want i went over here to start again i'm on a windows and i went down to microsoft sql server 2019 and clicked import and export it looks the same but for whatever reason it it all the research i did it has to do with the 32-bit versus the 64-bit when you do it this way it goes to the 64-bit and it is able to import the data if you do it the other way it was doing it the 32-bit version and gives you an error i don't understand it don't ask me that's that's the reason that i mean i went down a huge rabbit hole but this one works so let's go over here and this is going to be our data source where is the data coming from it's an excel file so let's do that let's browse and let's go over to my downloads i thought i saved it in downloads maybe because it's an excel workbook what was i saving before oh that's a csv okay something important to note is we're doing an excel and not a csv you're gonna get the same error i'm just doing it live and i'm making myself look stupid so um we're gonna save it but instead of a csv we're gonna save it as an excel workbook so let's save that um now we have to go back to how it was right here um the same way and we're gonna file save as and let's do this is now covet deaths and save it as a workbook now we have them now let's go back um now we have our covet deaths and our covid vaccinations let's do our deaths first let me get back right here so it looks kind of more normal so we have our excel file we have our covid deaths let's go next and now we have to say where we're going to place it where's our destination so we're going to click over here and go down to sql server native client 11.0 i want to say this is something that i messed up and it took me like 45 minutes to figure out it was the stupidest mistake um it's going to auto populate a server name and i never checked to confirm that this was my server name and so i couldn't figure out why i wasn't able to insert this into my portfolio project database that's because mine is one i created two different servers um intentionally and for whatever reason i forgot that and so all i have to do is add 01 over here so just make sure yours is is the same thing click portfolio project click next yes we're going to copy the data it should auto populate if it doesn't if it gives you like multiple you can always check mark on the one that you think is the right one it should be the first one we'll click next we'll just click finish i'm sure it says run immediately we'll click finish and finish now while this is running um there should be around 89 000 that's how it was like a week ago when i started it maybe a little more now because there's extra days um with that being said you know there's going to be a good size amount of data um we're about to do a lot of different things we're going to start at the very basics of just like querying the table like super simple and then we're going to go into things like joins ctes temp tables creating views um the whole purpose of what we're about to do is not to it's not to keep it too simple um i want to showcase to a potential employer right that you can do more advanced advanced things so i'm going to probably do i mean i'm looking at because i have already done this entire project individually i mean we've probably got like 15 to 20 queries here you don't have to do all of them um i'm going to walk through all of them and you can choose which ones you want but you don't have to do all of this it is quite a few so just know that so there's 85 000 right here that's fantastic uh it won't show up immediately you need to refresh it and there we go so that's our covid vaccinations let's get rid of this so we just have covered vaccinations um i thought that was our covid deaths one but maybe i'm wrong but let's do the exact same thing down here and we will import say next we're going to go down to excel and browse and now we want to do the covet deaths apparently last time we did the vaccination switch i actually actually you know what i bet what it did was it took yeah it took this right here as code vaccinations but that was the deaths one as it saved so uh forget that let's go right here let's do the covet vaccinations it just has the same sheet name so sorry for the confusion destination is going to be the exact same place it's going to be sql server native client let's add that 01 and let's click refresh portfolio project next next like i said before if it does this just click the first one it's going to be code vaccinations that did that for the covet deaths that's because i made the mistake earlier i hope you i hope when you're watching this you weren't super confused um the whole point make two tables or make two excels one should be covered deaths one should be code vaccinations upload them and then rename them in a nutshell so we have the same amount let's refresh this this one is actually the code vaccinations this one is covid deaths i'm telling you this stuff is it confuses me sometimes to be honest um but we're going to query this really quick to make sure we act are actually doing um what we're supposed to be doing so let's do select everything from um and let's do portfolio project and you can do dbo or you can do dot dot i tend to just do that because it's easier let's look at this one make sure it's the right table so we have total cases new cases perfect um and let's order on let's do three comma four just to make sure order by of course um just to make sure that we have all everything that we're looking for so this looks right this looks like our excel let's copy this let's go down here we're going to do covid vaccinations and let's run this one make sure the second one came in correctly as well so perfect so we have our two tables this is fantastic news and now we can get going um [Music] we can keep this one i'm gonna comment it out in case you know we want to come back to it um i'm gonna really quick again right here i have another laptop i have already done this whole project so i'm just using it as a guideline to know kind of what i'm doing next so that i don't waste everyone's time um so really quickly let's just let's select the data that we are going to be using you don't have to use these comments i will say that i'm going to specify i'm going to say hey this comment is something i would keep in your portfolio project i'm going to add a bunch of extra stuff that is not needed just for your purpose but when you are creating your portfolio project you shouldn't be adding some of the things that i'm gonna be commenting um on so we're gonna do or actually let's do really quick let's copy this so that it kind of knows what we're doing so let's select the location the date the total cases the new cases the total deaths and then population uh now where we're at i'm going to turn off my camera because it's going to get it's going to start getting in the way to be honest i don't want it to interfere with your ability to see what we're doing on screen so it's been great seeing you guys i'm going to turn this off and we will continue from here all right that should be turned off so let's keep running so this is what we're doing let's actually let's keep this going because i i don't like things not being organized um so we have our location oh no we want to do one two we want to do it based off the location and the date makes things everything easier i promise you so we're gonna be the first one's obviously afghanistan here's our date we have our total cases our new cases total deaths and population so really quick i'm just going to scroll down just a second um they started having you know the the total deaths it's um it started about a month after they got their first case it looks like so and then it just like ramps up a lot um and we're gonna be diving into all these numbers what they mean how to you can do some really simple calculations on them um but really quickly we're just going to do again a super simple calculation and one that we do multiple times for different things so let's go right down here and let's say we're going to be looking at the total cases versus total deaths so how many cases are there in this country and then how many deaths do they have per um you know how many deaths they have for their entire cases so let's say they have a thousand people who have been diagnosed they had 10 people who died what's the percentage of people who died who had um who had it so uh let's go right down here and we're gonna i'm just gonna copy this really quick it's just gonna make our life easier i think you should do the same as well so we have location date total cases and we're going to get rid of our new cases because we don't need that one in this query right here nor doing this population so let's work on our calculation really quick it should be super super easy let me make sure i'm still recording perfect oh man we're 25 almost 25 minutes in um or more because i have the intro so now we're going to do we want to know the percentage of people who are dying who actually get infected or or or who report being infected so we're going to do total underscore death so we'll go right down here and we're going to divide that by the total cases total cases and if we do this really quick um what it's going to have and let's go down to where there's actually numbers so we have 34 we have one it's it's showing 0.029 percent if you ever try to get a percentage of something you have to multiply times a hundred um so let's do that really quick all you have to add is that what's that the asterisk sign times 100 and while we're here let's just add the um what's called alias so let's do let's call this death percentage i don't know that works for me and let's take a look at this it'll be a little bit more accurate accurate so when there were 34 there was one that gives us a 2.94 percent death rate and we can go down even further and this is still all afghanistan let's go down to the very bottom let's go down to the very very bottom so as of as of today yesterday there were 59 745 total cases in afghanistan and there were 20 2 625 deaths which is 4 so you have a 4 chance basically right now of dying i mean if you want to look at it like that four percent chance of dying if you get it and you live in afghanistan um let's we you don't have to but really quick just to look at it further let's look at where the location um i think it's let's say like real quick because i'm not 100 if it states um it should i think it's united states but yeah so i mean i live in the united states if you don't you can look at your country but um you know we this is like this is genuine real reported data so it's really interesting um right at the beginning i mean though i don't know if it was the way we were reporting or what but we had really high percentage rates as we go down we're looking at a 5 6 i mean this was the peak of it this got really bad in the u.s um maybe i hope it gets better um but how many are we at this is i'm gonna go to the end of this year we're sitting around two to three percent um yeah it goes down to under two percent so at the end of at the end of the year we were looking at over two million people that's two million no wait 20 million nine three six three wait wait wait 20 million people who have been infected um that's a lot that's a lot of 20 million people who have had it 35 000 or 352 000 deaths by the end of the year that's a lot um let's keep going um and at the very end we had over 32 million 346 971 that's a lot of people who have been infected um there's a lot of deaths 576 000 and i verified this number um i googled it google knows all i googled this number and it's pretty accurate um and it's really sad that's a lot of lot of lives um and that's 1.78 so as of right now if you're to get it today a rough estimate is around one uh and three fourths to two percent chance that you're that you could die from it um so really interesting numbers this is the kind of exploratory stuff that that you know we're going to be doing we're going to get a lot more advanced as we go on but this shows you know the likelihood i mean we can i'm going to write that shows the likely i hope i'm spelling this right i'm not spelling this right likely i hope that's right if this not i apologize uh likelihood of dying if you contract uh covid in your country um again rough estimates but you know just glancing at the data that's kind of what we're looking at now we're going to look at and let's go down here let's look at looking at the total cases versus the population again we're going to do a lot of this like percentage stuff um it it's pretty simple um that will only last for so long i promise you but it'll be really i'm going to keep it on the states just because i'm going to be looking at that one the most because it obviously is pretty relevant to me um so if you're in another country filter by your country you'll be really interested in the stats i i know i was really really really um shocked by a lot of the things that we're going to find today so we're going to keep the location we're going to [Music] we're going to keep the date keep the total cases but let's change this to population and then instead of the total cases being here we're going to put the total cases there and then change this to population so what is this going to do for us this is going to show us what percentage of population has gotten covered so shows what percentage of population oops coveted um some of these things again they're good to know um the one that i upload to github will have the notes that i recommend keeping um again not everything in here is um not everything in here is what you know you need to have in there this is mostly just you know what i think you guys need to see while we're actually typing this out all right so let's take a look at this um actually i want to change this i'm going to put this right here it just is easier for me visually just for because the total case is right here so our population in the u.s is around 331 million um so at the beginning when we had one case i mean that's like nothing let's keep scrolling um and see where we get to one percent so one percent that's three million three hundred and eleven thousand three hundred and twelve uh people and that happened in was that august august of last year so one percent of the population let's keep going all the way down again we're just kind of glancing at this we're about 10 um again we're at that 32 million so 10 of the population has gotten it gotten a test and it's been confirmed so really interesting um you know we'll come back to that one i'm sure in the future i you know we might make we might use this one as like um a visualization again i'm only looking at the states or the united states right now but you know think about it in terms of how we're going to visualize this in the future because a lot of what we're doing we're going to visualize in the future in tableau i have tableau even open right here you can see i have a map um this is just a soup i threw this together in like two seconds um we have the uh we have the location and so you know this is like our future this is what you need to be envisioning when you're looking at this data so we have you know afghanistan and let's just scroll through belarus and bolivia and bulgaria and cambodia all the every single country that that is reporting so we're just looking at the states but remember all of these are going to be used so um just something to remember um i want to know and i'm really curious as to what countries have the highest um infection rates compared to the population so we're just looking at our population up here um how are we going to do this we'll do actually let me say well let me write it out really quick so let's look looking at countries with highest infection rate compared to population so that's what this script is going to do or this query is going to do i'm going to copy this so we're going to keep the location we are not going to keep the date this is not going to be date specific this is just going to be overall and then we'll look at the max of the total cases so we only want to look at the highest so when we're looking at the u.s we had 32 million we don't want to look at every single pop of the total cases we only look at the very highest one so we'll look at the max total cases um and let's right here we'll just say give it an alias at least something to recognize it so highest i guess we can say infection count so we'll say highest infection that's the highest infection account per country um so per location um and then we want to also take because it's going to it's not going since we don't have max total cases here if we just kept total cases here it'll give us the same one that we're looking at in this above query what we need to do is we need to look at the max of this so we're going to look at max and just add a parenthesis there and we'll look at this isn't the death percentage anymore i forgot to change in this last one this is what is this this percent of population infected [Music] so let's change that for both of these because i don't want to get confused when you're looking at the column headers later so look at the percent of population infected let's run this and see what we get uh list is not contained in either the aggregate oh i need to add a group by of course so let's add group by and we need to group by both the population and the location so let's try that really quick let's see if this works awesome well we ordered on location and population but i really want to look at the highest so let's so let's just see really quick look at some of these numbers they're like one percent four percent um ten percent okay so yeah yeah what we want to do is order on um this percent population infected so let's go ahead and do that uh and let's do that descending so the descending gets the highest number first um my goodness 17 so what percentage of your population has gotten covered it's been reported and and and um we can see that now so the very first one small population so it doesn't surprise me but if you look right down here so that's that 32 million that we were talking about that's that max of total cases um which is the the highest number of our infection count so we have 33 so we're at i mean we're we're right up there on the list let's look for other large countries i mean it's us you know there's israel there's belgium portugal france so you know we're up almost to about 10 in a lot of these countries so some some of us including the united states we are we are in there as well some of us has have really high percentage rates we just did not keep it under control um and you know a large amount of the population has gotten it that's what this one shows um now let's look uh kind of at the sad side of things we were just looking at how many people were infected let's look at how many people actually died um so let's do let's comment and we'll say this is gonna this is showing the countries with the let's do highest high am i spelling that right yeah highest death count per population um now how we're gonna do this let's copy this off the bat but i don't know if we're gonna do it the exact same way because we just need location um and not much else honestly so let's get rid of all this stuff but we do need we're looking at the highest death count so like we did up here with the max total cases we're gonna do max and then we'll do total deaths i hope it's like this total deaths um and then we'll do as total oops total death count and we'll order that by the total death count see i don't need this i think i need to group by cause there's an aggregate function and let's try this really quick okay so if you're getting this there's a there's a simple slash confusing explanation to this total deaths right now let's go into our covid deaths columns okay let's show the total deaths which is right here it's an n-varchar 255 it's an issue with the data type um oh wait total deaths no no total deaths right here it's an issue with the data type um it just has to do with how the data type is read when you use this aggregate function we need to convert it um or cast it is what we're actually doing we need to cast this as an integer so that's read as a numeric um why i cannot 100 give you a perfect explanation for it but this happens all the time you just need to look at the data and realize oh it's probably because of this data type let's try something else and then it'll work so let's cast this and we're in casting i find it's just easier but just as int boom there you go so now we're taking this and varchar255 over here and then we are converting it to an integer now let's run this um and let's get rid of this just for visual visual purposes now we are much more accurate but we have a slight issue or we're now seeing a slight issue with our data in our data in the location section we have a few ones that really shouldn't be there ones like world or africa um or south america these are grouping entire continents so let's go back up to our um let's go back up here and let's do actually let's pull it up really quick because this is just part of exploring the data and figuring it out so if we scroll down we're gonna find we're gonna see one like right where is it right here this this location is all of asia whereas in other ones the continent is asia if i can pull one up real quick so like right here the continent is asia whereas before the location is asia but if you also notice um the continent is null here so what we need to do is say where continent is not no because when it is null that means that this location is actually an entire continent and we don't want that that may be helpful for us um later on but is not helpful now so now this right here will get rid of that um and just knowing that figuring that out now we can add that to every every script um and we can do you know you don't have to do this i'm just doing this for you know visual purposes i'm not gonna do that for everyone um so let's say where consonant is not null and now let's look at this and now you can see that the united states is number one and so number one is not the best thing to be number one in but we have a death count of 576 000 and again i i googled this earlier these numbers are pretty accurate there are some of them are like a day or two behind give me a second i'm gonna take a sip of water they're like a couple days behind um this number's actually higher i'm in as you know as we continue to have more people die unfortunately that number just continues to go up um so the dataset that you download maybe a lot higher um as of right now we've been breaking everything out by location right really quickly let's just do this by something we kind of saw earlier i mean i'm just going to do this for breaking it up purposes but i'm going to say i'm going to caps lock let's break things down by continent facebook continent conte net jeez is that a nice spelling i don't even know let's keep going um but now we can do consonant right here and we'll just copy and paste that let's get that back up here um and now we can see where continent is not null let's see if that makes that yeah okay so now it's breaking it out by continents um with north america south america asia europe africa oceania is this perfect no no it's not perfect um north america looks like it's only including the numbers from the united states and not canada so we have some small issues in here um but for the purposes of what we're trying to do which i don't think anyone's gonna come in here and fact check us or check the data they may and then you know you might be screwed but for the purposes of hierarchy um and you know that drill down effect in tableau which is something we are going to do we want to start including this continent in our in our queries so that we can drill down um further into these things we can also do where this way i'm going to do where is null um actually let me see so before we were doing work continent is not null well let's do location i'm just i i'm doing this on the fly i haven't done this before i just kind of am doing this um this actually is the correct numbers and i don't know why i didn't do this before when i was actually creating this project but now this is a wonderful beautiful thing i believe this is the correct numbers um i could verify but i don't want to do that live because i i might look stupid but i think this is accurate um remember before we were looking at the location and the location um and it was actually the countries itself and then there were ones where we did where is not null to get rid of all the ones that were like world and all those other things well now i'm just filtering on those instead of deleting them before we were looking at everything but these now we're only looking at these and these numbers look a lot more accurate so with that being said um i'm going to use this going forward in my script so i'm going to kind of change things up to where from what i originally had um let me see though because if that is the case it may screw up our drill down effect which is highly unfortunate i may i honestly might just revert back to it for the pure fact that we want the visualizations to look correct um just know that this is the right way and if you want to go back and do that i highly encourage that i didn't figure that out they're my first time around but i'm willing to admit when i'm wrong let me see what let me do a time check i run into like 15 minutes or so i think we're gonna we're just gonna keep going all the way through i i don't think we're to stop um i don't think we're going to stop in this project so we want to do some of the the above queries were kind of what we were going for nothing crazy difficult right nothing crazy hard um and now we want to we want to start breaking this out by um continent as well i'm gonna go back and is this correct let me look no so it's not known um so we want to start doing some of the above queries but adding that continent in there you can even go back and add that as well um if you want to that's totally fine i'm gonna do some more queries down here um or at least one one or two more and then we're gonna start getting i think into some a little bit more advanced things we're gonna start getting some temp tables stuff like that because we're going to eventually set these up in views so that we have these views to um use for tableau later and again it shows you know how to create a view so that's important so we we've we've done this first one this next one is going to let me go down one more this is showing the continents with the highest death count so almost the exact same as we did before but now we're looking at the continents we can even go up and look at uh just wait we literally just did that um so that's what this one is actually looking at my notes wrong idiot okay perfect um now you know we want to start looking at this from a viewpoint of i'm going to visualize this so how do we do that what we want to look at let's look at some global numbers you can do as many of these as you want anything up here just add continent to it um anything where it's like group i just replace it with continent and you got it so i don't want to go through and do every single one of those but that is kind of the gist of what you might want to do especially if you want that drill down effect and if you don't what that is um you know it's like clicking on north america and then when you bring up north america then it shows all the countries in north america so canada and the united states and so it's a drill down so you look on africa and then there's all the african countries that's what drilling down does and that's what you can do when you have um those layers so you have the continent then you have the location so you know i'm not going to we'll look at that when we actually get to tableau but i don't want to actually spend all the time writing that out but what we now want to do is we want to calculate everything for the across the entire world so let's do this let's say um breaking let's do global let's just say global global numbers easier easier than nothing um all right let me really quick find the i think it's probably the first one the death percentage let me see if this is one that we want okay let me see all right so let's take this one i'm sorry that took me a while to find again i'm not cutting any of this stuff out you just got to stick with me you if you're sticking with me this long i know you care i know you're not you're not cutting away because i'm trying to figure things out on my side so um let me get rid of this so this is the exact same scroll well let's say where just so we can get the right numbers um so we are now going to look at the global numbers uh so we're not going to we're not going to include any location any continent or anything like that but we do want to make sure that we're only looking at all of the countries and we're not looking at the world numbers plus all the countries because then the numbers would get astronomical so instead of no now we can't do so let's try running this really quick so now we really can't do this um because now it's breaking everything out by um by you know that uh which is the dates it's breaking everything out by the date because uh these total cases the numbers are different right so really quick let's group by date and now let's see what it looks like uh it's going to give us an error obviously that's because we're looking at um that's because when we're looking at this we're looking at multiple things and we can't group by just the dates obviously if we wanted to group by something which we need to do we then need to start using aggregate functions on everything else um so really quickly let's do some aggregate functions i'm looking at my notes for just a second um to see what i did basically what we want to do and i think what will make things easier is i mean i could try to do the sum of max total cases i don't think that's possible um let me comment this out really quick yeah um it's because there's an aggregate function within an aggregate function and we can't really do that um if we go back to the data and you we kind of looked at this earlier there's one called new cases let's use this because instead of doing max we can just sum it or do a sum on it and that's going to give us the sum of all the new cases which adds up to the total cases so if we do this let's see this will give us on each day the total across the world because we're not filtering by any continent or or we're filting out um like the world and in the actual continents we're not filtering by location or continent or anything it's just by date so we're looking at the sum of the new cases so now let's do uh let's do the sum of new underscore deaths and we can run that one operand data type and varchar is invalid for the sum operator so going back and this is something i encountered a lot when i was doing this is these new cases is a float which is why it's working in the sum but the new death is an end varchar so what we need to do again is cast that as an integer it's just the easiest thing to do and now that one should work so um let's get rid of the well let's get rid of down to here so we're about to do another one and that's going to be our death percentage globally across um across the i guess the world so we need to do the sum of i think it's we need to do new deaths all right divided by the sum of new cases all right times 100. uh let's see where this takes us um okay of course we're getting the same thing let me um let me put this right here and see if this works um invalid data oh that's because this was new cases the new death one is right here and let's run this and now we are looking good and as you can see the death percentage is right here we have 91. um and let me give these i don't we can't let me go back real quick and just say as total cases as total deaths and let's run that again okay and so across the world these are our numbers so we have total cases on that very first day that cases were starting to be reported there were 98 total cases there was one total death that gives us a death percentage of one percent across the country or across the world i mean as we scroll down it gets lower and lower and that's because we have a lot of people who have gotten infected are the total cases again that's per day right so if we remove this all together that date altogether which we can do right now this will uh this will give us the total cases which is oh gosh let me read this through 150 million um versus three million one hundred eighty thousand two hundred six so overall across the world we are looking at a um a death percentage of a little over two percent so interesting numbers you can keep both of those queries separate if you'd like you know they might come in handy later but let's do this so we have um in one second check out my notes again because i just want to make sure i'm not doing something stupid all right all right so again we have a whole nother table that we haven't used yet uh it's this covid vaccinations um and just to you know refresh your memory let's do um let's look at the table from portfolio project dot dot gov vaccinations let's jog our memory on what we got here so we have um we have these tests we have vaccinations over here which is what we're actually going to be using um excuse me that's what we are going to be using so let's join these two tables together uh and let's let's actually just do from actually let's just do this whole thing from let's do covet deaths and here's how we're going to join it so we're going to say join and we're going to say oops wait that is wrong join and we're going to say on so what are we going to join them on um we're going to join them on two things we're going to join them on location because that's much more specific than the continent we're going to join them on location and we're going to join them on date let's call this one dea let's call this one vaccination so a little alias for these so that we don't have to type out this entire table name each time so let's do dea dot location is equal to vac dot location and d a dot and we'll say date is equal to v a c dot data and let's just see what we get really quick so we'll have all of these things and let's look at granada oh seven one seven and let's go all the way over here and it should have granada0717 so just making sure that they were joined correctly for this query what we're going to do is look at the total population and let's do that right here so looking at total population versus vaccination so how many peop what is the total amount of people in the world that have been vaccinated that is that is what we're going to do in this query right here so let's do d a dot continent [Music] location date and again these are going to be the same in either one but we have to specify let me just for example if we do population population oh actually that's a terrible example because population's only in one let me go back real quick let me say i only write date that's going to give me an error because there's date in both of them in fact we joined it on them so we know there's date in both of them so it's going to give us an error we just have to specify what table we want to pull it from so i'm going to do dea and d8.population just to keep it consistent and now we're going to add the next one dea dot and let's do new vaccinations and really quick let's just look at this um and let me get my orders because i want it to be organized i i actually one let's do one two three i don't like it when it's not organized it bothers me so we're looking at oh no i also need to add or continent is not null there we go uh d a perfect now let's run this this should look much better there we go all right we are in fact if we want to look at afghanistan like we have normally been doing um in previous ones we do two slash three so there's our population here's our new vaccinations now let's see we're going to go back go down and let's see they have vaccinations starting on 2 18. if we go even further down let's just go to who's this canada oh yeah can it'll be a good one to look at they started doing vaccinations on right here so 12 15 i mean they started very early and their numbers only increased and now they're you know doing this is per day right so this is 280 000 in one day so that's you know really high numbers but this is the number of new vaccinations um there is a column called total vaccinations in this table but we're going to do something pretty just to display again this whole portfolio project is to show potential employers that you know how to do certain things so i want to set up opportunities to do that we're not going to use the total vaccinations we're going to use this new vaccinations which is new vaccinations per day so we want to we want to know or do kind of like a rolling count out here so as this number let me go back to the beginning as this number increases 718 2300 4179 we want it to add up over here it's a pretty cool thing i mean you know it's once you see it you'll be like oh that's pretty easy but you know we're going to be using partition by we're going to be using um this is a windows function so it's really good to to showcase i think so we're gonna do um and let's do [Music] um we need to do the sum because we're going to be adding these together so we need to do the sum of new vaccinations oops you do the sum of new vaccinations let's do over and we're going to say partition oh gosh partition by and we need to partition by the location first and foremost because we're breaking it up by if we do it by consonant the numbers are going to be completely off we need to do it by location location and also partly the date but you'll see that in just a second but we need to partition it by breaking it up by um location and why is that because every time it gets to a new location we want the count to start over we we don't want this aggregate function to just keep running and running running it'll ruin all of our numbers we only want the this part to partition on the the location so that it runs only through canada and then when it gets to the next country it doesn't keep going um and if we only did that by the way let's look at what this looks like uh okay real quick i need to cast this as an integer like we've been doing in the past you can also do um real quick i want to show you another one convert and i think it's comma integer um or is it integer comma let me try integer comma i think it's that way actually and you can do it this way as well that is up to you um you know either one is totally fine if you want to use both that's even better because then it kind of shows that you can do both but they basically do the exact same thing so let's go down and let's see what what's happening here so it goes down to albania and since we're partitioning on albania albania their total amount of vaccinations is 347 000 i know that going into it because it has it on every single stinking row but down here they started to add they started to add up right but we didn't do that we only partitioned on location so it added it did the sum of all the new vaccinations by that location so what we need to do is go over here and say order by and we need to order it by both the location oops d a dot location and the date that is very important uh the date is what's going to separate it out um and you'll see in just a second what i mean so now let's run this and let's go back down to albania i think it was so here's albania let's go to our first one so here's what we have we have 60 and it gives us 60. then we add 78 so we add 60 plus 78 equals 138 then 78 plus 178 i'm sorry 60 plus 78 plus 42 equals 180 then 60 plus 78 plus 142.61 241 so you get the point it adds up every single uh consecutive one and when there's nulls or there's zeros it's going to uh not add anything it's just gonna keep it uh going and then you can see as it's a rolling count so we're gonna name this let's do as um let's do as um rolling people vaccinated let's call that um i think that's good now what we want to do is actually look at the total population versus the vaccinations and really what we want to do is use this rolling people vaccine we want to use the max number because at the very bottom is our max number this is how many people in albania we want to use that number and then divide it by the population to know how many people in that country are vaccinated so what we want to do is we'll do this we'll do rolling people vaccinated divided by population times 100 and as you can see we're getting an error you can't use a column that you just created to then use the next one so what we need to do is we need to create either a cte or a temp table um this is at this is the time of of the show of this tutorial whatever you want to call it where i'm going to give you some options you can do one you could do both you know there's no preference to me um but we're going to take this and we're going to at least for this first one we're going to use a cte so we're going to say excuse me we're going to say with and let's call it um pop versus back i don't know population versus vaccination and then all we need to do is specify the basically the columns that we're going to input so let's put as and then let's insert that down here because what we need to do is we want to say we're going to do continent oh gosh i'm so bad at spelling content and location date population um and then we'll have this rolling people vaccinated that should be it and let's see if there's we just need to close this parenthesis so this is our cte should be working um actually that's not true i need an open parenthesis here that's why it's giving me that error um let's see it's i'm still getting an error so let me see if i'm doing something wrong [Music] i have this in parentheses there and there i say with bought back says continent location date population ah i believe that is the issue so then we need we just need to add that last column new vaccinations if the number of columns in the cte is different than the number of columns here it's going to give you an error so you've got to make sure and then let's just say for real for right now select everything from and we'll do and we can even say pop versus back it'll come up right away so really quickly let's run this and see what happens the order by clause can't be in there i knew that but whoops let's comment that out let's get that all the way up here let's run this so now that query that we were looking at before is now in here but now we can actually use it to perform further calculations so we'll just do everything comma and then we'll do rolling people vaccinated divided by and that needs to be population times 100 i'm pretty sure this is incorrect give me a second um invalid object oh that's cause i have to run it with the cte my bad um so let's look at this percentage really quick um it's not wrong and it's actually going to give us a rolling number and this may actually be what we want [Music] so basically what it's doing is taking this column and doing it versus this column and so this number should only increase because as this number increases this number will increase because the population stays stagnant um again i'm kind of looking at this as we go so right now 12 of the population in albania is vaccinated so that you know that is that's all we know i don't think we need to go any further than that i think if you want to you can look at the max one um but you'll have to get rid of date and just keep the location um population et cetera because the date is gonna throw everything off so if that's something you wanna do absolutely do that um you can use a temp table here we can look at how to do that really quickly i think so that you guys know how to do that again i recommend throwing in one or two of these um like even up here you can do different um different counts and then do one for each um so let's do temp table all right so it's going to be a lot of the same stuff we're going to keep this and this is going to be what we insert so let's say insert into and we need to write where we're inserting it into but let's say again i'm only doing this for it's going to be basically the same it's going to have the same effect but um with a temp table so we're going to do temp table and let's look at um let's say let's call percent population vaccinated and we need to specify our columns so let's go down here excuse me let's go down here and let's do the basically the exact same thing so continent i think i spelled that right no i didn't spell that right i almost did i got really confident we'll do we and and just so you know for these we have to specify the data type as well um because we're basically creating like a genuine table it's just a temporary one so let's do invarchar 255 we'll do location we'll do the same thing in varchar 255 we need to do date and we'll do that as date time we'll do population and we can do i mean there's lots of different ones we can do but we'll do numeric for this example there's new underscore vaccinations and let's do that one as numeric again you can use different things and then we'll do rolling people vaccinated and this can be numeric as well and then we need to insert that into here okay so we're inserting the data and then down here we can actually select it and let's let's take this and do right here except we're going to be doing this by this right here but it hasn't been created yet but it will be created in just a second okay so you let me see it yeah so these were the rows that were affected um and we in our then we got our actual output from this right here now let's say you wanted to change something in here you're like oh you know i don't want to do it where this let me comment that out and then let me do this and um create that table again oh no we we got an error how can we get around this very simple i've done this and i should do this in a different one you can do drop table if exists and then do this right here and when we run this it should give us our output i highly recommend just adding this especially if you plan on making any alterations so that when you um run it multiple times you don't have to you know go and then delete the view or delete the temp table or drop temp table or you know it's just built in it's at the top it's easy to maintain and it looks good it's it's something that a lot of people do and so if you have that at the top of your query and somebody you know somebody who wants to hire you looks at this thing like oh okay that makes sense i'm glad they included that they know what they're doing this guy's smart i should hire them um now what we're going to do is i feel like i've showed you as much as i can show you um with the limited data that we've looked at again i could have done this for six hours straight if i had used all the data at least i mean there's just so much data but let's create a view you know i'm only going to show you how to create one view but i want you to go back and create multiple views you know if this is one that you want to look at these global numbers um let's look at this one really quick if you want to look at this number right here toss it in a view i mean that one doesn't make sense to toss in a view but this one toss these numbers in a view and we're going to look at it in tableau later but for right now let's just create our view so like let's just say creating view to store data for later visualizations all right so let's say create view um and i want i'm just going to keep the same thing um like that um and for views it's so easy i mean i'm literally just going to and i can even take um the order by i believe we'll see if i'm correct um actually let's get rid of both of these things so it says create view percent uh percent populate oops percent population vaccinated um and let's see am i doing anything wrong here let me see the order by clause l i was completely wrong i was wondering why i was getting that now let's try running it okay so it ran successfully let's look at our views it's not going to be in there let's refresh it hey look we got our very first view we can open that up like a table if we want to um i mean it's gorgeous um if you want to get rid of that select or sorry ctrl shift r that's a refresh um and now it it basically recognizes it but let's go back here for a second um and you know we can now query off of that it's a view now so you know it's it's something that you can it's permanent you know you have to go in and actually delete it's not like a temp table this is now permanent and this could be something that we now use for a visualization later so do some of these look at some of the queries that we've looked at and create a few of these views and we will use them later um normally in a normal setting uh if i was actually working i would put some of these in actual like i would call them like a work view or a work table or something set aside so that i can use them consistently but i would also set them aside so that i could connect tableau to that view now we're going to be using something called tableau public that will be in the very next tutorial unfortunately um let me see if i can show you i can't show you tableau public does not connect to sql databases um and that's because it's free and i totally get it you have to pay for the upgraded version but i am not a billionaire okay i cannot afford uh the real version of tableau i'm also not like a student or like something where i can get it cheap so i'm not paying for that so we're gonna use tableau public and and i recommend this anyways because anybody can access it it's free for anybody so we're going to be using tableau in the next one to actually visualize a lot of these things i want to get at least five visualizations we're going to create a dashboard it's going to be a beautiful beautiful thing all right so the very last thing that we are going to do is we are going to actually save this and then put it into github and i just want to show you how to do that that's where we're going to be storing our code at least for now um so let's go up here let's click file let's click save as i've already have multiple versions of this let's just push v2 we're going to save that so we have this saved now i'm going to go over here i'm going to go to my github now if you don't have an account i highly recommend getting an account so you can start putting your portfolio projects in here of course we're not going to put our tableau one in here but our sql ones and our python ones you can put in here again i'll talk a lot more about how we actually want to display this in github or other places but what we're going to do for this is we're going to create a new repository let's call this one portfolio projects make it public we'll create the repository we'll do all that extra stuff later so what we now want to do is upload an existing file we'll click right there go to choose files and we'll click this latest one that we saved and we'll open it and we can always change the name of it later on and you can add notes if you'd like but we'll commit that change so we'll actually upload this uh this file um but let's look at it really quick and yeah i'm gonna go back and i'm gonna use the real one where it has the formatting and and the notes that i have that i wanted to add in there but as you can see you know you can see all of the queries that we wrote and this is fantastic so if somebody comes in here you know we'll have more notes and kind of better comments on what they do and what the takeaway is this from for a hiring manager to you know when they actually look at this so this is a really really good place to start again this may not be your optimal place to put this i'll give you a few different options in a later video about how we can actually potentially improve upon this i'm really looking forward to getting more portfolio projects done so we can actually start building a complete portfolio if you've stuck around all this way i just want to say congratulations i mean i know this was a long video i know that it took a long time but you stuck with me you put in the hard work and that is fantastic and i really hope that it pays off and i hope that this has been helpful thank you for watching we'll have a lot more videos in the future on these portfolio projects and i'm i'm just really really looking forward to doing them to be honest so thank you for sticking with me thank you for watching i really appreciate it if you like this video be sure to like and subscribe below and i will see you in the next [Music] video [Music] you
Info
Channel: Alex The Analyst
Views: 168,429
Rating: 4.9765286 out of 5
Keywords: Alex The Analyst, data analyst project, data analyst portfolio, data analyst portfolio project, data analyst projects for beginners, data analyst projects for resume, data analyst sql project, data analyst github, data analyst tableau project, Portfolio project for data analyst, data exploration project, projects for data analysts, project for data analyst, data analyst projects, sql portfolio project, sql project, data analyst sql, sql project for data analyst
Id: qfyynHBFOsM
Channel Id: undefined
Length: 77min 9sec (4629 seconds)
Published: Tue May 04 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.