Data Analyst Portfolio Project | Data Cleaning in SQL | Project 3/4

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's going on everybody welcome back to another video today we will be heading back in a sequel for our third portfolio project [Music] now i am extremely excited for this project in particular for a few reasons one we're getting back into sql and i really like sql and two we are finally focusing on data cleaning and i have talked so much about why data cleaning is important and that you really need to learn how to clean data and that that's a big part of what a data analyst does but i haven't actually showed you how to do it yet and so that is what this whole project is going to be and then at the end you'll get to add it to your portfolio so it's really a win-win now before we start i just want to say that i think it's going to be a little bit more advanced than our very first video in sequel where we walk through data exploration if you see something that you have never seen before i will do my best to explain it while we're walking through it but if you get confused or it seems a little complicated please pause it google it do a little bit of research and then come back and i think that will be very helpful with that being said let's jump over my screen and we'll get started on the project so we're gonna start over here on github and this is where i've actually put the data set that we are going to be using so i will put this link in the description uh we're gonna go right over here to the nashville housing data for it data cleaning all you have to do is click download and it's going to download it and you can open it up if you want to we're not going to do anything to this data at all but really quick i'm just going to show you what it does look like and we'll of course look at this in sql in just a little bit but we have a unique id parcel id we have this address a sales date the price of the home so this is housing data if you didn't pick up on that already who actually owns the home the owner address and then some information about land value bedrooms bathrooms things like that again not super important because we're going to be doing all of this in sql so let's actually get this data into sql we're going to import it the exact same way that we did in the very first video so we're going to come right over here and go all the way down to microsoft sql server 2019 import and export we'll click next our data source is like last time a microsoft excel and let's take a look and we'll take that first one this is the most recent one i've downloaded but i just wanted to make sure so i downloaded a few times um for the destination we're going to click sql server native client 11.0 and this is my client or my server right here and i'm going to go down here and i want to put it in this portfolio project so you know just configure this to what your server is again if you haven't done this before you've never set up sql server or a server to go on sql server i will leave a link hopefully right here also in the description like i did for the first project so you know be sure to go through that video so that you know how to download this and have everything we're going to copy the data we're going to take sheet1 we could have renamed sheet1 to something else but we didn't and then we're going to finish this and finish and it should run successfully hopefully it's looking good perfect so we have 56 477 so let's head over to sql all right let's go to our database portfolio project and here is our sheet one now i'm going to rename this let's rename it was it nashville let's just do nashville housing that's what i'm going to rename it as um at least so when i post these queries um to the github and you see them this is what they will be so if you want to have them the exact same or be able to copy and paste them you know you should you should do that as well so let's take a look really quick let's select the top 1000 but there's about 56 000 rows there's a lot of data in here um and a lot of things so uh i'm about to open up a save thing and we'll walk through the exact things that we're going to be working on in just a little bit but um yeah this is what the data looks like in here there's lots of columns lots of data so really excited about this um let me pull this open really fast it's going to be this project walkthrough here are the things and i'm going to show you this really quickly here are the things that we're going to be walking through so we're going to standardize the date format we're going to populate the property address data that's referring to this right here if you notice there's the address and there's also the city that it's in so we want to be able to separate that out and that is actually right over here we're going to be doing the same thing same thing to the owner address except that has an address a city and the state which makes it a little bit more complicated and so um that one should be really really cool to show you um oh whoops i i messed up that's what this one is breaking out in individual columns that's what we're gonna do for that this is populating the property address um you know if you notice and we'll go into this a little bit there's actually some values in the property address that are blank but i'm gonna show you how you can actually populate that um which you know has a it's just a cool trick that i've used a few times and it it does work i mean i think you'll find that one interesting um in the sold as vacant field we're going to be doing some um some case statements if then um then we're gonna be removing duplicates and then deleting unused columns so we have a lot to get through this could be potentially the longest video and i'm okay with that um because i'm i love sql down here and and i will say that when i s when i in the very first video i said is gonna be etl video um and i fully intended on doing that but i ran into not issues on my side but issues and the fact that the vast majority of people who are going to be watching this are not going to be able to do what i did to configure my server but i left it in here anyways when i think etl is an automated process in order to extract the data from somewhere we're going to transform it and then put it somewhere this was going to be the extraction method and i was going to put it in a start procedure so that you could you know run the run the stored procedure run the job import the data is going to be really cool but i know that if i was having trouble with it me trying to explain it to you and you being able to figure it out on your side was gonna be very tough i left this anyways because i was able to get to work on my computer um but it is tough and it took a lot of research um and i did this for a previous server like a year or two ago and i remember being crazy hard but i was able to figure it out on my computer so if you want to try it out um try it out and and look into the stuff so i'm going to leave this here this is just for if you want to try it it's a little more advanced um and so you don't have to just import it and this will be a data cleaning project instead of an etl project but data cleaning is what 90 of it was going to be anyways anyways let's go back up to the very top really quickly i have a whole nother laptop right here as i did in the first video i didn't show it to you last time but um i have all of my queries written out over here i'm going to try to do this as quickly as possible we have a lot to get through now before we start writing our queries i am going to turn off my camera so i do not get in the way all right you should still be hearing my voice but let's get started let's just start with select everything and we'll do from and it is portfolio project dot dbo dot nashville housing so let's just get this pulled up on screen awesome so this is exactly what we were looking at before and the very first thing that we're going to be looking at is this sale date now i wrote standardized sale date but i'm really just going to change the sale date um so let's copy this really quick and let's look at just sale date and it has this time on the end and it serves absolutely no purpose and i it just annoys me i want to take that off and so right now it's a it's a date time format but we're going to convert and we're going to do date and we're going to take sale date sale date and we're going to go like that and let's run this really quick and this is what we want it to look like all right so let's say update and we have portfolio project specified up here so we can just say nashville housing and we're going to set sail date equal to and we're going to copy this now i will say before we do this i had some issues in my when i was initially doing it whether or not it made the update i was i'm not sure why or why not it was doing it um so yeah it's not doing it right now i you try it out on yours it may or may not be working i'm not exactly sure why that is because i would say like eighty percent of the time it's doing it twenty percent it's not i don't know why um no logical explanation of that but uh when i most time when i did it they would then be the same column something we can do i just thought of we can do alter alter i can't even say that word alter table and we can say um i think it's new or it's add add give me one second yeah so add and we'll just do sale date converted and let's make that a date format and just like this and then we can say like this and set sail date converted let's try this and see what happens so i'm going to add this column and then i'm going to update this and it says it's affected let's see what happened so let's write sale date convert sale date converted let's see what happened let's see if it actually worked and it worked okay so we we now have a column um and maybe at the end we'll remove that sale date column so that we just have that sale date converted but we know what that is you don't have to name it that you can name it sale date two or something like that um cool well let's go down to the property address and let's get a just a really quick look at it let's copy this up here i hate rewriting this stuff so i'm always copying and pasting but we're going to be working with the property address there we go so let's take a look at this really quick um so let's look at sorry i was looking at my notes we need to look at where the property address is null so what you'll see really quick when we run this is that there are null values um why there are null values yeah i really don't know um i really am not sure but let's look at everything where this is um where it's known so we have this property address we have a sale date a price legal reference there's this parcel id and there's this unique id so we have a lot of information and when you have something like this something like a an address an address is you know the address isn't going to change the address is the address the owner the owner's address might change but the property itself the address the 99.9 of the time is not going to change so you can say with almost certainty that you know this property address could be populated if we had a reference point to base that off of so really quickly let's look at just everything and let's look at and we'll just order by let's do property not property address let's do parcel id and let's take a look at this so we have to do a little bit of some research on this um but i'm going to show you something really quick let's see if i can find an example [Music] in not too long okay so here's an example here's the same id so 015 and that's the exact same address and we'll find this a lot of times and i look through the data and it's it is pretty much accurate um when it does have it it it is the exact same address so this parcel id is going to be the same as the property address so something that we can do is basically say if this parcel id has an address and this parcel id does not have an address let's populate it with this address that's already populated because we know these are going to be the same that is basically what we are about to do and it's not super complicated but let's get started writing it let's copy that down there one thing we're going to have to do with this is do a self join so we have to join the table to itself to look at if this is equal to this then this needs to be equal to this that kind of thing um so real quick let's just write that join part out and we'll go from there i don't know why i sounded canadian right there we'll go from there uh so we'll join on this and we'll say on a dot oh wait let's let's label them i'm gonna do this in the really lazy way i'm just gonna do a and b a dot parcel ids equal to b dot parcel id and um let's see really quick so we need to find a way to distinguish these the sale date could be the same one thing this unique id is it is unique so we need these to be different so let's use this and let's say let's say and a dot unique id is not equal to b.unique id so all we have done here is we've joined these the same exact table to itself and we said where the parcel id is the same but it's not the same row right because this is a unique id a unique will never or that means these will never repeat themselves so we'll never get the same one so if this is equal to this but these are different we want to then populate um populate the other one so let's do a dot parcel id and we'll say a dot property address b dot parcel id comma b dot property address and let's take a look at this really quick and let's do let me see if this works where a dot property address is null and let's see see what comes up here okay so this is perfect this is exactly what i wanted to see so we have this parcel id we have this parcel id and here is our address and it's blank in all 35 of these so we have an address for all of these but we're not populating it so what we want to do is we want to say use this thing called is null so is null is basically saying it's the first thing is what do we want to check to see if it's null so we want to check a dot property address this whole thing now if it is null what do we want to populate um we want to put in there this b b.property address because we want to take that property address and stick it in there so um let's run this really quick so this row is what is eventually going to be stuck into this row so this is perfect um it's literally saying when it's null take take this and put it there and so that's what this this part of is doing so let's go in here and write our update so we want to update and let's take this whole thing from here up and we'll this will be the set oops um so we're going to set um property okay we need to specify and just so you know when you're doing joins in an update statement you're not going to say nashville housing okay that's going to give you an error you need to use it by its alias so let's put a so now we're going to say property address is going to be equal to and now we're just going to copy this is null and put it right here and we only want to update let's see if it does take this so i think this should be correct let's let's test it out really quick and we're going to run this above query and see if it made that update all right so there you go as you can see there are now none that have nolan there otherwise it'd be giving us an output right now so that one is fixed we can go back and check it if you want to please go back and and double check that um but that is what we did and it worked perfectly so that's what that is null does it checks to see if this is null if it is null it it can populate it with a value you can also do like a string and what we i mean you can write you know no address if you wanted to do something like that we don't want to do that we're going to keep it how it is let's keep moving on we do not have unlimited time here trying to keep this i'm gonna try to keep this on one under two hours stretching the rules because for my love of sequel that is the only reason um and this i think is gonna take a little longer so let's take a look and let's copy this real quick and let's take a look at what are we doing the property address the property address um and we can get rid of this as well so if you notice we have two things here we have both the address and then there's this comma after all of them and there is the city now you know you don't know that or you maybe you haven't looked into this but i have and there are no other commas anywhere except for in between these things as a separator as a delimiter a delimiter is literally if you don't know if you've never heard that term delimiter a delimiter is something that separates different columns or different values so for us the delimiter is a comma and for this first one because we're going to be separating this one out and then we're going to be doing the owner address for this one we're going to be using something called a substring and we're also going to be using something called a character index or a char index so let's start writing that out and let's do select and let's say substring now the substring that we want to take we of course want to be looking at oops let me um put this down here so it helps us out a little bit and i'll get like that so substring and of course when we looking at property address and we want to look at position one so we're gonna start at position one now this next part is something that you may have never seen before um and if that if you haven't that's totally okay we're going to be the character index is going to be searching for the um it's going to basically be searching for a specific value okay that's all it's doing and you can you can look into this a little bit more if you want um so it's going to be char index that's how it's spelled and then like an open parenthesis and we want to specify what we're looking for so it can be anything you can even do you know if you wanted to things like um tom or you can do value well you do it um like this you can look for tom or if you're looking for a specific word like john you can search that that's what this is for but we're going to do a comma where are we looking that's what this next one is so we're looking in property address and then we're going to close the parentheses and we'd also close it again to complete off that substring and we'll say as address and let's just take a look really quick at this so right now it's taking the it is basically going it's looking at property address it's going to the very first value or starting at the first value and then it's going until the comma now the unfortunate thing is is we're actually getting this comma in this output and we don't want that you don't want a comma at the end of every address we can change that so we can say because this is specifying a position if we just look at this chart index which we can do really quick it is going to give us a a number it is saying at position 19 that is where the comma is right so it's not like it's taking it's not a value or it's not a um it's not a string it's a it's a number so we can say minus one and if we do that and now we run it now that comma is gone because we're looking back we're going to the comma and then going back one from one behind the comma so that's how you get rid of that comma right there the next one's a little bit more tricky because we're not starting well it's not super tricky but we're not starting at that first position anymore so let's put a comma and we have our substring now where we want to start is at this as we at where the comma is so instead of position one we want it to be where that character index um i don't want it to look like this this whole time is it like this what am i doing uh it doesn't matter let's just get rid of this and see if that fixes it what am i doing here oh it's just because this is wrong um and we'll just do comma parentheses that might fix it that doesn't matter okay i'm wasting time i'm going to keep going we want to start in this in this position okay but we actually don't want to start at -1 we need to start at plus 1 because we want to go to the actual comma itself then once we get to the comma we want to add one so we didn't if we just left it the same again it would include the comma at the beginning then we need to specify where it needs to go to where does it need to finish now every single thing is going to be different every single address has a different length but we can use that to our advantage in this one and we can literally say the length of property address you guessed it right and then we can close this off let's see if that works okay what's messing up so we have property substring property address comma character index and then we have specifying it in the comma um we have the property address plus one okay we can't have that right there i don't know why i had that finally figured it out at the end um so let's see what we're doing here let's see if it worked it works perfect um and again this was one that i'm guessing a lot of people haven't used before so i was trying to explain it a little bit more than other ones but if we take that out if you take out that plus one you're gonna see the comma at the beginning right here so that's what that is so plus one and that's what we're gonna keep now we can't separate two values into from one column without creating two other columns so just like we added this um table up here we're just gonna i mean we're i'm just gonna copy this down here really quick we're gonna create two new columns and add that value in so we're gonna we're gonna uh add that we're gonna call this let's call it because it's property address let's do property property split and this is the address and then we'll say this one this next one is going to be property and this is city split city city and this isn't going to be a date of course uh there's going to be let's do envar char let's make it 255 just in case it's a large just in case it is a large string a large text so then we can say update that update that and now we need to in insert what we did for it so this first one is the address so we're going to say that equals the address and we're going to take this whole thing this whole substring oops and copy that and that's going to equal this and then at the end we'll we'll look at it really quick so first let's add this table i'm gonna do this one at a time really quick so you can see it so it adds the table now it adds the results and again adds the table of city and sets that city to that substring and now let's take um let's take this and just do select everything from this and you should see at the very end because when you add it it goes to the end we should have two new values and here we are so property split address and property split city um it's much more usable than this i mean this would be a nightmare not a nightmare it'd just be annoying to use this column i mean now that it's separated on the address and the city it's so much more usable of data it really really is the next thing we're going to be looking at is this owner address now it was hard enough or it was tough enough to do this but i want to show you maybe even a simpler way to do it even though this is more complicated so let's go down here and let's get rid of this so let's say um let's get this and let's just say property oops no we're doing owner owner address here we go let's just take a look at this let's see what we got so again we're using or we what we have in here is the address the city and the state so what we need to do is split all of those out and again i don't want to use substrings again that was a pain i want to use something a little different something again that you may not have never seen it's called parse name and parse name is super useful um especially for like the limited stuff stuff that's delimited by a specific value um so let me just show you what it is and then we'll go from there so we can say parse parse name and we're going to be doing this on the owner address okay let's let me see let me see yeah i mean it's because i don't have this of course i do that all the time so annoying so on the owner address um and then let's do one and let's just see what happens uh nothing changed of course because parse name only is useful with periods or that's what it looks for that's what parts name looks for and these are commas so something we can just do is we can replace those commas with a a instead of a comma we replace it with a period so super easy we're just going to do owner address comma and we'll look for the comma in there then we need to specify what we need to change it to we'll change it to a period and let's close that and now let's run it and it's taking tennessee so something odd about at least to me odd about parse name is that it kind of does things backwards than what you would expect it to do uh let's really quick let's add the other things um you'll you'll get a kick out well you won't get a kick out of this as much as i do here's one two three let's execute this and it separates everything for us but it's backwards so it's one two three you would imagine it'd be one two three but no it's one two three so all we need to do is go three two one and run this and there we go so now we have it broken out this is now our address this is our city and this is our state so super what i would consider super easy a lot easier than the substring but i didn't want to show you the easy one first and then give you the hard one um so now we just need to add those columns and then we need to add the values so let's do this uh let's make some room and i need to get rid of one of these i think oh did i do that right what'd i do i have my alter table update alter table update what is this doing here what is this i don't even know what this is i'll just go like that so now we have three perfect um so from national housing we're going to say we're going to say this is the owner oops owner split address um actually let me just copy the owner make it easier so we have owner split address owner split city and let's do owner split and then state oops and copy there owner split city there we go owner split address owner split address so i'm putting all the sets equal to what we're about to add to so now this first one this three is the address we'll paste it there the second one is the city so we'll put that oh i see what happened here that's what happened can i get rid of that um i set the owner split city equal to that middle one and then of course the third one is the state so let's go do that and that should be done so let's do it two at a time oops owner split address what's wrong with that oh i probably just gotta run this first let's try that tried to get go too quick you can do this a much more efficient way i'm just doing this for visual purposes i would update all the tables first or add all the columns first i mean and then do all the updating at the end that's normally how i do it but again for visual purposes this is what we're doing so let's go get this actually let's get this bring this down here um don't keep this in your final queries it's a lot of extra selecting everything you don't need to do that um so here we go so owner split address owner split city owner split state again so much more usable than when it's all in one column i mean it is 10 100 times more useful data now um i you know that one to me that gets used a lot let's keep it going i feel like we're making fantastic time i don't even know i'm not even keeping track of time time is not even relative anymore it could be three hours and i wouldn't care let's keep going um let's take a look at this column right here sold as vacant right now is no but let's look at let's do select distinct oh my gosh i hate when i do this i do this all the time am i the only one i don't think i'm the only one and we'll do spl uh what is it sold as okay sold as vacant let's do a distinct count on or distinct on these so right now we have yes no ny i'm guessing which is known yes and then no so let's look just for just because i'm curious um let's look at a count of i don't want to do that let me just do sold as vacant let me do a count of this and we'll group by uh sold as vacant okay let's run this and see what we get oh gosh let me order by okay here we go now we're now we're moving that's not what i wanted at all order by two here's what i wanted okay so at no we have fifty one thousand yes four thousand almost five thousand no and then of just a few so let's change them to to yes and no because these are obviously the vastly more populated ones um and we're just gonna do this through a case statement so we're gonna say oh yeah let me get this ready before we start oh yeah i'm ahead of the game now let's do select and we'll do sold as vacant and then we'll start our case statement um yeah let's do right here so we'll do case when sold as vacant is equal to yes all we want to do is say then we want to make it no oh we won't make a yes what am i doing jeez i'm losing it when and i'm just oops oops ignore that pretend that didn't happen when sold as vacant is equal to n then no and then else we want to say if it's already if it's not one of those values that means it's already a yes or no so we're just going to say just keep it as sold as vacant and then we'll end it so let's take a look okay so let's scroll through here and see if we get any that we can see oh i just went by some didn't i oh i just went buy something i know i did um let's see okay here we go so here's an n it's now a no so this this sold as vacant as this column the newly uh the case statement right here is changing it so the n is no so this should work all and this will be a unique update statement um and i hope it works on like the first update statement that we we did that was uh that was a travesty um let's do update nashville housing and we'll say set sorry i'm talking faster than i'm going set sold as vacant equal to and we can just literally put in this case statement it's not pretty but let's try it okay now let's go look at this again and see if it made the update there we go the update statement worked oh fantastic it's a beautiful thing okay great i'm glad that one worked i was worried for a second that uh my update had broken in um in sql server now now we're gonna do something um these next two things is we're gonna remove the duplicates and then we're gonna get rid of unused columns um this removing duplicates i gotta be honest i don't do it a ton in sql but i have done it um especially for like queries you know when i'm looking at full tables i will write some sort of temp table and like put the remove duplicates in there i normally don't delete actual data we are we're going to do that but it's not a standard practice to delete data that's in that's in your database so just for future purposes don't blame me if you delete all the all the duplicates by accident in your table at work so you can do this a few different ways but the way i'm going to show you is we're going to write a cte and we're going to do some windows functions to find where there are duplicate values okay so excuse me so let's start writing out our cte and or you know even we can write out the query first then put it into a cte that might be a little bit better so let's do select everything and oh my gosh i was about to do it somebody's out there just like waiting for me to make that mistake again so we want to partition our data um when you're doing removing duplicates we're going to have duplicate rows and we need to be able to have a way to identify those rows right so you can use things like rank order rank row number there are a few different options we're going to be using row number um and you know if you want to look into how rank and rank uh like dense rank and all those ones work please do that so you know why we're doing it um but we're using row number because it's the i think the simplest and it's going to do what we need exactly so i'm going to get this over here we'll say select everything because we're selecting everything then we're going to add this row number on here so row number and we're going to do these parentheses right here i'm going to say over and an open parenthesis now we need to write our partition because we're going to partition this data so we're going to say partition by cool now really quickly while we're here we need to actually know what we're partitioning on that's helpful so let me write this so while we're writing it we can see what we're doing we need to partition it on things that should be unique two basically two each row um if and i guess for the sake of what we're doing we're gonna pretend this unique id isn't here um although you know you can say i'm cheating it doesn't matter but i'm going to say you know if things like the parcel id are the same if the sale date is the same um the property address is the same the sales price is the same this legal reference which i'm guessing is some type of legal document saying it's like somebody's property if all of those are the exact same then to me that is the same data it's it's unusable just for example i mean this may i don't i mean this date is just some random data set i found online right so that's what we're going to be going with that's what we're going to be running with and pretend that lie that i just told you is completely true so what we want to partition by let's start with the parcel um can i is this not right here why is it saying this why is it not giving me okay it doesn't even matter i'm just gonna say parcel id we can say we'll do a property address stick with me we're getting somewhere we'll do sale price um what do we say sale date i mean there shouldn't be two of this they didn't sell twice on the same day come on and then legal reference and oh i know why it's not working or my autocomplete isn't working which i love um it's because we're creating our own partition so it's it's own column of course i don't know why i'm it's late as you can see down here it's 11 15. it's getting late from me but hey i i this is an adrenaline rush for me um now we need to order it now we want to order on something that should be um not necessary i guess unique so we're going to order on this unique id we'll see if that actually does what we want it to do um oops what am i doing order buy come on and we'll say uh unique oops unique id perfect and we should be able to close that off and we're going to call this roanum i mean that's just that's just makes sense so now we have this and let's run this really quick and see what happens so um and maybe we should order this as well but we'll maybe we'll do that later yeah let's order this on parcel id order by parcel id let's just see what happens because this i think that should be pretty accurate um let's scroll down and see if we get any this is all ones maybe we should be doing it on unique id i don't know let's see if we get any hits okay there's a two in there let's let's look at this really quick because i want to see it maybe i did something wrong i don't know it is absolutely possible uh somebody play some jeopardy music for me real quick yeah i don't know i don't know why it's um okay so let's see let's look at these two and let's see if i did something wrong oops don't need to pull that up i was doing some research when i learned that convert by wasn't working um okay so this one and this one it's giving different row numbers so let's look at the actual data ignore the unique id but the data itself so the the sale date is the same the sale price is the same the legal reference is the same the owner's the same this is the same i mean literally every single thing in here is the same so this is a good example so we're going to in this query that we're about to write that that will be that second one will be deleted because we don't need it now there's only one so it looks like this is working as intended i can also do um let's do where row underscore num is greater than one let's see if that i don't think it will work actually yeah that's because uh it is that is in a windows function of course we can't do that what am i thinking that's why we need to put it into a cte oh of course it all comes back so let's call this oh it comes back to the ct these things are amazing um let's call this row num cte and we'll say as and then open parentheses and i don't think we can have an order by in here let's do it like this and let's just do select everything from row number c to e so again if you haven't watched my like ctv cte video or you've never used a cte before um this is now basically almost like a temp table so we're gonna be able to this query down here is querying off of this table that we quote unquote created so um it looks like it's working so all we're going to do is select um everything from that and we want to say where row num because that's now a row is greater than one and let's order that by i don't know property address let's see if that works and let's see what happens okay so all of these are duplicates we have 104 of them it looks like so there's not many but it there's twos i think threes no no three so there's multiple of these rows or columns that are basically duplicates and we want to delete them so all we're going to say is we're going to select instead of saying select everything from row we're just going to say delete and uh yeah i got to get rid of that order bye that doesn't work and let's do this there's 104 let's see if it worked um so now let's do let's go back and we'll say select everything and let's see if there's any more duplicates in there there are none that is fantastic every i'm like biting my nails now to see if each one of these works um because i that first one didn't work um so yeah so it worked we got rid of the duplicates that is fantastic and now it's smooth sailing from here because we're just going to delete some unused columns that we don't care about this doesn't happen often this i would say actually happens more like views when i'm creating views i have a view and i'm like oh i didn't mean to add that column let me just remove it because it's uh i don't need it you don't do this to um like the raw data that you import usually this is i mean again best practices please don't do this to your raw data that comes into your database um talk to somebody before you do this that's just my my legal advice for the day i'm not legally bound or legally held responsible for any mistakes you make so let's keep going we're literally just going to delete some columns it could be any columns that we want but for example we have these properties split address and owner's split address in city and state and city and these are perfect and much more useful than these owner this owner address because this is really unusable to be honest so we're gonna delete those um and maybe we'll also get rid of like i don't know maybe the land the land use might be useful this tax tax district who cares about that um so it's going to be super easy we're just going to write alter table alter table do i say that right jeez and we're going to say alter this table and we're going to drop a column and you can do as many as many as we want so we're going to say owner address we're going to do tax district and let's also do the property address all right and let's try this and let's see if it works i'm nervous all right so as you can see that the property address is gone the owner address is gone the tax what was it tax district has gone and now we are left with this um now remember the whole point of everything we were doing was to clean up the data right we wanted to clean the data and actually now well now that we're here we have this sale date as well and we have the sale date converted over here let's get rid i forgot let's get rid of this oh that was my dog max excuse them let's get rid of oops let's get rid of that sale price that that or the um sale date that made me look like an idiot this is sweet revenge sale date sweet sweet revenge all right and it is gone so it's as easy as that now remember like i was saying before the whole point of this project is to clean the data and make it more usable um and it may not have felt like that as we were going through because i wasn't you know really looking at the cleaning data we were cleaning it but you know what was the purpose of it i may not have highlighted that too much all these other columns that we created um are just it's much more usable much more friendly this is standardized now and you know we we did that through quite a few various methods um so let's go back up to the top we're going to recap what we did really quick so using this convert we tried to standardize the date format or change the date format may or may not have worked for you didn't work for me we populated this property address which we did that before we broke this out because if we reversed it if we broke these addresses out into individual columns and then we populated this thing um we would have because then we went and deleted uh we went and deleted this column oops sorry we went and deleted this property address so we wouldn't have actually gotten any of that data so there was a reason it was in that order don't mess that up that's happened so we broke it out we did that to using substring chart index as well as parse name and replace then we went through and we changed yes to node or y and ends to yes's and no's using case statements then we use we removed duplicates using a row number a cte and windows function of partition by and in the end we deleted a few useless columns that we no longer want to see because they are horrible and terrible and you know we don't want to see them anymore that is the entire project that was everything and you did it and i'm honestly super proud of you for sticking around this long this this was not necessarily an easy project we used quite a few new things that i may have not talked about or showed you before um this to me is just the beginning right this is just a a glimpse into all the things that you need to do you need to look for in order to clean data so you know i really do think this is a good portfolio project because it will show that you understand and know how to clean the data although this is not an end-to-end project right that could that would take a long time and a lot more exploratory analysis looking into the data to figure out what we need to change but for all intents and purposes i mean this is a pretty good project for cleaning data and i hope that you learned something i also hope that you worked on this hard if you want to make any improvements please do that this is not perfect by any means there's other things that you could change you could you know i don't even know i'm not even going to try to guess you could do other things to this data though um and and create your own queries create your own um data cleaning uh a part of this and so um you know do that if you were able to get this the etl part of it done do that i think it'd be really really cool again i was able to get it to work but i don't think 90 of people out there would be able to get it to work um it's just every computer is different every server is configured differently um and so it would just be a huge pain so i decided to cut that on and i'm sorry but hopefully this will suffice um with that being said this is it you made it all the way to the end again i'm super proud you guys are doing fantastic you guys are the ones putting in the hard work to build the portfolio for your future job i mean it's not easy but you're putting in the work and so and so kudos to you um in our next video we're gonna be going into python for the very first time really excited about that one because i think the only python video that i have up right now is on one where i was scraping data from twitter so um you know this will be a nice change of pace or a little bit different content that i normally put out and so i'm really excited about it and i hope you are as well with that being said i am done with the video i'm gonna be stopping it soon thank you for joining me if you liked this video be sure to subscribe be sure to like this video leave a comment below telling me how it changed your life and i will see you in the next video bye [Music] you
Info
Channel: Alex The Analyst
Views: 33,088
Rating: 4.9861474 out of 5
Keywords: Alex The Analyst, Data Analyst portfolio project, Data Analyst Portfolio Projects, Projects for data analysts, data analyst tableau, Data Analyst project, Data cleaning in sql, sql for data cleaning, how to clean data in sql, how to clean data, how to clean data in sql server, Data cleaning, Data Analyst data cleaning, data cleaning for data analysts, data cleaning for data analysis, sql portfolio project, sql project
Id: 8rO7ztF4NtU
Channel Id: undefined
Length: 54min 44sec (3284 seconds)
Published: Tue May 25 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.