Data Analyst Portfolio Project | Correlation in Python | Project 4/4

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what's going on everybody welcome back to another video today we are continuing our data analyst portfolio project series with our fourth project in python [Music] now i am extremely excited about this project because this is the very first portfolio project that we're doing in python and we're going to be using lots of popular libraries like pandas seaborne and matplotlib if you have never used python at all this project may be a little bit difficult for you i kind of expect you to know at least the basics of python and i go a little bit above and beyond that in some of the areas but i explain the more difficult concepts whereas i'm not going to explain what an array or a for loop is because i'm going to expect that you know that i tried to make this as beginner friendly as i possibly could but i will say i think this is one of the more technical projects that we have worked on so far with that being said let's jump over my screen we are going to download the data set we're going to install our python ide we're going to complete the entire project upload it into github so let's get started all right so the very first thing that we need to do is we need to download our data set and this is coming straight from kaggle as you can see up here this is the movie industry data set let's go down here and just take a sneak peek really quickly it's showing us 10 of 15 columns so we're really not able to see all of them at the moment we could but i don't want to take the time to do that so we have the budget we have the company we're going to be looking at things like genre the gross revenue of the entire movie the name of the movie on the released date uh and then there's a few other ones and i'm not going through all of them right now because we will get plenty of time to look at that in just a little bit we're going to click download right here is going to go into our downloads folder what we now need to do is to download our python ide so what we are going to do is we are going to be using something called jupyter notebooks and we can get that through something called anaconda so we're going to be downloading anaconda right now all you need to do is go right here and click download again i'll include all these links in the description so you just need to click download if you're using a mac or if you're using linux just click on one of those and you can get that version that you need let's click on download i am not going to walk through the actual installing it because it's super super easy i don't foresee anybody having any issues i hope unless you have storage issues like you don't have any room for it but just install that and right down here this is going to pop up and what we're going to do is we're going to click on this jupyter notebook right here so we're going to launch it of course i already have this pulled up over here but i'm going to pull it up over here as well here are where we kind of store our projects this project is for a later time and these are ones that i've already completed so i've already completed this entire project i have it on another screen right down here so i'll be getting a lot of uh i'll just be reading a lot of it because it took me a long time to create so i don't want it to take as long as it took me to create i want to do this quickly so what we are going to do is i'm not going to pull up any of these i want to click new click on python 3. so this is what it looks like this is where we're going to start excuse me give me a second so what we need to do first we need to import our libraries so or import the packages and then we're going to read in the data right so the first thing is import libraries and we're going to be using a lot of classic ones i don't want to waste a bunch of time on this part specifically so i'm just going to paste this in here but and you guys can do the same thing i will have this on the github look at the link everything that you're about to see will be in there so you don't have to write this out either but when i initially did it just took me a while so i didn't want to waste that time but what we're going to be using we're going to be using pandas as pd this pd sns plt this is what a lot of people use i mean it's like common practice to use that so i i advise you doing that as well excuse me so we're using pandas seaborn matplotlib um and i mean those are the big ones be sure to include this matplotlib incline if you don't know what that is just google it it's useful for what we're about to do but now we can actually read in the data so let's go right down here i'm going to say read in the data all we need to do is we're going to be using uh pandas for this we're going to be creating a data frame using um just i think it's read csv so it's super easy so we're going to do df.read underscore csv open parentheses and then um what are these apostrophes double yeah just apostrophes i think what we now need to do is we need to locate where it is in our folder so if we go right over here we have this movies right here we can right click on it and if you want to this is what i usually do is i go right in here the location says see users alex f downloads that's where it's located which it is so we're going to go right in here we're going to go like this and we're just going to type in the name and that's movies.csv so that is as easy as it's going to get um it's probably the next few things are probably as easy it's going to get so let's try running this really quick and we should get an error um it's going to say unicode escape you all you have to do to resolve this and this happens if you don't do this happens all the time just need to include an r okay so now it should work um what did i say oh i said df why did i say that why is it somebody should have told me i was doing that um we're going to do data frame equals df just stands for data frame so pd so we're using pandas i don't know why i wrote that i was getting ahead of myself i guess so we're going to run that it should work so it did work and let's look at it so let's say let's look at the data we're just going to do df dot head and then parentheses we're going to run this as well just to get a really quick glimpse so this is just like the top five rows so we have our budget our company some of the ones that we were looking at earlier and ones that we didn't see before i believe were score star votes writer year and we'll get into all of this in a little bit the very first thing that we're going to be doing is cleaning up the data um once in just a second i'm going to cut myself off screen i wanted to join you for at least a little bit a little bit longer but we're going to be cleaning the data and kind of just formatting it how we need it for what we're about to do um or for what we're what we're going to be working on i said earlier we're not going to be working on what i think most people thought we would be working on we're not going to be using pandas to basically use it like a sequel where we're you know doing like group buys and and all these things we're going to be going kind of in a different direction we're gonna be working on correlations so if you don't know what correlations is um for example you know let's say as the budget increases you also expect the revenue to increase right if they spend 100 million dollars on a movie you expect them to make 500 million or if they spend zero dollars you expect them to make like you know ten thousand dollars there's that's a high correlation because if it's low it's lowest high is high um and so we're going to be doing that for all of the fields that you see here and trying to find what fields direct what are directly correlated or highly correlated with this gross revenue um because again i think it's interesting to know kind of what things impact the the revenue of a film um so that's something that i found interesting so that's why we're doing it because you know again i created this i'm just kind of going with the flow so a little bit different i hope i hope that is exciting we're gonna be using some really fun things coming up um but with that being said i'm gonna take myself off of here so that you guys can see my full screen and we're just gonna get going so goodbye i will miss you but you know we'll see each other again i promise in a future video all right so what we're going to do is come down right here and we want to look to see if there's any missing data so let's see if there's any missing data i wouldn't be putting this as my description go back and change that to something that makes sense but that's we're going to keep it as there's lots of ways to do this um but what i'm going to do is i'm going to create a for loop we're just going to loop through these columns and within each one see if um you know there's there's missing data in it so we're going to say 4 column and frame dot columns and we're going to say let's do np dot mean and we're going to open bracket or sorry an open parenthesis we're going to do data frame and we're going to put in that column like right up here so when that column gets inserted into here it'll say data frame specif specified column and then we're going to do dot oops dot is null and open parenthesis it's pretty straightforward i guess um we're going to say we're going to call this percent missing so we'll do that then what we want to do is we want to create our output so we're going to be printing this so we're going to print um and let's do a little bit of formatting um and this is purely just for visual purposes for you but you don't have to do this if you don't want to but i'm going to do like this and then we're going to insert what our format is going to be for this so we're going to do give me a second let's go back here add this percent sign so we're gonna do dot format and then it's gonna be the column name that's gonna be right that's gonna insert right here and then we're going to do our percent underscore missing so let's run this really quick see if it works uh np is not defined that's probably because i didn't import it import uh numpy as np why did i why did i not do that now let's see if it works there we go happens to the best of us um so yeah so this np is supposed to be from numpy which apparently i didn't include it so we basically looped through every single thing we looked to see if there's any columns that had nulls in it and it looks like every single row every single value is filled in so we don't have to worry right now at least about that um so let's keep going on to the next part we're going to be doing some really basic data cleaning i think i mentioned that earlier so the first thing that i want to look at it are the data types for our columns super easy to do we're just going to do df dot d types and run this so we have floats objects integers and that's about it one thing that i noticed right away was that at least in the data was that this has like was this 80 million i think that's how much that is let me see if that's what it is one two three one two three just eight million okay so there's eight million but has this point zero at the end we don't need that um and the same thing for the gross revenue i think it's the only one that it does that on i just want to get rid of that just for the pure sake of i think it just doesn't look great we don't need it so what we're going to do is we need to specify um what column we're looking at first so what the one we're going to be doing first is this budget so we're going to say data frame and then again we need to say budget so that is specifying the column that we're going to be working on and we're going to do as type so this is just going to change the data type and we're going to do open parentheses um apostrophe i hope that's what it's actually called if i if i'm calling it an apostrophe and something else i'm gonna feel like an absolute idiot um so this is going to change it to an integer but we need to apply it right so we're just going to take this otherwise it wouldn't apply it so we're going to do just like that and i'm just going to make the note change jada type of columns may get something better than that please um i'm just doing this quickly for our purposes so we're just going to copy that we're going to do the exact same thing for gross okay and let's run this and take a look and see if that actually worked so we're just gonna run our data frame and now it just looks a little better right it's nothing huge that's a super small change um but it it does work the next thing that i want to look at and this is something that you unless you're like kind of looking in the data you may not notice but it has this year here if we go back to right here and i actually am now going to pull in all these if you look in here it says that the year is the year of release and then we also have this column called released date so the year in the released and the year and the in the year should match hypothetically speaking but they don't always um so here's 2016. here's 2017. let's see if there's any ones up here that show it um there are a lot of them though i there's a lot that or like 1987 that said 1986 so you can go through and see those all yourselves i'm not going to i'm not going to do that you can if you would like but again that that just will take more take a lot of time to kind of dig into the data but that's what you need to do to figure out how to clean the data so let's do that what we're going to do is we're going to fix it um and we're not going to fix it by changing this one we could but what we're going to do is we're going to create a new column so we're going to take this year released column and we're going to take just these first four values and that's going to become our new year column okay so i'm not going to delete the year we might later might drop it but as for right now we're going to create this new column so we're going to say df and then we're going to do bracket apostrophe i really i'm telling you if i am saying this wrong this whole video i'm going to be i'm not going to be happy about that um and we're going to take that from the released so again we're taking this released column and what we're going to do is we're going to we're going to we need to right now it's what data type is it released is it's an object i want to make it into a string so that i can pull from it or or take the string from you'll see so as type and we're going to create this as we're going to make this a string and then what we want to do is take the first four so um string and then we're going to do open bracket and then colon 4. you can also do zero but it's an understood that's just if you leave that blank it starts from the very beginning um and let's create this new what we're gonna call this so we're gonna do like this um and we can do year underscore actually all these are under year correct and we're going to do it just like this and we're going to say create correct year column so let's add this right here let's run it and see what we get and so this is the original year column it had 2016 and this is our year column has 2017 it looks correct i i'm i believe that there's i don't know why the one in here is a mistake uh they made that mistake but it looks like that's what it is and ours fixes that so if we're going to be using that year at all we'll pull from the year you're correct i think it was called you're correct that's what we'll pull from so we just corrected that and we should be good to go on that front the last thing i'm going to do um i guess you know the real really anything we do to the data is i'm just going to order it super simple i'm just going to order it by um the gross revenue so we're going to come down here we're going to say df dot sort values and we're going to say buy and then do an equal open prevent or open bracket and we're just going to specify that gross column and we'll do m place equals false oops false capital and then ascending and that's going to equal to false because i want it to be descending so let's look at this so the highest grossing and you'll you'll notice a trend down here some of the highest grossing um films are really big ones right avatar titanic jurassic world avengers some of the ones down here um not as well known i don't think i spit on your grave two was the most popular movie one was fantastic um two you know i saw it it just wasn't wasn't you know my cup of tea i think i was all their revenue to be honest um one thing that we're doing right now and i'm just gonna add this you can we can make it to where it doesn't have these you know we're only looking at a little bit of the data let's say we want to look at all of the data let's really quickly do that because i'm sure some of you guys are wondering how to do that maybe you're not but i'm going to show you how to do it anyways so we're gonna do pd dot set underscore options and we're gonna say uh open parentheses apostrophe display oops display dot max underscore rows and this is set to like i think it's like 20 or something um by default and we're just going to say none so we're going to do that right here and we're going to run this oops what i do wrong has nothing called set underscore options oh that's because it's supposed to be set underscore option all right so that should be good let's try running this again and see what happens okay so it's going to take a lot longer because it's pulling in all of the data um but when you come yeah when you come in here now you'll be able to scroll right super useful i prefer it this way i i just you know most people don't do it this uh most people don't have it as a default this way so i just wanted to show you how you can fix that so that will now be like that for the rest of the project let's keep going um one thing that is important when you're working with something that doesn't have any null values you wanna make sure you don't have any duplicates so super quickly we're just gonna look to see if it has um if there are any duplicates and we're just going to drop them super easy to do we're going to say um well actually let's write we're going to drop any duplicates so let's do df we're going to do open bracket and you can do this on any column you want you can do this on multiple columns you can do this across the entire thing and you you should but how you do this is you can say you know company oops excuse me company you can do drop [Music] underscore duplicates and dot sort underscore values oops values and you can say sending equals false let's just run that really quick uh what did i say sort under dot sort values oh that's because i didn't go like this my bad so it's gonna it's gonna sort the values um and it's going to tell us if if it drops any it doesn't look like we're dropping any duplicates right so that there's no company this is the distinct count of these sort values so this is just showing us all of the unique values in here if we were to get rid of this and i'm just showing you really quick for so i can actually make sense of what i'm trying to say if we get rid of that drop values then we start seeing all of these zentropa entertainments whatever that is we're seeing it tons of times so all that does is it shows us what values are distinct in here and if we want to get rid of that we can do you know df company and we're not going to do this df company equals and then we do like this we're not going to do it because i don't want to get rid of all the ones in there but um if we wanted to do it across the entire thing we do df company um or just data frame we wouldn't we wouldn't do any of it so that's just to show you kind of what that is but we would do this um and if we wanted to do that we could absolutely do that i mean i there aren't any duplicates but you run that it will drop any duplicates across the entire data frame so that's what that does let's keep going really quick and something else just with this by the way a reason why we're also could be looking at this is to see if there's issues in the actual quality of the data actually let me go back up because there was one up here i think it's like warner brothers or something let me see that go too far so right here actually this is fine um we have walt disney walt disney walt disney right there's a bunch of them something that you might need to do when you're data cleaning is to actually aggregate all these or standardize all these however you want to say it you these all i've already looked into this and we don't need to do this but all of these are different companies or or were companies during different times right so let's say this one was for like from like 1995 to 1980 and then they changed the name to this um we don't want to then standardize it because those are two distinct time frames and two distinct companies but if this one said say for example there's walt disney feature animation then walt disney feature animations with an s on the end that'd be a mistake and we would want to correct that luckily we don't have to do that because that's a huge process trust me i've done that it is tough so we're not going to do that today thank you for sticking with me in my ram my rants that i'm doing at the moment um so that's kind of an additional reason why we were wanting to look at this and how we looked at this but you can also drop the duplicates which helps clean it up because you don't shouldn't be having any duplicates in here but with that being said i believe we now have our data how we want it which is fantastic i think that's probably the easier part of what we're doing so now that we have our data we're going to start looking at what variables or what columns and let's pull this back up oh i should have done i'll let it run should have done dot head but um we're gonna see what things are most correlated oops to this gross revenue okay so my hypothesis what i'm going to be kind of checking um so because it is hard to look at all these not hard because we're going to do it it does take time to go and do one at a time to compare all of these so i'm going to be doing ones that i think will have a high correlation and then we're going to test it and then we're going to look at all of them together and i'll show you how to visualize all of this and write all this out but i believe that this budget and i'll write down here is my predictions i believe that the budget is going to have a high correlation i think that the more money they spend the more money they're going to bring in that's my guess i believe that the budget is going to have a high correlation i also think that and you know this may not be correct i think that the company would also have a correlation as well somewhat high i think that some of these bigger ones like um i mean 20th century fox film corporation walt disney they make movies that bring in a lot of money so i think that the company um company will have a high correlation uh let me write that out that's kind of my guess these are my educated guesses don't put that in your scripts um you don't need that that's my guess this is what i think is going to happen but we're gonna test it out right so one thing that we can do super quickly to compare the budget and the gross revenue is to do a scatter plot so let's build a scatter plot and let's compare let's do a scatter plot with budget versus gross revenue what we're going to do is go right down here we are going to say plt so this is our matplotlib plt dot scatter and that's going to be our scatter plot that we were just talking about and we're going to say x equals and this is you know what data are we going to be looking at so this is on the x-axis so we're going to say x equals data frame and this is going to be our budget so we're going to do a bracket apostrophe budgets again i keep hesitating on that apostrophe i feel like i'm wrong i feel like i feel like um if if i am wrong this whole move uh this whole time i'd be so mad i'm telling you um and then our y-axis is going to be data frame and then it's going to be r gross oops what i do i'm messing stuff up so it's going to be our gross so super easy let's plt.show this is going to actually bring it out so this is what it looks like um it's hard to interpret exactly what's going on here um i am going to dot head actually let me actually go pull i want to pull this thing that we're looking at right up here actually no what i'm going to do is i'm just going to say data frame is equal to so that i can um just run the data frame down here so there we go dot head all right so i wanted i just wanted to have the uh these ones on top so um it's hard to tell exactly what's going on here so i'm going to add a little bit of information just so where we can all read it let's add a title this will be plt.title and this is going to be budget versus gross earnings we'll do plt dot x label oops not c label x label we'll do the x label and that was our gross so i'm going to say gross earnings and we'll do plt dot y label and this is going to be open parentheses apostrophe we're going to do budget for film so oops that's not what i wanted so now let's run this and see what we get so this is a pretty good really quickly a pretty good visualization of what we're looking at here in terms of the budget versus the gross if you look at this one right here this one is easy to find because it's the very first one down here so we're looking at a budget of 245 million and these are in the millions so 2.45 is going to be right here so that's right and then the gross earnings was 930 million so almost a billion dollars and then with this is by um 100 millions right so 200 million 400 million 600 million 800 million almost to a billion right here so just a super quick you know fact check just to make sure that this is in fact correct what we want to do is determine if these are correlated visually it's it you can kind of guess it seems to be a little bit but it's hard to tell um so what we're going to do is do something called a reg plot or a regression plot so we're going to come down here we're going to be using seaborn for this so let's do sns let me actually type in right here bear with me we're going to plot the budget versus gross using seaborne um so let's do sns.write plot and open parentheses and again we're using i'm just going to steal oh wait no no i'm not going to steal that i was thinking about stealing something but that doesn't actually work so our x is going to be our budget and our y is going to oops what did i hit insert i hate when i do that y equals gross i'm i'm just really fumbling things up here and then we're going to say our data is equal to our data frame and let's run this really quickly and then i'm going to add some additional things to this but now we have this line and this is going to show us the correlation and in super simple terms it's going up and it's showing a positive correlation so just at a glance really quickly with what we've done i can already tell you that the budget and the gross are correlated but how much we don't know um but i will get that in just a second to show you exactly how much it is but i want to add some other information to this just so it looks better so we're going to do scatter underscore kws and we're going to change some of these just in one of these colors so it makes it a little bit easier to read we're going to do these oh gosh what are these called i'm just going to call them squiggly brackets so let's call it that we're new color and we're going to say a colon and we're going to do red so i want to keep the dots red right but i want to change up that line just so it's easier to visualize that will help us down the road i promise line underscore kws squiggly brackets color and we're going to do that let's make it blue why not and just like that whoops yepper let's see if this actually works i feel like i messed something up but yeah i did something wrong here give me a second um because i have this i just need to make sure i'm like closing this off correctly you guys are probably seeing what i'm missing oh that's it oh i must have hit insert again i'm telling you it messes me up every time there we go this should work now so uh yeah i just it was a simple syntax error but you can specify these things and make it look a little bit more appealing easier to visualize so much easier to see this when it was red on red just made it a little bit more challenging you know it's hard to see in here it's tough you can make this almost any color you want by the way you know you can make this black you can really do anything you want um i just prefer the red and the blue for this it's just super simple to see and looks totally fine so use any colors that you think you want and we'll go from there but now let's determine what the actual correlation is because we can see that there's a positive correlation but we don't know how much is it more or less than other fields we don't know so let's start looking at correlation and something that you can do that's so easy is df.core and let's run this and these are some of the fields from our data right these are some of the fields now year is in there but our year current isn't i'll go back and look at that in just a second um important to know is that this correlation is only going to be working on numerical fields it's not going to be working on all of our like our company our title the the things where there are strings in there it's only working on the numerical which is okay but that does pose an issue so we're gonna have to solve that later on how to do that um another thing to consider is that or not consider another thing to know about this correlation is there's different types of correlation or different what's called methods so there is the pearson which is the one we're using that's deep by default is pearson there's also one called kendall and there's one called spearman and they're all going to give you slightly different results or or i think this one gives more than slightly different results but they all have their different way of determining correlation so it's just something to be aware of if you want to really use this you should be aware of um of which one you're using by default and which one you want to be using and i recommend doing some research into these just to just so you know but let's just try the different ones real quick so pearson is the one that i believe we're using by default um so you know before we actually hit enter or run this really quick the budget and the gross has a pretty high correlation it's 0.712196 that's a pretty good correlation um there aren't many other ones in here that are that high votes and the gross are you know close but for gross i mean that the budget i think is the highest one and then next is votes that's what we know um so let's run this again with pearson so it's going to be the exact same but now let's do kendall and let's run this one and now budget and gross is 0.523459 i don't know why i'm saying the entire thing but i am and then let's try spearman this one should be a lot closer than than um the kindle 0.698 so again you need to be aware of what you're using the different types why for what we're going to be doing today we're just going to keep it default and be doing pearson um and that's that should be all you know all we need to really look at um what i want to do is it's kind of it really is it's hard to look in here and read each number individually what would be super easy is if we could visualize this and we can so really quickly um i want to say well i want to make note that high correlation between budget and gross um i was right not important i just wanted to toss that out there but what we're going to do now is we're going to visualize this information right here this correlation matrix is what it's called it's called the correlation matrix so what we're going to do is going to take this and we're going to assign it as our correlation matrix that's going to be equal to this so this is now called correlation matrix right here and what we're going to do is something with seaborn it's going to be sns.heatmap and we're going to open parentheses and we're going to use this correlation matrix and i want the annotations equal to true and if i didn't have it on i'll show you what that does if i don't have that on this part in there later um but we can we'll do plt.show really quickly let's let's look at what this looks like as you can see it has our numbers so right up here um actually let's run this again we had 0.71 we have 0.71 0.29 0.29 so now we have a visualization uh of this correlation matrix that we that we wrote and it has this nifty little bar over here and if it's black it's a very very low correlation so anything that's black super low anything that's brighter colors are a high correlation so we have of course a one-to-one correlation on everything that um in this matrix that is on itself so year to year budget to budget and then 0.71.71 0.66 0.66 so ones we were just looking at but now it's visualized it's a little bit easier and this will come in handy in just a little bit when we're visualizing every single column which will be really fun but we should what we should always do is i'm gonna go steal these real quick because i don't feel like writing these out again again i i would consider myself um somewhat lazy when it comes to this so we're just gonna say the title is going to be correlation matrix and let's just say four numeric features sounds good to me and then we'll do movie features and i'm going to make that i mean they're both on the same x axis and y axis so let's run that looks a little bit better there we go so it's it's nice to visualize this because it is tough to kind of read through every single number um it's just nice to see that okay these are highly correlated based off the color and based off these numbers so super easy to see and we again you can always go up here and change it to kendall and see what that looks like you know it it changes things and so yeah statement of the statement of the year it changes things so i'm going to keep it as pearson as we talked about and we can move on from there now the next one i think we set up we're going to look at company and company is let's just pull up pull up this really quick company is not numeric as we can see that's not numeric at all but we can convert this um and having and then we can create a numeric representation of it so for example this 20th century fox film corporation could be number one where lucasfilm is number two marvel studios is number three and but this will say you know one one two and three so they'll all have their unique identifier so instead of it being a again being a string it's going to be a numeric so that we can include it in this correlation matrix up here so let's uh let's look at company okay so what we're going to do is we're going to what i'm going to call numerize and that may not be a term at all but that's what we're going to call it so we're just going to say for the sake of simplicity df underscore numerized is equal to the data frame super easy and we're going to do a for loop and we're we're gonna do this for all fields um but we could specify just doing it on company but by doing all of them and i'll show you in a bit by doing all [Music] by uh let me let me take a step back by doing all the fields at one time we'll be able to look at company as well as country and director and genre and name all at one time um we could just do company but it's better to just do them all at the same time so let's use a for loop as we did before somewhat similar we're going to say 4 column let's just do column name in df dot numerized dot columns so this should seem quite familiar because we kind of did something like this before we're going to say if and we'll say if df numerized and we're going to put the column name so if that column has a d type that is equal to object that means it's like company country director genre if it has that what we want to do is we want to change that to a category type so all we're going to do is say is do do do do well i'll take this um is say df numerized column name dot a s type so we're changing the type of or the column um the column type and we're going to change it to a category and let's call this actually let's do it like this equal to that so that in the next one we can do something called cat codes cat codes so we're gonna do dfnumerize.cat.codes and this is what it's going to actually give it the random um the random numerization i'm sorry again i've it's called dfnumerize let's just just roll with me okay so then now let's look at df numeric actually let's um yeah let's let's look at it here and let's run it and see if it works yeah so i mean it did exactly what it left budget alone it left the gross earnings alone it left a score alone anything that was already numeric it left it alone because it has a numeric representation any of the ones that had an object type that we looked at before those were all numerized again i don't know if that's a real term but that's what i'm calling it maybe it is and i'm maybe it is so we have this and let's compare that to our original data frame i should have done like headers i always do that and then it ends up taking a ton of space um oh whoops let me go back way up to the top really quick let me just run this again and see if um you see if that may have ruined everything but we'll see if that does what we needed it to do okay so we haven't ordered oh no we don't have it order the same i did scrub everything jeez let me go back up here um and run through no yeah i'm gonna run this one [Music] and add that field sorry this is totally my fault what's happening why is it taking so long um where's that one that ordered the data here it is and then we ordered it by gross so let's go back down again i you know i'm gonna make mistakes i'm i'm only human now let's look at the data frame and i just wanted to do a quick comparison just so you know or you feel confident that what we're doing is what we're supposed to be doing okay so the company uh has 1428 for lucasfilm 2062 for 20th century 2062 um better yet an easier one to look at is country so 54 is usa 53 is uk um and you can see that really easily so it looks like it worked properly right it still is keeping to what it's supposed to do so now we're going to do is we're going to go back up here and we're literally just going to steal this correlation matrix and we're going to put it right down here but instead of the data frame we're going to use data frame numerized and let's run this and see what we get and it should be quite significantly larger than what we were looking at before so these are every this is every single um every single field now right because now it has a numeric representation of it and so if we're looking at um let's look at where's the gross so let's do just i'm just going to skim it looks like the company had a small very quite a small part in if it was related to the gross revenue but right here let's look so it looks like budget is pretty highly correlated this has a negative correlation it looks like run time like longer run times earn more money sometimes just in a super small scale votes so if a movie was really successful and got voted on hundreds of thousands of times usually those those ones made more money um and those look like that looks like it um and you know it it's hard to see right it's not hard to see but you know there are a ton a ton of stuff right here something we could do is keep it in its um in the original matrix we had the just the numbers instead of this heat map we could also do that um and so let's just for the sake of it we could do that and filter it down to kind of look at this as well it'll give us some of the same output but it'll be a little bit easier to visualize than this huge thing but this is good this is still good let's do dfnumera numerized dot correlation um and we'll just run that so this is what we're looking at and i want to kind of organize this to where i can see the ones that have the highest correlation quickly so what we're going to do is use something called unstacking and so we're going to do right here and let's just call this um correlation underscore matrix just to keep it simple uh and we're going to do correlation matrix dot oops unstack and then parentheses um and then core underscore pairs and whoops i need to do this so what this does when it unstacks it it says okay here's our budget and this is what all the things are compared to for budget if we go down to gross which is what we've been looking at this whole time we can see that the budget has a high correlation obviously the gross is correlated to itself um and votes so we can see that in a really quick way let's do it in an even quicker way i think we can do correlation pairs dot sort underscore values and we can say we can call that let's call that sorted underscore pairs oops it's equal to that we'll do underscore pairs okay so now everything is paired up right so it's kind of like a uh the matrix except in a linear way i don't know if that's right but it's you see what i'm saying this genre versus budget budget versus genre it is still in the correlation matrix um type uh so now that we have that sorted we can say sorted pairs and inside of that i'm going to say where we have sorted pairs that's greater than 0.5 so if it has a high correlation and we'll just call that high correlation and oops we'll do this right here high correlation and now we can see all the ones that had a high correlation these ones obviously don't count right because these all are themselves but um the year correct and released did really well i'm not surprised but for gross so we had gross and budget we had votes and gross this is the only other one for gross the gross revenue that did anything right that had a high correlation so it looks like you know our my hypothesis of the company being significant didn't really play a part it wasn't necessarily correct but we did find one that we didn't i didn't think of was that votes um votes and budget have the highest correlation to gross earnings so that is our project i mean we are at the very end um and we can say company has no has low correlation and i was wrong i was wrong um we have come to the end i hope that you stuck with me i hope you got to the well actually we're not at the very end at all i was completely wrong we have to up upload it to the portfolio project so what we need to do is we need to save this let's rename it really quick let's do let's type in movie correlation project i'm going to call it v2 you don't need to call that but i'm going to call it that i'm going to save it so now we've saved it if we go back to our right here you can see it's there one thing to note is that you can only upload it to github if it's under 25 megabytes so this is a little big but it's really easy to fix um all you need to do is wear this just like looking at a ton of data and like this we're just going to do dataframe.head and if we do that like two on one or two of these it will resolve all of our issues just trust me on that it'll make it much smaller let's see if there's any other ones like that um yeah df numerators do numerize dot head and then let's save that because you need to actually run those let's make sure i ran those yeah perfect so now let's save this i want to save you some heartache and it i mean it literally dramatically reduced it um so we're going to go up here we are going to add our file upload files and this is where we need to go find it so i need to go into my c drive i need to go to users alex f then go right down here and click on movie correlation project v2 whoops i didn't want to actually open it that was my fault what i wanted to do is i wanted to drag it in here so i am going to go right here i'm going to drag it right over here drop it in there and just say oops initial commit i'm going to commit changes and there we go so let's open this up see what it see what it looks like in here and it could potentially still be you know yeah it's it's still loading um because it doesn't immediately go in there but it will show it in there and i'm hoping if i keep rambling for a second it'll work because sometimes it takes a little bit to um get everything to work properly but that is that now um one thing that we're going to be taking a look at very soon and the very next video is how to actually put all these projects together and put it into a portfolio website right i have done this already i have already created the website let me see if i can let's do alex the analyst github dot io so we're gonna be using github pages for this um and so i'm gonna show you how to create this and um it's not that hard and it's completely free and so i'm looking forward to showing you how to do this um i learned this through youtube and now i'm teaching it through youtube so i've come full circle and this is a really good one i i use a similar variation of this for my own portfolio but um yeah okay this loaded so this is what it looks like when when you actually upload it i mean it literally just looks like the output of the jupiter notebooks and so everything that we just looked at oh geez maybe we trim that down before you upload it um yeah so that is an issue if you have like if you do that i know when i uploaded mine um before i i trimmed the or i you know i limited these to a certain amount so that this didn't happen um but that's just funny to me and let's see here we go and so you see it exactly how it is in jupiter notebooks um don't do what i did and have this definitely make sure that you limit it in some way so you can do the head.head on all of them and that will work so that is our project for this week i hope it was helpful i hope that it worked and i hope that you know you can add this to your portfolio project or your portfolio and feel good about it i feel like i'm going to do more of a traditional one on this data set because i like this data set and we'll go look and see you know we'll do counts and like a lot of the stuff that we do in sql when we're doing exploratory analysis and then visualizations we'll do that with this um i like this data set so i think i'm gonna do another project with the same exact data set except um look at it in a much different way and clean it up a little bit different and that your current that we added over here we'll probably actually use because we'll do some time series stuff so with that being said thank you for joining me i hope that this was a good project if you stick around this long i mean you definitely have invested quite a bit of time so thank you i hope that um i hope that the next project in the next coming videos to finish out our data analyst portfolio project series i hope that they're super helpful and that you can get up and running and have a complete portfolio by the end of this thank you guys for watching i really appreciate it if you like this video be sure to like and subscribe below and i'll see you in the next video you
Info
Channel: Alex The Analyst
Views: 21,533
Rating: 4.9492245 out of 5
Keywords: Data Analyst, Data Analyst job, Data Analyst Career, Data Analytics, Alex The Analyst, portfolio project, data analyst portfolio project, analyst portfolio project, data scientist portfolio project, python portfolio project, pandas portfolio project, python, python programming project, python project, data analyst python, data analyst python project
Id: iPYVYBtUTyE
Channel Id: undefined
Length: 60min 44sec (3644 seconds)
Published: Tue Jun 22 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.