Day in the Life of a Data Analyst - SurveyMonkey Data Transformation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys today I'm going to be teaching you how to manipulate data using Python and pandas uh in order to turn a SurveyMonkey data set which comes out in this very wide format into a long format that way you can take the data and you can visualize it inside tools like power bi and Tableau more easily uh so let's get right into it so the reason we're making this video over here or the reason I'm making this video is [Music] um so I'm a data analyst and I uh also do some independent Consulting on the side and I had a client come up to me one day and ask me to visualize some data in Tableau and one thing that I think is important to understand is that uh when you work as a data analyst or even if you work as a data scientist or anything you're going to spend a lot of time uh trying to just get data into a format where you can actually use it and so I think that if you're starting your data analysts or data science Journey uh it is really important to build these skills up and this is a very useful skill that you can um you know easily sell on websites like upwork or freelancer or even just you know friends and family that you know um basically whenever Excel is no longer an option so um I took the the format I got from my client and I you know scrubbed it clean there's no identifying information over here um so you know of course you'll see that the questions over here are just uh you know question one question two question three so um let's do what any good analysts should do first and start trying to understand the data so what they wanted um the format that the client or the format that we need um in order to put it into tablet was something like this so we need um an ID column we will need uh just some demographic information uh and there are a couple of columns um related to just that so it'll be like that and then we want a column whoops that just says question in it and it has the um question that we're asking we then want something called question plus sub question and then and sir and then uh total respondents and then total respondents or total um it's called the same answer so at the end of the day we want to transform um let's see let's call this desired format we want to transform this sheet over here into something like this over here so let's go through it step by step um ID that will correlate to this respondent ID column over here um demographic info so there are a couple of columns for demographic info um we have uh of course this has been scrubbed out we won't be needing start date end date but things like what division do you work in um which of the following best describes your position level um uh gender you identify as uh what we can call tenure um what is your employment type so these are all questions that uh identify the demographics of the individual which is very useful for um a uh any surveys in order to like find larger patterns in our data set so we're going to want to keep those as our demographic columns so we don't want to transform those those should stay exactly as they are um then you have the question columns over here and this is where the interesting stuff happens so you'll see we have questions and they're written in this what we call a wide format or a pivoted format and what we want to do is we want to take that those questions and unpivot them so you know kind of take them from this wide format and then uh turn them into a long format and what that'll do is um let's take respondent ID over here uh so if we call this responder ID you know respondent ID 2 for the last number over there um their their ID will be duplicated um for every single question there is because as we take it we have to correlate uh each of their um their answers with a question and so it'll turn this what I believe is a 200 row sheet into something like 17 000 rows um and that's exactly what we want uh computers are much better at handling rows I mean uh data of significant length than significant width um although as humans we generally prefer to look at things with significant width uh it's easier to just read across than to read up and down so these questions are what we're going to want to Tran uh uh transpose is another word uh you can use um and you can see there's an interesting pattern we have over here we actually have two header uh rows what we have uh what I'll call like the primary header where it's like question five but underneath question five it looks like they could have put in five six different responses so we're gonna have to deal with that somehow we're gonna have to change that into just a um a single response column and that's why if you'll notice over here I did a question plus sub question we're gonna call the second row over here to the sub question so as you can see you know just going across everything else is the same I believe this is like about a hundred call or 100 yeah 100 columns something like that so um or something close to that so um that is what we will be transforming today um and I just want to reiterate you know this isn't just some random exercise we're doing uh I was actually paid by a client in order to do this and you totally can be too it's uh uh it is a relatively easy skill to pick up and it pays pretty well so let us get started um I'm gonna be doing this in Python but uh we would benefit from preparing some of our data in Excel first and then moving to python so uh as a best practice we have this is the raw data here um if we're going to make any changes we'll want to save that as a new file so let's go to save as we'll call it that and this is so that when we inevitably make mistakes at some point in our career um we can always just roll back to the oldest version of our data so we have uh what I'll call the raw data over here the desired format um and what we first need to do is we need to collapse this second row the second header over here into its first row and the way I'm going to do that is I'm going to concatenate uh which means to stick together um the first row and anything on the second row if there is anything so let us create a new sheet and we'll call this the question and also um because we're going to be bringing this into python you usually want to use underscores instead of just spaces that's just a good practice honestly so let's select this first row over here so just press it over there copy and then paste special transpose and on Windows uh it'll look different it'll be like two arrows at a right angle so like kind of like that um that's how you transpose data all right and then let's insert a row over here name it question and then let us do the same thing over here Ctrl C or command C on a Mac paste special transpose and then we'll call this the sub question and actually I'm going to call it the raw question and the raw sub question you'll see why in a minute foreign so you'll see for a lot of these you know for response ID they're you know it's just a single question there's only one response ID you can give um start date end date you know email address all that stuff um that's fairly simple so it a question like this is an interesting one when asked what division you work in you could either say you could either give a response which I'm guessing is like just the standard response they have like a multiple choice um or if you selected other then you can type something in and then we want to know what that person said when they said other so you'll see that uh you know questions at question 25 they were like nine different responses they could have given um so let us get started in concatenating these two together um that way we can collapse these two rows into just one row okay so what I'm going to do over here is I first need to fill in um all these blanks over here with the question above it so let us call this question and then we will say uh I want it to equal this as long as it's not blank so if uh let's see this is not blank whoops is not blank so um uh exclamation mark equals means is not and then um the two quotation marks uh means that uh here's a string here's some text and I put nothing inside it so you know if it's blank so if it is not blank value if true and this you know great thing about Excel just tells you what you need I wanted to take that one and if it's false uh I wanted to take the value above it which you would think would be this one right um so let's try that and we'll see what happens and the only corrections I forgot to add in a parenthesis all right looks good now let's just double tap over here interesting so it looks like we got some zeros over there uh which means basically that it just didn't find any values um and you'll see why so when we go to question five for example for this row here row 21 if this is not blank which is not we're just going to use this value over here okay sounds good works well question five or sorry uh row 22. if this is not blank it is blank uh so we're not going to pull this value we're going to pull the value above it okay sounds good then we get to the third row over here row 23. 24 sorry or uh sorry yeah 23. if this is not blank it is blank take the value above it there's nothing there so how do we fix this well instead of telling it to take the value up here I'm going to tell it I'm going to tell it to take the value over here so we'll take it from A1 to C1 and you see that fixes our problem right there so now question five repeats every single um row and the reason we needed to do this is uh that way when we can concatenate the two columns both the question and the sub question Excel will actually have something to concatenate and remember concatenate just means to like literally just stick together like that um so for example you know if you were to do one plus one is two but one concatenated to one is eleven you know so we're going to do the same thing with the with the uh raw sub question so literally all I can do is just do that um and and we'll address this in a minute awesome and we'll call this sub question okay cool um and just because these are this is demographic data um I could change the formula the Excel formula to make it more robust and to and to work better with uh like a more robust set of data but we just want to go with the simplest solution so for these over here I will just say they're blank okay looks good looks good perfect so now we're going to have a column called question Plus sub question foreign equals concat um and if someone could explain this to me I actually don't know what's the difference between concatening concatenate um I'm guessing it's some like a legacy Excel feature or something uh but we're gonna concatenate this a uh Dash and that's why I'm adding these parentheses around or these quotation marks around it and that all right and for these over here I just want to copy this honestly again because this is a static data set and we're not like going into a database or something and pulling out a bunch of data um it's okay to just do these like one-off Solutions that you probably normally wouldn't do otherwise okay looks really good um so now we when we have the header across our raw data set instead of having question 24 blank blank blank blank blank uh like you see over here question 24 blank blank blank blank we will now have question 24 response 1 response to response three response four response five awesome um so let us go ahead and get this situated so what I have to do actually first thing I'm going to do is I'm going to create a new sheet call it edited underscore data uh and I know everyone was expecting a Python tutorial we'll get to there uh what I really wanted to do is I wanted to show people what it is really like to be a data analyst um you're not coding all the time uh sometimes you have to find the best solution to a problem which is not always coding up everything um coding is great when you want to have a replicable solution um so for example if I was to do this over and over and over again then I would do this encoding um because it's replicable uh it would be more replicable or if um the logic of something is uh simple enough to you know code up sometimes uh it's just easier to do things manually so that's why we're starting off in Excel but we'll get to the coding in a minute all right so I just copied and pasted all the data and then now what I'm going to do is I'm going to go to the question and I'm going to take um so these are all formulas right now actually so I need to copy this and so I hit command C Now command V control of your on Windows um values only because Excel won't let me transpose a formula um at least or if it does I don't know how to do it um and that's another thing um if someone if anyone has any uh critiques or suggestions about like just the process I followed in this video please leave them in the comments below um I'm always improving and I'm always looking forward to learning better ways to go about my analysis transpose it all right so now we have responder ID start date end date email address so we're going to be dropping these columns um division other cool awesome so this is great this is going to work really well for our data manipulation all right cool so let us go over here I think I think we're ready uh um yeah I only want this column and this column I won't need those okay cool so command s control s let's save our work um uh and exit so we now have SurveyMonkey output edited let us get into the exciting part open up your code editor uh and I'll probably make a video in the future um on how I set up my environment and what I believe to be uh the ideal setup for a um at least a beginner python or a beginner data now analyst environment um as a data analyst I spend the majority of my time in pandas um like while I'm in Python it's almost entirely pandas um because I work with a lot of um relational data what do they call it sorry structured data that's that was a term so let us open up a new workspace hmm uh let's see command up SurveyMonkey tutorial open all right cool and we're going to be using something called IPython uh notebooks and these are basically so another word for them is Jupiter notebooks you've probably heard that term used before and the great thing about these is that they have code cells that let you run your code step by step they're missing some other features uh from uh from python in Visual Studio code but I find that just the ability to run codes and cells very easily is uh the reason I prefer to use them that being said you can run code in cells in a regular python script in Visual Studio code um and that is definitely something I want to start trying um but for now that's not what we're doing so uh I just use Visual Studio code and you know in the future I'll probably make a video as to why I prefer it I've used pycharm before I've used spider um I like Visual Studio code though so we'll open up a new file and let's call this data or you know what let's do it this way so I'll show you my generalized format for naming files awesome so um when you become a data analyst more so than a lot of professions uh you will live and die by your organization it is really really easy to mess up stuff if you're not organized and I've learned that lesson the hard way so the naming of your files is really important and so let us do that here so by adding these prefixes uh on a much bigger project would I come into this folder I'll be able to easily tell okay what is what um you know an Excel file isn't always necessarily data because sometimes I use Excel for like uh parameters and arguments um just as an easy way to like you know write them out manually and have python ingest them and then I call these uh scripts you know because this is just the um uh python script we're writing in order to manipulate our data and another reason I like to number them is because oftentimes I will do an analysis an analysis for someone they'll say it's a one-off analysis and then they come back months later saying hey I need this thing done again um so by doing it this way months later when I uh you know completely forgotten about it and you know it's a small enough project where I don't bother documenting it um it is easy for me to look at the script step by step and just be like okay like I just run this then this then this then this um so although it seems very minor just naming your files a you know easy to use format well it is a great way to stay uh stay productive all right so let's get right into it uh the first thing we need to do is we need to import uh pandas so pandas is it stands for panel data and it is the standard uh python library for data frame manipulation um so I just started learning R and I actually found out that apparently R just does this like natively uh but python is a general purpose programming language and so it has a lot of or a lot of its uh um data science features and data analyst features actually come uh as part of other libraries so import pandas and then the standard practice is to just call it PD so we're going to import pandas we're going to call PD that way whenever I want to use a method from pandas or a functionality of pandas basically I don't have to keep writing pandas dot you know I can write in PD Dot and something else I like to do is I like to say import OS I always import OS and you'll see why in a second okay so uh I hit shift enter and that uh uh ran that code cell and it looks like everything ran as expected oh I am using the wrong environment so environment management can be one of the more annoying parts of uh it worked perfect uh of programming so I just want to make sure that worked cool um but it's very important that you have organized environments and I'll make a video in the future about how I decide to organize my environments I have one called minimal DS minimal data science um and it just has like Panda scikit-learn numpy just like that like base um uh things you need for any data science project and um whenever I need to do a specialized project I'll make a new environment for it and then I'll add stuff to it um and that's just to make sure that uh I don't dump a bunch of libraries into one uh project or into one environment and have compatibility issues in the future uh so what I like to do is I like to always put in that uh line of code whenever I start anything and that means uh so PWD stands for present working directory um I believe it's Windows speak um I forgot where I learned it from uh equals OS so that's this Library over here we're referencing this OS dot um the dot means let's go like find a method so go find something inside um inside OS um get CWD which means current working directory uh and then the print we have the parentheses over here so what this does is that this will take the directory that I'm in basically the folder directory just means folder the folder that I am currently in um and it will it'll make that uh string that text um the present working directory or sorry it'll assign that to the variable present working directory and the reason I do this is because um this makes it very easy for when I like move files around um whenever uh yeah whenever I move files around and the code might still think it's in an older directory this just makes sure that like everything I uh reference will go and grab the file from the appropriate directory um and let me go a little bit further along I'll explain it a little bit better so we did that uh now first thing we need to do we need to import data sets so data set equals uh PD uh pandas dot uh let's see read underscore Excel um and then the first argument we need to put in here is the file path where is the file we're trying to import so present working directory Plus um I believe we called it data yeah Survey Monkey output edited so I'm going to right click that rename command a command C and then put it in quotation marks and just paste it there um oh and we're going to want to also put a slash over here if you're using Windows uh you're going to need to put in a double backslash um just because of the way Windows handles uh file file names so as a Mac User I want to put in a slash or if you're a Linux user you know I'm sure you know that already um and then I would just type in data set over here shift enter awesome so it looks like we brought in okay so yeah it looks like we brought in the first sheet so I didn't specify a sheet name so it just brought in the first one so let us go specify sheet name with the sheet name uh argument so sheet name equals um what was it let's see oh we renamed it um file open edited data there we go and then let's try that again and my mistake so sheet name should actually be separated by an underscore awesome looks good so this is exactly what we had earlier respondent idea yep yep yep awesome so you'll see the cool thing with uh IPython notebooks is that you can just put in or Jupiter notebooks you can just like type in data set like I don't have to type in print or anything and it'll just show me a preview of the data along with how many rows are inside it this is um I don't believe the head um method does that for you um let's try it out so data set dot this is generally how you're taught how to look at the top a couple of rows yeah it only says how many rows it brought in um so that's one reason I like to just do it like this it tells me how many rows um are in the entire data set and it's important that we um know we brought in all of the data so 198 rows that's accurate that's what we're looking for all right so the first thing I want to do is I want to get a look at all of the columns so we're going to type in data set um Dot columns and actually before this let me try let me let me let's do something else so data set underscore modified equals dataset.copy and the reason we do this the reason we do the dot copy over here is that sometimes you're going to be bringing in um a very large file so like by large I mean literally something like 50 megabytes or something that can be uh it can take a while to bring that in um depending on your computer and a bunch of other factors and what happens is as we modify the data set we're going to be changing stuff and occasionally you might make a mistake and you need to roll back a change um importing the data again can take a long time and uh can get really annoying as you're iterating through your code so what I like to do is I like to import the data and then I take a copy of it and by taking a copy of it I can always roll back to just taking another copy of it if I mess up this over here um so data set data set modified equals we're assigning it to dataset.copy and it looks good and so we're going to do all of our changes over here that way we don't um accidentally uh or when we if we make a mistake we can always roll back the change so I want to see what columns are here so data set underscore modified um and then Dot columns gives us a list of our columns perfect and the reason I want to do this is because there are a couple of columns we wanted to drop if you remember me mentioning it earlier and the columns we want to drop um we I would rather get rid of them sooner rather than later because the less data is flowing through our program and our script the faster everything will work so let us try and get rid of us we generally want to get rid of as much data as early as possible in the process so I know we don't want start date end date email address first name last name or custom data one yeah we don't want any of those um any of those columns so what I'm going to do is I'm going to copy this and there are multiple ways to go about this is just one way copy and then just paste it like that and we're going to turn this into a list so a list is just a python iterator an iterator is literally something you can like iterate through um so it's like a collection of variables is the best way to describe an iterator and these are incredibly useful for all kinds of uh uh all kinds of things in uh in uh when manipulating data so we'll say columns to drop equals and I like to just check it out and that's how we created a list so basically what this is is the variable columns to drop is a list of columns we want to drop and I'll show you how we drop them in a minute so let us do data set underscore modified looks like my code or code completion isn't working so data set underscore modified equals itself dot drop and I want to drop columns equals this list over here because I wanted to drop these columns and the reason I can't write out the columns like this so for example like if you only wanted to drop one column you could just do that uh so you know logically you would think if I want to drop multiple columns then I should just be able to paste all of these in here right uh and the reason you can't do that is because what python is going to do is it's going to think that each of these is a new argument in the drop function or the drop method um and that's not what we're doing over here so that's why you need to provide it a list a list is a single object that it'll look inside and then decide what to drop off oh whoops uh dot drop columns to drop all right and as you can see we have now dropped all of those columns so we've gone from 100 columns to 94 columns meaning we removed six which is exactly what we're looking for perfect perfect um so we have finished the first step over here now the next thing we want to do is we want to now we're going to want to unpivot this data uh there are multiple terms for this unpivoting melting making long uh that's what we're going to want to do and the way we do that is we will use something called The Melt method so we're going to be using something called The Melt method um and so what we'll do is we'll take data set modified so do uh data set underscore modified equals and I believe it's PD dot melt and I forgot the uh exact arguments that I'm going to be using and it looks like my code completion is not working as expected so this will be a great chance to demonstrate how actually you know what let's do it this way this will be a great way to demonstrate exactly how um I I look up stuff whenever I forget it uh pandas documentation um and one thing important thing to remember is that uh almost no one actually remembers how all this stuff works all the time um only the only thing stuff I remember is like stuff I use all the time um and I don't use a milk function all the time so or the Melt method all the time so uh it's perfectly okay to look up documentation and what really makes a good um uh programmer or analyst um my friends tell me is I have a couple of friends I work in development uh oftentimes just the ability to know what to look up where and how to quickly look it up so with pandas the great thing is that um recently I think in the last year or so they changed their documentation and they finally went to version 1.0 um and now they're at version 1.2.1 and the documentation is tremendous now so I went to uh Google just typed in Panda's documentation clicked on the first link and then over here I want the Melt function so pandas.dataframe.melt or pd.melt either one works so this is what we're going to be using and um that those are the arguments we'll be using so it looks like I actually wrote something incorrectly I put pd.mount when really what I should be doing is it says pandas.dataframe.melt um the data frame refers to the data frame we're looking at so uh data set underscore modified dot melt um ID vars so ID variables value variables uh variable name value name okay cool um and columns to use as the identifier variables in our example this will be the demographic data the the demographic data is what we're going to be using as our identifier variables so that that's the stuff we do not want to Pivot so let's go ahead and identify those first so what I want to do now is I'm going to want to get those columns again so data set underscore modified uh Dot columns and because we're going to be talking about a lot of columns I'm actually going to be using um I'm going to be creating lists out of this and slicing those lists um that way we don't have to type out everything or copy and paste like massive walls of text because we have 94 columns we have to deal with now so okay so it looks like we want the ID variables the demographic variables to be the first second third fourth fifth sixth seventh eighth column so the first eight columns we want those to be the uh demographic variables uh and I can tell that they're separated because you'll see uh apostrophe apostrophe and then this Comm over here is what I'm looking for so I'm looking for a comma in between two of these apostrophes and that's how I know um that that's an item in the list so let us do the first eight items and you know what this looks a little bit ugly so I'm going to actually turn this into a list it's uh it's actually an index right now and what I want is I want a list um great thing about python just you know surround dataset.modified columns with list and you get a list and you'll see that's a little bit cleaner I just prefer it to look that way yeah one line per value simple easy to look at just to make sure let's put it into nine over there and it looks like we get question one so let me explain what we're doing over here uh we created a list and um as you can see this is what the list actually looks like now if I want to get a specific item in that list um in this case I want to get like the first couple of columns uh if I want to get the First Column right then I use brackets and I'll type in zero because python does zero based indexing it starts off at zero and indexes from there I get respondent ID makes sense uh if I want the first and second item then I can do that oh sorry uh I have to do it this way uh I have to use a uh I believe it's called a colon right yeah it's a colon um is it a colon no this is something else I completely forgot what it's called um but the the this uh uh uh sign over here is what we want to use and and you'll see when I do zero this two it gives us the first two um items in the in the list um so it'll include your first index and it'll go up until right before the last index in your in your list so if I want the first eight items I go zero to eight and we get the first eight columns um and the reason we want to do it this way versus the way we did it over here where we just like name the columns is because uh I also need to specify these questions over here and there's like 90 of them um so I don't want to have a wallet text of 90 different rows and I'd much rather do it this way with the lock cleaner so what we're going to say is we're going to call this ID variables equals that um and then we're going to come down here so we'll run that data set modified equals dataset modified.melt and the first argument is ID underscore variables equals ID underscore variables awesome um so that's why that's the first part now value variables so the value variables are basically uh what do you want to melt what do you want to unpivot um and that will be all of the questions so we want to take all of those questions and all of the answers underneath those questions and then unpivot them or melt them so foreign let us do the exact opposite um we're going to call this value underscore variables equals the same thing as this except it's every uh every column that is not that is basically passed uh that is the eighth column onwards so in that case I'll just put in an 8 and a blank over here so it when you have a blank it just means go until the end um so for example I could also write this that way and it would mean go from you know the first item the zeroth item all the way to the eighth item uh and then let us take a look at what that looks like looks good question one response question two response awesome awesome awesome all the way to the end that's exactly what we're looking for uh and you can see why I didn't want to paste this so I mean I could take this and I could paste it here um but it would make my code very long and ungainly and um there is a heavy importance that should be placed on clean good looking code because uh it is especially when you're in a workplace and you're sharing code with people um it you're doing a service to yourself and everyone else to make sure that your code is easy to read um that way uh you explain it to people less and they're able to understand it faster so I'm going to do command or command uh slash um control slash just to get rid of that uh comment out that line and then we'll do value underscore vars equals value Mars oh and uh one thing I might want to change over here um I'm gonna change this to say data set melted because if I kept that as data set modified if I kept that as data set modified when I do change it or when I do like like run that uh that line of code then it'll change data set modify to equal that and if I was to run it again it would try and unpivot data that's already been unpivoted so like I can't run the code multiple times so whenever you're about to make a change that you might want to uh reverse or try again a couple of times just assign it to a new object or a new variable and let's take a look at it so as you can see over here we're looking for we have 17 028 rows um and ten columns so we have made the data wide and then turned it to long and we're going to do a couple of other Transformations that we needed to do um but this is the bulk of the work right here this is this is where this is the kind of stuff that um pandas and python or R let you do very easily that were or otherwise like it's really really difficult to do this in Excel um you might be able to use Visual Basic uh or VBA I don't know how to use VBA um but you know I mean it's easy enough to do in Python we we did it in you know a few lines of code responded ID all the demographic questions as you noticed these have not been pivoted or unpivoted these these stayed the exact same variable that is the question plus sub question and then value is the answer so looks like we got exactly what we needed um now that being said I want to rename this variable and this value column and luckily the Melt method actually provides a argument for that value name and VAR name so VAR name and then value name so I want this variable name to be question and you know it's probably not best practice to like use use a plus sign in a column name but you know it is what it is answer and if I run that again you'll see question plus sub question answer and remember how earlier I said I wanted to assign this as a um as a separate variable uh here's what would have happened had I not done that okay so I ran it right oh but if I wanted to change something so I wanted to change this to you know f for whatever reason it errors out because now it's trying to reference data modified which I had already manipulated and changed so we don't want to do that whenever you do like a major a major change to your data set I I personally just changed the um the the uh I add a new variable um that can create its own problems um with just like storing way too many variables or um it can be difficult to have like good variable names after a while um but in this case we'll be we'll be just fine all right so 1728 rows 10 columns question plus sub question and answer so here's a bulk of the work now there's a couple of other things that we need to do uh if you remember we need the uh the client also wanted to know how many people responded per question and uh how many people responded with the same answer plus they wanted to know what the original question was so what I have is I have question plus sub question but they want to know just what the original question was too so those are three different things that'll teach us um uh two major skills that we can use whenever you're manipulating data um one is joining doing joined and the other is going to be um uh aggregating data foreign so let's go ahead and add the question column first that'll teach you a join and a join is one of the most important and fundamental skills as that you have as a data analyst and it is also one of the easiest things to mess up um and I'll show you why so I have this question um this uh question sheet over here and and uh for those of you that don't know a joint is basically just taking uh uh two sets of data uh where you have like columns like this and kind of just sticking them next to each other um and you can do all kinds of joins you can like you know literally just stick them like this next to each other where you're just like you know combining row by row or you can say for example uh whenever you see respondent ID use that whenever you see question four use that um so there are all kinds of ways that you can uh that you can combine your data um and a join is one of the most important ones that we will use current use right now so let us do that first so what we need to do we need to import the data set um the Excel file again because what happens is when Panda Imports an Excel file it's only importing that one sheet you can import an entire file but uh or an entire workbook but that's not what we did we imported a sheet so let us go ahead and uh import a uh that question sheet so how do we import stuff uh we'll call it questions equals pandas PD dot read underscore Excel present working directory so we need to get to the file path there we go and then sheet underscore name equals I believe it was questions and then file not found error so that probably means I oh uh I didn't add a slash over here and then key error so finally I found error means it literally couldn't find the file um because I didn't I didn't add a slash over here um and then questions equals or sorry questions uh yeah the worksheet doesn't exist so it was probably called question not questions there we go awesome and the only thing we need here we only need question and question plus sub question so let's go ahead and drop those other columns we don't need so we're going to do the same thing um in fact actually an easier thing we could do is we can call that questions import and then we can say questions equals questions underscore import dot copy remember we got to make a copy of it um and then questions uh remember how we did it earlier I think it was over here we're going to do it slightly differently this time I'll show you another way to do it questions dot drop um columns equals and I want to drop a list of columns and again since we're using a you know we're just going to be dropping three columns I'm just gonna put them in like this now you'll notice that they're not orange text uh that's only that's because I haven't turned them into Strings yet so we have to make sure we use those and um you know this is uh this is the exact same method I followed when uh last week I was manipulating this data this is just you know you'll find your own flow and how you like to do things uh but this is exactly what I did so it looks like we dropped those columns so let's take a look at the data one more time oh no it looks like the columns are still there so what happened over here is the dot drop method what it does is it creates a new copy of your data frame um and so literally all we did is we just said make a copy of this but because we didn't assign it to anything um you know the copy doesn't exist anymore after it's created it's gone um so that's why up here we um said dataset.drop so this is a copy and we assigned this copy to dataset modified another method you can use whenever you run it to these types of problems and you'll notice this sometimes where like you'll try a method in like it doesn't seem that the change is actually stuck um usually what you need is something called the in place argument so in place equals true and you'll see the chain stuck over there so uh in place basically means like uh like like perform this operator in place perform this operator right now on the data all right so question plus sub question questions question drop cool um oh it looks like we brought in 16 000 rows of data um I will be honest I don't actually know why I did that but I know for sure we did not have 16 000 rows of data it looks like we have a lot of these Nan's not a number um so let's just get rid of them because we don't need them so we will do uh let's see the method is questions dot drop n a um and I think I can just do that perfect now here's another instance where we have to do in place uh so before I do that let me just show you the changes didn't stick so we'll do questions dot drop n a in place equals true and then there we go 99 questions cool so now we have our questions data set and our um uh unpivoted data set so the questions dataset has a question and a question plus sub question we're going to join these two data sets together like this basically to correlate every question plus sub question with its original question so for example question 29 had apparently wow 14 different responses you could have given what was the original question that's what we want to know so let us do a join so there is a join method in pandas um but I've never so I've used it before but I was I watched this YouTube video um a couple of months back when I was learning python um and the guy over there recommended that you just stick to something called the merge method I'm not 100 sure why if someone knows leave it in the comments section um but that's what we're gonna use I was told that's a best practice and it's worked out just fine for me so we're gonna do data set uh we'll call it merged equals uh pd.merge and I use this method all the time so I've memorized the important arguments your left hand data set your right hand data set how you watch actually want to um uh join the data or merge the data left on so what columns from the left hand side are you are you going to use and then right on what columns from the right hand side are you going to use so the left-hand data set um so as a matter of practice I always like to make this my primary data set um the data data set I like really care about that has like the primary information in it so that's going to be this data set over here the new melted data set that we created so dataset melted foreign yeah as you can see my code completion's not working for some reason um I should probably fix that right the right hand data set is going to be this questions one over here and I want to do a um well let's go start with the left hand join I believe that's what I want to do and there are different types of joins and everything so a left-hand join what it'll do is it'll keep every single value from the left hand side and then find every matching value from the right hand side and attach it um you have an inner join which will take uh both sets of data and just find what matches between them and drop everything else and then the right hand join does the exact opposite thing of a left-hand join I never use right hand joins I if I need to do that I will make that right hand data set my left data set and it is just a matter of uh practice for me that way whenever anyone reads my code they know oh yeah you know he does not use right hand joins this is you know he always just uses left hand joins so we want to join on the column called question plus sub question that's what I want to match and if anyone's interested I can make a video on um what I call like I guess what could be called like primary data skills um things like you know unions joins stuff like that stuff that like once you understand this then um it's just a matter of understanding how to do it in different um uh languages um like in SQL R etc etc um and one major problem that you can run into whenever you're uh joining data is duplicating columns so for example if question plus sub question was repeated more than once um then what I would try and do is it would try and match every instance a question plus some question with every matching instance on the left hand side and that'll duplicate your columns and that's something you really have to check for whenever you are um uh combining your data so what I always do whenever I join data is I try and determine how many rows do I need to have at the end of at the end of this join and if it doesn't match then there's a problem with my join and it's very important that you do this because this is the easiest thing to make a mistake with when you're doing data analysis so I want uh 17 028 rows because literally literally all I want to do is I want to add a column this is question that has all the questions on it so what I'm going to do is I'm going to print the um the length of the data set uh Melton data data frame so we'll call this uh original data comma and then let's print something else I want to print the length in Python that's just Lan you can find the length of any iterator in this case Len will give the number of columns or sorry number of rows in a data set um a good question is how do you how do you get the number of columns in a data set I I actually don't know um something I'd have to look up length of data set melted then I want to print the length of so uh let's see merge data and then that'll just be the length of data set merged and then I just want to look at the data set too so let's run that foreign Al data merge data they're the exact same length so I got exactly what I needed and you'll see we have all of our normal columns and then answer question perfect okay so uh the next thing we want to do is we want to um know how many people answered a question for a given question because not everyone participated in in the question and we want to add that as a new row so we're going to be doing something called an aggregation so what we need to do is we need to group our data by this question column um and then for question one how many people answered well question two how many people answered pandas makes this super easy so what do I want I want um and I really I probably should have done this earlier but this will help me illustrate what I want I want to know foreign for a given question how many people responded to it so if I have like q1 q1 q1 um Q2 Q2 what I want this so this is our format right now uh these are the answers these are the questions I want to know how many people answered question we wonder how many people answered question two so I need to transform this into this so only one person answered question one and all three people answered question two so that's what I want to change it into so basically take this you know this over here as you can see over here all these response a question one response and then collapse them and or sorry all these question you know one question 30 and then collapse them into just one value so we are going to create a new data frame and then we're going to join that to our data so let's first create that new data frame pandas makes it super easy uh we just need to do dataset dot merged oh wait sorry data set underscore merged um dot Group by group bye parentheses I want to group by the questions column so just type in question uh and if you have multiple things you want to group by then what you do is you would put in a list as usual so I would do it like that and then like you know like that but we only want to group by one and then what column do I want to aggregate um so I have a column I want to group by and the columns I want to Aggregate and then how do I want to aggregate that data so aggregating is basically like um taking the sum of a bunch of data or in this case taking the count of a bunch of data so what column do I want to aggregate um I know I want to group by question I want to aggregate by answer how many people actually answer the question Dot N unique so n unique literally just means number of uniques sorry that comes later I do not need it I do not need that right now dot count um and this will just give me the count of people that you know did what we expected so oh and you'll see that it comes out in this funky format whenever you see this just type in dot re reset underscore index and you'll get a data frame again so it looks like we have some weird answers over there and this is probably caused by um the oh you know what here's a mistake we don't actually want to calculate the number of answers we want to calculate the number of respondent IDs and find the number of unique respondent IDs because one respondent ID is one person and we want to know how many respondents answered a question so and then we do we do want to know the number of unique uh respond to 90s because for a given question they could have given multiple answers um but I only want to count them once so let's try that again and it looks like we have 198 for every single value now this is where you know it's very important to always run your sanity checks does this make sense so let's open up our data well I can tell right here it's obvious not everyone answered every question these respondents over here didn't answer question number three so what's the problem here well what it's doing is that it's counting um uh it's counting the null values Nan's so what we can do is we can say um what we'll call respondents equals data set merged um dot drop n a yeah dataset merge dot drop n a I believe that should give us what we needed um and then just do that and then group that oh uh my mistake so when when I said dataset merge dot drop n a uh what it did is it literally looked for an N A in any value in the column um and then drop that entire row if there was a if there were any Nas on that row at all um so what we actually want to do over here is we will want to drop we want to drop whenever the answer is n a that's what we want to drop so what we'll actually do is we'll do data set merged and then specify the column and this is how you specify columns in pandas you can also use dot notation or like you can use like a a period or a DOT to specify the column but I don't like doing that because um a lot of times my columns will have spaces inside them and you know this is just uh this this allows me to address those columns all right okay sorry I figured it out um so the mistake that I made was I tried to uh I tried to filter on on a drop n a when really what I wanted to do was I wanted to find all of the uh answers so all of the rows in data set merged column answer that are not n a um so you'll find you know even you know even after you have you've had some experience you'll you know just completely forget how to do things sometimes so not n a uh and it looks like we have 9 600 uh rows so that's good that that removes um a large number of the rows that we had and so data set uh respondents is basically going to be every uh time the answer is not null and that's good because if it's null it means they didn't answer the question we don't want to include those people awesome this looks a lot more accurate so you'll see nothing goes above 198 yep 190 it's the highest over there which is which means that every single person answered question 29. um and the way I structured the data that makes sense um cool looks good looks good so let us assign that and now what we want to do is we want to take this question over here and then um merge it with uh this question over here that way we have this respondent ID column and what I want to do actually is I want I want to rename it because when we uh because we already have a respondent ID column in our original data set so when we combine the two it'll call one of them respondent ID underscore X and the other one respondent ID underscore y so we don't want that um so what we'll do over here we'll just do uh respondents dot rename uh columns equals and then you provide a dictionary so a dictionary is just a key value pair the key being the original column name so we want to rename this foreign being after that we're going to call this um respondents and then we want this to be in place cool perfect all right so let us do another merge so for the sake of time I'm just going to copy this over here and then we'll call this our second merge um and what do we want uh we want uh data set merged we want the original data we want um the same 1798 columns so we want to make sure we don't duplicate anything so our left hand data set will be data set merged and our right hand data set will be respondents and we will be joining on the questions column which is called question not questions uh it can be a little bit annoying I've had problems joining before just because I I accidentally forgot an S at the end of a column name so let's see what we get over here original data 17028 merged data 17028 perfect and it looks like we have a respondents thing at the end of the year which makes sense q1 they should all have the same value and then yeah q30 different numbers of values cool so now um the second to last thing that we need to do is we need to um the client wanted to know how many people answered the same answer per question um so what we will have to do is we will have to take the question plus some question and then um group it by answer and then count the number of respondent IDs per um answer so it'll actually be quite similar to what we did earlier uh with just uh one or two key differences so again we'll just copy our code and then we'll call this uh same answer okay and yes we want to make sure that the answer is not null um so when they said the same answer they they meant they didn't mean people um not answering the questions the same way that we don't want that um or actually no my bad we we do want that we do want that so we can actually remove that or what I actually prefer to do is um just add a hash over there uh that way if I ever need to add it back in it's easy enough to add back in I prefer to comment out code that way um whenever uh I might have to come back to it and I'm like I can either follow my thought process or I can be like oh no we actually need this thing in there so same answer equals same answer dot Group by question respondent ID um so actually we need to group this by question plus sub question and answer so you'll see for every question plus sub question and for every answer that you could have given to a question plus sub question what is account so let us do a list question plus sub question and and sir and then we'll call that we'll rename this as same answer and I believe that will give us what we need yeah yeah looks about right awesome so basically what this is saying is that there's 688 unique answers that people could have given which you know that that's about in line with what we're asking for so let us do this one more time one more merge and then we'll call this data set merge three uh there we go and then we want to make sure it has the same number of rows as data set merge two we'll call this same underscore answer and we actually want to merge on two columns right now um question plus sub question and answer so as always we'll use a list so question plus sub question uh and answer let us see what we get awesome 1728 17028 and yep same answer cool um and then one thing we're going to want to do over here is because um you'll see we have these Nan's over here so what that means is basically like no one answered um that answer for that question um what we're going to want to do over here is we're going to want to fill those n a's those not uh not available or not not in numbers with uh zero so I'm gonna do data set underscore merge underscore three [Music] um same answer Dot fill n a and I'll fill it with zero um and I believe I need to do in place equals true do I cool awesome awesome so this is exactly what we're looking for and this is really close to the final data set the only other thing they need is they need these columns renamed to something um you know that makes more sense so we are going to do our lazy method of copying just like that and got to be careful over here the good thing is if I make a mistake python won't let me know uh so there's one there's another there's another so now we have all these strings let's make sure we put commas in between them foreign and then make a list out of it or well wait we're going to need a dictionary actually because we're going to be doing a rename so let's call it uh output data set so whenever I have an output I like to call it output data set so output equals data set underscore merge underscore three uh dot copy output dot rename columns equals and then we'll call it um actually we you know we won't call it because we have to write it out anyway so we'll just copy that paste it there and I want to rename this first one division whoops division primary over here division secondary position level we're going to call that position generation what gender do you identify oh looks like I accidentally there we go gender tenure and which best describes your employment type employment in place equals true uh and the whole in place thing honestly that's just my style um Some people prefer to actually like assign it to that variable over and over again um and that is the um Beauty and the curse of python there's there are so many ways to do a single thing um it's actually one of the things that uh I've heard makes go very interesting as a language um there are a very limited number of ways to do things uh and that can make it easier to read code because there's only one way to do something awesome responder ID division primary division secondary position generation gender tenure employment type question sub question answer question so as you can see we were able to take that much longer data set oh and uh let's output this you know um output dot um let's see let's call it uh what is it again uh two underscore Excel um present working directory plus slash um again double backslash if you're on Windows ah let's call it uh final output Dot xlsx uh and then we'll just let it like make the sheet name right or whatever it wants to um what pandas will always do is it'll output this index over here um this like zero one two three four as a separate column uh I almost never want that so I just type in index equals false run it awesome it looks like it created that final output over there so let's go over here cool and that's exactly what the client asked for um so in in summary throughout this video I've taught you guys how to uh import data into import data from Excel into uh pandas how to uh rename columns unpivot columns which we call Melting uh do a join um what else drop columns a couple of other things um aggregate columns um so this is a the reason I wanted to go over this is because it is I think it's really easy uh and I think it's awesome like you know when you go on YouTube you watch classes and you you see um people give you like problems and stuff that uh that you can like solve it's a tremendous way to learn um one thing that I realized though is that uh it is also good to like see what like actually happens in the real world so this is a project I was paid real money to do um I charge a certain hourly rate for my work um and you know in about an hour I was I put this together um and the cool thing is now because uh I charge at minimum an hour uh if this ever needs to be done again then I have a script that's more or less ready and I just charge an hour you know um for what might even be just like you know 10 to 20 minutes of work um and and you know that that that's uh the power of programming and the power of data analysis when you like know how to do this so I wanted to show you guys this because I want to show you guys how easy it is to pick up skills like this um in order to uh make some extra cash on the side where you know which you know you can do whatever you want with it um and to show you guys what a real Pro or a real um uh job might look like this is something that I would do on a regular basis in like my job at work along with other stuff but this is like one aspect of it um so if you guys are interested in me going over my projects in the future let me know um I love to make tutorials like this I would like to um do it in this format where it's you know like very natural very easy for me to put out um if that is what people are interested in so thank you guys for having such a uh for thank you guys for you know joining me watching me do this and uh I really hope this was useful if you really liked it be sure to press subscribe uh like the video it really helps me out and it will uh let me know that these videos are worth making in it and and something that actually do help people um so have a great day guys thank you
Info
Channel: Shashank Kalanithi
Views: 2,251,476
Rating: undefined out of 5
Keywords:
Id: pKvWD0f18Pc
Channel Id: undefined
Length: 77min 13sec (4633 seconds)
Published: Sun Feb 07 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.