Efficient Data Integration: Importing Multiple Excel Sheets & Files into R - A Step-by-Step Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right okay today we are going to learn how to import multiple Excel sheets and files into R and we will also be following instructions from R for data science by Edie Wickam Ed wiam at all so go ahead and share my screen all right so this is what will follow from here reading multiple files but in addition we will do multiple Excel sheets so I'll be flipping back and forth from RS Studio to here and then maybe other places as we progress so let's go let's get it started so we are here and so this is what we are doing and yes yesterday we learned how to create create a project directory so today I change my mind again and I created a directory called R for data science uh dads and this is the directory that we'll be using for data science tutorials we have saved our data here so under data I created two additional folders one for raw data and one for tidy data there's no data that has already been tided but we have two items here under raw uh file which is we have multiple files there are 10 of these these here we are going to use the 2008 or South Sudan 200 sensus data just for learning purposes we are not saying this is what the population counts are now but we are just using this data to to teach what we want to share with you all and then this here contain multiple uh sheets and let's see if we can view it okay so it has multiple sheets as you can see so the first one is legs legs State Unity State J State upile and so on so these we will learn how to look through these so first we will see how to look through these Excel cheeks by using for Loop and then later on we will use the power package from thaiverse which is more efficient then for Loop so the first thing we will do we will load the packages the libraries that we are going to use I'm inan qu um visual editor so we we load so before we load the libraries I would like to suppress messages and in quar you just write the pound key and then the the vertical to that or a symbol and then space and then when you start typing message or the argument or option then it's going to give you the autoc completion like when you click next then here maybe suppress warnings or warning now we will load our libraries to load a library or a package in R you call library then type the name of the library here we are going to load all the I think there are nine packages within tidy verse we are going to load all of them at once and then we will also load read Excel which we are going to use for reading in these Excel files the other package that we may need but I may not implemented in this tutorial is write Excel for saving Excel files so but I think we will save it for next time now we have our libraries so let's go and see how we can Loop through all these Excel sheets again we go to our chunk Cod Chun let retrieve let let us retrieve the path where we have saved our file we have it and it's going to be this one okay so I did do do forward slash because we have the file here but now we are here we are in this here so I had to come outside and then now go into this uh directory so this is the path of the file let me do this and this is the path to the file that has multiple Excel sheets and then now let's let's say for Loop to do uh for Loop we need to extract the cheats by calling Excel cheet function on path now let's run it and this is telling us that this file or part contains 10 sheets which is true because we just saw it when we open the file or the F yeah the file now we are going to look through this and to look through this so we will save for and let's see when we say this so when we just do the autoc completion it's always going to give us the suggestion of what we want to say so so we are going to save for Sheet it doesn't matter what we call the first one the second one is what we must call sheets so for just a single sheet in sheets we are going to do something and we are going to call maybe let's call this okay I forgot something one step that I forgot we need to initialize so let's say sensus 2008 we will do we will create this as a table a data frame and that is when we look through all this like each sheet each sheet at a time and then later on we will combine all these sheets into one uh large file and then maybe inside here we can say this and let's do read underscore and it's going to be Excel so there are three options for Excel but we will use this one and we put in the path and and then shet is going to be the same sheet that we are saying here this cheat here is cheat is an argument so cheat is sheet and then we are going to read that and so let's say when we run this what do we find and then the next thing we will do is at the end we will combine all we will combine all the cheats and we will use bind row rows from D player and we are saying sensus 28 we are going to add what do we call it I think we call it yeah we call it this here and now let's see so we have to run it from here downward let's run it and see see what will happen and do we have any output okay okay so we are getting an error and it is saying incomplete expression path let here okay so let's see what the issue is let's try this again we come in here this is this okay now it works so we have just generated this data and this is the data that we have imported by just looping through the Excel sheets 10 of them now we see the states are here uh but this year is not the best way to go about this doing this so there's a better way and that is using power package it's faster it's efficient so this year this is normally how most people would do it this is how I used to do it before I learned about par package now I'm going to we are going to learn how to implement this in using par uh map function from par package so we will go back to R for data science and we will come and see what they are telling us here here they are talking about multiple Excel files I will first do the the multiple sheets and then second we will do the multiple Excel files but let's brow through this and see what they are doing so here they recommend that can use this function list. files from base R and so I'm going to copy this I can type it up but I'm just going to cheat a little bit copy it um but that's not what I want this time so this's let's go back to here first we are going to work with multiple Excel sheets we already have the path here so what we will do here is let's name our maybe we call it multi multi Excel sheats we just need a unique name to name the new file that we are going to create and here we are going to call the Excel sheets function we are going to pipe this into map function and now it's telling us okay give the it's not map mly so let's say we say path equal path that we created earlier and then the function that we are going to use which is this now let's try running this and see what will happen so we are getting an error and saying the argument but is missing and the reason is saying that because I need to pipe it in here now when we run this we get a list 10 lists CU there are 10 Excel sheets so we get 10 list when we use the maap function it will output list so there are 10 of these and we need to find a way to combine this so that we have a data frame rather than a list and the way we can can do that is we will pipe this sorry we will pipe this into list uncore rbin from par package as well and run it and now we have a data a data frame of 100 450 entries or rows observations so this here it's just maybe three line of s compared to this year so this year is the beauty of using part package or mapped function from part package next so that was quick next we are going to now I will copy this to work with the multiple Excel files so again what I have done is create 10 files one for each state of South Sudan there are other things that we could do uh to clean this data but I will show them in the next step after pulling in multiple Excel uh files so we will call this path we are I just copied this from the book online and then now I don't have to use this because that's not my that's not the path to my files so I will do the same thing I did earlier but this time I'm choosing the multiple Excel multiple files and the PN is the Excel so let's run it and see what it will say so path there are 10 path in this directory or files and then now we are going to do what we did above we will need to rename this we need to rename this uh name this data data frame let's go it may be just data frame DF and let's go back to the book and let's see what they are telling us here so just browsing through this is where they are talking about the maap function and this is one way to do the map and then the path and then now read Excel so we will come back to our studio and say okay let's do exactly what they they have done the way they did it and then save path let's call our function Excel now let's run it and see what will happen Okay so there are again 10 list and we are going to use so we are going to pipe this into list or buy and this here is going to convert that into a data frame now we have the same data frame that we have here when we combine these and also here so this is how you import multiple Excel files or Excel sheets into R using power package just only one function map that did that for US map as variances so we say question mark map to look up no that's not what we want let's look it up to see more information about map from per package the description and then this these are the variances I was talking about mapcore LGL for logical and then integers double chore which is corrector and then mapped vector or vect there are also other like work walk two and so on so these are the variances here the was maore DX but I think that has been superseded let's see so let me try this and that would give us the same so maore DF was a function a variant of mapped function but again I think it has been superseded or yeah let's look it up I don't want to miss speak yeah so it has been super uh superseded using the map map 2 IMAP and then pmap a super CED that it it's not like map is not there are the poit people now advise using map instead of using mapcore F and but it's still there you can use it if you want to but I stopped using it a while back so I just use map and along with lcore arbin arbine to just convert the list into Data frame now what I will do here is just do some data clean up because when we look at this data we see a lot of Nas we see a lot of columns that are not useful so we will pick out columns that we are interested in and then leave out columns that do not make sense to us and so this is not part of the data input but it is just part of the data uh cleaning or Tiding and transformation which is data wrangling so I'm just going to do this as a part of this tutorial but basically this what we have done here is how you import put multiple Excel sheets and multiple Excel files into R there is we could have done this one here [Music] so let me do this we could have done this and then could have written a Lambda function we could do that or we can just say path and this is how you write the anonymous function or Lambda function in R now let's run and see if we're going to get an error so we didn't get an error but let me let me do this let me change it to see this is how I always validate to make sure that that one ran yeah so it ran without throwing an error so that's this is another way we could have done and you can do it this way especially if you have let's say for example there are other arguments that you would like to include for example if there were rows that you wanted to skip then using anonymous function would have allowed you to do so are the argument that you that are part of the reor Excel uh but this is not necessary because our data is as expected there are no other arguments that we would like to include within the reor Excel function and then now let's see that's all on that side so what I will do is just clean up the data a little bit and then that will end our tutorial so let's look at the data frame I will keep this column just for the sake of it but when you look at when we look at column Nam so we can view our data inside here by using the view function instead of clicking it like the way I was doing uh so what is why is it doing that okay so let's see okay so so we can do it view it like that and we can also gleam use the glams function and what the glimps function will do is will print out summary like this so we see all the columns there are 11 columns that are four 50 rows and these are column so when you have a column name that has a space between the words then it's contain or included bra within the back tis we will need that first column we will also need the the region name which is the state name we will need the variable name which is population it contains population we will need H and then H name and excuse me and then this 2008 is the population column so let's go back and and we will remove this now let's say Tiding T and transform first thing we will do we will use a function called cleancore names from a generator package to tidy our column names clean them now the names have been converted into lower cases and underscores are used where there are more than one word in a column name next we are going to select Columns of Interest columns that we are interested in using the select function or select verbed from dly R and what I will do I'll use a we use the Glam function here glams function and then let me just select there are several ways or two ways we could have done this we could have just used the index or the position of the column by using like one for the First Column and so on but we are going to just select the column names that we are interested in region name I will rename this as state so what I will do I will break or let me let me just do like this because there's something that I would like to show you all as well [Music] now so let me see what that is variable name okay I'll call this here A call it gender and now you see I'm like running out of his space but I will keep it like that because I need to show you something that is going to make it look nice for us so we need this here we will call this H category and then the last one we will call it population let see so the column name is a number so we put it in the backtick okay so I got got an error and so the newer version of R is just it's not doing it that way anymore it's just putting X before the number that's why through the error but this is important because this error is it's a good learning example it says cannot subset columns that don't exist so it didn't find the column that I put in with this here was in the vtic and then so that's how you diagnose or the B now let's see see so here we have only five columns and then we can do more data cleaning so what I was saying this year is not it doesn't look good because it's all running over you have to be organized because the most important thing is for your code to be um legible easy to read by you maybe hly we can say the fuchia U so by fuchia U and all the other people will be reading your code so there's a function from that there a package called style r or Styler style R I think we will use it to format this code so I'm just going to highlight this I can do control shift p uh and then Ty typed no type style and then when I start typing it will bring it up so style the select CH so we are going to select and I think it didn't do it so why it didn't do it the other way I have been doing it is going to add add ins and so where is it so that didn't do it okay let me try again let me let me remove some of these okay so I'm going to try again let me highlight this or you can do it for the whole active file so I'm doing command shift p and then start typing what I'm looking for so let's say active file and again it didn't do it so maybe something has changed so I'm just going to manually do it and what I'm doing I'm breaking to the next line so that all these columns can be read clearly so now this is better this is how you supposed to write or like this kind of code here other thing I usually do is I go to the next line and like that but it doesn't make any difference so we have now selected the column of interest and then the next thing we need to do the next thing we need to do is so let me come [Music] here we need to replace this n as you can see this is saying greater equatorial So Below each uh state name are n a and those rows are for that state so we are going to use uh field to fill them in and we are going to separate the genda column because as you can see there is a population comma M and then space and then number in parenthesis we will need the middle part where it say male and female and so on we don't need this other part so we will use uh a variant of separate function from tid R so the first thing we need to do is to use the the field from tidy R and here we give it a column that we need to uh to F and we can do it like this and then come back and see what has happened by default it's going to fill it downward but you can specify those if you want to so when I click Like This there is dot Direction and this dot Direction has some options so let me I'm going to delete this yeah so the option the arguments that will the okay so there's down upward up or downward upward a down up or up down I've always used down I have never used the other arguments but that is what you do to to fill those Nas just go in there because I I I know that those rows belong to the column that they state with whose name appears above otherwise you w you should not do that if they were different other thing we should have done we could have we could have called this former region yeah former region is correct let's leave it like that and then now we are going to separate the genda column so we will say separate and then we start typing and now separate we can do separate underscore longcore dim or subate underscore long we will do the wider and wider here we will go back to to R for data science then we will see it has been used here okay I think I pass it let's go back okay so it's hiding for me so let me leave it it's not important I can so let's look it up here I just want to show you how it works I use it a lot this is experimental so it's new it's it's it's just being recommended in place of just separate function and I always go straight to the examples so this is what the function uh does so you put in the data but here we are piping our data in give it the columns or column and then give it the limiter and then the names and then names underscore separate and names underscore repairs some of these I have not used before but I have used two few so there are times when you want to only pull out part of your column and then don't care about the other part so let's see let's go back up and say the column that we want to separate is the gender column and now limiter is going to be here I will use it spaces I'm not going to use uh comma if I use comma then it's just going to separate it into two but I don't want to separate it into uh yeah I don't want to separate it into two I also want to get rid of this uh parenthesis number inside so if I use spaces it's going to split it into into three columns and then so now we can do that and then we will say names and then names we will do I will do na I need the middle part and the middle part is what I will call gender and now let's try running it like this and see what will happen so now we are getting an error and say Li must be a single string not the empty string so okay and if we put comma then it's going to separate it into two and we will come here and it has retained this part the middle part that I was saying and the reason we don't see the other column is because we are calling it with still r or separate underscore y underscore limit limit to just make it any a drop it and so I like it this way because it will give me me one more step to show something else so now we look at this but we don't need this part where it says uh parenthesis number uh close parenthesis so we will have to remove that and the way we will have to remove that is we are going to use a mutate function from dly R so mutate and we will take the gender column and here we can solve this using different methods but what I will do is I'm just going to use Str Str from a string R I will use St move and so it it is asking us to give it a string so the string is going to be the column name and the PN the PN that we are going to remove I'm going to say remove this here I will also include that is space so I'm just going to say remove this I will put it here but we will run it because I want us to learn something from this because the parenthesis is Al a special character in R and it's R is just going to ignore it so let's say that and now I didn't take it away because of what I said so now [Music] um let's try doing this let's just skip uh let's just skip those and now let's try now that's remove it so before we told our that these close and the open parenthesis are not special character these are something that we don't care about if we only put one back slash a back slash is also a special Cor in R so R will not do anything about but now we are saying okay get rid of those we don't want them there is a space here and that's why those have been removed and now we have cleaned this column we are left with what we are actually interested in the other thing that I will get rid of we don't need the total columns these total columns can be calculated from the rows no sorry not columns this total rows can be calculated from row containing male and female because those are the original uh uh observations so we can do this by filtering those out using the filter verb from dly R and we will say gender if we say genda is exactly total then that will only return the total columns so let's do that but that's not what we are interested in we are we want to remove those so we got like um an empty data data set so something is wrong uh and we can go back and run the first spot and see what we did wrong and gender agenda here okay I think I know what might be wrong but let me let me see what I typed in there so I think there is probably an empty space okay so that didn't do it okay then what is the issue let's see oops okay um going to do this okay so there is a space before total in the gender column that's why it is not uh R is not finding it because it's saying what you have given me is not what is in the data frame so let's let's see if we add a space see if that is going to remove or give those yeah see it works but this is not what we want so another way to solve this issue because we don't want these spaces the other way we can do this is we can chain on to this function here we can change something on another function called trim sorry Str Str trim and that should get rid of yep now I got rid of the spaces of that space here and that's why after we ran it it didn't give us gave us the blank yeah now we got rid of that and we can easily check this again the other thing we can do we can say data frame we won the gender column and let's say unique values and now there are no spaces here but like I said earlier this is not we are interested in so we can say to negate this so we say okay give us all the rows excluding the total columns the total rows or rows containing total in them then that is going to do the opposite and this is what we want uh and then the next thing we can do is clean up this eight category I think there are too many categories but we can shrink them down and it depends on what you want uh there are no r or wrong answers but especially when you have too many categories and you want to perform exploratory analysis so what will happen is your data is going to be over stretch so too many columns you will have fewer uh people or counts in each category and too so too many uh categories will over stretch data to fewer uh categories will lead to overcrowded uh data so you want a like a a balance like the best number or you need to find the best number of categories or age categories that will fit your data well we used to do this by hand in statistics when I used to teach statistics at Mid Tennessee State University um but there are formulas that are used for determining the number of categories that you want to divide your data into but here I'm just going to decide so first we need to look at the AG categories and see how many of unique categories we have or classes we have so there are actually so 15 15 is too many but we don't need this total column here that's why but I will go from zero to 14 and then from 15 to 24 and so on but let's get rid of this uh total total Row in the h category column so here we will say we could have said yeah we can we can do this one and say h category we don't want it to be equal total now let's do that we got rid of that and then now we will come back here again and say okay let's look at those now there are 14 so let's take these I'll just copy this go back to your data frame we will use case again within mutate function and we could have done it up there but I'm just going to do it down here so within muted function we are going to take the H category and we will say is equal case when so we will say each time we will say when H category is in and then we'll do a subset so just going to paste this here this is not the best way to do it because you can easily make mistakes but what I will do I'll say if age category is in this interval like this if it is in the that c uh in those intervals then we will call it 0 to 14 and then we will repeat the same thing until we are done so AG category in and then we will say we I said earlier that we are going to go from 15 to 24 if that is the case then we will call this 15 to 24 you need to watch out because you can easily make a mistake here I will intentionally intentionally make a mistake later on so that we can debar that so let's come here here we will call it so we are using the tier we will call this 25 25 to 34 H category in and then now 35 to 44 next one will be 45 to 54 this is 45 to 54 we are almost done so the last but not the least it's going to be 55 to 54 so here I will make a mistake intention I intentionally make a mistake the last category we can leave it as just 65 and above or plus so the way you do it you say if it is true then just make it that make it that now let's run it and we get an error it says unexpected a string contain in and then it is just running around without telling us what it is so let's see below here if it's going to help us find it okay so I've already spoted one and one mistake we have or issue is here we are missing a comma here so that is not even what I intended that's not where I made a mistake so let's see again I made a mistake here but this year was an not intentional so that was like a real mistake and then this one wasn't intentional and then now let's run it and then we are getting another mistake but that mistake was intentional and what it is let's see if we can diagnose it says unexpected numeric constant in and then true so the mistake that I know was intentional here is I did not I left out the comma there now let's see if there's another mistake so there are no other mistakes and let's go back to our data frame and we now excuse me we clearly see that the age categories are now from 0 to 4 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 and then 65 Plus and yeah our data is now sat tidy and we can do more with it but that is all I wanted to show you the main the purpose of this tutorial was just to show you how to import multiple Excel sheets and files into R using um for Loop but but mostly using a b function from Power package so I hope you will find or I hope this was helpful uh please let us know if there are topics that you are interested in we will be doing tutorial live tutorials every Saturday and Sunday it will depend mostly Saturday will be data Sciences are Sunday will be it will depend on on the day on the week there are weeks when I will be uh be doing uh data science with python or just talk about General um knowledge think next Sunday I will be talking I will show you all how to use uh Google duck because some people reach out to me about tools using opensource tools or tools that are freely available without having to pay for Microsoft subscription so I will go I I will show I will have a tutorial live tutorial on Sunday on how to use Google ducks to write for your writing so again thank you all for your time we appreciate it please don't forget to share our videos like comment and also subscribe to our YouTube channels Jung Institute and Y data studio also follow us on Linkin and on Facebook and until then see you take care bye
Info
Channel: Alierwai DataStudio
Views: 2,482
Rating: undefined out of 5
Keywords:
Id: ex5ubFNIwMw
Channel Id: undefined
Length: 62min 10sec (3730 seconds)
Published: Mon Dec 04 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.