Combine Data from Multiple Excel Files with Inconsistent Column Names

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in today's video we are talking about a very common problem that people face while combining data from multiple Excel files what if the data of those files has inconsistent headers how do you then combine the data and bring all of the data together and append them one below the other is something that we will take a look at in this video no further Ado let's start all right it wouldn't hurt to take a look at the problem first and then try to build up the logic and then actually build the solution together all right I'm working with these three simple Excel files which are there in one folder this is 2006 file this is 2005 file and this is 2007 file containing some random sales data now if you take a look at the headers of these files the headers are inconsistent and also you'll find that one of the file has got one extra column how do we deal with all of that it's something that we will take a look so for the moment if you take a look that date here is called date date and date everywhere that's good the sales rep however is called as rep in the second file and third file it's again called a sales rep so you have inconsistent headers similarly the client is called as customer in one of the files and again a customer the amount is at times called as sales amount so you again have inconsistency there and the profit in the region are fine it's just that now we have one additional column which is there in 2007 file which is called profit percentage which is missing from the two files above so we'll have to handle all of this inconsistency and then bring the data together into a single Excel file before we start to build the solution it's very important for you to understand the logic first and then you can tweak the logic in case you would want to customize the solution to fit your own needs here is what we are trying to do the first thing that we are trying to do is we're trying to build a bridge mapping table in between which is where I would accumulate all the column names so we are working with all of these Excel files that we have and in these Excel files the data is inconsistent that actually the column headers are inconsistent so I'm going to extract all the names of the columns and bring it into a list right here so I will just maybe have all the names of the columns right here now power query automatically cannot know that what would you like to rename the column as so you'll have to give it as an input that this is what I would like to rename my column for example let's just say that we have a column called amount Now power query wouldn't know that what would you like to call the amount column as finally when you present the data together so I'm going to say that hey power query when you have the amount column I would like to mention the Mount Column as a value column rename that as an as a value column so this particular column which is the rename column it needs to be filled by the user as to how would you like to rename all the columns that you have in all of these Excel files and this would accumulate into a mapping table once you have created a mapping table this mapping table is going to be used to rename all the data coming from the Excel files and then eventually we'll be able to combine it so here is how it's going to work and hopefully I'll be able to give you some visual picture of it so imagine that we're trying to create a query right here and we have a couple of steps source and we have navigation and at one point in time in the query you have a column called data and in that particular column called Data you have like a table and then you have another table and then you have another table and you'd be able to peek into these tables and see the preview of the data at the bottom I'm sure you have done that in the in the past so if you take a look at the preview of the tables the headers are going to be inconsistent right here what we are going to do is we're going to use this particular mapping table to rename The Columns of the closed tables or the nested tables without opening them and once the columns have been renamed using the mapping table we will then expand the data or combine the data eventually in the next step right so what we're trying to do is we're trying to rename the tables when they are closed and then eventually we'll try to kind of work this query out I hope you got the essence of the logic now let's start to build the mapping table first and then Carry On from there all right in power query here and I've already done some bit of leg work to be able to speed in up the process the first thing that I have done is I have created this folder location this folder location is nothing but the place where all the files are kept this is simple blank query so I can right click new query other sources in a blank query and I can just rename that as folder location and I just paste the path of the folder location right here however you can also say get data from file from folder but I am just used to of creating variables in case you would want to change the location yet tomorrow for some other data you can just quickly come here and change this folder location and it will change everywhere if you have connected that any which ways and then I have this simple mapping query that I have created and I have just written The Source step right here so in the source I have written a very simple formula right here which is folder dot files and I'm using the folder location that I just created right here all right this gives me a list of all the Excel files which are there in that folder so I can see that I have three Excel files in the folder year 5 year 6 and year 7. now in order for power query to be able to read the data of these Excel files I need to convert these Excel files into a table format so that you can able to read the data okay so I'm going to create a new column add column and a custom column and I'm going to start to use a function called excel.workbook in the Excel dot workbook function it's asking you hey where are the Excel files so the Excel files are there in the content column but these Excel files need to be in a data type of a binaries take a look so I need Excel workbooks but they need to be in the binary format and sure enough they are at the binary format so I'm just going to maybe feed the content column in there and the next input of this function is that use headers as any so once you're reading the data of off of these Excel files the first row of every page should be considered as a header that's what I'm wanting to do so that I don't have to do promote headers later you'll see so I'm just gonna maybe write a true right here close the bracket this is automatically promoting the headers and close and then okay now as of now I have been able to get a table table this is peaking into the Excel file and converting the Excel files to different pages in that file so every single Excel file has just got one page and I'm just going to expand this particular column all the columns are good and here is where I can take a look at the data so as of now if you take a look at the data you can see that the headers are automatically promoted and I don't have to do another step called promoted headers that is the reason why I wrote a true a couple of steps ago any which ways but the problem here is that that headers are inconsistent so we have a client here but we have a customer here so inconsistent headers but I'm trying to create a mapping table to capture all the columns together and then you know kind of relabel those columns so I'm going to right click on this column I'm not interested in the data just the column headers right click on the column and I'll say drill down this converts this particular column into a list which is where I have three tables which has got inconsistent headers I'm going to maybe wrap this into a function called table dot combine so table dot combine now what this function is going to do is it's going to actually combine the data from the three tables that I have so the first table plus the second table plus the third table and it will just append the data one below the other having the columns in an inconsistent format so if I now press enter you're going to see that we also have a rep column but we also have let's say a sales rep column so these columns should have been renamed before you combine the data but as of now it just combines The Columns but I'm not interested in the column the actual data I'm just interested in the column names right so I'm just going to say that hey I don't really want to have the the data but I just want to have the names of the column so I can wrap this around in another function called table dot column names and what this is going to do is this is going to give me the names of the columns in this table which has been combined so I just close this close this in the end press enter and what I have been able to get is the names of all the columns of all the combined tables which are nothing but your Excel files let's just load this data into Excel all right once the data has been loaded into Excel you can see that it just has got one column but we decided that we are going to have two columns the name of the original column and the name which is the rename that you're going to give the rename is the column that we'll have to give so we'll add a column to this table which is called the rename column and we'll start to rename our columns this is going to be a manual input that you will have to provide and tell power query that I'd like to rename the column so and so so the date column remains State the rep column remains sales rep this time however the customer column remains customer sales amount becomes amount profit is profit region is region sales rep is sales rep client becomes customer amount is amount and profit percentage is profit percentage right so no change in that now that we have this query which has got two columns one of the column is the output of the query and the other column is the column that we have created we'll take this table and load this back into Power query and start to work with this further so I'm going to click on this query go to data and then say I'll save from table range and this table with two columns as of now gets loaded back into my query now let's just rename this particular query as a renames and then we will start to use it a bit later as of now what we are going to do next is start to combine the data of the multiple Excel files and then try to use this query to rename our tables before we expand them all right part one is done we have the mapping table ready now let's just start to create an actual query which is where we'll start to use the mapping table and then combine the data from multiple Excel files also making sure that in the process of combining the data we are also renaming the headers and all the data fits in consistently all right I'm going to start to use this mapping table because I have done a few steps in here that I'd like to reuse so I can duplicate this table rather than writing the steps again so I'll click on duplicate and the table gets duplicated however I'm going to change the name of this table to as combined data and delete maybe this step and just retain the first three steps so if you take a look I am still getting the data of the same folder location the binary is Excel file I have to convert the Excel file into a table format and we have the tables right here and then I expand the table and all the sheets within that Excel file I accumulate those sheets in the data column as of now if you take a look at the data column my headers are still inconsistent that means you have a custom in here and maybe if I take a look at the previous one I have a client in here and so on and so forth but I'd like to kind of have the names consistent now let's just start to work with one table and then let's just see that how are we going to manually read in the columns and then try to apply the same logic to our mapping table so I'll just open this table for just a quick second and I'll try to manually rename the column so let's just say that the client column needs to be renamed as customer for instance I press enter a new Step gets inserted and let's just say that the amount column needs to be renamed as value just for an instance now if you take a look at the query that has been generated up on the top the step has been generated and the step is saying that hey the client column was renamed to a customer and the amount column was renamed to a value every single column that you rename is going to come into two pairs what is the name of the column and what is the rename of the column so name of the column and rename of the column this name rename ampere is packed into a list so this is like a list you can see the curly brackets and this is another list and every single list is then again packed into a list so what we are talking about is a list of a list structure so a list outside and within which we have sub lists and every single list is going to have two parts the column name and the rename if you're able to create a structure like this I think we should have solved the problem all right so I'm just going to get rid of this rename step right here and the data step right here and we are back onto our table which is where we have inconsistent names right here now let's just go back to our renames which is nothing but the mapping table that we are in the process of creating what we're going to do is we're going to transpose this particular table and I'll explain to you the reason why we transpose so I'll go to the transform Tab and I will transpose this particular table now if you take a look once we transpose the table what we have been able to get is a table with just two rows so the first row is nothing but the name name of the column and the second row is the rename so name and the rename and the reason why being able to transpose is that because we were trying to build a list of a list structure and the list had just two parts the name and the rename all right now this cannot go in as a table this needs to go in as a list so this entire thing needs to be converted into a list format so I'm just gonna say that this like a table dot two columns function what this I'm trying to say is that hey you have a lot of these columns which has got two rows name and a rename please convert all of these columns into a list all right let's just do that close the bracket and press enter what we get is a list within the list we have sub lists every single sub list is going to have two parts the name of the column and the rename of the column right so this is pretty good all right now what we're going to do is we're going to come back to the combined data query and before we expand all of these tables right here we'll apply the renaming so I'm just going to create a new column create a new column and I'm going to say something like this I'm going to say hey table dot rename columns and I'm going to say hey we have a table right here so in the data column we have three tables pick up every single table and then apply the reading so I'm just gonna maybe go ahead and say pick up the data column and then it says what are your renames so renames is nothing but the renames query that I have created and this query adheres the list of all this structure that this is asking pretty good and I'm just going to close the brackets and press enter however this gives you an error let's just go find out what the error is if you maybe click on the error and you see that it's trying to rename a sales rep there is a column that we're trying to rename which is called the sales rep however if you maybe go into this particular table and see that we don't really have a sales rep we have a rep column so what we're going to do is we're going to maybe have to tell the query that in case you don't find a column please ignore that so I'm just going to go back to my step right here and I'm going to maybe add an optional input in the end and I'm going to say that if you find a missing field please ignore that so missing field dot ignore is what we'll have to write pretty good now I'm just gonna maybe come right here this starts to work again so you can see that the column has been called as sales rep pretty good all the columns have been renamed pretty good and now is when I can combine the data of all of these files now one way is that you can click on the expand button but the problem with the expand button is that all the names of the columns are again going to be hard coded which I don't really want I however going to right click on this column and click on drill down this converts this into a list and in this step I'm just gonna maybe use the table dot combined function once again so I'm going to say table dot combine and close this into the brackets press enter now because the headers are consistent of the three tables right here the data just gets combined and all the columns are renamed and pretty good now if you remember that one of the files had extra column which is nothing but profit percentage so if you obviously have just one column present in one Excel file and the other files have a missing column obviously you will not be able to have the data for the other Excel files but this is pretty good all that I will do is I'm just gonna maybe take this data back into Excel and real quick test this out all right the data has been loaded into Excel let's just put this query to test I'm gonna go back into my 2005 Excel file right here and start to add a new column which has got inconsistent column name so let's just add this column called PCT profit which is nothing but profit percentage and I'm gonna maybe take my profit divide that by the amount add a new column in there and drag that down to the rest of the cells all right pretty good now if you take a look at one of the other Excel files the column was profit percent symbol but here the column is PCT profit what I'm going to do is I'm going to save this file close it and I'm going to go ahead and go back to my query remember that we have this mapping table which is where in case a new column has been added it will start to show up right here I'll right click and I'll say a refresh first you can see that a new column starts to get added which is nothing but PCT profit however the problem is all the other renames have moved so once you're refreshing this particular query you will have to write the renames once again that's a problem but I'll do that once again so date is a date a rep is a sales rep customer is a customer this is good this is good region is good this is nothing but uh profit percentage sales rep customer amount all right I've been able to rename all the columns right here this is pretty good I'm going to go ahead to my combined data query right here and now once I right click and I say refresh against 2005 I get to see the data right here and you can see that I have been able to create two different columns because one of the columns that I renamed was without the space so if I just go back and correct that so you can see that this one has I'll just add a space here as well now the columns are consistent and I can just go back and start to do a refresh now I just will get one particular column which is profit percentage for the Year 2005 and the data is all pretty good all right that's been it combining data from multiple Excel files when the files have inconsistent headers in case you have struggled with this particular problem in the past I hope the video was helpful in case you have any questions around this please feel free to drop in a comment and I would be glad to reply in the end a big shout about my tax and my power query courses in case you're a beginner in power bi and you'd like to master the fundamentals of data modeling power query and Dax in order to solve more difficult problems at highly encourage that you take a look at my courses it's going to be super helpful thanks so much for sticking around all this while and I will catch you guys in the next one [Music] thank you
Info
Channel: Goodly
Views: 46,015
Rating: undefined out of 5
Keywords: Power BI, Power Query, Excel, DAX, M Formulas
Id: 0ZOY8is-bgY
Channel Id: undefined
Length: 18min 31sec (1111 seconds)
Published: Tue Nov 15 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.