Automatically Remove Top Junk Rows & Combine Data from Multiple Excel Files

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I get this question a lot take a look at these three Excel files that I'm working with each of these Excel files has a few junk rows on the top and I'd like to remove the junk rows and then combine the data from multiple Excel files take a look we have 2007 Excel file it has got two junk rows on the top and I'd like to remove that and combine the data this is 2006 Excel file it has got four junk rows on the top I'd like to remove those and then combine the data and here we just have one junk row I'd like to remove that and then combine the data the tricky part in problem is that we have inconsistent number of junk rows on every single Excel file that we have so we need a mechanism to be able to handle that let's just go take a look how do we solve this type of problem no further Ado let's start [Music] foreign people I'm in Excel right here let's just go connect to that folder that contains the three files so data get data from file from a folder that's my folder click on open and that is going to give me the preview of the three files so at the moment I cannot combine so I'm going to click on transform data and that is going to land me in the power query window and I'll start to work with those files now even before I start to do all of the magic of the M language let's just start to understand the better fundamentals as to what logic are we going to form so for now I'm going to pick up let's say 2006 Excel file which is right here start to clean that manually a bit manually and then pack that logic and apply to the rest of the Excel files as well and hopefully that should work so I'm going to click on the binary light here to load that table it shows the table right here which is the data of that Excel file and I can see that there are a couple of junk rows on the top I'm gonna click on the table to load the table and start to work only with this table at the moment all right if you've worked enough with power query you understand that you have a function on the top right here remove rows and remove top rows and that gives you the ability to declare how many rows you'd like to remove from the top the problem is that this particular formula is going to hard code the number of rows that you would like to remove and we do not really have that so I need to come up with a logic to tell power query that this is where you should stop removing the row so for example go take a look at row number one row number two row number three row number four and as soon as you get to the headers that is where you should stop removing the rows but for power query to understand we also have to supply some bit of header understanding to power query how would power query understand that these are the headers of your data so what I'm going to do is first of all create a custom column right here so add column custom column and start to play around with my data so I'm going to write an underscore right here and underscore simply means a record not always in the context of the table so currently we are working with a table and we are writing an underscore so underscore simply means one row one row means one record so if any click on OK at the moment I'm gonna get a record record is nothing but the row of the data and that's the entire row of the data with all the columns that I have now I cannot really work with records at the moment because records contains the values and the headers I just want to work with the value so I'm going to convert the record into a list so that it becomes more convenient to work with that it happens to be a function called record dot to list I'm going to use that and start to work with that click on OK and the record that I had received it has now become a list now if you take a look at the list in row number five I can see that this list actually contains the headers that I have now how do I tell power query that at this particular position you have to stop you know removing the rows from the top so I need to kind of perform some check and the check is going to be performed by another function called list dot contains any so I'm going to go right here and I'm going to say that hey now that you have a list why don't you check that the list contains any of the possible headers or not so let's just say that I'm gonna write something like this all right what I've done is written a very simple function with list dot contains any and all that I'm trying to say is that hey you have a list that you are working with which is right here why don't you check in this list that do you have any of the two headers or not date and profit I mean to increase the probability that the headers are caught you can write more number of headers as well but for now I have just written two that means in the list if you do find date and profit that is the legit header row and you can just maybe stop removing the rows from the top now if I maybe press on OK at the moment you can see that I get a false false false false and that's where I get a true that is where it did find the profit and on the left hand side it did find the date as well all right once I've received a bunch of trues and falses with list dot contains any now I will start to feed this particular function in the table dot skip function which can skip the number of rows but based on a condition before I start to write the table dot skip function let's just try to understand that how that function works so so table dotscape can continue to skip or remove the rows till the time the condition becomes false not true false so it's going to say that hey this does not contain the header so yeah that is true that's going to be removed this also does not contain the headers that is true so true true true true and as soon as this becomes false it is going to stop removing the rows from here but the problem is according to the logic that we have built we are generating a true here so we need to invert this particular logic and that's going to be very very easy all that I'm going to do is right here I'm going to say that whatever logic is generated by list dot contains any please inverse that I'm going to write a not at the start and that is going to invert the logic click on OK and what we have been able to get is just the inverse of what we have received now this formula that we have it right here we can feed this formula into the table dot skip function to skip the number of rows as soon as you hit the first fall goals take a look so I'm gonna make a new step right here and I'm gonna say something like table dot Skip and table dot skip the first part is which stable would you like to skip which is nothing but added custom that is my previous step right here and in that I would like to use this formula that I have written right here so Ctrl C on that come back right here and I'm gonna maybe paste that formula and that is the condition that you have to use to skip the number of rows I close I press enter and you see that the number of rows have been skipped so all the rows that matched the true they were skipped as soon as you got the first false that is where power query stopped to skip the rows now what we can do is we can copy this particular piece of logic and this is the logic that we can apply to all the tables that we are working with all right to be able to do that what I'm going to do is I'm going to duplicate this particular query and work with that so I'm gonna get rid of all of the steps that we have and I'm gonna get back to the source and that is where we had the three tables 2005 tables six table and 17 table and that's where I had to kind of do my magic but at the moment these are not tables that power query can read it these are the binary Excel files right these are binaries so I need to convert the binaries into a table first before I can start to work with them so a very simple function so I'm just going to create a new step and I'm going to say something like table dot transform columns and I am trying to go inside this particular column called the content column and every single boundary should be converted into a table so that I can read it and the formula for that is Excel dot workbook so I'm going to say something like Excel dot workbook and this is going to be wrapped in the content column I have to tell which column that I'm trying to work with of course and that is pretty good close the bracket close the bracket and press enter missed the bracket right here press enter and now what we have been able to get is nothing but a table in which the sheet lies so if I can expand that table click on OK and that is where I can take a look at the table where ideally I have to skip the number of rows so that's my first table that's my second table inconsistent number of rows and that is my third table now what do we do we create a new column add column a custom column and the formula that we used just a while ago was table.skip in table.skip if you remember the first part was a table in which table would you like to skip the rows so here is a table here is a table and here is another table all of these three tables are there in the data column so I'm going to say something like table dot Skip and I'm going to say the column that I'm trying to work with is data which contains the three tables in each row of the table I'd like to write the logic that I just copied from there and that's pretty much it I can close the brackets and hopefully this should work click on OK it again gives me a table but if you peek into the table you can see that all the junk rows are skipped all the junk rows are skipped and all the junk rows are skipped now before I start to combine the three tables of course I have to promote the headers and for that there is a very simple function called table dot promote header so I can wrap the entire thing in table dot promote headers and close the bracket in the end and that's kind of good to go click on OK my spelling was incorrect table dot promote headers I just corrected that and you can see that the headers are now promoted nice and easy and I can now combine the three tables now I'll tell you a trick I can create a new step and I can say hey the three tables that I'd like to combine are in the custom column and I'm going to use a function called table dot combine start the bracket added custom is the name of the previous step in the previous step I have a column called custom which contains the three tables that I'd like to combine and I'm going to maybe in the square bracket write the name of the column close the bracket press enter and here are all the three tables combined with the top junk rows removed that is pretty damn awesome all right that's been it let me know how did you find this one in case you have any questions around this please feel free to drop in a comment and I'm gonna be glad to reply in the end a big shout out about my Dax and my power query training courses in case you are starting out your journey with power bi and you'd like to learn the fundamentals really good power query data modeling dats and then move on to solving more dense more complicated problems even of your own data my course is going to be highly beneficial in giving me the confidence of attacking problems in a logical way I suggest that you take a look at my courses it's going to be super helpful thanks so much for watching this and I will catch you guys in the next one bye now [Music] foreign [Music]
Info
Channel: Goodly
Views: 33,460
Rating: undefined out of 5
Keywords: Power BI, Power Query, Excel, DAX, M Formulas, combine data
Id: rZQZhoWu92A
Channel Id: undefined
Length: 10min 22sec (622 seconds)
Published: Fri Jun 16 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.