Power Query Tutorial - What is it, how to use it & 4 complete examples + Free download

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello and welcome to chandu.org in this video we are going to look at power query and understand all the powerful and useful features of it through a full length tutorial in my opinion power query is a life-changing tool once you know how to use power query you can automate most of the common data collection clean up extraction transformation and data maintenance jobs that tend to take up quite a bit of our time as analysts or reporting professionals so i believe that just by learning power query and using it efficiently you can save a ton of time and the best part of power query is it is actually a transferable skill so if you know how to use power query you can use it both in excel and also in power bi so one tool but two completely different applications this is why i believe learning power query can pay off rich dividends for anybody in the space of data analysis so in this video what i will do is we will start off by showcasing power query as a mind map because power query is somewhat traditional but at the same time somewhat completely new so if you want to really understand and appreciate it then you must really look at it from a holistic network kind of a picture so i thought i'll make a mind map and i'll show you how power query really works and then we will get into four full length examples on power query so we will start off by looking at some really simple data like a made up employee data set how to load that how to clean it up and how to publish and analyze that that's the first example the second example is how to do web scraping through power query so you can use power query to connect to any kind of websites and internal internet websites like sharepoint or other types of portals scrape the data in a structured manner and maintain the data in excel or power bi for analysis so how to do that the third example is consolidation let's just say you got a folder full of invoices or purchase receipts or project files or something else and you want to combine all of that into one neat table how to do that so that's the third example and the fourth example is how to do merges and append tables these are similar to sql operations like joins and unions or intersections so how to do those kind of things with power query and how to combine data that is in different places into one picture of truth so that's the fourth example all these examples and the sample data set completed workbooks are available for download just check the video description for the download link and grab a copy of the files so that you can practice and follow along all the way through finally at the end of the video i will share some closing remarks on power query and provide some additional pointers for you if you want to learn and grow your power power query journey so continue we have got a big plate [Music] okay so in this session we are going to look at power query from a mind back point of view so i'm going to explain power query because power query is a fairly complex piece of software trying to understand it from a traditional way is really tricky so i thought a mind map would be good so i got my sketchboard and i will be drawing so we have power query as the central concept i might actually run out of the space here if that happens we will just flip over and continue the journey and let's first start by asking the question what is it okay so what power query really is it has multiple purposes multiple meanings but at the core or basic level it is pitched or or provided as a etl application etl is a fairly technical word but essentially what it means is extract transform load so etl it is a process normally in data analysis and reporting lines of work where in order for you to build any kind of reporting you must first extract the data from other sources so for example imagine you work in a bank and you want to understand consumer trends or preferences for you know various types of mortgage products or whatever what you would do is you want to do the reporting on the ongoing data you would extract the data to meet your criteria from your existing data sets and store it but instead of just extracting and storing you will go through an in between step called transformation and at that process you will basically change the nature of data so you will for example you may remove some transactions that are not of the type of data that you are interested in analysis or whatever and then finally you load it into a different place from where data analysis happens so power query is essentially an etl application but for us people who are working in excel or power bi chances are we are not uh very keen on all the la all the processes of etl so power query also helps us with other things like data cleaning or data cleansing okay so that's another application of power query so we could think of power query as a data cleaning the application another thing that you could use power query is simply to connect this is because uh if you take an application like excel it can openly only open excel files of text files or comma separated values what if you want to connect to a sql server database or a pdf file or a an azure data set on the cloud or some other databases in that case you can use power query to essentially connect and get the data that you want to analyze in excel or power ba so you could think of power bi power query as a connector it can also be used as a consolidator so this means we are not interested in cleaning or changing the nature of data the data is all right it just so happens that it is in multiple places or parts of data are available in different different types of systems a good example is let's say you got monthly statements all set up in a folder and you would like to get everything into one place you could use power query to do that kind of work you are consolidating by combining multiple values another kind of work is let's say you are analyzing sales and you have sales data coming from one system customer data coming from another system and you would like to get both and consolidate to create one full picture and power query can be used for that particular purpose as well so this is uh what power query really is it is essentially an etl tool a data cleansing tool a connector a consolidator you can also use power query for other purposes like data generation okay so you could think of power query as a generator or creator of data when would you create data for example if you want to make a calendar table for your analysis you can use power query to do that kind of work you can also use power query to make lists and whatnot but you know that's what power query really is okay so we now understand what power query is let's uh get into detail of what will it look like so this next bubble is ui or user interface of power query now power query user interface looks pretty much like a typical excel screen in a way because it a majority of screen area is given to the data part and but then there are other elements on the screen so we will first make some bubbles and then expand them so the first one is your ribbon of the power query and irrespective of whether you're using power query within excel or power bi this ui does look more or less similar power bi power query tends to have some additional features usually added one or two cycles before it appears in excel but in general the screen behavior is more or less similar so you have your ribbon which is used to access various features of power query you have your preview grid or data view grid and this is where all the data is being shown to you you have your queries pane and this is where if you have multiple queries or multiple connections within power query they will all be listed and you can select one at a time to view and work on that you have your steps and query settings this is where you can see what is happening to the current thing that is being worked on in power query and then you also have technically not like they they're part of ribbon itself but we tend to view them separately which is uh you know publish button and then there is also a backstage screen where you can adjust power query settings so this is how the ui generally looks now majority of time when you are working within power query you will spend your time between ribbon and preview screen so you are viewing the data and you are adjusting um the next step for your process so let us just say you are looking at some sample data of employees and you would just like to delete any employees that are in the marketing department so what you are doing is you are viewing everybody's data here and you apply a filter and then you remove all the marketing people so the filter can be applied directly here or using the ribbon itself um and then work with that apart from these things power query also offers a lot of things by just right clicking on on a selected item and and using those right click context menu options um which i don't i consider them part of ui but they're not really shown anywhere explicitly but they are there okay so you we have all of these now let's get into the ribbon in the ribbon itself there are uh three main ribbons so we will make three bubbles one is your home ribbon the other is your transform ribbon and then the third one is add column ribbon the home ribboning is where common operations and frequently used items are usually bunched together and a lot of things about query management also go here so for example within home ribbon you would find your save button or publish button and then you will find your query management like merging multiple queries so combining different tables into one table or appending things uh and then common operations okay so these could be you know simple things like moving uh first row as header and etc etc so common operations are all listed in the home ribbon in the transform ribbon and add column ribbon they kind of tend to look same but they each have their own purpose so transformation is any operation on the data where things are happening happening within place so that means you don't want to add anything you just want to change the nature of data itself so common transformations are changing the shape of data so that means you got one type of a table layout and you want to go to another type of layout so this is usually things like unpivot um transpose and then changing headers etc so changing shape of the data and then text transformation so you got some text value and you want to extract items from left or between delimiters and those kind of things number transformations so you want to multiply every value with 5 or add 17 to it or get the integer division output as well you know those kind of things and then date and time related stuff and duration related stuff so those are transformations just as you are doing transformations you can also add columns so for example instead of extracting the first five characters of the text and replacing them you can extract and put them in the next column so those are the kind of things that you normally find in add column within add column you can use conditional columns so you can define based on condition what kind of output you want in the new column you can use column from example so that means you teach power query what you want and power query will build that for you so i'll call this a cfe and then you have your usual text number and then date uh related transformation add column operations there is more in the add column and transformation area but i'm just kind of giving you uh the highlights apart from these three ribbons there is also another ribbon that we we have within power query this is our view ribbon and from here you can adjust some settings and and things like that so this is the ribbon area so now we come to the preview section of the power query i'm going to just change the pen color so that it's easier to spot where these lines are going so the preview of the data is a simple massive grid and you can see all of your data if you select any certain type of elements a further preview is also shown within the preview area so we will just cross over here and then you have your data grid you also have as part of the data grid column profiles so that means it will show you how the data is within the column and how many blanks are there and if there are any errors or what kind of data type it is etcetera and you can also have preview of selection here so that means if if the one of the columns of the data is a list or another table in itself then you can see the preview of that when you select that particular item okay and then the data is usually within power query when you are working within power query power query only fetches a preview of the data so it could be top hundred or thousand or so many rows and you will work on that the actual operations do not happen when you are doing the just records the steps and then when you save or publish that's when the operations will happen and final data set will be pushed into the next layer which could be excel or power bi so this is uh the preview in the queries pane you will just see all the queries there is usually two or three kinds of things that can happen here um it all depends on whether you're using power query within excel or power bi but usually within queries you you have all the queries listed and you can then structure them into groups or folders and some of them can be kind of hidden they are not really hidden uh it's just that they are not being published so when you save and publish the hidden ones or are not really going into the next layer they just stay within power query so that's queries and you have your steps and query settings this is where um you can see for a selected query the one that you are currently working on all the steps that are being applied and the name of the query as well so name and uh and steps are are shown here so that's the uh that's the ui portion uh you also have publish button which is same as save and publish thing i kind of put it separately because it's the final step before you finish your work within power query so that is uh what power query is and how the ui is structured now there are other things that you want to look at power query when you are working the this is things to keep in mind so i'll do this here things to know okay so power query uses its own language this language is called m language it is fairly technical language and you can use m language to develop advanced things within power query fortunately about 99 percent of the time you do not need to know m language you can just use the ui features right click left button options and menu options to do what you want so you don't really know need to know m language but it is handy to understand that there is actually a language behind all of this and that is the m language another thing is power query is case sensitive so that means if you are doing a finder place or anything for that matter whether writing code or doing operations everything is dependent on the case of the data this is somewhat confusing especially for those of us who are coming in from excel or power bi because most of our things that we do in excel or power bi are not case sense two so you could be writing capital case b lookup small letter vlookup and it will all be same but within power query they mean different things so you want to be mindful about that case sensitivity and then um everything within power query almost all the time is basically in a step-by-step fashion because what power query is essentially is it's a step recorder so you do operations on your data and power query records those steps and then whenever you save and publish it will run all those operations and when you refresh it will again run those operations on on that data so that's what a step recorder really means it is actually listening to what you are doing and then learning from you um in a very passive manner and then just repeats those steps for you every time so it's kind of a data process automation tool essentially so that's what this step recorder one of the pre important things to remember within step recorder is each step depends on the previous step it again this is actually a rule that can be broken but usually um depends on previous step okay so if you take some data and you first filter something and then you do an operation so that next operation is happening after the filter operation is done so and then um and it goes on like that so if there is let's say five steps one two three four five and then there is an error happen when you do there was not an error it worked fine but because something changed in the data there was an error on number step number three then power query will not be able to proceed to four and five because four and five depend on three and and four respectively so because three is an error four cannot proceed so this is why um when you when you are setting up things you need to keep that it is a step recorder and each step depends on the previous step unless you make changes to the query in an advanced manner okay so for this reason when you are working within power query you have to think about future okay everything that you do within power query it may work while you are working on it but because what you are building is a repeatable process when you refresh power query in future that might actually things might change so you have to think about what are the things that could go uh or change and then build your queries accordingly so that's one of the key things to remember and and use it within power query all right there are a few other things to know but this is these are the good things to know what kind of data sets can we connect to so data sets within power query what kind of things that can be connected to again time for a new pen um so you can connect to pretty much any kind of thing especially within power bi it offers even wider choice but both excel and power bi are fairly flexible so we will start off with simple things like excel csv and then you have your web connection so you can connect to a web page where tabular data or unstructured data is maintained and use power query to grab that and scrape the data and then process it you can use xml json these are more structured ways of maintaining data on the web you can connect to apis for example if you have a google maps api account you can use that to get the latitude and longitude of a given address through power query and then process that of course you can connect to databases these could be all sorts of things like oracle um ms access sql server mysql and whatnot and you can connect to cloud sources you can connect to folders so all the files within the folder and you can also connect to pdf uh etc so these are all the types of things available there's more options available and because power query uh is part of both excel and power bi and both of these tools are always changing and improving every now and then microsoft introduces some new ways of connecting to data and adds new provisions depending on what is there and all of these are available certain things like web and apis require some authentication as well for example you may need to provide your user id and password to get to a certain resource and power query gives a structured way of providing those things as well so you can provide your your account or you can type to your windows login address as well so that means it will use your windows authentication to get to a data source and bring the bring what you want a classic example could be like an intranet page on sharepoint or some other source which can only be accessed when you connect to your organizational network and power query can do that and then get the data that is behind that kind of a an authentication layer so this is uh what power query really is all about i hope you found this particular mind map technique uh useful and interesting to understand where power query is now in the next section we will look at four examples of power query and then play with them and understand how transformations and ad columns can happen what happens when you publish what is all these steps about what kind of sources we can connect and some of these other things that you that i mentioned earlier so that's all i'll see in the next section in the first part of our power query tutorial we will be looking at some made up employee data set and understanding power query user interface and how to perform common data connection cleanup and publishing activities in the subsequent examples we will look at some additional situations and how to use power query for those so first let's get into our employee data set this data set is in a in an excel workbook and we would like to bring that over to this file through power query and pre-process the data by cleaning it up and making some changes to the data i have the file open here you are more than welcome to download the sample data and give it a go just check the video description for links on that so here i have one two three four five six columns of employee data and there is roughly 1000 rows of information as you could see there is this data is fairly cleanly structured but it is probably not adequate for doing analysis for example you could see that there is some salary information missing probably that employee is no longer with us and that's why that salary is not there likewise some departments are coded as null they should be probably you know defaulted to some other value not null likewise some gender values are also missing and if you look at the inconsistency in the location uh some of the addresses are entered like this it might seem like everything is written like that but it's actually a new line character and some other locations are hyphen and comma separated and another location like the los angeles ones are just comma separated and then the start date is mentioned in a date format and we would like to just load this up into power query and then perform some action so i'll close this file and let's just see how to get into power query so the first step is to go to data and click on get data button so this is the button that you normally use to access data sets that are outside if the data happens to be within the current excel workbook you use that button which is from table range so get data from file from excel workbook and we just point to the file where our data is uh so you just go and select the folder address where your workbook is so for example this one is here so i will just select that and import at this point you will come across a navigator screen and this is the place where you can preview the data and you can select which particular worksheet tab or table or you know named range you want to bring so here my worksheet has just one tab there is no tables or anything and this is giving me a preview of everything i think this is good enough we will quickly click on the transform data button you may think you know why are we not clicking on the load button we could load this but this data is not in a good shape so this is where power query is really used to transform the data and change its shape and size and things like that so we will place this power query editor window uh into just i'm gonna resize this into the screen area and we could see that everything is there obviously the file had data from row number two but row number three and then the first two rows were some additional information like power query demo and and a blank row followed by the data set so all of that got loaded here and everything is showing up now before we jump into actual cleanup activities let's just quickly understand what is it that you are seeing on the screen here this is a special window called power query editor and you can access this from both excel and power bi by using the queries button within within the respective applications so once you get here there is these all these ribbons you you can use those ribbons to access various functionalities of power query the power query editor window itself is split into basically apart from the ribbon there is three main areas the left hand side is queries area this is usually collapsed but you can expand this and you can see all the queries that are available right now there is only one query the connection to this file by the end of this video you will have four or five queries because we are going to demonstrate four different examples so all of them will be listed here and then you have the data set preview window here this will show a preview of the data usually the preview here is about top 1000 rows of the data so if you connect to an excel file or a database table that has let's say 600 000 rows then it will just bring the top 1000 rows and show them for us and then these are shown by columns and rows this structure might look deceptively like excel table but it's not excel because you can't really edit anything you can only do operations on the data that is you know you can delete or filter or those kind of things but you cannot individually update the values in the cell so you're only seeing them and you're doing operations on them all the operations that you perform on the data are listed in the query settings area on the right hand side uh the query settings area has two important boxes one is the name of the query and then the applied steps so you could see that although we just connected to an excel file and brought up the data it has actually done four steps on the data already so it went to the source which is the excel file that we pointed and then it navigated to the data table and then it automatically took the row number one and promoted that into headers and kind of done a data table data type changes for each column so all of these steps were applied uh if you are not happy with any of these steps you can select a step and delete it now keep in mind that power query is a step-by-step language so every subsequent step by default depends on the previous step so if i delete this then this change type step will be kind of referring to something that doesn't exist so it might give an error so a good way would be to delete from bottom for all the steps that you don't need [Music] so for example in this case i believe we are only needing to go to source and navigation and then at this point we need to first delete the first two rows so this kind of operation is removing rows and you can find that from the home ribbon you can also find some of these things on the transform ribbon as well but usually you know removing rows and columns is something that you find in the home ribbon so remove top and then we simply specify how many rows we want so i'll say two rows and then they're gone so at this point what is happening is your original data set is not changing that excel file will contain the data as it is but when it is brought over into power query power query makes a note that hey bring the table uh bring the data worksheet and then delete the first two rows and use the remaining data to process so that's the kind of a step that we are applying as a cleanup process so then what happens is every month whenever there is new data you don't have to repeat these steps because you've already taught power query what you want power query will then go ahead and apply those steps automatically whenever you refresh so we remove the top two rows now we can use the first row as headers and then that will basically create this kind of a view as you could see there is extra columns also added i'm not really sure where power query is picking up those columns it could probably be because i have my chandu.org on the eighth column and that's why it is going all the way up to eight columns nevertheless you know we could um select column seven and column eight and right click and remove columns the same can be accessed from the menu so many of the functionalities within power query can be accessed through the ribbon or occasionally by right clicking again we add another step remove the columns now usually whenever you promote the headers whenever there is a new set of headers power query automatically adds a data type change step this can be both useful or useless depending on what your preference is normally i would like to know exactly what steps are being applied on my data so i delete these kind of automatic steps and i do manual steps myself so that i know exactly what is happening and we have better control on these things okay so now that we are familiar with power query and we know how to get in and how to do some simple operations like deleting rows promoting headers and removing columns let's go and quickly define some tasks for us to clean up this data i have already defined these tasks in a notepad window just so i can remind myself what what is it that we are doing so we have already loaded the data and we looked at the power query window now let's go and address the big concerns here which is blanks and nulls there is two types of blanks here one is a blank in the gender column and the other is a blank in the salary column and then null in the department column so there are happening in different places and we need to address them um by a process so we need to define exactly what rule to follow and then apply that rule then anybody who is missing salary we would also like to delete such employees because they may not be here in the organization anymore and we would like to also extract the country and ignore the location altogether we don't need location for our analysis we just want to get the country so whether they are in usa or new zealand uh that's where these companies operating and and then finally we want to publish that we would then like to do some analysis in excel by building a pivot table and then see what happens when you update this file with august 2020 data and what happens when you refresh so that's what we will do in the next few minutes let's go in the gender column there is actually you could see that male female and the null then is what the missing gender values are and we don't want to delete employees whose gender is null instead we would like to kind of rename this as other or missing so to do that we can right click and replace values and value to find would be small letters null one of the things to remember is power query is case sensitive so capital and ull is different from small letter null small letters null is essentially null or just nothing there whereas capital letters null would be a text value that is none and then replace this with other and when you click ok it will add a step here and then anybody whose gender was previously null would now become other as you could see here at each step there is an associated formula attached to that and you can see those formulas in the formula bar by default this formula bar is turned off so if you are not seeing the formula bar in your power query you can go to view ribbon and enable this this is actually a one time thing so once you do it then it will remember that and it will use the same setting every time within power query whichever workbook you're using so we have replaced the gender now let's go to the department and department has a null department and what happened is let's just say from a business point of view uh we had some people join into a department when they join the organization they're not assigned into any department but eventually they end up into the engineering department it's just that the recruitment was a bit sloppy and you know they didn't put the department code when the employee got onboarded so it's saying null but it should really be engineering so we can again go and do another replace value step this time we will say null and then engineering okay so this is again a simple step it is at this point you might be thinking oh this is good but what if i have something even more complicated like i have an employee who joined in in their department is null but if they have joined in 2019 they should go to engineering in 2020 they should go to business development or something like that in such case you can't really use a replace values thing you need to be using an add column with a conditional column because you have a condition based on date and then do that we will cover that in another context later but for now we will use this way all right so that is done next thing is anybody whose salary is missing we technically consider them to be employees that are no longer here and they should not be in this data set anymore so we can go and apply a filter and uncheck the null we want to leave everything else just uncheck null in power query will add a step that essentially says salary is not equal to null and then it will filter so it what it is doing is essentially at each step the data is changing a little bit so now we have filtered and that is uh taken of all those salaries let's just double check what else is there we want to extract the country as well so this is where we got the location and we would like to really extract the last word the last word is usa or nz right so to extract the very last word we can do it in few different ways but if you observe this i think the last word would have a new letter new line character or a space and then the country name is followed by that so we can go and add a column extract extract last characters last characters what this would do is it will ask you how many characters you want so for example i could say i want last three characters and then that would essentially just get me usa or nz but nz with a space so this is one way of cheating power query and getting the answer there are other ways to do this as well for example if you have more countries like germany or australia also there then you can't really use three then you would need to be using seven or eight and then in that case then it will not be correct so we could also use um a text after delimiter kind of a thing and then also extract from the end um but if you're not if you're not really sure exactly how to do this which functionality to use then you could use this column from examples option as well in this way what you are doing is you are simply telling power query that i want to i will give you some examples of what i want and you figure out the logic and get me the data okay so this uh the way you need to be doing this is you select the column from which you want to extract so in this case location column from example rather than all columns from selection so power query only listens to looks at location column while trying to understand what you are doing so here uh it's now looking at just the location and whatever you type here it will based on that it will define the logic so for example here in the very first row i want to usa um the second row also you usa and then this one is nz um here is again usa and at this time i think it has done its job everybody's location details are extracted properly and we can just click ok and that would be added as a step you can examine the logic here in the thing uh but i'm not really you know you don't need to look at that as long as you can kind of eyeball the data and make sure that it is working then that's that's there so this is added and it will be called custom i'm just going to double click on the column header and rename that with country okay so this is how you name a column again that will be added as a step so once this is done we will right click and then we will remove the original location we no longer need the location full details because we have already extracted the country bit and now all of our data is there and we can go ahead and start by closing and loading this data so we will just say close and load and this will now come up and sit as a nice table within excel all the data [Music] so once this data is in excel i can go and build anything on top of it for example i could build a pivot table on top of this using summarize with pivot table option and i could for example just see how much money we are paying by department and how many people are there and then for example sort this largest to smallest and you know apply some currency format and and things like that so we are looking at all of our data and this is good uh i can then even um maybe add a slicer so i'm looking at either female employees male employees or other employees and we are analyzing the data like this so let's just say all of this is good uh we will rename this as pivot um and and then we realize that all of this data is up until july 2020 and now we are already in august and we have brought in some more staff and we would like to add their data so what i will do now is i will open the original file the one that we just connected to and let me make sure that this is actually the one uh it's in the data sets and i'll go all the way to the end and i'll paste some extra data so i've got some extra information here i'll just copy and paste it here so these are the employees that we hired in august of 2020 and and then i will save this file close this and now let's pay a close attention to the overall head count 985 and when i refresh we should be bringing in few more people from august and you know these numbers should change so how do you do that you don't need to go back to power query to do that all you have to do is go to data and refresh all so you could see that now the new details have been added and uh 1002 we have people in 73 million is the total so this is how the the refresh process works and then the new details come through as well so that is our very first example of power query let me just do a quick recap by showing you how to actually get back into the query and you know if you change your mind or you want to modify a step or something how to do that [Music] to see the queries you can just click on the queries and connections thing it will open up the queries pane and it will show you the data query that is added you can also look at the data that is if if you load it into excel and the data is also shown and from here you can right click and also um access the refresh button to refresh this data set so once this is there and you can you have it here you can right click and edit that particular one and you are now sent back into power query window where you can see all the steps you have applied and if you want you can add more steps as well uh for example you find that oh this is all good but the start date is there i would like to for example extract the year of join as well so for that to happen first thing we want to do is change this into a date type and then from there this is a date so i can add a column because this is date the date button will be enabled and from this i can extract any kind of thing so for example i can extract the year and that will give me 2020 or 2019 as a year field as well and i can close and load and now the year column will be added here if i go to my pivot and if i refresh this i will see that the year thing is now available for me and i can use that to for example put that on the pivot table to see the head counts and sum of salary by year uh how much is in the 2019 joinings and how much is it for 2020 joinies as well so this is how uh the whole thing works now the second example is where we are going to look at web data and connect to a website load the data from there and do some additional transformations on the data because usually if the data is coming from places like websites or xml files or those kind of things that data structure is not really suitable for analysis within excel or power bi so we would need to know exactly how to manipulate that within power query and that's what we will cover in the second example [Music] in the second example of our power query we will be looking at how you can connect to a website and fetch the data from the websites into excel now from a business data analysis point of view connecting to websites is not something that you would do often but from an organizational data analysis point of view many times you may have to connect to data sources that are like websites for example a sharepoint portal or on intranet web page is also a website so you may want to connect to that and get some sort of a tabular data or list of items that are on that page and use that for further analysis so that's where the web data example comes in really handy and this gives us a chance to really understand and appreciate the versatility and dynamic power of power query so for the purpose of this exercise i'm going to get the data of indian state population from wikipedia you could be using any website as long as the website has some sort of data in a meaningful format you can connect to that through power query and fetch that even if the website is only accessible from an internet let's say your local intranet web page it cannot be accessed by people outside only you can access it because you are connected to the windows network and then also power query works because this is more of a public demo for query i am going to use a publicly available website so the website that i'm using is wikipedia list of states in india by past population as you could see this lists all the states in india india has about 29 states and seven union territories and their population for each of the census uh calculations that was done uh in past decades so 2011 was the last one that india had and then 2001 91 81 like that and as you could see while this does have all the data there are some potential issues with this data for example uh some states did not exist until certain point in time uh so prior to that their population is listed as n a likewise uh some uh the same goes for these states and all of that information is there okay so we would like to for example go and connect to this web page connect the data and then deal with those missing values and also uh transform this table so this has one column per the state and then one column per each census uh let's say from analysis point of view this kind of a structure is very tricky to deal with because it is like a pivot table so we want to unpivot this data and get just three columns state year and population and kind of unstack this and make it a longer table and use that so let's go there let's copy the url and go to excel get data so data get data and then from other sources web and this will give you a prompt where you can paste the url so i'll paste my url here and then click ok at this point if this is the first time you are connecting to wikipedia power query may actually show you a message asking how you want to connect i have already done this a few times so it kind of jumps in here and it will show you the web page and power query says i found three different tables on that web page which one do you want you can kind of select them to see which one has the data that we want for example that table the very first one is what we really want it is the table that we saw on the web page but there is some issue though that header row is repeated and and then there is also a grand total row at the very bottom so these issues need to be addressed as well which we did not realize when we when you are viewing that here because i mean there is a total row but we didn't kind of pay attention to that but now when you connect to connected through power query you will get all of these information and you need to be dealing with those as well so let's click on transform data and um and this will bring us into the power query editor window where we can go ahead and do the transformations so the very first thing that we would like to do is remove that particular row because that's not needed there is already a header row there also we want to remove that change data type step because that's not something that we have done and we would like to keep control on what is what are the steps being happening so once that deleted step is there then we just go to remove rows remove top and then simply save one row so that kind of goes off likewise the very last row is my grand total i can also do the same remove bottom rows the bottom most row this automatically deletes the total row but imagine if the total is no longer there then it will delete the very last row whichever that is so it's kind of a blind operation and when you are doing this you want to be mindful because um whatever you are doing within power query you are kind of building a repeatable process so every time there is new data you just want to refresh you don't want to revisit this page so that means if some changes are happening where these assumptions are no longer true it will still go ahead and delete them so you may want to think about these and validate these assumptions from time to time either that or figure out another way of deleting for example rather than removing bottom rows one other way of doing this is we can go and filter any word that says total so if there is no such row as total then the filter operation will not really make any changes to the data it will keep it as it is but if there is a total row it will delete that so that's one other option of doing this okay so that part is done and now what we will do is this rank column has really no meaning i mean it has a meaning from the web page point of view but here we don't really need that so i'm going to remove that column also now the state and or union territory this column is needed but every other column here we don't need them we just want to take this kind of a pivot table format and unpivot this so that we want to keep this column and for every other column we just want to turn this into two columns one is year and the other is population so this kind of an operation is called transform on pivot because we are changing the nature of the data and you can do this from the transform ribbon you can go to transform and then you can see that there is unpivot option there there are multiple ways to unpivot we would like to keep this column and unpivot everything else so we select this column go to unpivot other columns this is one way of doing the other way is you can also right click here and do other pivot other columns i tend to go with the right click method because it's very handy and we don't need to go into the ribbon but they both do the same thing so as soon as you unplug it you notice that our state or union territory remains and for each state for each incidence of the other columns it will create two values one is the attribute the other is value this might seem a bit confusing but essentially that's the year that's the population and it kind of uh you know took the data and flattened it into a three table three column structure we will uh now go and extract just the ear bit which is 1951 because we don't need the entire thing there we just want the 1951 there so to do that we will add a column extract between delimiters open bracket and space are the delimiters that we would like to pay attention to anything between open bracket and space we want to extract so we will say this is open bracket space and we will get the ear we can now remove this one this one i will call it as year and that's my population let's go ahead and address the n a issue there are many states that have n a value there or we can decide what to do with them for example you could say okay i want to delete all of that or i want to default them to 0 or whatever maybe the method you prefer uh but first let's turn this into a whole number this way everything becomes a number and as you could see from the data snapshot here this is called data profiling it will tell me that immediately you know seven values i have an error there because they i cannot make them into numbers that is none of these have errors and we could see that all these nas have now become error because na cannot be converted into a number at this point you can deal with the errors in two different ways one is you can remove the errors other is you can replace the errors so if you don't want to have those state year combinations when the state didn't exist you can remove the errors alternatively if you want to do some analysis where you want to have the presence of that you can also replace them with zero i'm just going to remove them that will make that data tidy and then finally before we publish this i want to rename this as pop or india pop whatever you want to call you can put it like that and then once all of this is there i can go ahead and close and load this this will now come and sit into a table at this point the connection to wikipedia is established and then the data comes in here and sits in nice and tidy so imagine next year we have another round of census in india 2021 and then the numbers are updated on the wikipedia page you can just refresh this and it will come in and sit but keep in mind though you know because it's a wikipedia page anybody can go and edit it and change this page it's like a public website isn't it so what that means is if somebody decides that hey this format is no longer working i'm going to change this and make maybe two sets of tables or something else and that case your web connection and the query will no longer be correct so you need to be mindful about the format and the assumptions that you have made the assumptions are that this table will be there and it will be in this format but if that changes then your power query will no longer work so that is something that you want to keep in mind all the time when you are working with web sources this is why at the beginning i said in strict business data analysis you don't normally use like this kind of a connection but let's say you're connecting to a sharepoint web page and the data is always maintained in that format you can always use this method and get to it so that's uh our second example of our query [Music] in the third example of our power query tutorial we are going to look at how to take a folder of all files and consolidate them into one big table using power query for this example we are going to look at a bunch of project excel files and combine the data into one final set of data you can use this technique to combine data that is in text files csv files even pdfs through power query so here is our folder of files i got four files here uh each having the same pattern they they just list by month how much amount is spent on the project and certain projects have 12 months data some other projects have just a few months of data like this but nevertheless two columns and they all just go into sheet one so the very first thing that you want to do is make a make a note of the folder path normally the trick that i use is i navigate to the folder in windows explorer and then i just select the click on the folder path and then copy that text value into my notepad and then from here in the data ribbon get data from file from folder so this is the option that we can use and just paste the path in here and then this will open the navigator screen where power query says hey i found all these files do you want work with them so you can simply select load or transform and that will just do whatever power query by default wants to do but a better option would be to actually kind of get into the transform data you notice that there is actually a special button called combine so we are going to use that we will say combine and transform data because we want to really combine all the data in these files this will now open another screen where power query now is asking you so you got four files in the folder you want me to combine all of that but what do you want in each file what what is the process that you want to run on each file teach me for one and then i'll do the same for all of them so it will show a sample of the first file you can actually pick any file it doesn't really matter and it will show you what power query found in the very first file so in the very first file it found a sheet and if you say okay get this particular worksheet then it will do the same for all the files in that folder and do that so we will say that and click ok and at this point power query will open the query editor because we said combine and transform and it will load up that query here so this is our query notice that apart from this query there is actually whole bunch of helper queries now added into our queries panel these are things that power query needs in order to generate this so let us quickly understand what all of these things are because for the first time you see this it might look actually scary and confusing like you know what are all of these things i didn't do any of them so the way this works is power query wants to run the same process on all the files in the folder so you have taught us taught it what you want to do which is select the sheet one in the first file and then it will do the same for all of that so that kind of a process is usually done through functions so that is the function i'm going to just do some screen drawings here so that we can understand so this is our final query but what we started off was we loaded a sample file and then we taught power query on that sample file get to the sheet one so based on the sample file power query builds a small internal query called transform sample file where the operations that you want to do are defined so once these are done based on this power query will then create a transform file function so that function is for defining the process that you want to run on all the files in the folder so once this process is defined then power query will go to the folder for each file in the folder it will simply run that process that function and get the output and put everything together here in in one go so that's how power query behaves let me just me just get rid of all these drawings so let's understand what is happening here from the output perspective the output it shows the source name month and amount and all the projects are nicely concatenated if you're happy with this you can just leave it there if you want you can actually go and do some more additional things on top for example if you want to just extract the project name not the dot xlx thing you can go right click and then just say replace values and find the dot xlsx and replace that with nothing so that you will end up with project name and then rename that column as project so project month and amount all these details are now mentioned and our data is ready it is linked to the folder so if you add another file into the folder and you refresh you will get this output there let's just say you don't want to just have the month number but you want to actually have it like a name of the month or date or something like that so such things because they need to be happening on each and every file any changes to that need to happen at the function level but we don't know how to edit the function so that's why this transform sample file is built if you make changes here the function will automatically change so i can go here where it says month and amount i can introduce a new column for example let's just add a custom column which would be that month in the year 2020 so i will add this custom column and then just say date and the date would be a date which is equal to first of that month in year 2020. so we want to generated it to generate a date within power query you need to use hash date notation and then specify year day month year is 2020 month is that and the day is this so this is how we can write a custom column which will be a formula dependent on the year month and day and when you click ok it will generate the date so this is my month number and that's the date so once this date is there we can just remove this other column for now i will leave it there and we made the change here now internally power query will update the function so when you go back here it will now have the date column for all the files so this is how power query will really generate that function and then apply it on all the files let's say you don't want to run this function on all the files on the folder let's say the folder has some additional files like not just excel files but probably a pdf and some text files and some other documents in there as well and you want to ignore those and you just want to grab only the excel files to do that you would need to actually edit some of the steps so if you go to the source step it is listing all the items in the folder and one step that power query automatically does is it removes anything that is hidden so if there is a hidden file in the folder by default it will ignore that but you can add another step here for example you can say the extension need to be equals and then you can insert the step in the middle which would be dot xlsx okay so we are now looking only at excel files if there is another file in the folder that is not an xlsx file then it will not be considered likewise you could add a condition that says name begins with project so only files that have name as project and extension as xlsx will be considered so once you insert that step of course there's no change in the final outcome here because all the files in that folder are excel files but in future if you want to add something else then it will automatically be ignored so this is how that thing is done you can go and change the transform sample file again and remove this column and that will automatically update this particular outcome here as well so now let's go and close and load this and load this data here all the data comes through very nicely of course this date is shown here as a number because excel dates are numbers but i can select this and quickly turn that into a date format and our data is all good to go now let's go ahead and see what happens if you add another file to the folder i have now added a file called project new to the folder this was not there earlier it is just a copy of the project something else let's see what happens on this page here so right now it doesn't have project new and if i go to data you can refresh all the queries or if you just want to refresh this particular query without touching the others you can right click on the csv query and refresh that this will now be refreshed and the project new is listed right on the top because it's alphabetically the very first project and the data is coming through very nicely so this is how power query can be used to consolidate data that is all in a folder and bring that together [Music] in the fourth and final example of our power query tutorial we are going to look at how you can join two or more tables using merge and append operations this is actually a very powerful and helpful feature of power query where you can take two sets of data and then either join them or combine the tables to get one full set of data so for this example i have some sample data here where we know the student's information from year 11 and 12 we know the names gender and which course the student has signed up for and this goes on for about 176 rows for year 11 and another 125 grows for year 12. we also have course information what is the course name who is the teacher what kind of lessons they are how many credits are there for that course and what kind of assessment that course has now what we would like to do is we would like to first start by appending both sets of students into one big table notice that in this data set you have year 11 in one tab and year 12 in another tab what we want is we want to get one final table where all the students are in one table because they share the same columns three columns in year 12 and three columns in year 11 exactly same columns we can append the data so that's the first operation that we want to do then we want to combine that appended data with the course information so that for each student i can see for example who is their teacher what kind of lesson they are going to take in that course and how many credits they are going to get in what kind of assessment it is going to be all right and then finally we will also want to filter all the female students that are on three or more credit courses so anybody who is on the courses that have three credits or four credits we would like to filter them down and see such student data and publish that back to excel so i'll close this file and let's go here and bring the data so we go to data this time the data is in excel file so we'll say from workbook and connect to the file so we point to our file and click on import button and this will now bring the navigator screen where power query wants to know what do you want to bring unfortunately we need to get all the three tables right all the three sets of data so it's not one of them that we want this is where the select multiple items is useful you just stick that and you connect all the three items and then when you click on transform data power query will now create three different queries one for year 11 one for year 12 and one for courses table so we will go here into the power query let me just uh bring up the power query window into the recording area and we have all these information here along with this we also have our previous queries listed there okay [Music] so the first operation is very very simple we want to take year 11 and year 12 and combine them into one table so we can go into either one of these and from home ripa click on append queries if you just happened it will add the other queries underneath you can also append it as a completely new query so that means you can leave the original ones as they are and create a third query that combines your element in 12 data we'll just tap in it here in place so you can just do it for two tables or three or more tables also we'll just do it for two tables and so year 11 is my current table i'll pick year 12 and when you click ok it will go and append everybody's data here in one big table so now this one has all of the information this is good but it is somewhat annoying like we don't know exactly whether somebody is in year 11 or 12 and also the column headings are wrong so there is actually a problem with this happen so what i'm going to do is i'm going to undo the step by deleting it and first we will fix the problems here by using first row as header here and even here i will delete the data type change and use the first rows header so now that is done uh we go to year 11 and we will add a column here custom column and then this one will be simply saying here and within the field value will simply say year 11 that's these are all year 11 students we will repeat that process for year 12 here and in the custom column double quotes year 12. so now we have a way to identify from the table itself whether somebody is in year 12 or years 11 student so i can now go to this table and then click on home append queries and select year 12 at this point the data will be appended and then i can also see whether somebody is a near 11 student or 12 student right in the final data so now that this is all appended we can just go and rename this table as students because this is no longer year 11 students it has everybody's information and this has my student data [Music] if you go to the course table now the course table has this information i know which course it is and if for example if the course is aquamarine then the mirale is teacher it's a practical course with one credit and what not so if i go back to the student wherever aquamarine is if i if i can get to the corresponding details like who's the student who is there how many credits it is what kind of examinations it will be and what not that's that's it for us so this kind of thing in excel we normally use vlookup or x lookup to do within power query you can just join the tables you can take this table and join it with the other table to get the final table so this operation is called merging normally within sql it is called join but within part query it's called match it's the same thing so we can click on the students table merge and select the courses table now when you are merging you must specify what is the column on which you want to match so within the student course column corresponds to courses table course column so we are merging based on that column you just select that and it will apply if you want to do it based on two columns course and uh year and then there is actually different features per year then you may have an ear component here and you can select two columns you just have to hold ctrl and select and map it like that of course this would be wrong so i will just unselect this and we will just map it to course and course there are different ways to join data it kind of gets a bit technical but normally the default one is called a left outer that means if there is a course here but no corresponding information in the courses table that row will still be kept here in the main table because we are going from student to course table so whatever is on the left side students is on the left side it will retain all of its information and on the right side if there is corresponding information that will be fetched and shown so that's what a left outer join is you can do different types of joins in in the join is another common one where only there is a matching information that means the course is present in both tables only then that will be retained but for now we will just stick to left outer you can also use fuzzy matching options this is useful when you have spelling mistakes or you know similar sounding spellings and you just want to still match them you can apply that in our case everything is clear and easy so we'll just use this and click ok and for each course it will add a new table so it's not directly merging it is actually giving you the corresponding row of the other table if you select red and click on the table here you will see the entire information for the red course here so this is attained by terencio internship course three credits and it has an assessment of project work so once that is there all you have to do is expand this table and select the course column because we no longer need course information course information is already here and you can also undo uncheck this and click ok and it will add who is the teacher how is the lesson type how many credits it is and what is the assessment type so that is my students table okay [Music] at this point we would like to do the third operation which is filter female students on three or more credit courses so what i will do is instead of directly filtering in place here we will keep the main set of students table as it is i will create a reference to this so this is actually another table but it refers to the original table so whatever original table contains this will also contain female students right in this i'm just going to filter down to female and apply a condition on the credit so credit should be greater than or equal to three so now we are only looking at female students who are on courses that are three credits or more in in this query whereas that query gives me all students this query gives me only these students so once all of this is done we would like to publish the female students we don't want to publish any of these other tables like courses years or students so one way is we can uh instead of close and load we can click on close and load to and select the load behavior for all the queries uh but before i do that i will just create a new group here so that i can structure my support queries so i'll call this as [Music] support queries and in this table i will just move my courses year 12 and students so support queries group has all these queries that i'm using to build my female students thing the other queries are the ones that we are loading so we loaded data we loaded population csv and then finally this one so now i can click on close and load two this will show a special box normally we have not seen this box because every time we want we're just loading the data but now we are seeing this box and i will simply say just make a connection don't do anything and it will create connections for everything so it has not loaded anything right at this point i can right click on this and then just say load to and then create a table in new worksheet so this will bring up all the female students or three or more credits into my excel workbook into new worksheet i can just rename this as number four students and there is our power query example on how to merge and append tables so that concludes all the four examples just to give you a quick overview in example one we looked at employee data loaded it from excel file and applied some cleanup activities on it in example 2 we looked at a data set that is on wikipedia used web connections to load the data and unpivot the data to get it into a tabular format in example 3 we looked at a folder of csv files and applied folder connection option through power query got all the files and done transformations on all of them and then concatenated or consolidated all of the data into one place and fourth one is where we are merging and appending tables to combine data from different places into one one view and then uh load that after applying some filters so those are four examples i hope you found these four examples useful and helpful [Music] so that concludes our power query uh examples and mind map discussion i hope you found this particular video quite helpful and useful it took me quite a bit of time to prepare and edit and kind of arrange all of this so i really appreciate if you can share this video or leave a thoughtful comment or like this video and talk about this with your colleagues and other friends so that other people can also learn this powerful technology in terms of taking it to the next step because you've already learned some useful things and how do you take it to the next level here is how i would approach just instead of going and learning new things all these ideas that i have presented so far just to take them and apply them to your line of work okay so take them to your data sets your challenges and start right away apply those things first that way you get to solidify these ideas you will remember them better and you can apply them and you will understand which works which doesn't and what happens with your own situations once you finish that then i would highly recommend either getting a power query book or going for an online class where power query plays a role in the overall data analysis things so in terms of books i recommend this book and it is written by gil ravi and i highly recommend that because it is very useful and it will help you learn power query more in a holistic manner there is a link for the book in the video description please check that out now in terms of courses i recommend my excel school online training program again this is very very useful and it will help you understand and appreciate power query along with the overall spectrum of data analysis and reporting so that is it for now i hope you found this useful go ahead and practice all these ideas and tell me what worked what did work in the comment section and if you enjoyed it please give it a like or thumbs up thank you bye
Info
Channel: Chandoo
Views: 85,382
Rating: 4.9624143 out of 5
Keywords: chandoo, chandoo.org, Excel, spreadsheets, power query, Power Query tutorial, Excel Power Query Tutorial, power bi power query, Data cleansing examples, Combine folder of files with Excel, Web scraping tutorial, In depth power query tutorial, Power Query for beginners
Id: PiFAa_jjaEI
Channel Id: undefined
Length: 84min 53sec (5093 seconds)
Published: Wed Aug 26 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.