How to Do Data Profiling in Excel? #dataprofiling

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everyone and welcome to my data office today i'd like to introduce you to a quick and easy way of doing data profiling using microsoft [Music] excel if you are new to that lights on data youtube channel please don't forget to subscribe i release videos every week at least once a week on topics such as data quality data management data governance bi and so much more today's topic is on data profiling using excel listen ideally we would have a dedicated data quality tool a dedicated data profiling tool but not all of us have the budget for it so we just got to work with what we have what we have what a lot of us do have is access to microsoft excel now let's go into this data profiling technique that we could use with microsoft excel and for that we need some data so let's just bring that over here and i did find this open data on info please there you go this is it we're just gonna load this data into excel and i have it ready here as i've also split the first column of the city into two more to include the county and the state so let's paste it over here what we do need to do now is make sure that this is seen as a table as such looking good and now what we want to do is we want to go over to the data tab and select from table range which we just select this one and this will open the power query editor all of a sudden it just loads all that data in under the view column here yes everything looks great we do love the mono space where we can unflag that to look a little bit uh as it was but i i do enjoy the mono space anyhow and now what we do need to do is just close and load just so we can have a copy of this table and then we'll just repeat the process go back to sheet 1 data from table and range and we'll just repeat it just so we can reference the table that we just created and over here to the left we can see the queries tab we need to click on the arrow to expand it and see the table one that we've just created oh that's my cat jumping up and down near me and table the other table is here so this is what we want to do i'm just going to delete all of this and just go equals table dot profile and we're just going to reference table one because basically what we're telling our query here is just do a data profile on the data from table one and that's it oh error oh and i know why it's um it's finicky we do need to capitalize these two and it worked okay and now we just go close and load beautiful one thing i do want to mention is you all have power query if you have excel 2016 and higher this just comes as part of the excel package if you have a 2010 2012 then you just need to install this add-on of power query but it's also freely available from microsoft and with the micros microsoft ms 365 it comes in pre-built obviously and actually that one has even more functions that the data profiling functionality is able to provide and i'll try and get my hands on a 365 copy and i'll just do a walkthrough of that one as well okay so we loaded this data let's quickly take a look at it and see what can we find well all of a sudden we can see that it has added these new columns based on the initial columns that we did have in our sheet one and those were city county state company index grocery housing utilities transportation health care and miscellaneous goods and services in there it's hard to pronounce some of these sometimes especially when it's late at night like it is now so um as i did mention the profiling tool actually creates these columns based on the columns that we did input from table one these were these are the minimum and maximum values what the average is standard deviation count how many nulls do we have and how many distinct values and all of a sudden just by looking at this summary table it provides us with some awesome insights well let's start with with the minimum and the fact that we do have a city that it's called as such i mean obviously there's some sort of an error here and oh ohio should actually be in the states call so that's something that we need to correct and we can correct it right away let's just sort by city there you go state ohio already found what's really cool too is we can just right click and just click refresh and there you go we just took under consideration the updates that we've just made and we can see that reflected in here let's see what else we can find state obviously this is a bit of an issue with the brooklyn that was maybe just going under um interesting maybe that was going under county but no never mind that's a neighborhood i don't even know why it's listed here as such so that's obviously another thing that we need to look at i think another clear piece for state is if we were to take a look at the distinct count we see there are 69 states but we know already that there's some data quality errors in that column because there are non um there are no states that are you know 69 unless maybe they are from outside of the u.s maybe but we can quickly take a look but no if we sort this from a to zed obviously we can see these are neighborhoods in new york so these need to be oh even the county new and york look at that it's not good we need to correct that but you know what if we're gonna go through all this data it's gonna take me quite a bit of time to correct everything so i'm just going to correct these but see i'm so eager to correct data whenever i see that it's bad it's hard to help myself okay back to this oh let's uh let's refresh it see how it looks like now well we got rid of two states i guess that were incorrect so we know the 67 distinct that's a bit of an issue we need to take a look at that entire column review it a little bit further what else well cities so we know there are 325 entries all together in the table obviously there are cities that repeat themselves okay what what else is standing out well the fact that county is not really populated so we know we can't really rely on doing any analysis on that data just because it's missing so much maybe it's something that we don't even need anyways but it's something to note and this definitely is a good red flag that we can't rely on that data we can't really derive any meanings out of the county column the other piece pieces we can see there are a few no values so for grocery items and for the state they're really missing entries so there are five missing entries that probably we should add i don't know about the grocery items how we can find that data but definitely for states we could derive it based on the city let's see what else can we identify in here i think another red flag is on this on the average this one is is really intriguing the grocery items for the averages this needs to be a bit lower than the others so maybe there's something there so we can take a really quick look grocery and it's good to just sort from smallest to largest and yeah therefore they're quite low and i wonder if this is a typo and they were supposed to be 71.0 or there's there is a digit missing there i would assume it is so ideally we would want to contact the data steward of this data set and just find out more information to see if it is confirming our assumption or this is actually accurate data that it needs to be taken into account anything else anything else i think that's really it so as you can see all of a sudden just by doing this really quick power query function we can find out quite a few errors and we can spot and pinpoint exactly where we need to look into further for some data cleansing that would need to happen after this exercise so there you have it this is just a really quick and easy solution that we have available in excel in i guess power query in order to do some data profiling if we don't have a ddk tool for it obviously that's always preferred but again sometimes we just have to work with what we have and if all that you have is excel then this could be an easy solution to use please let me know if you do have any questions any comments and as always i appreciate your feedback please like the video if you've learned from it and don't forget to subscribe if you haven't already thank you
Info
Channel: Lights OnData
Views: 4,953
Rating: 5 out of 5
Keywords: lightsondata, lights on data, data, data profiling tool, data profiling, power bi tutorial for beginners, data profiling in power query, data profiling techniques, data profiling in excel, data science, data analytics, how to do data profiling in excel, how to do data profiling in power query, Excel data profiling, excel data quality, excel data quality exercise
Id: KZ7xSpLYD-o
Channel Id: undefined
Length: 10min 59sec (659 seconds)
Published: Thu Nov 26 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.