Data Profiling using SSIS

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hi there! In this video I'm going to show you how you can profile data using SQL server integration services. So SQL server integration services or SSIS for short has a component for performing data profiling and a really nice tool for looking at the output of the profile as well. So let's get started. So the first thing we want to do is I will launch the sequel server data tools which is a plug-in of Visual Studio. Let me start a new project. I will select integration services project. .Ok, so within the SSIS toolbox under the common folder you will find the data profiling task. Drag-and-Drop it. Double clicking on the component will open up the editor. Once you are in this window click on Quick Profile and connect to the database. So a limitation with the data profiling task is that it can only profile data that is stored in a SQL Server database. It really cannot profile data in any other format it has to be in the SQL Server database. So make sure if you have file in the ----- if you have data in a CSV file or a text file or excel file make sure you load it first to the SQL Server tables. All right so I will click on new and I will connect to the database. For localhost I'm just typing a dot it is equivalent of typing a localhost. So for server name of typed dot and select or enter a database name so for that let me click on the drop-down, and as you can see now it has connected to the database. I'll select the AdventureWorks DW 2014 test connection. Click on Ok. So this all set. Now I will have to select a table from this so let me pick Dim_Customer and here are some of the options. So these options represent what are the different, different data checks that it's going to do, right it can it can perform all of these. So the first is column null ratio profile so if you check this it is going to scan every column in this table and it will find what is the percentage of null records versus not null records all right so about the first option the second option is it will it will give some statistics at a column well levels such as what is the min value what is the max what is the standard deviation and if it happens to be a date column then it will tell you what is the minimum date what is the maximum date and so on then is the column value distribution so column value distribution will check what is the pattern of repeating data weight and how much percentage of the total values does it repeat so it gives a very nice breakdown next is the column length again very similar to column value however here we are checking the length of the string and what is the occurrence of that length across so so far in case these options did not make any sense I will show you the output and then you will understand what each of these means then it's a column pattern profile so column pattern profile you can enter a regular expression to check for that but it's slightly complicated so I'm just going to skip that the next just candidate key profile now this is very very useful the candidate key profile will look at all columns and identify which column can be defined as a unique key for the stable right or which can be used as a primary key now you see an option here for candidate key profile for up to one column keys and now you have the option of incrementing this what this means is if I keep the default number of one it is going to check one column at a time and it is going to find out if it if it is suitable to be treated as a primary key now if I bump up this number to two it's going to take two columns at a time and it will it will check every possible combination of all the columns at two columns at a time and then it is going to check if two columns if any of those two columns can be treated as the primary key and similarly you can you can check for three and four and so on alright so this takes a lot of time to compute right especially if it's anything greater than one then it will take time so for this example I will just leave it at the default value of one then is the functional dependency so for functional dependency if I say if I leave it as one it is going to take each column and compare it with other to see how dependent is it how related is the data again you see you can bump up this number to any number you want and of course this number has to be within the number of columns that are there in the table right so this again is it's very intensive it is very intensive on the on the CPU and it will take quite a bit of time so I am going to uncheck that now with these options I will click on OK and you will see that all of all all the options that we selected have been added to the profile request click on OK again now we are not done yet we will have to set a target where the profile will be stored so in the destination type as you can see it's a file connection I will click on the destination I will say new file correction make sure you select create file rather than the existing file I will choose a location here I'll give the output as profile output that's the name of the file click on OK and we are all set now let's let's run the package okay so the task is complete wait I I guess it barely took few seconds to complete right all right now let's look at the output for looking at the output for looking at the output double-click on the data profiling task again and you said you have a open profile viewer so just click on this and it will automatically launch the most recent profile that has been created ok so this is what the output looks like right so you will see that the dim customer was what was profile from the adventure works database and here are all the options that we checked so let's look at the first one the candidate key profiles so candidate key profiles so it's showing that there are three columns that are 100% unique across all the records that were there right so which means any of these three columns can be treated as a primary key in this data set so the key strengths clearly shows 100% now let's look at the second output which is the column length distribution profile so here for every column in the data set it's showing what is the minimum length maximum length right and within address line one it's showing how many records are there that have a length of 23 right and as you can see there are 335 records of of string length 23 similarly there are 29 records of total length of 33 and so on now you can sort this by clicking on this column so this is very very useful in understanding the data right and it helps you define the the target data structure right or what is the maximum what is the maximum you know size that you should allocate for a column and so on so this is computed at every column for every column in your data set and let's see if there is any date key in this okay I I guess it's not here all right so let's move on the next output is the column null ratio profiles it's a it's a very similar option here so for every column that you select it will show what is the percentage of null so as you can see except address line 2 or R ok a few other columns most of them do have data all right the percentage of null is 0 which means it has it has complete data so I just line 2 has around eighteen eighty thousand 172 records the terminal right and it's it's displayed here similarly if I scroll down and if I select the middle name as well you will see these are the records that have the middle name as blank so that's right here right so similarly suffix and title now next is the column statistics profile so column statistics is mostly done for integers and not strings and and dates as well right so it selects dates or or integers so birth date you'll see it's showing the what is the minimum value maximum value in the column for customer key what is the minimum maximum mean standard deviation and similarly for number of cars owned and so on so anything that's our number it's going to calculate these statistics for it right this again is quite useful next is the column value distribution profile so for every column we can see what is the number of unique records so address line 2 has a total of 166 records unique records of which this particular value is repeating 34 times so come you distance is five there are there are five distinct values and since five is a smaller number it shows you the statistics for all the five date first purchase there are thousand one hundred twenty four distinct dates and so on so so this this information is quite useful and the nice thing about this is this whole data is stored as an XML file which means you can consume it into any other application and you know you can you can store it permanently and you can now you can you can use it for some some kind of reporting right okay so this is about this is about the data profile viewer alright so look out for my channel for more such interesting videos I hope you found this useful thank you

Info

Channel: DataAcademy.in

Views: 29,592

Rating: 4.9601331 out of 5

Keywords: SSIS, SSIS Data Profiling Task, Data Profiling, SQL Server Integration Services, SQL Server Integration Services Data Profiling Task, Data Profiling Task, Data Profile Viewer, edureka

Id: avDqzhuHFkw

Channel Id: undefined

Length: 12min 28sec (748 seconds)

Published: Fri Jun 16 2017