Data Quality Services (DQS) in SQL Server 2012 Webinar

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone thanks for joining us today my name is Josh buckler and I'm the director of business development here at helpit systems I work closely with a client base and development team to continually understand and address data quality challenges on a day to day basis just a quick introduction helpit systems is one of the early pioneers in the data quality industry which is to say that we have been refining match and cleansing concepts and strategy for more than 20 years all of our products integrate closely with sequel server so contact data quality within sequel server has been an area of focus for us as an organization in just a moment I will be handing this over to Ben Gallagher our systems development director who has been with working with our team internally to understand dqs and how it can help our customers get closer to their contact data quality objectives as Ben will discuss there are some data cleansing applications for dqs and this webinar is really designed to get you started on that process and help you navigate some of the nuances we found along the way many of the organizations that we work with come to us to resolve their contact data quality issues we often explain to them that resolution is more than just cleansing it's also about issue prevention which led us to a concept we call the data quality firewall the data quality firewall is a strategy design to identify the best means to find and resolve data quality issues at every capture point and allow for organizations to take advantage of their data with confidence in real-time most companies need to deal with information coming in from several sources that can be broken into two primary segments transactional and batch transactional would include anywhere that there is a single record entry like a customer website a call center CRM and a store the ideal time to fixed data quality issues with these types of entries is in real-time while interacting with the customer and user because there are data issues that can only be fixed by the customer providing more information for example if someone were to provide an address in a web form like 100 Main Street but there's both a valid north and a South Main Street only the customer can confirm which is correct batch on the other hand includes all data file imports EDI feeds bulk lists of transactions and potentially even data appends since the opportunity to interact with a customer end user has already passed the most that can be done from a data quality perspective is to clean the data as best as possible prior to importing the new information into your master data and flag bad or incomplete data thereafter so with that brief introduction I'll turn it over to Ben to walk you through the agenda and proceed if you'd like to interact with help it or one another during the webinar you can join the conversation by using hashtag dqs intro on twitter or send us a direct message to at help it will also be available to answer any questions throughout the presentation in the webinar chat feature okay thanks for the introduction Josh say today we're going to provide an overview of the DQ s product which is new in sequel Server 2012 we're going to use a few examples to illustrate how you might go about creating a simple knowledge base they might use that to clean your data and then when we covered that we'll have a look at creating a matching policy and how you can use that to identify duplicates within your database so we're going to start out with a few slides but then we'll quickly be shifting to demonstrations of dqs a fairly regular intervals and hopefully that way you can get a flavor for the products how it works and how you might be able to use it as we're going on to try and highlight any sort of obvious limitations that we've uncovered but just bear in mind that the products pretty new and we're learning as we go along but we're trying answer any questions as best we can we're aiming to run through the webinar at about 30 minutes and hopefully it's going to be beneficial but with that we better get moving okay so the first thing we're going to be looking at on knowledge bases and domains and how to create them in dqs the knowledge base is really the heart of the product and it's the most powerful feature knowledge bases are collections of knowledge about related data domains that allow dqs to make intelligent decisions about the data that's being processed so for example you can use it to change things like st to street or bob to robert or you can use it to correct common misspellings for example you might create a knowledge base for names and addresses or for product models and descriptions and the domains you might create include individual elements such as first name last name state zip city that sort of thing the domains within the knowledge base represent key field in your data and that basically allows dqs to get an understanding of the structure of your data just note though that dqs does expect you to create the knowledge base and you need to populate it as well with appropriate values before you can start processing domains can be used to define data types and validation rules as the main is passed there's a number of tricks that dqs can perform to improve the quality of the individual records domains allow incoming data to be normalized to improve matching for the data to be properly cased spell-checked common errors can be corrected and another neat trick is to apply synonyms so for example the names Bob Bobby Rob or Robert Robbie could all be mapped to Robert and that's quite important when it comes to matching later on so the Bob and Rob for example can be matched as their as a as a first name okay turn based relations allow a simple Find and Replace to occur as domain data is processed say for example you could replace bull with Boulevard domains are single entities but can be combined to create composite two mates and this is where it gets a bit more interesting so for example you could create a composite domain for name consisting on first name and last name domains composite domains and hours to create cross domain rules so if you had a composite domain full name comprised of a first name and last name then you could set up a rule that says the base first name and last name must exist for the record to be valid when defining rules you've got some flexibility there too because you can use regular expression matching and I've got an example later on where I'll show you how you can use a name quality composite domain to output a quality score based on the contents of your title first name and last name data this is pretty cool but having done that you couldn't then create an addressee composite domain for combining your name data into one field since the domain such as first name is only allowed to be used in a single composite domain having said that though you could create a second composite domain rule to do this dqs can pass the composite domain data using reference data where it can pass it in order or using defined delimiters so if you build up enough reference data then the parsing is going to be improved but obviously that's going to be a bit time-consuming but it does give you the ability to fine-tune your system as it processes more data okay then let's have a look at DQ s and see how we could go about creating a knowledge base in practice and we'll create a few domains in our knowledge base as well okay so let's open up DQ s and see how you go about creating a simple knowledge base firstly you get to give your knowledge base a name a description you've also got the options created from an existing knowledge base or by importing from a DQ s file if you're transferring a knowledge base that you've created previously maybe you created one on a different server for example in our case we're just going to create a knowledge base from scratch you've got several options domain management and where you can define the data types that make up your knowledge base knowledge discovery which we'll come on to later but it's a tool that allows you to import sample data and matching policy again we'll come to this later but once your knowledge base has been set up you can define matching rules to allow you to identify matching records so we're going to go ahead and choose domain management the domain management screen is as its description says for managing your domains we can create domains in composite domains composite domains are comprised of multiple domains and now allow you to create cross domain rules so we'll start out by creating a few simple domains and then we'll add a couple of composite domains too and that way you can get to see what these do firstly we'll add a domain for premise when you create a domain you get to give it a name in a description and you can also say what kind of data dqs should expect you can also tell dqs whether to make corrections using a set of stored synonyms for example you might want to change Bobby to Robert there's several other options too such as whether is to ignore punctuation whether you want to proper case the data dictionary to use depending on your language and whether to apply spell checking and whether or not to use syntax error checking algorithm by default dqs performs some basic checks on string domains and may attempt to change the day terrific two-term insists tactically it's not correct okay so now I'll add some domains for Street first name last name and then title you so first name and then last name and finally we'll have title okay let's go ahead and create a composite domain now and we give this domain a name of address 1 and we'll make it comprised of the domains premise and Street so by default dqs will use a space when it's passing the address 1 data and then it's going to split that data into the two domains that comprise the composite domain in this case premise and straight I'm also going add a domain called named quality and a composite domain that uses it and we'll use this later on to demonstrate how we could use a composite domain to implement a very simple named quality algorithm so basically it's going to give you a score depending on how good the name data is in your database you and finally I'm going to create some domains now for zip and the other address lines and there might be present in our data so if you can just bear with me when I do that and then we'll move on you okay so I've added fields for zip address to address three in a company for each domain that you create you've also got the option to to use or add reference data using external service provider but of course you have to pay for these example might be using them a service provider maybe to validate address data you can also apply simple and complex domain rules and we'll go into that in a bit more detail later on and you can also manually add domain values and these are values that will be considered as correct and valid but you can also specify Corrections so maybe you've got misspellings you want to handle for example finally turn-based relations that's just a very grand name for Find and Replace and that can be a party or domain data as it's being processed so once it's published the knowledge base isn't available to be used in any dqs projects that you may create and we'll give you an example of how to do that to a bit later on one brilliant feature in the u.s. is the knowledge-based discovery functionality and this allows you to preload your knowledge base with values so that could be a list of names or states or some other list of common values for a specific data field microsoft also released a default knowledge base and that's got some common data in it as well such as country names and abbreviations US states etc there will also be more third-party and shared data sources which standard industry based information as time goes on as dqs processes your data you'll have the option to manually review the processing results and to add new values into the knowledge base and tune the processing for future runs it's also very much product that will gradually improve the more data it processes but it would definitely need appropriate resource to manage and implement this ok so I think the best thing to do is have a look at it so I'm going to give a quick demo now of how you might use a knowledge-based discovery and then we'll move on and look at how it actually go out using your knowledge base with a real dqs cleansing project okay so let's have a look at the knowledge discovery tool in dqs it's pretty cool and it allows you to help build your knowledge bases so we'll go ahead and right click on our knowledge base and just choose knowledge discovery this time there's a variety of different data source input formats that you can use things like Excel but for example I'm going to run here we're just going to use a simple database table it's what I prepared earlier so yeah in this case it contains just a set of sample first names so as we've seen before we'll choose the table and map the fields okay so we'll click Next and kickoff processing it should run through pretty quickly looks like it has it's found 22 unique names and we can have a look at those names you can see the several of the names were found more than once you've got options there to specify Corrections as well so there you go we've loaded in our first knowledge base okay so if we go and publish the updates that we made to our knowledge base then basically any new projects that we create that use this domain will now have access to this data so we're means and reality is if that you map a field to the first name domain then the data in that field will get to validate it against our first name knowledge base so only names that are in our knowledge base will be marked as valid so you can then use the results of that and do any processing that you want to do okay then so to summarize knowledge bases are really good at making minor Corrections normalizing names put up names that sort of thing the knowledge discovery feature is a great way of putting up your initial knowledge base and then when processing datasets going forward you can manually review the results and tune your knowledge base and in that way it become more and more effective but that process can't be automated so you are going to need someone to do that manual review and that could be quite an overhead in someone's time but having said that it's pretty easy to use and you don't need to be a expert on sequel server or DBA so you could get the right member of staff who knows most about the products involved in creating that database rather than leaving it to the DBA who obviously may come with a more technical rather than the business focused mindset okay so once you've created your knowledge base and set up your domains you'll want to start using it to do some actual cleansing and the easiest way to demonstrate that is to actually go back into dqs and have a look at how you do that okay let's go back to our domain again and look at setting up a few simple domain rules so again we right click on the domain and this time choose to main management and again we can go back now and created previously okay so the first rule I'm going to add is to our premise domain and this rule is going to say that any valid premise must consist of only numeric data the way we can achieve this is to say that data and a premise domain must match a regular expression and then we'll add a simple expression to indicate the field should contain numbers okay let's have a look at how we can use a composite domain to implement some basic named scoring so that's scoring based on the quality of the data in our name fields so this time we're going to open up the NQ score composite domain we created earlier and this domain was comprised of title first name and last name data this time though we create a composite domain rule which states that if our title first name and last name fields are all populated then put the numerical value ten into our name quality domain you can also create multiple roles here so you could setup different rules that say that for example if all of the name fields are empty then put the score zero into your name quality domain each rule that you create you have to give it a name the description is optional but I'd recommend putting something meaningful in there just from a point of view of maintenance if you come back to modify the domains you know several months down the line it'll be a helpful hint as to to what's going on because some of the rules can get quite complicated I'm actually going to be a bit naughty and ignore my own advice this is just a demo and it's a very simple rule but I would definitely recommend that you normally put in a description okay now let's extend our first name domain and we can add a couple of synonyms here for Robert so I'm adding synonyms for Bob and Bobby and that basically means that those two names will be normalized now to Robert in the output from our processing you finally let's change one of our domains and this time we can add some term based relations for Mississippi and Missouri to correct some misspellings that might occur another way you could do this might be to use a synonym instead but you need a dedicated state input field to do that you okay having done all that we'll finish up and publish the changes to our knowledge base and then we'll have a go at using them okay so let's go ahead and create a new project and will tell dqs that we wanted to use our demo knowledge base that's the one we've just created and will also tell us that we want to edit the cleansing attributes of the project okay so now we can go ahead and choose our database and select the table I've got a table called example data and that's got some names and addresses in it I'm now going to go ahead and just map all these fields so just bear with me for a minute while I do that and in a second I'm going to map the address one fill to the composite domain for address 1 and I'm also going to map the Bank quality school field to our name quality school domain and this is where the output from the NQ score composite domain is going to get written we've done all that we can move ahead and run our processing so I'll click on the start button and off we go it should run through relatively quickly okay so we'll go ahead and click Next now it's finished and we have a look at our results firstly in our address 1 results we can see that where DQ s is not recognized the new numeric value in the first part of our dress data it's mark the records as invalid and that's based on the rule we put for our premise domain we can also see records it's marked as correct these are showing up as new records and they've actually been added to the knowledge base in the background and in this case the numeric data is present in the first part of the address field my example is fairly crude but I'm sure you can implement a lot more intelligent rules to try and extract premise data but it will be a bit tricky as address data is often not well structured depending on the source of the data if your data is already well structured then the DQ s parsing and correcting will work pretty well but if your data is not well structured then it may be a bit harder ok so if we look at our address 3 column we can see there are a couple of Corrections and have been made to misspellings of Mississippi and Missouri and that's based off the term based relation rules that we added in our knowledge base and then with our name quality domain we can see that Bob's been changed to Robert due to our first name synonym that we set up and we've also generated name quality data schools for 34 bar Eckerd's these are the records that had titles first names and last names and they've all been given a score of 10 according to our rule that we'd set up ok next we can actually review the data that will be output and we also get the option to choose a table and database to tell dqs where to output the data each domain has got a domain source and a domain output value in a domain reason that you can use the query whether data is valid or not so using this information you can pull out your invalid data for manual or automatic correction so in a second I will open up a table in sequel management studio and you can have a look at the results that have been output from dqs okay so we have a look at the results that should be an output by DQ s we can see that Bob's being corrected to Robert cooling to our first name rule we can also see some of our dress one output and where it's failed and passed the composite domain rule in fact we've got a rule on premise and that's the reason that things have been passed or failed in this case we can also hopefully see a couple of Corrections for States for Missouri and there we go and also for Mississippi I think yeah and you can see the reason term based relation and we can also see the name quality score being output of 10 for various records where the name data and title was all populated and if we scroll across we'll see the premise where that's being passed and you can see where it's failed because the first word was post and it wasn't numeric and you can see where it succeeded as well so hopefully that's given you an idea of the kind of rules you can set up within DQ s and how you can use them it's a pretty cool tool but it really is a starting point for getting your data clean okay then to summarize having created your DQ s knowledge base this then becomes available to be used by DQ s data cleansing projects multiple projects could be created and they could all use the same single knowledge base if you wanted or you could have multiple knowledge bases if you subsequently go back and amend your knowledge base and then republish it existing cleansing projects don't seem to be able to pick up on the changes it is actually a little bit annoying especially in the early stages when you're developing your knowledge base and you may be making frequent changes DQ s provides a very nice easy to use interface to create this projects so you don't to be a sequel server expert to get started processing results can be reviewed within DQ s itself in the in the user interface and then written to your database and you can then write your own routines if you like to use those results and clean up your original source data tables so when DQ s provides the ability to create a matching policy within your knowledge base and this is a set of rules that can be applied to identify potential duplicate records within a data set matching can either be fuzzy or exact just keep in mind that the fuzzy algorithm being used is not phonetic so names that sound the same but a spell very differently I'm not going to match the grading of matches is something that will take very matter experimentation but the aim would be to try and establish clear rules for automation so that you're not accepting false matches or missing too many true matches or having to review too many matches manually either how successful you're going to be is going to be constrained by the nature of your data as well as the time and effort that you've got available to put in so the best way to have a look at this matching is probably to go back into DQ s and create a matching policy okay so let's have a look at how you can implement a basic matching policy with a knowledge base there the one we created earlier again right-click on the domain but this time choose matching policy okay so now we're going to choose a database and the table and then select the fields that we want on the mappings against our domains and these are going to be the fields that we're going to use in our matching rules so I'm selecting first-name lastname just one I choose zip and we'll map the domains now first-name lastname you dress one composite domain and my zip domain and then we'll click Next okay so we can go ahead and add our matching rule basically it's a combination of scores for various domains in your data so I'm adding scores for first-name and lastname I'm going to add a score promising Family Mart premise as a prerequisite which means it must match so it's got a meaning satin match for this rule to apply and also a score for zip and straight as well the score percentage total has to add up to 100% and you have a minimum matching score percentage of 80% though that can be customized there's an administrative tool and with dqs if you've got the the right user access level which allows you to change that so we kick that off and we'll get some results and have a look at those okay so we've actually only found two matches which is not very good because I know for a fact that there's more matches in that in this data set so we're gonna have to go and review our matching rule we can see from these two records though that some of our first-name data is in fact just initials and not complete names so we may be able to get some more matches by using a rule that schools matches based on the first character of the name obviously it's a bit contrived just because the first character of a name matches doesn't necessarily mean that the complete name is a match but it should illustrate the kind of match matching process that you need to follow in DQ s and it also shows that your data really needs to be quite granular if you're going to get the best out of the matching process so within my table luckily I've actually already got an initial field so again we're going to go and use that and we'll map that into our first name domain and we'll try the matching and we should get quite a few more matches this time okay so it shouldn't take too long to run and we go we've got quite a few more matches this time you can see them there and they're all given various different schools depending on how good DQ s thinks the matches are and those scores are obviously made up from the domains in your matching rule this is quite a useful little tool actually because it lets you preview your matches tweak your matching rule on a subset of your data so I've actually created a project already where I've used this rule and I've out put some data into a sequel server database there put the matching results I should say and we can have a look at those and compare those with a source data and see how good those matches are and see what matches we've got on what matches we've actually missed teeth of the product okay so if we have a look at the table here in sequel management studio these are the results that I've output from DQ s and as you can see they all look pretty good results just highlighting a few there for you you scroll down you can see the unmatch records at the bottom of the output if we compare that with the actual source data and i've ordered this by last name so you can see quite easily where there should be matches so J as Acevedo the address one data there is actually in the address two field in the second record and we'll see that because of that DQ s couldn't pick that up as a match which kind of really highlights the point that your data needs to be really well structured and granular to get the best out of the matching process the same thing here with these de Coster records again the address data is in just one in one record and address two in the other and again DQ s won't pick that up if we look down here we've got two dates and records da YT o n you can see the address data is the same and we're going to look at the source data we can actually see we've got a Dayton spelled differently there but it sounds the same and the address is all the same so it should really be a match but DQ s can't pick that up presumably its algorithm is not phonetic and there's two records here Griffis and Griffis they look like they should be a match but because of that address one data being slightly different I think that's the reason anyway DQ s couldn't pick that up as a match so I mean that gives you an idea of the kind of matching that you can do DQ s is pretty easy to initially configure and you know it's a pretty good tool but it is going to miss certain matches and if your data is not very granular and then the matching is going to struggle a bit the other thing you'll noticed is the DQ s is really focused on finding records within a single data set but not forever ping two different data sets so that is to say to find records that are in one data set that are also in another data set so that's quite important because quite often in a business you'll have an existing universe of data and you may have some sort of update file that's coming in and you want to match that against your existing universe and that's kind of one scenario dqs doesn't cater for kind of out of the box you might be able to work around it but it's obviously going to be a bit of extra work so in summary EQs provides useful matching functionality but it will miss some genuine matches that more sophisticated matching engines could pick up there's no obvious out-of-the-box way to identify duplicates between two different sets of data but you might be able to work around this by loading both sets into a master table and then using some kind of source ID field you can't use dqs to identify if a single record exists within your database say for example sometimes people might want to use a match matching as a duplicate prevention tool to check whether a single record or it exists in database but dqs is very much a batch cleaning product and so it doesn't really lend itself to that kind of thing data really needs to be granular and well-structured as well form as a matching to be effective so this is really a key point at least for contact data you're simply not going to get decent matches without having split your data into its component parts so what are the performance and hardware requirements for running dqs Microsoft provided a couple of tables in their documentation that gives an illustration of what you can expect to spare in mind that if you're running the product on a remote database then the network traffic may be quite high so bandwidth can be quite important we can see from the table that you could expect a process maybe a million records in about an hour from a matching point of view but it'll kind of vary a little bit depending on how many matches are in the database in how many rules you're running terms of hardware requirements dqs is pretty memory intensive and if you're running less memory than recommended then processing speed will reduce quite noticeably in fact I've actually been able to process five million records on a machine that's got less than 16 gigabytes so the requirements are really pretty rigid so to summarize a variety of common data sets are gradually coming on stream and can be loaded through the knowledge discovery feature knowledge bases can be tuned over time but there will always be somewhat of a need to create an update knowledge bases around your specific business needs and you need to be prepared to put that time and effort in the more depth for your knowledge bases contain the strong your data quality processes will be similarly creating and testing domain rules and composite domain rules can be complex and rather time-consuming to DQ s is really better suited to more structured data such as typical product data than it is to typical contact data for contact data DQ s could be used for applying Corrections and standardizations that are specific to the database being processed the data can then be passed to an application that focuses on challenges more typically found with contact data we've also found that processing large volumes of data using DQ s is quite resource intensive and there are faster products out there perhaps one of the biggest limitations of DQ s is that it cannot completely address the challenge of acting as a data quality firewall from the transactional perspective DQ s does not yet have a real-time integration feature so the power of the standardization and cleansing techniques that Ben is demonstrated today must take place after information has already been added to sequel server on the batch side incoming data and staged information can be reviewed against the knowledgebase for Standardization and consistency but DQ s cannot match the incoming batch data to your master customer file therefore in order to link new information you would need to reprocess the entirety of the master data with the new batch table in a single process each and every time so while DQ s has a lot of potential and with iterative learning can do a great job of cleansing a single source of information like your customer table it does fall a bit short in providing you a methodology to keep the data clean and available to the business in a real-time and accurate fashion the best way to incorporate DQ s into your data quality firewall strategy is instead to view the solution as a means to support the maintenance and cleanliness of your data by policing the data rules that are already enforced by other technologies like SSIS and third-party applications if anyone has any further questions they can always email them to Ben or me at the above location help it does offer a variety of data cleansing tools that pick up or dqs leaves off if you are not sure how to get your project started or need additional consultation to determine how to best take advantage of dqs you can reach out to me and we can provide a complimentary consult and data evaluation that some of our clients have found very helpful in determining how to move forward thank you so much for your time today
Info
Channel: helpIT systems
Views: 28,496
Rating: 4.8285713 out of 5
Keywords:
Id: KctO28lHaQo
Channel Id: undefined
Length: 36min 19sec (2179 seconds)
Published: Fri Jun 21 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.