SAS Data Preparation 2.1 is now available and it includes the ability to perform data quality transformations on your data using the definitions from the SAS Quality Knowledge Base (QKB).
The SAS Quality Knowledge Base is a collection of files which store data and logic that define data quality operations such as parsing, standardization, and generating match codes to facilitate fuzzy matching based on geographic locales. SAS software products reference the QKB when performing data quality transformations on your data. These products include: SAS Data Integration Studio, SAS DataFlux Data Management Studio/Server, SAS code via dqprocs, SAS MDM, SAS Data Loader for Hadoop, SAS Event Stream Processing, and now SAS Data Preparation which is powered by SAS Viya.
Out-of-the-box QKB definitions include the ability to perform data quality operations on items such as Name, Address, Phone, and Email.
The following are the data quality transformations available in SAS Data Preparation:
- Casing – case a text string in upper, lower, or proper case. Example using the Proper (Organization) case definition – input: sas institute output: SAS Institute.
- Parsing – break up a text string into its tokens. Example using the Name parse definition – input: James Michael Smith output: James (Given Name token), Michael (Middle Name token), and Smith (Family Name token).
- Field extraction – extract relevant tokens from a text string. Example using a custom created extraction definition for Clothing information – input: The items purchased were a small red dress and a blue shirt, large output: dress; shirt (Item token), red; blue (Color token), and small; large (Size token).
- Gender analysis – guess the gender of a text string. Example using the Name gender analysis definition – input: James Michael Smith output: M (abbreviation for Male).
- Identification analysis – guess the type of data for a text string. Example using the Contact Info identification analysis definition:
- Match codes – generate a code to fuzzy match text strings. Example using the Name match definition at a sensitivity of 85:
For more information on match codes, view this YouTube video on The Power of the SAS® Matchcode. - Standardize – put a text string into a common format. Example using the Phone standardization definition – input: 9196778000 output: (919) 677 8000.
Note: While all the examples above are using definitions from the English (United States) locale in the SAS Quality Knowledge Base for Contact Information, definitions are available for dozens of locales.
You can also customize the definitions in the QKB using SAS DataFlux Data Management Studio. This allows you to update the out-of-the-box QKB definitions or create your own data types and definitions to suit your project needs. For example, you may need to create a definition to extract the clothing information from a free-form text field as shown the Field extraction example. These customized definitions can then be used in SAS Data Preparation as part of your data quality transformations. For more information on Customizing the QKB, you can view this YouTube video.
For more information on the SAS Quality Knowledge Base (QKB), you can view its documentation.