SAS Data Preparation 2.1: Data quality transformations


SAS Data Preparation 2.1 is now available and it includes the ability to perform data quality transformations on your data using the definitions from the SAS Quality Knowledge Base (QKB).

The SAS Quality Knowledge Base is a collection of files which store data and logic that define data quality operations such as parsing, standardization, and generating match codes to facilitate fuzzy matching based on geographic locales. SAS software products reference the QKB when performing data quality transformations on your data.  These products include: SAS Data Integration Studio, SAS DataFlux Data Management Studio/Server, SAS code via dqprocs, SAS MDM, SAS Data Loader for Hadoop, SAS Event Stream Processing, and now SAS Data Preparation which is powered by SAS Viya.

Out-of-the-box QKB definitions include the ability to perform data quality operations on items such as Name, Address, Phone, and Email.

SAS Data Preparation 2.1
SAS Data Preparation – Data Quality Transformations

The following are the data quality transformations available in SAS Data Preparation:

  • Casing – case a text string in upper, lower, or proper case. Example using the Proper (Organization) case definition – input: sas institute   output: SAS Institute.
  • Parsing – break up a text string into its tokens. Example using the Name parse definition – input: James Michael Smith   output: James (Given Name token), Michael (Middle Name token), and Smith (Family Name token).
  • Field extraction – extract relevant tokens from a text string. Example using a custom created extraction definition for Clothing information – input: The items purchased were a small red dress and a blue shirt, large   output: dress; shirt (Item token), red; blue (Color token), and small; large (Size token).
  • Gender analysis – guess the gender of a text string. Example using the Name gender analysis definition – input: James Michael Smith   output: M (abbreviation for Male).
  • Identification analysis – guess the type of data for a text string. Example using the Contact Info identification analysis definition:
  • Match codes – generate a code to fuzzy match text strings. Example using the Name match definition at a sensitivity of 85:
    For more information on match codes, view this YouTube video on The Power of the SAS® Matchcode.
  • Standardize – put a text string into a common format. Example using the Phone standardization definition – input: 9196778000   output: (919) 677 8000.

Note:  While all the examples above are using definitions from the English (United States) locale in the SAS Quality Knowledge Base for Contact Information, definitions are available for dozens of locales.

You can also customize the definitions in the QKB using SAS DataFlux Data Management Studio.  This allows you to update the out-of-the-box QKB definitions or create your own data types and definitions to suit your project needs.  For example, you may need to create a definition to extract the clothing information from a free-form text field as shown the Field extraction example.  These customized definitions can then be used in SAS Data Preparation as part of your data quality transformations.  For more information on Customizing the QKB, you can view this YouTube video.

For more information on the SAS Quality Knowledge Base (QKB), you can view its documentation.


About Author

Mary Kathryn Queen

Principal Technical Training Consultant

Mary Kathryn Queen is a Principal Technical Training Consultant in the Global Enablement and Learning (GEL) Team within SAS R&D's Global Technical Enablement Division. Her primary focus is on SAS Data Management technologies, particularly data quality, data preparation, and data governance.

Leave A Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to Top