The previous post in this series described the first step of the data monitoring process — defining. Therefore, once we have defined the objective of our initiative and, consequently, also data sources and specific requirements towards them, we can move on to studying the data.
Step 2. Profiling
Gaining better understanding of data and discovering potential irregularities requires analysis. This process is called profiling. Profiling will allow us, among others, to verify whether table descriptions match the physical contents and whether relationships between the tables are in line with those generally accepted in the organisation. Profiling can be split into 4 basic types of analyses:
- Column profiling — this type of analysis gives us access to different types of statistical measures, such as: minimum, maximum, mean, median, mode or standard deviation as well as insight into distributions of value frequency and data patterns. There are two types of patterns — word and character patterns. Word patterns involve substitution of words with specific symbols (e.g. John Smith aged 59 will be presented as the following sequence: A A A 9), while character patterns involve substitution of different individual characters with a specific symbol (e.g. John Smith aged 59 will be presented as the following sequence: Aaaa Aaaaa aaaa 99).
- Analysis between columns — this type of analysis involves verification of consistency of several columns within a single table. We verify whether values of a given attribute (e.g. gender) match the name field or whether proportions of values from three different columns in relation to the next one are kept. During verification of multiple columns in a single table, we will also verify natural keys.
- Analysis between tables — this group of analyses involves verification not only of consistency between different tables but also of whether the data contain redundancy or not. Key analyses include e.g. analysis of foreign keys, orphaned record search or data difference and similarity search. Sometimes, columns with seemingly different names can contain the same information, while columns with identical names can contain data from completely different areas.
- Validation using business rules — this group of analyses is the most proactive one and involves detection of predetermined elements, such as non-compliance with organisation standards. For instance, we may require that the telephone number field in the customer file not be empty or that the email address be entered in a specific format. Rules can be defined at the level of an individual column, set of columns or at the level of multiple columns in multiple tables.
All these analyses will not only help us become better acquainted with the data or create data clearing processes, but first and foremost, help us verify the usefulness of our data quality indicators and will affect their final shape.
Tools which may be useful in data profiling include SAS Data Management Studio with profiling tasks and SAS Enterprise Guide. While both tools will allow for data analysis, they are intended for different types of users. The former application is intended for business users since it contains a ready-made set of analyses dedicated precisely to profiling tasks.
Figure 1. Report on a data profiling task in SAS Data Management Studio
These analyses include, among others, analysis of notation patterns, relationships, outliers or descriptive statistics. Additionally, it is also possible to define one’s own measures or business rules to verify data. Everything is available at the level of creators adapted to the business user’s requirements. Users with a more technical profile may be interested in more flexible tools. Such tools require more workload but also offer more freedom. These users will find SAS Enterprise Guide useful, allowing them to independently define data queries and reports, which they will be table to subsequently group into one or several data flows.
Figure 2. Sample data analysis process in SAS Enterprise Guide
Once we already know our data and what they really contain and once we have defined descriptions of business rules, we can start creating technical data monitoring rules.