GDPR: The complexity of identifying sensitive personal data

Trying to identify sensitive personal data — Download a paper about how SAS Data Management can help you address GDPR

With the specter of the European General Data Protection Regulation (GDPR) heavily breathing down companies’ necks, there's growing anxiety around corporate capabilities for complying with the privacy and management directives involved in collecting, storing, using, sharing, disposing of and producing personal data. The regulation provides a broad description of different facets of personal data. These range from identifying data (such as names or national identifiers) to identifying demographic (and in some cases psychographic) information deemed to be “sensitive.” That includes characteristics like race, ethnic origin, union membership, political opinions, health information and other aspects of an individual’s personal life.

The GDPR imposes constraints on a company’s ability to process personal data, empowering individuals with much greater control over how their personal data may – or, more importantly – may not be used. Compliance, though, is largely a process of interpreting the intent of the regulation in terms of:

Identifying data assets that contain personal data.
Implementing a broad set of controls to determine whether any particular processing application falls within the bounds of lawfulness (such as the processing that's necessary in the performance of a contract with the data subject).

That being said, the challenging part of compliance is not the mechanical aspects of developing code and implementing controls. Rather, it lies in the determination that a data asset contains personally identifiable information (PII), or sensitive personal data – and ensuring that adequate measures can be taken to protect against unauthorized processing or exchange.

Structured sensitive data

Conceptually, it seems that analyzing a structured data asset for personal data is not that difficult. We can enumerate the different data attributes that correspond to identifying information and use that as a starting point: name, address, social security number, etc. In fact, the US HIPAA law specifies 18 data attributes that become personal health information (PHI) when used in conjunction with an individual’s health condition. In turn, scanning a data set’s model structure (or more simply, the names of the columns) can yield a lot of data elements that are likely to represent personal information, particularly when there's a column whose name matches one of the enumerated data attributes.

The next step goes beyond data attribute name comparisons and starts to examine data asset content to identify sensitive data. One approach is to use a data profiling tool to scan a table’s columns and then compare each column’s contents with the values of tables known to house personal or sensitive data. For example, if a table under investigation has a column whose values match to the set of last names in an already assessed data set, then one might speculate that the column is a last name column and consequently requires protection. This process can accomplish two objectives:

It can iteratively build an inventory of value domains that represent sensitive data.
It can be a means for rapidly identifying sensitive columns on structured assets.

Unstructured sensitive data

Unstructured data assets are more difficult, both in terms of identifying levels of sensitivity and instituting controls. Yet there are approaches that integrate natural language analytics, metadata tagging, and semantic taxonomies to engineer a “knowledge net” that can be used to tag certain meta-tags and business terms as indicative of sensitive data when the terms appear in proximity to embedded entity data. An example might include tweets from an individual’s Twitter account that include hashtags promoting the name of a particular political candidate (“political opinions”). Text analytics can extract the individual’s identifying information and the information associated with the referenced candidate. The knowledge net can be consulted to determine that the referenced individual is indeed a political candidate, then to infer the political opinion, and then (based on the directives of GDPR) mark that piece of information as sensitive personal data.

Of course, this is just a conceptual approach – implementing it is much more complex. All these examples indicate that as more data protection laws are passed, the complexity of identifying and managing sensitive data will continue to raise new challenges that will have to be addressed.

Want a deeper dive? Download I Spy PII: How to Use SAS Data Management for Personal Data Protection