Balancing privacy concerns for analytics design

I've been working on a pilot project recently with a client to test out some new NoSQL database frameworks (graph databases in particular). Our goal is to see how a different storage model, representation and presentation can enhance the usability and ease of integration for master data indexes and entity data repositories.

It's relatively easy to install an evaluation version of a data management software package and tinker with its bells and whistles. But when you're evaluating how a product or tool will fit into an existing operational environment, you may need to assess how the product works using the same data that would be used in a production environment.

man considers how to balance privacy concerns with analytics design The data at this client's organization contains a significant amount of personally identifiable information (PII). Their environment is closed – only people with the appropriate access rights are allowed to see the data. There are other complications as well. In this situation, for example, the client does not just want to see how a particular type of data management product works – they want to see how the product works with their data in their environment.

Several potential conflicts emerge. The first is obvious: You can’t test how a product will work within a closed environment if you can't install, play with and test that product as it will be used in that environment. You can configure a test environment whose characteristics mimic the production system’s – and that might provide a reasonable platform within which the new tool can be tested. But it does not address the data protection issue.

Testing the product inevitably means testing the product with real data. But protection directives often prevent the developers from accessing the real data as part of the design-development-test cycle.

Overcoming the challenges of working with protected data

This challenge is hard to surmount. Not only that, it represents a more common development pattern: the need for access to protected data as a prelude to system design and implementation. Aside from our scenario, in which we want to test the use of private master data using a new database environment, here are some additional examples:

Developing data transformations for integration. Given a data set with protected data, how can you design, develop and test data transformations as part of a data integration application without getting access to the protected data?
Developing data visualization applications. How can you configure data visualizations for end-user computing without being able to see and use the protected data?
Developing analytics (probably the most complex). How can you do undirected analytics and machine learning, then apply analyses to find interesting patterns in protected data?

We had two ideas, although neither is ideal. The first was to use automatically generated test data. In this scenario, we would use a test data generator configured to create records that reflect similar characteristics to the protected data set. That would involve introducing selected, randomly generated errors into the data set.

The second idea was to selectively provide data samples under a data use agreement. Because these samples contain real data, they would be representative of the entire data set for design, development and testing purposes. At the same time, the data use agreement specifies conditions that limit the ways the data set can be used. While this doesn't guarantee that a rogue developer won’t misuse the data set, it does introduce some expectation of compliance with an agreed-to policy regarding acceptable use.

As I suggested, neither of these approaches completely addresses the issue of data protection during the product evaluation process. But they do establish a governance framework that can be engaged to ensure that proper measures are taken to protect personally identifiable information.

Read how SAS can help you identify, govern and protect personal data