Balancing privacy concerns for analytics design


I've been working on a pilot project recently with a client to test out some new NoSQL database frameworks (graph databases in particular). Our goal is to see how a different storage model, representation and presentation can enhance the usability and ease of integration for master data indexes and entity data repositories.

It's relatively easy to install an evaluation version of a data management software package and tinker with its bells and whistles. But when you're evaluating how a product or tool will fit into an existing operational environment, you may need to assess how the product works using the same data that would be used in a production environment.

man considers how to balance privacy concerns with analytics designThe data at this client's organization contains a significant amount of personally identifiable information (PII). Their environment is closed – only people with the appropriate access rights are allowed to see the data. There are other complications as well. In this situation, for example, the client does not just want to see how a particular type of data management product works – they want to see how the product works with their data in their environment.

Several potential conflicts emerge. The first is obvious: You can’t test how a product will work within a closed environment if you can't install, play with and test that product as it will be used in that environment. You can configure a test environment whose characteristics mimic the production system’s – and that might provide a reasonable platform within which the new tool can be tested. But it does not address the data protection issue.

Testing the product inevitably means testing the product with real data. But protection directives often prevent the developers from accessing the real data as part of the design-development-test cycle.

Overcoming the challenges of working with protected data

This challenge is hard to surmount. Not only that, it represents a more common development pattern: the need for access to protected data as a prelude to system design and implementation. Aside from our scenario, in which we want to test the use of private master data using a new database environment, here are some additional examples:

  • Developing data transformations for integration. Given a data set with protected data, how can you design, develop and test data transformations as part of a data integration application without getting access to the protected data?
  • Developing data visualization applications. How can you configure data visualizations for end-user computing without being able to see and use the protected data?
  • Developing analytics (probably the most complex). How can you do undirected analytics and machine learning, then apply analyses to find interesting patterns in protected data?

We had two ideas, although neither is ideal. The first was to use automatically generated test data. In this scenario, we would use a test data generator configured to create records that reflect similar characteristics to the protected data set. That would involve introducing selected, randomly generated errors into the data set.

The second idea was to selectively provide data samples under a data use agreement. Because these samples contain real data, they would be representative of the entire data set for design, development and testing purposes. At the same time, the data use agreement specifies conditions that limit the ways the data set can be used. While this doesn't guarantee that a rogue developer won’t misuse the data set, it does introduce some expectation of compliance with an agreed-to policy regarding acceptable use.

As I suggested, neither of these approaches completely addresses the issue of data protection during the product evaluation process. But they do establish a governance framework that can be engaged to ensure that proper measures are taken to protect personally identifiable information.

Read how SAS can help you identify, govern and protect personal data


About Author

David Loshin

President, Knowledge Integrity, Inc.

David Loshin, president of Knowledge Integrity, Inc., is a recognized thought leader and expert consultant in the areas of data quality, master data management and business intelligence. David is a prolific author regarding data management best practices, via the expert channel at and numerous books, white papers, and web seminars on a variety of data management best practices. His book, Business Intelligence: The Savvy Manager’s Guide (June 2003) has been hailed as a resource allowing readers to “gain an understanding of business intelligence, business management disciplines, data warehousing and how all of the pieces work together.” His book, Master Data Management, has been endorsed by data management industry leaders, and his valuable MDM insights can be reviewed at . David is also the author of The Practitioner’s Guide to Data Quality Improvement. He can be reached at

Related Posts

Leave A Reply

Back to Top