The ancients’ practice of publicizing set-in-stone personal records would run anathema to modern data privacy laws.

These days, in lieu of using contemporary personally identifiable records, I anonymized a 4,000-year-old tax record from ancient Babylon to describe three principles for effective data anonymization at scale:

  • Embracing rare attributes: values and preserves unique data points for insights.
  • Combining statistics and machine learning: enhances accuracy and effectiveness of data analysis and anonymization.
  • Stewarding anonymization fidelity: ensures the quality and integrity of anonymized data, preventing re-identification.

Data openness, efficiency and privacy protection

Modern governments make tradeoffs between data openness, efficiency and privacy protections. In carving a path for data governance, state-mandated privacy protections, like the EU General Data Protection Regulation (GDPR), China's Personal Information Protection Law (PIPL) and the California Consumer Privacy Act (CCPA), introduce considerations that impact the balance between data privacy and open access.

Personally identifiable data excepted from the US Freedom of Information Act (FOIA) should not be less meaningful because it contains names, addresses or rare disease indications. What if we could honor the FOIA spirit of openness while protecting privacy? And what if non-coders could use cloud computation resources to perform the job in the agencies where they work?

Figure 1. Transliteration of a tax record from the Old Babylonian Empire (OECT 15, 134), credit to UPenn Corpus for the transliteration (Creative Commons Attribution Share-Alike license 3.0). Sufficiently complete records in “name” and “location” provided encoding attributes for “area classification,” “tenant land,” “gardener,” “ox driver” and “ownership lineage.” Three continuous tax attributes were converted to consistent units based on Wikipedia’s Ancient Mesopotamian units of measurement. Rare attributes of “ox driver” and “gardener” are highlighted. Online materials were accessed on March 14, 2024.

Through the anonymization process of the aforementioned 4,000-year-old Babylonian tax data set (Figure 1), this is how government agencies can open anonymized citizen-level data and do it well using those three principles. Let’s talk about it in detail.

1. Embracing rare attributes

In modern statistical departments, aggregated summaries are commonly used to present data to the public. However, while these summaries are efficient for broader analysis, they often wash out rare attributes at lower unit levels. For example, in our Babylonian tax data set (Figure 1), aggregation information loss would include rare occupations of ox driver and gardener or unary disaggregated data points, like the only “sole proprietorship in town” (the sole case of Šamaš-ilum, highlighted by the green box in Figure 1).

Figure 2. Mosaic plot of non-missing data items showing the relationship between area classification (field, tower, town and subsistence land) versus tenant classification (cultivated, sole proprietor or tenant). One combination (sole proprietor from town) has only one record, and convention would not allow it to be presented as an aggregation because of possible individual re-identification.

Aggregations limit data richness and stifle the representation of diverse elements. In contrast, data synthesis offers an account of the rare attributes (e.g., “ox driver”) while disclaiming individual data provenance.

2. Combining statistics and machine learning

Even without being a skilled coder or an ancient Near East document interpreter, I could count on the multimodal analytics capabilities in SAS® Viya® to iteratively prep, explore, model and make decisions on data synthesis. At the center of my workflow was a synthetic data generator (SDG) – for which there is a no-code step for Viya 4 installations of SAS Studio Flow. The basis of this step is a Correlation-Preserving Conditional Tabular Generative Adversarial Network (CPCTGAN) that works well with continuous, ordinal and nominal data. Other machine learning-based algorithms, such as the Synthetic Minority Oversampling Technique (SMOTE), are available in SAS.

Figure 3. SAS Studio Flow shows how I explored, prepared, modeled and decided on the data synthesis. While the synthetic data generator (SDG) is at the center of the process, statistical processes (a negative binomial generalized linear model and a propensity score matching regression) were also necessary to hone and validate the synthetic data.

The SDG Step produced results that nicely preserved weak correlations of the continuous tax elements (Figures 4a and b). Close inspection of the data showed negative numbers appear in several continuous measures, especially “Iku,” where the original data set showed a deep left skew, bounded by its mode of 0 (Figure 4c).

Figure 4. Raw results of the SDG machine learning model. While continuous correlations were preserved, with change due to noise and relationships mediated by conditional categorical covariates, some individual samples were implausible, showing negative tax values.

Enter statistical modeling that created highly plausible Iku results! Using the code-writing assistant, SAS Task (also available in SAS Studio in Viya 4), I created a statistical model to score and overwrite the SDG-generated Iku value. Statistics + Machine Learning = Success!

3. Stewarding anonymization fidelity

To ensure quality, the data synthesis team could create additional strata to offset sampling bias and take steps to test re-identification potential through managed data or subsets of only the most plausible synthetic records. To accomplish this, I took the surfeit of SDG-generated data (hundreds of thousands of samples can easily be generated). Then, I short-listed the most plausible through Propensity Score Matching. The output data set well reflected the actual samples on all axes of diversity (Figure 5).

Figure 5. Cumulative Density Function matching synthetic (fake) samples to accurate records. The matched observations align on multivariate-metric propensity scores between the fake and real samples, whereas differences exist among the common support region (region) and all observations.

The result of back-end fidelity checks, or anonymized data stewardship, is that the synthetic data maintains important attributes of real data that prevent data re-identification. Such fine-tuning prevents data fuzziness, which is currently a problem for neighborhood-level data offered by the US Census Bureau.

A historical perspective

Hamurabi posted his code publicly, a social innovation venerated across social studies curricula. Ancient Fertile Crescent mathematical innovations seem less known, such as when Sumerians began representing “whos” by “whats,” row by column, into a table. These easy-tally tablets named people, their occupations and aspects of each citizen’s relationship to the state. The Babylonians, and eventually the written world, inherited the Sumerian innovation of tabular data.

In hindsight, we might celebrate these societies’ computational efficiency and openness to data access while taking their complete disregard for individual privacy. The multimodal analytics system of SAS Viya gave me, a low-skill coder unfamiliar with Babylonian tax records, a way to generate plausible anonymized records in six hours (including plenty of doubling back, learning as I went and going the extra mile for quality). The result was pretty good: a synthetic tabular data set that maximized data openness, efficiency and privacy protection (Figure 6).

Figure 6. The final anonymized records are paired with similar records from the original data set. When filtered to include only fake data, this version represents all scenarios and axes of diversity within the original records that also protect individual privacy.

Imagine the possibility of toppling every unnecessary agency firewall so that citizens, including citizen data scientists, could connect the dots for themselves. The technology is ready. Are we?

Elevate quality and productivity with a powerful, trusted synthetic data generation experience

Share

About Author

John Gottula

Principal Advisor for AI and Biostatistics

Dr. John Gottula is a farmer, data scientist and professor whose technical interests include biostatistics, composite AI and synthetic data. Dr. Gottula cultivates strategic public and private sector partnerships focusing on food, health and the environment, while leading analytics and digital change management projects as an Agile Scrum Product Owner. Additionally, Dr. Gottula stars in the SAS Users Mixed Models YouTube video, leads a network advance agile analytics for agriculture (#AgileAg), and supports analytics-for-research at NC A&T State University, the United States’ oldest and largest Historically Black College and University (HBCU) Land Grant. When not working, John enjoys forestry, gardening and being a girldad.

Leave A Reply

Back to Top