Synthetic data has become a valuable resource in data science and machine learning.
Superior quality, reliable synthetic data facilitates analysis and iteration at scale while mitigating privacy concerns associated with real data and can fill gaps where real data is scarce.
Note, however, that “good” synthetic data is not defined in black-and-white terms. ‘Good” synthetic data is a function of what it is meant to be used for, also called “fit for purpose.” The end purpose of synthetic data drives analytics practitioners to make a tradeoff between similarity and privacy.
Consider an example where the generated dataset tests different scenarios modeled on customer behavior. To ensure your scenarios capture realistic patterns, you wish your synthetic and real data to be as similar as possible.
However, for some datasets in sensitive and regulated industries, this increases risks of re-identifiability and data leakage, allowing a sufficiently motivated actor to take steps to reverse engineer back to identifiable or sensitive records or patterns. You may be required to introduce deliberate dissimilarities into those records through differential privacy before finally settling on a golden middle between maximum similarity and maximum privacy.
Therefore, once you are clear on the downstream use of synthetic data, visual and statistical evaluation metrics help you assess the data by comparing distributions, correlations, and other statistical properties between synthetic and real data.
Hopefully, this checklist can help assess existing synthetic data. The steps below refer to various techniques for ensuring that synthetic data can be used effectively for its intended purpose.
1. Check for bias amplification in synthetic data
Bias introduced by synthetic data can lead to inaccurate and unfair outcomes, especially in machine learning models. Analyze original data for inherent bias and conduct a similar analysis for generated data to ensure that bias is not unwarrantedly amplified in synthetic data.
2. Assess the need for balancing share of data segments
Synthetic data can be used to scale up the volume of observations or selectively augment (boost) specific segments. This can be done to balance out target event rates or underrepresented segments. Some use cases, like fraud modeling, suffer from rare event rates. Others, such as health care studies, may have historically underrepresented segments. Regarding financial products, some applicants may have a “thin file” problem. When required, use synthetic data generation to balance data distribution and address any underrepresented segments or groups.
3. Assess privacy and security of synthetic data
A significant advantage of synthetic data is its ability to preserve privacy. However, as mentioned earlier, we need to make sure synthetic data does not inadvertently reveal sensitive information. Employ techniques such as differential privacy to add noise to data, making it challenging to re-identify individuals. Implement robust security measures to protect synthetic data from unauthorized access. Security risks and weaknesses of synthetic data can be assessed through a vulnerability score, a metric designed to evaluate the security posture of the synthetic data. This step ensures the synthetic data is secure and does not expose sensitive information.
4. Test the usability of synthetic data
Usability is a key factor to consider when assessing synthetic data. Ensure that synthetic data can be easily integrated with real data and used effectively in your projects. Test the data in various scenarios and applications to ensure it meets requirements and can be used seamlessly with other datasets. Remember that your tolerance towards similarity and privacy depends on the final intended use of the synthetic data.
5. Monitor the performance of models trained on synthetic data
Once you have validated and tested the synthetic data, monitoring the performance of models trained on it is essential. Synthetic data can enhance model training by surfacing more robust evidence of relationships, enabling more iterations and modeling approaches, and accommodating more complex, data-hungry algorithms. Before releasing models trained on synthetic data, track key performance metrics and compare them with models trained on real data. Ensure that the synthetic data does not negatively impact the performance of your models and that it provides the desired benefits.
6. Assess similarity, distribution, and correlation
A similarity metric is essential for the assessment of synthetic data. This measure quantifies the similarity between synthetic and real datasets and highlights areas of divergence and discrepancies, which can be obtained by comparing distribution and correlation metrics. Techniques such as hierarchical clustering and correlation-preserving methods can be used to develop and validate the similarity metric. This step ensures that the synthetic data closely resembles real data regarding structure and content.
7. Check sequential quality
Sequential quality refers to the consistency and coherence of data over time or across sequences. Evaluate the sequential quality of synthetic data by comparing it with real data to ensure it maintains the same patterns and trends. This step is important for time-series data and other sequential datasets.
Also, watch out for any adverse effects of sequentially generated data on referential integrity with related datasets. Failure to do so might lead to some impractical observations which may not occur in real life. A simple example would be continued deposits into a transaction table while the balances (on a related account table) remain the same.
By following this checklist, you can ensure that your synthetic data is of high quality, provides the right mix of similarity and privacy preservation, and is “fit for purpose,” i.e., serves its intended purpose effectively.
Synthetic data holds immense potential in data science and machine learning and, with careful assessment, can be an asset for your projects.