How insurance companies can use synthetic data to fight bias

The insurance industry has been scrutinized for years due to “fair bias” practices. Indeed, bad data in business practices and bias are known insurance associates. Unfortunately, the result is marginalized populations.

Some industry experts – including a former insurance commissioner in the US – believe that discrimination will become the largest AI regulation issue. This is because customer data can easily reveal too much adverse data, allowing insurance companies to pick only the most desirable risks.

What is bad data for insurance businesses?

When building models, training data matters – a lot. Consider the example of body mass index (BMI) in life insurance. This example shows how a lack of diverse, representative and high-quality insurance data led to 80 years of an “ideal risk” that the American Medical Association eventually decried as inherently biased.

In this case, BMI data was based on a predominantly white male height and weight data set. Recent research proves that BMI does not account for things like bone density and muscle mass, so it is an inaccurate risk assessment measure for many people.

As the BMI example shows, a lack of data can create availability bias (an overreliance on data that’s easily accessed) – which leads to bad outcomes. And because data is the fuel for artificial intelligence, it follows that feeding bad data into AI systems will lead to poor results.

Synthetic data and AI

With today’s data-hungry AI initiatives, synthetic data fills many gaps – when there’s not enough suitable data, when you need to protect privacy, defend against bias – and more. Learn about different types of synthetic data, how it’s used, techniques for generating it, and how synthetic data generation and AI algorithms relate.

Bias: A 4-letter word

Historically, insurers have used zip codes or territory codes to calculate insurance premiums. But seemingly innocent variables like these can be proxies for sensitive data – such as race, gender or religion. Such variables can, in turn, hide bias.

Consider a Propublica story from 2017 in Chicago. The story focused on disparities in auto insurance premiums where zip codes were used as a primary data point for setting rates. Later research proved that those living in minority zip code areas paid higher premiums – holding constant factors such as age, coverage, gender and loss history.

In the most egregious example, the difference in premium when changing zip code was more than 300% higher in neighborhoods that were more than 50% minority. And it was higher in every one of the 34 companies quoted.

If biases like this are not assessed and mitigated, vulnerable populations will be further marginalized. AI will only exacerbate the inequities.

AI and trustworthiness: Efforts to promote AI literacy, inclusive contribution and demonstratable trustworthiness have risen to the highest levels of government.

Where generative AI comes into play

Most business cases of generative AI (GenAI) feature large language model (LLM) capabilities. But another type of GenAI – synthetic data – is especially useful for addressing data concerns like privacy and fairness. Synthetic data offers modelers the advantage of not relying on data masking to protect sensitive personal data. Consider what these organizations are saying:

Property Casualty 360 cites this statement: “By 2027, as many as 40% of the AI algorithms used by insurers will integrate synthetic data in order to ensure fairness within their processes and comply with regulations” (a prediction by IDC FutureScape).
MAPFRE calls synthetic data a “strategic advantage” for insurance. As they put it: “Synthetic data, being completely dissociated from specific individuals, ensures both respect for privacy and strict regulatory compliance.”

Too good to be true? Not at all.

A real-world example of synthetic data results

In 2022, SAS, in collaboration with Syntho and the Dutch AI Coalition, demonstrated that synthetic data produced more reliable results than anonymized data while maintaining the deep statistical patterns required for more advanced analysis.

Such advances, coupled with growing concerns about protecting privacy, are why IDC predicts that by 2027, 40% of AI algorithms insurers use throughout the policyholder value chain will use synthetic data to guarantee fairness within the system and comply with regulations.

Synthetic data for insurance: holy grail or AI snake oil?

Synthetic data, in and of itself, will not heal all wounds. Remember, you still need the original data to create the synthetic data. Because of that, perpetuated biases in the original data can still prevail.

Any dialogue on the safe consumption of AI, including GenAI, must acknowledge several truths:

Bias creates inequities.
All models possess bias.
Bias can be mitigated, but not eliminated.

To position themselves as leaders in this space, organizations need to develop their own trustworthy AI principles. They should also:

Foster a culture of data literacy and the use of data-driven decisions.
Empower employees to call out unintended AI risks.
Embrace a code of data ethics as an integral part of their enterprise.

Recently, SAS hosted a synthetic data insurance project with a large insurer experimenting with synthetic data and credit scoring. Results of the experiment were encouraging. The ensuing discussion also highlighted certain ugly truths about the use of credit and other factors that affect premium rating. For example:

- - Multiple studies have confirmed minorities and female drivers pay more for auto insurance.
  - Driving history can be influenced by police bias.
  - Tracking driving behavior through smart devices can be skewed based on road conditions that vary among neighborhoods.

Read a detailed review from the House of Representatives’ Committee on Financial Services on Auto Insurance Practices.

What’s next for synthetic data in insurance?

There are many ways for insurers to use GenAI.

Insurers can use generative AI models to create scenarios, then proactively identify risks and predict outcomes. GenAI can inform decisions about pricing and coverage. It can also automate claims processing to help lower costs and enhance customer experiences (and satisfaction). It can also be used to improve fraud detection and can make targeted risk prevention recommendations to customers that reduce the likelihood of claims.

Synthetic data holds the key to breaking the cycle of bias perpetuated in the insurance industry.

Rather than focusing on potential negative aspects of AI, the collective insurance community should ask the right questions and place a discreet focus on the quality of the data being used to generate their synthetic data. As a result, we can protect privacy and significantly reduce bias – all while unlocking the tremendous value of generative AI.

Get a private preview of SAS Data Maker – a low-code, no-code interface for augmenting or generating data quickly

Blogs