Learning from experiences (or “data”) in the world around us is as hard-wired as breathing. But this beautiful endeavor that perfectly reflects the human condition is no longer exclusively a human experience.
To be direct: Machines learn like humans learn. Let’s consider how.
Neural networks are computing systems with interconnected nodes that work like neurons in the human brain. Through algorithms, they can recognize hidden patterns and correlations in raw data, cluster and classify it, and – over time – continuously learn and improve.
An early form of artificial intelligence, neural networks are fueled by data. And data represents experience. The faster the world changes, the quicker the data from which we (or machines) learn becomes unreliable.
Data from past experience: Can we still trust it?
Think about what was true just five years ago: no COVID, no Ukraine War, no ChatGPT (or hype around generative AI (GenAI), no inflation, no supply chain disruption or toilet paper wars.
Considering the current pace of change, how reliable is the historical data we use to determine rates, make underwriting decisions or settle claims? How long is that data viable before it can no longer be trusted? Do our loss experiences, our policy acceptance (or declination) decisions, or our sales and marketing tactics accurately reflect evolving risk?
In 2019, the answer might have been yes. But with every passing day, it feels like our data is some double agent working against us.
We shouldn’t allow ourselves to be handcuffed to old truths. Instead, we should explore the possibilities of infusing synthetic data, a form of generative AI, into our processes.
Synth and (T)win
Why use data that’s not straight from the real world? Well, lots of reasons: sensitive or private information, cost, bias, availability, rare scenarios… the list goes on.
For insurers, there are several widely accepted and reliable techniques to generate synthetic data.
- Generative adversarial networks (GANs) were first introduced by Ian Goodfellow and his colleagues in their paper "Generative Adversarial Nets" in 2014. For a technical deep dive, feel free to explore this discussion by Jason Colon. The crude explanation is that a generator makes data and tries to “fool” a discriminator – this can be image, text, audio, video or tabular data. The results demonstrate upwards of 99% accuracy when compared to real data as these two networks compete against one another (hence the name, “adversarial”).
- Synthetic minority oversampling technique (SMOTE) addresses class imbalances by supplementing minority data sets to improve the statistical significance of the entire data set. In one technical paper, SMOTE was determined to be a highly reliable data science technique in determining insurance premium nonpayment cancellations.
- Digital twin technology generates a virtual model of a physical object or system from the real world. For example, a manufacturer might build a digital twin of a large piece of equipment to understand potential loss scenarios. This could prevent catastrophic failure due to vibrations or centrifugal forces and could project when components need to be replaced or maintained. Digital twins can use a combination of historical, real-world data, synthetic data and system feedback loop data as inputs. These inputs can be processed in batch or in real time.
Insurers can use any of these synthetic data generation techniques when faced with rare events, incomplete data or hard-to-obtain data. In addition to the above examples, insurance companies can use synthetic data to fight bias, avoid violation of privacy regulations and prevent exposure of sensitive information.
A haze of clarity
Insurers’ investment in synthetic data generation will address data decay and add value. Pioneering organizations like Hazy have proven the value of synthetic data.
Gartner says that by 2026, 75% of businesses will use generative AI to create synthetic customer data, up from less than 5% in 2023. IDC specifically notes that by 2027, “40% of AI algorithms utilized by insurers throughout the policyholder value chain will utilize synthetic data to guarantee fairness within the system and comply with regulations.” The report further predicts this integration will expand to underwriting, marketing and claims.
Data and AI research from SAS confirms the predictions: “50% of insurers expect up to two times, and 41% over three to four times, return on AI investments.” It’s also noted that GenAI will improve claims processes and operational efficiencies.
These results come with trustworthy-by-design assurances when considering data privacy and protection laws like the General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA) or the EU AI Act.
It’s so easy…
How easy is it? Point-and-click. No coding.
It’s true. The returning champion team for the 2024 SAS Hackathon, the StatSASticians, demonstrated the ease of use and functionality built into today’s data and AI tools.
Their hack story focuses on worker safety and the SMOTE technique. Data gathered from “smart helmets” was fed into a dashboard, with the intention of monitoring for early warning signs of heat stroke. However, the collected information was imbalanced (it didn’t provide a sufficient amount of diverse data), so the team used the SMOTE technique to address the imbalance.
The result? A worker safety model applicable to workers' compensation insurance that can inform “predict and prevent” outcomes.
Impressively, the team built the solution in a few weeks – with minimal data. This is the equivalent of Tony Stark building the original Iron Man suit in a cave. Imagine what a large enterprise could do with such powerful technology. (Did you know part of Iron Man 3 was filmed at SAS headquarters – crazy, right?).
So, which is better – real-world or synthetic data?
The answer to that question sounds like the start of a bad joke, but it’s one that came from personal experience.
Imagine this: You sit down to breakfast with the head of AI and the chief actuary at a large insurer. You start discussing synthetic data. The head of AI says, “We don’t like synthetic data. We like real data.” The chief actuary says, “If we don’t have real data, synthetic data works well.” The head of AI says, “It’s not as good as real data, that’s why we don’t like it.” The chief actuary responds, “Well, having something is better than having nothing.”
And around and around they went until the check arrived.
Both sides are correct. If you have sufficient amounts and types of real-world data that you can access, use and trust, that’s great. But this will not always be the case.
The bottom line: Challenge the status quo
To paraphrase some brilliant insight from Tommy Lee Jones (Men in Black, 1997), knowledge and certainty can be stupid and dangerous. Whether we’re discussing things like "The earth is flat," “The 4-minute mile can’t be broken,” or “We only like real data” – someone pushed back on those notions.
Insurers like MAPFRE already refer to synthetic data as a “strategic advantage.” ERGO champions the call to action to “unlock your treasure trove of data” to settle claims, fight fraud and develop new products.
Both endeavors can be accomplished – we can still use both real-world and synthetic data. Just remember that as data decays, we should prioritize the most recent and most reliable experience and combine it with the power of generative AI.