Research

3 Questions: Innovation in financial services through synthetic data

AI in Finance

Authors

Published on

02/07/2023

Wells Fargo’s Jasmine de Gaia discusses how artificially generated data can preserve privacy, drive innovation and expedite technology development.

Robust, reliable, and scalable testing is crucial to successful technology development and deployment. However, having large volumes of test data readily available to developers can be a challenge, particularly in a sector like financial services, where the data may need to be anonymized to protect its sensitive nature.

The MIT-IBM Watson AI Lab’s corporate membership program works with member companies, like Wells Fargo — which provides banking, investments, mortgage, and consumer and commercial financial services — to develop tailored solutions for leading industry challenges, such as this one. The Lab’s Membership Program Lead Kate Soule sat down with Jasmine de Gaia, Senior Vice President and Head of Customer Data Strategy at Wells Fargo to discuss how the Lab helped them to create synthetic test data with generative AI.

Q: What challenges and solutions did you consider with the Lab for developing large volumes of anonymized test data and why?

A: For a firm like Wells Fargo that handles the sensitive data of many individuals, rigorous testing is a non-negotiable component for the products and services we build. This means we need to have extensive test data to support and validate development and integration scenarios, both within and across our models. It also means having test data to support negative testing for failure scenarios. However, having large volumes of anonymized test data readily available to developers can be a challenge.

We explored several paths for test data creation. One common approach, data masking, involves stripping out any sensitive or personally identifiable information from production data. We further investigated temporarily bringing production data into test environments for use by developers. These approaches worked on a small scale, but were ultimately unsustainable for accelerated product development. We needed a better way to generate test data, and when we explored this use case with the Lab, we found that synthetic data could achieve all 3 of these goals:

  1. Data Scalability — Synthetic data generators can create enough data to stand in for millions of individuals;
  2. Data Efficiency — Once trained, synthetic data generators can create hundreds of thousands of records in minutes; and
  3. Data Privacy — Synthetic data doesn’t contain any real personal information and can be further imbued with statistical privacy protections along the development pipeline.

Working with our research collaborators at the Lab, we are now leveraging tools from the field of generative AI to automatically learn and produce realistic, synthetic test data. These generative AI models are first trained on a real dataset to learn a representation of the data that can then be used to create new realistic data records that match the learned representation, but they do not refer to an actual observation.

Q: Several projects with the Lab are helping to unlock new applications of synthetic data in your industry; what are these and what is the benefit?

A: In one project, we are exploring how to create synthetic versions of not just one dataset, but of an entire database. This allows us to provision multi-table development environments with realistic, non-sensitive data. Here, the technical challenge lies in being able to create multiple tables that referentially relate and make sense across the database. Relying on a form of hierarchical modeling, these AI techniques enable the creation of large volumes of relational test data.

Another project we are working on looks to create synthetic datasets for never-before-seen scenarios. Working with economic data, the models first learn the underlying causal relationships in an observed dataset (e.g., how inflation influences spending at the level of the individual), and then we leverage those learned relationships to conditionally generate data for new scenarios, like changes in spending behavior as inflation rises. The scenario data can be used to test our predictive models for robustness on out-of-distribution events. Conditional data generation is exciting because it can correct for biases in the underlying data.

Q: What’s next for synthetic data and its application at financial institutions?

A: As the quality of synthetic data generation continues to improve, it’s becoming increasingly indistinguishable from real data. This opens up exciting applications where synthetic data can be used as a near-perfect substitute for real data, even in complex tasks such as AI model development. However, as synthetic data quality continues to become more and more realistic, more protective measures are needed to ensure no real information is accidentally leaked into synthetic data, leading to potential privacy violations. That is why we are excited about a new project from the Lab on differentially private synthetic data, where the team is both automating the generation of synthetic data and protecting the privacy of that data with statistical guarantees of privacy.

The views expressed in the publication are personal views and do not necessarily reflect the views of Wells Fargo Bank, N.A., its parent company, affiliates and subsidiaries.