Founder, MIT Data-to-AI Lab
Abstract: Enterprises and organizations across sectors need high quality data to effectively test thousands of software applications and train robust machine learning (ML) models. But in many (or even most) cases, there is not enough data available — either because it does not exist, or because it is not accessible to those who need to use it. A promising solution to this problem is synthetic data. Synthetic data is created using a generative model that has been trained on real data. Because of this, it looks real and is statistically similar to real data.
While steady advances have been made in visual, language, audio, and video data, tabular data has not been a main focus of development for generative AI. Driven by our own need to solve data availability and access bottlenecks, we invented the first generative AI for tabular, relational data and named it “The Synthetic Data Vault” (SDV). SDV’s open source product has since been downloaded more than a million times and has been used by 50+ Global 2000 companies, including Mastercard, JPMorgan Chase, and many others.
In this talk, I will present how Synthetic data generated using SDV is transforming enterprise data work. Developers who create generative models and use them to generate synthetic data for testing their applications are able to achieve 100x coverage while reducing the time it takes to create the data by 1/10th. Machine learning engineers are using synthetic data to train more robust models that are more accurate — models that can help predict health outcomes better, or increase fraud prediction accuracy by 20x. Enterprises are also using synthetic data generated with SDV to solve access issues, sharing data with third party entities or releasing datasets for data science competitions. Synthetic data is even transforming the process of training generative models themselves.