Generative AI Redefining the Data Landscape: From Text and Images to Enterprise Applications

2024-03-06

Generating artificial intelligence (AI) has attracted much attention due to its remarkable ability to create text and images. However, it only touches the tip of the iceberg in the massive amount of data in modern society. Whenever a medical system records a patient's information, a flight is affected by a storm, or a person interacts with a software application, a massive amount of data is generated.

In these scenarios, creating realistic synthetic data using generative AI has great value for organizations. It can play a significant role in treating patients, adjusting flight routes, and improving software platforms, especially when real data is limited or sensitive.

In recent years, DataCebo, a spin-off company from MIT, has launched a generative software system called Synthetic Data Vault (SDV). The system aims to help organizations create synthetic data to support various applications such as software testing and machine learning model training.

Since its launch, SDV has been downloaded over one million times and has been used by over ten thousand data scientists to generate synthetic tabular data. The success of SDV is largely attributed to its innovative software testing capabilities, according to Kalyan Veeramachaneni, the company's founder and chief research scientist, and Neha Patki, an alumna.

SDV's Rise

In 2016, Veeramachaneni's Data to AI Lab team at MIT introduced an open-source generative AI toolset. This toolset helps organizations create synthetic data that matches the statistical properties of real data.

By using synthetic data, companies can protect sensitive information while preserving the statistical relationships between data points. Additionally, synthetic data can be used to simulate the operation of new software to predict its performance before release.

Veeramachaneni's team became interested in this problem because they collaborated with several companies that wanted to share data for research purposes.

"MIT showed us all these different use cases," explained Patki. "We collaborated with financial and healthcare companies, and all these projects helped us develop cross-industry solutions."

In 2020, the researchers founded DataCebo to build more SDV functionalities for larger organizations. Since then, SDV has been applied in impressive and diverse use cases.

For example, using DataCebo's new flight simulator, airlines can plan for rare weather events in ways that traditional methods cannot achieve. In another case, SDV users synthesized medical records to predict the health outcomes of patients with cystic fibrosis. Recently, a team from Norway used SDV to create synthetic student data to evaluate various admission policies for fairness and bias.

In 2021, the data science platform Kaggle hosted a competition for data scientists to create synthetic datasets using SDV to avoid using proprietary data. Approximately 30,000 data scientists participated in the competition, building solutions and predictions based on the company's real data.

As DataCebo continues to evolve, it remains loyal to MIT, with all current employees being MIT alumni.

Enhancing Software Testing Efficiency

Although their open-source toolset is used in various scenarios, the company focuses on increasing its impact in software testing.

"You need data to test these software applications," said Veeramachaneni. "Traditionally, developers had to manually write scripts to create synthetic data. With generative models created using SDV, you can learn from collected data samples and generate a large amount of synthetic data with the same attributes as real data or create specific scenarios and edge cases to test your applications."

For example, if a bank wants to test a program that aims to reject transfers from accounts with no money, it must simulate many accounts making transactions simultaneously. Creating such data manually would be time-consuming. With DataCebo's generative models, customers can create any edge case they want to test.

"For some industries, having data with a certain sensitivity is common," said Patki. "Usually, when you deal with sensitive data, you need to comply with various rules. Even without legal rules, companies are better off handling data access with caution. Therefore, from a privacy perspective, synthetic data is always a better choice."

Expanding the Application Scope of Synthetic Data

Veeramachaneni believes that DataCebo is advancing the development of what he calls "synthetic enterprise data," which is generated by users' behavior on large companies' software applications.

"This type of enterprise data is very complex and not universally available, unlike language data, which is easy to obtain," Veeramachaneni said. "When people use our publicly available software and provide feedback on whether it fits certain patterns, we learn a lot about these unique patterns. This allows us to improve our algorithms. In a way, we are building a corpus of these complex patterns, which is easily accessible for language and images."

Recently, DataCebo has also released new features to enhance the practicality of SDV, including tools for evaluating the "realness" of generated data (called the SDMetrics library) and a method for comparing model performance (called SDGym).

"This is about ensuring organizations trust this new data," Veeramachaneni said. "Our tools provide programmable synthetic data, which means we allow companies to insert their specific insights and intuitions to build more transparent models."