Jan 12

Real-world vs. synthetic data: Which to use in AI model training?

Most companies today use real-world data for artificial intelligence (AI) model training, but it can be lacking, especially for smaller businesses or new products. And even when real data appears to be abundant, data scientists often meet limitations as AI model training begins: real data may be biased or unusable due to privacy concerns. Synthetic data can solve some of these issues—often in a more cost-efficient way—but it has a few drawbacks of its own. Many advanced solutions require a mix of real-world and synthetic data to perform well and handle edge cases. In this post, we’re comparing the pros and cons of both real-world and synthetic data.

How do real-world and synthetic data compare?

First, real-world or real data represents genuine information collected in a real-life use case. It is data produced by real events or happenings. For example, it could be data companies collect through day-to-day customer service interactions.

Synthetic data is artificial computer-generated data imitating real data. Such data is created by data-generating programs. There are three types of synthetic data:

Fully synthetic: Nothing is retained from the original data. The data generator identifies specific real-world characteristics to estimate realistic parameters and then generates synthetic data.
Partially synthetic data: Some of the original data is retained. The data generator replaces certain features (for example, those with a high risk of disclosure) of the original data with synthetic ones. Partially synthetic data can also be used to fill in gaps in the original data.
Hybrid data: A blend of real and synthetic data. The data generator pairs random records from real data with synthetic records to create hybrid synthetic data.

Now, let’s compare real and synthetic data in terms of cost, quality, annotation, scalability, privacy, and regulatory compliance.

Is it cheaper to source real or synthetic data?

Typically, it is less costly to source synthetic data, but it depends on the use case. For example, if you are training a machine learning algorithm for a self-driving car, it would be much cheaper to simulate a car crash than to create one to collect necessary data.

Larger companies may already have a lot of real-world data to work with, but these datasets are often imbalanced. This is due to edge cases that happen rarely and therefore are not represented in the data. Data scientists then have to turn to synthetic data to supplement the shortage.

Creating synthetic data for text algorithms is usually cheaper as well. Existing data generators can generate grammatically correct texts of acceptable quality.

Meanwhile, synthetic data for speech algorithms is less frequently used due to the complexity of speech data. Today the only way to generate synthetic data for speech is to have real users simulate or act out necessary situations.

Data privacy and regulatory compliance

Privacy and regulatory compliance are the most significant drivers of using synthetic data in AI. Because synthetic data does not contain any sensitive or Personally Identifiable Information (PII), it makes it easier to preserve the privacy of real individuals and comply with regulations such as GDPR.

However, when synthetic data mimics real data that contains privacy issues or bias, they might be reflected in the synthetic data. It often happens with algorithms that generate synthetic data from biased historical data. For example, the demo of Meta’s language model Galactica was removed three days after going live because it produced heavily biased or faulty results.

Data quality

Synthetic data can provide more diversity by enabling the inclusion of edge cases, while sourcing such cases from real data may take a lot of work. In terms of bias, both real and synthetic data can be inherently or historically biased because, as mentioned earlier, synthetic data mimics real data, absorbing existing vulnerabilities.

The overall quality of the AI model is dependent on the data source. If we take synthetic data, its quality level will be influenced by the data generator and the quality of the original data.

To reflect reality, synthetic data needs to be sufficiently accurate and diverse. However, excessively manipulating synthetic data to achieve a level of fairness might distort the end result – unrealistic data cannot provide reliable insights.

Labeling and annotation

Data labeling and annotation can be a months-long, resource-intensive process. Here synthetic data has an advantage against real-world data because synthetic data is often automatically labeled as it is created. This also takes the human aspect of mislabeling. However, it can instead introduce faulty labels from the algorithms that create the data.

computer vision detects pedestrians, vehicles and traffic lights data labelling annotation

Ability to scale

Data needs can change as you start working on your machine learning algorithm. One of AI development teams' more common problems is a lack of data. Without more training data, it is challenging to scale AI projects.

Synthetic data is a better fit regarding scalability because it is easier and cheaper to source, with a few exceptions; for example, high-quality synthetic speech data is typically harder to produce.

Final verdict: To use or not to use synthetic data?

In the perfect scenario, you have plenty of diverse, unbiased real data matching the use case. Theoretically, real-world data is more likely to deliver higher performance. However, in practice, synthetic data helps with edge cases, fills data scarcity gaps, and alleviates privacy concerns.

For many use cases, it does not need to be a choice between synthetic and real-world data. Often, a combination of both will get you the best results. For most data scientists, the more critical factor regarding data is having enough high-quality, unbiased datasets.

two data scientists looking at data and graphs doing data analysis

If you are looking for data, StageZero Technologies offers hybrid speech data for conversational AI and intent classification data for speech and text use cases. Our data is produced by a global network of 10 million contributors in over 40 languages. Get in touch and tell us about your data needs.

Share on: