Synthetic Data

Data is the necessary fuel that drives firms’ advanced analytics and machine learning activities, but with privacy concerns and workflow challenges, researchers don’t always have simple access to what they need. Synthetic data, which can be shared and utilized in ways that real-world data cannot, is a fascinating new path to investigate. However, this growing method is not without dangers or consequences, and companies must carefully consider where and how they invest their resources.

What Is Synthetic Data?

Synthetic data is created artificially by an AI system that has been trained on real data. It has the same predictive potential as the original data, but instead of hiding or changing it, it replaces it. The objective is to replicate the statistical traits and patterns of an existing data collection by modeling and sampling its probability distribution. The program effectively generates new data with all the same properties as the original data, yielding the same results. However, it’s nearly hard to recreate the original data (think personally identifying information) from either the algorithm or the synthetic data it’s produced.

Synthetic data is created artificially by an AI system that has been trained on real data. It has the same predictive potential as the original data, but instead of hiding or changing it, it replaces it. The objective is to replicate the statistical traits and patterns of an existing data collection by modeling and sampling its probability distribution. The program effectively generates new data with all the same properties as the original data, yielding the same results. However, it’s nearly hard to recreate the original data (think personally identifying information) from either the algorithm or the synthetic data it’s produced.

The technique has applications in a variety of sectors. Companies in financial services, where data usage and customer privacy requirements are particularly stringent, are beginning to utilize synthetic data to help them uncover and eliminate bias in how they treat consumers – without violating data privacy regulations. Retailers are also realizing the possibility of new revenue streams from selling synthetic data on customer purchase habits without disclosing personal information.

Synthetic Data Generation

The Value for Business: Security, Speed, and Scale

The most apparent advantage of synthetic data is that it avoids the danger of disclosing crucial data and jeopardizes organizations’ and customers’ privacy and security. Encryption, anonymization, and advanced privacy preservation techniques (for example, homomorphic encryption or secure multiparty computing) focus on safeguarding the original data as well as any information included in the data that may be linked back to a person. However, if the original data is involved, there is always the possibility of it being compromised or exposed in some manner.

Synthetic data helps enterprises to acquire access to data more rapidly by removing the time-consuming hurdles of privacy and security measures. Consider one financial institution that has a wealth of data that may assist decision-makers in resolving several business issues. Even for strictly internal purposes, acquiring access to the data was a difficult task due to its high level of security. In one instance, it took six months to obtain a modest quantity of data, followed by another six months to receive an update. Now that the organization is creating synthetic data based on the original data, the team can update and model it in real-time, generating continual insights into how to enhance business performance. Furthermore, with synthetic data, a company can quickly train machine learning models on large data sets, accelerating the processes of training, testing, and deploying an AI solution. This addresses a real challenge many companies face: the lack of enough data to train a model. Access to a large set of synthetic data gives machine learning engineers and data scientists more confidence in the results they’re getting at the different stages of model development — and that means getting to market more quickly with new products and services.

Security and speed also enable scale, enlarging the amount of data available for analysis. While companies can currently purchase third-party data, it’s often prohibitively expensive. Buying synthetic data sets from third parties should make it easy and inexpensive for companies to bring more data to bear on the problem they’re trying to solve and get more accurate answers.

Why Isn’t Everybody Using It?

While the advantages of synthetic data are clear, achieving them may be challenging. Synthetic data generation is a highly difficult process, and to execute it well, an organization must do more than simply plug in an AI tool to examine its data sets. People with talents and highly sophisticated AI expertise are required for the task. A corporation also needs highly precise, complex frameworks and measurements to validate that it has accomplished what it set out to create. This is when things get very challenging.

Evaluating synthetic data is complicated by the many different potential use cases. Specific types of synthetic data are necessary for different tasks (such as prediction or statistical analysis), and those come with different performance metrics, requirements, and privacy constraints. Furthermore, different data modalities dictate their own unique requirements and challenges.

A simple example: Let’s say you’re evaluating data that includes a date and a place. These two discrete variables operate in different ways and require different metrics to track them. Now imagine data that includes hundreds of different variables, all of which need to be assessed with very specific metrics, and you can begin to see the extent of the complexity and challenge. We are just in the beginning stages of creating the tools, frameworks, and metrics needed to assess and “guarantee” the accuracy of synthetic data. Getting to an industrialized, repeatable approach is critical to creating accurate synthetic data via a standard process that’s accepted — and trusted — by everyone.

What Could Go Wrong?

Proving the veracity of synthetic data is a critical point. The team working on the effort must be able to demonstrate that the artificial data it created truly represents the original data — but can’t be tied to or expose the original data set in any way. That’s hard to do. If it doesn’t match precisely, the synthetic data set isn’t truly valid, which opens a host of potential problems.

Assume you’ve produced a synthetic data collection to aid in the creation of a new product. If the synthetic collection does not accurately replicate the original customer data set, it may contain incorrect buying signals about what consumers are interested in or likely to purchase. As a result, you may end up spending a lot of money developing a product that no one wants.

Incorrectly generated synthetic data can potentially get a corporation in hot water with authorities. If the use of such data results in a compliance or legal concern, such as manufacturing a product that hurt someone or did not function as described, it might result in significant financial fines and, perhaps, more scrutiny in the future. Regulators are only now beginning to analyze how synthetic data is developed, assessed, and disseminated, and they will likely play a role in steering this process.

A distant, but still real, ramification of improperly created synthetic data is the possibility of what’s known as member inference attacks. The whole concept of synthetic data is that it’s not in any way tied to the original data. But if it isn’t created exactly right, malicious actors might be able to find a vulnerability that enables them to trace some data points back to the original data set and infer who a particular person is. The actors can then use this knowledge to continually probe and question the synthetic set and eventually figure out the rest — exposing the entire original data set. Technically, this is extremely difficult to do. But with the right resources, it’s not impossible.

One potential problem with synthetic data that can result even if the data set was created correctly is bias, which can easily creep into AI models that have been trained on human-created data sets that contain inherent, historical biases. Synthetic data can be used to generate data sets that conform to a pre-agreed definition of fairness. Using this metric as a constraint to an optimizing model, the new data set will not only accurately reflect the original one but do so in a way that meets that specific definition of fairness. But if a company doesn’t make complex adjustments to AI models to account for bias and simply copies the pattern of the original, the synthetic data will have all the same biases.

What It Takes to Move Forward

With the relevant skills, frameworks, metrics, and technologies maturing, companies will be hearing a lot more about synthetic data in the coming years. As they weigh whether it makes sense for them, companies should consider the following four questions:

1. Do the right people know what we’re getting into? For most individuals, synthetic data is a novel and perplexing notion. Before implementing any synthetic data program, it is critical that the whole C-suite, as well as the risk and legal teams, completely comprehend what it is, how it will be utilized, and how it may help the organization.

2. Do we have access to the necessary skills? Because creating synthetic data is a sophisticated process, businesses must assess whether their data scientists and engineers are capable of learning how to accomplish it. They should examine how frequently they will generate such data since this will determine whether they should invest time and money in developing this skill or hire external experts as needed.

3. Do we have a clear purpose? Synthetic data must be generated with a particular purpose in mind because the intended use affects how it’s generated and which of the original data’s properties are retained. And if one potential use is to sell it to create a new revenue stream, planning for this potential new business model is key.

4. What’s the scale of our ambitions? Creating synthetic data isn’t for the faint of heart. The sheer complexity associated with doing it right — and the potential pitfalls of doing it wrong — means organizations should be sure it will deliver sufficient value in return.

Although synthetic data is still at the cutting edge of data science, more organizations are experimenting with how to get it out of the lab and apply it to real-world business challenges. How this evolution unfolds and the timeline it will follow remains to be seen. But leaders of data-driven organizations should have it on their radar and be ready to consider applying it when the time is right for them.