Intro to synthetic data

Synthetic data is a valuable tool for enabling better research. Find out more in this explainer.

What is synthetic data?

Synthetic data is artificial data that contains no information about real people, but follows some of the same patterns as real-world data. Each piece of information in the synthetic dataset is usually designed to be plausible, but is created at random based on the structure of original, real data.

Find out more about synthetic data in this short video, supported by funding from Health Data Research UK (HDR UK) and the Medical Research Council (MRC):

Click here to watch this video with British Sign Language (BSL) interpretation.

How is synthetic data used?

Often, processes for accessing secure data for research can be complex and time-consuming, meaning researchers have to wait a long time before they can begin their research. Synthetic data can speed up the research process by allowing researchers to begin working on their project before they are granted access to sensitive data.

While they wait for access to the real data, researchers can use a synthetic version of a dataset to explore whether the data is suitable for their project and begin developing their code to refine models and test their hypotheses. This ensures they can hit the ground running and make progress on their projects even before obtaining access to real data.

Synthetic data can also be valuable for training and teaching purposes, allowing users to experiment with realistic datasets without sacrificing the privacy of real people’s data.

While real decisions are always made using real data, synthetic data can be a valuable resource, enabling researchers to make progress on their projects, validate their methods, and gain insights into the data they expect to work with.

How is synthetic data generated?

Synthetic data is usually created using complex algorithms which analyse the structure of existing data and generate new datasets.

One example of this is the synthpop tool, which allows users to create synthetic versions of sensitive individual-level data for researchers using a commonly used coding language called 'R'.

Synthetic data fidelity

The fidelity of the synthesis is a measure of how closely the synthetic data resembles the real data. Fidelity is on a spectrum from low to high and the higher the fidelity, the more like the real data the synthetic data is. The level of fidelity required may differ depending on the use of the data.

Synthetic datasets that closely match the original data are known as high fidelity datasets. Because they are similar to the actual data, they can be very useful for enabling researchers to generate code.

Low fidelity synthetic datasets are designed to have lower levels of similarity to the original datasets they are based on. They may simplify complex patterns in the data. Although they may not capture all the nuances of the original dataset they can be used to help determine the suitability of the data for a project and for training purposes.

How is personal data kept safe?

Synthetic data is usually generated from datasets that have already been made anonymous through a process called deidentification. This means that no personal information is available in the synthetic dataset, and the data doesn’t refer to any real people. Learn more about deidentification.

When a high-fidelity dataset is generated, it may go through additional checks before it is released to the researcher, to make sure that there is no risk of personal information being inadvertently shared. Low fidelity datasets that do not maintain patterns across the dataset are very low risk.

What is RDS doing to develop the use of synthetic data?

As part of our mission to improve the ways data is used for research in Scotland, Research Data Scotland (RDS) is working with partners and researchers to develop the production and use of synthetic data.

Discover some of the key milestones in RDS’s work on synthetic data and our current areas of focus.