Skip to content

Discussing Data: exploring public perceptions of synthetic data

A colourful butterfly kite flies against a blue sky

Overview

Many public sector services now have a growing interest in the creation and sharing of synthetic data. These synthetic datasets are imitations of real administrative data, but they don’t include real people’s information. 

While real data is essential for analysis and decision-making, synthetic datasets are a useful tool for researchers in several ways. When used in the early stages of a project, such as in data discovery and code development, they can enable researchers to be better prepared when they receive the real data — leading to more effective research that benefits the public.  

As the creation of synthetic data has become more common, the need to ask the public what they think about its use became more urgent. One such project – supported by RDS – was ‘Discussing Data’. The project, which was run by Cardiff University and funded by ADR UK, sought to explore public perception of synthetic data and its use in research. 

This involved four online workshops with 39 members of the public. The selection process looked for members with a wide range of backgrounds and experiences, to ensure a diverse mix of views were represented. This included people of different ages, cultures and geographical locations across the UK. 

Attendees participated in expert talks, group discussions and interactive activities, culminating in the co-production of 10 recommendations for data owners for creating and sharing synthetic data. 

Findings 

The key findings around public attitudes to synthetic data to come out of these sessions related to communication, access and ethics, and public trust. Initial awareness of synthetic data was low when the sessions began, and some participants found the term confusing. This is why transparency is key. People want clear and accessible explanations of what synthetic data is, how it is being used, and what the benefits and risks are. 

Controlled access to synthetic data was preferred. Attendees were concerned that freely available synthetic data could be misunderstood or misused, and felt some level of access control was necessary.

Ethical oversight matters. It was important to know that real people are making decisions about how synthetic data is created, checked, and used. 

The presence or absence of these factors are linked to public trust. People wanted to know synthetic data was being used for public benefit, such as for health research. They were less comfortable if they felt it could be used in ways that were unfair, unclear, or not properly regulated. 

Recommendations

1. The term 'synthetic data' is not widely understood by the public. A brief explanation should be provided to explain that the data is not real, but it is based on real data and is created in a way that minimises personal privacy risk.

2. It must be made clear what the synthetic data can and cannot be used for. 

3. Benefits and impacts of the datasets for the organisations, researchers and the public must be made clear.

4. Explain the personal privacy benefits that synthetic data offers.

5. Provide a simple explanation for how your synthetic data is created. In particular, you should explain the role of humans vs automation (such as Artificial Intelligence) in the process.

6. Human oversight in checking personal privacy risks is important. Explain the quality and privacy checks you undertake before your synthetic dataset is released.

7. There is not widespread support for a fully open access approach to synthetic data. Use a simple registration process which records the requestor’s name, email address and intended use (as a minimum). Implement a simple user agreement covering the key terms and conditions such as allowed usage and how long synthetic data can be held.

8. Use accessible case studies from researchers to demonstrate what synthetic data is, report outcomes from the synthetic data, and emphasis positive impact for the public.

9. Use creative communication methods including infographics and engaging videos to convey information about synthetic data to the public.

10. Work with the public to ensure all of this information is accessible to people with a diverse range of needs.

You can read full details of these recommendations within the Discussing Data Project Report

 

Research Data Scotland’s impact

RDS are interested in synthetic data due to its potential to speed up research projects for public benefit. We first introduced the concept of synthetic data to members of the public in 2023, presenting to the SCADR Public Panel on work we were developing to help make synthetic datasets available to researchers. Feedback from this initial Panel session not only informed the work taking place at RDS but went on to inform synthetic data approaches and processes after it was shared at a UK-wide meeting on public engagement with synthetic data. This, and conversations with the public as part of a public dialogue led by ADR UK and the Office for Statistics Regulation (2022) on perceptions of public good use of data for research and statistics,  directly influenced ADR UK’s call for a larger project to gather insights from the public on the use of synthetic data. The project was awarded to Cardiff University to run ‘Discussing Data’.  Feedback from the ADR England Public Insights Panel was also incorporated into the Discussing Data work, with members of this panel providing feedback on infographics for public audiences. 

We have continued to engage with members of the public directly on synthetic data through the Scotland Talks Data Panel  , which has directly informed how RDS communicates, creates and allows access to synthetic data. Feedback from the Panel on an HDR UK-funded series of three data explainer videos, one of which was about synthetic data, and this in turn was used in the public dialogue sessions as part of the Discussing Data project. Our Senior Engagement Manager Katie Oldfield also contributed to the project as a member of the steering group. 

In turn, RDS are fully supportive of the findings and recommendations that have come out of Discussing Data and have made efforts to incorporate them within our overall approach to making synthetic data available for research. We have introduced controlled access and an End User License Agreement for researchers, and our communications around synthetic data have been informed by the recommendations set out above. Quality checking is also embedded within our method of creating synthetic datasets, ensuring that personal privacy is protected. 

Find out more

If you would like to find out more about the Discussing Data project, you can read their summary report or visit ADR UK to learn about how this project informs other ADR UK funded synthetic data projects.

You can also learn more about the Scotland Talks Data Panel in our Impact Report.

Still getting to grips with what synthetic data is? Watch our handy explainer video: 

 

 

Related content