Unlocking the potential of synthetic data

Research Data Scotland (RDS) is working to improve access to public sector data for research in the public good. Anyone who has ever attempted to access this type of data will know that there are a number of potential issues, not least the often very long timescales and the complex information governance and approvals process.

The data is held in a safe haven. It’s not always easy to know what data is available and whether it is suitable for the project, never mind what the data looks like in terms of messiness and missingness. So, how can synthetic data help with this? RDS, with the input of external partners and other data organisations, has developed a proposed strategy to move forward the production and research use of synthetic data in Scotland.

Workshop to understand user needs

RDS held an online synthetic data workshop on 22 November 2022, with the aim of getting feedback on the draft strategy. There were 20 participants, representing researchers, data provider organisations and others. The workshop consisted of two short presentations describing what synthetic data was and what RDS planned to do. During the breakout sessions participants discussed what they thought were the benefits and issues of synthetic data and what sort of data, such as for training, data discovery or code development, and what level of fidelity would be the most useful for researchers.

You can access the presentation here for more information on what synthetic data is, what fidelity means and what questions we asked.

One of the main themes that came out of the discussions was that it would make sense to start by generating low fidelity datasets with no discernible disclosure risk, as it was felt that the risk appetite of data controllers would vary depending on the dataset and the level of fidelity of the synthesis. Researchers and data providers noted that low fidelity datasets were good enough for their purposes too.

Insights from the workshop

One of the many surprises was the level of enthusiasm for developing low fidelity data sets. Researchers were keen on being able to access low fidelity, unlinked datasets for data discovery purposes at the early stages of a project, something that I wasn’t sure would be a particularly high priority. However, several participants noted that just being able to see the structure of the data and, for example, the limits of the values in each variable category would be really useful. Hearing this from potential users is helpful to RDS when planning future synthesis work.

Another issue raised was the Information governance (IG) challenges around synthetic data. It was suggested that there could be a role for RDS here, in creating a forum for IG and security experts across organisations, to bring them together to reach consensus on synthetic data risk management, rather than having many individual conversations. So, this is something we are now considering and will add into the strategy for discussion with our newly-formed Synthetic Data Working Group.