Skip to content

Synthetic Data Policy

Find information on synthetic data use and read our policy statement.

Scope

The scope of this policy includes the production, use, dissemination and sharing of synthetic data by Research Data Scotland (RDS).  Where relevant, RDS may do any combination of the following: produce, own, host, and/or make available synthetic datasets on behalf of data controllers.

This policy outlines the considerations and requirements for RDS regarding:

  • use of synthetic data
  • producing synthetic data
  • quality standards and quality assurance checks applied to synthetic data
  • the process of making synthetic data available
  • the use of synthetic data in the context of artificial intelligence

Background

Synthetic data is artificial data that contains no information about real people but follows some of the same patterns as real-world data. As a result, synthetic data carries a far lower disclosure risk than real data. 

Fidelity of synthetic data

Fidelity of synthetic data is a measure of how closely it resembles the real data in relation to its properties and characteristics. The higher the fidelity, the more it resembles the real data.

Following researcher and public engagement carried out by RDS, it was identified that there was a desire for low fidelity synthetic data due to it being best practice for use in assisting with code development and data discovery. High fidelity synthetic data would potentially require the use of a Trusted Research Environment (TRE) to access and therefore low fidelity synthetic data is more accessible.

Typically, low fidelity synthetic data has the same structure as the real data and in some cases, the ranges or distributions of these variables may match the real data (a high level of plausibility). However, relationships between the variables are not maintained in low fidelity synthetic data and therefore it is unlikely for any of these matches to occur.  

Low fidelity synthetic data can either be generated from the metadata of the real data (‘data free’ synthetic data generation), and therefore not require access to real data for its generation. Alternatively, sampling/statistical methods, for example using the synthpop package, require access to real data for synthetic data generation.

In both cases, access to real data is required for quality checking (to ascertain how closely the synthetic data resembles the real data) and statistical disclosure control (to ensure that no individual or place is identifiable).

Objectives

RDS will endeavour to make low fidelity synthetic data available in three ways, the choice of which is for the Data Controller to determine:

  • through collaboration with Data Controllers on RDS’s Metadata Catalogue and will not require the use of a trusted research environment (TRE)
  • through collaboration with Data Controllers and made available using a TRE such as the Scottish National Safe Haven. RDS will work with the Data Controller and other relevant parties to ensure access is proportionally controlled
  • through collaboration with Data Controllers to be made available to researchers as part of a request to access the real data. The use of a TRE would be for a Data Controller to determine

Roles and responsibilities

Data Controller 

The Data Controller retains full ownership of the synthetic data as RDS does not own any synthetic datasets. The Data Controller will determine, in agreement with RDS, restrictions on the use of synthetic data. They may delegate the power to approve access to synthetic data to RDS.   

Any breaches of the End User License Agreement will be reported to the Data Controller.  

The Data Controller can ask for a synthetic dataset to be removed from the metadata catalogue at any time.   

Data Team   

The Data Team in RDS will be responsible for the delivery of synthetic data. They will be the only members of RDS involved with the handling of synthetic data. Quality checks will be carried out between the synthetic data and the real data by an RDS Data Analyst, with appropriate agreements and safeguards in place. Applications to use synthetic data will be primarily assessed by the Data Team, however members of the Senior Leadership Team or the Data Controller may be consulted if required. 

Information Governance Officer  

An RDS Information Governance Officer must approve any request to access synthetic data as well as any decision to revoke synthetic data. In the absence of an Information Governance Officer, this responsibility can be carried out by an Information Governance Manager. 

An Information Governance Officer is primarily responsible for reporting any breaches of the End User License Agreement to the RDS Data Protection Officer. Where a data breach has occurred as a result, the RDS Data Breach Policy will be followed and the Data Controller will be contacted if there are concerns around the quality checking of a dataset.

Uses of synthetic data

Synthetic data made available by RDS can be used for public benefit purposes, however specific conditions relating to individual synthetic datasets—such as only being used for non-commercial purposes—can be found in their end user license agreement.

This can include data exploration to ascertain whether the data is suitable for a project or research question and to begin developing code to refine models and test hypotheses. It could also be used to train researchers in how to work with administrative or clinical data.

If RDS suspects that a user may not be using synthetic data for approved purposes, RDS may contact the user to alleviate these concerns and reserve the right to deny access to the synthetic data or other RDS services like the Researcher Access Service.   

If the applicant’s intended use of the synthetic data is not covered by the agreement between the Data Controller and RDS, then RDS must refuse access to the synthetic data. The applicant is free to contact the data controller directly at this point.

Producing synthetic data

It is considered best practice to use synthetic data of the lowest fidelity possible for the purposes of code development and data discovery, and where practicable, only use real data where necessary in the synthetic data generation process. Therefore, RDS will aim to use a ‘data free’ (metadata approach) methodology when generating synthetic data and encourage partners to do the same.

To facilitate this, RDS has developed a reusable R-based pipeline for generating synthetic data. RDS will make this code freely available to support synthetic data generation in this manner. 

Alongside the synthetic data generated, documentation will be made available to describe it. At a minimum, this will include:

  • information pertaining to how the synthetic data was generated, including generation tool(s), and whether real data or metadata was used
  • how analysis using the synthetic data would be expected to differ from analysis using the real data
  • structural metadata for the synthetic data
  • ownership and contact details for queries and improvements
  • a Digital Object Identifier (DOI) if available, or guidance for users to provide citations when using the synthetic data

Quality standards and checking

Synthetic data should not be expected to preserve all the features of the real data, and therefore it is important to check it against real data to ascertain the aspects where the synthetic data matches or differs from the real data, especially in relation to missingness, inclusion of errors etc. In addition, checks for statistical disclosure control and any necessary mitigation measures must be carried out.

Therefore, all synthetic data hosted by RDS must be subjected to quality and statistical disclosure control checks to the following standards and signed off by the relevant Data Controller: 

Labelling

The filename, each variable name and data dictionary should contain the word ‘synthetic’.

Disclosure

The synthetic data has undergone disclosure risk evaluation with any risks that are identified mitigated.

Structure

The synthetic data should have a structure as close as possible to the real data, with similar patterns of missingness, range of values (numeric variables) and replication of levels (categorical variables), and comparable variable names.

Utility 

Any differences in the synthetic data compared to the real data are documented, particularly in reference to structure, size, relationships between variables and ways in which analyses of the synthetic data will be expected to differ from the real data.

In addition, checks that all other documentation is present are required.

Following these checks, RDS can be confident that the synthetic data conforms to the required quality standards and that there is no disclosure risk. Where these checks have been performed by the Data Controller or Trusted Third Party, they are required to provide written evidence that these checks have been satisfactorily completed, as per the terms of the agreement between RDS and the Data Controller.

The outcomes of these checks are stored as a 'quality certificate within the synthetic data' to allow users to understand the nature of the synthetic data.

If RDS becomes aware of any issues relating to the quality and disclosure control checks of a synthetic dataset, access to the synthetic dataset may be paused and existing users will be informed as appropriate. If required, RDS will report any data breaches as part of their Data Breach Policy.

Making synthetic data available

Low fidelity synthetic data made available by RDS has been checked to ensure that any disclosure risk is minimised. This means that the synthetic data can be made available outside of a trusted research environment without posing a potential privacy risk.

Synthetic data can, however, pose a risk to the public if used incorrectly. For example, if the results of a study using synthetic data are published as though they have used the real data. This scenario is mitigated by only providing low fidelity synthetic data (which, by definition, has low analytic value), as it is highly unlikely that any analysis using it would make sense. Nevertheless, individuals looking to access synthetic data hosted by RDS must agree to the terms of an End User License Agreement which governs behaviour when accessing the synthetic data.

Differing synthetic datasets may have different terms of use, including any requirements specified by the Data Controller. The contents of these End User License Agreements will be drafted by RDS in collaboration with the respective Data Controller. 

Following a request to access synthetic data, the request must be approved by either RDS or the Data Controller, depending on the agreement in place between RDS and the Data Controller. Requests requiring the approval of the Data Controller will be assessed first by RDS before being passed to the Data Controller.

When approval from the Data Controller is required, criteria for approving and/or refusing requests is determined by the Data Controller. If the Data Controller consents, then RDS will publish the criteria alongside the synthetic dataset to ensure there is full transparency for applicants. Similarly, RDS will publish the criteria used for the assessment of applications for synthetic data under their approval.

Breach of End User License Agreement

If RDS suspects that a user may not intend to follow the terms of the End User License Agreement, even if they have agreed to the terms, RDS may contact the user to explore whether the basis for these suspicions can be addressed. If this is not possible, RDS reserve the right to refuse access to the synthetic data.

In the event RDS becomes aware of or suspects a user is breaking the terms of an End User License Agreement, RDS reserves the right to revoke access to the synthetic data and may contact the user if appropriate to discuss these concerns. Legal action may be taken against the user at the discretion of RDS and RDS may wish to revoke access to other services like the Researcher Access Service.

If access must be revoked from a user for any reason, RDS will make the Data Controller aware.

Users who are unhappy with a decision to refuse them access or who have had their access revoked can make a complaint via RDS’s Complaints Handling Policy.

Synthetic data access requests

RDS will endeavour to respond to all requests for synthetic data within 3 working days. However, requests may take longer depending on the nature of the request, staff availability, or other factors.

Once a request has been approved, a link to the synthetic data will be provided. The synthetic data is password protected; this password will also be provided via email. The user is permitted to download the synthetic data. The user of the synthetic data is not permitted to share the link, the synthetic data or the password to others unless within a secure teaching environment.

Policy statement

Synthetic data made available by RDS must only be used for purposes that support public benefit research or teaching.

It should be expected that synthetic data will not accurately reflect all properties of the real data, though the structure will resemble the real data as closely as possible.

Synthetic data are Crown copyright works of the Data Controllers. If RDS was to cease to exist, synthetic datasets are to be returned to their original owner. 

All synthetic data made available by RDS, whether the data is produced by RDS or not, must be subjected to quality and statistical disclosure control checks assessing structure, labelling, privacy and documentation, in line with the ‘Four Checks for Low Fidelity Synthetic Data’ as detailed by Raab G et al. Following completion of these checks, the synthetic data can be made available on the RDS metadata catalogue for researchers to request access.

The quality and statistical disclosure checks can either be completed by RDS, the producer of the synthetic data or a competent third party, i.e. the original data controller. When the data is not quality and disclosure checked by RDS, the party performing the checks is free to use whichever tool or approach they deem appropriate, however RDS can provide an R tool to assist with the quality and disclosure checks on request.

Synthetic data that is routinely hosted by RDS is low fidelity and therefore poses a minimal disclosure risk: it does not require a secure environment for it to be made available. However, to ensure that the synthetic data is used appropriately, users of are requested to agree to the terms of an End User License Agreement before access to the synthetic data is granted. RDS retain the right to deny access to the synthetic data if we have reason to believe that the terms of the End User License Agreement will not be adhered to.

Interested in using synthetic data?

Explore our metadata catalogue for the most up to date availability on synthetic datasets or sign up to our engagement contact list to be the first to know about opportunities to shape our work.

Last Updated 11 Mar 2026

Illustration of a desktop computer with two keys on the screen.

Access synthetic data

Apply to access synthetic datasets.

How to apply

Was this information helpful?