RDS Data Strategy

Our vision is enabling people to systematically innovate with data in the public good to promote and advance health and social wellbeing, deliver precision public health, and reduce inequalities.

Our Vision

Our vision is enabling people to systematically innovate with data in the public good to promote and advance health and social wellbeing, deliver precision public health, and reduce inequalities. 

So, the vision for data is to build access to a portfolio of data assets around people, places, and businesses that are held securely with public support and accessed at pace. Having the portfolio will avoid duplication of research through reuse rather than new collections.

We will be enable researchers and analysts to bring new insights to problems. A starting place for a data strategy are key structural issues where access to and linkage of data can improve societal wellbeing and reduce inequalities. This needs to start with our National Performance Framework, and will in particular be issues where people use multiple services, are at key transitions in their lives, or where connecting data on people with businesses and places is vital.  Key themes will be:

Equalities

Part of the Purpose from the National Performance Framework is “creating a more successful country with opportunities for all to flourish”. Many of our research programmes and official statistics report on the situation for Scotland as a whole, not exploring how this affects people from different backgrounds and in different circumstances. Data access and linkage should make it easier to build that insight on equality groups, and on intersectionality between those groups. Data on people’s circumstances may not be collected in every research project or set of statistics. We will link data across a number of key datasets to create a pseudonymised protected characteristics record that can then be securely links to other data for analysis.

Vulnerable People & Families

A number of areas of policy and operations are cross-cutting. This is particularly true for many vulnerable people, people with experience of the care system, migrants or those perpetrating or experiencing crime. Being able to bring together data around people and families, can paint a richer picture of people’s lives, understand and help predict future risks and help services to support people on positive pathways.

Inclusive Growth

Scotland’s labour market strategy sets out a vision of a strong labour market that drives inclusive and sustainable economic growth through growing and competitive businesses, high employment, a skilled population, and fair work. Being able to understand how the labour market and business base is changing and how these changes are affecting people from different background is vital. This includes following cohorts through their working lives and being able to understand people’s family circumstances. 

Health & Wellbeing

We are fortunate to have rich data about the health of the population, and use of healthcare services. However, much of this is held in different places and not connected. We have seen through Covid-19 the benefits of bringing this together, whether around assessing risk or vaccine efficacy. We also know that most of the determinants of health are social and economic. So, being able to connect between health and other data, build cohorts, and look at family and intergenerational circumstances means we can better understand the causes and solutions. It allows us to set goals and measure progress in more intelligent ways. 

Protecting & Enhancing the Environment

The Scottish Government has been clear in its commitment to securing a just and green recovery, which prioritises economic, social and environmental wellbeing, and responds to the twin challenges of the climate emergency and biodiversity loss. Data is held across many organisations, and typically not connected. Bringing some of this data together around places, people or businesses can help us understand more about how these interplay and therefore what works in moving towards our goals. 

Having a strategy for how we bring together data, allow it to connect around and between people, places and businesses, and add value will prioritise our efforts to support these themes.  We have brought together data from around 30 datasets over the last year that have allowed research around Covid-19. We need to take learning from this to widen the approach in trustworthy ways.

There will be linked service and infrastructure strategies that lay out what an ideal researcher and data controller services would look like and the steps we are taking to achieve that. This will be built upon our systems that maintain the security of the data; this includes using safe haven environments, and in the process of de-identification using the NRS as a trusted third party.  This will include enabling researchers to bring their own data to link to existing RDS available datasets. 

Principles for trustworthiness

First, without trust in how we handle and use data, we cannot operate.  There are a few ways we will maintain that trustworthiness -

• The use of data needs to follow the FAIR principles of Findability, Accessibility, Interoperability, and Reuse of digital assets.
• The data assets need to be used, and priorities for this need to be driven by an understanding of likely use.
• We should be problem led not data led, and should not bring together data that is not used
• We need to develop approaches to enable data to be able to be linked between people (families and households) to business and places
• We should be pragmatic. For example, there may be some situations where the creation of standing linked datasets that are well used would be proportionate.
• We should reduce complexity wherever possible.
• People should be able to choose to be excluded from some or all pieces of research where no overwhelming case for including all relevant people.
• We will be transparent about the data that is being held, and importantly who is accessing which data sets for what purpose.
• We will develop a clear and published policies at each stage of data processing, building upon what is in place already.

Understanding user needs

We will iterate our data strategy based on what we hear from researchers and data controllers who will be using the service. This means establishing routes for listening and communicating those priorities.

Core national datasets

To tackle any of these, there are a set of national data assets that it will be important to make available for research. This includes:
• Vital events (including the outputs from digitising Scotland project)
• Population censuses
• A picture of the protected characteristics of the population
• A register of Scottish based/operating businesses
• Property ownership and transactions
• Land use
• Interaction with “universal” public services, in particular GPs and schools.
• Scotland wide household surveys

Under the different themes, a likely group of datasets that are most important to help research around that objective. An initial proposal about that is in the Annex. 

We will turn this into a prioritised data acquisition plan, adjusting based upon conversations with users. 
In addition, organisations may want to pay Research Data Scotland to bring together data around particular themes for research in the public good. We will need to develop a policy for decision making on whether to take these opportunities.

Data federation

A number of other organisations are bringing data together to improve access and data linkage, and some include data about Scottish people, places and businesses. These include the Office for National Statistics, Ordinance Survey, Urban Big Data Centre and the Global Open Finance Centre of Excellence. 
Rather than secure arrangements directly with all data providers, we should work to ensure researchers can search metadata from a set of federated data services, and to be able to access and link data from these places.

Open data

While the focus for RDS is about case level data on people, places and businesses, there is a significant amount of open data that give researchers context, for example the Scottish Index of Multiple Deprivation, or prescribing data by GP practice. These datasets and many more are available via websites like statistics.gov.scot, opendata.nhs.scot, spatialdata.gov.scot and SE Web.

We will work to make key open data available in secure settings in ways that can be brought together with case level data. We will also work to make the metadata for key open data searchable as part of the RDS data catalogue.

Adding value to datasets

To make data useable, researchers first need to be able to find what data is available for reuse. Having a searchable metadata catalogue is core to this. We need to agree an approach to how this is doing, and build that into the data acquisition process. The development of this catalogue will need to happen iteratively.

Organising our data better opens up the opportunity to carry out research that would not otherwise have been possible. The building block of most of our sensitive datasets is data about people, but people live in family and household units, which shape people’s experiences. We know that there are many things that pass from one generation to the next which are key contributors in explaining people’s circumstances and outcomes. As part of our ADR-Scotland work, we are exploring how we establish these things and turn into a business as usual process.

Having linked data over time allows us to create cohorts that explore how the experiences for businesses and people, particularly around significant transitions. We need to develop an approach to cohort creation, based upon user demand.

Businesses are staffed by people and they are sited in places across Scotland. Being able to link between people, places and businesses would allow us to understand more about how people live their lives, labour market and business dynamics, and many more issues beyond this. This relies on developing approaches to systematically maintaining those links. We will need to develop approaches that are affordable and proportionate, again based on user demand.

Synthetic datasets are ones that mimic real data, but where they are populated by fictional people or businesses. Synthetic datasets allow researchers to test approaches, and write code that analyses data while necessary permissions are secured to use actual data. It should be possible to produce synthetic datasets as standard and make these freely available to researchers. We will investigate how to do this and make business as usual.

Approach

Short-term (2021)

We will 
• Focus on reducing time to access datasets
• Establish a user network that can advise on priorities, as part of the data and intelligence network
• deliver pathway 1 arrangements for Information Governance agreed at the December 2020 Transition Board, and where appropriate working through agreements with data owners for use of data beyond Covid-19 research 
• Make data discoverable through catalogue of searchable metadata
• Focus on delivery of data as part of ADR-Scotland (a focus on justice, children and environment) and the National Core Studies programmes (inc GP data)
• Start driving forward approaches to federate regional and national data 
• Bring together data from key administrative and census datasets to create a pseudonymised protected characteristics record available for research
• Build on the CURL work to establish an ongoing link between places (UPRN) and people (CHI) that enables household level analysis.
• Pilot approaches to offering synthetic datasets to researchers

Medium-term (2022 and 2023)

We will 
• Move forward with data acquisition based upon user feedback
• Develop a communications approach with data owners and the public to present the benefits already accrued and the potential further benefits for making their data available. 
• Further refine the process for data acquisition, in particular our offer to data owners to work with them, and support data quality improvements as part of the data preparation process. 
• Nove to arrangements with data controllers around functional anonymisation.
• Establish approaches to building broader expertise in the use of individual and linked data assets.
• Develop arrangements for bringing together key open datasets alongside case level data 
• Work to establish ways to enable family and intergenerational research.
• Use the development of the ONS Integrated Data Programme to improve the range and interconnectivity of data on Scottish businesses, and connection between other key UK wide and Scottish datasets.

Long-term (2024 and beyond)

We will
• Establish research cohorts, building on existing longitudinal studies
• Make broader connections across Scotland, UK and internationally to allow greater federation of data
• Continue to scan for international best practice in keeping data safe, acting in trustworthy ways, and speeding up access to data.

Roger Halliday, CEO RDS
March 2021

The table below suggests a starting point for priority datasets underpinning each key theme. This is based on conversations with many researchers and data controllers, and as part of the Administrative Data Research-Scotland and Health Data Research Scotland partnerships. Many of the datasets will be useful for more than one theme.

 priority datasets underpinning each key theme