Case study: SteatoSITE
An integrated gene-to-patient Data Commons for MASLD Research
Overview
Metabolic dysfunction-associated steatotic liver disease (MASLD) affects 25% of adults and is the leading cause of liver disorders. However, only one-fifth progress to metabolic dysfunction-associated steatohepatitis (MASH), which can cause cirrhosis, liver cancer, and reduced life expectancy. Currently, no tools predict which MASLD patients will develop MASH or cirrhosis, and no licensed treatments exist.
Understanding why MASLD progresses in some patients but not others is crucial. This knowledge will enable development of diagnostic tools allowing the NHS to prioritise resources for high-risk patients. Comprehensive understanding of MASLD will also facilitate creating effective therapies, potentially individualised through precision medicine strategies for optimal patient outcomes.
The SteatoSITE project had two primary objectives:
- Create a MASH Data Commons to analyse and share data.
- Create and verify novel diagnostic methods and therapeutic interventions for individuals with MASH.
Who was involved?
Led by Precision Medicine Scotland, the SteatoSITE project worked with the Grampian Data Safe Haven (DaSH) and partners including The University of Edinburgh, NHS Scotland, Innovate UK, Glasgow Molecular Pathology Node, Edinburgh Genomics, Eagle Genomics and NHS Lothian.
The solution
A Data Commons combines data, storage, and computing infrastructure as a proven approach for examining and sharing information across research teams, healthcare professionals, patients, and advocacy groups.
The investigators compiled comprehensive MASLD cases across Scotland, collecting data on liver damage characteristics, gene expression changes as disease severity increases, and relationships with different health outcomes.
SteatoSITE combines tissue-derived metrics with clinical data including pathological evaluations, liver RNA sequencing, and electronic medical records. The system utilised samples from the NHS Biorepository network and data from 12 of 14 Scottish NHS boards through the Scottish Safe Haven Network, later migrating to Edinburgh Parallel Computing Centre (EPCC) Safe Haven Services.
This dataset supports AI research utilising multiple data types, exemplified by the Innovate UK Eureka-funded INTErPRET-NAFLD collaboration involving University of Edinburgh specialists alongside NHS AI deployment (Bering, UK), digital pathology (HistoIndex, Singapore), and data processing (BioDev, UK) organisations.
Additional healthcare data will transform their MASH Data Commons into an advanced knowledge system for discovering new disease management approaches, assisting in creating diagnostic methods and determining which patients benefit most from new therapeutic options.
“The INTErPRET-NAFLD collaboration demonstrates that multimodal human databases like SteatoSITE are the preeminent model for studying complex diseases. It showcases Scotland as an ideal place to undertake healthcare research using real-world data and I hope the approach provides a template for other researchers studying different conditions”
“This collaboration is a fantastic example of how we can drive innovation in healthcare research using the ‘triple helix’ of partners: the NHS, UK academia, and industry, benefitting from the unique healthcare ecosystem available at Edinburgh BioQuarter.”
Problems encountered
Working with three NHS Safe Havens required rigorous data cleaning to address duplicate records, human errors (incorrect input, typing mistakes, sample misidentification), standardisation challenges, varying delimiters/encoding, formatting inconsistencies, and missing information. Notably, certain data (e.g., FibroScan) was recorded inconsistently between Health Boards, hindering even manual extraction.
The secondary-care tissue-based selection creates inherent spectrum bias. While strengthening outcome enrichment, SteatoSITE is less useful for modeling population-wide MASLD progression.
Patient characteristics (elevated age, BMI), increased comorbidities, and baseline disease severity likely explain higher outcome frequency versus other MASLD datasets. Limited ethnic diversity (predominantly white Scottish) warrants caution regarding generaliability.
onfounding from undetected alcohol consumption remains challenging. Researchers manually examined clinical histories and diagnostic reports to eliminate alcohol-related liver disease cases.
However, alcohol consumption data and drinking patterns are constrained by this retrospective real-world design and absence of comprehensive standardised alcohol documentation in historical electronic health records.
Key achievements
- Integrated tissue-based measurements with clinical information including pathological evaluations, liver RNA sequencing, and standard healthcare data from electronic medical records
- Facilitated artificial intelligence research utilising multiple data types to create meaningful tools for global public health applications.
- Established the robust positive correlation between histological fibrosis staging and subsequent clinical outcomes, including mortality from all causes.
- Conducted thorough bioinformatics examination of hepatic bulk RNA-seq data, revealing essential genes and enriched genetic functions/pathways that characterise the disease.
Image credit: nadezdagorosko. Source: Freepik