DECOVID: A UK Two-Center Harmonized Database of Acute Care Electronic Health Records for COVID-19 Research

DECOVID Consortium,; Aslett, Louis J. M.; Avramescu, Andreea; Bakewell, Nicholas; Birds, Isabel; Bowler, Louise; Camilleri, Michael P. J.; Chung, Sheng-Chia; Clifton, David A.; Cohen, Samuel N.; Constantine-Cooke, Nathan; Daub, Eric G.; Davidson, Shaun; Denaxas, Spiros; Diaz-Ordaz, Karla; Feltbower, Richard; Gallier, Suzy; Gardiner, Stephen; Gasperoni, Francesca; Goudie, Robert J. B.; Green, Rebecca E.; Hall, Marlous; Holmes, Chris; Hurst, John R.; Iles, Mark M.; Jorge, Joao; Karoune, Emma; Keogh, Ruth; King, Ruairidh; King, Ruth; Kirk, Paul D. W.; Klapaukh, Roman; Kouchaki, Samaneh; Lai, Alvina G.; Lea, Nathan; Leyrat, Clemence; Li, Kezhi; Lilaonitkul, Watjana; Lu, Huiqi Y.; Lyons, Terry; Mallon, Ann Marie; Manderson, Andrew; Margaritella, Nicolò; Matteson, Joshua; Morley, Sam; Nicholls, Hannah; O’Reilly, Martin; Pagel, Christina; Palmer, Edward; Roberts, Jack; Roberts, Timothy J.; Robertson, David S.; Robinson, James; Rockenschaub, Patrick; Ruddle, Roy; Sapey, Elizabeth; Santos, Luis; Soltan, Andrew A. S.; Gao Smith, Fang; Starr, Colin; Strickson, Oliver; Su, Li; Tackney, Mia S.; Thygesen, Johan H.; Torralbo, Ana; Turner, Alice; Vallejos, Catalina A.; Wang, Chenyang; Whitaker, Kirstie; Whitehouse, Tony; Westhead, David R.; Wong, Wai Keong; Wu, Yue; Yang, Lingyi; Zou, Xiaoxu

doi:10.3390/data10120195

Open AccessData Descriptor

DECOVID: A UK Two-Center Harmonized Database of Acute Care Electronic Health Records for COVID-19 Research

by

DECOVID Consortium

,

Louis J. M. Aslett

¹

,

Andreea Avramescu

²,

Nicholas Bakewell

³,

Isabel Birds

^4,5,

Louise Bowler

²,

Michael P. J. Camilleri

^6,7

,

Sheng-Chia Chung

⁸

,

David A. Clifton

⁹,

Samuel N. Cohen

¹⁰,

Nathan Constantine-Cooke

¹¹,

Eric G. Daub

²,

Shaun Davidson

¹²

,

Spiros Denaxas

¹³

,

Karla Diaz-Ordaz

¹⁴,

Richard Feltbower

^15,16

,

Suzy Gallier

^17,18

,

Stephen Gardiner

¹⁹

,

Francesca Gasperoni

³,

Robert J. B. Goudie

^3,*

,

Rebecca E. Green

²,

Marlous Hall

^16,20

,

Chris Holmes

²¹,

John R. Hurst

²²,

Mark M. Iles

^16,23,

Joao Jorge

^12,24,

Emma Karoune

²

,

Ruth Keogh

²⁵,

Ruairidh King

²,

Ruth King

²⁶

,

Paul D. W. Kirk

^3,27,

Roman Klapaukh

^28,29,

Samaneh Kouchaki

^12,30,

Alvina G. Lai

¹³,

Nathan Lea

¹³,

Clemence Leyrat

²⁵,

Kezhi Li

¹³,

Watjana Lilaonitkul

³¹

,

Huiqi Y. Lu

⁹

,

Terry Lyons

¹⁰,

Ann Marie Mallon

²,

Andrew Manderson

³,

Nicolò Margaritella

³²

,

Joshua Matteson

¹³

,

Sam Morley

¹⁰,

Hannah Nicholls

²

,

Martin O’Reilly

²,

Christina Pagel

³³

,

Edward Palmer

^29,34,35,

Jack Roberts

²

,

Timothy J. Roberts

^13,29

,

David S. Robertson

³,

James Robinson

²,

Patrick Rockenschaub

¹³

,

Roy Ruddle

^16,36

,

Elizabeth Sapey

^17,18

,

Luis Santos

^2,37,

Andrew A. S. Soltan

^38,39,

Fang Gao Smith

^17,18,

Colin Starr

³

,

Oliver Strickson

²,

Li Su

³,

Mia S. Tackney

³

,

Johan H. Thygesen

¹³

,

Ana Torralbo

¹³

,

Alice Turner

^17,18

,

Catalina A. Vallejos

¹¹,

Chenyang Wang

⁹,

Kirstie Whitaker

^2,40

,

Tony Whitehouse

^17,18

,

David R. Westhead

⁴,

Wai Keong Wong

⁴¹,

Yue Wu

⁴²,

Lingyi Yang

¹⁰

and

Xiaoxu Zou

¹⁷ Show full author list Hide full author list

¹

Department of Mathematical Sciences, Durham University, Durham DH1 3LE, UK

²

The Alan Turing Institute, London NW1 2DB, UK

³

MRC Biostatistics Unit, University of Cambridge, Cambridge CB2 0SR, UK

⁴

School of Molecular and Cellular Biology, University of Leeds, Leeds LS2 9JT, UK

⁵

LeedsOmics, University of Leeds, Leeds LS2 9JT, UK

⁶

School of Informatics, University of Edinburgh, Edinburgh EH4 2XU, UK

⁷

School of Science and Engineering, University of Dundee, Dundee DD1 4HN, UK

⁸

Institute of Cardiovascular Science, University College London, London NW1 2DA, UK

⁹

Department of Engineering Science, University of Oxford, Oxford OX3 7DQ, UK

¹⁰

Mathematical Institute, University of Oxford, Oxford OX2 6GG, UK

¹¹

Institute of Genetics and Cancer, University of Edinburgh, Edinburgh EH4 2XU, UK

¹²

Institute of Biomedical Engineering, University of Oxford, Oxford OX3 7DQ, UK

¹³

Institute of Health Informatics, University College London, London NW1 2DA, UK

¹⁴

Department of Statistical Science, University College London, London WC1E 6BT, UK

¹⁵

Child Health Outcomes Research at Leeds (CHORAL), School of Medicine, University of Leeds, Leeds LS2 9LU, UK

¹⁶

Leeds Institute for Data Analytics, University of Leeds, Leeds LS2 9NL, UK

¹⁷

University Hospitals Birmingham NHS Foundation Trust, Birmingham B15 2GW, UK

¹⁸

Department of Inflammation and Ageing, School of Inflammation, Infection and Immunity, College of Medicine and Health, University of Birmingham, Birmingham B15 2WB, UK

¹⁹

Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, Nuffield Department of Medicine, University of Oxford, Oxford OX3 7LF, UK

²⁰

Leeds Institute of Cardiovascular and Metabolic Medicine, University of Leeds, Leeds LS2 9JT, UK

²¹

Department of Statistics, University of Oxford, Oxford OX1 3LB, UK

²²

UCL Respiratory, University College London, London WC1E 6JF, UK

²³

NIHR Leeds Biomedical Research Centre, Leeds Teaching Hospitals NHS Trust, Leeds LS7 4SA, UK

²⁴

NIHR Biomedical Research Centre, Oxford OX3 9DU, UK

²⁵

Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London WC1E 7HT, UK

²⁶

School of Mathematics and Maxwell Institute for Mathematical Sciences, University of Edinburgh, Edinburgh EH9 3FD, UK

²⁷

Cambridge Institute of Therapeutic Immunology & Infectious Disease (CITIID), University of Cambridge, Cambridge CB2 0AW, UK

²⁸

Research Software Development Group, University College London, London WC1E 6BT, UK

²⁹

University College London Hospital, London NW1 2BU, UK

³⁰

School of Computer Science and Electronic Engineering, University of Surrey, Guildford GU2 7XH, UK

³¹

Global Business School for Health, University College London, London E20 2AE, UK

³²

School of Mathematics and Statistics, University of St Andrews, St Andrews KY16 9SS, UK

³³

Clinical Operational Research Unit, University College London, London WC1H 0BT, UK

³⁴

Bloomsbury Institute of Intensive Care Medicine, University College London, London WC1E 6BT, UK

³⁵

Whittington Hospital, London N19 5NF, UK

³⁶

School of Computer Science, University of Leeds, Leeds LS2 9JT, UK

³⁷

Medical Research Council Harwell Institute (Mammalian Genetics Unit and Mary Lyon Center), Harwell OX11 0RD, UK

³⁸

Department of Oncology, University of Oxford, Oxford OX3 7LE, UK

³⁹

Oxford University Hospitals NHS Foundation Trust, Oxford OX3 9DU, UK

⁴⁰

Berkeley Institute for Data Science, University of California at Berkeley, Berkeley, CA 94720-1234, USA

⁴¹

Cambridge University Hospitals, Cambridge CB2 0QQ, UK

⁴²

Department of Mathematics and Statistics, University of Strathclyde, Glasgow G1 1XH, UK

Show full affiliation list

Hide full affiliation list

^*

Author to whom correspondence should be addressed.

Data 2025, 10(12), 195; https://doi.org/10.3390/data10120195

Submission received: 17 September 2025 / Revised: 1 November 2025 / Accepted: 4 November 2025 / Published: 24 November 2025

Download

Browse Figures

Versions Notes

Abstract

The DECOVID database contains harmonized pseudonymized electronic health record (EHR) data on all adult (≥18 years old) patients presenting to two large, digitally mature centers in the United Kingdom between 1 January 2020 and 28 February 2021, with follow-up until at least 28 March 2021. The database was originally developed to support the COVID-19 response but is now available via the PIONEER data hub for researchers to explore a wide range of research questions, including exploratory analyses, risk factor assessment, prediction modeling, and comparative effectiveness studies. Raw data were extracted from local EHRs and transformed into a standardized form (Observational Health Data Sciences and Informatics-Common Data Model version 5.3.1). The database includes 165,420 patients across 256,804 hospital presentations. For these patients, highly granular data are available, including patient demographics, longitudinal vital signs, physiology, treatments, laboratory findings, clinical diagnoses, and outcomes. There are 10,030 patients with COVID-19, of whom 1472 died in hospital.

Keywords:

hospital data; electronic health record; COVID-19

1. Summary

DECOVID, a multi-center research consortium, was founded in March 2020 by two United Kingdom (UK) National Health Service (NHS) Foundation Trusts [1] (comprising three acute care hospitals) and three research institutes/universities: University Hospitals Birmingham (UHB), University College London Hospitals (UCLH), University of Birmingham, University College London, and The Alan Turing Institute. The original aim of DECOVID was to share harmonized electronic health record (EHR) data from UCLH and UHB to enable researchers affiliated with the DECOVID consortium to answer clinical questions to support the COVID-19 response [2]. The DECOVID database has now been placed within the infrastructure of PIONEER, a Health Data Research (HDR) UK-funded data hub that contains data from acute care providers, to make the DECOVID database accessible to external researchers not affiliated with the DECOVID consortium [3].

The raw EHR data from the two hospital Trusts are stored in disparate databases and are not expressed in a single standard format (i.e., unharmonized). Analysis of unharmonized EHR data is difficult for researchers due to limited or a lack of knowledge of the particularities of each hospital’s EHR system and clinical workflows, which often underlie important quirks in the raw data. Requiring each researcher to harmonize data separately leads to unnecessary duplication of effort and can easily lead to researchers making basic but significant errors in analyses, for example, due to misconstruing the labels of clinical items in their raw format.

To address these problems, we performed the laborious task of harmonizing the raw EHR data into a single standard format at source before transferring the data into a centralized database. This required several rounds of iterative feedback between clinicians, clinical informaticians, and analysts to ensure harmonization was performed consistently across hospital Trusts. The harmonized data are represented using the internationally recognized Observational Health Data Sciences and Informatics-Common Data Model (OHDSI-CDM) version 5.3.1, commonly referred to as the Observational Medical Outcomes Partnership (OMOP)-CDM [4]. The OMOP-CDM is a mature data model that facilitates observational and comparative effectiveness research across organizations. Utilizing the OMOP-CDM has been found to improve data quality, increase efficiency, and facilitate inter-database comparisons to support a more systematic approach to observational research [5,6].

We make available the final database from both hospital Trusts in a single centralized database. Despite the substantial upfront costs and technological infrastructure required to ensure data security when transferring data into a centralized database, for analysts, this approach is more efficient and convenient than the federated model often adopted for EHR data, since access only needs to be granted by a single data controller. This approach also facilitates iterative data science approaches and allows for tight feedback to the host centers to drive data quality improvement, which can be challenging in federated infrastructure, and avoids the approximations to complex models often required by federated analyses [7]. It also allows any data anonymization to be applied in a standardized way, if required. Additionally, the two-center nature of the centralized database also has important benefits over disparate single-center databases when conducting statistical analyses. In particular, the resulting larger sample size increases statistical precision and power, and the availability of data from two centers may broaden the generalizability of results.

The availability of highly granular secondary care data in the DECOVID database, harmonized into the OMOP-CDM, covering all adults (≥18 years old) presenting to two UK centers (between 1 January 2020 and 28 February 2021), represents a step forward for UK health data research. While summaries of secondary care visits are routinely collated nationally in England by NHS England (i.e., Hospital Episode Statistics (HES) [8]), until the NHS Research Secure Data Environment Network becomes available, more detailed data has been largely siloed at local hospital Trusts in the UK in unstandardized data formats, or has been explicitly collated only for specific research studies [9,10,11], or for specific patient groups, such as critical care patients [12]. Making such data easily available according to the Findable, Accessible, Interoperable, and Reusable (FAIR) principles [13] enables a broad range of researchers to use these data for research.

Other UK harmonized secondary care data sources contain data for a greater number of UK centers, but with much less detail than DECOVID [8,14]. The DECOVID database also complements international data such as the Consortium for Clinical Characterization of COVID-19 by EHR (4CE) and eICU Collaborative Research Database and National COVID Cohort Collaborative (N3C) [15,16,17]. In particular, the DECOVID database contains longitudinal records of all vital signs recorded in the EHR, along with relevant treatments, laboratory findings, clinical diagnoses, and outcomes. This granularity and temporal resolution enable detailed research questions to be explored using the DECOVID database that would be infeasible in coarser national datasets.

Data Visualizations

We present high-level visual summaries of the data. Figure 1 and Figure 2 show selected patient demographics, and Figure 3 shows chronic medical conditions by hospital Trust and COVID-19 status. Figure 4 shows a weekly time series of COVID-19 incidence of patients who tested positive for SARS-CoV-2 or had a suspected or confirmed clinical diagnosis during or within 14 days prior to presentation by hospital Trust.

Figure 5 exemplifies the granularity of data in the DECOVID database and depicts data on several selected measurements/procedures during a patient stay. The data presented in Figure 5 shows that this patient’s health status deteriorated over the visit, requiring admission into “level 2” (high dependency unit (HDU) [19]) care, but then improved, leading to the patient being discharged after 11 days, 10 h. Additional data summaries are presented in Table 1 of Section 3.

2. Methods

The DECOVID database originally received favorable ethical approval for research use by the UK Health Research Authority (REC reference 20/HRA/1689) [20], while ongoing use of the data falls under the remit of PIONEER (REC reference 20/EM/0158).

2.1. Cohort

The DECOVID database brings together existing EHR data from two large UK hospital Trusts, which are organizations of healthcare providers that together provide hospital, community, and other health services to patients [1]. All adult (≥18 years old) patients presenting (i.e., admitted to or visiting the emergency department, excluding outpatients) to any of the three hospitals across the two hospital Trusts between 1 January 2020 and 28 February 2021 are included, with follow-up running until 28 March 2021 for UHB and 13 April 2021 for UCLH. Patients who opted out via the UK’s national data opt-out (NDOO) policy are excluded.

The UCLH cohort includes patients presenting to any hospital within the hospital Trust, which has a catchment population of approximately 600,000 people [21]. The acute centers within the hospital Trust are University College Hospital, a 665-bed teaching hospital with 44 critical care beds (surge capacity of roughly 100 critical care beds), and University College Hospital at Westmoreland Street, a 95-bed specialist thoracic hospital.

The UHB cohort includes patients presenting to the Queen Elizabeth Hospital Birmingham, a 1215-bed tertiary care teaching hospital, with a 100-bed critical care unit (an additional roughly 70 surge capacity critical care beds were used during the pandemic [22]) and a catchment population of approximately 1.2 million people [21]. Data from the other hospitals (Good Hope, Heartlands, and Solihull) within UHB are not included.

2.2. Data Pipeline

The two hospital Trusts use entirely separate and different EHR systems. UCLH’s EHR data are stored in and transferred from an Epic EHR system [23], which was adopted by the hospital Trust in 2019. UHB’s EHR data are stored in and transferred from the Prescribing Information and Communications System (Birmingham Systems), a bespoke EHR system developed at UHB over the last 20 years [24].

Fortunately, the OMOP-CDM can flexibly accommodate the harmonization of different EHR systems and data domains (e.g., demographics, healthcare encounters, and physiological measurements), and has been used in practice for over a decade [4,5,6,25]. A key advantage of the OMOP-CDM is its schema representation of multiple clinical ontologies (e.g., LOINC [26], SNOMED-CT [27], READ [28], ICD-10 [29]) with internal mappings, hierarchies, and relationships between ontologies preserved.

The data pipeline is shown in Figure 6. Each hospital converted its local EHR data into the OMOP-CDM format. The following were removed at source within each of the hospital’s EHR system (s) in conformity with the UK’s NHS National Data Guardian’s Data Security Standards: any local database identifiers; clinically sensitive fields, including the exact date of birth and sexual health-related diagnoses (excluded diagnoses are listed on the GitHub repository [30,31]); and patients who opted out via the UK’s NDOO policy (including “people of interest”, such as politicians and celebrities). Each pseudonymized cohort was uploaded via an encrypted route (Secure File Transfer Protocol) to the DECOVID Data Safe Haven (DSH), a secure research environment, for initial analysis by the DECOVID consortium, and ultimately made available for future analysis by the wider research community via PIONEER.

2.3. Local Data Extraction Processes and Principles

Data within each participating hospital were transferred into the OMOP-CDM format using a common data specification [25]. The Extract, Transform, Load (ETL) processes were validated and verified as detailed in Section 3.7 below.

The overarching principle guiding the ETL processes was to preserve the raw source EHR data as closely as possible. As with all EHR data, the degree to which the raw data truly reflect the underlying biological processes or clinical events will vary for numerous reasons. For example, EHR data are sourced in various ways: some data are automatically extracted from medical devices (e.g., drug infusions, ventilator settings) and then verified by nursing staff at the bedspace, whereas other data are hand entered (e.g., height and weight) or obtained from automatic standard lab panels (e.g., hemoglobin levels, white blood cell count). Given our overarching principle, the ETL process simply extracted all relevant information in the EHR. For example, laboratory test results were included regardless of whether they were in a plausible range. However, as described in Section 3.7 below, any identified non-structural “impossible” values were labeled as such. The advantage of this approach is that it allows researchers to make research study-specific judgments about which data to rely upon, reducing biases that may have otherwise occurred if data were modified during the ETL process [32].

Most data items are accompanied by datetime stamps reflecting the time of clinical events. These were extracted in raw form: no coarsening or time-binning was applied. Both the recorded specimen collection time and the time when the laboratory result became visible to a clinician were extracted.

Some simplification of the raw data was required to harmonize the data into a standardized representation. For example, patient movements within a hospital are mapped to a standardized representation of clinically meaningful locations so that “ward 7A” might be translated to “acute adult inpatient ward”. This permits insights into progressions of patient care, without detailing identifiable or site-specific information that would not be meaningful outside the source hospital.

3. Dataset Description

The data records within the DECOVID database originate from EHR systems of participating hospitals before being transferred into the OMOP-CDM. The clinical data items that make up the data records of the DECOVID database were chosen based on iterative feedback meetings involving clinicians, clinical informaticians, and analysts. While we aimed at satisfying the OMOP-CDM wherever possible, a number of changes were required to match the data available in UK EHR systems. The key changes are described in this section, in addition to summaries of the data and an in-depth description of the data tables and records included in the DECOVID database.

3.1. Data Summaries

The DECOVID database contains 256,804 hospital visits from 165,420 adult (≥18 years old) patients presenting (i.e., admitted or visiting the emergency department, excluding outpatients) at either UCLH or UHB between 1 January 2020 and 28 February 2021 (see Section 2 Methods for details of the cohort). In total, the dataset covers 16.7 million hours of clinical care; 108 million measured clinical observations encompassing vital signs, acute physiology, and laboratory findings; 2.64 million clinical diagnoses relating to both acute and chronic health conditions; and 15.19 million drug administration events.

Table 1 contains summary statistics for the cohort and details of the frequency of selected clinical features in the database, by hospital Trust and COVID-19 status. 48.6% (80,402) of patients were male, 56.4% (93,261) were of white ethnicity (which forms 68.5% of those with ethnicity recorded). The median age at presentation of patient visits was 48 (interquartile range (IQR) 32–66) years. In summary, the DECOVID database includes city-center hospitals with patients that are more ethnically diverse and younger compared to patients admitted to hospitals in England overall (86% of white ethnicity among all admissions with ethnicity recorded and median age between 60 and 64 years) [33].

Of all hospital visits, 10,767 (4.2%) were by patients who tested positive for SARS-CoV-2 by polymerase chain reaction (PCR) or who had a suspected or confirmed clinical diagnosis but no positive diagnostic test during a visit or within 14 days prior to presentation. Of these, 1699 (15.8%) visits involved admission to “level 2” (HDU) and/or “level 3” (intensive care unit (ICU) [19]) care, and the in-hospital mortality was 13.7% (1472).

Table 1. Database cohort demographics, and measurement frequencies for selected demographic and clinical features by hospital Trust and COVID-19 status.

	UCLH		UHB
Characteristic	COVID-19 ¹	Non-COVID-19	COVID-19 ¹	Non-COVID-19
Patient-level Summaries ²
Patients, n	2483	75,153	7547	80,237
Number of visits per patient, median [IQR]	1 [1, 2]	1 [1, 2]	1 [1, 2]	1 [1, 2]
Sex at birth, n (%)
Female	1107 (44.6)	40,309 (53.6)	3613 (47.9)	39,952 (49.8)
Male	1376 (55.4)	34,809 (46.3)	3934 (52.1)	40,283 (50.2)
Not recorded or withheld ³	0 (0.0)	35 (<0.1)	0 (0.0)	<10 (<0.1)
Ethnicity ⁴, n (%)
Asian or Asian British	332 (13.4)	6610 (8.8)	1295 (17.2)	10,243 (12.8)
Black or African or Caribbean or Black British	269 (10.8)	6256 (8.3)	326 (4.3)	3284 (4.1)
Mixed or Multiple ethnic groups	47 (1.9)	1593 (2.1)	111 (1.5)	1518 (1.9)
White	1153 (46.4)	37,339 (49.7)	4420 (58.6)	50,349 (62.8)
Other	259 (10.4)	8091 (10.8)	216 (2.9)	2426 (3.0)
Not recorded or withheld ⁴	423 (17.0)	15,264 (20.3)	1179 (15.6)	12,417 (15.5)
Visit-level Summaries
Hospital visits, n	2783	115,478	7984	130,559
Age at presentation ⁵, years, median [IQR]	59 [45, 74]	40 [30, 59]	63 [48, 78]	53 [35, 70]
Presenting from, visits, n (%)
Home	1155 (41.5)	44,250 (38.3)	5741 (71.9)	54,183 (41.5)
Transferred as inpatient	276 (9.9)	2059 (1.8)	298 (3.7)	2890 (2.2)
Other settings ⁶	<10 (<0.1)	53 (0.0)	207 (2.6)	697 (0.5)
Not recorded or non-standard ⁷	1348 (48.4)	69,116 (59.9)	1738 (21.8)	72,789 (55.8)
Body Mass Index at presentation, kg/m², median [IQR]	27.7 [23.0, 31.2]	25.6 [22.2, 29.8]	27.6 [23.7, 32.5]	26.6 [23.1, 31.1]
Length of stay, days, median [IQR]	4 [0, 12]	0 [0, 1]	3 [0, 10]	0 [0, 1]
Types/levels of care ever received during visit ⁸, visits, n (%)
Emergency Department	2251 (80.9)	82,466 (71.4)	6954 (87.1)	94,190 (72.1)
Inpatient ward (level 1)	1926 (69.2)	45,402 (39.3)	6143 (76.9)	60,003 (46.0)
HDU/ICU (level 2/3 care)	857 (30.8)	4961 (4.3)	842 (10.5)	11,659 (8.9)
Theaters ⁹	203 (7.3)	13,627 (11.8)	<10 (<0.1)	161 (0.1)
For selected clinical observations, median number of measurements per patient-day ¹⁰ [% of patient-days with no measurements] ¹¹
Inpatient ward (level 1)
Heart rate	7.3 [5.2]	7.0 [8.7]	5.5 [6.7]	5.5 [6.1]
Temperature	7.2 [2.8]	6.7 [8.7]	5.5 [6.7]	5.5 [6.1]
Respiratory rate	7.4 [2.7]	7.0 [8.5]	5.5 [6.5]	5.5 [6.0]
Oxygen saturation	7.7 [2.8]	7.0 [8.7]	5.6 [6.6]	5.5 [6.1]
Glasgow Coma Score Total Sum Score [34]	0.1 [65.1]	0.0 [66.8]	0.0 [91.8]	0.0 [86.1]
NEWS2 [35]	7.0 [8.0]	6.0 [14.4]	5.4 [8.2]	5.4 [6.7]
HDU/ICU (Level 2/3)
Heart rate	24.3 [3.1]	23.0 [4.5]	22.0 [2.3]	18.0 [4.1]
Temperature	11.8 [3.4]	9.3 [4.8]	9.4 [2.8]	8.2 [4.6]
Respiratory rate	23.3 [3.1]	21.0 [4.4]	4.7 [67.9]	10.7 [23.4]
Oxygen saturation	24.9 [3.1]	22.7 [4.5]	22.5 [2.3]	18.0 [4.1]
Glasgow Coma Score Total Sum Score [34]	5.0 [16.4]	5.7 [19.1]	3.3 [46.2]	2.4 [45.4]
NEWS2 [35]	0.1 [75.9]	0.2 [67.9]	0.0 [96.1]	0.0 [62.6]
For selected laboratory/point-of-care tests, median number of measurements per patient-day ¹⁰ [% of patient-days with no measurement] ¹¹
Inpatient ward (level 1)
Full blood count	0.7 [50.7]	0.7 [47.4]	0.5 [57.3]	0.6 [52.8]
Electrolytes (Creatinine, Sodium, Potassium)	0.7 [49.5]	0.5 [47.8]	0.5 [53.1]	0.6 [49.8]
C-reactive protein	0.5 [60.0]	0.1 [71.3]	0.5 [61.7]	0.5 [59.9]
HDU/ICU (Level 2/3)
Full blood count	1.0 [17.2]	1.0 [27.5]	1.1 [14.3]	1.1 [33.4]
Electrolytes (Creatinine, Sodium, Potassium)	1.0 [16.2]	1.1 [25.9]	1.1 [13.3]	1.1 [32.7]
C-reactive protein	1.0 [19.1]	1.0 [33.4]	1.0 [24.2]	1.0 [42.3]
Arterial blood gas	5.0 [30.5]	0.0 [57.1]	7.8 [11.6]	2.0 [47.7]
For selected medications, number of visits with at least one drug exposure record (% of total number of visits)
Dexamethasone [36]	876 (31.5)	4175 (3.6)	2009 (25.2)	3734 (2.9)
Tocilizumab [37]	53 (1.9)	59 (0.1)	11 (0.1)	1 (0.0)
Maximum respiratory support received during visit (% of total visits)
No respiratory support	1187 (42.7)	98,854 (85.6)	3777 (47.3)	112,000 (85.8)
Supplementary oxygen	559 (20.1)	4424 (3.8)	3100 (38.8)	15,244 (11.7)
High flow nasal oxygen	346 (12.4)	1557 (1.3)	297 (3.7)	960 (0.7)
Non-invasive ventilation	351 (12.6)	8966 (7.8)	299 (3.7)	801 (0.6)
Invasive ventilation	340 (12.2)	1677 (1.5)	511 (6.4)	1554 (1.2)
Clinical diagnoses, number of condition records per visit, median [IQR]
Past medical history ¹²	2 [1, 3]	1 [1, 3]	NA ¹³	NA ¹³
Hospital visit/episode level ¹⁴	1 [1, 2]	1 [1, 1]	3 [2, 4]	2 [2, 2]
Consultant episode level ¹⁴	12 [7, 21]	4 [2, 8]	20 [11, 34]	9 [5, 17]
Problem list ¹⁵	5 [2, 8]	2 [2, 3]	1 [1, 1]	3 [1, 3]
Discharged to/Outcomes, visits, n (%)
Discharged home	2088 (75.0)	106,088 (91.9)	6086 (76.2)	109,007 (83.5)
Discharged to other setting	44 (1.6)	414 (0.4)	410 (5.1)	2206 (1.7)
Transferred as inpatient	181 (6.5)	3548 (3.1)	355 (4.4)	3900 (3.0)
Remain in hospital ¹⁶	39 (1.4)	4460 (3.9)	33 (0.4)	85 (0.1)
Died	428 (15.4)	675 (0.6)	1044 (13.1)	1605 (1.2)
Not recorded or non-standard ¹⁷	<10 (0.1)	293 (0.3)	56 (0.7)	13,756 (10.5)

¹ For both UCLH and UHB, “COVID-19” indicates a positive SARS-CoV-2 test collected or a COVID-19 clinical diagnosis confirmed/suspected during or within 14 days prior to presentation at hospital. ² Patient-level summaries are disaggregated by whether a patient was ever diagnosed with COVID-19, as patients can be linked to multiple visits in DECOVID with different COVID-19 diagnoses. ³ “Not recorded” includes records where the sex at birth of a patient is withheld, or not asked/missing. ⁴ Aggregated from 18 ethnic groups recorded according to the 2011 England and Wales Census [38]. “Not recorded or withheld” includes records where the ethnicity of a patient is listed as “Ethnicity not stated” or “Unknown racial group”. ⁵ Since the DECOVID database only contains the year of birth to ensure anonymity of patients, age at presentation was calculated assuming a birth date of 2nd July of the year of birth. ⁶ “Other settings” include hospice, assisted living facility, intermediate mental care facility, adult living care facility, and prison/correctional facility. ⁷ “Not recorded or non-standard” includes records where a patient was admitted from a source that was not recorded/missing, or not part of the NHS’s standard admission sources [39]. ⁸ During the pandemic peaks, higher-level care may have been provided in non-standard settings. This may not be accurately reflected in these data. ⁹ Includes operating theaters; theater recovery; anesthetic induction room; endoscopy or other procedure; and theater complex. Theaters were used for both elective surgery capacity as well as surge capacity primarily for COVID-19 patients admitted to ICU. ¹⁰ Patient-days were defined as complete 24 h blocks of a patient’s stay, restricted to any measurements/laboratory tests that were taken during a complete 24 h block (e.g., not occurring during a visit lasting less than 24 h, or not near the end of a visit during an incomplete 24 h block). ¹¹ Minimum mean frequency and maximum missingness across the constituent components of each group of observations/tests. The mean frequency of observations/tests per patient-day was calculated as the total number of observations/tests (numerator) divided by the total number of patient days (i.e., complete 24 h blocks) (denominator) for each level of care category. This calculation accounts for patients with no observations/tests. ¹² “Past medical history” is one of the condition/diagnosis types for conditions/diagnoses in DECOVID. Past medical history pertains to pre-existing conditions/diagnoses prior to presentation. ¹³ NA = Not Applicable. The Past medical history condition type was not used by UHB. ¹⁴ “Hospital visit/episode level” diagnosis combines “Encounter Diagnosis” and “Chief Complaint” diagnosis types in DECOVID; the Chief complaint condition type was not used by UCLH. “Consultant episode level” diagnosis refers to “Billing Diagnosis” in DECOVID. See Section 3.4 for further details. ¹⁵ “Problem list” is one of the condition/diagnosis types for conditions/diagnoses in DECOVID. A problem list is meant to provide a complete summary of a patient’s diagnoses and health issues to facilitate continuity of care [40]. ¹⁶ Patient follow-up runs until 28 March 2021 at UHB and 13 April 2021 at UCLH. ¹⁷ “Not recorded or non-standard” includes records where a patient was discharged to a destination that was not recorded/missing or not part of the NHS’s standard destinations of discharge [41].

3.2. Data Tables

The database schema is depicted in Figure 7. The OMOP-CDM schema is normalized in third normal form [42], meaning there are a number of standardized clinical data tables that contain data records on the clinical events that occurred longitudinally for each patient, as well as demographic information for each patient. Normalized databases have the advantage of containing less redundancy, making it easier to avoid data anomalies. There are 10 standardized clinical data tables used for DECOVID containing core information about clinical events that occurred during a visit (Table 2). The five OMOP-CDM clinical data tables not included (observation_period, device_exposure, observation, note_nlp, and note) are for data not extracted by DECOVID.

Only numerical identifiers are stored in the clinical data tables; text labels of clinical items are not recorded. The numeric identifiers are referred to as “concepts”. For example, the gender_concept_id column in the person table contains numerical concept IDs that are associated with specific labels rather than directly recording labels such as “Male” or “Female”. In the case of “Female”, the person table records the gender_concept_id = 8532.

The association between concept IDs in the clinical data tables and meaningful textual labels (and other information) is contained in the concept table. These data are derived from international and national vocabularies representing clinical information across a domain (e.g., conditions). For example, clinical vocabularies such as SNOMED-CT are used for standardized medical diagnoses, and LOINC is used for standardized health measurements. The concept table forms part of the Standardized Vocabularies provided as part of the OMOP-CDM. There are several additional Standardized Vocabularies data tables that describe the relationship between concepts and concept vocabularies, including concept_relationship and concept_ancestor. As well as being provided as a data table, the Standardized Vocabularies can be viewed on the OHDSI Athena online browser (https://athena.ohdsi.org/, accessed on 3 November 2025).

Standardized Health System data tables are also available, which contain data on the healthcare provider (s) responsible for administering healthcare (e.g., care_site). The OMOP-CDM Health Economic data tables (e.g., cost) are not available in DECOVID, and derived element tables (e.g., drug_era) have not been created.

3.3. Database Identifiers

The data tables are linked by deterministic linkage using unique keys/identifiers that have a suffix _id. Patients are identified by the unique identifier: person_id. The identifiers from the two hospital Trusts are not harmonized, so it is possible that the same patient appears in the data for both hospital Trusts with distinct identifiers, but since hospitals are in two separate distant cities, this was considered to be unlikely.

Hospital visits, episodes of care (e.g., admission, discharge), and encounters with a healthcare provider that occur during a patient’s visit are identified by unique identifiers: visit_occurrence_id and visit_detail_id. There are also unique identifiers for medical diagnoses/conditions, laboratory measurements, specimens, and medications during a patient’s visit: condition_occurrence_id, measurement_id, specimen_id, and drug_exposure_id. Each person_id has at least one visit_occurrence_id and visit_detail_id, and can have zero or more drug exposures, conditions, biological samples, and laboratory measurements collected.

3.4. Data Records

Table 2 explains the contents of each of the 10 clinical data tables. It is important to note that there are inter-hospital Trust differences in the concept IDs recorded in the 10 clinical data tables (e.g., “Tobacco smoking behavior—finding” and “Tobacco smoking consumption—finding” measurement_concept_ids in the measurement table are only recorded at UHB), and summary statistics for key concept_ids in all clinical tables by hospital Trust are available [30,31].

Since not all aspects of the OMOP-CDM are applicable to the UK healthcare setting or the needs of the DECOVID collaboration, a small number of changes to the OMOP-CDM were necessary. Table A2 (Appendix B) contains a comprehensive list of changes.

The visit_occurrence table records the time span of a patient’s episode of care. The admitting source and discharge to concepts are recorded according to the NHS Data Model and Dictionary [39,41]. The visit_detail table provides additional data on the visits in the visit_occurrence table and records clinically meaningful movements between types or levels of care, such as from the Emergency Department to an Adult Inpatient Ward. Note that at UHB, movements between wards at the same level are also recorded.

The condition_occurrence table records diagnoses in a variety of vocabularies: the majority are ICD-10 codes (approximately 66%) or SNOMED-CT codes (approximately 32%). The relationship between concepts from different vocabularies for the same diagnosis can be explored using the concept, concept_ancestor, and concept_relationship standardized vocabulary tables. The table includes diagnoses originally recorded for a variety of different purposes. It is important to note that the meaning of condition_start_date varies depending on the condition type. For diagnoses entered in the EHR’s Problem List or Past Medical History records, the condition_start_date is the date when the diagnosis was entered into the EHR (which may not be the same as, e.g., the date when the associated condition was first observed by the patient). For EHR Encounter Diagnoses and EHR Chief Complaint Diagnoses, the condition_start_date is the start of the corresponding hospital visit, since these data are only coded to the visit level in the EHR. For EHR Billing Diagnoses, the condition_start_date is the start of the hospital consultant episode (which may span a widely varying number of days).

The measurement table contains quantitative findings (e.g., laboratory results). The majority (87%) of records are mapped to SNOMED codes, and the remaining records are mapped to LOINC codes. The time when the finding became visible to clinical staff on the EHR is recorded in measurement_datetime in this table. The time when specimens were collected from a patient is recorded as specimen_datetime in the specimen table. The specimen_id of the specimen corresponding to a particular measurement_id for a measurement is provided via the “Measurement to Specimen” relationship (relationship_concept_id = 32668) in the fact_relationship table. The UHB EHR has the concept of “panels” of measurements, allowing both results from a single physical specimen and contemporaneous observations to be easily grouped. These “virtual panels” are recorded in the specimen table for UHB only. Additionally, specific to the recording of COVID-19 test results, point-of-care (PoC) SARS-CoV-2 tests are not recorded for UHB in the measurement and specimen tables; however, any positive PoC test is reflected in the clinical diagnosis coding in the condition_occurrence table at discharge.

The drug_exposure table records the administration of biochemical substances to patients. Nearly all (99%) records in the drug_exposure table are mapped to the NHS Dictionary of Medicines and Devices (DM+D) codes [44], and a small proportion are mapped to SNOMED codes. Most (75%) drug exposures are mapped to the Virtual Therapeutic Moiety (VTM) of DM+D for UHB, while for UCLH, most (91%) are mapped to the Virtual Medicinal Product (VMP) of DM+D. As with the condition_occurrence table, the relationship between concepts from different vocabularies for the same drug or drug ingredient can be explored using the concept, concept_ancestor, concept_synonym, and concept_relationship standardized vocabulary tables, such as the relationship between RxNorm terms and ingredients. Note that the dosage data for drugs at UHB may be of variable quality because UHB maps most of its drug exposures to VTM, which does not encode dosage in the drug name as is performed for VMP, which is primarily used at UCLH, and the OMOP quantity field only permits one quantity type to describe the dosage, which, in the case of drug infusions, is reserved for the rate of infusion, so it is not possible to determine the concentration of drug infusions at UHB. For example, “Glucose” may be recorded as administered at 100 mL/h at UHB with no information on the concentration of glucose being administered.

The care_site standardized health system table records institutional (physical or organizational) units where healthcare delivery is practiced (e.g., Emergency Complex, In-Patient Areas, and Accident and Emergency Department). The entries in the care_site table were made specifically for DECOVID and do not follow a national or international standardized vocabulary. The institutional units at UHB and UCLH included in the care_site table are listed on the GitHub repository [30,31].

Further documentation detailing the standardized vocabularies, health system, and clinical data tables that underpin the DECOVID database schema is openly available [30,31].

3.5. Data Granularity

The DECOVID database contains highly granular clinical data on patient flow, demographics, and longitudinal physiology, treatments, laboratory findings, diagnoses, and outcomes. However, the granularity of data may differ between clinical data types, as well as hospital Trusts, depending on clinical guidelines and practice for each clinical data type/item. For example, vital signs measurements may be taken relatively infrequently for ward patients (e.g., four times daily), but four times per hour for ICU patients. Table 1 reports the frequency of recording in DECOVID of selected clinical features.

The time granularity of the data is also detailed, with most time- or date-related fields recorded to the second. For example, the visit_detail, specimen, measurement, and drug_exposure data tables include datetime fields that are recorded to the second. However, not all data tables have data recorded to the same precision, for example, the death data table is mostly recorded to the day, and the condition_occurrence data table, in which some entries are only recorded with observation detail at the consultant episode or visit level [45].

3.6. Data Quality

There are several limitations to the data, as with any observational EHR data. Data quality and incomplete data capture are major limitations of EHR data that are often affected by inaccurate coding (e.g., diagnosis), data transcription and entry errors, and disparate, unstandardized EHR systems [46]. Although DECOVID has mitigated these limitations by adopting the OMOP-CDM for database harmonization and implementing data quality procedures (see Technical Validation), these are common data limitations that persist with EHR data.

3.7. Technical Validation

Bespoke validation and verification procedures were developed for DECOVID that followed best practices of database design, validation, and verification, including verification procedures that adapted OHDSI’s Data Quality Dashboard approach [47]. A bespoke approach was taken because it was not possible to efficiently utilize the open-source tools provided by the OHDSI program for validation and verification [48] due to difficulties setting these up.

Code used to develop the DECOVID database followed best practices of version control and was developed collaboratively across several institutions and validated by a number of database developers. The validation process extends beyond the database system itself to data quality during the ETL process, where uniqueness, plausibility, relational conformance, and completeness checks (i.e., data type, structure, format, coherence) are implicitly conducted both locally and centrally. Figure 6 presents the data and quality evaluation pipeline of the DECOVID database.

Assessing data quality is essential to ensure the validity of analyses. Data quality assessments are generally one of two types: “validation”, which focuses on the alignment of data with external benchmarks that come from known truths or relative gold standards that are independent of the data in question, and “verification”, which focuses on the alignment of data with respect to metadata constraints, system assumptions, and resources within the local environment [49]. An iterative manual quality evaluation of validation and verification steps occurred at multiple stages in the assembly of the analysis-ready DECOVID database. The main goal of this procedure was to “identify and fix” structural errors arising from source ETL, and “identify and label” non-structural errors that, while indicating “impossible” events have occurred, are nevertheless real reflections of the source EHR. Four main quality evaluation procedures took place:

Source Validation: Local checks of the OMOP-CDM database were validated against a parallel query run directly against the source EHR (Parallel QE Query). This ensures that data fidelity loss during concept mapping or creation of the OMOP-CDM database is minimized.

Source Verification: Local OMOP-CDM databases were checked internally for consistency, using expert domain knowledge to ensure that data representations are accurate over a broad range of requirements. Local OMOP-CDM database summary reports were compared by a DECOVID data wrangling team between sites to ensure that major differences are attributable to case mix differences, and not structural issues [50]. Figure A1 (Appendix A) is an example of the graphical checks performed.

Global Verification: The unified OMOP-CDM database underwent extensive quality evaluation. Potential data value errors were flagged and embedded in the database non-destructively. The list of checks made is in Table A1 (Appendix A). This allows analysts to apply a common set of rules when encountering erroneous data patterns, without introducing bias into the data by their outright removal. Any structural errors in the database (for example, violations of referential integrity within the relational database) were immediately addressed at the source.

Inferential Verification: Analysts engaged directly in the quality evaluation procedure, reporting back to central teams if data patterns appeared to be erroneous. This verification was performed through many communication channels but tracked via issues in GitHub.

For each of the quality evaluation procedures, flagged data and artifacts were fed back and checked at the source. This was particularly important during the Source Validation procedure to minimize data fidelity loss. Once the local EHR data were transformed into local OMOP-CDM databases, the data underwent a side-by-side review between each hospital Trust (i.e., Source Verification, see Appendix A). In some cases, findings reported from these initial steps fed back into processes within each hospital Trust, which gave rise to improvements in data exporting. After this, the data was transferred to PIONEER at UHB, where it was loaded into the DECOVID (Microsoft SQL) database. The database was queried for potentially erroneous values, such as measurement values falling outside physiologically possible bounds, temporal inconsistencies, and data missingness. These flagged data are recorded in tables that are separate from the DECOVID database, but refer directly to the potentially erroneous rows. The code for these checks is available on GitHub [30,31].

4. User Notes

4.1. Data Access

The DECOVID database is stored within the secure infrastructure of PIONEER at UHB, and the data stored in the DECOVID database is compliant with the UK NDOO policy. Access to the data stored in the DECOVID database presented in this paper is available on request through PIONEER (www.pioneerdatahub.co.uk). The data are not publicly available to ensure the privacy of personal data.

Access will be granted under the principles of the five safes (data, projects, people, settings, outputs) and data minimization [51]. Researchers wishing to transfer the database to another secure research environment would need to satisfy PIONEER’s requirements, such as the security standards ISO 27001 and the Data Security and Protection Toolkit assessment [52] and comply with the five safe framework [51].

The data request process through PIONEER is straightforward: applicants wishing to access the data must fill in all sections of the data request form and upload/attach any additional information. The form can be accessed online on the PIONEER website [53] or downloaded as a Word document and submitted via email to pioneer@uhb.nhs.uk. Information relating to the costs of data access and services provided by PIONEER is available [54]. Those wishing to discuss data projects prior to submitting a data request can contact PIONEER via email or online through the PIONEER contact form.

The data request form contains three sections: Section A (includes: information about the project, methods, type and size of data required), Section B (includes: level of data access requirement, selection of data-fields, data environment, statistical analysis, ethical approvals), and Section C (includes: information about the applicant and research team). PIONEER provides a user guide explaining how the form should be completed [55]. Note that data items that are not necessary to answer the research question (s) of interest will be removed. The balance of patient privacy and statistical utility for a research question is judged according to the Caldicott principles on protecting patient information [56].

A publication including lay summaries of approved and rejected data requests that have been submitted to PIONEER will be published quarterly and publicly available on PIONEER’s website. For approved data requests, a list of research outputs will also be published when research is complete, as determined by researchers requesting and analyzing the data.

4.2. Data Usage

The DECOVID database was originally developed for academics and affiliate researchers across several academic and public institutions of the DECOVID consortium to answer research questions pertaining to COVID-19 [57]. Extending access to the DECOVID database for non-commercial research beyond the DECOVID consortium will provide a platform for the development of research questions not yet explored. The database could be used for a wide range of research questions, including exploratory analyses, risk factor assessment, prediction modeling, and comparative effectiveness studies. For example, it could be used to understand aspects of the disease trajectory of COVID-19 patients, or for developing or assessing methods for predicting severity or progression of disease, or for comparing the efficacy of specific medications in a real-world setting. Project summaries of projects approved by PIONEER are also made public via the PIONEER website (https://www.pioneerdatahub.co.uk/project-summaries, accessed on 3 November 2025).

It is important to highlight ethical issues surrounding the use of the DECOVID database, much of which also applies to EHR data in general. It has been observed that the COVID-19 pandemic has had a disproportionate effect on disadvantaged and marginalized groups, especially Black, Asian, and Minority Ethnic (BAME) communities [58,59,60]. Factors related to social inequity and inequality, such as overcrowding, compelled work, impeded access to healthcare, and food insecurity, are social determinants of infectious disease exposure, susceptibility, and severity, as well as of vulnerability to public health emergencies [61,62]. Against this backdrop of health inequities and inequalities, the use of COVID-19 datasets should include appropriate considerations on the role of socioeconomic and sociodemographic characteristics in differential health outcomes [63]. The DECOVID database makes available data on the ethnicity of patients.

EHR data, such as the EHR data included in the DECOVID database, were not originally collected for research and reflect not only the health status of patients, but also patients’ interactions with healthcare systems, evolving clinical judgements about optimal treatment pathways, and other unobservable clinical and administrative processes that influence the presence or missingness of data [64]. It is therefore crucial to approach analyses of the DECOVID database with an active awareness of the contextual challenges surrounding the generation of EHR data, as well as the limitations already described above. Such contextual issues include clinical learning processes over time, the evolution of treatment strategies in response to new research, and changes in treatment policies, levels of care, and incident reporting that are triggered by stress on the healthcare system during pandemic waves and peaks. Additionally, depending on the research question, appropriate statistical analysis may also be challenging due to the presence of selection bias, missing data, informative observation frequency/times, misclassification, measurement error, and outliers. This is not an exhaustive review of the challenges of analyzing EHR data, but rather highlights that researchers analyzing the DECOVID database should consider the limitations of the DECOVID database in the context of their specific research question.

The Health Data Gateway Metadata Catalog (https://healthdatagateway.org/en/dataset/998, accessed 3 November 2025) is freely available for researchers to browse to enable an understanding of the data.

Author Contributions

Conceptualization—L.J.M.A., S.G. (Suzy Gallier), S.G. (Stephen Gardiner), R.J.B.G., C.H., E.K., M.O., C.P., E.P., R.R., E.S., D.R.W., M.H. and W.K.W. Data curation—A.A., N.B., M.P.J.C., S.D. (Shaun Davidson), S.G. (Suzy Gallier), S.G. (Stephen Gardiner), R.J.B.G., R.E.G., J.J., R.K. (Ruairidh King), R.K. (Roman Klapaukh), H.Y.L., A.M.M., J.M., H.N., E.P., T.J.R., J.R. (James Robinson), E.P., L.S. (Luis Santos), D.R.W., W.K.W. and X.Z. Formal analysis—N.B., L.B., S.D. (Shaun Davidson), S.G. (Stephen Gardiner), R.J.B.G., E.P. and D.R.W. Funding acquisition—L.J.M.A., S.G. (Suzy Gallier), C.H., M.O., C.P., E.S., K.W. and W.K.W. Investigation—L.J.M.A., A.A., N.B., L.B., M.P.J.C., S.-C.C., D.A.C., S.N.C., S.D. (Spiros Denaxas), K.D.-O., S.G. (Stephen Gardiner), F.G., R.J.B.G., R.E.G., C.H., J.J., E.K., R.K. (Ruth Keogh), S.K., K.L., W.L., H.Y.L., T.L., A.M.M., J.M., E.P., J.R. (Jack Roberts), P.R., R.R., F.G.S., A.T. (Ana Torralbo), C.W., K.W., T.W. and L.Y. Methodology—N.B., S.G. (Suzy Gallier), S.G. (Stephen Gardiner), R.J.B.G., C.H., R.K. (Roman Klapaukh), J.M., E.P., P.R., R.R., M.H., E.S., K.W. and L.Y. Project administration—L.J.M.A., N.B., E.G.D., S.G. (Suzy Gallier), R.J.B.G., R.E.G., E.K., R.K. (Ruth King), N.L., H.Y.L., H.N., M.O., C.P., E.P., T.J.R., R.R., E.S., K.W. and W.K.W. Resources—S.G. (Suzy Gallier), R.J.B.G., C.H., J.M., E.P. and E.S. Software—N.B., I.B., L.B., E.G.D., S.G. (Stephen Gardiner), R.K. (Roman Klapaukh), E.P., J.R. (James Robinson), C.S. and O.S. Supervision—L.J.M.A., D.A.C., E.G.D., S.D. (Spiros Denaxas), K.D.-O., S.G. (Suzy Gallier), R.J.B.G., M.H., C.H., E.K., R.K. (Ruth Keogh), N.L., T.L., A.M.M., M.S.T., E.P., T.J.R., R.R., E.S., L.S. (Luis Santos), F.G.S., K.W., T.W., D.R.W. and W.K.W. Validation—A.A., N.B., S.G. (Stephen Gardiner), R.J.B.G., R.E.G., J.J., R.K. (Roman Klapaukh), J.M., E.P., W.K.W., Y.W. and L.Y. Visualization—N.B., S.G. (Stephen Gardiner), E.K., R.J.B.G., E.P. and R.R. Writing—original draft preparation—N.B., S.G. (Stephen Gardiner), R.J.B.G. and E.K. Writing—review and editing—all authors. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the EPSRC (under the grant EP/N510129/1) through the DECOVID project, which was founded by University Hospitals Birmingham NHS Foundation Trust, University College London Hospitals NHS Foundation Trust, University College London, The Alan Turing Institute, and University of Birmingham. This study was funded by the National Institute for Health and Care Research (NIHR) Midlands Patient Safety Research Collaboration (PSRC) and the NIHR Birmingham Biomedical Research Centre. The views expressed in this paper are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care or any other funder named here. Individual members would like to acknowledge the following funding: David A Clifton was funded by the Oxford NIHR Biomedical Research Centre, an RAEng Research Chair, and an NIHR Research Professorship; Shaun Davidson was supported by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR, or the Department of Health; Suzy Gallier reports funding support from HDRUK, MRC and NIHR; Francesca Gasperoni was funded by UKRI Medical Research Council (programme code MRC_MC_UU_00002/5); Robert JB Goudie was funded by UKRI Medical Research Council (MRC) (programme codes MC_ UU_00002/2, MC_UU_00002/20 and MC_UU_00040/04) and supported by National Institute for Health and Care Research (NIHR) Cambridge Biomedical Research Centre (NIHR203312); Marlous Hall received funding from the Wellcome Trust (ref:206470/Z/17/Z); Mark Iles was supported in part by the National Institute for Health and Care Research (NIHR) Leeds Biomedical Research Centre (BRC) (NIHR203331); Joao Jorge was funded by RCUK Digital Economy Programme, under Grant EP/G036861/1 (Oxford Centre for Doctoral Training in Healthcare Innovation); Watjana Lilaonitkul was funded by a MRC Rutherford Fellowship (funded by HDRUK via the MRC); Huiqi Y Lu was funded by the Royal Academy of Engineering Daphne Jackson Research Fellowship; Hannah Nicholls was funded by a British Heart Foundation PhD Studentship; Edward Palmer was funded by MRC Doctoral Training Programme; E Sapey reports funding support from HDRUK, MRC, Wellcome Trust, EPSRC and NIHR; Fang Gao Smith was funded by as a NIHR Senior Investigator; Alice Turner is also funded by the National Institute for Health and Care Research (NIHR) Midlands Patient Safety Research Collaboration (PSRC), as well as NIHR EME and HTA; Andrew Soltan was funded by an NIHR Academic Clinical Fellowship (ACF-2020-13-015); David R. Westhead is supported in part by the National Institute for Health and Care Research (NIHR) Leeds Biomedical Research Centre (BRC) (NIHR203331); Lingyi Yang was funded by EPSRC [EP/S026347/1] and the Hong Kong Innovation and Technology Commission (InnoHK Project CIMDA).

Data Availability Statement

Restrictions apply to the availability of these data. Data were obtained from PIONEER and are available at https://healthdatagateway.org/en/dataset/998 (accessed on 3 November 2025) with permission for access through an application to PIONEER. See details of how to gain access to the dataset in the data access section above.

Acknowledgments

This work was supported by PIONEER, the Health Data Research Hub in Acute Care, which is affiliated with Health Data Research UK. Data curation and licensed access for this study through PIONEER have been approved by the East Midlands (Derby) REC (20/EM/0158) and are supported by the Confidentiality Advisory Group (Reference 20/CAG/0084). We would like to acknowledge the following DECOVID team members, who made contributions to this paper: Rahul Arora, Matthew Chun, Anna Hadjitofi, Harry Hemingway, Adam Mahdi, Ian Piper, Alec Topham, Alexey Youssef, and Chris Williams. We are grateful to the wider DECOVID team, particularly during the early stages of the project in Spring 2020, and their employers. This work uses data provided by patients and collected by the NHS as part of their care and support. We would like to acknowledge the contribution of all staff, key workers, patients, and the community who have supported our hospitals and the wider NHS at this time.

Conflicts of Interest

There are no conflicts of interest.

Appendix A

The code used to generate tables, figures, and descriptive statistics included in this paper is contained in an online, publicly accessible repository [30,31].

Table A1. Quality checks are implemented in SQL during centralized global verification.

Quality Check
Every patient in the DECOVID database should have a valid record in the VISIT_OCCURRENCE data table
A patient’s year of birth should be between 1900 and 2002
Discharge information should be coded as NULL for a patient if their visit_end_datetime is NULL (and vice versa)
A patient cannot have a visit_start_date from one visit that is strictly between the visit_start_date and visit_end_date from another visit. This quality check covers both the VISIT_OCCURRENCE and VISIT_DETAIL data tables
In the VISIT_OCCURRENCE data table, check that visit_start_date/datetime does not occur after the row was last updated (last_updated_datetime)
Patient visits to the A&E (i.e., emergency department) should have an interval of less than 24 h between visit_end_datetime and visit_start_datetime
Visits for the same patient should have the same person_id for all visits. Where there is a preceding visit_occurrence record for a patient, the visit_occurrence should have the same person_id and have an earlier start date
When discharge_to implies that a patient died, check that the patient has a death recorded
No patient should have a visit_occurrence with an end_date/datetime that extends 24 h beyond their death_date. This includes having no records in the VISIT_OCCURRENCE data table with a NULL visit_end_datetime
Rows in the MEASUREMENT data table that are complete duplicates of each other (apart from the measurement_id) are possibly a result of being part of different overlapping observation sets. These records are de-duplicated
For COVID swab test results in the SPECIMEN data table, pending results should have NULL time and NULL value. The time a sample was taken should be within the last 14 days
Except for the COVID swab tests, value_as_number and value_as_concept_id should not both be null; and measurment_datetime should not be null
with regard to the MEASUREMENT data table, check that all values lie within the ranges of possibility specified by a clinician, where possible
For the VISIT_OCCURRENCE and VISIT_DETAIL data tables, check that every record in the VISIT_OCCURRENCE data table has at least one record in the VISIT_DETAIL data table
Where there is a preceding visit (VISIT_OCCURRENCE and VISIT_DETAIL) ID, this visit should have an earlier start date
In any clinical data table, check that the concept is not from the wrong vocabulary, and the concept is of the correct type
In any clinical data table, check that all date and datetime fields match, and if a datetime field is not null then its corresponding date is not null
In the CONDITION_OCCURRENCE data table, the condition_start_date must be between a year before the patient’s year of birth and the condition_end_date.
In the CONDITION_OCCURRENCE data table, condition_start_date/datetime must be greater than visit_occurrence_start_date/datetime; and condition_end_date/datetime must be greater than (or equal to if date) condition_start_date/datetime
In any of the clinical data tables, table end_date/datetime must be greater than or equal to table start_date/datetime
In the CONDITION_OCCURRENCE data table, stop_reason, provider_id, condition_source_value, condition_source_concept_id, and condition_status_source_value are null
A person_id in the CONDITION_OCCURRENCE data table must match a person_id in the VISIT_OCCURRENCE and VISIT_DETAIL data tables
In the CONDITION_OCCURRENCE data table, condition_status_concept_id and condition_type_concept_id must match a concept_id in the concept table of the correct type
No sexual health conditions should be included in the CONDITION_OCCURRENCE data table
For the VISIT_OCCURRENCE data table, check that records start less than 60 min after a previous visit
Except for COVID swab tests, check that all measurements can be assigned to a visit_occurrence
For any clinical data table, check that the start_date/datetime is within bounds of the DECOVID project period (i.e., after 1 January 2020), including reference to the end_date/datetime
For records in the VISIT_OCCURRENCE and VISIT_DETAIL data tables, end_date/datetime should be after start_date/datetime
Where there are multiple records per patient in the VISIT_DETAIL data table, all but the first record should have a non-null value in preceding_visit_detail_id
For patients who have died and have a record in the DEATH data table, check that the death_datetime is not greater than the latest update_datetime and should not extend beyond the DECOVID observation period (i.e., prior to 1 January 2020) or is not NULL
Check that the concept is not from the wrong vocabulary for race/ethnicity
Local quality check reports are created, including data visualizations and tabular data summaries to explore the distribution of key variables, such as to identify patients with an age that is too high (see Figure A1 for an example)

Figure A1. Example of local quality check reports for measurement_concept_id: 37310255, Detection of 2019 novel coronavirus using polymerase chain reaction technique. The red shading highlights potential issues.

Appendix B

Table A2. Important Exclusions, Additions and Field Type Changes to OMOP-CDM Clinical Data Tables.

Data Table	Revisions
person	Exclusions month_of_birth, day_of_birth, ethnicity_concept_id, provider_id, care_site_id, location_id, gender_source_value, gender_source_concept_id, race_source_value, race_source_concept_id, ethnicity_source_value, ethnicity_source_concept_id Additions last_updated_datetime, deleted_datetime Field Type Changes race_concept_id, gender_concept_id, person_id changed to bigint person_source_value changed to varchar (64)
observation_period	Clinical data table not used in DECOVID
specimen	Exclusions specimen_type_concept_id, quantity, unit_concept_id, disease_status_concept_id, specimen_source_id, specimen_source_value, unit_source_value, anatomic_site_source_value, disease_status_source_value Additions last_updated_datetime, deleted_datetime Field Type Changes specimen_id, person_id, specimen_concept_id, anatomic_site_concept_id changed to bigint
death	Exclusions death_type_concept_id, cause_concept_id, cause_source_concept_id Additions last_updated_datetime, deleted_datetime Field Type Changes person_id changed to bigint
visit_occurrence	Exclusions visit_type_concept_id, provider_id, care_site_id, visit_source_concept_id, admitting_source_value, discharge_to_source_value Additions last_updated_datetime, deleted_datetime Field Type Changes visit_occurrence_id, person_id, visit_concept_id, admitting_source_concept_id, discharge_to_concept_id, preceding_visit_occurrence_id changed to bigint
visit_detail	Exclusions provider_id, visit_detail_type_concept_id, visit_detail_source_concept_id, admitting_source_value, admitting_source_concept_id, discharge_to_source_value, discharge_to_concept_id Additions last_updated_datetime, deleted_datetime Field Type Changes visit_detail_id, person_id, visit_detail_concept_id, care_site_id, preceding_visit_detail_id, visit_detail_parent_id, visit_occurrence_id changed to bigint
procedure_occurrence	Exclusions quantity, provider_id Additions last_updated_datetime, deleted_datetime Field Type Changes procedure_occurrence_id, person_id, procedure_concept_id, procedure_type_concept_id, modifier_concept_id, visit_occurrence_id, visit_detail_id, procedure_source_concept_id changed to bigint
drug_exposure	Exclusions verbatim_end_date, stop_reason, refills, days_supply, sig, lot_number, drug_source_value, drug_source_concept_id, route_source_value, dose_unit_source_value Additions last_updated_datetime, deleted_datetime. dose_unit_concept_id added specifically for decovid Field Type Changes drug_exposure_id, person_id, drug_concept_id, drug_type_concept_id, dose_unit_concept_id, provider_id, visit_occurrence_id, visit_detail_id changed to bigint. quantity changed to numeric/float
device_exposure	Clinical data table not used in DECOVID
condition_occurrence	Exclusions stop_reason, provider_id, visit_detail_id, condition_source_value, condition_source_concept_id, condition_status_source_value Additions last_updated_datetime, deleted_datetime. Field Type Changes condition_occurrence_id, person_id, condition_concept_id, visit_occurrence_id, visit_detail_id changed to bigint
measurement	Exclusions measurement_time, measurement_type_concept_id, provider_id, measurement_source_value, measurement_source_concept_id, unit_source_value, value_source_value Additions last_updated_datetime, deleted_datetime. Field Type Changes measurement_id, person_id, measurement_concept_id, operator_concept_id, value_as_concept_id, unit_concept_id, visit_occurrence_id, visit_detail_id changed to bigint value_as_number, range_low, range_high changed to numeric/float
note	Clinical data table not used in DECOVID
note_nlp	Clinical data table not used in DECOVID
observation	Clinical data table not used in DECOVID
fact_relationship	Exclusions verbatim_end_date, stop_reason, refills, days_supply, sig, lot_number, drug_source_value, drug_source_concept_id, route_source_value, dose_unit_source_value Additions last_updated_datetime, deleted_datetime Field Type Changes domain_concept_id_1, fact_id_1, fact_id_2, relationship_concept_id changed to bigint

References

NHS. NHS Trust. Available online: https://datadictionary.nhs.uk/nhs_business_definitions/nhs_trust.html (accessed on 3 November 2025).
DECOVID. DECOVID Protocol, DECOVID: A Highly Granular, Near Real Time Clinical Database and Research Environment from Digitally Mature NHS Trusts to Answer Critical Questions and Improve Patient Care During the COVID Pandemic. Available online: https://7aa1b654-606c-4622-8514-d83dfc3eba35.filesusr.com/ugd/93d683_8fd95f128e3d4daba264806de39685a9.pdf (accessed on 3 November 2025).
Gallier, S.; Price, G.; Pandya, H.; McCarmack, G.; James, C.; Ruane, B.; Forty, L.; Crosby, B.L.; Atkin, C.; Evans, R.; et al. Infrastructure and operating processes of PIONEER, the HDR-UK Data Hub in Acute Care and the workings of the Data Trust Committee: A protocol paper. BMJ Health Care Inform. 2021, 28, e100294. [Google Scholar] [CrossRef]
Garza, M.; Del Fiol, G.; Tenenbaum, J.; Walden, A.; Zozus, M.N. Evaluating common data models for use with a longitudinal community registry. J. Biomed. Inform. 2016, 64, 333–341. [Google Scholar] [CrossRef] [PubMed]
Voss, E.A.; Makadia, R.; Matcho, A.; Ma, Q.; Knoll, C.; Schuemie, M.; DeFalco, F.J.; Londhe, A.; Zhu, V.; Ryan, P.B. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J. Am. Med. Inform. Assoc. 2015, 22, 553–564. [Google Scholar] [CrossRef] [PubMed]
Overhage, J.M.; Ryan, P.B.; Reich, C.G.; Hartzema, A.G.; Stang, P.E. Validation of a common data model for active safety surveillance research. J. Am. Med. Inform. Assoc. 2012, 19, 54–60. [Google Scholar] [CrossRef] [PubMed]
Grimson, F.; Niklas, N.; Hermans, R.; Cirneanu, L.; Maissenhaelter, B.; Kim, J. Evaluation of statistical software for federated analysis of multi-site real world studies. In Proceedings of the Pharmaceutical Industry (PSI) 2019 Conference, Stockholm, Sweden, 27–30 May 2019. [Google Scholar]
NHS Digital. Hospital Episode Statistics. Available online: https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics (accessed on 3 November 2025).
Dahella, S.S.; Briggs, J.S.; Coombes, P.; Farajidavar, N.; Meredith, P.; Bonnici, T.; Darbyshire, J.L.; Watkinson, P.J. Implementing a system for the real-time risk assessment of patients considered for intensive care. BMC Med. Inform. Decis. Mak. 2020, 20, 161. [Google Scholar] [CrossRef]
Wood, A.; Denholm, R.; Hollings, S.; Cooper, J.; Ip, S.; Walker, V.; Denaxas, S.; Akbari, A.; Banerjee, A.; Whiteley, W.; et al. Linked electronic health records for research on a nationwide cohort of more than 54 million people in England: Data resource. BMJ 2021, 373, 826. [Google Scholar] [CrossRef]
NIHR Oxford Biomedical Research Centre. Infections in Oxfordshire Research Database (IORD). Available online: https://oxfordbrc.nihr.ac.uk/research-themes-overview/antimicrobial-resistance-and-modernising-microbiology/infections-in-oxfordshire-research-database-iord/#:~:text=The%20Infections%20in%20Oxfordshire%20Research,covering%20about%201%25%20of%20England (accessed on 3 November 2025).
Harris, S.; Shi, S.; Brealey, D.; MacCallum, N.S.; Denaxas, S.; Perez-Suarez, D.; Ercole, A.; Watkinson, P.; Jones, A.; Ashworth, S.; et al. Critical Care Health Informatics Collaborative (CCHIC): Data, tools and methods for reproducible research: A multi-centre UK intensive care database. Int. J. Med. Inform. 2018, 112, 82–89. [Google Scholar] [CrossRef]
GO FAIR. FAIR Principles. Available online: https://www.go-fair.org/fair-principles/ (accessed on 3 November 2025).
ISARIC4C Consortium. ISARIC4C (Coronavirus Clinical Characterisation Consortium). Available online: https://isaric4c.net/ (accessed on 3 November 2025).
Brat, G.A.; Weber, G.M.; Gehlenborg, N.; Avillach, P.; Palmer, N.P.; Chiovato, L.; Cimino, J.; Waitman, L.R.; Omenn, G.S.; Malovini, A.; et al. International electronic health record-derived COVID-19 clinical course profiles: The 4CE consortium. NPJ Digit. Med. 2020, 3, 109. [Google Scholar] [CrossRef]
Pollard, T.; Johnson, A.E.W.; Raffa, J.D.; Celi, L.A.; Mark, R.G.; Badawi, O. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Sci. Data 2018, 5, 180178. [Google Scholar] [CrossRef]
Bennett, T.D.; Moffitt, R.A.; Hajagos, J.G.; Amor, B.; Anand, A.; Bissell, M.M.; Bradwell, K.R.; Bremer, C.; Byrd, J.B.; Denham, A.; et al. Clinical characterization and prediction of clinical severity of SARS-CoV-2 infection among US adults using data from the US National COVID Cohort Collaborative. JAMA Netw. Open 2021, 4, e2116901. [Google Scholar] [CrossRef]
Observational Health Data Sciences and Informatics. ATHENA. Available online: https://athena.ohdsi.org/search-terms/start (accessed on 3 November 2025).
NHS. Critical Care Level. Available online: https://datadictionary.nhs.uk/attributes/critical_care_level.html (accessed on 3 November 2025).
NHS Health Research Authority. DECOVID V1 [COVID-19] (Ethics Approval). Available online: https://www.hra.nhs.uk/planning-and-improving-research/application-summaries/research-summaries/decovid-v1/ (accessed on 3 November 2025).
Public Health England. NHS Acute (Hospital) Trust Catchment Populations. Available online: https://app.powerbi.com/view?r=eyJrIjoiODZmNGQ0YzItZDAwZi00MzFiLWE4NzAtMzVmNTUwMThmMTVlIiwidCI6ImVlNGUxNDk5LTRhMzUtNGIyZS1hZDQ3LTVmM2NmOWRlODY2NiIsImMiOjh9 (accessed on 3 November 2025).
Oakley, C.; Pascoe, C.; Balthazor, D.; Bennett, D.; Gautam, N.; Isaac, J.; Isherwood, P.; Matthews, T.; Murphy, N.; Oelofse, T.; et al. Assembly Line ICU: What the Long Shops taught us about managing surge capacity for COVID-19. BMJ Open Qual. 2020, 9, e001117. [Google Scholar] [CrossRef]
Epic Systems Corporation. Software Verona, Wisconsin. Available online: https://www.epic.com/ (accessed on 3 November 2025).
University Hospitals Birmingham NHS Foundation Trust. Birmingham Systems PICS. Available online: https://web.archive.org/web/20170806072442/https://www.uhb.nhs.uk/birmingham-systems-pics.htm (accessed on 3 November 2025).
Observational Health Data Sciences and Informatics. OMOP Common Data Model. Available online: https://ohdsi.github.io/CommonDataModel/ (accessed on 3 November 2025).
LOINC (Regenstrief Institute, Inc.). Logical Observation Identifiers Names and Codes. Available online: https://loinc.org/ (accessed on 3 November 2025).
SNOMED International. Systematized Nomenclature of Medicine Clinical Terms. Available online: https://www.snomed.org/ (accessed on 3 November 2025).
NHS Digital. Read Codes. Available online: https://digital.nhs.uk/services/terminology-and-classifications/read-codes (accessed on 3 November 2025).
World Health Organization. International Statistical Classification of Diseases and Related Health Problems (ICD). Available online: https://www.who.int/standards/classifications/classification-of-diseases (accessed on 3 November 2025).
DECOVID Data Paper GitHub Repository. Available online: https://github.com/alan-turing-institute/DECOVID-data-paper/ (accessed on 3 November 2025).
Bakewell, N.; Goudie, R.J.B.; Gardiner, S.; Karoune, E.; Rockenschaub, P.; Green, B.; Nicholls, H.; Whitaker, K.J.; Aslett, L. Alan-Turing-Institute/DECOVID-Data-Paper: DECOVID Data Paper Repository; Version V1; Zenodo: London, UK, 2025. [Google Scholar] [CrossRef]
Rusanov, A.; Weiskopf, N.G.; Wang, S.; Weng, C. Hidden in plain sight: Bias towards sick patients when sampling patients with sufficient electronic health record data for research. BMC Med. Inform. Dec. Mak. 2014, 14, 51. [Google Scholar] [CrossRef]
NHS Digital. Hospital Admitted Patient Care Activity 2020-21. Available online: https://digital.nhs.uk/data-and-information/publications/statistical/hospital-admitted-patient-care-activity/2020-21 (accessed on 3 November 2025).
Teasdale, G.; Jennett, B. Assessment of coma and impaired consciousness: A practical scale. Lancet 1974, 304, 81–84. [Google Scholar] [CrossRef]
Royal College of Physicians. National Early Warning Score (NEWS) 2: Standardising the Assessment of Acute-Illness Severity in the NHS; Updated report of a working party; RCP: London, UK, 2017. [Google Scholar]
RECOVERY Collaborative Group; Horby, P.; Lim, W.S.; Emberson, J.R.; Mafham, M.; Bell, J.L.; Linsell, L.; Staplin, N.; Brightling, C.; Ustianowski, A. Dexamethasone in hospitalized patients with COVID-19. N. Engl. J. Med. 2021, 384, 693–704. [Google Scholar]
RECOVERY Collaborative Group. Tocilizumab in patients admitted to hospital with COVID-19 (RECOVERY): A randomised, controlled, open-label, platform trial. Lancet 2021, 397, 1637–1645. [Google Scholar] [CrossRef] [PubMed]
Office for National Statistics. Ethnic Group, National Identity and Religion. Available online: https://www.ons.gov.uk/methodology/classificationsandstandards/measuringequality/ethnicgroupnationalidentityandreligion (accessed on 3 November 2025).
NHS Digital. NHS Data Model and Dictionary, Admission Source. Available online: https://www.datadictionary.nhs.uk/attributes/admission_source.html?hl=admission%2Csource (accessed on 3 November 2025).
Poulos, J.; Zhu, L.; Shah, A.D. Data gaps in electronic health record (EHR) systems: An audit of problem list completeness during the COVID-19 pandemic. Int. J. Med. Inform. 2021, 150, 104452. [Google Scholar] [CrossRef]
NHS Digital. NHS Data Model and Dictionary, Destination of Discharge. Available online: https://www.datadictionary.nhs.uk/attributes/destination_of_discharge.html (accessed on 3 November 2025).
Codd, E.F. Further normalization of the database relational model. In Data Base Systems. Courant Computer Science Symposium, 6th ed.; Rustin, E., Ed.; Prentice-Hall: Englewood Cliffs, NJ, USA, 1972; pp. 33–64. [Google Scholar]
Observational Health Data Sciences and Informatics. Standardized Clinical Data Tables. Available online: https://www.ohdsi.org/web/wiki/doku.php?id=documentation:cdm:standardized_clinical_data_tables (accessed on 3 November 2025).
NHS Businesses Services Authority. Dictionary of Medicines and Devices (DM+D). Available online: https://www.nhsbsa.nhs.uk/pharmacies-gp-practices-and-appliance-contractors/dictionary-medicines-and-devices-dmd (accessed on 3 November 2025).
NHS. Consultant Episode (Hospital Provider). Available online: https://datadictionary.nhs.uk/nhs_business_definitions/consultant_episode__hospital_provider_.html (accessed on 3 November 2025).
Cowie, M.R.; Blomster, J.I.; Curtis, L.H.; Duclaux, S.; Ford, I.; Fritz, F.; Goldman, S.; Janmohamed, S.; Kreuzer, J.; Leenay, M.; et al. Electronic health records to facilitate clinical research. Clin. Res. Cardiol. 2017, 106, 1–9. [Google Scholar] [CrossRef] [PubMed]
OHDSI. Data Quality Dashboard. Available online: https://github.com/OHDSI/DataQualityDashboard (accessed on 3 November 2025).
OHDSI. Software Tools. Available online: https://www.ohdsi.org/software-tools/ (accessed on 3 November 2025).
Kahn, M.G.; Callahan, T.J.; Barnard, J.; Bauck, A.E.; Brown, J.; Davidson, B.N.; Estiri, H.; Goerg, C.; Holve, E.; Johnson, S.G.; et al. A harmonized data quality assessment terminology and framework for the secondary use of electronic health record data. EGEMS 2016, 4, 1244. [Google Scholar] [CrossRef] [PubMed]
Palmer, E. d.inspectEHR. Available online: https://github.com/DocEd/d.inspectEHR (accessed on 3 November 2025).
United Kingdom Data Service. What Is the Five Safes Framework? Available online: https://ukdataservice.ac.uk/help/secure-lab/what-is-the-five-safes-framework/ (accessed on 3 November 2025).
NHS England Data Security and Protection Toolkit. Available online: https://www.dsptoolkit.nhs.uk/Help/3 (accessed on 3 November 2025).
PIONEER Data Request Process. Available online: https://www.pioneerdatahub.co.uk/data/data-request-process/ (accessed on 3 November 2025).
PIONEER Data Service & Costs. Available online: https://www.pioneerdatahub.co.uk/data/data-services-costs/ (accessed on 3 November 2025).
PIONEER Data Request Form User Guide. Available online: https://www.pioneerdatahub.co.uk/wp-content/uploads/PIONEER-DRF-User-Guide-Final.pdf (accessed on 3 November 2025).
The UK Caldicott Guardian Council. The Caldicott Principles. Available online: https://www.ukcgc.uk/the-caldicott-principles (accessed on 3 November 2025).
DECOVID. Research. Available online: https://web.archive.org/web/20211206225110/https://www.decovid.org/research (accessed on 3 November 2025).
Office for National Statistics. Deaths Involving COVID-19 by Local Area and Socioeconomic Deprivation. Available online: https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/bulletins/deathsinvolvingcovid19bylocalareasanddeprivation/deathsoccurringbetween1marchand31july2020 (accessed on 3 November 2025).
Leslie, D.; Mazumder, A.; Peppin, A.; Wolters, M.K.; Hagerty, A. Does “AI” stand for augmenting inequality in the era of COVID-19 healthcare? BMJ 2021, 372, 304. [Google Scholar] [CrossRef]
Public Health England. Beyond the Data: Understanding the Impact of COVID-19 on BAME Groups. Available online: https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/892376/COVID_stakeholder_engagement_synthesis_beyond_the_data.pdf (accessed on 3 November 2025).
Abrams, E.M.; Szefler, S.J. COVID-19 and the impact of social determinants of health. Lancet Respir. Med. 2020, 8, 659–661. [Google Scholar] [CrossRef]
Quinn, S.C.; Kumar, S. Health inequalities and infectious disease epidemics: A challenge for global health security. Biosecur. Bioterror. 2014, 12, 263–273. [Google Scholar] [CrossRef] [PubMed]
Chowkwanyun, M.; Reed, A.L., Jr. Racial health disparities and COVID-19—Caution and context. N. Engl. J. Med. 2020, 383, 201–203. [Google Scholar] [CrossRef] [PubMed]
Agniel, D.; Kohane, I.S.; Weber, G.M. Biases in electronic health record data due to processes within the healthcare system: Retrospective observational study. BMJ 2018, 361, 1479. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Age-sex bar-plot pyramid, stacked by hospital Trust (note, x-axes have different scales). Patients were classified as having COVID-19 if they had a PCR positive test for SARS-CoV-2 or a COVID-19 clinical diagnosis confirmed/suspected during or within 14 days prior to presentation at hospital (visit start date) for any visit in the DECOVID database.

Figure 2. Bar plots of ethnicity groups in each hospital Trust, by COVID-19 status.

Figure 3. Bar plot of selected chronic conditions in each hospital Trust, by COVID-19 status. Also note, the following Athena codes [18] were used to identify chronic conditions: 321588 (Heart disease), 201820 (Diabetes mellitus), 4063381 (Chronic respiratory disease) and 46271022 (Chronic kidney disease).

Figure 4. Bar plot time series of suspected and confirmed COVID-19 cases, stacked by hospital Trust (weekly case counts <10 were suppressed to 10 in the time series plot). Patients were classified as having COVID-19 if they had a PCR-positive test for SARS-CoV-2 or a COVID-19 clinical diagnosis confirmed/suspected during or within 14 days prior to presentation at the hospital (visit start date) for any visit in the DECOVID database.

Figure 5. Example of the granularity of data in the DECOVID database for a single patient stay.

Figure 6. Data and quality evaluation pipeline. The solid lines and arrows represent flow of data; the dotted lines represent comparisons made and associated flow of “knowledge”. Data from hospitals are harmonized into the OMOP format and uploaded into the Data Safe Haven for unification. Abbreviations: EHR = electronic health record, OMOP = Observational Health Data Sciences and Informatics Clinical Data Model version 5, UHB = University Hospitals Birmingham, UCLH = University College London Hospital, QE = quality evaluation, ETL = extract, transform, load.

Figure 7. OMOP-CDM accounting for DECOVID revisions. Only fields used in DECOVID are displayed for the clinical data tables. Adapted from [43].

Table 2. Clinical data tables in the DECOVID database schema.

Table Name	Description
person	Contains records that uniquely identify each patient in the source data. The sex at birth of patients is recorded in the gender_concept_id field. This field is mapped to standard mapping values of male, female, or other/unknown, which includes cases where the sex at birth of a patient is withheld, or not asked/missing. Note, the name of this field is the naming convention used in the person table of the OMOP CDM, which is why the name has not been revised to sex. The ethnicity of patients is recorded in the race_concept_id column, according to the 18 ethnic groups used in the 2011 England and Wales Census [38].
death	Contains the clinical event for when a patient dies, including both in-hospital and out-of-hospital deaths (extracted from the NHS Spine up until 31 March 2021). Cause of death is not included, and the date of death, rather than the precise time of death, is recorded.
visit_occurrence	Contains records on the spans of time describing a patient’s individual episodes of care/visits.
visit_detail	Contains records on clinically meaningful movements of a patient within each record of the parent visit_occurrence table. Each row represents a movement between geographically separate care sites within a hospital Trust (e.g., patient transferred from one Adult Inpatient Ward to another Adult Inpatient Ward (geographically distinct from the first)) or between care sites within a hospital (e.g., patient transferred from A and E Majors to ICU within a hospital).
condition_occurrence	Contains records on the presence of a disease or medical condition stated as a diagnosis, and a sign or a symptom, which is either observed by a provider or reported by the patient. The concepts in this table were mapped from vocabularies, including diagnosis standards such as SNOMED-CT and ICD-9/10.
measurement	Contains records of measurements, i.e., structured values (numerical or categorical) obtained about a patient or a patient’s clinical/biological sample. The concepts of this table were primarily mapped from SNOMED and LOINC codes. There are 135 distinct clinical measurement types.
specimen	Contains records identifying clinical/biological samples from a patient.
drug_exposure	Contains records about the utilization of a drug when ingested or otherwise introduced into the body. The concepts used in this table were primarily mapped from SNOMED codes.
procedure_occurrence	In DECOVID, this contains only records on the insertion and removal of endotracheal and tracheostomy tubes. The concepts used in this table were all mapped from SNOMED codes.
fact_relationship	Contains records (i.e., facts) that belong to OMOP-CDM domains (e.g., Measurement) and their relationship(s) with other records from any of the OMOP-CDM data tables that may belong to the same OMOP-CDM domain or a different OMOP-CDM domain.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

DECOVID Consortium; Aslett, L.J.M.; Avramescu, A.; Bakewell, N.; Birds, I.; Bowler, L.; Camilleri, M.P.J.; Chung, S.-C.; Clifton, D.A.; Cohen, S.N.; et al. DECOVID: A UK Two-Center Harmonized Database of Acute Care Electronic Health Records for COVID-19 Research. Data 2025, 10, 195. https://doi.org/10.3390/data10120195

AMA Style

DECOVID Consortium, Aslett LJM, Avramescu A, Bakewell N, Birds I, Bowler L, Camilleri MPJ, Chung S-C, Clifton DA, Cohen SN, et al. DECOVID: A UK Two-Center Harmonized Database of Acute Care Electronic Health Records for COVID-19 Research. Data. 2025; 10(12):195. https://doi.org/10.3390/data10120195

Chicago/Turabian Style

DECOVID Consortium, Louis J. M. Aslett, Andreea Avramescu, Nicholas Bakewell, Isabel Birds, Louise Bowler, Michael P. J. Camilleri, Sheng-Chia Chung, David A. Clifton, Samuel N. Cohen, and et al. 2025. "DECOVID: A UK Two-Center Harmonized Database of Acute Care Electronic Health Records for COVID-19 Research" Data 10, no. 12: 195. https://doi.org/10.3390/data10120195

APA Style

DECOVID Consortium, Aslett, L. J. M., Avramescu, A., Bakewell, N., Birds, I., Bowler, L., Camilleri, M. P. J., Chung, S.-C., Clifton, D. A., Cohen, S. N., Constantine-Cooke, N., Daub, E. G., Davidson, S., Denaxas, S., Diaz-Ordaz, K., Feltbower, R., Gallier, S., Gardiner, S., Gasperoni, F., ... Zou, X. (2025). DECOVID: A UK Two-Center Harmonized Database of Acute Care Electronic Health Records for COVID-19 Research. Data, 10(12), 195. https://doi.org/10.3390/data10120195

Article Menu

DECOVID: A UK Two-Center Harmonized Database of Acute Care Electronic Health Records for COVID-19 Research

Abstract

1. Summary

Data Visualizations

2. Methods

2.1. Cohort

2.2. Data Pipeline

2.3. Local Data Extraction Processes and Principles

3. Dataset Description

3.1. Data Summaries

3.2. Data Tables

3.3. Database Identifiers

3.4. Data Records

3.5. Data Granularity

3.6. Data Quality

3.7. Technical Validation

4. User Notes

4.1. Data Access

4.2. Data Usage

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI