Phenotyping Diabetes Mellitus on Aggregated Electronic Health Records from Disparate Health Systems

Tan, Hui Xing; Lim, Rachel Li Ting; Ang, Pei San; Foo, Belinda Pei Qin; Koon, Yen Ling; Neo, Jing Wei; Ng, Amelia Jing Jing; Tan, Siew Har; Teo, Desmond Chun Hwee; Tham, Mun Yee; Yap, Aaron Jun Yi; Ng, Nicholas Kai Ming; Loke, Celine Wei Ping; Peck, Li Fung; Huang, Huilin; Dorajoo, Sreemanee Raaj

doi:10.3390/pharma2030019

Open AccessArticle

Phenotyping Diabetes Mellitus on Aggregated Electronic Health Records from Disparate Health Systems

by

Hui Xing Tan

,

Rachel Li Ting Lim

,

Pei San Ang

,

Belinda Pei Qin Foo

,

Yen Ling Koon

,

Jing Wei Neo

,

Amelia Jing Jing Ng

,

Siew Har Tan

,

Desmond Chun Hwee Teo

,

Mun Yee Tham

,

Aaron Jun Yi Yap

,

Nicholas Kai Ming Ng

,

Celine Wei Ping Loke

,

Li Fung Peck

,

Huilin Huang

and

Sreemanee Raaj Dorajoo

^*

Vigilance & Compliance Branch, Health Products Regulation Group, Health Sciences Authority, Singapore 138667, Singapore

^*

Author to whom correspondence should be addressed.

Pharmacoepidemiology 2023, 2(3), 223-235; https://doi.org/10.3390/pharma2030019

Submission received: 20 April 2023 / Revised: 2 June 2023 / Accepted: 28 June 2023 / Published: 3 July 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

Background: Identifying patients with diabetes mellitus (DM) is often performed in epidemiological studies using electronic health records (EHR), but currently available algorithms have features that limit their generalizability. Methods: We developed a rule-based algorithm to determine DM status using the nationally aggregated EHR database. The algorithm was validated on two chart-reviewed samples (n = 2813) of (a) patients with atrial fibrillation (AF, n = 1194) and (b) randomly sampled hospitalized patients (n = 1619). Results: DM diagnosis codes alone resulted in a sensitivity of 77.0% and 83.4% in the AF and random hospitalized samples, respectively. The proposed algorithm combines blood glucose values and DM medication usage with diagnostic codes and exhibits sensitivities between 96.9% and 98.0%, while positive predictive values (PPV) ranged between 61.1% and 75.6%. Performances were comparable across sexes, but a lower specificity was observed in younger patients (below 65 versus 65 and above) in both validation samples (75.8% vs. 90.8% and 60.6% vs. 88.8%). The algorithm was robust for missing laboratory data but not for missing medication data. Conclusions: In this nationwide EHR database analysis, an algorithm for identifying patients with DM has been developed and validated. The algorithm supports quantitative bias analyses in future studies involving EHR-based DM studies.

Keywords:

diabetes; clinical phenotyping; electronic medical record; rule-based algorithm; validation studies; pharmacovigilance

1. Introduction

Identifying patients with diabetes mellitus (DM) is often an obligatory feature in many electronic health record (EHR)-based analyses for defining and applying inclusion/exclusion criteria and addressing possible confounding and effect modifications.

Many algorithms for detecting DM in EHR data already exist, but they are predominantly derived from data arising from a single/common health system(s) in which critical data elements used to define DM status (e.g., coded diagnoses and specific medication exposures) are presumably identical throughout the data [1,2,3,4,5]. However, these features of existing algorithms undermine their applicability in external settings such as ours in Singapore, where data from various medical record systems across the country are aggregated with minimal processing. These aggregated medical records are primarily intended for care provision, in a setting where patients routinely consult multiple providers across different health systems over time. Challenges, however, arise when analyzing this relatively unharmonized database for insights. While the upfront conversion of all data contributors to a common data model is a viable strategy for circumventing issues of disparate data schemas when conducting multi-center analyses [6,7,8,9], the migration of source data into the data model on a regular basis can be considerably burdensome [10,11].

In this validation study, we sought to develop an algorithm that is adequately accommodative for identifying patients with DM in an aggregated database of diverse EHR sources. A combination of EHR data elements is used, and the algorithm’s accuracy and consistency are assessed on two datasets comprising over 2000 chart-reviewed patients. The first dataset is a group of 1194 patients who were hospitalized and newly diagnosed with Atrial Fibrillation (AF) and who had initiated oral anticoagulation therapy in 2019 or 2020. The second was a randomly sampled set of patients admitted to any public healthcare institution in 2019 or 2020 who had the required data elements for the gold standard labelling of diabetes status through chart reviews (n = 1619).

2. Results

There were a total of 608 and 586 patients in the 2019 and 2020 AF cohorts and 808 and 811 patients in the 2019 and 2020 random hospitalized sample, respectively. Sex distributions across both samples were equivalent (Table 1). Patients in the random sample were expectedly younger (mean age 47.5 and 45.8 years in 2019 and 2020, respectively) compared to the AF cohort (mean age 72.2 and 72.4 years in 2019 and 2020, respectively). Similarly, there was a larger proportion of DM patients in the AF cohorts (37.5 and 39.1%) as compared with the random sample (24.5 and 20.8%, Table 1).

Table 1. Demographic profile of patients in both study samples.

		Atrial Fibrillation Cohort (n = 1194)		Random Hospitalized Sample (n = 1619)
		2019 (n = 608)	2020 (n = 586)	2019 (n = 808)	2020 (n = 811)
Sex, n (%)	Male	305 (50.2%)	310 (52.9%)	380 (47.0%)	401 (49.4%)
Sex, n (%)	Female	303 (49.8%)	276 (47.1%)	428 (53.0%)	410 (50.6%)
Race, n (%)	Chinese	451 (74.2%)	458 (78.2%)	514 (63.6%)	489 (60.3%)
	Malay	92 (15.1%)	81 (13.8%)	139 (17.2%)	137 (16.8%)
	Indian	29 (4.8%)	25 (4.3%)	84 (10.4%)	99 (12.3%)
	Others	36 (5.9%)	22 (3.8%)	71 (8.8%)	86 (10.6%)
Age	Mean	72.2	72.4	47.5	45.8
Age	Standard deviation	11.8	12.0	28.8	27.5
Diabetes	Yes	228 (37.5%)	229 (39.1%)	198 (24.5%)	169 (20.8%)
Diabetes	No	380 (62.5%)	357 (60.9%)	610 (75.5%)	642 (79.2%)

Collectively, 50.0% (n = 597) and 36.1% (n = 584) of patients were predicted to have DM in the AF and random samples, respectively using an algorithm which was designed classify the record using various checkpoints that screened for the presence of DM related diagnosis codes, abnormal lab tests, and diabetic medications. Figure 1 illustrates the number of patients identified at each stage of the algorithm for the combined AF cohort.

The sensitivity and positive predictive value (PPV) ranged from 96.9 to 98.0% and from 61.1 to 75.6%, respectively, across all groups (Table 2). The PPV was notably lower in the random hospitalized sample by approximately 12 to 15 percentage points compared to that of the AF cohort. False-negatives were, however, uncommon, as illustrated by the high negative predictive values (NPV) ranging between 97.5 and 99.3%.

With diagnosis codes alone, modest sensitivity values of 77.0% and 83.4% are achieved (Table 3). When additional laboratory tests and medication criteria are combined, the sensitivity rises to 97.4% for the AF cohort and 97.8% for the random sample. The majority of the DM patients were identified by the diagnosis and laboratory test checkpoints, likely due to their sequential application, although there were marked increases in false-negatives on applying the laboratory test criteria (Table 3).

The algorithm performed consistently across the age and sex subgroups, with high sensitivity and NPV but lower specificity and PPV across all strata. In both instances, the specificity was higher in the younger age group compared to those aged 65 and above (90.8% vs. 75.8% and 88.8% vs. 60.6%) (Table 4).

A total of 152 false-positives and 12 false-negatives were found in the AF cohort, and 225 false-positives and 8 false-negatives were found in the random hospitalized sample. While the majority of the misclassifications occurred because DM was often stated in the hospital discharge summary but not captured in the structured data elements (diagnosis, laboratory tests or medication records data), there were also other reasons for misclassification, such as the patient having impaired fasting glucose, pre-diabetes or hyperglycemia due to other reasons (Table 5).

When modified to simulate scenarios of missing medication or laboratory test data (i.e., only diagnosis codes with either laboratory tests or medication data but not both), there were reductions in sensitivity in both samples, but to a larger degree in the Combined AF cohort (Table 6). While the availability of laboratory test data (but missing medications) led to a smaller loss in sensitivity compared to having medication data (but missing laboratory tests), considerably higher PPV and specificity are observed when medication data are available but laboratory tests are missing, suggesting that elevated glucose tests are more sensitive but DM medication use is more specific.

In terms of demographics, the DM cohorts identified by all three algorithms had similar age and sex distributions as compared with the actual DM patients. However, in the DM cohorts identified using (i) all three criteria and (ii) excluding medications, the proportion of Chinese patients identified was slightly higher as compared to the actual DM group (Table A7).

3. Methodology

3.1. Study Setting and Algorithm Development

The database includes patients with visits to all public healthcare facilities and captures approximately 85% of all nationwide acute hospital admissions and over 40% of all chronic disease outpatient visits [12]. An exploratory exercise was undertaken to identify potentially useful data elements that could help identify patients with DM in this database. All patients who fulfilled at least one of the following criteria (between 2018 and 2021) were first identified: (a) presence of a Systemized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) or International Classification of Diseases—Ninth and Tenth Revision (ICD 9 or ICD 10) code related to DM, (b) an abnormal blood glucose or glycated hemoglobin (HbA1c) laboratory test result or (c) prescribed any DM-related medication. Commonly used DM diagnosis codes, medications, laboratory tests for measuring blood glucose levels along with their upper bound thresholds and observed test frequencies were shortlisted and used to derive an algorithm for identifying DM patients (Figure 2). The full lists of shortlisted diagnosis codes, laboratory tests and medications are found in Table A1, Table A2, Table A3, Table A4 and Table A5 of the Appendix A, respectively [1,2,3].

Patients are categorized as diabetic if any one of the following are fulfilled: presence of a DM-related diagnosis code, presence of at least two glucose or HbA1c laboratory tests above the upper limit of normal, separated at least 30 days apart, or if they were prescribed any DM-related medication. For the ease of deployment, the algorithm was modularly designed to allow for assessing one data element at a time. As our database includes records from different healthcare institutions using a variety of laboratory assay equipment, defining a fixed threshold for the upper bound of normal values on all relevant blood glucose tests was not possible, as different facilities have slightly varying reference ranges. Setting-specific reference ranges are therefore used to identify abnormally high test results.

3.2. Validation Population and Chart Review

We validated the algorithm on two distinct patient samples, each with data from 2019 and 2020. The first dataset was a pre-selected group of 1194 patients who were hospitalized and newly diagnosed with AF and who had initiated oral anticoagulation therapy in 2019 or 2020. Diabetes is an important risk factor that potentially influences complication risks in patients with AF, and it would therefore be of interest to accurately identify DM status amongst AF patients [13,14]. The second group was a randomly sampled set of patients admitted to any public healthcare institution in 2019 or 2020 who had the required data elements for the gold standard labelling of diabetes status through chart reviews (n = 1619), as with the two AF cohorts. Only data that were recorded before or on the discharge date of the patient’s inpatient admission episode were used. Stratified analyses were performed by age and sex, and the reasons for misclassification were reviewed for a sample of false-positives and false-negatives. The performance of the algorithm in instances of missing laboratory and medication data was additionally evaluated. Lastly, a comparison of the DM cohorts identified by each algorithm was performed to analyze the impact of the choice of algorithm on the final DM cohort selected.

Chart reviews were performed on all cases used for validation (n = 2183, from both samples) by 15 clinically trained pharmacovigilance officers who had previously annotated a common set of 200 patient records (not included in this paper) with a near perfect agreement of 98.1% against the collectively derived gold standard label and good inter-annotator agreements of 0.88–1 (Table A6) for the presence of DM.

4. Discussion

DM poses a significant public health burden worldwide. A ‘War on Diabetes’ has been officially declared by the health ministry in Singapore, and diabetes has been made a key research focus area by national research funding agencies, with the aim to identify effective strategies for minimizing the impact of DM on its citizens and the health system [12]. With initiatives to make EHR data available for secondary analysis more readily, several forthcoming EHR-based epidemiological analyses on DM may be expected [15]. The proposed algorithm is therefore developed in anticipation of its use over time.

A unique feature of this study is its inclusion of a relatively large validation sample. These samples include narrowly and broadly defined patient populations on which the algorithm was validated. Previously proposed DM algorithms have often been developed from single institutions and validated on pre-selected rather than random samples [1,3]. Chart reviews were performed by reviewers after an initial run-in annotation phase to confirm inter-annotator agreement. Sensitivity analyses in different subgroups and in scenarios of missing data facilitate subsequent studies that apply the algorithm, where adjustments can be performed to quantitatively correct for misclassification bias when DM is studied as an exposure or outcome [16,17]. Nonetheless, the following limitations should be considered. As our database captures only unstructured notes from the inpatient setting (but not outpatient clinic visit notes), it was not possible to conduct comprehensive chart reviews of patients who were not hospitalized and consequently not possible to validate the algorithm on outpatients. Although the database captures the necessary data elements from outpatient visits, the algorithm’s performance remains unassessed in a healthier population that has not required hospital admission.

Second, the proposed algorithm has been designed to maximize sensitivity and therefore generates a substantial number of false-positive predictions. The main data element responsible for this is the laboratory tests of consistently elevated blood glucose levels. Leveraging glucose test results taken in the inpatient setting have been shown to be less specific, as these capture patients who may not have DM but rather other conditions manifesting in abnormal glucose metabolism [4]. If PPV is deemed more important in future studies, it is possible to simplify the algorithm by dropping the laboratory test requirement altogether, using only diagnosis codes and medication records to detect DM cases; the algorithm is fairly robust for missing laboratory test values, where the loss in sensitivity incurred is relatively small, but substantial improvements in PPV and specificity are observed. Overall, in terms of sensitivity and specificity, the algorithm performs comparably against previously published algorithms, although data source differences may limit some of these comparisons [1,2,18,19].

While DM medication use serves as a useful discriminatory factor for identifying DM patients at present, it is noteworthy that some classes of DM medications (such as GLP-1 agonists and SGLT2 inhibitors) are increasingly prescribed for non-DM indications, such as obesity and heart failure. While there may be considerable overlap of these conditions with DM, performance drift of the algorithm is possible over time. Drifts are, however, less likely to occur with algorithms primarily based on diagnosis codes and laboratory test values. Lastly, the current algorithm does not distinguish between the main subtypes of DM. Further work is necessary to identify patients with Type 1 DM of whom a substantial proportion may have been misdiagnosed as having Type 2 DM initially, only to have their diagnosis revised through subsequent testing [20,21]. Likewise, identifying patients with gestational diabetes requires a preceding algorithm to detect pregnancy status. The current algorithm nonetheless provides a starting point for developing subsequent DM subtype-specific algorithms.

5. Conclusions

Identifying DM using diagnosis codes alone in EHR studies can generate inaccurate estimates of disease prevalence and measures of association relating to DM. An algorithm for detecting DM patients in this database has been developed and validated in two distinct chart-reviewed samples. The algorithm can be calibrated to prioritize PPV over sensitivity, if needed. The data presented in this paper support quantitative bias analyses by future investigators performing DM-related studies.

Author Contributions

Design conceptualization, H.X.T.; Data analysis, H.X.T., R.L.T.L., D.C.H.T. and S.R.D.; Manuscript writing, R.L.T.L., P.S.A., B.P.Q.F., Y.L.K., J.W.N., A.J.J.N., S.H.T., D.C.H.T., M.Y.T., A.J.Y.Y., N.K.M.N., C.W.P.L., L.F.P., H.H. and S.R.D.; Data collection, P.S.A., B.P.Q.F., Y.L.K., A.J.J.N., S.H.T., M.Y.T., A.J.Y.Y., N.K.M.N., C.W.P.L., L.F.P. and H.H.; Supervision, P.S.A. and S.R.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable because this analysis was conducted as part of activities to facilitate public health surveillance by a public health authority and does not constitute ‘research’.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is not available on the public domain. The analysis is conducted as part of public health surveillance (not research) and therefore the data used for this analysis cannot be not considered to be ‘research data’.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. SNOMED-CT codes related to Type 1 and Type 2 DM that were used as criteria for DM patients.

Diagnosis Code	Description of Code
200687002	Cellulitis in diabetic foot
73211009	Diabetes mellitus (DM)
280137006	Diabetic foot
371087003	Diabetic foot ulcer
310505005	Diabetic hyperosmolar non-ketotic state
312912001	Diabetic macular oedema
399864000	Diabetic macular oedema not clinically significant
232020009	Diabetic maculopathy
25093002	Diabetic oculopathy (eye disease)
49455004	Diabetic polyneuropathy
268519009	Diabetic—poor control
127014009	Diabetic peripheral vascular disease (angiopathy)
127013003	Diabetic renal disease
4855003	Diabetic retinopathy
420789003	Diabetic retinopathy associated with OR due to Type 1 DM
232023006	Diabetic traction retinal detachment
312910009	Diabetic vitreous hemorrhage
402864004	Diabetic wet gangrene of the foot
441656006	Hyperglycemic crisis due to OR in DM
237633009	Hypoglycemia due to DM
421750000	Ketoacidosis due to Type 2 DM
420422005	Ketoacidosis in DM
426875007	Latent autoimmune DM in adults (LADA)
236499007	Microalbuminuric diabetic nephropathy
312903003	Mild non-proliferative diabetic retinopathy
312904009	Moderate non-proliferative diabetic retinopathy
230572002	Neuropathy due to DM
405749004	Newly diagnosed diabetes
390834004	Non-proliferative diabetic retinopathy (NPDR)/Background diabetic retinopathy (BDR)
59276001	Proliferative diabetic retinopathy (PDR)
236500003	Proteinuric diabetic nephropathy
312905005	Severe non-proliferative diabetic retinopathy
46635009	Type 1 DM Insulin-Dependent Diabetes Mellitus (IDDM)
44054006	Type 2 DM Non-Insulin-Dependent Diabetes Mellitus (NIDDM)
443694000	Type 2 DM uncontrolled
190331003	Type 2 DM with hyperosmolar coma

Table A2. ICD-9 code used as criterion for DM patients.

Diagnosis Code	Description of Code
25000	DM without mention of complication, T2 or unspecified type, not stated as uncontrolled

Table A3. Glucose laboratory threshold values that were used as criteria for DM patients.

Laboratory Test	Components of Blood	Threshold Values ^$
		mmol/L	mg/dL
Fasting glucose	Plasma/Serum/Venous	≥7.0	≥126
Glucose Tolerance Test (GTT)—Fasting	-	≥7.0	≥126
Random glucose	Plasma/Serum/Venous	≥11.1	≥200
Oral Glucose Tolerance Test (OGTT)—1 h	-	≥10.0	≥180
Glucose 1 h post-prandial	-	≥10.0	≥180
Glucose (60 min)	Plasma/Serum	≥10.0	≥180
Oral Glucose Tolerance Test (OGTT)—2 h	-	≥11.1	≥200
Glucose 2 h post-prandial	-	≥11.1	≥200
Glucose (120 min)	Plasma/Serum	≥11.1	≥200

^$ not used in final algorithm.

Table A4. HbA1c laboratory threshold value applied when phenotyping patients with DM [1].

Laboratory Test	Threshold Values
	%	mmol/mol
HbA1c	≥6.5	≥48

Table A5. List of DM-related medications, categorized according to their functions and drug classes, that were used as criteria for those with DM [2,3].

Drug Class	Active Ingredient		Brand Name
Biguanide	Metformin		Adimet
			Diabetmin
			Diabetmin XR
			Diamet
			Formet
			Glucient
			Meijumet
Thiazolidinedione	Pioglitazone		Actos
Sulfonylureas	Glipizide		Beapizide
			Diacon
			Diactin
			Dibizide
			Glynase
			Melizide
			Minidiab
			Sunglucon
	Gliclazide		Diamicron
			Diamicron MR
			Dianorm
			Diapro
			Gliavis
			Gliclada
			Glimicron
			Glizide
			Glynade
			Medoclazide
			Melicron
			Mexan
			Sun-gliclazide
			Sun-glizide
	Glimepiride		Amaryl
			Dialosa
			Diapride
	Glibenclamide		Benil
			Clamide
			Daonil
			Glyboral
	Tolbutamide		Tobumide
	Tolbutamide		Tolmide
Meglitinide	Repaglinide		Novonorm
Dipeptidyl peptidase-4 (DPP-4) inhibitors	Linagliptin		Trajenta
	Saxagliptin		Onglyza
	Sitagliptin		Januvia
	Vildagliptin		Galvus
GLP-1 Agonists (Incretin mimetics)	Dulaglutide		Trulicity
	Liraglutide		Saxenda
	Liraglutide		Victoza
	Semaglutide		Ozempic
	Semaglutide		Rybelsus
α-Glucosidase inhibitors	Acarbose		Garbose
α-Glucosidase inhibitors	Acarbose		Glucobay
Sodium-glucose co-transporter-2 (SGLT-2) inhibitor	Canagliflozin		Invokana
Sodium-glucose co-transporter-2 (SGLT-2) inhibitor	Ertugliflozin		Steglatro
Short-acting insulins (Bolus insulins)	Insulin aspart		Fiasp
	Insulin aspart		Novorapid
	Insulin glulisine		Apidra Solostar
	Insulin lispro		Humalog
	Regular (soluble/neutral) insulin		Actrapid
	Regular (soluble/neutral) insulin		Humulin R
Long-acting insulins (Basal insulins)	Insulin degludec		Ryzodeg
	Insulin degludec		Tresiba
	Insulin detemir		Levemir
	Insulin glargine		Basalog one
			Lantus Solostar
			Semglee
			Toujeo Solostar
	Neutral Protamine Hagedorn (NPH)/isophane insulin		Humulin N
	Neutral Protamine Hagedorn (NPH)/isophane insulin		Insulatard
Mixed insulins	Insulin aspart and insulin aspart protamine crystals		Novomix
	Insulin lispro and lispro protamine		Humalog mix
	Regular insulin and insulin isophane		Humulin 30/70
	Regular insulin and isophane insulin		Mixtard
Combination Medications	Vildagliptin	Metformin	Galvus Met
	Empagliflozin	Linagliptin	Glyxambi
	Glibenclamide	Metformin HCL	Glucovance
	Metformin/Metformin XR	Sitagliptin	Janumet/Janumet XR
	Metformin XR	Saxagliptin	Kombiglyze
	Linagliptin	Metformin HCL	Trajenta Duo
	Insulin glargine	Lixisenatide	Soliqua
	Sitagliptin	Ertugliflozin	Steglujan
	Dapagliflozin	Metformin/Metformin XR	Xigduo/Xigduo XR

Table A6. Inter-annotator agreement between 15 adjudicators for establishing DM status.

Annotator ID	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
1		1	1	0.99	1	0.99	0.94	1	0.93	0.99	0.99	0.95	0.95	1	0.99
2	1		1	0.99	1	0.99	0.94	1	0.93	0.99	0.99	0.95	0.95	1	0.99
3	1	1		0.99	1	0.99	0.94	1	0.93	0.99	0.99	0.95	0.95	1	0.99
4	0.99	0.99	0.99		0.99	0.98	0.95	0.99	0.92	0.98	0.98	0.94	0.96	0.99	0.98
5	1	1	1	0.99		0.99	0.94	1	0.93	0.99	0.99	0.95	0.95	1	0.99
6	0.99	0.99	0.99	0.98	0.99		0.93	0.99	0.92	0.98	0.98	0.96	0.94	0.99	0.98
7	0.94	0.94	0.94	0.95	0.94	0.93		0.94	0.92	0.93	0.93	0.89	0.96	0.94	0.93
8	1	1	1	0.99	1	0.99	0.94		0.93	0.99	0.99	0.95	0.95	1	0.99
9	0.93	0.93	0.93	0.92	0.93	0.92	0.92	0.93		0.94	0.92	0.88	0.95	0.93	0.92
10	0.99	0.99	0.99	0.98	0.99	0.98	0.93	0.99	0.94		0.98	0.94	0.96	0.99	0.98
11	0.99	0.99	0.99	0.98	0.99	0.98	0.93	0.99	0.92	0.98		0.94	0.94	0.99	0.98
12	0.95	0.95	0.95	0.94	0.95	0.96	0.89	0.95	0.88	0.94	0.94		0.9	0.95	0.94
13	0.95	0.95	0.95	0.96	0.95	0.94	0.96	0.95	0.95	0.96	0.94	0.9		0.95	0.94
14	1	1	1	0.99	1	0.99	0.94	1	0.93	0.99	0.99	0.95	0.95		0.99
15	0.99	0.99	0.99	0.98	0.99	0.98	0.93	0.99	0.92	0.98	0.98	0.94	0.94	0.99

Table A7. Demographic profile of DM patients identified by each algorithm.

Atrial Fibrillation Cohort (n = 1194)
		Actual DM Group (n = 457)	Diagnosis Codes and/or Laboratory Tests and/or Medications (n = 597)	Diagnosis Codes and/or Laboratory Tests (n = 574)	Diagnosis Codes and/or Medications (n = 456)
Sex, n (%)	Male	247 (54.0%)	314 (52.6%)	303 (52.8%)	247 (54.2%)
Sex, n (%)	Female	210 (46.0%)	283 (47.4%)	271 (47.2%)	209 (45.8%)
Race, n (%)	Chinese	328 (71.8%)	436 (73.0%)	424 (73.9%)	333 (73.0%)
	Malay	83 (18.2%)	105 (17.6%)	96 (16.7%)	79 (17.3%)
	Indian	26 (5.7%)	34 (5.7%)	33 (5.7%)	25 (5.5%)
	Others	20 (4.4%)	22 (3.7%)	21 (3.7%)	19 (4.2%)
Age	Mean	72.3	73.6	73.7	72.3
	Standard deviation	11.2	11.1	11.2	11.2
	Median	73.0	74.0	75.0	73.0
	Interquartile range	16.0	16.0	15.0	16.0
Random hospitalized sample (n = 1619)
		Actual DM group (n = 367)	Diagnosis codes and/or laboratory tests and/or medications (n = 584)	Diagnosis codes and/or laboratory tests (n = 573)	Diagnosis codes and/or medications (n = 382)
Sex, n (%)	Male	197 (53.7%)	319 (54.6%)	315 (55.0%)	198 (51.8%)
	Female	170 (46.3%)	265 (45.4%)	258 (45.0%)	184 (48.2%)
Race, n (%)	Chinese	237 (64.6%)	404 (69.2%)	398 (69.5%)	251 (65.7%)
	Malay	55 (15.0%)	71 (12.2%)	71 (12.4%)	56 (14.7%)
	Indian	50 (13.6%)	73 (12.5%)	68 (11.9%)	54 (14.1%)
	Others	25 (6.8%)	36 (6.2%)	36 (6.3%)	21 (5.5%)
Age	Mean	67.7	66.2	66.6	67.1
	Standard deviation	13.8	17.1	16.9	15.0
	Median	69.0	68.0	68.0	68.5
	Interquartile range	17.0	20.0	20.0	17.0

References

Upadhyaya, S.G.; Murphree, D.H.; Ngufor, C.G.; Knight, A.M.; Cronk, D.J.; Cima, R.R.; Curry, T.B.; Pathak, J.; Carter, R.E.; Kor, D.J. Automated Diabetes Case Identification Using Electronic Health Record Data at a Tertiary Care Facility. Mayo Clin. Proc. Innov. Qual. Outcomes 2017, 1, 100–110. [Google Scholar] [CrossRef] [PubMed]
Kagawa, R.; Kawazoe, Y.; Ida, Y.; Shinohara, E.; Tanaka, K.; Imai, T.; Ohe, K. Development of Type 2 Diabetes Mellitus Phenotyping Framework Using Expert Knowledge and Machine Learning Approach. J. Diabetes Sci. Technol. 2017, 11, 791–799. [Google Scholar] [CrossRef] [PubMed]
Weerahandi, H.M.; Horwitz, L.I.; Blecker, S.B. Diabetes Phenotyping Using the Electronic Health Record. J. Gen. Intern. Med. 2020, 35, 3716–3718. [Google Scholar] [CrossRef] [PubMed]
Spratt, S.E.; Pereira, K.; Granger, B.B.; Batch, B.C.; Phelan, M.; Pencina, M.; Miranda, M.L.; Boulware, E.; Lucas, J.E.; Nelson, C.L.; et al. Assessing electronic health record phenotypes against gold-standard diagnostic criteria for diabetes mellitus. J. Am. Med. Inform. Assoc. 2017, 24, e121–e128. [Google Scholar] [CrossRef] [PubMed]
Richesson, R.L.; Rusincovitch, S.A.; Wixted, D.; Batch, B.C.; Feinglos, M.N.; Miranda, M.L.; Hammond, W.E.; Califf, R.M.; Spratt, S.E. A comparison of phenotype definitions for diabetes mellitus. J. Am. Med. Inform. Assoc 2013, 20, e319–e326. [Google Scholar] [CrossRef] [PubMed]
Psaty, B.M.; Breckenridge, A.M. Mini-Sentinel and regulatory science--big data rendered fit and functional. N. Engl. J. Med. 2014, 370, 2165–2167. [Google Scholar] [CrossRef] [PubMed]
Voss, E.A.; Makadia, R.; Matcho, A.; Martijn, S.; Knoll, C.; Schuemie, M.; DeFalco, F.J.; Londhe, A.; Zhu, V.; Ryan, P.B. Feasibility and utility of applications of the common data model to multiple, disparate observational health databases. J. Am. Med. Inform. Assoc. 2015, 22, 553–564. [Google Scholar] [CrossRef] [PubMed]
Klann, J.G.; Abend, A.; Raghavan, V.A.; Mandl, K.D.; Murphy, S.N. Data interchange using i2b2. J. Am. Med. Inform. Assoc. 2016, 23, 909–915. [Google Scholar] [CrossRef] [PubMed]
Fleurence, R.L.; Curtis, L.H.; Califf, R.M.; Platt, R.; Selby, J.V.; Brown, J.S. Launching PCORnet, a national patient-centered clinical research network. J. Am. Med. Inform. Assoc. 2014, 21, 578–582. [Google Scholar] [CrossRef] [PubMed]
Schneeweiss, S. Learning from big health care data. N. Engl. J. Med. 2014, 370, 2161–2163. [Google Scholar] [CrossRef] [PubMed]
Bourke, A.; Bate, A.; Sauer, B.C.; Brown, J.S.; Hall, G.C. Evidence generation from healthcare databases: Recommendations for managing change. Pharmacoepidemiol. Drug. Saf. 2016, 25, 749–754. [Google Scholar] [CrossRef] [PubMed]
Tan, C.C.; Lam, C.S.P.; Matchar, D.B.; Zee, Y.K.; Wong, J.E.L. Singapore’s health-care system: Key features, challenges, and shifts. Lancet 2021, 398, 1091–1104. [Google Scholar] [CrossRef] [PubMed]
Christiansen, C.B.; Gerds, T.A.; Olesen, J.B.; Kristensen, S.L.; Lamberts, M.; Lip, G.Y.; Gislason, G.H.; Køber, L.; Torp-Pedersen, C. Atrial fibrillation and risk of stroke: A nationwide cohort study. Europace 2016, 18, 1689–1697. [Google Scholar] [CrossRef] [PubMed]
Chao, T.F.; Lip, G.Y.; Liu, C.J.; Tuan, T.C.; Chen, S.J.; Wang, K.L.; Lin, Y.J.; Chang, S.L.; Lo, L.W.; Hu, Y.F.; et al. Validation of a Modified CHA2DS2-VASc Score for Stroke Risk Stratification in Asian Patients with Atrial Fibrillation: A Nationwide Cohort Study. Stroke 2016, 47, 2462–2469. [Google Scholar] [CrossRef] [PubMed]
TRUST. Improving Health Outcomes through Trusted Data Exchange. Available online: https://trustplatform.sg/ (accessed on 2 February 2023).
Lash, T.L.; Olshan, A.F. EPIDEMIOLOGY Announces the “Validation Study” Submission Category. Epidemiology 2016, 27, 613–614. [Google Scholar] [CrossRef] [PubMed]
Marshall, R.J. Validation study methods for estimating exposure proportions and odds ratios with misclassified data. J. Clin. Epidemiol. 1990, 43, 941–947. [Google Scholar] [CrossRef] [PubMed]
Lo-Ciganic, W.; Zgibor, J.C.; Ruppert, K.; Arena, V.C.; Stone, R.A. Identifying type 1 and type 2 diabetic cases using administrative data: A tree-structured model. J. Diabetes Sci. Technol. 2011, 5, 486–493. [Google Scholar] [CrossRef] [PubMed]
Lipscombe, L.L.; Hwee, J.; Webster, L.; Shah, B.R.; Booth, G.L.; Tu, K. Identifying diabetes cases from administrative data: A population-based validation study. BMC Health Serv. Res. 2018, 18, 316. [Google Scholar] [CrossRef] [PubMed]
Bao, Y.K.; Ma, J.; Ganesan, V.C.; McGill, J.B. Mistaken Identity: Missed Diagnosis of Type 1 Diabetes in an Older Adult. Med. Res. Arch. 2019, 7, 1962. [Google Scholar] [PubMed]
Thomas, N.J.; Lynam, A.L.; Hill, A.V.; Weedon, M.N.; Shields, B.M.; Oram, R.A.; McDonald, T.J.; Hattersley, A.T.; Jones, A.G. Type 1 diabetes defined by severe insulin deficiency occurs after 30 years of age and is commonly treated as type 2 diabetes. Diabetologia 2019, 62, 1167–1172. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Flowchart with values indicated at each data element checkpoint, representing the total number of patients that were identified at that particular checkpoint, with respect to the 2019 and 2020 combined AF cohort of 1194 patients.

Figure 2. Flowchart used to phenotype patients with diabetes mellitus.

Table 2. Performance of the algorithm on the AF cohort and random sample of hospitalized patients.

	Atrial Fibrillation Cohort (n = 1194)		Random Hospitalized Sample (n = 1619)
	2019 (n = 608)	2020 (n = 586)	2019 (n = 808)	2020 (n = 811)
Diabetes, yes, n	228	229	198	169
Sensitivity, %	97.8	96.9	98.0	97.6
Specificity, %	81.1	77.6	80.3	83.6
Positive predictive value, %	75.6	73.5	61.8	61.1
Negative predictive value, %	98.4	97.5	99.2	99.3

Table 3. Cumulative sensitivity with respect to the respective data element checkpoints for the AF cohort and random cohorts.

	Data Element Checkpoint	Predicted to Have DM	Gold Standard (of Those Predicted to Have DM)		Cumulative Sensitivity (%)
	Data Element Checkpoint	Predicted to Have DM	DM	No DM (False-Positive)	Cumulative Sensitivity (%)
Combined atrial fibrillation cohort (n = 1194)
With DM (457)	Diagnosis codes	385	352	33	77.0
	Diagnosis codes and/or laboratory tests	574	422	152	92.3
	Diagnosis codes and/or laboratory tests and/or medications	597	445	152	97.4
Combined random hospitalized sample (n = 1619)
With DM (367)	Diagnosis codes	329	306	23	83.4
	Diagnosis codes and/or laboratory tests	573	355	218	96.7
	Diagnosis codes and/or laboratory tests and/or medications	584	359	225	97.8

DM: Diabetes mellitus.

Table 4. Stratified performance of the algorithm in different age and sex subgroups.

	Combined Atrial Fibrillation Cohort (n = 1194)				Combined Random Hospitalized Sample (n = 1619)
	Sensitivity (%)	Specificity (%)	PPV (%)	NPV (%)	Sensitivity (%)	Specificity (%)	PPV (%)	NPV (%)
Sex
Female	97.1	78.6	72.1	98.0	97.6	85.0	62.6	99.3
Male	97.6	80.2	76.8	98.0	98.0	78.8	60.5	99.2
Age group
64 years and below	96.5	90.8	87.3	97.5	98.0	88.8	57.6	99.6
65 years and above	97.7	75.8	71.1	98.2	97.7	60.6	64.4	97.3

PPV: Positive predictive value; NPV: Negative predictive value.

Table 5. Reasons for algorithmic misclassification in both cohorts.

	Number in Combined Atrial Fibrillation Cohort (FP = 152, FN = 12)	Number in Combined Random Hospitalized Sample (FP = 225, FN = 8)
Reasons for false-positive classification
DM not mentioned in unstructured clinical notes (e.g., discharge summary), but diagnosis, laboratory tests or medications fit the DM criteria	38	50
Impaired fasting glucose or HbA1c in prediabetic range	18	0
Hyperglycemia (due to other reasons)	0	3
Impaired glucose tolerance	0	3
Gestational diabetes	0	2
DM on diet control	1	2
Total FP sampled for review	57	60
Reason for false-negative classification
DM mentioned in discharge summary, but no diagnosis, labs or medications fit the DM criteria	5	4
Total FN sampled for review	5	4

DM: Diabetes mellitus; FP: False-positive; FN: False-negative.

Table 6. Algorithm performance in the absence of laboratory tests or medication data in both cohorts.

	TP	FP	TN	FN	Sensitivity (%)	Specificity (%)	PPV (%)	NPV (%)
Combined atrial fibrillation cohort (n = 1194)
Diagnosis codes and/or laboratory tests and/or medications	445	152	585	12	97.4	79.4	74.5	98.0
Diagnosis codes and/or laboratory tests	422	152	585	35	92.3	79.4	73.5	94.4
Diagnosis codes and/or medications	420	36	701	37	91.9	95.1	92.1	95.0
Combined random hospitalized sample (n = 1619)
Diagnosis codes and/or laboratory tests and/or medications	359	225	1027	8	97.8	82.0	61.5	99.2
Diagnosis codes and/or laboratory tests	355	218	1034	12	96.7	82.6	62.0	98.9
Diagnosis codes and/or medications	345	37	1215	22	94.0	97.0	90.3	98.2

TP: True-positive, FP: False-positive, TN: True-negative, FN: False-negative.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tan, H.X.; Lim, R.L.T.; Ang, P.S.; Foo, B.P.Q.; Koon, Y.L.; Neo, J.W.; Ng, A.J.J.; Tan, S.H.; Teo, D.C.H.; Tham, M.Y.; et al. Phenotyping Diabetes Mellitus on Aggregated Electronic Health Records from Disparate Health Systems. Pharmacoepidemiology 2023, 2, 223-235. https://doi.org/10.3390/pharma2030019

AMA Style

Tan HX, Lim RLT, Ang PS, Foo BPQ, Koon YL, Neo JW, Ng AJJ, Tan SH, Teo DCH, Tham MY, et al. Phenotyping Diabetes Mellitus on Aggregated Electronic Health Records from Disparate Health Systems. Pharmacoepidemiology. 2023; 2(3):223-235. https://doi.org/10.3390/pharma2030019

Chicago/Turabian Style

Tan, Hui Xing, Rachel Li Ting Lim, Pei San Ang, Belinda Pei Qin Foo, Yen Ling Koon, Jing Wei Neo, Amelia Jing Jing Ng, Siew Har Tan, Desmond Chun Hwee Teo, Mun Yee Tham, and et al. 2023. "Phenotyping Diabetes Mellitus on Aggregated Electronic Health Records from Disparate Health Systems" Pharmacoepidemiology 2, no. 3: 223-235. https://doi.org/10.3390/pharma2030019

APA Style

Tan, H. X., Lim, R. L. T., Ang, P. S., Foo, B. P. Q., Koon, Y. L., Neo, J. W., Ng, A. J. J., Tan, S. H., Teo, D. C. H., Tham, M. Y., Yap, A. J. Y., Ng, N. K. M., Loke, C. W. P., Peck, L. F., Huang, H., & Dorajoo, S. R. (2023). Phenotyping Diabetes Mellitus on Aggregated Electronic Health Records from Disparate Health Systems. Pharmacoepidemiology, 2(3), 223-235. https://doi.org/10.3390/pharma2030019

Article Menu

Phenotyping Diabetes Mellitus on Aggregated Electronic Health Records from Disparate Health Systems

Abstract

1. Introduction

2. Results

3. Methodology

3.1. Study Setting and Algorithm Development

3.2. Validation Population and Chart Review

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI