Wrangling Real-World Data: Optimizing Clinical Research Through Factor Selection with LASSO Regression
Abstract
1. Introduction
2. Materials and Methods
2.1. Data Source
2.2. Data Analysis
3. Results
3.1. Data Quality
3.2. Data Utility and Variable Selection
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Marston, H.D.; Paules, C.I.; Fauci, A.S. The Critical Role of Biomedical Research in Pandemic Preparedness. JAMA 2017, 318, 1757. [Google Scholar] [CrossRef] [PubMed]
- Gates, B. Innovation for Pandemics. N. Engl. J. Med. 2018, 378, 2057–2060. [Google Scholar] [CrossRef] [PubMed]
- Marshall, J.C.; Murthy, S.; Diaz, J.; Adhikari, N.K.; Angus, D.C.; Arabi, Y.M.; Baillie, K.; Bauer, M.; Berry, S.; Blackwood, B.; et al. A Minimal Common Outcome Measure Set for COVID-19 Clinical Research. Lancet Infect. Dis. 2020, 20, e192–e197. [Google Scholar] [CrossRef]
- Shuja, J.; Alanazi, E.; Alasmary, W.; Alashaikh, A. COVID-19 Open Source Data Sets: A Comprehensive Survey. Appl. Intell. 2021, 51, 1296–1325. [Google Scholar] [CrossRef]
- Talevi, A.; Bellera, C.L. Challenges and Opportunities with Drug Repurposing: Finding Strategies to Find Alternative Uses of Therapeutics. Expert. Opin. Drug Discov. 2020, 15, 397–401. [Google Scholar] [CrossRef]
- Nony, P.; Kurbatova, P.; Bajard, A.; Malik, S.; Castellan, C.; Chabaud, S.; Volpert, V.; Eymard, N.; Kassai, B.; Cornu, C.; et al. A Methodological Framework for Drug Development in Rare Diseases. Orphanet J. Rare Dis. 2014, 9, 164. [Google Scholar] [CrossRef]
- Pushpakom, S.; Iorio, F.; Eyers, P.A.; Escott, K.J.; Hopper, S.; Wells, A.; Doig, A.; Guilliams, T.; Latimer, J.; McNamee, C.; et al. Drug Repurposing: Progress, Challenges and Recommendations. Nat. Rev. Drug Discov. 2019, 18, 41–58. [Google Scholar] [CrossRef] [PubMed]
- Parvathaneni, V.; Kulkarni, N.S.; Muth, A.; Gupta, V. Drug Repurposing: A Promising Tool to Accelerate the Drug Discovery Process. Drug Discov. Today 2019, 24, 2076–2085. [Google Scholar] [CrossRef]
- Debnath, A. Drug Discovery for Primary Amebic Meningoencephalitis: From Screen to Identification of Leads. Expert. Rev. Anti-Infect. Ther. 2021, 19, 1099–1106. [Google Scholar] [CrossRef]
- Mougari, F.; Guglielmetti, L.; Raskine, L.; Sermet-Gaudelus, I.; Veziris, N.; Cambau, E. Infections Caused by Mycobacterium Abscessus: Epidemiology, Diagnostic Tools and Treatment. Expert. Rev. Anti-Infect. Ther. 2016, 14, 1139–1154. [Google Scholar] [CrossRef]
- Shi, J.; Xiao, Y.; Zhang, Y.; Geng, D.; Cong, D.; Shi, K.X.; Knapp, R.J. Challenges of Drug Development during the COVID-19 Pandemic: Key Considerations for Clinical Trial Designs. Br. J. Clin. Pharmacol. 2021, 87, 2170–2185. [Google Scholar] [CrossRef] [PubMed]
- Concato, J.; Corrigan-Curay, J. Real-World Evidence—Where Are We Now? N. Engl. J. Med. 2022, 386, 1680–1682. [Google Scholar] [CrossRef]
- Ashburn, T.T.; Thor, K.B. Drug Repositioning: Identifying and Developing New Uses for Existing Drugs. Nat. Rev. Drug Discov. 2004, 3, 673–683. [Google Scholar] [CrossRef]
- Greenblatt, W.; Gupta, C.; Kao, J. Drug Repurposing During The COVID-19 Pandemic: Lessons For Expediting Drug Development And Access: Study Examines Drug Repurposing during the COVID-19 Pandemic and Offers Lessons Learned for for Both Future Emerging Diseases and Drug Development in General. Health Aff. 2023, 42, 424–432. [Google Scholar] [CrossRef]
- Chen, P.-L.; Lee, N.-Y.; Cia, C.-T.; Ko, W.-C.; Hsueh, P.-R. A Review of Treatment of Coronavirus Disease 2019 (COVID-19): Therapeutic Repurposing and Unmet Clinical Needs. Front. Pharmacol. 2020, 11, 584956. [Google Scholar] [CrossRef] [PubMed]
- Aghila Rani, K.G.; Hamad, M.A.; Zaher, D.M.; Sieburth, S.M.; Madani, N.; Al-Tel, T.H. Drug Development Post COVID-19 Pandemic: Toward a Better System to Meet Current and Future Global Health Challenges. Expert. Opin. Drug Discov. 2021, 16, 365–371. [Google Scholar] [CrossRef]
- U.S. Food & Drug Administration. Framework for FDA’s Real-World Evidence Program. 2018. Available online: https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence (accessed on 8 July 2024).
- Khan, Z.; Karataş, Y.; Ceylan, A.F.; Rahman, H. COVID-19 and Therapeutic Drugs Repurposing in Hand: The Need for Collaborative Efforts. Le. Pharm. Hosp. Et Clin. 2021, 56, 3–11. [Google Scholar] [CrossRef]
- Allarakhia, M. Open-Source Approaches for the Repurposing of Existing or Failed Candidate Drugs: Learning from and Applying the Lessons across Diseases. Drug Des. Devel. Ther. 2013, 7, 753–766. [Google Scholar] [CrossRef]
- Pudjihartono, N.; Fadason, T.; Kempa-Liehr, A.W.; O’Sullivan, J.M. A Review of Feature Selection Methods for Machine Learning-Based Disease Risk Prediction. Front. Bioinform. 2022, 2, 927312. [Google Scholar] [CrossRef]
- Yu, S.-H.; Cai, J.-H.; Chen, D.-L.; Liao, S.-H.; Lin, Y.-Z.; Chung, Y.-T.; Tsai, J.J.P.; Wang, C.C.N. LASSO and Bioinformatics Analysis in the Identification of Key Genes for Prognostic Genes of Gynecologic Cancer. J. Pers. Med. 2021, 11, 1177. [Google Scholar] [CrossRef]
- Aronsson, L.; Andersson, R.; Ansari, D. Artificial Neural Networks versus LASSO Regression for the Prediction of Long-Term Survival after Surgery for Invasive IPMN of the Pancreas. PLoS ONE 2021, 16, e0249206. [Google Scholar] [CrossRef] [PubMed]
- Hu, C.; Li, L.; Huang, W.; Wu, T.; Xu, Q.; Liu, J.; Hu, B. Interpretable Machine Learning for Early Prediction of Prognosis in Sepsis: A Discovery and Validation Study. Infect. Dis. Ther. 2022, 11, 1117–1132. [Google Scholar] [CrossRef]
- Li, H.; Zhou, C.; Wang, C.; Li, B.; Song, Y.; Yang, B.; Zhang, Y.; Li, X.; Rao, M.; Zhang, J.; et al. Lasso-Cox Interpretable Model of AFP-Negative Hepatocellular Carcinoma. Clin. Transl. Oncol. 2024, 27, 309–318. [Google Scholar] [CrossRef] [PubMed]
- Inker, L.A.; Eneanya, N.D.; Coresh, J.; Tighiouart, H.; Wang, D.; Sang, Y.; Crews, D.C.; Doria, A.; Estrella, M.M.; Froissart, M.; et al. New Creatinine- and Cystatin C–Based Equations to Estimate GFR without Race. N. Engl. J. Med. 2021, 385, 1737–1749. [Google Scholar] [CrossRef]
- Zhang, S.; Huang, S.; Liu, J.; Dong, X.; Meng, M.; Chen, L.; Wen, Z.; Zhang, L.; Chen, Y.; Du, H.; et al. Identification and Validation of Prognostic Factors in Patients with COVID-19: A Retrospective Study Based on Artificial Intelligence Algorithms. J. Intensive Med. 2021, 1, 103–109. [Google Scholar] [CrossRef] [PubMed]
- Jiang, Z.; Yang, A.; Chen, H.; Shi, Y.; Li, X. Intelligent Diagnosis of the Severity of Disease Conditions in COVID-19 Patients Based on the LASSO Method. Front. Public. Health 2024, 12, 1302256. [Google Scholar] [CrossRef]
- Ranstam, J.; Cook, J.A. LASSO Regression. Br. J. Surg. 2018, 105, 1348. [Google Scholar] [CrossRef]
- Biswas, M.; Rahaman, S.; Biswas, T.K.; Haque, Z.; Ibrahim, B. Association of Sex, Age, and Comorbidities with Mortality in COVID-19 Patients: A Systematic Review and Meta-Analysis. Intervirology 2021, 64, 36–47. [Google Scholar] [CrossRef]
- Chatterjee, N.A.; Jensen, P.N.; Harris, A.W.; Nguyen, D.D.; Huang, H.D.; Cheng, R.K.; Savla, J.J.; Larsen, T.R.; Gomez, J.M.D.; Du-Fay-de-Lavallaz, J.M.; et al. Admission Respiratory Status Predicts Mortality in COVID-19. Influenza Respir. Viruses 2021, 15, 569–572. [Google Scholar] [CrossRef]
Factors * | Value | |
---|---|---|
Demographics | ||
Age | 66 (52–77) | |
Race and ethnicity | White | 33,986 (64.6) |
Black | 8172 (15.6) | |
Hispanic | 2040 (3.9) | |
Other | 8392 (15.9) | |
Sex | Male | 27,560 (52.4) |
Female | 25,030 (47.6) | |
Body mass index | 29 (24.5–34.7) | |
Comorbidities | Human immunodeficiency virus | 388 (0.73) |
Chronic lung disease | 15,655 (29.7) | |
Chronic kidney disease | 14,898 (28.3) | |
Cardiovascular disease | 14,769 (28) | |
Diabetes | 21,529 (40.9) | |
Indicators of Disease Severity | ||
Oxygen support | No oxygen | 33,369 (63.5) |
Oxygen only | 16,761 (31.8) | |
Ventilation | 2460 (4.7) | |
Oxygen saturation (%) | Minimum | 90.0 (86.0–93.0) |
Median | 96.0 (94.0–95.8) | |
Mean | 95.5 (94.1–97.0) | |
Maximum | 100.0 (98.0–100.0) | |
Respiratory rate | Minimum | 16.0 (14.0–16.0) |
Median | 18.0 (18.0–20.0) | |
Mean | 19.0 (17.8–21.1) | |
Maximum | 25.0 (21.0–31.0) | |
Heart rate | Minimum | 64.0 (57.0–72.0) |
Median | 81.0 (72.0–90.5) | |
Mean | 81.7(73.1–91.0) | |
Maximum | 103.0 (92.0–117.0) | |
Temperature (F) | Minimum | 97.3 (96.8–97.6) |
Median | 98.1 (97.8–98.5) | |
Mean | 98.2 (97.8–98.6) | |
Maximum | 99.5 (98.8–100.8) | |
Laboratory Measurements | ||
Leukocyte count (×109/L) | Minimum | 5.7 (4.1–8.0) |
Median | 6.9 (5.0–9.5) | |
Mean | 7.0 (5.1–9.7) | |
Maximum | 8.2 (6.0–11.6) | |
Monocyte count (×109/L) | Minimum | 0.4 (0.2–0.6) |
Median | 0.5 (0.4–0.8) | |
Mean | 0.5 (0.1–0.9) | |
Maximum | 0.6 (0.4–0.9) | |
Lymphocyte count (×109/L) | Minimum | 0.8 (0.5–1.2) |
Median | 1.0 (0.7–1.4) | |
Mean | 1.0 (0.7–1.4) | |
Maximum | 1.16 (0.8–1.69) | |
Eosinophil count (×109/L) | Minimum | 0 (0–0.01) |
Median | 0 (0–0.05) | |
Mean | 0 (0–0.06) | |
Maximum | 0 (0–0.07) | |
Basophil count (×109/L) | Minimum | 0 (0–0.01) |
Median | 0 (0–0.03) | |
Mean | 0.01 (0–0.06) | |
Maximum | 0.01 (0–0.19) | |
Hematocrit level (%) | Minimum | 36.1 (31.5–40.0) |
Median | 37.8 (33.3–41.6) | |
Mean | 38.0 (33.5–41.7) | |
Maximum | 40.0 (35.5–43.8) | |
Hemoglobin level (g/L) | Minimum | 11.8 (10.1–13.2) |
Median | 12.4 (10.7–13.7) | |
Mean | 12.4 (10.8–13.8) | |
Maximum | 13.1 (11.5–14.5) | |
Platelet count (×109/L) | Minimum | 190.0 (144.0–248.0) |
Median | 206.0 (157.0–268.0) | |
Mean | 208.3 (159.0–270.7) | |
Maximum | 228.0 (174.0–298.0) | |
Total bilirubin | Minimum | 0.4 (0.3–0.6) |
Median | 0.5 (0.4–0.7) | |
Mean | 0.5 (0.2–0.9) | |
Maximum | 0.6 (0.4–0.8) | |
Aspartate aminotransferase level (U/L) | Minimum | 29.0 (20.0–44.0) |
Median | 33.0 (23.0–52.0) | |
Mean | 33.5 (23.0–53.0) | |
Maximum | 38.0 (25.0–61.0) | |
Alanine aminotransferase level (U/L) | Minimum | 25.0 (16.0–40.0) |
Median | 27.0 (18.0–45.0) | |
Mean | 27.5 (18.0–45.7) | |
Maximum | 30.0 (19.0–51.0) | |
Serum creatinine level (mg/dL) | Minimum | 0.9 (0.7–1.2) |
Median | 0.9 (0.7–1.3) | |
Mean | 0.9 (0.7–1.3) | |
Maximum | 1.1 (0.8–1.5) | |
Estimated glomerular filtration rate | Minimum | 65.0 (39.4–88.4) |
Median | 75.0 (47.0–95.1) | |
Mean | 74.3 (47.3–94.1) | |
Maximum | 82.5 (54.3–76.9) | |
Outcome | ||
Mortality | Alive | 48,673 (92.6) |
Deceased | 3917 (7.4) |
Variable | Coefficient | |
---|---|---|
Demographics | ||
Age (18–39 years) | 40–49 years | 0.283 |
50–59 years | 0.803 | |
60–69 years | 1.619 | |
70–79 years | 2.167 | |
80+ years | 2.821 | |
Race and ethnicity (White) | Black | −0.685 |
Hispanic | −0.032 | |
Other | 0.021 | |
Sex (female) | 0.252 | |
Body mass index | −0.214 | |
Comorbidities (lack of presence of condition) | Human immunodeficiency virus | 0.276 |
Chronic lung disease | 0.350 | |
Chronic kidney disease | 0.850 | |
Cardiovascular disease | 0.139 | |
Diabetes | 0.093 | |
Indicators of Disease Severity | ||
Oxygen support (no oxygen) | Oxygen only | 0.802 |
Ventilation | 2.054 | |
Oxygen saturation (%) | Minimum | −0.046 |
Median | 0.153 | |
Mean | −0.745 | |
Maximum | 0.292 | |
Respiratory rate | Minimum | −0.123 |
Median | 0.076 | |
Mean | 0.351 | |
Maximum | −0.019 | |
Heart rate | Minimum | −0.075 |
Median | −0.065 | |
Mean | 0.356 | |
Maximum | −0.083 | |
Temperature (F) | Minimum | −0.128 |
Median | 0.077 | |
Mean | ||
Maximum | 0.014 | |
Laboratory Measurements | ||
Leukocyte count | Minimum | 0.348 |
Median | ||
Mean | ||
Maximum | 0.001 | |
Monocyte count | Minimum | 0.064 |
Median | −0.115 | |
Mean | ||
Maximum | −0.064 | |
Lymphocyte count | Minimum | −0.045 |
Median | −0.079 | |
Mean | −0.037 | |
Maximum | ||
Eosinophil count | Minimum | −0.107 |
Median | 0.202 | |
Mean | −0.448 | |
Maximum | −0.089 | |
Basophil count | Minimum | −0.098 |
Median | 0.054 | |
Mean | −0.081 | |
Maximum | −0.111 | |
Hematocrit level | Minimum | 0.482 |
Median | 0.055 | |
Mean | ||
Maximum | 0.040 | |
Hemoglobin level | Minimum | −0.302 |
Median | 0.070 | |
Mean | −0.484 | |
Maximum | ||
Platelet count | Minimum | −0.418 |
Median | 0.075 | |
Mean | ||
Maximum | ||
Total bilirubin | Minimum | 0.236 |
Median | ||
Mean | ||
Maximum | −0.091 | |
Aspartate aminotransferase level (U/L) | Minimum | 0.315 |
Median | ||
Mean | 0.087 | |
Maximum | 0.066 | |
Alanine aminotransferase level (U/L) | Minimum | −0.173 |
Median | −0.225 | |
Mean | ||
Maximum | 0.081 | |
Serum creatinine level (mg/dL) | Minimum | 0.125 |
Median | ||
Mean | ||
Maximum | −0.188 | |
Estimated glomerular filtration rate | Minimum | 0.136 |
Median | 0.129 | |
Mean | ||
Maximum | −0.416 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Howard, K.A.; Anderson, W.; Podichetty, J.T.; Gould, R.; Boyce, D.; Dasher, P.; Evans, L.; Kao, C.; Kumar, V.K.; Hamilton, C.; et al. Wrangling Real-World Data: Optimizing Clinical Research Through Factor Selection with LASSO Regression. Int. J. Environ. Res. Public Health 2025, 22, 464. https://doi.org/10.3390/ijerph22040464
Howard KA, Anderson W, Podichetty JT, Gould R, Boyce D, Dasher P, Evans L, Kao C, Kumar VK, Hamilton C, et al. Wrangling Real-World Data: Optimizing Clinical Research Through Factor Selection with LASSO Regression. International Journal of Environmental Research and Public Health. 2025; 22(4):464. https://doi.org/10.3390/ijerph22040464
Chicago/Turabian StyleHoward, Kerry A., Wes Anderson, Jagdeep T. Podichetty, Ruth Gould, Danielle Boyce, Pam Dasher, Laura Evans, Cindy Kao, Vishakha K. Kumar, Chase Hamilton, and et al. 2025. "Wrangling Real-World Data: Optimizing Clinical Research Through Factor Selection with LASSO Regression" International Journal of Environmental Research and Public Health 22, no. 4: 464. https://doi.org/10.3390/ijerph22040464
APA StyleHoward, K. A., Anderson, W., Podichetty, J. T., Gould, R., Boyce, D., Dasher, P., Evans, L., Kao, C., Kumar, V. K., Hamilton, C., Mathé, E., Guerin, P. J., Dodd, K., Mehta, A. K., Ortman, C., Patil, N., Rhodes, J., Robinson, M., Stone, H., & Heavner, S. F. (2025). Wrangling Real-World Data: Optimizing Clinical Research Through Factor Selection with LASSO Regression. International Journal of Environmental Research and Public Health, 22(4), 464. https://doi.org/10.3390/ijerph22040464