Predictive Models for Early Infection Detection in Nursing Home Residents: Evaluation of Imputation Techniques and Complementary Data Sources
Abstract
1. Introduction
2. Materials and Methods
2.1. Study Design and Multi-Source Data Integration
- -
- Physiological Data (Internal Source): Measurements were collected daily using wearable biosensors recording Body Temperature (TEMP), Electrodermal Activity (EDA), Oxygen Saturation (SpO2), and Heart Rate (BPM). Trained medical personnel oversaw collection according to established protocols described in the SPIDEP Project [5]. These were complemented by demographic data (age, sex) and clinical characteristics, specifically the Barthel Index, which measures the patient’s performance in activities of daily living and functional independence.
- -
- Environmental Data (External Source): Recognizing the influence of environmental factors on respiratory health, daily meteorological variables (mean temperature, pressure, insolation) and air pollution metrics (e.g., Nitrogen Dioxide (NO2), Particulate Matter (PM10)) were retrieved from public repositories and temporally matched with physiological data.
- -
- Digital Epidemiology Data (External Source): To capture social context and potential early warnings of community transmission, we integrated search volume data from Google Trends. We selected specific keywords in Spanish related to respiratory symptoms (e.g., “flu”, “fever”) as well as urinary tract symptoms (e.g., “cystitis”, “dysuria”) to analyze local search behaviors during the study period. The complete list of Spanish keywords used for the Google Trends data extraction, including the comparison between raw and refined sets, is provided in Supplementary Table S1.
- -
- Basic Model: Utilizing exclusively demographic data and physiological vital signs (internal source).
- -
- Air Pollution Model: Incorporating pollutant variables (internal + environmental sources).
- -
- Social Media Model: Integrating digital epidemiology metrics from Google Trends (internal + digital sources).
2.2. Data Preprocessing and Missing Data Imputation
- Mean Imputation: Missing values were replaced with the average value of each feature.
- kNN: Missing data were estimated using similarity-based predictions from the closest neighbors in the dataset, preserving relationships between variables, and capturing local data patterns.
- MICE: Multiple imputed datasets were generated iteratively to address uncertainty and variability in missing values, enhancing the robustness and reliability of the subsequent analyses.
2.3. Basal Module
- -
- Categorical State (): A discretized variable taking values in depending on which historical quartile the current observation falls into.
- -
- Dynamic Variation (): Defined as the first-order difference , capturing sudden shifts in the patient’s physiological state.
2.4. System Workflow and Decision Architecture
2.5. Temporal Framework: Lead, Lag, and Labeling Strategy
- Lag (Anticipation Horizon): The Lag is defined as the number of days prior to that the system aims to anticipate the infection. This creates a “Prediction Point” at . To guarantee that the model does not “peek” into the future, all physiological data occurring between the Prediction Point () and the Diagnosis Date () are strictly excluded from the feature extraction process. This ensures the model only reacts to early, subtle variations rather than the overt clinical deterioration that immediately precedes diagnosis.
- Lead (Historical Observation Window): The Lead defines the window of historical context used as input for the model, looking backward from the Prediction Point . The feature vector for a sample is derived exclusively from the interval . By shifting the observation window to end exactly at the Prediction Point, we maintain absolute causal integrity, ensuring that no information from the period is encoded in the inputs.
2.6. Model Training and Evaluation
- Binary classification: Differentiating between “healthy” and “infected” states without specifying infection types.
- Multiclass classification: Differentiating specific infection types including “ARI”, “UTI” and “Other Infections”.
2.7. Feature Importance and Explainable AI with SHAP
3. Results
3.1. Binary Classification: Early Detection Capacity
3.1.1. Impact of Multi-Source Integration
3.1.2. Model Robustness and Specificity
3.1.3. Feature Selection and Efficiency
3.2. Multiclass Diagnosis: Etiological Differentiation
3.2.1. The Limitations of Vital Signs Alone
3.2.2. Impact of External Data
- -
- Social Media Model: Achieved the highest consistency, with a Recall of 0.94 for ARI and 0.97 for UTI. This suggests that community search patterns (e.g., keywords like “dysuria” vs. “cough”) provide the specific semantic labels needed to disambiguate the physiological signal.
- -
- Air Pollution Model: Also demonstrated exceptional performance (Recall > 0.95 for both ARI and UTI), likely leveraging meteorological correlations (e.g., temperature drops associated with respiratory outbreaks).
3.2.3. Challenge of Heterogeneous Classes
3.3. Temporal Analysis: Impact of Lead and Lag
3.3.1. Binary Dynamics
3.3.2. Multiclass Dynamics
3.4. Model Interpretability and Feature Importance
3.5. Summary of Main Findings
4. Discussion
4.1. Comparison of Data Sources
- -
- Physiological Limits: The Basic Model proved effective for immediate detection but degraded rapidly as the prediction horizon extended. This suggests that vital signs act as indicators of current systemic stress rather than predictors of future risk, without immediate updates, their diagnostic relevance decays.
- -
- Environmental Nowcasting: air pollution data consistently enhanced performance at short lags. The strong correlation found at Lag 0–2 supports the biological plausibility of environmental exposure—such as spikes in NO2—acting as immediate triggers for respiratory or systemic inflammation in vulnerable elderly populations.
- -
- Social Forecasting: Contrary to the initial assumption that social media might add noise, our analysis revealed its critical role as a long-range sensor, peaking at a 6-day anticipation (F1 0.97). This aligns with the “digital epidemiology” hypothesis: community search patterns regarding symptoms precede clinical saturation, reflecting the behavioral latency between feeling the first symptoms and seeking medical attention.
4.2. Multiclass Classification Dynamics
4.3. Impact of Feature Selection
4.4. Model Explainability and Feature Contribution
4.5. Handling Missing Data
4.6. Broader Implications
4.7. Limitations and Future Work
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Keating, N. A Research Framework for the United Nations Decade of Healthy Ageing (2021–2030). Eur. J. Ageing 2022, 19, 775–787. [Google Scholar] [CrossRef] [PubMed]
- Anderson, R.M.; May, R.M. Infectious Diseases of Humans: Dynamics and Control; Reprinted; Oxford University Press: Oxford, UK, 2010; ISBN 978-0-19-854040-3. [Google Scholar]
- Bauer, T.T.; Ewig, S.; Rodloff, A.C.; Muller, E.E. Acute Respiratory Distress Syndrome and Pneumonia: A Comprehensive Review of Clinical Data. Clin. Infect. Dis. 2006, 43, 748–756. [Google Scholar] [CrossRef] [PubMed]
- Foxman, B. Urinary Tract Infection Syndromes. Infect. Dis. Clin. N. Am. 2014, 28, 1–13. [Google Scholar] [CrossRef] [PubMed]
- Garcés-Jiménez, A.; Polo-Luque, M.-L.; Gómez-Pulido, J.A.; Rodríguez-Puyol, D.; Gómez-Pulido, J.M. Predictive Health Monitoring: Leveraging Artificial Intelligence for Early Detection of Infectious Diseases in Nursing Home Residents through Discontinuous Vital Signs Analysis. Comput. Biol. Med. 2024, 174, 108469. [Google Scholar] [CrossRef] [PubMed]
- Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data, 1st ed.; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2002; ISBN 978-0-471-18386-0. [Google Scholar]
- Rubin, D.B. Inference and Missing Data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
- Molenberghs, G.; Kenward, M.G. Missing Data in Clinical Studies, 1st ed.; Wiley: Hoboken, NJ, USA, 2007; ISBN 978-0-470-84981-1. [Google Scholar]
- Donders, A.R.T.; Van Der Heijden, G.J.M.G.; Stijnen, T.; Moons, K.G.M. Review: A Gentle Introduction to Imputation of Missing Values. J. Clin. Epidemiol. 2006, 59, 1087–1091. [Google Scholar] [CrossRef] [PubMed]
- Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A Survey on Missing Data in Machine Learning. J. Big Data 2021, 8, 140. [Google Scholar] [CrossRef] [PubMed]
- Buuren, S.V.; Groothuis-Oudshoorn, K. Mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef]
- Yoon, J.; Jordon, J.; van der Schaar, M. GAIN: Missing Data Imputation Using Generative Adversarial Nets. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Imberti, L.; Tiecco, G.; Logiudice, J.; Castelli, F.; Quiros-Roldan, E. Effects of Climate Change on the Immune System: A Narrative Review. Health Sci. Rep. 2025, 8, e70627. [Google Scholar] [CrossRef] [PubMed]
- Dominici, F.; Peng, R.D.; Bell, M.L.; Pham, L.; McDermott, A.; Zeger, S.L.; Samet, J.M. Fine Particulate Air Pollution and Hospital Admission for Cardiovascular and Respiratory Diseases. JAMA 2006, 295, 1127. [Google Scholar] [CrossRef] [PubMed]
- Zanobetti, A.; Schwartz, J.; Samoli, E.; Gryparis, A.; Touloumi, G.; Atkinson, R.; Le Tertre, A.; Bobros, J.; Celko, M.; Goren, A.; et al. The Temporal Pattern of Mortality Responses to Air Pollution: A Multicity Assessment of Mortality Displacement. Epidemiology 2002, 13, 87–93. [Google Scholar] [CrossRef] [PubMed]
- Lei, J.; Chen, R.; Liu, C.; Zhu, Y.; Xue, X.; Jiang, Y.; Shi, S.; Gao, Y.; Kan, H.; Xuan, J. Fine and Coarse Particulate Air Pollution and Hospital Admissions for a Wide Range of Respiratory Diseases: A Nationwide Case-Crossover Study. Int. J. Epidemiol. 2023, 52, 715–726. [Google Scholar] [CrossRef] [PubMed]
- Lampos, V.; Miller, A.C.; Crossan, S.; Stefansen, C. Advances in Nowcasting Influenza-like Illness Rates Using Search Query Logs. Sci. Rep. 2015, 5, 12760. [Google Scholar] [CrossRef] [PubMed]
- Paul, M.J.; Dredze, M. Discovering Health Topics in Social Media Using Topic Models. PLoS ONE 2014, 9, e103408. [Google Scholar] [CrossRef] [PubMed]
- Milinovich, G.J.; Williams, G.M.; Clements, A.C.A.; Hu, W. Internet-Based Surveillance Systems for Monitoring Emerging Infectious Diseases. Lancet Infect. Dis. 2014, 14, 160–168. [Google Scholar] [CrossRef] [PubMed]
- Caballé-Cervigón, N.; Castillo-Sequera, J.L.; Gómez-Pulido, J.A.; Gómez-Pulido, J.M.; Polo-Luque, M.L. Machine Learning Applied to Diagnosis of Human Diseases: A Systematic Review. Appl. Sci. 2020, 10, 5135. [Google Scholar] [CrossRef]
- Baldominos, A.; Ogul, H.; Colomo-Palacios, R.; Sanz-Moreno, J.; Gómez-Pulido, J.M. Infection Prediction Using Physiological and Social Data in Social Environments. Inf. Process. Manag. 2020, 57, 102213. [Google Scholar] [CrossRef]
- Aramaki, E.; Maskawa, S.; Morita, M. Twitter Catches the Flu: Detecting Influenza Epidemics Using Twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 27–29 July 2011; pp. 1568–1576. [Google Scholar]












| Population Metrics | Cardenal Cisneros | Francisco de Vitoria | Total |
|---|---|---|---|
| Residents | 127 | 316 | 443 |
| Participants | 20 | 40 | 60 |
| Infected Residents | 7 | 33 | 40 |
| Participation Rate (%) | 16% | 13% | 14% |
| Medical Staff | 4 | 14 | 18 |
| Average Age | 87.7 | 87.6 | 87.6 |
| Start Date | 24 March 2018 | 4 April 2018 | 24 March 2018 |
| End Date | 11 March 2019 | 11 March 2019 | 11 March 2019 |
| Category | Count |
|---|---|
| Acute Respiratory Infections | 48 |
| Urinary Tract Infections | 54 |
| Other Infections | 50 |
| Total Infections | 152 |
| Residents with Infections | 43 |
| Average Infections per Resident | 3.53 |
| Vital Sign | Expected Values | Actual Values | Missing Values | Missing Values (%) |
|---|---|---|---|---|
| Body temperature | 21,002 | 4167 | 16,835 | 80.16 |
| SpO2 | 21,002 | 4177 | 16,825 | 80.11 |
| BPM | 21,002 | 4178 | 16,824 | 80.11 |
| EDA | 21,002 | 4165 | 16,837 | 80.16 |
| Dataset | Class | With Feature Selection–Mean Imputation | |||
|---|---|---|---|---|---|
| Precision | Recall | Specificity | F1-Score | ||
| Basic | Healthy | 0.81 | 0.73 | - | 0.77 |
| Infected | 0.77 | 0.84 | 0.73 | 0.80 | |
| Social Media | Healthy | 0.83 | 0.72 | - | 0.77 |
| Infected | 0.76 | 0.86 | 0.72 | 0.81 | |
| Air Pollution | Healthy | 0.85 | 0.77 | - | 0.81 |
| Infected | 0.78 | 0.86 | 0.77 | 0.82 | |
| Dataset | Class | With Feature Selection–Mean Imputation | |||
|---|---|---|---|---|---|
| Precision | Recall | Specificity | F1-Score | ||
| Basic | Healthy | 0.37 | 0.52 | 0.59 | 0.43 |
| ARI | 0.33 | 0.13 | 0.95 | 0.18 | |
| UTI | 0.45 | 0.53 | 0.66 | 0.49 | |
| Other | 0.22 | 0.10 | 0.92 | 0.13 | |
| Social Media | Healthy | 0.81 | 0.90 | 0.90 | 0.85 |
| ARI | 0.89 | 0.94 | 0.98 | 0.92 | |
| UTI | 0.88 | 0.97 | 0.93 | 0.92 | |
| Other | 0.87 | 0.51 | 0.98 | 0.63 | |
| Air Pollution | Healthy | 0.81 | 0.94 | 0.89 | 0.87 |
| ARI | 0.88 | 0.96 | 0.98 | 0.91 | |
| UTI | 0.90 | 0.96 | 0.94 | 0.93 | |
| Other | 0.86 | 0.49 | 0.98 | 0.61 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Granda, M.; Santamera-Lastras, M.; Garcés-Jiménez, A.; Bueno-Guillén, F.J.; Rodríguez-Puyol, D.M.; Gómez-Pulido, J.M. Predictive Models for Early Infection Detection in Nursing Home Residents: Evaluation of Imputation Techniques and Complementary Data Sources. Healthcare 2026, 14, 166. https://doi.org/10.3390/healthcare14020166
Granda M, Santamera-Lastras M, Garcés-Jiménez A, Bueno-Guillén FJ, Rodríguez-Puyol DM, Gómez-Pulido JM. Predictive Models for Early Infection Detection in Nursing Home Residents: Evaluation of Imputation Techniques and Complementary Data Sources. Healthcare. 2026; 14(2):166. https://doi.org/10.3390/healthcare14020166
Chicago/Turabian StyleGranda, Melisa, María Santamera-Lastras, Alberto Garcés-Jiménez, Francisco Javier Bueno-Guillén, Diego María Rodríguez-Puyol, and José Manuel Gómez-Pulido. 2026. "Predictive Models for Early Infection Detection in Nursing Home Residents: Evaluation of Imputation Techniques and Complementary Data Sources" Healthcare 14, no. 2: 166. https://doi.org/10.3390/healthcare14020166
APA StyleGranda, M., Santamera-Lastras, M., Garcés-Jiménez, A., Bueno-Guillén, F. J., Rodríguez-Puyol, D. M., & Gómez-Pulido, J. M. (2026). Predictive Models for Early Infection Detection in Nursing Home Residents: Evaluation of Imputation Techniques and Complementary Data Sources. Healthcare, 14(2), 166. https://doi.org/10.3390/healthcare14020166

