A Novel Method for Medical Predictive Models in Small Data Using Out-of-Distribution Data and Transfer Learning

: Applying deep learning to medical research with limited data is challenging. This study focuses on addressing this difficulty through a case study, predicting acute respiratory failure (ARF) in patients with acute pesticide poisoning. Commonly, out-of-distribution (OOD) data are overlooked during model training in the medical field. Our approach integrates OOD data and transfer learning (TL) to enhance model performance with limited data. We fine-tuned a pre-trained multi-layer perceptron model using OOD data, outperforming baseline models. Shapley additive explanation (SHAP) values were employed for model interpretation, revealing the key factors associated with ARF. Our study is pioneering in applying OOD and TL techniques to electronic health records to achieve better model performance in scenarios with limited data. Our research highlights the potential benefits of using OOD data for initializing weights and demonstrates that TL can significantly improve model performance, even in medical data with limited samples. Our findings emphasize the significance of utilizing context-specific information in TL to achieve better results. Our work has practical implications for addressing challenges in rare diseases and other scenarios with limited data, thereby contributing to the development of machine-learning techniques within the medical field, especially regarding health inequities.


Introduction
Machine learning has emerged as a prominent field in the current medical research landscape.However, constructing effective machine-learning models, particularly with deep-learning (DL) techniques, often requires large amounts of data [1].Acquiring the necessary volume of datasets can be time consuming and resource intensive, involving significant resource costs, including financial expenses.This challenge is especially pronounced in specialized medical fields, where data acquisition may be hindered by lowprevalence diseases, regional inequalities or standardization challenges [2].When the ratio of training samples to the Vapnik-Chervonenkis (VC) dimensions of a learning machine is less than 20, it is considered a small sample size [3][4][5][6].VC dimensions measure the capacity of a classifier, representing the cardinality of the greatest collection of points, which the procedure can break [4].However, these theories do not apply to real-world scenarios with limited datasets for machine-learning models, as they primarily focus on generic machine learning with a high number of training samples [7].Using limited datasets risks inadequate model training, potentially reducing the likelihood of achieving global minima [8].Additionally, random weight initialization in machine-learning models may introduce further uncertainty, thereby highlighting the necessity for meticulous weight initialization in scenarios with limited data [9][10][11][12].Moreover, small datasets pose various challenges, including overfitting problems, the significant impact of noise components, missing values, outliers and sharp fluctuations in variables within the dataset, resulting in low generalization ability [13].
Despite these challenges, small datasets possess intrinsic value, and ongoing research endeavors are addressing the mentioned issues [14,15].Traditionally, augmentation methods, such as changing direction or adjusting angles, have been prevalent in the image domain [16].Additionally, oversampling strategies are commonly employed in tabular data to increase the number of patient data, especially for minority groups [17].Recent approaches utilizing generative adversarial networks or diffusion models for data synthesis aim to overcome these issues [18][19][20].However, these strategies have yet to resolve trust issues related to generated data and require substantial resources.They are also predominantly limited to the field of images [21,22].Other various algorithms, such as ensemble learning and input-doubling method [13,[23][24][25][26], have also been actively researched.However, they mostly face similar challenges, and there is no clear consensus on technically feasible solutions, necessitating further research [13,27,28].
Furthermore, prior knowledge has improved predictive accuracy over random weight initialization in data-deficient contexts.As a result, the medical field has been actively exploring transfer learning (TL) to achieve this [29].TL involves refining a pre-trained model on extensive datasets by adapting insights from one related task to another.TL can produce robust models capable of operating with sparse data, shortening training durations and improving model generalization [30].However, its primary medical sector use is limited to imaging tasks, while electronic health record (EHR) data are often structured in tabular form, making it challenging to acquire compatible large-scale datasets for TL applications [29][30][31][32].
Meanwhile, the medical field frequently encounters diverse instances of out-of-distribution (OOD) data [33].OOD data are generated from a distribution, which deviates from the one on which the model was initially trained [34].In medical practice, OOD data are common in various scenarios, e.g., when data from different hospitals are used for external validation, when training and evaluation data are segregated, or when integrating data reflective of varying patient conditions or environments, even within the same medical condition [35].OOD data often follow a distinct statistical distribution compared to indistribution data and may exhibit contextual disparities [34].The utilization of OOD data has enhanced machine models' overall performance and robustness when employed in the right context [9,10,36].
This study presents an innovative method, which uses OOD data and TL to overcome data scarcity in machine-learning model development.We propose a machine-learning model for predicting acute respiratory failure (ARF) in patients with acute pesticide poisoning, incorporating a unique model development approach.Acute pesticide poisoning is a public health concern worldwide, and it is often accompanied by fatal outcomes [37].ARF is a major cause of mortality in patients with acute pesticide poisoning, and it is known that the clinical course differs based on the category of pesticide, ingestion amount and underlying disease [38].Since acute pesticide poisoning is rare, a lack of clinical experience can prevent predicting the prognosis of patients.Therefore, an ARF prediction model is essential for the timely treatment of patients with acute pesticide poisoning [39,40].Acute pesticide poisoning is more prevalent in rural areas than in urban areas, leading to variations in data based on the location of medical institutions.Collecting data is exceptionally challenging due to the infrequency of cases, making our novel approach well suited for this problem [30].
The main contributions of this paper can be summarized as follows: 1.
Introducing a highly intuitive and simple idea and assessing the potential utility of OOD data in creating pre-trained models for TL.

2.
Experimentally validating the effectiveness of OOD and TL in small medical datasets while minimizing artificial data manipulations, such as data generation.

3.
Developing a predictive model for ARF in patients with acute pesticide poisoning using the proposed method, showcasing low bias and high performance.
The remainder of the paper is organized as follows.Section 2 encompasses the study population, labeling, feature selection, handling of outliers and missing values, and modeling.Section 3 presents the study participants' characteristics, model performance and model interpretation.Section 4 engages in a thorough analysis of the outcomes, addressing limitations and future research.Lastly, Section 5 provides a summary of the study's contributions and implications for the field.Additionally, in Appendix A, we present abbreviation descriptions (Table A1) along with tables and figures, which may aid in the understanding of the paper.In Appendix B, we offer additional experiments, which support and reinforce the experiments conducted in the manuscripts.

Study Population
The study was conducted on 129,953 patients aged 19 years and older admitted to the general ward at Korea University Anam Hospital between January 2015 and December 2021.To distinguish patients who experienced ARF from those who did not, we excluded 1508 patients who experienced ARF but had unclear onset times.These patients had not ingested pesticides and were considered as OOD data collected from different regions or hospitals.
A retrospective observational cohort study was conducted on 1081 patients with acute pesticide poisoning who were admitted to Soonchunhyang University Cheonan Hospital between January 2015 and December 2020.To ensure reliable results, exclusion criteria were established based on previous studies [37,41].First, patients under the age of 19 were excluded, as were those who had been poisoned by paraquat-based pesticides, which are known to cause ARF within a short time.Considering the pattern of ARF occurrence and the study design, patients who were diagnosed with ARF within 1 h of admission or 72 h after admission were also excluded, as were patients with a "Do Not Resuscitate" status due to mechanical ventilator refusal (Figure 1).The final study cohort included 803 patients with acute pesticide poisoning.
The Institutional Review Boards (IRB) of Korea University Anam Hospital (IRB number: 2023AN0145) and Soonchunhyang University Cheonan Hospital (IRB number: 2020-02-016) reviewed and approved this study.The study was conducted following the principles outlined in the Helsinki Declaration.

Labeling
We considered the time of receiving mechanical ventilation as the onset of acute ARF.The study utilized data from two hospitals, each exhibiting distinct characteristics.As a result, specific research designs were implemented tailored to each dataset.Korea University Hospital data showed a notable scarcity of ARF cases, leading to a data imbalance issue.

Labeling
We considered the time of receiving mechanical ventilation as the onset of acute ARF.The study utilized data from two hospitals, each exhibiting distinct characteristics.As a result, specific research designs were implemented tailored to each dataset.Korea University Hospital data showed a notable scarcity of ARF cases, leading to a data imbalance issue.
The specific approach adopted to address this issue is illustrated in Figure 2. Patients who experienced ARF were labeled "1", with a prediction time of 1-72 h before the onset of the condition.Patients who did not experience ARF were labeled "0", with a prediction timeframe of 143-72 h before discharge to mitigate the uncertainty associated with the potential later onset after discharge.Data points outside the defined prediction timeframe were excluded, effectively rectifying the data imbalance issue.
Conversely, Soonchunhyang University Cheonan Hospital patients exhibited a different pattern, with ARF cases being more prevalent within 72 h of admission.This resulted in a less severe data imbalance.To maintain rigorous evaluation criteria and account for this pattern, a prediction timeframe of 1-72 h after admission was applied.Patients who experienced ARF were labeled "1", while those who did not experience this condition were labeled "0".

Feature Selection
The feature selection process was informed by prior studies [15,19].We additionally consulted with experts to determine relevant features.We then excluded features, which were not commonly applicable to TL.For example, the Glasgow Coma Scale (GCS), considered a critical feature in past research, was excluded from our study due to its high rate of missing data in the Korea University Anam Hospital dataset.Considering the limited sample size and the need for rapid data assessment, we prioritized features, which are typically measured in the majority of patients within an hour, ensuring minimal missing data.As a result, we only selected variables missing from <5% of the Soonchunhyang University Cheonan Hospital data.The selected features included age, sex, systolic blood pressure (SBP), diastolic blood pressure (DBP), respiratory rate, body temperature, serum creatinine, hemoglobin, total carbon dioxide (Total CO2), pH, pCO2, pO2, base excess (BE), lactate, category of pesticide and amount of ingestion.In this context, "sex" refers to the sex assigned at birth.

Handling of Outliers and Missing Values
To tackle potential outliers, values falling below the 2.5th percentile or exceeding the 97.5th percentile for each attribute were considered outliers and treated as missing values The red and blue dashed lines indicate the prediction time points labeled as "1" and "0", respectively.The red star-shaped symbols signify the occurrence of mechanical ventilation.

Feature Selection
The feature selection process was informed by prior studies [15,19].We additionally consulted with experts to determine relevant features.We then excluded features, which were not commonly applicable to TL.For example, the Glasgow Coma Scale (GCS), considered a critical feature in past research, was excluded from our study due to its high rate of missing data in the Korea University Anam Hospital dataset.Considering the limited sample size and the need for rapid data assessment, we prioritized features, which are typically measured in the majority of patients within an hour, ensuring minimal missing data.As a result, we only selected variables missing from <5% of the Soonchunhyang University Cheonan Hospital data.The selected features included age, sex, systolic blood pressure (SBP), diastolic blood pressure (DBP), respiratory rate, body temperature, serum creatinine, hemoglobin, total carbon dioxide (Total CO 2 ), pH, pCO 2 , pO 2 , base excess (BE), lactate, category of pesticide and amount of ingestion.In this context, "sex" refers to the sex assigned at birth.

Handling of Outliers and Missing Values
To tackle potential outliers, values falling below the 2.5th percentile or exceeding the 97.5th percentile for each attribute were considered outliers and treated as missing values to eliminate their potential influence on the whole dataset.Subsequently, the multiple imputation by chained equations (MICE) algorithm was used to impute the missing data.MICE is widely used to generate imputations, which closely resemble true distributions when the rate of missing values is low [42].Following this, robust scaling was applied.Notably, MICE and robust scaling computations were exclusively performed on the training data throughout all phases of the learning process.
Before performing the above pre-processing, the data obtained from Korea University Anam Hospital contained numerous missing values, necessitating additional preprocessing.First, we organized the features daily.Systolic blood pressure (SBP), diastolic blood pressure (DBP), respiratory rate and body temperature were arranged daily using the highest recorded values.The remaining attributes were assigned the last recorded values.Despite these efforts, any remaining missing data were then imputed by referencing the most recent observations.Furthermore, the pesticide category and ingestion amount were not available in the Korea University Anam Hospital dataset and were uniformly set to zero.

Modeling and Performance Evaluation
Predicting ARF in patients with acute pesticide poisoning is a challenging task due to the limited availability of such cases.This study used a large-scale OOD dataset of patients without acute pesticide poisoning to overcome this challenge to enhance ARF prediction.The study employs various machine-learning models, including logistic regression (LR), random forest (RF), extreme gradient boosting (XGB), light gradient-boosting machine (LGBM) and a multi-layer perceptron (MLP).The regression analysis, ensemble method and neural network models considered in our study are among the most commonly utilized models in the development of clinical prediction models [43,44].Furthermore, since the data we are working with are not in the form of time series or images, we did not consider models such as recurrent neural networks or convolutional neural networks.Our novel approach to TL using OOD data is illustrated in Figure 3. the most recent observations.Furthermore, the pesticide category and ingestion amount were not available in the Korea University Anam Hospital dataset and were uniformly set to zero.

Modeling and Performance Evaluation
Predicting ARF in patients with acute pesticide poisoning is a challenging task due to the limited availability of such cases.This study used a large-scale OOD dataset of patients without acute pesticide poisoning to overcome this challenge to enhance ARF prediction.The study employs various machine-learning models, including logistic regression (LR), random forest (RF), extreme gradient boosting (XGB), light gradient-boosting machine (LGBM) and a multi-layer perceptron (MLP).The regression analysis, ensemble method and neural network models considered in our study are among the most commonly utilized models in the development of clinical prediction models [43,44].Furthermore, since the data we are working with are not in the form of time series or images, we did not consider models such as recurrent neural networks or convolutional neural networks.Our novel approach to TL using OOD data is illustrated in Figure 3.The first step of the approach is to develop an MLP model, which can predict ARF in patients without acute pesticide poisoning.This MLP model is a pre-trained model, serving as a foundation for fine-tuning using data from patients with acute pesticide poisoning.During the fine-tuning process, the number of trainable layers is adjusted, and each model variant is evaluated systematically.The initial MLP model consists of five dense layers, including the output layer.The number of trainable dense layers is varied systematically to create models ranging from TL1 to TL5.The specific model architecture can be examined in Figure A3.
To ensure the reliability of the models, cross-validation is carried out due to the lim- The first step of the approach is to develop an MLP model, which can predict ARF in patients without acute pesticide poisoning.This MLP model is a pre-trained model, serving as a foundation for fine-tuning using data from patients with acute pesticide poisoning.
During the fine-tuning process, the number of trainable layers is adjusted, and each model variant is evaluated systematically.The initial MLP model consists of five dense layers, including the output layer.The number of trainable dense layers is varied systematically to create models ranging from TL1 to TL5.The specific model architecture can be examined in Figure A3.
To ensure the reliability of the models, cross-validation is carried out due to the limited number of patients in the cohort.The dataset is divided into five groups, with the ratio of patients with ARF to patients without ARF maintained in each group.Group 5 underwent early stopping in DL, while the remaining groups (Groups 1-4) were used for four-fold cross-validation, as shown in Figure A4.
We considered key performance metrics during model evaluation, such as the area under the receiver operating characteristic (AUROC) and the F1 score.For the final evaluation, a comprehensive range of performance metrics is considered for the best performing model, which includes accuracy, precision, recall, F1 score, negative predictive value (NPV), Matthews correlation coefficient (MCC), AUROC and the area under the precision-recall curve (AUPRC).In addition to these quantitative metrics, a visual assessment is conducted to comprehensively understand the model's performance.This visual inspection involves the examination of confusion matrices, AUROC curves and AUPRC curves.It provides valuable insights into the model's performance, behavior and strengths.

Statistical Analysis and Model Interpretation
The basic statistics of the datasets from both hospitals were thoroughly examined.A t-test was conducted at a significance level of 0.05 to determine potential between-hospital differences.Each hospital's cases were designated as "1" (indicating ARF) and "0" (indicating non-ARF) and examined separately.Subsequent t-tests were performed at a significance level of 0.05 on data subsets to determine the significance of the observed differences.
To better understand how the model works and which features are most important, we identified the features, which received high evaluations from the model.We also used Shapley additive explanation (SHAP) values to confirm the clinical significance of the model's learning process.These SHAP values help us understand how each feature contributes to the model's predictions, making it easier to interpret model-guided decisions.This step is important in determining the practical relevance of the model's findings in a clinical setting.

Study Participants' Characteristics
After addressing the outliers, we provide in Table A2 the missing values for each feature and their corresponding proportions.Our study approach comprised selecting features with clinically significant relevance while ensuring that the proportion of missing values did not exceed 5%.Soonchunhyang University Cheonan Hospital initially selected features with less than 5% missing data.However, additional missing values were introduced after handling the outliers, causing some features to exceed the 5% threshold.Table A2 includes the GCS-which was otherwise excluded from feature selection-for comparative purposes.
Table 1 includes each feature's mean and standard deviation.A statistical analysis was conducted using p-values obtained from the t-tests to verify between-hospital differences.The t-test results indicated significant differences for all features, suggesting the two datasets were OOD.For the sex feature, the table presents the number and percentage of males.A chi-squared test confirmed the differences detailed in the table.This table offers a comprehensive overview of the statistical differences between the datasets, emphasizing the OOD relationship and the specific comparison for the sex feature.  2 and 3 present key statistics, such as means, standard deviations and statistical significance, highlighting the differences between patients with ARF and patients without ARF in both hospital datasets.For the Korea University Anam Hospital dataset, data pre-processing intentionally introduced distinctions between patients with ARF and patients without ARF, resulting in significant disparities in all features.However, in the Soonchunhyang University Cheonan Hospital dataset, some features were not significantly different between the patients with ARF and patients without ARF.Table A3 provides an insightful overview of the distribution of pesticide-related features in the Soonchunhyang University Cheonan Hospital dataset, offering a better understanding of the features' characteristics.These tables comprehensively illustrate the differences between patients with ARF and patients without ARF and provide insights into the distribution of pesticide-related features in the dataset.

Model Performance
Each model's major performance metrics, including their AUROC and F1 scores, are included in Table 4.The models have generally high AUROC values, but the wide range of confidence intervals raises concerns about the reliability of their performance.This variability in performance can be attributed to differences between the training and test sets, especially in limited datasets.The RF model has the highest AUC among the traditional models, while the LR model has the narrowest confidence interval.The MLP model has the lowest mean and the widest confidence interval.Notably, the TL model, which has the same structures as the MLP model, significantly outperforms the existing models.The TL approach remarkably narrows the confidence interval, substantially enhancing overall model performance.Figure 4 visually illustrates the comparative performance of each model.Table A4 illustrates the performance when GCS is included as a feature.Incorporating GCS as a feature significantly enhances the performance across all models.This confirms the crucial importance of GCS as a feature in patients with acute pesticide poisoning.

Model Interpretation
Figure A2 displays cases where there is a probability difference of 0.1 or more between the MLP and TL5 models.The red area represents patients with ARF, while the blue area represents patients without ARF.Therefore, if the model's performance is high, the probability should be higher in the red area and lower in the blue area.The observed trend suggests that, except for some cases, the probability increases for patients with ARF and decreases for patients without ARF.Ultimately, the model predicts with greater confidence that patients who experience ARF are more likely to do so, and patients who do not experience ARF are more likely to remain free from it.This indicates an increased discriminative ability of the model.
Figure 5 displays the model's SHAP values, highlighting the significant factors contributing to the development of ARF.The analysis provides the following insights: 1. High Cr, low TCO2 and low DBP significantly contributed to the development of ARF.Table 5 provides detailed performance metrics for the MLP model, which has the same structure as the high-performing TL5 model and uses Group 4 evaluation data.These metrics offer a more detailed view of the model's performance and effectiveness in the specific evaluation context.The table includes a comprehensive set of performance metrics, such as accuracy, precision, recall, F1 score, NPV, MCC, AUROC and AUPRC.Figure A1 visually compares the models' performance, highlighting the confusion matrix, AUROC curve and AUPRC curves.Notably, there is a significant improvement in precision and recall for the TL5 model, highlighting substantial enhancements in its overall performance.

Model Interpretation
Figure A2 displays cases where there is a probability difference of 0.1 or more between the MLP and TL5 models.The red area represents patients with ARF, while the blue area represents patients without ARF.Therefore, if the model's performance is high, the probability should be higher in the red area and lower in the blue area.The observed trend suggests that, except for some cases, the probability increases for patients with ARF and decreases for patients without ARF.Ultimately, the model predicts with greater confidence that patients who experience ARF are more likely to do so, and patients who do not experience ARF are more likely to remain free from it.This indicates an increased discriminative ability of the model.
Figure 5 displays the model's SHAP values, highlighting the significant factors contributing to the development of ARF.The analysis provides the following insights: 1.
High Cr, low TCO 2 and low DBP significantly contributed to the development of ARF. 2.
Older age, low BE, high pCO 2 and high SBP may contribute to the development of ARF.

3.
Glufosinate and organophosphates were more likely to contribute to the development of ARF than other pesticides.4.
Ingesting less than 100 cc carried a lower likelihood of developing ARF, while those who ingested 100-200 cc showed higher likelihood.

Discussion
Our study is a pioneering effort by the authors to apply OOD and TL techniques, commonly used in the image domain, to EHR, aiming to improve the performance of models with small sample sizes.Specifically, the authors developed a model for predicting ARF in patients with acute pesticide poisoning with minimized bias [37,41].In cases of acute pesticide poisoning, MLP models face limitations due to insufficient data, resulting in low performance and wide confidence intervals [45].In contrast, the newly proposed approach outperforms the MLP model and exhibits narrower confidence intervals.Additionally, the highest performance was achieved when the number of trainable layers was maximized.Maximizing the utilization of information tailored to the intended purpose is more advantageous than simply using information from a pre-trained model.The study also confirms the potential benefits of initializing weights using OOD data, particularly in cases of limited data, instead of commonly used initialization [9][10][11][12].Moreover, transfer learning showed its ability to enhance performance, even when data are scarce, as illustrated in Figure A1.
Simply examining retrospective data does not allow for a discussion on the mechanisms of ARF between patients with acute pesticide poisoning and those without.However, leveraging OOD data, the model may have learned more generalized and rough pat-

Discussion
Our study is a pioneering effort by the authors to apply OOD and TL techniques, commonly used in the image domain, to EHR, aiming to improve the performance of models with small sample sizes.Specifically, the authors developed a model for predicting ARF in patients with acute pesticide poisoning with minimized bias [37,41].In cases of acute pesticide poisoning, MLP models face limitations due to insufficient data, resulting in low performance and wide confidence intervals [45].In contrast, the newly proposed approach outperforms the MLP model and exhibits narrower confidence intervals.Additionally, the highest performance was achieved when the number of trainable layers was maximized.Maximizing the utilization of information tailored to the intended purpose is more advantageous than simply using information from a pre-trained model.The study also confirms the potential benefits of initializing weights using OOD data, particularly in cases of limited data, instead of commonly used initialization [9][10][11][12].Moreover, transfer learning showed its ability to enhance performance, even when data are scarce, as illustrated in Figure A1.
Simply examining retrospective data does not allow for a discussion on the mechanisms of ARF between patients with acute pesticide poisoning and those without.However, leveraging OOD data, the model may have learned more generalized and rough patterns regarding the deteriorating respiratory condition of patients.The weights configured in this manner are expected to be effective in facilitating TL effectively.To assess the importance of features, SHAP values were employed.Most of the results were consistent with the trends of feature importance identified in previous research.By combining the importance of individual features as indicated by SHAP values with factors such as pesticide category and ingestion amounts, future research can contribute to a better understanding of the mechanisms underlying ARF resulting from acute pesticide poisoning.
However, this study has some limitations.First, it is retrospective and based on data from a single institution.Future studies should address these limitations and expand the scope of data collection.Second, further study is needed to examine the best ways to use OOD data, investigate various TL application methods and develop strategies for handling differing features in different application contexts.For instance, there is a need for discussion on how to address challenges when important features, such as GCS in this study, are mostly missing, making them difficult to leverage in pre-training.Despite its limitations, this study contributes valuable new methodologies for managing limited data in studying rare diseases and comparable conditions, highlighting the significant promise of machine-learning techniques in advancing medical research.

Conclusions
This study pioneers the application of OOD and TL techniques in the EHR, particularly in scenarios characterized by limited data.We conducted this research with a focus on predicting acute respiratory failure in patients with acute pesticide poisoning.Our proposed approach surpasses conventional predictive models by leveraging OOD data in conjunction with pre-trained models, highlighting the substantial benefits of OOD data for weight initialization in settings where data are scarce.The outcomes of our experimentation suggest that our method holds promise as a viable alternative for effectively training models with limited data.When an appropriate OOD dataset is adeptly utilized, it introduces a compelling methodology for addressing data limitations in rare diseases and analogous scenarios.Future research should be expanded beyond these preliminary findings to refine transfer learning applications and formulate strategies for handling diverse data attributes across various medical scenarios.A key emphasis should be placed on addressing the challenge of managing crucial yet dissimilar features in prediction.In conclusion, our study makes a significant contribution by presenting innovative methodologies to navigate challenges posed by limited data in the study of rare diseases and similar conditions.We will conduct additional research to overcome the limitations discussed in this paper.validation of our approach using open datasets.These findings are also available in Appendix B and the same repository.

Appendix B
This appendix presents the results of additional experiments conducted to indirectly validate the utility of out-of-distribution (OOD) data in transfer learning (TL) using diverse datasets from the UCI Machine Learning Repository (https://archive.ics.uci.edu/datasets,accessed on 1 January 2024).Acknowledging potential limitations in the experimental design, we emphasize that the identification of appropriate OOD datasets is crucial.The outcomes of TL are explored across various datasets, recognizing that performance improvements may vary based on context.The utilized datasets are detailed below, and further information can be found in the UCI Machine Learning Repository:
We conducted multiple repetitions of the same experimental procedure across various datasets to validate the effectiveness of the proposed method.Initially, minimal data pre-processing, including handling missing values, was performed for each dataset.Subsequently, the datasets were divided into two groups to establish an OOD relationship between the majority and minority classes.One group was utilized for pre-training, while the other was employed to evaluate the proposed method.All models, including the pre-trained model, shared identical structures.Each model comprised two dense layers, with each layer incorporating batch normalization and a dropout layer with a dropout rate of 0.3.The output layer was adjusted with an appropriate activation function for regression and classification tasks.Considering data imbalance, weights were assigned to the minority class during training.For evaluation metrics, the area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) were used for classification tasks, while the mean squared error (MSE) and R-squared (R 2 ) were employed for regression.Additionally, to account for potential variations in performance based on the method of splitting minority class data into training and test sets, we varied the random seed and repeated the process of dividing the training and test sets 300 times (7:3).The results were then averaged, and 95% confidence intervals were examined.
In the Pima Indian dataset, we stratified participants into two groups based on body mass index (BMI).The overweight group, defined as BMI 25 or above, comprised 259 individuals (40%) with diabetes, while the group with BMI below 25 included 9 individuals (7%) with diabetes (Table A5).We hypothesized a scenario where diabetes identification is targeted in the non-overweight group.We leveraged the overweight group as a pretraining model and conducted transfer learning.A comparison of the means is presented in Figure A5, and the mean values along with 95% confidence intervals for all datasets are provided in Table A12.In the Cirrhosis Patient Survival Prediction dataset, we considered three scenarios.First, we observed a significantly higher proportion of females in the dataset.Therefore, we predicted the severity of cirrhosis in male patients.Among male patients, 127 individuals (35%) were labeled as 4, while among female patients, 7 individuals (39%) were labeled as 4. Second, we predicted the severity of cirrhosis in elderly patients aged 60 and above.Among patients under 60, 97 individuals (30%) were labeled as 4, while among patients aged 60 and above, 41 individuals (47%) were labeled as 4. Third, we predicted the severity of cirrhosis in patients who took D-penicillamine.Among patients who did not take D-penicillamine and were labeled as 4, 89 individuals (35%) were identified, while among those who took it, 55 individuals (35%) were labeled as 4. For detailed information, please refer to Table A6.
Comparisons of the means are illustrated in Figure A6.Mean values, accompanied by 95% confidence intervals for all datasets, are also detailed in Table A12.In the NHANES dataset, elderly and non-elderly individuals are labeled as 1 and 0, respectively.In this analysis, we further categorized patients into two groups: those without diabetes and those with diabetes or deemed to be in a pre-diabetic state.Among the former, 338 individuals (15%) were elderly, while among the latter, 26 individuals (33%) were elderly (Table A7).We employed the same methodology as before to predict the elderly group within the latter category.The results can be observed in Figure A7, and the 95% confidence intervals are detailed in Table A12.In the NHANES dataset, elderly and non-elderly individuals are labeled as 1 and 0, respectively.In this analysis, we further categorized patients into two groups: those without diabetes and those with diabetes or deemed to be in a pre-diabetic state.Among the former, 338 individuals (15%) were elderly, while among the latter, 26 individuals (33%) were elderly (Table A7).We employed the same methodology as before to predict the elderly group within the latter category.The results can be observed in Figure A7, and the 95% confidence intervals are detailed in Table A12.In the Wisconsin Breast Cancer dataset, excluding the target variable, we applied the KMeans algorithm to divide the data into two groups.In Cluster 1, there were 90 individuals (20%) diagnosed with malignant tumors.In contrast, Cluster 0 contained only two individuals (0.4%), resulting in a dataset with severe class imbalance (Table A8).We undertook the task of identifying malignant tumor patients in Cluster 0. The results can be observed in Figure A8.Additionally, a comprehensive performance comparison is available in Table A12.In the Wisconsin Breast Cancer dataset, excluding the target variable, we applied the KMeans algorithm to divide the data into two groups.In Cluster 1, there were 90 individuals (20%) diagnosed with malignant tumors.In contrast, Cluster 0 contained only two individuals (0.4%), resulting in a dataset with severe class imbalance (Table A8).We undertook the task of identifying malignant tumor patients in Cluster 0. The results can be observed in Figure A8.Additionally, a comprehensive performance comparison is available in Table A12.In the Parkinson's Telemonitoring dataset, there are two Unified Parkinson's Diseas Rating Scale (UPDRS) metrics.The first is the motor UPDRS, which is also utilized as feature, and the second is the total UPDRS.The total UPDRS is determined by considering various indicators along with the motor UPDRS.The dataset encompasses diverse data including voice recordings, collected over six months from 44 Parkinson's patients.Draw ing inspiration from the degenerative nature of Parkinson's disease, we assumed a sce nario of predicting the total UPDRS in patients under the age of 60 (Table A9).To preven the mixing of data from the same patients between the training and testing sets, we di vided the data based on patients.The regression results are presented using MSE and R2 The results are depicted in Table A10.In the Parkinson's Telemonitoring dataset, there are two Unified Parkinson's Disease Rating Scale (UPDRS) metrics.The first is the motor UPDRS, which is also utilized as a feature, and the second is the total UPDRS.The total UPDRS is determined by considering various indicators along with the motor UPDRS.The dataset encompasses diverse data, including voice recordings, collected over six months from 44 Parkinson's patients.Drawing inspiration from the degenerative nature of Parkinson's disease, we assumed a scenario of predicting the total UPDRS in patients under the age of 60 (Table A9).To prevent the mixing of data from the same patients between the training and testing sets, we divided the data based on patients.The regression results are presented using MSE and R2.The results are depicted in Table A10.Finally, the CDC Diabetes Health Indicators dataset includes variables with various operational definitions, and specific information can be found on the respective website.The overarching goal task is predicting diabetes status.We assumed three scenarios.The first scenario involved the prediction of diabetes in individuals who have experienced a stroke.For those without a stroke, 32,078 individuals (13%) had diabetes or pre-diabetes, while for those who had a stroke, 3268 individuals (32%) had diabetes or pre-diabetes.The second scenario involved the prediction of diabetes in individuals with coronary artery disease or heart disease.For those without the disease, 27,468 individuals (12%) had diabetes or pre-diabetes, while for those with the disease, 7878 individuals (33%) had diabetes or pre-diabetes.The third scenario involved predicting diabetes in binge drinkers.In this dataset, adult males are defined as binge drinkers if they consume 14 or more drinks per week, and adult females are defined as binge drinkers if they consume 7 or more drinks per week.For non-binge drinkers, 34,514 individuals (14%) had diabetes or pre-diabetes, while among binge drinkers, 832 individuals (6%) had diabetes or pre-diabetes.For detailed information, please refer to Table A11.Each result can be verified in Figure A9, and the comprehensive results, including 95% confidence intervals, are available in Table A12.

Figure 2 .
Figure 2. Study design for each hospital."MV" refers to mechanical ventilation.Cases 1 and 2 pertain to patients receiving mechanical ventilation, with Case 2 subsequently excluded based on exclusion criteria.Case 3 corresponds to patients not on mechanical ventilation.(A,B) represent Korea University Anam Hospital and Soonchunhyang University Cheonan Hospital, respectively.The red and blue dashed lines indicate the prediction time points labeled as "1" and "0", respectively.The red star-shaped symbols signify the occurrence of mechanical ventilation.

Figure 2 .
Figure 2. Study design for each hospital."MV" refers to mechanical ventilation.Cases 1 and 2 pertain to patients receiving mechanical ventilation, with Case 2 subsequently excluded based on exclusion criteria.Case 3 corresponds to patients not on mechanical ventilation.(A,B) represent Korea University Anam Hospital and Soonchunhyang University Cheonan Hospital, respectively.The red and blue dashed lines indicate the prediction time points labeled as "1" and "0", respectively.The red star-shaped symbols signify the occurrence of mechanical ventilation.

Figure 3 .
Figure 3. Transfer learning process.We used a model pre-trained on OOD data to initialize the initial weights and adjust the number of trainable layers.Red represents the data from Hospital A and the corresponding training results of the model using that data.Blue signifies the data from Hospital B and the modelʹs training outcomes based on it.During the learning process, layers frozen during training by Hospital B maintain their red color, indicating that training by Hospital B did not influence them.In contrast, layers that underwent training exhibit a mix of red and blue, resulting in a purple hue.

Figure 3 .
Figure 3. Transfer learning process.We used a model pre-trained on OOD data to initialize the initial weights and adjust the number of trainable layers.Red represents the data from Hospital A and the corresponding training results of the model using that data.Blue signifies the data from Hospital B and the model's training outcomes based on it.During the learning process, layers frozen during training by Hospital B maintain their red color, indicating that training by Hospital B did not influence them.In contrast, layers that underwent training exhibit a mix of red and blue, resulting in a purple hue.

Mathematics 2024 ,
12,  x FOR PEER REVIEW 10 of 26 AUROC curve and AUPRC curves.Notably, there is a significant improvement in precision and recall for the TL5 model, highlighting substantial enhancements in its overall performance.

Figure 4 .
Figure 4. Presentation of AUROC and F1 scores for each model with 95% confidence intervals.Red dots denote the performance of the model reported in prior studies, while blue dots represent the performance of the best model derived from new proposed approach.

Figure 4 .
Figure 4. Presentation of AUROC and F1 scores for each model with 95% confidence intervals.Red dots denote the performance of the model reported in prior studies, while blue dots represent the performance of the best model derived from new proposed approach.

Figure A2 .
Figure A2.Comparison of probabilities between the MLP and TL5 models.Only cases with a probability difference of 0.1 or more are displayed in Group 4. Red area represents patients with ARF, while the blue area represents patients without ARF.

Figure A2 .
Figure A2.Comparison of probabilities between the MLP and TL5 models.Only cases with a probability difference of 0.1 or more are displayed in Group 4. Red area represents patients with ARF, while the blue area represents patients without ARF.

Figure A2 .
Figure A2.Comparison of probabilities between the MLP and TL5 models.Only cases with a probability difference of 0.1 or more are displayed in Group 4. Red area represents patients with ARF, while the blue area represents patients without ARF.

Figure A3 .
Figure A3.The structure of the MLP and TL models.

Figure A3 . 26 Figure A3 .
Figure A3.The structure of the MLP and TL models.

Figure A6 .
Figure A6.Comparison with and without transfer learning in the Cirrhosis Patient Survival Prediction dataset: (A) Male, (B) Elderly, (C) D-penicillamine.

Figure A6 .
Figure A6.Comparison with and without transfer learning in the Cirrhosis Patient Survival Prediction dataset: (A) Male, (B) Elderly, (C) D-penicillamine.

Figure A7 .
Figure A7.Comparison with and without transfer learning in the NHANES dataset.

Figure A8 .
Figure A8.Comparison with and without transfer learning in the Wisconsin Breast Cancer dataset.

Table 1 .
Mean and standard deviation according to the feature."*" means statistically significant under a significance level of 0.05.

Table 2 .
Differences between patients with ARF and patients without ARF at Korea University Anam Hospital."*" means statistically significant under a significance level of 0.05.

Table 3 .
Differences between patients with ARF and patients without ARF at Soonchunhyang University Cheonan Hospital."*" means statistically significant under a significance level of 0.05.

Table 4 .
Model performance.In transfer learning, the term "numbers" refers to the number of trainable dense layers.

Table 5 .
Model performance for Group 4.

Table 5 .
Model performance for Group 4.

Table A2 .
Number and proportion of missing values by feature.

Table A3 .
Characteristics of pesticide exposure at Soonchunhyang University Cheonan Hospital.

Table A4 .
Model performance with Glasgow Coma Scale.

Table A5 .
Statistics of the Pima Indian dataset."*" means statistically significant under a significance level of 0.05.
Comparison with and without transfer learning in the Pima Indian dataset.Figure A5.Comparison with and without transfer learning in the Pima Indian dataset.

Table A6 .
Statistics of the Cirrhosis Patient Survival Prediction dataset."*" means statistically significant under a significance level of 0.05.

Table A7 .
Statistics of the NHANES dataset."*" means statistically significant under a significance level of 0.05.

Table A7 .
Statistics of the NHANES dataset."*" means statistically significant under a significance level of 0.05.
Figure A7.Comparison with and without transfer learning in the NHANES dataset.

Table A8 .
Statistics of the Wisconsin Breast Cancer dataset."*" means statistically significant under a significance level of 0.05.

Table A8 .
Statistics of the Wisconsin Breast Cancer dataset."*" means statistically significant under a significance level of 0.05.
Figure A8.Comparison with and without transfer learning in the Wisconsin Breast Cancer dataset

Table A9 .
Statistics of the Parkinson's Telemonitoring dataset."*" means statistically significant under a significance level of 0.05.

Table A9 .
Statistics of the Parkinson's Telemonitoring dataset."*" means statistically significant under a significance level of 0.05.

Table A10 .
Comparison with and without transfer learning in the Parkinson's Telemonitoring dataset.

Table A11 .
Statistics of the CDC Diabetes Health Indicators dataset."*" means statistically significant under a significance level of 0.05.