1. Introduction
Very preterm infants are at high risk of developing neurodevelopmental impairments (NDI) that include a range of language, cognitive, sensory, and motor impairments [
1]. Those born at or before 32 weeks of gestational age are at a higher risk of NDI. Research studies have identified clinical conditions that have been associated with adverse outcomes, such as prenatal, perinatal, and comorbidities factors. Prenatal risk factors include infections [
2], hypertension [
3], and malnutrition [
4]. Perinatal risk factors include gestational age [
5], sex [
6], and Apgar score [
7]. Moreover, comorbidities also may impact the long-term neurodevelopmental outcome, such as necrotizing enterocolitis [
8], sepsis [
9], bronchopulmonary dysplasia [
10] among others. These risk factors can be prospective predictors of NDI, and provide a better understanding of the potential pathways to adverse outcomes in preterm infants.
Previous models using traditional statistical methods have been developed to predict NDI in preterm infants using antenatal and neonatal clinical data [
11,
12,
13]. These studies have demonstrated the significance of predictive models that can assist neonatologists in early diagnosis and decision-making. Despite these efforts, drawbacks have been considered such as the lack of diversity in the variables, treating the variables as independent, and measuring a risk combination of NDI and/or death, among others. Nowadays, owing to the exponential increase in data, advantages can be obtained by applying machine learning (ML) models that can help to support these statistical models. Previous studies have used clinical features as predictors for NDI in preterm infants and extremely low birth weight infants using ML models. Research performed by Ambalavanan et al. [
14] used antenatal and perinatal variables to predict neurodevelopmental outcomes at 18 months, using Neural Networks. Some of the limitations of this model were the small sample size and the selection of variables, as they only used variables selected with reported risk.
Furthermore, Ambalavanan et al. [
15], used antenatal and postnatal data to predict NDI using classification tree models, among the drawbacks they consider is that the model predicted a combined outcome of NDI or death. Moreover, some of the classification tree nodes were based on very small subsets (e.g., specific treatment), from the original dataset and were less accurate than the ones based on larger subsets. Another study performed by Juul et al. [
16], aimed to predict NDI outcomes at 2 years by using Bayesian Additive Regression Trees using three subsets of selected variables. This research provided meaningful conclusions such as using a dichotomous version of the NDI outcome performed better than its original version and that total transfusion volume was the most important predictor for their cohort.
Despite the efforts, to the best of our knowledge there is still no ML-based solution that performs well in this task in small cohorts and highly imbalanced datasets by using clinical data from very preterm infants. This study aims to evaluate the application of ML techniques to predict NDI in very preterm infants using clinical data from a local cohort acquired at the Hospital Puerta del Mar, in Cádiz, Spain, which includes prenatal, perinatal, and comorbidities records. The authors’ main goal is to determine which is the best set of ML techniques, together with the set of clinical variables, that better predicts the NDI outcomes in very preterm infants. The main contributions of this work are next summarized:
Develop supervised ML models to predict NDI in very preterm infants.
Analyze the performance of these ML models when using all clinical features available, a subset of them guided by experts in this field, and mutual information-selected features.
Apply a commonly used data augmentation technique to deal with data scarcity and class imbalance issues.
The rest of the paper is organized as follows.
Section 2 describes the study design and data collection.
Section 3 explains feature selection methods and ML classifiers implemented in this research.
Section 4 describes strategies for the generation of synthetic data, and evaluation metrics that have been considered.
Section 5 details and discusses results that were obtained with the experimental design and the chosen methods. Finally,
Section 6 gives an overview of the conclusions of this study and the approaches that need to be considered for future work.
2. Dataset
2.1. Study Design and Participants
Data was prospectively collected in a cohort study including very preterm infants from May 2018 to January 2021 at Puerta del Mar University Hospital, Cádiz, Spain. Research and Ethics Committee approval and informed consent of participants were obtained. Inclusion criteria were very preterm infants born at ≤32 weeks and/or very-low-birth-weight infants (≤1500 g).
A total of 52 clinical features were used in this study which include prenatal, perinatal, and comorbidities features. Prenatal variables refer to the mother’s health records at the time of pregnancy such as age, hypertension, hypothyroidism, chorioamnionitis, gestational diabetes, preeclampsia, cesarean delivery, IV fertilization, etc. Perinatal variables refer to clinical records of the preterm infant during delivery, such as gestational age, sex, Apgar score at 1 and 5 min of life, Clinical Risk Index for Babies (CRIB Index), intubation at the delivery room, head circumference at birth, small for gestational age (birth weight bellow 10th centile), etc. Comorbidities include patent ductus arteriosus, bronchopulmonary dysplasia, days of oxygen therapy, and mechanical ventilation, among others.
2.2. Neurodevelopmental Assessments at 2 Years Corrected Age
In this study, we analyzed a dataset of 180 very preterm infants, each assessed for neurodevelopmental outcomes at 2 years of corrected age. The corrected age is the chronological age reduced by the number of weeks born before 40 weeks of gestation. Assessments were conducted using the Bayley Scales of Infant Development, 3rd Edition (Bayley III) [
17]. This test evaluates three different areas of neurodevelopment: motor, cognitive, and language. The scores are independent for each area and evaluation is performed by a qualified clinical psychologist. The Bayley score is a quantitative variable, that is categorized in the following threshold: ≥85 normal neurodevelopment, <85 mild impairment, and <70 severe impairment.
To address the imbalance in our dataset, we simplified the problem to binary classification: values ≥ 85 were considered normal neurodevelopment and values < 85 as mild to severe impairment. For cognitive impairment, 155 patients had a normal neurodevelopmental outcome, while 25 had mild to severe impairment. Meanwhile, for motor impairment, 156 patients obtained a normal neurodevelopmental outcome, while 24 patients had a mild to severe impairment. For language impairment, 139 patients had a normal neurodevelopmental outcome, while 41 obtained mild to severe impairment.
2.3. Data Curation and Pre-Processing
From the 52 features considered in this study, 32 features had less than 6% of missing values, while 6 features had around 30% to 40% of missing values. Features with missing values were imputed using two simple methods according to the type of feature. Imputation approaches used in this study, are commonly simple, and fast methods to impute; however, considering the complexity of clinical features, suggestions from the clinicians were considered to perform these imputation methods.
Numerical features: imputation was performed using the mean value, where missing values were replaced with the mean of all known values in each feature. Moreover, normalization was applied, where each feature was scaled to a range of 0–1. Scaling was adjusted on the training dataset, whereas the test set was normalized based on the training data.
Categorical features: the most frequent value imputation was performed, where missing values were substituted with the most frequent value in each feature. Subsequently, these features were transformed using dummy encoding, where each feature was converted into k classes, yielding a total of k − 1 [
18].
5. Results and Discussion
In this study, we worked with a small and class-imbalanced dataset (approximately 80% vs. 20%) and evaluated the performance of different models. These results indicate that performing an oversampling method such as SMOTE-NC yields better results in cognitive and motor impairments for all features
Table A1 and
Table A2, clinician-selected features
Table A3 and
Table A4, and mutual information-selected features
Table A5 and
Table A6. Given our aim to predict the positive class (mild/severe impairment), our primary focus was on models that achieved better results for the recall, as shown in
Figure 1. We are particularly interested in a model that can accurately identify the true positives, meaning that when the model indicates a patient has an impairment, this prediction should be correct most of the time.
Moreover, according to the best model performance (Gaussian Naives) in motor impairment prediction, we present the ROC curve in
Figure 2, which achieved an AUC value of 0.69.
Clinician-selected features performed better based on the recall metric compared with all features for the three types of impairment prediction. However, performing mutual information-selected features and SMOTE-NC outperformed the other settings in cognitive and motor impairment prediction. In contrast, for language impairment prediction, the best setting was clinician-selected features. In this sense, the best model performance according to recall was Gaussian Naive Bayes for all types of neurodevelopmental impairment predictions. Moreover, implementing SMOTE-NC increased its prediction of cognitive and motor impairments across all settings compared to no oversampling settings. Meanwhile, in language impairment prediction, applying SMOTE-NC obtained notable results in the setting of mutual information-selected features. Any feature selection employed in this work was inconsistent for all three cases, indicating reevaluation, because recent studies have proposed more sophisticated and smarter approaches for addressing this task [
16].
Furthermore, it is important to mention that even though prediction for the mild/severe class was the goal of this analysis, joining both classes due to the lack of patients might imply a loss of clinical information regarding the state and future of each patient; therefore, future analysis should be considered to work with these classes independently. Moreover, it has been previously stated by Juul et al. [
16] that even by applying advanced methods the field is still not able to predict complex long-term outcomes such as the Bayley score. However, in this study, we will be approaching extra features that previously have been considered that are predictors for NDI such as socioeconomic features [
25] and image-based features such as brain volumes [
26,
27]. Moreover, multimodal data integration does seem a greater alternative for predicting NDI outcomes by the use of machine learning methods, research on this scope has been focused on integrating clinical data and image-based data to predict NDI in preterm infants [
28,
29].
6. Conclusions and Future Work
In this work we assessed a small and class-imbalanced dataset to predict the neurodevelopmental impairments at two years in very preterm infants, using six different classification models, employing all features and feature selection by an expert and by mutual information and performing SMOTE-NC oversampling. Besides, this evaluation was performed using clinical data from a local cohort acquired in a Spanish public hospital. Results are indicative that using mutual information-selected features and SMOTE-NC techniques was the best setting for cognitive and motor impairment prediction, while for language impairment prediction the best setting was clinician-selected features. The feature selection performed by experts in this field should be reconsidered for further studies, implementing a more complex approach as recent studies have suggested.
To this end, this work could be extended by applying further methods, adding more heterogeneous features such as image-based features, and centering the prediction into regression methods so that the Bayley score can be predicted in its original version. Moreover, it is important to note that this is a small dataset, and it is expected to include more patients and validate the model on external cohorts.