A Machine Learning Algorithm for Quantitatively Diagnosing Oxidative Stress Risks in Healthy Adult Individuals Based on Health Space Methodology: A Proof-of-Concept Study Using Korean Cross-Sectional Cohort Data

Oxidative stress aggravates the progression of lifestyle-related chronic diseases. However, knowledge and practices that enable quantifying oxidative stress are still lacking. Here, we performed a proof-of-concept study to predict the oxidative stress status in a healthy population using retrospective cohort data from Boramae medical center in Korea (n = 1328). To obtain binary performance measures, we selected healthy controls versus oxidative disease cases based on the “health space” statistical methodology. We then developed a machine learning algorithm for discrimination of oxidative stress status using least absolute shrinkage and selection operator (LASSO)/elastic net regression with 10-fold cross-validation. A proposed fine-tune model included 16 features out of the full spectrum of diverse and complex data. The predictive performance was externally evaluated by generating receiver operating characteristic curves with area under the curve of 0.949 (CI 0.925 to 0.974), sensitivity of 0.923 (CI 0.879 to 0.967), and specificity of 0.855 (CI 0.795 to 0.915). Moreover, the discrimination power was confirmed by applying the proposed diagnostic model to the full dataset consisting of subjects with various degrees of oxidative stress. The results provide a feasible approach for stratifying the oxidative stress risks in the healthy population and selecting appropriate strategies for individual subjects toward implementing data-driven precision nutrition.


Introduction
In general, aging comprises interconnected physiological processes and simultaneous unfavorable body composition and metabolic dysfunction [1]. These changes may cause persistent oxidative stress, leading to a state of chronic low-grade inflammation [2]. If not adequately controlled, oxidative stress and chronic low-grade inflammation may result in structural and functional abnormalities, leading to diet-and lifestyle-related chronic diseases, such as metabolic syndrome, cardiovascular disease, neurodegenerative disease, and certain cancers [1][2][3]. Therefore, diagnosing and identifying oxidative stress risks at an early stage is essential to optimize health and wellbeing and alleviate the increasing burden of diet-and lifestyle-related chronic diseases [4].
Many studies have identified variables, including cigarette smoking [5], aging [3], lipid peroxidation product malondialdehyde (MDA) [6,7], and body mass index (BMI) [8,9], as associated factors with oxidative stress. However, in the diagnostic setting, the response cannot be based on a single variable due to the multifactorial components found within the person [10]. According to Califf (2018), a composite biomarker may help understand the subtle changes that describe the processes and their interactions instead. Moreover, each of the multiple variables may play a critical role in the summative outcome of oxidative stress, allowing monitoring of disease progression at an early stage [11]. Researchers have used many statistical or graphical techniques like pattern recognition, principal component analysis, and partial least squares discrimination analysis to analyze and visualize multiple variables at a time [12]. However, these techniques often define the axis based on the variation or line that best separates the defined groups, thus having no biological meaning [13]. Bouwman et al. (2012) recently proposed a statistical visualization method, named 'health space', that addresses this issue by projecting individual subjects' health status based on predefined biological processes determined by multivariate parameterization.
The rapid development in big data analysis and artificial intelligence has recently begun to unlock clinically relevant information from cohort data, journals, and clinical practices hidden in a large volume of these, assisting clinical practice by providing up-todate medical information or reducing diagnostic and therapeutic errors [14][15][16]. However, to the best of our knowledge, no studies have applied machine learning (ML) algorithms to serve as a starting point to reduce the risk of oxidative stress-related chronic diseases in advance and answer concerning data-driven precision nutrition. We tackle this problem by exploring a proof-of-concept study to test the utilization of the health space methodology for extensive population studies. Furthermore, we used ML approaches to develop a model that discriminates oxidative stress risks with strong prediction power and interpretability by selecting valuable features from the full spectrum of diverse and complex data. Subsequently, we validated the proposed diagnostic model in a separate hold-out test set of subjects. This study may initiate and facilitate evidence-based data-driven precision nutrition.

Study Population
This study used a retrospective cohort design to analyze 2454 subjects, who first came to Seoul Metropolitan Government-Seoul National University Boramae Medical Center (Seoul, Republic of Korea) for regular general health check-ups between 1 April 2015 and 31 August 2018. A total of 1328 subjects were included after excluding (1) subjects with missing data on any independent variables selected for analysis (n = 618), (2) subjects aged under 20 years old (n = 5), (3) same persons visited twice (n = 117), and (4) patients diagnosed with diseases excluding oxidative diseases (n = 386). The Institutional Review Boards of Seoul National University Boramae Medical Center (approval number 20140929/26-2014-118/102) and Ewha Womans University (approval number 86-8) approved this study. We received written informed consent from all subjects and performed all procedures following the relevant guidelines and regulations.

Development and Validation of ML Algorithm
A schematic of the analysis pipeline is presented in Figure 1. We first split the eligible dataset (n = 1328) into training (n = 911) and hold-out (n = 417) datasets after removing the duplicate samples. The training dataset consisted of subjects who received a regular general health check-up at the Seoul Metropolitan Government-Seoul National University Boramae Medical Center in 2015-2016. The most recent dataset of the 2017-2018 period was used as the hold-out test set. Notably, the separate and independent hold-out dataset was used only for validating the final model to ensure no bias in the model development. To obtain binary performance measures, we formed two reference groups in the training and hold-out datasets by adopting a health space model [13]. The healthy controls were defined as subjects having neither metabolic syndrome nor its components nor physician-diagnosed medical conditions. Extensive phenotyping of oxidative stress-related diseases was made based on the previously published data. As a result, the oxidative disease cases were defined as subjects with metabolic syndrome [1], dyslipidemia [9], hypertension [9], intermediate coronary syndrome [5], stroke [7], diabetes mellitus [23], or liver cirrhosis [24], among all subjects diagnosed as having diseases (Supplementary Materials Figure S1). We used the disease terminology presented by the Human Phenotype Ontology [25].
Considering the high-dimensional and collinear characteristics of the data, we employed the regularized generalized linear model (GLM) with the least absolute shrinkage or selection operator (LASSO) and elastic net penalties using the R package "glmnet" [26]. The mixing parameter alpha was set to 0.5, reflecting the balance between the ridge and LASSO penalties of the two approaches [27]. The resulting multivariate regression model was validated by 10-fold stratified cross-validation (CV) to avoid overfitting [28]. This procedure was repeated 100 times to improve stability by increasing the number of evaluations from 10 to 1000 [29]. Then, based on the one-standard-error rule, we calculated the optimal value of lambda (λ) and the minimum misclassification rate (lambda.1se) to tune the elastic net and LASSO regressions [30,31]. The best-performing model was identified by comparing the two best candidates derived from each elastic net and LASSO regression using the area under the receiver operating characteristic curve (AUC). Finally, external validation of the best performing model was carried out by calculating AUC, specificity, sensitivity, accuracy, negative predictive value (NPV), positive predictive value (PPV), positive clinical utility index (CUI+), and negative clinical utility index (CUI-) [32,33].
Furthermore, we investigated calibration to confirm whether the predicted probabilities agree with the observed probabilities [34]. To this end, the final prediction algorithm was applied to the entire dataset (n = 1328), stratified into four categories according to the number of metabolic syndrome risk factors (0, 1, and 2) and the presence of oxidative stress diseases. Next, we applied our final prediction algorithm to the subjects excluded from this study due to other conditions unrelated to oxidative stress. Then, Duncan's post hoc test was used to delineate group differences further.

Statistics
All statistical analyses were performed using the R software (version. 3.6.1; R Foundation for Statistical Computing) [35]. The results were expressed descriptively as means and standard deviation (SD) for continuous variables and number (percentage) for categorical variables. Differences between healthy controls and oxidative disease cases were compared by the Student's t-test for continuous variables and Chi-square test for categorical variables. Statistical significance was set at p < 0.05.

Characteristics of the Reference Groups
Out of 1328 subjects, 884 samples were extracted as healthy controls (n = 379, 28.5%) and oxidative disease cases (n = 505, 38%) to develop and validate the ML algorithm discriminating oxidative stress risks in the healthy population. Table 1 compares the 43 features derived from these two reference groups in the training and hold-out dataset, respectively. The results indicate the separable features of the groups. In both datasets, the subjects in the oxidative disease cases were older, with a higher percentage of male subjects and smokers than the healthy controls. The oxidative disease cases had significantly higher BMI and body fat percentages than the healthy controls. This trend was similar for the ALP, BUN, creatinine, CRP, GGT, GPT, HbA1c, and UA levels. In contrast, albumin, bilirubin, LDL-C, TC, and total protein levels did not differ between the two groups. For CBC data, Hb, Hct, MCHC, RBC, and WBC were significantly higher in the oxidative disease cases than in the healthy controls. Table 1. Comparison between healthy controls and oxidative disease cases included in the training and hold-out datasets for the machine learning algorithm discriminating oxidative stress risks in the healthy population.

Developing an ML Algorithm for Discriminating Oxidative Stress Risks in the Healthy Population
The heatmap presented in Figure 2 shows all features and corresponding coefficients obtained from the training dataset using logistic regression with elastic net (upper) and LASSO (lower) penalties. Ten-fold CV was run 100 times for each penalty. The color scale beneath the heatmap represents a range of coefficient values, where blue is negative and red is positive. Twenty-one out of a total of 43 features were extracted at least once within the 200 individual replications. Age, BMI, GGT, GPT, Hb, HbA1c, and WBC were the most consistently extracted features across all 200 models. The CV results are presented in Figure 3A, depicting the mean squared prediction error against log λ with one-standard error bars. The vertical dashed lines indicate the location of the minimum misclassification error (lambda.1se) selected by a "one-standarderror" rule. The elastic net regression gives lambda.1se = 0.0253, and the LASSO regression yields lambda.1se = 0.0167, indicating that LASSO produced a more regularized model compared with the elastic net model. Figure 3B compares the performance of the elastic net and LASSO models using AUC, validating the above findings that the LASSO model (AUC 0.949, 95% confidence interval [CI] 0.925-0.974) performed slightly better than the elastic net model (AUC 0.948, 95% CI 0.924-0.973). The best LASSO regression model contains 16 features (age, plasma MDA, BMI, RFS, HbA1c, GPT, GGT, bilirubin, albumin, WBC, RBC, Hb, RDW, monocytes, basophils, and MCHC). The feature with the highest negative coefficient value was bilirubin, while those with the highest positive coefficient value were HbA1c and albumin. In contrast, the best elastic net regression model contains four more features, including creatinine, UA, body fat percentage, and LDL-C.   (Table S1).  Figure 5A is a violin plot illustrating the diagnostic values for oxidative stress of all individuals stratified by four categories based on the number of metabolic syndrome risk factors and the presence of oxidative stress diseases. The results confirmed that our final model can suitably define healthy and oxidative disease categories, which seems to be better than the traditional separation based on clinical diagnosis or the metabolic syndrome definition. Moreover, our final model can identify individuals with higher metabolic risks as having higher oxidative stress risks toward the oxidative disease cases (p for trend < 0.001). The result presented in Figure 5B shows the diagnostic values for oxidative stress of the subjects excluded from this study due to the presence of other diseases unrelated to oxidative stress. The other disease groups colored in purple were significantly different or at least tended to differ from either the healthy controls or the oxidative stress cases, but it was challenging to calibrate the apparent magnitude of oxidative stress risks. Calibration of oxidative stress risks of subjects having zero (n = 379), one (n = 282), and two metabolic risk factors (n = 162) and oxidative disease cases (n = 505). (B) Application of oxidative stress risks of subjects with other diseases not related to oxidative stress (n = 386) in comparison to the healthy controls (n = 379) and oxidative disease cases (n = 505). The violin plot represents the distribution of data by using the kernel density function, and the width of the violin plot represents the sample size at this level. The red dot on each box plot represents the mean value. The box in the violin plot represents the median and quartile; the extension from the thin black line represents the 95% confidence interval. The different lowercase letters indicate a significant difference between the health status by Duncan's test (p < 0.05).

Discussion
Given the need to better discriminate high-risk individuals for diet-and lifestylerelated chronic diseases in the general population, we developed and validated an ML model to diagnose individuals' oxidative stress risks at an early stage. Many prediction models for disease diagnosis or prognosis have emerged [10,36], but they do not enable individuals to monitor disease deterioration. The present study was initiated as proof of the health space model, a statistical model for analyzing and visualizing the treatment effects of functional foods in individual subjects based on predefined biological processes [13]. The health space model was applied to depict subjects' oxidative stress status by evaluating the summative outcomes of biological processes, thus avoiding erroneous conclusions [11]. Furthermore, it was expanded to derive binary samples of healthy controls and oxidative disease cases from the Boramae cohort data.
As oxidative stress is complex and multifactorial, a composite biomarker might facilitate diagnosing and risk-stratifying subjects with high oxidative stress levels than any single biomarker [6,[37][38][39][40][41]. The current study used ML models in the multivariate statistical analysis rather than the traditional statistical approaches. The advantages of ML models over traditional statistical approaches include their ability to consider interactions between features and explore combinations that might not be apparent [42]. Among many algorithms, the GLM-based technique has mainly been used for the text mining of electronic health reports and developing a prediction model in health care, providing a simple and interpretable description [43]. However, the performance of GLMs has been indicated as unsatisfactory because they are prone to overfitting and sensitive to outliers [44]. Alternatively, a graph learning-matching network (GLMNet) that fits a GLM via penalized maximum likelihood can be used to overcome the limitations of naïve GLMs [28,45,46]. In this current study, we applied the two representative regularization methods called LASSO and elastic net. Our results showed that the LASSO regression better predicted oxidative stress risks than the elastic net regression.
In a data mining algorithm, proper internal validation is essential to prevent model overfitting and reduce potential false-positive findings [47]. The split-sample approach is a straightforward and popular method in which the training data are randomly split into two parts for developing a model and measuring its performance. The CV is a more sophisticated approach with the advantage of intuitively being regarded as an extension of the split-sample method [48]. In the current study, we applied CV for internal validation with a 10% fraction to test a model developed on 90% of the sample in this context. This procedure was repeated 100 times in the training dataset to improve CV stability. Such stratification is ideal for keeping the test dataset intact and bringing it out only at the end of the data analysis. However, because the test dataset was used repeatedly in this study, we determined the regularization parameters to choose the most parsimonious model whose error is no more than one standard error [31,48]. As a result, we obtained a final transparent and easy-to-interpret diagnostic model composed of a combination of 16 features representing the composite of anthropometrical, biochemical, and clinical data.
We note that we reserved a separate and independent hold-out dataset for external validation of the final model. AUC analysis was performed to assess the final model performance, presenting an AUC value of 0.949 in the validation dataset versus 0.935 in the training dataset. In the context of discrimination, AUC values of 1.0, 0.9-0.99, 0.8-0.89, 0.7-0.79, 0.51-0.69, and 0.5 can be interpreted as perfect, excellent, good, fair, poor, and of no value, respectively [49]. Therefore, our model may provide appropriate selection criteria for individuals to implement data-driven precision nutrition. Contrary to our result, some studies that explored a global oxidative stress index (Oxidative-INDEX) did not reach beyond the good prediction level. Park et al. [50] created an oxidative stress score consisting of MDA, oxidized low-density lipoprotein, and 8-epi-prostaglandin F2α. The AUC value was 0.75 when combined with single nucleotide polymorphisms to predict obesity in the Korean population. In another study, the Oxidative-INDEX was calculated in patients with coronary artery disease by measuring overall pro-and antioxidant exposure balance after z-score standardization [51]. It showed significant associations of the score with diabetes, smoking habit, hypercholesterolemia, aging, and CRP at the multivariate regression analysis, suggesting the potential use of the Oxidative-INDEX in preventing, diagnosing, and treating coronary artery disease. However, validation remained undone, and further investigation was needed in the general population.
There is increasing interest in the application of ML for nutritional science to prevent and improve diet-and lifestyle-related chronic diseases. However, to our knowledge, no prior published studies included the application of ML algorithms for quantifying health conditions and providing appropriate strategies to individuals for implementing data-driven precision nutrition in the context of our study. Our model was able to quantitatively stratify oxidative stress burden, as expected. It could even distinguish healthy individuals from most individuals with diseases unrelated to oxidative stress, although to a relatively lesser degree of preciseness. However, this study had several limitations. First, even though the model performance was excellent, as indicated by CV discrimination and external validation, there was partial overlap between the groups categorized by the number of metabolic syndrome risk factors (0, 1, and 2) and the presence of oxidative stress diseases. Other statistical considerations may have to be enjoined to solve this problem. Second, this study was conducted using the dataset obtained from adults residing in the Republic of Korea. Thus, the results of this study may not be generalizable to all ethnic populations. Further investigations are underway to develop and validate an oxidative stress predictor in a more extensive and diverse multi-ethnic population using the same ML techniques. Third, in this study, we excluded those subjects with missing observations rather than replacing missing values with imputation methods. This approach may induce the problem of reducing statistical power or yielding biased estimates if in traditional epidemiology and clinical research [52]. Therefore, we performed multiple imputation using the partially observed cases by chained equations in R package mice [53]. Although the number of selected features increased from 16 to 22 after imputation, most of them were duplicated with the model presented here. In addition, the diagnostic performance was compatible with each other ( Figure S2). Last, the primary aim of creating the composite predictor was to facilitate identifying an individual's health condition and predicting longitudinal outcomes of an individual's health risks. However, by using a cross-sectional dataset, our model allowed us to develop only a diagnostic model for distinguishing oxidative stress levels of each individual. We may need a population-based longitudinal cohort dataset that follows individuals and tracks the progression of chronic diseases related to oxidative stress in order to develop a prognostic risk model and use it as a practical reference for subsequent personalized health decisions. The prognostic model is likely to be conveyable to the general public if appropriately validated.

Conclusions
The proposed ML algorithm is based on the cross-sectional data obtained from the Boramae medical center cohort in Korea. It allowed the development of a composite diagnostic model for oxidative stress at the individual level. The resulting predictor comprised 16 features and was proved to have excellent performance in quantifying oxidative stress risks. Considering the importance of these findings in the context of precision nutrition, the limitations of this study create opportunities for further research to enhance and expand our current understandings. The present study is the first to report a feasible approach for stratifying oxidative stress risks in a healthy population, providing appropriate strategies to engage in proactive health management to prevent diet-and lifestyle-related chronic diseases.

Patents
The application numbers of the patent related to this work are 10-2020-0071809 and PCT/KR2021/007312.

Supplementary Materials:
The following are available online at https://www.mdpi.com/article/ 10.3390/antiox10071132/s1, Figure S1, Classification of oxidative stress-related disease; Figure S2, Analysis of missing data; and Table S1, Comparison of healthy controls in the training set and those in the hold-out set.