A Chemical Structure and Machine Learning Approach to Assess the Potential Bioactivity of Endogenous Metabolites and Their Association with Early Childhood Systemic Inflammation

Metabolomics has gained much attention due to its potential to reveal molecular disease mechanisms and present viable biomarkers. This work uses a panel of untargeted serum metabolomes from 602 children from the COPSAC2010 mother–child cohort. The annotated part of the metabolome consists of 517 chemical compounds curated using automated procedures. We created a filtering method for the quantified metabolites using predicted quantitative structure–bioactivity relationships for the Tox21 database on nuclear receptors and stress response in cell lines. The metabolites measured in the children’s serums are predicted to affect specific targeted models, known for their significance in inflammation, immune function, and health outcomes. The targets from Tox21 have been used as targets with quantitative structure–activity relationships (QSARs). They were trained for ~7000 structures, saved as models, and then applied to the annotated metabolites to predict their potential bioactivities. The models were selected based on strict accuracy criteria surpassing random effects. After application, 52 metabolites showed potential bioactivity based on structural similarity with known active compounds from the Tox21 set. The filtered compounds were subsequently used and weighted by their bioactive potential to show an association with early childhood hs-CRP levels at six months in a linear model supporting a physiological adverse effect on systemic low-grade inflammation.


Introduction
Metabolomics aims to capture a broad range of small molecules, either quantitatively by measuring their concentration or qualitatively by identifying their presence and structure, which are essential intermediates and end products of metabolism in organisms.One can analyze pathological processes underlying a disease or physiological state of interest using rich metabolomics data, e.g., obtaining biomarkers [1] or predicting responses to therapy [2,3].Furthermore, one can use the metabolome for phenotyping organisms or diseases [4].With such methods, one can also delve into a mechanistic information layer for understanding modes of action regarding exposures such as diet, exercise, and pollutants on disease progression [5].Metabolomic instruments provide rapid, cost-effective, and sensitive results from many possible biological samples.Yet, with an appropriate analysis, the data they produce are meaningful.It is essential to interpret these data to build biochemical pathways and comprehend their interactions in healthy and diseased conditions [6].Liquid chromatography-mass spectrometry (LC-MS), one of the most commonly used approaches for metabolomics studies, can achieve a high number of annotations [7], which was also essential to this study.

hs-CRP as a Target for Physiological Conditions
Another significant predictor of health is the assessment of levels of C-reactive protein with high-sensitivity methods (hs-CRP), which is generally considered a marker of systemic low-grade inflammation.Elevated hs-CRP has been shown in early-onset diseases and has been associated with an increased prevalence of overweight and obesity in children [8,9].Further, studies have shown that elevated hs-CRP is a biomarker of childhood asthma [10] and allergy [11] but is also associated with Attention deficit hyperactivity disorder (ADHD) development [12].Hence, hs-CRP is a valuable tool for understanding childhood obesity risk, asthma, allergy, and overall health status.Elevated hs-CRP is hence interpreted as a physiological state of systemic low-grade inflammation in this study.The expectation is, therefore, that inflammation is related to metabolic processes and should hence be reflected in the metabolome by, e.g., other biomarkers like the previously presented GlycA [13].

Signals in the Metabolome
The leading assumption in this work is that one can use chemical information of endogenous or exogenous metabolites to relate them to their biological activity and consequently with physiological conditions.This can be processed using quantitative structure-activity relationships (QSARs).QSARs are based on mapping a feature space (X), calculated from chemical structures, on a target chemical or biological activity (y) [14,15].Hence, given a chemical structure, one should be able to understand the compound's biological activity, a concept relevant in drug discovery and toxicology.In reverse, after developing a QSAR, one can predict biological activity for a biological target (e.g., an enzyme or receptor) for a new chemical structure, given some chemical similarity to previously utilized chemical space.To achieve this with metabolites, one must know their structures.Herein, the biological activities of interest are the activation or inhibition of specific biochemical pathways from the Tox21 listed targets [16,17].The selected Tox21 data set for modeling includes 12 bioactivity endpoints, 5 related to stress response (SR) and 7 to nuclear receptor (NR) panels.NRs are a class of transcription factors involved in regulating gene expression.Once the models are trained, and the relationship of chemical structure and bioactivity is set, they can be used to assess bioactivities of other chemical structures, i.e., metabolites.

Aim of This Work
This work aims to filter metabolites from many compounds to inspect associations with potential biological activities and hence with physiological conditions.Filtering is based on structural similarities of metabolites to known bioactive compounds against the Tox21 targets, which can be potentially bioactive in unspecified pathways.Hence, dysregulated, endogenous compounds could be associated with physiological states.Even though there is not much research on the endobiotic effects and associations, there are indications for such.Elevated cortisol concentrations have been associated with psychiatric diseases [18,19].Other metabolites like phenylalanine are known toxicants in conditions such as phenylketonuria [20].Research also suggests that toxic intermediates in metabolomic pathways [21] can contain reactive functional groups and be leveraged, e.g., for cancer therapy.A set of QSAR models is developed in this work against the Tox21 biological target and used to predict potential outcomes for each of the quantified metabolites.The expected outcomes are then used as filters to select potentially bioactive compounds that could trigger the given Tox21 pathways given their chemical structure.The filtered metabolites were evaluated based on their abundance and bioactivity against levels of hs-CRP in the children at six months of age.

Cohort
The COPSAC2010 cohort comprises a non-selected group of mother-child pairs, including 738 pregnant women and their 700 children.The women were recruited between gestational weeks 22 and 26 during their first pregnancy examinations.Of the 738 pregnant women (with an average age of 32.3 ± 4.3 years at the time of their child's birth), 700 children were enrolled in the study.Since then, the children have been visiting the clinic regularly.Gestational age was determined using routine pregnancy care ultrasonography.The COPSAC2010 study participants included infants delivered pre-term and post-term (30-42 weeks).During the third trimester, the women took part in a double-blind, randomized controlled trial with a factorial design, receiving either high-dose (2800 IU/day) or standard-dose (400 IU/day) [22] vitamin D and either 2.4 g n-3 long-chain polyunsaturated fatty acid (LCPUFA, 55% (w/w) 20:5 (n-3) eicosapentaenoic acid (EPA) and 37% (w/w) 22:6 (n-3) docosahexaenoic acid (DHA)) or placebo (72% (w/w) n-9 oleic acid and 12% (w/w) n-9 linoleic acid) [23].Women with endocrine, heart, or kidney diseases or a daily vitamin D intake above 600 IU/day were excluded.Children with a gestational age of less than 32 weeks were also excluded.The trial received approval from the National Committee on Health Research Ethics (H-B-2008-093) and the Danish Data Protection Agency (2015-41-3696).Both parents provided oral and written informed consent before enrolment.The full protocol for the recruitment, information on withdrawals from the study, and a flowchart of the study were previously presented in the Supplementary Materials (Protocol and Supplementary Appendix) of this group's previous work [23].Figure 1 depicts the metabolomic part of this study.2.1.2.Blood Collection and Metabolite Quantification Blood samples were taken from children at the age of six months.Blood was collected in an ethylenediaminetetraacetic acid (EDTA) tube and left at room temperature for 30 min before being centrifuged for 10 min at 4000 rpm.The supernatant was collected and stored at −80 °C for future analysis [24].The methods for sample preparation, UHPLC-MS/MS analysis, and quality control are comprehensively described in reference [25].In

Blood Collection and Metabolite Quantification
Blood samples were taken from children at the age of six months.Blood was collected in an ethylenediaminetetraacetic acid (EDTA) tube and left at room temperature for 30 min before being centrifuged for 10 min at 4000 rpm.The supernatant was collected and stored at −80 • C for future analysis [24].The methods for sample preparation, UHPLC-MS/MS analysis, and quality control are comprehensively described in reference [25].In this study, an untargeted metabolomic analysis of plasma from mothers and their children was conducted at Metabolon, Inc. using a Waters ACQUITY UHPLC (Milford, MA, USA) and a ThermoFisher Scientific (Waltham, MA, USA) QExactive™ Hybrid Quadrupole-Orbitrap™ mass spectrometer with a heated electrospray ionization source, operating at a resolution of 35,000 mass units.The processed samples underwent analysis on four specialized platforms tailored for different classes of compounds: UHPLC-ESI(+)MS/MS for hydrophilic compounds, UHPLC-ESI(+)MS/MS for hydrophobic compounds, reverse-phase UHPLC-ESI(−)MS/MS optimized for basic conditions, and HILIC/UHPLC-ESI(−)MS/MS.The identification of metabolites was based on retention time or index range, a mass accuracy within ±10 ppm, and MS/MS spectra.Compound identification was based on the following criteria: (1) compounds labeled with "*" have identification level 2; (2) compounds labeled with "**" have level 3 (since no standards or matching spectra are available); (3) compounds named with "X-" are unknown and therefore have level 4; and (4) if no label is applied, the identification level is 1 [26].The analyses were conducted in Metabolon, Inc. in Morrisville, NC, USA.Additional details on the analysis were previously published in [27].

Assessment of hs-CRP Levels
Children at the age of six months had blood drawn from a cubital vein into an EDTA tube.The samples were centrifuged to separate plasma and cells and stored at −80 • C until analysis.After the samples were thawed, the hs-CRP levels were measured using a highsensitivity electrochemiluminescence-based assay from Meso Scale Discovery.Duplicate measurements were taken and analyzed using the Sector Image 2400 A from Meso Scale Discovery in Gaithersburg, MD.The lower limit of detection for CRP was 0.007 ng/mL [28].

Data Preparation and Model Building 2.2.1. Data Preparation for Metabolites
The metabolites with more than 33% missing values (67% values per feature present) were removed from the analysis, referring to our previous works to keep methodological consistency.Some literature sources recommend imputing features above 70% [29], which is close to the threshold in this work.The variables that passed this threshold, i.e., with less than 33% missing values, were then imputed with 1/10 of the minimum concentration per metabolite (under the assumption to correspond to the detection limit).The metabolites were then scaled to 0-1 (min-max scaling).A further cleaning step was the removal of low-variance metabolites, i.e., the lowest 10% of metabolites by variance [30].The last step was removing highly correlated metabolites with above 90% Pearson correlation.Finally, 517 compounds were successfully converted to simplified molecular-input lineentry system (SMILES) encodings of molecules, resulting in a final data set for the analysis.

Bioactivity Assessment Pipeline
The model training and application pipeline is presented in Figure 2. The pipeline starts with data extraction and preparation for building QSAR models (blue and yellow rectangles in the figures), described in Sections 2.2.2 and 2.2.3.Once the data are prepared, models are built using Random Forests and hyperparameter optimization (Section 2.2.4) on features generated from chemical structures.These models can predict bioactivity against biological targets such as the 12 given in Section 2.2.2.Once the models are generated and validated, the final step is to predict bioactivities for the 12 targets per quantified metabolite and patient.The predictions then enter further statistical analysis to reveal an association with hs-CRP levels.

Bioactivity Data
The chosen bioactivities for this task stem from the Tox21 compound library [16,17] provided by the U.S. Environmental Protection Agency (EPA).The Tox21 program, which was also part of the Tox21 challenge [16], is also where data can be downloaded from; this includes data on the androgen receptor (AR), estrogen receptor (ER), progesterone receptor (PR), aromatase receptor, peroxisome proliferator-activated receptor gamma (PPAR), and the aryl hydrocarbon receptor (AhR).The compounds in the library were tested against their ability to interact with various SR pathways, including antioxidant-responsive elements (ARE), p53 tumor proteins, mitochondrial membrane potential (MMP), proteins involved in DNA damage repair (ATAD5), and heat shock factor response elements (HSE).Numerous studies have been conducted on this data set, and the outcomes have been reported in various reports [31][32][33][34].Hence, it represents a baseline data set for building bioactivity QSARs since it is imbalanced, chemically diverse, and one of the more significant publicly available data sets (~10 k compounds).Because of the challenges presented by this data set, it has been extensively studied in cheminformatics research in machine learning methods [31,34], class balancing methods [33,35], the testing of different chemical representations [36], and multitasking/deep learning [34].The raw data were pre-processed, as they contain duplicate structures that consider the active or organic part of the molecules.This was also reported earlier [33].Molecules with invalid structural identifiers were removed, and those that were valid were converted to their canonical SMILES [37].Duplicates were either removed by IDs or SMILES.Our methodology employed a comprehensive standardization protocol to ensure the uniformity and accuracy of chemical structure representations before the computational analysis.The structure preparation pipeline inspired by procedures [33,38]  Once the structures were ready, the molecular descriptors Morgan fingerprints (FPR) [39], MACC keys, and 200 molecular descriptors were calculated for the 8314 structures using the RDKit library [40].To reduce possible bit collision in the fingerprints [41,42], the fingerprint vectors were set to 5120 bits and a radius of 2. The Python scripts used for this work were previously published [43].The data sets are available in the Supplementary Materials (https://zenodo.org/records/10888738,accessed on 25 April 2024).In the final prepared data sets, the SR-HSE activator (PubChem AID: 743228) assay recorded 6402 observations with 340 positives, while the NR-AR agonist assay observed 7060 instances, identifying 306 positives.

Bioactivity Data
The chosen bioactivities for this task stem from the Tox21 compound library [16,17] provided by the U.S. Environmental Protection Agency (EPA).The Tox21 program, which was also part of the Tox21 challenge [16], is also where data can be downloaded from; this includes data on the androgen receptor (AR), estrogen receptor (ER), progesterone receptor (PR), aromatase receptor, peroxisome proliferator-activated receptor gamma (PPAR), and the aryl hydrocarbon receptor (AhR).The compounds in the library were tested against their ability to interact with various SR pathways, including antioxidant-responsive elements (ARE), p53 tumor proteins, mitochondrial membrane potential (MMP), proteins involved in DNA damage repair (ATAD5), and heat shock factor response elements (HSE).Numerous studies have been conducted on this data set, and the outcomes have been reported in various reports [31][32][33][34].Hence, it represents a baseline data set for building bioactivity QSARs since it is imbalanced, chemically diverse, and one of the more significant publicly available data sets (~10 k compounds).Because of the challenges presented by this data set, it has been extensively studied in cheminformatics research in machine learning methods [31,34], class balancing methods [33,35], the testing of different chemical representations [36], and multitasking/deep learning [34].The raw data were pre-processed, as they contain duplicate structures that consider the active or organic part of the molecules.This was also reported earlier [33].Molecules with invalid structural identifiers were removed, and those that were valid were converted to their canonical SMILES [37].Duplicates were either removed by IDs or SMILES.Our methodology employed a comprehensive standardization protocol to ensure the uniformity and accuracy of chemical structure representations before the computational analysis.The structure preparation pipeline inspired by procedures [33,38] [39], MACC keys, and 200 molecular descriptors were calculated for the 8314 structures using the RDKit library [40].To reduce possible bit collision in the fingerprints [41,42], the fingerprint vectors were set to 5120 bits and a radius of 2. The Python scripts used for this work were previously published [43].The data sets are available in the Supplementary Materials (https://zenodo.org/records/10888738,accessed on 25 April 2024).In the final prepared data sets, the SR-HSE activator (PubChem AID: 743228) assay recorded 6402 observations with 340 positives, while the NR-AR agonist assay observed 7060 instances, identifying 306 positives.The SR-ARE agonist (AID: 743219) assay demonstrated a higher activity with 945 positives out of 5758 observations.The NR-Aromatase antagonist (AID: 743139) and NR-ER-LBD (AID: 743077) assays reported 304 and 343 positives from 5652 and 6817 observations, respectively.Notably, the NR-AhR agonist (AID: 743122) assay detected 756 positives among 6459 observations, and the SR-MMP disruptor (AID: 720637) assay found 903 positives in 5715 cases.The NR-ER agonist assay revealed 806 positives from 6056 observations, and the NR-PPAR-gamma assay agonist (AID: 743140) had 197 positives out of 6355.The SR-p53 agonist (AID: 720552) assay identified 423 positives among 6643 observations, the SR-ATAD5 inducer (AID: 720516) assay found 275 positives in 6926 cases, and the NR-AR-LBD agonist (AID: 743053) assay reported 234 positives from 6617 observations.

Machine Learning Models
Machine learning models were trained using Python (www.python.org,v3.9.1.)and its library sci-kit-learn [44] based on previous works [36,43].Due to class imbalance in the Tox21 set, model penalization and optimization techniques were used to improve classification quality.Before model training, the data with the 12 bioactivity endpoints were split into two random subsets of 80% and 20% per endpoint individually.The penalty in scoring during hyperparameter optimization was based on the Matthews Correlation Coefficient (MCC) [45,46] defined by Equation ( 1), where TP (True Positive), TN (True Negative), FN (False Negative), and FP (False Positive) are the elements of the confusion matrix (https://en.wikipedia.org/wiki/Confusion_matrix,accessed on 25 April 2024).Another important metric used in this work is Balanced Accuracy, shown in Equation (2).A critical assessment of metrics for imbalanced classification QSAR models was given in previous works [46,47].
Balanced Accuracy = 1 2 Bayesian hyperparameter optimization (BO) was used for hyperparameter optimization [46,48] through a 10-fold cross-validation (CV) and penalizing the procedure by the MCC then iterating it 20 times and returning a set of "optimal" parameters.Hence, the MCC CV value was calculated as an average of outer folds during CV.Once the CV optimized model was generated, the model was applied on the train set each time, and for each metric, a "Train" value was generated.One of the essential model parameters was "class weight", which contributed to classifying this imbalanced data set.During model training on the train set, feature selection using permutation importance was applied in the following steps in previous work [46].The final model's test set was evaluated for each target, resulting in each metric in a "Test value".

Association Testing and Toxic Unit Approach
To evaluate the association with physiological conditions, the toxic unit (TU) approach was utilized [49].TU is defined as the ratio of chemical compound concentration (ci) and the selected exposure-based toxicity value (e.g., LC 50 ).Herein, a new approach is derived, namely the bioactive potential (BiP) in Equation (3).BiP is hence a product of the metabolite's concentration (i) (relative peak area) and the probability of it being active (0-1) against a target (t).The higher the concentration and the likelihood of being bioactive, the more potent we assume the compound to be against the given targets.
For each child (c) in the cohort, BiP was calculated per metabolite and target and then summed and log-transformed sBIP using Equation ( 4), where t is the number of targets, yielding one value per child.
The associations were inspected by means of linear regression using the pingouin library for Python [50].

QSAR Model Results
Eight out of twelve models yielded results above random classification (MCC CV and MCC Test > 0.3), namely for the following targets: SR-ATAD5, NR-AR-LBD, NR-AR, NR-ER-LBD, NR-ER, SR-ARE, NR-AhR, and SR-MMP (marked with an asterisk in Table 1).The threshold of 0.30 is an arbitrary choice to be in the MCC "fair agreement" regime (>0.20).Further, it reduces the risk of having imbalanced classifiers [46] while retaining a reasonable number of models.These models were then selected for further processing.While the algorithm trained for both fingerprints and descriptor sets separately, each time, the descriptors yielded better results in this setting over fingerprints with 13-38 descriptors per model (Table 1).The complete results of the models are given in the Supplementary Materials [51].
Table 1.Model results for the 12 targets from the Tox21 data set.Models with an asterisk (*) were kept for further processing based on our selection criteria of the MCC CV and MCC Test being > 0.30.The abbreviation BACC is Balanced Accuracy.N(1)/N(0,1) is the imbalance ratio of the positive and total compounds per target.Based on the MCC results, one can observe (Figure 3) that each model's CV and test set assessments are well aligned and do not overfit.Our previous study on the Tox21 data set showed that these model results are aligned with models known in the literature [44], as observable in the Supplementary Materials of the cited work [44].1.

Predicting Metabolite Target Activity
The selected eight models were saved to persistent storage and applied to yield bioactivity for the 517 annotated metabolites.This resulted in a table of 517 metabolites x eight columns, namely SR-ATAD5, NR-AR-LBD, NR-AR, NR-ER-LBD, NR-ER, SR-ARE, NR-AhR, and SR-MMP.A snippet of the results is given in Table 2 to ease understanding of the model results.The results in the table are presented as probabilities instead of binary results (bioactive or not), where the convention is that a probability above 0.5 (p > 0.5) is considered bioactive.One can observe from the table that the probabilities show a broad range across targets.Those marked with an asterisk would be deemed bioactive towards the target since their values are above 0.5.

Predicting Metabolite Target Activity
The selected eight models were saved to persistent storage and applied to yield bioactivity for the 517 annotated metabolites.This resulted in a table of 517 metabolites x eight columns, namely SR-ATAD5, NR-AR-LBD, NR-AR, NR-ER-LBD, NR-ER, SR-ARE, NR-AhR, and SR-MMP.A snippet of the results is given in Table 2 to ease understanding of the model results.The results in the table are presented as probabilities instead of binary results (bioactive or not), where the convention is that a probability above 0.5 (p > 0.5) is considered bioactive.One can observe from the table that the probabilities show a broad range across targets.Those marked with an asterisk would be deemed bioactive towards the target since their values are above 0.5.A mask was applied to filter out all "nonactive" results with values below 0.5 in the full results.In total, 51 metabolites showed at least one probability in one model being active above 0.5.A complete list of active metabolites is given in Table 3.

Association with hs-CRP
For each child (c) in the cohort, BiP was calculated per metabolite and target and then summed and log-transformed sBIP using Equation (2).Once a table was obtained with sBiP per child and metabolite, the potentially bioactive part of the metabolome was associated with levels of hs-CRP.The table was scaled priorly, and metabolites were decorrelated, reducing their number from 51 to 40.The hs-CRP levels were logarithmed, and missing values were imputed by median values.The associations were then evaluated in a linear regression.Metabolites amongst the 40 metabolites with an association with hs-CRP filtered for a regression coefficient (estimate) with a p-value below 0.001, chosen due to a high number of predictors to avoid false discovery, are shown in Table 4.The results show negative significant coefficients for cortisone, retinol, and 3beta-hydroxy-5-cholestenoate, while the cortisol and glycocholenate sulfate resulted in positive coefficients.

Discussion
hs-CRP is a well-known biomarker in clinical medicine and diagnostics, serving as a powerful tool for assessing low-grade or chronic systemic inflammation [52,53].We have previously shown that systemic low-grade inflammation assessed by hs-CRP levels in pregnant mothers and their children are correlated independently of anthropometrics and environmental exposures [28].Elevated levels of hs-CRP are associated with an increased risk of cardiovascular disease [54], inflammatory bowel disease (IBD) [55], depression [56], ADHD [12], decreased lung function in childhood [57], allergic sensitization [11], the composition of the early-life airway microbiota [58], and the risk of childhood asthma [59].Metabolites serve as the ultimate effectors of cellular processes and represent the penultimate step in the progression to the phenotype.This underscores the importance of investigating the association between CRP and metabolites to understand better how systemic low-grade inflammation is reflected in the metabolome.
In our research, we developed a series of QSAR models targeting the Tox21 cellular endpoints.These models were employed to forecast potential outcomes against such targets for each of the quantified metabolites within the COPSAC2010 cohort.The selection of these specific metabolites was based on their abundance and bioactivity in relation to high-sensitivity C-reactive protein (hs-CRP) levels in children at six months of age.First and foremost, the model we developed has achieved reasonable generalization, as indicated by the prediction results.The Balanced Accuracy of our models for the selected eight endpoints consistently exceeded 0.6 in both the training and test sets, and the Matthews Correlation Coefficients (MCC) consistently exceeded 0.3.These metrics demonstrate the robustness of our models for making accurate predictions in the following sections.Comparing the results here with those of others in the literature is difficult since the data might have been processed differently, and the train test set can be different, too.A comparison with [60], who performed thorough modeling of the same set, results in the following observations.For the SR-ATAD5 assay, our model achieved an MCC score of 0.31, improving the literature values of 0.272 and 0.395 under different criteria.In the case of NR-AR, our model significantly outperformed the literature-reported scores for the related assay with our score of 0.7 compared to 0.481 and 0.357.The SR-ARE assay saw our model achieving a 0.34 MCC score, which is competitive with the literature values of 0.317 and 0.461.For NR-Aromatase, our model's score of 0.15 was compared against the reported literature scores of 0.186 and 0.429.Our NR-ER-LBD model's performance, with an MCC score of 0.56, was robust against the literature scores of 0.457 and 0.362.Similarly, for NR-AhR, our model achieved a 0.45 MCC score, slightly below the literature value of 0.513 and closely matching the second criterion score of 0.359.The SR-MMP assay model demonstrated a strong performance with an MCC score of 0.58, comparing favorably against the literature scores of 0.501 and 0.475.In the NR-ER assay, our model's 0.41 MCC score was evaluated against literature scores of 0.235 and 0.555.For the NR-PPAR-Gamma assay, our model scored 0.09, which was weaker than the literature scores of 0.238 and 0.457.The SR-p53 assay model yielded a 0.24 MCC score compared to the literature scores of 0.356 and 0.458.Our NR-AR-LBD model's MCC score of 0.55 was robust compared to the literature scores of 0.481 and 0.357.The results appear in similar ranges while not using the same test set and processing steps.A similar comparison of our previous models with the literature was presented in the Supplements of [46], yielding similar conclusions.
Using the developed models in this work, we identified 40 out of the 517 metabolites in our cohort that showed bioactivity in at least one of the eight models.We tested the BiP of these 40 metabolites in a linear model against the levels of hs-CRP in children at six months of age.We identified five metabolites of particular interest, each demonstrating regression coefficients (estimates) with a p-value below 0.001.
Among these five metabolites, cortisone and cortisol both belong to corticosteroid hormones, and cortisone is released by the adrenal gland in response to stress [61].We observed a positive association between cortisol and hs-CRP and a negative association between CRP and cortisone.One of cortisone's effects on the body is the suppression of the immune system due to a decrease in the function of the lymphatic system [62], which is consistent with our findings.It was proven in Shimba's study [63] that corticosteroids are potent inhibitors of inflammatory corneal lymphangiogenesis, with significant differences between various corticosteroids in terms of their antilymphangiogenic potency.The primary mechanism seems to be through the suppression of macrophage infiltration, proinflammatory cytokine expression, and inhibition of the proliferation of lymphatic endothelial cells.A study by Rueggeberg et al. observed that 2-year increases in diurnal cortisol secretion were significantly associated with higher levels of CRP at a 6-year follow-up [64].These findings may be difficult to reconcile because cortisol generally has anti-inflammatory properties.However, sustained exposure to high cortisol levels may render innate immune cells partially resistant to glucocorticoid inhibition, allowing inflammation to escape normal regulatory controls [65,66].This could explain the apparent positive correlation between high CRP and cortisol.
The coefficients of cortisol and cortisone show an opposite direction (1.026 vs. −0.91),both being significant.The balance between serum cortisol and cortisone is a known phenomenon, and hence often a ratio is used (cortisol/cortisone), with cortisol being the more active one.Examples of altered rations compared to normal can occur during acute-phase responses [67], in obesity-related issues [68], and in tuberculosis [69].The relationship in this work may be subject to various factors, including chronic stress, individual differences, and the balance between pro-inflammatory and anti-inflammatory processes in the body.A positive association between cortisol and CRP and a negative association between cortisone and CRP can be explained by the complex and dynamic nature of the body's stress and inflammatory responses.In some situations, cortisol can have pro-inflammatory effects under chronic stress [70].Chronic stress can lead to dysregulation of the hypothalamic-pituitary-adrenal axis, resulting in prolonged elevated cortisol levels [68].These protracted high cortisol levels can induce inflammation and stimulate the production of inflammatory mediators like CRP.A positive association between cortisol and CRP can be observed in this scenario.
Retinol (Vitamin A) exhibited the most robust negative association with hs-CRP among these five notable metabolites in our predictive analysis.In the study by Dios et al. [67], it was investigated that 6-8-year-old children in the highest hs-CRP group (hs-CRP ≥ 0.60 mg/dL) displayed significantly lower levels of retinol compared to those in the lower hs-CRP groups.It is hypothesized that retinol binding to retinol-binding protein (RBP) plays a role in the association between retinol and CRP in children.From a biological perspective, retinol is transported in the blood while bound to RBP, and its concentration is tightly regulated by RBP [68].Additionally, a correlation between RBP4 and CRP levels has been described in school-aged obese children [71].Our findings align with previous research and further confirmed the strong correlation between retinol and CRP in a larger, controlled cohort of 6-month-old children.3Beta-hydroxy-5-cholestenoate (3-BH5C) belongs to the class of monohydroxy bile acids, alcohols, and derivatives.This compound is a component of the primary bile acid biosynthesis pathway and has been recognized as a metabolite capable of predicting gut microbiome Shannon diversity [72].Shannon diversity serves as a potential marker for overall microbiome health.Our results identified a robust negative association between 3-BH5C and CRP.To the best of our knowledge, this is the first time a negative correlation between 3-BH5C and CRP has been observed, suggesting that the microbiome health of children, as indicated by the marker 3-BH5C, may be linked to inflammation.A prior study conducted in the Northern Finland Birth Cohort 1966 (NFBC1966) and the TwinsUK cohort demonstrated that higher CRP levels were associated with lower alpha diversity of gut microbiome measures [73].This provides further support for the credibility of our findings.
Another bile acid metabolite, glycocholenate sulfate, classified as a conjugated bile acid (CBA), has been identified as a significant biomarker for CVD risk in a cohort that included 1919 African American participants in the Atherosclerosis Risk study [74].CBA profiles are increasingly recognized as crucial signaling molecules intricately involved in mammalian cholesterol and lipid metabolism, glucose homeostasis, thermogenesis, inflammation, and intestinal function [75].Our study further confirmed that CBA, represented by glycocholenate sulfate, is strongly associated with CRP levels in children at six months.

Limitations and Future Perspective
There are several limitations to consider when interpreting the results.First, the metabolite concentrations were measured semi-quantitatively, which may introduce measurement errors.Secondly, the bioactivity model-based approach used in this study may introduce error propagation and be limited by noise [76] since (a) bioactivity data have a measurement error and (b) the models come with limited accuracy.Furthermore, it is essential to note that the ligand-receptor interactions are complex and may not be fully captured by the model-based approach, a known limitation of QSARs [77].While efforts were made to incorporate as much information as possible into the models, the true nature of these interactions may be more nuanced than what was captured in this study.There is also no consensus on how to clean data for modeling; hence, the data-cleaning process used in this study may have needed to be more rigorous and lost some information, which is further driven by the unannotated metabolites.Another limitation comes from compounds being excluded in the data processing; however, their effects might be more complex and have a synergy.
A concept worth mentioning is causality, which cannot be assessed using such methods.The exact drivers of metabolic pathways are a question of causality and still need to be completely revealed.Future work should address some of these limitations.
Even though this work is closer to methodological approaches than clinical application, with further validation, it might find its way to the clinics.The hope of metabolomics is, among other things, to support biomarker identification.Such a filtration approach, which can also be perceived as a metabolite selection method, can speed up the identification of biomarkers.Future research might reveal that biomarkers are rather linear and non-linear combinations of metabolites instead of single metabolites.This is also the power of machine learning, which helps generate such combinations in a data-driven manner.
We are also aware that regardless of the children being observed in the clinic and a lot of information being available (see https://copsac.com/home/copsac-cohorts/copsac2010-cohort/, accessed 25 April 2024) that this is only a glimpse of the underlying biology and environmental factors in such processes.In addition to this, both the metabolome and hs-CRP are snapshots and are affected by other processes.More research work is hence needed to understand their variability, as well as single biomarkers and combinations of biomarkers related to physiological conditions.

Conclusions
We developed a series of QSAR models targeting the Tox21 biological endpoint and used them to predict potential outcomes for each quantified metabolite within the COP-SAC2010 cohort.Our analysis unveiled five metabolites of particular interest due to their robust association with hs-CRP levels.We observed the significant influence of CRP on corticosteroid hormones, bile acid metabolites, and vitamin A. The association with corticosteroid hormones is supported by the literature and hence shows this data-driven and chemistry-informed approach to be meaningful, as it leads to known findings without the addition of prior medical knowledge.Our conclusions regarding bile acid metabolites are novel to our understanding.Still, they are supported by the known findings regarding the relationship between microbiome-driven gut health and the level of CRP.Even though causality has not been researched in this paper, we show that there is a relationship between metabolites and CRP.This has the potential to help to understand biological pathways but also for use in clinical settings where supplementation with, e.g., vitamin A and the improvement of gut health might play a role in early childhood.
Supplementary Materials: The modeling data and auxiliary files are published in (https://zenodo.org/records/10888738, accessed on 25 April 2024).approved 22 October 2013, and enrollment required both oral and written informed consent from both parents.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Metabolites 2024 , 17 Figure 1 .
Figure 1.Metabolome sampling and chemical categories in the COPSAC2010 cohort selected for this work.

Figure 1 .
Figure 1.Metabolome sampling and chemical categories in the COPSAC2010 cohort selected for this work.

Figure 2 .
Figure 2. The modeling pipeline in this study.

Figure 2 .
Figure 2. The modeling pipeline in this study.

Figure 3 .
Figure 3. MCC CV and test set results for the classification models in Table1.

Table 2 .
A snippet of probabilities for the first five metabolites being active towards the eight models (columns).

Table 3 .
A list of active metabolites given the pre-trained models.
Figure 3. MCC CV and test set results for the classification models in Table1.

Table 3 .
A list of active metabolites given the pre-trained models.

Table 4 .
Results of the association testing using linear regression against hs-CRP.