2.1. Initial Data
The object of the study was blood serum from potentially healthy children (with no chronic diseases at the time of examination) aged 12–14 years during puberty, living in areas with different levels of anthropogenic load of the urban environment, as a target and the most sensitive part of the population. Information regarding the absence of chronic diseases and acute infectious conditions at the time of examination was obtained from available medical documentation and participant health questionnaires collected during the study. A total of 105 children participated in the study: 72 from the city of Kazan (a territory with a high level of polymetallic pollution) and 33 from rural areas (the Vysokogorsky and Arsky districts of the Republic of Tatarstan), considered conditionally unpolluted territories. The classification of urban and rural territories was based on official regional environmental monitoring data reported in the State Report on the State of Natural Resources and Environmental Protection of the Republic of Tatarstan (Ministry of Ecology of the Republic of Tatarstan) [
33].
Serum samples were collected strictly under fasting conditions (8–12 h of fasting) in the morning from the antecubital vein. The exploratory relational dataset contained 200 records derived from data obtained from 105 biological participants. The relational modelling framework was used to analyse nonlinear associations between demographic, physiological, environmental and biochemical variables within the machine learning pipeline. The framework retained alternative relational representations of the same participant within different exploratory modelling contexts used during preliminary model construction and comparison procedures. In this framework, several analytical representations associated with the same participant could be retained during exploratory modelling procedures, resulting in partial relational replication within the dataset structure. Consequently, some relational rows represented partially replicated participant-derived structures and therefore should not be interpreted as fully independent biological observations. No synthetic biological values were generated, and no artificial augmentation of measured biochemical parameters was performed. Because such relational structures may introduce statistical dependencies between rows, additional leakage-aware validation using independently reconstructed datasets without repeated observations was subsequently performed.
All analysed biological and clinical data were fully anonymized prior to machine learning analysis, and no personally identifiable participant information was available to the investigators during data processing. According to the decision of the Local Ethics Committee, separate written informed consent for the analytical use of anonymized data was waived within the approved study framework.
In the present study, a generally accepted method for determining MDA concentration based on its reaction with thiobarbituric acid (TBA) was used [
34,
35]. TBA was dissolved in the presence of Triton X-100 to prevent its precipitation. To stabilize the trimethine complex, Trilon B was added, and denatured serum proteins were dissolved in a mixture of ethanol and chloroform (7:3), as described by [
36].
At high temperature and in an acidic medium, the reaction between MDA and TBA proceeds with the formation of a pink-colored trimethine complex containing one MDA molecule and two TBA molecules. The absorption maximum of the complex is at 532 nm.
The working TBA solution was prepared by dissolving 864 mg of TBA in 100 mL of a mixture containing 1% Triton X-100 and 8.2 M ethanol. All other solutions were prepared using bidistilled water.
To 1.5 mL of serum, 0.5 mL of a 1% Triton X-100 solution, 0.2 mL of a 0.6 M HCl solution, and 0.8 mL of a 0.06 M working TBA solution were added sequentially. The mixture was heated in a boiling water bath for 10 min and then cooled to 15 °C for 30 min. To stabilize the color, 0.2 mL of a 5 mM Trilon B solution was added, and the volume was adjusted to 10 mL with a mixture of ethanol and chloroform (7:3). Optical density was measured at 532 nm using an SF-46 spectrophotometer in a 1 cm glass cuvette. A blank sample, in which bidistilled water was added instead of serum, served as the control. To minimize matrix interference, additional blank samples subjected to the complete protein precipitation procedure were analysed. Purity of the absorption peak at 532 nm was verified by spectrophotometric scanning and absence of secondary shoulders in the 550–560 nm region characteristic of nonspecific sugar-related reactions. In the MDA calculations, a molar extinction coefficient of 0.156 µM−1·cm−1 was used.
The concentration of malondialdehyde in blood serum was calculated using the formula:
where
D1 is the optical density of the serum sample;
D2 is the optical density of the control;
U1 is the volume of serum taken for analysis (1.5 mL);
U2 is the final volume of the mixture (10 mL); L is the cuvette path length (1 cm); and ε = 0.156 is the molar extinction coefficient of the MDA–TBA complex (
L·µmol
−1·cm
−1). The result is expressed in µmol/L.
In the obtained blood serum, the contents of iron (Fe), copper (Cu), zinc (Zn), strontium (Sr), and lead (Pb) were determined by atomic absorption spectrometry using an AAnalyst 400 instrument (PerkinElmer, Shelton, CT, USA).
Zinc was determined at the resonance line of 213.9 nm with a detection limit of 1.5 µg/L; copper at 324.8 nm with a detection limit of 1 µg/L; iron at 248.3 nm with a detection limit of 5 µg/L; strontium at 460.7 nm with a detection limit of 3 µg/L; and lead at 283.3 nm with a detection limit of 7 µg/L. Lead was determined using an electrodeless high-frequency lamp, while the other metals were measured with hollow-cathode lamps.
Calibration solutions were prepared by appropriate dilution of certified reference materials. Formal LOD and LOQ values for the TBA-based MDA assay were not separately estimated because the method was applied in a comparative mode aimed at identifying relative intergroup differences rather than absolute analytical quantification. The concentrations of Zn, Cu, Fe, and Sr were measured directly in blood serum after dilution at a ratio of 1:2 with bidistilled water. Lead determination was performed after protein precipitation. For this purpose, 0.75 mL of a 1.5% HCl solution was added to 1.5 mL of blood serum, and the mixture was incubated for 1 h at 37 °C. After protein hydrolysis, proteins were precipitated with 0.75 mL of 20% trichloroacetic acid (TCA) and centrifuged for 10 min at 1500 rpm; the dilution ratio was also 1:2. The supernatant was used for analysis. The result is expressed in mg/L. Because serum samples were analysed, the measured metal concentrations primarily reflect circulating mobile fractions associated with transport proteins rather than total body metal burden.
Since the dependence of MDA on the eleven identified predictors is markedly nonlinear, it is appropriate to use machine learning methods to model MDA concentration. In particular, multilayer perceptrons have proven effective in similar tasks, serving as universal approximators [
37]. Similar exploratory applications of machine learning methods for oxidative stress and environmental exposure assessment have been reported in recent biomedical studies; however, most previous works focused primarily on predictive performance rather than interpretability of nonlinear biological relationships. In the present study, additional SHAP analysis was incorporated to improve transparency of the neural network model and reduce its “black-box” character.
2.4. Model Validation and Overfitting Control
To evaluate model reliability for a limited biomedical dataset, additional generalization control procedures were applied. The ratio between model complexity and dataset size was assessed by comparing training and test errors and analysing bias–variance behaviour. Overfitting was assumed when training accuracy significantly exceeded test accuracy. Predictor reduction was performed using correlation analysis and multicollinearity diagnostics to reduce model capacity and improve generalization.
To improve validation reliability for the reduced neural network model, repeated cross-validation was additionally performed. The validation procedure included repeated 5-fold cross-validation with 20 repetitions using different random partitions of the dataset (100 validation runs in total). For each iteration, the Mean Absolute Error (MAE) and coefficient of determination (R2) were calculated.
Because the relational dataset contained partially replicated participant-derived relational structures, additional robustness-oriented validation procedures were performed. Two independently reconstructed datasets without repeated observations were generated from the original dataset. In these supplementary datasets, duplicated or partially replicated relational structures were removed, ensuring that each retained observation corresponded to a single participant representation. Repeated 5-fold cross-validation was then independently applied to each reconstructed dataset in order to evaluate reproducibility and stability of model performance under stricter independence conditions and to reduce the risk of intra-patient information leakage during model validation. Each reconstructed dataset contained only one retained relational representation per participant. This procedure was intended to reduce the probability of intra-participant information leakage during model validation.
The extended robustness-oriented validation procedures were primarily focused on the reduced neural network model because the reduced architecture demonstrated improved generalization ability and lower overfitting compared with the full model.