Multivariate and Machine Learning-Based Assessment of Soil Elemental Composition and Pollution Analysis

Badawy, Wael M.; El-Agawany, Fouad I.; Blokhin, Maksim G.; Mohamed, Elsayed S.; Uzhinskiy, Alexander; Morsi, Tarek M.

doi:10.3390/environments12080289

Open AccessArticle

Multivariate and Machine Learning-Based Assessment of Soil Elemental Composition and Pollution Analysis

by

Wael M. Badawy

^1,2,*

,

Fouad I. El-Agawany

³,

Maksim G. Blokhin

⁴

,

Elsayed S. Mohamed

⁵

,

Alexander Uzhinskiy

⁶

and

Tarek M. Morsi

¹

Radiation Protection and Civil Defense Department, Nuclear Research Center, Egyptian Atomic Energy Authority, Cairo 13759, Egypt

²

Frank Laboratory of Neutron Physics, Joint Institute for Nuclear Research, 141980 Dubna, Russia

³

Department of Physics, Faculty of Science, Menoufia University, Shebin El-Koom 32511, Egypt

⁴

Far East Geological Institute of the Far Eastern Branch of the Russian Academy of Sciences, 690022 Vladivostok, Russia

⁵

National Authority for Remote Sensing and Space Sciences (NARSS), Cairo 1564, Egypt

⁶

Laboratory of Information Technology, Joint Institute for Nuclear Research, 141980 Dubna, Russia

^*

Author to whom correspondence should be addressed.

Environments 2025, 12(8), 289; https://doi.org/10.3390/environments12080289

Submission received: 17 July 2025 / Revised: 9 August 2025 / Accepted: 18 August 2025 / Published: 21 August 2025

Download

Browse Figures

Versions Notes

Abstract

The present study provides a comprehensive characterization of soil elemental composition in the Nile Delta, Egypt. The soil samples were analyzed using Inductively Coupled Plasma Atomic Emission Spectrometry (ICP-AES), highly appropriative for the major element determination and Inductively Coupled Plasma Mass Spectrometry (ICP–MS), outstanding for the trace element analysis. A total of 55 elements were measured across 53 soil samples. A variety of statistical and analytical techniques, including both descriptive and inferential methods, were employed to assess the elemental composition of the soil. Bivariate and multivariate statistical analyses, discriminative ternary diagrams, ratio biplots, and unsupervised machine learning algorithms—such as Principal Component Analysis (PCA), t-Distributed Stochastic Neighbour Embedding (t-SNE), and Hierarchical Agglomerative Clustering (HAC)—were utilized to explore the geochemical similarities between elements in the soil. The application of t-SNE for soil geochemistry is still emerging and is characterized by the fact that it preserves the local distribution of elements and reveals non-linear relationships in geochemical research compared to PCA. Geochemical background levels were estimated using Bayesian inference, and the impact of outliers was analyzed. Pollution indices were subsequently calculated to assess potential contamination. The findings suggest that the studied areas do not exhibit significant pollution. Variations in background levels were primarily attributed to the presence of outliers. The clustering results from PCA and t-SNE were consistent in terms of accuracy and the number of identified groups. Four distinct groups were identified, with soil samples in each group sharing similar geochemical properties. While PCA is effective for linear data, t-SNE proved more suitable for nonlinear dimensionality reduction. These results provide valuable baseline data for future research on the studied areas and for evaluating their environmental situation.

Keywords:

elemental composition; soil; ICP-AES; ICP–MS; statistical analysis; machine learning

1. Introduction

The Nile Delta, covering approximately 22,000 km², contributes two-thirds of Egypt’s agricultural output and is the country’s main hub for gas and oil production. Its formation began in the late Pliocene, with major development during the Pleistocene due to fluctuating sea levels during ice ages. The chemical industry is the largest source of hazardous waste in the region, while water pollution primarily arises from raw sewage, industrial runoff, and agricultural pesticides [1]. The Delta also faces serious challenges, including land reclamation, coastal erosion, soil salinization, pollution, and the impacts of climate change, such as rising sea levels [1,2].

The Nile Delta, home to over half of Egypt’s 80 million people, supports 63% of the country’s agricultural land. Its coastline features the Rosetta and Damietta headlands, coastal flats, dunes, and four brackish lakes: Mariut, Edku, Borullus, and Manzalah [2,3]. As the global population increases and per capita resource consumption continues to increase, the demand for food, resources, and arable land has increased [4,5]. Consequently, this demand is driven by the rapid expansion of agricultural areas worldwide. Furthermore, great efforts are being made to transform the desert into agricultural areas [4].

The expansion of agricultural land, coupled with a growing population and rapid urbanization, has placed significant pressure on soil quality in the Nile Delta. The intensive use of fertilizers further exacerbates the problem, degrading soil health and increasing the accumulation of waste. This excessive agricultural activity and urban sprawl not only strain natural resources but also contribute to higher levels of pollution, including runoff of chemicals and improper waste disposal, ultimately threatening the long-term sustainability of the region’s ecosystem [6,7].

Currently, the monitoring of the environmental situation in the Nile Delta, particularly concerning the elemental composition of agricultural soil, remains limited and outdated. Most existing studies focus on specific, localized areas, with little comprehensive or recent data available. While there is considerable interest among researchers in examining the elemental content of the Delta’s soils, these efforts have largely been fragmented, lacking widespread coverage or a coordinated approach. As a result, there is a need for more systematic, up-to-date, and region-wide studies to better understand the evolving environmental challenges and soil health in this critical region. This study aims to provide a comprehensive overview of the environmental situation in the Nile Delta, especially along the two branches of the Nile, by employing integrated analytical methods. This study fills a critical gap by utilizing advanced analytical techniques (ICP-MS and ICP-AES) and sophisticated statistical and machine learning approaches to provide an integrated, region-wide assessment of soil elemental composition. The establishment of baseline data and identification of pollution sources will directly support environmental management strategies, inform policy decisions, and promote sustainable development and public health protection efforts in Egypt.

The elemental composition of various components within ecosystems is investigated through a range of advanced analytical techniques. Common methods include neutron activation analysis (NAA), which detects and quantifies elements by measuring gamma rays emitted after neutron capture; X-ray fluorescence (XRF), which utilizes X-rays to excite atoms and measure their characteristic fluorescent emissions; inductively coupled plasma mass spectrometry (ICP-MS), a highly sensitive method that ionizes samples and detects trace and ultra-trace elements with precision; inductively coupled plasma atomic-emission spectrometry (ICP-AES); and atomic absorption spectroscopy (AAS). These techniques, alongside others, allow for comprehensive multi-elemental analysis, providing critical insights into the distribution and behavior of elements in environmental studies [3,8,9,10]. In this study, inductively coupled plasma mass spectrometry (ICP-MS) and inductively coupled plasma atomic-emission spectrometry (ICP-AES) were employed to quantify the elemental mass fractions in soil samples.

Throughout this research, the following objectives have been pursued: (i) characterize the soil samples in terms of their elemental composition; (ii) conduct bivariate and multivariate statistical analyses, discriminative ternary diagrams and ratio biplots, and unsupervised machine learning algorithms (principal component analysis PCA, t-distributed Stochastic Neighbour Embedding t-SNE, and Hierarchical Agglomerative Clustering HAC) to clarify the geochemical similarity of soil; (iii) analysis of the geochemical background of the determined elements using Bayesian inference methods; (iv) assess the degree of pollution and the contamination extent using various indices and elucidate the common pollution sources; and (vi) establish baseline data on geochemical elements that can provide a reference point for monitoring potential future changes.

2. Materials and Methods

2.1. Sampling Strategy

A total of 53 agricultural soil samples were collected from the Nile Delta region of Egypt, primarily from areas under cultivation. The specific sampling locations, along with the number of samples collected, are illustrated on the map in Figure 1. Geographically, the sampling area is bordered by the Mediterranean Sea to the north, with the southern boundary marking the edge of the cultivated land in the Delta. The soil texture across the region ranges from clayey to loamy. Samples were spaced at intervals of approximately 15 to 20 km between consecutive locations. All the samples were gathered following IAEA TECDOC-1415 [11] and recommendations of the Egyptian Geological Survey and Mining Authority [12]. The samples were collected at a depth of 10–15 cm, using Polyvinyl Chloride (PVC, Misr El-Hegaz Group, Cairo, Egypt) pipes to minimize the risk of contamination. Prior to sampling, surface grass and vegetation were cleared. The soil samples were then air-dried and gently mixed to ensure homogeneity. Later, the samples were transported to the laboratory for analysis using inductively coupled plasma mass spectrometry (ICP-MS, Agilent Technologies Inc., Tokyo, Japan) and inductively coupled plasma atomic emission spectrometry (ICP-AES, Thermo Scientific Inc., Waltham, MA, USA).

2.2. Sample Preparation for Analysis Using ICP-MS and ICP-AES

Elemental composition analysis of the soil samples was conducted in the Analytical Chemistry Laboratory at the Centre for Collective Use, “Primorsky Centre for Local Elemental and Isotopic Analysis”, part of the Far East Geological Institute in Vladivostok, Russia. To remove plant debris and rock particles, air-dried soil samples were pre-sifted through a sieve with a pore diameter of 2 mm and then ground to a powder in an agate mortar.

An open acid digestion method, as described by NSAM [13], with minor laboratory modification, was employed for the calcined to constant weight at 105 °C, 0.05 g samples step by step digestion process using a mixture of HNO₃, HF, and HClO₄ (“suprapur”, Merck).

To prevent possible hydrolysis and polymerization of highly charged ions (such as Zr, Nb, Hf, Ta, Mo, and W), traces of HF were added. The final dilution factor was 1:1000. Before ICP-MS analysis, the solutions were further diluted with 2% HNO₃ to achieve a final dilution factor of 1:5000.

High-purity nitric acid was distilled using a Milestone dry distillation unit (Millipore, Burlington, MA, USA). Major elemental composition was analyzed with an iCAP 7600 Duo ICP-AES (Thermo Scientific Inc, Waltham, MA, USA), operating in radial plasma observation mode. Trace element concentrations were determined using an Agilent 8800 quadrupole ICP-MS (Agilent Technologies Inc, Tokyo, Japan), enabling the detection of a broad spectrum of trace elements. Further details on the ICP analytical methodology can be found in NSAM [13], Badawy, et al. [14]. The content of Si was found by the gravimetric method after decomposition of samples (0.2 g) by the fusion technique with anhydrous sodium carbonate (Na₂CO₃) and subsequent treatment with hydrofluoric acid and distillation of volatile silicon compounds.

2.3. Quality Control of (ICP-MS and ICP-AES)

Certified reference materials (CRMs) SCHT-3 (GSO 2509-83, typical black soil, Russian CRM) and JA-3 (andesite powder, Geological Survey of Japan) were used to assess the repeatability and accuracy of the obtained results [15]. The measurement on the CRMs demonstrated accurate data on average (10–15%, 2σ) for trace elements, as well as reproducibility with 2σ < 5% for major ones. The amount of material analyzed for both studied soil specimens and CRMs was sufficient to prepare representative samples of homogeneous composition.

3. Statistical Data Analysis

The statistical analysis of the data obtained was performed with R: A Statistical Computing Language and Environment (version 4.3.1) [16], Python Language Reference, Version 3.12 [17], and the data were structured in MS Excel. The data obtained for the soil samples was examined for normality with 95% probability (p ≤ 0.05) using the Shapiro–Wilk test [18]. The normality test is beneficial if the results obtained must be statistically evaluated prior to further analysis. The p-values for all the elements were calculated and the normal distribution of the obtained data was specified. The outliers were identified and sorted aside based on the Mahalanobis distance probability [19,20]. The correlation matrix using the Pearson method was designed to explore more information on the relationship between variables.

Despite the removal of outliers, the data distribution remained non-normal. Therefore, before performing multivariate statistical analysis, all variables in the dataset were scaled using the min–max method to mitigate large discrepancies in mass fractions, particularly for major, trace, and rare earth elements (REE).

Principal Component Analysis (PCA) and t-distributed Stochastic Neighbour Embedding (t-SNE), both widely recognized and effective dimensionality reduction techniques, were applied to the data to assess sample similarities and identify potential common geochemical characteristics [21,22,23]. Both approaches are used mainly for reducing dimensionality. For example, PCA provides a linear dimensionality reduction, but t-SNE performs a non-linear dimensionality reduction. PCA is explained in detail in our previous work [14,24]. In this research, however, t-SNE was used because it minimizes the divergence between distributions and fits complex and non-linear data better [25].

After the dimensionality reduction was performed using PCA and t–SNE, the unsupervised learning machine (ML) methods are employed to acquire more insights about the similarities of the samples examined and the best presentation will be used for further clustering. In the present work, Hierarchical Cluster Analysis (HCA), specifically using the agglomerative method, was performed to check the similarities between observations. This approach, also known as Hierarchical Agglomerative Clustering (HAC), begins with each sample in its own cluster. As the algorithm progresses, pairs of clusters are merged step by step, moving upward in the hierarchy, which is why it is often referred to as “bottom-up” clustering [26,27,28]. HCA was performed using the Euclidean method as a metric and ward as a grouping method; a more detailed description of HCA can be found in Müllner [29], Bar-Joseph, et al. [30]. The results of the model were examined using Silhouette and Davies–Bouldin scores.

4. Background and Pollution Analysis

To assess metal pollution and distinguish between anthropogenic and geogenic influences, geochemical background levels must be considered [31,32]. The background refers to the natural mass fractions of metals in sediments unaffected by human activities, while “enrichment” measures how much metal concentrations exceed natural levels. The geochemical background can be determined either by directly using regional or global reference values for the elemental mass fractions or by employing statistical methods to estimate it indirectly. For instance, the background can be calculated using various statistical analyses. In the current work, another approach is using the Just Another Gibbs Sampler (JAGS) in combination with the Markov Chain Monte Carlo (MCMC) method to apply Bayesian inference (BIM) will be applied [33]. In this approach, a Bayesian inference using Gibbs sampling was implemented. Bayesian inference is the process of fitting a probability model to a set of data and summarizing the result by a probability distribution on the parameters of the model and on unobserved quantities such as predictions for new observations. In Bayesian statistics, the aim is to estimate the posterior distribution of parameters based on observed data and a specified model. This involves updating prior beliefs about the parameters with observed data to derive the posterior distribution. However, deriving the exact posterior distribution is often computationally infeasible for complex models, which is why MCMC is used [33,34,35]. More in-depth information on Bayesian inference can be found in Kruschke [36], Park, et al. [37], Plummer, et al. [38].

In this study, background values from the computed data using the BIM method will be deployed. To avoid overestimating or underestimating pollution indices, individual indices (SPI) below one were excluded from the calculations [14,39].

4.1. Single Pollution Index (SPI)

The single pollution index is used to estimate the ratio of pollution by dividing the mass fractions obtained for each element by the corresponding background value. It can be expressed as follows:

S P I = \frac{C_{s a m p l e}}{C_{b a c k g r o u n d}}

(1)

where C_sample is the mass fraction of the element in the sample. Whereas C_background is the mass fraction of the corresponding element in the background.

4.2. Enrichment Factor (EF)

The enrichment factor (EF), which describes the characteristics of the elements in terms of enrichment or depletion in the samples studied, can be used to distinguish between the extent of anthropogenic and natural metal contamination [31,32,39,40,41,42,43,44]. EF is calculated using Equation (1):

E F = {(C_{x} / C_{F e})}_{s o i l} / {(C_{x} / C_{F e})}_{b a c k g r o u n d}

(2)

where (C_x/C_Fe)_soil is the ratio between the mass fractions of the element in the sample and the mass fractions of Fe in the sample, while (C_x/C_Fe)_background is the ratio of the corresponding element in the computed background values.

4.3. Pollution Load Index (PLI)

The Pollution Load Index (PLI) serves as a reliable indicator of soil degradation caused by the accumulation of heavy metals. This index provides a simple yet informative method for evaluating the pollution load from accumulated metals, offering a comprehensive assessment of contamination levels across the studied sites. The PLI (Pollution Load Index) is calculated by taking the nth root of the product of n individual pollution indices (SPI), providing an assessment of soil quality [45].

P L I = \sqrt[n]{\prod_{i = 1}^{n} {S P I}_{i}}

(3)

In this context, the value n refers to the total number of metals being assessed. When (PLI) > 1, it means that pollution exists. Otherwise, if (PLI) < 1, there is no metal pollution.

4.4. Total Pollution Index TPI (Zc)

The evaluation of chemical pollution levels in soil serves as an indicator of potential adverse effects on public health. This assessment is conducted using indicators developed through integrated geochemical and geohygienic studies in urban areas with active pollution sources. The Total Pollution Index (TPI) approach closely resembles the Pollution Load Index (PLI) and can be calculated by comparing the chemical mass fractions of an element in soil, denoted as C_i (mg/kg), to the regional background concentration. The coefficient of the chemical mass fractions of the determining elements is K_c and given as:

K_{c} = C_{i} / C_{b i}

(4)

The sum of the coefficients of (pollutants) is the TPI and it is given as:

T P I = \sum_{i = 1}^{n} K_{c i} - (n - 1)

(5)

where n is the number of the determined elements and K_ci is the concentration coefficient, which is higher than 1 for the i element [46,47,48].

5. Results and Discussion

5.1. Elemental Abundances

A total of 53 agricultural soil samples were collected from the Nile Delta, Egypt and analyzed using ICP-MS (Inductively Coupled Plasma-Mass Spectrometry) and ICP-AES (Inductively Coupled Plasma-Atomic Emission Spectroscopy) to determine elemental content in mg/kg. The analysis revealed the presence of 55 different elements.

After applying the Mahalanobis distance to remove outliers and reduce the noise in the dataset, four elements were identified as outliers and discussed separately. Descriptive statistical analysis (mean, median, minimum, maximum, coefficient of variation, skewness, kurtosis) was performed and is presented in Table 1.

The results obtained show that Si has the maximum value with an average value of 245,000 ± 2640 mg/kg and ranges from a minimum of 221,000 to 309,000 mg/kg for locations #51 and 41, respectively. On the other hand, the minimum values were measured for Bi with an average value of 0.112 ± 0.006 mg/kg and the mass fractions ranged from 0.1 to 0.3 mg/kg for locations #51 and 19, respectively. The obtained results were compared with the corresponding values for the upper continental crust UCC by [49]. The logarithmic normalized results to UCC were calculated and are illustrated in Figure 2. From the figure, it is obvious that there are some elements that are slightly higher than those reported for UCC by Rudnick and Gao [49]. The elements whose mass fractions are slightly above the UCC can be presented in descending order as follows: Cd (4.4%) > Ti (4.1%) > Cu (3.7%) > P (3.2%) > Eu (2.8%) > V (2.7%) > Fe (2.7) > Ni (2.7%) > Ca (2.6%). The comparison of the mass fractions with the equivalent value from the UCC shows that there is an anthropogenic influence on soil, which is understandable as these are agricultural and densely populated regions. These peculiarities can be attributed to the excess use of fertilizers, domestic waste disposal, and the exponential increment of roads and urbanization, as this is a highly populated region [31]. The results obtained in this study are compared with those reported by Arafa, Badawy, Fahmi, Ali, Gad, Duliu, Frontasyeva and Steinnes [3]. Although different analytical techniques were used, the findings are consistent and demonstrate strong alignment between the two datasets.

The coefficient of variation (CV%) reflects the variability of the data, which falls within the range of 7% to 38%, well below the threshold of 75%. Since CV% is determined by both the standard deviation and the mean, a higher CV% indicates a significant level of dispersion in the data, suggesting that a statistical approach is essential for analysis. In contrast, a lower CV% signifies minimal variability. Additionally, we assessed the distribution patterns of the results for each element, revealing that in most instances, the distributions are negatively skewed. This indicates that the mean is situated below the median, with extreme values occurring more frequently on the lower end than on the upper end of the distribution. A distribution is considered nearly symmetrical when the skewness falls between −0.5 and 0.5; however, in the present study, skewness ranges from −2.0 to 3.7. As a result, the mean value may not provide meaningful insights, and it is suggested to perform statistical analyses prior to engaging in inferential statistics. Similarly, the kurtosis values higher than 3 indicate significant variability of the results obtained [14,50,51].

In the same way, the wt% of the oxides were normalized to the corresponding values of UCC and the results showed that TiO₂ has the highest mean value of 2.47 wt% and the lowest value was calculated for Na₂O with an average value of 0.32 wt%. The results of the normalized values for soil oxides are presented and shown in Figure S1.

The mass fractions of the rare earth elements (REEs) were normalized to the corresponding values for UCC by Rudnick and Gao [49], Post Archean Australian Shale (PAAS) by Taylor and McLennan [52], and Chondrites by Taylor and McLennan [52], as clearly shown in Figure 3. The data clearly indicate that the Chondrites normalizer exhibits the highest values compared to the other two normalizers, PAAS and UCC, which generally align closely with one another. Notably, a negative europium (Eu) anomaly was observed in the Chondrites normalizer, indicating lower concentrations of Eu relative to other rare earth elements (REEs). This negative anomaly may be attributed to specific agricultural practices that could cause Eu depletion through runoff or erosion. Understanding these patterns is crucial for assessing soil quality, particularly concerning erosion and weathering processes. In contrast, when comparing the mass fractions of REEs to UCC and PAAS, the results demonstrate a positive Eu anomaly, suggesting that the soil contains higher levels of Eu relative to these two normalizers [31,53].

5.2. Normality Test and Intercorrelation

After the removal of the outliers using Mahalanobis distance probability, the new data were subjected to the assessment of normality using a Shapiro–Wilk normality test [18]. The results showed that the mass fractions of certain elements deviated from a normal distribution. A correlation matrix was generated, and correlation coefficients were calculated using the Pearson method to identify key trends and relationships among the elements, as illustrated in Figure S2.

At a glance, it is evident that Si, P, K, Ca, Sr, and Sb exhibit negative correlations with other elements. The strongest correlations were observed among the rare earth elements (REE), with coefficients nearing 1. No correlation patterns were identified for the following pairs: Co vs. U, Sb vs. Lu, and Sn vs. U. Notably, a strong negative correlation was found between Ca and Cr, Sc, Fe, and Ni (r = − 0.7). The correlation matrix provides a detailed overview of the uncorrelated features, which should be taken into account for further analysis using machine learning techniques, such as dimensionality reduction techniques like PCA and t-SNE for hierarchical agglomerative clustering (HAC).

5.3. Geochemical Provenance of Elements in Soil

The geochemical origin of soil elements can be elucidated through effective methods that accurately represent their compositional variations. Among these methods, ternary diagrams stand out, offering valuable insights into soil characterization. Two ternary diagrams, La/10–Y/15–Nb/8 (A) and Th–La-Sc (B), are plotted, as clearly seen in Figure 4. The elements in both ternary diagrams are compared with the corresponding values from the literature as follows: UCC by Rudnick and Gao [54], PAAS by Taylor and McLennan [52], sediments by Viers, et al. [55], and the North American shale composite NASC by Gromet, et al. [56]. The samples match well those reported in the literature. Specifically, the ternary diagram A, where the samples are well allocated in the vicinity of UCC and B as well. These peculiarities can be attributed to the association of crustal elements in the soil samples.

To gain further insights into sedimentary recycling, we analyzed the ratio of Zr/Sc versus Th/Sc, as shown in Figure 5. The figure clearly indicates minimal contributions from Th and Zr, suggesting a reduced sedimentary recycling process. This reduction can be attributed to the influence of the High Dam in Aswan, which traps sediments behind it. The obtained results are matching with our previous work along the two banks of the Nile River in the Egyptian sector [3,8].

In the same manner, the distribution patterns of Th and U were examined in the studied soil samples, as clearly shown in Figure 6. The figure depicts that the ratio Th/U is in line with the corresponding values for UCC, NASC, PAAS, and the sediments, as shown in Figure 6A. In addition, the plot shows that the results of Th and U in the soil are distributed appropriately around the slope value of the ratio Th/U~3.9. These peculiarities prove that the origin of the Th and U is mainly from the crustal association, while Figure 6B shows the distribution of Th and U in the soil samples. It is clear from the figure that the amounts of Th and U are less than those of UCC, PAAS, NASC, and the sediments.

The relationship between Co and Th offers valuable insights into the weathering of rocks and their relationship to agricultural soils. Furthermore, it serves as an effective instrument for discerning the disparate origins of the elements present in the soil, as evidenced by the following illustration in Figure 7. It is obvious that most of the samples examined are located in the immediate vicinity of regions characterized by the occurrence of basalt, andesite and dacite rhyolite. These results can be attributed to the weathering of Ethiopian highlights that accumulated in the Nile Delta [3,8].

5.4. Findings of the Background and Pollution Analysis

In order to assess the environmental situation with regard to metal pollution, it is important to obtain the background values for the specific elements for which the assessment is carried out. As already mentioned, the background can be regional or global and can be calculated directly or indirectly. In the present work, the indirect regional background calculation was implemented. Based on the assumption that the mass fractions of the determined elements are normally distributed, the Just Another Gibbs Sampler (JAGS) is used in combination with Markov Chain Monte Carlo (MCMC) methods to apply Bayesian inference (BIM).

In this study, the results obtained were utilized as prior data for a model in which the mean (µ) follows a normal distribution, and the standard deviation (σ) follows a Gamma distribution. These hyperparameters were selected because they yielded the most accurate results for µ and σ, respectively. Specifically, the normal distribution was applied over the range defined by the minimum and maximum values of the data. The standard deviation σ was calculated using a Gamma distribution with a shape and rate parameter of 0.001. The model was designed to run for 50,000 iterations, with samples processed using JAGS and MCMC methods. The results of posteriors of the dataset (µ ± σ) with and without outliers are tabulated and given in Table 2.

At first glance, it is noticeable that there are slight differences between the results of the models for the raw data and the data after removing the outliers. The significant influence of the outlier on the distribution patterns of the mass fractions of the determined elements is evident.

Comparing the mean values of prior data, as seen in Table 1, it was discovered that the mean values of prior data are nearly identical to the posterior ones after eliminating the effect of the outliers. The close alignment of the mean values in the prior and posterior data can be attributed to the normal distribution of most mass fractions of the elements. This distribution limits the information available to significantly shift the posterior means away from the prior means. Additionally, removing outliers improves the similarity of these mean values, while the normal distribution of the prior data supports the tendency for the posterior means to closely resemble the prior values. Therefore, it is highly recommended to preprocess and treat the data before making hypotheses and conclusions. In particular, determining background levels on the basis of empirical data from the region under investigation can help to correctly assess the environmental situation with regard to pollution or contamination with elements. Otherwise, using background values from regions with different geological features could lead to a significant miscalculation of pollution levels, resulting in either overestimation or underestimation of the environmental situation. Based on the calculated results for posterior mean values without outliers from Table 2, the pollution indices and contamination factor were computed.

The single pollution index SPI is calculated and visualized in Figure 8. The figure clearly indicates that a small number of elements exhibit slightly elevated levels compared to the background values. However, these minimal amounts of SPI are well within the normalization line and do not pose any significant risk to human health or the surrounding environment.

The enrichment factor EF was calculated, and the results revealed that all the elements can be classified as no to minor enrichment. Specifically, EF ranges from 0.9 for Ni to 1.08 for Ca, respectively. Similarly, the pollution load index PLI was calculated and the results showed that PLI is higher than unity and does not pose any significant hazard to humans and the environment. Finally, the total pollution index TPI was calculated to be lower than 11 (permissible > 16) for the locations investigated. Based on the criteria and classes of PLI and TPI reported by Badawy, Dmitriev, El Samman, El-Taher, Blokhin, Rammah, Madkour, Salama and Budnitskiy [14], Abrahim and Parker [40], Andreev and Dzyuba [46], Saet, Revich, Yanin, Smirnova, Basharkevich, Onishchenko, Pavlova, Trefilova, Achkasov and Sarkisyan [48], Kowalska, et al. [58], the locations studied do not pose significant hazards. Overall, the data suggest that the studied areas are safe for both people and the environment.

5.5. Findings of the Unsupervised Learning HCA, PCA, t–SNE, and HAC

In this study, Hierarchical Cluster Analysis (HCA) was employed, an algorithm designed to categorize similar observations into groups known as clusters. Hierarchical clustering organizes data into a multilevel hierarchy, where clusters at one level merge to form clusters at the next. The algorithm groups objects based on their shared attributes, ultimately producing a set of distinct clusters. Within each cluster, the objects exhibit similarities, while differences exist between the clusters themselves [29,59,60]. Prior to the application of these algorithms, the data were scaled using the min–max normalization method to address discrepancies between the mass fractions of major, trace, and rare earth elements. This preprocessing step ensures that each feature contributes equally to the analysis.

To gain an insight into the similarity of the observations, the hierarchical dendrogram was plotted, as can be clearly seen in Figure 9. A cluster analysis was performed using Ward’s minimum variance method to collect the group of observations based on geochemical characteristics. In addition, HCA was performed using Euclidean distance to calculate the distance to quantify the similarity or dissimilarity between the observations in Euclidean space. It is characterized by the direct measurement of the distance between two points. The accuracy of the clustering analysis was tested using the Cophenetic correlation coefficient, which yielded 0.7, suggesting moderate clustering with slight overlaps [28].

The figure clearly shows that soil samples are grouped into two primary clusters, each with identifiable subgroups sharing common geochemical characteristics. However, accurately determining the exact number of groups requires further analysis. To address this, the unsupervised machine learning algorithm HAC was applied. HAC was carried out based on the most powerful tools in dimensionality reduction (PCA and t–SNE). PCA is an unsupervised linear technique for reducing the dimensionality of high-dimensional data and enhancing data visualization. In this study, the dataset consists of 49 samples and 55 features, making it appropriate to apply PCA before conducting HAC.

Principal Component Analysis (PCA) revealed that the first two components capture nearly 70% of the total variance, with individual contributions of 63.7% and 6.5%, respectively. The scree plot further supports this, as it shows a notable elbow to the second component, suggesting that the first two components carry the most significant information. Based on this insight, the first two principal components were selected for further analysis using Hierarchical Agglomerative Clustering (HAC). This reduced data matrix was subsequently used to visualize the clustering outcomes of HAC. An unsupervised machine learning Agglomerative Clustering was applied and fitted to the data. The model was examined by calculating Silhouette and Davies–Bouldin Scores [61]. Then, by iterating different numbers of clusters, the highest scores were found out and the results are tabulated in Table S1.

The highest Silhouette Score and lowest Davies–Bouldin Score achieved were 0.59 and 0.48, respectively, for two clusters. Therefore, it seems that the optimum number of clusters for the investigated data is two clusters. However, having prior knowledge about the investigated areas and potential pollution sources, it would be better to try using four clusters, especially because the differences in the scores are not significant compared with two clusters. These scores suggest normal clustering quality; although they are not particularly high, they indicate that the samples are reasonably well-clustered, with some overlap between clusters. In clustering analysis, a Silhouette Score closer to 1 signifies better-defined and more distinct clusters, while a lower Davies–Bouldin Score reflects higher clustering accuracy, as it indicates that observations within clusters are tightly grouped and well-separated from other clusters [61]. Based on the Agglomerative Clustering Analysis, the scatter plot of the first two PCAs was plotted and illustrated in Figure 10.

The figure depicts four distinct groups of locations. The groups can be summarized as follows:

Cluster 1 includes the highest number of locations, namely 7, 10, 11, 14, 16, 18, 23, 26, 27, 29, 30, 31, 32, 34, 35, 36, 37, 38, 40, and 48.
Cluster 2 has a minor number of samples, and they are locations 19, 41, 44, 46, 49, and 50.
Cluster 3 contains 11 locations: 1, 3, 4, 5, 6, 8, 12, 13, 15, 22, 39.
Cluster 4 includes the locations 2, 9, 17, 21, 25, 28, 33, 47, 51, 52, 53.

The studied locations are grouped based on shared sources of elemental composition in agricultural soil. PCA revealed overlaps between clusters, limiting its effectiveness in the clear separation of geochemical patterns.

t-SNE, a widely used nonlinear dimensionality reduction technique, was applied to further reduce the data’s dimensions. The same approach was applied: first, the data was scaled, and t-SNE was performed with hyperparameters for a perplexity of 30 and auto-selected learning rate. Following dimensionality reduction, hierarchical agglomerative clustering was implemented as the chosen unsupervised clustering method. The results of HAC based on t-SNE are plotted and illustrated in Figure 11. The clustering evaluation yielded a Silhouette score of 0.49 and a Davies–Bouldin score of 0.63, indicating that the clusters are moderately well-defined.

With slight exceptions, both figures were grouped in almost the same places. The following discussion and interpretation will, therefore, be devoted to both. Although the clustering is moderate, the figure shows four clusters and can be summarized as follows:

Cluster 1 includes the soil samples labeled as 2, 9, 17, 21, 25, 28, 32, 33, 35, 36, 47, 48, 51, 52, and 53. Despite the fact that the samples were collected at different locations in the Nile Delta, they are grouped in a cluster, which is due to the fact that the samples were collected in agricultural areas close to the highways. Therefore, it can be assumed that the crustal association from the dust along the highways is the common source of these elements.
Cluster 2 contains 1, 7, 10, 11, 14, 16, 18, 20, 26, 27, 29, 30, 31, 34, 37, 38, and 40. The samples were taken from agricultural land near rural areas, and most likely the influences and pressures from these areas affect the amount of elemental mass fractions in the adjacent soil. It is good evidence that the analysis of the samples was accurate and that the statistical treatment and hypotheses are precise.
Cluster 3 has 10 samples, namely 3, 4, 5, 6, 8, 12, 12, 13, 15, 22, and 39. The samples were taken near large cities and industries. For example, some samples were collected near Tanta and Banha. It is, therefore, hypothesized that the common geochemical features that cluster these samples together are due to domestic activities and industry.
Cluster 4 has the minimum number of soil samples 19, 41, 44, 46, 49, and 50. These samples come from different places, but are collected in one group. This can be explained by the excessive use of fertilizers and the proximity to Lake Burullus [31].

Overall, the soil samples are well clustered according to the common geochemical characteristics. It is evidence that the precision and accuracy of the analysis is quite high. The analysis provides a comprehensive picture of the elemental composition of agricultural soils and provides an acceptable segmentation of the samples in terms of the elemental abundances. For example, the analysis shows that the samples were collected near highways, rural areas, and mixed potential pollution sources. Furthermore, the assignment of the clusters obtained by PCA and t-SNE showed a satisfactory agreement. They showed that the clusters are similar with respect to the common sources of pollution, as they reflect almost the same locations.

6. Conclusions

This study confirms the hypothesis that the agricultural soils in the Nile Delta region are not polluted and are mainly characterized by natural inputs from the earth’s crust. Comprehensive elemental analysis shows a slight enrichment of Cd, Ti, Cu, P, Eu, V, Fe, Ni and Ca, which currently pose no significant risks to the environment or human health. By integrating statistical approaches with advanced machine learning methods, in particular hierarchical clustering analysis and t-distributed Stochastic Neighbor Embedding, the study provides an innovative approach for environmental monitoring where sources related to human activities such as road traffic, agriculture and mixed land use can be effectively discriminated. The determination of empirically derived elemental background concentrations significantly improves the accuracy of future pollution assessments and reduces the risk of overestimating or underestimating environmental hazards. The present research combines soil geochemistry with machine learning models to create a novel methodological framework applicable to broader environmental management, policymaking and land-use planning. Policy makers can use these insights to develop targeted strategies to mitigate potential environmental impacts, while managers in agriculture and industry could use this approach to continuously monitor and control sources of contamination. The presented methodological approach and its results not only enrich the understanding of soil geochemistry in the Nile Delta but also hold promise for application in similar ecological assessments worldwide, particularly in regions experiencing rapid agricultural expansion and urbanization.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/environments12080289/s1, Figure S1: Boxplot shows the distribution patterns of the major elements as oxides in the soil samples; Figure S2: A correlation matrix based on Pearson method; Table S1: values of the accuracy scores.

Author Contributions

Conceptualization, W.M.B. and E.S.M.; Methodology, M.G.B.; Software, W.M.B.; Validation, W.M.B., M.G.B., E.S.M. and A.U.; Formal analysis, W.M.B.; Investigation, W.M.B., E.S.M. and T.M.M.; Data curation, W.M.B., F.I.E.-A. and T.M.M.; Writing—original draft, W.M.B. and A.U.; Writing—review & editing, F.I.E.-A., M.G.B. and T.M.M.; Visualization, F.I.E.-A. and A.U. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data is presented in this study.

Acknowledgments

The authors acknowledge the joint project between the Academy of Scientific Research and Technology (Egypt) and the Joint Institute for Nuclear Research (Dubna, Russia) ASRT–JINR collaboration.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hamza, W. The Nile Delta. In The Nile; Dumont, H.J., Ed.; Monographiae Biologicae; Springer: Dordrecht, The Netherlands, 2009; pp. 75–94. [Google Scholar]
Fishar, M.R. Nile Delta (Egypt). In The Wetland Book: II: Distribution, Description and Conservation; Finlayson, C.M., Milton, G.R., Prentice, R.C., Davidson, N.C., Eds.; Springer: Dordrecht, The Netherlands, 2016; pp. 1–10. [Google Scholar]
Arafa, W.M.; Badawy, W.M.; Fahmi, N.M.; Ali, K.; Gad, M.S.; Duliu, O.G.; Frontasyeva, M.V.; Steinnes, E. Geochemistry of sediments and surface soils from the Nile Delta and lower Nile valley studied by epithermal neutron activation analysis. J. Afr. Earth Sci. 2015, 107, 57–64. [Google Scholar] [CrossRef]
Bratley, K.H.; Woodcock, C.E. Estimating the expansion and reduction of agricultural extent in Egypt using Landsat time series. Int. J. Appl. Earth Obs. 2024, 133, 104141. [Google Scholar] [CrossRef]
Fritz, S.; See, L.; McCallum, I.; You, L.; Bun, A.; Moltchanova, E.; Duerauer, M.; Albrecht, F.; Schill, C.; Perger, C.; et al. Mapping global cropland and field size. Glob. Change Biol. 2015, 21, 1980–1992. [Google Scholar] [CrossRef] [PubMed]
Badawy, W.; Frontasyeva, M.V.; Ibrahim, M. Vertical Distribution of Major and Trace Elements in a Soil Profile from the Nile Delta, Egypt. Ecol. Chem. Eng. S 2020, 27, 281–294. [Google Scholar] [CrossRef]
Stanley, D.J. Nile delta: Extreme case of sediment entrapment on a delta plain and consequent coastal land loss. Mar. Geol. 1996, 129, 189–195. [Google Scholar] [CrossRef]
Badawy, W.M.; Ghanim, E.H.; Duliu, O.G.; El Samman, H.; Frontasyeva, M.V. Major and trace element distribution in soil and sediments from the Egyptian central Nile Valley. J. Afr. Earth Sci. 2017, 131, 53–61. [Google Scholar] [CrossRef]
Puthusseri, R.M.; Nair, H.P.; Johny, T.K.; Bhat, S.G. Insights into the response of mangrove sediment microbiomes to heavy metal pollution: Ecological risk assessment and metagenomics perspectives. J. Environ. Manag. 2021, 298, 113492. [Google Scholar] [CrossRef] [PubMed]
Arbuzov, S.I.; Chekryzhov, I.Y.; Verkhoturov, A.A.; Spears, D.A.; Melkiy, V.A.; Zarubina, N.V.; Blokhin, M.G. Geochemistry and rare-metal potential of coals of the Sakhalin coal basin, Sakhalin island, Russia. Int. J. Coal Geol. 2023, 268, 104197. [Google Scholar] [CrossRef]
IAEA. Soil Sampling for Environmental Contaminants; International Atomic Energy Agency: Vienna, Austria, 2004. [Google Scholar]
EGSMA. Geologic Map of Um Safi Area; EGSMA: Cairo, Egypt, 1989. [Google Scholar]
NSAM, N.-A.M. Determination of Element Composition of Rocks, Soils, Grounds, and Sediments by ICP-AE and ICP-MS Methods. All-Russian Scientific-Research Institute Of Mineral Resources named after N.M. Fedorovsky: Moscow, Russia, 2015. [Google Scholar]
Badawy, W.M.; Dmitriev, A.Y.; El Samman, H.; El-Taher, A.; Blokhin, M.G.; Rammah, Y.S.; Madkour, H.A.; Salama, S.; Budnitskiy, S.Y. Elemental composition and metal pollution in Egyptian Red Sea mangrove sediments: Characterization and origin. Mar. Pollut. Bull. 2024, 198, 115830. [Google Scholar] [CrossRef]
Josso, P.; Rushton, J.; Lusty, P.; Matthews, A.; Chenery, S.; Holwell, D.; Kemp, S.J.; Murton, B. Late Cretaceous and Cenozoic paleoceanography from north-east Atlantic ferromanganese crust microstratigraphy. Mar. Geol. 2020, 422, 106122. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing, 4.2.1; R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
van Rossum, G. Python Tutorial, Technical Report CS-R9526, Centrum voor Wiskunde en Informatica (CWI), 3.10; Python Software Foundation: Amsterdam, The Netherlands, 1995. [Google Scholar]
Shapiro, S.S.; Wilk, M.B. An analysis of variance test for normality (complete samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
Filzmoser, P.; Hron, K. Outlier Detection for Compositional Data Using Robust Methods. Math. Geosci. 2008, 40, 233–248. [Google Scholar] [CrossRef]
Filzmoser, P.; Hron, K.; Reimann, C. Principal component analysis for compositional data with outliers. Environmetrics 2009, 20, 621–632. [Google Scholar] [CrossRef]
Lê, S.; Josse, J.; Husson, F. FactoMineR: An R package for multivariate analysis. J. Stat. Softw. 2008, 25, 18. [Google Scholar] [CrossRef]
Van Der Maaten, L. Accelerating t-SNE using Tree-Based Algorithms. J. Mach. Learn. Res. 2014, 15, 3221–3245. [Google Scholar]
Keim, D.A.; Kohlhammer, J.; Ellis, G.P.; Mansmann, F. Mastering the Information Age—Solving Problems with Visual Analytics; Eurographics Association: Eindhoven, The Netherlands, 2010. [Google Scholar]
Badawy, W.; Silachyov, I.; Dmitriev, A.; Lennik, S.; Saleh, G.; Mitwalli, M.; El-Farrash, A.; Sallah, M. Elemental distribution patterns in rock samples from Egypt using neutron activation and complementary X-ray fluorescence analyses. Appl. Radiat. Isot. 2023, 202, 111063. [Google Scholar] [CrossRef]
Hinton, G.; Roweis, S. Stochastic Neighbor Embedding. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2003; Volume 15, pp. 857–864. [Google Scholar]
Baxter, M.J. Exploratory Multivariate Analysis in Archaeology; Percheron Press, a division of Eliot Werner Publications, Inc.: Clinton Corners, NY, USA, 2015. [Google Scholar]
Carlson, D.L. Quantitative Methods in Archaeology Using R; University of Cambridge: Cambridge, UK, 2017. [Google Scholar]
Badawy, W.M.; Dmitriev, A.Y.; Koval, V.Y.; Smirnova, V.S.; Chepurchenko, O.E.; Lobachev, V.V.; Belova, M.O.; Galushko, A.M. Formation of reference groups for archaeological pottery using neutron activation and multivariate statistical analyses. Archaeometry 2022, 64, 1377–1393. [Google Scholar] [CrossRef]
Müllner, D. Modern hierarchical, agglomerative clustering algorithms. arXiv 2011. [CrossRef]
Bar-Joseph, Z.; Gifford, D.K.; Jaakkola, T.S. Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 2001, 17, S22–S29. [Google Scholar] [CrossRef] [PubMed]
Badawy, W.; Elsenbawy, A.; Dmitriev, A.; El Samman, H.; Shcheglov, A.; El-Gamal, A.; Kamel, N.H.M.; Mekewi, M. Characterization of major and trace elements in coastal sediments along the Egyptian Mediterranean Sea. Mar. Pollut. Bull. 2022, 177, 113526. [Google Scholar] [CrossRef] [PubMed]
Badawy, W.M.; Duliu, O.G.; El Samman, H.; El-Taher, A.; Frontasyeva, M.V. A review of major and trace elements in Nile River and Western Red Sea sediments: An approach of geochemistry, pollution, and associated hazards. Appl. Radiat. Isot. 2021, 170, 109595. [Google Scholar] [CrossRef] [PubMed]
Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis; Chapman and Hall/CRC: Boca Raton, FL, USA, 2013. [Google Scholar]
Lunn, D.J.; Spiegelhalter, D.J.; Thomas, A.; Best, N. The BUGS project: Evolution, critique and future directions. Stat. Med. 2009, 28, 3049–3067. [Google Scholar] [CrossRef]
Plummer, M. JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling. In Proceedings of the Third International Workshop on ‘Distributed Statistical Computing’ (DSC 2003), Vienna, Austria, 20–22 March 2003; Technische Universität Wien: Vienna, Austria, 2003. [Google Scholar]
Kruschke, J.K. Bayesian estimation supersedes the t test. J. Exp. Psychol. Gen. 2013, 142, 573–603. [Google Scholar] [CrossRef]
Park, J.H.; Quinn, K.; Martin, A. MCMCpack: Markov chain Monte Carlo in R. J. Stat. Softw. 2011, 42, 1–21. [Google Scholar] [CrossRef]
Plummer, M.; Best, N.; Cowles, K.; Vines, K. Output Analysis and Diagnostics for MCMC (CODA). R News 2006, 6, 7–11. [Google Scholar]
Kamanina, I.Z.; Badawy, W.M.; Kaplina, S.P.; Makarov, O.A.; Mamikhin, S.V. Assessment of Soil Potentially Toxic Metal Pollution in Kolchugino Town, Russia: Characteristics and Pollution. Land 2023, 12, 439. [Google Scholar] [CrossRef]
Abrahim, G.M.S.; Parker, R.J. Assessment of heavy metal enrichment factors and the degree of contamination in marine sediments from Tamaki Estuary, Auckland, New Zealand. Environ. Monit. Assess. 2008, 136, 227–238. [Google Scholar] [CrossRef]
Lv, J.; Liu, Y.; Zhang, Z.; Zhou, R.; Zhu, Y. Distinguishing anthropogenic and natural sources of trace elements in soils undergoing recent 10-year rapid urbanization: A case of Donggang, Eastern China. Environ. Sci. Pollut. Res. 2015, 22, 10539–10550. [Google Scholar] [CrossRef] [PubMed]
Reimann, C.; de Caritat, P. Distinguishing between natural and anthropogenic sources for elements in the environment: Regional geochemical surveys versus enrichment factors. Sci. Total Environ. 2005, 337, 91–107. [Google Scholar] [CrossRef] [PubMed]
Ergin, M.; Saydam, C.; Baştürk, Ö.; Erdem, E.; Yörük, R. Heavy metal concentrations in surface sediments from the two coastal inlets (Golden Horn Estuary and İzmit Bay) of the northeastern Sea of Marmara. Chem. Geol. 1991, 91, 269–285. [Google Scholar] [CrossRef]
Chen, C.W.; Kao, C.M.; Chen, C.F.; Di Dong, C. Distribution and accumulation of heavy metals in the sediments of Kaohsiung Harbor, Taiwan. Chemosphere 2007, 66, 1431–1440. [Google Scholar] [CrossRef]
Varol, M. Assessment of heavy metal contamination in sediments of the Tigris River (Turkey) using pollution indices and multivariate statistical techniques. J. Hazard. Mater. 2011, 195, 355–364. [Google Scholar] [CrossRef]
Andreev, D.N.; Dzyuba, E.A. Total soil heavy metal contamination in various biotops at the territory of Vishersky reserve. Gen. Biol. 2016, 64, 5. [Google Scholar]
Shaykhutdinova, A.N. Assessment of the degree of contamination agricultural soils Kuzbass mobile forms of heavy metals. In Proceedings of the V International Scientific Conference Dedicated to the 85th Anniversary of the Department of Soil Science and Soil Ecology of TSU, Tomsk, Russia, 10 August 2015; p. 5. [Google Scholar]
Saet, Y.E.; Revich, B.A.; Yanin, E.P.; Smirnova, R.S.; Basharkevich, I.L.; Onishchenko, T.L.; Pavlova, L.N.; Trefilova, N.Y.; Achkasov, A.I.; Sarkisyan, S.S. Geochemistry of the Environment; Nedra: Moscow, Russia, 1990. [Google Scholar]
Rudnick, R.L.; Gao, S. Composition of the Continental Crust. In Treatise on Geochemistry; Turekian, K.K., Ed.; Elsevier: Oxford, UK, 2014; pp. 1–51. [Google Scholar]
Hair, J.F.; Sarstedt, M.; Pieper, T.M.; Ringle, C.M. The use of partial least squares structural equation modeling in strategic management research: A review of past practices and recommendations for future applications. Long Range Plan. 2012, 45, 320–340. [Google Scholar] [CrossRef]
Zhou, X.; Chen, Q.; Liu, C.; Fang, Y. Using moss to assess airborne heavy metal pollution in Taizhou, China. Int. J. Environ. Res. Public Health 2017, 14, 430. [Google Scholar] [CrossRef] [PubMed]
Taylor, S.R.; McLennan, S.M. The Continental Crust: Its Composition and Evolution; Blackwell Scientific Publications: Oxford, UK, 1985; p. 312. [Google Scholar]
Fowler, A.D.; Doig, R. The significance of europium anomalies in the REE spectra of granites and pegmatites, Mont Laurier, Quebec. Geochim. Cosmochim. Acta 1983, 47, 1131–1137. [Google Scholar] [CrossRef]
Rudnick, R.L.; Gao, S. Composition of the continental crust. Treatise Geochem. 2003, 3, 1–64. [Google Scholar] [CrossRef]
Viers, J.; Dupre, B.; Gaillardet, G. Chemical composition of suspended sediments in World Rivers: New insights from a new database. Sci. Total Environ. 2009, 407, 853–868. [Google Scholar] [CrossRef]
Gromet, L.P.; Haskin, L.A.; Korotev, R.L.; Dymek, R.F. The “North American shale composite”: Its compilation, major and trace element characteristics. Geochim. Cosmochim. Acta 1984, 48, 2469–2482. [Google Scholar] [CrossRef]
Bhatia, M.R.; Crook, K.A.W. Trace element characteristics of graywackes and tectonic setting discrimination of sedimentary basins. Contrib. Mineral. Petrol. 1986, 92, 181–193. [Google Scholar] [CrossRef]
Kowalska, J.B.; Mazurek, R.; Gąsiorek, M.; Zaleski, T. Pollution indices as useful tools for the comprehensive evaluation of the degree of soil contamination–A review. Environ. Geochem. Health 2018, 40, 2395–2420. [Google Scholar] [CrossRef] [PubMed]
Ranjbarzadeh, R.; Caputo, A.; Tirkolaee, E.B.; Jafarzadeh Ghoushchi, S.; Bendechache, M. Brain tumor segmentation of MRI images: A comprehensive review on the application of artificial intelligence tools. Comput. Biol. Med. 2023, 152, 106405. [Google Scholar] [CrossRef] [PubMed]
Nielsen, F. Hierarchical Clustering. In Introduction to HPC with MPI for Data Science; Nielsen, F., Ed.; Springer International Publishing: Cham, Switzerland, 2016; pp. 195–211. [Google Scholar]
Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]

Figure 1. A map showing the sampling locations of the soil.

Figure 2. Boxplot shows the main descriptive statistical patterns of the determined elements in the soil samples.

Figure 3. A plot shows the distribution patterns of the rare earth elements normalized to UCC, PAAS, and Chondrites.

Figure 4. Ternary diagram shows the relationship between (A) La/10–Y/15–Nb/8 and (B) Th–La-Sc [57].

Figure 5. Biplot shows the ratio indicators of Zr/Sc vs. Th/Sc for the studied soil compared with the corresponding values from the literature.

Figure 6. Distribution patterns of Th and U in the soil samples compared with the literature (A) Th/U vs U and (B) Th vs U. The dashed red line stands for normalization.

Figure 7. Biplot shows the relationship between immobile trace elements, Co vs. Th.

Figure 8. Boxplot shows the logarithmic values of the single pollution index SPI.

Figure 9. Hierarchical cluster analysis of the soil samples.

Figure 10. Hierarchical agglomerative clustering based on PCA.

Figure 11. Hierarchical agglomerative clustering based on t–SNE.

Table 1. Descriptive statistics of the studied soil samples from Nile Delta River. The mass fractions of the elements are provided in mg/kg and compared with upper continental crust (UCC) by Rudnick and Gao [49].

Element	Mean ± SE	Median ± MAD *	Min–Max	CV%	Skewness	Kurtosis	W-Static	p-Value	UCC	Element	Mean ± SE	Median ± MAD *	Min–Max	CV%	Skewness	Kurtosis	W-Static	p-Value	UCC
Li	14.1 ± 0.407	13.7 ± 1.6	7–23.8	20.3	0.527	1.91	0.961	0.101	24	Cd	0.237 ± 0.008	0.2 ± 0	0.2–0.4	22.3	0.974	−0.197	0.645	0.000	0.09
Be	1.32 ± 0.033	1.3 ± 0.1	0.7–1.7	17.4	−0.516	−0.054	0.957	0.074	2.1	Sn	3.12 ± 0.132	3 ± 0.4	1.5–7.2	29.7	2.28	7.82	0.787	0.000	2.1
Na	7850 ± 172	7980 ± 707	4260–10,500	15.4	−0.504	0.817	0.973	0.317	24,258.57	Sb	0.396 ± 0.021	0.4 ± 0.1	0.2–1	37.5	2	6.07	0.775	0.000	0.4
Mg	17,700 ± 291	18,000 ± 1280	13,300–24,100	11.5	−0.007	0.923	0.95	0.036	14,953.85	Cs	1.23 ± 0.031	1.2 ± 0.1	0.6–1.7	17.9	−0.374	0.451	0.969	0.222	4.9
Al	74,100 ± 1650	77,200 ± 5610	38,000–91,200	15.6	−1.14	0.852	0.902	0.001	81,510.71	Ba	365 ± 8.11	379 ± 37.7	223–453	15.5	−0.644	−0.119	0.957	0.070	628
Si	245,000 ± 2640	241,000 ± 10,100	221,000–309,000	7.56	1.61	3.11	0.857	0.000	311,405.14	La	27.6 ± 0.635	28.9 ± 3	18.1–34.7	16.1	−0.452	−0.765	0.949	0.034	31
P	1260 ± 61	1160 ± 169	707–3280	33.8	2.42	8.58	0.795	0.000	654.57	Ce	62.4 ± 1.29	63.8 ± 5.5	40.8–78.8	14.4	−0.399	−0.393	0.971	0.262	63
K	9840 ± 177	9760 ± 888	7840–12,700	12.6	0.46	−0.455	0.968	0.195	23,244.16	Pr	6.62 ± 0.138	6.8 ± 0.6	4.3–8.1	14.6	−0.495	−0.349	0.958	0.077	7.1
Ca	40,000 ± 1860	35,700 ± 4020	24,200–80,100	32.7	1.75	2.45	0.781	0.000	25,657.49	Nd	30.2 ± 0.677	31.2 ± 3.3	19.5–38.8	15.7	−0.372	−0.371	0.97	0.240	27
Sc	20.7 ± 0.572	21.8 ± 2.2	10.8–26.7	19.3	−0.926	0.235	0.915	0.002	14	Sm	6.46 ± 0.149	6.6 ± 0.6	4.1–8.2	16.2	−0.427	−0.303	0.961	0.101	4.7
Ti	9460 ± 273	9570 ± 1160	4390–12,300	20.2	−0.812	0.272	0.935	0.009	3835.79	Eu	1.71 ± 0.04	1.8 ± 0.2	1.1–2.2	16.5	−0.445	−0.278	0.956	0.063	1
V	157 ± 3.72	164 ± 11.8	83.9–200	16.6	−1.16	0.772	0.89	0.000	97	Gd	6.01 ± 0.137	6.2 ± 0.6	3.8–7.6	16	−0.426	−0.408	0.964	0.132	4
Cr	116 ± 2.11	120 ± 7	70.7–135	12.7	−1.34	1.39	0.867	0.000	92	Tb	0.857 ± 0.02	0.9 ± 0.1	0.6–1.1	16	−0.372	−0.433	0.917	0.002	0.7
Mn	1040 ± 26.4	1060 ± 112	536–1500	17.8	−0.351	0.343	0.979	0.538	774.46	Dy	4.96 ± 0.117	5.1 ± 0.5	3.1–6.4	16.5	−0.434	−0.371	0.962	0.111	3.9
Fe	63,300 ± 1490	65,900 ± 5850	36,300–77,700	16.4	−0.987	0.255	0.908	0.001	39,175.06	Ho	0.896 ± 0.02	0.9 ± 0.1	0.6–1.2	15.6	−0.342	−0.057	0.935	0.009	0.83
Co	22.3 ± 0.619	23 ± 2.6	11.6–28.8	19.4	−0.638	−0.024	0.949	0.032	17.3	Er	2.39 ± 0.059	2.4 ± 0.3	1.5–3.1	17.2	−0.392	−0.457	0.965	0.154	2.3
Ni	75.5 ± 2.16	79.4 ± 5.9	39.1–116	20	−0.602	0.836	0.921	0.003	47	Tm	0.359 ± 0.01	0.4 ± 0	0.2–0.5	18.8	−0.956	0.296	0.732	0.000	0.3
Cu	62.3 ± 1.48	62.9 ± 3.7	32.3–82.5	16.6	−0.851	1.37	0.916	0.002	28	Yb	2.58 ± 0.059	2.6 ± 0.3	1.7–3.3	16	−0.313	−0.468	0.959	0.082	1.96
Zn	93.8 ± 2.08	91.8 ± 6.3	53.1–123	15.5	−0.259	0.622	0.958	0.077	67	Lu	0.367 ± 0.008	0.4 ± 0	0.2–0.5	16.1	−1	0.801	0.701	0.000	0.31
Ga	15.6 ± 0.439	15.8 ± 2.1	7.4–20.8	19.7	−0.548	−0.069	0.967	0.184	17.5	Hf	4.65 ± 0.144	4.7 ± 0.7	2.2–6.6	21.6	−0.175	−0.452	0.984	0.751	5.3
Ge	1.33 ± 0.032	1.3 ± 0.1	0.7–1.9	17.1	0.023	0.589	0.965	0.155	1.4	Ta	1.11 ± 0.036	1.1 ± 0.2	0.6–1.6	22.4	−0.108	−0.685	0.966	0.170	0.9
As	2.99 ± 0.103	2.9 ± 0.3	2.1–7.1	24.2	3.75	19.4	0.651	0.000	4.8	W	0.68 ± 0.025	0.6 ± 0.1	0.4–1.3	25.5	0.975	1.95	0.911	0.001	1.9
Rb	33.7 ± 0.7	33.9 ± 3.4	20.5–43.7	14.5	−0.188	−0.009	0.99	0.955	84	Tl	0.186 ± 0.005	0.2 ± 0	0.1–0.2	19	−2.04	2.17	0.417	0.000	0.9
Sr	261 ± 10.9	240 ± 16.8	171–594	29.2	2.57	7.18	0.695	0.000	320	Pb	14.1 ± 0.766	12.7 ± 2.6	8.3–35.6	37.9	1.86	4.1	0.822	0.000	17
Y	22.6 ± 0.565	22.9 ± 2.8	14.3–29.3	17.5	−0.296	−0.581	0.965	0.151	21	Bi	0.112 ± 0.006	0.1 ± 0	0.1–0.3	34.7	3.3	10.7	0.354	0.000	0.16
Zr	161 ± 4.99	164 ± 21.3	76.8–226	21.7	−0.112	−0.488	0.978	0.466	193	Th	4.91 ± 0.106	5 ± 0.5	2.7–6.5	15.1	−0.484	0.29	0.978	0.468	10.5
Nb	17.5 ± 0.669	16.9 ± 3.3	8.3–26.3	26.7	0.095	−0.848	0.972	0.290	12	U	1.31 ± 0.025	1.3 ± 0.1	0.9–1.7	13.3	0.074	−0.352	0.962	0.113	2.7
Mo	0.724 ± 0.024	0.7 ± 0.1	0.4–1	22.8	−0.089	−0.68	0.95	0.036	1.1

* MAD = median absolute deviation.

Table 2. The µ ± σ of the determined elements with and without outliers using Bayesian inference method.

Element	µ ± σ with Outliers	µ ± σ Without Outliers	UCC	Element	µ ± σ with Outliers	µ ± σ Without Outliers	UCC
Li	13.3 ± 4	14.1 ± 2.9	24	Cd	0.2 ± 0.1	0.2 ± 0.1	0.09
Be	1.2 ± 0.3	1.3 ± 0.2	2.1	Sn	2.9 ± 1.1	3.1 ± 1	2.1
Na	7664.5 ± 1607.7	7851.2 ± 1237.4	24,258.57	Sb	0.4 ± 0.2	0.4 ± 0.2	0.4
Mg	16,807.2 ± 3785.8	17,679.9 ± 2094.1	14,953.85	Cs	1.2 ± 0.3	1.2 ± 0.2	4.9
Al	70,235.1 ± 17,993.7	74,059.1 ± 11,849.1	81,510.71	Ba	353.4 ± 74.5	365.5 ± 58.3	628
Si	25,5903.7 ± 44,770.4	244,630 ± 18,993.4	311,405.14	La	26.3 ± 6.4	27.6 ± 4.6	31
P	1211.5 ± 462.7	1263.4 ± 438.8	654.57	Ce	59.5 ± 14	62.4 ± 9.3	63
K	9561.1 ± 1754.8	9837.9 ± 1276.5	23,244.16	Pr	6.3 ± 1.5	6.6 ± 1	7.1
Ca	38,189 ± 14,395.4	39,952.4 ± 13,406.1	25,657.49	Nd	28.8 ± 6.9	30.2 ± 4.9	27
Sc	19.6 ± 5.6	20.7 ± 4.1	14	Sm	6.1 ± 1.5	6.5 ± 1.1	4.7
Ti	8998.2 ± 2561	9463.1 ± 1960.8	3835.79	Eu	1.6 ± 0.4	1.7 ± 0.3	1
V	148.2 ± 41.1	157 ± 26.8	97	Gd	5.7 ± 1.4	6 ± 1	4
Cr	111.6 ± 22.9	116.2 ± 15.1	92	Tb	0.8 ± 0.2	0.9 ± 0.1	0.7
Mn	988.3 ± 262.2	1039.9 ± 190	774.46	Dy	4.7 ± 1.2	5 ± 0.8	3.9
Fe	59,670.8 ± 16,718.8	63,318.3 ± 10,687.5	39,175.06	Ho	0.9 ± 0.2	0.9 ± 0.1	0.83
Co	21 ± 6.4	22.3 ± 4.5	17.3	Er	2.3 ± 0.6	2.4 ± 0.4	2.3
Ni	71 ± 22.2	75.5 ± 15.5	47	Tm	0.3 ± 0.1	0.4 ± 0.1	0.3
Cu	59.1 ± 15.8	62.3 ± 10.7	28	Yb	2.5 ± 0.6	2.6 ± 0.4	1.96
Zn	88.8 ± 23	93.8 ± 15	67	Lu	0.3 ± 0.1	0.4 ± 0.1	0.31
Ga	14.7 ± 4.4	15.6 ± 3.2	17.5	Hf	4.4 ± 1.4	4.6 ± 1	5.3
Ge	1.3 ± 0.3	1.3 ± 0.2	1.4	Ta	1.1 ± 0.3	1.1 ± 0.3	0.9
As	2.8 ± 0.9	3 ± 0.7	4.8	W	0.6 ± 0.2	0.7 ± 0.2	1.9
Rb	32.1 ± 7.7	33.7 ± 5	84	Tl	0.2 ± 0	0.2 ± 0	0.9
Sr	251.6 ± 84.2	261.4 ± 78.5	320	Pb	13.5 ± 5.8	14.1 ± 5.5	17
Y	21.5 ± 5.7	22.6 ± 4.1	21	Bi	0.1 ± 0	0.1 ± 0	0.16
Zr	151.8 ± 48.5	161.2 ± 35.8	193	Th	4.7 ± 1.1	4.9 ± 0.8	10.5
Nb	16.5 ± 5.9	17.5 ± 4.8	12	U	1.2 ± 0.3	1.3 ± 0.2	2.7
Mo	0.7 ± 0.2	0.7 ± 0.2	1.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Badawy, W.M.; El-Agawany, F.I.; Blokhin, M.G.; Mohamed, E.S.; Uzhinskiy, A.; Morsi, T.M. Multivariate and Machine Learning-Based Assessment of Soil Elemental Composition and Pollution Analysis. Environments 2025, 12, 289. https://doi.org/10.3390/environments12080289

AMA Style

Badawy WM, El-Agawany FI, Blokhin MG, Mohamed ES, Uzhinskiy A, Morsi TM. Multivariate and Machine Learning-Based Assessment of Soil Elemental Composition and Pollution Analysis. Environments. 2025; 12(8):289. https://doi.org/10.3390/environments12080289

Chicago/Turabian Style

Badawy, Wael M., Fouad I. El-Agawany, Maksim G. Blokhin, Elsayed S. Mohamed, Alexander Uzhinskiy, and Tarek M. Morsi. 2025. "Multivariate and Machine Learning-Based Assessment of Soil Elemental Composition and Pollution Analysis" Environments 12, no. 8: 289. https://doi.org/10.3390/environments12080289

APA Style

Badawy, W. M., El-Agawany, F. I., Blokhin, M. G., Mohamed, E. S., Uzhinskiy, A., & Morsi, T. M. (2025). Multivariate and Machine Learning-Based Assessment of Soil Elemental Composition and Pollution Analysis. Environments, 12(8), 289. https://doi.org/10.3390/environments12080289

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multivariate and Machine Learning-Based Assessment of Soil Elemental Composition and Pollution Analysis

Abstract

1. Introduction

2. Materials and Methods

2.1. Sampling Strategy

2.2. Sample Preparation for Analysis Using ICP-MS and ICP-AES

2.3. Quality Control of (ICP-MS and ICP-AES)

3. Statistical Data Analysis

4. Background and Pollution Analysis

4.1. Single Pollution Index (SPI)

4.2. Enrichment Factor (EF)

4.3. Pollution Load Index (PLI)

4.4. Total Pollution Index TPI (Zc)

5. Results and Discussion

5.1. Elemental Abundances

5.2. Normality Test and Intercorrelation

5.3. Geochemical Provenance of Elements in Soil

5.4. Findings of the Background and Pollution Analysis

5.5. Findings of the Unsupervised Learning HCA, PCA, t–SNE, and HAC

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI