1. Introduction
Natural or white hydrogen has gained increasing interest as a clean and carbon-free energy source. Its generation can occur through various mechanisms depending on the geological setting, but in some natural emanations, subcircular surface patterns, known as fairy circles (FC), have been identified [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14]. These FCs documented in the literature exhibit a range of sizes, often large enough to be detected using satellite imagery. Their identification has incorporated local geological information, terrain topography, and the computation of spectral indices derived from satellite data [
14,
15,
16,
17,
18,
19,
20,
21].
In studies focused on FC-related structures, satellite imagery constitutes the most commonly used input, from which a variety of spectral indices are computed. Previous works relying primarily on optical satellite data have considered indices such as the Normalised Difference Vegetation Index (NDVI), Atmospherically Resistant Vegetation Index (ARVI), Green Normalised Difference Vegetation Index (GNDVI), Near-Infrared and Shortwave Infrared Burn Ratio Index (NBRI), Green Leaf Index (GLI), Enhanced Vegetation Index (EVI), and the Normalised Difference Water Index (NDWI), as used in the works of [
15,
18]. In addition to these indices, other studies have incorporated Digital Elevation Models (DEMs) and topographic inputs [
15,
18,
20,
21], geophysical information [
17], as well as statistical and morphological analyses of the subcircular structures [
19,
20]. All of these works emphasise the importance of understanding the geological context, including the identification of source rocks, potential reservoir rocks, and faulting, in a manner analogous to petroleum exploration. Among the key morphometric characteristics, the slope of the structures is notable, as it tends to be lower when compared with depressions associated with karst features [
20]. Technologies such as Light Detection and Ranging (LiDAR) are highly useful for identifying areas with potential FCs, as they enable improved geomorphological characterisation relative to DEMs [
21]. Field measurements of hydrogen concentrations have shown that hydrogen concentration varies within FC structures, with localised zones of higher concentration inside the surface area of the feature [
7,
18,
21].
Despite their widespread use, spectral indices are inherently dependent on local environmental conditions [
22]. Consequently, although spectral indices have been applied in FC exploration, their individual contribution to FC identification has rarely been isolated or systematically assessed. Vegetation stress or anomalous spectral responses may arise from multiple factors unrelated to hydrogen emissions, such as soil moisture conditions or precipitation variability, complicating the interpretation of index-based signals.
In recent years, with the aim of automating analytical workflows, machine learning (ML) tools have been increasingly integrated with satellite imagery for various classification tasks. These include land-cover classification using random forests [
23], pixel-based convolutional neural networks (CNNs) for the extraction of water bodies [
24] and comparative evaluations of deep learning models against traditional approaches such as random forests and decision trees for soil classification [
25]. Regarding the prediction of fairy circles, Nigar et al. [
26] implemented a U-Net model for FC identification, using WorldView-2 image data. Across these studies, in addition to the spectral bands of the respective satellite platforms, researchers have incorporated spectral indices, elevation information, terrain slope, image texture metrics, and other ancillary variables. Satellite imagery provides continuous temporal coverage and allows for the analysis of large territorial extents, making it one of the primary inputs used in exploration workflows [
27,
28,
29,
30].
McMaho et al. [
7] compiled global locations where natural hydrogen emissions have been identified but are not necessarily associated with FCs. Among regions where FC-like surface expressions have been reported, the Carolina Bays on the eastern United States coast stand out for the abundance and clear manifestation of subcircular structures, motivating their selection as the study area. Although previous research has demonstrated the utility of satellite imagery for FC-related studies, a systematic evaluation of how individual spectral bands and derived indices contribute to FC detection remains limited, particularly when using multispectral data alone. This lack of systematic assessment hinders a clear understanding of which spectral inputs are most informative and how they influence model performance in exploratory workflows for natural hydrogen.
To address this gap, this study proposes a progressive machine learning framework to evaluate the contribution of Landsat-8 spectral bands and normalised indices for FC detection in the Carolina Bays. The approach begins with traditional pixel-based classifiers, which enable variable-importance analysis, and subsequently uses the most informative inputs to guide experiments with deep learning models. Leveraging recent advances in ML and computational tools, we assess and compare the performance of logistic regression, random forest, multilayer perceptron, CNN, and U-Net architectures for identifying FC-associated areas of interest using multispectral satellite imagery as the sole input. The purpose of this work is to present a reproducible and extensible workflow that can be complemented with non-spectral variables in future studies in order to support future natural hydrogen exploration strategies through the prioritisation of predictors according to local conditions and ML model responses.
Study Area
Carolina Bays correspond to a set of shallow depressions 1 to 3 m deep, with elliptical to ovoid shapes, an average width-to-length ratio of approximately 0.58, elevated sandy rims, and muddy or organic infill, distributed along the Atlantic Coastal Plain from New Jersey to Georgia, United States (
Figure 1). Their origin has been attributed to potential meteorite impacts as well as to wind- and wave-driven processes within the Earth system [
31,
32,
33]. The bays are characterised by their high surface density, preferential NW-SE orientation, and smooth geometry, making them one of the most representative examples worldwide of subcircular patterns observable in remote sensing data [
12,
21,
34]. The elliptical depressions exhibit a similarly well-defined geometric precision, suggesting that they may have formed contemporaneously during a common formative event. Comparable terrestrial landforms with the same characteristics are not widely reported in other regions of the world [
35].
Zgonnik et al. [
21] measured natural hydrogen concentrations along profiles that crossed each structure, starting outside their margins at a distance equivalent to half the diameter of each feature. Along these profiles, they identified the areas where hydrogen concentrations began to increase. Due to the nature of these natural depressions, commonly filled with water and organic deposits (peat), it has been suggested that the hydrogen source could be related to biological activity [
36]. However, Zgonnik et al. [
21] found that the highest hydrogen concentrations were located along the rims of the structures and at depths of up to 5 m, reducing the likelihood of a purely biogenic origin.
To analyse both areas with and without subcircular structures, the Landsat-8 scene was subdivided into four Areas of Interest (AOIs). AOIs 01 and 02 are located in regions with a high density of subcircular depressions and therefore represent positive scenarios with a clear surface expression of FCs. In contrast, AOIs 03 and 04 are situated toward the northwest portion of the Landsat-8 scene, in areas where no subcircular structures have been mapped. These two AOIs act as negative or control areas, enabling an evaluation of the model’s ability to minimise false detections in environments lacking any surface evidence of FCs.
3. Results
3.1. Exploratory Data Analysis
The training data for AOI 01 and AOI 03 revealed the presence of outliers in the box-and-whisker plots for both classes (
Appendix A). In the histograms, it was also observed that pixels associated with FCs did not exhibit a strong spectral distinction relative to non-associated pixels (
Appendix B). In both types of graphical representations, the two classes shared similar value ranges across the variables, with the primary difference being the imbalance in sample counts; Class 0 (non-FC) contained substantially more samples than Class 1 (FC). The total number of pixels considered for training the traditional models was approximately 103.3 million, of which about 91.7 million correspond to Class 0 (non-FC) and 11.6 million to Class 1 (FC). This represents a class imbalance in which Class 0 contains approximately 7.9 times more samples than Class 1.
An examination of the descriptive statistics (
Table 1) further illustrates the strong overlap between classes. For example, Band 3 shows a mean value of 0.13 for Class 0 and 0.10 for Class 1; for Band 4, mean values are 0.12 and 0.09 (Class 0/Class 1); and for Band 6, 0.24 and 0.21 (Class 0/Class 1). For the NUI B5-B4, which is analogous to the traditional NDVI as it uses the same spectral bands but rescales values to the 0–1 range, the mean values are 0.82 for Class 0 and 0.84 for Class 1. Similarly, the NUI B6-B4 presents mean values of 0.75 and 0.77 for Class 0 and Class 1, respectively. Overall, the statistical measures exhibit comparable behaviour, reflecting similar value distributions across both classes for the 28 available variables derived from the seven Landsat 8 bands. This overlap in feature distributions helps explain the limited separability observed between FC and non-FC classes in the traditional ML models.
In the analysis of Pearson’s linear correlation (
Appendix C) between variables (spectral bands and NUI), strong relationships were observed among some variables that share a common band used in the NUI formulation. For example, Bands 3 and 4 exhibit a correlation coefficient of 1.00, while Bands 6 and 4 show a coefficient of 0.84. The NUI B5-B4 displays a correlation of −0.82 with Band 4 and −0.39 with Band 5. Relying solely on the quantitative Pearson coefficient can be misleading; therefore, it should be complemented with scatterplot representations to verify the presence of linear relationships between variables. As in the histograms, in the scatterplots (
Appendix D), Class 1 (associated with FCs) falls within the distribution of Class 0 (absence of FCs), indicating that both classes share similar characteristics within certain ranges of the variables.
Since no clear separation between classes was achieved using the original variables, a PCA (
Appendix E) was performed to reduce the number of variables or dimensions. The PCA results yield new components, and it was found that 90% of the variance of the original data can be explained using the first four principal components. The first principal component alone accounts for about 68% of the variance. An examination of the absolute values of the variable loadings for this component indicates that the six most influential variables are Bands 1, 2, 3, and 4, together with the NUIs B6-B1 and B7-B1. In contrast, the four variables contributing least to this component are the NUIs B2-B1, B4-B3, B6-B5, and B7-B6. The NUI B5-B4 ranks eleventh in terms of contribution to the first component, followed by B6-B4 in twelfth position.
However, when these new components were plotted in scatterplots and coloured by class, it was still not possible to distinguish clear boundaries between them (
Appendix F); what was achieved instead was a partial removal of the linear dependence among the original variables. Although PCA effectively reduced the number of variables for training, these new components were not used because an additional objective of this study was to facilitate the interpretation of how the original input variables influence model performance during training. PCA generates new components that combine information from multiple original variables [
43]. In remote sensing, it is commonly used to reduce the dimensionality of multispectral and hyperspectral imagery, condensing the most relevant information into a few components [
44,
45].
3.2. Training
A total of 75 traditional models were trained to evaluate the presence of FCs at the pixel level. These comprised 30 logistic regression models, 30 random forest models, and 15 MLP classifier models. All traditional models were implemented using the Scikit-Learn library [
46]. For each algorithm, the official documentation was consulted, and a small set of hyperparameters was adjusted, while the remaining settings were kept at their default values. For LR, only the solver was modified (solver = ‘saga’). For the RF classifier, the following parameters were used: n_estimators = 30, max_depth = 10, min_samples_split = 0.05, min_samples_leaf = 0.01, bootstrap = True, and max_samples = 0.7. Of the 15 MLP classifier models, 10 used a hidden-layer architecture of {28, 14, 7, 3}, while the remaining 5 employed {112, 56, 28, 14, 7, 3}. For both architectures, the remaining parameters were consistent: activation = ‘relu’, solver = ‘adam’, max_iter = 100, learning_rate = ‘adaptive’, learning_rate_init = 0.001, tol = 0.001, early_stopping = True, shuffle = True, n_iter_no_change = 10, and validation_fraction = 0.2.
For the deep learning experiments, the CNN architecture consisted of a convolutional feature extractor followed by a fully connected classifier. The network receives a seven-channel input tile and applies successive 3 × 3 convolutions with ReLU activations, using 32, 64, and 128 feature maps. Two 2 × 2 max-pooling operations are employed to reduce spatial dimensionality. The resulting feature maps are then flattened and passed through a dense layer of 128 neurons with ReLU activation, followed by an output layer that produces class scores for the two target categories (FC vs. non-FC).
For the 64 × 64-pixel U-Net, the model follows an encoder–decoder scheme based on repeated convolutional blocks. Each encoder block comprises two 3 × 3 convolutions with ReLU activation, followed by a 2 × 2 max-pooling operation. The number of channels increases progressively through the network (from the input depth of 7 or 12, then projected to 16, 32, and 64). The decoder reconstructs the output by upsampling and concatenating, at each level, the corresponding encoder feature maps via skip connections. The output layer is a 1 × 1 convolution that produces a binary segmentation mask (FC vs. non-FC). The second U-Net configuration, using 256 × 256-pixel tiles, takes a seven-channel input tensor but employs a deeper and wider encoder. It comprises four hierarchical levels with progressively increasing channel depths (64, 128, 256, and 512), followed by a bottleneck block with 1024 channels. At each level, two 3 × 3 convolutions with ReLU activation are applied, followed by 2 × 2 max-pooling. The decoder mirrors this process using 2 × 2 transposed convolutions for upsampling and concatenates encoder feature maps via skip connections. The final layer is a 1 × 1 convolution that outputs the full-resolution binary segmentation mask.
The training time per cycle (including dataset creation and model training) for the LR models ranged from 5.4 min (323 s) to 6.4 min (383 s), with an average of 5.9 min (351 s) per model. For the RF models, training cycles ranged from 7.8 min (467 s) to 8.5 min (508 s), with an average of 8.0 min (482 s) per model. The simpler MLP architecture required between 22.9 min (1376 s) and 113.5 min (6807 s) per cycle, with an average of 60.6 min (3633 s). The more complex MLP architecture required between 52.2 min (3133 s) and 105.0 min (6298 s) per cycle, with an average of 84.5 min (5071 s).
As increasingly complex models were implemented, training times increased, and the performance metrics improved; this is evident in the fact that the MLP models required more time to train but achieved better performance in both the training and test sets. Considering that training times grew with ML model complexity, and given the preliminary runtimes observed during the initial deep learning experiments, only a limited number of CNN and U-Net models were ultimately trained for the analysis (
Appendix G), as will be presented later.
The Scikit-Learn library used for traditional models provides a score value, which corresponds to the accuracy metric. Accuracy ranges from 0.0 to 1.0, where 1.0 indicates perfect classification. The LR models achieved a mean training accuracy of 0.65 (the 25th, 50th, and 75th percentiles were also 0.65) and a mean test accuracy of 0.59 (with the 25th, 50th, and 75th percentiles likewise equal to 0.59). The RF models achieved a mean training accuracy of 0.66 (25th, 50th, and 75th percentiles all 0.66) and a mean test accuracy of 0.61, with test set percentiles of 0.60, 0.61, and 0.61 for the 25th, 50th, and 75th percentiles, respectively. The MLP classifier models with the simpler architecture produced a mean training accuracy of 0.66 (25th, 50th, and 75th percentiles of 0.66, 0.67, and 0.67, respectively) and a mean test accuracy of 0.57 (25th, 50th, and 75th percentiles of 0.57, 0.56, and 0.57, respectively).
For the CNN models, the training curves indicate loss values below approximately 0.2 from around epoch 15 onwards for both the training and validation datasets, except for one model that reached this threshold at approximately epoch 30. Accuracy values exceeded 0.90 in all cases, with no systematic divergence between the training and validation curves. This behaviour suggests that neither overfitting nor underfitting occurred during training. Although greater variability is observed in the validation curves, particularly during the early epochs, this is expected, as these data were not seen by the models during training.
The U-Net models generally exhibit similar behaviour in their loss curves, achieving values below 0.2 for both the training and validation datasets, albeit with noticeable variability in the validation loss. For models trained over a larger number of epochs, the loss curves progressively decreased to values below 0.10, in some cases approaching 0.05. However, these longer training runs also display occasional sharp increases or spikes in the loss values, suggesting that further fine-tuning of the training hyperparameters would be required to stabilise convergence.
Nevertheless, considering the objectives of this study, to evaluate the performance of different classification and segmentation models using satellite imagery as an initial exploratory input, rather than to identify and optimise a single “best” model, it is evident that deep learning approaches outperform traditional ML models in terms of predictive performance. This improvement, however, comes at the cost of substantially longer training times and the need for more complex model architectures.
3.3. Features Importance: Logistic Regression and Random Forest Classification
The logistic regression and random forest classification algorithms allow an approximation of variable importance during model training (
Appendix H). In the case of logistic regression, this is expressed through the weights or coefficients assigned to each variable, while for random forest classification, it is given by the feature importance values. Because LR coefficients can be either positive or negative, we used their absolute values to make feature contributions comparable across variables. The most influential predictor was Band 3, with a mean absolute coefficient of 57.55 (25th, 50th, and 75th percentiles: 57.49, 57.58, 57.63). This was followed by the NUI B5-B2 (mean absolute coef. = 35.48; percentiles 35.43, 35.49, 35.53) and B7-B4 (mean absolute coef. = 35.18; percentiles 35.14, 35.19, 35.23). For RF, the mean feature importance values do not show a marked contrast among predictors; however, their distributions are noticeably more dispersed than in LR, indicating stronger variability across model realisations. The three highest-ranked predictors were all NUIs: B4-B2 (mean importance = 0.12; percentiles 0.10, 0.12, 0.17), B6-B4 (mean = 0.10; percentiles 0.08, 0.10, 0.12), and B4-B3 (mean = 0.09; percentiles 0.08, 0.09, 0.11). When jointly ranking variable importance from both algorithms (
Table 2), ordered from highest to lowest importance, it is observed that, within the top half of the training variables, Landsat 8 Band 3 and Band 6 are consistently among the most relevant.
Band 3 corresponds to the green portion of the spectrum, and Band 6 to the first shortwave infrared band (SWIR1), which may serve as a criterion for future selection of spectral indices. For example, NDVI is computed using Bands 5 and 4 of Landsat 8, and while NDVI is one of the most commonly used indices for FC identification in the literature; the results here suggest that it may not be the most relevant for ML model performance. The NUI B5-B4, which is analogous to NDVI, ranks 19th out of 28 variables in the LR analysis, with a mean absolute coefficient of 13.46 (25th, 50th, and 75th percentiles: 13.40, 13.46, 13.51). Similarly, in the RF feature importance ranking, B5-B4 occupies the 21st position out of 28, with a mean importance value of 0.01 (25th, 50th, and 75th percentiles: 0.01, 0.02, 0.02). These results indicate that, despite its widespread use, NDVI-like indices contribute relatively little to class discrimination in this specific ML framework. Bands 3 and 6 also contribute to several of the NUI combinations calculated as complementary training variables. The five NUIs that appear as important in both the logistic regression and random forest models correspond to the relationships between bands B6-B4, B4-B3, B6-B5, B4-B1, and B7-B3.
Based on the joint ranking of the 28 training variables, we selected the variables that were common to both algorithms and located within the upper half of the ranked list for training the deep learning architectures (CNNs and U-Net). In total, seven shared variables were identified: Bands B3 and B6, and the five NUI combinations B6-B4, B4-B3, B6-B5, B4-B1, and B7-B3.
3.4. Predictions from Traditional Models
The precision, recall, F1, and AUC metrics for the test data were computed using the pixels associated with AOI 02 and AOI 04. Precision is defined as the proportion of true positives relative to the sum of true positives and false positives. Recall (or sensitivity) is defined as the proportion of true positives relative to the sum of true positives and false negatives, while the F1 score represents the balance between precision and recall. The AUC corresponds to the area under the Receiver Operating Characteristic (ROC) curve, which is constructed by plotting the true positive rate against the false positive rate.
Across the test set, the performance was broadly similar when compared to metrics of the traditional ML approaches. Precision remained low for all methods, with mean values of 0.12 for LR (25th/50th/75th percentiles = 0.12/0.12/0.12), 0.13 for RF (0.13/0.13/0.13), 0.12 for the simpler MLP (0.12/0.12/0.12), and 0.12 for the more complex MLP (0.11/0.12/0.12). Recall showed variation among models, with LR achieving the highest mean recall of 0.69 (0.68/0.69/0.69), followed by the more complex MLP with 0.65 (0.60/0.67/0.68), RF with 0.61 (0.61/0.61/0.62), and the simpler MLP with 0.60 (0.56/0.60/0.62). F1 scores were consistently low and nearly identical across methods, with mean values of 0.21 for LR (0.21/0.21/0.21) and RF (0.21/0.21/0.21), 0.20 for the simpler MLP (0.20/0.20/0.21), and 0.20 for the more complex MLP (0.19/0.20/0.20). AUC values were also comparable, with LR slightly higher at a mean of 0.65 (0.65/0.65/0.65), followed by RF at 0.64 (0.64/0.64/0.65), the simpler MLP at 0.64 (0.63/0.64/0.65), and the more complex MLP at 0.63 (0.63/0.63/0.63).
The predictions generated using the AOI 02 image set, an area where FCs are present, do not show a clear delineation of the FCs, but they do succeed in highlighting zones of potential interest that correspond to the target layer. This limited ability to outline their shape is because, in this first stage, the models make predictions based solely on pixel-level probabilities. The mean values and standard deviations (with standard deviations close to zero) obtained from the 30 logistic regression models (
Figure 2) indicate highly consistent behaviour among all models. The coefficients of these 30 models for each training variable show negligible variation, reflecting strong similarity among the logistic regression models, despite efforts to introduce randomness during the construction of the training subsets.
The predictions generated using the 30 random forest models do show noticeable differences among them, reflected in higher standard deviation values, reaching up to 5% deviation in their predictions. The spatial distribution of the mean probability is similar to that observed in the logistic regression models: although the FC boundaries cannot be clearly delineated, areas of potential interest are consistently highlighted (
Figure 3).
For the 15 MLP models, differences in predictions are also observed, similar to the random forest results, with standard deviation values reaching up to 20%, although values around 5% are more typical. It should be noted that, due to the relatively small number of MLP models, outliers can have a stronger effect on the reliability of the mean and standard deviation estimates. For both hidden-layer architectures, the shallower MLPs (
Figure 4) and those with a greater number of hidden layers (
Figure 5), the predicted probabilities tend to be higher, indicating greater confidence in identifying pixels that may belong to an FC.
When integrating the predictions from all 75 traditional ML models, the standard deviations generally remain below 5%, although some pixels reach values up to 20%. The mean probability maps show a diffuse distribution, limiting the ability to clearly delineate FCs but still highlighting potential areas of interest (
Figure 6). When predictions are generated for tiled subsets of AOI 04 (
Figure 7), which contains no FCs, the models occasionally highlight a few pixels as potential FC candidates. However, in all cases, these predictions appear more as noise rather than meaningful signals, and they do not illuminate coherent zones of interest as observed in the AOI 02 predictions.
3.5. CNN Models
A total of four models were trained using 64 × 64-pixel tiles and the 7 Landsat 8 bands. An initial resolution of 32 × 32 pixels had been considered, but the models were unable to generate accurate predictions, leading to an increase in tile size. This difficulty is likely related to the satellite’s spatial resolution and the initial tile size, which may have been too small to capture sufficient information to determine whether an FC structure was present within the tile. The limited number of trained models is because CNNs predict only the probability that the input tile contains or does not contain an FC, without providing a visual representation of its spatial location (
Figure 8). For this reason, only the influence of tile size was considered for future training stages. Nevertheless, the results represent an improvement compared with the predictions produced by the traditional pixel-based models used in the first stage.
3.6. U-Net Models
Deep learning models can be applied to both image classification (CNN models) and image segmentation (U-Net models). Classification assigns a single label to the entire input image, without explicitly accounting for the spatial extent of the labelled object within that image. By contrast, segmentation estimates the object’s extent by performing pixel-wise classification, thereby enabling the delineation of the object and its internal structure within the image [
47].
Given that the CNN results showed improved predictions, largely because CNNs process entire images rather than individual pixels and are designed to extract spatial features for classification rather than segmentation, a U-Net architecture was adopted for the next stage. Since U-Net models can generate pixel-level segmentation masks of subcircular structures associated with FCs, a total of eleven models were trained. Five of these models used 64 × 64-pixel input images, while the remaining six used 256 × 256-pixel images. The number of training variables varied among models. For the 64 × 64-pixel models, some were trained using only the seven Landsat 8 bands. Others used seven variables corresponding to those identified as important in the first stage (Bands B3 and B6, and the five NUI combinations B6-B4, B4-B3, B6-B5, B4-B1, and B7-B3). A third group was trained with twelve variables: the seven Landsat bands plus the five NUI variables identified in the first stage.
Predictions for AOI 02 (
Figure 9), where FCs are present, and AOI 04 (
Figure 10), where no FCs occur, show clear improvements, with more continuous areas being delineated and sharper boundaries around potential structures. In AOI 02, the predictions resemble subcircular FC-like patterns, especially when using the seven variables that include the NUI combinations. In AOI 04, the models produce scattered areas of potential interest, but without subcircular patterns, consistent with the expectation that AOI 04 contains no FC structures.
These results, obtained using 64 × 64-pixel input images, motivated the training of models with a resolution of 256 × 256 pixels, grouped into two configurations. The first 256 × 256 pixel group used only the seven Landsat 8 bands, while the second group used seven variables corresponding to Bands B3 and B6, together with the five NUI combinations B6-B4, B4-B3, B6-B5, B4-B1, and B7-B3. For both groups, the predictions already exhibit structures resembling FCs; however, in AOI 02, where FCs are present, the models using Bands B3 and B6 plus the five NUIs produce more conservative predictions, with a spatial distribution more similar to the target layer (
Figure 11). In AOI 04, where FCs are absent, the models that incorporate the NUIs also yield more conservative outputs, with fewer areas predicted as FCs (
Figure 12). Nonetheless, in both model groups, FCs are still estimated in areas where they do not occur (AOI 04), likely associated with the presence of shadows and clouds in the imagery.
4. Discussion
Previous studies [
26,
48,
49,
50] have commonly applied ML approaches using predefined spectral inputs. By contrast, our study explicitly evaluates the contribution and interpretability of individual spectral bands and indices, derived from traditional ML analyses, to inform the subsequent design of deep learning models, with segmentation-based architectures providing the most informative spatial context. Recent segmentation-oriented ML architectures increasingly prioritise structural coherence, which is crucial for delineating geomorphological features, as demonstrated in applications such as alluvial mapping [
51], rock glacier monitoring [
52], and flood mapping [
53].
Within this context, the progressive methodology adopted in this study provides a structured way to link variable interpretability and model complexity. By first analysing pixel-based classifiers, it becomes possible to assess the relative contribution of individual spectral bands and indices under a controlled and interpretable framework, and then to transfer this knowledge to deep learning architectures that better preserve spatial relationships.
The exploratory data analysis stratified by class revealed that both FC and non-FC pixels occupy largely overlapping ranges in the distributions of the training variables, which substantially hinders class separability. Traditional statistical methods may struggle as the number of variables or data dimensionality increases, leading to mathematical challenges, since not all measured variables necessarily contribute to understanding the underlying phenomena of interest [
54]. Although PCA reduced linear dependence among some variables, the overlap between classes persisted. While no models were trained using dimensionality reduction, it is recommended that future studies consider this approach, as applying PCA reduced the original 28 variables to 5 components explaining at least 90% of the data variance, which could help decrease computational time during model training.
At the pixel level, the class overlap observed in the descriptive statistics persists even after applying PCA. For example, the NUI B5-B4 (analogous to NDVI) shows only a 0.05 difference in the first quartile, with higher values for Class 1, while the third quartile is identical for both classes (
Table 1). Similarly, the NUI B6-B4, identified as relevant in the variable-importance analysis, exhibits only a 0.04 difference in the first quartile (higher for Class 1) and a 0.01 difference in the third quartile (higher for Class 0). Overall, these small shifts imply that for roughly half of the samples the two classes are nearly indistinguishable based on these predictors and have very similar distributional summaries. Although PCA was used for dimensionality reduction, neither B5-B4 nor B6-B4 contributed strongly to the leading components; B5-B4 and B6-B4 rank eleventh and twelfth, respectively, in their contribution to the first principal component. This persistent overlap helps explain why linear projections such as PCA do not produce clear class separation and underscores the intrinsic difficulty of discriminating FC from non-FC pixels using spectral information alone.
Class imbalance is a well-known issue in machine learning, often causing models to primarily learn the majority class. This imbalance leads to biassed decision thresholds in classification algorithms, resulting in poorly defined decision boundaries and misleading performance metrics. For this reason, balancing strategies such as downsampling the majority class, upsampling the minority class, or using class weights are recommended [
55,
56,
57] so that models pay equal attention to all classes, which, in this case, are the subcircular structures associated with potential natural hydrogen sources. In this study, class imbalance was addressed through downsampling and the training of multiple LR, RF, and MLP models, which yielded favourable results when evaluating the study areas at the pixel level. This approach also allowed the computation of additional statistics, such as the standard deviation and mean values of predictions across the 75 models.
For LR and RF, it is possible to derive information on variable importance. In the case of LR, the estimated coefficients showed little to no variation among models, reflecting the simplicity and linear nature of the algorithm, which limits its ability to capture complex relationships. In contrast, RF models exhibited variation in the ranking of feature importance while still revealing a consistent pattern in terms of which variables were more relevant than others. Although variable importance measures in RF are useful for feature selection, they may be affected by differences in measurement scales or in the number of categories [
58]. In this study, all input variables were normalised to a 0–1 range, and bootstrap sampling prior to training was used to mitigate these issues.
Band 3 and Band 6 provided the largest information gain in the classical models, suggesting that spectral indices incorporating these bands may be particularly informative. At this initial stage, the objective of the pixel-based classifiers was to assign each pixel to the presence or absence of FC-like structures, and vegetation-based indices are commonly used in the literature for FC detection. However, the contribution of any given index is likely to be site-dependent: local factors such as climatic variability, soil properties, and their temporal dynamics can amplify or constrain its predictive value [
22]. Consistent with this observation, ANH and UPTC [
59] performed pixel-level mapping of targets for natural hydrogen prospecting using satellite imagery and included terrain slope as an additional predictor; in their random forest training, slope ranked among the most influential variables for distinguishing FC pixels.
In the Carolina Bays region, Lundine et al. [
34] applied deep learning models to morphometric analyses based on DEM data. They also evaluated pixel-based machine learning algorithms and reported limitations in consistently detecting subcircular structures. In particular, some classifiers were unable to clearly distinguish these structures from a stream segment present in the DEM. Moreover, several methods produced a salt-and-pepper classification pattern, in which pixels within the bays were incorrectly labelled as non-bay areas, a behaviour typical of pixel-based approaches. By contrast, when deep learning models trained on LiDAR-derived elevation data (using elevation information only) yielded a more coherent identification of these landforms [
34]. These findings suggest that topography, and more broadly terrain-derived variables, can provide complementary information for representing the geometry of subcircular structures, potentially supporting their delineation and interpretation in studies aiming to identify potential FC.
Most of these traditional spectral indices are vegetation-related, and in many cases, they must be adjusted or recalibrated for a new study area [
60,
61,
62]. In this context, a decrease in vegetation response does not necessarily imply the presence of hydrogen emissions. Likewise, an increase or decrease in hydrogen emissions would not immediately translate into a direct response in vegetation; such effects may take some time to become apparent. Establishing a robust relationship between these processes therefore requires dedicated field investigations and multitemporal monitoring. Accordingly, this study considered a broad temporal window of satellite imagery (March 2013 to January 2025) to maximise the amount of available data and to allow the models to learn which spectral variables contribute most to the classification task. In addition, we trained 60 classical models (30 LR and 30 RF models) for which feature importance can be readily extracted, enabling us to quantify the variability in the estimated importance of the training variables.
When performance was compared across the two test areas (AOI 02 and AOI 04), the traditional ML models exhibited only modest differences in their mean metrics. Precision remained consistently low and broadly similar across methods. Recall showed greater variation: logistic regression achieved the highest average recall (0.69), followed by the more complex MLP models (0.65), whereas the random forest and simpler MLP architectures yield slightly lower values. F1 scores remain uniformly low (mean 0.20–0.21), indicating limited balance between precision and recall. AUC values are moderate and comparable across approaches (mean 0.63–0.65), suggesting that no traditional model clearly outperforms the others. Overall, these results indicate a high false-positive rate, driven in largely by pronounced class imbalance. The limited performance is also consistent with the substantial class overlap observed in the exploratory analysis, which discriminates based on pixel-level spectral predictors, particularly challenging for these models.
However, qualitative visual inspection of the spatial predictions reveals a clear mismatch between the metric-based evaluation and the models’ practical behaviour. Although LR achieves marginally higher average scores, its spatial outputs, together with the mean probability and standard deviation maps, are comparable to those of the other traditional models and still fail to delineate FC-like subcircular structures. The predicted probabilities are diffuse and show little correspondence with the expected edges or geometry of these features. By contrast, the deep learning models produce markedly more coherent spatial representations. In particular, the U-Net improves boundary delineation of FC-like structures, even when the subcircular geometry is not recovered perfectly. CNN models, while often outperforming traditional approaches at the tile (image) level, remain largely limited to indicating the presence or absence of FC-like patterns and do not provide explicit localisation.
The probability of recognising FC-like structures is strongly influenced by tile size, that is, by the spatial context of the image, as a larger spatial extent allows the geometry of the object to be described in greater detail. In a complementary manner, the spatial resolution of the image or scene directly affects the ability to characterise its extent, smooth edges, and closed contours [
26,
34,
48,
63,
64,
65]. When analyses are performed at the pixel level or using small tiles, the available information is essentially local, which limits spatial context and can lead to misclassifications, as well as to diffuse probability maps and fragmented detections. By contrast, the use of larger tiles enables a more complete representation of FC structures, preserving boundary continuity and the contrast between the interior and exterior of the object. Consequently, for future studies or applications in other regions where FCs are present, it is essential to select tile sizes and spatial resolutions that preserve the key geometric characteristics of these structures.
The objective of this research was not to optimise the traditional models nor the deep learning models (CNNs and U-Nets), but rather to evaluate the influence of satellite image data on the identification of potential FCs. As expected, and consistent with previous studies [
26,
48,
65], the best results were obtained using the more complex models, specifically U-Net architectures with an input resolution of 256 × 256 pixels, integrating Bands B3 and B6 with the five NUIs (B6-B4, B4-B3, B6-B5, B4-B1, and B7-B3). The use of a higher input resolution led to improvements in the detection of potential FCs, yielding structures with a more clearly subcircular geometry. The size of the input image directly controls the amount of spatial context available to the model by expanding the neighbourhood around each pixel, thereby improving the delineation of extended structures and reducing ambiguity in areas with diffuse boundaries [
63,
64].
Despite these improvements, U-Net models still predicted FC-like structures in areas where no FCs are present. This behaviour may be partly explained by the fact that, although cloud and shadow masks were applied during preprocessing, they were intentionally not reapplied to the model outputs during prediction. This decision allowed us to assess model robustness to residual cloud and shadow contamination and to explore whether FC could be detected without an additional postprocessing step, such as reapplying the cloud and shadow masks to the predicted outputs. Reintroducing these masks during the postprocessing stage could help mitigate false positives. More broadly, these challenges and the presence of artefacts caused by noise or errors related to clouds and shadows in satellite imagery underscore the importance of appropriate masking strategies to minimise their impact on model performance [
66,
67,
68].
By contrast, the models trained in the initial stage, although unable to clearly delineate FC structures, were comparatively more stable in areas without FCs and avoided widespread false predictions. Consequently, the use of multiple complementary models, i.e., ensemble learning [
69,
70,
71,
72], should be considered for future investigations.
Hydrogen occurrence is not exclusively associated with FCs, as evidenced by the compilation presented in McMahon et al. [
7]. It is therefore important to incorporate complementary information into the models, such as morphometric and regional characteristics of the study area. These variables may differ in spatial resolution from satellite imagery or may sometimes be difficult to obtain. In this study, such additional potential input variables were not included; instead, satellite images were prioritised because they exhibit higher temporal variability and capture changes in vegetation and soil [
73,
74], whereas variables such as topography and geology do not show significant temporal variation.
Our methodology, iteratively training multiple models for variable selection, is consistent with practical workflows in supervised classification, where different predictor subsets are evaluated using a representative, class-balanced dataset to identify informative feature combinations [
75]. Dimensionality reduction techniques can also help extract salient information without substantially compromising classification performance. In this context, supervised dimensionality-reduction methods often achieve higher accuracy than unsupervised alternatives, although unsupervised feature extraction can still provide satisfactory results in some settings [
76].
Satellite imagery plays a central role in exploration, planning, and monitoring of renewable energy projects, including environmental impact assessment and verification of emission reductions [
30]. These satellite data also contribute to characterising the geological features of a region [
77], although such characteristics do not vary significantly over time. Nonetheless, there are still technical, social, and structural barriers that hinder their use and the understanding of study areas [
30]. Specifically in hydrogen exploration, previous studies [
20,
21,
59] have shown that incorporating terrain morphometric variables, such as slope, provides additional information on the shape of the structures and helps to contextualise and identify FCs [
26].
Potential future work includes exploring recent techniques that optimise these models for specific image-processing tasks. These include transfer learning, deep residual networks, attention mechanisms, transformers, generative and adversarial networks, and multimodal models. Such approaches could be applied, for example, to enhance image resolution in order to better capture relevant details in tasks such as semantic segmentation [
49,
50,
78,
79,
80].
5. Conclusions
The variable importance analysis highlighted the relevance of Landsat 8 Bands B3 (green) and B6 (shortwave infrared 1), together with several Normalised Unit Indices derived from them (B6-B4, B4-B3, B6-B5, B4-B1, and B7-B3). Incorporating indices based on these bands can enhance the sensitivity of the models for detecting subcircular structures associated with potential hydrogen emissions. In this regard, future research should incorporate morphometric variables of the subcircular structures, since terrain geometry and slope can provide key complementary information for differentiating patterns associated with hydrogen emanations linked to fairy circles.
The proposed feature selection strategy, based on a progression from traditional models to more complex architectures, demonstrates that input selection should not be limited to a few traditional indices commonly reported in the literature. Instead, it requires a more robust criterion tailored to the local context. By integrating additional variables, it is possible to perform progressive training, moving from simple to more complex models, thereby identifying the most representative variables for each specific case study.
Traditional ML models, including logistic regression, random forest, and MLP, provided stable initial estimates at the pixel level, whereas more complex architectures like CNNs and U-Nets showed a greater ability to represent subcircular structures. Nevertheless, limitations were observed both in the precise delineation of these structures and in the occurrence of false detections in areas without fairy circles. These findings suggest that combining complementary models within ensemble learning schemes may be a promising strategy to improve the reliability of predictions. Finally, it is essential to complement these approaches with targeted field campaigns to acquire measurements and spectral signatures directly linked to hydrogen occurrence, thereby strengthening interpretation and supporting the verification of both the models and their input data.