2.1. Study Area
The focal point of our study area is the Vălioara Valley and its broader surroundings, where a geological mapping project was recently launched with the aim of gaining a better understanding of the Late Cretaceous sediments found in the area [
44,
45]. The bounding coordinates of the 70 km
2 study area in the UTM34N metric cartographic coordinate system are as follows: easting min.: 635000, max.: 645000; northing min.: 5048000, max.: 5055000. The projection of the maps presenting the results of the analysis was also UTM34N, interpreted on the WGS84 datum surface.
The Vălioara Valley is located within the Hațeg Basin, which is one of the largest intermountain basins of the Southern Carpathians. The altitude in the center of the basin is 350 m. Due to its basin morphology, it is filled with Quaternary fluvial sediments in its central regions, which are accompanied by proluvial and colluvial sediments. Surrounding the basin, the Southern Carpathians are to the south, while to the west and northwest, the Rusca Montană mountain group is found, which is part of the Banat Mountains. To the east, the Sebes Mountains border the basin. The altitude of the mountain ranges exceeds 2400 m in the south, while in the east, the typical altitude is around 1500 m, and in the west, around 1000 m (
Figure 1A). There is mixed-vegetation cover, characterized by forests, pastures, plantations, and open agricultural areas (
Figure 1B).
Figure 1.
Location and physiography of the study area: (
A) geography, (
B) true-color satellite image of the study area, and (
C) general geology redrawn after the 1:200,000 geological map of Romania and other sources [
45,
46,
47].
Figure 1.
Location and physiography of the study area: (
A) geography, (
B) true-color satellite image of the study area, and (
C) general geology redrawn after the 1:200,000 geological map of Romania and other sources [
45,
46,
47].
The Hațeg Basin is a good example of a complex landscape, due to the geological evolution of the terrain. The geological structure of the region (
Figure 1C) is shaped by large-scale over-thrusting of crystalline basement rocks linked to the Late Cretaceous phases of the Alpine orogeny [
48,
49,
50]. During the early Late Cretaceous, a marine sedimentary basin, precursor to the Hațeg Basin, developed and accumulated primarily deep-marine siliciclastic turbiditic deposits, commonly referred to as “flysch” in the literature [
46,
50,
51]. Overlying these marine deposits, the Haţeg Basin is filled with two main uppermost Cretaceous sedimentary units, the Densuş-Ciula and Sînpetru formations, both notable for their continental vertebrate fossils [
44,
47,
52]. In the western part of the Haţeg Basin, where the study area is situated, the lower Densuş-Ciula Formation comprises coarse conglomerates interbedded with minor sandstones and mudstones. This lower sequence also includes tuff and volcaniclastic material derived from contemporaneous volcanic activity in the Banatitic Province to the west [
53]. The middle section of the Densuş-Ciula succession features fluvial and proximal alluvial fan deposits, while the upper section, formed during the early to late Maastrichtian, also reflects predominantly alluvial depositional environments [
54,
55].
The study area was centered around the Vălioara Valley and the Densuş-Ciula Formation. In the northern part of the area, the morphology is higher, indicating the outcropping of crystalline rocks, while to the south, the upper Cretaceous sediments form the mountain ranges (
Figure 2A). Quaternary formations are typically found in the valley floor and on the gentler slopes. It is a mixed landscape, including several small valleys such as Boita, Vălioara and Rachitova, and represents a 7 by 10 km (70 km
2) area. The area includes elevated parts such as the eastern reaches of the Rusca Montană with the Curatului peak (Vârful Curatului), reaching a height of 939 m in the northern section, and Mount Fata (Dealul Fata), standing at 637 m, in the central to southern part. A local valley is also present in the southern region, beginning with the settlement of Demsuş and extending in a west–east orientation. The southeastern portion of the study area constitutes the plain area of the Hațeg Basin.
A wide variety of vegetation types, based on land use and natural distribution, are present in the study area. There is not a dominating species in this region. Meadows, farms, and forests are all equally represented in Corine land-use statistics [
56] (
Figure 2B).
Figure 2.
Detailed maps of the study area: (
A) geology based on 1:50,000 scale geological maps [
57,
58]; (
B) the NDVI maps of the study area SD:2023.01.08, and (
C) the CORINE land cover [
56] (ID-s also in
Table 1). The coordinate system and projection used is UTM N34.
Figure 2.
Detailed maps of the study area: (
A) geology based on 1:50,000 scale geological maps [
57,
58]; (
B) the NDVI maps of the study area SD:2023.01.08, and (
C) the CORINE land cover [
56] (ID-s also in
Table 1). The coordinate system and projection used is UTM N34.
The NDVI analysis revealed variations in the density and health of the vegetation, with darker green areas denoting strong vegetation, which is primarily seen in the vast forest sections, as indicated by the CORINE land-use data as well (
Figure 2B,C). The NDVI is created from a satellite image taken in fall of 2023, so the forested areas also reflect a lower density than during the growing season. Despite this, the forested areas can still be identified through visual evaluation.
Meadows, interspersed between forested regions, can also be delineated on the NDVI, occupying the middle of the NDVI value range, represented in white to light green colors. Although meadows do not grow tall, they can be dense, preventing light from reaching and reflecting from the soil. Grass also falls into this reflection category, which can significantly overlap but can be slightly distinguished from meadows on the NDVI (
Table 1; NDVI values). Farmland usually exhibits high NDVI values. This is because most of the fields are harvested in the autumn, but some crops are still present in certain areas. The farmlands, being partly prepared for winter, appear in vigorous red, representing one extreme of the vegetation range with high soil reflectance. Additionally, the parts of the cultivated area that still have crops fall into the mid-density range of vegetation. Residential areas exhibit higher reflection values attributable to two main reasons. Human-built structures reflect more light because of the materials used in roofs and pavements. The other reason is managed vegetation: households maintain their land by activities like mowing the grass. Also, next to residential areas, there are orchards. In orchards, the trees are planted in a pattern that results in a more homogeneous NDVI distribution. The water coverage is not significant (hidden by vegetation) in the study area, and the satellite images used are cloudless, so the NDVI values are within the 0 to 1 range.
Table 1 presents the NDVI ranges for various CORINE land cover classes revealing patterns and land-use characteristics. Discontinuous urban fabric and non-irrigated arable land share similar NDVI averages (~0.42), indicating moderate vegetation cover. Permanently irrigated land, pastures, and complex cultivation patterns show slightly higher average NDVI values (0.45–0.46), reflecting consistent agricultural activity. Lands that are dominantly used for agriculture and those that contain natural vegetation, also next to broad-leaved forests, stand out with the highest average NDVI values (0.493–0.559, respectively), introducing robust canopy cover. Natural grasslands and transitional woodland/shrub categories are represented in high averages, while grasslands fit in between the pastures and cultivated areas, the transitional lands are closer to the broad-leaved forests [
56].
2.2. Data Collection
On the input side, we had three different kinds of datasets. We used Sentinel-2 multispectral time-series data as the base, SRTM elevation data, and lithological information derived from published geological maps as ground truth [
57,
58]. The multispectral images were in L2A format, indicating they already have radiometric, atmospheric, and geometric corrections applied. In the preprocessing, the original 12 bands were used with their original spatial resolution (
Table 2). To maintain the same pixel size, in both the geological data and Sentinel-2 data, only a spatial resample is used (on Sentinel-2 imagery), while the spectral values remain unchanged within the area of the original pixel size.
The time-series collection consists of six different images. This collection covers the entire vegetation period of the study area, containing images from each season.
The dates of the images are
2021-02-25;
2023-01-08;
2023-04-28;
2023-11-02;
2023-12-19;
2023-12-27.
There is one image from 2021 and the others are from 2023. The aim of this multi-season collection was to gather as much information as possible from the vegetation cover and from the reflectance mix of the vegetation and soil (winter). The images were downloaded from the Copernicus Datahub using the Semi-Automatic Classification plugin in QGIS. All the acquired images are cloudless and contain minimal to no distortions over the study area.
The SRTM dataset gives additional topographic context to the classifier and supports a deeper understanding of the study area. The SRTM data was also resampled to match with the corresponding input dataset.
Lithological maps were used to train and validate the classification (supervised classification). We used 1:50,000 scale maps published by the Romanian Geological Institute as a geological basis. The research area includes two sheets, which were produced in two different mapping periods, resulting in differences in key elements. The greater part of the defined area is covered by the Hațeg sheet [
57], while the western area extends into the Băuțar sheet [
58]. Both sheets were compiled from several geological surveys conducted in the second half of the 20th century. The 1:200,000 scale maps of the area are available online in vector form on the website of the Romanian Geological Institute; however, 50,000 sheets cannot be accessed in this form. For this reason, the processing of the lithological maps involved vectorizing the raster format maps and unifying the geological units by creating categories based on lithological similarities (
Table 3). The created categories were used as classes for classification; each class has a different label (1–14).
Sampling was performed on the digitized geological map, but since the processing was pixel-based, the vector format was re-rasterized. In the case of lithological maps, the pixel size was chosen to match the Sentinel-2 datasets (10 m spatial resolution). The lithological units are represented by a different number of pixels for each lithological type. There are dominant classes with great coverage, such as class 1 (218,576 pixels), which includes slope debris, alluvial fan debris, and different terrace deposits. This is followed by Cretaceous units, such as class 6 (129,925 pixels) and class 5 (119,728 pixels). This indicates the dominance of sedimentary rocks. The least dominant features in the examined area are class 8, (distal flysch facies) and class 14 (basic metatuff), with only 320 and 455 pixels, respectively. There are other classes showing moderate representations, with pixel counts ranging from 4400 (class 4) to 78,561 (class 12).
Figure 3 shows the number of pixels for each defined class. Classes 8 and 14 have significantly fewer pixels than the average number per class (~50,000 = 700,000/14). While classes 3, 4, 9, and 10 have solid coverage and are an order of magnitude higher than 8 and 14, they are still less than average. Class 2 has a sample size of the same order of magnitude as the average, but it is a fraction of it. The NDVI values for each class are shown in
Figure 4A. Concerning the trend, the lowest median NDVI is observed on Quaternary deposits (class 1). Values generally increase and peak on Cretaceous flysch and conglomerate units (classes 6 and 7) before remaining consistently high across the metamorphic rocks (classes 9–13). Concerning the distributions, the lower-indexed classes like class 1 and class 5 cover a larger extent of the NDVI scale, and their shapes are more elongated. In contrast, most of the higher-indexed classes (>8) show distributions that are more tightly clustered at the higher end of the NDVI scale. Class 8 is a notable outlier, displaying a median NDVI that is lower than any other class and a very narrow distribution.
For classes with a relatively low number of samples and higher vegetation values, the risk of low accuracy is high. In these cases, the classification model’s ability to distinguish lithology, and consider vegetation has a greater impact on performance. In other words, in these classes, the right choice of method can lead to significant improvements.
The spatial distribution of lithological units in the study area is closely related to topography. A clear altitude-based differentiation of rock types is observed (
Figure 4B), which is a major geo-environmental feature of the region and other areas in general [
60,
61,
62]. The low-elevation zones, which are primarily below 500 m, are dominated by Cenozoic sedimentary rocks. These include Quaternary deposits such as alluvium and terrace deposits (class 1), as well as Neogene marls, sands, and claystones (classes 2–4). The violin plots for these rock types are compact and concentrated at lower altitudes, indicating that they occupy a narrow and distinct elevation range. At intermediate and higher elevations, the lithology is more varied. Cretaceous units, including conglomerates, sandstones, and flysch facies (classes 5–9), are distributed across a broad elevation range, typically between 400 and 700 m. The highest elevations in the study area, generally above 600 m, are composed of Palaeozoic metamorphic rocks. Specifically, carbonate schist (class 10), gneiss (class 11), and other schists (classes 12–13) are found at these higher altitudes. The plot for gneiss (class 11) indicates that it forms the highest peaks. Less represented classes (8, 14) have a very narrow elevation range, resulting in a unique violin-like shape.
2.3. Preprocessing and Feature Extraction
In this section, we outline the key steps of data processing from the loading to the full image prediction including the description of the Forced Invariance Method. The data pipeline was separated into two parts: the preprocessing and classification (classifier) parts (
Figure 5).
During the preprocessing part, the main goal was to create a normalized, unbiased, outlier-, and noise-free dataset in a specific shape the classifier can work with. The data normalization and concatenation are the first step of data processing, ensuring the images are combined and interpreted on the same scale (including the handling of outliers).
All Sentinel-2 bands with native 20 m and 60 m resolutions were included in the analysis. The 60 m bands (B1, B9, and B10), which were designed for atmospheric correction, were also incorporated. As part of the preprocessing, these bands were resampled to a 10 m spatial resolution to match the highest resolution bands. To achieve this, the nearest neighbor interpolation method was used, which is a frequently used technique to bring different spatial resolution bands to the same pixel number while new, interpolated values are not calculated [
63,
64,
65].
The SRTM elevation data went through the same normalization and outlier filtering until it was combined with the satellite imagery collections after the Forced Invariance Method was applied (dataset wise). The Forced Invariance Method (FIM) was one of our main preprocessing steps, particularly to test its capabilities in suppressing vegetation and revealing underlying lithological features. After the FIM, each advancement on the flowchart (
Figure 5) is performed with the original dataset and the FIM modified dataset. As a noise reduction and data compaction technique, the Principal Component Analysis (PCA) is included. The final step of the preprocessing pipeline is patch extraction. This step arranges the data so there will be no overlap between the train and validation datasets, and it also forms the input shape that fits the classifier.
The second part of the methodology is where the framework definitions of the classifiers are made. Three different model definitions have been created: an Artificial Neural Network (ANN)-based Multi-Layer Perceptron (MLP), a convolutional layer-centered (Conv2D), and a Vision Transformer-based one (ViT). Each model handles the dataset differently, so during the training part, the number of (all) parameters and processors are different, though they share the same training parameters (outside the model definitions). As part of the training, we validated the datasets and models by monitoring the training and validation loss measurements. The trained models were then used to predict the whole input satellite images. The results are evaluated with common measures, like the F1-score and Overall Accuracy (OA), and practical measures, like vegetation category-based F1-scores (as trend charts), the vegetation index, and F1-score correlation (bar plot). Maps were also created for visual interpretation.
2.3.1. Z-Score Outlier Filtering and Normalization
The initial preprocessing stage involved outlier removal and data normalization to ensure all input features are interpreted on the same scale. First, the time-series data were loaded using the Python GDAL library, which is an efficient module that has tools to handle geospatial meta information of raster-type datasets. After the time-series data is loaded (each instance individually), the geospatial metadata (e.g., projection and geotransform parameters) from each source file was preserved in order to ensure the final outputs could be geolocated correctly. Next, a z-score-based outlier filter was applied to the time series to identify and remove statistical outliers. Outliers are data sequences not fitting in the dataset and causing performance loss later. This step was critical, as certain downstream algorithms in the processing pipeline, particularly the Forced Invariance Method, are sensitive to extreme values, which can lead to analytical instability and degrade performance [
66].
Following outlier removal, a simple min–max normalization technique was used, and the data were scaled to a uniform range of 0–1. The SRTM elevation data also went through the same outlier filtering, but it was processed separately. After this step, the instances were in the same shape and on the same scale, and the normalized time-series layers were stacked to form a single, multi-dimensional data array.
2.3.2. Vegetation Suppression (FIM)
The Forced Invariance Method involves preprocessing the data by applying dark pixel correction, subtracting the smallest DN (Digital Number) value pixels from all pixels in each band to correct atmospheric path radiance and sensor calibration offsets. The vegetation index calculation is the second step, NDVI (Normalized Difference Vegetation Index) by default, used to plot against each band on a scatter plot (
Figure 6). Average DN values represent the trend over NDVI categories, which is described with the best fitting curve. Median and mean filters are suggested by the authors of the FIM study [
28] for curve fitting. With the known connection between the band values and the NDVI trend, the curve can be flattened with the following equation [
28]:
The
TargetDN is an adjustable variable, corresponding to the average DN value for each band by default. Following these steps, the vegetation-related contrast is neutralized, with only the lithological patterns remaining and the other terrain-specific information. Additional steps like radiometric outlier masking to exclude anomalous radiometric features are advised. Post-processing, like contrast stretching and color balancing, is also included in the original study [
28].
After this step, the workflow continued with two different datasets: the one modified with this method (FIM-modified dataset), and the one that skipped this step (original dataset). The SRTM elevation data is concatenated with the resulting flattened dataset—the original dataset is also extended with the same SRTM elevation data.
For each satellite image, the corresponding NDVI was calculated and used for the decorrelation. Additionally, the NDVI values were downscaled to the corresponding band’s resolution, ensuring there was no distortion in the average NDVI and band DN values.
2.3.3. Dimensionality Reduction
During the Principal Component Analysis of multiband rasters, the dataset size is reduced, while the variability within is mostly retained. The process involves the calculation of the covariance matrix, where each element represents the relationship between two bands. This covariance matrix undergoes eigen decomposition to yield its corresponding eigenvalues and eigenvectors. Each eigenvalue represents the variance explained by a single component, and the eigenvectors define the linear combinations of the original bands. The eigenvalues are then ranked in descending order, and the principal components are generated by projecting the original data onto the axes defined by these ranked eigenvectors [
67].
Dimensionality reduction involved 73 input features. These features included 72 bands from the satellite time series (6 dates × 12 bands per date) and one band from the SRTM. The criterion for selecting the number of principal components (PCs) was the cumulative explained variance (
Figure 7), which explains the retained variability of each principal component. The first 12 principal components were selected for both the original (A) and decorrelated (FIM) datasets. This number of components accounts for a very high percentage of the total data variability in both scenarios: 95.98% for the original dataset and 96.72% for the FIM-processed dataset.
The FIM-processed data (
Figure 7B) shows a marginally higher cumulative variance than the original data (
Figure 7A). This is likely due to the modification reducing the overall data complexity, allowing for more of the total variance to be captured in fewer components. To ensure consistency between the two processing workflows, we used the 12 output bands for both datasets. The 12 output bands for both datasets are represented in
Figure 8.
The comparison between band-wise contributions to the PCA output revealed notable shifts between the original and the decorrelated datasets (
Table 4). In both cases, bands B8, B12, and B7 were the most informative. In the modified dataset, the value of band B8 increased from 0.179 to 0.221; the values of bands B12 and B7 also increased. The additional values of the other bands were less significant, though the order changed between the two datasets. Every other band lost significance when the dataset was modified, except B3, B2, and B1. This suggests that decorrelation suppresses some redundant spectral information while enhancing the relevance of others, with the order remaining almost the same. The low contribution of SRTM in both cases is expected, though it is also present in the results.
The distribution of information content across satellite scenes showed major reorganization in the modified dataset compared to the original, considering the acquisition date (
Table 5). The scene acquired on 8 April 2023 retained the highest variance; the significance of this scene almost doubled with the FIM preprocessing, and it was the only scene that did not change position in terms of significance. This increase may indicate a clean or vegetation-contrasting scene, which the FIM preserves more effectively. After preprocessing, the 8 January 2023 image lost second place, while the extracted information content grew. The importance of other scenes decreased.
2.3.4. Patch-Based Sampling
The last step of preprocessing is to incorporate spatial and spectral contexts for the classifier. To provide contextual information for each training sample, a square window of pixels was created around each central pixel. The mask creation for training, evaluation, and hold-out datasets followed strict rules. The hold-out set is preserved for final validation, as there must be an unseen portion of the dataset. The hold-out dataset is used to evaluate the robustness and generalization capabilities of the models.
Pixel-Based Image Analysis with the extraction of contextual patches can involve data leakage. If the training and validation patches are selected too close to one another, their areas can overlap. This allows the same pixel data to be present in both the training and validation sets, leading to artificially inflated performance metrics and a model that cannot generalize well to truly unseen data [
68].
To prevent this, a spatially disjointed sampling strategy was implemented. During mask creation, the distribution of the pixels was random, stratified by lithology to ensure all classes were represented (
Figure 9A). A buffer zone was applied around each validation pixel to define the validation patch and create a “keep-out” zone for training samples (
Figure 9B). Training samples were then selected from the remaining areas, guaranteeing they were spatially separate from any validation patch (
Figure 9C). In this study, validation and hold-out split masks were defined as a 5 × 5 pixel window, created by using a 2-pixel buffer around the central pixels of the samples (
Figure 9).
Firstly, the hold-out dataset’s central pixels were chosen, and a 2-pixel buffer was applied around them. Approximately 1.6–2.2% of the pixels were chosen for hold-out for each lithological unit within each class. For the validation set, additional pixels were selected from the remaining pool, also stratified by lithological class. All remaining available pixels became part of the training set. For each dataset, masks were created. The sample ratio of the training, validation, and hold-out datasets is approximately 84-8.5-7.5, respectively (
Table 6).
This approach preserves the spectral profile of the central pixel and provides additional spatial information on spectral and topographic properties.
During sample generation, the edges were also used as central pixels. To overcome the edge effect and be able to predict near the edges during the inference, we used reflect padding. Reflect padding is a widely used technique, which extends images over the edges by mirroring the pixel values. This method improves classification accuracy by preserving the local texture and spatial consistency near the image boundaries [
69].
2.4. Model Architectures
In this study, three different deep learning classifiers were tested: a Multi-Layer Perceptron (MLP) with dense layers for processing, a Conv2D model that utilizes only convolutional layers for processing, and an implementation of the Vision Transformer (ViT) architecture. For each model, batch normalization and dropout layers were also used to ensure the models have no biases and improve the generalization capability on unseen data.
The MLP is the simplest among the classifiers tested. It used three hidden dense layers (192 nodes each) and an output dense layer (14 nodes) with SoftMax activation, which converted the outputs into a probabilistic interpretation [
70]. For each layer, the ReLU (Rectified Linear Unit) activation function was added to introduce non-linearity into the model (
Figure 10). During training, each pixel’s information is considered. The MLP model benefits from the additional contextual information introduced by the window-based approach. However, it requires flattened input, so the positions of pixels are not preserved across the layers. The purpose of this model was to serve as a performance baseline for comparison.
The second model was a Convolutional Neural Network (CNN), which used three two-dimensional convolutional layers for processing. Each layer consisted of 192 filters (feature maps) as parameters, and we also used padding (same; differs from the patch extraction). Using padding preserves the spatial dimensions of the feature maps throughout the convolutional blocks, thus avoiding data loss [
71]. The filter sizes were 2 by 2 in all convolutional layers, which also ensures there is no lost data. Before the output layer, the feature maps were flattened (
Figure 11).
The most complex model we tested was the Vision Transformer [
72]. The model processes the input 3 × 3 patch by treating each of the 9 pixels individually and applying positional encoding to each pixel in each layer. Two transformer encoder blocks were used with 8 heads each. The base architecture can be seen in
Figure 12. The final classification head of the ViT, an MLP used in this model, had similar parameters to dense layers in the case of the MLP model, with 192 nodes and a ReLU activation function.
2.5. Training and Validation
To fit our datasets to the model architectures, we formed two different input shapes from the same datasets. The MLP requires a flattened input (DSFlat), while the Conv2D and ViT models require a multidimensional input array (DSImg). The flattened input had a batch shape of (512, 108) per iteration, where 108 is the number of the individual values per sample. In the case of the multidimensional input, we used the described shape (512, 12, 3, and 3), representing a 3 × 3-patch context with 12 input features.
The data loaders worked with a batch size of 512, incorporating a shuffle mechanism so that in each epoch the order of the training samples changes. In a GPU environment, we used pinned memory and multiple workers (1–4), allowing data loading to occur in parallel on the CPU. By using this parallelization, the training can be GPU optimized [
73].
The classification loss was calculated using weighted focal loss, which is known to improve learning in datasets with a severe class imbalance [
74]. The core formulation of focal loss is
where
denotes the model’s estimated probability for the correct class,
is the focus parameter (set to 2.0) that down-weights easy samples, and
is a class-specific weight. These weights, derived using the “Effective Number of Samples” method [
75], are defined as follows:
where
is the smoothing parameter (typically close to 1) and
nc is the number of samples for class
c. This formula assigns a higher weight to underrepresented classes, compensating for the class imbalance. The combination of focal loss and the effective number of samples approach enhances the training stability, especially in cases of extreme class imbalances.
During the optimization process, the AdamW algorithm was used. Compared to other adaptive methods, AdamW incorporates an effective weight decay [
76]. The initial learning rate was
and the weight decay was
; these parameters were chosen to balance fast convergence with protection against overfitting. Weight decay is important especially for models with high parameter numbers, because it helps to restrict excessive parameter growth.
The learning rate was fine-tuned with the “reduce on plateau” parameter. If, at the end of a complete epoch, the macro F1-score did not improve, the learning rate was reduced by a factor of 0.5. This adaptive decay allows the model to make larger parameter updates in the early stages of training and progressively shift to smaller, more precise steps where the learning curve begins to plateau. By using the learning rate decay, the optimization process can converge more smoothly and with higher precision, potentially improving the chances of finding a better (possibly global) minimum. [
77].
To ensure the model’s robustness and generalization capability on unseen data, we used k-fold cross-validation, where the number of folds was 4. Each fold had the same sample number ratio across the classes [
78]. The number of epochs per fold was set to 50, and an early stopping criterion was also implemented. If the model was unable to reach a better macro F1-score for 10 consecutive epochs, the training was stopped [
79]. The best macro F1-score model was chosen for each fold. To ensure reproducibility, the random seed was fixed at 42.
2.6. Evaluation
The stability and robustness were evaluated with a stratified k-fold cross-validation [
80]. In each fold, the F1-scores, weighted F1-scores, accuracy, and macro F1-scores were calculated, along with their averages and standard deviations (
Supplementary Tables S1–S6). This approach excludes the possibility that one portion of data might distort the results. The macro F1-score-based evaluation is also an effective way to monitor the rare classes’ performance. The validation macro F1-score-based checkpointing ensured that, in every fold, the best model was saved.
The saved folds were then ensembled into one model by averaging the outputs of the models from each fold. This ensembled model was tested on a yet unseen hold-out dataset as the final evaluation of generalizability [
81]. The hold-out validation avoids the cross-validation’s optimistic distortion and fits within the remote sensing evaluation practice. The final metrics are based on the hold-out dataset and the ensembled model. To compare the different model architectures, Area Under the Curve (AUC) values and Receiver Operating Characteristic (ROC) curves were calculated (
Supplementary Figure S1), which plot the true-positive rate against the false-positive rate [
82].
To calibrate the confidence of the probabilistic outputs (i.e., accuracy ratio), a post hoc temperature scaling was also used. This one parameter correction keeps the ranking while it sharpens the SoftMax outputs and decreases the ECE (Expected Calibration Error) [
83]. Brier and ECE scores for each fold were also calculated and used as calibration evaluation metrics (
Supplementary Tables S1–S6).
The Brier formula averages the squared difference between the predicted probability distribution
and the ground truth
, where
is represented as a one-hot encoded vector. The averaging is performed over all classes
and over all samples
.
The ECE measures the calibration deviation: how well the average confidence of the model matches the actual hit rate at the binned confidence levels. Predictions are grouped into confidence bins based on their maximum predicted probability, and within each bin, the mean confidence is compared to the fraction of correct predictions , with weights proportional to the bin sizes .
The logit outputs were calibrated based on the validation dataset; the temperature parameter and the before–-after calibration confusion matrices are included in the
Supplementary Material, along with the per-fold metrics and confusion matrices (
Supplementary Figures S2–S7).
To compare the models, the Overall Accuracy (OA) was also calculated for each trained model. The OA is the proportion of samples for which the predicted label
matches the ground truth label
. It is computed as the average of the indicator function that is 1 when
and 0 otherwise.
For the class-by-class evaluation, we used the scikit-learn’s classification report function. The AUC differences between the model pairs were analyzed with the DeLong test; the
p-values for each class were summarized using Fisher’s method (
Supplementary Tables S7 and S8). The report provides the F1-score for each individual class (not aggregated—micro, macro, and weighted; see
Supplementary Tables S9–S14), which is the harmonic mean of precision and recall [
84]. Precision is the ratio of correctly classified positive items (true positives, TP) to all items classified as positive (TP + false positives, FP). The recall is the ratio of the correctly classified positive items to all relevant items (TP + false negatives, FN). This metric was chosen because the classification models were trained on classes with varying sample sizes, which can lead to misleading statistical results. The F1-score addresses this issue by considering both precision and recall, providing a balanced evaluation.
During the training, and to ensure that the smaller sample number classes also gained enough attention, the macro F1-scores were also calculated and used to evaluate the results.
Measuring the effect of vegetation on the predicted raster is challenging. To do this, we used two different metrics to compare the models, both of which use NDVI to represent vegetation. These two methods were the trend of the F1-scores over the defined NDVI classes, while the other metric shows how the accuracy of a class correlates with the vegetation. For the trend analysis, we defined four different NDVI categories (
Table 7).
For class-by-class analysis, we used specific correlation-based statistics. An indirect marker NDVI was used as the vegetation descriptor. Four major NDVI groups (major bin) were formulated based on
Table 7. Each pixel is classified into a major group
, where
is the NDVI value:
Every pixel group
, with
number of pixels, is divided into ten subgroups (sub-bins). The separation of sub-bins included the sorting of pixels
by NDVI and the calculation of deciles (ensuring identical sub-bin
pixel counts):
The sub-bin level true-positive
, false-positive
, and false-negative
values served as input for the accuracy measure (macro F1-score), where
is the predicted class and
is the ground truth:
Macro F1-score and NDVI averages are formulated for each sub-bin:
The robustness of the K-th major bin is measured by the relationship (
,
) between the ten sub-bins with the Pearson correlation
:
The Pearson correlation coefficient
is calculated for each of the four main vegetation groups defined in
Table 7. Each coefficient resulted from a 10-point measure consisting of the average NDVI and macro F1-score pairs derived from the ten sub-bins within that group. In this analysis, the goal is for the correlation coefficient to be around zero in as many groups as possible. This would indicate that vegetation does not significantly affect classification accuracy, suggesting that the model can be effectively applied in other vegetated areas.
The NDVI–macro F1-score correlation graph shows significant fluctuations. For interpretation, we defined four groups of values (
Table 8). Based on this, the model that performed better is the one whose average correlation does not deviate significantly towards either extreme.