A Systematic Evaluation of CNN Configurations for Multiclass Oil Spill Classification in Hyperspectral Images

Carrasco-García, María Gema; González-Enrique, Javier; Ruiz-Aguilar, Juan Jesús; Camarero-Orive, Alberto; Elizondo, David; Turias Domínguez, Ignacio J.

doi:10.3390/jmse14040383

Open AccessArticle

A Systematic Evaluation of CNN Configurations for Multiclass Oil Spill Classification in Hyperspectral Images

by

María Gema Carrasco-García

^1,*

,

Javier González-Enrique

²

,

Juan Jesús Ruiz-Aguilar

¹

,

Alberto Camarero-Orive

³

,

David Elizondo

⁴

and

Ignacio J. Turias Domínguez

^2,*

¹

Department of Industrial and Civil Engineering, Algeciras School of Engineering and Technology (ASET), University of Cádiz, 11202 Algeciras, Spain

²

Department of Computer Science Engineering, Algeciras School of Engineering and Technology (ASET), University of Cádiz, 11202 Algeciras, Spain

³

Department of Transport Engineering, Universidad Politécnica de Madrid (UPM), 28040 Madrid, Spain

⁴

School of Computer Science and Informatics, Faculty of Computing, Engineering and Media, De Monfort University (DMU), Leicester LEI 9BH, UK

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(4), 383; https://doi.org/10.3390/jmse14040383

Submission received: 31 December 2025 / Revised: 23 January 2026 / Accepted: 16 February 2026 / Published: 18 February 2026

(This article belongs to the Special Issue Oil Spills in the Marine Environment)

Download

Browse Figures

Versions Notes

Abstract

Oil spills represent a severe threat to aquatic ecosystems, requiring rapid and reliable detection methods to support environmental response. Hyperspectral imaging (HSI) offers high spectral resolution for distinguishing hydrocarbon types, but its effective use depends on the performance and robustness of deep learning (DL) models, especially under data-limited conditions. This study presents a systematic evaluation of convolutional neural network (CNN) configurations for oil spill classification in visible-near-infrared (VNIR) hyperspectral data, examining the influence of architectural depth and hyperparameters such as the number of convolutional kernels, neuron density, and dropout rate. Two architectures were tested across 54 configurations and two training set sizes (259 and 518 samples). Results show that a compact architecture with an additional max pooling layer achieved near-perfect accuracy (>0.99) with reduced complexity and greater robustness, outperforming its deeper counterpart. Importantly, this study reveals that under small-sample scenarios, optimal performance can still be achieved by carefully balancing model capacity, favouring moderate convolutional depth and high neuron density, while avoiding over-regularisation. These findings provide practical guidance for designing efficient CNNs for UAV-based oil spill monitoring and lay the groundwork for future integration into local real-time processing pipelines and transfer learning applications.

Keywords:

oil spills; hyperspectral; HSI; CNN; deep learning; classification

1. Introduction

Marine and freshwater pollution is one of the most important environmental challenges of the 21st century. It is explicitly addressed in the 2030 Agenda through Sustainable Development Goals (SDGs) 6 (Clean Water and Sanitation) and 14 (Life Below Water), which jointly emphasise the need to improve water quality, reduce marine pollution, and improve monitoring and management systems [1]. This concern stems from the growing pressure that water pollution exerts on ecological integrity, public health, and economic activities [2,3,4,5,6]. This is particularly evident in the case of oil spills, which represent a critical and persistent form of aquatic pollution due to their rapid dispersion on the surface, the toxicity of their chemical components, the complexity of remediation, and their persistent environmental effects [7,8,9]. These scenarios underscore the necessity of timely and effective local water quality monitoring as a prerequisite for environmental protection and sustainable management. The consequences of large-scale disasters, such as Prestige in 2002 or the Deepwater Horizon spill in 2010, are well-documented. In the case of Prestige, approximately 64,000 tons of heavy fuel oil were released, contaminating 900 km of coastline across Northern Spain, Portugal, and Southwestern France [10]. Similarly, the explosion of the offshore drilling platform resulted in the release of millions of barrels of oil into the Gulf of Mexico, contaminating an estimated area of 10,000 square kilometres [11,12]. However, such catastrophic events are not the only pathways through which hydrocarbons enter aquatic systems. Additionally, sources include land-based runoffs, accidental discharge during maritime transport, oil pipeline leaks, exploration and production activities, as well as operational procedures such as bunkering [13]. While medium- and large-scale spills (>7 tons) have decreased approximately 90% according to the International Maritime Organization (IMO) and the International Tanker Owners Pollution Federation (ITOPF), small-scale spills remain frequent and are often underreported or poorly documented [14,15]. This situation is further compounded by the substantial increase in global seaborne traffic in recent years [14], which has led to higher vessel density in coastal and port areas. These high-traffic zones, often hosting additional hydrocarbon-related activities such as cargo handling, refuelling, and maintenance, are particularly vulnerable to routine low-volume discharges. Although these incidents are often minor and individually undetected, their cumulative effects can lead to significant and persistent degradation of water quality and marine ecosystems [16,17]. These scenarios underscore the necessity of timely and effective local water quality monitoring as a prerequisite for environmental protection and sustainable management.

Remote sensing technology has been increasingly explored as a promising solution for the detection and monitoring of oil spills due to its ability to provide rapid, synoptic, and wide-area observations. A wide variety of remote sensing techniques have been investigated, including Synthetic Aperture Radar (SAR), laser fluorescence, thermal infrared, and optical remote sensors (ORSs) operating in the visible (VIS), near-infrared (NIR), and shortwave infrared (SWIR) spectral ranges [18,19,20]. SAR has been one of the most widely used techniques in oil spill monitoring due to its ability to obtain images in all weather conditions, both day and night [18,19,21,22]. However, SAR also presents limitations: the presence of similar natural phenomena, such as internal waves, areas of light winds, or biogenic slicks, can lead to high false positive rates. In this context, multispectral imaging (MSI) sensors such as Landsat MSS/OLI, MODIS, and Sentinel-2 MSI have been widely used to detect and monitor oil spills [23,24,25,26]. However, when atmospheric, illumination, or sea surface conditions are not optimal, the limited number of relatively broad spectral bands in MSI can significantly reduce its effectiveness, especially in the classification of materials with similar spectral signatures or when spectral responses are altered by external factors such as the effects of weathering or sun glint [27,28]. These limitations have led to growing interest in hyperspectral imaging (HSI), which captures hundreds of contiguous and narrow spectral bands, typically under 20 nm, across the VIS to SWIR range, offering a significantly higher spectral resolution than MSI [29]. Hyperspectral sensors generate a three-dimensional data structure known as a hypercube, where each spatial pixel contains a complete reflectance spectrum. (Figure 1a). This pixel-level spectral information captures the interaction of light with matter at fine wavelength intervals, encoding the presence of specific chemical bonds through their characteristic absorption features [30,31]. In this context, HSI provides a chemically informative map of the scene, often referred to as a chemical image [32]. Each pixel effectively acts as a localised spectral signature (Figure 1b), allowing for the identification and classification of compounds with higher precision and speed than conventional spectral methods [33]. These capabilities, combined with continuous technological advances in sensor design, have consolidated hyperspectral imaging as a powerful tool in remote sensing since its inception for Earth-surface observation [34].

Different satellite-based HSI missions have been deployed in recent years, such as PRISMA (Agenzia Spaziale Italiana, ASI), EnMAP (Germany), and HISUI (Japan, Aerospace Exploration Agency, JAXA). These platforms have supported a wide range of environmental remote sensing applications, such as geological and soil characterisation, vegetation and ecosystem monitoring, inland and coastal water quality assessment, and natural and anthropogenic hazard observation. A comprehensive review by Chabrillat et al. [35] highlights the broad spectrum of environmental applications in which EnMAP has been applied, including soil mapping and monitoring, raw materials and mining, vegetation dynamics, agriculture, forestry, inland and coastal water analysis, cryosphere monitoring, anthropogenic greenhouse emissions, and both natural and anthropogenic hazards. Despite the considerable advantages of satellite hyperspectral imaging, such as data availability, systematic global coverage, and low-cost acquisition, these platforms face operational limitations. First, the revisit cycle under nadir viewing conditions is generally long: Hyperion operated with a 16-day repeat interval, while PRISMA and EnMAP operate for 29 and 27 days, respectively [36,37,38,39]. For the monitoring of localised pollution events, such as oil spills of small dimensions, these revisit times are often impractical for effective detection and timely response. Second, spatial resolution also poses a significant constraint. Most current missions operate with a ground sampling distance (GSD) of 30 × 30 m [37,38,40,41]. Previous studies have already pointed out that this spatial resolution is often insufficient for fine-oil spill detection in confined or small regions [20]. Additionally, ORS imagery acquired from spaceborne platforms is highly affected by atmospheric conditions, particularly cloud cover. As highlighted by Hu et al. [13], up to two-thirds of the global ocean may be cloud-covered at any given time, significantly limiting the availability of usable optical data. These temporal, spatial, and atmospheric limitations reduce the suitability of satellite hyperspectral systems for continuous or near-real-time monitoring of localised contamination, such as oil spills in local areas.

In this context, hyperspectral sensors mounted on unmanned aerial vehicles (UAVs) effectively overcome many of the limitations inherent to satellite-based and manned airborne platforms. These UAV-HSI systems combine high spectral resolution with ultra-high spatial resolution, decreasing from metres to just a few centimetres depending on flight altitude. In addition, they offer low operational costs, flexible deployment, and the capacity for on-demand missions.

Nevertheless, regardless of the sensing platform, HSI inherently involves extremely high data dimensionality, strong inter-band correlation, and significant spectral redundancy; factors that cause the so-called curse of dimensionality. This phenomenon, also known as the Hughes effect [42], poses major challenges for traditional processing techniques in the classification of hyperspectral images [43], requiring more sophisticated approaches capable of extracting meaningful patterns and reducing data complexity.

In recent years, the most significant advance in HSI, and the main reason behind its consolidation as a leading remote sensing technology, has been the ability to directly process complete hyperspectral images, eliminating the need to pre-identify spectral features. The rise of deep learning (DL) techniques [44] has been a key factor in this change, as it enables accurate classification and detection tasks to be performed on HSI data. Among these techniques, convolutional neural networks (CNNs) [45] have become particularly relevant due to their ability to extract deep semantic features from the three-dimensional spectral–spatial structure of hyperspectral data [46,47]. For instance, Hu et al. [48] proposed a 1DCNN focused exclusively on the spectral domain, which treats each spectral pixel as a one-dimensional curve. To incorporate spatial information, Yang et al. [49] applied a 2DCNN-based framework enhanced with multiscale feature extraction. Wang et al. [50] proposed a hybrid model integrating both 1DCNN and 2DCNN to extract spectral and spatial features simultaneously. Additionally, Zhu et al. [51] and Kang et al. [52] evaluated CNNs for oil film classification under challenging conditions such as cloud shadows, high glint, and varying oil thicknesses. Although paradigms beyond CNNs have recently emerged, such as attention-based (transformer) [53,54,55] and graph-based [56] approaches, and have attracted increasing interest in recent years, CNNs remain a widely adopted and operationally attractive family.

However, the performance of CNN-based classifiers is highly influenced by the definition of the architecture and the selection of hyperparameters, as noted by Kang et al. [52] and Hu et al. [48] in their work. Despite their acknowledged importance, such design aspects remain insufficiently explored in the context of hyperspectral image analysis. This gap is particularly relevant in oil spill detection, where variations in spatial–spectral patterns demand robust and well-optimised models. Addressing this need, the present study proposes a systematic evaluation of 3DCNN configurations to quantify how architectural choices, hyperparameter settings, and training-set size influence performance in a multi-class classification task for oil-contaminated water in hyperspectral imagery and to identify performance-improving directions across this combined space. The analysis focuses on key architectural components known to influence dynamics and generalisation in high-dimensional data domains, such as HSI. First, the effect of applying dimensionality reduction through convolutional operations versus downsampling through max pooling is examined in order to assess how different feature compression strategies affect the ability of the network to retain meaningful spectral–spatial information. Second, the number of convolutional kernels is varied to investigate how feature extraction capacity adapts to model depth. Thirdly, the neuron density in the fully connected layers is adjusted to explore its influence on high-level feature integration and classification decision boundaries. Finally, different dropout rates are applied after the dense layers to evaluate their role in controlling overfitting and improving model robustness in scenarios with limited data. This specific analysis provides insight into how each architectural element contributes to the overall effectiveness of the model and its resilience to overfitting, factors that are especially relevant in high-dimensional domains such as HSI. In addition to network architecture, the study explores how model performance is affected by the amount of training data, evaluating each configuration under two dataset size scenarios. Using a lightweight VNIR hyperspectral camera suitable for use on UAVs, these datasets were generated as a first step towards the development of a real-time local monitoring system for environmental protection. Assessing CNN performance under different data availability conditions is particularly relevant in remote sensing applications, where environmental, technical, or economic limitations often constrain data acquisition. Furthermore, it provides insight into how CNN architecture and hyperparameters scale with dataset size, offering practical guidance for optimising model design. This dual evaluation ultimately contributes to identifying model designs that are both effective and adaptable, supporting future implementations of real-time local hyperspectral imaging systems for oil spill detection and broader environmental monitoring, particularly in the context of smart and sustainable cities [57].

The remainder of this paper is organised as follows. Section 2 presents the materials used in this study, including a description of the acquisition setup and the datasets generated. Section 3 describes the methods, detailing the CNN architectures, hyperparameter configurations, performance metrics, and the experimental procedure. Section 4 reports the results obtained and provides a comparative analysis across scenarios, followed by their discussion. Finally, Section 5 summarises the main conclusions and outlines directions for future research.

2. Materials

This section presents the equipment, datasets, and preprocessing steps involved in the development of this research. The study used a compact snapshot-based hyperspectral camera capable of acquiring spectral data with spatial and spectral fidelity. Hyperspectral images of clean water and water contaminated with different hydrocarbons were acquired under controlled outdoor conditions and preprocessed to generate datasets suitable for training and evaluating CNN models. An overview of the data acquisition and preprocessing workflow is provided in Figure 2.

2.1. Hyperspectral Camera and Data Acquisition Format

The hyperspectral images used in this study were acquired using the Cubert ULTRIS X20 Plus (Cubert Hyperspectral, Ulm, Germany), a compact, lightweight, and snapshot-based hyperspectral camera optimised for integration into UAVs. As a snapshot system, the camera captures the full hyperspectral cube in a single frame, without the need for line scanning or mechanical movement, making it particularly well-suited for dynamic scenarios such as UAV-based environmental monitoring.

The camera integrates two independent image sensors: a panchromatic sensor with a fine spatial resolution of 1886 × 1886 pixels, used for pansharpening to improve spatial detail, and a spectral sensor with a native resolution of 410 × 410 pixels and a spectral acquisition in 164 contiguous bands in the VNIR (400–1000 nm).

The diagonal field of view (FOV) of 35° of the spectral sensors covers an area of approximately 26.5 × 26.5 m when operated at 60 m above ground level (AGL), a typical altitude for local environmental monitoring missions using UAVs. This corresponds to a ground sampling distance (GSD) of approximately 6.5 cm/pixel, resulting in a spatial coverage of 41.8 cm² per pixel. These characteristics provide a favourable balance between spatial resolution and coverage area, making the system highly suitable for the detection of localised oil spills. Furthermore, depending on the flight altitude, pixel size can be adapted to specific monitoring needs, ranging from a few centimetres, ideal for local applications in coastal zones, port areas, and other sensitive aquatic environments.

Regarding the acquisition format, the camera initially captures data in digital number (DN) format. The conversion to reflectance is automatically handled by the associated Cubert software, which applies factory calibration parameters together with user-acquired reference measurements taken under the same light conditions as the hyperspectral images. This conversion is performed according to Equation (1), where I_raw is the raw intensity measured from the scene, I_white is the white reference, and I_dark is the dark reference image. Since the solar intensity varies across wavelengths [22], a Lambertian reflectance standard (spectralon panel) was used as a white reference to correct the acquired spectra and standardise reflectance across the entire spectral range. On the other hand, the dark reference is captured with the lens covered to eliminate dark current and electronic noise (offset), thereby correcting the signal baseline. This calibration process is critical for ensuring the radiometric accuracy and comparability of the hyperspectral data, particularly when operating under field conditions where lighting and environmental parameters may vary.

R e f l e c t a n c e = \frac{I_{r a w} - I_{d a r k}}{I_{w h i t e} - I_{d a r k}}

(1)

2.2. Hyperspectral Dataset Acquisition and Preprocessing

A custom-built hyperspectral dataset was acquired under controlled outdoor conditions to develop and evaluate the proposed classification framework. The experimental setup simulated four representative scenarios: clean water, water mixed with gasoil, water mixed with C10, and water mixed with fuel oil. These hydrocarbons were selected due to their frequent occurrence in port environments and coastal areas associated with refining activities, maritime transport, and bunkering operations.

Each class was physically modelled using a dedicated container, resulting in four spectrally distinct targets. The experimental design and the dimensions of the containers (2 m × 1 m × 0.5 m) ensured that, in each container, the hydrocarbon was heterogeneously distributed throughout the water volume, generating different concentration levels representative of realistic spill conditions. Contamination of the entire water surface was ensured by manually spreading the hydrocarbon. To reduce evaporation changes that could affect highly diluted hydrocarbon regions or wind effects, a controlled time window was maintained during the acquisition process. Figure 3 provides an RGB illustration of the four experimental scenarios. It is included as a visual reference of the laboratory setup and the macroscopic appearance of each class; however, it should be noted that highly diluted hydrocarbon regions may not be readily distinguishable to the naked eye, which motivates the use of hyperspectral information for reliable discrimination.

Data acquisition was conducted outdoors under natural sunlight, requiring a careful calibration procedure to ensure radiometric consistency and spectral reliability across all hyperspectral scenes. Prior to image acquisition, a series of calibration and configuration steps were performed, including the adjustment of the integration time and the acquisition of white and dark reference images, necessary to convert raw DN into reflectance values according to Equation (1). The integration time was manually configured to maximise the signal strength while avoiding sensor saturation, thereby ensuring optimal use of the sensor’s dynamic range under ambient lighting conditions. This step is particularly critical, as the integration time selection represents the most sensitive parameter when operating with the ambient light sensor. Once the optimal settings were established, a white reference was acquired using a Spectralon panel under direct sunlight in order to normalise the reflectance of the scene and compensate for the spectral distribution of the light source, and a dark reference was captured with the lens cap to remove electronic noise and dark current from the signal. In future UAV-based deployments, intermittent landings will need to be programmed to reacquire the white reference, especially in long-duration or multi-flight missions where lighting conditions may change, in order to ensure consistent and accurate reflectance measurements throughout the monitoring operation.

Once the calibration was complete, the hyperspectral data were acquired under stable ambient lighting conditions. A total of 20 hyperspectral images were acquired for each class: clean water, water mixed with gasoil, water mixed with C10, and water mixed with fuel oil, resulting in a total of 80 raw hypercubes representative of clean water and hydrocarbon-contaminated water (Figure 2). To generate a usable dataset suitable for model training and evaluation, each hyperspectral image was subsequently divided into 81 patches. This patch-based strategy increases the number of training samples while preserving the spatial–spectral structure of the data and ensuring sufficient intra-class variability, which favours robust model generalisation.

3. Methods

The methodological framework of this study is designed to evaluate the effectiveness of deep learning paradigms in identifying oil spills within HSI data. Specifically, the experimental design focuses on the systematic analysis of CNN architectural configurations and hyperparameter settings, exploring how these factors leverage classification performance under different data availability scenarios.

CNNs have become the cornerstone of image analysis due to their ability to extract hierarchical feature representations directly from raw data. This paradigm shift in computer vision is deeply rooted in the seminal contribution of [35], whose pioneering research on backpropagation and multi-layered convolutional architectures established the fundamental principles of modern DL, effectively revolutionising the automated analysis of remote sensing imagery. The success of CNNs lies in the convolutional layers, which function as trainable kernels that scan the input data to detect local patterns. At the initial stages of the network, these kernels are capable of capturing basic structures such as edges, gradients, or textures. As depth increases, the network composes these simple elements into increasingly abstract and complex representations that are relevant to the target classification task. This ability to automatically extract and refine features across multiple layers made CNNs particularly effective in domains where high-dimensional data prevails, such as HSI. In this context, the spectral dimension adds an additional level of complexity, making the combined extraction of spatial and spectral patterns critical for accurate classification. Three-dimensional CNNs (3DCNN) address this challenge by performing spatial and spectral feature extraction across the entire hypercube, allowing correlations between bands to be preserved and more discriminative representations to be obtained.

Based on this theoretical foundation, the methodological implementation of this study is structured around four key components, as illustrated in Figure 4. First, the architectural design of the 3DCNN models is presented, detailing the structural differences explored in terms of depth, convolutional strategy, and dimensionality reduction (Section 3.1). Next, the systematic grid search for hyperparameter optimisation is described (Section 3.2), focusing on parameters that critically affect learning capacity and regularisation. The training protocol, including dataset partitioning, evaluation strategy, and statistical replication, is described in Section 3.3 to ensure robustness and comparability. Finally, Section 3.4 presents the performance metrics used to evaluate classification accuracy and operational reliability.

3.1. CNN Architectural Design

Two distinct 3DCNN architectures were designed to evaluate how network depth and feature abstraction influence the classification of oil-contaminated water in hyperspectral images. The use of 3D convolutions allows for the simultaneous extraction of spatial and spectral features, preserving the intrinsic correlations in the three dimensions of the hyperspectral cube.

Architecture 1—Arc1 (Deep Configuration): This model consists of four sequential 3D convolutional layers, each followed by a ReLu activation function [4 × 3DCNN + ReLU]. The spatial kernel size is fixed at 3 × 3 across all layers, while the spectral depth progressively decreases throughout the network: [130, 30, 5, 1]. This design allows for progressive compression and hierarchical feature extraction in the spectral domain, while maintaining spatial resolution. Following convolutional blocks, the architecture includes two fully connected layers: FC1 is followed by both ReLu activation and dropout regularisation, while FC2 only includes dropout. A final fully connected layer (FC3) is used for classification, with a softmax layer producing the class probabilities. This deep configuration aims to evaluate the advantages of deeper spectral feature extraction on model performance. The overall structure can be summarised as: [4 × (3DCNN + ReLu)] − [FC1 + ReLu + DO1] − [FC2 + DO2] − [FC3 + Softmax].
Architecture 2—Arc2 (Pooling Configuration): This alternative design replaces the fourth convolutional layer with a 3D max pooling operation to reduce spatial–spectral resolution. The three convolutional layers again use 3 × 3 spatial kernels with spectral depths of [100, 30, 30], followed by a max pooling layer with a kernel size of [2, 2, 2] and a stride of [3, 3, 3]. The objective of this configuration is to evaluate whether spatial–spectral downsampling through pooling can approximate the feature abstraction achieved in deeper convolutional models while reducing complexity. The fully connected stage is identical to Arc1: FC1 includes ReLU activation and dropout, and FC applies dropout only, followed by the classification layer and softmax. The configuration can be summarised as: [3 × (3DCNN + ReLU)] − [MP] − [FC1 + ReLu + DO1] − [FC2 + DO2] − [FC3 + Softmax].

In both architectures, ReLu activations are employed after each convolutional and first FC layer to introduce non-linearity into the network and avoid vanishing gradients, thus improving convergence and enabling the learning of complex spectral and spatial features. Additionally, dropout is applied after each fully connected layer to mitigate overfitting and encourage the learning of robust and generalisable features. This design allows for a systematic comparison of deep versus shallow convolutional strategies, with or without pooling, and their effectiveness in extracting meaningful patterns from high-dimensional hyperspectral data.

3.2. Hyperparameter Optimisation and Grid Search

To identify the optimal hyperparameter configurations for CNN in the context of oil spill detection, a comprehensive grid search was performed across a multidimensional hyperparameter space. The evaluation focused on key architectural elements known to significantly affect model performance, especially in high-dimensional domains such as HSI. Specifically, the grid search considered:

Number of convolutional kernels (NKs): Three configurations (NK1, NK2, and NK3) were defined to examine how varying the number of filters in each convolutional layer affects the capacity of the network to extract meaningful spatial–spectral features. A higher number of kernels increases the representational power of the network, which could improve its ability to detect subtle variations in oil spill patterns, but also increases computational complexity and risk of overfitting.
Neuron density of the fully connected layer (N): Three levels of neuron density (N1, N2, N3) were tested in the dense layers after the convolutional block in order to evaluate how it influences the integration and abstraction of features extracted in the convolutional stages.
Dropout rate (DO): The regularisation strength was varied across three dropout intensities (DO1, DO2, DO3), applied after the convolutional blocks to control overfitting and improve generalisation.

The complete set of hyperparameter combinations evaluated for each of the two CNN architectures (ARC1 and ARC2) is summarised in Table 1. For each architecture, a total of 27 configurations (3 filter sizes × 3 neuron densities × 3 dropout rates) were tested, resulting in 54 unique models across both architectures. This comprehensive evaluation provides a solid foundation for understanding how individual hyperparameters and their interactions influence classification performance under different network designs.

3.3. Training Protocol and Statistical Robustness

One of the main challenges of HSI analysis is the limited availability of labelled samples, particularly in application-driven contexts such as oil spill monitoring. To explicitly evaluate the robustness of the proposed CNN configurations under limited data conditions, model training was performed using two training set sizes: a low-sample regime of 259 samples and an expanded regime of 518 samples. These values were intentionally selected to represent difficult learning conditions where DL models typically exhibit lower classification performance. In such scenarios, architectural and hyperparameter design choices can exert a greater influence on model outcomes. Preliminary experiments showed that using larger training sets led to consistently high accuracy across configurations, making it harder to discriminate between their relative effectiveness.

For each case, patches were randomly sampled from the extracted pool to build the training, validation, and test sets; therefore, the evaluation reflects performance within the same acquisition domain. The training sets were constructed in a class-balanced manner to ensure equitable representation of all four classes. Internal validation was performed using an additional set of the same size as the training set, which was also balanced. The remaining available samples from the complete dataset were reserved for external testing of the trained models. This three-way data partitioning allowed for the evaluation of generalisation both within (through validation) and beyond (through test set) the training regime.

To mitigate the influence of random effects associated with weight initialisation and batch composition, each of the 54 distinct CNN configurations resulting from the grid search was trained using five independent runs with different random seeds. This procedure allows the average performance metrics and variability for each configuration, providing insight into accuracy and stability. Experimental observations revealed low variability across runs for the tested configurations. Therefore, five repetitions were considered sufficient to obtain reliable and statistically consistent performance estimates while maintaining the computational feasibility of the exhaustive hyperparameter evaluation. This strategy ensures a balance between statistical robustness and experimental tractability while preserving comparability across all CNN configurations and dataset scenarios.

3.4. Performance Evaluation Metrics

The performance of the model was evaluated using a set of complementary metrics focused on assessing both classification accuracy and robustness. For each CNN configuration, the average classification accuracy (AA) and its associated standard deviation were computed across the five independent training runs. This dual analysis allows not only to identify high-performance configurations but also to evaluate their stability under stochastic training effects. In addition, the maximum overall accuracy (OA) achieved across the repetitions was also reported, as it provides insight into the best achievable performance of each architectural configuration in each dataset size scenario.

Complementary to the AA and OA, a detailed evaluation of the classification performance was carried out using class-wise performance metrics derived from confusion matrices. Given the safety-critical nature of oil spill monitoring, it is essential to balance high detection capability with the prevention of false alarms. Therefore, the discriminatory performance of each configuration was quantified using sensitivity (recall), specificity, and precision:

Sensitivity (Recall) measures the ability of the model to correctly identify oil-contaminated pixels, representing the true positive rate. High sensitivity is crucial in environmental monitoring to ensure that oil spills are not overlooked (Equation (2)).

S e n s i t i v i t y = \frac{T P}{T P + F N}

(2)

Specificity assesses the accuracy in identifying clean water surfaces, which corresponds to the true negative rate. This metric is particularly important for reducing false positives that could trigger unnecessary mitigation actions (Equation (3)).

S p e c i f i c i t y = \frac{T N}{T N + F P}

(3)

Precision quantifies the proportion of oil-contaminated pixels correctly identified among all pixels predicted as oil (Equation (4)).

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

Equations (2) to (4) describe the calculation of these metrics, where true-positive (TP) and true-negative (TN) are correctly classified cases, while false-negative (FN) and false-positive (FP) represent two types of errors.

All metrics were calculated for each of the five training repetitions and then averaged to obtain a representative and robust estimate of the performance of each of the 54 CNN configurations under both dataset size scenarios. The combined analysis of AA, variability, OA, and class-wise metrics allows for a comprehensive comparison of how CNN depth, filter size, dropout rate, and fully connected layer size influence trade-offs between detection sensitivity, classification reliability, and robustness. A detailed analysis of these results and the implications of the observed trade-offs is presented in the Results and Discussion section.

4. Results and Discussion

The performance of the proposed methodology for water and hydrocarbons determination, in which two architectures were trained across 54 configurations and 2 training set sizes, yielded promising results in terms of classification metrics and generalisation capability. In this section, the main outcomes are presented, with emphasis on the most interesting results.

The best-performing replicate evaluated on the test set for each configuration, architecture, and training set size is presented in Figure 5, based on OA. This allows for a comparison of the models that achieve the highest and lowest accuracy in all experimental conditions, underlining the effect of both the network design and the amount of training data on the classification performance. Figure 5 displays four 3D plots, each corresponding to a specific combination of network architecture and training set size, as follows: (a) Architecture 1 with 259 samples, (b) Architecture 1 with 518 samples, (c) Architecture 2 with 259 samples, and (d) Architecture 2 with 518 samples. The number of neurons in the fully connected layers, the number of convolutional kernels, and the dropout percentages applied are represented along the x-axis, y-axis, and z-axis, respectively, in Figure 5. This arrangement results in 27 different configurations for each architecture. For each configuration, the accuracy value of the best-performing model in the five replicates (OAs) is represented by a sphere. The size and colour of each sphere, according to the colourbar, reflect the level of accuracy: larger spheres with warmer colours mean higher accuracy, while smaller spheres with cooler colours represent lower performance. To facilitate the presentation and interpretation of the results, the hyperparameter options are codified as indicated in Table 1: N1, N2, and N3 correspond to the number of neurons in the fully connected layer; NK1, NK2, and NK3 refer to the number of convolutional kernels; and DO1, DO2, and DO3 denote the percentage of dropout in the dropout layers.

At first glance, two key conclusions emerge. First, for a given network architecture, increasing the size of the training dataset results in significantly higher accuracy. For Arc1, OA rose from a range of 0.70–0.95 with 259 training samples (Figure 5a) to 0.84–0.98 with 518 samples (Figure 5b). The same pattern is held for Architecture 2, where the accuracy increased from 0.8 to 0.98 with the smaller dataset (Figure 5c) to 0.97–0.99 with the larger one (Figure 5d). This outcome was to be expected and reaffirms a fundamental principle of deep learning: the more (high-quality) examples, the better performance.

Second, based on the above-mentioned accuracy ranges, the next main conclusion can be drawn: Arc2 consistently outperformed Arc1. Remarkably, the accuracy range achieved by Arc2 with the smaller training set (259) (Figure 5c) was the same as that resulting from Arc1 and the larger set (518) (Figure 5b). This suggests that changing a convolutional layer to a max pooling layer, as implemented in Arc2, contributes to a more efficient feature extraction compared to maintaining four convolutional layers, as in Arc1. This modification improves the overall performance, even with fewer training examples. Furthermore, this finding highlights the fact that the additional convolutional filtering in Arc1 may not only fail to improve feature extraction but may also introduce unnecessary complexity, capturing redundant or irrelevant patterns that hinder the generalisation capability of the model. In contrast, max pooling decreases dimensionality, emphasising the most important features, allowing for a more robust and efficient learning process, especially under limited training data conditions.

As a result of both observations, these findings explain why Arc2, trained with the larger dataset, produced the best-performing models, achieving near-perfect performance with OA values exceeding 0.99 in the classification of water and the three hydrocarbon types (Figure 5d). Specifically, the highest-performing model corresponds to the [N1-NK1-DO1] configuration, followed closely by [N2-NK3-DO2] and [N1-NK1-DO3] (Figure 5d), with only marginal differences in OA, on the order of ten-thousandths.

These results suggest that a lower number of kernels in convolutional layers (NK1) can lead to optimal performance when combined with a higher number of neurons in the fully connected layers (N1). In this case, the reduced number of feature maps extracted by the convolutional blocks appears to demand greater representational capacity in the classification stage, allowing the fully connected layers to effectively capture the underlying spatial–spectral relationships. In contrast, increasing the number of convolutional kernels (NK3) enables a reduction in the size of the fully connected layers (N2) without compromising classification performance. Notably, the larger number of convolutional kernels did not lead to remarkable performance. This finding indicates a trade-off between the representational capacity of the convolutional and dense layers: when the feature extractor is more compact, a richer classification stage is needed to compensate; when feature extraction is more expressive, fewer neurons may suffice to maintain high accuracy.

This pattern is also observed in the opposite direction when analysing the configurations that yielded the lowest performance. In particular, the poorest results, OA values of 0.971 and 0.974, were obtained when a small number of convolutional kernels (NK1) was combined with a low number in the fully connected layers (N2 and N3). This outcome suggests a limited capacity of the classification stage to process the extracted spatial–spectral features effectively.

A similar drop in accuracy was found when higher numbers of convolutional kernels (NK2 or NK3) were combined with the largest number of neurons (N1). In these cases, the increased complexity of the network does not translate into better performance, potentially due to redundant or overly detailed feature representations, which may lead to overfitting or ineffective learning. Despite these declines, it is important to emphasise the overall robustness of the proposal methodology, as even the lowest-performing configurations achieved relatively high classification accuracy.

Regarding the dropout rate, no definitive conclusion can be reached from the analysis of high-performing models. Both high and low dropout settings yielded high accuracies, suggesting that, under the tested conditions, its influence may not be particularly critical. However, an examination of the poorest-performing models reveals that higher dropout percentages tend to alleviate performance issues, likely by mitigating overfitting when the architecture is not well aligned with the complexity of the data. This implies that, although dropout may not be a critical factor in optimal models, it plays a more significant role in protecting against performance degradation when the network design is not ideal. This effect is exemplified by the [N1-NK2] configuration, which achieved an accuracy of 0.988 when combining the largest neural density and number of kernels with the highest dropout rate (DO1). In contrast, accuracy dropped by nearly two percentage points when lower dropout rates (DO2 and DO3) were applied.

This individual analysis of hyperparameters also offers valuable insights into the role of neuron density in the fully connected layers. A clear trend is observed in the distribution of the best-performing models: architectures with a higher number of neurons (N1) consistently outperformed those with lower neuron counts, indicating that greater representational capacity in the classification stage contributes positively to overall performance.

The findings from the analysis of Arc2 models trained with the largest dataset become even more pronounced when the training set is reduced to 259 samples (Figure 5c), where two distinct trends can be observed. The first corresponds to the best-performing models, which are consistently associated with higher neuron densities in the fully connected layers and higher dropout rates, suggesting that increased representational capacity and regularisation are key to maintaining accuracy under data-scarce conditions. The second trend indicates the direction in which the models became poorer, which follows the opposite pattern: configurations with fewer neurons and lower dropout rates exhibit significantly reduced classification accuracy. The limited training data also reinforces the representational limitations of the N3 configuration, as all models using the lowest number of neurons in the dense layers produced accuracies below 0.915, regardless of the other hyperparameter values. Similarly, the use of the largest number of kernels (NK2) also led to poor-performing configurations. This indicates that, under limited data availability, the extraction of an excessive number of feature maps may introduce noise or redundancy, overwhelming the capacity of the network to generalise effectively. In such cases, the model may struggle to discern meaningful patterns, leading to reduced accuracy and increased sensitivity to overfitting.

Beyond the general trends, the analysis reveals that classification performance is not solely dictated by extreme values of individual hyperparameters, but rather by the balance and interaction among them. For example, one of the highest-performing configurations in Arc2 trained on the larger dataset was [N2-NK3-DO2] (Figure 5d), which combined a moderate number of convolutional kernels and neurons with an intermediate dropout rate. In contrast, when this balance is disrupted, performance tends to deteriorate. This is exemplified by the [N2-NK1-DO2] configuration (Figure 5d), which shares the same neuron density and dropout rate as the previous case, but replaces the moderate number of kernels (NK3) with the smallest tested setting (NK1). This reduction in convolutional kernels likely limited the richness of the extracted feature maps, weakening the ability of the model to generalise and leading to a significant drop in classification accuracy. This underscores the importance of synergy between architectural components, rather than relying on strong individual settings.

Finally, Arc1 trained with 259 samples yielded the most modest results, with a maximum OA of only 0.958, reached under the [N1-NK2-DO2] configuration. This highlights the combined effect of limited training data and a less efficient architectural design, reinforcing the comparative advantage of Arc2 in both performance and robustness across dataset scenarios.

This analysis is further complemented by the results shown in Figure 6, which report the average accuracy and the standard deviation obtained from the five replicates of each configuration. Together, these findings provide a comprehensive overview of the individual and combined influence of the hyperparameters, as well as the consistency and stability of model performance across repetitions. Figure 6 also presents four 3D plots, although in this case, the models are grouped according to dropout rate: blue corresponds to 60% dropout, purple to 40%, and orange to 20%. The x-axes and y-axes remain the same as in Figure 5, representing the number of neurons in the fully connected layers and the number of kernels in the convolutional layers, respectively. The z-axis displays the value of the average accuracy (solid line) and standard deviation (dashed line) obtained across the five training replicates for each configuration.

Several insights emerge from Figure 6. First, the standard deviation remains relatively low across all configurations, not exceeding 0.2 in any case. This reflects a high level of stability in the training process and suggests that the influence of weight initialization or batch variations is minimal, even under constrained data regimes. Moreover, the variability tends to decrease as the number of training samples increases (right panels), confirming that data availability not only boosts performance but also enhances convergence reliability. Second, when comparing architectures, Arc2 (bottom row) again demonstrates more consistent and higher AA than Arc1 (top row), especially when trained with 518 samples (Figure 6d). This supports the conclusion that the pooling-based reduction strategy implemented in Arc2 provides a more efficient feature abstraction pathway for this HSI classification task. In addition to this, the plots reveal that excessive regularisation, specifically the 60% dropout rate (blue), can significantly impair model learning under low-sample conditions, as seen in several configurations in Figure 6a,c. This suggests the existence of a threshold beyond which dropout ceases to improve generalisation and instead suppresses the learning of minority-class patterns such as oil slicks.

Overall, this illustration not only reinforces the trends observed in Figure 5 but also confirms the robustness of the proposed framework. The combination of high AA, low standard deviation, and consistent performance across diverse hyperparameter settings and training sizes underscores the effectiveness and stability of the models, particularly those based on the Arc2.

Focusing on the best-performing configurations identified in the previous analysis, Figure 7 provides a class-level assessment of the classification behaviour through confusion matrices and associated performance metrics. While OA and AA offer a global measure of model performance, they do not fully capture how effectively each class is identified, nor the nature of misclassifications. In the context of oil spill monitoring, it is particularly important to understand whether high accuracy is achieved at the expense of missed detections or false alarms. For this reason, the models are further analysed in terms of their ability to correctly detect contaminated water, avoid false positives over clean water, and reliably distinguish between different hydrocarbon types. This detailed evaluation is carried out using sensitivity (recall), specificity, and precision, enabling a deeper interpretation of the operational reliability of the selected configurations.

The results confirm that Arc2, especially when trained with the larger dataset, achieves near-perfect classification across all classes, with all three metrics consistently at or above 0.98, including for fuel oil, which is typically the most challenging to detect due to its heterogeneous spectral signature (depending on its concentration). This high sensitivity ensures minimal risk of missing contaminated areas (false negatives), while the strong specificity and precision reduce the likelihood of false alarms.

In contrast, Arc1, though also achieving high values overall, shows slightly reduced sensitivity and precision for fuel oil under the smaller training set, where performance drops to 0.87 and 0.97, respectively. These errors are reflected in the confusion matrix, where a notable portion of fuel oil instances are misclassified as diesel. This highlights the difficulty in differentiating spectrally similar hydrocarbons under data-constrained conditions, and the added value of a more effective feature extraction and regularisation strategy, as implemented in Arc2.

Notably, for both architectures, the clean water class was always detected with 100% sensitivity, reinforcing the robustness of the model in identifying the absence of contamination. This is a key operational requirement in surveillance systems, ensuring that clean areas are not misclassified, which could otherwise lead to inefficient or unnecessary mitigation responses.

Taken together, these class-wise results support the conclusion that the performance advantage of Arc2 is not only reflected in OA, but also in its balanced and reliable per-class behaviour, particularly under varying training data conditions. The ability to maintain high sensitivity, specificity, and precision across all hydrocarbon types makes it a robust candidate for real-world deployment in oil spill scenarios.

The experimental results presented in this study offer a comprehensive understanding of how architectural depth and hyperparameter configurations influence oil spill detection performance in hyperspectral images. Beyond the high OA achieved, several deeper insights emerge that are critical for designing robust and efficient classification systems:

Trade-off in feature extraction capacity: The analysis reveals that increasing the number of convolutional kernels does not necessarily enhance model performance. In fact, the use of the highest kernel configuration (NK2) consistently resulted in suboptimal accuracy, suggesting that an excessive number of feature maps may introduce redundancy or noise that hinders generalisation. Conversely, both the lowest (NK1) and moderate (NK3) kernel configurations yielded the best-performing models, particularly when paired with higher neuron densities in the fully connected layers. These findings underscore the importance of a balanced feature extraction process, rather than simply increasing depth or complexity.
Model parsimony and architectural efficiency: The comparison between Arc1 (four convolutional blocks) and Arc2 (three convolutional blocks with pooling) highlights the advantages of streamlined design. In the high-sample regime, both architectures converged to near-perfect accuracy; however, Arc2 achieved this with fewer layers, reduced complexity, and greater consistency across configurations. This aligns with the principle of parsimony, supporting the selection of more compact architectures when they yield equivalent or superior results, especially for real-time or resource-constrained applications.
Sensitivity to regularisation and data availability: The experiments reveal a non-linear relationship between dropout rate and model performance. While moderate dropout (DO2 and DO3) proved beneficial in mitigating overfitting, particularly in low-sample regimes, excessive regularisation (DO, 60%) significantly impaired learning, especially in Arc1. This was particularly problematic when combined with smaller kernel configurations, suggesting that under-constrained models struggle to consolidate meaningful spatial–spectral patterns from limited data. Thus, dropout must be carefully tuned in relation to both the network complexity and the size of the training set.
Operational reliability and class-wise performance: From a practical standpoint, especially in maritime surveillance, models must not only be accurate but also reliable across all target classes. The confusion matrices and class-wise metrics confirm that Arc2 consistently delivers superior sensitivity, specificity, and precision, particularly for the most challenging class (fuel oil), which often shows spectral overlap with other hydrocarbons in low concentrations. Notably, clean water was detected with 100% sensitivity across all configurations, reinforcing the robustness of the model in avoiding false alarms that could trigger unnecessary responses. Moreover, Arc2 demonstrated high reproducibility across training runs, with near-zero standard deviation in the best-performing configurations, indicating stability against stochastic effects such as weight initialisation.

5. Conclusions

This study presented a systematic evaluation of CNN configurations for the classification of oil-contaminated water in hyperspectral imagery. The analysis demonstrates that network architecture, in combination with specific hyperparameter tuning and training-set size, significantly influences classification performance, particularly in data-limited scenarios. From the analysis of 54 3DCNN configurations (two architectures, three kernel settings, three neuron-density settings, and three dropout rates) under two training regimes (259 and 518 samples) and five independent runs per configuration, three main conclusions can be drawn:

Pooling-based dimensionality reduction is preferable to increased convolutional depth in this task. Across both data regimes, the architecture incorporating a max pooling layer (Arc2) consistently outperformed its deeper alternative (Arc1), achieving higher OA with lower architectural complexity. In particular, Arc2 achieved a near-perfect performance when trained with 518 samples (OA > 0.99), while Arc1 remained below that range, and Arc2, with 259 samples, matched the accuracy range of Arc1 trained with 518 samples. This result highlights the effectiveness of max pooling as a dimensionality reduction technique that preserves essential spatial–spectral information, reducing unnecessary model complexity.
Performance is driven by the interaction between feature-extraction capacity and the classifier. While the best overall results were obtained when training with a larger dataset, the study reveals that carefully balanced configurations, particularly those using fewer kernels and higher neuron densities, can also deliver high performance with reduced training data. In contrast, the largest kernel setting (NK2) systematically led to suboptimal behaviour, indicating that increasing the number of feature maps can introduce redundancy/noise and degrade generalisation under the acquisition conditions considered.
Regularisation and data availability must be co-tuned under scarce data scenarios. The analysis indicates that excessive dropout (0.6) impaired learning in the low sample regime in several configurations, whereas moderate dropout (0.2 and 0.4) preserved both accuracy and stability.

This knowledge, especially conclusions (ii) and (iii), is particularly relevant in real-world oil spill detection scenarios, where large, labelled datasets may not be available.

Moreover, the analysis of class-wise metrics confirmed that top-performing configurations maintain high sensitivity and specificity, which is critical in safety-related applications such as maritime surveillance. The robustness observed across replicate training further supports the operational reliability of the conclusions reached.

Limitations and future work. Future research will focus on two main directions: (i) integrating the optimised CNN models into real-time processing pipelines for UAV-based environmental monitoring, and (ii) exploring transfer learning techniques to adapt the classifiers to different locations, hydrocarbon types, and sea conditions, thus enhancing the scalability of the proposed methodology for global maritime safety applications. Because the dataset was acquired under controlled laboratory conditions, the reported results primarily reflect within-domain performance. Extending the evaluation to more diverse acquisition conditions and real marine scenarios is the next step to continue and give meaning to this research. Different thickness cases and biofilm conditions will also be considered to assess detectability as a function of thickness and improve robustness for operational scenarios. Furthermore, the configuration-sensitivity analysis will be extended to emerging HSI-classification paradigms beyond CNNs, such as transformer-based and graph-based models.

Author Contributions

Conceptualisation, M.G.C.-G., I.J.T.D. and J.J.R.-A.; data curation, M.G.C.-G. and J.G.-E.; formal analysis, M.G.C.-G. and I.J.T.D.; funding acquisition, I.J.T.D. and J.J.R.-A.; investigation, M.G.C.-G. and I.J.T.D.; methodology, M.G.C.-G., J.G.-E. and I.J.T.D.; project administration, J.J.R.-A. and I.J.T.D.; software, M.G.C.-G., J.G.-E., D.E. and I.J.T.D.; resources M.G.C.-G. and I.J.T.D.; supervision J.J.R.-A., D.E., A.C.-O. and I.J.T.D.; validation, M.G.C.-G.; visualisation, M.G.C.-G.; writing—original draft, M.G.C.-G. and I.J.T.D.; writing—review and editing, M.G.C.-G., J.J.R.-A., D.E., A.C.-O. and I.J.T.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Acknowledgments

This work was partially supported by the European Regional Development Fund (ERDF/FEDER) through the project FEDER-UCA-2024-B2-07 “Spectral imaging technology and deep learning for environmental protection: Detection of water and air pollution (DEEPENVSI)”, under the Andalusia ERDF Operational Programme 2021–2027.

Conflicts of Interest

The authors do not have any relevant conflicts of interest to declare regarding the content of this article.

References

Available online: https://sdgs.un.org/goals (accessed on 24 December 2025).
Thiagarajan, C.; Devarajan, Y. The urgent challenge of ocean pollution: Impacts on marine biodiversity and human health. Reg. Stud. Mar. Sci. 2025, 81, 103995. [Google Scholar] [CrossRef]
Babuji, P.; Thirumalaisamy, S.; Duraisamy, K.; Periyasamy, G. Human Health Risks due to Exposure to Water Pollution: A Review. Water 2023, 15, 2532. [Google Scholar] [CrossRef]
Bochynska, S.; Duszewska, A.; Maciejewska-Jeske, M.; Wrona, M.; Szeliga, A.; Budzik, M.; Szczesnowicz, A.; Bala, G.; Trzcinski, M.; Meczekalski, B.; et al. The impact of water pollution on the health of older people. Maturitas 2024, 185, 107981. [Google Scholar] [CrossRef] [PubMed]
Etim, I.I.N.; Ekerenam, O.O.; Ikeuba, A.I.; Njoku, C.N.; Emori, W.; Zhang, R.; Duan, J. Navigating the blue economy: A comprehensive review of marine pollution and sustainable approaches. J. Ocean. Limnol. 2025, 43, 1160–1182. [Google Scholar] [CrossRef]
Yang, J.; Li, J.; van Vliet, M.T.; Jones, E.R.; Huang, Z.; Liu, M.; Bi, J. Economic risks hidden in local water pollution and global markets: A retrospective analysis (1995–2010) and future perspectives on sustainable development goal 6. Water Res. 2024, 252, 121216. [Google Scholar] [CrossRef]
Yildiz, S.; Sönmez, V.Z.; Uğurlu, Ö.; Sivri, N.; Loughney, S.; Wang, J. Modelling of possible tanker accident oil spills in the Istanbul Strait in order to demonstrate the dispersion and toxic effects of oil pollution. Environ. Monit. Assess. 2021, 193, 538. [Google Scholar] [CrossRef]
Kababu, E.; Angel, D.L.; Sisma-Ventura, G.; Belkin, N.; Rubin-Blum, M.; Rahav, E. Effects of crude oil and gas condensate spill on coastal benthic microbial populations. Front. Environ. Sci. 2022, 10, 1051460. [Google Scholar] [CrossRef]
French-McCay, D.P.; Parkerton, T.F.; de Jourdan, B. Bridging the lab to field divide: Advancing oil spill biological effects models requires revisiting aquatic toxicity testing. Aquat. Toxicol. 2023, 256, 106389. [Google Scholar] [CrossRef]
Acosta-González, A.; Martirani-von Abercron, S.M.; Rosselló-Móra, R.; Wittich, R.M.; Marqués, S. The effect of oil spills on the bacterial diversity and catabolic function in coastal sediments: A case study on the Prestige oil spill. Environ. Sci. Pollut. Res. 2015, 22, 15200–15214. [Google Scholar] [CrossRef]
Jolliff, J.K.; Ladner, S.; Lewis, D.; Jarosz, E.; Crout, R.L.; Lawson, A.; Smith, T.; McCarthy, S.; Cayula, S. The hyperspectral signatures of complex ocean frontal boundaries: The example of cold air outbreaks in the northern Gulf of Mexico. In Proceedings of the SPIE 11014, Ocean Sensing and Monitoring XI; SPIE: Baltimore, MD, USA, 2019; p. 110140. [Google Scholar] [CrossRef]
Li, Y.; Lu, H.; Zhang, Z.; Liu, P. A novel nonlinear hyperspectral unmixing approach for images of oil spills at sea. Int. J. Remote Sens. 2022, 41, 4684–4701. [Google Scholar] [CrossRef]
Hu, C.; Lu, Y.; Sun, S.; Liu, Y. Optical Remote Sensing of Oil Spills in the Ocean: What Is Really Possible? J. Remote Sens. 2021, 2021, 9141902. [Google Scholar] [CrossRef]
ITOPF (International Tanker Owners Pollution Federation). Oil Tanker Spill Statistics 2022. 2023. Available online: https://www.itopf.org/knowledge-resources/data-statistics/statistics/ (accessed on 16 January 2026).
IMO (International Maritime Organization). Annual Report on Marine Pollution Incident Data: 2022. 2023. Available online: https://www.imo.org/en/OurWork/Environment/Pages/OilPollution.aspx (accessed on 16 January 2026).
Al-Kamzari, A.; Gray, T.; Fitzsimmons, C.; Burgess, J.G. An Unresolved Environmental Problem—Small-Scale Unattributable Marine Oil Spills in Musandam, Oman. Sustainability 2025, 17, 7769. [Google Scholar] [CrossRef]
Ortmann, A.C.; Poon, H.Y.; Ji, M.; Cobanli, S.E.; Wohlgeschaffen, G.; Greer, C.W.; Robinson, B.; King, T.L. Rapid dilution effectively decreases hydrocarbons following small oil spills, but impacts on microeukaryote communities are still observed. Front. Mar. Sci. 2024, 11, 1354063. [Google Scholar] [CrossRef]
Fingas, M.; Brown, C. Review of oil spill remote sensing. Mar. Pollut. Bull. 2014, 83, 9–23. [Google Scholar] [CrossRef] [PubMed]
Fingas, M.; Brown, C.E. A review of oil spill remote sensing. Sensors 2018, 18, 91. [Google Scholar] [CrossRef]
Duan, P.; Kang, X.; Ghamisi, P.; Li, S. Hyperspectral Remote Sensing Benchmark Database for Oil Spill Detection with an Isolation Forest-Guided Unsupervised Detector. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5509711. [Google Scholar] [CrossRef]
Li, X.; Liu, B.; Zheng, G.; Ren, Y.; Zhang, S.; Liu, Y.; Gao, L.; Liu, Y.; Zhang, B.; Wang, F. Deep-learning-based information mining from ocean remote-sensing imagery. Natl. Sci. Rev. 2021, 7, 1584–1605. [Google Scholar] [CrossRef]
Tsokas, A.; Rysz, M.; Pardalos, P.M.; Dipple, K. SAR data applications in earth observation: An overview. Expert Syst. Appl. 2022, 205, 117342. [Google Scholar] [CrossRef]
Hong, X.; Chen, L.; Sun, S.; Sun, Z.; Chen, Y.; Mei, Q.; Chen, Z. Detection of Oil Spills in the Northern South China Sea Using Landsat-8 OLI. Remote Sens. 2022, 14, 3966. [Google Scholar] [CrossRef]
Lei, F.; Wang, W.; Zhang, W.; Li, K.; Xu, Z. Oil Spills Tracking Through Texture Analysis from MODIS Imagery. In Proceedings of the IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan; IEEE: New York, NY, USA, 2019; pp. 9768–9771. [Google Scholar] [CrossRef]
Kolokoussis, P.; Karathanassi, V. Oil spill detection and mapping using sentinel 2 imagery. J. Mar. Sci. Eng. 2018, 6, 4. [Google Scholar] [CrossRef]
Zakzouk, M.; El-Magd, I.A.; Ali, E.M.; Abdulaziz, A.M.; Rehman, A.; Saba, T. Novel oil spill indices for sentinel-2 imagery: A case study of natural seepage in Qaruh island, Kuwait. MethodsX 2024, 12, 102520. [Google Scholar] [CrossRef] [PubMed]
Trujillo-Acatitla, R.; Tuxpan-Vargas, J.; Ovando-Vázquez, C. Oil spills: Detection and concentration estimation in satellite imagery, a machine learning approach. Mar. Pollut. Bull. 2022, 184, 114132. [Google Scholar] [CrossRef] [PubMed]
Ivshina, I.B.; Kuyukina, M.S.; Krivoruchko, A.V.; Elkin, A.A.; Makarov, S.O.; Cunningham, C.J.; Peshkur, T.A.; Atlas, R.M.; Philp, J.C. Oil spill problems and sustainable response strategies through new technologies. Environ. Sci. Process Impacts 2015, 17, 1201–1219. [Google Scholar] [CrossRef] [PubMed]
Goetz, A.F.H. Three decades of hyperspectral remote sensing of the Earth: A personal view. Remote Sens. Environ. 2009, 113, S5–S16. [Google Scholar] [CrossRef]
Faltynkova, A.; Johnsen, G.; Wagner, M. Hyperspectral imaging as an emerging tool to analyze microplastics: A systematic review and recommendations for future development. Microplast. Nanoplast. 2021, 1, 13. [Google Scholar] [CrossRef]
Sridhar, A.; Kannan, D.; Kapoor, A.; Prabhakar, S. Extraction and detection methods of microplastics in food and marine systems: A critical review. Chemosphere 2022, 286, 131653. [Google Scholar] [CrossRef]
Vidal, C.; Pasquini, C. A comprehensive and fast microplastics identification based on near-infrared hyperspectral imaging (HSI-NIR) and chemometrics. Environ. Pollut. 2021, 285, 117251. [Google Scholar] [CrossRef]
Shan, J.; Zhao, J.; Liu, L.; Zhang, Y.; Wang, X.; Wu, F. A novel way to rapidly monitor microplastics in soil by hyperspectral imaging technology and chemometrics. Environ. Pollut. 2018, 238, 121–129. [Google Scholar] [CrossRef]
Goetz, A.F.H.; Vane, G.; Solomon, J.E.; Rock, B.N. Imaging spectrometry for earth remote sensing. Science 1985, 228, 1147–1153. [Google Scholar] [CrossRef]
Chabrillat, S.; Foerster, S.; Segl, K.; Beamish, A.; Brell, M.; Asadzadeh, S.; Milewski, R.; Ward, K.J.; Brosinsky, A.; Koch, K.; et al. The EnMAP spaceborne imaging spectroscopy mission: Initial scientific results two years after launch. Remote Sens. Environ. 2024, 315, 114379. [Google Scholar] [CrossRef]
eoPortal. Available online: https://www.eoportal.org/satellite-missions/eo-1#eo-1-earth-observing-1 (accessed on 24 December 2025).
enMap. Available online: https://www.enmap.org/mission/ (accessed on 24 December 2025).
ASI. Available online: https://www.asi.it/en/earth-science/prisma/ (accessed on 24 December 2025).
Guanter, L.; Kaufmann, H.; Segl, K.; Foerster, S.; Rogass, C.; Chabrillat, S.; Kuester, T.; Hollstein, A.; Rossner, G.; Chlebek, C.; et al. The EnMAP spaceborne imaging spectroscopy mission for earth observation. Remote Sens. 2015, 7, 8830–8857. [Google Scholar] [CrossRef]
HISUI. Available online: https://www.hisui.go.jp/en/sensors/index.html (accessed on 24 December 2025).
Natale, V.G.; Kafri, A.; Tidhar, G.A.; Chen, M.; Feingersh, T.; Sagi, E.; Cisbani, A.; Baroni, M.; Labate, D.; Nadler, R.; et al. SHALOM—Space-borne hyperspectral applicative land and ocean mission. In Proceedings of the 2013 5th Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), Gainesville, FL, USA, 26–28 June 2013; p. 1. [Google Scholar] [CrossRef]
Hughes, G. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63. [Google Scholar] [CrossRef]
Shan, J.; Zhao, J.; Zhang, Y.; Liu, L.; Wu, F.; Wang, X. Simple and rapid detection of microplastics in seawater using hyperspectral imaging technology. Anal. Chim. Acta 2019, 1050, 161–168. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Handwritten Digit Recognition with a Back-Propagation Network. In Proceedings of the Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1990; pp. 396–404. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Zhou, D.X. Theory of deep convolutional neural networks: Downsampling. Neural Netw. 2020, 124, 319–327. [Google Scholar] [CrossRef] [PubMed]
Hu, T.; Yuan, J.; Wang, X.; Yan, C.; Ju, X. Spectral-Spatial Features Extraction of Hyperspectral Remote Sensing Oil Spill Imagery Based on Convolutional Neural Networks. IEEE Access 2022, 10, 127969–127983. [Google Scholar] [CrossRef]
Yang, J.F.; Wan, J.H.; Ma, Y.; Zhang, J.; Bin Hu, Y.; Jiang, Z.C. Oil spill hyperspectral remote sensing detection based on DCNN with multi-scale features. J. Coast. Res. 2019, 90, 332–339. [Google Scholar] [CrossRef]
Wang, B.; Shao, Q.; Song, D.; Li, Z.; Tang, Y.; Yang, C.; Wang, M. A spectral-spatial features integrated network for hyperspectral detection of marine oil spill. Remote Sens. 2021, 13, 1568. [Google Scholar] [CrossRef]
Zhu, X.; Li, Y.; Zhang, Q.; Liu, B. Oil film classification using deep learning-based hyperspectral remote sensing technology. ISPRS Int. J. Geo-Inf. 2019, 8, 181. [Google Scholar] [CrossRef]
Kang, X.; Wang, Z.; Duan, P.; Wei, X. The Potential of Hyperspectral Image Classification for Oil Spill Mapping. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5538415. [Google Scholar] [CrossRef]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A.; Chanussot, J. SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5518615. [Google Scholar] [CrossRef]
Kang, X.; Deng, B.; Duan, P.; Wei, X.; Li, S. Self-Supervised Spectral-Spatial Transformer Network for Hyperspectral Oil Spill Mapping. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5507410. [Google Scholar] [CrossRef]
Kang, J.; Yang, C.; Yi, J.; Lee, Y. Detection of Marine Oil Spill from PlanetScope Images Using CNN and Transformer Models. J. Mar. Sci. Eng. 2024, 12, 2095. [Google Scholar] [CrossRef]
Nadipelli, N.K.; Sarma, T.H.; Reddy, R.D.; Rao, K.R.M.; Mrudula, K.; Kanthi, M. Efficient Hyperspectral Band Selection Using GA-SR-NMI-VI: A Hybrid Similarity and Evolutionary-Based Approach. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5507105. [Google Scholar] [CrossRef]
Mukundan, A.; Karmakar, R.; Jouhar, J.; Valappil, M.A.E.; Wang, H.C. Advancing Urban Development: Applications of Hyperspectral Imaging in Smart City Innovations and Sustainable Solutions. Smart Cities 2025, 8, 51. [Google Scholar] [CrossRef]

Figure 1. (a) Schematic representation of a hyperspectral image as a three-dimensional data cube composed of spatial dimensions (x–y) and spectral dimension (λ). Each pixel contains a full reflectance spectrum, known as a hyperpixel. (b) An example of hyperspectral signatures of water and fuel oil, highlighting their spectral differences across the VIS-NIR range.

Figure 2. Data acquisition and preprocessing workflow. The diagram illustrates the experimental setup, from image capture with the hyperspectral camera to the generation of preprocessed patches. Four target classes were recorded under sunlight: clean water, gasoil on water, C10 on water, and fuel oil on water. Integration time calibration and white/dark reference acquisition were performed before collecting 20 hyperspectral images per class. Each hypercube was subsequently divided into 81 patches.

Figure 3. RGB photograph samples (snippets of normal photos) of the four experimental scenarios used to build the hyperspectral dataset: water, gasoil, C10, and fuel oil. Highly diluted hydrocarbon regions may not be clearly visible to the naked eye; there are areas of hydrocarbon–water mixing that are not noticeable.

Figure 4. Overview of the methodological scheme. The process includes four main stages: (i) design of two CNN architectures with different depth and complexity; (ii) systematic grid search to evaluate the influence of convolutional kernels, neuron density, and dropout rate; (iii) model training using two dataset sizes under repeated evaluation protocol; and (iv) performance assessment based on overall accuracy, statistical robustness, and class-wise.

Figure 5. Representation best performance in terms of OA resulting from the combination of the different hyperparameters: density in the fully connected layers (N1, N2, N3), number of convolutional kernels (Nk1, NK2, NK3), and dropout rate (DO1, DO2, DO3), in the two architectures (Arc1, Arc2) with the two training set sizes. (a) Architecture 1 trained on 259 data, (b) Architecture 1 trained on 518 data, (c) Architecture 2 trained on 259 data, and (d) Architecture 2 trained on 518 data.

Figure 6. Average accuracy (solid lines) and standard deviation (dashed lines) obtained from five independent training runs for each CNN configuration, grouped by dropout rate: 60% (DO1, blue), 40% (DO2, purple), and 20% (DO3, orange). The x- and y-axes represent the number of neurons in the fully connected layers and the number of convolutional kernels, respectively. Subfigures correspond to: (a) Arc1 with 259 training samples; (b) Arc1 with 518 training samples; (c) Arc2 with 259 training samples; and (d) Arc2 with 518 training samples.

Figure 7. Confusion matrices and class-wise performance metrics (sensitivity, specificity, and precision) for the best-performing configurations of both architectures under two training regimes (259 and 518 samples). In the confusion matrices, the green diagonal indicates correctly classified samples (counts and percentages), whereas off-diagonal cells in red denote misclassifications.

Table 1. Hyperparameter values evaluated for each proposed architecture: number of neurons in the fully connected layers (N1, N2, N3), number of convolutional kernels (NK1, NK2, NK3), and dropout percentage (DO1, DO2, DO3).

Hyperparameter	Codification	ARC1	ARC2
Neurons FC Layer	N1	[256, 128]	[256, 128]
	N2	[128, 64]	[128, 64]
	N3	[64, 32]	[64, 32]
Number CNN kernel	NK1	[2, 4, 8, 4]	[2, 4, 8]
	NK2	[16, 32, 64, 32]	[16, 32, 64]
	NK3	[8, 16, 32, 16]	[8, 16, 32]
	DO1	0.6	0.6
% Dropout	DO2	0.4	0.4
	DO3	0.2	0.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Carrasco-García, M.G.; González-Enrique, J.; Ruiz-Aguilar, J.J.; Camarero-Orive, A.; Elizondo, D.; Turias Domínguez, I.J. A Systematic Evaluation of CNN Configurations for Multiclass Oil Spill Classification in Hyperspectral Images. J. Mar. Sci. Eng. 2026, 14, 383. https://doi.org/10.3390/jmse14040383

AMA Style

Carrasco-García MG, González-Enrique J, Ruiz-Aguilar JJ, Camarero-Orive A, Elizondo D, Turias Domínguez IJ. A Systematic Evaluation of CNN Configurations for Multiclass Oil Spill Classification in Hyperspectral Images. Journal of Marine Science and Engineering. 2026; 14(4):383. https://doi.org/10.3390/jmse14040383

Chicago/Turabian Style

Carrasco-García, María Gema, Javier González-Enrique, Juan Jesús Ruiz-Aguilar, Alberto Camarero-Orive, David Elizondo, and Ignacio J. Turias Domínguez. 2026. "A Systematic Evaluation of CNN Configurations for Multiclass Oil Spill Classification in Hyperspectral Images" Journal of Marine Science and Engineering 14, no. 4: 383. https://doi.org/10.3390/jmse14040383

APA Style

Carrasco-García, M. G., González-Enrique, J., Ruiz-Aguilar, J. J., Camarero-Orive, A., Elizondo, D., & Turias Domínguez, I. J. (2026). A Systematic Evaluation of CNN Configurations for Multiclass Oil Spill Classification in Hyperspectral Images. Journal of Marine Science and Engineering, 14(4), 383. https://doi.org/10.3390/jmse14040383

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Systematic Evaluation of CNN Configurations for Multiclass Oil Spill Classification in Hyperspectral Images

Abstract

1. Introduction

2. Materials

2.1. Hyperspectral Camera and Data Acquisition Format

2.2. Hyperspectral Dataset Acquisition and Preprocessing

3. Methods

3.1. CNN Architectural Design

3.2. Hyperparameter Optimisation and Grid Search

3.3. Training Protocol and Statistical Robustness

3.4. Performance Evaluation Metrics

4. Results and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI