1. Introduction
The detection of Unexploded Ordnance (UXO) in underwater environments poses a significant challenge both from the perspective of human safety and environmental protection. Following World War II, large amounts of ammunition and chemical weapons were deliberately disposed of in seas and lakes, especially in the European region. Progressive corrosion of these submerged munitions poses an increasing risk of release of toxic substances into aquatic ecosystems. Current estimates indicate that the Baltic Sea alone harbors several tens of thousands of tons of wartime remnants, a considerable fraction of which remain undocumented and unexplored [
1]. With the development of marine infrastructure and the intensification of installation works (wind farms, underwater cables), the need for effective methods to identify and classify hazardous objects is becoming increasingly urgent. Traditional UXO detection methods rely on sensors such as magnetometers, sonars, or video cameras, but in underwater environments, magnetometers are particularly important, as they record disturbances in the Earth’s magnetic field caused by the presence of ferromagnetic objects, even those that are buried. However, the analysis of data obtained in this way is time-consuming, costly, and requires the involvement of highly qualified specialists. Automating the object classification process using machine learning methods can significantly increase the efficiency and accuracy of UXO detection, and may even surpass the effectiveness of human experts in distinguishing dangerous objects from harmless ones [
2].
One of the key challenges in applying machine learning methods to underwater UXO classification is the limited availability of representative and high-quality training datasets. Acquiring empirical data in marine environments is associated with high costs, considerable labor, and technical limitations, such as changes in magnetic inclination depending on geographic latitude or differences resulting from the orientation of the object relative to the measurement trajectory. The literature [
3] highlights that the lack of appropriate datasets is a significant barrier to the development of effective AI-based systems.
In response to these challenges, the concept of digital twins is gaining increasing popularity. Digital twins enable the simulation of real-world conditions and the generation of synthetic data for training machine learning models. This approach allows for flexible modeling of both UXO and non-UXO objects under controlled conditions of the Earth’s magnetic field, which significantly accelerates and facilitates the process of building datasets and, consequently, machine learning tools for automating field studies. Previous studies have shown that the use of digital twins makes it possible to achieve a high degree of consistency between simulation results and real measurements, with agreement levels reaching approximately 95% [
3], and allows for testing various neural network architectures in terms of their effectiveness in classifying UXO based on raw magnetometer data.
Building on the foundation of high-fidelity simulated data, this paper addresses the existing gap in comprehensive analyses of neural networks for UXO classification from magnetometer signals. We present a systematic design process for a bespoke Convolutional Neural Network (CNN) architecture and conduct a thorough evaluation of its practical application for underwater UXO detection. Our work analyzes the influence of network depth and width, as well as the impact of various architectural components and meta-features, on classification performance. The proposed CNN is benchmarked against a selection of top-performing classical machine learning models. Finally, we perform a detailed resilience analysis to assess the network’s robustness to common signal distortions, including additive noise, linear drift, and time warping.
The primary contributions of this work are as follows:
We introduce and utilize a novel dataset for model training, featuring simulated data from three parallel magnetometer sensors that includes both UXO and non-UXO objects with remanent magnetization.
We deliver access to this dataset.
We provide a comparative analysis of classical machine learning methods, establishing a robust performance baseline for the UXO classification task.
We present a detailed, systematic design process for a CNN architecture tailored to underwater UXO classification and quantify the performance impact of its key components through extensive ablation studies.
We deliver a comprehensive resilience analysis of the final model, evaluating its robustness against realistic signal distortions.
We derive actionable guidelines for designing field data acquisition protocols, informed by our model’s resilience analysis, to mitigate the impact of common signal distortions.
The remainder of this article is organized as follows:
Section 2 provides an overview of the relevant literature on UXO classification, with a focus on methods applied in underwater environments.
Section 3 describes the dataset used in our experiments.
Section 4 details the experimental setup, which is divided into four stages: classical model evaluation, neural network development, ablation studies, and resilience analysis. The results of these experiments are presented and discussed in
Section 5. Finally,
Section 6 provides concluding remarks, summarizes our findings, and outlines potential directions for future research.
2. Recent Advances in UXO Classification
The field of Unexploded Ordnance (UXO) classification has rapidly evolved with the emergence of advanced sensor technologies and sophisticated data analysis methods. While magnetometers have long been a mainstay for detecting ferromagnetic UXO underwater, recent years have seen the increasing integration of electromagnetic induction (EMI) sensors, which are particularly effective for detecting both ferrous and non-ferrous metallic objects. EMI sensors induce secondary electromagnetic fields in conductive materials, enabling discrimination between UXO and benign clutter, even in challenging environments where magnetic signatures alone may be ambiguous [
4,
5]. However, while EMI sensors provide valuable information by actively probing conductive targets and can detect both ferrous and non-ferrous metals, they generally exhibit lower signal-to-noise ratios, shorter effective detection ranges, higher power demands, and greater sensitivity to environmental factors than magnetometers, which remain more robust and efficient for passive detection of ferromagnetic UXO in underwater environments.
Quantum magnetometry is an emerging technology for enhancing Unexploded Ordnance (UXO) detection in underwater environments. By exploiting quantum mechanical phenomena such as spin coherence and quantum superposition, quantum magnetometers—particularly those based on nitrogen vacancy (NV) centers in diamond—offer sensitivity and precision far surpassing traditional sensors, enabling the detection of even weak magnetic anomalies associated with UXO. Recent field trials and industrial initiatives, including the integration of NV magnetometers into autonomous underwater vehicles (AUVs), demonstrate the feasibility of compact, low-power, and robust quantum sensors for real-time mapping and localization tasks [
6]. Despite challenges related to miniaturization, environmental robustness, and power management, advances in nanofabrication and quantum optics are paving the way for deploying these sensors in operational settings, promising a significant leap in the accuracy and efficiency of underwater UXO classification.
In the context of autonomous data acquisition, Seidel et al. [
7] describe the integration of multiple submersible fluxgate magnetometers with a hovering autonomous underwater vehicle (AUV) as part of the EU-funded BASTA project. Their system enables high-resolution magnetic surveys at low altitudes and velocities, improving the detection and localization of munitions on the seafloor. The AUV-mounted magnetometers, arranged in a vertical triangle, allow for the calculation of spatial magnetic gradients, which enhance the discrimination between UXO and nonUXO. Field trials conducted at munitions dumpsites in the Baltic Sea and controlled tests with surrogate objects demonstrate the system’s capability to reliably detect munitions shells as small as 81 mm from a distance of 1 m above the seafloor. The authors also emphasize the importance of integrating high-resolution camera systems for ground-truth and final confirmation of UXO.
One of recent attempts to automate underwater Unexploded Ordnance (UXO) detection involves the application of inverse modeling techniques to magnetic data analysis. Brighouse et al. [
8] proposed a framework that fits measured magnetic anomalies to physically based parametric models, specifically employing the magnetic dipole model to represent UXO signatures. This approach enables the extraction of intrinsic object parameters such as magnetic polarizabilities, orientation, and burial depth, which are essential for accurate classification. By grounding classification algorithms in these physically meaningful parameters, the method improves discrimination between UXO and nonUXO. Validation on complex marine datasets with multiple targets and varying burial conditions demonstrated enhanced detection confidence and a reduction in false positives. The study underscores the importance of integrating dipole-based physical modeling with statistical classification methods to advance the automation and reliability of marine UXO rationalization.
On the data analysis front, machine learning and deep learning have transformed the UXO classification process. Traditional manual interpretation of sensor data is labor-intensive and subjective, but automated classification using Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and hybrid models such as Bi-LSTM-RBM has demonstrated superior accuracy and efficiency [
9,
10]. Advanced deep learning models are capable of extracting subtle, high-dimensional features from raw sensor data, enabling the reliable differentiation of UXO from non-hazardous objects. Notably, transformer-based architectures and Deep Neural Networks have shown promise in handling the temporal and spatial complexities inherent in various types of underwater data used in UXO detection, for example, visual or sonar data [
11,
12].
A persistent challenge in developing robust machine learning models is the scarcity of high-quality labeled training data. To address this, the use of synthetic datasets generated with digital twin frameworks and numerical modeling has become increasingly prevalent. Digital twins simulate realistic sensor responses to UXO and other non-dangerous objects under varying environmental and operational conditions, facilitating the creation of diverse and balanced datasets for model training and validation [
3]. This approach not only accelerates the development of effective classifiers, but also allows for systematic evaluations of model performance across a range of scenarios.
Recent advances in Unexploded Ordnance (UXO) classification have increasingly focused on solving inverse problems and applying machine learning techniques to improve detection accuracy and discrimination capabilities. Inversion approaches, such as cooperative and joint inversion, have been explored to simultaneously utilize multiple datasets for more robust parameter estimation, enhancing the physical characterization of targets [
13]. Techniques leveraging the magnetic gradient tensor, including eigenvector decomposition and singular value decomposition (SVD), have demonstrated efficient and automatic localization of dipole-like magnetic sources without requiring prior knowledge of magnetic moment orientation [
14,
15].
Complementing these physics-based methods, machine learning models have gained prominence, particularly when applied to electromagnetic induction (EMI) data. Early implementations employed probabilistic neural networks trained on physics-based model outputs to classify UXO [
16]. Subsequent research expanded the use of classifiers such as support vector machines (SVMs), Random Forests, and neural networks, often combined with feature extraction techniques that capture intrinsic target properties from EMI decay curves [
17,
18]. Hybrid approaches integrating supervised and unsupervised learning methods, including Gaussian mixture models, have shown promise in reducing training data requirements while providing probabilistic interpretations of classification outcomes [
19].
Recent advances in UXO classification focus on enhancing sensing technologies, including the introduction of electromagnetic induction (EMI) and quantum magnetometry sensors, as well as developing measurement platforms such as Autonomous Underwater Vehicles equipped with gradiometers for conducting surveys. Another major area of research involves modeling UXO objects and developing machine learning algorithms to improve classification accuracy. Overall, the integration of advanced sensors, deep learning techniques, and digital twin technologies is significantly advancing UXO classification accuracy and operational efficiency in complex marine environments.
3. Dataset Description
The dataset utilized in this study is an extension of the synthetic magnetometer-based data framework originally developed in our previous work [
3]. The dataset was generated using a digital twin approach employing finite element method (FEM) simulations to replicate the magnetic signatures of underwater objects under controlled environmental conditions, thereby overcoming the challenges associated with the scarcity and high cost of empirical data collection in underwater UXO detection.
3.1. Data Generation Methodology
The synthetic dataset was produced by simulating the magnetic field disturbances caused by various underwater objects using a high-fidelity FEM model. The simulation environment accounted for the Earth’s geomagnetic field parameters, including inclination, as well as the magnetic properties and geometric configurations of the objects. To enhance computational efficiency without compromising accuracy, model simplifications were applied based on sensitivity analyses, reducing simulation time by approximately threefold compared to the full-scale model.
The dataset comprises time-series magnetometer readings generated by simulating a sensor trajectory over the objects at a fixed altitude and velocity, replicating realistic survey conditions. Multiple orientations and positions of each object relative to the sensor path were included to capture variability in the magnetic signatures. Noise consistent with typical underwater magnetometer measurements was incorporated to improve the dataset’s representativeness of real-world conditions. For building the data generator, a computational model was developed using the Gmsh (version 4.8.4)-GetDP (version 3.3.0) software environment, a versatile and open numerical modeling platform that enables the creation of three-dimensional models of various physical fields. In this study, Gmsh-GetDP was employed specifically to model the static magnetic field around the objects (
Figure 1) incorporating the effects of remanent magnetization.
The dataset comprises magnetometer signal samples acquired through simulated measurements replicating the operational parameters of real-world underwater platforms, including Remotely Operated Vehicles (ROVs), Autonomous Underwater Vehicles (AUVs), and tracked submersible platforms (bottom crawlers). A horizontally aligned sensor array configuration was implemented, featuring three magnetometers spaced 1 m apart to capture spatial magnetic field gradients. For each target object, data acquisition covered a 10 m trajectory, emulating a platform transiting over the object, with the central magnetometer passing directly above the target. The total magnetic field intensity was sampled at 10 cm intervals, yielding a three-dimensional data tensor (3 sensors × 101 samples) per measurement instance. Trajectory parameters, including altitude and azimuthal direction relative to the target, were recorded as metadata.
The virtual environment model used for synthetic data generation was rigorously verified against physical trials, demonstrating high compliance between simulated and experimental magnetometer signatures [
3]. The comparison results previously presented in our aforementioned article indicate that the numerical model based on the finite element method achieved over 95% correct representation of real-world measurements, confirming the reliability of the digital twin approach for dataset generation (
Table 1).
The table shows that, in nearly all cases, the values reach approximately 0.95–0.98, indicating an almost perfect fit, while the PSNR values generally exceed 20 dB. Additionally, a MAX-MIN column was included to represent the signal amplitude, revealing that error metrics such as RMSE, ME, and MAE are one to two orders of magnitude smaller than this amplitude, further confirming the high accuracy of the results.
3.2. Description of UXO and Non-UXO Objects
The dataset includes a diverse set of objects categorized into two classes: UXO (Unexploded Ordnance, positive class) and non-UXO (negative class).
UXO (positive class) This class consists of cylindrical projectiles representative of typical underwater Unexploded Ordnance. The cylinders have diameters ranging from 60 mm to 200 mm and projectile lengths between 240 mm and 650 mm. These dimensions correspond to commonly encountered UXO types such as artillery shells and mortar rounds. The magnetic signatures of these objects are influenced by their ferromagnetic material properties, shape, and orientation relative to the Earth’s magnetic field, resulting in complex and distinctive magnetic anomalies.
Non-UXO (negative class) The negative class comprises objects that may produce magnetic signals but are not considered hazardous UXO. This class includes the following:
Too-small projectiles: Cylindrical objects with diameters ranging from 30 mm to 50 mm and lengths between 100 mm and 240 mm, representing metallic debris or small fragments unlikely to pose a threat.
Too-large projectiles or pipes: Cylindrical objects with diameters between 240 mm and 400 mm and lengths from 750 mm to 1200 mm, which are too large to be classified as UXO and typically represent pipes or large metallic structures.
Low-height cylinders: Objects simulating manhole covers or similar structures, characterized by diameters ranging from 110 mm to 400 mm and a thickness of approximately 20 mm.
Rectangular prisms: Objects with lengths between 400 mm and 1000 mm, widths from 50 mm to 200 mm, and thicknesses ranging from 20 mm to 500 mm, representing various metallic clutter shapes.
Rectangular cubes: Larger block-shaped objects with lengths between 900 mm and 2000 mm, widths from 50 mm to 200 mm, and thicknesses between 50 mm and 200 mm, simulating bulky debris or structural elements.
This diverse representation of non-UXO objects ensures that the dataset challenges classification algorithms to distinguish hazardous ordnance from a wide variety of non-dangerous metallic clutter commonly found in underwater environments.
4. Experimental Setup
The experimental evaluation was divided into four main stages. In the first stage, we investigated the application of classical machine learning techniques for the classification of UXO/nonUXO. These results served as a baseline for the subsequent development of a neural network model. The second stage focused on the design and development of a neural network architecture tailored for UXO/nonUXO classification, with emphasis on network structure. The third stage, called ablation studies, focused on dataset composition and the impact of other network parameters. Particular attention was given to the impact of meta-features, such as altitude above the seabed and sensor movement direction. The fourth stage comprised an assessment of the model’s robustness to signal distortions. The following subsections detail each experimental phase.
4.1. Classical Machine Learning Methods
Classical machine learning techniques offer a straightforward approach to model development, often requiring minimal hyperparameter tuning. In this study, we evaluated representative models from three algorithmic families: tree-based, distance-based, and shallow neural networks. Specifically, we selected a Random Forest classifier, a k-Nearest Neighbors (kNN) classifier, and a simple multilayer perceptron (MLP).
A preliminary analysis of classical machine learning methods for UXO classification was presented in [
9]. Unfortunately, that study utilized an extended feature set, including all three components of the magnetic field vector (X, Y, Z—three orthogonal components, measured along the three axes of a 3D Cartesian coordinate system), and assumed the absence of remanent magnetization in UXO objects. This simplification—particularly the omission of remanent magnetization—can lead to an overestimation of classification accuracy. In contrast, the present study focuses on an updated dataset that accounts for remanent magnetization, resulting in reported accuracies that are significantly lower than those in [
9]. Moreover, since UXO surveys often employ cesium vapor-based magnetometer sensors that provide only the scalar magnitude of the magnetic field, defined as
, our evaluation is constrained to using this scalar value alone.
All classical models were implemented using the scikit-learn library (version 1.6.1).
Table 2 outlines the parameter settings that differ from default values. Model evaluation was conducted with five-fold cross-validation, as detailed in
Section 4.5.
4.2. Neural Network Design
The second phase of experimentation focused on neural network architectures. Starting from the best-performing classical neural network model, we developed several neural networks, with an emphasis on Convolutional Neural Networks (CNNs) for effective temporal pattern extraction. In our evaluation we ignored transformer-based models, since they require very large datasets and, as presented in [
20], often other models outperforms transformer architectures, which, by employing positional encoding and using tokens to embed parts of the inputs, may cause information loss.
The input data for the models was formatted as tensors of shape , representing 101 temporal samples from three parallel sensors. As only the magnetometer magnitude was available, a single measurement of a sensor was represented as a single value (magnitude without three spatial components of the magnetic field vector). This data was fed into a CNN comprising multiple convolutional blocks.
Inspired by ResNet [
21], our architecture adopted a modular design wherein each building block consists of two convolutional layers, element-wise batch normalization layers, and a skip connection that facilitates gradient flow and mitigates vanishing gradients. Unlike traditional batch normalization [
22], element-wise normalization was applied independently to each element of an input tensor. The element-wise normalization was implemented as follows:
Here, the input data
X is assumed to be of size
, where
B denotes the batch size and
represents the dimensions of a single input tensor (in our specific case,
and
). The normalization is applied independently for each
coordinate across the batch size dimension.
and
are learnable parameters that allow for better adjustment to the input data’s profile, and
is a small constant to avoid dividing by zero.
An overview of the network architecture is provided in
Figure 2. It presents (a) the design of a single building block and (b) the full network architecture. The network begins with an initial convolutional layer, followed by one or more building blocks, and concludes with an adaptive max-pooling layer of size
to reduce overfitting by condensing each filter to a single maximum activation. A fully connected layer follows the convolutional part, integrating meta-features such as seabed distance and sensor orientation.
Meta-features are processed via an additional fully connected layer with four neurons before being concatenated with the output of the CNN.
The network was implemented in PyTorch (version 2.5.1+cu121). Hyperparameters, summarized in
Table 3, were selected based on convergence behavior. Early stopping was applied after six iterations without training loss improvement. Learning rate was reduced by a factor of 0.1 if no improvement was observed for three consecutive iterations. A batch size of 512 was used to stabilize the batch normalization estimates, as smaller batches yielded unstable mean and variance estimates due to high signal variance. The influence of the network hyperparameters on prediction performance are shown in
Figure 3.
All networks in this stage were evaluated using five-fold cross-validation.
4.3. Ablation Studies
Ablation studies were conducted to isolate the contributions of specific components to model performance.
First, we examined the effect of removing skip connections and batch normalization layers. The resulting architecture resembled a conventional CNN: a sequence of convolutional layers followed by adaptive max pooling and a fully connected layer with 100 ReLU-activated neurons. The remaining architecture was unchanged, except for the removed components.
Second, we evaluated the impact of meta-features by excluding them entirely. The green-shaded metadata-processing subnetwork (see
Figure 2b) was removed, and the model was retrained from scratch. Performance was again assessed using five-fold cross-validation.
Lastly, we extended the input data to include all magnetometer components (X, Y, Z) along with the scalar magnitude M. This resulted in four-channel input tensors of a particular shape , where denotes the batch size.
4.4. Resilience Analysis
The final experimental stage focused on model resilience to input signal distortions, simulating real-world sensor anomalies.
Three perturbations were introduced to the test set:
Additive noise: An independent and independently distributed noise was added as
, with noise levels
, where
refers to all signals in the input data. Examples are provided in
Figure 4a–e.
Time warping: Signals were divided into three random segments, each stretched or compressed based on a parameter
. Here, the t’th sample of an output signal was generated as
, where Interpolate() is a function returning interpolation of a signal around
. For interpolation, the Piecewise Cubic Hermite Interpolating Polynomial method was used, which assures shape-preserving interpolation. Example distortions for
are shown in
Figure 5a.
Signal drift: Drift was introduced into the signals at various amplitude levels,
(
Figure 6a). The distortion was performed additively according to the formula
, where the nonlinear drift component is generated as
. Here,
denotes the cumulative sum function applied to a sequence and normalized to have a maximum equal 1.
represents a sequence of samples drawn from a standard normal distribution (mean zero, standard deviation one), with the number of samples
W equal to that of the original signal. The amplitude factor
A controls the overall magnitude of the added drift relative to the original signal’s peak-to-peak range.
The perturbed dataset was evaluated using the best model obtained from classical machine learning methods, as well as the developed neural network. All data transformations were implemented using the TsAug library [
23]. For the experiments, a simple 80/20 train/test split was applied. To evaluate the impact of input signal distortions, the network’s prediction performance was measured for increasing levels of input signal perturbations. In these experiments, balanced accuracy (detailed in the following section) was utilized as the primary performance indicator.
4.5. Evaluation Criteria
Two evaluation procedures were used. Model comparison relied on five-fold cross-validation, while hyperparameter tuning, and the resilience analysis employed a fixed 80/20 train/test split. In cross-validation, the dataset was partitioned into five subsets, with each subset used once as a test set and the remainder for training, yielding mean and standard deviation metrics.
Performance was assessed using accuracy, balanced accuracy, and macro-averaged F1 score (
). Balanced accuracy was the primary metric due to class imbalance, as it averages recall and specificity and mitigates bias toward dominant classes. In contrast to standard accuracy—which is defined as the proportion of correctly predicted samples, and may be misleading when one class dominates—balanced accuracy ensures equal consideration for both classes by computing the average recall for each. It is formally defined as
where
(true positives),
(true negatives),
(false positives), and
(false negatives) are elements of the confusion matrix. This makes balanced accuracy particularly suitable for our imbalanced dataset, in which UXO samples constitute approximately 28% of the total.
4.6. Computational Environment
All computations were performed on a dedicated server equipped with three GPUs: two NVIDIA RTX A6000 and one NVIDIA RTX A4000, alongside two AMD EPYC 7282 CPUs. Each experiment utilized a single GPU. Due to significant differences in processing speed between the A4000 and A6000, execution time comparisons were omitted.
Experiments were conducted in Python 3.11.9 using the IPython interactive environment. To ensure reproducibility, random seeds were fixed to 23 across PyTorch, NumPy, and Python’s built-in random module. The source scripts are avaliable on GitHub
https://github.com/piotres/uxo_classification_pub, accessed on 16 June 2025.
5. Results
The obtained results are organized into four stages, corresponding to those described in
Section 4. These stages aim to evaluate and compare the performance of the proposed neural network against classical machine learning methods, which serve as reference points for validating performance improvements. Subsequently, the architecture of the neural network is optimized and analyzed in terms of depth and layer width. The optimal configuration is further investigated by isolating and evaluating specific architectural components. Finally, a resilience analysis is conducted to assess the sensitivity of classification accuracy to variations in input data quality.
5.1. Classical Machine Learning Methods for UXO/nonUXO Classification
Experiments were conducted to establish reference performance benchmarks for classical machine learning methods. The results are presented in
Table 4.
Among the evaluated models, the best performance was achieved by the Random Forest classifier, which attained an accuracy of 86.39%, balanced accuracy of 83.34%, and an score of 83.38%. The second-best model was the MLP neural network, with 85.26% accuracy, 80.94% balanced accuracy, and an of 81.55%. The kNN classifier performed the worst, reaching 83.51% accuracy, 79.31% balanced accuracy, and 79.59% score. All differences between models were statistically significant, based on a two-tailed, two-sample T-test with .
These results differ substantially from those reported in [
9] due to two key distinctions between the datasets. As already mentioned, first, the earlier dataset included all three components of the magnetic field vector (X, Y, Z), while the current study relies solely on the magnitude of the magnetic field, a value typically returned by many commercial magnetometers such as optically pumped magnetometers (OPMs) [
24]. More importantly, the previous dataset did not account for remanent magnetization, which plays a significant role in shaping the magnetic field distribution around ferromagnetic objects [
25].
5.2. Neural Network Design
Initial results from [
9] highlighted the advantages of using neural networks for UXO classification. However, when applied to the updated dataset—which includes the effects of remanent magnetization—the performance of the original network dropped significantly. This performance degradation is likely due to the increased complexity of the magnetic field distribution, which may exceed the capacity of simpler network architectures.
To address these limitations, we transitioned from a basic MLP to a more expressive Convolutional Neural Network (CNN). Our proposed architecture is built upon modular building blocks, each comprising two convolutional layers, element-wise batch normalization, and residual skip-connections to facilitate effective gradient flow.
The network was trained using the hyperparameters detailed in
Table 3. The selection of these parameters was guided by empirical analysis. For instance, the choice of training duration and the implementation of an early stopping protocol were informed by the learning curve shown in
Figure 3a. As the figure illustrates, the model’s validation performance typically plateaus after approximately 100 epochs. This observation justified setting a fixed training length while using early stopping to prevent overfitting and improve efficiency.
Similarly, the choice of a batch size of 512 is supported by the analysis in
Figure 3b. While the performance was comparable for batch sizes of 256 and 512, the larger value was selected for its theoretical benefits. A larger batch size provides more stable estimates of the mean and standard deviation for the batch normalization layers, which is particularly important given the high variance observed across the input signals in our dataset.
Figure 3.
The influence of the network hyperparameters on prediction performance. (a) Learning curve. (b) Batch size.
Figure 3.
The influence of the network hyperparameters on prediction performance. (a) Learning curve. (b) Batch size.
Table 5 summarizes the effects of varying the width (number of filters per block) and depth (number of blocks) of the network on classification performance.
The best-performing architecture consists of the following: an initial convolutional block with 32 filters; two building blocks with 128 and 512 filters, respectively; a fully connected layer with 100 neurons; and 4 neurons for processing meta-features. This model achieved 87.72% accuracy, 84.78% balanced accuracy, and an of 84.89%, with a total of 3,256,202 trainable parameters.
The second-best configuration included three building blocks and yielded similar but slightly less stable results. While its performance was not significantly different statistically, it required substantially more computational resources and had higher variance, with 4,870,862 parameters—about 1.5 times more than the optimal two-block model. These findings highlight the importance of selecting an appropriately sized network, balancing performance with model complexity.
Comparing the best neural network model to the top classical method (Random Forest), the CNN achieved approximately 1.5 percentage point higher accuracy and up to 3.8 percentage points improvement in balanced accuracy. This demonstrates that the CNN architecture provides a measurable performance advantage for this classification task, especially when dealing with more complex input data reflecting real-world conditions. Additionally, these results were statistically significantly different with a t-test with .
5.3. Ablation Studies
To quantify the impact of individual architectural components on the overall model performance, we conducted a series of ablation studies.
First, we analyzed the influence of the meta-features. The results, presented in
Table 6, demonstrate that incorporating metadata improves the balanced accuracy by approximately 1 percentage point. These features provide crucial context that aids the network in interpreting the raw magnetometer signals. For instance, the sensor’s height above the seabed directly correlates with the signal amplitude, which decreases rapidly with increasing distance. Similarly, the sensor’s orientation (cardinal direction) provides contextual information that can help the model better characterize the shape of magnetic anomalies.
Second, we investigated the effect of input data dimensionality. Our baseline model used only the scalar magnitude of the magnetic field. A key question was to determine the performance loss resulting from this simplification. We compared the baseline against a model trained on all four magnetic field components (X, Y, Z, and M). As shown in
Table 7, providing the full vector data improves the balanced accuracy by over 2.3 percentage points (from 84.78% to 87.10%). This finding suggests that, when available, using the complete magnetic field vector from sensors like fluxgates [
26] is significantly more effective.
The final ablation experiment analyzed the contribution of the skip connections (SC) and element-wise batch normalization (BN) layers. The results, presented in
Table 8, reveal a dramatic decrease in performance when both mechanisms are removed. The balanced accuracy and
drop by over 13 percentage points. This significant degradation highlights the critical role of these components. The additive nature of the skip connections, combined with the regularizing effect of batch normalization, effectively mitigates the vanishing/exploding gradient problem and stabilizes the training of deep networks.
5.4. Resilience Analysis
The final part of our study focused on the resilience of the proposed neural network compared to the Random Forest classifier. We evaluated model performance against three types of signal distortions—additive noise, time warping, and linear drift—as described in
Section 4.4. It is important to note that the results reported in this section were obtained on a single fixed test set to ensure a consistent comparison across different distortion levels, rather than through cross-validation.
Figure 4 illustrates the impact of additive noise on model performance. On the clean, unaltered test set, the Random Forest classifier slightly outperforms the neural network. However, its performance collapses precipitously with the introduction of even a small amount of noise, quickly approaching a balanced accuracy of 50% (the equivalent of random guessing). In stark contrast, the neural network demonstrates significantly greater stability. Its balanced accuracy decreases by only two percentage points at a noise level of 0.01 and remains relatively stable thereafter, dropping only by another two points as the noise level increases to 10%. This robust behavior indicates that the proposed neural network is well-suited for real-world applications where signal noise is inevitable. In practice, the impact of such noise could be further mitigated by applying a pre-processing step, such as a moving average filter.
Figure 4.
The impact of additive noise on model performance. Subfigures (a–e) show example signals with increasing levels of noise. Subfigure (f) plots the balanced accuracy of the neural network (blue) and Random Forest (orange) as a function of the noise amplitude.
Figure 4.
The impact of additive noise on model performance. Subfigures (a–e) show example signals with increasing levels of noise. Subfigure (f) plots the balanced accuracy of the neural network (blue) and Random Forest (orange) as a function of the noise amplitude.
Next, we evaluated resilience to time warping, which simulates variations in survey speed by altering the signal’s shape without changing its amplitude (
Figure 5a). The results, plotted in
Figure 5b, again show an immediate performance collapse for the Random Forest model. Even the slightest time warp causes its balanced accuracy to fall below 50%, indicating that its predictions become worse than random. The neural network, however, remains robust. Its performance is largely unaffected by a speed ratio of 1.1 and only degrades by approximately 6 percentage points at a more significant ratio of 1.5. This demonstrates the CNN’s ability to learn scale-invariant features.
Figure 5.
The impact of time warp augmentation on model performance. (a) Examples of signals distorted by different max speed ratios. (b) Balanced accuracy as a function of the max speed ratio.
Figure 5.
The impact of time warp augmentation on model performance. (a) Examples of signals distorted by different max speed ratios. (b) Balanced accuracy as a function of the max speed ratio.
Finally, we tested the models’ resilience to a linear signal drift, with results shown in
Figure 6. Consistent with the previous experiments, the Random Forest classifier’s performance degrades sharply with even a minimal drift, while the neural network maintains its predictive power. For the neural network, a drift of 0.01 causes a minor decrease of about one percentage point in balanced accuracy. Even with a substantial drift of 0.1, the total performance drop is only about three percentage points. This proves the quality of the designed neural network, especially in relation to the classical machine learning methods.
Figure 6.
The impact of linear drift augmentation on model performance. (a) Examples of signals with different levels of added drift. (b) Balanced accuracy as a function of the drift amplitude.
Figure 6.
The impact of linear drift augmentation on model performance. (a) Examples of signals with different levels of added drift. (b) Balanced accuracy as a function of the drift amplitude.
An important question arises regarding the potential for further mitigating the impact of signal perturbations on neural network performance. One promising approach to address this challenge is data augmentation, specifically by generating additional synthetic data samples. This strategy is particularly apt given the nonlinear dependency of magnetic field amplitude changes on sensor trajectory parameters, including altitude and azimuthal direction relative to the target object. A complementary approach involves including such augmented data (data that was used for resistance evaluation) within the training set. However, careful consideration is required to prevent performance degradation, especially when perturbations become excessively significant. While a comprehensive exploration of these augmentation strategies falls outside the scope of this article, it constitutes a direction for our future research.
6. Conclusions
This study demonstrates that classifying UXO from magnetometer data, particularly when the data is generated from complex digital twins incorporating remanent magnetization, presents significant challenges that expose the limitations of classical machine learning models. We found that popular models like Random Forest can easily overfit, which we think could be the result of subtle artifacts within the simulated data. These artifacts—potentially arising at the boundaries of different mesh densities used to ensure feasible computation times in the digital twin—while visually indistinguishable, can be exploited by tree-based models. This overfitting leads to a drastic degradation in performance when the model is faced with even minor variations in the input data.
Fortunately, the proposed Convolutional Neural Network (CNN) architecture proves to be highly resilient to these variations. Moreover, it significantly outperforms the classical models, achieving superior predictive accuracy. Our ablation studies confirmed that architectural components such as element-wise batch normalization and skip connections are critical to this stability and performance. Furthermore, our results show that a more compact “two-block” model offers a compelling trade-off, achieving comparable performance to a deeper “three-block” model while having 1.5 times fewer learnable parameters. This finding is particularly important for developing computationally efficient models suitable for on-site data processing during field studies. The inclusion of metadata was also shown to provide a consistent performance improvement.
The resilience analysis yields critical insights for the design of practical field surveys. Our results indicate that signal drift and time-warping are the most detrimental distortions, causing the steepest decline in model performance. This observation has direct implications for AUV survey design, underscoring the danger of electromagnetic interference—for instance, from one vehicle’s propulsion system causing signal drift on a nearby vehicle’s sensor. It also highlights the importance of precise vehicle positioning and maintaining a constant survey speed, as variations in velocity are equivalent to time-warping distortions that can severely impact classification accuracy.
In conclusion, this work not only presents a robust deep learning architecture for UXO classification, but also highlights the practical challenges and considerations for deploying such models. We believe these findings advance the understanding of neural network applications in this domain and provide actionable guidelines for improving both model design and data acquisition protocols. For future research, we anticipate that significant performance gains could be achieved by fusing magnetometry data with complementary sensor modalities, such as sonar imagery [
12], to create a more comprehensive and reliable classification system.