Next Article in Journal
Piezoelectric Analysis of a Hydrofoil Undergoing Vortex-Induced Vibration
Next Article in Special Issue
An Improved Factor Graph Optimization Algorithm Enhanced with ANFIS for Ship GNSS/DR Integrated Navigation
Previous Article in Journal
A Systematic Evaluation of CNN Configurations for Multiclass Oil Spill Classification in Hyperspectral Images
Previous Article in Special Issue
Improved Long Short-Term Memory-Based Fixed-Time Fault-Tolerant Control for Unmanned Marine Vehicles with Signal Quantization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

System-Level Optimization of AUV Swarm Control and Perception: An Energy-Aware Federated Meta-Transfer Learning Framework with Digital Twin Validation

1
Marine Science and Ecological Environment College, Shanghai Ocean University, Shanghai 201306, China
2
School of Engineering, Shanghai Ocean University, Shanghai 201306, China
3
Merchant Marine Academy, Shanghai Maritime University, Shanghai 201306, China
4
Shanghai Longjing Information Technology Co., Ltd., Shanghai 201108, China
5
College of Fisheries and Life Sciences, Shanghai Ocean University, Shanghai 201306, China
6
School of Economics and Management, Shanghai Ocean University, Shanghai 201306, China
*
Author to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2026, 14(4), 384; https://doi.org/10.3390/jmse14040384
Submission received: 12 January 2026 / Revised: 10 February 2026 / Accepted: 11 February 2026 / Published: 18 February 2026
(This article belongs to the Special Issue System Optimization and Control of Unmanned Marine Vehicles)

Abstract

Deep-sea exploration increasingly relies on Autonomous Underwater Vehicles (AUVs) to enable persistent, wide-area surveying in harsh and uncertain environments. In practice, however, deployments are constrained by tight energy budgets and bandwidth-limited, intermittent acoustic links, which complicate mission-level coordination. Moreover, many existing systems treat perception and control as loosely coupled modules, often resulting in redundant sensing, inefficient communication, and degraded overall performance—particularly under heterogeneous sensing modalities and shifting geological conditions. To address these challenges, we propose a hierarchical Federated Meta-Transfer Learning (FMTL) framework that tightly integrates collaborative perception with adaptive control for swarm optimization. The framework operates at three levels: (1) Representation Learning aligns heterogeneous sensors in a shared latent space via a physics-informed contrastive objective, substantially reducing communication overhead; (2) Meta-Learning Adaptation enables rapid transfer and convergence in new environments with minimal data exchange; and (3) Energy-Aware Control realizes closed-loop exploration by coupling Federated Explainable AI (FXAI) with decentralized multi-agent reinforcement learning (MARL) for path planning under energy constraints. Validated in high-fidelity hardware-in-the-loop simulations and a digital-twin environment, FMTL outperforms state-of-the-art baselines, achieving an AUC of 0.94 for target identification. Furthermore, an energy–intelligence Pareto analysis demonstrates a 4.5× improvement in information gain per Joule. Overall, this work provides a physically consistent and communication-efficient blueprint for the optimization and control of next-generation intelligent marine swarms.

1. Introduction

1.1. The Grand Challenge: Unveiling the Deep Sea Frontier Amidst Unprecedented Demands

The deep seafloor constitutes Earth’s largest and least surveyed biome, and is increasingly recognized as a domain of strategic and societal relevance, rather than a purely scientific frontier. The rising demand for critical minerals associated with the green energy transition, including cobalt, nickel, and rare earth elements, has elevated polymetallic nodules, cobalt-rich crusts, and seafloor massive sulfides as potential future resource targets [1,2]. In parallel, the improved characterization of deep-sea geological processes remains essential for evaluating geohazards, such as submarine landslides and tsunamis, as well as for reconstructing paleoclimate signals preserved in marine sediments [3]. Despite these drivers, exploration in the deep ocean remains constrained by darkness, pressure, and limited communication, and current operational practices do not support mapping and interpretation at the spatial scale and temporal resolution required by contemporary demands. Current multi-AUV systems lack system-level optimization, often treating perception and motion planning as isolated modules. This leads to inefficient energy consumption and suboptimal control in dynamic environments.

1.2. The “Lonely Explorer” Dilemma: Data Silos and Sensor Heterogeneity in AUV Operations

Deep-sea campaigns commonly deploy AUVs as independent platforms, which induces two coupled inefficiencies. Geophysical observations, such as bathymetry, magnetic signatures, and chemical anomalies, are typically retained as local logs on individual vehicles, which prevents in situ knowledge sharing and encourages redundant coverage, delayed interpretation, and a fragmented geological context. This limitation becomes more acute in fleets designed around specialization, where different AUVs carry distinct payloads to increase coverage efficiency. Although such heterogeneity is operationally attractive, fusing modalities that differ in structure and sampling—ranging from point clouds to vector fields and scalar time series—into a coherent, shared model remains technically challenging [4,5]. Existing multi-AUV coordination often emphasizes motion and coverage while lacking the cognitive integration required to synthesize heterogeneous observations into a unified, real-time geological understanding [6].

1.3. Limitations of Current Approaches: From BruteForce Surveys to BlackBox AI

Scaling exploration by densifying pre-planned survey grids or increasing platform count raises cost without resolving representation and fusion constraints. Data-driven methods have enhanced the automated interpretation of single-sensor geophysical products [7]. However, many models remain difficult to trust in operational geology because their internal reasoning is not readily inspectable [8], particularly in centralized learning and inference pipelines. This limitation becomes more acute in multi-AUV settings, where data are distributed across platforms, and acoustic links severely limit bandwidth and reliability, making the centralized collection of raw sensor data impractical [9,10]. Distributed perception has been studied, but it is often decoupled from planning and control, resulting in weak connections between high-level geological inference and adaptive tasking, as well as coordinated motion. As a result, many AUV swarms still execute largely static coverage patterns that do not respond to discoveries during deployment. These limitations motivate a shift from isolated data acquisition toward collaborative, in situ knowledge formation, paired with mechanisms that translate inference into coordinated exploration behavior [11,12].

1.4. Our Proposed Paradigm: The Cognitive AUV Swarm

This work frames multi-AUV exploration as a cognitive system in which heterogeneous vehicles jointly construct and refine a shared geological model during operation. The proposed Cognitive AUV Swarm concept treats each AUV as both a sensing platform and a contributor to collective learning, enabling the fleet to leverage complementary viewpoints while respecting data ownership and communication constraints [13,14]. To operationalize this concept, the research presents a hierarchical Federated Meta-Transfer Learning (FMTL) framework, tailored explicitly to seafloor exploration [15,16]. The framework spans the swarm’s learning lifecycle by combining prior knowledge acquisition, collaborative model evolution across vehicles, and rapid adaptation to new regions or conditions [17]. It enables collaboration through compact model updates rather than raw, high-resolution sensor transmissions, thereby reducing dependence on high-bandwidth links and supporting privacy-preserving operations [18]. The framework addresses heterogeneous fusion through cross-modality representation learning that aligns sensor-specific information into a unified latent space suitable for shared reasoning [19]. To support operational interpretability, the framework incorporates an explainable AI layer intended to make geological predictions transparent to domain experts and to couple these predictions to closed-loop, adaptive exploration policies that reallocate effort during deployment [20,21,22].

1.5. Contributions and Research Structure

This research establishes a Cognitive AUV Swarm paradigm to mitigate data silos and sensor heterogeneity [11] through an FMTL architecture integrating transfer, federated, and meta-learning [23]. Technical contributions include: (1) cross-modality federated representation learning for unified latent space alignment [19]; (2) federated explainability for trustworthy reasoning [20]; (3) closed-loop exploration coupling perception to MARL-based planning [21,22]; and (4) decentralized motion coordination under acoustic constraints [9,10]. The framework introduces three novel architectural choices: (1) Mapper-Centric Aggregation, which aggregates only the shared mapper while keeping sensor-specific encoders local to preserve specialization and avoid “semantic mismatch” [8], ensuring convergence under non-IID modality distributions; (2) Federated Attribution, which aggregates compressed importance vectors (Vk) for swarm-level expert auditing; and (3) Information-Theoretic Planning, which couples geological inference to motion planning via a constrained optimization that explicitly penalizes acoustic connectivity strain.

2. Related Work

This work lies at the intersection of intelligent marine geophysics, distributed multi-agent systems, and contemporary learning paradigms, and it is motivated by the mismatch between current methodological capabilities and the operational constraints of multi-AUV exploration.

2.1. AI in Marine Geophysics: Automation Gains and Deployment Constraints

Recent work has applied deep learning to accelerate geophysical interpretation by reducing reliance on manual post-mission processing. Convolutional architectures have been utilized across various marine sensing products and interpretation targets, demonstrating that data-driven models can extract structured signals from complex measurements when trained on task-specific datasets [24,25]. Despite these advances, prevailing workflows remain dominated by centralized training and deployment assumptions that conflict with the inherently distributed and in situ nature of AUV operations. Model outputs are also frequently complex to validate in scientific practice because the mapping from input measurements to geological claims is insufficiently transparent, which limits trust and inhibits hypothesis checking against domain knowledge [8]. Researchers have explored visualization-based interpretability techniques to expose model sensitivity patterns; however, they rarely formulate these methods as part of a collaborative multi-sensor learning process, where the rationale for cross-modality fusion must be inspectable [20].

2.2. Multi-AUV Systems: Coordination Maturity and the Missing Collective Semantic Model

Research on multi-AUV systems has established robust methods for physical coordination, addressing formation maintenance, cooperative motion planning, and safety constraints under underwater dynamics [4,5]. These approaches enable synchronized movement and coverage efficiency, often via distributed agreement mechanisms over low-dimensional coordination variables. A separate line of work has examined distributed estimation and decentralized fusion for sharing state information across agents [10]. However, these formulations typically focus on compact state representations rather than high-dimensional semantic understanding of complex environments. The problem of enabling sensor-heterogeneous AUV teams to construct a shared, coherent representation of the seafloor remains insufficiently addressed, particularly when the sensing modalities differ in structure and sampling. Existing fusion strategies commonly rely on central aggregation of raw observations, which conflicts with the bandwidth limits of underwater acoustic networks and offers limited support for persistent, fleet-level semantic consistency. This creates a gap between established capabilities in coordinating motion and the comparatively limited capacity to coordinate interpretation, which motivates architectures that treat collective perception as a first-class objective rather than an auxiliary output of coordination.

2.3. Advanced Learning Paradigms: Federated, Meta, and Transfer Learning Without System-Level Integration

Researchers have developed machine learning paradigms aimed at learning from distributed data and facilitating rapid adaptation; however, they often design these methods as standalone solutions rather than as components of an integrated robotic exploration system. Federated learning provides a mechanism for collaboratively optimizing models across distributed clients while avoiding the sharing of raw data [18]. Its use in robotic settings remains challenging when heterogeneity extends beyond statistical variation to modality-level differences that imply distinct input structures and potentially distinct client architectures, where standard aggregation schemes can degrade performance or fail to converge. Meta-learning targets rapid adaptation with limited data by optimizing learning procedures that generalize across tasks [17]. Most meta-learning pipelines assume centralized access to curated task distributions and do not directly support continual, collaborative learning from the evolving experiences of a deployed fleet. Transfer learning based on large-scale pretraining has shifted model development toward reusable priors that can be adapted to downstream objectives [16]. A comparable direction is emerging in geoscience modeling [3]. However, the integration question remains: how to combine a static pretrained prior with dynamic, private, and heterogeneous data streams collected by distributed robotic platforms in a way that supports continual improvement without violating communication constraints.
We also draw inspiration from FedProto [26], which addresses heterogeneity by exchanging class prototypes rather than model parameters. While FedProto focuses on discrete classification tasks, our FMTL framework extends this intuition to the continuous domain by aligning “geophysical feature manifolds” (acting as continuous prototypes) across heterogeneous sensors, ensuring semantic consistency without enforcing architectural uniformity.

2.4. Synthesis and Our Contribution

Across these domains, the literature indicates that automated geophysical interpretation benefits from distribution-aware and interpretable learning, multi-AUV systems benefit from shared semantic modeling beyond kinematic coordination, and federated learning, meta-learning, and transfer learning benefit from system-level integration to address real-world robotic constraints. This work targets the intersection of these gaps by proposing a Federated Meta-Transfer Learning (FMTL) framework that combines pretrained priors, privacy-preserving collaborative updates across heterogeneous vehicles, and rapid adaptation mechanisms within a unified operational loop. The framework is designed to address the data silo and heterogeneity problem by aligning distributed learning with the realities of underwater communication and by treating interpretability as a requirement of collaborative fusion rather than an afterthought.

Why Heterogeneous Architecture Integration Is Non-Trivial

A critical distinction between this work and prior federated learning in robotics [18] is that standard parameter averaging assumes clients optimize the same functional form. When AUV encoders Ek have different architectures. Ek_bathy: 256 × 3 → 128 D vs. Ek_chem: 4 × 1 → 128 D—averaging their gradients creates a coordination failure: the aggregated encoder gradient is incompatible with the task head on some clients, or equivalently, induces a “modality-averaged” feature that loses signal in specialized sensors.
This pattern is being applied for the first time to marine multi-AUV systems under acoustic constraints. The mapper-centric scheme addresses this by keeping local encoders fixed during aggregation, moving the fusion problem into a representation space where dimensionality is uniform. This ensures the preservation of modality-specific features while enabling effective fusion across different sensor types, which is crucial for multi-sensor AUV operations.

2.5. The Gap Between Perception and Action: Integrating Planning and Control

Multi-AUV autonomy frequently separates perception from action by coupling offline sensing-driven analysis to pre-deployment survey plans. At the same time, online coordination focuses on motion synchronization and safety constraints rather than adaptive tasking [4,5]. This separation limits responsiveness to discoveries during deployment, as the planning layer is often not structured to incorporate updated geological insights in real-time. Adaptive exploration methods have begun to connect inference to decision making [21,22]. Nevertheless, the integration of distributed perception, online geological analysis, and coordinated multi-agent planning under acoustic bandwidth constraints remains insufficiently explored. The approach developed in this work is motivated by this integration gap and aims to couple collaborative inference with adaptive planning and decentralized control within a single operational framework.

3. The Federated Meta-Transfer Learning (FMTL) Framework

To operationalize a Cognitive AUV Swarm, this work proposes a hierarchical Federated Meta-Transfer Learning (FMTL) framework that governs the acquisition, fusion, and deployment of collective geological intelligence under distributed sensing and constrained underwater communication. Figure 1 summarizes the end-to-end information flow, where learning progresses from a pretrained prior, through collaborative model updates across heterogeneous vehicles, to rapid adaptation in previously unseen regions, and where the resulting belief state is coupled to exploration decisions.

3.1. Foundational Prior via Transfer Learning: The GeoSense Model

FMTL begins with a universal GeoSense foundation model that encodes broad geological priors intended to support subsequent swarm learning rather than to solve site-specific inference in isolation. The model is pretrained on publicly available global marine geophysical data to capture cross-variable dependencies that persist across regions [3]. GeoSense adopts a Transformer-based encoder architecture to represent relational structure and spatial correlations among geophysical measurements, producing latent representations that are reused as initialization for downstream learning within the swarm. Pretraining is formulated as a supervised or self-supervised objective that encourages recovery of missing or masked geophysical signals from contextual cues and supports discrimination among geological regimes, yielding a knowledge-rich parameterization that stabilizes and accelerates later updates under limited local data.

3.2. Collaborative Cognition via Federated Learning

After pretraining, GeoSense is disseminated to a swarm of N participating AUVs, ( A = A 1 , , A N ), each carrying a potentially distinct sensor payload. Learning proceeds in communication rounds in which each vehicle optimizes a local objective using onboard measurements while the swarm constructs a shared representation through privacy-preserving aggregation of model updates rather than raw sensor logs [18]. The central challenge is that the swarm is sensor-heterogeneous, so collaboration cannot assume shared raw inputs or homogeneous encoders; the learning process must also preserve interpretability for scientific validation [8,20].

3.2.1. Cross-Modality Federated Representation Learning

To enable knowledge transfer across modalities without exchanging raw measurements, FMTL enforces collaboration in a shared latent space ( Z ) for geological representations rather than at the data level [19]. Each AUV ( A k ) maintains a local model composed of a sensor-specific encoder ( E k ) and a shared representation mapper (M). The encoder ( E k ) maps modality-specific inputs to an intermediate representation, while (M) projects encoder outputs into the common space ( Z ) using an identical architecture across the swarm. Local updates optimize both components on each vehicle, while federation aggregates only the shared mapper to induce cross-modality alignment.
Formally, let denote the local dataset of AUV k with sensor modality ( s k ). The local model is defined as the composition f k = M E k , where E k : X s k R d e n c is the sensor-specific encoder and M : R d e n c Z is the shared representation mapper. AUV k minimizes a local task loss L t a s k defined as
m i n θ E k θ M E ( x , y ) D k L t a s k y , M ( E k ( x ; θ E k ) ; θ M )
where θ E k and θ M denote parameters of E k and M, and L t a s k specifies the learning objective defined by the onboard training signal.
Crucially, during federation round (t), only θ M is transmitted and aggregated server-side, yielding
θ M ( t + 1 ) k = 1 N | D k | | D t o t a l | θ M k , ( t ) ,   | D t o t a l | = k = 1 N | D k |
where θ M k , ( t ) is the mapper parameter after local training on AUV k in round t, and dataset-size weighting follows standard secure aggregation practice [18]. The encoder parameters θ E k remain strictly local, so modality-specific processing is preserved while alignment is enforced through shared updates of M into Z. Figure 2 illustrates the mapper-centric federation mechanism that supports cross-fleet fusion without requiring raw data exchange.
Cross-modal regularization effect: Since the shared mapper M forces alignment between modalities, a strong signal in one modality (e.g., a clear bathymetric structure) can effectively “stabilize” the representation of a weaker or noisy modality (e.g., a fluctuating chemical signal) during the gradient update process. This suggests that the swarm does not merely fuse data but actively calibrates heterogeneous sensors against each other without external ground truth, a key capability for long-duration autonomy where sensor drift is inevitable.
In Equation (1), the task loss is defined as L t a s k = L C E + λ c o n t L c o n t , combining a supervised term and a contrastive alignment term. We utilize the NT-Xent (Normalized Temperature-scaled Cross Entropy) loss for L c o n t to align heterogeneous embeddings:
L c o n t ( z i , z j ) = log exp ( s i m ( z i , z j ) / τ ) k = 1 2 B 1 [ k i ] exp ( s i m ( z i , z k ) / τ )
where ( z i , z j ) are positive pairs (co-located multi-modal samples), B is the batch size of positive pairs (resulting in 2 B total representations), and s i m ( z i , z j ) denotes the cosine similarity.
Specifically, to ensure reproducibility, we set the contrastive weight λ c o n t = 0.5 and the temperature parameter τ = 0.1 and provide a sensitivity analysis in Section 5.5.3. Local training is performed using the AdamW optimizer with ( β 1 = 0.9, β 2 = 0.999) and an initial learning rate of 1 0 4 under a cosine decay. For the mapper-centric aggregation in Equation (2), communication rounds are triggered every 10 local steps. To mitigate the instability induced by non-IID data distributions, we apply a dynamic exponential moving average (EMA) to the global mapper parameters during the aggregation phase, facilitating smoother convergence and robust representation alignment.
The mapper-centric aggregation scheme in Equation (2) departs from standard FedAvg in a way that requires theoretical justification. In standard federated learning, convergence is guaranteed when all clients optimize L(θ) with IID or bounded non-IID data. Here, AUVs optimize potentially different loss landscapes because their modality-specific encoders, Ek, project into shared space Z via distinct pathways. To analyze the convergence behavior of the proposed mapper-centric federation under heterogeneous modalities, we adopt a standard stochastic optimization framework commonly used in federated and asynchronous learning. The analysis relies on the following mild assumptions commonly used in analyses of asynchronous stochastic optimization.
Assumption 1 (Smoothness). Each local objective induced by a modality-specific encoder Ek is L-smooth with respect to the shared mapper parameters θ.
Assumption 2 (Unbiased Stochastic Gradients). The stochastic gradients computed at each client are unbiased estimators of the corresponding local gradients, with bounded variance σ2.
Assumption 3 (Bounded Asynchrony). Due to acoustic communication constraints, the staleness of received updates is bounded by δ local optimization steps in the worst case.
Under Assumptions 1–3, we derive the convergence bound for the proposed mapper-centric aggregation under acoustic constraints. The detailed mathematical proof is provided in the Appendix A, Appendix B, Appendix C, Appendix D, Appendix E and Appendix F.
Beyond the ideal setting, we account for acoustic impairments, including the packet delivery ratio (PDR) ρ ∈ (0, 1] and the maximum asynchronous delay δ. Under the realistic acoustic channel model (Section 5.6.2), the error bound for the aggregated shared mapper is derived as:
E [ L ( θ T ) L * ] C 1 T N I d e a l   C o n v e r g e n c e + C 2 δ ρ T + σ 2 N ρ A c o u s t i c   P e n a l t i e s
where T denotes the total number of communication rounds, N is the swarm size, and σ2 represents the stochastic gradient variance. C 1 and C 2 are constants dependent on the Lipschitz constant L and the initial optimality gap.
This bound demonstrates that the algorithm achieves a convergence rate of O ( 1 / T ) , dominated by the optimization error term. Unlike standard FedAvg, the bound suggests that while the contrastive objective regularizes modality-level discrepancies, the acoustic constraints introduce a residual error floor. Specifically, as T → ∞, the first two terms vanish, but the system stabilizes within an error neighborhood scaled by O ( 1 / ρ ) , indicating that lower packet delivery ratios result in a higher steady-state error floor due to effective batch-size reduction.
Notably, unlike standard FedAvg, the derived convergence bound does not explicitly depend on modality-level distribution discrepancies. This property arises from the contrastive objective, which regularizes heterogeneous local objectives through a shared representation space. Moreover, the algorithm is tolerant to bounded asynchrony: the error induced by stale updates with delay δ decays at a rate of O ( 1 / T ) , ensuring that delayed gradients caused by acoustic multipath effects do not lead to divergence, provided that the learning rate decays appropriately.
These theoretical results are consistent with the empirical observations, where performance degrades smoothly as the packet delivery ratio decreases and becomes critical when ρ < 0.48 in our experiments. This trend is further supported by experimental results under realistic acoustic channels, which we discuss in Section 5.6.2.

3.2.2. Federated Decision Attribution for Interpretability

FMTL integrates interpretability into the federated loop by augmenting parameter updates with compact attribution signals derived from local explainability analyses [20]. After each local update, A k computes a compressed feature-importance summary V k that reflects which input features most influenced its current predictions under modality s k . Each AUV transmits V k alongside the mapper update, and the server aggregates the received vectors to form a federated decision attribution map. The aggregated attribution provides a swarm-level consensus over which geophysical cues are driving a given inference, enabling domain experts to audit the basis of the collective model and to evaluate whether the learned decision logic is consistent with geological reasoning constraints [8].

3.2.3. Communication-Efficient Model Synchronization

Underwater acoustic links impose severe bandwidth constraints and non-negligible loss rates, so FMTL adopts communication-efficient synchronization mechanisms to reduce payload size while preserving learning stability [10]. Model updates are compressed prior to transmission via sparsification and quantization, and then encoded in a lossless sparse format. Robustness to packet loss is supported through erasure-resilient (e.g., forward error correction or selective retransmission) transmission so that synchronization remains viable under intermittent connectivity, and round-based scheduling is used to reduce contention in multi-vehicle acoustic channels. These mechanisms are treated as part of the learning protocol because the feasibility of federation depends on maintaining predictable update exchange under constrained communication.

3.3. Rapid Adaptation via Meta-Learning

The final learning component targets deployment in new survey regions whose baseline geology may differ from previously observed environments. FMTL frames each new geological block as a task sampled from an environment distribution and aims to learn shared model parameters θ s h a r e d that generalize across tasks and can be rapidly specialized via a latent context inferred from limited initial observations. Adaptation is implemented through context-based meta-learning [17], where a context encoder infers a latent description of the local environment from early survey observations and uses this context to condition the model’s inference.
Let D n e w denote the sequence of initial observations collected in a new survey block. Here, θ s h a r e d denotes the fixed parameters learned from pretraining and federated learning, while task adaptation is realized by inferring the context vector c from D n e w without updating θ s h a r e d . A context encoder g ϕ maps this sequence to a latent context vector c = g ϕ ( D n e w ) , and the geological inference model F is conditioned on c as:
y ^ = F ( x , c ; θ s h a r e d )
where x is the current input, y ^ is the predicted geological variable of interest, θ s h a r e d denotes parameters shared across tasks (induced by transfer and federated learning), and c captures task-specific context inferred from D n e w . By holding θ s h a r e d fixed while inferring c , the swarm adapts its interpretation logic to local distributions without relying on extensive gradient-based fine-tuning, which reduces the risk of degrading shared knowledge when operating under limited local data.

3.4. Perception–Action Integration: From Geological Beliefs to Adaptive Exploration

FMTL couples collaborative perception to exploration behavior by converting the evolving collective model into an actionable belief representation and using this belief to drive tasking, planning, and coordination. The server-side model produces a continuously updated global belief map M that summarizes the current belief state and uncertainty over the survey area. This map conditions a multi-agent decision process that assigns goals and prioritizes regions for measurement under operational constraints. The exploration policy is formulated as a constrained objective that trades off information acquisition against communication burden while respecting feasibility constraints.
Let π denote a multi-agent exploration policy that selects joint actions a t over a horizon T . The optimal policy π * is defined through:
π * = a r g m a x π E t = 0 T I ( M ; Z t a t ) λ c o m m C c o m m ( a t )
where I ( M ; Z t a t ) denotes the expected information gain (mutual information) regarding the map M obtained by executing action a t and acquiring observation Z t . The term C c o m m ( a t ) penalizes actions that violate or strain acoustic connectivity, and λ c o m m is a balancing hyperparameter that controls the information–communication trade-off. Connectivity constraints are encoded within C c o m m ( ) to enforce conditions of the form | p i p a n c h o r | R c o m m , where p i denotes the position of agent i , p a n c h o r is an anchor or relay reference, and R c o m m is the acoustic range. Each AUV executes assigned objectives through onboard constrained planning and decentralized motion coordination that regulates relative geometry using locally available neighbor information and supports asynchronous execution between update exchanges. Implementation details, parameter settings, and runtime characteristics are reported in Section 4 under controlled experimental conditions.
We implement a hierarchical control architecture that couples the learned policy with heuristic planning. To bridge the gap between high-level coordination and low-level feasibility, we implement a hierarchical control architecture that integrates offline learning with online heuristic planning. The framework consists of three core components:
  • High-Level Decision (Offline Learned MARL): A decentralized policy network π θ , trained offline via Multi-Agent PPO, serves as the strategic decision-maker. At the beginning of each replanning cycle, it generates a High-Level Goal g * —identifying a specific sub-region in the belief map—to maximize long-term information gain. This policy encapsulates collaborative logic while remaining computationally lightweight for onboard inference.
  • Low-Level Execution (Online Heuristic RRT): Upon setting the goal g * , an Information-Driven RRT acts as the local solver. It operates in a decentralized, Jacobian-free manner to generate kinematically feasible trajectories reaching g * . Crucially, this solver enforces the hard constraints defined in Equation (6), including collision avoidance and acoustic connectivity C c o m m , which are typically difficult to guarantee through pure PL.
  • Coordination via Intention Vectors: To accommodate bandwidth limitations, AUVs broadcast compact Intention Vectors—containing the selected goal g * and cost-to-go—during TDMA slots instead of dense trajectories. The local planner utilizes these vectors to prune redundant search branches, effectively achieving swarm-level coordination without the need for centralized optimization.
To solve the constrained objective in Equation (6) in real time, we employ a modified Information-Driven Rapidly-exploring Random Tree (RRT). Unlike standard RRT which targets a fixed goal state, our planner iteratively builds a T by sampling candidate viewpoints in the belief map. A node expansion is valid only if the connecting trajectory satisfies the kinematic limits and the acoustic connectivity constraint C c o m m ( ) . The path π * is selected by tracing the branch that maximizes the cumulative information gain (IG) while maintaining feasible acoustic links, effectively bridging the high-level optimization objective with low-level motion primitives.
A deterministic TDMA-style access protocol is employed for underwater synchronization to minimize packet collisions and ensure predictable update windows. The resulting bounded communication latency is explicitly accounted for in trajectory optimization by treating model and task updates as arriving at discrete synchronization instants rather than as continuous feedback. Between synchronization instants, AUVs execute trajectories asynchronously using their most recent feasible plans while continuously monitoring for new tasking messages. When task updates are received, replanning is triggered from the current state at the next available planning cycle, preventing motion interruption and avoiding idle waiting for communication. This execution strategy enables sustained survey progress under severely constrained acoustic bandwidth.
Unlike surface navigation tasks [27], which primarily focus on learning kinematic collision avoidance behaviors, deep-sea exploration requires reasoning about complex underlying physical processes. A key differentiator of the proposed FMTL framework lies in its implicit encoding of physical laws within the shared latent space Z. By training on multi-physics simulation data, the shared mapper learns a physics-informed representation in which the latent manifold structure reflects governing physical equations. This implicit physical consistency enables the swarm to infer source locations even under sparse sampling conditions, a capability that purely data-driven baselines—such as standard Vision Transformer models—fail to achieve in highly dynamic oceanographic environments.
To bridge the gap between RL-based exploration and operational safety, we integrate a Control Barrier Function (CBF) layer that provides formal guarantees for collision avoidance and connectivity maintenance ( C c o m m ).
  • Feasibility Guarantee: Let h ( x ) 0 represent the safe set (connectivity and collision-free space). The controller enforces the forward invariance of this set by solving the QP: min u u π * ( x ) 2 s . t . h ˙ ( x ) + α h ( x ) 0. This ensures that even if the RL policy π * suggests an unsafe action due to sensing noise, the executed action u remains within the feasible set.
  • Safety under Communication Loss: We define a Safe Horizon ( T s a f e ) bound linking planner feasibility to packet loss. If communication is lost for duration Δ t l o s s , a sufficient condition for safety is:
Δ t l o s s R c o m m d i j ϵ p o s 2 v m a x
where R c o m m is the acoustic range, d i j is the current inter-agent distance, ϵ p o s is navigation error, and v m a x is maximum velocity. The planner defaults to a “fail-safe hold” mode if the predicted time to next update exceeds T s a f e , ensuring the system never violates reachability constraints during acoustic blackouts.

4. Experimental Setup and Evaluation

This section evaluates the proposed FMTL pipeline, illustrated in Figure 3, in a high-fidelity simulated marine exploration environment designed to reflect distributed sensing, sensor heterogeneity, and acoustic communication limitations. The evaluation targets two questions: whether cross-modality federation produces a coherent collective model that is not attainable through isolated learning or naive federation, and whether coupling that model to adaptive planning yields measurable improvements in exploration performance under operational constraints.
The framework starts from a globally pre-trained GeoSense foundation model and performs mapper-centric federated learning to align heterogeneous sensor modalities while preserving local encoders on each AUV. The resulting shared latent representation is then rapidly adapted to new mission contexts via meta-learning and coupled with an adaptive planner for coordinated swarm exploration. To rigorously evaluate the effectiveness of this architecture under realistic operational constraints, the subsequent experiments are conducted in a high-fidelity simulated marine environment. In particular, a digital-twin seafloor scenario is constructed to capture geological heterogeneity, sensor diversity, and acoustic communication limitations, enabling systematic assessment of cross-modality federation, adaptation speed, and downstream exploration performance.

4.1. Digital Twin Validation Environment

A digital-twin seafloor environment is constructed over a 100 km × 100 km area with procedurally generated geology. The simulator maintains a hidden ground-truth geological map that specifies the spatial structure of geological formations and delineates target zones associated with the exploration objective. A multi-physics forward model generates sensor measurements conditioned on each AUV’s position relative to the ground truth. The simulated swarm comprises (9) AUVs partitioned into three payload-specialized groups to induce modality diversity at the fleet level. Vehicle motion is modeled with bounded acceleration a 0.1   m / s 2 and bounded speed ( v m a x = 2 m/s), with collision detection integrated into the kinematic update. Survey-line constraints enforce a minimum spacing (50 m) to control redundant coverage and to keep the exploration objective aligned with information acquisition. Communication is constrained by an acoustic range limit (5 km) and non-negligible propagation delay, and the simulator enforces a connectivity topology consistent with range-limited message exchange.

4.2. Hardware-in-the-Loop (HIL) Validation Platform

To bridge the gap between digital twin and sea trials, we constructed a high-fidelity Hardware-in-the-Loop (HIL) testbed designed to evaluate the computational feasibility and communication robustness of the FMTL framework on flight-grade hardware. The swarm agents are physically instantiated using NVIDIA Jetson AGX Orin (64 GB) modules to represent the high-performance perception nodes and Jetson Xavier NX modules for standard agents, reflecting a realistic heterogeneous compute capability.
The underwater acoustic channel is emulated at the physical layer using a bank of S2CR 18/34 acoustic modem emulators (EvoLogics GmbH, Berlin, Germany) coupled with the UNETStack network simulator. This setup enforces strict bandwidth constraints (effective bitrate configured to 480 bps to mimic robust modulation modes) and introduces realistic packet collisions and propagation delays (1500 m/s sound speed).
Crucially, the GeoSense inference and FMTL local updates are executed in real-time on the embedded GPUs, allowing us to profile actual energy consumption and inference latency (averaging 45 ms per sample on Orin). The HIL setup validates that the proposed privacy-preserving gradient exchange can be completed within the specific TDMA time slots (13.5 s cycle) defined in our protocol, proving operational readiness for deployment.

4.3. Baselines for Comparison

Performance is compared against three baselines that disentangle the effects of multi-modal sensing, cross-modality fusion, and learning under communication constraints. ISOLATED trains separate centralized models within each payload group, so neither multi-modal learning nor cross-modality alignment is performed, and training remains decoupled across modalities; the resulting predictions are combined only at the map level. CENTRALIZED trains a single unified multi-modal model using aggregated raw sensor streams, enabling cross-modality fusion during training while assuming unconstrained communication that is incompatible with acoustic links. FedAvg applies standard federated averaging over the fleet under communication constraints, but restricts learning to the modality common to all AUVs, so neither multi-modal utilization nor cross-modality fusion is realized and training is decoupled with respect to specialized modalities. Table 1 summarizes these assumptions; “Decoupled Training” refers to optimizing modalities separately instead of jointly within a shared fusion model.

Extended Baseline: FedAvg with Heterogeneous Encoder Broadcasting

To isolate the contribution of the mapper-centric design, the FedAvg-HeteroArch baseline aggregates both sensor-specific encoders ( E k ) and the shared mapper (M). While FMTL achieved an AUC of 0.94, FedAvg-HeteroArch reached only 0.76. This gap confirms that naively broadcasting incompatible encoder updates causes significant performance degradation, proving the necessity of mapper-centric aggregation for heterogeneous AUV swarms.

4.4. Experiment 1: Validation of Collaborative Cross-Modality Fusion

The experiment evaluates whether Stage 2 of FMTL produces a collective geological map that exceeds the capabilities of isolated learning and naive federation. The swarm surveys a complex region containing hidden target zones, with training proceeding over a fixed number of communication rounds corresponding to a fixed survey duration. Each method outputs a final potential map over a discretized grid, and performance is measured by the ability to distinguish target cells from non-target cells. Evaluation uses the area under the ROC curve (AUC) as the primary discrimination metric, with additional reporting of precision and recall computed over predefined target zones to assess detection reliability in spatially localized regions. Communication cost is measured as the total transmitted payload size between AUVs and the server for federated methods, enabling the comparison of predictive performance under different communication budgets. Table 2 defines the evaluation metrics used in Experiment 1 (ROC-AUC, Precision/Recall, and communication cost).

4.5. Experiment 2: Validation of Adaptive Planning and Motion Coordination Under Constraints

This experiment evaluates closed-loop coupling between the evolving geological belief state and exploration behavior. The swarm begins with a pre-planned survey pattern, and a subsequent belief update indicates a high-value region that was not scheduled in the initial plan. The adaptive planner reassigns tasks and recomputes feasible trajectories subject to kinematic and acoustic connectivity constraints, while decentralized coordination maintains cohesion during replanning and execution. Performance is quantified by replanning latency, path feasibility, swarm fragmentation under the connectivity constraint, and objective-normalized survey efficiency relative to a static-plan baseline. Comparisons are made against a static grid policy with no adaptation, a reactive reassignment heuristic that does not enforce formal feasibility constraints, and the integrated MARL-adaptive policy instantiated by the proposed framework. Table 3 defines the reported metrics.

4.6. Experiment 3: Validation of Federated Explainable AI (FXAI)

This experiment evaluates whether the FXAI layer yields attribution signals that are geologically meaningful and decision-relevant. A correctly detected SMS target is selected, and the corresponding federated decision attribution map is generated for the surrounding area. Qualitative assessment examines whether attribution focuses on sensor-derived cues that are consistent with the decision rationale implied by the underlying measurements, rather than spreading across unrelated spatial structures. Quantitative assessment uses feature occlusion guided by the FXAI map by perturbing the highest-importance features in the input representation and measuring the resulting decrease in prediction confidence; a pronounced degradation indicates that the attributed features are causally aligned with the model’s inference rather than being incidental correlates.

4.7. Experiment 4: Validation of Rapid Adaptation via Meta-Learning

To ensure the geological reasoning capability extends beyond procedural environments, the “GeoSense” foundation model was pre-trained and validated using real-world geophysical survey data rather than synthetic noise. We integrated three distinct datasets to construct the digital twin environment:
Global Bathymetry & Magnetics: We utilized the GEBCO 2024 Grid (15 arc-second interval) combined with EMAG2v3 (Earth Magnetic Anomaly Grid) to provide realistic, large-scale spatial correlations for the pre-training stage.
High-Resolution SMS Micro-Bathymetry: For fine-grained target detection, we injected raw AUV multibeam bathymetry data collected from the TAG Hydrothermal Field (sourced from the MGDS repository), featuring realistic terrain artifacts, side-lobe noise, and irregular vent structures that procedural generation fails to capture.
Chemical Plume Replay: The chemical sensing modality operates on a replay basis using historical plume traces from published hydrothermal surveys, ensuring the advection-diffusion characteristics match real oceanographic dynamics.
This data-driven validation supports interpreting the performance metrics in Table 4 as reflecting robustness to the complex, non-Gaussian noise conditions inherent in deep-sea operations.

Validation Domain: Geological Regime Shift

A critical test of meta-learning’s generalization is deployment to a fundamentally different geological environment without any retraining. In our evaluation, the GeoSense foundation model is pretrained on Domain 1 (hydrothermal vent fields, high-contrast magnetic and chemical signals). The swarm then deploys to Domain 2 (abyssal nodule plains, weak signals, different sensor modality correlations) with no local fine-tuning on Domain 2 data prior to deployment. This adaptation test measures whether the learned representations facilitate rapid transfer and capture task-general geological features rather than overfitting to Domain 1 patterns. Results are reported in Figure 4.

4.8. Supplementary Validation on Real-World Geophysical Datasets

To assess practical applicability, we evaluate FMTL on real-world datasets, including global bathymetry–magnetic SMS locations, high-resolution AUV vent annotations, and synthetic chemical plumes [3], which together emulate heterogeneous squads. Compared with isolated, centralized, and handcrafted baselines, Table 4 shows that FMTL transfers simulation-trained priors to real geophysical patterns and outperforms handcrafted fusion via learned latent interactions. While FMTL identifies most targets, the remaining gap to the centralized reference is primarily attributable to data quality and incomplete spatial coverage. Failure analysis indicates that errors concentrate near non-target magnetic anomalies and in regions with strong signal attenuation. Under increasing perturbation noise, performance degrades monotonically, but the method remains stable in the sense that it does not exhibit abrupt failure across the tested noise range. Overall, these results support the external validity of the proposed fusion mechanism and motivate future closed-loop at-sea deployments.

4.9. Limitations of Current Validation and Path to Operational Deployment

4.9.1. Scope of Present Study

This work remains a proof-of-concept rather than an operational deployment. Evidence is drawn from high-fidelity digital twin, retrospective evaluation on historical geophysical datasets, and a planned at-sea campaign. Simulation supports controlled stress testing but cannot reproduce unanticipated field failure modes. Historical data provide authenticity for geological pattern recognition, yet they reflect sequential single-vehicle acquisition and incomplete chemical measurements. Field trials are therefore required to establish end-to-end feasibility under real sensing, communication, and coordination coupling.

4.9.2. Gaps Requiring Field Validation

Key uncertainties persist without sea trials. Multi-AUV coordination is affected by hydrodynamic interaction, GPS-denied navigation drift, and latency variability from acoustic synchronization. Communication behavior deviates from stationary loss assumptions due to burst errors, motion-induced carrier instability, and site-dependent multipath. Platform effects couple actuation to sensing through electromagnetic and acoustic interference, while energy depletion can degrade sensor quality. Environmental nonstationarity further perturbs sensing and registration through biofouling, turbidity, and currents.

4.9.3. Deployment Roadmap, Hardware-in-the-Loop Evidence, and Evaluation Criteria

A staged campaign quantifies sim-to-real transfer via shallow-water trials and HIL testing to verify edge inference and federated convergence under physical constraints. Failure triggers include an AUC < 0.75, communication overhead > 5 GB, or a false positive rate > 20%. Specific Go/No-Go criteria for Phase 1 sea trials require: (1) TDMA synchronization within a 50 ms timing error despite ±15 m/s sound speed variations; (2) positive AUC growth under 30% sustained packet loss; and (3) geospatial alignment error < 15 m over 4 h. The full set of validation metrics, baseline values, acceptable field performance thresholds, and measurement methods are summarized in Table 5. These benchmarks address operational uncertainties in reliability and drift correction before full-scale deployment.

5. Results and Analysis

This section reports the empirical results obtained under the experimental protocol in Section 4. The evaluation targets two coupled objectives: improving Seafloor Massive Sulfide (SMS) detection accuracy relative to isolated training, a standard federated baseline, and an oracle centralized reference that assumes raw-data pooling, while maintaining feasibility under bandwidth-limited underwater communication.

5.1. Experiment 1: Collaborative Fusion Under Heterogeneity

To rigorously validate the superiority of the FMTL framework beyond basic architectural ablations, we conducted a comprehensive benchmarking against three recent state-of-the-art (SOTA) methods in distributed robotic exploration:
FedDRL-AUV (2024): A standard federated deep reinforcement learning baseline that uses secure aggregation but lacks the meta-learning adaptation stage [28].
GNN-Mapper (2023): A Graph Neural Network-based approach that explicitly models neighbor communication topologies for cooperative mapping, representing the current SOTA in graph-based coordination [29].
ViT-Explorer (2024): A Vision Transformer-based agent adapted for underwater perception, utilizing self-attention mechanisms for feature extraction without federated collaboration [30].
As illustrated in Figure 5, FMTL demonstrates superior performance, achieving a dominant AUC of 0.94, which significantly outperforms the distributed baselines FedDRL-AUV (0.86) and GNN-Mapper (0.88). In contrast, while ViT-Explorer exhibits strong local feature extraction (0.85), its potential is severely constrained by the data scarcity inherent to isolated operations.
Crucially, to ensure the reliability of these gains, we applied rigorous statistical hypothesis testing following common experimental practice in high-fidelity maritime swarm studies [25]. We conducted a one-way ANOVA followed by Bonferroni-corrected post hoc t-tests across n = 30 independent runs. The analysis confirms that FMTL’s performance advantage is statistically significant (p < 0.001) against all baselines. The effect size, calculated using Cohen’s d, exceeds 0.8 for the comparison against FedDRL-AUV, indicating a “large” practical significance beyond mere statistical variance. This validates that the mapper-centric aggregation strategy provides a fundamental advantage in handling heterogeneous sensor modalities compared to standard parameter averaging.

Quantifying Cross-Modal Alignment Quality

To validate that the performance gain in FMTL comes from genuine cross-modal alignment rather than simple parameter averaging, we compute the cross-modality embedding distance in the shared latent space Z. For FMTL, the average Euclidean distance between modality clusters [24] in Z is 0.23 ± 0.08, indicating strong alignment. For the late-fusion baseline (Variant E), this distance is 1.47 ± 0.15, showing that modalities remain isolated without explicit contrastive alignment. This quantitative evidence confirms that the contrastive loss Equation (3) induces meaningful semantic fusion.
FMTL consistently demonstrates advantages over baseline methods in terms of detection accuracy, communication efficiency, convergence speed, and inference latency. As summarized in Figure 5a–d, FMTL achieves the highest target identification AUC while operating at substantially lower communication cost, reaches 90% performance in fewer episodes, and reduces per-decision latency, indicating improved learning efficiency and timelier decision-making under resource constraints. Building on these representation-level results, the following experiment shifts focus to closed-loop execution, evaluating whether the MARL-driven planner maintains kinematic feasibility, communication connectivity, and task responsiveness during dynamic mission reallocation.

5.2. Experiment 2: Dynamic Path Planning and Control Effectiveness

5.2.1. Replanning Performance

Figure 6 summarizes replanning latencies over 47 dynamic task reassignments during the 12-h mission. The measured latency is 45 ± 12 ms on average, with a 95th percentile value of 72 ms. Runtime attribution indicates that RRT expansion dominates the planning cycle (35 ms on average), while gradient aggregation and broadcast each contribute approximately 5 ms. These timings imply that replanning is completed well within a single TDMA communication cycle of 13.5 s, which is consistent with the intended closed-loop operation in which new assignments can be disseminated without inducing plan staleness.

5.2.2. Formation Coherence Under Dynamic Replanning

Figure 7 reports the maximum inter-AUV distance over the mission, reflecting the degree of swarm dispersion induced by adaptive re-tasking. The static grid baseline remains tightly grouped by construction (1.9 ± 0.3 km), whereas the reactive heuristic exhibits a transient peak separation of 6.2 km accompanied by loss of network coherence. The MARL-adaptive variant sustains broader coverage while remaining within the acoustic-connectivity threshold: the maximum inter-AUV distance is 4.1 ± 0.8 km and does not exceed the 5 km limit. The observed behavior is consistent with the acoustic-connectivity safety condition in Equation (7), which helps prevent fragmentation by bounding the tolerable link-loss duration given the inter-AUV distance margin and the maximum speed.

5.2.3. Path Feasibility Under Kinematic Constraints

The evaluation assesses trajectory feasibility over 47 replanned instances. The MARL-adaptive method yields 46 feasible trajectories (98%) and incurs a single failure due to an infeasible goal under mission energy constraints. The reactive heuristic produces 36 feasible trajectories (76%), and turning-radius and speed-constraint violations primarily cause its failures. The static grid does not generate replanned trajectories and is therefore not directly comparable under this criterion. These results highlight the importance of explicitly considering feasibility and physical constraints during trajectory planning.

5.2.4. Survey Efficiency Gains

Figure 8 tracks the cumulative discovery of high-value targets over time. The static grid exhibits the expected linear baseline behavior, while the reactive heuristic accelerates early discovery but later saturates as execution failures and connectivity degradation limit adequate coverage. The MARL-adaptive approach sustains discovery growth and achieves a 42% increase in survey efficiency by mission end. The gain aligns with three coupled factors reported in the mission logs: a reduced hotspot response time (2.3 min versus 18 min), a higher fraction of feasible replanned paths (98%), and reward-driven prioritization that concentrates effort on higher-value targets.

5.2.5. AUC Trajectory During the Dynamic Mission

Figure 9 reports how SMS detection AUC evolves as the mission progresses. During the initial survey window (hours 0–3), all methods increase AUC at a similar rate as coverage expands (0.72 → 0.83). After the first hotspot is detected (hours 3–6), the MARL-adaptive policy begins to separate as it reallocates resources toward informative regions (0.83 → 0.91). During the later adaptive phase (hours 6–12), the MARL-adaptive approach preserves its advantage. It reaches 0.94 AUC, while the static grid stabilizes at 0.87 under redundant coverage, and the reactive heuristic exhibits fluctuations consistent with intermittent trajectory infeasibility. The joint interpretation of Figure 8 and Figure 9 indicates that the reported efficiency gain corresponds to a measurable improvement in detection quality over mission time, with a +0.07 AUC increase relative to the static baseline at mission completion.

5.2.6. Information-Gain Efficiency: Novel Metric for Constrained Exploration

Beyond traditional survey efficiency, we introduce a metric specific to bandwidth-constrained systems: Information-Gain per Communication Unit (IG/Byte). This metric captures the value extracted per bit of acoustic transmission spent on federated updates. For FMTL, IG/Byte = 0.018 (detection AUC improvement per MB communicated). The static grid baseline has IG/Byte = 0.014 (lower efficiency because communication overhead is decoupled from geological discovery). This 28% efficiency gain demonstrates that adaptive planning coupled to communication budgets is operationally meaningful, not merely academically interesting. A detailed comparison of dynamic planning strategies across efficiency, latency, feasibility, and communication overhead is summarized in Table 6.

5.2.7. Optimization Robustness and Reward Sensitivity

To validate the choice of MARL against non-gradient alternatives, we benchmarked the system in the “Sparse-Deceptive” environments identified in Section 4. We compared the proposed MARL planner against CMA-ES (Evolutionary Strategy for trajectory optimization) and MCTS (Monte Carlo Tree Search with information-gain heuristics).
Performance in Standard vs. Sparse Zones: In nominal environments, MARL outperforms CMA-ES and MCTS in computational time (45 ms vs. 210 ms) and coordination stability. However, as reward density drops below ρ r e w a r d < 1 0 4 , pure MARL performance degrades due to sparse feedback.
Hybrid Mitigation Strategy: To address the “posterior collapse” edge case, we implemented a Hybrid MARL-Frontier scheme. This variant switches to a frontier-based exploration heuristic when the local value function variance drops below a threshold ϵ.
Results: In sparse-deceptive settings, the Hybrid MARL-Frontier scheme reduces the failure rate from 5.1% to 0.8% while preserving MARL’s efficiency in information-rich zones. This suggests that MARL delivers strong coordination when feedback is available, but a heuristic fallback improves robustness under extreme sparsity.
Overall, Figure 6, Figure 7, Figure 8 and Figure 9 and Table 6 show that the MARL-based planner enables fast replanning, efficient exploration, and stable swarm connectivity throughout different mission phases. The robustness analysis in sparse-deceptive environments confirms that the proposed hybrid strategy mitigates performance degradation under extreme reward sparsity. With the planning layer validated, the Section 5.3 examines the interpretability of swarm-level decisions through FXAI. We then evaluate, in Section 5.4, whether meta-learning further enhances rapid adaptation in previously unseen environments under distribution shifts.

5.3. Experiment 3: FXAI Delivers Trustworthy and Geologically Sound Insights

This experiment evaluates whether the FXAI layer can expose decision-relevant evidence in a manner consistent with the underlying geophysical signals, thereby supporting the interpretation of swarm-level predictions beyond point estimates. Figure 10 visualizes the federated decision attribution outcome for a correctly identified SMS deposit. The attribution concentrates on the same spatial locus where the input modalities exhibit convergent anomalies, indicating that the model’s high predicted probability is supported by aligned multimodal cues rather than by a single dominant channel. As shown in Figure 10A–C, the bathymetric input exhibits a localized, mound-like structure (Figure 10A), the magnetic field displays a pronounced negative anomaly (Figure 10B), and the chemical channel shows elevated plume concentration in the vicinity of the deposit (Figure 10C). Consequently, the attribution map (Figure 10D) places its highest mass in the overlap region that corresponds to the deposit location.
Bridging the gap in trust for autonomous discovery is essential to ensure the reliability of the system, particularly when the system makes decisions without human oversight. In deep-sea exploration, a “black-box” detection of a mineral deposit is operationally insufficient; mission commanders require verifiable evidence before authorizing costly sampling actions. The proposed Federated Explainable AI (FXAI) layer serves as a trust mechanism, providing scientists with a “visual consensus” of why the swarm believes a target exists (e.g., correlating a magnetic anomaly with a specific chemical gradient). This interpretability transforms the swarm from a passive data collector into a transparent collaborative partner, significantly lowering the barrier for deploying autonomous reinforcement learning agents in high-stakes marine environments.
A distinctive feature of federated attribution is that the decision consensus emerges from independent sensor measurements without centralized fusion. To verify that the FXAI attribution map (Figure 10) reflects genuine multi-sensor consensus rather than dominance by a single modality, we compute the attribution entropy across modality contributions. For the SMS target shown in Figure 10, the attribution entropy is 0.92 nats (on a scale of 0 = single modality dominates, 1.1 = uniform contribution), indicating balanced influence from all three sensor types. This high entropy confirms that the swarm achieves multi-modal synthesis, not uni-modal detection artifacts masked by weak signals from other sensors.
The spatial selectivity of the attribution also provides a check against reliance on incidental terrain variation. The attribution remains localized around the deposit region rather than diffusing broadly across unrelated background patterns, which is consistent with a decision rule that prioritizes localized hydrothermal signatures. The occlusion analysis reinforces this interpretation in quantitative terms. Neutralizing the magnetic anomaly within the target region reduces prediction confidence by 68%, while removing the chemical plume signal reduces confidence by 55%; perturbations to features outside the deposit-relevant region induce changes below 2%. These sensitivity patterns are consistent with the attribution map and support the claim that the FXAI layer is identifying principal drivers of the prediction rather than producing post hoc rationalizations.

Beyond Attribution: Counterfactual Testing and Operational Impact

To verify that the FXAI layer captures causal geophysical reasoning, we conducted a two-fold validation using counterfactual analysis and operational assessment. By injecting “decoy” targets with physically inconsistent signals—such as strong chemical plumes paired with background magnetic noise—detection probability dropped from 0.94 to 0.12, proving the model relies on multi-physics dependencies rather than statistical artifacts. We further quantified operational value by comparing a “Black-Box” trigger to an “FXAI-Gated” policy that requires multi-modal consensus before sampling. The FXAI-Gated approach reduced false-positive sampling by 22%, resulting in a 14% saving in total mission energy consumption. This demonstrates that interpretability functions as a critical “energy-saving filter” against spurious correlations, directly extending the operational endurance of the autonomous swarm.

5.4. Experiment 4: Meta-Learning Enables Unprecedented Adaptation Speed

The “Geological Regime Shift” test, as summarized in Figure 11, shows that FMTL’s meta-learned initialization achieves an initial AUC of 0.78, versus 0.45 for fine-tuning. FMTL reaches 90% peak performance in only 5 missions, whereas Fine-tuning Only (Variant C) requires 42 missions, and Random Initialization (Variant A) exceeds 90. This 8.4× improvement validates that meta-learning provides a superior initialization for rapid domain adaptation, allowing swarms to maintain high-performance inference amidst shifting environmental conditions without extensive in situ retraining.

5.5. Ablation Study: Dissecting the Contributions of Each Component

To isolate the role of each design choice in the FMTL pipeline, seven variants were trained by removing or modifying individual components while holding the training budget fixed, as summarized in Table 7. All variants employed the same evaluation protocol and test split as described in Section 4.1, and were trained for the same number of communication rounds (T = 100), ensuring that differences in accuracy, adaptation behavior, and resource use could be attributed to the ablated component rather than to unequal optimization effort.
To ensure the reliability of the reported performance gains, we applied rigorous hypothesis testing following the protocol established in high-fidelity swarm literature. We conducted a Kruskal–Wallis H-test (H = 52.3, p < 0.001) followed by Dunn’s post hoc test with Bonferroni correction to control for family-wise error rates across the n = 30 independent runs.
The analysis confirms that the performance advantage of the full FMTL framework over the No-Pretrain and No-Meta variants is statistically significant (p < 0.001). Furthermore, we calculated the effect size using Cohen’s d. The comparison between FMTL and the standard FedAvg baseline yields a Cohen’s
D of 0.82, indicating a “large” effect size. This substantiates that the improvements are not merely artifacts of stochastic variance but represent a fundamental algorithmic advantage in handling heterogeneous underwater data.

5.5.1. Setup and Variants

FMTLFull serves as the reference configuration and includes GeoSense pretraining, federated learning with the shared mapper, the cross-modality contrastive alignment term L c o n t defined in Equation (3) and used in the task loss L t a s k in Equation (1), the FXAI attribution layer, and the meta-learning stage. Variant A removes pretraining by starting from random initialization while preserving the remaining pipeline. Variant B replaces the federated setting with centralized data fusion, serving as an upper bound that relaxes the bandwidth constraint. Variant C removes the meta-learning stage and relies on standard fine-tuning. Variant D removes contrastive alignment by setting λ c o n t = 0, i.e., training with L C E only, while retaining the shared mapper and meta-learning. Variant E replaces the shared mapper with a late-fusion architecture. Variant F removes the FXAI attribution module while leaving the predictive pipeline unchanged.

5.5.2. Main Ablation Results and Component-Wise Interpretation

Table 8 summarizes performance metrics. FMTLFull reaches a 0.94 AUC and 90% peak performance in 5 missions with 500 MB of communication. Removing pretraining (Variant A) causes the largest drop (0.78 AUC, 12 missions), while removing meta-learning (Variant C) maintains asymptotic accuracy (0.93 AUC) but delays convergence to 42 missions, highlighting its role in time-to-effectiveness. Disabling the contrastive loss (Variant D) or using late fusion (Variant E) reduces AUC to 0.85 and 0.81, respectively, underscoring the need for latent alignment. This degradation is further illustrated in Figure 12, where embeddings remain segregated by modality when the contrastive objective is removed. Removing FXAI (Variant F) confirms its orthogonality to detection. Finally, centralized training (Variant B) sets an upper bound (0.97 AUC) but is non-deployable due to prohibitive communication costs (102,000 MB, approx. 200× higher than FMTL).
We can include the learning-curve comparison between FMTLFull and Variant A to clarify whether the performance gap is due to optimization difficulties under random initialization or a persistent representational deficit (Figure 13). The corresponding result should be interpreted as evidence of representation transfer rather than as an argument about hyperparameter tuning, since all variants share the same communication-round budget.
System-Level Trade-off Analysis. Beyond predictive metrics, we evaluate the computational and energy efficiency of key architectural variants to illustrate critical system-level trade-offs. While the centralized baseline (Variant B) achieves high accuracy, it incurs a prohibitive energy cost for raw data transmission (>50 kJ), rendering it infeasible for battery-powered swarms compared to our proposed FMTL (<2 kJ). Furthermore, although removing the explanation layer (Variant F) significantly reduces latency by 31 ms and peak memory by 1.8 GB, this gain in efficiency is offset by operational opacity and the loss of the safety filter. Conversely, the Late Fusion approach (Variant E) offers the lowest computational overhead but fails to capture necessary cross-modal correlations, resulting in a 13% degradation in AUC. This comprehensive breakdown confirms that the full FMTL framework achieves an optimal operating point, effectively balancing high-fidelity inference via Mapper and FXAI with the stringent energy constraints of acoustic operation.
As illustrated in Figure 4 presented in Section 4.7, the adaptation curves contrasting FMTLFull and Variant C provide a direct visualization of the role of meta-learning in accelerating adaptation. According to the reported metrics, the difference primarily manifests in the number of missions required to achieve peak performance, rather than in the final AUC attained after sufficient adaptation.

5.5.3. Robustness and Practical Considerations

Robustness to key hyperparameters was assessed to determine whether the ablation conclusions depend on a narrow configuration. The latent dimension exhibits improving AUC up to d = 512, after which accuracy saturates while training time and memory continue to increase, making d = 512 the most favorable accuracy–efficiency choice under the reported measurements, as shown in Table 9.
The number of local epochs controls the balance between local fitting and global synchronization. Performance improves from E = 1 to E = 5, remains stable through E = 10, and decreases at E = 20, consistent with overfitting to client-specific data distributions under non-IID conditions, as reflected in Table 10.
The contrastive-loss weight shows a peaked relationship with performance. The setting λ1 = 0.5 yields the strongest AUC while substantially reducing the reported cross-modality embedding distance relative to λ1 = 0.0, whereas larger weights reduce AUC despite marginally tighter alignment, consistent with over-regularization. Table 11 further illustrates this over-regularization effect.
The computational cost was broken down to assess the runtime and memory implications of the entire pipeline. The per-round training time and per-sample inference latency show that the attribution computation contributes the most to the overhead, compared to the predictive forward pass, with a detailed component-wise breakdown provided in Table 12. Meanwhile, the total inference time remains below the sensor reporting interval for the stated mission cadence.
Scalability testing results, as presented in Table 13, show that AUC improves with swarm size, rising from 0.87 to 0.95 as the number of AUVs increases to 12, before saturating. Furthermore, the system exhibits linear communication growth and only sublinear increases in training time per round. This confirms that the proposed framework maintains computational efficiency even as the deployment scale expands.

5.6. Robustness Under Adverse Conditions

Operational deployments expose AUV swarms to failures and uncertainties that are not captured by nominal evaluations, particularly sensor outages and corruption, intermittent and delayed acoustic links, and nonstationary environmental conditions that can perturb both sensing and motion execution. Robustness was assessed by stress-testing FMTL under controlled degradations and measuring the resulting changes in detection performance, survey efficiency, convergence behavior, and long-horizon stability.

5.6.1. Sensor Failure Scenarios

Robustness to sensor-side disruptions was evaluated through complete sensor loss, intermittent corruption, and calibration drift. Table 14 reports performance when one or more AUVs experience complete sensor failure during 20% of the survey duration, implemented by nulling the corresponding data stream. AUC decreases from 0.94 to 0.92 with a single failure and to 0.83 when three of nine vehicles are affected, while survey efficiency drops from 100% to 72% under the most severe setting. The degradation is monotone rather than catastrophic, which is consistent with a swarm architecture in which remaining vehicles continue to contribute functional updates, and the planner reallocates coverage to compensate for lost sensing capacity.
Intermittent sensor glitches were modeled by injecting outliers at increasing rates. Table 15 shows that higher outlier prevalence increases false positives and reduces AUC, with performance degrading from 0.94 to 0.84 as the outlier rate reaches 30%. This behavior indicates sensitivity to heavy-tailed corruption when the corrupted measurements are not explicitly filtered or down-weighted, and it provides a quantitative bound on tolerable corruption rates under the current preprocessing and loss formulation.
Calibration drift was evaluated by introducing a slowly varying bias to the magnetometer baseline and monitoring detection performance throughout the mission. Under this drift model, the AUC decreases from 0.94 to 0.88 over a six-hour run, indicating that gradual bias accumulation can significantly impact detection quality even when the data stream remains continuous. A federated calibration correction procedure was employed to mitigate this effect by aggregating residual statistics to estimate the global drift pattern and applying the correction as a preprocessing step, thereby maintaining an AUC of 0.93 throughout the same six-hour mission. This result suggests that drift manifests as a systematic component that can be partially compensated through swarm-level aggregation of residual structure when sufficient coverage and update diversity are available.

5.6.2. Communication Efficiency and Robustness Under Realistic Acoustic Channel Models

Underwater acoustic communication imposes severe rate constraints and channel impairments that affect both learning synchronization and control-level coordination. To ensure feasibility under the strict 480 bps acoustic bandwidth limit, FMTL implements a multi-stage compression pipeline. While the cumulative size of raw model gradients generated over the 72 h mission corresponds to approximately 500 MB (representing the logical information exchange), the system does not transmit these raw vectors. Instead, we apply Deep Gradient Sparsification (transmitting only the top-1% magnitude gradients) followed by 8-bit quantization. This compression reduces the actual physical payload to approximately 12.5 MB over the entire mission duration. This fits comfortably within the theoretical channel capacity (≈15.5 MB at 480 bps), ensuring that the collaborative updates are physically realizable without causing network congestion.
To evaluate robustness against physical layer impairments, we employed the Bellhop ray-tracing model, which incorporates multipath, Doppler, and shadow-zone effects—a significant advance over simplified i.i.d. loss assumptions. Table 16 details the system performance as the packet delivery ratio degrades. The model maintains an AUC of 0.86 at 35% packet loss, demonstrating resilience to moderate interference. However, the simulation identifies a sharp failure threshold at approximately 52% packet loss, beyond which convergence fails (AUC < 0.74). This threshold behavior aligns with theoretical requirements for a minimum fraction of participating clients and suggests that acoustic links operating near this limit require active mitigation strategies rather than reliance on graceful degradation.
Beyond packet loss, propagation delays and modem processing introduce latency on the order of seconds. While this adds overhead, it remains minor relative to the per-round computation time, indicating that acoustic propagation alone does not bottleneck training speed. To mitigate channel errors, adaptive error-control coding and opportunistic scheduling were implemented. These mechanisms prioritize essential updates under low-SNR conditions, enhancing effective throughput without altering the fundamental learning objective. To evaluate resilience against severe environmental dynamics, the impact of time-varying conditions was assessed over a 72 h mission profile simulating a passing storm. As shown in Table 17, cumulative AUC degrades gradually with increasing loss severity, reaching a minimum of 0.81 during the peak storm phase (Hours 48–72). The system demonstrates rapid resilience, with performance recovering to 0.88 AUC within six hours post-storm. This trajectory confirms that performance dips were driven by intermittent data unavailability rather than permanent model divergence.
Table 18 provides a comparative analysis against prior methods, highlighting the necessity of physics-based channel modeling over i.i.d. dropout assumptions for realistic marine assessment. In summary, the proposed framework constitutes an integrated system-level solution defined by the co-design of its components: federated learning accommodates modality heterogeneity, explainability is embedded within the federation loop, and planning explicitly accounts for communication constraints. This holistic approach ensures robust operation in environments where ensembles of disjoint techniques would likely fail.

5.6.3. Environmental Interference and Motion Constraints

Environmental interference was evaluated through elevated noise conditions and dynamic obstacles that disrupt planned survey paths. Under high-noise sensing, AUC decreases from 0.94 to 0.86, indicating that increased measurement variance can materially reduce the discriminative signal-to-noise ratio. A robust loss replacement was employed to mitigate sensitivity to extreme errors and restore the AUC to 0.92.
Dynamic obstacles were modeled by blocking 15% of planned survey lines and triggering online re-routing via the motion planner. Coverage efficiency decreases by 12% while AUC remains at 0.93, indicating that irregular sampling patterns affect coverage but do not necessarily degrade detection quality when the model continues to receive sufficient informative observations.

5.6.4. Adversarial Robustness

Although adversarial perturbations are not the primary threat model for marine exploration, robustness was tested under gradient-based attacks to bound worst-case sensitivity. Table 19 shows that both untargeted and targeted perturbations substantially reduce the AUC under the stated ε, and that stronger iterative attacks further degrade performance. Adversarial training partially recovers AUC under all tested attacks. These results should be interpreted as an upper bound on vulnerability under white-box access assumptions, rather than as an operational risk assessment, since the attack model presumes knowledge of the model and the ability to inject precisely structured perturbations into the input.

5.6.5. Long-Duration Mission Stability

Long-horizon stability was evaluated over a 72 h continuous operation with accumulated drift, communication variability, and resource constraints. Table 20 reports a gradual decline in AUC from 0.94 to 0.90 over the full duration, alongside an increase in memory usage and false discovery rate. The pattern indicates progressive degradation rather than abrupt collapse, suggesting that extended runs may benefit from periodic checkpointing or reset strategies. We do not claim catastrophic forgetting, since it cannot be established without an explicit retention protocol that evaluates retention on earlier distributions.
For operational deployment planning, the communication–accuracy trade-off is quantified by reducing the mapper update frequency and applying post-training quantization. The baseline FMTL setting achieves 0.94 AUC with 500 MB communication; increasing the update interval by 5× (67.5 s) lowers communication to 100 MB but reduces AUC to 0.89, whereas 8-bit quantization lowers communication to 62.5 MB while maintaining 0.92 AUC. The resulting Pareto frontier identifies 8-bit quantization as a practical operating point for bandwidth-starved deployments, preserving ~98% of baseline accuracy at 1/8 of the communication cost and enabling long-duration missions under constrained relay links.

5.6.6. Summary: Operational Reliability

Across the conducted stress tests, system performance degrades smoothly as failures and impairments intensify. The most pronounced degradations arise from heavy-tailed sensor corruption and severe packet loss, which can prevent the effective aggregation of models. Overall, the results indicate that FMTL remains functional under partial sensor outages and moderate acoustic unreliability. Furthermore, robustness can be improved through drift correction, link adaptation, and noise-tolerant optimization strategies, which collectively bound the conditions under which collaborative mapping remains operationally viable.

5.6.7. Communication-Accuracy Pareto Frontier

An underexplored yet practically important question in federated learning concerns the Pareto-optimal trade-off between communication cost and detection accuracy under operational constraints. This trade-off is characterized by fixing the mission duration to 12 h and systematically varying two factors: the mapper update frequency, ranging from every TDMA cycle (13.5 s) to every five cycles (67.5 s), and the parameter precision, reduced from 32-bit to 8-bit using post-training quantization. As shown in Figure 14, the baseline FMTL configuration achieves an AUC of 0.94 with a communication cost of 500 MB. Reducing the update frequency by a factor of two lowers communication to 250 MB with only a 0.01 AUC degradation, while a five-fold reduction yields 100 MB communication at the cost of a 0.05 AUC decrease. Notably, 8-bit quantization achieves an AUC of 0.92 with only 62.5 MB communication, effectively retaining 98% of the baseline accuracy within quantization error. These results indicate that significant communication savings are achievable without catastrophic performance loss. From an operational perspective, this Pareto analysis provides mission planners with principled guidance for selecting operating points in bandwidth-starved deployments, such as long-range acoustic relay scenarios, where aggressive communication reduction is necessary.

5.6.8. The Energy-Intelligence Trade-Off

While surface USV swarms often prioritize computational speed, deep-sea AUV operations are fundamentally constrained by strict energy budgets. To quantify energy-aware learning performance, the metric Information Gain per Joule (IG/J) is introduced. Although the centralized baseline yields slightly higher raw accuracy, it incurs a prohibitive energy penalty due to high-bandwidth acoustic transmission (approximately 50 J/bit over long ranges). As shown in Figure 15, FMTL achieves the highest IG/J ratio (0.92 AUC/kJ), making it 4.5× more energy-efficient than centralized raw-data streaming and 1.8× more efficient than standard FedAvg, primarily because faster convergence reduces total mission time. These results indicate that FMTL is not only a bandwidth-saving strategy but also a Green AI solution that is critical for long-endurance autonomous exploration.
Overall, the energy–intelligence trade-off analysis demonstrates that learning architectures for deep-sea autonomy must be evaluated not only by accuracy, but by how effectively information is acquired under energy and communication constraints. In this regard, FMTL directly transforms representation learning efficiency into extended mission endurance, establishing energy-aware intelligence as a primary design objective for long-duration autonomous exploration.

5.7. Safety Assurance and Failure Case Analysis

Lyapunov-based Shield for Provable Safety: To bridge the gap between learning-based control and operational safety, we integrate a Lyapunov-based Safety Shield (similar to Control Barrier Functions). This module acts as a final filter, solving a quadratic program (QP) to minimally adjust the RL-generated action only when acoustic connectivity or collision constraints are violated. In adversarial simulations (with strong currents and high obstacle density), the Shield reduced the collision rate from 14.3% (Vanilla RL) to 0.0%, ensuring kinematic feasibility.
Failure Case Analysis: Despite robust performance, we analyzed edge cases where FMTL failed (approximately 5% of missions). The primary failure mode occurred in “Sparse-Deceptive” environments, where extremely sparse sensor rewards led to “posterior collapse” in the meta-learning context encoder. This suggests that while FMTL excels in heterogeneous fusion, future work should incorporate active, curiosity-driven exploration to better handle environments with extreme information sparsity.
Impact of Quantization and Latency on Safety. We investigate whether aggressive bandwidth reduction, such as 8-bit quantization and 67.5 s update intervals, compromises the Lyapunov Safety Shield. This analysis is grounded in the Decoupling Principle, where the Safety Shield operates on local, high-frequency sensor data (10 Hz) to enforce strict collision avoidance constraints ( h ˙ ( x ) + α h ( x ) 0 ) , while federated updates modify the global guidance map at a significantly lower frequency. Our results indicate that although reduced bandwidth affects global path optimality—evidenced by a ∼4% increase in path length under 8-bit quantization—it does not degrade safety. The collision rate remains at 0.0% even under the sparsest update intervals (5 cycles), as the onboard shield locally overrides any stale global commands that would otherwise violate safety constraints.

6. Discussion

This work couples perception, explanation, and action in a closed loop, enabling real-time adaptive behavior under communication constraints without centralized raw data. Unlike conventional modular stacks, this framework allows newly acquired evidence to influence motion decisions within the same operational window. A key finding is that standard federated learning (FedAvg-HeteroArch) suffers from “semantic mismatch” because broadcasting parameters across structurally different encoders creates conflicting gradients. Our mapper-centric federation sidesteps this by keeping encoders private and enforcing alignment only in the shared latent space Z. This deliberate architectural asymmetry provides three operational advantages: it preserves modality-specific preprocessing, enables graceful degradation under sensor outages without retraining, and ensures representation-level privacy for sensor configurations. The novelty lies in recognizing that heterogeneous robotics requires fundamentally different aggregation patterns than traditional data-parallel training.

6.1. Why Meta-Learning Matters in the Deep Sea

Standard transfer learning is insufficient for the non-stationary nature of seafloor exploration; meta-learning (Stage 3) acts as the primary adaptation mechanism to prevent overfitting and reduce energy consumption by accelerating feature acquisition. The federated swarm model achieves an AUC of 0.94 compared to 0.72 for isolated models, demonstrating that shared representation captures complementary information while reducing data payload from 102 GB to 500 MB for low-rate acoustic links. This architecture couples perception, decision, and control, using an online replanning layer and decentralized consensus to maintain swarm cohesion during asynchronous execution. While FXAI provides an essential audit trail for costly sampling actions by verifying geophysical cues, HIL simulations highlight the need for future field validation against long-term sensor drift and varying sound-speed structures. To optimize the “Intelligence vs. Bandwidth” budget, we propose a tiered protocol: Phase 1 (Exploration) uses minimal-communication late fusion to maximize survey area, while Phase 2 (Target Verification) triggers the full FMTL protocol for high-precision consensus only when geological uncertainty is high.

6.2. Operational Guidelines for Deployment

Derived from our extensive “Hardware-in-the-Loop” validation, we propose a tiered deployment protocol for next-generation AUV swarms:
Phase 1 (Federated Initialization): Deploy agents in a “Loosely Coupled” mode using the GeoSense foundation model to establish a baseline safe map.
Phase 2 (Adaptive Switching): Upon detection of high-value targets, switch to the full FMTL Protocol. Dynamic bandwidth allocation should prioritize gradient updates over raw telemetry. To standardize the transition from Phase 1 (Exploration) to Phase 2 (Target Verification), we define the triggering condition S based on predictive confidence and FXAI consensus:
S = ( y ^ > τ c o n f ) ( H ( V a t t r i b ) > τ c o n s )
Based on our ROC analysis, we recommend operational thresholds of τ c o n f = 0.85 (minimizing false positives) and τ c o n s = 0.7 (normalized attribution entropy). Here, H ( V a t t r i b ) enforces multi-modal consensus (balanced modality contributions), while the spatial selectivity of the attribution map separately indicates where the evidence concentrates. This ensures that the system only switches to the high-bandwidth, energy-intensive FMTL protocol when a target is detected with high confidence and is supported by multi-modal evidence (e.g., both magnetic and chemical signatures), preventing mode-switching due to single-sensor noise.
Hardware Constraints: For edge deployment (e.g., NVIDIA Jetson NX), we recommend 8-bit quantization of the Mapper module, which our results in Figure 14 show retains 98% accuracy while reducing energy consumption by 40%.

6.3. Broader Applicability Beyond Deep Sea

While validated on seafloor exploration, the FMTL architecture addresses the fundamental “Data Silo” and “Heterogeneity” problems common to all maritime robotics. The principles of Mapper-Centric Aggregation and Meta-Initialization are directly transferable to:
Surface USV Swarms: For fusing Radar/LiDAR data in congested ports (similar to adversarial surface scenarios).
Under-Ice AUVs: Where acoustic communication is even more constrained, and acoustic/optical sensor fusion is critical for ice-keel mapping.
This framework thus serves as a blueprint not just for geology, but for the broader Maritime Cognitive Network.

7. Conclusions

This research addresses isolated autonomy by formulating multi-AUV operations as a collaborative learning and control problem. The FMTL framework couples representation learning, federated aggregation, and rapid adaptation in an end-to-end loop, integrating transfer (GeoSense), federated (privacy-preserving), and meta-learning (fast adaptation). Validated via HIL and real-world data (Section 4.8), the model outperforms isolated specialists, improves “time-to-effectiveness,” and supports FXAI auditing within restricted bandwidth budgets. Future work will focus on: (1) open-ocean trials to verify behavior under physical hydrodynamic and multipath effects; (2) delay-tolerant planning with explicit feasibility and reachability guarantees; and (3) asynchronous consensus-based formation maintenance. This framework establishes a foundation for scalable, cooperative marine autonomy through model-level fusion and online action updates under operational constraints.

Author Contributions

Conceptualization, Z.N.; Methodology, Y.W. and S.H.; Software, Y.Z., Y.W., Z.Z. and Y.Y. (Yang Yang); Validation, Y.X. and Y.Y. (Yang Yang); Formal analysis, Y.X.; Investigation, Y.Y. (Yijie Yin); Resources, Y.Z., W.L. and D.X.; Data curation, Z.Z. and S.H.; Writing—original draft, Z.N.; Writing—review and editing, M.W.; Visualization, H.T.; Supervision, D.X.; Project administration, Y.Y. (Yijie Yin) and W.L.; Funding acquisition, H.T. and W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62303108; and Shanghai Jinggao Investment Consulting Co., Ltd., grant number D-8006-23-0223.

Data Availability Statement

The data presented in this study are available as follows: Simulation Code and Synthetic Datasets: The source code and generated synthetic datasets supporting the FMTL framework are publicly available at: https://github.com/Nixer-713/fmtl_auv_simulation (accessed on 11 January 2026). Public Geophysical Data: Publicly available datasets were analyzed in this study. These data can be accessed at: GEBCO (https://www.gebco.net, accessed on 15 November 2025); EMAG2 (NOAA/NCEI, https://www.ncei.noaa.gov/products/earth-magnetic-model-anomaly-grid-2, accessed on 15 November 2025); and the Marine Geoscience Data System (MGDS, https://www.marine-geo.org, accessed on 15 November 2025). Hardware-in-the-Loop (HIL) Data: The raw sensor logs (approximately 2.3 TB) generated during HIL testing are available from the corresponding author upon reasonable request. Future Sea Trial Data: Data collected during the planned Phase 1/2 sea trials will be deposited in the Rolling Deck to Repository (R2R) within six months of mission completion.

Conflicts of Interest

Author Wei Li is employed by Shanghai Longjing Information Technology Co., Ltd., the rest of authors declare no conflict interests. The authors declare that this study received funding from Shanghai Jinggao Investment Consulting Co., Ltd. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Appendix A

Appendix A.1. Problem Setup and Assumptions

We consider the following optimization problem over the global model parameters θ :
min θ F ( θ ) = k = 1 N p k F k ( θ )
where F k ( θ ) denotes the local objective function at client k , and p k represents the relative weight of client k , typically proportional to its local data size.
To analyze the convergence behavior of the proposed method, we make the following standard assumptions.
Assumption 1 (L-Smoothness). Each local objective function F k ( θ ) is L -smooth. That is, for any x , y , we have:
F k ( x ) F k ( y ) L x y
Assumption 2 (Unbiased Gradient and Bounded Variance). The stochastic gradient g k ( θ ) computed by client k is an unbiased estimator of the true gradient:
E [ g k ( θ ) ] = F k ( θ )
and its variance is bounded:
E [ | g k ( θ ) F k ( θ ) | 2 ] σ 2
Assumption 3 (Bounded Delay). The communication delay (staleness) of gradient updates is bounded by a known constant δ m a x . That is, the gradient from each client may be computed using a model parameter that is at most δ m a x iterations old.

Appendix A.2. Derivation Sketch

We denote the global model update rule at iteration t as:
θ t + 1 = θ t η t k A t g k θ t δ k
where η t is the learning rate, A t is the set of active clients at iteration t , and δ k denotes the staleness associated with client k .
Applying Lipschitz Smoothness. By the L -smoothness of F ( θ ) , we obtain:
F ( θ t + 1 ) F ( θ t ) + F ( θ t ) , θ t + 1 θ t + L 2 θ t + 1 θ t 2
Substituting the update rule yields:
θ t + 1 θ t = η t k A t g k θ t δ k
To explicitly characterize the effect of delayed gradients, we define the delay-induced gradient error as:
e t = F ( θ t ) F ( θ t δ )
Using Assumption 1, the magnitude of this error can be bounded as:
e t L | θ t θ t δ |
By incorporating this bound into the expected descent inequality and taking expectations over the stochastic gradients, the convergence behavior can be analyzed by separating the standard stochastic optimization term and the delay-induced error term. This derivation leads to the convergence bound presented in Equation (4) of the main text.

Appendix B

Contributions and Research Structure

This research provides a conceptual and technical foundation for cognitive, collaborative deep-sea exploration by introducing the Cognitive AUV Swarm paradigm, offering a blueprint to mitigate data silos and sensor heterogeneity in large-scale marine operations [11]. It also proposes and implements an FMTL architecture that integrates transfer learning, federated learning, and meta-learning to support collaborative geological discovery under distributed data and constrained underwater communication [23,24]. The key technical contributions are: (1) Cross-modality federated representation learning that enables heterogeneous sensor fusion by aligning sensor-specific information into a unified latent space [19]. (2) Federated explainability (FXAI) that supports geological reasoning and improves model transparency for domain experts [20]. (3) A closed-loop exploration strategy coupling collaborative perception with real-time planning and task allocation through multi-agent reinforcement learning, enabling adaptive behavior during deployment [21,22]. (4) A decentralized motion coordination approach ensuring coordinated swarm operation under acoustic communication constraints [9,10]. While federated learning, transfer learning, and meta-learning are well-established paradigms, their integration in this work introduces three novel architectural choices that are motivated explicitly by heterogeneous marine robotics:
  • Mapper-Centric Aggregation Under Structural Heterogeneity: Unlike standard FMTL segregates the learning process by aggregating only the shared representation mapper M, while keeping sensor-specific encoders, such as Ek, strictly local. This asymmetric design is necessary because heterogeneous AUVs carry distinct sensor payloads with incompatible input dimensions, unlike FedAvg, which broadcasts and aggregates all model parameters [8]. Broadcasting a universal encoder would either force homogeneous architectures (resulting in lost sensor specialization) or create a “semantic mismatch” gradient that pulls M toward contradictory objectives. The novelty lies in proving that this segmented aggregation converges under non-IID modality distributions—a scenario excluded from standard FedAvg convergence analysis.
  • Federated Attribution as a Trust Mechanism: FXAI does not compute explanations on a central model; instead, it aggregates compressed feature-importance vectors (Vk) from each AUV to construct a swarm-level consensus map. This is operationally distinct from post-hoc explanation methods because it enables domain experts to audit the basis of a distributed prediction before authorizing costly sampling actions in high-stakes environments.
  • Information-Theoretic Planning Under Communication Constraints: The proposed framework formulates adaptive exploration as a constrained optimization that explicitly penalizes actions straining acoustic connectivity. This couple’s approach to planning, which differs from prior adaptive exploration work that typically decouples geological inference from motion planning, is notable.

Appendix C

To directly isolate the contribution of mapper-centric federation, we introduce an additional baseline that is not included in Table 1, referred to as FedAvg-HeteroArch. This baseline applies standard federated averaging to both the sensor-specific encoders Ek and the shared mapper M, allowing each AUV to retain its own encoder architecture while broadcasting and aggregating encoder parameters across the fleet. The purpose of this comparison is to determine whether the performance gains of FMTL arise specifically from the mapper-centric design, rather than from the mere application of federated learning to heterogeneous platforms. The corresponding results are reported below. FedAvg-HeteroArch achieves an AUC of 0.76 under heterogeneous payload configurations, compared to 0.94 achieved by FMTL. This substantial performance gap confirms that broadcasting incompatible encoder updates leads to pronounced degradation, underscoring the necessity of mapper-centric aggregation in heterogeneous AUV swarms.

Appendix D

Deployment Roadmap, Hardware-in-the-Loop Evidence, and Evaluation Criteria

A staged campaign quantifies sim-to-real transfer while controlling operational risk. The initial shallow-water trial focuses on federated aggregation over acoustic links, evaluates convergence within mission budgets, and tests the feasibility of edge inference under power constraints, benchmarking map quality against independently surveyed ground truth in GPS-denied conditions. Interim hardware-in-the-loop testing limits inference and communication to physical stacks, enforcing TDMA scheduling and modem overheads, while demonstrating stable federation under measured packet loss and calibration imperfections. Operational readiness has not been claimed, and open issues include regulatory integration, unverified safety-critical behaviors under prolonged communication loss, long-term reliability, and the lack of a quantified cost-benefit assessment. At-sea outcomes will be transparently reported with mission logs and datasets for reproducibility.
Field performance will be evaluated using predefined criteria linking detection quality, communication burden, operational efficiency, and system reliability. Failure triggers include AUC below 0.75, communication overhead above 5 GB per mission, or a false positive rate above 20%, necessitating redesign and conflicting with operational trust requirements in Section 5.4.
To address operational uncertainties in sim-to-real transfer, Go/No-Go criteria for Phase 1 sea trials were established from stability limits observed in simulation:
Synchronization under Varying Sound Speed: The TDMA protocol must maintain slot alignment with timing error < 50 ms under Sound Speed Profile (SSP) variations of ±15 m/s. Exceeding this threshold triggers fallback to asynchronous mode.
Tolerance to Real-World Packet Loss: The federated update process must show positive AUC growth over a 10-round window under 30% sustained packet loss, providing a safety margin below the 52% breakdown threshold in Section 5.6.
Drift Correction Stability: For multi-hour missions, the Federated Calibration module must maintain geospatial alignment error < 15 m over a 4-h dive without surfacing. Exceeding this limit requires mission abort or resurfacing for GPS fixes.

Appendix E

Beyond Attribution: Counterfactual Testing and Operational Impact

To evaluate whether the proposed FXAI layer captures causal geophysical reasoning rather than merely identifying statistical correlations, we conduct a two-fold validation: interventional counterfactual analysis and a quantified assessment of its operational impact on autonomous missions.
Geophysical Counterfactuals and Robustness Verification:
We perform an interventional analysis by injecting “decoy” targets into the validation set to test the model’s robustness against physically inconsistent signals. Specifically, we create scenarios where strong chemical plume signatures are paired with background magnetic noise—violating known physical correlations inherent to SMS detection. As a result, the model’s detection probability for SMS targets drops from 0.94 to 0.12. The FXAI attribution maps reveal that the model correctly identifies the absence of a magnetic anomaly as a key factor against target presence. These counterfactual tests show that the framework’s decision logic aligns with known multi-physics dependencies of hydrothermal vent fields, rather than relying on background artifacts.
Quantifying Operational Value via Decision Gating: While interpretability layers typically do not alter raw predictive accuracy, their operational value lies in their ability to gate decisions. To quantify this, we simulate a “Sample-on-Trigger” mission profile with two deployment policies:
  • Black-Box Policy: Triggers a physical sampling action whenever the raw detection probability y ^ > 0.9.
  • FXAI-Gated Policy: Triggers sampling only if y ^ > 0.9 and the FXAI attribution maps exhibit a consensus across at least two distinct modalities (e.g., magnetic and chemical).
Results indicate that the FXAI-Gated Policy reduces false positive sampling actions by 22%. Since physical sampling is energy-intensive (≈500 J per sample) and time-consuming, this reduction leads to a 14% saving in total mission energy consumption. These findings demonstrate that interpretability functions as an essential “energy-saving filter,” preventing costly autonomous actions caused by single-sensor noise or spurious correlations, and thereby extending the operational endurance of the swarm.

Appendix F

Experiment 4: Meta-Learning Enables Unprecedented Adaptation Speed

This experiment studies the adaptation behavior of the FMTL pipeline’s meta-learning stage when the swarm is deployed to a previously unseen geological environment. A critical test of generalization is the “Geological Regime Shift”: the GeoSense foundation model is pretrained on Domain 1 (hydrothermal vent fields) and deployed to Domain 2 (abyssal nodule plains) without retraining. Figure 10 reports AUC as a function of the number of adaptation missions. The meta-learned initialization attains an initial AUC of 0.78 at mission entry, whereas the fine-tuning baseline starts at 0.45, indicating materially stronger out-of-distribution generalization. The adaptation trajectory further indicates that the meta-learned model reaches 90% of its peak performance within approximately five missions.
To strictly isolate the contribution of meta-learning from standard transfer learning, we evaluated adaptation speed across three specific variants under this distribution shift: (1) Fine-tuning Only (Variant C), representing standard transfer learning, which requires 42 missions to reach 90% performance on the new domain; (2) Random Initialization (Variant A), which requires more than 90 missions to reach the same threshold; and (3) Meta-Learned Initialization (FMTL), which reaches the threshold in only 5 missions. This corresponds to an 8.4× improvement of FMTL over standard fine-tuning. This result supports the hypothesis that meta-learning provides a superior initialization for rapid domain adaptation, which is critical for multi-week seafloor expeditions where environmental conditions can shift on sub-mission timescales, allowing the swarm to maintain high-performance inference without extensive in situ retraining.

Appendix G

Discussion

A central insight emerges from the heterogeneous architecture baseline (FedAvg-HeteroArch): standard federated learning is fundamentally mismatched to structurally heterogeneous systems. FedAvg-HeteroArch attempts to handle sensor specialization through parameter broadcasting, but this creates a coordination problem: if E k outputs differ in dimension or semantics across vehicles, averaging the downstream mapper weights creates a “semantic mismatch” gradient that pulls the shared representation toward conflicting objectives simultaneously. In contrast, mapper-centric federation sidesteps this problem by keeping encoders private and enforcing alignment only in the shared latent space Z. This design choice has three operational consequences: (1) Modality-Specific Preprocessing is Preserved: Each AUV retains full control over sensor-specific preprocessing, reducing dependence on homogeneous preprocessing pipelines. (2) Graceful Degradation Under Sensor Outages: If an AUV loses a sensor mid-mission, its encoder E k can be zeroed or replaced without retraining the shared mapper M. Standard FedAvg would require re-initialization and retraining to maintain consistency. (3) Privacy Preservation at the Representation Level: Raw sensor encodings remain on individual platforms; only abstract mapper updates are transmitted. This provides operational privacy (mission-sensitive sensor configurations remain undisclosed) beyond formal differential privacy. These properties are not emergent; they result from deliberate architectural asymmetry. The novelty lies in recognizing that federated learning for heterogeneous robotics requires fundamentally different aggregation patterns than those used in homogeneous data-parallel training.

References

  1. Hein, J.R.; Mizell, K.; Koschinsky, A.; Conrad, T.A. Deep-ocean mineral deposits as a source of critical metals for high- and green-technology applications: Comparison with land-based resources. Ore Geol. Rev. 2013, 51, 1–14. [Google Scholar] [CrossRef]
  2. Ramirez-Llodra, E.; Brandt, A.; Danovaro, R.; De Mol, B.; Escobar, E.; German, C.R.; Levin, L.A.; Martinez Arbizu, P.; Menot, L.; Buhl-Mortensen, P.; et al. Deep, diverse, and definitely different: Unique attributes of the world’s largest ecosystem. Biogeosciences 2010, 7, 2851–2899. [Google Scholar]
  3. Constantinoiu, L.F.; Bernardino, M.; Rusu, E. Autonomous Shallow Water Hydrographic Survey Using a Proto-Type USV. J. Mar. Sci. Eng. 2023, 11, 799. [Google Scholar] [CrossRef]
  4. Wang, Z.; Li, G.; Ren, J. Dynamic Path Planning for Unmanned Surface Vehicle in Complex Offshore Areas Based on Hybrid Algorithm. Comput. Commun. 2021, 166, 49–56. [Google Scholar] [CrossRef]
  5. Xu, X.; Lu, Y.; Liu, X. Intelligent Collision Avoidance Algorithms for USVs via Deep Reinforcement Learning under COLREGs. Ocean Eng. 2020, 217, 107704. [Google Scholar] [CrossRef]
  6. Feng, Z.; Pan, Z.; Chen, W.; Liu, Y.; Leng, J. USV Application Scenario Expansion Based on Motion Control, Path Following and Velocity Planning. Machines 2022, 10, 310. [Google Scholar] [CrossRef]
  7. Lyu, H.; Hao, Z.; Li, J. Ship Autonomous Collision-Avoidance Strategies—A Comprehensive Review. J. Mar. Sci. Eng. 2023, 11, 830. [Google Scholar] [CrossRef]
  8. Mainak, G.; Arzoo; Dinesh, K. Federated Learning for Self-Steering USVs. In Proceedings of the 2025 6th International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, 25–27 June 2025; pp. 624–628. [Google Scholar]
  9. Song, B.; Khanduri, P.; Zhang, X.; Yi, J.; Hong, M. FedAvg Converges to Zero Training Loss Linearly for Overparameterized Multi-Layer Neural Networks. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 32304–32330. [Google Scholar]
  10. Paull, L.; Saeedi, S.; Seto, M.; Li, H. AUV navigation and localization: A review. IEEE J. Ocean. Eng. 2014, 39, 131–149. [Google Scholar]
  11. Nantogma, S.; Zhang, S.; Yu, X.; An, X.; Xu, Y. Multi-USV dynamic navigation and target capture: A guided multi-agent reinforcement learning approach. Electronics 2023, 12, 1523. [Google Scholar]
  12. Xie, Y.; Ma, Y.; Cheng, Y.; Li, Z.; Liu, X. BIT* + TD3 Hybrid Algorithm for Energy-Efficient Path Planning of Unmanned Surface Vehicles in Complex Inland Waterways. Appl. Sci. 2025, 15, 3446. [Google Scholar]
  13. Ruder, S.; Peters, M.E.; Swayamdipta, S.; Wolf, T. Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, Minneapolis, MI, USA, 2–7 June 2019; pp. 15–18. [Google Scholar]
  14. Öztürk, C.; Taşyürek, M.; Türkdamar, M.U. Transfer learning and fine-tuned transfer learning methods’ effectiveness analyse in the CNN-based deep learning models. Concurr. Comput. Pract. Exp. 2023, 35, e7542. [Google Scholar] [CrossRef]
  15. Cheng, Y.; Feng, G.; Zhang, C. An Efficient and Lightweight YOLOv8s Strawberry Maturity Detection Model. J. Agric. Sci. Technol. A 2024, 14, 46–66. [Google Scholar] [CrossRef]
  16. Zhang, L.; Wu, J.; Zhang, K.; Wang, Z.; Yan, X.; Liu, P.; Wang, Q.; Fan, L.; Yao, J.; Yang, Y.; et al. Diagnosis of pumping machine working conditions based on transfer learning and ViT model. Geoenergy Sci. Eng. 2023, 226, 211729. [Google Scholar] [CrossRef]
  17. Stüber, J.; Kopicki, M.; Zito, C. Feature-based transfer learning for robotic push manipulation. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 5643–5650. [Google Scholar]
  18. Anwar, A.; Raychowdhury, A. Autonomous navigation via deep reinforcement learning for resource constraint edge nodes using transfer learning. IEEE Access 2020, 8, 26549–26560. [Google Scholar] [CrossRef]
  19. Amoke, D.A.; Li, Y.; Naqvi, S.M. Transfer learning-based vessel trajectory classification in AIS data. In Proceedings of the 2025 25th International Conference on Digital Signal Processing (DSP), Pylos, Greece, 25–27 June 2025. [Google Scholar]
  20. Jin, K.; Zhu, H.; Gao, R.; Wang, J.; Wang, H.; Yi, H.; Shi, C.-J.R. DEMRL: Dynamic estimation meta reinforcement learning for path following on unseen unmanned surface vehicle. Ocean Eng. 2023, 288, 115958. [Google Scholar] [CrossRef]
  21. Wang, B.; Jiang, P.; Gao, J.; Huo, W.; Yang, Z.; Liao, Y. A lightweight few-shot marine object detection network for unmanned surface vehicles. Ocean Eng. 2023, 277, 114329. [Google Scholar] [CrossRef]
  22. Song, R.; Gao, S.; Li, Y. A novel approach to multi-USV cooperative search in unknown dynamic marine environment using reinforcement learning. Neural Comput. Appl. 2025, 37, 16055–16070. [Google Scholar]
  23. Liu, X.; Deng, Y.; Nallanathan, A.; Bennis, M. Federated learning and meta learning: Approaches, applications, and directions. IEEE Commun. Surv. Tutor. 2023, 26, 571–618. [Google Scholar]
  24. Zhao, W.; Dai, S.; Tian, H.; Zhu, D.; Zhang, Y.; Jiang, F. An Overview Study of Deep Learning in Geophysics: Cross-Cutting Research to Advance Geoscience. IEEE Access 2025, 13, 124364–124388. [Google Scholar] [CrossRef]
  25. Er, M.J.; Chen, J.; Zhang, Y.; Gao, W. Research Challenges, Recent Advances, and Popular Datasets in Deep Learning-Based Underwater Marine Object Detection: A Review. Sensors 2023, 23, 1990. [Google Scholar] [CrossRef] [PubMed]
  26. Tan, Y.; Long, G.; Liu, L.; Zhou, T.; Lu, Q.; Jiang, J.; Zhang, C. FedProto: Federated Prototype Learning across Heterogeneous Clients. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2022), Vancouver, BC, Canada, 22 February–1 March 2022; Volume 36, pp. 8436–8444. [Google Scholar]
  27. Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
  28. Chen, L.; Zhang, W.; Dong, C.; Qiao, S.; Huang, Z.; Nie, Y.; Hou, Z.; Tan, C.W. FedDRL: A Trustworthy Federated Learning Model Fusion Method Based on Staged Reinforcement Learning. arXiv 2024, arXiv:2307.13716. [Google Scholar] [CrossRef]
  29. Finkelshtein, B.; Huang, X.; Bronstein, M.; Ceylan, I.I. Cooperative Graph Neural Networks. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 21–27 July 2024; pp. 13633–13659. [Google Scholar]
  30. Lin, Q.; Li, H.; Jia, Y.; Li, Y.; Lian, S.; Liu, H.; Kwong, S.; Cong, R. ViT-UWA: Vision Transformer Underwater-Adapter for Dense Predictions Beneath the Water Surface. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Figure 1. The Federated Meta-Transfer Learning (FMTL) Framework. The architecture consists of three stages: (1) Pretraining the GeoSense foundation model; (2) Federated learning, where AUVs collaborate via a shared mapper while keeping raw data local; and (3) Meta-learning for rapid adaptation to new survey environments.
Figure 1. The Federated Meta-Transfer Learning (FMTL) Framework. The architecture consists of three stages: (1) Pretraining the GeoSense foundation model; (2) Federated learning, where AUVs collaborate via a shared mapper while keeping raw data local; and (3) Meta-learning for rapid adaptation to new survey environments.
Jmse 14 00384 g001
Figure 2. Cross-Modality Federated Representation Learning. Heterogeneous sensor encoders (E1, E2) map inputs to a shared latent space (Z). The contrastive loss aligns these representations to facilitate shared mapper (M) fusion with a guaranteed convergence rate of O ( 1 / T ) , without exchanging raw sensor measurements.
Figure 2. Cross-Modality Federated Representation Learning. Heterogeneous sensor encoders (E1, E2) map inputs to a shared latent space (Z). The contrastive loss aligns these representations to facilitate shared mapper (M) fusion with a guaranteed convergence rate of O ( 1 / T ) , without exchanging raw sensor measurements.
Jmse 14 00384 g002
Figure 3. Overall architecture of the proposed FMTL framework for collaborative deep-sea exploration.
Figure 3. Overall architecture of the proposed FMTL framework for collaborative deep-sea exploration.
Jmse 14 00384 g003
Figure 4. Adaptation curves in new geological province.
Figure 4. Adaptation curves in new geological province.
Jmse 14 00384 g004
Figure 5. Comparison between FMTL and baseline methods: (a) target identification AUC; (b) communication cost vs. AUC; (c) episodes to reach 90% performance; (d) inference latency per decision.
Figure 5. Comparison between FMTL and baseline methods: (a) target identification AUC; (b) communication cost vs. AUC; (c) episodes to reach 90% performance; (d) inference latency per decision.
Jmse 14 00384 g005
Figure 6. Distribution of replanning latency under the MARL-adaptive navigation policy.
Figure 6. Distribution of replanning latency under the MARL-adaptive navigation policy.
Jmse 14 00384 g006
Figure 7. Swarm fragmentation over mission time under different coordination strategies.
Figure 7. Swarm fragmentation over mission time under different coordination strategies.
Jmse 14 00384 g007
Figure 8. Cumulative High-Value Targets Discovered Over Time for Different Exploration Strategies.
Figure 8. Cumulative High-Value Targets Discovered Over Time for Different Exploration Strategies.
Jmse 14 00384 g008
Figure 9. AUC Evolution During Dynamic Mission Phases with Different Exploration Strategies.
Figure 9. AUC Evolution During Dynamic Mission Phases with Different Exploration Strategies.
Jmse 14 00384 g009
Figure 10. Federated Explainable AI (FXAI) attribution visualization across three sensing modalities and the resulting federated decision map. (A) Bathymetric depth distribution; (B) Magnetic anomaly distribution (nT); (C) Chemical plume concentration field; (D) Federated attribution map indicating feature importance relative to the predicted SMS location.
Figure 10. Federated Explainable AI (FXAI) attribution visualization across three sensing modalities and the resulting federated decision map. (A) Bathymetric depth distribution; (B) Magnetic anomaly distribution (nT); (C) Chemical plume concentration field; (D) Federated attribution map indicating feature importance relative to the predicted SMS location.
Jmse 14 00384 g010
Figure 11. Adaptation performance of meta-learned, fine-tuned, and randomly initialized models in a previously unseen geological environment.
Figure 11. Adaptation performance of meta-learned, fine-tuned, and randomly initialized models in a previously unseen geological environment.
Jmse 14 00384 g011
Figure 12. t-SNE visualization of the learned shared latent space. (a) Under the full FMTL framework, embeddings from different sensing modalities are well aligned and cluster according to underlying geological structures rather than sensor identity, indicating effective cross-modality representation learning. (b) When the contrastive objective is removed (Variant D), embeddings remain separated by modality, demonstrating the failure to establish a unified geological representation.
Figure 12. t-SNE visualization of the learned shared latent space. (a) Under the full FMTL framework, embeddings from different sensing modalities are well aligned and cluster according to underlying geological structures rather than sensor identity, indicating effective cross-modality representation learning. (b) When the contrastive objective is removed (Variant D), embeddings remain separated by modality, demonstrating the failure to establish a unified geological representation.
Jmse 14 00384 g012
Figure 13. Convergence behavior of the proposed framework with and without GeoSense pre-training. Models initialized with domain-informed representations (FMTLFull) achieve faster convergence and higher final performance, whereas training from random initialization (Variant A) requires substantially more communication rounds and converges to inferior accuracy. This comparison underscores the effectiveness of pre-training in federated learning settings.
Figure 13. Convergence behavior of the proposed framework with and without GeoSense pre-training. Models initialized with domain-informed representations (FMTLFull) achieve faster convergence and higher final performance, whereas training from random initialization (Variant A) requires substantially more communication rounds and converges to inferior accuracy. This comparison underscores the effectiveness of pre-training in federated learning settings.
Jmse 14 00384 g013
Figure 14. Communication–accuracy Pareto frontier for FMTL under constrained acoustic bandwidth.
Figure 14. Communication–accuracy Pareto frontier for FMTL under constrained acoustic bandwidth.
Jmse 14 00384 g014
Figure 15. Energy–intelligence trade-off under long-duration deep-sea operations. (a) Total energy per mission decomposed into transmission and computation for FMTL, Centralized, and FedAvg. (b) Information Gain per Joule (IG/J) comparison, showing the highest IG/J for FMTL. (c) Cumulative energy growth over a 72 h mission horizon. (d) Sensitivity surface illustrating the joint effect of communication budget and quantization level on AUC. Colors in (a) distinguish transmission and computation energy. In (d), the color gradient represents AUC values, with warmer colors indicating higher performance.
Figure 15. Energy–intelligence trade-off under long-duration deep-sea operations. (a) Total energy per mission decomposed into transmission and computation for FMTL, Centralized, and FedAvg. (b) Information Gain per Joule (IG/J) comparison, showing the highest IG/J for FMTL. (c) Cumulative energy growth over a 72 h mission horizon. (d) Sensitivity surface illustrating the joint effect of communication budget and quantization level on AUC. Colors in (a) distinguish transmission and computation energy. In (d), the color gradient represents AUC values, with warmer colors indicating higher performance.
Jmse 14 00384 g015
Table 1. Baseline definitions and assumptions.
Table 1. Baseline definitions and assumptions.
MethodMulti-ModalCross-ModalityComm. ConstrainedDecoupled Training
ISOLATED × ×
CENTRALIZED × ×
FedAvg × ×
FMTL (Ours) ×
Table 2. Metrics used in Experiment 1.
Table 2. Metrics used in Experiment 1.
MetricDefinitionInterpretation
AUCROC AUC computed over map grid cellsGlobal discrimination capability
Precision/RecallComputed over target-zone cellsSpatial detection reliability
Communication costTotal transmitted payload sizeEfficiency under acoustic constraints
Table 3. Metrics used in Experiment 2.
Table 3. Metrics used in Experiment 2.
MetricDefinition
Replanning latencyTime from belief update to execution of a new feasible trajectory
Path feasibilityFraction of trajectories satisfying kinematic and connectivity constraints
Swarm fragmentationMaximum inter-vehicle distance relative to connectivity limit
Survey efficiencyObjective-normalized mission performance relative to static baseline
Table 4. Performance on real-world SMS detection.
Table 4. Performance on real-world SMS detection.
MethodDataset 1 (GEBCO)Dataset 2 (AUV)Avg. PrecisionAvg. Recall
ISOLATED (Bathy)0.620.580.550.48
ISOLATED (Mag)0.710.690.680.64
Classical Fusion0.740.720.700.68
FMTL (Ours)0.890.840.860.81
ORACLE0.930.890.900.87
Table 5. Validation metrics for sea trials.
Table 5. Validation metrics for sea trials.
MetricSimulation BaselineAcceptable Field PerformanceMeasurement Method
SMS Detection AUC0.94≥0.85ROV groundtruth validation
Communication Efficiency500 MB/mission≤1 GB/missionAcoustic modem logs
Survey Time Reduction60% vs. grid search≥40%Mission duration comparison
False Positive Rate3.2%≤8%Postmission geologist review
System Uptime100% (sim)≥92%Vehicle telemetry logs
Energy EfficiencyN/A≤15% battery overhead vs. nonAI baselinePower consumption sensors
Table 6. Dynamic Planning Performance Comparison.
Table 6. Dynamic Planning Performance Comparison.
MetricStatic GridReactive HeuristicMARL-Adaptive (Ours)
Survey Efficiency GainBaseline (0%)+18%+42%
Replanning Latency (ms)N/A320 ± 8045 ± 12
Path Feasibility (%)100% (no replans)76%98%
Max Swarm Fragmentation (km)<2 (grouped)6.24.1
Hotspot Response Time (min)N/A (static)182.3
Communication Overhead (MB)500480600
Avg AUC Achieved0.870.910.94
Table 7. Definition of ablation variants and included components.
Table 7. Definition of ablation variants and included components.
Variant ID Description Components Included
FMTLFull Complete framework (baseline) Pretrain + FedLearn + FXAI + Meta
Variant A No pretraining Random init + FedLearn + Meta
Variant B No federated learning Pretrain + Centralized + Meta
Variant C No meta-learning Pretrain + FedLearn (finetune only)
Variant D No contrastive loss Pretrain + FedLearn (w/o Equation (1)) + Meta
Variant E No shared mapper Pretrain + FedLearn (late fusion) + Meta
Variant F No FXAI Pretrain + FedLearn + Meta (no attribution)
Table 8. Ablation results for FMTL under different component removals.
Table 8. Ablation results for FMTL under different component removals.
VariantAUCPrecisionRecallAdaptation Speed (Missions to 90%)Comm. Cost (MB)
FMTLFull0.940.910.895500
Variant A (No Pretraining)0.780.740.7112500
Variant B (No Federation, Centralized)0.970.950.934102,000
Variant C (No Meta-Learning)0.930.900.8842500
Variant D (No Contrastive Loss)0.850.810.797500
Variant E (No Shared Mapper)0.810.770.749500
Variant F (No FXAI)0.940.910.895485
Note: Adaptation speed is defined as the number of missions required to reach 90% of the final AUC achieved by FMTLFull. The slightly reduced communication cost for Variant F reflects the removal of attribution-related metadata, while all other components and update schedules remain identical.
Table 9. Latent dimension d vs. AUC/training time/memory.
Table 9. Latent dimension d vs. AUC/training time/memory.
d AUC Training Time (hours) Memory (GB)
128 0.88 3.2 4.1
256 0.91 4.7 5.8
512 0.94 8.3 9.2
1024 0.94 16.1 17.6
Table 10. Local epochs E vs. AUC/communication rounds to convergence.
Table 10. Local epochs E vs. AUC/communication rounds to convergence.
E AUC Communication Rounds to Convergence
1 0.89 180
5 0.94 100
10 0.94 95
20 0.93 92 (overfitting on local data)
Table 11. Contrastive weight λ1 vs. AUC/cross-modality embedding distance.
Table 11. Contrastive weight λ1 vs. AUC/cross-modality embedding distance.
λ1AUC Cross-Modality Embedding Distance
0.0 0.85 1.47 (poor alignment)
0.1 0.91 0.54
0.5 0.94 0.23
1.0 0.92 0.19 (overregularized)
Table 12. Component-wise training time per round, inference time per sample, and memory footprint.
Table 12. Component-wise training time per round, inference time per sample, and memory footprint.
Component Training Time per Round Inference Time per Sample Memory
Sensor Encoder (E) 42 s 8 ms 2.1 GB
Shared Mapper (M) 18 s 3 ms 0.9 GB
Task Head (H) 6 s 1 ms 0.3 GB
FXAI Attribution 31 s 45 ms 1.8 GB
Context Module (CAM) 14 s 5 ms 0.7 GB
Total FMTL 111 s 62 ms 5.8 GB
Table 13. Scaling with swarm size.
Table 13. Scaling with swarm size.
# AUVs AUC Communication Cost per Round (KB) Training Time per Round (s)
3 0.87 167 111
6 0.91 334 118
9 0.94 500 124
12 0.95 667 142
18 0.96 1001 189
24 0.96 1334 251
Table 14. Complete sensor failure severity vs. AUC and survey efficiency.
Table 14. Complete sensor failure severity vs. AUC and survey efficiency.
Failure Type Baseline (No Failure) 1 AUV Failed 2 AUVs Failed 3 AUVs Failed
AUC 0.94 0.92 (2%) 0.88 (6%) 0.83 (11%)
Survey Efficiency 100% 96% 87% 72%
Table 15. Outlier rate vs. AUC and false positive rate.
Table 15. Outlier rate vs. AUC and false positive rate.
Outlier Rate AUC False Positive Rate
0% 0.94 3.2%
10% 0.92 4.1%
20% 0.89 6.8%
30% 0.84 11.3%
Table 16. Performance Under Realistic Channels.
Table 16. Performance Under Realistic Channels.
Channel Condition Packet Loss Rate AUC After 100 Rounds Convergence Time
Ideal (AWGN, no fading) 0% 0.94 100 rounds
Moderate (multipath, SNR = 15 dB) 12% 0.93 118 rounds
Harsh (multipath + Doppler) 24% 0.91 156 rounds
Severe (+shadow zones) 35% 0.86 203 rounds
Extreme (storm, SNR = 5 dB) 52% 0.74 Did not converge
Table 17. Time-varying channel phases vs. average packet loss and cumulative AUC.
Table 17. Time-varying channel phases vs. average packet loss and cumulative AUC.
Mission Phase Avg. Packet Loss Cumulative AUC
Hours 0–24 12% 0.89
Hours 24–48 28% 0.87 (2%)
Hours 48–72 48% 0.81 (8%)
Post-Storm (+6 h)21% 0.88 (Recovery)
Table 18. Comparison to prior work in channel modeling assumptions.
Table 18. Comparison to prior work in channel modeling assumptions.
Study Channel Model Validation Method
Standard FedAvg [9] I.i.d. packet loss Synthetic dropout
FedProx [24] Uniform 10% loss Simulated
Our Work Bellhop raytracing + Doppler + shadows Physics-based
Table 19. Adversarial attack performance.
Table 19. Adversarial attack performance.
Attack Type Clean AUC Attacked AUC Success Rate
No Attack 0.94 0.94 N/A
Untargeted FGSM 0.94 0.81 68%
Targeted FGSM 0.94 0.77 74%
PGD (stronger) 0.94 0.73 81%
Table 20. Long-duration mission stability metrics over time.
Table 20. Long-duration mission stability metrics over time.
Time Elapsed AUC Memory Usage False Discovery Rate
0–6 h 0.94 5.8 GB 3.2%
6–12 h 0.93 5.9 GB 3.8%
12–24 h 0.92 6.2 GB 4.5%
24–48 h 0.91 6.7 GB 5.1%
48–72 h 0.90 7.3 GB 5.9%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nie, Z.; Tian, H.; Yin, Y.; Zhou, Y.; Li, W.; Xiong, Y.; Wang, Y.; Zhang, Z.; Yang, Y.; Xie, D.; et al. System-Level Optimization of AUV Swarm Control and Perception: An Energy-Aware Federated Meta-Transfer Learning Framework with Digital Twin Validation. J. Mar. Sci. Eng. 2026, 14, 384. https://doi.org/10.3390/jmse14040384

AMA Style

Nie Z, Tian H, Yin Y, Zhou Y, Li W, Xiong Y, Wang Y, Zhang Z, Yang Y, Xie D, et al. System-Level Optimization of AUV Swarm Control and Perception: An Energy-Aware Federated Meta-Transfer Learning Framework with Digital Twin Validation. Journal of Marine Science and Engineering. 2026; 14(4):384. https://doi.org/10.3390/jmse14040384

Chicago/Turabian Style

Nie, Zinan, Hongjun Tian, Yijie Yin, Yuhan Zhou, Wei Li, Yang Xiong, Yichen Wang, Zitong Zhang, Yang Yang, Dongxiao Xie, and et al. 2026. "System-Level Optimization of AUV Swarm Control and Perception: An Energy-Aware Federated Meta-Transfer Learning Framework with Digital Twin Validation" Journal of Marine Science and Engineering 14, no. 4: 384. https://doi.org/10.3390/jmse14040384

APA Style

Nie, Z., Tian, H., Yin, Y., Zhou, Y., Li, W., Xiong, Y., Wang, Y., Zhang, Z., Yang, Y., Xie, D., Wang, M., & Huang, S. (2026). System-Level Optimization of AUV Swarm Control and Perception: An Energy-Aware Federated Meta-Transfer Learning Framework with Digital Twin Validation. Journal of Marine Science and Engineering, 14(4), 384. https://doi.org/10.3390/jmse14040384

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop