Next Article in Journal
VisPower: Curriculum-Guided Multimodal Alignment for Fine-Grained Anomaly Perception in Power Systems
Next Article in Special Issue
Hyperspectral Images Anomaly Detection Based on Rapid Collaborative Representation and EMP
Previous Article in Journal
Unleashing GHOST: An LLM-Powered Framework for Automated Hardware Trojan Design
Previous Article in Special Issue
ActionMamba: Action Spatial–Temporal Aggregation Network Based on Mamba and GCN for Skeleton-Based Action Recognition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Beyond RGB: Early Stage Fusion of Thermal and Visual Modalities for Robust Maritime Perception

SEA.AI GmbH, Peter-Behrens-Platz 4, 4020 Linz, Austria
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(23), 4746; https://doi.org/10.3390/electronics14234746
Submission received: 15 October 2025 / Revised: 23 November 2025 / Accepted: 29 November 2025 / Published: 2 December 2025

Abstract

In maritime environments, reliable object detection and semantic segmentation are essential for navigation and collision avoidance, especially under adverse conditions. This paper benchmarks early stage RGB–thermal (RGBT) fusion architectures for these tasks using a novel, pixel-aligned maritime dataset. We evaluate transformer-based, attention-driven, and lightweight convolutional models, analyzing trade-offs between accuracy and efficiency for edge deployment. Our results show that RGBT fusion significantly improved detection robustness, with transformer models achieving the top accuracy and lightweight models like WNet-S offering strong performance with lower computational costs. We also introduce a modular, open-source fusion framework to support reproducible research and practical deployment in maritime and other safety-critical domains.

1. Introduction

Safe navigation and collision avoidance in the maritime domain are heavily dependent on automatic identification system (AIS) and radar technologies. The AIS is a vessel tracking system that broadcasts a ship’s position, identity, and navigation status through VHF radio to improve situational awareness [1]. However, the AIS is only mandatory for vessels exceeding a gross tonnage of 300 and for all passenger ships, which means that many smaller boats remain untracked. Radar is another widely used modality, but its performance is sensitive to meteorological conditions, as well as the shape, size, and material of targets. As a result, radar systems may fail to detect small floating objects or small boats. Effective collision avoidance is therefore critical in maritime environments, where high traffic density, limited maneuverability, and poor visibility can significantly increase the risk of accidents. This creates a pressing need for automated, vision-based perception systems capable of operating reliably across diverse and adverse environmental conditions.
Traditionally, human vision has been used to augment radar and AIS, but it presents inherent limitations. Humans cannot continuously monitor the entire environment, and their perception degrades under fatigue, darkness, fog, heavy rain, or even strong sunlight. Small vessels and floating debris often go unnoticed during nighttime navigation or in dense fog, while heavy rain can obscure even large objects. These limitations underscore the need for automated perception systems that provide reliable detection across diverse and adverse conditions. Consequently, maritime safety increasingly relies on computer vision technologies capable of automatically detecting and segmenting obstacles, regardless of visibility or weather.
Over the past decade, deep learning has transformed computer vision, enabling remarkable advances in image classification, semantic segmentation, and object detection. Convolutional neural networks (CNNs) such as U-Net [2] and, more recently, transformer-based models [3,4] have become the state of the art across domains ranging from autonomous driving to medical imaging [5]. However, models trained solely on RGB imagery often fail under poor illumination or adverse weather, making them insufficient for safety-critical maritime applications.
Multimodal fusion is increasingly vital in safety-critical domains such as autonomous driving, surveillance, and maritime robotics, where combining complementary sensor modalities enhances robustness under adverse conditions. RGB–thermal (RGBT) fusion leverages the strengths of both modalities; thermal cameras, operating in the long-wave infrared (LWIR) spectrum (8–12 µ m), detect heat signatures regardless of lighting, while RGB sensors provide rich spatial and color detail [6]. Prior works in autonomous driving and surveillance have demonstrated the benefits of RGBT fusion.
For instance, in autonomous driving, RGBT fusion has been shown to enhance pedestrian detection in low-light environments [7], as well as enhancing perception during nighttime and in adverse weather conditions [8]. Similarly, in surveillance, RGBT fusion improved intruder recognition and activity monitoring [9]. Recently, Ying et al. introduced RGBT-Tiny, a large scale benchmark for visible-thermal tiny object detection [10]. While their focus was specialized object detection algorithms, our work emphasizes semantic segmentation as the primary task, with object detection only considered as a secondary post-processing step. Similarly, Zhou et al. presented MaDiNet, a Gamma diffusion-based approach for SAR target detection using RGB and radar modalities [11]. Although the sensing modalities differ from our RGBT set-up, this work highlights the growing interest in multimodal fusion for safety-critical environments. These insights underscore the importance of robust multimodal research for small object detection, which is directly relevant to maritime collision avoidance, characterized by small, distant targets.
Despite its success in other domains, RGBT fusion remains underexplored in maritime environments, where variable lighting, reflective surfaces, and limited annotated data pose unique challenges. In maritime perception, recent datasets such as those in [12,13] have enabled benchmarking of multimodal fusion approaches using RGB, thermal, radar, and AIS data. However, systematic evaluation of fusion architectures tailored to maritime conditions—characterized by small, distant objects and dynamic backgrounds—remains limited.
Within computer vision, semantic segmentation and object detection are key tasks enabling detailed scene understanding. A pivotal milestone in RGBT segmentation research was MFNet, which introduced a benchmark dataset of aligned street scenes and a symmetric double-encoder fusion architecture [8]. Subsequent works, such as RTFNet [14] and FuseSeg [15], refined feature fusion strategies, while GMNet introduced graded shallow and deep fusion modules with multiscale supervision [16]. These architectures share many features with other multimodal domains, such as RGB and depth (RGBD). The RGBD architectures follow a similar double-encoder paradigm, which can be seen in FuseNet [17] or ACNet [18]. More recently, attention-based models such as SA-Gate [19] and transformer-based approaches like CMX [20] have pushed the boundaries of accuracy, with CMX achieving state-of-the-art results across multiple modality pairs. At the same time, lightweight architectures with efficient encoders and separable convolutions [21] have emerged to address real-time deployment needs on resource-constrained platforms.
Despite these advances, several challenges remain. First, most multimodal fusion methods have been developed for autonomous driving or indoor robotics and have rarely been evaluated in maritime settings, where object scales, environmental variability, and sensor perspectives differ substantially. Second, a persistent trade-off exists between accuracy and deployability; transformer-based approaches deliver excellent accuracy but incur high computational costs, making them impractical for edge devices commonly used onboard vessels. This trade-off is particularly evident when detecting small objects such as buoys or wooden debris. Transformer-based models like CMX achieve superior accuracy in these cases due to their global attention mechanisms, which capture long-range dependencies and subtle thermal cues. However, their computational demands limit real-time deployment. In contrast, lightweight CNN architectures offer competitive performance for small-object detection while maintaining low latency and minimal parameter counts, making them suitable for resource-constrained onboard systems. This balance remains insufficiently benchmarked in maritime contexts. Finally, while multimodal datasets for driving and surveillance exist (e.g., PST900 [22]), aligned RGBT maritime datasets suitable for segmentation and detection have only recently become available [23].
To address these gaps, our work focuses on challenges unique to maritime perception and builds upon prior advances in multimodal fusion. Building on our previous work to create the RGBT maritime dataset [23] we expanded its size and diversity and improved the alignment quality to support robust benchmarking. Using this enhanced dataset, we trained and evaluated a diverse set of models, including CNN baselines, transformer-based networks, and novel fusion designs. Central to this study is the introduction of WNet, a modular double-encoder architecture that allows systematic benchmarking of different encoder and decoder configurations as well as alternative fusion strategies. This modularity enabled the development of hybrid variants such as WNet-FFM and WNet-CMX, combining elements from state-of-the-art fusion paradigms. Models were assessed not only for segmentation accuracy but also for computational efficiency and inference speed, providing practical insights into optimal design choices for real-time deployment in embedded maritime systems.
To further close these gaps in maritime perception, this paper investigates how multimodal fusion architectures can be optimized to balance accuracy, robustness, and computational efficiency under real-world conditions. Specifically, our contributions are as follows:
  • Comprehensive Benchmarking: We present a systematic evaluation of multimodal (RGBT) early fusion architectures for maritime semantic segmentation and object detection. The benchmark spans transformer-based, attention-based, and lightweight CNN models, all retrained on a rigorously aligned RGBT maritime dataset.
  • Modular Fusion Architecture: We introduce the WNet architecture, a flexible fusion framework that enables interchangeable encoders, fusion modules, and decoders. This modular design supports systematic exploration of architecture trade-offs and achieves a superior performance-to-cost ratio, making it suitable for edge deployment.
  • Performance Efficiency Analysis: We provide practical insights into fusion depth, model complexity, and real-time deployability through a detailed analysis of the inference speed and parameter count. This helps identify architectures that are both accurate and resource-efficient.
Finally, we unveil WNet, an open-source multimodal fusion framework which promotes reproducibility and transparent benchmarking. The framework supports modular experimentation and is available at https://github.com/SEA-AI/beyond-RGB (accessed on 28 November 2025).

2. Materials and Methods

2.1. Dataset

The dataset used in this study consists of precisely aligned RGBT image pairs based on a recently published pixel-aligned maritime panoptic dataset collected in open-water scenarios using SEA.AI devices capable of streaming both modalities. Alignment was performed as described in [23] using a homography-based pipeline built on machine learning-driven correspondence search, enabling accurate annotation propagation and a unified ground truth for masks and bounding boxes.
A key advantage of this dataset is its panoptic nature, which combines semantic and instance-level annotations to support benchmarking of both segmentation and object detection tasks within a single framework. This ensures a consistent evaluation of multimodal fusion architectures in complementary perception objectives.
The collection spans diverse geographic regions throughout the Atlantic Ocean (e.g., Greenland, the Caribbean, La Manche, Gibraltar, and Biscay Bay), varying illumination (daytime and nighttime), and multiple weather conditions, including clear skies, fog, rain, and overcast scenarios. These conditions significantly impact RGB image quality—fog and rain reduce contrast, and nighttime introduces severe illumination challenges—while thermal imagery remains largely unaffected. Representative examples of these scenarios are shown in Figure 1. The dataset also covers a range of maritime environments, from open ocean to harbors and near-shore scenes, with variable traffic density and background complexity. It includes 18 annotated object categories, such as (1) vessel classes (e.g., sailboats and cargo ships), (2) navigation objects (e.g., buoys), and (3) other floating objects (e.g., containers and wooden debris). Following the original protocol, augmented training data were generated to reach approximately 6000 training and 2000 validation samples using semi-automated annotation. A statistical summary of the dataset is provided in Table 1.

2.2. Architectures and Fusion Strategies

This work explores a range of early stage fusion architectures for multimodal semantic segmentation and object detection, building on both well-established convolutional neural network (CNN) models and more recent transformer-based approaches.
Early fusion strategies have been widely adopted in multimodal research because they allow for feature representations from RGB and thermal (or, for example, depth) modalities to be combined at multiple stages of the network, rather than relying solely on late decision-level fusion. This is particularly beneficial in maritime perception, where small objects must be detected under challenging conditions, and complementary cues from both modalities can improve feature robustness.
In general, three principal fusion paradigms are used in multimodal computer vision: early fusion, late fusion, and proposal- or decision-level fusion. Late fusion combines the outputs of independent unimodal networks (typically by merging or weighting their detection or segmentation results that can even be learned [24]), which offers modularity and robustness to sensor failure but limits fine-grained feature interaction [25]. Proposal-level fusion, commonly employed in multimodal detection frameworks such as AVOD [26] or MMF [27], merges modality-specific region proposals, improving object-level confidence estimation while remaining limited in pixel-level precision. By contrast, early fusion enables rich cross-modal feature interactions throughout the encoder-decoder hierarchy, which is particularly advantageous for dense prediction tasks such as semantic segmentation or maritime obstacle detection. This justifies our focus on early fusion architectures in this study, as they best capture the complementary nature of RGB and thermal information at multiple feature scales.
A recurring design pattern in multimodal segmentation research is the double-encoder architecture. Consequently, most existing RGBT segmentation models can be abstracted into a generalized form, which is illustrated schematically in Figure 2. In this framework, RGB and thermal inputs are processed by separate encoders, whose feature representations are subsequently combined through a multilevel fusion strategy (denoted as the F block). Similar to the original single-encoder U-Net segmentation architecture [2], skip connections propagate fused encoder features to the decoder at each depth level, facilitating fine-grained spatial reconstruction and preserving detailed information across scales.
This general formulation is highly flexible; the encoders may be based on standard CNN backbones such as ResNet [28], DenseNet [29], or even visual transformers [3,4]. Likewise, the decoder can vary in depth and resolution, while the specific contribution of each architecture typically lies in its chosen fusion strategy. For example, GMNet [16] employs multi-level supervision, CAINet [30] integrates context-aware interactions, and transformer-based CMX [20] introduces cross-modal attention blocks. Thus, the schematic provides a unifying framework for comparing both classical and state-of-the-art RGBT fusion architectures. We benchmarked a diverse set of multimodal architectures, covering both convolutional and transformer-based approaches.
The architectures in the published framework, which were trained and benchmarked on the maritime dataset, include the following:
  • Single-modality baselines: RGB-only and thermal-only U-Net models are used to quantify the contribution of each modality individually. Additionally, U-Net variants with separable convolutions, such as those proposed in MobileNet [31], were benchmarked, as the ultimate goal is deployment on edge devices.
  • Early fusion baseline: A four-channel U-Net (RGB + thermal concatenated at the input) served as the simplest early fusion strategy.
  • Established RGBT double-encoder fusion networks: (1) RTFNet [14], one of the first architectures to introduce a multilevel fusion strategy for RGB and thermal images, performs simple addition-based one-way feature propagation between modalities. (2) GMNet [16] integrates both shallow and deep feature fusion, depending on the encoder depth. It also employs multiple training losses, including boundary loss, binary mask loss, and semantic mask loss.
  • Attention-based fusion: SA-Gate [19], which has shown promising results in the RGB depth domain, was benchmarked here as well. It employs bidirectional feature propagation with attention gates to selectively combine features from both modalities.
  • Transformer-based fusion: CMX [20], which leverages cross-modal attention mechanisms to fuse features effectively across modalities, was evaluated together with its SegFormer MiT backbone [4].
  • Proposed modular network: WNet corresponds to the general schematic shown in Figure 2. It was implemented as a double-encoder U-Net baseline, but it was designed modularly to allow configuration of the encoder (e.g., SegFormer or ResNet), fusion strategy, decoder structure, and depth. This modular design enables systematic experiments across numerous architecture variants, allowing for a balance between feature interaction, depth, and computational cost.

2.3. Training Set-Up

All experiments were implemented in PyTorch (2.2.2+cu121) and executed on an NVIDIA RTX 4070 Ti GPU (Nvidia Corporation, Santa Clara, CA, USA). The models were trained for 60 epochs with a batch size of 6, being constrained by the GPU memory. The Adam optimizer was used with an initial learning rate of 5 × 10 4 , which decayed stepwise at epochs 40 and 50 to 1 × 10 4 and 5 × 10 5 , respectively.
The training/validation split was performed with entire trips (images collected throughout one voyage) rather than individual frames to avoid data leakage. Data augmentation was applied using the Albumentations library [32], including horizontal flips and affine transformations.
The loss functions varied depending on the architecture. Binary cross-entropy with logits was used for most models due to its numerical stability compared with plain BCE. For the selected networks, Lovász-Softmax [33] and Lovász hinge [34] losses were retained in order to be consistent with their original publications.

2.4. Evaluation Metrics

Performance was evaluated using established metrics for semantic segmentation and object detection, as both pixel-level accuracy and detection reliability are critical in maritime perception tasks. All evaluations were conducted in a class-agnostic manner for the following, since the primary objective was robust detection and accurate object masks, while class differentiation played a secondary role:
  • Segmentation metrics: (1) The intersection over union (IoU) [35] measures the overlap between the predicted and ground truth masks. (2) The Dice similarity coefficient (equivalent to the F1 score) [36] emphasizes the balance between precision and recall. (3) The Matthews correlation coefficient [37] provides a balanced evaluation, even under class imbalance. The binarization threshold was selected for the validation set to maximize the mean IoU.
  • Detection: Bounding boxes were derived from segmentation masks and evaluated using precision, recall, and the F1 score [38]. True positives, false positives, and false negatives were determined using an IoU threshold of 0.5 between the predicted and ground truth boxes [35]. This threshold aligns with widely adopted standards (e.g., PASCAL VOC, and COCO) and offers a practical balance between localization precision and recall. Higher thresholds would disproportionately penalize small object detections, which are critical in maritime scenarios. To further analyze robustness, the results were stratified by object size into four categories: small, medium, large, and valid. The valid category included all objects larger than 1.75 × 1.75 px, sharing its lower bound with the small category, while the thresholds for the medium and large categories were set at 8 × 8 px and 20 × 20 px. This stratification highlights performance differences across scales, which is particularly relevant for detecting small and distant objects, which are common in maritime environments.
  • Efficiency: The inference time (milliseconds per frame) and model parameter count were measured to quantify the trade-off between accuracy and practical deployability on resource-constrained edge devices. These metrics provide insight into whether an architecture can support real-time operation onboard vessels without exceeding hardware limitations.

3. Results

3.1. Semantic Segmentation

We evaluated the model variants described in the previous section, and the corresponding results are summarized in Table 2. The confidence thresholds for segmentation mask binarization were determined on a validation set by maximizing the mean intersection over union (mIoU). The selected threshold for each model is reported in Table 2.
The benchmarked architectures included double-encoder models, single-encoder models with four-channel input (denoted by the suffix -4i), and single-encoder models using only thermal (THR in the table) or RGB input. When a ResNet backbone was employed, the suffix R18 or R34 indicates the specific version of the ResNet encoder used. For the transformer-based CMX architectures, the suffixes b0 and b2 denote the transformer backbones SegFormer MiT-b0 and MiT-b2, respectively. The default U-Net and WNet configurations use encoder channel dimensions of 32 ,   64 ,   64 ,   128 ,   256 to preserve a low parameter number. There is also U-NetS, which is a bit simplified as it works with bilinear interpolation instead of the last convoluted upsampling.
Additionally, we included several hybrid architectures that combine elements of WNet and CMX; WNet-MLP adopts CMX’s multilayer perceptron (MLP) decoder, WNet-FFM incorporates CMX’s feature fusion module, and WNet-CMX integrates both components.
Multimodal fusion consistently outperformed the single-modality baseline in all IoU values and Dice and Matthews coefficients. RGB-only and thermal-only U-Net achieved limited performance, with the thermal-only models performing more robustly.
Transformer-based CMX achieved the highest values for all metrics across the dataset, confirming the strong capacity of transformer-based feature aggregation. However, this came at a significant computational cost. Lightweight networks such as WNet-S or SAGate-R18 reached competitive IoU values while maintaining far lower parameter counts, demonstrating the potential of efficiency-focused designs.
To complement these quantitative results, Figure 3 presents the representative segmentation outputs for three architectures across different environmental conditions: clear daytime, nighttime, and overcast scenes. Transformer-based CMX-b2 achieved the most accurate segmentation, closely matching the ground truth even in challenging nighttime scenarios. WNet-FFM demonstrated competitive performance with fewer parameters, successfully detecting small objects in low-contrast conditions. In contrast, the thermal-only UNet-S-THR baseline model struggled with spatial detail and occasionally missed distant objects, such as two light-emitting targets in the nighttime example that were visible only in the RGB image, underscoring the importance of RGBT fusion for robust maritime perception. These qualitative examples reinforce the quantitative findings in Table 2, where multimodal fusion consistently outperformed the single-modality baselines, particularly for small-object detection under adverse conditions.

3.2. Object Detection

The detection results, derived from the segmentation masks, mirrored the segmentation findings, as shown in Table 3. We did not compare them against detection architectures as this work focuses on segmentation, and the detection metrics provide a different view on their performance.
Across all evaluated object size categories, CMX consistently achieved the highest detection F1 scores, confirming its strong performance in multimodal feature integration. The CMX-b2 variant led the benchmark tests, with superior results in the valid, small, and large object categories, reflecting the benefits of transformer-based cross-modal attention for robust object localization. WNet-S and SA-Gate [19] followed closely, delivering competitive accuracy with significantly lower computational complexity. These findings reinforce the hypothesis that multimodal fusion architectures outperform single-modality baselines, highlighting the advantage of combining RGB and thermal modalities for comprehensive detection in challenging maritime environments.

3.3. Efficiency Analysis

To evaluate the trade-off between accuracy and deployability, we analyzed the inference time and model size alongside detection performance. Figure 4 summarizes these relationships across three subplots.
  • (a) Inference time vs. F1 detection score: This plot illustrates the balance between computational speed and detection accuracy. The transformer-based CMX-b2 achieved the highest F1 score but at the cost of significantly longer inference times, making it less suitable for real-time edge deployment. In contrast, lightweight architectures such as WNet-S and SA-Gate delivered competitive F1 scores while maintaining inference times below 3 ms per frame, highlighting their suitability for embedded systems.
  • (b) Inference time vs. false positives: This subplot shows how efficiency correlates with detection reliability. The models with longer inference times (e.g., CMX-b2) generally exhibited fewer false positives, while faster models like WNet-S maintained a reasonable balance, suggesting that efficiency-focused designs do not necessarily compromise robustness.
  • (c) Inference time vs. Matthews correlation coefficient: This metric provides a holistic view of the binary classification quality. CMX-b2 again led in overall performance, but WNet-S and SA-Gate achieved strong Matthews coefficients with minimal computational overhead, reinforcing their practicality for real-time maritime applications.
Overall, these results confirm that while transformer-based models offer state-of-the-art accuracy, lightweight CNN-based architectures provide an optimal compromise between performance and efficiency, making them ideal for deployment on resource-constrained platforms.

4. Discussion

The results presented in this work provide clear evidence of the benefits of multimodal fusion for object detection and semantic segmentation in maritime environments. Across all evaluated architectures, introducing thermal and RGB modalities jointly led to consistent performance gains compared with single-modality models. Even the simple early fusion variants, such as four-channel U-Net models, outperformed their single-input counterparts. However, the best results were achieved with double-encoder fusion networks, confirming that modality-specific feature extraction followed by learned fusion enables more robust multimodal representations.
These findings align with prior work in terrestrial RGBT segmentation [8,14,16,19,20], where multimodal architectures consistently outperformed the unimodal baselines. The present results extend these observations to the maritime domain, which is characterized by high scene variability, reflections, and frequent low-contrast conditions. The improvement due to multimodal fusion highlights that thermal imagery compensates for illumination-related weaknesses of RGB, while RGB contributes spatial and color information absent in thermal channels.
Among the tested models, transformer-based CMX [20] achieved the highest overall segmentation and detection accuracy, confirming (1) the strong capability of cross-modal attention mechanisms to integrate features effectively and (2) transformers’ capability of nuanced feature extraction. However, this came at a considerable computational cost, as CMX exhibited the slowest inference times and highest parameter count. Its smaller variant, CMX-b0, demonstrated a more favorable balance between accuracy and efficiency, suggesting that transformer fusion can scale down effectively while retaining competitive performance.
In contrast, lightweight architectures such as WNet-S and SA-Gate provided strong alternatives for edge deployment. Both achieved high IoU values and F1 scores while maintaining significantly lower inference times. The proposed modular WNet implementation demonstrated that the general double-encoder U-Net scheme is flexible enough to accommodate different backbones, fusion modules, and decoder configurations while maintaining high performance. The results further indicate that the model complexity and parameter count are not always reliable predictors of performance. For instance, larger architectures such as GMNet [16] and RTFNet [14] did not outperform simpler designs like WNet-S in either the segmentation or detection tasks.
From the object detection results, which were derived via extraction of the bounding box from the segmentation masks, similar conclusions can be drawn. The models leveraging both modalities consistently achieved higher precision and recall values. False positives and false negatives were concentrated primarily in the small object category, which remains a persistent challenge in maritime vision tasks.
Beyond maritime applications, the insights from this study can be generalized to other safety-critical domains where multimodal fusion is essential, such as autonomous driving, intelligent video surveillance, and industrial robotics. These fields share common challenges, including variable illumination, adverse weather, and the need for real-time inference on constrained hardware platforms. The modular fusion strategies proposed here, along with the accompanying performance-efficiency analysis, offer a transferable foundation for designing deployable multimodal perception systems in these domains.
For example, early fusion techniques that improve low-light visibility in thermal RGB maritime imagery can similarly improve pedestrian detection in nighttime driving or intruder recognition in infrared surveillance systems. By demonstrating how lightweight fusion mechanisms and dual-encoder architectures balance accuracy with computational cost, this work contributes generalizable design principles for sensor fusion. These principles are adaptable to the differing technical constraints of maritime, automotive, and surveillance platforms—such as sensor placement, latency requirements, and object scale variability—making the proposed framework broadly applicable.
In the maritime context specifically, improved multimodal perception directly supports collision avoidance by enhancing detection reliability under poor visibility, fog, or nighttime conditions, situations where radar or an AIS often underperform. Integrating RGBT fusion models into onboard vision systems can therefore significantly strengthen situational awareness, enabling earlier and more dependable identification of nearby vessels and floating obstacles and ultimately reducing the risk of maritime accidents.
Limitations and Future Work. Despite these advances, several limitations remain, and thus there is significant potential for future research. First, this study focused solely on RGB and thermal modalities without considering other potential sources of information such as radar or LiDAR. Second, all object detection results were derived from segmentation masks rather than using dedicated detection networks such as YOLO [39] or Mask R-CNN [40], since our evaluation focused on segmentation-derived bounding boxes to maintain consistency across architectures. In future work, we plan to train and evaluate multimodal object detection architectures, such as YOLO and Mask R-CNN variants, to directly compare their performance with segmentation-derived detection and assess the impact of multi-modality on end-to-end detection accuracy. Additionally, while the inference time and model size were measured on a standard GPU set-up, no hardware-optimized or compiled inference benchmarking (e.g., TensorRT or ONNX graph optimization) was performed in this study. Such optimizations can significantly accelerate inference by leveraging graph-level transformations, kernel fusion, and precision reduction (e.g., FP16 or INT8 quantization). Beyond TensorRT and ONNX, other techniques such as pruning, weight clustering, and mixed-precision training are widely used to reduce latency and the memory footprint without substantial accuracy loss. Architectural simplifications also offer potential gains. For example, Su et al. proposed the Pixel Difference Network (PiDiNet), which leverages pixel difference convolutions to capture high-order local information while maintaining low computational costs. Future work will include a systematic evaluation of these deployment-oriented optimizations to quantify their impact on the speed-accuracy trade-off, particularly for transformer-based architectures, where the computational overhead is most pronounced. This analysis will help determine whether lightweight CNN models remain the most practical choice after optimization or whether transformer models can achieve real-time performance under hardware-specific acceleration. Another finding was that precise RGBT alignment is critical for effective multimodal fusion, particularly for early stage architectures that rely on pixel-level correspondence. While we did not conduct a formal sensitivity analysis in this study, initial experiments using unaligned image pairs showed no improvement over single-modality baselines, indicating that misalignment can negate the benefits of fusion. Future work will include a systematic evaluation of alignment errors and their impact on segmentation and detection performance, with a focus on small objects, where pixel-level accuracy is most influential. We also plan to provide visual qualitative comparisons and integrate recently published architectures to strengthen benchmarking. Finally, while the dataset covers a wide range of marine conditions, it is still limited in size compared with large-scale terrestrial benchmarks such as COCO [41], and studying environmental effects such as day-versus-nighttime conditions or varying weather scenarios on model robustness would be a valuable direction for future work.
Generalization to Other Domains. While this study focused on maritime perception, the proposed modular fusion framework and benchmarking methodology have strong potential for adaptation to other safety-critical domains such as autonomous driving, aerial surveillance, and industrial robotics. These domains share similar challenges, including adverse weather, low visibility, and hardware constraints. Future research will explore how the insights gained here—particularly regarding lightweight architectures and multimodal fusion strategies—can be transferred and validated across these applications.

5. Conclusions

This study presents a comprehensive benchmark of early stage RGB–thermal fusion architectures tailored for maritime perception, addressing the challenges of object detection and semantic segmentation under adverse conditions. By leveraging a rigorously aligned multimodal dataset and evaluating a diverse set of architectures, including transformer-based, attention-driven, and lightweight CNN models, we demonstrated that multimodal fusion significantly enhances detection robustness and segmentation accuracy compared with single-modality baselines. For example, transformer-based CMX-b2 achieved the highest segmentation IoU value (36.67%) and detection F1 score (81.94%) but required 11.68 ms per frame and 66.56 M parameters, highlighting its computational cost. In contrast, lightweight designs such as WNet-S delivered a competitive IoU value (33.13%) and F1 score (70.75%) with only 0.62 M parameters and 2.52 ms for the inference time, offering a practical solution for real-time edge deployment. The proposed modular fusion framework further enables flexible experimentation and rapid prototyping, contributing a valuable tool for the research community.
Beyond maritime applications, the insights gained here can be generalized to other safety-critical domains such as autonomous driving and surveillance, where multimodal perception under hardware constraints is essential. By linking performance metrics to practical deployment considerations, this work establishes design guidelines for balancing accuracy and efficiency and provides a framework to accelerate future research in deployable multimodal vision systems.

Author Contributions

Conceptualization, O.K. and D.M.; methodology, O.K. and D.M.; software, O.K.; validation, O.K. and D.M.; formal analysis, O.K.; investigation, O.K.; resources, D.M.; data curation, O.K.; writing—original draft preparation, O.K.; writing—review and editing, C.R.; visualization, O.K.; supervision, D.M.; project administration, C.R.; funding acquisition, C.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Austrian Research Promotion Agency (FFG) with the research projects “Unsupervised AI” (#915995) and “MS-ParadiseAI” (#921378).

Data Availability Statement

The full dataset used in this study is not publicly available due to company confidentiality constraints. However, the previously published subset of the data, which is also used in this work, is publicly accessible at https://huggingface.co/datasets/SEA-AI/SEANet (accessed on 28 November 2025).

Acknowledgments

AI and AI-assisted tools were used for language and grammar editing, as well as for clearer formulation of some sentences or paragraphs.

Conflicts of Interest

Authors Ondrej Kafka, Christian Rankl and David Moser are employed by the company SEA.AI GmbH. The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. International Maritime Organization. Guidelines for the Onboard operational Use of Shipborne Automatic Identification Systems (AIS), 2nd ed.; Number A.1106(29) in IMO Resolutions; IMO: London, UK, 2014. [Google Scholar]
  2. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015, Proceedings, Part III 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
  3. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
  4. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  5. Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef]
  6. Brenner, M.; Reyes, N.H.; Susnjak, T.; Barczak, A.L. Rgb-d and thermal sensor fusion: A systematic literature review. IEEE Access 2023, 11, 82410–82442. [Google Scholar] [CrossRef]
  7. El Ahmar, W.; Massoud, Y.; Kolhatkar, D.; AlGhamdi, H.; Alja’Afreh, M.; Laganiere, R.; Hammoud, R. Enhanced Thermal-RGB Fusion for Robust Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; pp. 365–374. [Google Scholar] [CrossRef]
  8. Ha, Q.; Watanabe, K.; Karasawa, T.; Ushiku, Y.; Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: New York, NY, USA, 2017; pp. 5108–5115. [Google Scholar]
  9. Yang, Z.; Li, Y.; Tang, X.; Xie, M. MGFusion: A multimodal large language model-guided information perception for infrared and visible image fusion. Front. Neurorobotics 2024, 18, 1521603. [Google Scholar] [CrossRef] [PubMed]
  10. Ying, X.; Xiao, C.; An, W.; Li, R.; He, X.; Li, B.; Cao, X.; Li, Z.; Wang, Y.; Hu, M.; et al. Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 6088–6096. [Google Scholar] [CrossRef]
  11. Zhou, J.; Liu, Y.; Peng, B.; Liu, L.; Li, X. MaDiNet: Mamba Diffusion Network for SAR Target Detection. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 10787–10800. [Google Scholar] [CrossRef]
  12. Guo, Y.; Liu, R.W.; Qu, J.; Lu, Y.; Zhu, F.; Lv, Y. Asynchronous trajectory matching-based multimodal maritime data fusion for vessel traffic surveillance in inland waterways. IEEE Trans. Intell. Transp. Syst. 2023, 24, 12779–12792. [Google Scholar] [CrossRef]
  13. Yao, S.; Guan, R.; Wu, Z.; Ni, Y.; Huang, Z.; Wen Liu, R.; Yue, Y.; Ding, W.; Gee Lim, E.; Seo, H.; et al. WaterScenes: A Multi-Task 4D Radar-Camera Fusion Dataset and Benchmarks for Autonomous Driving on Water Surfaces. IEEE Trans. Intell. Transp. Syst. 2024, 25, 16584–16598. [Google Scholar] [CrossRef]
  14. Sun, Y.; Zuo, W.; Liu, M. RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes. IEEE Robot. Autom. Lett. 2019, 4, 2576–2583. [Google Scholar] [CrossRef]
  15. Sun, Y.; Zuo, W.; Yun, P.; Wang, H.; Liu, M. FuseSeg: Semantic segmentation of urban scenes based on RGB and thermal data fusion. IEEE Trans. Autom. Sci. Eng. 2020, 18, 1000–1011. [Google Scholar] [CrossRef]
  16. Zhou, W.; Liu, J.; Lei, J.; Yu, L.; Hwang, J.N. GMNet: Graded-feature multilabel-learning network for RGB-thermal urban scene semantic segmentation. IEEE Trans. Image Process. 2021, 30, 7790–7802. [Google Scholar] [CrossRef]
  17. Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture. In Proceedings of the Asian Conference on Computer Vision (ACCV), Taipei, Taiwan, 20–24 November 2016; pp. 213–228. [Google Scholar]
  18. Hu, X.; Yang, K.; Fei, L.; Wang, K. Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; IEEE: New York, NY, USA, 2019; pp. 1440–1444. [Google Scholar]
  19. Chen, X.; Lin, K.Y.; Wang, J.; Wu, W.; Qian, C.; Li, H.; Zeng, G. Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 561–577. [Google Scholar]
  20. Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14679–14694. [Google Scholar] [CrossRef]
  21. Zhou, W.; Zhang, H.; Yan, W.; Lin, W. MMSMCNet: Modal memory sharing and morphological complementary networks for RGB-T urban scene semantic segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7096–7108. [Google Scholar] [CrossRef]
  22. Shivakumar, S.S.; Rodrigues, N.; Zhou, A.; Miller, I.D.; Kumar, V.; Taylor, C.J. Pst900: Rgb-thermal calibration, dataset and segmentation network. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Virtual, 31 May–31 August 2020; IEEE: New York, NY, USA, 2020; pp. 9441–9447. [Google Scholar]
  23. Kafka, O.; Kaufmann, J.; Rankl, C. SEANet: RGB and Thermal Maritime Panoptic Dataset and Intermodal Alignment Procedure. In Proceedings of the 2025 IEEE 9th Forum on Research and Technologies for Society and Industry (RTSI), Tunis, Tunisia, 24–26 August 2025; pp. 202–207. [Google Scholar] [CrossRef]
  24. Santos, C.E.; Bhanu, B. DyFusion: Dynamic IR/RGB Fusion for Maritime Vessel Recognition. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 1328–1332. [Google Scholar] [CrossRef]
  25. Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
  26. Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S.L. Joint 3D Proposal Generation and Object Detection from View Aggregation. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: New York, NY, USA, 2018; pp. 1–8. [Google Scholar] [CrossRef]
  27. Liang, M.; Yang, B.; Chen, Y.; Hu, R.; Urtasun, R. Multi-Task Multi-Sensor Fusion for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7345–7353. [Google Scholar] [CrossRef]
  28. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June–1 July 2016; pp. 770–778. [Google Scholar]
  29. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar] [CrossRef]
  30. Lv, Y.; Liu, Z.; Li, G. Context-Aware Interaction Network for RGB-T Semantic Segmentation. IEEE Trans. Multimed. 2024, 26, 6348–6360. [Google Scholar] [CrossRef]
  31. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
  32. Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and flexible image augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
  33. Berman, M.; Triki, A.R.; Blaschko, M.B. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4413–4421. [Google Scholar]
  34. Yu, J.; Blaschko, M. Learning submodular losses with the Lovász hinge. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 1623–1631. [Google Scholar]
  35. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  36. Dice, L.R. Measures of the Amount of Ecologic Association Between Species. Ecology 1945, 26, 297–302. [Google Scholar] [CrossRef]
  37. Matthews, B.W. Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme. Biochim. Biophys. Acta (BBA) Protein Struct. 1975, 405, 442–451. [Google Scholar] [CrossRef]
  38. Powers, D.M.W. Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
  39. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June–1 July 2016; pp. 779–788. [Google Scholar]
  40. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  41. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V 13; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Figure 1. Dataset visualization. (a) Representative samples of the multimodal dataset used in training for benchmarking in this publication. Two columns are shown, where each row consists of aligned RGB, thermal (rendered using the viridis colormap from Matplotlib 3.9.2, where dark purple indicates cooler regions and yellow indicates warmer regions), and panoptic ground truth images from left to right. Panoptic segmentation use distinct colors to represent different object instances and categories, where each instance is assigned a unique color for clear visual separation. Various locations and weather conditions (day, night, sunny, rainy, and overcast) are shown, too. (b) Heatmap of data collection locations, where color intensity ranges from blue (low density) to red (high density), indicating the spatial distribution of captured sequences.
Figure 1. Dataset visualization. (a) Representative samples of the multimodal dataset used in training for benchmarking in this publication. Two columns are shown, where each row consists of aligned RGB, thermal (rendered using the viridis colormap from Matplotlib 3.9.2, where dark purple indicates cooler regions and yellow indicates warmer regions), and panoptic ground truth images from left to right. Panoptic segmentation use distinct colors to represent different object instances and categories, where each instance is assigned a unique color for clear visual separation. Various locations and weather conditions (day, night, sunny, rainy, and overcast) are shown, too. (b) Heatmap of data collection locations, where color intensity ranges from blue (low density) to red (high density), indicating the spatial distribution of captured sequences.
Electronics 14 04746 g001
Figure 2. Generalized schematic of multimodal fusion architectures for maritime perception. The diagram illustrates the double-encoder design, where RGB and thermal inputs are processed by separate encoders, fused at multiple levels, and propagated through skip connections to the decoder. This flexible formulation supports semantic segmentation as the primary task, with object detection performed as a post-processing step on segmentation outputs. This framework (“WNet”) enables systematic benchmarking of different encoder backbones, fusion strategies, and decoder configurations. Architectures such as RTFNet, GMNet, SA-Gate, CMX, and our modular WNet variants (e.g., WNet-FFM or WNet-CMX) can all be abstracted into this framework.
Figure 2. Generalized schematic of multimodal fusion architectures for maritime perception. The diagram illustrates the double-encoder design, where RGB and thermal inputs are processed by separate encoders, fused at multiple levels, and propagated through skip connections to the decoder. This flexible formulation supports semantic segmentation as the primary task, with object detection performed as a post-processing step on segmentation outputs. This framework (“WNet”) enables systematic benchmarking of different encoder backbones, fusion strategies, and decoder configurations. Architectures such as RTFNet, GMNet, SA-Gate, CMX, and our modular WNet variants (e.g., WNet-FFM or WNet-CMX) can all be abstracted into this framework.
Electronics 14 04746 g002
Figure 3. Representative semantic segmentation results under diverse maritime conditions (top = daytime clear, middle = nighttime, bottom = overcast). Columns show RGB and thermal inputs, ground truth masks, and predictions from three architectures: CMX-b2 (transformer-based), WNet-FFM (hybrid fusion), and UNet-S-THR (thermal-only baseline). Panoptic segmentation results use distinct colors to represent different object instances and categories, where each instance is assigned a unique color for clear visual separation. The examples illustrate the advantage of multimodal fusion in detecting small and distant objects, particularly under low-light and adverse conditions. Notably, in the nighttime row, UNet-S-THR failed to detect two distant light-emitting objects visible in the RGB image, highlighting the importance of fusion with RGB.
Figure 3. Representative semantic segmentation results under diverse maritime conditions (top = daytime clear, middle = nighttime, bottom = overcast). Columns show RGB and thermal inputs, ground truth masks, and predictions from three architectures: CMX-b2 (transformer-based), WNet-FFM (hybrid fusion), and UNet-S-THR (thermal-only baseline). Panoptic segmentation results use distinct colors to represent different object instances and categories, where each instance is assigned a unique color for clear visual separation. The examples illustrate the advantage of multimodal fusion in detecting small and distant objects, particularly under low-light and adverse conditions. Notably, in the nighttime row, UNet-S-THR failed to detect two distant light-emitting objects visible in the RGB image, highlighting the importance of fusion with RGB.
Electronics 14 04746 g003
Figure 4. Trade-offs between accuracy and efficiency across multimodal fusion architectures. (a) Inference time vs. F1 detection score (“valid” size category). (b) Inference time vs. detection false positive counts (“valid” size category). (c) Inference time vs. Matthews coefficient. These plots illustrate how the transformer-based models (e.g., CMX-b2) achieved the highest accuracy at the cost of longer inference times, while WNet-S offers competitive performance with minimal computational overhead, making it suitable for real-time maritime deployment.
Figure 4. Trade-offs between accuracy and efficiency across multimodal fusion architectures. (a) Inference time vs. F1 detection score (“valid” size category). (b) Inference time vs. detection false positive counts (“valid” size category). (c) Inference time vs. Matthews coefficient. These plots illustrate how the transformer-based models (e.g., CMX-b2) achieved the highest accuracy at the cost of longer inference times, while WNet-S offers competitive performance with minimal computational overhead, making it suitable for real-time maritime deployment.
Electronics 14 04746 g004
Table 1. Summary statistics of the RGBT maritime panoptic dataset.
Table 1. Summary statistics of the RGBT maritime panoptic dataset.
PropertyCount or RangeDescription
Total image pairs∼80006000 training, 2000 validation
Image resolution640 × 512 pxPixel-aligned RGB and thermal pairs
Geographic regions20+Throughout Atlantic ocean
Illumination conditions2Daytime or nighttime (80:20) percentage
Weather conditions4Clear, foggy, rainy, overcast
Object categories18Vessels, buoys, containers, wooden logs, etc.
Sensor platformSEA.AI embedded systemRGB + LWIR cameras
Table 2. Segmentation benchmark results. The time is in milliseconds, the size is in millions of parameters, and the binarization threshold is the selected confidence threshold. Bold values highlight the best performance within each metric category (e.g., lowest inference time, smallest model size, highest IoU, Dice, and Matthews coefficients), indicating architectures that achieve optimal trade-offs between accuracy and efficiency. Models are grouped by sensor fusion approaches and U-Net-based baseline architectures, as indicated by the horizontal lines, to facilitate comparison within similar design families.
Table 2. Segmentation benchmark results. The time is in milliseconds, the size is in millions of parameters, and the binarization threshold is the selected confidence threshold. Bold values highlight the best performance within each metric category (e.g., lowest inference time, smallest model size, highest IoU, Dice, and Matthews coefficients), indicating architectures that achieve optimal trade-offs between accuracy and efficiency. Models are grouped by sensor fusion approaches and U-Net-based baseline architectures, as indicated by the horizontal lines, to facilitate comparison within similar design families.
ModelTimeSizeThrIoUDiceMatthews
RTFNet-R184.1331.000.2831.7439.7641.18
GMNet-R186.7031.980.2533.1741.1142.39
GMNet-R18-1L *6.6631.870.2832.7640.5741.81
GMNet-R348.1452.200.2232.0439.8841.16
SAGate-R184.289.820.3532.9340.8042.11
SAGate-R18-1L4.389.670.3032.9940.9542.33
SAGate-R345.7114.990.3234.4642.4643.74
WNet2.945.170.3232.6940.5741.97
WNet-S **2.520.620.2833.1341.1242.46
WNet-S-Deeper3.672.250.2832.8240.8942.29
WNet-MLP2.240.880.3232.9641.0742.41
WNet-CMX5.064.950.2532.4140.5141.97
WNet-FFM5.184.510.2532.6740.4641.76
CMX-b06.9812.110.3033.9941.8443.05
CMX-b211.6866.560.3836.6744.6145.76
UNetS-S-4i1.990.350.2831.9939.8941.29
UNet-S-4i2.400.290.2531.6839.6841.19
UNetS-S-THR2.020.350.2230.6738.3639.87
UNet-S-THR2.310.290.2230.4038.2039.58
UNetS-S-RGB2.040.350.1516.5223.0024.57
UNet-S-RGB2.280.290.1016.1722.5724.28
* 1L = only one loss used instead of the original. ** S = separable convolution was used.
Table 3. Detection benchmark precision, recall, and F1 score results for various detection size classes, all expressed as percentages. Bold values highlight the best performance within each metric category (e.g., lowest inference time, smallest model size, highest IoU, Dice, and Matthews coefficients), indicating architectures that achieve optimal trade-offs between accuracy and efficiency. Models are grouped by sensor fusion approaches and U-Net-based baseline architectures, as indicated by the horizontal lines, to facilitate comparison within similar design families.
Table 3. Detection benchmark precision, recall, and F1 score results for various detection size classes, all expressed as percentages. Bold values highlight the best performance within each metric category (e.g., lowest inference time, smallest model size, highest IoU, Dice, and Matthews coefficients), indicating architectures that achieve optimal trade-offs between accuracy and efficiency. Models are grouped by sensor fusion approaches and U-Net-based baseline architectures, as indicated by the horizontal lines, to facilitate comparison within similar design families.
ModelValidSmallLarge
prec rec f1 prec rec f1 prec rec f1
RTFNet-R1845.4142.8844.1128.6223.5425.8378.3676.0677.19
GMNet-R1846.4744.7745.6028.7724.5626.5076.0775.5775.82
GMNet-R18-1L49.8045.1947.3931.6424.0227.3176.8275.5776.19
GMNet-R3446.8743.6345.1928.8623.1825.7178.3174.1076.15
SAGate-R1850.5647.4448.9535.0629.3231.9369.6776.7173.02
SAGate-R18-1L49.1847.7048.4334.5130.5232.4074.3878.0176.15
SAGate-R3450.8250.7050.7634.8931.9133.3372.9683.0677.68
WNet-S50.6651.6151.1337.2736.4836.8770.0271.5070.75
WNet49.8249.9549.8938.3936.8537.6062.9568.8965.79
WNet-S-Deeper48.2851.5849.8735.3238.1136.6669.4272.4870.92
WNet-MLP44.3950.6447.3132.9637.0934.9061.2472.3166.32
WNet-CMX48.8650.1149.4835.9836.3036.1467.3773.2970.20
WNet-FFM51.6551.1251.3837.6335.8236.7169.2773.7871.45
CMX-b053.4750.6752.0337.7032.3934.8476.1280.4678.23
CMX-b256.0654.2255.1340.0135.3437.5381.4882.4181.94
UNetS-S-4i48.3850.6449.4835.1136.5435.8168.9570.5269.73
UNet-S-4i48.1949.7948.9736.4235.7636.0961.7368.5764.97
UNetS-S-THR47.8448.2648.0535.9935.3435.6666.4961.4063.84
UNet-S-THR45.9148.0646.9634.3536.5435.4161.9762.3862.18
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kafka, O.; Rankl, C.; Moser, D. Beyond RGB: Early Stage Fusion of Thermal and Visual Modalities for Robust Maritime Perception. Electronics 2025, 14, 4746. https://doi.org/10.3390/electronics14234746

AMA Style

Kafka O, Rankl C, Moser D. Beyond RGB: Early Stage Fusion of Thermal and Visual Modalities for Robust Maritime Perception. Electronics. 2025; 14(23):4746. https://doi.org/10.3390/electronics14234746

Chicago/Turabian Style

Kafka, Ondrej, Christian Rankl, and David Moser. 2025. "Beyond RGB: Early Stage Fusion of Thermal and Visual Modalities for Robust Maritime Perception" Electronics 14, no. 23: 4746. https://doi.org/10.3390/electronics14234746

APA Style

Kafka, O., Rankl, C., & Moser, D. (2025). Beyond RGB: Early Stage Fusion of Thermal and Visual Modalities for Robust Maritime Perception. Electronics, 14(23), 4746. https://doi.org/10.3390/electronics14234746

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop