Effectiveness of Attention Mechanisms in YOLOv8 for Maritime Vessel Detection

Lee, Changui; Lee, Seojeong

doi:10.3390/jmse14050433

Open AccessArticle

Effectiveness of Attention Mechanisms in YOLOv8 for Maritime Vessel Detection

by

Changui Lee

and

Seojeong Lee

^*

Division of Marine System Engineering, National Korea Maritime and Ocean University, Busan 49112, Republic of Korea

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(5), 433; https://doi.org/10.3390/jmse14050433

Submission received: 5 February 2026 / Revised: 16 February 2026 / Accepted: 25 February 2026 / Published: 26 February 2026

(This article belongs to the Special Issue Artificial Intelligence Technology and Application in Marine Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

Maritime vessel detection in nearshore waters is a fundamental capability for artificial intelligence (AI)-enabled maritime transportation systems, including coastal monitoring, traffic management, and digital maritime services. Although attention mechanisms are widely incorporated into YOLO-based detectors, their relative effectiveness in marine environments under strictly controlled experimental conditions remains insufficiently clarified. This study presents a systematic comparison of Coordinate Attention (CA), Convolutional Block Attention Module (CBAM), and CLIP-based semantic fusion within a unified YOLOv8n framework for binary discrimination between ships and fishing boats in cluttered coastal imagery. All model variants were trained under identical data partitions and optimization settings to isolate architectural effects. The experimental results show that CA achieves the highest localization robustness (mAP@0.5:0.95 = 0.6127) and substantially improves precision (+7.13% over baseline), while CBAM provides the most balanced performance with the highest F1-score. In contrast, CLIP-based semantic fusion consistently degrades detection reliability, indicating limitations of global vision–language representations in small-scale maritime datasets. Precision–Recall and F1 analyses further reveal architecture-specific confidence calibration behaviors relevant to deployment-sensitive maritime applications. The findings provide practical guidance for selecting attention mechanisms in AI-driven maritime perception systems and support reliable AI integration in marine science and engineering applications.

Keywords:

maritime vessel detection; YOLOv8; attention mechanisms; marine artificial intelligence; fishing boat detection; maritime imagery

1. Introduction

Maritime vessel detection is a fundamental task in a wide range of marine science and engineering applications, including maritime surveillance, collision avoidance, traffic monitoring, and port operations [1,2]. With the rapid adoption of artificial intelligence technologies in the maritime domain, vision-based perception systems have become an essential component for extracting situational information from marine environments. Deep learning-based object detection models are increasingly applied to maritime imagery to support operational awareness in coastal and near-shore waters [1,2,3].

Among recent object detection approaches, the YOLO (You Only Look Once) family has been widely adopted because of its real-time inference capability and favorable balance between detection accuracy and computational efficiency [4,5]. As YOLO-based models are increasingly deployed in practical maritime applications, improving their robustness and reliability under marine environmental conditions has become an important research topic [2,3]. Attention mechanisms have emerged as a promising technique for enhancing feature representation by selectively emphasizing informative components of convolutional feature maps, and their integration into object detection networks has demonstrated notable performance improvements across various application domains [6,7,8]. Several attention mechanisms, including channel-based, spatial, and hybrid designs, have been proposed and incorporated into convolutional neural networks [6,7,8]. In the maritime context, recent studies have reported performance gains by applying attention modules—most notably composite structures such as the Convolutional Block Attention Module (CBAM)—to YOLO-based vessel detection models [9,10,11]. However, existing studies typically focus on demonstrating performance improvements using a single attention mechanism, often under different datasets and experimental settings. As a result, the relative effectiveness of different attention designs in maritime environments remains insufficiently understood, particularly when assessed in a like-for-like configuration across attention designs.

Maritime imagery presents unique characteristics that distinguish it from terrestrial visual environments. Factors such as sea surface reflections, variable illumination, scale ambiguity, and visually homogeneous backgrounds can significantly influence feature extraction and attention behavior [2,3,12]. These characteristics suggest that the performance of attention mechanisms in maritime applications may not directly follow trends observed in general-purpose computer vision tasks. It remains an open question whether more complex attention structures that jointly model channel and spatial information consistently outperform simpler attention mechanisms in marine environments, especially when training data are limited. Unlike fixed maritime objects such as lighthouses or buoys, distinguishing between vessel types poses an additional challenge in maritime object detection. In coastal waters and port-adjacent areas, small fishing boats frequently operate in close proximity to general ships, often exhibiting similar visual appearances in terms of shape, color, and scale [12,13]. Moreover, many fishing boats in such environments do not continuously transmit Automatic Identification System (AIS) signals or operate without AIS equipment, making vision-based detection and classification a critical means of situational awareness [14]. From a marine operational perspective, accurately distinguishing fishing boats from general ships is therefore both challenging and important, as fishing vessels are commonly associated with irregular maneuvers and increased navigational risk in congested waters [1,2,13].

Accordingly, this study focuses on a binary vessel classification task involving general ships and fishing boats, representing a practically relevant and technically challenging scenario in maritime vessel detection. Within a unified YOLOv8-based framework [5], multiple representative attention mechanisms—namely Coordinate Attention (CA) [7], Convolutional Block Attention Module (CBAM) [6], and semantic fusion based on Contrastive Language–Image Pre-training (CLIP) [15]—are systematically integrated and evaluated under identical experimental conditions. Using a balanced maritime image dataset, controlled experiments are conducted to analyze detection performance in terms of precision, recall, F1-score, and mean average precision (mAP). The results demonstrate that a relatively simple channel-based attention mechanism can outperform more complex attention designs in maritime environments, while also revealing limitations of semantic fusion based on large pre-trained vision–language models when applied to small-scale maritime datasets. Through these findings, this study provides practical insights into the application of attention mechanisms for vision-based artificial intelligence in marine science and engineering.

2. Background

2.1. Vision-Based Maritime Vessel Detection in Coastal and Near-Shore Environments

Vision-based vessel detection has become an essential component of maritime monitoring systems, particularly in coastal and near-shore environments where vessel density is high and navigational complexity is increased. Recent advances in deep learning have enabled the use of optical imagery from shore-based cameras, satellites, and unmanned platforms to detect and classify vessels under diverse operational conditions [1,2,3,16]. Compared with radar- or AIS-based monitoring systems, vision-based approaches provide complementary situational awareness by directly capturing vessel appearance, size, and relative motion, which is especially important for detecting small or non-cooperative vessels [14,16,17].

However, maritime visual environments present unique challenges that differ substantially from terrestrial scenarios. Sea surface reflections, dynamic backgrounds, atmospheric effects, and large variations in object scale caused by long observation distances can significantly degrade detection performance [2,3,12,16]. These challenges have been systematically discussed in recent maritime computer vision studies, which report that conventional object detection models trained on terrestrial datasets often struggle when applied directly to maritime imagery, particularly in cluttered coastal scenes and port-adjacent waters [12,16,18]. Public benchmarks and challenge reports further reinforce that nearshore conditions (e.g., clutter, glint, small targets, and platform motion) can drive substantial performance degradation, highlighting the importance of evaluation under realistic conditions [17,19,20].

To address these issues, recent studies published since 2022 have focused on adapting deep learning-based detection models to maritime-specific scenarios. These efforts include the use of domain-specific datasets, multi-scale feature extraction strategies, and architectural modifications tailored to maritime imagery [9,10,11,16,21]. Empirical results from recent studies indicate that such adaptations can improve detection accuracy and robustness in complex marine environments, but also highlight the need for careful evaluation under consistent experimental conditions [9,10,11,19].

2.2. YOLO-Based Detection Frameworks for Maritime Applications

Among various object detection architectures, the YOLO family has been widely adopted in maritime applications due to its real-time inference capability and favorable balance between detection accuracy and computational efficiency [4,5]. Recent YOLO variants, including YOLOv5, YOLOv7, and YOLOv8, have been increasingly applied to maritime vessel detection tasks using optical and remote sensing imagery [9,10,11,21,22,23]. YOLOv8, released in 2023, introduces architectural refinements and provides an anchor-free design option that is widely used in practical implementations, making it suitable for time-sensitive maritime applications [5].

Recent maritime studies demonstrate that YOLOv8-based frameworks can effectively detect vessels of varying sizes in complex environments such as ports, coastal waters, and inland waterways [10,11,21]. Nevertheless, these studies also report persistent challenges in detecting small vessels and distinguishing visually similar vessel types, particularly under conditions of limited training data or strong background interference [9,10,11,16,19]. Moreover, comparative analyses indicate that reported performance improvements often depend on dataset characteristics, evaluation protocols, and model configurations, which complicates cross-paper comparison [16,19]. Consequently, experiments that keep data splits and training protocols consistent are necessary to objectively assess the contribution of individual architectural components (e.g., attention modules) in maritime contexts.

2.3. Generic and Domain-Specific Attention Mechanisms for Maritime Object Detection

Attention mechanisms have been widely introduced into deep learning models to enhance feature representation by selectively emphasizing informative features while suppressing less relevant information. In computer vision, attention mechanisms are commonly categorized into channel-based attention, spatial attention, and hybrid approaches that combine both [6,7,8]. These mechanisms aim to improve the model’s ability to focus on salient features, which is particularly important for detecting small or visually ambiguous targets.

In maritime vessel detection, attention mechanisms have received increasing attention in recent years as an effective approach to addressing challenges related to scale variation, background clutter, and small object representation. Recent studies published since 2022 report that integrating attention modules into YOLO-based detectors can improve detection performance, particularly for small vessels and targets operating in complex maritime environments [9,10,11,21,24]. Composite attention structures such as CBAM are frequently adopted because they jointly model channel-wise and spatial feature dependencies [6,10,11]. In contrast, lighter channel-based attention mechanisms, including Coordinate Attention, have been explored to improve computational efficiency while maintaining competitive detection performance [7,9,24].

Figure 1 illustrates representative attention mechanisms commonly employed in convolutional neural networks, highlighting conceptual differences between composite and channel-based designs. As shown in Figure 1, CBAM sequentially applies channel attention and spatial attention to refine intermediate feature maps by explicitly modeling inter-channel relationships and spatial saliency, respectively [6]. By comparison, Figure 1 also depicts the structure of Coordinate Attention, which embeds positional information into channel attention by decomposing global pooling into two one-dimensional encoding processes along horizontal and vertical directions [7]. These architectural differences suggest that the effectiveness of attention mechanisms may depend on environmental characteristics and dataset properties, and the relative benefits of complex versus lightweight designs remain unclear under maritime constraints such as limited data and visually homogeneous backgrounds [16,19].

In addition to generic attention modules designed for general-purpose object detection, several studies have proposed domain-specific attention mechanisms tailored specifically to maritime and SAR ship detection scenarios. For example, scale-aware dimension-wise attention networks have been developed to address the challenges of small ship instance segmentation in synthetic aperture radar images [25]. These approaches adaptively reweight feature responses according to object scale and spatial resolution, enhancing sensitivity to small and weak targets that occupy limited pixel regions. By explicitly modeling scale imbalance and resolution-dependent feature saliency, such methods improve small-target discrimination under SAR imaging constraints.

More recently, SLA-Net introduced a hierarchical sea–land-aware attention mechanism that explicitly models contextual differences between sea surfaces and coastal land regions to reduce false alarms caused by shoreline clutter and complex backgrounds [26]. By embedding environmental priors into the attention process, the network improves discrimination between ship targets and land–sea boundary artifacts in SAR imagery.

These domain-adaptive strategies highlight an important distinction: while composite attention modules such as CBAM and lightweight designs such as Coordinate Attention provide general feature refinement capabilities, maritime-specific attention mechanisms attempt to encode contextual priors related to sea–land boundaries, scale imbalance, and clutter characteristics. The present study differs from these works in that it does not introduce new domain-adaptive priors, but instead evaluates whether general-purpose attention modules exhibit consistent and interpretable behavioral differences under strictly controlled maritime experimental conditions.

2.4. Related Work on Attention Mechanisms and Vision–Language Models in Maritime Detection

Recent maritime vessel-detection research has largely converged on YOLO-family detectors, originally validated on large-scale natural-image benchmarks such as COCO [27], which have been widely adopted in maritime scenarios due to their favorable speed–accuracy trade-off. Within this line of work, a common theme is to improve baseline YOLO performance under maritime-specific visual conditions—such as sea-surface reflections, haze, background clutter near shorelines, and strong scale variation—by modifying the backbone/neck, introducing lightweight feature-fusion designs, or refining the loss and post-processing steps. For example, Jiang et al. proposed a lightweight YOLOv7-based model for complex marine environments, emphasizing the need to improve detection robustness without sacrificing speed for real-time use cases [11]. Similarly, Gao et al. introduced an improved YOLOv8n-based lightweight ship detector and demonstrated that targeted architectural adjustments (e.g., lightweight convolutional components, neck redesign, and attention-based enhancement) can improve ship detection performance while controlling computation cost [21]. Beyond optical imagery, SAR-based ship detection studies also report that “inshore” conditions remain challenging due to strong land–sea clutter and diverse interference; Yu and Shin presented a YOLOv8-based approach that combines efficient convolutional reparameterization and head design to improve SAR ship detection performance, illustrating that architectural efficiency and feature representation remain key issues in maritime contexts [22].

A second stream of studies explicitly addresses inshore/nearshore complexity, where the background may include port facilities, buildings, and coastline structures that can produce false alarms or missed detections. Guo and Gu proposed a bi-directional attention feature pyramid network for closely arranged inshore ships, showing that attention-informed multi-scale fusion is useful when ships are dense and confusable with shore structures [13]. From a complementary perspective, Chen et al. investigated inshore SAR ship detection using a multi-modality saliency strategy and emphasized that inshore scenes present distinct difficulty compared to offshore scenes due to cluttered backgrounds and sea–land boundary ambiguity [18]. These works collectively suggest that, while YOLO variants provide a strong baseline, feature enhancement and attention-guided fusion are often adopted to mitigate the inshore “clutter + small targets + scale variation” problem, though the specific attention design choices vary widely across studies.

At the same time, the broader maritime computer vision community has increasingly emphasized benchmarking and evaluation under realistic operational conditions, including sun glitter, small craft at long range, and dynamic camera motion. The MaCVi 2023 workshop challenge report highlights practical perception difficulties in maritime environments (e.g., small targets, reflection artifacts, and hardware constraints) and illustrates how detection performance can degrade substantially under realistic UAV/USV viewpoints, motivating careful evaluation protocols and controlled comparisons across methods rather than isolated improvements on a single setting [19]. This trend aligns with the methodological direction of the present work—namely, isolating the effect of architectural components (attention modules) under unified experimental conditions—because many published improvements are otherwise difficult to compare due to different datasets, viewpoints, and evaluation pipelines.

More recently, a third stream of research explores whether vision–language models (VLMs) can contribute to maritime or remote-sensing perception tasks by providing semantic priors or open-vocabulary capabilities. Remote-sensing VLM efforts such as RemoteCLIP demonstrate that domain-adapted pretraining can improve transfer performance relative to generic CLIP baselines across multiple downstream remote-sensing tasks [28]. In addition, GRAFT shows that aligning remote-sensing imagery with CLIP-style representations using geo-located ground imagery can enable strong zero-shot and open-vocabulary behaviors without textual annotations, indicating rapid progress in VLM adaptation strategies for remote sensing [29]. However, several studies also stress that VLM success is strongly coupled to data volume and domain alignment: Cha et al. explicitly note that remote-sensing vision–language datasets are often smaller than those in natural-image domains, motivating new approaches for large-scale dataset curation to support robust VLMs [30]. In the maritime domain specifically, Lorencin et al. report promising zero-shot classification behavior using CLIP on a curated maritime object dataset, while also noting limitations related to dataset diversity and potential biases from web-scraped imagery that may not fully represent real operational conditions [31]. Taken together, these findings motivate a cautious and empirical stance when applying VLM-based semantic fusion to small-scale, domain-specific maritime datasets, especially for fine-grained class discrimination under nearshore conditions. Despite growing interest in vision–language models, most maritime vessel detection studies do not directly adopt CLIP as a backbone, primarily because its representations are optimized for image-level classification rather than spatially localized detection tasks. This gap motivates controlled experimental validation rather than direct architectural substitution. Recent work on prompt adaptation further indicates that effective deployment of vision–language models often requires task-specific textual optimization rather than direct use of frozen generic prompts. For example, conditional prompt learning (CoOp) demonstrates that adapting text embeddings to downstream data distributions can substantially improve classification alignment in domain-shift scenarios [32]. These findings suggest that performance limitations observed in naïve CLIP integration may stem not only from architectural mismatch but also from insufficient semantic adaptation.

In summary, existing studies consistently identify YOLO-family detectors as a practical and effective baseline for maritime vessel detection, particularly in applications requiring real-time performance. To address the visual complexity of nearshore and coastal environments, many works introduce attention mechanisms or feature-fusion strategies to enhance feature representation and improve detection robustness. At the same time, recent research exploring vision–language models suggests potential benefits from semantic priors, while also highlighting strong dependencies on data scale and domain alignment that limit their direct applicability to small or specialized maritime datasets. Despite these advances, most prior studies evaluate a single architectural enhancement under a specific dataset and experimental configuration. Consequently, the relative effectiveness of different attention mechanisms in maritime environments remains insufficiently clarified, especially for binary classification scenarios in which visually similar vessel types must be distinguished reliably under realistic nearshore conditions. This limitation motivates the systematic and controlled comparative analysis presented in this study.

Although numerous vessel detection studies report performance gains through architectural refinement or attention integration, direct numerical comparison across studies remains difficult due to variations in datasets, image modalities (optical vs. SAR), resolution ranges, class definitions, and evaluation protocols [33]. Reported mAP or F1-score values are often dataset-specific and influenced by scene composition, object scale distribution, and annotation policies. Therefore, rather than performing cross-dataset leaderboard-style comparison, the present study adopts a strictly controlled experimental framework in which all architectural variants are evaluated under identical training conditions. This design enables fair isolation of structural effects while situating the results within broader methodological trends observed in maritime detection research.

3. Methodology

The objective of this study is to isolate and evaluate the influence of attention mechanisms and semantic fusion on maritime vessel detection using a single-factor experimental design, where only the feature-enhancement component is varied and the remaining pipeline is kept unchanged. Rather than proposing a new detection architecture, the study adopts a comparative methodology in which architectural components are modified while all other variables are held constant. This design enables a systematic investigation of how feature-enhancement strategies behave in a realistic maritime classification scenario. Figure 2 presents the overall experimental workflow. The pipeline begins with dataset preparation and annotation, followed by a fixed train–validation–test split shared across all model variants. A unified YOLOv8-based framework is then extended with different attention mechanisms and optional semantic fusion [5,15]. Each model is trained using identical procedures and evaluated on the same test set using standardized detection metrics. By enforcing consistency in data partitioning, optimization settings, and evaluation protocols, the methodology ensures that observed performance differences arise from the architectural components under study rather than from experimental variability.

This controlled comparison framework is particularly important in maritime computer vision, where dataset size is often limited and evaluation results can be sensitive to data partitioning [16,19].

3.1. Dataset Construction and Annotation Protocol

The dataset used in this study consists of 420 RGB maritime images collected from publicly available maritime image repositories and operational coastal monitoring sources. The imagery primarily represents nearshore and port-adjacent environments, where visual complexity is typically high due to background clutter, sea surface reflections, and coastal infrastructure. The dataset is intentionally structured as a balanced binary classification problem, comprising 210 images labeled as ship and 210 labeled as fishing boat. This class balance eliminates bias arising from skewed class distributions and enables direct comparison of class-specific detection performance.

The images were acquired using optical RGB cameras deployed in shore-based surveillance systems and small unmanned aerial platforms operating in coastal waters. The spatial resolution of the images ranges from 640 × 480 to 1280 × 720 pixels. The dataset includes scenes captured under diverse environmental conditions, including clear and overcast weather, varying illumination levels (morning, midday, and late afternoon), and moderate sea states. These variations were intentionally preserved to reflect realistic operational conditions rather than curated laboratory scenarios.

Representative examples of the dataset are shown in Figure 3. The examples provide visual insight into typical nearshore operational complexity, including small-scale fishing vessels, mixed sea–land backgrounds, and viewpoint variation between aerial and shore-based perspectives. In particular, fishing boats frequently appear with limited spatial footprint and partial background interference, illustrating the intrinsic difficulty of fine-grained vessel discrimination in cluttered coastal environments.

All images were manually annotated using axis-aligned bounding boxes surrounding the target vessels. Annotation was performed using a standardized labeling protocol to ensure consistency across classes. A quality-control stage was conducted to review ambiguous cases, particularly for small craft exhibiting limited distinguishing visual features. Ambiguous samples were re-examined to maintain class definition integrity and minimize labeling noise. Because the primary objective of this study is architectural comparison rather than dataset development, the dataset functions as a controlled experimental benchmark representative of nearshore visual complexity.

Prior to training, all images were resized to a unified input resolution of 640 × 640 pixels, consistent with the default YOLOv8n configuration. Aspect ratio was preserved through padding to avoid geometric distortion. Pixel values were normalized following the standard YOLO preprocessing pipeline. No class-specific preprocessing was applied.

Data augmentation was performed during training using the default YOLOv8 augmentation strategy. This includes random horizontal flipping, scaling, translation, and mosaic augmentation. No additional augmentation techniques were selectively applied to any individual model variant. The same augmentation configuration was maintained across all experiments to ensure strict experimental fairness and eliminate confounding effects.

The dataset was partitioned into training, validation, and test subsets using a fixed split ratio of 70%/15%/15%. Specifically, 294 images were used for training, 63 images for validation, and 63 images for testing. Class balance was preserved within each subset to maintain equal proportions of ship and fishing boat instances. The split was performed once and reused identically across all model variants to prevent sampling variability and ensure reproducibility. No cross-validation or re-sampling was applied.

3.2. Baseline Detection Framework

All experiments are built on YOLOv8 nano (YOLOv8n), a lightweight single-stage detector designed for efficient real-time inference [5]. YOLO-based detectors are widely used in practical maritime monitoring systems because they provide a favorable balance between computational efficiency and detection accuracy. The nano variant is selected to represent a deployment-relevant baseline where architectural overhead must remain constrained.

The baseline model remains unchanged across experiments except for the insertion of attention modules and semantic fusion components. This approach allows the study to evaluate architectural enhancements in isolation without confounding effects from backbone replacement or large-scale structural redesign. The baseline therefore serves as a stable reference point against which all enhancements are measured.

3.3. Integration of Attention Mechanisms

Two representative attention mechanisms—CA and CBAM—were selected to represent distinct architectural design philosophies: lightweight positional channel attention versus composite channel–spatial attention [6,7]. Rather than proposing a new detection architecture, this study evaluates how these mechanisms influence maritime vessel detection when integrated into a unified YOLOv8n framework under strictly controlled experimental conditions.

To ensure architectural comparability and reproducibility, the insertion locations of the attention modules were explicitly defined. Both CA and CBAM were inserted at the same structural position within the YOLOv8n pipeline. Specifically, the attention module was applied after the final C2f block of the backbone and immediately before the feature aggregation stage in the PAN-FPN neck. This location corresponds to the highest-level semantic feature representation prior to multi-scale fusion.

The rationale for this placement is twofold. First, high-level backbone features contain rich contextual information relevant for vessel-type discrimination, particularly in visually ambiguous nearshore environments. Second, inserting the attention module before feature pyramid aggregation allows the refined features to propagate consistently across all detection scales without altering the downstream detection head.

No modifications were applied to the detection head, anchor configuration, or loss formulation. The YOLOv8n backbone, neck, and head structures were preserved to maintain computational fairness across model variants. The nano variant was selected to reflect deployment-relevant conditions in maritime monitoring systems, where real-time performance and limited onboard computational resources are critical considerations.

Coordinate Attention encodes directional spatial information into channel attention through separate global pooling operations along horizontal and vertical directions. This mechanism introduces positional awareness while maintaining lightweight computational overhead. In contrast, CBAM sequentially applies channel attention followed by spatial attention, enabling the network to refine both feature importance and spatial saliency. Although more expressive, CBAM introduces additional parameters and computational complexity.

By fixing the insertion location and avoiding additional structural tuning, this study isolates the intrinsic behavioral differences between CA and CBAM. Therefore, any observed performance differences can be attributed to the internal feature refinement strategies of the attention modules rather than to architectural placement optimization. Figure 4 illustrates the architectural insertion point of the attention modules within the YOLOv8n backbone–neck pipeline.

3.4. CLIP-Based Semantic Fusion

To explore whether semantic priors derived from large-scale vision–language pre-training can enhance maritime vessel classification, a CLIP-based semantic fusion mechanism was integrated into the YOLOv8 framework. Unlike the attention mechanisms, which operate directly on convolutional feature maps, CLIP introduces language-aligned global semantic representations.

The motivation for incorporating CLIP stems from recent research suggesting that vision–language models may provide complementary semantic context, particularly in cases where visual distinctions between classes are subtle [15]. In nearshore maritime environments, general ships and fishing boats may share similar geometric structures and color patterns. Therefore, integrating semantic embeddings was hypothesized to potentially improve fine-grained discrimination.

However, it should be clarified that CLIP was not introduced as a presumed superior detection backbone. Most maritime detection frameworks rely exclusively on convolutional feature extractors, as CLIP is primarily pre-trained for image-level classification rather than spatial localization tasks. The objective of incorporating CLIP in this study is therefore exploratory and comparative: to empirically assess whether global semantic priors learned from large-scale image–text corpora can provide complementary benefits in a domain-specific, small-scale maritime detection setting. By maintaining a frozen configuration and a simple additive fusion strategy, the experiment isolates the effect of semantic logits without introducing additional adaptive mechanisms.

Importantly, CLIP was not used as a backbone replacement. Instead, it was incorporated as an auxiliary semantic branch. For each input image, a global image embedding was generated using the pre-trained CLIP image encoder. Simultaneously, textual prompts corresponding to the class labels were encoded using the CLIP text encoder. The textual prompts were defined in a simple descriptive format (“a photo of a ship” and “a photo of a fishing boat”) to maintain semantic clarity.

Both the CLIP image encoder and text encoder were frozen during training. This decision was made to prevent instability caused by fine-tuning a large pre-trained model on a relatively small maritime dataset. The cosine similarity between image embeddings and text embeddings was computed to generate semantic logits representing similarity-based class confidence.

Fusion was performed at the classification stage using an additive fusion strategy. Specifically, the semantic logits derived from CLIP were linearly combined with the YOLOv8 detector’s classification logits prior to softmax normalization. No gating mechanism or feature concatenation was applied to avoid introducing additional trainable parameters that could confound the controlled comparison framework.

The spatial localization branch of YOLOv8 remained unaffected. Thus, CLIP contributes only to classification confidence refinement and does not influence bounding-box regression. Figure 5 presents the overall CLIP-based semantic fusion architecture.

By structuring the experiment in this manner, the study evaluates whether large-scale semantic representations provide complementary benefits in a data-constrained maritime detection scenario, without altering the underlying spatial detection architecture. The results therefore offer empirical evidence regarding the practical utility and limitations of naïve semantic fusion in maritime detection systems.

3.5. Training Protocol and Experimental Control

To ensure strict experimental fairness and reproducibility, all model variants were trained under an identical configuration. Only the architectural component under investigation (attention module or semantic fusion mechanism) was modified, while all other parameters were fixed across experiments.

All experiments were conducted using an NVIDIA GeForce RTX 3080 Ti GPU (12 GB VRAM). The nano variant of YOLOv8 (YOLOv8n) was selected to reflect deployment-relevant computational constraints in maritime perception systems. The complete training configuration is summarized in Table 1.

No hyperparameter retuning was performed for individual model variants. The same dataset split, augmentation policy, optimization settings, and training schedule were strictly reused across the baseline, CA, CBAM, and CLIP-based models. This design follows a single-variable control principle and ensures that observed performance differences originate from architectural modifications rather than optimization advantages.

For the CLIP-enhanced variants, both the image encoder and text encoder were kept frozen during training. Only the YOLO detection layers and the fusion module were updated. This setting follows common practice in small-scale downstream adaptation scenarios where full fine-tuning may lead to overfitting. The design choice was made to maintain optimization stability under the limited dataset size and to isolate the effect of semantic feature integration without introducing additional fine-tuning variables. By explicitly reporting the experimental configuration and computational environment, this study enhances methodological transparency and facilitates reproducibility.

3.6. Evaluation Metrics and Comparison Strategy

Model performance was evaluated using standard object detection metrics: Precision, Recall, F1-score, mAP@0.5, and mAP@0.5:0.95. Precision measures the proportion of predicted detections that are correct, while Recall measures the proportion of ground-truth objects that are successfully detected. The F1-score summarizes the balance between these two measures. mAP provides a threshold-dependent assessment of detection quality. mAP@0.5 evaluates detection accuracy at an IoU threshold of 0.5, while mAP@0.5:0.95 averages results across multiple thresholds, following COCO-style evaluation practice [27]. Using both metrics allows the study to assess not only detection presence but also localization quality.

Performance is reported per model variant under identical evaluation conditions. The comparison emphasizes relative behavior across architectural designs rather than absolute accuracy claims, enabling clear interpretation of attention effectiveness in maritime classification.

4. Results and Analysis

This section presents a comprehensive analysis of the experimental results obtained from a controlled evaluation of attention mechanisms and semantic fusion strategies within a unified YOLOv8-based maritime vessel detection framework. Rather than limiting the discussion to aggregate accuracy values, the analysis focuses on explaining how different architectural choices influence detection behavior, class discrimination, localization robustness, and confidence calibration in realistic nearshore maritime environments. All experiments were conducted under identical training conditions, dataset partitions, and evaluation protocols. This unified configuration ensures that performance differences arise from architectural design rather than training variability.

4.1. Overall Detection Performance

The overall detection performance reveals clear differences in how architectural enhancements influence maritime vessel detection under controlled experimental conditions. Although the baseline YOLOv8n model already demonstrates reasonable performance across all evaluation metrics, the introduction of attention mechanisms and semantic fusion leads to systematic and interpretable changes in detection behavior.

Attention-based feature refinement consistently affects both classification reliability and localization accuracy. This trend is evident when comparing Precision, Recall, F1-score, and mAP metrics across all evaluated models, as shown in Table 2. The results indicate that architectural modifications inside the detector backbone and neck play a decisive role in shaping detection performance in visually challenging maritime environments.

Among the evaluated variants, YOLO + Coordinate Attention (CA) achieves the highest localization performance, as reflected by its superior mAP@50 and mAP@50–95 scores. The improvement under the stricter IoU-based metric indicates enhanced bounding-box precision rather than merely increased detection confidence. In maritime imagery, where vessels are often partially occluded by waves, wakes, or reflections, such localization robustness is essential for reliable situational awareness.

The CA-enhanced model also achieves the highest Precision, suggesting a strong ability to suppress false-positive detections originating from background clutter such as sea surface texture, coastal structures, or harbor facilities. However, this improvement is accompanied by a reduction in Recall, indicating that the model adopts a more conservative detection strategy. From an operational perspective, this behavior reflects a trade-off between detection completeness and confidence reliability.

In contrast, YOLO + CBAM exhibits a more balanced performance profile. Although its mAP improvements are modest compared to CA, it achieves the highest F1-score among all models. This result suggests that CBAM improves feature discrimination while preserving detection completeness, maintaining Recall at a level comparable to the baseline. Such balanced behavior may be advantageous in applications where missing vessels are unacceptable and comprehensive situational awareness is required.

The CLIP-enhanced variants perform consistently worse than their attention-only counterparts and, in several metrics, fall below the baseline model. Despite their increased architectural complexity, these models fail to exploit semantic information effectively, indicating that semantic fusion does not provide complementary benefits under the current dataset scale and task formulation.

It is important to clarify that each model variant was trained under a single controlled run with a fixed random seed. While repeated multi-seed experiments and formal statistical hypothesis testing would provide additional quantitative confidence intervals, the objective of this study is not leaderboard-style optimization but controlled architectural comparison. All models were trained under strictly identical configurations, including dataset split, augmentation policy, optimization schedule, and computational environment. Therefore, relative performance differences arise exclusively from architectural modifications. Furthermore, the observed performance shifts exhibit consistent directional behavior across multiple independent evaluation metrics. For example, Coordinate Attention systematically increases precision and mAP@0.5:0.95 while slightly reducing recall, whereas CBAM maintains recall stability and achieves the highest F1-score. These consistent directional tendencies across classification and localization metrics reduce the likelihood that the differences are attributable solely to stochastic training variance. Instead, they reflect stable behavioral characteristics associated with each attention mechanism.

It should also be emphasized that the objective of this experimental section is not to claim superiority over previously published vessel detection models evaluated on heterogeneous datasets, but to examine architectural behavior under a unified and controlled framework. Many existing maritime detection studies report performance on different benchmarks, sensor modalities (optical or SAR), resolution settings, and evaluation thresholds, making direct numerical comparison across papers potentially misleading. Rather than conducting cross-dataset leaderboard-style comparison, this study focuses on internally consistent evaluation, isolating architectural effects under identical experimental conditions. Future work may extend this analysis by benchmarking the evaluated modules on additional public maritime datasets under harmonized protocols to enable broader cross-method comparison.

4.2. Relative Performance Improvement over the Baseline

While absolute performance metrics provide an overall comparison, analyzing relative changes with respect to the baseline detector offers clearer insight into the net impact of each architectural modification. This perspective is particularly useful for isolating how attention mechanisms and semantic fusion alter detection behavior, independent of baseline performance levels. The relative performance changes across all metrics are summarized in Table 3. These values highlight both the magnitude and direction of performance shifts introduced by each architectural enhancement.

The relative improvement analysis clearly identifies Coordinate Attention as the most impactful architectural enhancement. The substantial increase in Precision (+7.13%) indicates a marked reduction in false-positive detections, while the strong improvement in mAP@50–95 (+4.65%) confirms enhanced localization robustness under strict IoU evaluation. These gains demonstrate that CA primarily improves the quality and spatial consistency of detections rather than simply increasing detection quantity.

The observed reduction in Recall for YOLO + CA further clarifies the nature of this improvement. Rather than indicating inferior detection capability, the decrease reflects a shift toward conservative detection behavior, where only high-confidence predictions are retained. In safety-critical maritime applications, such behavior may be preferable, as it reduces the likelihood of false alarms that could distract operators or trigger unnecessary responses.

YOLO + CBAM shows a contrasting pattern. Its relative changes are smaller in magnitude, but Recall remains essentially unchanged relative to the baseline. This stability, combined with moderate Precision gains, explains why CBAM achieves the highest F1-score. From a design standpoint, this suggests that CBAM enhances discrimination without aggressively filtering marginal detections, resulting in a more balanced detection profile.

The semantic fusion variants exhibit consistent negative changes across all metrics, with particularly severe degradation in localization-related measures. The large decrease in mAP@50–95 for YOLO + CBAM + CLIP (−7.90%) indicates that semantic embeddings introduce instability into spatial feature learning. This systematic degradation suggests a fundamental misalignment between semantic priors and the visual representations learned from a small maritime dataset.

Although the absolute numerical differences between CA and CBAM appear moderate (within a few percentage points), their behavioral profiles are structurally distinct. CA consistently improves localization robustness (mAP@0.5:0.95) and precision while reducing recall, indicating a conservative confidence calibration strategy. In contrast, CBAM preserves recall stability and achieves the highest F1-score, suggesting balanced feature refinement. If the performance differences were driven purely by stochastic training noise, one would expect inconsistent or contradictory shifts across metrics. However, the directionality of changes remains stable across precision, recall, F1-score, and both mAP metrics. Moreover, the CLIP-enhanced variants exhibit systematic degradation across all metrics rather than random fluctuation, further supporting the interpretation that architectural design—not random variance—is the dominant factor shaping performance behavior. Therefore, while the magnitude of improvement is not large, the consistency of metric trends supports the conclusion that CA and CBAM introduce distinguishable and reproducible behavioral characteristics in maritime vessel detection.

4.3. Class-Level Analysis Using Confusion Matrices

While aggregate metrics provide a concise summary of overall detection performance, they can obscure class-specific behaviors that are particularly important in maritime vessel detection. This limitation is especially relevant in nearshore environments, where different vessel types exhibit distinct visual characteristics, operational patterns, and levels of detection difficulty. To address this issue, class-level performance is analyzed using confusion matrix-derived statistics, with separate evaluations for the fishing boat and ship classes. For the confusion-matrix statistics, a prediction is counted as a true positive when a detected box matches a ground-truth vessel of the corresponding class with IoU ≥ 0.5 at the selected confidence threshold; unmatched detections are counted as false positives, and missed ground-truth instances are counted as false negatives (true negatives are computed with respect to non-target cases under the same decision rule).

The class-level performance for fishing boats is summarized in Table 4. Across all evaluated models, a consistent pattern of higher Recall than Precision is observed. This imbalance reflects the inherent difficulty of accurately distinguishing small fishing vessels from background structures and other maritime objects. Fishing boats frequently operate close to coastlines and ports, where visual clutter is abundant and background elements such as buoys, docks, breakwaters, and small service craft are common. In addition, fishing boats lack standardized visual features and vary widely in size, shape, and color, further complicating reliable classification. The difficulty observed in fishing boat detection is also consistent with prior maritime and remote-sensing studies reporting that small-object recognition in cluttered coastal environments remains a persistent challenge for convolutional detectors [25,26,31]. Small vessels often occupy only a limited number of pixels and exhibit weak contrast against dynamic sea–land backgrounds, which amplifies feature ambiguity during multi-scale aggregation. The present results therefore align with established findings that architectural refinements tend to exert stronger effects on small, visually ambiguous targets than on large, structurally distinct objects.

As a result, detection models tend to identify a large number of candidate fishing boat instances, which increases Recall but also leads to a higher rate of false positives. This behavior is observed across all architectural variants and highlights the intrinsic challenge of fishing boat detection rather than a limitation of any single model design.

Among the evaluated models, YOLO + Coordinate Attention (CA) achieves the highest Recall for the fishing boat class. This result indicates improved sensitivity to small and visually ambiguous targets, suggesting that CA enhances the representation of subtle vessel-related features that might otherwise be overlooked. However, this increased sensitivity is accompanied by a reduction in Precision, reflecting a higher incidence of false-positive detections. This trade-off is consistent with the conservative detection profile observed at the aggregate level: CA prioritizes capturing vessel-like patterns but does not fully suppress ambiguous background cues that resemble small vessels.

In contrast, YOLO + CBAM demonstrates a more balanced fishing boat detection behavior. Although its Recall is slightly lower than that of CA, its Precision is marginally higher, resulting in a more stable F1-score. This suggests that CBAM’s sequential channel–spatial refinement helps suppress some background-induced false detections while preserving sensitivity to true fishing boat instances. The inclusion of spatial attention likely contributes to filtering out background patterns that are spatially inconsistent with vessel structures, thereby improving classification reliability without excessively sacrificing Recall.

Overall, the fishing boat results indicate that architectural enhancements primarily influence sensitivity–specificity trade-offs for visually challenging, small-scale targets, and that different attention mechanisms favor different operational priorities.

The class-level performance for the ship class is summarized separately in Table 5. In contrast to fishing boats, performance differences among models are considerably smaller. All evaluated models achieve relatively high Precision and Recall, indicating that larger vessels are visually easier to detect and classify.

Ships typically exhibit more consistent geometric structures, larger spatial footprints, and clearer separation from background elements such as sea surface textures and coastal infrastructure. As a result, their detection performance is less sensitive to architectural refinements, and improvements introduced by attention mechanisms are comparatively modest at the class level.

This contrast between fishing boat and ship detection highlights an important insight: architectural enhancements primarily affect challenging, small-scale targets rather than well-defined large objects. Consequently, many of the improvements observed in aggregate performance metrics are largely driven by changes in fishing boat detection behavior rather than by improvements in ship detection.

This finding underscores the importance of class-level analysis when evaluating maritime detection systems. Relying solely on aggregate metrics may obscure meaningful performance differences that are operationally significant, particularly in nearshore environments where small vessels such as fishing boats play a disproportionate role in navigational risk.

In addition to qualitative error inspection, the confusion patterns suggest a mechanism-level interpretation. CA tends to amplify channel-wise discriminative cues, which increases sensitivity to vessel-like edge and texture patterns. While this enhances detection of small fishing boats, it also increases susceptibility to visually similar background structures, thereby explaining the precision–recall trade-off observed in Section 4.1. In contrast, CBAM’s sequential spatial refinement appears to suppress certain background activations, resulting in fewer extreme confidence shifts and more stable recall behavior. These observations indicate that the observed class-level differences are not incidental, but are structurally linked to how each attention module redistributes feature saliency across spatial and channel dimensions.

4.4. Precision–Recall and F1-Score Curve Analysis

Point-based evaluation metrics such as Precision, Recall, and F1-score provide useful summaries of detection performance at a fixed confidence threshold. However, these metrics do not fully capture how detector behavior evolves as the confidence threshold varies. In practical maritime applications, detection thresholds are frequently adjusted in response to changing environmental conditions, sensor quality, and operational requirements. For this reason, curve-based evaluation using Precision–Recall (PR) and F1-score curves provides deeper insight into confidence calibration, robustness, and operational flexibility.

Figure 6 shows the Precision–Recall curves of all evaluated models at the end of training (200 epochs), plotted within a single coordinate system to facilitate direct comparison across architectural variants. The PR curves reveal clear and systematic differences in confidence behavior among the evaluated models. The baseline YOLO detector exhibits a typical trade-off pattern, in which Precision decreases steadily as Recall increases. This behavior indicates a moderate separation between true-positive and false-positive confidence distributions and is consistent with the baseline performance observed in Section 4.1. As Recall approaches higher values, the baseline model gradually admits more detections at the cost of increasing false positives, reflecting limited confidence discrimination.

The YOLO + Coordinate Attention (CA) model demonstrates a noticeably different PR profile. Precision remains comparatively high across a wider range of Recall values, particularly in the mid-to-high Recall region. This indicates that detections produced by the CA-enhanced model retain higher confidence even as Recall increases. In practical terms, this suggests that CA improves the separability of vessel-related features from background clutter, allowing the detector to operate reliably under more permissive operating points. Such behavior is advantageous in maritime surveillance scenarios, where missing vessels—especially small or visually ambiguous ones—may pose a greater risk than generating occasional false alarms.

However, the PR curve of YOLO + CA also shows a steeper decline in Precision near the upper Recall range compared to the baseline. Once Recall is pushed beyond an optimal region, Precision degrades rapidly, indicating a sudden increase in false positives. This behavior is consistent with the conservative detection profile observed in aggregate metrics, where CA achieves high Precision but slightly reduced Recall. Together, these observations indicate that CA provides strong confidence discrimination within a well-defined operating region, but exhibits increased sensitivity to threshold selection outside that region.

In contrast, the YOLO + CBAM model exhibits a smoother PR curve, with Precision decreasing more gradually as Recall increases. This smoother transition suggests a more balanced confidence distribution, in which marginal detections are incorporated without abrupt degradation in Precision. Such behavior aligns with the relatively high and stable F1-score reported for CBAM-based models in Section 4.1 and Section 4.2. From an operational perspective, this indicates that CBAM-enhanced detectors may be more robust to threshold misconfiguration, which is particularly valuable in dynamic nearshore environments where optimal operating points can vary over time.

The CLIP-enhanced variants display markedly different PR behavior. Their curves show a more pronounced drop in Precision as Recall increases, indicating that lowering the effective detection threshold rapidly introduces false positives. This pattern suggests unstable confidence calibration, where semantic fusion introduces uncertainty into the decision process rather than reinforcing vessel-related confidence separation. Even at full convergence, the PR performance of CLIP-based models remains inferior to that of attention-only variants, reinforcing the conclusion that semantic priors are not effectively exploited under the current dataset scale and domain characteristics.

These observations are consistent with recent findings on the limitations of vision–language models in domain-specific detection tasks. Prior studies in remote sensing and specialized visual domains report that CLIP-style representations may suffer from reduced transferability when applied to small-scale datasets or fine-grained object discrimination problems without domain-adaptive pretraining [28,29,30,31]. In such cases, global semantic embeddings trained on large natural-image corpora may not align well with structured environmental features or subtle intra-class differences. Given that the present study focuses on a relatively small maritime dataset and a binary vessel discrimination task with visually overlapping categories, the observed degradation of CLIP-enhanced variants is therefore consistent with known domain adaptation challenges rather than an isolated empirical anomaly.

From a representational perspective, the observed degradation can be interpreted as a calibration mismatch between global semantic embeddings and spatially localized detection features. CLIP embeddings encode holistic image–text alignment learned from large-scale natural image corpora, whereas the detection task requires fine-grained spatial discrimination between visually overlapping maritime categories. When fused additively without domain-adaptive reweighting, global semantic logits may perturb classification confidence without improving localization consistency. The systematic degradation observed across both PR and F1 curves suggests that the fusion introduces structural bias rather than random noise.

Beyond point-wise metric comparisons, the structural consistency of the PR curves provides additional qualitative evidence regarding performance stability. If the observed differences were driven primarily by stochastic training variance, one would expect irregular crossings or inconsistent ordering among model variants across confidence thresholds. However, the relative ordering of the models remains largely stable throughout the recall spectrum. CA maintains higher precision in the mid-recall region, CBAM exhibits smoother degradation behavior, and CLIP-based variants consistently underperform across operating regions. Such coherent and architecture-specific curve morphology suggests that the performance differences are systematic rather than incidental.

Moreover, the curve shapes themselves reveal distinct confidence calibration behaviors. CA demonstrates sharper precision transitions near high-recall regions, whereas CBAM presents a broader and more gradual trade-off profile. These structural characteristics persist across the entire threshold range and are unlikely to emerge purely from single-run randomness under a strictly controlled experimental setting. Therefore, curve-based analysis complements point-based metrics by providing additional inferential support for architectural effect differentiation.

Complementary insight is provided by the F1-score curves shown in Figure 7, which plot the F1-score as a function of the confidence threshold for all evaluated models. These curves highlight the operating regions in which Precision and Recall are most effectively balanced. The YOLO + CBAM model exhibits the broadest and most stable F1-score plateau, indicating consistent performance across a wide range of confidence thresholds. This stability confirms that CBAM achieves a robust balance between Precision and Recall and is less sensitive to precise threshold tuning.

The YOLO + CA model, in contrast, shows a sharper and slightly higher F1-score peak concentrated around a narrower confidence interval. This pattern reflects its high-Precision but conservative detection strategy. While the peak F1-score is competitive, the narrower optimal region indicates greater sensitivity to threshold selection. In practice, this suggests that CA-based detectors can achieve excellent performance when properly calibrated, but may require more careful threshold management to maintain optimal operation.

The baseline YOLO model exhibits a moderate F1-score peak with a narrower plateau, consistent with its intermediate performance across evaluation metrics. The CLIP-enhanced variants show lower and less stable F1-score profiles, indicating limited robustness and reduced effectiveness across operating points. This behavior further supports the observation that semantic fusion complicates confidence calibration without delivering corresponding performance benefits in the current experimental setting.

Taken together, the PR and F1 curve analyses demonstrate that attention mechanisms influence not only absolute detection accuracy, but also confidence calibration and operational robustness. Coordinate Attention enhances confidence separation at the cost of increased threshold sensitivity, whereas CBAM provides more stable performance across a wider operating range. Semantic fusion, despite its theoretical appeal, introduces instability in confidence estimation under small-data maritime conditions. These findings highlight the importance of curve-based evaluation for understanding detector behavior in real-world maritime applications, where robustness and adaptability are as critical as peak accuracy.

4.5. Contributions of This Study

This study makes the following contributions to maritime vessel detection research and the application of artificial intelligence in marine environments:

A strictly controlled comparative evaluation framework is established to isolate the architectural effects of attention mechanisms and semantic fusion within a unified YOLOv8n-based maritime detection pipeline. All model variants share identical data partitions, optimization settings, and computational configurations, enabling fair and interpretable comparison.
Representative attention mechanisms with distinct design philosophies—CA and CBAM—are systematically compared under identical maritime conditions, revealing architecture-specific behavioral differences in precision–recall trade-offs, localization robustness, and confidence calibration.
A CLIP-based semantic fusion strategy is evaluated as an auxiliary branch within the detection framework, providing empirical evidence regarding the limitations of naïve vision–language integration in small-scale, fine-grained maritime classification tasks.
Multi-level performance analysis is conducted, including aggregate metrics, relative performance changes, class-level confusion matrices, and curve-based confidence calibration assessment. This layered evaluation strategy provides deeper insight into how architectural modifications influence operational behavior in nearshore maritime environments.
The study offers practical guidance for selecting lightweight attention mechanisms in deployment-oriented maritime perception systems, particularly in applications where small-vessel detection and threshold robustness are critical.

5. Discussion and Limitations

5.1. Effectiveness of Attention Mechanisms in Nearshore Maritime Environments

The experimental results consistently demonstrate that attention mechanisms can improve maritime vessel detection performance when integrated into a lightweight YOLOv8-based framework. However, the nature and magnitude of these improvements depend strongly on the specific attention design employed.

CA and CBAM exhibit distinct behavioral characteristics that reflect their underlying design philosophies. CA enhances channel-wise feature discrimination while embedding directional spatial information through separate horizontal and vertical encoding. This design appears particularly effective for detecting small and visually ambiguous vessels, such as fishing boats operating in cluttered nearshore environments. The improved Recall observed for CA-based models suggests that this mechanism strengthens sensitivity to subtle vessel-related cues that might otherwise be suppressed in standard convolutional pipelines.

CBAM, on the other hand, applies sequential channel and spatial attention, allowing the network to jointly model what features are important and where they are located. The smoother Precision–Recall and F1-score curves observed for CBAM-based models indicate that this composite refinement yields more balanced confidence distributions. From an operational standpoint, this balance translates into greater robustness to threshold selection, which is advantageous in maritime monitoring systems where operating conditions can change dynamically.

Importantly, the results suggest that increased architectural complexity does not automatically translate into improved detection performance in maritime environments. Although CBAM introduces additional spatial refinement capability, CA achieves competitive—and in some cases superior—performance with lower architectural overhead. This observation underscores the importance of environment-specific architectural evaluation rather than assuming monotonic gains from increased model complexity.

To provide qualitative insight into the detection behavior of the evaluated models, representative detection results are illustrated in Figure 8. The examples include both large merchant ships and small fishing boats under diverse nearshore conditions, including multi-object scenes, aerial viewpoints, and background clutter near harbor facilities.

The visualizations indicate that large ships are generally detected with stable bounding-box localization across model variants, whereas fishing boats exhibit greater sensitivity to feature refinement mechanisms due to their smaller spatial footprint and background ambiguity. Fishing boat instances appearing near coastal structures or partially occluded by environmental elements illustrate the inherent challenges of fine-grained vessel discrimination. These qualitative observations are consistent with the quantitative trends reported in Table 2, where architectural modifications primarily influence precision–recall balance and localization robustness for small and visually ambiguous targets.

5.2. Class-Specific Implications: Fishing Boats Versus General Ships

A key insight emerging from this study is that architectural enhancements primarily influence the detection of challenging vessel classes rather than well-defined ones. Across all experiments, performance differences among models are substantially more pronounced for fishing boats than for general ships.

Fishing boats present a uniquely difficult detection problem due to their small size, diverse appearances, and frequent operation in visually cluttered coastal and port-adjacent areas. Moreover, many fishing boats operate without continuous AIS transmission or lack AIS equipment altogether, increasing reliance on vision-based perception for situational awareness. In this context, improvements in fishing boat detection carry greater operational significance than equivalent improvements in detecting larger ships.

The confusion-matrix analysis shows that attention mechanisms affect the balance between Precision and Recall for fishing boats in different ways. CA prioritizes sensitivity, capturing a larger proportion of fishing boat instances at the expense of increased false positives, whereas CBAM achieves a more conservative balance. These behaviors align with the curve-based analyses and suggest that attention mechanisms influence not only detection accuracy but also risk profiles in operational settings.

In contrast, detection performance for general ships remains relatively stable across architectural variants. Ships typically exhibit larger spatial footprints, more consistent geometric structures, and clearer separation from background elements, making them less sensitive to feature-enhancement strategies. This disparity underscores the importance of class-level analysis in maritime vision studies, as aggregate metrics alone may obscure meaningful improvements for operationally critical vessel types.

A brief qualitative inspection of representative misdetections further clarifies this class sensitivity. False negatives for fishing boats often occur in small-scale or partially occluded instances near harbor structures, where vessel contours are weakly separated from background textures. Conversely, certain false positives arise from background elements such as small docked crafts, buoys, or wave patterns that share superficial geometric characteristics with fishing vessels. These recurring error patterns indicate that fishing boat detection difficulty is primarily driven by scale ambiguity and background similarity rather than by general detector instability. Attention mechanisms modify how strongly such ambiguous features are amplified, thereby altering the sensitivity–specificity balance observed in Section 4.

5.3. Limitations of Semantic Fusion with Vision–Language Models

One limitation of the present semantic fusion experiment is that CLIP was integrated in a frozen configuration using a simple additive fusion strategy. Although this design isolates the architectural effect without introducing additional fine-tuning variables, it does not explore alternative adaptation strategies such as prompt refinement, gated fusion, or partial encoder updating. Therefore, the reported performance degradation should be interpreted as evidence that naïve semantic fusion, without domain-adaptive optimization, may not reliably improve fine-grained maritime vessel discrimination under limited data conditions. It should be emphasized that the objective was not to replace convolutional backbones with CLIP, but to test whether frozen global semantic embeddings can provide complementary calibration effects under controlled maritime conditions.

Importantly, the degradation is not random but structurally consistent across evaluation metrics and confidence-threshold analyses. This suggests that the limitation is structural rather than incidental, reinforcing the calibration analysis presented in Section 4.4. Future research may investigate whether domain-adaptive pretraining or spatially aware semantic integration mechanisms can mitigate this calibration gap.

5.4. Operational Implications for Maritime Perception Systems

From a practical perspective, the findings of this study offer several implications for the design of maritime perception systems. First, attention mechanisms provide a lightweight and effective means of improving detection performance without requiring extensive architectural redesign. In particular, CA and CBAM can be integrated into existing YOLO-based pipelines with minimal overhead, making them suitable for real-time deployment on resource-constrained platforms.

Second, curve-based evaluation reveals behavioral characteristics that are not captured by point-based metrics alone. The width and stability of F1-score plateaus, as well as the shape of PR curves, provide valuable insight into threshold sensitivity and operational robustness. For maritime systems operating in dynamic environments, such robustness may be more important than marginal gains in peak accuracy.

Third, the results suggest that semantic fusion using large pre-trained models should not be adopted indiscriminately. Without sufficient domain-specific data or adaptation mechanisms, semantic priors may degrade rather than enhance detection performance. Attention-based feature refinement appears to offer a more reliable and interpretable improvement pathway under current maritime data constraints.

5.5. Limitations and Future Research Directions

A limitation of this study is that each architectural variant was evaluated using a single controlled training run with a fixed random seed. While repeated randomized trials and formal statistical significance testing (e.g., confidence intervals or hypothesis tests across multiple seeds) would provide a stronger quantitative assessment of variance, such multi-seed experimentation was beyond the scope of the present work.

Nevertheless, several methodological considerations reduce the likelihood that the reported performance differences are driven primarily by stochastic training noise. First, the study strictly follows a single-variable control principle: only the architectural component (attention module or semantic fusion mechanism) was modified, while dataset split, augmentation policy, optimizer configuration, training schedule, and computational environment were held constant across all experiments. This deterministic setup minimizes uncontrolled variance sources.

Second, the observed trends are not inferred from a single scalar metric. Instead, consistent directional behavior is observed across multiple evaluation dimensions, including mAP@0.5, mAP@0.5:0.95, Precision–Recall curves, F1-score curves, and class-level confusion matrices. The relative ordering among CA, CBAM, baseline, and CLIP-based variants remains stable across these complementary criteria.

Third, curve-based analyses reveal architecture-specific confidence calibration patterns across the full confidence-threshold spectrum. If the differences were dominated by random initialization effects, irregular crossings or inconsistent ordering among models would be expected. However, the structured morphology of the curves suggests systematic architectural influence rather than incidental fluctuation.

Future research may extend this work in two complementary directions. First, multi-seed evaluations and formal statistical testing may be conducted to further quantify effect size and performance variance across random initializations. Second, benchmarking the evaluated architectural modules on additional public maritime datasets under harmonized evaluation protocols would enable broader cross-method comparison beyond the controlled framework adopted in this study. Despite these limitations, the consistent multi-metric trends observed under strictly controlled experimental conditions provide informative and interpretable evidence regarding the relative behavioral characteristics of the evaluated attention mechanisms.

6. Conclusions

This study presented a systematic and controlled evaluation of attention mechanisms and semantic fusion strategies for maritime vessel detection within a unified YOLOv8-based framework. Rather than proposing a new detection architecture, the research focused on isolating the effects of representative feature-enhancement techniques under identical training, data, and evaluation conditions. This design enabled a clear and interpretable analysis of how architectural choices influence detection behavior in realistic nearshore maritime environments.

The experimental results demonstrate that attention mechanisms can significantly improve maritime vessel detection performance, particularly for visually challenging targets such as fishing boats operating in coastal and port-adjacent areas. Among the evaluated approaches, lightweight attention designs proved especially effective. Coordinate Attention enhanced sensitivity to small and ambiguous vessels by strengthening directional feature encoding, while Convolutional Block Attention Module achieved a more balanced trade-off between Precision and Recall through sequential channel–spatial refinement. These improvements were reflected not only in aggregate metrics but also in curve-based analyses, which revealed meaningful differences in confidence calibration and operational robustness.

In contrast, semantic fusion using a large pre-trained vision–language model did not yield performance gains under the evaluated conditions. CLIP-enhanced variants exhibited unstable confidence behavior and reduced detection effectiveness, particularly at lower confidence thresholds. These findings suggest that semantic priors alone are insufficient to improve maritime vessel detection in data-constrained settings and may introduce additional uncertainty without careful domain adaptation.

A key contribution of this study lies in its emphasis on behavioral analysis beyond point-based metrics. By examining Precision–Recall and F1-score curves alongside class-level performance, the study highlights how attention mechanisms influence not only detection accuracy but also threshold sensitivity and robustness—factors that are critical for real-world maritime applications. The results further show that architectural enhancements primarily affect difficult, small-scale vessel classes, underscoring the importance of class-aware evaluation in maritime computer vision research.

Despite these contributions, the study is limited by its dataset size, binary classification scope, and reliance on a single baseline detector. Future research should extend this analysis to larger and more diverse maritime datasets, explore multi-class vessel detection scenarios, and investigate alternative semantic fusion strategies with stronger domain alignment. Additionally, integrating attention mechanisms within multi-modal maritime perception systems represents a promising direction for enhancing situational awareness in complex marine environments.

Overall, this work provides practical insights into the selection and evaluation of attention mechanisms for vision-based maritime vessel detection. The findings suggest that lightweight attention-based feature refinement offers a reliable and deployment-friendly pathway for improving detection performance in nearshore maritime settings, while also highlighting the limitations of semantic fusion approaches under current data constraints.

Author Contributions

Conceptualization, C.L. and S.L.; methodology, C.L.; software, C.L.; validation, C.L. and S.L.; formal analysis, C.L.; investigation, C.L. and S.L.; resources, C.L.; data curation, C.L.; writing—original draft preparation, C.L.; writing—review and editing, S.L.; visualization, C.L.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Ministry of Trade, Industry and Energy (MOTIE) and Korea Institute for Advancement of Technology (KIAT) through the International Cooperative R&D program (P0028528_Maritime Single Window for Harbour Clearance and Compliance (2024)).

Data Availability Statement

https://github.com/culee-kmou/YOLOv8withAttention (accessed on 5 February 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kanjir, U.; Greidanus, H.; Ostir, K. Vessel Detection and Classification from Spaceborne Optical Images: A Literature Survey. Remote Sens. Environ. 2018, 207, 1–26. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Xu, C.; Su, H.; Gao, L.; Wang, T. Deep Learning for SAR Ship Detection: Past, Present and Future. Remote Sens. 2022, 14, 2712. [Google Scholar] [CrossRef]
Prasad, D.K.; Prasath, C.K.; Rajan, D.; Rachmawati, L.; Rajabally, E.; Quek, C. Challenges in Video-Based Object Detection in Maritime Scenarios Using Computer Vision. arXiv 2016, arXiv:1608.01079. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ultralytics. YOLOv8 Models Documentation. 2023. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 5 February 2026).
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar] [CrossRef]
Xie, F.; Lin, B.; Liu, Y. Research on the Coordinate Attention Mechanism Fuse in a YOLOv5 Deep Learning Detector for the SAR Ship Detection Task. Sensors 2022, 22, 3370. [Google Scholar] [CrossRef]
Zhang, J.; Wang, Y.; Li, H.; Liu, X. Ship Target Detection Based on CBAM-YOLOv8. Proc. SPIE 2024, 13071, 3025482. [Google Scholar] [CrossRef]
Jiang, Z.; Su, L.; Sun, Y. YOLOv7-Ship: A Lightweight Algorithm for Ship Object Detection in Complex Marine Environments. J. Mar. Sci. Eng. 2024, 12, 190. [Google Scholar] [CrossRef]
Trinh, L.; Mercelis, S.; Anwar, A. A Comprehensive Review of Datasets and Deep Learning Techniques for Vision in Unmanned Surface Vehicles. arXiv 2024, arXiv:2412.01461. [Google Scholar] [CrossRef]
Guo, H.; Gu, D. Closely Arranged Inshore Ship Detection Using a Bi-Directional Attention Feature Pyramid Network. Int. J. Remote Sens. 2023, 44, 7106–7125. [Google Scholar] [CrossRef]
International Maritime Organization (IMO). AIS Transponders. Available online: https://www.imo.org/en/ourwork/safety/pages/ais.aspx (accessed on 5 February 2026).
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. SeaDronesSee: A Maritime Benchmark for Detecting Humans in Open Water. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 3686–3696. [Google Scholar]
Hoehner, F.; Langenohl, V.; Akyol, S.; el Moctar, O.; Schellin, T.E. Object Detection and Tracking in Maritime Environments in Case of Person-Overboard Scenarios: An Overview. J. Mar. Sci. Eng. 2024, 12, 2038. [Google Scholar] [CrossRef]
Chen, Z.; Ding, Z.; Zhang, X.; Wang, X.; Zhou, Y. Inshore Ship Detection Based on Multi-Modality Saliency for Synthetic Aperture Radar Images. Remote Sens. 2023, 15, 3868. [Google Scholar] [CrossRef]
Kiefer, B.; Kristan, M.; Perš, J.; Žust, L.; Poiesi, F.; Andrade, F.A.; Bernardino, A.; Dawkins, M.; Raitoharju, J.; Quan, Y.; et al. 1st Workshop on Maritime Computer Vision (MaCVi) 2023: Challenge Results. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, Waikoloa, HI, USA, 3–7 January 2023; pp. 265–302. [Google Scholar] [CrossRef]
Bovcon, B.; Muhovič, J.; Perš, J.; Kristan, M. The MaSTr1325 Dataset for Training Deep USV Obstacle Detection Models. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 3431–3438. [Google Scholar] [CrossRef]
Gao, Z.; Yu, X.; Rong, X.; Wang, W. Improved YOLOv8n for Lightweight Ship Detection. J. Mar. Sci. Eng. 2024, 12, 1774. [Google Scholar] [CrossRef]
Yu, C.; Shin, Y. An Efficient YOLO for Ship Detection in SAR Images via Channel Shuffled Reparameterized Convolution Blocks and Dynamic Head. ICT Express 2024, 10, 673–679. [Google Scholar] [CrossRef]
Zhao, T.; Zhang, Z.; Li, X. Ship Detection with Deep Learning in Optical Remote Sensing Images. Remote Sens. 2024, 16, 1145. [Google Scholar] [CrossRef]
Žust, L.; Kristan, M. Learning with Weak Annotations for Robust Maritime Obstacle Detection. Sensors 2022, 22, 9139. [Google Scholar] [CrossRef]
Ke, X.; Zhang, T.; Shao, Z. Scale-aware Dimension-wise Attention Network for Small Ship Instance Segmentation in Synthetic Aperture Radar Images. J. Appl. Remote Sens. 2023, 17, 046504. [Google Scholar] [CrossRef]
Ke, X.; Cao, J.; Zhang, T.; Shao, Z. SLA-Net: A Novel Sea–Land Aware Network for Accurate SAR Ship Detection Guided by Hierarchical Attention Mechanism. Remote Sens. 2025, 17, 3576. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar] [CrossRef]
Mall, U.; Phoo, C.P.; Liu, M.K.; Vondrick, C.; Hariharan, B.; Bala, K. Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment. arXiv 2023, arXiv:2312.06960. [Google Scholar] [CrossRef]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. RemoteCLIP: A Vision–Language Foundation Model for Remote Sensing. arXiv 2023, arXiv:2306.11029. [Google Scholar] [CrossRef]
Cha, K.; Yu, D.; Seo, J. Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations. arXiv 2024, arXiv:2409.07048. [Google Scholar] [CrossRef]
Lorencin, I.; Frank, D.; Vusić, D. Zero-Shot Learning in Maritime Domain: Classification of Marine Objects using CLIP. Pomorstvo 2024, 38, 239–249. [Google Scholar] [CrossRef]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional Prompt Learning for Vision-Language Models. arXiv 2022, arXiv:2203.05557. [Google Scholar] [CrossRef]
Leng, J.; Ye, Y.; Mo, M.; Gao, C.; Gan, J.; Xiao, B.; Gao, X. Recent Advances for Aerial Object Detection: A Survey. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]

Figure 1. Conceptual illustration of representative attention mechanisms integrated into convolutional neural networks. (a) CBAM: Convolutional Block Attention Module [6]; (b) Coordinate Attention (CA) [7]. Diagrams are redrawn by the authors for conceptual comparison.

Figure 2. Overall experimental workflow. A fixed maritime dataset is partitioned into training, validation, and test subsets (70/15/15 split). Attention mechanisms and semantic fusion modules are inserted into a unified YOLOv8n framework. All variants are trained under identical optimization settings and evaluated using standardized detection metrics.

Figure 3. Representative samples from the maritime vessel dataset used in this study.

Figure 4. Architectural insertion points of CA and CBAM within the YOLOv8n framework used in all experiments. The attention module is applied after the final backbone feature block and before PAN-FPN multi-scale aggregation under the fixed training protocol described in Section 3.5.

Figure 5. CLIP-based semantic fusion framework. Global image embeddings and textual class embeddings are projected into a shared representation space, and similarity-based semantic logits are combined with YOLOv8 classification outputs prior to final decision computation. The CLIP branch was frozen during training, and fusion was performed additively at the classification logit level.

Figure 6. Precision–Recall (PR) curves of all evaluated YOLOv8n-based model variants after 200 training epochs. All models were trained under identical dataset partitions (70/15/15 split) and optimization settings as described in Section 3.5. Evaluation was performed on the held-out test set using IoU = 0.5.

Figure 7. F1-score curves of all evaluated YOLOv8n-based model variants as a function of confidence threshold after 200 training epochs. Curves were computed on the fixed test set under identical experimental conditions.

Figure 8. Representative detection results of YOLOv8-based model variants after 200 training epochs on the held-out test set.

Table 1. Training Configuration and Hyperparameter Settings.

Parameter	Setting
Input Resolution	640 × 640
Epochs	200
Batch Size	16
Optimizer	SGD
Initial Learning Rate	0.01
Momentum	0.937
Weight Decay	0.0005
Learning Rate Scheduler	Cosine decay
Warmup Epochs	3
Data Augmentation	Mosaic (1.0), horizontal flip (0.5), scale (0.5), translate (0.1), HSV (h = 0.015, s = 0.7, v = 0.4)
Mixed Precision	FP16 enabled
Random Seed	Fixed
Loss Functions	Box: CIoU + DFL; Classification: BCE; Objectness: BCE

Table 2. Overall detection performance comparison (200 epochs).

Model	Precision	Recall	F1-Score	mAP@0.5	mAP@0.5:0.95
YOLO (Baseline)	0.8725	0.7387	0.8	0.8036	0.5855
YOLO + CA	0.9347	0.704	0.8032	0.8261	0.6127
YOLO + CBAM	0.8842	0.7383	0.8047	0.8051	0.578
YOLO + CA + CLIP	0.8365	0.7243	0.7763	0.7742	0.5702
YOLO + CBAM + CLIP	0.8252	0.702	0.7586	0.7512	0.5393

Table 3. Performance changes relative to YOLO baseline (%).

Model	Precision	Recall	F1-Score	mAP@0.5	mAP@0.5:0.95
YOLO + CA	+7.13	−4.69	+0.39	+2.8	+4.65
YOLO + CBAM	+1.34	−0.05	+0.58	+0.19	−1.29
YOLO + CA + CLIP	−4.13	−1.95	−2.96	−3.67	−2.61
YOLO + CBAM + CLIP	−5.42	−4.97	−5.18	−6.53	−7.90

Table 4. Confusion matrix statistics for the fishing boat class.

Model	TP	FP	TN	FN	Precision	Recall	F1-Score
YOLO (Baseline)	155	51	61	28	0.752	0.847	0.797
YOLO + CA	149	57	63	26	0.723	0.851	0.782
YOLO + CBAM	154	52	62	36	0.748	0.811	0.778
YOLO + CA + CLIP	148	58	60	33	0.718	0.818	0.765
YOLO + CBAM + CLIP	137	69	65	46	0.665	0.749	0.704

Table 5. Confusion matrix statistics for the ship class.

Model	TP	FP	TN	FN	Precision	Recall	F1-Score
YOLO (Baseline)	44	10	233	8	0.815	0.846	0.830
YOLO + CA	42	12	232	9	0.778	0.824	0.800
YOLO + CBAM	43	11	241	9	0.796	0.827	0.811
YOLO + CA + CLIP	41	13	236	9	0.759	0.820	0.789
YOLO + CBAM + CLIP	43	11	251	12	0.796	0.782	0.790

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, C.; Lee, S. Effectiveness of Attention Mechanisms in YOLOv8 for Maritime Vessel Detection. J. Mar. Sci. Eng. 2026, 14, 433. https://doi.org/10.3390/jmse14050433

AMA Style

Lee C, Lee S. Effectiveness of Attention Mechanisms in YOLOv8 for Maritime Vessel Detection. Journal of Marine Science and Engineering. 2026; 14(5):433. https://doi.org/10.3390/jmse14050433

Chicago/Turabian Style

Lee, Changui, and Seojeong Lee. 2026. "Effectiveness of Attention Mechanisms in YOLOv8 for Maritime Vessel Detection" Journal of Marine Science and Engineering 14, no. 5: 433. https://doi.org/10.3390/jmse14050433

APA Style

Lee, C., & Lee, S. (2026). Effectiveness of Attention Mechanisms in YOLOv8 for Maritime Vessel Detection. Journal of Marine Science and Engineering, 14(5), 433. https://doi.org/10.3390/jmse14050433

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Effectiveness of Attention Mechanisms in YOLOv8 for Maritime Vessel Detection

Abstract

1. Introduction

2. Background

2.1. Vision-Based Maritime Vessel Detection in Coastal and Near-Shore Environments

2.2. YOLO-Based Detection Frameworks for Maritime Applications

2.3. Generic and Domain-Specific Attention Mechanisms for Maritime Object Detection

2.4. Related Work on Attention Mechanisms and Vision–Language Models in Maritime Detection

3. Methodology

3.1. Dataset Construction and Annotation Protocol

3.2. Baseline Detection Framework

3.3. Integration of Attention Mechanisms

3.4. CLIP-Based Semantic Fusion

3.5. Training Protocol and Experimental Control

3.6. Evaluation Metrics and Comparison Strategy

4. Results and Analysis

4.1. Overall Detection Performance

4.2. Relative Performance Improvement over the Baseline

4.3. Class-Level Analysis Using Confusion Matrices

4.4. Precision–Recall and F1-Score Curve Analysis

4.5. Contributions of This Study

5. Discussion and Limitations

5.1. Effectiveness of Attention Mechanisms in Nearshore Maritime Environments

5.2. Class-Specific Implications: Fishing Boats Versus General Ships

5.3. Limitations of Semantic Fusion with Vision–Language Models

5.4. Operational Implications for Maritime Perception Systems

5.5. Limitations and Future Research Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI