1. Introduction
Maritime vessel detection is a fundamental task in a wide range of marine science and engineering applications, including maritime surveillance, collision avoidance, traffic monitoring, and port operations [
1,
2]. With the rapid adoption of artificial intelligence technologies in the maritime domain, vision-based perception systems have become an essential component for extracting situational information from marine environments. Deep learning-based object detection models are increasingly applied to maritime imagery to support operational awareness in coastal and near-shore waters [
1,
2,
3].
Among recent object detection approaches, the YOLO (You Only Look Once) family has been widely adopted because of its real-time inference capability and favorable balance between detection accuracy and computational efficiency [
4,
5]. As YOLO-based models are increasingly deployed in practical maritime applications, improving their robustness and reliability under marine environmental conditions has become an important research topic [
2,
3]. Attention mechanisms have emerged as a promising technique for enhancing feature representation by selectively emphasizing informative components of convolutional feature maps, and their integration into object detection networks has demonstrated notable performance improvements across various application domains [
6,
7,
8]. Several attention mechanisms, including channel-based, spatial, and hybrid designs, have been proposed and incorporated into convolutional neural networks [
6,
7,
8]. In the maritime context, recent studies have reported performance gains by applying attention modules—most notably composite structures such as the Convolutional Block Attention Module (CBAM)—to YOLO-based vessel detection models [
9,
10,
11]. However, existing studies typically focus on demonstrating performance improvements using a single attention mechanism, often under different datasets and experimental settings. As a result, the relative effectiveness of different attention designs in maritime environments remains insufficiently understood, particularly when assessed in a like-for-like configuration across attention designs.
Maritime imagery presents unique characteristics that distinguish it from terrestrial visual environments. Factors such as sea surface reflections, variable illumination, scale ambiguity, and visually homogeneous backgrounds can significantly influence feature extraction and attention behavior [
2,
3,
12]. These characteristics suggest that the performance of attention mechanisms in maritime applications may not directly follow trends observed in general-purpose computer vision tasks. It remains an open question whether more complex attention structures that jointly model channel and spatial information consistently outperform simpler attention mechanisms in marine environments, especially when training data are limited. Unlike fixed maritime objects such as lighthouses or buoys, distinguishing between vessel types poses an additional challenge in maritime object detection. In coastal waters and port-adjacent areas, small fishing boats frequently operate in close proximity to general ships, often exhibiting similar visual appearances in terms of shape, color, and scale [
12,
13]. Moreover, many fishing boats in such environments do not continuously transmit Automatic Identification System (AIS) signals or operate without AIS equipment, making vision-based detection and classification a critical means of situational awareness [
14]. From a marine operational perspective, accurately distinguishing fishing boats from general ships is therefore both challenging and important, as fishing vessels are commonly associated with irregular maneuvers and increased navigational risk in congested waters [
1,
2,
13].
Accordingly, this study focuses on a binary vessel classification task involving general ships and fishing boats, representing a practically relevant and technically challenging scenario in maritime vessel detection. Within a unified YOLOv8-based framework [
5], multiple representative attention mechanisms—namely Coordinate Attention (CA) [
7], Convolutional Block Attention Module (CBAM) [
6], and semantic fusion based on Contrastive Language–Image Pre-training (CLIP) [
15]—are systematically integrated and evaluated under identical experimental conditions. Using a balanced maritime image dataset, controlled experiments are conducted to analyze detection performance in terms of precision, recall, F1-score, and mean average precision (mAP). The results demonstrate that a relatively simple channel-based attention mechanism can outperform more complex attention designs in maritime environments, while also revealing limitations of semantic fusion based on large pre-trained vision–language models when applied to small-scale maritime datasets. Through these findings, this study provides practical insights into the application of attention mechanisms for vision-based artificial intelligence in marine science and engineering.
3. Methodology
The objective of this study is to isolate and evaluate the influence of attention mechanisms and semantic fusion on maritime vessel detection using a single-factor experimental design, where only the feature-enhancement component is varied and the remaining pipeline is kept unchanged. Rather than proposing a new detection architecture, the study adopts a comparative methodology in which architectural components are modified while all other variables are held constant. This design enables a systematic investigation of how feature-enhancement strategies behave in a realistic maritime classification scenario.
Figure 2 presents the overall experimental workflow. The pipeline begins with dataset preparation and annotation, followed by a fixed train–validation–test split shared across all model variants. A unified YOLOv8-based framework is then extended with different attention mechanisms and optional semantic fusion [
5,
15]. Each model is trained using identical procedures and evaluated on the same test set using standardized detection metrics. By enforcing consistency in data partitioning, optimization settings, and evaluation protocols, the methodology ensures that observed performance differences arise from the architectural components under study rather than from experimental variability.
This controlled comparison framework is particularly important in maritime computer vision, where dataset size is often limited and evaluation results can be sensitive to data partitioning [
16,
19].
3.1. Dataset Construction and Annotation Protocol
The dataset used in this study consists of 420 RGB maritime images collected from publicly available maritime image repositories and operational coastal monitoring sources. The imagery primarily represents nearshore and port-adjacent environments, where visual complexity is typically high due to background clutter, sea surface reflections, and coastal infrastructure. The dataset is intentionally structured as a balanced binary classification problem, comprising 210 images labeled as ship and 210 labeled as fishing boat. This class balance eliminates bias arising from skewed class distributions and enables direct comparison of class-specific detection performance.
The images were acquired using optical RGB cameras deployed in shore-based surveillance systems and small unmanned aerial platforms operating in coastal waters. The spatial resolution of the images ranges from 640 × 480 to 1280 × 720 pixels. The dataset includes scenes captured under diverse environmental conditions, including clear and overcast weather, varying illumination levels (morning, midday, and late afternoon), and moderate sea states. These variations were intentionally preserved to reflect realistic operational conditions rather than curated laboratory scenarios.
Representative examples of the dataset are shown in
Figure 3. The examples provide visual insight into typical nearshore operational complexity, including small-scale fishing vessels, mixed sea–land backgrounds, and viewpoint variation between aerial and shore-based perspectives. In particular, fishing boats frequently appear with limited spatial footprint and partial background interference, illustrating the intrinsic difficulty of fine-grained vessel discrimination in cluttered coastal environments.
All images were manually annotated using axis-aligned bounding boxes surrounding the target vessels. Annotation was performed using a standardized labeling protocol to ensure consistency across classes. A quality-control stage was conducted to review ambiguous cases, particularly for small craft exhibiting limited distinguishing visual features. Ambiguous samples were re-examined to maintain class definition integrity and minimize labeling noise. Because the primary objective of this study is architectural comparison rather than dataset development, the dataset functions as a controlled experimental benchmark representative of nearshore visual complexity.
Prior to training, all images were resized to a unified input resolution of 640 × 640 pixels, consistent with the default YOLOv8n configuration. Aspect ratio was preserved through padding to avoid geometric distortion. Pixel values were normalized following the standard YOLO preprocessing pipeline. No class-specific preprocessing was applied.
Data augmentation was performed during training using the default YOLOv8 augmentation strategy. This includes random horizontal flipping, scaling, translation, and mosaic augmentation. No additional augmentation techniques were selectively applied to any individual model variant. The same augmentation configuration was maintained across all experiments to ensure strict experimental fairness and eliminate confounding effects.
The dataset was partitioned into training, validation, and test subsets using a fixed split ratio of 70%/15%/15%. Specifically, 294 images were used for training, 63 images for validation, and 63 images for testing. Class balance was preserved within each subset to maintain equal proportions of ship and fishing boat instances. The split was performed once and reused identically across all model variants to prevent sampling variability and ensure reproducibility. No cross-validation or re-sampling was applied.
3.2. Baseline Detection Framework
All experiments are built on YOLOv8 nano (YOLOv8n), a lightweight single-stage detector designed for efficient real-time inference [
5]. YOLO-based detectors are widely used in practical maritime monitoring systems because they provide a favorable balance between computational efficiency and detection accuracy. The nano variant is selected to represent a deployment-relevant baseline where architectural overhead must remain constrained.
The baseline model remains unchanged across experiments except for the insertion of attention modules and semantic fusion components. This approach allows the study to evaluate architectural enhancements in isolation without confounding effects from backbone replacement or large-scale structural redesign. The baseline therefore serves as a stable reference point against which all enhancements are measured.
3.3. Integration of Attention Mechanisms
Two representative attention mechanisms—CA and CBAM—were selected to represent distinct architectural design philosophies: lightweight positional channel attention versus composite channel–spatial attention [
6,
7]. Rather than proposing a new detection architecture, this study evaluates how these mechanisms influence maritime vessel detection when integrated into a unified YOLOv8n framework under strictly controlled experimental conditions.
To ensure architectural comparability and reproducibility, the insertion locations of the attention modules were explicitly defined. Both CA and CBAM were inserted at the same structural position within the YOLOv8n pipeline. Specifically, the attention module was applied after the final C2f block of the backbone and immediately before the feature aggregation stage in the PAN-FPN neck. This location corresponds to the highest-level semantic feature representation prior to multi-scale fusion.
The rationale for this placement is twofold. First, high-level backbone features contain rich contextual information relevant for vessel-type discrimination, particularly in visually ambiguous nearshore environments. Second, inserting the attention module before feature pyramid aggregation allows the refined features to propagate consistently across all detection scales without altering the downstream detection head.
No modifications were applied to the detection head, anchor configuration, or loss formulation. The YOLOv8n backbone, neck, and head structures were preserved to maintain computational fairness across model variants. The nano variant was selected to reflect deployment-relevant conditions in maritime monitoring systems, where real-time performance and limited onboard computational resources are critical considerations.
Coordinate Attention encodes directional spatial information into channel attention through separate global pooling operations along horizontal and vertical directions. This mechanism introduces positional awareness while maintaining lightweight computational overhead. In contrast, CBAM sequentially applies channel attention followed by spatial attention, enabling the network to refine both feature importance and spatial saliency. Although more expressive, CBAM introduces additional parameters and computational complexity.
By fixing the insertion location and avoiding additional structural tuning, this study isolates the intrinsic behavioral differences between CA and CBAM. Therefore, any observed performance differences can be attributed to the internal feature refinement strategies of the attention modules rather than to architectural placement optimization.
Figure 4 illustrates the architectural insertion point of the attention modules within the YOLOv8n backbone–neck pipeline.
3.4. CLIP-Based Semantic Fusion
To explore whether semantic priors derived from large-scale vision–language pre-training can enhance maritime vessel classification, a CLIP-based semantic fusion mechanism was integrated into the YOLOv8 framework. Unlike the attention mechanisms, which operate directly on convolutional feature maps, CLIP introduces language-aligned global semantic representations.
The motivation for incorporating CLIP stems from recent research suggesting that vision–language models may provide complementary semantic context, particularly in cases where visual distinctions between classes are subtle [
15]. In nearshore maritime environments, general ships and fishing boats may share similar geometric structures and color patterns. Therefore, integrating semantic embeddings was hypothesized to potentially improve fine-grained discrimination.
However, it should be clarified that CLIP was not introduced as a presumed superior detection backbone. Most maritime detection frameworks rely exclusively on convolutional feature extractors, as CLIP is primarily pre-trained for image-level classification rather than spatial localization tasks. The objective of incorporating CLIP in this study is therefore exploratory and comparative: to empirically assess whether global semantic priors learned from large-scale image–text corpora can provide complementary benefits in a domain-specific, small-scale maritime detection setting. By maintaining a frozen configuration and a simple additive fusion strategy, the experiment isolates the effect of semantic logits without introducing additional adaptive mechanisms.
Importantly, CLIP was not used as a backbone replacement. Instead, it was incorporated as an auxiliary semantic branch. For each input image, a global image embedding was generated using the pre-trained CLIP image encoder. Simultaneously, textual prompts corresponding to the class labels were encoded using the CLIP text encoder. The textual prompts were defined in a simple descriptive format (“a photo of a ship” and “a photo of a fishing boat”) to maintain semantic clarity.
Both the CLIP image encoder and text encoder were frozen during training. This decision was made to prevent instability caused by fine-tuning a large pre-trained model on a relatively small maritime dataset. The cosine similarity between image embeddings and text embeddings was computed to generate semantic logits representing similarity-based class confidence.
Fusion was performed at the classification stage using an additive fusion strategy. Specifically, the semantic logits derived from CLIP were linearly combined with the YOLOv8 detector’s classification logits prior to softmax normalization. No gating mechanism or feature concatenation was applied to avoid introducing additional trainable parameters that could confound the controlled comparison framework.
The spatial localization branch of YOLOv8 remained unaffected. Thus, CLIP contributes only to classification confidence refinement and does not influence bounding-box regression.
Figure 5 presents the overall CLIP-based semantic fusion architecture.
By structuring the experiment in this manner, the study evaluates whether large-scale semantic representations provide complementary benefits in a data-constrained maritime detection scenario, without altering the underlying spatial detection architecture. The results therefore offer empirical evidence regarding the practical utility and limitations of naïve semantic fusion in maritime detection systems.
3.5. Training Protocol and Experimental Control
To ensure strict experimental fairness and reproducibility, all model variants were trained under an identical configuration. Only the architectural component under investigation (attention module or semantic fusion mechanism) was modified, while all other parameters were fixed across experiments.
All experiments were conducted using an NVIDIA GeForce RTX 3080 Ti GPU (12 GB VRAM). The nano variant of YOLOv8 (YOLOv8n) was selected to reflect deployment-relevant computational constraints in maritime perception systems. The complete training configuration is summarized in
Table 1.
No hyperparameter retuning was performed for individual model variants. The same dataset split, augmentation policy, optimization settings, and training schedule were strictly reused across the baseline, CA, CBAM, and CLIP-based models. This design follows a single-variable control principle and ensures that observed performance differences originate from architectural modifications rather than optimization advantages.
For the CLIP-enhanced variants, both the image encoder and text encoder were kept frozen during training. Only the YOLO detection layers and the fusion module were updated. This setting follows common practice in small-scale downstream adaptation scenarios where full fine-tuning may lead to overfitting. The design choice was made to maintain optimization stability under the limited dataset size and to isolate the effect of semantic feature integration without introducing additional fine-tuning variables. By explicitly reporting the experimental configuration and computational environment, this study enhances methodological transparency and facilitates reproducibility.
3.6. Evaluation Metrics and Comparison Strategy
Model performance was evaluated using standard object detection metrics: Precision, Recall, F1-score, mAP@0.5, and mAP@0.5:0.95. Precision measures the proportion of predicted detections that are correct, while Recall measures the proportion of ground-truth objects that are successfully detected. The F1-score summarizes the balance between these two measures. mAP provides a threshold-dependent assessment of detection quality. mAP@0.5 evaluates detection accuracy at an IoU threshold of 0.5, while mAP@0.5:0.95 averages results across multiple thresholds, following COCO-style evaluation practice [
27]. Using both metrics allows the study to assess not only detection presence but also localization quality.
Performance is reported per model variant under identical evaluation conditions. The comparison emphasizes relative behavior across architectural designs rather than absolute accuracy claims, enabling clear interpretation of attention effectiveness in maritime classification.
4. Results and Analysis
This section presents a comprehensive analysis of the experimental results obtained from a controlled evaluation of attention mechanisms and semantic fusion strategies within a unified YOLOv8-based maritime vessel detection framework. Rather than limiting the discussion to aggregate accuracy values, the analysis focuses on explaining how different architectural choices influence detection behavior, class discrimination, localization robustness, and confidence calibration in realistic nearshore maritime environments. All experiments were conducted under identical training conditions, dataset partitions, and evaluation protocols. This unified configuration ensures that performance differences arise from architectural design rather than training variability.
4.1. Overall Detection Performance
The overall detection performance reveals clear differences in how architectural enhancements influence maritime vessel detection under controlled experimental conditions. Although the baseline YOLOv8n model already demonstrates reasonable performance across all evaluation metrics, the introduction of attention mechanisms and semantic fusion leads to systematic and interpretable changes in detection behavior.
Attention-based feature refinement consistently affects both classification reliability and localization accuracy. This trend is evident when comparing Precision, Recall, F1-score, and mAP metrics across all evaluated models, as shown in
Table 2. The results indicate that architectural modifications inside the detector backbone and neck play a decisive role in shaping detection performance in visually challenging maritime environments.
Among the evaluated variants, YOLO + Coordinate Attention (CA) achieves the highest localization performance, as reflected by its superior mAP@50 and mAP@50–95 scores. The improvement under the stricter IoU-based metric indicates enhanced bounding-box precision rather than merely increased detection confidence. In maritime imagery, where vessels are often partially occluded by waves, wakes, or reflections, such localization robustness is essential for reliable situational awareness.
The CA-enhanced model also achieves the highest Precision, suggesting a strong ability to suppress false-positive detections originating from background clutter such as sea surface texture, coastal structures, or harbor facilities. However, this improvement is accompanied by a reduction in Recall, indicating that the model adopts a more conservative detection strategy. From an operational perspective, this behavior reflects a trade-off between detection completeness and confidence reliability.
In contrast, YOLO + CBAM exhibits a more balanced performance profile. Although its mAP improvements are modest compared to CA, it achieves the highest F1-score among all models. This result suggests that CBAM improves feature discrimination while preserving detection completeness, maintaining Recall at a level comparable to the baseline. Such balanced behavior may be advantageous in applications where missing vessels are unacceptable and comprehensive situational awareness is required.
The CLIP-enhanced variants perform consistently worse than their attention-only counterparts and, in several metrics, fall below the baseline model. Despite their increased architectural complexity, these models fail to exploit semantic information effectively, indicating that semantic fusion does not provide complementary benefits under the current dataset scale and task formulation.
It is important to clarify that each model variant was trained under a single controlled run with a fixed random seed. While repeated multi-seed experiments and formal statistical hypothesis testing would provide additional quantitative confidence intervals, the objective of this study is not leaderboard-style optimization but controlled architectural comparison. All models were trained under strictly identical configurations, including dataset split, augmentation policy, optimization schedule, and computational environment. Therefore, relative performance differences arise exclusively from architectural modifications. Furthermore, the observed performance shifts exhibit consistent directional behavior across multiple independent evaluation metrics. For example, Coordinate Attention systematically increases precision and mAP@0.5:0.95 while slightly reducing recall, whereas CBAM maintains recall stability and achieves the highest F1-score. These consistent directional tendencies across classification and localization metrics reduce the likelihood that the differences are attributable solely to stochastic training variance. Instead, they reflect stable behavioral characteristics associated with each attention mechanism.
It should also be emphasized that the objective of this experimental section is not to claim superiority over previously published vessel detection models evaluated on heterogeneous datasets, but to examine architectural behavior under a unified and controlled framework. Many existing maritime detection studies report performance on different benchmarks, sensor modalities (optical or SAR), resolution settings, and evaluation thresholds, making direct numerical comparison across papers potentially misleading. Rather than conducting cross-dataset leaderboard-style comparison, this study focuses on internally consistent evaluation, isolating architectural effects under identical experimental conditions. Future work may extend this analysis by benchmarking the evaluated modules on additional public maritime datasets under harmonized protocols to enable broader cross-method comparison.
4.2. Relative Performance Improvement over the Baseline
While absolute performance metrics provide an overall comparison, analyzing relative changes with respect to the baseline detector offers clearer insight into the net impact of each architectural modification. This perspective is particularly useful for isolating how attention mechanisms and semantic fusion alter detection behavior, independent of baseline performance levels. The relative performance changes across all metrics are summarized in
Table 3. These values highlight both the magnitude and direction of performance shifts introduced by each architectural enhancement.
The relative improvement analysis clearly identifies Coordinate Attention as the most impactful architectural enhancement. The substantial increase in Precision (+7.13%) indicates a marked reduction in false-positive detections, while the strong improvement in mAP@50–95 (+4.65%) confirms enhanced localization robustness under strict IoU evaluation. These gains demonstrate that CA primarily improves the quality and spatial consistency of detections rather than simply increasing detection quantity.
The observed reduction in Recall for YOLO + CA further clarifies the nature of this improvement. Rather than indicating inferior detection capability, the decrease reflects a shift toward conservative detection behavior, where only high-confidence predictions are retained. In safety-critical maritime applications, such behavior may be preferable, as it reduces the likelihood of false alarms that could distract operators or trigger unnecessary responses.
YOLO + CBAM shows a contrasting pattern. Its relative changes are smaller in magnitude, but Recall remains essentially unchanged relative to the baseline. This stability, combined with moderate Precision gains, explains why CBAM achieves the highest F1-score. From a design standpoint, this suggests that CBAM enhances discrimination without aggressively filtering marginal detections, resulting in a more balanced detection profile.
The semantic fusion variants exhibit consistent negative changes across all metrics, with particularly severe degradation in localization-related measures. The large decrease in mAP@50–95 for YOLO + CBAM + CLIP (−7.90%) indicates that semantic embeddings introduce instability into spatial feature learning. This systematic degradation suggests a fundamental misalignment between semantic priors and the visual representations learned from a small maritime dataset.
Although the absolute numerical differences between CA and CBAM appear moderate (within a few percentage points), their behavioral profiles are structurally distinct. CA consistently improves localization robustness (mAP@0.5:0.95) and precision while reducing recall, indicating a conservative confidence calibration strategy. In contrast, CBAM preserves recall stability and achieves the highest F1-score, suggesting balanced feature refinement. If the performance differences were driven purely by stochastic training noise, one would expect inconsistent or contradictory shifts across metrics. However, the directionality of changes remains stable across precision, recall, F1-score, and both mAP metrics. Moreover, the CLIP-enhanced variants exhibit systematic degradation across all metrics rather than random fluctuation, further supporting the interpretation that architectural design—not random variance—is the dominant factor shaping performance behavior. Therefore, while the magnitude of improvement is not large, the consistency of metric trends supports the conclusion that CA and CBAM introduce distinguishable and reproducible behavioral characteristics in maritime vessel detection.
4.3. Class-Level Analysis Using Confusion Matrices
While aggregate metrics provide a concise summary of overall detection performance, they can obscure class-specific behaviors that are particularly important in maritime vessel detection. This limitation is especially relevant in nearshore environments, where different vessel types exhibit distinct visual characteristics, operational patterns, and levels of detection difficulty. To address this issue, class-level performance is analyzed using confusion matrix-derived statistics, with separate evaluations for the fishing boat and ship classes. For the confusion-matrix statistics, a prediction is counted as a true positive when a detected box matches a ground-truth vessel of the corresponding class with IoU ≥ 0.5 at the selected confidence threshold; unmatched detections are counted as false positives, and missed ground-truth instances are counted as false negatives (true negatives are computed with respect to non-target cases under the same decision rule).
The class-level performance for fishing boats is summarized in
Table 4. Across all evaluated models, a consistent pattern of higher Recall than Precision is observed. This imbalance reflects the inherent difficulty of accurately distinguishing small fishing vessels from background structures and other maritime objects. Fishing boats frequently operate close to coastlines and ports, where visual clutter is abundant and background elements such as buoys, docks, breakwaters, and small service craft are common. In addition, fishing boats lack standardized visual features and vary widely in size, shape, and color, further complicating reliable classification. The difficulty observed in fishing boat detection is also consistent with prior maritime and remote-sensing studies reporting that small-object recognition in cluttered coastal environments remains a persistent challenge for convolutional detectors [
25,
26,
31]. Small vessels often occupy only a limited number of pixels and exhibit weak contrast against dynamic sea–land backgrounds, which amplifies feature ambiguity during multi-scale aggregation. The present results therefore align with established findings that architectural refinements tend to exert stronger effects on small, visually ambiguous targets than on large, structurally distinct objects.
As a result, detection models tend to identify a large number of candidate fishing boat instances, which increases Recall but also leads to a higher rate of false positives. This behavior is observed across all architectural variants and highlights the intrinsic challenge of fishing boat detection rather than a limitation of any single model design.
Among the evaluated models, YOLO + Coordinate Attention (CA) achieves the highest Recall for the fishing boat class. This result indicates improved sensitivity to small and visually ambiguous targets, suggesting that CA enhances the representation of subtle vessel-related features that might otherwise be overlooked. However, this increased sensitivity is accompanied by a reduction in Precision, reflecting a higher incidence of false-positive detections. This trade-off is consistent with the conservative detection profile observed at the aggregate level: CA prioritizes capturing vessel-like patterns but does not fully suppress ambiguous background cues that resemble small vessels.
In contrast, YOLO + CBAM demonstrates a more balanced fishing boat detection behavior. Although its Recall is slightly lower than that of CA, its Precision is marginally higher, resulting in a more stable F1-score. This suggests that CBAM’s sequential channel–spatial refinement helps suppress some background-induced false detections while preserving sensitivity to true fishing boat instances. The inclusion of spatial attention likely contributes to filtering out background patterns that are spatially inconsistent with vessel structures, thereby improving classification reliability without excessively sacrificing Recall.
Overall, the fishing boat results indicate that architectural enhancements primarily influence sensitivity–specificity trade-offs for visually challenging, small-scale targets, and that different attention mechanisms favor different operational priorities.
The class-level performance for the ship class is summarized separately in
Table 5. In contrast to fishing boats, performance differences among models are considerably smaller. All evaluated models achieve relatively high Precision and Recall, indicating that larger vessels are visually easier to detect and classify.
Ships typically exhibit more consistent geometric structures, larger spatial footprints, and clearer separation from background elements such as sea surface textures and coastal infrastructure. As a result, their detection performance is less sensitive to architectural refinements, and improvements introduced by attention mechanisms are comparatively modest at the class level.
This contrast between fishing boat and ship detection highlights an important insight: architectural enhancements primarily affect challenging, small-scale targets rather than well-defined large objects. Consequently, many of the improvements observed in aggregate performance metrics are largely driven by changes in fishing boat detection behavior rather than by improvements in ship detection.
This finding underscores the importance of class-level analysis when evaluating maritime detection systems. Relying solely on aggregate metrics may obscure meaningful performance differences that are operationally significant, particularly in nearshore environments where small vessels such as fishing boats play a disproportionate role in navigational risk.
In addition to qualitative error inspection, the confusion patterns suggest a mechanism-level interpretation. CA tends to amplify channel-wise discriminative cues, which increases sensitivity to vessel-like edge and texture patterns. While this enhances detection of small fishing boats, it also increases susceptibility to visually similar background structures, thereby explaining the precision–recall trade-off observed in
Section 4.1. In contrast, CBAM’s sequential spatial refinement appears to suppress certain background activations, resulting in fewer extreme confidence shifts and more stable recall behavior. These observations indicate that the observed class-level differences are not incidental, but are structurally linked to how each attention module redistributes feature saliency across spatial and channel dimensions.
4.4. Precision–Recall and F1-Score Curve Analysis
Point-based evaluation metrics such as Precision, Recall, and F1-score provide useful summaries of detection performance at a fixed confidence threshold. However, these metrics do not fully capture how detector behavior evolves as the confidence threshold varies. In practical maritime applications, detection thresholds are frequently adjusted in response to changing environmental conditions, sensor quality, and operational requirements. For this reason, curve-based evaluation using Precision–Recall (PR) and F1-score curves provides deeper insight into confidence calibration, robustness, and operational flexibility.
Figure 6 shows the Precision–Recall curves of all evaluated models at the end of training (200 epochs), plotted within a single coordinate system to facilitate direct comparison across architectural variants. The PR curves reveal clear and systematic differences in confidence behavior among the evaluated models. The baseline YOLO detector exhibits a typical trade-off pattern, in which Precision decreases steadily as Recall increases. This behavior indicates a moderate separation between true-positive and false-positive confidence distributions and is consistent with the baseline performance observed in
Section 4.1. As Recall approaches higher values, the baseline model gradually admits more detections at the cost of increasing false positives, reflecting limited confidence discrimination.
The YOLO + Coordinate Attention (CA) model demonstrates a noticeably different PR profile. Precision remains comparatively high across a wider range of Recall values, particularly in the mid-to-high Recall region. This indicates that detections produced by the CA-enhanced model retain higher confidence even as Recall increases. In practical terms, this suggests that CA improves the separability of vessel-related features from background clutter, allowing the detector to operate reliably under more permissive operating points. Such behavior is advantageous in maritime surveillance scenarios, where missing vessels—especially small or visually ambiguous ones—may pose a greater risk than generating occasional false alarms.
However, the PR curve of YOLO + CA also shows a steeper decline in Precision near the upper Recall range compared to the baseline. Once Recall is pushed beyond an optimal region, Precision degrades rapidly, indicating a sudden increase in false positives. This behavior is consistent with the conservative detection profile observed in aggregate metrics, where CA achieves high Precision but slightly reduced Recall. Together, these observations indicate that CA provides strong confidence discrimination within a well-defined operating region, but exhibits increased sensitivity to threshold selection outside that region.
In contrast, the YOLO + CBAM model exhibits a smoother PR curve, with Precision decreasing more gradually as Recall increases. This smoother transition suggests a more balanced confidence distribution, in which marginal detections are incorporated without abrupt degradation in Precision. Such behavior aligns with the relatively high and stable F1-score reported for CBAM-based models in
Section 4.1 and
Section 4.2. From an operational perspective, this indicates that CBAM-enhanced detectors may be more robust to threshold misconfiguration, which is particularly valuable in dynamic nearshore environments where optimal operating points can vary over time.
The CLIP-enhanced variants display markedly different PR behavior. Their curves show a more pronounced drop in Precision as Recall increases, indicating that lowering the effective detection threshold rapidly introduces false positives. This pattern suggests unstable confidence calibration, where semantic fusion introduces uncertainty into the decision process rather than reinforcing vessel-related confidence separation. Even at full convergence, the PR performance of CLIP-based models remains inferior to that of attention-only variants, reinforcing the conclusion that semantic priors are not effectively exploited under the current dataset scale and domain characteristics.
These observations are consistent with recent findings on the limitations of vision–language models in domain-specific detection tasks. Prior studies in remote sensing and specialized visual domains report that CLIP-style representations may suffer from reduced transferability when applied to small-scale datasets or fine-grained object discrimination problems without domain-adaptive pretraining [
28,
29,
30,
31]. In such cases, global semantic embeddings trained on large natural-image corpora may not align well with structured environmental features or subtle intra-class differences. Given that the present study focuses on a relatively small maritime dataset and a binary vessel discrimination task with visually overlapping categories, the observed degradation of CLIP-enhanced variants is therefore consistent with known domain adaptation challenges rather than an isolated empirical anomaly.
From a representational perspective, the observed degradation can be interpreted as a calibration mismatch between global semantic embeddings and spatially localized detection features. CLIP embeddings encode holistic image–text alignment learned from large-scale natural image corpora, whereas the detection task requires fine-grained spatial discrimination between visually overlapping maritime categories. When fused additively without domain-adaptive reweighting, global semantic logits may perturb classification confidence without improving localization consistency. The systematic degradation observed across both PR and F1 curves suggests that the fusion introduces structural bias rather than random noise.
Beyond point-wise metric comparisons, the structural consistency of the PR curves provides additional qualitative evidence regarding performance stability. If the observed differences were driven primarily by stochastic training variance, one would expect irregular crossings or inconsistent ordering among model variants across confidence thresholds. However, the relative ordering of the models remains largely stable throughout the recall spectrum. CA maintains higher precision in the mid-recall region, CBAM exhibits smoother degradation behavior, and CLIP-based variants consistently underperform across operating regions. Such coherent and architecture-specific curve morphology suggests that the performance differences are systematic rather than incidental.
Moreover, the curve shapes themselves reveal distinct confidence calibration behaviors. CA demonstrates sharper precision transitions near high-recall regions, whereas CBAM presents a broader and more gradual trade-off profile. These structural characteristics persist across the entire threshold range and are unlikely to emerge purely from single-run randomness under a strictly controlled experimental setting. Therefore, curve-based analysis complements point-based metrics by providing additional inferential support for architectural effect differentiation.
Complementary insight is provided by the F1-score curves shown in
Figure 7, which plot the F1-score as a function of the confidence threshold for all evaluated models. These curves highlight the operating regions in which Precision and Recall are most effectively balanced. The YOLO + CBAM model exhibits the broadest and most stable F1-score plateau, indicating consistent performance across a wide range of confidence thresholds. This stability confirms that CBAM achieves a robust balance between Precision and Recall and is less sensitive to precise threshold tuning.
The YOLO + CA model, in contrast, shows a sharper and slightly higher F1-score peak concentrated around a narrower confidence interval. This pattern reflects its high-Precision but conservative detection strategy. While the peak F1-score is competitive, the narrower optimal region indicates greater sensitivity to threshold selection. In practice, this suggests that CA-based detectors can achieve excellent performance when properly calibrated, but may require more careful threshold management to maintain optimal operation.
The baseline YOLO model exhibits a moderate F1-score peak with a narrower plateau, consistent with its intermediate performance across evaluation metrics. The CLIP-enhanced variants show lower and less stable F1-score profiles, indicating limited robustness and reduced effectiveness across operating points. This behavior further supports the observation that semantic fusion complicates confidence calibration without delivering corresponding performance benefits in the current experimental setting.
Taken together, the PR and F1 curve analyses demonstrate that attention mechanisms influence not only absolute detection accuracy, but also confidence calibration and operational robustness. Coordinate Attention enhances confidence separation at the cost of increased threshold sensitivity, whereas CBAM provides more stable performance across a wider operating range. Semantic fusion, despite its theoretical appeal, introduces instability in confidence estimation under small-data maritime conditions. These findings highlight the importance of curve-based evaluation for understanding detector behavior in real-world maritime applications, where robustness and adaptability are as critical as peak accuracy.
4.5. Contributions of This Study
This study makes the following contributions to maritime vessel detection research and the application of artificial intelligence in marine environments:
A strictly controlled comparative evaluation framework is established to isolate the architectural effects of attention mechanisms and semantic fusion within a unified YOLOv8n-based maritime detection pipeline. All model variants share identical data partitions, optimization settings, and computational configurations, enabling fair and interpretable comparison.
Representative attention mechanisms with distinct design philosophies—CA and CBAM—are systematically compared under identical maritime conditions, revealing architecture-specific behavioral differences in precision–recall trade-offs, localization robustness, and confidence calibration.
A CLIP-based semantic fusion strategy is evaluated as an auxiliary branch within the detection framework, providing empirical evidence regarding the limitations of naïve vision–language integration in small-scale, fine-grained maritime classification tasks.
Multi-level performance analysis is conducted, including aggregate metrics, relative performance changes, class-level confusion matrices, and curve-based confidence calibration assessment. This layered evaluation strategy provides deeper insight into how architectural modifications influence operational behavior in nearshore maritime environments.
The study offers practical guidance for selecting lightweight attention mechanisms in deployment-oriented maritime perception systems, particularly in applications where small-vessel detection and threshold robustness are critical.
5. Discussion and Limitations
5.1. Effectiveness of Attention Mechanisms in Nearshore Maritime Environments
The experimental results consistently demonstrate that attention mechanisms can improve maritime vessel detection performance when integrated into a lightweight YOLOv8-based framework. However, the nature and magnitude of these improvements depend strongly on the specific attention design employed.
CA and CBAM exhibit distinct behavioral characteristics that reflect their underlying design philosophies. CA enhances channel-wise feature discrimination while embedding directional spatial information through separate horizontal and vertical encoding. This design appears particularly effective for detecting small and visually ambiguous vessels, such as fishing boats operating in cluttered nearshore environments. The improved Recall observed for CA-based models suggests that this mechanism strengthens sensitivity to subtle vessel-related cues that might otherwise be suppressed in standard convolutional pipelines.
CBAM, on the other hand, applies sequential channel and spatial attention, allowing the network to jointly model what features are important and where they are located. The smoother Precision–Recall and F1-score curves observed for CBAM-based models indicate that this composite refinement yields more balanced confidence distributions. From an operational standpoint, this balance translates into greater robustness to threshold selection, which is advantageous in maritime monitoring systems where operating conditions can change dynamically.
Importantly, the results suggest that increased architectural complexity does not automatically translate into improved detection performance in maritime environments. Although CBAM introduces additional spatial refinement capability, CA achieves competitive—and in some cases superior—performance with lower architectural overhead. This observation underscores the importance of environment-specific architectural evaluation rather than assuming monotonic gains from increased model complexity.
To provide qualitative insight into the detection behavior of the evaluated models, representative detection results are illustrated in
Figure 8. The examples include both large merchant ships and small fishing boats under diverse nearshore conditions, including multi-object scenes, aerial viewpoints, and background clutter near harbor facilities.
The visualizations indicate that large ships are generally detected with stable bounding-box localization across model variants, whereas fishing boats exhibit greater sensitivity to feature refinement mechanisms due to their smaller spatial footprint and background ambiguity. Fishing boat instances appearing near coastal structures or partially occluded by environmental elements illustrate the inherent challenges of fine-grained vessel discrimination. These qualitative observations are consistent with the quantitative trends reported in
Table 2, where architectural modifications primarily influence precision–recall balance and localization robustness for small and visually ambiguous targets.
5.2. Class-Specific Implications: Fishing Boats Versus General Ships
A key insight emerging from this study is that architectural enhancements primarily influence the detection of challenging vessel classes rather than well-defined ones. Across all experiments, performance differences among models are substantially more pronounced for fishing boats than for general ships.
Fishing boats present a uniquely difficult detection problem due to their small size, diverse appearances, and frequent operation in visually cluttered coastal and port-adjacent areas. Moreover, many fishing boats operate without continuous AIS transmission or lack AIS equipment altogether, increasing reliance on vision-based perception for situational awareness. In this context, improvements in fishing boat detection carry greater operational significance than equivalent improvements in detecting larger ships.
The confusion-matrix analysis shows that attention mechanisms affect the balance between Precision and Recall for fishing boats in different ways. CA prioritizes sensitivity, capturing a larger proportion of fishing boat instances at the expense of increased false positives, whereas CBAM achieves a more conservative balance. These behaviors align with the curve-based analyses and suggest that attention mechanisms influence not only detection accuracy but also risk profiles in operational settings.
In contrast, detection performance for general ships remains relatively stable across architectural variants. Ships typically exhibit larger spatial footprints, more consistent geometric structures, and clearer separation from background elements, making them less sensitive to feature-enhancement strategies. This disparity underscores the importance of class-level analysis in maritime vision studies, as aggregate metrics alone may obscure meaningful improvements for operationally critical vessel types.
A brief qualitative inspection of representative misdetections further clarifies this class sensitivity. False negatives for fishing boats often occur in small-scale or partially occluded instances near harbor structures, where vessel contours are weakly separated from background textures. Conversely, certain false positives arise from background elements such as small docked crafts, buoys, or wave patterns that share superficial geometric characteristics with fishing vessels. These recurring error patterns indicate that fishing boat detection difficulty is primarily driven by scale ambiguity and background similarity rather than by general detector instability. Attention mechanisms modify how strongly such ambiguous features are amplified, thereby altering the sensitivity–specificity balance observed in
Section 4.
5.3. Limitations of Semantic Fusion with Vision–Language Models
One limitation of the present semantic fusion experiment is that CLIP was integrated in a frozen configuration using a simple additive fusion strategy. Although this design isolates the architectural effect without introducing additional fine-tuning variables, it does not explore alternative adaptation strategies such as prompt refinement, gated fusion, or partial encoder updating. Therefore, the reported performance degradation should be interpreted as evidence that naïve semantic fusion, without domain-adaptive optimization, may not reliably improve fine-grained maritime vessel discrimination under limited data conditions. It should be emphasized that the objective was not to replace convolutional backbones with CLIP, but to test whether frozen global semantic embeddings can provide complementary calibration effects under controlled maritime conditions.
Importantly, the degradation is not random but structurally consistent across evaluation metrics and confidence-threshold analyses. This suggests that the limitation is structural rather than incidental, reinforcing the calibration analysis presented in
Section 4.4. Future research may investigate whether domain-adaptive pretraining or spatially aware semantic integration mechanisms can mitigate this calibration gap.
5.4. Operational Implications for Maritime Perception Systems
From a practical perspective, the findings of this study offer several implications for the design of maritime perception systems. First, attention mechanisms provide a lightweight and effective means of improving detection performance without requiring extensive architectural redesign. In particular, CA and CBAM can be integrated into existing YOLO-based pipelines with minimal overhead, making them suitable for real-time deployment on resource-constrained platforms.
Second, curve-based evaluation reveals behavioral characteristics that are not captured by point-based metrics alone. The width and stability of F1-score plateaus, as well as the shape of PR curves, provide valuable insight into threshold sensitivity and operational robustness. For maritime systems operating in dynamic environments, such robustness may be more important than marginal gains in peak accuracy.
Third, the results suggest that semantic fusion using large pre-trained models should not be adopted indiscriminately. Without sufficient domain-specific data or adaptation mechanisms, semantic priors may degrade rather than enhance detection performance. Attention-based feature refinement appears to offer a more reliable and interpretable improvement pathway under current maritime data constraints.
5.5. Limitations and Future Research Directions
A limitation of this study is that each architectural variant was evaluated using a single controlled training run with a fixed random seed. While repeated randomized trials and formal statistical significance testing (e.g., confidence intervals or hypothesis tests across multiple seeds) would provide a stronger quantitative assessment of variance, such multi-seed experimentation was beyond the scope of the present work.
Nevertheless, several methodological considerations reduce the likelihood that the reported performance differences are driven primarily by stochastic training noise. First, the study strictly follows a single-variable control principle: only the architectural component (attention module or semantic fusion mechanism) was modified, while dataset split, augmentation policy, optimizer configuration, training schedule, and computational environment were held constant across all experiments. This deterministic setup minimizes uncontrolled variance sources.
Second, the observed trends are not inferred from a single scalar metric. Instead, consistent directional behavior is observed across multiple evaluation dimensions, including mAP@0.5, mAP@0.5:0.95, Precision–Recall curves, F1-score curves, and class-level confusion matrices. The relative ordering among CA, CBAM, baseline, and CLIP-based variants remains stable across these complementary criteria.
Third, curve-based analyses reveal architecture-specific confidence calibration patterns across the full confidence-threshold spectrum. If the differences were dominated by random initialization effects, irregular crossings or inconsistent ordering among models would be expected. However, the structured morphology of the curves suggests systematic architectural influence rather than incidental fluctuation.
Future research may extend this work in two complementary directions. First, multi-seed evaluations and formal statistical testing may be conducted to further quantify effect size and performance variance across random initializations. Second, benchmarking the evaluated architectural modules on additional public maritime datasets under harmonized evaluation protocols would enable broader cross-method comparison beyond the controlled framework adopted in this study. Despite these limitations, the consistent multi-metric trends observed under strictly controlled experimental conditions provide informative and interpretable evidence regarding the relative behavioral characteristics of the evaluated attention mechanisms.
6. Conclusions
This study presented a systematic and controlled evaluation of attention mechanisms and semantic fusion strategies for maritime vessel detection within a unified YOLOv8-based framework. Rather than proposing a new detection architecture, the research focused on isolating the effects of representative feature-enhancement techniques under identical training, data, and evaluation conditions. This design enabled a clear and interpretable analysis of how architectural choices influence detection behavior in realistic nearshore maritime environments.
The experimental results demonstrate that attention mechanisms can significantly improve maritime vessel detection performance, particularly for visually challenging targets such as fishing boats operating in coastal and port-adjacent areas. Among the evaluated approaches, lightweight attention designs proved especially effective. Coordinate Attention enhanced sensitivity to small and ambiguous vessels by strengthening directional feature encoding, while Convolutional Block Attention Module achieved a more balanced trade-off between Precision and Recall through sequential channel–spatial refinement. These improvements were reflected not only in aggregate metrics but also in curve-based analyses, which revealed meaningful differences in confidence calibration and operational robustness.
In contrast, semantic fusion using a large pre-trained vision–language model did not yield performance gains under the evaluated conditions. CLIP-enhanced variants exhibited unstable confidence behavior and reduced detection effectiveness, particularly at lower confidence thresholds. These findings suggest that semantic priors alone are insufficient to improve maritime vessel detection in data-constrained settings and may introduce additional uncertainty without careful domain adaptation.
A key contribution of this study lies in its emphasis on behavioral analysis beyond point-based metrics. By examining Precision–Recall and F1-score curves alongside class-level performance, the study highlights how attention mechanisms influence not only detection accuracy but also threshold sensitivity and robustness—factors that are critical for real-world maritime applications. The results further show that architectural enhancements primarily affect difficult, small-scale vessel classes, underscoring the importance of class-aware evaluation in maritime computer vision research.
Despite these contributions, the study is limited by its dataset size, binary classification scope, and reliance on a single baseline detector. Future research should extend this analysis to larger and more diverse maritime datasets, explore multi-class vessel detection scenarios, and investigate alternative semantic fusion strategies with stronger domain alignment. Additionally, integrating attention mechanisms within multi-modal maritime perception systems represents a promising direction for enhancing situational awareness in complex marine environments.
Overall, this work provides practical insights into the selection and evaluation of attention mechanisms for vision-based maritime vessel detection. The findings suggest that lightweight attention-based feature refinement offers a reliable and deployment-friendly pathway for improving detection performance in nearshore maritime settings, while also highlighting the limitations of semantic fusion approaches under current data constraints.