1. Introduction
The global shipbuilding industry, as a crucial supporting sector for international trade, is undergoing transformation and upgrading toward intelligent manufacturing. According to statistics from the United Nations Conference on Trade and Development, over 80% of global trade goods are transported by maritime shipping, making the maritime industry a vital pillar of international trade. In this context, the level of intelligence in the shipbuilding industry [
1] directly relates to the efficiency and cost control of global maritime logistics. However, due to the unique characteristics of order-based production mode in shipbuilding, the sorting work after ship component cutting has become a key bottleneck, constraining production efficiency improvement.
Ship component sorting, as the core process connecting cutting and assembly, directly affects the overall production cycle [
2]. Current sorting modes rely on Quick Response (QR) code recognition technology to achieve semi-automated operation but still require manual classification and stacking according to patterns, with heavy dependence on manual processes severely constraining the transition toward fully intelligent shipbuilding. Unlike standardized manufacturing industries, ship component manufacturing features diverse patterns, complex and varied geometric shapes, different size specifications, and widespread component stacking occlusion phenomena, limiting the application effectiveness of traditional intelligent recognition technologies.
For target recognition and classification tasks in industrial scenarios, existing research is primarily based on deep learning methods. Single-stage detection networks represented by the You Only Look Once (YOLO) series have gained widespread attention in industrial applications due to their efficiency but typically require large amounts of training data to achieve ideal recognition performance. When facing complex scenarios in ship component sorting with limited dataset samples and target stacking occlusion, existing methods still encounter problems such as inaccurate recognition and missed detection.
To address the problems of small samples and target occlusion, researchers mainly adopt two improvement approaches. The first is network structure improvement design, such as Residual Network (ResNet) and Densely Connected Convolutional Network (DenseNet), using residual connections and dense connectivity, respectively, achieving lightweight through improved connection methods but with limited effectiveness in handling stacking occlusion problems [
3]. The second is the introduction of attention mechanisms, which although showing potential in detection performance improvement, existing attention mechanisms still face key challenges in industrial scenarios of ship component recognition including high computational complexity, insufficient small-sample adaptability, and limited capability for processing spatial redundant information [
4].
Current attention mechanism research demonstrates promising results in general computer vision tasks, yet effective solutions remain needed for the specific requirements of ship component sorting, including small sample learning, target stacking recognition, and feature redundancy suppression. In ship manufacturing environments, existing attention mechanisms encounter difficulties distinguishing between diverse component patterns and complex geometric shapes, while manual markings and surface scratches create interference that conventional spatial attention cannot effectively filter, and component stacking occlusion produces spatial relationships that current channel attention mechanisms fail to properly address. Therefore, a Triplet Spatial Reconstruction Attention (TSA) mechanism is proposed, addressing these limitations through threshold-based feature separation combined with triplet parallel processing to simultaneously achieve high accuracy in small-sample scenarios while maintaining computational efficiency for industrial deployment.
The proposed methodology encompasses dataset construction for ship component recognition, systematic analysis of existing attention mechanisms, design of the TSA mechanism featuring dual-branch parallel processing architecture, and integration with YOLOv8n to develop the YOLO-Ship detection network. The TSA mechanism establishes feature extraction in different spatial directions within sub-networks, achieving cross-dimensional feature fusion through channel dimensions while performing separation and reconstruction of spatial semantic information through weight discrimination, suppressing redundant information in spatial dimensions and enhancing semantic feature expression capability. Experimental validation demonstrates substantial performance improvements with a mean Average Precision reaching 88.7%, precision achieving 84.2%, and recall attaining 87.1%, while maintaining computational efficiency with 2.6 M parameters and 7.5 GFLOPs. The network fully leverages the feature separation reconstruction and parallel processing advantages of the TSA mechanism, achieving an effective balance between accuracy and efficiency in small-sample ship component recognition tasks. The following sections provide a comprehensive analysis of related work, detailed methodology description, experimental validation, and discussion of the proposed approach’s performance and implications.
2. Related Work
2.1. Attention Mechanisms in Object Detection
Attention mechanisms such as Squeeze-and-Excitation Network (SENet) and Convolutional Block Attention Module (CBAM) adopt spatial and channel combination approaches, effectively improving network performance. Among them, CBAM shows significant effectiveness in cross-dimensional attention weight integration [
5]. With the development of attention mechanisms, researchers have proposed improvement schemes from different perspectives. Efficient Multi-scale Attention (EMA), proposed by Daliang Ouyang, as a multi-scale attention mechanism, adopts a three-way parallel architecture combining spatial scale with channel information to achieve target feature information enhancement and cross-dimensional interactive fusion [
6]. Kin Wai Lau proposed Large Separable Kernel Attention (LSKAttention), a separable kernel attention mechanism that decomposes 2D convolution kernels into 1D cascaded structures, reducing computational overhead [
7]. Despite these advances, existing attention mechanisms exhibit fundamental limitations when applied to industrial scenarios: sequential processing architectures that create computational bottlenecks unsuitable for real-time deployment, uniform spatial treatment that fails to distinguish meaningful signals from industrial noise such as surface markings and lighting variations, and inadequate integration of spatial reconstruction with channel attention for robust feature representation under challenging industrial conditions.
2.2. Spatial Reconstruction and Feature Separation Methods
Spatial reconstruction approaches aim to address feature redundancy through separation and reorganization strategies. The Spatial and Channel Reconstruction Convolution (SCConv) demonstrates effectiveness in handling spatial and channel redundancy through separation–reconstruction strategies, utilizing dedicated units for spatial and channel processing. Coordinate Attention (CA) performs cross-dimensional fusion embedding of spatial channels, showing excellent performance in feature interaction across different dimensional spaces. Meanwhile, lightweight network architectures such as Residual Network (ResNet) and Densely Connected Convolutional Network (DenseNet) employ residual connections and dense connectivity, respectively, to achieve computational efficiency through improved connection methods, yet these approaches show limited effectiveness in handling complex spatial relationships and feature separation requirements characteristic of industrial detection scenarios.
2.3. Industrial Scene Small-Sample Detection Challenges
Industrial manufacturing environments present distinct detection challenges due to inherent complexities in object characteristics and data acquisition constraints. In aluminum casting inspection, deep object detection methods face significant obstacles when dealing with simulated defects and complex geometric variations under industrial conditions [
8]. Recent research has examined difficulties in processing small-scale components, occlusion patterns, and constrained training datasets within manufacturing facilities [
9]. Industrial stamping processes demonstrate similar constraints where deep metric learning approaches must operate effectively with limited sample sizes for tool condition diagnosis [
10].
Ship component detection encounters specific small-sample challenges due to the diverse patterns, complex geometric shapes, and varying size specifications inherent in maritime manufacturing. The acquisition of defective assembly samples requires extended production periods, as assembly errors occur infrequently during normal operations. Traditional few-shot learning approaches face difficulties in ship component recognition due to high visual similarity between correctly and incorrectly assembled components, where subtle assembly differences must be detected reliably [
11]. Advanced prototype-based methods show effectiveness in laboratory conditions yet struggle when applied to ship manufacturing environments where lighting variations, component wear, and assembly contexts affect detection accuracy [
12]. These limitations highlight the necessity for robust detection mechanisms specifically designed for industrial small-sample scenarios.
3. Materials and Methods
The methodology encompasses four key components: dataset construction for ship component recognition, analysis of existing attention mechanisms, design of the proposed TSA triplet spatial reconstruction attention mechanism, and development of the YOLO-Ship detection network.
3.1. Dataset Construction
Due to the order-based customized production mode adopted in shipbuilding, with complex and diverse geometric shapes of components, there is currently a lack of open-source ship component datasets internationally. The dataset was constructed through a combination of web collection and on-site acquisition methods, expanding 160 original images to 792 images through rotation and flipping operations. The image size was uniformly set to 640 × 640. Based on the sample quantity, this dataset belongs to a small-sample dataset. To improve training effectiveness, homomorphic filtering [
13] was applied to the dataset images using an improved Gaussian function to filter noise and the effects of light diffuse reflection from steel plate surfaces. The improved homomorphic filtering function is shown as follows:
where
represents the improved homomorphic filtering transfer function,
represents the steepness of the function,
represents the Euclidean distance from point
to the center in the frequency domain,
represents the adjustment parameter,
represents half the horizontal length of the image after Fourier transform, and
is the low-frequency gain coefficient.
Regarding the determination of sample categories, because layout planning of cutting component positions is performed during steel plate component cutting to maximize steel plate utilization, diversification and randomization of component styles and distribution occur during component recognition. To address this issue, the complex target recognition and classification task was decomposed. After identifying and classifying components to their corresponding ship sections through QR code scanning, the component styles of target steel plates were classified based on geometric features as the classification criterion, which were overall divided into rectangle, circle, and rectangle-circle (linear features dominate over curved features), circle-rectangle (curved features dominate over linear features), and others. The five component classification labels were set as rectangle, circle, rec-cir, cir-rec, and other, respectively, thereby generating label files and completing dataset construction. This sorting and classification method is applicable to the sorting of ship steel plate cutting components. Partial component image data are shown in
Figure 1.
To ensure reproducibility and facilitate future research, the complete dataset will be publicly released upon publication. The dataset includes all processed images in 640 × 640 resolution with corresponding annotation files in standard format for all five component categories. This release aims to establish a benchmark for ship component recognition research and enable direct comparison with future methods.
3.2. Attention Mechanism
Due to the harsh production environment for ship components, with numerous interference information such as manual markings, scratches, and stacking on their surfaces, leading to substantial redundant information in spatial dimensions, a spatial separation-reconstruction approach was adopted to divide feature map data into two categories: information-rich and information-sparse, separated using thresholds. The Spatial and Channel Reconstruction Convolution (SCConv) module proposed by Li et al. effectively handles spatial and channel redundancy through Spatial Reconstruction Unit (SRU) and Channel Reconstruction Unit (CRU) units [
14], providing effective ideas for spatial reconstruction, but the overall architecture has high computational complexity and large parameter count. Coordinate Attention (CA) performs cross-dimensional fusion embedding of spatial channels, showing excellent performance in feature interaction, but its identification and filtering effectiveness for redundant feature information is not ideal when facing multi-scale feature fusion [
15].
Inspired by the design concepts of SCConv spatial reconstruction strategy and CA cross-dimensional fusion, addressing the problems of high computational complexity in the former and insufficient redundant information identification capability in the latter, TSA is constructed as a triplet spatial reconstruction attention mechanism. TSA achieves a balance between performance enhancement and computational efficiency in small-sample data scenarios through innovative triplet parallel processing and spatial separation-reconstruction strategies.
3.2.1. SRU
The SRU is the spatial reconstruction component of the Spatial and Channel Reconstruction Convolution (SCConv) module proposed in 2023. It adopts a separation–reconstruction approach to effectively reduce redundant features in spatial and channel dimensions, decrease model computational complexity, and enhance semantic feature information. Group normalization is applied to input feature maps, dividing channels equally into 16 groups, with the normalization principle shown in Equation (2).
where
and
are the mean and standard deviation of the input feature map
,
is the stability coefficient, and
and
are trainable affine transformation values.
Weight coefficients are obtained through normalization processing, as shown in Equation (3).
where
represents the normalized weight coefficient vector,
represents the weight value of the
-th channel,
represents the scaling parameter of the corresponding channel, and
represents the total number of feature map channels.
Based on weight coefficients
, feature separation is achieved using Sigmoid activation function and gating mechanism, as shown in Equation (4).
where the
function converts continuous weights to binary masks through preset thresholds, achieving the separation of information-rich features from redundant features. Finally, based on Equation (5), the separated features are reorganized through cross-reconstruction operations, suppressing spatial redundant information while enhancing effective feature expression.
where
and
represent the feature maps separated by weights
and
, respectively,
and
are the binary masks obtained from Equation (4),
,
,
, and
represent the subdivided feature groups for cross-reconstruction,
denotes element-wise multiplication,
denotes element-wise addition, and
denotes concatenation.
3.2.2. CA Mechanism
Coordinate Attention (CA) is a lightweight attention module characterized by considering channel dimensions while incorporating spatial position information, elevating ordinary cross-dimensional fusion to the connection between channels and spatial positions [
16]. The structure is shown in
Figure 2.
The CA mechanism performs global average pooling operations along vertical and horizontal directions on the input tensor, generating two one-dimensional feature vectors to capture spatial feature information from different directions. The pooled features from both directions are concatenated to construct a feature representation containing complete spatial positional information. The concatenated features undergo convolution, batch normalization, and Rectified Linear Unit (ReLU) activation processing to further enhance feature representation capability. The processed feature maps are split along horizontal and vertical directions, with each direction activated through sigmoid functions to generate direction-aware attention weights, which are then applied to the original input features through element-wise multiplication, resulting in enhanced feature outputs that fuse spatial features from different directions with channel information.
3.3. TSA Triplet Spatial Reconstruction Attention Mechanism
The proposed TSA triplet spatial reconstruction attention mechanism adopts horizontal spatial pooling, vertical spatial pooling, and global feature enhancement as a triplet processing strategy, dividing input image features into two branches for parallel processing, achieving reduced computational overhead while enhancing recognition effectiveness. The horizontal and vertical pooling operations extract directional spatial dependencies while preserving channel information, and the global enhancement component applies semantic-driven threshold discrimination to distinguish meaningful features from redundant information, enabling adaptive feature reconstruction through cross-reorganization operations.
The first branch performs horizontal and vertical pooling on input feature values , retaining channel information while combining the two directional dimensional values for batch normalization, obtaining target feature weight coefficients under the combination of spatial directional feature information and channel dimensions. After multiplying the weights with the feature map values themselves, sigmoid activation mapping is performed. Using these values for threshold judgment, features are divided into feature values with higher semantic information and relatively lower feature values . The two sets of feature values are then divided separately, with four groups of data undergoing pairwise cross-reorganization and addition to suppress redundant spatial features, finally combining the two sets of features into original-size feature values.
The other branch performs batch normalization on feature values
, and after activation mapping, adds them to the first branch output feature values with weight coefficient readjustment to ensure output feature values do not overfit, yielding output
. The entire TSA triplet spatial reconstruction attention mechanism is shown in
Figure 3.
The computational efficiency of TSA stems from its parallel processing architecture and linear complexity design. The horizontal and vertical pooling operations maintain O (C × H × W) complexity for spatial feature extraction, while the dual-branch structure avoids the quadratic channel dependencies inherent in traditional attention mechanisms. In contrast to SCConv’s O (C2 × H × W) complexity arising from its sequential spatial-channel processing, TSA achieves feature separation and reconstruction through threshold-based operations that scale linearly with input dimensions. Similarly, while CA mechanisms typically require O (C × H × W + C2) operations for coordinate embedding and channel-wise attention computation, TSA’s triplet processing strategy eliminates the need for expensive channel-wise matrix operations through its parallel pooling and reconstruction approach. The threshold separation and cross-reorganization operations maintain constant computational overhead regardless of channel depth, contributing to the overall efficiency gain.
3.4. YOLO-Ship Network
Targeting the characteristics of small samples, diversity, and stacking occlusion in ship component recognition, YOLOv8n was selected as the base network. Compared to newer YOLO versions, YOLOv8 achieves better balance between accuracy and computational efficiency, with good maturity and stability in industrial applications [
17], making it more suitable for the lightweight objectives of this research.
Based on the aforementioned TSA triplet spatial reconstruction attention mechanism, the TSA mechanism is combined with the C2f module for improvement, constructing the C2f-TSA module with the structure shown in
Figure 4.
The improved C2f-TSA module replaces the original C2f modules in the Neck section of the YOLOv8n network, forming the YOLO-Ship detection network. Through the spatial reconstruction and cross-dimensional feature fusion capabilities of the TSA mechanism, it better handles small target detection and stacking occlusion problems in ship component images. Compared to the original network, the lightweight design of TSA ensures that the improved network reduces model parameters while improving convergence speed and detection accuracy, meeting the real-time requirements of industrial applications. The complete structure of the YOLO-Ship network is shown in
Figure 5.
4. Results
Comprehensive experimental validation was conducted to evaluate the performance of the proposed YOLO-Ship model in ship component detection tasks.
4.1. Experimental Configuration
The experimental platform adopted Ubuntu 18.04 LTS operating system, PyTorch deep learning framework version 1.9.0 with Compute Unified Device Architecture (CUDA) 11.1, and RTX3090 24 GB graphics card. The dataset was divided into training (554 images), test (79 images), and validation (159 images) sets following a 7:1:2 ratio to ensure balanced category representation. The training set was used for model parameter optimization, the validation set for hyperparameter tuning and early stopping strategy, and the test set for final performance evaluation. Adaptive Moment Estimation with Weight Decay (AdamW) optimizer was selected with learning rate set to 0.001, batch size of 8, total training epochs of 300, network input image size of 640 × 640, and no pre-trained weights loaded. Model performance evaluation adopted three metrics: Precision (P), Recall (R), and mean Average Precision (mAP) [
18]. These metrics comprehensively evaluate detection performance, with Precision addressing false positive control crucial for industrial applications, Recall ensuring complete detection coverage, and mAP providing integrated assessment across all confidence thresholds and categories. All comparative experiments were conducted under identical hardware, software, and parameter configurations to ensure fair comparison across different methods.
4.2. Experimental Results and Analysis
The experimental results demonstrate the effectiveness of the TSA mechanism through comprehensive performance analysis and comparative studies with existing methods.
4.2.1. Model Performance Analysis
The YOLO-Ship network with integrated TSA mechanism optimizes model parameters to 2,625,035 (a reduction of 0.6 M compared to the original model) with a computational cost of 7.5 GFLOPs. The experimental results validate the theoretical advantages of TSA analyzed in
Section 3.3, confirming that the mechanism achieves significant computational efficiency improvements while maintaining detection accuracy. The actual detection results are shown in
Figure 6, and the model’s performance on the test set is shown in
Figure 7 and
Figure 8.
4.2.2. Comparison of Different Attention Mechanisms
To verify that the proposed TSA mechanism improves accuracy and precision in component recognition, ablation experiments were conducted with four control groups: no attention mechanism, SCConv attention, CA, and the proposed TSA mechanism under the same training and testing conditions [
19]. The evaluation was based on Precision (P), mean Average Precision (mAP), and Recall (R) [
20], with experimental results and conclusions shown in
Table 1.
4.2.3. Comparison with Other Mainstream Models
To validate and evaluate YOLO-Ship model performance, multiple current mainstream target detection network models were trained on the ship component dataset under the same experimental environment and parameter configuration [
21], providing comprehensive comparison across different architectural approaches. The baseline selection encompasses classical convolutional architectures and YOLO variants to represent both traditional feature extraction approaches and modern single-stage detection frameworks, ensuring fair comparison across different architectural paradigms. Performance comparison was conducted [
22], as shown in
Table 2.
The comparative analysis demonstrates that YOLO-Ship achieves superior performance across all evaluation metrics. Notably, YOLO-Ship outperforms recent YOLO variants, achieving 88.7% mAP, which represents significant improvements over the compared models.
5. Discussion
The proposed study introduces a Triplet Spatial Reconstruction Attention (TSA) mechanism integrated with YOLOv8n to address ship component detection challenges in small-sample industrial manufacturing scenarios. The approach achieves 88.7% mean Average Precision while reducing parameters to 2.6 M and maintaining linear computational complexity. This discussion examines the technical innovations compared to existing methods, practical implications for industrial deployment, potential application prospects, current limitations, and future research directions.
Traditional attention mechanisms exhibit fundamental limitations when applied to industrial small-sample detection scenarios. Approaches like SCConv and CA, while effective in standard computer vision tasks, demonstrate inherent design constraints in industrial environments. SCConv’s sequential processing of spatial and channel information creates computational bottlenecks that limit real-time deployment capabilities, while CA’s coordinate-based attention fails to adequately address feature redundancy prevalent in cluttered manufacturing scenes.
Recent advances in object detection increasingly explore transformer-based attention mechanisms and sophisticated multi-scale architectures. While transformer-based approaches demonstrate powerful global modeling capabilities through self-attention operations, they typically require substantial computational resources and extensive training data to establish effective attention patterns, presenting challenges for resource-constrained industrial applications. Furthermore, complex multi-scale feature fusion frameworks, despite achieving impressive accuracy in general scenarios, often introduce computational overhead that may not align optimally with the geometric patterns and spatial constraints characteristic of manufacturing environments. The threshold-based separation strategy employed in TSA addresses these limitations through targeted spatial-semantic processing that enables more granular feature discrimination while maintaining computational efficiency for industrial small-sample scenarios.
The TSA mechanism demonstrates the potential for extension to manufacturing contexts sharing similar characteristics with ship component detection, particularly scenarios involving geometric component classification, small-sample constraints, and computational efficiency requirements. The lightweight architecture and parallel processing design may prove applicable to manufacturing environments where edge computing deployment offers advantages. The performance improvements achieved by YOLO-Ship demonstrate practical significance in shipbuilding applications where component recognition accuracy impacts production efficiency. The 3.5% mAP improvement corresponds to meaningful cost reductions in large-scale manufacturing operations, where error minimization helps prevent material waste and rework cycles. The 12.8% parameter reduction while maintaining enhanced performance suggests that TSA provides an effective approach to the model complexity–accuracy balance, which is relevant for edge computing applications in industrial Internet of Things deployments.
The current implementation presents several limitations requiring acknowledgment. The dataset scale, while representative of typical small-sample industrial scenarios, limits comprehensive evaluation across diverse shipbuilding environments and component types. The current evaluation focuses primarily on geometric component classification, and performance in surface defect detection remains unexplored. Practical deployment of TSA involves several implementation considerations that warrant discussion. The triplet parallel processing architecture, although reduces parameter count and theoretical computational complexity, may exhibit varying inference performance across different hardware platforms due to memory bandwidth constraints and parallel processing efficiency differences. The mechanism incorporates architectural parameters such as group normalization settings that are optimized for the current dataset characteristics. However, while these demonstrate robust performance in the current evaluation, their optimal configuration may vary when applied to different industrial contexts with varying data distributions or challenging visual conditions such as extreme metallic reflections under high-intensity lighting, where the separation mechanism’s feature discrimination capabilities could be compromised.
Future research directions should prioritize addressing current limitations and expanding the method’s robustness and applicability. Adaptive threshold learning mechanisms warrant investigation to enable automatic parameter optimization across different manufacturing environments. Robust preprocessing strategies for challenging illumination conditions require development to address metallic surface reflections and varying lighting scenarios common in industrial settings. Validation across additional small-sample industrial detection tasks would help establish the broader applicability of TSA beyond ship component recognition. Integration with emerging vision–language models presents opportunities for enhanced semantic understanding in manufacturing quality control applications, potentially enabling more sophisticated defect classification and process optimization capabilities.
6. Conclusions
A lightweight detection network YOLO-Ship has been presented, integrating the novel TSA mechanism to address ship component recognition challenges in intelligent manufacturing. The main contributions include the development of the triplet spatial reconstruction attention mechanism and its successful integration with YOLOv8n to create an effective detection network optimized for industrial scenarios.
Experimental validation demonstrates substantial performance improvements: mAP reaches 88.7% (3.5% increase), precision achieves 84.2% (2.2% increase), and recall attains 87.1% (3.8% increase), while maintaining computational efficiency with only 2.6 M parameters and 7.5 GFLOPs. These results validate the practical value of TSA in addressing small-sample learning and component occlusion challenges in industrial environments.
Limitations include dataset scale constraints and domain-specific validation that may affect broader applicability. Future research should prioritize cross-domain validation across diverse manufacturing sectors including automotive and heavy machinery, to demonstrate TSA’s broader industrial applicability, focus on developing adaptive threshold mechanisms for TSA, and investigate the method’s general effectiveness across different intelligent manufacturing systems.
Author Contributions
Conceptualization, B.F. and Z.Y.; methodology, B.F.; software, B.F.; validation, B.F., Z.Y. and C.F.; formal analysis, B.F.; investigation, B.F. and C.F.; resources, Z.Y.; data curation, B.F.; writing—original draft preparation, B.F.; writing—review and editing, B.F., Z.Y. and C.F.; visualization, B.F.; supervision, Z.Y.; project administration, Z.Y.; funding acquisition, Z.Y. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The ship component dataset constructed in this study, including all 792 images with corresponding annotations, will be made publicly available through GitHub repository upon publication. The dataset includes detailed geometric classification labels for all five component categories in standard annotation format. The implementation was developed using Python 3.8.10, PyTorch 1.8.1+cu111, and CUDA 11.1 for GPU acceleration. The source code and detailed implementation of the proposed TSA mechanism and YOLO-Ship network are not publicly released due to ongoing coherent research; however, researchers may obtain the relevant code and technical details by submitting reasonable academic requests to the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
YOLO | You Only Look Once |
QR | Quick Response |
ResNet | Residual Network |
DenseNet | Densely Connected Convolutional Network |
SENet | Squeeze-and-Excitation Network |
CBAM | Convolutional Block Attention Module |
EMA | Efficient Multi-scale Attention |
LSKAttention | Large Separable Kernel Attention |
TSA | Triplet Spatial Reconstruction Attention |
C2f | Cross Stage Partial with two convolutions and feature fusion |
SCConv | Spatial and Channel reconstruction Convolution |
SRU | Spatial Reconstruction Unit |
CRU | Channel Reconstruction Unit |
CA | Coordinate Attention |
GFLOPs | Giga Floating-point Operations per Second |
mAP | mean Average Precision |
P | Precision |
R | Recall |
AdamW | Adaptive Moment Estimation with Weight Decay |
ReLU | Rectified Linear Unit |
CUDA | Compute Unified Device Architecture |
References
- Zhang, X.; Chen, D. Shipbuilding 4.0: A Systematic Literature Review. Appl. Sci. 2024, 14, 6363. [Google Scholar] [CrossRef]
- Jebbor, I.; Benmamoun, Z.; Hachimi, H. Optimizing Manufacturing Cycles to Improve Production: Application in the Traditional Shipyard Industry. Processes 2023, 11, 3136. [Google Scholar] [CrossRef]
- Ahmed, M.; Hashmi, K.A.; Pagani, A.; Liwicki, M.; Stricker, D.; Afzal, M.Z. Survey and Performance Analysis of Deep Learning Based Object Detection in Challenging Environments. Sensors 2021, 21, 5116. [Google Scholar] [CrossRef] [PubMed]
- Xu, W.; Wan, Y.; Zhao, D. SFA: Efficient Attention Mechanism for Superior CNN Performance. Neural Process. Lett. 2025, 57, 38. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023. [Google Scholar]
- Lau, K.W.; Po, L.M.; Rehman, Y.A.U. Large separable kernel attention: Rethinking the large kernel attention design in CNN. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
- Mery, D. Aluminum Casting Inspection using Deep Object Detection Methods and Simulated Ellipsoidal Defects. Mach. Vis. Appl. 2021, 32, 72. [Google Scholar] [CrossRef]
- Tang, J.; Lu, H.; Xu, X.; Wu, R.; Hu, S.; Zhang, T.; Cheng, T.W.; Ge, M.; Chen, Y.C.; Tsung, F. An Incremental Unified Framework for Small Defect Inspection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
- Dzulfikri, Z.; Su, P.-W.; Huang, C.-Y. Stamping Tool Conditions Diagnosis: A Deep Metric Learning Approach. Appl. Sci. 2021, 11, 6959. [Google Scholar] [CrossRef]
- Kim, S.; An, S.; Chikontwe, P.; Kang, M.; Adeli, E.; Pohl, K.M.; Park, S.H. Few Shot Part Segmentation Reveals Compositional Logic for Industrial Anomaly Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20–27 February 2024. [Google Scholar]
- Jaykumar P, J.; Palanisamy, K.; Chao, Y.W.; Du, X.; Xiang, Y. Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Abu Dhabi, United Arab Emirates, 14–18 October 2024. [Google Scholar]
- Karnati, M.; Seal, A.; Yazidi, A.; Krejcar, O. Flepnet: Feature level ensemble parallel network for facial expression recognition. IEEE Trans. Affect. Comput. 2022, 13, 2058–2070. [Google Scholar] [CrossRef]
- Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Bai, L.; Zhi, J.S. Omni-dimensional dynamic convolution with coordinate attention detection scheme. Sci. Prog. 2025, 108, 00368504251336695. [Google Scholar] [CrossRef] [PubMed]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Ali, M.L.; Zhang, Z. The YOLO framework: A comprehensive review of evolution, applications, and benchmarks in object detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the Machine Learning Research, Virtual, 18–24 July 2021. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Zhang, D.; Zheng, Z.; Li, M.; Liu, R. CSART: Channel and spatial attention-guided residual learning for real-time object tracking. Neurocomputing 2021, 436, 260–272. [Google Scholar] [CrossRef]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghiam, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).