MTP-YOLO: You Only Look Once Based Maritime Tiny Person Detector for Emergency Rescue

: Tiny person detection based on computer vision technology is critical for maritime emergency rescue. However, humans appear very small on the vast sea surface, and this poses a huge challenge in identifying them. In this study, a single-stage tiny person detector, namely the “You only look once”-based Maritime Tiny Person detector (MTP-YOLO), is proposed for detecting maritime tiny persons. Specifically, we designed the cross-stage partial layer with two convolutions Efficient Layer Aggregation Networks (C2fELAN) by drawing on the Generalized Efficient Layer Aggregation Networks (GELAN) of the latest YOLOv9, which preserves the key features of a tiny person during the calculations. Meanwhile, in order to accurately detect tiny persons in complex backgrounds, we adopted a Multi-level Cascaded Enhanced Convolutional Block Attention Module (MCE-CBAM) to make the network attach importance to the area where the object is located. Finally, by analyzing the sensitivity of tiny objects to position and scale deviation, we proposed a new object position regression cost function called Weighted Efficient Intersection over Union (W-EIoU) Loss. We verified our proposed MTP-YOLO on the TinyPersonv2 dataset. All these results confirm that this method significantly improves model performance while maintaining a low number of parameters and can therefore be applied to maritime emergency rescue missions.


Introduction
With increasing global maritime activities, the complexity and urgency of maritime rescue missions have also increased.In this context, quickly and accurately locating people in distress is the key to improving rescue efficiency and minimizing casualties and property losses.However, existing methods such as manual observation or satellite positioning still face many challenges in accurately locating victims.Manual observation can easily cause personnel fatigue and distraction, thereby increasing the risk of missing search and rescue targets.For satellite positioning, when the signal quality is poor or the victim's communication equipment fails, positioning cannot be performed.Therefore, existing detection technology cannot meet the requirements of modern maritime emergency rescue.
Lately, there has been a growing emphasis among scholars on techniques for object detection through visual data analysis.As it uses high-resolution cameras and complex image processing algorithms to identify objects in specific scenes, this method will not be affected by visual fatigue and signal quality when being used for maritime emergency rescue.Vision-based object detection algorithms mainly include the following categories: two-stage object detectors represented by the RCNN series, one-stage detectors represented by the YOLO series [1], and Transformer-based DETR series [2].Although they all achieved impressive performances with natural images, detecting tiny objects at sea remains a challenge [3].
On the one hand, there is a certain commonality between maritime tiny object detection and small object detection.Firstly, compared to regular images, tiny objects in sea surface images are smaller in size and contain less information compared to the entire image.Secondly, limited information about tiny objects may disappear during forward propagation, and the model may not be able to capture the key features of tiny objects, leading to detection errors.Furthermore, tiny objects might overlap with each other, challenging the detector's ability to differentiate between objects that are close together.
On the other hand, detecting small objects at sea has its own unique characteristics.For example, the lighting conditions at sea may be complex and variable due to the presence of more specular reflections [4].The fluctuation in lighting intensity can significantly impact the imaging effect of the camera.This complicates the process of discerning the characteristics of the object.Moreover, due to the lack of additional light sources, the lighting conditions largely depend on sunlight, resulting in a large number of backlit scenes.
In scenes with such backlighting, there is a scenario where the object is poorly lit, contrasted by a very bright background, and both factors might potentially degrade the efficacy of object detection.
To tackle these particular hurdles, we designed an innovative architecture named MTP-YOLO for tiny person detection in maritime emergency rescue missions.We trained and evaluated MTP-YOLO on the TinyPersonv2 [5] dataset, which contains sea surface images annotated with tiny person labels.The results indicate that MTP-YOLO can improve the detection ability of tiny objects compared to the most advanced methods currently available.Our contributions are summarized as follows: 1.
We designed a new feature extraction module called C2fELAN to better retain tiny object information and reduce information loss during forward propagation, allowing the model to use this information to detect tiny objects and overcome the challenges of tiny object detection.

2.
We adopted the Multi-level Cascaded Enhanced CBAM to obtain a more focused attention distribution, allowing the model to attach importance to areas where the important features of tiny objects exist and learn more useful information.

3.
We proposed a new bounding box regression loss function called Weighted EIoU Loss to solve the problem of tiny objects having different sensitivities to position and scale deviation and boost the model's performance in identifying tiny persons.

Object Detection
At present, the popular object detection algorithms mainly include the following categories: two-stage object detectors represented by the RCNN series, one-stage detectors represented by the YOLO series [6], and DETR series based on the Transformer.After integrating the RPN [7] structure, the RCNN series algorithm greatly improves detection accuracy but is slow and cannot fulfill the demands of real-time detection in most applications.Algorithms in the YOLO series approach the task of object detection as a problem of spatial regression.It uses CSPNet [8], PAN [9], FPN [10], and a series of their variants as the basic building blocks of the network.While fulfilling real-time detection requirements, its accuracy reaches the same level as the RCNN series.The DETR series of algorithms are introduced, utilizing the Transformer architecture [11] from the domain of natural language processing; however, it is difficult to be applied to new fields without a pre-trained model in the corresponding field.Therefore, the YOLO series currently remains the most widely used algorithm.YOLOv8 [12] was chosen as the basis of this article as this method has proven to be powerful in a large number of computer vision tasks.

Tiny Object Detection
Despite significant advancements in the field of object detection algorithms, research on tiny object detection still faces great challenges, including the following: (1) the tiny object itself occupies a small size in the image and has limited available information; (2) the features of tiny objects may disappear during the forward propagation process of the network, which poses certain difficulties to detection; (3) tiny objects in complex environments will be interfered with by factors like lighting, occlusion, and aggregation, thereby complicating their differentiation from the backdrop or akin items.In response to the difficulties in tiny object detection, researchers have made numerous improvements to mainstream object detection algorithms.Their improvement methods can be divided as follows: context information learning methods [13] that solve the problem of limited feature information being carried by tiny objects, multi-scale feature fusion methods [14] that integrate multiple feature layers to improve the representation ability of tiny objects, and attention mechanism methods to [15][16][17][18] improve the model's attention to tiny object features.Although these works have improved tiny object detection performance in their respective scenarios, they may not be applicable when there are scene changes.Considering this, we propose that the MTP-YOLO algorithm is suitable for the scenarios listed in this paper.

Overview of MTP-YOLO
MTP-YOLO is built upon YOLOv8 and comprises three primary parts, as depicted in Figure 1.The backbone of MTP-YOLO includes the following four main components: CBS (conv2d, batch normalization, sigmoid linear unit), C2fELAN, MCE-CBAM, and SPPF (Spatial Pyramid Pooling Fast).These are utilized to extract pertinent attributes of the object from an input image.The neck architecture still adopts the FPN (Feature Pyramid Network) and PAN (Path Aggregation Network) structures for feature fusion at different scales, mainly composed of C2fELAN, Concat, Upsample, and CBS components.This enhances the model's ability to recognize objects of various sizes by integrating localization information and semantic information.The head module adopts the current mainstream decoupling head structure, dividing the detection head into a regression branch and a classification branch.The regression branch uses DFL (Distribution Focal Loss) and Weighted EIoU Loss, while the classification branch uses BCE (Binary Cross Entropy) loss.

C2fELAN Module
By combining two neural network architectures designed using gradient path planning, CSPNet (Cross Stage Partial Network) and ELAN [20], the authors of YOLOv9 designed a Generalized Efficient Layer Aggregation Network (GELAN) that considers weight, inference speed, and accuracy.The design purpose of CSPNet is to enable the network to obtain richer gradient fusion information while reducing computational complexity.The method divides the tensor of the base layer into two segments, which are then MTP-YOLO retains the overall style of YOLOv8.However, the original YOLOv8 network was not designed for tiny object detection tasks, which reduces its applicability in tiny person detection tasks at sea.Therefore, we improved the network structure to enhance its performance in detecting tiny objects.Firstly, we drew inspiration from the GELAN structure of the latest object detector, YOLOv9 [19], and design the C2fELAN module to replace the C2f modules in the original YOLOv8 backbone and neck.This allowed us to preserve the key features of the tiny objects and obtain richer gradient information during the calculation process, thereby achieving a higher detection accuracy.Secondly, we inserted our proposed MCE-CBAM module before the SPPF layer in the YOLOv8 architecture to boost the feature extraction ability of the backbone and make the model pay attention to key details that are conducive to identifying tiny objects, thereby improving the ability to detect tiny objects.In addition, by analyzing the sensitivity of tiny objects to position and scale deviation, we designed a new boundary box regression loss called Weighted EIoU Loss, substituting the CIoU in the cost function to alleviate the substantial influence of tiny object position deviation on detection performance, thereby improving the detection performance of tiny objects.

C2fELAN Module
By combining two neural network architectures designed using gradient path planning, CSPNet (Cross Stage Partial Network) and ELAN [20], the authors of YOLOv9 designed a Generalized Efficient Layer Aggregation Network (GELAN) that considers weight, inference speed, and accuracy.The design purpose of CSPNet is to enable the network to obtain richer gradient fusion information while reducing computational complexity.The method divides the tensor of the base layer into two segments, which are then merged through a cross-stage hierarchical approach.By separating the gradient flows, they can propagate on different network paths.In addition, CSPNet can greatly reduce computational complexity, and improve inference speed and accuracy.The main purpose of designing ELAN is to address the issue of the gradually deteriorating convergence of deep models during model scaling.Comparing VoVNet (Variety of View Network) and ResNet (Residual Network), VoVNet performs worse than ResNet when stacking more blocks.The authors analyzed that this is because there are too many transition layers in the VoVNet structure, which leads to an increasing number of shortest gradient paths when stacking blocks, making training more difficult as the number of blocks increases.Therefore, by appropriately deleting the transition layer, network performance can be improved, and the shortest gradient path of the entire network can be quickly lengthened.When the network is stacked deeper, the above design strategy can then successfully train ELAN.The author of YOLOv9 extended the ability of ELAN, which initially only used convolutional layers for stacking, to a new architecture that can accommodate any kind of computational block.
Taking inspiration from the GELAN module proposed by the author of YOLOv9, we combined the C2f and ELAN neural network modules with gradient path planning to design the feature extraction module, C2fELAN, used in this paper.This structure can retain relatively complete feature information of small objects and provide reliable gradient information that can be used to determine the objective function.The comprehensive layout is depicted in Figure 2. Specifically, we replaced the stacking of convolution modules in ELAN modules with the stacking of RepNC2f (re-parameterization cross stage partial layer with two convolutions without identity connection) modules.RepNC2f modifies the convolution in the bottleneck structure of the C2f module to RepConvN (re-parameterization convolution without identity connection), which is the structure of RepConv (re-parameterization convolution) after removing the identity mapping.The RepConv idea is to reparameterize the RepVGG block used during training, converting the 1 × 1 convolution and unprocessed identity maps in RepVGG into a 3 × 3 convolution, and then fusing them.By applying this RepConv to RepNC2f and RepNBottleneck, the inference efficiency of the network can be greatly improved.
(re-parameterization convolution) after removing the identity mapping.The RepConv idea is to reparameterize the RepVGG block used during training, converting the 1 × 1 convolution and unprocessed identity maps in RepVGG into a 3 × 3 convolution, and then fusing them.By applying this RepConv to RepNC2f and RepNBottleneck, the inference efficiency of the network can be greatly improved.

Multi-Level Cascaded Enhanced CBAM Module
In recent years, various object detection architectures have adopted attention mechanisms to optimize their models and have achieved good results.Research on the combination of deep learning and visual attention mechanisms mostly focuses on using masks to form attention mechanisms.The principle of masking is to identify key features in image data through another layer of new weights.Through learning and training, deep neural networks learn the areas that need attention in each new image, forming the necessary attention.Among them, the most typical attention mechanisms include the self-attention, spatial attention, and temporal attention mechanisms.These attention mechanisms allow the model to assign different weights to different positions of the input sequence in order to focus on the most relevant part when processing each sequence element.As shown in Figure 3, when the model is in the training phase, the RepConvN module has two different convolution kernels: 3 × 3 and 1 × 1.When the model is in the inference stage, the 1 × 1 and 3 × 3 convolution kernels can be combined into a single 3 × 3 kernel through structural reparameterization.The specific method includes filling the surrounding parts of the 1 × 1 kernel into a 3 × 3 form.Based on the additivity principle of convolution kernels of the same size, the padding kernel is added to the original 3 × 3 convolution kernel to form a 3 × 3 convolution kernel for inference.
(re-parameterization convolution) after removing the identity mapping.The RepConv idea is to reparameterize the RepVGG block used during training, converting the 1 × 1 convolution and unprocessed identity maps in RepVGG into a 3 × 3 convolution, and then fusing them.By applying this RepConv to RepNC2f and RepNBottleneck, the inference efficiency of the network can be greatly improved.

Multi-Level Cascaded Enhanced CBAM Module
In recent years, various object detection architectures have adopted attention mechanisms to optimize their models and have achieved good results.Research on the combination of deep learning and visual attention mechanisms mostly focuses on using masks to form attention mechanisms.The principle of masking is to identify key features in image data through another layer of new weights.Through learning and training, deep neural networks learn the areas that need attention in each new image, forming the necessary attention.Among them, the most typical attention mechanisms include the self-attention, spatial attention, and temporal attention mechanisms.These attention mechanisms allow the model to assign different weights to different positions of the input sequence in order to focus on the most relevant part when processing each sequence element.

Multi-Level Cascaded Enhanced CBAM Module
In recent years, various object detection architectures have adopted attention mechanisms to optimize their models and have achieved good results.Research on the combination of deep learning and visual attention mechanisms mostly focuses on using masks to form attention mechanisms.The principle of masking is to identify key features in image data through another layer of new weights.Through learning and training, deep neural networks learn the areas that need attention in each new image, forming the necessary attention.Among them, the most typical attention mechanisms include the self-attention, spatial attention, and temporal attention mechanisms.These attention mechanisms allow the model to assign different weights to different positions of the input sequence in order to focus on the most relevant part when processing each sequence element.
Therefore, MTP-YOLO within this article was also designed with a Multi-level Cascaded Enhanced CBAM, targeted at improving the tiny person detection effect, as depicted in Figure 4. Considering that the original CBAM can enhance the model's ability to focus on key features, we stacked and cascaded the spatial attention module and channel attention module in the CBAM to further enhance the model's performance to focus on crucial attributes and improve its detection performance.
Therefore, MTP-YOLO within this article was also designed with a Multi-level Cascaded Enhanced CBAM, targeted at improving the tiny person detection effect, as depicted in Figure 4. Considering that the original CBAM can enhance the model's ability to focus on key features, we stacked and cascaded the spatial attention module and channel attention module in the CBAM to further enhance the model's performance to focus on crucial attributes and improve its detection performance.The traditional CBAM aims to enhance the network's representation ability by introducing attention mechanisms, including the following two submodules: channel attention module and spatial attention module.Through the adaptive refinement of intermediary feature representations in each convolutional block of the deep network, CBAM achieves the attention to key information and suppression of unnecessary information.Utilizing the operations of both average and max pooling, the channel attention mechanism integrates the spatial information from the input feature maps, resulting in the acquisition of dual feature maps.After feeding them separately into a shared multi-layer perceptron, the output features of the two multi-layer perceptrons are added element by element, and the channel attention map is generated through a sigmoid activation function.The spatial attention mechanism first conducts channel-wise global maximum pooling and global average pooling on the input feature map, yielding a pair of feature maps.Next, these two feature maps along the channel axis are concatenated and a convolution is executed to reduce the parameter count.Subsequently, spatial attention features are generated through sigmoid operations.

Weighted-EIoU Loss
The loss associated with object position regression is a vital part of the loss function used in object detection, and currently the mainstream bounding box regression loss is the IoU series [21][22][23][24].Although it has undergone multiple evolutions, we found that they have all overlooked a problem, which is the different sensitivities of tiny objects to the positional and scale deviations of the detection box, as shown in Figure 5.When the detection box is offset by a width in the horizontal direction, the tiny object will disappear The traditional CBAM aims to enhance the network's representation ability by introducing attention mechanisms, including the following two submodules: channel attention module and spatial attention module.Through the adaptive refinement of intermediary feature representations in each convolutional block of the deep network, CBAM achieves the attention to key information and suppression of unnecessary information.Utilizing the operations of both average and max pooling, the channel attention mechanism integrates the spatial information from the input feature maps, resulting in the acquisition of dual feature maps.After feeding them separately into a shared multi-layer perceptron, the output features of the two multi-layer perceptrons are added element by element, and the channel attention map is generated through a sigmoid activation function.The spatial attention mechanism first conducts channel-wise global maximum pooling and global average pooling on the input feature map, yielding a pair of feature maps.Next, these two feature maps along the channel axis are concatenated and a convolution is executed to reduce the parameter count.Subsequently, spatial attention features are generated through sigmoid operations.

Weighted-EIoU Loss
The loss associated with object position regression is a vital part of the loss function used in object detection, and currently the mainstream bounding box regression loss is the IoU series [21][22][23][24].Although it has undergone multiple evolutions, we found that they have all overlooked a problem, which is the different sensitivities of tiny objects to the positional and scale deviations of the detection box, as shown in Figure 5.When the detection box is offset by a width in the horizontal direction, the tiny object will disappear from the detection box, resulting in a missed detection, even though the offset distance is very small.When the size of the box doubles, the tiny object is still in the detection box, and the model can still recognize the small object without missing the detection.from the detection box, resulting in a missed detection, even though the offset distance is very small.When the size of the box doubles, the tiny object is still in the detection box, and the model can still recognize the small object without missing the detection.To address this problem, we propose a novel bounding box loss function called Weighted-EIoU Loss.We apply different weights to the center point distance deviation term and the detection box scale deviation term respectively, with the aim of boosting the detection efficacy of tiny objects: + 2 = 3 (2) where  represents the intersection and union ratio of the detection box and the ground truth,  is the diagonal length of the minimum bounding rectangle,  and  are the width and height of the minimum bounding rectangle,  and  represent the center points of the detection box and the ground truth,  and ℎ represent the width and height of the detection box,  and ℎ represent the width and height of the ground truth,  stands for the Euclidean distance, and  and  represent the weight applied.Equation (1) takes into account the overlapping area, center point distance, and differences in width and height between the predicted and actual boxes simultaneously.Among them,  + 2 = 3 is meant to adjust only the weight within the bounding box loss, thereby avoiding implicitly imposing additional weights between the bounding box loss and the classification loss.In addition, considering that tiny objects are more sensitive to center point distance offset, we set  >  in the experiment.

Datasets and Experimental Settings
The TinyPersonv2 dataset used in this paper includes 6278 images, which are taken from Internet platforms such as Baidu, YouTube, and Bing, as well as from cameras, and are specially designed for tiny person detection.We randomly divide it into a training set and a validation set with a ratio of 8:2.In order to further enrich the dataset and enhance the generalization ability of the model, we also utilize multiple data augmentation methods, including HSV transformation, shifting, and mosaic augmentation, etc.To address this problem, we propose a novel bounding box loss function called Weighted-EIoU Loss.We apply different weights to the center point distance deviation term and the detection box scale deviation term respectively, with the aim of boosting the detection efficacy of tiny objects: where IoU represents the intersection and union ratio of the detection box and the ground truth, c is the diagonal length of the minimum bounding rectangle, C w and C h are the width and height of the minimum bounding rectangle, b and b gt represent the center points of the detection box and the ground truth, w and h represent the width and height of the detection box, w gt and h gt represent the width and height of the ground truth, ρ stands for the Euclidean distance, and α and β represent the weight applied.Equation (1) takes into account the overlapping area, center point distance, and differences in width and height between the predicted and actual boxes simultaneously.Among them, α + 2β = 3 is meant to adjust only the weight within the bounding box loss, thereby avoiding implicitly imposing additional weights between the bounding box loss and the classification loss.In addition, considering that tiny objects are more sensitive to center point distance offset, we set α > β in the experiment.

Experiment 4.1. Datasets and Experimental Settings
The TinyPersonv2 dataset used in this paper includes 6278 images, which are taken from Internet platforms such as Baidu, YouTube, and Bing, as well as from cameras, and are specially designed for tiny person detection.We randomly divide it into a training set and a validation set with a ratio of 8:2.In order to further enrich the dataset and enhance the generalization ability of the model, we also utilize multiple data augmentation methods, including HSV transformation, shifting, and mosaic augmentation, etc.
We trained our MTP-YOLO on NVIDIA RTX3060 GPU and used the PyTorch 2.0.1 framework.All the networks we mentioned did not use pretrained weights and were trained from scratch.The total training time is 500 epochs.The starting learning rate was configured at 0.01, with a momentum of 0.937 and a weight decay coefficient of 0.0005.Like YOLOv8, we turned mosaic augmentation off during the final 10 epochs.This model uses the SGD optimizer.We then configured the input image dimensions to 640 by 640 pixels and established the batch size as 8.

Comparison of Different Weight Values for Weighted EIoU
In order to determine how to set the weights of the Weighted EIoU to achieve the optimal model performance, we started from 1 and continuously increased the value of α with a step size of 0.5.We trained the model under different α values and evaluated its performance.Table 1 shows the experimental results, which show that when α = 1, W-EIoU degenerates into EIoU.When α > 1 and α < 3, the model performance improves to varying degrees, and the best performance is achieved when α = 3.0.This is because as it approaches 3.0, the width and height related terms in W-EIoU are suppressed, increasing its attention to positional deviation, and allowing for more tiny objects to be detected.Therefore, we set α to 3.0, where β is 0.0.This indicated a substantial enhancement in the model's detection capabilities upon the elimination of terms associated with width and height, confirming that tiny objects have a high sensitivity to positional discrepancies.Specifically, when α = 3, the terms related to width and height are suppressed, and the mAP score reaches its maximum value.This means that for small targets, the position of the center point should be considered as the main factor, while width and height become irrelevant.At this point, Weighted EIoU is somewhat similar to DIoU, but unlike DIoU, α = 3 weight is applied to the terms related to the distance from the center point.

Algorithm Comparison
To demonstrate the effectiveness of our proposed network, we compared it with nine other state-of-the-art (SOTA) methods, including five anchor-based object detection methods (i.e., Faster RCNN, YOLOv5, YOLOv6 [25], YOLOv7, and SSD [26]) and four anchor-free object detection methods (i.e., FCOS [27], YOLOv8, YOLOv9, and DETR [28]).For a fair comparison, all comparison results were generated from the source code provided by the author.All methods were retrained on the same dataset as the approach introduced in this paper, and the original set parameters of the corresponding methods were used.
Table 2 and Figure 6 show the comparison of the four indicators and visualization results of different methods, respectively.Table 2 shows the evaluation scores of our method and other SOTA methods on precision, recall, and mAP metrics.Among them, precision represents the proportion of true positives among all the positive samples detected, recall represents how many positives in the total sample were predicted correctly, and mAP signifies the mean AP across all classes, where AP denotes the area encompassed by the curve of precision and recall and the coordinate axes.The larger the three indicators, the better.In addition, the parameters of these models are also shown in Table 3.As indicated in Table 2, the proposed MTP-YOLO achieves the highest mAP value, indicating the necessity of a network model specifically designed for detecting tiny persons at sea.For example, compared to the latest YOLOv9 method, our method achieved percentage gains of 0.9% and 0.1% in accuracy and mAP, respectively.In Table 2, the precision and recall of YOLOv7 are better than that of MTP-YOLO as they were obtained at the specified confidence threshold, which indicates that YOLOv7 has a higher precision and recall only at that confidence threshold; however, YOLOv7 ′ s mAP is observed to be lower than that of MTP-YOLO as the mAP is obtained at different confidence thresholds, indicating that MTP-YOLO can better adapt to different confidence thresholds and has high robustness.Furthermore, it should be highlighted that the model size of our method is much smaller than that of the optimal method, YOLOv9.By comparing all indicators, our method achieved a 77.6% accuracy, 56.9% recall, and 69.1% mAP on the TinyPersonv2 dataset using only 48.7% of YOLOv9 parameters.The above results clearly indicate that our model has achieved accuracy comparable to state-of-the-art object detection methods.In Figure 6, we provide a visual (qualitative) comparison to demonstrate the superiority of the proposed MTP-YOLO.The first row comprises the original images, and the second row is our result, followed by YOLOv5, YOLOv6, YOLOv7, YOLOv8, and YOLOv9.Our method can accurately identify the position of tiny persons in the image.Compared with other SOTA methods, our proposed method has a lower missed detection rate and is more suitable for detecting tiny people at sea in emergency rescue missions.

Ablation Study
To validate the efficiency of our diverse enhancement strategies for the detection of tiny persons, we conducted ablation testing by gradually merging each optimization measure.Table 4 presents the detailed results of these experiments.Analysis of C2fELAN.C2fELAN resulted in an increase in the number of network layers from 225 to 548, and GFLOPS increased from 28.8 G to 78.7 G.However, compared to the baseline, due to the integration of the C2fELAN module in Table 4, the model's precision increased from 0.758 to 0.771, with a percentage gain of 1.30%.The recall rate increased from 0.578 to 0.597, with a percentage gain of 1.90%.In addition, the mAP value of the model has increased from 0.674 to 0.689.The above results indicate the effectiveness of the C2fELAN module.
Analysis of MCE-CBAM.In Table 4, we also demonstrate the effectiveness of the Multi-level Cascaded Enhanced CBAM.In terms of accuracy, this module brought a 0.4% gain to the baseline and contributed 0.1% percentage points to the baseline in terms of mAP.The Multi-level Cascaded Enhanced CBAM enables networks to focus on important regions that are conducive to detecting tiny objects, thereby improving the detection performance of tiny objects.
Analysis of W-EIoU.Table 4 shows that the adoption of the refined Weighted EIoU led to a 0.1% improvement in both the model's accuracy and its mAP score.The experimental results demonstrate that the weighted EIoU loss fully considers the sensitivity of tiny objects to position deviation, and applies larger weights to the position deviation term, making the model more focused on predicting the center point position, thereby improving the recognition ability of tiny objects.

Conclusions
This paper proposes an end-to-end object detection network specifically designed for detecting maritime tiny persons, called MTP-YOLO.Benefiting from the proposed C2fELAN feature extraction module, our network can fully capture key features related to tiny objects to accurately locate the objects.We integrated the designed Multi-level Cascaded Enhanced CBAM into our model, improving the capacity of the model to focus on crucial details of tiny objects.In addition, by modifying the bounding box regression loss function to our proposed Weighted EIoU, there is an additional enhancement in the model's capacity to pinpoint the location of tiny objects, leading to a decrease in the rate at which tiny objects go undetected.However, this method is mainly suitable for conditions with good lighting, and its performance may decrease when night approaches.Moving forward, our intention is to obtain images in night or dark scenes through data augmentation or camera shooting, making them suitable for emergency rescue in night scenes, and use model lightweight methods to enhance the detection efficiency of our model, thereby evaluating the practical deployment efficacy of the suggested approach.

Figure 2 .
Figure 2. The structure of C2fELAN and its components.(a) The details of C2fELAN; (b) The details of RepNC2f; (c) The details of RepNBottleneck.RepNC2f: re-parameterization cross stage partial layer with two convolutions without identity connection.RepNBottleneck: re-parameterization bottleneck without identity connection.RepConvN: re-parameterization convolution without identity connection.As shown in Figure3, when the model is in the training phase, the RepConvN module has two different convolution kernels: 3 × 3 and 1 × 1.When the model is in the inference stage, the 1 × 1 and 3 × 3 convolution kernels can be combined into a single 3 × 3 kernel through structural reparameterization.The specific method includes filling the surrounding parts of the 1 × 1 kernel into a 3 × 3 form.Based on the additivity principle of convolution kernels of the same size, the padding kernel is added to the original 3 × 3 convolution kernel to form a 3 × 3 convolution kernel for inference.

Figure 2 .
Figure 2. The structure of C2fELAN and its components.(a) The details of C2fELAN; (b) The details of RepNC2f; (c) The details of RepNBottleneck.RepNC2f: re-parameterization cross stage partial layer with two convolutions without identity connection.RepNBottleneck: re-parameterization bottleneck without identity connection.RepConvN: re-parameterization convolution without identity connection.

Figure 2 .
Figure 2. The structure of C2fELAN and its components.(a) The details of C2fELAN; (b) The details of RepNC2f; (c) The details of RepNBottleneck.RepNC2f: re-parameterization cross stage partial layer with two convolutions without identity connection.RepNBottleneck: re-parameterization bottleneck without identity connection.RepConvN: re-parameterization convolution without identity connection.As shown in Figure3, when the model is in the training phase, the RepConvN module has two different convolution kernels: 3 × 3 and 1 × 1.When the model is in the inference stage, the 1 × 1 and 3 × 3 convolution kernels can be combined into a single 3 × 3 kernel through structural reparameterization.The specific method includes filling the surrounding parts of the 1 × 1 kernel into a 3 × 3 form.Based on the additivity principle of convolution kernels of the same size, the padding kernel is added to the original 3 × 3 convolution kernel to form a 3 × 3 convolution kernel for inference.

Figure 4 .
Figure 4.The structure of MCE-CBAM and its components.(a) The structure of MCE-CBAM; (b) The details of channel attention mechanism; (c) The details of spatial attention mechanism.CAM: channel attention module.SAM: spatial attention module.FC: fully connected layer.

Figure 4 .
Figure 4.The structure of MCE-CBAM and its components.(a) The structure of MCE-CBAM; (b) The details of channel attention mechanism; (c) The details of spatial attention mechanism.CAM: channel attention module.SAM: spatial attention module.FC: fully connected layer.

Figure 5 .
Figure 5.Comparison of the sensitivity of small objects to the position and scale deviation of bounding boxes.The red box represents the ground truth, and the yellow dashed box represents the detection box.

Figure 5 .
Figure 5.Comparison of the sensitivity of small objects to the position and scale deviation of bounding boxes.The red box represents the ground truth, and the yellow dashed box represents the detection box.

Figure 6 .
Figure 6.Visual comparison between MTP-YOLO and other networks.Figure 6. Visual comparison between MTP-YOLO and other networks.

Figure 6 .
Figure 6.Visual comparison between MTP-YOLO and other networks.Figure 6. Visual comparison between MTP-YOLO and other networks.

Table 1 .
The results when taking different values of α.

Table 2 .
Comparison of the MTP-YOLO with other networks.RCNN: region-convolutional neural network.YOLO: You only look once.SSD: Single Shot Multi-Box Detector.FCOS: Fully Convolutional One-Stage object detection.DETR: Detection Transformer.MTP-YOLO: "You only look once"-based Maritime Tiny Person detector.Bold represents the maximum value of the column.

Table 4 .
Results of ablation study.C2fELAN: cross stage partial layer with two convolutions efficient layer aggregation networks.MCE-CBAM: Multi-level Cascaded Enhanced Convolutional Block Attention Module.W-EIoU: Weighted Efficient Intersection over Union.Bold represents the maximum value of the column.