An Enhanced Target Detection Algorithm for Maritime Search and Rescue Based on Aerial Images

: Unmanned aerial vehicles (UAVs), renowned for their rapid deployment, extensive data collection, and high spatial resolution, are crucial in locating distressed individuals during search and rescue (SAR) operations. Challenges in maritime search and rescue include missed detections due to issues including sunlight reﬂection. In this study, we proposed an enhanced ABT-YOLOv7 algorithm for underwater person detection. This algorithm integrates an asymptotic feature pyramid network (AFPN) to preserve the target feature information. The BiFormer module enhances the model’s perception of small-scale targets, whereas the task-speciﬁc context decoupling (TSCODE) mechanism effectively resolves conﬂicts between localization and classiﬁcation. Using quantitative experiments on a curated dataset, our model outperformed methods such as YOLOv3, YOLOv4, YOLOv5, YOLOv8, Faster R-CNN, Cascade R-CNN, and FCOS. Compared with YOLOv7, our approach enhances the mean average precision (mAP) from 87.1% to 91.6%. Therefore, our approach reduces the sensitivity of the detection model to low-lighting conditions and sunlight reﬂection, thus demonstrating enhanced robustness. These innovations have driven advancements in UAV technology within the maritime search and rescue domains.


Introduction
Unmanned aerial vehicles (UAVs) are lightweight, user-friendly, maneuverable, and budget-friendly vehicles whose attributes have become crucial for capturing video remote sensing data.They have widespread applications in various sectors, including civilian, military, and scientific research domains [1,2].Among the various UAV applications, search and rescue (SAR) missions are compatible with their characteristics.Moreover, they offer rapid deployment, high data capacity, and excellent spatial resolution, making them ideal for search and rescue operations.During search and rescue missions, UAVs are crucial in locating individuals who have fallen into the water, providing invaluable on-site overviews.The significance of aerial support in maritime search and rescue scenarios is crucial because it enables rapid monitoring and extensive searches.However, the detection and identification of individuals in open water remains a challenging aspect of this application.Even experienced search and rescue operators face difficulties manually identifying individuals in images, thus indicating the need for computer-assisted object detection.
Traditional algorithms for object detection can be categorized into two main types: handcrafted and automatic.Various representative detectors based on handcrafted feature extraction have been used for object detection, including the Viola-Jones detector [3], the histogram of oriented gradient (HOG) detector [4], and deformable parts model (DPM) detectors [5].Algorithms based on automatic feature extraction, such as frame differencing, have been employed to detect moving objects by computing the differences between adjacent frames in a video sequence and extracting the contours of the objects.Baykara et al. utilized frame differencing for motion object detection and further improved detection accuracy by applying morphological dilation to individual targets [6,7].However, these methods lack robustness and fail to achieve the required accuracy for man-overboard detection using UAV images.
Advancements in computational capabilities have catalyzed the rapid evolution of deep learning techniques.With the emergence of detection algorithms rooted in deep learning [8], traditional methods have become less applicable, finding applications in diverse fields such as autonomous driving [9], robotics [10], surveillance [11], and agriculture [12].However, the use of UAVs to detect individuals in maritime distress at sea still presents several challenges.These challenges primarily include the following: (1) In UAV-captured images, the size of individuals in maritime distress is often insignificant, posing a significant challenge for detection.(2) Individuals in maritime distress may not choose specific times for emergencies; therefore, variations in lighting conditions can affect the performance of detectors.(3) During maritime search and rescue operations, the presence of water can lead to sunlight reflection, thus affecting the quality of captured images.These complexities render target recognition and localization exceptionally challenging, affecting the performance of the algorithm, with the potential for missed detections and false alarms in real-world scenarios.This study introduces an enhanced target detection algorithm designed for detecting individuals in maritime distress scenarios using UAV images to address these challenges effectively, ensure precise detection, and improve rescue efficiency.
The primary contributions of this study are as follows: • The model's feature extraction capability is via integrating an asymptotic feature pyramid network (AFPN).This architectural structure facilitates direct interaction between adjacent hierarchical levels, addresses semantic gaps, and mitigates information loss in the target features.The model retained detailed feature information even in low-light and high-contrast environments.

•
To enhance the perception of small-scale targets in UAV image data, we introduced an attention module called BiFormer.This module leverages the mechanisms of adaptive computation allocation and content awareness, allowing it to prioritize image regions relevant to the targets, thus enhancing the ability of the model to perceive the characteristics of individuals in maritime distress within the UAV image data.

•
To optimize the execution of the localization and classification tasks and to resolve the conflicts between them, we employed a decoupled detection head known as task-specific context decoupling (TSCODE).This approach replaces coupled detection heads, enabling separate execution of localization and classification tasks.Consequently, the accuracy and performance of the model in object detection were significantly enhanced.
The remainder of this paper is organized as follows: Section 2 reviews related studies.Section 3 presents the specialized dataset and UAV-based man-overboard detection algorithm.Detailed information regarding the experiments and analysis is provided in Section 4. In Section 5, conclusions are drawn, and future research directions are outlined.

Object Detection
In the early days of computer vision, traditional object-detection algorithms relied on meticulously handcrafted feature engineering.These algorithms require the design of intricate feature representations and various acceleration techniques in an era in which effective image representations are lacking.In 2001, Viola and Jones [3] introduced the Viola-Jones detector, which employs a sliding window approach across all possible positions and scales in an image to detect the presence of faces.Dalal and Triggs [4] proposed the HOG detector, balancing feature invariance and non-linearity by calculating features on a uniform grid and employing overlapped local contrast normalization (within "blocks").Felzenszwalb and Girshick [5] introduced the DPM, treating inference as the collective detection of different object parts.Subsequently, as the performance of handcrafted features began to saturate, progress in object detection research slowed until 2014, when convolutional neural networks (CNN) were introduced as a solution to the problem [13].Object detection methods based on CNN are categorized into one-and two-stage detectors.Two-stage algorithms begin by selecting candidate regions within an image and classifying and refining the target position within these regions.Girshick et al. [13] introduced deep learning for object detection and proposed a region-based CNN (R-CNN) algorithm, which laid the foundation for subsequent CNN-based object detection methods.He et al. [14] enhanced the CNN architecture by introducing spatial pyramid pooling (SPP) layers, resulting in a faster SPP-Net detection algorithm than R-CNN.Girshick [15] introduced the Fast R-CNN model, which distinguishes all potential candidate boxes from the extracted feature maps, thus improving the training and detection speeds compared to R-CNN.Ren et al. [16] presented a Faster R-CNN algorithm that introduced a two-stage detection approach and improved the training and detection speed compared to the Fast R-CNN model.Cascade R-CNN [17] was developed by Cai and Vasconcelos to combat overfitting using a series of detectors with increasing intersection over union (IoU) thresholds.
One-stage detectors are based on global regression-based classification and directly predict the position and category of the target.Compared with two-stage methods, they are better suited for real-time detection.Redmon et al. [18] introduced the you only look once (YOLO) detection algorithm, which employs a neural network to predict the target positions and categories from object images despite its detection accuracy being low.Liu et al. [19] introduced the single-shot multibox detector (SSD) algorithm, an improved version of YOLO that employs the VGG-16 [20] deep convolutional neural network to extract multiscale feature maps and directly output target positions.Redmon et al. [21] introduced YOLOv3, which is a high-accuracy, high-speed object-detection algorithm.YOLOv4, released in 2020, is a benchmark for deep learning.YOLOv4 employs CSPDarkNet-53 as the backbone network, reducing redundant gradient calculations and integrating various advanced techniques to enhance object detection performance.YOLOv5 offers five models of different sizes and complexities (YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, and YOLOv5n) to accommodate various hardware and computational capabilities.YOLOv5 emphasizes engineering considerations, prioritizing model flexibility and ease of deployment, albeit with performance tradeoffs.Li et al. [22] introduced YOLOv6, which combines the RepBlock module inspired by the renowned RepVGG network [23] as the backbone, enhancing the training and inference speeds.YOLOv7 [24] introduces trainable freebies, reparameterizes the model structure, and integrates "expansion" and "compound scaling" techniques to enhance inference speed and accuracy.Tian et al. proposed a fully convolutional one-stage object detector (FCOS) [25], whose algorithm effectively filters out lower-scoring boxes with very low intersection-over-union ratios.

Object Detection Based on UAV Images
The use of advanced object detection methods in conjunction with UAVs has gained significant attention because of their effective and flexible data acquisition capabilities.However, when detecting individuals in maritime distress using UAV imagery, the situation differs from that of conventional target detection in natural images.In the domain of deep learning-based object detection techniques, detectors rely on extracting features from the backbone of the network to identify targets.In deep CNN, shallow layers involve fewer convolution operations and capture texture information, whereas deeper layers, via more convolutional and upsampling operations, focus on extracting semantic information.UAV imagery is distinguished from natural images because the target sizes in UAV images are often insignificant.Following multiple convolutions and upsampling operations on such images, obtaining the desired information can be challenging and results in a significant loss of target feature information in the presence of intense reflections, affecting detection accuracy.Furthermore, UAVs primarily capture images from a top-down perspective, resulting in significant differences in target features compared with natural images.Re-searchers have made concerted efforts to address this issue.Zhang et al. [26] introduced an architecture that interlaces bounding boxes and masks by employing pixel-level algorithms to enhance the model's capability to detect dense and small objects and used variable convolutions to adjust the receptive fields dynamically, improve the transferability of learned features, and mitigate the impact of viewpoint changes.Liu et al. [27] optimized the Resblocks in a darknet by connecting two ResNet units with the same width and height and enhancing the entire darknet structure by increasing the convolution operations in the early layers, enriching spatial information, and elevating the model's ability to detect small objects.Ye et al. [28] employed a multiscale feature-fusion module to enhance the detection of numerous small objects using UAV imagery.Moreover, preliminary results from the Manipal-UAV [29] dataset designed for person detection offer a foundation for exploring other UAV applications.These applications encompass crowd detection and counting, action recognition, and person tracking, indicating the versatility and potential impact of research in this field.
Researchers applied these methods to various tasks involving UAVs.Wu et al. [30] employed object detection algorithms for UAV images for tree-crown detection.Qiu et al. [31] conducted experiments to identify an optimal method for detecting tile cracks on sidewalks using UAVs.Shao et al. [32] utilized an improved version of YOLOv3 for pedestrian detection in UAV-IR images.Qin et al. [33] employed UAVs for intelligent and precise spraying on areca nut trees.Safonova et al. [34] utilized UAVs to search for and identify trees affected by pests at an early stage, thus enabling timely protective measures.Kainz et al. [35] leveraged UAVs for traffic monitoring, whereas Souza et al. [36] used UAVs to inspect power transmission lines.
Maritime object detection is a practical scenario in UAV applications.Tran et al. [37] integrated an improved version of YOLOv2 with UAVs to detect bottled marine debris (BMD).YOLO-D, proposed by Wang et al. [38], is a variant of YOLOv3 that incorporates a dual attention mechanism, allocating higher weights to insignificant targets in the loss function and utilizing a secondary recursive feature pyramid network (SR-FPN).Lu et al. [39] improved the YOLOv5 algorithm for improved adaptation to UAV applications in maritime and fishery enforcement.Zhao et al. [40] combined the YOLOv4 algorithm with a deepsort for multi-boat speed measurements using UAV images.Bai et al. [41] deployed an enhanced YOLOv5s algorithm in UAVs to search for individuals who have fallen overboard.Despite the widespread application of object-detection techniques to UAV images, challenges persist when employing UAVs for man-overboard search and rescue operations, including small target sizes and interference from intricate backgrounds.
Therefore, deep-learning-based object detection algorithms, which are the predominant approaches in this domain, have harnessed the convenient and efficient benefits of UAVs.The YOLO series has been extensively studied and implemented in UAV visual systems.In this study, we explore the suitability of YOLOv7 as a visual model for SAR applications using UAVs.The enhanced object-detection model shows potential for delivering improved performance in scenarios involving UAV-based man-overboard detection.

YOLOv7
YOLOv7 adheres to the overarching structure of the YOLO series, as shown in Figure 1, and can be segmented into three main components: backbone, neck, and head.The input images were initially processed by the backbone network, which extracted the image features.Next, the extracted features are fused and processed using the neck module.Finally, detection results were obtained using the head module.The neck module performs feature fusion via upsampling, enabling top-down information propagation and leveraging both high-and low-level feature information, which includes modules such as the SSPCSPC and ELAN-Y.The SSPCSPC module enhances the computation speed and accuracy via max-pooling operations, enhancing the resilience of the model to images of varying resolutions.
The head module consists of detection heads of different scales, including large, medium, and small.The head module handles the classification and regression tasks as the classifier and regressor of the network, respectively, to achieve object classification and localization for object detection.The RepConv module, comprising three branches with convolutional and batch normalization layers, underwent reparameterization during computation.

Improvement
Despite YOLOv7 showing excellent performance in general object detection tasks, it still faces challenges in man-overboard detection scenarios, such as small target scales and complex backgrounds.Therefore, we propose ABT-YOLOv7, a modified version of YOLOv7 designed to detect individuals in maritime distress using UAV images.

Improvement Based on AFPN
In YOLOv7, the multiscale feature extraction strategy utilizes classical top-down and bottom-up feature pyramid networks.However, these methods suffer from issues related to the loss or degradation of feature information, which affects the fusion of nonadjacent hierarchical layers.Moreover, we propose an AFPN [42] structure, which is precisely engineered to foster direct interactions among distant layers.
As shown in Figure 2, the AFPN first fuses two adjacent low-level features and gradually incorporates high-level features in subsequent steps, which helps mitigate significant semantic gaps between nonadjacent hierarchical layers.To address potential conflicts arising from multi-object information fusion at each spatial location, we employ adaptive spatial fusion operations to alleviate such inconsistencies.By integrating features from different hierarchical levels and utilizing adaptive spatial fusion operations, AFPN addresses the semantic gap between nonadjacent layers, which mitigates information conflicts during feature fusion.This approach preserves features from each level, thus enhancing the extraction of target features.
When detecting individuals in distress within aerial imagery, the presence of strong reflections from extensive bodies of seawater often impairs image quality significantly.To address this issue, it becomes essential to employ AFPN, which utilizes adaptive spatial fusion to mitigate information conflicts that may arise during the feature integration process.

Improvement Based on BiFormer
The concept of attention closely mirrors the human cognitive focus, enabling the extraction of key focal points to emphasize pertinent information while minimizing interference from irrelevant data.As pivotal components within the framework of visual transformers, attention mechanisms are instrumental in capturing extensive contextual dependencies.However, the inherent characteristics of attention mechanisms result in a surplus of unproductive computations.Therefore, this study introduces the BiFormer [43] attention module, which is a strategic approach that harnesses sparsity to enhance the salience of target-relevant information, thus curtailing redundant computations associated with inconsequential data.
The BiFormer module employed a dual-step strategy.Initially, irrelevant key-value pairs were filtered at the coarse-region level.Subsequently, a refined token-to-token attention mechanism is applied within the intersection of the remaining candidate regions, referred to as routing regions.Employing an adaptive query strategy, BiFormer selectively attends to a small subset of tokens pertinent to the query, effectively sidestepping nonessential tokens and their potential interference.The inclusion of the top-k relevant window for key-value pair collection, as depicted in the figure, enables the utilization of sparsity to bypass computations in the least relevant areas.Figure 3 illustrates the sequential process of the BiFormer Block, showcasing its efficacy.
When utilizing drone imagery for the detection of individuals in distress, the subjects, i.e., the distressed individuals, are often significantly smaller in comparison to the background.To address this challenge, the employment of BiFormer, which leverages sparsity to enhance the saliency of target-relevant information, becomes imperative.

Improvement Based on Decoupled Detection Head
The detection head of YOLOv7 concurrently performed classification and regression tasks.However, during object detection, these tasks diverge in focus.Classification emphasizes coarser semantic information, whereas regression provides finer pixel-level details.To address this intrinsic dichotomy, YOLOX [44] introduces a decoupled detection head.Moreover, the feature output from the neck was bifurcated into two branches, each dedicated to classification and localization.Distinct operations are performed for each task-specific branch.Furthermore, this straightforward approach does not comprehensively resolve this issue because different input elements encapsulate varying degrees of semantic and spatial detail.Lower-level features abound with detailed information but lack a semantic context, whereas higher-level features exhibit the opposite characteristics.This unavoidably impedes maximal exploitation of the advantages of a decoupled head.
To optimize the performance of the decoupled head, we replaced the YOLOv7 detection head with the TSCODE head [45], which aimed to further enhance the detection of individuals in maritime distress scenarios.
Furthermore, it leverages intermediate feature maps extracted from diverse layers to generate G cls l features for classification via downsampling, concatenation, and convolution operations.TSCODE effectively integrates multiple feature levels for regression via upsampling, aggregation, and convolutional layer operations to create G loc l features.This principle is illustrated as follows: The Concat(•) operation represents channel merging within feature maps.Conv(•) denotes the downsampling convolutional layer, and µ(•) signifies the upsampling operation.Features distinct from G cls l are fed into their respective task branches, thus optimizing the independent task branches to their fullest potential.This methodology enhances the performance of the decoupled backbone of the model by optimizing individual task branches, thus bolstering the overall performance in classification and regression tasks.The thoughtfully crafted architecture capitalizes on feature information, leading to improved model performance in classification and regression tasks.The structure of the classification branch is illustrated in Figure 4, and that of the localization branch is depicted in Figure 5.When conducting the detection of individuals in maritime distress, the focus of classification and regression tasks differs somewhat.Classification emphasizes coarser semantic information, whereas regression places greater emphasis on finer pixel-level details.TSCODE effectively addresses this issue by utilizing intermediate feature maps extracted from different layers for both regression and classification operations.
To address the challenges inherent in search-and-rescue scenarios, this section proposes three enhancement strategies.The APFN is employed to augment the model's capability to extract features from smaller-scale targets, such as individuals in maritime distress.The integration of the BiFormer attention mechanism increases the model's focus on target regions, ameliorating the issue of imbalanced positive and negative samples in search and rescue contexts.The inclusion of a decoupled detection head enhances both the detection precision and training speed of the model.In the future, the initials AFPN, BiFormer, and TSCODE will be used to refer to our method, and it will be denoted as "ABT-YOLOv7".The revised ABT-YOLOv7 model, tailored for man-overboard detection using UAV images, is shown in Figure 6.

Experiments 4.1. Dataset
To validate the enhanced performance of ABT-YOLOv7, we conducted experiments using a meticulously curated dataset composed of selected MOBDrone [46] and See-DronesSea [47] datasets.
The MOBDrone dataset comprises 49 high-resolution (4 K) videos captured using a DJI FC6310 camera on a Phantom 4 Pro V2 drone.These videos depict a range of scenarios simulating individuals falling into the water, including conscious and unconscious individuals, along with other objects.The dataset comprised 66 videos with resolutions post-processed to 1080p, resulting in 126,170 images.Professional annotators manually labeled the bounding boxes for objects across five categories (person, boat, surfboard, wood, and lifebuoy) for 181,689 annotations.
The SeaDronesSee dataset comprises 14,227 RGB images (training set: 8930; validation set: 1547; test set: 3750).These images captured a diverse array of situations, spanning heights from 5 to 260 m and viewing angles from 0 • to 90 • (gimbal tilt angle).Each frame is accompanied by the corresponding height, angle, and other metadata.The dataset was captured using multiple cameras, and annotations covered various categories, including swimmers, boats, jet skis, life-saving equipment, and buoys.
While the MOBDrone concentrates on individuals without life jackets in maritime man-overboard scenarios, SeaDronesSee covers a broad spectrum of scenes related to the entire rescue process.Furthermore, we combined the processed datasets for joint use to validate the proposed approach.

Experimental Setup
This study was conducted using the Linux operating system.The hardware configuration of the algorithmic environment comprises an Intel(R) Xeon(R) Bronze 3204 central processing unit (CPU) with a clock frequency of 1.90 GHz, coupled with 64 GB of memory.The utilized graphics processing unit (GPU) is an NVIDIA A100-PCIE-40GB, featuring a 40 GB memory capacity.To leverage GPU acceleration, the system ran on CUDA 11.7.The programming language selected was Python 3.8.16,managed by Anaconda 2.3.1.0.The primary deep learning framework employed was PyTorch 2.0.1.
The setting of hyperparameters plays a key role in our experiments.The image size determines the resolution of input data, which can significantly impact both the performance and speed of a model.The learning rate dictates the step size of each parameter update and requires adjustment based on the specific problem and model at hand.The learning rate decay frequency entails periodic adjustments of the learning rate during training to facilitate improved model convergence.Batch size refers to the number of samples processed together during each model update, influencing training speed and memory requirements.The number of training workers can expedite the data preparation process.Lastly, the maximum number of training epochs sets the overall duration of model training and should be tailored to the complexity of the task and model under consideration.The experimental setup is presented in Table 1.

Evaluation Metrics
The model employs four metrics to assess its capability for roadside target recognition.These metrics include precision, recall, confidence, average precision (AP), precision-recall (PR) curve, and IoU.These indicators are extracted from the output results.
The formulas for precision and recall calculation are as follows: where True Positive (TP) represents accurate predictions made by the model, False Positive (FP) indicates incorrect predictions, and False Negative (FN) represents instances the model failed to detect.'n' is the total number of samples predicted as positives by the model, and 'm' is the total number of actual positive samples.Precision and recall are selected as the evaluation metrics for the model algorithm to assess its performance.Precision measures the accuracy of correctly predicted positive samples, while recall gauges the model's ability to accurately identify all positive instances in the dataset.The formulas for confidence and average precision are as follows: where P r (Object) represents the probability of the current anchor box containing an object, and IoU truth pred signifies the IoU value between the predicted anchor box and the actual target anchor box when the current anchor box contains an object.AP denotes the average precision across different recall levels.mAP is the mean of the AP values for various classes.
IoU represents the overlap between the predicted bounding box and the actual bounding box.It is calculated by dividing the intersection area of the two bounding boxes by their union area, ranging from 0 to 1, where 0 indicates no overlap and 1 indicates a perfect match between the predicted and actual bounding boxes.
Mathematically, the IoU is calculated using the following formula: Here, B p stands for the predicted frame, and B GT represents the ground truth frame.If the area of the identified region is greater than that of the IoU threshold, it is classified as True Positive (TP); if it is smaller, it is classified as False Positive (FP).

Fusion Attention Mechanism Comparison Test
The BiFormer Module, as an attention mechanism, can be incorporated into three key components of YOLOv7: the backbone, neck, and head.The impact of adding attention modules at different positions varies, and strategically placing the BiFormer module in the model is key to maximizing its effectiveness.To achieve the optimal integration of the BiFormer Module with the model, its impact on each component was empirically analyzed.Table 2 indicates that, when integrated into the backbone, the BiFormer Module significantly enhanced the recognition accuracy of the model, with both precision and recall surpassing other modifications.Lower-level semantic information is diluted through the backbone and neck, making it challenging to introduce substantial changes via the further attention weighting of fewer features.Consequently, the metrics show marginal improvements in this context.
However, the most substantial gain is observed when the BiFormer Module is incorporated into the neck.The attention weighting of feature maps using different dimensions was effective in preserving fine-grained information, yielding the most favorable results.
The most significant improvements were achieved by integrating the BiFormer Module into the backbone because the BiFormer Module enables the network to focus on crucial features while processing the input data.This selectivity allows the network to enhance its attention towards specific regions or features, thus improving the feature representation.This capability is valuable for handling complex scenes and tasks, such as object detection, because it facilitates better capture of essential information.

Decoupled Detection Head Comparison Test
Furthermore, we significantly improved the original detection head of YOLOv7 by utilizing different decoupled heads.As illustrated in Table 3, the TSCODE-decoupled head outperformed the decoupled head used in YOLOX.This superiority can be attributed to TSCODE's dedicated focus on enhancing the contextual understanding of both classification and localization tasks.By considering the relationships between objects and their surrounding environments, TSCODE can provide robust and accurate predictions for complex maritime hazard scenarios.YOLOv7 was used as the baseline for initial experiments.Subsequent experiments introduced the following enhancements: YOLOv7-A was incorporated into APFN, YOLOv7-B integrated BiFormer for improved accuracy, YOLOv7-C was combined with TSCODE, YOLOv7-D simultaneously integrated with AFPN and TSCODE, YOLOv7-E was combined with both BiFormer and TSCODE, and YOLOv7-F synergistically utilized all three methods.
Figure 7 and Table 4 indicate that each modular method exhibits improved experimental results compared with the original YOLOv7.This underscores the effectiveness of the reinforcement techniques employed in this study for man-overboard detection.

•
Building upon the summarized optimization methods, we developed an enhanced search and rescue algorithm tailored for man-overboard scenarios.This algorithm incorporates AFPN for neck feature fusion, integrates the attention mechanism from BiFormer, and leverages TSCODE's decoupled detection head to generate the final output.In comparison to the original YOLOv7 model, our approach achieved a notable increase in mean average precision (mAP) by 4.5%.

Comparative of Different Object Detection Models
To establish the effectiveness of the proposed approach, we conducted experiments on a man-overboard dataset using six classic object detection algorithms: faster R-CNN, cascade R-CNN, FCOS, YOLOv3, YOLOv4, YOLOv5, and YOLOv8, and selected YOLOv5m because it outperformed other versions of YOLOv5, Table 5.The Faster R-CNN model exhibited a decrease of 35.5% in accuracy, 48.3% in recall, and 43.1% in mAP.The Cascade R-CNN model exhibited a decrease of 4.4% in accuracy, 5.4% in recall, and 5.2% in mAP.The FCOS model exhibited a decrease of 4.9% in accuracy, 11.5% in recall, and 5.9% in mAP.The YOLOv3 model exhibited a decrease of 6.2% in accuracy, 9.2% in the recall, and 6.5% in mAP.The YOLOv4 model demonstrated an improvement of 4.7% in accuracy, a decrease of 12% in the recall, and a 7% drop in mAP.The YOLOv5 model experienced a slight accuracy reduction of 3.6%, a decrease of 6.1% in the recall, and a 5.5% drop in mAP.The YOLOv8 model exhibited an accuracy increase of 5.3%, a decrease of 11.6% in the recall, and a 7.2% drop in mAP.
When considering man-overboard detection, our proposed ABT-YOLOv7 method surpasses all other algorithms in terms of detection accuracy.The results highlight the effectiveness of the enhancement techniques employed in this study, which can be harnessed to enhance man overboard detection tasks based on UAV images.

Results and Visualization
To assess the generalization performance of the proposed algorithm, we conducted evaluations using diverse sets of complex scenes extracted from a dataset.The first row in Figure 8 shows the efficacy of the proposed method under low-light conditions.Despite YOLOv7 exhibiting instances of missed detection, our method accurately identifies all the objects within the scene.In the second row of Figure 8, in scenarios where the UAV images are subjected to intense sunlight reflections, YOLOv7's effectiveness diminishes, whereas our method continues to perform effectively.This resilience is attributed to the Adaptive Feature Fusion Network, which enables our algorithm to suppress background noise and extract target features, even in high-contrast images.As shown in the third row of Figure 8, changes in the perspective of the UAV result in variations in the object shapes within the image.Furthermore, our algorithm relies on robust attention mechanisms and feature extraction capabilities to identify targets precisely in such cases.Figure 8 shows the remarkable ability of the proposed algorithm to detect small-scale underwater entities in images captured by UAVs.Therefore, ABT-YOLOv7 consistently and accurately detected all designated objects, demonstrating its exceptional performance.Our developed model demonstrated reduced sensitivity to environmental factors, such as changes in perspective and variations in water surface reflections.This characteristic enhanced the robustness of the model under adverse conditions.To enhance transparency and facilitate a more intuitive evaluation and comparison of the feature extraction capabilities of the proposed small-object detection methods, we employed the Grad-CAM (Gradient-weighted Class Activation Mapping) technique [48].This approach enabled the visualization of the heat maps of the detected objects.The Grad-CAM algorithm calculates the gradients of the target class outputs based on feature maps of the final convolutional layer.Subsequently, these gradients were used to compute a weighted sum to generate the activation map, which highlights the regions of interest.By utilizing class gradients, Grad-CAM aids in analyzing the attention of networks relevant to specific categories.Visualizing these attention regions provides insights into whether the network has effectively learned the features or information pertinent to image classification.
Figure 9 shows Grad-CAM images of both YOLOv7 and ABT-YOLOv7.In these images, the brighter regions indicate the specific areas in which the networks prioritized their attention.Moreover, the enhanced functionalities proposed within the model architecture demonstrate the capability of the model to identify individuals in maritime distress within UAV images.The improved model exhibited superior feature extraction and noise resistance abilities in the context of identifying an overboard man.Consequently, the ABT-YOLOv7 model demonstrated excellent performance in executing the task of searching for and rescuing submerged individuals.This establishes a viable solution for detecting individuals in water using UAV-captured imagery.

Conclusions
This study presents ABT-YOLOv7, a novel algorithm designed to detect individuals in maritime distress at sea when they fall overboard.Building on the YOLOv7 framework, this algorithm aimed to overcome challenges, such as small target sizes and issues related to sunlight reflection, which lead to missed detections and false alarms.The proposed ABT-YOLOv7 integrated an AFPN module.The AFPN fused features from various levels, alleviating potential information conflicts and bolstering detection performance.The model has a BiFormer module to enhance its ability to perceive insignificant targets.In addition, incorporating decoupled detection head TSCODE harmonized the classification and localization tasks, resulting in improved detection accuracy and convergence speed.To validate the effectiveness of ABT-YOLOv7, we rigorously tested it on datasets obtained from the MOBDrone and SeaDronesSee.The testing included ablation experiments and comparative trials encompassing a variety of scenarios associated with water rescue operations.In conclusion, the ABT-YOLOv7 model demonstrates potential outcomes in scenarios involving the search and rescue of individuals with maritime distress.Nonetheless, there remains significant room for improvement in the speed of our detectors.Additionally, given that maritime accidents commonly occur in complex and adverse environments, comprehensive data under such challenging conditions become paramount.Future research will focus on the collection of more comprehensive data encompassing complex and adverse sea conditions, enhancing the detector's speed, refining its architecture, and expanding its capabilities to establish a more robust solution.

Figure 1 .
Figure 1.General framework of YOLOv7.The backbone of the YOLOv7 model includes CBS, MP, and ELAN modules.The CBS module comprises convolutional layers and batch normalization layers and utilizes the sigmoid linear unit (SiLU) activation function for nonlinearity.The ELAN module contains two branches, one for channel transformation and the other for feature extraction.The ELAN module enhanced the generalization capability of the model by regulating the shortest and longest gradient paths.The MP module conducts down-sampling operations using convolution and max-pooling techniques.The neck module performs feature fusion via upsampling, enabling top-down information propagation and leveraging both high-and low-level feature information, which includes modules such as the SSPCSPC and ELAN-Y.The SSPCSPC module enhances the computation speed and accuracy via max-pooling operations, enhancing the resilience of the model to images of varying resolutions.The head module consists of detection heads of different scales, including large, medium, and small.The head module handles the classification and regression tasks as the classifier and regressor of the network, respectively, to achieve object classification and localization for object detection.The RepConv module, comprising three branches with convolutional and batch normalization layers, underwent reparameterization during computation.

Figure 3 .
Figure 3.The structure and details of BiFormer.

Figure 4 .
Figure 4.The details of the classification branch.

Figure 5 .
Figure 5.The details of the localization branch.

Figure 7 .
Figure 7.The average precision across various experiments.

Figure 8 .
Figure 8.Detection results of YOLOv7 and ABT-YOLOv7 in various scenarios.

Table 2 .
Results of fusion attention mechanism comparison test.

Table 3 .
Results of decoupled detection head comparison test.

Table 4 .
Results of ablation experiments with different methods.

Table 5 .
Comparison of detection performance for different methods.