Next Article in Journal
Enhanced Muscle Flavor in Male Chinese Mitten Crab (Eriocheir sinensis) Driven by Feed-Induced Reconfiguration of Intestinal Volatile Compounds
Previous Article in Journal
Effects of Antimicrobial Peptides on the Growth Performance of Squabs Were Investigated Based on Microbiomics and Non-Targeted Metabolomics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hierarchical Dual-Model Detection Framework for Spotted Seals Using Deep Learning on UAVs

1
College of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao 266590, China
2
Shandong Engineering Research Center for Beidou Navigation and Intelligent Spatial Information Technology Application, Qingdao 266590, China
3
Qingdao Key Laboratory of Beidou Navigation and Intelligent Spatial Information Technology Application, Qingdao 266590, China
4
North China Sea Ecological Center of the Ministry of Natural Resources, Qingdao 266033, China
*
Author to whom correspondence should be addressed.
Animals 2025, 15(21), 3100; https://doi.org/10.3390/ani15213100 (registering DOI)
Submission received: 15 September 2025 / Revised: 16 October 2025 / Accepted: 22 October 2025 / Published: 25 October 2025
(This article belongs to the Section Aquatic Animals)

Simple Summary

This study addresses the challenges encountered by Unmanned Aerial Vehicles in monitoring the spotted seal population in the Liaohe River estuary, including the diminishing visibility of small targets, significant background interference and limited edge computing resources. To address these issues, a dual-model layered detection framework is proposed. The framework involves deploying a lightweight object detection model on an Unmanned Aerial Vehicle for the initial screening of spotted seals, followed by precise detection of images transmitted back from the Unmanned Aerial Vehicle using a more accurate detection model on a ground-based workstation. Experimental results demonstrate that this approach not only enhances image processing speed but also significantly improves detection accuracy, reducing both false positives and false negatives. This research study contributes to more accurate population size estimations and dynamic distribution assessments, offering an efficient and reliable method for monitoring endangered species.

Abstract

This study introduces a hierarchical dual-model detection framework for accurately monitoring spotted seals (Phoca largha) in the Liaohe River estuary by using deep learning on Unmanned Aerial Vehicles (UAVs). To address challenges such as weak target features, background interference and limited edge computing capacity, this study deploys an optimized FF-YOLOv10 lightweight model on UAVs for rapid target localization, followed by an enhanced PP-YOLOv7 model on ground stations for precise detection. The FF-YOLOv10 model reduces computational complexity by 24.2% and increases inference speed by 33.3%, while the PP-YOLOv7 model achieves 94.2% precision with a 1.9% increase in recall rate. This framework provides an efficient and precise technical solution for the long-term ecological monitoring of marine endangered species, supporting habitat conservation policy formulation and ecosystem health assessments.

1. Introduction

Population size is a critical determinant of long-term survival of a species in natural ecosystems and serves as a key indicator of regional biodiversity [1]. The spotted seal (Phoca largha), recognized as a keystone indicator of ecosystem health, has gained significant scientific attention due to extensive spatial overlap with human activities across aquatic and terrestrial domains [2,3]. As apex predators, seal populations exert cascading effects on ecosystem dynamics [4]. For instance, in the North Atlantic, robust seal populations may mitigate interspecific competition with commercially vital fish species like flounders, thereby influencing the equilibrium of ecologically and economically significant fish stocks [5]. The Liaohe River estuary, a critical habitat for seals in China, has recently faced growing threats from environmental changes and human activities [6]. Accurately estimating the population size of spotted seals is essential to formulating effective conservation and management strategies to safeguard their habitat and maintain ecosystem stability [7]. In animal population assessments, weak target features, background interference and the constraints of edge computing capabilities significantly influence target recognition accuracy. Studies have indicated that weak target features lead to decreased detection accuracy and increased missed detection rates; background complexity makes target recognition more difficult, and limited edge computing capabilities may lead to real-time response delays, thereby affecting the accuracy and timeliness of monitoring tasks [8,9]. Consequently, the development of efficient and sustainable intelligent monitoring techniques has become a central focus in research on spotted seal monitoring.
Traditional wildlife monitoring methodologies have predominantly relied on invasive techniques, primarily including manual field surveys and GPS collar technology [10], which rely on a large amount of domain knowledge and experience and have insufficient ability to process complex data patterns and high-dimensional data [11]. While these approaches played a pivotal role in early ecological studies, they exhibit inherent limitations, such as high labor intensity, restricted spatiotemporal coverage and substantial invasiveness [12], thereby failing to meet the requirements for large-scale, high-frequency monitoring operations. Unmanned Aerial Vehicle (UAV) technology, characterized by its flexible deployment, high-resolution imaging and multimodal data collection capabilities [13], offers innovative tools for wildlife monitoring [14,15]. Hodgson et al. [16] deployed UAVs equipped with DSLR cameras across varying altitudes, acquiring 6243 high-resolution images that manually identified dugongs with a 95% sighting rate, alongside cetaceans and marine turtles. Kiszka et al. [17] demonstrated UAVs’ utility in shallow coral reef ecosystems by estimating densities of reef-associated elasmobranchs, thereby validating their capacity to deliver critical fishery-independent data in photic zone habitats. In avian ecology, Hodgson et al. [18] also applied drones to estimate the number of bird nests, improving data quality and counting accuracy while showcasing their utility in surveying populations and locations that are otherwise difficult to access. Beaver et al. [19] further advanced this paradigm by integrating thermal infrared sensors on UAVs to survey white-tailed deer populations, where independent observers achieved significantly higher detection probabilities compared with manned aerial surveys. Notwithstanding the demonstrated reliability of manual animal identification in UAV imagery, the exponentially growing volume of image datasets necessitates labor-intensive interpretation, resulting in suboptimal efficiency and operator-dependent biases [20]. Consequently, the development of intelligent wildlife auto-recognition algorithms has become essential to overcoming the technical challenges in monitoring [21,22].
Deep learning is a machine learning technique based on artificial neural networks. Its core principle lies in progressive feature extraction and nonlinear transformation, which enables multi-level feature learning and automatic extraction of high-level semantic features from data [23]. In recent years, with the advancement of artificial intelligence technologies, deep learning-based object detection models have provided valuable technical support for wildlife monitoring. Peng et al. [24] improved the Faster R-CNN model by optimizing anchor scales and addressing difficult negative samples, resulting in an increase in the F1 score for detecting Tibetan wild donkeys from 44% to 86%, significantly reducing the need for manual labor. Gray et al. [25] applied a convolutional neural network (CNN) to detect sea turtles in drone images, finding that the model detected 8% more turtles compared with manual counting, greatly enhancing detection accuracy. Tripathi et al. [26] employed single-stage detection models, including YOLO and DETR networks, for non-invasive, real-time detection of swamp deer to estimate their population size. Jiang et al. [27] proposed an enhanced wilDT-YOLOv8n, which integrates deformable convolution and multimodal attention mechanisms, boosting the mean average precision for wildlife detection to 88.54% and achieving a tracking accuracy of 40.35%, thus reducing target loss caused by obstacles. Wu et al. [28] utilized the improved InceptionResNetV2 model, incorporating dual attention mechanisms and Dropout optimization, to achieve 99.37% identification accuracy for individual Amur tiger stripes, providing algorithmic support for the precise conservation of endangered species.
Although deep learning and drone technology hold significant promise for the future of wildlife monitoring—with UAV-based counts of colony-nesting birds showing greater precision than traditional ground counts [18]—they still face ecological and technical limitations. Lightweight models are essential to meeting the endurance and computational constraints of UAVs; for example, the YOLOv8-E model incorporates edge-sensitive Sobel-based modules to significantly reduce computational cost while maintaining detection accuracy [29]. Similarly, LPS-YOLO has demonstrated improved detection accuracy for small targets in UAV imagery despite parameter reduction [30]. However, animal detection from drone images remains challenging due to small target size, complex backgrounds and the limited distinguishable features inherent in such data [31]. Moreover, balancing wide-area coverage with precise species identification remains unresolved. The MDTS framework addresses this by combining thermal detection for broad coverage with high-resolution RGB zoom for accurate identification, significantly reducing data volume while enhancing ecological survey efficiency [32]. On the other hand, systems like WildLive achieve near real-time detection and tracking onboard UAVs—processing HD and 4K video at frame rates of 17 FPS and 7 FPS or greater, respectively, demonstrating feasibility but also highlighting the resource limits even for advanced onboard hardware [33]. To address these challenges, we propose a dual-model hierarchical framework that integrates lightweight UAV-based detection with high-precision ground-station verification. Unlike existing approaches that emphasize either lightweight efficiency or high-accuracy detection alone, our method synergistically bridges both, achieving real-time aerial monitoring with precise post-verification. This framework offers a practical, energy-efficient solution for small marine mammal ecological monitoring, enhancing drone operational endurance and complementing the current literature on UAV-aided conservation.
In summary, UAV and deep learning technologies provide promising tools for wildlife monitoring but face persistent challenges, such as weak target features, background interference and limited onboard computing resources. To address these issues, this study proposes a dual-model hierarchical detection framework that combines UAV-based lightweight detection with high-precision ground-station verification, thereby balancing efficiency and accuracy in spotted seal monitoring. The main contributions of this study are as follows:
(1)
A dual-model hierarchical detection framework is developed, integrating UAV-based lightweight detection with high-precision ground-station verification to achieve the real-time monitoring and accurate population estimation of spotted seals.
(2)
A lightweight YOLOv10 [34] variant optimized for edge deployment is constructed, incorporating focal modulation networks (FocalNets) to enhance the detection of hard-to-recognize targets under limited onboard resources.
(3)
The ground-based YOLOv7 [35] model is enhanced with multi-scale feature pyramids and partial convolution, strengthening small-target representation and suppressing background interference, thereby achieving a practical balance between accuracy and efficiency.

2. Materials and Methods

2.1. Study Area

Liaohekou, as shown in Figure 1, the southernmost breeding area for the harbor seal in the western Pacific, is home to a coastal wetland ecosystem of considerable ecological importance [36]. This region is characterized by distinctive habitat conditions, including brackish salt marshes and mudflats resulting from the confluence of seawater and freshwater, along with seasonal sea ice coverage during the winter months. These unique environmental features support a food chain for the harbor seal, primarily consisting of benthic organisms and fish. This unique ecotone features characteristic habitats, including saline marshes and mudflats formed by brackish water convergence, coupled with seasonal ice cover during winter [6]. The region is situated at the junction of Bohai Bay and the Liaohe Alluvial Plain, functioning as an essential stopover on the East Asia–Australasia migratory route for migratory birds. It is also the sole breeding habitat of the spotted seal in Chinese waters, offering significant ecological connectivity and conservation value.

2.2. Data Acquisition

To obtain images of seals in their natural habitat, photographs were sourced from various natural environments, and the DJI Mavic 3E drone (Shenzhen, China) was utilized to capture images at the Shuangtaizi estuary, with flight altitudes set at 15 m and 20 m. Given the challenges in acquiring spotted seal image data and the relatively limited number of individuals inhabiting coastal areas, this study employed data augmentation techniques—including spatial geometric transformations, edge padding and complex background scaling—to expand the collected spotted seal imagery. These methods were applied to increase both the quantity and diversity of training samples, thereby improving model performance during training and enhancing the generalization capability of the computational framework. Typical collected images are shown in Figure 2, which illustrate the raw dataset diversity and field conditions prior to annotation. These examples provide an overview of the visual complexity encountered in natural habitats and highlight the challenges of data acquisition.
The augmented images were annotated using the Labelimg tool, with the minimum enclosing rectangles drawn around the seals, ensuring minimal background inclusion within the rectangle. The class attribute for each rectangle was designated as “Phoca largha”. After completing the annotations, an XML format label file was generated, containing the height, width and class information for each rectangle. The annotated images are shown in Figure 3, serving as examples of the labeling strategy applied in this study. These figures demonstrate how bounding boxes were drawn to capture the seals while minimizing background noise, thereby establishing the ground-truth labels used for subsequent model training and evaluation.

2.3. Real-Time Spotted Seal Detection Model Based on Enhanced YOLOv10

YOLOv10 is a real-time, end-to-end object detection model based on the YOLOv8 architecture [37]. It employs a consistent dual-assignment strategy, incorporating dual-label assignment and consistency matching metrics, eliminating the need for NMS and effectively addressing redundant predictions during post-processing. Based on this model, this paper proposes the FF-YOLOv10 model for real-time detection of spotted seals by drones. The structure of this model is shown in Figure 4. The FF-YOLOv10 model incorporates FasterNet [38] to streamline the C2f module, optimizing the network structure and reducing the number of parameters, thereby meeting the requirements for efficient deployment on mobile devices. Given the complex habitat of seals, some targets are challenging to detect due to issues such as positional overlap and image blurriness. To address these challenges, the SPFF module is replaced by FocalNets, which improves the model’s ability to focus on hard-to-detect targets by adjusting feature map responses, thus enhancing overall detection performance.

2.3.1. Lightweight C2f Module

The FasterNet module, known for its efficiency, is widely adopted in object detection algorithms due to its exceptional speed and effectiveness across various visual tasks. This architecture improves feature representation and expands the receptive field while maintaining a lightweight design and high processing speed. Additionally, it emphasizes the importance of simplifying the computation process by minimizing redundant elements. This approach not only optimizes computational resource utilization but also significantly enhances the network’s processing efficiency.
The FasterNet architecture, as illustrated in Figure 5, adopts a modular design to ensure both flexibility and scalability. It begins with PConv, which selectively applies spatial convolutions to specific input channels, reducing redundant computation and floating-point operations while preserving key spatial features. This mitigates the accuracy loss often associated with deep convolutions. Subsequently, PWConv enables channel expansion and local cross-channel interaction, forming a T-shaped structure whose weight distribution aligns with pre-trained network statistics, enhancing central receptive field focus. FasterNet adopts a four-stage hierarchical structure in which the shallow stage utilizes high-resolution features and dense PConv modules to extract fine-grained details, while the deeper stages progressively increase channel dimensions to enhance semantic abstraction. Spatial downsampling and channel expansion are simultaneously optimized through embedding or merging layers. Normalization and activation are introduced only after the intermediate PWConv, and residual connections are employed to maintain feature diversity and gradient stability, thereby avoiding information loss caused by excessive normalization. For efficient hardware deployment, the architecture minimizes memory access conflicts by simplifying branches and reduces data transfer via cross-channel feature reuse. Serialized variants offer a tunable balance between accuracy and speed. The overall design ensures low latency, high accuracy, and cross-platform adaptability, making it effective for visual tasks in resource-constrained environments.
To address the challenge of large model sizes impacting detection speed and hindering drone deployment in seal detection tasks, this paper incorporates the fast convolutional structure of FasterNet into the C2f module, as illustrated in Figure 4. Upon inputting a feature map with dimensions h (height), w (width) and c (channel) into the C2f-Faster module, the input feature map is first processed by convolution and then split into two parts: one is passed through directly, while the other undergoes additional processing through multiple FasterNet modules. The concatenated feature map is subsequently processed by another convolutional layer to generate the final output feature map [39]. In the FasterNet module, the feature map is partitioned into several subregions, with some undergoing convolution operations, which are then combined with the unprocessed subregions. This design enables the seal detection model to operate efficiently on drones with limited memory, facilitating rapid detection while reducing model complexity and significantly improving the speed of object detection tasks.

2.3.2. Focal Modulation Networks

To effectively integrate contextual information at different granularities for visual representation learning, this study adopted the Focal Modulation Networks architecture, as illustrated in Figure 6. Moreover, by selectively emphasizing relevant features, FocalNets effectively mitigates background noise, ensuring precise recognition and localization of spotted seals. The implementation of the algorithm is detailed below.
(1)
Focal Modulation
Given an input visual feature map X R C × H × W , a linear projection is first applied to obtain a set of query features q. Meanwhile, contextual information is aggregated using a contextual encoder M 2 , and a lightweight interaction function T 2 is applied to modulate the query with the contextual signal. The output representation at each spatial location i is computed as
y i = T 2 M 2 i , X , x i
Here, M 2 captures multi-scale contextual cues from the surrounding neighborhood, and T 2 denotes an element-wise interaction operator, such as addition or modulation, that fuses the context with the query in a content-adaptive manner.
(2)
Hierarchical Contextualization
To capture semantic context at multiple scales, we adopt a hierarchical representation strategy. First the input X is projected onto a new feature space via a linear transformation:
Z 0 = f z X R H × W × C
Subsequently, we apply L layers of depth-wise convolutions (DWConv) with GelU activations to progressively expand the receptive field and encode increasingly global context:
Z ( ) = GeLU DWConv Z ( 1 ) , = 1 , , L
Each layer Z captures context at a specific spatial scale. To include global semantics, we apply average pooling to the final output:
Z ( L + 1 ) = A v g P o o l Z ( L )
This results in L + 1 contextual feature maps, which together span a continuum from fine-grained local to coarse global representations.
(3)
Gated Aggregation
A gated aggregation mechanism is introduced to adaptively fuse multi-scale contextual features. Specifically, a linear layer is employed to obtain the spatial and hierarchical awareness weights G = f g X R H × W × L + 1 . A weighted sum is then computed via element-wise multiplication to produce a single feature map Z o u t , maintaining the same spatial dimensions as the input X:
Z o u t = = 1 L + 1 G Z R H × W × C
where G R H × W × C denotes a slice of G, corresponding to the t h focus layer on the horizontal plane. This design allows FocalNets to adaptively learn and integrate information from different focal depths. To facilitate information propagation across different channels, another linear layer is applied to generate the focused control mapping M = h Z o u t R H × W × C . The final focused control operation is formulated as
y i = q α x i = 1 L + 1 g i · z i
where g i and z i represent the control value and visual features at the i-th spatial location in G and Z , respectively.

2.4. Improved YOLOv7-Based Precision Detection Model for Spotted Seals

The YOLOv7 model is primarily composed of the backbone network, the neck network and the detection layer. The backbone network utilizes the ELAN structure for feature extraction, enhancing feature representation capabilities while maintaining consistent input and output feature sizes. Additionally, max pooling (MP) is applied for downsampling, using both max-pooling and convolution operations while adjusting the number of channels. The neck network integrates the CBS, SPPCSPC, MP and ELAN modules, adhering to the traditional PAFPN architecture. This network extracts multi-scale features from the backbone for comprehensive feature fusion. Finally, the REPConv architecture is employed to design reparameterized convolutions, balancing network complexity during training while reducing parameters and computational cost during inference, without compromising accuracy. To mitigate false positives and missed detections of small targets in UAV images caused by low resolution and sparse feature information, an improved YOLOv7 model is deployed at the ground workstation. As illustrated in Figure 7, the PP-YOLOv7 architecture incorporates a small-object detection module, which enhances the representation of small targets by capturing spatial details and contextual information more effectively. In addition, the use of partial convolution focuses computation on the regions of interest within the input image, thereby suppressing interference caused by background regions with similar colors. This design significantly improves the model’s generalization capability and robustness.

2.4.1. Small-Target Detection Layer

With input images of 480 × 480 pixel resolution, the original YOLOv7 network employs three distinct detection scales (20 × 20, 40 × 40 and 80 × 80) for feature map analysis to accommodate multi-scale targets. However, in UAV-based wildlife monitoring scenarios, particularly for protected species like spotted seals, image acquisition requires maintaining sufficient shooting distances to avoid disturbing natural behaviors, inevitably introducing complex backgrounds and multi-scale targets within captured imagery. Small targets, characterized by low pixel occupancy and limited visual features, often suffer from ineffective recognition in such contexts. The original model demonstrates suboptimal performance in small-target detection due to the restricted receptive field of its 80 × 80 scale feature maps, leading to compromised precision due to either false positives or missed detections.
To address this limitation, we introduce a dedicated small-target detection layer that preserves and constructs higher-resolution shallow feature maps for direct participation in detection predictions. These high-spatial-resolution features enhance the model’s representational capacity by capturing subtle textures and contour features of small targets, thereby significantly reducing both missed detection rates and false detection rates. Specifically, our implementation first upsamples the 80 × 80 scale feature maps generated by the FPN module to obtain 160 × 160 resolution representations. Subsequently, these upsampled features are fused with shallow-layer features extracted from the backbone network, forming enhanced 160 × 160 feature maps that are directly fed into the prediction module. This integration strengthens the model’s capability to detect small targets through multi-scale feature preservation. Representative examples of output feature maps across four detection scales are illustrated in Figure 8.

2.4.2. Partial Convolution

In contrast to conventional convolution (Conv) modules, PConv operates exclusively on a specific fraction of input feature map channels rather than simultaneously processing all channels. As shown in Figure 9, this selective processing mechanism applies convolution operations to strategically chosen channel subsets while preserving the integrity of unmodified channels. The processed outputs are subsequently concatenated with bypassed channels through residual connections, maintaining comprehensive feature representation in final outputs. By optimizing both computational channels and memory access operations, this architecture achieves significant reductions in FLOPs and memory usage compared with standard Conv modules while maintaining equivalent model performance metrics.
The high-resolution imagery acquired by UAVs for monitoring Phoca largha generates substantially increased pixel processing demands during analytical workflows. Traditional convolutional neural networks, typically constructed as multi-layered hierarchical architectures, involve successive convolution operations and feature abstraction through depth-wise progression. While this layered configuration effectively captures high-level semantic features, it imposes significant computational burden and memory footprint—particularly evident in the memory access complexity defined by the following operational formula:
h × ω × 2 c + k 2 × c 2 h × ω × 2 c
PConv minimizes computational redundancy through optimized convolutional operator design, achieving dual reduction in arithmetic complexity and memory consumption. When employing a standard channel selection ratio of r = c p / c = 1 / 4 , the memory access operations for PConv can be mathematically expressed as
h × ω × 2 c p + k 2 × c p 2 h × ω × 2 c p
It amounts to only one-fourth of the computational cost of a standard convolution.
In seal detection tasks, the high similarity between the gray-black speckled patterns on the seals’ dorsal regions and the muddy tidal flat backgrounds often introduces background interference, leading to missed detections or false positives; therefore, it is crucial to effectively extract local features within the region of interest and integrate them into global features [40]. To address this, PConv employs masks to focus computational resources on regions of interest within the input image, thereby enabling more precise extraction of seal-related features. This process effectively suppresses background noise, emphasizes the morphological and textural characteristics of spotted seals, and significantly enhances detection accuracy and robustness. Furthermore, spotted seal detection faces challenges arising from variations in individual appearance, posture and behavioral patterns. By leveraging PConv to improve the model’s generalization capability, the network adaptively captures discriminative features across diverse scenarios, ensuring robust performance under variations in seal appearance and behavior. Consequently, replacing the CBS convolutional layer in the second branch of the ELAN structure with the improved PConv architecture yields superior performance. Compared with traditional convolution, PConv demonstrates enhanced proficiency in processing image edges and fine-grained details, which facilitates more accurate boundary localization in subsequent detection tasks.

3. Results

3.1. Environmental Configuration and Evaluation Metrics

In order to ensure the fairness of the results, the parameter settings of the model in this paper are uniformly set to 300 epochs, a batch size of 32 and an image pixel size of 480 × 480, and other parameters are set to default values. Both the final iteration’s model weights and the optimal weights were saved for subsequent analysis. The augmented dataset comprises 3036 images, which were split into training and validation sets in a 9:1 ratio to ensure no overlap between the two sets, thereby mitigating the risk of model overfitting. The experimental hardware and software environment is detailed in Table 1.
To systematically evaluate the performance improvements of the proposed model, this study adopts the following metrics: precision (P), recall (R), mean average precision (mAP), frames per second (FPS) and model weight file size (MB). The model weight file size (MB) serves as an indicator of model complexity, where a smaller file size corresponds to reduced computational demands. FPS quantifies the real-time capability and efficiency of the object detection algorithm by measuring the number of image frames processed per second. The mathematical formulations for precision, recall and mAP are provided below:
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
m A P = 1 N n N A P n
In the equations, T P denotes the number of correctly identified positive samples; F P represents the number of negative samples erroneously classified as positive instances, i.e., erroneous detections; F N corresponds to the number of positive samples incorrectly predicted as negative instances, reflecting missed detections; A P ( n ) is defined as the area under the PR curve for the n-th category of detection targets; and N indicates the total number of categories in the evaluation.

3.2. Selection of Baseline Models for UAVs and Ground Stations

Given the varying recognition efficacy of different algorithms on the spotted seal dataset, this study conducted systematic comparisons between the proposed algorithm and mainstream object detection algorithms to justify the selection rationale of the baseline algorithm and demonstrate the superiority of the improved method. As evidenced by the experimental results in Table 2, YOLOv11 [41] achieves higher FPS than YOLOv10, albeit with marginally reduced precision and recall rates. Compared with the YOLOv8 model, YOLOv10 delivers superior precision and recall with lower computational load, while its reduced memory footprint and high frame rate prove particularly advantageous for video surveillance and mobile device applications. The rapid processing capability enables timely detection and response to dynamic targets in video streams. Among all models, YOLOv7 demonstrates the highest precision and recall rates, making it suitable for scenarios requiring extreme accuracy. However, despite its detection accuracy advantages, YOLOv7 exhibits inferior detection speed compared with YOLOv10. Under resource-constrained conditions, YOLOv10 maintains real-time detection performance with lower computational demands, enabling effective deployment on UAV mobile platforms while preserving practical detection accuracy. This comparison aligns with recent findings, which indicate that YOLOv10 outperforms YOLOv7 in terms of speed, size, and latency, while maintaining comparable accuracy. This makes YOLOv10 particularly advantageous for UAV inference tasks in resource-constrained environments [42]. In contrast, YOLOv7 continues to demonstrate strong baseline precision, especially in complex scenarios such as UAV imaging over water surfaces, which makes it highly suitable for secondary verification tasks in ground-station applications [43,44]. Consequently, for the spotted seal dataset, YOLOv7 was selected as the baseline model for subsequent experiments to meet the ground workstation’s stringent accuracy requirements, whereas YOLOv10 serves as the foundational model for real-time UAV deployment applications.

3.3. Comparison and Ablation Experiments of UAV End Models

3.3.1. Comparison Experiment of Backbone Network

To validate the performance of different lightweight models in YOLOv10, this study selected widely adopted lightweight architectures, including ShuffNetV2 [45], the MobileNet series [46,47] and FasterNet, for a series of comparative experiments. Figure 10 presents evaluation metrics across different networks, comparing the performance impact of the improved C2f-Faster module and existing lightweight networks on the spotted seal dataset. As demonstrated in Figure 10, the proposed method achieves higher inference speed and accuracy compared with networks with fewer parameters. Furthermore, when benchmarked against networks maintaining acceptable accuracy ranges, C2f-Faster effectively reduces parameter counts and computational complexity while sustaining high inference speeds to ensure processing timeliness.

3.3.2. Comparison with Other UAV End Models

To validate the superiority of the proposed FF-YOLOv10 network over conventional object detection algorithms, we trained multiple state-of-the-art models on a custom dataset. For experimental reliability, Spike-YOLO [48], YOLOv7-tiny, RT-DETR [49] and SSD [50] were trained using identical hyperparameters to their original unmodified versions, as shown in Table 3. Unlike traditional accuracy-focused evaluation, our emphasis here is on FPS, MB and parameter count, which are critical to real-time onboard deployment on UAV platforms. Comparative results with other YOLO algorithms from Table 2 demonstrate that the FF-YOLOv10 model outperforms nine competing models in lightweight metrics for spotted seal detection tasks, specifically regarding parameter count and weight file size, although its weight file size remains larger than that of YOLOv10. Additionally, with minimal loss in AP and recall, the proposed model achieves significant detection speed improvements through its optimized network architecture, effectively meeting real-time detection requirements. Alternative methods designed for rapid UAV-based spotted seal detection and data acquisition exhibit limited generalizability, struggling with insufficient detection accuracy and suboptimal inference speeds. Consequently, the selection of FF-YOLOv10 as the onboard detection algorithm for UAV systems demonstrates strong justification. This choice not only ensures high detection accuracy and rapid inference capabilities but also substantially enhances practical applicability and operational effectiveness in field deployments.

3.3.3. Ablation Experiment of UAV End Models

To evaluate the individual impact of each improved module on FF-YOLOv10 (a lightweight model designed for UAV deployment), we conducted a series of ablation experiments on the spotted seal dataset. The final ablation results are presented in Table 4. As shown in the table, after replacing the original C2f module, the modified model demonstrates a 28.6% reduction in parameters and an 11.1% decrease in weight file size compared with the baseline YOLOv10. Concurrently, it achieves a 14.3% improvement in inference speed and a 0.9% enhancement in mAP. For single-class detection on the spotted seal dataset, this mAP improvement directly corresponds to a 0.9% increase in seal detection accuracy. These results indicate that while the C2f-Faster module causes a 1.2% reduction in recall rate, it effectively reduces model size, accelerates detection speed and enhances feature extraction capabilities. The final FF-YOLOv10 model, incorporating FocalNets into F-YOLOv10, further improves recall rate and inference speed. Compared with YOLOv10, its inference speed has improved by 33.3%. Meanwhile, the parameter count and weight file size are reduced to 75.8% and 92.6% of the original values, respectively. This demonstrates that UAVs equipped with the optimized model can process imagery more rapidly in real-time operations. Through higher frame rates, such systems capture more critical frames while mitigating false negatives or positives caused by motion blur or fast-moving targets. To intuitively demonstrate the detection capability of the FF-YOLOv10 model onboard UAVs, Figure 11 presents visualization results of spotted seal detection at different flight altitudes. The images were captured at varying UAV heights to reflect typical aerial monitoring scenarios. As shown in the figure, FF-YOLOv10 effectively identifies seals across scales, maintaining reliable localization performance even when targets appear small due to high-altitude imaging. This highlights the model’s suitability for fast onboard inference during large-area search operations.

3.4. Comparison and Ablation Experiments of Ground-Station Models

3.4.1. Comparison with Other Ground-Station Models

This study conducts a comprehensive performance comparison between the proposed PP-YOLOv7 algorithm and state-of-the-art object detection methods for spotted seal detection tasks. The evaluation metrics primarily include mean average precision, precision and recall. To intuitively visualize the performance disparities across algorithms, a scatter plot is employed for visualization, as illustrated in Figure 12. The improved model achieves 94.2% precision and 86.6% recall, surpassing a series of benchmark algorithms. This demonstrates that PP-YOLOv7 achieves an optimal balance between detection accuracy and robustness. Compared with the original YOLOv7 model, the proposed method improves precision and recall by 1.2% and 1.9%, respectively, in spotted seal detection. For single-class spotted seal detection tasks, the model attains an mAP of 92.4%, highlighting its exceptional detection performance in significantly enhancing the reliability and efficiency of target recognition. These advancements are critical to the large-scale ecological monitoring of spotted seals. The high precision effectively reduces false positives in ground workstation-based monitoring, ensuring that only genuine targets are identified and recorded. This minimizes interference from invalid data and enhances the overall efficiency of the monitoring system. Furthermore, ground-station monitoring tasks typically require extensive spatial coverage, where any missed detection of individual seals could adversely impact population assessments, conservation policy formulation and environmental impact evaluations. The high recall of PP-YOLOv7 guarantees comprehensive detection coverage, thereby providing reliable data support for the long-term monitoring of spotted seal population dynamics.
The above analysis provides an objective evaluation of the model’s improvement effects. To more intuitively demonstrate the superior performance of the enhanced model, selected detection results are visualized in Figure 13. The RT-DETR model fails to extract sufficient semantic information to distinguish background elements, resulting in a detected spotted seal count that far exceeds the actual number and a significantly higher false positive rate compared with other algorithms. Due to variations in the proportion of spotted seals within the frame and resolution limitations, both YOLOv5 and YOLOv7 exhibit varying degrees of missed detections and false positives. In contrast, the proposed PP-YOLOv7 algorithm, equipped with a small-target detection layer, achieves precise identification of spotted seals of varying sizes in complex environments. Additionally, its prediction bounding boxes align more accurately with actual targets. As a ground workstation detection model, PP-YOLOv7 effectively reduces both missed detection rates and false positive rates, demonstrating marked improvements in scenarios involving overlapping seals and background interference.

3.4.2. Ablation Experiment of Ground-Station Models

To objectively evaluate the precision of the improved YOLOv7 model in detecting multi-scale spotted seal targets within complex UAV scenarios using real-time transmitted data, we systematically integrated enhancement modules into the baseline YOLOv7 and assessed their individual impacts through ablation studies, with results detailed in Table 5. The incorporation of a small-target detection layer into the baseline model increased accuracy, recall and mAP by 0.7%, 3.2% and 2.9%, respectively, while reducing model complexity and decreasing weight file size by 3.1%. These results demonstrate that the small-target detection layer significantly enhances holistic detection performance, particularly improving recognition capability for small objects in complex backgrounds. Building upon this, the integration of PConv further elevated detection accuracy and mAP to 0.942 and 0.924, respectively, without increasing parameter count or weight file size. Although the recall rate slightly decreased compared with the model with only the small-target detection layer, it remained 2.2% higher than the baseline. This indicates that the optimized model strengthens detection robustness across diverse environmental conditions. By enabling accurate quantification of seal populations and reliable assessment of their ecological status, the enhanced system ensures dependable monitoring performance.

3.5. Comparison of Detection Results Under Different Weather Conditions

To comprehensively evaluate the robustness and generalizability of the proposed dual-model hierarchical detection framework in real-world marine environments, we conducted tests under four representative weather conditions: sunny, reflective, foggy and overcast. Five mainstream detection models were compared, and performance was assessed in terms of detection completeness, false positives, and missed detections. From the evaluation results presented in Figure 14, it is evident that PP-YOLOv7 consistently demonstrates high detection performance across all weather conditions, particularly excelling in challenging environments. FF-YOLOv10, a lightweight model, delivers faster processing and lower power consumption, making it ideal for real-time detection applications. The visualized detection outcomes under each condition are shown in Figure 15. As shown, our framework demonstrates consistently high detection completeness across varying seal sizes, even with low-contrast or occluded backgrounds. Furthermore, it substantially reduces both false positives and missed detections compared with baseline methods. These results confirm the framework’s strong adaptability to challenging marine conditions, reinforcing its practical applicability for UAV-based ecological monitoring tasks.

4. Discussion

This study introduces a dual-model hierarchical detection framework that combines a lightweight model (FF-YOLOv10) with a high-precision model (PP-YOLOv7) to enhance the efficiency and accuracy of spotted seal monitoring in natural environments. The framework effectively addresses key challenges in small-object detection, including feature degradation, the low pixel resolution of spotted seals in UAV imagery, and the computational constraints of edge devices. Experimental results show that FF-YOLOv10 achieves an inference speed of 833.3 FPS while maintaining detection accuracy, and reduces the number of parameters by 24.2% compared with the original YOLOv10. In addition, PP-YOLOv7 increases detection accuracy by 1.2% without adding computational complexity. By balancing real-time responsiveness with detection completeness and accuracy, the proposed framework fulfills the core requirements of UAV-based wildlife monitoring under field conditions.
The framework’s robustness was further validated by testing it under various weather conditions, including sunny, reflective, foggy, and overcast environments. The results demonstrated that the proposed model effectively handles low-contrast backgrounds and varying environmental factors, with FF-YOLOv10 excelling in real-time detection and PP-YOLOv7 ensuring high detection accuracy, even under challenging conditions. These results confirm the framework’s practical applicability and adaptability in diverse field settings, where environmental factors such as lighting, background interference, and weather conditions often vary.
Compared with conventional object detection methods such as YOLOv11 and RT-DETR, FF-YOLOv10 achieves a substantial reduction in model complexity while maintaining detection accuracy. This improvement primarily benefits from the introduction of the C2f module, which reduces redundant channel connections and incorporates cross-layer feature fusion to achieve a more lightweight network structure. Additionally, the Focal Modulation mechanism enhances the model’s ability to represent small objects and complex backgrounds by integrating spatial attention guidance and modality-specific modulation. PP-YOLOv7 further contributes to performance gains by introducing a dedicated small-object detection head and partial convolution modules, markedly improving detection accuracy for spotted seals—targets that are small in scale and often embedded in visually similar backgrounds within UAV imagery. This approach shares conceptual similarities with the feature enhancement strategy proposed by Liu et al. [51], which improves salient region extraction through multi-scale and edge-aware attention in marine species recognition. Such alignment underscores the significance of enhancing localized perceptual sensitivity for detecting small targets under background similarity interference. Similarly, Zhang et al. [52] highlighted that in environments with substantial background interference, the incorporation of a dedicated small-object detection head significantly enhances detection performance.
The dataset used in this study was collected from a relatively limited geographic location and primarily focuses on a single species—the spotted seal. Therefore, the application scope of the proposed approach still holds considerable potential for expansion. Future research could explore validation across diverse geographic regions and species contexts to enhance the generalization and adaptability of the model. The integration of infrared imaging technology into the proposed dual-model hierarchical detection framework warrants further investigation, as it enables continuous monitoring under low-light or nighttime conditions. This approach is expected to preserve detection accuracy while minimizing system energy consumption and data redundancy, thereby advancing the practical deployment of this technology in ecological conservation and biodiversity management.

5. Conclusions

This study presents a dual-model hierarchical detection framework for the sustainable, real-time monitoring of spotted seals. By deploying two enhanced object detection models—PP-YOLOv7 on ground-based workstations and FF-YOLOv10 on UAV platforms—the framework enables rapid detection and accurate identification of seals without disrupting their natural behavior. Experimental results show that the lightweight design significantly improves computational efficiency through staged feature allocation and selective computation, resulting in a 14.3% increase in FPS while maintaining stable detection accuracy. This design facilitates real-time onboard inference on UAVs and reduces the impact of data transmission and storage on flight endurance. The precision-focused detection model incorporates multi-scale feature maps and hierarchical convolution operations to capture fine-grained details of small targets. It achieves 94.2% detection accuracy and a 1.9% improvement in recall, all without increasing computational complexity. The model also enhances sensitivity to small objects, especially under conditions of background–target color similarity. In comparison with traditional monitoring approaches, the proposed framework offers superior detection precision, better real-time responsiveness, and higher automation. Future research will focus on further optimizing the lightweight design for UAV models to extend flight endurance and adapt to complex environmental conditions, thereby improving its performance in field-based monitoring applications.

Author Contributions

Conceptualization, J.L. and M.J.; methodology, J.L.; validation, J.L. and C.W.; formal analysis, J.L.; investigation, J.L., L.Q. and J.W.; resources, F.J., M.J. and J.W.; data curation, J.L. and L.Q.; writing—original draft preparation, J.L.; writing—review and editing, J.L., F.J. and M.J.; visualization, J.L. and C.W.; supervision, F.J. and M.J.; project administration, M.J., L.Q. and J.W.; funding acquisition, M.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Open Fund Project of Key Laboratory of Bohai Sea Ecological Early Warning and Protection and Restoration, Ministry of Natural Resources (grant number SKDZK20230464).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon reasonable request from the corresponding author. Due to restrictions imposed by the ecological reserve, only partial datasets can be provided.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. McNeely, J.A.; Miller, K.R.; Reid, W.V.; Mittermeier, R.A.; Werner, T.B. Conserving the World’s Biological Diversity; IUCN: Gland, Switzerland; WRI: Washington, DC, USA; CI: Crystal City, VA, USA; WWF-US: Washington, DC, USA; World Bank: Washington, DC, USA, 1990. [Google Scholar]
  2. Allen, S.G.; Ainley, D.G.; Page, G.W. The effect of disturbance on harbor seal haul out patterns at Bolinas Lagoon, California. Fish. Bull. 1984, 82, 493. [Google Scholar]
  3. Zhuang, H.; Hou, L.; Wang, S.; Gao, Y.; Zhang, C.; Wang, Z.; Zhao, L.; He, Y.; Zhou, Q.; Lu, Z.; et al. Facial feature-based individual identification of spotted seals (Phoca largha). Acta Ecol. Sin. 2025, 45, 6586–6599. (In Chinese) [Google Scholar] [CrossRef]
  4. Birenbaum, Z.; Do, H.; Horstmyer, L.; Orff, H.; Ingram, K.; Ay, A. SEALNET: Facial recognition software for ecological studies of harbor seals. Ecol. Evol. 2022, 12, e8851. [Google Scholar] [CrossRef]
  5. Aarts, G.; Brasseur, S.; Poos, J.J.; Schop, J.; Kirkwood, R.; Van Kooten, T.; Mul, E.; Reijnders, P.; Rijnsdorp, A.D.; Tulp, I. Top-down pressure on a coastal ecosystem by harbor seals. Ecosphere 2019, 10, e02538. [Google Scholar] [CrossRef]
  6. Zhang, J.; Song, W. Construction and Practice of Marine Ecological Protection Importance Evaluation System–Taking Dalian Sea Area as an Example. Nat. Resour. Inf. 2025, 06, 47–54. (In Chinese) [Google Scholar]
  7. Wang, N.; Ding, K. Effects of the marine environment on spotted seals survival (Phoca largha) Bohai Sea. Mar. Sci. Bull. 2019, 38, 202–209. [Google Scholar] [CrossRef]
  8. Mulero-Pázmány, M.; Hurtado, S.; Barba-González, C.; Antequera-Gómez, M.L.; Díaz-Ruiz, F.; Real, R.; Navas-Delgado, I.; Aldana-Montes, J.F. Addressing significant challenges for animal detection in camera trap images: A novel deep learning-based approach. Sci. Rep. 2025, 15, 16191. [Google Scholar] [CrossRef] [PubMed]
  9. Cunha, F.; dos Santos, E.M.; Barreto, R.; Colonna, J.G. Filtering empty camera trap images in embedded systems. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2438–2446. [Google Scholar]
  10. Zhang, X.; Zhang, H.; Han, Y.; Weng, Q.; Yuan, Z.; Yao, Y. Research Progress on Wildlife Monitoring and Recognition Based on Deep Learning. Acta Theriol. Sin. 2022, 43, 251–258. [Google Scholar]
  11. Xiao, Z.; Chen, Y.; Zhou, X.; He, M.; Liu, L.; Yu, F.; Jiang, M. Human action recognition in immersive virtual reality based on multi-scale spatio-temporal attention network. Comput. Animat. Virtual Worlds 2024, 35, e2293. [Google Scholar] [CrossRef]
  12. Ma, Z.; Dong, Y.; Xia, Y.; Xu, D.; Xu, F.; Chen, F. Wildlife real-time detection in complex forest scenes based on YOLOv5s deep learning network. Remote Sens. 2024, 16, 1350. [Google Scholar] [CrossRef]
  13. Cen, Q.; Zhu, Q.; Wang, Y.; Chen, W.; Liu, S. YOLOv9-YX: Lightweight algorithm for underwater target detection. Vis. Comput. 2024, 41, 4033–4045. [Google Scholar] [CrossRef]
  14. He, G.; Zhang, X.; Wang, J.; Xu, P.; Hou, X.; Dong, W.; Lei, Y.; Jin, X.; Wang, W.; Tian, W.; et al. Advancing primate surveillance with image recognition techniques from unmanned aerial vehicles. Am. J. Primatol. 2025, 87, e23676. [Google Scholar] [CrossRef] [PubMed]
  15. Lee, S.; Song, Y.; Kil, S.H. Feasibility Analyses of Real-Time Detection of Wildlife Using UAV-Derived Thermal and RGB Images. Remote Sens. 2021, 13, 2169. [Google Scholar] [CrossRef]
  16. Hodgson, A.; Kelly, N.; Peel, D. Unmanned aerial vehicles (UAVs) for surveying marine fauna: A dugong case study. PLoS ONE 2013, 8, e79556. [Google Scholar] [CrossRef]
  17. Kiszka, J.J.; Mourier, J.; Gastrich, K.; Heithaus, M.R. Using unmanned aerial vehicles (UAVs) to investigate shark and ray densities in a shallow coral lagoon. Mar. Ecol. Prog. Ser. 2016, 560, 237–242. [Google Scholar] [CrossRef]
  18. Hodgson, J.C.; Baylis, S.M.; Mott, R.; Herrod, A.; Clarke, R.H. Precision wildlife monitoring using unmanned aerial vehicles. Sci. Rep. 2016, 6, 22574. [Google Scholar] [CrossRef]
  19. Beaver, J.T.; Baldwin, R.W.; Messinger, M.; Newbolt, C.H.; Ditchkoff, S.S.; Silman, M.R. Evaluating the use of drones equipped with thermal sensors as an effective method for estimating wildlife. Wildl. Soc. Bull. 2020, 44, 434–443. [Google Scholar] [CrossRef]
  20. Gonzalez, L.F.; Montes, G.A.; Puig, E.; Johnson, S.; Mengersen, K.; Gaston, K.J. Unmanned aerial vehicles (UAVs) and artificial intelligence revolutionizing wildlife monitoring and conservation. Sensors 2016, 16, 97. [Google Scholar] [CrossRef]
  21. Corcoran, E.; Winsen, M.; Sudholz, A.; Hamilton, G. Automated detection of wildlife using drones: Synthesis, opportunities and constraints. Methods Ecol. Evol. 2021, 12, 1103–1114. [Google Scholar] [CrossRef]
  22. Tuia, D.; Kellenberger, B.; Beery, S.; Costelloe, B.R.; Zuffi, S.; Risse, B.; Mathis, A.; Mathis, M.W.; Van Langevelde, F.; Burghardt, T.; et al. Perspectives in machine learning for wildlife conservation. Nat. Commun. 2022, 13, 792. [Google Scholar] [CrossRef]
  23. Yan, P.; Wang, W.; Li, G.; Zhao, Y.; Wang, J.; Wen, Z. A lightweight coal gangue detection method based on multispectral imaging and enhanced YOLOv8n. Microchem. J. 2024, 199, 110142. [Google Scholar] [CrossRef]
  24. Peng, J.; Wang, D.; Liao, X.; Shao, Q.; Sun, Z.; Yue, H.; Ye, H. Wild animal survey using UAS imagery and deep learning: Modified Faster R-CNN for kiang detection in Tibetan Plateau. ISPRS J. Photogramm. Remote Sens. 2020, 169, 364–376. [Google Scholar] [CrossRef]
  25. Gray, P.C.; Fleishman, A.B.; Klein, D.J.; McKown, M.W.; Bezy, V.S.; Lohmann, K.J.; Johnston, D.W. A convolutional neural network for detecting sea turtles in drone imagery. Methods Ecol. Evol. 2019, 10, 345–355. [Google Scholar] [CrossRef]
  26. Tripathi, R.N.; Agarwal, K.; Tripathi, V.; Badola, R.; Hussain, S.A. Conservation in action: Cost-effective UAVs and real-time detection of the globally threatened swamp deer (Rucervus duvaucelii). Ecol. Inform. 2025, 85, 102913. [Google Scholar] [CrossRef]
  27. Jiang, L.; Wu, L. Enhanced Yolov8 network with Extended Kalman Filter for wildlife detection and tracking in complex environments. Ecol. Inform. 2024, 84, 102856. [Google Scholar] [CrossRef]
  28. Wu, L.; Jinma, Y.; Wang, X.; Yang, F.; Xu, F.; Cui, X.; Sun, Q. Amur Tiger Individual Identification Based on the Improved InceptionResNetV2. Animals 2024, 14, 2312. [Google Scholar] [CrossRef] [PubMed]
  29. Zhao, Y.; Wang, L.; Lei, G.; Guo, C.; Ma, Q. Lightweight UAV Small Target Detection and Perception Based on Improved YOLOv8-E. Drones 2024, 8, 681. [Google Scholar] [CrossRef]
  30. Lu, Y.; Sun, M. Lightweight multidimensional feature enhancement algorithm LPS-YOLO for UAV remote sensing target detection. Sci. Rep. 2025, 15, 1340. [Google Scholar] [CrossRef]
  31. Axford, D.; Sohel, F.; Vanderklift, M.; Hodgson, A. Collectively advancing deep learning for animal detection in drone imagery: Successes, challenges, and research gaps. Ecol. Inform. 2024, 83, 102842. [Google Scholar] [CrossRef]
  32. Bartlett, B.; Santos, M.; Dorian, T.; Moreno, M.; Trslic, P.; Dooly, G. Real-Time UAV Surveys with the Modular Detection and Targeting System: Balancing Wide-Area Coverage and High-Resolution Precision in Wildlife Monitoring. Remote Sens. 2025, 17, 879. [Google Scholar] [CrossRef]
  33. Dat, N.N.; Richardson, T.; Watson, M.; Meier, K.; Kline, J.; Reid, S.; Maalouf, G.; Hine, D.; Mirmehdi, M.; Burghardt, T. WildLive: Near Real-time Visual Wildlife Tracking onboard UAVs. arXiv 2025, arXiv:2504.10165. [Google Scholar] [CrossRef]
  34. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
  35. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar] [CrossRef]
  36. Jiao, F. Conservation and management of spotted seal resources in Liaodong Bay. China Fish. 2015, 04, 35–38. (In Chinese) [Google Scholar]
  37. Varghese, R.; M., S. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
  38. Chen, J.; Kao, S.-H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. arXiv 2023, arXiv:2303.03667. [Google Scholar] [CrossRef]
  39. Wang, C.; Han, Q.; Zhang, T.; Li, C.; Sun, X. Litchi picking points localization in natural environment based on the Litchi-YOSO model and branch morphology reconstruction algorithm. Comput. Electron. Agric. 2024, 226, 109473. [Google Scholar] [CrossRef]
  40. Zhang, M.; Tian, X. Transformer architecture based on mutual attention for image-anomaly detection. Virtual Real. Intell. Hardw. 2023, 5, 57–67. [Google Scholar] [CrossRef]
  41. Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
  42. Pham, V.; Ngoc, L.D.T.; Bui, D.L. Optimizing YOLO Architectures for Optimal Road Damage Detection and Classification: A Comparative Study from YOLOv7 to YOLOv10. arXiv 2024, arXiv:2410.08409. [Google Scholar] [CrossRef]
  43. Alshibli, A.; Memon, Q. Benchmarking YOLO Models for Marine Search and Rescue in Variable Weather Conditions. Automation 2025, 6, 35. [Google Scholar] [CrossRef]
  44. Mai, R.; Wang, J. UM-YOLOv10: Underwater Object Detection Algorithm for Marine Environment Based on YOLOv10 Model. Fishes 2025, 10, 173. [Google Scholar] [CrossRef]
  45. Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. arXiv 2018, arXiv:1807.11164. [Google Scholar] [CrossRef]
  46. Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4—Universal Models for the Mobile Ecosystem. arXiv 2024, arXiv:2404.10518. [Google Scholar] [CrossRef]
  47. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  48. Luo, X.; Yao, M.; Chou, Y.; Xu, B.; Li, G. Integer-Valued Training and Spike-Driven Inference Spiking Neural Network for High-performance and Energy-efficient Object Detection. arXiv 2024, arXiv:2407.20708. [Google Scholar] [CrossRef]
  49. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2024, arXiv:2304.08069. [Google Scholar] [CrossRef]
  50. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  51. Gao, S.; Zhang, P.; Yan, T.; Lu, H. Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
  52. Zhang, H.; Xu, C.; Li, Y.; Wang, J. Dense attention pyramid for tiny object detection. Pattern Recognit. 2021, 118, 108030. [Google Scholar]
Figure 1. Overview of the study area. The area within the red rectangle represents the region captured by the drone imagery.
Figure 1. Overview of the study area. The area within the red rectangle represents the region captured by the drone imagery.
Animals 15 03100 g001
Figure 2. Raw images used for annotation and augmentation.
Figure 2. Raw images used for annotation and augmentation.
Animals 15 03100 g002
Figure 3. Labeled images with bounding boxes around spotted seals for model training.
Figure 3. Labeled images with bounding boxes around spotted seals for model training.
Animals 15 03100 g003
Figure 4. Structure of FF-YOLOv10.
Figure 4. Structure of FF-YOLOv10.
Animals 15 03100 g004
Figure 5. Overall architecture of FasterNet. The top shows the network pipeline, while the bottom illustrates the structure of the FasterNet Block and two convolution strategies.
Figure 5. Overall architecture of FasterNet. The top shows the network pipeline, while the bottom illustrates the structure of the FasterNet Block and two convolution strategies.
Animals 15 03100 g005
Figure 6. Focal modulation network architecture.
Figure 6. Focal modulation network architecture.
Animals 15 03100 g006
Figure 7. Structure of PP-YOLOv7.
Figure 7. Structure of PP-YOLOv7.
Animals 15 03100 g007
Figure 8. Schematic of small-target detection layer. The input is used for image data input, FPN represents multi-scale feature extraction, and PAN is used for feature enhancement and information flow.
Figure 8. Schematic of small-target detection layer. The input is used for image data input, FPN represents multi-scale feature extraction, and PAN is used for feature enhancement and information flow.
Animals 15 03100 g008
Figure 9. Conv and PConv structures. * denotes the convolution.
Figure 9. Conv and PConv structures. * denotes the convolution.
Animals 15 03100 g009
Figure 10. Performance comparison of different networks in YOLOv10. (a) FPS-based mAP comparison. (b) FPS-based comparison of MB and parameters.
Figure 10. Performance comparison of different networks in YOLOv10. (a) FPS-based mAP comparison. (b) FPS-based comparison of MB and parameters.
Animals 15 03100 g010
Figure 11. Visualization results of the FF-YOLOv10 model detecting spotted seals in UAV images captured from different altitudes. The model demonstrates robust detection performance across varying scales and resolutions.
Figure 11. Visualization results of the FF-YOLOv10 model detecting spotted seals in UAV images captured from different altitudes. The model demonstrates robust detection performance across varying scales and resolutions.
Animals 15 03100 g011
Figure 12. Experimental comparison of classical object detection networks. (a) Precision comparison based on mAP. (b) Recall comparison based on mAP.
Figure 12. Experimental comparison of classical object detection networks. (a) Precision comparison based on mAP. (b) Recall comparison based on mAP.
Animals 15 03100 g012
Figure 13. Visualization of model detection results. The yellow dashed line represents the detection challenges faced by other models, while the red dashed line illustrates the notable improvement achieved by PP-YOLOv7. The red, blue and green rectangles in the image correspond to the detection performance of the model.
Figure 13. Visualization of model detection results. The yellow dashed line represents the detection challenges faced by other models, while the red dashed line illustrates the notable improvement achieved by PP-YOLOv7. The red, blue and green rectangles in the image correspond to the detection performance of the model.
Animals 15 03100 g013
Figure 14. The number of detections, false positive (FP) rate and false negative (FN) rate of each model under different weather conditions.
Figure 14. The number of detections, false positive (FP) rate and false negative (FN) rate of each model under different weather conditions.
Animals 15 03100 g014
Figure 15. Visualization of detection results of five models under varying weather conditions.
Figure 15. Visualization of detection results of five models under varying weather conditions.
Animals 15 03100 g015
Table 1. Experimental configuration environment.
Table 1. Experimental configuration environment.
ConfigurationParameter
Programming languagePython 3.8.18
Deep learning frameworksPyTorch 1.8.0
Operating systemWindows 10 X64
CPUIntel i9-10980XE
Host memory64 GB
GPUNvidia GeForce RTX 3080Ti
Table 2. Comparison of base model detection results. Among them, the YOLO series all use version n.
Table 2. Comparison of base model detection results. Among them, the YOLO series all use version n.
ModelPrecisionRecallmAPGFLOPsMBParametersFPS
YOLOv110.8570.6540.7356.35.52,582,347666.7
YOLOv100.8670.6720.7411.25.42,492,822625.0
YOLOv80.8560.6680.7428.16.33,005,843769.2
YOLOv70.9300.8470.901103.274.836,479,926166.7
YOLOv50.9200.7260.81015.814.47,012,822108.1
Table 3. Comparative performance of object detection algorithms on the seal dataset.
Table 3. Comparative performance of object detection algorithms on the seal dataset.
ModelRecallmAPMBParametersFPS
Spike-YOLO0.6090.69627.113,248,643196.1
YOLOv7-tiny0.6880.79312.36,006,646212.8
RT-DETR0.6830.75366.231,985,79581.3
SSD0.4650.69190.623,612,24625.2
FF-YOLOv100.6650.7425.01,888,742833.3
Table 4. Ablation experiment results of FF-YOLOv10.
Table 4. Ablation experiment results of FF-YOLOv10.
ModelRecallmAPMBParametersFPS
YOLOv100.6720.7415.42,492,822625
F-YOLOv100.6640.7484.81,780,323212.8
FF-YOLOv100.6650.7425.01,888,742833.3 ↑33.3%
↑ indicates an increase in FPS.
Table 5. Ablation experiment results of the improved YOLOv7 model.
Table 5. Ablation experiment results of the improved YOLOv7 model.
ModelsPrecisionRecallmAPMBParameters
YOLOv70.930.8470.90174.836,479,926
YOLOv7+A0.9370.8740.92872.537,023,248
YOLOv7+A+B0.9420.8660.92472.537,023,184
A represents the small-object detection layer, and B represents partial convolution.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Jin, F.; Ji, M.; Qu, L.; Wang, J.; Wang, C. Hierarchical Dual-Model Detection Framework for Spotted Seals Using Deep Learning on UAVs. Animals 2025, 15, 3100. https://doi.org/10.3390/ani15213100

AMA Style

Liu J, Jin F, Ji M, Qu L, Wang J, Wang C. Hierarchical Dual-Model Detection Framework for Spotted Seals Using Deep Learning on UAVs. Animals. 2025; 15(21):3100. https://doi.org/10.3390/ani15213100

Chicago/Turabian Style

Liu, Jun, Fengxiang Jin, Min Ji, Liang Qu, Juan Wang, and Chen Wang. 2025. "Hierarchical Dual-Model Detection Framework for Spotted Seals Using Deep Learning on UAVs" Animals 15, no. 21: 3100. https://doi.org/10.3390/ani15213100

APA Style

Liu, J., Jin, F., Ji, M., Qu, L., Wang, J., & Wang, C. (2025). Hierarchical Dual-Model Detection Framework for Spotted Seals Using Deep Learning on UAVs. Animals, 15(21), 3100. https://doi.org/10.3390/ani15213100

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop