Next Article in Journal
Tailoring Sensitivity and Selectivity with Nanoparticle-Functionalized ZnO Nanorods: The Impact of Metals on Sensing and Electrical Performance
Previous Article in Journal
PaEDNet: A Robust Denoising and Classification Framework for Vibration-Based Fault Diagnosis with Measurement Noise
Previous Article in Special Issue
Global–Local Feature Fusion Network for Remote Sensing Image Change Detection in Open-Pit Mining Areas
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

YOLIP: An Enhanced Framework for UAV-Assisted Wildlife Monitoring Based on YOLO Integrated with the CLIP Model

1
Portland College, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
2
School of Communications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(11), 3436; https://doi.org/10.3390/s26113436
Submission received: 5 May 2026 / Revised: 22 May 2026 / Accepted: 25 May 2026 / Published: 29 May 2026
(This article belongs to the Special Issue AI-Based Visual Sensing for Object Detection)

Abstract

UAV-based wildlife monitoring encounters tremendous challenges posed by complex environments, such as the extremely low proportion of effective targets in aerial images and variations in remote sensing scales. This paper presents a novel fusion framework named YOLIP, which integrates a detection head with semantic perception capabilities and an implicit feature adjustment module to boost detection accuracy and feature representation ability. Specifically, this paper redesigns the detection head to enable it to simultaneously learn spatial positioning and semantic features, thereby achieving more reliable extraction of regional features. The implicit feature modulation module introduces a dual-path fusion mechanism, which elevates the feature quality through geometric–semantic fusion, thereby improving the consistency and robustness of the detection. Furthermore, this paper also develops an asynchronous scheduling strategy, which can selectively execute computationally intensive operations to achieve computational optimization, enabling this framework to adapt to actual detection scenarios based on unmanned aerial vehicles. In this study, we conducted numerous experiments on the self-built drone wildlife dataset as well as the publicly available aerial wildlife dataset. Theresults demonstrate that compared with existing detection models, YOLIP improves mAP@0.5 by 11.6% while maintaining an efficient inference speed, achieving an improvement in detection performance. In addition, cross-dataset evaluation verifies the stable performance and generalization capability of the proposed method across multiple real-world scenarios.

1. Introduction

Effective monitoring of endangered wild animals in complex natural environments is a major challenge faced in the field of ecological protection. Due to its sparse distribution, wide coverage area, and significant differences in habitats, traditional monitoring methods relying on manual patrols or fixed camera systems have limitations in terms of real-time monitoring, including limited coverage range and insufficient time continuity [1]. With the rapid development of drone technology, large-scale dynamic monitoring based on continuous video streams has gradually become possible, providing new opportunities for real-time ecological observation [2].
However, accurately detecting and identifying wildlife targets from the perspective of drones remains a significant challenge. Animal targets captured from an aerial viewpoint typically occupy only a small number of pixels in the image, making it difficult for models to extract key features from them [3]. Complex environmental factors such as changes in lighting, occlusion, and background noise can also significantly degrade detection performance. Most existing methods still rely on closed-set training, limiting the model’s ability to generalize to unseen species in real-world scenarios. To address these challenges, researchers have adapted the YOLO series of models for UAV target detection, achieving a good balance between detection speed and accuracy [4,5,6]. However, in UAV-assisted wildlife monitoring scenarios, due to the aforementioned challenges, the performance of such methods remains limited in terms of small-target detection and adaptation to dynamic environments [7,8,9]. Especially when the ground sampling distance (GSD) increases with increase in flight altitude, the pixels occupied by each animal target will decrease, making it difficult to distinguish the texture, outline, and specific details of the category. In this situation, the degradation of early-stage features may lead to detection errors, and these errors cannot be compensated for by the subsequent recognition module.
In recent years, vision-language models represented by CLIP have demonstrated strong zero-shot and open-vocabulary recognition capabilities by aligning visual and textual representations in a shared embedding space. These characteristics make CLIP well suited for recognition tasks involving unseen categories [10,11,12,13]. However, CLIP itself lacks precise object localization capabilities and often incurs high computational costs when directly applied to dense visual scenes [14,15,16,17,18].
To overcome these limitations, recent studies have explored the approach of combining YOLO with CLIP. That is, by leveraging the precise localization capability of YOLO and the semantic generalization ability of CLIP, the performance of the object detection task can be improved. Among these studies, YOLO-World stands out as a representative work in this field. This method introduces a Visual-Language Path Aggregation Network (RepVL-PAN) and a region-text contrastive learning mechanism, achieving fundamental alignment of cross-modal features. The resulting model combines both precise localization and generalization capabilities [19]. Driven by this research paradigm, subsequent methods have further explored cross-modal representation enhancement strategies. YOLOE extends prompt-based detection capabilities while maintaining the existing inference overhead [20]. Uni-YOLO enhances the model’s robustness in cluttered backgrounds by establishing a CLIP-guided feature alignment mechanism [21]. CLIP-YOLO, on the other hand, replaces the traditional classification head with semantic embeddings and combines attention mechanisms to elevate the expressive power of visual features [22]. Mamba-YOLO-World further proposed a fusion mechanism based on the State Space Model (SSM). This mechanism combines linear computational complexity with a global receptive field, effectively enhancing feature interactions and improving the model’s generalization performance [23]. Furthermore, research has shown that embedding CLIP into the information of YOLOv8 during its training process can enhance the model’s data efficiency and stability. Especially in cases where the training data are limited, this approach demonstrates significant advantages [24].
Despite these advancements, existing drone wildlife detection methods and the YOLO-CLIP fusion approach still have the following limitations. Existing YOLO-CLIP fusion methods mainly focus on general visual–semantic alignment or open-category recognition. However, the geometric mismatch problem between the proposals generated by the detector and the CLIP-style visual input has not been fully resolved. Furthermore most existing YOLO–CLIP fusion methods are designed for image-level object detection or general open-category recognition tasks. In contrast, drone-based wildlife monitoring typically relies on continuous video streams, where changes in object appearance between adjacent frames are relatively slow. In this scenario, performing computationally intensive semantic recognition operations on every frame would inevitably result in redundant computations and reduce the model’s practical deployment efficiency on resource-constrained drone platforms. Therefore, a perception framework for UAV applications should not only focus on improving detection and recognition accuracy but also incorporate frame-level inference efficiency into the model design.
Furthermore, wildlife monitoring places higher demands on the synergy between high-recall target localization and semantic refinement. In aerial images, small-scale targets, partial occlusions, and complex background interference can easily lead to missed detections; once candidate regions are not generated correctly, it is typically difficult to compensate for this error in the subsequent semantic recognition stage. Therefore, reliable candidate region generation should be prioritized before semantic classification; simultaneously, the generated candidate regions need to be further refined in both geometric and semantic spaces to reinforce the stability of the final recognition.
To address these limitations, this paper proposes a hierarchical perception framework, YOLIP. This framework involves a YOLO foreground extraction branch with high recall rate and integrates YOLOv11 and CLIP through a unified cross-modal alignment process. The proposed YOLIP head converts the detector features into language-aligned proposal embeddings, thereby strengthening the connection between the localization based on YOLO and the semantic representation based on CLIP. The Interactive Fusion Module (IFM) further refines the candidate regions through geometric normalization and semantic alignment, reducing the distortion of the proposals and the mismatch of cross-modal representations. Furthermore, a frame-level asynchronous scheduling strategy was introduced, which decoupled the high-frequency positioning from the low-frequency semantic recognition, thereby reducing redundant computations in the continuous drone video stream.
The key contributions of this research are presented as follows:
  • We propose a structured YOLOv11–CLIP model, which divides the process of object detection and semantic understanding. This framework has the ability to run in real-time and handle open vocabulary tasks.
  • We design a YOLIP head with semantic perception capabilities, which is used to generate language-related proposal embeddings from the features of the detector. This enables a more consistent interface between the localization based on YOLO and the recognition based on CLIP.
  • We introduce an Interactive Fusion Module (IFM), which optimizes the candidate regions through geometric standardization and semantic alignment. This helps to reduce the distortion of the proposal and the mismatch of cross-modal representations.
  • We develop an asynchronous scheduling method aimed at reducing redundant semantic inference in continuous drone video streams, which increases throughput and enhances the performance of the existing video streaming system.

2. Hierarchical Perception Framework

This study proposes a hierarchical perception framework that is suitable for wildlife monitoring in complex natural environments. By integrating detection, recognition, and decision-making into a unified workflow, the framework enables accurate semantic mapping of visual observations used for category-level forecasts.
Specifically, this system combines the real-time detection capability of YOLOv11 with the open vocabulary recognition capability of the CLIP model. By combining these complementary strengths, the framework achieves robust performance under small-sample and zero-shot scenarios. As shown in Figure 1, the proposed system consists of three core elements:
(1)
A real-time object detection module based on YOLOv11, responsible for efficiently locating candidate regions in complex backgrounds;
(2)
A semantic recognition module based on CLIP, which performs zero-shot classification via the alignment of visual features and text embeddings;
(3)
A multi-modal fusion and decision module, which integrates detection and semantic data is utilized to generate ultimate forecasts.
Figure 1. Hierarchical perception framework diagram.
Figure 1. Hierarchical perception framework diagram.
Sensors 26 03436 g001
To bridge the representational gap between YOLOv11 and CLIP, we introduce a semantic-aware detection head, termed the YOLIP Head, along with an Interaction Fusion Module (IFM). The YOLIP Head acts as a semantic projection interface, transforming detection features into language-aligned proposal embedding and establishing an initial alignment of visual features with the shared semantic domain. On top of this, the IFM performs further cross-modal alignment through a contrastive learning objective, refining the coherence between visual and textual depictions.
Together, this hierarchical alignment mechanism enables robust cross-modal representation and notably boosts generalization in situations with limited data and zero-shot cases.
Furthermore, the framework incorporates prompt engineering and feature normalization approaches to promote semantic coherence and steady the training procedure. Overall, the proposed framework establishes an efficient and scalable paradigm for multi-modal perception within real-world situations.

2.1. YOLOv11-Based Object Detection

In the first stage, this paper employs the lightweight YOLOv11n as the base detector to efficiently locate candidate regions from the drone images. Compared with earlier versions of YOLO, YOLOv11 introduces a more powerful design for feature extraction and spatial feature aggregation.
This is particularly important for UAV wildlife monitoring. The C3K2 module enhances the local feature extraction capability with lower computational overhead [6]. It introduces C3k blocks to better preserve local spatial details and multi-scale features. Its structure is shown in Figure 2.
These characteristics are highly relevant to UAV wildlife monitoring tasks, as the ground sample distance affected by the flight altitude means that animal targets are typically represented as small-scale targets in the images. Therefore, in this paper, YOLOv11n is selected as the location branch of YOLIP to achieve a good balance between detection accuracy and computational efficiency. At the same time, a relatively low confidence threshold is adopted to maintain a high recall rate, ensuring that sufficient foreground animal candidate regions are reserved for subsequent semantic recognition based on CLIP and IFM feature refinement.

2.2. CLIP Semantic Classification

In the recognition stage, the multi-modal pre-trained model CLIP is employed to perform semantic classification of candidate regions. CLIP maps visual and textual inputs into a shared embedding space through contrastive learning, enabling open-vocabulary recognition.
Specifically, each candidate region is encoded by the CLIP visual encoder, while category descriptions are converted into text embedding using prompt templates. The semantic similarity between image and text embedding is computed via cosine similarity, followed by SoftMax normalization to obtain classification probabilities.
This mechanism allows the system to generalize to unseen categories without additional training, making it particularly suitable for wildlife monitoring scenarios.

2.3. Overview: A Semantic-Aware Proposal Framework via Contrastive Alignment

YOLIP is a semantic-aware detection framework that transforms dense visual features into language-aligned proposal embeddings via contrastive learning. The study does not merely connect a detector to a classifier; it instead reformulates detection as semantic proposal generation. The overall process can be formulated as follows.
Let I R H × W × 3 denote an input frame. A YOLOv11 backbone extracts multi-scale feature maps F ( s ) R H s × W s × C at strides s 8 , 16 , 32 , where the number of channels is compressed to C = 256 .
These features are fed into a novel YOLIP Head, which projects every spatial cell into a CLIP-aligned semantic embedding field E s via a shared function g ϕ .
From this field, we extract a set of semantic-aware proposals P = ( B i , e i ) i = 1 N p , where each bounding box B i is paired with a pre-aligned embedding e i R d . A dual-pathway Intermediate Feature Mapping (IFM) module then refines these proposals. Its geometric pathway normalizes each region to a fixed canonical size, while its semantic pathway produces a refined embedding e i .
Finally, e i is matched against frozen CLIP text prototypes t c via cosine similarity. The entire framework is driven by a unified InfoNCE contrastive objective:
L c o n = 1 N i = 1 N log exp ( sim ( e i , t y i ) / τ ) c = 1 C exp ( sim ( e i , t c ) / τ ) ,
where N is the number of valid proposals in a mini-batch, C denotes the number of text prototypes, τ is the temperature coefficient.
In order to jointly optimize the foreground positioning and semantic alignment, the overall training objective of YOLIP is set as:
L t o t a l = λ b o x L b o x + λ d f l L d f l + λ c l s L c l s + λ c o n L c o n
Here, L b o x represents the bounding box regression loss, L d f l is the distribution focus loss used for bounding box refinement, and L c l s is responsible for supervising the classification of foreground proposals. These losses related to detection will optimize the localization branch of YOLOv11, thereby generating accurate and highly recallable animal proposals. In contrast, L c o n represents the constructed proposals in association with the semantic prototypes based on CLIP. These weight coefficients λ b o x , λ d f l , λ c l s , and λ c o n respectively determine the relative importance of localization learning, bounding box refinement, foreground proposal classification, and semantic alignment.
Overall, YOLIP can be viewed as a semantic proposal generation framework: detection provides spatial grounding, while contrastive alignment embeds language-level semantics directly into each proposal.

2.3.1. YOLIP Head (YH): Generating Language-Aligned Proposal Embedding

The YOLIP head extends the traditional YOLO detection head by introducing a semantic projection branch. Unlike standard YOLO heads that terminate in closed-set class logits, our design outputs dense semantic embeddings that are natively comparable with CLIP text prototypes. The structure diagram is shown in Figure 3.
Formally, let F s R H s × W s × C denote the feature map at scale s ( P 3 , P 4 , P 5 ) , compressed to C = 256 via a 3 × 3 convolution. A shared projection function g ϕ : R C R d is applied densely across all scales and spatial positions:
E s = g ϕ ( F s ) R H s × W s × d ,
where g ϕ is a two-layer MLP:
g ϕ ( f ) = W 2 · σ ( W 1 · f + b 1 ) + b 2 ,
with W 1 R 512 × 256 , W 2 R 512 × 512 , ReLU activation σ , and output dimension d = 512 that match CLIP ViT-B/32. The semantic embedding field is denoted as E s .
On this field, two lightweight predictors operate. A foreground heatmap head produces a class-agnostic object score via a 1 × 1 convolution followed by a sigmoid:
H ( s ) = σ Conv 1 × 1 ( E ( s ) ) [ 0 , 1 ] H s × W s × 1
A parallel bounding box regression head (anchor-free) predicts the coordinates.
For each high-confidence foreground location, we extract a proposal B i = ( x i , y i , w i , h i ) and pool its embedding from E s using RoIAlign:
e i = RoIAlign ( E s , B i ) R d
The output is a set of semantic-aware proposals:
P = { ( B i , e i ) } i = 1 N p
Crucially, the projection parameters ϕ are shared across all scales (P3–P5) and all spatial positions. This weight-sharing enforces scale-invariant semantics and reduces overfitting—A critical advantage in data-limited regimes.
In essence, the YOLIP Head transforms the detection backbone’s visual features into a language-ready proposal set, where each proposal already carries a CLIP-aligned semantic embedding.

2.3.2. IFM: Interaction Fusion Module

The intermediate directly feeding YOLO proposals into CLIP leads to a dual domain gap: geometric (arbitrary region sizes vs. fixed 224 × 224 input) and semantic (detection-optimized local features vs. globally aligned CLIP embedding).
The Interaction Fusion Module (IFM) module effectively fills these two gaps through a dual-path architecture. The overall architecture of the system is shown in Figure 4.
Geometric Normalization Pathway. Given each candidate region B i , a learnable spatial transformer estimates an affine transformation parameter Θ i . A grid generator and a differentiable sampler with reflection padding then warp the region to a canonical resolution:
P i = Warp ( B i , Θ i ) R 224 × 224 × 3
Unlike naive resizing, this learned transformation adapts to object geometry and reduces spatial distortion.
Semantic Refinement Pathway. In parallel, the pre-aligned embedding e i is further refined through a weight-shared projection head g ϕ , which shares parameters with g ϕ in the YOLIP Head:
e i = g ϕ ( e i ) R d
Weight-tying ensures that e i inhabits the exact same representational space as e i , preventing feature drift and stabilizing cross-module training. The outputs of both pathways are jointly forwarded. The normalized patch P i enters CLIP’s visual encoder f vis , while e i is compared against frozen text prototypes t c :
s i , c = cos ( e i , t c ) = e i · t c e i   t c
During training, e i is supervised by the same InfoNCE loss L align applied to e i , creating a cascaded alignment mechanism: YOLIP Head provides coarse semantic initialization; IFM achieves precise optimization. Geometric standardization ensures spatial compatibility, while semantic optimization guarantees the accuracy of representation.
Combining these two paths can ensure that the visual markers entering CLIP not only have the correct spatial form, but also occupy the appropriate area in the semantic representation space.

3. Execution Framework and Throughput Analysis

3.1. Resource-Constrained Dual-Model Execution Mode

In the serial mode, each input frame will successively go through four steps: YOLOv11 detection, region proposal extraction, CLIP encoding, and result fusion. The total latency per frame is therefore the sum of these stages. Accordingly, the end-to-end frame rate (FPS) can be defined as follows: To maximize the efficiency of the YOLIP framework under the constraint of limited computing resources, two execution modes were designed: the serial mode and the asynchronous batch processing mode. As shown in Table 1, the asynchronous mode increases the throughput by overlapping computations between different stages.
F P S e 2 e = 1 T y o l o + T c l i p + n T p o s t
where T y o l o denotes the inference time of YOLOv11, T c l i p is the total CLIP encoding time over all proposals, T p o s t represents the average post-processing time per proposal, and n is the number of proposals generated by the YOLIP Head.
To enhance the reasoning efficiency, we further proposed an asynchronous batch execution architecture. In this design, the detection model (YOLOv11) and the semantic encoder (CLIP) execute independently in separate CUDA streams, thereby achieving parallel computing and reducing the idle time caused by the high latency of CLIP.
Meanwhile, proposals from consecutive frames are dynamically aggregated into batches of size B before being processed by CLIP. This batching strategy significantly strengthens GPU utilization.
Under this mechanism, the average CLIP processing time per frame is reduced to T c l i p B / B , where T c l i p B denotes the total processing time for a batch. The overall system throughput is determined by the slowest stage in the pipeline:
T h r o u g h p u t = 1 max T yolo , T clip ( B ) B
This asynchronous design advances hardware utilization by overlapping computation across stages and is particularly suitable for latency-insensitive scenarios such as offline video analysis.

3.2. Frame-Level Scheduling and Asynchronous Inference Optimization for Video Streaming

For the continuous video streams generated by drone patrols, we designed a frame-level adaptive scheduling strategy. We separate the high-frequency lightweight target localization from the low-frequency candidate region semantic recognition. The lightweight detection branch runs at the original video frame rate and is used for real-time monitoring and short-term tracking. The semantic recognition branch based on CLIP is activated in an event-triggered manner. Specifically, semantic recognition is triggered only when a target appears, disappears, or changes beyond a predefined threshold.
To quantitatively evaluate the performance improvement brought by the scheduling strategy, we defined a scheduling factor K. It represents the number of consecutive frames processed by YOLO during the two semantic recognition operations based on CLIP. Based on the typical video frame rates of unmanned aerial vehicles (30 frames per second and 60 frames per second), we implemented a K + 1 scheduling scheme, such as the 29 + 1 and 59 + 1 schemes. The CLIP-based recognition function would be activated after every 29 or 59 YOLO detection frames. Under this strategy, the average per-frame processing latency T a v g and the achievable scheduled frame rate F P S s c h e d can be calculated as follows.
T a v g = T y o l o + 1 k T c l i p
F P S s c h e d = 1 T a v g = 1 T y o l o + 1 k T c l i p
With K increases, the computational overhead brought by the CLIP model is significantly distributed, and the overall performance of the system is infinitely close to the theoretical upper limit of pure YOLO detection.
It should be noted that in actual scenarios such as vast grasslands and areas with few wild animals, the changes in visual content are relatively slow and the target objects are scarce. Therefore, it is usually advisable to set the K value to be greater than 30. This not only reduces the system load but also significantly improves the response speed, while not affecting the integrity of semantic perception.
The frame-level adaptive scheduling mechanism is the core way to realize the efficient deployment of multi-modality models on the edge side, and its dynamic on-demand computing paradigm provides an important reference for similar computing-intensive applications on embedded platforms.

4. Experimental Simulation Analysis

4.1. Datasets Construction

4.1.1. Construction

The datasets used in this study were collected using a self-built UAV-assisted acquisition platform. The platform consists of a quadcopter equipped with a global-shutter high-definition camera, an onboard Jetson Orin AGX processor, and a LiDAR sensor for autonomous navigation and spatial perception, as illustrated in Figure 5.
A total of 6700 images were collected in real-world field conditions. The training set consists of 5500 images, including 500 negative samples containing only background; the validation set contains 1200 images. Negative samples were included in the training set to enhance background recognition and reduce false positives. The dataset covers common African savanna wildlife species, including elephants, zebras, antelopes, wildebeests, giraffes, and buffaloes. Table 2 summarizes the category distribution of the constructed dataset.
All images were manually annotated with bounding boxes and corresponding category labels to ensure annotation accuracy and consistency. During data collection and selection, real application scenarios in UAV-assisted wildlife monitoring were fully considered, allowing the dataset to cover various typical field environments.
To fully demonstrate the recognition performance of the paper in complex environments such as with small targets and occlusions, the concepts of APsl and APot are introduced in this article. Specifically, APsl represents the detection performance for small-scale targets. Objects whose bounding-box area is below 32 × 32 pixels. APot represents the detection performance under the occlusion condition. In this study, an object is considered to be occluded when more than 30% of its visible area is covered by vegetation, terrain, shadows, or overlaps with other animals.
Overall, the dataset is suitable not only for assessing detection accuracy but also for analyzing environmental adaptability and generalization capability in complex scenarios. The dataset will be made publicly available upon acceptance.
In addition, to further verify the generalization capability of the proposed method, experiments are also conducted on the publicly available UAV-assisted wildlife dataset WAID [25], as described in Section 4.3. Unless otherwise specified, all experiments are conducted on the self-constructed UAV dataset.

4.1.2. Evaluation Metrics

In this study, the mean average precision ( m A P ) is adopted as the primary evaluation metric, including m A P @ 0.5 and m A P @ 0.5:0.95, following standard object detection protocols. Precision (P) and recall (R) are defined as:
P = T P T P + F P
R = T P T P + F N
where T P , F P , and F N denote true positives, false positives, and false negatives, respectively.
For each category, the Average Precision (AP) is computed as the area under the Precision–Recall curve, and m A P is obtained by averaging AP over all categories.
A P = 0 1 P ( R ) d R
m A P = 1 N i = 1 N A P i
In addition to accuracy metrics, we further evaluate computational performance using Frames Per Second (FPS) to measure inference speed, as well as Floating Point Operations (FLOPs) and the number of parameters to assess model complexity.

4.1.3. Hyperparameter Tuning for Foreground Proposal Generation

All experiments were conducted on a platform equipped with an Intel Core i9-12900KF CPU and an NVIDIA Tesla T10 GPU (16 GB VRAM). The models were implemented using PyTorch 2.5.0 with CUDA 12.4.
To determine optimal training configurations, we performed systematic hyperparameter tuning on key factors, including learning rate, input resolution, batch size, and loss weights,. Detailed hardware and software configurations are summarized in Table 3.
This section describes the process of adjusting the hyperparameters of the localization branch in YOLOv11. At this stage, the detector was optimized to extract candidate objects of foreground animals from the drone images, while the semantic recognition module based on CLIP was not involved. The experimental results are summarized in Table 3 and visually compared in Figure 6. As shown in Table 4 and Figure 6, the learning rate significantly affects convergence stability, where 0.001 achieves the best balance between accuracy and stable performance.
Increasing the input resolution refines detection performance, particularly for small objects, but introduces higher computational cost. A resolution of 960 × 960 provides a favorable trade-off between accuracy and efficiency. Batch size variations indicate that moderate settings (batch size = 8) yield more stable performance compared to smaller or larger values. Based on comprehensive evaluation, configuration No. 8 achieves the best overall performance, with m A P @ 0.5 of 0.97. Therefore, the final model adopts the following settings: learning rate = 0.001 , input size = 960 × 960 , batch size = 8 , with loss weights λ b o x = 7.5 , λ c l s = 0.5 , and λ d f l = 1.5 , while the IoU threshold is set to 0.5 . It should be noted that the mAP@0.5 values reported in Table 3 only reflect the localization performance of the candidate boxes proposed by the detector, and do not represent the final end-to-end performance of the entire YOLIP framework.

4.2. Comparison Experiments

As shown in Table 5, the proposed YOLIP method outperforms all the other comparison methods in all the indicators, indicating that it has a significant advantage in the drone-assisted wildlife detection task.
Specifically, YOLIP attains a mAP@0.5 of 86.6%, outperforming YOLOv8m, YOLOv11n, and YOLO26n by 16.6%, 11.6% and 8.3%, respectively. Under the stricter mAP@0.5:0.95 metric, it also achieves consistent improvements. These results confirm the effectiveness of the proposed cross-modal alignment mechanism.
To further explain the source of the performance improvement, this paper conducts a gain decomposition analysis based on the results in Table 4. The comparison between YOLOv8n and YOLOv11n reflects the contribution brought about by the upgrade of the detector’s backbone. The complete YOLIP framework further increased the mAP@0.5 metric to 86.6%. The additional improvement mainly comes from the proposed YOLIP Head and IFM module, which respectively enhance the regional-level candidate representation and the cross-modal feature refinement.
The convergence behaviors of different models are illustrated in Figure 7. It can be seen that the detection accuracy of YOLIP is significantly higher than that of other series of models, and it demonstrates stable and excellent training performance. Moreover, the qualitative comparison in complex transfer scenarios (Figure 8) further highlights its robustness and semantic understanding ability. In the transfer scenarios, the performance of YOLIP is highlighted.
To further investigate the visual attention behavior of the proposed framework, Grad-CAM image visualization processing was applied to representative images of unmanned aerial vehicles observing wild animals, as shown in Figure 9.
The visualization results show that YOLIP has always focused on the biologically significant areas of the animals, rather than the irrelevant background areas, even in complex aerial scenes involving small targets, occlusions, and complex environmental textures. It must be pointed out that Grad-CAM reflects the spatial response of the YOLIP foreground localization stage to the target area, rather than the entire semantic alignment process itself.

4.3. Generalization on Public Datasets (WAID)

To further evaluate the generalization capability of the proposed YOLIP framework beyond the self-constructed UAV datasets, additional experiments are conducted on a publicly available aerial wildlife datasets, namely, WAID [24].
The WAID datasets consists of 14,375 aerial images captured by unmanned aerial vehicles (UAVs) in real-world wildlife monitoring scenarios. It contains six common large-animal categories, including cattle, sheep, and zebras, with full bounding box annotations provided for object detection tasks. Compared with our self-constructed datasets, WAID exhibits distinct characteristics in terms of object scale, viewpoint variation, and environmental diversity, making it a suitable benchmark for cross-dataset evaluation.
In this experiment, the proposed YOLIP model is directly applied to the WAID datasets with minimal fine-tuning to ensure fair comparison. No architectural modifications are introduced. Standard evaluation metrics, including m A P @ 0.5 and m A P @ 0.5:0.95, are adopted for performance assessment.
As shown in Table 6, YOLIP maintains competitive performance on WAID, demonstrating that the proposed feature alignment mechanism is not limited to a specific dataset but generalizes effectively to different UAV-assisted wildlife monitoring scenarios. YOLIP consistently outperforms baseline detectors, demonstrating robust cross-dataset generalization.

4.4. Ablation Study

4.4.1. Ablation Study on IFM Dual-Pathway Design

To evaluate the contribution of the proposed dual-pathway IFM module, we conduct an ablation study by incrementally introducing its components. The results are summarized in Table 7.
As shown in Table 6, the naive YOLO-CLIP baseline, which directly resizes detected regions to 224 × 224 without any feature alignment, achieves 82.5% m A P @ 0.5 . Introducing the geometric pathway alone—a learnable spatial transformer with reflection padding—improves m A P @ 0.5 to 83.2%, a gain of 0.7 percentage points. The enhancement is particularly notable on small objects, A P s l from 29.2% to 30.3%, confirming that learned geometric normalization preserves discriminative features that would otherwise be lost in naive resizing. The corresponding effect is illustrated in Figure 10 below.
Adding the semantic pathway (the weight-tied projection head g ϕ , without geometric alignment) yields 84.3% m A P @ 0.5 , representing an improvement of 1.8 percentage points over the naive baseline. This demonstrates that contrastive semantic refinement alone provides meaningful gains, even when regions are not geometrically optimal.
The full IFM module, combining both geometric and semantic pathways, achieves 86.8% m A P @ 0.5 , representing a 4.3 percentage point enhancement over the naive baseline. On the two challenging subsets, A P s l rises from 29.2% to 35.0% and A P o t from 34.3% to 42.1%. The synergistic gains—where the combined effect exceeds the sum of individual improvements—confirm that geometric normalization and semantic refinement are complementary: geometric alignment provides well-conditioned inputs for CLIP’s patch embedding, while semantic alignment ensures representational fidelity in the shared embedding space.
These results support the dual-pathway design of the IFM module. By jointly addressing the geometric and semantic domain gaps, the proposed alignment interface enables more effective cross-modal transfer between YOLO’s detection features and CLIP’s semantic space.

4.4.2. Ablation Study on YH, FI, and FS Modules

To evaluate the contribution of each component in the proposed framework, we conduct a stepwise ablation study by incrementally introducing key modules into the baseline model. The results are summarized in Table 8 and Table 9. The baseline model (Model 1) adopts a conventional serial pipeline, where YOLOv11 is used for detection and CLIP is applied independently for recognition, achieving a mAP@0.5 of 82.5%. Model 2 introduces the YOLIP Head (YH), which generates language-aligned proposal embeddings embedding and enhances feature representation at the detection stage. This results in a 3.4% improvement in m A P @ 0.5 , demonstrating the effectiveness of early-stage semantic alignment. Model 3 further incorporates the IFM module, enabling cross-modal alignment between visual features and CLIP embeddings.
This increases m A P @ 0.5:0.95 from 57.6% to 60.7%, indicating enhanced semantic consistency and generalization capability. Finally, Model 4 integrates a frame-adaptive scheduling (FS) mechanism that decouples detection and semantic recognition through asynchronous inference scheduling. While maintaining comparable detection accuracy, the proposed scheduling strategy significantly increases the stream-level processing throughput from 11.2 FPS in a fully serial pipeline to 57.8 FPS under asynchronous deployment. These results demonstrate that each module contributes positively to the overall performance. In particular, the YOLIP Head improves detection quality, the FI module enhances semantic alignment, and the FS strategy significantly boosts computational efficiency.

4.5. Comprehensive Comparison with YOLO-World

To further evaluate the effectiveness of the proposed framework, we conduct a comprehensive comparison with YOLO-World, a representative open-vocabulary detection paradigm based on YOLO and CLIP. The comparison focuses on architecture design, training paradigm, efficiency, and real-world applicability.
As summarized in Table 10, the two methods differ fundamentally in design philosophy. YOLO-World follows a “semantics-first” paradigm by aligning detector features with text embeddings, while YOLIP adopts a “detection-first, semantics-refined” strategy, prioritizing high-recall proposal generation followed by semantic refinement. The quantitative results in Table 11 demonstrate that YOLIP achieves superior closed-set detection performance, reaching 86.6% m A P @ 0.5 , which surpasses YOLO-World by 5.48%. This confirms the effectiveness of the forward pipeline design in practical detection tasks.
Although YOLO-World exhibits stronger zero-shot generalization due to its end-to-end alignment training, YOLIP provides a more stable and efficient solution for real-world deployment. Furthermore, the asynchronous scheduling strategy enables YOLIP to decouple detection and recognition processes, reducing redundant computation in video streams. This design makes it particularly suitable for resource-constrained edge environments and real-time applications.
In general, while YOLO-World emphasizes semantic generalization, YOLIP offers a better balance between accuracy, efficiency, and deployability, making it more practical for UAV-assisted wildlife monitoring scenarios.

5. Conclusions

This paper presents YOLIP, a hierarchical perception framework that effectively integrates YOLO-based detection with CLIP-based semantic recognition through a unified cross-modal alignment pipeline. By introducing the YOLIP Head and the Interaction Fusion Module (IFM), the proposed method enables the generation of language-aligned proposals and fine-grained semantic refinements, achieving robust performance in both detection accuracy and open-vocabulary recognition.
In addition, the proposed frame-level scheduling and asynchronous inference strategy significantly improves computational efficiency in continuous video streams, making the framework well suited for real-time deployment in UAV-assisted wildlife monitoring scenarios.
Although this framework has many of the aforementioned advantages, it still has some limitations. Firstly, the semantic recognition ability of YOLIP still relies on the quality of the proposals generated by the detector. This means that the foreground regions missed during the localization stage may propagate errors to the subsequent semantic refinement stage. Secondly, although CLIP enhances the semantic representation capability, its computational cost is still relatively high compared to lightweight detection models. This may limit the deployment efficiency on low-power embedded hardware. Furthermore, the current asynchronous scheduling strategy relies on manually predefined scheduling intervals and trigger thresholds, which may not achieve the optimal generalization effect under different video dynamics and environmental conditions.
Furthermore, although the proposed framework employs semantic alignment technology based on CLIP, this study primarily focuses on closed wildlife monitoring scenarios rather than a completely open vocabulary detection benchmark. The generalization ability of this framework when dealing with novel and intricate object categories still requires further investigation.

Author Contributions

Conceptualization, X.L. and K.X.; methodology, R.H.; software, R.H.; validation, Y.C. and L.Z.; formal analysis, K.X. and C.Y.; investigation, Y.C.; resources, X.L.; data curation, L.Z. and X.C.; writing—original draft preparation, R.H.; writing—review and editing, X.L. and Y.C.; visualization, L.Z. and H.P.; supervision, X.L.; project administration, R.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Key Research and Development Program of China (granted as 2021ZD0140405), and the Natural Science Foundation of Jiangsu Province of China (granted as BK20241885), and by the Science and Technology Innovation Training Program (STITP) of Jiangsu Province of China (granted as 202510293079Z and 202510293108Y).

Data Availability Statement

The publicly available dataset is obtained from the website https://doi.org/10.3390/app131810397. Other data used to support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The conference version of this paper has been accepted by the 6th International Conference on Computer Communication and Artificial Intelligence (CCAI2026, Nanjing, China), that provides a discussion of the original basis of the YOLIP framework. Following further simulations and more in-depth research work, we present this article version as a complete statement for this project.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
YOLOYou Only Look Once
YHYOLIP Head
CLIPContrastive Language-Image Pre-training
UAVUnmanned Aerial Vehicle
RoIAlignRegion of Interest Align
IFMInteraction Fusion Module
mAPmean Average Precision
APAverage Precision
FPSFrames Per Second
FLOPsFloating Point Operations
TPTrue Positives
FPFalse Positives

References

  1. Wang, S.; Chen, K.; Wei, Z.; Yang, L.; Wang, Q.; Liu, M.; Cao, K.; Zhao, C.; Chang, R.; Wang, Z.; et al. UAV-based deep learning for biodiversity monitoring: Advances, applications, and future directions. Ecol. Inform. 2026, 95, 103710. [Google Scholar] [CrossRef]
  2. Lee, S.; Song, Y.; Kil, S.-H. Feasibility Analyses of Real-Time Detection of Wildlife Using UAV-Derived Thermal and RGB Images. Remote Sens. 2021, 13, 2169. [Google Scholar] [CrossRef]
  3. Tian, J.; Gao, Y.; Xia, X.; Ju, G.; Ye, P.; Tang, S.; Wang, H.; Wang, X. MVDFusion: Multimodal Vehicle Detection in Foggy Weather Using LiDAR and Radar Fusion. Sensors 2026, 26, 2663. [Google Scholar] [CrossRef]
  4. Yu, H.; Li, G.; Zhang, W.; Huang, Q.; Du, D.; Tian, Q.; Sebe, N. The Unmanned Aerial Vehicle Benchmark: Object Detection, Tracking and Baseline. Int. J. Comput. Vis. 2020, 128, 1141–1159. [Google Scholar] [CrossRef]
  5. Tang, Y.; Wang, J.; Sheng, W.; Bian, J. EP-YOLO: An Enhanced Lightweight Model for Micro-Pest Detection in Agricultural Light-Trap Environments. Sensors 2026, 26, 2607. [Google Scholar] [CrossRef]
  6. Wu, P.; Xu, Y.; Ma, Y.; Zhang, Y.; Xu, Y. LYA-YOLO: A lightweight and accurate YOLO model in drone aerial image scenes. Expert Syst. Appl. 2026, 321, 132166. [Google Scholar] [CrossRef]
  7. Sulake, N.R. YOLOv11 Demystified: A Practical Guide to High-Performance Object Detection. arXiv 2026. [Google Scholar] [CrossRef]
  8. Yu, H.; Liu, J.; Lin, M. A Comprehensive Literature Review on YOLO-Based Small Object Detection: Methods, Challenges, and Future Trends. Comput. Mater. Contin. 2026, 87, 7. [Google Scholar] [CrossRef]
  9. Li, Y.; Wang, T.; Li, T.; Yang, X. LWU-YOLO: A lightweight algorithm for small object detection in UAV applications. J. Vis. Commun. Image Represent. 2026, 117, 104791. [Google Scholar] [CrossRef]
  10. Fan, R.; Jiao, R.; Nan, W.; Meng, H.; Jiang, A.; Yang, X.; Zhao, Z.; Dang, J.; Wang, Z.; Tian, Y.; et al. CS-YOLO: A small object detection model based on YOLO for UAV aerial photography. Signal Process. Image Commun. 2026, 142, 117460. [Google Scholar] [CrossRef]
  11. Lu, Q.; Xie, Y.; Zhang, J.; Guo, Y.; Wei, Y.; Jiang, J.; Luan, X. CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation. Remote Sens. 2025, 17, 3675. [Google Scholar] [CrossRef]
  12. Chen, Z.; Deng, Y.; Li, Y.; Gu, Q. Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP. arXiv 2023. [Google Scholar] [CrossRef]
  13. Niu, X.; Zhao, M.; Jiang, D.; Wu, Y.; Su, B. ReAttnCLIP: Training-free open-vocabulary remote sensing image segmentation via re-defined attention in CLIP. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Denver, CO, USA, 3–7 June 2026; IEEE: New York, NY, USA, 2026. [Google Scholar]
  14. Sapkota, R.; Karkee, M. Object detection with multimodal large vision-language models: An in-depth review. Inf. Fusion 2026, 126, 103575. [Google Scholar] [CrossRef]
  15. Zhao, Y.; Gong, F.; Du, C.; Ji, X.; Li, D.; Yan, X.; Xu, J. Run as one: CLIP-based semantic fusion hashing for multi-modal retrieval. Displays 2026, 93, 103383. [Google Scholar] [CrossRef]
  16. Zohra, F.; Zhao, C.; Liu, S.; Ghanem, B. Effectiveness of Max-Pooling for Fine-Tuning CLIP on Videos. In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 11–12 June 2025; IEEE: New York, NY, USA, 2025; pp. 3282–3291. [Google Scholar]
  17. Han, Z.; Luo, G.; Sun, H.; Li, Y.; Han, B.; Gong, M.; Zhang, K.; Liu, T. Alignclip: Navigating the misalignments for robust vision-language generalization. Mach. Learn. 2025, 114, 58. [Google Scholar] [CrossRef]
  18. Guo, Y.C.; Xu, T.Y.; Liu, Y.; Zhang, L. DSM-CLIP: A hard negative sampling framework for generalizable person re-identification. Neurocomputing 2025, 653, 130920. [Google Scholar] [CrossRef]
  19. Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-Time Open-Vocabulary Object Detection. arXiv 2024. [Google Scholar] [CrossRef]
  20. Wang, A.; Liu, L.; Chen, H.; Lin, Z.; Han, J.; Ding, G. YOLOE: Real-Time Seeing Anything. arXiv 2025. [Google Scholar] [CrossRef]
  21. Wang, X.; Ren, W.; Chen, X.; Fan, H.; Tang, Y.; Han, Z. Uni-YOLO: Vision-Language Model-Guided YOLO for Robust and Fast Universal Detection in the Open World. In Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24); ACM: New York, NY, USA, 2024. [Google Scholar]
  22. Li, J.; Sun, S.; Zhang, K.; Zhang, J.; Zhuo, L. Single-stage zero-shot object detection network based on CLIP and pseudo-labeling. Int. J. Mach. Learn. Cybern. 2025, 16, 1055–1070. [Google Scholar] [CrossRef]
  23. Wang, H.; He, Q.; Peng, J.; Yang, H.; Chi, M.; Wang, Y. Mamba-YOLO-World: Marrying YOLO-World with Mamba for Open-Vocabulary Detection. arXiv 2024. [Google Scholar] [CrossRef]
  24. Saeed, F.; Aldera, S.; Al-Shamma’a, A.A.; Farh, H.M.H. Rapid Adaptation in Photovoltaic Defect Detection: Integrating CLIP with YOLOv8n for Efficient Learning. Energy Rep. 2024, 12, 5383–5395. [Google Scholar] [CrossRef]
  25. Mou, C.; Liu, T.; Zhu, C.; Cui, X. WAID: A Large-Scale Dataset for Wildlife Detection with Drones. Appl. Sci. 2023, 13, 10397. [Google Scholar] [CrossRef]
Figure 2. Schematic diagram of key architectural improvements in YOLOv11: C3k2 and the C2PSA attention module.
Figure 2. Schematic diagram of key architectural improvements in YOLOv11: C3k2 and the C2PSA attention module.
Sensors 26 03436 g002
Figure 3. Generating language-aligned proposal embedding structure.
Figure 3. Generating language-aligned proposal embedding structure.
Sensors 26 03436 g003
Figure 4. Interaction Fusion Module (IFM) architecture.
Figure 4. Interaction Fusion Module (IFM) architecture.
Sensors 26 03436 g004
Figure 5. UAV-assisted data collection platform used for wildlife monitoring.
Figure 5. UAV-assisted data collection platform used for wildlife monitoring.
Sensors 26 03436 g005
Figure 6. Graph of hyperparameter adjustments for the branch positioning only.
Figure 6. Graph of hyperparameter adjustments for the branch positioning only.
Sensors 26 03436 g006
Figure 7. Visual model comparison chart.
Figure 7. Visual model comparison chart.
Sensors 26 03436 g007
Figure 8. Migration scene comparison chart.
Figure 8. Migration scene comparison chart.
Sensors 26 03436 g008
Figure 9. Grad-CAM visualization of YOLIP attention.
Figure 9. Grad-CAM visualization of YOLIP attention.
Sensors 26 03436 g009
Figure 10. The geometric pathway effect figure.
Figure 10. The geometric pathway effect figure.
Sensors 26 03436 g010
Table 1. Comparison of execution modes in the YOLIP framework.
Table 1. Comparison of execution modes in the YOLIP framework.
ModeParallelismLatency CharacteristicThroughputSuitable Scenarios
Serial ModeFully sequentialAccumulated across all stagesLimitedReal-time, low-latency systems
Asynchronization ModePipeline paralleDominated by slowest stageHighVideo streaming, offline analysis
Table 2. Category distribution of the self-built UAV wildlife dataset.
Table 2. Category distribution of the self-built UAV wildlife dataset.
CategoryTraining ImagesValidation ImagesInstances
Elephant10802303125
Zebra9802104287
Antelope9201953654
Wildebeest7601702813
Buffalo6401452146
Giraffe6201401879
Background-only500110
Total5500120017,904
Table 3. Experimental environment and training configuration.
Table 3. Experimental environment and training configuration.
CategoryItemSpecification
HardwareCPUIntel Core i9-12900KF
GPUNVIDIA Tesla T10 (16 GB VRAM)
RAM32 GB DDR5
SoftwareOperating SystemUbuntu 22.04 LTS
Deep Learning FrameworkPyTorch 2.5.0
Table 4. Detector-only hyperparameter tuning for the YOLOv11 localization branch.
Table 4. Detector-only hyperparameter tuning for the YOLOv11 localization branch.
Exp.LRInput SizeBatchIoUBoxClsDFLmAP@0.5FPS
10.0003 960 × 960 80.507.50.51.50.8864
20.001 960 × 960 80.50150.51.50.9668
30.003 960 × 960 80.507.50.51.50.9361
40.001 640 × 640 80.507.50.51.50.9184
50.001 1280 × 1280 80.507.50.51.50.9739
60.001 960 × 960 40.507.50.51.50.9465
70.001 960 × 960 160.507.50.51.50.9266
80.001 960 × 960 80.507.50.51.50.9767
90.001 960 × 960 80.507.51.01.50.9667
100.001 960 × 960 80.55150.51.50.9567
Table 5. Comparative analysis of the performance of the YOLIP model and mainstream detectors.
Table 5. Comparative analysis of the performance of the YOLIP model and mainstream detectors.
ModelmAP@0.5mAP@0.5:0.95FPS *Params (M)APsl
YOLOv8n70.047.018.27.60.31
YOLOv11n75.049.430.38.80.34
Faster R-CNN67.042.06.541.50.23
DETR63.041.0541.00.20
YOLO26n78.350.142.08.40.34
YOLIP86.660.757.8153.80.35
* The FPS of YOLIP is measured under asynchronous scheduling mode rather than pure serial inference.
Table 6. Generalization performance of YOLIP on the WAID datasets.
Table 6. Generalization performance of YOLIP on the WAID datasets.
ModelmAP@0.5mAP@0.5:0.95
YOLOv11n61.442.8
YOLO26n64.245.6
YOLIP72.156.7
Table 7. Ablation study on the geometric and semantic pathways of IFM.
Table 7. Ablation study on the geometric and semantic pathways of IFM.
ModelmAP@0.5 (%)APslAPot
YOLO-CLIP82.529.234.3
+Geometric Pathway only83.230.336.8
+Semantic Pathway only84.332.239.7
+IFM (Geometric + Semantic)86.835.042.1
Table 8. Ablation study on YH and IFM.
Table 8. Ablation study on YH and IFM.
ModelYHIFMFLOPs/GP/%R/%mAP@0.5/%mAP@0.5:0.95/%FPS
1 156.880.378.182.557.611.2
2* 160.383.480.785.959.519.5
3163.284.181.586.860.118.8
* √ indicates that the module is used.
Table 9. Throughput analysis under frame-adaptive scheduling.
Table 9. Throughput analysis under frame-adaptive scheduling.
Scheduling ModeEffective Throughput FPS
Full serial pipeline18.8
Asynchronous scheduling (k = 30)57.8
Table 10. A multi-dimensional comparison between YOLIP and YOLO-World.
Table 10. A multi-dimensional comparison between YOLIP and YOLO-World.
DimensionsYOLO-WorldYOLIP
Visual BranchOpen-vocabulary detection via region-to-text alignment in a shared vision-language embedding spaceForward semantic recognition using CLIP visual encoder for fine-grained classification of candidate regions
Semantic BranchOpen-vocabulary via reverse matchingFine-grained via forward recognition.
Training ParadigmEnd-to-end joint optimization with cross-modal alignment loss for
zero-shot generalization
Two-stage training: supervised detection training followed by frozen CLIP-based semantic refinement
Design PhilosophySemantics-driven detection: adapts detector features to align with semantic conceptsDetection-driven semantic refinement: prioritizes high-recall proposal generation and refines them via semantic alignment
Table 11. YOLIP vs. YOLO-World comparison.
Table 11. YOLIP vs. YOLO-World comparison.
ModelmAP@0.5mAP@0.5:0.95APslAPotFPS (Serial) *FPS (Async) *FLOPs (G)
YOLO-World82.148.90.470.6441.5160.0
YOLIP (ours)86.660.70.450.6418.857.8163.4
* FPS (serial): serial mode. FPS (async): asynchronous mode (k = 30).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, R.; Chen, Y.; Xu, K.; Zhang, L.; Yue, C.; Pi, H.; Chen, X.; Lin, X. YOLIP: An Enhanced Framework for UAV-Assisted Wildlife Monitoring Based on YOLO Integrated with the CLIP Model. Sensors 2026, 26, 3436. https://doi.org/10.3390/s26113436

AMA Style

Hu R, Chen Y, Xu K, Zhang L, Yue C, Pi H, Chen X, Lin X. YOLIP: An Enhanced Framework for UAV-Assisted Wildlife Monitoring Based on YOLO Integrated with the CLIP Model. Sensors. 2026; 26(11):3436. https://doi.org/10.3390/s26113436

Chicago/Turabian Style

Hu, Ruiheng, Yiwei Chen, Kejia Xu, Leyan Zhang, Chengyang Yue, Hao Pi, Xuhua Chen, and Xiaoyong Lin. 2026. "YOLIP: An Enhanced Framework for UAV-Assisted Wildlife Monitoring Based on YOLO Integrated with the CLIP Model" Sensors 26, no. 11: 3436. https://doi.org/10.3390/s26113436

APA Style

Hu, R., Chen, Y., Xu, K., Zhang, L., Yue, C., Pi, H., Chen, X., & Lin, X. (2026). YOLIP: An Enhanced Framework for UAV-Assisted Wildlife Monitoring Based on YOLO Integrated with the CLIP Model. Sensors, 26(11), 3436. https://doi.org/10.3390/s26113436

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop