Semantic-Guided Mamba Fusion for Robust Object Detection of Tibetan Plateau Wildlife

Lan, Ping; Xian, Yukai; Shen, Te; Lee, Yurui; Zhao, Qijun

doi:10.3390/electronics14224549

Open AccessArticle

Semantic-Guided Mamba Fusion for Robust Object Detection of Tibetan Plateau Wildlife

by

Ping Lan

^1,*,†

,

Yukai Xian

^1,†

,

Te Shen

¹,

Yurui Lee

² and

Qijun Zhao

^1,3

¹

School of Information Science and Technology, Tibet University, Lhasa 850000, China

²

School of Humanities, Tibet University, Lhasa 850000, China

³

College of Computer Science, Sichuan University, Chengdu 610065, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(22), 4549; https://doi.org/10.3390/electronics14224549

Submission received: 29 October 2025 / Revised: 15 November 2025 / Accepted: 19 November 2025 / Published: 20 November 2025

Download

Browse Figures

Versions Notes

Abstract

Accurate detection of wildlife on the Tibetan Plateau is particularly challenging due to complex natural environments, significant scale variations, and the limited availability of annotated data. To address these issues, we propose a semantic-guided multimodal feature fusion framework that incorporates visual semantics, structural hierarchies, and contextual priors. Our model integrates CLIP and DINO tokenizers to extract both high-level semantic features and fine-grained structural representations, while a Spatial Pyramid Convolution (SPC) Adapter is employed to capture explicit multi-scale spatial cues. In addition, we introduce two state-space modules based on the Mamba architecture: the Focus Mamba Block (FMB), which strengthens the alignment between semantic and structural features, and the Bridge Mamba Block (BMB), which enables effective fusion across different scales. Furthermore, a text-guided semantic branch leverages knowledge from large language models to provide contextual information about species and environmental conditions, enhancing the consistency and robustness of detection. Experiments conducted on the Tibetan wildlife dataset demonstrate that our framework outperforms existing baseline methods, achieving 70.2% AP, 88.7% AP50, and 76.8% AP75. Notably, it achieves significant improvements in detecting small objects and fine-grained species. These results highlight the effectiveness of the proposed semantic-guided Mamba fusion approach in tackling the unique challenges of wildlife detection in the complex conditions of the Tibetan Plateau.

Keywords:

Tibetan Plateau; wildlife detection; multimodal fusion; Mamba state space model; CLIP; DINOv3; large language model; ecological monitoring

1. Introduction

Automated animal detection underpins ecological conservation, wildlife monitoring, and livestock management. With modern computer vision, detection now spans camera traps, UAV/drone surveys, satellite imagery, and videos, greatly reducing manual labeling effort and enabling large-scale, evidence-based decisions.

The development of object detection has been shaped by several influential architectures that progressively improved both accuracy and efficiency. R-CNN [1] established region-based classification with deep features. Faster R-CNN [2] introduced region proposal networks, unifying proposal generation and detection for substantial speedups. The YOLO series [3] reframed detection as single-shot regression to enable real-time performance. SSD [4] leveraged multi-scale anchors and feature pyramids to better capture small objects within a single-stage pipeline.

Several recent methods have advanced detection capabilities by introducing novel attention mechanisms and more efficient architectures. A deformable DETR [5] employed sparse, multi-scale deformable attention to improve convergence and small-object accuracy. Sparse R-CNN [6] learned a compact set of proposals, avoiding dense anchors while maintaining high quality. A Swin Transformer [7] offered a hierarchical backbone with shifted windows for scalable multi-scale representation. DINO [8] capitalized on self-supervised pretraining for stronger generalization. RT-DETR [9] optimized the DETR family for real-time inference.

In the wildlife detection domain, researchers have developed specialized models that address the unique challenges posed by natural environments and diverse species. Mulero-Pázmány et al. [10] proposed a two-stage framework for large-scale camera trap analysis. Their approach combines global and expert models to reduce class imbalance and background confusion. Kim et al. [11] introduced hierarchical transfer to improve fine-grained recognition among visually similar species. Li et al. [12] developed TMS-YOLO to balance accuracy and efficiency for rare and occluded wildlife. Ma et al. [13] designed WL-YOLO with attention mechanisms for challenging forest scenes. Xian et al. [14] further enhanced YOLOv8 by integrating multi-branch feature fusion and adaptive convolutional modules, demonstrating improved robustness on complex artistic imagery with fine-scale structures. Ke et al. [15] improved aerial wildlife detection with IECA-YOLOv7, a lightweight model that integrates enhanced attention and optimized loss functions for drone-based monitoring. Gong et al. [16] proposed GFI-YOLOv8 for sika deer posture recognition, illustrating the potential of fine-grained pose-aware detection for behavioral analysis. Chen et al. [17] advanced YOLO-SAG with efficient intra-scale feature interaction and lightweight convolutions. In a related study, Xian et al. [18] incorporated Gabor, wavelet, and color priors into a YOLO-based detector, highlighting the potential of explicit multi-modal feature cues for boosting small-object recognition. Guo et al. [19] integrated Swin Transformer with Mask R-CNN for robust detection in complex urban settings.

The integration of diverse data modalities beyond traditional RGB imagery has proven valuable for expanding the scope and reliability of ecological monitoring applications. Jenkins et al. [20] showed that temporal cues in time-lapse imagery improve seabird detection. Wu et al. [21] used satellite remote sensing for large-area monitoring. Delplanque et al. [22] leveraged aerial orthomosaics to quantify populations at scale. Brack et al. [23] demonstrated high-throughput counting from overhead imagery. Krishnan et al. [24] and Backman et al. [25] showed that combining thermal and visible UAV imagery enhances the detectability of cryptic species.

Various evaluation frameworks and learning strategies have been proposed to bridge the gap between research innovations and practical deployment requirements. Korkmaz et al. [26] benchmarked YOLOv8 [27], Yolo-NAS, and Fast-RNN for farm animal safety, quantifying trade-offs across modern detectors. Mou et al. [28] presented KI-CLIP, which couples foundation models with expert text for few-shot learning and unknown species discovery. Simoes et al. [29] built DeepWILD to automate detection, classification, and counting in camera trap videos. In addition, recent studies such as Wang [30] have shown that traditional augmentation techniques remain highly effective for improving robustness in UAV and transformer-based detection pipelines, further motivating the use of targeted data augmentation in our approach.

Selective state space models (SSMs) have recently emerged as an efficient alternative for long-range modeling, offering linear computational complexity while maintaining strong representational capacity. Mamba [31] achieves linear-time sequence processing with data-dependent parameters while preserving strong context propagation. In 3D detection, 3DET-Mamba [32] pioneered a fully Mamba-based point-cloud pipeline. UniMamba [33] combined multi-head Mamba with local 3D convolutions to scale to LiDAR scenes. For 2D and temporal settings, MambaDETR [34] and MambaBEV [35] modeled global context in end-to-end pipelines. Mamba YOLO [36] replaced classic backbones for real-time accuracy, and MambaVision [37] hybridized attention and SSMs. Samba [38] demonstrated benefits for dense, multi-modal salient detection.

Despite these advances, Tibetan wildlife detection remains a challenging problem due to the inherent complexity of high-altitude ecosystems and the specific characteristics of the target species. The data exhibit a long-tail distribution with scarce categories, and targets are often small or distant. Frequent occlusion and background clutter hinder localization. Closely related species such as canids and ungulates present high visual similarity, leading to confusion when distinguishing cues are weak. We address these issues with a semantic-guided multimodal detector that integrates complementary visual representations and efficient feature propagation mechanisms. Our approach fuses CLIP [39] semantics, DINOv3 [40] hierarchical features, and explicit multi-scale convolution via a Spatial Pyramid Convolution (SPC) adapter. Two Mamba-based modules provide selective feature propagation. The Focus Mamba Block (FMB) propagates semantic guidance, while the Bridge Mamba Block (BMB) connects hierarchical and global features with multi-scale local detail. A Deformable DETR decoder performs final semantic–visual joint localization. This design enhances small-object recall, reduces fine-grained confusions, and remains computationally efficient.

Our main contributions are as follows: (i) We propose a semantic-guided multimodal fusion detector that unifies CLIP [39] and DINOv3 [40] features with an SPC adapter and Mamba-based FMB and BMB modules, decoded by a Deformable DETR head. (ii) We curate a Tibetan wildlife dataset with realistic long-tail distributions and challenging natural backgrounds. (iii) We conduct extensive experiments and ablations that establish state-of-the-art accuracy on our dataset, demonstrate solid generalization to COCO, and show clear gains on small and rare classes alongside favorable efficiency–accuracy trade-offs.

2. Method

To address the challenges of limited wildlife image data and the difficulty of group detection in Tibetan animal imagery, we propose a semantic-guided multimodal feature fusion framework for object detection. As illustrated in Figure 1, the overall architecture consists of three stages: Feature Extraction, Feature Enhancement, and Feature-Guided Detection.

Unlike existing DETR-based, YOLO-based, or Mamba-based detectors, our framework introduces a three-module design enhancement design tailored to the characteristics of Plateau wildlife imagery. The SPC Adapter strengthens the model’s multi-scale perception, which is essential for detecting distant and small animals commonly appearing in long-range observations. The Focus Mamba Block (FMB) establishes semantic–structural alignment by injecting CLIP semantic priors into the hierarchical DINOv3 features, enabling more reliable discrimination of fine-grained species with subtle visual differences. The Bridge Mamba Block (BMB) selectively propagates information across scales and mitigates long-tail imbalance by exposing rare categories to richer hierarchical cues. Together, these modules form a unified and specialized representation space that addresses the key challenges of small-object detection, fine-grained recognition, and long-tail distributions in Tibetan wildlife datasets.

In the feature extraction stage, the SPC Adapter captures multi-scale structural details from the input image, enhancing the model’s ability to perceive objects of varying sizes. At the same time, the CLIP [39] Image Tokenizer and DINOv3 [40] Tokenizer extract semantically rich and structurally diverse visual embeddings, respectively. In parallel, the textual modality is encoded by the CLIP [39] Text Encoder, while a large language model (LLM) is employed to semantically enrich the scene descriptions by extracting structured knowledge such as species names, attributes, and living environments. Together, these processes yield knowledge-augmented multimodal representations, which serve as the foundation for subsequent feature fusion and detection.

In the feature enhancement stage, two modules based on the Mamba state space model (SSM) are introduced. The Focus Mamba Block facilitates inter-layer interactions between CLIP and DINOv3, enhancing the semantic coherence and detailed representation of visual features. The Bridge Mamba Block fuses DINOv3’s hierarchical features with multi-scale representations, compensating for DINOv3’s inherent limitation in scale modeling and improving detection robustness across object sizes. Through the collaboration of these two modules, the model acquires high-quality visual features that integrate both global semantic and local structural information.

In the feature-guided detection stage, a Deformable DETR decoder is employed to perform final predictions based on the enhanced and fused features. This stage jointly processes multimodal visual embeddings, cross-modal semantic features, and multi-granularity visual representations. Using Cross-Attention, the decoder enables semantically guided object queries, while Multi-Head Self-Attention and MSDeform Attention jointly model multi-scale spatial dependencies. Ultimately, the unified decoding framework achieves object classification and bounding-box regression through semantic–visual joint localization.

2.1. Feature Extraction

To address data scarcity, scale variation, and complex backgrounds in Tibetan wildlife imagery, we extract complementary representations along three parallel branches: CLIP for semantic visual encoding, DINOv3 for hierarchical structural features, and an SPC Adapter for explicit multi-scale details. We also incorporate the textual modality via the CLIP text encoder and a large language model to obtain knowledge augmented text embeddings that later guide detection.

In the visual encoding pathway, the input image I is separately fed into the CLIP Image Tokenizer and the DINOv3 Tokenizer to obtain semantically rich and structurally detailed embeddings:

E_{clip} = f_{CLIP} (I), E_{dino} = f_{DINOv 3} (I)

(1)

where

E_{clip}

captures high-level semantic consistency, while

E_{dino}

focuses on structural information and fine-grained visual details. To preserve the hierarchical representation within DINOv3, three feature layers are extracted from different Transformer blocks:

E_{dino}^{(8)}, E_{dino}^{(16)}, E_{dino}^{(32)} = {Blocks}_{{5, 8, 11}} (f_{DINOv 3} (I))

(2)

corresponding to feature maps at downsampling ratios of 1/8, 1/16, and 1/32, respectively. These features are later fused with multi-scale convolutional representations to enhance scale robustness.

To explicitly capture fine-grained and multi-scale information, we introduce a SPC Adapter, as shown in Figure 2. Given an input feature

X \in R^{H \times W \times C}

, we first apply local convolution and pooling to achieve spatial downsampling:

\tilde{X} = {MaxPool}_{3 \times 3} ({Conv}_{3 \times 3} (X))

(3)

Then, a spatial pyramid convolution unit

S (\cdot)

aggregates multi-receptive-field responses through multiple depthwise convolution (DWConv) branches:

F_{k} = {DWConv}_{k} ({Conv}_{1 \times 1} (\tilde{X})), k \in {3, 5, 7}

(4)

The outputs from different receptive fields are fused and pooled to form a unified multi-scale representation:

F_{spc} = {Conv}_{3 \times 3} ({MaxPool}_{3 \times 3} (\sum_{k} F_{k}))

(5)

To enrich scale perception, the SPC module is recursively applied three times, where each output serves as the input to the next stage:

Z^{(t)} = S ({MaxPool}_{3 \times 3} ({Conv}_{3 \times 3} (Z^{(t - 1)}))), Z^{(0)} = I

(6)

Finally, we obtain three multi-granularity features representing shallow texture, mid-level structure, and high-level semantics:

F^{(8)} = Z^{(1)}, F^{(16)} = Z^{(2)}, F^{(32)} = Z^{(3)}

(7)

These scale-aware convolutional features effectively complement the semantic embeddings from CLIP and DINOv3, enabling the model to adapt to objects of different sizes.

For the textual modality, we use a knowledge-enhanced text encoding mechanism that combines visual language model(VLM) generation and human refinement. For each training image, an initial description

\hat{c} = g_{VLM} (I)

is produced and manually corrected to yield a refined caption c. We then extract key semantic attributes, including species name, visual appearance, and living environment, through an attribute extractor

h (\cdot)

:

(s, a, e) = h (c)

(8)

These elements are formatted into a template-based textual representation:

\tilde{c} = T (s, a, e)

(9)

which is encoded by the CLIP text encoder to obtain the final text embedding:

E_{text} = f_{text} (\tilde{c})

(10)

During training, this branch provides semantic priors that enhance the alignment between text and vision; during inference, it can be optionally omitted, allowing the model to rely solely on visual cues.

Through this stage, we obtain three complementary types of features, including semantic visual features

(E_{clip})

, structural and hierarchical features

(E_{dino}^{(\cdot)})

, and explicit multi-scale convolutional features

(F^{(\cdot)})

, together with knowledge augmented textual embeddings

(E_{text})

, forming a unified foundation for multimodal fusion and detection.

2.2. Feature Enhancement

Although the feature extraction stage yields complementary representations—semantic embeddings from CLIP, hierarchical structural features from DINOv3, and multi-scale convolutional responses from the SPC Adapter—these features remain independently formed, showing weak cross-modal and cross-scale correlations. Specifically, CLIP excels at capturing global semantic concepts but lacks detailed spatial awareness, while DINOv3 focuses on visual structures without semantic alignment to textual priors. The SPC features provide localized detail but are isolated from higher-level contextual understanding. Such heterogeneity limits the model’s ability to construct a unified, discriminative feature space for downstream detection.

To address these issues, we introduce a unified Feature Enhancement framework that establishes semantic–structural alignment and cross-scale interaction among the extracted features. This framework comprises two complementary modules: the Focus Mamba Block (FMB) and the Bridge Mamba Block (BMB). The FMB selectively propagates semantic guidance from CLIP into the DINOv3 hierarchy to enhance semantic–structural consistency, while the BMB bridges DINOv3’s global transformer semantics with SPC’s local multi-scale details through selective state-space fusion. Together, these modules refine multi-level features and form a unified representation space optimized for object detection in complex and domain-specific scenarios such as Tibetan wildlife imagery.

In multimodal representation learning, features from CLIP and DINOv3 are naturally complementary: CLIP provides high-level semantic alignment between vision and language, while DINOv3 captures detailed structural and spatial cues. However, their cross-layer correspondence is typically weak, limiting effective semantic–structural synergy. To address this, we propose the FMB, which selectively aggregates cross-layer dependencies guided by CLIP semantics. The core idea is to refine the DINOv3 feature hierarchy by injecting CLIP-informed contextual cues, thereby enhancing the semantic–structural consistency of multimodal representations.

As illustrated in Figure 3, the FMB anchors one feature sequence (e.g., CLIP embeddings) and inserts the complementary representation (e.g., DINOv3 embeddings) between two duplicated anchor segments. This composite sequence is processed through an SSM to capture long-range dependencies and selective contextual interactions. After processing, the middle inserted part is discarded, and the two outer segments are aggregated and projected via an MLP to form the enhanced representation. For clarity, we describe here the procedure that updates CLIP features using DINOv3 guidance as a representative example; the reverse process follows the same principle.

Given the semantic visual embedding

E_{clip}^{(i)} \in R^{N_{c} \times C}

and the structural embedding

E_{dino}^{(i)} \in R^{N_{d} \times C}

from the feature extraction stage, we first perform a self-attention refinement on the CLIP feature to strengthen its intra-layer semantic dependencies:

{\hat{E}}_{clip}^{(i)} = SelfAttn (E_{clip}^{(i)}) .

(11)

Next, we construct an extended sequence by duplicating the refined CLIP features and inserting the DINOv3 features in between. This composite sequence enables the model to establish semantic–structural interactions across heterogeneous representations:

T^{(i)} = [{\hat{E}}_{clip}^{(i)}; E_{dino}^{(i)}; {\hat{E}}_{clip}^{(i)}] .

(12)

The concatenated sequence

T^{(i)}

is then processed by the state space model

Mamba (\cdot)

, which captures both global context and selective long-range dependencies through recurrent-like dynamic filtering:

Z^{(i)} = Mamba (T^{(i)}) .

(13)

After passing through the Mamba layer, the resulting sequence is divided into three segments corresponding to the two duplicated anchors and the inserted complementary section:

(Z_{L}^{(i)}, Z_{M}^{(i)}, Z_{R}^{(i)}) = Split (Z^{(i)}; N_{c}, N_{d}, N_{c}) .

(14)

We then discard the middle section

Z_{M}^{(i)}

, which primarily serves as a contextual bridge during propagation, and aggregate the two outer segments through element-wise addition to obtain the fused output:

{\tilde{E}}_{clip}^{(i)} = Z_{L}^{(i)} \oplus Z_{R}^{(i)} .

(15)

Finally, the fused feature

{\tilde{E}}_{clip}^{(i)}

is projected through an MLP to enhance channel-wise representation and restore dimensional consistency:

E_{fmb}^{(i)} = MLP ({\tilde{E}}_{clip}^{(i)}) .

(16)

Through this selective aggregation mechanism, the Focus Mamba Block enables semantic guidance from CLIP to propagate across DINOv3’s structural hierarchy, yielding enhanced multimodal features that preserve both global semantics and fine-grained spatial coherence.

Although DINOv3 provides a strong hierarchical representation for visual structure, its Transformer backbone lacks explicit multi-scale modeling, which limits its ability to capture small objects and fine-grained local details. In contrast, the multi-scale convolutional features from the SPC Adapter contain rich spatial information but lack sufficient global contextual reasoning. To bridge this gap, we propose a BMB, designed to selectively connect hierarchical DINOv3 features with multi-granularity convolutional representations. As shown in Figure 4, this block establishes bidirectional interaction between global transformer semantics and local convolutional cues through a selective state-space modeling mechanism.

The BMB operates on two input feature streams: a DINOv3 feature x and a corresponding SPC-derived feature y from the same scale level. It consists of three stages: intra-modal enhancement, cross-feature alignment, and selective state-space fusion.

We first perform intra-modal enhancement on x to enrich its contextual representation. The feature is normalized and projected through linear and convolutional transformations, followed by nonlinear activation and a selective state-space model (SSM). Formally, this process is expressed as:

\begin{matrix} h_{1} & = LN (x), \end{matrix}

(17)

where

LN (\cdot)

denotes Layer Normalization used to stabilize the feature distribution and normalize across channels.

The normalized feature is then passed through a linear layer and convolutional operator to extract complementary channel–spatial responses. These responses are activated by the SiLU function and further processed by a selective state-space model:

\begin{matrix} o_{1} & = LN (SSM (SiLU (Conv (Linear (h_{1}))))), \end{matrix}

(18)

where

SSM (\cdot)

represents the selective state-space mechanism that captures long-range dependencies along the sequence dimension. Given an input sequence

X \in R^{L \times d}

, a general SSM is formulated as:

Y (t) = C x (t), \dot{x} (t) = A x (t) + B u (t),

(19)

where A, B, and C are state matrices that govern the propagation and selection of contextual information, enabling efficient long-sequence modeling while preserving spatial coherence.

In parallel, a lightweight linear projection branch complements the SSM-based representation, forming a gated dual-branch pathway:

o_{2} = SiLU (Linear (h_{1})) .

(20)

The outputs of the two branches are combined through an element-wise gating operation to capture nonlinear feature dependencies, followed by a linear projection and residual aggregation:

o u t p u t_{1} = Linear (o_{1} ⊙ o_{2}) + x .

(21)

This operation enhances DINOv3’s hierarchical semantics by injecting adaptive contextual modulation, producing the transformer-enhanced output

o u t p u t_{1}

.

For the convolutional feature stream y, we perform a symmetric enhancement to adapt it for fusion. After normalization, three parallel transformation paths are applied: linear, convolutional, and linear–SiLU, which jointly encode channel, spatial, and nonlinear dependencies. The operations are defined as follows:

h_{2} = LN (y),

(22)

f_{1} = Linear (h_{2}), f_{2} = Conv (h_{2}), f_{3} = SiLU (Linear (h_{2})),

(23)

and the combined output is computed as follows:

o u t p u t_{2} = (f_{1} + f_{2}) + f_{3} + y .

(24)

This formulation adaptively captures multi-scale detail information and maintains the locality of the SPC features while aligning their feature space with the DINOv3 representation.

After obtaining the two enhanced streams

o u t p u t_{1}

and

o u t p u t_{2}

, we perform cross-feature alignment using DWConv. Each feature passes through

DWConv

for receptive field harmonization, and an interaction matrix

M

is formed by element-wise multiplication:

g_{1} = DWConv (o u t p u t_{1}), g_{2} = DWConv (o u t p u t_{2}), M = g_{1} ⊙ g_{2} .

(25)

The fused representation aggregates both self and mutual information as follows:

f_{fused} = M + g_{1} + g_{2} .

(26)

To capture long-range correlations between these cross-modality features, a selective state-space model is again applied:

s = SSM (f_{fused}) .

(27)

Finally, s is used as a dynamic gating mask to selectively integrate both streams, producing the final bridged representation:

f_{final} = (s ⊙ o u t p u t_{1}) + (s ⊙ o u t p u t_{2}) .

(28)

Through this process, the Bridge Mamba Block unifies DINOv3’s global structural hierarchy with SPC Adapter’s local spatial cues. The resulting feature

f_{final}

serves as a scale-aware, semantically rich representation that preserves both global coherence and fine-grained discriminability, thereby improving detection robustness under complex scenes and varying object scales.

2.3. Feature-Guided Detection

To transform the enhanced multimodal representations into actionable detections, we adopt a feature-guided detection head inspired by Deformable DETR, while introducing multi-source feature integration to achieve semantic–structural consistency. As illustrated in Figure 1, the process begins by constructing three types of input features: the fused visual feature

x_{1}

, the text-guided enhanced feature

x_{2}

, and the multi-granularity feature

x_{3}

obtained from the Feature Enhancement stage.

Specifically,

x_{1}

is produced by concatenating the CLIP and DINOv3 feature outputs, followed by an MLP projection and layer normalization:

x_{1} = LN (MLP ([F_{clip}; F_{dino}])) .

(29)

This step fuses global semantics from CLIP and hierarchical structures from DINOv3 into a unified visual embedding.

Next, we obtain the text–visual interactive representation

x_{2}

by combining the text features T with the enhanced visual feature

F_{enh}

, where

F_{enh}

is the normalized sum of the CLIP feature and the last-layer DINOv3 feature:

F_{enh} = LN (F_{clip} + F_{dino}) .

(30)

The text and enhanced visual features are then processed by a cross-attention mechanism, followed by MLP and layer normalization, to yield the cross-modal representation:

x_{2} = LN (MLP (CrossAttn (T, F_{enh}))) .

(31)

Meanwhile,

x_{3}

represents the multi-granularity visual feature obtained from the Feature Enhancement stage, which integrates outputs from different scales and Mamba blocks. To maintain consistency across scales, each feature map is first flattened and linearly projected into a shared embedding space before concatenation:

x_{3} = [Linear (Flatten (F_{8})); Linear (Flatten (F_{16})); Linear (Flatten (F_{32}))] .

(32)

where each

F_{s}

denotes the feature from scale s.

We then establish three interaction pathways among

x_{1}

,

x_{2}

, and

x_{3}

. The first path fuses

x_{3}

with

x_{2}

, the second combines

x_{3}

with

x_{1}

, and the third retains

x_{3}

itself for structural consistency. These three streams are input together into the multi-head self-attention module to capture both intra-scale and cross-scale dependencies:

x_{4} = MHSA ([x_{3} + x_{2}; x_{3} + x_{1}; x_{3}]) .

(33)

The output

x_{4}

and the baseline feature

x_{3}

are then aggregated through an Add&Norm operation to form

x_{5}

:

x_{5} = AddNorm (x_{4}, x_{3}) .

(34)

To enhance spatial precision and contextual adaptation, we further introduce deformable attention over multi-scale visual features. The MSDeform Attention module takes

x_{5}

as the query and the fused features

x_{5} + x_{1}

,

x_{1}

, and

x_{2}

as its key–value inputs:

z = MSDeformAttn (x_{5} + x_{1}, x_{1}, x_{2}) .

(35)

The deformable attention samples a sparse but informative set of key locations across scales to adaptively focus on semantically relevant regions:

z_{q} = \sum_{m = 1}^{M} W_{m} \sum_{k = 1}^{K} A_{m q k} X (p_{q} + Δ p_{m q k}),

(36)

where

A_{m q k}

are attention weights,

Δ p_{m q k}

are learned offsets, and X denotes the feature sampling function.

The resulting feature z and the previous output

x_{5}

are combined and normalized:

x_{6} = AddNorm (z, x_{5}) .

(37)

This representation is subsequently refined through a feed-forward network (FFN) and residual normalization:

x_{7} = AddNorm (x_{6}, FFN (x_{6})) .

(38)

The above sequence, from multi-head self-attention to FFN, is repeated N times to progressively refine the multi-scale contextual representation.

Finally, two independent FFN heads predict the object class and bounding box for each query embedding:

\hat{y} = {FFN}_{cls} (x_{7}), \hat{b} = σ ({FFN}_{box} (x_{7})) .

(39)

For optimization, we apply the Hungarian matching strategy between predictions and ground-truth objects. The matching cost for the i-th ground-truth and j-th prediction is computed as follows:

C_{i, j} = λ_{cls} CE (y_{i}, {\hat{y}}_{j}) + λ_{ℓ_{1}} {∥ b_{i} - {\hat{b}}_{j} ∥}_{1} + λ_{giou} L_{GIoU} (b_{i}, {\hat{b}}_{j}) .

(40)

The total detection loss combines classification, regression, and IoU terms:

L_{\det} = λ_{cls} L_{CE} + λ_{ℓ_{1}} L_{ℓ_{1}} + λ_{giou} L_{GIoU} .

(41)

Through this process, the feature-guided detection head unifies visual–textual alignment and multi-scale deformable reasoning. The integration of

x_{1}

,

x_{2}

, and

x_{3}

allows the network to capture both semantic priors and fine-grained spatial cues, enabling robust detection across diverse Tibetan wildlife scenes with varying object sizes and complex backgrounds.

3. Experiments

3.1. Dataset

In this study, we constructed a multi-source object detection dataset focused on thirteen representative animal species of the Tibetan Plateau. The dataset covers the following classes: Argali sheep, Equus kiang, Gyps himalayensis, Himalayan rabbit, Lynx, Snow Leopard, Tibetan Crane, Tibetan eared-pheasant, Tibetan red deer, Tibetan antelope, Vulpes ferrilata, wolf, and yak. The images were collected from diverse sources, including field photography on the Tibetan Plateau, UAV/drone and wildlife monitoring, and publicly available datasets.

The number of images for each category is as follows: yak (827), Tibetan antelope (763), Tibetan red deer (352), Snow Leopard (449), Argali sheep (578), wolf (243), Lynx (219), Himalayan rabbit (231), Equus kiang (409), Gyps himalayensis (302), Vulpes ferrilata (278), Tibetan Crane (336), and Tibetan eared-pheasant (217). This distribution exhibits a pronounced long-tail characteristic, with categories such as yak and Tibetan antelope having abundant samples, while rare species like Lynx, Tibetan eared-pheasant, and wolf have relatively fewer images. All images have been meticulously annotated with bounding boxes at the individual level to ensure high-quality training and evaluation, as illustrated in Figure 5.

To address data imbalance, we employed class-balanced sampling during training to ensure that each batch contains a more uniform distribution of classes. For under-represented categories, we increased the frequency of data augmentation, including random cropping, rotation, color jittering, affine transformations, and other techniques. These augmentations are essential for Tibetan wildlife images, as they simulate long-distance observations, viewpoint and pose variations, and the strong illumination changes common at high altitude.

The entire dataset was randomly divided into training, validation, and test sets at a ratio of 7:2:1, respectively. This ensures that each class is well represented and balanced across all subsets, facilitating robust model training, hyperparameter tuning, and fair evaluation.

3.2. Experimental Setup

All experiments were conducted using the PyTorch (version 1.12.1) deep learning framework. We evaluated our method against several detection models, including both YOLO-series detectors (YOLOv8-S [27], YOLO11-S [41]) and DETR-based transformer models (Deformable-DETR [5], RT-DETR [9], DEIMv2-S [42], D-FINE-S [43]). Model training was performed on a workstation equipped with two NVIDIA RTX 4090 GPUs.

For the backbone encoders (CLIP and DINOv3), we used pretrained weights and froze their parameters during training. Newly introduced modules including the SPC Adapter, Focus Mamba Block, and Bridge Mamba Block were initialized using Kaiming initialization [44]. The number of training epochs was set to 250, with a batch size of 8 (4 per GPU). The AdamW optimizer was employed with an initial learning rate of

1 \times 10^{- 4}

, weight decay of 0.05, and a cosine annealing schedule for stable convergence.

For fair comparison, all baseline detectors were trained end-to-end following the official training configurations of their respective repositories. To ensure consistent experimental conditions, we kept the data splits, input resolutions, and augmentation pipeline aligned with our settings whenever supported by the baseline framework. Other training hyperparameters—including the number of epochs, optimizer settings, and backbone update strategy—followed the official recommendations of each baseline; using their default configurations is generally considered the most reliable and reproducible practice. Notably, all baseline models were fine-tuned end-to-end, whereas our framework freezes the CLIP and DINOv3 backbones by design to stabilize multimodal alignment.

For our Tibetan wildlife dataset, input images were resized to

640 \times 640

pixels, while for COCO val2017 [45], the default input resolutions of each baseline model were retained to ensure fair comparison. Comprehensive data augmentation was applied, including random horizontal flipping, multi-scale training, color jittering, mosaic augmentation, and affine transformations.

Model performance was evaluated using COCO-style metrics, including AP,

{AP}_{50}

,

{AP}_{75}

,

{AP}_{S}

,

{AP}_{M}

, and

{AP}_{L}

. Model complexity was measured by the number of parameters (Params) and computational cost (GFLOPs). All reported results correspond to the test set of the respective datasets unless otherwise specified.

3.3. Comparative Experiments

We benchmark our approach against several state-of-the-art object detection models, including YOLO-series detectors (YOLOv8-S [27], YOLO11-S [41]) and DETR-based transformer models (Deformable-DETR [5], DEIMv2-S [42], RT-DETR [9], D-FINE-S [43]).

As summarized in Table 1, our method achieves the best overall performance across all metrics while maintaining reasonable computational cost (28 M parameters, 102 GFLOPs). In terms of efficiency, the model runs at approximately 68 FPS, which is slower than lightweight YOLO variants but remains competitive among DETR-style detectors, achieving higher throughput than Deformable-DETR while preserving substantially better accuracy. Specifically, our approach reaches 70.2% AP, outperforming the strongest baseline RT-DETR [9] (64.9% AP) by 5.3 percentage points. The improvements are particularly pronounced for small objects, which are common in distant or partially occluded wildlife scenes on the Plateau. We attribute these gains to the synergy between multi-scale deformable reasoning and text-guided semantics, which strengthens localization under cluttered backgrounds and mitigates inter-species visual confusion.

Per-class results in Table 2 reveal consistent improvements across all 13 species categories. Our method achieves substantial gains for rare and visually challenging species, including Lynx (+6.4% over RT-DETR [9]), Snow Leopard (+6.7%), Tibetan Eared Pheasant (+5.2%), Vulpes Ferrilata (+5.8%), and Wolf (+6.0%). These results demonstrate improved robustness under long-tail distributions and enhanced discrimination among visually similar species. Qualitative examples in Figure 6, further illustrate fewer missed detections and tighter bounding boxes in crowded, low-contrast, and cluttered backgrounds. The normalized confusion matrix in Figure 7 shows substantially reduced cross-class confusion, particularly among taxonomically related categories (e.g., canids: wolf and Vulpes ferrilata; ungulates: Tibetan antelope, Argali sheep, and Equus kiang), confirming that text-guided semantic features effectively enhance fine-grained species discrimination.

To evaluate the broader applicability of our framework, we conduct experiments on the COCO val2017 benchmark [45]. As shown in Table 3, our method achieves 55.8% AP, consistently outperforming all baseline models including RT-DETR [9] (53.1% AP). Notably, our approach demonstrates strong performance on small objects (

{AP}_{S}

: 38.2% vs. 34.8%), validating the effectiveness of our multi-scale feature fusion strategy across different datasets and domains. These results confirm that our semantic-guided Mamba fusion framework generalizes well beyond domain-specific wildlife scenarios to general object detection tasks.

We further examined the model’s robustness under different input resolutions. As presented in Table 4, reducing the image size to 512 × 512 results in a moderate performance drop, with the most noticeable decrease observed in

{AP}_{S}

. This is expected, as lower-resolution inputs inevitably weaken fine-grained cues required for detecting distant or small wildlife targets. Nevertheless, the overall decline remains limited, indicating that the model still preserves competitive accuracy even under reduced image quality. In contrast, increasing the resolution to 800 × 800 brings a slight improvement, particularly for small objects, due to the availability of richer spatial details and sharper local structures. The gain in overall AP, however, is marginal, suggesting that the model already captures most useful information at the default resolution. Overall, these results demonstrate that the proposed approach maintains stable and reliable performance across a range of input scales.

3.4. Ablation Study

To validate the contribution of each component, we conducted comprehensive ablation studies on the Tibetan wildlife dataset. Table 5 presents the results of progressively adding the SPC Adapter, FMB, BMB, and Text-Guide module.

Starting from the baseline (56.2% AP), each module contributes positively when added individually: SPC (+3.6%), FMB (+4.0%), and BMB (+4.3%). The BMB shows slightly higher gains, suggesting that bridging hierarchical transformer features with explicit multi-scale representations is particularly effective for handling scale variation. Pairwise combinations reveal strong complementarity: SPC+FMB achieves 63.2% AP (+7.0% over baseline), SPC+BMB reaches 63.7% (+7.5%), and FMB+BMB attains 63.4% (+7.2%). When all three visual modules are combined, performance jumps to 67.1% AP (+10.9%), demonstrating effective integration of multi-scale spatial details, semantic guidance, and hierarchical feature fusion. Adding the Text-Guide to the full visual framework provides an additional 3.1% gain (67.1% → 70.2%), indicating that semantic priors from language models help disambiguate visually similar species and stabilize localization under background clutter. It is worth noting that the 67.1% AP corresponds to the visual-only setting used at inference, while the 70.2% result reflects the optional use of textual priors during training. This distinction highlights that our model maintains strong performance even without language inputs, while benefiting further from semantic cues—particularly for rare species in long-tailed distributions.

Figure 8 examines decoder depth impact. Performance improves steadily from 1 to 4 layers, with AP increasing from 56.2% to 68.9%. Beyond 4 layers, gains diminish (70.2% at 6 layers), indicating our feature fusion already provides strong context and excessive depth yields diminishing returns. We adopt 6 decoder layers to balance performance and efficiency.

From a design perspective, SPC enriches multi-scale context for small targets, while Mamba-based blocks expand the effective receptive field with linear-time sequence modeling, improving both recall and box quality. The Text-Guide injects class-level semantics resilient to sample sparsity, particularly beneficial for rare species in our long-tailed dataset.

4. Discussion

Our proposed framework integrates visual semantics, structural hierarchies, and contextual priors through a combination of CLIP, DINOv3, SPC, and Mamba-based modules. The results demonstrate that each component contributes meaningfully to detection performance, especially in challenging scenarios involving small objects and fine-grained species. The SPC Adapter effectively enhances multi-scale perception, while the Focus Mamba Block improves the alignment between high-level semantics and structural features. The Bridge Mamba Block complements this by facilitating cross-scale information exchange, which proves beneficial in complex natural scenes with high visual variance. Furthermore, the text-guided semantic branch introduces external contextual knowledge via large language models, providing prior information that improves category-level consistency and reduces confusion in visually similar species. This is particularly helpful in the Tibetan wildlife dataset, where species often share close visual attributes.

However, several limitations remain, and addressing them provides a meaningful direction for future research. Although CLIP and DINOv3 offer strong semantic and structural priors, their computational cost may restrict deployment in real-time field monitoring or edge-computing scenarios. To mitigate this, future work can incorporate lightweight state-space backbones, model compression strategies such as pruning, quantization, and distillation, or hybrid designs that preserve semantic structural benefits while reducing complexity. In addition, the effectiveness of the text-guided semantic branch is influenced by the quality and specificity of the prompts generated by the language model. Ambiguous prompts may lead to semantic drift, especially for visually similar species. This issue could be alleviated through automatic prompt refinement, confidence-based prompt filtering, or domain-adapted language models that better align with ecological terminology.

Furthermore, although the proposed method performs well on the Tibetan Plateau dataset, its generalizability to unfamiliar ecological environments remains to be fully assessed. Domain adaptation techniques, cross-region fine-tuning, and self-supervised pretraining on broader wildlife corpora may help extend the framework to diverse habitats and imaging conditions. Strengthening these aspects will enhance the robustness, transferability, and practical relevance of the proposed framework in wider wildlife monitoring applications. In addition, LLM-generated textual prompts may introduce bias or unintended leakage, which could be addressed in future work through prompt refinement or domain-specific language adaptation. All wildlife imagery used in this study complies with open-source licensing or institutional permissions, and we plan to release the dataset through a request-based access procedure to ensure responsible use.

Author Contributions

Conceptualization, Y.X. and P.L.; methodology, Y.X., Q.Z. and T.S.; software, Y.X. and T.S.; validation, Y.L. and P.L.; formal analysis, Y.X. and Q.Z.; investigation, Y.X., Q.Z. and Y.L.; resources, P.L.; data curation, Y.L. and P.L.; writing—original draft preparation, Y.X.; writing—review and editing, P.L. and Q.Z.; visualization, Y.L.; supervision, P.L. and Q.Z.; project administration, P.L.; funding acquisition, P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (Grant No. 2023YFC3321705).

Data Availability Statement

The datasets generated, used, and/or analyzed during the current study will be available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BMB	Bridge Mamba Block
DWConv	Depthwise Convolution
FFN	Feed-Forward Network
FMB	Focus Mamba Block
SPC	Spatial Pyramid Convolution
SSM	State Space Model
VLM	Visual Language Model

References

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 580–587. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27 June–2 July 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 14454–14463. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Mulero-Pázmány, M.; Hurtado, S.; Barba-González, C.; Antequera-Gómez, M.L.; Díaz-Ruiz, F.; Real, R.; Navas-Delgado, I.; Aldana-Montes, J.F. Addressing significant challenges for animal detection in camera trap images: A novel deep learning-based approach. Sci. Rep. 2025, 15, 16191. [Google Scholar] [CrossRef]
Kim, J.I.; Baek, J.W.; Kim, C.B. Hierarchical image classification using transfer learning to improve deep learning model performance for amazon parrots. Sci. Rep. 2025, 15, 3790. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Zhang, H.; Xu, F. Intelligent detection method for wildlife based on deep learning. Sensors 2023, 23, 9669. [Google Scholar] [CrossRef]
Ma, Z.; Dong, Y.; Xia, Y.; Xu, D.; Xu, F.; Chen, F. Wildlife real-time detection in complex forest scenes based on YOLOv5s deep learning network. Remote Sens. 2024, 16, 1350. [Google Scholar] [CrossRef]
Xian, Y.; Zhao, Q.; Yang, X.; Gao, D. Thangka image object detection method based on improved YOLOv8. In Proceedings of the 2024 13th International Conference on Computing and Pattern Recognition, Tianjin, China, 25–27 October 2024; pp. 214–219. [Google Scholar]
Ke, W.; Liu, T.; Cui, X. IECA-YOLOv7: A Lightweight Model with Enhanced Attention and Loss for Aerial Wildlife Detection. Animals 2025, 15, 2743. [Google Scholar] [CrossRef]
Gong, H.; Liu, J.; Li, Z.; Zhu, H.; Luo, L.; Li, H.; Hu, T.; Guo, Y.; Mu, Y. GFI-YOLOv8: Sika deer posture recognition target detection method based on YOLOv8. Animals 2024, 14, 2640. [Google Scholar] [CrossRef]
Chen, L.; Li, G.; Zhang, S.; Mao, W.; Zhang, M. YOLO-SAG: An improved wildlife object detection algorithm based on YOLOv8n. Ecol. Inform. 2024, 83, 102791. [Google Scholar] [CrossRef]
Xian, Y.; Lee, Y.; Shen, T.; Lan, P.; Zhao, Q.; Yan, L. Enhanced object detection in thangka images using Gabor, wavelet, and color feature fusion. Sensors 2025, 25, 3565. [Google Scholar] [CrossRef]
Guo, Z.; He, Z.; Lyu, L.; Mao, A.; Huang, E.; Liu, K. Automatic detection of feral pigeons in urban environments using deep learning. Animals 2024, 14, 159. [Google Scholar] [CrossRef]
Jenkins, M.; Franklin, K.A.; Nicoll, M.A.; Cole, N.C.; Ruhomaun, K.; Tatayah, V.; Mackiewicz, M. Improving object detection for time-lapse imagery using temporal features in wildlife monitoring. Sensors 2024, 24, 8002. [Google Scholar] [CrossRef] [PubMed]
Wu, Z.; Zhang, C.; Gu, X.; Duporge, I.; Hughey, L.F.; Stabach, J.A.; Skidmore, A.K.; Hopcraft, J.G.C.; Lee, S.J.; Atkinson, P.M.; et al. Deep learning enables satellite-based monitoring of large populations of terrestrial mammals across heterogeneous landscape. Nat. Commun. 2023, 14, 3072. [Google Scholar] [CrossRef] [PubMed]
Delplanque, A.; Théau, J.; Foucher, S.; Serati, G.; Durand, S.; Lejeune, P. Wildlife detection, counting and survey using satellite imagery: Are we there yet? GIScience Remote Sens. 2024, 61, 2348863. [Google Scholar] [CrossRef]
Brack, I.V.; Ferrara, C.; Forero-Medina, G.; Domic-Rivadeneira, E.; Torrico, O.; Wanovich, K.T.; Wilkinson, B.; Valle, D. Counting animals in orthomosaics from aerial imagery: Challenges and future directions. Methods Ecol. Evol. 2025, 16, 1051–1060. [Google Scholar] [CrossRef]
Krishnan, B.S.; Jones, L.R.; Elmore, J.A.; Samiappan, S.; Evans, K.O.; Pfeiffer, M.B.; Blackwell, B.F.; Iglay, R.B. Fusion of visible and thermal images improves automated detection and classification of animals for drone surveys. Sci. Rep. 2023, 13, 10385. [Google Scholar] [CrossRef]
Backman, K.; Wood, J.; Brandimarti, M.; Beranek, C.T.; Roff, A. Human inspired deep learning to locate and classify terrestrial and arboreal animals in thermal drone surveys. Methods Ecol. Evol. 2025, 16, 1239–1254. [Google Scholar] [CrossRef]
Korkmaz, A.; Agdas, M.T.; Kosunalp, S.; Iliev, T.; Stoyanov, I. Detection of Threats to Farm Animals Using Deep Learning Models: A Comparative Study. Appl. Sci. 2024, 14, 6098. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8; Ultralytics: Frederick, MD, USA, 2023. [Google Scholar]
Mou, C.; Liang, A.; Hu, C.; Meng, F.; Han, B.; Xu, F. Monitoring endangered and rare wildlife in the field: A foundation deep learning model integrating human knowledge for incremental recognition with few data and low cost. Animals 2023, 13, 3168. [Google Scholar] [CrossRef] [PubMed]
Simões, F.; Bouveyron, C.; Precioso, F. DeepWILD: Wildlife Identification, Localisation and estimation on camera trap videos using Deep learning. Ecol. Inform. 2023, 75, 102095. [Google Scholar] [CrossRef]
Wang, S. Effectiveness of traditional augmentation methods for rebar counting using UAV imagery with Faster R-CNN and YOLOv10-based transformer architectures. Sci. Rep. 2025, 15, 33702. [Google Scholar] [CrossRef] [PubMed]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Li, M.; Yuan, J.; Chen, S.; Zhang, L.; Zhu, A.; Chen, X.; Chen, T. 3DET-mamba: State space model for end-to-end 3D object detection. In Proceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 9–15 December 2024; pp. 47242–47260. [Google Scholar]
Jin, X.; Su, H.; Liu, K.; Ma, C.; Wu, W.; Hui, F.; Yan, J. UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 1407–1417. [Google Scholar]
Ning, T.; Lu, K.; Jiang, X.; Xue, J. MambaDETR: Query-based Temporal Modeling using State Space Model for Multi-View 3D Object Detection. arXiv 2024, arXiv:2411.13628. [Google Scholar]
You, Z.; Wang, N.; Wang, H.; Zhao, Q.; Wang, J. MambaBEV: An efficient 3D detection model with Mamba2. arXiv 2024, arXiv:2410.12673. [Google Scholar] [CrossRef]
Wang, Z.; Li, C.; Xu, H.; Zhu, X.; Li, H. Mamba yolo: A simple baseline for object detection with state space model. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8205–8213. [Google Scholar]
Hatamizadeh, A.; Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 25261–25270. [Google Scholar]
He, J.; Fu, K.; Liu, X.; Zhao, Q. Samba: A Unified Mamba-based Framework for General Salient Object Detection. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 25314–25324. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Siméoni, O.; Vo, H.V.; Seitzer, M.; Baldassarre, F.; Oquab, M.; Jose, C.; Khalidov, V.; Szafraniec, M.; Yi, S.; Ramamonjisoa, M.; et al. Dinov3. arXiv 2025, arXiv:2508.10104. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11; Ultralytics: Frederick, MD, USA, 2024. [Google Scholar]
Huang, S.; Hou, Y.; Liu, L.; Yu, X.; Shen, X. Real-Time Object Detection Meets DINOv3. arXiv 2025, arXiv:2509.20787. [Google Scholar] [CrossRef]
Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine regression task in DETRs as fine-grained distribution refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]

Figure 1. Overall framework. The method extracts multi-scale and multimodal features, performs Focus Mamba and Bridge Mamba-based enhancement and fusion, and decodes with a Deformable DETR head to realize semantic visual joint localization. Here, the snowflake icon represents frozen parameters, and the fire icon indicates trainable components.

Figure 2. Spatial Pyramid Convolution adapter. Parallel convolutions at multiple receptive fields capture fine to coarse details which are aggregated by a point-wise projection.

Figure 3. Focus Mamba Block. A selective state space module strengthens each layer by aggregating neighborhood layers with CLIP guidance.

Figure 4. Bridge Mamba Block. A selective state space fusion bridges DINOv3 with multi-granularity features to form enhanced multi-scale representations.

Figure 5. Category distribution of the Tibetan wildlife dataset. Each color represents a distinct animal species, revealing a pronounced long-tail distribution across the 13 categories.

Figure 6. Qualitative detection results across diverse scenes in the Tibetan wildlife dataset, including crowded, low-contrast, and cluttered backgrounds. Our method produces tighter boxes and fewer misses, especially for small and partially occluded animals.

Figure 7. Normalized confusion matrix on the test set. Our approach reduces cross-class confusion among visually similar species, reflecting stronger semantic discrimination via text-guided features.

Figure 8. Ablation on the number of decoder layers. Increasing depth improves AP up to saturation, reflecting a balance between contextual refinement and computational cost.

Table 1. Performance comparison of various object detection models on the Tibetan wildlife dataset.

Model	Params (M)	GFLOPs	FPS	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
YOLOv8-S [27]	11	29	247	55.0	72.9	59.8	37.7	60.0	70.1
YOLO11-S [41]	9	22	233	57.8	76.2	63.0	40.5	63.2	73.1
D-FINE-S [43]	10	25	113	60.9	79.8	66.2	44.1	66.4	76.7
Deformable-DETR [5]	40	173	32	59.1	78.3	64.0	42.5	64.2	75.4
DEIMv2-S [42]	10	26	77	63.4	82.2	69.3	46.2	69.1	79.5
RT-DETR [9]	42	136	121	64.9	84.0	71.2	48.4	70.8	81.1
Ours	28	102	68	70.2	88.7	76.8	53.1	76.4	86.5

Table 2. Per-class Average Precision on the Tibetan wildlife dataset. Abbreviations: TA—Tibetan Antelope; AS—Argali Sheep; TR—Tibetan Red Deer; L—Lynx; EK—Equus Kiang; GH—Gyps Himalayensis; HR—Himalayan Rabbit; Y—Yak; SL—Snow Leopard; TC—Tibetan Crane; TE—Tibetan Eared Pheasant; VF—Vulpes Ferrilata; W—Wolf.

Model	TA	AS	TR	L	EK	GH	HR	Y	SL	TC	TE	VF	W
YOLOv8-S [27]	59.3	60.9	57.1	49.7	61.8	54.6	58.5	60.0	47.8	52.6	55.1	48.7	50.7
YOLO11-S [41]	61.9	63.6	60.0	52.3	64.4	57.1	61.2	62.7	50.9	55.3	57.7	50.9	52.6
D-FINE-S [43]	65.1	67.0	63.3	55.2	67.9	60.2	64.3	65.9	53.7	58.4	61.1	54.2	56.0
Deformable-DETR [5]	63.3	64.9	61.5	53.6	66.0	58.4	62.6	64.0	52.0	56.7	59.3	52.6	54.4
DEIMv2-S [42]	66.3	68.3	64.6	57.5	69.9	62.3	66.8	68.1	56.4	61.0	64.2	56.7	59.2
RT-DETR [9]	69.0	70.8	67.3	57.1	71.7	62.2	69.2	70.0	56.1	60.6	67.0	55.9	58.2
Ours	74.3	76.7	72.5	63.5	78.4	68.4	75.2	75.8	62.8	66.9	72.2	61.7	64.2

Table 3. Performance comparison of various object detection models on COCO val2017 [45].

Model	Params (M)	GFLOPs	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
YOLOv8-S [27]	11	29	44.9	61.8	48.6	25.7	49.9	61.0
YOLO11-S [41]	9	22	46.6	63.4	50.3	28.7	51.3	64.1
D-FINE-S [43]	10	25	48.5	65.6	52.6	29.1	52.2	65.7
Deformable-DETR [5]	40	173	46.9	65.6	51.0	29.6	50.1	61.6
DEIMv2-S [42]	10	26	50.9	68.3	55.1	31.4	55.3	70.3
RT-DETR [9]	42	136	53.1	71.3	57.7	34.8	58.0	70.0
Ours	28	102	55.8	74.0	61.0	38.2	60.5	73.5

Table 4. Performance of ours model under different input resolutions on the Tibetan wildlife dataset.

Resolution	AP	${AP}_{50}$	${AP}_{75}$	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$
512 × 512	67.8	86.0	73.4	49.2	73.1	84.7
640 × 640	70.2	88.7	76.8	53.1	76.4	86.5
800 × 800	70.9	89.2	77.4	55.0	76.9	87.2

Table 5. Ablation study on SPC, FMB, BMB, and Text-Guide. (✓) indicates the component is enabled.

SPC	FMB	BMB	Text-Guide	AP	${AP}_{50}$	${AP}_{75}$
				56.2	75.0	61.4
✓				59.8	79.1	64.5
	✓			60.2	79.4	64.8
		✓		60.5	79.8	65.1
✓	✓			63.2	82.5	68.0
✓		✓		63.7	83.0	68.4
	✓	✓		63.4	82.8	68.1
✓	✓	✓		67.1	86.4	73.3
✓	✓	✓	✓	70.2	88.7	76.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lan, P.; Xian, Y.; Shen, T.; Lee, Y.; Zhao, Q. Semantic-Guided Mamba Fusion for Robust Object Detection of Tibetan Plateau Wildlife. Electronics 2025, 14, 4549. https://doi.org/10.3390/electronics14224549

AMA Style

Lan P, Xian Y, Shen T, Lee Y, Zhao Q. Semantic-Guided Mamba Fusion for Robust Object Detection of Tibetan Plateau Wildlife. Electronics. 2025; 14(22):4549. https://doi.org/10.3390/electronics14224549

Chicago/Turabian Style

Lan, Ping, Yukai Xian, Te Shen, Yurui Lee, and Qijun Zhao. 2025. "Semantic-Guided Mamba Fusion for Robust Object Detection of Tibetan Plateau Wildlife" Electronics 14, no. 22: 4549. https://doi.org/10.3390/electronics14224549

APA Style

Lan, P., Xian, Y., Shen, T., Lee, Y., & Zhao, Q. (2025). Semantic-Guided Mamba Fusion for Robust Object Detection of Tibetan Plateau Wildlife. Electronics, 14(22), 4549. https://doi.org/10.3390/electronics14224549

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic-Guided Mamba Fusion for Robust Object Detection of Tibetan Plateau Wildlife

Abstract

1. Introduction

2. Method

2.1. Feature Extraction

2.2. Feature Enhancement

2.3. Feature-Guided Detection

3. Experiments

3.1. Dataset

3.2. Experimental Setup

3.3. Comparative Experiments

3.4. Ablation Study

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI