From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification

Kudryavtsev, Vasiliy; Borodin, Kirill; Berezin, German; Bubenchikov, Kirill; Mkrtchian, Grach; Ryzhkov, Alexander

doi:10.3390/jimaging12010030

Open AccessArticle

From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification

by

Vasiliy Kudryavtsev

¹

,

Kirill Borodin

^1,*

,

German Berezin

¹

,

Kirill Bubenchikov

²

,

Grach Mkrtchian

^1,*

and

Alexander Ryzhkov

²

¹

Faculty of IT, Technical University of Communication and Informatics, Moscow 111024, Russia

²

AI Lab, Avito, Moscow 125196, Russia

^*

Authors to whom correspondence should be addressed.

J. Imaging 2026, 12(1), 30; https://doi.org/10.3390/jimaging12010030

Submission received: 4 December 2025 / Revised: 22 December 2025 / Accepted: 4 January 2026 / Published: 7 January 2026

(This article belongs to the Section Biometrics, Forensics, and Security)

Download

Browse Figures

Versions Notes

Abstract

Automated animal identification is a practical task for reuniting lost pets with their owners, yet current systems often struggle due to limited dataset scale and reliance on unimodal visual cues. This study introduces a multimodal verification framework that enhances visual features with semantic identity priors derived from synthetic textual descriptions. We constructed a massive training corpus of 1.9 million photographs covering 695,091 unique animals to support this investigation. Through systematic ablation studies, we identified SigLIP2-Giant and E5-Small-v2 as the optimal vision and text backbones. We further evaluated fusion strategies ranging from simple concatenation to adaptive gating to determine the best method for integrating these modalities. Our proposed approach utilizes a gated fusion mechanism and achieved a Top-1 accuracy of 84.28% and an Equal Error Rate of 0.0422 on a comprehensive test protocol. These results represent an 11% improvement over leading unimodal baselines and demonstrate that integrating synthesized semantic descriptions significantly refines decision boundaries in large-scale pet re-identification.

Keywords:

animal biometrics; pet reunification; multimodal deep learning; visual-semantic fusion; SigLIP; vision–language models; synthetic text generation; metric learning

1. Introduction

1.1. Context and Relevance

Individual animal identification is a notable problem across multiple domains that demand reliable and automated solutions [1,2]. Reunification of lost pets [3], wildlife conservation tracking [2,4], livestock management [4,5], and veterinary care [1,2] all depend on accurate identification systems. Traditional approaches have developed physical markers such as microchips and collar tags [6,7] to address this need, yet these methods introduce meaningful practical limitations. Physical markers can fall out, break, or become lost during the animal’s lifetime [6,8], rendering them unstable in real-world scenarios. These constraints have motivated exploration of alternative identification methods that do not depend on auxiliary physical devices [2,4,9].

Visual recognition systems offer a practical alternative that circumvents the limitations of physical markers by leveraging the unique biometric features inherent to each animal [10,11]. Rather than relying on attached or implanted devices, visual-based systems capture facial and body characteristics that remain stable throughout an animal’s life [12,13]. This approach proves particularly valuable in field scenarios where physical markers may be inaccessible or have been removed [10,14]. Deep learning techniques have enabled scalable processing of visual information by allowing systems to extract and compare biometric features from camera images without human intervention [15,16].

Deep learning-based visual identification systems provide noticeable advantages for animal identification tasks [1,2]. These approaches are non-invasive, requiring only image capture rather than physical contact or device implantation [9,17,18]. The methods are inherently scalable, as they can process multiple subjects simultaneously and adapt to new individuals without requiring system reconfiguration [15,19,20]. In contrast to breed-level classification systems, which categorize animals into predefined species or breed categories, individual identification systems must distinguish between unique subjects within a population [1,2,21]. This more granular level of recognition enables finer-grained management and tracking across diverse real-world deployment scenarios [4,22].

However, the development of robust visual identification systems faces considerable technical challenges [21,22]. Systems must contend with limited training datasets, where the number of individual animals with sufficient labeled image samples remains constrained [22,23]. Annotation quality and consistency present additional difficulties, particularly when labels are collected across multiple sources or by different annotators [23,24]. A primary limitation emerges from model generalization failures on new animals not encountered during training [21,25]. Models trained on specific populations may fail to identify individuals from different environments, with varying lighting conditions, camera angles, or image quality [26,27]. These generalization failures compromise system reliability in production environments where the model must operate across diverse conditions without retraining [21,28]. Research addressing individual animal identification therefore requires systematic investigation of how to build systems that maintain robust performance across these diverse real-world conditions [1,21]. This investigation must encompass dataset construction strategies that capture natural variation [24,29], architectural choices that generalize effectively [9,21,30], and evaluation protocols that measure performance realistically [21,22]. Understanding which approaches enable reliable identification across different populations, environments, and deployment scenarios [22] remains essential for enabling widespread adoption of visual identification systems in practical applications [1,21].

1.2. Research Problem and Associated Challenges

Individual animal identification differs fundamentally from breed-level classification and species recognition tasks [1]. While classification systems assign animals to predefined categories based on taxonomic or morphological characteristics [31], individual identification requires distinguishing between unique subjects within a population [2,32]. This distinction is not merely a matter of granularity; it represents a qualitatively different problem that demands more discriminative feature learning and more demanding evaluation protocols [2]. Individual identification systems must capture and leverage subtle inter-individual variations [33] that breed classifiers explicitly ignore [2], making the task substantially more challenging and consequently more valuable for practical applications [34].

The challenge of individual identification arises partly from the inherent variability in animal appearance across different observations [22]. The same animal exhibits considerable visual variation depending on acquisition conditions, body posture, age, and physical state. Lighting conditions significantly affect feature visibility and contrast, while camera viewpoint influences which body regions are visible and how features appear in projection [27]. Age-related changes in appearance, such as fur coloration or facial feature development, introduce temporal variation that models must accommodate [14]. These factors combine to create substantial intra-individual variation that can rival or exceed inter-individual differences, complicating the learning of discriminative features.

The availability and quality of training data present notable constraints on system development [35]. Existing animal identification datasets remain limited in scale compared to human face recognition benchmarks, restricting the diversity of individuals and conditions represented in training corpora [1]. Beyond scale, annotation quality and consistency become relevant concerns in metric learning frameworks, where individual-level labels directly affect learning outcomes [36]. A single mislabeled image may corrupt multiple training triplets during learning, propagating errors throughout the optimization process [37]. This sensitivity to label quality means that datasets collected from heterogeneous sources or with varying annotation standards may contain systematic noise that degrades model performance, even if the total number of labeled samples appears sufficient [38].

Framing animal identity verification as a unified text-image classification problem leverages complementary information: the image contributes fine visual detail, while the text contributes identity-conditioned priors, yielding more separable decision boundaries than image-only models under pose, illumination, and viewpoint variation [39,40]. A shared representation space that jointly embeds images and identity text [41] lets the model highlight stable, discriminative markers linked to the described individual [42], while attenuating nuisance variation unrelated to identity [42]. Under a closed-set protocol where the set of candidate individuals is predefined, this unified space supports a straightforward classifier module and calibrated thresholds for identity recognition, enabling clean comparisons to image-only baselines and making the effect of multimodal fusion easy to quantify [28].

Optimization should align image and text for the same individual and push apart representations of different individuals using contrastive or metric learning objectives over the shared space [43], while preserving strong unimodal encoders so performance remains reliable if one modality is weak. Training can progress from unimodal warm starts to joint alignment, with balanced sampling across individuals to maintain uniform learning pressure and regularization that encourages tight within-identity clusters across changing capture conditions. For assessment, closed-set evaluations allow rigorous measurement of accuracy, calibration, and confusion patterns; ablations make clear when the unified text-image representation materially improves identity discrimination relative to unimodal baselines [44].

1.3. Field Snapshot

Early animal identification research operated under severe data constraints that fundamentally limited research directions and methodological choices [1]. Datasets typically contained fewer than 100 individual animals, a scale insufficient for training robust identification models [11]. These constraints forced the research community toward breed-level classification and species recognition as more tractable alternatives, leaving individual identification largely unexplored despite its greater practical value and technical complexity [2].

Foundational datasets such as Stanford Dogs [45] and Oxford-IIIT Pet [46] established important benchmarks for fine-grained animal classification and drove methodological advances in deep learning for animal visual analysis. However, these datasets were explicitly designed for breed recognition rather than individual identification. Their limited scope created a persistent research gap between breed classification and individual identification that persisted for nearly a decade, during which the two problems developed as largely separate research tracks with different datasets, evaluation protocols, and technical approaches [2].

The emergence of individual-identification datasets marked a paradigm shift in research objectives [2]. DogFaceNet [17] pioneered this direction by demonstrating the feasibility of adapting face-recognition techniques to animal domains, establishing the viability of individual animal identification using facial features. Although this pioneer work operated at a limited scale, its primary contribution lay in opening a new research trajectory that motivated subsequent work toward larger and more comprehensive individual-identification systems.

Multi-species approaches like AnimalWeb [47] represent attempts to improve model generalization across taxonomic boundaries by learning shared feature representations. These approaches face inherent challenges in balancing species diversity against adequate per-species representation. AnimalWeb’s hierarchical taxonomy mirrors biological classifications but does not necessarily align with optimal feature hierarchies for computer vision tasks, creating a mismatch between biological organization and computational optimization objectives.

Different biometric modalities offer distinct characteristics that shape their practical applicability [48]. Nose-print patterns remain stable throughout an animal’s lifetime and exhibit unique ridge structures [49], yet their utility is constrained by acquisition requirements that demand close-range imaging and subject cooperation [50]. Facial features enable recognition from typical camera distances and work across diverse viewing conditions [18], though they exhibit greater variation across time and conditions compared to specialized biometric patterns. Body markings provide globally unique signals but exhibit variability due to seasonal changes, injuries, and growth [26], requiring models to handle substantial intra-individual variation.

Animal identification research has adapted techniques from human face recognition, particularly loss functions and metric learning frameworks that optimize for discriminative embedding spaces [44]. However, this adaptation introduces significant challenges because animal faces present distinct characteristics from human faces, including greater intra-individual variation, different facial structural constraints, and species-specific features that may not align with architectures developed for human recognition [51].

Recent developments have expanded the landscape of available methods and datasets. Large-scale initiatives such as WildlifeReID-10k [25] and community-curated multispecies collections [29] now enable investigation of diverse animal species under real-world conditions. Semi-supervised learning approaches address limited labeled data through pseudo-labeling and consistency regularization [36], while cross-species transfer learning attempts to leverage features learned on one species for identification in others [52]. The introduction of PetFace [35], providing 257,484 unique individuals across multiple species, enables investigation of previously intractable problems such as unseen-individual verification. Open-source tools such as WildlifeDatasets [24] have facilitated standardized evaluation and reproducibility. Vision–language models have recently been adapted for animal-centric tasks [28], and advances in robust visual encoders such as DINOv2 [53] and multilingual SigLIP [54] provide powerful foundation models for animal identification. However, larger datasets frequently involve trade-offs with annotation quality and quality control [11], introducing systematic noise that may degrade performance in metric-learning frameworks where label integrity directly affects training dynamics [37].

Fundamental questions remain open regarding the optimal balance between dataset scale and annotation quality, the extent to which biometric features transfer across species boundaries [55], and the specific conditions under which models trained on controlled datasets generalize to real-world deployment scenarios [27]. These questions have not received systematic investigation across standardized evaluation protocols, leaving important gaps in our understanding of how to build reliable identification systems for practical applications [22].

1.4. Gap and Rationale

Different studies on animal identification use different datasets, evaluation metrics, and training protocols, making it difficult to compare results across the literature [1]. When DogFaceNet [17] demonstrates strong performance on one dataset, it is unclear whether this reflects a superior method or simply favorable experimental conditions [1]. Cross-species transfer learning and breed-specific fine-tuning report conflicting conclusions about which approach works best, and these different methods operate under incomparable evaluation standards. This inconsistency prevents researchers from determining which architectural choices actually improve identification performance [1].

Individual animal identification has remained unimodal, relying solely on visual features [2], while human experts naturally use both visual observation and textual descriptions when identifying animals. Veterinarians and shelter workers describe distinctive marks, physical characteristics, and other identifying features alongside visual assessment [2]. Vision–language models such as CLIP [39] have proven effective on human-centric tasks [39], but their potential for animal identification combined with textual descriptions remains unexplored [56]. This represents an opportunity to leverage a naturally multimodal signal that existing systems do not exploit [2,56].

Currently, few works systematically examine how different text encoder architectures and fusion strategies affect animal verification performance [57]. While vision encoder comparisons exist in the literature, text encoding choices for animal descriptions remain uninvestigated [5]. Different transformer variants, embedding dimensions, and fusion mechanisms such as adaptive gating and cross-attention have not been evaluated in the animal identification context [57]. Without systematic ablation studies isolating each component’s contribution, practitioners lack guidance on which combinations to use [5]. Table 1 summarizes representative prior studies, their main limitations, and how the present work addresses these gaps.

To summarize, three gaps limit progress toward practical, comparable animal verification systems:

(i): The lack of a unified evaluation protocol that enables fair, controlled comparison of vision encoders across heterogeneous datasets;
(ii): The lack of multimodal verification pipelines that exploit identity-level textual cues, despite their natural use by human experts; and
(iii): The absence of systematic ablations of text encoders and fusion strategies in the animal identification setting.

Accordingly, what is needed is a controlled experimental framework that standardizes training and evaluation across backbones, and a principled multimodal design that quantifies when and how text improves individual verification. This paper fills these gaps by establishing a unified large-scale training corpus and standardized protocol, and by conducting encoder- and fusion-level ablations for image–text animal verification using synthesized identity descriptors.

1.5. Contribution

Individual animal identification is a critical problem in computer vision with applications in pet reunification, wildlife conservation, and veterinary management [1,2]. Traditional physical markers like tags and microchips have limitations in accessibility and durability [6,7,8]. Deep learning-based visual identification offers a non-invasive alternative, but existing work faces challenges in dataset scale, annotation quality, and real-world generalization [35].

This paper makes the following contributions:

Unified large-scale corpus and protocol. We construct a unified training corpus by combining Pet911.ru and Telegram data with established benchmarks (Dogs-World [58], LCW [59], PetFace [35]), totaling 695,091 unique animals and 1,904,157 photographs, and define a standardized evaluation protocol to enable fair comparisons.
Controlled vision backbone ablation. We benchmark multiple vision encoders under identical training conditions (losses, optimization, schedule, and metrics), ensuring that observed differences reflect architectural choices rather than experimental variation [11].
Multimodal verification with systematic ablations. We introduce an image–text animal verification pipeline that augments visual embeddings with synthesized identity descriptions [17,51], and we systematically ablate text encoder architectures [17,51] and multimodal fusion strategies (including cross-attention and gating) to provide practical guidance on effective combinations. Different fusion strategies are compared to measure how multimodal integration improves verification performance relative to image-only baselines [28,39].

Together, these contributions address the lack of comparability across studies and the absence of systematic multimodal investigations in animal identification, while providing a reproducible foundation for settings where both images and descriptive text are available.

2. Materials and Methods

2.1. Data

Pet911 Dataset. We constructed a dataset through automated web scraping of the pet911.ru platform (accessed on 1 November 2025), a Russian service for lost and found pet announcements. The parsing implementation employs BeautifulSoup for Hypertext Markup Language (HTML) processing and requests library for HyperText Transfer Protocol (HTTP) communication with error handling. The system navigates catalog pages using pagination detection algorithms to identify available content. For each listing, we extracted animal metadata including species classification, descriptive text, and associated photographs. We filtered listings to retain only animals with at least two photographs per individual. Downloaded images underwent validation for format consistency, with automatic conversion of WebP formats to Joint Photographic Experts Group (JPEG) for standardization [7]. The Pet911 dataset yielded 65,961 photographs representing 22,050 unique animals. Each animal record includes species classification for cats or dogs, textual description, and between 2 and 8 associated photographs. The dataset captures real-world variability in image quality, lighting conditions, and animal poses representative of lost pet scenarios [3].

Telegram Channel Dataset. The Telegram dataset construction utilized the Telethon library to access public animal-related channels through the Telegram Application Programming Interface (API). The system processes message streams from targeted public channels, using keyword matching to identify animal-related content [1]. Media processing handles both individual photos and grouped albums. The system automatically detects grouped messages and downloads all associated images while maintaining proper file organization. The Telegram dataset contributed 131,698 photographs from 73,101 unique animals. This source provides complementary data characteristics, including casual photography styles, varied backgrounds, and diverse animal representations not captured in formal lost pet platforms [11].

Existing Datasets. Beyond our constructed datasets, we incorporated established benchmarks that represent diverse data collection methodologies and real-world scenarios [2]. The Dogs-World dataset [58] provides 301,342 photographs from 200,458 unique dogs, capturing variations in controlled and semi-controlled environments. The LCW dataset [59] contributes 381,267 photographs representing 140,732 individual animals, expanding the diversity of acquisition conditions and animal populations [21]. PetFace, the largest benchmark in our evaluation, contains 1,001,532 photographs representing 257,349 unique animals across multiple species [35]. For evaluation purposes, we utilized Cat Individual Images [60], which provides 13,542 photographs of 518 individual cats, and DogFaceNet [17], consisting of 8363 photographs from 2483 unique dogs, both serving as controlled test sets for assessing model generalization across different animal populations. These established datasets have been evaluated in prior work and demonstrate the trade-off between dataset scale and annotation quality that characterizes recent progress in animal identification research.

Combined Dataset Composition. The training corpus combines our constructed datasets with established datasets, leveraging comprehensive scale and diversity across multiple data sources [29]. As presented in Table 2, our complete dataset contains 1,904,157 total photographs representing 695,091 unique animals across cats and dogs. The combination of constructed and established datasets provides a robust foundation for model training and evaluation across diverse scenarios and animal types [9,23]. In addition to the total number of identities and photos, we report descriptive statistics of the number of photos per identity: min and max denote the minimum and maximum number of images available for a single identity, while mean, med (median), and std denote the average, median, and standard deviation of images per identity, respectively. These statistics quantify the per-identity sample-count distribution and highlight differences in data balance across sources. Training uses balanced sampling to ensure equal representation of identities within each batch, addressing class imbalance issues where some animals have significantly more photos than others [4,18].

Data Preprocessing. We evaluated the impact of automated animal detection preprocessing on model performance [19,20]. The baseline experiment uses the original dataset without additional preprocessing. A second configuration incorporates YOLO12 [61] object detection to crop animal regions before feature extraction, testing whether explicit localization improves identification performance [61].

Data Composition Ablation Experiments. We systematically evaluated how different data sources impact model performance through controlled ablation studies. Our primary investigation examined whether incorporating our newly collected Pet911 and Telegram datasets improves identification accuracy compared to training solely on established benchmarks. The PetFace [35] dataset presents a specific methodological challenge: all images underwent automated face detection, precise alignment, and manual filtering, resulting in a highly controlled distribution that differs substantially from real-world deployment scenarios. To quantify this distribution mismatch effect, we designed three experimental configurations: training without PetFace [35] to assess performance on unfiltered data, training with the PetFace [35] training split only following standard protocols, and training with the complete PetFace dataset to examine whether scale compensates for domain shift.

Training and Test Split. Our experimental framework employs stratified splits maintaining animal identity separation between training and test sets to ensure valid evaluation of model generalization [27]. The training set comprises five different datasets, while the test set contains Cat Individual Images [60] and DogFaceNet [17]. This configuration prevents data leakage and enables proper assessment of individual identification performance on previously unseen animals [16].

2.2. Vision Encoder Experiments

Vision Encoder Selection. We evaluated six pre-trained vision encoders that represent different architectural approaches and pre-training objectives relevant to animal identification tasks. CLIP-ViT-Base [39] combines vision transformer architecture with language-image contrastive learning, enabling models to leverage semantic relationships between visual and textual information. SigLIP-Base [43] employs sigmoid loss for contrastive learning, offering improved training stability and convergence properties compared to standard softmax-based approaches. SigLIP2-Base [54] represents an updated version of SigLIP with refined training procedures and architectural improvements. SigLIP2-Giant is a scaled-up variant of the SigLIP2 [54] architecture to giant scale with optimized training and higher input resolution, providing state-of-the-art comparable visual representations through massive model capacity and enhanced fine-grained detail capture. DINOv2-Small [53] uses self-supervised learning on diverse image collections without language supervision, enabling the discovery of task-agnostic visual features that generalize across domains. Zer0int CLIP-L provides a large-scale CLIP variant with geometric mean pooling aggregation, offering increased model capacity and refined feature aggregation strategies. These diverse encoders enable systematic investigation of how different pre-training objectives influence feature quality for individual animal identification. Table 3 further summarizes the computational characteristics of each vision backbone, including the number of parameters, the number of multiply–add operations (Mult-Adds), and the peak inference VRAM footprint in megabytes (batch size = 1), the average training time per epoch in seconds, and the inference throughput measured in images per second.

Training Configuration. All vision encoders undergo identical training procedures to ensure fair comparison across different architectural approaches. Training employs a batch size of 116 samples structured as 58 unique animal identities with 2 photographs each, ensuring balanced identity representation within every training iteration. The learning rate is fixed at 1 × 10⁻⁴ with Adam optimization using default parameters, providing consistent gradient updates across all encoders. Training proceeds for 10 epochs across all experiments, establishing a standardized training duration that allows sufficient convergence while maintaining consistent computational requirements. This configuration enables assessment of encoder performance under identical learning conditions, revealing which architectural choices and pre-training objectives produce superior feature representations for animal identification.

Transfer Learning Strategy. All vision encoders utilize transfer learning by freezing lower layers while unfreezing only the final five transformer blocks during training. This approach preserves general-purpose visual features learned during large-scale pre-training on diverse image datasets, while enabling adaptation to animal identification tasks through fine-tuning higher-level features. Freezing early layers maintains foundational feature patterns that remain useful across different domains, reducing catastrophic forgetting and improving convergence speed. Unfrozen final blocks allow the model to learn animal-specific feature representations that discriminate between individual subjects. This balance between preservation and adaptation leverages the benefits of pre-trained models while enabling task-specific optimization without requiring extensive training resources.

Sampling. Training employs a balanced identity sampler that ensures equal representation of animal identities within each batch. This sampling strategy guarantees that each of the 58 identities appears exactly twice per batch, regardless of how many total photographs each identity possesses. This approach directly addresses class imbalance issues inherent in animal identification datasets, where some animals have many photographs while others have few. Balanced sampling improves convergence by preventing the model from biasing toward frequently-represented identities and ensures that less-represented animals contribute equally to gradient updates.

2.3. Text Generation

For text generation, we utilized Qwen3-VL [62], a state-of-the-art multimodal large language model developed with capabilities for both vision and language understanding. Qwen3-VL is designed to interpret visual inputs and generate highly structured descriptive text, making it well-suited for automated annotation tasks. Because individual manual textual descriptions are unavailable for each sample in our dataset, the model is employed to produce consistent, standardized descriptions across our entire collection. This approach enables scalable annotation and minimizes potential variability or human bias associated with real text data, ensuring each sample receives uniform descriptive metadata tailored to our verification objectives.

2.4. Text Encoder Experiments

Ablation studies of text encoder architectures are conducted to ensure methodological consistency and control across modalities. We evaluate E5-Base [63], a transformer-based model tailored for semantic retrieval, as well as E5-Small [63] and their respective v2 versions (E5-Small-v2 [63] and E5-Base-v2 [63]), which provide improved computational efficiency and representational accuracy via updated training procedures. Additionally, our experiments include BERT [64], the standard backbone for general-purpose language modeling. All text encoder experiments are trained under identical configurations that mirror those of the vision encoder experiments (Section 2.2), including batch size, learning rate, optimization protocol, balanced sampling, and a transfer learning strategy based on partial layer freezing. This unified protocol enables fair cross-modality comparisons and isolates the impact of each text encoder on downstream verification performance. Table 4 further summarizes the computational characteristics of each text backbone, including the number of parameters, the number of multiply–add operations (Mult-Adds), the peak inference VRAM footprint in megabytes (batch size = 1), the average training time per epoch in seconds, and the inference throughput measured in images per second.

2.5. Multimodal Experiments

In three dual-encoder baselines(CLIP-ViT-Base + E5-Base-v2, CLIP-ViT-Base + E5-Small-v2 and SigLIP2-Giant + E5-Small-v2), image and text embeddings are first projected into a shared space and then concatenated to form a joint representation. Specifically, we pair CLIP-ViT-Base with either E5-Base-v2 or E5-Small-v2, and SigLIP2-Giant with E5-Small-v2, comparing the impact of different vision and text encoders under the same fusion scheme. We use the second version of the small text encoder (E5-Small-v2) as it provides a better efficiency–quality trade-off in our setting. Empirically, the small text encoder variants achieve higher retrieval performance than the base counterpart, so subsequent experiments focus on E5-Small-v2 as the default text encoder, while BERT-based baselines are omitted due to clearly inferior results discussed in Section 4.3.

In the cross-attention variants, CLIP-ViT-Base + E5-Small-v2 cross-attention and SigLIP2-Giant + E5-Small-v2 + cross-attention first produce image patch embeddings and text token embeddings, which are then fused by an attention module where text features query the image features. The attended text representations, enriched with information from the corresponding image features, are pooled into a single multimodal embedding that replaces simple concatenation and is used for downstream retrieval.

In the weighted-text variants, CLIP-ViT-Base + E5-Small-v2 and SigLIP2-Giant + E5-Small-v2 use the same dual-encoder and concatenation scheme as the baselines, but apply a learnable scalar weight to the text embedding before fusion. Image and text features are projected into a shared space, the text embedding is rescaled by this trainable factor, and then concatenated with the image embedding to form the final multimodal representation used for retrieval.

For the gated fusion variant SigLIP2-Giant + E5-Small-v2 + gating, image and text embeddings are first projected into a shared space with separate linear layers and then concatenated. This concatenated vector is passed through a small MLP with softmax over two outputs, yielding normalized weights for the text and image embeddings, which are combined as a weighted sum to obtain the final multimodal representation used for retrieval.

Table 5 further summarizes the computational characteristics of each multimodal configuration, reporting the total number of trainable parameters across both encoders and fusion module, the aggregate number of multiply–add operations (Mult-Adds), the peak inference VRAM footprint in megabytes, the average training time per epoch in seconds, and the inference throughput measured in samples processed per second.

2.6. Comparison Methods

To assess the performance of the proposed approach, we compare it against several strong models pre-trained for wildlife re-identification and biological taxonomy. These models are utilized as fixed feature extractors without any additional fine-tuning on the target datasets, ensuring that the comparison focuses on the generalizability of their learned representations. Inference is conducted using the same evaluation protocol as employed for our main method to guarantee a fair assessment.

The comparison includes MiewID-msv3 [65], a specialized feature extractor trained using contrastive learning on a high-quality dataset covering 64 different wildlife species, ranging from terrestrial mammals to aquatic animals.

Additionally, we evaluate three distinct architectures from the MegaDescriptor family [24], which are designed as foundation models for individual animal re-identification: MD-T-CNN-288, which is based on the EfficientNet-B3 convolutional neural network [66]; MD-CLIP-336, which adapts a large Vision Transformer initially pre-trained with CLIP [39]; and MD-L-384, which leverages a Swin Transformer Large backbone [67]. Finally, we include BioCLIP [68], a biology-focused vision foundation model based on the CLIP ViT [39] architecture.

Table 6 summarizes the computational characteristics of the pre-trained comparison models, including the number of parameters, the number of multiply–add operations (Mult-Adds), the peak inference VRAM footprint in megabytes (batch size = 1), and the inference throughput measured in images per second, while omitting training time since these models are used only in frozen, inference-only mode.

2.7. Loss Function Design

The training objective combines two complementary components that jointly optimize the feature space for individual animal identification.

Triplet Loss. We employ triplet loss [37] with margin

α = 0.45

to encourage separation between different animal identities by penalizing cases where different animals produce similar embeddings. For a triplet consisting of an anchor image

x_{a}

, a positive image

x_{p}

(same animal), and a negative image

x_{n}

(different animal), the triplet loss is defined as

L_{triplet} = max (0, ∥ f (x_{a}) - f (x_{p}) ∥_{2}^{2} - ∥ f (x_{a}) - f (x_{n}) ∥_{2}^{2} + α)

(1)

where

f (\cdot)

represents the neural network,

{∥ \cdot ∥}_{2}

denotes the Euclidean distance, and

α

establishes the minimum required distance between features from different animals.

Intra-Pair Variance Regularization. We apply intra-pair variance regularization [69] to promote consistency across multiple photographs of the same animal. This loss minimizes the variance of similarity scores within both positive pairs (same identity) and negative pairs (different identities), encouraging tighter clustering and more stable decision boundaries.

For positive pairs with cosine similarity scores

{s_{p}^{i}}_{i = 1}^{N_{p}}

and negative pairs with similarity scores

{s_{n}^{j}}_{j = 1}^{N_{n}}

, the intra-pair variance loss is computed as

L_{var}^{pos} = max {(0, (1 - ε_{pos}) {\bar{s}}_{p} - s_{p}^{i})}^{2}

(2)

L_{var}^{neg} = max {(0, s_{n}^{j} - (1 + ε_{neg}) {\bar{s}}_{n})}^{2}

(3)

where

{\bar{s}}_{p} = \frac{1}{N_{p}} \sum_{i = 1}^{N_{p}} s_{p}^{i}

and

{\bar{s}}_{n} = \frac{1}{N_{n}} \sum_{j = 1}^{N_{n}} s_{n}^{j}

represent the mean positive and negative similarity scores, respectively, and

ε_{pos} = ε_{neg} = 0.01

are small epsilon values that define tolerance margins. The total variance loss is

L_{var} = \frac{1}{N_{p}} \sum_{i = 1}^{N_{p}} L_{var}^{pos} + \frac{1}{N_{n}} \sum_{j = 1}^{N_{n}} L_{var}^{neg}

(4)

This formulation penalizes positive pairs with similarity below

(1 - ε_{pos}) {\bar{s}}_{p}

and negative pairs with similarity above

(1 + ε_{neg}) {\bar{s}}_{n}

, thereby reducing intra-class variance and increasing inter-class separation.

Combined Loss Function. The overall training objective combines both loss components with respective weight coefficients:

L_{total} = λ_{1} L_{triplet} + λ_{2} L_{var}

(5)

where

λ_{1} = 1.0

and

λ_{2} = 0.5

, indicating that identity separation receives higher priority than intra-identity consistency. Together, these components optimize the feature space to produce compact clusters for each animal while maintaining large separation between different identities.

2.8. t-SNE Computation

To generate the t-distributed Stochastic Neighbor Embedding (t-SNE) visualizations [70], embeddings from each dataset configuration were extracted and filtered to retain only samples belonging to the top 30 most frequent classes; for each class, a maximum of 100 samples were retained to ensure balanced representation. The high-dimensional embeddings were then projected into a two-dimensional space using t-SNE [70] with perplexity set to 30. Each point in the resulting scatter plot corresponds to an individual sample, colored according to its class identity that ensures visual distinction across all 30 classes. Grid lines and consistent axis scaling were applied across subplots to facilitate direct comparison of cluster structure.

2.9. Inference and Evaluation

Inference Configuration. Inference uses a batch size of 128 and 8 workers for data loading. Image embeddings are extracted from vision and text encoders and stored in pickle format for efficient retrieval during evaluation.

Evaluation Protocol. Positive pairs (same animal) are generated with constraints: no single image appears more than 5 times across all pairs, and each identity has a maximum of 15 pairs. Negative pairs (different animals) are generated with the same constraints while accounting for image usage in positive pairs. This controlled generation ensures consistent evaluation across all methods. These constraints are necessary to ensure that pair-based verification metrics (Equal Error Rate, Receiver Operating Characteristic Area Under the Curve) accurately reflect model performance rather than artifacts of data imbalance or repeated imagery.

Evaluation Metrics. We report three metrics that assess different aspects of identification performance.

Top-k Accuracy measures the percentage of queries where the correct identity appears in the top k predictions:

Top - k = \frac{Number of correct predictions in top - k}{Total predictions}

(6)

We report Top-1, Top-5, and Top-10 accuracy.

ROC AUC (Receiver Operating Characteristic Area Under the Curve) measures overall separability of same-animal and different-animal pairs across all decision thresholds:

ROC AUC = \int_{0}^{1} TPR ({FPR}^{- 1} (τ)) d τ

(7)

where TPR is the true positive rate, and FPR is the false positive rate.

EER (Equal Error Rate) represents the threshold where the false positive rate equals the false negative rate:

EER = min_{θ} | FPR (θ) - FNR (θ) |

(8)

where FPR is false positive rate and FNR is false negative rate.

Lower EER indicates better decision boundary calibration. ROC AUC and EER together provide both discrimination and calibration perspectives on model performance.

The McNemar test is used to compare two models evaluated on the same test set by checking whether their proportions of correct predictions differ significantly. For each sample, the outcome of model A (correct/incorrect) and model B (correct/incorrect) forms a

2 \times 2

contingency table with counts a (both correct), b (A correct, B incorrect), c (A incorrect, B correct), and d (both incorrect). The test focuses on the discordant pairs b and c; under the null hypothesis that both models have the same accuracy, these two counts should be similar.

For b + c, the McNemar statistic is

χ^{2} = \frac{{(b - c)}^{2}}{b + c}

(9)

which approximately follows a chi-squared distribution with 1 degree of freedom, and the corresponding p-value is obtained from this distribution.

2.10. Hardware

All experiments were executed on a single NVIDIA A100 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 40 GB of VRAM and an AMD EPYC 7742 64-core CPU (Advanced Micro Devices, Inc., Santa Clara, CA, USA).

3. Results

3.1. Ablation Studies on Different Data Configurations

Table 7 presents evaluation metrics for various data setups using the CLIP-ViT-Base model [39], with results computed across the entire test set. For each configuration, the table lists ROC AUC, EER, Top-1, Top-5, and Top-10 accuracy. Bold font indicates the best result in each column, while underline indicates the second-best value. The listed configurations include different combinations of training data and data augmentation strategies.

Figure 1 summarizes the quantitative performance of the proposed pet identification models under different training data configurations using a unified visual representation of key metrics. The bar plots report ROC AUC, 1 - EER, and Top-k identification accuracy for each configuration, while the ROC curves in the bottom-right panel jointly depict the trade-off between true positive rate and false positive rate for all evaluated models, including a random baseline. Together, these visualizations consolidate the comparison of alternative training regimes within a single figure and complement the tabulated results by providing an intuitive view of how detection preprocessing and additional data sources affect verification and retrieval behaviour over the entire operating range.

Table 8 presents the pairwise p-values obtained from the McNemar test applied to evaluate the statistical significance of performance differences between models trained under various data configuration settings. Each configuration corresponds to a distinct training dataset composition, including the baseline training data and its combinations with additional sources such as YOLO-based preprocessing, PetFace (training split and full set), Telegram, and Pet911 datasets. The resulting p-values provide a comparative measure of whether the observed performance variations between configurations are statistically significant.

Table 9 displays performance metrics for various data configurations evaluated on the Cat Individual Images dataset [60] using the CLIP-ViT-Base model [39]. Each configuration is assessed using ROC AUC, EER, Top-1, Top-5, and Top-10 accuracy metrics. Values shown in bold represent the highest performance achieved in each respective column, while underlined values indicate the second-highest performance.

Table 10 presents performance metrics for several data configurations evaluated using the CLIP-ViT-Base model [39] on the DogFaceNet dataset [17]. Configurations are listed alongside ROC AUC, EER, Top-1, Top-5, and Top-10 accuracy values. The best result for each metric column is shown in bold, while the second-best value for each metric is underlined.

Figure 2 presents a comparative t-SNE visualization of data embeddings across six distinct data configurations, each displayed in a separate subplot. In each subplot, colored clusters represent embedded samples projected into a 2D space, with color coding distinguishing different classes or subsets. The spatial distribution and separation of clusters vary across subplots, reflecting how the inclusion of additional datasets or preprocessing steps alters the geometric structure of the feature space. All plots share identical axis ranges for direct visual comparison, and grid lines are included to aid in positional reference.

3.2. Vision Encoder Performance Comparison

Table 11 reports overall performance metrics for six vision encoders evaluated on an identification task, including ROC AUC, EER, and Top-1, Top-5, and Top-10 accuracy. The best-performing value in each column is highlighted in bold, while the second-best value is underlined to facilitate rapid comparison across models. All metrics are presented as numerical scores with higher values indicating better performance for ROC AUC and top-k accuracies, and lower values indicating better performance for EER.

Figure 3 provides a compact overview of the evaluation outcomes for all considered vision encoders on the pet identification task. The grouped bar charts report ROC AUC, 1-EER, and Top-k identification accuracy for each backbone, while the ROC panel on the bottom-right jointly presents the corresponding verification curves together with a random baseline. By organizing these complementary metrics in a single layout, the figure visually contrasts the behaviour of different foundation models across operating points and ranking depths, complementing the tabular results with an immediate comparison of backbone-dependent performance characteristics.

Table 12 reports pairwise p-values obtained from the McNemar test used to assess the statistical significance of differences between identification outcomes across various vision encoder configurations.

Table 13 presents performance metrics for six vision encoders evaluated on the Cat Individual Images dataset [60], reporting ROC AUC, EER, and Top-1, Top-5, and Top-10 accuracy. In each column, the highest-performing value is emphasized in bold, while the second-highest is marked with an underline to enable direct visual ranking across models. All scores are numerical and presented without interpretation, with higher values preferred for ROC AUC and top-k accuracies, and lower values preferred for EER.

Table 14 reports quantitative results for six vision encoders evaluated on the DogFaceNet dataset [17], presenting ROC AUC, EER, and Top-1, Top-5, and Top-10 accuracy scores. The best-performing value in each column is indicated in bold, while the second-best is underlined to support immediate comparative assessment. All metrics are reported as raw numerical values with higher values indicating superior performance for ROC AUC and top-k accuracy, and lower values preferred for EER.

Figure 4 presents t-SNE visualizations of feature embeddings from the six evaluated encoders, computed using samples from the top 30 most frequent classes. The projections reveal varying degrees of structural organization in the latent space; while baseline architectures exhibit noticeable cluster dispersion and inter-class overlap, the larger and specialized models demonstrate notable tighter intra-class grouping and distinct separation. This visual clustering aligns with the quantitative performance, highlighting the superior discriminative capabilities of the advanced architectures.

3.3. Text Encoder Performance Comparison

Table 15 reports performance metrics for five text encoders evaluated on an overall test set, including ROC AUC, EER, and Top-1, Top-5, and Top-10 accuracy. The best value in each column is indicated in bold, and the second-best value is underlined to facilitate direct comparison across models. Higher values are preferable for ROC AUC and top-k accuracy metrics, while lower values are better for EER.

Figure 4. t-SNE visualization of embeddings under different vision encoder configurations.

Figure 5 illustrates the behaviour of the text encoders on the pet identification benchmark using complementary verification and retrieval metrics. The bar plots summarize ROC AUC, 1 - EER, and Top-k identification accuracy for BERT and all E5 variants, while the ROC panel compares their corresponding receiver operating characteristic curves with a random baseline over the full false-positive-rate range. By presenting these measures side by side, the figure visually contrasts how different text backbones influence matching quality and ranking performance.

Table 16 presents the pairwise p-values obtained from the McNemar test used to assess the statistical significance of differences between various text encoder configurations. Each cell reports the p-value corresponding to the pairwise comparison between two encoders, where lower values indicate stronger evidence of a difference in their performance distributions, while values approaching one suggest similarity.

Table 17 reports performance metrics for five text encoders evaluated on the Cat Individual Images dataset [60], including ROC AUC, EER, and Top-1, Top-5, and Top-10 accuracy. The best value in each column is displayed in bold, and the second-best value is underlined to support rapid visual comparison across models. Higher values are preferred for ROC AUC and top-k accuracy, while lower values are preferred for EER.

Table 18 presents performance metrics for five text encoders evaluated on the DogFaceNet dataset [17], including ROC AUC, EER, and Top-1, Top-5, and Top-10 accuracy. The highest-performing value in each column is shown in bold, while the second-highest is underlined to enable direct comparison across models. Higher values are preferred for ROC AUC and top-k accuracy, and lower values are preferred for EER.

Figure 6 shows t-SNE visualizations of text embeddings produced by five different encoder variants. Each subplot presents the distribution of individual clusters corresponding to text descriptions, with distinct colors representing different identities. All visualizations are plotted in a shared coordinate range for direct comparison. The arrangement highlights the separation and clustering patterns achieved by each text encoder across the dataset.

3.4. Comparative Multimodal Performance

Table 19 summarizes the performance of all unimodal and multimodal setups in terms of ROC AUC, EER, and Top-k retrieval accuracy, where bold numbers denote the best result and underlined numbers denote the second-best result for each metric. For fair comparison, the table also reports the corresponding vision-only baselines (CLIP-ViT-Base and SigLIP2-Giant), which serve as reference points for evaluating the added value of the multimodal fusion strategies.

Figure 7 presents a unified view of the multimodal configurations, combining visual and textual backbones with different fusion strategies. The bar charts summarize ROC AUC, 1 - EER, and Top-k identification accuracy for all multimodal variants, while the ROC panel contrasts their receiver operating characteristic curves and a random baseline over the full operating range. This layout makes it possible to visually compare how specific fusion mechanisms and weighting schemes affect both verification quality and retrieval performance across the evaluated multimodal models.

Table 20 presents the pairwise p-values obtained using the McNemar test to assess statistical differences in identification outcomes among various multimodal model configurations. The table provides a comprehensive comparison of the significance levels for all model pairs, enabling evaluation of whether their performance differences are statistically distinguishable.

Table 21 summarizes identification performance for the Cat Individual Images dataset [60] across different multimodal setups, reporting ROC AUC and EER for verification together with Top-k identification accuracy for a set of image encoders and their multimodal extensions. The table includes CLIP-ViT-Base and SigLIP2-Giant evaluated as pure image encoders, as well as variants that combine these image backbones with E5-based text embeddings using concatenation, cross-attention, weighted-text, and gating fusion strategies, enabling a direct comparison between unimodal image and multimodal image–text configurations under a common evaluation protocol. Values in bold denote the best-performing configuration for each metric column, while underlined values indicate the second-best performance in that column.

Table 22 reports identification performance on the DogFaceNet benchmark [17] across a set of image encoders and their multimodal extensions, summarizing ROC AUC and EER for verification together with Top-k identification accuracy for each encoder configuration. The evaluated models include CLIP-ViT-Base and SigLIP2-Giant used as standalone image encoders for direct comparison, as well as variants that combine these image features with E5-based text embeddings via simple fusion, cross-attention, weighted-text, and gating mechanisms. Values typeset in bold denote the best performance achieved in each metric column, while underlined values indicate the second-best performance per column.

Figure 8 represents the t-SNE visualization of embeddings generated by the CLIP-ViT-Base (top row) and SigLIP2-Giant (bottom row) backbones across the different evaluated multimodal configurations. Distinct colors correspond to individual identities, illustrating the variations in class separability and cluster compactness achieved by each fusion strategy in the shared embedding space.

3.5. Comparison of the Proposed Methods with Existing Approaches

Table 23 presents a quantitative comparison of the proposed multimodal approach against several existing methods, summarizing performance metrics such as ROC AUC and EER for verification, as well as Top-k identification accuracy. The proposed configuration (SigLIP2-Giant + E5-Small-v2 + gating) is benchmarked against MiewID-msv3, three variants of the MegaDescriptor family (MD-T-CNN-288, MD-CLIP-336, MD-L-384), and BioCLIP. Bold values indicate the best performance achieved in each column, while underlined values denote the second-best results.

Figure 9 reports the performance of the proposed system in comparison with existing pet re-identification approaches. The grouped bar charts present ROC AUC, 1 - EER, and Top-k identification accuracy for prior baselines and for the best configuration of the method introduced in this work, while the ROC panel visualizes their receiver operating characteristic curves together with a random reference. This visualization highlights, in a single view, how the proposed approach compares to established methods in terms of both verification quality and retrieval accuracy across different operating points.

Table 24 reports the pairwise p-values from the McNemar test used to evaluate statistical differences in identification outcomes among existing approaches. The results show the significance of performance differences between each pair of models, allowing for assessment of whether the observed improvements are statistically meaningful across the compared configurations.

Table 25 presents a comparative analysis of the proposed multimodal method against several existing approaches on the Cat Individual Images dataset [60]. The evaluations report ROC AUC and EER for verification, along with Top-k identification accuracies for each configuration. Bold values denote the best performance achieved in each metric column, while underlined values indicate the second-best results.

Table 26 details the performance of the proposed multimodal method relative to existing approaches on the DogFaceNet dataset [17], summarizing verification metrics and identification accuracy. Bold formatting highlights the best result in each column, while underlined values indicate the second-best performance across the evaluated configurations.

Figure 10 depicts t-SNE visualization of feature embeddings generated by the proposed method (SigLIP2-Giant + E5-Small-v2 + gating) compared to established benchmarks, including MiewID-msv3, variants of the MegaDescriptor family (MD-T-CNN-288, MD-CLIP-336, MD-L-384), and BioCLIP. Distinct colors correspond to individual identities, demonstrating the relative compactness of clusters and the degree of class separability achieved by each approach in the shared embedding space.

4. Discussion

4.1. Effect of Dataset Configuration Choices

Ablation results in Table 7, Table 9, Table 10 and Figure 2 illustrate that the amount and diversity of training data directly impact identification performance. We found that the configuration using the largest, most heterogeneous dataset consistently achieves the highest ROC AUC and lowest EER, though it does not always achieve the highest Top-1 accuracy. This setup outperformed others in threshold-based metrics.

Despite occasional drops in Top-1 and Top-5 scores, incorporating user-generated data sources and expanding data scale resulted in more resilient and generalizable models. Based on these findings, we adopted the most comprehensive data configuration for all subsequent experiments, prioritizing broader robustness over isolated accuracy peaks.

The t-SNE visualization [70] in Figure 2 shows that larger, multi-source datasets produce tighter and more separated clusters, supporting our decision to use this configuration. These qualitative results reinforce that expanding dataset diversity improves the reliability of representations, especially when evaluated with robust, threshold-based metrics.

4.2. Impact of the Vision Encoder on Task Performance

Across all evaluation settings, SigLIP2-Giant [54] establishes a clear performance upper bound, indicating that both model capacity and refined vision–language pre-training translate into stronger animal verification capabilities. On the overall test set, it attains the highest ROC AUC of 0.9912 and the lowest EER of 0.0378, while also surpassing all competing encoders in Top-1, Top-5, and Top-10 accuracy, with Top-1 reaching 0.8243 compared to 0.7626 for the best CLIP-based baseline (Zer0int CLIP-L). This pattern points to more discriminative, well-calibrated decision boundaries that better separate same-animal and different-animal pairs across a wide range of thresholds.

The advantage of SigLIP2-Giant [54] is equally clear in the species-specific benchmarks, where it dominates all metrics on both Cat Individual Images and DogFaceNet. On Cat Individual Images [60], it achieves ROC AUC of 0.9940, EER of 0.0344, outperforming self-supervised DINOv2-Small [53] and high-capacity Zer0int CLIP-L, which already operate near the top of the scale. A similar trend holds on DogFaceNet [17], where SigLIP2-Giant [54] reaches ROC AUC of 0.9926, EER of 0.0326, and Top-1 accuracy of 0.7475, confirming that its benefits transfer to both cat and dog identification despite differences in pose and acquisition conditions.

Qualitative t-SNE analysis further supports these quantitative trends, showing that SigLIP2-Giant [54] is the only encoder that produces embeddings with consistently compact, well-separated clusters across all examined settings. In the corresponding subplot, class manifolds occupy distinct, minimally overlapping regions of the 2D space, whereas alternative encoders show noticeable cluster fragmentation and overlap even when their scalar metrics are competitive. This pronounced cluster separability suggests that SigLIP2-Giant [54] learns feature spaces with reduced intra-class variance and increased inter-class margins, which in turn explains its superior metrics and underlines its suitability as a backbone for large-scale individual animal identification.

4.3. Impact of the Text Encoder on Task Performance

The impact of the text encoder on task performance is distinct from that of the image encoder. Our experiments show that text encoders operating on synthetic descriptions achieve a Top-1 accuracy of 37.95% on the overall test set, which is lower than the vision-only baseline. However, despite this difference in ranking capability, text encoders maintain robust verification performance, consistently achieving ROC AUC scores around 0.97 across datasets like DogFaceNet [17]. This suggests that while text descriptions may lack the pixel-level granularity required for high-precision ranking, they successfully capture core identity features that remain stable across different instances. This observation is further supported by our t-SNE analysis (Figure 6), which shows that while text embeddings form broader clusters with only partial separability between identities, which contrasts with the tighter clustering observed in visual features. Notably, this distinct modality provides complementary information, as evidenced by the fact that integrating these text embeddings into a multimodal setup yields the highest overall metrics in our study, Section 4.4. The varying performance across different text architectures remains negligible, indicating that the utility of the text branch is driven more by the semantic content of the descriptions than by the specific encoder architecture used.

4.4. Performance Gains from Multimodal Fusion

Comparison of fusion strategies reveals that combining visual features with text embeddings primarily enhances ranking performance rather than verification metrics. While the standalone SigLIP2-Giant vision encoder retains the highest ROC AUC and EER scores, the multimodal setup notably improves Top-k accuracies. Specifically, the fusion of SigLIP2-Giant with the E5-Small-v2 text encoder using a gating mechanism achieves a Top-1 accuracy of 84.28% on the overall test set, surpassing the 82.43% accuracy of the vision-only baseline. This trend holds for sub-datasets as well, where the gated fusion model on DogFaceNet reaches a Top-1 accuracy of 78.18% compared to 74.75% for the vision baseline. Among the evaluated strategies, the gating mechanism proved most effective, likely because it dynamically assigns importance weights to image and text features. This quantitative improvement is visually corroborated by our t-SNE analysis (Figure 8), which demonstrates that the multimodal embeddings form more distinct and compact clusters than those of the unimodal baselines, confirming that integrating semantic attributes from text allows the model to better distinguish between similar candidates in ranking tasks.

4.5. Advantages of the Proposed Multimodal Framework

Our proposed multimodal framework, specifically the SigLIP2-Giant encoder combined with E5-Small-v2 using a gating mechanism, demonstrates improved performance compared to established approaches for animal identification. As shown in Table 23, our method achieves an overall Top-1 accuracy of 84.28%, notably exceeding the 72.70% achieved by the best competing model (MiewID-msv3). This corresponds to an absolute improvement of over 11% in rank-1 identification capabilities. Furthermore, our approach exhibits robust verification reliability, achieving an EER of 0.0422 compared to 0.2442 for MiewID-msv3, indicating precise calibration of decision boundaries.

The effectiveness of our framework is further corroborated by qualitative analysis using t-SNE visualizations (Figure 10). These plots reveal that our method produces more compact and well-separated clusters for individual identities compared to baseline models. While embeddings from models like MD-T-CNN-288 show cluster fragmentation and overlap for visually similar individuals, our fusion strategy successfully segregates these identities, confirming that the combination of high-capacity visual features with semantic text descriptors results in a more discriminative representation space.

4.6. Limitations and Future Work

Dependence on Synthetic Text. Our current multimodal framework utilizes text descriptions generated by a vision–language model. This approach ensures consistent, structured captions that align well with visual features during training. However, real-world user queries often consist of unstructured or subjective human descriptions. While the current study establishes the baseline efficacy of using generated text, bridging the domain gap between synthetic captions and natural human language remains an open question. Future work will specifically investigate adaptation strategies to handle noisy, human-written queries in “lost and found” scenarios.

Computational Complexity and Deployment. To achieve the highest performance observed in our experiments, we leveraged large-scale backbones, specifically the SigLIP2-Giant vision encoder, which contains approximately 2 billion parameters. Furthermore, the most effective multimodal configurations require textual input; in our pipeline, this text is obtained using a 7-billion-parameter VLM. Consequently, the combined computational footprint of the heavy vision encoder and the requisite text generation process presents challenges for resource-constrained deployment. Future research could explore knowledge distillation to retain these capabilities in lighter architectures suitable for edge devices.

Label Noise in Scraped Data. To facilitate large-scale training, we aggregated over 130,000 images from public sources. Although we applied semi-automated cleaning procedures to filter the data, datasets of this magnitude collected from open platforms may contain label noise, such as duplicate entries or inconsistent labeling. Since metric learning is sensitive to such irregularities, this potential noise could impact the precise estimation of the model’s upper performance bound.

Data Availability. The primary contribution of this work is the development of advanced architectural approaches and fusion strategies for animal identification. Due to legal and privacy constraints associated with the source platforms, we are unable to publicly release the aggregated datasets. However, to ensure reproducibility and facilitate further research, we release the full codebase, including trained model weights, training scripts, and testing protocols.

5. Conclusions

This study addresses the task of individual animal identification by establishing a large-scale evaluation framework and introducing a multimodal recognition pipeline. Through the construction of a training corpus comprising over 1.9 million photographs across 695,000 identities, including data from Pet911 and Telegram, we observed that dataset diversity is essential for developing models that generalize to unseen animals. Our systematic ablation of vision architectures highlighted the effectiveness of SigLIP2-Giant, which achieved an ROC AUC of 0.9912 and produced compact discriminative feature clusters. Additionally, the assessment of multimodal fusion demonstrated that augmenting visual features with textual descriptions via a gated fusion mechanism improved ranking capabilities.

The proposed configuration resulted in a Top-1 accuracy of 84.28% on the overall test set, surpassing the MiewID-msv3 baseline by approximately 11% in identification accuracy, with an Equal Error Rate of 0.0422. These findings suggest that integrating semantic identity priors with visual features helps mitigate ambiguity in retrieval scenarios. Consequently, this methodology provides a validated foundation for automated pet reunification systems, while subsequent work may focus on optimizing these large-scale models for resource-constrained environments.

Future work will focus on evaluating the proposed multimodal identification pipeline in real-world scenarios where the textual modality is provided by users and is therefore noisy, subjective, and potentially incomplete compared to the synthetic descriptions used in our current experiments. Such an evaluation requires deploying the system with verified identities and real outcome logs in order to measure end-to-end effectiveness under operational conditions. We therefore treat these directions as important future work rather than topics that can be properly validated within the present experimental framework.

Author Contributions

Conceptualization, K.B. (Kirill Bubenchikov), G.M. and A.R.; methodology, V.K. and K.B. (Kirill Borodin); software, V.K. and K.B. (Kirill Borodin); validation, V.K., K.B. (Kirill Borodin), K.B. (Kirill Bubenchikov), G.M. and A.R.; formal analysis, V.K., K.B. (Kirill Borodin), K.B. (Kirill Bubenchikov), G.M. and A.R.; investigation, V.K.; resources, V.K. and G.M.; data curation, V.K., K.B. (Kirill Borodin) and G.B.; writing—original draft preparation, K.B. (Kirill Borodin), G.B., K.B. (Kirill Bubenchikov), G.M. and A.R.; writing—review and editing, K.B. (Kirill Borodin), G.B., K.B. (Kirill Bubenchikov), G.M. and A.R.; visualization, V.K., K.B. (Kirill Borodin) and G.B.; supervision, K.B. (Kirill Borodin), K.B. (Kirill Bubenchikov), G.M. and A.R.; project administration, K.B. (Kirill Bubenchikov), G.M. and A.R.; funding acquisition, K.B. (Kirill Bubenchikov), G.M. and A.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in DINO-v2-small version at https://huggingface.co/AvitoTech/DINO-v2-small-for-animal-identification (accessed on 15 November 2025),

ours Zer0int-CLIP-L version at https://huggingface.co/AvitoTech/Zer0int-CLIP-L-for-animal-identification (accessed on 15 November 2025),

ours SigLIP-Base version at https://huggingface.co/AvitoTech/SigLIP-Base-for-animal-identification (accessed on 15 November 2025),

ours SigLIP2-Base version at https://huggingface.co/AvitoTech/SigLIP2-Base-for-animal-identification (accessed on 15 November 2025),

ours CLIP-ViT-Base version at https://huggingface.co/AvitoTech/CLIP-ViT-base-for-animal-identification (accessed on 15 November 2025),

ours SigLIP2-giant version at https://huggingface.co/AvitoTech/SigLIP2-giant (accessed on 15 November 2025),

SigLIP2-giant + E5-Small-v2 + gating at https://huggingface.co/AvitoTech/SigLIP2-giant-e5small-v2-gating (accessed on 15 November 2025),

BeatifulSoup4 at https://pypi.org/project/beautifulsoup4/ (accessed on 15 November 2025),

Telethon library at https://docs.telethon.dev/en/stable/ (accessed on 15 November 2025),

Targeted telegram channel at https://t.me/HvostatPatrul (accessed on 15 November 2025),

CLIP-ViT-Base at https://huggingface.co/openai/clip-vit-base-patch32 (accessed on 15 November 2025),

SigLIP-Base at https://huggingface.co/google/siglip-base-patch16-224 (accessed on 15 November 2025),

SigLIP2-Base at https://huggingface.co/google/siglip2-base-patch16-224 (accessed on 15 November 2025),

SigLIP2-Giant at https://huggingface.co/google/siglip2-giant-opt-patch16-384 (accessed on 15 November 2025),

DINOv2-Small at https://huggingface.co/facebook/dinov2-small (accessed on 15 November 2025),

Zer0int CLIP-L at https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14 (accessed on 15 November 2025),

Qwen3-VL at https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct (accessed on 15 November 2025),

E5-Base at https://huggingface.co/intfloat/e5-base (accessed on 15 November 2025),

E5-Small at https://huggingface.co/intfloat/e5-small (accessed on 15 November 2025),

E5-Small-v2 at https://huggingface.co/intfloat/e5-small-v2 (accessed on 15 November 2025),

E5-Base-v2 at https://huggingface.co/intfloat/e5-base-v2 (accessed on 15 November 2025),

BERT at http://huggingface.co/google-bert/bert-base-uncased (accessed on 15 November 2025),

MD-T-CNN-288 at https://huggingface.co/BVRA/MegaDescriptor-T-CNN-288 (accessed on 15 November 2025),

MD-CLIP-336 at https://huggingface.co/BVRA/MegaDescriptor-CLIP-336 (accessed on 15 November 2025),

MD-L-384 at https://huggingface.co/BVRA/MegaDescriptor-L-384 (accessed on 15 November 2025),

MiewID-msv3 at https://huggingface.co/conservationxlabs/miewid-msv3 (accessed on 15 November 2025),

BioCLIP at https://huggingface.co/imageomics/bioclip (accessed on 15 November 2025).

Conflicts of Interest

Authors Kirill Bubenchikov and Alexander Ryzhkov are employed by the company Avito. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ROC AUC	Receiver Operating Characteristic Area Under the Curve
EER	Equal Error Rate
TPR	True Positive Rate
FPR	False Positive Rate
FNR	False Negative Rate
t-SNE	t-Distributed Stochastic Neighbor Embedding

References

Zhang, Q.; Ahmed, K.; Sharda, N.; Wang, H. A Comprehensive Survey of Animal Identification: Exploring Data Sources, AI Advances, Classification Obstacles and the Role of Taxonomy. Int. J. Intell. Syst. 2024, 2024, 7033535. [Google Scholar] [CrossRef]
Vidal, M.; Wolf, N.; Rosenberg, B.; Harris, B.P.; Mathis, A. Perspectives on Individual Animal Identification from Biology and Computer Vision. Integr. Comp. Biol. 2021, 61, 900–916. [Google Scholar] [CrossRef] [PubMed]
Weiss, E.; Slater, M.; Lord, L. Frequency of Lost Dogs and Cats in the United States and the Methods Used to Locate Them. Animals 2012, 2, 301–315. [Google Scholar] [CrossRef] [PubMed]
Zheng, Z.; Zhao, Y.; Li, A.; Yu, Q. Wild Terrestrial Animal Re-Identification Based on an Improved Locally Aware Transformer with a Cross-Attention Mechanism. Animals 2022, 12, 3503. [Google Scholar] [CrossRef] [PubMed]
Kang, M.H.; Oh, S.H. Research trends in livestock facial identification: A review. J. Anim. Sci. Technol. 2025, 67, 43–55. [Google Scholar] [CrossRef]
Lord, L.K.; Griffin, B.; Slater, M.R.; Levy, J.K. Evaluation of collars and microchips for visual and permanent identification of pet cats. J. Am. Vet. Med. Assoc. 2010, 237, 387–394. [Google Scholar] [CrossRef]
McGreevy, P.; Masters, S.; Richards, L.; Magalhaes, R.J.S.; Peaston, A.; Combs, M.; Irwin, P.J.; Lloyd, J.; Croton, C.; Wylie, C.; et al. Identification of Microchip Implantation Events for Dogs and Cats in the VetCompass Australia Database. Animals 2019, 9, 423. [Google Scholar] [CrossRef]
Lancaster, E.; Rand, J.; Collecott, S.; Paterson, M. Problems Associated with the Microchip Data of Stray Dogs and Cats Entering RSPCA Queensland Shelters. Animals 2015, 5, 332–348. [Google Scholar] [CrossRef]
Wahltinez, O.; Wahltinez, S.J. An open-source general purpose machine learning framework for individual animal re-identification using few-shot learning. Methods Ecol. Evol. 2024, 15, 373–387. [Google Scholar] [CrossRef]
Matley, J.K.; Klinard, N.V.; Jaine, F.R.A.; Lennox, R.J.; Koopman, N.; Reubens, J.T.; Harcourt, R.G.; Cooke, S.J.; Huveneers, C. Long-term effects of tagging fishes with electronic tracking devices. Fish Fish. 2024, 25, 1009–1025. [Google Scholar] [CrossRef]
Petso, T.; Jamisola, R.S., Jr.; Mpoeleng, D. Review on methods used for wildlife species and individual identification. Eur. J. Wildl. Res. 2022, 68, 3. [Google Scholar] [CrossRef]
Morandi, K.; Lindholm, A.K.; Lee, D.E.; Bond, M.L. Phenotypic matching by spot pattern potentially mediates female giraffe social associations. J. Zool. 2022, 318, 147–157. [Google Scholar] [CrossRef]
Meenakshisundaram, A.; Thomas, L.; Kennington, W.J.; Thums, M.; Lester, E.; Meekan, M. Genetic markers validate photo-identification and uniqueness of spot patterns in whale sharks. Mar. Ecol. Prog. Ser. 2021, 668, 177–183. [Google Scholar] [CrossRef]
McCutcheon, J.; Campbell, B.; Hudock, R.E.; Motz, N.; Windsor, M.; Carlisle, A.; Hale, E. An approach to predicting linear trends in tagging-related mortality and tag loss during mark-recapture studies. Front. Ecol. Evol. 2025, 13, 1572994. [Google Scholar] [CrossRef]
Cheema, G.S.; Anand, S.; Qureshi, Q.; Jhala, Y. Revisiting animal photo-identification using deep metric learning and network flow. Methods Ecol. Evol. 2021, 12, 863–875. [Google Scholar]
Schofield, D.; Nagrani, A.; Zisserman, A.; Hayashi, M.; Matsuzawa, T.; Biro, D.; Carvalho, S. Chimpanzee face recognition from videos in the wild using deep learning. Sci. Adv. 2019, 5, eaaw0736. [Google Scholar] [CrossRef]
Mougeot, G.; Li, D.; Jia, S. A Deep Learning Approach for Dog Face Verification and Recognition. In Proceedings of the PRICAI 2019: Trends in Artificial Intelligence, Cuvu, Yanuca Island, Fiji, 26–30 August 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 418–430. [Google Scholar]
Ferreira, A.C.; Silva, L.R.; Renna, F.; Brandl, H.B.; Renoult, J.P.; Farine, D.R.; Covas, R.; Doutrelant, C. Deep learning-based methods for individual recognition in small birds. Methods Ecol. Evol. 2020, 11, 1072–1085. [Google Scholar] [CrossRef]
Romero-Ferrero, F.; Bergomi, M.G.; Hinz, R.C.; Heras, F.J.H.; de Polavieja, G.G. idtracker.ai: Tracking all individuals in small or large collectives of unmarked animals. Nat. Methods 2019, 16, 179–182. [Google Scholar] [CrossRef]
Sakamoto, N.; Kakeno, H.; Ozaki, N.; Miyazaki, Y.; Kobayashi, K.; Murata, T. Marker-less tracking system for multiple mice using Mask R-CNN. Front. Behav. Neurosci. 2023, 16, 1086242. [Google Scholar] [CrossRef]
Hou, S.; Huang, P.; Wang, Z.; Liu, Y.; Li, Z.; Zhang, M.; Huang, Y. OpenAnimals: Revisiting Person Re-Identification for Animals Towards Better Generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, USA, 19–23 October 2025; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar]
de Lorm, T.A.; Horswill, C.; Rabaiotti, D.; Ewers, R.M.; Groom, R.J.; Watermeyer, J.; Woodroffe, R. Optimizing the automated recognition of individual animals to support population monitoring. Ecol. Evol. 2023, 13, e10260. [Google Scholar] [CrossRef]
Nepovinnykh, E.; Eerola, T.; Biard, V.; Mutka, P.; Niemi, M.; Kunnasranta, M.; Kälviäinen, H. SealID: Saimaa Ringed Seal Re-Identification Dataset. Sensors 2022, 22, 7602. [Google Scholar] [CrossRef] [PubMed]
Cermak, V.; Picek, L.; Adam, L.; Papafitsoros, K. WildlifeDatasets: An Open-Source Toolkit for Animal Re-Identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 5953–5963. [Google Scholar]
Adam, L.; Čermák, V.; Papafitsoros, K.; Picek, L. WildlifeReID-10k: Wildlife Re-Identification Dataset with 10k Individual Animals. In Proceedings of the CVPR Workshops (CVPRW), Nashville, TN, USA, 11–12 June 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 2090–2100. [Google Scholar] [CrossRef]
Odo, A.; McLaughlin, N.; Kyriazakis, I. Re-identification for long-term tracking and management of health and welfare challenges in pigs. Biosyst. Eng. 2025, 251, 89–100. [Google Scholar] [CrossRef]
Beery, S.; Van Horn, G.; Perona, P. Recognition in Terra Incognita. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part XVI. Springer: Cham, Switzerland, 2018; pp. 472–489. [Google Scholar] [CrossRef]
Santamaria, J.D.; Isaza, C.; Giraldo, J.H. CATALOG: A Camera Trap Language-Guided Contrastive Learning Model. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 1197–1206. [Google Scholar] [CrossRef]
Otarashvili, L.; Subramanian, T.; Holmberg, J.; Levenson, J.J.; Stewart, C.V. Multispecies Animal Re-ID Using a Large Community-Curated Dataset. arXiv 2024, arXiv:2412.05602. [Google Scholar]
Schneider, S.; Taylor, G.W.; Kremer, S.C. Similarity Learning Networks for Animal Individual Re-Identification—Beyond the Capabilities of a Human Observer. In Proceedings of the 2020 IEEE Winter Applications of Computer Vision Workshops (WACVW), Snowmass Village, CO, USA, 1–5 March 2020; pp. 44–52. [Google Scholar] [CrossRef]
Ratnasingham, S.; Hebert, P.D.N. A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. PLoS ONE 2013, 8, e66213. [Google Scholar] [CrossRef]
Pereira, K.S.; Gibson, L.; Biggs, D.; Samarasinghe, D.; Braczkowski, A.R. Individual Identification of Large Felids in Field Studies: Common Methods, Challenges, and Implications for Conservation Science. Front. Ecol. Evol. 2022, 10, 866403. [Google Scholar] [CrossRef]
Paudel, S.; Brown-Brandl, T. Advancements in Individual Animal Identification: A Historical Perspective from Prehistoric Times to the Present. Animals 2025, 15, 2514. [Google Scholar] [CrossRef]
Hou, J.; He, Y.; Yang, H.; Connor, T.; Gao, J.; Wang, Y.; Zeng, Y.; Zhang, J.; Huang, J.; Zheng, B.; et al. Identification of animal individuals using deep learning: A case study of giant panda. Biol. Conserv. 2020, 242, 108414. [Google Scholar] [CrossRef]
Shinoda, R.; Shiohara, K. PetFace: A Large-Scale Dataset and Benchmark for Animal Identification. arXiv 2024, arXiv:2407.13555. [Google Scholar]
Ferreira, R.E.; Bresolin, T.; Dórea, J.R. Pseudo-labeling and semi-supervised learning for individual cattle identification. Smart Agric. Technol. 2023, 4, 100194. [Google Scholar]
Hoffer, E.; Ailon, N. Deep Metric Learning Using Triplet Network. In Similarity-Based Pattern Recognition. SIMBAD 2015. Lecture Notes in Computer Science; Feragen, A., Pelillo, M., Loog, M., Eds.; Springer: Cham, Switzerland, 2015; pp. 84–92. [Google Scholar] [CrossRef]
Song, B.; Zhao, S.; Dang, L.; Wang, H.; Xu, L. A survey on learning from data with label noise via deep neural networks. Syst. Sci. Control Eng. 2025, 13, 1–34. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Proceedings of Machine Learning Research (PMLR): New York, NY, USA, 2021. Volume 139, pp. 8748–8763. [Google Scholar]
Azizi, E.; Zaman, L. Deep Learning Pet Identification Using Face and Body. Information 2023, 14, 278. [Google Scholar] [CrossRef]
Zhou, X.; Zhang, X.; Niyato, D.; Shen, Z. Learning Item Representations Directly from Multimodal Features for Effective Recommendation. arXiv 2025, arXiv:2505.04960. [Google Scholar] [CrossRef]
Huang, Z.; Liu, X. Generalizable Object Re-Identification via Visual In-Context Prompting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 25–26 September 2025. [Google Scholar]
Zhai, X.; Mustafa, B.; Kolesnikov, A.; Beyer, L. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 11975–11986. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Khosla, A.; Jayadevaprakash, N.; Yao, B.; Li, F.F. Novel Dataset for Fine-Grained Image Categorization: Stanford Dogs. In Proceedings of the First Workshop on Fine-Grained Visual Categorization (FGVC), Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
Parkhi, O.M.; Vedaldi, A.; Zisserman, A.; Jawahar, C.V. Cats and Dogs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3498–3505. [Google Scholar]
Khan, M.H.; McDonagh, J.; Khan, S.; Shahabuddin, M.; Arora, A.; Khan, F.S.; Shao, L.; Tzimiropoulos, G. AnimalWeb: A Large-Scale Hierarchical Dataset of Annotated Animal Faces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6939–6948. [Google Scholar]
Tharwat, A.; Hassanien, M.; Verhagen, B.E. Animal biometrics: Quantifying and detecting phenotypic appearance. Trends Ecol. Evol. 2018, 33, 13–27. [Google Scholar]
Choi, S.H.; Lee, J.S.; Kim, Y.J.; Min, K.H.; Song, D.S. Study on the Viability of Canine Nose Pattern as a Unique Biometric Identifier. BMC Vet. Res. 2021, 17, 1–9. [Google Scholar]
Chan, Y.K.; Lin, C.H.; Ben, Y.R.; Wang, C.L.; Yang, S.C.; Tsai, M.H.; Yu, S.S. Dog nose-print recognition based on the shape and spatial features of scales. Expert Syst. Appl. 2024, 237, 121353. [Google Scholar] [CrossRef]
Clapham, M.; Miller, E.; Nguyen, M.; Darimont, C.T. Automated facial recognition for wildlife that lack unique markings: A deep learning approach for brown bears. Ecol. Evol. 2020, 10, 12883–12892. [Google Scholar] [CrossRef]
Biggs, B.; Boyne, O.; Charles, J.; Fitzgibbon, A.; Cipolla, R. Deep Cross-species Feature Learning for Animal Face Recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 664–679. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2024, arXiv:2304.07193. [Google Scholar] [CrossRef]
Tschannen, M.; Gritsenko, A.; Wang, X.; Naeem, M.F.; Alabdulmohsin, I.; Parthasarathy, N.; Evans, T.; Beyer, L.; Xia, Y.; Mustafa, B.; et al. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. arXiv 2025, arXiv:2502.14786. [Google Scholar]
Hosny, K.M.; Kassem, M.A.; Foaud, M.M. Transfer Learning for Animal Species Identification from CCTV Image. Int. J. Intell. Syst. Appl. Eng. 2023, 11, 825–831. [Google Scholar]
Li, Y.; Zhao, D.; Qiao, T.; Wu, Y.; Pang, B.; Koh, Y.S. MetaWild: A Multimodal Dataset for Animal Re-Identification with Environmental Metadata. In Proceedings of the 33rd ACM International Conference on Multimedia, MM ’25, Dublin, Ireland, 27–31 October 2025; pp. 13009–13015. [Google Scholar] [CrossRef]
Cermak, V.; Picek, L.; Adam, L.; Neumann, L.; Matas, J. WildFusion: Individual Animal Identification with Calibrated Similarity Fusion. In Proceedings of the Computer Vision—ECCV 2024 Workshops, Milan, Italy, 29 September–4 October 2024; Proceedings, Part II. Springer: Cham, Switzerland, 2025; pp. 18–36. [Google Scholar] [CrossRef]
Toumbourou, L. Dogs of the World. Dataset, 300k+ Images, 240+ Breeds, CC0 Public Domain License. 2022. Available online: https://www.kaggle.com/datasets/lextoumbourou/dogs-world (accessed on 15 November 2025).
DSEIDLI. LCW (Labeled Cats in the Wild). Dataset of 140k+ Unique Cats for Cat Face Recognition and Related Research. Licensed Under Apache 2.0. 2023. Available online: https://www.kaggle.com/datasets/dseidli/lcwlabeled-cats-in-the-wild (accessed on 15 November 2025).
Lin, T.Y. Cat Individual Images. Kaggle. 2019. Available online: https://www.kaggle.com/datasets/timost1234/cat-individuals (accessed on 15 November 2025).
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), San Diego, CA, USA, 2–7 December 2025. [Google Scholar]
Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:2502.13923. [Google Scholar] [CrossRef]
Wang, L.; Yang, N.; Huang, X.; Jiao, B.; Yang, L.; Jiang, D.; Majumder, R.; Wei, F. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv 2024, arXiv:2212.03533. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Otarashvili, L. MiewID, Version 1.0.1; Zenodo: Geneva, Switzerland, 2023. [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; PMLR: New York, NY, USA, 2019; Volume 97, pp. 6105–6114. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Stevens, S.; Wu, J.; Thompson, M.J.; Campolongo, E.G.; Song, C.H.; Carlyn, D.E.; Dong, L.; Dahdul, W.M.; Stewart, C.; Berger-Wolf, T.; et al. BioCLIP: A Vision Foundation Model for the Tree of Life. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Yu, B.; Tao, D. Deep Metric Learning with Tuplet Margin Loss. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6490–6499. [Google Scholar]
van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Pet Identification Performance Across Training Data Configurations: ROC AUC, 1 - EER, and Top-k Metrics.

Figure 2. t-SNE visualization of training data embeddings under different data configurations.

Figure 3. Pet Identification Performance Across Vision Encoder Configurations: ROC AUC, 1 - EER, and Top-k Metrics.

Figure 5. Pet Identification Performance Across Text Encoder Configurations: ROC AUC, 1 - EER, and Top-k Metrics.

Figure 6. t-SNE visualization of embeddings under different text encoder configurations.

Figure 7. Pet Identification Performance Across Multimodal Configurations: ROC AUC, 1-EER, and Top-k Metrics.

Figure 8. t-SNE visualization of embeddings under different multimodal setup configurations.

Figure 9. Pet Identification Performance Across Compared Approaches: ROC AUC, 1 - EER, and Top-k Metrics.

Figure 10. t-SNE visualization of embeddings under different feature extraction models, comparing the proposed method against existing approaches.

Table 1. Summary of key gaps in prior work and how the present study addresses them.

Gap	Representative Works	How This Work Addresses It
No unified evaluation protocol	[1,2,5,17]	We combine five large training datasets into a unified corpus and define a single standardized training and evaluation protocol used for all backbones and configurations.
Unimodality	[2,5,17,57]	We introduce a multimodal pipeline that augments visual embeddings with identity-level textual descriptions.
No systematic ablation studies	[1,2,5,17,57]	We perform controlled ablations over multiple vision encoders, text encoders, and training configurations, isolating their contributions to verification performance on a common benchmark suite.
No controlled comparison of multimodal fusion strategies	[1,2,5,17,56]	We implement and compare several fusion mechanisms under identical settings, identifying effective designs for animal verification.

Table 2. Dataset Composition and Statistics.

Dataset	Identities	Photos	Usage	Min	Max	Mean	Med	Std
Pet911 (ours)	22,050	65,961	Training	2	10	2.99	3	1.34
Telegram (ours)	73,101	131,698	Training	2	10	2.73	2	1.05
Dogs-World [58]	200,458	301,342	Training	2	17	2.36	2	0.63
LCW [59]	140,732	381,267	Training	2	6	3.71	2	1.45
PetFace [35]	257,349	1,001,532	Training	3	11	4.58	4	1.29
Cat Individual Images [60]	518	13,542	Evaluation	6	144	25.58	20	10.06
DogFaceNet [17]	2483	8363	Evaluation	2	17	2.36	2	0.63

Table 3. Vision Encoders’ computational characteristics.

Configuration	Parameters	Mult-Adds	VRAM	Epoch Time	Throughput
CLIP-ViT-Base	151,277,313	201,094,656	375.24	7641	211
DINOv2-Small	22,056,576	79,195,392	201.66	2497	224
SigLIP-Base	203,155,970	205,686,528	500.96	4354	151
SigLIP2-Base	375,187,970	205,686,528	500.96	4271	160
Zer0int CLIP-L	427,616,513	457,503,744	1696.21	10,041	66
SigLIP2-Giant	1,871,885,426	1,833,395,712	7394.99	64,641	10

Table 4. Text encoders’ computational characteristics.

Configuration	Parameters	Mult-Adds	VRAM	Epoch Time	Throughput
BERT	109,482,240	109,482,240	519.65	669	990
E5-Base	109,482,240	109,482,240	519.65	688	985
E5-Base-v2	109,482,240	109,482,240	519.65	670	963
E5-Small	33,360,000	33,360,000	178.26	657	986
E5-Small-v2	33,360,000	33,360,000	178.26	661	997

Table 5. Multimodal setups’ computational characteristics.

Configuration	Parameters	Mult-Adds	VRAM	Epoch Time	Throughput
CLIP-ViT-Base + E5-Base-v2	261,415,937	311,233,280	897.41	2813	203
CLIP-ViT-Base + E5-Small-v2	185,097,089	234,914,432	555.27	2788	204
CLIP-ViT-Base + E5-Small-v2 + cross-attention	187,988,865	235,704,960	588.3	2952	195
CLIP-ViT-Base + E5-Small-v2 + weighted text	185,097,090	234,914,432	555.27	2905	194
SigLIP2-Giant + E5-Small-v2	1,907,803,890	1,869,314,176	7557.53	63,959	10
SigLIP2-Giant + E5-Small-v2 + cross-attention	1,907,805,426	1,868,265,088	7556.02	63,296	10
SigLIP2-Giant + E5-Small-v2 + weighted text	1,906,229,491	1,867,739,776	7551.51	64,209	10
SigLIP2-Giant + E5-Small-v2 + gating	1,906,360,948	1,867,871,234	7552.01	63,922	10

Table 6. Comparison methods’ computational characteristics.

Configuration	Parameters	Mult-Adds	VRAM	Throughput
MiewID-msv3	51,109,277	24,307,036,098	891.97	159
MD-T-CNN-288	12,233,232	1,589,219,176	211.22	353
MD-CLIP-336	303,507,456	649,129,984	2360.4	36
MD-L-384	195,198,516	445,579,776	1960.17	46
BioCLIP	86,192,640	172,314,624	391.57	267

Table 7. Performance comparison for ablations across different data setups using the CLIP-ViT-Base model [39]; all metrics are computed on the entire test set.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
Train data	0.9557	0.1042	0.6392	0.8088	0.8535
Train data + YOLO Preprocessing	0.9626	0.0978	0.6282	0.7826	0.8256
Train data + PetFace(train split)	0.9598	0.0982	0.6460	0.8093	0.8504
Train data + Telegram	0.9716	0.0818	0.6791	0.8329	0.8713
Train data + Pet911 + Telegram	0.9729	0.0772	0.6527	0.8139	0.8568
Train data + PetFace (full) + Pet911 + Telegram	0.9752	0.0729	0.6511	0.8122	0.8555

Table 8. Pairwise McNemar test p-values for comparing training-data configurations. ↑ indicates fewer errors for the row encoder; ↓ indicates fewer errors for the column encoder.

Configuration	Train Data	Train Data + YOLO Preprocessing	Train Data + PetFace (Train Split)	Train Data + Telegram	Train Data + Pet911 + Telegram
Train data + YOLO Preprocessing	0.0 (↑)	-	-	-	-
Train data + PetFace (train split)	0.000012 (↑)	0.593305 (↑)	-	-	-
Train data + Telegram	0.0 (↑)	0.0 (↑)	0.0 (↑)	-	-
Train data + Pet911 + Telegram	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.000152 (↑)	-
Train data + PetFace (full) + Pet911 + Telegram	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.001651 (↑)

Table 9. Performance comparison for ablations across different data setups using the CLIP-ViT-Base model [39]; all metrics are computed on the Cat Individual Images [60].

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
Train data	0.9753	0.0761	0.8082	0.9475	0.9655
Train data + YOLO Preprocessing	0.9783	0.0708	0.8411	0.9601	0.9721
Train data + PetFace(train split)	0.9772	0.0713	0.8491	0.9677	0.9790
Train data + Telegram	0.9857	0.0528	0.8575	0.9737	0.9832
Train data + Pet911 + Telegram	0.9822	0.0593	0.8296	0.9561	0.9712
Train data + PetFace (full) + Pet911 + Telegram	0.9821	0.0604	0.8359	0.9579	0.9711

Table 10. Performance comparison for ablations across different data setups using the CLIP-ViT-Base model [39]; all metrics are computed on the DogFaceNet dataset [17].

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
Train data	0.9653	0.0920	0.4408	0.6477	0.7241
Train data + YOLO Preprocessing	0.9494	0.1207	0.3802	0.5772	0.6575
Train data + PetFace(train split)	0.9735	0.0774	0.4098	0.6264	0.7029
Train data + Telegram	0.9715	0.0866	0.4705	0.6688	0.7415
Train data + Pet911 + Telegram	0.9727	0.0778	0.4460	0.6479	0.7239
Train data + PetFace (full) + Pet911 + Telegram	0.9739	0.0772	0.4350	0.6417	0.7204

Table 11. Vision Encoder Performance—Overall Metrics.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
CLIP-ViT-Base	0.9752	0.0729	0.6511	0.8122	0.8555
DINOv2-Small	0.9848	0.0546	0.7180	0.8678	0.9009
SigLIP-Base	0.9811	0.0572	0.7359	0.8831	0.9140
SigLIP2-Base	0.9793	0.0631	0.7400	0.8889	0.9197
Zer0int CLIP-L	0.9842	0.0565	0.7626	0.8994	0.9267
SigLIP2-Giant	0.9912	0.0378	0.8243	0.9471	0.9641

Table 12. Pairwise McNemar test p-values for comparing vision encoder configurations. ↑ indicates fewer errors for the row encoder; ↓ indicates fewer errors for the column encoder.

Configuration	DINOv2-Small	SigLIP-Base	SigLIP2-Base	Zer0int CLIP-L
DINOv2-Small	-	-	-	-
SigLIP-Base	0.005463 (↓)	-	-	-
SigLIP2-Base	0.0 (↓)	0.0 (↓)	-	-
Zer0int CLIP-L	0.134735 (↓)	0.30085 (↓)	0.0 (↓)	-
SigLIP2-Giant	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.0 (↑)

Table 13. Vision Encoder Performance—Cat Individual Images [60] Metrics.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
CLIP-ViT-Base	0.9821	0.0604	0.8359	0.9579	0.9711
DINOv2-Small	0.9904	0.0422	0.8547	0.9660	0.9764
SigLIP-Base	0.9899	0.0390	0.8649	0.9757	0.9842
SigLIP2-Base	0.9894	0.0388	0.8660	0.9772	0.9863
Zer0int CLIP-L	0.9881	0.0509	0.8768	0.9767	0.9845
SigLIP2-Giant	0.9940	0.0344	0.8899	0.9868	0.9921

Table 14. Vision Encoder Performance—DogFaceNet [17] Metrics.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
CLIP-ViT-Base	0.9739	0.0772	0.4350	0.6417	0.7204
DINOv2-Small	0.9829	0.0571	0.5581	0.7540	0.8139
SigLIP-Base	0.9792	0.0606	0.5848	0.7746	0.8319
SigLIP2-Base	0.9776	0.0672	0.5925	0.7856	0.8422
Zer0int CLIP-L	0.9814	0.0625	0.6289	0.8092	0.8597
SigLIP2-Giant	0.9926	0.0326	0.7475	0.9009	0.9316

Table 15. Text Encoder Performance—Overall Metrics.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
BERT	0.9698	0.0736	0.3675	0.5717	0.6554
E5-Base	0.9703	0.0740	0.3733	0.5748	0.6595
E5-Base-v2	0.9697	0.0755	0.3699	0.5743	0.6575
E5-Small	0.9693	0.0764	0.3795	0.5783	0.6615
E5-Small-v2	0.9700	0.0757	0.3736	0.5731	0.6597

Table 16. Pairwise McNemar test p-values for comparing text encoder configurations. ↑ indicates fewer errors for the row encoder; ↓ indicates fewer errors for the column encoder.

Configuration	BERT	E5-Base	E5-Base-v2	E5-Small
E5-Base	0.696150 (↓)	-	-	-
E5-Base-v2	0.013254 (↓)	0.029675 (↓)	-	-
E5-Small	0.044146 (↑)	0.093004 (↓)	0.771638 (↑)	-
E5-Small-v2	0.016518 (↑)	0.041364 (↓)	1.000000 (↓)	0.731556 (↓)

Table 17. Text Encoder Performance—Cat Individual Images [60] Metrics.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
BERT	0.9707	0.0726	0.5301	0.7397	0.8060
E5-Base	0.9706	0.0739	0.5396	0.7464	0.8157
E5-Base-v2	0.9708	0.0751	0.5386	0.7484	0.8131
E5-Small	0.9693	0.0764	0.5510	0.7504	0.8188
E5-Small-v2	0.9700	0.0754	0.5373	0.7439	0.8146

Table 18. Text Encoder Performance—DogFaceNet [17] Metrics.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
BERT	0.9756	0.0674	0.1776	0.3745	0.4791
E5-Base	0.9758	0.0674	0.1789	0.3731	0.4765
E5-Base-v2	0.9759	0.0676	0.1721	0.3713	0.4751
E5-Small	0.9745	0.0685	0.1804	0.3761	0.4772
E5-Small-v2	0.9753	0.0692	0.1821	0.3743	0.4788

Table 19. Multimodal setups Performance—Overall Metrics.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
CLIP-ViT-Base	0.9752	0.0729	0.6511	0.8122	0.8555
CLIP-ViT-Base + E5-Base-v2	0.9711	0.0768	0.6331	0.8130	0.8561
CLIP-ViT-Base + E5-Small-v2	0.9707	0.0800	0.6454	0.8193	0.8617
CLIP-ViT-Base + E5-Small-v2 + cross-attention	0.9807	0.0627	0.6409	0.8149	0.8589
CLIP-ViT-Base + E5-Small-v2 + weighted text	0.9703	0.0793	0.6696	0.8371	0.8806
SigLIP2-Giant	0.9912	0.0378	0.8243	0.9471	0.9641
SigLIP2-Giant + E5-Small-v2	0.9853	0.0495	0.8071	0.9363	0.9566
SigLIP2-Giant + E5-Small-v2 + cross-attention	0.9856	0.0471	0.7916	0.9284	0.9513
SigLIP2-Giant + E5-Small-v2 + weighted text	0.9895	0.0399	0.8250	0.9463	0.9625
SigLIP2-Giant + E5-Small-v2 + gating	0.9882	0.0422	0.8428	0.9576	0.9722

Table 20. Pairwise McNemar test p-values for comparing mulimodal models configurations. ↑ indicates fewer errors for the row encoder; ↓ for the column encoder.

Configuration	CLIP-ViT-Base + E5-Base-v2	CLIP-ViT-Base + E5-Small-v2	CLIP-ViT-Base + E5-Small-v2 + Cross-Attention	CLIP-ViT-Base + E5-Small-v2 + Weighted Text	SigLIP2-Giant + E5-Small-v2	SigLIP2-Giant + E5-Small-v2 + Cross-Attention	SigLIP2-Giant + E5-Small-v2 + Weighted Text
CLIP-ViT-Base + E5-Small-v2	0.004174 (↓)	-	-	-	-	-	-
CLIP-ViT-Base + E5-Small-v2 + cross-attention	0.0 (↓)	0.0 (↓)	-	-	-	-	-
CLIP-ViT-Base + E5-Small-v2 + weighted text	0.044237 (↓)	0.524457 (↑)	0.0 (↑)	-	-	-	-
SigLIP2-Giant + E5-Small-v2	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.0 (↑)	-	-	-
SigLIP2-Giant + E5-Small-v2 + cross-attention	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.002496 (↑)	-	-
SigLIP2-Giant + E5-Small-v2 + weighted text	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.0 (↑)	-
SigLIP2-Giant + E5-Small-v2 + gating	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.000258 (↑)

Table 21. Multimodal setups Performance—Cat Individual Images [60] Metrics.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
CLIP-ViT-Base	0.9821	0.0604	0.8359	0.9579	0.9711
CLIP-ViT-Base + E5-Base-v2	0.9824	0.0617	0.8004	0.9445	0.9634
CLIP-ViT-Base + E5-Small-v2	0.9839	0.0596	0.8141	0.9532	0.9694
CLIP-ViT-Base + E5-Small-v2 + cross-attention	0.9807	0.0627	0.8062	0.9477	0.9669
CLIP-ViT-Base + E5-Small-v2 + weighted text	0.9843	0.0587	0.8176	0.9497	0.9693
SigLIP2-Giant	0.9940	0.0344	0.8899	0.9868	0.9921
SigLIP2-Giant + E5-Small-v2	0.9931	0.0369	0.8867	0.9834	0.9889
SigLIP2-Giant + E5-Small-v2 + cross-attention	0.9922	0.0375	0.8725	0.9779	0.9858
SigLIP2-Giant + E5-Small-v2 + weighted text	0.9939	0.0315	0.8962	0.9881	0.9921
SigLIP2-Giant + E5-Small-v2 + gating	0.9929	0.0344	0.8952	0.9872	0.9932

Table 22. Multimodal setups Performance—DogFaceNet [17] Metrics.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
CLIP-ViT-Base	0.9739	0.0772	0.4350	0.6417	0.7204
CLIP-ViT-Base + E5-Base-v2	0.9739	0.0744	0.4371	0.6589	0.7309
CLIP-ViT-Base + E5-Small-v2	0.9717	0.0817	0.4479	0.6628	0.7358
CLIP-ViT-Base + E5-Small-v2 + cross-attention	0.9715	0.0836	0.4469	0.6593	0.7326
CLIP-ViT-Base + E5-Small-v2 + weighted text	0.9786	0.0676	0.4960	0.7054	0.7768
SigLIP2-Giant	0.9926	0.0326	0.7475	0.9009	0.9316
SigLIP2-Giant + E5-Small-v2	0.9909	0.0368	0.7141	0.8810	0.9190
SigLIP2-Giant + E5-Small-v2 + cross-attention	0.9898	0.0380	0.6964	0.8706	0.9112
SigLIP2-Giant + E5-Small-v2 + weighted text	0.9932	0.0309	0.7420	0.8975	0.9280
SigLIP2-Giant + E5-Small-v2 + gating	0.9920	0.0314	0.7818	0.9233	0.9482

Table 23. Comparison of the proposed methods with existing approaches—Overall Metrics.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
MiewID-msv3	0.8484	0.2442	0.7270	0.8468	0.8685
MiewID-msv3	0.9227	0.1508	0.5569	0.7057	0.7447
MD-T-CNN-288	0.7592	0.3082	0.1599	0.2515	0.2955
MD-CLIP-336	0.7767	0.2998	0.3410	0.4648	0.5138
MD-L-384	0.9070	0.1673	0.4607	0.6170	0.6695
BioCLIP	0.8383	0.2465	0.3720	0.5047	0.5539
SigLIP2-Giant + E5-Small-v2 + gating	0.9920	0.0314	0.7818	0.9233	0.9482

Table 24. Pairwise McNemar test p-values for comparing existing approaches. ↑ indicates fewer errors for the row encoder; ↓ indicates fewer errors for the column encoder.

Configuration	MiewID-msv3	MD-T-CNN-288	MD-CLIP-336	MD-L-384	BioCLIP
MD-T-CNN-288	0.0 (↓)	-	-	-	-
MD-CLIP-336	0.0 (↓)	0.0 (↑)	-	-	-
MD-L-384	0.0 (↓)	0.0 (↓)	0.0 (↑)	-	-
BioCLIP	0.0 (↓)	0.0 (↓)	0.0 (↓)	0.0 (↓)	-
SigLIP2-Giant + E5-Small-v2 + gating	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.0 (↑)	0.0 (↑)

Table 25. Comparison of the proposed methods with existing approaches—Cat Individual Images [60] Metrics.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
MiewID-msv3	0.9383	0.1392	0.8723	0.9695	0.9760
MD-T-CNN-288	0.8695	0.2111	0.7347	0.8546	0.8743
MD-CLIP-336	0.8953	0.1848	0.8182	0.9158	0.9277
MD-L-384	0.9364	0.1401	0.8421	0.9464	0.9567
BioCLIP	0.8754	0.2141	0.8473	0.9478	0.9585
SigLIP2-Giant + E5-Small-v2 + gating	0.9929	0.0344	0.8952	0.9872	0.9932

Table 26. Comparison of the proposed methods with existing approaches—DogFaceNet [17] Metrics.

Configuration	ROC AUC	EER	Top-1	Top-5	Top-10
MiewID-msv3	0.9227	0.1508	0.5569	0.7057	0.7447
MD-T-CNN-288	0.7592	0.3082	0.1599	0.2515	0.2955
MD-CLIP-336	0.7767	0.2998	0.3410	0.4648	0.5138
MD-L-384	0.9070	0.1673	0.4607	0.6170	0.6695
BioCLIP	0.8383	0.2465	0.3720	0.5047	0.5539
SigLIP2-Giant + E5-Small-v2 + gating	0.9920	0.0314	0.7818	0.9233	0.9482

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kudryavtsev, V.; Borodin, K.; Berezin, G.; Bubenchikov, K.; Mkrtchian, G.; Ryzhkov, A. From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification. J. Imaging 2026, 12, 30. https://doi.org/10.3390/jimaging12010030

AMA Style

Kudryavtsev V, Borodin K, Berezin G, Bubenchikov K, Mkrtchian G, Ryzhkov A. From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification. Journal of Imaging. 2026; 12(1):30. https://doi.org/10.3390/jimaging12010030

Chicago/Turabian Style

Kudryavtsev, Vasiliy, Kirill Borodin, German Berezin, Kirill Bubenchikov, Grach Mkrtchian, and Alexander Ryzhkov. 2026. "From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification" Journal of Imaging 12, no. 1: 30. https://doi.org/10.3390/jimaging12010030

APA Style

Kudryavtsev, V., Borodin, K., Berezin, G., Bubenchikov, K., Mkrtchian, G., & Ryzhkov, A. (2026). From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification. Journal of Imaging, 12(1), 30. https://doi.org/10.3390/jimaging12010030

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

From Visual to Multimodal: Systematic Ablation of Encoders and Fusion Strategies in Animal Identification

Abstract

1. Introduction

1.1. Context and Relevance

1.2. Research Problem and Associated Challenges

1.3. Field Snapshot

1.4. Gap and Rationale

1.5. Contribution

2. Materials and Methods

2.1. Data

2.2. Vision Encoder Experiments

2.3. Text Generation

2.4. Text Encoder Experiments

2.5. Multimodal Experiments

2.6. Comparison Methods

2.7. Loss Function Design

2.8. t-SNE Computation

2.9. Inference and Evaluation

2.10. Hardware

3. Results

3.1. Ablation Studies on Different Data Configurations

3.2. Vision Encoder Performance Comparison

3.3. Text Encoder Performance Comparison

3.4. Comparative Multimodal Performance

3.5. Comparison of the Proposed Methods with Existing Approaches

4. Discussion

4.1. Effect of Dataset Configuration Choices

4.2. Impact of the Vision Encoder on Task Performance

4.3. Impact of the Text Encoder on Task Performance

4.4. Performance Gains from Multimodal Fusion

4.5. Advantages of the Proposed Multimodal Framework

4.6. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI