2.1. Data
Pet911 Dataset. We constructed a dataset through automated web scraping of the
pet911.ru platform (accessed on 1 November 2025), a Russian service for lost and found pet announcements. The parsing implementation employs BeautifulSoup for Hypertext Markup Language (HTML) processing and requests library for HyperText Transfer Protocol (HTTP) communication with error handling. The system navigates catalog pages using pagination detection algorithms to identify available content. For each listing, we extracted animal metadata including species classification, descriptive text, and associated photographs. We filtered listings to retain only animals with at least two photographs per individual. Downloaded images underwent validation for format consistency, with automatic conversion of WebP formats to Joint Photographic Experts Group (JPEG) for standardization [
7]. The Pet911 dataset yielded 65,961 photographs representing 22,050 unique animals. Each animal record includes species classification for cats or dogs, textual description, and between 2 and 8 associated photographs. The dataset captures real-world variability in image quality, lighting conditions, and animal poses representative of lost pet scenarios [
3].
Telegram Channel Dataset. The Telegram dataset construction utilized the Telethon library to access public animal-related channels through the Telegram Application Programming Interface (API). The system processes message streams from targeted public channels, using keyword matching to identify animal-related content [
1]. Media processing handles both individual photos and grouped albums. The system automatically detects grouped messages and downloads all associated images while maintaining proper file organization. The Telegram dataset contributed 131,698 photographs from 73,101 unique animals. This source provides complementary data characteristics, including casual photography styles, varied backgrounds, and diverse animal representations not captured in formal lost pet platforms [
11].
Existing Datasets. Beyond our constructed datasets, we incorporated established benchmarks that represent diverse data collection methodologies and real-world scenarios [
2]. The Dogs-World dataset [
58] provides 301,342 photographs from 200,458 unique dogs, capturing variations in controlled and semi-controlled environments. The LCW dataset [
59] contributes 381,267 photographs representing 140,732 individual animals, expanding the diversity of acquisition conditions and animal populations [
21]. PetFace, the largest benchmark in our evaluation, contains 1,001,532 photographs representing 257,349 unique animals across multiple species [
35]. For evaluation purposes, we utilized Cat Individual Images [
60], which provides 13,542 photographs of 518 individual cats, and DogFaceNet [
17], consisting of 8363 photographs from 2483 unique dogs, both serving as controlled test sets for assessing model generalization across different animal populations. These established datasets have been evaluated in prior work and demonstrate the trade-off between dataset scale and annotation quality that characterizes recent progress in animal identification research.
Combined Dataset Composition. The training corpus combines our constructed datasets with established datasets, leveraging comprehensive scale and diversity across multiple data sources [
29]. As presented in
Table 2, our complete dataset contains 1,904,157 total photographs representing 695,091 unique animals across cats and dogs. The combination of constructed and established datasets provides a robust foundation for model training and evaluation across diverse scenarios and animal types [
9,
23]. In addition to the total number of identities and photos, we report descriptive statistics of the number of photos per identity:
min and
max denote the minimum and maximum number of images available for a single identity, while
mean,
med (median), and
std denote the average, median, and standard deviation of images per identity, respectively. These statistics quantify the per-identity sample-count distribution and highlight differences in data balance across sources. Training uses balanced sampling to ensure equal representation of identities within each batch, addressing class imbalance issues where some animals have significantly more photos than others [
4,
18].
Data Preprocessing. We evaluated the impact of automated animal detection preprocessing on model performance [
19,
20]. The baseline experiment uses the original dataset without additional preprocessing. A second configuration incorporates YOLO12 [
61] object detection to crop animal regions before feature extraction, testing whether explicit localization improves identification performance [
61].
Data Composition Ablation Experiments. We systematically evaluated how different data sources impact model performance through controlled ablation studies. Our primary investigation examined whether incorporating our newly collected Pet911 and Telegram datasets improves identification accuracy compared to training solely on established benchmarks. The PetFace [
35] dataset presents a specific methodological challenge: all images underwent automated face detection, precise alignment, and manual filtering, resulting in a highly controlled distribution that differs substantially from real-world deployment scenarios. To quantify this distribution mismatch effect, we designed three experimental configurations: training without PetFace [
35] to assess performance on unfiltered data, training with the PetFace [
35] training split only following standard protocols, and training with the complete PetFace dataset to examine whether scale compensates for domain shift.
Training and Test Split. Our experimental framework employs stratified splits maintaining animal identity separation between training and test sets to ensure valid evaluation of model generalization [
27]. The training set comprises five different datasets, while the test set contains Cat Individual Images [
60] and DogFaceNet [
17]. This configuration prevents data leakage and enables proper assessment of individual identification performance on previously unseen animals [
16].
2.2. Vision Encoder Experiments
Vision Encoder Selection. We evaluated six pre-trained vision encoders that represent different architectural approaches and pre-training objectives relevant to animal identification tasks.
CLIP-ViT-Base [
39] combines vision transformer architecture with language-image contrastive learning, enabling models to leverage semantic relationships between visual and textual information.
SigLIP-Base [
43] employs sigmoid loss for contrastive learning, offering improved training stability and convergence properties compared to standard softmax-based approaches.
SigLIP2-Base [
54] represents an updated version of SigLIP with refined training procedures and architectural improvements.
SigLIP2-Giant is a scaled-up variant of the SigLIP2 [
54] architecture to giant scale with optimized training and higher input resolution, providing state-of-the-art comparable visual representations through massive model capacity and enhanced fine-grained detail capture.
DINOv2-Small [
53] uses self-supervised learning on diverse image collections without language supervision, enabling the discovery of task-agnostic visual features that generalize across domains.
Zer0int CLIP-L provides a large-scale CLIP variant with geometric mean pooling aggregation, offering increased model capacity and refined feature aggregation strategies. These diverse encoders enable systematic investigation of how different pre-training objectives influence feature quality for individual animal identification.
Table 3 further summarizes the computational characteristics of each vision backbone, including the number of parameters, the number of multiply–add operations (Mult-Adds), and the peak inference VRAM footprint in megabytes (batch size = 1), the average training time per epoch in seconds, and the inference throughput measured in images per second.
Training Configuration. All vision encoders undergo identical training procedures to ensure fair comparison across different architectural approaches. Training employs a batch size of 116 samples structured as 58 unique animal identities with 2 photographs each, ensuring balanced identity representation within every training iteration. The learning rate is fixed at 1 × 10−4 with Adam optimization using default parameters, providing consistent gradient updates across all encoders. Training proceeds for 10 epochs across all experiments, establishing a standardized training duration that allows sufficient convergence while maintaining consistent computational requirements. This configuration enables assessment of encoder performance under identical learning conditions, revealing which architectural choices and pre-training objectives produce superior feature representations for animal identification.
Transfer Learning Strategy. All vision encoders utilize transfer learning by freezing lower layers while unfreezing only the final five transformer blocks during training. This approach preserves general-purpose visual features learned during large-scale pre-training on diverse image datasets, while enabling adaptation to animal identification tasks through fine-tuning higher-level features. Freezing early layers maintains foundational feature patterns that remain useful across different domains, reducing catastrophic forgetting and improving convergence speed. Unfrozen final blocks allow the model to learn animal-specific feature representations that discriminate between individual subjects. This balance between preservation and adaptation leverages the benefits of pre-trained models while enabling task-specific optimization without requiring extensive training resources.
Sampling. Training employs a balanced identity sampler that ensures equal representation of animal identities within each batch. This sampling strategy guarantees that each of the 58 identities appears exactly twice per batch, regardless of how many total photographs each identity possesses. This approach directly addresses class imbalance issues inherent in animal identification datasets, where some animals have many photographs while others have few. Balanced sampling improves convergence by preventing the model from biasing toward frequently-represented identities and ensures that less-represented animals contribute equally to gradient updates.
2.4. Text Encoder Experiments
Ablation studies of text encoder architectures are conducted to ensure methodological consistency and control across modalities. We evaluate
E5-Base [
63], a transformer-based model tailored for semantic retrieval, as well as
E5-Small [
63] and their respective v2 versions (
E5-Small-v2 [
63] and
E5-Base-v2 [
63]), which provide improved computational efficiency and representational accuracy via updated training procedures. Additionally, our experiments include
BERT [
64], the standard backbone for general-purpose language modeling. All text encoder experiments are trained under identical configurations that mirror those of the vision encoder experiments (
Section 2.2), including batch size, learning rate, optimization protocol, balanced sampling, and a transfer learning strategy based on partial layer freezing. This unified protocol enables fair cross-modality comparisons and isolates the impact of each text encoder on downstream verification performance.
Table 4 further summarizes the computational characteristics of each text backbone, including the number of parameters, the number of multiply–add operations (Mult-Adds), the peak inference VRAM footprint in megabytes (batch size = 1), the average training time per epoch in seconds, and the inference throughput measured in images per second.
2.5. Multimodal Experiments
In three dual-encoder baselines(
CLIP-ViT-Base + E5-Base-v2,
CLIP-ViT-Base + E5-Small-v2 and
SigLIP2-Giant + E5-Small-v2), image and text embeddings are first projected into a shared space and then concatenated to form a joint representation. Specifically, we pair
CLIP-ViT-Base with either
E5-Base-v2 or
E5-Small-v2, and
SigLIP2-Giant with
E5-Small-v2, comparing the impact of different vision and text encoders under the same fusion scheme. We use the second version of the small text encoder (
E5-Small-v2) as it provides a better efficiency–quality trade-off in our setting. Empirically, the small text encoder variants achieve higher retrieval performance than the base counterpart, so subsequent experiments focus on E5-Small-v2 as the default text encoder, while BERT-based baselines are omitted due to clearly inferior results discussed in
Section 4.3.
In the cross-attention variants, CLIP-ViT-Base + E5-Small-v2 cross-attention and SigLIP2-Giant + E5-Small-v2 + cross-attention first produce image patch embeddings and text token embeddings, which are then fused by an attention module where text features query the image features. The attended text representations, enriched with information from the corresponding image features, are pooled into a single multimodal embedding that replaces simple concatenation and is used for downstream retrieval.
In the weighted-text variants, CLIP-ViT-Base + E5-Small-v2 and SigLIP2-Giant + E5-Small-v2 use the same dual-encoder and concatenation scheme as the baselines, but apply a learnable scalar weight to the text embedding before fusion. Image and text features are projected into a shared space, the text embedding is rescaled by this trainable factor, and then concatenated with the image embedding to form the final multimodal representation used for retrieval.
For the gated fusion variant SigLIP2-Giant + E5-Small-v2 + gating, image and text embeddings are first projected into a shared space with separate linear layers and then concatenated. This concatenated vector is passed through a small MLP with softmax over two outputs, yielding normalized weights for the text and image embeddings, which are combined as a weighted sum to obtain the final multimodal representation used for retrieval.
Table 5 further summarizes the computational characteristics of each multimodal configuration, reporting the total number of trainable parameters across both encoders and fusion module, the aggregate number of multiply–add operations (Mult-Adds), the peak inference VRAM footprint in megabytes, the average training time per epoch in seconds, and the inference throughput measured in samples processed per second.
2.6. Comparison Methods
To assess the performance of the proposed approach, we compare it against several strong models pre-trained for wildlife re-identification and biological taxonomy. These models are utilized as fixed feature extractors without any additional fine-tuning on the target datasets, ensuring that the comparison focuses on the generalizability of their learned representations. Inference is conducted using the same evaluation protocol as employed for our main method to guarantee a fair assessment.
The comparison includes
MiewID-msv3 [
65], a specialized feature extractor trained using contrastive learning on a high-quality dataset covering 64 different wildlife species, ranging from terrestrial mammals to aquatic animals.
Additionally, we evaluate three distinct architectures from the MegaDescriptor family [
24], which are designed as foundation models for individual animal re-identification:
MD-T-CNN-288, which is based on the EfficientNet-B3 convolutional neural network [
66];
MD-CLIP-336, which adapts a large Vision Transformer initially pre-trained with CLIP [
39]; and
MD-L-384, which leverages a Swin Transformer Large backbone [
67]. Finally, we include
BioCLIP [
68], a biology-focused vision foundation model based on the CLIP ViT [
39] architecture.
Table 6 summarizes the computational characteristics of the pre-trained comparison models, including the number of parameters, the number of multiply–add operations (Mult-Adds), the peak inference VRAM footprint in megabytes (batch size = 1), and the inference throughput measured in images per second, while omitting training time since these models are used only in frozen, inference-only mode.
2.7. Loss Function Design
The training objective combines two complementary components that jointly optimize the feature space for individual animal identification.
Triplet Loss. We employ triplet loss [
37] with margin
to encourage separation between different animal identities by penalizing cases where different animals produce similar embeddings. For a triplet consisting of an anchor image
, a positive image
(same animal), and a negative image
(different animal), the triplet loss is defined as
where
represents the neural network,
denotes the Euclidean distance, and
establishes the minimum required distance between features from different animals.
Intra-Pair Variance Regularization. We apply intra-pair variance regularization [
69] to promote consistency across multiple photographs of the same animal. This loss minimizes the variance of similarity scores within both positive pairs (same identity) and negative pairs (different identities), encouraging tighter clustering and more stable decision boundaries.
For positive pairs with cosine similarity scores
and negative pairs with similarity scores
, the intra-pair variance loss is computed as
where
and
represent the mean positive and negative similarity scores, respectively, and
are small epsilon values that define tolerance margins. The total variance loss is
This formulation penalizes positive pairs with similarity below and negative pairs with similarity above , thereby reducing intra-class variance and increasing inter-class separation.
Combined Loss Function. The overall training objective combines both loss components with respective weight coefficients:
where
and
, indicating that identity separation receives higher priority than intra-identity consistency. Together, these components optimize the feature space to produce compact clusters for each animal while maintaining large separation between different identities.
2.9. Inference and Evaluation
Inference Configuration. Inference uses a batch size of 128 and 8 workers for data loading. Image embeddings are extracted from vision and text encoders and stored in pickle format for efficient retrieval during evaluation.
Evaluation Protocol. Positive pairs (same animal) are generated with constraints: no single image appears more than 5 times across all pairs, and each identity has a maximum of 15 pairs. Negative pairs (different animals) are generated with the same constraints while accounting for image usage in positive pairs. This controlled generation ensures consistent evaluation across all methods. These constraints are necessary to ensure that pair-based verification metrics (Equal Error Rate, Receiver Operating Characteristic Area Under the Curve) accurately reflect model performance rather than artifacts of data imbalance or repeated imagery.
Evaluation Metrics. We report three metrics that assess different aspects of identification performance.
Top-k Accuracy measures the percentage of queries where the correct identity appears in the top
k predictions:
We report Top-1, Top-5, and Top-10 accuracy.
ROC AUC (Receiver Operating Characteristic Area Under the Curve) measures overall separability of same-animal and different-animal pairs across all decision thresholds:
where TPR is the true positive rate, and FPR is the false positive rate.
EER (Equal Error Rate) represents the threshold where the false positive rate equals the false negative rate:
where FPR is false positive rate and FNR is false negative rate.
Lower EER indicates better decision boundary calibration. ROC AUC and EER together provide both discrimination and calibration perspectives on model performance.
The McNemar test is used to compare two models evaluated on the same test set by checking whether their proportions of correct predictions differ significantly. For each sample, the outcome of model A (correct/incorrect) and model B (correct/incorrect) forms a contingency table with counts a (both correct), b (A correct, B incorrect), c (A incorrect, B correct), and d (both incorrect). The test focuses on the discordant pairs b and c; under the null hypothesis that both models have the same accuracy, these two counts should be similar.
For b + c, the McNemar statistic is
which approximately follows a chi-squared distribution with 1 degree of freedom, and the corresponding
p-value is obtained from this distribution.