1. Introduction
In smart animal husbandry, accurate and convenient individual identification is a prerequisite for achieving large-scale and intelligent management of farms. As the core breed of the global dairy industry, the precise identification of Holstein cattle holds significant importance for precision feeding, health monitoring, genetic breeding, and production management. Currently, the individual identification of Holstein cattle primarily relies on ear tags, which pose significant invasiveness and the risk of detachment. Once tags fall off, it can lead to confusion regarding individual identity, interruptions in production traceability, and failures in genetic information. In recent years, machine vision-based individual identification technology has offered new solutions for the meticulous management of farms due to its non-invasive nature, sustainable monitoring capabilities, and ease of implementation [
1].
The unique distribution of hair color patches in Holstein cattle serves as a natural marker for visual individual identification [
2,
3,
4]. However, existing identification methods face several bottlenecks: First, most studies rely on facial and lateral body features for individual identification, which is challenging to apply in practical production environments due to difficulties in feature extraction caused by lens occlusion and motion blur [
5,
6,
7]. Second, many studies operate within a closed-set identification framework, limiting their ability to identify only those individuals who appear in the training set, which hampers adaptability to open scenarios where new individuals continuously enter the scene [
8,
9]. Additionally, while current mainstream convolutional neural networks can capture local texture details, they exhibit insufficient capability to express the overall structure of the hair patches. Consequently, model performance significantly decreases under abnormal conditions, such as local occlusion, cattle back stains, or random production markers [
10,
11]. Furthermore, existing open-set identification methods lack the stability required to effectively discriminate new individuals in real production environments and have limited capabilities for online incremental identification [
12,
13,
14,
15]. These limitations hinder the application of vision-based individual identification techniques for Holstein cattle in practical farming scenarios.
To address the above challenges, this study proposes an incremental identification framework for Holstein cattle based on a top-view perspective. First, we design a novel feature extraction network, ResWTA, which integrates wavelet convolution and a spatial attention mechanism to enhance the feature expression capability of the overall patch structure of the back by expanding the convolutional sensory field. Second, we construct a few-shot augmented prototype network along with a distance metric module to facilitate the online incremental identification of new individuals. Finally, we systematically evaluate the impacts of various loss functions, prototype computation strategies, and distance metric methods on incremental identification performance to verify the robustness and practicality of the proposed framework in complex scenarios. Experimental results indicate that the proposed method significantly enhances the performance of incremental identification of Holstein cattle individuals from a top-view perspective, providing an efficient and reliable solution for open individual management in pastureland.
The primary contributions of this study are as follows:
- (1)
We propose a novel feature extraction network, ResWTA, which fuses wavelet convolution with a spatial attention mechanism. This design significantly enhances the model’s ability to extract features from the cattle’s back pattern structure.
- (2)
We construct a few-shot augmented prototype network aimed at improving the incremental identification performance of the framework in production scenarios.
- (3)
We introduce an incremental identification framework that integrates the feature extraction network, the prototype network, and a distance metric module. Through evaluating the impact of each component on the incremental identification performance of Holstein cattle, we identify the optimal incremental identification framework, thereby providing an efficient and practical individual identification solution for the smart animal husbandry industry.
2. Related Work
Currently, deep learning-based individual identification methods for Holstein cattle primarily rely on natural biometric features such as the face [
5,
16,
17], muzzle [
6,
7,
18,
19], flank [
20,
21,
22,
23], and rump [
24] for characterization and matching. However, identification based on the cattle’s face requires high image quality and is adversely affected by the cattle’s frequent head movements, complicating data acquisition. Identification based on the texture of the mouth and nose requires clear images taken at close range, which may induce stress reactions in the cattle. While flank and rump images contain rich individual features, they must be captured from a horizontal perspective, making them susceptible to obstructions, motion blur, and skin covering, resulting in unstable features. In contrast, dorsal images of Holstein cattle are particularly suitable for individual identification in production environments due to their distinctive characteristics, stable exposure, and minimal occlusion, which can be easily obtained from a top-view perspective.
In recent years, scholars have conducted research on the individual identification of Holstein cattle from a top-view perspective. Zhao et al. (2019) proposed a visual identification system for Holstein cattle utilizing top-view images, wherein features are extracted and matched from the dorsal image using scale-invariant feature transform (SIFT), fast feature point detection (FAST), and binary robust invariant scalable keypoints (BRIEF), achieving an identification accuracy of 96.72% [
25]. However, it relied on manually designed feature point selection algorithms, which may result in inaccurate feature point detection when images are blurred or when individual occlusions occur, subsequently affecting identification accuracy. Bello et al. (2020) introduced an individual identification scheme based on the back pattern of cattle through convolutional neural networks (CNNs) and deep belief networks (DBNs) for feature extraction and classification, achieving an identification accuracy of 89.95% on a multi-breed dataset [
26]. Ferreira et al. (2022) evaluated the capability of a deep learning model to identify individual Holstein calves across various postures and growth cycles using 3D images of cattle backs, attaining F1-scores as high as 95.9% in closed-set evaluations [
27]. Ma et al. (2025) developed a lightweight feature extraction network, CowBackNet, by integrating deep separable convolution, inverse residual structures, and multiple attention mechanisms, achieving an identification accuracy of 95.86% on the Cows2021 dataset [
28]. Although the aforementioned methods have demonstrated strong performance under closed-set conditions, they struggle in open breeding scenarios where new individuals are continuously introduced. Furthermore, most of these approaches are based on convolutional neural networks, which primarily focus on local detail information and often fail to adequately represent the overall structure of the mottled pattern. Consequently, their performance significantly declines under abnormal conditions such as local occlusions, dirt on the cattle’s back, and production markers.
In the field of open-set identification, several scholars have proposed open-set identification frameworks for Holstein cattle based on the principles of metric learning. Andrew et al. (2021) utilized ResNet to extract dorsal mottled features of cattle and constructed a discriminative embedding space, achieving an open-set identification accuracy of 93.75% [
12]. Gao et al. (2022) integrated self-supervised learning, metric learning, and active learning, resulting in an enhancement of open-set identification accuracy from 74.9% to 92.44% on the Cows2021 dataset [
13]. Wang et al. (2024) attained an identification accuracy of 94.26% by merging a lightweight feature extraction network with a distance metric [
14]. Additionally, Wang et al. (2024) improved identification accuracy for the Cows2021 dataset by introducing a spatial feature transformation module to address the impact of variations in cattle posture and orientation on individual identification in a production environment, achieving an accuracy of 94.31% [
15]. Despite these advancements, the aforementioned studies highlight persistent issues, such as instability in category discrimination and limited online incremental identification capabilities, which hinder the ability to maintain stable identification of new individuals continuously introduced in actual production environments.
In recent years, image enhancement techniques have been widely applied in deep vision tasks. Methods such as denoising, rotation, cropping, blurring, brightness adjustment, and noise injection have been employed to improve image quality or augment the sample space, thereby effectively enhancing model robustness and generalization [
29,
30,
31]. However, existing studies have mainly focused on the role of image enhancement in improving model generalization during the training phase, while the feasibility of applying image enhancement to enhance prototype discriminability during the prototype construction stage of incremental identification has not yet been investigated.
In summary, existing studies exhibit significant deficiencies in feature expression, adaptability to open scenes, and stability in incremental identification. To address these issues, this study proposes an incremental identification framework utilizing the feature extraction network ResWTA and a few-shot augmented prototype network. This framework enhances the low-frequency feature expression of the overall dorsal patch structure of cattle while simultaneously achieving robust online incremental identification of new individuals. Thus, it effectively fills the research gap concerning the incremental identification of Holstein cattle in the open scene of a top-view perspective.
5. Discussion
Table 3 indicates that ResWTA demonstrates a 3.02% enhancement in top-1 accuracy relative to the baseline model. ResWTA-101 and ResWTA-X101 demonstrate differing levels of enhancement in top-1 accuracy relative to the networks prior to improvement. This phenomenon is ascribed to the wavelet convolution of the WTAConv module, which delineates low-frequency contours and overarching pattern topologies of the cattle’s back. Additionally, the spatial attention mechanism emphasizes critical pattern detail while mitigating background interference. This aligns with the findings of Finder et al. (2024), which indicate that wavelet convolution can enhance the effective receptive field of the network and improve feature extraction capabilities [
39].
Analysis of
Table 3 indicates that as the number of model parameters increases, the identification performance of the network gradually improves. However, the identification accuracy of deep networks such as ResNet101 and ResNeXt101 is lower than that of ResNet50, and the classification accuracy of ResWTA-101 and ResWTA-X101 is also lower than that of ResWTA. This phenomenon, often referred to as ’larger model but lower performance’, is closely related to model overfitting. The backs of Holstein cattle targeted in this study possess relatively simple and well-defined texture features. Consequently, the high capacity of deep networks may lead to fitting random noise or chance patterns from a limited number of samples, which ultimately diminishes their generalization performance. This observation aligns with Zhang et al. (2016), who found that high-capacity deep networks are prone to fitting noise in datasets characterized by simple feature structures or low sample complexity, resulting in degraded generalization performance [
40]. Furthermore, Janik et al. (2021) demonstrated through effective dimensionality versus complexity entropy metrics that shallow or moderately deep networks can more efficiently capture primary feature patterns, while indiscriminately increasing model depth significantly heightens the risk of overfitting [
41]. Similarly, Basha et al. (2019) found that for datasets with low feature dimensions and strong structural regularities, shallow or medium-depth networks outperform deeper models in terms of classification performance [
42].
In the field of individual identification based on cattle body patterns, several empirical results align with the findings of this paper. Wang et al. (2023) [
43], Hou et al. (2021) [
24], and Ma et al. (2025) [
28] demonstrated in their studies on individual cattle identification that a lower number of parameters and moderate network depth tend to yield better generalization abilities. Wang et al. (2024) [
14] demonstrated that ResNet18 surpassed ResNet50 in the specific task of individual cattle back identification. This further substantiates that in low-complexity visual classification tasks, such as cattle back patterns, augmenting the model’s parameters and depth does not inherently result in performance improvements.
This study utilizes the t-SNE dimensionality reduction visualization technique to analyze the feature distribution of the feature extraction network before and after training, as illustrated in
Figure 8. The feature points extracted by the untrained ResWTA network are initially distributed in a disordered manner, making it challenging to differentiate between individuals. In contrast, post-training, the sample categories in the feature space demonstrate distinct clustering, with samples of the same class closely converging and those from different classes being effectively separated, exhibiting clear boundaries. This observation aligns with the results of the qualitative experiments presented in
Table 3, indicating that the ResWTA network is capable of extracting more discriminative features related to the cattle’s back.
This research employs the LayerCAM algorithm to visualize heatmaps corresponding to the outputs of ResNet50, ResNet101, ResNeXt101, and the enhanced model Stage4, as illustrated in
Figure 9. Analysis indicates that the model incorporating the WTAConv module exhibits enhanced feature activation in the critical texture regions of the cattle’s back relative to the original model. Furthermore, the activated regions align more closely with the actual pattern boundaries, reflecting a more robust overall focus. This demonstrates that WTAConv can effectively enhance the representation of cattle back patterns.
Further analysis indicates that the identification performance of ResNet101, ResNeXt101, and their variants is inferior to that of ResNet50 and ResWTA as the number of model parameters increases. The primary reason is that the feature structure of cattle back patterns is straightforward and predominantly depends on low-frequency global information for individual identification, and the increased model capacity can easily result in overfitting to non-discriminative noise, as evidenced by the first, third, and fifth images in
Figure 9d, and the first, fourth, and sixth images in
Figure 9f, which demonstrate overfitting to irrelevant or noisy regions, consequently impacting the model’s accuracy. This further confirms that augmenting the model’s parameters and depth does not inherently result in improved performance in low-complexity visual classification tasks, such as cattle back pattern identification.
To evaluate the model’s feature extraction capabilities in complex environments, this study first applies four types of enhancements to the original images: grayscaling, blurring, brightness and contrast adjustment, and random occlusion of small targets. The enhancements replicate production environment interferences, including nighttime infrared illumination, lens defocus, natural light variations, and contamination on the cattle’s body surfaces and lenses. Subsequently, LayerCAM generates the output heat maps for the ResNet50 and ResWTA models based on the enhanced data in Stage 4, as illustrated in
Figure 10. A comparison with
Figure 10c,d indicates that ResWTA exhibits superior focus on the overall pattern structure of the cattle’s back under the aforementioned interference conditions, demonstrating significantly better anti-interference capabilities than ResNet50. This verifies that the combination of low-frequency wavelet convolution and the spatial attention mechanism effectively enhances the model’s ability to extract overall pattern features and improve local anti-interference performance.
This study utilizes t-SNE dimensionality reduction visualization to assess the impact of Triplet loss on the generation of final discriminative features, employing one set of prototype samples (five samples) for each cattle in the test set.
Figure 11 presents the visualization results. Analysis of
Figure 11a,d reveals that the features extracted by ResWTA prior to training are scattered randomly in the low-dimensional space, with unclear category boundaries. Following the adoption of cross-entropy loss in ResWTA, the clustering of feature space becomes progressively clearer, as illustrated in
Figure 11b,e, which aligns closely with the model’s visualization effects on the validation set. The introduction of Triplet loss in place of cross-entropy loss further enhances the discriminative nature of the features generated by the model, as depicted in
Figure 11c,f. The features are more tightly clustered within classes and distinctly separated between classes. This improvement can be attributed to the ability of Triplet loss to bring similar samples closer together while pushing dissimilar samples apart, thereby enhancing classification performance, which aligns with existing research conclusions [
35,
44]. Additionally, this study employs the same-color triangles to label the prototype positions of the five samples, as shown in
Figure 11c, indicating that the prototypes are highly representative of the samples and exhibit strong discriminability.
The results of the ablation experiments presented in
Table 5 indicate that employing ResWTA for feature extraction enhances the model’s TOP-1 accuracy by 2.33 percentage points compared to ResNet50. Additionally, the introduction of the median prototype mechanism further increases the accuracy by 3.63 percentage points, while the implementation of the image enhancement strategy contributes an additional improvement of 1.14 percentage points to the TOP-1 accuracy. These findings substantiate that enhancements in the feature extraction network, the prototype mechanism, and the image enhancement strategy collectively improve incremental identification performance. Specifically, WTAConv significantly enhances the model’s capability to characterize the overall mottled contour of the cattle’s back by amplifying the representation of low-frequency and global features. The median mechanism demonstrates greater robustness to outlier samples during prototype construction, effectively mitigating noise interference and thereby enhancing category discrimination stability. Moreover, the image enhancement strategy bolsters the model’s adaptability to environmental interferences, such as light fluctuations and occlusions, by expanding the training sample space. The synergistic effect of these three strategies systematically improves the model’s generalization ability and robustness in open incremental identification scenarios.
Table 7 summarizes the principal types of identification errors observed in the test set during the experiments, including high pattern similarity, inconspicuous features, and extensive occlusion.
Table 7(a) shows errors caused by high pattern similarity. The low-frequency wavelet convolution (WTAConv) adopted in this study effectively enhances low-frequency features and improves the overall representation of pattern structure. However, as the model does not explicitly model high-frequency details, some critical high-frequency cues may not be fully utilized, resulting in certain limitations when distinguishing individuals with highly similar patterns. This phenomenon is consistent with the perspective of Fan et al. (2023), who emphasized that only by fully leveraging the complementarity of high- and low-frequency information can the discrimination accuracy of object edges and details be significantly improved [
45].
Table 7(b),(d),(e) show errors caused by inconspicuous features (i.e., sparse patterns or nearly solid coloration), where the root cause is the limited discriminative features extractable by the model from such individuals. Previous studies have demonstrated that the distinctive black-and-white patterns of Holstein cattle are the primary visual basis for individual identification, and mainstream methods in feature extraction and modeling are highly dependent on this information [
1,
15]. Therefore, when the pattern features of the target individual are insufficiently prominent, the identification performance is significantly affected.
Table 7(c) presents errors caused by extensive occlusion, where the absence of key feature regions prevents the model from acquiring sufficient discriminative information, leading to identification failure.
In conclusion, while the proposed model in this study effectively enhances the extraction and differentiation of Holstein cattle back features, there remains a risk of misidentification for individuals that exhibit high similarity, sparse features, or significant occlusion. The model enhances identification accuracy through the incorporation of multi-level wavelet transforms and spatial attention modules; however, this improvement is accompanied by an increase in training and inference time costs, resulting in a relatively high deployment cost on edge devices in agricultural settings. Future research will concentrate on the integration of RGB-D multimodal image data of Holstein cattle, examining identification techniques in scenarios characterized by high similarity, sparse features, and extensive occlusion, while also pursuing lightweight model development to improve applicability in actual farm settings.