Next Article in Journal
Development and Optimization of Antennas for 860–960 MHz RFID Applications and Their Impact on the Human Body
Previous Article in Journal
Ground Enhancement Materials for Grounding Systems: A Systematic Review of Factors, Technologies and Advances
Previous Article in Special Issue
Multimodal Minimal-Angular-Geometry Representation for Real-Time Dynamic Mexican Sign Language Recognition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Communication

Impact of Background Removal on Cow Identification with Convolutional Neural Networks

1
Veterinary Legislation Unit, Faculty of Veterinary Medicine, Trakia University, 6000 Stara Zagora, Bulgaria
2
Institute of Biophysics and Biomedical Engineering, Bulgarian Academy of Sciences, 1000 Sofia, Bulgaria
*
Author to whom correspondence should be addressed.
Technologies 2026, 14(1), 50; https://doi.org/10.3390/technologies14010050
Submission received: 30 November 2025 / Revised: 7 January 2026 / Accepted: 8 January 2026 / Published: 9 January 2026
(This article belongs to the Special Issue Image Analysis and Processing)

Abstract

Individual animal identification is a cornerstone of animal welfare practices and is of crucial importance for food safety and the protection of humans from zoonotic diseases. It is also a key prerequisite for enabling automated processes in modern dairy farming. With newly emerging technologies, visual animal identification based on machine learning offers a more efficient and non-invasive method with high automation potential, accuracy, and practical applicability. However, a common challenge is the limited variability of training datasets, as images are typically captured in controlled environments with uniform backgrounds and fixed poses. This study investigates the impact of foreground segmentation and background removal on the performance of convolutional neural networks (CNNs) for cow identification. A dataset was created in which training images of dairy cows exhibited low variability in pose and background for each individual, whereas the test dataset introduced significant variation in both pose and environment. Both a fine-tuned CNN backbone and a model trained from scratch were evaluated using images with and without background information. The results demonstrate that although training on segmented foregrounds extracts intrinsic biometric features, background cues carry more information for individual recognition.

1. Introduction

The globally recognized One Health concept emphasizes the importance of animal management focused on controlling zoonotic diseases and ensuring food safety through risk management at the animal source, while also improving animal welfare and productivity [1]. The basis for efficient livestock management lies in the effective control of animal movements [2], which in turn ensures traceability of animal products and supports the monitoring of animal behavior and herd health. Individual animal identification plays a crucial role in modern livestock management, with ear tags and RFID (radio-frequency identification) transponders being the most widely used methods [3]. However, these methods are physically invasive, susceptible to loss or fraud, and prone to errors; therefore, new automated and non-invasive alternatives are being sough [4].
In recent years, computer vision and machine learning have emerged as promising approaches for visual biometric identification of animals. Convolutional neural networks (CNNs) have achieved high accuracy in recognizing various animal species, including sheep, pigs, cows, and even wild species [5,6]. They are capable of automatically extracting complex visual features and distinguishing individual animals even with small differences in appearance [7].
Nonetheless, numerous studies have shown that models trained in controlled environments often lose accuracy when applied to new or more diverse settings. This occurs because CNN models often learn background features (e.g., walls, lighting, equipment) rather than the intrinsic biometric features of the animal itself. This leads to overfitting to a specific background, which compromises the model’s generalization ability [4,8].
Analogous trends have been observed in general classification tasks, where the background acts as a strong prioritization signal for the model. Removing or neutralizing background information can significantly improve the robustness of neural networks to unfamiliar data [9].
Segmentation of image foreground and background has proven to be an effective method for reducing background interference [10]. Researchers develop an algorithm for separating foreground and background regions, showing that independent modeling of the two components can improve object recognition accurac [11].
This study explores how background removal influences cow identification with Neural Networks by evaluating the performance of networks trained on segmented foreground images (cow only), full images (cow plus background), and background-only images. We also compared training strategies–using pre-trained and fine-tuned–across two architectures: ResNet-50 and the Swin Transformer. The presence of constant backgrounds (e.g., walls and pens) in our dataset may facilitate the recognition; however, this reflects the natural conditions of most livestock environments. Unlike artificially created datasets, in which subjects are isolated or backgrounds are randomized, farm environments are inherently repetitive, which can inadvertently train models to rely on contextual features. Accordingly, the experimental goal was to quantify whether constraining the model to rely on foreground features improves robustness to variations in pose and background. Importantly, by isolating the foreground, the model’s ability to focus on the animal’s biometric cues rather than its environmental context, is tested, thereby addressing a previously understudied axis of variation in cattle identification. Over-reliance on these static cues, as demonstrated in our background-only experiments, poses a risk to generalization. Hypothetically, if an animal is moved to a different stall or farm, background-dependent models may fail to correctly identify it. Therefore, the goal of this study is not simply to assess foreground performance, but also to quantify the extent to which background replacement with white noise can support the development of models in realistic farm settings.
The paper is organized as follows: Section 2 describes the Materials and Methods, including the dataset composition, data collection protocols, segmentation procedures, and training setup for both ResNet-50 and Swin Transformer models. It also outlines the evaluation strategy using Top-k accuracy and cross-validation. Section 3 presents the experimental results across different background conditions and model architectures, supported by statistical analysis. Finally, Section 4 concludes the paper by summarizing the main findings and outlining future directions for improving biometric consistency and reducing contextual bias in livestock identification systems.

2. Materials and Methods

The present study represents the second-stage experiment of an ongoing research project on visual cow face identification [4]. Using the dataset of cow images prepared during the first phase of the project, we further expanded it with new photographs taken on several farms in southern Bulgaria between August and September 2025. Images were captured using a digital camera (Panasonic Lumix DMC-GX8 (manufactured by Panasonic Corporation, Osaka, Japan), equipped with a 15 mm f/1.7 lens) in daylight at the farm sites, resulting in a total of 599 images of 166 dairy cows (all of which were adult individuals with typical breed characteristics). At least three to four images were captured per animal, with records maintained for each cow to ensure accurate traceability and identification. After the expanded dataset was created, it was deduplicated as described below. The dataset represents images that can be easily obtained under typical farm conditions. Images of cows were captured sequentially, with limited variability in pose and background for each individual animal. This uniformity may lead a Neural Network to rely disproportionately on background similarity rather than on the distinctive biometric features of each animal.
Annotation and segmentation were performed using computer vision platform Roboflow Pro (version October 2025), where veterinarians manually annotate pixel-accurate masks for two classes: foreground (the cow) and background (the surrounding environment) [12,13]. Roboflow’s Smart Polygon tool leverages AI, including the Segment Anything Model, to enable rapid and precise instance segmentation annotation by generating detailed polygon masks. The tool provides several refinement options to improve annotation accuracy. Users can click inside the generated mask to remove unwanted regions or outside to expand the selection, allowing for precise adjustments. The tool also supports click-and-drag to define a region of interest, which is especially useful to ensure precise separation between the animal and surrounding objects such as walls, feeding structures, and equipment. Additionally, users can adjust the polygon’s complexity using Convex Hull, Smooth, or Complex settings to control vertex density. Vertices can be manually edited, and points can be undone if needed, ensuring full control over the final segmentation mask. The resulting masks were exported in the COCO format [14] for the training workflows.
To generate the foreground-only variant, background pixels were replaced with independently sampled white noise, thereby eliminating environmental cues while preserving the original features of animal. The background-only images were produced, similarly, by replacing the foreground pixels.
For the architectures of the model a ResNet-50 [15] and a Swin Transformer [16] were chosen. Both architectures were initialized with ImageNet pre-trained weights, their final classification layers were removed to extract embeddings, and finally were fine-tuned on an Nvidia B200 (Nvidia, Santa Clara, CA, USA) graphical processing unit with 180 GB of RAM.
As the focus of the project was the task of identification of cow images, the triplet loss function was used [17,18]. The loss function uses triplets of samples—each containing an anchor (the reference example), a positive sample (similar to the anchor), and a negative sample (dissimilar to the anchor). The model was trained to bring similar items closer together in the embedding space while pushing dissimilar ones farther apart.
As described in [4] the triplet loss function could be expressed mathematically as
L ( a , p , n ) = max ( d ( a , p ) d ( a , n ) + α , 0 )
where d ( a , p ) is the L 2 distance between the anchor, a, and the positive sample, p, d ( a , n ) is the distance between the anchor and the negative sample, n, and α is a positive number that ensures a gap between the distances of positive and negative pairs.

2.1. Deduplication

To identify near-duplicate images, perceptual hashing (pHash) was used. For each group of images that pass a selected threshold of 10, only one sample of the group is selected. Figure 1 shows an example of images that exceed the similarity threshold and are therefore considered near-duplicates.
Perceptual hash generated a 64-bit hash by resizing an image to 32 × 32 pixels, converting it to grayscale, applying a 2D Discrete Cosine Transform (DCT), and retaining the top-left 8 × 8 low-frequency coefficients. The median of these 64 values was computed, and each coefficient was compared to it: if greater, the corresponding bit was set to 1; otherwise, 0. This resulted in a balanced hash that was robust to minor visual distortions. Similar images were identified using Hamming distance [19]. Examples of images from the dataset can be found in Figure 2.

2.2. Data Augmentation and Preprocessing

To enhance model robustness and generalization, we applied distinct transformation pipelines to the training and testing datasets. Training transformations incorporated data augmentation, whereas testing focused on standardization. For training, images were first converted to tensors and then randomly cropped and resized to 1000 × 1000 pixels using a scale between 0.9 and 1.0. Horizontal flips were applied with a 50% probability, followed by random rotations of up to 15 degrees. Small affine transformations were introduced, including rotations of up to 5 degrees, translations of up to 5%, and scaling between 0.95 and 1.05. Finally, tensors were normalized using ImageNet mean and standard deviation values. These augmentations increased dataset diversity and helped reduce overfitting. For testing, images were resized to 1000 × 1000 pixels, converted to tensors, and normalized using the same statistics. This ensured consistent preprocessing without augmentation, enabling fair evaluation.

2.3. Cross-Validation Methodology

To robustly evaluate model performance, three-fold cross-validation on a dataset of 125 individuals was used, dividing the data into three approximately equal groups. In each fold, one group served as the training set, another as the validation set for early stopping to prevent overfitting, and the third as the test set for final evaluation. Specifically, for Fold 1, the groups were assigned as training, validation, and test, respectively; for Fold 2, the assignments cycled such that the original test group became training, training became validation, and validation became test; for Fold 3, the cycle continued with validation serving as training, test as validation, and training as test. This ensured that each group was used for testing exactly once, providing unbiased performance estimates across the entire dataset.

2.4. Model Architecture

ResNet-50 and Swin Transformer (Swin-T) were selected as backbones due to their established status in computer vision research and extensive validation across numerous benchmarks and real-world applications [20]. Despite this commonality, the two architectures represent fundamentally different design paradigms: ResNet-50 is a convolutional neural network that emphasizes local feature hierarchies through residual connections, whereas Swin-T is a vision transformer that prioritizes global context via self-attention mechanisms.
Swin Transformer was used as a backbone for the embeddings by first initializing with ImageNet pre-trained weights. Then, the final classification head was removed to extract pooled features directly. The output dimension was 768, representing the embedding space for image representations.
For ResNet-50 as embedding backbone network, the weights were initialized again with ImageNet pre-trained weights. The final classification layer was removed to extract features from the average pooling layer. The output dimension is 2048, representing the embedding space for image representations. During the forward pass, input images were processed through the backbone to obtain feature maps, which were then flattened into vectors. These vectors served as embeddings, capturing semantic information from the images for similarity comparison using cosine similarity.

2.5. Training Methodology

The training process utilized the PyTorch Lightning trainer framework [21], configured with a maximum of 100 epochs to allow sufficient iterations for convergence. An early-stopping callback monitored the validation loss, halting training if no improvement occurred for 20 consecutive epochs to prevent overfitting and optimize computational efficiency. This approach ensures the model trains until either convergence or the patience threshold is reached, balancing thorough learning with resource constraints. After each epoch, the trainer evaluates the the triplet loss on the validation set to track generalization.
Optimization employed the Adam optimizer with a learning rate of 110 5 and default values for the remaining hyperparameters. Model checkpoints save the best-performing version based on validation metrics for the triplet loss. This setup promotes stable and efficient training across the cross-validation folds and the two backbone architectures.
Triplets were sampled dynamically during training. For each mini-batch, an anchor-positive pair was selected from the same class, while a hard negative was selected from a different class with the closest embedding distance, based on current model predictions. This semi-hard mining strategy encouraged the model to learn more discriminative embeddings.
Training was conducted using PyTorch Lightning (v2.6.0) and PyTorch (v2.8.0) [22].

2.6. Evaluation Metrics

Model performance in identification tasks was evaluated using multiple metrics to assess ranking accuracy and precision. Top-k accuracy (k = 1, 3, and 5) measures the proportion of tests where the true identity appears among the k most probable predictions, suitable for tasks with high inter-class similarity. For retrieval-based evaluation, Mean Average Precision (mAP) computes the average precision across ranked reference sets, by sorting references by distance to test embeddings and averaging precision at correct identity positions. Cumulative Match Characteristic (CMC) reports the cumulative accuracy at specific ranks (e.g., rank-1, rank-5), indicating the fraction of tests with correct identity within top-k. Confidence intervals (CI) at 95% for top-k and mAP were calculated.

3. Results and Discussion

The experimental framework aims to evaluate how the removal of background information influences the performance of neural networks in the context of visual cow identification. Four distinct training configurations were implemented for each of the two architectures.
Three variants of the dataset were subjected to the experimental analysis:
  • original images
  • foreground only, (after segmentation), in which the background was replaced with white noise
  • background only
Results from the models pretrained only on ImageNet are also reported.
For each configuration, performance was assessed through the cross-validation procedure described in Section 2.3.
Table 1 and Table 2 shows the results of the ResNet-50 and the Swin-T model, respectively. The 95% confidence intervals for the top-1, top-3, and top-5 accuracies estimates were calculated using the ordinary Wald interval method, a standard normal approximation for binomial proportions. For each accuracy proportion p derived from 125 test examples, the interval is computed as p ± z p ( 1 p ) n , where z = 1.96 (the z-score for 95% confidence) and n = 125 . The margins of error reported in the table represent these standard errors, expressed as percentages.
The ResNet-50 model exhibited top-1 accuracies ranging from 78.40% to 80.00% when evaluated on full images, with corresponding top-3 and top-5 accuracies reaching 82.40% to 86.40%. Similar metrics regarding the recent livestock biometric research were reported in [23,24]. Performance on isolated backgrounds showed a decline, with top-1 accuracies dropping to 69.60–76.00%, while foreground-only tests yielded even lower results (54.40–60.00%). Notably, the ImageNet-pretrained variant performed comparably to fine-tuned models on full images but struggled more with segmented components.
In contrast, the Swin-T model demonstrated superior performance, achieving top-1 accuracies of 80.00–84.80% across all configurations, with top-5 accuracies occasionally exceeding 88.00%. On full images and backgrounds, accuracies remained in similar ranges (80.00–84.80% top-1), and on foreground isolation showed 78.40–84.00% top-1 accuracy. The ImageNet-pretrained Swin-T model maintained strong baseline performance, suggesting enhanced capability in handling spatial hierarchies for cow identification tasks.
In Table 3 mean Average Precision (mAP) was used to evaluate identification performance. For each test image, the reference set was ranked by increasing distance to the test embedding, obtained from a triplet loss-trained network. Average Precision (AP) per test was calculated as the mean precision at ranks where correct identities appeared, accounting for multiple correct matches. mAP was then computed as the average AP across all tests. Confidence intervals (95%) were estimated using bootstrap resampling with 20,000 iterations and the Bias-Corrected and Accelerated (BCA) method, which corrects for bias and skewness in the sampling distribution—essential for mAP given the bounded [ 0 , 1 ] range and potential non-normality of AP values [25].
The mAP results demonstrate that Swin-T consistently outperforms ResNet-50 across all training and test set combinations, with improvements ranging from 1.9% to 20.3%. Notably, training on foreground regions yields the highest mAP for full-image tests in Swin-T (70.90%, 95% CI: 65.00–76.06%), suggesting that focusing on cow-specific features enhances generalization. Conversely, for foreground-only tests, ImageNet-pretrained models achieve peak performance (71.28%, 95% CI: 65.50–76.45% for Swin-T), indicating that pretraining on large-scale diverse datasets can achieve identification accuracy on segment-specific regions. Overall, in several comparisons the confidence intervals confirm statistical significance in performance differences, but the rest are not conclusive.
The Cumulative Match Characteristic (CMC) plots in Figure 3 illustrate the ranking performance of ResNet-50 and Swin-T models across different training and test sets, with each subplot representing a test set (full images, backgrounds, or foregrounds). CMC measures retrieval precision in identification tasks: for a given rank k, it is the fraction of test images where the correct identity appears within the top-k ranked predictions. Curves were computed by ranking predicted labels by confidence for each test image, then calculating the cumulative accuracy at ranks 1 through 10. Specifically, at rank 1, CMC indicates top-1 accuracy. Results show Swin-T consistently outperforms ResNet-50, with the highest CMC at rank 1 (e.g., 84% for foreground-trained Swin-T on images test vs. 79.2% for ResNet-50).
Overall, Swin-T outperformed ResNet-50 in most test scenarios, particularly on segmented datasets, highlighting the advantages of transformer-based architectures over convolutional networks for this application. However, the reported standard deviations reveal overlapping confidence intervals in several configurations, indicating that the observed differences may not be statistically significant. Thus, the results are indicative of potential trends rather than conclusive evidence of superiority. Both models showed reduced accuracy on foreground-only inputs, indicating that background substitution with white noise may not help in overcoming the challenges in identification in farm environments. This finding corroborates earlier studies on the attribution of cow identification models [4] and multi-species identification tasks [8], which observed that background information in animal identification datasets often exhibits a strong relation with identity labels.
Based on the survey on [26] environmental factors and background patterns were also discussed as factors which can influence visual animal analysis.
Comparable patterns of background dependence have been reported in both wildlife re-identification [27] and cattle identification research [28,29]. However, the variability of performance highlights the importance of greater diversity in imaging conditions—such as pose, lighting, and multi-environment sampling—as emphasized in recent livestock biometrics reviews [23,24,30].
Ref. [9] found similar trends, showing that removing background often reduces accuracy but improves robustness to novel backgrounds. Ref. [11] likewise demonstrated that foreground-background separation enhances reliability by forcing networks to rely on object-relevant cues rather than environmental artifacts.
The observation that the pretrained ImageNet model achieves accuracies between 59.2% and 84% suggests that CNNs are universal feature extractors. Texture-bias studies [31] have shown that CNNs trained on natural images preferentially learn textural patterns, which can transfer into farm settings and allow identification based on visual context rather than biometric identity. This behavior matches findings from explainable AI analyses in livestock identification, where CNN attention maps often highlight large background regions [4]. These results also reinforce the conclusions of [4] that CNN-based cow identification often relies on background regions even when the animal’s face is prominent. Similar phenomena were documented by [8], who observed that models trained without foreground-background separation in wildlife datasets exhibited substantial contextual overfitting.
In the present experiment, models trained exclusively on background images outperform those trained on foreground images when tested on full images, although this improvement is not statistically significant as the confidence intervals overlap. This suggests that the neural networks can implicitly use background characteristics as a proxy for identity, especially when background conditions vary across individuals but remain consistent for the same individual. Similar implications have been documented in multi-feature cattle identification systems [29].

4. Conclusions

The experimental results indicate that background information may influence the performance of neural-network-based cow identification models and, in certain settings, may contribute to confounding effects. The higher accuracy observed in background-inclusive configurations suggests that contextual cues can affect identity recognition, pointing to a potential tendency toward contextual reliance in visual animal recognition systems. In contrast, foreground-only segmentation can be viewed as a useful analytical tool for examining the contribution of intrinsic biometric features, rather than as a definitive or deployable preprocessing strategy.
It should be emphasized that replacing the background with white noise produces images that differ from those encountered in natural farm environments. Accordingly, this approach was used solely as a controlled ablation to suppress structured background information, and not as a realistic data representation. Alternative techniques that preserve isolated foregrounds while remaining closer to the natural image manifold—such as blurring, padding, or inpainting—may offer more practical avenues for future investigation, particularly with respect to model interpretability and robustness.
The comparative evaluation of CNN-based (ResNet-50) and transformer-based (Swin-T) architectures under multiple training configurations provides additional perspective on these observations. The stronger performance observed for Swin-T relative to ResNet-50 indicates that architectural choices may affect how biometric and contextual information are utilized.
Overall, this study points to the challenges associated with background uniformity in farm-based image datasets. While background-related features may contribute to improved performance within a single environment, they may also constrain generalization across different farms or camera configurations. Given the limited dataset size and the exploratory scope of the analysis, the findings should be considered indicative rather than conclusive. Future work should focus on larger and more diverse datasets and training strategies designed to reduce sensitivity to background variation, particularly in cross-environment deployment scenarios.

Author Contributions

Conceptualization, D.T. and A.M.; methodology, D.T. and A.M.; coding and machine learning, A.M.; validation, A.M., D.T. and I.L.; formal analysis, A.M.; investigation, A.M. and G.B.; resources, D.T. and I.L.; data curation, D.T. and A.M.; writing—original draft preparation, A.M. and G.B.; writing—review and editing, A.M., G.B., D.T. and R.R.; visualization, A.M.; supervision, A.M. and G.B.; project administration, G.B.; funding acquisition, G.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Education and Science in Bulgaria within the framework of the Bulgarian National Recovery and Resilience Plan, Component “Innovative Bulgaria”, Project No. BG-RRP-2.004-0006-C02 “Development of research and innovation at Trakia University in the service of health and sustainable well-being”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available from the corresponding author upon reasonable request due to legal restrictions.

Acknowledgments

The authors express their gratitude to all farm owners who willingly cooperated during data collection and permitted the capture and subsequent use of digital images of their cattle for research purposes.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the study design; data collection, analysis, or interpretation; manuscript preparation; or the decision to publish the results.

References

  1. Danasekaran, R. One Health: A Holistic Approach to Tackling Global Health Issues. Indian J. Community Med. 2024, 49, 260–263. [Google Scholar] [CrossRef] [PubMed]
  2. Paudel, S.; Brown-Brandl, T. Advancements in Individual Animal Identification: A Historical Perspective from Prehistoric Times to the Present. Animals 2025, 15, 2514. [Google Scholar] [CrossRef] [PubMed]
  3. Peng, W.; Liu, Z.; Cai, J.; Zhao, Y. Research and application progress of electronic ear tags as infrastructure for precision livestock industry: A review. Intell. Robot. 2025, 5, 433–449. [Google Scholar] [CrossRef]
  4. Tanchev, D.; Marazov, A.; Balieva, G.; Lazarova, I.; Rankova, R. Exploring Attributions in Convolutional Neural Networks for Cow Identification. Appl. Sci. 2025, 15, 3622. [Google Scholar] [CrossRef]
  5. Kumar, S.; Singh, S.K. Cattle Recognition: A New Frontier in Visual Animal Biometrics Research. Proc. Natl. Acad. Sci. India Sect. A Phys. Sci. 2020, 90, 689–708. [Google Scholar] [CrossRef]
  6. Hansen, M.F.; Smith, M.L.; Smith, L.N.; Salter, M.G.; Baxter, E.M.; Farish, M.; Grieve, B. Towards on-farm pig face recognition using convolutional neural networks. Comput. Ind. 2018, 98, 145–152. [Google Scholar] [CrossRef]
  7. Kabuga, E.; Langley, I.; Arso Civil, M.; Measey, J.; Bah, B.; Durbach, I. Similarity learning networks uniquely identify individuals of four marine and terrestrial species. Ecosphere 2024, 15, e70012. [Google Scholar] [CrossRef]
  8. Picek, L.; Neumann, L.; Matas, J. Animal Identification with Independent Foreground and Background Modeling. arXiv 2024, arXiv:2408.12930. [Google Scholar] [CrossRef]
  9. Xiao, K.; Engstrom, L.; Ilyas, A.; Madry, A. Noise or Signal: The Role of Image Backgrounds in Object Recognition. arXiv 2020, arXiv:2006.09994. [Google Scholar] [CrossRef]
  10. Bello, R.W.; Mohamed, A.S.A.; Talib, A.Z. Cow Image Segmentation Using Mask R-CNN Integrated with Grabcut. In Proceedings of the International Conference on Emerging Technologies and Intelligent Systems (ICETIS 2021), Al Buraimi, Oman, 25–26 June 2021; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2022; Volume 322, pp. 23–32. [Google Scholar] [CrossRef]
  11. Mopaka, P.R. A Foreground-And-Background Region Segmentation Approach to Image Segmentation. Master’s Thesis, University of Johannesburg, Johannesburg, South Africa, 2024. [Google Scholar]
  12. Yadav, C.S.; Peixoto, A.A.T.; Rufino, L.A.L.; Silveira, A.B.; Alexandria, A.R.d. Intelligent Classifier for Identifying and Managing Sheep and Goat Faces Using Deep Learning. Agriengineering 2024, 6, 3586–3601. [Google Scholar] [CrossRef]
  13. Pavithra, M.; Karthikesh, P.S.; Buchammagari, J.; Mayakuntla, N.; Raja, C.; Krishna, K.L. Implementation of Enhanced Security System using Roboflow. In Proceedings of the 2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 14–15 March 2024; pp. 1–5. [Google Scholar] [CrossRef]
  14. Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar] [CrossRef]
  15. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  16. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
  17. Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar] [CrossRef]
  18. Schneider, S.; Taylor, G.W.; Linquist, S.; Kremer, S.C. Similarity Learning Networks for Animal Individual Re-Identification—Beyond the Capabilities of a Human Observer. arXiv 2019, arXiv:1902.09324. [Google Scholar] [CrossRef]
  19. Khanam, Z.; Ahsan, M.N. Implementation of pHash algorithm for face recognition in secured remote online examination system. Int. J. Adv. Sci. Res. Eng. 2018, 4, 1–5. [Google Scholar] [CrossRef]
  20. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
  21. Falcon, W.; Borovec, J.; Wälchli, A.; Eggert, N.; Schock, J.; Jordan, J.; Skafte, N.; Bereznyuk, V.; Harris, E.; Murrell, T.; et al. PyTorchLightning/Pytorch-Lightning: 0.7.6 Release (0.7.6). 2020. Available online: https://zenodo.org/records/3828935 (accessed on 15 October 2025).
  22. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
  23. Meng, H.; Zhang, L.; Yang, F.; Hai, L.; Wei, Y.; Zhu, L.; Zhang, J. Livestock Biometrics Identification Using Computer Vision Approaches: A Review. Agriculture 2025, 15, 102. [Google Scholar] [CrossRef]
  24. Kang, M.H.; Oh, S.H. Research trends in livestock facial identification: A review. J. Anim. Sci. Technol. 2025, 67, 43–55. [Google Scholar] [CrossRef] [PubMed]
  25. Efron, B. Better bootstrap confidence intervals. J. Am. Stat. Assoc. 1987, 82, 171–185. [Google Scholar] [CrossRef]
  26. Zeppelzauer, M. Automated detection of elephants in wildlife video. J. Image Video Process. 2013, 2013, 46. [Google Scholar] [CrossRef] [PubMed]
  27. Qiao, Y.; Clark, C.; Lomax, S.; Kong, H.; Su, D.; Sukkarieh, S. Automated individual cattle identification using video data: A unified deep-learning architecture approach. Front. Anim. Sci. 2021, 2, 759147. [Google Scholar] [CrossRef]
  28. Smink, M.; Liu, H.; Döpfer, D.; Lee, Y.J. Computer Vision on the Edge: Individual Cattle Identification in Real-time with ReadMyCow System. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 7056–7065. [Google Scholar] [CrossRef]
  29. Li, D.; Li, B.; Li, Q.; Wang, Y.; Yang, M.; Han, M. Cattle identification based on multiple feature decision layer fusion. Sci. Rep. 2024, 14, 26631. [Google Scholar] [CrossRef] [PubMed]
  30. Tassinari, P.; Bovo, M.; Benni, S.; Franzoni, S.; Poggi, M.; Mammi, L.M.E.; Mattoccia, S.; Stefano, L.D.; Bonora, F.; Barbaresi, A.; et al. A computer vision approach based on deep learning for the detection of dairy cows in free-stall barn. Comput. Electron. Agric. 2021, 182, 106030. [Google Scholar] [CrossRef]
  31. Geirhos, R.; Rubisch, P.; Michaelis, C.; Bethge, M.; Wichmann, F.A.; Brendel, W. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv 2018, arXiv:1811.12231. [Google Scholar] [CrossRef]
Figure 1. Example of images that were considered near-duplicates as measured by the pHash function. Only one example of a set of near duplicates is in the final dataset.
Figure 1. Example of images that were considered near-duplicates as measured by the pHash function. Only one example of a set of near duplicates is in the final dataset.
Technologies 14 00050 g001
Figure 2. Both (top) and (bottom) images are of the same animal. They are below the pHash similarity threshold, and hence, are in the dataset. The different dataset variants are: (a) original image. (b) foreground. (c) background (d) original image (e) foreground. (f) background.
Figure 2. Both (top) and (bottom) images are of the same animal. They are below the pHash similarity threshold, and hence, are in the dataset. The different dataset variants are: (a) original image. (b) foreground. (c) background (d) original image (e) foreground. (f) background.
Technologies 14 00050 g002aTechnologies 14 00050 g002b
Figure 3. Cumulative Match Characteristic (CMC) curves showing the percentage of correct identifications within the top-k ranks for different models, training sets, and test sets. ResNet-50 and Swin-T models are evaluated across ImageNet-pretrained and dataset-specific training on full images, backgrounds, or foregrounds, tested on corresponding sets. Higher curves indicate better ranking performance at lower ranks.
Figure 3. Cumulative Match Characteristic (CMC) curves showing the percentage of correct identifications within the top-k ranks for different models, training sets, and test sets. ResNet-50 and Swin-T models are evaluated across ImageNet-pretrained and dataset-specific training on full images, backgrounds, or foregrounds, tested on corresponding sets. Higher curves indicate better ranking performance at lower ranks.
Technologies 14 00050 g003
Table 1. Summary of ResNet-50 accuracy under different training, and test configurations. Here, ‘imagenet’ refers to the a model trained only on the ImageNet dataset, ‘images’ refers to the original images, ‘background’ refers to only the background, ‘foreground’ refers to only a cow.
Table 1. Summary of ResNet-50 accuracy under different training, and test configurations. Here, ‘imagenet’ refers to the a model trained only on the ImageNet dataset, ‘images’ refers to the original images, ‘background’ refers to only the background, ‘foreground’ refers to only a cow.
Training SetTest SetTop-1 (%)Top-3 (%)Top-5 (%)
imagenetimages 80.00 ± 7.01 83.20 ± 6.55 86.40 ± 6.01
imagesimages 78.40 ± 7.21 84.00 ± 6.43 85.60 ± 6.15
backgroundimages 78.40 ± 7.21 82.40 ± 6.68 85.60 ± 6.15
foregroundimages 79.20 ± 7.12 82.40 ± 6.68 85.60 ± 6.15
imagenetbackground 69.60 ± 8.06 77.60 ± 7.31 83.20 ± 6.55
imagesbackground 76.00 ± 7.49 79.20 ± 7.12 79.20 ± 7.12
backgroundbackground 74.40 ± 7.65 78.40 ± 7.21 83.20 ± 6.55
foregroundbackground 70.40 ± 8.00 79.20 ± 7.12 82.40 ± 6.68
imagenetforeground 59.20 ± 8.62 72.00 ± 7.87 80.80 ± 6.90
imagesforeground 60.00 ± 8.59 73.60 ± 7.73 77.60 ± 7.31
backgroundforeground 54.40 ± 8.73 66.40 ± 8.28 73.60 ± 7.73
foregroundforeground 57.60 ± 8.66 70.40 ± 8.00 76.80 ± 7.40
Table 2. Summary of Swin-T accuracy under different training, and test configurations. Here, ‘imagenet’ refers to the a model trained only on the ImageNet dataset, ‘images’ refers to the original images, ‘background’ refers to only the background, ‘foreground’ refers to only a cow.
Table 2. Summary of Swin-T accuracy under different training, and test configurations. Here, ‘imagenet’ refers to the a model trained only on the ImageNet dataset, ‘images’ refers to the original images, ‘background’ refers to only the background, ‘foreground’ refers to only a cow.
Training SetTest SetTop-1 (%)Top-3 (%)Top-5 (%)
imagenetimages 84.00 ± 6.43 84.80 ± 6.29 88.00 ± 5.70
imagesimages 84.00 ± 6.43 86.40 ± 6.01 88.80 ± 5.53
backgroundimages 84.80 ± 6.29 85.60 ± 6.15 88.80 ± 5.53
foregroundimages 84.00 ± 6.43 86.40 ± 6.01 88.80 ± 5.53
imagenetbackground 80.00 ± 7.01 84.00 ± 6.43 86.40 ± 6.01
imagesbackground 78.40 ± 7.21 84.00 ± 6.43 84.80 ± 6.29
backgroundbackground 80.80 ± 6.90 84.00 ± 6.43 84.80 ± 6.29
foregroundbackground 81.60 ± 6.79 84.00 ± 6.43 85.60 ± 6.15
imagenetforeground 84.00 ± 6.43 85.60 ± 6.15 89.60 ± 5.35
imagesforeground 80.00 ± 7.01 85.60 ± 6.15 88.00 ± 5.70
backgroundforeground 81.60 ± 6.79 86.40 ± 6.01 88.00 ± 5.70
foregroundforeground 78.40 ± 7.21 85.60 ± 6.15 88.00 ± 5.70
Table 3. Mean Average Precision (mAP) results for cow identification across different models, training sets, and test sets, with 95% confidence intervals (CI) computed via bootstrap resampling. ‘imagenet’ denotes a model trained solely on the ImageNet dataset; ‘images’ refers to original full images; ‘background’ refers to only the background regions; ‘foreground’ refers to only the cow regions.
Table 3. Mean Average Precision (mAP) results for cow identification across different models, training sets, and test sets, with 95% confidence intervals (CI) computed via bootstrap resampling. ‘imagenet’ denotes a model trained solely on the ImageNet dataset; ‘images’ refers to original full images; ‘background’ refers to only the background regions; ‘foreground’ refers to only the cow regions.
ModelTraining SetTest SetmAP (%)Lower CI (%)Upper CI (%)
ResNet-50imagenetimages64.21%58.41%69.46%
ResNet-50imagesimages64.17%58.51%69.47%
ResNet-50backgroundimages63.72%57.96%69.12%
ResNet-50foregroundimages61.23%55.55%66.65%
ResNet-50imagenetbackground56.37%50.81%61.77%
ResNet-50imagesbackground60.41%54.57%65.88%
ResNet-50backgroundbackground59.79%53.79%65.37%
ResNet-50foregroundbackground56.34%50.64%61.80%
ResNet-50imagenetforeground50.98%45.26%56.67%
ResNet-50imagesforeground49.29%43.68%54.95%
ResNet-50backgroundforeground45.36%39.83%50.96%
ResNet-50foregroundforeground50.17%44.41%55.99%
Swin-Timagenetimages68.00%62.18%73.12%
Swin-Timagesimages70.08%64.31%75.16%
Swin-Tbackgroundimages70.32%64.47%75.39%
Swin-Tforegroundimages70.90%65.00%76.06%
Swin-Timagenetbackground65.90%60.05%71.41%
Swin-Timagesbackground65.60%59.74%71.09%
Swin-Tbackgroundbackground67.07%61.13%72.46%
Swin-Tforegroundbackground66.27%60.29%71.75%
Swin-Timagenetforeground71.28%65.50%76.45%
Swin-Timagesforeground66.03%60.06%71.47%
Swin-Tbackgroundforeground65.40%59.55%70.75%
Swin-Tforegroundforeground68.18%62.15%73.55%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Balieva, G.; Marazov, A.; Tanchev, D.; Lazarova, I.; Rankova, R. Impact of Background Removal on Cow Identification with Convolutional Neural Networks. Technologies 2026, 14, 50. https://doi.org/10.3390/technologies14010050

AMA Style

Balieva G, Marazov A, Tanchev D, Lazarova I, Rankova R. Impact of Background Removal on Cow Identification with Convolutional Neural Networks. Technologies. 2026; 14(1):50. https://doi.org/10.3390/technologies14010050

Chicago/Turabian Style

Balieva, Gergana, Alexander Marazov, Dimitar Tanchev, Ivanka Lazarova, and Ralitsa Rankova. 2026. "Impact of Background Removal on Cow Identification with Convolutional Neural Networks" Technologies 14, no. 1: 50. https://doi.org/10.3390/technologies14010050

APA Style

Balieva, G., Marazov, A., Tanchev, D., Lazarova, I., & Rankova, R. (2026). Impact of Background Removal on Cow Identification with Convolutional Neural Networks. Technologies, 14(1), 50. https://doi.org/10.3390/technologies14010050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop