YOLOX-S-TKECB: A Holstein Cow Identification Detection Algorithm

Zhang, Hongtao; Zheng, Li; Tan, Lian; Gao, Jiahui; Luo, Yiming

doi:10.3390/agriculture14111982

Open AccessArticle

YOLOX-S-TKECB: A Holstein Cow Identification Detection Algorithm

by

Hongtao Zhang

^*,

Li Zheng

,

Lian Tan

,

Jiahui Gao

and

Yiming Luo

College of Electrical Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450045, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(11), 1982; https://doi.org/10.3390/agriculture14111982

Submission received: 2 August 2024 / Revised: 20 October 2024 / Accepted: 4 November 2024 / Published: 5 November 2024

(This article belongs to the Section Farm Animal Production)

Download

Browse Figures

Versions Notes

Abstract

Accurate identification of individual cow identity is a prerequisite for the construction of digital farms and serves as the basis for optimized feeding, disease prevention and control, breed improvement, and product quality traceability. Currently, cow identification faces challenges such as poor recognition accuracy, large data volumes, weak model generalization ability, and low recognition speed. Therefore, this paper proposes a cow identification method based on YOLOX-S-TKECB. (1) Based on the characteristics of Holstein cows and their breeding practices, we constructed a real-time acquisition and preprocessing platform for two-dimensional Holstein cow images and built a cow identification model based on YOLOX-S-TKECB. (2) Transfer learning was introduced to improve the convergence speed and generalization ability of the cow identification model. (3) The CBAM attention mechanism module was added to enhance the model’s ability to extract features from cow torso patterns. (4) The alignment between the apriori frame and the target size was improved by optimizing the clustering algorithm and the multi-scale feature fusion method, thereby enhancing the performance of object detection at different scales. The experimental results demonstrate that, compared to the traditional YOLOX-S model, the improved model exhibits a 15.31% increase in mean average precision (mAP) and a 32-frame boost in frames per second (FPS). This validates the feasibility and effectiveness of the proposed YOLOX-S-TKECB-based cow identification algorithm, providing valuable technical support for the application of dairy cow identification in farms.

Keywords:

YOLOX-S; Holstein cows; object detection; identification algorithm

1. Introduction

Animal husbandry, being a pivotal industry, serves as a primary source of nutrition for people, and its demand is consistently escalating. In recent years, the government has released several policy documents, including the Digital Agriculture and Rural Development Plan (2019–2025) and Opinions on Promoting High-Quality Development of Animal Husbandry. These policies underscore the significance of digital farming in driving high-quality progress within the animal husbandry sector. Presently, the dairy farming industry holds a strategic position within animal husbandry [1]. It is imperative to intensify research and implementation of advanced technologies like computer vision, big data, and automation in dairy farming. This will elevate the level of intelligence in animal farming, facilitate the advancement of precision animal husbandry, and establish a contemporary animal husbandry farming system [2].

In the initial phases, cow identification primarily relied on numbered plastic ear tags, neck collars, or leg bands as markers, followed by manual recognition and documentation. However, with the onset of digital farming, farms shifted to electronic ear tags for cow management. Despite its convenience, this method posed issues like easy detachment and the potential to harm the cows. To address these concerns, various non-contact cow identification technologies have surfaced in recent times [3]. Amongst these, the cow identification technique utilizing two-dimensional images has garnered significant attention and research from scholars globally. This approach not only mitigates the downsides of electronic ear tags but also opens up new avenues for digital farm management. Image-based cow identification can be categorized into four distinct methods [4,5]: identification through cow eyes, cow muzzle patterns, cow faces, and cow body patterns. While eye-based and muzzle pattern-based techniques are susceptible to interference from cow movements and demand high-quality images, face-based methods encounter challenges during on-site data collection. Factors like hair, texture variations, and unpredictable image acquisition complicate the process, augmenting the dataset construction workload and hindering real-time identification in farming environments. Alternatively, the body pattern emerges as a distinctive and easily observable identifier for cows [6,7].

In 2005, Kim et al. [8] from Kyoto University captured cow body patterns using a charge-coupled device (CCD). They enhanced recognition accuracy through techniques like edge detection and fine noise component analysis. However, the equipment’s complexity made it impractical for modern cattle farms. A decade later, in 2015, Ahmed [9] introduced a method that extracted features from cow body images using the Speeded-Up Robust Features (SURF) algorithm, coupled with an SVM for cow classification and recognition. This approach focused on minimal image features, leading to limited recognition performance. That same year, Zhao Kaixuan [10] developed a Holstein cow identification model based on video analysis and convolutional neural networks. By precisely extracting and analyzing cow body image features and designing a corresponding neural network structure, this method achieved a recognition rate of 90.55% in experiments, with room for further optimization. In 2021, Li S et al. [11] improved Alexnet by incorporating multiple multiscale convolutions, using a short-circuit connected BasicBlock to maintain desired values and prevent gradient issues. They added an enhanced inception module and attention mechanism to capture features at various scales, boosting feature point detection. That year, Liu J et al. [12] and colleagues introduced transfer learning, using the AlexNet network to extract blurred contours and side body features. They then employed SVM to identify 35 cows, achieving a data accuracy rate of 98.91%. Also in 2021, Xing Y et al. [13] proposed an enhanced SSD algorithm that fused features from different images, improving detection efficiency and accuracy, especially for overlapping cows in complex farming environments. In 2023, Zhang Yu [14] presented a recognition method based on the TLAINS-InceptionV3 model. This approach established an experimental dataset using side body images of 39 beef cattle to capture target features and develop an individual cow identification model.

Cow identification serves as the cornerstone for attaining digital and intelligent management in cattle farms. Scholars worldwide have conducted relevant research utilizing computer vision technology. However, there is still room for improvement in recognition accuracy, particularly in scenarios involving complex backgrounds, partial occlusions, and unconstrained environments. Additionally, there is a need to augment data volumes to bolster the generalization capabilities of algorithm models and devise more efficient and precise cow identification algorithms. These advancements will pave the way for intelligent and automated cow farming. The methodology outlined in this paper, and illustrated in Figure 1, comprises the following steps:

(1): Establishing an image preprocessing platform for acquiring Holstein cow side body images

By conducting a comprehensive analysis of cow characteristics, including size, weight, and temperament, as well as fence attributes such as height, strength, and background wall materials, we determine the optimal width of the cow image acquisition channel and the positioning of infrared sensors. Given that cameras are prone to environmental influences, it is imperative to assess factors that impact imaging quality. These factors encompass the relative positioning and height of depth and color cameras, imaging distance, light source distribution and intensity, fence dimensions, and imaging background. The ultimate goal is to construct an image acquisition platform that minimally disrupts the cows while capturing high-quality side body images.

(2): Constructing a deep convolutional neural network model utilizing transfer learning and attention mechanism for cow identification

By analyzing cow body texture distribution patterns, a recognition model is established using a large dataset containing similar texture data. This model acts as a pre-trained foundation for cow identification and is transferred to enhance the cow identification process. Additionally, an attention mechanism is employed to eliminate irrelevant features. Given the intricate nature of the cattle farm environment, it is imperative to accurately capture cow target information and differentiate between cows and background elements. Hence, it is crucial to devise efficient target detection algorithms, assess the evaluation methods and performance metrics of the deep convolutional neural network model, investigate and pinpoint the optimal network configuration and parameters, refine the network structure, and ultimately construct a tailored deep convolutional neural network model for cow identification within this specific background.

2. Materials and Methods

This study aims to establish a comprehensive Holstein cow identification dataset, accompanied by a corresponding 3D point cloud dataset. While existing research primarily focuses on identifying Holstein cows through facial features, there is a notable lack of studies exploring body pattern-based identification. Collecting body images of Holstein cows poses significant challenges, as these animals often exhibit low cooperation, are prone to fright, and tend to gather closely together. To overcome these obstacles, we adopted a non-intrusive and stress-free approach for data collection, successfully compiling a dataset featuring 150 Holstein cows.

Furthermore, we employed advanced image enhancement techniques to enrich the dataset, ensuring it accurately reflects real-world natural conditions. Recognizing the synergy between this study and research on cow body measurements, we will utilize the Helios camera (The Helios camera, manufactured in Canada by the LUCID brand and model HLT0035-001, with version number 204900009.) to develop a cow image acquisition system. This system is designed to efficiently capture two-dimensional body images of cows, providing robust technical support for cow farming and related studies. The resulting two-dimensional image dataset, named “Holstein_cow_RGB-150”, stands as a valuable resource for future research and applications.

2.1. Data Acquisition

2.1.1. Collection Equipment

The Helios camera, with its distinctive design and remarkable performance, shines brightly in the realm of visual technology. As illustrated in Figure 2, the Triton color camera occupies the upper portion of the camera body. This component operates much like standard two-dimensional RGB image collectors, adept at capturing vivid color information and intricate texture details, which offers a valuable reference for subsequent depth image analysis. While the Triton color camera is an integral part of the Helios camera, its true technological masterpiece lies in the TOF (Time of Flight) camera housed in the lower half. By analyzing the phase difference between emitted and reflected light, the TOF camera precisely determines the distance between the camera and its target, subsequently producing depth images of the scene.

2.1.2. Real-Time Data Acquisition

The field test site for this project is situated at the Pasture of Henan Agricultural Science and Technology Co., Ltd. (Xinxiang, China) To facilitate real-time data acquisition, we have established an on-site data collection platform. Before commencing data collection, we gained a comprehensive understanding of the behavioral patterns and time allocation of Holstein cows, as well as the structure of large-scale farms. Table 1 presents the behavioral characteristics of Holstein cows, and Table 2 for farming time allocation details.

On large-scale farms, the Helios camera is strategically positioned on the exterior wall adjacent to the milking hall passageway. This placement enables the camera to capture the body posture of cows as they pass through. The decision to install the camera at this location was guided by three key factors: (i) the ability to fully document the cows’ body posture; (ii) the slower gait of the cows, which aids in clearer imaging; and (iii) the prevention of direct contact between the camera and cows, thus preserving equipment integrity. After meticulous on-site measurements, the camera was installed at a precise distance of 2.8 m from the wall and 1.2 m above the ground, which ensures optimal image clarity and accuracy. Prior to installation, simulation tests, and debugging were performed to guarantee comprehensive cow coverage and precise side-image captures. To facilitate automatic real-time photography, a specular reflective infrared sensor is employed, connected to the Helios camera. When a cow enters the milking hall, the infrared beam is interrupted, triggering the sensor to send a signal to the camera, which then captures an image. To accommodate cows of varying sizes, a dual infrared sensor system was implemented, broadening the detection range and ensuring that any interruption prompts the camera to take a picture. Figure 3 illustrates the conceptual framework and practical application of our experimental data collection setup.

To guarantee the absence of any detrimental impact on the cows and to adhere to the stringent quality standards for image acquisition, we have deliberately selected a specific timeframe spanning from 6 January to 6 February 2024. Specifically, we have captured two-dimensional images of 150 Holstein cows during the milking procedure through the designated platform, with sessions scheduled between 8:30 and 9:30 in the morning and from 11:40 to 12:40 at noon.

The selection of the specific time period from 6 January to 6 February 2024 is based on a series of meticulous considerations. This decision aims to capture a relatively stable physiological state of cows, minimizing behavioral or physical changes caused by natural factors such as seasonal changes and fluctuations in the reproductive cycle, thereby ensuring the consistency and reliability of image data. Meanwhile, during this winter period, the relatively stable environmental conditions, marked by minimal variations in temperature, humidity, and lighting, help reduce potential interference from the external environment on image quality. Furthermore, the choice of this time frame adheres to stringent image acquisition quality standards, simplifying variable control through a fixed timeline and ensuring the uniformity of data quality. The emphasis on the health and safety of cows is also a crucial factor in this decision, ensuring that data collection activities do not adversely affect the animals, reflecting a respect for animal welfare in research. At the level of experimental design, the selection of this specific time period further enhances the precision and scientific rigor of the experiment, contributing to reduced errors and improved accuracy and reliability of research results. Lastly, data collection within a short time frame promotes data consistency and comparability, laying a solid foundation for subsequent data processing and analysis. In summary, the selection of this time period is the result of comprehensive considerations of various factors, aimed at comprehensively enhancing the overall quality and credibility of the research.

2.1.3. Preprocessing of the Holstein_cow_RGB-150 Dataset

Over a 30-day data collection period, we successfully captured 9000 two-dimensional images of 150 Holstein cows. After a careful screening process, we ended up with 8309 high-quality Holstein cow images, as illustrated in Figure 4a. Training a deep learning model for cow identification demands robust data support. Prior to commencing the training, it is imperative to rigorously screen and evaluate both the quantity and quality of the samples. The quality of the data significantly influences the model’s overall performance. Hence, in deep learning training, the processes of data gathering, preparation, and augmentation are of utmost importance. These vital steps are instrumental in enhancing the model’s recognition accuracy, thereby ensuring precise and reliable model outputs.

This study begins with screening the collected data to eliminate unqualified images, specifically those with blurred bodies or areas with significant overlap or occlusion. To bolster the dataset’s diversity, the compilation from these filtered images incorporates samples featuring single individuals, multiple individuals, various body types, and a wide range of ages. Consequently, random sampling was executed on-site, yielding 500 randomly selected images, as exhibited in Figure 4b. For the 8809 valid two-dimensional Holstein cow images emerging from the screening, we employed the OpenCV image tool, adhering to a stringent data processing workflow to thoroughly enrich the image data. By implementing image processing algorithms, we accomplished seven data enhancement techniques: brightness enhancement, contrast improvement, rotation angle calibration, image flipping, affine transformation, shear transformation, and HSV data enhancement. These methods guaranteed the precision and completeness of the image data. Figure 5 illustrates a comparative analysis of data samples pre- and post-data enhancement. Following these enhancement procedures, we successfully amassed a total of 61,663 two-dimensional target images of Holstein cows, offering a comprehensive and varied dataset for subsequent model training.

In the process of image processing and data augmentation, brightness enhancement allows users to adjust based on the original brightness value range of the image (e.g., 0–255), optimizing the image by increasing or decreasing the percentage of each pixel’s brightness value (e.g., ±20%). Contrast improvement, on the other hand, focuses on adjusting the difference between the brightest and darkest pixel values in the image, which can be achieved through methods such as linear stretching or gamma correction. Additionally, rotation angle calibration permits the image to be rotated around its center point by any angle, with the rotation angle being either fixed or randomly set. Image flipping, as a simple enhancement technique, encompasses horizontal and vertical flipping. Affine transformation, a more complex two-dimensional linear transformation combined with translation, enables random adjustments to the image’s rotation, scaling, skewing, and translation parameters. Cropping transformation involves cutting out a rectangular area from the image, with the selection of this area being either random or based on a strategy, and its size and position being flexible and variable. Lastly, HSV data augmentation adjusts the image within the HSV color space, allowing for intuitive control over the image’s color and brightness characteristics by altering hue, saturation, or brightness parameters.

In the realm of image processing, alongside intra-frame algorithms, inter-frame algorithms constitute efficient tools for data augmentation. Both the Mixup method and the Mosaic algorithm belong to image enhancement techniques, aiming to enhance the generalization ability and detection performance of deep learning models through different strategies. Specifically, the Mixup method generates brand-new training samples by randomly mixing two image data, thereby increasing the diversity and complexity of the dataset. In contrast, the Mosaic algorithm adopts a more innovative approach by stitching together different parts of multiple images to create a complex image containing multiple scene elements, further enriching the content of the training data and enabling the model to learn a more comprehensive feature representation. Notably, the Mixup and Mosaic algorithms are two renowned image inter-frame techniques. As depicted in Figure 6a, the Mixup method enhances the generalization capacity of neural network training by merging two images in equivalent proportions (0.5:0.5). This mixing approach aids the model in better adapting to unseen target data. Conversely, the Mosaic algorithm, as illustrated in Figure 6b, amplifies image background details and efficiently boosts sample diversity within each training dataset batch through randomized image cropping and splicing. This, in turn, also fosters network generalization. Both approaches are potent methods for enhancing the performance of deep learning models.

2.2. Test Method

2.2.1. Model Evaluation Metrics

Precision: this measures the accuracy of the positive predictions made by the model. It is the ratio of true positive predictions to the total number of positive predictions (both true and false). In other words, it indicates how many of the predicted positive cases were actually correct, as exemplified in Equation (1).

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

where TP represents the true positive, and FP represents the false positive.

Recall, which is defined as the ratio of accurately predicted positive samples to the total number of positive samples, is determined by dividing the count of samples that are correctly predicted as positive by the total count of all positive samples, as demonstrated in Formula (2).

R e c a l l = \frac{T P}{T P + F N}

(2)

where FN represents the false negative.

Average Precision (AP) assesses the performance of the algorithm across diverse data types. To compute the Average Precision, the Precision-Recall curve (PR curve) is plotted initially, and subsequently, the area under the curve (AUC) is determined. This area, enclosed by the curve and the axes, represents the Average Precision. The entire calculation procedure is illustrated in Equation (3).

A P = \int_{0}^{1} P (r) d r

(3)

Mean Average Precision (MAP) is obtained by aggregating the mean precision values across multiple category samples and then computing their overall average, as outlined in Formula (4).

M A P = \frac{1}{b} \sum_{b_{i} \in b} {A P}_{(b i)}

(4)

where b represents the category of data.

Equation (5) FPS stands for Frames Per Second, an evaluation metric used to measure the speed of object detection. It represents the number of images that can be processed per second. For instance, an FPS of 50 indicates that 50 images can be processed every second, equating to a processing time of 0.02 s per image. The methodology for calculating FPS is exhibited in Formula (5).

F P S = \frac{1}{t i m e}

(5)

where

t i m e

represents the time required to detect one image.

2.2.2. Training of the Network Model

The optimized and meticulously designed platform seamlessly integrates with the Windows 10 system, ensuring rock-solid stability. It harnesses the exceptional computing and acceleration capabilities of an NVIDIA GeForce RTX 3080 graphics card. Powered by an Intel Core i7-11700KF CPU with a peak frequency of 3.60 GHz, it effortlessly handles a wide range of high-performance demands. A substantial 32 GB of memory guarantees a lag-free and reliable system experience. The software suite boasts features of PyCharm 2021.2, Python 3.9, and the PyTorch 1.9.1 deep learning framework, offering a comprehensive environment for development. Additionally, the CUDA 9.0 programming environment unlocks the full potential of the graphics card’s 10 GB video memory, enabling efficient execution of demanding deep learning workloads. This configuration enables the construction of a robust and high-performance deep learning platform, providing powerful computational support for a variety of experiments [16].

3. Network Model and Improvements

3.1. YOLOX-S Algorithm

The YOLO algorithm excels in efficient and real-time object recognition, offering superior speed and accuracy compared to other algorithms. Its main advantage lies in utilizing a single CNN model, which simplifies the complexity found in traditional methods relying on multiple models or components, thereby boosting processing speed. YOLO is of significant importance in object detection and demonstrates strong scalability, accommodating images of various sizes and resolutions. Among its iterations, YOLOX stands out as one of the newer algorithms. With its complex weight parameters, exceptional real-time detection speed, precise performance, and innovative decoupled head processing, YOLOX distinguishes itself within the YOLO family [17]. Compared to earlier YOLO versions, YOLOX introduces advancements in areas like candidate box extraction, reducing model size and computational demand, as well as augmentation techniques. Furthermore, YOLOX incorporates a multi-scale feature fusion approach, which bolsters the model’s versatility by integrating features of different scales. This allows it to handle object detection tasks effectively across various scales. Unlike other object detection methods, YOLOX employs an anchor-free approach, obviating the need for predefined anchors. This simplifies the detection process and reduces parameter adjustment complexity. Its emphasis on multi-scale feature fusion enhances the detection of targets of differing sizes, thereby enhancing the algorithm’s versatility. The YOLOX series comprises four primary models: YOLOX-S, X, L, and M. Among them, YOLOX-S, with its minimal parameters and shallow network structure, is ideal for high real-time demand scenarios. It boasts the fastest inference speed and superior adaptability. Conversely, the other three models, while based on YOLOX-S, increase depth and width, leading to a reduction in inference speed. This might hinder their ability to meet stringent real-time requirements, such as cow identification in farming pastures.

The network structure of YOLOX-S is illustrated in Figure 7 [18]. When handling dataset images, YOLOX-S initially utilizes adaptive image scaling technology to consistently modify the images’ resolution. To improve the data’s quality and diversity, the network integrates two cutting-edge data augmentation techniques: Mosaic and Mixup. Furthermore, YOLOX-S adopts a unique adaptive anchor box computation approach. This approach recalculates anchor boxes in every training iteration, which guarantees the utilization of optimal anchor box values alongside adaptive image scaling. This scaling not only modifies the image dimensions but also enhances the model’s inference speed by reducing black border padding, thus attaining superior processing efficiency.

In the backbone feature extraction network module of this architecture, YOLOX-S utilizes the Darknet53 network structure, which is integrated with the Focus module and the CSP module. Although the Darknet53 network demonstrates comparable detection accuracy and other metrics to the ResNet series of algorithms, it doubles the frames per second (FPS), justifying its selection as the backbone network structure. Refer to Figure 8, where the Focus module slices and stacks feature layers by sampling alternate pixels, merging width and height data into channel information, and increasing the channel count from three to twelve. A crucial aspect of the YOLOX-S model is the CSP module, built on residual convolutions and renowned for its ease of optimization. As the network depth increases, this module significantly improves detection accuracy, effectively tackling gradient vanishing issues and considerably enhancing the model’s feature extraction capabilities.

With the steady progress in feature extraction techniques, feature representations and semantic information have seen remarkable enhancements in their expressive powers. Yet, this evolution often comes with a decrease in resolution, potentially leading to a loss of crucial local details in images. To rectify this, we have incorporated the Neck module. This module’s primary function is to merge multi-level feature maps, ensuring the capture of both spatial data from shallow maps and semantic data from deeper ones during extraction. Building upon the initial FPN framework, the Neck module has undergone refinement, evolving into a novel FPN + PAN structure. This advanced design enables bidirectional feature transmission, allowing for the flow of high-level semantic data via a top-down pathway and precise localization details through a bottom-up route. This bidirectional mechanism vastly improves the integration of deep and shallow features, enhancing the overall efficiency of image feature extraction.

After careful processing and integration of features at various levels, the Neck network transmits key information to the output end for precise predictions. This process has the capability to generate both category and precise location information of the target object. At the output stage, the algorithm incorporates the state-of-the-art CIOU (Complete Intersection over Union) loss function. CIOU is a specifically tailored loss function for object detection tasks. It leverages the strengths of both IOU and GIOU, enabling a more precise evaluation of the similarity between target bounding boxes, thereby boosting the accuracy of object detection.

3.2. Model Optimization and Improvement of the YOLOX-S Algorithm

3.2.1. Introduction of Transfer Learning

Transfer Learning strives to leverage knowledge gained from a source domain to enhance learning outcomes in a target domain. Given that most deep learning tasks exhibit some degree of correlation, transferring pre-trained parameters from the source to the target domain and making necessary adjustments can significantly boost the convergence rate of the target model, reduce training time, and mitigate challenges like overfitting [19]. In the context of classifying images of cattle breeds, the process involves initial image processing through convolutional layers, subsequent max pooling, four residual learning blocks, another pooling layer, and a fully connected layer yielding 1024 neurons. Ultimately, the images are passed through a Softmax layer, which produces the final classification results. The transfer learning model is depicted in Figure 9.

3.2.2. Improvement of Clustering Algorithms

The K-means algorithm, as utilized in YOLOX-S, could potentially increase iterations and the time complexity. When applied to cow identity datasets, K-means risks converging locally, which indicates that the clustering centers may only optimize within a confined range, missing out on the globally optimal solution. Additionally, when YOLOX-S employs K-means for selecting anchor box clustering centers, the algorithm’s inherent randomness might produce unsatisfactory or inadequate results, particularly in scenarios involving numerous cows with minor individual variations. To address this issue, K-means++ comes into play (see [20]). This enhanced algorithm optimizes the original by refining the data center selection process, thereby significantly boosting clustering and convergence performance [21]. The detailed algorithm flowchart is depicted in Figure 10, illustrating the roulette wheel selection principle through Formula (6).

P (x_{i}) = \frac{D (x_{i})}{\sum_{j = 1}^{n} D (x_{j})}

(6)

D (x_{i})

represents the shortest distance from the data point

x_{i}

to the known data center, and

P (x_{i})

represents the probability of data point

x_{i}

being selected as the data center.

3.2.3. Differences in Bounding Box Regression Loss Functions

The fluctuations in dataset quality adversely affect the model’s convergence, potentially resulting in underwhelming performance, especially in tasks such as anchor box detection for cow identity, thereby compromising detection accuracy. To counteract this, the YOLOX-S model employs the CIOU Loss as its loss function. This function goes beyond merely considering the Euclidean distance of the center point; it also integrates the aspect ratio factor for a more holistic loss evaluation. It is important to note that the aspect ratio referred to here does not directly represent actual length and width values but rather relative values derived through specific processing and calculations. However, its inability to precisely reflect the real difference compared to the confidence level introduces challenges to the model’s convergence. Consequently, this particular loss function may not be ideal for the subject matter of our study. Table 3 presents a comparative analysis of various commonly used loss functions.

Upon conducting a comparative analysis, it was discovered that utilizing EIOU Loss to compute the difference in width and height provides a more precise representation of the deformation degree of the target prediction box in cow images than traditional aspect ratio methods. This approach not only significantly accelerates the convergence speed of the model but also elevates the regression accuracy of the prediction box, thereby notably boosting the algorithm’s performance in managing multiple cow localization tasks [22]. To refine the training process and address the challenge of sample imbalance in the dataset, an optimization strategy combining Focal Loss and EIOU Loss has been introduced for the loss function. This method mitigates the negative influence of inferior cow image data on model accuracy while amplifying the beneficial impact of superior cow image data, enabling the model to prioritize the generation of accurate cow image detection anchor boxes and consequently enhancing detection precision. For clarity, the loss function that incorporates Focal Loss is designated as F_EIOU_Loss [23].

3.2.4. Introduction of Mixed Attention Mechanism

The Convolutional Block Attention Module (CBAM) is an innovative algorithm designed to enhance the performance of convolutional neural networks. This module seamlessly integrates with existing convolutional neural networks, dynamically adjusting and optimizing input features by multiplying them with input mappings. The cornerstone of CBAM, the mixed attention mechanism, lies in the combination of attention models in both spatial and channel dimensions, enabling a comprehensive analysis and refinement of features.

The structure of CBAM is shown in Figure 11, and the module includes the following two key components:

Channel Attention: This vital component is tasked with pinpointing the significance of various channels, and determining which feature channels hold the richest information within the comprehensive feature map. By meticulously assessing the informational value of each channel and allocating distinct weights, the model gains the capability to refine the feature distribution. This process underscores the essential channels while mitigating the influence of lesser important ones, thus optimizing the overall feature set.

Spatial Attention: Different from channel attention, which focuses on specific channels, spatial attention targets specific locations. In other words, spatial attention pinpoints the areas within the spatial dimension of the feature map that contain crucial information. This process involves evaluating the significance of features at various locations and assigning distinct weights, ultimately emphasizing the characteristics of key regions.

In practice, the mixed attention mechanism initially employs the channel attention mechanism to scrutinize the significance among channels. Following this, it pinpoints critical regions by concentrating on the spatial distribution of features via the spatial attention mechanism. This integration enables CBAM to simultaneously seize feature dependencies in both channel and spatial dimensions, thereby offering a more comprehensive feature representation. When compared to attention mechanisms solely concentrating on the channel dimension, like SE-Net, CBAM extracts more comprehensive feature information through its dual attention approach, thereby elevating network performance. CBAM not only enhances the feature extraction process but also effectively adaptively fine-tunes input features, leading to notable improvements in information utilization and recognition accuracy [24,25,26].

3.2.5. Multi-Scale Feature Fusion Optimization

Multi-scale features, encompassing both scale and depth levels of the feature map, are influenced by the image size and network depth. To overcome the constraints posed by image size, we employ a multi-scale feature extraction approach. Simultaneously, to blend deep and shallow feature maps efficiently, we utilize a feature fusion method. Precisely, shallow feature maps facilitate target localization owing to their high resolution, whereas deep feature maps excel in classification tasks due to their robust feature representation. To optimize information capture and utilization across all scales, while mitigating the increase in parameters and computational load potentially caused by the path aggregation network in the original model, we adopted a Weighted Bidirectional Feature Pyramid Network (BiFPN) [27]. As illustrated in Figure 12, this network seamlessly integrates upper and lower source paths, significantly reducing training time by simplifying the model and enhancing its stability.

P3, P4, P5, P6, and P7 represent the output layers of the backbone network, with each output layer having corresponding output features (including information such as the number of channels, the size of the features, etc.). Circles without color represent features, while colored circles represent operators. The wired connections indicate weights (w), and both upward and downward connections involve resize operations, representing either upsampling or downsampling.

Considering the aforementioned reasons, this study opts for the BiFPN algorithm as the feature fusion framework for YOLOX-S. The objective is to enhance YOLOX-S’s ability in the feature fusion of cow torso images, facilitating deeper optimization [28,29]. Through this methodology, we anticipate improving the network’s fusion efficiency for recognizing cow identity traits, enabling high-accuracy cow identity detection and recognition.

4. Results and Analysis

4.1. Comparison of Ablation Experiments

The primary aim of this paper is to systematically study, evaluate, and compare various optimization algorithms through rigorous experimental methods. Our goal is to ascertain the practical applicability of these optimization techniques. We are committed to conducting comprehensive research on diverse optimization strategies, aiming to explore their potential to enhance the detection efficiency of algorithms. Taking into account the unique attributes of our research subject and its operational setting, we will thoroughly assess the efficacy of each optimization measure in improving the cow identification algorithm, thereby ensuring its precision and dependability in real-world scenarios. For the purposes of this experiment, YOLOX-S has been chosen as the baseline network. Detailed experimental outcomes are documented in Table 4.

From the experimental results of the ablation comparison in Table 4, we can draw the following conclusions:

The YOLOX-S cow identification model, enhanced with transfer learning (dubbed YOLOX-S-Transfer), demonstrates a notable 2.56% boost in MAP value and a 7-frame elevation in FPS. This result clearly demonstrates that utilizing the cattle species transfer model can effectively accelerate the improvement of accuracy in cow identification and significantly reduce the model training time, showcasing the efficiency of migration learning in cross-domain tasks.

The YOLOX-S cow identification model, combined with the K_means++ algorithm (dubbed YOLOX-S-K_means++), exhibits notable advancements across all metrics when compared to the original clustering approach in target detection box clustering on the Holstein_cow_RGB-150 dataset. Specifically, it achieves a 5.16% boost in MAP value and a 9-frame FPS increase. When benchmarked against the YOLOX-S-Transfer model, the YOLOX-S-TK variant outperforms the YOLOX-S-Transfer model by a 6.14% increase in accuracy and a 5-frame boost in FPS. The K_means++ algorithm effectively avoids the local optimum problem by optimizing the selection of clustering centers and, consequently, significantly enhances the detection performance of the model, thereby demonstrating its positive contribution to clustering optimization for the task of cow identification. Consequently, the integration of K_means++ significantly elevates the cow identification model’s performance.

After substituting the original loss function of the YOLOX-S cow identification model with F_EIOU_Loss, the updated YOLOX-S-F_EIOU_Loss model exhibited a notable 4.76% boost in MAP value and a 10-frame gain in FPS. In comparison to the YOLOX-S-TK model, the YOLOX-S-TKE variant showcased a 2.95% MAP value increase and a 9-frame FPS improvement. The introduction of F_EIOU_Loss demonstrates its effectiveness in mitigating the adverse effects of low-quality data, thereby strengthening the stability and precision of cow identification across diverse real-world applications.

The integration of the hybrid attention mechanism module (CBAM) has led to a notable enhancement in the performance of the YOLOX-S cow identification model, now known as YOLOX-S-CBAM. Specifically, the Mean Average Precision (MAP) value has increased by 2.13%, and the Frames Per Second (FPS) has improved by 4 frames. Furthermore, this integration enhances the feature extraction capabilities of the YOLOX-S-TKE model, resulting in the YOLOX-S-TKEC model with further improvements. As a result, the YOLOX-S-TKEC model demonstrates a 1.25% boost in recognition accuracy, accompanied by a 4-frame increase in FPS. This indicates that CBAM plays a key role in enhancing the feature extraction capability of the model, which enhances the model’s ability to capture and process complex feature information.

The YOLOX-S cow identification model, enhanced with the BiFPN feature fusion pyramid structure (referred to as YOLOX-S-BiFPN), shows significant improvements. Specifically, it boasts a 2.56% increase in MAP value and a 7-frame boost in FPS. Furthermore, when compared to the YOLOX-S-TKEC model, it elevates the MAP value by 2.43% and FPS by 7 frames. These results suggest that the integration of the BiFPN feature fusion module into the YOLOX-S-TKECB algorithm not only strengthens the feature fusion ability of the model, but also improves the system efficiency and performance by optimizing the feature fusion path, which showcases the significant advantage of an efficient feature fusion strategy in enhancing the model’s inference speed.

4.2. Analysis of Results

Through ablation experiments, we have drawn the following conclusions: After implementing all the optimization measures outlined in this paper on YOLOX-S, we observed notable enhancements in terms of the overall MAP value and FPS (Frames Per Second). Specifically, the YOLOX-S-TKECB model pushes the overall MAP value to 97.87%, an improvement of 15.31% over the original YOLOX-S model, and achieves a high frame rate of 108 frames per second, an increase of 32 frames compared to the original. This significant improvement fully validates the effectiveness and efficiency of the optimization measures implemented. In practical application scenarios, the optimized cow identification model demonstrates high accuracy and real-time performance, providing strong technical support for related tasks. In particular, the model exhibits excellent generalization ability and robust performance when dealing with diverse cow identification datasets.

As shown in Figure 13 and Figure 14, these two sets of comparison graphs visually reflect the great difference in model performance before and after optimization. Figure 13a,b reveal the limitations of the original model in single-target and multi-target recognition, such as low detection accuracy, inaccurate positioning of bounding boxes, and recognition failure due to target occlusion. Meanwhile, Figure 13c,d show changes in the average MAP value and loss value of the original model during the training process, respectively, indicating that the average MAP value of the original model is only 82.56%, and the loss function appears to be significantly increasing during the fitting process. This reflects the instability and inefficiency of the original model during the training process.

In contrast, Figure 14a,b demonstrate the excellent performance of the optimized model in single- and multi-target recognition, which not only dramatically improves the detection accuracy, but also effectively deals with complex situations such as target occlusion. Figure 14c,d detail the changes in average MAP value and loss value during the improved model’s training. Notably, after approximately 60 training rounds, the improved model maintains its accuracy without significant decreases or increases in the loss rate, showcasing remarkable stability. This underscores that the optimized algorithm significantly boosts recognition accuracy and speed while also excelling in training stability and efficiency.

5. Discussion

In this paper, a cow identification algorithm based on YOLOX-S-TKECB is proposed, aiming to achieve high-precision identification of Holstein cows in an efficient and interference-free manner. The core of the research encompasses the construction of a data acquisition platform, image preprocessing techniques, and the development and optimization of a convolutional neural network model that integrates transfer learning and attention mechanisms.

5.1. Developing a Platform for Real-Time RGB-D Image Acquisition and Cow-Specific Preprocessing

Based on our in-depth understanding of Holstein cows’ living habits, we designed and implemented a customized RGB-D image real-time acquisition and preprocessing platform. This platform leverages the high performance of the Helios camera to accurately capture 2D images of cows within their feeding environment. After 30 days of continuous data acquisition, we successfully constructed the Holstein_cow_RGB-150 dataset, which comprises images of 150 Holstein cows. To enhance the generalization ability of our model, we employed a diversified data enhancement strategy to effectively expand the diversity of the dataset.

5.2. Establishing a Holstein Cow Identification Model with the YOLOX-S-TKECB Framework

In light of the YOLO algorithm’s strengths in balancing efficient real-time performance with accuracy, this study introduces several innovations to the YOLOX-S model. By incorporating migration learning techniques and leveraging pre-training weights, we accelerate model convergence and enhance generalization capabilities. Optimizing the clustering algorithm ensures that the anchor frames better align with the varying body sizes of cows, thereby improving the detection of targets of different sizes. The introduction of the EIOU and Focal Loss functions significantly improves the accuracy of detecting small and occluded targets. Integrating a hybrid attention mechanism module sharpens the model’s focus on processing key information. Additionally, adopting a multi-scale feature fusion strategy ensures comprehensive detection capabilities for both large and small targets. These optimization measures result in notable improvements in the YOLOX-S-TKECB model during ablation experiments and interpretability analyses, with a 15.31% increase in the mAP value and a 32 frames-per-second (FPS) boost, fully validating its feasibility and application potential in cow identity recognition.

Although the YOLOX-S-TKECB algorithm has made significant progress in cow identification, it still faces several challenges and limitations. Firstly, expanding the scale and diversity of the dataset is necessary. Although the current Holstein_cow_RGB-150 dataset provides a preliminary foundation, it still struggles to comprehensively cover all characteristics of dairy cows and complex environmental backgrounds, limiting the model’s wide adaptability and ability to recognize new images. Secondly, the stability and recognition accuracy of the model in complex agricultural environments, particularly under varying light conditions and heavy occlusions, need to be further improved to ensure high performance in actual feeding environments. Additionally, the trade-off between real-time performance and accuracy remains challenging. Although existing optimizations have significantly improved this, the extreme demand for real-time processing is still not fully met, limiting potential applications in large-scale ranching and automated breeding systems. Finally, considering technology diffusion, reducing the cost of high-end equipment will also be a focus of subsequent work.

In view of these limitations, this study will focus on the following aspects in the future: first, continue to expand the scale and diversity of the dataset by integrating and enhancing multi-source data to construct a more comprehensive and representative dairy cow image dataset; second, deeply optimize the model architecture and explore more efficient network structures and feature extraction methods to further improve recognition accuracy and real-time performance; third, develop a version of the algorithm suitable for low-cost hardware to lower the threshold of technology application and promote the popularization and wide adoption of cow recognition technology; fourth, strengthen research on the model’s adaptability in complex environments and improve its robustness and stability in actual feeding environments through simulation training and adaptive adjustment.

Through these efforts, it is expected that the YOLOX-S-TKECB cow recognition algorithm will play a greater role in the intelligentization of dairy farming and promote technological progress, transformation, and upgrading of the industry.

Author Contributions

Conceptualization and supervision, H.Z.; methodology and writing—draft preparation, L.Z.; formal analysis, L.T.; data curation, J.G.; software, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key R & D and Promotion Projects in Henan Province, China (No. 232102110265).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to privacy concerns.

Conflicts of Interest

There are no conflicts of interest.

References

Liu, C.; Yang, Y. Development and effectiveness of China’s dairy industry policy. China Dairy Cattle. 2017, 10, 58–64. [Google Scholar] [CrossRef]
Liu, D. The current situation and healthy development strategy of China’s modern livestock industry. Heilongjiang Anim. Sci. Vet. Med. 2016, 87–89. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Song, X.; Bokkers, E.; Mourik, S.; Groot Koerkamp, P.; van der Tol, P. Automated body condition scoring of dairy cows using 3-dimensional feature extraction from multiple body regions. J. Dairy Sci. 2019, 102, 4294–4308. [Google Scholar] [CrossRef]
Ma, L.; Dong, B.; Yan, J.; Li, X. Matting Enhanced Mask R-CNN. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
Huang, L.; Li, S.; Zhu, A.; Fan, X.; Zhang, C.; Wang, H. Non-contact body measurement for Qinchuan cattle with lidar sensor. Sensors 2018, 18, 107–118. [Google Scholar] [CrossRef]
Scheila, G.; Elton, F.; Luciano, B.; Laurimar, G.; Isabella, C. Application of depth sensor to estimate body mass and morphometric assessment in Nellore heifers. Livest. Sci. 2021, 245, 104442. [Google Scholar] [CrossRef]
Kim, H.; Choi, H.; Lee, D.; Yoon, Y. Recognition of individual Holstein cattle by imaging body patterns. Asian-Australas. J. Anim. Sci. 2005, 18, 1194–1198. [Google Scholar] [CrossRef]
Ahmed, S.; Gaber, T.; Tharwat, A.; Hassanien, A.E.; Snáel, V. Muzzle-based cattle identification using speed up robust feature approach. In Proceedings of the 2015 International Conference on Intelligent Networking and Collaborative Systems, IEEE Computer Society, Taipei, Taiwan, 2–4 September 2015; pp. 99–104. [Google Scholar] [CrossRef]
Zhao, K.; He, D. Recognition of individual dairy cattle based on convolutional neural networks. Trans. Chin. Soc. Agric. Eng. 2015, 31, 181–187. [Google Scholar] [CrossRef]
Li, S.; Fu, L.; Sun, Y.; Mu, Y.; Chen, L.; Li, J.; Gong, H. Individual dairy cow identification based on lightweight convolutional neural network. PLoS ONE 2021, 16, e0260510. [Google Scholar] [CrossRef]
Liu, J.; Jiang, B.; He, J.; Song, H. Individual recognition of dairy cattle based on gaussian mixture model and cnn. Comput. Appl. Softw. 2018, 35, 159–164. [Google Scholar] [CrossRef]
Xing, Y.; Wu, B.; Wu, S.; Wang, T. Individual Cow Recognition Based on Convolution Neural Network and Transfer Learning. Laser Optoelectron. Progress. 2021, 58, 503–511. [Google Scholar] [CrossRef]
Zhang, Y. Researchonbeef Cattle Body Side Recognition Method Based on Deep Learning. Inner Mongolia University of Science & Technology, Nei Mongol, China. 2023. Available online: https://www.chndoi.org/Resolution/Handler?doi=10.27724/d.cnki.gnmgk.2023.000604 (accessed on 1 February 2024).
Huang, H. Lifestyle and comfort of dairy cows. N. Anim. Husb. 2014, 25. Available online: https://d.wanfangdata.com.cn/periodical/bfmy201413028 (accessed on 1 February 2024).
Yang, S. Research on Steel Surface Defect Detection based on Deep Learning. South China University of Technology, Guangzhou, China. 2021. Available online: https://kns.cnki.net/kcms2/article/abstract?v=OJyTKzW6FepCe81C5SxLMKpx7Tc-nNhV8oRcfJY8XvtKyxJoRDNKGHUGhjvbfW_43kShTeutpTo6fCg58CklNJvmn3NKRjHdIVF8n8J6IE5PIT7Xklhj-pkJli6_Qdq9hVPdHVZv84O5z9ySbe9uJOUBxNeLRZ6tZrWWud5u2DdQHEOsq3PMOeBq0z84v6eG&uniplatform=NZKPT&language=CHS (accessed on 1 February 2024).
Shao, Y.; Zhang, D.; Chu, H.; Zhang, X.; Rao, Y. A Review of YOLO Object Detection Based on Deep Learning. J. Electron. Inf. Technol. 2022, 44, 3697–3708. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Zhang, H.; Luo, Y.; Tan, L.; Yang, J.; Wang, Y. Research on Millet Disease Identification Based on Transfer Learning and Residual Network. J. Henan Agric. Sci. 2023, 52, 162–171. [Google Scholar] [CrossRef]
Wang, R.; Bai, Q.; Gao, R.; Li, Q.; Zhao, C.; Li, S.; Zhang, H. Oestrus detection in dairy cows by using atrous spatial pyramid and attention mechanism. Biosyst. Eng. 2022, 223, 259–276. [Google Scholar] [CrossRef]
Guo, X.; Yang, J.; Li, Z.; Wang, Y.; Cheng, S.; Li, J. Waterlogging Risk Assessment Based on Subjective and Objective Combination Weight-TOPSIS-k-means++. China Water Wastewater 2024, 40, 130–136. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet v2: Smaller models and faster training. arXiv 2021, arXiv:2104.00298. [Google Scholar] [CrossRef]
Liu, Q.; Guo, X.; Li, C.; Yang, D. An Improved YOLOv5-based Method for Image Recognition of Cattle Individual. Softw. Eng. 2023, 26, 42–47+58. [Google Scholar] [CrossRef]
Chen, J.; Ye, Y.; Kang, M. Image Super-Resolution Using Hybrid Attention Mechanism. Assoc. Comput. Mach. 2021, 62–67. [Google Scholar] [CrossRef]
Wang, S. A Review of Gradient-Based and Edge-Based Feature Extraction Methods for Object Detection. In Proceedings of the 2011 IEEE 11th International Conference on Computer and Information Technology, Paphos, Cyprus, 31 August–2 September 2011; pp. 277–282. [Google Scholar] [CrossRef]
Vasconez, J.; Delpiano, J.; Vougioukas, S.; Auat Cheein, F. Comparison of convolutional neural networks in fruit detection and counting: A comprehensive evaluation. Comput. Electron. Agric. 2020, 173, 105348. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar] [CrossRef]
Deng, M.; Gong, J.; Zheng, P.; Ma, C.; Yin, Y. Lightweight Target Detection Method for Group-raised Pigs Based on Improved YOLOX. Trans. Chin. Soc. Agric. Mach. 2023, 54, 277–285. [Google Scholar]
Hadji, I.; Wildes, R. What do we understand about convolutional networks? arXiv 2018, arXiv:1803.08834. [Google Scholar] [CrossRef]

Figure 1. Based on YOLOX-S-TKECB cow identification algorithm technology roadmap.

Figure 2. Helios Camera.

Figure 3. Dairy cattle 3D data acquisition channel. (a) Setup schematic; (b) actual collection scene.

Figure 4. Original Holstein cow acquisition image. (a) Helios camera captures examples. (b) Sample graphs are randomly collected.

Figure 5. The image is processed by the in-frame algorithm. (a) Master drawing; (b) brightness boost; (c) contrast boost; (d) rotation angle calibration; (e) flip; (f) affine transformation, t; (g) shear transformation; (h) HSV data enhancement.

Figure 6. The image is processed by an interframe algorithm. (a) Mixup enhancement operation; (b) Mosaic enhancement operations.

Figure 7. YOLOX-S network structure diagram.

Figure 8. Focus module operation diagram.

Figure 9. Transfer learning model structure diagram.

Figure 10. Flowchart of K-means++ algorithm.

Figure 11. CBAM diagram of mixed attention mechanism.

Figure 12. BiFPN structure diagram. P3, P4, P5, P6, and P7 represent the output layers of the backbone network, with each output layer having corresponding output features (including information such as the number of channels, the size of the features, etc.). Circles without color represent features, while colored circles represent operators. The wired connections indicate weights (w), and both upward and downward connections involve resize operations, representing either upsampling or downsampling.

Figure 13. YOLOX-S recognition model detection effect. (a) Improved previous single object identification effect; (b) improved previous multi-target identification effect; (c) model training accuracy curve before improvement; (d) model training loss function curve before improvement.

Figure 14. YOLOX-S-TKECB recognition model detection effect. (a) Improved single object identification effect; (b) improved multi-target identification effect; (c) improved model training accuracy curve; (d) improved model training loss function curve.

Table 1. Behavioral habits of Holstein cows.

Habits of Holstein Cows	Characteristic Description
Personality	Curiosity, gentleness, friendliness, and adaptability, easily startled.
Behavior	Prefers a quiet environment, and strong external stimuli can trigger stress reactions.
Sociality	A social animal that lives in groups, and experiences significant stress when separated from the herd.
Activity Range	Prefers to move freely in pastures and requires ample space for walking, running, and resting.

Table 2. Time allocation for Holstein dairy farming [15].

Holstein Cow Farming	Time Allocation
Feeding	3 to 5 h (with 9 to 14 meals per day)
Lying Down/Resting	12 to 14 h
Social Behavior	2 to 3 h
Rumination	7 to 10 h
Drinking	30 min
Milking	2.5 to 3.5 h (8:30–9:30, 11:40–12:40, 5:30–6:30)

Table 3. Loss function comparative analysis table.

Loss Function	IOU Loss	GIOU Loss	DIOU Loss	CIOU Loss	EIOU Loss
Advantages	Scale invariance, with non-negativity, identity, symmetry, and triangle inequality properties.	Introducing a minimum outer bounding box to solve the problem that loss is equal to 0 when there is no overlap between the detection box and the real box.	Directly calculate the Euclidean distance between the centers of two bounding boxes to accelerate convergence.	By adding losses for the scales of the detection boxes, including losses for the length and width, the predicted boxes can better match the ground truth boxes.	Calculate the width-height difference value to replace the aspect ratio and introduce Focal Loss to solve the sample imbalance problem.
Disadvantages	1. The two frames do not intersect and cannot reflect the distance. 2. Cannot accurately reflect the overlap of the two frames.	1. When the detection frame and the real frame contain each other, GIOU degenerates into IOU. 2. When two frames intersect, the horizontal and vertical directions converge slowly.	The aspect ratio of the bounding box is not considered in the regression process, and the accuracy needs to be improved.	The aspect ratio has ambiguity due to its nature as a relative value. The issue of balancing easy and difficult samples has not been considered.	undetermined

Table 4. Ablation comparison experiment results table. √ is the improved part added, and × is the unadded part.

Algorithm	Transfer	K_Means++	F _EIOU_Loss	CBAM	BiFPN	MAP	FPS
YOLOX-S	×	×	×	×	×	82.56%	76
YOLOX-S-Transfer	√	×	×	×	×	85.12%	83
YOLOX-S-K_means++	×	√	×	×	×	87.72%	85
YOLOX-S-F_EIOU_Loss	×	×	√	×	×	87.32%	86
YOLOX-S-CBAM	×	×	×	√	×	84.68%	80
YOLOX-S-BiFPN	×	×	×	×	√	83.76%	82
YOLOX-S-TK	√	√	×	×	×	91.26%	88
YOLOX-S-TKE	√	√	√	×	×	94.21%	97
YOLOX-S-TKEC	√	√	√	√	×	95.46%	101
YOLOX-S-TKECB	√	√	√	√	√	97.87%	108

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; Zheng, L.; Tan, L.; Gao, J.; Luo, Y. YOLOX-S-TKECB: A Holstein Cow Identification Detection Algorithm. Agriculture 2024, 14, 1982. https://doi.org/10.3390/agriculture14111982

AMA Style

Zhang H, Zheng L, Tan L, Gao J, Luo Y. YOLOX-S-TKECB: A Holstein Cow Identification Detection Algorithm. Agriculture. 2024; 14(11):1982. https://doi.org/10.3390/agriculture14111982

Chicago/Turabian Style

Zhang, Hongtao, Li Zheng, Lian Tan, Jiahui Gao, and Yiming Luo. 2024. "YOLOX-S-TKECB: A Holstein Cow Identification Detection Algorithm" Agriculture 14, no. 11: 1982. https://doi.org/10.3390/agriculture14111982

APA Style

Zhang, H., Zheng, L., Tan, L., Gao, J., & Luo, Y. (2024). YOLOX-S-TKECB: A Holstein Cow Identification Detection Algorithm. Agriculture, 14(11), 1982. https://doi.org/10.3390/agriculture14111982

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOX-S-TKECB: A Holstein Cow Identification Detection Algorithm

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.1.1. Collection Equipment

2.1.2. Real-Time Data Acquisition

2.1.3. Preprocessing of the Holstein_cow_RGB-150 Dataset

2.2. Test Method

2.2.1. Model Evaluation Metrics

2.2.2. Training of the Network Model

3. Network Model and Improvements

3.1. YOLOX-S Algorithm

3.2. Model Optimization and Improvement of the YOLOX-S Algorithm

3.2.1. Introduction of Transfer Learning

3.2.2. Improvement of Clustering Algorithms

3.2.3. Differences in Bounding Box Regression Loss Functions

3.2.4. Introduction of Mixed Attention Mechanism

3.2.5. Multi-Scale Feature Fusion Optimization

4. Results and Analysis

4.1. Comparison of Ablation Experiments

4.2. Analysis of Results

5. Discussion

5.1. Developing a Platform for Real-Time RGB-D Image Acquisition and Cow-Specific Preprocessing

5.2. Establishing a Holstein Cow Identification Model with the YOLOX-S-TKECB Framework

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI