1. Introduction
With the advancement of information technology, sheep rearing methods are evolving towards scalability and intelligence. Internet of Things (IoT) technology has been effectively utilized in livestock environmental information, enabling the creation of remote monitoring systems for livestock [
1]. The increase in production levels has resulted in a rise in the number of livestock herds. However, data collection and monitoring of livestock present new challenges. Currently, phenotypic assessment of sheep relies mainly on manual measurements. The development of Precision Livestock Farming (PLF) systems, such as automated weighing systems, RFID sensors, and temperature monitoring, has gradually enhanced operational efficiency and farm animal welfare [
2].
Research on sheep posture recognition primarily involves accelerometer and visual recognition technologies. Alvarenga et al. classified five exclusive behaviors of sheep herds using accelerometers: grazing, lying down, running, standing, and walking postures [
3]. Radeski et al. presented an optimized method for identifying sheep gaits and postures using acceleration values recorded by a triaxial accelerometer [
4]. He et al. utilized detection and semantic segmentation to improve sheep weight estimation [
5]. However, attaching sensors to sheep’s bodies often induces stress reactions [
6]. Non-contact, low-cost, simple, and effective computer vision techniques have been widely applied in animal monitoring processes, significantly contributing to evaluating animal behavior [
7].
Mask R-CNN, a versatile model widely used in image segmentation and object detection, offers the advantage of instance segmentation, particularly beneficial in scenarios necessitating the precise delineation of individual objects within an image. In the realm of sheep farming, the utilization of computer vision technology, specifically the Mask R-CNN model, has exhibited potential in monitoring and overseeing sheep behavior and well-being [
5,
8,
9,
10,
11]. An accurate assessment of sheep body dimensions is vital for evaluating growth status, productivity, and welfare [
12]. Prior studies have suggested employing machine vision and deep learning techniques to measure sheep body dimensions, particularly in the standing posture [
13,
14]. Unlike larger livestock, sheep display a wider range of postures due to their distinct characteristics like additional joints, agility, and complex behaviors [
15]. Consequently, there are challenges in effectively utilizing visual methods to capture and analyze various posture states of sheep, including standing, walking, and jumping.
This study utilizes a real-world dataset of segmented sheep from a breeding farm as the foundation for our research. Leveraging the Mask R-CNN convolutional neural network, the study distinguishes between head-down, head-up, and jumping postures of Ujumqin sheep using contour points. It also identifies body size parameters, including head-down and head-up, during the walking state of Ujumqin sheep. This automated approach reduces labor resources for livestock monitoring and provides continuous, real-time data crucial for making informed management decisions. It significantly reduces stress in sheep and the risk of zoonotic diseases, thus ensuring farm animal welfare practices are implemented. This study presents an economical and effective method for collecting data in animal husbandry research.
2. Materials and Methods
2.1. Experimental Animal
The data for this study were collected from Heshig Animal Husbandry Development Co., Ltd., situated in the East Ujumqin Banner, Inner Mongolia Autonomous Region, China. Established in 1999, the farm is located behind the Urias Mountains, covering an area of 55,000 acres where sheep are raised under grazing conditions. The predominant plant species on the grassland include Aneurolepididum chinens, Artemisia frigida, Stipa grandis, Stipa krylovii, and Xyris pauciflora Willd. The breeding season for Ujumqin sheep occurs annually in October, with lambing following in February and shearing in June. In August 2023, data collection was carried out to ensure more reliable production data, taking advantage of the mild climate that facilitated sheep production and measurement tasks. To effectively cover different groups, two rounds of herd gatherings were conducted across the region. The herds were randomized and divided into four manageable subgroups. This division was implemented to facilitate image acquisition while preventing large-scale crowding and trampling among the sheep. A total of 1100 Ujumqin sheep aged between 6 months and 5 years were captured with dynamic images, with 43 sheep being excluded due to poor image quality. The images were categorized into two groups: data A and data B. Data A consisted of 548 sheep (241 rams and 307 ewes), and were utilized for posture training, classification parameter adjustment, and constructing a sheep recognition neural network. Each sheep in data A was represented by its best recognition images in head-down, head-up, and jumping poses to facilitate posture classification. The images were further split into training, verification, and test sets in a 7:2:1 ratio, resulting in 7371, 106, and 1053 images, respectively. Data B included 509 Ujumqin sheep, consisting of 211 rams and 298 ewes. The best identification photos of these animals were utilized as a validation dataset to assess the pose analysis model. Furthermore, within data B, the accuracy of manual and machine measurements for body slanting length, withers height, hip height, and chest depth at different postures was evaluated in a subgroup of 285 sheep, comprising 119 rams and 166 ewes.
2.2. Collection Process
RFID high-frequency electronic ear tags were affixed to the right ears of lambs at three weeks of age for identification and record-keeping purposes. The sheep were guided through an image data collection channel consisting of a plastic background panel and a rectangular metal frame. Each sheep passed through the channel only once. Two 8-megapixel autofocus high-definition cameras were positioned above and right of the channel. There was plenty of light to observe during the shoot. An industrial computer behind the background panel stored and identified the ear tags. An automated shooting program took images of the sheep’s back and side at 30 frames per second, organizing them into folders named after the ear tag numbers. We manually collected data on gender, ear tag number, and body measurements and compiled it into an Excel spreadsheet for comparative analysis. During manual body measurements, one person restrained the sheep while another collected the data. The sheep were positioned on a flat surface in a natural upright posture, with the head and neck extended and limbs standing upright. Manual measurements were taken for body slanting length, withers height, hip height, and chest depth of the Ujumqin sheep. The measurement standards for body measurements were as follows:
Body slanting length: The distance from the front edge of the shoulder to the rear edge of the ischial tuberosity.
Withers height: The vertical distance from the withers to the ground.
Hip height: The vertical distance from the highest point of the hip joint to the ground.
Chest depth: The straight-line distance from the withers to the lower edge of the sternum.
2.3. Experimental Environment
The study utilized KS12A884 cameras equipped with the Sony IMX377 camera sensor model. The cameras were procured from Shenzhen Kingsen Technology Co., Ltd., located in Shenzhen, China. Data storage and transmission considerations were considered, adopting a resolution of 640 × 480. The analysis server was equipped with a central processing unit featuring 36 cores and 72 threads, operating at a base frequency of 2.3 GHz and a boost frequency of 3.7 GHz. Additionally, it included a 32 GB Samsung registered error-correcting code (RECC) memory module sourced from Samsung Electronics, headquartered in Suwon, South Korea and a 2 TB server hard disk for storage sourced from Western Digital, headquartered in San Jose, California, USA. The system operated on the Windows 10 operating system, utilizing the PyTorch framework and Python 3.6 in the software stack. Notably, the learning rate plays a crucial role in the training process, impacting speed and convergence [
16]. The network parameters were set to a learning rate of 0.002, a batch size of 8, and 512 epoch iterations to optimize speed and convergence.
2.4. Image Preprocessing
This study utilized Mask R-CNN for sheep image recognition, specifically identifying sheep images captured by both back and side cameras. All images underwent manual annotation using the open-source labeling software Labelme version 5.1.1. The quantity of images in the dataset played a crucial role in the detection outcomes. To augment the dataset, a range of techniques were implemented, including horizontal and vertical flipping, as well as adjustments to image brightness through both enhancement and reduction.
Figure 1a illustrates the various modes of transformation applied. Ultimately, 10,530 images were collected to train the model, comprising 7470 side images and 3060 back images of Ujumqin sheep. These images constituted the Ujumqin sheep dataset, which was further divided into training, validation, and test sets.
2.5. Sheep Recognition Model
The Mask R-CNN model comprises five key components: Input, Backbone network, Region Proposal Network (RPN), Roi Align, and Output. The Input component receives preprocessed image and label data for model training. The Backbone network utilizes ResNet50 as the deep convolutional neural network to extract features from 640 × 480 images, with anchor sizes set to 512 × 512, 256 × 256, 128 × 128, and 64 × 64. This configuration enables the network to recognize low-level details like wool color and texture and high-level features such as sheep position. The feature pyramid network structure combines spatial and semantic information for feature extraction. The RPN filters feature maps by generating varying-sized boxes based on aspect ratios. Roi Align resamples feature maps to a uniform size for classification and regression tasks. Ultimately, the Output component yields class, sheep position, and sheep mask.
Figure 1b presents a diagrammatic representation of the sheep recognition model architecture, grounded in the Mask R-CNN framework.
2.6. Key Frame Screening
To identify key frames of sheep passing through the channel, this study utilized the Mask R-CNN approach for sheep object detection. This method facilitated the recognition of sheep positions in the images by overlaying a mask on top of them. Data on sheep areas could be extracted by applying binary thresholding to the processed images. During the passage of sheep, a multitude of images were captured. However, for effective recognition, it was imperative to select images displaying clear and complete contours of the sheep, ideally with the entire body centered in the frame. This study employs the method of region division and the number of image mask pixels to identify key frames.
Figure 2a presents a diagram of the key frame screening used for the movement of Ujumqin sheep.
where
R1 is located at (1/3x–2/3x, 0.375–0.625y),
R2 is located at (1/6x–5/6x, 1/4y–3/4y), and
R3 is located at (x, y), x represents the X-axis resolution and y represents the Y-axis resolution. The weighting coefficients
ni were assigned as follows:
n1 = 10,
n2 = 0.5, and
n3 = 0.01.
2.7. Sheep Posture Recognition
During the study, it was observed that sheep exhibit a specific posture only when jumping, characterized by the bending of their legs and the movement of their head upwards. This results in a distinct angle formed by the minimum bounding rectangle, which is not seen during normal movement. The diagonal angle between the minimum bounding rectangle and the minimum rectangle effectively captured the sheep’s body curvature and leg extension, enhancing the accuracy of posture information during jumping. The study utilized Python to calculate the contour points of both rectangles, followed by statistical analysis using SPSS version 22 to compare posture variations. A decision tree method was also employed to identify key parameters for classifying sheep jumping behavior. A neural network was then utilized to detect the sheep mask, enabling the assessment of the sheep’s highest point and head position for posture analysis. Sheep images were segmented into four regions (top-left, bottom-left, top-right, and bottom-right) based on the highest point and head position. For instance, if the highest point is in the top-right and the head is above the hip, the sheep is categorized as looking upward; whereas if the highest point is in the top-right but the head is below the hip, the sheep is classified as being in a lowered position.
2.8. Sheep Body Sizes Recognition
Due to the potential instability of sheep skeletons during jumping, this study did not factor body measurements into jumping states. For sheep postures characterized by lowered and raised heads, the ConvexHull function of OpenCV2 was utilized to pinpoint contour points. The calculation methods used for body slanting length and chest depth in various walking postures were consistent, whereas those used for withers height and hip height varied.
The computation methodology for body slanting length and chest depth was as follows. The study employed a recursive algorithm-based approach to extract the maximum inscribed rectangle within a spatial domain for calculating these parameters. Specifically, the maximum circumscribed rectangle was initially identified to define the spatial region, and then recursive algorithms were utilized to identify multiple inscribed rectangles within this region. Subsequently, the areas of these inscribed rectangles were calculated and represented in a list format. Leveraging the properties of a stack, the maximum rectangular area was identified. The diagonal positions of the maximum inscribed rectangle denoted the feature points for body slanting length, and the distance of chest depth was the distance between the feature point of withers height and the feature point to the right of the body slanting length. The recognition of different postures is shown in
Figure 2b.
Hip in raised head posture: The ConvexHull function was utilized to calculate the convex hull points of the sheep contour, identifying the rightmost overlapping point within the sheep contour as the feature point for withers height.
Withers height in raised head posture: Line A is defined as the line connecting the top of the sheep’s head to the shoulder point. All contour points on line A were selected, and the distance between each contour point and line A was computed. The point with the longest distance was selected as the feature point for withers height.
Withers height and hip height in lowered head posture: In the lowered head posture, the sheep’s scapula and hip bones protrude, identifying feature point locations for withers height and withers height. The distance between the sheep’s body’s left and right boundary points is denoted as S. The withers height range is defined as 3/8S–4/8S, and the withers height range is 6/8S–7/8S. The U-shaped chord length curvature algorithm demonstrates strong noise resistance and rotational invariance, meeting the specified criteria. Therefore, this algorithm is employed to calculate the curvature. By utilizing the U-shaped chord length curvature, the point with the highest curvature was determined as the feature point for withers height and hip height in the lowered head posture.
Figure 2c is a neighborhood diagram that was supported by the U-shaped chord length curvature.
where
,
.
,
, and
are the coordinates of
,
, and
, respectively and U is the setting constant.
2.9. Loss Function
The loss function of Mask R-CNN comprises three main components: classification loss, bounding box regression loss, and mask segmentation loss. These elements are combined to create the comprehensive loss function utilized during model training.
is the classification loss, which penalizes the model for errors in category prediction.
is the bounding box regression loss, which measures the difference between predicted bounding boxes and ground truth boxes.
is the mask segmentation loss function, which quantifies the accuracy of minimizing the discrepancy between predicted masks and ground truth masks. This study assessed the outcomes of the Mask R-CNN model on the validation set. Once the loss function ceased to decrease significantly, it was determined that the model had been adequately trained and was identified as the best model. Subsequently, the precision and recall metrics were calculated.
where
TP is the true positive number,
FP is the false positive number, and
FN is the false negative number. Precision denotes the precision of identifying positive samples, while recall indicates the proportion of correctly identified positive samples out of all positive samples.
2.10. Actual Distance Conversion
Accurately measuring sheep body size required converting pixel distances to actual distances. Due to the disparity between the camera coordinate system and the real-world coordinate system, pixel distances from the camera had to be converted into centimeter measurements for precise sheep body dimensions. The Euclidean formula was utilized for this conversion. To ensure accuracy, the back height of Ujumqin sheep was limited to 90 cm, with the back calibration plate moving within a range of 0 to 90 cm and the side calibration plate within a range of 0 to 50 cm, with an image acquisition interval of 5 cm. This study adopted a multivariate regression approach for pixel-to-centimeter conversion calculations at different camera distances to address varying proportional relationships at different distances. By using the known actual length of the calibration plate, the ratio of pixel values to actual centimeters was considered the dependent variable, while the distance from the camera to the sheep and the distance from the sheep to the distance channel were treated as independent variables for polynomial regression.
4. Discussion
The present study focused on analyzing the movement patterns of Ujumqin sheep. The Mask R-CNN convolutional neural network model was utilized to detect the contour points of sheep, allowing for the analysis of body size traits under various postures. These traits included body slanting length, withers height, hip height, and chest depth. Previous research by Bene et al. has shown that sheep exhibit different postures during exercise, emphasizing the impact of exercise conditions on their postures [
17]. Current studies on sheep behavior primarily involve accelerometers and visual recognition technology. For example, Alvarenga et al. used accelerometers to categorize five distinct behaviors of sheep, such as herding, lying down, running, standing, and walking [
3]. Radeski et al. proposed an optimized method for identifying sheep gait and posture by analyzing acceleration values from a triaxial accelerometer [
4]. Gu et al. introduced a deep learning-based approach for detecting sheep behaviors, including standing, eating, and lying down [
18]. Nonetheless, these methods encounter challenges related to sheep wearables and computational detection performance limitations.
Zhang et al. highlight the significance of analyzing the minimum area of a rectangle by investigating the directional characteristics of a patch obtained from the aspect ratio of the rectangle [
19]. Chaudhuri et al. utilized minimum boundary rectangles to extract various features, such as the aspect ratios of the longer and shorter axes, offering a method to scrutinize sheep’s jumping posture using minimum area and minimum rectangles [
20]. Similarly, Sant’ana et al. employed the minimum rectangular area approach to assess the physical indicators of sheep [
21]. This study involved analyzing sheep’s posture as they moved through a channel by inputting contour point data, utilizing the angle between two rectangles and the key point positions during the sheep’s walking state, achieving an overall accuracy rate of 94.70%. The recognition accuracy for head-down and head-up states exceeded 90%, while that for jumping posture surpassed 85%. Xu et al. utilized Mask R-CNN to detect two typical behaviors, standing and lying down, of varying group sizes, achieving an accuracy of 94% in the validation set [
8], similar to the results obtained in this study. Polk et al.‘s research showed that sheep exercising on an inclined treadmill exhibited a more bent knee posture compared to those on a horizontal treadmill, indicating that sheep bend their knees to move swiftly during fast walking [
22]. This study also noted underestimations in withers height and hip height in the head-down state.
Utilizing advanced technologies such as machine learning and visual image analysis is crucial for improving body size measurement and posture analysis in animals [
23]. Witte et al. highlighted the significance of visual assessment in discerning the quality disparities between segmentation masks [
24]. Zhao et al. employed Mask R-CNN to achieve a 93.7% accuracy in detecting Hu sheep with an IoU threshold of 0.5 [
25]. Xu et al. investigated the behavior of sheep standing and lying down in pens of varying sizes, achieving over 94% accuracy in the validation set [
8]. The model’s classification accuracy in this study was high, possibly attributable to the custom dataset used. The classification of side and back images of sheep was distinct, with a large sample size contributing to the elevated accuracy.
Zhang et al. proposed a non-contact method using machine vision to measure the body size of small-tailed Han sheep, aiming to overcome the limitations of manual measurement [
13]. Study on Alpagota goats demonstrated that a dual-camera recognition system accurately predicted withers height, chest depth, and body length with a 3.5% error rate [
26]. Similarly, in Ujumqin sheep, visual image measurements exhibited a 5% error in predicting body slanting length, withers height, and hip height compared to manual measurements, with a 10% error in chest depth [
14]. Various factors contribute to errors in body size calculations, including differences between machine-generated image data and human measurements. Animals must be in a standardized posture for accurate measurement. Manual measurements also introduce variability due to factors like the experience level of personnel, procedural deviations, and fatigue from prolonged work. The presence of long chest hair during movement can lead to neural networks misinterpreting hair as part of the body, resulting in significant errors in chest depth calculations. Lina et al. emphasized the challenges of accurately measuring animal body size parameters, highlighting issues such as postural changes and the impact of features like chest hair [
27]. This was also mentioned by Mathis et al. [
28], illustrating that there are still challenges with current vision technology. While neural networks excel in processing visible images, they face limitations when dealing with coat and tail fat interference. Future studies should investigate the use of multiple cameras to address these challenges. Moreover, developing a user-friendly interface and integrating it with existing farm management tools is crucial to promoting farmers’ adoption of such systems.
As the agricultural landscape continues to evolve globally, the integration of automation and intelligent technologies becomes increasingly crucial. This study utilized a neural network model to calculate the contour points of Ujumqin sheep, enabling the estimation of pose and body size parameters. The automated system developed in this research reduces the possibility of human oversight and significantly enhances the efficiency of farm management. Moreover, the real-time body size monitoring feature offers data that empower farmers to make swift decisions regarding feeding and health management, ultimately enhancing the overall well-being of the animals. This study accurately identified sheep’s posture and body size and laid the groundwork for the development of strategies for predicting health status. This research aligns with the growing emphasis on precision and intelligence in modern agriculture, presenting a cost-effective method for collecting data to investigate various animal husbandry practices across different postures.