1. Simple Overview
Monitoring the body temperature of dairy cows is essential for assessing their health and welfare. Traditional rectal temperature measurement, although accurate, is time-consuming and can cause discomfort and stress in animals. To overcome these limitations, this study developed a contactless and intelligent method for estimating cow body temperature using thermal infrared imaging combined with environmental information. A deep learning model was applied to automatically detect and extract the eye socket region from thermal images, which closely reflects the cow’s internal temperature. By integrating ambient temperature, humidity, wind speed, and light intensity, a random forest model was built to predict the rectal temperature. The results showed that the predicted temperatures were strongly correlated with rectal measurements, confirming the reliability of the method. This approach offers a practical and animal-friendly tool for continuous temperature monitoring in modern dairy farms, contributing to precision livestock farming and improved animal welfare.
2. Introduction
Measuring physiological indicators in animals plays a crucial role in monitoring their welfare and health [
1]. Various physiological parameters of dairy cows typically reflect their physiological status and health condition, with core body temperature being the most representative physiological parameter [
2]. Cows are thermostatic animals, normal physiological functions in which depend on a relatively constant core body temperature [
3]. Changes in core body temperature directly or indirectly reflect the cow’s physical condition. Rectal temperature is recognized as the primary physiological indicator for assessing an animal’s thermal balance and is clinically used to represent the cow’s core body temperature [
4,
5]. The normal rectal temperature range for dairy cows is 37.5–39.5 °C, showing regular variations associated with physiological activities such as estrus, ovulation, pregnancy, and parturition. Convenient, accurate, and effective monitoring of temperature changes not only aids in estrus detection, pregnancy diagnosis, and parturition prediction but also enables proactive disease monitoring, prevention, and control [
6].
Traditional body temperature monitoring in livestock is primarily conducted manually, which is time-consuming and poses potential disease transmission risks [
7]. Moreover, this approach requires directly contacting or physically restraining the animals, which may induce stress responses, thereby affecting the stability and accuracy of physiological measurements [
8].
Temperature monitoring methods are generally classified into contact and non-contact approaches. Contact-based monitoring in cattle mainly relies on high-precision sensors to capture temperature from specific body regions [
9]. Common implementations include subcutaneous or intramuscular implantation or the insertion of temperature loggers into the rectum [
10], vagina [
11], subcutaneous tissue near the neck [
12], or behind the ear [
13]. When implanted beneath the skin or within the vagina, these devices typically provide higher measurement accuracy and sensitivity to minor temperature changes. For instance, Kou et al. [
14] designed a device using a thermistor sensor enclosed in a protective casing, which was attached to the metatarsal region of dairy cows to enable automatic surface temperature monitoring.
However, contact-based approaches are often limited by poor animal compliance and the risk of stress responses during use. In addition, complex rearing environments and external disturbances can damage the sensors and lead to abnormal readings. Consequently, developing non-contact and intelligent temperature detection technologies has gained interest for achieving accurate and stress-free monitoring in precision livestock farming.
Infrared thermal imaging (IRT) is emerging as the primary method for measuring an animal’s body surface temperature due to it being non-contact and rapid as well as enabling real-time monitoring [
15]. M. Z. Lu et al. [
16] collected 600 datasets from 20 piglets using infrared thermography (IRT). They employed a Support Vector Machine (SVM) classifier combined with contour features to identify the ear base region and extracted the highest temperature within this region as the ear base temperature, enabling automated measurement from top-view thermal images. Similarly, D. He et al. [
17] captured lateral thermal images of dairy cows and proposed an automated eye temperature detection method based on a skeletal tree model, achieving a mean error of 0.35 °C in estimating eye socket temperature. These studies demonstrate the potential of IRT for non-invasive and automated body temperature monitoring in livestock.
Nevertheless, IRT measurements are highly susceptible to environmental factors. Gloster et al. [
18] showed that ambient temperature affected infrared readings of cattle hooves, with the largest variation occurring under lower temperatures. Church et al. [
19] found that humidity had minimal impact at a 1 m distance, but became significant at greater distances and higher temperatures. Additionally, wind speeds of 12 km/h introduced an error of 0.78 °C, while direct solar radiation caused a 0.6 °C difference between eyes over 30 min compared to shaded conditions. These findings indicate that airflow, solar radiation, and other environmental factors can directly influence surface temperature measurements.
Accurate assessment of core body temperature relies on reliable reference indicators. Rectal temperature correlates closely with core body temperature [
20]. To enhance the accuracy and convenience of non-contact rectal temperature detection, this study proposes a method for detecting cow rectal temperature by integrating thermal imaging with environmental factor compensation. Thermal infrared images of the facial region are acquired, and deep learning techniques are employed to localize and segment the eye region. By combining the segmented regions with a temperature matrix and accounting for environmental factors—including ambient temperature, humidity, wind speed, and light intensity—this approach enables automatic extraction of eye temperature and precise prediction of cattle rectal temperature. The contributions of this paper are as follows.
- (1)
A thermal imaging and environmental parameter acquisition platform for dairy cow heads has been established, laying the foundation for subsequent body temperature prediction research.
- (2)
A cascaded deep learning approach for segmenting the cow’s eye region was proposed to reduce the influence of environmental conditions and animal movement on eye socket localization.
- (3)
A non-contact method for predicting rectal temperature has been developed by fusing thermal imaging with environmental data, aimed at achieving precise estimation of dairy cows’ rectal temperature under real-world farming conditions.
3. Materials and Methods
3.1. Data Acquisition
The experimental data were collected on 10 July 2025 at Shengsheng Ranch in Luoyang City, a region experiencing a temperate continental monsoon climate. The annual average wind speed excluding calm periods is 3.2 m per second, annual average temperature ranges from 12.2 to 24.6 °C, annual precipitation is 528–800 mm, and annual sunshine duration reaches 2200–2300 h, with annual relative humidity maintained between 60 and 70%. The farm employs open-style barns with barred-style rearing. The experimental barn is a covered free-stall structure, featuring a fully roofed design with open sides that allow natural ventilation. This semi-enclosed configuration effectively shields the animals from direct solar radiation; therefore, solar radiation was not included as an independent environmental variable in this study. The ambient temperature, relative humidity, wind speed, and light intensity were continuously monitored to characterize the microclimatic conditions within the barn. A comprehensive spray system combining fans and sprinklers was installed in the feeding area, cycling every 4–5 min. Each cycle included approximately 30 s of spraying to fully wet the cows’ backs, followed by fan drying to achieve temperature reduction.
The test subjects were 28 Holstein dairy cows, aged 3.25–7.33 years (mean 4.67 ± 1.18 years), with body weights ranging from 728 to 832 kg (mean 788.35 ± 31.88 kg). At the time of the study, 21 cows were in lactation and 7 cows were in the dry period. When cows extended their heads outside the stall bars to feed, a headlock feeding system restrained them to prevent excessive head movement. This experiment employed a MAG32 (Shanghai Magnity Technologies Co., Ltd., Shanghai, China) thermal camera to capture thermal imaging videos of the dairy cows. An STM32 (STMicroelectronics N.V., Plan-les-Ouates, Switzerland) development board, MB016 temperature/humidity sensor (Guangzhou Xingyi Electronic Technology Co., Ltd., Guangzhou, China), BH1750 light sensor (Rohm Semiconductor, Kyoto, Japan), and WS3054 ultrasonic anemometer (Chengdu SenTec Technology Co., Ltd., Chengdu, China) were used, comprising an environmental parameter acquisition device. This device recorded barn conditions including temperature, humidity, light intensity, and wind speed during data collection. A computer collected sensor data, while farm staff measured rectal temperature using Youmu veterinary electronic thermometers (Zhengzhou Youmu Agricultural Technology Co., Ltd., Zhengzhou, China) inserted into the cows’ rectums. Basic parameters of various sensors are shown in
Table 1.
This experiment treated the collection of thermal infrared video, rectal temperature, and concurrent environmental data—including environmental temperature, relative humidity, light intensity, and wind speed—from a cow as a single data acquisition task. The thermal camera and environmental data logger were mounted 2 m from the stall rail and 0.8 m above the ground for data collection. During the cow’s feeding period, a thermal imaging video of the animal’s face was recorded using the thermal camera, with the imager’s emissivity set to 0.98. Environmental data was recorded and rectal temperature measurements were taken concurrently with the thermal imaging video capture. The ear tag number for each cow was recorded so that every dataset could be accurately linked to its specific cow ID. The data acquisition device is shown in
Figure 1. In the experiment, 28 datasets were collected. The thermal imaging video for each cow lasted approximately 1 min, recorded at 25 frames per second (fps), totaling 42,000 frames, and with a video resolution of 384 pixels (horizontal) × 288 pixels (vertical). The environmental parameter collection device operated at 6 fps, gathering a total of 10,080 environmental data points.
3.2. Dataset Construction
The thermal camera was connected to the computer through Ethernet, transmitting and storing the captured thermal imaging video on the computer. Meanwhile, the environmental parameter acquisition device transmitted environmental data to the computer through a serial port controlled by a microcontroller.
3.2.1. Preprocessing of Thermal Imaging Data
The captured thermal infrared video was analyzed using ThermoX (v2.5.4), the accompanying software for the thermal imaging cameras. It was discovered that the pseudo-color temperature scale on the right side of the video employs a dynamic adaptive mapping method, in which its displayed temperature range adjusts in real time based on the highest and lowest temperatures within the current frame, lacking a fixed temperature range. To eliminate visual errors caused by this, we fixed the temperature range of the color temperature scale to ensure that color changes only reflected actual temperature variations, enhancing the quantitative interpretability of images and the stability of data analysis. The specific steps of this process are shown in
Figure 2, and temperature thermal imaging remapping of the cow’s head region is shown in
Figure 3.
3.2.2. Dataset
When creating the thermal imaging dataset, the remapped pseudo-color images were first manually inspected to remove those where the left or right eye socket area was not visible due to head movement. This process yielded a final dataset of 33,450 thermal infrared images. The head and eye socket regions of cows were manually annotated in thermal infrared images; examples of annotated images are shown in
Figure 4. The dataset was then randomly divided into training, validation, and test sets at a 7:2:1 ratio.
During the construction of the rectal temperature prediction dataset, the mismatch in frame rates between the environmental monitoring device and the thermal infrared camera made frame-by-frame alignment infeasible. Thermal infrared data were therefore synchronized with environmental parameters at one-second intervals. The environmental monitoring device recorded ambient temperature, relative humidity, wind speed, and light intensity at 6 fps, and the per-second averages of these parameters were used for synchronization. For thermal data, the first frame within each second was selected, and the mean eye-region temperature in that frame was extracted as the model input variable, with the rectal temperature as the output target. Each dataset record thus represented one second of data, including the averaged environmental parameters, the mean eye-region temperature, and the corresponding rectal temperature. The final dataset, collected from 28 Holstein dairy cows, comprised 1471 records and was divided by cow identity into training and validation subsets at an 8:2 ratio to ensure individual independence.
To provide a clear overview of the complete methodology, including data acquisition, eye socket localization, eye-region temperature extraction, and rectal temperature prediction using the random forest model, an overall flow diagram is presented in
Figure 5.
3.3. Eye Socket Detection Based on Cascade Deep Learning
Deep learning is a branch of artificial intelligence in which high-level features from large datasets are learned, enabling it to surpass traditional machine learning and find widespread application in the field of agricultural breeding [
21]. In this study, infrared thermal imaging was combined with deep learning to automatically segment key temperature measurement regions in dairy cows and extract corresponding surface temperatures. Considering the uncertainty of cows’ feeding positions, thermal infrared images often captured multiple individuals; therefore, direct eye segmentation could lead to mismatches between eye temperature and individual identity. To ensure accurate temperature extraction, a cascading strategy was implemented, in which the cow head region was first detected and the eye socket area was subsequently segmented within the detected head image.
3.3.1. Cow Head Detection Based on YOLO
The YOLO (You Only Look Once) series of algorithms is an end-to-end object detection model based on deep convolutional neural networks [
22]. By transforming the detection task into a single regression problem, it achieves fast and accurate object localization and classification [
23]. The YOLO network, released by Ultralytics (Frederick, MD, USA), is the first to integrate object detection, instance segmentation, and image classification tasks. Building upon YOLOv5, it introduces the C2f module to replace the C3 structure and employs a multi-scale feature fusion mechanism to significantly enhance detection performance for objects of varying sizes. During training, the model incorporates diverse data augmentation strategies, including random scaling, cropping, flipping, and color perturbations. It also adopts the approach of deactivating Mosaic augmentation in later stages, as observed in YOLOX, to improve accuracy [
24]. Considering detection accuracy, inference speed, and generalization capability comprehensively, we selected YOLOv8 as the detection model in this study.
The annotated dataset was then trained using the YOLOv8 network, during which batch sizes of 16 and 240 epochs were employed, with data loading by 8 parallel threads. To ensure detection performance and convergence stability, parameter updates utilized the Stochastic Gradient Descent (SGD) optimizer with an initial learning rate of 0.01 and weight decay of 0.0005.
3.3.2. Segmentation of the Eye Socket Based on OH-YOLO
Thermal infrared images were first processed using the trained detection model to locate and crop the cow head region based on the predicted bounding boxes, generating images containing only the head. Cropped head images were then manually screened to remove accidental inclusions of other cows, ensuring that the extracted eye regions corresponded accurately to individual identities.
Eye socket segmentation was performed using an optimized and lightweight model, OH-YOLO, developed based on the YOLOv8n-seg framework. In OH-YOLO, conventional convolutional blocks in the backbone were replaced with Online Convolutional Re-parameterization (OREPA) modules to reduce parameter redundancy and improve efficiency, while the High-level Screening-feature Fusion Pyramid Network (HSFPN) was introduced in the neck to enhance multi-scale feature fusion and segmentation robustness. The overall network architecture of OH-YOLO is shown in
Figure 6.
OREPA (Online Convolutional Re-parameterization) (structure shown in
Figure 7) is a two-stage training framework designed for structural reparameterization models, simplifying complex multi-branch modules during training into a single convolutional operation to significantly reduce computational and memory overhead [
25].
The overall OREPA process consists of two phases: the first is Block Linearization, which utilizes a specialized linear scaling layer to optimize the performance of online blocks; the second is Block Squeezing, which leverages the linear additivity and associativity of convolutions to merge multi-layer, multi-branch linear structures into a single equivalent convolutional kernel (OREPA Conv). This significantly reduces feature-level computations and memory consumption.
- 2.
HS-FPN Module
HSFPN [
26], illustrated in
Figure 8, consists of two components: feature selection and feature fusion. The feature selection module comprises two key components: Channel Attention (CA) and Dimension Matching (DM). The CA module processes the input feature map
(where
C,
H, and
W denote the number of channels, height, and width, respectively) by employing both global average pooling and global max pooling to compute the average and maximum values for each channel. DM performs channel compression using a 1 × 1 convolution to align the channel dimension of multi-scale feature maps.
The feature fusion module primarily comprises a Selective Feature Fusion (SFF) mechanism, which employs high-level features as weights to filter out important semantic information embedded within low-level features, thereby enabling strategic feature fusion [
27]. Given high-level feature
and low-level feature
, the former is first up-sampled and aligned via transposed convolution and bilinear interpolation to produce
. The aligned high-level feature is converted into attention weights through the CA module, which are then applied to filter low-level features. The refined low-level features are fused with the high-level features, yielding the output feature
. The fusion process for feature selection can be formulated as follows:
In the equations, and denote the input high-level and low-level features; represents the transposed convolution operation; indicates the bilinear interpolation; corresponds to the transformed high-level feature after processing; and denotes the output feature map after fusion.
3.3.3. Detection Model Evaluation Metrics
To comprehensively evaluate the performance of these models, this study analyzes them in terms of two dimensions: accuracy metrics and efficiency metrics. Accuracy was assessed with
(average precision) and
(mean average precision).
measures detection accuracy for a single class as the area under the precision–recall curve, while
averages
across all classes.
considers a detection correct when the Intersection over Union
exceeds 0.5, and
averages
AP across
thresholds from 0.50 to 0.95 in steps of 0.05, reflecting detection and localization performance under varying overlap requirements. The two indicators are calculated as shown in Equations (3) and (4).
In the formulas, represents the precision value of the detection result, denotes the recall rate, is the total number of categories in the dataset, and is the average precision for category .
Efficiency was evaluated using the number of parameters (Params), computational complexity (GFLOPs), model size, and image processing time per frame. Params indicate model structural complexity, GFLOPs quantify floating-point operations during inference, model size reflects storage demands, and processing time measures real-time capability. Together with accuracy metrics, these indicators provide a comprehensive assessment of the model’s detection performance, computational efficiency, and deployment feasibility.
3.4. Temperature Extraction in the Eye Socket Area
In
Section 3.2.1, during the thermal imaging preprocessing stage, we obtained the raw temperature matrix for each frame. Each element of this matrix corresponds one-to-one with the temperature value of a single pixel in the image. Therefore, when extracting temperature data from the eye socket region of dairy cows, the coordinates
obtained from the segmentation of the eye socket region can be mapped to the indices in the temperature matrix. By extracting the corresponding temperature values, the average temperature of the eye region can be obtained.
However, prior to locating the eye socket region, to ensure segmentation accuracy, we performed image cropping based on the head region bounding box information predicted by the detection model—specifically, the top-left corner coordinates
and bottom-right corner coordinates
. Subsequently, eye socket segmentation was performed on the cropped head region image, which resulted in segmentation coordinates based on the cropped image coordinate system. To achieve a one-to-one correspondence between segmented coordinates and temperature matrix indices, the corresponding cropping offset was added to convert the coordinates back to the original image coordinate system, as calculated using Equations (5) and (6). Finally, using the restored original coordinates
of the restored eye socket region, the corresponding pixel temperature values were extracted from the temperature matrix. Their mean value was then calculated to determine the temperature of the eye socket region.
In the equations, represents the coordinates of the segmented eye socket, denotes the upper-left corner coordinates of the head detection box, and indicates the coordinates of the eye socket region within the original thermal infrared image.
3.5. Established Rectal Temperature Prediction Model
Since thermal imaging technology is based on the radiative transfer relationship between the infrared emission intensity of an object and its surface temperature, its measurements are inevitably influenced by environmental conditions [
28]. Specifically, ambient temperature and wind speed affect the surface heat exchange between the animal and its surroundings, while humidity and light intensity influence the transmission of infrared radiation and image quality. Therefore, based on the average eye socket temperature extracted from the thermal matrix, we further incorporated environmental parameters (including ambient temperature, relative humidity, wind speed, and light intensity) as additional input features into the random forest model to minimize environmental interferences. By learning the relationships between environmental variations and eye socket temperature, the model compensates for measurement deviations caused by fluctuating environmental conditions, thereby improving the accuracy and stability of body temperature prediction.
To further quantify the relationships between each environmental parameter and eye socket temperature, a correlation analysis was conducted using Statistical Package for the Social Sciences (SPSS) software (v26.0). Pearson’s correlation coefficients were calculated for each pair of variables to assess their linear associations [
29]. This analysis provides additional statistical evidence of how environmental factors may influence eye socket temperature measurements and their potential contribution to the random forest prediction model.
3.5.1. Random Forest Model
Given that rectal temperature prediction is a regression problem, we employ random forest (RF) as the predictive model. RF is a supervised integrated learning method whose basic unit is a decision tree, which is a simple predictive model that stratifies the input data space into output regions [
30]. The prediction of the output region of a decision tree is the average value of the response variable that falls within that output region in the training dataset. RF is a forest composed of multiple decision trees. It generates a large training dataset through random sampling and independently constructs each decision tree. Following this, it votes on or averages the prediction results to generate the regression outcome. Taking advantage of the variance and diversity among the decision trees, RF demonstrates remarkable robustness and generalizability [
31]. A basic schematic of the RF model for rectal temperature prediction is shown in
Figure 9.
3.5.2. Model Evaluation Metrics
After establishing the predictive model, to broadly evaluate its performance, the mean square error (MSE), mean absolute error (MAE), and coefficient of determination (R
2) were adopted as metrics to evaluate the model. MSE represents the mean squared error between predicted values and actual values; MAE measures the average absolute error level of the model predictions; and R
2 reflects the degree of fit between the predicted values and the actual values, in which the closer R
2 is to 1, the better the fit of the model.
In the formulas, denotes the number of rectal temperatures in the test set; represents the true value of the i-th rectal temperature in the test set, denotes the predicted value of the i-th rectal temperature in the test set, and is the average of the true temperature values in the test set.
4. Results
4.1. Cow Head Object Detection Results
The test set was divided into the trained cow head detection model to evaluate its performance in locating cow head regions, the results of which indicate that the proposed object detection model performs exceptionally well. Ultimately, for thermal infrared images with a resolution of 384 × 288, the average precision (AP) for detecting targets in the cow head region achieved 100% at , 99.01% at , and 98.35% at , demonstrating the model’s high accuracy and stability in target localization.
4.2. Eye Socket Recognition Results
The eye socket area was further segmented among the detected cow head targets. The overall detection results of the cow head and eye socket are shown in
Figure 10.
4.2.1. Segmentation Results of Different Models
To evaluate the segmentation performance of the proposed model in this study, comparative experiments were conducted between the improved model and other mainstream segmentation models, including Mask-RCNN and the YOLO series, which are commonly used. All models were tested in the same dataset, and the experimental results are shown in
Table 2.
As shown in the table, the OH-Yolo model demonstrates optimal performance in segmentation tasks, achieving , , and values of 99.5%, 98.99%, and 86.59%, respectively. It maintains high detection accuracy across different IoU thresholds, outperforming other comparison models. In terms of model complexity, OH-Yolo has 2.178 million parameters, representing reductions of approximately 95.0%, 94.2%, and 33.1% compared to Mask R-CNN, YoloV7, and YoloV8, respectively. Its computational load (GFLOPs) is 9.7, representing reductions of 92.8%, 93.2%, and 19.2% compared to Mask R-CNN, YOLOv7, and YOLOv8, respectively. The model size is 4.36 MB, reduced by 330.7 MB, 285.6 MB, and 2.13 MB compared to Mask R-CNN, YoloV7, and YoloV8, respectively, significantly lowering storage requirements. Regarding the inference speed, OH-Yolo processes each image in 2.57 ms—approximately 77.6% faster than Mask R-CNN, 53.4% faster than YoloV7, and 9.5% faster than YoloV8. This means that it achieves significant real-time performance gains while maintaining high accuracy. In summary, the improved OH-Yolo shows significant advantages over Mask R-CNN, Yolov7, and Yolov8 in terms of accuracy, computational complexity, model size, and inference efficiency, demonstrating its comprehensive strengths in practical segmentation tasks.
To visually demonstrate the segmentation results of each model on actual images, images randomly selected from the test set were used to test the models, the results of which are shown in
Figure 11. It can be observed that the Mask-RCNN model performed poorly in this task, showing frequent missed and false detections. In contrast, the YOLO series achieved a higher segmentation accuracy and stability. YOLOv7 and YOLOv8 produced clearer eye-socket boundaries, whereas the proposed OH-YOLO maintained a comparable accuracy with improved robustness under minor head movements and lighting variations. As supported by
Table 2, the lightweight OH-YOLO architecture sustains high-quality segmentation while offering faster inference and greater suitability for edge deployment.
4.2.2. Eye Socket Model Ablation Test
Ablation studies were conducted to investigate the impact of innovative modules on the performance of eye region segmentation. To validate the effectiveness of improvements in individual modules, a series of ablation tests were conducted on the dataset. This study extends the YOLOv8 architecture by integrating the OREPA module and the HS-FPN module, thereby implementing a series of targeted enhancements designed to improve the accuracy and robustness of eye socket segmentation. The performance comparison results between the new model and the original model are shown in
Table 3.
As shown in
Table 3, among them, “✗” indicates that this module is not selected, and “✓” indicates that this module is selected. The original model (Experiment 1) exhibits a high segmentation performance, with average precision
,
, and
values reaching 99.5%, 98.99%, and 86.85%, respectively. However, its drawbacks include its numerous parameters, high computational complexity, and substantial model memory requirements, which pose challenges for deployment on resource-constrained devices. After replacing the original backbone network’s convolutional module with the OREPA module (Experiment 2), the model maintained detection accuracy while reducing the computational complexity by 9.2% and shortening the inference time from 2.84 ms to 2.60 ms, demonstrating higher structural efficiency. When replacing the original PANet in the neck structure with the HS-FPN module (Experiment 3), the model maintained a stable detection performance while significantly reducing parameters by 28.2%, decreasing model size by 25.1%, and slightly shortening the image inference time. Further integrating both the OREPA and HS-FPN modules (Experiment 4), the fusion model maintained the high accuracy of the original models while reducing the parameters, floating-point operations, and memory consumption by 33.1%, 19.2%, and 32.8%, respectively. The improved model also achieved a 9.5% increase in inference speed.
To further illustrate the performance variation across different ablation settings, the precision (P),
mAP50, and
mAP50–95 metrics were visualized, as shown in
Figure 12. As observed, these indicators remain consistently high across all experiments, indicating that the proposed structural modifications do not compromise segmentation accuracy. The highly overlapping curves further demonstrate that the integration of the OREPA and HS-FPN modules enables the model to maintain high precision while achieving improved computational efficiency and structural compactness. Taken together with
Table 3, the results of the ablation experiments verify the reliability of the proposed lightweight optimization strategy, achieving a balanced trade-off between structural simplification and segmentation quality.
4.3. Eye Socket Temperature Extraction Results
In this study, the average eye temperature of each cow was extracted using a method combining a temperature matrix derived from thermal infrared images with eye region segmentation and localization. The results showed that the infrared thermography system could stably identify the eye region of dairy cows and output reliable temperature information. The extracted eye temperature values exhibited noticeable inter-individual variation.
Figure 13 presents the infrared eye temperature and rectal temperature of 28 dairy cows.
Overall, the eye temperature was lower than the rectal temperature; however, both measurements displayed a generally consistent variation pattern among individuals, indicating that cows with higher eye temperatures tended to have higher rectal temperatures as well. These results demonstrate that the proposed infrared thermography-based eye temperature extraction method can effectively reflect both individual differences and overall variation trends in body temperature, providing a useful reference for subsequent temperature monitoring in dairy cows.
4.4. Rectal Temperature Prediction Results
A random forest model was constructed using the following inputs: the average temperature in the eye socket region, environmental temperature and humidity, wind speed, and light intensity data. The recorded rectal temperature was utilized as the model’s output variable. To achieve optimal predictive performance for the random forest model, we employed a trial-and-error approach to determine the model’s best hyperparameter settings. The number of decision trees (Ntrees) was tested over a range of 1 to 400, and the minimum leaf size (Nleafs) was evaluated across the values [5, 10, 20, 50].
Figure 14 shows the mean squared error between the predicted and actual rectal temperature for different Ntrees and Nleafs values.
The results suggest that models with fewer leaf nodes and larger tree sizes exhibit lower prediction errors; at the same time, model complexity and computational time must be taken into account. The combination of a small Nleaf value and a large Ntrees value incurs significant computational time and may lead to overfitting. When the Ntrees value exceeds 200, the prediction accuracy does not show a significant improvement. The model performance results under different hyperparameter settings indicate that the model performs best when the decision tree depth is set to 200 and the minimum number of observations per leaf node is set to 5; therefore, the final model structure consists of 200 decision trees with 5 observations per leaf node.
A random forest model using this structure was trained on the constructed dataset to obtain a temperature prediction model, which was then applied to analyze the test dataset and evaluate its performance. Subsequently, the predictions were compared with the actual rectal temperatures, with the results shown in
Figure 15, which clearly demonstrates that the predicted temperature values align closely with the actual temperature readings. The established rectal temperature prediction model achieved an MSE of 0.117, MAE of 0.058, and R
2 of 0.852. These metrics indicate that the model possesses a strong fitting capability and excellent generalization performance for rectal temperature prediction tasks, enabling high-precision non-contact estimation of rectal temperature under complex environmental conditions.
4.5. Comparison of Different Prediction Algorithms
To validate the reliability and stability of the random forest method in establishing rectal temperature prediction models, we selected Multi-Layer Perceptron (MLP), Artificial Neural Network (ANN), decision trees (DTs), and the Adaptive Boosting algorithm (AdaBoost) as comparative models for performance analysis.
Figure 16 presents a comparison of the performance of different models using three evaluation metrics: the mean squared error (MSE), mean absolute error (MAE), and coefficient of determination (R
2). As shown in the figure, the random forest-based model achieves the lowest MSE and MAE values and the highest R
2 value among all of the compared models. This result demonstrates that the random forest method provides superior accuracy, stronger fitting capability, and greater stability in rectal temperature estimation, delivering the most reliable predictive performance in this study.
5. Discussion
5.1. Analysis of Eye Socket Segmentation Results
From the comparative analysis of different models, the YOLO series demonstrated markedly superior accuracy and computational efficiency in eye socket segmentation compared to Mask-RCNN. Owing to its two-stage detection and segmentation architecture, Mask-RCNN is less capable of accurately delineating small thermal targets with fine structural boundaries, such as cow eye sockets. In contrast, the one-stage YOLO framework unifies detection and segmentation into a single process, enabling a favorable balance between precision and real-time performance. Among these models, YOLOv8 served as a robust baseline exhibiting high segmentation accuracy, while the optimized OH-YOLO further enhanced structural compactness and inference efficiency without compromising accuracy, making it more suitable for deployment in edge-computing or resource-limited livestock environments.
In the ablation experiments, each incorporated module contributed targeted functional improvements. The OREPA module effectively reduced redundant convolutional computations and enhanced backbone feature extraction efficiency, thereby producing cleaner and more discriminative feature representations for subsequent layers. The HS-FPN module strengthened multi-scale feature fusion, improving the model’s robustness to variations in head orientation and eye socket size. When integrated, these modules formed the OH-YOLO architecture, which achieved a refined balance between segmentation precision and computational efficiency. Although performance indicators such as and exhibited only marginal changes across experiments, this is primarily attributed to the characteristics of the task—eye socket regions in thermal infrared images present high contrast, clearly defined boundaries, and minimal background interference. Furthermore, head cropping and pseudo-color remapping in preprocessing enhanced edge distinctness, leading to a saturation effect in high-accuracy metrics.
Overall, the results of the ablation and comparative analyses confirm that the proposed optimization strategy significantly improves computational efficiency and structural robustness while maintaining segmentation accuracy. The reductions in parameter count, computational complexity, and inference time collectively demonstrate the practical potential of the proposed approach for intelligent livestock management and real-world edge deployment.
5.2. Impact of Environmental Factors
A Pearson correlation analysis was conducted to evaluate the influence of ambient temperature, relative humidity, wind speed, and light intensity on eye socket temperature. The results indicated that ambient temperature and wind speed were the primary environmental factors affecting eye socket temperature, showing statistically significant correlations, whereas the effects of light intensity and humidity were relatively minor. Detailed correlation coefficients and significance levels are presented in
Table 4.
These findings are consistent with the physical principles of thermal infrared thermography. In animals, higher ambient temperatures increase the surface temperature, which could have elevated the measured eye socket temperature in this study. In contrast, stronger wind speeds accelerate convective heat exchange between the animal’s body surface and the surrounding air, which could have decreased the eye socket temperature in this study. Light intensity may slightly influence thermal measurements through localized reflection or surface heating, but this effect was limited under the experimental conditions. The relative humidity showed no significant effect on infrared radiation transmission within the tested range and thus had a negligible impact on the temperature measurement.
In constructing the random forest model for body temperature prediction, ambient temperature, wind speed, humidity, and light intensity were incorporated as additional input features. By learning the relationships between eye socket temperature and these environmental parameters, the model effectively compensated for temperature deviations induced by environmental fluctuations. Consequently, the model maintained a high prediction accuracy and stability even under varying thermal and illumination conditions.
The rectal temperature prediction approach presented in this paper was compared with other prediction models that combine thermal imaging with environmental factor compensation, as detailed in
Table 5, in which “—” indicates no numerical value. F. K. Wang et al. [
32] implemented a multi-sensor architecture with signal processing to correct for certain environmental influences, yet region of interest selection relied on manual localization. This approach not only increases the labor required but also introduces potential human error, which may compromise model stability. In contrast, the present study utilizes an improved OH-YOLO model for automatic eye socket detection, ensuring precise localization. By incorporating a more comprehensive set of environmental factors, the robustness and predictive accuracy of the model are substantially enhanced. A. K. Balhara et al. [
33] employed a regression model to predict body temperature by integrating average eye temperature with ambient temperature. V. M. Pacheco et al. [
34] and R. V. de Sousa et al. [
35] applied ANNs to model multiple body regions or integrate environmental factors, achieving some level of thermal stress assessment; however, regional localization and temperature extraction still required manual operation, and the computational complexity limited the efficiency of large-scale monitoring. By contrast, the present study combines deep learning-based automatic eye socket localization with a random forest model, resulting in improved prediction accuracy and stability while enabling high-throughput monitoring and minimizing manual intervention.
In all the aforementioned studies, the eye socket region was chosen as the primary temperature measurement area because it provides the most reliable indication of rectal temperature. The region is particularly suitable for non-contact temperature measurement due to its dense capillary network and proximity to the brain, making the ocular surface temperature highly representative of systemic thermal status [
8]. While maintaining the core principle of temperature measurement, consistent with previous studies, this work achieves a significant improvement in predictive performance through automated eye socket detection, comprehensive environmental factor integration, and lightweight modeling, substantially reducing both human effort and computational burden.
5.3. Study Limitations
Cows exhibit autonomous behaviors, such as head swinging, turning, and feeding, which can cause variations in the eye region captured by thermal infrared sensors. These natural movements, combined with the strict quality requirements applied during dataset construction, result in uneven data availability across individuals. Such imbalances may limit the model’s ability to fully capture individual-specific temperature patterns, particularly for cows with fewer samples. Despite this limitation, the proposed method maintained a relatively stable predictive performance across most individuals, indicating a degree of robustness to data sparsity and feature variability induced by natural motion. Nevertheless, addressing these challenges is essential for further enhancing the reliability and generalizability of non-contact rectal temperature estimation under dynamic and complex farm conditions.
5.4. Further Study
Building on the identified limitations, future research could implement tracking cameras combined with real-time image display and processing platforms to enable continuous and adaptive monitoring of the eye region. Extending the study from single- to multi-cow scenarios will also be a priority, as cows naturally congregate, creating complex interactions and variable postures that challenge current detection algorithms. Integrating individual identification methods, such as ear tags or facial recognition, with spatially aware target detection algorithms would allow accurate labeling and continuous thermal monitoring of multiple cows simultaneously. Such advancements would enhance the model’s adaptability to dynamic behaviors and improve the practicality of automated thermal monitoring systems, ultimately contributing to both animal welfare and operational efficiency in commercial dairy farming.