ID-APM: Inverse Disparity-Guided Annealing Point Matching Approach for Robust ROI Localization in Blurred Thermal Images of Sika Deer

Zhu, Caocan; Mu, Ye; Sun, Yu; Gong, He; Guo, Ying; Fan, Juanjuan; Li, Shijun; Li, Zhipeng; Hu, Tianli

doi:10.3390/agriculture15192018

Open AccessArticle

ID-APM: Inverse Disparity-Guided Annealing Point Matching Approach for Robust ROI Localization in Blurred Thermal Images of Sika Deer

by

Caocan Zhu

¹

,

Ye Mu

^1,2,3

,

Yu Sun

^1,2,3,

He Gong

^1,2,3,

Ying Guo

^1,2,3,

Juanjuan Fan

¹,

Shijun Li

^4,5

,

Zhipeng Li

⁶

and

Tianli Hu

^1,2,3,*

¹

College of Information Technology, Jilin Agricultural University, Changchun 130118, China

²

Jilin Province Agricultural Internet of Things Technology Collaborative Innovation Center, Changchun 130118, China

³

Jilin Province Intelligent Environmental Engineering Research Center, Changchun 130118, China

⁴

School of Electronics and Information Engineering, Wuzhou University, Wuzhou 543002, China

⁵

Guangxi Key Laboratory of Machine Vision and Intelligent Control, Wuzhou 543002, China

⁶

College of Animal Science and Technology, Jilin Agricultural University, Changchun 130118, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(19), 2018; https://doi.org/10.3390/agriculture15192018

Submission received: 2 September 2025 / Revised: 22 September 2025 / Accepted: 24 September 2025 / Published: 26 September 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Non-contact, automated health monitoring is a cornerstone of modern precision livestock farming, crucial for enhancing animal welfare and productivity. Infrared thermography (IRT) offers a powerful, non-invasive means to assess physiological status. However, its practical use on farms is limited by a key challenge: accurately locating regions of interest (ROIs), like the eyes and face, in the blurry, low-resolution thermal images common in farm settings. To solve this, we developed a new framework called ID-APM, which is designed for robust ROI registration in agriculture. Our method uses a trinocular system and our RAP-CPD algorithm to robustly match features and accurately calculate the target’s 3D position. This 3D information then enables the precise projection of the ROI’s location onto the ambiguous thermal image through inverse disparity estimation, effectively overcoming errors caused by image blur and spectral inconsistencies. Validated on a self-built dataset of farmed sika deer, the ID-APM framework demonstrated exceptional performance. It achieved a remarkable overall accuracy of 96.95% and a Correct Matching Ratio (CMR) of 99.93%. This research provides a robust and automated solution that effectively bypasses the limitations of low-resolution thermal sensors, offering a promising and practical tool for precision health monitoring, early disease detection, and enhanced management of semi-wild farmed animals like sika deer.

Keywords:

animal monitor; ID-APM; RAP-CPD; image match

1. Introduction

Precision Livestock Farming (PLF) is revolutionizing modern agriculture by integrating advanced technologies to monitor animal health and behavior automatically, thereby enhancing productivity and animal welfare [1,2]. Within the PLF paradigm, non-contact monitoring systems are particularly valuable for high-value, stress-sensitive species such as farmed sika deer. Traditional health assessment methods, such as manual inspection or the use of rectal thermometers, are often impractical for these semi-wild animals, as they induce significant stress that can compromise their health and reduce economic returns [3]. Infrared Thermography (IRT) has emerged as a powerful, non-invasive alternative, enabling remote and continuous measurement of body surface temperature, which serves as a critical indicator of physiological status, complementing other metrics such as heart rate [4] and respiratory rate [5].

Despite its promise, the widespread adoption of IRT in real-world agricultural settings is hindered by a confluence of formidable challenges centered on the robust and automated localization of physiologically significant Regions of Interest (ROIs), such as the ocular area [6]. The primary obstacle is the inherently low resolution of the thermal sensors typically employed in agricultural applications for economic feasibility and large-scale deployment (e.g., 256 × 192 pixels in this study). This hardware limitation is the root cause of significant detail loss and visual ambiguity. Compounding this issue, the complex and often cluttered backgrounds of farm environments, coupled with the non-rigid nature of the animal subjects, introduce severe non-linear distortions between the visible and thermal image pairs. As illustrated in Figure 1, the thermal image of a sika deer’s eye appears highly blurred and pixelated, with its precise boundaries almost indistinguishable. This degradation is a pervasive consequence of low-resolution sensing, often exacerbated by animal motion, and makes the registration task exceptionally difficult. The combined effects of low resolution, non-linear distortions, and the lack of clear features in the thermal modality pose a formidable challenge for conventional cross-modal image registration algorithms.

Existing infrared-based animal monitoring approaches can be broadly distinguished by the nature of the animal’s skin and coat, which fundamentally dictates the task’s difficulty. For species with sparse or no fur, such as pigs, the thermal signatures of key physiological areas are often distinct and well-defined against the body. This allows for the direct application of object detection algorithms [7] on the thermal image itself to automate ROI localization with considerable success. The challenge becomes significantly more complex for species with dense fur, such as cattle [8], sheep, and the sika deer in our study. The coat’s insulating properties obscure the underlying thermal patterns, making small thermal windows like the eyes extremely difficult to automatically [9] distinguish from the surrounding fur. Consequently, many studies on furred animals still rely on manual or semi-automated methods for ROI selection, which is impractical for large-scale, continuous monitoring. To address this, some researchers have proposed cross-modal methods that introduce information from the visible light spectrum to aid localization. These methods typically involve image fusion [10] or feature matching between the visible and thermal domains. However, they are still fundamentally limited by the quality of the thermal image. If the thermal ROI lacks distinct features due to low resolution and blur, matching algorithms (e.g., SIFT [11], SURF [12]) are prone to failure.

To bridge this gap, this paper introduces a trinocular vision framework, termed Inverse Disparity-guided Annealing Point Matching (ID-APM), designed explicitly for robust cross-modal registration in these demanding agricultural scenarios. Additionally, a set of equipment is designed based on this method, with its physical diagram shown in Figure 2. For the convenience of introduction, in this method, dual visible cameras are collectively referred to as stereo cameras, while the infrared-visible binocular system refers to the combination of an infrared thermal camera and a visible camera. Our primary contributions are threefold:

(1) We propose a framework that leverages the high-resolution visible spectrum from a stereo camera system to guide the localization of ROIs within the ambiguous, low-resolution thermal image, effectively decoupling the registration task from the poor image quality of the thermal modality.

(2) We introduce an enhanced point set matching algorithm, Random Annealing Pruning CPD (RAP-CPD), which demonstrates superior robustness against the noisy, sparse feature sets and non-rigid transformations typical of this application.

(3) We validate our method on a real-world dataset of farmed sika deer, achieving an exceptional overall accuracy of 96.95% and a Correct Matching Ratio (CMR) of 99.93%, demonstrating its immense potential as a practical tool for precision livestock farming. By fusing 3D spatial coordinates with temperature data, our system provides a rich, four-dimensional dataset (X, Y, Z, T) that paves the way for advanced health diagnostics and management of farmed animals.

2. Materials and Methods

According to the implementation process of the proposed matching method as shown in Figure 3, thermal and visible images, respectively, undergo image calibration operations to obtain two pairs of images. Among them, for visible images, the disparity of the target object is calculated through the target detection model and the detection box matching algorithm, and then the infrared mapping region is obtained through disparity and region mapping operations.

2.1. Experimental Data

The dataset for this study was collected at the Dong’ao Sika Deer Breeding Base in Shuangyang District, Jilin Province, China. It comprises 263 sets of original trinocular images, each captured simultaneously by a coaxial camera array. The array consists of two identical visible-light cameras and a thermal infrared camera with a sensor resolution of 256 × 192 pixels. The visible-light camera is the HBVCAM-12M2353 model manufactured by Shenzhen Shitong Yuntian Technology Co., Ltd. (Shenzhen, China), while the thermal infrared camera is the C256 model produced by Yantai Airui Optoelectronics Technology Co., Ltd. (Yantai, China). This setup was specifically designed to replicate typical monitoring scenarios in agricultural environments where non-contact observation is required.

Manual annotations were created for two key Regions of Interest (ROIs) on the sika deer: the face and the eyes. The annotation process was tailored to the unique challenges of each modality. For the thermal images, the “deer face” labels were meticulously delineated based on the discernible thermal gradients and edge information present in the low-resolution infrared data. In contrast, annotating the “deer eye” in the thermal domain was significantly more challenging due to its small size and poor thermal contrast. Consequently, the “deer eye” labels were generated through a semi-automated process: they were primarily guided by the thermal edges, but their final positions were refined by referencing the relative location of the corresponding, clearly visible eye in the high-resolution visible-light images. This reliance on cross-modal reference introduces a degree of subjective judgment. To mitigate this subjectivity and ensure a robust ground truth, each ‘deer eye’ was independently annotated by five individuals. The final ground-truth coordinate was then established by averaging the coordinates from these five separate annotations. This multi-annotator averaging process ensures the resulting labels are highly objective and serve as a reliable benchmark for our evaluation.

To enhance the robustness of our evaluation and simulate a wider range of real-world farm conditions, the original dataset was augmented. We applied various types of noise, including Gaussian and salt-and-pepper noise, to the images to simulate environmental factors such as sensor noise, dust, and minor occlusions. This augmentation process expanded the original 263 image sets into a comprehensive dataset of 1315 sets, allowing for a more thorough validation of the algorithm’s performance under diverse and challenging scenarios. The distribution of labels across the augmented dataset is detailed in Table 1.

To ensure a fair and standardized comparison across different algorithms, all images provided to the baseline methods were pre-calibrated. This preprocessing step geometrically aligns the images, thereby eliminating confounding factors such as scale and rotation. This allows the evaluation to focus purely on the core matching performance of each algorithm, rather than on their sensitivity to parameter tuning for geometric transformations [13].

2.2. Image Size Adaptation and Rectification

Due to the inherent differences in the field of view, focal length and pixel size between infrared thermal and visible cameras, there is a significant inconsistency in the spatial resolution. This mismatch of spatial resolution will directly affect the accuracy and consistency of heterogeneous images in the subsequent registration process. Therefore, it is necessary to perform scale normalization processing on the image, including scaling and cropping operations, to ensure that it meets the unified standard resolution requirements. Specifically, through the calculation of the scaling parameters in the homography matrix, the relative spatial resolution relationship between two cameras can be determined. This initial step corrects for geometric distortions and normalizes the image scales, which is a critical prerequisite for all subsequent matching and mapping operations. The spatial resolution of the multimodal imaging device before scaling is shown in Figure 4.

In Figure 4,

{O M}_{t}

and

{O M}_{v}

respectively represent the focal lengths of the infrared thermal and the visible camera. Referring to the image scale normalization method proposed by Xiang, this paper further optimizes and enhances it [14]. The original method calculates the scaling factor based on the ratio of the physical focal length to the pixel size, and the pixel size is defined as the physical size of a single pixel. Therefore, the calculation result of this method is the ratio of the pixel focal length of infrared and visible cameras under ideal conditions. However, due to the existence of non-ideal factors such as lens distortion, optical characteristics and sensor differences, there is a significant deviation between the pixel focal length under ideal conditions and the actual pixel focal length obtained through calibration calculation. Furthermore, the original method did not fully consider the influence of the maximum pixel size on the pixel focal length, resulting in insufficient calculation accuracy of the scaling factor and further affecting the actual registration effect. To improve the calculation accuracy of the scaling factor, this paper directly adopts the calibrated actual pixel focal length for calculation. The formula is as follows:

s = \frac{f_{V I S}}{f_{I R}}

(1)

Among them,

s

is the scaling factor used to normalize the image sizes,

f_{V I S}

represents the calibrated focal length of the visible camera in pixels, and

f_{I R}

represents the calibrated focal length of the infrared camera, also in pixels. Through the calculation method of the scaling factor based on the measured pixel focal length, the scale difference between infrared and visible cameras can be reflected more accurately. Next, the infrared image is magnified according to the scaling factor, and the central area of the visible image is cropped to obtain the initially adapted infrared and visible images. Subsequently, using these adapted images, the external parameters of the infrared camera and the visible camera are calculated through the Zhang’s calibration method, thereby achieving accurate geometric calibration. Specifically, the calculation of Zhang’s calibration method requires obtaining the calibration images in advance and using the standard checkerboard images as the calibration reference to obtain the internal parameters of the camera and the external parameters between the cameras [15]. However, ensuring the consistency of the checkerboard in the heterologous imaging system is a huge challenge.

To solve this problem, this paper refers to methods such as Nakagawa’s use of a hot air gun [16] and Song’s creation of a temperature difference through an electric heating module [17], and innovatively proposes a flexible calibration scheme. Specifically, by placing the checkerboards printed on A4 paper fixed on the hard support plate under a light source with a larger candela value for illumination, the temperature difference between different blocks on the checkerboard is created by taking advantage of the light absorption characteristics of different colors, thereby obtaining the calibration images. The advantage of this scheme is that it uses the differential light absorption of black and white blocks to create a significant temperature gradient, enabling the features in the infrared-visible image to have a high degree of consistency without being limited by other influencing factors, such as the thickness of the electric heating module and the distortion of the checkerboard pattern caused by hot air. Figure 5 shows the image processing effect before and after calibration. The results show that after the above geometric calibration processing, IRT not only solves the problem of image distortion, but also substantially satisfies the same geometric constraint conditions as the visible image.

2.3. Target Object Detection

Once the image data has been calibrated, the subsequent step is to extract external features suitable for effective image matching. In this study, the binocular ranging method is combined with the target detection model to obtain the relative distance information of the target object, thereby achieving the image matching task.

The object detection model selected for this study is Yolov11. While newer versions of the Yolo [18] series are available, the choice of Yolov11 was a deliberate decision based on the specific requirements of this research. Our primary objective is to develop a practical and deployable system for real-world animal monitoring applications. In this context, Yolov11 offers an optimal balance of performance, stability, and ease of deployment. It is a mature and extensively validated model within the industrial and academic communities, boasting robust community support and well-documented deployment pipelines. Therefore, Yolov11 was deemed the most suitable choice to serve as a reliable and efficient component within our larger methodological framework. To ensure the generalizability of the detector and to mitigate the risk of overfitting, the YOLOv11 model was trained not only on the 263 image sets used for our final evaluation, but on a separate, extensive internal dataset. This training dataset comprises 4325 images of sika deer, capturing a wide variety of individuals, backgrounds, lighting conditions, and poses. This pre-training process yields a robust and generalized deer detection model. The strict annotation standards and IoU loss adjustments mentioned in this study were applied during this extensive training phase. The resulting trained model is then applied to our dataset to perform inference and extract the bounding boxes for the subsequent matching task.

2.4. Detection Box Matching

Since the center point of the detection box of the

Y o l o

is constrained by the area of the grid cells [19], its performance in detecting small target objects is poor. Situations such as missing detection boxes, multiple labels and localization errors may occur. Moreover, considering that in the actual production environment, similar target objects may have partial occlusion and closely stacked, it is impossible to directly use the descending order of the two visible images,

{I m g}_{L}

and

{I m g}_{R}

, taken by the stereo camera and the center point of the detection box to match the detection box of the same object. In addition, since the stereo camera is shot coaxially, the positions of different objects in the left and right images will be different due to disparity, which is dependent on distance. This disparity is the core principle by which binocular vision exploits the slight differences between the images projected on the left and right retinas, a phenomenon also known as binocular disparity, to estimate object distance. Under idealized conditions, the magnitude of this disparity is inversely related to the viewing distance [20]; that is, as an object moves farther away, the disparity decreases because the horizontal separation between the eyes subtends a smaller angle at the object. The detection box representation of the binocular image is shown in Figure 6, presenting the requirement of non-rigid registration. Therefore, it is impossible to achieve the matching between two point sets through overall translation. In this regard, this paper also proposes a CPD [21] point set matching algorithm based on random annealing pruning (RAP-CPD), which strengthens the processing of global information on the basis of the CPD algorithm.

First of all, we need to establish the Gaussian Mixture Model (GMM) between the target point set

X

and the source point set

Y

:

p (x_{n}) = v p_{o u t} (x_{n}) + (1 - v) \sum_{i = 1}^{M} \frac{1}{M} p (x | m)

(2)

Among them,

p (x | m) = \frac{1}{{(2 π σ^{2})}^{\frac{D}{2}}} e x p (- \frac{{‖X - T (y_{m})‖}^{2}}{2 σ^{2}})

,

v

is the outlier probability, m is the number of points in the point set

Y

,

σ^{2}

is the variance between the point sets

X

and

Y

, and

T (y_{m})

is the position of the source point after transformation.

Because the CPD algorithm relies on the differences between the source point set and the target point set to determine whether the distributions are similar, when a local area presents a denser distribution, a larger weight will be assigned to these denser regions, overfitting detailed information. This affects the data expression of the macroscopic concept of shape, especially when there is a non-rigid registration requirement between these two point sets. To address this, we introduce a random pruning operation to dynamically sparsify the point set at each iteration. The design of the pruning strategy is crucial for balancing global exploration and local refinement. In the initial stages, an aggressive pruning is needed to force the algorithm to focus on the overall shape, making it robust to outliers and noise. As a coarse alignment is achieved, a more conservative pruning is required to retain points for fine-grained adjustments. To achieve this adaptive behavior, we propose a non-linear annealing schedule for the point survival probability

p_{s} (t)

, defined as:

p_{s} (t) = 1 - p_{r}^{\ln (t + e)}

(3)

Among them,

p_{r}

is the initial discard rate, and

t

is the number of iterations. The constant

e

ensures the argument of the logarithm is always positive. The motivation for this logarithmic-exponential form is its desirable decay characteristic: the discard probability

p_{r}^{\ln (t + e)}

decreases sharply at the beginning and then slows down, perfectly matching our requirement for an aggressive-to-conservative pruning transition. In our experiments, the total number of iterations is set to 50.

Then, the EM algorithm is adopted to optimize the parameter estimation of the probability model with hidden variables in order to maximize the logarithmic likelihood of the Gaussian Mixture model (GMM). The convergence of the algorithm is ensured through the iterative calculation of the objective function

Q

and the Bayesian matching probability

p_{m n}

within point set:

Q (T, σ^{2}) = - \frac{N}{2} \ln σ^{2} - \frac{1}{2 σ^{2}} \sum_{n = 1}^{N} \sum_{m = 1}^{M} p_{m n} {‖x_{n} - T (y_{m})‖}^{2}

(4)

p_{m n} = \frac{e x p (- \frac{1}{2 σ^{2}} {‖x_{n} - T (y_{m})‖}^{2})}{\sum_{k = 1}^{M} e x p (- \frac{1}{2 σ^{2}} {‖x_{n} - T (y_{k})‖}^{2}) + c}

(5)

Among them,

c = {(2 π σ^{2})}^{D / 2} \frac{v}{1 - v} \frac{M}{N}

by minimizing the objective function

Q

, the expected value

E

of log-likelihood will inevitably decrease until it is at a local minimum.

E (T, σ^{2}) = - \sum_{n = 1}^{N} \log \sum_{m = 1}^{M} \frac{p (x | m)}{M}

(6)

Meanwhile, to avoid overfitting to noise and ensure a smooth and coherent transformation, a regularization term is incorporated into the objective function. This term penalizes complex, non-rigid deformations. It effectively acts as a low-pass filter on the displacement field, preserving the global, macroscopic shape of the point set while suppressing high-frequency, localized jitter. This regularization is a key component of the CPD algorithm that ensures physically plausible transformations.

Through the continuous iteration of the EM algorithm, the variance

σ^{2}

typically decreases. However, our random pruning mechanism introduces stochasticity, which can occasionally increase the variance. The standard CPD algorithm lacks a mechanism to escape local optima. To address this, we introduce a probabilistic acceptance criterion inspired by the principles of simulated annealing. The core idea is to allow the algorithm to occasionally accept a “worse” solution (one that increases variance) to explore the solution space more thoroughly. While any step that decreases the variance is always accepted, a step that increases the variance is accepted with a conditional probability

P

, defined as:

P ({σ^{2}}^{(t)}, {σ^{2}}^{(t + 1)}) = \{\begin{matrix} 1, {σ^{2}}^{(t + 1)} - {σ^{2}}^{(t)} \leq 0 \\ e x p (\frac{{σ^{2}}^{(t + 1)} - {σ^{2}}^{(t)}}{{σ^{2}}^{(t)} \ln {σ^{2}}^{(t)} / (\ln {σ^{2}}^{(t)} - \ln {σ^{2}}^{(t + 1)})}), {σ^{2}}^{(t + 1)} - {σ^{2}}^{(t)} > 0 \end{matrix}

(7)

While this formulation shares the philosophical goal of traditional simulated annealing, its mechanism is distinct. The motivation behind this specific formulation is to create an adaptive exploration strategy, rather than relying on a predefined cooling schedule. The denominator term in the exponent acts as an “effective temperature” that is not externally scheduled but is dynamically dependent on the magnitude of the variance change itself. When the algorithm is making large, unstable jumps (

{σ^{2}}^{(t + 1)}

is much larger than

{σ^{2}}^{(t)}

), the acceptance probability for this “bad” move is low, preventing divergence. Conversely, when the algorithm has stabilized in a region where variance changes are small, the denominator term becomes very large, causing the acceptance probability to increase. This encourages the algorithm to “jump out” of potentially shallow local minima when it is close to convergence, promoting a more exhaustive search of the solution space. This dynamic, self-regulating acceptance mechanism enhances the algorithm’s ability to find a superior global solution. To maintain a balance between exploration and stability, we accept an inferior solution only if the calculated probability

P

exceeds a threshold of 0.8.

2.5. Infrared Region Mapping

By using the difference between the center points of the matching box in the visible images, that is, the disparity of the visible stereo camera system, and integrating it into triangulation ranging, the distance of the target object relative to the coaxial plane of the stereo camera can be calculated. And the coaxial alignment of the three cameras ensures that this derived distance is equally valid for the target’s position relative to the coaxial plane of the encompassing infrared-visible binocular system. The ranging schematic diagram is shown in Figure 7, where three cameras simultaneously locate a point in space.

The following formula for depth estimation can be derived from the principle of similar triangles formed by the camera centers, the image planes, and the target object in 3D space:

\frac{z}{f} = \frac{b_{L I R}}{x_{L} - x_{I R}} = \frac{b_{L R}}{x_{R} - x_{I R}}

(8)

Among them,

x_{L}

,

x_{I R}

,

x_{R}

represent the horizontal coordinate offset of the target object relative to the center point in the trinocular image,

b_{L I R}

,

b_{L R}

represents the baseline length of the infrared-visible binocular system and stereo camera,

z

represents the distance of the target object relative to the coaxial plane where the trinocular system is located, and

f

represents the pixel focal length of the camera. Since the cameras of the stereo camera are of the same model, and the thermal infrared images of the infrared cameras have also been initially adapted to the visible images through zooming operations, the pixel focal lengths of the three cameras can be regarded as consistent.

Once the depth

z

is obtained, we can algebraically rearrange Equation (

8

) to solve for the unknown horizontal position of the target in the infrared image,

x_{I R}

. This inverse operation allows us to project the known location from the visible domain onto the thermal domain.

x_{I R} = x_{L} - \frac{f}{z} \times b_{L I R}

(9)

Finally, based on the center position of the eye detection box in

{I m g}_{L}

and the size of the detection box, the region mapping in infrared thermal images is carried out. This final mapping step represents the culmination of our entire framework. It elegantly solves the initial problem by completely bypassing the need for direct feature detection or segmentation in the noisy, low-resolution thermal domain.

3. Results

In order to verify the superiority of ID-APM in finding the ROI of animals in infrared thermal images, it will be compared with different types of methods [22,23]. Our selection strategy was twofold: since ID-APM is primarily based on traditional computational principles, we first compared it against leading traditional algorithms, including feature-based (MS-PIIFD, MS-HLMO, POS-GIFT) and area-based (CFOG) approaches. To provide a more comprehensive comparison against the current state of the art, we also included several powerful deep learning-based methods (DISK, ReDFeat, and XoFTR). This diverse benchmark allows for a rigorous assessment of our method’s performance against both its peers and leading deep learning paradigms.

3.1. Validation of the Calibration Method

To quantitatively validate the accuracy and robustness of our proposed flexible calibration method, we evaluated its performance using the standard metric of average reprojection error. A set of 248 calibration images was captured from various angles and distances. Using the camera parameters, we reprojected the 3D world coordinates of the checkerboard corners back onto the 2D image planes. The average reprojection error, which measures the distance between the reprojected points and the originally detected corner points, was calculated across all images. Our method achieved an average reprojection error of 0.9 pixels, which is well within the acceptable range for high-precision applications. This result confirms that our low-cost, light-based calibration approach provides accurate and reliable camera parameters, establishing a solid foundation for the subsequent 3D reconstruction and registration tasks.

3.2. Effect of Image Scaling on Registration Quality

Before evaluating the end-to-end performance of our method, we first conducted a preliminary experiment to validate the effectiveness of the proposed image scaling step. This step is crucial for harmonizing the different fields of view and resolutions between the visible and infrared cameras. To quantify its impact, we compared the registration quality of 248 image pairs with and without applying the scaling factor, using Mutual Information (MI) as the metric where a higher value indicates better alignment. The results, presented in Figure 8, clearly demonstrate that applying the scaling factor consistently leads to higher MI values, confirming the significant benefit of this preprocessing step for improving the baseline registration quality.

3.3. Evaluation Metrics

The root mean square error (RMSE), best root mean square error (Best RMSE), accuracy, class accuracy and correct matching ratio (CMR) are introduced to evaluate these comparison methods. Best RMSE applies when, from multiple different matching feature points in a region, those with the smallest Euclidean distance are preferentially selected. The expressions for different metrics are as follows:

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(S_{i} - O_{i})}^{2}}

(10)

B e s t R M S E = \sqrt{\frac{1}{M} \sum_{i = 1}^{M} D_{i, m i n}}

(11)

Among them,

O_{i}

represents the true coordinate point, and

S_{i}

is the coordinate point derived from the method,

N

is the number of feature points,

M

is the number of detection areas, and

D_{i, m i n}

represents the matching feature point with the smallest distance in the i-th area. To prevent suboptimal metric results from erroneous feature point correspondences [24], a filtering process, guided by the established correct regional mapping, is applied to the feature point matching results. That is, the distance between the derived coordinates and the actual coordinates cannot be greater than 80 pixels. Therefore, the metrics of the comparison algorithms are superior to the original values.

3.4. Experimental Results and Analysis

To evaluate the performance of our proposed ID-APM, we conducted a series of comparative experiments. However, before presenting the results, it is crucial to understand the unique challenges posed by our dataset, which significantly impact the performance of many image matching methods. Our benchmark is characterized by three primary difficulties: a drastic disparity in spatial resolution between the high-resolution visible (1600 × 1200) and very low-resolution thermal (256 × 192) images; the indistinct and textureless nature of the target ROI in the thermal modality; a category of label designated as small targets; and complex, cluttered backgrounds that introduce severe non-linear distortions.

To visually demonstrate the superiority of this algorithm in different scenarios, as shown in Figure 9, two sets of infrared-visible image data with different difficulties will be selected from the dataset. One set consists of a single sika deer, while the other comprises a herd of sika deer.

The mapping relationships between visible and infrared images under different methods are visually presented in Figure 9. The selected baseline methods represent three distinct categories: feature-based matching (FBMs) including MS-PIIFD, MS-HLMO, and POS-GIFT; the area-based method (ABM) CFOG; and learning-based methods, ReDFeat, DISK and XoFTR.

In the simpler single-deer scenario, most methods demonstrate some capability. However, the performance distinctions become starkly evident in the more realistic and challenging multi-deer scene. The traditional feature-based methods exhibit divergent and unreliable behaviors. MS-PIIFD struggles to extract robust features from the low-resolution thermal image, leading to erroneous many-to-one matches. POS-GIFT and XoFTR successfully identify a few high-confidence correspondences, but these are exclusively concentrated on the nearest, most detailed sika deer. Conversely, MS-HLMO manages to find some matches on the strong contour information of the more distant deer but fails to detect any features on the closer, more textured animal. This fragmented performance underscores the unreliability of FBMs in complex scenes with significant scale and feature-quality variations.

The area-based method, CFOG, fails to yield many meaningful correspondences for the targeted ROIs in this challenging context. Among the learning-based approaches, ReDFeat and XoFTR, having been trained on a multi-modal dataset, provides a stable but sparse set of correct matches, primarily on the nearest deer, outperforming the general-purpose DISK model. DISK establishes a much denser set of correspondences, but many are chaotically distributed across the background, lacking the specificity required for our task. Ultimately, all baseline methods demonstrate a critical failure: they can, at best, partially register few deer while completely ignoring others in the herd.

In striking contrast, our proposed ID-APM framework demonstrates superior performance by successfully and comprehensively locating the ROIs of all sika deer present in the scene, including those that are distant and highly blurred. This remarkable robustness stems from its core design principle: by leveraging external geometric cues from the high-resolution stereo system rather than relying on fickle internal spectral information from the thermal image, our method is largely immune to the challenges of low resolution and thermal ambiguity. While the final coordinate precision, influenced by the inherent jitter of the YOLO detection boxes, may be marginally less exact than the few correct matches found by methods like POS-GIFT or ReDFeat, ID-APM’s ability to provide reliable, holistic localization across all targets represents a paradigm shift in utility. For practical agricultural monitoring, achieving near-perfect recall and robust localization for every animal in a herd is profoundly more valuable than achieving sub-pixel precision on a single, ideally positioned animal while completely missing others. Our method provides reference localizations for every target, which is more than sufficient for reliable temperature extraction and health assessment.

The comprehensive quantitative performance of all methods on the augmented sika deer dataset is summarized in Table 2. The results unequivocally demonstrate the superior efficacy of our proposed ID-APM framework, particularly in the metrics most critical for practical agricultural applications: accuracy, Correct Matching Ratio (CMR), and computational efficiency.

A detailed analysis reveals several key insights. The performance of MS-PIIFD, CFOG, and DISK is severely limited in this challenging scenario. Their low Accuracy (<65%) and poor Correct Matching Ratio (especially for MS-PIIFD and DISK) indicate a fundamental failure to establish reliable correspondences in the face of low-resolution thermal imagery and complex scene dynamics. These methods are ill-suited for the demands of this application.

At first glance, MS-HLMO, ReDFeat, POS-GIFT and XoFTR appear to be strong competitors due to their exceptionally low RMSE values, suggesting high precision on the matches they find. This precision, however, is profoundly misleading when viewed alongside their abysmal accuracy scores, which languish between 50% and 65%. This stark discrepancy tells a critical story: these methods are only capable of “cherry-picking” a small number of high-confidence matches on the easiest, most feature-rich targets (typically the closest deer), as was observed in our qualitative analysis. While the few matches they establish are indeed precise (hence the low RMSE), and their CMR is high, their inability to detect and register the majority of the animals in the scene renders them practically ineffective for holistic herd monitoring. In an agricultural context, a system that misses nearly half of the animals is fundamentally unreliable, regardless of the correctness of its few successful matches.

In stark contrast, our ID-APM framework achieves a near-perfect overall accuracy of 96.95% and a CMR of 99.93%. This combination of high accuracy and exceptional matching correctness demonstrates the framework’s overall robustness: it not only correctly matches identified targets but, crucially, finds almost every target in every scene. This holistic superiority is visually encapsulated in the radar chart presented in Figure 10, where the overall performance of each method is represented by the area of its polygon. ID-APM’s polygon occupies a significantly larger area than any competitor, signifying its dominant and well-balanced performance profile. Moreover, the slightly higher RMSE of ID-APM (21.25) is an expected and acceptable trade-off. Because our method does not reference feature information within the infrared image, its final precision is susceptible to the cumulative errors from its prior knowledge (such as detector jitter), which explains why its RMSE is higher than that of the top-performing feature matching algorithms. However, this minor compromise in precision is vastly outweighed by the monumental gain in holistic scene understanding. While it is not faster than XoFTR, which leverages GPU acceleration for its operations, ID-APM is significantly more efficient than all other baseline methods, making it a highly viable solution for scalable, near-real-time monitoring.

Although ID-APM does not perform as well as several other methods in terms of RMSE, it can be seen from the distribution statistics chart in Figure 11 that the deviation values of the horizontal and vertical coordinates are primarily constrained near the origin, generally exhibiting a trend resembling a quadratic distribution. Moreover, the further away from the origin, the lower the distribution density. Therefore, the obtained mapping region has high reference value and, unlike what might be expected from an “average” distribution, it does not suffer from an unconstrained divergence of results.

4. Discussion

This study was motivated by a critical challenge in Precision Livestock Farming: the reliable localization of physiological ROIs in the inherently low-resolution and blurred thermal images typical of real-world agricultural settings. Our experimental results demonstrate that the proposed ID-APM framework not only successfully addresses this challenge but also significantly outperforms existing state-of-the-art methods in terms of overall accuracy, matching robustness, and computational efficiency. The core strength of our approach lies in its strategic decoupling of the geometric localization task from the problematic thermal modality. By leveraging the rich, high-resolution information from a stereo visible system to establish a robust 3D geometric prior, ID-APM can accurately map ROIs onto the thermal image, a task where methods reliant on thermal features consistently fail. The near-perfect accuracy (96.95%) and CMR (99.93%) achieved on our challenging sika deer dataset provide strong evidence for the efficacy of this paradigm.

A key finding of our comparative analysis is the critical distinction between raw precision (RMSE) and practical utility (Accuracy and CMR) in the context of herd monitoring. While several baseline methods (e.g., MS-HLMO, ReDFeat, POS-GIFT and XoFTR) exhibited lower RMSE values, their performance was confined to a small fraction of “easy” targets, leading to an overall accuracy of less than 60%. This reveals a fundamental flaw: these algorithms are precise but not comprehensive. In a practical agricultural application, the ability to monitor every animal in a herd is paramount; a system that misses half the population is functionally useless, regardless of its precision on the few it detects. Our ID-APM framework, conversely, prioritizes holistic scene understanding. Its slightly higher RMSE is a direct and acceptable consequence of its design philosophy. Because our method intentionally avoids relying on ambiguous features within the thermal image, its final precision is instead determined by the geometric accuracy of its prior stages. This makes it susceptible to minor cumulative errors (such as detector jitter), which explains why its RMSE is higher than that of feature-matching algorithms that can achieve sub-pixel precision on the rare occasions they find a perfect match. This trade-off is fundamental to our contribution: by accepting a minor compromise in precision, we achieve a monumental gain in comprehensiveness. This represents a paradigm shift from a “precision on a few” model to a “reliable monitoring for all” model, which is far more aligned with the actual needs of modern livestock management.

However, ID-APM has several limitations. The primary constraint stems from its sequential pipeline, which makes the system susceptible to the propagation and accumulation of errors; inaccuracies in early stages can cascade and amplify. A key source of this risk is the system’s fundamental reliance on precise initial calibration. This strict prerequisite, a step not required by most image matching algorithms, can easily introduce systematic geometric shifts. Errors from YOLO localization and RAP-CPD matching also contribute, with mismatches being particularly critical as they can lead to catastrophic failures in ROI mapping.

The framework’s reliance on a deep learning model for initial detection also imposes certain hardware requirements for deployment. While our experiments show high efficiency, deploying the system on low-power edge devices, such as a Raspberry Pi or NVIDIA Jetson, would incur an additional computational cost of approximately 0.1–0.5 s per frame for YOLO inference. This latency can be partially offset by reducing the number of iterations in the RAP-CPD algorithm, but we found that maintaining a high level of reliability requires a minimum processing time of around 0.83 s per frame, making it suitable for near-real-time but not high-frame-rate applications.

Furthermore, the system’s performance is fundamentally dependent on the quality of the 3D positional data provided by the visible-light stereo cameras. Consequently, its robustness would be significantly compromised in environments with poor illumination or in the event of hardware failure of one of the visible cameras.

On the other hand, the framework’s intentional independence from thermal features presents a significant advantage for species generalization. Unlike methods tied to specific thermal patterns, the core logic of ID-APM is largely species-agnostic. Its adaptability to new types of livestock, such as cattle or swine, is primarily a matter of retraining the upstream YOLO detector with species-specific data, rather than re-engineering the entire registration pipeline. This suggests a clear and promising pathway for extending the framework’s applicability across a wide range of precision livestock farming scenarios.

Future research will focus on addressing the discussed limitations, such as by developing a more end-to-end framework to enhance the system’s applicability. Building on our robust ROI localization, the immediate next step is to enable continuous physiological status monitoring and the generation of automated health alerts. However, the ultimate goal is to conduct the crucial downstream clinical validation of this technology. This will involve long-term farm deployment to correlate our data with veterinary diagnoses, thereby quantitatively demonstrating how our system contributes to early disease detection and advances animal welfare studies.

5. Conclusions

In this study, we addressed the critical challenge of robust ROI localization in low-resolution thermal images for agricultural animal monitoring. We introduced ID-APM, a trinocular vision framework that decouples geometric localization from the ambiguous thermal modality by leveraging guidance from high-resolution stereo vision. Comprehensive experiments on a challenging dataset of farmed sika deer demonstrated the superiority of our approach. Achieving an overall accuracy of 96.95% and a Correct Matching Ratio of 99.93%, ID-APM provides the holistic and reliable herd-wide monitoring that eludes other methods. The primary contribution of this work is a paradigm shift from achieving high precision on a few targets to ensuring comprehensive localization for all animals, a more practical goal for livestock management. Coupled with its high computational efficiency, the ID-APM framework establishes a powerful and effective methodology for advancing automated health monitoring in precision livestock farming. Future work will aim to enhance end-to-end robustness, extend the framework to other species and enable physiological status monitoring through detection within regions of interest (ROI), thereby advancing animal welfare studies.

Author Contributions

C.Z.: Writing—review and editing, Writing—original draft, Validation, Software, Methodology, Formal analysis; Y.M.: Writing—review and editing, Methodology, Formal analysis; Y.S.: Writing—review and editing, Supervision; H.G.: Writing—review and editing, Supervision, Conceptualization; Y.G.: Supervision, Conceptualization; J.F.: Supervision, Funding acquisition; S.L.: Supervision, Conceptualization; Z.L.: Writing—review and editing, Conceptualization, Funding acquisition, Supervision; T.H.: Writing—review and editing, Methodology, Supervision, Funding acquisition, Conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Technologies Research and Development Program, funding number [2023YFD1302000 https://service.most.gov.cn (accessed on 4 April 2025)]; the Department of Science and Technology of Jilin Province, funding number [YDZJ202501ZYTS581 http://kjt.jl.gov.cn (accessed on 4 April 2025)]; and the Department of Education of Jilin Province, funding number [JJKH20240467KJ http://jyt.jl.gov.cn (accessed on 4 April 2025)].

Data Availability Statement

The data presented in this study is available on request from the corresponding author. Due to ongoing research projects, only partial data is publicly available at: https://github.com/mozahal/ID-APM (accessed on 19 September 2025).

Acknowledgments

The authors acknowledge the staff at the Dong’ao Sika Deer Breeding Base for their assistance in data collection. Additionally, the authors extend their appreciation to all participants involved in the design of the ID-APM.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bianchi, M.C.; Bava, L.; Sandrucci, A.; Tangorra, F.M.; Tamburini, A.; Gislon, G.; Zucali, M. Diffusion of Precision Livestock Farming Technologies in Dairy Cattle Farms. Animal 2022, 16, 100650. [Google Scholar] [CrossRef]
Tuyttens, F.A.M.; Molento, C.F.M.; Benaissa, S. Twelve Threats of Precision Livestock Farming (PLF) for Animal Welfare. Front. Vet. Sci. 2022, 9, 889623. [Google Scholar] [CrossRef]
Tablado, Z.; Jenni, L. Determinants of Uncertainty in Wildlife Responses to Human Disturbance. Biol. Rev. 2017, 92, 216–233. [Google Scholar] [CrossRef]
Di Credico, A.; Perpetuini, D.; Izzicupo, P.; Gaggi, G.; Cardone, D.; Filippini, C.; Merla, A.; Ghinassi, B.; Di Baldassarre, A. Estimation of Heart Rate Variability Parameters by Machine Learning Approaches Applied to Facial Infrared Thermal Imaging. Front. Cardiovasc. Med. 2022, 9, 893374. [Google Scholar] [CrossRef]
Takahashi, Y.; Gu, Y.; Nakada, T.; Abe, R.; Nakaguchi, T. Estimation of Respiratory Rate from Thermography Using Respiratory Likelihood Index. Sensors 2021, 21, 4406. [Google Scholar] [CrossRef]
Wang, F.-K.; Shih, J.-Y.; Juan, P.-H.; Su, Y.-C.; Wang, Y.-C. Non-Invasive Cattle Body Temperature Measurement Using Infrared Thermography and Auxiliary Sensors. Sensors 2021, 21, 2425. [Google Scholar] [CrossRef]
Wang, G.; Ma, Y.; Huang, J.; Fan, F.; Wang, Z. Measurement of Pig Body Temperature Based on Ear Segmentation and Multifactor Infrared Temperature Compensation. IEEE Trans. Instrum. Meas. 2025, 74, 5008415. [Google Scholar] [CrossRef]
Ma, S.; Yao, Q.; Masuda, T.; Higaki, S.; Yoshioka, K.; Arai, S.; Takamatsu, S.; Itoh, T. Development of Noncontact Body Temperature Monitoring and Prediction System for Livestock Cattle. IEEE Sens. J. 2021, 21, 9367–9376. [Google Scholar] [CrossRef]
Dan, B.; Zhu, Z.; Wei, Y.; Liu, D.; Li, M.; Tang, T. Infrared Dim-Small Target Detection via Chessboard Topology. Opt. Laser Technol. 2025, 181, 111867. [Google Scholar] [CrossRef]
Xie, Q.; Wu, M.; Yang, M.; Bao, J.; Li, X.; Liu, H.; Yu, H.; Zheng, P. A Deep Learning-Based Fusion Method of Infrared Thermography and Visible Image for Pig Body Temperature Detection. In Proceedings of the 2021 International Symposium on Animal Environment and Welfare, Chongqing, China, 21–23 October 2021. [Google Scholar]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up Robust Features. In Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Part I 9. pp. 404–417. [Google Scholar] [CrossRef]
Gao, C.; Li, W. An Invariant Feature Extraction for Multi-Modal Images Matching. arXiv 2023, arXiv:2311.02842. [Google Scholar] [CrossRef]
Xiang, L.; Zhao, L.; Chen, S.; Li, X. Infrared and Visible Image Registration in UAV Inspection. In Proceedings of the 2022 6th International Conference on Video and Image Processing, Shanghai, China, 23–26 December 2022; Association for Computing Machinery: New York, NY, USA, 2023; pp. 67–71. [Google Scholar] [CrossRef]
Zhang, Z. A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Nakagawa, W.; Matsumoto, K.; de Sorbier, F.; Sugimoto, M.; Saito, H.; Senda, S.; Shibata, T.; Iketani, A. Visualization of Temperature Change Using RGB-D Camera and Thermal Camera. In Proceedings of the Computer Vision-ECCV 2014 Workshops, Zurich, Switzerland, 6–7 and 12 September 2014; Part I 13. pp. 386–400. [Google Scholar] [CrossRef]
Song, K.; Wang, J.; Bao, Y.; Huang, L.; Yan, Y. A Novel Visible-Depth-Thermal Image Dataset of Salient Object Detection for Robotic Visual Perception. IEEE ASME Trans. Mechatron. 2023, 28, 1558–1569. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A Survey and Performance Evaluation of Deep Learning Methods for Small Object Detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
Wanner, S.; Goldluecke, B. Variational Light Field Analysis for Disparity Estimation and Super-Resolution. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 606–619. [Google Scholar] [CrossRef]
Fan, A.; Ma, J.; Tian, X.; Mei, X.; Lin, W. Coherent Point Drift Revisited for Non-Rigid Shape Matching and Registration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 1414–1424. [Google Scholar] [CrossRef]
Brenner, M.; Reyes, N.H.; Susnjak, T.; Barczak, A.L.C. RGB-D and Thermal Sensor Fusion: A Systematic Literature Review. IEEE Access 2023, 11, 82410–82442. [Google Scholar] [CrossRef]
Jiang, X.; Ma, J.; Xiao, G.; Shao, Z.; Guo, X. A Review of Multimodal Image Matching: Methods and Applications. Inf. Fusion 2021, 73, 22–71. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)?—Arguments against Avoiding RMSE in the Literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
Gao, C.; Li, W. Multi-Scale PIIFD for Registration of Multi-Source Remote Sensing Images. arXiv 2021, arXiv:2104.12572. [Google Scholar]
Ye, Y.; Bruzzone, L.; Shan, J.; Bovolo, F.; Zhu, Q. Fast and Robust Matching for Multimodal Remote Sensing Image Registration. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9059–9070. [Google Scholar] [CrossRef]
Tyszkiewicz, M.; Fua, P.; Trulls, E. DISK: Learning Local Features with Policy Gradient. In Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 14254–14265. [Google Scholar]
Deng, Y.; Ma, J. ReDFeat: Recoupling Detection and Description for Multimodal Feature Learning. IEEE Trans. Image Process. 2023, 32, 591–602. [Google Scholar] [CrossRef] [PubMed]
Hou, Z.; Liu, Y.; Zhang, L. POS-GIFT: A Geometric and Intensity-Invariant Feature Transformation for Multimodal Images. Inf. Fusion 2024, 102, 102027. [Google Scholar] [CrossRef]
Tuzcuoğlu, Ö.; Köksal, A.; Sofu, B.; Kalkan, S.; Alatan, A.A. XoFTR: Cross-Modal Feature Matching Transformer. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–21 June 2024; pp. 4275–4286. [Google Scholar] [CrossRef]

Figure 1. Visible and infrared thermal images taken through coaxial shooting. The white area (the area with a relatively high temperature) of the sika deer’s face in infrared thermal image is used as the eye area for intuitive observation.

Figure 2. The physical equipment of this method, which consists of two visible cameras, one infrared thermal imaging device and a fixed bracket.

Figure 3. Schematic diagram of the specified location obtained by the registration of infrared and visible images.

Figure 4. The spatial resolution of the infrared thermal and the visible camera when observing consistent position and orientation.

O

represents the spatial origin coordinate, and

M_{t}

and

M_{v}

represent the imaging plane centers of the infrared thermal and the visible camera, respectively.

Figure 4. The spatial resolution of the infrared thermal and the visible camera when observing consistent position and orientation.

O

represents the spatial origin coordinate, and

M_{t}

and

M_{v}

represent the imaging plane centers of the infrared thermal and the visible camera, respectively.

Figure 5. Images of the calibration plate before and after calibration under the visible and infrared thermal imaging cameras.

Figure 6. Illustration of the bounding box matching challenge between stereo images. The figure shows the ground-truth bounding boxes (Real Labels of Left Image) and the detection results from the YOLO model on both the left (Yolo Labels of Left Image) and right (Yolo Labels of Right Image) views. The same number indicates the same target object, while ‘F’ denotes a false positive detection by YOLO. This visualization highlights the key challenges our registration algorithm must address: non-rigid spatial distribution of objects due to binocular disparity, potential missing detections, and false positive detections, making a simple one-to-one matching based on coordinates infeasible.

Figure 7. The ranging schematic diagram of the images of the same object by three multimodal coaxial cameras, where P and Q represent the spatial positions of the corners of the eyes.

Figure 8. The histogram statistics of the MI values of the images calibrated with and without the scaling factor. The distribution for the scaled images is skewed towards the right, with a majority of MI values concentrated above 1.03, whereas the unscaled images show a wider, more varied distribution with a peak at a lower MI value. This result empirically confirms that our scaling preprocessing step is highly effective, significantly improving the baseline alignment between the visible and infrared images and thus providing a better foundation for the subsequent fine-grained registration.

Figure 9. The result images of different methods. (a) A single-deer scene with relatively clear features. (b) A more complex multi-deer scene, characterized by significant variations in distance, scale, and severe blurring in the thermal modality for distant animals. The green line indicates the correspondence of unlabeled features, the yellow line denotes the correspondence of mislabeled features, and the red line represents the correspondence of correctly labeled features. The green and red label boxes represent the predicted position and the actual position, respectively.

Figure 10. The five main indicators of different methods. For each axis, values closer to the outer edge represent better performance, so the larger the area, the better the comprehensive effect. To provide a clear visual hierarchy, the methods listed in the legend are intentionally sorted based on their overall effectiveness, from best to worst, allowing for a more intuitive comparison.

Figure 11. The distribution statistics of the horizontal and vertical coordinate deviation values between the predicted coordinates and the true coordinates. The x-axis and y-axis represent the horizontal and vertical deviations of the predicted coordinate points relative to the actual coordinate points, respectively, while the z-axis and color represent the density of these deviation values. As can be seen from the figure, the error distribution exhibits a two-dimensional Gaussian distribution centered at (0, 0), with the vast majority of predicted points having very small deviations, demonstrating that the prediction model has high accuracy and stability.

Table 1. Distribution of the number of different types of labels in different images.

Picture Type	Deer Face	Deer Eye
Left visible image	4555	3865
Infrared thermal image	4490	3880
Right visible image	4605	3975

Table 2. The evaluation indicators of each method. The bold red font indicates the optimal value of this indicator, the blue font represents the second, and the green font represents the third.

Method	RMSE (Faces)	RMSE (Eyes)	RMSE	Best RMSE	Accuracy (Faces)	Accuracy (Eyes)	Accuracy	CMR	Time
MS-PIIFD [25]	54.26	58.27	55.74	47.45	39.87%	36.60%	38.35%	4.16%	6.38 s
CFOG [26]	53.20	50.72	52.07	46.55	10.36%	11.34%	10.81%	30.76%	4.31 s
DISK [27]	40.96	36.95	39.08	30.23	64.70%	64.69%	64.69%	10.81%	76.43 s
MS-HLMO [13]	39.33	50.99	45.77	18.78	56.34%	59.54%	57.83%	99.95%	390.29 s
ReDFeat [28]	17.65	16.88	17.36	13.03	58.13%	61.21%	59.56%	81.92%	4.43 s
POS-GIFT [29]	22.38	22.24	22.32	13.92	54.34%	59.92%	56.93%	93.89%	16.57 s
XoFTR [30]	23.77	18.68	21.36	12.85	47.88%	52.45%	50.01%	77.33%	0.09 s
ID-APM	21.90	20.45	21.25	21.25	97.11%	96.78%	96.95%	99.93%	1.26 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, C.; Mu, Y.; Sun, Y.; Gong, H.; Guo, Y.; Fan, J.; Li, S.; Li, Z.; Hu, T. ID-APM: Inverse Disparity-Guided Annealing Point Matching Approach for Robust ROI Localization in Blurred Thermal Images of Sika Deer. Agriculture 2025, 15, 2018. https://doi.org/10.3390/agriculture15192018

AMA Style

Zhu C, Mu Y, Sun Y, Gong H, Guo Y, Fan J, Li S, Li Z, Hu T. ID-APM: Inverse Disparity-Guided Annealing Point Matching Approach for Robust ROI Localization in Blurred Thermal Images of Sika Deer. Agriculture. 2025; 15(19):2018. https://doi.org/10.3390/agriculture15192018

Chicago/Turabian Style

Zhu, Caocan, Ye Mu, Yu Sun, He Gong, Ying Guo, Juanjuan Fan, Shijun Li, Zhipeng Li, and Tianli Hu. 2025. "ID-APM: Inverse Disparity-Guided Annealing Point Matching Approach for Robust ROI Localization in Blurred Thermal Images of Sika Deer" Agriculture 15, no. 19: 2018. https://doi.org/10.3390/agriculture15192018

APA Style

Zhu, C., Mu, Y., Sun, Y., Gong, H., Guo, Y., Fan, J., Li, S., Li, Z., & Hu, T. (2025). ID-APM: Inverse Disparity-Guided Annealing Point Matching Approach for Robust ROI Localization in Blurred Thermal Images of Sika Deer. Agriculture, 15(19), 2018. https://doi.org/10.3390/agriculture15192018

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ID-APM: Inverse Disparity-Guided Annealing Point Matching Approach for Robust ROI Localization in Blurred Thermal Images of Sika Deer

Abstract

1. Introduction

2. Materials and Methods

2.1. Experimental Data

2.2. Image Size Adaptation and Rectification

2.3. Target Object Detection

2.4. Detection Box Matching

2.5. Infrared Region Mapping

3. Results

3.1. Validation of the Calibration Method

3.2. Effect of Image Scaling on Registration Quality

3.3. Evaluation Metrics

3.4. Experimental Results and Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI