Visual Saliency Detection for Over-Temperature Regions in 3D Space via Dual-Source Images

To allow mobile robots to visually observe the temperature of equipment in complex industrial environments and work on temperature anomalies in time, it is necessary to accurately find the coordinates of temperature anomalies and obtain information on the surrounding obstacles. This paper proposes a visual saliency detection method for hypertemperature in three-dimensional space through dual-source images. The key novelty of this method is that it can achieve accurate salient object detection without relying on high-performance hardware equipment. First, the redundant point clouds are removed through adaptive sampling to reduce the computational memory. Second, the original images are merged with infrared images and the dense point clouds are surface-mapped to visually display the temperature of the reconstructed surface and use infrared imaging characteristics to detect the plane coordinates of temperature anomalies. Finally, transformation mapping is coordinated according to the pose relationship to obtain the spatial position. Experimental results show that this method not only displays the temperature of the device directly but also accurately obtains the spatial coordinates of the heat source without relying on a high-performance computing platform.


Introduction
In the path planning of mobile robots, it is common to construct a map using the dynamic vision fusion of cameras and multi-sensors [1][2][3]. In a specific industrial environment, the robot needs to monitor the temperature of the equipment and work in the area of abnormal temperature points. The existing neural network control method shows high stability [4][5][6], but it also needs to accurately find the location of the abnormal temperature's point. Traditionally, using a visible-light binocular camera to reconstruct the target is not possible, because it cannot accurately operate on the abnormal temperature point area [7][8][9][10]. At present, the most commonly used temperature detection methods use sensor contact measurements [11][12][13]. However, there are installation and use problems in engineering applications, so non-contact space measurements can be used to solve the installation problem. Visual target detection can solve this problem.
In the field of target detection, deep learning is a commonly used technology. At present, in 2D target detection, many methods of optimizing the structure of deep convolutional neural networks improve the accuracy of target detection [14][15][16], such as fully convolutional networks (FCN), progressive fusion [17], multi-scale depth encoding [18], and data set balancing and smearing methods [19,20]. In mobile robot navigation, precise positioning of the target often requires obtaining spatial coordinates. The depth camera can be used to obtain depth information for 2.5D target positioning. The deep network also plays an important role in this field. The variational autoencoder [21,22], the adaptive window and weight matching algorithm [23], the deep purifier, [21,22], the adaptive window and weight matching algorithm [23], the deep purifier, and the feature learning unit greatly improve the accuracy of detection. However, deep learning requires more sophisticated hardware and relies on a large number of training samples [24][25][26][27].
With the development of 3D reconstruction technology, the application of 3D reconstruction technology in real life has become extensive, attracting the attention of many experts and scholars [28,29]. Commonly used 3D visual reconstruction methods include feature extraction and matching, sparse point cloud reconstruction, camera pose solution, dense point cloud reconstruction, and surface reconstruction [30][31][32][33]. Through the research of different experts and scholars, related technologies such as feature matching, depth calculation, and mesh texture reconstruction have made great breakthroughs, which have resulted in a higher degree of reduction in visual 3D reconstruction [34][35][36].
The method proposed in this paper mainly uses ordinary and infrared cameras to take pictures of targets and then sparse point cloud reconstruction through ordinary pictures to obtain the pose of the camera when imaging. Then, image fusion is performed on ordinary pictures and infrared pictures. The original camera's internal and external parameters do not change. The original image can be replaced with the fusion image to surface-map the dense point cloud in order to generate a three-dimensional surface. In addition, a three-dimensional reconstruction target that visually displays the surface temperature is obtained [37][38][39]. This paper uses an adaptive random sampling algorithm to obtain the main texture features, remove redundant point clouds, and finally, use the depth confidence to filter the wrong point clouds [40][41][42].
To reduce the calculation cost and dependence on training samples, this paper mainly uses the characteristics of infrared images to detect the center coordinates of the heat source. First, the infrared images are pre-processed by channel extraction and image segmentation. Then, the position of the two-dimensional plane temperature abnormal points is detected. Finally, the coordinate transformation is calculated based on the camera's imaging pose relationship in order to calculate its spatial coordinates [43][44][45]. Therefore, it is possible to use the reconstructed target as an obstacle to plan the movement path of the robot and to work on the temperature abnormal point area according to the obtained spatial coordinate information. The schematic diagram is shown in Figure 1. The robot rotates around the target center once to reconstruct a complete target and quickly finds the center position of the heat source that needs to be operated using the above method.

Materials and Methods
The process of sparse point cloud reconstruction is as follows: Feature extraction, feature matching, elimination of mismatched pairs, 3D point cloud initialization, and camera pose calculation. Among these steps, the image mismatch elimination and pose solution have a great impact on the sparse point cloud reconstruction effect. The text uses the random sample consensus (RANSAC) algorithm to remove false matches and the beam adjustment method to recalculate the

Materials and Methods
The process of sparse point cloud reconstruction is as follows: Feature extraction, feature matching, elimination of mismatched pairs, 3D point cloud initialization, and camera pose calculation. Among these steps, the image mismatch elimination and pose solution have a great impact on the sparse point cloud reconstruction effect. The text uses the random sample consensus (RANSAC) algorithm to remove false matches and the beam adjustment method to recalculate the camera pose. The visible light camera used in this article is a 200W pixel POE DS-2CD3T25-I3 with a focal length of four millimeters. The device was manufactured by HIKVISION Company in Hangzhou, China. The process of sparse point cloud reconstruction includes feature extraction, feature matching, elimination of mismatched pairs, 3D point cloud initialization, and camera pose calculation. Among these steps, the image mismatch elimination and pose solution have the greatest impact on the sparse point cloud reconstruction effect. The text uses the RANSAC algorithm to remove false matches and the beam adjustment method to recalculate the camera pose.
To realize 3D reconstruction, the feature points of the picture first need to be extracted. The scale-invariant feature transform (SIFT) algorithm is a computer vision algorithm that is used to detect and describe local features of images, find extreme points in the interscale, and extract their position, scale, and rotation invariants. It is divided into the following four steps: • Multi-scale spatial extreme point detection: This searches image locations on all scales and uses Gaussian differential functions to identify potential rotation invariants and scale candidate points.

•
Accurate positioning of key points: After determining candidate positions, a high-precision model is fitted to determine the scale and position. The stability of key points is used as the basis for selection.

•
Calculation of the main direction of key points: Based on the local gradient direction of the image, each key point obtains one or more directions. In the future, the image processing will be transformed relative to the key-point scale, direction, and position to ensure the invariance of the transformation.

•
Descriptor construction: In the field of key points, the direction of local gradients is measured according to the scale selected above, and these gradients are transformed into another representation.
The effect of feature point extraction is shown in Figure 2. The process of sparse point cloud reconstruction includes feature extraction, feature matching, elimination of mismatched pairs, 3D point cloud initialization, and camera pose calculation. Among these steps, the image mismatch elimination and pose solution have the greatest impact on the sparse point cloud reconstruction effect. The text uses the RANSAC algorithm to remove false matches and the beam adjustment method to recalculate the camera pose.
To realize 3D reconstruction, the feature points of the picture first need to be extracted. The scaleinvariant feature transform (SIFT) algorithm is a computer vision algorithm that is used to detect and describe local features of images, find extreme points in the interscale, and extract their position, scale, and rotation invariants. It is divided into the following four steps: • Multi-scale spatial extreme point detection: This searches image locations on all scales and uses Gaussian differential functions to identify potential rotation invariants and scale candidate points.

•
Accurate positioning of key points: After determining candidate positions, a high-precision model is fitted to determine the scale and position. The stability of key points is used as the basis for selection.

•
Calculation of the main direction of key points: Based on the local gradient direction of the image, each key point obtains one or more directions. In the future, the image processing will be transformed relative to the key-point scale, direction, and position to ensure the invariance of the transformation.

•
Descriptor construction: In the field of key points, the direction of local gradients is measured according to the scale selected above, and these gradients are transformed into another representation.
The effect of feature point extraction is shown in Figure 2. This shows the reconstruction of a potted plant on a 3.0 GHz CPU desktop computer, selecting 30 consecutive shots at a resolution size of 4000 × 3000 ppi. The maximum calculation memory required during the reconstruction process, before using the adaptive sampling algorithm, is 5.3 GB. After adapting to the sampling algorithm, it is 3.2 GB, which proves that the algorithm effectively reduces the memory required for calculation. This shows the reconstruction of a potted plant on a 3.0 GHz CPU desktop computer, selecting 30 consecutive shots at a resolution size of 4000 × 3000 ppi. The maximum calculation memory required during the reconstruction process, before using the adaptive sampling algorithm, is 5.3 GB. After adapting to the sampling algorithm, it is 3.2 GB, which proves that the algorithm effectively reduces the memory required for calculation.

Error Matching Elimination Based on the RANSAC Algorithm
There will be matching errors after feature matching. RANSAC is a commonly used error elimination algorithm. The grid-based motion (GMS) [46] algorithm, recently proposed by scholars, can match features in a short time and is very robust. It can remove wrong matches to a certain extent. However, the original author notes that the GMS algorithm is suitable for supplementing the RANSAC algorithm but not replacing it. Therefore, this article mainly uses the RANSAC algorithm to eliminate wrong feature matching. The algorithm works by using Equation (2) as the cost function to iteratively update the sample set.
In the above formula, (x, y) represents the corner position of the target image, (x , y ) is the corner position of the scene image, s is the scale parameter, and H is a 3 × 3 homography matrix.
Error matching elimination based on RANSAC is shown in Figure 3.
Sensors 2020, 20, x FOR PEER REVIEW 4 of 14 There will be matching errors after feature matching. RANSAC is a commonly used error elimination algorithm. The grid-based motion (GMS) [46] algorithm, recently proposed by scholars, can match features in a short time and is very robust. It can remove wrong matches to a certain extent. However, the original author notes that the GMS algorithm is suitable for supplementing the RANSAC algorithm but not replacing it. Therefore, this article mainly uses the RANSAC algorithm to eliminate wrong feature matching. The algorithm works by using Equation (2) as the cost function to iteratively update the sample set.
In the above formula, ( , ) represents the corner position of the target image, ( , ) is the corner position of the scene image, s is the scale parameter, and H is a 3 × 3 homography matrix.
Error matching elimination based on RANSAC is shown in Figure 3.

The Position Pose of the Phase Machine Is Solved by the Beam Adjustment Method
After the image alignment, the 3D point cloud and camera pose can be obtained. However, there will be interference noise when calculating the position and the 3D point, and there will be significant error in the subsequent calculation. Therefore, bundle adjustment is used to reduce the error [9], and the P matrix and F matrix of each picture after correction can be obtained. The reprojection error is defined as: where is a projection matrix from three-dimensional to two-dimensional, is a kernel function, and ( , ) − is a cost function. Figure 4 shows the sparse point cloud obtained after the bundle

The Position Pose of the Phase Machine Is Solved by the Beam Adjustment Method
After the image alignment, the 3D point cloud and camera pose can be obtained. However, there will be interference noise when calculating the position and the 3D point, and there will be significant error in the subsequent calculation. Therefore, bundle adjustment is used to reduce the error [9], and the P matrix and F matrix of each picture after correction can be obtained. The reprojection error is defined as: where π is a projection matrix from three-dimensional to two-dimensional, ρ j is a kernel function, and π(P C , X k ) − x j 2 is a cost function. Figure

Adaptive Random Sampling
• A pixel point is randomly selected from the obtained point cloud image. ( ) is the depth value of the pixel point and is inversely mapped into the three-dimensional space according to Equation (4). The tangent plane ( ) is obtained according to the normal direction.
is the camera internal parameter, is the rotation matrix, and is the translation vector.
Specific steps are as follows: • Expand outwards with as the center, expand the radius r one pixel at a time, and calculate the three-dimensional coordinates ( ) of each pixel in the expansion range.
• Calculate the distance of each pixel to the tangent plane within the current expansion range, and set the threshold size as . If ≤ , the pixel point can be considered to be in the smooth area, and the point can be removed. • When the expansion radius r is larger than the maximum expansion radius , or a point cloud of a certain proportion of in the expansion range is removed, the expansion stops. and are tunable parameters. They can be determined according to the point cloud redundancy. During debugging, it is found that there are still many redundant point clouds after culling.
can be increased and can be decreased. If the point cloud is over-eliminated, the parameter adjustment method is reversed.
• Then, randomly select a pixel point and repeat the above steps until all the sampling points in the current 3D point cloud image are sampled.

Deep Confidence Removes the Cloud of Error Points
The above formula is the depth value estimation of the point cloud, i.e., the larger the estimated value, the smaller the error value and the higher the reliability. Among these values, P( ) is the depth value estimation of two adjacent frames, → represents the projection point of the pixel

Adaptive Random Sampling
• A pixel pointx i is randomly selected from the obtained point cloud image. D i (x i ) is the depth value of the pixel point and is inversely mapped into the three-dimensional space according to Equation (4). The tangent plane P(x i ) is obtained according to the normal direction. K i is the camera internal parameter, R i is the rotation matrix, and T i is the translation vector.
Specific steps are as follows: • Expand outwards withx i as the center, expand the radius r one pixel at a time, and calculate the three-dimensional coordinates P(x i ) of each pixel x i in the expansion range.

•
Calculate the distance d i of each pixel x i to the tangent plane within the current expansion range, and set the threshold size as t d . If d i ≤ t d , the pixel point can be considered to be in the smooth area, and the point can be removed.

•
When the expansion radius r is larger than the maximum expansion radius r max , or a point cloud of a certain proportion of p i in the expansion range is removed, the expansion stops. r max and p i are tunable parameters. They can be determined according to the point cloud redundancy. During debugging, it is found that there are still many redundant point clouds after culling. r max can be increased and p i can be decreased. If the point cloud is over-eliminated, the parameter adjustment method is reversed. • Then, randomly select a pixel point and repeat the above steps until all the sampling points in the current 3D point cloud image are sampled.

Deep Confidence Removes the Cloud of Error Points
The above formula is the depth value estimation of the point cloud, i.e., the larger the estimated value, the smaller the error value and the higher the reliability. Among these values, E d (P(x i )) is the depth value estimation of two adjacent frames,x i→i represents the projection point of the i pixel projected by the current pixel, and N(i) represents the number of frames taken. The specific steps are as follows: • The point cloud for the current frame k is sorted from high to low according to the estimated value, and the confidence threshold ε d is set, starting from the point where the estimated value is the smallest. If E d (P(x i )) < ε d , the point is eliminated, the calculation continues until E d (P(x i )) > ε d stops, and the remaining point clouds are stored in the sequence S k . Then, the same calculation is performed on the next frame point cloud image until the point cloud image is calculated and the sequence set S = {S k |k = 1, · · · , n} is obtained.

•
Starting from the k frame depth map, all three-dimensional pointsx i are mapped tox i+1 on the k + 1 frame. Compare the estimated values of the two points, the s. • maller three-dimensional coordinates of the larger estimated points of the estimated values, and so on, until all depth maps are completed.

•
The three-dimensional sampling points of all depth maps are intersected to obtain the final three-dimensional point cloud image. Then, perform the mesh reconstruction and mesh texture generation on the filtered dense point cloud. The effect before and after filtering is shown in Figure 5.
The reconstruction details are shown in Figure 6.
Sensors 2020, 20, x FOR PEER REVIEW 6 of 14 projected by the current pixel, and ( ) represents the number of frames taken. The specific steps are as follows: • The point cloud for the current frame k is sorted from high to low according to the estimated value, and the confidence threshold is set, starting from the point where the estimated value is the smallest. If P( ) < , the point is eliminated, the calculation continues until ( ) > stops, and the remaining point clouds are stored in the sequence . Then, the same calculation is performed on the next frame point cloud image until the point cloud image is calculated and the sequence set = {S | = 1, ⋯ , n} is obtained.
• Starting from the k frame depth map, all three-dimensional points are mapped to on the k + 1 frame. Compare the estimated values of the two points, the s.
• maller three-dimensional coordinates of the larger estimated points of the estimated values, and so on, until all depth maps are completed. • The three-dimensional sampling points of all depth maps are intersected to obtain the final three-dimensional point cloud image. Then, perform the mesh reconstruction and mesh texture generation on the filtered dense point cloud. The effect before and after filtering is shown in Figure 5. The reconstruction details are shown in Figure 6.

Image Fusion
After reconstructing the sparse point cloud, the camera parameters are obtained. The original image can be corrected for distortion. The infrared image can be calibrated and corrected by itself. The image registration error is shown in the following formula: projected by the current pixel, and ( ) represents the number of frames taken. The specific steps are as follows: • The point cloud for the current frame k is sorted from high to low according to the estimated value, and the confidence threshold is set, starting from the point where the estimated value is the smallest. If P( ) < , the point is eliminated, the calculation continues until ( ) > stops, and the remaining point clouds are stored in the sequence . Then, the same calculation is performed on the next frame point cloud image until the point cloud image is calculated and the sequence set = {S | = 1, ⋯ , n} is obtained.
• Starting from the k frame depth map, all three-dimensional points are mapped to on the k + 1 frame. Compare the estimated values of the two points, the s.
• maller three-dimensional coordinates of the larger estimated points of the estimated values, and so on, until all depth maps are completed. • The three-dimensional sampling points of all depth maps are intersected to obtain the final three-dimensional point cloud image. Then, perform the mesh reconstruction and mesh texture generation on the filtered dense point cloud. The effect before and after filtering is shown in Figure 5. The reconstruction details are shown in Figure 6.

Image Fusion
After reconstructing the sparse point cloud, the camera parameters are obtained. The original image can be corrected for distortion. The infrared image can be calibrated and corrected by itself. The image registration error is shown in the following formula:

Image Fusion
After reconstructing the sparse point cloud, the camera parameters are obtained. The original image can be corrected for distortion. The infrared image can be calibrated and corrected by itself. The image registration error is shown in the following formula: where f is the focal length, l pix is the pixel size, and d x is the baseline length. D optimal is the target distance, and the alignment error of the image is zero. Only objects that are far away from the camera will be precisely aligned.

Calculate Scale Factor
As the focal length and resolution of infrared and visible images are different, the imaging size of objects in space from the two camera types is not consistent. At the same time, the optical center of the hardware systems of the two camera types deviates in the Y direction. Therefore, it is not easy to scale the image by focal length.
The method adopted in this paper calculates the pixel difference between two corner points in infrared and visible images by using the checkerboard calibration board to obtain the image scale.
It is assumed that the checkerboard calibration board corner with k line, l column, namely kl, is accumulated. n is the corner number on the checkerboard, the upper left corner is minimum 1, and the lower right corner is the maximum kl. The values increase from left to right, and from top to bottom, in f rared n is the x or y coordinate of the corner n on the infrared image, and visible n is the X or Y coordinate of the corner n on the visible light image.

Relative Offset of the Image
The factor scale is used to realize the unification of space objects in infrared and visible images. Then, the same corner point on the checkerboard is selected to calculate the relative offset of infrared and visible images.
where X di f f and Y di f f are the offsets required for each pixel in the infrared image. The RGB color model is a color standard in the industrial world. It obtains various colors by changing the three-color channels of red (R), green (G), and blue (B) and by superimposing each on others. After the completion of each pixel offset, the values of the three channels of RGB of the infrared and visible pixel pairs in the same coordinate can be fused, and the fusion effect is shown in Figure 7. Figure 7 is the heating plate placed in the carton. An infrared camera with a resolution of 384 × 288 ppi is used. The infrared camera and visible light camera take pictures at the same time.
where f is the focal length, is the pixel size, and is the baseline length. is the target distance, and the alignment error of the image is zero. Only objects that are far away from the camera will be precisely aligned.

Calculate Scale Factor
As the focal length and resolution of infrared and visible images are different, the imaging size of objects in space from the two camera types is not consistent. At the same time, the optical center of the hardware systems of the two camera types deviates in the Y direction. Therefore, it is not easy to scale the image by focal length.
The method adopted in this paper calculates the pixel difference between two corner points in infrared and visible images by using the checkerboard calibration board to obtain the image scale.
It is assumed that the checkerboard calibration board corner with line, column, namely , is accumulated. n is the corner number on the checkerboard, the upper left corner is minimum 1, and the lower right corner is the maximum kl. The values increase from left to right, and from top to bottom, is the x or y coordinate of the corner n on the infrared image, and is the X or Y coordinate of the corner n on the visible light image.

Relative Offset of the Image
The factor scale is used to realize the unification of space objects in infrared and visible images. Then, the same corner point on the checkerboard is selected to calculate the relative offset of infrared and visible images.
where and are the offsets required for each pixel in the infrared image. The RGB color model is a color standard in the industrial world. It obtains various colors by changing the three-color channels of red (R), green (G), and blue (B) and by superimposing each on others. After the completion of each pixel offset, the values of the three channels of RGB of the infrared and visible pixel pairs in the same coordinate can be fused, and the fusion effect is shown in Figure 7. Figure 7 is the heating plate placed in the carton. An infrared camera with a resolution of 384 × 288 ppi is used. The infrared camera and visible light camera take pictures at the same time. The camera pose is calculated based on the reconstructed sparse point cloud, and all the fused pictures are surface-reconstructed. The 3D reconstruction effect of the temperature display is shown in Figure 8. The camera pose is calculated based on the reconstructed sparse point cloud, and all the fused pictures are surface-reconstructed. The 3D reconstruction effect of the temperature display is shown in Figure 8.

3D Target Detection
As shown in Figure 9, in this experiment, a high-temperature bottle is used as the temperature abnormal region of the overall device, and its spatial coordinates need to be calculated.

Target Detection of the Heat Source
In the infrared picture, the pixel temperature generated by the detection is proportional to the R channel value, so the image can be preprocessed first. The R channel value size of the original image is extracted, and all pixels are sorted according to the R value. However, noise in the image is unavoidable and will interfere with the sorting results. To avoid incorrect sorting, the extracted image can be cut and divided into sub-regions. The size of the region can be determined according to the input original image size. Then, the average value of the R channel in each area is calculated, and the area is sorted according to the average value to obtain the R channel size set of each area = { , , , ⋯ }, assuming is its maximum value. After the infrared image preprocessing is complete, the R channel value of each small area can be obtained. To allow the detection frame to be adaptively scaled, the size of the heat source needs to be calculated, so small squares (that meet the conditions) can be calculated and recorded for each small area location. The criteria are:

3D Target Detection
As shown in Figure 9, in this experiment, a high-temperature bottle is used as the temperature abnormal region of the overall device, and its spatial coordinates need to be calculated. The camera pose is calculated based on the reconstructed sparse point cloud, and all the fused pictures are surface-reconstructed. The 3D reconstruction effect of the temperature display is shown in Figure 8.

3D Target Detection
As shown in Figure 9, in this experiment, a high-temperature bottle is used as the temperature abnormal region of the overall device, and its spatial coordinates need to be calculated.

Target Detection of the Heat Source
In the infrared picture, the pixel temperature generated by the detection is proportional to the R channel value, so the image can be preprocessed first. The R channel value size of the original image is extracted, and all pixels are sorted according to the R value. However, noise in the image is unavoidable and will interfere with the sorting results. To avoid incorrect sorting, the extracted image can be cut and divided into sub-regions. The size of the region can be determined according to the input original image size. Then, the average value of the R channel in each area is calculated, and the area is sorted according to the average value to obtain the R channel size set of each area = { , , , ⋯ }, assuming is its maximum value. After the infrared image preprocessing is complete, the R channel value of each small area can be obtained. To allow the detection frame to be adaptively scaled, the size of the heat source needs to be calculated, so small squares (that meet the conditions) can be calculated and recorded for each small area location. The criteria are: > * (10) Figure 9. Detection target.

Target Detection of the Heat Source
In the infrared picture, the pixel temperature generated by the detection is proportional to the R channel value, so the image can be preprocessed first. The R channel value size of the original image is extracted, and all pixels are sorted according to the R value. However, noise in the image is unavoidable and will interfere with the sorting results. To avoid incorrect sorting, the extracted image can be cut and divided into sub-regions. The size of the region can be determined according to the input original image size. Then, the average value of the R channel in each area is calculated, and the area is sorted according to the average value to obtain the R channel size set of each area R agg = {R 1 , R 2 , R 3 , · · · R n }, assuming R max is its maximum value.
After the infrared image preprocessing is complete, the R channel value of each small area can be obtained. To allow the detection frame to be adaptively scaled, the size of the heat source needs to be calculated, so small squares (that meet the conditions) can be calculated and recorded for each small area location. The criteria are: size r = size p * p r Sensors 2020, 20, 3414 9 of 14 Among them, R i represents the value of the R channel region, and k is a proportionality coefficient that needs to be adjusted according to specific conditions. After calculating the situation of each sub-region, each region can be assessed, in order from left to right and top to bottom. Each sub-region is set to be square. The size of each sub-region size r can be determined according to Equation (11), where size p is the size of the infrared image used for detection, the proportion of p r sub-regions, and p r is an adjustable parameter. If four of the eight regions around the area meet the conditions, that area is a sub-area within the heat source range, and the position coordinate is recorded and evaluated. Finally, the size of the heat source border can be obtained from the coordinate position. The effect is shown in Figure 10. Among them, represents the value of the R channel region, and k is a proportionality coefficient that needs to be adjusted according to specific conditions. After calculating the situation of each sub-region, each region can be assessed, in order from left to right and top to bottom. Each sub-region is set to be square. The size of each sub-region can be determined according to Equation (11), where is the size of the infrared image used for detection, the proportion of sub-regions, and is an adjustable parameter. If four of the eight regions around the area meet the conditions, that area is a sub-area within the heat source range, and the position coordinate is recorded and evaluated. Finally, the size of the heat source border can be obtained from the coordinate position. The effect is shown in Figure 10.

Coordinate Transformation Mapping in 3D Space
After the detection of the heat source target, the coordinates of the heat source center in each infrared picture can be obtained; because the shoot is a head-up relationship, the horizontal deviation and the height deviation can also be obtained. The steps are as follows: Take the center of the first picture as the center point of the space and choose another angle during the shoot as the second position. As shown by two positions in Figure 11, calculate the deviation between the actual heat source and the ideal heat source. The following situations can occur:   Figure 12a as an example, cam1_center and cam2_center are the imaging center points of the camera at two positions, "ideal" is the most central position of the heat source processing experiment and is the intersection of the two imaging centers, and "real" is the actual position of the heat source. When the heat source reaches the imaging plane, the distance from the center of the camera is and , where α is the angle of rotation of the second position relative to the first position. According to its geometric relationship, the rest of the same angle, that is, the angle shown in the figure, is obtained according to the geometric relationship. = = ( + )/2 (12)

Coordinate Transformation Mapping in 3D Space
After the detection of the heat source target, the coordinates of the heat source center in each infrared picture can be obtained; because the shoot is a head-up relationship, the horizontal deviation and the height deviation can also be obtained. The steps are as follows: Take the center of the first picture as the center point of the space and choose another angle during the shoot as the second position. As shown by two positions in Figure 11, calculate the deviation between the actual heat source and the ideal heat source. The following situations can occur: Sensors 2020, 20, x FOR PEER REVIEW 9 of 14 = * (11) Among them, represents the value of the R channel region, and k is a proportionality coefficient that needs to be adjusted according to specific conditions. After calculating the situation of each sub-region, each region can be assessed, in order from left to right and top to bottom. Each sub-region is set to be square. The size of each sub-region can be determined according to Equation (11), where is the size of the infrared image used for detection, the proportion of sub-regions, and is an adjustable parameter. If four of the eight regions around the area meet the conditions, that area is a sub-area within the heat source range, and the position coordinate is recorded and evaluated. Finally, the size of the heat source border can be obtained from the coordinate position. The effect is shown in Figure 10.

Coordinate Transformation Mapping in 3D Space
After the detection of the heat source target, the coordinates of the heat source center in each infrared picture can be obtained; because the shoot is a head-up relationship, the horizontal deviation and the height deviation can also be obtained. The steps are as follows: Take the center of the first picture as the center point of the space and choose another angle during the shoot as the second position. As shown by two positions in Figure 11, calculate the deviation between the actual heat source and the ideal heat source. The following situations can occur:   Figure 12a as an example, cam1_center and cam2_center are the imaging center points of the camera at two positions, "ideal" is the most central position of the heat source processing experiment and is the intersection of the two imaging centers, and "real" is the actual position of the heat source. When the heat source reaches the imaging plane, the distance from the center of the camera is and , where α is the angle of rotation of the second position relative to the first position. According to its geometric relationship, the rest of the same angle, that is, the angle shown in the figure, is obtained according to the geometric relationship. = = ( + )/2 (12) Figure 11. Camera imaging pose. Figure 12 is a top view of various situations. Taking Figure 12a as an example, cam1_center and cam2_center are the imaging center points of the camera at two positions, "ideal" is the most central position of the heat source processing experiment and is the intersection of the two imaging centers, and "real" is the actual position of the heat source. When the heat source reaches the imaging plane, the distance from the center of the camera is bias 1 and bias 2 , where α is the angle of rotation of the second position relative to the first position. According to its geometric relationship, the rest of the same angle, that is, the angle shown in the figure, is obtained according to the geometric relationship.
x = bias 1 z = (z 1 + z 2 )/2 light 1 = bias 2 /cosα light 2 = light 1 − bias 1 y = depth depth = light 2 /tanα (12) In the above formula, z is the height position of the heat source, and z 1 and z 2 are the deviations from the origin of the space coordinates at the heights taken at the two positions. In order to reduce the operation error, the average of the two positions is taken as the height deviation. light 1 and light 2 are the distances in the calculation of geometric relations, respectively. According to the above formula, the head-up deviation x, depth deviation y, and height deviation z can be obtained. As the coordinates in the actual space of the idea are already known, the space coordinates of the actual heat source can be calculated.
Sensors 2020, 20, x FOR PEER REVIEW 10 of 14 In the above formula, z is the height position of the heat source, and and are the deviations from the origin of the space coordinates at the heights taken at the two positions. In order to reduce the operation error, the average of the two positions is taken as the height deviation. ℎ and ℎ are the distances in the calculation of geometric relations, respectively. According to the above formula, the head-up deviation x, depth deviation y, and height deviation z can be obtained. As the coordinates in the actual space of the idea are already known, the space coordinates of the actual heat source can be calculated. Although detection speed has been greatly improved by the enhanced convolutional neural network structure, it still cannot provide high-precision results, and relies on high-performance GPUs. The method in this paper conducted 15 experiments, only running on a 3.0 GHz desktop computer, using the thermos randomly placed in the above figure as a simulated heat source. The camera is 10 m away from the ideal heat source. The error values were obtained from the actual measured coordinates and calculated coordinates. The error results of the experiment are shown in Figure 13. It can be seen from the experimental results that the error value is within ±20 mm, with high accuracy, and the calculation speed is 20 ms, which meets the detection requirements of industrial equipment.

Conclusions
The experimental results demonstrate that the method proposed in this paper can fuse target surface temperature information captured by infrared cameras into a three-dimensional point cloud while ensuring the accuracy and speed of the reconstruction and that the reconstructed object can intuitively display its surface temperature. The spatial coordinates of the heat source are calculated using the spatial transformation mapping relationship of the infrared picture. The experimental Although detection speed has been greatly improved by the enhanced convolutional neural network structure, it still cannot provide high-precision results, and relies on high-performance GPUs. The method in this paper conducted 15 experiments, only running on a 3.0 GHz desktop computer, using the thermos randomly placed in the above figure as a simulated heat source. The camera is 10 m away from the ideal heat source. The error values were obtained from the actual measured coordinates and calculated coordinates. The error results of the experiment are shown in Figure 13. It can be seen from the experimental results that the error value is within ±20 mm, with high accuracy, and the calculation speed is 20 ms, which meets the detection requirements of industrial equipment. Although detection speed has been greatly improved by the enhanced convolutional neural network structure, it still cannot provide high-precision results, and relies on high-performance GPUs. The method in this paper conducted 15 experiments, only running on a 3.0 GHz desktop computer, using the thermos randomly placed in the above figure as a simulated heat source. The camera is 10 m away from the ideal heat source. The error values were obtained from the actual measured coordinates and calculated coordinates. The error results of the experiment are shown in Figure 13. It can be seen from the experimental results that the error value is within ±20 mm, with high accuracy, and the calculation speed is 20 ms, which meets the detection requirements of industrial equipment.

Conclusions
The experimental results demonstrate that the method proposed in this paper can fuse target surface temperature information captured by infrared cameras into a three-dimensional point cloud while ensuring the accuracy and speed of the reconstruction and that the reconstructed object can intuitively display its surface temperature. The spatial coordinates of the heat source are calculated using the spatial transformation mapping relationship of the infrared picture. The experimental

Conclusions
The experimental results demonstrate that the method proposed in this paper can fuse target surface temperature information captured by infrared cameras into a three-dimensional point cloud while ensuring the accuracy and speed of the reconstruction and that the reconstructed object can intuitively display its surface temperature. The spatial coordinates of the heat source are calculated using the spatial transformation mapping relationship of the infrared picture. The experimental results demonstrate that the algorithm is highly accurate and meets the requirements of robot navigation and positioning.

Patents
A 3D reconstruction method based on point cloud optimization sampling; a 3D surface temperature display method based on infrared and visible image fusion is presented; the invention relates to a method for detecting the heat source center in three-dimensional space.