A Dual-Tree–Complex Wavelet Transform-Based Infrared and Visible Image Fusion Technique and Its Application in Tunnel Crack Detection

: Computer vision methods have been widely used in recent years for the detection of structural cracks. To address the issues of poor image quality and the inadequate performance of semantic segmentation networks under low-light conditions in tunnels, in this paper, infrared images are used, and a preprocessing method based on image fusion technology is developed. First, the DAISY descriptor and the perspective transform are applied for image alignment. Then, the source image is decomposed into high-and low-frequency components of different scales and directions using DT-CWT, and high-and low-frequency subband fusion rules are designed according to the characteristics of infrared and visible images. Finally, a fused image is reconstructed from the processed coefficients, and the fusion results are evaluated using the improved semantic segmentation network. The results show that using the proposed fusion method to preprocess images leads to a low false alarm rate and low missed detection rate in comparison to those using the source image directly or using the classical fusion algorithm.


Introduction
During their operational phase, tunnels are prone to various issues such as cracking, water seepage, and other hazards that impact their lining surfaces.These problems arise as a result of factors like concrete aging, temperature effects, and pressure from surrounding rocks.If left unattended, these issues can escalate, jeopardizing the stability and safety of the tunnel structure.Hence, it is crucial to conduct structural monitoring of tunnel linings to ensure transportation safety.
Crack identification is a well-known issue in the field of non-destructive testing, and it can be addressed by deploying distributed sensors to monitor the mechanical behavior of concrete.This enables the detection of surface or internal damage in the concrete.GROSSE [1] utilizes acoustic emission technology to identify and locate defects by monitoring stress waves released from concrete cracks.Kocherla et al. [2] embedded PZT inside the concrete and assessed the cracking open state by active acoustic measurements.Kim et al. [3] estimated crack locations by monitoring the natural frequency and mode shape of concrete beams.The tunnel has a large area to be detected, and in the elastic wave-based detection method, the dense deployment of sensors is required to achieve the desired detection accuracy.There may be problems of lower detection accuracy and poorer economy in the project.At present, the most common solution to achieve a wide range of high-precision inspections is to work with a mechanical device to perform pointby-point sweeps.Haddar uses eddy current technology to sweep for cracks [4] and deposits [5] in pipelines through a linear sampling method.Gong et al. designed an on-board image acquisition system that utilizes multiple line-scan cameras to capture full cross-section images of the tunnel surface for detecting tunnel surface cracks.In recent years, with the rapid development of computer technology, image detection methods based on deep learning have gradually been applied to the identification of tunnel lining cracks [6,7].Semantic segmentation networks such as U-Net are capable of achieving pixel-level crack recognition, which can lead to more refined crack recognition results compared to those using the frame output mode.Lau et al. [8] replaced the encoder in U-Net with a pre-trained ResNet-34 block, achieving good recognition performance in tests using the CFD and Crack500 datasets.Li et al. [9] optimized U-Net by introducing a clique block and an attention mechanism, improving the accuracy of detecting cracks in tunnels.However, due to the limited tunnel lighting conditions and the lining surface complexity, the quality of input visible images used for detection is usually poor.The detection accuracy potential of a semantic segmentation network is limited, and efforts to optimize the semantic segmentation network may not lead to the desired detection results.
In computer vision detection techniques, there are many ways to improve the detection effect for different problems.The most discussed area is to improve the network structure of recognition models, and the research hotspot focuses on model optimization for deep learning.Meanwhile, in the whole detection process, sample preprocessing is a necessary task.Common image preprocessing operations include filtering, histogram transformation, and morphological operations, which are all based on the operation of the visible image itself and are easily limited by the quality of the source image.The image fusion technique is a common preprocessing method that is widely used in medical imaging [10], remote sensing imaging [11], and target detection [12,13].Visible images have better resolution and texture details.The differences and complementarities between infrared and visible images make them common data sources in image fusion work.In recent years, fusion strategies based on multi-scale decomposition have been developed rapidly, and many scholars have developed fusion methods for different scenarios.Zhang et al. [14] proposed a multi-scale decomposition image fusion method based on a local edge-preserving (LEP) filter and saliency detection; the fusion process eliminates artifacts and halos in visible images under low illumination by preserving the brightness of the infrared image.Han et al. [15] use the discrete wavelet transform to implement multi-scale decomposition and enhance the spatial resolution of infrared spectra by extracting visual details from visible images.In addition, there are decomposition methods based on image pyramids [16] and NSCT [17] that have been applied to fusion algorithms.It can be seen that the multi-scale transform can separate the feature coefficients of the infrared image and visible image and then design the fusion rule to transform and combine the discrete coefficients in each scale level.Therefore, the design of fusion rules is the key to the multi-scale fusion algorithm.The formulation of fusion rules depends on the image features and application scenarios.Gao et al. [18] fuse high-frequency subbands using large neighboring pixel differences while combining weighted average and absolute value extraction to fuse low-frequency subbands.Adu et al. [19] proposed a minimum regional cross-gradient method for bandpass direction subband coefficient selection.Zhang et al. [20] use the non-sampling shear transform (NSST) to select the fusion coefficients of the background in order to capture the details of the visible image.In the multi-scale decomposition method based on dual-tree complex wavelet transform (DT-CWT), Madheswari et al. [21] use swarm intelligence based on particle swarm optimization to find the best weights.Saeedi et al. [22] utilized fuzzy logic to integrate the outputs of three different fusion rules and proposed a new method based on population optimization for the design of low-frequency fusion rules.The above scholars have designed different fusion rules based on the decomposition method, but the core idea is roughly the same, which is to extract the brightness of the infrared image and the details of the visible image for fusion processing.Infrared images reflect the thermal radiation properties of an object and are not affected by the environment, such as lighting, and their ability to highlight a target is better.
In the field of crack detection studied in this paper, preprocessing methods based on image fusion techniques have been found to be effective in improving the performance of recognition models [23].Liang et al. [24] used an infrared image fusion technique for pavement crack detection based on deep learning to solve the problem of uneven illumination as well as shadows on the pavement.Su et al. [25] and Pozzer et al. [26] enhanced image information by fusing thermal infrared (IRT) images and visible images and applied this information to crack recognition on concrete surfaces.The above scholars' work validates the feasibility of using image fusion techniques for feature enhancement.The shooting conditions inside tunnels are more complex compared to those of pavement and concrete surfaces.First, the light inside tunnels is darker, resulting in a low contrast at the crack locations in visible images [27].In addition, the flatness of the lining surface and the ambient light result in more prominent details in the image background.Directly using LED or similar components for supplementary lighting can result in uneven brightness distribution in the image and introduce shadow interference.When using semantic segmentation networks for crack recognition, all of these features lead to the degradation of the imaging quality and affect the accuracy of the model recognition.Therefore, the development of an image preprocessing method based on image fusion techniques for tunnel environments is of great significance for computer vision detection.
In this paper, an image preprocessing method is developed using image fusion techniques that introduce infrared images to enhance the crack target information.First, the DAISY descriptor is used to match the image feature points and then combined with the perspective transform to achieve image alignment.Then, the multiscale fusion method is developed based on dual-tree complex wavelet transform (DT-CWT), and the fusion rules of low-frequency subbands and high-frequency subbands are designed according to the image characteristics in the tunnel environment.Finally, the fusion results are input to the semantic segmentation network for performance evaluation.
The remainder of this paper is organized as follows.Section 2 describes the feature extraction and alignment process of the image.In Section 3, the high-and low-frequency fusion rules designed based on DT-CWT are presented, and in Section 4, the fusion results are evaluated and analyzed using a semantic segmentation network.

Matching Feature Points
In the image acquisition process, due to the inconsistency of the position, attitude, and acquisition parameters of infrared and visible cameras, it is necessary to align the images captured by each camera in pixel-level image fusion.The alignment process includes three steps: feature point extraction, feature point matching, and coordinate mapping.
In this paper, the features from the accelerated segment test (FAST) algorithm [28] are used to extract the key points of the image, and the DAISY descriptor [29] is used to quantify the features of the key points.The geometric structure of the DAISY descriptor is a multilayer concentric circle.In each layer, eight sampling points are evenly distributed at 45 • intervals, and a Gaussian convolution is applied to compute the relationship between each point and its neighboring pixels, as shown in Figure 1.Winder [30] found that the DAISY descriptor performs optimally in different datasets by comparing several feature aggregation strategies for chunking in Cartesian and polar coordinate systems.The computation process of the DAISY descriptor is specified as follows: Winder [30] found that the DAISY descriptor performs optimally in different datasets by comparing several feature aggregation strategies for chunking in Cartesian and polar coordinate systems.The computation process of the DAISY descriptor is specified as follows: (1) The gradient of the input image I in direction d is calculated.Only those values with gradients greater than zero are retained in the result.The gradient matrix is denoted as , where the operator (•) + = max(•, 0).The gradient calculation needs to be decomposed in the x direction and y direction separately.Let the angle between the gradient direction and the positive x direction be θ.The formula for the gradient matrix is where the convolutions in the x and y directions are applied using one-dimensional convolution kernels [1, −1] and [1, −1] T , respectively.
(2) We use Gaussian convolution kernels with different scales to perform separate convolution operations on the gradient matrix G d to form different scaled convolution matrices G Σ d , which are calculated as follows: where G Σ denotes a Gaussian convolution kernel with Σ standard deviation. ( T denote the feature vector of location (u, v) after convolution by a Gaussian kernel of standard deviation Σ. H denotes the number of gradient directions in the feature vector.Let h Σ (u, v) be the normalized feature vector.The DAISY descriptor D of the feature point location (u, v) is denoted as where Q is the number of layers, and T is the number of sampling points in each layer.l i (u,v,R j ) denotes the location of the ith sampling point above the jth concentric ring centered at the point (u, v).R j is the distance between the sampling point and the center point (u, v).
The nearest neighbor (NN) algorithm is used to calculate the Euclidean distance e between DAISY descriptors to match the key points of an image.Let A be any key point of the source image, B 1 and B 2 denote the two points on the target image that have the minimum Euclidean distance from the DAISY descriptor of point A, and e(A,B 2 ) > e(A,B 1 ) The keypoint matching process for infrared and visible images is shown in Figure 2. Let A and B denote the key point sets of the source and target images, respectively, with set size N.A i ∈ A, B i ∈ B, and there exists a unique matching A i →B i .Inevitably, there will be wrong matching pairs due to image quality or algorithm performance limitations.Assume that each position of the target is located in the same plane and the camera distortion is small.Then, when the shooting viewpoint changes, the relative position relationship between each feature point should be approximately the same, and the relative position relationship between the feature point corresponding to the wrong matching pair and other points is obviously different.To eliminate false matches, a filtering process is designed based on the random sampling consistency (RANSAC) algorithm.First, calculate the distances between A i and B i and the other elements within the set, A j and B j (j = 1... N, j ̸ = i), within the image, where the distance between A i and A j is denoted as a ij , and the distance between B i and B j is denoted as b ij .If both A i →B i and A j →B j are matched correctly, then K ij = b ij /a ij is an approximately fixed value.Otherwise, K ij value differs from that of the others.Then, the set of points (a ij , b ij ) is input into the RANSAC model to find outliers that do not fit the model.For specific information on the algorithm, refer to reference [31].Finally, count the number of occurrences of all associated points A k , i.e., the transverse coordinates a *k or a k* of the outer points in the set of outer points, and if the occurrence of subscript k is greater than N/2, then A k →B k is determined to be a false match (Algorithm 1).for (j = 1 to N) do 4: if (the <= j) then 5: break; 6: if (sk(i) > N/2) then 21: K.append(i)//Put index the into the array K.

Perspective Transform
Perspective transforms project a picture onto a new plane of view and are commonly used for distortion correction of images.The alignment of an image is essentially a distortion-correcting process for specified feature points.The lining surface studied in this paper can be approximated as a plane, and the captured image can be regarded as a single

Perspective Transform
Perspective transforms project a picture onto a new plane of view and are commonly used for distortion correction of images.The alignment of an image is essentially a distortion-correcting process for specified feature points.The lining surface studied in this paper can be approximated as a plane, and the captured image can be regarded as a single depth-of-field image.Therefore, the pixel alignment of the infrared image and the visible image is realized using the transmission transform.We select four mapping pairs to compute the perspective transform matrix T. To decrease the error, the four points with the largest encircled area are selected for the computation.Let the coordinates of the source and target images in the four selected matching pairs be denoted as (x k , y k ) and (x ′ k , y ′ k ), respectively, with k = 1, 2, 3, and 4. According to the mapping relation of the perspective transformation, the position of the target image can be calculated by the following matrix equation: where the left-hand side of the equation and the first term on the right-hand side of the equation are known matrices, so the second term on the right-hand side of the equation can be solved using Gaussian elimination or LU decomposition.Form the transformation matrix T according to the solution result of Equation ( 4), where t ij is the element of the ith row and jth column of the transformation matrix T, and t 33 = 1.Using the transformation matrix T, any point (x, y) in the source image can be mapped to the intermediate variables (x t ′ , y t , z t ′ ), which represent the position of the target image through the transformation matrix.The transformation equation is Then, the normalization operation x ′ = x t/ z t , y ′ = y t/ z t is performed to obtain the mapped position coordinates (x ′ , y ′ ), and the pixel values of the integer position coordinates are obtained after bilinear interpolation.

Infrared and Visible Image Fusion
The process of multiscale image fusion is organized into three steps.First, the image is transformed at multiple scales to obtain decomposed images with different resolution levels.Then, the fusion rules at each level are designed to obtain the transform coefficients used for reconstruction.Finally, the fused image is obtained using a multiscale inverse transform.In this paper, we use the dual-tree complex wavelet transform to perform the multi-scale decomposition of the image and design the fusion rules for the low-frequency subbands and high-frequency subbands of each level according to the characteristics of the image at different scales.

DT-CWT Multiscale Decomposition
The dual-tree-complex wavelet transform (DT-CWT) is an improved method of discrete wavelet transform (DWT) that solves the frequency aliasing problem of the DWT and has translation invariance.The DT-CWT is constructed by two juxtaposed DWTs, realizing a low-pass filter and a high-pass filter.The two wavelet functions ψ h (t) and ψ g (t) are used as the real and imaginary parts of the complex wavelet, respectively, as expressed in Equation (6).
When transforming the original signal s(t), the wavelet coefficients d and scale coefficients c in the real and imaginary parts are calculated by Equation (7) and Equation (8), respectively.
where j is the scale factor, J is the number of decomposition layers, and k is the filter length.φ h and φ g are the scale functions of the real and imaginary trees, respectively.The scale function in the context of wavelet analysis represents the low-frequency components of the original signal, while the wavelet function captures the high-frequency components.By performing multiscale decomposition on the signal, the scale coefficients from each scale form the low-frequency subbands, while the wavelet coefficients form the high-frequency subbands.These coefficients are computed and combined to yield the wavelet coefficients d and scale coefficients c required for signal reconstruction.The reconstructed signal can be mathematically expressed using Equation (9).
The visible and infrared images are independently decomposed using DT-CWT to obtain their respective low-frequency and high-frequency components at different scales.Typically, the low-frequency component of an image contains the primary target information, while the high-frequency component captures the edge and texture details of the target.However, in the case of a crack target image, the low-frequency component primarily represents the background information, while the high-frequency component contains both the target (crack) and background texture information, owing to the small scale of the crack.The presence of high-contrast texture details can potentially lead to misclassification by the semantic segmentation network.
Hence, a crucial objective in image fusion is to enhance the crack features and suppress the background texture.In this research paper, multiscale fusion rules are specifically designed based on the characteristics of crack images, aiming to preserve the high contrast from the infrared images and the high resolution from the visible images.The fusion process is visually illustrated in Figure 3.

Fusion Rules for Low-Frequency Subbands
The low-frequency component of the image carries crucial information about the structure of the lining and the location, as well as the direction of crack expansion.When allocating weights to the background component, priority should be given to the infrared image in order to enhance the brightness of the lining area and improve the contrast of the cracks.Since the brightness of the crack location differs significantly from that of the background, the saliency of each pixel can be computed to predict the crack location.The saliency of a pixel is determined by taking the difference between the brightness of the pixel itself and the average brightness of its surrounding neighborhood, as calculated using Equation (10).
In the equation, I IR and I VIS represent the brightness values of the infrared and visible images, respectively, and µ represents the average brightness within a neighborhood δ of radius n.When considering the background location, the difference between S IR and S VIS is relatively small.However, at the crack locations, S IR is significantly larger than S VIS .Based on the analysis mentioned above, the low-frequency coefficients (L F ) obtained after fusion at each scale can be represented as follows: In this paper, the threshold value S th is selected as 5σ 2 (e 0.5 − 1).Here, L IR and L VIS represent the low-frequency coefficients obtained from the DT-CWT decomposition of the infrared and visible images, respectively.w IR and w VIS denote the weight coefficients assigned to the two images, which are adjusted based on the difference in saliency.The adjustment is performed according to the equation expressed as Equation (12).
where σ 2 is the brightness variance in neighborhood δ. w VIS is positively correlated with the saliency difference, meaning that visible images are assigned higher weights when the difference in saliency for a pixel exceeds the threshold S th and the brightness is lower than the average value within the neighborhood.Conversely, w IR is negatively correlated with the saliency difference, indicating that infrared images are assigned higher weights when the saliency difference is below the S th threshold.In cases where neither condition is met, both infrared and visible images are assigned equal weights of 0.5 for the fusion process.

Fusion Rules for Low-Frequency Subbands
The low-frequency component of the image carries crucial information about the structure of the lining and the location, as well as the direction of crack expansion.When allocating weights to the background component, priority should be given to the infrared image in order to enhance the brightness of the lining area and improve the contrast of the cracks.Since the brightness of the crack location differs significantly from that of the background, the saliency of each pixel can be computed to predict the crack location.The saliency of a pixel is determined by taking the difference between the brightness of the pixel itself and the average brightness of its surrounding neighborhood, as calculated using Equation (10).

Fusion Rules for High-Frequency Subbands
The high-frequency component of an image contains important details such as target edges and texture information.The magnitude of the high-frequency coefficients reflects the richness of the detailed information in the image.In traditional multiscale fusion algorithms, the high-frequency coefficients with the maximum absolute values from both images are typically retained to preserve the detailed information from both sources.However, in the specific application scenario addressed in this paper, the visible image already contains nearly all valuable high-frequency information, including the details of the crack location.Additionally, the visible image also contains high-frequency disruption information, such as white noise on the left side of the crack and uneven structures on the lining surface.If the approach of directly retaining the high-frequency coefficients with the maximum absolute value from both images is followed, the fused image may exhibit high-frequency disruption information in the background region.This disruption information, due to its similarity in scale and brightness to the crack features, can potentially interfere with crack recognition.
Similar to the fusion process for low-frequency subbands, it is essential to differentiate between cracks and background regions when dealing with high-frequency components.In this regard, the gradients of the decomposed image in the xand y-directions are represented as G x (x, y) and G y (x, y), respectively.The fusion rule devised for high-frequency components at each scale is as follows: where H *,d (x, y) is the high-frequency coefficient of point (x, y) in direction d.The position where the difference between the gradients of the two images is less than G th and the brightness is less than the mean value of the neighborhood is recognized as a crack, and H VIS is directly used as the fused coefficients.
The threshold value G th plays a crucial role in detail restoration, considering the distinct original resolutions of the infrared and visible images, which results in nonoverlapping pixels at the crack location.Selecting an appropriate G th is essential to strike a balance between retaining image details and accurately recognizing the crack location.When G th is excessively large, the background section retains an excessive amount of visible image information.Conversely, when G th is too small, the crack location may not be accurately identified.A larger G th results in more visible image components being retained, preserving image details to a greater extent.However, this can lead to high-frequency interference in the background region.On the other hand, a smaller G th preserves more infrared image components, resulting in a purer background.However, an excessively small G th may cause the fusion rule to be overly strict, potentially diminishing the importance of high-frequency fusion.In this paper, due to significant resolution and crack contrast differences between the two images, as well as the high degree of irregularity on the lining surface, a more conservative strategy is employed.Specifically, a larger G th is selected to minimize high-frequency interference in the fused image.In this study, G th,d is set as 0.7 times the maximum absolute difference between G VIS,d and G IR,d .
The pseudo-code corresponding to the implementation process of the above image fusion algorithm is as follows (Algorithm 2): G IR,x = ∂I IR,x /∂x; G IR,y = ∂I IR,y /∂y; G VIS,x = ∂I VIS,x /∂x; G VIS,y = ∂I VIS,y /∂y;//Calculate the gradient in the x-direction and y-direction.

Experiment and Evaluation
The visible and infrared images used for algorithm evaluation were collected from a highway tunnel located in Dalian, Liaoning Province, China.The image alignment and fusion algorithm discussed in this section were implemented using MATLAB 2018a.To illustrate the process, a sample image depicted in Figure 4 was chosen.A total of 24 matching pairs were obtained through feature matching between the infrared and visible images.In Figure 4, the green lines represent correct matching pairs, while the red lines denote incorrect matching pairs.The coordinates of each feature point were input into Algorithm 1 for screening, and the results obtained from the RANSAC algorithm are displayed in Figure 5.The red points in Figure 5 represent outliers that do not conform to the model.The indexes of the feature points corresponding to these statistical outliers are shown in Figure 6 for the given sample.In this particular sample, with N = 24, it can be determined that the matching pairs with indexes k = 3, 8, 9, 14, 17, and 22 are identified as incorrect matches.
Algorithm 1 for screening, and the results obtained from the RANSAC algorithm are displayed in Figure 5.The red points in Figure 5 represent outliers that do not conform to the model.The indexes of the feature points corresponding to these statistical outliers are shown in Figure 6 for the given sample.In this particular sample, with N = 24, it can be determined that the matching pairs with indexes k = 3, 8, 9, 14, 17, and 22 are identified as incorrect matches.Algorithm 1 for screening, and the results obtained from the RANSAC algorithm are displayed in Figure 5.The red points in Figure 5 represent outliers that do not conform to the model.The indexes of the feature points corresponding to these statistical outliers are shown in Figure 6 for the given sample.In this particular sample, with N = 24, it can be determined that the matching pairs with indexes k = 3, 8, 9, 14, 17, and 22 are identified as incorrect matches.Algorithm 1 for screening, and the results obtained from the RANSAC algorithm are displayed in Figure 5.The red points in Figure 5 represent outliers that do not conform to the model.The indexes of the feature points corresponding to these statistical outliers are shown in Figure 6 for the given sample.In this particular sample, with N = 24, it can be determined that the matching pairs with indexes k = 3, 8, 9, 14, 17, and 22 are identified as incorrect matches.After obtaining the feature point matching pairs of the infrared image and the visible image, we selected four points with the largest encircled area among them and performed the perspective transformation according to Equations ( 4) and (5).In this paper, the infrared image is used as the source image, and the visible image is used as the target image.The result of the transformation is shown in Figure 7.
After obtaining the feature point matching pairs of the infrared image and the visible image, we selected four points with the largest encircled area among them and performed the perspective transformation according to Equations ( 4) and (5).In this paper, the infrared image is used as the source image, and the visible image is used as the target image.The result of the transformation is shown in Figure 7.The fused lining surface image data are ultimately input into a semantic segmentation network for recognition.In this context, objective indicators and subjective evaluations based on conventional images have limited significance.Therefore, we assess the fusion algorithms by employing two enhanced U-Net networks and replicating other multiscale fusion algorithms for comparison purposes.The semantic segmentation network is implemented using Python 3.8.
In this paper, we use two semantic segmentation networks, ResU-Net [32] and R2U-Net [33], which are both improved U-Net networks, to recognize cracks in images.ResU-Net adds a residual connection to the convolutional layer in U-Net and adds an attention mechanism; R2U-Net adds a recurrent module based on a residual connection.ResU-Net and R2U-Net improve the resolution recognition compared to that of U-Net.The recognition accuracy is significantly improved in images with low contrast, insufficient light, and overexposure.The purpose of using these two improved networks is to exclude the effect of the poor performance of semantic segmentation networks in crack recognition and to facilitate a side-by-side comparison of image fusion algorithms.
The evaluation process uses the collected public dataset as a training sample for the semantic segmentation model.This dataset consists of 3100 images.Each image used for training is resized to 512 × 512 and converted to grayscale.To avoid overfitting, the original images and fused images used for evaluation are not added to the training set.
Figure 8 shows the crack recognition results for each type of sample.It can be seen that the recognition results obtained by directly inputting visible or infrared images to the semantic segmentation model are poor.The background of the infrared images has a brightness similar to that of cracks due to the gullies on the lining surface, resulting in some areas in the background being recognized as cracks.In addition, the lack of contrast leads to the incomplete recognition of some small cracks.Visible images also have areas of misrecognition due to the high level of texture detail on the lining surface, and in lowintensity lighting environments, the characteristics of the gullies formed by the surface concavities are similar to those of the cracks.Algorithm A [34] is the classical low-pass pyramid fusion algorithm, and Algorithm B [35] is the process after DT-CWT decomposition using the absolute maximum rule for reconstruction.These two algorithms preserve the high-resolution features of cracks in the visible image and enhance crack resolution.However, based on their recognition results, it is evident that both algorithm A and algorithm B still struggle to eliminate the influence of gullies on the lining surface and exhibit a high false recognition rate.In contrast, the fused image produced using the proposed The fused lining surface image data are ultimately input into a semantic segmentation network for recognition.In this context, objective indicators and subjective evaluations based on conventional images have limited significance.Therefore, we assess the fusion algorithms by employing two enhanced U-Net networks and replicating other multiscale fusion algorithms for comparison purposes.The semantic segmentation network is implemented using Python 3.8.
In this paper, we use two semantic segmentation networks, ResU-Net [32] and R2U-Net [33], which are both improved U-Net networks, to recognize cracks in images.ResU-Net adds a residual connection to the convolutional layer in U-Net and adds an attention mechanism; R2U-Net adds a recurrent module based on a residual connection.ResU-Net and R2U-Net improve the resolution recognition compared to that of U-Net.The recognition accuracy is significantly improved in images with low contrast, insufficient light, and overexposure.The purpose of using these two improved networks is to exclude the effect of the poor performance of semantic segmentation networks in crack recognition and to facilitate a side-by-side comparison of image fusion algorithms.
The evaluation process uses the collected public dataset as a training sample for the semantic segmentation model.This dataset consists of 3100 images.Each image used for training is resized to 512 × 512 and converted to grayscale.To avoid overfitting, the original images and fused images used for evaluation are not added to the training set.
Figure 8 shows the crack recognition results for each type of sample.It can be seen that the recognition results obtained by directly inputting visible or infrared images to the semantic segmentation model are poor.The background of the infrared images has a brightness similar to that of cracks due to the gullies on the lining surface, resulting in some areas in the background being recognized as cracks.In addition, the lack of contrast leads to the incomplete recognition of some small cracks.Visible images also have areas of misrecognition due to the high level of texture detail on the lining surface, and in low-intensity lighting environments, the characteristics of the gullies formed by the surface concavities are similar to those of the cracks.Algorithm A [34] is the classical low-pass pyramid fusion algorithm, and Algorithm B [35] is the process after DT-CWT decomposition using the absolute maximum rule for reconstruction.These two algorithms preserve the high-resolution features of cracks in the visible image and enhance crack resolution.However, based on their recognition results, it is evident that both algorithm A and algorithm B still struggle to eliminate the influence of gullies on the lining surface and exhibit a high false recognition rate.In contrast, the fused image produced using the proposed method maintains the high resolution of crack locations compared to the visible image.As a result, it enhances the background brightness, improves the contrast of crack positions, and suppresses texture details like pits and gullies on the lining surface.In Figure 9, it can be observed that the falsely positive (FP) pixels in the results of the proposed method are mainly located around the actual cracks.This occurrence can be attributed to resolution inconsistencies, resulting in slight errors in crack width estimation.However, these FP pixels do not significantly affect the recognition of the crack skeleton.Conversely, in the original image and those generated by the comparison algorithm, the FP pixels are predominantly distributed in the background, leading to significant interference in crack recognition.As indicated by the objective index, the proposed method exhibits the highest specificity, reflecting a low rate of false recognition in the background.Both ResU-Net and R2U-Net demonstrate high recall for all segmented images.This, along with the observations in Figure 9, suggests that the prediction results for crack positions closely align with the ground truth, resulting in a low miss detection rate for the model.The high resolution of the proposed method, combined with the visible image information at crack locations, results in a small number of missed pixels at the crack edges.However, this has no significant impact on the overall detection results, as there are no substantial instances of continuous or extensive missed pixels.
segmentation evaluation for each image category are presented in Table 1.It is evident that the proposed method achieves significantly higher accuracy compared to the other methods.This indicates that the proposed method achieves a high correct detection rate for cracks while also minimizing false detections.In Figure 9, it can be observed that the falsely positive (FP) pixels in the results of the proposed method are mainly located around the actual cracks.This occurrence can be attributed to resolution inconsistencies, resulting in slight errors in crack width estimation.However, these FP pixels do not significantly affect the recognition of the crack skeleton.Conversely, in the original image and those generated by the comparison algorithm, the FP pixels are predominantly distributed in the background, leading to significant interference in crack recognition.As indicated by the objective index, the proposed method exhibits the highest specificity, reflecting a low rate of false recognition in the background.Both ResU-Net and R2U-Net demonstrate high recall for all segmented images.This, along with the observations in Figure 9, suggests that the prediction results for crack positions closely align with the ground truth, resulting in a low miss detection rate for the model.The high resolution of the proposed method, combined with the visible image information at crack locations, results in a small number of missed pixels at the crack edges.However, this has no significant impact on the overall detection results, as there are no substantial instances of continuous or extensive missed pixels.In summary, without altering the structure of the semantic segmentation network model and the training samples, using the proposed method to fuse images as input samples can effectively enhance the crack recognition performance of the semantic segmentation network.When compared to the original image and the image generated by the comparison method, the sample preprocessed using the proposed method offers several advantages, including high detection precision, a low miss detection rate, and a low false detection rate.These advantages significantly improve the accuracy and adaptability of the semantic segmentation network.
It is necessary to emphasize that the results presented in Table 1 solely indicate that utilizing processed samples from infrared image fusion in a specific scenario enhances the crack prediction performance of U-Net.However, this does not imply that combining the In summary, without altering the structure of the semantic segmentation network model and the training samples, using the proposed method to fuse images as input samples can effectively enhance the crack recognition performance of the semantic segmentation network.When compared to the original image and the image generated by the comparison method, the sample preprocessed using the proposed method offers several advantages, including high detection precision, a low miss detection rate, and a low false detection rate.These advantages significantly improve the accuracy and adaptability of the semantic segmentation network.
It is necessary to emphasize that the results presented in Table 1 solely indicate that utilizing processed samples from infrared image fusion in a specific scenario enhances the crack prediction performance of U-Net.However, this does not imply that combining the proposed method with R2U-Net or ResU-Net is the optimal solution for this problem.There is a possibility of improving U-Net or employing other semantic segmentation models to achieve better classification results, but exploring such possibilities is beyond the scope of this paper.The utilization of infrared images for compensation circumvents the potential issues of shadow interference that arise from direct LED lighting.In practical engineering, mechanical devices or vehicle-mounted equipment can be designed to accommodate both visible and infrared cameras for inspections.This setup enables the direct acquisition of high-quality detection samples under low-light conditions.The method proposed in this paper exhibits practical application potential in the field of tunnel inspection.

Conclusions
To address the challenges posed by low-lighting conditions, including low contrast and poor crack recognition accuracy in tunnels, this paper presents a preprocessing method that utilizes both infrared and visible images as detection samples.The aim is to enhance the performance of semantic segmentation networks in such scenarios.Initially, the DAISY descriptor is employed to align the feature points in both images.Subsequently, a multiscale fusion method based on DT-CWT (dual-tree-complex wavelet transform) is developed to effectively combine the features extracted from the infrared and visible images.The preprocessed images are then fed into a semantic segmentation network for crack recognition, and a classical fusion method is introduced for evaluation purposes.The proposed method is validated in a highway tunnel under low-lighting conditions, leading to the following specific conclusions: (1) A multiscale fusion method based on DT-CWT is developed, which incorporates separate fusion rules for low-frequency and high-frequency subbands.The fusion rules for low-frequency subbands utilize pixel saliency, while the fusion rules for high-frequency subbands utilize gradient difference.As a result, the fused image retains the high crack resolution observed in visible images while incorporating the background blurring effect from infrared images.This approach effectively enhances the contrast of the crack location.(2) In scenarios where there is a difference in brightness range and resolution between the infrared and visible images, both the DAISY descriptor and SIFT demonstrate accurate feature point-matching capabilities between the two images.When utilized in conjunction with perspective transformation, these techniques enable successful alignment of the images.(3) Two semantic segmentation models, ResU-Net and R2U-Net, were used to evaluate the effect of image fusion.By analyzing the results of crack recognition and assessing objective metrics like precision, recall, and specificity, it can be concluded that the image preprocessed using the proposed method shows a decreased false detection rate and missed detection rate compared to methods that utilize the original image and other classical fusion algorithms.This makes it more suitable as a detection sample for semantic segmentation networks.

18 Figure 1 .
Figure 1.DAISY descriptor.Where "+" is the position of the sampling point and the size of the circle is positively correlated with the Gaussian convolution kernel scale

Figure 1 .
Figure 1.DAISY descriptor.Where "+" is the position of the sampling point and the size of the circle is positively correlated with the Gaussian convolution kernel scale

Figure 2 .
Figure 2. Schematic of keypoint matching results, calculating the distance of each keypoint from other keypoints.

Figure 2 .
Figure 2. Schematic of keypoint matching results, calculating the distance of each keypoint from other keypoints.

Figure 3 .
Figure 3.The proposed method for infrared image and visible image fusion.

Figure 3 .
Figure 3.The proposed method for infrared image and visible image fusion.

Figure 4 .
Figure 4. Keypoint matching result: red lines are wrong match pairs, and green lines are correct match pairs.

Figure 5 .
Figure 5. RANSAC calculation results, where outliers are points that do not fit the model.

Figure 6 .Figure 4 .
Figure 6.Wrong match finding, where the orange dotted line is the judgment threshold (N/2), and the red bar is the false match pair.

Figure 4 .
Figure 4. Keypoint matching result: red lines are wrong match pairs, and green lines are correct match pairs.

Figure 5 .
Figure 5. RANSAC calculation results, where outliers are points that do not fit the model.

Figure 6 .Figure 5 .
Figure 6.Wrong match finding, where the orange dotted line is the judgment threshold (N/2), and the red bar is the false match pair.

Figure 4 .
Figure 4. Keypoint matching result: red lines are wrong match pairs, and green lines are correct match pairs.

Figure 5 .
Figure 5. RANSAC calculation results, where outliers are points that do not fit the model.

Figure 6 .Figure 6 .
Figure 6.Wrong match finding, where the orange dotted line is the judgment threshold (N/2), and the red bar is the false match pair.

Figure 7 .
Figure 7. Perspective transform result, where A1~A4 and B1~B4 are the reference points for calculating the transformation matrix T.

Figure 7 .
Figure 7. Perspective transform result, where A 1 ~A4 and B 1 ~B4 are the reference points for calculating the transformation matrix T.

Figure 9 .
Figure 9. Visualization of R2U-Net recognition results; the gray areas are missed detected pixels, and the red areas are falsely detected pixels.

Figure 9 .
Figure 9. Visualization of R2U-Net recognition results; the gray areas are missed detected pixels, and the red areas are falsely detected pixels.

Table 1 .
The semantic segmentation network evaluation results.

Table 1 .
The semantic segmentation network evaluation results.