Bridging Data Distribution Gaps: Test-Time Adaptation for Enhancing Cross-Scenario Pavement Distress Detection

Hou, Yushuo; Li, Yishun; Du, Mengyun; Li, Lunpeng; Wu, Difei; Yu, Jiang

doi:10.3390/app142411974

Open AccessArticle

Bridging Data Distribution Gaps: Test-Time Adaptation for Enhancing Cross-Scenario Pavement Distress Detection

by

Yushuo Hou

^1,2,

Yishun Li

^2,3,*

,

Mengyun Du

^2,4,

Lunpeng Li

^2,3

,

Difei Wu

^2,3

and

Jiang Yu

^2,4

¹

Logistics Engineering College, Shanghai Maritime University, Shanghai 201306, China

²

Transportation Infrastructure Digital Research Center, Tongji University, Shanghai 201804, China

³

The Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University, Shanghai 201804, China

⁴

Chongqing Traffic Engineering Quality Inspection Co., Ltd., Chongqing 400000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(24), 11974; https://doi.org/10.3390/app142411974

Submission received: 15 November 2024 / Revised: 12 December 2024 / Accepted: 19 December 2024 / Published: 20 December 2024

(This article belongs to the Section Transportation and Future Mobility)

Download

Browse Figures

Versions Notes

Abstract

Automatic pavement distress detection using deep learning has revolutionized maintenance efficiency, but deploying models in new, unseen scenarios presents significant challenges due to shifts in data distribution. Traditional transfer learning requires extensive labeled data from the new domain, which is both time-consuming and costly. This paper proposes a test-time adaptation (TTA) framework that addresses feature distribution biases across different scenes, including differences in background, perspective, and environmental conditions. It adapts models at inference time without requiring additional labeled data, making it a promising solution for cross-scenario applications. The framework dynamically adapts the model to these biases by generating domain-specific prior knowledge, applying perspective correction, and generating global attention maps to reduce focus on irrelevant elements. We evaluate the framework on a cross-scene dataset that includes pavement images from three countries and four perspectives. In unsupervised settings, the TTA framework improves detection accuracy by 20.6%, achieving 93.09% of the accuracy obtained through transfer learning with 10,000 labeled images. Compared to traditional transfer learning, our framework reduces the reliance on high-quality labeled data while achieving similar performance gains. Experimental results also demonstrate the framework’s adaptability across various deep learning detection models, offering a scalable solution for rapid deployment and cross-scenario application of pavement distress detection systems.

Keywords:

pavement distress; deep learning; cross-scene transfer; test-time adaptation

1. Introduction

Computer vision has made significant strides in pavement distress detection, particularly in vehicle-mounted cameras [1] or smartphones [2] scenarios (as shown in Figure 1a–c). While advanced object detection models have been proposed [3,4,5], their performance often degrades when applied to new, unseen scenarios (Figure 1d–f), such as roadside monitoring [6], drones [7], and motorbikes [8]. This cross-scenario generalization gap arises from variations in lighting, weather, road types, and the significant effort required for data annotation and model adaptation. These limitations hinder the widespread deployment of existing models in diverse real-world applications.

Numerous studies have investigated enhancing cross-scene detection performance, primarily focusing on data and model aspects. In terms of model, transfer learning is a widely used technique for enhancing cross-scene detection performance by adapting pre-trained models to new domains [9,10]. Sajad Ranjbar et al. [11] proposed a transfer learning method based on multiple pre-trained models to achieve pavement crack segmentation. Liu et al. [12] used the transfer learning method to apply pavement damage detection across scenes to street view images, requiring over 20,000 labeled data samples for fine-tuning. Similarly, Liu et al. [13] created their own dataset with 18,700 labeled images and implemented the transfer application of pavement distress detection based on vehicle-mounted motion cameras. Relying on 26,620 annotated datasets, Deeksha et al. [14] explored the performance of transfer learning algorithms in multi-country cross-scenario detection tasks. Peraka et al. [15] utilized transfer learning to recognize pavement distresses from images sourced from three different datasets, achieving high detection accuracy across multiple distress categories. However, the model struggled when transitioning from perpendicular to parallel camera angles, often misclassifying roadside vehicles as pavement distresses. It can be seen that transfer learning methods still face significant challenges, including negative transfer, domain shift, and the requirement for substantial labeled data [16]. Negative transfer occurs when the source domain’s knowledge is misapplied to the target domain, leading to performance degradation. Domain shift, a mismatch in data distribution between domains, can hinder effective knowledge transfer, even in highly related scenarios. Moreover, the need for extensive labeled data for fine-tuning limits the rapid deployment of models in new environments.

To address data scarcity in new scenarios, generative adversarial networks (GANs) have been employed to synthesize realistic pavement images. By learning the distribution of training data, GANs can generate diverse images with varying lighting, weather, and viewpoint conditions. This approach has shown promise in improving detection accuracy, particularly for tasks like cracks [17,18,19] and potholes [20,21,22] detection. Liu et al. [23] proposed a lightweight GAN structure for automatic recognition of pavement distress, which enhances computational efficiency and reduces costs by reducing the parameters of the model layer. However, GANs are computationally expensive to train and can produce artifacts in generated images, which may negatively impact model performance. Additionally, the limited diversity of generated data can hinder the model’s generalization ability to real-world scenarios. Thus, there remains a strong demand for efficient and effective methods to enhance cross-scenario detection.

In summary, current methods for improving cross-scenario generalization primarily rely on training-phase techniques, such as transfer learning and GAN-based data augmentation. Transfer learning can leverage knowledge from a source domain to a target domain, but it heavily relies on the availability of sufficient labeled data in the target domain. This can be a significant limitation, especially in scenarios where data is scarce or the target domain differs substantially from the source domain. GAN-based methods can generate synthetic data to augment training sets, but they often struggle with mode collapse and require careful tuning to produce realistic and diverse samples. Additionally, these methods typically focus on improving the model’s ability to generalize to new data distributions during training, but they may not be as effective in adapting to dynamic changes in the environment during testing. This limitation can hinder the deployment of these models in real-world applications where conditions can change rapidly.

Unlike traditional approaches, test-time adaptation (TTA) offers a more flexible and efficient solution for cross-scenario adaptation. By dynamically adjusting model parameters or structure during the testing phase, TTA can adapt to new data distributions without requiring extensive re-training or additional labeled data (as shown in Figure 2). This makes TTA particularly well-suited for scenarios like pavement distress detection, where data is limited and conditions can vary widely. TTA’s ability to adapt in real-time to changing environments and its independence from labeled data make it a promising approach for enhancing the generalization and performance of deep learning models in cross-scenario applications. Li et al. [24] proposed a test-time domain adaptation framework for monocular depth estimation, which aligned the input features between source and target data by the scale alignment scheme, correcting the absolute scale inference on the target domain. Segu [25] evaluated a holistic test-time adaptation framework for multiple object tracking on a variety of domain shifts, including simulations to real, outdoor to indoor, and indoor to outdoor transitions. Experimental results demonstrate significant improvements in all metrics of the original model. Evidently, TTA shows great performance enhancement in scenes with significant changes in lighting, backgrounds, or other visual conditions. This flexibility and immediacy make TTA an effective solution for cross-source pavement distress detection [26].

Therefore, this study proposes a TTA-based enhancement framework for cross-scene pavement distress detection. The framework first introduces a domain prior knowledge generator to enable unsupervised feature transfer for images from new scenes. Building on this, three key modules—viewpoint conversion, foreground focus, and adaptive normalization—are developed to address the challenges posed by background variations, viewpoint discrepancies, and distribution shifts across different scenes. These modules are integrated into an unsupervised TTA-based cross-scene detection enhancement framework, designed to be seamlessly integrated with any existing pavement detection model. With this proposed framework, existing distress detection models can be dynamically adapted during the prediction process, eliminating the need for retraining or additional labeled data. The effectiveness of each module, as well as the entire framework, has been validated through ablation and effectiveness experiments. By employing unsupervised methods to mitigate domain bias, this framework offers a robust solution to the challenges associated with cross-scene pavement distress detection.

The remainder of this paper is organized as follows. Section 2 presents the details of the proposed framework and modules. The dataset used in this study is introduced in Section 3. Section 4 illustrates the results of three experiments. Finally, Section 5 offers a summary of this research.

2. Methodology

In pavement distress detection, differences between the source domain and the target domain can be summarized into three types: background differences, viewpoint differences, and distribution biases. Background differences refer to the variability in the non-road surface elements of the images, caused by a variety of factors, including distress collection locations, road conditions, and types of roads. Viewpoint differences arise from the different angles between the camera orientation and the road surface. Existing pavement image viewpoints primarily encompass wide view, oblique view, and top-down view. These varying viewpoints dictate differences in the road surface coverage, details of distresses, and the extent of perspective distortion, where generally, a top-down view is advantageous for capturing a comprehensive view of pavement distresses. Furthermore, even within the same data source, external interferences such as weather, lighting, and shadows can introduce systematic biases during the collection process. For instance, rain can cause image noise, while shadows from buildings or other obstacles can affect the images, constituting a distribution bias. This distribution bias results in the deviation of the dataset’s distribution from the real distribution and can also impact the generalization ability of detection algorithms in practical applications.

To tackle these three differences, three modules and a domain prior generator were developed to improve the model’s performance in the target domain: (1) Domain Prior Generator: This module utilizes a convolutional neural network for self-supervised learning on pavement distress data, extracting features of pavement distress data from both the source domain and the target domain. (2) Viewpoint Transformation: This module employs the domain prior generator to generate perspective information for the target domain images and perform viewpoint transformation on the images to a unified viewpoint for recognition. (3) Foreground Focus: This module uses the domain prior generator to generate attention weight information for the target domain images. By using the attention weight information, pavement distress detection can focus on the foreground of the image and emphasize more on foreground distress information while ignoring non-uniform background information. (4) Adaptive Normalization: This module utilizes the domain prior generator to generate the mean and variance of the target domain image’s distribution and then normalize the image and its features to adapt to the distribution of the target domain. By utilizing the domain prior generator and three modules, the gap between the target domain and the source domain can be reduced, and the pavement distress detection model can better adapt to the distribution of test data in the target domain, thereby improving the detection results of the model in the target domain. The framework of this paper is illustrated in Figure 3.

2.1. Domain Prior Knowledge Generator

It is necessary to first summarize and synthesize the characteristics of these data in target domains to enable the pavement distress detection model to autonomously adapt to those target domains and exhibit better detection performance. The domain prior knowledge generator extracts general features from the target domain data. This enables the pavement distress detection model to adapt autonomously to distribution changes from the source domain to the target domain. It also improves detection performance within the target domain. Considering that the differences in pavement distress image data between the source domain and the target domain mainly manifest in aspects such as viewpoint, background, and lighting/shadow conditions, this paper uses a convolutional neural network to extract features on those three aspects. Simultaneously, there are usually a number of well-annotated data samples in the source domain, but only unlabeled data in the target domain. Self-supervised learning [27] is a type of machine learning where a model is trained using automatically generated labels derived from the data itself, rather than relying on manually annotated labels. So, this paper utilizes self-supervised learning to learn and train models from data samples in the source domain and target domain without labels, acquiring a domain prior knowledge generator for the target domain. Its architecture is illustrated in Figure 4.

For images within the test dataset domain, after being inputted into the domain prior knowledge generator, they undergo feature extraction through convolutional neural networks and are then integrated through fully connected layers to adapt the feature representation format for subsequent encoding of prior knowledge data into the depth estimation viewpoint transformation module, attention foreground focus module, and adaptive normalization module. Compared to typical convolutional neural networks, the goal of the domain prior knowledge generator is to learn and characterize the data characteristics of different domains across both source and target domains, then generate prior knowledge tailored to the input image from the current target domain. By utilizing self-supervised learning to learn and train from both the source domain and the target domain, the model can be guided to adapt to changes in data distribution from the source domain to the target domain, thereby improving the generalization of subsequent models.

The training process of the domain prior knowledge generator is also illustrated in Figure 4. It adopts the architecture of a Generative Adversarial Network (GAN) and constructs a domain prior knowledge discriminator for the domain prior knowledge generator. This discriminator extracts and restores features from each image or batch of images computed by the domain prior knowledge generator, performing up-sampling and reconstruction on the pavement distress images. Subsequently, the cross-entropy loss function calculates the loss between the original pavement distress images and the pavement distress images generated through the domain prior knowledge. Through backpropagation and optimization of the loss function, the parameters of the domain prior knowledge generator and the domain prior knowledge discriminator were updated continuously. This completes the training of the domain prior knowledge generator, obtaining a generator capable of extracting features from pavement distress images in both the source domain and the target domain. The pseudocode for the training of the domain prior knowledge generator is shown in Algorithm 1.

Algorithm 1: Training of domain prior knowledge generator
Input: The dataset X composed of data from the source domain and the target domain; Initial Domain Prior Knowledge Generator G; Domain Prior knowledge Discriminator D; Training frequency n; Cross entropy loss function L Output: Trained Domain Prior Knowledge Generator G’, domain prior knowledge Φ_X
	1: for epoch in {1, 2,……, n} do 2: for x_i in X do $3 : {\tilde{ϕ}}_{x} = G (x_{i})$ $4 : p r e d = D ({\tilde{ϕ}}_{x})$ $5 : l o s s = L (p r e d, x_{i})$ 6: loss.backward() 7: G’ = G.optimize() 8: D’ = D.optimize() 9: end 10: end $11 : ϕ_{x} = G (X)$

2.2. Viewpoint Transformation

The viewpoint transformation module adjusts and corrects the viewpoints of pavement distress images from different target domains. It aligns images from various viewpoints to a unified bird’s-eye view. This eliminates irrelevant factors caused by perspective variations and improves the accuracy of pavement distress detection. When transferring a well-trained pavement distress detection model from the source domain to the target domain, the differences in image perspectives between the source and target domains significantly impact the accuracy of pavement distress detection. When the viewpoint is slanted, the features of pavement distress are often obscured due to occlusion issues and distorted due to viewpoint problems. Therefore, when transferring a well-trained model from the source domain to the target domain, it is necessary to adjust the viewpoint of collected pavement images.

The viewpoint transformation module allows the pavement distress detection model to detect distress from a unified bird’s-eye view perspective, ensuring that the pavement distress detection model can extract pavement distress features as comprehensively as possible. Considering that pavement can be viewed as a plane when capturing pavement conditions using images, while pavement distress appears as irregular patterns on this plane, when capturing pavement images from different viewpoints, the pavement plane is projected onto the vertical plane of the viewpoint, leading to occlusion and deformation of pavement distress. Therefore, adjusting the viewpoint of pavement images to fix them in a unified perspective can be seen as finding the position and angle relative to the pavement of the shooting source, thus restoring the image with current perspective distortion to the original image in bird’s-eye view. Furthermore, the above task can be viewed as finding the position and distance relative to the shooting source for each point in the pavement image, thereby enabling the image to be restored.

Specifically, the domain prior knowledge generator first generates a large number of proposal line points. This is because, after extensive self-supervised training on numerous target domain images, the domain prior knowledge generator can effectively extract key information from any target domain image, which is stored in these key points. Next, the Hough transform is employed to detect lines formed by these proposal line points. Typically, in a bird’s-eye view, lines such as lane markings are parallel to each other; however, from different viewpoints, these parallel lines may intersect. Therefore, the lines detected using the Hough transform can extract viewpoint information. The Hough Transform is a technique for identifying collections of points in the parameter space that represent specific geometric shapes, commonly used for detecting shapes such as lines and circles in images.

Based on this, any pavement images captured from an arbitrary angle can be transformed into a pavement image in space according to the following equation:

(\begin{matrix} x \\ y \\ z \end{matrix}) = (\begin{matrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{matrix}) (\begin{matrix} m \\ n \\ 1 \end{matrix})

(1)

Transformed into a bird’s-eye view:

(\begin{matrix} x' \\ y' \\ 1 \end{matrix}) = (\begin{matrix} x / z \\ y / z \\ z / z \end{matrix}) = (\begin{matrix} \frac{a_{11} x + a_{12} y + a_{13}}{a_{31} x + a_{32} y + a_{33}} \\ \frac{a_{21} x + a_{22} y + a_{23}}{a_{31} x + a_{32} y + a_{33}} \\ 1 \end{matrix})

(2)

where (m, n) represents the pavement image captured from any viewpoint, while (x’, y’) represents the pavement image under the bird’s-eye view perspective;

(\begin{matrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \end{matrix})

is determined based on the position of the camera and the viewpoint. The architecture for viewpoint correction of pavement images, combined with depth estimation, is illustrated in Figure 5.

Before inputting pavement images into the pavement distress recognition model, the images first undergo preprocessing in a pre-trained domain prior knowledge generator. This generator generates prior knowledge specific to the current image, obtains some proposal line points, and then detects lines in the images. Based on the detected lines, the module utilizes perspective transformation algorithms to restore the image to a bird’s-eye view perspective. The pseudocode of the viewpoint transformation module is presented as Algorithm 2. Under a unified and fixed bird’s-eye view perspective, the features of pavement distress are more effectively presented. A fixed viewpoint is advantageous for the model to eliminate irrelevant interference factors, improving the accuracy of pavement distress recognition.

Algorithm 2: Viewpoint transformation
Input: The Proposal Line Points P, the Input Image X. Output: The Bird’s-eye View Image X’.
	1: for p_i in P do 2: l_i = Hough Transform (p_i) 3: L.append (l_i) 4: for l_i in L do 5: for l_j in L do 6: p = line_intersection (l_i, l_j) 7: P’.append(p) 8: k = KMeans (P’) 9: l₁, l₂ = Hough Transform (P, k) 10: p₁, p₂ = line_intersection (l₁, x = 0), line_intersection (l₂, x = 0) 11: p₃, p₄ = line_intersection (l₁, x = height(X), line_intersection (l₂, x = height(X)) 12: X’ = Perspective Transformation (X)

2.3. Foreground Focus Module

The foreground focus module highlights regions in the image that are likely to contain pavement distress. It directs the model’s attention to areas with higher probabilities of distress while ignoring background regions. The background elements of the collected road images typically exhibit significant variation. These differences can often confuse the model’s understanding of pavement distress, affecting the transfer performance of the pavement distress detection model in the target domain. Additionally, the location distribution of pavement distress on images in the target domain may also differ from that in the source domain. This can lead to the pavement distress detection model trained in the source domain ignoring some pavement distress in the target domain as invalid information, because the model has never encountered pavement distress appearing in certain locations in the source domain. Therefore, when applying a well-trained model from the source domain to the target domain, it is necessary to adjust the model’s attention to focus on regions where pavement distress is more likely to occur.

The attention mechanism is a deep learning module that mimics how humans process external information. It allows the model to focus on prominent features in input images for key calculations and predictions, enabling the computer to focus on the characteristics of the key regions quickly and efficiently. The attention mechanism can be abstracted into the following expression:

Attention = f (g(x), x)

(3)

where x represents the input features; g represents the processing of input features to generate attention; f represents the processing of input features combined with attention. This attention mechanism can be seen as a dynamic weight adjustment process based on input image features.

Wang et al. [28] first introduced the attention mechanism into computer vision and proposed a novel non-local network, greatly improving the accuracy of object detection. Furthermore, Dosovitskiy et al. [29] introduced the transformer architecture, which achieved tremendous success in natural language processing, into computer vision. Leveraging the powerful self-attention capability of the transformer, they developed the Vision Transformer model, which possesses enhanced image recognition capabilities.

Cracks in pavement often appear as elongated shapes, with their effective features concentrated along the crack lines, especially in alligator cracks. Similarly, other distress types like potholes are typically smaller, with their features concentrated locally. Therefore, to further enhance the cross-scene transfer ability of pavement distress recognition models, focusing the model’s attention on prominently featured pavement distress, introducing an attention mechanism is an important step. To achieve this, this study employs the domain prior knowledge generator to extract attention points from the current images in the target domain. These attention points represent areas identified by the domain prior knowledge generator as requiring significant focus, which may indicate the presence of pavement distress. Subsequently, considering the randomness inherent in the domain prior knowledge generator, not all attention points are useful. Therefore, k-means clustering is utilized to cluster these attention points, yielding several cluster centers that serve as highly reliable attention points for pavement distress identification. Finally, the average distance from all pixels in the image to these cluster centers is calculated, which is used as the attention weight for the image. This weight is then multiplied by the original image to obtain the foreground-focused target domain image, as illustrated in Figure 6. The pseudocode of the foreground focus module is presented as Algorithm 3.

Algorithm 3: Foreground focus
Input: The Attention Points P, the Input Image X. Output: The Foreground Focus Image X’.
	1: k = KMeans (P) 2: for i in range(height) do 3: for j in range(width) do 4: w_i,j = distance((i, j), k) 5: X’ = X.*w

In the foreground focus module, before each input into convolution layers, images and features need to be multiplied by their respective attention matrices. This process strengthens and focuses on the more prominent foreground features belonging to distress, while weakening features related to irrelevant backgrounds. This ensures that when transferring from the source domain to the testing domain, the model can ignore interference from background factors and concentrate attention on the main pavement distress, thus enhancing the model’s performance in the testing domain.

2.4. Adaptive Distribution Normalization

The adaptive distribution normalization module enables the pavement distress detection model to better adapt to data from different target domains. It further enhances the model’s generalization ability across domains. Additionally, besides differences in viewpoint and background, there are other factors such as varying lighting conditions, shadows, and other disturbances in the pavement distress datasets across different domains. These factors can also lead to a decrease in performance when object detection models’ transfer application. Therefore, it is necessary to normalize the distribution of images between different domains.

Feature normalization is an important step to ensure model stability and convergence for deep learning models. It helps prevent gradient explosions during backpropagation and optimization. Additionally, normalization allows the model to adapt to data of different scales and ranges, thus enhancing its generalization capability. Generally, the formula for normalizing the same batch {x₁, x₂…x_n} during training is as follows:

μ_{B} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

(4)

σ_{B}^{2} = \frac{1}{n} {\sum_{i = 1}^{n} (x_{i} - μ_{B})}^{2}

(5)

{\tilde{x}}_{i} = \frac{x_{i} - μ_{B}}{\sqrt{σ_{B}^{2}}} x

(6)

y_{i} = γ {\tilde{x}}_{i} + β

(7)

where

μ_{B}

and

σ_{B}

are the mean and variance of the same batch of data, respectively.

{\tilde{x}}_{i}

represent the normalized results.

y_{i}

is the output of the normalization layer.

γ

and

β

are trainable neural network parameters. After passing through the normalization layer, the training data can be standardized to a uniform range, which reduces data divergence and lowers the learning difficulty of the network.

However, when applying a trained model to real-world testing scenarios, images are typically input one by one rather than in batches. Additionally, the fixed parameters of normalization make it difficult for the model to adjust and adapt to real-world scenarios during the testing phase. Therefore, this paper proposes an adaptive normalization layer, the output of which is as follows:

y_{i} = γ (γ_{X} (\frac{x_{i} - μ_{X}}{σ_{X}}) + β_{X}) + β

(8)

where

x_{i}

represents the test data and

y_{i}

represents the result of adaptive normalization for the test data.

μ_{X}

and

σ_{X}

are the overall mean and variance of the test source domain data, generated by the domain prior knowledge generator.

γ_{X}

and

β_{X}

are the normalization parameters of the test source domain data, also generated by the domain prior knowledge generator.

γ

and

β

are trainable neural network parameters determined during the training process. This approach allows the domain prior knowledge generator to perform self-supervised learning and induction on the test data, obtaining prior information about the target domain distribution. Subsequently, this prior information is input into the adaptive normalization layer for data normalization, achieving the normalization of test data from different source domains to a unified distribution. This effectively addresses the model transfer issue in pavement distress recognition across different source domains.

In the adaptive normalization module, normalization operations can be employed to adjust the data distribution, thereby addressing issues related to varying lighting and style when transferring pavement distress detection models from the source domain to the target domain. By enabling the model to adapt to images in the target domain, better pavement distress detection performance can be achieved. In terms of specific data processing, before inputting the images from the target domain into the pavement distress detection models for distress recognition, the images are first fed into the domain prior knowledge generator. This generator then produces the adaptive normalization parameters specifically for the current test image, which are then inputted into the adaptive normalization layer of the pavement distress detection model. Subsequently, the image is input into the pavement distress detection model equipped with the adaptive normalization layer, where the features of the image are normalized using the normalization parameters generated by the domain prior knowledge generator tailored for the current test image. The pseudocode of the foreground focus module is presented as Algorithm 4. The adaptive and variable normalization layer ensures that the normalization of the current image features is always appropriate, thereby removing interference factors such as lighting, shadows, etc., and enhancing the model’s generalization ability during cross-domain transfer.

Algorithm 4: Adaptive distribution normalization
Input: The Distribution Parameters μ_X, σ_X, γ_X, β_X, the Input Features X, Distress detection model M. Output: The Adaptive Distribution Normalization Features X’.
	1: $X' = γ (γ_{X} (\frac{X - μ_{X}}{σ_{X}}) + β_{X}) + β$ 2: O = M(X’) 3: M.optimize() 4: γ.update(), β.update() 5: $X' = γ (γ_{X} (\frac{X - μ_{X}}{σ_{X}}) + β_{X}) + β$

2.5. Framework Deployment and Application

In practice, when transferring a pavement distress detection model

M

trained on a source domain

X

to a target domain

Y

, we typically start by collecting a large amount of unlabeled image data

y

from

Y

, which is cost-effective and straightforward. These images are then fed into a domain prior knowledge generator for self-supervised learning, enabling the extraction of viewpoint features

V

, foreground features

F

, and normalization features

N

from the target domain.

To address viewpoint variation, a viewpoint transformation is applied to standardize the images to a unified perspective. A foreground focus module directs attention to the pavement distress targets, while an adaptive distribution normalization layer ensures feature consistency with the source domain during extraction.

The outputs of these modules are combined, and the aggregated features are input into the model

M

, originally trained on the source domain, for effective deployment in the target domain. The pseudocode of the whole framework is presented as Algorithm 5.

Algorithm 5: Whole framework of the test-time adaptation
Input: The pavement distress detection model $M$ (trained on the source domain $X$ ), a large amount of unlabeled image data $y$ (collected on the target domain $Y$ ), Domain Prior knowledge Generator, Viewpoint Transformation module, Foreground Focus module, Adaptive Distribution Normalization module. Output: The pavement distress detection model $M$ ’ (adapted to the target domain $Y$ ).
	1: $V$ , $F$ , $N$ = Domain Prior knowledge Generator ( $y$ ) 2: $y_V$ = Viewpoint Transformation ( $y$ , $V$ ) 3: $y_F$ = Foreground Focus ( $y$ , $F$ ) 4: $M ’ = M (y_V, y_F)$ . Adaptive Distribution Normalization( $N$ )

3. Data Description and Model Evaluation Index

3.1. Data Description

To evaluate the performance of the proposed framework in enhancing cross-scene detection, a cross-scene pavement distress detection dataset was established, comprising pavement images from six countries, four different viewpoints, and multiple scenes. This dataset is divided into two parts: the standard dataset and the comprehensive dataset. The standard dataset, labeled as CHN_C, consists of images captured in simpler scenes, focusing on distress with a direct view of the pavement surface, and is used for model training. The CHN_C dataset was collected from over 3000 km of asphalt roads in Shanghai, using a high-definition (HD) camera mounted on the rear of a vehicle to capture road images at a resolution of 1628 × 1236. All images were taken on sunny days with dry road surfaces. The comprehensive dataset comprises a diverse collection of pavement images from multiple scenes, sourced from the Crowdsensing-based Road Damage Detection Challenge (CRDDC2022) [30]. The CRDDC2022 dataset encompasses pavement images from six countries: Norway, Japan, India, the Czech Republic, the United States, and China. Images are derived from a variety of sources, including vehicle-mounted high-definition cameras, vehicle-mounted smartphones, Google Street View images, motorbikes, and drones. Due to the diversity in collection equipment, locations, and shooting perspectives, there are observable differences in background, viewpoint, and scene elements among the images across these datasets, as demonstrated in Figure 7. Table 1 presents the details of each pavement image dataset used in this study.

The CHN_C dataset was used as the source domain data for training the detection model due to its consistent and simplified image scenes. Vehicle-mounted cameras capture images at an oblique view, closely positioned to the pavement, contributing to small perspective distortion and high resolution. Furthermore, collection conditions on sunny days without water accumulation can reduce the impact of external interferences such as weather, lighting, and shadows, which enhance the model’s accuracy in capturing pavement distress features. To balance the number of images in the training and test sets, we selected 10,000 images to serve as the training set, as detailed in Table 2.

To validate the generalizability of the proposed model in cross-source pavement images, images from Norway, Japan, the United States, and China within the CRDDC2022 dataset were selected as the test set. Data from India and the Czech Republic were not chosen due to their similarity to the Japan dataset. In NO_C, JPN_S, and US_G datasets, pavement images were captured parallel to the road surface using vehicle-mounted cameras. Compared to the training set (CHN_C), their primary differences lie in the image background. Owing to variations in the construction of roads, buildings, and roadside facilities across these countries, images in each dataset have a unique background. This diversity allows us to verify the generalization performance of the proposed model in the presence of background differences. On the other hand, the differences between CHN_M and CHN_D compared to CHN_C mainly lie in the viewpoint, which are influenced by the camera’s shooting angles. In CHN_M, the camera angle deviates more from the vertical direction of the road surface, incorporating a portion of the roadside scene into the images. Although CHN_D employs a top-down view, the drone’s higher altitude above the ground allows for a broader coverage but results in a reduced resolution. Therefore, these two types of pavement images with different viewpoints can serve to validate the proposed model’s ability to adapt to viewpoint differences. Moreover, since the CRDDC2022 dataset includes pavement images captured under varying external interferences, such as different weather and lighting, these data can be utilized to assess the model’s generalizability to distribution biases.

Given the variations in sample sizes and image resolutions across the datasets, we selected 2000 images from each of the five datasets within CRDDC2022 to ensure the diversity and comparability of the test set. These include various types of pavement distresses and images affected by external interferences, as illustrated in Table 2. Notably, since the CHN_M dataset contains only 1977 annotated images, all of them were included in the test set. Furthermore, to standardize the image resolution across both the training and test sets, images were cropped to resize them to a uniform resolution of 512 × 512. At this image resolution, there is not much loss of detail information in the image, and the majority of pavement distress features in the resized image are preserved, which is sufficient for the model to identify distress targets. This resolution is suitable for images of different sizes from different datasets, allowing them to present a better display state for model detection. Finally, this resolution ensures that the pavement distress detection model can fully extract the features of the distress, which, to some extent, guarantees the accuracy of detection.

3.2. Model Evaluation Index

To assess and compare the detection performance of the improved model, this paper introduces four commonly used performance evaluation metrics: Precision, Recall, Average Precision for a single class (AP), mean Average Precision across all classes (mAP), and F1-score. These metrics are used to evaluate the performance of the pavement distress detection model trained on the source domain when transferred to the testing domain. Their calculation formulas are as follows:

P r e c i s i o n = \frac{T_{P}}{T_{P} + F_{P}}

(9)

R e c a l l = \frac{T_{P}}{T_{P} + F_{N}}

(10)

A P = \int_{0}^{Recall} Precision d (Recall)

(11)

m A P = \frac{\sum_{i = 1}^{k} A P_{i}}{k}

(12)

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(13)

where T_P represents the number of correctly identified pavement distresses; F_P represents the number of other targets incorrectly identified as pavement distresses; F_N represents the number of pavement distresses missed in detection; and k represents the number of categories of pavement distresses.

The pavement distress detection model provides a confidence score for each target during distress identification, indicating the degree to which the target is considered a positive sample. Determining predicted positive samples involves comparing the confidence scores with a set confidence threshold. Therefore, the values of T_P, F_P, and F_N are related to the confidence threshold setting. Precision represents the model’s ability to correctly identify distresses, while Recall measures the model’s capability to successfully detect pavement distresses. Both are related to the confidence threshold and exhibit a trade-off relationship: increasing the confidence threshold typically leads to higher Precision but lower Recall. Typically, the confidence threshold is set at 0.5. AP, mAP, and F1-score are determined based on Precision and Recall, providing a comprehensive evaluation of model accuracy.

4. Results and Analysis

This paper designs three sets of experiments based on two baseline object detection models (YOLO v8 and Faster RCNN) to demonstrate the accuracy enhancement effect of the proposed method on pavement distress detection. (1) A comparison with the traditional transfer learning fine-tuning method verifies the proposed framework’s ability to enhance cross-scene distress detection accuracy. (2) An ablation study is conducted to assess the individual contributions of the three modules—viewpoint transformation, foreground focus, and adaptive normalization—toward accuracy improvement. (3) The superiority of the proposed framework is further validated through comparisons with current methods, evaluating multiple dimensions such as accuracy improvement, deployment complexity, and computational efficiency. Across all experiments, changes in training loss are used to assess model convergence, while metrics such as F1-score, Precision, Recall, and mAP (introduced in detail in Section 3.2) are employed to evaluate detection accuracy.

4.1. Validity Experiments

To assess the efficacy of the proposed test-time adaptation framework, some validation experiments were conducted. We compiled the high-quality pavement distress data from CHN_C, collected by ourselves, into a standard dataset, which is also our source domain. We selected the images from CRDDC as a more complex comprehensive dataset, which is also the target domain we need to transfer. We further divided the comprehensive dataset into a comprehensive train dataset and a comprehensive test dataset in an 8:2 ratio.

Initially, the YOLOv8 and FasterRCNN models were trained on the standard dataset and subsequently tested on the comprehensive dataset to evaluate the detection accuracy of the control group. Thereafter, both the unlabeled standard dataset and the comprehensive dataset were employed to train the domain prior knowledge generator. Following the convergence of the domain prior knowledge generator’s training, the adaptive pavement distress detection models, specifically TTA-YOLO and TTA-FasterRCNN, were trained using the standard dataset. These models were then tested on the comprehensive test dataset to evaluate the experimental group’s accuracy. Additionally, transfer learning was implemented with the trained YOLOv8 and Faster R-CNN models using a comprehensive training set. The resulting models, Transfer-YOLO and Transfer-Faster R-CNN, were tested on the comprehensive test dataset to compare the target domain adaptation effect provided by the TTA module. Both the comprehensive training set and the test set included data from the target domain, ensuring consistent data distribution across these datasets. The datasets used for model training and testing are detailed in Table 3. In this way, YOLO/FasterRCNN represents a model trained on the source domain and evaluated for its performance in the target domain; Transfer-YOLO/Transfer-FasterRCNN represents a model trained on both the source and target domains and evaluated for its performance in the target domain. TTA-YOLO/TTA-FasterRCNN represents our proposed framework, which only requires image data in the target domain without labels.

The experimental platform was based on the Ubuntu 22.04 operating system and conducted on two NVIDIA GeForce GTX 4090 GPU servers. The experiments utilized the PyTorch deep learning framework, with CUDA and CUDNN libraries employed to accelerate model training. To ensure robust model performance and avoid overfitting or underfitting, several strategies were incorporated throughout the training process. Specifically, the training epochs for the YOLOv8 model, FasterRCNN model, Transfer-YOLO model, and Transfer-FasterRCNN model were set to 100 epochs, while the self-supervised training for the domain prior knowledge generator was set to 200 epochs. During the training of the adaptive pavement distress detection models using the trained domain prior knowledge generator, the training epochs were again set to 100 epochs. The model training weights were updated using the Adam optimizer and the cosine annealing learning rate decay algorithm, with an initial learning rate set to 0.01.

To mitigate the risk of overfitting, a combination of techniques was applied. First, the model weights were updated using the Adam optimizer, known for its adaptive learning rate capabilities, which helped to stabilize training and prevent both overfitting and underfitting. Second, a cosine annealing learning rate decay algorithm was used to gradually decrease the learning rate over time, promoting smoother convergence and avoiding sharp updates that could lead to overfitting. The initial learning rate was set to 0.01, which was optimized through empirical testing. Additionally, the model performance was closely monitored during training by observing the changes in loss function values, as shown in Figure 8, for all models: YOLO, FasterRCNN, TTA-YOLO, TTA-FasterRCNN, Transfer-YOLO, and Transfer-FasterRCNN.

By using this combination of techniques, including careful adjustment of training epochs, learning rate scheduling, and optimization strategies, the models were trained effectively, striking a balance between underfitting and overfitting, ensuring good generalization to new data. Regular validation and testing on separate datasets further ensured the robustness of the models and their ability to adapt to unseen scenarios.

From Figure 8, it is evident that the YOLO, FasterRCNN, TTA-YOLO, TTA-FasterRCNN, Transfer-YOLO, and Transfer-FasterRCNN models exhibit a gradual decrease in their total loss function values. As the number of iterations approaches 100 epochs, the loss function values converge, while simultaneously, the models’ detection results steadily improve. Additionally, comparing the changes in loss functions during the training processes of the YOLO model and the TTA-YOLO model reveals that initially, the TTA-YOLO model incurs a higher loss function on the test data. This is primarily because during the early stages of training, the TTA-YOLO model attempts to adapt to the test data. Due to the significant differences between the test and training datasets, the TTA-YOLO model requires larger adjustments on the test dataset early on, resulting in poorer performance. However, as the training epochs increase, the TTA-YOLO model’s loss function on the test dataset becomes closer to that on the training dataset. This phenomenon is also observed in Transfer-YOLO. During the training process of Transfer-YOLO, the loss function on the test data becomes increasingly closer to that on the training data, indicating that the Transfer-YOLO model is gradually adapting from the training data to the test data. As shown in Figure 8, the loss function values on the test data for both TTA-YOLO and Transfer-YOLO ultimately align with those on the training data, demonstrating that these two models are eventually able to effectively adapt to the distribution of the test data. In contrast, the loss function of the YOLO model on the test data consistently remains discrepant from that on the training data, indicating an inability to adapt well to the test data. A similar trend is observed in the training process of the FasterRCNN, TTA-FasterRCNN, and Transfer-FasterRCNN models.

In order to quantitatively evaluate the test-time adaptation framework for pavement distress detection, this study utilized the pre-trained YOLO, FasterRCNN, TTA-YOLO, TTA-FasterRCNN, Transfer-YOLO and Transfer-FasterRCNN models to assess and calculate detection accuracy on the test dataset, as depicted in Table 4. From the table, it is evident that Transfer-YOLO outperforms the original YOLO object detection results across various metrics in the target domain. This improvement is because transfer learning involves further training with annotated data from the target domain, building upon the source domain. The transfer learning model not only encounters various pavement distress data from the target domain but also learns how to identify and detect these distresses from the target domain annotations. However, obtaining labels for pavement distress in the target domain is not always easy or inexpensive. Annotating pavement distress requires substantial human and material resources, especially since data must be labeled for each different target domain. The framework proposed in this paper effectively addresses this issue.

The test-time adaptation framework for pavement distress detection proposed in this paper has improved all aspects of indicators compared with the original pavement distress detection models in Table 4. Particularly noteworthy is the nearly 20% increase in AP for pothole identification. The research indicates that due to the relatively small volume of potholes in the data, they are susceptible to disturbances from perspectives, backgrounds, and other factors, resulting in compromised detection accuracy during migration. However, the proposed adaptive pavement distress detection models effectively mitigate this issue, leading to enhanced pothole identification accuracy. Overall, by employing the framework proposed in this study to transfer a pavement distress detection model trained on the source domain to the target domain, there was an improvement of about 28% in Precision, about 20% increase in Recall, about 20% rise in mAP, and about 23% improvement in F1-score. In the unsupervised setting, the framework proposed in this paper achieves 93.09% of the performance attained by transfer learning in the supervised setting. These results affirm the efficacy of the proposed framework.

In summary, the test-time adaptive framework for pavement distress detection proposed in this paper achieves performance close to that of transfer learning without the need for target domain data annotation. Although transfer learning can significantly enhance a model’s performance in the target domain by utilizing target domain data and labels to teach the model how to detect pavement distress in that domain, it requires extensive data annotation in the target domain. By employing self-supervised learning, the proposed framework enables the original pavement distress detection model to adapt from the source domain to the target domain and achieves better pavement disease detection results.

4.2. Ablation Experiments

To validate the effectiveness of the three modules proposed in this Adaptive Pavement Distress Detection Framework, this study conducted an ablation experiment. Figure 9 illustrates the recognition results of the original YOLO model, YOLO + viewpoint transformation (VT), YOLO + foreground focus (FF), YOLO + adaptive normalization (AN), YOLO + VT + FF, YOLO + VT + AN, YOLO + FF + AN, YOLO + VT + FF + AN, FasterRCNN model, FasterRCNN + VT, FasterRCNN + FF, FasterRCNN + AN, FasterRCNN + VT + FF, FasterRCNN + VT + AN, FasterRCNN + FF + AN, FasterRCNN + VT + FF + AN.

Overall, the individual use of each module can highlight the features of cracks in the image to a certain extent, thereby improving the model’s accuracy in recognizing pavement distress. Furthermore, the combination of these modules enhances the representation features of distress in the images. This indicates that the stacking of these modules contributes to improving the model’s generalization ability in cross-domain recognition, consequently enhancing the model’s detection accuracy.

To quantitatively assess the contributions of each module to cross-domain recognition of pavement distress, this study trained various models on standard datasets and performed predictions on a comprehensive test dataset, as shown in Table 5.

From Table 5, it can be observed that integrating the viewpoint transformation module into the original object detection models helps the models adapt to pavement distress images from different viewpoints in the test dataset, thereby enhancing the models’ detection capabilities. This resulted in an increase in mAP by 8.59%, Recall by 7.06%, and F1-score by 9.64%. Incorporating the foreground focus attention module allows the model to focus more on distinct features of distress while ignoring background interference, leading to improved detection capabilities. This led to an increase in mAP by 5.41%, Recall by 6.56%, and F1-score by 7.42%. The addition of the adaptive normalization module weakened the distribution bias between the train and test datasets to some extent, ensuring better adaptation of the model to the test dataset. This resulted in an increase in mAP by 4.4%, Recall by 6.6%, and F1-score by 7.4%.

Moreover, the combination of these modules further enhanced the model’s detection capabilities. These three modules address different adaptation challenges encountered when the model is transferred to the test dataset, proposing targeted improvements from three aspects, thereby further enhancing the models’ pavement distress detection capabilities. Compared to the original object detection model, the test-time adaptation model showed improvements in Precision by 28.0%, mAP by 20.38%, Recall by 18.23%, and F1-score by 22.92%, demonstrating superior performance in all aspects.

However, it must be acknowledged that despite proposing various modules to address the generalization issues during model transfer, the performance is still inferior to direct training and testing on the test dataset. This is because when training on and testing the test set, the object detection model encounters data during the evaluation phase that it has already seen and fully adapted to, resulting in higher detection accuracy. However, obtaining data and labels from the target domain is not always feasible and requires significant human and material resources for data extraction and labeling. Therefore, it is not always practical to retrain and fine-tune the pavement distress detection model on data from the new target domain every time it is transferred.

Contrary to this approach, the framework proposed in this study only requires obtaining a few images from the target domain to adapt the model, which can significantly improve the accuracy of pavement distress detection models. Therefore, the framework proposed in this study has significant practical value as it reduces the resource-intensive nature of retraining on target domain data and still achieves notable improvements in model accuracy.

4.3. Comparative Experiment

To validate the superiority of the model proposed in this study, comparative experiments were conducted with methods such as CycleGAN [31], EasyTL [32], DDTCDR [33], and GraftNet [34]. Figure 10 presents the visual results of the comparative experiments. When using the CycleGAN algorithm, images from the test dataset are stylized to resemble the style of the training data, indirectly adapting the model to the distribution of the test data. However, this approach struggles to address differences in viewpoint between the test and training data. The EasyTL method, on the other hand, aims to align the distribution of test data with that of the training data, helping the model better adapt to the test domain. DDTCDR (Domain Distribution Transfer for Cross-Domain Recognition) improves the model’s generalization ability by considering the proximity of the distributions between the training and test datasets, effectively bridging the gap between them. GraftNet enhances model performance on test data by pre-training to extract common features shared between the training and test data. It then customizes a divergent branch based on the unique distribution characteristics of the test data.

However, most of the above methods mainly focus on the distribution differences between the test and train datasets, without adequately addressing differences in viewpoint and background. Consequently, their ability to improve the recognition capability of pavement distress detection models is limited. In contrast, the method proposed in this study comprehensively considers differences in three aspects and proposes targeted solutions. Its detection results are more accurate, further demonstrating the superiority of the proposed method.

This study quantitatively analyzed the impact of various methods on the recognition accuracy of pavement distress detection, as shown in Table 6. From the table, it is evident that the method proposed in this study outperforms other methods in various indicators of pavement distress detection. In particular, the DDTCDR and GraftNet methods exhibit detection accuracy even lower than that of the original YOLO and FasterRCNN models. This is primarily because these methods, when transferring the pavement distress detection model to the target domain, fail to effectively promote the model’s adaptation to the target domain’s distribution. Instead, they interfere with the model’s recognition of target domain images, leading to a decrease in accuracy.

This paper also calculated the runtime of various methods. The TTA-YOLO model in this study consumes 0.023 s to recognize an image, and the TTA-FasterRCNN model takes up to 0.43 s, which is the longest among all models. However, the slow processing speed of TTA-FasterRCNN is mainly attributed to the inherently slow recognition speed of the FasterRCNN model itself. As for TTA-YOLO, the processing speed of 0.023 s per frame still ensures that the pavement distress detection model is real-time and can meet practical engineering requirements. Furthermore, in terms of model accuracy, the proposed framework in this paper far exceeds other models, thereby demonstrating the superiority of the proposed method.

On the other hand, TTA models do not require additional annotated data from the target domain and can achieve performance close to models obtained through transfer learning. As shown in Table 6, transfer learning models can achieve the best detection performance by using annotated data from the target domain for training. This is mainly because the pavement distress detection model trained in the source domain has already learned some features of pavement distress. When transfer learning is applied in the target domain, the model further learns the specific features of pavement distress in the target domain and adapts to its characteristics, such as distress background, image lighting intensity, and shooting angles. Finally, Transfer-YOLO achieves a Precision of 75.34%, and Transfer-FasterRCNN achieves a Precision of 76.26%.

However, obtaining a large amount of annotated pavement distress data in the target domain is not always easy or cost-effective. Typically, annotating a pavement distress dataset containing 10,000 images requires approximately 10 days of continuous annotation by one operator. In contrast, using the proposed test-time adaptive framework requires only the collection of images from the target domain without the need for time-consuming and labor-intensive annotation work. This framework enables the transfer of pavement distress detection models from the source domain to the target domain with a comprehensive detection precision close to that obtained through transfer learning, and significantly higher than other domain transfer methods.

Overall, the proposed framework bridges the gap between the training and test domains, enabling the transfer of various pavement distress detection models to the test domain at minimal cost. It allows any trained pavement distress detection model to perform well under various conditions, such as different countries and regions, different pavement colors and types, different collection devices, different shooting angles, and even different lighting conditions, among others.

5. Conclusions

This study proposes a TTA-based framework to enhance cross-scene pavement distress detection. The framework leverages a domain prior knowledge generator to facilitate unsupervised feature transfer from the source to target domains. To address challenges like background variations, viewpoint discrepancies, and distribution shifts, the framework incorporates three key modules: viewpoint conversion, foreground focus, and adaptive normalization. By enabling real-time adaptation to new data without requiring additional labeled data, this framework significantly improves the model’s generalization ability and robustness in diverse scenarios.

With the framework proposed in this paper, the detection accuracy of the two baseline models improved by 20.38% and 21.08% in terms of mAP, respectively. Notably, the framework maintained strong performance, particularly in detecting small-sized potholes. It is important to highlight that these improvements were achieved without requiring new scene annotation datasets. Additionally, the ablation experiment results show that the three modules—viewpoint transformation, foreground focus, and adaptive normalization—each made a significant contribution to accuracy, increasing mAP by 8.59%, 5.41%, and 4.4%, respectively. Among these, viewpoint transformation provided the largest improvement, indicating that viewpoint differences are a key challenge in cross-scene applications. Compared to the supervised transfer learning method, the unsupervised TTA achieves only a 6% lower comprehensive detection accuracy but reduces the need for 9977 high-quality annotated data points.

Future research could focus on integrating additional distress and challenging datasets to further test and enhance the robustness of the model. Additionally, exploring the model’s applicability beyond pavement distress detection—such as for other types of infrastructure condition assessments—could broaden its utility and impact. The extent to which the resolution of the collected images affects the accuracy of pavement damage detection will also be explored. Exploring more effective ways to integrate detection models from the source and target domains, rather than solely relying on data collection, represents a more promising direction for future research.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L. and D.W.; validation, M.D. and J.Y.; data curation, Y.H., M.D. and J.Y.; writing—original draft preparation, Y.L., Y.H. and L.L.; writing—review and editing, Y.L.; visualization, L.L.; supervision, Y.H.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Young Scientists Fund of the National Natural Science Foundation of China (Grant No. 52402387) and the China Postdoctoral Science Foundation (Certificate Number: 2024M752425).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

Authors Mengyun Du and Jiang Yu were employed by the company Chongqing Traffic Engineering Quality Inspection Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Li, Y.; Liu, C.; Shen, Y.; Cao, J.; Yu, S.; Du, Y. RoadID: A Dedicated Deep Convolutional Neural Network for Multipavement Distress Detection. J. Transp. Eng. Part B Pavements 2021, 147, 04021057. [Google Scholar] [CrossRef]
Maeda, H.; Sekimoto, Y.; Seto, T.; Kashiyama, T.; Omata, H. Road Damage Detection and Classification Using Deep Neural Networks with Smartphone Images: Road Damage Detection and Classification. Comput.-Aided Civ. Infrastruct. Eng. 2018, 33, 1127–1141. [Google Scholar] [CrossRef]
Coenen, T.B.J.; Golroo, A. A Review on Automated Pavement Distress Detection Methods. Cogent Eng. 2017, 4, 1374822. [Google Scholar] [CrossRef]
Cha, Y.-J.; Choi, W.; Büyüköztürk, O. Deep Learning-Based Crack Damage Detection Using Convolutional Neural Networks: Deep Learning-Based Crack Damage Detection Using CNNs. Comput.-Aided Civ. Infrastruct. Eng. 2017, 32, 361–378. [Google Scholar] [CrossRef]
Zakeri, H.; Nejad, F.M.; Fahimifar, A. Image Based Techniques for Crack Detection, Classification and Quantification in Asphalt Pavement: A Review. Arch. Comput. Methods Eng. 2017, 24, 935–977. [Google Scholar] [CrossRef]
Bai, J.; Li, S.; Zhang, H.; Huang, L.; Wang, P. Robust Target Detection and Tracking Algorithm Based on Roadside Radar and Camera. Sensors 2021, 21, 1116. [Google Scholar] [CrossRef] [PubMed]
Zhu, J.; Zhong, J.; Ma, T.; Huang, X.; Zhang, W.; Zhou, Y. Pavement Distress Detection Using Convolutional Neural Networks with Images Captured via UAV. Autom. Constr. 2022, 133, 103991. [Google Scholar] [CrossRef]
Gagliardi, V.; Giammorcaro, B.; Bella, F.; Sansonetti, G. Deep Neural Networks for Asphalt Pavement Distress Detection and Condition Assessment. In Proceedings of the Earth Resources and Environmental Remote Sensing/GIS Applications XIV, Amsterdam, The Netherlands, 3–7 September 2023; SPIE: Bellingham, WA, USA, 2023; Volume 12734, pp. 251–262. [Google Scholar]
Li, Y.; Che, P.; Liu, C.; Wu, D.; Du, Y. Cross-Scene Pavement Distress Detection by a Novel Transfer Learning Framework. Comput.-Aided Civ. Infrastruct. Eng. 2021, 36, 1398–1415. [Google Scholar] [CrossRef]
Lin, C.; Tian, D.; Duan, X.; Zhou, J.; Zhao, D.; Cao, D. DA-RDD: Toward Domain Adaptive Road Damage Detection Across Different Countries. IEEE Trans. Intell. Transp. Syst. 2023, 24, 3091–3103. [Google Scholar] [CrossRef]
Ranjbar, S.; Nejad, F.M.; Zakeri, H. An Image-Based System for Pavement Crack Evaluation Using Transfer Learning and Wavelet Transform. Int. J. Pavement Res. Technol. 2021, 14, 437–449. [Google Scholar] [CrossRef]
Lei, X.; Liu, C.; Li, L.; Wang, G. Automated Pavement Distress Detection and Deterioration Analysis Using Street View Map. IEEE Access 2020, 8, 76163–76172. [Google Scholar] [CrossRef]
Liu, Y.; Liu, F.; Liu, W.; Huang, Y. Pavement Distress Detection Using Street View Images Captured via Action Camera. IEEE Trans. Intell. Transp. Syst. 2024, 25, 738–747. [Google Scholar] [CrossRef]
Arya, D.; Maeda, H.; Ghosh, S.K.; Toshniwal, D.; Mraz, A.; Kashiyama, T.; Sekimoto, Y. Transfer Learning-Based Road Damage Detection for Multiple Countries. arXiv 2020, arXiv:2008.13101. [Google Scholar]
Peraka, N.S.P.; Biligiri, K.P.; Kalidindi, S.N. Development of a Multi-Distress Detection System for Asphalt Pavements: Transfer Learning-Based Approach. Transp. Res. Rec. 2021, 2675, 538–553. [Google Scholar] [CrossRef]
Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain Adaptive Faster R-Cnn for Object Detection in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3339–3348. [Google Scholar]
Zhang, Y.; Zhang, L. A Generative Adversarial Network Approach for Removing Motion Blur in the Automatic Detection of Pavement Cracks. Comput. Aided Civ. Eng 2024, 39, 3412–3434. [Google Scholar] [CrossRef]
Chen, T.; Ren, J. Integrating GAN and Texture Synthesis for Enhanced Road Damage Detection. IEEE Trans. Intell. Transp. Syst. 2024, 25, 12361–12371. [Google Scholar] [CrossRef]
Ren, R.; Shi, P.; Jia, P.; Xu, X. A Semi-Supervised Learning Approach for Pixel-Level Pavement Anomaly Detection. IEEE Trans. Intell. Transp. Syst. 2023, 24, 10099–10107. [Google Scholar]
Maeda, H.; Kashiyama, T.; Sekimoto, Y.; Seto, T.; Omata, H. Generative Adversarial Network for Road Damage Detection. Comput. Aided Civ. Eng 2021, 36, 47–60. [Google Scholar] [CrossRef]
Salaudeen, H.; Çelebi, E. Pothole Detection Using Image Enhancement GAN and Object Detection Network. Electronics 2022, 11, 1882. [Google Scholar] [CrossRef]
Fan, R.; Wang, H.; Bocus, M.J.; Liu, M. We Learn Better Road Pothole Detection: From Attention Aggregation to Adversarial Domain Adaptation. In Computer Vision–ECCV 2020 Workshops; Bartoli, A., Fusiello, A., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 12538, pp. 285–300. ISBN 978-3-030-66822-8. [Google Scholar]
Liu, Z.; Pan, S.; Gao, Z.; Chen, N.; Li, F.; Wang, L.; Hou, Y. Automatic Intelligent Recognition of Pavement Distresses with Limited Dataset Using Generative Adversarial Networks. Autom. Constr. 2023, 146, 104674. [Google Scholar] [CrossRef]
Li, Z.; Shi, S.; Schiele, B.; Dai, D. Test-Time Domain Adaptation for Monocular Depth Estimation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 4873–4879. [Google Scholar]
Segu, M.; Schiele, B.; Yu, F. Darth: Holistic Test-Time Adaptation for Multiple Object Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 9717–9727. [Google Scholar]
Ahmed, S.M.; Niloy, F.F.; Raychaudhuri, D.S.; Oymak, S.; Roy-Chowdhury, A.K. MeTA: Multi-Source Test Time Adaptation. arXiv 2024, arXiv:2401.02561. [Google Scholar]
Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A Survey on Contrastive Self-Supervised Learning. Technologies 2021, 9, 2. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Dosovitskiy, A. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Arya, D.; Maeda, H.; Sekimoto, Y.; Omata, H.; Ghosh, S.K.; Toshniwal, D.; Sharma, M.; Pham, V.V.; Zhong, J.; Al-Hammadi, M. RDD2022-The Multi-National Road Damage Dataset Released through CRDDC’2022. figshare 2022, 10, m9. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Wang, J.; Chen, Y.; Yu, H.; Huang, M.; Yang, Q. Easy Transfer Learning by Exploiting Intra-Domain Structures. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1210–1215. [Google Scholar]
Li, P.; Tuzhilin, A. DDTCDR: Deep Dual Transfer Cross Domain Recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020; ACM: Houston, TX, USA, 2020; pp. 331–339. [Google Scholar]
Liu, B.; Yu, H.; Qi, G. Graftnet: Towards Domain Generalized Stereo Matching with a Broad-Spectrum and Task-Oriented Feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13012–13021. [Google Scholar]

Figure 1. Pavement distress images under different scenes. (a) Vehicle-mounted cameras. (b) Dashcams. (c) Smartphones. (d) Roadside monitoring. (e) Drones. (f) Motorbikes.

Figure 2. Transfer learning versus test-time adaptation.

Figure 3. The framework of the proposed test-time adaptive pavement distress detection model.

Figure 4. The network structure of domain prior knowledge generator.

Figure 5. The illustration of the viewpoint transformation module.

Figure 6. Illustration of the foreground focus module based on attention.

Figure 7. Pavement sample images of each dataset. (a) CHN_C. (b) NO_C. (c) JPN_S. (d) US_G. (e) CHN_M. (f) CHN_D.

Figure 8. Change in loss function value during the training process.

Figure 9. Detection result of ablation experiments.

Figure 10. Detection result of comparative experiments.

Table 1. Introduction to pavement image datasets.

Type	Country	Image Source	Code	Number of Labeled Images	Image Resolution	Viewpoint	Road Type
Source domain/Model training	China	Vehicles with HD camera	CHN_C	65,748	1628 × 1236	Oblique view	Urban road
Target domain/Cross-scene transfer performance evaluation	Norway	Vehicles with HD camera	NO_C	8161	3650 × 2044	Extra-wide view	Expressways and county roads
	Japan	Vehicles with smartphone	JPN_S	10,506	600 × 600	Wide view	Urban road and country roads
	United States	Google Street View images	US_G	4805	640 × 640	Wide view	Urban road and highway
	China	Motorbikes with camera	CHN_M	1977	512 × 512	Oblique view	Urban road
	China	Drones with camera	CHN_D	2401	512 × 512	Top-down view	Urban road

Table 2. Sample statistics selected from each dataset.

Dataset	Number of Longitudinal Crack	Number of Transverse Crack	Number of Alligator Crack	Number of Potholes	Number of Images Selected
CHN_C	10,463	8782	6472	3247	10,000
NO_C	2147	849	289	271	2000
JPN_S	1275	1142	1052	863	2000
US_G	2753	1475	357	58	2000
CHN_M	2678	1096	641	235	1977
CHN_D	1204	943	251	70	2000

Table 3. Dataset for model training and testing.

	YOLO/FasterRCNN	TTA-YOLO/TTA-FasterRCNN	Transfer-YOLO/ Transfer-FasterRCNN
Train Dataset	Labeled Standard Dataset	Labeled Standard Dataset + Unlabeled Comprehensive Dataset	Labeled Standard Dataset + Labeled Comprehensive Dataset
Test Dataset	Comprehensive Test Dataset

Table 4. Results of validity experiments.

Distress Types	YOLO				FasterRCNN
Distress Types	Precision/%	Recall/%	AP/%	F1/%	Precision/%	Recall/%	AP/%	F1/%
Longitudinal Crack	35.14	38.26	30.25	36.63	40.18	45.46	40.27	42.66
Transverse Crack	32.23	37.18	30.85	34.53	42.34	45.24	40.64	45.02
Alligator Crack	70.28	77.23	72.43	73.59	55.78	60.28	55.98	59.27
Pothole	30.91	24.29	25.91	27.20	39.24	33.23	35.17	30.36
Average	42.14	44.24	39.86	43.16	44.39	46.05	43.02	44.51
Distress Types	TTA-YOLO				TTA-FasterRCNN
Distress Types	Precision/%	Recall/%	AP/%	F1/%	Precision/%	Recall/%	AP/%	F1/%
Longitudinal Crack	66.24	59.28	57.32	62.57	67.13	58.37	64.39	62.44
Transverse Crack	60.18	58.46	57.80	59.31	63.74	62.31	60.42	63.02
Alligator Crack	83.12	80.54	80.88	81.81	80.26	82.54	77.35	81.38
Pothole	71.02	51.6	44.96	59.77	73.27	65.46	54.23	69.15
Average	70.14	62.47	60.24	66.08	71.10	67.17	64.10	69.08
Distress Types	Transfer-YOLO				Transfer-FasterRCNN
Distress Types	Precision/%	Recall/%	AP/%	F1/%	Precision/%	Recall/%	AP/%	F1/%
Longitudinal Crack	70.48	63.18	60.52	66.63	71.23	59.85	65.41	65.05
Transverse Crack	67.36	65.24	62.84	66.28	68.52	65.42	68.50	66.93
Alligator Crack	85.38	82.62	86.31	83.98	85.94	84.37	88.93	85.15
Pothole	78.46	54.08	45.81	64.03	79.35	75.28	51.44	77.26
Average	75.34	68.89	70.35	71.97	76.26	71.23	68.57	73.66

Table 5. The evaluation results of ablation experiments.

Model	Precision/%	Recall/%	mAP/%	F1/%
YOLO	42.14	44.24	39.86	43.16
YOLO + VT	54.38	51.30	48.45	52.80
YOLO + FF	50.36	50.80	45.27	50.58
YOLO + AN	50.29	50.84	44.26	50.56
YOLO + VT + FF	65.82	57.02	55.35	61.10
YOLO + VT + AN	66.26	53.78	49.34	59.37
YOLO + FF + AN	62.27	54.85	48.30	58.32
YOLO + VT + FF + AN	70.14	62.47	60.24	66.08
Transfer-YOLO	75.34	68.89	70.35	71.97
FasterRCNN	44.39	46.05	43.02	45.20
FasterRCNN + VT	54.61	57.47	52.27	56.00
FasterRCNN + FF	49.93	52.85	47.53	51.35
FasterRCNN + AN	50.34	51.37	47.02	50.85
FasterRCNN + VT + FF	65.86	63.29	58.68	64.55
FasterRCNN + VT + AN	66.37	64.63	55.26	65.49
FasterRCNN + FF + AN	63.65	60.26	54.18	61.91
FasterRCNN + VT + FF + AN	71.10	67.17	64.10	69.08
Transfer-FasterRCNN	76.26	71.23	68.57	73.66

Table 6. The evaluation results of comparative experiments.

Model	Precision/%	Recall/%	mAP/%	F1/%	Time/s
YOLO	42.14	44.24	39.86	43.16	0.0003
FasterRCNN	44.39	46.05	43.02	45.20	0.2500
CycleGAN	52.16	53.34	52.17	52.74	0.0150
EasyTL	50.56	50.19	48.27	50.37	0.0082
DDTCDR	38.27	36.34	34.29	37.28	0.0075
GraftNet	33.95	33.64	34.25	33.79	0.0086
TTA-YOLO	70.14	62.47	60.24	66.08	0.0230
TTA-FasterRCNN	71.10	67.17	64.10	69.08	0.4300
Transfer-YOLO	75.34	68.89	70.35	71.97	0.0003
Transfer-RCNN	76.26	71.23	68.57	73.66	0.2500

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hou, Y.; Li, Y.; Du, M.; Li, L.; Wu, D.; Yu, J. Bridging Data Distribution Gaps: Test-Time Adaptation for Enhancing Cross-Scenario Pavement Distress Detection. Appl. Sci. 2024, 14, 11974. https://doi.org/10.3390/app142411974

AMA Style

Hou Y, Li Y, Du M, Li L, Wu D, Yu J. Bridging Data Distribution Gaps: Test-Time Adaptation for Enhancing Cross-Scenario Pavement Distress Detection. Applied Sciences. 2024; 14(24):11974. https://doi.org/10.3390/app142411974

Chicago/Turabian Style

Hou, Yushuo, Yishun Li, Mengyun Du, Lunpeng Li, Difei Wu, and Jiang Yu. 2024. "Bridging Data Distribution Gaps: Test-Time Adaptation for Enhancing Cross-Scenario Pavement Distress Detection" Applied Sciences 14, no. 24: 11974. https://doi.org/10.3390/app142411974

APA Style

Hou, Y., Li, Y., Du, M., Li, L., Wu, D., & Yu, J. (2024). Bridging Data Distribution Gaps: Test-Time Adaptation for Enhancing Cross-Scenario Pavement Distress Detection. Applied Sciences, 14(24), 11974. https://doi.org/10.3390/app142411974

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bridging Data Distribution Gaps: Test-Time Adaptation for Enhancing Cross-Scenario Pavement Distress Detection

Abstract

1. Introduction

2. Methodology

2.1. Domain Prior Knowledge Generator

2.2. Viewpoint Transformation

2.3. Foreground Focus Module

2.4. Adaptive Distribution Normalization

2.5. Framework Deployment and Application

3. Data Description and Model Evaluation Index

3.1. Data Description

3.2. Model Evaluation Index

4. Results and Analysis

4.1. Validity Experiments

4.2. Ablation Experiments

4.3. Comparative Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI