An Investigation on Prediction of Infrastructure Asset Defect with CNN and ViT Algorithms

Lethanh, Nam; Trinh, Tu Anh; Hossain, Mir Tahmid

doi:10.3390/infrastructures10050125

Open AccessArticle

An Investigation on Prediction of Infrastructure Asset Defect with CNN and ViT Algorithms

by

Nam Lethanh

^1,*

,

Tu Anh Trinh

¹

and

Mir Tahmid Hossain

²

¹

Institute of Smart Cities & Management, University of Economics, Ho Chi Minh City 722700, Vietnam

²

Department of Computer Science & Engineering, East West University, Dhaka 1212, Bangladesh

^*

Author to whom correspondence should be addressed.

Infrastructures 2025, 10(5), 125; https://doi.org/10.3390/infrastructures10050125

Submission received: 17 March 2025 / Revised: 28 April 2025 / Accepted: 6 May 2025 / Published: 20 May 2025

(This article belongs to the Section Infrastructures Inspection and Maintenance)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Convolutional Neural Networks (CNNs) have been demonstrated to be one of the most powerful methods for image recognition, being applied in many fields, including civil and structural health monitoring in infrastructure asset management. Current State-of-the-Art CNN models are now accessible as open-source and available on several Artificial Intelligence (AI) platforms, with TensorFlow being widely used. Besides CNN models, Vision Transformers (ViTs) have recently emerged as a competitive alternative. Several demonstrations have indicated that ViT models, in many instances, outperform the current CNNs by almost four times in terms of computational efficiency and accuracy. This paper presents an investigation into defect detection for civil and structural components using CNN and ViT models available on TensorFlow. An empirical study was conducted using a database of cracks. The severity of crack is categorized into binary states: “with crack” and “without crack”. The results confirm that the accuracies of both CNN and ViT models exceed 95% after 100 epochs of training, with no significant difference observed between them for binary classification. Notably, the cost of this AI-based approach with images taken by lightweight and low-cost drones is considerably lower compared to high-speed inspection cars, while still delivering an expected level of predictive accuracy.

Keywords:

condition assessment; infrastructure asset management; machine learning; TensorFlow; CNN; ViT

1. Introduction

Monitoring and predicting defects on civil and infrastructure systems play a crucial role in asset management. The current version of ISO 55000 (Asset Management) and the guidelines on condition assessment issued by the Asset Management Institute (AIM) have highlighted that aside from using conventional visual inspection and detection methods, the use of Artificial Intelligence (AI) with modern machine-learning algorithms should be potentially applied where possible to save the cost of the monitoring activities, as well as to increase the level of accuracy.

As components of infrastructure systems deteriorate over time, the severity of deterioration should be measured and captured. The severity is often expressed as condition state, either using binary state or multiple-condition state, depending on the types of components [1,2]. By employing smart technologies for data collection and analysis, they can detect these issues early, enabling prompt maintenance and preventing further damage. This proactive strategy prolongs the lifespan of the city’s infrastructure, lowers overall maintenance expenses, and ensures a smooth and safe transportation network for all users [3,4].

Over the last few decades, within the context of transportation and road and bridge management, the monitoring of defects has relied heavily on manual visual inspection or high-speed inspection cars [3,5]. While these methods provided some level of information, they were both time-consuming and expensive. This inefficiency led to challenges in pinpointing the emergence and severity of defects, hindering timely interventions. As a result, repairs often occurred when problems became severe, leading to higher overall costs and disruptions to traffic flow [6].

Recently, the field of defect monitoring has been revolutionized by the application of Artificial Intelligence (AI) and machine-learning techniques. TensorFlow (TensorFlow https://www.tensorflow.org/) accessed on 1 December 2024, a popular open-source software library, is being used to develop powerful Neural Network algorithms. These algorithms are trained on vast datasets of images containing various defect types, like cracks. By employing statistical modeling techniques, the AI learns to identify and classify these defects with high accuracy. This paves the way for automated inspections, significantly reducing the need for manual labor and high-speed inspection cars.

In developing countries, due to a shortage of investments and, sometimes, lack of proper asset management, defects pose a significant challenge. Due to budget limitations, these defects deteriorate quickly, creating a vicious cycle. In reports published by the World Bank [7,8] and the Asian Development Bank [9], it was stated that 30–50% of paved roads in low- and middle-income countries are classified as being in poor condition. Maintenance backlogs are due to chronic underinvestment, and many developing nations spend less than 1% GDP on road maintenance, whilst in developed nations, the recommended value is between 2 and 3%. The report did emphasize that delays in maintenance increase future repair costs by 4–5 times compared to yearly interventions. For example, in pavement management, cracks, potholes, and uneven surfaces emerge rapidly on roads subjected to heavy traffic loads. Unfortunately, addressing these issues promptly is often hindered by limited funds for repairs and maintenance [7,8,9]. This quick deterioration of pavements not only creates a bumpy and unsafe ride for commuters but also shortens the lifespan of roads, ultimately leading to even greater costs down the line. Finding innovative solutions for cost-effective pavement maintenance and exploring alternative funding mechanisms are crucial steps for developing countries to overcome this hurdle.

Convolutional Neural Networks (CNNs) have proven to be among the most powerful methods for image recognition, with applications in various fields, including civil and structural health monitoring within infrastructure asset management [10,11]. State-of-the-Art (SOTA) CNN models are now available as open-source models and can be accessed on several Artificial Intelligence (AI) platforms, with TensorFlow being particularly popular [12]. In addition to CNN models, Vision Transformers (ViTs) have recently emerged as a strong alternative. Numerous studies have shown that ViT models often outperform current SOTA CNNs by nearly four times in both computational efficiency and accuracy [13,14].

This study explores defect detection in civil and structural components using both a CNN model and a ViT model accessible via TensorFlow. An empirical analysis utilized a database of cracks, categorizing them based on severity (e.g., width and pattern) to create distinct subsets for accuracy testing. Results confirmed that when cracks are classified in binary terms (crack or no crack), CNN and ViT models show similar accuracies. However, with multiple classification states, the CNN model exhibits greater sensitivity, but significantly lower accuracy compared to the ViT model.

2. Literature Review

Over the last few decades, within the context of transportation, road, and bridge management, the monitoring of defects has heavily relied on manual visual inspections and high-speed inspection vehicles (HSVs). Although these methods have provided valuable information, they exhibit significant limitations in terms of time, cost, and accuracy. Manual inspections are notably labor-intensive and subjective, often missing between 20 and 30% of surface defects depending on inspector experience and environmental conditions [15]. Furthermore, rating inconsistencies between inspectors can vary by as much as 25%, undermining the reliability of condition assessments [3].

While HSVs offer improvements by inspecting approximately 100–150 km per day compared to only 5–10 km/day with manual inspections [5], they come with substantial costs, averaging between USD 1000 and 2000 per lane-mile [16]. Moreover, HSVs primarily detect macro-surface defects and are often unable to capture micro-cracks below 1 mm in width, which are critical indicators of early-stage deterioration [3]. Their operation also heavily depends on sophisticated technology typically manufactured in developed countries, making maintenance and calibration challenging for infrastructure agencies in developing nations [8]. This inefficiency in conventional methods leads to delays in defect identification and intervention, often resulting in repairs being carried out only after significant deterioration has occurred, causing higher overall costs and major disruptions to traffic flow.

Traditional pavement defect detection methods pose significant challenges for developing countries, primarily due to substantial costs [7,8,9]. High-tech inspection systems are often exclusive to affluent nations with multi-million-dollar road maintenance budgets. Developing countries struggle to afford these technologies, causing delays in defect identification and remediation. This results in accelerated deterioration, increased maintenance costs, and unsafe road conditions. A cost-effective, efficient alternative is crucial for developing nations to address this challenge and extend their road networks’ lifespan.

Machine learning has significantly transformed the field of engineering, particularly in structural health monitoring and infrastructure asset management [11,12]. Traditional condition assessment methods, such as manual visual inspections and expensive dense sensor networks, were time-consuming, costly, and often limited in spatial or temporal resolution [3,5,16]. Today, machine-learning algorithms can efficiently analyze vast datasets collected from cost-effective sources like wireless sensor networks, UAV-mounted cameras, and IoT devices. Applications are widespread: vibration-based anomaly detection models, using supervised and unsupervised learning, help identify early-stage damage in bridges and tall structures [17,18]; deep-learning methods such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) are extensively used for crack detection, corrosion assessment, and pavement distress classification [19,20]; and predictive maintenance systems powered by Recurrent Neural Networks (RNNs) and long short-term memory (LSTM) architectures are employed to forecast the remaining useful life of infrastructure assets, enabling proactive maintenance scheduling [21]. These technologies enable early defect detection, optimize maintenance interventions, reduce lifecycle costs, and improve the overall safety and resilience of civil infrastructure systems.

This enables early detection of subtle changes, facilitating preventative maintenance and extending infrastructure lifespan. Machine learning also predicts future issues based on historical data and trends, allowing proactive asset management and resource allocation. This shift from reactive to predictive maintenance optimizes costs, ensures safety and functionality, and transforms engineering practices [12].

The emergence of AI, particularly machine learning, revolutionizes pavement defect detection. Trained on extensive datasets of pavement images, Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) accurately identify and classify defects like cracks, potholes, and roughness. This technology offers numerous benefits, including automation, objectivity, and scalability. AI-powered systems automate inspections, reducing manual labor while providing consistent and objective assessments, thus minimizing human error. Additionally, they efficiently analyze large pavement areas captured by drone imagery, making them an ideal solution for efficient and accurate defect detection [5,6,20].

In recent years, defect detection research has been dominated by the use of Convolutional Neural Networks (CNNs). However, a key limitation of this research has been its focus on proprietary, non-open-source algorithms. While CNNs have achieved impressive results in identifying cracks, the emphasis has primarily been on this specific defect type [4,19,20]. This narrow focus leaves other critical distresses, like potholes, rutting, and edge raveling, under-represented in the research landscape. To achieve a more comprehensive approach to pavement health assessment, further research efforts need to explore the potential of CNNs, along with potentially open-source alternatives, for the detection of a wider range of defects. Aside from CNN algorithms, the Vision Transformer (ViT) model is another powerful and prominent alternative which has emerged recently and has been found to be compatible with the CNN model.

The surge in open-source Artificial Intelligence (AI) algorithms, particularly those built on the user-friendly TensorFlow framework, presents a game-changer for pavement defect prediction. These algorithms offer several advantages over traditional, non-open-source methods. Firstly, their open-source nature fosters collaboration and innovation among researchers, accelerating the development and refinement of defect detection models. Secondly, the affordability of open-source tools makes AI-powered infrastructure management more accessible, especially for developing countries with limited budgets.

The following two subsections provide a brief overview of the concepts and modeling approaches of CNN and ViT models. A detailed description of these models is beyond the scope of this paper, as the focus is on their applications using pre-trained CNN and ViT models available through open-source platforms like TensorFlow. Readers interested in exploring the architectural structures and inner workings of these models are encouraged to consult the extensive literature and TensorFlow documentation.

2.1. Brief on CNN Model

Convolutional Neural Networks (CNNs) are a class of deep-learning models specifically designed for processing and analyzing visual data, such as images and videos [22]. They are structured to automatically and adaptively learn spatial hierarchies of features from the input data through the application of convolutional layers [10]. CNNs typically consist of multiple layers, including convolutional layers, pooling layers, and fully connected layers.

The core concept behind CNNs is to extract meaningful features from input images through convolutions with learnable filters or kernels [11]. These filters capture various aspects of the image, such as edges, textures, and more complex patterns, in a hierarchical manner. Pooling layers, often using max or average pooling techniques, reduce the spatial dimensions of the feature maps, retaining important information while minimizing computational complexity [23]. Finally, fully connected layers aggregate these features and classify the input into different categories based on learned representations.

CNNs have achieved remarkable success in numerous computer vision tasks, including image classification, object detection, and image segmentation [12]. Their ability to automatically learn hierarchical representations directly from raw pixel data has significantly advanced the State of the Art in visual recognition systems.

2.2. Brief on ViT Model

Vision Transformers (ViTs) represent a recent breakthrough in computer vision, challenging the traditional dominance of Convolutional Neural Networks (CNNs). Unlike CNNs, which process images using sequential convolutional operations, ViT approaches image understanding through a transformer architecture originally developed for natural language processing tasks [13]. ViT models split an image into fixed-size patches, linearly embedding each patch into a sequence of tokens, which are then processed by transformer layers. This enables ViT to capture global dependencies and long-range interactions within images, facilitating effective feature learning across the entire input space.

The primary innovation of ViT lies in its ability to replace the spatial hierarchy traditionally learned by CNNs with a self-attention mechanism. This mechanism allows ViT to attend to relationships between all pairs of tokens, thereby capturing both local and global context in the image [14]. Despite its initial adaptation from language models, ViT has shown impressive performance in image classification and other vision tasks, often achieving competitive or superior results compared to CNNs, particularly on datasets with complex visual patterns or large-scale images. As research continues to refine ViT architectures and explore their applications, they hold promise for advancing the State of the Art in computer vision tasks beyond traditional CNN-based approaches.

3. Methodologies

The overarching methodology of this study can be briefly explained with the workflow shown in Figure 1.

Step 1: Define Condition States for Pavement Defects

Each defect type with its continuous state is transformed to a discrete scale of the condition state, with two ways of defining the severity of the defect (or condition state), either with binary condition states or multiple condition states. Asset managers can define their preferred scale that shall be satisfied also with existing legislation and norms in many practical situations.

Step 2: Collect/Filter/Transform Data

Depending on the type of drone and its mounted camera, asset managers and drone pilots need to define appropriate height of flying paths considering local constraints such as trees, buildings, and weather conditions. This also includes data filtering to eliminate possible effects and obstacles such as shadows, tree leaves, and other items.

This step also involves transformation of the image to proper segmentation with a uniform dimension.

Step 3: Annotate/Labels Images with Attributes

Appropriate software used for image annotation and labeling can be utilized. This is a manual task but can also be automated if supervising model is utilized. A simpler method is to save images with the same attributes in folders and use the names of the folders to dictate the attributes.

Step 4: Write a Program to Connect and Interact with TensorFlow

TensorFlow provides a systematic way to interact with its algorithms, such as Keras. This can be conveniently performed with Python, R, KNIME, or other data analytical platforms.

Step 5: Run the Program.

Run the program on both training data and test data to evaluate and validate the effectiveness of the model.

Readers who are interested in learning machine-learning algorithms available with the TensorFlow platform can read and refer to numerous works from the literature offering a description of mathematical and Neural Network models, as well as coding examples utilizing Keras CNN algorithms and ViT algorithms.

The above steps present a generic methodological framework. The specific methodologies regarding the architectures of CNN and ViT models applied to empirical datasets are further detailed in the following section (“Empirical Example and Discussion”), as different datasets with their own characteristics and attributes may require customized model parameter configurations.

4. Empirical Example and Discussion

4.1. Step 1: Define Condition States for Pavement Defects

For the purpose of this study, the authors focus on using AI models to detect whether an image contains cracks. In other words, the crack is categorized into two states: “with crack” and “without crack”. This simplification is, in many cases, acceptable depending on the expectations of asset managers. The definition of multiple condition states—involving precise measurements of crack width, length, and types—will be addressed in a subsequent study and discussed in a separate paper. Examples of images with cracks and without cracks are shown in Table 1.

4.2. Step 2: Collect/Filter/Transform Data

We applied the methodology to a dataset comprising 15,000 images showcasing cracks. The dataset is stored in RGB format as JPEG files and was resized to dimensions of 240 × 240 pixels.

This dataset was collected using the DJI Mini 4 Pro, a drone priced at approximately USD 1000 in 2024. The DJI Mini 4 Pro is primarily designed for personal use, weighing only 249 g, and it is equipped with a high-definition camera capable of capturing 48MP photos and recording videos in 4K resolution. For this study, the research team programmed the drone to automatically capture photos of target road sections using a non-DJI third-party drone deployment application which supports pre-program on-route selection using GPS and Google Map. To maintain consistent image resolution, the drone’s altitude was kept at 10 m above the road surface.

Since the captured images often included extraneous elements such as roadsides, curbs, or unrelated objects, a post-processing step was necessary. This involved dividing an image into equal squares of 100 cm × 100 cm. The image-slicing process was carried out using GIMP (GIMP—a GNU Image Manipulation Program https://www.gimp.org/), though other image-processing applications could also be used. The entire process—from capturing images with the drone to slicing the images—spanned one month and covered a total of 100 km of road sections. The mission’s total cost, excluding the reusable drone, was under USD 2000. This translates to an approximate cost of USD 20 per kilometer for drone operations and image processing (for 10 m of road section, the cost is 0.2 cent (~5000 VND), and for each image, the cost is less than 0.008 cent (200 VND)). Comparatively, this cost is minimal when juxtaposed with high-speed inspection cars, which require already upfront a significant investment of around USD 1.5 million for the vehicle, sensors, and associated equipment alone.

4.3. Step 3: Annotate/Labels Images with Attributes

With a binary state classification of “with crack” and “without crack”, annotation and labeling can be easily performed by organizing images into two folders: “Crack” and “Without-Crack”. Attributes corresponding to each image, such as pixel values, are also defined through Step 4 by introducing resizing image function so that all images are loaded onto the GPU with a square shape of 240 × 240 pixels.

4.4. Step 4: Write a Program to Connect and Interact with TensorFlow

The program is with Python programming language, using standard syntax to connect with TensorFlow Keras CNN algorithm and ViT algorithm. The program was executed in Google Colab environment, using L4 GPU to allow for faster computational time compared to executing the code using a laptop.

With CNN model, the following syntax along with its parameters are defined:

model = tf.keras.Sequential([
tf.keras.layers.Conv2D(filters = 32, kernel_size = (3,3), strides = (2,2), activation = “relu”, padding = “valid”, input_shape = (image_size, image_size,3)),
  tf.keras.layers.MaxPooling2D((2, 2)),
  tf.keras.layers.Conv2D(filters = 32, kernel_size = (3,3), strides = (2,2), activation = “relu”, padding = “valid”),
  tf.keras.layers.MaxPooling2D((2, 2)),

  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(units = 64, activation = “relu”,
                       kernel_regularizer = regularizers.L1L2(l1 = 1 × 10⁻³, l2 = 1 × 10⁻³),
                       bias_regularizer = regularizers.L2(1e-2),
                       activity_regularizer = regularizers.L2(1e-3)),
  tf.keras.layers.Dropout(0.5),
  tf.keras.layers.Dense(units = 1, activation = “sigmoid”),
])

With the ViT model, the following parameters are used:

learning_rate = 0.001

weight_decay = 0.0001

batch_size = 256

num_epochs = 100

image_size = 240 # We will resize input images to this size

patch_size = 20 # Size of the patches to be extracted from the input images

num_patches = (image_size // patch_size) ** 2

projection_dim = 64

num_heads = 4

transformer_units = [

projection_dim * 2,

projection_dim,

] # Size of the transformer layers

transformer_layers = 8

mlp_head_units = [2048, 1024] # Size of the dense layers of the final classifier

Here, an image size of 240 will be sliced to a smaller size, with each size being 20 × 20, resulting in 1200 squares.

For both CNN and ViT, we use Adam optimizer (Adam optimization: https://keras.io/api/optimizers/adam/) [24] from Keras (Keras: https://keras.io/), which is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments, for compilation

4.5. Step 5: Run the Program

4.5.1. Results of CNN Model

The model was set up with convolutional layers (applying filters to the input image layer to generate a feature map), max pooling layers (reducing the dimension of feature map without affecting the key features, and summarizing the features generated by a convolution layer), a fully connected or flatten layer (connecting every neuron in one layer to every neuron in the next), and a dropout layer (randomly deactivating a fraction of neurons during training to prevent overfitting). Figure 2 presents a summary of input and output in respective layers of the Neural Network.

The model produces a result of approximately 95% accuracy (in contrast to the loss), as depicted in Figure 3 and Table 2. Notably, as can be seen in Figure 3 and Table 2, the level of accuracy gets to more than 90% only after the first five epochs.

4.5.2. Results of ViT Model

The model produces a result of approximately 98% accuracy (in contrast to the loss), as depicted in Figure 4 and Table 3. Notably, as can be seen in Figure 4 and Table 3, the level of accuracy gets to more than 90% only after the first three epochs.

5. Conclusions

This research investigates defect detection in civil and structural components using both a CNN model and a ViT model available through TensorFlow. The study conducted an empirical analysis using a database of cracks. The findings indicate that when cracks are classified simply as binary (crack or no crack), both CNN and ViT models demonstrate comparable levels of accuracy.

Although it has been shown that the methodology with the models is effective to be used for the detection of cracks with a high accuracy level for this example; however, the example is relatively simple, with only two condition states of crack being defined: one without cracks and another with cracks. In practical situations, asset managers might wish to define the condition states into more than two (2) to capture the severity and damage level of cracks in order to make a proper preventive or corrective intervention. This limitation will be addressed by the team in the upcoming research work that covers the multiple condition states of pavement defects, including cracks and other types of distress, such as potholes, roughness, and rutting.

It is confirmed from the work that the open-source approach with use of TensorFlow algorithm is effective and significantly reducing the cost of implementation while still maintaining a good level of accuracy. In addition, suffice it to say that this method can be extended to other types of defects on infrastructure systems.

Author Contributions

Conceptualization, N.L.; methodology, N.L.; software, N.L. and M.T.H.; validation, N.L. and M.T.H.; formal analysis, N.L. and M.T.H.; investigation, N.L. and M.T.H.; resources, N.L.; data curation, N.L. and M.T.H.; writing—original draft preparation, N.L.; writing—review and editing, N.L.; visualization, N.L. and M.T.H.; supervision, N.L.; project administration, N.L. and T.A.T.; funding acquisition, N.L. and T.A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research is the outcome of the ministerial-level research project “Application of Drones and GIS for Predicting Degradation of Urban Infrastructure Systems in Ho Chi Minh City”, with project code B2023-KSA-03. Bài báo là sản phẩm của đề tài cấp bộ “Ứng dụng Drone và GIS vào dự báo sự xuống cấp của một số hệ thống hạ tầng kỹ thuật đô thị tại thành phố Hồ Chí Minh”, mã số: B2023-KSA-03.

Data Availability Statement

The dataset presented in this article are not readily available because of it is restricted only for this study. Requests to access the datasets should be directed to the first author of the paper.

Acknowledgments

The authors would like to express gratitude toward the Ministry of Education and Training (MOET) and the University of Economics Ho Chi Minh City for providing funding to conduct the study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Le Thanh, N. Stochastic Optimization Methods for Infrastructure Management with Incomplete Monitoring Data. Ph.D. Disertation, Kyoto University, Kyoto, Japan, 2009. [Google Scholar] [CrossRef]
Lethanh, N.; Kaito, K.; Kobayashi, K. Infrastructure Deterioration Prediction with a Poisson Hidden Markov Model on Time Series Data. J. Infrastruct. Syst. 2015, 21, 04014051. [Google Scholar] [CrossRef]
Shtayat, A.; Moridpour, S.; Best, B.; Shroff, A.; Raol, D. A review of monitoring systems of pavement condition in paved and unpaved roads. J. Traffic Transp. Eng. (Engl. Ed.) 2020, 7, 629–638. [Google Scholar] [CrossRef]
Fan, Z.; Lin, H.; Li, C.; Su, J.; Bruno, S.; Loprencipe, G. Use of Parallel ResNet for High-Performance Pavement Crack Detection and Measurement. Sustainability 2022, 14, 1825. [Google Scholar] [CrossRef]
Ma, N.; Fan, J.; Wang, W.; Wu, J.; Jiang, Y.; Xie, L.; Fan, R. Computer vision for road imaging and pothole detection: A state-of-the-art review of systems and algorithms. Transp. Saf. Environ. 2022, 4, tdac026. [Google Scholar] [CrossRef]
Hoang, N.-D.; Nguyen, Q.-L. Computer Vision-Based Recognition of Pavement Crack Patterns Using Light Gradient Boosting Machine, Deep Neural Network, and Convolutional Neural Network. J. Soft Comput. Civ. Eng. 2023, 7, 21–51. Available online: https://www.jsoftcivil.com/article_168894.html (accessed on 20 March 2025).
World Bank. Transport Infrastructure: Sector Results Profile. 2014. Available online: https://www.worldbank.org/en/results/2013/04/14/transport-results-profile (accessed on 20 March 2025).
OECD. Road Maintenance and Investment Strategies: International Best Practices. OECD Publishing. 2021. Available online: https://www.oecd-ilibrary.org/transport/road-maintenance-and-investment-strategies_02b0c8d2-en (accessed on 20 March 2025).
ADB 2017, Infrastructure Financing Modalities in Asia and the Pacific: Strengths and Limitations. Available online: https://www.adb.org/publications/infrastructure-financing-modalities-asia-and-pacific (accessed on 20 March 2025).
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Available online: http://code.google.com/p/cuda-convnet/ (accessed on 20 March 2025).
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021. https://arxiv.org/abs/2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2023. https://arxiv.org/abs/1706.03762. [Google Scholar]
FHWA. Practical Guide for Quality Management of Pavement Condition Data Collection; U.S. Department of Transportation: Washington, DC, USA, 2013. [Google Scholar]
World Bank. Asset Management for Sustainable Road Maintenance: Guidelines for Low- and Middle-Income Countries; The World Bank: Washington, DC, USA, 2018. [Google Scholar]
Kong, X.; Li, J.; James, G. Structural health monitoring of bridges using wireless sensor networks and machine learning. Struct. Control Health Monit. 2017, 24, e1920. [Google Scholar]
Sohn, H.; Farrar, C.R.; Hemez, F.M.; Czarnecki, J.J. A Review of Structural Health Monitoring Literature: 1996–2001; Los Alamos National Laboratory Report LA-13976-MS; Los Alamos National Laboratory: Los Alamos, NM, USA, 2004. [Google Scholar]
Zhang, H.; Qian, Z.; Tan, Y.; Xie, Y.; Li, M. Investigation of pavement crack detection based on deep learning method using weakly supervised instance segmentation framework. Constr. Build. Mater. 2022, 358, 129117. [Google Scholar] [CrossRef]
Lee, T.; Kim, J.; Lee, Y. CNN-Based Road-Surface Crack Detection Model That Responds to Brightness Changes. Electronics 2021, 10, 1402. [Google Scholar] [CrossRef]
Zhao, R.; Yan, R.; Chen, Z.; Mao, K.; Wang, P.; Gao, R.X. Deep learning and its applications to machine health monitoring. Mech. Syst. Signal Process. 2019, 115, 213–237. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015. https://arxiv.org/abs/1409.1556. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017. https://arxiv.org/abs/1412.6980. [Google Scholar]

Figure 1. Workflow and methodologies.

Figure 2. Model summary.

Figure 3. Loss and accuracy level (CNN).

Figure 4. Loss and accuracy level (ViT).

Table 1. Pavement segments with crack and no crack (binary state).

Binary Condition State	Graphical Representation
1
2
2

Table 2. Loss and accuracy level corresponding to each epoch (CNN model).

Epoch	Training		Validation		Epoch	Training		Validation		Epoch	Training		Validation
Epoch	Loss	Acc.	Loss	Acc.	Epoch	Loss	Acc.	Loss	Acc.	Epoch	Loss	Acc.	Loss	Acc.
1	3.4755	0.5391	2.0235	0.7212	34	0.4478	0.9403	0.4549	0.9458	67	0.4895	0.9262	0.4297	0.9502
2	1.5043	0.8413	1.167	0.8735	35	0.5035	0.936	0.3964	0.9503	68	0.4426	0.9455	0.4419	0.9485
3	1.0026	0.872	0.8312	0.888	36	0.4369	0.9387	0.4069	0.943	69	0.4488	0.9434	0.4056	0.9553
4	0.7182	0.8936	0.8128	0.7758	37	0.43	0.9412	0.4349	0.9447	70	0.4639	0.9393	0.4398	0.9493
5	0.5648	0.8942	0.4732	0.9148	38	0.4378	0.9385	0.4328	0.9465	71	0.4543	0.9413	0.4519	0.9468
6	0.4547	0.9056	0.3859	0.9262	39	0.4264	0.9411	0.4556	0.9345	72	0.4391	0.9463	0.4211	0.9545
7	0.4211	0.9123	0.3559	0.9307	40	0.4258	0.9418	0.4438	0.9452	73	0.4473	0.9439	0.4466	0.9482
8	0.4475	0.8924	0.4407	0.902	41	0.4469	0.9384	0.3874	0.953	74	0.4524	0.9425	0.4324	0.9521
9	0.4216	0.907	0.6006	0.8828	42	0.505	0.9243	0.5629	0.9332	75	0.4706	0.9364	0.4629	0.9434
10	0.4655	0.9092	0.5016	0.8933	43	0.4638	0.9393	0.5035	0.9253	76	0.4519	0.9426	0.4407	0.9496
11	0.3975	0.9148	0.3453	0.9267	44	0.4695	0.9335	0.4429	0.9432	77	0.4334	0.9474	0.4159	0.9551
12	0.6355	0.865	0.5894	0.8783	45	0.4629	0.9392	0.4922	0.9318	78	0.4432	0.9453	0.4283	0.9515
13	0.5431	0.8827	0.4996	0.8882	46	0.4503	0.94	0.4613	0.944	79	0.4584	0.9395	0.4563	0.9447
14	0.4963	0.8933	0.3971	0.9243	47	0.4396	0.9439	0.442	0.941	80	0.463	0.9385	0.4439	0.9483
15	0.4439	0.9133	0.4778	0.898	48	0.4459	0.9406	0.4701	0.933	81	0.4429	0.9458	0.4193	0.9549
16	0.4305	0.9234	0.4257	0.9212	49	0.4547	0.9396	0.4404	0.9522	82	0.4548	0.9408	0.4513	0.9463
17	0.4381	0.9258	0.4207	0.934	50	0.4675	0.9404	0.4658	0.945	83	0.4463	0.9436	0.4264	0.9525
18	0.4411	0.9269	0.4241	0.9365	51	0.4539	0.9411	0.4471	0.943	84	0.4534	0.9414	0.4384	0.9503
19	0.5891	0.9202	0.9021	0.9078	52	0.4603	0.939	0.4894	0.9402	85	0.4455	0.9439	0.4103	0.9573
20	0.6218	0.9162	0.4871	0.9322	53	0.4853	0.942	0.4064	0.955	86	0.4489	0.9429	0.4456	0.9475
21	0.4965	0.9324	0.4699	0.9282	54	0.4499	0.944	0.4218	0.9548	87	0.4592	0.9393	0.463	0.9439
22	0.4943	0.9226	0.3954	0.9495	55	0.4648	0.9418	0.44	0.952	88	0.4373	0.9462	0.4232	0.9534
23	0.4387	0.9378	0.4571	0.932	56	0.4462	0.9441	0.436	0.9547	89	0.4525	0.9417	0.4395	0.9497
24	0.4636	0.9319	0.465	0.925	57	0.4395	0.9452	0.4098	0.956	90	0.4691	0.9366	0.4586	0.9442
25	0.4787	0.9291	0.4353	0.9467	58	0.455	0.9405	0.5553	0.9008	91	0.4424	0.9456	0.4155	0.9555
26	0.4372	0.9372	0.4449	0.9468	59	0.451	0.9403	0.4461	0.9473	92	0.4569	0.9398	0.4481	0.9467
27	0.469	0.936	0.4118	0.9547	60	0.5096	0.9292	0.434	0.9472	93	0.4474	0.9433	0.4334	0.9514
28	0.4484	0.9379	0.4313	0.9493	61	0.4478	0.9437	0.4473	0.9473	94	0.4513	0.9422	0.4426	0.9488
29	0.441	0.9385	0.4292	0.9375	62	0.4408	0.9456	0.4523	0.9467	95	0.4451	0.9442	0.4199	0.9543
30	0.4882	0.9272	0.4986	0.9272	63	0.4521	0.9434	0.4322	0.9523	96	0.4595	0.9389	0.4639	0.9433
31	0.4617	0.9372	0.4906	0.9343	64	0.4511	0.9445	0.5569	0.9025	97	0.4382	0.9465	0.4241	0.9531
32	0.4739	0.937	0.4531	0.9475	65	0.4828	0.9347	0.4574	0.9448	98	0.4537	0.9413	0.4404	0.9495
33	0.4764	0.9346	0.4867	0.9385	66	0.4305	0.9493	0.4191	0.9538	99	0.4693	0.9363	0.4594	0.9441
										100	0.4419	0.9459	0.4163	0.9552

Table 3. Loss and accuracy level corresponding to each epoch (ViT model).

Epoch	Training			Validation
Epoch	Loss	Acc.	Top 5 Accuracy	Loss	Acc.	Top 5 Accuracy
1	1.2525	0.6316	1	0.4728	0.7598	1
2	0.3942	0.8138	1	0.2533	0.8887	1
3	0.2137	0.9158	1	0.1629	0.9392	1
4	0.146	0.9429	1	0.1029	0.964	1
5	0.1138	0.9547	1	0.0847	0.9698	1
6	0.0955	0.9653	1	0.0727	0.9755	1
7	0.0877	0.9682	1	0.0809	0.9743	1
8	0.0725	0.9721	1	0.0576	0.9807	1
9	0.0676	0.975	1	0.0465	0.9852	1
10	0.0663	0.976	1	0.0494	0.9802	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lethanh, N.; Trinh, T.A.; Hossain, M.T. An Investigation on Prediction of Infrastructure Asset Defect with CNN and ViT Algorithms. Infrastructures 2025, 10, 125. https://doi.org/10.3390/infrastructures10050125

AMA Style

Lethanh N, Trinh TA, Hossain MT. An Investigation on Prediction of Infrastructure Asset Defect with CNN and ViT Algorithms. Infrastructures. 2025; 10(5):125. https://doi.org/10.3390/infrastructures10050125

Chicago/Turabian Style

Lethanh, Nam, Tu Anh Trinh, and Mir Tahmid Hossain. 2025. "An Investigation on Prediction of Infrastructure Asset Defect with CNN and ViT Algorithms" Infrastructures 10, no. 5: 125. https://doi.org/10.3390/infrastructures10050125

APA Style

Lethanh, N., Trinh, T. A., & Hossain, M. T. (2025). An Investigation on Prediction of Infrastructure Asset Defect with CNN and ViT Algorithms. Infrastructures, 10(5), 125. https://doi.org/10.3390/infrastructures10050125

Article Menu

An Investigation on Prediction of Infrastructure Asset Defect with CNN and ViT Algorithms

Abstract

1. Introduction

2. Literature Review

2.1. Brief on CNN Model

2.2. Brief on ViT Model

3. Methodologies

4. Empirical Example and Discussion

4.1. Step 1: Define Condition States for Pavement Defects

4.2. Step 2: Collect/Filter/Transform Data

4.3. Step 3: Annotate/Labels Images with Attributes

4.4. Step 4: Write a Program to Connect and Interact with TensorFlow

4.5. Step 5: Run the Program

4.5.1. Results of CNN Model

4.5.2. Results of ViT Model

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI