Channel Segmentation Proofreading Network for Crack Counting with Imbalanced Samples

Sun, Mingsi; Xu, Fangai; Zhang, Fachao; Zhao, Jian; Zhao, Hongwei

doi:10.3390/a19030236

Open AccessArticle

Channel Segmentation Proofreading Network for Crack Counting with Imbalanced Samples

by

Mingsi Sun

¹,

Fangai Xu

²,

Fachao Zhang

³,

Jian Zhao

^1,*

and

Hongwei Zhao

^4,5

¹

College of Computer Science and Technology, Changchun University, Changchun 130022, China

²

China Mobile Communications Group Jilin Co., Ltd., Changchun 130061, China

³

Longwang Township Comprehensive Service Center, Nongan County, Changchun 130218, China

⁴

College of Computer Science and Technology, Jilin University, Changchun 130012, China

⁵

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(3), 236; https://doi.org/10.3390/a19030236

Submission received: 27 February 2026 / Revised: 16 March 2026 / Accepted: 18 March 2026 / Published: 22 March 2026

Download

Browse Figures

Versions Notes

Abstract

This paper presents a channel segmentation proofreading network for crack counting with imbalanced samples. The network is built by stacking basic blocks called channel segmentation proofreading blocks, which are composed of the Approximate Overlapping Window Transformer and the Counting Proofreading Module. The former is designed to extract sufficient high-level semantic information, enhancing the ability of the network to judge crack quantities. Guided by the calculation results of the self-attention mechanism in the classical Transformer, Approximate Overlapping Window Transformer employs distinct computation steps to obtain the same results. Confining the computation process within overlapping windows, we continuously adjust to obtain the most suitable feature extraction process and internal structure for crack counting. Furthermore, to prevent the misidentification of multiple cracks as a single crack due to incorrect connection predictions of crack regions, the Counting Proofreading Module employs channel separation techniques. Following the concept of splitting positive and negative weights, it constructs positive and negative values with different characteristics, further confirming crack regions. Through the combined action of both components, when trained and tested on the crack counting dataset, our network achieves optimal results across all metrics.

Keywords:

crack counting; transformer optimization; information proofreading; channel segmentation

1. Introduction

Crack detection typically refers to the task of employing computer vision algorithms to identify crack regions in images and prominently indicate those areas. Currently, there are various methods for crack detection. Some use supervised learning to enhance the perception of cracks [1,2,3,4], some specifically consider the impact of road noise on crack detection [5], some utilize weak supervision to alleviate the annotation difficulty of crack regions and reduce associated time costs [6], and others employ semi-supervised learning to facilitate the development of crack detection through the combined effects of the Segment Anything Model [7] and additional samples [8]. While the current methods for crack detection can meet the basic requirements of road maintenance, the process of independently determining whether each pixel or patch belongs to a crack can negatively impact detection accuracy. Conversely, adding additional supervision for the quantity of cracks can prompt the model to make additional judgments about the connectivity of cracks, thereby eliminating erroneously identified crack regions that do not align with the expected number of cracks.

Supervision for crack quantity can indeed provide assistance in crack detection and further improve its accuracy [9]. However, counting cracks is a challenging task. Firstly, there is currently a lack of relevant datasets for crack counting. Previous efforts did not fully recognize the benefits of crack counting for crack detection, resulting in a lack of dedicated datasets available for crack counting. Transferring from other counting tasks to crack counting through techniques like transfer learning does not yield the same level of supervised learning effectiveness. Therefore, annotating a crack counting dataset is a practical and urgently needed task. Secondly, even in the scenario where the crack sample of an image contains only two crack regions, accurately discerning the connectivity between crack regions poses a challenge for network design, especially when these cracks are in close proximity. For instance, when the road region between two nearby cracks is erroneously predicted as a crack, it becomes difficult for the crack counting network to accurately determine the number of cracks. Therefore, while the network extracts rich high-level contextual semantic information from cracks, it should also possess additional proofreading capabilities to obtain more accurate crack counting results.

To address the challenges posed by crack counting, we propose a meticulously annotated and expert-reviewed crack counting dataset, along with a Channel Segmentation Proofreading Network (CSPNet) designed for crack counting with imbalanced sample. Specifically, we selected several crack detection benchmark datasets as the foundation for the crack counting dataset. Since most crack detection datasets are relatively small, we initially merge these benchmark datasets into a large dataset without distinguishing between training and testing sets. Subsequently, we assess the quantity of cracks based on the connectivity of cracks in the ground truth of crack detection datasets and assign them to their respective quantity categories. Finally, the annotated crack counting dataset undergoes thorough verification and is split into training and testing sets in preparation for training.

We train the proposed Channel Segmentation Proofreading Network (CSPNet) on the pre-divided crack counting training set. This network prioritizes the extraction of high-level semantic features related to cracks. To extract crack feature clues as comprehensively as possible while preserving real-time crack counting performance, we optimize the computation process of the traditional Transformer. In the traditional Transformer, there are Queries (Q), Key (K), and Value (V). The similarity matrix is calculated by performing matrix multiplication between Q and K, and then the similarity matrix is further multiplied with matrix V to obtain a tensor with weighted sums. This tensor undergoes further learning through a feed-forward network. This entire process places a significant computational burden on the network and often requires additional positional encodings in Transformers due to the lack of positional information. To address the limitations of the traditional Transformer, we analyzed its computation results and proposed a novel calculation method that prioritize calculating the dot product results of K and V. We then combine the pixel windows using a linear convolutional structure and multiply the result by Q to obtain an approximate value of the self-attention mechanism in Transformers. We refer to this structure as the Approximate Overlapping Window Transformer (AowFormer). It has lower computational complexity compared to the traditional Transformer while retaining the ability to encode dependencies between tokens, making it more suitable for crack counting tasks. After using AowFormer, the extraction of rich high-level semantic crack features facilitates the determination of crack quantity. However, to further obtain reliable crack quantity predictions, it is necessary to add an additional proofreading module in the network. Hence, we propose a Counting Proofreading Module (CPM). It utilizes the characteristic of crack counting that needs to be considered as a whole to generate a unified small-scale copy of V in each stage of AowFormer, which can capture high-level semantic information related to the number of cracks in this copy. This process focuses on capturing high-level semantic features from both the positive and negative aspects of cracks. Firstly, we use a structure similar to SENet (Squeeze and Excitation Network) [10] to obtain positive and negative channel-related weights. Then, we perform channel segmentation operations to obtain another factor, positive and negative feature maps, which will be multiplied by positive and negative weights. This operation allows positive information to be widely distributed across channels and undergo a more comprehensive information gathering process, while minimizing the collection intensity of negative information and maintaining a degree of original feature maps. As a result, we obtain enhanced positive feature maps that are more effective in anchoring crack positions and determining crack quantity, while the negative feature maps primarily serve to complement contextual information and balance the features. Finally, the positive and negative weights are multiplied with their corresponding feature maps and aggregated. The aggregated feature maps are restored to the size of the current stage and then used to rectify the dot product results of K and V as proofreading weights.

CSPNet is constructed by stacking Channel Segmentation Proofreading Blocks (CSPBs), where CSPB is a combination of CPM and AowFormer. CSPNet can be regarded as a backbone. Since there has been no detailed research on crack counting prior to ours, for a fair comparison, we select the best backbones and compare them with our proposed CSPNet. Experimental hyperparameters and environment are kept consistent throughout the comparison. Our method achieve optimal results in any metric, and extensive ablation experiments demonstrate the effectiveness and rationality of our AowFormer and CPM. We believe that the proposed method will contribute to the development of crack counting and related fields, and it is likely to promote advancements in crack detection.

The main contributions of this paper are as follows:

(1): This paper pioneers the task of crack counting and proposes the first solution for this task, CSPNet.
(2): The crack counting dataset annotated and verified by us will benefit the progress and development of crack counting, as well as other counting and multi-task crack detection domains.
(3): The proposed AowFormer introduces a new solution to optimize the Transformer computation process, while the channel segmentation-based positive and negative features construction of CPM makes the proofreading module novel and provides accurate weight information for proofreading crack regions, ultimately positively affecting crack quantity predictions.
(4): Extensive experiments confirm the effectiveness of the proposed CSPNet and the rationality of its internal component design.

2. Related Work

In our pioneering work on crack counting tasks, there are no existing references in the field of crack counting to provide guidance. Therefore, this section will introduce other related work from the perspectives of crack detection and excellent backbones to help readers understand the significance and importance of the crack counting task from these fields.

Crack Detection: Due to the significant advancement in the field of crack detection facilitated by deep learning, crack detection methods can be categorized into traditional crack detection methods and deep learning-based crack detection methods based on whether deep learning frameworks are used.

(1): Traditional Crack Detection Methods: Subirats et al. [11] apply separable 2D continuous wavelet transform at different scales, analyze the maximum wavelet coefficients, and utilize post-processing to obtain a binary map indicating the presence of cracks. Refs. [12,13,14,15] are all methods based on image thresholds. Among them, in order to address the issue of gaps in cracks after image preprocessing, Huang et al. [12] propose an algorithm to connect these cracks, achieving accurate surface crack detection. Xu et al. [13] present an unsupervised crack detection method based on saliency and statistical features to address the problem of complex and diverse noise in large image regions. Other methods [16,17,18] rely on manually designed features and classification. Marcos [17] employs shoulder detection, unit candidate proposals, and crack classification in a three-step process for crack detection. Hamzeh [18] combines wavelet modulus and 3DRT for knowledge generation, trained and tested an artificial neural network classifier using peak features and parameters. Yan et al. [19] enhance grayscale road surface images by redesigning a median filtering algorithm with four structural elements, and combined morphological gradient operators and morphological closure operators to extract crack edges and fill crack gaps. Rabih et al. [20] propose an automatic crack detection algorithm highly dependent on the localization of the minimum path formed by a series of adjacent pixels within each image, introducing two post-processing steps to enhance detection quality.
(2): Deep Learning-Based Crack Detection Methods: Deep learning-based crack detection methods can be categorized into block-level [8,21,22,23,24,25,26,27] and pixel-level crack detection [1,2,3,4,5,6,9,28,29,30,31,32,33,34]. Generally, patch-level crack detection methods are inferior to pixel-level crack detection methods in terms of detection accuracy and difficulty. Leo et al. [22] use a deeper neural network to differentiate between crack and non-crack patches, demonstrating the superior performance of deep neural networks. MOD-YOLO [26] enhances crack detection by introducing MODSConv for better channel interaction, Global Receptive Field-Space Pooling Pyramid-Fast for scale handling, and DAF-CA for precise feature extraction, all while maintaining dimensional integrity. Li et al. [8] proposed a semi-supervised method for road defect detection based on deep transfer learning, which achieved performance comparable to that of supervised learning with fewer annotated data and accurately determined the crack dimensions under different scenarios. Considering the limitations of patch-level crack detection in accuracy, which is not conducive to crack counting tasks, the more relevant field to crack counting is pixel-level crack detection. DeepCrack [1] designs a model that aggregates multi-scale and multi-level features, applies deep supervision directly to features at each stage, and optimizes the final prediction using guided filtering and conditional random field methods. FPHBN [3] incorporates a feature pyramid into the edge detection algorithm for crack detection, also using a hierarchical nested sample reweighting to balance the contribution of hard samples to the loss. Yuki et al. [6] formulates the crack problem as a weakly supervised problem and proposes a two-branch framework that maintains high detection accuracy even with low-quality labeling. TCDNet [30], by effectively embedding channel and position information in a mixed attention module, captures the long-range dependence of crack features and embed this module into a traditional U-shaped network using multi-scale feature fusion to construct the network. DcsNet [31], combining a morphology branch and a shallow detail branch, achieves a balance between crack detection speed and accuracy. Luo et al. [32] combine the advantages of traditional visual machine detection methods and semantic segmentation detection methods to improve the accuracy of pavement crack detection. For long and complex pavement cracks, CT [4] utilizes Swin Transformer [35] as the encoder and decoder, which is combined with all multi-layer perceptron (MLP) layers, thereby forming an innovative solution. Based on the design tenet of simultaneously learning cracks and crack-related information, Sun et al. [9] constructed a multi-task semi-supervised learning framework consisting of crack region detection, crack and noise edge classification, and crack counting.

Excellent Backbone: The backbone, generally refers to a deep neural network formed by stacking repeated basic blocks and its applicability greatly influences the performance of the entire network. The proposed CSPNet can be considered as a backbone, so understanding other excellent backbones will further promote the understanding of the proposed method. Excellent backbones can be divided into two categories as follows: those based on convolutional neural networks [36,37,38,39,40,41,42] and those based on Transformers [35,43,44,45,46,47,48,49,50,51]. ResNet [37] constructs a deeper neural network using a residual architecture. EfficientNet [39] scales the network uniformly using efficient compound coefficients and uses neural architecture search methods to design the baseline network. ConvNext [40] believes that the effectiveness of the hybrid method of convolution and Transformer mainly comes from the inherent superiority of Transformer. Based on this, the standard ResNet [37] is gradually optimized for the design of vision Transformers, using the key components found during the optimization process to build models. Given that modern ConvNets are still limited to pyramidal structures, OverLoCK [42] has designed a model architecture composed of a base network, an overview network, and a focus network by leveraging the characteristics of the human visual system. Swin Transformer [35] is a hierarchical Transformer that restricts self-attention computation to non-overlapping local windows through window shifting schemes and allowing cross-window connections to achieves higher efficiency. MobileViT [45] combines the advantages of CNN (Convolutional Neural Network) and ViT (Vision Transformer), constructing a lightweight, low-latency neural network model for machine vision tasks on mobile devices. RMT [49] mainly extends the temporal decay mechanism to the spatial domain and proposes a spatial decay matrix based on Manhattan distance, so as to introduce the explicit spatial prior into Self-Attention. vHeat [51] utilizes the discrete cosine transformation to introduce a heat conduction operator based on the physical heat conduction principle, thus alleviating the challenge of significant computational overhead faced by visual representation models of the attention mechanism.

Cross-Domain-Related Methods: Crack counting is inherently intertwined with the research fields of object counting, instance segmentation and advanced crack analysis, and the latest research results in these fields provide important technical insights for solving the core challenges of crack counting. In the field of object counting, ZSC [52] explores counting arbitrary object categories without annotated exemplars by automatically selecting representative patches from images based solely on class-name supervision. VA-Count [53] introduces a visual association-based framework for zero-shot object counting that enhances exemplar discovery using vision–language models and suppresses noisy exemplars via contrastive learning to improve counting performance without manual annotations. A dynamic example network [54] was proposed for class-agnostic object counting and localization, which iteratively refines predictions and incorporates negative example mining to improve generalization under few-shot supervision. For instance segmentation related to crack tasks, FastInst [55] proposes a query-based framework for real-time instance segmentation that improves efficiency and accuracy through instance activation-guided queries, a dual-path update strategy, and mask-guided learning within a lightweight Mask2Former-style architecture. SAPNet [56] integrates SAM (Segment Anything Model) [7] with multiple instance learning to perform point-supervised category-specific instance segmentation, selecting representative mask proposals while leveraging point distance guidance and box mining strategies to improve semantic consistency and segmentation performance. In the research of advanced crack analysis methods, the modified Forman model and Paris law [57] have been studied for fatigue crack analysis and growth prediction using computational fracture mechanics with uncertainty quantification, demonstrating improved crack propagation modeling capability of the modified Forman model across broader growth regions. DDTL [58] employs deep dense transfer learning with pre-trained models to perform automatic crack detection and classification in concrete dam borehole images, demonstrating effective crack analysis and localization performance.

Through the field of crack detection, it is possible to understand which regions in an image belong to cracks and how cracks are typically distinguished. By studying past excellent backbones and cross-domain-related methods, researchers can grasp the basic ideas and different characteristics of model design and implementation. Combining insights from these three domains provides a deeper understanding of the crack counting task and network proposed in this paper.

3. Crack Counting Dataset

As this paper is the first work applying to the crack counting task, there is no readily available crack counting dataset for direct use in training and testing. Therefore, we commence the annotation work for our crack counting dataset on crack detection datasets, and we validate the annotations by experts.

We select images from six crack detection datasets, namely, Crack500 [3], CFD [59], GAPS384 [24], cracktree200 [14], AEL [20], and NCD [5], to form the basis of the crack counting dataset. Since most benchmark datasets for crack detection contain a limited number of images, we merge the original images from all the above datasets into a large dataset, totaling 3040 images. Subsequently, we determine the number of cracks in each image based on the connectivity of the corresponding ground truth, utilizing a connectivity judgment tool with human verification. Images are then sorted into folders based on the number of cracks. Due to the majority of crack images containing only 1 to 3 cracks, counting precisely for cracks of all quantity categories will result in insufficient training and testing data, ultimately leading to extreme sample imbalance and network underfitting. Hence, we record images with four or more cracks as having four cracks. Finally, we invite domain experts verify the dataset which is annotated based on the crack quantities and adjust the dataset.

We randomly split the entire dataset into training and testing sets in a 9:1 ratio. Since the number of crack images varies for each counting category, we prioritize reserving 76 images from crack images of each category to serve as the test set for each category. The remaining training images for each crack counting category are 589, 2435, 618, and 143, respectively. To address prediction biases due to imbalanced training samples, we take the number of images with two cracks as the quantity reference and apply data augmentation to other categories. The augmentation include operations such as flipping and rotating. Due to differences in the number of training images for different categories, the augmentation factor is also not the same. After data augmentation, the final numbers of training images were 2356, 2435, 2472, and 2288. Examples of images for each crack counting category in the dataset are shown in Figure 1. Data augmentation alleviates to some extent the issue of imbalanced crack quantities. However, when the number of images in each category approaches, the information carried by the augmented image dataset is not as effective as datasets where the quantities of each category are naturally close. Therefore, predicting crack quantities on this dataset remains a fairly challenging task.

In this study, supplementary details concerning the annotation workflow, validation of inter-annotator agreement, and potential biases associated with the crack counting dataset are provided as follows:

(1): Annotation workflow: Six public benchmark datasets for crack detection—Crack500, CFD, GAPS384, cracktree200, AEL, and NCD—were integrated to generate 3040 raw images. A connectivity-based judgment algorithm was adopted to preliminarily count cracks in the ground-truth annotations, and manual verification was implemented to finish the primary annotation. Subsequently, joint examination and annotation correction were performed by specialists in road engineering and computer vision. The dataset was then randomly partitioned into training and test subsets at a 9:1 ratio. In response to the imbalanced distribution of samples, data augmentation operations including flipping and rotation were implemented on sample categories other than the two-crack class, thereby balancing the quantity of each category in the training set.
(2): Validation of inter-annotator agreement: Three professional researchers with at least two years of relevant experience in crack detection were selected to conduct independent blind annotation. Post hoc verification demonstrated that all individual annotations satisfied the criterion of high consistency. For samples with inconsistent annotations, two senior domain experts conducted collective arbitration to confirm the ultimate annotation labels, so as to ensure the accuracy and consistency of the whole annotation process.
(3): Potential biases in the dataset: The dataset suffers from three categories of inherent biases. First, original sample distribution bias: samples with one to three cracks account for more than 85% of the dataset. Although data augmentation achieved sample size balance, the feature diversity of augmented samples is slightly inferior to that of original samples. Second, scene coverage bias: all samples are restricted to pavement cracks, with no crack images from other civil engineering scenarios such as bridges and tunnels included, which limits the scene generalization ability. Third, annotation subjectivity bias: the connectivity judgment of some low-contrast and narrow cracks is affected by subjective factors, which may introduce minor uncertainty into the annotation of a small number of samples.

4. Method

This section will introduce the proposed method from three aspects. Firstly, we will introduce the design concept and structure of AowFormer. Secondly, we will showcase the structure of CPM and analyze the proofreading function. Finally, we will introduce the construction process and details of CSPNet.

4.1. AowFormer

The Approximate Overlapping Window Transformer (AowFormer) structure borrows and refines the design principles and computational processes of the Transformer while relying on a convolutional framework in the implementation. Its overall architecture is shown in Figure 2, which is composed of an attention computation module and an FFN module connected in series. The left part is the core of approximate self-attention computation, while the two right FFN branches respectively perform channel dimension expansion and dimension preservation, so as to adapt to the feature extraction requirements of crack counting. Traditional ViT [43], due to their ability to encode long-range dependencies, have propelled progress in various visual tasks. However, due to the extremely high temporal and spatial complexity of ViT, the images processed by ViT are often reduced in resolution by multiple factors (such as reducing their length and width to 1/16 of their original size). This compromise makes ViT less effective in fine-grained visual tasks and deviates from the established experience in constructing multi-scale and multi-level backbones. Although hierarchical ViTs [35,44] have been proposed to mitigate the substantial memory and time consumption of ViT, we believe that there are still some shortcomings in these approaches.

Firstly, many hierarchical ViTs still employ a downsampling module at the starting position, which, though not as drastic as the downsizing of traditional ViTs, are unacceptable for tasks that require fine-grained crack region locations for crack recognition and counting. As demonstrated in our experiments, backbones with a downsampling module at their starting positions often perform poorly, such as Vision Transformer [43], Swin Transformer [35], and ConvNext [40]. Secondly, many methods reduce computational complexity by confining the self-attention computation to non-overlapping finite windows. However, this non-overlapping window segmentation, even with window shift operations, inevitably leads to information interaction lacks. For example, when two relevant features are segmented into different windows, there is no communication between them until the next window shift operation, potentially causing the gradual isolation of originally related features and resulting in erroneous feature representations. Considering the effectiveness of the Transformer architecture in information extraction and encoding dependencies, we plan to improve existing Transformer architectures. The improvement goals include being lightweight enough, encoding without any downsampling and implementing a self-attention mechanism with overlapping windows to avoid the issues associated with non-overlapping windows.

For better understanding, the computation process of the self-attention mechanism in a ordinary ViT is first presented as

\begin{matrix} A = s o f t m a x (Q K / \sqrt{d}); S = A V, \end{matrix}

(1)

where

Q, K, V \in R^{B \times C \times H W}

,

A \in R^{B \times H W \times H W}

. B represents batch size, C represents the number of channels, and H and W represent the length and width of the image. A is an attention matrix that occupies a space that is the square times of the product of the length and width of the images involved in the calculation, which is also the reason for the high temporal and spatial complexity of ViT. Assuming the input feature map

X \in R^{B \times C \times H \times W}

, for simplicity, we omit the B and C dimensions in the following analysis. If we label the element in the first row and first column of X as

x_{11}

,

s_{11}

can be approximated as

\begin{matrix} s_{11} = x_{11} * (x_{11}^{2} + x_{12}^{2} + \dots + x_{H W}^{2}), \end{matrix}

(2)

where

s_{11}

represents the value of the element in the first row and first column of the result after the self-attention mechanism calculation. This equation can be understood as, with the point in the first row and first column as a reference, when Q and K undergo the matrix multiplication as shown in Equation (1),

a_{11}

can be approximated as

a_{11} = x_{11} * (x_{11} + x_{12} + \dots + x_{H W})

. When A undergoes matrix multiplication with V, the result of Equation (2) is obtained. To conserve computational costs as much as possible while maintaining performance not inferior to, and even surpassing, the self-attention mechanism in certain aspects, we propose our Approximate Overlapping Window Transformer (AowFormer) structure based on the analysis in Equation (2).

AowFormer combines the advantageous structures of convolution and Transformer, constructing a novel, rational, and lightweight model architecture, as illustrated in Figure 2. Specifically, unlike the classical Transformer, this paper, when generating Q, K, and V samples from X, doesn’t utilize tight channel connections. We employ a

1 \times 1

depthwise convolution, effectively applying a scaling factor merely to the existing channels. This approach aims to retain the independence among different channels as much as possible while also conserving parameter quantity to a certain extent. Following this, based on the computational process of the self-attention mechanism detailed in Equation (2), we designed and built an approximately overlapping window structure. In the typical Transformer flow, computing

a_{11} = x_{11} * (x_{11} + x_{12} + \dots + x_{H W})

involves maintaining an

R^{B \times H W \times H W}

matrix, consuming a substantial amount of storage space. Shifting our perspective, we no longer compute following the standard Transformer flow, instead, directly starting from the results and constructing

s_{11}

from

x_{11}

and

(x_{11}^{2} + x_{12}^{2} + \dots + x_{H W}^{2})

as two components. We prioritize constructing the second component. Neglecting Q, this component can be seen as the result of element-wise multiplication and summation across K and V over the entire image. We take Q as a part and the multiplication of K and V as the other part. Convolution is utilized to handle the element-wise addition after multiplying K and V. As illustrated by the operations in the

K / V

branch in Figure 2, K and V are first multiplied element-wise, and then aggregated via convolution with

{C o n v}_{s}

to obtain

{P a r t}_{2}

. This step constitutes the core of constructing the

7 \times 7

overlapping window attention, corresponding to the convolutional operation stage in the figure. This process can also be formulated as

\begin{matrix} {P a r t}_{2} = {C o n v}_{s} (K * V) . \end{matrix}

(3)

Ideally,

{C o n v}_{s}

should adopt large kernels covering the entire geometric scale of the image. However, this even burdens the network more than the conventional self-attention mechanism. One strategy to avoid using large kernels involves accumulating multiple small convolutional kernels to approximate a large kernel, as commonly designed in contemporary CNN backbones. We adopt a similar strategy, progressively increasing the count of 3x3 convolutions to continuously expand the receptive field, which corresponds to the window size in Transformer. Through experimentation, we derive two critical conclusions as follows: (1) excessively large or small overlapped windows are detrimental to crack counting. When the chosen number of

3 \times 3

convolutions is 3, our network achieves optimal effectiveness; (2) contrary to mainstream methods that introduce non-linearity between

3 \times 3

convolutions, through theoretical analysis and experimental verification, we determine that from the perspective of simulating Transformers, non-linearity should not be introduced between

3 \times 3

convolutions. CNNs often employ ReLU activation and batch normalization to process feature maps after convolutions. These non-linearity can lead to inconsistent inter-layer information. If the proposed AowFormer applies non-linearity, although the overlapping window can be automatically implemented during the convolutional stacking process, unlike the Transformer window, it fails to ensure that information between different points within the window originates from the same feature map. It is unacceptable that the introduction of non-linearity during the accumulation of small kernels disrupts this information, affecting the effectiveness of the simulated overlapped windows. Hence, we refrain from incorporating any non-linear operations in this process.

Following the processing in Equation (3), the multiplication of K and V generates a feature map where each point is the square of

x_{i j}

. Subsequently, after the convolutional layer, the feature map

{P a r t}_{2}

obtained after convolutional layers contains the sum of surrounding points for each point. As these points heavily overlap,

{P a r t}_{2}

maintains the same size as X. Since no additional processing is needed for the first part, Q can directly serve as the first part. Multiplying Q with

{P a r t}_{2}

yields the approximate attention mechanism result

{S A}_{i j}

, expressed as

\begin{matrix} {S A}_{i j} = x_{i j} * \sum_{i - ω}^{m = i + ω} \sum_{j - ω}^{n = j + ω} {k v}_{m n} ({k v}_{m n} \approx x_{m n}^{2}), \end{matrix}

(4)

where

ω

represents the size of the accumulated overlapping window. The key distinction between

S A

and S lies in the fact that S captures relationships between every pair of pixels in the image, obtaining weighted results, while

S A

captures relationships between pixel pairs within the overlapping window, providing an approximate weighted result.

S A

only imitates S in terms of the computed result without being entirely equivalent in the computation region and approach. It exhibits better performance in the field of crack counting while maintaining a more lightweight structure. Considering other dimensions within S, the batch dimension B can be neglected in most cases. However, there are differences in the channel dimension C between

S A

and S. Traditional Transformers first aggregate channels between each pixel pair, preserving only the geometric dimension, then restore channels through V. In contrast, AowFormer maintains the channels throughout. Additionally, the use of

{C o n v}_{s}

aggregates information in the channel dimension, further enriching this information through multiplication with Q. Explicitly preserving the channel feature might be the reason why, even computing attention within smaller overlapping regions, AowFormer can surpass the classic Transformer and maintain superior performance in the field of crack counting.

In the specific implementation process, due to the two element-wise multiplications involved in AowFormer, potential issues may arise, when performing this type of multiplication without any additional processing. Points in non-crack regions with confidence probabilities as very small negative values in the original feature map, when squared through multiplication, may be erroneously considered by the network as points within crack regions in subsequent operations. And, due to the squared values typically being larger, this can pose challenges to the convergence of network learning. Additionally, when performing the second element-wise multiplication with Q, the values at each point can reach cubic levels, leading to extreme differentiation in the information stored in the feature map. If these phenomena occur frequently in the information flow, it will inevitably result in errors and negative impacts on the crack counting task. To avoid these issues, we apply normalization during the element-wise multiplication in AowFormer. Specifically, we use the sigmoid activation function for K involved in the first multiplication and for Q involved in the second multiplication. This constrains their values within the range of 0 to 1 before subsequent operations, preventing incorrect region classification and addressing the problem of excessively large values in the feature map. Furthermore, to prevent the the possibility of excessive cumulative values caused by non-linear

{C o n v}_{s}

, we retain the scaling design from the Transformer, multiplying the accumulated result by a scaling factor corresponding to

\sqrt{d}

in Equation (1). Since we do not compute self-attention globally, we set the scaling factor to 1/7, corresponding to the square root of the total number of pixels in the accumulated

7 \times 7

window, significantly reducing the likelihood of value spikes. We also replace the FFN (Feedforward Neural Network) structure in the traditional Transformer with a simple

1 \times 1 C o n v + B N + R e L U

and eliminate the residual connection between the approximate overlapping window attention mechanism and the input. Instead, we retain the residual connection between the output of FFN and the input to self-attention, along with the layer normalization after the residual connection or before entering the structure. The differences between AowFormer and the typical Transformer can also be understood more clearly through the Table 1. The design merits of AowFormer illustrated in Figure 2, namely the absence of downsampling, overlapping window attention, and convolution-augmented core, represent the critical differences between the proposed model and the canonical Transformer. Meanwhile, these designs form the foundation for the comparative dimensions listed in Table 1, and successfully strike a balance between model lightweightness and effective feature extraction.

AowFormer, starting from the results of the Transformer, undergoes different calculation steps to obtain approximate weighted calculation results within overlapping windows and fine-tunes the other structures in Transformer blocks, achieving excellent performance and a more lightweight architecture. Experiments indicate its outstanding performance in the field of crack counting. However, the improved performance may be attributed to factors such as rich channel information, the applicability of overlapping window attention to crack counting, and appropriate window size selection, rather than the impact of a correction structure specifically targeting crack numbers. To further enhance the discriminative ability for the number of cracks in each image, we propose a Counting Proofreading Module (CPM), which will be introduced in Section 4.2.

4.2. CPM

To accurately determine the number of cracks in each image, additional verification of the counted crack number is required. Specifically, since crack counting is not a fine-grained task, and the prediction of the network is even just a scalar, low-level detailed information, while not to be ignored, only plays a supplementary role. The focus should be on high-level semantic information. Feature maps encompassing high-level semantic information filter out a large amount of irrelevant and inaccurate information, retaining only the crucial crack region information. Although using only advanced semantic information in crack region detection may result in discrepancies, fuzzy boundaries, etc., when adjusting the task to the crack counting domain, even with these issues, it will not affect the statistical counting of the number of cracks. Therefore, we first scale the feature maps input to each block to a uniform size, regardless of their original scale, and choose a uniform size of

20 \times 20

. The scaled feature map contains multiple channels with mixed positive and negative information. The overall architecture of CPM is illustrated in Figure 3, which adopts a dual-branch design. The upper branch generates positive and negative channel weights through an SE-like structure, while the lower branch obtains positive and negative feature maps via channel splitting. They are finally fused into calibration weights to accomplish crack region verification. To highlight the crack region and achieve counting proof, we take the following steps to construct the proposed CPM.

Firstly, we judge the positivity and negativity of the channels. Not all feature maps containing advanced semantic information are beneficial for crack counting. To discern this, we divide them in the channel dimension. We adopt average pooling to aggregate information from each channel, then utilize a SE-like [10] structure to increase parameter space, aiding in determining the positive and negative features of the channels. Finally, we employ the sigmoid function to obtain a score for each channel. To accurately distinguish the positive and negative features of the channels, instead of just obtaining normalized weights, we set a threshold of 0.5. Channels with scores greater than 0.5 are marked as positive, and channels with scores less than 0.5 are marked as negative. Subsequently, for the purpose of proofreading, information within positive channels should undergo comprehensive information extraction and optimization, while information within negative channels should be readjusted to weaken the proportion of negative information. To this end, we propose a channel segmentation structure.

We do not directly use pre-defined positive and negative channels to first distinguish and then construct the feature maps. Instead, we directly initiate the construction before differentiation from the entire feature map. Specifically, the feature map entering the channel segmentation is the same as the one entering the average pooling. To weaken negative information, we set the hyperparameters for channel segmentation as large for positive and small for negative, specifically, 2/3 for positive and 1/3 for negative. In the initial stage after channel segmentation, positive information is more abundant, while negative information is weaker. We perform subsequent processing on the segmented positive and negative initial information. For positive information, we enrich its content through aggregating group information and local information, and further extract information from these two perspectives. This step can be formulated as

\begin{matrix} F_{p o s} = G W C (F_{p o s}) + L W C (F_{p o s}), \end{matrix}

(5)

where

G W C

represents the group information structure, consisting of a

3 \times 3

group convolution, a ReLU activation, and a

1 \times 1

convolution. The channel number is expanded from the segmented positive channel number to the overall channel number in the first convolution. Similarly, LWC represents the local information structure, consisting of a

3 \times 3

convolution, batch normalization (BN), ReLU activation, and a

3 \times 3

convolution. It also expands the channel number to the overall channel number in the first convolution. The number of groups in the group information structure is chosen as 2, introducing different channel combinations in the feature map. Adding any normalization operation in the group information structure does not improve the results. This may be due to the simplicity of the chosen group information structure, where normalization is not necessary for simple modules. For negative information, we adopt a very simple structure without any information extraction process, primarily aiming to restore its channel number. The structure can be formulated as follows:

\begin{matrix} F_{n e g} = C o n c a t (F_{n e g}, R W C (F_{n e g})), \end{matrix}

(6)

where

R W C

is a

1 \times 1

convolution with the output channel number equal to the number of channels assigned to the positive side. When it is concatenated with the original negative feature map, the newly generated feature map

F_{n e g}

has its channel number restored. After obtaining

F_{p o s}

and

F_{n e g}

, we multiply them by the weights distinguishing positive and negative features in the channels and then add them. This process corresponds to the core fusion step in Figure 3. The

F_{p o s}

/

F_{n e g}

generated by the lower branch and the

W_{p o s}

/

W_{n e g}

from the upper branch are multiplied element wise and then aggregated, which constitutes the feature and weight fusion module of the dual-branch structure in the figure. This process can also be formulated as follows:

\begin{matrix} F_{p r o} = W_{p o s} * F_{p o s} + W_{n e g} * F_{n e g} . \end{matrix}

(7)

This equation can be seen as preserving the overall enhanced positive feature map to its corresponding position while filling the remaining negative positions with the no enhanced negative feature map. From the perspective of the implied information, the initial positive and negative feature maps after channel segmentation are likely to contain both the positive and negative information originally split by channel weights. Positive information, after additional extraction, transmission, and optimization, further weakens internal negative information. During learning, it also tends to mask negative information as much as possible, making the information stored in the original positive position more reliable and accurate than before. For the negative feature map, it undergoes less processing, preserving the original information composition. And the negative feature maps contain positive information, and during channel expansion, the positive information is inserted into the positions of negative information in the original feature map, which contributes to the enhancement of information. Moreover, according to the design philosophy of SCConv [60], both positive and negative information are needed. Therefore, retaining some negative information at negative channel positions will benefit network training and prevent the occurrence of extreme prediction phenomena. If negative information is considered as noise, our CPM can be seen as adapting to add varying degrees and channel positions of noise each time a feature map is input. Moderate noise not only does not affect the convergence of the network but also enhances the robustness of the network, making it more adaptable to extreme situations. Since the proposed crack counting dataset is still highly imbalanced, these negative information, also referred to as noise can significantly enhance the robustness of the network. When facing the images of the crack count category with a low number of samples, the network can still achieve good crack count prediction results. Based on the above analysis, we believe that after the processing by CPM, the obtained feature map (

F_{p r o}

) has a good distinction effect in the crack region and contains abundant positive information with some limited negative information.

The current feature maps, owing to their well-designed nature, can serve as weights to rectify the feature maps. We upsample

F_{p r o}

to the size of the input feature map and utilize the sigmoid activation function to generate weight

W_{p r o}

. Determining where to apply the proofreading structure and where to utilize these proofreading weights remains a point worthy of consideration. Given that AowFormer falls under the Transformer architecture, V stands as a critical element in the final weighted summation within Transformer architecture. Numerous additional operations, such as adding extra depth convolutions, are typically applied to V. Hence, we similarly opt to take V as the input for CPM to generate count-based rectifying weights. Regarding where to apply proofreading weights, there are generally two choices. One is the location for generating the similarity map, and the other is the position after computing the weighted sum. Due to our computation flow being different from the original from the outset, if the first choice is seen as applied to an intermediate state of computation, the application position should be the position where the second part, namely

(x_{11}^{2} + x_{12}^{2} + \dots + x_{H W}^{2})

, has been calculated. We choose the first application scheme because effects applied at intermediate states often impact the final result, while application at the final result can only influence intermediate results through backpropagation. Upon multiplication of the second part by counting-based proofreading weights, the crack regions undergo rectification, enabling a stronger refinement of crack region extraction upon multiplication with Q. Conversely, applying rectification to the position of the weighted sum will result in the outcomes entirely conforming to the proofreading weights, losing the advantages of AowFormer’s structural design and computational step optimization. Once the input and output positions of CPM are established, their integration with AowFormer forms the foundational block of the proposed CSPNet, detailed in Section 4.3.

Moreover, potential problems about the structure might include why not differentiate each pixel in the location of positive and negative weights. The CPM depicted in Figure 3 is entirely grounded in the channel dimension for the construction of positive and negative features and weights, and it does not incorporate any pixel-level discrimination processes. However, when distinguishing at each point, every pixel of each feature map is divided by weight calculation into crack and non-crack points. During the early stages of the network, numerous false positive points and false negative points exist. And, due to insufficient learning, many pixel predictions hover around the threshold without confident discrimination. Applying extreme weights forcibly divides them, undermining the ability of the network to adjust for erroneous predictions, ultimately affecting the discrimination and counting of crack regions.

4.3. Structure Design of CSPNet

The foundational block formed by the combination of AowFormer and CPM in this paper is referred to as the Channel Segmentation Proofreading Block (CSPB). Building upon CSPB, we construct CSPNet, as depicted in Figure 4. With an input channel of 32, except for the final stage, each block with a stride of 2 doubles the channel count. The CSPB with a stride of 2 utilizes max pooling for a

2 \times

downsampling. Given the discrepancy in input and output channel numbers, an additional

1 \times 1

convolution expands the channel for the residual connection, and within the main path, the convolution at the FFN stage transforms channels. After stacking multiple stages of CSPB, we compress the geometric space of the feature map using global average pooling. Subsequently, a linear layer reduces the channel to the same number as the categories of crack quantities, and a softmax activation is applied to obtain predictions for crack quantities. The position with the highest activation corresponds to the predicted crack counting category. To ensure sufficient supervision and correct inaccurate crack counting predictions, we utilize a balanced cross-entropy loss function to calculate the loss and use the ground truth of the crack count dataset to guide for more accurate quantity prediction.

We do not design multiple model sizes, unlike other backbones. For convenience, we only employ the structure with a base channel count of 32 as our base structure. CSPNet, by stacking CSPBs across multiple scales and layers, extracts crack features at different levels and predicts the crack quantity on the feature map containing the most advanced semantic information at the smallest feature map. The feature map with the smallest scale contains more channel numbers and richer contextual semantic information related to cracks, making it suitable for predicting the crack quantity. CSPNet optimizes the information extraction process through AowFormer and further refines crack information using CPM. By stacking multiple times, the effect of crack counting is improved step by step. Subsequently, supervised training using ground truth progressively optimizes the effect through backpropagation, attaining more accurate crack prediction capabilities in the crack counting scenarios with highly imbalanced samples.

5. Experiment

The main purpose of this paper remains focused on the crack counting task, rather than serving as a general backbone. Therefore, unlike other backbones, it does not undergo validation across multiple tasks, such as image classification, object detection, and image segmentation. Validation is solely conducted within the field of crack counting, and specific details regarding experimental settings, evaluation criteria, comparison results, and ablation results can be found in the following text.

5.1. Experimental Settings

We conduct the training of CSPNet under the PyTorch 1.10 framework, with the following hyperparameter settings: batch size (32), optimizer (SGD), learning rate (0.0125), momentum (0.9), weight decay (

1 \times 10^{- 4}

), max epochs (100), and learning rate multiplied by 1/10 at epochs 30, 60, and 90. The crack images used for input undergo various initialization operations such as random cropping, random flipping, random erasing, normalization, etc. To ensure a fair comparison, the methods for comparison are initialized with the same operations and hyperparameter settings as ours. All methods are trained on the crack counting dataset constructed in the same environment using a single NVIDIA RTX 3090 GPU, manufactured by NVIDIA Corporation (headquartered in Santa Clara, California, United States of America). The loss function used is identical to CSPNet.

5.2. Evaluation Criteria

Due to the imbalance in crack quantities, evaluation criteria are arranged in order of reflecting the effectiveness of predictions under imbalanced conditions: mean Average Precision (mAP), mean Recall (mRecall), mean F1-score (m F1-score), and Top1 accuracy. Top1 accuracy refers to selecting the maximum probability vector element as the prediction and calculating the accuracy of each prediction matching the ground truth. For binary classification, the equation is given as

\begin{matrix} A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}, \end{matrix}

(8)

where

T P

,

F P

,

T N

, and

F N

represent true positive, false positive, true negative, and false negative, respectively. While accuracy can assess overall correctness, it may not be a good metric in the presence of sample imbalances. Mean average precision, mean recall, and mean F1-score represent the average values of precision, recall, and F1-score, respectively. These metrics have a better ability to evaluate the predictive performance of models under imbalanced samples compared to Top1 accuracy. The equations for precision, recall, and F1-score are as follows:

\begin{matrix} P r e c i s i o n = \frac{T P}{T P + F P}; \end{matrix}

(9)

\begin{matrix} R e c a l l = \frac{T P}{T P + F N}; \end{matrix}

(10)

\begin{matrix} {F 1}_{s c o r e} = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l} . \end{matrix}

(11)

Precision represents the accuracy of predictions for positive samples, while Recall signifies the probability of actual positive samples among all samples predicted as positive. The

{F 1}_{s c o r e}

takes the harmonic mean of Precision and Recall, reaching its peak only when both Precision and Recall are maximized simultaneously. In the context of multiple categories, the positive samples for calculating these metrics are those belonging to the current category, while the negative samples are from the rest of the categories. The average of these metrics refers to the mean result obtained after calculating these values for each category.

5.3. Comparison Experiment

In this study, state-of-the-art (SOTA) backbones are selected for comparison with the proposed CSPNet. The chosen backbones include ViT [43], MobileNetv3 [61], MobileViT [45], ResNet [37], Swin Transformer [35], ConvNext [40], DeiT [47], ConvMixer [41], MixMim [46], EfficientNet [39], TransNext [48], DaViT [62], RepLKNet [63], Rev-Vit [64], RMT [49], SparseViT [50], vHeat [51], OverLoCK [42], CA-Deit [65], and ViT-5 [66]. For a fair comparison, the output results of the optimal model size for each backbone in crack counting are selected as the final experimental results. All methods are trained in the same experimental environment as CSPNet, with identical hyperparameter settings. Each backbone is trained until convergence, and the results from the best-performing round are chosen for presentation. The results of the comparison experiment are shown in Table 2.

From the experimental results, it can be observed that our mothed achieves optimal results in all evaluated metrics. Making progress in imbalanced crack counting categories is challenging, and our method outperformed the second-ranked method by 1.88%, 1.34%, 2.32%, and 2.71% on the four metrics, demonstrating the superior performance of CSPNet. Additionally, it is noticeable that the Transformer backbones with modern ViT architecture [35,43,46] and the CNN backbone with Transformer-oriented modern improvements [40,63] perform at the lowest level. We attribute this phenomenon to two main factors: (1) these methods perform several times of resolution downsampling in the initial stages of image input (e.g., reducing both height and width to 1/4 of the original size), which filter out a considerable amount of detailed information, leading to a lack of fine details assisting in determining crack contours and ultimately affecting the accuracy of crack counting; (2) long-range dependencies in Transformer encoding or the use of large kernels in modern CNNs are not suitable for crack counting task. In crack images, there is often noise, and cracks in images tend to have a low pixel ratio. Encoding long-range dependencies not only fails to introduce more helpful information for determining crack contours but may also introduce a large amount of irrelevant features, ultimately affecting the prediction of crack quantities.

ResNet [37], DeiT [47], and ConvMixer [41] have all achieved moderate-level performance. They are significantly associated with traditional convolutional neural networks. ResNet [37] is a classic convolutional neural network built entirely on convolutional layers. DeiT [47] utilizes a CNN as a teacher network, employing a distillation strategy to assist the Transformer architecture in learning. ConvMixer [41] utilizes deep convolution within the Transformer architecture to extract inter-token information and employs point convolution to extract inter-channel information. All three methods incorporate some degree of knowledge from convolutional networks. According to the comparison results of the seven mentioned methods, it is evident that, in the field of crack counting, the methods of combining Transformer and CNN (DeiT [47], ConvMixer [41]) is better than pure Transformer architectures and is even better than conventional CNN networks.

The performances of MobileNetv3 [61], MobileViT [45], EfficientNet [39], TransNext [48], DaViT [62], RepLKNet [63], Rev-Vit [64], RMT [49], SparseViT [50], vHeat [51], OverLoCK [42], CA-Deit [65], and ViT-5 [66] are all commendable. MobileNetv3 [61] and EfficientNet [39] are lightweight networks under a pure CNN architecture, while MobileViT [45], TransNext [48] and SparseViT [50] are lightweight network under a hybrid architecture. Furthermore, DaViT [62], RepLKNet [63], Rev-Vit [64], RMT [49], vHeat [51], OverLoCK [42], CA-Deit [65], and ViT-5 [66] incorporate advanced knowledge from other fields to facilitate the improvement of model performance. Experimental results from these methods suggest that lightweight models are suitable for crack counting task, with CNN and out-of-domain knowledge playing a crucial role in the crack counting process. The adaptability of lightweight networks to crack counting tasks may stem from their better generalization and robustness under certain conditions, allowing them to handle noise, variations, or limited samples more effectively. Given the complex road environments and highly imbalanced categories of training samples in crack counting task, lightweight networks appear to be more suitable.

CSPNet exhibits a significant advantage in all metrics, attributed to several factors, as follows: (1) CSPNet adopts a relatively lightweight design: using deep convolution in the generation of Q, K, and V, and employing structures with channel scaling, enable the network to compute efficiently on a larger scale compared to the Transformer methods. (2) CSPNet is grounded in the convolutional neural network architecture: previous experiments demonstrate the superior performance of CNN architectures in crack counting task, and the adoption of CNN architecture instead of standard Transformer architecture by CSPNet is an important factor in its effectiveness. (3) Introduction of Transformer information: in CSPNet, AowFormer mimics the Transformer architecture from the perspective of result approximation by a pure convolutional framework. It avoids the performance decline associated with the standard Transformer structure while incorporating excellent long range features for the crack counting task. (4) Role of CPM: on the basis of extracting high-level semantic information related to crack counting from AowFormer, CPM further proofread crack quantities through positive and negative aspects, enhancing the overall performance of CSPNet. It is worth clarifying that even though CSPNet attains the state-of-the-art performance in crack counting, its result is just slightly beyond 50%. In contrast, numerous other approaches cannot achieve 50% performance across all evaluation indicators. This validates that the crack counting dataset we constructed is a challenging benchmark. The challenge arises not only from the elusive continuity of cracks that is difficult to identify, but also from the negative effects imposed by imbalanced positive and negative samples and uneven crack types on the network’s robustness. In view of the complex nature of the dataset and that the network does not incorporate excessive parameters and computational complexity, the evaluation performance achieved by all models is completely normal and rational. It is believed that with the development of more applicable and task-specialized frameworks or loss functions in the future, significant improvements can be obtained on all evaluation metrics. Such progress, however, relies on the sustained and dedicated efforts of researchers. AowFormer and CPM are extensively studied and analyzed in Section 5.4 regarding their effectiveness, rationality, and impact on crack counting. Notably, although we only tested CSPNet on one dataset, this dataset is composed of several crack detection datasets, featuring complex scenarios, diverse crack types, and high challenge, thus being sufficient to demonstrate the excellent performance of CSPNet.

5.4. Ablation Study

In this section, we conduct further research on the key components, AowFormer and CPM, which constitute CSPB. The experimental environment, hyperparameter settings, and dataset used in the ablation experiments are identical to those in the comparison experiment.

We first explore whether the combination of AowFormer and CPM will yield better results than using AowFormer alone, as shown in Table 3.

According to the data in Table 3, constructing CSPB and CSPNet using only AowFormer already achieves first-tier performance, slightly below the second-best crack counting results highlighted in blue in Table 2. When inserting the CPM block into AowFormer, the results became optimal. AowFormer, by achieving the same results as the conventional self-attention mechanism on a pure convolutional architecture, approximately encodes dependency relations of the overall pixels within the overlapping windows, resulting in effective information extraction and filtering. This provides an initial solution for combining convolution and Transformer in crack counting tasks. CPM first utilizes the characteristics from both positive and negative aspects to construct positive and negative weights as well as positive and negative feature maps. Subsequently, the positive and negative weights are multiplied by the positive and negative feature maps correspondingly and added to generate proofreading weights. This process can verify the accuracy of the crack regions. When connected with AowFormer, it corrects erroneously connected crack regions, ultimately leading to a superior crack counting prediction. We do not present data for CPM used alone, because in fact, CPM is an insertion block witch cannot independently construct a foundational block. In subsequent content, we further explore the internal structures of AowFormer and CPM.

The experimental results of the internal structure exploration of AowFormer are presented in Table 4. All experiments are conducted on a constructed CSPB with adjustments focusing on the non-linear construction of

{C o n v}_{s}

, the selection of window sizes, and how windows are constructed in AowFormer. We also investigate whether lightweight designs for Q, K, and V are reasonable. According to the data in the first two rows of Table 4, it can be observed that as the non-linear construction of

{C o n v}_{s}

in AowFormer increases, the prediction performance of crack counting gradually declines. This validates our previous assertion that non-linearity disrupts the consistency of feature maps, obstructing AowFormer from obtaining results approximating self-attention calculations. We also explore the number of

3 \times 3

convolutions in

{C o n v}_{s}

, where Num 1 and Num 5 represent 1 and 5

3 \times 3

convolutions being used, respectively, corresponding to accumulated window sizes of

3 \times 3

and

11 \times 11

. The results indicate that neither of these cases performs as well as the

7 \times 7

window, and the

11 \times 11

window is even less effective than the

3 \times 3

window. This confirms our claim that crack counting task does not require encoding a wide range of dependencies. The fifth row explores how windows are constructed. When the window size is set to

7 \times 7

, it can be constructed either by stacking small convolution kernels or using a single large convolution kernel. As shown in the results, directly using a large kernel is not as effective as stacking small kernels. However, the results in the fifth row are still better than those in the third and fourth rows, indicating that our choice of window size is reasonable, even though using other implementations are better than choosing different window sizes. Finally, we investigate whether the lightweight generation of Q, K, and V is meaningful. We compare the results between conventional convolution, group convolution, and the depthwise convolution we adopted. With an increase in the number of groups (namely more lightweight), the crack counting performance improves. The number of groups in depthwise convolution equals the number of channels, and using deep convolution in the generation of Q, K, and V saves parameters compared to conventional convolution, making the computation more efficient. The lighter design achieves better results, confirming the rationality of our design.

The main purpose of further exploration of CPM is to investigate the impact of channel segmentation ratios on the results. The experimental results are shown in Table 5. The first column represents the proportion of positive and negative information in the channel when performing channel segmentation. The results indicate that when the proportion of negative information in the channel is greater than or equal to the proportion of positive information (first and second rows in the table), the data under all metrics is even weaker than the case with only AowFormer (first row in Table 3). This is because the segmented positive channels no longer contain sufficient information, and expanding and mining information from so few channels is not conducive to optimizing positive information. At the same time, negative information occupies a large number of channels without effective extraction. When multiplied by negative weights, there is no difference from when there are fewer channels, leading to information wastage. The 2:1 ratio (ultimately adopted) is more effective than the 3:1 ratio, indicating that excessive compression of the proportion of negative information channels should be avoided during channel segmentation. The results with a 3:1 ratio are still better than the first two rows in the table, indicating that a higher proportion of positive channels is generally better than a higher proportion of negative channels, supporting our analysis of the poor performance in the first two rows.

After exploring the combination of components and the internals of each component, the reasonability and effectiveness of the proposed components are fully validated. Furthermore, to investigate the impact of different crack image ratios, i.e., different imbalanced sample sizes, on the crack counting performance of CSPNet, we conducted the experiments presented in Table 6. The experiments from top to bottom respectively represent training on the raw dataset without any data augmentation, using flip augmentation to increase the number of other crack images to half that of images with two cracks, using rotation augmentation to achieve the same sample ratio as the previous experiment, and adopting the balanced image ratio used in CSPNet. From the experimental results, we draw the following conclusion: (1) CSPNet achieves the worst crack counting performance without any data augmentation. This is because the number of other crack images differs significantly from that of images with two cracks. Such an imbalanced sample ratio causes the model to focus more on fitting the feature expression of two cracks during training, thus exerting an adverse effect on the prediction of crack numbers in images of other categories. (2) The crack counting effectiveness derived from the rotation data augmentation approach is superior to that of the flipping method. While the proportions of various sample categories in Experiments 2 and 3 remain not completely balanced, the samples are more balanced in a certain degree since the quantity of two-crack images is only double that of each of the other categories, resulting in more favorable crack counting outcomes. From the horizontal comparison between Experiments 2 and 3, Experiment 2 obtains an extreme score solely in the Mean AP indicator, whereas its scores in all the remaining indicators are inferior to those of Experiment 3. This demonstrates that rotation can supply the model with more abundant crack clues than flipping for the crack counting task. (3) The current image ratio adopted in CSPNet achieves the optimal comprehensive performance. Although the quantities of different image types are not perfectly equal, the small disparity indicates that the sample ratio employed by CSPNet can be regarded as balanced. In comparison with Experiments 2 and 3, CSPNet with the balanced sample ratio significantly outperforms the other two experiments in the Top-1 Accuracy, Mean Recall, and Mean F1 metrics. Particularly in the Mean F1 metric, it surpasses Experiments 2 and 3 by 22.23% and 10.51%, respectively, which verifies the effectiveness and rationality of balanced samples. It is worth noting that both Experiments 2 and 3 outperform Experiment 4 in the Mean AP metric, with Experiment 2 achieving a substantial advantage. We attribute such an advantage in Mean AP to sacrifices made in other metrics, and excellent performance solely in the Mean AP metric is insufficient to justify the correctness of the adopted image ratio. In summary, after careful comparison and evaluation, we finally decide to use the balanced image ratio for training, validation, and testing of CSPNet.

6. Conclusions

Based on the practical requirements of road maintenance, considering fields that can help with crack detection, we propose the crack counting task. The crack counting task involves judging, predicting, and outputting the quantity of road cracks. Due to the slender geometric construction and the close physical proximity of cracks, this task is fairly challenging. According to the characteristics of the task, we propose that the network should be able to extract rich high-level semantic features and make additional proofreading to enhance crack counting. To achieve this, we construct an Approximate Overlapping Window Transformer architecture (AowFormer) and a Counting Proofreading Module (CPM). Specifically, AowFormer optimizes the calculation process of Transformer by approximating self-attention results within overlapping window areas, achieving more efficient and appropriate semantic information extraction. CPM is a plug-and-play module that builds positive and negative weights on channels, allocates the information coverage ratio of positive and negative feature maps through channel segmentation operations, and obtains positive and negative feature maps with different emphases through structural design. After multiplying and aggregating with positive and negative weights, it achieves more accurate crack region localization, ultimately aiding crack counting. The combination of AowFormer and CPM forms CSPB, and multiple CSPBs stacked together form CSPNet, as proposed in this paper. In extensive validation on the crack counting dataset annotated by us in this paper, CSPNet demonstrate leading advantages in multiple metrics, proving the rationality of its structural design. Additionally, the effectiveness and rationality of AowFormer and CPM are thoroughly explored and confirmed.

Author Contributions

Conceptualization, M.S. and H.Z.; methodology, M.S.; software, F.X.; validation, F.Z.; formal analysis, F.X.; investigation, J.Z.; resources, F.Z.; data curation, H.Z.; writing—original draft preparation, M.S.; writing—review and editing, M.S.; visualization, F.X. and H.Z.; supervision, J.Z.; project administration, F.Z. and H.Z.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jilin Provincial Department of Education Science and Technology Research Project, grant number JJKH20251376KJ, and the Jilin Provincial Natural Science Foundation Project, grant number YDZJ202501ZYTS589 and YDZJ202501ZYTS619.

Data Availability Statement

Data will be made available upon request.

Acknowledgments

The authors are grateful to the anonymous reviewers for their insightful comments which have certainly improved this paper.

Conflicts of Interest

The author Fangai Xu was employed by the company China Mobile Communications Group Jilin Co., Ltd. and author Fachao Zhang was employed by the company Longwang Township Comprehensive Service Center, Nongan County. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A Deep Hierarchical Feature Learning Architecture for Crack Segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Schmugge, S.J.; Rice, L.; Lindberg, J.; Grizziy, R.; Joffey, C.; Shin, M.C. Crack segmentation by leveraging multiple frames of varying illumination. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; IEEE: New York, NY, USA, 2017; pp. 1045–1053. [Google Scholar]
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1525–1535. [Google Scholar] [CrossRef]
Guo, F.; Qian, Y.; Liu, J.; Yu, H. Pavement crack detection based on transformer network. Autom. Constr. 2023, 145, 104646. [Google Scholar] [CrossRef]
Sun, M.; Zhao, H.; Li, J. Road crack detection network under noise based on feature pyramid structure with feature enhancement (road crack detection under noise). IET Image Process. 2022, 16, 809–822. [Google Scholar] [CrossRef]
Inoue, Y.; Nagayoshi, H. Crack detection as a weakly-supervised problem: Towards achieving less annotation-intensive crack detectors. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: New, York, NY, USA, 2021; pp. 65–72. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Li, J.; Yuan, C.; Wang, X.; Chen, G.; Ma, G. Semi-supervised crack detection using segment anything model and deep transfer learning. Autom. Constr. 2025, 170, 105899. [Google Scholar] [CrossRef]
Sun, M.; Zhao, H.; Liu, P.; Zhou, J. A multi-task mean teacher with two stage decoder for semi-supervised crack detection. Multimed. Tools Appl. 2024, 83, 59519–59536. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Subirats, P.; Dumoulin, J.; Legeay, V.; Barba, D. Automation of pavement surface crack detection using the continuous wavelet transform. In Proceedings of the 2006 International Conference on Image Processing, Las Vegas, NV, USA, 26–29 June 2006; IEEE: New York, NY, USA, 2006; pp. 3037–3040. [Google Scholar]
Huang, W.; Zhang, N. A novel road crack detection and identification method using digital image processing techniques. In Proceedings of the 2012 7th International Conference on Computing and Convergence Technology (ICCCT), Seoul, Republic of Korea, 3–5 December 2012; IEEE: New York, NY, USA, 2012; pp. 397–400. [Google Scholar]
Xu, W.; Tang, Z.; Zhou, J.; Ding, J. Pavement crack detection based on saliency and statistical features. In Proceedings of the 2013 IEEE International Conference on Image Processing, Melbourne, Australia, 15–18 September 2013; IEEE: New York, NY, USA, 2013; pp. 4093–4097. [Google Scholar]
Zou, Q.; Cao, Y.; Li, Q.; Mao, Q.; Wang, S. CrackTree: Automatic crack detection from pavement images. Pattern Recognit. Lett. 2012, 33, 227–238. [Google Scholar] [CrossRef]
Tang, J.; Gu, Y. Automatic crack detection and segmentation using a hybrid algorithm for road distress analysis. In Proceedings of the 2013 IEEE International Conference on Systems, Man, and Cybernetics, Manchester, UK, 13–16 October 2013; IEEE: New York, NY, USA, 2013; pp. 3026–3030. [Google Scholar]
Kapela, R.; Śniatała, P.; Turkot, A.; Rybarczyk, A.; Pożarycki, A.; Rydzewski, P.; Wyczałek, M.; Błoch, A. Asphalt surfaced pavement cracks detection based on histograms of oriented gradients. In Proceedings of the 2015 22nd International Conference Mixed Design of Integrated Circuits & Systems (MIXDES), Toruń, Poland, 25–27 June 2015; IEEE: New York, NY, USA, 2015; pp. 579–584. [Google Scholar]
Quintana, M.; Torres, J.; Menéndez, J.M. A simplified computer vision system for road surface inspection and maintenance. IEEE Trans. Intell. Transp. Syst. 2015, 17, 608–619. [Google Scholar] [CrossRef]
Zakeri, H.; Nejad, F.M.; Fahimifar, A.; Torshizi, A.D.; Zarandi, M.F. A multi-stage expert system for classification of pavement cracking. In Proceedings of the 2013 Joint IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), Edmonton, AB, Canada, 24–28 June 2013; IEEE: New York, NY, USA, 2013; pp. 1125–1130. [Google Scholar]
Maode, Y.; Shaobo, B.; Kun, X.; Yuyao, H. Pavement crack detection and analysis for high-grade highway. In Proceedings of the 2007 8th International Conference on Electronic Measurement and Instruments, Xi’an, China, 16–18 August 2007; IEEE: New York, NY, USA, 2007; pp. 4–548. [Google Scholar]
Amhaz, R.; Chambon, S.; Idier, J.; Baltazart, V. Automatic crack detection on two-dimensional pavement images: An algorithm based on minimal path selection. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2718–2729. [Google Scholar] [CrossRef]
Zhang, L.; Yang, F.; Zhang, Y.D.; Zhu, Y.J. Road crack detection using deep convolutional neural network. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: New York, NY, USA, 2016; pp. 3708–3712. [Google Scholar]
Pauly, L.; Hogg, D.; Fuentes, R.; Peel, H. Deeper networks for pavement crack detection. In Proceedings of the 34th ISARC, Taipei, Taiwan, 28 June–1 July 2017; IAARC: Oulu, Finland, 2017; pp. 479–485. [Google Scholar]
Feng, C.; Liu, M.Y.; Kao, C.C.; Lee, T.Y. Deep active learning for civil infrastructure defect detection and classification. Comput. Civ. Eng. 2017, 2017, 298–306. [Google Scholar]
Eisenbach, M.; Stricker, R.; Seichter, D.; Amende, K.; Debes, K.; Sesselmann, M.; Ebersbach, D.; Stoeckert, U.; Gross, H.M. How to get pavement distress detection ready for deep learning? A systematic approach. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: New York, NY, USA, 2017; pp. 2039–2047. [Google Scholar]
Chen, Z.; Zhang, J.; Lai, Z.; Zhu, G.; Liu, Z.; Chen, J.; Li, J. The devil is in the crack orientation: A new perspective for crack detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6653–6663. [Google Scholar]
Su, P.; Han, H.; Liu, M.; Yang, T.; Liu, S. MOD-YOLO: Rethinking the YOLO architecture at the level of feature information and applying it to crack detection. Expert Syst. Appl. 2024, 237, 121346. [Google Scholar] [CrossRef]
Dong, X.; Liu, Y.; Dai, J. Concrete surface crack detection algorithm based on improved YOLOv8. Sensors 2024, 24, 5252. [Google Scholar] [CrossRef] [PubMed]
Lau, S.L.; Chong, E.K.; Yang, X.; Wang, X. Automated pavement crack segmentation using u-net-based convolutional neural network. IEEE Access 2020, 8, 114892–114899. [Google Scholar] [CrossRef]
Xu, H.; Su, X.; Wang, Y.; Cai, H.; Cui, K.; Chen, X. Automatic bridge crack detection using a convolutional neural network. Appl. Sci. 2019, 9, 2867. [Google Scholar] [CrossRef]
Zhou, Q.; Qu, Z.; Li, Y.X.; Ju, F.R. Tunnel crack detection with linear seam based on mixed attention and multiscale feature fusion. IEEE Trans. Instrum. Meas. 2022, 71, 1–11. [Google Scholar] [CrossRef]
Pang, J.; Zhang, H.; Zhao, H.; Li, L. DcsNet: A real-time deep network for crack segmentation. Signal Image Video Process. 2022, 16, 911–919. [Google Scholar] [CrossRef]
Luo, J.; Lin, H.; Wei, X.; Wang, Y. Adaptive Canny and Semantic Segmentation Networks Based on Feature Fusion for Road Crack Detection. IEEE Access 2023, 11, 51740–51753. [Google Scholar] [CrossRef]
Khan, M.A.M.; Kee, S.H.; Nahid, A.A. Vision-based concrete-crack detection on railway sleepers using dense U-Net model. Algorithms 2023, 16, 568. [Google Scholar] [CrossRef]
Ai, W.; Zou, J.; Liu, Z.; Wang, S.; Teng, S. Light propagation and multi-scale enhanced DeepLabV3+ for underwater crack detection. Algorithms 2025, 18, 462. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR: London, UK, 2019; pp. 6105–6114. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 11976–11986. [Google Scholar]
Trockman, A.; Kolter, J.Z. Patches are all you need? arXiv 2022, arXiv:2201.09792. [Google Scholar]
Lou, M.; Yu, Y. OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 128–138. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Liu, J.; Huang, X.; Liu, Y.; Li, H. Mixmim: Mixed and masked image modeling for efficient visual representation learning. arXiv 2022, arXiv:2205.13137. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: London, UK, 2021; pp. 10347–10357. [Google Scholar]
Shi, D. Transnext: Robust foveal visual perception for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 17773–17783. [Google Scholar]
Fan, Q.; Huang, H.; Chen, M.; Liu, H.; He, R. Rmt: Retentive networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 5641–5651. [Google Scholar]
Su, L.; Ma, X.; Zhu, X.; Niu, C.; Lei, Z.; Zhou, J.Z. Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics-centered, parameter-efficient image manipulation localization through spare-coding transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 3 March 2025; Volume 39, pp. 7024–7032. [Google Scholar]
Wang, Z.; Liu, Y.; Tian, Y.; Liu, Y.; Wang, Y.; Ye, Q. Building Vision Models upon Heat Conduction. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 9707–9717. [Google Scholar]
Xu, J.; Le, H.; Nguyen, V.; Ranjan, V.; Samaras, D. Zero-shot object counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15548–15557. [Google Scholar]
Zhu, H.; Yuan, J.; Yang, Z.; Guo, Y.; Wang, Z.; Zhong, X.; He, S. Zero-shot object counting with good exemplars. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 368–385. [Google Scholar]
Liu, X.; Li, G.; Qi, Y.; Yan, Z.; Zhang, W.; Qing, L.; Huang, Q. Dynamic example network for class-agnostic object counting. Pattern Recognit. 2026, 170, 111998. [Google Scholar] [CrossRef]
He, J.; Li, P.; Geng, Y.; Xie, X. Fastinst: A simple query-based model for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 23663–23672. [Google Scholar]
Wei, Z.; Chen, P.; Yu, X.; Li, G.; Jiao, J.; Han, Z. Semantic-aware sam for point-prompted instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 3585–3594. [Google Scholar]
Akramin, M.; Marizi, M.; Husnain, M.; Shamil Shaari, M. Analysis of surface crack using various crack growth models. J. Phys. Conf. Ser. 2020, 1529, 042074. [Google Scholar] [CrossRef]
Dai, Q.; Ishfaque, M.; Khan, S.U.R.; Luo, Y.L.; Lei, Y.; Zhang, B.; Zhou, W. Image classification for sub-surface crack identification in concrete dam based on borehole CCTV images using deep dense hybrid model. Stoch. Environ. Res. Risk Assess. 2025, 39, 4637–4654. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Li, J.; Wen, Y.; He, L. SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 6153–6162. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 1314–1324. [Google Scholar]
Ding, M.; Xiao, B.; Codella, N.; Luo, P.; Wang, J.; Yuan, L. Davit: Dual attention vision transformers. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXIV; Springer: Cham, Switzerland, 2022; pp. 74–92. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Mangalam, K.; Fan, H.; Li, Y.; Wu, C.Y.; Xiong, B.; Feichtenhofer, C.; Malik, J. Reversible vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10830–10840. [Google Scholar]
Han, D.; Li, T.; Wang, Z.; Huang, G. Vision Transformers are Circulant Attention Learners. arXiv 2025, arXiv:2512.21542. [Google Scholar] [CrossRef]
Wang, F.; Ren, S.; Zhang, T.; Neskovic, P.; Bhattad, A.; Xie, C.; Yuille, A. ViT-5: Vision Transformers for The Mid-2020s. arXiv 2026, arXiv:2602.08071. [Google Scholar] [CrossRef]

Figure 1. Examples of images with different quantity cracks in our proposed crack counting dataset. From left to right, each column corresponds to images containing one, two, three, and four cracks, respectively. The quantity of cracks in each image is determined through a connectivity judgment tool applied to the original ground truth corresponding to each original image, followed by human verification. Images containing four or more cracks are recorded as containing four cracks.

Figure 2. The architecture of our proposed AowFormer. AowFormer is constructed by connecting Attention Module and FFN. On the left is the AowFormer attention computation process, and on the right are two types of FFNs. The upper FFN-CE expands the channel quantity of the output features, while the lower FFN maintains consistency between the input and output feature channel quantities.

Figure 3. The architecture of CPM. The downsampled input feature X is fed into different branches. The upper branch employs a SE-like structure to obtain positive and negative weights. The lower branch utilizes a channel segmentation structure to obtain positive and negative features. Subsequently, the positive and negative weights are multiplied by the corresponding positive and negative features and added to obtain the calibration feature.The calibration feature undergoes upsampling and is then processed by the Sigmoid function to obtain calibration weight

W_{p r o}

.

Figure 3. The architecture of CPM. The downsampled input feature X is fed into different branches. The upper branch employs a SE-like structure to obtain positive and negative weights. The lower branch utilizes a channel segmentation structure to obtain positive and negative features. Subsequently, the positive and negative weights are multiplied by the corresponding positive and negative features and added to obtain the calibration feature.The calibration feature undergoes upsampling and is then processed by the Sigmoid function to obtain calibration weight

W_{p r o}

.

Figure 4. The overall architecture of CSPNet. Our CSPNet consists of four stages, each comprising several CSPBs. The first CSPB in Stage 2 and Stage 3 expands the channels to twice that of the previous stage and reduces the height and width of the feature map by half. The first CSPB in Stage 4 only reduces the feature map size. The specific structures of different CSPBs are illustrated at the bottom of the image.

Table 1. The main differences between the classical Transformer and the proposed AowFormer point by point.

Comparison Dimension	Typical Transformer	AowFormer
Core Design Concept	Self-attention core	self-attention approximation
Self-Attention Computation	Calculate weights then feature weighting	Feature fusion + query vector
Attention Scope	ViT: Global; Swin: Windowed with feature isolation	$7 \times 7$ overlapping window
Convolution Usage	Auxiliary to self-attention	Core of self-attention approximation
Nonlinear Operations	FFN uses ReLU/BN	No nonlinearity in Conv stages
Channel Dimension	Aggregate then restore	Preserved throughout; richer feature expression
Complexity and Parameters	Quadratic complexity; high parameters	Conv-level complexity; far fewer parameters
Feature Extraction Focus	Global dependencies; noise-prone	Local overlapping windows; noise-suppressed

Table 2. Comparison results on the crack counting dataset. For each metric, the top two best-performing results are highlighted in red and blue, respectively.

Method	Top1 Acc.	Mean AP	Mean Recall	Mean F1
ViT [43]	45.83	50.82	45.70	43.99
MobileNetv3 [61]	55.21	55.60	54.41	54.54
MobileViT [45]	52.43	55.73	51.78	50.19
ResNet [37]	48.26	48.94	47.74	48.25
Swin Transformer [35]	43.06	47.91	40.96	32.24
ConvNext [40]	37.85	51.29	36.91	29.50
DeiT [47]	51.04	51.07	51.51	50.12
ConvMixer [41]	51.74	51.39	52.08	51.22
MixMim [46]	35.42	38.69	35.83	32.31
EfficientNet [39]	54.51	55.95	53.75	54.15
TransNext [48]	53.47	53.04	54.52	51.49
DaViT [62]	54.51	54.48	54.63	54.48
RepLKNet [63]	53.12	53.98	51.91	52.25
Rev-Vit [64]	53.12	53.16	52.61	52.80
RMT [49]	54.17	52.80	54.65	50.15
SparseViT [50]	54.17	54.38	53.86	53.89
vHeat [51]	53.82	53.39	53.44	53.42
OverLoCK [42]	54.17	54.06	54.30	53.92
CA-Deit [65]	52.78	52.68	52.63	52.61
ViT-5 [66]	52.78	52.69	52.54	52.14
CSPNet (ours)	56.25	56.70	55.92	56.02

Table 3. Ablation study of AowFormer and CPM.

Components	Top1 Acc.	Mean AP	Mean Recall	Mean F1
AowFormer	54.51	55.12	54.10	54.37
CPM	56.25	56.70	55.92	56.02

Table 4. Further exploration of the internal structure of AowFormer.

Operations	Top1 Acc.	Mean AP	Mean Recall	Mean F1
+ReLU	55.21	55.93	55.20	55.25
+BN and ReLU	54.17	54.87	53.60	54.03
Num 1	54.17	55.11	54.04	54.49
Num 5	53.47	53.66	53.55	53.52
Direct $7 \times 7$	54.86	55.44	55.39	54.63
QKV G = 1	54.51	54.45	54.10	54.26
QKV G = 4	55.21	55.91	54.93	55.34

Table 5. Further exploration of the internal structure of CPM.

Ratio	Top1 Acc.	Mean AP	Mean Recall	Mean F1
1:2	54.51	53.95	54.45	53.88
1:1	53.82	54.41	53.79	53.72
2:1	56.25	56.70	55.92	56.02
3:1	55.56	56.32	55.18	55.60

Table 6. Exploration of the impact of different sample ratios on the crack counting performance of CSPNet.

Imbalance Conditions	Top1 Acc.	Mean AP	Mean Recall	Mean F1
Raw ratio	43.06	29.58	40.79	30.86
Imbalanced Sample 1	50.69	69.96	50.22	45.83
Imbalanced Sample 2	52.43	58.88	51.78	50.69
Balanced ratio	56.25	56.70	55.92	56.02

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, M.; Xu, F.; Zhang, F.; Zhao, J.; Zhao, H. Channel Segmentation Proofreading Network for Crack Counting with Imbalanced Samples. Algorithms 2026, 19, 236. https://doi.org/10.3390/a19030236

AMA Style

Sun M, Xu F, Zhang F, Zhao J, Zhao H. Channel Segmentation Proofreading Network for Crack Counting with Imbalanced Samples. Algorithms. 2026; 19(3):236. https://doi.org/10.3390/a19030236

Chicago/Turabian Style

Sun, Mingsi, Fangai Xu, Fachao Zhang, Jian Zhao, and Hongwei Zhao. 2026. "Channel Segmentation Proofreading Network for Crack Counting with Imbalanced Samples" Algorithms 19, no. 3: 236. https://doi.org/10.3390/a19030236

APA Style

Sun, M., Xu, F., Zhang, F., Zhao, J., & Zhao, H. (2026). Channel Segmentation Proofreading Network for Crack Counting with Imbalanced Samples. Algorithms, 19(3), 236. https://doi.org/10.3390/a19030236

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Channel Segmentation Proofreading Network for Crack Counting with Imbalanced Samples

Abstract

1. Introduction

2. Related Work

3. Crack Counting Dataset

4. Method

4.1. AowFormer

4.2. CPM

4.3. Structure Design of CSPNet

5. Experiment

5.1. Experimental Settings

5.2. Evaluation Criteria

5.3. Comparison Experiment

5.4. Ablation Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI