1. Introduction
Concrete is one of the most widely used engineering materials in modern infrastructure, forming the physical basis of bridges, tunnels, buildings, rail systems, marine facilities, and many other mission-critical assets [
1]. Despite its excellent compressive capacity and broad constructability, concrete remains a quasi-brittle material whose tensile resistance is comparatively weak [
2]. Under long-term service conditions, the combined effects of mechanical loading, shrinkage, thermal gradients, carbonation, chloride ingress, alkali-aggregate reaction, and other environmental actions make surface cracking nearly unavoidable [
3]. Once cracks appear and propagate, they do not merely affect visual appearance; more importantly, they create direct pathways for water, oxygen, chlorides, and carbon dioxide to reach reinforcing steel [
4]. This accelerates corrosion, induces cover delamination and spalling, reduces effective load-carrying area, and may eventually compromise structural safety and durability [
5].
From the perspectives of macroeconomics and social security, the aging of infrastructure has become a severe global challenge. According to the Global Infrastructure Health White Paper released in 2025, approximately 35% of the bridges in use worldwide have exceeded their designed lifespan and require high-frequency and high-precision monitoring and maintenance solutions [
6]. It is estimated that the annual investment in repairing traffic infrastructure defects in the United States and China alone amounts to hundreds of billions of dollars, and this figure is still growing at a compound annual growth rate of about 5% [
7]. This difficult situation requires engineers to focus on “damage prediction” rather than “damage repair”. During this transformation process, the precise extraction and quantitative assessment of crack characteristics (such as position, direction, length, and width) have become the core basis for judging the safety level of the structure and formulating maintenance plans.
In the current engineering operation and maintenance practice, crack monitoring still largely relies on traditional manual visual inspections. This approach has played an important role over the past half century, but it has significant limitations when dealing with modern large-scale infrastructure complexes. Firstly, manual inspection is highly subjective, and the inspection results often vary depending on the professional experience of the inspectors, their visual acuity, and even their psychological and physiological state on that day, resulting in poor consistency and repeatability of the data [
8]. Secondly, for scenarios such as the bottom of long-span bridges and super-high piers that are located at high altitudes and are highly risky, manual inspections often require the erection of scaffolding or the rental of aerial inspection vehicles. This not only poses significant operational risks but also incurs extremely high logistics and time costs [
9]. Furthermore, it is difficult to achieve high-frequency coverage of large-scale structures through manual methods. The lag in data collection may cause managers to miss the best time for reinforcement [
10].
To solve the aforementioned problems, automated crack identification technologies based on computer vision (CV) have been investigated. Deep learning-based approaches, such as convolutional neural networks (CNNs) and Vision Transformer (ViT) models, have made significant breakthroughs in crack classification, object detection, and pixel-level semantic segmentation fields [
11]. Compared with manual methods, the visual inspection solution has inherent advantages such as non-contact, low cost, high efficiency, and digital traceability, making it possible for the digital inspection of infrastructure [
12].
Despite the advances of CV methodologies, such methods encounter a significant challenge of precise quantification because of the inherent 2D nature of the image data. In ideal laboratory tests, cameras are typically vertically positioned in front of the experimental sample block, with the optical axis being orthogonal to the structural surface. Under this perspective, the pixel spatial resolution is uniform and can be directly converted to physical width if the scale factor is obtained. However, in actual inspection scenarios, due to the physical limitations of the on-site working space, the geometric irregularity of the structural surface, and the random deviations in the flight posture of the unmanned aerial vehicle, the captured crack images are often taken at an inclined angle [
7]. This non-orthogonal perspective causes severe geometric projection distortion, resulting in the pixels in the image no longer being evenly distributed. There is a non-linear mapping error between the apparent pixel width of the crack and its true physical width. Furthermore, the complex and variable light and shadow, dirt, water seepage, as well as the texture of the concrete itself in an outdoor environment, all pose significant challenges for the reliable extraction of sub-millimeter-sized cracks under varying perspectives [
13].
To improve the robustness of crack quantification under various perspectives, researchers have introduced the use of stereo vision, laser-assisted ranging, RGB-D sensors, and various three-dimensional reconstruction strategies [
14]. These methods are promising because they provide geometric information that can support plane estimation and scale recovery. However, current solutions still face several practical challenges. Many systems depend on expensive cameras, long focal length optics, rigid multi-sensor mounting, and frequent calibration. In outdoor or large-span inspection settings, these requirements increase cost, reduce portability, and complicate deployment. Even when depth sensing is available, strong sunlight, long-range measurement, multipath interference, and noisy point clouds can degrade geometric accuracy. Furthermore, complex backgrounds in real structures, such as stair edges, construction joints, stains, and surface texture, often mimic crack appearance, creating false positives during segmentation.
This study proposes a method and evaluation framework for crack width measurement under non-orthogonal imaging conditions using RGB-D sensors. The method first estimates a major plane in the collected RGB-D image using a plane-fitting algorithm based on Random Sampling and Consensus (RANSAC) [
15]. Based on the equation of the fitted plane, a homography matrix is computed to rectify the plane to a standard view, such that the transformed plane is parallel to the image plane and located at a predetermined distance from it. At the same time, the RGB image is binarized to obtain crack masks by combining multiple binarization mechanisms. The regions of interest (ROI) for crack quantification are extracted by combining the binarized images and inliers of the RANSAC algorithm, a process termed dynamic masking in this research.
The novelty of this study’s crack width measurement is that it could be used in a multi-plane situation. In conventional image-based approaches, crack width is usually calculated directly from pixel distances, which makes the result sensitive to viewpoint changes and imaging distance. And the crack must be located in a single plane. In this study, the crack-bearing surface is first estimated from RGB-D data and then used as a geometric reference for rectification and width reconstruction. This allows the measured width to be related to the physical crack plane rather than to the distorted image plane. In addition, the evaluation is designed to compare the same physical crack regions across different viewpoints, which provides a more consistent basis for assessing measurement robustness.
This research evaluates the robustness of existing binarization-based methods and the proposed method using an RGB-D image dataset capturing a crack from different perspectives. The research aligns planes in different images in the dataset to enable the consistent evaluation of the same physical parts of the crack under different perspectives. The evaluation showed that the proposed approach reduces the overall error by 52.3% compared with the baseline method and provides better robustness under varying distance and viewing-angle conditions, with greater improvements observed at larger distances and wider viewing angles. The contributions of this research can be summarized into the following two points:
An RGB-D-assisted geometric rectification for crack width measurement under non-orthogonal imaging conditions: This study develops a geometric rectification approach to reduce the sensitivity of conventional image-based crack width measurement to viewing angle and imaging distance. Planes are fitted to RGB-D images, and the estimated plane equations are used to analytically derive homography transformations from the observed plane regions to standard views with orthogonal viewing angles and controlled imaging distances. This rectification standardizes the crack detection and width estimation problem, improving the robustness and accuracy of subsequent image processing algorithms.
An evaluation framework for crack width measurement methods under different viewing angles and imaging distances: this study quantifies the variations in crack width measurement accuracy under different imaging conditions by establishing correspondences between physical locations along the same crack across RGB-D images captured under different imaging conditions. The evaluation framework enables the quick quantification and visualization of the viewpoint dependence of different crack width measurement algorithms, facilitating the development, selection, and deployment of crack width measurement in the field.
Section 2 discusses related work.
Section 3 discusses methodology in detail, including plane fitting, geometric rectification, hybrid global–local binarization, dynamic masking, crack width calculation, and evaluation methodology.
Section 4 presents the experiment and results.
Section 5 presents the discussion, followed by the conclusion in
Section 6.
3. Methodology
3.1. Overview
The proposed approach estimates the crack width from an RGB-D image taken under various viewpoint conditions by the following five steps (
Figure 1): plane fitting, geometric rectification, hybrid global–local binarization, dynamic masking, and crack width calculation. These steps are designed to maximize the performance of relatively simple classical binarization algorithms under challenging viewpoints. Details of these steps, as well as the methodology used to evaluate the performance of RGB-D-image-based proposed and existing crack width estimation methods, are discussed in the following sections.
3.2. Plane Fitting
The proposed method begins with fitting a plane to the RGB-D image data. First, the 3D coordinates
of an image point
near the crack can be calculated based on the projection Equation (1):
where
and
here represent the pixel coordinates of the crack surface in the RGB image;
is the scale factor;
is the skew coefficient;
and
are the focal lengths of this camera; and
and
are the principal points. According to the intrinsic parameters, we know
is 1,
is zero, and the corresponding
,
,
, and
.
is known as the depth value of point
, then
and
can be derived from Equation (1):
In this case, the three-dimensional coordinates of image pixels can be determined.
The plane-fitting in this research is based on the RANSAC algorithm [
14]. Minimal sets of points (3 in this research) are sampled to hypothesize planes in the following form:
where
are the parameters of the fitted plane. Equation (5) enforces the normalization condition on the plane coefficients, such that the normal vector has unit length. Under this normalization,
represents the signed distance parameter of the plane, and
corresponds to the perpendicular distance from the camera center to the plane. The distances from all candidate points to those planes are evaluated, and inliers are determined. The case with the largest number of inliers and the lowest residual is selected, and the final plane equation is determined using those inliers.
According to the coordinate Equations (2) and (3), and the plane Equation (4), the depth of any point in the image
can be calculated by the following:
The three-dimensional coordinates
,
, can then be calculated by Equations (2) and (3). After obtaining the plane equation, the normalized plane normal vector
is obtained by the following:
3.3. Geometric Rectification
After obtaining the fitted plane and its normal vector, the crack surface observed under an oblique viewpoint is transformed into an orthogonal view with a consistent geometric scale. For points lying on the same 3D plane, the mapping between the baseline image and a virtual rectified image can be described by a plane-induced homography [
36]. Let the pixel coordinate in the baseline RGB image be written in homogeneous form as follows:
and the corresponding point in the rectified image is as follows:
Then, the geometric relationship between the two images can be expressed as follows:
where
is the homography matrix.
To construct this mapping, a virtual camera is defined such that the fitted crack plane becomes parallel to the image plane after transformation. Let the target optical axis of the virtual camera be the following:
A rotation matrix
is then introduced to align the fitted plane normal vector
with
:
The corresponding rotation angle
and rotation axis
are given by the following:
where
is the unit rotation axis. The rotation matrix
can then be obtained by the Rodrigues formula:
where
is the
identity matrix, and
is the skew-symmetric matrix of
, defined as follows:
This formulation converts the axis-angle representation into a rotation matrix and is the standard Rodrigues rotation formula used in geometric vision and in OpenCV implementations. If
is sufficiently small, the plane normal is already aligned with the target optical axis, and
degenerates to the identity matrix [
37,
38].
Accordingly, can be obtained from the axis-angle representation through the Rodrigues formula. This operation reorients the inclined crack surface into an orthogonal plane in the virtual view.
To further unify the image scale after rectification, a target rectification depth
is introduced. This parameter specifies the desired distance between the fitted plane and the virtual camera. The corresponding translation vector is expressed as follows:
When is selected close to the true physical depth of the crack surface, the resampled image can better preserve the actual geometric scale of the crack pattern and reduce the apparent compression caused by perspective projection.
Let
and
denote the intrinsic matrices of the baseline camera and the virtual rectified camera, respectively. The plane-induced homography is then written as follows:
Using this homography matrix , the baseline oblique image is warped into a new rectified image through inverse perspective mapping. This rectified image is the direct output of the geometric rectification step and represents the crack surface in an orthogonal manner with a unified geometric scale. Consequently, crack features that are compressed in the baseline oblique view are restored to proportions closer to their true physical geometry. Such pre-rectification is particularly important for sub-millimeter cracks, as it improves the effective sampling density of fine crack features prior to binarization and segmentation, thereby providing a geometrically consistent image for subsequent crack width estimation.
3.4. Dynamic Masking
Since geometric rectification and homography transformation may introduce unmapped pixels and irregular invalid regions near the image boundary, a dynamic masking strategy was further employed to constrain subsequent crack segmentation to the valid surface region. Without such a constraint, black borders, empty pixels, and boundary artifacts may contaminate local statistics and lead to erroneous crack responses.
In the proposed method, a dense validity mask was generated from the full-resolution depth image and the fitted plane model. For each pixel, its three-dimensional coordinates
were first recovered from the depth value and camera intrinsic. The point-to-plane distance was then computed as follows:
where
are the parameters of the fitted plane. A raw mask
was subsequently defined by the following:
where
(0 in this research) denotes the minimum valid depth threshold and
(3 mm in this research) is the allowable point-to-plane distance. In this way, only pixels with reliable depth observations and sufficient geometric consistency with the target surface were retained as valid candidates.
To improve spatial continuity and suppress isolated noisy regions, the raw mask was further refined by morphological opening and closing operations. After that, connected-component analysis was applied, and the principal surface region was preserved as the final valid mask. This refinement step helps remove small spurious regions and yields a more compact and stable support region for crack extraction.
The final dynamic mask was then used to restrict the statistical domain of the subsequent binarization process. Only pixels within the valid region participated in threshold estimation, while pixels outside the mask were excluded from crack classification. Therefore, the proposed masking strategy effectively suppresses artifacts introduced by rectification, reduces the influence of invalid boundary regions, and improves the reliability of crack segmentation under skewed viewing conditions.
3.5. Hybrid Global–Local Binarization
To improve crack segmentation under nonuniform illumination and complex surface texture, a hybrid global–local binarization strategy was adopted in the proposed method. Specifically, Sauvola thresholding was used to preserve fine crack boundaries under spatially varying brightness, whereas Otsu thresholding was introduced to provide a global intensity reference for suppressing large-scale non-crack dark regions. By combining these two thresholding mechanisms, the proposed method improves robustness against background interference while maintaining sensitivity to narrow cracks [
27,
29]. For each pixel
the Sauvola local threshold is defined as follows:
where
and
are the local mean and standard deviation computed within a rectangular window centered at
respectively;
is a sensitivity parameter; and
is the normalization factor for the standard deviation.
In addition, a global threshold
was determined using Otsu’s method, which selects the threshold that maximizes the inter-class variance of the grayscale histogram:
where
and
denote the probabilities of the two classes separated by threshold
,
and
are the corresponding class mean intensities.
The final threshold was then formulated as a linear combination of the local Sauvola threshold and the global Otsu threshold:
where
(=0.6 in this study) is the fusion weight. The Sauvola threshold
was calculated using a window size of 21 × 21 pixels, with k = 0.25 and R = 128. These parameters were kept constant across all tested images to ensure reproducibility. They were selected based on preliminary trials to balance local crack-detail preservation and background-noise suppression.
The binarized image
is finally obtained by the following:
where
denotes the grayscale intensity at pixel
. Since cracks generally appear as dark line-like structures in the image, pixels with intensities lower than the threshold were classified as crack pixels. In this way, the proposed hybrid strategy can suppress irrelevant dark patterns at the global scale while preserving thin crack details at the local scale.
3.6. Crack Width Calculation
After geometric rectification and binarization, crack width is obtained by combining image-domain edge localization with metric reconstruction in physical space. First, the binary crack region is skeletonized using a thinning operation based on the Zhang–Suen algorithm [
39], and each sampled point is assigned to its nearest skeleton pixel as the local center point. Then, the local tangent direction is estimated from neighboring skeleton points, and the corresponding normal direction is used as the measurement direction so that the width is measured perpendicular to the local crack trend.
Starting from the local center point, the two crack edges are identified by searching along the positive and negative normal directions within the binary crack region. For the detected edge points, their 3D coordinates can be reconstructed using the pixel-to-coordinate relations in Equations (2) and (3) together with the plane-constrained depth formulation in Equation (6), under the fitted plane given in Equation (4).
The crack-width sampling points are determined from the extracted crack centerline within the selected crack region. Specifically, the crack region of interest is first selected, and the centerline pixels are sampled at a fixed interval. For each sampled centerline point, the local normal direction is estimated using neighboring centerline pixels. Then, two crack edge points are searched along the positive and negative normal directions. The physical crack width is calculated as the Euclidean distance between the reconstructed 3D coordinates of these two edge points.
The physical crack width is finally defined as the Euclidean distance between the reconstructed 3D coordinates of the two edge points. To ensure robustness, samples for which a reliable local measurement cannot be established are excluded from width statistics. These invalid samples include cases where the local normal direction cannot be stably estimated from the skeleton neighborhood, where valid edge points cannot be identified along the normal direction, or where numerical instability occurs during pixel-to-plane projection. In this way, the crack width is converted from an apparent pixel distance into a physically meaningful metric distance on the fitted plane, improving measurement reliability under varying viewpoints and imaging distances.
Figure 2 presents two representative invalid-sample cases that can be visually illustrated.
Figure 2a,b are the overviews of two different samples, crack 1 and crack 2. The yellow and blue labels are used to assist in locating the ROI of binarization images.
Figure 2e represents the invalid sample caused by crack branching, and
Figure 2f represents the invalid sample caused by failure of edge detection. Numerical instability during pixel-to-plane projection is included in the text, but hard to visualize in the image domain.
3.7. Viewpoint-Consistent Evaluation Methodology
To enable crack width comparison at the same physical crack location under viewpoint variability, a reference-image-based correspondence transfer strategy is designed. First, one image is selected from the rectified image set as the standard reference image, typically the one with the smallest viewpoint deviation. Because this image suffers the least geometric distortion, it serves as a stable evaluation baseline. A region of interest containing the target crack is manually defined on the reference image, while the crack center points within this region are automatically extracted as evaluation samples. In this way, manual intervention is only used to delimit the evaluation region, whereas the generation of actual measurement points remains automatic. Shown as
Figure 3.
Next, MASt3R is employed to establish dense correspondences between the reference image and the target images acquired under different depths and viewpoint angles. Instead of estimating correspondences over the entire scene indiscriminately, the proposed method introduces a plane-constrained strategy so that only points belonging to the crack plane are retained for homography estimation. Specifically, the dense geometric coordinates recovered by MASt3R are filtered using plane information, which removes points associated with background planes and irrelevant structures and preserves only the correspondences lying on the target crack surface. This strategy reduces multi-plane interference and improves the stability and geometric consistency of point transfer across viewpoints.
To improve the reliability of point transfer, the MASt3R-derived correspondences were constrained to the crack plane. Dense correspondences were first obtained between the reference and target images using MASt3R. Then, only correspondences whose recovered 3D coordinates were located on or close to the fitted crack plane were retained for homography estimation. Points from background surfaces, irrelevant objects, or out-of-plane regions were discarded. This filtering reduces the influence of multi-plane mismatches and helps ensure that the transferred sample points correspond to the same physical crack locations across different viewpoints.
Based on the estimated homography, the sampled points in the reference image are transferred to two groups of target images: (1) baseline images captured under different depths and viewpoint angles, and (2) the corresponding rectified images acquired under the same conditions. Since both measurements originate from the same set of sampled points in the same reference image, crack width comparison between the baseline and rectified images is conducted at identical physical crack locations rather than at independently selected positions in each image. Therefore, the resulting difference more faithfully reflects the influence of viewpoint variation and the effectiveness of geometric rectification in improving measurement consistency.
This evaluation strategy partly relies on the quality of the MASt3R-derived correspondences. Poor correspondences may shift the transferred sample points away from the true crack locations and introduce local errors in the relative error calculation. To reduce this effect, only correspondences located on the fitted crack-bearing plane and within the valid evaluation region were retained. This limitation is acknowledged, and the reported results should be interpreted as measurement accuracy under the retained valid correspondences.
5. Discussion
The improved performance of the proposed method can be mainly attributed to the introduction of geometry-aware rectification before crack extraction and width estimation. In conventional image-domain measurement, crack morphology is directly affected by viewpoint-dependent perspective distortion, especially under oblique imaging conditions. By incorporating RGB-D-based surface information and performing homography rectification on the crack plane, the proposed framework provides a more geometrically consistent representation of the crack region. This geometric normalization improves the reliability of subsequent binarization, edge localization, and width calculation, thereby reducing the sensitivity of the overall measurement process to viewpoint changes.
The experimental results further indicate that both imaging depth and viewing angle influence crack-width measurement accuracy, but their effects are not equally significant. Compared with depth variation, the influence of viewing angle is more pronounced because angle directly alters the degree of geometric compression and perspective distortion in the observed crack pattern. This explains why the baseline method shows a clearer increase in relative error as the angle becomes larger. In contrast, the proposed method maintains lower error levels under all tested grouping conditions, indicating that the rectification step effectively compensates for distortion induced by non-orthogonal viewpoints. These results confirm that explicit geometric correction is particularly important for robust crack measurement in multi-view inspection scenarios.
The findings are consistent with previous RGB-D and three-dimensional reconstruction studies showing that geometric information is important for physical crack measurement [
14]. The results also support previous rectification studies showing that perspective correction is necessary for quantitative inspection under non-orthogonal views [
9,
21,
24]. Building on the importance of incorporating geometric information and the potential of rectification in the context of crack detection and quantification, the proposed framework transforms the problem into a simpler, standardized form using depth information and quantifying the viewpoint dependence of different crack-width estimation algorithms to enable better control of crack-width measurement processes.
From an application perspective, the proposed framework is meaningful for practical infrastructure inspection, where crack images are often captured under non-ideal viewpoints and varying acquisition distances. In such cases, purely image-based measurement methods may suffer from unstable accuracy due to uncontrolled pose changes. By integrating geometric correction, crack extraction, and metric width estimation into a unified pipeline, the proposed method improves not only average accuracy but also measurement stability, which is essential for reliable structural condition assessment.
Although the proposed method shows improved accuracy and robustness, several limitations should be acknowledged. First, the current validation is based on a limited number of crack specimens and surface conditions. This limits the generalizability of the observed performance improvement patterns across different viewpoint conditions. Second, the method relies on the assumption that the crack-bearing region can be represented by a dominant plane. For rough, curved, or multi-plane surfaces, plane fitting and homography rectification may become less reliable. Third, the crack-width calculation still depends on the quality of binarization, skeleton extraction, local normal estimation, and edge-point detection. Errors in any of these steps may lead to invalid samples or biased width estimation.
These limitations indicate that further study is needed. Future work should test the method on more crack specimens, different surface textures, wider depth–angle ranges, and more complex field environments. Adaptive selection of the target rectification depth should also be investigated to balance geometric correction and crack-shape preservation. Further extensions to curved or multi-plane surfaces would also improve the practical applicability of the proposed framework.
6. Conclusions
This study proposed an RGB-D-assisted crack-width measurement framework to improve measurement accuracy under varying viewpoint conditions. The method integrates plane fitting, geometric rectification, hybrid global–local binarization, dynamic masking, and crack-width calculation into a unified pipeline. By introducing surface-based homography rectification before crack extraction, the proposed framework reduces viewpoint-induced perspective distortion and provides a more consistent geometric basis for quantitative crack-width estimation.
The experimental results showed that the proposed method consistently achieved lower relative error than the baseline image-based measurement method across different depth and viewing-angle settings. With the target rectification depth set to 150 mm, the proposed method reduced the overall fitted-surface error by 19.3–52.3% compared with the baseline method. The improvement was more evident under larger viewing angles and longer imaging distances, where perspective distortion had a stronger influence on image-domain measurement.
These findings indicate that incorporating surface geometry into crack-width measurement is effective for improving robustness under non-orthogonal imaging conditions. The study also demonstrates that viewpoint-consistent evaluation is important for assessing measurement reliability, because the same physical crack regions can be compared across different imaging geometries. Therefore, the proposed framework provides a practical geometry solution for more reliable quantitative crack assessment in infrastructure inspection.
Several limitations remain. The current validation is based on a limited number of crack specimens and relatively controlled imaging conditions. The method also relies on the assumption that the crack region can be approximated by a dominant plane. Future work should further validate the framework using more crack types, different surface textures, wider depth–angle ranges, and more complex field environments. Adaptive rectification-depth selection and uncertainty crack-width measurement should also be investigated to improve the generalizability of the proposed method.