2.1. Exploratory Data Analysis and Defect Consistency
This section presents an exploratory data analysis of H_RSM signals and quantifies the consistency of differences in these signals across various defective road segments. The data were collected in Orlando, Florida, using an SUV equipped with six-degree-of-freedom accelerometers and a camera, along with proprietary software (
Figure 1) developed by Honda and i-Probe specifically for Honda’s Road Surface Monitoring (H_RSM) applications. The proprietary software logs the accelerometer signals, vehicle coordinates, speed, and accompanying camera images for each second of data collection. Although the method is intended to operate independently of visual data, the synchronized camera images were used by a researcher to manually label the segments as either defective (potholes or patching) or intact. Potholes were identified as localized pavement depressions or cavities visible in the camera imagery, whereas patching referred to repaired pavement regions with visually distinguishable surface texture or elevation differences relative to the surrounding roadway. Labeling was performed by a single researcher as part of this exploratory study. This approach provided ground-truth labels for training and evaluation. The sampling frequency during data collection was set to 10 Hz. This frequency was chosen to align with both the capabilities of the prototype vehicle setup and anticipated future hardware deployments from the automotive manufacturer. Typical consumer vehicles currently transmit sensor signals at 1 Hz, but higher frequencies such as 10 Hz may be implemented in future applications. Depending on future system requirements and data handling capabilities, higher sampling rates could be explored if additional resolution becomes necessary. The dataset consists of measurements collected from a single instrumented vehicle operating within a specific geographic region. The sample size is limited, with fewer observations for potholes than patching. Accordingly, the analysis is focused on evaluating signal-level consistency and separability rather than establishing generalizable performance across vehicles, environments, or defect populations.
As part of the initial exploratory analysis,
Figure 2 presents three H_RSM signals collected from different road segments, along with the labeled defect points. Several observations can be made. First,
Figure 2a,b show that the amplitudes of H_RSM signal at potholes are usually higher than at patching locations. However, patching amplitudes can reach similar levels to potholes, as seen at 70 s and 165 s in
Figure 2b. Second,
Figure 2c reveals a smaller peak at 40 s following a larger magnitude peak at 36 s, which may indicate an after-shock effect associated with the vehicle’s response after crossing a pothole, likely arising from suspension dynamics and tire–road interaction following the initial impact.
As the second part of the exploratory analysis, three H_RSM signals collected from the same road segment were used to evaluate the consistency of differences in these signals across various defective road sections, as shown in
Figure 3. This figure also shows the selected segments corresponding to patching and potholes, alongside the originally labeled data points. The selected segments do not exactly match the labeled points for several reasons. First, the effect of passing over a pothole can continue beyond the defect, appearing as an after-shock, as seen in
Figure 2. Second, to capture this effect, the segments were extended, whether for patching or pothole, up to a point that does not overlap with the next segment. Third, the segments were slightly extended before the labeled points to account for the initiation of a peak caused by a defect and to minimize potential human error in labeling. Fourth, the segment length was kept constant to quantitatively evaluate the consistency of defect differences using evaluation metrics. Finally, although the selected segments were placed on the exact location of defects, some defects in certain signals were mislabeled, as noted in
Figure 3a.
Consequently, each dataset includes two selected segments corresponding to patching and one segment corresponding to a pothole. These segments are labeled as “PatX” and “PothX”, where X represents the segment number. Specifically, Pat1, Pat3, and Pat5 correspond to passages over the same patching, while Pat2, Pat4, and Pat6 correspond to passages over another patching. Poth1, Poth2, and Poth3 indicate passages over the same pothole.
Two methods were adopted to assess the consistency of differences in defect signals. The first is the Pearson correlation coefficient [
11], which quantifies the linear relationship between two time series signals,
and
, each with
data points. The coefficient, denoted as
, is calculated using the sample means of the signals
and
. The value
ranges from −1 to 1, where 1 represents a perfect positive linear relationship, −1 represents a perfect negative linear relationship. The coefficient is computed as follows:
The second metric used is Dynamic Time Warping (DTW) [
12], an algorithm that measures the similarity between two time series signals even when there are variations in timing or local distortions. DTW can be thought of as a specialized “distance” measure, similar in spirit to Euclidean distance, but adapted for time series. While Euclidean distance simply compares values at corresponding time points, it is sensitive to small shifts or differences in timing. In contrast, DTW compensates for these local timing shifts by finding an optimal alignment between the two signals, effectively “warping” the time axis to minimize overall differences. In this way, DTW produces a distance-like value that quantifies the dissimilarity between signals while being robust to small variations in timing or shape.
Figure 4a illustrates this concept by showing how DTW aligns points between two signals, a classical depiction often used to introduce this method. This approach is well-suited for comparing H_RSM signals for the same type of defect, as slight differences in signal shapes can arise due to variations in wheel positioning, timing, or vehicle speed when traversing defects. For this application, low DTW values between segments of the same defect type suggest similar responses in the H_RSM signals, which may support classification. Conversely, high DTW values across different defect types indicate differences in signal patterns that may assist separation in machine learning models.
The DTW algorithm operates as follows. Consider
and
as two signals with
and
data points respectively. A cost matrix
is initialized, where D[i,j] represents the cumulative distance between
and
. The matrix has dimensions
, with
,
for
, and
for
. The cost matrix is then computed using the local Euclidean distance between each pair of points,
, and the recursive relation:
for
and
and the DTW distance between the two signals is given by the value
.
The Pearson correlation coefficient captures linear relationships between signal amplitudes, while DTW captures similarity in signal shape under temporal misalignment. Using both metrics provides a complementary assessment of defect signatures in both amplitude and time-warped domains. Accordingly, the Pearson correlation coefficient (
) and Dynamic Time Warping (DTW) values were calculated separately for each pair of selected segments and presented in matrix form in
Figure 4b,c. These matrices are divided into four sub-matrices. The upper left (UL) sub-matrix represents comparisons of H_RSM segments for patching, while the lower right (LR) sub-matrix represents comparisons for potholes. The upper right (UR) and lower left (LL) sub-matrices, which are transposes of each other, represent comparisons between patching and pothole segments. The average
values for these sub-matrices are as follows:
;
;
.
The corresponding average DTW values are:
For patching, DTW shows a much lower within-patching distance (32,436) compared with cross-defect distances (92,756), meaning patching segments are more similar to each other under DTW. For potholes, Pearson correlation shows a high within-pothole similarity (0.92) compared with patching and cross-defect correlations around 0.75–0.76. These results suggest that DTW was more sensitive to similarities among patching segments, whereas the Pearson correlation coefficient more clearly captured similarities among pothole segments. This difference is expected because Pearson correlation measures linear similarity in signal shape, whereas DTW accounts for temporal misalignment between signal segments. These observations support the use of both metrics to evaluate the consistency and separability of defect-related RSM signals.
We introduce the DCI to quantify the relative distinguishability of pothole and patching vibration signatures. Since Pearson correlation values have a bounded interpretation as measures of linear correlation, while DTW values are distance-based and depend on factors such as signal length, amplitude scaling, and preprocessing, the DCI does not directly combine raw Pearson and DTW values. Instead, both metrics are used as relative comparative measures within a consistent normalization. Accordingly, the DCI compares within-defect and cross-defect relationships using both Pearson correlation and Dynamic Time Warping. Equal weighting was selected to provide a simple and interpretable combined indicator for this exploratory analysis, although alternative weighting strategies could be considered in future work. The resulting DCI is therefore interpreted as a relative exploratory indicator of defect distinguishability and does not represent a universal threshold for classification.
The DCI was calculated using the average r and DTW values from the sub-matrices shown in
Figure 4b,c. As a result, the DCI was computed as 29%, indicating that the signatures of patching and potholes in the H_RSM signals differ by 29% on average. This value indicates a measurable level of separation between defect types within the dataset, supporting the further use of defect classification models that rely on consistent differences in signal patterns. Such measurable differences support the feasibility of applying machine learning models to classify these defects within the evaluated dataset.
2.2. Defect Identification Algorithm
To demonstrate that defects can also be identified relative to intact road sections, we design a simple and interpretable first-stage detector, Defect Identification Algorithm, based on a Hampel Identifier [
13] to separate large defect peaks from intact segments. The Hampel Identifier is a robust outlier detection method that identifies data points exceeding a specified threshold,
. It is formulated as:
where
is a signal containing outliers, such as peaks in the H_RSM signal caused by defects, referred to here as defect peaks. The coefficient
in Equation (4) can be determined based on the expected percentage of outliers (defect peaks) and the length of the moving window [
14]. The parameter
in Equations (4) and (5) is an odd number that adjusts the moving window’s length and can be selected as twice the maximum length of outliers plus one [
15]. However, determining c in advance is challenging because the expected percentage of outliers is not known [
16].
Furthermore, this study applies an exponent to the signal, expressed as , to amplify high-magnitude defect-related peaks relative to lower-magnitude background fluctuations. This transformation can improve the separation between defect-related peaks and intact road segments. However, larger values of may also over-amplify isolated noise or non-defect fluctuations. Therefore, a two-step Defect Identification Algorithm (DIA) is proposed in this study. The first step of DIA determines suitable values for the Hampel coefficient and the exponent using a grid-search procedure. The search range for was set from 3 to 12 to cover both sensitive and conservative Hampel thresholds. Values near the lower end of this range allow smaller deviations from the local median to be detected, increasing sensitivity to potential defect peaks but also increasing false detections in intact road segments. Values near the upper end impose a stricter threshold, reducing false detections but potentially missing weaker defect responses. The search range for was set from 1 to 4 to evaluate moderate levels of peak amplification. A value of preserves the original signal, while larger values progressively emphasize high-magnitude peaks relative to lower-magnitude fluctuations. The upper limit of was selected to avoid excessive distortion of the signal and over-amplification of isolated noise.
Accordingly, a grid search was conducted over all combinations of and . Each combination was evaluated using the recorded signals by assessing whether defect-related peaks were consistently detected while intact road segments remained largely below the adaptive threshold. The selected parameters, and , provided a stable balance between identifying major defect-related peaks and limiting excessive false detections in intact regions. Because this study focuses on an exploratory evaluation using a limited dataset, these parameters should be interpreted as empirically tuned values rather than universally optimal constants. The window length was selected based on the observed span of consecutive defect-related peaks in the recorded signals. After these parameters were determined, the second step applied the Hampel identifier to the H_RMS signals using the same parameter settings throughout the dataset.
2.4. Data Preparation
In the preprocessing stage, labeled H_RSM segments were modified to account for the continued signal response following the passage over a pothole, which can appear as an after-shock (as seen in
Figure 3). To capture this extended effect, all labeled segments, regardless of defect type, were lengthened to twice their original size. This extension ensures that input segments reflect the full defect signature during inference, where the defect type is not known in advance. Additionally, segments were extended by six data points before the original start point to capture the initial rise in the signal in cases where the original label did not include the full defect onset. This approach is similar to the extension method used in the defect consistency analysis. In this case, however, the doubling of segment length occasionally resulted in merged segments. The original labeled positions were not adjusted to preserve possible uncertainty due to manual annotation. As described previously, labeling was performed by a researcher using synchronized camera images. Slight temporal mismatch may exist between when a defect is visible in the camera frame and when the vibration response is recorded by the sensors. Adjusting for this offset without precise calibration could introduce additional bias. This step is referred to as preprocessing in the pipeline.
In the second step of data preparation, merged labeled segments containing more than two peaks were subdivided into smaller segments, each with at most two peaks. This was done using local minima and maxima to isolate appropriate boundaries. Based on field observations, a single defect, particularly a pothole, often produces two consecutive peaks in H_RSM signals (see
Figure 3). Limiting segments in this way ensures consistency across the dataset and improves interpretability. In the final stage of data preparation, referred to as data cleaning, segments that deviated from the typical signal shape were removed. These atypical shapes often resulted from the transformation of long segments or inaccuracies in labeling. Segments were excluded if they (i) decreased continuously at the beginning instead of rising to a peak, (ii) increased at the end rather than stabilizing or declining, (iii) lacked any clear peaks, or (iv) were substantially shorter than average.
Figure 6a,b show examples of raw segments before preparation, while
Figure 6c,d show examples after preprocessing, transformation, and cleaning. These segmentation and cleaning rules were applied consistently across all labeled signals. The resulting processed segments exhibited recurring structural characteristics, including dominant impact peaks, asymmetric decay behavior, and localized secondary responses. These observed patterns motivated the shape-based feature characterization and classification framework presented in the following section.
2.5. Feature Construction for Defect Classification
The processed defect segments exhibited recurring structural characteristics that could be used to distinguish different defect types. To quantitatively encode these patterns, four distinct feature sets were designed to describe the shape and temporal structure of the H_RSM signals.
Figure 7 illustrates a typical labeled segment. In this figure, point P marks the peak of the largest signal amplitude, observed early in the segment. The amplitude at point P is denoted as 2A. Points S and D correspond to the signal values at half the peak amplitude (i.e., A), indicating the onset and offset of the main peak, respectively. These define two subregions of the signal: Zone 1 spans from S to D, and Zone 2 extends from D to E, the end of the segment. The durations of these zones are given by
and
. For finer resolution, internal points I1 and I2 divide the interval from S to P and P to D into equal halves, while points I3, I4, and I5 divide the interval from D to E into four equal parts. These reference points serve as anchors for computing shape-descriptive features in the subsequent analysis. This segmentation into zones and internal points enables capturing both global and localized signal characteristics critical for distinguishing defect types.
Using the characteristics of a typical labeled H_RSM segment, four distinct feature datasets were generated. The main goal was to encode the shape of the segments rather than their amplitudes since no direct correlation was observed between the amplitudes of H_RSM segments corresponding to different defects. Accordingly, the feature sets are as follows:
The 1st Feature Set: This set consists of 11 parameters. Two parameters are the slope of two lines between points S and P, between P and D. The other two parameters are the ratio of
to
, and the ratio of areas of the H_RSM signal in Zone 1 and Zone 2. Three parameters are the standard deviation, skewness and kurtosis of the signal segment located between points S and D [
17]. The final four parameters are the parameters of the generalized exponential Gaussian distribution [
18] which were computed for H_RSM signal portions in Zone 2. These parameters and their explanations are as follows: (i) the location parameter shifts the entire distribution along the x-axis, similar to how the mean determines the central tendency in a normal distribution. The scale parameter affects the spread of the data by stretching or compressing the distribution, analogous to the role of standard deviation in a normal distribution. The shape parameter introduces asymmetry to the distribution by controlling its skewness, thus enabling the model to better fit data that are not symmetrically distributed. Lastly, the kurtosis parameter adjusts the peakiness and tail weight of the distribution, allowing it to represent data with heavy tails or varying degrees of peakiness, which can be significantly different from that of a normal distribution. Given the flexibility of the generalized exponential Gaussian distribution (GEGD) in modeling diverse data behaviors beyond the normal distribution, it was chosen to represent the shape of the H_RSM data segments in Zone 2. This choice is particularly suitable as the GEGD can model the segments as resembling a Gaussian distribution when a peak is present, or an exponential distribution when a peak is absent.
The 2nd Feature Set: This set contains 59 features. The additional 24 parameters are the parameters of the 3rd order polynomials fitted between point pairs (S, P), (P, D), (D, I3), (I3, I4), (I4, I5), and (I5 and E). The polynomials were fitted to normalized values of signal portions to exclude the effect of signal amplitude. Other additional 24 parameters are the division of areas within each other between point pairs (S, I1), (I1, P), (P, I2) and (I2, D), and also between point pairs (D, I3), (I3, I4), (I4, I5) and (I5, E).
The 3rd Feature Set: This set contains 118 features. The additional 59 features are the same parameters used in the 2nd feature set but obtained from the square magnitude of each labeled segment. The motivation behind this approach is that this process can extend the distance between large peaks and small peaks or signal points with small values, which might help encode the differences between H_RSM signal segments corresponding to different defect types. As an example,
Figure 8a shows an H_RSM segment corresponding to a pothole, while
Figure 8b shows the square magnitude of this signal. As these figures show, a small peak located in Zone 2 of this segment (see 4–5 s in
Figure 8a) is less apparent with respect to main peak at 1–2 s in the square amplitude signal. In addition,
Figure 8c shows the fitted polynomial to normalized values of the segment located between points S and P. It should be noted that the feature sets are constructed from a limited number of training samples, which may increase the risk of overfitting. This is particularly relevant for the third feature set with 118 features. The use of PCA in the fourth feature set partially addresses this issue by reducing dimensionality while retaining most of the variance in the data. However, the reported classification performance should be interpreted with this limitation in mind.
The 4th Feature Set: This set contains 18 features, which were obtained from the 3rd feature set through Principal Component Analysis (i.e., PCA), which is a dimensionality reduction method [
19]. The goal of this subset is to avoid the curse of dimensionality, which refers to the various issues that arise in high-dimensional spaces, such as increased computational complexity, data sparsity, reduced effectiveness of distance metrics, and risk of overfitting. Accordingly, PCA transforms a feature set into a set of orthogonal (uncorrelated) components, ordered by the amount of variance they explain, thereby simplifying the complexity of high-dimensional data while retaining its essential patterns. Afterwards, as a rule of thumb, the components that explain 95% of the total variance are used as the new features.
Figure 8d shows that 18 features reached 95% of the total explained variance.
The feature sets were split into training and testing subsets, and four training subsets were fed into a Random Forest machine learning model. Each training subset included 28 data points for patching and 11 data points for pothole, while each testing subset contained 7 data points for patching and 3 data points for pothole. Random Forest was selected because it builds multiple decision trees during training and combines their outputs to improve predictive performance [
19]. Each tree is trained on a random subset of data (bagging) and uses a random subset of features for splitting nodes, which enhances model robustness and accuracy while mitigating overfitting. By averaging the predictions of these trees, Random Forest provides a more stable and accurate prediction than individual decision trees. The Random Forest hyperparameters were selected using a grid search procedure within the training subset. The tuned parameters included the number of trees, maximum tree depth, and minimum number of samples required to split an internal node. Five-fold stratified cross-validation was used during model training to reduce sensitivity to class imbalance in the small dataset. The final model performance was then evaluated using the held-out test subset.