1. Introduction
Road roughness is a critical indicator for assessing pavement quality, as it directly impacts vehicle dynamics and passenger ride comfort. Irregularities in the road surface induce vertical vibrations that are transmitted to the vehicle body, which can deteriorate driving stability and increase passenger fatigue [
1]. Beyond comfort, advanced AI-powered detection systems have been increasingly recognized for their potential to enhance road safety and provide life protection by identifying hazardous damage early [
2]. Therefore, accurate identification and classification of road surface conditions are vital. The International Standards Organization (ISO) classifies road roughness into classes A through F based on Power Spectral Density (PSD), a standard widely adopted in simulation and engineering practices [
3].
Traditional methods for measuring road roughness are generally categorized into contact-based and non-contact-based approaches. Contact-based methods, such as profilers, offer high accuracy but are limited in scalability and high-speed applicability [
4]. In the realm of non-contact approaches, recent studies have successfully employed Convolutional Neural Networks (CNNs) to classify road environments using visual data [
5] and assess terrain roughness based on ISO standards [
6]. While these visual and sensor-based machine learning techniques show promise [
7], traditional feature engineering relying on handcrafted statistical indicators, such as RMS and kurtosis, often fails to capture the complex nonlinear mapping between road excitation and vehicle response [
8].
In accelerometer-based approaches, sensor placement is a pivotal design choice. Early studies typically placed sensors at the wheel center (unsprung mass) to classify roughness conditions using methods such as sliding time windows [
9,
10]. While this configuration captures raw road profiles directly, it faces severe engineering challenges due to the harsh working environment. Furthermore, raw wheel-end signals contain excessive high-frequency noise. In contrast, reusing existing sprung mass accelerometers offers a robust alternative. Not only is this approach cost-effective by leveraging standard onboard sensors or low-cost MEMS units, a direction validated by recent monitoring studies [
11], but the suspension system also acts as a natural filter for irrelevant high-frequency noise, aligning better with ride comfort assessment [
1].
Given the costs associated with real-world data collection, simulation has become a pivotal approach for generating training datasets. However, widely utilized linear quarter-car models often oversimplify suspension components. To address these deficiencies and capture nonlinear dynamic characteristics, recent research has increasingly relied on joint simulation environments and real-vehicle tests to validate speed-adaptive classification frameworks [
8]. This study establishes a high-fidelity full-vehicle model to generate realistic vibration feature signals essential for training robust classifiers.
To effectively process these complex signals and overcome the limitations of manual feature extraction, Deep Learning (DL) models such as CNNs [
12], RNNs [
13], and Transformers have been explored. Specifically, Long Short-Term Memory (LSTM) networks [
14] were introduced to capture temporal dependencies, while Transformer architectures [
15] and their time-series variants like Informer [
16] have shown promise in modeling long-range correlations. However, existing architectures face limitations in vehicular edge computing scenarios. LSTM suffer from serial execution bottlenecks, preventing parallel computation and limiting inference speed on embedded chips. Transformers, while capable of modeling long-range dependencies, incur heavy computational costs due to self-attention mechanisms. Moreover, they treat continuous physical signals as discrete tokens, which may result in the loss of phase continuity information.
To overcome these challenges in both hardware deployment and algorithm efficiency, the main contributions of this paper are as follows:
We propose a road roughness classification framework utilizing a single sprung mass accelerometer. The framework is primarily validated on high-fidelity synthetic data generated from a joint CarSim–Simulink platform covering ISO A–F roughness levels. Furthermore, to verify the robustness and generalization capability of our model under realistic engineering constraints, we conduct comprehensive stress tests across varying environmental noise levels, vehicle mass configurations, and driving speeds, confirming its reliability for real-world deployment.
We conduct a comprehensive comparison of various deep learning models, systematically analyzing the trade-offs between classification accuracy and computational efficiency in the context of vehicular edge computing.
We propose DB-MLP, a novel dual-branch architecture based purely on MLPs. Unlike Transformers, it eliminates complex tokenization and heavy attention mechanisms; unlike LSTMs, it supports fully parallel inference. The proposed model achieves superior accuracy with minimal parameters, making it an ideal candidate for real-time deployment.
3. Methodology
3.1. Overall Framework
The proposed framework performs integrated road roughness classification based on simulated vibration data, combining physical modeling, signal preprocessing, and a neural network classifier (
Figure 1).
Road surface profiles corresponding to ISO 8608 roughness levels (A–F) are generated and used as excitation inputs in a full-vehicle dynamics simulation. The resulting vertical acceleration of the sprung mass is recorded through a virtual accelerometer mounted at the vehicle’s center of gravity. The acceleration signals are sampled at 1000 Hz, segmented into 1 s windows, normalized, and labeled according to the corresponding road class, forming a physically consistent training dataset with representative vibration characteristics.
The classification model is implemented using the proposed DB-MLP architecture, a lightweight multi-branch MLP framework specifically designed for vibration-based classification. DB-MLP extracts multi-scale temporal patterns through nonlinear temporal projection and a generic classification head, without relying on decomposition or attention mechanisms.
Finally, the trained DB-MLP network outputs one of six ISO-defined roughness levels. This framework integrates physical simulation and data-driven modeling to achieve efficient and accurate road surface classification using a single accelerometer configuration.
3.2. Simulation-Based Data Generation and Preprocessing
To obtain labeled, high-quality sensor data for surface classification, we simulate road-induced vehicle responses based on the ISO 8608 standard. This standard defines road roughness as a stochastic process characterized by its Power Spectral Density (PSD). Each roughness class (A–F) corresponds to a different PSD level.
3.2.1. Road Surface Roughness Model
The road surface elevation is modeled as a zero-mean stationary Gaussian process. Its spatial frequency characteristics are described by the PSD function:
Here, n is the spatial frequency , and is the reference frequency. denotes the road roughness coefficient, which varies with ISO-defined classes (A–F). is the typical frequency index that controls how roughness energy distributes over frequency. This model allows generating synthetic road profiles with specific statistical properties.
To analyze road profiles in the temporal domain, the PSD can be transformed as:
where
is the angular frequency,
u is vehicle speed, and
is the cutoff frequency.
This temporal representation enables simulation via linear shaping filters.
The corresponding shaping filter transfer function is:
The road roughness evolution can also be expressed as a stochastic differential equation:
where
represents the road surface roughness displacement
;
= is the lower cutoff spatial frequency and is a zero-mean white noise signal with a unit power spectral density.
Equation (
4) characterizes the road surface as a first-order stochastic process driven by filtered white noise, effectively capturing the statistical properties of road roughness for simulation and vibration analysis.
3.2.2. Simulink-Based Profile Generation
The road profile generation model was implemented within the MATLAB/Simulink environment, grounded in the mathematical formulation of Equation (
4). A standard Gaussian white noise module served as the stochastic source
, configured with sufficient sampling resolution to accurately capture high-frequency road features. This noise signal acts as the excitation input to a continuous transfer function block, which shapes the spectral content to produce the time-domain road elevation
.
To simulate a comprehensive range of driving scenarios, independent road profiles corresponding to ISO 8608 roughness classes A through F were generated. The roughness coefficient
for each class is constrained within the lower and upper limits specified in
Table 1, with the geometric mean serving as the reference baseline. The vehicle speed
u within the model was fixed at 10 m/s to ensure kinematic consistency with the subsequent full-vehicle dynamics simulation.
The resulting time-domain roughness signals at a speed of 10 m/s are illustrated in
Figure 2.
Figure 2a displays the relatively smooth profiles for classes A and C, while
Figure 2b depicts the significantly higher amplitudes associated with the rougher conditions of classes D and F.
3.3. Road Profile Generation and Validation
Road roughness signals are generated in Simulink by filtering white noise to obtain road surface profiles that follow the target power spectral density (PSD). Each simulated profile is evaluated by comparing its PSD with the ISO 8608 reference spectrum. The lower and upper tolerance limits are defined as one-half and twice the standard PSD, respectively, as summarized in
Table 1. As shown in
Figure 3, the generated profile falls within the D-class region, confirming the accuracy of the road roughness modeling. The validated elevation data are then exported via the
To Workspace block and imported into CarSim for full-vehicle simulation.
3.4. Vehicle Response Simulation and Labeling
To generate a high-fidelity labeled dataset, a production-level mid-size SUV model was established within the CarSim environment. The vehicle simulated driving over generated ISO 8608 road profiles (Classes A–F) at a constant speed of 10 m/s. The suspension system employed a double-wishbone (SLA) configuration at the front and a Hotchkiss solid axle at the rear.
Crucially, to replicate realistic driving conditions and induce authentic vehicle roll dynamics, distinct road profiles were generated for the left and right wheels. While both tracks correspond to the same ISO roughness class, they were synthesized using independent random seeds and slightly varying roughness coefficients () within the standardized limits. This spatial asymmetry ensures that the model captures complex vehicle body motions beyond simple vertical heave.
The primary feature for classification is the sprung-mass vertical acceleration, recorded by a virtual sensor at the vehicle’s center of gravity with a sampling rate of 1000 Hz. Physically, the vehicle suspension acts as a mechanical low-pass filter, effectively attenuating high-frequency disturbances from tire-road interactions and engine vibrations while preserving the low-to-mid frequency components that characterize road roughness. The continuous acceleration signals were then segmented into 1-s windows and labeled according to the corresponding road roughness class. The key physical parameters of the vehicle model are detailed in
Table 2.
The simulated acceleration responses are illustrated in
Figure 4. While a global trend of increasing signal amplitude is observable from Class A to F, amplitude alone is often insufficient for precise classification due to the significant overlap between adjacent classes.
Classes A–C (
Figure 4a): Exhibit continuous, stationary random vibrations with relatively low amplitudes.
Classes D–F (
Figure 4b): Induce significantly higher vertical excitations. Notably, these rougher roads trigger nonlinear suspension dynamics, characterized by transient high-energy impacts and distinct decay patterns.
These temporal and spectral variations, rather than simple energy thresholds, provide the critical discriminative information for the proposed model.
3.5. DB-MLP Architecture
3.5.1. Motivation and Background
Sprung-mass acceleration signals generated by road excitation exhibit strong non-stationarity, broadband frequency content, and short-duration high-energy transients. These characteristics differ from the long-term temporal dependencies commonly targeted by forecasting-oriented architectures such as RNNs and Transformers. In the present road-surface classification setting, the sequence length is moderate (1 s, 1000 samples) and the discriminative information is mainly encoded in local vibration patterns rather than very long-range dependencies.
Transformer-based models and other deep attention architectures have demonstrated strong performance on generic time-series benchmarks and can also achieve competitive accuracy in our setting. However, they typically require substantially more parameters, higher computational cost, and increased inference latency, which limits their suitability for real-time, embedded deployment with low-cost onboard hardware.
Conversely, recent linear time-series models show that simple temporal projections can be surprisingly effective for forecasting, but their purely linear mappings lack the nonlinear capacity needed to fully separate road-induced vibration patterns, especially in high-frequency bands.
To balance accuracy and efficiency, we propose a lightweight DB-MLP (Dual-Branch MLP) architecture. The design philosophy is to use nonlinear MLP projectors to extract discriminative vibration features, while a multi-branch structure captures multiple temporal resolutions without resorting to complex attention mechanisms. This yields a compact model that aligns well with the characteristics of high-frequency sprung-mass acceleration and is better suited to real-time deployment on resource-constrained platforms. The overall architecture is illustrated in
Figure 5, and the detailed specifications are listed in
Table 3.
3.5.2. Dual-Domain MLP Block
The core processing unit of DB-MLP is the Dual-Domain MLP Block. To efficiently model the dynamic characteristics inherent in vibration signals, this block employs a decoupled design that sequentially applies temporal mixing and position-wise feature transformation. This separation significantly reduces the parameter complexity compared to standard dense layers.
First, a time-mixing MLP captures global temporal dependencies. We reinterpret the time dimension
as the feature axis, projecting it into a latent space of dimension
and then restoring it. Omitting bias terms for brevity, the operation is defined as:
where
denotes the GELU activation. Crucially, this bottleneck structure minimizes redundancy:
compresses the temporal sequence, and
reconstructs it. The total parameters for this module scale linearly with
, avoiding the quadratic complexity of full attention mechanisms.
Subsequent to the temporal projection, the second stage concentrates on the local feature domain. Whereas conventional Transformer architectures typically utilize a Position-wise Feed-Forward Network to project features into a high-dimensional latent space via an expansion-reduction mechanism, our analysis indicates that such dimensional expansion introduces unnecessary redundancy and heightens the risk of overfitting for univariate inputs possessing a single channel.
To mitigate this issue, we substitute the standard FFN with a streamlined Position-wise Feature Transformation block. This module is instantiated as a learnable Adaptive Scaling Layer that performs a non-linear transformation independently at each time step to calibrate the signal amplitude:
In this formulation, and denote learnable scalar weights within , and represents the GELU activation function. By eschewing the extensive parameterization characteristic of standard FFNs, this module functions as a critical dynamic gain controller. It operates orthogonally to the Time Mixing layer by selectively amplifying significant resonance patterns or attenuating noise based on signal intensity, thereby maximizing parameter efficiency for resource-constrained edge devices.
3.5.3. Feature Fusion and Classification
The final stage integrates the learned representations from all temporal branches to form a holistic description of the road surface. Since each branch outputs a univariate feature sequence
, we flatten and concatenate them to construct a unified global feature vector:
The dimensionality of this fused vector is the sum of temporal lengths across all scales, . In our specific configuration (), this results in a comprehensive feature vector , preserving both fine-grained high-frequency details and coarse-grained low-frequency trends.
Finally, a lightweight MLP projection head maps this high-dimensional representation to the semantic label space. Consistent with previous blocks, we simplify the notation by omitting bias terms:
where
projects the fused features into the latent hidden space, and
generates the final logits for the
road surface classes.
3.5.4. Model Complexity and Efficiency
To demonstrate the lightweight nature of DB-MLP, we analyze its parameter complexity () and computational complexity (). Let denote the cumulative sequence length across all branches. The parameter distribution is analyzed as follows:
Time Mixing Layers: As the primary feature extractors, these layers perform global projection across time. Their parameter count scales linearly with the sequence length: .
Classification Head: The first projection layer of the MLP head maps the fused temporal vector to the hidden space, contributing .
Position-wise Transformation: Due to the univariate input design (), the adaptive scaling layers contribute a constant and negligible amount of parameters relative to the sequence length.
Aggregating these components, the total parameter count is dominated by the linear terms:
For our experimental configuration (), the model contains approximately 0.58 million parameters.
In terms of computational complexity, DB-MLP achieves linear complexity with respect to the sequence length, i.e., . This represents a substantial efficiency advantage over Transformer-based architectures, whose self-attention mechanism incurs a quadratic complexity of .
3.5.5. Design Summary
In summary, the proposed DB-MLP architecture integrates three distinguishing characteristics tailored for vibration signal analysis:
Physics-Aware Multi-Resolution Modeling: By employing a dual-branch structure, the model simultaneously captures high-frequency transient impacts and low-frequency suspension trends, addressing the broadband nature of road excitations.
Linear-Complexity Feature Extraction: The decoupled Time-Mixing and Position-wise Transformation design avoids the quadratic bottleneck of self-attention mechanisms, ensuring computational efficiency.
Holistic Representation Integration: The model fuses global temporal context across scales into a unified vector, creating a highly separable latent space that simplifies the downstream classification task.
Collectively, these design choices establish DB-MLP as a superior trade-off solution. By delivering highly competitive accuracy with a minimal parameter footprint, the architecture demonstrates strong theoretical feasibility for future deployment on resource-constrained automotive control units, see
Table 4.
3.6. Training Strategy
The training process involves a physics-preserving preprocessing step followed by a supervised classification objective.
3.6.1. Physics-Aware Normalization
Standard time-series normalization often employs instance-wise scaling, such as normalizing each sample independently to zero mean and unit variance. However, in road surface classification, the absolute amplitude of the acceleration signal carries critical physical information. Specifically, rough roads like Class E or F are characterized by high-energy impacts where the suspension often hits its limit stops (bump stops), generating peak values significantly higher than those of smooth roads.
If instance-wise normalization were applied, the low-amplitude noise of Class A roads would be rescaled to the same numerical range as the high-impact signals of Class F, thereby destroying the relative energy features between classes. To preserve the physical consistency of vibration magnitudes, we adopt a Global Normalization strategy. The normalized input
is computed as:
where
and
are the global mean and standard deviation calculated over the entire training dataset. This ensures that the distinct signal signatures of suspension limit impacts are retained in the input feature space.
3.6.2. Optimization Objective
We formulate the road surface recognition task as a multi-class classification problem. Given the normalized input sequence
and its corresponding ground-truth label
, the model predicts a probability distribution
via a softmax classifier. The model parameters are optimized by minimizing the categorical cross-entropy loss:
where
corresponds to the ISO 8608 road classes (A–F),
is the one-hot encoded target label, and
is the predicted probability for class
k.
4. Experiments
4.1. Dataset Construction and Preprocessing
To evaluate the performance of the proposed method in a controlled yet realistic environment, we constructed a high-fidelity vehicle vibration dataset. To ensure the representativeness of the data regarding real-world operating conditions, the road elevation profiles were synthesized strictly following the ISO 8608 international standard for road roughness classification. This standardization guarantees that the statistical properties of the generated profiles align with physical road surfaces found in engineering practice.
Data Generation. As detailed in
Section 3, the road profiles for classes A–F were generated based on their power spectral density (PSD) characteristics. These profiles were then processed through a high-fidelity vehicle dynamics model in CarSim software (version 2021.0). Unlike simplified quarter-car models, this simulation environment captures the complex non-linear dynamic interactions between the tire, suspension, and vehicle body. The simulation collected vertical acceleration signals at a sampling rate of 1000 Hz for a duration of 300 s per road class, ensuring sufficient temporal diversity to cover stochastic road features.
Data Splitting and Preprocessing. To generate input samples for the deep learning model, a non-overlapping sliding window of 1 s () was applied to the continuous time-series data. Consequently, the standard dataset comprises 1800 samples (300 samples × 6 classes). The dataset was randomly partitioned into training (70%), validation (20%), and testing (10%) sets.
Auxiliary Robustness Test Sets. In addition to the standard dataset collected under nominal conditions, we generated distinct auxiliary test sets to evaluate the model’s generalization capability against domain shifts. These sets include data under varying vehicle speeds (20 km/h), increased sprung mass (+480 kg), and environmental noise injection, which are specifically analyzed in
Section 4.7.
4.2. Baselines and Metrics
We compare DB-MLP with a comprehensive set of baselines:
RNN/CNN-based: LSTM, InceptionTime [
25].
Transformer-based: Transformer [
15], Crossformer, Informer [
16].
MLP-based: TimeMixer [
24], LightTS [
23].
Evaluation metrics include Accuracy, Macro F1-Score, Parameter Count (Params), and Inference Time (averaged per sample).
4.3. Implementation Details
All models were implemented in PyTorch 1.7.1 on an NVIDIA RTX 4090 GPU.
Training Protocol: We use the Adam optimizer () with an Early Stopping mechanism (patience = 60) to prevent overfitting.
Batch Size: A batch size of 128 is used for MLP-based models, while it is adjusted to 32 for Transformer variants to accommodate GPU memory constraints.
Unified Setting for Generalization: We adopted a unified protocol (, max epochs = 300) to ensure fair comparison without dataset-specific tuning.
4.4. Performance Evaluation
4.4.1. Comparative Analysis with Baselines
Table 5 presents the quantitative comparison results. The proposed DB-MLP outperforms all baselines, achieving the highest accuracy (0.985) and F1-score (0.984).
Accuracy vs. Efficiency: While recent MLP-based models demonstrate strong performance, DB-MLP achieves a superior balance between accuracy and efficiency.
Specifically, compared to the second-best model TimeMixer (Acc. 0.965), DB-MLP improves accuracy by 2.0%. Remarkably, this gain is achieved with significantly lower resource consumption: DB-MLP requires only 6.7% of TimeMixer’s parameters (0.58 million vs. 8.55 million) and roughly 1% of its GPU memory (7.4 MB vs. 744.1 MB).
Compared to the lightweight baseline LightTS, DB-MLP not only improves accuracy by 3.0% (0.985 vs. 0.955) but also reduces inference latency by roughly a factor of three (0.007 ms vs. 0.024 ms). Overall, DB-MLP offers the fastest inference speed among all models, being approximately 20 times faster than InceptionTime and TimeMixer, making it the ideal candidate for real-time deployment on edge devices.
Architecture Analysis: The results highlight a clear trend: MLP-based architectures (TimeMixer, LightTS, DB-MLP) generally outperform RNN, CNN, and Transformer-based models in this task. This suggests that for fixed-length vehicle vibration signals, global temporal mixing is more effective than local convolution or recurrent processing. However, standard Attention-based models (e.g., Informer at 0.835) struggle significantly, likely due to the mismatch between sparse attention mechanisms and the need for capturing high-frequency fine-grained features. DB-MLP’s dual-branch design successfully overcomes these limitations by explicitly modeling both local details and global trends with linear complexity, leading to superior performance.
4.4.2. Confusion Matrix Analysis
Figure 6 presents the confusion matrix derived from the testing set. The pronounced diagonal dominance corroborates the exceptional precision of the model. The limited number of misclassifications are strictly confined to adjacent categories, a pattern that reflects the inherent physical continuity of ISO 8608 road profiles. To effectively manage these boundary ambiguities in a real-world deployment setting, we propose a robust post-processing strategy. Given that the physical overlap between classes precludes absolute frame-level separation, the control logic necessitates the incorporation of a confidence margin defined by a minimum probability difference
between the predicted and secondary classes. Furthermore, a temporal consistency check is employed to ensure that the minor stochastic fluctuations observed at class boundaries do not propagate instability to the downstream suspension control system.
4.5. Visual Analysis
To interpret the learned representations, we visualize the feature manifold of the test set using t-SNE (
Figure 7). The resulting topological structure is highly consistent with the vehicle dynamic responses illustrated in
Figure 4. The model effectively distinguishes three physical regimes:
Tire-Filtering Stage (Classes A and B): The distinct isolation of these classes corresponds to the suspension-quiescent regime. On smooth roads, micro-texture excitations are primarily absorbed by tires, resulting in minimal sprung mass response. The model correctly clusters these as a distinct low-energy state.
Effective Stroke Regime (Classes C and D): As roughness increases, the suspension operates continuously within its effective design stroke. Although the damper exhibits non-linear force-velocity characteristics, the system dynamics remain smooth and bounded. The model maps Classes C and D as adjacent clusters with touching boundaries, accurately reflecting the continuous ordinal nature of the ISO 8608 spectrum before physical saturation occurs.
Saturation and Impact Point (Separation of D and E): A significant gap separates Class D from Class E. This structural discontinuity aligns with the transition from continuous oscillation to mechanical saturation (e.g., bump stop engagement). As observed in the severe regime responses (Classes E–F in
Figure 4), the acceleration signals exhibit spikes characteristic of hard non-linear impacts. The model successfully detects this dynamics shift, separating the high-impact group from the nominal effective stroke group.
4.6. Ablation Study
To systematically evaluate the contribution of each component in the proposed DB-MLP architecture, we conducted two groups of ablation experiments: (i) Module Importance Analysis, employing a subtractive strategy, and (ii) Multi-Branch Depth Analysis, investigating the optimal number of temporal scales. All models share the same training settings to ensure a fair comparison.
4.6.1. Ablation Settings
To quantify the contribution of specific components, we configure five variants based on the subtractive “drop-out” principle:
M0 (DB-MLP Ours): The complete architecture with dual-branch inputs, Time-Mixing, and Position-wise Feature Transformation enabled.
M1 (w/o Multi-Scale): Disables the multi-branch structure. The model processes only the single-scale original sequence () while retaining the Time-Mixing and Position-wise modules.
M2 (w/o Position-wise Trans.): Removes the Adaptive Scaling block. The model retains the dual-branch inputs and Time-Mixing MLP for feature extraction.
M3 (w/o Time-Mixing): Removes the global Time-Mixing MLP. The model retains the dual-branch inputs but relies on the Position-wise Feature Transformation for feature processing.
M4 (MLP Projection Head): Removes all feature extraction modules. This baseline consists solely of the final MLP projection head applied directly to the flattened raw input, serving as a lower-bound reference.
We investigate the impact of temporal scale depth by varying the number of down-sampling branches:
L0 (2 Branches): Combines the original sequence () and the down-sampled sequence. This is the default setting (Ours).
L1 (3 Branches): Adds a coarser down-sampled branch to L0.
L2 (4 Branches): Further adds an down-sampled branch to L1.
Consistent with the main experiments, all ablation variants are trained with a learning rate of and early stopping with a patience of 60 epochs.
4.6.2. Results and Analysis
Table 6 summarizes the results. The full DB-MLP model achieves the highest accuracy of 98.5%.
Dominance of Time-Mixing: The most significant performance drop occurs when removing the Time-Mixing module (M3), with accuracy plummeting to 81.7% (). This drastic degradation confirms that the model’s primary capability stems from modeling global temporal dependencies via the Time-Mixing MLP.
Importance of Non-Linearity: Removing the Position-wise Transformation (M2) results in a 2.3% drop in accuracy. This indicates that the non-linear feature recalibration is essential for refining the features extracted by the temporal mixer.
Benefit of Multi-Scale: Simplifying the model to a single scale (M1) leads to a 1.3% decrease in accuracy. The dual-branch structure provides consistent gains by capturing patterns at different resolutions.
Superiority over Baseline: The standalone MLP Projection Head(M4) achieves only 71.7% accuracy. The DB-MLP architecture yields a massive improvement of +26.8%, validating that the proposed dual-domain feature extraction is far more effective than a direct linear projection.
Table 7 illustrates the effect of varying the number of temporal scales.
Optimal Configuration (L0): The dual-branch structure achieves the peak accuracy of 98.5%.
Feature Degradation in Deeper Branches (L1 and L2): Notably, adding a third branch (L1, down-sampling) causes the accuracy to drop back to 97.2%, identical to the single-scale baseline (M1). This suggests that the aggressive down-sampling filters out critical high-frequency vibration components (10–50 Hz), producing over-smoothed features that cancel out the benefits of the branch. Further increasing the depth (L2) exacerbates this information loss (95.8%).
4.6.3. Discussion
The ablation study leads to three key conclusions:
Time-Mixing is the Core Driver: Global temporal modeling is the cornerstone of the proposed method. Removing the Time-Mixing module (M2) causes the most drastic performance drop (), confirming that the model relies on capturing long-range dependencies rather than simple statistical patterns.
Shallow Multi-Scale achieves the Best Balance: Our experiments identify the dual-branch structure (L0) as the optimal balance. While adding one down-sampled branch improves robustness, further increasing the depth (L1, L2) degrades performance due to spectral information loss and over-smoothing, rendering deeper branches ineffective.
Efficiency of Structured Feature Extraction: The DB-MLP achieves superior accuracy (98.5%) with 0.58 million parameters. In contrast, the standalone MLP Projection Head (M4) fails to capture complex patterns (71.7%), demonstrating that the performance gain comes from the specific dual-domain architecture rather than the classification head itself.
4.7. Robustness and Generalization Analysis
To validate the reliability of the proposed DB-MLP for real-world deployment, this section evaluates the model’s performance under three severe domain shifts: environmental sensor noise causing signal degradation, changes in vehicle mass resulting in parameter variations, and variations in driving speed causing frequency shifts.
4.7.1. Resilience to Environmental Sensor Noise
To verify the model’s immunity to real-world interference, we conducted a stress test by injecting Additive White Gaussian Noise into the standardized sensor data strictly during the inference phase. The model was evaluated at Signal-to-Noise Ratios ranging from 30 dB down to 10 dB. This broad spectrum approximates unstructured environmental disturbances ranging from inherent sensor fluctuations to high-intensity mechanical vibrations.
As illustrated in
Figure 8, the results demonstrate strong robustness. Under typical noise conditions where SNR ≥ 20 dB, the accuracy at 20 dB showed a negligible drop compared to the clean baseline, indicating that standard sensor noise has virtually no effect. Even under severe interference at 10 dB, the model maintained a high accuracy of 94.3%. Notably, the F1-Score of 0.945 remained closely aligned with the accuracy, confirming balanced classification capabilities without bias.
We attribute this robustness to the dual-branch architecture, where the Average Pooling branch effectively functions as a low-pass filter, suppressing high-frequency noise while preserving essential road features.
4.7.2. Adaptability to Vehicle Mass Variations
To further evaluate the model’s adaptability to vehicle parameter changes, a sensitivity test on the sprung mass was conducted. In real-world driving, passenger occupancy significantly alters the vehicle’s natural frequency according to , which can potentially affect signal patterns.
Based on the simulation configuration, the baseline vehicle comprising only the driver has a total sprung mass of 2337 kg. A fully-loaded test set was generated by increasing the occupancy to 7 passengers, raising the total sprung mass to 2817 kg. This corresponds to a net mass increase of 480 kg, representing approximately a 20.5% variation. The model was trained solely on the baseline dataset and directly tested on this heavy-load dataset without fine-tuning.
Despite the significant shift in vehicle dynamics caused by the mass increase, the model maintained a high accuracy of 96.7%. This represents only a marginal decline of 1.8% compared to the baseline of 98.5%. This result confirms that the DB-MLP captures intrinsic road roughness features rather than overfitting to specific vehicle dynamic parameters.
4.7.3. Generalization Across Driving Speeds via Physics-Guided Augmentation
Finally, this subsection validates the capability of the model to handle severe frequency shifts by testing on unseen 20 km/h data dominated by resonance using a model trained only on 36 km/h data dominated by isolation.
Vehicle speed v linearly scales the signal frequency f according to . Consequently, lower speeds result in temporally stretched waveforms. Specifically, 20 km/h represents a worst-case scenario because it shifts road excitation frequencies into the vehicle’s resonance band (1–2 Hz). This creates high-amplitude body motion noise that makes classification significantly harder than at the nominal training speed of 36 km/h. To address this, we adopted a Single-Speed Training, Cross-Speed Testing protocol. A physics-guided resampling augmentation was applied to the 36 km/h training data to mathematically simulate the temporal patterns covering the range of 20–50 km/h, forcing the model to learn invariant spatial features rather than speed-dependent patterns.
As presented in
Table 8, the frequency mismatch caused the Baseline (without augmentation) to drop to 84.4%. Our approach recovered performance to 90.0%, achieving an improvement of 5.6%. The remaining gap to the Oracle upper bound (95.3%) is expected and physically explainable: linear resampling cannot fully replicate the non-linear amplitude amplification caused by suspension resonance at specific frequencies.
Physical Consistency Analysis. The comparison of confusion matrices in
Figure 9 reveals that the proposed method captures the true physical distribution of the target domain, despite the domain gap. Two key observations confirm this:
Resonance Amplitude Bias: We observed a slight tendency to underestimate roughness (e.g., Class D misclassified as C). This occurs because the mathematically resampled signals exhibit lower dynamic intensity than the actual resonating vehicle body dynamics at 20 km/h, leading to conservative predictions.
Identical Limit Behavior (Class F ≈ E): Remarkably, our accuracy on the most severe Class F roads is 0.89, which is identical to the Oracle baseline. Both models exhibit an 11% misclassification into Class E. This reflects a physical reality: at low speeds, the impact energy is insufficient to fully engage suspension stops, making Class F and E physically indistinguishable. This shared error pattern confirms the model has learned the genuine road dynamics rather than statistical artifacts.
4.7.4. Summary and Discussion
In summary, the systematic evaluations in
Section 4.7 demonstrate that DB-MLP maintains high precision across diverse domain shifts. Specifically, the model achieved 94.3% accuracy under environmental noise at 10 dB SNR, maintained 96.7% accuracy despite vehicle mass variations, and reached 90.0% accuracy under speed-induced frequency mismatches when utilizing PGA. These results confirm that the dual-branch architecture effectively captures intrinsic vibration features rather than over-fitting to specific vehicle parameters.
The 5.3% performance gap observed in the speed generalization test described in
Section 4.7.3 warrants a deeper physical analysis. This discrepancy is primarily attributed to the frequency-selective gain of the suspension system. While linear resampling aligns the spectral components in the frequency domain, it does not account for the nonlinear magnitude amplification caused by resonance. Specifically, at 20 km/h, the excitation shifts into the primary resonance band of 1 to 2 Hz, where the system’s transmissibility
is significantly higher than in the isolation zone at 36 km/h. To theoretically bridge this gap, a Magnitude Correction Factor
can be defined to adjust the amplitude of resampled signals:
where the parameters are defined as follows:
f: The vibration frequency component in Hz.
: The magnitude of the suspension’s Frequency Response Function at frequency f.
and : The target driving speed of 20 km/h and the source training speed of 36 km/h, respectively.
: The frequency-dependent correction factor used to compensate for the gain difference between resonance and isolation states.
Regarding other engineering constraints, the model exhibits inherent structural immunity. Sensor bias, which manifests as a direct current offset, does not alter the alternating current patterns representing road roughness and is mitigated by our Global Normalization and Time-Mixing layers. Furthermore, minor sampling rate fluctuations, typically characterized as high-frequency jitter, are effectively smoothed by the Average Pooling branch which functions as a structural low-pass filter. Collectively, these design choices ensure that DB-MLP is not only computationally efficient but also robust against the non-ideal signal conditions typical of real-world vehicular environments. It is, however, essential to distinguish the physical operating domain of this data-driven approach from traditional geometric profiling. Since the proposed method fundamentally relies on dynamic vehicle-road interaction, intrinsic limitations exist under static conditions or negligible velocities where excitation is insufficient. Yet, this constraint is operationally acceptable for suspension control, as such systems predominately function during normal driving speeds where vehicle dynamics are significant.
5. Conclusions
This paper presents an effective and lightweight framework for road roughness classification using sprung mass vertical acceleration signals. By addressing the limitations of traditional linear simulation models and computationally intensive deep learning architectures, we achieved a favorable balance between classification accuracy and real-time inference efficiency.
First, we established a high-fidelity co-simulation platform combining CarSim and Simulink. Unlike simplified quarter-car models, this platform captures non-linear suspension dynamics and vehicle body coupling effects, generating a physically consistent dataset covering ISO A–F roughness levels. Second, we proposed DB-MLP, a novel dual-branch architecture tailored for vibration signal analysis. By employing multi-resolution temporal projection and dual-domain (time and position-wise) mixing, the model effectively extracts both fine-grained vibration textures and global motion trends without relying on heavy self-attention mechanisms. Crucially, this design explicitly decouples high-frequency road textures from low-frequency body motions, aligning feature extraction with physical vibration characteristics.
Experimental results demonstrate that DB-MLP achieves superior accuracy (98.5%) compared to established baselines on the specific road classification task. In terms of efficiency, it requires only 0.58 million parameters and achieves an ultra-low inference latency of 0.007 ms per sample, which is approximately 20 times faster than InceptionTime and TimeMixer. Such computational efficiency suggests that the model is highly suitable for deployment on resource-constrained embedded ECUs. Moreover, extensive robustness analyses validated the model’s stability under challenging conditions, including sensor noise, vehicle mass variations, and speed changes.
Future work will focus on two main directions: (1) Bridging the Sim-to-Real gap by validating the model on real-world vehicle test data and investigating transfer learning techniques to mitigate domain shifts caused by environmental noise and sensor heterogeneity; (2) Integrating the proposed classifier into closed-loop semi-active suspension control systems to evaluate its impact on improving not only ride comfort but also driving safety in extreme road conditions.