4.1. Brake Force Identification Based on BP Neural Networks
Compared with traditional hydraulic braking systems, in which the brake pedal is mechanically coupled to the brake actuators, the EMB system achieves physical decoupling between the brake pedal and the wheel actuators. This system requires real-time recognition of the driver’s braking intent to precisely regulate clamping force at each wheel through electromechanical actuators. Within the vehicle stability control strategy described herein, the braking intensity Z plays a critical role in clamping force distribution and is therefore of particular importance.
There exists a strong nonlinear mapping relationship between braking intent and driver input, making it difficult to establish an accurate mathematical model for direct representation. As a typical type of multilayer feedforward network, BP neural networks have become the most widely applied neural network architecture in this field due to their global approximation capability for arbitrary nonlinear functions and excellent generalization performance [
20,
21]. They have been successfully applied for recognizing vehicle driving intentions (such as lane changes and braking scenarios). This paper employs a three-layer BP network architecture (consisting of an input layer, a hidden layer, and an output layer), as shown in
Figure 4. Vehicle speed, brake pedal displacement, and the pedal’s rate of change are used as input features, with the output layer mapped to the braking intensity Z [
22].
Definition: Input layer:
; the number of neurons in the hidden layer: M = 5; The number of output layer neurons:
. Weights and biases:
denotes the connection from the kth neuron in layer
to the jth neuron. It is also the element in the kth column of the
jth row of the
th-layer weight matrix. Similarly,
denotes the bias of the jth neuron in layer
, which is also the
jth element of the bias vector for layer
. Let
denote the linear output of the jth neuron in the
lth layer, and let
denote the output of the activation function of the
jth neuron in the lth layer. In this context, the activation function is denoted by the symbol
, and the activation of the
jth neuron in the
lth layer is as follows:
The forward propagation process can be expressed as follows:
Loss function: To measure the difference between the predicted value
and the actual value
, the commonly used mean squared error is employed.
Backpropagation:
Rate of change of parameters:
During the braking intensity recognition process, braking intensity simulation parameters exported from CarSim were used to construct the BP neural network model. Specifically, 70% of the data were used to train the neural network, 15% were used as the validation set, and the remaining 15% for testing. When determining the number of hidden layer neurons, both model recognition accuracy and computational speed were taken into account. Based on the evaluation of accuracy and training time across different neuron configurations (as shown in
Figure 5), the number of hidden layer neurons was set to 5, with the maximum number of iterations capped at 1000. To prevent overfitting, an early stopping mechanism was employed, which halted the training process when the validation loss ceased to decrease.
To ensure the generalization, approximation capability, and convergence speed of the neural network, the tanh(x) activation function was selected for the hidden layer, and weights were initialized using the Xavier initialization method. Similarly, the Leaky-ReLU activation function was selected for the output layer, and the weights were initialized using the He initialization method.
Unseen samples were selected to test the predictive accuracy of the BP neural network. During the braking process, the vehicle gradually decelerated from a maximum speed of 100 km/h until coming to a complete stop. To simulate driver braking behavior, a step-input braking maneuver was employed, in which the brake pedal force rapidly increased to 100 N within 1 s and then remained constant. The model recognition accuracy is shown in
Figure 6. The overall prediction error is less than 5%, meeting the accuracy requirements.
4.2. Upper-Level Additional Yaw Moment Control
When implementing vehicle stability control, the vehicle’s sideslip angle and yaw rate are used as reference targets. Therefore, a two-degree-of-freedom dynamic model involving lateral and yaw motions is established, with the yaw moment
included, as shown in the following equation:
When the vehicle is traveling steadily, i.e.,
and
, the expected sideslip angle
and expected yaw rate
can be obtained from the above equation:
where
is the stability coefficient.
Taking into account the constraints imposed by the road surface coefficient of friction
, the modified expected sideslip angle and yaw rate are obtained as follows:
In the calculation of yaw moments in the upper-layer controller, PID control, MPC and SMC are commonly used. PID control relies heavily on empirical tuning, while MPC, although more accurate, involves a significant computational burden, which can affect the system’s real-time performance [
23,
24,
25]. Therefore, SMC was selected as the method for calculating the vehicle’s additional yaw moment.
Based on the nonlinear characteristics of the vehicle dynamics model, SMC is employed to calculate the yaw moment to be applied to the vehicle [
26,
27,
28]. To eliminate the chattering phenomenon during the switching process of SMC, an adaptive gain dynamic weighting reaching law is introduced in this paper [
29].
Define the sliding surface as follows:
In Formula (28), , , , , , and .
The law of convergence is as follows:
In Equation (29), the adaptive gain term is as follows:
represent positive constants, where is the base gain used to overcome inherent system uncertainties and disturbances, and is the exponential adaptive term that provides nonlinear gain adjustment. When the system’s state deviates significantly from the sliding surface, the gain increases rapidly to accelerate convergence. As the system approaches the sliding surface, the gain decreases to suppress chattering. is the saturation adaptive term, which accelerates convergence when the system is far from the equilibrium point and decelerates convergence near the equilibrium point to suppress chattering.
In Equation (30),
represents the dynamic weighting factor:
where
,
and
are positive constants, and
represents the error-driving term. When
, this term approaches 1, enabling rapid system convergence. When
, this term approaches 0, suppressing system chattering. The term
denotes the time decay term, which gradually attenuates over time to enable a smoothing mode.
As shown in
Figure 7, compared with traditional exponential convergence algorithms, the adaptive reaching law proposed in this paper exhibits a faster convergence rate and shorter dynamic response time. Furthermore, the chattering phenomenon is significantly suppressed, indicating that this method can effectively improve transient performance and enhance control smoothness.
- (1)
Proof of stability:
Select the Lyapunov function:
According to the Lyapunov stability criterion, if
, the system is stable. From Equation (32), we know that
. When
,
, and when
,
; therefore,
. Furthermore, from Equation (31), we know that
k(s) > 0; thus, the following equation holds true:
Therefore, based on the Lyapunov stability criterion, the system is stable.
The desired yaw moment
is obtained by combining Equations (22), (28) and (29).
Define the system tracking error state vector as
, where
and
. Combining this with Equation (22), the open-loop error dynamics equation including disturbances can be expressed as follows:
Taking the derivative of Equation (28) yields the following:
Substitute Equation (36) into Equation (37):
This can be simplified to the following:
This yields the following complete closed-loop error dynamics matrix equation:
where
=
,
,
,
, and
- (2)
Proof of Accessibility:
In real-world physical systems, two-degree-of-freedom vehicle models must account for external disturbances and parameter uncertainties. Let the derivative of the actual sliding surface be
where d(t) is the total disturbance term, and
.
Define a minimal sliding mode boundary layer , where .
When the system’s state lies outside the boundary, i.e.,
, due to the properties of the
function, there must exist a constant
strictly greater than zero such that the following equation holds true:
Furthermore, as shown above, s and ϕ(s) always have the same sign; therefore, when |s| > ∆, the following holds:
Substitute Equation (42) into Equation (34):
Substitute the perturbation upper bound
:
From Equation (31), it follows that
; therefore, the basic gain k
0 must satisfy the following condition:
Let
be a positive constant; substituting this into Equation (42) gives the following:
Since
, the above equation can be rewritten as follows:
- (3)
Proof of System State Boundedness:
where
,
, and
.
Substituting the tracking errors
and
yields the following equation:
where
From Equations (26) and (27), it follows that and are strictly constrained by the road surface coefficient of friction , and the front-wheel steering angle is determined by the driver’s physical input; therefore, must be bounded (i.e., there exists a constant such that ).
Compute the time derivative of the designed integral sliding surface (Equation (28)):
Since
appears only in
as a control input, we rearrange the terms to express the controlled term
in terms of the uncontrolled term and other bounded variables:
Substituting Equation (53) into the above equation yields the following:
Combine Equations (48) and (51) and write them in standard state-space matrix form:
The above equations constitute a zero-error dynamic model of the system constrained near the sliding surface. During controller parameter design, by appropriately setting the positive constants , , and , the error system matrix is ensured to satisfy the Hurwitz stability criterion.
As shown in Equations (33)–(35), the sliding mode variables remain within a finite boundary layer; therefore, the external excitation vector is strictly bounded. According to the Input-to-State Stability (ISS) theorem in nonlinear control theory, the states of a strictly Hurwitz-stable linear system are necessarily bounded when driven by a bounded external input.
Therefore, the tracking error vector is bounded. Meanwhile, Equations (26) and (27) show that the reference trajectory is bounded by physical constraints. According to the principle of linear superposition for bounded signals, the actual physical state variables— and —in the closed-loop system are both bounded. Thus, while ensuring error convergence, this controller fundamentally guarantees the safety and stability of the vehicle’s chassis dynamics.
4.3. Parameter Optimization Based on the TD3 Reinforcement Learning Algorithm
TD3 is a model-free reinforcement learning (RL) method. It learns optimal policies through direct interaction with the environment (input states, output actions/parameters). Consequently, it can learn directly from data how to tune controller parameters for an uncertain or complex system without requiring knowledge of the system’s exact differential equations. This is of significant value for real-world systems where accurate modeling is notoriously challenging.
Addressing the challenges posed by strong nonlinearity, parameter perturbations, and position disturbances in electro-hydraulic servo systems—challenges that traditional control algorithms struggle to resolve—Reference [
30] proposed an intelligent composite control strategy that integrates the TD3 deep reinforcement learning algorithm with adaptive fractional-order sliding mode control. A fractional-order sliding surface was designed, along with a reaching law function based on the sliding surface. Adaptive control laws were designed based on Lyapunov stability theory, and the gain parameters for sliding mode switching were optimized online using the TD3 deep reinforcement learning algorithm. Furthermore, in response to the issues of excessive reliance on manual experience, low efficiency, and cumbersome procedures in the selection of PID controller parameters, Reference [
31] proposed a PID parameter optimization method based on the TD3 algorithm.
Given that the adaptive variable-gain sliding mode controller design involves numerous parameters, the sliding surface coefficients and controller gains typically rely on prior knowledge of the system’s dynamic model. For complex nonlinear, time-varying, or partially modeled systems, manually or model-based tuning of these parameters is extremely challenging, often requiring a difficult trade-off between robustness and chattering. Relying solely on manual adjustment results in a heavy workload and often leads to suboptimal performance. To address these limitations, this paper incorporates the TD3 algorithm from reinforcement learning to achieve autonomous parameter optimization [
32,
33].
TD3 (Twin Delayed Deep Deterministic Policy Gradient) is an enhanced actor–critic algorithm developed to address the limitations of DDPG. Its three key core mechanisms—twin critics, delayed policy updates, and target policy smoothing—are specifically designed to suppress Q-value overestimation, stabilize the training process, and enhance policy generalization. These mechanisms effectively mitigate common issues encountered in DDPG during continuous control tasks, such as training oscillations, premature convergence to local optima, and potential divergence. The specific implementation details are as follows:
- (1)
Twin Critics: Mitigating Overestimation of Q-values at the Source
When DDPG uses a single Q-network, environmental noise, sample bias, and function approximation errors cause the Q-values to systematically overestimate the true action values (i.e., they treat certain actions as better than they actually are). This overestimation is propagated through gradients to the policy network, causing the policy to continuously optimize toward “overestimated bad actions,” ultimately converging to a local optimum or even divergence.
The core principle of TD3 is to simultaneously train two independent value networks (Q
1, Q
2), which share the same policy network but do not interfere with one another; Meanwhile, two target value networks (Q
1′, Q
2′) are also maintained. When calculating the target Q-value, the minimum of the outputs from the two target Q-networks is taken:
Substitute into the Bellman equation to find the solution function:
Here, M represents the number of action samples in the experience replay buffer, and represents the parameters of the target critic network.
- (2)
Delayed update policy:
Unlike the synchronous update mechanism for the policy network and Q-network in DDPG, where even minor fluctuations in Q-values are directly passed to the policy network, causing policy oscillations, the TD3 algorithm innovatively introduces a delayed update strategy. Specifically, the Q-network is updated d times first until it converges sufficiently to a relatively stable state, and only then is the policy network is updated. This approach establishes a more stable mapping from the continuous state space to action values. During the parameter iteration process, the algorithm uses backpropagation to optimize the parameters of the actor network, as mathematically expressed below:
In the equation, represents the loss gradient; denotes the gradient; denotes the parameters of the actor network; and denotes the actor network.
At the same time, the target network has been updated as follows:
Here, is the soft update parameter, with a value between 0 and 1.
- (3)
Optimization of the smoothness of the objective function:
In DDPG, the target policy directly outputs deterministic actions, and the Q-values are highly sensitive to even minor changes in these actions. TD3 smooths the target policy by adding normally distributed noise to the target actions, thereby smoothing the value estimates across the action space and improving the policy’s generalization to environmental disturbances and parameter perturbations, as shown below:
Here, represents random noise following a normal distribution; represents the variance; and c represents the clipping factor.
For detailed parameter settings, see
Appendix A,
Table A2; for the pseudocode of the training process, see
Appendix B, Algorithm A1.
Figure 8 illustrates the structure of the TD3 algorithm.
Figure 9 shows the training curve of the TD3 algorithm.
The intelligent agent module encompasses the primary execution process of the decision-making algorithm, responsible for receiving inputs and making decisions. Input state information includes the sideslip angle
, yaw rate
, and lateral offset y, as follows:
The action space for the intelligent agent is selected as the sliding surface parameters
,
,
,
and the tracking law parameters
,
,
for the sliding mode controller, expressed as follows:
The reward function is crucial for optimizing sliding mode control. To enable the agent to achieve multi-objective cooperative optimization, the following objective rewards are designed:
In the equation, represents the error between the ideal sideslip angle and the actual sideslip angle; represents the error between the ideal yaw rate and the actual yaw rate; represents the error between the ideal lateral offset and the actual lateral offset. , , and are the sideslip angle, yaw rate, and lateral offset, respectively.