1. Introduction
The optimal design of complex mechanical systems typically involves a large number of design variables [
1]. The advent of computer experiments [
2], from the second half of the 20th century onwards, has greatly expanded the possibility for designers to explore a wide range of system configurations. However, the number of design variables often becomes so large that an exhaustive exploration of all the possible design solutions is not feasible, even with the most powerful computer available to designers. To overcome this limitation, two complementary strategies can be pursued: reducing the computational time needed to evaluate the system behaviour or reducing the number of evaluations needed to identify an optimal solution.
The first strategy can be addressed by adopting a surrogate model of the system [
3]. In this approach, the original validated model based on physical laws is replaced with a black box that approximates the relationship between the design variables and the objective functions of the physical system. The black box, referred to in this paper as the surrogate model or approximator, is calibrated on a limited number of simulations of the original physical model, which are employed to tune the model’s the parameters. Several types of approximators exist, such as polynomial models [
4], Kriging techniques [
2,
5], and artificial neural networks. Thanks to their ability in capturing the performances of the system even in the presence of highly non-linear or multi-modal responses [
1], artificial neural networks have been widely employed in the context of optimisation problems, particularly in fields such as structural optimisation [
6], vehicle dynamics [
7], and vehicle suspension optimisation [
8].
Regarding the reduction in the number of objective function evaluations required to solve an optimisation problem, a key idea is to sample the design domain in regions where a high density of optimal solutions is expected. This can be achieved by focusing on a subregion of the whole design domain and progressively moving to other subregions on the basis of the optimal solutions identified in previous iterations [
9,
10]. Another option is to prioritise the solutions with higher fitness levels according to a prescribed optimality metric, as is commonly realised with evolutionary algorithms [
11,
12] and other meta-heuristic algorithms [
13], such as Tabu Search [
14], Particle Swarm Optimisation [
15], and Ant Colony Optimisation [
16]. However, all these multi-objective optimisation techniques lack a specific learning process based on the design solutions found [
17], relying instead on stochastic rules to generate new design solutions [
18]. Indeed, no inferential process is employed to estimate the optimality of new solutions [
19]; rather, they are simply generated by recombining previously analysed solutions with the addition of a random component.
Reinforcement Learning (RL) algorithms, on the other hand, are specifically designed to learn a policy that maximises a reward on the basis of a set of prior observations. Despite the potential demonstrated by RL techniques in the context of Multi-objective Optimisation (MO), relatively few studies have been published in this area [
20]. Zeng et al. [
21] proposed two Multi-Objective Deep Reinforcement Learning models, the Multi-Objective Deep Deterministic Policy Gradient (MO-DDPG) and the Multi-Objective Deep Stochastic Policy Gradient (MO-DSPG). They applied these to the optimisation of a wind turbine blade profile and to several ZDT test problems [
22], showing superior performance compared to well-established Non-dominated Sorting Genetic Algorithm (NSGA) methods [
23]. Similarly, Yonekura et al. [
24] addressed the optimal design of turbine blades adopting a Proximal Policy Optimization (PPO) algorithm based on the hyper-volume associated with the optimal design solutions [
25].
Yang et al. [
26] developed a Deep Q-Network (DQN) algorithm [
27] grounded in Pareto optimality to address the cart pole control problem [
28]. Seurin and Shirvan [
29] applied several variants of their Pareto Envelope Augmented with Reinforcement Learning (PEARL) algorithm to DTLZ test problems [
30], as well as to constrained test problems such as CDTLZ [
31] and CPT [
32]. They compared their performance with NSGA-II [
33], NSGA-III [
34], Tabu Search [
14], and Simulated Annealing [
35]. After demonstrating the advantages of their method over these state-of-the-art algorithms, PEARL was applied to the constrained multi-objective optimisation of a pressurised water reactor.
Regarding different techniques, Ebrie and Kim [
36] adopted a multi-agent DQN algorithm for the optimal scheduling of a power plant. Similarly, Kupwiwat et al. [
37] employed a multi-agent DDPG algorithm to address the optimal design of a truss structure. The work of Somayaji NS and Li [
38] introduced a DQN algorithm with a modified version of the Bellman equation [
39] to handle multi-objective optimisation problems, applying it to the optimal design of electronic circuits. Importantly, this study also highlights the use of the trained agent in an inference phase, where it can be queried after training to identify new optimal solutions. Finally, Sato and Watanabe [
40] faced a mixed-integer problem concerning the optimal multi-objective design of induction motors, proposing a method called the Design Search Tree-based Automatic design Method (DeSearTAM).
Besides the dimensionality of the design space, challenges in the optimisation of complex mechanical systems also arise from the need to consider a large number of objective functions. As this number increases, the fraction of non-dominated entries within the solutions set increases rapidly [
41,
42], making the Pareto optimality criterion alone insufficient to select a preferred design. To address this issue, Levi et al. proposed the concepts of
k-optimality [
43] and
-optimality [
44], which enable a further ranking among Pareto-optimal solutions.
Building on these challenges, this work explores an RL-based method for the multi-objective optimisation of complex mechanical systems with a high number of objective functions and compares its performance against state-of-the-art approaches. The specific method considered, called Multi-Objective Reinforcement Learning–Dominance-Based (MORL–DB) [
20] employs a DDPG algorithm to explore the design space with the aim of increasing the
k-optimality level of the solutions. Unlike DQN-based algorithms, designed for discrete action spaces [
26,
36,
38], the adoption of a DDPG agent allows the MORL–DB method to effectively explore high-dimensional continuous design spaces [
21]. In addition, differently from other multi-objective RL approaches in the literature [
21,
24,
26,
29,
36,
37,
38,
40], the MORL–DB method defines the reward function based on the
k-optimality metric. This feature enables the method to successfully address many-objective optimisation problems, where Pareto optimality alone is not an effective selection criterion.
A proof of the effectiveness of the MORL–DB method has already been discussed by De Santanna et al. [
45], who applied the algorithm to the optimisation of the elasto-kinematic behaviour of a MacPherson suspension system. In the present work, the MORL–DB method was applied to the optimal elasto-kinematic design of a double wishbone suspension system, modelled as a multi-body system in the ADAMS Car suite. The optimisation problem involves 30 design variables, including the coordinates of 8 suspension hard points and 6 scaling factors acting on the elastic characteristics of 2 of the suspension’s rubber bushings. These variables are constrained by the available three-dimensional design space for the hard point locations and by upper and lower bounds on the scaling factors. A total of 14 objective functions are considered.
This paper is structured as follows. The formulation of the multi-objective optimisation problem is presented in
Section 1.1, together with the optimality metric adopted. In particular, the limitations of Pareto optimality in many-objective scenarios are discussed; then, the
k-optimality metric is introduced as a more effective selection criterion.
Section 1.2 provides a brief introduction to Reinforcement Learning, with a specific focus on the algorithm adopted in this study.
Section 2 describes the MORL–DB method, while
Section 3 illustrates the case study to which it is applied.
Section 4 presents the results of the optimisation procedure, alongside those obtained through the MS method [
9] and KEMOGA [
44], which are compared in
Section 5. Finally,
Section 6 highlights the main findings of the present work and outlines possible directions for further research.
1.1. Multi-Objective Optimisation
Most of the engineering design problems typically require the minimisation of a set of objective functions. As a result, an optimal design problem is usually a multi-objective optimisation problem (MOP). A multi-objective optimisation problem with
m objective functions and
n design variables is defined as [
1]
where
is the vector of the design variables (bounded between
and
),
is the vector of the objective functions, and
is one of the
q design constraints. A feasible solution of the MOP is defined as a set of objective function values
, obtained by evaluating the function
at a vector of design variables
that satisfies the given constraints. When compared with other feasible solutions, it may either dominate or be dominated [
1]. Referring to a minimisation problem, a solution
of the MOP is Pareto-optimal if
such that [
1]
In other words, given a Pareto-optimal solution, it is not possible to find another solution improving at least one objective function without worsening at least another one. The Pareto-optimal solution set is called the Pareto-optimal set or Pareto front.
k-Optimality
Pareto optimality is easily achieved using design solutions that perform well in at least one objective function. As the number of objective functions increases, a growing fraction of solutions tends to be non-dominated in at least one objective function, leading to an inflation of the Pareto front. In practice, this means that in high-dimensional objective spaces, most solutions become Pareto-optimal, thereby reducing the effectiveness of Pareto dominance as a selection criterion.
To illustrate this phenomenon, a randomised experimental setup was implemented to evaluate how the fraction of Pareto-optimal solutions evolves with the number of objectives. Each experiment is represented by a matrix in , where each of the 10,000 rows corresponds to a randomly generated design solution and each of the m columns corresponds to an objective function. The entries of each matrix were independently sampled from a uniform distribution in the range . For each value of m, ranging from 2 to 30, 1000 independent matrices were generated. For each matrix, the set of Pareto-optimal solutions was computed, and the fraction of these solutions relative to the total number of solutions in the matrix was calculated.
Figure 1 presents the results of this analysis. In
Figure 1a, the average fraction of Pareto-optimal solutions is plotted against the number of objective functions
m, with vertical segments indicating the corresponding standard deviation. The results clearly show that as the number of objective functions increases, the relative size of the Pareto-optimal set approaches the unity. This confirms that Pareto optimality alone, although useful to characterise non-dominated solutions, is insufficient to support the designer in selecting a final solution when dealing with many-objective optimisation problems [
41].
To overcome this limitation, Levi et al. [
43] propose the concept of
k-optimality to rank Pareto-optimal solutions. Given the MOP of Equation (
1), a solution is
k-optimal if it is Pareto-optimal for any subset of
objective functions. It follows that the optimality level
k of a solution can vary from 0, for a simple Pareto-optimal solution, to
for a solution that optimises all the objective functions of the problem. This latter case is the Utopia solution, which cannot be achieved in any well-posed MOP.
By definition, the optimality level
k of a solution represents the maximum number of objective functions that it is possible to remove from the MOP without violating the its Pareto optimality condition. In contrast, a simple Pareto-optimal solution (i.e.,
) may lose its dominance even if a single objective function is neglected [
9]. Top-ranked
k-optimal solutions are therefore characterised by the highest robustness in their Pareto optimality, as they are less sensitive to changes in the set of objective functions of the MOP. Moreover, as shown in
Figure 1b, considering the fraction of top
k-optimal solutions for the same experimental setup as in
Figure 1a enables a much stricter selection of the final design solution. Unlike Pareto optimality, the proportion of top-ranked
k-optimal solutions remains consistently low across all problem dimensions, which significantly reduces the set of alternatives and thus assists the designer, even in many-objective scenarios.
For an operative definition of the optimality level
k of a solution [
9], it is possible to consider a corollary [
43] of the definition, which states that a
k-optimal solution improves at least
objectives over any other solution in the MOP. This allows for an efficient implementation in computer programming.
1.2. Reinforcement Learning
This section outlines the general framework of Reinforcement Learning (RL) and its adaptation to the context of multi-objective optimisation. Reinforcement Learning is an approach used to improve the performance of an artificial machine that interacts, performing some actions, with an environment. The core of RL lies in the feedback information passed through the reward function, a scalar function that measures the goodness of the new state of the environment after the machine perturbed it with its action. The artificial machine establishes the action to perform according to an inner policy that uses the state of the environment as information. The reward feedback mechanism constantly informs the machine on how good its past actions were. A training algorithm updates the policy of the machine to maximise the expected future rewards, that is, the rewards that the machine expects to receive for future actions.
Figure 2 shows the general framework for RL, in which
n is a generic step within the evolution of the environment E. The environment, described by its state
, is perturbed by the action
and reaches the state
because of it. Agent A is rewarded with the scalar function
, which is a judgment, in numerical terms, of the performed action. Then, the policy of the agent is updated using an internal training algorithm that aims to maximise the expected reward of future actions.
Deep Deterministic Policy Gradient
The Deep Deterministic Policy Gradient (DDPG) is a model-free, off-policy, actor–critic algorithm that exploits deep neural networks to learn control strategies in high-dimensional, continuous action domains. This means that the DDPG operates without a predefined model of the environment (model-free), learns a target policy different from the one used to generate behaviour of the actor (off-policy), and relies on the interplay between two neural networks typically referred to as the actor and the critic. The DDPG combines ideas the from Deep Q-learning method [
27], which is restricted to discrete action spaces, and the Deterministic Policy Gradient (DPG) method [
20], which performs well generally but struggles with more complex control challenges. A summary of the algorithmic flow of the DDPG is presented in
Figure 3.
The agent follows the results of the cooperation between two artificial neural networks, the actor and the critic. The actor is a neural network tasked with choosing an action based on the current environment state
. The function governing this decision is the policy
, which maps inputs from the state space to outputs in the action space, so that
. Given a state
, an action
, and a reward
, the cumulative expected reward, discounted over future steps and obtained by following the policy
, is referred to as the expected return:
Here, N denotes the number of future steps considered, and is a discount factor that adjusts the weight of future rewards. The objective in the DDPG is to derive an optimal deterministic policy that maximises the expected return from each state .
The critic network estimates the action-value function
, commonly referred to as the Q-function. This function evaluates the quality of an action taken in a given state by calculating the expected cumulative reward starting from
and continuing under policy
. For deterministic policies
used in the DDPG, the Q-function satisfies the following Bellman equation:
The critic, described by
, is trained to estimate the optimal Q-function corresponding to the optimal policy
. Its training objective is to minimize the Mean Squared Bellman Error (MSBE), given by
The target values
are obtained from the right-hand side of Equation (
4) and are computed using a target actor
and a target critic
, which enhance training stability:
The actor is optimised to learn the policy
. Its parameters
are updated through the deterministic policy gradient, using mini-batches of randomly selected transitions:
Rather than fully transferring the main networks’ weights to the target networks, DDPG performs soft updates to gradually adjust target parameters
This mechanism ensures more stable learning by preventing abrupt changes in target values, especially when the coefficient is small.
To promote exploration and increase the probability of discovering high-performing policies, random noise sampled from a stochastic process
is added to the deterministic actor output. The resulting exploration policy
is
Over the training episodes, the added noise is gradually reduced, making the policy more deterministic. In this study, uncorrelated, mean-zero Gaussian noise is used [
46], with its standard deviation updated after each training episode as
where
is the standard deviation for the upcoming episode,
is the value for the current episode, and
D is a predefined decay factor. Training does not begin immediately: a warm-up phase is included, during which the agent selects random actions to populate the transitions buffer and explore the design space.
3. Case Study: Multi-Objective Optimisation of a Double Wishbone Suspension
Multi-objective optimisation of suspension systems presents significant challenges due to the combination of high-dimensional design spaces and multiple objective functions [
9]. In the context of elasto-kinematic optimisation using multi-body models, several contributions have been published. For instance, Țoțu and Cătălin [
50] and Avi et al. [
51] applied genetic algorithms to suspension optimisation. The former investigated a design space with 12 variables, corresponding to the coordinates of four suspension joints, and addressed three objectives: wheelbase variation, wheel track variation, and bump steer during vertical wheel travel. The latter extends the dimensionality to 24 design variables by considering eight hard points, and optimises nine performance indices, combined into a single scalar objective function through a weighted sum approach. Bagheri et al. [
52] also employed a genetic algorithm (NSGA-II [
33]), acting on the lateral and vertical positions of three hard points, to optimise three kinematic objectives under static conditions.
Among more recent contributions, Magri et al. [
53] proposed a novel modelling strategy and compared its performance against Simscape multi-body models. Olschewski and Fang [
54] instead introduced a kinematic equivalent model to investigate the influence of bushing stiffness on suspension behaviour. Finally, De Santanna et al. addressed the optimisation of a MacPherson suspension system with 9 objective functions and 21 design variables, adopting a structured strategy for exploring the design space [
9] and considering an application of the MORL–DB method [
45]. To the best of the authors’ knowledge, this is the first application of Reinforcement Learning techniques to the optimal design of suspension systems [
55].
In this work, the Multi-Objective Reinforcement Learning–Dominance-Based (MORL–DB) method is applied to a case study related to the optimal elasto-kinematic design of a double wishbone suspension system, using a multi-body model available in the ADAMS Car suite (part of the
acar_shared database, see
Figure 6). The model is symmetric with respect to the vehicle’s vertical longitudinal plane, i.e., the plane normal to the
y-axis of the reference system shown. All data presented in this work refer to the right-hand suspension; the left-hand suspension points share the same coordinates, except for a sign reversal in the
y-coordinate. The system is optimised acting on 30 design variables: the three-dimensional coordinates of eight suspension hard points (HPs) and the values of six scaling factors (SFs), as summarised in
Table 1. The SFs modify the elastic behaviour of rubber bushings mounted at HP 1 and 3 by scaling their force–displacement curves along the force axis. Each bushing is associated with three independent SFs, scaling the elastic characteristics along three orthonormal directions, for a total of six design variables ascribed to bushings. A scaling factor greater than one increases stiffness, while a value below one reduces it.
All SFs have a nominal value of 1. Each hard point is allowed to move within a cubic design space with a side of 80 mm, centred on its nominal position (see
Table 2). The scaling factors vary within the interval
, corresponding to a maximum deviation of 20% from the nominal force–displacement curve.
The behaviour of the suspension system is evaluated through the ADAMS Car suite. The analysis consists of a sequence of quasi-static load cases (LCs) in which a force or displacement input is applied at the wheel. The resulting output curves describe key suspension characteristics. Four LCs are considered: parallel wheel travel (PWT), opposite wheel travel (OWT), traction and braking (TB), and lateral force (FY) (see
Table 3). The first two are mainly kinematic, in which a vertical displacement is applied at the wheel contact patch: symmetrically in PWT, to reproduce chassis heave motion, and anti-symmetrically in OWT, to reproduce body roll. TB and FY focus on suspension compliance by applying a force at the tyre contact patch, longitudinally in TB (to replicate traction and braking actions) and laterally in FY (accounting for the cornering force). The output curves for each load case are summarised in
Table 4.
4. Results
This section presents the results of the multi-objective optimisation of the double wishbone suspension. The outcomes of the MORL–DB method are compared with those of two other approaches: the Moving Spheres (MS, [
9]) method and a multi-objective genetic algorithm with sorting (KEMOGA, [
44]). The Moving Spheres method is designed for optimal design tasks where the design variables are primarily spatial coordinates, performing a structured exploration of the design space. Instead of searching the entire domain at once, new configurations of the system are generated within spherical neighbourhoods of a reference one, which is iteratively updated. The
multi-objective genetic algorithm (KEMOGA) used in this study is a standard binary-encoded genetic algorithm that leverages
k-optimality to guide the search. Fitness is based on the
k-optimality level of each solution, promoting diverse and well-performing configurations along the Pareto front. It is acknowledged that, given the stochastic nature of all algorithms considered, slight variations may occur across different trials. Nevertheless, the results obtained by each method proved to be consistent between the different attempts made. In order to determine an effective comparison, for each method, the solutions relating to the attempt that produced the highest number of solutions with the best
k-optimality level were considered.
Figure 7 reports the number of optimal design solutions by method and optimality level (see
Section 1.1 for the definition of
k-optimality). The optimality level of each solution is computed relative to the combined solution sets of all three methods. This enables an objective comparison of method performance in approximating the Pareto front. Most MS solutions have an optimality level
, indicating good performance for only one objective function. The trend for the MS method optimal design solutions decreases monotonically up to
, after which no solutions are found. In contrast, KEMOGA and the MORL–DB method produce solutions with a higher optimality level. The MORL–DB method identifies a solution with
: the remaining Pareto-optimal solutions for any subset of
objective functions closer to Utopia than any other solution. The results obtained querying the trained DDPG agent, reported in
Figure 7b, are consistent with those produced during training, shown in
Figure 7a. This confirms the agent’s capability to identify optimal configurations in an inference phase, i.e., a post-training stage in which the knowledge acquired during learning is exploited [
38].
Figure 8 shows the multi-body output curves for the suspension system designed with the MORL–DB method (
), alongside representative curves from two optimised suspension systems obtained with the MS method (
) and KEMOGA (
). In each case, the single solution displayed is selected among those with the highest optimality level achieved by the respective method. All output quantities considered are highly specific to suspension systems [
56]. Toe, camber, and caster are three key suspension angles; wheelbase and track are the longitudinal and lateral distances between wheels, respectively. Diving is the attitude of a vehicle to lower the front when braking; anti-dive is the ability to hinder this unwanted motion. Wheel rate is the suspension spring force transmitted to the contact patch per unit vertical wheel displacement. Roll centre denotes the point around which the body of the car rolls in a wheel-fixed frame.
Figure 9 shows the location of the hard points for the optimal design solutions of all three methods inside the design space. Despite unconstrained exploration using KEMOGA and the MORL–DB method inside the design space, all three methods show high consistency, as they find the same optimal regions inside the design space.
5. Discussion
All three methods use surrogate models to approximate the physical model, as each objective function is predicted by a feed-forward neural network rather than computed through a multi-body simulation. The setup parameters for calibrating the feed-forward networks are reported in
Table A1. In the same table, “Genetic algorithm parameters” refers to the algorithm used to optimise the surrogate model, as discussed in phase (A4) of
Figure 5. The “local” and “global” attributes specify the extension of the surrogate model approximation: the MORL–DB method and KEMOGA use a global surrogate model; the MS method uses a local one. The local surrogate approximates the physical model only over a sub-portion of the design space, whereas the global surrogate covers the entire design space. Training the global surrogate model can be regarded as an overhead time cost, as this surrogate model substitutes the original multi-body environment for the whole optimisation procedure without requiring retraining. By contrast, the local surrogate model must be retrained at each iteration of the MS method [
9], since the portion of the design space being approximated is updated over the iterations of the method. The validation metrics for both surrogate models are reported in
Table A1 and in
Figure A1.
Setup parameters for MS the method, KEMOGA, and MORL–DB method are reported in
Table A2,
Table A3, and
Table A4, respectively. Each method was executed until its convergence criterion was satisfied: the MS method converged when the design space became an active constraint, while KEMOGA and the MORL–DB method were stopped after a prescribed number of epochs or episodes, beyond which no sensible improvement in the quality of the solution was observed. The number of configurations generated by the trained DDPG agent was set so that the total configurations processed by the MORL–DB method—considering both training and inference phases—equalled those required by KEMOGA.
Table 5 reports the computational times of the three methods. The MORL–DB inference phase required approximately 1 h 30 m to be performed, reported here for completeness. “Training set generation” refers to the time required to run multi-body simulation in the ADAMS environment in order to generate data for surrogate model training. “FNN training” refers to the time required to train the feed-forward neural networks. This breakdown reflects the different surrogate setups: while the MS method requires repeated training of local surrogate models, the MORL–DB method and KEMOGA rely on a global surrogate model, trained once. In this way, the computational implications of local versus global modelling are explicitly accounted for, ensuring a consistent comparison among the different approaches. “Method-specific computations” denotes the time required by calculations that are specific to the optimisation method. “Optimal design solution validation” corresponds to the time required to validate optimal solutions identified by each method in the ADAMS multi-body environment. Note that all times except method-specific ones arise because of the surrogate model. In principle, all three methods could employ the physical model directly; in this case, most of the surrogate-specific times would be replaced with simulation times.
Table 5 also reports the number of design solutions processed by the feed-forward neural networks. The MORL–DB method processes the fewest solutions, whereas the MS method and KEMOGA process at least one order of magnitude more. The comparison confirms the higher efficiency of the MORL–DB method over the other state-of-the-art methods. The MORL–DB method can handle high-dimensional problems without any scalarisation of the objective functions and requires fewer objective function evaluations. Its performance in settings relying directly on physical models rather than surrogates remains an important direction for further investigation.
Finally, it is worth noting that, at the end of the training, the agent has an in-built knowledge of the state transition, enabling it to move from configurations in the neighbourhood of a reference design toward optimal configuration [
38]. In this work, this capability was effectively exploited by querying the trained agent in an inference phase, thereby refining the Pareto-optimal set with additional solutions not identified during training. It is clear that this inference phase cannot be performed with meta-heuristic methods such as KEMOGA, which inherently lacks a learning process.