Machine Learning Based Surrogate Models for the Thermal Behavior of Multi-Plate Clutches

: Multi-plate clutches play safety-critical roles in many applications. For this reason, correct functioning and safe operation are essential. Spontaneous damages are particularly critical because the failure of the clutch can lead to a failure of the system. Such damage mainly occurs due to very high loads and ultimately very high temperatures. Finite Element Analysis (FEA) enables simulation and prediction of these temperatures, but it is very time-consuming and costly. In order to reduce this computational effort, surrogate models can be created using machine learning (ML) methods, which reproduce the input and output behavior. In this study, various ML methods (polynomial regression, decision tree, support vector regressor, Gaussian process and neural networks) are evaluated with respect to their ability to predict the maximum clutch temperature based on the loads of a slip cycle. The models are examined based on two use cases. In the ﬁrst use case, the axial force and the speed are varied. In the second use case, the lining thickness is additionally modiﬁed. We show that ML approaches fundamentally achieve good results for both use cases. Furthermore, we show that Gaussian process and backpropagation neural network provide the best results in both cases and that the requirement to generate predictions during operation is fulﬁlled.


Motivation
Wet multi-plate clutches are machine elements in widespread use which are extensively employed in drive technology [1,2]. Applications range from starting and powershifting elements in dual-clutch and automatic transmissions, differential locks and torque converter clutches to serving as braking components in construction machinery or marine transmissions [3,4]. Given the nature of these applications and the high stresses that such components experience, multi-plate clutches fulfill safety-relevant roles, so safe and reliable operation must be ensured [5].
The damage and failures encountered during operation of multi-plate clutches can fundamentally be divided into two categories. Long-term damage refers to deterioration caused by variations in the friction system throughout a large number of shifting cycles. The number of shiftings until damage or failure can sometimes amount to several tens of thousands of shiftings. The latter category consists of spontaneous failures. These occur in individual shifting operations due to very high thermal and mechanical loads [4]. Since this phenomenon does not build up over time and therefore occurs unexpectedly, it is particularly critical and safety-relevant, because a single shift can jeopardize the safety of the entire system [6,7]. When considering damage to clutch systems, one key parameter is the maximum temperature within the plates and the friction lining, as these are ultimately responsible for the damage mechanisms. The temperatures during operation can be measured experimentally via sensors, which is often impossible due to the given

Research Objectives
Considering the limitations of FE simulations, experimental temperature measurement and analytical solutions of the problem mentioned above, the use of machine learning methods to develop a surrogate model presents a good alternative. As shown in the introduction, the use of ML-based surrogate models in the context of FE simulations is already being heavily researched. Nevertheless, a large portion of the applications exists in structural mechanics and, to the best of our knowledge, no application exists in multi-plate clutch development and operation. The aim and main contribution of this study is to investigate the development of a surrogate model for predicting the maximal temperature within a clutch system during operation, based on the thermomechanical FE simulation of the clutch. This paper is structured as follows. Section 2 first describes the general procedure. The FE model being replaced by surrogate models and the procedure for generating the datasets are then described and the individual ML models used are briefly explained. Section 3 This paper is structured as follows. Section 2 first describes the general procedur The FE model being replaced by surrogate models and the procedure for generating th datasets are then described and the individual ML models used are briefly explained. Sec tion 3 presents the research results, which are then discussed in Section 4. Finally, Sectio 5 summarizes the findings and discusses possible future work.

Methodology
This study deals with the creation of surrogate models of a thermoelastic finite ele ment simulation. The overall procedure is illustrated in Figure 1. The procedure can b roughly subdivided into four stages. The first step consists of the generation of the re quired data from the FE model. These are then processed in order to facilitate the creatio of the surrogate models. In the last step, the generated models are subjected to a paramete study to investigate the influence of the sample size on the model performance. The ind vidual steps are described in more detail in the following sections.

FE-Model and Use Cases
The model developed and presented by Schneider et al. [20] was used as a basis for the research presented in this paper. The model is a parameterizable two-dimensional model of a multi-plate clutch in transient operation. The geometry of the illustrated clutch and the corresponding FE model are shown in Figure 2. The clutch pack consists of 6 steel plates and 5 carrier plates with linings on both sides and comprises 10 friction surfaces. Furthermore, the reaction and pressure plates are placed on the left and right. The model developed and presented by Schneider et al. [20] was used as a basis for the research presented in this paper. The model is a parameterizable two-dimensional model of a multi-plate clutch in transient operation. The geometry of the illustrated clutch and the corresponding FE model are shown in Figure 2. The clutch pack consists of 6 steel plates and 5 carrier plates with linings on both sides and comprises 10 friction surfaces. Furthermore, the reaction and pressure plates are placed on the left and right. The simulation model can be divided into two distinct parts. In the first part, the mechanical aspects of the simulation are considered. The pressures and strains due to mechanical and thermal loads within the components are calculated based on the current axial force, speed and temperature distribution. The thermal aspects of the simulation are considered in the second part. Based on the pressure distribution calculated in the first part, the heat flows generated at the friction surfaces can be calculated. These are applied as loads for the thermal simulation. Subsequently, the transient thermal simulation is performed, and it provides the temperature distribution at the coupling as a result. After the two simulation stages have been carried out, the operating conditions of the clutch (pressure and temperature distribution) are updated. This procedure represents 1 time step. This procedure is performed for the defined number of time steps, each with the updated operating conditions as the initial condition. The complete process flow can be seen in Figure A1 in Appendix A. A time step size of 0.5 s and an element size of 0.0002 m were adopted. The simulation was carried out for 28 steps (14 s). The initial temperature of the clutch is 80 °C or 353.15 K. Figure 3 is a comparison of the experimentally determined data and the results of the FE simulation. The experimental temperatures were measured at a point inside the third steel plate. These were compared with the associated temperature determined by the simulation. Figure 3a shows that the FE simulation reproduced the temporal temperature variation with good precision. Figure 3b shows the measured and simulated temperature increases, along with the identity line. There was a strong correlation between the measured and the simulated data. The simulation model can be divided into two distinct parts. In the first part, the mechanical aspects of the simulation are considered. The pressures and strains due to mechanical and thermal loads within the components are calculated based on the current axial force, speed and temperature distribution. The thermal aspects of the simulation are considered in the second part. Based on the pressure distribution calculated in the first part, the heat flows generated at the friction surfaces can be calculated. These are applied as loads for the thermal simulation. Subsequently, the transient thermal simulation is performed, and it provides the temperature distribution at the coupling as a result. After the two simulation stages have been carried out, the operating conditions of the clutch (pressure and temperature distribution) are updated. This procedure represents 1 time step. This procedure is performed for the defined number of time steps, each with the updated operating conditions as the initial condition. The complete process flow can be seen in Figure A1 in Appendix A. A time step size of 0.5 s and an element size of 0.0002 m were adopted. The simulation was carried out for 28 steps (14 s). The initial temperature of the clutch is 80 • C or 353.15 K. Figure 3 is a comparison of the experimentally determined data and the results of the FE simulation. The experimental temperatures were measured at a point inside the third steel plate. These were compared with the associated temperature determined by the simulation. Figure 3a shows that the FE simulation reproduced the temporal temperature variation with good precision. Figure 3b shows the measured and simulated temperature increases, along with the identity line. There was a strong correlation between the measured and the simulated data.
The model enables the variation of a large number of parameters that influence the experienced load, the geometry of the clutch system and the material data of the installed components. Given that the study focused on creating a model suitable for operation, parameters that may vary during operation, such as load (axial force and speed) and lining thickness (due to wear of the lining), were considered. Other design and material parameters, such as the thickness of the steel plate or the Young's modulus, do not change during operation and were therefore not considered in the modeling. Two different use cases were considered and are presented in Table 1.  The model enables the variation of a large number of parameters that influence the experienced load, the geometry of the clutch system and the material data of the installed components. Given that the study focused on creating a model suitable for operation, parameters that may vary during operation, such as load (axial force and speed) and lining thickness (due to wear of the lining), were considered. Other design and material parameters, such as the thickness of the steel plate or the Young's modulus, do not change during operation and were therefore not considered in the modeling. Two different use cases were considered and are presented in Table 1. Use Case 1 only covered operations under varying loads. The axial force and the speed were varied. The profiles of the axial force applied and the differential rotational speed are shown by way of example in Figure 4. In this research, the clutch is analyzed in transient slip conditions. The axial force is applied and the differential force is increased and then reduced to zero again. In Use Case 2, the wear and tear of the lining during operation was taken into account in addition to the load parameters.  Use Case 1 only covered operations under varying loads. The axial force and the speed were varied. The profiles of the axial force applied and the differential rotational speed are shown by way of example in Figure 4. In this research, the clutch is analyzed in transient slip conditions. The axial force is applied and the differential force is increased and then reduced to zero again. In Use Case 2, the wear and tear of the lining during operation was taken into account in addition to the load parameters.

Dataset Generation
The data employed to generate the surrogate models originate from the FE model described above. For each combination of input parameters, the simulation outputs the temperature distribution of the clutch at each time step. Since the maximum temperature

Dataset Generation
The data employed to generate the surrogate models originate from the FE model described above. For each combination of input parameters, the simulation outputs the temperature distribution of the clutch at each time step. Since the maximum temperature is of interest concerning the damage mechanisms, the maximum value from the temperature distribution at each time step was selected. An illustrative temperature distribution with the corresponding maximum temperature is shown in Figure 5.

Dataset Generation
The data employed to generate the surrogate models originate from the FE model described above. For each combination of input parameters, the simulation outputs the temperature distribution of the clutch at each time step. Since the maximum temperature is of interest concerning the damage mechanisms, the maximum value from the temperature distribution at each time step was selected. An illustrative temperature distribution with the corresponding maximum temperature is shown in Figure 5.  Table 2. Each simulation yields a tuple , ), where ∈ ℝ is an -dimensional vector representing the varied parameters of the respective use case ( = 2 for Use Case 1 and = 3 for Use Case 2) and ∈ ℝ is a vector with the maximum temperature at the each of the 29 timesteps (see Table 3).   Table 2. Each simulation yields a tuple (p i , T i ), where p i ∈ R m is an m-dimensional vector representing the m varied parameters of the respective use case (m = 2 for Use Case 1 and m = 3 for Use Case 2) and T i ∈ R 29 is a vector with the maximum temperature T max at the each of the 29 timesteps (see Table 3). Since there was a large difference in the magnitude of the individual input data, the latter were scaled to obey a standard normal distribution after scaling. Furthermore, no outliers within the data were expected given the FE simulation's deterministic nature. The complete data set consisting of 200 simulations (5900 data points) is divided into a training set and a test set. The test set comprised the data from 25 simulations (725 data points), while the remaining data points were assigned to the training sets. The goal of this split was to evaluate the quality of the predictions of the trained models on cases not yet seen. Different subsets of the training sets of different sizes were created to investigate the influence of the size of the data set and the required number of simulations for adequate surrogate model generation. Training sets comprising 25, 50, 75, 100, 125, 150, and 175 simulations were examined. The data set with 100 simulations (2900 data points) was used as a baseline for this research.

Model Development
Five different machine learning methods are investigated to construct the surrogate model. These are explained in greater depth below. Unless otherwise noted, more information on each algorithm can be found in the work by Murphy [21].

1.
Polynomial Regression (PR): Polynomial regression is a subclass of linear regression, in which a basis function expansion is performed with polynomial functions [8,21]. The use of higher order polynomial functions enables non-linear relationships to be modeled. The PR model is given by: where w is the vector with the weight factors and d is the selected polynomial order. Although PR models are very popular due to their simple calculation and the option of deriving conclusions about the influences of the individual input parameters [8], these models suffer from the disadvantage of being prone to overfitting for high polynomial degrees [21].

2.
Decision Tree (DT): Decision trees are methods that divide the input space into several areas. These subdivisions take place along the individual axes of the input space by means of the CART algorithm and can be represented by means of a tree. The mean value w i is calculated for each of the regions R i created after the subdivision. The output value of the model is given by the following expression: 3. Support Vector Regression (SVR): Support vector regression is a parametric model that uses kernels and considers only a portion of the training dataset to generate predictions. SVR models for a predefined kernel κ can generically be defined by the following equation: where w are the parameters to be calculated.
A combination of 2 regularization and epsilon insensitive loss function is used as the cost function. With the use of the epsilon insensitive loss function, all data points located within a band of width epsilon are not penalized. As a result, the model can be obtained by solving the constrained optimization problem given by Equation (5) [21]: where ξ + i and ξ − i are introduced as slack variables and indicate the extent to which a data point lies outside the -band. The hyperparameter C regulates the tradeoff between the flatness of the model and the degree to which deviations larger than are tolerated. One of the drawbacks of these methods is the high computational cost of constructing the models. Interested readers can refer to Smola and Schölkpf [22] for more information on SVR.

4.
Gaussian Process (GP): Gaussian processes are non-parametric methods having the inference of distributions over functions as a basic principle [21]. If the data is noisefree, then GPs have the capability to interpolate the data points exactly. This is advantageous when creating surrogate models of deterministic FE models [23]. For the measured values f at the sample points X and the values f * being predicted at the points X * , the joint probability distribution is given by: where κ is the selected kernel or covariance function. The posterior probability distribution p of the functions can be used to determine the predictions on the selected points X * and is given by: For further background on Gaussian processes, the reader is referred to the work of Rasmussen and Williams [24].

5.
Backpropagation Neural Network (BPNN): Neural networks are a set of models based on the structure and function of biological neurons and consist of a number of layered and interconnected units (neurons) [8]. The output h of a neuron consists of a linear combination of the inputs, which is then subjected to a nonlinear activation function: where X is the matrix with the layer inputs, W is the weight matrix, b is the bias vector and φ is the selected activation function. Given a network architecture, the network can be trained using a backpropagation algorithm to solve a specific problem [8,25].
Each algorithm was subjected to hyperparameter optimization. All of the hyperparameters examined are listed in Table 4. All of the models were implemented in Python. The algorithms and implementations available in the Scikit-Learn package [26] were used for the PR, DT, SVR, and the GP. The Tensorflow package [27] with the Keras API was also employed for the BPNN.

Model Evaluation
In order to investigate the generalization capabilities of the models and to ensure independence of the results from the randomly selected dataset, the models were subjected to a nested cross-validation procedure. The process consists of two nested loops. The inner loop is dedicated to determining the optimal hyperparameters for a given fold of the data. For this purpose, a grid search (PR, DT, SVR, GP) or a randomized search (BPNN) with 3-fold cross-validation was performed. The generalization capabilities of the best model determined in the inner loop are examined in the outer loop (see Figure 6). To further investigate the performance of the models, the models were further evaluated using the test set.

Model Evaluation
In order to investigate the generalization capabilities of the models and to ensure independence of the results from the randomly selected dataset, the models were subjected to a nested cross-validation procedure. The process consists of two nested loops. The inner loop is dedicated to determining the optimal hyperparameters for a given fold of the data. For this purpose, a grid search (PR, DT, SVR, GP) or a randomized search (BPNN) with 3-fold cross-validation was performed. The generalization capabilities of the best model determined in the inner loop are examined in the outer loop (see Figure 6). To further investigate the performance of the models, the models were further evaluated using the test set. The mean squared error (MSE), the root mean squared error (RMSE) and the mean absolute percentage error (MAPE) were employed as metrics to measure the quality of the models. These can be calculated according to Equations (12)- (14): where y i are the true values,ŷ i are the predicted values and n is the number of samples.

Results
The following section presents the results of the investigations for both use cases. The first part presents the performance of each model during the training process. The FE simulations were performed on a workstation with 96GB (12 × 8 GB) RAM, a NVIDIA Quadro P2000 5 GB (4) DP GFX GPU, and two Intel Xeon 6154 3.0 processors. The development of the surrogate models and the computation of the inference times were performed on a machine with 16 GB RAM and an Intel i7 processor.

Use Case 1-Axial Force + Rotational Speed
The RMSE scores of the 10-fold cross-validation for the polynomial regression, decision tree, support vector regression, and Gaussian process are shown in Figure 7. For the PR model, except for the third fold, low RMSE values with a mean of 8.12 were evident without considerable variance. In the case of the decision tree model, an average RMSE score of 17.60 was obtained, with the values of the individual folds ranging from 10 to 23. Similarly, in the case of the SVR, all scores were above 20, and an average RMSE score of 25.97 was achieved. The RMSE values for the cross-validation of the Gaussian process model exhibited the lowest mean value, at 7.41. However, the scores of the individual folds demonstrated a strong variance between themselves. As was the case with the PR, the third fold yielded an RMSE score of over 20, while the first fold, for example, had an RMSE score of only 1.93. sion tree, support vector regression, and Gaussian process are shown in Figure 7. For the PR model, except for the third fold, low RMSE values with a mean of 8.12 were evident without considerable variance. In the case of the decision tree model, an average RMSE score of 17.60 was obtained, with the values of the individual folds ranging from 10 to 23. Similarly, in the case of the SVR, all scores were above 20, and an average RMSE score of 25.97 was achieved. The RMSE values for the cross-validation of the Gaussian process model exhibited the lowest mean value, at 7.41. However, the scores of the individual folds demonstrated a strong variance between themselves. As was the case with the PR, the third fold yielded an RMSE score of over 20, while the first fold, for example, had an RMSE score of only 1.93.  Figure 8 presents the training history for the BPNN. The training was performed for 10000 epochs. Early stopping was used as a threshold to stop the training if the validation loss did not improve for 100 epochs. The learning rate was adjusted during training so that if the validation loss reached a plateau and did not improve for 50 epochs, the learning rate was reduced by half. The plotted curve shows that the training process converged to a RMSE of 3.45 after approximately 1200 epochs. The RMSE for the validation set amounted to 3.61, which indicates that no overfitting occurred. Prediction on the test data (725 data points) set were performed to verify further the ability of the models to generate accurate predictions based on unused data. The RMSE and the R-squared values for each model for the test dataset are listed in Table 5. The computation time required for the inference is provided as well.  Figure 8 presents the training history for the BPNN. The training was performed for 10,000 epochs. Early stopping was used as a threshold to stop the training if the validation loss did not improve for 100 epochs. The learning rate was adjusted during training so that if the validation loss reached a plateau and did not improve for 50 epochs, the learning rate was reduced by half. The plotted curve shows that the training process converged to a RMSE of 3.45 after approximately 1200 epochs. The RMSE for the validation set amounted to 3.61, which indicates that no overfitting occurred.  Figure 8 presents the training history for the BPNN. The training was performed for 10000 epochs. Early stopping was used as a threshold to stop the training if the validation loss did not improve for 100 epochs. The learning rate was adjusted during training so that if the validation loss reached a plateau and did not improve for 50 epochs, the learning rate was reduced by half. The plotted curve shows that the training process converged to a RMSE of 3.45 after approximately 1200 epochs. The RMSE for the validation set amounted to 3.61, which indicates that no overfitting occurred. Prediction on the test data (725 data points) set were performed to verify further the ability of the models to generate accurate predictions based on unused data. The RMSE and the R-squared values for each model for the test dataset are listed in Table 5. The computation time required for the inference is provided as well. Prediction on the test data (725 data points) set were performed to verify further the ability of the models to generate accurate predictions based on unused data. The RMSE and the R-squared values for each model for the test dataset are listed in Table 5. The computation time required for the inference is provided as well. As during the training procedure, the BPNN achieved the best RMSE score of 4.29, followed by the GP model, with 6.772. The PR model also achieved a comparatively low RMSE of 9.535, while both the DT and SVR models performed the poorest, with higher RMSE scores of 17.564 and 28.22, respectively. All models yielded MAPE values between 0.51% (BPNN) and 3.69% (PR). The order of performance of the models was analogous to the RMSE values. Concerning the required computing time, the individual algorithms differed slightly. Given its higher complexity, the BPNN required the highest computing time. In this sense, PR and DT were the most efficient algorithms with very short computation times of less than 0.1 s. All inference times, however, fell below the 1.0 s range, but were virtually negligible compared to the more than 1000 s required for the FE simulation.
To further illustrate the results and explore the application of the models, the predictions for a selected slip cycle are shown in Figure 9. Overall, all models except the SVR model provided good predictions for the selected circuit. In this particular instance, the GP model almost reproduced the data, whereas the PR and the BPNN exhibited only weak deviations at certain points. In the case of DT, the shape of the curve was correctly mapped, but a weak general shift of the curve to higher temperatures was evident. The model thus provided a higher maximum temperature than the simulated values. Noticeable in the case of the SVR model is that the model has difficulty reproducing the oscillations in temperature. As a result, this model exhibited the worst performance.
To investigate the influence of the volume of data, Figure 10 illustrates the obtained RMSE for each model as a function of the number of simulations used. Overall, it became evident that for all algorithms an increase in the amount of data also led to an improvement in performance (even if only a slight one in some cases). In the case of the SVR, a slight improvement was seen up to 50 simulations, whereas there was practically no variation in the results thereafter. In the case of the PR model, a strong improvement was present up to 90 simulations. For both DT and BPNN, the performance of the models improved with larger datasets, but the trend was noisy. In the case of the GP model, a very smooth curve was evident where performance improved continuously with higher data volumes. model provided good predictions for the selected circuit. In this particular instance, the GP model almost reproduced the data, whereas the PR and the BPNN exhibited only weak deviations at certain points. In the case of DT, the shape of the curve was correctly mapped, but a weak general shift of the curve to higher temperatures was evident. The model thus provided a higher maximum temperature than the simulated values. Noticeable in the case of the SVR model is that the model has difficulty reproducing the oscillations in temperature. As a result, this model exhibited the worst performance. To investigate the influence of the volume of data, Figure 10 illustrates the obtained RMSE for each model as a function of the number of simulations used. Overall, it became evident that for all algorithms an increase in the amount of data also led to an improvement in performance (even if only a slight one in some cases). In the case of the SVR, a slight improvement was seen up to 50 simulations, whereas there was practically no variation in the results thereafter. In the case of the PR model, a strong improvement was present up to 90 simulations. For both DT and BPNN, the performance of the models improved with larger datasets, but the trend was noisy. In the case of the GP model, a very smooth curve was evident where performance improved continuously with higher data volumes. Figure 10. RMSE scores for the test sets as a function of the data volume for Use Case 1.

Use Case 2-Axial Force + Rotational Speed + Lining Thickness
The results of the 10-fold cross-validation for Use Case 2 are shown in Figure 11. The PR and SVR showed very similar behavior and performance. The RMSE values of the individual folds for both models were similar, and the average RMSE values amounted to

Use Case 2-Axial Force + Rotational Speed + Lining Thickness
The results of the 10-fold cross-validation for Use Case 2 are shown in Figure 11. The PR and SVR showed very similar behavior and performance. The RMSE values of the individual folds for both models were similar, and the average RMSE values amounted to 26.67 and 27.59, respectively. The DT performed better, with an average RMSE score of 20.20. However, there was a large variance between the individual folds values, ranging from 13 to 27. This higher variance was also evident in the case of the Gaussian process, in which the values varied between 11 and 26. However, among these four models, the GP achieved the best performance during cross-validation with an average RMSE score of 18.03.  Figure 12 presents the training history of the BPNN for Use Case 2. The training was performed with the same settings as in Use Case 1. As in the first use case, the RMSE for both the training set and the validation set converged. After about 850 epochs, an RMSE of 6.94 was achieved for the training set. The validation loss (RMSE) amounted to 6.25. As was similar to Use Case 1, the BPNN did not demonstrate any overfitting issue.  Table 6 shows the performance of the created methods on the test set. The perfor- Figure 11. RMSE for 10-fold cross-validation of the PR, DT, SVR and GP models for Use Case 2. Figure 12 presents the training history of the BPNN for Use Case 2. The training was performed with the same settings as in Use Case 1. As in the first use case, the RMSE for both the training set and the validation set converged. After about 850 epochs, an RMSE of 6.94 was achieved for the training set. The validation loss (RMSE) amounted to 6.25. As was similar to Use Case 1, the BPNN did not demonstrate any overfitting issue. Table 6 shows the performance of the created methods on the test set. The performance of the models follows the order in cross-validation on the training dataset. Both polynomial regression and SVR showed the worst results, with RMSE values of approximately 24. The decision tree model performed slightly better, with an RMSE score of 19.71. Similar to the first use case, the Gaussian process and backpropagation neural network performed best, with RMSE values of 14.43 and 6.31, respectively. Whereas the SVR had the worst RMSE value, a better MAPE value than that of the PR was achieved. Although the GP performed better than the DT in terms of RMSE, both achieved similar MAPE values of about 2.2%. In terms of both RMSE and MAPE, the BPNN achieved the best performance. Figure 11. RMSE for 10-fold cross-validation of the PR, DT, SVR and GP models for Use Case 2. Figure 12 presents the training history of the BPNN for Use Case 2. The training was performed with the same settings as in Use Case 1. As in the first use case, the RMSE for both the training set and the validation set converged. After about 850 epochs, an RMSE of 6.94 was achieved for the training set. The validation loss (RMSE) amounted to 6.25. As was similar to Use Case 1, the BPNN did not demonstrate any overfitting issue.  Table 6 shows the performance of the created methods on the test set. The performance of the models follows the order in cross-validation on the training dataset. Both polynomial regression and SVR showed the worst results, with RMSE values of approximately 24. The decision tree model performed slightly better, with an RMSE score of 19.71. Similar to the first use case, the Gaussian process and backpropagation neural network performed best, with RMSE values of 14.43 and 6.31, respectively. Whereas the SVR had  In the example of individual slip cycles in Figure 13, the same behavior was seen for the PR and the SVR as for the SVR in Use Case 1. The general increase in temperature was modeled, but the temperature fluctuations were not captured. This effect was also present in a weakened form in the case of the GP. The DT modeled properly models the oscillatory temperature rises, but exhibited a deviation in the range between 5 and 10 s. As in the first use case, the predicted and true temperatures matched best in the case of the BPNN. Figure 14 illustrates the model's performance dependence as a function of the amount of data for Use Case 2. For the PR model it was evident that almost zero correlation was present. Only small fluctuations in the range between RMSE scores of 25 and 30 took place. The SVR modeled performs worse than the PR model for up to 50 simulations. Above 50 simulations, the performance of the two models was very similar. The decision tree model showed the worst performance at a low number of simulations but improved continuously when increasing the amount of data. Starting with about 100 simulations, an RMSE score of about 20 was achieved. Larger amounts of data did not exhibit any further significant improvement. As with Use Case 1, the Gaussian process showed a high dependence on the amount of data. A steady improvement of the results was observed with an increase in the data volume. The best performance was consistently achieved for all data volumes by the BPNN models. Better results than those for the PR and SVR were already achieved with only 10 simulations. From about 60 simulations onwards, there was a further significant improvement in performance. Starting from 100 simulations, a further improvement took place, but only to a small extent.
In the example of individual slip cycles in Figure 13, the same behavior was seen for the PR and the SVR as for the SVR in Use Case 1. The general increase in temperature was modeled, but the temperature fluctuations were not captured. This effect was also present in a weakened form in the case of the GP. The DT modeled properly models the oscillatory temperature rises, but exhibited a deviation in the range between 5 and 10 s. As in the first use case, the predicted and true temperatures matched best in the case of the BPNN.  Figure 14 illustrates the model's performance dependence as a function of the amount of data for Use Case 2. For the PR model it was evident that almost zero correlation was present. Only small fluctuations in the range between RMSE scores of 25 and 30 took place. The SVR modeled performs worse than the PR model for up to 50 simulations. Above 50 simulations, the performance of the two models was very similar. The decision tree model showed the worst performance at a low number of simulations but improved continuously when increasing the amount of data. Starting with about 100 simulations, an RMSE score of about 20 was achieved. Larger amounts of data did not exhibit any further significant improvement. As with Use Case 1, the Gaussian process showed a high dependence on the amount of data. A steady improvement of the results was observed with an increase in the data volume. The best performance was consistently achieved for all data volumes by the BPNN models. Better results than those for the PR and SVR were already achieved with only 10 simulations. From about 60 simulations onwards, there was a further significant improvement in performance. Starting from 100 simulations, a further improvement took place, but only to a small extent.  Figure 14 illustrates the model's performance dependence as a function of the amount of data for Use Case 2. For the PR model it was evident that almost zero correlation was present. Only small fluctuations in the range between RMSE scores of 25 and 30 took place. The SVR modeled performs worse than the PR model for up to 50 simulations. Above 50 simulations, the performance of the two models was very similar. The decision tree model showed the worst performance at a low number of simulations but improved continuously when increasing the amount of data. Starting with about 100 simulations, an RMSE score of about 20 was achieved. Larger amounts of data did not exhibit any further significant improvement. As with Use Case 1, the Gaussian process showed a high dependence on the amount of data. A steady improvement of the results was observed with an increase in the data volume. The best performance was consistently achieved for all data volumes by the BPNN models. Better results than those for the PR and SVR were already achieved with only 10 simulations. From about 60 simulations onwards, there was a further significant improvement in performance. Starting from 100 simulations, a further improvement took place, but only to a small extent.

Discussion
Considering the magnitude of the RMSE values (between 4.29 and 28.22) in relation to the magnitude of the output variable (temperature ranging between 350 K and 1000 K), none of the five models considered turned out to be completely unsuitable regarding tem-

Discussion
Considering the magnitude of the RMSE values (between 4.29 and 28.22) in relation to the magnitude of the output variable (temperature ranging between 350 K and 1000 K), none of the five models considered turned out to be completely unsuitable regarding temperature predictions based on the axial force and the rotational speed. This outcome was also confirmed by the MAPE scores. Average absolute deviations of no more than 4.02% were achieved in all models in both use cases. Nevertheless, both the GP and BPNN stood out as the best alternatives for the surrogate models. The disadvantages of these two models was a significantly higher level of complexity than the PR, DT, and SVR, as well as the greater training time. Furthermore, all the models met the requirement of providing nearly real-time temperature predictions during operation. The inference times obtained were negligible compared to the simulation time and the duration of the switching process. The PR model, which is characterized by its simplicity, also achieved acceptable results in the first use case. In the second use case, which also considered the variation of the lining thickness during operation, the GP and the BPNN provided the best results. The PR models, which competed with the two best models (GP and BPNN) in Use Case 1, demonstrated significantly worse performance in this case.
Suppose not only the RMSE and MAPE scores, which indicate the average performance of the models, are taken into account, but individual slip cycles are analyzed as well. In that case, the SVR can be directly discarded as a good surrogate model. In both use cases, the SVR did not manage to reproduce the oscillations due to the three slip cycles in the temperature profile and only mapped the basic increase in temperature. Regarding Use Case 2, all but the DT and BPNN models exhibited the behavior of the SVR model in the first use case and they demonstrated weaknesses in replicating the fluctuations of the temperatures. It is also important to mention that Figures 9 and 13 show only individual slip cycles as examples and that these results cannot be applied to all other slip cycles.
Regarding the research on the influence of data volume, both use cases showed that an increase in data volume fundamentally leads to improved performance. The range between 80 and 100 simulations can be considered preferable. In the two best models (GP and BPNN) an acceptable performance was reached within this range, with still manageable quantities of data to be generated. Larger data volumes achieved better results, but the improvement was not substantial.
In future work, we plan to extend the models to include more variables. For instance, the maximum temperature can be predicted in addition to the temperature of the material at any given x and y coordinates. Other geometry (steel plate thickness) and material parameters (Young's modulus, heat capacity, etc.) or initial temperature distribution and oil temperature can also be taken into account. Based on these extended surrogate models, a sensitivity analysis can also be performed in order to further investigate the influences of the individual parameters on temperature behavior. Furthermore, other approaches such as physics-informed neural networks can be considered, which utilize not only the data, but also the existing physical knowledge about the domain to generate the models.

Conclusions
This paper examines the potential of constructing surrogate models based on machine learning methods for a two-dimensional thermo-mechanical finite element model of a multiplate clutch. Based on the existing FE model, datasets of different sizes were generated and five machine learning algorithms were investigated: polynomial regression, decision tree, support vector regression, Gaussian process and backpropagation neural networks. The Gaussian process and the backpropagation neural networks emerged as the best models in both use cases. Polynomial regression shows good results in the first use case (axial force and speed as inputs) but underperformed when the lining thickness was additionally varied and considered as an input. Regarding the amount of data required, an improvement in performance for both GP and BPNN took place when the amount of data was increased. Acceptable results were achieved starting at about 60 FE simulations. In the context of the application, it was shown that ML-based surrogate models are able to adequately reproduce the thermo-mechanical behavior of the clutch system during operation.