Training Data Selection for Machine Learning-Enhanced Monte Carlo Simulations in Structural Dynamics

The evaluation of structural response constitutes a fundamental task in the design of ground-excited structures. In this context, the Monte Carlo simulation is a powerful tool to estimate the response statistics of nonlinear systems, which cannot be represented analytically. Unfortunately, the number of samples which is required for estimations with high confidence increases disproportionally to obtain a reliable estimation of low-probability events. As a consequence, the Monte Carlo simulation becomes a non-realizable task from a computational perspective. We show that the application of machine learning algorithms significantly lowers the computational burden of the Monte Carlo method. We use artificial neural networks to predict structural response behavior using supervised learning. However, one shortcoming of supervised learning is the inability of a sufficiently accurate prediction when extrapolating to data the neural network has not seen yet. In this paper, neural networks predict the response of structures subjected to non-stationary ground excitations. In doing so, we propose a novel selection process for the training data to provide the required samples to reliably predict rare events. We, finally, prove that the new strategy results in a significant improvement of the prediction of the response statistics in the tail end of the distribution.


Introduction
As engineers, we are responsible for the reliable design of structures and infrastructures. This task turns out to be challenging, especially when structures are supposed to withstand natural hazards, such as earthquakes. A powerful tool to evaluate the response statistics of structures provides the statistical investigation of a set of experiments, commonly known as the Monte Carlo method.
Structural failure should occur very rarely in order to save infrastructures and, consequently, human life so that the number of samples of the crude Monte Carlo method must be chosen high. Generally, experimental setups on full-scale structures are the most reliable way to collect data of the response behavior [1][2][3]. However, on the one hand, measured earthquake acceleration data are limited and, on the other hand, experimental setups of such a high number of samples are not realizable. One strategy to overcome this problem is the realization of hybrid simulations [4]. Although this approach significantly decreases the costs, it is still inefficient to collect enough data for reliable Monte Carlo predictions of low failure probabilities.
In order to obtain a practically realizable framework, purely numerical models are used to simulate the statistical response behavior of structures. Especially when inducing structural failure, numerical simulations are much easier realizable in an engineer's every-day life environment. Nevertheless, for complex structures modeled by highdimensional finite element models, the computational cost of the crude Monte Carlo simulation becomes excessive.
There are two major options to deal with this issue: • Methods were developed to reduce the number of the required samples employing advanced Monte Carlo strategies to estimate the probability of failure, such as importance and asymptotic sampling [5,6] or subset simulation techniques [7]. • Methods were developed to reduce the effort of each simulation, in this regard, model order reduction techniques have shown significant improvement of the computational burden [8][9][10]. Furthermore, the application of machine learning has received attention in the field of structural mechanics and showed promising results [11][12][13][14].
Several studies have used neural networks for the design in earthquake engineering [15] and for the investigation of hazard events [16]. Researchers used machine learning approaches for structural response prediction, health monitoring, post-earthquake assessment, and interpretation of experimental data [17][18][19][20][21][22][23][24][25]. Special focus has been laid on the design of the structures to estimate the response or classify the damage of reinforced concrete frames and masonry shear walls [26,27]. Furthermore, in-plane failure modes have been investigated using machine learning algorithms [28]. With particular emphasis on the estimation of the tail end probability, i.e., the occurrence of very extreme events, extended data sets were used to predict the probability of failure using various neural network architectures [29,30]. To this end, the earthquake samples for the training sets of the neural network were generated using random factorization. This leads to sample sets with higher variance and, therefore, a larger portion of extreme events. However, these strategies need the consideration of additional extended data sets for the training procedure to obtain a suitable neural network prediction in case of very rare events-a strategy that requires a careful choice of additional intensity factors when generating earthquake training sets.
In order to deal with this issue, this paper proposes a training data selection approach that enables a reliable prediction within the entire domain without extending the training data set. The proposed training data selection process is demonstrated on three different structures using two-, three-, and six-dimensional selection processes. This strategy improves the prediction of the response statistics in the case of rare events, i.e., in the tail region of the distribution and turns out to result in more stable response distributions.

Estimation of Failure Probability
To determine whether structural failure occurs, one must define a limit state based on engineering decisions. Using a limit state function g(x), failure occurs if this function is below or equal to zero, i.e., g(x) ≤ 0. The probability of failure can be evaluated in terms of a multi-dimensional integral of the probability density function f X (x) [5]: Introducing an indicator function I g (x 1 , . . . , x n ), which is 1, if g(x 1 , . . . , x n ) ≤ 0, and 0 otherwise, Equation (1) can be reformulated as: Generating a set of random samples and evaluating the corresponding indicator functions I g (x (k) ), the Monte Carlo method allows one to define the probability of failure in terms of an expected value: This approach works well for response estimations around the mean of the distribution, however, the general drawback is the low confidence of this estimator around the tail end of the distribution, i.e., the estimation of low failure probabilities: Therefore, a disproportionately high number of samples must be evaluated in order to estimate small probabilities of failure with high confidence.

Artificial Non-Stationary Seismic Excitation Using the Kanai-Tajimi Filter
During the design process, earthquake excitations are employed to perform nonlinear dynamic time history analyses. In this context, the Kanai-Tajimi [31,32] filter is used to model the movement of the ground in terms of a single degree of freedom oscillator, which is subjected to a random white noise excitation. In doing so, site dependencies are included by using the natural frequency and the damping ratio of the ground. However, the conventional Kanai-Tajimi filter models earthquakes as stationary random processes and does not account for temporal changes in the properties of the excitation history. To this end, non-stationary Kanai-Tajimi models have been proposed to create more realistic excitations [33,34]. In this way, the time-dependent frequency content ω g (t) and the timedependent intensity parameter e(t) are extracted from a measured benchmark record.
We adopt this procedure to provide many realistic and site-dependent artificial ground motions based on one recorded accelerogram chosen from the region around the building site. Exemplary, we use the Northridge earthquake accelerogram, recorded in California in 1994. In particular, the chosen time history is the 360°component from a Station of USGS (Sepulveda VA Hospital, CA: VA Hospital Bldg 40) on the ground level [35].
A time window that moves from t 0 to t end of the accelerogram is defined to capture the time-varying features, as shown in Figure 1. In this study, we chose a time window of size t w = 1.5 s, which results in a smooth extraction of the parameters, while covering the frequency oscillations. Decreasing the time windows increases the deviation from window to window. By contrast, large time windows flatten the curves of the extracted parameters. The intensity of the acceleration at time t within the time-window t − t w 2 < t < t + t w 2 is calculated by: By counting the number of zero crossings in the interval [t − t w 2 , t + t w 2 ], the time dependent frequency at time t is evaluated: We fit polynomial functions to the extracted properties, as shown in Figure 2. Therefore, next to the size of the scanning time, the order of the polynomial function must be chosen carefully to subsequently generate the nonstationary excitation. Next to these time-varying parameters, other ground properties must be chosen, such as damping of the ground. This parameter may be found through soil tests or from trial and error methods. We choose a damping ratio of ζ g = 0.3 based on numerical fitting methods [36,37] and visual inspection. The nonstationary Kanai-Tajimi filter is represented as the solution to the following nonlinear single degree of freedom system [33]: To obtain the filter response, x f andẋ f , the single degree of freedom system is solved by the explicit central difference integration scheme [38,39]. The filter acceleration is written as:ẍ The extracted envelope function e(t) magnifies this response based on the extracted intensity parameter over the excitation time history. Finally, one obtains the synthetic accelerogram that includes the essential physical properties of the chosen benchmark excitation: Using a value of S 0 = 1.8 for the power spectral density, we generated 10 4 samples for the numerical demonstration of this study, which we will discuss in Section 3 more detailed. One real and one synthetic accelerogram are presented in Figure 3a,b, respectively. As can be observed from these figures, the artificial ground excitation preserves the essential properties of the benchmark accelerogram while revealing the required level of randomness for our Monte Carlo simulation approaches.
This procedure was repeated 10 4 times to generate the sample set used in the numerical experiments of this paper. The range of the acceleration response spectra of all generated ground excitation is shown in Figure 4. Furthermore, the 25 percentile, mean, and 75 percentile of the spectra are highlighted in this visualization. The acceleration response spectrum of the benchmark earthquake is also plotted and is embedded within the envelope, defined by the upper and lower bound of all generated spectra.

Feedforward Neural Networks
Machine learning has become an indispensable tool in many applications. The incorporation of artificial neural networks in complex problems of mechanics has been shown useful in terms of structural response prediction. In this section, we provide the basic idea and key mathematical expressions, summarizing the fundamentals of feedforward neural networks, which can be found in detail in literature [40,41]. The term feedforward neural networks implies already the data flow of the neural network, as the information is transported forward from layer to layer. These architectures comprise at least two layers: an input and an output layer. In this case, they are called single-layer perceptrons. For this type, the output is calculated by the weighted sum of all inputs. However, this architecture is limited to the classification of separable patterns [40]. By adding so-called hidden layers between the input and output layer, one overcomes these limitations. These architectures are called multi-layer or deep feedforward neural networks or multi-layer perceptrons. The simplest architecture has one hidden layer, as depicted in Figure 5. First, one chooses appropriate information parameters for the input layer and the output parameters as the desired targets. In our case, the input parameters are chosen intensity measures, while the output parameter is the failure criterion. The data is passed forward from layer to layer via matrix multiplications, vector additions, and activations, where, in case of supervised learning, the coefficients of these matrices and vectors are the neural network parameters to be found through an optimization process. The procedure of one forward pass is expressed mathematically as follows. The first layer (l = 1), is provided with normalized input data, such as, min-max feature scaling or standard score normalization, cf. [41]. The following layers (l > 1) are fed by the output of the previous layer,ŷ (l) . The outputs are weighted by a matrix W (l) . Furthermore, a bias vector is added b (l) , i.e., z (l) = W (l) x (l) + b (l) . Additionally, each neuron within each layer is provided by an activation function f (z (l) ). The most common activation functions are the sigmoid function f (z) = e z 1+e z , the tangent hyperbolicus function f (z) = e z −e −z e z +e −z , and the rectified linear activation unit f (z) = max (0, z). The output of every hidden layer is written asŷ (l) = f (z (l) ).
In this manner, the information is transported through all the layers until the output layerŷ (l=L) =ŷ. During parameter optimization, the output layer is provided by the targets y of the sample. The error is calculated by comparing this target with the output of the last layer. The weights and biases can be optimized by back-propagation of the error [42], e.g., the mean squared error = 1 N y (y −ŷ) 2 : Hereby, the gradient ∂ ∂ŷ (l) of the last layer (l = L) leads to 2 N y (ŷ − y). For all previous . The forward computation, the back-propagation of the error, and the respective update of the weights W and biases b are performed in each training iteration. This procedure is repeated until the error falls below a desired magnitude. Implementation-wise it has proven useful to update the neural network parameters after a certain training subset or so-called batches. After the neural network has seen every sample of the entire training data set, one epoch is finished. Generally, it takes many epochs to sufficiently train a neural network on the data set. In order to obtain the best results, the neural network architecture needs to be modified iteratively. These adjustments include the number of layers, the number of neurons in each layer, the activation function, the optimizer, and the input features.

Artificial Neural Network Strategy for Structural Response Prediction
The main goal of this study is to use artificial neural networks to decrease the computational effort and, therefore, to make the Monte Carlo method attractive for real-life applications. We focus on predicting the peak story drift ratio (PSDR) for the output instead of predicting the complete response time history. The latter would require more complex neural network architectures, such as recurrent neural networks. Using feedforward architectures, there are several options, i.e., convolutions, to extract the features from the accelerogram [30]. However, the failure prediction is more accurate using a relatively small artificial neural network with fewer training samples. Therefore, this strategy is supposed to have the highest speed up compared to the full simulation procedure [43].
Carefully preprocessing the input data allows us to use a simple artificial neural network architecture. The challenge of so-called feature engineering is the generation of physically meaningful input parameters. In this context, in earthquake engineering, extensive research has been carried out to characterize seismic motion in terms of single magnitudes, such as earthquake intensities [44]. These intensity measures are used as input parameters for neural networks in this paper. Hereby, the Arias intensity I A , the characteristic intensity I c , and the cumulative absolute velocity (CAV) were already used for response predictions [45,46]. Besides using features directly from the accelerogram, one may extract additional spectral features. Therefore, the displacement, velocity, and acceleration response spectra as well as the spectral values at the first eigenfrequency of the structure are used to evaluate additional relevant input features. As a result of this, a meaningful combination of several intensity measures as input parameters was extensively studied [47]. For a steel frame structure, the combination of the effective peak ground acceleration (EPA), the spectral acceleration at the natural period S a (T 1 ), the velocity spectrum intensity 2.5 0.1 S v , the spectral displacement at the natural period S d (T 1 ), the cumulative absolute velocity (CAV), and the peak ground acceleration (PGA) revealed promising prediction results [29]. A set of relevant input parameters that can be calculated from an accelerogram are summarized in Table 1.
We first calculate the set of artificial earthquake records S, based on which we extract the corresponding intensity set I that contains the intensity vectors Applying supervised learning, the neural network learns from training data. Therefore, at first, a subset of S must be evaluated: The neural network learns the damage indicator directly, which is, in our case, the PSDR. The target set for the neural network training, O t , is calculated from the response set R t : Using a neural network, the prediction of the sets are written as Once the neural network is trained and optimized in such a way that the predictions converge to the targets, the neural network can be used to predict the individual response values and, consequently, the full set:

The New Training Data Selection Method Based on Sample Intensities
Artificial neural networks can solve a wide variety of tasks in several fields of application. Once a neural network is properly trained, it can make predictions that are both accurately and efficiently. In general, neural networks will predict reliably as long as they have seen similar patterns before. Thus, one may imagine that the mean of the data set will be excellently predicted. However, in this study, we want to estimate the response of events near the failure region of the distribution, i.e., extreme events that happen very rarely. Therefore, we provide a new strategy to cover the entire domain of possible events during the training procedure. The major steps of the procedure to use machine learning are shown in Figure 6, while the appropriate choice of the training data is addressed in this paper.  Figure 6. Flowchart of machine learning enhanced structural response statistics using the proposed strategy for the selection of the training samples during supervised learning.
In particular, Figure 7 shows all the steps of the new selection process. Instead of using an arbitrary training data set I t,rand , we select a training data set I t,sel based on the intensities of the generated artificial earthquake samples. Therefore, we split each intensity range into evenly distributed intervals. Dependent on the number of intensity measures considered for the selection, we choose from an i-dimensional grid. If the selection is based on one intensity only, we pick one sample from each bin. In this case, the number of selected samples is equal to the number of bins or slightly smaller if empty bins exist. This procedure will likely give a better-distributed training set than a randomly picked set, especially, if the chosen intensity correlates with the output quantity. However, it will probably not cover the whole domain of all intensity measures chosen as input parameters. One may interpret this strategy as an adaption of the Latin-Hypercube sampling approach. The difference is that we perform the sampling beforehand based on non-stationary Gaussian random white noises and, subsequently, filter the data.  The domain is represented better considering two intensities. The procedure can be easily shown for this setup. In Figure 8, we present the correlation between velocity spectrum intensity 2.5 0.1 S v and the spectral displacement S d (T 1 ). In particular, Figure 8a shows all samples from set I, while Figure 8b shows only 450 samples from this set, in other words a randomly chosen subset used for the training, I t,rand . As one can see, most samples with high intensities are not included in this set, although they are the most important samples that allow for the neural network to learn more around the tail region of the distribution, as the intensities have a certain correlation with the structural response. In Figure 8c, we also present the samples chosen from I using this strategy. Both Intensities were cut into 34 intervals, which resulted in the selection of 438 samples. In doing so, we achieve that the selected training set I t,sel covers the whole parameter space of both intensities. Therefore, the neural network will learn from very frequent and from very rare events to the same extent. The crucial part of this strategy is a meaningful choice in the number of the intensities considered and their type to perform the selection procedure. Furthermore, the grid size must be adjusted to get a certain amount of samples, and the reduction of the grid size results in a higher number of training samples. In particular, if several intensity measures are considered, the grid size must be chosen large enough to select a small share of the set.
The selection process, shown in Figure 9, is based on three intensities: the velocity spectrum intensity 2.5 0.1 S v , the spectral displacement S d (T 1 ), and the spectral acceleration at the natural period S a (T 1 ). Each intensity is split into 19 bins resulting in a selection of 565 samples. Figure 9 shows that the selected data points for the three intensities spread evenly over the entire domain of interest.

Numerical Example
In this section, we provide a numerical demonstration of the new strategy. We chose the peak story drift ratio (PSDR) as the damage indication criterion. Thus, the evaluation of the PSDR was used for the entire training data set and the reference solution using the crude Monte Carlo simulation approach. We used an in-house finite element tool written in C++ and Python to calculate the nonlinear structural response. This tool has been used in several previous studies and has been validated using commercial finite element codes [48]. For the implementation of artificial neural network architectures, we used the Python library TensorFlow [49]. Three different structures are subjected to synthetic seismic excitations to present the new strategy. Furthermore, we provide the neural network predictions based on different numbers of intensities chosen to select the training data.

Nonlinear Frame Structures
The three structures are modeled by fiber beam elements using an elastoplastic material law with kinematic hardening, as shown in Figure 10. We chose a Young's modulus of E 1 = 210 GPa and a material density of ρ = 7850 kg m 3 . The yielding limit of the material was set as a value of σ y = 235 MPa. The post-yielding stiffness was chosen as E 2 = 21 GPa, which is 10% of initial Young's modulus. The beams and columns of the frames were modeled using the same beam element formulation, shown in Figure 10. For the fiber-beam elements, we used hollow cross sections with different widths, heights, and thicknesses.  Tables 2 and 3. Additionally, the structures are presented in Figure 11. To account for a realistic mass distribution, all structures have additional point masses. The first natural period of the structures A, B and C are T 1,A = 0.74 s, T 1,B = 0.40 s and T 1,C = 0.88 s, respectively.   The PSDR for the first two stories is depicted in Figure 11 for structure C. For these particular problems, we observed that a story drift ratio of ∼4.5% corresponds to a full plastification of the frame corners in the first stories. These assumptions agree with the estimation of the collapse of steel structures [50]. However, lower story drift ratios can already cause damage to the structure, i.e., story drift ratios above 2% lead already to plastification of the frame corners. More elaborate models could also include damage and softening effects and may improve the failure criterion [51], which is material for future research. However, in this study, we focus on the general approach of the strategy with the PSDR as representative output quantity and an elastoplastic material law from the underlying finite element model. additional mass Figure 11. Numerical example to demonstrate the strategy: (A) Two-bay-tree-story frame structure; (B) three-bay-two-story frame structure; (C) two-bay-five-story frame structure. Figure 12 reveals the distribution of the intensities of the training set using the violin plot format. To compare the statistics of each set, all intensities are scaled between zero and one by min-max normalization based on the entire set. In Figure 12a, the violin plots of the full set are shown. Comparing this set with a randomly chosen sample set, shown in Figure 12b, the issue becomes clearer. Observing the randomly chosen training distribution, one immediately realizes that the most severe earthquakes are not covered. However, using our new selection strategy enables covering a broad range of data that include a large share of high-and low-intensity values, as shown in Figure 12c. Therefore, these events are covered more likely in the training set. The presented distributions in Figure 12c are based on the selection using two intensities, and Figure 12d is based on the selection using three intensities. The strategy clearly shows the anticipated effect on the training set. The distributions based on two or three dimensions do, however, not change significantly.

Training Data Selection from Generated Earthquake Samples
Increasing the number of intensities that contribute to the selection process, the number of bins needs to be decreased. Otherwise, the number of training samples increases and the computational benefit of the whole strategy vanishes. Using six intensities for the selection process, the probability density function of each intensity shows an uneven distribution, as shown in Figure 13. The peaks of these distributions correlate with the number of the selected bins, and the share of the higher selected Intensities decreases except for the peak ground acceleration. Therefore, we propose to select training data from a relatively small number of intensities, i.e., two or three different features.  Figure 12. Violin plots [52,53] show the distribution of the intensities, which are used as inputs for the neural network; (a) full set, 10,000 samples; (b) 500 randomly chosen samples from the full set; (c) 438 samples are chosen as training data based on the proposed strategy using two intensities (velocity spectrum intensity 2.5 0.1 S v and the spectral displacement S d (T 1 )) with 34 bins each; (d) 565 samples are chosen as training data based on the proposed strategy using three intensities (velocity spectrum intensity 2.5 0.1 S v , the spectral displacement S d (T 1 ) and the spectral acceleration S a (T 1 )) with 19 bins each.  Figure 13. Violin plots [52,53] show the distribution of the intensities, which are used as inputs for the neural network; 291 samples are chosen as training data based on the proposed strategy using the shown six intensities with 5 bins each.

Predicted Response Statistics
The response statistics using the Crude Monte Carlo method were performed calculating l = 10 4 artificially generated earthquakes, which constitutes for all simulations the entire benchmark set S. The peak story drift ratio set O was calculated using the finite element method by evaluating the reference solutions through numeric time integration using the Newmark method [54]. We applied the new strategy to the structures A, B, and C (see Figure 11) using the proposed sample selection for two, three, and six intensities, respectively.
For each selection, an artificial neural network was trained. The input vector X consists of six features; therefore, the input layer also had six neurons. We used three hidden layers, consisting of between 11 and 14 neurons, using rectified linear activation functions. The output layer consisted of one neuron only, which produced the PSDR prediction. We used the ADAM algorithm to update the learning rate during training [55]. This configuration was used for all neural network architectures presented in this study.
The statistics of the predictions are shown in Figure 14. For the structures, A, B, and C, the prediction of the PSDR is shown in the top, middle, and bottom plots of Figure 14, respectively. The probability density function (PDF) is shown on the left-hand side, while the complementary cumulative distribution function (CCDF) is shown on the right-hand side.
For the two-dimensional selection process, the samples were selected based on the velocity spectrum intensity 2.5 0.1 S v and the spectral displacement S d (T 1,A ) using 29 bins each. The selected set consists of n = 333 samples, which was used for the training. The validation data was taken from the full set and has a size of 20% of the entire training set. Furthermore, considering the spectral acceleration S a (T 1,A ), the 3D selection method used 15 bins, which resulted in 336 samples for the training. The six-dimensional selection method used five bins, which resulted in 291 samples. The PDF of the predicted PSDR of structure A is shown in Figure 14a and shows the expected agreement with the finite element solution. In Figure 14b, the corresponding CCDF is shown. On the one hand, we can see that the prediction of randomly selected data cannot predict the tail region correctly. On the other hand, all neural networks trained by the new selected data approach predict this region well.
We performed the same procedure for structure B. The two-dimensional selection process was based on 2.5 0.1 S v and S d (T 1,B ), using 27 bins for each resulted in n = 318 training data samples. Considering the spectral acceleration S a (T 1,B ) for the 3D selection, we used 15 bins, which resulted in n = 325 samples for the training set S t,sel . We used only five bins for the six-dimensional selection, which resulted in n = 274 samples. Independent of the training data, the PDF of the predicted PSDR of structure B, shown in Figure 14c, matches the finite element solution. Figure 14d shows the corresponding CCDF. Compared to the previous example using structure A, we observe that the prediction of randomly selected data performs better in the tail region. This structure shows higher resistance to earthquakes and the neural network can predict the results of the extreme event better in this case, although the prediction overshoots the finite element solution slightly. The neural networks trained by the selected data approach can predict this region well.
Structure C shows the highest share of samples exceeding a PSDR of 4.5%. The two dimensional selection process based on 2.5 0.1 S v and S d (T 1,C ), with 28 bins each, resulted in n = 331 selected samples for training. Considering S a (T 1,C ) for the three dimensional selection and using 14 bins, resulted in n = 309 selected samples. The six-dimensional selection method resulted in n = 274 samples if five bins were used for each intensity. The PDF of the PSDR using structure C is shown in Figure 14e. All artificial neural networks can predict the statistics in the mean region well. Figure 14f shows the CCDF. However, the artificial neural network trained by randomly selected data fails to predict the statistics in the tail region of the distribution. The results of the selected training procedure show that it is beneficial to use two or three intensities during the selection instead of using all six intensities. Generally, the extreme events are not covered properly if large bin sizes are chosen.

Discussion
Our three numerical studies reveal that the neural networks trained by the new data selection can predict the tail region. In particular, the response of structure B is predicted well by each neural network. Regarding the taller structures A and C, the neural network trained by randomly selected data fails to predict rare events, while the statistics are provided accurately by selected training data. Hereby, it turns out that the selection strategy, using two or three intensities, namely, the velocity spectrum intensity 2.5 0.1 S v , the spectral displacement S d (T 1,A ), and the spectral acceleration S a (T 1,A ), provides the most stable solution. The selection process using all six intensities does not ensure the inclusion of extreme events, as the bin size must be chosen rather large.
The predictions of artificial neural networks using randomly selected data during training are unstable. Even though the mean region is predicted well, the results in the tail region give inconsistent results. The easiest way to tackle this problem is to increase the training data set. The predictions will be less sensitive to each randomly selected point, and it is likely to perform better on the task. However, the computational benefit of the entire strategy would suffer. While preserving the computational efficiency, we can reduce the instability using the proposed selection strategy. We propose to use two or three intensity measures for the selection to make this procedure efficient. Using the new selection strategy, the predicted response statistics significantly improve in the tail region of the distribution. However, the neural network prediction using the same training data can still vary slightly if the influence of the randomness, such as the initialization of the weights and biases of the neural networks, is not removed.
The proposed procedure increases the share of rare events from the distribution in the training data without increasing the computational cost. Another strategy to provide extreme events in the training data is to use scaling methods during the generation procedure of the synthetic earthquake [29]. The generation of another set of earthquakes results in additional computational effort. Furthermore, these excitations must be generated carefully regarding factorization; otherwise, the prediction on a differently generated set can lead to unsatisfactory results. The new proposed strategy does not need to create extended training data. Instead, the data are taken from the already generated set. Thus, compared with the strategy of extended sets, this selection process benefits in its applicability.
The prediction of the structural response of a single earthquake might be improved by choosing more complex neural network architectures, such as convolutional neural networks. Advanced machine learning algorithms would learn from the unprocessed input data, i.e., the entire time history of the generated earthquakes. The downside of employing more sophisticated architectures, such as convolutional neural networks, is that the simulation requires higher computational effort during training. The neural network must learn more parameters, and the number of training samples that need to be evaluated for the training data set increases significantly.
The computational cost for training the network parameters and predicting the proposed neural network is small compared to the cost of evaluating the training data. Therefore, the computational speed-up can be roughly estimated by dividing the total number of samples l of the entire set S by the number of samples n of the training set S t . Using 333 samples for training and 20% for validation leads to a speed-up of around 25. Considering the calculation of the intensities, this speed-up will decrease slightly. Overall, the numerical experiments have shown that the new procedure results in a speed-up factor of above 20.

Limitations and Potential Advancements
The approach provides a reliable prediction of the tail end of the response statistics. Yet, we tested this strategy on two-dimensional models only. Three-dimensional structures will be of high interest for this strategy as the number of degrees of freedom drastically increases and, consequently, the computational cost. In this case, the accuracy of the proposed approach must be tested. Furthermore, we used a rather simple assumption for the criterion of structural failure. Incorporating more elaborate material models that include material damage and softening will result in a significantly higher computational cost. It is interesting to evaluate the computational speed up in this case since we assume that the neural network prediction will not be affected by this extension. However, in this study, we used a bi-linear material model and analyzed the behavior in the cross section with the highest stresses and focused on the level of plastification. Evaluating static cyclic test calculations one observes that, if the story drift ratio exceeds the chosen failure criterion of a maximum PSDR the most affected cross section, i.e., the cross section in the first frame corner, fully plastifies.

Conclusions
This paper proposes a new training data selection strategy for neural network-enhanced Monte Carlo simulations. The training data selection enables one to reliably predict the whole response statistic domain, while revealing a high level of computational efficiency. Based on a synthetic earthquake set that is generated taking into account the properties of a specific region, this strategy provides a powerful technique to predict the probability of failure, even in the tail end of the distribution.
Although the application of the proposed selection strategy reduces the sensitivity of classically trained neural network architectures, the prediction quality is still sensitive to the neural network parameters. Thus, future studies are necessary to find the best neural network architecture and stabilizing measures for consistent output quality. However, we found a way to effectively choose training data to apply artificial neural networks for seismic response statistics.
The use of more than only one benchmark earthquake record can be incorporated into the data generation process, which is of high interest for future research. Furthermore, the applicability to more realistic structures could reveal even more significant benefits of the new strategy in the future, as more complicated structures do not necessarily require more sophisticated neural network architectures, which would, finally, reveal an even higher speed up in computation time.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript:

CAV
Cumulative absolute velocity CCDF Complementary cumulative distribution function CDF Cumulative distribution function EPA Effective peak ground acceleration PDF Probability density function PGA Peak ground acceleration PSDR Peak story drift ratio SDOF Single degree of freedom system