Applying Heuristics to Generate Test Cases for Automated Driving Safety Evaluation

: Comprehensive safety evaluation methodologies for automated driving systems that account for the large complexity real trafﬁc are currently being developed. This work adopts a scenario-based safety evaluation approach and aims at investigating an advanced methodology to generate test cases by applying heuristics to naturalistic driving data. The targeted requirements of the generated test cases are severity, exposure, and realism. The methodology starts with the extraction of scenarios from the data and their split in two subsets—containing the relatively more critical scenarios and, respectively, the normal driving scenarios. Each subset is analysed separately, in regard to the parameter value distributions and occurrence of dependencies. Subsequently, a heuristic search-based approach is applied to generate test cases. The resulting test cases clearly discriminate between safety critical and normal driving scenarios, with the latter covering a wider spectrum than the former. The veriﬁcation of the generated test cases proves that the proposed methodology properly accounts for both severity and exposure in the test case generation process. Overall, the current study contributes to ﬁll a gap concerning the speciﬁc applicable methodologies capable of accounting for both severity and exposure and calls for further research to prove its applicability in more complex environments and scenarios.


Motivation and Aim
As automated and autonomous driving (AD) systems get ready to penetrate the market, their safety evaluation and approval for public roads demands standardized and harmonized safety evaluation methodologies. To fulfil this demand, several international efforts are being undertaken by large scale research projects, such as PEGASUS [1][2][3], SAKURA [4][5][6], Ko-HAF [7,8], Catapult [9], and Streetwise [10]. All these projects have adopted a scenario-based safety evaluation approach, which relies on a clear description of the operational design domain (ODD), in which the systems are meant to be used, as well as on well-structured sets of functional scenarios that the systems will need to address to evaluate their safety [2]. These scenarios are subsequently parameterized and concretized to define finite sets of test cases (thereafter test suite), upon which the AD-system safety is evaluated physically on proving grounds and/or virtually based on simulations.
The scenario-based approach needs to be contextualized within comprehensive safety strategies. A scenario test suite, that specifically addresses the components of decision tion [29,30], the minimization of required test cases by introducing a cost function [31], or the clustering of scenarios from real traffic data [32].
To the best knowledge of the authors of this paper, neither search-based techniques for explicit scenario parameterization that addresses all the above limitations nor the sub-category of heuristics have been applied to generate test cases for automated vehicle safety evaluation.

Main Contribution
This paper proposes a novel methodology to generate safety-relevant test cases by applying heuristics to naturalistic driving data. The proposed methodology can contribute to overcome several of the unresolved challenges in the field of test case generation.
In contrast with [26][27][28], the distinction between normal and critical driving scenarios is addressed before starting the test case generation, which ensures that only parameter ranges and dependencies that are known to be relevant in advance are accounted for in the process. By defining criticality measurements in advance, specific subsets can be extracted from the real-world data. Consequently, test cases to address severity are generated based on the critical subset, whereas test cases to address exposure are derived from the normal driving subset. This can be a significant improvement over previous studies, as the parameter dependencies between critical and normal driving scenarios have been shown to differ [26].
Another contribution of the proposed approach is an improved compromise between the incorporation of corner cases from real driving scenarios and the robustness in considering outliers. In [26], outlying scenario data has a significant impact on the resulting dependencies, but the robustness is relatively low. On the other hand, the multi-dimensional fitting in [28] often neglects corner cases. The proposed methodology considers multidimensional and polynomial dependencies between parameters through the incorporation of regression models for the test case assignment. This enables parameter dependencies to be appropriately accounted for in the test case generation.
Finally, the proposed search-based methodology constantly verifies the safety relevance of the generated test cases in two dimensions during the iterative process, which improves the quality of the final test suite. Therefore, the generated test suite does not require additional steps to evaluate the safety relevance or redundancy of the test cases, allowing for a direct application for AD safety evaluation. Therefore, the adoption of search-based scenario parameterization enhances the current state-of-the-art in explicit scenario parameterization.

Paper Structure
Hereafter, the methodology, results, discussions, and conclusions of the current paper are elaborated in detail. In Section 2, the heuristics methodology, proposed to generate tests, is described generically. In Section 3, the heuristics methodology proposed is prototypically applied to a set of cut-in scenarios extracted from a previously collected and unpublished traffic data set. The cut-in test cases generated are assessed in Section 4 from the perspective of severity and exposure, by means of independent metrics. Finally, a general discussion is presented in Section 5 and the paper is concluded with Section 6.

Heuristics Methodology
The proposed methodology requires the availability of a driving data set that can be assumed representative of the scenario and the traffic environments targeted. First, the scenarios and their corresponding parameters are extracted from the driving data and split into two subsets that represent critical and normal driving scenarios. The scenario parameters are analysed by fitting distributions and by modelling their dependencies through regression analysis. The results obtained from the analysis are then processed by heuristics to iteratively determine, by evaluating the severity and the exposure of each test case candidate, which ones are included in the final test suites.

Evaluation of Severity and Exposure
Safety relevance of a scenario was considered according to its potential consequences in the case of failure (severity) and the likelihood to occur in real traffic (exposure), as introduced in [33]. Criticality is, therefore, associated with both severity and exposure, implying that it can either result from consequences (in the case of failure), unfavorable handling at some point of the system's lifetime, or a scenario frequently occurring in real traffic. Consequently, safety relevant scenarios are severe, frequent, or a combination of both. The test cases towards severity aim to meet the former, while the test cases towards exposure aim to meet the latter. By explicitly targeting frequently occurring scenarios, the necessary breadth of testing is reached, and it is ensured that the system is capable of dealing with all occurring scenarios.
Severity was evaluated by means of risk potential field (RPF), in accordance with [34,35]. The RPF is a Bayesian network-based trajectory planner that calculates the overall risk potential of the vehicle's possible paths. For the required application of the database and the corresponding evaluation of test cases, the RPF evaluation was limited to the path associated with the analysed test case, the driving state of the vehicle under test (VuT) (as defined in [26]), and its interaction with the challenging vehicle (chall) (represented by the parameters characterizing the scenario, as in [36]).
Exposure was addressed by targeting the full coverage of the patterns (parameter's individual distribution functions as well as the parameter dependencies from the regression models) found in the data set. Therefore, an evaluation was conducted to compare the previously assigned test cases and the newness of a parameter combination of a test case candidate. In order to avoid redundancy between different test cases and to optimize the coverage of the data, an objective measure of newness was incorporated.
The newness evaluation of a test case candidate was conducted in several steps, by comparing each test case candidate with all previously assigned test cases (e.g., to define the fourth test case, a test case candidate was compared with the three test cases previously assigned to the test suite). For each Parameter (Param), the value difference between the test case candidate (cand) and the compared test case (comp) was divided by the parameter distribution range from the total data set, according to (1). Therefore, the equation provides values close to zero for similar candidates and compared parameters, as well as values close to one for very different parameters.
Based on the newness value for each individual parameter, the average newness value of the parameters that defined a test case was calculated. The minimal average newness of all parameters that defined a test case candidate was regarded as relevant, in comparison to all previously assigned test cases (e.g., the relevance of a candidate to become the fourth test case in the final test suite is judged based on the minimum newness value out from the comparison with each of the three previously assigned test cases).

Data Split, Distribution Fitting, and Regression Analysis
The scenario data set was split into two subsets, representative for critical and normal driving scenarios, respectively. Thereafter, for each subset, distributions were fitted to the data and analysed through regression models. These distributions and models served as input for the application of the heuristic methodology. The resulting regression models allowed us to consider parameter dependencies in the test case generation and to reflect the observed data patterns of the underlying naturalistic driving data. In that way, realism in the test case generation is accounted for in two manners: first, as naturalistic driving data build the basis for the test case generation procedure, it is ensured that no synthetic parameter values are applied. Moreover, by abstracting the original parameter values with distribution functions, no outlier parameter values are used in the test case generation. Second, current state-of-the-art approaches [24,27] set parameter values in the test case generation independently, neglecting the fact that dependencies exist. Its importance has already been introduced in [26] and is now considered in this proposed methodology.
With the severity evaluation of each scenario, by means of RPF, a relative split of the data set was conducted. For the current study, a 10% threshold was applied to split the relatively more severe data from the normal driving data. The threshold was determined iteratively during the methodology development, as a trade-off between the large number of scenarios in the transition from critical to normal driving scenarios, as well as the limited data sample size that can affect the statistical reliability of the results. Future work should address the threshold's value and kind (relative or absolute split), by analyzing multiple data sets and scenario types.
From the distribution fitting, the resulting range, as well as the frequency of occurrences, were derived for each parameter. Univariate, generalized, extreme value distributions were applied to consider both skewness and sharpness at the parameter range boundaries. This resulted in parameter values for the expected value (µ), the standard deviation (σ), and the skewness (k). The minimum and maximum of each individual parameter range, during the test case generation, was determined as the 0.1 and 99.9 percentiles of the corresponding distribution, as in [37].
For the regression analysis, four types of independent parameter relations (constant, linear, pairwise-linear, and quadratic) were considered. The selection of these relations was conducted based on physical plausibility. For example, a second order dependency between distances and accelerations is physically plausible, therefore justifying the incorporation of quadratic relationships in the analysis. By this, the particularity of the AD systems is addressed, as there are physical dependencies between parameters, which is not generally applicable, in the context of software testing. Only the terms that improved the R 2 -value of the model by at least 0.1 were incorporated in the final regression models. The implementation was conducted by a stepwise regression, to assess all potential dependencies, while removing those that did not meet the pre-defined chosen criteria.

Heuristic Generation of Test Cases
This section presents the actual application of a heuristics search-based methodology to generate the test cases, based on the fitted distributions and the regression models previously developed. Since the objective function differs between severity and exposure test cases, an explanation focused on severity is provided first, followed by clarifications concerning exposure.
The proposed heuristics methodology is adapted from the software testing related work described in [38] and summarized in Figure 1. The severity test cases were derived from the fitted distributions and the regression models obtained from the critical subset of scenarios. The generation of a single test case comprised three main steps ("Test case initialization", "Test case candidate generation", and "Selection of best candidate"), from which the latter two were part of an iteration loop. This procedure accounts for the coverage of the search space for the methodology, as well as for the representativeness, by considering the frequency of occurrence of the parameters.
The "Test case candidate generation" step, applied the initialized random test case In the "Test case initialization" step, an initial solution was obtained by randomization, based on each parameter's fitted distribution. To ensure a wide spread of cases, the initial values for each parameter from the randomization were set to differ at least 20% from the corresponding value of the previous assigned test case.
This procedure accounts for the coverage of the search space for the methodology, as well as for the representativeness, by considering the frequency of occurrence of the parameters.
The "Test case candidate generation" step, applied the initialized random test case from the previous step to generate a pre-determined number of test case candidates. For each candidate, each parameter value from the initial random solution was varied, within a range of 20% around the first obtained randomized solution, as well as based on the fitted distribution. Hereafter, each of these values were assigned the influencing variables, following the regression models that had a minimum R 2 -value of 0.7, and the current values were overwritten, according to the following rules:

1.
Set parameters only influenced by independent parameters; 2.
Set parameters influenced by either parameters set, according to rule 2 or by independent parameters; 3.
Set all remaining dependent parameters, according to their regression models from lower to higher R 2 order.
In the current study, to provide a high coverage when analyzing the parameter combinations as potential test cases, an arbitrary number of 1000 test case candidates was defined.
As previously mentioned, the "Test case candidate generation" and "Selection of best candidate" steps were part of an iteration loop. For the assignment of a test case, the iteration loop was repeated until no better candidate test case could be identified or until an arbitrary number of 100 iteration steps was reached. In either case, the current solution was assigned to the final test suite as the best test candidate. To select the best test case candidate, both safety dimensions were assessed, either being optimized or meeting a pre-defined level. Specifically, for the severe test cases, the RPF was optimized, while a certain newness value was ensured. Therefore, severe test cases that covered the relevant parameter space were derived by this verification-in-the-loop procedure. Subsequently, the process was restarted from the test case initialization step for the next test case.
The process to assign exposure test cases to the final test suite is analogue to the process to assign the severity test cases, with the following differences. First, the normal driving subset of scenarios was used to fit the distributions and develop the regression models, instead of the critical subset of scenarios. Second, in the step "Selection of best candidate", a maximum RPF value was pre-defined to ensure focus on normal driving scenarios. Furthermore, the newness value was optimized to reach the goal of covering the parameter space of normal driving to the best achievable level.
Regarding the test suite size, a trade-off between an accurate coverage of the data pattern (by a large amount of test cases) and an efficient safety evaluation process (by limiting the test suite to the most relevant test cases) needs to be defined. In the current study, a test suite of 20 severity test cases and 20 exposure test cases was pre-defined.

Data Set
In this section, the heuristics methodology proposed was prototypically applied to a set of cut-in scenarios, extracted from a previously collected, yet unpublished, traffic data set.

Cut-In Scenario Definition, Detection, and Grouping
The data set incorporated in this study was collected on German highways with four different instrumented vehicles. The vehicles were equipped with a mid-range front radar, a mono camera, and measurement hardware devices to continuously record the naturalistic driving behaviour of both the measurement vehicle and the surrounding objects. As the measurement devices of the vehicles are not recognizable from outside of the vehicles, the collected surrounding objects and vehicle data can be regarded as non-biased. The data comprised a total of 123,225 km driven kilometres and 1159 h, with neither automation function activated nor instructions for the drivers.
Following data post-processing, to generate clear lane-related object tracks, cut-in scenarios from the right were detected, following the definitions above and rule-based detection algorithms, extended by a dual approach for lane change detection, as elaborated in detail in [39]. For the current study, the beginning of a cut-in scenario was set when the challenging vehicle's lateral distance to its left lane became lower than 1.5 m and had a positive lateral velocity. The end of the cut-in was set when the lateral distance from the challenging vehicle to its right lane (after crossing it) exceeded 1 m. The application of the detection algorithms and further data set clearing steps resulted in a total of 2294 clear cut-in scenarios that formed the basis of the current study application. For each of the detected scenarios, the required VuT's and challenging vehicle's parameters, evolving over the scenario duration, were saved as illustrated in Figure 2. The detected time of the evolving scenario parameters were abstracted by taking the characteristic values of the relevant descriptive parameters as summarized in Table 1. Subsequently, the risk potential value for each scenario was calculated. Based on these results, the detected scenarios were grouped into a relatively more severe subset and an exposure subset, which comprised of 229 and 2065 cut-in scenarios, respectively.
In this section, the heuristics methodology proposed was prototypically applied to a set of cut-in scenarios, extracted from a previously collected, yet unpublished, traffic data set.

Cut-In Scenario Definition, Detection, and Grouping
The data set incorporated in this study was collected on German highways with four different instrumented vehicles. The vehicles were equipped with a mid-range front radar, a mono camera, and measurement hardware devices to continuously record the naturalistic driving behaviour of both the measurement vehicle and the surrounding objects. As the measurement devices of the vehicles are not recognizable from outside of the vehicles, the collected surrounding objects and vehicle data can be regarded as non-biased. The data comprised a total of 123,225 km driven kilometres and 1159 h, with neither automation function activated nor instructions for the drivers.
Following data post-processing, to generate clear lane-related object tracks, cut-in scenarios from the right were detected, following the definitions above and rule-based detection algorithms, extended by a dual approach for lane change detection, as elaborated in detail in [39]. For the current study, the beginning of a cut-in scenario was set when the challenging vehicle's lateral distance to its left lane became lower than 1.5 m and had a positive lateral velocity. The end of the cut-in was set when the lateral distance from the challenging vehicle to its right lane (after crossing it) exceeded 1 m. The application of the detection algorithms and further data set clearing steps resulted in a total of 2294 clear cut-in scenarios that formed the basis of the current study application. For each of the detected scenarios, the required VuT's and challenging vehicle's parameters, evolving over the scenario duration, were saved as illustrated in Figure 2. The detected time of the evolving scenario parameters were abstracted by taking the characteristic values of the relevant descriptive parameters as summarized in Table 1. Subsequently, the risk potential value for each scenario was calculated. Based on these results, the detected scenarios were grouped into a relatively more severe subset and an exposure subset, which comprised of 229 and 2065 cut-in scenarios, respectively.

Vehicle under Test (VuT) Challenging Vehicle (Chall)
Interaction Parameters vx (initial value) vx (initial value) Longitudinal distance dx ax (initial value) vy (max. value) (initial value) a x (mean value)

Data Distribution and Regression Analysis
The resulting fitted distributions for the severe and the exposure data subsets are provided in Tables 2 and 3, respectively. A comparison of the fitted distributions shows that severe cut-in scenarios tend to have lower longitudinal velocity of both vehicles (vx,VuT = 98.2 km/h and vx,chall = 95.9 km/h), lower longitudinal distance (dx = 14.4 m), and a more frequent braking by the challenging vehicle than the exposure cut-in scenarios.

Data Distribution and Regression Analysis
The resulting fitted distributions for the severe and the exposure data subsets are provided in Tables 2 and 3, respectively. A comparison of the fitted distributions shows that severe cut-in scenarios tend to have lower longitudinal velocity of both vehicles (v x,VuT = 98.2 km/h and v x,chall = 95.9 km/h), lower longitudinal distance (d x = 14.4 m), and a more frequent braking by the challenging vehicle than the exposure cut-in scenarios.   Table 4 shows the resulting regression models that were identified as relevant (R 2 higher than 0.7). For each regression model, its R 2 -value and the influencing parameters, together with their standardized coefficients, are provided. Negative values indicate that the corresponding dependent parameter decreases with an increase of the associated influencing parameter. For each regression model, a significant constant term was identified. Moreover, the normal driving data set shows a strong cross-correlation between the two longitudinal velocities.  Table 5 summarizes the severity and the exposure test suites, obtained as a result of applying the heuristics (Section 2.3 and Figure 1) to the fitted distributions (Tables 2 and 3) and regression models (Table 4). Each test suite generated comprises 20 test cases, as pre-determined.  Concerning the application of the regression models, within the test case candidate generation, the following order was applied. For both test suites, the two parameters v x,VuT and v x,chall were mutually influencing. Therefore, rule three (Section 2.3 above) applies. For the severe test cases, the regression model for v x,chall is applied first, due to its lower R 2 -value. The same rule, applied to the exposure test cases, leads to the opposite order of v x,VuT , followed by v x,chall .

Evaluation of Criticality of Severity and Exposure Test Suites
The evaluation of criticality for both test suites was conducted based on two different metrics: RPF ( Figure 3a) and a newly developed independent criticality metric, denoted as required deceleration, a x (Figure 3b). For both metrics, the generated test suites severity and exposure are opposed to the original scenario database. Accordingly, the performance of the heuristics, to create critical test cases out of the original database, can be evaluated. Figure 3a shows the comparison of the RPF values from the original database and the RPF values from the severity and the exposure test cases, respectively. The comparison reveals that the RPF values of the generated severity test cases are at the edge of the RPF values from the original dataset. Accordingly, the test cases that are more critical could be synthetically generated. Contrary to that, the values of the exposure test cases show, as expected, a comparable criticality as the original scenario dataset. tified because the established metrics, such as time to collision or time headway, do not account for criticality in longitudinal and lateral direction. The critical deceleration metric is defined as the required mean deceleration for the VuT necessary to overcome the negative relative velocity with the challenging cut-in vehicle and to achieve a longitudinal distance > 10 m, before the end of the cut-in scenario.
Hence, a simulation framework is established, where the generated cut-in test cases are re-modelled by applying their generated characteristic parameter values. To calculate the required deceleration of the VuT for a scenario, the scenario is remodelled in simulation, according to its specified parameter characteristics. The following assumptions apply: the cut-in object is assumed to be detected by the VuT as soon as the object's left side crosses the left lane marking.
The movement of the VuT is characterised by the initial velocity from the test case, influenced by the initial acceleration value model. The cut-in object is initially positioned and parametrized according to the initial longitudinal and lateral distance, as well as the initial velocity in the test case. The object's acceleration is modelled as a mean value over the lane change maneuver. The lateral movement for the lane change is abstracted as a second-degree polynomial function of the lateral velocity with a maximum according to the maximum lateral velocity in the test case. As the test cases were assessed in a simplified simulation environment, not the absolute but the relative values were relevant for evaluation purposes. To validate the severity of both test suites, a newly defined independent criticality measure (required deceleration a x ) was incorporated. The selection of this metric was justified because the established metrics, such as time to collision or time headway, do not account for criticality in longitudinal and lateral direction. The critical deceleration metric is defined as the required mean deceleration for the VuT necessary to overcome the negative relative velocity with the challenging cut-in vehicle and to achieve a longitudinal distance > 10 m, before the end of the cut-in scenario.
Hence, a simulation framework is established, where the generated cut-in test cases are re-modelled by applying their generated characteristic parameter values.
To calculate the required deceleration of the VuT for a scenario, the scenario is remodelled in simulation, according to its specified parameter characteristics. The following assumptions apply: the cut-in object is assumed to be detected by the VuT as soon as the object's left side crosses the left lane marking.
The movement of the VuT is characterised by the initial velocity from the test case, influenced by the initial acceleration value model. The cut-in object is initially positioned and parametrized according to the initial longitudinal and lateral distance, as well as the initial velocity in the test case. The object's acceleration is modelled as a mean value over the lane change maneuver. The lateral movement for the lane change is abstracted as a second-degree polynomial function of the lateral velocity with a maximum according to the maximum lateral velocity in the test case. As the test cases were assessed in a simplified simulation environment, not the absolute but the relative values were relevant for evaluation purposes. Figure 3b shows the comparison of the VuT's required deceleration values between the original scenario database and the severity and exposure test cases, respectively. Even with an independently designed criticality measure, the severity test suite incorporated significantly higher risk, interpreted as higher required deceleration for the VuT than the original database. Only a few test cases show a comparable criticality to the original data. For the exposure test suite, a risk evaluation that was similar to the original scenario data can be observed.

Evaluation of Coverage of Severity and Exposure Test Suites
To evaluate how well the generated severity and exposure test suites cover the parameter values from the dataset, the coverage ratio between each parameter value range (in each test suite) and the corresponding parameter value range (from the fitted distribution of the underlying dataset) were calculated. Therefore, the parameters' value ranges in the test cases were compared against their 0.1 to 99.9 percentile of the belonging fitted distribution. Table 6 summarizes the coverage ratios for each parameter. The values for the parameters set by regression models are denoted in brackets, as the use of a regression model highly restricts the achievable coverage of a parameter's range. For the exposure test cases, the independent parameters a x,VuT , v y,chall , and a x,chall were perfectly covered (100%) by the generated test suite. For the two dependent parameters, v x,VuT and v x,chall , the coverage was lower, with 46.0% and 37.5%, respectively. The coverage of d x was slightly lower than for the other independent parameter, due to its appearance in the regression models. In this specific case, it led to the shortage of the influencing parameter's value range. For the severity test cases, comparatively, low coverage ratio values were obtained, ranging from 30.5%% for v x,chall to 98.2% for a x,chall .

Comparison to Monte Carlo-Based Test Case Generation
In this sub-chapter, a Monte Carlo-based test case generation was applied to the cut-in dataset, to enable an additional evaluation of the methodology proposed in the current paper, by comparing with a methodology typically applied to generate test cases in the field. As indicated in related work, no method is available which accounts for both multidimensional parameter dependencies and objective functions. A Monte Carlo simulation, which considers multi-dimensional dependencies, shows to be a suitable comparison method in terms of replicability and comparability.
To ensure the comparability, the data subsets for severity and exposure (see Section 2.2) were reused, resulting in one test suite for severity and one test suite for exposure. First, the parameters were fitted to distributions independently for each subset, as follows: independent parameters were approximated by a normal distribution. These were a x,VuT , v y,chall , and a x,chall for the severe data subset and a x,VuT , v y,chall , and d x for the normal driving subset. The two dependent parameters, both times v x,VuT , v x,chall , and d x , were fitted with a multivariate Gaussian distribution. Thereby, the parameter dependencies were incorporated, which accounts for the comparability of the methodologies.
Second, sampling was applied by generating randomized values or pairs of values, respectively, from the parameter distributions of each scenario subset and assigning them to the test cases.
The number of test cases for each test suite was limited to the size of the heuristics test suites. The generated test suites for severity and exposure of the Monte Carlo simulation are provided in the Appendix A Table A1.
In the following, the same evaluation methods as in section A and B were applied by comparing the heuristics test suites with the Monte Carlo test suites. For criticality, the test suites were assessed by their RPF values and the introduced independent criticality metric required deceleration, a x , in Figure 4.
The number of test cases for each test suite was limited to the size of the heuristics test suites. The generated test suites for severity and exposure of the Monte Carlo simulation are provided in the Appendix A Table A1.
In the following, the same evaluation methods as in section A and B were applied by comparing the heuristics test suites with the Monte Carlo test suites. For criticality, the test suites were assessed by their RPF values and the introduced independent criticality metric required deceleration, ax, in Figure 4. For both criticality metrics, the heuristics severity test suites showed significantly higher incorporated criticality than the generated Monte Carlo severity test suite. For exposure, a comparable distribution of the RPF values and ax required can be observed for the heuristics and the Monte Carlo test suites. While the heuristics aims to find different test cases (by newness evaluation), the Monte Carlo methodology is more focused on the frequency of occurrence. Therefore, the corner-cases tends to be neglected by the native Monte Carlo sampling.
Regarding coverage, Table 7 presents the coverage rations for the Monte Carlo test suites of severity and exposure. The comparison with the coverage results of the heuristics test suites reveals a slightly higher parameter coverage for the severe Monte Carlo test suite, in comparison to the severe heuristics test suite. A contrary tendency becomes visible for the exposure test cases with a higher coverage of the exposure heuristics test suite. [ ]: Parameters were set by their regression models and not independently. For both criticality metrics, the heuristics severity test suites showed significantly higher incorporated criticality than the generated Monte Carlo severity test suite. For exposure, a comparable distribution of the RPF values and a x required can be observed for the heuristics and the Monte Carlo test suites. While the heuristics aims to find different test cases (by newness evaluation), the Monte Carlo methodology is more focused on the frequency of occurrence. Therefore, the corner-cases tends to be neglected by the native Monte Carlo sampling.
Regarding coverage, Table 7 presents the coverage rations for the Monte Carlo test suites of severity and exposure. The comparison with the coverage results of the heuristics test suites reveals a slightly higher parameter coverage for the severe Monte Carlo test suite, in comparison to the severe heuristics test suite. A contrary tendency becomes visible for the exposure test cases with a higher coverage of the exposure heuristics test suite. [ ]: Parameters were set by their regression models and not independently.

Efficiency of the Test Case Generation Process
The heuristics approach, applied to the cut-in dataset, proved valid and efficient to identify both severe and normal driving test cases ( Table 5).
The required computational power can be considered low, as it took less than 30 min run time on an Intel Dual-Core i5-7200u with 8 GB RAM to generate the test cases, starting from the original data set. Nevertheless, the algorithm was more efficient in identifying severe cases than normal driving cases. Specifically, the majority of severity test cases were determined within 20 iterations, whereas many of the exposure test cases reached the preset maximum number of 100 iterations. This difference can be explained by the difference between parameter spaces and objective functions (applied to define both test suites). The severe data subset was limited to narrow parameter ranges that tend to concentrate in the vicinity of parameter range edges. In addition, the objective function, applied to the severe data subset and based on RPF, imposed a restrictive combination of parameter values (e.g., short distances and high lateral velocity). In contrast, the normal driving data subset tended to cover wider parameter ranges, and the restrictions imposed by the newness value were looser than for the RPF. When the proposed methodology is further developed and applied to larger sets and more complex scenarios, the balance between size of the data set, the number test case candidates generated, and the pre-set maximum number of iterations need to be further evaluated.

Results from the Criticality Assessment
The criticality evaluation of the heuristics test suites based on RPF (Figure 3a) indicated the correct application of the test case generation process on the prototypical dataset, as the severity test cases were at the edge of the dataset's RPF distribution. The 10% threshold was set as a trade-off between the relatively small overall data base and the sample size requirements, in order to enable statistically significant results. However, the incorporation of an independent criticality metric highlights both the potential of the methodology proposed and the importance of criticality metrics. Even with the independent criticality measure, the severity test cases showed significantly higher criticality values. However, there seemed to be, on the one hand, test case that incorporated a comparable, low criticality, similar to the original dataset. On the other hand, a few test cases showed extremely high incorporated risk with the required deceleration values > 10 m/s 2 . Moreover, the severity test cases, indicated in Table 5, show that comparably similar test cases were created with high lateral velocity v y,chall and low cut-in distances d x but only small relative velocities. Previous studies, e.g., [26], showed that high criticality is also connected to high relative velocities, in relation to greater cut-in distances, which is not properly included in the RPF criticality metric. Furthermore, the test cases toward severity show a certain convergence towards low longitudinal distances and similar longitudinal velocities. This highlights the relevance of well-suited criticality metrics to the test case generation process, which represented the risk sustained by VuT's as realistically as possible. This is equally valid for the scenario data division into the severe and normal driving subset, based on the 10% highest RPF values. As soon as an internationally accepted criticality measure are available, the threshold should be set to a finite value, in order to be independent from the dataset.
Nonetheless, setting a definite threshold can lead to further challenges, regarding measurement errors of parameters or the handling of scenarios, from naturalistic driving data that are below the criticality threshold but are still somewhat nearly severe. Accordingly, the evaluation of the possible incorporation of an enlarged parameter band, instead of one concrete value or threshold for the data split, remains for future work.

Results from the Coverage Assessment
The coverage evaluation confirms a good coverage of the exposure data and an expected limited coverage of the critical test cases. Furthermore, as all coverage values were lower than 100%, the used distribution functions were suitable. The comparably low coverage ratio values of the exposure test suite for v x,VuT and v x,chall (Table 6) can be explained by the relatively low R 2 -value (0.79 and 0.86) of the associated regression models (Table 4). Although the regression models were valid, according to the set rules, the value range of the parameter was not entirely explained by the regression model, resulting in the reduced applied parameter range. This leads to a trade-off between the consideration of even lower dependencies and the maximisation of the parameter coverage. Another trade-off, regarding coverage, becomes visible for the set minimum newness threshold for severe test cases. A high threshold value leads to a higher coverage but a lower criticality performance of the resulting test cases. Both trade-offs are part of future work.
Independent from the concrete results, the proposed exposure evaluation, by means of relative newness, provides the following advantages: first, each parameter is treated equally for the newness evaluation. Second, the ratio to the actual value range of a parameter ensures comparability through different data sets. Third, the holistic comparison of a test case candidate with all prior assigned test cases avoids the assignment of similar test cases that increases the variety of the test cases assigned.

Results from the Comparison with Monte Carlo Sampling
With the comparison to the Monte Carlo sampling, the explicit advantages of the methodology became visible. Although the Monte Carlo simulation was executed with the two data subsets for severity and normal driving, the method was not able to create test cases beyond the criticality of the original dataset. More critical test cases might be generated when extending the size of the test suite, going along with an increase in computational time, or including the mentioned importance-based sampling techniques, as in [27]. However, to the best of our knowledge, no work is available that connects importancebased sampling and multi-dimensional dependencies, as in our proposed methodology. Regarding the results for coverage, the indicated coverage ratios for the Monte Carlo test suites underline the importance of suitable distribution functions. Therefore, for scenariorelated parameter ranges, the proposed univariate generalized extreme value distribution, which considers both skewness and sharpness, can be recommended.

General Implications of the Methodology
In general, the methodology proposed in this study relies on the existence of a data set, upon which, by generating two different test suites, can provide an acceptable coverage of both severity and exposure. The most basic underlying assumption of the methodology is that there will be a data set available that is representative of the scenarios that need to be covered to ensure safety of a different automated driving systems. This assumption is feasible for relatively simple scenarios, such as the cut-in involving only two vehicles applied in the current study. However, as the field evolves towards more complex scenarios that incorporate road geometry, several traffic participants, and urban environments, the requirements for naturalistic traffic data will boost. Therefore, the presented methodology might need to be enhanced by the incorporation of discrete parameters. This highlights the need for international collaborative efforts to continue developing safety assurance methodologies and to share data for automated driving safety evaluation purposes.

Conclusions
This study proposes a novel methodology to generate scenario-specific test suites that account for realism, severity, and exposure requirements. Based on two scenario subsets (one covering severe conditions and one covering normal driving), the scenario parameters were abstracted by distribution functions and regression models. Thereafter, test cases were generated by optimizing the objective functions for severity and exposure, while considering the identified data patterns. The severity and the exposure were optimized by the risk potential field, as well as by a newly proposed newness criterion, respectively. To our best knowledge, the methodology proposed was unique in considering both multidimensional parameter dependencies and optimization by an objective function.
The applicability of the proposed methodology was demonstrated by applying it to generated test cases for cut-in scenarios from a naturalistic driving data set collected on German highways. The generated test cases reflect the real traffic data patterns and successfully discriminate between safety critical and normal driving scenarios. The applicability of the methodology to larger data sets and different scenarios remains as future work. Acknowledgments: The Ministry of Economy, Trade and Industry of Japan is acknowledged for supporting this research.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.