5.1. RQ1—Feature Selection with Optimization Results
Bearing in mind RQ1, we constructed two evaluation tracks. Firstly, the proposed optimization approach (ELSGA) was evaluated over several generations and compared with the standard genetic algorithm (without the IC component). Secondly, the selected features (using ELSGA) were used to build models through a type of time-series k-fold cross validation called “walk-forward cross validation (WFCV)” [
77] and compared with other ML, deep learning and ensemble based approaches.
The optimization process was carried out as feature selection by selecting the most informative parameters to predict a vehicle’s component failures (a component related to the turbocharger). We built a forecasting system in which a supervised ML algorithm examined the contribution of the parameters to the claims. Considering results from our previous study [
18], in which vehicle usage changed over time, we segmented the data into different parts. This segmentation process was done by assuming that the distribution of the vehicle usage logged in the same season over the years was relatively similar with respect to the other seasons. Thus, in each segment there might be a set of parameters that had more impact on the failures, and, accordingly, more pertinent to the prediction of failure. We designed the optimization process to be implemented in each segment in order to identify the most informative features. However, in the second tier of the evaluation, the union of the selected features (from all segments) built the predictive models, and was compared with other algorithms.
Since there is no universal configuration for such a heuristic optimization algorithm that is considered to be the best setting, GA parameters, including population size, mutation rate, number of parents inside the mating pool, and number of elements to mutate, were selected in an optimization manner. This meant the implementation code was constructed in a parameterized fashion to find the best configuration of the GA settings.
Table 3 lists the parameters that our optimization system tuned to obtain the best performance.
Table 4 describes the output of the IC at initiating the first population on the whole and segmented data. The values
and
were selected through the WFCV (as the optimal values) to train the elastic model.
The whole data-set with the lowest feature reduction () resulted in the highest number of predictors in the first population, with 410. Among the segments, 313 predictors were selected with coefficients larger than 0 in , which also indicated the lowest reduction proportion within the segments. Likewise, the IC initiated 80 predictors out of 577 as being decent parents for further generation in . The figures indicated that the IC delivered different numbers of predictors to GA operators–meaning that, in each segment/season, different parameters had different impacts on failure of further generations in each segment. It was observed the value of in all cases was close to zero, which meant that the LS almost turned into the ridge method to initiate the first generation.
To evaluate the performance of the selected features, we formulated the objective function as a classification task, wherein the function seeks the optimal performance of the breakdowns forecasting, taking into account vehicle usage. Thus, given the vehicle usage with n readout samples and the selected predictors/features m, which were injected by GA operators, in each generation, the function predicted whether the usage led to a failure or not.
In this stage of the experiment, the data was partitioned into training and test data. More to the point, in each segment of the data “throughout the optimization process”, we took three months of vehicle operation, with one month from the previous season, into account to train the model, and the last month was used to test the model. The statistical information of the training and test data in each segment is listed as follows:
Segment1 () contained 86,768 readouts, where 52,060 were considered for training and 34,708 for testing the model.
Segment2 () contained 90,709 readouts, where 54,421 were considered for training and 36,288 for testing the model.
Segment3 () contained 61,364 readouts, where 38,045 were considered for training and 23,319 for testing the model.
Segment4 () contained 98,732 readouts, where 75,036 were considered for training and 23,696 for testing the model.
As is mentioned in the Segmentation section, in each segment, the readout time-series samples were placed in the time order, in order to use the past data to build the model, and future data to test the model. Since these data were cumulative, and the scale of the readouts differed from year to year, data normalization–using Equation (
19)–was conducted to form the data on the same scale.
The three criteria, , and , were set to , and 100 to recall the IC or terminate the optimization process so as to provide the best predictors, respectively.
The plots in
Figure 8 illustrate the superiority of the ELSGA approach in all segments compared with the GA approach without the IC component over the whole optimization process. We noticed a considerable jump in the performance provided by ELSGA that describes how well the IC adapted to the optimization process to generate decent individuals in the first population. The increase at the early stages assured that the whole optimization process could be terminated before the criterion
was met (maximum 100 generation). However,
was not passed due to the complexity of the problem. ELSGA performed better by reporting around 80% AUC in all segments with 3% higher than GA over the process. It was also noticeable that the AUC value obtained by ELSGA in
Segment4 was higher among the other segments, with 82% AUC vs. 79% obtained by the basic optimization. The statistical assessment of the data revealed that
Segment4 contained vehicles which had more claims compared to the vehicles located in other segments in this highly unbalanced data. This might be the reason why our model performed better, since more data and positive samples were available to build the predictive model.
The overall results suggest the segmentation formulation is compatible and positively contributes to better claim prediction performance with optimization approaches, particularly the ELSGA approach.
In the second track of the evaluation, the output of the ELSGA was taken into consideration to build the model using the XGBoost classifier [
68]. In this experiment, the selected predictors, due to ELSGA, were exploited to build the predictive models and to compare with models trained with several classifiers by injecting the data. In fact, multiple classifier (including deep learning and ensemble approaches) results were selected as the baseline to compare with ELSGA. The similar WFCV method was used to build and evaluate the models with various portions of the data in a time-series fashion (this decision was made since our data was time-series data and the folds could not be randomly selected by normal k-fold cross-validation).
Table 5 shows the comparison of our proposed ELSGA method and multiple classifiers. The figures obtained from different predictive models demonstrate how complex the task is to predict claims by taking vehicle usage into account. As can be seen, most of the classifiers performed poorly, providing low
values. Practically, these results showed that most of the linear classifiers had no discriminatory capacity in regard to this complex problem, and could not map the usage to breakdowns. In contrast, deep learning models (such as CNN, LSTM and biLSTM) showed much better performance compared to the linear classifiers, by
.
Among the examined predictive models, Boosting and Stacking performed close to the proposed approach by and vs. , respectively, when total data was considered. Concerning the segmentation, we can clearly observe that the proposed approach (Module2-ELSGA) significantly outperformed the other classifiers, except for Bagging. It is necessary to remark that, within deep learning approaches, CNN worked well on the last segment by providing . To go beyond this performance assessment, we applied a paired statistical t-test to evaluate how significant the results of the classifiers were. In almost all cases, the p-values were smaller than the critical value , so this rejected the null hypothesis, by stating that there was a significant difference between the outcomes. The statistical tests showed that only the performance of the ensemble-based approaches were relatively close to that of the proposed approach in most cases. It is also fair to remark that the statistical test on the performance of the CNN model, in the last segment, indicated that the difference was not significant. It can be observed that in one case even Bagging performed slightly better than the proposed approach in Segment 3; however, the difference was not significant. In Segment 1, the figures indicate Stacking performed quite similarly to our proposed ELSGA approach by reporting .
In addition, we conducted an A/B test by comparing the performance of the models trained by the original predictors (MORGP) vs. the models built by the extracted predictors (MEXP). Both groups of predictors (original and extracted) were selected by the optimization approach. Indeed, we aimed to reveal the contribution of the extracted features and original features to predictive models.
From
Figure 9, we can observe that more than 50% of the selected predictors (except in the last segment) were extracted from the extraction process, which translated to the importance of such predictors in the prediction task (Orange vs. purple bars). In fact, we considered this proportionate comparison mainly as a sanity check. However, to quantify the impact of the extracted features, several models were built through WFCV to construct and evaluate the models. The figures suggest the MEXP outperformed the MORGP by 3%, 1%, 7%, and 10% in segments one, two, three, and four, respectively (green vs. pink bars). The statistical tests also indicated how significant the differences were in segments two, three, and four by rejecting the null hypothesis (
is set to
as critical value). Although the test failed to reject the null hypothesis (by
p-value = 3.32), and concluded the difference between the two models was not significant in the first segment (green and pink bars), the Extracted predictors (green) showed their valuable impacts by providing almost the same performances
with respect to the models trained by the mixed predictors (blue bar). The information extracted and used as extracted predictors in the models had greater, and more significant, impact on the predictive model’s performances, compared to the original predictors.
5.2. RQ2—Snapshot-Stacked Ensemble Results
To answer RQ2, we took the output of the optimization process as the input of the ensemble module. We aimed to build a general model that could be used to predict the claims over the year, given vehicle usage. Thus, identical features from all segments were combined and injected into the first deep network to build and generate several diverse snapshot models. In addition, each season, as one additional feature, was added to the identical features to support generalization.
Given Equation (
13), the number of cycles were set to
to generate 20 different snapshot models over the 400 epochs. Of the data, 60% were considered to train the snapshots and 30% to be the validation set to obtain the hard labels and soft predictions. Accordingly, 10% of the data was held out to test the meta-learning model at the final stage (see
Figure 7 for visualization purposes). To quantify the diversity in the generated models, we used a disagreement measure, defined in Equation (
20). This metric calculated the relation between the number of times the base classifiers predicted the same label, in terms of the total number of instances [
78].
where
refers to the label assigned by the snapshot (
) to readout
. Value
is the function of truth predicate that counts the cases where
was correct and the
was wrong and vise-versa.
Figure 10 illustrates the disagreement between the 20 snapshots. In the first phase, the hard labels predicted by the 20 snapshot models (here we took the average of the models’ performances) was compared with the performance of the same algorithms used in the second module (Module2) evaluation, depicted in
Table 6. The AUC values obtained over the 5-fold WFCV showed that, in almost all cases, our approach was superior in the first phase (Module3 First phase–Snapshots–, see
Table 6), compared to the other approaches. This was not the case when it came to comparing the ensemble-based and Module2 performances. The classification results from all segments (
vs.
in Segment 1,
vs.
in Segment 2,
vs.
in Segment 3 and
vs.
in Segment 4) and union of the features (
vs.
) suggested the first module performed better with respect to the first deep net (first phase). A similar observation was obtained when the first phase was compared with other ensemble approaches, such as Bagging, Boosting, and Stacking. This meant we did not see any improvement from the snapshot models in the first deep neural networks.
This motivated us to construct meta-learning with the aim to learn from the errors received in the first phase and improve the prediction in the second phase, that being the final prediction. The results of the snapshot models () were horizontally added to the data-set (the validation set in the first phase) to be trained and tested again in the second phase. This resulted in 60 extra new features in the data set. This meant the data considered as the validation set in the first deep networks (first phase/layer) became a new data set by carrying 60 extra features (in the second phase). Accordingly, the test set (10% of the whole data), which was held out, was used to assess the model. It is necessary to remark that, in each phase, we trained the deep neural network separately in one training shot. This meant that, in the first phase, we constructed 20 models in one training process using CCAS, and, in the second phase, we built a meta-learning model through one training process.
The result of the second phase is shown in
Table 6 (last row), which was compared with other approaches. The figures represented in the table confirm significant improvement due to the stacking and ensemble snapshots performances generated in the first phase. We almost obtained
in all experiments, with the exception of one in Segment 2 (
). The meta-learning built by the snapshot models could learn the errors obtained from the previous training and prediction phases. This resulted in a general predictive model that could be potentially used for claim/fault prediction in all seasons. Indeed, the diversity of the snapshots, illustrated in
Figure 10, highly supported the second phase to build a generalized meta-learning model for the final prediction. Taking into account the results obtained from the above experiments, the ensemble approaches performed well, compared with other classifiers, including linear and deep networks. However, the superiority of our approach in this complex problem was very evident. This led us to examine these approaches in a different context. Thus, we conducted the proposed approach with the other three ensemble approaches on several different datasets to assess the generality and to ascertain whether the SSED performed equally to, or better than, its performance in other application domains. This is an important consideration, since it assesses the generality of the approach to deal with data from different contexts (see
Table 7 for the details information of the data sets).
The observations obtained from
Table 8 show that SSED, in most cases, took the first rank, in terms of accuracy of performance, and outperformed other ensemble approaches. Individual comparison between SSED Vs. Bagging and Boosting indicated that the SSED almost consistently outperformed the two approaches on all datasets. However, this consistency was violated on dataset 7, where the figures show that Boosting provided slightly better results (
). In contrast, the comparison between SSED vs. Stacking suggests that Staking had better generalization on the datasets. However, the statistical tests on the results (datasets 3, 7, 9, and 10) described the differences as not being statistically significant, when considering the critical value at
. Concerning the same comparison, we observed the value of the t-test on the figures obtained on datasets 1, 4, 6, 10, 11, where the SSED performed better, confirmed the differences were statistically significant, and concluded that SSED outperformed the Stacking approach.