Impact of Regressand Stratiﬁcation in Dataset Shift Caused by Cross-Validation

: Data that have not been modeled cannot be correctly predicted. Under this assumption, this research studies how k-fold cross-validation can introduce dataset shift in regression problems. This fact implies data distributions in the training and test sets to be different and, therefore, a deterioration of the model performance estimation. Even though the stratiﬁcation of the output variable is widely used in the ﬁeld of classiﬁcation to reduce the impacts of dataset shift induced by cross-validation, its use in regression is not widespread in the literature. This paper analyzes the consequences for dataset shift of including different regressand stratiﬁcation schemes in cross-validation with regression data. The results obtained show that these allow for creating more similar training and test sets, reducing the presence of dataset shift related to cross-validation. The bias and deviation of the performance estimation results obtained by regression algorithms are improved using the highest amounts of strata, as are the number of cross-validation repetitions necessary to obtain these better results.


Introduction
Knowing the performance of different models over a dataset or determining their best parameter setup are common tasks when facing a new problem in data science [1,2]. In order to address these aspects, k-fold cross-validation (k-fcv) [3,4] is one of the most simple and frequently used approaches (for a survey on cross-validation procedures, the reader may consult the work of Arlot and Celisse [5]). k-fcv creates k pairs of training and test sets from the original dataset, in such a way that the models are built from the training sets and validated on the test sets. After this, the mean of the test results is taken as the model performance estimate.
Even though k-fcv offers several advantages for performance estimation, such as a reduced computation time compared to leave-one-out [6], its application is not totally risk-free [7,8]. It may cause dataset shift [9,10], in which the data used to build and evaluate the model do not follow the same distribution. This fact usually involves wrong predictions when testing the system, implying an underestimation of the model performance [10].
There are different types of dataset shift that have been studied in the specialized literature, such as covariate shift [11] (which affects the distributions of the input variables), conditional shift [12] (which also affects the conditional distributions of the output variable given an input) and posterior shift [13] (which is produced when the conditional distributions of the output given an input vary but the input distributions do not). Among them, a common type of dataset shift, known as target shift or prior probability shift [11,14], occurs in the output variable. The problem of target shift has been widely studied in classification [15,16]. In this context, stratification [6] is employed to reduce target shift related to k-fcv. It consists of having the same proportion of samples of each class in the training and test sets. This approach has provided successful results creating cross-validation folds for both model selection and evaluation in classification [6].
Nevertheless, in regression problems [17], the most common k-fcv scheme for performance estimation is the application of standard cross-validation (CV) [18][19][20], in which training and test sets are randomly built. Since the distribution of the output variable is not considered when partitioning a dataset, this approach has the inconvenience that it can potentially introduce target shift. Despite this, there are works that have applied stratification on the output variable to build more similar training and test sets [7,21,22]. These are mainly based on ordering the samples according to their regressand values, creating different strata of samples and evenly distributing the samples of each stratum among all the folds. Krstajic et al. [7] noted that there appears to be no clear consensus regarding the application of stratified cross-validation. Thus, Breiman and Spector [21] compared several partitioning approaches with regression datasets and concluded that there were no significant improvements using stratification. On the other hand, Baxter et al. [22] used stratification in the context of water treatment data as an effective alternative to make training and test sets illustrative of the problem domain. Their approach first determined the proportion of samples contained in each set and then iteratively assigned the previously ordered samples to each set based on such proportions. Other works are somewhere in the middle and concluded that stratification is not particularly useful when a large number of repeated k-fcv is used in model selection, whereas it is recommended for model assessment [7]. These facts highlight the importance of further studying the dataset shift induced by k-fcv and the usage of stratification in the field of regression to better understand their implications.
This paper deepens the understanding of the impacts of target shift induced by k-fcv in regression datasets. It analyzes the influence on target shift and its consequences of different stratification schemes in k-fcv with respect to CV. These schemes include a series of artificial strata of samples according to the values of the output variable, aiming to minimize the target shift between the distributions of the output variable in training and test sets. Thus, inspired by previous works [7,21,22], several stratification schemes have been designed and compared against CV, considering different amounts of strata: from the minimum amount of two strata to the maximum possible amount of strata equivalent to the quantity of samples in the dataset. The most common values of k in k-fcv in the literature (2, 5 and 10) have been considered along with these stratification schemes to study the effect of target shift in 28 real-world regression datasets using five algorithms belonging to several regression paradigms (such as decision trees and neural networks, among others [23,24]). The partitionings have been repeated thousands of times with each approach, leading to a total of more than 4 M results to analyze. The statistical tests recommended in the literature have been employed to contrast the conclusions derived from the analysis of test performance results and target shift between the training and test sets [25]. A webpage with the details of the experimentation, datasets, additional results and plots can be accessed at https://joseasaezm.github.io/scvreg/ (accessed on 29 June 2022). In summary, the main contributions of this paper are the following: • Delving into the use of regressand stratification in k-fcv and analyzing whether, despite not being generalized, it should be recommended when dealing with regression data. • Establishing a direct comparison between k-fcv with and without stratification at three levels (amount of dataset shift introduced, quality of performance estimation and convergence speed) to determine in which aspects stratification offers advantages and the degree of improvement in each of them. • Studying different amounts of strata in the output variable in order to check if they significantly affect the results obtained and recommend the most appropriate values. • Analyzing if the effects of stratification on the results depend on the number of folds k in k-fcv, through the study of the values of k commonly used in the literature (2, 5 and 10).
• Drawing conclusions through experimentation with different regression paradigms, both classic and more recent, including decision trees, extreme learning machines and ensembles, among others.
Note that, even though there are works in the literature dealing with regressand stratification, most of the research in this field has considered the distributions in the input space, thus addressing the presence of covariate shift in the data [26][27][28]. Some of these methods, such as representative splitting cross-validation (RSCV) [26], are based on the DUPLEX [29] algorithm to create partitions with k-fcv. Other works [27,30] are based on clustering to create training, validation and test sets for use with neural networks. For example, May et al. [27] created groups of samples using self-organizing maps, which were then distributed among the three sets. The proposal of Diamantidis et al. [28] is also based on clustering and one-center strategies, creating the folds deterministically using the distributions in the input space. A different process is followed by SPlit [31], which is based on the usage of support points and a sequential nearest neighbor method.
Unlike the above approaches, this paper focuses exclusively on target shift to delve into its impacts when partitioning using k-fcv. This fact will allow for knowing the degree of improvement in performance solely attributable to regressand stratification (considering different amounts of strata). Even though there are other works that have applied regressand stratification when using k-fcv (mainly to develop hydrological models [32][33][34]), to our best knowledge, this paper differs for simultaneously combining a comprehensive study on the impacts of regressand stratification while also considering different regression paradigms, dozens of datasets, stratification levels, numbers of folds k in k-fcv and a numerical study of the amount of target shift, performance and convergence speed for each of the k-fcv approaches studied dealing with regression problems.
The remainder of this manuscript is disposed as follows. Section 2 introduces how k-fcv can introduce dataset shift. Section 3 describes the partitioning methods employed in this paper and Section 4 details the experimental framework. Section 5 is devoted to the analysis of the results. Finally, Section 6 closes this work, summarizing the main findings.

On Dataset Shift Induced by Cross-Validation
Let x and y respectively be the input attributes and the output variable in a dataset, with X and Y their corresponding domains. In supervised learning, a function f : X → Y is usually estimated from a training set of m samples D tra = {(x i , y i ) ∈ X × Y }, i = 1, . . . , m, in order to predict the output variable in a different test set of m samples D tst = {(x j , y j ) ∈ X × Y }, j = 1, . . . , m . Commonly, it is assumed that the training and test sets have identical joint distributions, that is, P tra (x, y) = P tst (x, y) [17]. However, if these sets are obtained by a k-fcv procedure without considering the distributions of the input and output variables, P tra (x, y) = P tst (x, y) is likely to occur. This scenario is known as dataset shift [9,10] and occurs when the training and test sets follow different distributions [11]. In supervised data, either classification or regression, two main types of data shift are found: 1. Target shift [14,15], which affects the distributions of the output variable P tra (y) = P tst (y), but it maintains the conditional distributions P tra (x|y) = P tst (x|y); 2. Covariate shift [35,36], which affects the distributions of the input attributes P tra (x) = P tst (x), but it maintains the conditional distributions P tra (y|x) = P tst (y|x).
Among them, covariate shift has been widely studied in the specialized literature [26,27,37]. A common approach to reduce its negative impacts is to estimate a weight for each training sample relative to the test set [38], which is then used by learning algorithms. Examples of this strategy are the KLIEP [37] and uLSIF [39] methods. Other works are based on computing the weights by analyzing the means of the training and test sets in a kernel Hilbert space [40] or introducing a surrogate kernel matching [41]. There are also approaches to reduce covariate shift related to partitioning with k-fcv, such as the DB-SCV [42] and DOB-SCV [8] approaches in the field of classification or the aforementioned proposals for regression problems based on the DUPLEX algorithm [26], clustering [27] or support points [31]. These methods are generally based on creating training and test partitions by choosing close samples in the input space.
Even though it is recommended to address covariate and target shifts simultaneously in real-world applications, this research focuses on target shift since the output variable usually has a strong influence on the building and evaluation processes of the models. Note that, although other factors of the data apart from target shift (such as covariate shift) could affect the results obtained, these can potentially equally affect all the partition schemes in this paper since none of the k-fcv approaches studied deals with them specifically. Figure 1 shows a regression dataset with varying degrees of target shift, x and y being the input and output variables, respectively. Figure 1b illustrates a high target shift, since those samples with y > t used to validate the model are not considered to build it and, thus, they are probably wrongly predicted. This situation is partially corrected in Figure 1c, which shows that both sets have samples along the domain of y. In real-world applications, target shift can occur because of the nature of the problem (e.g., when a model is built on past data and used to predict future data with different characteristics) or it can be unexpectedly introduced by cross-validation during the performance estimation of the models [8]. Using k-fcv, the data are divided into k separate folds. Then, each fold is used to test the model trained with the remaining k − 1 folds and, finally, the k evaluation results are averaged to obtain an individual estimation. If the training and test sets are obtained without taking into account the distribution of the output variable, the data used to build the model may differ from those used to validate it.
In classification, target shift induced by k-fcv is commonly prevented by applying a stratification scheme [6]. However, the usage of k-fcv based on folds of random samples, usually employed in the field of regression, may imply that the impact of target shift is overlooked. This research focuses on the analysis of the presence and impact of target shift induced by k-fcv in regression datasets, studying how regressand stratification can help to reduce its negative consequences.

Cross-Validation in Regression Problems
This section describes the different k-fcv partitioning methods used in this research. They split a regression dataset D into k approximately equal-sized folds (F 1 , . . . , F k ). Each of these folds represents a test set and the remaining folds represent the corresponding training sets.

Standard Cross-Validation
Algorithm 1 shows the most widely used and simplest approach to partition a regression dataset with k-fcv: standard cross-validation (CV) [18,19]. First, it computes the number of samples per fold (line 2, being |D| the number of samples). Then, the whole dataset D is split into k random folds of samples of equal size (lines 3-6). Note that, since this partitioning scheme does not consider the distribution of the output variable to create each fold, it can introduce target shift uncontrollably.

Totally Stratified Cross-Validation
Contrary to CV, which does not use the information of the regressand to create the folds, totally stratified cross-validation (TSCV) is one of the approaches used in this research to reduce target shift to the maximum degree. Algorithm 2 shows its pseudocode.
It introduces as many strata as samples in the dataset. The different strata are created according to the regressand distribution. First, it sorts the samples in D considering their output variable (line 2). Then, each sample is selected in order (line 3) and assigned to a fold with less samples (lines [4][5]. If several folds are tied with less samples (line 4), one of them is arbitrarily chosen, which adds some randomness in the partitioning.
TSCV is based on the idea of assigning the closest samples (according to their output variables) to different folds. In this way, folds are intended to be as similar as possible to each other, while each of them contains the maximum possible diversity of values of the output variable. This fact finally implies that training and test sets have similar distributions of the output variables, reducing target shift.

Stratified Cross-Validation
At an intermediate point between CV and TSCV, t-stratified cross-validation (SCV t ) is another approach used in this paper to introduce a variable stratification in the k-fcv process with regression problems. It is presented in Algorithm 3.
This procedure allows for creating the desired amount of strata t when building the k folds. First, SCV t sorts all the samples according to the value of the output variable (line 2) and computes the number n of samples per stratum (line 3). Afterwards, it starts an iterative process to assign samples to each fold (lines 4-13): it selects blocks of n samples conforming each stratum (lines 5-6) and, then, each of the samples of that block (line 8) is assigned to a fold with less samples (line 9) until there are no more available samples to assign. Thus, SCV t considers the same number of samples from each stratum in each of the folds. SCV t is a generalization of the previous k-fcv schemes: CV and TSCV. If t = 1, no stratification is considered and, thus, SCV t is equivalent to CV. If t = |D|, the maximum number of strata are considered and SCV t is equivalent to TSCV.

Real-World Datasets
This research considers 28 real-world regression datasets taken from the UCI machine learning and KEEL-dataset repositories (https://archive.ics.uci.edu/, http://www.keel.es; both accessed on 29 June 2022). In order to study the impact of stratification in k-fcv with regression problems regardless of the characteristics of the data, datasets belonging to different applications and areas (including fields such as biology, geology, chemistry and so on) and with different numbers of attributes and samples are selected. Table 1 presents them, along with their number of attributes (at) and samples (sa). Those samples containing missing values in these datasets are removed before their usage. Furthermore, both the input attributes and the output variables are normalized to the interval [0, 1].

Regression Algorithms
In order to build models on the above datasets, 5 algorithms belonging to different regression paradigms are chosen, including decision trees [23], distance-based models [43], neural networks [24], multiple linear regression [44] and ensemble-based models [45]. They are briefly described below along with their main parameters, which are shown in Table 2. In order to delve into the specific characteristics of each algorithm, the reader can consult the reference associated with each of them.
1. Recursive partitioning and regression trees (RPART) [23]. It builds a decision tree from the dataset, in which the nodes are successively split into subnodes using a homogeneitybased threshold attribute value. The process stops when the last subset of samples is included in the tree or the maximum number of leaves is reached (known as tree pruning). 2. k-nearest neighbors (NN) [43]. To estimate the output value for a sample, it computes the distances between such sample and all the training samples. Then, it selects the k closest samples to the query and averages their regressand values to obtain a single prediction. 3. Extreme learning machine (ELM) [24]. It is a feedforward neural network with a hidden layer of nodes whose parameters do not need to be tuned. Its main advantage is that it produces good generalization performance in less time compared to traditional neural networks trained with backpropagation. 4. Multivariate adaptive regression spline (MARS) [44]. It is a non-parametric algorithm based on two main stages. In the forward stage, it splits the data in several subsets and runs a linear regression model on each partition. In the backward stage, the model is pruned to avoid overfitting by removing the functions that contribute the least to performance. 5. Generalized boosted regression modeling (GBM) [45]. It iteratively builds decision trees based on random subsets of the training samples using boosting. For each new tree, those samples poorly modeled by previous trees have a higher probability of being selected.

Methodology of Analysis
To study the effects of target shift induced by k-fcv in regression datasets and how stratification can help to reduce its impacts, the following experimental study is performed. Each dataset in Table 1 is partitioned using three different values of k in k-fcv (2, 5 and 10). These folds are obtained with eight partitioning schemes (see Section 3), each one with a different stratification degree: • CV, which does not consider any stratification; • TSCV, which considers a total stratification of the samples; • SCV t with six different values of t (2, 5, 10, 20, 50 and 100), which allows for controlling the stratification degree.
Once the datasets are split into k folds with the aforementioned schemes, the regression methods in Table 2 are evaluated over them, obtaining their test performance results with the RMSE metric: with n the amount of samples and y i and y i the real and predicted regressand values for the i-th sample, respectively. Additionally, the Kolmogorov-Smirnov statistic (D n ) [46] is used to estimate the amount of target shift between the training and test sets, that is, the difference between the distributions of the regressand values in both sets. Given the samples of regressand values in the training and test sets, X and Y, and their empirical distribution functions F X and F Y , D n is computed as: The above procedure is repeated 1000 times with different seeds to generate random numbers, obtaining, thus, different partitions in each run. Table 3 shows a summary of the experiment performed, in which # indicates the amount of values of each variable. Thus, the experimentation of this research entails the analysis of more than 4 M results. The conclusions derived from them are contrasted using the statistical tests recommended in the specialized literature [25]. Specifically, Wilcoxon's test [25] is used for rejecting the null hypothesis of the equality of means in pairwise comparisons, implying the superiority of one of the methods. A significance level α = 0.1 is assumed in this paper.

Analysis of Results
The analysis of results is divided into three main parts. First, Section 5.1 analyzes the amount of induced target shift by the different k-fcv schemes. Then, Section 5.2 focuses on the effect of stratification on the error estimation of the regression methods with k-fcv. Finally, Section 5.3 studies the convergence speed of the stratification schemes with respect to the performance estimated using CV, that is, the number of repetitions necessary by each stratification approach to reach a stable (better) behavior with respect to CV.

Analysis of Induced Target Shift by Cross-Validation Schemes
In order to measure the amount of target shift existing between the training and test sets created by each k-fcv approach, the Kolmogorov-Smirnov [46] non-parametric test is used. This test calculates a statistic D n ∈ [0, 1], which can be taken as an indicator of the difference between two samples. D n is measured considering the distributions of the output variable in the different training and test sets created by each k-fcv scheme for each dataset. The lower the value of D n is, the more similar the training and test distributions and the lower amount of target shift introduced by k-fcv are. Table 4 shows the averaged D n values when measuring target shift between training and test sets for all the datasets considering each one of the k-fcv schemes (with k = 2, 5 and 10). The results of each partitioning scheme are compared against two reference methods using Wilcoxon's test, obtaining their associated p-values: 1. CV, which does not consider any stratification (row vs. CV); 2. TSCV, which considers a maximum stratification (row vs. TSCV).
The p-value of Wilcoxon's test allows for rejecting the null hypothesis that the mean results of the two algorithms involved in the comparison are equal, that is, they have a similar behavior on average in all the datasets. Given that a significance level of 0.1 is considered and p-values are in scientific notation in Table 4, those with exponent −1 do not allow for rejecting the null hypothesis, whereas the rest do (indicating that there are differences between the behaviors of the two methods compared).
The best results in Table 4 are underlined, whereas p-values lower than α = 0.1 are remarked in bold. A darker background in the results indicates that these are better. In addition to the p-value, the sum of ranks [47] associated with each algorithm within Wilcoxon's test is calculated as a way of representing their effectiveness. In order to do this, the differences between both methods in the results of each dataset are computed and a ranking is assigned to the absolute value of each difference. The sum of ranks associated with the positive differences is assigned to the first algorithm, whereas the sum of ranks of negative differences is assigned to the second method. A higher sum of ranks represents a greater effectiveness of the corresponding algorithm. Finally, those cases in Table 4 in which the method of the row obtains a higher sum of ranks than that of the column in Wilcoxon's test are indicated with an asterisk. Table 4. Induced target shift by different k-fcv schemes. A darker background in the results indicates that these are better. Those cases in which the method of the row obtains a higher sum of ranks than that of the column in Wilcoxon's test are indicated with an asterisk. vs. TSCV 7.45E-9 * 7.45E-9 * 7.45E-9 * 7.45E-9 * 7.45E-9 * 7.45E-9 * 1.49E-8 * The above results show that, for each value of k in k-fcv, a total ordering according to the increasing number of strata is observed: from CV (with the highest target shift value) to TSCV (with the lowest target shift value). This ordering is also observed in the results of most of the individual datasets.
The comparisons among k-fcv schemes using Wilcoxon's test support the above conclusions. They show that those partitionings considering stratification provide better results than not considering it (CV). Similarly, using the maximum number of strata (TSCV) implies an improvement compared to considering a fewer amount of strata.
When comparing the results of each partitioning method for the different values of k, it is observed that a higher k involves an increment of the injected target shift by k-fcv. This fact may be due to that the greater the number of folds, the more difficult it is for all of them to be similar.
The above results show the positive effects of the stratification schemes to reduce target shift in regression datasets compared to the traditional approach, which does not consider any stratification (CV). Specifically, TSCV is the stratification that achieves further reducing the differences between the distributions of values of the output variables in the training and test sets, with clear differences compared to the rest of the approaches. The next section analyzes whether this reduction in target shift in the data leads to a better error estimation of the models, implying a lower bias in performance estimation. Table 5 shows the error estimation using RMSE and standard deviation results of each partitioning scheme, considering all the runs with different numbers of folds k (2, 5 and 10). Additionally, the p-value associated with Wilcoxon's test after comparing the results of each partitioning method against the method with no stratification (CV) and the method with maximum stratification (TSCV) is computed. Due to the large amount of results obtained, just those for the RPART and NN regression methods are shown in this paper. The results for the rest of the regression techniques are found on the webpage of this paper and show conclusions similar to those presented here.

Effect of Stratification in Error Bias Related to Target Shift
Analogously to the amount of target shift in the previous section, the error estimation results for RPART and NN show that, for each value of k in k-fcv, an ordering according to the increasing number of strata is observed: from CV (with the highest error values) to TSCV (with lowest error values). Although there are some exceptions, this fact is also true in standard deviation results.
The statistical comparisons for the error results also support that all the methods using stratification are generally better than CV (although no differences are observed between CV and its closest stratification levels in some cases). The comparisons with TSCV show similar behavior: TSCV is usually better than the methods using a lower stratification, although with the approaches closest in number of strata (50 and 100), no differences are observed in some cases. The analysis of standard deviation provides similar results.
As a conclusion, it is observed that the reduction in target shift shown in Section 5.1 by the stratification schemes is related to that when estimating the error made by the models. This fact implies that the application of stratification allows for obtaining better estimations of the performance of the models, so its usage can be recommended against not considering it.

Convergence Speed of Stratification Schemes against CV
Section 5.2 focuses on error estimation when a large number of k-fcv repetitions are performed (1000). The analysis of the p-values of Wilcoxon's test shows that stratification generally improves error estimation compared to not considering it. This section studies the speed (number of k-fcv repetitions) required by each partitioning method to reach this p-value lower than 0.1 when compared to CV. To this end, the performance (from 5 to 150 repetitions, by increments of 5) of each regression method using stratified k-fcv against CV is compared with Wilcoxon's test and the associated p-values are computed. This process is repeated 50 times to obtain more robust results. Figure 2 shows the results for RPART and NN-the rest of the methods can be found on the webpage of this paper.
The analysis of Figure 2 shows a certain ordering in the results of the different stratification strategies. The highest p-values are usually related to those approaches that use a smaller number of strata, whereas an increase in the number of strata implies that the pvalues are reduced. A higher variability in the results between consecutive k-fcv repetitions is also observed in those approaches that use a lower number of strata. Thus, these results show that the usage of higher amounts of strata in k-fcv usually involve lower numbers of k-fcv repetitions to obtain better and stable results compared to CV. Furthermore, Figure 2 shows that, if higher values of k in k-fcv are considered, the p-values obtained are generally higher with all the partitioning methods. A higher variability in the results between consecutive repetitions is also presented with higher values of k in k-fcv. Table 5. Effect of stratification in error estimation (RMSE) and standard deviation results. A darker background in the results indicates that these are better. Those cases in which the method of the row obtains a higher sum of ranks than that of the column in Wilcoxon's test are indicated with an asterisk.

Conclusions
This research has analyzed both how k-fcv can introduce target shift in regression datasets and its negative impact in performance estimation. Several stratification schemes have been analyzed to build more similar training and test distributions, decreasing the existence of target shift produced by cross-validation.
The experiments performed have shown that both dataset shift and bias are reduced by considering stratification. In general, the larger the number of strata is, the lower the target shift and error estimation results are. The convergence speed of the different stratification schemes when they are compared to CV shows that a larger number of strata usually implies that cross-validation provides a stable and better performance estimation faster. Among the stratification schemes studied, the usage of TSCV can be recommended, since it is the one that generally provides the best results in terms of the introduced target shift and estimation of the error made by the models built. Despite this, it should be noted that other regressand stratification schemes using a smaller number of strata also obtain good results compared to CV. Furthermore, the usage of lower stratification may imply some advantages in terms of computational cost when partitioning the dataset, particularly if the number of samples is high. Finally, even though this research has focused on the study of target shift in regression problems, it is also important to consider the presence of other types of dataset shift, such as that occurring in the input attributes. Their joint consideration may imply the need to further investigate the most appropriate synergy between the number of strata in the regressand with the different strategies to reduce other forms of dataset shift.
In future works, it is planned to study the behavior of other regression methods, such as Support Vector Regression [48] or XGBoost [49], with the proposed stratification schemes, as well as to use these algorithms along with other k-fcv approaches considering different types of dataset shift simultaneously, such as target and covariate shift.