Techniques for Robust Imputation in Incomplete Two-Way Tables

: We describe imputation strategies resistant to outliers, through modiﬁcations of the simple imputation method proposed by Krzanowski and assess their performance. The strategies use a robust singular value decomposition, do not depend on distributional or structural assumptions and have no restrictions as to the pattern or missing data mechanisms. They are tested through the simulation of contamination and unbalance, both in artiﬁcially generated matrices and in a matrix of real data from an experiment with genotype-by-environment interaction. Their performance is assessed by means of prediction errors, the squared cosine between matrices, and a quality coefﬁcient of ﬁt between imputations and true values. For small matrices, the best results are obtained by applying robust decomposition directly, while for larger matrices the highest quality is obtained by eliminating the singular values of the imputation equation.


Introduction
Complete data matrices are required for some statistical analysis techniques, making the imputation of missing values necessary in certain circumstances. For example, the biplot analysis used widely in multivariate exploratory analysis [1] and the additive main effects and multiplicative interaction models-AMMI [2]-used to describe the interaction between genotypes and environments, have as main tool the singular value decomposition-SVD [3]. However, this SVD is not directly calculable if there are missing values [4], and it is necessary to pre-process the information by first replacing the missing data with plausible values.
Modern literature on incomplete information analysis recommends completing matrices using methods that employ either maximum likelihood or multiple imputation [5]. However, these procedures can depend heavily not only on probability distributions (for example, multivariate normal), but also on the missing data mechanisms [6].
There are currently four missing data mechanisms [7]: Missing Completely at Random (MCAR), where the missing data does not depend on the feature variable being considered or any of the other feature variables in the dataset; Missing at Random (MAR), where the missing data in the feature variable is conditional on any other feature variable in the dataset but not on the one being considered; Missing not at Random (MNAR); when the possibility of a feature variable having a missing data entry depends on the value of the feature variable itself (irrespective of any alteration or modification to the values of other feature variables in the dataset); and Missing by Natural Design (MBND), where the missing data occurs because it cannot be measured physically but is relevant in the data analysis procedure [7].
To take into account both dependence on probability distributions and missing data mechanisms, a very useful option is non-parametric imputation [8,9]. A general method free of structural and distributional assumptions was originally proposed by Krzanowski [10] and recently generalized by Arciniegas-Alarcón et al. [11] to complete matrices from multi-environmental experiments. The method depends on the SVD, which is a least squares technique, but a disadvantage of these types of techniques is that they are highly influenced by untypical values or outliers [12].
To the best of our knowledge, this method has not yet been subjected to a robustness study, so our purpose here is to evaluate its performance in the face of outlier observations and to propose some strategies for robust imputations. We therefore first present an updated version of the SVD method, then consider robust alternatives for imputation, and finally describe numerical studies for the corresponding performance analysis.

Missing Value Imputation Using the SVD-Method SVD88
The method consists of an updated version of the imputation system of Krzanowski [10] with some minor changes reported in the literature that improve its performance [11]. Consider a matrix Y (n × p) with elements y ij (i = 1, . . . n; j = 1, . . . p) and p > n (if p < n the matrix should be first transposed). First, suppose there is just one missing value y ij in Y. Then, the i-th row from Y is deleted and the SVD for the ((n − 1) × p) resulting matrix is to delete the j-th column from Y and obtain the SVD for the (n × (p − 1)) matrix Y (−j) as The matrices U, V, U and V are orthonormal, while D and D are diagonal. Now, combining the two SVDs, Y (−i) and Y (−j) , the imputed value is given by: where H is the optimal number of components of the SVD that can be found by crossvalidation methods adapted for matrices with missing data and available in the R statistical environment [13]. In this work, the "bcv" package was used, which has implemented crossvalidation for incomplete matrices using an EM algorithm ( [13]; available from CRAN; see https://cran.r-project.org/web/packages/bcv/index.html, accessed on 25 June 2021). When there is more than one missing value, an iterative scheme is required as follows. Initially, all missing values are replaced by their respective column means, giving a completed matrix Y and then the columns are standardized by subtracting m j and dividing the result by s j (where m j and s j represent the mean and the standard deviation of the j-th column calculated only from the observed values). Using the standardized matrix, the imputation for each missing value is recalculated using the Equation (1). Finally, the matrix Y is returned to its original scale, y ij = m j + s jŷij . Then, the process is iterated until stability is achieved in the imputations (i.e., when the values in two successive iterations agree to within a desired level of accuracy). In order to avoid convergence problems, a parity check should be done in each iteration by matching the sign of u ih d h p/(p − 1) 1 2 v jh d h n/(n − 1) 1 2 in Equation (1) to the sign of u ih d h v jh obtained from the SVD of the Y matrix for each h = 1, . . . , H [11]. Pseudocode i. Y <-incomplete matrix. ii.
Return to step ii. until stability is achieved in the imputations.

Robust Singular Value Decomposition
The presence of outliers in a data set can reduce the effectiveness of least squares techniques [12], which in this case is the standard SVD. To avoid this behavior, robust lower-rank approximations or equivalent robust SVD could be used. Fortunately, the literature provides several alternatives that can be considered.
Recently, [14] presented a review of four options for producing robust SVD. The first is an approximation of alternating regressions based on a mixture of M estimators and trimmed least squares estimators. For more details on these types of regressions, see [15]. The second consists of a sub-sampling method whose objective is to find a sub-matrix of rows without outliers with high probability and from this sub-matrix to estimate the effects of rows and columns by means of a robust function. For more details see [16]. The third option is to use iterative reweighted least square algorithms to obtain robust estimators in principal component analysis (PCA). A complete description of this approach and extensions can be found in [17] and a computational implementation is available in the R RobRSVD package [13]. The last option is an optimization with restrictions in which the data matrix is assumed to be a sum of three matrices, one of low rank, a sparse matrix to represent the outliers, and a random noise matrix. Algorithms for such optimization and additional details can be found in [14].
Other very useful work on robust SVD has been carried out by [18], who used alternating robust fittings with trimmed least squares to find the SVD and apply it in microarray data analysis; Maronna and Yohai [19] who used alternating M regressions to obtain robust low-rank approximations; and Rodrigues et al. [20,21], who used the SVD proposed by Hawkins et al. [22] for robust singular spectral analysis and for a robust version of AMMI models. The computational implementation of this SVD can be found in the R pcaMethods package ( [13]; pcaMethods available from Bioconductor; https://bioconductor.org/packages/release/bioc/html/pcaMethods.html, accessed on 25 June 2021).
Given this literature review and taking our objectives into account, we delimit the research and choose an option to produce a robust SVD that can use the SVD88 method. To focus specifically on matrices with genotype-by-environment interaction, we adopt the suggestion of [20]. Thus, we use the Hawkins et al. [22] SVD which is based on an alternating L1 regression algorithm (rSVD01), and the generalization proposed by [23] using weighted least absolute deviation regression (rSVD10). The frequent use of alternating regressions in the literature cited above obliged us to also consider the classic reference of Gabriel and Odoroff [24], whose main objective was to produce resistant low-rank matrices (rSVD84).
We thus initially focused on comparing the performances of rSVD01, rSVD10, and rSVD84 when they are introduced into the SVD88 imputation system. However, their individual performance was also considered, i.e., without introducing them into the imputation system. This is because the SVD from alternating regressions does not need any data imputation as it uses only the observed information. Moreover, a robust lowrank approximation can provide estimates of the missing positions of an incomplete matrix [18] without any additional calculations. The three SVDs were tested, and in the case of rSVD01 and rSVD10, we also tested replacing L1 regressions with M regressions using the rlm function of R ( [13]; RobRSVD available from CRAN; see https: //cran.r-project.org/web/packages/RobRSVD/index.html, accessed on 25 June 2021). These preliminary tests used contaminated and incomplete matrices of different dimensions, starting with matrices of dimension 20 × 5 and continuing to matrices with many more elements, for example, dimension 100 × 8. In these initial experiments, the method that was statistically simplest, computationally most efficient, and with the best results when introduced into the SVD88 was always rSVD84, so it is the method described below.

rSVD84 (Procedure Based on García-Peña et al., 2021 and Gabriel and Odoroff 1984)
Consider a dimension Y matrix (n × p) with possible missing entries. (i) Using the observed information, calculate the vectors of trimmed means (i.e., the means when a certain percentage of the most extreme observations are ignored; we use 10% or 20% following [25]) by columns and by rows, b(1 × p) and a(n × 1) respectively. (ii) Determine the presence of outliers in vector a by any technique that is preferred for univariate outliers (for example, the quartile method) and if there is any discrepant data replace it with a trimmed mean of the elements of a. (iii) Update the elements of b and a as follows: b c = med{|y r,c /a r |; r = 1, . . . , n } and a r = med{|y r,c /b c |; c = 1, . . . , p }. Repeat steps (ii) and (iii) with the updated vectors b and a until some convergence criteria are attained. Once the stability of vectors b and a has been achieved, a robust low-rank approximation of Y is obtained by means of the ab T product whose element signals must be the same as those of Y. To obtain the first singular value and the first right and left singular vector of the robust SVD, the standard SVD over ab T is applied and both the first singular value and the first right and left singular vector are recorded. To obtain the second singular value and the second right and left vector, the same procedure described in this section is done, but replacing the matrix Y with Y-ab T . This cyclic strategy deflating the matrix continues until the desired number of robust SVD components-rSVD84 is obtained. In the Appendix A of this manuscript a function is provided in R [13] that calculates the rSVD84 for any complete or incomplete matrix. Pseudocode Calculate a robust singular value decomposition on Y* Section 2.3 (rSVD). iv.
The rSVD contains the require imputations.

Alternatives to Obtain Robust Imputation
The SVD88 algorithm depends directly on the standard SVD, so the presence of outliers can affect the performance and decrease the quality of the imputations. To avoid this behavior, we have three possible strategies: (i) Substitute in the SVD88 iterations the classic robust standardization, in which the column standardization is done by subtracting the median M j and dividing the result by the median absolute deviation-MAD j . Having thus standardized the matrix, use rSVD84 until stability in the imputations is obtained and then return the matrix to the original scale; (ii) Find a robust low-rank approximation to the incomplete matrix using rSVD84 and then take the values obtained in that approximation as imputations for those matrix positions that were not originally obtained. (iii) Perform the imputation in two stages, first obtaining a robust low-rank approximation of the incomplete matrix and then applying the SVD88 method to refine the imputations.
Some additional implementation details for these options are as follows. In option (i) we tried to obtain a robust version of SVD88 but using the robust standardization and rSVD84 together did not perform well and in preliminary tests it never achieved convergence. However, if we eliminated the singular values of Equation (1) of imputation then convergence was achieved. This robust version of the SVD88 is called M5RobSVD. Option (ii) or rSVD84 has always shown in preliminary tests to be the fastest but for both this and M5RobSVD an appropriate number of components to be used in the imputation must be specified. To do this, the robust SVD was calculated on the incomplete matrix and k was chosen such that the sum of the first k ordered singular squared values divided by the sum of all singular squared values was greater than 0.75 [11]. Finally, option (iii) depends on the sequential application of two procedures, rSVD84 and SVD88. This sequence of procedures in the preliminary tests has always reached convergence. In view of the original authors of the procedures, we call the imputation methods GOKimputation and M5GOKimputation when the singular values were disregarded.
Return to step ii. until stability is achieved in the imputations,

Numerical Example 1: Simulation Study
To evaluate the performance of the five imputation systems (SVD88, M5RobSVD, rSVD84, GOKimputation, and M5GOKimputation), we simulated eight incomplete and contaminated 100 × 8 size matrices ( Table 1) that had the structure of an AMMI model, using the steps given by [20,26] as follows: 1.
Create a matrix X with n = 100 rows and p = 8 columns with observations drawn from a uniform distribution [−0.5; 0.5].

2.
Compute the SVD of X to obtain the matrices U, V and D, containing, respectively, the left and right singular vectors and the singular values of X.
Simulate a matrix Y with AMMI2 structure (i.e., AMMI with the first two singular values/vectors): represents the element in the row i and column i of the matrix A, 1 n 1 T p µ is a matrix (n × p) with the grand mean µ in all positions, α n 1 T p is a matrix (n × p) of rows main effects (all matrix rows equal), 1 n β T p is a matrix (n × p) of columns main effects (all matrix columns equal). For each complete Y matrix, observations were then randomly deleted at the specified percentages shown in Table 1 and outliers were produced in the remaining data at the companion percentages in Table 1. The positions of the outliers were chosen randomly, and the outliers were generated from N µ j + 5σ 2 j , σ 2 j , where µ j and σ 2 j represent the mean and variance of the j-th column (or j-th environment) of the non-missing values. The five imputation procedures were applied to the eight incomplete and contaminated matrices, the five imputation procedures were applied, having previously recorded the true data that had been randomly removed. This made it possible to assess the performance of the imputations by calculating the prediction error P e defined by [4] as the square root of the mean squared error (MSE) between the true values and the corresponding imputations.

Numerical Example 2: Cross-Validation on Real Data from Experiments with Genotype-by-Environment Interaction
The Ontario winter wheat dataset with 18 genotypes in nine environments was published by [27] to show an application of biplot analysis, and recently these data were used to illustrate new proposals for distribution-free multiple imputation [9] and new bootstrap methods to determine the optimal number of multiplicative components in AMMI models [28].
A difficulty in evaluating the behavior of robust imputation systems in complete real data is the lack of a priori knowledge of what "good behavior" should be. For this reason and following the recommendation of [19], we consider adding a small contamination of the matrix elements and removing some elements. Initially, 10% of the elements of each environment were removed, but unlike the previous simulation study, we used an MNAR mechanism. To obtain non-random deletions, we obtained the 10th percentile of each column P10 j ; j = 1, . . . , 9 and any observation that was less than P10 j was eliminated to obtain an incomplete matrix. On the resulting matrix, 10% of positions chosen at random were contaminated following the same contamination process used in the simulation study mentioned above.
When the "Ontario" matrix was incomplete and contaminated, each of the values present in the matrix was eliminated and imputed with the robust strategies presented from the remaining elements. Thus, a complete matrix of imputations for each method was obtained by cross-validation and could be compared with the original matrix using P e [4]. In addition to the prediction error, we added two more statistics proposed by Gabriel [29] to identify the best method. One of these statistics was the coefficient . The GF 2 statistic will always be in the range [0, 1] and the closer it is to 1, the better are the imputations. On the other hand, GF 1 cannot be greater than GF 2 , but it can take negative values which indicate that the imputations are of lower quality than if matrix I were the zero matrix. This second numerical study was also used to compare our new robust proposals with a method that has had good results in data science, the missForest method [8]. missForest is an imputation method based on random forests (predictors that consist of a collection of randomized regression trees) that has a high versatility as it is nonparametric, it can be applied to continuous, categorized or mixed data (categorized and continuous) and it has a computational implementation in the statistical language R [30,31]. Table 2 shows the results of the first numerical example. When data were incomplete (10% and 20% of missing values), but without contamination, the SVD88 algorithm always obtained the smallest prediction errors. This behavior is as expected, since without outliers the procedures based on least squares generally perform well [12]. This situation changes completely when there is contamination of the data with 5, 10 and 15% of outliers. In these cases, it can be seen that as the number of outliers increases, the SVD88 prediction errors also increase. Thus, SVD88 is highly affected by contamination, producing low quality imputations. It can also be seen that in these contaminated scenarios the four proposed strategies of robust imputation always surpass SVD88 in terms of P e . It remains, then, to analyze in detail the performance of the robust imputations.

Results and Discussion
In general, using the mean and median of the prediction errors, M5RobSVD provides the best quality of imputations when there are outliers because it minimizes the values of such statistics. However, at the highest imputation and contamination percentages, (20% and 15%, respectively) the M5GOKimputation strategy marginally outperforms M5RobSVD. Moreover, the remaining two strategies, rSVD84 and GOKimputation, are also effective in the presence of outliers because they had lower P e values than SVD88, but in no case were they able to improve on M5RobSVD or M5GOKimputation. These results show that the elimination of the singular values of the imputation equation within a robust procedure favors convergence and improves the quality of the imputations. Table 3 shows the results of the second numerical example based on the "Ontario" matrix. It is noteworthy that the four strategies proposed in this article outperformed both the SVD88 and the missForest method, obtaining GF 1 and GF 2 values close to 1 and minimizing P e . In this cross-validation exercise on real data, SVD88 presented the worst performance with very poor-quality imputations, which can be concluded from the negative value in the GF 1 statistic. This situation can be explained by the lack of convergence in the imputation of some observations. On considering the GF 2 and GF 1 statistics, the four methods M5RobSVD, rSVD84, GOKimputation M5GOKimputation showed values of approximately 0.98 with very small differences only in the third and fourth decimal places. This indicates that the imputations are all high quality and the performance difference among these methods is very small. On the other hand, the prediction error was minimized using rSVD84, so taking into account the values of all three statistics then rSVD84 can be chosen in this case, because it is the fastest computationally robust method.
The numerical results presented so far show that our four robust imputation alternatives (M5RobSVD, rSVD84, GOKimputation and M5GOKimputation) work very well and in the case of data contamination the SVD88 algorithm should be replaced by one of them. The reader may therefore be left with the question: which method to choose according to the results obtained in this study? answer this question, outliers must first be detected and for this reason the "cellWise" package from R ( [13]; cellWise available from CRAN; https://cran.r-project.org/web/packages/cellWise/index.html, accessed on 25 June 2021) is recommended. Once outliers are detected [32], the dimension of the matrix under analysis must be taken into account. If the matrix is small (for example 18 × 9) rSVD84 can be applied, but if the matrix is larger (for example 100 × 8), the M5RobSVD and M5GOKimputation systems can be considered. In the latter case, if the imputation and contamination percentages were greater than 20 and 10% respectively, it is suggested to test both algorithms. From a computational point of view, in large matrices (100 × 8) M5GOKimputation presents greater speed if compared to M5RobSVD. All the proposals in this article are based on a robust SVD, for that reason in the Appendix A function of R is presented to calculate the rSVD84.
Finally, it is worth answering one more question that may arise for the applied researcher when analyzing contaminated and unbalanced data: what is the maximum percentage of contamination that can be tolerated if an effective imputation of missing values is to be made? There is a lot of literature on the two problems and very useful references can be found in the recent study by [32]. However, we suggest following a rationale derived from a simulation study described by [33] and from the results obtained by [4]. In a real situation, algorithms based on SVD can work with up to 60% of missing values [4], but if there is a high proportion of outliers in the remaining data, imputations will necessarily be affected by them and will produce low quality of results [33]. Taking this into account, we recommend only considering imputation on incomplete matrices that still have at least half of the observed data without outliers. For example, one rule of thumb might be to add the percentages of missing data and outliers, and only proceed if that sum is not greater than 40%. This will ensure that at least 60% of "clean" data can be effectively used for the purpose of completing the matrix. With less than 60% of "clean" data any imputation results should be treated with caution.

Conclusions
In this paper, we have focused on nonparametric imputation based on the singular value (SVD) decomposition [10,11]. SVD is a least-squares technique, and in the presence of outliers it is known that least-squares imputation methods can produce low quality imputations [12]. The main aims of our study were therefore:

1.
To consider robust versions of the method that will allow for outliers.

2.
To investigate the robustness of all the proposed methods, in what we believe to be the first extensive such robustness study.

3.
To provide advice as to which specific method to use in a given practical application requiring imputation.
Our conclusions regarding these aims can be summarized as follows. The basic method, denoted SVD88 in the text, performs well in the absence of outliers but performance deteriorates markedly as the number of outliers increases. Imputations in all cases are indeed of low quality. Four possible robust versions of SVD88 were identified and studied. All improved substantially on SVD88 so in the case of contaminated data one of them should be used instead of SVD88. In a practical application, if the matrix is small then rSVD84 is recommended, but for larger matrices either of M5RobSVD or M5GOK would be preferable. If computational speed is needed, then the latter should be chosen. As for the maximum number of missing or contaminated observations to allow, a suggested rule of thumb is to add the percentages of missing and outlier observations and to proceed only if the sum is not greater than 40%.