Evaluating the Performance of Multiple Imputation Methods for Handling Missing Values in Time Series Data: A Study Focused on East Africa, Soil-Carbonate-Stable Isotope Data

Abstract: In all fields of quantitative research, analysing data with missing values is an excruciating challenge. It should be no surprise that given the fragmentary nature of fossil records, the presence of missing values in geographical databases is unavoidable. As in such studies ignoring missing values may result in biased estimations or invalid conclusions, adopting a reliable imputation method should be regarded as an essential consideration. In this study, the performance of singular spectrum analysis (SSA) based on L1 norm was evaluated on the compiled δ13C data from East Africa soil carbonates, which is a world targeted historical geology data set. Results were compared with ten traditionally well-known imputation methods showing L1-SSA performs well in keeping the variability of the time series and providing estimations which are less affected by extreme values, suggesting the method introduced here deserves further consideration in practice.


Introduction
Handling missing values is a common challenge in almost all areas of study: from economics and social sciences to geology, archaeology and medicine [1][2][3]. As the primary aim of any data collection process is to obtain a more profound domain knowledge, the presence of missing values which causes failure in "complete-case" analysis is clearly undesirable.
Although almost all quantitative studies are affected by incomplete data, missing values are particularly prominent in longitudinal data. In this study, we introduce a new method of missing values imputation using a non-parametric time series analysis technique: singular spectrum analysis (SSA). Imputing missing values using this relatively new but powerful time series analysis method was expected to provide analytical improvements, which are discussed in the subsequent sections.
We used the East Africa soil-carbonate-stable isotope (δ 13 C) dataset, which is an excellent example of a widely used dataset with large proportions of missing data. East Africa soil-carbonate-stable isotope (δ 13 C) was collected from multiple sites of Ethiopia, Kenya and Tanzania, and later compiled by Levin in 2013 [4] (see Figure 1). The stable carbon isotopic has been widely used in archaeology to infer paleodiet, artefact provenance, and paleoenvironment [5]. The presence of considerable missing values in this data introduces an element of ambiguity into inferential analysis by affecting properties of statistical estimators such as variance or periodicity. Also, as discussed in several studies, ignoring missing values of the East Africa soil-carbonate-stable isotope (δ 13 C) or other similar datasets may lead to misinterpretation of macroevolutionary patterns [6], drawing inaccurate connections existed among lineages [7] or even invalid timing estimation of diversification events [6].
In this study, a comparison was performed on several imputation techniques and the estimated values produced by SSA. We first provide a brief introduction to SSA and its characteristics which make it a suitable candidate for imputing missing values in such a powerful and prominent archaeological time series data with the oldest observation related to four million years ago. We will then briefly discuss the selection of methods which have been applied to deal with this problem.
Our results suggest that the newly introduced method based on SSA technique produces a robust estimation of missing values and deserves further consideration in practice.
The remainder of this paper is organised such that Section 2 presents an outline of the steps underlying the SSA techniques and describes the newly introduced approach. Section 3 provides an overview of the other imputation techniques evaluated in this study. Section 4 explains the data under study, and in Section 5, the results obtained using the East Africa soil-carbonate-stable isotope data are discussed. The paper concludes with a concise summary in Section 6.

Review of SSA
The SSA technique includes two complementary stages: decomposition and reconstruction, each of which consists of two separate steps. The first stage decomposes a time series into several components that allows for signal extraction and noise reduction. The reconstruction stage leads to a less noisy series using the leading eigentriples of the trajectory matrix. The most common version of SSA is called basic SSA. It is noteworthy that the matrix norm used in basic SSA is the Frobenius norm or L 2 -norm. A newer version of SSA which is based on L 1 -norm, and is therefore, called L 1 -SSA, has been introduced and further explained in [8,9]; and it has been confirmed that L 1 -SSA is robust against outliers. In the following, the steps of these two versions of SSA are concisely presented and differences highlighted. The theory underlying basic SSA is explained in detail in [10]. For more detailed information on L 1 -SSA, see [8].

Stage 1: Decomposition (Embedding and Singular Value Decomposition)
In embedding step, the time series Y N = {y 1 , . . . , y N } is mapped into the vectors X 1 , . . . , X K where X i = (y i , . . . , y i+L−1 ) T and K = N − L + 1. The single choice of this step is the integer number L such that 2 ≤ L ≤ N − 1 called window length. The output of the embedding step is the trajectory matrix X = [X 1 : · · · : X K ] whose columns are the vectors X i . The trajectory matrix is a Hankel matrix in the sense that all elements on the anti-diagonal are equal.
In the singular value decomposition (SVD) step, the SVD of the trajectory matrix X is performed. The eigenvalues of XX T and corresponding eigenvectors are denoted by λ 1 , . . . , λ L (in decreasing order of magnitude) and (U 1 , . . . , U L ). If d = max{i, such that λ i > 0} = rank(X), then the SVD of the trajectory matrix in basic SSA can be written as These weights are diagonal elements of the diagonal weight matrix W = diag(w 1 , w 2 , . . . , w d , 0, 0, . . . , 0) and are computed such that and . L 1 is the L 1 norm of a matrix. For more information, see [8].
The matrix X I corresponding to the group I is defined as X I = X i 1 + · · · + X i p where I = {i 1 , . . . , i p }. For example, if I = {2, 5, 6}, then X I = X 2 + X 5 + X 6 . After computing that matrices for the groups I = I 1 , . . . , I m , the SVD of X can be written as (1) In Hankelization step, we seek to transform each matrix X I of the grouping step into a Hankel matrix so that these can subsequently be converted into a time series by combining the first column (row) and the last row (column) of the Hankel matrix. In basic SSA, Hankelization is obtained via diagonal averaging of the matrix elements over the anti-diagonals. Let A be an L × K matrix with elements a ij , 1 ≤ i ≤ L, 1 ≤ j ≤ K. By diagonal averaging, the matrix A is transferred into the Hankel matrix HA with the elements a s over the anti-diagonals (1 ≤ s ≤ N) using the following formula: where A s = {(l, k) : l + k = s + 1, 1 ≤ l ≤ L, 1 ≤ k ≤ K} and |A s | denotes the number of elements in the set A s . By applying diagonal averaging (2) to all the matrix components of (1), the following expansion is obtained: X = X I 1 + · · · + X I m , where X I j = HX I j , j = 1, . . . , m. In L 1 -SSA, Hankelisation corresponds to computing the median of the matrix elements over the anti-diagonals [8].

Imputation Based on SSA
Generally, in the iterative SSA imputation method, first, missing values are replaced by initial values, and then reconstructed repeatedly until convergence occurs [11]. The last reconstructed values are considered imputed values. This imputation algorithm contains the following steps: 1. Set suitable initial values in place of missing data (e.g., mean of the non missing data). It is noticeable that the SSA-based imputation can be performed via basic SSA or L 1 -SSA. We applied both basic SSA and L 1 -SSA to impute missing values in this investigation. It is worth mentioning that the mean of the non-missing data was utilised as an initial value for a missing data and not the final estimate. In the iterative SSA algorithm, initial values are replaced with reconstructed value until convergence occurs.

Other Imputation Methods
The other imputation algorithms of univariate time series which were used in this study are as follows: 1. Interpolation: linear, spline and Stineman interpolation. The average in this implementation is taken from an equal number of observations on either side of a missing value. For example, to impute a missing value at location i, the observations y i−2 , y i−1 , y i+1 , y i+2 , are used to calculate the mean for moving average window size 4 (2 left and 2 right). Whenever all observations in the current window are not available (NA), the window size is incrementally increased until there are at least 2 non-NA values present. The weighted moving average is used in the following three ways: The observations directly next to the ith missing value (y i−1 , y i+1 ) have weight 1/2, the observations one further away (y i−2 , y i+2 ) have weight 1/3, the next y i−3 , y i+3 have weight 1/4 and so on. This method is the variation of inverse distance weighting. • Exponential weighted moving average (EWMA): Weights decrease exponentially. The observations directly next to the ith missing value have weight 1 2 1 , the observations one further away have weight 1 2 2 , the next have weight 1 2 3 and so on. This method is also the variation of inverse distance weighting.
In this study, we use the moving average with a window of size 8 (4 left and 4 right). For SSA-based imputation methods, the R package Rssa was employed together with the R scripts generated by the authors. For more information on Rssa see [12][13][14]. All calculations of other imputation methods were done with the help of the R package of imputeTS [15]. More detailed information about the theoretical background of the algorithms such as interpolation and Kalman smoothing can be found in the imputeTS manual [16]. Kalman smoothing (or the Kalman filter) is a well-known method of time series analysis and it is not the smoothing part of interpolation.

Data
East Africa soil-carbonate-stable isotope data, compiled by Levin [4], has been widely used as a valuable source of information for various research communities [4,[17][18][19]. When compiling the data, the δ 13 C was measured against the Vienna Pee Dee Belemnite (VPDB) per millilitre (%) [4]. The compilation does not include data from non-pedogenic carbonates. Age is reported in Ma (millions of years ago) (For more information regarding the age calculation method see [4]). It is evident, while attempts have been made for this compilation to be as complete as possible, the published dataset contains large proportions of missing data. Also, the length of the original series was reduced from 1360 observations to 491 (including missing values) because multiple pedogenic carbonate nodules reported from a single soil outcrop (~1 m 2 ) were replaced by their average value. Figure 2 illustrates a plot of averaged δ 13 C values against age (black points). δ 13 C values range from −4.65 (%) to +6.23 (%) and ages are assigned as in [4]. To identify the location of missing values in the time series, a time interval of 0.02 Ma is considered and ages without a reported δ 13 C value are marked as missing values. The length of this data set is 491 and the number of NAs is 206. Hence, 42% of measurements are missing. Figure 3 shows the length of NA gaps (consecutive NAs) in the time series and presents a ranking of which gaps occur most often. The frequency of each gap and the associated number of NAs of that gap are also reported in that figure. For example, the gap of length two (2NAs) occurs 23 times, making up for overall 46 NAs; the gap of length three (3NAs) occurs eight times, making 24 NAs totally; and so on. The most frequent gap is of length one, occurring 40 times, and the longest gap has size 18.

Imputing Results
Figures 4-8 depict the application of imputation methods adopted in this study where the imputed values are shown in red. Figure 4 shows the results achieved by basic SSA and L 1 -SSA respectively. Note how the imputed values are not only consistent with the general pattern of the data, but also contain volatility with an amount similar to what is present in data without NAs, thereby providing the reader with a trusty outlook for the long-term prospects of the soil carbonate time series.
The window length (L) and the number of leading eigentriples (r) are two important parameters in SSA. It is well know that the performance of the imputation depends crucially on these parameters. In the case of no missing data, the general recommendation is to choose the window length close to half of the series length [20]. By replacing missing data with the mean of the non missing data and following the recommendations in [20], we chose L = 245. In order to choose r, the information contained in singular values and singular vectors of the trajectory matrix of the time series must be be used. In doing so, a scree plot of the singular values, one-dimensional and two-dimensional figures of the singular vectors, and the matrix of the absolute values of the weighted correlations can provide a visual method to select r [21,22]. The analysis of the eigenvectors showed that the eigenvectors with indices from 1 to 16 correspond to the signal and all the rest may be classified as being produced by noise. Therefore, we chose r = 16 in order to reconstruct the time series. More information on choosing parameters L and r can be found in [23][24][25][26].  In Figure 5, imputed values generate a smooth line. The interpolation methods appear to have difficulties in accurately capturing the variation when they are faced with a significant number of missing values, as is clearly visible within the last 30% of the data. The values imputed by Kalman Smoothing in Figure 6 are not significantly affected by extreme δ 13 C values; hence, leaving a series which appears to have a number of peak values. This is important as the complete-case analysis on the data may result in misleading conclusions by easily detecting those points as outliers. Thus, those points should not be considered outliers. It appears from Figure 7 that the LOCF and NOCB methods can be improved to provide better estimates, as all imputed values are equal in a particular gap when LOCF (or NOCB) is employed. We consider this aspect further in the discussion which follows. As visual inspections fall short of providing sound evidence, to compare the different imputing methods, statistical properties of the original and imputed time series are presented in Table 1.
To have a comprehensive view of different imputing methods employed here, the entire dataset was treated as the main source. Then, 10% to 40% of the dataset was randomly deleted and removed from the time series. Those missing data were then estimated with an imputing algorithm. The mean squared error (MSE) was utilised as the main criterion to compare the performances of the imputing algorithms. The mean values of MSEs are reported in Table 2, obtained from 1000 replications, for various levels of missing values (10% to 40%). The results confirm that the SSA-based approach works well. In addition, it can be concluded that the LOCF and NOCB methods have poor performance compared to other imputing techniques.

Conclusions
Following the recent theoretical and empirical success of L 1 -SSA at providing more reliable reconstructions and forecasts in comparison to basic SSA, in this study, the application of a number of imputation methods, including L 1 -SSA, was evaluated. We initially explained the SSA technique and produced an outline of the algorithms for imputing missing values based on L 1 -SSA. In brief, to estimate a missing value in a time series, the algorithm is optimised based on minimising the difference between consecutively replaced value and the attributed reconstructed one.
We compared the performances of SSA-based methods and interpolation, Kalman smoothing, LOCF, NOCB and weighted moving average approaches. It is noteworthy that we analysed them with 1D imputation. In addition to the descriptive analyses reported in the previous section, the results of cross-validation are also provided. The results yet again indicate the superiority of the SSA-based models over other methods considered here for various levels of missing values.
As confirmed by the measures of central tendency, the introduced approach of missing values processing is undoubtedly a practical benefit, in particular to those researchers working with time series datasets with significant missing values. As was mentioned before, SSA is a non-parametric time series analysis and signal processing technique which does not rely on any assumptions. Therefore, it can practically lend itself as an imputation method to other types of datasets.

Acknowledgments:
The authors would like to thank the editor and anonymous referees for their invaluable comments, feedback and suggestions which helped improve the quality of this manuscript. Our heartfelt gratitude to the referees whose comments were very helpful in contributing to a major improvement in the revised version.

Conflicts of Interest:
The authors declare that they have no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: