Comparing Single and Multiple Imputation Approaches for Missing Values in Univariate and Multivariate Water Level Data

: Missing values in water level data is a persistent problem in data modelling and especially common in developing countries. Data imputation has received considerable research attention, to raise the quality of data in the study of extreme events such as ﬂooding and droughts. This article evaluates single and multiple imputation methods used on monthly univariate and multivariate water level data from four water stations on the rivers Benue and Niger in Nigeria. The missing completely at random, missing at random and missing not at random data mechanisms were each considered. The best imputation method is identiﬁed using two error metrics: root mean square error and mean absolute percentage error. For the univariate case, the seasonal decomposition method is best for imputing missing values at various missingness levels for all three missing mechanisms, followed by Kalman smoothing, while random imputation is much poorer. For instance, for 5% missing data for the Kainji water station, missing completely at random, the Kalman smoothing, random and seasonal decomposition methods had average root mean square errors of 13.61, 102.60 and 10.46, respectively. For the multivariate case, missForest is best, closely followed by k nearest neighbour for the missing completely at random and missing at random mechanisms, and k nearest neighbour is best, followed by missForest, for the missing not at random mechanism. The random forest and predictive mean matching methods perform poorly in terms of the two metrics considered. For example, for 10% missing data missing completely at random for the Ibi water station, the average root mean square errors for random forest, k nearest neighbour, missForest and predictive mean matching were 22.51, 17.17, 14.60 and 25.98, respectively. The results indicate that the seasonal decomposition method, and missForest or k nearest neighbour methods, can impute univariate and multivariate water level missing data, respectively, with higher accuracy than the other methods considered.


Introduction
Water level is a measure of water depth in rivers/basin/lakes within a given place over time.Studying water level is important as it can provide warnings for flood risk, which helps to limit the impact of flood disasters on the local population and is also crucial for effective water resources management and for policy makers [1].The study of water level is also important for the health of a river, to determine the required level for plants and animals to survive at various times of year [2].
Khalifeloo [3] states that recent extreme events globally, such as flooding, drought and bush burning, among other natural disasters, were caused largely by climate change, and these have attracted the attention of researchers to try to provide solutions.Accurate prediction of such extreme events can allow mitigating measures to be implemented.However, much water level/hydrological data is complicated by missing observations, especially in developing countries [4,5], which makes accurate prediction very difficult.which are data deletion (complete-case analysis or available-case analysis) and single imputation [20].The former approach is criticised because it reduces sample size and can make statistical analysis difficult, especially for temporal data [7].It also ignores the causes of missingness [8].However, the latter approach maintains the original sample size and provides a basis for smooth statistical analyses [21].
Single imputation involves replacing each missing value with a single value [22], whereas the alternative, MI, generates two or more values for each missing value [23].A few single imputation methods are mean, median, mode and random imputations.Despite their usability, most single imputation methods underestimate variance or uncertainty about the missing values, which yields invalid tests and confidence intervals since the estimated values are derived from the ones present, and may also produce biased parameter estimates [10,24,25].They also ignore relationships between variables [26].Therefore, MI is preferred where applicable.The availability of these advanced methods in software (especially MI) enables researchers to readily replace missing values with imputed values [27].
The MI method was first proposed by Rubin [28] and simply replaces each missing value by a vector of D imputed values, D ≥ 2. The D values are ordered in the sense that D completed datasets can be created from the imputed vectors; replacing each missing value with the first component in its vector of imputations creates the first completed dataset; replacing each missing value with the second component in its vector creates the second completed dataset, and so on [29].In another definition [25], MI is said to be the act of replacing the nonresponse item in the dataset with more than one value; as a result of which, several datasets will be created from it.It has been argued that no single set of imputations or methods of imputation can satisfy all missing data [6], which implies that one method may be better than another method for a particular type of data.
Some more sophisticated single imputation methods for handling univariate time series, comprising the seasonal decomposition, random and Kalman smoothing methods, are available in the imputeTS package in the R software [30].The authors recommended Kalman smoothing and the seasonal decomposition methods for imputing complex univariate time series data.Wijesekara and Liyanage [31] found Kalman smoothing to be the best method for imputing air quality data.Kalman smoothing is robust for smaller datasets and recommended for imputing high-resolution data [32][33][34].The seasonal decomposition method was identified as the most effective for imputing univariate time series by Moritz et al. [35].
Other studies include Jadhav et al. [36], who compared imputation methods on five numeric datasets and based on the root mean square error (RMSE) statistic found that k nearest neighbour (kNN) imputation outperformed another six methods comprising mean, median, predictive mean matching (PMM), Bayesian linear regression, linear regression (non-Bayesian) and random methods.Alsaber et al. [37,38] identified missForest and kNN as appropriate to impute both continuous and categorical variables, compared to Bayesian principal component analysis, expectation maximisation with bootstrapping, PMM, kNN and random forest methods for imputing rheumatoid arthritis and air quality datasets, respectively, using RMSE and mean absolute error criteria.
In the last decade, the application of imputation techniques for missing data in hydrological studies has received increasing interest.Ben Aissia et al. [39] recommend multiple imputation rather than mean or interpolation imputation methods for multivariate missing data in hydrological frequency analysis.Various imputation methods which can be used for hydrological missing data, including the arithmetic mean, median and regressionbased methods, and imputation based on principal components and multiple imputation were reviewed in Gao et al. [10], who recommended the use of autoregressive conditional heteroscedasticity models for imputation, since these can produce accurate forecasts of non-constant volatility and incorporate heteroscedasticity, which are synonymous with hydrological data.Hamzah et al. [40] evaluated three methods for estimating missing values of daily streamflow from the Langat River basin, namely robust random regression imputation, kNN and classification and regression trees (CART); each of these methods was combined with multiple linear regression (MLR), and based on RMSE and mean absolute percentage error (MAPE) statistics, CART-MLR was said to be the best imputation method.A similar follow-up study found hybridised CART-MLR to be best for all considered missing data percentages, followed by PMM-MLR based on Adj R 2 , root square error and MAPE statistics [41].
To the best of our knowledge, these various approaches to imputation have not been fully exploited in imputing missing values for hydrological data such as water level discharge in Nigeria.However, the work in [13] for the first time introduced the concept of MI for imputing annual peak river discharge, which is vital for flood frequency estimation.The authors compared satellite radar altimetry and MI for five hydrological stations, namely Baro, Lokoja, Umaisha, Onitsha and Taoussa stations.It was concluded that for a dataset with 3 years or fewer missing values, both methods can be utilised.However, for data with more than 3 years missing, radar altimetry was better.Oyerinde et al. [42] used the PMM method to impute missing data from 22 water discharge stations with different missing percentages from 2-70% and recommended PMM for imputing data gaps in data from the Niger basin.However, these two studies failed to consider the missing data mechanism/pattern, which is important for imputing missing data.In addition, they failed to consider single imputation methods, which are frequently suitable for imputing complex univariate time series [30].
This paper presents a comparative study of single and multiple imputation methods for missing values in water level discharge time series data from four water stations on the major rivers Benue and Niger in Nigeria, assuming three missing data mechanisms in each case.The three single imputation methods used are Kalman smoothing and the random and seasonal decomposition methods for the Kainji water station on the river Niger.Four imputation methods comprising random forests, missForest, kNN and PMM were used to impute missing data from the Ibi, Makurdi and Umaisha water stations on the river Benue.The results should be helpful for selecting a suitable imputation approach in future water level studies where data are missing and the probable missing data mechanism can be identified.
A number of key abbreviations are used in this paper and these are summarised in Abbreviations, for reference.

Materials and Methods
This section will briefly describe the study area and the methodology for the analysis.

Study Area
The Kainji water station on river Niger, and Ibi, Makurdi and Umaisha water stations on the river Benue (Table 1, Figure 1) were chosen as the study area.The two rivers cut across various states in the northern part of Nigeria, with confluence at Lokoja in Kogi state.There are 194 water monitoring stations currently in the country, mostly located along the rivers and dams, to monitor the movements of water level discharges.From these 194 stations available, four water stations were selected as having more up-to-date data than the rest, as some stations have no records from 1980 to date and some have 2 years of records only, for example, as found from a summary of the Nigeria Hydrological Services Agency (NIHSA) records.The main reasons for the data gaps at these water stations are that some people in most communities close to the stations vandalised instruments and/or there were faulty recording instruments.Since the Ibi, Makurdi and Umaisha water stations are all located on the river Benue and the observations are recorded at the same time-points for all three stations, we combined these water level datasets to generate multivariate data, representing water levels at three locations on the river, and data deletion and imputation were conducted on these combined data.After imputation, the simulated (complete) data and the imputed data were compared for each water station separately to obtain the two performance The dataset from Kainji water station obtained from NIHSA is complete; hence, it was used for the univariate imputation directly.However, Ibi, Makurdi and Umaisha stations have 28, 49 and 23 missing observations in the monthly water levels, respectively, or 39%, 68% and 32% missing data percentages, and the predictive mean matching (PMM) imputation method was used first to impute these missing values for the purposes of this study.To avoid influence of this choice on the results of this study, time series models were fitted to these completed data and the datasets used for the multivariate imputation were simulated from these fitted models.The seasonal autoregressive integrated moving average models (4,0,1)(1,0,1); (2,0,2)(1,0,1); and (1,0,2)(1,0,1) were found to be the best fitting models for Ibi, Makurdi and Umaisha stations, respectively.
Since the Ibi, Makurdi and Umaisha water stations are all located on the river Benue and the observations are recorded at the same time-points for all three stations, we combined these water level datasets to generate multivariate data, representing water levels at three locations on the river, and data deletion and imputation were conducted on these combined data.After imputation, the simulated (complete) data and the imputed data were compared for each water station separately to obtain the two performance metrics.

Missing Data Mechanisms
Before imputing data, it is important to know the reason why the data are missing.The three missing data mechanisms under which missingness occurs comprise missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) [43], reviewed briefly below.

Missing Completely at Random
A data value is said to be MCAR if the probability of missingness is the same for all units in the sample.This implies that the cause of missingness in the data is independent of the data itself.Following Santos et al. [44], suppose X is a data matrix of order n × p and x i,j represents the (i,j)th element in X, where i = 1, 2, • • • , n represents the cases and j = 1, 2, • • • , p represents the variables in the sample.Let X be divided into X obs and X mis representing the observed and missing values, respectively, and suppose we also have an indicator matrix R of order n × p, which indicates whether the element x i,j is missing or not: if r i,j = 0, then x i,j is missing, otherwise r i,j = 1 implies x i,j is observed.Then the probability distribution for MCAR can be written as in Equation (1): where ω is the parameters of the missing data in the model.

Missing at Random
Under the MAR mechanism, the probability of missingness may depend on the observed data but not on the value(s) missing, as in Equation ( 2): (2)

Missing Not at Random
The MNAR mechanism is present when the probability of missingness in a variable is said to be dependent on the observed and unknown data, which implies that the missing data may be associated with both X obs and X mis , as in Equation (3): (3)

Imputation Methods for Univariate and Multivariate Data
Imputation methods used here for univariate and multivariate data are briefly described below.

Imputation Methods for Univariate Water Level Time Series Data
Three single imputation methods are used, namely Kalman smoothing and the seasonal decomposition and random methods.These methods, especially Kalman smoothing and seasonal decomposition, were selected to impute univariate water level because they frequently produce best results for longer and complex time series data [30].
All three of these methods were implemented here using the R package imputeTS version 3.3, for univariate time series imputation.

Kalman Smoothing Method
The Kalman smoothing method operates on a basic structural model or the state space representation of an autoregressive integrated moving average (ARIMA) model [30].The origin of Kalman smoothing may be traced back to R.E.Kalman, who introduced a recursive solution to the discrete-data linear filtering problem in 1960, and over time the method has received much interest, particularly in autonomous and assisted navigation [45].A Kalman filter, as defined by Maybeck [46], is an optimal recursive data processing algorithm which utilises all available information, irrespective of precision, to estimate the variable of interest.Welch and Bishop [45] defined the Kalman filter and smoother as a set of mathematical equations which efficiently compute the posterior distribution over latent states of a linear state space model given some observed data, and these equations do not carry out any learning.
Kalman filters derive from Gaussian state space models [47].These models involve observation and state vectors, given in Equations ( 4) and ( 5), respectively.
where y t is the vector of observations and x t is the state vector at time t and Q t and R t are the process and measurement noise, respectively, x 1 ∼ N(µ 1 , Σ 1 ).The matrices A, C and D are assumed known, and v t and w t are assumed to be serially independent.After some derivations using Equations ( 4) and ( 5) for initial state x 1 , with known parameters, the Kalman filter is derived as in Equation ( 6): where K t is called Kalman gain and given by K t = AΣ t C F −1 t and where F t is a nonsingular matrix.
If µ t|t and Σ t|t are computed, µ t+1 = Cµ t and Σ t = CΣ t|t C + DQ t D can be used to predict the state vector (x t+1 and its variance matrix.
Implementing the Kalman filter on the available dataset, the optimal estimates of the states are obtained [48] and the data gap can be imputed using ŷt = Ax t . (7)

Seasonal Decomposition Method
The seasonal decomposition method removes (subtracts) the seasonal component from the time series via the Seasonal Trend decomposition of time series by the Loess filtering procedure and performs a chosen type of single imputation on the deseasonalised data, after which the seasonal component is added back [30,49].
If the time series, and its trend, seasonal and remainder components are denoted by The default linear interpolation algorithm was used here from the options (comprising mean, random, Kalman smoothing, weighted moving average, last observation carried forward, and interpolation) and implemented in the imputeTS package to fill in the missing data.

Random Method
The random method is a univariate imputation method where each missing value is replaced using a random sample from between two given bounds, where the default bounds are the minimum and maximum value from the observed time series and the method uses the uniform distribution to generate the random values to be selected [30].This approach is very common in survey practice but has very limited literature to support its use [29].

Imputation Methods for Multivariate Water Level
Four imputation methods were considered to impute water levels from Ibi, Makurdi, and Umaisha water stations on the river Benue.Two single imputation methods (kNN and missForest) and two multiple imputation methods (random forest and predictive mean matching) were used, and are briefly described below.
k Nearest Neighbour Method kNN imputation, or nearest neighbour imputation, is a donor-based learning algorithm, in which the imputed value is obtained either as an observed value for another variable from the record or as an average of the observed values [50,51].One main feature of this method [52], that is different from the other methods considered, is that the imputed values are actually observed values, not generated values, drawn to replace the data gap.kNN imputation is similar to hot-deck imputation, as data gaps are sorted and imputed sequentially, but also differs from hot-deck imputation as kNN computes k (number of neighbour) values which are the distances between the observed variables for all cases with missing values and the k nearest possible observed donors [51].The responses for these k neighbouring values are averaged to provide the imputed value.
To identify the optimal value of k, the value of k = 1, 3, 5, 7, 9, 11 and 15 were considered to implement the kNN imputation.It was evident that k = 7 and k = 15 consistently produced the best (lowest mean) results from either RMSE or MAPE to use in imputations for the five percentages missing.In general, k = 7 is a good choice for these datasets and it was used for imputation in this paper.
kNN imputation is implemented in the R VIM package [51] to find the distances to identify the nearest neighbours.The authors extended the Gower distance [53], a general coefficient to measure similarity between two sampling units and which can handle various data types.The distance between two observations a and b is given as where ω j is a weight which indicates the importance of the variable j and δ a,b,j is the contribution of the jth variable, and can be obtained for continuous variables as δ a,b,j = x a,j − x b,j τ j (10) where x a,j and x b,j are the values of the jth variable for the ath and bth observations, respectively, and τ j is the range of the jth variable.

Predictive Mean Matching Method
The name of this method was proposed by Little [6].However, the initial idea was conceived in the work of Rubin [54].The idea of PMM largely depends on linear model assumptions but with some modification on the residuals [6], as the method relaxes the normality assumptions.The author considered single and multiple nonresponses.For multivariate data, if y is the target variable to be imputed for a given case, the method generates plausible values for y using other variables in the data as follows.An imputation model is used to predict y from the other variables, for both complete and incomplete cases.These values are predicted means from a fitted regression model.A completely observed donor case is then identified for (matched to) each missing y value, as a case whose predicted y (predicted mean) value is closest to the predicted y for the missing value to be imputed.
That is, respondent l's missing y value is imputed as the observed value for that closest respondent as ŷi = y j (11) where μi − μj 2 ≤ μi − μq 2 for all q, μi is the predicted mean of Y for individual l, and y j is the observed value of Y for respondent j [6].This procedure is a way of adding noise into the imputation with the aim of preserving the distribution of the y values.For multiple imputation, several nearest matches are found and the observed values from a subset of those is randomly sampled with replacement to provide multiple imputed values of y.This procedure is repeated for all missing values.Bayesian implementation induces greater noise in the imputations by drawing μi in (11) from the posterior distribution of the predicted mean for the missing value [6].
Van Buuren and Groothuis-Oudshoorn [55] describe the PMM method as simple and widely used and highlight the procedures to follow when implementing PMM imputation with the R mice package, which is used here for PMM imputation, and Bayesian PMM is carried out via the Gibbs sampler.We use PMM for multiple imputation with five multiple imputed values (m = 5) generated for each missing value and one of the five imputed values was selected at random to replace the missing value.The method requires specification of the number of iterations (maxit), which is important for the approximation to the posterior distribution to converge.We used 1000 iterations, having observed no difference in the results using 1000 iterations or more than 1000.

Random Forests Method
A decision tree has a tree-like structure with three parts, namely decision nodes, leaf nodes and a root node (Figure 2).A decision tree algorithm divides a training dataset into branches, which further divide into other branches via these nodes.This sequence continues until a leaf node is attained and cannot be separated further.Decision nodes provide a link to the leaves and are used for predicting the outcome of an observation.

Random Forests Method
A decision tree has a tree-like structure with three parts, namely decision nodes, leaf nodes and a root node (Figure 2).A decision tree algorithm divides a training dataset into branches, which further divide into other branches via these nodes.This sequence continues until a leaf node is attained and cannot be separated further.Decision nodes provide a link to the leaves and are used for predicting the outcome of an observation.Random forests (RF) are a combination of tree predictors where each tree depends on the values of a random vector sampled independently for that tree from the available predictor variables and with the same distribution for all trees in the forest [56].RF is a machine learning algorithm utilising ensemble learning to provide solutions to complex Random forests (RF) are a combination of tree predictors where each tree depends on the values of a random vector sampled independently for that tree from the available predictor variables and with the same distribution for all trees in the forest [56].RF is a machine learning algorithm utilising ensemble learning to provide solutions to complex problems in regression and classification [57].RF is a powerful and flexible algorithm applicable in various sectors such as banking, stock exchange and health applications.It is also applied for imputing missing values, which is the focus here.Breiman [56] defines a RF mathematically as a classifier consisting of a collection of tree-structured classifiers {g(x, θ l ), l = 1, . ..}where the {θ l } are independent identically distributed random vectors.For input x, each tree casts a unit vote for the most popular class among training observations in the leaf node reached by that input vector.For prediction of a continuous outcome, each tree predicts the average value of the training observations in the leaf node reached by input vector x.The RF algorithm combines the decision trees to predict by taking the average value from all of these trees, and the prediction accuracy increases as the number of trees increases.
Several authors adopted RF to implement various packages for imputing missing values in the R software.For instance, Stekhoven and Buehlmann [58] implement the RF algorithm in the missForest package, and Doove et al. [59] implement RF in the mice package.Other R packages that implement RF for missing data imputation include the CALIBERrfimpute, randomForest, randomSurvivalForest and randomForestSRC packages [60,61].Several researchers compare various missing data imputation methods from these packages and conclude that missForest gives lower imputation error [61][62][63].
This work will use the mice and missForest packages to impute missing data using the RF algorithm.The missForest package incorporates interaction and nonlinearity in the model and can handle both continuous and categorical missing data with no reliance on any distribution, although it deviates from the implementation in Breiman [56] slightly.Its advantages over other algorithms make the RF method popular for missing data imputation [60].In the mice package univariate missing data is imputed using an RF algorithm based on Breiman [56].It is important to highlight that the mice package has more functions than the missForest package.
To identify the best number of trees to use in implementing RF and missForest for this study, 10, 50, 100 and 500 trees were considered.It was evident that 50 and 500 trees generally produced the best (lowest mean) results from either RMSE or MAPE criteria (defined immediately below) for the five percentages missing.Therefore, for missForest we use 500 trees as run time is very fast.However, for the random forest from the mice package, as the number of trees grows the run time becomes much higher and we do not recommend using a higher number of trees.In that case, we use 50 trees.

Evaluation Metrics
Two indicators, root mean square error (RMSE) and mean absolute percentage error (MAPE) were used to evaluate the performance of these imputation methods.The RMSE measure provides a broad representation of the error distribution from the method/ model [64] and MAPE gives an intuitive interpretation for the relative error [65].These statistics are calculated using Equations ( 12) and (13).
where y i and ŷi are the ith observation of the complete and imputed water level discharges, respectively, and n is the sample size.In general, the smaller the values from these two indicators, the better the performance of the imputation method.

Results
The evaluation and comparisons between imputation methods are presented below to identify the best method for imputing univariate and multivariate water level discharges.

Univariate Water Level Imputation
Evaluation of the three single imputation methods comprising Kalman smoothing (KS), random and seasonal decomposition (Sdec) is performed using monthly water level records from the Kainji water station covering years 2010-2016.Missing data at six levels of missingness, 5%, 10%, 20%, 30%, 40% and 50%, were created in the complete dataset assuming MCAR, MAR and MNAR mechanisms using the R missMethods package.These missing values were then replaced with new values generated from the three methods from the R imputeTS package.
The RMSE and MAPE measures were used to compare the imputation accuracies from these different methods of imputation for each missing value mechanism.To conduct this, 30 repeats of the data deletion and imputation were run for each imputation method to provide the results recorded.Finally, summary statistics for RMSE and MAPE from the 30 repeats were obtained and summarised using the mean and standard deviation.Lower values of RMSE and MAPE and low standard deviations are desirable for low error and low variability.Tables 2 and 3 show the results for RMSE and MAPE, respectively.Table 2 shows that the Sdec method is best for imputing missing values for each of the three missing data mechanisms, as it has the lowest (bolded) mean values in each case for all six missingness percentages, followed by the KS and random methods in that order.The random method is always much the worst.See also Figure 3 for boxplots of the results.It is important to note that at lower levels of missingness, especially at 5% and 10%, the KS method (with mean RMSE of 13.61 and 25.36, 16.35 and 22.44, and 15.61 and 25.42, for MCAR, MAR and MNAR mechanisms, respectively) is not much worse than Sdec (which has mean RMSE of 10.46 and 21.22, 13.53 and 19.12, and 13.76 and 22.33, respectively) and the random method comes last.This finding is consistent with the study of Moritz and Bartz-Beielstein [30].The Sdec method also in most cases has the lowest standard deviation, implying greater consistency than for the other methods; however, at 40% and 50%, the random method seems to be more consistent than the other methods.
The mean RMSE values range between 10.46 and 318.10 for MCAR, 13.53 and 320.50 for MAR and 13.76 and 319.90 for MNAR.These minimum and maximum values for the MCAR mechanism are slightly lower compared to MAR and MNAR.The mean RMSE also increases as the percentage of missingness increases for any method or missing data mechanism, which implies, not surprisingly, that these methods are better when dealing with fewer missing values than at higher levels of missingness.
deviation, implying greater consistency than for the other methods; however, at 40% and 50%, the random method seems to be more consistent than the other methods.Figure 3 clearly shows the level and spread of RMSE values at different percentages of missingness, assuming the MCAR, MAR and MNAR mechanisms.For all missingness rates, the values tend to cluster around the median RMSE for both KS and Sdec, especially from 5% to 30%, while the plots for the random method are located higher up and have a wider spread.Again, it is obvious that Sdec is best, based on smaller values of the median RMSE, closely followed by KS, and random is third by some distance.For the random method, the spread in RMSE values tends to decrease as the missing percentage increases, which is similar to the case for the standard deviation values; whereas, for the KS and Sdec methods at higher percentages of missingness, say 40% and 50%, the spread tends to widen and there are some outliers.
The summary of the MAPE statistic in Table 3 also suggests that the best method for imputing missing values at all levels for the Kainji water level discharge is Sdec, the KS method is next best and the random method is poorer.However, as the percentage of missingness in the data increases these differences between the methods become clearer.This can also be seen from the boxplots in Figure S1; however, at 50% missing, Sdec has a wider spread than the other methods, meaning that its MAPE result is more unpredictable at this higher percentage of missing values, similarly to Figure 3. Sdec also has the least variability for 5-30% missingness; whereas, for 40% and 50% missingness, KS generally has the least variability.In this case the MAPE values for the MAR mechanism are the lowest, followed by MCAR and then MNAR.As for RMSE, the mean values increase as the level of missingness increases.For MAPE, the standard deviation values also tend to increase with the level of missingness.Sdec is the best method at 50% missingness, based on the mean RMSE and MAPE, however, it is more variable than KS.Therefore, at that missingness level, KS may be preferred.

Multivariate Water Level Imputation
The performance of RF, kNN, missForest (MF) and PMM methods, i.e., two single imputation methods (kNN and MF) and two multiple imputation methods (RF and PMM), assuming MCAR, MAR and MNAR missing data mechanisms, were analysed using monthly simulated water level discharge from three water stations, namely Ibi, Makurdi and Umaisha on the river Benue, for the time period 2011-2016, as described above.Five rates of missing values (10%, 20%, 30%, 40% and 50%) were created using the ampute function from the R mice package (results for the 5% missing rate are not included here as it gave fewer points missing for any one station, since the 5% missing values were created across the whole dataset of values for the three stations, and as the results were very similar to those for 10% missingness.The 5% level was initially considered as a low percentage of missing values, for consistency with the univariate study).The missing values were replaced with new values generated from the four methods from the R mice package (RF and PMM), missForest package (MF) and VIM package (kNN).
The accuracies of these four imputation methods were assessed using RMSE and MAPE metrics, using 30 repeats, as above.The mean RMSE for each method at the five missingness levels for the three missingness mechanisms for the three water stations are presented in Table 4 and corresponding results for MAPE are shown in Table 5 (results for standard deviation are not shown).It was thought that a lower missingness level may have produced lower variability of results, and potentially lower (better) values of the accuracy measures.This is not the case, and from the results for all water stations the variability seems to be similar on the whole for all missing percentages for the methods considered.From Table 4, for the MCAR mechanism at 10% missingness, the methods with the lowest mean RMSE values are MF, kNN and kNN for the Ibi, Makurdi and Umaisha stations, respectively.At 20% missing, kNN, kNN and MF are best for Ibi, Makurdi and Umaisha, respectively.For 30% and 40% missingness, kNN is the best across all water stations.At 50% missing, the lowest mean RMSE values were for kNN, kNN and MF for Ibi, Makurdi and Umaisha, respectively (see also Figure S2, which shows that kNN and MF tend to have a lower spread around the median RMSE and are overall the best two methods, while PMM and RF tend to have a wider spread and/or be located higher up in the plots for all water stations).For the MAR mechanism at 10% and 20% missing, the best methods are MF, kNN and MF for Ibi, Makurdi and Umaisha, respectively.At 30% missing, kNN, MF and kNN for Ibi, Makurdi and Umaisha, respectively, have the lowest RMSE.For 40% missing, MF, MF and kNN are best, and at 50% missing, kNN, MF and MF are best for the Ibi, Makurdi and Umaisha stations, respectively (Table 4).(Figure S3 also tends to show a lower level and spread around the median RMSE values for MF and kNN, a wider spread for RF and overall PMM is worst).For the MNAR mechanism for 10% missing, kNN is best for all three water stations.At 20% missingness, MF, RF (which is similar to MF in this case) and kNN have the lowest RMSE values for the Ibi, Makurdi and Umaisha stations, respectively.For 30% and 40% missing, kNN is the best method across all water stations.At 50% missingness, RMSE is lowest for kNN, MF and MF for Ibi, Makurdi and Umaisha, respectively (Figure S4 also shows the best two methods as kNN and MF, in that order, and PMM and RF are the worst due to a higher level and spread).
From these results, for the MAR mechanism, MF is generally best, closely followed by kNN, while RF is slightly better overall than PMM, which is last.For the MCAR and MNAR mechanisms, kNN is best, closely followed by MF.There is a clear difference between the best two and the last two imputation methods.RF and PMM are worst.PMM and RF also tend to give a larger spread of RMSE values (Figures S2-S4).
The mean RMSE values range between 14.60 and 140.91 for MCAR, 15.39 and 150.09 for MAR and 19.11 to 147.40 for MNAR, depending on the imputation method and level of missingness, and these values are lowest for the MCAR mechanism compared to MAR and MNAR, which is similar to the conclusion from the univariate case above.
From the summary of the MAPE statistic (Table 5), the overall best method for imputing missing values at all levels of missing data for the three water stations along the river Benue assuming the MCAR mechanism is kNN, followed by MF.RF and PMM are not as good.For the MAR mechanism, MF is best, closely followed by kNN.Finally, MNAR has an interesting result with MF and kNN as the two best methods overall, in that order, while RF is sometimes not far behind, especially at 20% missing, and PMM is generally worst.
The corresponding boxplots for MAPE for the MCAR, MAR and MNAR mechanisms are presented in Figures S5-S7, respectively, and broadly confirm the results for RMSE.These plots do show many high outliers, indicating poor results in some instances, especially for the PMM and RF methods and for the Umaisha water station, but also in some cases for the kNN and MF methods for higher levels of missing data.

Discussion
Missing data, especially for hydrological data such as water level data, is a persistent problem for many developing countries for various reasons and ignoring missing data for analysis causes loss of information and potentially misleading conclusions, which in turn impacts mitigating measures taken by governments and other stakeholders.This study compares the performances of several single and multiple imputation methods, each on monthly water level data from the Kainji water station on the river Benue and the Ibi, Makurdi and Umaisha water stations on the river Niger in Nigeria.This aimed to identify the best method(s) for both single and multiple imputation approaches to avoid biased estimates and misleading conclusions [18].
Studies on imputation for missing data in hydrology-related areas have been carried out previously, such as in Gao et al. [10] and Hamzah et al. [40,41].In Nigeria for instance, using imputation in this area is a new practice.Ekeu-wei et al. [13] introduced the concept of multiple imputation to impute annual peak river discharge, which is vital for flood frequency estimation.Oyerinde et al. [42] used the PMM method to impute missing data from 22 water discharge stations with different missing data percentages.However, these two studies failed to consider the missing data mechanism/pattern, which is important for handling missing data.In addition, they failed to consider single imputation methods, which are frequently suitable for imputing complex univariate time series [30].
For both our univariate and multivariate data, performances of single and multiple imputation methods were compared using RMSE and MAPE statistics.Missingness was introduced in the data at several different rates in each case, using MCAR, MAR and MNAR mechanisms.
For the univariate water level data, of the three single imputation methods considered in this study, the Sdec method is best for imputing the missing data for the three missing data mechanisms.At lower levels of missingness, especially at 5% and 10%, KS is not much worse than Sdec but the random method was much poorer and is not recommended.Furthermore, Sdec in general has the lowest variability in RMSE or MAPE, except at the highest level of missingness.The results for MCAR data were better compared to MCAR or MNAR.However, for univariate time series imputation, assuming MCAR or MAR tends to give similar results [35].The mean RMSE increases as the percentage of missingness increases, which implies that these methods are better when dealing with fewer missing values, which is not surprising.
For the multivariate case, performances of single and multiple imputation methods were compared.We showed that for MCAR and MAR, missForest has the best results, closely followed by kNN; while, for the MNAR mechanism kNN is the best method, closely followed by missForest; whereas, RF and PMM have low accuracy in imputing data gaps, based on the two performance metrics considered.Coincidently, these two best methods are single imputation methods and are fast to execute, which is also consistent with our findings in the univariate case.
For RF, the random forest method, our study found no consistent improvement in the results as the number of trees increased using the random forest from the mice R package; but, it confirmed that using a large number of trees (say 500) is time consuming and would not be recommended in practice, which is consistent with the finding in Boehmke and Greenwell [66].For the kNN method, k = 15 and k = 7 consistently produced the best results from either RMSE or MAPE.This conclusion disagrees with the findings of Muinonen et al. [67], who say that imputation accuracy from kNN does not improve for k > 3 but confirm the findings of McRoberts et al. [68] that a higher value of k (k ≥ 7 in our case) may improve estimation accuracy and have a lower variability of results.
This article has examined several imputation methods with water level data for a range of missing value percentages.Clearly, any such study is limited and a wider range of techniques could be used.The focus here is on water levels, and different conclusions may have been drawn with different data.We also considered missing data percentages from very little, at 5%, to a substantial proportion, at 50%, and no higher than 50% as it has been reported that using imputation for datasets with more than 50% missing observations often produces biased results and high variability (unpredictability) in the imputed data [69].However, Madley-Dowd et al. [70] found that, for data where values are missing according to the missing at random mechanism, imputation can provide unbiased results even with a large percentage of missing data (up to 90% missing).The study in [42] also considered missing data from 2% up to 70% of the dataset size for water discharge data.Therefore, it would be worth examining this further in the context of water level imputation, especially as even in one of the raw datasets (from the Makurdi water station) used to construct the multivariate data 68% of the water levels were missing.

Figure 1 .
Figure 1.Map of Nigeria showing the four water stations (marked left to right as follows by symbols: Kainji (star), Umaisha (small square), Makurdi (diamond), Ibi (circle)) along the rivers Benue and Niger (source: Google Maps, modified by authors).

Figure 1 .
Figure 1.Map of Nigeria showing the four water stations (marked left to right as follows by symbols: Kainji (star), Umaisha (small square), Makurdi (diamond), Ibi (circle)) along the rivers Benue and Niger (source: Google Maps, modified by authors).

Figure 2 .
Figure 2. A decision tree and its three different types of nodes.

Figure 2 .
Figure 2. A decision tree and its three different types of nodes.

Table 1 .
Characteristics of the selected water stations.

Table 2 .
Comparison of the mean and standard deviation (in brackets) values of the RMSE statistic between the deleted original data and the imputed data for missing completely at random, missing at random and missing not at random mechanisms and the Kalman smoothing (KS), random and seasonal decomposition (Sdec) imputation methods.Values in bold show the best method in each case (with lowest mean or lowest standard deviation).

Table 3 .
Comparison of the mean and standard deviation (in brackets) values of the MAPE statistic between the deleted original data and the imputed data for missing completely at random, missing at random and missing not at random mechanisms and the Kalman smoothing (KS), random and seasonal decomposition (Sdec) imputation methods.Values in bold show the best method in each case (with lowest mean or lowest standard deviation).

Table 4 .
Comparison of the mean values of the RMSE statistic between the deleted original data and the imputed data for missing completely at random, missing at random and missing not at random mechanisms and the random forest (RF), k nearest neighbour (kNN), missForest (MF) and predictive mean matching (PMM) imputation methods.Values in bold show the best method in each case (with lowest mean).

Table 5 .
Comparison of the mean values of the MAPE statistic between the deleted original data and the imputed data for missing completely at random, missing at random and missing not at random mechanisms and the random forest (RF), k nearest neighbour (kNN), missForest (MF) and predictive mean matching (PMM) imputation methods.Values in bold show the best method in each case (with lowest mean).