Missing data issues for time series abound in many fields of earth science and many studies have addressed developing techniques to fill the data gaps [17
]. Since there were no other attributes available for estimation of missing values in the lakes’ observational VTS, focus was placed on Landsat-derived lake extent data having time intervals of about 16 days. These were characterized as “missing at random” (MAR) [25
] prompting the use of univariate methods to generate complete VTS. Univariate techniques include methods such as deletion and substitution methods as well as interpolation, smoothing, and seasonal decompositions methods [26
]. The consensus is that simple univariate imputations algorithms (such as deletion or substitutions) yield inferior results while more sophisticated approaches (such as interpolations and smoothing), using Kalman smoothing interpolations, and seasonal decompositions are supposed to yield better results [26
]. Due to the existence of a general trend in our target time series, deletion and replacement methods were thus considered unsuitable. Also, inspection of the time series showed that in general the changes in volume for both lakes were fairly smooth (monthly time scales), based upon which the decision was made to apply interpolation, smoothing, and seasonal decompositions approaches to construct a corresponding monthly VTS which is well matched by typical time increments (daily, weekly or monthly) available in atmospheric data sets. Note, that extreme events such as the occurrence of Hurricanes at times yielded a very fast response with time scales of just a few days prompting the need to address these rapid change periods separately. Toward this end, the lakes’ VTS had to be first constructed featuring an equal and small-time step (16 days) requiring a strategy to fill in missing values after which the time series was resampled to feature a monthly time step.
2.2.1. Alternative Observational ∆V Datasets and the Characteristics of Sudden Changes
The rate-of-change time series is derived by simply using the difference of two consecutive variable values. While this is readily done for an evenly spaced time series with ∆T = constant, irregular time stepping requires the use of two consecutive points throughout the temporal domain. The latter is much more challenging because of the uneven spacing or missing data over prolonged periods of time. Short of interpolating in between observed data points so that the equidistant data points can be calculated, any choice of a constant ∆T therefore runs the risk of omitting crucial information because data points are disqualified for not being “on the mark”. To preserve most of the data points in the lakes’ observational VTS, it was decided to construct alternative datasets with ∆T equal to multiples of 16-day intervals, the highest frequency available. The resulting set of different ∆T-datasets provided insight into the lakes’ characteristics, some of which might have been present in one dataset but missing in the other. Examination of the alternative datasets (16-day, 32-day, 48-day, 64-day, 80-day, 96-day, 112-day) and their associated changes show which one of them would capture the most observational changes, i.e., their ability to represent all the variable outliers. Preliminary statistical analysis of the datasets showed that their distributions follow a bell shape with only slight skewness. LE values seemed to skew to the right, indicating the presence of positive outliers in the data (Figure 3
). In the case of LA, the outliers were distributed on both sides however with more positive values than negative ones suggesting the importance of the positive outlier presence in the datasets.
Identifying outliers and associating them with corresponding dates in the temporal domain for all datasets showed that 1979, 1998, 2005, 2007, 2008, 2009, 2012, and 2013 were the years in which the positive outliers of LE had occurred. For LA, the positive outliers were related to the years of 1988, 1999, 2000, 2005, 2007, 2008, 2009, 2010 and 2011, while the negative ones happened in 1991, 1997, 1998, 1999, 2000, 2001, 2003, and 2010. Positive outliers of the LE dataset showed that extreme anomalies always caused lake growth and that no phenomenon ever caused the lake to shrink beyond its normal fluctuations. Conversely, LA was much more affected by both gaining and losing water beyond its normal variations.
In order to differentiate the outliers to see how they changed the lake regime, focus was placed on the analysis of volume change values right after the outlier. A regime shift was observed when a positive outlier was followed by another positive outlier. Conversely, when the targeted outlier was followed by a value in the normal range, no regime shift was observed in the lakes’ behavior. In the case of negative outliers no such pattern was observed, the following volume change values were always in the normal range (the normal range is between the high and low end bars in the box plot).
All outliers of LE were followed by positive changes that were higher than the normal range, except for the years of 2009 and 2013. For LA, only the outliers occurring in 2007 and 2008 were followed by other positive outliers. The need to find additional underlying causes of outlier occurrences, prompted further examination of the watersheds’ physics to look for a trigger. Outliers can be the sign of errors in measurements or they could also be the response signal to an actual event [28
]. Precipitation is typically the main contributor to closed-basin lakes both as direct deposition and run-off collection from the watershed. They occur at different time scales however, i.e., direct rainfall can cause a sudden change to the lake storage while runoff coming from the watershed would be built up gradually over time as it pours into the lake in the aftermath of a storm event. Hence, severe storm events have the ability to cause rapid lake responses, followed by slower runoff volumes being added. The only balancing process is lake surface evaporation which, however, takes place at even larger time scales. Hence, the combination of these processes, sudden strong rainfall, moderately fast runoff, and then slow evaporation, will yield sudden increases in water level which takes months to several years to return to its original equilibrium state [32
The main source of extreme rainfall in our study area is the occurrence of either tropical storms or hurricanes. Hispaniola is located within the typical corridor of North Atlantic cyclones, many of which impact the Caribbean islands during the months of late summer and early fall. The impact varies however and only cyclones which pass directly over or in close proximity of the lakes’ watershed tend to register a significant amount of precipitation which was identified to be within a 50-mile distance of the lakes’ watersheds.
The anomalies for both lakes suggested a strong correlation between cyclone activities in the years of 1979 (Tropical Storm Claudette and Hurricane David), 1998 (Hurricane George), 2005 (Tropical Storm Alpha), 2007 (Tropical Storm Noel), 2008 (Tropical Storms Fay and Gustav), and 2012 (Tropical Storm Isaac); with each cyclone contributing to monthly rainfall rates higher than 87.6 mm over the lakes and their watersheds. Comparison with the observational VTS indicated that significant changes in lake volume happened within a window of fewer than two weeks following each of the individual cyclones thus confirming their impact and the corresponding occurrences of outliers. These “cyclone singularities” in the VTS were used to split the time series into before- and after-sections so the imputation algorithms would not be “distracted” by the singularity. Figure 4
shows an example of how to integrate a cyclone’s effect in the interpolation process. In this figure, the sub-series is split into two smaller parts: One before and one after hurricane George (1998), so the general characteristics of the lake stay disconnected from this one-time extreme event and the behavioral characteristic of the lake after the storm does not affect its behavior before. The same procedure was considered for all other influential cyclones.
2.2.2. Evenly Spaced Time Series Construction
Due to the two 4-year gaps in lake volume data (1974 to 1978 and 1992 to 1996), the observational VTS was split into two separate time regions, i.e., before and after 1996. Since the 16-day interval featured the most data points, it was used as the reference interval. The level of missingness for both lakes was around 90% (time span 1972 to 1996), and 47% and 56% (1996 to 2014), for LA and LE, respectively. Due to the high level of missingness between 1972–1996 and also for 2014–2017 (only a few data points emerged for this time span) the focus centered on the available data between 1996–2014.
Since the time shift of the Landsat-5 TM and Landsat-7 ETM data products was only 8 days, any start date for a 16-days interval based time series would necessarily negate the common use of either all of the Landsat-5 TM or the Landsat-7 ETM data points (data points would alternate on a 8-day offset). To address this issue, one could either build an 8-day time series, which causes an increase in the number of missingness or split the time series into smaller parts, analyze them separately, and then merge the results. The second option was deemed more appropriate because it permitted the retention of all data from both Landsat satellites. For this purpose, the time series was split into three sections which permitted the inclusion of the 8-day shift. The first part of the time series (10 June 1996, to 26 December 1999) was comprised of LT5 data with a 16-day interval, the second part (January 2000, to January 2001) was a mix of Landsat-5 TM and Landsat-7 ETM data with an 8-day interval, and the third part (22 February 2001, to 21 August 2014) corresponded to the values derived from Landsat-7 ETM, again using a 16-day interval. Note that the observational VTS was later divided into more subsections based on the date of influential cyclones. Since the response times of the lakes to forcing was slow (monthly time scales) the imputed time series (8 and 16-days intervals) were resampled to yield a VTS with monthly intervals.