Reconstruction of the Cloud-Free Time Series Satellite Observations of Land Surface Temperature ( LST ) Using Singular Spectrum Analysis ( SSA )

Land Surface Temperature (LST) is a basic parameter in energy exchange between the land and atmosphere and is frequently used in many sciences such as climatology, hydrology, agriculture, ecology, etc. LST time series data have usually deficient, missing and unacceptable data caused by the presence of clouds in images, presence of dust in atmosphere and sensor failure. In this study, Singular Spectrum Analysis (SSA) algorithm was used to resolve the problem of missing and outlier data caused by cloud cover. The region studied in the present research included an image frame of MODIS with horizontal number 22 and vertical number 05 (h22v05). This image involved a large part of Iran and Turkmenistan and Caspian Sea. In this study, MODIS LST sensor (MOD11A1) was used during 2015 with 1×1 Km spatial resolution and day/night LST data (daily temporal resolution). The results of the data quality showed that cloud cover caused 36.37% of missing data in the studied time series with 730 day/night LST images. Further, the results of SSA algorithm in reconstruction of LST images indicated the Root Mean Square Error (RMSE) of 2.95 K between the original and reconstructed data in LST time series in the study region. In general, the findings showed that SSA algorithm using spatio-temporal interpolation in LST time series can be effectively used to resolve the problem of missing data caused by cloud cover.

Missing values and/or outliers can be caused by cloud cover, sensor failure, aerosol, and inefficiency of cloud masking algorithms [1,2,13,14].Atmospheric dust, aerosol, gases and clouds, in particular, can dramatically affect the energy reflected from the surface and expose the reading of optical and thermal sensors to error [13].Clouds in satellite images are identified as phenomena that have higher reflection in visible bands of electromagnetic spectrum than other most terrestrial phenomena and have lower temperature than the below land surface in the thermal bands [15].In thermal remote sensing, clouds absorb part of the energy reflected from the earth.They also emit thermal infrared energy at much lower temperature so that they are not correctly diagnosed in cloud masking algorithm, and some negative outliers (lower temperature than other measurements in a time series) will remain in place of missing data in a time series [16].
The missing data in a time series, depending on size dispersion, temporal distribution and their continuity over time, can be very short, dispersed, or have a long gap.According to the missing data in time series, two terms can be defined: missing data with short gap and missing data with long gap.
When there are less than 50% of sampled data over time, they are faced with the missing data with short gap/size in time series.However, if more than half of the sampled data are lost over time, this time series will have a large gap [17,18].The outliers are defined as abnormal values whose deviation from normal changes in dataset is larger [16].The outliers can be classified into positive and negative e data.For example, values larger than 1 or smaller than -1 in a Normalized Difference Vegetation Index (NDVI) time series.Further, sharp increase or decrease of NDVI data in a time series of NDVI, compared to the data before and after it, can be identified as outliers.Time series can be periodic or non-periodic.Remote sensing LST time series are included in periodic time series because they reflect the seasonal and annual variations of the sun.
Several methods have been developed to identify cloud cover, which can be divided into two groups [13].The first group, which is a spectrum-based approach, uses all spectral data to determine whether a specific pixel includes cloud.The second group, which is a temporal-based approach, presents an evaluation of the missing values caused by cloud cover through temporal interpolation.
For example, Fast Fourier Transform (FFT) algorithm and Harmonic ANalysis of Time Series (HANTS) algorithm used to reconstruct the missing data and remove the outliers in the time series of NDVI are based on the second approach [14,[18][19][20][21][22].HANTS algorithm was first developed to reconstruct the NDVI time series and to extract phonological data.But since this algorithm reconstructs the signal of a time series based on periodic behavior, it was effectively used to reconstruct other time series with periodic behavior like LST [2,16,23,24].A disadvantage of HANTS is that in these methods only temporal correlation between observations is used to fill the gap of missing data in a time series.Singular Spectrum Analysis (SSA) is an advanced method to reconstruct the time series with missing data, especially with a large gap, that uses SSA (with only one time series) and Multi-channel Singular Spectrum Analysis (M-SSA for more than a time series).SSA was first developed by Broomhead & King (1986).It was then proposed by Vautard & Ghil (1992) and Kondrashov & Ghil (2006) to fill the gaps of missing data in a time series [25][26][27].The value of each pixel in an LST image over time is the result of interaction between radiation energy and ambient turbulent energy.presumption, SSA makes use of Empirical Orthogonal Functions (EOFs) and analysis of major components in temporal domain to extract data from short and noisy time series, without primary knowledge about dynamic processes that affect time series [27].SSA is an advanced method for analysis of time series that makes use of multivariate principles of statistics and geometry, linear algebra and signal processing [28].SSA analyses the time series to simpler and interpretable components such as trends, periodic and semi-periodic fluctuations and noises [29].Thus, several significant components achieve maximum variance and noises achieve minimum variance in observations of a time series.
SSA has been used as a standard tool in analysis of climatic time series, meteorology and geophysics [30][31][32][33][34].In a research, the hourly LST time series data were reconstructed by SSA and M-SSA.The results showed a mean absolute error of 2.25° K in an hourly time series with 63% missing data caused by cloud cover [1].In another study, SSA algorithm was used to reconstruct the missing data in Leaf Area Index (LAI) of MODIS data.The findings indicated M-SSA effectively reconstructed this index using spatio-temporal correlation [35].Moreover, another study showed that SSA could be effective in analysis of hydrologic time series, including reconstruction of time series, extraction of trends, fluctuations and prediction [36].
The current study was aimed to assess the performance of SSA and M-SSA algorithms in removing the clouds and outliers in daily MODIS LST time series.In most applied studies, satellite imageries of LST covered by cloud, has been a limitation in choosing the images.Therefore, the results of this study can be used in most applied studies using LST images according to the pure data generated by SSA algorithm.

Data
The study region in this research included an image frame in MODIS sensor with horizontal number 22 and vertical number 5 (h22v5) in MODIS sinusoidal tiling system.This image involved a large part of Iran and Turkmenistan and Caspian Sea.Where n'= n-m+1 and each lagged vector is defined as: This trajectory matrix contains the complete record of patterns presented within a window of size m.
By selecting a large number of window size, more information about the basic pattern of the time series will be captured.A small value of window size enhances the statistical confidence at the final Peer-reviewed version available at Atmosphere 2018, 9, 334; doi:10.3390/atmos9090334 results [39], since the structure of time series will be captured repeatedly [40].The final form of the trajectory matrix X is a rectangular matrix of the form: Step2: The next step is decomposition of trajectory matrix X of size m×n' using the Singular Value Decomposition (SVD) method which yields: Step 3: Partitioning d eigentriples into p distinct subsets.Then, summing all the components inside each subset such that; The matrices XIn' have the form of a Hankel matrix in an ideal case and consequently fit the trajectory matrices.
Step 4: Since the ideal case described in step 3 is not usually the case, the XIn' matrices should be transformed into the form of a Hankel matrix to fit the trajectory matrices.This step is known as diagonal averaging.In this sense, the original matrix can be reconstructed as the sum of these matrices.
Where for each p, the series xtn' is the result of the diagonal averaging of the matrix XIn'.
The SSA workflow for gap filling procedure consists of several steps which are explained as follows: This process repeats for the selected number of EOFs, each time keeping the previous ones as fixed.
To find the optimal value for the window width and the number of dominant SSA modes to fill the gaps, cross-validation is applied.It means that a portion of the available data (selected as random) is flagged as missing, and the RMSE error from the reconstruction is computed to find the best value for the window size and number of EOFs.The SSA technique can be generalized to be used for multivariate time series and gap-filling of missing values in those time series [41].
In this research, the spatio-temporal dispersion of the missing data in time series is shown.The spatial dispersion is the same missing data caused by cloud cover in each time series image, and temporal distribution is the missing data along 730 images over time.To display the spatial dispersion of the missing data, the image statistics in ENVI 4.8 software were used.Also, to show the temporal distribution of the missing data, the missing data caused by cloud cover along each pixel of this time series were counted and then divided by the total number of data over time.Hence, the map of missing data in this time series was prepared.
SSA-MTM software was sued to reconstruct the LST time series with 730 images during 2015.
This software is accessible via http://research.atmos.ucla.edu/tcd//ssa/form.htmlfor free.To reconstruct the time series in SSA software, it is necessary to determine the number of significant components and window size.To determine these parameters, first, a single time series without missing data was selected as a representative of the whole time series.Then, the effects of window size and number of components in reconstruction of this time series were assessed by SSA.Next, the significant components were determined by different tests in SSA software such as Monte Carlo test.
This will be explained in details in the results section.Having determined the window size and significant components, the time series of all images were reconstructed by M-SSA using these results.It should be pointed out that due to the lengthy discussion of SSA software tests such as Monte Carlo test as well as testing the null hypothesis of EOFs and theoretical discussion on the trends and periods, more information can be found in the study of Ghafarian Malamiri (2015) [16].
To evaluate the M-SSA performance in reconstructing the images of this time series, Root Mean Square Error (RMSE) was sued.RMSE is obtained from equation 7.In this equation, xi and yi represent original and reconstructed data, respectively.To calculate RMSE, first the cloud data, the data with zero value, were removed.Then, RMSE was computed between the true data and reconstructed data by SSA algorithm along each pixel of this time series and displayed as an RMSE map.

Spatio-temporal distribution of the missing data
To analyze the performance of SSA algorithm in reconstructing LST images, the distribution style and amount of missing data in LST time series were calculated temporally and spatially.For a better display, the time series with 730 images were depicted in LST time series during the day and night.The percentages of missing data in each LST time series image (spatial dispersion) are shown during the day and at night in figures 3 and 4. In horizontal axis (figure 3), number one is related to early January 2015.Based on figures 3 and 4, the amount of missing data is higher at the start and end of time series.The start and end of times series, depending on the winter and fall seasons, have the highest rate of missing data, which is possibly due to higher cloudiness in these seasons.

Determining the parameters of SSA algorithm
Determining the window size and number of components are two important parameters in reconstructing a time series by SSA algorithm.Therefore, it is necessary to select the window size and number of components optimally.Fig. 6 shows the effects of the number of components in a window size of 60 in reconstructing a time series with 730 images.By increasing the number of components from 1 to 2, the coefficient of determination value (R2) increased significantly between the original data and reconstructed signal by SSA algorithm.Then, by increasing the number of components from 2 to 8, R2 value was not significantly increased.Fig. 6 (right) illustrates the effect of different window size on reconstruction of the same time series with four components.By increasing the window size, the R2 value between the original data and reconstructed data was reduced.Using small window sizes to reconstruct a time series with long-distance missing data has no favorable results, and using large window sizes not only reduces the accuracy of reconstruction but also increases the time of processing.Thus, in the present study window size of 60 was used to process the time series.In the following, the accurate determination of the number of significant components and window sizes by SSA algorithm are fully described.

Data-basis and null-hypothesis-basis Monte Carlo test results
The results of data and null-hypothesis EOF-s for pure red noise hypothesis are shown in Figure 9.The graphs show the power in the data eigenvalues and surrogates data bars vs. the frequency associated to EOFs.Obtaining a single frequency for each EOF is not straightforward due to the fact Peer-reviewed version available at Atmosphere 2018, 9, 334; doi:10.3390/atmos9090334that the EOFs resulting from SSA for an intrinsically periodic time series are not purely sinusoid unless they are composed of completely purely periodic components (e.g.synthetic periodic time series).To do so, Allen and Smith (1996) associated a frequency to each EOFs by maximizing the Rsquared (R2) with a pure sinusoid for better visualization.The top graph of Figure 9 shows the result of the data-based Monte Carlo test.
Fig. 8 illustrates the eigenvalues versus their relevant frequencies.Fig. 9 shows that components 1, 2 and 3 are significantly major components of this time series with specific percentile of 97.5%.In this figure, the components are named according to the larger variance, respectively.The frequencies of components 1, 2 and 3 are 0.449, 0.002 and 0.006 (cycle/day) which belong to periods of 2 (2/2 = 1as daily component), 500 (500/2 = 250 almost one year) and 166 (166/2 = 88 day or almost 3 months as a trend) of the images.
The null-hypothesis EOFs test which is visible as lower graph of Figure 9 confirms the significance of these pairs.The null-hypothesis-based test has the advantage of having a lower probability of selecting a noise component as a signal component, which help to identify significant signal better and thus the result is more reliable [40].Peer-reviewed version available at Atmosphere 2018, 9, 334; doi:10.3390/atmos9090334 components can be differentiated in this time series.Therefore, three components, including 1, 2 and 3 were used to reconstruct the whole time series in window size 60 in M-SSA.

Conclusion
In the present study, SSA algorithm was used to reconstruct the missing data caused by cloud cover in daily LST time series of MODIS sensor during 2015.According to the results of accuracy evaluation of the SSA algorithm, the mean reconstruction error in the whole studied region was 2.95 °K.These results are promising owing to the amount of missing data caused by cloud cover in the time series with 730 LST images because these figures are in line with temperature measurement errors (±3 °K) by remote sensing [42].The findings showed the minimum reconstruction error on the Caspian Sea (<2.6 °K).This is due to the high thermal specific capacity of water which results in little and gradual water temperature changes per day and during the year.Hence, the reconstructed signal is more compatible with the original data.Ghafarian Malamiri and Zare Khormizie (2017) reconstructed a daily MODIS LST time series during a year by HANTS algorithm and reported that reconstruction error of LST time series was lower on the sea [43].Further, the results of Ghafarian Malamiri and Zare Khormizie (2017) showed that reconstruction error in annual LST time series during the day from MODIS sensor in this study region was 3.87 °K on average [43].Therefore, the reconstruction accuracy is higher in SSA algorithm than HANTS algorithm.This is because only temporal interpolation is used in HANTS algorithm; whereas, spatio-temporal interpolation is used in SSA algorithm in M-SSA to reconstruct the time series.Since the daily MODIS LST time series includes LST during the day and night, this time series can be processed in two different ways to reconstruct the effects of cloud cover.First, LST images during the day and at night can be reconstructed into two separate time series.Second, LST images can be put together as a day/night sequence and be reconstructed as a time series.In this study, the second method was applied.When daily MODIS LST time series is used in day/night sequence, a significant component (component 1 which indicates the day/night sequence) is created.When this component is used to reconstruct the time series, the accuracy will increase dramatically because the daily missing data caused by cloud cover certain points may be present during the night and vice versa.Reconstructing the hourly LST time series, Ghafarian Malamiri (2015) reported three significant periodic components 24, 12 and 8 hours were differentiated in the window size of 72 hours [16].
According to the findings, SSA algorithm can effectively resolve the missing data problem in MODIS LST time series.In a research using SSA and M-SSA algorithms, Ghafarian et al. (2012) reconstructed the time series of hourly LST data and showed an absolute mean error of 2.25 °K in an hourly time with 63% missing data caused by cloud cover [1].Another study indicated that M-SSA using spatio-temporal correlation effectively reconstructed the MODIS LAI [35].In general, the results showed that SSA algorithm could be effectively used to resolve the missing data problem caused by cloud cover in daily MODIS LST time series.

Fig. 1 Figure 1 .
Figure 1.An LST image of the studied region.

2. 1 . 1 .Figure 2 .
Figure 2. Time series with 730 images (A), trend of temperature changes along a pixel (B) and magnification of a part of time series (C) Where D and E are left and right singular vectors of X with m×m and n×n size, respectively, and L is a rectangular diagonal matrix of size m×n.The elements of L, called singular values, are the square roots of the eigenvalues of the lagged-covariance matrix S = XXT of size m×m.The columns of matrix D are the eigenvectors of S also known as Empirical Orthogonal Functions (EOFs).The rows of ET are eigenvectors of matrix XTX.If we plot the singular values in descending order, one can often distinguish between an initial steep slope, representing a signal, and a (more or less) flat floor, representing the noise level[27].Then any subset of d eigenvalues (EOFs), 1≤d≤ m, for which related eigenvalues are positive provides the best representation of the matrix X as a sum of matrices Xi, i = 1, 2, …, d[29].

Figure 3 .
Figure 3.The missing data of each image (Spatial dispersion) in annual LST time series during the day.

Figure 4 .
Figure 4.The missing data of each image (Spatial dispersion) in annual LST time series during the night.

and
Zagros mountain ranges and humidity barrier that causes the formation of cloud and prevents the penetration of humid air into the central regions of Iran.The lowest rate of missing data is observed in the central regions of Iran.Fig. 5 illustrates the histogram of temporal distribution map of the missing data.A large part of this time series has lost 20-60% of LST data as the result of the effects of cloud cover.On average, 36.37% of data have been lost due to cloud cover.

Figure 5 .
Figure 5. Temporal distribution map of the missing data in annual LST time series with 730 images (left) and histogram of temporal distribution map of missing data (right)

Figure 6 .Figure 7 .
Figure 6.R 2 values between the original data and SSA reconstruction in a time series with 730 data in different window sizes (right) and different number of components (left)

Figure 9 .
Figure 9. Monte Carlo SSA based on data (top) and null-hypothesis EOFs test (down).

1 2 3
Fig. 10 presents the temporal EOFs of the significant components in this time series based on the results of Monte Carlo test.Component 1 of the image in Monte Carlo test shows the same day/night temperature changes.As shown, the black lines in both image is repeated every day and night.The component 2 of the image indicates the annual temperature changes and component 3 almost shows the seasonal periods in this time series ( the components 2 and 3 are shown in red and blue colors in figure 10, up) which belongs to a significant trend.Component 2 is shown as a small part of the annual period and it is periodically repeated in approximately each 730 images.For a better display, component 2 is drawn in window size 730 (Fig. 10, bottom).Component 3 is only part of a periodic signal with 166 images, which is shown as a trend in window size of 60 images (Fig. 10, up).

Figure 10 .
Figure 10.SSA T-EOFs of components 1, 2 and 3 values with 60 image window size (top) and SSA T-EOFs of components 3 with 730 image window size (down).

Fig. 11
Fig. 11 displays the RMSE map between the original data and reconstructed data by SSA along annual LST time series with 730 images.As shown in Fig. 11 (left), the minimum RMSE is observed on the Caspian Sea.At other points, the RMSE is variable from 2 to 4 kelvin (K).Fig. 11 (right) shows the histogram of RMSE map of LST.Overall, the mean RMSE of the reconstructed LST by SSA algorithm in the studied time series with 730 images is 2.95 K.

Figure 11 .Figure 12 .Fig. 13 .
Figure 11.RMSE map between the original data and reconstructed data by SSA algorithm in annual LST time series with 730 images (left) and histogram of RMSE map (right)

Figure 13 .
Figure 13.A time series along a pixel together with reconstruction results of SSA (up) and magnification of a part of time series (down)

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 20 June 2018 doi:10.20944/preprints201806.0327.v1
Peer-reviewed version available at Atmosphere 2018, 9, 334; doi:10.3390/atmos9090334Fora given window width (m), the original time series is centered by computing the unbiased value of the mean, and the missing data is set to zero.The first leading EOF is found by an iterative procedure which applies the SSA algorithm on the zeroed and centered set.The missing values are updated based on the reconstructed component of the current EOF.Using this updated set the SSA algorithm is applied again, and the missing values are updated based on the result of this new iteration.The process repeats until convergence is achieved.Then the iteration starts for the second leading EOF (keeping the first one fixed) until convergence has been achieved for the second EOF.

preprints.org) | NOT PEER-REVIEWED | Posted: 20 June 2018 doi:10.20944/preprints201806.0327.v1
Peer-reviewed version available at Atmosphere 2018, 9, 334; doi:10.3390/atmos9090334Thepercentage of missing data caused by cloud cover in each pixel during one year (temporal dispersion) is shown in Fig.5.As shown, the highest rate of missing data is found in the areas around Caspian Sea in Turkmenistan (the northern areas of studied region).The minimum rate of missing data is found in the southern part of study areas (central regions of Iran).In general, the northern and western regions of Iran involve the highest rate of missing data, considering the location of Alborz Preprints (www.

preprints.org) | NOT PEER-REVIEWED | Posted: 20 June 2018 doi:10.20944/preprints201806.0327.v1
Three significant components were identified in daily MODIS LST time series with day/night sequence.The first component was the same day and night temperature changes, the second Peer-reviewed version available at Atmosphere 2018, 9, 334; doi:10.3390/atmos9090334component was annual temperature changes and the third component was seasonal temperature changes.The number of significant components (periods or trends) varies in different time series.