Fast Thick Cloud Removal for Multi-Temporal Remote Sensing Imagery via Representation Coefficient Total Variation

: Although thick cloud removal is a complex task, the past decades have witnessed the remarkable development of tensor-completion-based techniques. Nonetheless, they require substantial computational resources and may suffer from checkboard artifacts. This study presents a novel technique to address this challenging task using representation coefficient total variation (RCTV), which imposes a total variation regularizer on decomposed data. The proposed approach enhances cloud removal performance while effectively preserving the textures with high speed. The experimental results confirm the efficiency of our method in restoring image textures, demonstrating its superior performance compared to state-of-the-art techniques


Introduction
Thick cloud removal [1], a prevalent challenge in remote sensing imagery, severely affects the quality and accuracy of information extraction [2,3]; while thin cloud removal can be addressed through image dehazing [4], interpolation [5] or machine learning [6] algorithms, thick clouds pose a more complex problem, rendering these algorithms inadequate [7].Remote sensing images captured at different timestamps for the same scene potentially offer complementary information for reconstructing pixels obscured by thick clouds.Consequently, the effective removal of thick clouds from multi-temporal remote sensing images is a significant research endeavor in remote sensing image processing [8][9][10].
The elimination of thick clouds from multi-temporal remote sensing images is a daunting task that has seen various approaches in recent years.One promising technique is framing this issue as a data completion problem.In the context of multi-temporal thick cloud removal, given the c-channel image of h × w pixels, Y ∈ R h×w×c×t , captured with t timestamps, Y is considered partially observed data, with pixels obscured by thick clouds being missing.Consequently, the index set Ω collects the indices of observed pixels (i.e., pixels not covered by clouds).The objective of thick cloud removal is to estimate the underlying signal by filling in the missing values caused by the clouds.
Some early methods (including information cloning [11,12], similar pixel replacement [13] and spatial-temporal weighted regression [14]) have made remarkable developments.Research on gap filling, e.g., multi-temporal single-polarization techniques, multi-frequency and multi-polarization techniques and repeat-pass interferometric techniques, also provides insightful ideas for cloud removal (please refer to [15] for more details).Recently, a prevalent method exploits the low-rank structure inherent in remote sensing imagery time series, which can be effectively leveraged via tensor completion techniques.For instance, Liu et al. [16,17] introduced the sum of the nuclear norm (SNN) for general tensor completion problems, which later gained widespread adoption in multi-temporal imagery thick cloud removal.SNN involves computing the weighted sum of the nuclear norms imposed on matrices obtained by unfolding the original tensor along each mode.This approach yields a convex optimization problem that can be efficiently solved using the ADMM algorithm.In numerous studies, this algorithm is referred to as high-accuracy low-rank tensor completion (HaLRTC).The experimental results indicate the efficacy of HaLRTC in removing thick clouds; yet, it might not excel in reconstructing crucial details.
Following SNN, the tensor nuclear norm (TNN) [18][19][20] is one of the typical variants.Different from the SNN, the TNN is formulated based on tensor singular value decomposition (tSVD), resulting in an exact recovery theory for the tensor completion problem.Note that tSVD requires an invertible transform, which has often been set as a Fourier transform or discrete cosine transform in earlier studies.This transform often plays a vital role in tensor completion.Recently, research has shown that a tight wavelet frame (a.k.a.framelet) [21,22] could better highlight the low-rank property, and the deduced framelet TNN (FTNN) is more promising in various visual data completion tasks [23].
Besides better modeling low-rank priors, combining other prior knowledge is another way to boost performance.Ji et al. [24] sought to enhance HaLRTC by incorporating nonlocal correlations in the spatial domain.This was achieved by searching and grouping similar image patches within a large search window, thereby promoting the low rankness of the constructed higher-order tensor.This potentially facilitates the reconstruction of underlying patterns.Extensive experiments have shown that the nonlocal prior would significantly improve the performance on synthetic and real-world time-series data.Chen et al. [25] framed cloud removal as a robust principal component analysis problem, wherein the clean time series of remote sensing images represent the low-rank component, and thick clouds/shadows are considered sparse noise.Observing the spatial-spectral local smoothness of thick clouds/shadows, they introduced a spatial-spectral total variation regularizer.In a similar vein, Duan et al. [26] proposed temporal smoothness and sparsity regularized tensor optimization.They leveraged the sparsity of cloud and cloud shadow pixels to enhance the overall sparsity of the tensor while ensuring smoothness in different directions using unidirectional total variation regularizers.Similarly, Dao et al. used a similar idea to model a cloud as sparse noise and successfully applied the burn scar detection algorithms for cloudy images [27].
The aforementioned methods are rooted in the tensor nuclear norm and its variants.Some researchers have also employed tensor factorization techniques to model low rankness.The classic models include CP factorization [28,29] and Tucker factorization [30].For instance, He et al. [31] proposed tensor ring decomposition for time series remote sensing images, harnessing the low-rank property from diverse dimensions.Their model incorporated total variation to enhance spatial smoothness.Lin et al. [32] factored remote sensing images at each time node into an abundance tensor and a semiorthogonal basis, imposing the nuclear norm on the time series of abundance tensors.This separately models the lowrank property along channel and temporal dimensions.Inspired by [33], Zheng et al. [34] introduced a novel factorization framework, termed spatial-spectral-temporal (SST) connective tensor network decomposition, effectively exploring the rich SST relationships in multi-temporal images.This framework essentially performs subspace representation along the spectral mode of the image at each time node and introduces tensor network decomposition to characterize the intrinsic relationship of the fourth-order tensor.
Except for the matrix/tensor completion methods, machine learning and deep learning [35,36] are also typical ways to remove clouds.For example, Singh and Komodakis propose a cloud removal generative adversarial network (GAN) to learn the mapping between cloudy images and cloud-free images [37].Ebel et al. follow the architecture of cycle-consistent GAN to remove clouds for optical images with the aid of synthetic aperture radar images [38].These methods are time-consuming in the training phase, thus preventing fast exploration for practitioners.In what follows, we focus on matrix/tensor completion methods.
A concise review of recent advancements reveals that the majority of existing methods necessitate the inclusion of additional regularizations [39] or the construction of more intricate factorization models [34].These approaches inevitably introduce considerable computational overhead and foster an increase in hyperparameters.However, although tensor-completion-based methods are able to achieve good metrics, it has been observed that they may suffer from checkerboard artifacts.Consequently, the development of efficient cloud removal algorithms that maintain a high performance emerges as a fascinating challenge.
Toward this goal, this paper presents a representation coefficient total variation method for cloud removal (RCTVCR).The core concept involves unfolding tensor data into a matrix format and performing matrix factorization to obtain subspace representation.Here, the representation coefficient effectively captures the visual context of the original tensor data.To enhance detail reconstruction, total variation is employed for the representation coefficient.This method is applied to synthetic data, yielding rapid processing and slightly superior performance compared to state-of-the-art approaches.The experimental results on real-world data demonstrate that RCTVCR effectively recovers desirable textures.
The highlights of this paper can be briefly summarized as follows: 1.This paper demonstrates that the representative coefficients obtained via matrix factorization have sparse gradient maps, thus proposing a novel regularization, representation coefficient total variation (RCTV).

2.
This paper formulates a novel model, RCTVCR, for multi-temporal imagery by combining RCTV and low-rank matrix factorization.

3.
RCTVCR with performance improvement is faster than state-of-the-art methods.For example, RCTVCR only takes 6 s to process imagery with a size of 512 × 512 × 3 × 7, while TNN and FTNN take 126 s and 1155 s, respectively.
The remainder of this article is structured as follows: Section 2 presents RCTVCR.Section 3 reports the outcomes of the numerical experiments.Finally, Section 4 summarizes the findings of this study.The code is available at https://github.com/shuangxu96/RCTVCR, accessed on 16 December 2023.

Preliminaries and Motivations
To start with, the low-rank matrix-factorization-based thick cloud removal problem is introduced, and it also presents the motivation for representation coefficient total variation.
As previously discussed, multi-temporal remote sensing imagery is represented by a tensor Y ∈ R h×w×c×t , where c denotes the number of channels, t represents the number of time nodes and h and w correspond to the height and width, respectively.We reshape the original tensor into a matrix Y ∈ R hw×ct .Although each column in Y represents the scene of different bands at various time nodes, they essentially share similar visual contexts and textures.This observation suggests that Y is theoretically a low-rank matrix.
Figure 1a-d visualize typical multi-temporal remote sensing imagery with three channels and four time nodes, named Munich.Then, the cloudy images are simulated by manually applying different cloud masks, as shown in Figure 1e-h.According to the theory of linear algebra, it is known that a matrix tends to be a low rank if the singular values quickly approach zero.The singular value curves for cloud-free and cloudy imagery are given in Figure 1i, where cloud-free imagery exhibits a rapid decay in singular values and cloudy imagery exhibits the opposite results.This implies that cloud-free imagery should be a low-rank matrix.Therefore, the thick cloud removal task can be typically formulated as a low-rank matrix/tensor completion problem, where the pixels covered by clouds are regarded as missing pixels.The existing algorithms (e.g., TNN) exhibit remarkable performance on low-rank tensor completion problems.However, they are computationally intensive.To develop a faster algorithm, this paper focuses on the matrix format.The most classical approach to exploring low-rank properties of matrix data is the low-rank matrix factorization technique.Given a predefined integer r, the recovered cloud-free data can be expressed as X = UV ⊤ [40], where U ∈ R hw×r represents the representation coefficients and V ∈ R ct×r denotes the basis of the subspace.Note that it is often assumed that r << min(hw, ct) so that the rank of UV ⊤ is r, leading to low-rank matrix factorization.This mathematical expression may be unreadable for readers who do not excel at low-rank data analysis.Actually, the formula X = UV ⊤ can be regarded as spectral unmixing [41], where U denotes the abundance matrix, V denotes the endmember matrix and r represents the number of endmembers.
Since the cloud-free image is inaccessible, X is an unknown variable that needs estimation.To this end, the cloud removal task can be formulated as: min where Ω is an index set collecting all the pixels that are not covered by clouds.Note that P Ω (•) is an orthogonal projection operator.In other words, if a pixel (i, j) is not covered by cloud ((i, j) ∈ Ω), the projection will return its value; otherwise, the projection will return zero.
In Equation ( 1), the loss function ∥X − UV ⊤ ∥ 2 F encourages the recovered image X to be low rank, and the constraint P Ω (X) = P Ω (Y) forces the recovered and observed imagesto exhibit the same pixel values for regions not covered by clouds.This optimization is typically solved via the ADMM algorithm, where the key step involves implementing singular value decomposition (SVD) with the computational complexity of O(hwc 2 t 2 ).
The model in Equation ( 1) is typically fast but insufficient for recovering complex textures.To enhance texture quality, a common strategy is to incorporate the local smoothness prior (i.e., neighbor pixels tend to have similar values).In other words, the total variation is imposed on X, resulting in the following problem: min Here, λ is a hyperparameter, and the total variation is defined as: where X ∈ R h×w×c×t represents the tensor version of X, and D h and D w denote the horizontal and vertical gradient operators, respectively.In other words, D h X and D w X represent the difference between two neighbor pixels along horizontal and vertical directions.
Minimizing the TV regularizer facilitates neighbor pixels to have similar values, potentially leading to better image quality.It is well known that the total variation minimization problem is often solved by using the fast Fourier transform (FFT) with a computational complexity of O(hwct log(hw)) [42].However, despite the significant improvement in total variation, the high computational complexity is undesirable.This urges researchers to explore fast and high-performance algorithms.

Representation Coefficient Total Variation
To reduce computational complexity, this paper presents an investigation into a novel regularizer.We first demonstrate that the representation coefficient U (abundance matrix) is visually akin to images.This assertion is substantiated by the first row of Figure 2, where the Munich dataset is decomposed with r = 5, yielding five coefficients.It is evident that each coefficient conveys a similar context in the original images depicted in Figure 2. Next, we pose an intriguing question: do these coefficients exhibit local smoothness?To address this, we visualize the gradient intensity map in the second row of Figure 2. The gradient intensity of pixel (i, j) is defined as: and its histogram is presented in the third row of Figure 2. The analysis reveals that the gradient intensities of most pixels approach zero, indicating that the coefficients U are locally smooth.Building upon this discovery, this paper introduces a novel regularizer, representation coefficient total variation (RCTV), characterized by the following definition: where U ∈ R h×w×r represents the tensor counterpart of U. In contrast to Equation (3), RCTV applies the ℓ 1 -norm to the gradients of the coefficients U rather than X.Although the difference between TV and RCTV seems to appear minimal, it should be emphasized that this variant is not trivial: Firstly, the RCTV minimization problem is addressed by FFT, accompanied by a relatively lower computational complexity of O(hwr log(hw)).This is attributed to the assumption that r << min(hw, ct).This advantage becomes particularly significant when the number of channels or time nodes is substantial.More significantly, when generating cloud-free data under the constraint P Ω (X) = P Ω (Y), the algorithm essentially only estimates the pixels obscured by clouds, directly copying their values if they are cloudfree.As TV is applied to the entire dataset [31,43], it consumes computational resources excessively for cloud-free pixels.However, RCTV mitigates this issue to a certain extent since it is imposed on representation coefficients.
Secondly, the resulting data are ultimately reconstructed through the product of U and V T , which potentially maintains gradient map consistency across different channels and time nodes.That is to say, different channels and time nodes will exhibit similar texture and structure, so RCTV can more efficiently exploit information from other channels and time nodes.However, TV solely employs local smoothness within a single channel at each time node.In this regard, RCTV appears more promising in extracting additional texture information from different channels and time nodes.

RCTV-Regularized Cloud Removal
Subsequently, the RCTV-regularized cloud removal (RCTVCR) model is formulated as: Compared with Equation (2), the above equation minimizes RCTV instead of TV, and it inserts a constraint V T V = I for stability.
To solve this problem, two auxiliary variables G h and G w are introduced to decouple the ℓ 1 -norm and D i U (i ∈ {h, w}), resulting in the following problem: The ADMM algorithm is employed to solve this problem, aiming to minimize the augmented Lagrangian function of the above problem.That is, Note that M i and M are Lagrangian multipliers, and µ is the penalty parameter.We then update each variable alternatively.
Update G i : by fixing other variables, we focus on optimizing G i .The corresponding optimization problem is as follows: The solution is given by: where the soft-thresholding function S γ (x) = sign(x)max(| x | −γ, 0).Update U: this leads to the following optimization problem: Setting the derivative of this objective function to zero yields the optimal solution equation: where D T i denotes the transposed operator of D i .After simplification, the equation becomes: where unfold(•) represents the unfolding operation.For clarity, the right-hand side of this equation is denoted by R. By applying the FFT on both sides and the convolution theorem, the closed-form solution can be derived as [44]: Here, F () represents the FFT, and | • | 2 denotes the element-wise square operation.Update V: the optimization problem concerning V is formulated as: After straightforward calculation, the problem can be rewritten as: Drawing upon reference [45], the solution can be derived as: Update X: the problem is a constrained least-squares task: min The solution can be straightforwardly obtained as: Update multipliers: based on the general ADMM principle, the multipliers are further updated using the following equations: Finally, Algorithm 1 summarizes the workflow.

Experiments
To assess the performance of the RCTVCR algorithm, experiments are conducted on both synthetic and real-world datasets.Peak signal-to-noise ratio (PSNR), structure similarity (SSIM) and spectral angle mapper (SAM) are employed for evaluation purposes.Higher PSNR and SSIM and lower SAM mean that the image has better quality.The compared methods include HaLRTC [16,17], SNNTV [46], SPCTV [47], TNN [18,19] and FTNN [23].All experiments are conducted on a desktop running Windows 10, equipped with an Intel Core i7-12700F CPU at 2.10 GHz and 32 GB of memory.

Experiment on Synthetic Datasets
In this subsection, a comprehensive series of synthetic experiments are performed to assess the effectiveness of RCTVCR.The synthetic datasets consist of Munich with three channels and four time nodes, and Farmland with eight channels and four time nodes.Figure 3 illustrates the images from the Farmland dataset.These datasets comprise 256 × 256 pixels and were captured via Landsat-8.To enhance realism, three real cloud masks with varying cloud amounts are chosen from the WHU cloud dataset, as presented in Figure 4.
Table 1 summarizes the performance comparison of all competing methods in terms of performance metrics (PSNR, SSIM and processing time) on the Munich dataset.As the cloud mask size increases from small to large, the PSNR and SSIM values for all methods generally decrease.Based on the results presented in the table, it can be inferred that RCTVCR emerges as the optimal method due to its consistent achievement of the highest PSNR and SSIM values across all cloud mask sizes, accompanied by the shortest processing time.Table 2 displays a comparative analysis of various methods in terms of their performance metrics on the Farmland dataset.Similar to the Munich dataset, a consistent pattern is observed: as the cloud mask size escalates from small to large, the PSNR and SSIM values for all methods generally decrease.When comparing RCTVCR with SPCTV and TNN, it is noticeable that RCTVCR exhibits a higher PSNR value by 0.35 and 0.44, respectively, on the large cloud mask size.Moreover, RCTVCR boasts the shortest processing time.Only 0.2 s is required to process the Farmland dataset with a large mask, whereas SPCTV and TNN necessitate 394.3 s and 176.3 s, respectively.Based on these results, it can be concluded that RCTVCR emerges as the optimal method, consistently achieving the highest PSNR and SSIM values.To further emphasize the superiority of our method, we present the declouded images of Munich and Farmland in Figures 5 and 6.Our method effectively preserves intricate details and yields more visually appealing outcomes, whereas competing methods often result in blurred or artificial-looking artifacts.It should be noted that compared methods that are based on tensor factorization or tensor nuclear norms exhibit varying degrees of checkerboard artifacts.RCTVCR, however, exhibits no such disadvantage.
To better illustrate the efficiency of RCTVCR, a scatter plot is visualized in Figure 7 for the Farmland dataset with a middle cloud mask.It is shown that RCTVCR achieves the highest PSNR with the fastest speed.TNN and SPCTV obtain improvements over HaLRTC at the cost of a very long processing time.FTNN is the most computationally intensive, but it has an unsatisfactory PSNR value.The reason may be that FTNN is more suitable for random missing completion tasks.Nonetheless, the clouds correspond to continuous area missing completion tasks, which is more challenging.
In conclusion, the superior visual performance of our method relative to existing techniques is clearly demonstrated through experimental outcomes.By effectively tack-ling cloud removal, structural preservation and overall image quality enhancement, our proposed approach represents an advancement in the field.

Experiments on Real-World Datasets
Here, we apply competing methods to a real-world dataset, which was released in [39].It was captured via Sentinel-2 over Mechelen, Belgium.The original data size is very large, and thus, a patch with a size of 512 × 512 is cropped.There are three channels and seven time nodes, and cloudy images and corresponding masks are displayed in Figure 8.It is revealed that time nodes 2, 5 and 7 are polluted by severe clouds and shadows, which would be a very difficult task.Former investigations often select partial time nodes with fewer clouds to verify algorithm performance.In our study, all time nodes are simultaneously employed to test algorithm robustness to the amount of cloud coverage.
Decloud images are displayed in Figure 9.It is observed that HaLRTC and SNNTV output very blurry contexts.FTNN could recover a few more details, but it cannot accurately remove clouds and shadows.SPCTV and TNN exhibit heavy checkerboard effects and color distortion.Only our method faithfully recovers plenty of details and preserves color consistency.Additionally, the barplot shown in Figure 10 reveals that RCTVCR with a processing time of 6 s is significantly faster than other methods, which often take hundreds of seconds.

Influence of the Temporal Number
To assess the impact of the temporal number on the reconstruction effectiveness of our method, we conducted the following experiment.We maintain the cloudy image of the Munich dataset with a middle mask and gradually add reference images to vary the temporal number.Table 3 presents three metrics of the decloud images using reference images with varying temporal numbers.The data indicate that as the temporal number grows, the image quality also increases.However, from Table 1, it is shown that RCTVCR with a smaller temporal number is still better than some existing methods with four time nodes, validating the effectiveness of RCTVCR.

Parameter Sensitivity
In RCTVCR, two parameters, r and τ, require tuning.To examine the impact of these parameters, we implemented RCTVCR with various parameter configurations on the Farmland dataset with a middle cloud mask.
Initially, we fixed τ at 4 × 10 −4 and varied r from 12 to 22.As depicted in Figure 11a, the PSNR curve rapidly increased before gradually descending as r increased, achieving optimal performance at r = 14. Figure 12 displays the declouded images obtained with r = 12, 14 and 22.It is evident that r = 14 yields superior details compared to r = 12.Notably, r = 22 preserves decent textures but exhibits color distortion.Subsequently, we fixed r = 14 and investigated τ values ranging from 1 × 10 −4 to 9 × 10 −4 .As illustrated in Figure 11b, the PSNR curve incrementally increased before exhibiting a tendency to decline as τ increased, yielding the optimal performance at τ = 4 × 10

Conclusions
In conclusion, the proposed method for removing thick clouds from remote sensing images offers an effective and efficient solution.By employing matrix decomposition and regularization techniques, the method not only removes thick clouds but also preserves the textures in the images, resulting in improved visual quality.The sensitivity analysis and comparisons with other methods further validate the superiority of the proposed approach.This study contributes to the advancement of remote sensing image processing and has potential applications in various fields.Further research can be conducted to enhance the method's robustness and applicability in different scenarios.Due to the current limitations of our research scope and hardware resources, we are unable to incorporate deep learning techniques into our experimental comparison at this stage.Future work includes exploring possible avenues for incorporating deep learning methods into our research.

Figure 1 .
Figure 1.(a-d) The images in cloud-free Munich dataset of 4 time nodes.(e-h) The simulated cloudy Munich dataset with different cloud masks.(i) The singular value curves for the cloud-free and cloudy imagery.Note that for time nodes 1 to 4, the exact timestamps are 201501, 201503, 201504 and 201508, where "XXXXYY" means the image is taken in YY-th month of XXXX-th year.

(Figure 2 .
Figure 2. (a-e) The visualization for 5 representation coefficients.(f-j) The gradient map of each representation coefficient.For easier inspection, the value of the gradient map is amplified 2 times.(k-o) The histogram of each representation coefficient.

Figure 3 .
Figure 3.The false-color images in Farmland dataset.Note that for time nodes 1 to 4, the exact timestamps are 20130715, 20130901, 20131003 and 20131019, where "XXXXYYZZ" means the image is taken in ZZ-th day of YY-th month of XXXX-th year.

Figure 4 .
Figure 4.The three cloud masks with different-sized clouds.The white pixels denote clouds.

Figure 5 .
Figure 5.The false-color decloud images of all compared methods using Munich dataset with a middle cloud mask.

Figure 6 .Figure 7 .
Figure 6.The false-color decloud images of all compared methods on Farmland dataset with a large cloud mask.

Figure 8 .
Figure 8.The false-color images and their masks in the real-world dataset.Note that for time nodes 1 to 7, the exact timestamps are 20180816, 20180831, 20180905, 20180915, 20180925, 20181015 and 20181020, where "XXXXYYZZ" means the image is taken on ZZ-th day of YY-th month of XXXX-th year.

Figure 9 .Figure 10 .
Figure 9.The false-color decloud images of all compared methods on the real-world dataset.From top to bottom, they correspond to time nodes 2, 5 and 7, respectively.

Figure 11 .
Figure 11.The metric with different hyper-parameters.

− 4 .
The declouded images with τ = 1 × 10 −4 , 4 × 10 −4 and 9 × 10 −4 are presented in Figure13.It should be noted that τ governs the strength of the RCTV regularizer.A small value of τ leads to similarity in the behavior of RCTVCR and methods that do not employ a local smoothness prior (e.g., HaLRTC).Conversely, a large value of τ results in a relatively blurry image.

Table 1 .
Metrics on Munich dataset.The best values are marked in bold.

Table 2 .
Metrics on Farmland dataset.The best values are marked in bold.

Table 3 .
Metrics on Munich dataset of RCTVCR with varying temporal number.