DeepInSAR—A Deep Learning Framework for SAR Interferometric Phase Restoration and Coherence Estimation

.


Introduction
Synthetic Aperture Radar (SAR) is a remote sensing technology, which uses active microwaves to capture ground surface characteristics. An Interferometric SAR (InSAR) image a.k.a interferogram is created from two temporally separated single look complex (SLC) SAR images via the point-wise product of one SLC image with the complex conjugate of the other SLC image. Thus each pixel in an interferogram indicates phase difference between two co-registered SLC images. The phase difference encodes useful information including deformation of the earth's surface and topographical signals, and has been successfully used to obtain the digital elevation model (DEM). InSAR final products are widely used for civil engineering; topography mapping; infrastructure; oil/gas mining; natural hazards monitoring and elevation change detection. In any SAR system, as the satellite circumnavigates earth, SAR sensor launches millions of radar signals toward the earth in the form of microwaves. The SAR image is represented as a SLC image, which is generated based on radar information echoed back from They proved that phase information and noise can be more easily separated in the wavelet domain. The success of WInPF was of a great importance to lots of subsequent work. Reference [2] applied wavelet packets based Wiener filter to further separate phase information in the wavelet packet domain, it achieves superior performance compared to the WInPF filter. In Reference [19], Bian and Mercer proposed undecimated wavelet transform by treating image filtering as an estimation problem. Overall, wavelet-domain filters seem to better preserve a good spatial resolution than other methods and have high computational efficiency. Xu et al. [20] introduced a joint denoising filter via simultaneous regularization in the wavelet domain. Phase discontinuities are well preserved through this joint sparse constraint and iterations.
The idea of non-local filtering is to explore more information from the data itself. In general, images contain repetitive structures such as corners and lines. Those redundant patterns in an image could be analyzed and explored to improve filtering performance. In recent years, many studies deploy non-local techniques for SAR data filtering from amplitude images de-specking [21][22][23] to interferometric phase denoising [3,[24][25][26][27], and InSAR stack multi-temporal processing [28,29]. Compared to the aforementioned methods, non-local based methods always achieve state-of-the-art results. Non-local filtering adapts estimation to the local signal behaviour to deal with non-stationary images like previous approaches, but it also takes consideration of the entire image according to the image self-similarity property. The first non-local method applied to interferometric phase filtering was proposed by Deledalle et al. in Reference [21]. Both image intensities and interferometric phase information are used to build a non-local means model with a probability criterion for estimating pixels. NL-InSAR [3] is the first InSAR application to use a non-local approach for the joint estimation of the reflectivity, interferometric phase and coherence map from an interferogram. In Reference [24,30], researchers achieve better results on textural fine details preservation by combining non-local filtering with pyramidal representation and singular value decomposition. A unified framework (NL-SAR) is proposed in Reference [27] as an extension of NL-InSAR, where an adaptive procedure is carried out to handle very high resolution images. It is able to obtain the best non-local estimation with good quality on radar structures and discontinuities reconstruction. Recently, works on extending and modifying existing image restoration algorithms to suit interferometric phase domain achieve very promising performance. In Reference [10], a modified patch-based locally optimal Wiener (PLOW) method is proposed for interferometric phase filtering that achieves on par and better results than non-local means. Another famous algorithm, non-local block-matching 3D (BM3D) also inspired researchers to propose InSAR-BM3D [26], which delivered state-of-the-art results for InSAR phase filtering. The method is not proposed to do coherence estimation specifically. InSAR-BM3D computes the maximum likelihood estimates of coherence via stack-wise averages. Then the estimated coherence is used to determine the threshold at the collaborative filtering step. Hence, the performance is likely affected by the accuracy of the coherence estimation, which highly depends on how stationary of the whole stack is.
Milestone works using Convolutional Neural Network (CNN) have shown their ability to outperform almost all conventional algorithms on different visual related tasks including image restoration. There are also some recent SAR based studies benefited from CNN, including the Fuzzy superpixels based Semi-supervised Similarity-constrained CNN (FS-SCNN) model [31], which uses an ensemble learning technique to achieve superior prediction on PolSAR images classification task. Ma et al. [32] proposed an attention-based graph CNN to improve the SAR segmentation results. In Reference [33], DeepLabv3+ [34], a well-known image semantic segmentation CNN model, is adopted for oil spill identification on SAR images. A direct automatic target recognition (D-ATR) deep CNN based model is proposed in Reference [35] to obtain high accuracy and fast processing for target recognition that outperforms all other conventional methods. These works benefit from CNNs as superior feature extractors on SAR images. Anantrasirichai et al. [36] employ CNNs in the InSAR phase to volcano deformation monitoring via transfer learning from optical images. In this work, we propose our DeepInSAR architecture, which is a new deep learning-based model for SAR interferometric phase restoration and coherence estimation. The model is empowered by a set of state-of-the-art deep learning techniques, relying on suitable phase-oriented solutions. We aim to design a more effective joint phase filter and coherence estimator, by learning from the pre-generated training data. We pre-processed InSAR data into a single tensor to do a multi-modal fusion analysis of both phase and amplitude information. A densely connected feature extractor is used to achieve multi-scale feature extraction and fusion. Two subsequent fully connected CNN perform phase filtering and coherence estimation from extracted features respectively. InSAR phase noise can be considered as zero-mean additive noise. So, we adopt the residual learning strategy, which has been proven by in the literature as effective for removing such type of noise [37]. Meanwhile, pre-activation and bottleneck [38], as well as batch normalization techniques [39], are used to enhance training efficiency and boost the model's performance. The remainder of the paper is organized as follows. In Section 2, we briefly define our interferometric phase noise model and describe our proposed DeepInSAR architecture in detail, as well as our experimental setup. Section 3 presents quantitative and qualitative comparison with the performances of three other established methods for both simulated and real data. Result analysis is presented in Section 4. Conclusion and future work are given in Section 5.

Phase Noise Model
Similar to the classical additive degradation mode in natural image restoration problem, an interferometric phase can also be characterized by: which has been validated in Reference [5]. θ y denotes the noisy observation, θ x is clean phase component and v is the noise with zero mean and σ standard deviation, θx and σ are independent of each other. It follows a similar definition in the natural image analysis that clean signals are independent from noise signals. Unfortunately, it is not feasible to directly use natural image processing algorithms in interferometric phase domain, because of branch cuts. According to the SAR interferometric phase calculation, the range of interferometric phase is within [−π, π), which means that wrapped phase value could jump from negative to positive or positive to negative π, and they could represent high-frequency motion signals that should be well preserved. Therefore, in this work, we follow the strategy in Reference [10,18] to process the interferometric phase in the complex domain. In other words, the phase noise model could be represented by real and imaginary channels, which are continuous values: The noisy phase observation θ y is decomposed into two components y Real and y Imag . v r and v i are zero-mean additive noise in the real and imaginary parts, and they are independent from the underlying clean phase signals θ x . As analyzed in Reference [10] Q is a quality indicator, which is monotonically changing with coherence level. We designed our filtering network based on the above complex phase model. During training, the network learns to filter both real and imaginary parts and then the estimated clean phaseθx could be reconstructed from filteredx Real andx Imag as:

The Proposed DeepInSAR
In this section, we describe our proposed DeepInSAR in detail. The main goal is to establish and validate the idea of using deep learning method to automate and accelerate both interferometric phase filtering and coherence estimation, which are conducted separately in most of existing approaches. Recently, deep learning studies especially CNNs have been dominating various fields of vision-related tasks. Generally, their excellent performance can be attributed to their powerful feature classification and ability to learn image priors during the training stage. The reasons why we choose to use CNN for InSAR filtering and coherence estimation are (1) CNN is effective for capturing spatial feature characterization with a lot of trained parameters, (2) many achievements in deep learning can be borrowed to benefit better training and generalization, as well as to speed up and improve the output data quality, and (3) powerful GPUs could speed up CNN training and runtime inference. Deep CNN is well suited to be deployed on modern GPUs for parallel computation. All these advantages make deep learning techniques promising for InSAR phase filtering and coherence estimation, where real-time processing and high-quality outcome of large resolution radar images are required. Figure 1 illustrates the architecture of the proposed DeepInSAR network. At a high-level, our deep model includes multiple modules for handling different subtasks. The amplitudes and their interferometric phases of two SLC SAR images are combined by concatenating into a single tensor during a preprocessing step. The output is subsequently fed into a densely connected feature extractor. Dense connectivity helps extract useful features under different scales and composite multi-scale features are suitable for different end tasks [40]. Two feature to image transformations are achieved by sub-networks performing-(1) phase filtering using residual learning strategy [37] and (2) coherence estimation. The model is expected to learn optimal discriminative functions, mapping from noisy observations to both latent clean phase signals and coherence, by a feed-forward neural network. Referring to our noise model in Equation (2), we propose to fully utilize all the information from two SLCs rather than only analyzing interferometric phase. As shown in the Preprocessing Module in Figure 1, the raw input contains two noisy co-registered SLC SAR images S 1 and S 2 . Interferometric phase image I is calculated as: where A is amplitude and ϕ is phase. In fact, the phases in SLC images look like random noise from one pixel to another because each pixel is a complicated function of scattering features located on the ground surface. However, interferometric phase ∆ϕ represents phase-difference fringes illustrating changes in distance between ground and satellite antenna, which are valuable information needed for InSAR related applications, but they are often contaminated by noise. Intuitively, we want to incorporate amplitude images, because they usually show recognizable patterns like buildings, mountains, and valleys, which are useful spatial characterizations and hence informative for denoising and coherence estimation. For phase filtering, our proposed DeepInSAR aims to learn a mapping function F oc : observation → clean. As shown in Equation (2), F oc can include noisy y Rea1 , y Imag and Q as observations. In this work, we further use two SLC's amplitude value to replace the Q in the observations, because we learn from Reference [41] that coherence magnitude |γ| can be approximated based on two SLC's amplitude: where M, N represent estimator window size. This widely used coherence estimator shows a potential mapping (A S1 , A S2 ) → |γ|. Moreover, As mentioned in Section 2, Q is related to |γ|. Here we hypothesize that there is a mapping chain (A S1 , A S2 ) → |γ| → Q. Hence, instead of using any handcrafted sampling estimator to estimate Q. We proposed to use a deep model to approximate the mapping function F oc , in a simplified end-to-end manner by treating both SLC amplitudes together with interferometric phase as input observation to the network. Theoretically, sufficient and well-reasoned input would help the model learn a proper mapping function to estimate latent clean signals more precisely. The same should also support estimating the quality of signals (coherence). Unfortunately, in real-world SAR image, the range of amplitude values could be extremely broad, that is, from 0 to 1 × 10 6 , and the scale of the values also varies across different target sites and types of radar sensor. This is one of the reasons why learning-based studies are not pursued for SAR analysis because using uncontrolled amplitude values to train a deep discriminative model is not effective. In general, the learning-based method requires each input dimension to have a similar distribution with low and controlled variance, which has been suggested by many deep learning studies [37,42]. Unnormalized input data can lead to an awkward loss function topology and place more emphasis on certain parameter gradients resulting in a poor training. Hence, for a CNN layer, all the input pixels should be in the same scale. The amplitude values in raw SAR images are not suitable as input data for a deep model. In this work, we introduce an adaptive method to normalize all amplitude values to lie between 0 to 1. The model saturates potential outliers as well as keeps most dynamic changes in the original image without destroying or cutting off any essential ground characteristics.
Knowing that if data roughly follows a normal distribution, the standard Z score of each data point can be calculated as the position of a raw score in terms of its distance from the mean, when measured in standard deviation units [43]. However, SAR amplitude values follow Rayleigh distribution [44] with potential extremes in the distribution tail. Hence, the mean is not statistically robust in our case, and it is easily influenced by outliers. In this study, we apply a modified Z score [45] which estimates Z score based on Median Absolute Deviation (MAD). The MAD value of SLC amplitude A is calculated as: whereÃ is the median of the data. Next, we transform the data into the modified Z score domain: A mz represents each pixel's modified Z score and 0.6745 is the 0.75th quartile of the standard normal distribution, to which the MAD converges. For outlier detection, researchers commonly use absolute values of modified Z scores to threshold the data, where data points with |Z| score greater than 3.5 are potential outliers and are ignored [45]. In Figure 2, there are 6 SLC amplitude images selected from three real-world datasets captured by TerraSAR-X in StripMap mode [46], with 2 SLCs taken at different time for each stack. By observing their raw amplitude values and histograms as shown in the 1st and 2nd rows of Figure 2, data points are close to Rayleigh distribution as mentioned above. So simply cutting off according to the modified Z score might cause loss of information located on the right tail of high amplitude values. Although logarithm transformation could help us visualize the images better, there is no fixed base number for all images because they might differ by order of magnitude. In our proposed normalization method, we adopt modified Z score as a transformation function to force all values to be close to 0 first and then all potential outliers will be far from 0 and greater than 3.5. To give a standard input data distribution for training the neural network, we apply a hyperbolic tangent tanh non-linear function as: to bind all input amplitudes with a controlled variance. A good property of hyperbolic tangent tanh(x) function is that the input value between −1 to 1 will be enhanced and others will be saturated. In our case, we divide A mz by 7 (two times of 3.5) to make the majority of data points lie between −1 to 1. Then ground characteristics could be potentially enhanced after tanh operations. Secondly, data points with relatively high amplitude are still kept on the right tail, and for those extremely high values, likely outliers, are saturated close to 1. Note that, we further normalize the transformed data to the range [0 ,1], because we use a Rectified Linear Unit (ReLU) activation for introducing nonlinearity in the CNN to learn complex features. Non-negative input is recommended to avoid saturated neuron at an early training stage when using ReLU activation in the early layers [47]. As shown in the 3rd row in Figure 2, after our proposed data normalization, all amplitude values lie in the range 0 to 1 are properly delivered without losing and breaking essential details. One can also observe this in the 4th row of Figure 2. The final observation o is a tensor [y real , y imag ,Â S1 ,Â S2 ], and is the input to the proposed DeepInSAR.

Filtering with Residual Learning
Residual learning is designed for solving performance degradation problem on very deep neural networks [48]. In our interferometric phase filtering, we apply a similar idea but without using too many skip-connections within the network. We only create identity shortcuts for predicting the residuals of both real and imaginary channels. Instead of directly outputting the estimated clean components, the proposed model is trained to predict residuals. The model implicitly filters the latent clean signals with hidden operations within the deep neural network. For each of the real and imaginary channels, we have the loss function below: where W fe , W real and W imag are the trainable parameters in the model corresponding to feature extractor, real and imaginary channels respectively. For both real and imaginary channels filtering, during the training iterations, our model aims to learn a residual mapping R(o) ≈ y − y−v Q according to our noise model (Equation (1)). Then the clean components can simply be reversed by x = y − R(o). (y, x) represents noise-free training sample (patch) pairs. Residual mapping is much easier to learn than the original unreferenced mapping. It has been shown to output excellent results in many low-level vision image inverse restoration problems such as image super-resolution [49] and image denoising [37]. To the best of our knowledge, we are the first to use residual learning and CNN to do InSAR phase filtering. The model now learns a residual mapping R : observations → residuals on real and imaginary channels respectively. Furthermore, it is known that phase noise variance σ 2 θ could be approximated by coherence magnitude |γ| [41]: where Li 2 is Euler's dilogarithm. Our input tensor for phase filtering includes two SLCs' amplitudes, which correlated to coherence magnitude. Hence, our designed observation input is well-reasoned for predicting phase residuals.

Coherence Estimation
Coherence map is estimated from two co-registered SAR images and is usually used as an indicator of phase quality. Demarcation of image regions based on the degree of contamination ("coherence") is an important component of the InSAR processing pipeline. 0 coherence denotes complete decorrelation. On the other hand, successful and accurate deformation is measurable with high coherence. Lower quality of interferometry corresponds to decreasing coherence level and increasing level of noise on the phase. Interferometric fringes can only be observed where image coherence prevails. Filtered output is usually combined with coherence map for further processing, because coherence map could tell how much useful signals are potentially within this area. Some of the filtering studies also require coherence map in the filtering process. However, most of them use Maximum Likelihood (ML) estimator (Equation (5)) or its extensions, which are usually significantly biased when using small window sizes. These methods can lose resolution and increase computational cost with large window sizes. Generally speaking, an area on the ground is treated as coherent, when it appears to have similar surface characterization within all images under analysis. However, between two SAR acquisitions, subareas will decorrelate if the land surface is disturbed. Therefore, CNN is a very good candidate to handle this spatial and non-local based analysis, especially on our input o, where almost all necessary information is available for learning the features and capturing mapping functions. During training, the model can learn to capture prior knowledge on all training samples and represent the knowledge as network weights. Intuitively, our method takes a more reliable and robust non-local analysis compared to conventional non-stack based work, which only considers one interferogram. It is also more time efficient than stack-based method because there is no requirement for doing heavy runtime analysis after training is done. In our model, we have a separate module in the proposed DeepInSAR for coherence estimation by using the same features extracted from observations o as shown in Figure 3. Because coherence lies in the range [0,1], we calculate sigmoid cross entropy loss, given logits obtained from last convolution layer's output c = F oh (o; W fe , W coh ): z is the reference coherence map that can be pre-calculated by any existing coherence estimator in order to generate training dataset for real images.

Shared Feature Extractor with Dense Connectivity
Natural images exhibit repetitive patterns, such as geometric and photometric similarities, which provide cues to improve the filtering performance. This concept is also valid for InSAR interferometric phase and SAR amplitude images. However, it should be noted that though, CNNs perform well for visual related tasks, it is known that as CNNs become increasingly deep, both input and gradient information can vanish and "wash out." Recent work ResNet [48,50] have addressed this problem by building shorter connections between layers close to the input and those close to the output. By doing this, CNNs can be substantially deep but still have accurate performance as well as efficient training. We adopt a dense connected CNN introduced in Reference [40] as a shared feature extractor before the real-imaginary filter and coherence estimator. In the single-look interferometric phase, the latent noise level is related to the coherence magnitude [41]. A shared feature extractor for both phase filter and coherence estimation modules is expected to capture this relationship in latent space because weights in the feature extractor W fe are updated based on the gradient feedback back-propagated from both phase residual prediction and coherence estimation as shown in Figure 3. During training, the model can encode non-local image prior by updating network parameters according to both phase filter and coherence estimator loss. After training, the model can directly produce filtering and coherence output with a learned discriminative network function without any runtime non-local analysis.
Furthermore, because of the dense connectivity, our feature extractor follows multi-supervision that learns to extract common feature parameters for all related subsequent tasks [39]. In case of dense connectivity, each layer in the feature extractor is connected to every other layer in a feed-forward manner. During gradient back-propagation, each layer's weight is updated based on all subsequent layers' gradients [40]. As shown in Figure 1, features extracted by each layer in the feature extractor module of DeepInSAR are based on all preceding layers' output. At the same time, its own output is passed to all subsequent layers as input. In our network, all feature maps extracted at different depth levels are passed to both phase filter and coherence estimator as a single concatenated tensor. Note that, as per deep CNNs' working mechanism, early layers extract most detailed and low complexity features with a small perceptual field. With increasing depth, later layers in the feature extractor start extracting high level and complex features with a larger perceptual field. Therefore, a densely connected CNN feature extractor allows each sub-module to perform its own task with multi-scale and multi-complexity features. The proposed DeepInSAR also achieves a deep supervision by allowing each layer in the feature extractor to have direct access to the gradients from both sub-modules. Dense connectivity guarantees the model to get better feature propagation and enables feature reuse and fusion, which is really important for InSAR phase filtering and coherence estimation. In real-world images, ground data sites contain very different scale level characteristics. That is why most existing methods require user-defined window sizes to extract image characteristics. Therefore, all these methods suffer from the inability to choose a generic optimal window size, and fail to automatically generalize to different data sites. In our case, we use a dense CNN based feature extractor to intelligently select the best multi-level features for subsequent modules. The experiments in Section 4 show that our model is capable of generalizing on phase filtering and coherence estimation for different scale features in one image, as well as performing effectively on new site images.

Teacher-Student Framework
Based on our findings, the main reason why deep learning techniques have not been pursued widely in InSAR filtering and coherence estimation so far is the lack of ground truth image data (reference without noise) for training such models. For training our proposed DeepInSAR model, we need image pairs as described in Section 3. However, there is no ground truth for real-world InSAR images. Therefore we introduce a teacher-student framework to make it feasible to train DeepInSAR for real-world images. From the literature, stack-based methods, like PtSel [51], always give reliable results. PtSel is an industry level algorithm for coherence estimation and interferometric phase filtering, which searches similar pixels across a stack of interferograms in both spatial and temporal domains. There are three key steps of PtSel algorithm (Figure 4) to generate the coherence map for a stack of interferograms. Next, the filtering process is replacing each interferograms' pixel data by the weighted mean of the phase values of its neighbouring pixels, where the weight is the PtSel generated coherence value. Despite the accuracy of stack-based methods, it requires historic SLCs and intensive online parallel searching using a high-end GPU farm, which limits its ability to be integrated into a time-critical InSAR processing chain. The stack-based methods have to wait for several months to collect sufficient data before it can start processing a new site. Although existing stack or non-stack based methods are powerful, most of them require human expert to ensure intermediate output quality because they are incapable of automatically detecting and removing all possible real-world noise patterns from InSAR data. We introduce a deep neural network to replace the manual pre-processing, that is, feature extraction; and post-processing, that is, quality inspection, with a single intelligent trainable model. Similar to training an object classification neural network model, a large human labeled dataset is required in our approach. Human thus acts as a teacher to teach the model how to classify objects by providing human labeled data. For InSAR phase restoration and coherence estimation, we adopt the PtSel method to create filtered phase images for reference, coherence maps with human tuning and full stack processing to make sure the results are sufficiently reliable. The detail of the PtSel algorithm and its GPU implementation can be found at References [51,52]. In our approach, PtSel with expert supervision becomes the teacher of the proposed DeepInSAR model, which is a student. We are able to demonstrate that, after training, (1) the student DeepInSAR can generate on par or even better results than its teacher method-PtSel, using the same test data sets, (2) our model only requires feed-forward inference on a single pair of SLCs, while PtSel requires more than thirty SLCs; and (3) our model can output filtering and coherence results after a one pass computation, while PtSel requires back and forward tuning processes and needs the phase unwrapping step, which is time consuming.

Experimental Setup
We compared our method with a number of other non-stack based methods, which can also perform both phase filtering and coherence estimation. They are (1) BoxCar filter, (2) NL-SAR [27] and (3) NL-InSAR [3]. We used publicly available implementations of these methods found in https: //github.com/gbaier/despeckCL. Note that all parameters were set, when applicable, as suggested by the authors of the original papers, or else chosen to optimize the performance. We implemented the proposed DeepInSAR using Tensroflow-GPU 1.10; the code is available at: https://github.com/ Lucklyric/DeepInSAR. In order to maximize the randomness of the training patch samples, for a given training dataset, the model was trained on randomly extracted image patches with a size of 128 × 128 on the fly [53]. Network parameters were updated using Adam optimizer with a batch size of 64 and 0.001 initial learning rate. The model was trained on two NVIDIA 1080 GPUs for 6 hours with 1.6 × 10 5 iterations. To fairly compare the computational time, we executed all methods on the same GPU with an i7-8700K processor and 32 GB RAM. It is worth noting that we built and trained our model using common hyper-parameter settings in our experimental setup because the work presented in this paper is mainly for validating the feasibility of using deep learning techniques to do InSAR phase filtering and coherence estimation. It is expected that more extensive hyper-parameter tuning will further improve the performance of our proposed deep model based on the findings in References [40,49]. We conducted our experiments using both simulated and real-world data to assess the effectiveness and robustness of the proposed model. In this section, we also discuss learning capacity and generalization ability, which are essential criteria for evaluating a learning model.

Results on Simulation Data
In this section, we present quantitative results using simulated data. Simulated data allows us to evaluate the filtered quality in a controlled environment by comparing with the simulated ground truth. Ground truth is treated as an optimal teacher for training our proposed DeepInSAR; we can objectively demonstrate our model's capability to learn proper phase filtering and coherence estimation for new simulated testing images, with ground truth available. The simulation strategy is similar to the work for generating the interferometric phase in Reference [26]. Instead of synthesizing a limited known patterns, the additional advantage is to extend the simulation for randomly generated irregular motion signals, ground reflective phenomena, as well as non-stationary noisy conditions. We designed a synthetic InSAR generator to randomly simulate a pair of SLC SAR images with the following procedure: • Generate first SLC image S 1 with 0 phase value. The amplitude value grows from 0.1 to 1 from the left-most column in the image to the right column following a Rayleigh distribution. This leads to a linearly growing of coherence from left to right.

•
Generate second SLC image S 2 by adding random Gaussian bubbles as synthetic motion signals to the phase. The amplitude value is equal to S 1 's amplitude value.

•
Add random low-value amplitude bands (less than 0.3) on S 1 and S 2 to simulate stripe-like low amplitude incoherence areas.
• Generate noisy SLCs S noisy 1 and S noisy 2 by adding independent additive Gaussian noise v to both real and imaginary channels of S 1 and S 2 .
• Calculate clean and noisy interferometric phase I and I noisy . • Calculate ground truth coherence using clean amplitude, phase, and the standard deviation of base noise v.
Our simulated image generator includes a set of parameters for controlling the complexity of the interferometric phase at different distortion levels. We generated 18 different configurations, by combining (1) three base Additive White Gaussian Noise (AWGN) levels of v (S1, S2, S3), (2) three fringe frequency levels of phase fringes (F1, F2, F3), and (3) with or without low amplitude strips (S, NS). For example, the dataset, which has a relatively high level of base noise, and low fringe frequency with low amplitude stripes, is denoted by S3-F1-S. Sample images are shown in the first column of Figure 5. We generated 100 samples with 1000 × 1000 image resolution under each configuration. Half of them were used for training and the rest were for testing. In this experiment, in order to assess the learning capacity and generalization ability of our proposed DeepInSAR model, a single model was trained on all 18 datasets with the noise-free ground truth images (teacher). Because all amplitude stripes and motion signals are randomly generated, all images between training and testing datasets were distinct. Figure 5 shows randomly selected samples from our simulation dataset. Our data generator is inspired by the noise simulation strategy described in Reference [54]. Basically, we simulate speckle noise by adding uncorrelated zero-mean Gaussian random variables to the real and imaginary parts of both synthetic SLCs before multiplying them for interferogram generation. To get the ground truth coherence for the simulated interferogram, we make an empirical mapping to it from the standard deviation of those random variables and the ground truth amplitude. This is because increasing the noise will decrease the coherence, and decreasing the amplitude will also decrease the coherence. In this case, each pixel in the generated interferogram is composed of 4 zero-mean Gaussian random variables with identical standard deviation. The source code of our simulator and full resolution simulated samples used in the experiments are available online at https://github.com/Lucklyric/InSAR-Simulator. Visual comparisons with BoxCar, NL-InSAR, NL-SAR, and our proposed DeepInSAR methods are presented in Figure 6. Each two rows show the phase filtering and coherence estimation of the three images in Figure 5 respectively, where (a-d) are filtering outputs and (e-h) are coherence estimations of S1-F3-NS, (i-l) are filtering outputs and (m-p) are coherence estimations of S2-F2-NS, and (q-t) are filtering outputs and (u-x) are coherence estimations of S3-F1-NS. Visual inspection on the filtered outputs compared to ground truth clean phase images in Figure 5 shows that our model can preserve phase structural details better than other methods for increasing base noise levels (Figure 6q-t) and frequency of fringes (Figure 6a-d). As we can observe, all methods work fairly well on low-level noise (S1) and low-level fringe frequency (F1) cases. However, with increasing distortion level, all other methods perform rather poorly. The BoxCar filter loses resolution and produces noticeably squiggly artifacts (Figure 6j,r). In particular, when distortion is with high base noise (S3) and high fringe frequency (F3), our model only loses insignificant detail especially in relatively low coherent regions on the left (Figure 6a,q). Although NL-InSAR can guarantee strong noise suppression with detail preservation on high frequency fringes (Figure 6c), it over-smooths the image when phase distortion level keeps increasing (2nd row of Figure 5); fringe structures are washed out when both distortion level and fringe frequency are high (Figure 6k). For coherence estimation, our proposed DeepInSAR is most matched to ground truth (Coherence row in Figure 5). BoxCar and NL-SAR tend to output low coherence on fast moving areas (Figure 6f,h). NL-InSAR and NL-SAR fail to compute correct coherence around low amplitude strips (Figure 6w,x). NL-InSAR also shows inaccurate coherence estimation between the phase jumps (Figure 6h,p).   Figure 5, 1st row. It can be seen that our model can preserve structural details better than others for increasing base noise levels and frequency of fringes (5th row). Our proposed method's coherence estimation is most matched to ground truth ( Figure 5, 3rd row), while other methods tend to predict inaccurate results on areas with highly dense fringes or low amplitude stripes.
We also use objective assessment to evaluate the performance of our method. Our test datasets include 18 × 50 = 900 simulated images with noisy and ground truth phase images, as well as corresponding coherence indices. The results obtained from BoxCar, NL-InSAR, NL-SAR and our proposed DeepInSAR are compared. We computed both Root Mean Square Error (RMSE) in radians (Table 1), and mean Structural Similarity Map (SSIM) between the filtered phase image and noise-free ground truth to quantitatively evaluate the filtering performance (Table 2). RMSE and mean SSIM are also used to assess coherence estimation (Tables 3 and 4). Numerical results further confirm our observations that the proposed DeepInSAR significantly outperforms all other methods on most of the 18 different distortion levels. From the simplest (S1-F1-NS) to the most challenging (S3-F3-S) simulation task, all methods have decreasing the performance on both phase filtering and coherence estimation. However, the proposed DeepInSAR has the lest performance degradation and consistently gives better results than the other methods with a total mean of RMSE (radians) 0.8536 and mean SSIM score 0.8666 for phase filtering. The statistical analysis proves that our proposed model can effectively remove the noise and at the same time maintain the structural information effectively. The accuracy of our coherence measurement also shows superior performance with a total mean of phase RMSE 0.2167 and mean SSIM score 0.7984. Coherence computations in all other methods are biased as the data complexity increases, especially when they deal with dense phase fringes (F3) and low amplitudes strips (S).

Results on Real Data
Real complex features and noise patterns cannot be fully replicated by simulation data. However, we can conclude from simulation data experiments that if we can give the model close to clean reference data for teaching DeepInSAR, the model can learn latent mapping from training samples. As mentioned in Section 2.2.5 , we use PtSel with expert supervision to generate clean reference phases and coherence maps for three real-world datasets captured by TerraSAR-X in StripMap mode [46]: (1) Site-A-27 SLCs, (2) Site-B-37 SLCs, and (3) Site-C-103 SLCs. We used a cropped version of these datasets with size 1000 × 1000 pixels. For coherence estimation, because the window-based PtSel coherence estimator is biased [51], we applied binary thresholding 0.5 on PtSel's coherence output to transform the original regression problem into a classification task. During the inference step, we use coherence estimator's sigmoid output as the confidence level to represent final coherence magnitude. To demonstrate the generalization ability of the proposed DeepInSAR on real word InSAR data, we trained the model using images from two sites and tested its robustness on the third site. Three representative interferograms selected from each of the three real datasets are shown in Figure 7.
Filtered phases and estimated coherence obtained using BoxCar, NL-InSAR, NL-SAR, PtSel, and our trained DeepInSAR are shown in Figures 8-10, which are the outputs of three real sites given in Figure 7. We use qualitative comparison because we do not have noise-free real images for quantity evaluation. The BoxCar filter tends to blur fringe edges in all the visual samples, mainly because of its low-pass behaviour and it under-filters near incoherent areas, which can be easily observed when zooming in. In Figure 8, there appears minor loss of resolution in thin strips (when zoomed in) for the proposed compared to PtSel (stack-based) but is still much better than all other methods that use a single interferogram. NL-InSAR has more stripping artifacts that cause streaks in the phase along incoherence boundaries, which also shown up in its coherence output (Figure 9). It also results in artifacts that follow the benches rather than the fringe lines ( Figure 10). NL-SAR can significantly remove the noise, but it also yields over-filtering that breaks some fringes and also merges small scale signals with neighbouring fringes (Figure 8). Overall, though non-local based NL-SAR and NL-InSAR can provide as sharp and visually appealing filtered phase as DeepInSAR on high coherence areas, in medium and low coherence areas, they tend to flatten the phase and create artifacts in highly noisy areas (Figure 8). Both methods have lower overall variance and less blurring than the BoxCar filter, though NL-InSAR has high variance in the estimates between the coherence/amplitude boundaries with streaky artifacts. Our proposed DeepInSAR shows a good balance between noise removal and structural preservation. Regarding to coherence estimation, the proposed DeepInSAR consistently gives better contrast and less spurious high coherence points within the low coherence areas in all the visual samples. It would be easier to be used as the weighting mask for the subsequent InSAR processing e.t. phase unwrapping, compared to other methods. In NL-SAR's and NL-InSAR's coherence outputs, there are also artifacts showing high coherence dots in low coherent areas. The limitation is caused by NL-InSAR's numerical instability algorithm and preferential treatment of amplitude, when the amplitude similarities disagree with the phase similarities. Explanation of NL-InSAR's weakness is also discussed in Reference [55,56]. Comparing to these non-stack based methods, our DeepInSAR offers both strong noise suppression and detail preservation as well as gives clear high contrast coherence estimation. It performs on par and even better than its stack-based teacher method-PtSel. PtSel's coherence estimation is biased toward low coherence in the dynamic areas (Figures 8 and 10), because it requires the target remaining stable over a long period of time [51].

Discussion
The High-level fringe frequency indicates fast-moving areas on the ground. These areas usually introduce many phase jumps (−π to +π) in the wrapped interferogram. As aforementioned, structural information is one of the most important information that any phase filtering method should preserve. This is because the performance of subsequent InSAR processing, for example, phase unwrapping, is heavily affected by the distorted fringe structure. Many gradient-based phase-unwrapping methods reply on the phase gradients and derivatives, which are types of structural information [10]. As an effective InSAR phase filter, it should preserve the structural details as much as possible [16], and our proposed method demonstrates this capability. For such evaluation, SSIM is a better metric comparing to RMSE for assessing how much structure information has been preserved after filtering. The mean SSIM score (Table 2) indicates that our method preserves excellent details even on highly dense fringes (F3), where all reference methods show decreasing performance as the fringe density increases. Our model shows more noticeable improvement under the SSIM metric than the RMSE metric. It is because RMSE estimates absolute errors and the SSIM provides scores, which focus on the structural similarity. If a filter is over-filtering or breaking the boundary between phase jumps, it may only show insignificant RMSE changes, but will introduce a significant SSIM degradation. Furthermore, if a filter fails to fully suppress the noise signals, the residual noise in the output image will also be reflected more sensitively by SSIM score as in the natural image [57]. This is also the main reason why we use the SSIM metric in the comparisons. Note that, the structural information of coherence is not as important as the filtered phase, because the coherence values are mostly used as a thresholding or weighting metric for subsequent processing. However, we still added SSIM metric for coherence estimation to enrich the experimental analysis. Table 4 shows that the proposed DeepInSAR can predict coherence map most matched to the ground truth. It demonstrates why our method can give high contrast and clear boundaries between extremely low and high coherence areas in both simulation and real site outputs. We believe that a method which can precisely recover the structural information in coherence map must also benefit the subsequent processing with a more detailed and precise coherence indication.
Moreover, in Figure 7, we used three very different real site interferogram examples. Similarly, all test simulation data were generated randomly as described in Section 3.1. Both quantitative and qualitative results confirm that our trained DeepInSAR model generalizes well to new InSAR data without any human supervision or parameter adjustment, which is required by other methods. As an example, when we adjusted the searching window size to a smaller size, NL-SAR and NL-InSAR were able to filter well on those highly dense fringes, but facing under-filtering problem on slow motion areas. During the experiments, we had to manually tune the set of parameters for the reference methods in order to get reasonable results. Their coherence estimators also have similar limitations. In comparison, our proposed model's coherence output is closest to the ground truth in all different distortion cases. For instance, all three referenced methods tend to give better results when using (1) a small window size on highly dense fringe areas but (2) need a large window size on low frequency motion. There is no fixed size, which works for all 18 simulated distortion levels. However, we show that our learning based DeepInSAR works well for all 18 simulated datasets with a single trained model. It has successfully learned the mapping from noisy observations (18 different distortions) to latent clean signals and coherence magnitudes, when we give it proper training samples to explore. Using densely connected feature extractor gives DeepInSAR the ability to intelligently handle multi-scale signal characteristics with a single model. Since the simulated signal patterns are random, therefore simulated motion patterns, noise conditions and low reflective strips, are irregular among all training and testing images. The evaluation output from the test dataset shows that our trained model does not suffer from the over-fitting issue and only shows a small generalization error, which however has not affected its better performance. It learns well from the teacher and the model can be generalized to new InSAR data. From the operational point of view, NL-InSAR has large amount of artifacts that it produces in the phase and coherence. There are many instances where it does a good job, but in an industrial setting reliability is more important. NL-SAR is better in terms of reliability, but much worse in terms of resolution and is therefore also not an efficient option. The proposed DeepInSAR balances well on noise reduction and fringe preservation. At the same time, it gives a high level of bi-modality in the coherence estimates between the incoherence and coherent pixels.
Furthermore, besides the superior performance compared to other non-stack methods, under a teacher-student framework, DeepInSAR can achieve results comparable to or better than its teacher method with a learned discriminating neural network. The PtSel algorithm (teacher) has several limitations-(1) It relies on temporal information, which means that non-local linear motion can make it hard to pick a neighbourhood suitable for all interferograms, causing under-filtering in these areas. As a result, the algorithm has to wait for many more days of sufficient data before starting the process; (2) It has bias toward filtering results-PtSel looks for similar nearby pixels to perform filtering. If it does not find enough of such pixels, then the filtering is toward averaging, giving worse result compared to another pixel which can find lot of similar neighbours. PtSel's filtering and coherence output is regarded as state-of-the-art in the literature, but it fails to give optimal output across the test input image because of its biased adaptive kernel estimation. On the other hand, the proposed DeepInSAR successfully distills the knowledge from training samples and generalizes the model to new unseen InSAR images with a simple feed-forward inference, without any human expert supervision, or intensive online searching on a stack of interferograms as required by PtSel. Our proposed DeepInSAR model captures coherence in the fast-moving areas even better than PtSel and produces excellent delineation in the coherence with better contrast, which helps subsequent stages in the InSAR processing pipeline, that is, when thresholding and weighting are required on the estimated coherence in the phase unwrapping stage. With respect to the average running time (T) in seconds, as seen from Table 5, the proposed method requires significant less running time than other non-stack methods because only feed-forward computation is needed after training. After testing different parameter settings (e.g., number of iterations and patch size), reference methods sometimes get better results after running for a longer time. However, it is not always the case, which means that these methods have limited potential of full automation without human intervention. The proposed method shows better results with much faster processing time. It is worth mentioning that PtSel outputs used for training and visual comparison are generated using a Titan XP GPU farm. This is because PtSel requires high-end GPUs for intensive parallel searching on a stack of SLCs (>30). In comparison, our method can run on a consumer level system, and perform filtering and coherence estimation using only two SLCs. Taking filtering, coherence performance and flexibility into consideration, the proposed DeepInSAR is very competitive and suitable for real-world InSAR applications. Lastly, it is worth mentioning that, in this work, our InSAR simulator is mainly designed for quantitative evaluation and analysis, because there is no ground truth data for real-world images. The proposed simulator can generate randomly composite irregular motion signals, ground reflective phenomena, as well as non-stationary noise conditions with different controlled configurations. It is an ideal scenario for objectively assess our proposed DeepInSAR's learning capacity and ability of generalization. However, as a data-driven technique, when we want to apply the proposed DeepInSAR framework on real-world InSAR data, we need to make sure the training data distribution is similar to real-world scenarios. Existing simulator is designed to give controlled experimental environments for quantitative analysis, but it is still not able to fully replicate the real-world complex features and noise patterns. That is also the reason why we propose the teacher-student framework, which has been validated to be useful for adapting the proposed DeepInSAR to a real-world phase filtering and coherence estimation pipeline. This is also one of the contributions we would like to highlight. We show the potential benefits in the InSAR industry that the proposed DeepInSAR framework has the ability to transform conventional methods, which might require higher computational resources, more input observations, and human supervision, into a differentiable deep neural network model by learning from their outputs. In future work, we plan to investigate a Generative Adversarial Network (GAN) [58] based InSAR simulator for generating more realistic synthetic data. We believe that it will certainly confirm the operationalization aspect of the proposed DeepInSAR.

Conclusions
In this paper, we propose a learning-based DeepInSAR framework to address two important research issues: InSAR phase filtering and coherence estimation, in a single process. Our model works well when using either simulated or real data, under different synthetic distortion and real noisy pattern levels. To quantitatively assess the proposed method, we designed an InSAR simulator, which can generate motions and noise patterns randomly. The proposed DeepInSAR outperforms existing non-stack based methods for both tasks by giving the most matched filtered phase and coherence map comparing to the ground truth data. SSIM scores (0.8666 for phase filtering and 0.7986 for coherence estimation) also show superior DeepInSAR performance that can preserve well the phase fringe structure after filtering, and at the same time gives sharp and clear coherence map. Numerical results show that the proposed DeepInSAR can generalize well on new unseen images once it has been trained, and thus can be applied in various real-world InSAR applications. We also presented a teacher-student training strategy, which allows the proposed DeepInSAR to augment, automate and accelerate existing un-differentiable methods using a differentiable deep neural network. Our trained model can obtain the same or better filtering and coherence estimation results only on a single pair of SLC images compared to its teacher algorithm, which requires a stack of SLCs(>30), achieving significantly higher computational efficiency. Comparing to other non-stack based methods, our model gives most robust results on both filtering and coherence estimation (1) without any human supervision and (2) with real-time performance. In addition, the proposed DeepInSAR gives a high level of bi-modality coherence estimation that nicely distinguishes the incoherence and coherent pixels, which benefits the subsequent phase unwrapping. To the best of our knowledge, the proposed DeepInSAR is the first work that uses deep neural network to perform InSAR filtering and coherence estimation jointly using both amplitude and phase information of only two co-registered SLC SAR images. In future work, we will investigate how well the proposed DeepInSAR framework can benefit subsequent InSAR analytic stages along the processing pipeline.