TomoSAR 3D Reconstruction for Buildings Using Very Few Tracks of Observation: A Conditional Generative Adversarial Network Approach

: SAR tomography (TomoSAR) is an important technology for three-dimensional (3D) reconstruction of buildings through multiple coherent SAR images. In order to obtain sufﬁcient signal-to-noise ratio (SNR), typical TomoSAR applications often require dozens of scenes of SAR images. However, limited by time and cost, the available SAR images are often only 3–5 scenes in practice, which makes the traditional TomoSAR technique unable to produce satisfactory SNR and elevation resolution. To tackle this problem, the conditional generative adversarial network (CGAN) is proposed to improve the TomoSAR 3D reconstruction by learning the prior information of building. Moreover, the number of tracks required can be reduced to three. Firstly, a TomoSAR 3D super-resolution dataset is constructed using high-quality data from the airborne array and low-quality data obtained from a small amount of tracks sampled from all observations. Then, the CGAN model is trained to estimate the corresponding high-quality result from the low-quality input. Airborne data experiments prove that the reconstruction results are improved in areas with and without overlap, both qualitatively and quantitatively. Furthermore, the network pretrained on the airborne dataset is directly used to process the spaceborne dataset without any tuning, and generates satisfactory results, proving the effectiveness and robustness of our method. The comparative experiment with nonlocal algorithm also shows that the proposed method has better height estimation and higher time efﬁciency.


Introduction
The capability of inverting the spatial distribution in the elevation direction in each SAR azimuth-range imaging unit through multiple coherent SAR images makes TomoSAR an important technique in 3D information reconstruction of target objects [1,2]. Generally, numerous coherent SAR images (more than 20) [3] are required to obtain sufficient 3D reconstruction results. However, under realistic situations, only 3-5 practical tracks are available because of the constraints of cost and time. According to previous research, the accuracy of elevation inversion is proportional to the number of observation orbits and SNR [4]. Therefore, the accuracy of elevation inversion will decrease severely, which degrades the height estimation of buildings and damages the 3D architectural structures. Appropriate methods are needed in TomoSAR research under the condition of very few tracks.
Estimating the number of scatterers and their corresponding elevation are the essential steps in a TomoSAR procedure. The more accurate the elevation inversion is, the more refined information of buildings will be reconstructed, such as the overall architectural structure and the plane surface. However, the large errors in the elevation inversion with very few tracks severely damage these potential characteristics. As shown in Figure 1, the surface of a building is not a plane anymore and cannot be conjectured from the damaged blue points, the overall structure of building becomes fuzzy, and the height of the building will be estimated incorrectly. Some traditional research of TomoSAR 3D reconstruction for buildings using a few tracks have been carried out. In 2015, XiaoXiang Zhu et al. used six-track coherent data to conduct research on building height estimation [5]. The geometric structures of the target buildings are obtained through the public optical building structure database to calculate the contour lines of buildings in SAR images. Then, the joint sparse estimation of the points on the contour lines are used to obtain the heights of contour lines. In 2020, HongLiang Lu et al. obtained the contour lines of the building directly from the SAR image through the contour line extraction algorithm (CLE) rather than other public databases [6]. However, the above methods need to be carried out by obtaining the geometry structures of the buildings, which is inconvenient and complicated. In 2020, Yilei Shi et al. proposed the nonlocal algorithm to deal with the building height estimation task with 3-5 tracks [4]. This algorithm performs patch-wise weighted filtering on the interferometric images, which effectively increases the SNR and obtains large-scale urban building 3D reconstruction results. However, the track configuration is specially set, so that the equivalent baseline length is long enough to maintain the elevation resolution. There is no research about the decrease of elevation resolution caused by the reduction of equivalent baseline length. In general, these studies lack automatic learning of the architectural structural characteristics and high-dimensional features of building.
There are also some studies on SAR 3D imaging using deep learning algorithms, which can automatically learn high-dimensional features of targets from data. In 2019, Siyan Zhou et al. proposed a deep fully connected network to denoise the 3D point cloud data of a single independent building to obtain a relatively flat surface [7]; however, this method cannot handle buildings with overlap and finds it hard to process the point cloud data of a large-scale scene. In 2021, Shihong Wang et al. proposed a 3D autoencoder network to filter the low-resolution 3D voxel data generated from three-track circular SAR data to approach the high-resolution results of all tracks [8]. However, this method requires a lot of computation and memory resources to process 3D voxel data, which is of low efficiency. In general, the deep learning algorithm has indeed brought new ideas into the research of SAR 3D reconstruction, enlightening the potential of deep learning algorithm in TomoSAR 3D information reconstruction for buildings with very few tracks.
The generation and development of generative adversarial network (GAN) has brought a major revolution into deep learning research. In 2014, Goodfellow first proposed the GAN principle and its network model [9], which gave rise to the research of generative networks. However, the native GAN is difficult to train and hard to converge. To deal with these problems, in 2015, Alec Radford et al. combined the GAN network structure with the deep convolutional network and proposed the Deep Convolutional GAN (DCGAN) [10]. The network structure of DCGAN solves the problems of unstable training and mode collapse, making it the main application of GAN network structures. Unfortunately, the DCGAN did not explain the problems of GAN theoretically. In order to fundamentally solve the problems, in 2017, Martin Arjovsky et al. analyzed the original loss function in depth and theoretically explained the problem of GAN nonconvergence, proposing WGAN [11]. Based on previous work, the GAN has important applications in image generation, style transfer, image synthesis, anti-spoofing, image deblurring, etc. [12][13][14][15][16]. However, the traditional GAN takes random noise as input. Generating a specific result based on the input content is called conditional GAN(CGAN), such as medical image segmentation and image deblurring tasks [17][18][19]. However, to our best knowledge, there is no research on the application of GAN in terms of SAR 3D reconstruction. This article will explore a method of applying a GAN network in SAR 3D imaging to generate refined and high-quality 3D results.
Traditional research of TomoSAR 3D reconstruction for buildings focuses on improving the solution of sparse equation by group sparsity based on contour lines or nonlocal algorithm increasing SNR. These methods are short of the automatic application of higherdimensional features of buildings. In contrast, the deep learning algorithms show great potential in applying the high-dimensional features. The GAN network has a significant application in generating refined image results, which is very suitable for the task of 3D reconstruction of architectural targets. The code of our network is available here (https://gitee.com/WshongCola/cgan_for_few_tracks accessed on 15 May 2021).
In summary, the key contributions of our work are as follows: (1) The Conditional generative adversarial network (CGAN) is originally applied to generate high-quality TomoSAR 3D reconstruction for buildings using very few tracks by learning the high-dimensional features of architectural structures. (2) Instead of directly processing large 3D data, the range-elevation 2D slices are processed to reduce network parameters and computational complexity, which makes it possible to deal with large-scale scenes. In order to solve the problem of possible misalignment among generations, the content loss between the input and generation is considered, so that the architectural structure can be reconstructed at the correct position. (3) The overlap of buildings in TomoSAR images makes buildings seem fused together and makes it hard to identify them from each other. The proposed method is able to distinguish the overlapped buildings correctly and estimate the heights of buildings. Compared with the widely used nonlocal algorithm, our method can estimate the height of buildings more accurately, and it is of higher time efficiency.

Materials and Methods
As shown in Figure 2, the proposed method is divided into two connected modules: data generation module and CGAN module.
Data Generation module: Input: All coherent SAR images. Output: • Low-quality slice set: The low-resolution and low-SNR range-elevation slices generated using three tracks by TomoSAR procedure. • High-quality slice set: The high-resolution and high-SNR range-elevation slices generated using all tracks.
The TomoSAR procedure is described in detail in Section 2.1.2

CGAN module:
Input data is the paired low-quality and high-quality slices generated by the data generation module. The network parameters are iteratively updated for learning the features between two pairs, obtaining the mapping relationship from low-quality slices to high-quality slices.

TomoSAR Principle
As shown in Figure 3, traditional synthetic aperture radars project the three-dimensional spatial distribution of scatterers along the elevation direction to the two-dimensional azimuth-range plane, which causes the elevation distribution of scatterers to be lost. To-moSAR expands the observing capabilities by multiple coherent tracks, making it possible to retrieve the elevation distribution information from coherent SAR images [20][21][22]. Assuming there are N tracks in total, according to previous research, the typical TomoSAR 3D imaging model can be formulated as follows [23]: In the above equation, y n represents the complex-value observation data of the nth antenna phase center (APC), expressed as the integral of the scatterer spatial distribution σ(s) along the elevation direction s. Taking a certain APC as the reference, the baseline length of the ith APC is b i . Discretizing the elevation range into M samples and considering noise, the continuous integral model can be approximated as where y is the complex-value observation data vector in each azimuth-range unit of corresponding APCs. Φ is the observation matrix, which bridges the observation vector and elevation distribution. The element of matrix is Φ i,j = exp(−j 4πb i s j λR ), where s j is the jth elevation discretized resolution cell, and R is the distance between the scatterers and APC. Based on Equation (2), the elevation distribution σ(s) in each azimuth-range unit can be inverted from y.
According to the Fourier analysis principle [24], assuming the maximum baseline length is B, the Rayleigh resolution in the elevation direction is ρ s = λR 2B , and the observation matrix can be expressed as exp(−j2πξ i s j ), where ξ i = 2b i 2B is the spatial frequency. Equation (2) can be regarded as the Fourier transform of σ. The sampling interval of spatial frequency is ∆ξ = 2∆b λR , where ∆b is the interval length between two neighboring APCs. Based on the theory of space-time analysis, the maximum unambiguity range of the elevation direction is s ua = λR 2∆b . According to the above theoretical analysis and the research in [4], the less the number of available track is, the shorter the maximum baseline length will be, leading to a decrease in the elevation spatial resolution.
Based on the theory of linear algebra, when the number of samples M in the elevation direction is much larger than the practical number of coherent tracks, Equation (2) becomes an underdetermined equation with a nonunique solution space. The general solution is to use the compressed sensing (CS) method [25][26][27]. The objective function with the sparse constraint item is as follows: In the above equation, λ represents the sparse coefficient. The larger the value is, the sparser the solution will be. σ 1 is the sparse constraint item to limit the solution space.

TomoSAR Procedure
As shown in Figure 4, firstly, the azimuth-range units in different SAR images should be related to the same scatterers after registration [28]. Secondly, channel phase errors, which are caused by positioning or system errors, are compensated to focus the inversion results in the elevation direction. The minimum entropy method is applied to estimate the phase errors. Thirdly, the sparse recovery methods are used to estimate the elevation positions of scatterers in each azimuth-range unit. Finally, the coordinate system is transformed from radar system into geodetic system. Firstly, the image registration ensures the azimuth-range units in different coherent SAR images related to the same scatterers. Secondly, the channel imbalance calibration compensates the phase errors among channels. Thirdly, sparse recovery methods are applied to invert the elevation position of scatterers. Finally, the coordinate system is transformed from radar system to the geodetic system. Figure 5 illustrates the procedure of data generation. Figure 5. Flowchart of data generation. The final output is the paired super-resolution dataset composed of low-quality and high-quality slice sets. The low-quality set of low-SNR and lowresolution contains range-elevation 2D binary slices, which is generated from three tracks. In contrast, the high-quality set uses all tracks, and has high SNR and resolution. The binary operation means that the value set is 1 if a scatterer is estimated in this position. All slices are in radar coordinate system. Paired super-resolution dataset: The paired dataset contains the low-quality and high-quality slice sets. Slices of different sets in the same azimuth position are paired to be input and corresponding ground truth.

Data Generation
Low-quality slice set: The low-quality slice set is composed of binarized 2D rangeelevation slices generated using three adjacent tracks through algorithms in TomoSAR procedure, which is of low SNR and low elevation spatial resolution. The equivalent baseline length of three adjacent tracks is much shorter than that of full tracks, which will severely decrease the elevation resolution. This low-quality set is the input data of the network. The binary operation means that the value set is 1 if a scatterer is estimated in this position.
High-quality slice set: The high-quality slice set shares the same procedure as the low-quality set, but uses all SAR images to generate high-SNR and high-elevation spatial resolution data.
All slices are in radar coordinate system.

CGAN Module
The CGAN consists of two models, the generator and the discriminator, as shown in Figure 6. The generator produces a fake result as similar as possible to the high-quality ground truth based on the low-quality input. The role of the discriminator is to distinguish the fake result and the ground truth. Through iterative optimization, the generator needs to produce a result as similar as possible to make the discriminator believe it is the truth, and the discriminator needs to distinguish the fake result from ground truth as much as possible. Figure 6. Flowchart of the CGAN module. The CGAN consists of two models named generator (G) and discriminator (D). The generator produces a fake result as similar as possible to the ground truth to make the discriminator believe the generation is the truth. On the contrary, the discriminator is used to distinguish the fake result and ground truth. After iterations, the generator will be able to generate a refined result which is hard to tell from the corresponding ground truth. Besides this, the content loss between the generated result and ground truth is also considered avoiding position bias.

Generator
The generator network structure based on the autoencoder [19] is shown in Figure 7, which can be roughly divided into three main parts: downsampling compression, feature extraction, and upsampling reconstruction. In the downsampling compression part, four times downsampling is achieved through two convolution networks with step of two, and the feature dimension is expanded from 64 to 256, which is indicated by the number on the left side of blocks, such as n64. The feature extraction part digs out high-dimensional features from the downsampled data by stacking nine layers of deep convolution blocks. The upsampling reconstruction part uses two deconvolution (TransposedConv) layers with step of two to achieve four times upsampling and decrease the feature dimension to one.
In order to accelerate the convergence of the network, the generator adopts the residual network structure, which adds the input data straight into the output, learning the difference between the input and output. In addition, an InstanceNorm module has been applied to accelerate the network convergence by normalizing each channel in the convolution layers. The feature dimension is indicated by the number on the left side of blocks, such as n128. The feature extraction part is composed of nine stacked blocks based on res-net structure, which has capability of digging out high-dimensional features of data. The upsampling part decreases the feature dimension and reconstructs the data dimension by deconvolution (TransposedConv) layers.

Discriminator
A Markovian discriminator [29] is applied in the proposed method, which is different from other discriminators that generate one real number based on the network input to indicate the probability that the input is truth. In contrast, the Markovian discriminator generates a matrix. Each element of the matrix represents the truth probability of a small area in the receptive field. Finally, the average of matrix is regraded as the probability that the entire picture is the truth. The discriminator structure is shown in Figure 8.
The discriminator network is composed of four convolutional layers, which contain convolution model, BatchNorm model, and LeakyReLU model in each layer, and the last layer is only a convolutional model with one filter to generate the truth possibility map. The BatchNorm model is used to accelerate the convergence of network and the LeakyReLU is used as activation function. Figure 8. Network structure of the discriminator. The network contains four convolutional layers increasing the feature dimension from 64 to 512, and finally becomes 1 to indicate the truth possibility of a small area in the receptive field. In the first four layers, the BatchNorm model is inserted into layers to accelerate the convergence and the LeakyReLU model is used as the activation function.

Loss Function
The total loss is formulated as follows: In Equation (4), the total loss is the sum of adversarial loss L GAN and content loss L x , which is balanced by parameter λ. The adversarial loss is formulated as follows: In Equation (5), the loss function of the discriminator is formulated as two parts. The first part is −E x∼P high_res [D(x)], x is sampled from the high-resolution dataset, and it is processed by the discriminator D(). The second part is E z∼P low_res [D(G(z))], where z is sampled from the high-resolution dataset correspondingly. The input data z is processed by the generator G() to produce a fake result G(z), then the fake result is processed by the discriminator D(). The discriminator D() will give a score to indicate the truth of input data. The higher the score, the more the judge believes that the data entered is true. Therefore, the discriminator tries to give high score to the high-res input and low score to the low-res input to minimize the loss function. In Equation (6), the generator tries to produce fake results similar to high-res input, so that the discriminator will give high score to the fake result.
In order to make the generated results relevant to the content of the input data, the content loss is applied. However, minimizing the L1 or L2 distance between the generated result and the ground truth in pixel-wise will lead to blurry artifacts in generated results. Therefore, in this paper, the feature extraction network is adopted to measure the distance in perceptual domain, which is expressed as follows: In Equation (7), φ i,j represents the result of the feature layer between the jth maximum pooling after the ith convolutional layer after activation. By constraining features in different levels, such as high semantic feature level and low detail level, the generated result will be refined and fit the ground truth much better.

Network Training
Data augmentation: Training of CGAN requires an amount of data. However, the number of slices of a building in a TomoSAR scene is insufficient; only hundreds of slices are available. The network will try to memorize these slices rather than learning the deep features. To solve this problem, the data augmentations, including random combination, flipping, and translation, are applied on the input slices. The application of random combination tries to simulate the condition of many buildings by placing slices of one isolated building at different azimuth positions together into one fabricated slice. Besides this, the additional noise is used to improve the robustness of network.
Training configurations: The network is trained on AMD Ryzen 5 2600 Six-Core CPU and 1 NVIDIA GeForce GTX 1080Ti GPU. The optimizer is Adam. In a total of 150 epochs, the learning rate in the first 100 epochs is 0.0001 and is linearly reduced to 0 in the last 50 epochs. According to GPU memory limitation, the input size is set as 1200 * 256, while the batchsize is 14. The network converges after 100 epochs in our experiments. The critic function of GAN is Wasserstein GAN with Gradient Penalty (wgan-gp) [11], which can solve the problem of gradient disappearance and explosion. The perceptual network named VGG19 [30] pretrained on ImageNet is adopted to calculate the difference between the feature maps of fake result and ground truth.

Airborne Dataset
The YunCheng airborne data is used in the experiment. The parameters are listed in Table 1.
There are nine buildings in Figure 9, which are indicated with rectangles and numbered in both the optical and intensity SAR images correspondingly. The buildings #1 and #2, marked with red color, are selected as training set, and the other buildings with white color are the testing set. Obviously, in the SAR image, three (#3, #4, #5) buildings are overlapped. Meanwhile, the four (#6, #7, #8, #9) buildings are nonoverlapped. Both the overlapped objects and nonoverlapped objects are indicated with white dotted rectangles. In order to explain the reconstruction results of overlapped and nonoverlapped objects more intuitively, two slices are selected at the positions of slice 1 and slice 2 in Figure 9b. Slice 1 is the elevation-range image of buildings #6, #7, #8, and #9, while slice 2 is the corresponding view of buildings #1, #2, #3, and #4. Building #5 shared the same characteristics as building #4 and is not listed here.  n Figure 10, the 3D height and strength distribution of both all-track and three-track TomoSAR results are shown. Figure 10b,d show the normalized strength distributions. Compared with the three-track result, the dominant scatterers mainly locate at the surface of buildings and the structures are much more refined. In Figure 10a,c, it is obvious that the quality of the three-track TomoSAR 3D result is much lower than that of the all-track TomoSAR. According to the previous analysis, it is proved that the three-track TomoSAR of low SNR and resolution will definitely introduce large errors in the elevation inversion, resulting in fuzzy architectural structures. Compared with all-track results, the strength distribution is worse with lots of powerful artifacts and outliers, which makes the height distribution of buildings fuzzy and declines the quality of reconstruction.
In order to explore the ability of the network in dealing with both overlapped and nonoverlapped situations, the total buildings are divided into two testing sets.
In this experiment, the configurations of nonlocal algorithm [31][32][33][34][35] are set as follows in Table 2, according to the recommendations in [32].  [36] marked with red circles that affects the structures. In addition, the heights of buildings are estimated and labeled with orange lines in strength maps. In Figure 12, the results of nonlocal algorithm and proposed CGAN method are shown and compared. From the normalized strength maps in Figure 12b,d, the nonlocal algorithm can indeed remove the artifacts and outliers by increasing the SNR. However, there are still some artifacts and outliers remaining. In contrast, the results of the proposed CGAN method are of higher quality with fewer artifacts and outliers. Meanwhile, the multipath scattering marked by red circles is also well suppressed, which affected the structures in nonlocal results. is the normalized strength map of the nonlocal algorithm. The nonlocal algorithm can remove the artifacts and outliers by increasing the SNR. However, it is still affected by the multipath scattering, marked with red circles. In contrast, the proposed CGAN method generates a higher quality result by suppressing more artifacts and outliers. In addition, the multipath scattering is also well suppressed so that the structures are much clearer. The height estimations are also labeled with orange lines. Figure 13a shows the height and strength maps of buildings #3 and #4 reconstructed using all and three tracks. In Figure 13a, the structures of two overlapped buildings can be clearly recognized. In Figure 13b, the dominant scatterers mainly locate at the surface of buildings. There is a large interval between two buildings. However, it is found in Figure 13c that the two buildings are difficult to distinguish from each other without obvious boundaries. The strength map in Figure 13d also shows that the scatterers of two buildings almost merge together, which severely damages the structures of two buildings. From results of three-track TomoSAR, the overlapped two buildings cannot be distinguished from each other; the structure of building #3 is too blurry to tell it apart from building #4. Moreover, the top of building #3 is hard to determine, and it is impossible to estimate its height. The height estimation is labeled with orange lines.

Overlapped Buildings
In Figure 14, the results of the proposed CGAN method and nonlocal algorithm are shown. Figure 14a is the height map of nonlocal algorithm, and Figure 14d is the normalized strength map. The nonlocal algorithm can greatly distinguish between two overlapped buildings with an obvious interval in between. Meanwhile, Figure 14c,d are the results of proposed CGAN method. Similarly, the proposed method can also distinguish between two overlapped buildings with a large interval in between. However, it becomes worse when it comes to the height estimation of building #3. The height estimated by the nonlocal algorithm is severely different from that of the all-track results. In contrast, the height estimated by the proposed CGAN method is closer to that of the all-track results. Generally, the nonlocal algorithm can divide two overlapped buildings with an obvious interval. However, there are still some artifacts and outliers. In contrast, the results of the proposed method can greatly distinguish between two buildings with a large interval. Additionally, the roofs of two buildings are clear. The height of building #3 estimated in the height map of the proposed method is closer to ground truth. The height estimation of building #3 in the nonlocal result is severely different from that of all-track results, which is probably affected by the multipath scattering. The height estimation is labeled with orange lines.
In Figure 15, it should be noted that although the nonlocal algorithm can effectively improve the overall 3D imaging quality, the structures of buildings are still blurry compared to CGAN results. In other words, the imaging results of the proposed CGAN method are much clearer, and it has stronger ability to suppress the artifacts, outliers, and multipath scattering. The heights of buildings are mainly considered when building the 3D reconstruction task. Therefore, buildings #3, #4, #6, #7, #8, and #9 are estimated according to the strength maps by different methods. The results are as follows: Table 3 is the comparison of building height estimations by different methods. Limited by the resolution and error of subjective judgment, the errors under 2 meters are negligible. The main focus is on the height estimation results of building #3, #4, and #7, whose errors cannot be ignored. For building #3, there are large errors in the three-track and nonlocal estimations, compared to all-track result. It may be caused by the powerful multipath scattering under the condition of three tracks, which damages the roof structure of building. For building #4, it is difficult to judge the height because the interval between buildings #3 and #4 are too small to separate them. Furthermore, the roof of building #4 merges with the surface of building #3, which makes it harder to identify the roof. For building #7, it can be seen that both the three-track TomoSAR and the nonlocal algorithm estimated results have large errors. From the respective intensity distribution maps, it can be seen that the dominant influence is the multipath scattering of the nearby building, which damages the roof of the target building severely.  Table 4 compares the time consumption of the nonlocal algorithm and the proposed CGAN method in reconstructing the entire scene. The nonlocal algorithm is accelerated using the vector parallel acceleration calculation technique, which occupies lots of memory and consumes about 4 h. In contrast, the noniterative parallel calculation of the proposed CGAN network consumes about 10 s to process the entire scene. It is obvious that the proposed CGAN method has much higher time efficiency than the iterative nonlocal algorithm.

Spaceborne Dataset
We use the spaceborne dataset from TerraSAR-X in Barcelona to demonstrate the effectiveness and robustness of our method. The parameters are listed in Table 5. Figure 16 shows the optical image of target building and corresponding SAR intensity image.  In Figure 17, the 3D views of the CS method using all tracks and only three tracks are shown. While using only three tracks, the CS method introduces lots of outliers. On the contrary, the all-track reconstruction shows few outliers and refined building surface. In Figure 18, the CS method is applied to reconstruct building surface using three and all tracks. While using three tracks, there are lots of outliers marked by orange circles, which has strong intensity. In contrast, there are fewer outliers in the all-track reconstruction result; however, the building surfaces of all results are not smooth. In Figure 19, the comparison of reconstruction results of the nonlocal algorithm and the CGAN method while using only three tracks is shown. Generally, the nonlocal algorithm can compress the outliers. In contrast, the proposed CGAN method can not only greatly compress the outliers, but also generate refined building surface.  In order to estimate the effectiveness and robustness of the proposed CGAN method, we directly use the network trained on the airborne dataset to process the spaceborne dataset. Surprisingly, the pretrained network, without any tuning on the spaceborne dataset, also shows reasonable results.
The heights of buildings using several methods are labeled in strength maps using orange lines. There is no significant difference among these estimations, proving that the proposed CGAN method can correctly reconstruct the height of buildings in the spaceborne dataset.
In Figure 20, 3D views of the nonlocal algorithm and the proposed CGAN method are compared. There are many outliers in the results of the nonlocal algorithm using three tracks. However, compared with the CS method, it slightly compresses the outliers. In contrast, the number of outliers is significantly lower in the CGAN method. Furthermore, the surface of buildings generated by our method is smooth and refined. Importantly, instead of tuning on the spaceborne dataset, the network directly uses the network pretrained on the airborne dataset to process the spaceborne dataset. In terms of results, the CGAN approach has also yielded satisfactory reconstruction results, which proves the effectiveness and robustness of our methodology.

Conclusions
In this article, we propose the CGAN model to generate satisfactory TomoSAR 3D reconstruction for buildings using three tracks.
Firstly, we introduced the principle of TomoSAR imaging and theoretically analyzed the consequences of using very few tracks. Secondly, we proposed the CGAN model and explained the network structures in detail. Besides this, we described the procedure of data generation to meet the requirement of a large amount of training data.
The experiments on the YunCheng airborne dataset and TerraSAR-X spaceborne dataset prove the capabilities of the proposed CGAN method in improving TomoSAR 3D reconstruction using three tracks in both quality and quantity. Furthermore, to explain the efficiency and effectiveness, we also compared it with the nonlocal algorithm used in previous research. The comparison results indicate that the proposed CGAN method can produce more refined structures of buildings and more accurate height estimation. Moreover, the nonlocal algorithm takes over 4 h to finish the entire scene. In contrast, the proposed method only takes around 10 s, proving the time efficiency of our method.
However, in this paper, only the features in the 2D elevation-range slice are considered because of the computation complexity. It is believed that the reconstruction results will be better with the features considered in all three dimensions (azimuth, range, elevation). Besides this, the training dataset is also limited because the TomoSAR data of buildings is hard to access and consumes lots of time to generate a super-resolution dataset. In the future, more work will be carried out to explore the application of 3D features in TomoSAR 3D reconstruction for buildings.

Data Availability Statement:
The spaceborne dataset used in this article is not publicly available, as it was purchased. A publicly available airborne dataset was analyzed in this study. This data can be found here: (http://radars.ie.ac.cn/web/data/getData?dataType=SARMV3D_en&pageType=en accessed on 15 May 2023)

Conflicts of Interest:
The authors declare no conflicts of interest.