Fast Terahertz Coded-Aperture Imaging Based on Convolutional Neural Network

: Terahertz coded-aperture imaging (TCAI) has many advantages such as forward-looking imaging, staring imaging and low cost and so forth. However, it is difﬁcult to resolve the target under low signal-to-noise ratio (SNR) and the imaging process is time-consuming. Here, we provide an efﬁcient solution to tackle this problem. A convolution neural network (CNN) is leveraged to develop an off-line end to end imaging network whose structure is highly parallel and free of iterations. And it can just act as a general and powerful mapping function. Once the network is well trained and adopted for TCAI signal processing, the target of interest can be recovered immediately from echo signal. Also, the method to generate training data is shown, and we ﬁnd that the imaging network trained with simulation data is of good robustness against noise and model errors. The feasibility of the proposed approach is veriﬁed by simulation experiments and the results show that it has a competitive performance with the state-of-the-art


Introduction
Since the terahertz wave (0.1-10 THz) lies between the visible and microwave frequencies, it has stronger penetration capability than light and higher resolution than microwave, allowing for visualization of hidden objects at the millimeter level. Moreover, it does little harm to the human body compared to X-rays. Therefore, THz technology has attracted increasing attention, and the generation and detection of THz have also been extensively researched utilizing various approaches. Generation by nonlinear optical effects such as optical parametric oscillation [1] and detection by GaSe electro-optic sensors [2] are one of the typical approaches. Solid-state electronic devices [3] and a lowtemperature-grown GaAs photoconductive antenna gated [4] are also used as emitters and detectors. In order to overcome the limitations of THz band, some practical methods have been proposed [5,6]. With the development of THz technology and radar imaging technology, great progress has been made in various industries and research fields. A graphene-based THz ring resonator is considered a potential application for label-free sensing [7]. The application of THz wave in modulation technology was also reported [8]. For nondestructive detection, a millimeter wave radar imaging method based on synthetic aperture radar was presented [9]. In addition, THz radar imaging technology is attractive for security screening [10,11].
As a promising THz radar imaging technology, Terahertz coded-aperture imaging (TCAI), is derived from optical coded-aperture imaging [12] and radar coincidence imaging [13], it utilizes an aperture coded antenna [14] to generate a spatiotemporal independent wave distribution. By modeling the imaging system, the matrix-imaging equation can be established. And the target of interest is reconstructed through computational imaging [15] method. TCAI has many advantages such as high resolution, all-time functionality, low complexity, and low cost and so forth. Besides, the forward-looking and staring imaging capability can be obtained without relying on any relative motion between imaging platform and target, which is different from synthetic aperture radar [16]. As a result, the TCAI technique has a potential application in security screening, battlefield reconnaissance, nondestructive detection and so on.
In recent years, many methods and systems have been proposed that promote the development of TCAI. The authors of Reference [17] designed a single-pixel pulsed terahertz camera, which utilizes random patterns for imaging and the theory of compression sensing (CS) [18] to solve the imaging equation. In 2013, metamaterial apertures that support custom-designed complex measurement modes were introduced into microwave imaging, which avoids mechanical scanning [15]. Xu et al. utilized randomly programmable metasurface to solve the inverse-scattering problem in the single-sensor imaging [19]. Furthermore, the matrix-imaging equation based on the theory of physical optics was derived and high-resolution TCAI was achieved [20]. Besides, the Bayesian estimation [21] and sparsity-driven methods [22] have greatly inspired TCAI research. Although these methods and systems are effective, there are still some great challenges in TCAI. First, the accuracy of system modeling cannot easily be guaranteed whether it is based on the time-delay signal model [23] or Fresnel diffraction theory [20], and some errors are introduced into the reference signal matrix, which leads to the ability to solve the target scattering coefficient is poor at low SNR. Second, the iterative algorithms are used to reconstruct the target, which are too time-consuming to high frame rate imaging, and they are not quite stable and robust to modeling errors and noise. Considering the importance of these problems in practical applications, a fast TCAI method needs to be devised.
Deep learning (DL) has greatly inspired the research in object detection, image classification, signal processing, and among many others. For the inverse problem community, learning-based methods have been successfully employed in multiple scattering media imaging [24], holographic image reconstruction [25], lensless computational imaging [26], computational ghost imaging [27,28] and so forth, but they usually take a lot of effort to collect data set, which is not easily affordable. To reduce the cost of training, some researchers have proposed training imaging network with simulation data set. In References [29] and [30], the practically usable networks that were trained using simulation data set show competitive imaging performance in real-world scenarios, and the simulation results of Reference [31] also demonstrate the effectiveness using simulation data set to train the network. Thus, we investigated the TCAI based on DL to tackle a series of problems mentioned above. In this paper, we design an end to end neural network, which is trained with simulation data. Once trained, the target of interest can be restored instantly by inputting echo signal into the imaging network, and the simulation experiment also proves that the imaging quality at low SNR superior to state-of-the-art iterative approaches for TCAI.

Signal Model and Learning-Based Approach
For the convenience, we take TCAI model based on single input single output technology as an example. The schematic diagram is shown in Figure 1 and it mainly contains a controlling and processing terminal, transmitter module, transmitter, coded aperture, receiver module, receiver. The transmitting module includes mixer and frequency multiplier, and the receiving module includes low noise amplifier and mixer. The THz wave transmitted from the transmitter, and then it propagates to the coded aperture. The coded aperture, controlled by the controlling and processing terminal, randomly or pseudo-randomly modulates the amplitude or phase of incident THz wave to produce a spatiotemporal-independent radiation field in the imaging plane, which is divided into M grid-cells. After being reflected from the imaging area, the pseudo-random signal is collected by receiver and then sent to the controlling and processing terminal for reconstruction the target. Suppose the transmitter emits a THz linear frequency modulation signal, the reference signal of the m-th grid cell at time t n can be deduced as where A is the amplitude, f c is the center frequency, K is the chirp rate, c is the speed of light, r represents the signal propagation distance, Q stands for the number of transmitting element of the coded aperture, ϕ t n ,q is the random phase modulation factor for the q-th coding aperture element at time t n . Then, the echo signal at time t n is expressed as where β m stands for the scattering coefficient corresponding to the m-th grid cell. Based on the time discretion of Equation (2), the matrix-imaging formula can be written as where ω = {ω n } n=N n=1 is the additive measurement noise, The row vector and column vector of S are the time-domain samples at {t n } n=N n=1 and the spatial-domain samples at {m} m=M m=1 , respectively.
For the previous TCAI [23,32], the compressed sensing [18] recovery algorithms are the standard method to solve β and the imaging problem is treated as an optimization problem after obtaining the evaluation of S where S b − Sβ 2 2 is the fitting error, σ 2 is usually the variance of ω and ψ represents the prior information on β. Typically, ψ(β) = β 1 , where · 1 is L 1 Norm that is used to constrain the imaging domain. For the solving (5), a large number of iterations are usually required which essentially limit the imaging speed. However, the method we presented breaks the bottleneck, it uses neural network to approximate function g so that target scattering coefficient can be estimated directly from echo signal.
where Θ is the set of all possible parameters. Here they can be learned from a training set each of which pairs up a known original target β d and the corresponding echo signal S d b , where d = 1, 2, . . . , D, enumerates the total D different training data pairs. Thus, this parametric reconstruction process can be expressed as where g θ (·) stands for the target of recovery from the imaging network under parameter θ, L(·) is the loss function.
As can be seen from Equation (4), the large-scale reference-signal matrix creates a heavy computational burden. Due to the short wavelength and precise resolving ability of THz waves, the imaging area is divided into smaller grids, which means that the more the number of elements in M and the more complicated calculation of reference-signal matrix. As a result, it is time-consuming to solve the target scattering coefficient through the algorithms consist of iterative-based processes as shown in Equation (5). Therefore, it is very necessary to use the parameter reconstruction algorithm as shown in shown Equation (6) to achieve fast TCAI. In contrast, the proposed approach is in significant ways different. It does not need some time-consuming operations like previous imaging technique. Instead, it demands plenty of data set which includes the groundtruth image and corresponding echo signal and spends some time in training network, but these can be done in advance. Once the training procedure is completed, the designed imaging network can blindly reconstruct target from echo signal.

Network Structure and Data Generation
Recently, convolution neural networks (CNN) have been extended and applied to solve inverse problems. The theoretical motivations for using CNNs as the learning architecture and the design strategies of CNN-based imaging framework have been discussed [33]. Moreover, the relationship between CNN and iterative optimization algorithms has also been surveyed [34]. Motivated by this research, we propose a neural network for fast TCAI which includes the nonlinear part of the encoded information and the linear part of the decoded information. Figure 2 shows the schematic diagram of the network architecture. We suppose that the input of the network is the echo signal with the length of (H × W) × 1. It is down-sampled by ×1, ×2, creating two flow structure, with spatial dimensions of H × W, H/2 × W/2, respectively. And the output is the expected target with different scales. The number of feature maps in each layer is 16. After the down-sample, the two tensors flow to the residue blocks, which is constructed by two convolutional layers with batch normalization and two rectified linear units (ReLU), that is, ReLU(x) = max(0, x). And a shortcut is utilized between the block's input and output, as indicated by the red arrows, which mitigates the divergent gradient problem and accelerates the convergence of the deep neural network. Following residue block, the spatial dimensions of this feature map from H/2 × W/2 to H × W via up-sampling block, each block includes one convolutional layer with batch normalization, a ReLU operation and one up-sampling layer that facilitates super-resolution. Finally, the fusion tensor can be obtained by a connection operation in third-dimension of the output tensor of each flow structure. Here, all nonlinear operations are completed, and the low-level and high-level semantic features are learned from the echo signal. For the linear part, which is made up of convolutional layers, it can transform these extracted features into the output in imaging domain. It is important to note that some tricks are adopted to reduce network parameters and achieve cross-channel information interaction. In testing, to avoid over-fitting, the neurons are randomly ignored.  In the developed network, the spatial correlations between convolution layers can be modeled as where r stands for the number of channels in i-1 layer, P and B are the size of the weight matrix, that is, convolutional kernels. In i-th convolutional layer, v x,y i,j is the value of x, y-th pixel in the j-th feature map and w p,b i,j,r is the weight of p, b-th position in the j-th convolutional kernels. These parameters are included in Θ, and they can be optimized by minimizing the loss function of the predicted target and the original target where · 2 refers to L 2 Norm, numel(·) indicates the quantity of pixels in the original target, and D * = 8 is the mini-batch size in the stochastic gradient descent (SGD) method [35]. And the Adam optimizer [36] was adopted to optimize imaging parameters. As mentioned above, the training of an imaging network usually requires a large amount of the original target and corresponding echo signal. For the generation of the original target data set, we first randomly generate the number of scattering points and their positions on the 60 × 60 imaging plane, which is designed to imitate real-target cases and guarantee the diversity and richness of the target. Then the scattering coefficients are generated randomly. Subsequently, the corresponding echo data set can be obtained by Equation (3), each of which is N = 3600 in length. In particular, the white Gaussian noise is added to the echo signal to acquire the data set with noise. Eventually, we generated two different data sets, one set of which corresponds to data without noise and another with SNR = 5 dB. Each data set containing D = 50, 000 training pairs (S d b , β d ) D d=1 . One tenth of this is used as a verification set and the rest as a training set. In the quantitative analysis, the testing data are generated in the same way as the training data.

Results
In this section, the feasibility of the DL based approach is verified by numerical experiments. Parameters in the imaging experiment are given in Table 1. The frequency range of the THz signal is from 330 GHz to 350 GHz. The 1-bit transmission-type coded aperture has 60 × 60 elements that are employed to modulate the phase of the transmitted signal by 0 0 or 180 0 . The imaging area is divided into M = 60 × 60, and it is easy to calculate that the pixel interval is no larger than 0.005 m. The size of the target (T) is expressed as the number of non-zero scattering coefficients in imaging area [37], the ratio between T and M is considered the complexity of the target. Also, it is easy to know from the table that the spatial dimension of S is 3600 × 3600, which creates high imaging complexity. In practice, the scale of S may be larger and this calculation is more complicated. Therefore, it is necessary to use a neural network to model S implicitly.
The network is implemented in Python version 3.6 using DL frameworsorFlow version 1.8 and trained on a desktop computer with GPU NVIDIA 2080 and the CUDA edition is 9.0. The training took approximately 10 h, which is time-consuming but this procedure can be done in advance. Once trained, the target of interest can be restored instantly by inputting echo signal into the imaging network. The Adam optimization algorithm is employed to optimize weights, and the initial learning is 10 −3 which decays with the factor of 0.98 after each epoch. The batch size is 8 and the training lasts for 500 epochs. Two typical reconstruction algorithms for TCAI, the Sparse Bayesian Learning (SBL) algorithm and TVAL3 algorithm, were chosen to compete with the presented approach. To analyze the imaging performance of various algorithms, the Mean Square Error (MSE) is used as the quantitative index. In this paper, these test targets are composed of scattering points, and the corresponding scattering coefficients are a random value between zero and unity. All reconstruction results and MSE calculations were done on the computer with Inter Xeon Silver 4116 CPU except as specifically stated, and each MSE represents the average results 100 Monte Carlo trials and the shape of the target changes randomly in each trial. To investigate the validity of the imaging network, we first carried out simulation experiments at various targets and the results with SNR = 20 dB are shown Figure 3, and the ratio between the target size and the number of grid cells is T/M = 11/3600, T/M =17/120, T/M =431/3600, T/M =731/3600, T/M =71/200, T/M =37/60, respectively. It can be seen that the target can be successfully reconstructed whether it is made up of unit ideal point scatters or multi-value point scatters. For further investigating the imaging performance of the proposed method, the SBL algorithm [38] and TVAL3 algorithm [39] were implemented as comparisons. Figure 4 shows the reconstruction results of the "smile" whose scattering coefficients of all the point scatters are a random value between zero and unity. One can clearly see that the object can be recovered from the echo signal through the deep network-based algorithm as low as SNR = −5 dB, despite the reconstruction results apparently being distorted and having many spurious scatterers just like other methods. However, the imaging quality gradually enhanced with the increase of SNR. Compared with SBL algorithm under all SNR levels, the proposed algorithm provides higher resolution results in which the scattering intensity more authentically and the target outline is clearer. For TVAL3 algorithm, the target can be perfectly reconstructed when the SNR is larger than 15 dB. However, its performance degraded dramatically as SNR go lower. In general, our performance is competitive with the typical optimization iterative algorithms when SNR is not larger than 10 dB. It is main reason that CS-based algorithms are not quite robust since some errors exist in the reference signal matrix S. Nonetheless, the developed network can automatically fix these errors during the training. With the increase of SNR, we also note that the imaging performance of the proposed method is ultimately bounded by the imaging network's representational error. However, as can be seen in the predicted target, the reconstructions are semantically correct.

Target
The proposed method  Quantitatively, the MSEs of different methods under different SNRs are calculated and the results are drawn in Figure 5. For each SNR, the shape of the target changes randomly in each trial. As expected, the SBL algorithm is worse than the proposed method no matter for which SNR. Although the TVAL3 algorithm shows good performance in high SNR, but it is sensitive to noise and takes more time than the presented algorithm. Table 2 shows the time cost of various algorithms, each of which is the average of 100 trials. Due to the neural network-based approach can be easily parallelized, we also recorded the reconstruction time of the presented method with GPU implementation. It is easy to calculate that the imaging frame rate of our method is no less than 280 Hz. From these results, we can see that the proposed algorithm has great superiority with imaging efficiency. One explanation is that an end-to-end network can directly transform the echo signal into the target, while classic imaging techniques require a large amount of iterations to estimate a satisfactory solution. Therefore, it is unsurprising that the neural network based method is much less time consuming. Thus, experimental results above fully illustrate that the proposed method is a promising tool for fast TCAI under low SNR. More evidence is shown in Figure 6. Again, we can see the superiority of deep learning-based approach clearly.

Discussion
Compared with the classic imaging techniques, the end-to-end neural network can adjust the imaging error adaptively, so its stability and robustness are easier to be guaranteed. Even though the reconstructed image is not as perfect as TVAL3 at high SNR, the time cost is encouraging. The dynamic response video was obtained with GPU implementation and included in Media. As aforementioned, the imaging frame rate is quite high. Therefore, we set the pause time of each frame to 0.008 s. It is less than the recovery time of a batch of images, so the video does not play very smoothly. Nonetheless, the superiority of the proposed approach on imaging efficiency can be clearly. A frame image from target video is shown in Figure 7a ( T/M = 367/3600), and the corresponding reconstruction result is shown in Figure 7b. To investigate the influence of noiseless and noisy training data on network imaging performance, we trained two networks using different data sets. In test, the target size is the same as the original target in Figures 4 and 8 shows the evidence that whether the training data is noisy or not has little impact on the target restoration accuracy. Also, it is worth noting that the method presented here in this manuscript is especially useful for TCAI when we get the empirical data set. The large-scale reference signal matrix does not need to be estimated in advance and the total imaging complexity will be reduced significantly. In the absence of an empirical data set, the simulation data can be employed to train a practically usable imaging network as in References [29] and [30]. In conclusion, the neural network based approach is a promising tool for TCAI.

Conclusions
In this paper, a fast TCAI method is presented. We first introduced the TCAI system and learning-based approach. Subsequently, an end-to-end network was developed and tens of thousands of training pairs were generated to learn the mapping relationship between the echo signal and original target. The developed network includes the nonlinear part of the encoded information and the linear part of the decoded information. The experimental results of the simulation data set demonstrated that the proposed method can outperform the state-of-the-art iterative algorithms in both accuracy and efficiency. Also, it can quickly reconstruct targets of different sizes and the pixel interval that can be resolved is no larger than 0.005 m. With the advantages of the neural network based approach, it has a potential application in security screening, battlefield reconnaissance, nondestructive detection and so on. In further work, we intend to improve the network performance and experimentally verify our method with training data sets and echo signals.

Abbreviations
The following abbreviations are used in this manuscript: