Combining the CSI technique with a deep learning approach is accomplished by learning a data-driven mapping, $\mathcal{G}$, from a CSI reconstruction to the true permittivity ($\mathcal{G}:{\u03f5}^{CSI}\to {\u03f5}^{true}$).

In this study, we learn a mapping from the real and imaginary parts of the permittivities in CSI reconstructions at several frequencies to a single real permittivity image. Thus, if the CSI complex permittivity map is an $L\times M\times N$ 3D image, and reconstructions at five frequencies are utilized, then each of the learned functions maps $5\times L\times M\times N$ complex domain to $L\times M\times N$ real domain (e.g., ${\mathcal{G}}_{R}:{\mathbb{C}}^{5\times L\times M\times N}\mapsto {\mathbb{R}}^{L\times M\times N}$). The complex output of CSI at the five selected frequencies can be treated as a 10-channel image. We realized this mapping through a deep neural network as follows.

The desired mapping for our task at hand is an image-to-image transformation; there are multiple neural architectures that can implement this mapping. For instance, a naive choice could be a fully-connected single layer neural network which takes in CSI reconstruction as input and is trained to output the ground truth permittivity. However, such an architecture would be very prone to overfitting [

12]. We, therefore, use a hierarchical convolutional neural network for our image-to-image transformation task. A good template for such a task is the U-Net architecture which is one of the most successful deep neural networks for image segmentation and reconstruction problems [

13]. The architecture consists of successive convolutional and downsampling layers, followed by successive deconvolutional and upsampling layers. Moreover, the skip connections between the corresponding contractive and expansive layers keep the gradients from vanishing that helps in the optimization process [

13,

43]. To use a U-Net for reconstruction, the original objective of the U-Net is replaced with the sum of pixelwise squared reconstruction errors between the true real part of permittivity and the output of U-Net [

13]. In our problem, the network input is the 3D CSI reconstructed complex images (after 500 iterations). Thus, there are two options for choosing the U-Net architecture, U-Net with complex weights and U-Net with real weights. Very few studies have been done on the training of U-Net with complex weights, although very recently Trabelsi et al. tried to train the neural network with complex weights for convolutional architectures [

44]. In this paper, we decided to use a U-Net architecture having real-valued weights. A schematic representation of our architecture is shown in

Figure 2. The motivation for choosing the neural network parameters (the number of convolutional layers, size and number of filters) is as follows. In a hierarchical multi-scale CNN, the effective receptive field of the convolution filters is variable at each layer, i.e., through successive sub-sampling it is possible to have a larger receptive field even by using filters of smaller kernel size [

12,

45]. As mentioned above, the input to our neural network is

$L\times M\times N\times 10$; in particular, for each frequency, the dimension of our input image volume is

$64\times 64\times 64$ (i.e.,

$L=M=N=64$). If we start with a 3D receptive field of

$3\times 3\times 3$, after four layers of successive convolutions and subsampling (by a factor of 1/2), the receptive field would effectively span the entire image volume. We, therefore, use four convolutional layers with a 3D filter kernel size of

$3\times 3\times 3$. Since after each convolutional layer the size of the image volume is reduced, we can increase the number of filters at each successive layer to enhance the representational power of the neural network [

12]. In particular, we start with 32 filters for the first layer and successively double the number of filters after each layer (number of filters after the fourth layer is 512). This defines the encoder part of the U-net i.e., the part of a neural network consisting of contractive convolutions. For the decoder part, we follow a symmetric architecture consisting of expansive convolutions [

13].