Efficient Tensor Sensing for RF Tomographic Imaging on GPUs

Radio-frequency (RF) tomographic imaging is a promising technique for inferring multi-dimensional physical space by processing RF signals traversed across a region of interest. Tensor-based approaches for tomographic imaging are superior at detecting the objects within higher dimensional spaces. The recently-proposed tensor sensing approach based on the transform tensor model achieves a lower error rate and faster speed than the previous tensor-based compress sensing approach. However, the running time of the tensor sensing approach increases exponentially with the dimension of tensors, thus not being very practical for big tensors. In this paper, we address this problem by exploiting massively-parallel GPUs. We design, implement, and optimize the tensor sensing approach on an NVIDIA Tesla GPU and evaluate the performance in terms of the running time and recovery error rate. Experimental results show that our GPU tensor sensing is as accurate as the CPU counterpart with an average of 44.79× and up to 84.70× speedups for varying-sized synthetic tensor data. For IKEA Model 3D model data of a smaller size, our GPU algorithm achieved 15.374× speedup over the CPU tensor sensing. We further encapsulate the GPU algorithm into an open-source library, called cuTensorSensing (CUDA Tensor Sensing), which can be used for efficient RF tomographic imaging.


Introduction
Radio frequency (RF) tomographic imaging, also called wireless tomography, is a technique for detecting objects within a specific region by analyzing radio frequency signals transmitted between wireless nodes, as shown in Figure 1.This technique uses radio reflections that bounce off the object body and does not require the object to carry any wireless device.This advantage has made it ideal for security checking, rescue operations, space applications, and smart buildings [1].
In general, wireless signal propagating on a link loses power due to shadowing, distance, and multipath fading.The task of RF tomographic imaging is to estimate a spatial distribution of the shadowing loss in a detecting region from the measured power of wireless signals.Classic radio tomography and vector-based methods have been proposed.The work in [2] presented a linear model for using received signal strength (RSS) measurements to obtain images of moving objects.The work in [3] addressed substantial issues for the practical implementation.A tensor-based compressed sensing method [1] was proposed for RF tomographic imaging in a three and higher dimensionally-structured region, which estimates the distribution by utilizing its low-rank property.However, the method requires computing tensor singular value decomposition (t-SVD) in each algorithmic iteration, which leads to high computational complexity.To address this problem, Deng et al. [4] proposed to use the transform-based tensor model to formulate the RF tomographic imaging as a tensor sensing problem, then used a fast iterative algorithm Alt-Min to solve the tensor sensing problem.Their method fully utilizes the geometric structure of the three-dimensional loss field tensor.Compared to the tensor-based compressed sensing method, Deng's method achieved a lower error rate and faster computation speed.However, the iterative algorithm Alt-Min proposed by Deng et al. [4] iteratively estimates a pair of tensors, which is a computationally-intensive process, thus not being very practical for RF tomographic imaging for big objects or regions, or real-time applications.In this paper, we address this problem by exploiting graphics processing units (GPUs).Because of massive hardware parallelism and high memory bandwidth, GPUs have been widely used in diverse applications including machine learning [5][6][7], graph processing [8][9][10], big data analytics [11,12], image processing [13], and fluid dynamics [14].In order to reap the power of GPUs, the algorithmic steps need to be mapped delicately onto the architecture of GPUs, especially the thread and memory hierarchy.In this work, we design, implement, and optimize the transform model-based tensor sensing method of Deng et al. [4] on an NVIDIA Tesla V100 GPU with CUDA (Compute Unified Device Architecture) and evaluate it in terms of the running time and recovery error.Experiment results show that the GPU algorithm achieves similar relative error as the CPU counterpart.Moreover, the GPU tensor sensing outperforms the CPU tensor sensing on all tensor sizes with 45.39× speedup on average and up to 84.70× speedup on bigger tensor sizes such as 120 × 120 × 6.We encapsulate this GPU tensor sensing algorithm into an open-source library called "cuTensorSensing" (CUDA Tensor Sensing), which is available at [15].
Our contributions are summarized as follows.First, we analyze the steps of the transform model-based tensor sensing using the Alt-Min algorithm and discuss how to map them onto the GPU architecture.Second, we design, implement, and optimize the tensor sensing on an NVIDIA Tesla V100 GPU.Third, we evaluate and compare the GPU tensor sensing and the CPU tensor sensing in terms of running time and relative error with both synthetic data and IKEA Model data.Fourth, we encapsulate the GPU tensor sensing implementation into an open-source library such that it can be used in diverse applications.
The remainder of the paper is organized as follows.In Section 2, we discuss the related works.Section 3 presents the notations and briefly summarizes the RF tomographic imaging task as a tensor sensing problem.Section 4 describes the design, implementation and optimizations of the tensor sensing on the GPU.In Section 5, we describe the experiment methodology.Section 6 evaluates the GPU tensor sensing with both synthetic data and IKEA Model data.The conclusions are given in Section 7.

Related Works
Existing works on RF tomographic imaging can be classified into vector-based and tensor-based approaches.The vector-based approaches [16,17] aim at estimating a spatial distribution of the shadowing loss in 2D regions of interest.They are not able to infer three-dimensional regions due to the fact that spatial structures of the signal data are ignored.Therefore, researchers have proposed the tensor-based approaches [1,4] to infer three-dimensional spaces.The tensor-based compressed sensing [1] uses the tensor nuclear norm (TNN) [18] to extend RF tomographic problems to the three-dimensional case.This approach has a high computational complexity and error rate.Deng et al. [4] exploited the transform-based tensor model [19] to explore three-dimensional spatial structures for higher accuracy and speed.Our work aims at accelerate Deng's work on GPUs to make it practical for real-time scenarios and bigger tensors.
GPUs have massive parallelism and high memory bandwidth, and many existing research works [20][21][22][23][24] have demonstrated the benefit of utilizing GPUs to accelerate general purpose computing.Due to the high dimensions of tensors, tensor computations are often computation-intensive and time consuming.Recently, GPUs have been increasingly adopted to accelerate diverse tensor computations.Some works focused on accelerating specific tensor operations including tensor contraction [25,26], factorization [27], transpose [28,29], and tensor-matrix multiplication [30].These works propose parallel tensor algorithms specifically optimized for the GPU architectures.GPUTENSOR [31] is a parallel tensor factorization algorithm that splits a tensor into smaller blocks and exploits the inherent parallelism and high-memory bandwidth of GPUs.To handle dynamic tensor data, GPUTENSOR updates its previously-factorized components instead of recomputing them from the raw data.Sparse tensor-times-dense matrix multiplication is a critical bottleneck in data analysis and mining applications, and [32] proposed an efficient primitive on CPU and GPU platforms.Different from these works, our work considers the tensor sensing based on the transform tensor model for RF tomographic imaging.

Description of the Tensor Sensing Problem
Deng et al. [4] proposed to use the transform-based tensor model to formulate the RF tomographic imaging as a tensor sensing problem.Then, they utilized a fast iterative algorithm Alt-Min to solve the tensor sensing problem.Here, we briefly summarize their approach including the Alt-Min algorithm in order to map it onto the GPU architecture.

Alt-Min Algorithm
The goal of tensor sensing is to recover the loss field tensor X from the linear map matrix A and the measurement vector y, which is formulated as follows: This method uses the Alt-Min algorithm (Algorithm 1) to estimate two low rank matrices iteratively whose matrix product is the squeezed matrix of the object tensor X .

Implementation of the Tensor Sensing on CPU
Algorithm 2 shows the implementation of tensor sensing on CPU.

Algorithm 2 Implementation of the tensor sensing on CPU.
Input: Randomly-initialized matrix U, measurement vector y, linear map matrix A Output: X  V ← transform vec(V) back to matrix form 6: use V to form a block diagonal matrix V t b 8: vec(U t ) ← perform least squares minimization on W and y 11: U t ← transform vec(U t ) back to matrix form Next, the least squares minimization problem is solved to estimate V of the next iteration (Algorithm 2, Line 3 and Line 4).The least minimization problem is formulated as follows: As the estimated V is in vector form vec(V), vec(V) is transformed back to matrix form V (Algorithm 2, Line 5).To estimate U, the least squares minimization problem in Equation (3) should be transposed: vec The process of estimating U is similar to estimating V, which is implemented with the corresponding transposed matrix (Algorithm 2 Lines 6-12).Next, the above process is repeated until the number of iterations reaches the set value.Lastly, the two estimated matrices are multiplied to get the final result matrix (Algorithm 2, Line 14).

The Implementation and Optimization of Efficient GPU Tensor Sensing
To achieve high performance on GPUs, we need to consider the data representation, the mappings from computations to GPU threads, and memory accesses.We first design a basic GPU tensor sensing implementation, then optimize the implementation to further improve performance.In Algorithm 2, after least squares minimization (Lines 4 and 10 in Algorithm 2), we get vectorized matrices.For a matrix A ∈ R m×n , the corresponding vectorized A is A v ∈ R mn×1 .The vectorized matrices are converted back to the original matrices (Lines 5 and 11 in Algorithm 2).In some scientific computing programming languages, such as MATLAB, this conversion must be done with the appropriate conversion function.We adopt the column-first storage format to store matrices and vectors, which not only ensures read and write continuity, but also avoids explicit vector-to-matrix conversions since vector A v and matrix A in memory are the same in this format, as shown in Figure 2.

Multiplication of Block Diagonal Matrices
Using the operational properties of the block matrix, we get: This shows that multiplication of block diagonal matrices can be transformed into a batch of small matrix multiplications.As we use column-first format to store A, the batch of A i is stored in constant stride.Let p indicate the location of the first element of A 0 , then the location of the first element of A i is p + i × N 1 N 3 .We utilize the gemmStridedBatchd() routine in the NVIDIA cuBLAS Library to compute a batch of matrix multiplications simultaneously to achieve better performance.Figure 3 shows how this process performs on GPU and Table 1 shows the parameters setting of this routine.

Eliminating Explicit Transpose Operations
After each least squares method, the transpose of the target matrix is obtained.The transpose operation of the matrix needs to be performed (Lines 6 and 12 in Algorithm 2).However, the transpose operation takes much computing time and resource.As the operation after transpose of the matrix is the multiplication of diagonal matrices, we eliminate the explicit matrix transpose by enabling the transpose option in the gemmStridedBatched() routine.In this way, the gemmStridedBatched() routine will perform matrix transpose implicitly and efficiently before the matrix multiplications.

Least Squares Minimization
As shown in Algorithm 2, least squares minimization (LSM) is the major step of the tensor sensing approach, which is the most time-consuming part of the entire approach.There are many approaches to perform least squares minimization.QR factorization is one of the most efficient approaches, which is well supported by CUDA.

Optimizations of the GPU Tensor Sensing
During the computation flow of the GPU tensor sensing, frequent data transfer between the CPU and GPU will significantly degrade system performance.Therefore, we design a data reusing strategy to reduce data transfer overhead and resource consumption.
In the entire tensor sensing flow, we invoke data transfer only twice at the beginning and at the end.At the beginning, the input data are transferred from the CPU to the GPU.At the end, the final result matrix is sent back to the CPU.We optimize the computations in the tensor sensing flow such that they all perform in-place calculations.For instance, in the QR decomposition to solve the least squares problem, the input vector y is overrode by the result vectors (vec(U) and vec(V)).Therefore, we need to reassign the vector y at the beginning of each least squares minimization iteration.However, it is an expensive operation to load the original data of vector y from the CPU memory every time.Instead, we pre-allocate a memory space named dyL on the GPU that stores the original data of the vector y.Every time we need to reassign the dy, we use the GPU device-to-device transfer routine cudaMemcpyDeviceToDevice() to copy the data from dyL to dy.Since the device-to-device bandwidth of 380 GB/s is much higher than the bandwidth between the CPU and GPU of 12 GB/s, this strategy significantly reduces data transfer overhead.Algorithm 3 and Figure 4 describe the computation flow and data organization of the tensor sensing on the GPU.

Experiment Methodology
In this section, we describe the experimental methodology including the hardware and software platform, testing data, testing process, and the comparison metrics.

Hardware and Software Platform
We use an NVIDIA Tesla V100 GPU to evaluate the performance of GPU tensor sensing.The GPU incorporates 5120 CUDA cores @ 1.53 GHz and 32 GB DDR5 memory.It is installed on a server with 128 GB DDR memory and an Intel i7-7820x CPU with 8 cores @ 3.60 GHz supporting 16 hardware threads with hyper-threading technology.The server is running Ubuntu 18.04 LTS with Kernel Version 4.15.0.The CPU and GPU tensor sensing is running with MATLAB R2017b and CUDA 10.0, respectively.

Testing Data
In the experiment, we used both IKEA model and synthetic data in the evaluation.For IKEA model data, we used the IKEA 3D data [1] that generated a ground truth tensors of size 60 × 60 × 15 with rank 6.Each 3D model was placed in the middle of the "tensor" and occupied a part of the space.In this task, we mainly focused on the location and outline information, while the texture and color information were ignored.The synthetic data were generated according to the compressed sensing model [1].The synthetic data did not correspond to a specific physical model.We used them to demonstrate the performance of our approach for different data sizes.

Testing Process
The synthetic and real data were processed by the CPU tensor sensing implementation in MATLAB and GPU tensor sensing implementation in CUDA, respectively.We evaluated and compared two versions of GPU tensor sensing: the unoptimized one and the optimized one.The unoptimized GPU tensor sensing adopted none of the optimization techniques in Section 4.2.We repeated each experiment five times and report the average results.

Comparison Metrics
We adopted two metrics for comparison: running time and relative error rate.

•
Running time: Varying the tensor size and fixing other parameters, we measured the execution time of the CPU tensor sensing, unoptimized GPU tensor sensing, and optimized GPU tensor sensing.Finally, we calculated speedups as the running time of the CPU tensor sensing divided by the running time of GPU tensor sensing.
Error rate: We adopted the metric relative square error, defined as RSE = X − X F / X F .

Running Time of Synthetic Data
Figure 5a shows that the running time of the CPU tensor sensing and two GPU tensor sensing implementations (unoptimized and optimized ones) for X of size n × n × 6 of rank one, where n varies from 40-120 at a step of 20.The sampling rate was set to 20%, and both CPU and GPU tensor sensing performed five iterations for completion.The detailed time value is listed in Table 2.While M = N 1 × N 2 × N 3 × sampling rate and A ∈ R M×N 1 N 2 N 3 , the scale of the main operation matrix A increased at a rate of four times as the increase rate of n.
We can see that the running time of the CPU tensor sensing was polynomial with the increase of n, while the running time of the GPU tensor sensing was linearly growing.The unoptimized and optimized GPU tensor sensing achieved an average of 4.24× and 44.79× and up to 9.77× and 84.70× speedups, respectively.This illustrates the effectiveness of the optimization methods proposed in Section 4.2.When the tensor size was small, the data transfer occupied a major portion of the entire execution time of the unoptimized GPU tensor sensing.As a result, its performance was even lower than the CPU tensor sensing.This experiment evaluated the error rate of the CPU and GPU tensor sensing under different iterations for X of size 60 × 60 × 15 with rank six.The sampling rate was set to 50%. Figure 6b shows the recovery results with 10 iterations.The running time of the CPU tensor sensing at five iterations was 14.91 s on average, while the running time of the GPU tensor sensing was 0.97 s on average; thus, the speedup is 15.37×.As shown in Figure 7, under increased iterations from 1-30, the RSEs dropped significantly.More importantly, the CPU tensor sensing and GPU implementation achieved almost the same RSEs at all iterations (the two curves in Figure 7 overlap with each other), which means that they achieved similar error rate performance in tensor sensing.

Conclusions
In this work, we present an open-source library named cuTensorSensing for efficient RF tomographic imaging on GPUs.The experiment evaluations show that the proposed GPU tensor sensing works effectively and accurately.Compared with the counterpart on CPU, the GPU tensor sensing achieved a similar error rate, but much faster speed.For synthetic data, the GPU tensor sensing achieved an average of 44.79× and up to 84.70× speedups versus the CPU tensor sensing for bigger tensors.For IKEA Model 3D objects' data of a smaller tensor size, the GPU tensor sensing achieved a 15.37× speedup over the CPU tensor sensing.The cuTensorSensing library is useful for efficient RF tomographic imaging.

Figure 1 .
Figure 1.An illustration of the RF tomographic imaging network.

2 :
use U to form a block diagonal matrix U b 3: W ← A * U b 4: vec(V) ← perform least square minimization on W and y 5:

12 :U
← transposeU t 13: end for 14: return X = UV First, U of the previous iteration (in the first iteration, U is initialized randomly) is used to form a block diagonal matrix U b (Algorithm 2, Line 2):

Figure 2 .
Figure 2. Vectorizing a matrix A into vec(A) in memory.

Figure 3 .
Figure 3. Multiplication of block diagonal matrices on GPU.

Table 1 .
The parameters of the gemmStridedBatchd() routine in the cuBLAS library.Parameters Meaning Value transA operation op(A) that is non-or transpose non-transpose transU operation op(U) that is non-or transpose transpose A pointer to the A matrix corresponding to the first instance of the batch d A U pointer to the U matrix d y W pointer to the W matrix d W strideA the address offset between A i and A i+1 M × N 1 N 3 strideU the address offset between U i and U i+1 0 strideW the address offset between W i and W i+1 M × N 3 batchNum number of gemm to perform in the batch N 2 4.1.

Figure 4 .
Figure 4. Memory organization in the calculation process.

Figure 5 .
Figure 5. (a) The running time of the CPU tensor sensing and GPU tensor sensing and (b) the speedups of the unoptimized and optimized GPU tensor sensing.

Figure 6 .
Figure 6.(a) The 3D visualization of the IKEA models and (b) the corresponding recovery results.

Figure 7 .
Figure 7. RSE of the CPU tensor sensing and GPU tensor sensing.
6.2.Error Rate and Running Time of IKEA Model Data