A Hybrid GPU and CPU Parallel Computing Method to Accelerate Millimeter-Wave Imaging

Ding, Li; Dong, Zhaomiao; He, Huagang; Zheng, Qibin

doi:10.3390/electronics12040840

Open AccessCommunication

A Hybrid GPU and CPU Parallel Computing Method to Accelerate Millimeter-Wave Imaging

by

Li Ding

^1,2,†

,

Zhaomiao Dong

^1,†,

Huagang He

¹ and

Qibin Zheng

^1,2,*

¹

Shanghai Key Lab of Modern Optical System, University of Shanghai for Science and Technology, No. 516 JunGong Road, Shanghai 200093, China

²

School of Health Science and Engineering, University of Shanghai for Science and Technology, No. 516 JunGong Road, Shanghai 200093, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2023, 12(4), 840; https://doi.org/10.3390/electronics12040840

Submission received: 11 October 2022 / Revised: 28 January 2023 / Accepted: 1 February 2023 / Published: 7 February 2023

(This article belongs to the Special Issue High-Performance Computing and Its Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The range migration algorithm (RMA) based on Fourier transformation is widely applied in millimeter-wave (MMW) close-range imaging because of its few operations and small approximation. However, its interpolation stage is not effective due to the involved intensive logic controls, which limits the speed performance in a graphics processing unit (GPU) platform. Therefore, in this paper, we present an acceleration optimization method based on the hybrid GPU and central processing unit (CPU) parallel computation for implementing the RMA. The proposed method exploits the strong logic-control capability of the CPU to assist the GPU in processing the logic controls of the interpolation stage. The common positions of wavenumber-domain components to be interpolated are calculated by the CPU and stored in the constant memory for broadcast at any time. This avoids the repetitive computation consumed in a GPU-only scheme. Then the GPU is responsible for the remaining matrix-related steps and outputs the needed wavenumber-domain values. The imaging experiments verify the acceleration efficiency of the proposed method and demonstrate that the speedup ratio of our proposed method is more than 15 times of that by the CPU-only method, and more than 2 times of that by the GPU-only method.

Keywords:

millimeter-wave imaging; RMA; close-range imaging; GPU; CPU

1. Introduction

Recently, synthetic aperture radar (SAR) technology has been widely applied in millimeter-wave (MMW) imaging applications [1,2,3]. One of the most popular imaging algorithms at present is the range migration algorithm (RMA), which is well suited to MMW close-range imaging because of its few operations and small approximation [4,5,6].

The practical application of radar close-range imaging is a task suffering from a heavy computational burden. Therefore, almost all imaging systems spend great efforts studying the real-time performance of the algorithm on the hardware with parallel processing capability. Since the beginning of this century, the graphics processing unit (GPU) due to its powerful parallel processing capability has received increasing attention. The introduction of the compute unified device architecture (CUDA) programming model by NVIDIA, makes the GPU available to do parallel computing with the general purpose [7].

Plenty of studies have explored parallel computing strategies on CPUs and GPUs to address imaging problems. Yin Q. [8] proposed a GPU-based framework of the parallel inversion method for polarimetric SAR imagery. This optimization method utilizes the parallel computing advantage of the GPU to process the imagery with a large amount of computation, making the computational efficiency of the algorithm be improved by about 100 times. Cui Z. [9] proposed a constant false alarm rate with convolution and pooling (CP-CFAR) method to improve the detection efficiency via GPU parallel acceleration in the airborne SAR images and the operation speed can reach more than 18 times. Liu G. [10] proposed a parallel simulation system for the muti-input muti-output (MIMO) radar based on the GPU architecture and its simulation achieves a speedup of 130 times compared with the serially sequential CPU method. Gou L. [11] proposed to accelerate the video SAR imaging using a GPU on the CUDA platform, which effectively solves the real-time problem. In general, the previous work shows that GPUs do significantly improve the computational efficiency of SAR imaging.

The primary stages of the RMA for SAR three-dimensional (3D) imaging are fast Fourier transformation (FFT) and interpolation. Because that CPUs and GPUs have open sources to quickly implement FFT, the computational complexity of the interpolation becomes the key to limiting the implementation speed. There are a variety of interpolation methods to convert the non-uniform wavenumber domain to a uniform one, such as linear interpolation, Newton interpolation, cubic spline interpolation, etc. Compared to linear interpolation, the cubic spline interpolation has less approximation and better continuity [12]. Although the Newton interpolation method can ensure the accuracy and the overall continuity of the interpolation function, its interpolation curve is not stable enough at the edges and not smooth enough out of the interpolation nodes [13]. Cubic spline interpolation is a segmental interpolation method, which can effectively avoid Runge’s phenomenon [14], and thus can maintain both the accuracy of its interpolation points and the smoothness of its interpolation curve. Therefore, the cubic spline interpolation algorithm is used in this paper.

In implementing cubic spline interpolation in the GPU platform, a fixed number of scattering points are required to estimate the expected value at the desired position, which results in lots of repetitive calculations over the whole scattering-point data set. These heavy tasks not only take up more video memories, but also increase the access time of the CUDA core. Therefore, compared with the straightforward migration of the RMA in a GPU-only platform, this paper proposes a hybrid CPU and GPU platform to accelerate cubic spline interpolation in the RMA for MMW imaging using parallel computing. By decomposing and analyzing the cubic spline interpolation in detail, the steps of the interpolation are separated better match the hardware according to the calculation characteristics of each step. For those steps which involve logical judgments but only require simple computation, the CPU is adopted. Facing the steps that require large-scale matrix operations, the GPU is adopted as the host processor, while the CPU takes the auxiliary role to deal with data transfer and calculate variables related to the original positions of the wave-number domain. In such a way, the proposed method reduces both the response time and waiting for the time of the GPU to perform the interpolation, thereby improving the speed of the RMA implementation. The experiments demonstrate that the proposed approach has high timeliness. It can obtain a speedup ratio at least 2 times faster than the traditional GPU-only acceleration method and at least 15 times faster than the CPU-only method.

2. Acceleration Method

2.1. Three-Dimensional (3D) Range Migration Algorithm

The diagram of the considered MMW close-range imaging system is shown in Figure 1. The motion trajectory of the transceiver is linear. Let

(x^{'}, y^{'}, R_{0})

be a spatial sampling position of the transceiver, where

x^{'} \in [- L_{x} / 2, L_{x} / 2]

and

y^{'} \in [- L_{y} / 2, L_{y} / 2]

,

L_{x}

denotes the aperture length in the azimuth dimension (i.e., X dimension),

L_{y}

denotes the aperture length in the height dimension (i.e., Y dimension) and

R_{0}

indicates the distance between the observation plane and the target origin.

Assuming that the transceiver emits stepped-frequency (SF) signals, at the p-th frequency, the response measured at the transceiver is

s (x^{'}, y^{'}, k_{p}) = \underset{x, y, z}{\int \int \int} σ (x, y, z) e^{- j 2 k_{p} R} d x d y d z,

(1)

where

σ (x, y, z)

denotes the reflectivity function of the scatterer at the position

(x, y, z)

,

k_{p} = 2 π f_{p} / c

denotes the wave-number, c denotes the speed of light,

f_{p}

denotes the p-th operating frequency,

f_{p} = f_{0} + (p - 1) Δ f

,

f_{0}

and

Δ f

denote the starting frequency and the frequency step, respectively,

p = 1, 2, \dots, P

and P is the number of the transmitted stepped frequency, R denotes the distance between the target and the transceiver, i.e.,

R = \sqrt{{(x - x^{'})}^{2} + {(y - y^{'})}^{2} + {(z - R_{0})}^{2}}

. The RMA imaging algorithm is shown in Algorithm 1 [15,16,17].

Although

k_{x}

,

k_{y}

and

k_{p}

are uniform, the wave-number component in the Z dimension, i.e.,

k_{z} = \sqrt{4 k_{p}^{2} - k_{x}^{2} - k_{y}^{2}}

, is non-uniform. The wave-number domain for a certain height is shown in Figure 2. Therefore, the 3rd stage in Algorithm 1 is necessary to achieve the conversion from a non-uniformity of

k_{z}

to a uniform one. As Algorithm 1 shows, the echo data

s (x^{'}, y^{'}, k_{p})

is transformed into the image data

\hat{σ} (x, y, z)

through a series of stages, i.e., two-dimensional (2D) Fourier transforms (FT), phase compensation, interpolation, and 3D inverse FT. Highly complete program libraries such as faster Fourier transform in the west (FFTW) and CUDA fast Fourier transform (CUFFT) can efficiently perform FFT operations on the CPU and GPU, respectively. Although CUDA provides a ready-made library of FFT to call, a zero-frequency component transfer operation is required before and after the FT in the program. This operation can be considered a 3D data replication. The time overhead of the data replication operation in the memory is greater than that in the video memory. Moreover, the FFTW library has a very short running time on the CPU. In this paper, the FFTW library is used to achieve efficient operations of the involved Fourier transform. Therefore, the acceleration of interpolation is crucial for imaging.

Algorithm 1 3D RMA imaging

Input:
- Echoes collected over the frequency band and the whole spatial observation plane, $s (x^{'}, y^{'}, k_{p})$ , $p = 1, 2, \dots, P$ , $x^{'} \in [- L_{x} / 2, L_{x} / 2]$ , $y^{'} \in [- L_{y} / 2, L_{y} / 2]$ ;
Output:
1:
Take the spatially 2D FT of $s (x^{'}, y^{'}, k_{p})$ along $x^{'}$ and $y^{'}$ to form the angular spectrum at each frequency,

$\begin{matrix} S (k_{x}, k_{y}, k_{p}) = \underset{x^{'}, y^{'}}{\int \int} s (x^{'}, y^{'}, k_{p}) e^{- j (k_{x} x^{'} + k_{y} y^{'})} d x^{'} d y^{'} \end{matrix}$

(2)

where $k_{x}$ and $k_{y}$ denote the wave-number component corresponding to the X-dimension and Y-dimension, respectively.
2:
Apply the phase compensation to the angular spectrums based on the method of stationary phase,

$S (k_{x}, k_{y}, k_{z}) = S (k_{x}, k_{y}, k_{p}) e^{- j k_{z} R_{0}}$

(3)

where $k_{z}$ denotes the wave-number component corresponding to the Z-dimension, and, and $k_{z} = \sqrt{4 k_{p}^{2} - k_{x}^{2} - k_{y}^{2}}$ .
3:
Turn the non-uniform $S (k_{x}, k_{y}, k_{z})$ to the uniform $\dot{S} (k_{x}, k_{y}, {\dot{k}}_{z})$ by cubic spline interpolation, where $\dot{S} (k_{x}, k_{y}, {\dot{k}}_{z})$ denotes the spectrum value at the position $(k_{x}, k_{y}, {\dot{k}}_{z})$ and ${\dot{k}}_{z}$ denotes the desired uniform sampling position of the wave-number component in the Z-dimension.
4:
Take the 3D inverse FT of $\dot{S} (k_{x}, k_{y}, {\dot{k}}_{z})$ to achieve the imaging,

$\hat{σ} (x, y, z) = \frac{1}{{(2 π)}^{3}} \underset{k_{x}, k_{y}, k_{z}}{\int \int \int} \dot{S} (k_{x}, k_{y}, {\dot{k}}_{z}) e^{j (k_{x} x + k_{y} y + k_{z} z)} d k_{x} d k_{y} d k_{z}$

(4)

5:
return: $\hat{σ} (x, y, z)$

2.2. Cubic Spline Interpolation

To discuss the interpolation, the Algorithm 1 would be discretized. The transceiver is sampled uniformly in the azimuth and height dimensions, and the

(m, n)

-th sampling position is denoted as

(x_{m}, y_{n})

, where

x_{m} = - L_{x} / 2 + (m - 1) d_{x}

,

m = 1, 2, \dots, M

,

x_{m} \in [- L_{x} / 2, L_{x} / 2]

;

y_{n} = - L_{y} / 2 + (n - 1) d_{y}

,

n = 1, 2, \dots, N

,

y_{n} \in [- L_{y} / 2, L_{y} / 2]

;

d_{x}

and

d_{y}

denote the sampling intervals in the azimuth and height dimensions, respectively. Then, stacking all the samples of

S (k_{x}, k_{y}, k_{z})

in Equation (3) gives,

\begin{matrix} S = {[\begin{matrix} S (k_{x, 1}, k_{y, 1}, k_{z, (1, 1, 1)}) & \dots & S (k_{x, 1}, k_{y, 1}, k_{z, (1, 1, P)}) \\ ⋮ & ⋱ & ⋮ \\ S (k_{x, M}, k_{y, 1}, k_{z, (M, 1, 1)}) & \dots & S (k_{x, M}, k_{y, 1}, k_{z, (M, 1, P)}) \\ ⋮ & ⋱ & ⋮ \\ S (k_{x, M}, k_{y, N}, k_{z, (M, N, 1)}) & \dots & S (k_{x, M}, k_{y, N}, k_{z, (M, N, P)}) \end{matrix}]}_{M N \times P} \end{matrix}

(5)

where

k_{z, (m, n, p)} = \sqrt{4 k_{p}^{2} - k_{x, m}^{2} - k_{y, n}^{2}}

. The 1D interpolation of

S

is done along the

k_{z}

-dimension. Let

s_{m, n}

be the column vector of the

(m + (n - 1) M)

-th row, i.e.,

s_{m, n} = {[S (k_{x, m}, k_{y, n}, k_{z, (m, n, 1)}), \dots, S (k_{x, m}, k_{y, n}, k_{z, (m, n, P)})]}^{T}

. Since

k_{x}

,

k_{y}

have been uniformly sampled and they are invariant in each row, the variables

k_{x}

and

k_{y}

can be omitted in

s_{m, n}

for simplification. Therefore, for each row, let

s \in C^{P \times 1}

be the column vector for generalization, and

s = {[s_{1}, s_{2}, \dots, s_{P}]}^{T}

. The p-th element of

s

corresponds to the wave-number domain component at

k_{z, p}

, and

k_{z, p}

is non-uniform. Let

\dot{s} \in C^{Q \times 1}

denote the vector of points after interpolation, i.e.,

\dot{s} = {[{\dot{s}}_{1}, {\dot{s}}_{2}, \dots, {\dot{s}}_{Q}]}^{T}

, and the q-th element of

\dot{s}

corresponds to the wave-number domain component at the desired

{\dot{k}}_{z, q}

, where

{\dot{k}}_{z, q}

is uniform,

q = 1, 2, \dots, Q

. It is worth noting that the original set

{k_{z, p}, p = 1, 2, \dots, P}

is different in different rows. Even though

k_{z, p}

depends on the determined

k_{x, m}

and

k_{y, n}

in the

(m + (n - 1) M)

-th row, the positions to be interpolated is fixed denoted by the common set as

{{\dot{k}}_{z, q}, q = 1, 2, \dots, Q}

. The cubic-spline equation at the q-th point to be interpolated can be constructed in the following form Refs. [18,19],

{\dot{s}}_{q} = a_{p} + b_{p} ({\dot{k}}_{z, q} - k_{z, p}) + c_{p} {({\dot{k}}_{z, q} - k_{z, p})}^{2} + d_{p} {({\dot{k}}_{z, q} - k_{z, p})}^{3}

(6)

where

k_{z, p} < {\dot{k}}_{z, q} < k_{z, p + 1}

,

a_{p}

,

b_{p}

,

c_{p}

,

d_{p}

are the zero-order term coefficient, the primary term coefficient, the secondary term coefficient, and the tertiary term coefficient of

({\dot{k}}_{z, q} - k_{z, p})

, respectively. The cubic spline function is shown in Algorithm 2 [18,19,20].

2.3. GPU-Only Method

The scheme for computing the cubic spline interpolation in a GPU-only platform is shown in Figure 3.

In Algorithm 2, although the computations of all the

M \times N

vectors of

s

satisfy the program parallelization in the GPU platform, the straightforward migration of the cubic spline interpolation in the GPU would cost a lot of time. This is because that kernel functions in the GPU need to find the adjacent points

k_{z, i}

of the interpolation point

{\dot{k}}_{z, q}

and the wave value

s_{i}

of

k_{z, i}

. Especially in step 7, kernel functions need to find the address of each

{\dot{s}}_{q}

in advance. This traditionally parallel design requires kernel functions of the GPU to run through the entire interpolation process and increases the workload of the video random-access memory (RAM). This method also causes the CPU to stand by for a long time. This is not in line with the solution of the efficient use of the hardware. Although GPUs have a large number of cores, their cores’ structure is too simple to be as fast as CPUs for single instruction single data (SISD) processing.

2.4. The Hybrid GPU and CPU Acceleration Method

Since the structural characteristics of the CPU make it better for SISD processing, we use the CPU to perform step 1, step 2, and step 3 of the Algorithm 2 which are suited for the SISD computation model, and put the other steps into the GPU for processing. In step 7, when kernel functions interpolate each row of the echo data

S

through a parallel processing scheme, the GPU-only method causes additional waiting time for kernel functions to find the

{\dot{s}}_{q}

. The proposed method utilizes the sequential addressing

{\dot{s}}_{q}

of

{\dot{k}}_{z, q}

. This way avoids extra waiting time for different kernel functions to find

{\dot{s}}_{q}

, thus speeding up the whole program. The optimized schematic block diagram is shown in Figure 4.

Algorithm 2 Cubic spline function

Input:
- The originally non-uniform sampling positions in $k_{z}$ -dimension, i.e., $k_{z, p}, p = 1, 2, \dots, P$ ; 1D column vector to be interpolated, $s \in C^{P \times 1}$ ; The desired uniform positions ${\dot{k}}_{z, q}, q = 1, 2, \dots, Q$ ;
Output:
1:
Calculate the z-dimension difference, denotes $h_{i}, i = 1, 2, \dots, P - 1$ ;

$h_{i} = k_{z, i + 1} - k_{z, i}$

(7)

2:
Construct the tridiagonal matrix $H$ from the obtained $h_{i}$

$H = [\begin{matrix} 1 & 0 & 0 & \dots & \dots & 0 \\ h_{1} & 2 (h_{1} + h_{2}) & h_{2} & \dots & \dots & 0 \\ \dots & \dots & \dots & \dots & \dots & \dots \\ 0 & \dots & \dots & h_{P - 2} & 2 (h_{P - 2} + h_{P - 1}) & h_{P - 1} \\ 0 & \dots & \dots & 0 & 0 & 1 \end{matrix}]$

(8)

3:
Decompose $H$ using Gaussian elimination to obtain the upper triangular matrix $L$ and the lower triangular matrix $U$ .
4:
Calculate the matrix $g$ based on the following formula,

$g = 6 {[\begin{matrix} 0 & \frac{s_{3} - s_{2}}{h_{2}} - \frac{s_{2} - s_{1}}{h_{1}} & \frac{s_{4} - s_{3}}{h_{3}} - \frac{s_{3} - s_{2}}{h_{2}} & \dots & \frac{s_{P} - s_{P - 1}}{h_{P - 1}} - \frac{s_{P - 1} - s_{P - 2}}{h_{P - 2}} & 0 \end{matrix}]}^{T}$

(9)

5:
Construct the vector $w, w = {[w_{1}, w_{2}, \dots, w_{P}]}^{T}$ based on the following formula, where $w_{p}$ denotes the quadratic differential value of $s_{p}$ , i.e., $w_{p} = s_{p}^{''}$ .

$w = U^{- 1} L^{- 1} g$

(10)

6:
Determine the coefficients in Equation (6) from the obtained $h_{i}$ and $w_{i}$ based on the following equation:

$a_{i} = s_{i}$

(11)

$b_{i} = \frac{s_{i + 1} - s_{i}}{h_{i}} - \frac{h_{i} w_{i}}{2} - \frac{h_{i} (w_{i + 1} - w_{i})}{6}$

(12)

$c_{i} = \frac{w_{i}}{2}$

(13)

$d_{i} = \frac{w_{i + 1} - w_{i}}{6 h_{i}}$

(14)

7:
Estimate the spectrum of ${\dot{s}}_{q}$ correspond to ${\dot{k}}_{z, q}$ , baesd on Equation (6),

${\dot{s}}_{q} = a_{i} + b_{i} ({\dot{k}}_{z, q} - k_{z, i}) + c_{i} {({\dot{k}}_{z, q} - k_{z, i})}^{2} + d_{i} {({\dot{k}}_{z, q} - k_{z, i})}^{3}$

(15)

where $k_{z, i} < {\dot{k}}_{z, q} < k_{z, i + 1}$ .
8:
return: $\dot{s} = {[{\dot{s}}_{1}, {\dot{s}}_{2}, \dots, {\dot{s}}_{Q}]}^{T}$

The pseudo code is shown in the Algorithm 3:

Algorithm 3 Hybrid CPU-GPU method pseudo code of cubic spline interpolation

Input:
- The originally non-uniform sampling positions in $k_{z}$ -dimension, i.e., $k_{z, p}, p = 0, 2, \dots, P - 1$ ; 1D column vector to be interpolated, $s \in C^{P \times 1}$ ; The desired uniform positions ${\dot{k}}_{z, q}, q = 0, 2, \dots, Q - 1$ ;
Output:
1:
for $p [0 : P - 2]$ do
2:
     $h [p] = k_{z, p + 1} - k_{z, p}$ ⟵ Calculation of wave number domain steps, i.e., Equation (7)
3:
end for
4:
for $p [1 : P - 3]$ do
5:
     $H 1 [p] = h [p - 1]$
6:
     $H 2 [p] = (h [p - 1] + h [p]) * 2$
7:
     $H 3 [p] = h [p]$ ⟵ Calculation of the three diagonals of the tridiagonal matrix $H$ , i.e., Equation (8)
8:
end for
9:
$H 1 [P - 3] = 0.0, H 2 [0] = H 2 [P - 1] = 1.0, H 3 [0] = 0.0$
10:
$U [0] = H 2 [0]$
11:
for $p [1 : P - 4]$ do
12:
     $L [p] = H 1 [p] / U [p - 1]$
13:
     $U [p] = H 2 [p] - H 3 [p - 1] * L [p]$
14:
end for
15:
$L [0] = H 1 [P - 4] / U [P - 5]$
16:
Use the function cudaMemcpyToSymbol to send $L, U, h$ to constant memory.
17:
$i d ⟵ b l o c k D i m . x * b l o c k I d x . x + t h r e a d I d x . x$     (GPU running part)
18:
$g [i d] = 6 * (\frac{s [i d + 1] - s [i d]]}{h [i d]} - \frac{s [i d] - s [i d - 1]}{h [i d - 1]})$ ⟵ Computing the array g, i.e., Equation (9)
19:
$U w [i d] = g [i d] - L [i d] * U w [i d - 1]$ ⟵ Computing the array $Uw$ . The array $Uw$ is an intermediate step in the computation of the array $w$ .
20:
$w [P - 4] = U w [P - 4] / U [P - 4]$
21:
$w [i d] = (U w [i d] - H 3 [i d] * w [i d + 1]) / U [i d]$ ⟵ Computing the array $w$ , i.e., Equation (10).
22:
$w [i d] = w [i d - 1], w [0] = 0.0, w [P - 2] = 0.0$ ⟵ Add 0 to both ends of array $w$ .
23:
$a [i d] = s [i d]$
24:
$b [i d] = \frac{s [i d + 1] - s [i d]}{h [i d]} - \frac{h [i d] * w [i d]}{2} - \frac{h [i d] * (w [i d + 1] - w [i d])}{6}$
25:
$c [i d] = w [i d] / 2$
26:
$d [i d] = \frac{w [i d + 1] - w [i d]}{6 * h [i d]}$ ⟵ Calculation of spline curve coefficients, i.e., Equation (11) to Equation (14)
27:
if $k_{z, p}$ < ${\dot{k}}_{z}, q$ < $k_{z, p + 1}$
28:
     ${\dot{s}}_{q} = a [i d] + b [i d] * ({\dot{k}}_{z, q} - k_{z, i}) + c [i d] * {({\dot{k}}_{z, q} - k_{z, i})}^{2} + d [i d] * {({\dot{k}}_{z, q} - k_{z, i})}^{3}$ ⟵ Calculation of the spatial wave number at the interpolation point, i.e., Equation (15)
29:
end if
30:
return: $\dot{s} = {[{\dot{s}}_{0}, {\dot{s}}_{1}, \dots, {\dot{s}}_{Q - 1}]}^{T}$

The pseudo-code corresponds to Algorithm 2. The step 1 to step 3 of the Algorithm 2 are calculated in the CPU. The CPU transfers the matrices

L

and

U

to the video memory after completing step 3. Generally, this task can be approached with different mechanisms: global memory, shared memory, texture memory, and constant memory. To avoid the time-consuming impact of access conflicts on the overall cubic spline interpolation, the matrices

L

and

U

are stored in constant memory.

Generally speaking, the memory is much larger than the video memory. In the implementation of CUDA kernel functions, the data is transferred from the memory to the video memory, which requires data transfer time. The data are then read from the video memory and processed using multiple threads, which requires computation time. Therefore, the time consumed to execute the algorithm is the sum of the data transfer time and the computation time. Before running the signal processing, it is necessary to reduce the data transfer time. For large amounts of data, it is not possible to transfer the data in the CPU buffer to the GPU at one time. Therefore, it needs to be transferred in chunks. If it is desired to perform kernel function operations on the GPU at the same time as data transfer, streams can be introduced for asynchronous parallel processing of data to improve computational performance.

Therefore, in our close-range 3D imaging, the proposed method in this paper utilizes asynchronous parallelism for data processing, which is different from the common serial execution method. Actually, the implementation of the asynchronous parallel scheme is shown in Figure 5, which compares the running time of both the serial execution and the 4-stream asynchronous parallel execution. As Figure 5 shows, 4 streams are created in the GPU, and the data transferred from the CPU is equally distributed to each stream. In such a way, the data transfer and processing among different streams would not interfere with each other. This allows the data interaction between memory and video memory to be executed in parallel with the computation of kernel functions. This approach ensures that the GPU core is busy most of the time while effectively alleviating the drawback of a long time for data transfer between the memory and video memory.

Because of the semi-threaded bundle broadcast feature of the constant memory, the GPU reads constant memory much faster than global memory. Therefore, the choice of constant memory saves a theoretical

93.75 %

of read time. The left steps 4–7 would be achieved in the GPU. These 1D data in each row meet the parallelism requirement and can be processed in parallel by the GPU’s single instruction multiple data (SIMD) processing capability.

3. Experimental Results and Analysis

Simulation experimental configuration: Inter(R) Core(TM) i5-7400 CPU @ 3.00GHz processor; 64-bit operating system; 16G memory (RAM); 1 NVIDIA GTX 1050 4G GDDR5 discrete graphics card; 1 PNA network analyzer. The measurement parameters used in the experiment are listed in Table 1. The single point simulation experimental scenarios and results are shown in Figure 6.

To verify the correctness of the results, the 3D RMA close-range imaging algorithm is implemented by CPU-only, GPU-only and our proposed method, respectively. In this paper, MATLAB, a commercial mathematical software from MathWorks, is used as the imaging display platform. According to Figure 6, results obtained from the simulation, the difference among these 3 methods is quite small. Next, the actual imaging experiments are performed, which the same parameters as the simulation model parameters. The experimental scenarios and results are shown in Figure 7 and Figure 8. It is clearly observed from the comparison in Figure 8 that the imaging algorithm can be implemented properly on different platforms. Thus, it confirms the feasibility of our proposed approach.

The amplitude-level comparison between the two methods is shown in Figure 9. In Figure 9, the 3D surface is created using MATLAB’s built-in mesh function so that the energy distribution of the main view in Figure 8 can be visualized. From this, the imaging results calculated by these two methods can be found without significant differences. The absolute error is only about

1.11628 \times 10^{- 6}

, which confirms that the proposed method can meet the functional requirements.

Figure 10 depicts the time consumption by different platforms with various data sizes. Table 2 summarizes the speedup ratios of the conventional GPU-only method and the hybrid CPU+GPU method compared to the CPU-only method.

Since the GPU programs have a start-up overhead, the benefits of the GPU-based parallel acceleration can only be realized when the amount of data is large enough. From Table 2, we can have that the program with the traditional GPU parallel acceleration method gets more than 8 times the speedup ratio compared to the CPU-only method. The hybrid CPU + GPU method makes the speedup ratio reach more than 15 times, and the results demonstrate that the larger the data size is, the better acceleration the method can reach.

4. Conclusions

In this paper, we conduct an in-depth study of hardware optimization for the 3D RAM implementation on the CUDA platform with the programmable GPU and elaborate a hybrid GPU and CPU strategy to achieve parallel computing. Especially for the interpolation stage which has the greatest influence on the imaging time efficiency, this paper uses four streams to optimize the data transfer and selects different video memory for storage according to the data characteristics, optimizes the matrix storage method, and accomplishes the effective execution of RMA. The calculation results using the NVIDIA GTX 1050 graphics card demonstrate that the calculation speed of the MMW 3D close-range imaging based on our acceleration optimization is greatly improved compared to the CPU-only platform and the traditional GPU-only method. Lastly, we accompany our contribution with the full source code of a working prototype (the code and explanatory notes can be accessed via the following URL: https://github.com/miao3rd/miao_c.git accessed on 24 January 2023).

Author Contributions

Conceptualization, Z.D. and L.D.; methodology, Z.D. and L.D.; software, Z.D.; validation, Z.D. and H.H.; formal analysis, Z.D. and L.D.; investigation, Z.D.; resources, L.D. and Q.Z.; data curation, Z.D.; writing—original draft preparation, Z.D. and L.D.; writing—review and editing, H.H. and Q.Z.; visualization, Z.D.; supervision, Q.Z.; project administration, L.D. and Q.Z.; funding acquisition, L.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Shanghai Science and Technology Development Foundation (21ZR443600), the National Natural Science Foundation of China (12105177) and Shanghai Municipal Science and Technology Major Project (No. 2021SHZDZX), PR China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lorente, D.; Limbach, M.; Gabler, B.; Esteban, H.; Boria, V.E. Sequential 90° rotation of dual-polarized antenna elements in linear phased arrays with improved cross-polarization level for airborne synthetic aperture radar applications. Remote Sens. 2021, 13, 1430. [Google Scholar] [CrossRef]
Liu, C.; Chen, Z.; Yun, S.; Chen, J.; Hasi, T.; Pan, H. Research advances of SAR remote sensing for agriculture applications: A review. J. Integr. Agric. 2019, 18, 506–525. [Google Scholar] [CrossRef]
Alibakhshikenari, M.; Virdee, B.S.; Limiti, E. Wideband planar array antenna based on SCRLH-TL for airborne synthetic aperture radar application. J. Electromagn. Waves Appl. 2018, 32, 1586–1599. [Google Scholar] [CrossRef]
Li, J.; Song, L.; Liu, C. The cubic trigonometric automatic interpolation spline. IEEE/CAA J. Autom. Sin. 2017, 5, 1136–1141. [Google Scholar] [CrossRef]
Liu, J.; Qiu, X.; Huang, L.; Ding, C. Curved-path SAR geolocation error analysis based on BP algorithm. IEEE Access 2019, 7, 20337–20345. [Google Scholar] [CrossRef]
Miao, X.; Shan, Y. SAR target recognition via sparse representation of multi-view SAR images with correlation analysis. J. Electromagn. Waves Appl. 2019, 33, 897–910. [Google Scholar] [CrossRef]
Kim, B.; Yoon, K.S.; Kim, H.-J. GPU-Accelerated Laplace Equation Model Development Based on CUDA Fortran. Water 2021, 13, 3435. [Google Scholar] [CrossRef]
Yin, Q.; Wu, Y.; Zhang, F.; Zhou, Y. GPU-based soil parameter parallel inversion for PolSAR data. Remote Sens. 2020, 12, 415. [Google Scholar] [CrossRef]
Cui, Z.; Quan, H.; Cao, Z.; Xu, S.; Ding, C.; Wu, J. SAR target CFAR detection via GPU parallel operation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 4884–4894. [Google Scholar] [CrossRef]
Liu, G.; Yang, W.; Li, P.; Qin, G.; Cai, J.; Wang, Y.; Wang, S.; Yue, N.; Huang, D. MIMO Radar Parallel Simulation System Based on CPU/GPU Architecture. Sensors 2022, 22, 396. [Google Scholar] [CrossRef] [PubMed]
Gou, L.; Li, Y.; Zhu, D. A real-time algorithm for circular video SAR imaging based on GPU. Radar Sci. Technol. 2019, 17, 550–556. [Google Scholar]
Liu, J.; Yang, B.; Su, Y.; Liu, P. Fast Context-Adaptive Bit-Depth Enhancement via Linear Interpolation. IEEE Access 2019, 7, 59403–59412. [Google Scholar] [CrossRef]
Carnicer, J.M.; Khiar, Y.; Peña, J.M. Inverse central ordering for the Newton interpolation formula. Numer. Algorithms 2022, 90, 1691–1713. [Google Scholar] [CrossRef]
Chand, A.; Kapoor, G. Cubic spline coalescence fractal interpolation through moments. Fractals 2007, 15, 41–53. [Google Scholar] [CrossRef]
Wang, Z.; Guo, Q.; Tian, X.; Chang, T.; Cui, H.-L. Near-Field 3-D Millimeter-Wave Imaging Using MIMO RMA with Range Compensation. IEEE Trans. Microw. Theory Tech. 2019, 67, 1157–1166. [Google Scholar] [CrossRef]
Sheen, D.; McMakin, D.; Hall, T. Near-field three-dimensional radar imaging techniques and applications. Appl. Opt. 2010, 49, E83–E93. [Google Scholar] [CrossRef] [PubMed]
Tan, W.; Huang, P.; Huang, Z.; Qi, Y.; Wang, W. Three-dimensional microwave imaging for concealed weapon detection using range stacking technique. Int. J. Antennas Propag. 2017, 2017, 1480623. [Google Scholar] [CrossRef]
Kapoor, G.P.; Prasad, S.A. Convergence of Cubic Spline Super Fractal Interpolation Functions. Fractals 2012, 22, 218–226. [Google Scholar]
Abdulmohsin, H.A.; Wahab, H.B.A.; Hossen, A.M.J.A. A Novel Classification Method with Cubic Spline Interpolation. Intell. Autom. Soft Comput. 2022, 31, 339–355. [Google Scholar] [CrossRef]
Viswanathan, P.; Chand, A.; Agarwal, R.P. Preserving convexity through rational cubic spline fractal interpolation function. J. Comput. Appl. Math. 2014, 263, 262–276. [Google Scholar] [CrossRef]

Figure 1. Close-range imaging.

Figure 2. The wave-number domain of a certain height.

Figure 3. Cubic spline interpolation in the traditionally GPU-only platform.

Figure 4. The proposed method.

Figure 5. Asynchronous parallel execution and serial execution.

Figure 6. Azimuth amplitude results. (a) is the CPU implementation result; (b) is the GPU implementation result; (c) is the result of the proposed method implementation.

Figure 7. Experimental scene.

Figure 8. Imaging results comparison. (a–c) are the front, top and side views obtained by CPU-only method, respectively; (d–f) are the front, top and side views obtained by GPU-only method, respectively; (g–i) are the front, top and side views obtained by the proposed method, respectively.

Figure 9. Amplitude-level comparison. (a) is the imaging target energy map obtained by the CPU-only method; (b) is the imaging target energy map obtained by the GPU-only method; and (c) is the imaging target energy map obtained by the proposed method.

Figure 10. Comparison of the time consumption by different platforms.

Table 1. Measurement parameters used in the simulation and the experiment.

Parameters	Value
Center frequency	92.5 GHz
Frequency bandwidth	35 GHz
Sweeping frequency points	201
azimuth dimension samples number	161
azimuth dimension sampling interval	1.5 mm
azimuth dimension aperture length	0.24 m
height dimension samples number	161
height dimension sampling interval	1.5 mm
height dimension aperture length	0.24 m
Antenna beamwidth	30^∘
Antenna-to-target distance	0.3 m

Table 2. Speedup ratio of the traditional GPU-only method and the hybrid CPU + GPU method compared to the CPU-only method.

Data Volume	128 × 128	192 × 192	256 × 256	320 × 320
Methods	128 × 128	192 × 192	256 × 256	320 × 320
Traditional GPU method	8.33	9.16	9.17	9.34
CPU+GPU hybrid method	15.20	17.02	17.57	18.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, L.; Dong, Z.; He, H.; Zheng, Q. A Hybrid GPU and CPU Parallel Computing Method to Accelerate Millimeter-Wave Imaging. Electronics 2023, 12, 840. https://doi.org/10.3390/electronics12040840

AMA Style

Ding L, Dong Z, He H, Zheng Q. A Hybrid GPU and CPU Parallel Computing Method to Accelerate Millimeter-Wave Imaging. Electronics. 2023; 12(4):840. https://doi.org/10.3390/electronics12040840

Chicago/Turabian Style

Ding, Li, Zhaomiao Dong, Huagang He, and Qibin Zheng. 2023. "A Hybrid GPU and CPU Parallel Computing Method to Accelerate Millimeter-Wave Imaging" Electronics 12, no. 4: 840. https://doi.org/10.3390/electronics12040840

APA Style

Ding, L., Dong, Z., He, H., & Zheng, Q. (2023). A Hybrid GPU and CPU Parallel Computing Method to Accelerate Millimeter-Wave Imaging. Electronics, 12(4), 840. https://doi.org/10.3390/electronics12040840

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid GPU and CPU Parallel Computing Method to Accelerate Millimeter-Wave Imaging

Abstract

1. Introduction

2. Acceleration Method

2.1. Three-Dimensional (3D) Range Migration Algorithm

2.2. Cubic Spline Interpolation

2.3. GPU-Only Method

2.4. The Hybrid GPU and CPU Acceleration Method

3. Experimental Results and Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI