GPU-Based Parallel Implementation of VLBI Correlator for Deep Space Exploration System

: Very Long Baseline Interferometry (VLBI) solution can yield accurate information of angular position, and has been successfully used in the ﬁeld of deep space exploration, such as astrophysics, imaging, detector positioning, and so on. The increase in VLBI data volume puts higher demands on efﬁcient processing. Essentially, the main step of VLBI is the correlation processing, through which the angular position can be calculated. Since the VLBI correlation processing is both computation-intensive and data-intensive, the CPU cluster is usually employed in practical application to perform complex distributed computation. In this paper, we propose a parallel implementation of VLBI correlator based on graphics processing unit (GPU) to realize a more efﬁcient and economical angular position calculation of deep space target. On the basis of massively GPU parallel computing, the coalesced access strategy and the parallel pipeline strategy are introduced to further accelerate the VLBI correlator. Experimental results show that the optimized GPU-based VLBI method can meet the real-time processing requirements of the received data stream. Compared with the sequential method, the proposed approach can reach a 224.1 × calculation speedup, and a 36.8 × application speedup. Compared with the multi-CPUs method, it can achieve 28.6 × calculation speedup and 4.7 × application speedup.


Introduction
The computational demands of scientific research are constantly increasing. In the field of radio astronomy, observation has evolved from the single telescope to the interferometer arrays, which is currently under development. At the same time, the Very Long Baseline Interferometry (VLBI) technology, which increases the accuracy of observation by extending the distance between the stations, has also been rapidly developed. They are all developed based on interferometric technology, but the data processing methods of them are different.
VLBI plays a key role in deep space exploration and astronomical observation due to the capability of high accurate angular measurement. The technology performs correlation calculations on the observation data of multiple radio telescopes, thus synthesizing multiple telescopes into a synthetic aperture telescope with an equivalent diameter of the longest baseline length [1,2]. Until now, the VLBI has been widely used in many fields, such as geodetic survey, astrophysics, detector positioning, and so on [3,4]. With the increase of VLBI stations and the observation bandwidth, there is a strong demand for fast correlation processing [5,6].
For the interferometric technology, the most important part is the design of correlators. In the past few decades, many correlators based on dedicated hardware have been devel-oped. Among various dedicated hardware, application specific integrated circuit (ASIC), digital signal processing (DSP) and field-programmable gate arrays (FPGA) are usually employed in the development of VLBI correlators. A. R. Whitney et al. [7] implemented a VLBI system based on the the custom-VLSI chip and the DSP. Compared with ASIC, FPGA has a shorter design cycle, lower development cost and more flexible design. Rurik A. Primiani et al. [8] designed a new VLBI correlator, which replaced the previously-coined ASIC correlator. Hitoshi Kiuchi et al. [9] proposed six preset correlation processing system on the basis of FPGA. In order to avoid programming, testing and debugging issues in the traditional FPGA development process, Gan et al. [10] adopted the OpenCL tool flow to convert the OpenCL kernel function into customized FPGA hardware accelerator files, automatically. DSP is more suitable for processing computing tasks, and the FPGA is better for logic operations. Shanghai Astronomical Observatory (SHAO) developed the third-generation Chinese VLBI Data Acquisition System (CDAS), and then equipped the DSP board with Xilinx XC7K480T FPGA for data processing [11]. The joint research activity in the RadioNet FP7 Programme designed a generic high-performance computing platform based on DSP and FPGA for radio astronomy, named UniBoard [12]. Although the methods based on dedicated chips can effectively implement the VLBI correlator, their economic costs and scalability are still insufficient.
A recent trend is to correlate in software instead of dedicated hardware. With the rapid development of high-performance computing technology [13][14][15][16], the correlators have begun to be developed on the general CPU platform. In the earlier work, VLBI software correlators are implemented based on a single CPU [17], e.g., the VLBI correlator from National Radio Astronomy Observatory (NRAO) was developed based on an IBM 360/50 [18]. Furthermore, the VLBI correlators are implemented on CPU clusters to achieve higher efficiency, such as the Chinese VLBI Network (CVN) Earth software developed by Shanghai Astronomical Observatory of the Chinese Academy of Sciences [19] and DiFX VLBI correlator proposed by Swinburne University of Technology [20]. Extensive experiments verify that the CPU cluster can realize real-time VLBI data processing at a medium data rate, and it is easier to design and develop compared with the dedicated chips.
The emergence of the graphics processing unit (GPU) and the compute unified device architecture (CUDA) provide new opportunities for accelerating VLBI correlators [21,22]. GPU has been developed as a high-performance device with the characteristics of high parallelism, multithreading, many-cores, huge bandwidth capacity, and hundreds of computing cells. But they only discussed how to deploy the algorithm on heterogeneous platforms and analyze the processing results of the software. In recent years, for the interferometric array observation technology, some correlators have been accomplished on the GPU. Thomas Hobiger et al. [23] discussed the feasibility of implementing correlator on GPU. The Cobalt project implemented a GPU-based correlator for LOFAR [24]. However, the GPU-based parallel acceleration method of VLBI algorithm has not been studied yet. Besides, GPU has been widely used in the field of remote sensing for big data processing [25][26][27][28]. Inspired by the remote sensing applications and correlators of interferometric array observation technology on GPU, we try to implement a high efficient GPU-based VLBI correlator considering hierarchical optimization strategies.
In order to meet real-time processing at high data rate, we propose a parallel implementation of VLBI algorithm on the CPU-GPU heterogeneous platform. To solve the computation-intensive issue, the VLBI algorithm is implemented in GPU massively parallel mode, and is designed to realize the fine-grained task partitioning and calculation. Further, to deal with the data-intensive problem, the multi-level GPU optimization strategies have been introduced to improve the efficiency of data access and processing pipeline. Finally, the proposed method has been evaluated by the actual observation data. It is proved that the GPU implementation can meet the demands of high data rate real-time calculation, and is superior to the serial CPU method and multi-core CPU method.
In all, compared with previous works, we make the following contributions.
(1) GPU parallelization of the VLBI algorithm is presented. According to the characteristics of the GPU, multi-point calculation tasks are allocated to GPU threads to make full use of the computing capability of the threads and the hardware resources of the GPU. (2) Shared memory is used to solve the uncoalescing problem encountered when accessing global memory. (3) According to the characteristics of VLBI data, CUDA stream is used to optimize the data transmission between CPU and GPU.
The remainder of this paper is organized as follows. Section II presents the calculation flow of the VLBI algorithm and analyzes the algorithm complexity. Section III gives the details of the proposed GPU-based parallel VLBI method. Experimental results are shown and analyzed in Section IV. Conclusions and perspectives are given in Section V.

Methodology
In this section, the principle and implementation of VLBI correlator are introduced specifically. Meanwhile, the analysis of the algorithm complexity is also briefly analyzed.

The Principle of VLBI
The physical process of VLBI system is shown in Figure 1. The two ground stations simultaneously receive signals from the same observation target. Then, each station sends the observation data to the VLBI correlator, which performs correlation calculations to obtain the residual delay. Finally, the residual delay and estimated delay are employed to calculate the angular position of the observed target (i.e., θ). Through the above accurate coherent processing, sub-millisecond spatial resolution and nanosecond delay estimation can be obtained.

The Physical Process of the VLBI
Because the distance between the observation target and the earth is long enough, the signals received by the station can be approximately regarded as the plane waves. In addition, due to the different geographic locations, the stations will receive the same wavefront signal at different time points. Therefore, there will be a delay between the two stations when receiving the same signal. According to the geometric relationship in Figure 1, the delay τ can be calculated as the following: where B is the baseline length between the two stations, c is the speed of light, and θ is the intersection angle between the direction of the observation target and the baseline, that is, the angular position information of the target. Accordingly, the orbital position of the target can be determined with θ. Therefore, τ needs to be calculated for obtaining the angular position of the observation target.
By calculating the derivative with respect to θ in Equation (1), the accuracy expression of the angular position θ can be obtained as below.
It can be seen from Equation (2) that the accuracy of the angular position θ is directly proportional to the delay accuracy ∂τ and inversely proportional to the baseline length B. Therefore, the higher angular position accuracy can be obtained by increasing the baseline length or the delay accuracy.

The Signal Model of VLBI Correlator
As aforementioned, there is a time delay between the signals of the two stations that receive the same target signal. To acquire a more accurate delay, the following two steps are performed. Firstly, some compensation methods are utilized to roughly compensate for the delay between the signals of the stations. Then, the residual delay can be obtained through the correlation operations of the compensated signal. Next, the single-frequency signal model is employed to introduce the principle of VLBI method.
First, the real delay between two stations is assumed to be τ g . The signals received by the two stations from the observation target are expressed as follows: where ω is the carrier frequency of the observation signal, and t is the time scale. After the carrier synchronization, the signals of the two stations can be expressed as: where ω 0 is the frequency of local oscillator. After the rough delay compensation of the signal, the signal of Station 2 can be obtained as follows: where τ g0 is the rough delay estimation of the actual delay compensation model. The rough delay value is usually calculated by the empirical formula of fifth degree polynomial. Then, based on the rough delay compensation, the result of correlation processing is as follows: where ∆τ g is the residual delay value. It can be seen from the correlation results that the phase contains a linear component that varies with frequency. The slop of the phasefrequency diagram is just the residual delay, which can be employed to calculate the angular position.
Based on the residual delay calculated from the above processing, the angular position information of the observation target can be obtained by calculation with Equation (1).

The Implementation of VLBI Correlator
As shown in Figure 2, the VLBI algorithm mainly includes four parts, respectively, rough delay calculation, delay compensation, correlation, and residual delay calculation. Through the above calculations, the real-time angular position can be acquired for the deep space exploration system.

Rough Delay Calculation
The orbit prediction [29,30] method is used to estimate the delay compensation τ g0 between the two stations. Generally, the empirical formula of fifth degree polynomial is accurate enough to calculate the rough delay of any sampling time t, where a i (i = 0, 1, 2, 3, 4, 5) is the empirical coefficients, τ g0 I and τ g0 F , respectively, indicate the integer and fractional delay, N i and N f represent the number of sampling periods for the integer and fractional delay, and T s is the sampling period.

Delay Compensation
Because the delay of the two stations is not always an integer multiple of the signal sampling period, the delay compensation can be divided into integer delay and fractional delay. Normally, the integer delay is compensated with the time domain shift, and the fractional delay is compensated through the frequency domain phase multiplication. Meanwhile, it is crucial to eliminate the influence of the Doppler effect in the delay compensation process, which corresponds to the fringe rotation part in the VLBI algorithm. As shown in Figure 2, the delay compensation includes three steps, including the integer delay compensation, the fringe rotation and the fractional delay compensation. The specific implementations are as follows.

(i) Integer delay compensation
The signals of one station cope with the integer delay as follows: where n indicates the discrete time scale, x(n) is the received signal, and I(n) indicates the processed signal after the integer delay compensation.
The processing step is introduced to counteract the influence of the Doppler effect, which is denoted as follows: where Fri(n) is the result after eliminating the Doppler effect, and f d (n) is the Doppler frequency employed in the estimation. (iii) Fractional delay compensation After carrying out the fractional delay compensation, the signal can presented below: where f indicates the discrete frequency scale, and FFT(·) means the Fourier transform operator.

Signal Correlation
Through the above operations, the compensation operations of the signals are accomplished. According to Equation (8), the correlation between the two stations can be executed as below: where s1 and s2 indicate the different stations that observing the same target.

Angular Position Calculation
After finishing the above operations, the residual delay can be obtained through calculating the slop of the phase spectrum in the correlation results.
After solving the two values τ g0 and ∆τ g , the angular position can be figured out by Equation (9).

The Analysis of Algorithm Complexity
According to the previous VLBI introduction, the complexity analysis of the four steps will be given successively [31]. As for the calculations of rough delay, it is independent of signal length L, and its computational complexity is O(1). Next, the integer delay compensation can be implemented through the data shift operation, which can be implemented along with the data reading process. Therefore, the computational complexity of this part can be seen as O(1). As Equation (12) described, the essence of fringe rotation includes the Doppler frequency calculation of L points and the multiplication of L points in time domain. In the next step, the Fourier transform of L points and the multiplication will be employed to finish the fractional delay compensation. Accordingly, the computational complexity can be expressed as O(LlogL). The algorithm complexities of the correlation calculation and the slop estimation are related to the signal length L.
As aforementioned, the calculation of VLBI mainly includes the multiplication and Fourier transform in the signal dimension. For the consideration of efficiency, Intel Integrated Performance Primitives (IPP) [32,33] is employed to implement these timeconsuming steps in a serial way. Figure 3 depicts the time-consuming ratio of the serial VLBI algorithm. It can be seen that the time-consuming situation of the fringe rotation and fractional delay are slightly different from the above complexity analysis. The main reasons include two aspects: first, due to the application of Intel IPP library library, the operations of FFT and sequence multiplication are optimized at the software and hardware level; second, although the complexity is the same, the amount of calculation is different between the fringe rotation and the fractional delay. Thus, the fringe rotation takes more time in the implementation. In general, the most time-consuming part is concentrated on the calculation of delay compensation and correlation, which accounting for about 99.98% of the total running time.
In practical applications, in order to improve the signal-to-noise ratio and acquire a more accurate delay, the signal is divided into multiple blocks. Concretely, the signalto-noise ratio will be enhanced by accumulating the results of these blocks. First of all, the pending signals of one station which contains C channels are divided into blocks as follows. There are Z signals in one channel. Each signal is divided into A integration time; an integration time includes F FFT time; and one FFT time has N discrete points. According to the above definitions of the blocks, there are M (i.e., M = Z × A) integration time in one channel. Algorithm 1 shows the pseudo-code of the serial VLBI algorithm, which processes the data of one channel. From Algorithm 1, the main time consumption comes from the two four-layer loops. Combine the analysis of time complexity and Algorithm 1, our experiment mainly optimizes the operations of delay compensation and correlation in parallel.

Proposed Method
It can be seen from the above analysis that the fringe rotation, fractional delay compensation, and signal correlation are the most time-consuming parts of the VLBI algorithm. In order to accelerate the VLBI algorithm, the three parts are parallelized in the GPU. The GPU parallel version of VLBI is displayed in Figures 4 and 5. According to the characteristics of the GPU, it has thousands of physical threads, so it is more suitable for performing data-intensive and computationally intensive operations. It can be seen from Figure 4 that the CPU-based parallel VLBI algorithm may require multiple devices to achieve the processing effect of the GPU-based parallel algorithm.  As shown in Figure 5, it can be seen that the calculations of rough delay, the integer delay, and the angular position are executed on the CPU. Besides, the operations of fringe rotation, fractional delay and correlation are executed on the GPU. After the calculation of correlation is completed, the results will be returned from the GPU to the CPU. Eventually, the phase and frequency in the correlation results are fitted to a straight line to obtain the residual delay on the CPU.

The CPU-GPU Data Transfer Optimization
According to the CUDA programming paradigm, the data transfer is necessary between CPU and GPU. Basically, the required input data is transferred from CPU to GPU for the parallel function. After the high performance computing, the results will be transferred from the GPU to the CPU. Since the peak bandwidth between the CPU memory and the GPU memory is relatively low, it means that the data transfers between the CPU and GPU may slow the whole application performance. The following are some general guidelines for the data transfer between them.
• Minimize the number of transfers between the host and the device. • Merge many small transfers into one.
In this paper, we will treat the data in an integration time as a basic data unit, and transfer it to the GPU, which will eliminate most of the per-transfer overhead.

The Solution of the Uncoalescing Problem
Since the data type is complex, when we operate on the real or imaginary part of the data on the global memory, the uncoalescing problem will be encountered. As shown in Figure 6, the discontinuous storage areas of global memory are accessed, and will lead to the performance decline. If the data access is continuous, one instruction can read more data. This means fewer instructions will be used to perform data access operations. Thus, the data reading time will be shortened. Memory coalescing technology can greatly improve the speed of global memory access. The hardware will detect whether the location accessed by the thread is an adjacent location in the global memory. When 16 threads of a warp are coalesced in memory transactions, the bandwidth of the global memory will be maximized. Therefore, shared memory is employed to solve the uncoalescing problem in this article, as shown in Figure 6. We employ a continuous method to access data from global memory to shared memory, firstly. Since there is no need to consider the order of access when using shared memory, the problem of uncoalescing problem can be avoided. When a block starts to execute, the GPU will allocate a certain amount of shared memory, and the address space of the shared memory will be shared by all threads in the block. Shared memory is allocated to all blocks in Streaming Multiprocessing (SM), which is also a scarce resource of the GPU. Therefore, the more shared memory is employed, the less active threads that can be parallelized.
The main operations of memory coalescing method is shown below. Firstly, adjacent threads read the neighboring real or imaginary number from global memory to shared memory. Then, in the kernel function, the data in the shared memory is converted into a complex type and calculated. Finally, the complex number is divided into the real and imaginary part for output.

The Thread Task Allocation
The main content of a parallel algorithm is the task partitioning and scheduling, namely, the task allocation on each thread. When formulating tasks to be executed on each thread, the following two aspects are mainly considered. First, the GPU thread has limited computing ability. Nevertheless, when designing the thread computing tasks, it must give full play to its capabilities within the scope of its computing power. Secondly, if more resources (such as shared memory, register, etc.) are used in a thread block, fewer active thread blocks that can be parallelized.
Based on the parallel design guidelines, the calculation of multiple data points will be considered to execute in one thread. To optimize the performance of the GPU implementation, the maximum number of threads, namely 1024, is set to each thread block. Accordingly, the block number can be calculated as (F × N/(1024 × N t )), where F and N t are the dimension of Fourier transform and the number of points computed in one thread, respectively.

The Cuda Stream
It is noted that the data calculation process of each station is independent in the serial algorithm. In order to optimize the execution pipeline, the CUDA stream technology is introduced to realize concurrent execution of parallel calculation and data transmission. Thus, the overall running time can be decreased greatly. From Figure 7, it can be seen that CUDA can start multiple streams, simultaneously. The data transmission and kernel calculation are performed in each stream, independently. In this way, when the data of current station performs the calculation, the data of next station can transmit data or execute the calculation operation. In the GPU parallel implementation, the parallel pipeline strategy outperforms the serial pipeline by hiding the data transfer overhead.

The Overall Approach of Gpu Parallel VLBI
A detailed step-by-step algorithm description of the parallel VLBI on the CPU-GPU heterogeneous platform is summarized in Algorithm 2. The calculations of fringe rotation, fractional delay, and correlation are denoted as kernel 1, kernel 2 and, kernel 3, which are executed on the GPU. The CUFFT library is used to perform FFT operations on the GPU. Besides, the CUDA stream is used to reduce the time of the data transmission.  31 Calculate the residual delay 32 end

Experimental Configuration
The hardware specification is described in Table 1. It also shows the bandwidth of data transfer between the CPU and the GPU. The software environment includes Windows 10 64 bit operating system, Visual Studio 2017, CUDA 9.2, and Intel IPP library. The experimental data are from Beijing Aerospace Control Center. The observation stations are Kashi Station and Jiamusi Station, respectively. The data of each station has 8 channels, and the data rate is 64 Mb/s. The sampling rate and quantization bit are 4 MHz and 1 bit. The observation time of the experimental data is 6 s.

Efficiency Analysis
In the experiments, the proposed GPU-based VLBI algorithm (denoted as GPU version) are compared with the corresponding serial version and the multi-CPU version. The serial version is implemented in C language based on the intel IPP library. The multi-CPU version of VLBI (denoted as multi-CPU version) is implemented with the message passing interface (MPI) and the Intel IPP library. We evaluate the performance improvement (acceleration factors) of the GPU implementation relative to the multi-CPU version and the serial version. Figure 8 depicts the calculation time of fractional delay and fringe rotation for one channel data of a station. Through shared memory bridging, the problem of discontinuous data access is solved. When reading data from global memory, the bandwidth of the global memory is fully utilized. After solving the uncoalescing problem, the execution time of the algorithm is significantly reduced. From Figure 8, it can be seen that the processing time of fringe rotation and fractional delay is reduced along with the decrease of the block number. If multi-point computing tasks are allocated on a thread, the computing resources of the thread can be fully utilized, and sufficient hardware resources can be allocated for each thread. The aforementioned optimization methods can reduce the time of processing data. Furthermore, the results verify the effectiveness of the method proposed. For the fractional delay, the time change is more obvious after optimization. According to the analysis of the algorithm in Section II, although the complexity is the same, the amount of calculation is different between the fringe rotation and the fractional delay. Therefore, the thread in Kernel1 has a larger amount of calculation and consumes more hardware resources. Thus, the repartitioning of GPU resources has less impact on edge rotation.  Figure 9 shows the percentages of the data transmission time and the calculation time. It can be observed that the data transfer time is reduced after applying the proposed parallel pipeline. In particular, after using CUDA stream for transmission optimization, the time percentage of data transfer has been reduced from 90.37% to 66.86%. However, the time for data transfer still accounts for a large part of the total time. That is because the CUDA stream operation only hides a part of the data transfer time between the CPU and GPU, but it does not shorten it, as shown in Figure 7. In addition, when optimizing without using CUDA streams, the data transmission time accounts for more than 90% of the total time. In view of the above two reasons, the proportion of data transmission time still accounts for a large proportion. Regardless, for the VLBI algorithm, the CUDA stream is capable of shortening the ratio of data transmission, and is essential to promote algorithm performance.  Table 2 indicates the time consumption and speedups of various optimization strategies applied in this article. The different approaches are employed to process the data in one channel. "Serial" refers to the serial algorithm running on the single CPU. Moreover, the model and frequency of the CPU are revealed in Table 1. In the version of "Initialization", the task performed on each thread is a single point calculation, where CUFFT is used to optimize the FFT. Based on the "Initialization" , "Coalesce" solves the uncoalescing problem when accessing the global memory. "Allocation" version redesigned the computing tasks performed on each thread in the GPU. The version of "CUDA stream" further optimize the data transmission between the CPU and the GPU.

The Analysis of the Optimization Strategies
According to the above results, the multilevel GPU optimization strategy is consistent with the design ideas. Based on the basic GPU parallel, the serial algorithm is accelerated by 18.1×. After solving the uncoalescing problem, the bandwidth of global memory is fully utilized. Hence, the "Coalesce" version attains the speedup of 19.1×. When making full use of thread computing resources and GPU hardware resources, the acceleration effect reaches 22.1×. Benefiting from the CUDA stream, multiple streams can be started at the same time to process data from different stations. Finally, compared with the serial version, the method proposed in this article can achieve 36.6×. It can be seen from the optimization results of the above acceleration method that data transmission has the most significant impact on the computational efficiency of the VLBI algorithm. Therefore, after using stream optimization, the acceleration effect is significantly improved.  Table 3 lists the specific time-consumption comparison of the serial version and the GPU version. The last two rows show that the time and speedups of "Calculation" (that is, only the calculation part, excluding data transmission time) and "Overall|" (including data transmission time), respectively. The part of fringe rotation achieves a high speedup of about 853.5×. And the fractional delay reaches about 98× speedup. The GPU parallel implementation of the correlation achieves a speedup of 22.7×. Besides, the speedups of the "Calculation" and "Overall|" are about 224.1× and 36.8×, separately. As can be seen from Table 1, the GPU used in this paper (NVIDIA RTX 2070) has 2304 CUDA cores. In theory, it should be 2304× the execution speed of the serial version. However, the computing power of CPU and GPU threads si different. Besides, the data transfer between CPU and GPU is inevitable. Therefore, compared with the serial version, the GPU-based parallel version cannot achieve an acceleration of 2304×.
It is obvious from Table 3 that the GPU version achieves remarkable acceleration factors on the platform as compared to the serial version. As for the practical data processing, the execution time of the serial version is about 39s, which is far from meeting the requirements of real-time processing. However, the GPU-based method can fulfill the data processing in about 1s in that it takes full advantages of the computational power of GPUs, the high bandwidth, as well as low latency of shared memory, and benefits from exploiting the massively parallel nature of GPUs. To further evaluate the efficiency of the proposed parallel approach, we compare the GPU implementation with the multi-CPU correlator developed by Beijing Aerospace Control Center ( denoted as the multi-CPU version in this article). The multi-CPU version runs on a workstation with two CPUs. The processor employed in the experiment is Intel(R) Xeon(R) CPU E5-2683 v3 with 28 threads (2 GHz). The workstation has 2 correlation computing nodes. Each node starts 20 threads to perform the computing tasks. Table 4 is consistent with the content in Table 3. Compared with the fringe rotation, fractional delay and correlation of the multi-CPU version, the acceleration ratio of the GPU version has reached 124.7×, 6.4×, and 7×, respectively. Specifically, the part of "Calculation" and "Overall|" achieves about 28.6× and 4.7× speedup. It can be seen from the results that the multi-CPU version requires about 5 s to process data of 6 s observation time. Although the multi-CPU version can achieve real-time processing, the GPU version only needs 1s, which can further improve the processing capability. This means that it may take 5 such machines to achieve the GPU computing effect, as shown in Figure 4. The details of the time consumption on the three versions are displayed in Figure 10. It can be seen, in Tables 3 and 4, that the acceleration effect of the fringe rotation is more obvious than the fractional delay. The main reasons include two sides: firstly, the fractional delay contains the FFT operation, and the FFT operation has the highest time complexity; secondly, Compared with the Intel IPP library, the optimization effect of CUDA CUFFT has the limited optimization capabilities. Therefore, the optimization effect of the fractional delay is not as good as the fringe rotation.

The Comparison of Data Processing Rate
In the practical application, the data processing rate needs to reach 128 Mbps to meet the requirements of the real-time processing. From Table 5, it can be seen that the multi-CPU version has achieved the requirements of real-time processing. With the increase of data volume and data transmission rate, it will be difficult to undertake the future data processing. As for the GPU-based method, it has enough computing power margin to deal with this situation. Combining Figure 4 and Table 5, it can be summarized that the GPUbased VLBI algorithm can complete signal processing with fewer hardware resources and higher data processing efficiency than the multi-CPU version. Therefore, the GPU-based VLBI correlator will play a greater role in deep space exploration system.

Accuracy Analysis
VLBI technology aims to achieve higher observation accuracy by increasing the distance among stations. Therefore, the parallel algorithm design of the VLBI algorithm should not only pay attention to the final acceleration effect, but also control the calculation error within an acceptable range.
Theoretically, there should be no difference between the results of the GPU-based method and the CPU-based method. In fact, the mathematical function libraries of the GPU and the CPU have different calculation accuracy. Furthermore, the instruction set optimization degree of CPUs and GPUs is different. Thus, their calculation results will exist a little difference in value.
As shown in Table 6, the value differences between CPU version and GPU version are listed for comparison. Mean absolute error (MAE) are employed to measure the difference of the two methods. It can be seen that the errors of FFT and fractional delay between two versions are both around 10 −6 . The MAE difference of the residual delay step is about 5.1 × 10 −15 . It can be observed from the overall MAE results that the calculation accuracy of the angular position can bee guaranteed.

Conclusions
In this paper, a GPU-based VLBI parallel correlator is proposed to meet the growing demands for high computing power. Through the algorithm complexity analysis, the targets of the parallel acceleration are determined, namely calculations of the fringe rotation, fractional delay and the correlation. By assigning the multi-point calculation as single thread task, the VLBI algorithm is massively parallelized. As a kind of computing-intensive and data-intensive issue, the GPU-based VLBI algorithm is further optimized in terms of data transfer. Through multi-level optimization strategies, namely data transmission rules, coalescing access and concurrent pipelines, the efficiency is increased by 2-fold. Under the premise of ensuring the calculation accuracy, compared with the serial version, the calculation part and the overall calculation have reached the speedups of 224.1× and 36.8×, respectively. Compared with the multi-CPU version, the corresponding parts have achieved speedups of 28.6× and 4.7×. In addition, the data processing rate of the GPU-based correlator can reach 712 Mbps, which can far meet the current ground station processing needs. The GPU-based correlator outperforms the actual multi-CPU-based correlator in terms of computing performance and data rate. In the future work, we will study the multi-GPU-based VLBI algorithm and deploy it to the better performance CPU cluster to further improve the data processing capability for deep space exploration.