Co-Processing Parallel Computation for Distributed Optical Fiber Vibration Sensing

: Rapid data processing is crucial for distributed optical ﬁber vibration sensing systems based on a phase-sensitive optical time domain reﬂectometer ( Φ -OTDR) due to the huge amount of continuously refreshed sensing data. The vibration sensing principle is analyzed to study the data ﬂow of Rayleigh backscattered light among the di ﬀ erent processing units. A ﬁeld-programmable gate array (FPGA) is ﬁrst chosen to synchronously implement pulse modulation, data acquisition and transmission in parallel. Due to the parallelism characteristics of numerous independent algorithm kernels, graphics processing units (GPU) can be used to execute the same computation instruction by the allocation of multiple threads. As a conventional data processing method for the sensing system, a di ﬀ erential accumulation algorithm using co-processing parallel computation is veriﬁed with a time of 1.6 µ s spent of the GPU, which is 21,250 times faster than a central processing unit (CPU) for a 2020 m length of optical ﬁber. Moreover, the cooperation processes of the CPU and GPU are realized for the spectrum analysis, which could shorten substantially the time of fast Fourier transform analysis processing. The combination of FPGA, CPU and GPU can largely enhance the capacity of data acquisition and processing, and improve the real-time performance of distributed optical ﬁber vibration sensing systems.


Introduction
A distributed optical fiber vibration sensing system has the advantages of a simple structure, resistance to electromagnetic interference, adaptability in flammable environments and wide detection range [1][2][3]. As one typical distributed optical fiber vibration sensing system, a phase-sensitive optical time domain reflectometer (Φ-OTDR) can achieve simultaneously multi-point vibration detection and location. Because of the benefits of its high sensitivity, good spatial resolution and long detecting distance [4,5], Φ-OTDR has broad application prospects in long-distance oil and gas pipelines [6], border security intrusion detection and smart grids [7,8]. The current research hotspot of Φ-OTDR is the improvement of its sensing performance, such as a sensing distance up to even hundreds of kilometers [9,10], and a spatial resolution of a sub-meter magnitude [11], which leads to a huge amount of sensing data and a growing demand for computation resources. Consequently, the time consumption procedure of signal processing is longer and the real-time performance is more degradative. In order

Vibration Sensing Principle
The vibration sensing principle of the Φ-OTDR system is based on the phenomenon of Rayleigh backwards scattering (RBS) light, which is produced by the elasto-optical effect in the inner of optical fiber, when the pulsed light is injected into and propagates along the optical fiber. The result of the mutual interference of numerous Rayleigh scattering centers within the injected pulse duration lead to the formation of random up and down fluctuations in the RBS curve. When a light pulse with a frequency of w is injected into the sensing optical fiber at the moment of t = 0, the amplitude of backscattered signal without vibration is shown in the following equation [33]: where Ip and τp are respectively the amplitude and the time delay of the p-th RBS curve; L means the pulse width of incident pulse; β is the fiber attenuation constant; c is the velocity of light in a vacuum; n is the refractive index of the fiber; M is the number of scatters within the sensing fiber. The F(t/L) is a rectangular function, which can be expressed as: When an external vibration is applied at certain positions in the sensing optical fiber, the light phase will be modulated, which could be collected by the intensity detection. In general, the acquired data should be processed by a differential accumulation algorithm and an FFT algorithm in order to demodulate the vibration location and frequency information [34,35]. Eventually, the detection and location of external vibration could be achieved based on the RBS curves.

Vibration Sensing Principle
The vibration sensing principle of the Φ-OTDR system is based on the phenomenon of Rayleigh backwards scattering (RBS) light, which is produced by the elasto-optical effect in the inner of optical fiber, when the pulsed light is injected into and propagates along the optical fiber. The result of the mutual interference of numerous Rayleigh scattering centers within the injected pulse duration lead to the formation of random up and down fluctuations in the RBS curve. When a light pulse with a frequency of w is injected into the sensing optical fiber at the moment of t = 0, the amplitude of backscattered signal without vibration is shown in the following equation [33]: where I p and τ p are respectively the amplitude and the time delay of the p-th RBS curve; L means the pulse width of incident pulse; β is the fiber attenuation constant; c is the velocity of light in a vacuum; n is the refractive index of the fiber; M is the number of scatters within the sensing fiber. The F(t/L) is a rectangular function, which can be expressed as: When an external vibration is applied at certain positions in the sensing optical fiber, the light phase will be modulated, which could be collected by the intensity detection. In general, the acquired data should be processed by a differential accumulation algorithm and an FFT algorithm in order to Appl. Sci. 2020, 10, 1747 4 of 18 demodulate the vibration location and frequency information [34,35]. Eventually, the detection and location of external vibration could be achieved based on the RBS curves.

Experiment Setup
The experiment setup is illustrated in Figure 2 based on the principle of the Φ-OTDR sensing system. The laser source has an ultra-narrow linewidth of 100 Hz and a center wavelength of 1550 nm. The polarization state of the output laser will be affected by the birefringence property of the optical fiber. Therefore, the polarization controller is used to adjust the polarization state of the laser [36]. The FPGA module drives the acousto-optical modulator (AOM) to modulate the continuous light into pulsed light, while the repetition frequency is 8 kHz and the pulse width is 200 ns, which corresponds to 20 m spatial resolution. More concretely, the FPGA board generates a series of clock sequences with an adjustable repetition frequency and pulse width in the form of a transistor-transistor logic (TTL) signal, which is amplified by the boost circuit to adapt the input voltage of AOM. This electrical modulation signal acts on the electroacoustic transducer of AOM, then transforms into the ultrasonic field in the form of electrical signal. When the light wave passes through the acousto-optic medium, due to the acousto-optic effect, the continuous light is modulated and becomes the intensity modulation wave in the form of an optical pulse. Then, the pulse is injected into an erbium-doped fiber amplifier (EDFA), which could amplify the power of the pulsed light. Following this, the pulsed light goes through a dense wavelength division multiplexer (DWDM), which can combine and transmit different wavelengths in the same fiber. Therefore, the light, including other wavelengths (except 1550 nm), could be filtered out by the DWDM, and a purer optical wave at 1550 nm could be obtained at the output of the DWDM module. Then, the pulsed light is launched into the sensing optical fiber via an optical circulator (OC). A piezoelectric transducer (PZT) is placed on a certain location of the sensing optical fiber in order to produce a vibration signal. Eventually, the coherent RBS light is monitored by the photodetector (PD) and is then transformed into an electric signal. The FPGA board could realize the data acquisition with the hardware language of VerilogHDL and the parallel programming tool Quartus II. In the host computer, the type of CPU is a CORE i7 8700K, using C programming language and the Visual Studio programming tool.

Experiment Setup
The experiment setup is illustrated in Figure 2 based on the principle of the Φ-OTDR sensing system. The laser source has an ultra-narrow linewidth of 100 Hz and a center wavelength of 1550 nm. The polarization state of the output laser will be affected by the birefringence property of the optical fiber. Therefore, the polarization controller is used to adjust the polarization state of the laser [36]. The FPGA module drives the acousto-optical modulator (AOM) to modulate the continuous light into pulsed light, while the repetition frequency is 8 kHz and the pulse width is 200 ns, which corresponds to 20 m spatial resolution. More concretely, the FPGA board generates a series of clock sequences with an adjustable repetition frequency and pulse width in the form of a transistortransistor logic (TTL) signal, which is amplified by the boost circuit to adapt the input voltage of AOM. This electrical modulation signal acts on the electroacoustic transducer of AOM, then transforms into the ultrasonic field in the form of electrical signal. When the light wave passes through the acousto-optic medium, due to the acousto-optic effect, the continuous light is modulated and becomes the intensity modulation wave in the form of an optical pulse. Then, the pulse is injected into an erbium-doped fiber amplifier (EDFA), which could amplify the power of the pulsed light. Following this, the pulsed light goes through a dense wavelength division multiplexer (DWDM), which can combine and transmit different wavelengths in the same fiber. Therefore, the light, including other wavelengths (except 1550 nm), could be filtered out by the DWDM, and a purer optical wave at 1550 nm could be obtained at the output of the DWDM module. Then, the pulsed light is launched into the sensing optical fiber via an optical circulator (OC). A piezoelectric transducer (PZT) is placed on a certain location of the sensing optical fiber in order to produce a vibration signal. Eventually, the coherent RBS light is monitored by the photodetector (PD) and is then transformed into an electric signal. The FPGA board could realize the data acquisition with the hardware language of VerilogHDL and the parallel programming tool Quartus II. In the host computer, the type of CPU is a CORE i7 8700K, using C programming language and the Visual Studio programming tool.

Data Flow Analysis
As illustrated in Figure 3, the complete Φ-OTDR system consists of a data generation unit, a data acquisition unit and a data processing unit. Firstly, the modulated pulse which is launched into the Φ-OTDR sensing system continuously generates RBS curves, then the optical signal is detected by PD and converted to an electronic signal. Secondly, the electric signal is acquired by an analog-digital converter (ADC) with a 50 MSa/s sample rate, then stored in the on-chip RAM of FPGA. Owing to a large amount of RBS data collected by the Φ-OTDR sensing system, as well the insufficient internal storage space of the FPGA, this will result in data overflows and even the collapse of the whole system if the data is not sent to the host computer in time. Therefore, FPGA must complete the old data which are stored in the on-chip RAM, preprocessing and transmitting them into the CPU before the arrival of new data. The FPGA module is connected to the CPU via a USB transmission interface, achieving data transmission quickly. Eventually, when the amount of data reaches a certain level, the data can be transferred from the CPU to the GPU, in order to realize fast algorithmic processing. Afterwards,

Data Flow Analysis
As illustrated in Figure 3, the complete Φ-OTDR system consists of a data generation unit, a data acquisition unit and a data processing unit. Firstly, the modulated pulse which is launched into the Φ-OTDR sensing system continuously generates RBS curves, then the optical signal is detected by PD and converted to an electronic signal. Secondly, the electric signal is acquired by an analog-digital converter (ADC) with a 50 MSa/s sample rate, then stored in the on-chip RAM of FPGA. Owing to a large amount of RBS data collected by the Φ-OTDR sensing system, as well the insufficient internal storage space of the FPGA, this will result in data overflows and even the collapse of the whole system if the data is not sent to the host computer in time. Therefore, FPGA must complete the old data which are stored in the on-chip RAM, preprocessing and transmitting them into the CPU before the arrival of new data. The FPGA module is connected to the CPU via a USB transmission interface, achieving data transmission quickly. Eventually, when the amount of data reaches a certain level, the data can be transferred from the CPU to the GPU, in order to realize fast algorithmic processing. Afterwards, the GPU sends data back to the CPU after dealing with data through multi-threads, then the results can be displayed on the screen.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 18 the GPU sends data back to the CPU after dealing with data through multi-threads, then the results can be displayed on the screen.

Data Parallel Processing Based on FPGA
Owing to the large amount of RBS data collected by the Φ-OTDR sensing system and the demand for real-time data acquisition, the FPGA of Altera's Cyclone III series is chosen to implement pulse modulation, data acquisition and data transmission in parallel. The main functions of the FPGA in the Φ-OTDR sensing system are shown in Figure 4.
Firstly, the AOM controller generates a pulse signal whose repetition frequency and width are 8 kHz and 200 ns, respectively, in the way of FPGA programming. Then, the pulse signal is transferred to the AOM in order to transform continuous light into the expected pulsed light via the Input/ Output (I/O) interface. Furthermore, the operation of the main module depends on the clock domain management which is realized by phase locked logic (PLL). It receives a reference clock generated from a crystal oscillator and provides clocks of 50 and 100 MHz for the A/D controller and the USB controller, respectively.
Secondly, the Syn module synchronously controls the A/D controller to enable the acquisition state and dominate sampling rate. The operating principle of the A/D controller is based on the state machine. State 0 is the initial state and turns into State 1 when the synchronous trigger signal arrives. State 1 runs in a continuous loop until the data counter is full, then it changes into State 2. The additional data header should be added to each RBS data string in State 2. State 3 means that the data is written into the internal memory.
Thirdly, a first-in first-out (FIFO) device is used as the memory module to temporarily store the collected data. The data are written to FIFO and read out to the upper computer at the speeds of 50 and 100 MB/s, respectively. This method is mainly used in order to achieve data storage and balance the speed of data writing and reading, in case useless data is transmitted.
Afterwards, the USB controller is used to control the data transmission between FIFO and the high-speed USB 3.0 transmission microcontroller (EZ-USB FX3). It is also a state machine which consists of three states. State 0 represents the start state, which changes into State 1 when the RBS data arrives from the FIFO. State 1 is the full flag judgment state. If the slave device, FIFO, that exits in EZ-USB FX3 is not full, it moves to State 2, otherwise, it remains in State 1. State 2 represents the data writing state. During this state, the RBS data can be read and then written into the slave FIFO through the synchronized slave FIFO device at a speed of 100 MB/s. Finally, the EZ-USB FX3 chip packages the received data and transmits the data to the host computer via the USB transmission bus.

Data Parallel Processing Based on FPGA
Owing to the large amount of RBS data collected by the Φ-OTDR sensing system and the demand for real-time data acquisition, the FPGA of Altera's Cyclone III series is chosen to implement pulse modulation, data acquisition and data transmission in parallel. The main functions of the FPGA in the Φ-OTDR sensing system are shown in Figure 4.  In summary, the core functionality of the FPGA board is to generate a pulse similar to a trigger signal, then data acquisition, local storage and data transmission via the serial bus. These functions could further affect the parameters of The Φ-OTDR system. For example, the data acquisition speed is 50 MSa/s and the data transmission speed is 100 MB/s in our experimental setup, which means that the real-time data throughput should be approximately equal or less than 100 MB/s. On one hand, the data acquisition speed could determine the resolution in the distance. Supposing that the interval between the sending and receiving time of the sensing light is 1 s, the maximum propagation distance of light in the optical fiber is 10 8 m, meaning that a data acquisition speed of 50 MSa/s could determine that the resolution in the distance is 2 m between two sampling points. On the other hand, the realtime data throughput could determine the requirements for real-time operation. The repetition frequency of optical pulses (8 kHz) could determine the maximum sensing distance (12.5 km) of one Firstly, the AOM controller generates a pulse signal whose repetition frequency and width are 8 kHz and 200 ns, respectively, in the way of FPGA programming. Then, the pulse signal is transferred to the AOM in order to transform continuous light into the expected pulsed light via the Input/Output (I/O) interface. Furthermore, the operation of the main module depends on the clock domain management which is realized by phase locked logic (PLL). It receives a reference clock generated from a crystal oscillator and provides clocks of 50 and 100 MHz for the A/D controller and the USB controller, respectively.
Secondly, the Syn module synchronously controls the A/D controller to enable the acquisition state and dominate sampling rate. The operating principle of the A/D controller is based on the state machine. State 0 is the initial state and turns into State 1 when the synchronous trigger signal arrives. State 1 runs in a continuous loop until the data counter is full, then it changes into State 2. The additional data header should be added to each RBS data string in State 2. State 3 means that the data is written into the internal memory.
Thirdly, a first-in first-out (FIFO) device is used as the memory module to temporarily store the collected data. The data are written to FIFO and read out to the upper computer at the speeds of 50 and 100 MB/s, respectively. This method is mainly used in order to achieve data storage and balance the speed of data writing and reading, in case useless data is transmitted.
Afterwards, the USB controller is used to control the data transmission between FIFO and the high-speed USB 3.0 transmission microcontroller (EZ-USB FX3). It is also a state machine which consists of three states. State 0 represents the start state, which changes into State 1 when the RBS data arrives from the FIFO. State 1 is the full flag judgment state. If the slave device, FIFO, that exits in EZ-USB FX3 is not full, it moves to State 2, otherwise, it remains in State 1. State 2 represents the data writing state. During this state, the RBS data can be read and then written into the slave FIFO through the synchronized slave FIFO device at a speed of 100 MB/s. Finally, the EZ-USB FX3 chip packages the received data and transmits the data to the host computer via the USB transmission bus.
In summary, the core functionality of the FPGA board is to generate a pulse similar to a trigger signal, then data acquisition, local storage and data transmission via the serial bus. These functions could further affect the parameters of The Φ-OTDR system. For example, the data acquisition speed is 50 MSa/s and the data transmission speed is 100 MB/s in our experimental setup, which means that the real-time data throughput should be approximately equal or less than 100 MB/s. On one hand, the data acquisition speed could determine the resolution in the distance. Supposing that the interval between the sending and receiving time of the sensing light is 1 s, the maximum propagation distance of light in the optical fiber is 10 8 m, meaning that a data acquisition speed of 50 MSa/s could determine that the resolution in the distance is 2 m between two sampling points. On the other hand, the real-time data throughput could determine the requirements for real-time operation. The repetition frequency of optical pulses (8 kHz) could determine the maximum sensing distance (12.5 km) of one RBS curve without signal aliasing, and also the total number of sampling points (6250) for one RBS curve. If each sampling point is set to be represented by 8 bits (1 Byte), the data volume of one RBS curve is 6.25 kB. Therefore, a maximum real-time data throughput of 100 MB/s means that 16,000 RBS curves could be transmitted per second, which puts forward more challenges for the real-time operation of the next co-processing units.

Data Parallel Processing Based on GPU
The graphics card chosen was NVIDIA's GEFORCE series GTX 1060. It used a GP106-300 chip based on more advanced Pascal architecture, which has the properties of high performance and low cost. For the used GPU, the number of multi-processors and the max number of threads per multi-processor are nine and 2048, so the number of maximum available threads of the GPU is 9 × 2048 = 18,432. Compute unified device architecture (CUDA) is a parallel computing platform for the NVIDIA's GPU, which contains instruction set architecture (ISA) and a parallel computation engine. By using the CUDA technique, the stream processors can be mapped to thread processors to deal with the computation of large-scale dense data. CUDA application programs can use GPU hardware for parallel computing by directly calling the underlying CUDA API, or by using the CUDA runtime library which encapsulates the underlying CUDA API to simplify the programming process. The processing program then needs to be converted to an executable binary file by the compiler, and the CUDA programming code is divided into the host code executed in the CPU and the device code executed in the GPU. Therefore, the compiler should ensure that the two parts of code can be compiled into a binary file and executed on different devices.
The key of the rapid parallel data processing of the Φ-OTDR sensing system by the GPU is the allocation of multiple threads to execute the same instruction at the same time and the lower switching frequency of the context, which could reduce the data processing time and improve the real-time performance of the system. Many threads in the same thread block have the same instruction address and can be executed in parallel. In addition, they can communicate indirectly through shared memory and synchronization primitives to maintain dependency. However, threads in different blocks of the same grid or different grids are independent of each other. They do not need to communicate with each other and have a highly independent and parallel relationship.
The processing architecture of CUDA is illustrated in Figure 5. It consists of threads, thread blocks, and thread grids. The global memory is shared by some thread grids, and these grids can call and handle data which come from the global memory. Every thread grid consists of numerous two-dimensional thread blocks. The blocks have an X-axis and a Y-axis, and they have different settable index value sizes in different dimensions. Similarly, the thread blocks are constructed by large amounts of thread in the form of a two-dimensional data structure. Coordinates are adopted in Figure 5 in order to display this special two-dimensional data structure conveniently. The first parameter of the coordinate represents the index value in the direction of the X-axis, and the second parameter signifies the index value of the Y-axis.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 18 through shared memory and synchronization primitives to maintain dependency. However, threads in different blocks of the same grid or different grids are independent of each other. They do not need to communicate with each other and have a highly independent and parallel relationship. The processing architecture of CUDA is illustrated in Figure 5. It consists of threads, thread blocks, and thread grids. The global memory is shared by some thread grids, and these grids can call and handle data which come from the global memory. Every thread grid consists of numerous twodimensional thread blocks. The blocks have an X-axis and a Y-axis, and they have different settable index value sizes in different dimensions. Similarly, the thread blocks are constructed by large amounts of thread in the form of a two-dimensional data structure. Coordinates are adopted in Figure  5 in order to display this special two-dimensional data structure conveniently. The first parameter of the coordinate represents the index value in the direction of the X-axis, and the second parameter signifies the index value of the Y-axis.

Experimental Results and Discussions
In our experiment, the external disturbance acting on the sensing optical fiber is simulated by a PZT phase modulator whose acting length is 1.5 m and driving voltage is 10 Vpp. The PZT phase modulator acts, at a position of 1000 m, as the single-point vibration source in sinusoidal form with a frequency of 100 Hz, while the total length of the optical fiber is 2020 m. Thus, the number of effective sampling points is reduced to 1010 for one RBS curve, and the data volume of each effective RBS curve is 1.01 kB. With the repetition frequency of optical pulses of 8 kHz, the total number of RBS curves is 8000 in 1 s, and the effective data volume is 8.08 MB during an acquisition time of 1 s. The data is then sent to the USB port of the upper computer via the USB 3.0 transmission bus at a speed of 100 MB/s, so the minimum data transmission time is about 80.8 ms. Moreover, in the interior of the upper computer, the USB port is connected with a south bridge USB I/O interface chip through the I/O bus, and the USB I/O interface chip is linked with a north bridge chipset via the Peripheral Component Interconnect (PCI) bus. The north bridge chipset is responsible for communication with the Dynamic Random Access Memory (DRAM), Cache and CPU through the memory bus and the front side bus. As a result, the data transmission time in the interior of the upper computer is hard to calculate accurately because of its complexity, but the data transmission time between the upper computer and lower computer could be considered as a fixed value if the experimental parameters

Experimental Results and Discussions
In our experiment, the external disturbance acting on the sensing optical fiber is simulated by a PZT phase modulator whose acting length is 1.5 m and driving voltage is 10 Vpp. The PZT phase modulator acts, at a position of 1000 m, as the single-point vibration source in sinusoidal form with a frequency of 100 Hz, while the total length of the optical fiber is 2020 m. Thus, the number of effective sampling points is reduced to 1010 for one RBS curve, and the data volume of each effective RBS curve is 1.01 kB. With the repetition frequency of optical pulses of 8 kHz, the total number of RBS curves is 8000 in 1 s, and the effective data volume is 8.08 MB during an acquisition time of 1 s. The data is then sent to the USB port of the upper computer via the USB 3. Component Interconnect (PCI) bus. The north bridge chipset is responsible for communication with the Dynamic Random Access Memory (DRAM), Cache and CPU through the memory bus and the front side bus. As a result, the data transmission time in the interior of the upper computer is hard to calculate accurately because of its complexity, but the data transmission time between the upper computer and lower computer could be considered as a fixed value if the experimental parameters do not change. Figure 6a shows the superposition of the first 100 consecutive RBS traces and the enlarged plot at the single vibration position of 1000 m. It can be clearly seen that the RBS curves evidently fluctuate in the range from 1000 to 1020 m, which corresponds to the spatial resolution of 20 m. However, the RBS curves remain stable within the other range. Figure 6b shows a two-dimensional time-distance RBS figure, with the X-axis being time and the Y-axis being distance. The evolution of the time verifies that the RBS curves remain stable outside the vibration area. This result verifies the feasibility of vibration detection and location by the Φ-OTDR sensing system.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 18 RBS curves remain stable within the other range. Figure 6b shows a two-dimensional time-distance RBS figure, with the X-axis being time and the Y-axis being distance. The evolution of the time verifies that the RBS curves remain stable outside the vibration area. This result verifies the feasibility of vibration detection and location by the Φ-OTDR sensing system.

Differential Accumulation Algorithm
A differential accumulation algorithm is the conventional data processing method for a Φ-OTDR sensing system in order to realize the vibration location. It can remove the background noise and improve the signal-to-noise ratio (SNR). The cooperative data processing of differential accumulation algorithms with FPGA, CPU, and GPU is shown in Figure 7. The collected data in the storage unit goes through data alignment by the FPGA, which adds the data header before each RBS data string, and is then sent to the host computer via the USB transmission bus.
The received data is stored as the N*M matrix in the RAM of the host computer, which can be read and modified by the CPU. More concretely, the data matrix has M RBS curves, and there are N points in each RBS trace. Then, the data should be transmitted to the GPU via the Peripheral Component Interconnect-Express (PCI-E) bus. The differential accumulation processing of RBS traces is executed in the form of multiple threads in the GPU. For a differential accumulation algorithm, this only involves the simple calculation, multiplication and division, addition and subtraction functions. Therefore, this algorithm complexity is O(1).
The algorithm with multiple threads is shown in equations as follows: where the order number of curves is j = 1, 2, 3, …, J − k, J is the number of curves in a thread, and k is the curve number between two subtractive curves. Then the order number of points in one curve is n = 1, 2, 3, …, N, and the length of each RBS curve is N. Then, the order number of threads is marked as threadi = 1, 2, 3, …, Q, and the total number of threads is represented as Q. Therefore, Q multiplied by J is M.
Equation (3)  In Equation (4), the differential data in one thread is accumulated and averaged, and each thread performs this operation in parallel. Therefore, each thread generates an average curve which contains N points, and Q curves can be obtained in total. Then, these Q curves are accumulated and averaged

Differential Accumulation Algorithm
A differential accumulation algorithm is the conventional data processing method for a Φ-OTDR sensing system in order to realize the vibration location. It can remove the background noise and improve the signal-to-noise ratio (SNR). The cooperative data processing of differential accumulation algorithms with FPGA, CPU, and GPU is shown in Figure 7. The collected data in the storage unit goes through data alignment by the FPGA, which adds the data header before each RBS data string, and is then sent to the host computer via the USB transmission bus.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 18 again in one thread to obtain a final vibration positioning curve, which can be sent to the display unit controlled by the CPU.   (3) and (4) is chosen as five in this experiment. Figure 8a shows the single-point vibration positioning result. The external disturbance is located at 1012 m, with a positioning error of 12 m and SNR of 19.46 dB while the total length of optical fiber is The received data is stored as the N*M matrix in the RAM of the host computer, which can be read and modified by the CPU. More concretely, the data matrix has M RBS curves, and there are N points in each RBS trace. Then, the data should be transmitted to the GPU via the Peripheral Component Interconnect-Express (PCI-E) bus. The differential accumulation processing of RBS traces is executed in the form of multiple threads in the GPU. For a differential accumulation algorithm, this only involves the simple calculation, multiplication and division, addition and subtraction functions. Therefore, this algorithm complexity is O(1).
The algorithm with multiple threads is shown in equations as follows: where the order number of curves is j = 1, 2, 3, . . . , J − k, J is the number of curves in a thread, and k is the curve number between two subtractive curves. Then the order number of points in one curve is n = 1, 2, 3, . . . , N, and the length of each RBS curve is N. Then, the order number of threads is marked as threadi = 1, 2, 3, . . . , Q, and the total number of threads is represented as Q. Therefore, Q multiplied by J is M. Equation (3)  In Equation (4), the differential data in one thread is accumulated and averaged, and each thread performs this operation in parallel. Therefore, each thread generates an average curve which contains N points, and Q curves can be obtained in total. Then, these Q curves are accumulated and averaged again in one thread to obtain a final vibration positioning curve, which can be sent to the display unit controlled by the CPU. Figure 8 demonstrates the positioning results of differential accumulation algorithm for 2000 RBS curves and the interval value of k in Equations (3) and (4) is chosen as five in this experiment. Figure 8a shows the single-point vibration positioning result. The external disturbance is located at 1012 m, with a positioning error of 12 m and SNR of 19.46 dB while the total length of optical fiber is 2020 m.    Furthermore, Figure 9 illustrates the time consumption comparison results for the differential accumulation processing of single-point vibration positioning between the GPU and CPU. The total number of RBS curves, which is denoted as M, is chosen as 500 (marked in black), 1000 (marked in red), 1500 (marked in blue) and 2000 (marked in green) in the comparison analysis. Figure 9a shows the variation in time consumption of the differential accumulation calculations Furthermore, Figure 9 illustrates the time consumption comparison results for the differential accumulation processing of single-point vibration positioning between the GPU and CPU. The total number of RBS curves, which is denoted as M, is chosen as 500 (marked in black), 1000 (marked in red), 1500 (marked in blue) and 2000 (marked in green) in the comparison analysis. serial computing process. In addition, when the total number of RBS curves increases, the average time rises correspondingly, which is also marked by the short straight lines.
By comparing Figure 9a with Figure 9b, when the total number of RBS curves is same, the processing time of parallel operations based on the GPU is far less than that of serial computations based on the CPU. For the differential accumulation algorithm, when the length of optical fiber is 2020 m and the number of processed curves is 2000, the time spent on the GPU is only 1.6 μs. Under the same conditions, the CPU takes about 34 ms to realize data processing. It can be demonstrated that the method of combining the GPU and CPU is faster than serial computing architecture based on the CPU in relation to the differential accumulation algorithm for a Φ-OTDR vibration sensing system. Therefore, GPU can speed up the data processing of a differential accumulation algorithm and improve the real-time performance of a Φ-OTDR sensing system.   It can be concluded that, when the value of J is less than 100 (which means that Q is more than 20), the time consumption is longer than 650 µs. This phenomenon may be explained by thread synchronization in the operation process of the GPU. The smaller value of J means the greater number of threads, which will lead to frequent thread switching on the GPU. The process of thread switching will also consume time, thus slow down the calculation time of the threads.

FFT Analysis Processing
When the value of J is greater than 100 (which means that Q is less than 20), the variation in time consumption is specifically shown in the enlarged plot. The time consumption is largely reduced to less than 2 µs. As the number of curves in one thread increases, the time consumption shows the trend of irregular fluctuation. It demonstrates that the time consumption of differential accumulation algorithm computed on the GPU may be limited by many factors, which are not only related to the number of RBS curves, but also may be influenced by the thread cache or the thread switch process.
Moreover, the change in the GPU type may increase the number of threads, but the number of threads is not the only factor affecting the computing capabilities of the system. The computing capability of the Φ-OTDR system depends not only on the number of allocated threads, but also on the amount of sensing data. Because of the thread cache allocation, or the thread switching process, in the GPU, these factors will also affect the computing capability of the system. Therefore, the algorithm complexity, the amount of sensing data and the number of threads all need to be considered for the system optimization. Figure 9b illustrates the variation in time consumption of the differential accumulation algorithm with serial computing based on the CPU. When the total number of RBS curves is fixed, a conclusion can be drawn-that the time gradually increases with the rising number of curves in a serial computing process. In addition, when the total number of RBS curves increases, the average time rises correspondingly, which is also marked by the short straight lines.
By comparing Figure 9a with Figure 9b, when the total number of RBS curves is same, the processing time of parallel operations based on the GPU is far less than that of serial computations based on the CPU. For the differential accumulation algorithm, when the length of optical fiber is 2020 m and the number of processed curves is 2000, the time spent on the GPU is only 1.6 µs. Under the same conditions, the CPU takes about 34 ms to realize data processing. It can be demonstrated that the method of combining the GPU and CPU is faster than serial computing architecture based on the CPU in relation to the differential accumulation algorithm for a Φ-OTDR vibration sensing system. Therefore, GPU can speed up the data processing of a differential accumulation algorithm and improve the real-time performance of a Φ-OTDR sensing system. Figure 10 shows the cooperation process of the CPU and GPU to realize the spectrum analysis, which could be important for the position location and frequency component analysis of the external disturbance. Firstly, the original Φ-OTDR data acquired by FPGA is transferred into the RAM of the host computer via the USB bus, and can then be processed in the form of the Φ-OTDR data matrix (N*M), with N time traces and M sampling points in each time trace. Secondly, the CPU copies time traces into graphics memory, which can be read and modified by the GPU through the memory copy function. The GPU then allocates simultaneously numerous threads to achieve the one-dimensional FFT computation. In addition, n represents the number of time traces in a thread on the GPU.

FFT Analysis Processing
Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 18 Figure 10 shows the cooperation process of the CPU and GPU to realize the spectrum analysis, which could be important for the position location and frequency component analysis of the external disturbance. Firstly, the original Φ-OTDR data acquired by FPGA is transferred into the RAM of the host computer via the USB bus, and can then be processed in the form of the Φ-OTDR data matrix (N*M), with N time traces and M sampling points in each time trace. Secondly, the CPU copies time traces into graphics memory, which can be read and modified by the GPU through the memory copy function. The GPU then allocates simultaneously numerous threads to achieve the one-dimensional FFT computation. In addition, n represents the number of time traces in a thread on the GPU.
Finally, the processed result is sent back to the host computer via the PCI-E bus and is displayed on the screen. Thus, the frequency spectrum of temporal signal can be obtained, and it is beneficial to identify the external vibration position. In conclusion, this parallel operating method largely shortens the data processing time compared with the traditional serial computing method based on the CPU. The process of FFT analysis based on the GPU mainly contains two parts, which are multiple curves parallel computation and the butterfly operation, which relies on the butterfly-shaped arithmetic unit. It can simultaneously provide the needed operands for many arithmetic units, owing to the advantages of its parallel structure and in-place calculation. Thus, this parallel operation structure improves the speed of FFT calculation and shortens the time noticeably. Because of the butterfly operation of FFT, when the number of calculation points is M in one curve, the algorithm complexity is O(M/2 × log2(M)).
The process of the first part is shown in Figure 11, the RBS data is generated and stored in the CPU and transmitted into the GPU. Then, the N threads are allocated for the time traces, which means each time curve is processed by one thread. It is necessary to verify whether the length of the time trace (M) is a power of two, owing to the existence of the butterfly operation. If the length is not a power of two, the data needs to be complemented by adding a zero value at the end of the data string. Each curve is processed by these steps on the GPU at the same time. Finally, the processed result is sent back to the host computer via the PCI-E bus and is displayed on the screen. Thus, the frequency spectrum of temporal signal can be obtained, and it is beneficial to identify the external vibration position. In conclusion, this parallel operating method largely shortens the data processing time compared with the traditional serial computing method based on the CPU.
The process of FFT analysis based on the GPU mainly contains two parts, which are multiple curves parallel computation and the butterfly operation, which relies on the butterfly-shaped arithmetic unit. It can simultaneously provide the needed operands for many arithmetic units, owing to the advantages of its parallel structure and in-place calculation. Thus, this parallel operation structure improves the speed of FFT calculation and shortens the time noticeably. Because of the butterfly operation of FFT, when the number of calculation points is M in one curve, the algorithm complexity is O(M/2 × log2(M)).
The process of the first part is shown in Figure 11, the RBS data is generated and stored in the CPU and transmitted into the GPU. Then, the N threads are allocated for the time traces, which means each time curve is processed by one thread. It is necessary to verify whether the length of the time trace (M) is a power of two, owing to the existence of the butterfly operation. If the length is not a power of two, the data needs to be complemented by adding a zero value at the end of the data string. Each curve is processed by these steps on the GPU at the same time. This characteristic can decrease the computation time of FFT. The signal of the temporal domain (x(n)) could be decomposed into even and odd sequences, so the FFT of the even data sequence is called as X1(k) and the FFT of the odd data sequence is called as X2(k). It is shown in the equation as follows: Therefore, Equation (5) could be developed as Equation (7).
Equation (7) expresses the principle of the butterfly operation in one thread. Xthreadi(k) indicates the frequency value of the first half of all points and Xthreadi (k + M/2) indicates the frequency value of the second half of all points. Then, the order of threads is marked as threadi = 1, 2, 3, …, Q, and the total number of threads is represented as Q.
The FFT analysis process is shown in Figure 12. The data value of each time trace is needs to be sorted before processing by the butterfly arithmetic unit. Firstly, the natural sequence number of x(n) in Equation (5) is transformed and expressed in the binary form. Then, the corresponding binary number is flipped. Afterwards, these sequence numbers are converted to the decimal label. Therefore, x(n) is sorted and stored in the memory in the new numerical order.
FFT computation of M data points is achieved by using an R-level butterfly operation with R = log2M; each level has an M/2 arithmetic unit to perform the butterfly operation at the same time. The current level number of butterfly computation is denoted as P, P = 0, 1, 2, …, R − 1.
From Figure 12, each butterfly computation is achieved by one thread, and M/2 threads are allocated to a time trace which contains M points. There are three kinds of exchange operations in the FFT process based on the principle of in-place computation, which are the address exchange, the The second part is the butterfly operation of each time trace by using the butterfly arithmetic unit. This FFT analysis is called decimation in time (DIT) FFT. The process is shown in the equation as follows: Equation (5) represents the computation process of conventional discrete Fourier transform (DFT). In Equation (5), W kn M is the rotation factor which has features of symmetry and periodicity. This characteristic can decrease the computation time of FFT. The signal of the temporal domain (x(n)) could be decomposed into even and odd sequences, so the FFT of the even data sequence is called as X 1 (k) and the FFT of the odd data sequence is called as X 2 (k). It is shown in the equation as follows: Therefore, Equation (5) could be developed as Equation (7).
Equation (7) expresses the principle of the butterfly operation in one thread. X threadi (k) indicates the frequency value of the first half of all points and X threadi (k + M/2) indicates the frequency value of the second half of all points. Then, the order of threads is marked as threadi = 1, 2, 3, . . . , Q, and the total number of threads is represented as Q.
The FFT analysis process is shown in Figure 12. The data value of each time trace is needs to be sorted before processing by the butterfly arithmetic unit. Firstly, the natural sequence number of x(n) in Equation (5) is transformed and expressed in the binary form. Then, the corresponding binary number is flipped. Afterwards, these sequence numbers are converted to the decimal label. Therefore, x(n) is sorted and stored in the memory in the new numerical order.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 18 operation. When the value of P is not equal to zero, the obtained frequency values will replace previous frequency values. However, the (R − 1)-level butterfly computation only realizes two exchange operations, which are the address exchange and the operands exchange. DIT FFT has the characteristic of in-place computation. It means that the output data of the same butterfly arithmetic unit will replace the position of the input data in the memory space. In addition, every butterfly arithmetic unit of same level operates independently, which provides the theory basis for the multi-thread processor on the GPU. Every thread executes a butterfly arithmetic unit, but the processing data in the thread will refresh with the growing number of butterfly operator levels. Therefore, it can largely decrease computational complexity and the storage space of data. Essentially, applying DIT FFT on the GPU can speed up data processing. Eventually, the real-time performance of the Φ-OTDR sensing system could be improved. In our experiment system, a single-point vibration test at a position of 1000 m is applied, while the total length of optical fiber is 2020 m. A two-points vibration test at the positions of 995 m and 1513 m are applied, while the total length of optical fiber is 2036 m. Each time trace contains 1024 points. All time-series curves are dealt with FFT processing with a frequency resolution of 7.80 Hz on the GPU, and the obtained frequency-space waterfall map is shown in Figure 13. From Figure 13a, the exerted external vibration is detected and located at a position of 1012 m, and the location error is 12 m. From Figure 13b, the vibration is exerted on two positions of 985 and 1505 m. The location errors are, thus, 10 and 8 m.  FFT computation of M data points is achieved by using an R-level butterfly operation with R = log 2 M; each level has an M/2 arithmetic unit to perform the butterfly operation at the same time. The current level number of butterfly computation is denoted as P, P = 0, 1, 2, . . . , R − 1.
From Figure 12, each butterfly computation is achieved by one thread, and M/2 threads are allocated to a time trace which contains M points. There are three kinds of exchange operations in the FFT process based on the principle of in-place computation, which are the address exchange, the operands exchange, and the results exchange after the butterfly operation. The zero-level butterfly computation only performed one exchange operation, which is the result exchange after a butterfly operation. When the value of P is not equal to zero, the obtained frequency values will replace previous frequency values. However, the (R − 1)-level butterfly computation only realizes two exchange operations, which are the address exchange and the operands exchange.
DIT FFT has the characteristic of in-place computation. It means that the output data of the same butterfly arithmetic unit will replace the position of the input data in the memory space. In addition, every butterfly arithmetic unit of same level operates independently, which provides the theory basis for the multi-thread processor on the GPU. Every thread executes a butterfly arithmetic unit, but the processing data in the thread will refresh with the growing number of butterfly operator levels. Therefore, it can largely decrease computational complexity and the storage space of data. Essentially, applying DIT FFT on the GPU can speed up data processing. Eventually, the real-time performance of the Φ-OTDR sensing system could be improved. In our experiment system, a single-point vibration test at a position of 1000 m is applied, while the total length of optical fiber is 2020 m. A two-points vibration test at the positions of 995 m and 1513 m are applied, while the total length of optical fiber is 2036 m. Each time trace contains 1024 points. All time-series curves are dealt with FFT processing with a frequency resolution of 7.80 Hz on the GPU, and the obtained frequency-space waterfall map is shown in Figure 13. From Figure 13a 1513 m are applied, while the total length of optical fiber is 2036 m. Each time trace contains 1024 points. All time-series curves are dealt with FFT processing with a frequency resolution of 7.80 Hz on the GPU, and the obtained frequency-space waterfall map is shown in Figure 13. From Figure 13a, the exerted external vibration is detected and located at a position of 1012 m, and the location error is 12 m. From Figure 13b, the vibration is exerted on two positions of 985 and 1505 m. The location errors are, thus, 10 and 8 m.   Appl. Sci. 2020, 10, x FOR PEER REVIEW 14 of 18 Figure 14 illustrates the computing time contrast of FFT analysis processing on the CPU and GPU. The abscissae of Figure 14a,b represent the number of time curves, which positively correlate with the length of optical fiber. The ordinate means the computing time of the data analysis.
From Figure 14a,b, when the length of optical fiber is changeless, with the growing number of points which needs to be executed for FFT analysis processing, the time increases gradually as well. When the number of points is constant, the time shows the trend linearly rising with the growing length of optical fiber. Comparing Figure 14a with Figure 14b, it can be concluded that the computing time of FFT operation based on the CPU is much longer than that based on the GPU when the point number of FFT analysis and the length of optical fiber is invariable. When the length of optical fiber is 2020 m and the point number of FFT is 8192, the consumption time is 1.17 ms with FFT analysis performed on the GPU. Under this experimental condition, it takes about 119.5 ms with FFT analysis performed on the CPU, so the speed of using GPU is approximately 140 times faster than that of traditional series computation based on a CPU in FFT analysis processing. It can be noted that a parallel computation method based on the GPU can shorten substantially the time of FFT analysis processing and optimize the real-time performance of the Φ-OTDR sensing system.

Extended Comparison Experiments and Discussions
In addition, we also did the experiment in long distance conditions to verify the feasibility and effectiveness of our parallel computation method based on the GPU. The lengths of optical fiber are 2020, 4430, 9832 and 14,130 m, respectively. The differential accumulation algorithm and the FFT processing both are executed on the CPU and GPU. The experiment results are shown in Figure 15.
The time contrast between the CPU and GPU for the differential accumulation algorithm is shown in Figure 15a when the number of RBS curves is 2000. The red bar chart shows the time consumed on the CPU, and the longer the distance, the longer the time consumed. The blue bar chart shows the time consumed on the GPU. The changing trend of time is not regular, but floats up and down with the microsecond level. This phenomenon demonstrates that the time consumption of the differential accumulation algorithm on the GPU may be limited by many factors, which is not only related to the optical fiber distance, but also may be influenced by the thread cache allocation or the thread switching process in the GPU, which also has an uncertain time consumption less or close to one microsecond. In the condition of the same amount of data, the time spent on the GPU is far less than that of the CPU. It can be demonstrated that the method of combining the GPU and CPU can realize fast data processing in relation to the differential accumulation algorithm for a Φ-OTDR vibration sensing system.
The time contrast between the CPU and GPU for FFT processing is shown in Figure 15b when the point number of FFT is 8192. The red and blue bar chart represents the time consumption of FFT processing on the CPU and GPU, respectively. With the increase in distance, both CPU and GPU show the same trend of growth. Moreover, in the condition of same amount of data, the time spent on the CPU is 100 times more than that of the GPU. It demonstrates that our parallel computation From Figure 14a,b, when the length of optical fiber is changeless, with the growing number of points which needs to be executed for FFT analysis processing, the time increases gradually as well. When the number of points is constant, the time shows the trend linearly rising with the growing length of optical fiber. Comparing Figure 14a with Figure 14b, it can be concluded that the computing time of FFT operation based on the CPU is much longer than that based on the GPU when the point number of FFT analysis and the length of optical fiber is invariable. When the length of optical fiber is 2020 m and the point number of FFT is 8192, the consumption time is 1.17 ms with FFT analysis performed on the GPU. Under this experimental condition, it takes about 119.5 ms with FFT analysis performed on the CPU, so the speed of using GPU is approximately 140 times faster than that of traditional series computation based on a CPU in FFT analysis processing. It can be noted that a parallel computation method based on the GPU can shorten substantially the time of FFT analysis processing and optimize the real-time performance of the Φ-OTDR sensing system.

Extended Comparison Experiments and Discussions
In addition, we also did the experiment in long distance conditions to verify the feasibility and effectiveness of our parallel computation method based on the GPU. The lengths of optical fiber are 2020, 4430, 9832 and 14,130 m, respectively. The differential accumulation algorithm and the FFT processing both are executed on the CPU and GPU. The experiment results are shown in Figure 15. algorithm and FFT processing. Therefore, this paper verifies that the distributed optical fiber sensor is another example of a GPU application. Because of the large amount of sensing data and real-time performance requirement in practical applications of distributed optical fiber sensors, the GPU is used as an effective solution to improve the speed of data processing with a strong parallel computing capability. Based on the above discussion, the differential accumulation algorithm and FFT processing on the GPU could lead to less time being consumed. However, the use of the GPU will also bring the problem of increasing cost. Therefore, the identification of when the GPU will be necessary for realtime operation could be our next research direction. In this study, we only put forward a preliminary mathematical criterion which could be optimized for real application in the GPU.
In the experimental setup, the maximum real-time data throughput is 100 MB/s and the data acquisition speed is 50 MSa/s. If each sampling point is set to be represented by 8 bits (1 Byte), the acquisition time (Ta) and the transmission time (Tt) could be simplified as Ta = 2 s and Tt = 1 s for the same data volume of 100 MB. Thus, the time consumed in the lower system (Tls), including the FPGA board and the PCI-E bus, could be considered as Tls = max(Ta, Tt) = 2 s.
Meanwhile, considering the complexity of the algorithm, FFT processing consumes more time. Based on the linear relationship in Figure 14, the time consumed by FFT processing on the CPU (TCPU) and GPU (TGPU) is TCPU = 120 ms and TGPU = 1.2 ms while the point number of FFT processing is 8192 and the number of RBS curves is 1000. Thus, the time consumed could be calculated as TCPU = 1920 ms and TGPU = 19.2 ms if the number of RBS curves (NRBS) NRBS = 16,000 with the repetition frequency of optical pulses of 8kHz in our experimental setup. Considering that the CPU also needs memory allocation, reading and writing operations, the time consumed in the upper system (Tus), including the CPU and internal bus, should be greater than TCPU, which means that Tus = kTCPU with the coefficient k > 1.
As a result, if Tus > Tls, which means kTCPU > max(Ta, Tt), it is recommended to use the GPU to optimize the execution time of the processing algorithm. By contrast, if Tus < Tls, which means kTCPU < max(Ta, Tt), the use of the CPU may be also acceptable.

Conclusions
In this paper, the feasibility of speeding up the data processing rate of the Φ-OTDR sensing system by using the method of cooperation between the CPU and GPU processing is discussed and experimentally verified. Two typical signal processing methods for the Φ-OTDR sensing system are realized in parallel operations based on the GPU and traditional serial computations based on the CPU, respectively. For the differential accumulation algorithm which can achieve vibration detection The time contrast between the CPU and GPU for the differential accumulation algorithm is shown in Figure 15a when the number of RBS curves is 2000. The red bar chart shows the time consumed on the CPU, and the longer the distance, the longer the time consumed. The blue bar chart shows the time consumed on the GPU. The changing trend of time is not regular, but floats up and down with the microsecond level. This phenomenon demonstrates that the time consumption of the differential accumulation algorithm on the GPU may be limited by many factors, which is not only related to the optical fiber distance, but also may be influenced by the thread cache allocation or the thread switching process in the GPU, which also has an uncertain time consumption less or close to one microsecond. In the condition of the same amount of data, the time spent on the GPU is far less than that of the CPU. It can be demonstrated that the method of combining the GPU and CPU can realize fast data processing in relation to the differential accumulation algorithm for a Φ-OTDR vibration sensing system.
The time contrast between the CPU and GPU for FFT processing is shown in Figure 15b when the point number of FFT is 8192. The red and blue bar chart represents the time consumption of FFT processing on the CPU and GPU, respectively. With the increase in distance, both CPU and GPU show the same trend of growth. Moreover, in the condition of same amount of data, the time spent on the CPU is 100 times more than that of the GPU. It demonstrates that our parallel computation based on the GPU can shorten substantially the time of FFT analysis processing and optimize the real-time performance of the Φ-OTDR sensing system. Evidently, the feasibility and validity of our parallel computation method based on the GPU are verified under long distance and large data volume conditions for the differential accumulation algorithm and FFT processing. Therefore, this paper verifies that the distributed optical fiber sensor is another example of a GPU application. Because of the large amount of sensing data and real-time performance requirement in practical applications of distributed optical fiber sensors, the GPU is used as an effective solution to improve the speed of data processing with a strong parallel computing capability.
Based on the above discussion, the differential accumulation algorithm and FFT processing on the GPU could lead to less time being consumed. However, the use of the GPU will also bring the problem of increasing cost. Therefore, the identification of when the GPU will be necessary for real-time operation could be our next research direction. In this study, we only put forward a preliminary mathematical criterion which could be optimized for real application in the GPU.
In the experimental setup, the maximum real-time data throughput is 100 MB/s and the data acquisition speed is 50 MSa/s. If each sampling point is set to be represented by 8 bits (1 Byte), the acquisition time (T a ) and the transmission time (T t ) could be simplified as T a = 2 s and T t = 1 s for the same data volume of 100 MB. Thus, the time consumed in the lower system (T ls ), including the FPGA board and the PCI-E bus, could be considered as T ls = max(T a , T t ) = 2 s.
Meanwhile, considering the complexity of the algorithm, FFT processing consumes more time. Based on the linear relationship in Figure 14, the time consumed by FFT processing on the CPU (T CPU ) and GPU (T GPU ) is T CPU = 120 ms and T GPU = 1.2 ms while the point number of FFT processing is 8192 and the number of RBS curves is 1000. Thus, the time consumed could be calculated as T CPU = 1920 ms and T GPU = 19.2 ms if the number of RBS curves (N RBS ) N RBS = 16,000 with the repetition frequency of optical pulses of 8kHz in our experimental setup. Considering that the CPU also needs memory allocation, reading and writing operations, the time consumed in the upper system (T us ), including the CPU and internal bus, should be greater than T CPU , which means that T us = kT CPU with the coefficient k > 1.
As a result, if T us > T ls, which means kT CPU > max(T a , T t ), it is recommended to use the GPU to optimize the execution time of the processing algorithm. By contrast, if T us < T ls, which means kT CPU < max(T a , T t ), the use of the CPU may be also acceptable.

Conclusions
In this paper, the feasibility of speeding up the data processing rate of the Φ-OTDR sensing system by using the method of cooperation between the CPU and GPU processing is discussed and experimentally verified. Two typical signal processing methods for the Φ-OTDR sensing system are realized in parallel operations based on the GPU and traditional serial computations based on the CPU, respectively. For the differential accumulation algorithm which can achieve vibration detection and location, serial computations only based on the CPU take about 21,250 times as much execution time as parallel operations in the condition of same data volume. Thus, the GPU greatly accelerates the processing speed of the differential accumulation algorithm for the Φ-OTDR sensing system. For the FFT analysis which can precisely achieve vibration location and spectrum analysis, the computation time with the GPU is approximately 140 times faster than serial computation under same experimental condition. It shows that the GPU can effectively improve the FFT processing speed of the Φ-OTDR sensing system. In addition, the fast data collection and transmission are achieved by using data acquisition system based on FPGA. In conclusion, this novel data co-processing parallel computation method, which combines FPGA technology with the CPU and GPU accelerator, can realize data acquisition, transmission, and processing in parallel, and eventually optimize the real-time performance of the Φ-OTDR sensing system.
In this paper, we use the underlying CUDA API to realize multi-thread parallel computing in the GPU for distributed optical fiber sensors; this function encapsulation may become a research direction for embedded and portable application requirements, with the challenges of function optimization and algorithm simplification. In addition, in order to improve the processing speed of the whole system, collaborative parallel research into multi-processors will be a potential research direction. For example, further research on the parallel operations of several CPUs or several GPUs may be conducted, which could promote the parallel operation of a data processing algorithm in a Φ-OTDR system. Therefore, multi-processors may be a parallel operation solution for the academic and industry communities.