# Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson’s Correlation Coefficients for Time Series Data—fMRI Study

^{*}

## Abstract

**:**

## 1. Introduction

#### GPU Architecture, CUDA Programming Model and cuBLAS Library

## 2. Materials and Methods

#### 2.1. Space Storage Needed for Computing Correlations and Reordering Them

#### 2.2. Case 1: Correlation Computation Can Be Done in One Round

Algorithm 1 Extracting ordered upper triangle part of correlation matrix |

Input:$N\times N$ correlation matrix S |

Output: Ordered correlation array C of size $N(N-1)/2$ |

1: $idx=blockDim.x\ast blockIdx.x+threadIdx.x$ |

2: $i=idx/N$ |

3: $j=idx\%N$ |

4: if $i<j$ and $i<N$ and $j<N$ then |

5: ${k}_{1}=i\times N-{\displaystyle \frac{i\times (i+1)}{2}}+j-i$ |

6: ${k}_{2}=j\times N+i$ |

7: $C[{k}_{1}-1]=S[{k}_{2}]$ |

8: end if |

#### 2.3. Case 2: Correlation Computation Needs to Performed in Multiple Rounds

Algorithm 2 Extracting ordered upper triangle part of correlation matrix - case2 |

Input:$B\times {N}^{\prime}$ correlation matrix S |

Output: Ordered correlation array C of size ${N}^{\prime}B-B(B+1)/2$ |

1: $idx=blockDim.x\ast blockIdx.x+threadIdx.x$ |

2: $i=idx/{N}^{\prime}$ |

3: $j=idx\%{N}^{\prime}$ |

4: if $i<j$ and $i<B$ and $j<{N}^{\prime}$ then |

5: ${k}_{1}=i\times {N}^{\prime}-{\displaystyle \frac{i\times (i+1)}{2}}+j-i$ |

6: ${k}_{2}=j\times B+i$ |

7: $C[{k}_{1}-1]=S[{k}_{2}]$ |

8: end if |

#### 2.4. Overall Algorithm

Algorithm 3 Fast-GPU-PCC | |

Input:$N\times M$ matrix U of time series data | |

Output: Correlation array C of size $N(N-1)/2$ | |

1: | Preprocess the fMRI data using Equation (3) |

2: | Copy normalized data to GPU global memory |

3: | $B=X/2N$ |

4: | if$B>N$then |

5: | Multiply matrix U to its transpose ${U}^{T}$ |

6: | Extract upper triangle part of the matrix using Algorithm 1 |

7: | Transfer the correlation array to CPU |

8: | else |

9: | $i=0$, ${N}^{\prime}=N$ |

10: | while $i<N$ do |

11: | Multiply rows i to $i+B$ of matrix U to columns i to N of ${U}^{T}$ |

12: | Extract the upper triangle part of correlation matrix using Algorithm 2 |

13: | Transfer the extracted correlations to CPU |

14: | $i=i+B$ |

15: | ${N}^{\prime}={N}^{\prime}-B$ |

16: | $B=X/2{N}^{\prime}$ |

17: | if $B>{N}^{\prime}$ then |

18: | $B={N}^{\prime}$ |

19: | end if |

20: | end while |

21: | end if |

## 3. Results

#### 3.1. Increasing Number of Voxels

#### 3.2. Increasing the Length of Time Series

#### 3.3. Experiments on Real Experimental Data

## 4. Discussion

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Craddock, R.C.; Tungaraza, R.L.; Milham, M.P. Connectomics and new approaches for analyzing human brain functional connectivity. GigaScience
**2015**, 4, 13. [Google Scholar] [CrossRef] [PubMed] - Raschka, S.; Mirjalili, V. Python Machine Learning; Packt Publishing Ltd.: Birmingham, UK, 2017. [Google Scholar]
- Hosseini-Asl, E.; Gimel’farb, G.; El-Baz, A. Alzheimer’s disease diagnostics by a deeply supervised adaptable 3D convolutional network. arXiv, 2016; arXiv:1607.00556. [Google Scholar]
- Lindquist, M.A. The statistical analysis of fMRI data. Stat. Sci.
**2008**, 23, 439–464. [Google Scholar] [CrossRef] - Zhang, H.; Tian, J.; Zhen, Z. Direct measure of local region functional connectivity by multivariate correlation technique. In Proceedings of the 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Lyon, France, 22–26 August 2007; pp. 5231–5234. [Google Scholar]
- Wang, Y.; Cohen, J.D.; Li, K.; Turk-Browne, N.B. Full Correlation Matrix Analysis of fMRI Data; Technical Report; Princeton Neuroscience Institute: Princeton, NJ, USA, 2014. [Google Scholar]
- Chang, C.; Glover, G.H. Time-frequency dynamics of resting-state brain connectivity measured with fMRI. Neuroimage
**2010**, 50, 81–98. [Google Scholar] [CrossRef] [PubMed] - Lee Rodgers, J.; Nicewander, W.A. Thirteen ways to look at the correlation coefficient. Am. Stat.
**1988**, 42, 59–66. [Google Scholar] [CrossRef] - Jiao, Y.; Zhang, Y.; Wang, Y.; Wang, B.; Jin, J.; Wang, X. A novel multilayer correlation maximization model for improving CCA-based frequency recognition in SSVEP brain–computer interface. Int. J. Neural Syst.
**2017**, 28, 1750039. [Google Scholar] [CrossRef] [PubMed] - Eslami, T.; Saeed, F. Similarity based classification of ADHD using singular value decomposition. In Proceedings of the ACM International Conference on Computing Frontiers 2018, Ischia, Italy, 8–10 May 2018. [Google Scholar]
- Liang, X.; Wang, J.; Yan, C.; Shu, N.; Xu, K.; Gong, G.; He, Y. Effects of different correlation metrics and preprocessing factors on small-world brain functional networks: A resting-state functional MRI study. PLoS ONE
**2012**, 7, e32766. [Google Scholar] [CrossRef] [PubMed] - Zhang, Y.; Zhang, H.; Chen, X.; Lee, S.W.; Shen, D. Hybrid high-order functional connectivity networks using resting-state functional MRI for mild cognitive impairment diagnosis. Sci. Rep.
**2017**, 7, 6530. [Google Scholar] [CrossRef] [PubMed] - Zhao, X.; Liu, Y.; Wang, X.; Liu, B.; Xi, Q.; Guo, Q.; Jiang, H.; Jiang, T.; Wang, P. Disrupted small-world brain networks in moderate Alzheimer’s disease: A resting-state fMRI study. PLoS ONE
**2012**, 7, e33540. [Google Scholar] [CrossRef] [PubMed] - Godwin, D.; Ji, A.; Kandala, S.; Mamah, D. Functional connectivity of cognitive brain networks in schizophrenia during a working memory task. Front. Psychiatry
**2017**, 8, 294. [Google Scholar] [CrossRef] [PubMed] - Baggio, H.C.; Sala-Llonch, R.; Segura, B.; Marti, M.J.; Valldeoriola, F.; Compta, Y.; Tolosa, E.; Junqué, C. Functional brain networks and cognitive deficits in Parkinson’s disease. Hum. Brain Mapp.
**2014**, 35, 4620–4634. [Google Scholar] [CrossRef] [PubMed] - Craddock, R.C.; James, G.A.; Holtzheimer, P.E.; Hu, X.P.; Mayberg, H.S. A whole brain fMRI atlas generated via spatially constrained spectral clustering. Hum. Brain Mapp.
**2012**, 33, 1914–1928. [Google Scholar] [CrossRef] [PubMed] - Gembris, D.; Neeb, M.; Gipp, M.; Kugel, A.; Männer, R. Correlation analysis on GPU systems using NVIDIA’s CUDA. J. Real-Time Image Process.
**2011**, 6, 275–280. [Google Scholar] [CrossRef] - Liu, Y.; Pan, T.; Aluru, S. Parallel pairwise correlation computation on intel xeon phi clusters. In Proceedings of the 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Los Angeles, CA, USA, 26–28 October 2016; pp. 141–149. [Google Scholar]
- Liang, M.; Zhang, F.; Jin, G.; Zhu, J. FastGCN: A GPU accelerated tool for fast gene co-expression networks. PLoS ONE
**2015**, 10, e0116776. [Google Scholar] [CrossRef] [PubMed] - Wang, Y.; Du, H.; Xia, M.; Ren, L.; Xu, M.; Xie, T.; Gong, G.; Xu, N.; Yang, H.; He, Y. A hybrid CPU-GPU accelerated framework for fast mapping of high-resolution human brain connectome. PLoS ONE
**2013**, 8, e62789. [Google Scholar] - Eslami, T.; Awan, M.G.; Saeed, F. GPU-PCC: A GPU-based Technique to Compute Pairwise Pearson’s Correlation Coefficients for Big fMRI Data. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Boston, MA, USA, 20–23 August 2017; pp. 723–728. [Google Scholar]
- Luo, J.; Wu, M.; Gopukumar, D.; Zhao, Y. Big data application in biomedical research and health care: A literature review. Biomed. Inform. Insights
**2016**, 8, BII-S31559. [Google Scholar] [CrossRef] [PubMed] - Vargas-Perez, S.; Saeed, F. A hybrid MPI-OpenMP strategy to speedup the compression of big next-generation sequencing datasets. IEEE Trans. Parallel Distrib. Syst.
**2017**, 28, 2760–2769. [Google Scholar] [CrossRef] - Saeed, F.; Perez-Rathke, A.; Gwarnicki, J.; Berger-Wolf, T.; Khokhar, A. A high performance multiple sequence alignment system for pyrosequencing reads from multiple reference genomes. J. Parallel Distrib. Comput.
**2012**, 72, 83–93. [Google Scholar] [CrossRef] [PubMed] - Awan, M.G.; Saeed, F. An out-of-core GPU-based dimensionality reduction algorithm for big mass spectrometry data and its application in bottom-up proteomics. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Boston, MA, USA, 20–23 August 2017; pp. 550–555. [Google Scholar]
- Saeed, F.; Hoffert, J.D.; Knepper, M.A. A high performance algorithm for clustering of large-scale protein mass spectrometry data using multi-core architectures. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Niagara, ON, Canada, 25–28 August 2013; pp. 923–930. [Google Scholar]
- Schatz, M.C. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics
**2009**, 25, 1363–1369. [Google Scholar] [CrossRef] [PubMed] - Pandey, R.V.; Schlötterer, C. DistMap: A toolkit for distributed short read mapping on a Hadoop cluster. PLoS ONE
**2013**, 8, e72614. [Google Scholar] [CrossRef] [PubMed] - Lewis, S.; Csordas, A.; Killcoyne, S.; Hermjakob, H.; Hoopmann, M.R.; Moritz, R.L.; Deutsch, E.W.; Boyle, J. Hydra: A scalable proteomic search engine which utilizes the Hadoop distributed computing framework. BMC Bioinform.
**2012**, 13, 324. [Google Scholar] [CrossRef] [PubMed] - Wang, S.; Kim, J.; Jiang, X.; Brunner, S.F.; Ohno-Machado, L. GAMUT: GPU accelerated microRNA analysis to uncover target genes through CUDA-miRanda. BMC Med. Genom.
**2014**, 7, S9. [Google Scholar] [CrossRef] [PubMed] - Liu, Y.; Wirawan, A.; Schmidt, B. CUDASW++ 3.0: Accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions. BMC Bioinform.
**2013**, 14, 117. [Google Scholar] [CrossRef] [PubMed] - Eklund, A.; Andersson, M.; Knutsson, H. fMRI analysis on the GPU-possibilities and challenges. Comput. Methods Progr. Biomed.
**2012**, 105, 145–161. [Google Scholar] [CrossRef] [PubMed] - Sanders, J.; Kandrot, E. CUDA by Example: An Introduction to General-Purpose GPU Programming; Addison-Wesley Professional: Boston, MA, USA, 2010. [Google Scholar]
- Awan, M.G.; Saeed, F. GPU-ArraySort: A parallel, in-place algorithm for sorting large number of arrays. In Proceedings of the 2016 45th International Conference on Parallel Processing Workshops (ICPPW), Philadelphia, PA, USA, 16–19 August 2016; pp. 78–87. [Google Scholar]
- NVIDIA. cuBLAS. Available online: http://docs.nvidia.com/cuda/cublas/index.html#axzz4VJn7wpRs (accessed on 1 February 2017).
- Fast-GPU-PCC, GitHub Repository. Available online: https://github.com/pcdslab/Fast-GPU-PCC (accessed on 28 February 2018).
- Functional Connectomes Project. FCP Classic Data Sharing Samples. Available online: http://fcon_1000.projects.nitrc.org/fcpClassic/FcpTable.html (accessed on 1 February 2017).

**Figure 1.**(

**a**,

**b**) Are examples of two possible orders for PCC matrix and their resulting correlation array. In part (

**a**), the first $N-1$ elements of the array show the Pearson’s correlations between the first variable and all other variables, the next $N-2$ elements show the correlation of the second variable with all others and so on. In part (

**b**), the last $N-1$ elements show correlation of the last element with the rest of elements, $N-2$ elements before them show the correlation of the $N-1$th element to the rest of elements and so on.

**Figure 2.**Space needed for computing correlation of first B voxels with the rest of voxels. Pairwise correlation is computed by multiplying a matrix containing time series of B voxels to a matrix containing time series of all voxels which results in a matrix containing $N\times B$ elements. This matrix has $NB-B(B+1)/2$ distinct correlation coefficients that need to be extracted and stored in the resulting correlation array.

**Figure 4.**Overall process of Fast-GPU-PCC. In part (

**a**), fMRI time series is normalized in CPU and transferred to GPU memory. Block size B is computed using Equation (6). If B is larger than N means that the whole computation can be performed in one round which is shown in part (

**b**). In part (

**b**) the whole normalized matrix is multiplied to its transpose and upper triangle is extracted and transferred back to CPU. If block size is computed in part (

**a**) is smaller than N means that only pairwise correlation of B voxels with the rest of voxels can be computed which is shown in part (

**c**). In part (

**c**) after correlation of the first B voxels with the rest of voxels is computed and transferred back to CPU, new block size is computed and this process is repeated multiple time until all pairwise correlations are computed.

Number of Voxels (N) | GPU-PCC | Fast-GPU-PCC | Wang et al. [20] | CPU Version |
---|---|---|---|---|

20,000 | 0.73 | 0.47 | 1.58 | 15.65 |

30,000 | 1.6 | 0.96 | 3.28 | 35.23 |

40,000 | 2.9 | 2.40 | 5.71 | 62.81 |

50,000 | 4.5 | 3.2 | 8.65 | 98.8 |

60,000 | 6.5 | 4.7 | 12.15 | 143 |

70,000 | 8.8 | 6.07 | 16.2 | 202 |

80,000 | 11.6 | 7.7 | 21.56 | 270 |

90,000 | 14.7 | 8.9 | 26.95 | 341 |

100,000 | 18.14 | 10.9 | 31.99 | 424 |

Number of Voxels (N) | GPU-PCC | Wang et al. [20] | CPU Version |
---|---|---|---|

20,000 | 1.55 | 3.36 | 33.29 |

30,000 | 1.66 | 3.41 | 36.69 |

40,000 | 1.2 | 2.37 | 26.17 |

50,000 | 1.4 | 2.7 | 30.8 |

60,000 | 1.38 | 2.58 | 30.42 |

70,000 | 1.4 | 2.66 | 33.27 |

80,000 | 1.5 | 2.8 | 35.06 |

90,000 | 1.65 | 3.02 | 38.31 |

100,000 | 1.66 | 2.93 | 38.89 |

Length of Time Series (M) | GPU-PCC | Fast-GPU-PCC | Wang et al. [20] | CPU Version |
---|---|---|---|---|

50 | 5.6 | 4.6 | 12.08 | 62 |

100 | 6.5 | 4.7 | 12.15 | 143 |

200 | 10.08 | 4.9 | 12.48 | 335.9 |

300 | 14.7 | 5.18 | 13.22 | 514 |

400 | 17.85 | 5.42 | 13.43 | 689.3 |

500 | 23.32 | 5.76 | 13.63 | 862.4 |

Length of Time Series (M) | GPU-PCC | Wang et al. [20] | CPU Version |
---|---|---|---|

50 | 1.21 | 2.62 | 13.47 |

100 | 1.38 | 2.58 | 30.4 |

200 | 2.05 | 2.54 | 68.5 |

300 | 2.83 | 2.55 | 99.22 |

400 | 3.29 | 2.47 | 127.12 |

500 | 4.04 | 2.36 | 149.72 |

Fast-GPU-PCC | GPU-PCC | Wang et al. [20] | CPU Version |
---|---|---|---|

9.26 | 20.83 | 27.46 | 577 |

GPU-PCC | Wang et al. [20] | CPU Version |
---|---|---|

2.24 | 2.96 | 62.3 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Eslami, T.; Saeed, F.
Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson’s Correlation Coefficients for Time Series Data—fMRI Study. *High-Throughput* **2018**, *7*, 11.
https://doi.org/10.3390/ht7020011

**AMA Style**

Eslami T, Saeed F.
Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson’s Correlation Coefficients for Time Series Data—fMRI Study. *High-Throughput*. 2018; 7(2):11.
https://doi.org/10.3390/ht7020011

**Chicago/Turabian Style**

Eslami, Taban, and Fahad Saeed.
2018. "Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson’s Correlation Coefficients for Time Series Data—fMRI Study" *High-Throughput* 7, no. 2: 11.
https://doi.org/10.3390/ht7020011