# Parallelization Strategies for Graph-Code-Based Similarity Search

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction and Motivation

## 2. State of the Art and Related Work

#### 2.1. Information Retrieval

#### 2.2. Multimedia Features and Multimedia Feature Graphs

#### 2.3. Graph Codes and Algorithms

`for each GC in collection`

`--- parallelize ---`

`calculate the intersection matrices`

`of GC_Query and~GC`

`--- parallelize each ---`

`calculate M_F of GC_Query and GC`

`calculate M_FR of GC_Query and GC`

`calculate M_RT of GC_Query and GC`

`--- end parallelize each ---`

`compare`

`--- end parallelize~---`

`order result list according to`

`value of M_F`

`value of M_FR where M_F is equal`

`value of M_RT where M_F and M_FR are equal`

`return result list`

#### 2.4. Parallel Computing

#### 2.5. Discussion and Open Challenges

## 3. Modeling and Design

#### 3.1. Parallel Graph Code Algorithms

**Task 1:**Calculate the metrics for each pair of Graph Codes $G{C}_{i}xG{C}_{query}$ (see Figure 4). The metric calculation itself can be decomposed by the data decomposition method.**Task 2:**For each element in the dictionary and the matrix of $G{C}_{query}$, find the matching elements in $G{C}_{i}$ and calculate the values according to the metrics (see Figure 5).**Task 3:**After the metrics are calculated, the result set should be ordered.

**Task 1**and

**Task 2**are part of the sub-use case Parallel Graph Code Metric calculation of the use case similarity search.

**Task 3**maps to the use case Parallel Ordering of the use case similarity search.

**Task 1**, the iteration over the collection, is an obvious task to run in parallel and has no interdependencies between each calculation of the metric, as shown in the TDG in Figure 4. A metric calculation uses two Graph Codes as input. For a similarity search, one of the inputs is the reference Graph Code $G{C}_{query}$, and the other is an element $G{C}_{i}$ from the collection $G{C}_{coll}$. Each of these calculations can be executed independently and thus arbitrarily in parallel. The calculated metric values will be stored in a result array; the position is correlated with the position in the collection. The parallelization can be done on thread-level parallelization and, therefore, run on multicore CPUs and GPUs, as well as with distributed processing. Listing 1 shows the start and end of the parallelization for

**Task 1**parallelization.

Listing 1. Pseudocode of Task 1 parallelization. |

1 --- START TASK 1 -> TLP parallelise --- |

2 for each GC in collection |

3 --- calculate the intersection of GC_Query and~GC --- |

4 for each element in GC_Q->VM |

5 Check Intersection with GC_Q |

6 Calculate M_F, M_FR, M_RT from m_res array values |

7 --- END Task 1 TLP parallelize --- |

8 order result list according to |

9 value of M_F |

10 value of M_FR where M_F is equal |

11 value of M_RT where M_F and M_FR are equal |

12 return result list |

**Task 2**, finding intersections in the two matrices can be processed in parallel, but the findings, as well as the check for the feature relationship metric and feature relationship type metric, need to be stored in intermediate storage, as shown in Figure 5. This part of

**Task 2**is named

**Task 2a**. As a subsequent step, the storage is summed up and the values are used for the calculation of the final metric values. This step is

**Task 2b**. For large matrices in the millions, due to the high LOD, summing up the values can be done in parallel with reduction approaches [51]. Listing 2 shows the start and end of the parallelization in the case of

**Task 2**.

Listing 2. Pseudocode of Task 2 parallelization. |

1 for each GC in collection |

2 --- calculate the intersection of GC_Query and~GC --- |

3 --- START TASK 2a -> SIMD parallelize --- |

4 for each element in GC_Q->VM |

5 Check Intersection with GC_Q, store in m_res array |

6 ... |

7 --- END TASK 2a SIMD parallelize --- |

8 --- START TASK 2b -> TLP parallelize |

9 reduce m_res arrays |

10 --- END TASK 2b TLP parallelize --- |

11 Calculate M_F, M_FR, M_RT from m_res array values |

12 order result list according to |

13 value of M_F |

14 value of M_FR where M_F is equal |

15 value of M_RT where M_F and M_FR are equal |

16 return result list |

**Task 3**, the ordering of the metrics, has a dependency on all preceding tasks because it needs all the calculated metrics as input. The task itself can be done with parallel versions of QuickSort [52] or RadixSort [53].

**Task 1**can run in parallel with a sequential metric calculation, or in parallel, as described with

**Task 2**, as shown in Figure 7. Executing

**Task 1**in sequence and only running

**Task 2**in parallel is also possible, as shown before in Figure 5. The following Listing 3 shows the start and end of the tasks and the type of parallelization for all the tasks.

Listing 3. Pseudocode of Task 1, 2, and 3 parallelization. |

1 --- START TASK 1 -> TLP parallelise --- |

2 for each GC in collection |

3 --- calculate the intersection of GC_Query and~GC --- |

4 --- START TASK 2a -> SIMD parallelize --- |

5 for each element in GC_Q->VM |

6 Check Intersection with GC_Q, store in m_res array |

7 ... |

8 --- END TASK 2a SIMD parallelize --- |

9 --- START TASK 2b -> TLP parallelize |

10 reduce m_res arrays |

11 --- END TASK 2b TLP parallelize --- |

12 Calculate M_F, M_FR, M_RT from m_res array values |

13 --- END Task 1 TLP parallelize --- |

14 --- START TASK 3 -> TLP parallelize --- |

15 order result list according to |

16 value of M_F |

17 value of M_FR where M_F is equal |

18 value of M_RT where M_F and M_FR are equal |

19 --- END TASK 3 TLP parallelize --- |

20 return result list |

#### 3.2. Definitions

- For reference, we define the sequential algorithm without parallel steps as
**Sequential**($SEQ$). - We define the (thread-level) parallelization of
**Task 1**only as**Parallel GC Compute**($PC$). - The parallel computation of Graph Code metrics, as described as
**Task 2a**, is defined as**Parallel Metric Sequential Reduce**($PM$). - The combination of
**Tasks 2a**and**2b**is defined as**Parallel Metric Parallel Reduce**($PMPR$). - For very large Graph Codes, it may be useful to use $PM$ alone, but it can be combined with $PC$, which we define as
**Parallel GC Compute with Parallel Metric**($PCPM$). - Accordingly, $PC$ in combination with $PMPR$ is defined as
**Parallel GC Compare with Parallel Metric Parallel Reduce**($PCPMPR$). - Finally, if parallel sorting is also applied, we define the combination with $PC$ as
**Parallel Compute and Parallel Sort**($PCPS$). - Consequently, $PC$ with $PMPR$ and $PS$
**Parallel Compute Parallel Metric Parallel Reduce Parallel Sort**is defined as ($PCPMPRPS$). - If parallel sorting is also applied, we define the combination with $PC$ as
**Parallel Compute and Parallel Sort**($PCPS$). - Respectively, all parallelized tasks are defined as
**Parallel Compute Parallel Metric Parallel Reduce Parallel Sort**($PCPMPRPS$).

#### 3.3. Potential Parallelization on Modern Processors

**Task 2**algorithms constitute data-level parallelization, it loads the data once and processes each item. Data-level parallelization is more suitable for GPUs, but in the case of heterogeneous Graph Codes, thread divergence can limit the effectiveness. However, the low degree of dependency allows for processing many calculations in parallel, employing all available cores. Given a processor with 16,000 cores, such as the Nvida H100 [54], 16,000 Graph Code metric calculations can be performed in parallel.

**Task 2a**could replace the search of corresponding feature vocabulary terms in the Graph Code to compare with a lookup table, such as an inverted list [56]. This approach could be applied to the sequential and parallel versions of the algorithms.

#### 3.4. Theoretical Speedup

#### 3.5. Discussion

**Task 3**(ordering) needs to be computed on one node with all intermediate results. Implementing and evaluating this remains an open challenge.

**Task 3**) on the GPU and CPU, $PCPS$ was implemented for CUDA. Our implementation is discussed in Section 4 and the evaluation results given in Section 5.

## 4. Implementation and Testing

**Parallel GC Compute**$PC$. We implemented the algorithm in a sequential version, $CPU\phantom{\rule{4pt}{0ex}}SEQ$; a CPU POSIX-Threads version, $CPU\phantom{\rule{4pt}{0ex}}Parallel$; and a CUDA parallel version, $CUDA\phantom{\rule{4pt}{0ex}}Parallel$. For the CPU parallel algorithm, the number of utilized cores can be set. The CUDA implementation is optimized for maximum GPU utilization. For $PC$ for CUDA, each Graph Code similarity metric calculation is packaged in a parallel executable unit, a kernel. Listing 4 shows parts of the kernel as a function declaration in lines 1–35. The listing demonstrates the process flow for a number of Graph Codes (numberOfGCs), stored at the pointer gcMatrixData (line 2) and gcDictData (line 2), accessible by helper arrays gcMatrixOffsets, gcMatrixSizes, and gcDictOffsets (lines 3–4). First, the index of the Graph Code in the collection is located with the CUDA thread model (line 6). Next, the values gcQuery (line 4) and index, both containing the positions of the two Graph Codes to compare, are used to access the data points in the corresponding arrays—for example, in line 11 or line 14. The lines 19–34 show the metric calculation and the storage of the values in the metrics array. The metrics array will be transferred from the GPU memory to the main memory after execution, demonstrated in the function demoCalculateGCsOnCuda (lines 37–70), with the transfer in lines 65–67. This example also shows that most of the actual calculation is done in the CUDA kernel and, hence, in a parallel way. The sequential part of the algorithm is low, which is in line with the previously mentioned application of Gustafson’s law and the theoretical speedup calculation.

Listing 4. Partial code of the Parallel Graph Code metric calculation according to PC. |

1 /∗ metric calculation ∗/ |

2 __global__ void cudaGcCompute(unsigned short ∗gcMatrixData, unsigned int ∗gcDictData, |

3 unsigned int ∗gcMatrixOffsets, unsigned int ∗gcMatrixSizes, |

4 unsigned int ∗gcDictOffsets, int gcQuery, int numberOfGcs, Metrics ∗metrics) { |

5 |

6 unsigned int index = threadIdx.x + blockIdx.x ∗ blockDim.x; |

7 if (index >= numberOfGcs) |

8 return; |

9 |

10 int sim = 0; |

11 int elementsGc1 = sqrtf((float) gcMatrixSizes[gcQuery]); |

12 int elementsGc2 = sqrtf((float) gcMatrixSizes[index]); |

13 |

14 unsigned int off1 = gcDictOffsets[gcQuery]; |

15 unsigned int off2 = gcDictOffsets[index]; |

16 |

17 ... // Metric Calculation |

18 |

19 metrics[index].similarity = 0.0; |

20 metrics[index].recommendation = 0.0; |

21 metrics[index].inferencing = 0.0; |

22 metrics[index].similarity = (float) sim / (float) elementsGc1; |

23 metrics[index].idx = index; |

24 if (num_of_non_zero_edges > 0) { |

25 /∗edge_metric∗/ metrics[index].recommendation = |

26 (float) edge_metric_count / (float) num_of_non_zero_edges; |

27 } |

28 if (edge_metric_count > 0) { |

29 /∗edge_type_metric∗/ metrics[index].inferencing = |

30 (float) edge_type / (float) edge_metric_count; |

31 } |

32 metrics[index].compareValue = metrics[index].similarity |

33 ∗ 100000.0f + metrics[index].recommendation |

34 ∗ 100.0f + metrics[index].inferencing; |

35 } |

36 |

37 Metrics ∗demoCalculateGCsOnCuda(int numberOfGcs, |

38 unsigned int dictCounter, |

39 unsigned short ∗d_gcMatrixData, |

40 unsigned int ∗d_gcDictData, |

41 unsigned int ∗d_gcMatrixOffsets, |

42 unsigned int ∗d_gcDictOffsets, |

43 unsigned int ∗d_gcMatrixSizes, |

44 int gcQueryPosition) { |

45 Metrics ∗d_result; |

46 |

47 HANDLE_ERROR(cudaMalloc((void ∗∗) &d_result, numberOfGcs ∗ sizeof(Metrics))); |

48 |

49 int gridDim = ceil((float) numberOfGcs / 1024.0); |

50 int block = (numberOfGcs < 1024) ? numberOfGcs : 1024; |

51 |

52 cudaGcCompute<<<gridDim, block>>>(d_gcMatrixData, |

53 d_gcDictData, |

54 d_gcMatrixOffsets, |

55 d_gcMatrixSizes, |

56 d_gcDictOffsets, |

57 gcQueryPosition, |

58 numberOfGcs, |

59 d_result); |

60 |

61 HANDLE_ERROR(cudaPeekAtLastError()); |

62 HANDLE_ERROR(cudaDeviceSynchronize()); |

63 |

64 Metrics ∗result = (Metrics ∗) malloc(numberOfGcs ∗ sizeof(Metrics)); |

65 HANDLE_ERROR(cudaMemcpy(result, d_result, |

66 numberOfGcs ∗ sizeof(Metrics), cudaMemcpyDeviceToHost)); |

67 HANDLE_ERROR(cudaFree(d_result)); |

68 |

69 return result; |

70 } |

## 5. Evaluation

#### 5.1. Scalability of Parallel Graph Code Algorithms

#### 5.2. Efficiency for High LOD

#### 5.3. Impact of Graph Code Heterogeneity

#### 5.4. Summary

## 6. Conclusions and Future Work

## Author Contributions

## Funding

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Richter, F. Smartphones Cause Photography Boom. 2017. Available online: https://www.statista.com/chart/10913/number-of-photos-taken-worldwide/ (accessed on 29 January 2023).
- DCTV Productions Comparing Streaming Services. (1 January 2022). Available online: http://web.archive.org/web/20220101112312/https://dejaview.news/comparing-streaming-services/ (accessed on 21 January 2023).
- DCTV Productions Comparing Streaming Services. (21 August 2022). Available online: http://web.archive.org/web/20220828145828/https://dejaview.news/comparing-streaming-services/ (accessed on 21 January 2023).
- Demner-Fushman, D.; Antani, S.; Simpson, M.; Thoma, G. Design and Development of a Multimodal Biomedical Information Retrieval System. J. Comput. Sci. Eng.
**2012**, 6, 168–177. [Google Scholar] [CrossRef] [Green Version] - National Library of Medicine. What Is Open-i? Available online: https://openi.nlm.nih.gov/faq#collection (accessed on 21 January 2023).
- Jenik, C. A Minute on the Internet in 2021. Statista. 2022. Available online: https://www.statista.com/chart/25443/estimated-amount-of-data-created-on-the-internet-in-one-minute/ (accessed on 17 October 2022).
- Meta Platforms Ireland Limited. Instagram Homepage. 2022. Available online: https://www.instagram.com/ (accessed on 21 January 2023).
- Google. YouTube. Available online: http://www.youtube.com (accessed on 21 January 2023).
- Cloud Computing—Wikipedia. Page Version ID: 1128212267. 2022. Available online: https://en.wikipedia.org/w/index.php?title=Cloud_computing&oldid=1128212267 (accessed on 26 December 2022).
- Big Data—Wikipedia. Page Version ID: 1126395551. 2022. Available online: https://en.wikipedia.org/w/index.php?title=Big_data&oldid=1126395551 (accessed on 16 December 2022).
- Machine Learning—Wikipedia. Page Version ID1128287216. 2022. Available online: https://en.wikipedia.org/w/index.php?title=Machine_learning&oldid=1128287216 (accessed on 19 December 2022).
- Deep Learning. Wikipedia. Page Version ID1127713379. 2022. Available online: https://en.wikipedia.org/w/index.php?title=Deep_learning&oldid=1127713379 (accessed on 16 December 2022).
- Dasiopoulou, S.; Mezaris, V.; Kompatsiaris, I.; Papastathis, V.; Strintzis, M. Knowledge-Assisted Semantic Video Object Detection. IEEE Trans. Circuits Syst. Video Technol.
**2005**, 15, 1210–1224. Available online: http://ieeexplore.ieee.org/document/1512239/ (accessed on 21 January 2023). [CrossRef] - Wagenpfeil, S.; Vu, B.; Mc Kevitt, P.; Hemmje, M. Fast and Effective Retrieval for Large Multimedia Collections. Big Data Cogn. Comput.
**2021**, 5. Available online: https://www.mdpi.com/2504-2289/5/3/33 (accessed on 11 October 2022). [CrossRef] - Raieli, R. Multimedia Information Retrieval: Theory and Techniques; Chandos Publishing: Cambridge, UK, 2013; ISBN 978-1843347224. [Google Scholar]
- CXL.com. Reduce Your Server Response Time for Happy Users, Higher Rankings. 2021. Available online: https://cxl.com/blog/server-response-time/ (accessed on 12 October 2021).
- Kirk, D.; Hwu, W. Programming Massively Parallel Processors: A Hands-On Approach; Elsevier, Morgan Kaufmann: Amsterdam, The Netherlands, 2013. [Google Scholar]
- Apache™ Hadoop® Project Apache Hadoop. Available online: https://hadoop.apache.org/ (accessed on 13 January 2022).
- Singhal, A. Modern information retrieval: A brief overview. IEEE Data Eng. Bull.
**2001**, 24, 35–43. [Google Scholar] - Davies, J.; Studer, R.; Warren, P. Semantic Web technologies: Trends and Research in Ontology-Based Systems; John Wiley & Sons: Hoboken, NJ, USA, 2006; OCLC: ocm64591941. [Google Scholar]
- Wagenpfeil, S.; McKevitt, P.; Hemmje, M. AI-Based Semantic Multimedia Indexing and Retrieval for Social Media on Smartphones. Information
**2021**, 12, 43. [Google Scholar] [CrossRef] - Gurski, F.; Komander, D.; Rehs, C. On characterizations for subclasses of directed co-graphs. J. Comb. Optim.
**2021**, 41, 234–266. [Google Scholar] [CrossRef] - Wagenpfeil, S.; Hemmje, M. Towards AI-based Semantic Multimedia Indexing and Retrieval for Social Media on Smartphones. In Proceedings of the 15th International Workshop on Semantic and Social Media Adaptation And Personalization (SMA), Zakynthos, Greece, 29–30 October 2020; pp. 1–9. [Google Scholar]
- Wagenpfeil, S.; Mc Kevitt, P.; Cheddad, A.; Hemmje, M. Explainable Multimedia Feature Fusion for Medical Applications. J. Imaging
**2022**, 8, 104. [Google Scholar] [CrossRef] [PubMed] - Wagenpfeil, S.; McKevitt, P.; Hemmje, M. Graph Codes-2D Projections of Multimedia Feature Graphs for Fast and Effective Retrieval. ICIVR. 2021. Available online: https://publications.waset.org/vol/180 (accessed on 2 February 2022).
- Sciencedirect.com. Adjacency Matrix. 2020. Available online: https://www.sciencedirect.com/topics/mathematics/adjacency-matrix (accessed on 3 April 2023).
- Wagenpfeil, S.; Mc Kevitt, P.; Hemmje, M. Towards Automated Semantic Explainability of Multimedia Feature Graphs. Information
**2021**, 12, 502. Available online: https://www.mdpi.com/2078-2489/12/12/502. (accessed on 3 January 2023). [CrossRef] - Asim, M.N.; Wasim, M.; Ghani Khan, M.U.; Mahmood, N.; Mahmood, W. The Use of Ontology in Retrieval: A Study on Textual. IEEE Access
**2019**, 7, 21662–21686. [Google Scholar] [CrossRef] - Domingue, J.; Fensel, D.; Hendler, J.A. (Eds.) Introduction to the Semantic Web Technologies. In Handbook of Semantic Web Technologies; SpringerLink: Berlin, Germany, 2011. [Google Scholar] [CrossRef]
- W3C. SKOS Simple Knowledge Organisation System. 2021. Available online: https://www.w3.org/2004/02/skos/ (accessed on 2 February 2022).
- Silge, J.; Robinson, D. Text Mining with R: A Tidy Approach. (O’Reilly, 2017). OCLC: ocn993582128. Available online: https://www.tidytextmining.com/tfidf.html (accessed on 20 March 2023).
- Wagenpfeil, S. Smart Multimedia Information Retrieval. (University of Hagen, 2022). Available online: https://nbn-resolving.org/urn:nbn:de:hbz:708-dh11994 (accessed on 9 February 2023).
- Rauber, T.; Rünger, G. Parallel Programming; Springer: Berlin/Heidelberg, Germany, 2013; Section 3.3; pp. 98–112. [Google Scholar]
- Tanenbaum, A. Structured Computer Organization; Pearson Prentice Hall: Hoboken, NJ, USA, 2006; OCLC: Ocm57506907. [Google Scholar]
- Flynn, M. Very high-speed computing systems. Proc. IEEE
**1966**, 54, 1901–1909. Available online: http://ieeexplore.ieee.org/document/1447203/ (accessed on 17 December 2021). [CrossRef] [Green Version] - Keckler, S.; Hofstee, H.; Olukotun, K. Multicore Processors and Systems; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Apple Inc. M1 Pro and M1 Max. 2021. Available online: https://www.apple.com/newsroom/2021/10/introducing-m1-pro-and-m1-max-the-most-powerful-chips-apple-has-ever-built/ (accessed on 29 January 2023).
- Wikipedia. Apple A14. 2022. Available online: https://en.wikipedia.org/wiki/Apple_A14 (accessed on 12 January 2022).
- Intel Deutschland GmbH. Intel® Core™ i9-12900KF Processor. Available online: https://www.intel.de/content/www/de/de/products/sku/134600/intel-core-i912900kf-processor-30m-cache-up-to-5-20-ghz/specifications.html (accessed on 18 December 2021).
- Harish, P.; Narayanan, P. Accelerating large graph algorithms on the GPU using CUDA. In Proceedings of the High Performance Computing—HiPC 14th International Conference, Goa, India, 18–21 December 2007; pp. 197–208. [Google Scholar]
- Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. arXiv
**2017**, arXiv:1702.08734. [Google Scholar] [CrossRef] [Green Version] - Kusamura, Y.; Kozawa, Y.; Amagasa, T.; Kitagawa, H. GPU Acceleration of Content-Based Image Retrieval Based on SIFT Descriptors. In Proceedings of the 19th International Conference On Network-Based Information Systems (NBiS), Ostrava, Czech Republic, 7–9 September 2016; pp. 342–347. Available online: http://ieeexplore.ieee.org/document/7789781/ (accessed on 4 January 2022).
- Google Ireland Limited. TensorFlow Home Page. 2022. Available online: https://www.tensorflow.org/ (accessed on 18 December 2022).
- NVIDIA CUDA-Enabled Products. CUDA Zone. Available online: https://developer.nvidia.com/cuda-gpus (accessed on 18 December 2022).
- Grama, A. Introduction to Parallel Computing; Addison-Wesley: Boston, MA, USA, 2003. [Google Scholar]
- Rauber, T.; Rünger, G. Parallel Programming; Springer: Berlin/Heidelberg, Germany, 2013; Section 4.2.1; pp. 162–164. [Google Scholar]
- Amdahl, G. Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities, Reprinted from the AFIPS Conference Proceedings, Vol. 30 (Atlantic City, N.J., Apr. 18–20), AFIPS Press, Reston, Va., 1967, pp. 483–485, When Dr. Amdahl Was at International Business Machines Corporation, Sunnyvale, California. IEEE Solid-State Circuits Newsl.
**2007**, 12, 19–20. Available online: http://ieeexplore.ieee.org/document/4785615/ (accessed on 27 March 2022). - Gustafson, J. Reevaluating Amdahl’s law. Commun. ACM
**1988**, 31, 532–533. [Google Scholar] [CrossRef] [Green Version] - Norman, D.; Draper, S. User Centered System Design: New Perspectives on Human-Computer Interaction; L. Erlbaum Associates: Mahwah, NJ, USA, 1986. [Google Scholar]
- Object Management Group. Unified Modeling Language. 2011. Available online: https://www.omg.org/spec/UML/2.4.1/ (accessed on 29 April 2022).
- Harris, M. Optimizing parallel reduction in CUDA. Nvidia Dev. Technol.
**2007**, 2, 70. [Google Scholar] - Manca, E.; Manconi, A.; Orro, A.; Armano, G.; Milanesi, L. CUDA-quicksort: An improved GPU-Based Implementation of Quicksort: CUDA-QUICKSORT. Concurr. Comput. Pract. Exp.
**2016**, 28, 21–43. [Google Scholar] [CrossRef] - Harris, M.; Owens, J.; Patel, R.; Aaldavid; Yan, E.; Zhangyaobit; Sengupta, S.; Dan; Ap1. cudpp 2.2. (Zenodo,2014,8,31). Available online: https://zenodo.org/record/11548 (accessed on 17 September 2022).
- NVIDIA. NVIDIA H100 Tensor Core GPU. Available online: https://www.nvidia.com/en-us/data-center/h100/ (accessed on 3 January 2023).
- Sitaridi, E.; Ross, K. GPU-accelerated string matching for database applications. VLDB J.
**2016**, 25, 719–740. [Google Scholar] [CrossRef] - Wikipedia Inverted index. Wikipedia. Page Version ID: 1137401637. 2023. Available online: https://en.wikipedia.org/w/index.php?title=Inverted_index&oldid=1137401637 (accessed on 19 February 2023).
- Chen, L.; Villa, O.; Krishnamoorthy, S.; Gao, G. Dynamic load balancing on single-and multi-GPU systems. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS), Atlanta, GA, USA, 19–23 April 2010. [Google Scholar]
- NIST. TREC Washington Post Corpus. Available online: https://trec.nist.gov/data/wapost/ (accessed on 6 March 2022).
- Steinert, P. marquies/gmaf-cuda. 2023. Available online: https://github.com/marquies/gmaf-cuda (accessed on 10 October 2022).

**Figure 1.**A Minute on the Internet in 2021 [6].

**Figure 6.**Comparison of (

**a**) initial and (

**b**) new parallel parts of the Graph Code algorithm in pseudocode. Colored items show tasks to decompose.

**Table 1.**Hardware configurations of Central Processing Units (CPUs) and Graphical Processing Units (GPUs) used for evaluation.

Device Class | Low-End CPU | Low-End GPU | Medium CPU | Medium GPU | High-End CPU | High-End Desktop GPU | Top-of-the-Line GPU for Reference |
---|---|---|---|---|---|---|---|

Model name | Jetson Nano | Jetson Nano | Intel Core i5 | Nvidia GeFroce GTX 1060 3 GB | AWS c6a. 48xlarge | Nvidia RTX 3060 | Nvidia H100 80 GB PCIe |

Processor | ARM Cortex-A57 MPCore | GM20B TM660M-A2 | i5-8500 | GP106/GP106-300-A1 | AMD EPYC | GA104 | H100 |

Architecture | ARMv8-A64-bit | Maxwell | I686/Coffee Lake-S | Pascal | Zen | Ampere | Hopper |

DRAM | 4096 MB | 4096 MB (shared) | 16 GB | 3 GB | 384 GB | 12 GB | 80 GB |

CPU/GPU-clock | 1430 MHz | 640 MHz (921 MHz Boost) | 3 GHz | 1506 MHz (1708 MHz Boost) | Turbo 3.6 GZz | 1320 MHz (1777 MHz Boost) | 1830 MHz (1980 MHz Boost) |

Cores | 4 CPU cores | 128 CUDA cores | 6 CPU cores | 1152 CUDA cores | 192 CPU cores | 3584 CUDA cores | 16,896 CUDA Cores |

**Table 2.**Comparison of execution duration in seconds for Graph Code calculation and ordering for Sequential ($SEQ$), CPU Parallel ($PC$), and CUDA Parallel ($PC$) of subsets of the WaPo dataset. Each measured the time with and without sorting.

# GCs | CPU SEQ without Sort | CPU SEQ with Sort | CPU Par without Sort | CPU Par with Sort | CUDA Par without Sort | CUDA Par with Sort |
---|---|---|---|---|---|---|

[s] | [s] | [s] | [s] | [s] | [s] | |

100 | 0.005 | 0.005 | 0.002 | 0.002 | 0.003 | 0.003 |

1000 | 0.045 | 0.045 | 0.011 | 0.011 | 0.006 | 0.006 |

10,000 | 0.462 | 0.465 | 0.100 | 0.102 | 0.012 | 0.013 |

100,000 | 4.756 | 4.787 | 1.021 | 1.052 | 0.079 | 0.092 |

250,000 | 10.421 | 10.502 | 2.560 | 2.642 | 0.163 | 0.202 |

500,000 | 18.078 | 18.252 | 4.917 | 5.091 | — * | — * |

728,618 | 21.942 | 22.210 | 7.195 | 7.462 | — * | — * |

**Table 3.**Comparison of execution duration in seconds for Graph Code calculation and ordering for ($PC$) on different CPUs.

10,000 | 50,000 | 100,000 | 200,000 | 400,000 | |
---|---|---|---|---|---|

Low-Level CPU (4 core) | 0.461 s | 2.338 s | 4.793 s | 9.798 s | 19.770 s |

Medium CPU (6 core) | 0.102 s | 0.516 s | 1.053 s | 2.098 s | 4.088 s |

High-Level CPU (192 core) | 0.018 s | 0.053 s | 0.091 s | 0.171 s | 0.326 s |

**Table 4.**Comparison of execution duration in seconds for Graph Code calculation and ordering for ($PC$) on different GPUs.

10,000 | 50,000 | 100,000 | 200,000 | 400,000 | |
---|---|---|---|---|---|

Low-Level GPU (128 Core) | 0.144 s | 0.467 s | 0.923 s | - * | - * |

Medium GPU (1152 Core) | 0.013 s | 0.048 s | 0.092 s | 0.166 s | - * |

High-Level GPU (3584 Core) | 0.004 s | 0.015 s | 0.030 s | 0.053 s | 0.110 s |

**Table 5.**Comparison of execution duration in seconds for Graph Code calculation for artificial Graph Codes of dim = 40.

# GCs | CPU SEQ | CUDA PC | CUDA PM | CUDA PMPR |
---|---|---|---|---|

[s] | [s] | [s] | [s] | |

100 | 0.105 | 0.044 | 0.096 | 0.034 |

1000 | 1.050 | 0.043 | 0.888 | 0.247 |

10,000 | 10.512 | 0.096 | 8.905 | 2.372 |

100,000 | 105.956 | 0.562 | 88.529 | 23.286 |

**Table 6.**Comparison of execution duration in seconds for Graph Code calculation for artificial Graph Codes of dim = 1000.

# GCs | CUDA PC | CUDA PM | CUDA PMPR |
---|---|---|---|

[s] | [s] | [s] | |

100 | 121.167 | 2.863 | 2.208 |

1000 | 387.565 | 28.334 | 21.502 |

**Table 7.**Comparison of execution duration in seconds for Graph Code calculation for artificial Graph Codes of dim = 39.

# GCs | CPU SEQ | CPU PC | CPU PC |
---|---|---|---|

[s] | [s] | [s] | |

100 | 0.098 | 0.022 | 0.010 |

1000 | 0.983 | 0.222 | 0.017 |

10,000 | 9.796 | 2.248 | 0.052 |

100,000 | 98.389 | 22.080 | 0.437 |

250,000 | 244.893 | 54.606 | 1.095 |

500,000 | 490.977 | 108.641 | 2.187 |

728,618 | - | - | 3.194 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Steinert, P.; Wagenpfeil, S.; Mc Kevitt, P.; Frommholz, I.; Hemmje, M.
Parallelization Strategies for Graph-Code-Based Similarity Search. *Big Data Cogn. Comput.* **2023**, *7*, 70.
https://doi.org/10.3390/bdcc7020070

**AMA Style**

Steinert P, Wagenpfeil S, Mc Kevitt P, Frommholz I, Hemmje M.
Parallelization Strategies for Graph-Code-Based Similarity Search. *Big Data and Cognitive Computing*. 2023; 7(2):70.
https://doi.org/10.3390/bdcc7020070

**Chicago/Turabian Style**

Steinert, Patrick, Stefan Wagenpfeil, Paul Mc Kevitt, Ingo Frommholz, and Matthias Hemmje.
2023. "Parallelization Strategies for Graph-Code-Based Similarity Search" *Big Data and Cognitive Computing* 7, no. 2: 70.
https://doi.org/10.3390/bdcc7020070