Effective On-Chip Communication for Message Passing Programs on Multi-Core Processors
Abstract
:1. Introduction
- We design an infrastructure that allows us to investigate the algorithmic differences when an application is optimized for either shared memory or for message passing. NAS (network attached storage) parallel benchmarks are used since they have been designed for both MPI and OpenMP. Here, we eliminate the library and system-level overheads and focus only on the characteristics of the algorithms. Using this system, we have characterized the similarities and differences between applications optimized for both systems. We find that the patterns of data movement between the processes in a message passing architecture are similar to those for threads in shared memory architecture. In other words, there is no penalty for using explicit data movement in message passing architecture; this is not surprising when one considers that shared memory architectures tend to have implicit data movement that is carried out through the coherence protocol. We also see that the total volume of transactions on the interconnection is large with shared memory architecture since coherence messages are issued whenever the status of the shared block is updated. However, payload data movements are only issued to the interconnection when blocks need to be transferred to another core or when a block is read from/written to its private address space. Finally, we find that the performance of our shared memory architecture is more sensitive to the size of the last level of private memory hierarchy than the message passing architecture.
- Our analysis suggests that while behaviors are similar, there are opportunities to improve message passing performance using methods that are not available on multi-core processors currently; in particular, support for the efficient bulk transfer of data between caches. As a result, we design messaging protocols that allow message passing programs to communicate efficiently either between memories or between high levels of the cache memory hierarchies. These protocols allow new capabilities in the interconnection, such as point-to-point messaging, which exploits higher transfer rates. It also improves hit rates and increases IPC (instructions per cycle) in MPI programs.
- We evaluate our multicore design with hardware support for message passing and show that it is competitive with a state-of-the-art shared memory multicore. In the experiments, in the best case, message passing achieves up to a 34% increase in speed over its shared memory counter-part, and it achieves a 10% increase in speed on average. The advantages of message passing come from both efficient bulk transfer and the elimination of the wasteful movement of data.
- We also provide a characterization that offers insight into the scenarios under which message passing works better. In particular, message passing offers a greater advantage when the cache is a scarce resource. This is likely the result of the more efficient management of the cache hierarchy through explicit data movement. When private L2 caches are larger, in the 1 MB (mega bytes) to 4 MB range, message passing architecture is still better on average, but the differences of performance between the architectures decrease.
2. Motivation
2.1. Assumptions and Definitions
2.2. Shared Memory vs. Message Passing
- What can be done in parallel?
- How are threads coordinated to achieve the goal?
- How can parallelism be expressed to maximally exploit available compute resources?
2.3. Key Architectural Differences
2.3.1. Private Data versus Shared Data
2.3.2. Bulk Transfer versus Fine-Grained Per-Block Transfer
2.3.3. Collective Transfer versus No Global Coordination MP: Collective Communication
2.3.4. Explicit Overlap versus Implicit Overlap
2.4. Benchmarks
Characteristics
3. Message Passing Architecture
3.1. Overview
3.2. MP Runtime Layer
3.3. The Hardware–Software Interface
3.4. Messaging Architecture
3.4.1. Point-to-Point Communication
3.4.2. Collective Communication
4. Design and Implementation
4.1. Overview
4.2. Basic Protocol for Message Passing
- Invalidate: Invalidate the selected block in the message regardless of whether each blocks is dirty or not. This supports the receipt of messages, since the old contents of the buffer do not need to be preserved.
- Flush: Invalidate the selected blocks. The blocks that are dirty will be written back to the next level of the memory hierarchy. This supports message sending.
- Writeback: The replaced blocks due to the write misses will be written back to the next level of the memory hierarchy. This is the same as the approach in a conventional cache hierarchy, but it supports a writeback of multiple blocks.
- ReadyToSynchronize: Synchronize the send and receive sides. The receive side issues synchronize message to the send side. The send side should wait until the synchronize message arrives.
- Transfer: Transfer the selected blocks from the source buffer to the destination buffer. The request is similar to a burst copy in Direct Memory Access (DMA).
- Send: (Cache-to-cache only) Read and send out the selected blocks in the designated message buffer. The missed blocks will be read from the next level of memory hierarchy.
- Receive: (Cache-to-cache only) Receive the blocks and update the cache. The replaced blocks due to the write misses will be written back to the next level of memory hierarchy.
4.3. Memory-to-Memory Communication
4.3.1. Blocking
4.3.2. Non-Blocking
4.4. Cache-to-Cache Communication
4.5. Collective Communication
5. Evaluation
5.1. Trace Generation
5.2. System Configuration
5.3. Overall Performance
5.4. Total Bytes of Messages on Interconnection of All Benchmarks on Each Architecture with Different Configurations
5.5. Sensitivity Study
6. Related Work
7. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Giacomoni, J.; Moseley, T.; Vachharajani, M. FastForward for efficient pipeline parallelism: A cache-optimized concurrent lock-free queue. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel, Programming, Salt Lake City, UT, USA, 20–23 February 2008; pp. 43–52. [Google Scholar]
- Friedley, A.; Bronevetsky, G.; Hoefler, T.; Lumsdaine, A. Hybrid MPI: Efficient message passing for multi-core systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, CO, USA, 17–21 November 2013; pp. 1–11. [Google Scholar]
- Dupont de Dinechin, B.; Graillat, A. Feed-Forward Routing for the Wormhole Switching Network-on-Chip of the Kalray MPPA2 Processor. In Proceedings of the 10th International Workshop on Network on Chip Architectures, Boston, MA, USA, 14 October 2017; pp. 1–6. [Google Scholar]
- Ma, W.; Ao, Y.; Yang, C.; Williams, S. Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight. Clust. Comput. 2020, 23, 493–507. [Google Scholar] [CrossRef] [Green Version]
- Xu, Z.; Mauldin, T.; Yao, Z.; Pei, S.; Wei, T.; Yang, Q. A Bus Authentication and Anti-Probing Architecture Extending Hardware Trusted Computing Base Off CPU Chips and Beyond. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 30 May–3 June 2020; pp. 749–761. [Google Scholar]
- Yu, Z.; Xiao, R.; You, K.; Quan, H.; Ou, P.; Yu, Z.; He, M.; Zhang, J.; Ying, Y.; Yang, H.; et al. A 16-Core Processor With Shared-Memory and Message-Passing Communications. IEEE Trans. Circuits Syst. Regul. Pap. 2013, 61, 1081–1094. [Google Scholar] [CrossRef]
- Baker, J.; Gold, B.; Bucciero, M.; Bennett, S.; Mahajan, R.; Ramachandran, P.; Shah, J. SCMP: A Single-Chip Message-Passing Parallel Computer. J. Supercomput. 2004, 30, 133–149. [Google Scholar] [CrossRef]
- Wijngaart, R.; Mattson, T.; Haas, W. Light-weight communications on Intel’s single-chip cloud computer processor. ACM Sigops Oper. Syst. Rev. 2011, 45, 73–83. [Google Scholar] [CrossRef]
- Bailey, D.H.; Barszcz, E.; Barton, J.T.; Browning, D.S.; Carter, R.L.; Dagum, L.; Fatoohi, R.A.; Frederickson, P.O.; Lasinski, T.A.; Schreiber, R.S.; et al. The NAS parallel benchmarks. Int. J. High Perform. Comput. Appl. 1991, 5, 63–73. [Google Scholar]
- Loff, J.; Griebler, D.; Mencagli, G.; Araujo, G.; Torquati, M.; Danelutto, M.; Fernandes, L.G. The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures. Future Gener. Comput. Syst. 2021, in press. [Google Scholar] [CrossRef]
- Luk, C.-K.; Cohn, R.; Muth, R.; Patil, H.; Klauser, A.; Lowney, G.; Wallace, S.; Reddi, V.J.; Hazelwood, K. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, IL, USA, 12–15 June 2005; pp. 190–200. [Google Scholar]
- Villa, O.; Stephensen, M.; Nellans, D.; Keckler, S. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In Proceedings of the MICRO ’52: The 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH, USA, 12–16 October 2019; pp. 372–383. [Google Scholar]
- Pavlovic, M.; Etsion, Y.; Ramirez, A. Can many cores support the memory requirements of scientific applications? In Proceedings of the 2010 International Conference on Computer Architecture, ser. ISCA’10, Phoenix, AZ, USA, 22–26 June 2012; pp. 65–76. [Google Scholar]
- Chandra, S.; Larus, J.R.; Rogers, A. Where is time spent in message-passing and shared-memory programs? In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA, USA, 5–7 October 1994; pp. 61–73. [Google Scholar]
- Tasoulas, Z.-G.; Anagnostopoulos, I.; Papadopoulos, L.; Soudris, D. A Message-Passing Microcoded Synchronization for Distributed Shared Memory Architectures. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst. 2018, 38, 975–979. [Google Scholar] [CrossRef]
- Chodnekar, S.; Srinivasan, V.; Vaidya, A.; Sivasubramaniam, A.; Das, C. Towards a communication characterization methodology for parallel applications. In Proceedings of the Third International Symposium on High-Performance Computer Architecture, San Antonio, TX, USA, 1–5 February 1997; pp. 310–319. [Google Scholar]
- Kubiatowicz, J.; Agarwal, A. Anatomy of a message in the alewife multiprocessor. In Proceedings of the 7th International Conference on Supercomputing, ser. ICS ’93, Tokyo, Japan, 19–23 July 1993; pp. 195–206. [Google Scholar]
- Martin, M.M.K.; Hill, M.D.; Sorin, D.J. Why on-chip cache coherence is here to stay. Commun. ACM 2012, 55, 78–89. [Google Scholar] [CrossRef] [Green Version]
- Daya, B.K.; Chen, C.H.O.; Subramanian, S.; Kwon, W.C.; Park, S.; Krishna, T.; Holt, J.; Chrakasan, A.P.; Peh, L.S. SCORPIO: A 36-core research chip demonstrating snoopy coherence on a scalable mesh NoC with in-network ordering. In Proceedings of the 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA), Minneapolis, MN, USA, 14–18 June 2014; pp. 25–36. [Google Scholar]
- Sanchez, D.; Kozyrakis, C. SCD: A scalable coherence directory with flexible sharer set encoding. In Proceedings of the IEEE 18th International Symposium on High Performance Computer Architecture (HPCA), New Orleans, LA, USA, 25–29 February 2012; pp. 1–12. [Google Scholar]
- Held, J. Single-chip cloud computer. Euro-Par-Workshop 2010, 6586, 85. [Google Scholar]
- Mattson, T.G.; Riepen, M.; Lehnig, T.; Brett, P.; Haas, W.; Kennedy, P.; Howard, J.; Vangal, S.; Borkar, N.; Ruhl, G. The 48-core SCC processor: The programmer’s view. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, New Orleans, LA, USA, 13–19 November 2010; pp. 1–11. [Google Scholar]
- Urea, I.A.C.; Riepen, M.; Konow, M. RCKMPI: Lightweight MPI implementation for Intels single-chip cloud computer (SCC). Lect. Notes Comput. Sci. 2011, 6960, 208–217. [Google Scholar]
- Urea, I.A.C.; Riepen, M.; Konow, M.; Gerndt, M. Invasive MPI on Intel’s single-chip cloud computer. In Proceedings of the Architecture of Computing Systems ARCS 2012, ser. Lecture Notes in Computer Science, Munich, Germany, 28 February–2 March 2012; Volume 7179, pp. 74–85. [Google Scholar]
- Howard, J.; Dighe, S.; Hoskote, Y.; Vangal, Y.; Finan, D.; Ruhl, G.; Jenkins, D.; Wilson, H.; Borkar, N.; Schrom, G.; et al. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In Proceedings of the Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 7–11 February 2010; pp. 108–109. [Google Scholar]
Function | Description | Variants |
---|---|---|
send(msg,dest) | Calling thread sends msg to dest thread | Blocking/nonblocking |
recv(msg,src) | Receive msg from src thread | Blocking/nonblocking |
wait(msg) | Wait for non-blocking message to complete | Nonblocking |
all-to-all(msg,size,group) | Send part of msg to all threads in group, recv msg from all threads in group | Blocking |
Component | Configurations |
---|---|
Number of cores | 16 |
L1 cache | 32 kB size, 2-way association, 64 B block, 2 cycles latency, write-buffer. |
L2 cache | 512 kB size, 8-way association, 64 B block, 20 cycles latency, write-buffer, fully pipelined (2 cycles). |
Interconnect | 32 B/cycles bandwidth each port of routers, 2 cycles latency, input-buffer, no contention model, fully pipelined (2 cycles). |
Memory | Unlimited size, 64 B block, 102.4 B/cycles bandwidth (off-chip) 200 cycles latency, fully pipelined (10 cycles). |
Component | Shared Memory | Message Passing |
---|---|---|
L1 cache | Write-through | Write-back |
L2 cache | MESI protocol | Write-back |
Coherence/ communication | MESI directory (unlimited size), 2 cycles latency | Message passing controller, 2 cycles latency |
Name | Description |
---|---|
OMP | 16-core OpenMP on SM. |
MtoM | 16-core MPI on MP with memory-to-memory communication. |
CtoC | 16-core MPI on MP with cache-to-cache communication. |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huh, J.; Lee, D. Effective On-Chip Communication for Message Passing Programs on Multi-Core Processors. Electronics 2021, 10, 2681. https://doi.org/10.3390/electronics10212681
Huh J, Lee D. Effective On-Chip Communication for Message Passing Programs on Multi-Core Processors. Electronics. 2021; 10(21):2681. https://doi.org/10.3390/electronics10212681
Chicago/Turabian StyleHuh, Joonmoo, and Deokwoo Lee. 2021. "Effective On-Chip Communication for Message Passing Programs on Multi-Core Processors" Electronics 10, no. 21: 2681. https://doi.org/10.3390/electronics10212681