Mercury: Accelerating 3D Parallel Training with an AWGR-WSS-Based All-Optical Reconfigurable Network
Abstract
1. Introduction
- This work conducts a detailed analysis of DDL training traffic models at the micro-batch level under 3D parallel strategies. Based on these models, we introduce a hybrid network architecture integrating an OCS subnetwork and an OTS subnetwork. This optical network is further augmented with a two-tier scheduling algorithm, which couples periodic topology optimization for predictable 3D traffic patterns with distributed real-time adjustments.
- We develop a high-fidelity OMNeT++ based simulation framework to model 3D parallel training across 64 GPUs under Mercury, enabling a large-scale validation of the hybrid architecture’s scalability and demonstrating a remarkable speedup effect compared to the representative RDCN work. The proposed scheme is also validated using a three-node FPGA-based hardware testbed. Experimental results also demonstrate a significant speedup in DDL job training compared to the state-of-the-art solutions.
2. Network Architecture and Traffic Modeling
2.1. Network Architecture
2.2. Traffic Modeling
3. Control Mechanism
3.1. Mercury Scheduling Framework
3.2. Most Efficient Path Configuration (MEPC) Algorithm
3.3. Selective Valient Load Balancing (S-VLB)
4. Performance Evaluation
4.1. Simulation Setup
4.2. Experimental Setup
4.3. Results and Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
| Layer | Matrix | Size |
| Attention | , input | |
| , projection matrices | ||
| , weighted result | ||
| , output results of attention layer | ||
| , output matrix of attention layer | ||
| MLP | , hidden layer weight matrix | |
| , output layer weight matrix | ||
| , hidden layer output | ||
| activation output | ||
| , MLP output |
| Layer | Operation | FLOPs |
|---|---|---|
| Attention | ||
| / | ||
| MLP | ||
| Layer | Operation | FLOPs |
|---|---|---|
| MLP | ||
| Attention | ||
References
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Rajbhandari, S.; Rasley, J.; Ruwase, O.; He, Y. ZeRO: Memory optimizations Toward Training Trillion Parameter Models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis; IEEE: New York, NY, USA, 2020. [Google Scholar]
- Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv 2019, arXiv:1909.08053. [Google Scholar]
- Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, M.X.; Chen, D.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y.; et al. Gpipe: Easy scaling with micro-batch pipeline parallelism. arXiv 2019, arXiv:811.06965v5. [Google Scholar]
- Smith, S.; Patwary, M.; Norick, B.; LeGresley, P.; Rajbhandari, S.; Casper, J.; Liu, Z.; Prabhumoye, S.; Zerveas, G.; Korthikanti, V.; et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv 2022, arXiv:2201.11990. [Google Scholar]
- Li, W.; Liu, X.; Li, Y.; Jin, Y.; Tian, H.; Zhong, Z.; Liu, G.; Zhang, Y.; Chen, K. Understanding Communication Characteristics of Distributed Training. In Proceedings of the 8th Asia-Pacific Workshop on Networking; ACM: New York, NY, USA, 2024; pp. 1–8. [Google Scholar]
- Mellette, W.M.; McGuinness, R.; Roy, A.; Forencich, A.; Papen, G.; Snoeren, A.C.; Porter, G. Rotornet: A scalable, low-complexity, optical datacenter network. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication; ACM: New York, NY, USA, 2017; pp. 267–280. [Google Scholar]
- Shrivastav, V.; Valadarsky, A.; Ballani, H.; Costa, P.; Lee, K.S.; Wang, H.; Agarwal, R.; Weatherspoon, H. Shoal: A network architecture for disaggregated racks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19); ACM: New York, NY, USA, 2019; pp. 255–270. [Google Scholar]
- Ballani, H.; Costa, P.; Behrendt, R.; Cletheroe, D.; Haller, I.; Jozwik, K.; Karinou, F.; Lange, S.; Shi, K.; Thomsen, B.; et al. Sirius: A flat datacenter network with nanosecond optical switching. In Proceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication; ACM: New York, NY, USA, 2020. [Google Scholar]
- Amir, D.; Saran, N.; Wilson, T.; Kleinberg, R.; Shrivastav, V.; Weatherspoon, H. Shale: A practical, scalable oblivious reconfigurable network. In Proceedings of the ACM SIGCOMM 2024 Conference; ACM: New York, NY, USA, 2024; pp. 449–464. [Google Scholar]
- Liu, H.; Lu, F.; Forencich, A.; Kapoor, R.; Tewari, M.; Voelker, G.M.; Papen, G.; Snoeren, A.C.; Porter, G. Circuit switching under the radar with {REACToR}. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14); ACM: New York, NY, USA, 2014; pp. 1–15. [Google Scholar]
- Proietti, R.; Cao, Z.; Nitta, C.J.; Li, Y.; Ben Yoo, S.J. A scalable, low-latency, high-throughput, optical interconnect architecture based on arrayed waveguide grating routers. J. Light. Technol. 2015, 33, 911–920. [Google Scholar] [CrossRef]
- Wang, G.; Andersen, D.G.; Kaminsky, M.; Papagiannaki, K.; Ng, T.S.E.; Kozuch, M.; Ryan, M. c-Through: Part-time optics in data centers. In Proceedings of the ACM SIGCOMM 2010 Conference; ACM: New York, NY, USA, 2010; pp. 327–338. [Google Scholar]
- Farrington, N.; Porter, G.; Radhakrishnan, S.; Bazzaz, H.H.; Subramanya, V.; Fainman, Y.; Papen, G.; Vahdat, A. Helios: A hybrid electrical/optical switch architecture for modular data centers. In Proceedings of the ACM SIGCOMM 2010 Conference; ACM: New York, NY, USA, 2010; pp. 339–350. [Google Scholar]
- Wang, W.; Khazraee, M.; Zhong, Z.; Ghobadi, M.; Jia, Z.; Mudigere, D.; Zhang, Y.; Kewitsch, A. {TopoOpt}: Co-optimizing network topology and parallelization strategy for distributed training jobs. arXiv 2022, arXiv:2202.00433. [Google Scholar]
- Liu, H.; Urata, R.; Yasumura, K.; Zhou, X.; Bannon, R.; Berger, J.; Dashti, P.; Jouppi, N.; Lam, C.; Li, S.; et al. Lightwave fabrics: At-scale optical circuit switching for datacenter and machine learning systems. In Proceedings of the ACM SIGCOMM 2023 Conference; ACM: New York, NY, USA, 2023; pp. 499–515. [Google Scholar]
- Poutievski, L.; Mashayekhi, O.; Ong, J.; Singh, A.; Tariq, M.; Wang, R.; Zhang, J.; Beauregard, V.; Conner, P.; Gribble, S.; et al. Jupiter evolving: Transforming google’s datacenter network via optical circuit switches and software-defined networking. In Proceedings of the ACM SIGCOMM 2022 Conference (SIGCOMM 22); Association for Computing Machinery: New York, NY, USA, 2022; pp. 66–85. [Google Scholar] [CrossRef]
- Jouppi, N.; Kurian, G.; Li, S.; Ma, P.; Nagarajan, R.; Nai, L.; Patil, N.; Subramanian, S.; Swing, A.; Towles, B.; et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture; ACM: New York, NY, USA, 2023. [Google Scholar]
- Benjamin, J.L.; Gerard, T.; Lavery, D.; Bayvel, P.; Zervas, G. PULSE: Optical circuit switched data center architecture operating at nanosecond timescales. J. Light. Technol. 2020, 38, 4906–4921. [Google Scholar] [CrossRef]
- Qian, K.; Xi, Y.; Cao, J.; Gao, J.; Xu, Y.; Guan, Y.; Fu, B.; Shi, X.; Zhu, F.; Miao, R.; et al. Alibaba hpn: A data center network for large language model training. In Proceedings of the ACM SIGCOMM 2024 Conference; ACM: New York, NY, USA, 2024; pp. 691–706. [Google Scholar]
- Liang, C.; Song, X.; Cheng, J.; Wang, M.; Liu, Y.; Liu, Z.; Zhao, S.; Cui, Y. NegotiaToR: Towards a simple yet effective on-demand reconfigurable datacenter network. In Proceedings of the ACM SIGCOMM 2024 Conference; ACM: New York, NY, USA, 2024; pp. 415–432. [Google Scholar]
- Xue, X.; Calabretta, N. Nanosecond optical switching and control system for data center networks. Nat. Commun. 2022, 13, 2257. [Google Scholar] [CrossRef] [PubMed]
- Porter, G.; Strong, R.; Farrington, N.; Forencich, A.; Chen-Sun, P.; Rosing, T.; Fainman, Y.; Papen, G.; Vahdat, A. Integrating microsecond circuit switching into the data center. ACM SIGCOMM Comput. Commun. Rev. 2013, 43, 447–458. [Google Scholar] [CrossRef]
- Chen, K.; Singla, A.; Singh, A.; Ramachandran, K.; Xu, L.; Zhang, Y.; Wen, X.; Chen, Y. OSA: An optical switching architecture for data center networks with unprecedented flexibility. IEEE/ACM Trans. Netw. 2013, 22, 498–511. [Google Scholar] [CrossRef]
- Feng, S.; Zhang, J.; Zhou, H.; Li, X.; Ji, Y. Mercury: A Reconfigurable Datacenter Network with Collaborative Optical Timeslot Switching and Optical Circuit Switching. In Proceedings of the Optical Fiber Communication Conference; Optica Publishing Group: Washington, DC, USA, 2025. [Google Scholar]
- Bhat, K.U.P.; Ravish, H.; Anirudha, V.; Prabhavathi, P. Design and Verification of Peripheral Component Interconnect Express (PCIe) 3.0. Int. Res. J. Eng. Technol. 2020, 7, 905–919. [Google Scholar]
- Urata, R.; Liu, H.; Yasumura, K.; Mao, E.; Berger, J.; Zhou, X.; Lam, C.; Bannon, R.; Hutchinson, D.; Nelson, D.; et al. Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale. arXiv 2022, arXiv:2208.10041. [Google Scholar] [CrossRef]
- Spanke, R. Architectures for large nonblocking optical space switches. IEEE J. Quantum Electron. 1986, 22, 964–967. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017); Advances in neural information processing systems; ACM: New York, NY, USA, 2017; p. 30. [Google Scholar]
- Valiant, L.G.; Brebner, G.J. Universal schemes for parallel communication. In Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing; ACM: New York, NY, USA, 1981; pp. 263–277. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar] [CrossRef]











| Type | Times | Volume/Bytes |
|---|---|---|
| TP | ||
| DP | 1 | |
| PP |
| Job Num | (p, t, d) Per Job | GPUs Per Job |
|---|---|---|
| 1 | (4, 4, 4) | 64 |
| 2 | (4, 4, 2) | 32 |
| 3 | (2, 5, 2) | 20 |
| 4 | (2, 4, 2) | 16 |
| 5 | (2, 3, 2) | 12 |
| Parameters | Values |
|---|---|
| Number of transformer layers, | 12 |
| Hidden layer dimension, | 256/512/1024/2048 |
| Number of multi-heads, | 12 |
| Sequence length, | 512 tokens |
| Size of a micro-batch, | 8 |
| Storage size of a parameter, | 2 bytes |
| Number of model parameters, | |
| Computation power of a GPU | 125 TFLOPs |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Feng, S.; Zhang, J.; Zhou, H.; Li, X.; Ji, Y. Mercury: Accelerating 3D Parallel Training with an AWGR-WSS-Based All-Optical Reconfigurable Network. Photonics 2026, 13, 286. https://doi.org/10.3390/photonics13030286
Feng S, Zhang J, Zhou H, Li X, Ji Y. Mercury: Accelerating 3D Parallel Training with an AWGR-WSS-Based All-Optical Reconfigurable Network. Photonics. 2026; 13(3):286. https://doi.org/10.3390/photonics13030286
Chicago/Turabian StyleFeng, Shi, Jiawei Zhang, Huitao Zhou, Xingde Li, and Yuefeng Ji. 2026. "Mercury: Accelerating 3D Parallel Training with an AWGR-WSS-Based All-Optical Reconfigurable Network" Photonics 13, no. 3: 286. https://doi.org/10.3390/photonics13030286
APA StyleFeng, S., Zhang, J., Zhou, H., Li, X., & Ji, Y. (2026). Mercury: Accelerating 3D Parallel Training with an AWGR-WSS-Based All-Optical Reconfigurable Network. Photonics, 13(3), 286. https://doi.org/10.3390/photonics13030286
