# Communication Optimization Schemes for Accelerating Distributed Deep Learning Systems

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- First, we propose a new layer dropping scheme to reduce communication data. In the case of large-scale deep learning models, the number of model parameters is large, so the number of gradients computed by workers is also large. Therefore, it is inefficient to compare the threshold value with the value of each gradient computed by the worker. Our proposed scheme efficiently compresses communication data by comparing the representative value of each hidden layer of the deep learning model and a threshold. We used the average of the absolute values of the gradients as the representative value of each hidden layer. The worker converts the hidden layer for which the representative value is less than the threshold value into a list type of size oneand sends it to the parameter server. Furthermore, the parameter server converts the hidden layers that have not been updated into a list of size one and sends them to the worker. Hidden layers that are not sent to the parameter server are accumulated in the worker’s local cache. When the accumulated value is greater than the threshold, the hidden layer stored in the worker’s local cache is sent to the parameter server to ensure training accuracy.
- Second, we propose an efficient method to pick a threshold value. The threshold is a value that can reduce the size of the gradients that the worker will send to the parameter server by a ratio R from the size of the total gradients. It takes much time to calculate a threshold by reflecting all the gradients. Therefore, in this paper, we compute the threshold by replacing the value of each gradient with the L1 norm of the gradients included in each hidden layer. Since the L1 norm is the sum of the absolute values of each element of the vector, the L1 norm of the gradients of each hidden layer shows the effect of the gradients on the model parameters. By using the L1 norm of each hidden layer, a large number of gradients to be used for threshold calculation can be reduced to the number of hidden layers in the deep learning model. The threshold can be fixed at the value calculated at the beginning of training or can be calculated again at specific cycles. In our experiments, we calculated a new threshold every 100 steps.

## 2. Related Works

## 3. Distributed Deep Learning Architecture

## 4. Communication Optimization Schemes

#### 4.1. Operation of the Worker for Communication Optimization

#### 4.2. Operation of the Parameter Server for Communication Optimization

#### 4.3. Gradient Accumulation

## 5. How to Calculate the Threshold

Algorithm 1 Threshold searching algorithm. |

## 6. Evaluations

#### 6.1. Performance of Layer Dropping in a 1 Gbit Network Environment

#### 6.2. Performance of Layer Dropping in a 56 Gbit Network Environment

#### 6.3. Summary of the Experimental Results

## 7. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Real, E.; Aggarwal, A.; Huang, Y.; Le, Q. Regularized Evolution for Image Classifier Architecture Search. arXiv
**2018**, arXiv:1802.01548. [Google Scholar] [CrossRef] [Green Version] - Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y.; et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 103–112. [Google Scholar]
- Zinkevich, M.; Weimer, M.; Li, L.; Smola, A.J. Parallelized Stochastic Gradient Descent. In Proceedings of the Advances in Neural Information Processing Systems 23, Vancouver, BC, Canada, 6–11 December 2010; pp. 2595–2603. [Google Scholar]
- Kim, Y.; Choi, H.; Lee, J.; Kim, J.; Jei, H.; Roh, H. Efficient Large-Scale Deep Learning Framework for Heterogeneous Multi-GPU Cluster. In Proceedings of the 2019 IEEE 4th International Workshops on Foundations and Applications of Self* Systems (FAS*W), Umea, Sweden, 16–20 June 2019; pp. 176–181. [Google Scholar]
- Kim, Y.; Lee, J.; Kim, J.; Jei, H.; Roh, H. Efficient Multi-GPU Memory Management for Deep Learning Acceleration. In Proceedings of the 2018 IEEE 3rd International Workshops on Foundations and Applications of Self* Systems (FAS*W), Trento, Italy, 3–7 September 2018; pp. 37–43. [Google Scholar]
- Kim, Y.; Choi, H.; Lee, J.; Kim, J.S.; Jei, H.; Roh, H. Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster. Clust. Comput.
**2020**, 23. [Google Scholar] [CrossRef] - Kim, Y.; Lee, J.; Kim, J.S.; Jei, H.; Roh, H. Comprehensive techniques of multi-GPU memory optimization for deep learning acceleration. Clust. Comput.
**2020**, 23, 2193–2204. [Google Scholar] [CrossRef] - Naumov, M.; Kim, J.; Mudigere, D.; Sridharan, S.; Wang, X.; Zhao, W.; Yilmaz, S.; Kim, C.; Yuen, H.; Ozdal, M.; et al. Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems. arXiv
**2020**, arXiv:2003.09518. [Google Scholar] - Heigold, G.; McDermott, E.; Vanhoucke, V.; Senior, A.; Bacchiani, M. Asynchronous stochastic optimization for sequence training of deep neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014. [Google Scholar]
- Zhang, H.; Zheng, Z.; Xu, S.; Dai, W.; Ho, Q.; Liang, X.; Hu, Z.; Wei, J.; Xie, P.; Xing, E.P. Poseidon: An efficient communication architecture for distributed deep learning on {GPU} clusters. In Proceedings of the 2017 {USENIX} Annual Technical Conference ({USENIX}{ATC} 17), Santa Clara, CA, USA, 12–14 July 2017; pp. 181–193. [Google Scholar]
- Sattler, F.; Wiedemann, S.; Müller, K.; Samek, W. Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar] [CrossRef] [Green Version]
- Dong, J.; Cao, Z.; Zhang, T.; Ye, J.; Wang, S.; Feng, F.; Zhao, L.; Liu, X.; Song, L.; Peng, L.; et al. EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), San Diego, CA, USA, 22–26 February 2020; pp. 610–622. [Google Scholar] [CrossRef]
- Li, Y.; Park, J.; Alian, M.; Yuan, Y.; Qu, Z.; Pan, P.; Wang, R.; Schwing, A.; Esmaeilzadeh, H.; Kim, N.S. A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks. In Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Fukuoka, Japan, 20–24 October 2018; pp. 175–188. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM
**2017**, 60, 84–90. [Google Scholar] [CrossRef] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Aji, A.F.; Heafield, K. Sparse Communication for Distributed Gradient Descent. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; Association for Computational Linguistics: Copenhagen, Denmark, 2017; pp. 440–445. [Google Scholar] [CrossRef]
- Chen, C.; Choi, J.; Brand, D.; Agrawal, A.; Zhang, W.; Gopalakrishnan, K. AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), The 30th Innovative Applications of Artificial Intelligence (IAAI-18), and The 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, LA, USA, 2–7 February 2018; McIlraith, S.A., Weinberger, K.Q., Eds.; AAAI Press: Palo Alto, CA, USA, 2018; pp. 2827–2835. [Google Scholar]
- Wen, W.; Xu, C.; Yan, F.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems—NIPS’17, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 1508–1518. [Google Scholar]
- Sergeev, A.; Balso, M.D. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv
**2018**, arXiv:1802.05799. [Google Scholar] - TensorFlow: An Open Source Machine Learning Library for Research and Production. Available online: https://www.tensorflow.org/ (accessed on 10 November 2020).
- Lin, Y.; Han, S.; Mao, H.; Wang, Y.; Dally, W.J. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv
**2017**, arXiv:1712.01887. [Google Scholar] - KISTI Neuron. Available online: https://www.ksc.re.kr/eng/resource/neuron (accessed on 10 November 2020).

Network | Layer Dropping | Worker Compute (sec) | Parameter Server Compute (sec) | Communication (sec) | Total (sec) |
---|---|---|---|---|---|

1 Gb/s | No | 0.388 | 0.073 | 3.457 | 3.918 |

Yes | 0.408 | 0.085 | 0.182 | 0.675 | |

56 Gb/s | No | 0.222 | 0.048 | 1.951 | 2.222 |

Yes | 0.230 | 0.060 | 0.375 | 0.665 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Lee, J.; Choi, H.; Jeong, H.; Noh, B.; Shin, J.S.
Communication Optimization Schemes for Accelerating Distributed Deep Learning Systems. *Appl. Sci.* **2020**, *10*, 8846.
https://doi.org/10.3390/app10248846

**AMA Style**

Lee J, Choi H, Jeong H, Noh B, Shin JS.
Communication Optimization Schemes for Accelerating Distributed Deep Learning Systems. *Applied Sciences*. 2020; 10(24):8846.
https://doi.org/10.3390/app10248846

**Chicago/Turabian Style**

Lee, Jaehwan, Hyeonseong Choi, Hyeonwoo Jeong, Baekhyeon Noh, and Ji Sun Shin.
2020. "Communication Optimization Schemes for Accelerating Distributed Deep Learning Systems" *Applied Sciences* 10, no. 24: 8846.
https://doi.org/10.3390/app10248846