Distributed deep learning (DDL), which trains deep learning models using multiple workers, is gaining attention because it reduces the total training time of deep learning and is easy to scale. Due to these advantages, DDL is widely used by popular deep learning frameworks, including TensorFlow [1
], MXNet [2
], and PyTorch [3
]. Although DDL implementations are different in each framework, the fundamental structure is as follows. Each worker trains a copy (replica) of a deep learning model using local training data and synchronizes the model parameters at the end of each training iteration. For synchronization, there are two common methods: (1) using a parameter server (PS) to collect and distribute parameters (PS architecture [4
]) and (2) using all-to-all communication among workers (all-reduce architecture [5
]). In both cases, workers send gradients and gather the updated model parameters.
Ideally, DDL should achieve a near-linear performance gain in proportion to the number of workers. However, because all workers have to synchronize their trained parameters at the end of each iteration, communications for parameter synchronization result in severe bottlenecks in DDL training, called “communication overhead” [6
]. Thus, enhancing DDL communication is essential to accelerating the entire DDL training process because this communication time can occupy up to 92.8% of the total training time [7
]. Therefore, finding solutions to communication overhead is an important research topic.
In an effort to overcome the communication overhead, we first analyze how each DDL worker generates traffic for synchronizations (DDL traffic) and measure the traffic characteristics using image classification models (e.g., ResNet [9
], AlexNet [10
], and VGG16 [11
]) with TensorFlow. In PS architecture, the communication between the workers and the PS occurs right after the backpropagation of each worker is finished. Specifically, the worker sends the gradients as its training results to the PS, and the PS sends the updated model parameters to the workers. This characteristic of PS architecture makes network communication be concentrated in a short period. This communication can impact the performance on two sides in PS architecture: (1) memory input/output (IO)—for data communication between GPU memory and main memory (including PCIe bottlenecks); and (2) network—between worker nodes and a PS node. In terms of memory IO, GPU-vendor-driven solutions, such as hardware-based improvements, are being delivered (e.g., NVIDIA GPUDirect [12
] and NCCL [13
]). This paper focuses on network bottlenecks, which include the performance impact between the workers’ NIC and PS’s NIC, including the network switches between them.
To observe the network traffic characteristics, we conduct several measurements and observe that DDL traffic is periodically generated in bursts (Section 2.1
). Specifically, the DDL system in our evaluation generates a large number of packets within very short intervals (e.g., milliseconds). This type of traffic burstiness is known to be the main cause of network congestion [14
]. In other words, DDL burst traffic periodically makes the network congested, causing throughput degradation, high delay, and packet loss; thereby, the total training time increases. Because the DDL burst traffic results from the fundamental nature of the DDL itself, the DDL traffic inevitably causes network congestion. Recently, efforts [6
] have been made to reduce the network communication bottleneck of DDL by changing the transmission order of deep neural network (DNN) layers to overlap and hide the communication overhead (Section 2.2
). However, to the best of our knowledge, these efforts have not addressed the network congestion problems caused by the burstiness of DDL traffic.
In addition, existing congestion control schemes are not enough to address this problem. In conventional congestion control, the transport protocol (e.g., TCP) detects congestion after the packet is lost at the end host, so it takes longer to react to congestion. To reduce the reaction time to congestions, explicit congestion notification (ECN) is used to explicitly notify hosts of congestion based on the switch queue status without a packet drop [19
]. Specifically, when the switch queue length is higher than a certain threshold, packets are marked with a congestion encountered (CE) bit in the TCP header fields (congestion marking). The host can slow down as soon as a congestion marked packet arrives at the host. This threshold is initially set by the network operators and usually has a range of up to 80, depending on the switch specification (Section 3.3
). The end host receiving the congestion-marked packets reduces the TCP congestion window and decreases the sending rate so that the congestion can be cleared [20
]. However, the ECN approach handles network congestion after burst traffic is generated; therefore, the existing congestion control approaches are inefficient for DDL communication because the congestion is identified after DDL generates burst traffic. This means that the ECN approach does not help reduce the congestion caused by the burstiness of DDL traffic.
As a solution, we propose proactive congestion notification (PCN), a novel congestion-avoidance technique. The key idea behind PCN is based on the following insight: the bustiness of DDL traffic can be known in advance. This is due to the fact that DDL consists of iterations for training and the time when DDL traffic is generated is known in each iteration. With the anticipation of DDL traffic, PCN attempts to regulate the traffic stacked on a switch in advance, so that network switches do not become congested when DDL traffic arrives. The PCN mechanism is to modify the congestion marking threshold, considering that burst traffic occurs periodically (details in Section 3.2
To realize PCN, we use P4 [21
] and programmable switches (Section 2.3
) because traditional switches do not support the operations for PCN. For instance, PCN requires a way to pass the new threshold (PCN threshold) to the switches. Furthermore, a switch should be able to change its original threshold (initial threshold) to the new PCN threshold. However, traditional switches can only parse predefined types of network packets, and they cannot change the initial threshold. In contrast, using P4, a new type of packet header can be defined, and the desired bit of the header can be parsed and used in a programmable switch. Furthermore, new switch operations, such as match-action, can be implemented as needed for PCN (see details in Section 2.3
). Recently, several studies tried to enhance DDL traffic by using an in-network switch for aggregating parameters or flow scheduling [22
]. However, to the best of our knowledge, this is the first study that proactively handles the burstiness of DDL traffic via P4. We implement PCN functionalities in P4 and a traffic generator that simulates the common DDL traffic pattern observed with VGG16, AlexNet, and ResNet (Section 2.1
). Then, we run and evaluate PCN on BMv2 [24
], a P4 software switch. The evaluation results show an average 72% improvement in throughput (Section 4
The remainder of this paper is organized as follows. Section 2
explains the background, related work, and motivation. Section 3
details the PCN design, and Section 4
presents the evaluation results. Section 5
discusses the limitations and future work of PCN. Finally, Section 6
concludes the paper.
In this section, we show the evaluation results of our proof-of-concept implementation. The PCN mechanism is implemented in two parts: (1) DDL traffic generator that sends PCN-START and simulates DDL traffic, and (2) P4 switch that implements PCN.
We implement PCN in P4 and evaluate using BMv2 [24
], a P4 software switch. We set a tree network topology of a single root and two leaf switches, a frequently used topology. As discussed in Section 2.1
and Figure 2
, DDL traffic differs in its amount and interval for each model, but the burst-idle traffic pattern is common among all the models. Therefore, instead of using individual models, we implement a DDL traffic generator called DDLgen that simulates this burst-idle traffic pattern generated by a worker, based on tcpreplay [38
] and iperf3 [39
]. As a baseline, DDLgen generates burst traffic using iperf3 burst mode and stays idle for the given network-idle period. DDLgen repeats this pattern similar to the burst-interval traffic pattern observed (Section 2.1
). To evaluate the DDL traffic performance with PCN design, we add PCN operations to DDLgen, which means that DDLgen first sends the PCN-START packet, waits for the ACK packet, and repeats the burst-interval pattern as explained in Figure 6
We run DDLgen with and without PCN, measure performance improvement, and monitor the switch queue status. For each leaf switch, three hosts generate background traffic (in datacenters, the number of concurrent long TCP connections to a server is generally two to three [3
].) for 100 s using iperf3. We run DDLgen to generate DDL traffic after 20 s of background traffic in order to make the background traffic stable (because of TCP’s congestion control (e.g., slow start, congestion window), TCP throughput fluctuates at the beginning and becomes stable later. The background traffic becomes stabilized after 20 s in our environment). DDLgen simulates six iterations of burst-idle traffic patterns with a 7 s network-idle period. We use initial and PCN thresholds of 40 and 10, respectively, following the policy stated in Section 3.3
. Our switch in evaluations has a queue with a length of 60, so the queue length ranges from 0 to 59.
PCN is the first approach to improve the network IO bottleneck by using programmable switches. Whereas existing studies (Section 2.2
) deal with optimizations on hosts (i.e., workers and PS), PCN is implemented in network switches; thus, the techniques are complementary to each other. In other words, it is difficult to compare the existing studies with PCN (for queueing latency or queue length metrics). Thus, we evaluate PCN when it is turned on or off.
4.1. Evaluation Metrics
We conduct evaluations on the three key categories: (1) queue length change (Section 4.2
), (2) performance improvement (Section 4.3
), and (3) overheads (Section 4.4
). The detailed measurement of the three categories are as follows:
Queue length change: We measure the queue length of a switch as the preparedness for the burst packets of PCN. Two measurements, start queue length and maximum queue length, are measured. Start queue length represents the queue length occupied by the background packets before the DDL packets arrive at the switch, which shows the switch queue status right before the burst packets of the iteration arrive. The maximum queue length is the highest switch queue length during each burst interval.
Performance improvement: To evaluate the performance improvement by PCN, we use three metrics–average throughput, burst completion time, and queueing latency. The average throughput is the amount of transmitted network traffic, and the burst completion time refers to the time taken for network communication of burst packets to be completed. Also, queueing latency is per-packet latency in a network switch caused by the existing packets in the switch queue. These three metrics are measured and compared with and without running PCN. To measure the queueing latency, we implement in-band telemetry [40
] that makes the packets contain custom network statistics, such as queueing latency per switch.
Overheads: PCN makes the switch available for burst packets by reducing its queue threshold. Although this scheme is effective on DDL packets, it could reduce the throughput of other background traffic. So, we measure the decreased throughput of background traffic as an overhead of PCN.
4.2. Queue Length Change
To see the improved burst tolerance with PCN, we evaluate the queue length change by measuring (1) start queue length and (2) maximum queue length. We repeat the experiment more than 20 times to gain reliable and reasonable results. Figure 9
a shows the distribution of the queue lengths as box plots which present the median values at their middle and the minimum and maximum values using the bars. Also, Figure 9
b and c each show the specific pattern of start queue length and maximum queue length, respectively, during six burst iterations generated by DDLgen.
First, without PCN, the start queue length when the first packet of a burst is enqueued ranges from 26 to 48. On the other hand, PCN ranges from 1 to 8. On average, the start queue length is 35 without PCN and 5 with PCN. With a queue capacity of 60, a switch queue without PCN has space for 25 incoming packets, while a switch queue with PCN has space for 55 incoming packets. Therefore, PCN enables the switch to handle twice as many burst packets.
In terms of the maximum queue length, the queue lengths without PCN range from 52 to 59, while the ones with PCN range from 42 to 47. On average, without PCN case shows 54.8 of queue length, and frequently reaches up to 59, which means that the switch queue becomes full of packets. This indicates that incoming burst packets will be dropped without PCN. With PCN, the maximum queue length is 44.1 on average, and queue length does not go over 50, indicating that it neither causes packet loss nor congestion.
4.3. Performance Improvement
Firstly, Figure 10
a,b show the average burst completion time and the throughput of DDL traffic, respectively, when the PCN is on or off. The results indicate that PCN reduces the burst completion time by 39% (Figure 10
a) and improves throughput by 72% (Figure 10
b). Such improvements result from PCN that reserves a portion of a packet queue for burst packets of DDL workload in advance; so with PCN, DDL traffic’s sending rate increases properly with a spacious switch queue. On the contrary, without PCN, DDL traffic suffers network congestion due to the deficient queue space. In addition, we measure the per-packet queueing latency within a switch (Figure 10
c). With PCN, the average queueing latency of packets processed during burst intervals is reduced by 13%. These improvements show that PCN appropriately mitigates the congestion caused by DDL traffic.
Here, we investigate the overhead of the PCN design. We see the throughput of the background traffic because PCN reduces all traffic sending rates before the DDL burst impacts the background traffic throughput. In our results, the background traffic throughput decreases by approximately 5.7%. PCN does not preempt the background traffic but reduces the sending rate of the switch while the PCN threshold is applied. Note that this overhead is transient and disappears by the PCN threshold recovery. As the overhead of PCN, PCN-START is added in ordinary PS architecture for delivering the PCN threshold to the switches. The amount of traffic increased by PCN-START is a single packet without any payload (approximately 64 bytes), which means the overhead is negligible. In addition, to identify the PCN-START packet and deliver the PCN threshold, 6 bits of DSCP fields in the IP header are used.