Next Article in Journal
Energy-Efficient Current Control Strategy for Drive Modules of Permanent Magnetic Actuators
Previous Article in Journal
Robust Federated Learning Against Data Poisoning Attacks: Prevention and Detection of Attacked Nodes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A m-RGA Scheduling Algorithm Based on High-Performance Switch System and Simulation Application

School of Electronic Information and Automation, Tianjin University of Science and Technology, Tianjin 300450, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(15), 2971; https://doi.org/10.3390/electronics14152971
Submission received: 13 June 2025 / Revised: 13 July 2025 / Accepted: 16 July 2025 / Published: 25 July 2025

Abstract

High-speed switching chips are key components of network core devices in the high-performance computing paradigm, and their scheduling algorithm performance directly influences the throughput, latency, and fairness within the system. However, traditional scheduling algorithms often encounter issues such as high implementation complexity and high communication overhead when dealing with bursty traffic. To addressing the issue of bottlenecks in high-speed switching chip scheduling, we propose a low-complexity and high-performance scheduling algorithm called m-RGA, where m represents a priority mechanism. First, by monitoring the historical service time and load level of the VOQs at the port, the priority of the VOQs is dynamically adjusted to enhance the efficient matching and fair allocation of port resources. Additionally, we prove that an algorithm achieving a 2× speedup under a constant traffic model can simultaneously guarantee throughput and latency, making this algorithm theoretically as excellent as any maximum matching algorithm. Through simulation, we demonstrate that m-RGA outperforms Highest Rank First (HRF) arbitration in terms of latency under non-uniform and bursty traffic patterns.

1. Introduction

With the transformation of data generation methods in data centers in recent years, artificial intelligence (AI) has increasingly become one of the core driving forces for the development of computationally intensive tasks and High-Performance Computing (HPC) [1,2]. Simultaneously, the scale of data has shown exponential growth [3] far exceeding the processing capabilities of traditional linear expansion, posing higher requirements for computing architectures and communication networks. HPC is a computing paradigm that combines multiple computing nodes (typically high-performance servers) into a cluster and enables them to work collaboratively to complete large-scale and complex tasks [4]. Communication bottlenecks may lead to high latency and increased idle time of nodes, resulting in reduced overall resource utilization [2].
As one of the core components of HPC interconnection networks, high-speed switches must be capable of providing extremely low latency and high bandwidth to support data transmission performance in any communication mode [5]. Therefore, although expansion can be achieved by increasing the number of ports in the switching structure, the problems of excessive expansion in terms of cost, power consumption, and congestion are also obvious [6]. Recent studies have shown that adopting scalable low-latency switching structures such as crossbar switches has become one of the mainstream directions. This structure not only has the advantages of strong parallelism and low latency but also makes it easy to achieve efficient resource utilization through scheduling algorithms, consequently becoming an important interconnection technology route in current HPC and AI data centers [7].
Switch types mainly include Output Queued (OQ) [8] and Input Queued (IQ) [9] switches. OQ switches are difficult to implement in high-speed traffic because they requires a speedup by a factor of N hardware support (such as memory bandwidth and switch rate). Thus, IQ switches are the preferred solution for high-speed systems due to their low speedup requirement (speedup of 1) and hardware scalability. However, the performance of IQ switches is highly dependent on the optimization of scheduling algorithms. Achieving efficient packet transmission in complex traffic scenarios remains a critical and unresolved challenge, especially with increasing port counts and stringent low-latency requirements.
Although some scheduling algorithms for IQ switches exhibit strong scheduling capabilities, several challenges remain that hinder their effectiveness in dynamic traffic scenarios. For instance, certain algorithms fail to efficiently utilize switching resources in bursty traffic under high-load conditions, leading to issues such as many-to-one port contention and port matching synchronization failures. Consequently, the design of iteration scheduling algorithms that offer high throughput, low complexity, and adaptability to dynamic traffic patterns remains a critical research focus in the field of IQ switches.
To solve the above problems, we propose a new dynamic adaptive priority iteration scheduling algorithm for single bit request, which we call m-RGA. Figure 1 illustrates an N × N centrally-scheduled input-queued switch, where N represents the number of ports of the switch. The rest of this paper is organized as follows: the next section discusses several iterative algorithms and summarizes the existing problems involving these algorithms; in Section 3, we introduce the proposed m-RGA algorithms with examples; in Section 4, the performance of the proposed algorithms is evaluated and compared with existing schemes; Section 5 presents the conclusions of this paper; finally, Appendix A includes the stability proof of m-RGA, which proves that the 2-speedup algorithm of m-RGA can satisfy the requirement of stability in the execution process under constant traffic.

2. Related Work

In order to solve the problem of head of line blocking [10] caused by FIFO in IQ switch, the VOQ structure was introduced. In the input buffer structure, multiple VOQ buffers are set at the input port according to the destination port. If the number of ports is N, then the number of VOQs in the input bus will also be N, while the total number of queues at the input port will be N 2 . In the switch fabric, the throughput of the switch unit is improved by providing buffer; however, this also increases the crossbar hardware structure and scheduling ability.
The introduction of the VOQ structure in the last three decades has led to many algorithms being produced, all of which belong to the category of iterative scheduling algorithms. These algorithms differ from others in the information sent in the request phase as well as in the arbitration mechanisms used in the rest phases and in the grant and accept phases. Due to the use of massively parallel processing, a set of widely accepted iterative scheduling algorithms are used to compute matches in an iterative manner, causing the scheduling decision to consider whether or not the VOQ of the input port is occupied. In general, each iterative scheduling algorithm consists of three steps, with only unmatched inputs and outputs considered. In the request phase, if an input has a cell waiting for the VOQ of a output, then the input port will send a request to that output. In the grant phase, each output port, if it receives multiple requests sent from different input ports, will select one of them, after which the output tells the input whether or not its request has been granted. In the accept phase, if the input receives more than one grant, it receives one of them. The following are practical examples of multiple iterative scheduling algorithms.
PIM [11] adopts multi-round iteration; each iteration includes three phases: request, grant and accept. Inputs and outputs request and respond randomly based on their current state. However, the need for random matching in each iteration leads to poor fairness, while random arbitration leads to higher hardware costs. iSLIP [12] uses a fixed pointer and a sliding window mechanism to ensure that the input and output ports are fairly selected. Although iSLIP is beneficial in uniform traffic scenarios, it does not consider port weights, and its performance is poor under nonuniform and bursty traffic characteristics. Although the queue length of VOQ is considered as a weight in iLQF [13], its actual length is transmitted as communication overhead in the control phase, which increases the communication overhead. In addition, iLQF uses the queue length as a weight, ignoring the waiting time of the head of the cell. This can easily cause the port to become blocked for a long time, which is not conducive to the fairness of the port. DRRM [14] follows the same method on the arbitration pointers of input and output ports, and both approaches adopt the polling method. The input and output ports in DRRM have less contention, which can reduce the communication and time cost in the scheduling process; however, it is less sensitive to VOQ with more packets buffered in the queue. Similarly, GA [15] is similar to DRRM, and is not suitable for use in other scenarios.
SRR [16], RR/LQF [17], and HRF [11] adopt the longest VOQ length and preferred input/output pair, and all share the same core calculation formula (1). SRR uses the longest VOQ length at the input port for matching, while RR/LQF uses the longest VOQ length at both the input and output ports for arbitration. When multiple ports have VOQs of the same length, the former two methods adopt the polling method. HRF proposes a highest rank first method. This can provide a higher matching probability than the strategy only relying on the longest queue length, resulting in improved scheduling performance. Although HRF outperforms the other two algorithms, its performance degrades as the load increases. This degradation is primarily due to the additional overhead introduced by queue length calculation and sorting during the scheduling process, which increases the waiting time of cells at the input ports and adversely affects overall transmission efficiency. Moreover, all three algorithms require sorting based on queue lengths, resulting in high time complexity, which is unfavorable for efficient scheduling in high-speed switching scenarios under heavy load conditions. In contrast, Π -RGA [18] introduces the earliest activation time as its priority, demonstrating good performance under extremely high-load traffic. However, its performance significantly deteriorates under other traffic patterns and its complex priority management mechanism makes it difficult to implement efficiently in high-speed switching environments. In the following, we summarize the key limitations of traditional iteration scheduling algorithms:
The first such limitation involves multiple iterations. To improve throughput during scheduling, it is desirable to maximize port matching through multiple iterations, that is, to reach a state in which no input or output remains unnecessarily idle. However, introducing multiple iterations increases the time duration of the control phase, which in turn extends the queuing delay of cells in the buffer and increases overall latency.
A second limitation concerns multi-bit communication overhead. In iterative scheduling algorithms, communication overhead primarily arises from the three phases, namely, request, grant, and accept. Among these, the request phase incurs the highest cost. While the grant and accept phases typically involve only a single-bit exchange per output or input port, the request phase can be more demanding; for example, in the request phase of iLQF, each input port sends a weight value based on the corresponding VOQ length, resulting in a time complexity of O(N). This significantly increases the overhead of the transmission process. A key objective is to reduce this overhead to O(1) in order to simplify the transfer control phase.
The third limitation consists of stability and fairness requirements. In addition to high throughput and low latency, scheduling algorithms must also ensure stability and fairness. The switching chip should remain stable under dynamic load conditions and guarantee that each non-empty VOQ has a fair opportunity to be served. This implies that as long as there is at least one cell ready for transmission and the corresponding VOQ is non-empty, the scheduling algorithm should avoid starvation and ensure timely servicing of all ports.

3. m-RGA Scheduling Algorithm

This paper proposes a dynamic adaptive priority iterative scheduling algorithm called m-RGA (where m is a priority matching scheme) which has good adaptive ability and stably matches as many ports as possible. In Equation (1), i , j , t , N represent the input port i, output port j , current time slot t , and number of ports N . The formula shows that after N time slots, each output port will be answered once by input port i in a round-robin fashion. This form reflects the fact that in actual scheduling, N output ports corresponding to input port i will be matched uniformly one time, which is a reaction phenomenon of global polling pointers. In the grant phase of the actual scheduling, output port j will respond to input port i according to the time slot and Equation (1). Therefore, if output port j receives the request signal sent by VOQij, it will respond to the VOQ first. Similarly, in the accept phase, input port i preferentially looks for the grant signal from output port j to respond; note that the request–grant and accept3 phases occur in the same time slot. At the same time, during the response process, the arbitration response of the port will be affected by other scheduling priority mechanisms such as the queue length and the wait time of the first cell, which are in the form of local polling pointers.
j = ( i + t ) mod N
m-RGA takes into account both the queue length and the historical service time of the queue, then dynamically adjusts the scheduling priority according to the port load in order to adapt to different traffic scenarios. Based on this, global and local polling pointers are introduced. Under the condition of maintaining stable matching, new connection possibilities are tried periodically by releasing connections in order to gradually increase the number of matches and improve the throughput. For time t, input port i is preferred to be served in the previous timeslot. For transmission port j , if the last timeslot is in the matching state, then it continues to determine whether the port meets the priority service mechanism. If the condition is satisfied, the request sent by the VOQ is changed from a strong request to a weak request. In the scheduling phase, input port i records a historical service time counter P containing N VOQs, which is used to store the timeslot in which each VOQ is actually transmitted. N counters SC(i,j) are recorded at output port j to detect queue lengths in real time in VOQs with the same destination port j but different input port i, where i , j { 0 , , N 1 } . The detailed operation of the algorithm at time slot t is summarized as follows.
For requests (Figure 2), at a time slot t, input port i has the priority to determine whether or not the port was in the matching state in the previous time slot. Initially, each port is a strong request ( R = 2), representing the highest priority. If it is in the matching state, it continues to determine whether the matched output port meets the j value in Equation (1). If the condition is met, the request is changed to a weak request ( R = 1), which means that output port j has already completed a response, and it can respond to other ports. If not, it will continue to set the request to a strong request ( R = 2) and to look for the port that meets this condition. When the earliest historical service time P of VOQ A is greater than the threshold A (where A represents the maximum delay of arbitration response), the VOQ’s R is set to 2 and the packets in the queue are preferentially responded to (the VOQ will also send a strong request signal). When another input port k is not connected but there are non-empty VOQs in the input port k at the same time, VOQs will also send weak request signals. VOQs that are not empty will not participate in further operations; in this case, we set the value of R to 0, indicating the lowest priority ( R = 0).
For the grant phase (Figure 3), if output j receives a VOQ request in more than one input port, then output port j selects SC(i,j) based on the strength of the request. SC(i,j) is used to represent the actual length of VOQs under the same output port. If SC(i,j) > 0, then output j is assigned to the VOQ with the largest SC(i,j). When a strong request has higher priority than a weak request, i , j { 0 , , N 1 } .
For the accept phase (Figure 4), if input i receives multiple grant signals from output m, then according to the strong and weak grant signals in input i, the p value of the oldest historical load VOQ in the strong grant signal is preferred. If p > 0, then input port i preferentially selects the port with the largest P, input i is assigned to this output port j, and the value of p is changed to this time slot value in order to complete the port matching, where i , j { 0 , , N 1 } . When the in-port match is successful, the VOQ status during the participation in the arbitration is recorded. When a packet is transmitted out of the VOQ, we modify the historical load time of the VOQ to 0 at this time. When the queue length of a non-empty VOQ becomes empty during transmission, we disconnect and record the connection of the VOQ at this time. Figure 3 shows the whole process of this scheduling in a time slot. Here, we only outline the core idea and significance of this research.
It is important to point out that A represents the maximum delay of arbitration response when a VOQ in the input port has not been matched and cannot transmit the data packet. This value is generally used for the change of the request signal in the request phase. When the earliest historical service time P of VOQ A is greater than the threshold A (in general, we set it as 50), the priority of the request signal of the VOQ is forced to change from weak to strong. At this time, the VOQ has a large probability of being selected and transmitting data packets in this time slot.
In Figure 2, we provide a detailed example to explain the working mechanism of the m-RGA algorithm. Note that only the example analysis of strong signals is shown in Figure 2. The figure shows the state (R, P, SC[i]) of all VOQs in a 4 × 4 input-queued switch chip in a given time slot. At this time, all of the input and output ports are idle; meanwhile, output port 1 satisfies Equation (1). Its matching input port is input port 2, and output port 2 satisfies input port 3 in Equation (1). We take input port 2 as an example to analyze the overall flow of m-RGA scheduling. In the request phase, at this time the non-empty VOQs in input port 2 send request signals to O1, O2, and O3 in the output ports. In the grant phase (Figure 3), the received request signals for O1 are all strong request signals and satisfy Equation (1); thus, O1 sends grant signals to input port I2, O2 sends grant signals to I3, O3 sends grant signals to I2, and O4 sends grant signals to I4. In the accept phase (Figure 4), I2 receives the grant signals sent by both O1 and O3. At this time, I2 chooses to grant O3 according to the P value of the corresponding position, sends the accept signal, and completes the scheduling process one time. The pointer is then changed to the next position that matches the output port.

3.1. Analysis of Key Technologies

m-RGA reduces the time step size of multiple iterations in the every timeslot, and reduces the communication overhead from O(N) to O(1). In addition, for the case where there may be idle ports in a single iteration of m-RGA, global and local pointers are introduced to enhance the probability of port matching. At the same time, the queue length and historical service time of VOQs were recorded by preferentially retrieving the matched ports in the last time slot. On the premise of stable matching, the matching fluctuation was reduced and the fairness of service ports was increased. The results show that the proposed algorithm can achieve better performance under different traffic models. Below, we summarize the advantages of m-RGA.
First, m-RGA provides low communication overhead with single-iteration and single-bit requests. The m-RGA algorithm achieves port matching using only a single iteration per timeslot, thereby decreasing scheduling phase and minimizing latency. Despite the use of just one iteration, m-RGA delivers high throughput across diverse traffic conditions thanks to its tailored matching mechanism and dynamic responsiveness to port load variations. In addition, it operates using only a single-bit request per input port, significantly lowering communication overhead and achieving a favorable tradeoff between scheduling efficiency and implementation complexity.
Second, m-RGA incorporates both strong and weak request mechanisms in a dual request mechanism, providing enhanced port utilization and matching stability. Strong requests allow matched ports to be re-used based on the results from the previous timeslot, maintaining existing matches and reducing redundant computation. Concurrently, weak requests increase the opportunities for new matches by avoiding direct contention with strong requests. This allows previously unmatched ports to compete more effectively based on port load, thereby improving the overall matching probability, enhancing throughput, and maintaining stable switching performance.
Third, m-RGA introduces scalable matching with global and local polling pointers. To steadily increase the number of input-output port matching, m-RGA employs both global and local polling pointer mechanisms. In a single iteration, some VOQs may fail to form point-to-point connections due to port contention. The global polling pointer periodically releases previously matched ports by clearing existing connections at each time slot, enabling re-matching and preventing the algorithm from converging to suboptimal local solutions. This promotes progressive optimization of matching quality and enhances system throughput. Simultaneously, the local polling pointer, in conjunction with historical service time constraints, disconnects point-to-point connections if a VOQ exceeds a predefined service delay threshold. These VOQs are then prioritized in subsequent scheduling, effectively mitigating starvation. Together, the global and local pointer mechanisms improve matching flexibility and fairness, enabling the algorithm to maintain stability and adaptability under various traffic patterns.
Fourth, m-RGA employs simplified hardware implementation in the form of a lightweight hardware structure consisting of a set of state counters at both input and output ports. In combination with the use of global and local pointers to guide efficient matching, these counters and control mechanisms are straightforward to implement on existing hardware platforms, allowing the algorithm to be practically deployed with minimal hardware overhead.

3.2. Time Complexity of m-RGA

In this paper, each non-empty VOQ only has two conditions when sending a request signal: strong request signals or weak request signals. In fact, the m-RGA algorithm does not sort the two kinds of signals; in the grant phase, the Arbiter component chooses according to the request signal that is sent, and the request signal at this time will not exceed N (where N is the number of ports). In the grant phase, the actual operation sequence of the Arbiter component is as follows: (1) the Arbiter receives the request signal and judges the priority of the request signal (strong request is greater than weak request); (2) the Arbiter determines whether the relationship between the input port i, output port j, and time slot t of Equation (1) is satisfied according to the strong request priority; if it is satisfied, the input port i is given priority; (3) if Equation (1) is not satisfied, the Arbiter selects which input port i to grant priority to based on the value in each SC[i]. The arbiter in each port performs the above steps.
Therefore, returning to the time complexity of m-RGA, the choice of strong and weak request signals is actually only O ( 1 ) . However, a schedule contains three phases (request, grant, and acknowledge), in which only the grant phase needs to make the most response to the request signal, which is O ( N 2 ) . As the number of grant signals in the acknowledge phase will be further reduced ( O ( N ) at most), we only consider the ordering of SC[i] in the grant phase. However, SC[i] actually detects the length of each VOQ. Commonly used sorting algorithms such as quicksort run in O ( n log n ) time; in practice, however, the ranking process can be greatly simplified due to the property of input-queued switches that any input port receives and sends at most one packet per time slot.
Specifically, the ranked list can be stored in a balanced binary search tree. When a new packet arrives at VOQ i j , its queue size is increased by one, while the other VOQs remain unchanged; then, the new ranking of VOQ i j becomes higher or remains the same. The time complexity of finding a new rank in a binary tree is O ( log n ) . The same procedure can also be used for packet departure processing. Therefore, the overall time complexity of m-RGA is reduced to O ( log n ) . In summary, the time complexity of the m-RGA algorithm is O ( log n ) due to the use of balanced binary search tree to maintain and update the queue. This shows that the algorithm has a significant advantage in efficiency, especially when dealing with large-scale data exchange.

3.3. Pseudocode Description of the m-RGA Algorithm

This subsection explains the pseudocode of m-RGA (Algorithm 1). During the request phase, each nonempty VOQ i j sends a strong or weak request signal as appropriate. In the grant phase, the Arbiter component selects the strong/weak grant signal according to the strong/weak request signal and the longest SC [ k ] . In the accept phase, the Input Arbiter selects the final result based on the size of actTime [ i ] [ j ] of the grant signal.
  • S C [ i ] [ j ] : Length of the k-th VOQ in output port.
  • a c t T i m e [ i ] [ j ] : Last activated time of VOQ i , j .
  • A: Activation time threshold for VOQ i , j .
  • Arbiter [ j ] : The j-th arbiter in the output port.
  • input _ Arbiter [ i ] : The i-th arbiter in the input port.
  • b u s y [ i ] [ j ] : A Boolean indicating whether VOQ i , j was successfully matched in the last timeslot.
Algorithm 1: m-RGA Algorithm
Electronics 14 02971 i001

3.4. Comparative Analysis of Algorithm Performance

In this chapter, we technically compare our newly proposed m-RGA algorithm with the performance of previously introduced algorithms. The summary results are recorded in Table 1. The m-RGA algorithm combines global and local polling mechanisms to ensure that the matching is stable, and gradually provides fairness for other ports to obtain the priority. Because each port of m-RGA only sorts the state counters into strong and weak signals, the time complexity of m-RGA is O(logN). In this chapter, m-RGA, SRR, RR/LQF, and HRF all belong to the category of single-iteration scheduling algorithms. These algorithms are compared and the differences are summarized in Table 1. In addition, although we have summarized the complexity of the other algorithms in Table 1, this does not fully represent their overall complexity; for example Π -RGA and HRF are actually difficult to implement during execution due to their high complexity. In addition, both iSLIP and iLQF can have single and multiple iterations; only their single-iteration cases are summarized in Table 1.
To accurately represent the earliest activation time in Π -RGA, it is necessary to use Multi Bit Request (MBRS) to store these timestamps and transmit multiple MBRS in the control link. Because only the earliest activation time and state counter are used, Π -RGA is only suitable for transmission in non-uniform traffic scenarios. In high load and uniform traffic scenarios, the priority differentiation of each port is low, as the activation time of all VOQs is very close. However, SRR, RR/LQF, HRF, and m-RGA all adopt preferential input–output pairs, that is, global polling. Both SRR and RR/LQF use the longest VOQ; SRR uses the longest VOQ at the input, while RR/LQF uses the longest VOQ at the input and output ports, then uses a polling method in cases with more than one VOQ. HRF proposes a highest VOQ rank-first approach, which has more matching possibilities than just using the longest VOQ. Our m-RGA absorbs the advantages of Π -RGA by only sending a Single Bit Request (SBR) when the VOQ is not empty, which improves the poor performance of Π -RGA in uniform traffic scenarios. In addition, m-RGA introduces global and local polling methods to modify the shortcomings of Π -RGA in uniform traffic scenarios.
The design of the m-RGA algorithm is actually based on the limitations of the traditional scheduling algorithm, that is, multiple iterations and multi-bit communication overhead. In the design of traditional scheduling algorithm, it generally takes O ( l o g N ) iterations to complete the maximum matching of ports (when no input or output is unnecessary idle, the rate maximum matching is achieved), and the most influence on the number of communication overhead bits in the scheduling process is the comparison of queue lengths in the request phase. The m-RGA algorithm only needs one or two iterations, that is, the corresponding matching between ports can be completed, and the level is reduced to O ( 1 ) . In addition, because only 2 bits of the priority of the request signal need to be transmitted in the request phase, the time delay in the communication process is greatly reduced compared to using the queue length as the priority, as in the iLQF algorithm, and the communication overhead in this phase is also O ( 1 ) level.
For the other counters (SC[i], R, and P), the design itself is related to the state of VOQ and the Arbiter component; thus, a group of VOQ strong and weak priority counters, a group of oldest activation time counters, and a group of VOQ queue length counters need to be added. These counters and state variables are very convenient and simple when implemented in existing hardware.

4. Performance Analysis

This section describes the design and testing of m-RGA. The simulation experiment used OMNeT++5.5.1 and was run on a Windows computer with Intel(R) Core(TM) i5-10210U CPU and 12GB RAM. We analyzed the performance of the new algorithm through simulation and compared it with other classical iterative scheduling algorithms such as PIM, iSLIP, iLQF, SRR, RR/LQF, HRF, and Π -RGA. In Figure 5, Figure 6 and Figure 7 we consider three classical traffic patterns, namely, uniform, bursty, and hotspot traffic. Although we set the size of the switch to N = 32 in the simulation, the results also hold for other sizes. Finally, each data point in the figure is the result of running 10 5 timeslots, where the initial 5 × 10 3 timeslots are the warm-up phase and the threshold of its historical load is 50. The horizontal axis represents the input load p and the vertical axis represents the average queuing delay of data packets.

4.1. Uniform Traffic

In the uniform traffic model, during each time slot, a packet arrives at each input port with a probability p (i.e., the input load) and the destination output port is selected uniformly at random.
In Figure 5, the performance ratios of the PIM, iLQF, and Π -RGA algorithms are still significantly different from those of SRR, iSLIP, RR/LQF, and m-RGA and weaker than those of HRF. When the input load is at low or medium level, the gap between SRR, RR/LQF, and m-RGA is not very large. This is because the number of non-empty VOQs in the matching state is not large and the activation time is mostly in the same time slot. In this case, Equation (1) can be used to directly obtain the matching result. When the input load p = 0.7, in m-RGA the data packet can wait an average of 99.168 timeslots to complete the scheduling; however, in this case the data packet needs 213.83 timeslots to complete. The former is about two times less than the latter, while in the Π -RGA algorithm, the data packet needs 732.87 time slots to complete; thus, m-RGA is about more than seven times as large. When the input load p > 0.7, the delay performance of Π -RGA becomes sluggish, which may be because its activation time is in effect as the load increases. When the input load p > 0.8, m-RGA performs better than SRR and RR/LQF. This is because in the uniform traffic pattern, when choosing VOQ, all arbiters consider not only the historical service time but also its load strength, which is more convenient for distinguishing priorities and generating request signals with different strengths. In summary, this prioritization mechanism is significantly more complex than the preferred formula calculation, meaning that its performance increases dramatically.

4.2. Burst Traffic

The bursty traffic follows the ON/OFF traffic model, which is modeled by a two-state Markov process. In the ON state, a packet arrival is generated for each time slot. In the OFF state, no packet arrives. Packets of the same burst have the same destination port, and the destination ports of each burst are uniformly distributed. Given the average input load p and burst size s, the state transition probability from OFF to ON is p/[s(1-p)] and the state transition probability from ON to OFF is 1/s. During our simulation, we set the burst size s = 30 packets.
In Figure 6, the performance of Π -RGA and m-RGA algorithm is significantly better than that of HRF algorithm. When the input load p > 0.7, the average queuing delay of HRF algorithm increases sharply, while that of Π -RGA and m-RGA algorithm increases significantly after input load p > 0.8. m-RGA only uses the VOQ queue length in the output accept phase and only transmits single-bit information of strong and weak signals in the communication phase. This is different from HRF, which sorts the VOQ queue length first and then encodes it, which increases extra computational overhead and delay consumption. m-RGA is more responsive to historical response/activation time and load intensity in the same time slot. This is because in our implementation we use the queue length and activation time considerations of VOQ, meaning that it consumes less time than HRF. The performance of the m-RGA algorithm is also better than that of the Π -RGA algorithm, which is because m-RGA uses only a single bit of communication overhead in the request and prioritizes serving the last timeslot to match the port, while Π -RGA uses multi-bit communication overhead in the control phase. As a result, m-RGA shows reduced delay consumption in the transmission process.

4.3. Hotspot Traffic

In Figure 7, the performance of Π -RGA and m-RGA algorithms is significantly better than that of HRF algorithm. When input load p > 0.7, the average queuing delay of the HRF algorithm increases sharply, while that of the Π -RGA and m-RGA algorithms increases significantly after input load p > 0.8. According to the characteristics of hotspot traffic, a VOQ will continue to receive data packets, and when the VOQ is responded multiple times it will seriously affect the queuing time of other VOQs. Therefore, we introduce the historical activation time to provide port fairness; as the VOQ load continues to rise, the priority of other VOQs will also increase.

4.4. Fairness Analysis

Table 2 shows the fairness of the m-RGA algorithm under high load ( p 80 ) conditions under the burst traffic and hotspot traffic models. As shown in the table, the fairness performance of m-RGA can remain above 97% even in high-load traffic scenarios. This is related to the increase of the priority input–output pairs in Equation (1). When the port is not in a matching state and the port load is high, the port will preferentially execute this formula. Additionally, when the historical load of VOQ is greater than 50, the VOQ that has not been in a “hungry” state will also be matched.

5. Future Prospects

First, in the traffic patterns used in our experiments, we have mainly discussed the ideal case of the m-RGA algorithm under uniform, uniform bursty, and hotspot traffic scenarios. However, in actual hardware implementations bursts will also be generated for different types of traffic scenarios, including extreme scenarios. In this study, we were limited by the specific time frame and available hardware resources; thus, there are some intended use cases that have not yet been fully covered or explored in depth. Our experiments focused on key scenarios that can effectively demonstrate the core advantages of the m-RGA algorithm. This means that despite our encouraging results in these critical scenarios, we have not yet fully evaluated other possible application scenarios, especially those requiring higher flexibility or adaptation to more complex network dynamics.
Second, the m-RGA algorithm relies on the ordering of queue length and the ordering of the oldest activation time in the parameter settings, which requires the information of each VOQ to be precisely maintained. This mechanism requires careful consideration during parameter setting. Particularly in dynamic network environments, it is crucial to ensure that the ranking can accurately reflect the actual queue state changes.
Third, concerning the sensitivity of hardware constraints, this precise parameter setting of the m-RGA algorithm is an important consideration for modern high-speed switches that seek high performance and low latency. In hardware implementation, it will occupy more resources, which limits its application in resource-first environments.
Future work will be devoted to further expanding the application scope of m-RGA algorithm, including but not limited to developing more elaborate parameter adjustment mechanisms, exploring performance under different traffic patterns, and optimizing the performance of the algorithm for different hardware configurations. We look forward to addressing the current limitations in subsequent research and continuously improving the applicability and efficiency of m-RGA algorithm.

6. Conclusions

In this paper, we propose a new single-iteration scheduling algorithm which reduces the limitations of the traditional iterative algorithm and limits the number of iterations and the overhead of the request phase to a single and single bit. To ensure high throughput and low delay, the VOQ priority is dynamically adjusted according to the historical service time and load degree of the VOQ in the port, meaning that each VOQ has a chance to be served and the starvation problem caused by some VOQs not being served for a long time is avoided. Second, we provide a time complexity analysis of the proposed m-RGA algorithm and list it together with other algorithms to compare the number of iterations, request type, input/output arbitration, and time complexity. Third, we discuss the shortcomings of the m-RGA algorithm, specifically its sensitivity to traffic patterns, parameter settings, and hardware limitations. We show that an algorithm that achieves a speedup of 2 under a constant traffic model can achieve both throughput and delay performance, which makes the algorithm as good as any maximum matching algorithm in a theoretical sense. Simulation results show that the proposed algorithm can achieve high throughput under both uniform and non-uniform traffic patterns.

Author Contributions

B.C. did the writing and experimental design and validation. W.Z. supervised the manuscript and managed the project. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

No datasets were used in this study, and the data content was obtained from the experimental results.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Building the Traffic Model and 100% Throughput Proof

Appendix A.1. Building the Traffic Model

This appendix is devoted to analyzing the throughput of m-RGA with a single iteration and no speedup and providing the sufficient condition of stability for the single-iteration m-RGA algorithm. Unless otherwise specified, the switch buffer is assumed to be sufficiently large.
The stability proof in this section is mainly based on the approach adopted by [19,20]. We first construct a fluid model for m-RGA. Let λ i j be the mean packet arrival rate to VOQ(i,j) and let Z i j (n) be the number of packets in VOQ(i,j) at the beginning of time slot n. Further, let the cumulative number of packet arrivals and departures for VOQ(i,j) at the beginning of slot n be A i j (n) and D i j (n), respectively. Then, we have
Z i j ( n ) = Z i j ( 0 ) + A i j ( n ) D i j ( n ) , N 0 , i , j = 0 , 1 , , N 1 .
Without loss of generality, assume that the packet arrival process obeys the strong law of large numbers with probability one, i.e.,
lim n A i j ( n ) n = λ i j , i , j = 0 , , N 1 .
By definition, the switch is rate-stable if
lim n D i j ( n ) n = λ i j , i , j = 0 , , N 1 .
An admissible traffic matrix is defined as one that satisfies the following constraints:
i λ i j 1 , j λ i j 1 .
If a switch is rate-stable for an admissible traffic matrix, then the switch delivers 100% throughput. The fluid model is determined by a limiting procedure, illustrated below. First, the discrete functions are extended to right-continuous functions. For arbitrary time t ε [n, n + 1),
A i j ( t ) = A i j ( n ) ,
Z i j ( t ) = Z i j ( n ) ,
and
D i j ( t ) = D i j ( n ) + ( t n ) ( D i j ( n + 1 ) D i j ( n ) ) .
Note that all functions are random elements of the set { 0 , 1 , 2 , } . We shall sometimes use the notation A i j ( · , ω ) , Z i j ( · , ω ) and D i j ( · , ω ) to explicitly denote the dependency on the sample path ω . For a fixed ω , at time t we have [19]:
  • A i j ( t , ω ) : The culative number of arrivals to VOQ ( i , j ) .
  • Z i j ( t , ω ) : The number of packets in VOQ ( i , j ) .
  • D i j ( t , ω ) : The cumulative number of departures from VOQ ( i , j ) .
For each r > 0 , we define
A ¯ i j r ( t , ω ) = r 1 A i j ( r t , ω ) ,
Z ¯ i j r ( t , ω ) = r 1 Z i j ( r t , ω ) ,
and
D ¯ i j r ( t , ω ) = r 1 D i j ( r t , ω ) .
It is shown in [19] that for each fixed ω satisfying (2) and any sequence { r n } with r n as n , there exists a subsequence { r n j } and the continuous functions A i j ( · ) , Z i j ( · ) , and D i j ( · ) , where converges to uniformly on compacts as k for any t 0 ,
A ¯ i j ( t , ω ) λ i j t ,
Z ¯ i j ( t , ω ) Z ¯ i j ( t ) ,
D ¯ i j ( t , ω ) D ¯ i j ( t ) ;
Definition A1. 
Anyfunction obtained through the limiting procedure in (3) is said to be a fluid limit of the switch. Thus, the fluid model equation using m-RGA is
Z ¯ i j ( t ) = Z ¯ i j ( 0 ) + λ i j t D ¯ i j ( t ) t 0 .
Definition A2. 
The fluid model of a switch operating under a scheduling algorithm is said to be weakly stable if, for every fluid model solution ( D ¯ , Z ¯ ) with Z ¯ ( 0 ) = 0 , Z ¯ ( t ) = 0 for almost every t 0 .

Appendix A.2. Proof of 100% Throughput

From [19], a switch is rate-stable if its corresponding fluid model is weakly stable. Our goal here is to prove that for every fluid model solution ( D ¯ , Z ¯ ) using m-RGA, Z ¯ ( t ) = 0 for almost every t 0 . Specifically, we will use Fact 1 from [20].
Fact 1 Let f be a non-negative absolutely continuous function defined on R + { 0 } with f ( 0 ) = 0 . Assume that f ( t ) > 0 and f ( t ) 0 for almost every t. Then, f ( t ) = 0 for almost every t 0 .
Note that R + is the set of positive real numbers and that f ( t ) denotes the derivative of function f at time t.
In the following theorem, we show the sufficient condition for 100% throughput of m-RGA.
Theorem A1. 
(Sufficiency) When λ i j 1 / N (for all i , j { 0 , 1 , , N 1 } ), m-RGA can achieve 100% throughput.
Proof. 
Define B { m : Z ¯ i m ( t ) > 0 } . Let G i ( t ) denote the joint queue occupancy of all non-empty VOQs at input port i. Then, we have
G i ( t ) = m B Z ¯ i m ( t ) .
Because Z ¯ ( t ) is a non-negative and absolutely continuous function, from (38), G i ( t ) is also non-negative and absolutely continuous. Without loss of generality, assume that all VOQs are initially empty, i.e., Z ¯ ( 0 ) = 0 . Then, G i ( 0 ) = 0 and the derivative of G i ( t ) is
G i ( t ) = m B Z ¯ i m ( t ) .
Combining the above equation with (17), we get
G i ( t ) = m B λ i m m B D ¯ i m ( t ) .
From the condition 0 λ i j 1 / N (for all i , j { 0 , 1 , , N 1 } ),
G i ( t ) h N m B D ¯ i m ( t ) ,
where h = B 0 .
Suppose that G i ( t ) > 0 when t > 0 . This implies that m 1 B and m 2 B , Z ¯ i m 1 ( t ) > 0 and Z ¯ i m 2 ( t ) = 0 . Then, Z ¯ i m 1 ( t ) Z ¯ i m 2 ( t ) > 0 . By the continuity of these functions, δ such that
min t [ t , t + δ ] Z ¯ i m 1 ( t ) Z ¯ i m 2 ( t ) > 0 , m 1 B , m 2 B .
Let
q = min m 1 B min m 2 B min t [ t , t + δ ] { Z ¯ i m 1 ( t ) Z ¯ i m 2 ( t ) } .
Thus, for a large enough k, we have Z ¯ i m 1 r n k ( t ) Z ¯ i m 2 r n k ( t ) q / 2 , where m 1 B , m 2 B , and t [ t , t + δ ] . Also, for a large enough k, we have r n k · q / 2 1 . Thus, Z i m 1 ( t ) Z i m 2 ( t ) 1 , where m 1 B , m 2 B , and t [ r n k t , r n k ( t + δ ) ] . This means that in the long time interval [ r n k t , r n k ( t + δ ) ] , any non-empty VOQ at input port i belongs to the set U { VOQ ( i , m ) : m B } and any VOQ that belongs to set U is non-empty [20]. Because m-RGA always assigns the highest priority to the preferred input–output pairs calculated by (1), during the same time interval, each non-empty VOQ sends at least one packet per N slots. When the above preference criteria are met, the input port is preferred to match the one with the longest service history VOQ; if the preferred condition is not met, then input port i will maintain the connection of the port matching in the previous time slot. Therefore, in the same time interval, each non-empty time slot will be selected at least once, that is, a data packet will be sent. Then, in the long time interval [ r n k t , r n k ( t + δ ) ] , input i sends at least h = B packets per N slots, in other words,
m B [ D i m ( r n k t ) D i m ( r n k t ) ] L h ,
where L Z , N L r n k t r n k t < N L + N . Thus, we have
L > r n k · ( t t ) N 1 .
Combining (22) with (23), we have
m B [ D i m ( r n k t ) D i m ( r n k t ) ] > h · r n k · ( t t ) N h .
Because h = B is within [ 0 , N ] , its impact is insignificant for fluid limit [20]. Dividing the above equation by r n k and letting k , the fluid limit is obtained as follows:
m B [ D ¯ i m ( t ) D ¯ i m ( t ) ] > h · ( t t ) N .
Further dividing the above equation by ( t t ) and letting t t , the derivative of fluid limit is
m B D ¯ i m ( t ) > h N .
Combining (24) and (26), and we get
G i ( t ) < 0 .
Based on Fact 1, G i ( t ) = 0 for almost every t 0 . Due to (27), Z ¯ i m ( t ) = 0 for almost every t 0 . Then, m-RGA is weakly stable. We have proved Theorem 1 that when λ i j 1 / N (for all i , j { 0 , 1 , , N 1 } ), m-RGA achieves 100% throughput.      ☐

References

  1. Priyadarshini, P.; Veeramanju, K.T. A systematic review of cloud storage services—A case study on Amazon Web Services. Int. J. Case Stud. Bus. IT Educ. 2022, 6, 124–140. [Google Scholar]
  2. Sapotnitska, N.; Ovander, N.; Harkava, V.; Kireeva, K.; Orlenko, O. Using big data to optimize economic processes in the digital age. Financ. Credit. Act. Probl. Theory Pract. 2023, 4, 164. [Google Scholar] [CrossRef]
  3. Junaid, M.; Ali, S.; Siddiqui, I.F.; Nam, C.; Qureshi, N.M.F.; Kim, J.; Shin, D.R. Correction to: Performance Evaluation of Data-driven Intelligent Algorithms for Big Data Ecosystem. Wirel. Pers. Commun. 2022, 127, 1827–1830. [Google Scholar] [CrossRef]
  4. Zhu, G.; Li, Y.; Chen, Y.; Chai, S.; Shi, Q.; Luo, Z. Global competitive situation of 6G key technology R&D and China’s countermeasures. Strateg. Study CAE 2023, 25, 9–17. [Google Scholar]
  5. Feng, Y.; Ma, K. Switch-Less Dragonfly on Wafers: A Scalable Interconnection Architecture Based on Wafer-Scale Integration. In Proceedings of the SC24: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, 17–22 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–17. [Google Scholar]
  6. Rocher-Gonzalez, J.; Escudero-Sahuquillo, J.; Garcia, P.J.; Lopez, P. Congestion Management in High-Performance Interconnection Networks Using Adaptive Routing Notifications. J. Supercomput. 2023, 79, 7804–7834. [Google Scholar] [CrossRef]
  7. Zyla, K.; Liess, M.; Wild, T.; Smith, J. FlexCross: High-Speed and Flexible Packet Processing via a Crosspoint-Queued Crossbar. In Proceedings of the 27th Euromicro Conference on Digital System Design (DSD), Paris, France, 28–30 August 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 98–105. [Google Scholar]
  8. Kesidis, G.; McKeown, N. Output-buffer ATM Packet Switching for Integrated-Services Communication Networks. In Proceedings of the ICC’97—International Conference on Communications, Montreal, QC, Canada, 12 June 1997; IEEE: Piscataway, NJ, USA, 1997; Volume 3, pp. 1684–1688. [Google Scholar]
  9. McKeown, N.; Varaiya, P.; Walrand, J. Scheduling Cells in an Input-Queued Switch. Electron. Lett. 1993, 29, 2174–2175. [Google Scholar] [CrossRef]
  10. Karol, M.J. Input versus output queueing on a space-division packet switch. IEEE Trans. Commun. 1987, 35, 1347–1356. [Google Scholar] [CrossRef]
  11. Anderson, T.E.; Owicki, S.S.; Saxe, J.B.; Thacker, C.P. High-Speed Switch Scheduling for Local-Area Networks. ACM Trans. Comput. Syst. 1993, 11, 319–352. [Google Scholar] [CrossRef]
  12. McKeown, N. The iSLIP Scheduling Algorithm for Input-Queued Switches. IEEE/ACM Trans. Netw. 1999, 7, 188–201. [Google Scholar] [CrossRef]
  13. McKeown, N.W. Scheduling Algorithms for Input-Queued Cell Switches. Ph.D. Dissertation, University of California, Berkeley, CA, USA, 1995. [Google Scholar]
  14. Chao, J. Saturn: A terabit packet switch using dual round robin. IEEE Commun. Mag. 2000, 38, 78–84. [Google Scholar] [CrossRef]
  15. Han, K.E.; Song, J.; Kim, D.U.; Youn, J.; Park, C.; Kim, K. Grant-Aware Scheduling Algorithm for VOQ-Based Input-Buffered Packet Switches. ETRI J. 2018, 40, 337–346. [Google Scholar] [CrossRef]
  16. Scicchitano, A.; Bianco, A.; Giaccone, P.; Leonardi, E.; Schiattarella, E. Distributed scheduling in input queued switches. In Proceedings of the IEEE International Conference on Communications (ICC 2007), Glasgow, UK, 24–28 June 2007; IEEE: Piscataway, NJ, USA, 2007. [Google Scholar] [CrossRef]
  17. Hu, B.; Yeung, K.L.; Zhang, Z. Minimizing the communication overhead of iterative scheduling algorithms for input-queued switches. In Proceedings of the 2011 IEEE Global Telecommunications Conference (GLOBECOM 2011), Houston, TX, USA, 5–9 December 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1–5. [Google Scholar]
  18. Mneimneh, S. Matching from the first iteration: An iterative switching algorithm for an input queued switch. IEEE/ACM Trans. Netw. 2008, 16, 206–217. [Google Scholar] [CrossRef]
  19. Berger, M.S. Delivering 100% throughput in a buffered crossbar with round robin scheduling. In Proceedings of the 2006 Workshop on High Performance Switching and Routing, Poznan, Poland, 7–9 June 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 403–407. [Google Scholar]
  20. Javidi, T.; Magill, R.; Hrabik, T. A high-throughput scheduling algorithm for a buffered crossbar switch fabric. In Proceedings of the 2001 IEEE International Conference on Communications (ICC 2001), Helsinki, Finland, 11–14 June 2001; IEEE: Piscataway, NJ, USA, 2001; pp. 1586–1591. [Google Scholar]
Figure 1. A N × N Centralized input-queued switch.
Figure 1. A N × N Centralized input-queued switch.
Electronics 14 02971 g001
Figure 2. A 4 × 4 Input-queued switch. Each input has four VOQs, which are set to R and P according to their request mechanism and historical service time counter. All empty VOQs are set to 0. The figure shows the detailed process of the request phases (a). The arrows represent strong signals.
Figure 2. A 4 × 4 Input-queued switch. Each input has four VOQs, which are set to R and P according to their request mechanism and historical service time counter. All empty VOQs are set to 0. The figure shows the detailed process of the request phases (a). The arrows represent strong signals.
Electronics 14 02971 g002
Figure 3. Detailed process of the grant phases (b). The arrows represent strong signals.
Figure 3. Detailed process of the grant phases (b). The arrows represent strong signals.
Electronics 14 02971 g003
Figure 4. Detailedprocess of the confirmation phase (c). The arrows represent strong signals.
Figure 4. Detailedprocess of the confirmation phase (c). The arrows represent strong signals.
Electronics 14 02971 g004
Figure 5. Comparative analysis with uniform traffic.
Figure 5. Comparative analysis with uniform traffic.
Electronics 14 02971 g005
Figure 6. Comparative analysis with bursty traffic.
Figure 6. Comparative analysis with bursty traffic.
Electronics 14 02971 g006
Figure 7. Comparative analysis with hotspot traffic.
Figure 7. Comparative analysis with hotspot traffic.
Electronics 14 02971 g007
Table 1. Comparison of m-RGA with other algorithms.
Table 1. Comparison of m-RGA with other algorithms.
Scheduling AlgorithmNumber of IterationsRequest TypeInput/Output ArbiterTime Complexity
iSLIPMultiple iterationsMultiple SBRLocal Round Robin O ( 1 )
PIMMultiple iterationsMultiple SBRSRandom Selection O ( N log N )
iLQFMultiple iterationsMultiple MBRLongest VOQ O ( N 2 log N )
Π -RGASingle iterationMultiple MBRSDynamic priority matching O ( log N )
SRRSingle iterationSingle SBRGlobal Round Robin and longest VOQ (input only) O ( log N )
RR/LQFSingle iterationSingle SBRGlobal Round Robin and longest VOQ O ( log N )
HRFSingle iterationMultiple SBRSGlobal Round Robin and Highest Rank O ( log N )
m-RGASingle iterationMultiple SBRSGlobal and Loacl Round Robin O ( log N )
Table 2. The fairness of the m-RGA algorithm under high load ( p 80 ) conditions.
Table 2. The fairness of the m-RGA algorithm under high load ( p 80 ) conditions.
Input Load P (%)Bursty Traffic (%)Hotspot Traffic (%)
80%97.21%98.32%
85%97.64%98.42%
90%98.92%98.20%
95%96.57%97.68%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cheng, B.; Zhou, W. A m-RGA Scheduling Algorithm Based on High-Performance Switch System and Simulation Application. Electronics 2025, 14, 2971. https://doi.org/10.3390/electronics14152971

AMA Style

Cheng B, Zhou W. A m-RGA Scheduling Algorithm Based on High-Performance Switch System and Simulation Application. Electronics. 2025; 14(15):2971. https://doi.org/10.3390/electronics14152971

Chicago/Turabian Style

Cheng, Bowen, and Weibin Zhou. 2025. "A m-RGA Scheduling Algorithm Based on High-Performance Switch System and Simulation Application" Electronics 14, no. 15: 2971. https://doi.org/10.3390/electronics14152971

APA Style

Cheng, B., & Zhou, W. (2025). A m-RGA Scheduling Algorithm Based on High-Performance Switch System and Simulation Application. Electronics, 14(15), 2971. https://doi.org/10.3390/electronics14152971

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop