1. Introduction
A distributed storage system that supports the lower layer of cloud computing is a reliable platform for storing petabyte (PB)-level data. Due to the large scale of its nodes, it is prone to abnormal situations. To decrease the disadvantageous influence of abnormal situations, fault-tolerant mechanisms must be employed to enhance the reliability and availability of the system. Traditional distributed storage systems, represented by the Hadoop distributed file system (HDFS) [
1], ensure reliability through replication, which provides fast read speeds, but leads to low storage utilization. However, as the number of nodes grows and the amount of data increases, the cost of storage and operation becomes unacceptable, making replication impractical [
2]. Erasure codes, which has higher storage efficiency and the same fault-tolerant capability as replication [
3], can be used to address this issue. Erasure codes can encode multiple pieces of raw data in parallel and form a small amount of parity data, which can significantly save storage space. An increasing number of companies are adopting erasure codes for their products. Google applies the Reed–Solomon code (RS code) [
4] in its new file system Colossus [
5]. Facebook’s open-source solution HDFS-RAID introduces erasure code to HDFS clusters [
6]. The Local Reconstruction Codes (LRCs) storage system is used to back up data in Windows Azure Storage (WAS) [
7].
The traffic in data center networks is high and dynamic, with significant variation in each link [
8]. However, access to each node during data repair is not equally balanced, leading to an uneven link load. This leads to wastage of bandwidth for some links, while the continuous overload on other links will eventually cause network congestion and further delay data repair. In addition, once link failures or node failures has occurred during data repair, network resources are consumed by large amounts of data, which is not only detrimental to the reliability of the storage system, but also to other system applications. Therefore, reducing the volume of data being transmitted and the network latency caused by data repair is crucial for improving the performance of erasure codes and system reliability.
Existing research on reducing the cost of data repair based on erasure code can be broadly classified into three types: proposing new solutions with low complexity [
9,
10,
11], optimizing the repair algorithm of data transmission [
12,
13,
14], and modifying the deployment strategy of data blocks [
15,
16,
17]. Researchers have made great efforts to improve repair methods. Nevertheless, two factors currently limit the effectiveness of these repair methods: (1) most research work adjusts the data transmission route to transfer the data flow at the network bottleneck, which does not reduce the network burden; and (2) existing work does not fully consider the network load when allocating data transmission tasks, which is detrimental to active adaptation to network conditions. For these two factors, further research is needed.
Software-defined networks (SDNs) [
18] have attracted significant attention in recent years. Centralized network control is achieved by decoupling the control plane and the data plane through its switch protocol, and software-programmable interfaces are provided for network applications. The SDN controller can monitor and manage all network resources, obtain information such as network topology changes and link status, and execute efficient processing in calculations and traffic statistics based on this information [
19]. Because of the unique characteristics of SDNs, we propose a new data repair strategy, Software-Defined Network Controller Repair (SDNC-Repair), which aims to improve the repair throughput of the system and reduce the data repair latency. We put forward a data source selection algorithm based on intelligent bandwidth measurement and design a transmission scheduling algorithm based on dynamic feedback to support the strategy we propose and develop a cooperative and efficient data repair method. Our experiments prove that our approach can achieve better repair performance and higher system throughput.
The contributions of this work can be summarized as follows:
Propose a method for improving the performance of erasure-code-based data repair called SDNC-Repair that optimizes the transmission of the data repair process using the measurement technology of SDN and creates a distributed pipeline data repair operation to achieve efficient repair.
Develop a data source selection algorithm based on intelligent bandwidth measurement and a transmission scheduling algorithm based on dynamic feedback. These algorithms provide node combinations and schedule data flow during data repair.
Present a cooperative and efficient data repair method that improves the efficiency of repair by using SDN to shorten the repair chain, and improve the transmission efficiency and distribution of computation.
The remainder of this paper is structured as follows.
Section 2 provides the method and motivation of our research, and
Section 3 provides an overview of some related works.
Section 4 discusses the details of SDNC-Repair, including a data source selection algorithm based on intelligent bandwidth measurement, a transmission scheduling algorithm based on dynamic feedback, and a cooperative and efficient data repair method. Experiments and analyses are carried out in
Section 5. Finally, we draw conclusions and discuss future work in
Section 6.
2. Background and Motivation
2.1. Background
Generally, a storage system based on the
RS (n, k) erasure code divides the original data into
k data blocks
and stores them in data devices
. These
k data blocks form
m (where
m = n − k) parity blocks
through linear coding calculations.
where
represents an element in the coding matrix, which determines the coefficient of each data block in the encoding process.
Parity blocks are stored in parity devices
. When data blocks and parity blocks are combined, a stripe is formed and deployed to
n different nodes to maximize system reliability.
Figure 1 depicts an example of the structure of a storage system using
RS (9,6) erasure code. The system can reconstruct the original data from any
k available units unless the available nodes are less than
k.
When data stored in the nodes are lost because of abnormal situations, the system triggers a data repair operation to maintain the stability of the system. The traditional repair method replaces the failed node with a new node called the new node or destination node. Then, k normal nodes are selected in the same stripe as the failed node, and their data are copied to the destination node. These nodes involved in data repair are referred to as providing nodes or source nodes. Finally, the system determines whether the failed node is a data block or a parity block. If it is a data block, the destination node decodes the received data. If it is a parity block, the destination node re-encodes the data.
Data repair is executed in stripes, as depicted in
Figure 2. The
RS (9,6) method divides the original data into 6 data blocks
(which can be further divided into smaller sub-blocks) and stores them in data nodes
. Parity blocks
are formed through data blocks and stored in parity nodes
. Stripe
S consists of data nodes and parity nodes. Suppose the system is ready to access
, but
is lost due to the failure of
, triggering data repair. The new node obtains
d1,
d3,
d4,
d5,
d6,
p1 from source nodes
D1,
D3,
D4,
D5,
D6,
P1, and then calculates the missing data through the inverse matrix operation in the decoding process. The result is saved in the new node, indicating the repair operation is complete. When multiple nodes fail, the system carries out the repair of these nodes in parallel.
2.2. Motivation
Large-scale data centers are deployed in layers, owing to the need to accommodate large-scale server nodes and the limited scalability of the single-layer network. This situation results in complex network topologies within data centers, and communication latencies between nodes vary. Moreover, data and servers have unequal degrees of access frequency, which causes unbalanced burdens on the link [
8]. When data repair is needed, the existing erasure-code-based method usually randomly selects data from the available nodes in the same stripe (such as the first available
k nodes). However, this erasure code method does not consider the quality of bandwidth between nodes and the burdens on the link, which affects the transmission and read/write performance during data repair. In other words, this erasure code method of randomly selecting nodes does not optimize data repair latency. Furthermore, choosing providing nodes in a poor network or under high load has two disadvantageous effects. Firstly, it aggravates network congestion. Secondly, it places a greater burden on the CPU and memory.
Traditional erasure-code-based methods involve a large number of data transmissions, encoding and decoding calculations, and downloading of data blocks, despite their improper node selection. Repairing lost data requires transmitting k times the amount of data on average, and the data transmission is slanted and concentrated, which is detrimental to the system load balancing. In addition, read/write operations are time consuming, which also adversely affects system performance. Therefore, a low-overhead and high-efficiency erasure-code-based data repair method is needed.
To address this problem, we introduce an SDN in our work. Firstly, we use the SDN to measure the network status of the system and select k available nodes with low load and high bandwidth. We then schedule transmission routes to reduce the network burden and shorten the transfer time during data repair. Finally, we utilize computation distribution and parallel repair to improve data repair performance.
3. Related Works
Many researchers have chosen to improve data repair performance by modifying erasure codes. In addition to Reed–Solomon codes, array codes adopt array layout coding, which is based on exclusive OR (XOR) rather than the Galois field operations, simplifying coding and decoding [
20]. Dimakis et al. proposed regenerating codes based on the concept of grid coding, which can greatly reduce the network bandwidth consumed in the data repair process [
9]. Liang et al. used local regenerative code to repair and store data between failed nodes in industrial networks while ensuring user data privacy, indicating the extensive usability of regenerating codes [
10]. Shan et al. proposed Geometric Partitioning, which divides the regenerative code into blocks of different sizes to improve the repair performance of the regenerative code [
11].
Some literature focuses on optimizing the data repair process from the perspective of data transmission structure, as traditional data repair using star structure repair (SSR) is simple but inefficient. Zheng et al. [
21] introduced a traffic efficient repair scheme (TERS) to SSR, which saves considerable repair bandwidth. Tree structure repair (TSR) [
22] forms a tree structure based on the network distance of nodes. Huang et al. [
23] designed a tree-type repair scheme considering node selection, which includes algorithms to select nodes and establish the optimal repair tree. Zhou et al. [
24] proposed a tree-structured data placement scheme with cluster-aided top-down transmission, which improves the practicality and efficiency of data insertion. Repair pipelining (PR) [
12] transmits repair data by pipeline. The literature [
13] proposed partial parallel repair (PPR), which uses the divide-and-conquer method to decompose the repair operation into multiple nodes and uses a parallel pipeline to transmit calculation data until the repair is completed. Li et al. implemented a repair pipelining prototype, which improves the performance of degraded reads and full-node recovery over existing repair techniques [
14].
The evolution of the data repair transmission structure focuses on the scarce resource of network bandwidth, aiming to improve efficiency by reducing the network overhead introduced by data repair. In addition, some literature focuses on cross-rack networks and strives to reduce the transmission traffic of data repair on high-level links of network topology. For example, the Intra-Node Parity data reconstruction scheme proposed in the literature [
25] uses switch computing to realize traffic merging and forwarding, effectively reducing the amount of data transmitted on the network. Hou et al. [
26] proposed a cross-rack-aware regenerating code that achieves a balance between storage cost and cross-rack network repair bandwidth cost. Hu et al. [
15] proposed a hierarchical block placement strategy in DoubleR, which places multiple data blocks on each rack and aggregates data blocks by finding suitable relay nodes within the rack, minimizing cross-rack traffic. Xu et al. [
16] proposed rPDL, which effectively reduces cross-rack traffic and provides nearly balanced cross-rack traffic distribution by uniformly choosing replacement nodes and retrieving determined available blocks to recover the lost blocks. Liu et al. [
17] achieved low latency by deploying caching services at the edge servers close to end-users.
In conclusion, existing data repair strategies either require significant changes to the existing system architecture, or do not consider the differences in Quality of Service (QoS) among heterogeneous networks. SDN enables monitoring and management of all network resources. By utilizing SDN, it is possible to dynamically adjust network resources to adapt to changing network conditions and optimize the data repair process accordingly, balancing latency and link utilization in a more flexible way to improve the data repair efficiency.
4. SDNC-Repair
4.1. The SDNC-Repair Framework
The data repair process in erasure codes consists of two essential parts: data transmission and encoding/decoding calculations. The framework of SDNC-Repair and interactions between components are shown in
Figure 3. SDNC-Repair is implemented by the storage system with RS code, the SDN controller, and a network of switches. Storage nodes are used to store data blocks and parity blocks. They are deployed in racks. Information such as the location of the data and the location of the redundancy is stored in the metadata node. The network of switches is composed of SDN switches (such as OpenFlow) and the links between these switches. The top-of-rack software switch supports the XOR operation, which can reduce the amount of data transferred across racks. The SDN controller realizes the control and monitoring of the switch group through the SDN switches protocol.
Figure 3 describes the basic principle of SDNC-Repair, which consists of two main phases: the transmission phase and the calculation phase.
Transmission phase (Steps 1–7 in
Figure 3): The most suitable transmission routes are selected according to the network topology and monitors the workload of the switch to control the repair rate.
Calculation phase (Steps 8–10 in
Figure 3): The switch delivers data to the top-of-rack switch based on the flow table and achieves efficient data repair through pipelining and parallelization.
In the aforementioned process, the OpenFlow protocol matches VLAN ID and VLAN priority to route the data flow through the path designated by the controller. SDNC-Repair provides three algorithms to improve data repair efficiency: an intelligent bandwidth measurement-based data source selection algorithm and a dynamic feedback-based transmission scheduling algorithm during the Transmission phase (shown in red in
Figure 3), and a cooperative and efficient data repair method during the Calculation phase (shown in blue in
Figure 3). These algorithms are discussed in
Section 4.2,
Section 4.3 and
Section 4.4, respectively.
Table 1 provides a summary of the notations used in this paper.
4.2. Data Source Selection Algorithm Based on Intelligent Bandwidth Measurement
To reconstruct missing data,
k providing nodes in the same stripe must be chosen, and they must provide data for
x new nodes. During this process, the repair speed is strongly associated with system reliability. Practical measurements show that network transit time accounts for 70–80% of the overall repair time. As shown in
Figure 4, the x-axis labeled “
RS(3,2)-1” represents repairing one missing data block with
RS (3,2), and the remaining x-axes are similar. Network transmission is a key factor affecting the performance of data repair.
The goal of the algorithm is to select
k nodes with high available bandwidth as providing nodes, shorten network transit time, and improve system reliability. Existing methods [
4,
22] assume that data repair occurs in a homogeneous network. However, the traffic in data centers is high and dynamic, and the burdens on links vary. Correspondingly, the available bandwidth between nodes also changes continuously. Therefore, simply considering the available bandwidth between nodes as a fixed value cannot achieve optimal transmission latency. Data repair involves data downloading, decoding, and uploading of the repaired data block. The amount of transmission traffic and the number of switches it flows through are decisive factors in its occupation of the system. This algorithm optimizes the use of system resources and improves data repair efficiency from the root.
The algorithm uses SDN network virtualization technology to select
k nodes with high available bandwidth and a close address from
n − x surviving nodes and in parallel repairs data in
x new nodes. The algorithm takes advantage of the SDN controller to control the global network, sorts
n − x surviving nodes according to the system load, and dynamically selects the top
k nodes with low transmission latency. The details of the data source selection algorithm (Algorithm 1) are as follows.
Algorithm 1: Data source selection algorithm based on intelligent bandwidth measurement |
Input: Group of new nodes , Group of surviving nodes , Topology graph Output: Group of providing nodes |
1. 2. for Each node in do 3. Assume and 4. for Each node in do 5. if then // Indicates that the new node and the surviving node are connected 6. Add to // Generate node distance set 7. Add to // Generate candidate data source node set 8. end if 9. end for 10. end for 11. for Each node in do 12. for Each node in do 13. // Generate available bandwidth set 14. // Calculate latency according to the decision parameter of and the decision parameter of 15. end for 16. end for 17. Sort in ascending order based on // Sort in ascending order 18. // Generate the first k low-latency providing nodes set 19. return |
The inputs of the algorithm are a group of new nodes
, a group of surviving nodes
, and a topology graph
maintained by SDN controllers, where
V represents the switches that participate in data repair and
E represents all the links between nodes. The bandwidth information is the edge weight of the links in
E. SDN controllers use the link discovery protocol described in the literature [
27] to create and maintain
. The output of the algorithm is the node set
with low transmission latency.
The algorithm first calculates the distance between the new node and in , defined as the number of hops through switching devices, to determine if nodes are reachable.
The distances between reachable nodes are then added to the decision parameter set , and reachable nodes are added to the candidate data source node sequence set .
The controller measures the remaining available bandwidth in the ports of the switch-connected nodes, which is the difference between the path bandwidth and the smallest background load of all links in the path.
Based on and , along with their respective weight factors α and β, the transmission delay is calculated. The first k data sources with the lowest delay are then selected from based on the ascending order of .
The flowchart of the algorithm is shown in
Figure 5.
The algorithm’s results are used to provide data to the new node during the repair operation. Data source nodes with low latency are obtained through intelligent measuring. The node connectivity and the load of the links during the process are actively detected, which makes the system highly adaptable to the network status.
4.3. Transmission Scheduling Algorithm Based on Dynamic Feedback
The transmission mode is crucial, since data transmission takes a long time in the repair process. Higher bandwidth paths allow for faster transmission speeds than lower bandwidth paths. However, imbalances in repair tasks and access, as well as system management-related events such as repair latency, can lead to unequal link utilization in real situations. When this state accumulates and amplifies, it will inevitably lead to link congestion, affecting system performance. System reliability can be critically damaged if data are permanently lost during repair. Based on dynamic feedback, SDNC-Repair put forward a transmission scheduling algorithm that considers data flow credibility and latency requirements. To improve network throughput and avoid overloading switches and links, the algorithm selects the low-cost routes between providing nodes and new nodes according to the global network view and link status and reasonably schedules data blocks to avoid transmission congestion. Algorithm 2 describes the scheduling algorithm based on dynamic feedback.
Algorithm 2: Transmission scheduling algorithm based on dynamic feedback |
Input: The list of data to be transmitted , Global topology graph , The switch port length threshold Output: // A sign determines whether the transmission is successful or not 1. Set 2. for Each block in do 3. 4. for Each path in do 5. // Get the load of in at time t 6. Calculate // Calculate the utilization of the links in the path 7. if then 8. // Calculate the load on the path 9. end if 10. end for 11. Find , where // Select the route with the lowest background load 12. // Obtain the queue of the switch at time 13. if then 14. // sent congestion signal, notify other switches to reduce their sending rate. 15. end if 16. ,) 17. end for 18. |
The inputs of the algorithm are a list of data to be transmitted , a global topology graph , and a switch port length threshold . The output is a sign flag that determines whether the transmission was successful.
The algorithm first discovers the underlying network topology through the controller to find the available path set for the data blocks in the list.
Then, the controller queries the switch port flow statistics through traffic monitoring components to calculate the link utilization and the path load , where represents the maximum utilization ratio of all links in the path.
To ensure that the transmission avoids bottleneck links, the path with the smallest background load is selected as the transmission path for the data block.
At the same time, the controller periodically checks the switch port queue length for the transmitted data block and dynamically adjusts the transmission rate by comparing it with the system’s set threshold . If exceeds , the switch sends a congestion notification message to the controller to reduce the transmission rate and avoid switch overload.
The purpose of setting
Q′ is to detect the load of the switch and adjust the sending rate to a suitable value. A fixed value of
Q′ is not adaptable to the QoS of all networks due to the different features of each network. Therefore, this article selects three volumes of block-level tracking, volume_0, volume_1, and volume_2, which contain data collected from practical applications. A description of the dataset is provided in
Table 2. We test and analyze the influence of the Q′ value on link utilization and latency by simulating node failure through randomly erasing the data stored in data nodes.
Figure 6a shows that when
Q′ is low, the link utilization and latency are low due to the small number of data blocks sent by the switch during the repair. As the value of
Q’ increases, link utilization increases synchronously. However, the high rate of sending speed causes the accumulation of data blocks in the link, which further increases the repair latency. Thus, it is essential to set the value of
Q’ according to the actual size of data blocks and bandwidth. The value of
Q’ is inversely proportional to the size of data blocks and directly proportional to bandwidth. An appropriate value of
Q’ can effectively avoid network congestion, increase repair throughput, and improve data repair performance. The flowchart of Algorithm 2 is shown in
Figure 7.
The algorithm uses software-defined centralized control technology to select the transmission route of the data block base on the effective link bandwidth and adjusts the flow rate based on the link load to avoid link overload during the repair. Notably, the algorithm focuses on improving the general throughput of the repair operation rather than shortening the execution time of specific repair tasks.
Section 4.4 discusses a cooperative and efficient data repair method.
4.4. Cooperative and Efficient Data Repair Method
Under the hierarchical network layout structure of the data center, the cross-rack network bandwidth between storage nodes is often limited, and the data repair performance is usually bottlenecked by the cross-rack bandwidth. The goals of this section are to reduce the usage of cross-rack bandwidth and improve the computational efficiency of decoding and reconstruction. Data blocks are first sent to a top-of-rack switch (ToR) before being transmitted across the rack to the target node. To decrease the cross-rack bandwidth, ToR is an appropriate place to aggregate data. SDNC-Repair introduces a cooperative and efficient data repair method that optimizes decoding by leveraging the characteristics of data block transmission. By deploying the ToR in the SDNC-Repair framework as a software switch and using its support for read/write operations and XOR operations of the specified memory address, part of the repair task can be completed in the ToR, thus reducing the amount of data transmission across the rack. Data exchange between racks through the cooperation of nodes can also decrease repair costs.
We formalize the data repair calculation problem as follows: Suppose that a strip of
n storage nodes comprises
k data nodes
and
m parity nodes
, which respectively store the data blocks
and the parity blocks
. The storage nodes are distributed in
racks
, and ToRs are denoted as
. As the data repair process between strips is independent of each other, this section focuses on the analysis of data repair on a single stripe. Assume that
in
fails and
is lost. As discussed in
Section 2, any data block can be expressed as a linear combination of the other
k data blocks. Thus,
can be repaired through the following formula:
where
is the coefficient of the decoding matrix.
The cooperative and efficient data repair method aims to parallelize the data reconstruction process by disassembling Equation (2) and distributing the data repair calculations. The specific steps can be broadly described as follows:
According to the algorithm in
Section 4.2, the data nodes and parity nodes participating in the repair operation are determined. The data block
and the parity block
stored in
and
are multiplied by their respective decoding coefficients
in the decoding inverse matrix, resulting in the encoded blocks
and
.
The encoded blocks are then sent to the ToR , where they are aggregated through a summation operation. The intermediate calculation results of partial repair, and , are obtained as a result. Then, delivers the results to the ToR where the new node is located.
At the ToR , all the received results are summed, the data block is recovered, and it is sent to . Then, stores , indicating the completion of the repair.
Taking the RS (9,6) code as an example, suppose node
fails, and data block
needs to be repaired. The repair process is shown in
Figure 8.
Providing nodes and decode in parallel and obtain encoded blocks . These encoded blocks are sent to ToR , respectively.
The sums the received data and obtains an intermediate calculation result. Data aggregation greatly reduces the amount of data transferred backward. Then, , and send the results to , where is located.
Finally, sums the received intermediate calculation results to recovers and sends it to , which stores and completes the data repair operation.
The cooperative and efficient data repair method utilizes the XOR operation mechanism of software switches to aggregate data within a rack before transmitting repair data across the rack. The intermediate calculation result formed by the encoded blocks has the same size as a data block, but with fewer blocks. Reducing the number of blocks decreases the amount of data that needs to be transmitted across the rack, resulting in improved network efficiency and reduced costs. When the number of racks is fixed, a larger value of
k in an
RS (n, k) code (which means that more data blocks are in the same stripe) leads to better performance for the cooperative and efficient data repair method. This is because more data can be aggregated within each rack, improving the efficiency of the repair process. Furthermore, distributing and parallelizing calculations can effectively utilize the computing power of each storage node and switch involved in the repair, enabling simultaneous transmission and computation. The experiments in
Section 5 show that the calculation efficiency of data repair can be improved by the method described above.
Figure 8.
Collaborative data repair.
Figure 8.
Collaborative data repair.
6. Conclusions
This paper presented a study on a data repair mechanism based on erasure codes. Existing repair schemes do not consider the dynamic nature of network traffic and the imbalance of user access, resulting in less-than-ideal performance under actual workloads. To address these issues, this paper proposed a cooperative repair strategy based on an SDN controller, SDNC-Repair, and described its framework. SDNC-Repair provides solutions in data source selection, transmission scheduling, and cooperative and efficient data repair. The simulation results showed that SDNC-Repair effectively improves system repair throughput and reduces average repair time.
There is still much room for improving SDNC-Repair. Future work will include adding a link weight calculation algorithm and cache mechanism to further reduce repair costs; constructing a transmission structure with minimum latency across data centers and a computation model with minimum redundancy to ensure data repair efficiency of erasure codes; and slicing data blocks, retaining necessary information fragments, and performing fine-grained scheduling and control. These improvements can further enhance the performance of SDNC-Repair and make it more adaptable to various network environments.