Construction of Fractional Repetition Codes with Variable Parameters for Distributed Storage Systems

In this paper, we propose a new class of regular fractional repetition (FR) codes constructed from perfect difference families and quasi-perfect difference families to store big data in distributed storage systems. The main advantage of the proposed construction method is that it supports a wide range of code parameter values compared to existing ones, which is an important feature to be adopted in practical systems. When using one instance of the proposed codes for a given parameter set, we show that the amount of stored data is very close to that of an existing state-of-the-art optimal FR code.


Introduction
As users produce a tremendous amount of data everyday and these big data need to be stored in cloud storage [1,2], the efficiency and reliability of distributed storage systems (DSSs) become more crucial.Rather than using a random protocol, we need to adopt some mathematical structures in storing big data to minimize the overall cost of DSSs.Recently, locally repairable codes (LRCs) [3][4][5][6][7], and regenerating codes (RCs) [8][9][10][11] have been proposed for DSSs to support cost-reducing repair against node failures as well as data reconstruction.For repair, LRCs aim to connect to a small number of other nodes while RCs aim to use a small amount of network bandwidth.There exists a trade-off between storage and repair bandwidth for general RCs and its two extreme points correspond to the minimum storage regenerating (MSR) codes and the minimum bandwidth regenerating (MBR) codes.If every repaired node contains the data which were stored before failure, such RC is said to be exact [12,13].
Fractional repetition (FR) codes are firstly defined in [14].Data to be stored are fractionalized into several blocks, each block is repeated, and all repeated blocks are distributed to nodes based on a rule.FR codes can be seen as a variant of exact MBR codes and they aim to minimize disk I/O as well as network bandwidth for repair.Various construction methods for FR codes are proposed using graphs, design theory, algebraic structures, and so on [14][15][16][17][18][19].Note that the existing construction methods result in FR codes with quite restricted code parameter values, which will be discussed in Section 2.
In this paper, we propose a new class of regular FR codes constructed from perfect difference families and quasi-perfect difference families.The proposed codes can support a wide range of code parameter values, especially, any integer-valued code length larger than a certain value.This property is valuable for practical DSSs where each data has an arbitrary size and different importance, so it may need to be repeated a different number of times or distributed along a different number of nodes.Also, we theoretically show that the proposed codes have a good property to store a large size of data.Finally, the proposed FR codes are compared with existing regular FR codes in terms of the constructible code parameters and the actual size of stored data for each number of connected nodes.This comparison shows that the amount of stored data based on one instance of proposed codes is not far from the FR capacity bound [14] and very close to that of an existing state-of-the-art optimal FR code under the same parameter values.

Regenerating Codes and Fractional Repetition Codes
An (n, k, d, M, α, β) q RC for k ≤ d ≤ n − 1 and β ≤ α is defined as follows.The number of nodes in DSS is denoted by n and the number of data symbols we need to store in DSS is denoted by M, which is often called file size.The alphabet of symbols is denoted by F q and each node stores α symbols.The parameter k is called reconstruction degree which means that a data collector can reconstruct all data symbols by connecting to any k nodes, downloading α symbols from each node, and decoding them.A failed node is repaired by connecting to any other d node, downloading β symbols from each, and decoding them.With these notations, the repair bandwidth becomes dβ.Without loss of generality, we assume β = 1 for the code construction throughout this paper [12].For MBR codes, d is equal to α and the file size is given as for given k, α(= d), and β(= 1).This value is called MBR capacity.An (n, α, ρ) regular FR code C is defined as a collection of n subsets N 0 , N 1 , . . ., N n−1 of {0, 1, . . ., θ − 1} such that • each symbol of {0, 1, . . ., θ − 1} belongs to exactly ρ subsets in C.
According to an FR code C, node i, i = 0, . . ., n − 1, stores the symbols in N i .The parameter ρ is called the repetition degree of C. The incidence matrix of C, denoted by I(C), is defined by the n × θ binary matrix whose (i, j)-th element is 1 if set N i includes symbol j or 0 otherwise.It is noted that the row and column weights of I(C) are α and ρ, respectively.An FR code is said to be irregular when |N i | is not constant or the number of appearances of each symbol in the subsets is not constant.An FR code is often used as an inner code together with an outer (θ, M) MDS code.This concatenated code is called distributed replication based exact simple storage (DRESS) code [20] and is represented with parameters [(θ, M), k, (n, α, ρ)].
FR codes can be seen as a variant of exact MBR codes.For repair, a failed node needs to connect to α nodes and download the β = 1 symbol from each.This means d = α.Note that repair can be done without any computation and the failed node gets to store exactly what it stored before failure.For reconstruction, a data collector needs to connect to any k node and download the α symbols from each.Up to now, FR codes have the same properties with MBR codes but the only difference is that the constraint of connecting to "any" d nodes for repair is relaxed to "some" d nodes and the selection of d nodes is usually table-based.
For a given FR code, the file size M is determined as a function of k as the following: It is addressed in [14] that for a well-constructed FR code, M(k) can be larger than the MBR capacity (1) and this becomes possible thanks to the relaxation of the constraint.For a given parameter set (n, k, α, ρ), the FR capacity, denoted by A(n, k, α, ρ), is defined as the maximum value of M(k) among all possible FR codes.Capacity-achieving FR codes are constructed for some parameters in [19] but the FR capacity is unknown in general.Instead, an upper bound is derived in [14], shown as follows:

Related Works
In Section 3, we will propose new classes of regular FR codes.It is known that regular FR codes are good for load balancing [15] while irregular FR codes are suitable for heterogeneous networks [21].Though various irregular FR codes have been proposed, for example, in [17,18,20,21], we focus on introducing existing regular FR codes related to our proposed construction.Also, we consider only the regular FR codes for 3 ≤ ρ ≤ α and β = 1.It is noted that ρ = 2 could not be enough for practical systems to have good reliability in general and ρ ≤ α is a common scenario in DSSs [15].
In [14], Steiner systems are used as a class of regular FR codes and a parameter set (v, (v − 1)/(k − 1), k) is constructible when a Steiner system S(2, k, v) exists.In [15], ((q n+2 − 1)/(q − 1), (q n+1 − 1)/(q − 1), q + 1) regular FR codes are constructed for a prime power q and n ≥ 1, which are based on the projective geometry and Latin squares.In fact, these FR codes are a subclass of the FR codes from Steiner systems in [14] and they additionally have a scalable property.In [16], various regular FR codes are constructed.FR codes constructed from mutually orthogonal Latin squares have parameters (ρp m , p m , ρ ≤ p m − 1) and (4a, a, 4) for a prime p and positive integers m and an integer a = 2, 6.For β = q m−2 , (qρ, q m−1 , 1 ≤ ρ ≤ (q m − 1)/(q − 1)) FR codes are constructed from affine resolvable designs and, for β = a, (8a − 2, 2a, 4a − 1) FR codes are constructed from Hadamard designs, where q is a prime power and 4a − 1 ≥ 7 is an odd prime power.Lastly, in [19], (ρα, α, ρ) regular FR codes are constructed from transversal designs for 3 ≤ ρ ≤ α + 1.Also, ((s + 1)(st + 1), t + 1, s + 1) regular FR codes are constructed from generalized quadrangles for 2 ≤ s ≤ t.These two classes of FR codes are optimal for selected parameters satisfying the conditions in [19].In addition, a lower bound on the reconstruction degree is derived for all the other given parameters and some constructions which attain this bound are presented.Also, FR batch codes are constructed based on bipartite complete graphs, graphs with large girth, transversal designs, and affine planes to additionally satisfy the property that any set of symbols of a prescribed size can be retrieved by downloading, at most, one symbol from each node.It is noted that this property is not considered in this paper.

Perfect Difference Families
In this subsection, definitions and terms are given mostly based on [22].A cyclic difference family is defined as follows.
In this paper, special classes of CDFs, called perfect difference family and quasi-perfect difference family, will be used to construct FR codes.
are called the positive differences of B i 's over Z v and the other half of differences are called the negative differences of B i 's over Z v .
min } is called normalization and the normalization does not change the property of the CDF.Using ordering together with the normalization, any CDF can be represented as , i = 0, 1, . . ., t − 1, which will be assumed as the default throughout this paper.
The incidence matrix has the size of n × tn and, actually, each B i forms an n × n circulant matrix and the incidence matrix is a concatenation of t circulant matrices.A choice of ρ, t, t , and n fully determines the resulting FR code.When t = t , all subsets of (ρ(ρ − 1)t + 1, ρ, 1) (Q)PDF are used to construct the proposed FR codes but we can still flexibly select n as long as it is not less than ρ(ρ − 1)t + 1.It is noted that for the choice of t = t and n = ρ(ρ − 1)t + 1, the incidence matrix becomes that of the Steiner system S(2, ρ, ρ(ρ − 1)t + 1) but for every n > ρ(ρ − 1)t + 1, the incidence matrix is different from that of the Steiner system and it still guarantees that any two nodes do not contain a pair of symbols in common.
When t < t , only t out of t subsets in (ρ(ρ − 1)t + 1, ρ, 1) (Q)PDF are used to construct the proposed FR code.This is equivalently explained as the resulting incidence matrix is constructed by removing some circulant matrices from the incidence matrix in the case of t = t .For example, assume that we choose ρ = 3, t = 2, t = 3, and n = 19.Then, we have the (19, 3, 1) QPDF B 0 = {0, 1, 6}, B 1 = {0, 2, 10}, B 2 = {0, 3, 7} given in Section 2.3.If we choose the first and the last subsets, B 0 = {0, 1, 6} and B 1 = {0, 3, 7} are used to construct the proposed (19,6,3) FR code whose incidence matrix is shown in Figure 1.With the proposed construction, we can achieve a wide range of code parameters, especially in terms of n, and this is a main contribution of this work and differentiates from the existing constructions of FR codes, which will be shown in Section 3.3.It is noted that the proposed FR codes are scalable for the number of stored data symbols because we can easily add or remove some symbols corresponding to a subset.However, they are not good for the scalability of the number of storage nodes because adding or removing a node does not sustain the structure of circulant matrices any more.

Analysis of the Proposed Construction
Equation (2) implies that any k nodes need to contain as few as possible common symbols to have a large file size M for a given n, α, and ρ.Thus, allowing any two nodes to contain, at most, one common symbol can be a good strategy to make an (n, α, ρ) FR code achieve a large M and the next theorem shows that the proposed FR codes satisfy this property.Theorem 2. Any two nodes contain, at most, one common symbol when using the proposed FR codes.
Proof.Since an incidence matrix of FR codes with t < t can be regarded as being constructed by removing some circulant matrices from the one with t = t , we need to prove this property for only t = t .For a set S, we first define D n (S) as the multiset of cardinality |S|(|S| − 1) such that (s − s ) mod n ∈ D n (S) for all s, s ∈ S, s = s .
Assume that we have a (Q)PDF B i = {b i0 = 0, b i1 , . . ., b i(ρ−1) }, i = 0, 1, . . ., t − 1, and the corresponding n × tn incidence matrix.For nodes l and m, 0 ≤ l < m ≤ n − 1, node l has symbols (b ij + l) mod n + ni and node m has symbols (b ij + m) mod n + ni for j, j = 0, . . ., ρ − 1. Nodes l and m have a common symbol in the i-th circulant matrix if (b ij + l) mod n = (b ij + m) mod n for some j and j .This condition is equivalent to Thus, we need to check if m − l appears, at most, once in D n (B i ), i = 0, . . ., t − 1, to prove this theorem.

Comparison with Existing Codes
First, in Figure 2, we list all constructible 5 ≤ n ≤ 50 and 3 ≤ α ≤ 16 of the proposed FR codes and the existing regular FR codes with β = 1 for ρ = 3 which is the most common repetition degree in DSSs.In the legend, PROP, SS, MOLS, TD, and GQ represent the proposed FR codes, the FR codes constructed from Steiner systems, mutually orthogonal Latin squares, transversal designs, and generalized quadrangles, respectively.Our construction gives abundant choices of n for α equal to a multiple of 3 while other constructions give, at most, one value of n for each α.To compare the file size of the proposed and existing FR codes, we need to select proper parameter values for which some of these codes can be constructed in common.In Figure 2, we can see that every point of PROP with n = 2α + 1 always overlaps a point of SS; every point of MOLS with a prime power α also overlaps a point of TD.Actually, they are equivalent to each other for the parameter values, which is well-known in design theory [22].Except for too small or too large parameter values, the parameters (n, α, ρ) = (33, 6, 3), (27, 9, 3) seem proper for comparison but the generalized quadrangle for (33, 6, 3) is not known [22].Thus, for (27, 9, 3), we compare, via numerical analysis, the file size of PROP, TD, and the upper bound of FR capacity in (3).
In Figure 3, PROP1 and PROP2 represent the proposed (27, 9, 3) FR codes with the choice of t = t = 3 and the choice of t = 3 and t = 4, respectively, and the FR bound represents the upper bound in (3).We can see that PROP1 has a slightly smaller file size than the others but PROP2 has almost the same file size with TD.It is noted that TD is known as an optimal code for some parameters but in this case it may not be optimal because it follows the FR bound only up to k = 5 < α.

Concluding Remarks
In this paper, we propose a construction method of regular FR codes based on (Q)PDFs.The resulting FR codes can have various code parameter values compared to other existing regular FR codes, which is an important feature for practical system design.One instance of them, for some parameters, is shown to have a file size close to the FR codes constructed from transversal designs via numerical analysis.Consequently, we can say that the proposed construction enables a flexible choice of code parameter values by slightly sacrificing the file size compared to the state-of-the-art FR codes.

Figure 2 .
Figure 2. Constructible n and α of the proposed and existing FR codes for ρ = 3.

Figure 3 .
Figure 3. File size of the proposed FR codes, the FR codes from transversal designs, and the FR bound for (27, 9, 3).