E ﬃ cient Systolic-Array Redundancy Architecture for O ﬄ ine / Online Repair

: Neural-network computing has revolutionized the ﬁeld of machine learning. The systolic-array architecture is a widely used architecture for neural-network computing acceleration that was adopted by Google in its Tensor Processing Unit (TPU). To ensure the correct operation of the neural network, the reliability of the systolic-array architecture should be guaranteed. This paper proposes an e ﬃ cient systolic-array redundancy architecture that is based on systolic-array partitioning and rearranging connections of the systolic-array elements. The proposed architecture allows both o ﬄ ine and online repair with an extended redundancy architecture and programmable fuses and can ensure reliability even in an online situation, for which the previous fault-tolerant schemes have not been considered.


Previous Works
In this section, the systolic-array architecture and its fault-tolerance schemes, which were previously proposed, are described.

Systolic-Array Architecture
Because matrix multiplication and convolution operations are independent for each piece of data, the calculations are optimized using parallel processing with as many processors as possible. However, if the number of processors exceeds several thousand, interconnection problems that transfer data from the memory increase rapidly. To solve these problems, a systolic-array structure was introduced in [5], as shown in Figure 1a.
In the systolic-array architecture, a MAC unit, which enables MAC functions, is used [10]. The MAC unit performs multiplication and accumulation processes repeatedly to perform continuous and complex operations in digital signal processing. A basic MAC architecture is illustrated in Figure 1b. In neural-network calculations, the convolution-layer operation accounts for most of the computation time. Therefore, by adopting this systolic-array structure, matrix multiplication and convolution calculations can be accelerated.

Previous Fault Repair Systolic-Array Designs
Fault-tolerant systolic-array designs were proposed by Kung et al. [11] and were later improved [12]. The basic concept of these schemes is considering the array as a small systolic-array when faulty MACs are detected in the array. Figure 2a shows the redundant MAC architecture for repairing a faulty MAC processor. MAC processors are arranged in a 4 × 4 systolic-array, and redundant MAC processors are placed at the end of each row. These redundant MAC processors can serve as substitutes when a fault occurs in each row. Furthermore, to replace the faulty MAC with the redundant MAC, the signal should be rerouted so that several MUXs are used. An example of a repair process in this redundancy architecture is presented in Figure 2b. According to the characteristics of the systolic-array, if MAC unit C11 has a defect, for normal operation, the role of the faulty core is performed by C12, and the role of C12 is performed by C13. Finally, a redundant core R1 performs the role of C13. Through this process, the repair process is

Previous Fault Repair Systolic-Array Designs
Fault-tolerant systolic-array designs were proposed by Kung et al. [11] and were later improved [12]. The basic concept of these schemes is considering the array as a small systolic-array when faulty MACs are detected in the array. Figure 2a shows the redundant MAC architecture for repairing a faulty MAC processor. MAC processors are arranged in a 4 × 4 systolic-array, and redundant MAC processors are placed at the end of each row. These redundant MAC processors can serve as substitutes when a fault occurs in each row. Furthermore, to replace the faulty MAC with the redundant MAC, the signal should be rerouted so that several MUXs are used.

Previous Fault Repair Systolic-Array Designs
Fault-tolerant systolic-array designs were proposed by Kung et al. [11] and were later improved [12]. The basic concept of these schemes is considering the array as a small systolic-array when faulty MACs are detected in the array. Figure 2a shows the redundant MAC architecture for repairing a faulty MAC processor. MAC processors are arranged in a 4 × 4 systolic-array, and redundant MAC processors are placed at the end of each row. These redundant MAC processors can serve as substitutes when a fault occurs in each row. Furthermore, to replace the faulty MAC with the redundant MAC, the signal should be rerouted so that several MUXs are used. An example of a repair process in this redundancy architecture is presented in Figure 2b. According to the characteristics of the systolic-array, if MAC unit C11 has a defect, for normal operation, the role of the faulty core is performed by C12, and the role of C12 is performed by C13. Finally, a redundant core R1 performs the role of C13. Through this process, the repair process is An example of a repair process in this redundancy architecture is presented in Figure 2b. According to the characteristics of the systolic-array, if MAC unit C11 has a defect, for normal operation, the role of the faulty core is performed by C12, and the role of C12 is performed by C13. Finally, a redundant core R1 performs the role of C13. Through this process, the repair process is completed in such a manner that the signals to be transmitted to the faulty MAC are shifted to the neighboring MAC and transmitted. Thus, the repair process is completed. Another repair process is proposed in [13]. In this study, the role of the faulty MAC is replaced by the neighboring core which is placed on a diagonal. Basically, switches are located at the intersection point among every four neighboring cores.
By utilizing switches, the faulty MAC can be bypassed in either horizontal or vertical ways and shifted by its neighboring core.
Recently, Zhang et al. proposed a fault-tolerant systolic-array design considering the impact of hardware faults in the application domain [14]. The method reduced the effect of the faulty MAC by calculating the weight in the CNN corresponding to the faulty MAC as zero. Although this method reduces the effect of hardware faults without additional redundant MACs, reliability issues still exist, because the systolic-array is comprised of the faulty MACs.

Proposed Redundancy Architecture
In this section, the proposed systolic-array redundancy architecture is described. The proposed method has two main features: the "redundancy architecture" and "online error repair."

Systolic-Array Redundancy Architecture
The previous redundancy architecture can repair all faulty MACs when only one faulty MAC occurs in each row. This means that faulty MACs cannot be repaired when two or more faulty MACs are exist in a row. Therefore, an extended redundancy architecture is considered in which configuring redundancies at the end of the row and column of the array as shown in Figure 3a. This redundancy architecture can repair faults for various patterns when a large number of MACs fail. The proposed method repairs faulty MACs by using the shift-based repair mechanism which is commonly used for its simple implementation. Figure 3b shows an example in which three faulty MACs are concentrated in an area. In this example, there are two faulty MACs in the 2nd row and the 3rd column. Nevertheless, the redundant MAC architecture structure can repair this faulty pattern with more additional hardware overhead.
Electronics 2020, 2, x FOR PEER REVIEW 4 of 14 completed in such a manner that the signals to be transmitted to the faulty MAC are shifted to the neighboring MAC and transmitted. Thus, the repair process is completed. Another repair process is proposed in [13]. In this study, the role of the faulty MAC is replaced by the neighboring core which is placed on a diagonal. Basically, switches are located at the intersection point among every four neighboring cores. By utilizing switches, the faulty MAC can be bypassed in either horizontal or vertical ways and shifted by its neighboring core. Recently, Zhang et al. proposed a fault-tolerant systolic-array design considering the impact of hardware faults in the application domain [14]. The method reduced the effect of the faulty MAC by calculating the weight in the CNN corresponding to the faulty MAC as zero. Although this method reduces the effect of hardware faults without additional redundant MACs, reliability issues still exist, because the systolic-array is comprised of the faulty MACs.

Proposed Redundancy Architecture
In this section, the proposed systolic-array redundancy architecture is described. The proposed method has two main features: the "redundancy architecture" and "online error repair."

Systolic-Array Redundancy Architecture
The previous redundancy architecture can repair all faulty MACs when only one faulty MAC occurs in each row. This means that faulty MACs cannot be repaired when two or more faulty MACs are exist in a row. Therefore, an extended redundancy architecture is considered in which configuring redundancies at the end of the row and column of the array as shown in Figure 3a. This redundancy architecture can repair faults for various patterns when a large number of MACs fail. The proposed method repairs faulty MACs by using the shift-based repair mechanism which is commonly used for its simple implementation. Figure 3b shows an example in which three faulty MACs are concentrated in an area. In this example, there are two faulty MACs in the 2nd row and the 3rd column. Nevertheless, the redundant MAC architecture structure can repair this faulty pattern with more additional hardware overhead.

Redundancy Configuration by Partitioning the Entire Array
In the event of MAC failure, to repair the fault, the signals that should be assigned to the faulty MAC are rerouted to the adjacent MAC, and the signals that should be conveyed to the last MAC in

Redundancy Configuration by Partitioning the Entire Array
In the event of MAC failure, to repair the fault, the signals that should be assigned to the faulty MAC are rerouted to the adjacent MAC, and the signals that should be conveyed to the last MAC in the rows or columns are assigned to the redundant MAC. However, when the number of faulty MACs is high, the redundancy architecture cannot repair the specific faulty MAC patterns. For example, if two or more faults occur in a systolic-array row, one redundant MAC at the end of the row can only repair one faulty MAC. In this case, the remaining unrepaired faulty MACs can be repaired by the redundant MAC located at the end of the column, but if there are other faults in this column, the fault pattern Electronics 2020, 9, 338 5 of 14 means that the faults cannot be repaired. Therefore, to reduce the probability of these fault patterns occurring, an architecture of configuring redundancies by partitioning a large size systolic-array can be considered. This partitioning architecture, which is called "array partitioning," is more efficient in terms of repair rate because it can repair more faulty MAC patterns. Increasing the partitioning level means partitioning the entire array into multiple small arrays. For example, if the entire systolic-array size is 128 × 128, setting the partitioning level to 4 will partition the entire array into four 64 × 64 arrays. Similarly, when the partitioning level is set to 16, an entire 128 × 128 array will be partitioned into 16 arrays with a size of 32 × 32. In this paper, the parameter PL (partitioning level) which indicates the number of parts the entire array will be partitioned into is used. Let N ROW and N COL are the number of rows and columns of the systolic-array, respectively. Then, if the PL is set to 4, it means that the entire N ROW × N COL size array is divided into four N ROW /2 × N COL /2 size arrays.
An advantage of the "array partitioning" feature of the proposed method is that it increases the repair rate by increasing PL; however, excessive partitioning of the entire array can increase the redundancies and possibly cause a timing problem due to the redundancy configuration. Therefore, it is important to determine the appropriate PL.

One Redundancy per Multiple Rows and Columns
A straightforward redundancy architecture for repairing faults in systolic-array partitions involves placing redundancies at the end of every row and column. However, this architecture is inefficient in terms of area overhead unless the probability of a fault occurring in the systolic-array is high. Therefore, to design a more efficient redundancy architecture in terms of area overhead, an architecture whereby a redundancy is placed per two or more rows and columns can be considered. We set the parameters "One redundancy per multiple Rows Level (ORL)" and "One redundancy per multiple Columns Level (OCL)". If the ORL is set to 2 and the OCL is set to 4, it means one redundancy is placed in two rows and four columns. In addition, when applying the same value to ORL and OCL, use the parameter OL. For example, if OL is set to 4, ORL and OCL have the same value 4. Figure 4a shows the proposed architecture with OL 2. Redundancy MAC RC0 can repair any one faulty MAC occurring in the first row and the second row. That is, if a faulty core is detected in the first systolic-array row, RC0 should act as the last core of the first row, and if a faulty core is detected in the second systolic row, RC0 should act as the last core in the second row. This extension architecture can reduce the total additional hardware overhead because it can reduce the number of redundant MACs. Similarly, Figure 4b shows the case where OL is 4. Based on this extension, it is also possible to extend the architecture of one redundant core per three or more rows and columns. However, if three or more rows and columns have only one redundant MAC, this redundancy architecture can be configured with less hardware overhead, but Based on this extension, it is also possible to extend the architecture of one redundant core per three or more rows and columns. However, if three or more rows and columns have only one redundant MAC, this redundancy architecture can be configured with less hardware overhead, but the repair rate will decrease; furthermore, a timing problem may occur because the distance between the redundancy and MAC will increase.

Repair Strategy for Offline/Online Repair
Typically, redundancies that are not used in offline repair are wasted. However, in the proposed method, the remaining redundancies are used to repair faults during normal operation. If at least one redundant MAC remains unused, the online repair process can be performed. Typically, programmable fuses are used to repair faulty memory cells by programming the address fuses of the spare decoder [15]. In order to utilize the unused redundant MACs, programmable fuses are adopted to enable the online repair process to repair additional erroneous MACs during the normal operation.
Basically, the online repair process is progressed with unused redundant MACs after the offline repair process. It means that if an error occurs in the MAC which cannot be covered by the remaining redundancies, it is impossible to repair the erroneous MAC with the proposed method. Therefore, it is best to leave unused redundant MACs as efficiently as possible during the offline repair process. This does not mean that the number of unused redundant MACs should be maximized because the area of repairable MACs in the online repair process is not fully determined by the quantity of unused redundant MACs. Besides, more unused redundant MACs means redundancies are over-assigned in the first place.
Unlike the previous redundancy architecture, the proposed method assigns redundancies to both the end of the row and column of the array. As mentioned in Section 2.2, each redundant MAC is specialized to repair the faulty MAC at the same line. So, several MACs share the same redundant resources depending on the redundancy architecture. Let's say that those several MACs belong to the same group. Then, each group has the same number of available redundant MACs to use before the offline repair process. As the repair process is progressed, the number of available redundancies of each group is decreased. The basic concept of the proposed repair strategy is spending redundancies first in the group which has the higher number of available redundant MACs. This prevents the specific group from having no available redundancy even though there is another option to utilize resources of other groups.
The example of the proposed repair strategy is depicted in Figure 5. In this example, there are two faulty MACs, C13 and C33, into a 4 × 4 MAC array as shown in Figure 5a. There are four groups in the MAC array and each group has two available redundant MACs. For example, Group 1 consists of MAC C00, C01, C10, and C11. If a fault occurs in Group 1, redundant MACs, RC0 and RR0, can be utilized. Therefore Group 1 has one available row redundant MAC and one column redundant MAC. Figure 5b   The overall flowchart of the proposed method is shown in Figure 6. First, redundant MACs should be tested before testing normal MACs because only healthy redundant MACs can be utilized in the repair process. And then, the offline test for normal MACs is progressed. When faults are detected in the offline test process, faulty information is collected in the fault list. As get the proper The overall flowchart of the proposed method is shown in Figure 6. First, redundant MACs should be tested before testing normal MACs because only healthy redundant MACs can be utilized in the repair process. And then, the offline test for normal MACs is progressed. When faults are detected in the offline test process, faulty information is collected in the fault list. As get the proper repair solution, the repair decision of the offline repair starts after all the faulty information is collected. Then, the proposed method investigates the number of available redundancies of each group and sorts the numbers in descending order. If there are available redundancies to use, the offline repair process is performed based on the repair strategy. The redundancy list is updated whenever the repair decision is making. If all faulty MACs are repaired, the offline test phase ends. Once the online test phase starts, the first process is to check the available resources through programmable fuses. In the online test phase, repairable MACs are determined by the information of remaining redundant MACs in the redundancy list. Therefore, if an error is detected during the normal operation, the proposed method can instantly find out whether the erroneous MAC is repairable or not. If the group which has the erroneous MAC can utilize two or more redundant MACs, the proposed method seeks the way to avoid the number of available redundant MACs of each group being 0. This decision making process can be easily performed with the simple binary search tree. Then, online repair step is repeatedly performed until there is no remaining redundancy left.
Once the online test phase starts, the first process is to check the available resources through programmable fuses. In the online test phase, repairable MACs are determined by the information of remaining redundant MACs in the redundancy list. Therefore, if an error is detected during the normal operation, the proposed method can instantly find out whether the erroneous MAC is repairable or not. If the group which has the erroneous MAC can utilize two or more redundant MACs, the proposed method seeks the way to avoid the number of available redundant MACs of each group being 0. This decision making process can be easily performed with the simple binary search tree. Then, online repair step is repeatedly performed until there is no remaining redundancy left.

Efficient Redundancy Configuration
To efficiently perform the convolution operation, it is effective to set the systolic-array to 128 × 128 or more. Therefore, a strategy for a systolic-array with a size of 128 × 128 or more should also be configured to construct redundancy. While it is possible to simply place redundancies all the way to the right and to the bottom side of the entire array, this is not an efficient solution in terms of repair rate and hardware overhead.
As mentioned in Sections 3.2 and 3.3, the proposed architecture can reduce the hardware overhead and can increase the repair rate when multiple faulty MACs are detected. Therefore, the redundancy architecture that can be applied as efficiently as possible in terms of hardware overhead and repair rate with a systolic-array size, can be effectively applied in real neural network operations.

Efficient Redundancy Configuration
To efficiently perform the convolution operation, it is effective to set the systolic-array to 128 × 128 or more. Therefore, a strategy for a systolic-array with a size of 128 × 128 or more should also be configured to construct redundancy. While it is possible to simply place redundancies all the way to the right and to the bottom side of the entire array, this is not an efficient solution in terms of repair rate and hardware overhead.
As mentioned in Sections 3.2 and 3.3, the proposed architecture can reduce the hardware overhead and can increase the repair rate when multiple faulty MACs are detected. Therefore, the redundancy architecture that can be applied as efficiently as possible in terms of hardware overhead and repair rate with a systolic-array size, can be effectively applied in real neural network operations.

Simulation Environment
To evaluate the performance of the proposed architecture, hardware overheads, and repair rates induced by the redundancy configuration were calculated and compared with those obtained by changing the parameter of the proposed architecture. For example, Figure 7 shows the redundancy architecture with PL 4 and OL 2. We experimented with large size of systolic-arrays such as 128 × 128, 256 × 256, and 512 × 512. Then, experiments were performed while varying the parameters, PL and OL, for each array size. These experiments were performed in a 3 GHz Linux environment with 256 GB memory. changing the parameter of the proposed architecture. For example, Figure 7 shows the redundancy architecture with PL 4 and OL 2. We experimented with large size of systolic-arrays such as 128 × 128, 256 × 256, and 512 × 512. Then, experiments were performed while varying the parameters, PL and OL, for each array size. These experiments were performed in a 3 GHz Linux environment with 256 GB memory.

Repair Rates
To evaluate the repair performance of the redundancy architectures, numerous random fault patterns were introduced into the NROW × NCOL systolic-array MAC architectures. These faulty MACs can be repaired using the redundant MACs that are located at the right and bottom sides. If all faulty MACs are repaired successfully after the repair process, this fault pattern is classified as a "repairable case." On the other hand, if one or more faulty MACs cannot be repaired, this case is classified as an "irreparable case." The repair rates are represented as follows:

Repair rate(%) = +
To investigate the effect of the PL and OL on repair rate, repair rates were measured while increasing PL and OL. In this experiment, the number of fault patterns was randomly generated 100,000 times.
First, the repair rate was measured while PL was increased when OL was fixed to 1. In this case, increasing PL while maintaining OL increases the number of redundancies and also the number of repairable faults. Therefore, as the PL increases, a higher repair rate can be achieved as shown in Figure  8a. In contrast, when the PL was fixed, the repair rate was measured while increasing the OL. The result is shown in Figure 8b. Increasing OL means decrease of the number of redundancies and the repair rate drops. As shown in Figure 8, the increase of PL means the increase in redundancy, which means an improvement in the repair rate. Similarly, the increase of OL causes the decrease of the repair rate

Repair Rates
To evaluate the repair performance of the redundancy architectures, numerous random fault patterns were introduced into the N ROW × N COL systolic-array MAC architectures. These faulty MACs can be repaired using the redundant MACs that are located at the right and bottom sides. If all faulty MACs are repaired successfully after the repair process, this fault pattern is classified as a "repairable case." On the other hand, if one or more faulty MACs cannot be repaired, this case is classified as an "irreparable case." The repair rates are represented as follows: Repair rate(%) = Repairable case Repairable case + Irreparable case To investigate the effect of the PL and OL on repair rate, repair rates were measured while increasing PL and OL. In this experiment, the number of fault patterns was randomly generated 100,000 times.
First, the repair rate was measured while PL was increased when OL was fixed to 1. In this case, increasing PL while maintaining OL increases the number of redundancies and also the number of repairable faults. Therefore, as the PL increases, a higher repair rate can be achieved as shown in Figure 8a. In contrast, when the PL was fixed, the repair rate was measured while increasing the OL. The result is shown in Figure 8b. Increasing OL means decrease of the number of redundancies and the repair rate drops. As shown in Figure 8, the increase of PL means the increase in redundancy, which means an improvement in the repair rate. Similarly, the increase of OL causes the decrease of the repair rate because it reduces the number of redundancies. Changing PL and OL is related to changing the number of redundancies and thus a change in the repair rate can be expected. Therefore, to proceed with the experiments with the same number of redundant MACs, experiments were conducted under a given number of PL and OL conditions to equalize the number of redundant MACs. For example, assuming a systolic-array of N × N, the number of redundancies when the proposed architecture is not applied is 2N, where N is placed on the right side and N is placed on the bottom side. Under these conditions, a fourfold increase in the PL will double the number of redundancies, and a doubling of the OL reduces the number of redundancies by half. Therefore, to have the same number of redundancies, which is 2N, when an N × N systolic-array is set, the repair rate was measured with the PL and OL as (1,1), (4,2), (16,4), (64,8), and (256,16), respectively. The results are shown in Figure 9. As can be seen from the results of this experiment, the repair rate is the highest in OL 2 condition. When OL is increased to 2 or more, the effect of reducing repair rate due to the increase of OL, becomes more dominant than the effect of increasing repair rate due to the increase of PL. There is one more important thing in these results. As mentioned in Section 3, the number of used redundant MACs in the offline repair process and the number of the faulty MACs are the same if the faulty systolic-array is repairable. Before the offline repair process, there are 256 redundant MACs in Figure 9a and 512 redundant MACs in Figure 9b. As can be seen from the graphs, the number of unused redundant MACs is much larger than the number of used redundant MACs even in the situations which have low repair rate. These results prove the need for the online repair process with unused redundant MACs.
Electronics 2020, 2, x FOR PEER REVIEW 10 of 14 because it reduces the number of redundancies. Changing PL and OL is related to changing the number of redundancies and thus a change in the repair rate can be expected. Therefore, to proceed with the experiments with the same number of redundant MACs, experiments were conducted under a given number of PL and OL conditions to equalize the number of redundant MACs. For example, assuming a systolic-array of N × N, the number of redundancies when the proposed architecture is not applied is 2N, where N is placed on the right side and N is placed on the bottom side. Under these conditions, a fourfold increase in the PL will double the number of redundancies, and a doubling of the OL reduces the number of redundancies by half. Therefore, to have the same number of redundancies, which is 2N, when an N × N systolic-array is set, the repair rate was measured with the PL and OL as (1,1), (4,2), (16,4), (64,8), and (256,16), respectively. The results are shown in Figure 9. As can be seen from the results of this experiment, the repair rate is the highest in OL 2 condition. When OL is increased to 2 or more, the effect of reducing repair rate due to the increase of OL, becomes more dominant than the effect of increasing repair rate due to the increase of PL. There is one more important thing in these results. As mentioned in Section 3, the number of used redundant MACs in the offline repair process and the number of the faulty MACs are the same if the faulty systolic-array is repairable. Before the offline repair process, there are 256 redundant MACs in Figure 9a and 512 redundant MACs in Figure  9b. As can be seen from the graphs, the number of unused redundant MACs is much larger than the number of used redundant MACs even in the situations which have low repair rate. These results prove the need for the online repair process with unused redundant MACs.  For comparison with the previous architecture [12], the experiments with the number of redundancies is set to N are conducted. Since the previous architecture places redundancy only on its right side, the number of redundancies in this architecture of N × N is determined as N. As can be For comparison with the previous architecture [12], the experiments with the number of redundancies is set to N are conducted. Since the previous architecture places redundancy only on its right side, the number of redundancies in this architecture of N × N is determined as N. As can be seen in Table 1, even with the same number of redundancies, the redundancy architecture where redundancies on the right and bottom side has a much higher repair rate than that of the previous architecture under the condition that N is 128, 256, and 512. In addition, it can be seen that, even in this case, as PL and OL increase, the repair rate decreases slightly even if the number of redundancies is the same. Finally, repair rates of the online repair process with various redundancy architectures are shown in Figure 10. In this experiment, the data from Figure 9b is utilized. It is assumed that 40 faulty MACs are already repaired in the offline repair process. In order to evaluate the performance of the online repair process, the concept of the error rate is defined as follows. An error rate of a single MAC is the probability that an error occurs in a MAC during the normal operation. Similar to the results of the offline repair process, redundancy architectures, which have high repair rates in the offline repair process, also show good results in the online repair process. In conclusion, the proposed method shows much more improved repair rates compared to the previous work. However, it is necessary to assign the proper value of PL and OL considering the number of faults, the error rate, and overall hardware overhead.
Electronics 2020, 2, x FOR PEER REVIEW 12 of 14 the offline repair process, redundancy architectures, which have high repair rates in the offline repair process, also show good results in the online repair process. In conclusion, the proposed method shows much more improved repair rates compared to the previous work. However, it is necessary to assign the proper value of PL and OL considering the number of faults, the error rate, and overall hardware overhead.

Hardware Overhead
To calculate the hardware overhead of the redundancy architectures, we employed the synthesis tool Synopsys Design Vision [16] and the Synopsys Armenia Educational Department (SAED) 32/28 nm Open Cell Library [17] technology parameters. The additional hardware consists of MUXs that

Hardware Overhead
To calculate the hardware overhead of the redundancy architectures, we employed the synthesis tool Synopsys Design Vision [16] and the Synopsys Armenia Educational Department (SAED) 32/28 nm Open Cell Library [17] technology parameters. The additional hardware consists of MUXs that allow signals to be sent to adjacent MACs to reroute the signal and redundant MACs. In this experiment, the additional area overheads were measured by changing PL and OL in a systolic-array having a large size of 128 × 128 or more. Figure 11 shows the ratio of the additional hardware size caused by the redundant configuration for the systolic-array compared to the structure without the redundancy. Increasing OL to 1, 2, 4, and 8 in the same PL condition will inevitably lead to hardware reduction because the number of redundant MACs is reduced. Conversely, when OL is fixed and PL is increased, as shown in Figure 12, the area overhead becomes larger because the redundancies are further allocated for each divided systolic-array partition. In addition, we measured the area overhead while adjusting OL and PL for the area comparison under the conditions of the same number of redundancies. As shown in Figure 13, when the number of redundancies is the same, it can be seen that even if PL and OL change, the overall area overhead is similar. However, owing to an increase in OL, the overall area overhead slightly increased because of the addition of a large size MUX for routing in multiple rows and columns. Although 8.9%, 5.8%, and 3.0% hardware overheads are consumed in 8-bit, 16-bit, and 32-bit, respectively, compared to the previous architecture, high repair rates can be achieved through the proposed architecture.

Hardware Overhead
To calculate the hardware overhead of the redundancy architectures, we employed the synthesis tool Synopsys Design Vision [16] and the Synopsys Armenia Educational Department (SAED) 32/28 nm Open Cell Library [17] technology parameters. The additional hardware consists of MUXs that allow signals to be sent to adjacent MACs to reroute the signal and redundant MACs. In this experiment, the additional area overheads were measured by changing PL and OL in a systolic-array having a large size of 128 × 128 or more. Figure 11 shows the ratio of the additional hardware size caused by the redundant configuration for the systolic-array compared to the structure without the redundancy. Increasing OL to 1, 2, 4, and 8 in the same PL condition will inevitably lead to hardware reduction because the number of redundant MACs is reduced. Conversely, when OL is fixed and PL is increased, as shown in Figure  12, the area overhead becomes larger because the redundancies are further allocated for each divided systolic-array partition. In addition, we measured the area overhead while adjusting OL and PL for the area comparison under the conditions of the same number of redundancies. As shown in Figure  13, when the number of redundancies is the same, it can be seen that even if PL and OL change, the overall area overhead is similar. However, owing to an increase in OL, the overall area overhead slightly increased because of the addition of a large size MUX for routing in multiple rows and columns. Although 8.9%, 5.8%, and 3.0% hardware overheads are consumed in 8-bit, 16-bit, and 32bit, respectively, compared to the previous architecture, high repair rates can be achieved through the proposed architecture.

Conclusion
To ensure the reliability of the systolic-array architecture, which is widely used in neuralnetwork computing, a new efficient redundancy architecture is proposed. The proposed architecture can repair a larger number of faults than the previous redundancy architecture, with reasonable hardware overhead. Moreover, unused redundant MACs are utilized in the online repair process through programmable fuses. For maximizing the efficiency of the online repair process, group-based repair strategy is adopted in the entire repair process. This provides maximum redundancies utilization by minimizing hardware overhead waste.

Conclusions
To ensure the reliability of the systolic-array architecture, which is widely used in neural-network computing, a new efficient redundancy architecture is proposed. The proposed architecture can repair a larger number of faults than the previous redundancy architecture, with reasonable hardware overhead. Moreover, unused redundant MACs are utilized in the online repair process through programmable fuses. For maximizing the efficiency of the online repair process, group-based repair strategy is adopted in the entire repair process. This provides maximum redundancies utilization by minimizing hardware overhead waste.