Highly Fault-Tolerant Systolic-Array-Based Matrix Multiplication

: Matrix multiplication plays a crucial role in various engineering and scientific applications. Cannon’s algorithm, executed within two-dimensional systolic arrays, significantly enhances computational efficiency through parallel processing. However, as the matrix size increases, reliability issues become more prominent. Although the previous work has proposed a fault-tolerant mechanism, it is only suitable for scenarios with a limited number of faulty processing elements (PEs). This paper introduces a pair-matching mechanism, assigning a fault-free PE as a proxy for each faulty PE to execute its tasks. Our fault-tolerant mechanism comprises two stages: in the first stage, each fault-free PE completes its designated computations; in the second stage, computations intended for each faulty PE are executed by its assigned fault-free PE proxy. The experimental results demonstrate that compared to the previous work, our approach not only significantly improves the fault tolerance of systolic arrays (applicable to scenarios with a higher number of faulty PEs) but also reduces circuit areas. Therefore, the proposed approach proves effective in practical applications.


Introduction
Matrix multiplication plays a vital role in diverse engineering and scientific applications.For example, in convolutional neural networks, convolutions can also be depicted as matrix multiplications.To fulfill the needs of high-speed processing, minimizing the computation time for matrix multiplication is crucial.Therefore, several algorithms have been introduced in the past to accelerate matrix multiplications, such as Cannon's algorithm [1] and Winograd's algorithm [2].
A systolic array [3][4][5][6][7] serves as a general matrix multiplication accelerator, achieving matrix multiplication in a regular and parallel manner via data transmission between processing elements (PEs).Each PE is primarily responsible for executing multiply-accumulate computations [8,9].Cannon's algorithm can also be implemented through a systolic array.Taking a 4 × 4 systolic array as an example, if matrix operations are performed using the traditional algorithm, it would require 10 clock cycles, whereas utilizing Cannon's algorithm for matrix operations would only require four clock cycles.
Although a systolic array reduces computation time through parallel processing, it also increases the likelihood of hardware failures.Note that these hardware failures are caused by latent manufacturing defects [10,11] or circuit aging effects [12,13], potentially resulting in functional errors.We can employ post-manufacturing testing mechanisms [14,15] to identify latent manufacturing defects.Additionally, run-time detection mechanisms [16,17] can be used to identify the effects of circuit aging.
As the size of the matrix increases, the probability of hardware failures also increases.Therefore, besides identifying hardware failures, it is imperative to incorporate faulttolerance mechanisms to enhance the reliability of the systolic array for matrix multiplication.Especially in applications such as healthcare, aviation, autonomous driving, controller design for the systolic array.Therefore, the proposed pair-matching method can simplify the design of both the controller and PEs.
The primary contributions of our work are elaborated below.

1.
We introduce a fault-tolerant approach using pair-matching, with the analysis results showcasing a notable enhancement in the fault-tolerance capabilities compared to those in the previous work [30].2.
We implemented the pair-matching approach in circuitry, with the experimental results indicating reduced areas for both the controller and PEs in comparison to those in the previous work [30].
Note that, in recent years, there have also been some studies [31][32][33][34] exploring faulttolerant designs for systolic-array-based deep neural network (DNN) accelerators.Due to the inherent fault-tolerant nature of artificial intelligence applications, these studies' faulttolerant designs may allow for some errors.Unlike these studies [31][32][33][34], we must ensure the correctness of the matrix multiplication results.Therefore, the fault-tolerance issues that we investigate in matrix multiplication differ fundamentally from those addressed in the literature on the fault tolerance of DNN accelerators.
The remainder of this paper is structured as follows.Section 2 offers background materials.Section 3 introduces the proposed fault-tolerance approach, encompassing onedimensional pair-matching and two-dimensional pair-matching.Section 4 showcases our experimental results.Lastly, concluding remarks are provided in Section 5.

Preliminaries
In this section, we present background materials.Section 2.1 introduces the systolic array architecture.Section 2.2 presents Cannon's algorithm.Section 2.3 discusses the fault-tolerant mechanism proposed by Jan and Huang [30].

The Systolic Array Architecture
Suppose that n × n matrices A and B are represented as Equations ( 1) and (2), respectively: Then, the n × n matrix multiplication A × B can be expressed as Equation (3): A systolic array functions as a versatile accelerator for matrix multiplication, accomplishing the task in a structured and parallel fashion through data exchange among PEs.Each PE is primarily tasked with performing multiply-accumulate computations.Additionally, each PE also requires interconnection circuits for data transmission.Taking a 4 × 4 systolic array as an example, with each clock cycle, matrix A moves one step to the right, and matrix B moves one step downward, as shown in Figure 1.After 10 clock cycles, matrix multiplication can be completed.

Cannon's Algorithm
Cannon's algorithm can be used to accelerate matrix multiplications.It primarily involves three steps: 1.
Initial alignment: The rows of matrix A are shifted to the left by i steps (0 ≤ i ≤ n), with the 0-th row shifted 0 steps, the 1-st row shifted 1 step, and so forth; the columns of matrix B are shifted upwards by j steps (0 ≤ j ≤ n − 1), with the 0-th column shifted 0 step, the 1-st column shifted 1 step, and so forth.

2.
Multiplication and accumulation: The two aligned matrices are overlapped, and multiply-accumulate operations are conducted on the PEs.

3.
Global matrix shifting: All elements in matrix A are shifted one step to the left, and all elements in matrix B are shifted one step upward.Then, one proceeds to step 2.
In Cannon's algorithm, step 2 must be executed a total of n times, and step 3 must be executed n − 1 times in total.Then, the resulting matrix C is obtained.
In conjunction with Cannon's algorithm, each PE is connected to two registers, which store a value from the aligned matrix A and a value from the aligned matrix B, respectively.The PEs in the 0-th row are connected to those in the last row, and similarly, those in the 0-th column are connected to those in the last column, as illustrated in Figure 2.
Based on the initial alignment of matrix A and matrix B, corresponding values are simultaneously transmitted to each PE.Subsequently, multiply-accumulate operations are executed in all PEs.During each global matrix shifting, the values of matrix A are sent to the left neighboring PEs, while the values of matrix B are sent upward to the neighboring PEs.As a result, only n clock cycles are needed to complete the matrix multiplication.
We illustrate an example using 4 × 4 matrix multiplication.Figure 3a overlaps the aligned matrix A and the aligned matrix B. For example, at this point, the elements overlapping at PE 1,2 are A 1,3 and B 3,2 .Next, we shift all elements of matrix A one step to the left and all elements of matrix B one step up.For clarity, in Figure 3a, we use the 0-th row as an example to indicate the shift direction of the elements in matrix A and the 0-th column as an example to indicate the shift direction of the elements in matrix B. Figure 3b gives the results after the first global matrix shifting is performed.As shown in Figure 3b, the elements overlapping at PE 1,2 become A 1,0 and B 0,2 .This process is repeated 4 times for multiply-accumulate and 3 times for global matrix shifting to obtain the final result.For instance, at the position PE 1,2 , we obtain the final result of C 1,2 as

Fault-Tolerant Matrix Multiplier Design
Jan and Huang [30] studied the fault-tolerant design of a systolic array executing Cannon's algorithm for matrix multiplications.Their approach requires identifying faultfree columns, including those in the form of twisted columns.
Figure 4 is taken as an example.The PEs shaded in gray indicate faulty PEs.There are two twisted columns.One twisted column comprises PE 0,1 , PE 1,1 , PE 2,2 , and PE 3,1 , while the other twisted column comprises PE 0,2 , PE 1,3 , PE 2,3 , and PE 3,2 .From Figure 4, it can be observed that PE 0,1 performs the tasks that originally belonged to PE 0,0 and PE 0,1 , PE 1,1 performs the tasks that originally belonged to PE 1,0 and PE 1,1 , PE 2,2 performs the tasks that originally belonged to PE 2,0 and PE 2,1 , and so forth.In [30], it is assumed that there exists a controller, referred to as a host control circuit, capable of directly transmitting data to each PE.This assumption is necessary because the systolic array must have data that correspond to aligned matrices to execute Cannon's algorithm.Moreover, for each PE, the host control circuit needs to manage two multiplexers (in this PE) for data selection.

1.
To circumvent faulty PEs when data flows from bottom to top, each PE can transmit data to the upper-left PE, the upper PE, and the upper-right PE, respectively.This means that in each PE, a multiplexer is required to select the data coming from the bottom-left PE, the bottom PE, the bottom-right PE, and the host control circuit.2.
To bypass a faulty PE when data flows from right to left, the data will pass through the faulty PE (without executing any multiply-accumulate operations) and enter the next PE in the subsequent cycle.This implies that in each PE, another multiplexer is necessary to choose the data arriving from the multiply-accumulate unit (MAC unit), the right PE, and the control circuit.
From the above discussion, we find that in the previous work, to support twisted columns, data may come from different directions, making the design of the host control circuit complex and adding extra circuits to each PE.Moreover, the previous work must find a fault-free column , which also limits the capability for fault tolerance.

The Proposed Approach
With analysis, we observe that in [30], there already exists a data transmission path between the host control circuit and each PE.Therefore, we propose a new fault-tolerant approach that utilizes the existing data transmission path between the host control circuit and each PE for data transfer, eliminating the need to search for a fault-free column.This approach not only simplifies the host control circuit but also streamlines the data path design of PEs.
It should be noted that our objective is to ensure the accuracy of matrix multiplication results.Therefore, the fault-tolerance issues that we address in matrix multiplication are different from those in the literature on DNN accelerators [31][32][33][34], which may tolerate some errors.
In this section, we introduce the proposed approach.Section 3.1 covers our faulttolerance mechanism while also discussing the corresponding PE architecture.Section 3.2 presents two pair-matching algorithms: one-dimensional pair-matching and two-dimensional pair-matching.

Fault-Tolerance Mechanism and Corresponding PE Architecture
Note that, after post-manufacturing testing [14,15], the host control circuit can establish information about faulty PEs.Then, throughout the circuit's lifespan, the number of faulty PEs may increase as the circuit ages.Hence, the host control circuit can incorporate run-time detection mechanisms [16,17] to dynamically update the information on faulty PEs.
The main drawback of the previous work [30] is the necessity to find at least one fault-free column (even in the form of a twisted column), which limits the potential for fault tolerance.Additionally, to support twisted columns, both the host control circuit and the data path of each PE incur significant additional circuitry, resulting in area overhead.
In contrast to the previous work, we adopt a pair-matching mechanism.For each faulty PE, we assign a fault-free PE to act as its proxy.In other words, this fault-free PE not only needs to complete its own task but also the task of its corresponding faulty PE.The proposed pair-matching mechanism does not require the identification of a fault-free column.Therefore, the proposed approach exhibits higher fault tolerance.
The proposed fault-tolerance mechanism follows the assumptions outlined in [30].Firstly, we assume that faults in each PE can only occur in the MAC unit and not in the data transmission paths or storage components.Secondly, we assume the existence of data transmission paths between the host control circuit and each PE.
The proposed fault-tolerance mechanism consists of two stages: 1.
In the first stage, Cannon's algorithm is executed in the systolic array.Each fault-free PE completes the computations that it should perform.However, each faulty PE does not engage in computations.Nevertheless, during the process of global shifting, both faulty PEs and fault-free PEs engage in data transmissions.2.
In the second stage, the computations that each faulty PE should carry out are completed by the fault-free PE that it is acting as a proxy for.Since there are data transmission paths between the host control circuit and each PE, the data required by each fault-free PE can be directly transmitted by the host control circuit.
To implement the proposed fault-tolerance mechanism, we need to identify a fault-free PE to replace each faulty PE.The pairing of faulty PEs with fault-free PEs is managed by the host control circuit, which takes the matrix size and the list of faulty PEs as input.It then executes the pair-matching algorithm to establish the corresponding pairs.In Section 3.2, we will present two pair-matching algorithms.These two algorithms offer a trade-off between fault tolerance and controller area overhead.
In [30], the authors do not discuss the bandwidth of data transfer between the host control circuit and PEs.Therefore, here, we also assume no limitation on the data transfer bandwidth.It is worth noting that even if the data transfer bandwidth is limited, it does not affect the fault tolerance of our approach.However, if the data transfer bandwidth is limited, the time that it takes for the host control circuit to transmit data to PEs will be longer.For example, if the host control circuit can only transmit data to a maximum of 4 PEs at a time, it will need to transmit data to 8 PEs in two separate transmissions.
We illustrate our fault-tolerance mechanism with an example.In the first stage, the systolic array executes Cannon's algorithm.Initially, based on the results of the initial alignment, the host control circuit sends data to each PE. Figure 5a depicts the overlap between the aligned matrix A and the aligned matrix B. For example, PE 0,0 receives data A 0,0 and B 0,0 , PE 0,1 receives data A 0,1 and B 1,1 , PE 0,2 receives data A 0,2 and B 2,2 , and so on.It is worth noting that PEs shaded in gray are faulty PEs.In other words, PE 0,1 , PE 0,3 , and PE 3,3 are faulty PEs.Faulty PEs are unable to perform MAC operations but can still transmit data.All fault-free PEs simultaneously perform MAC operations.Then, we conduct the first global matrix shifting.To clarify, in Figure 5a, we use the 0-th row as an example to demonstrate the shift direction of the elements in matrix A and the 0-th column as an example to demonstrate the shift direction of the elements in matrix B. The results of data transmission to each PE are shown in Figure 5b.Subsequently, all fault-free PEs simultaneously perform MAC operations again.Then, we conduct the second global matrix shifting, and the results of data transmission to each PE are shown in Figure 5c.Then, all fault-free PEs simultaneously perform MAC operations again.Following this, we proceed with the third global matrix shifting, and the results of data transmission to each PE are shown in Figure 5d.Finally, all fault-free PEs simultaneously perform MAC operations.At this point, each fault-free PE has completed its originally assigned task.For example, the accumulation result of PE 0,0 corresponds to the value of C 0,0 , where Next, we proceed to the second stage.Based on the outcomes of the pair-matching algorithm, the host control circuit assigns a fault-free PE to manage the tasks of each faulty PE.In Section 3.2, we will introduce two pair-matching algorithms.Here, we assume that the matching pairs are as follows: <PE 0,1 , PE 0,0 >, <PE 0,3 , PE 0,2 >, and <PE 3,3 , PE 3,0 >, where PE 0,0 takes over the task of PE 0,1 , PE 0,2 takes over the task of PE 0,3 , and PE 3,0 takes over the task of PE 3,3 .
In the second stage, only fault-free PEs serving as proxies perform MAC operations.Initially, the host control circuit sends data A 0,1 and B 1,1 to PE 0,0 , data A 0,3 and B 3,3 to PE 0,2 , and data A 3,2 and B 2,3 to PE 3,0 .The data transmission results are shown in Figure 6a.Subsequently, PE 0,0 , PE 0,2 , and PE 3,0 simultaneously perform MAC operations.Then, the host control circuit sends data A 0,2 and B 2,1 to PE 0,0 , data A 0,0 and B 0,3 to PE 0,2 , and data A 3,3 and B 3,3 to PE 3,0 .The data transmission results are shown in Figure 6b.Afterwards, PE 0,0 , PE 0,2 , and PE 3,0 simultaneously perform MAC operations.Next, the host control circuit sends data A 0,3 and B 3,1 to PE 0,0 , data A 0,1 and B 1,3 to PE 0,2 , and data A 3,0 and B 0,3 to PE 3,0 .The data transmission results are shown in Figure 6c.Then, PE 0,0 , PE 0,2 , and PE 3,0 simultaneously perform MAC operations.Following that, the host control circuit sends data A 0,0 and B 0,1 to PE 0,0 , data A 0,2 and B 2,3 to PE 0,2 , and data A 3,1 and B 1,3 to PE 3,0 .The data transmission results are shown in Figure 6d.Subsequently, PE 0,0 , PE 0,2 , and PE 3,0 simultaneously perform MAC operations.As a result, PE 0,0 , PE 0,2 , and PE 3,0 complete their proxy tasks.We illustrate with the pair <PE 0,1 , PE 0,0 >.PE 0,0 acts as the proxy for the tasks of PE 0,1 .Therefore, in this stage, PE 0,0 computes the value of C 0,1 , where Finally, let us discuss the architecture design of each PE.Our PE design is depicted in Figure 7. Essentially, each PE primarily executes MAC operations.For each PE, data for matrix A can be sourced either from the host control circuit (referred to as Controller-A in Figure 7) or from the PE located to the right (referred to as the east PE in Figure 7) because global matrix shifting necessitates shifting each element of matrix A to the left.Therefore, a two-to-one multiplexer is required to select the data source for matrix A. Similarly, for each PE, data for matrix B can be sourced either from the host control circuit (denoted as Controller-B in Figure 7) or from the PE located below (referred to as the south PE in Figure 7) because global matrix shifting requires shifting each element of matrix B upward.Thus, a two-to-one multiplexer is also necessary to select the data source for matrix B. For each PE, due to global matrix shifting requiring each element of matrix A to shift left by one position, the data of matrix A need to be transmitted to the left-side PE (referred to as the west PE in Figure 7).Similarly, due to global matrix shifting necessitating each element of matrix B to shift upward by one position, the data of matrix B need to be transmitted to the PE above (referred to as the north PE in Figure 7).Upon completing all multiplications and accumulations, the final result is sent back to the host control circuit (referred to as Controller-R in Figure 7).
In comparison to the previous work [30], the data path of our PE is relatively simple, resulting in circuit area savings for the PE.Additionally, since our control logic is also relatively simple, the area of the controller (i.e., the host control circuit) can also be saved.

Proposed Pair-Matching Algorithms
In the proposed fault-tolerance mechanism, for each faulty PE, we must find a faultfree PE to act as its proxy in execution.Therefore, the host control circuit must employ a pair-matching algorithm.Considering hardware efficiency, we use the following two principles to develop the pair-matching algorithms: 1.
Each fault-free PE can only proxy for at most one faulty PE.Thus, all tasks of faulty PEs can be executed simultaneously by fault-free PEs acting as their proxies.

2.
We perform pair-matching independently for each row (or each column).Hence, pair-matching can be conducted in parallel for all rows (or all columns).
We refer to the scheme of performing pair-matching independently for each row as row-based pair-matching.Similarly, the scheme of performing pair-matching independently for each column is referred to as column-based pair-matching.Essentially, the concepts of row-based pair-matching and column-based pair-matching are identical.Without loss of generality, we illustrate the row-based pair-matching scheme using Figure 8.
For each row, the number of matching pairs is the minimum of the number of faulty PEs and the number of fault-free PEs in that row.In Figure 8, we illustrate this using row 0. Since there are 3 faulty PEs (i.e., PE 0,0 , PE 0,2 , and PE 0,5 ) and 4 fault-free PEs (i.e., PE 0,1 , PE 0,3 , PE 0,4 , and PE 0,5 ) in this row, the number of matching pairs is 3 (i.e., min(3,4) = 3).We scan from left to right in our row-based matching approach.The first faulty PE is PE 0,0 , and the first fault-free PE is PE 0,1 , so the first matching pair is <PE 0,0 , PE 0,1 >.The second faulty PE is PE 0,2 , and the second fault-free PE is PE 0,3 , so the second matching pair is <PE 0,2 , PE 0,3 >.The third faulty PE is PE 0,5 , and the third fault-free PE is PE 0,4 , so the third matching pair is <PE 0,5 , PE 0,4 >.Based on the row-based pair-matching scheme and the column-based pair-matching scheme, we present two pair-matching algorithms.

1.
One-dimensional pair-matching algorithm: This algorithm only executes the rowbased pair-matching scheme.

2.
Two-dimensional pair-matching algorithm: This algorithm first executes the rowbased pair-matching scheme.If there are faulty PEs that fail to complete pairing, this algorithm then attempts pairing using the column-based pair-matching scheme.
Clearly, the control logic of the one-dimensional pair-matching algorithm is simpler, but its fault tolerance is lower.Conversely, the control logic of the two-dimensional pair-matching algorithm is more complex, but its fault tolerance is higher.Algorithm 1 displays the proposed one-dimensional pair-matching algorithm.For each row, the number of matching pairs corresponds to the minimum value between the number of faulty PEs and the number of fault-free PEs in that row.We define rp[i] as the number of matching pairs in row i.Each matching pair is assigned a unique index (starting from 0).We define index[i] as the index of the first matching pair in row i.In other words, we have index[0] = 0.For index i, where i > 0, we have Equation ( 4) as follows: Algorithm 1 One-Dimensional Pair-Matching 1: for i from 0 to n − 1 do 2: for j from 0 to n − 1 do assign PE i,j as the faulty PE of pair[k + p]; assign PE i,j as the fault-free PE of pair[k + q]; 13: q = q+1;  The controller (i.e., the host control circuit) can easily determine the value of index[i] for each row i.Since each row knows the index of its first matching pair, in hardware implementation, each row can proceed in parallel using the row-based pair-matching scheme.In Algorithm 1, we employ an array named "pair" to store matching pairs.Each matching pair within this array comprises a faulty PE and a fault-free PE, organized based on their respective indices.
We illustrate the proposed one-dimensional pair-matching algorithm (i.e., Algorithm 1) using the PE array in Figure 9a.From Figure 9a, we have index [5] = 15, and index[6] = 18.Then, we can perform the row-based pair-matching scheme for each row.As a result, we can obtain the results, as displayed in Figure 9b.
Next, we extend the proposed one-dimensional pair-matching algorithm to a twodimensional context.Our two-dimensional pair-matching algorithm includes two phases: in the first phase, we apply the row-based pair-matching scheme for each row.The task of this phase is the same as that of our one-dimensional pair-matching algorithm.However, we must mark the PEs that have been paired.In the second phase, we apply the columnbased pair-matching scheme for each column.In this phase, we only need to perform pair-matching for unmarked PEs (i.e., PEs that have not yet been paired).
In the second phase, since we only consider unmarked PEs, for each column, the number of matching pairs is the minimum value of the number of unmarked faulty PEs and the number of unmarked fault-free PEs in this column.We define cp[j] as the number of matching pairs in column j.We also continue to assign a unique index to each matching pair (continuing from the numbering in the first phase).We define col_index[j] as the index of the first matching pair in column j.In other words, we have Equation ( 5) as follows: For j > 0, we have Equation ( 6) as follows: We illustrate the proposed two-dimensional pair-matching algorithm using Figure 10 as an example.In the first phase, we apply the row-based pair-matching scheme.The results of the first phase are displayed in Figure 10a.Upon completion of the first phase, we have 21 matching pairs (numbered from 0 to 20).However, we find that in column 5, there are still 3 faulty PEs that have not been paired.Then, we apply the column-based pair-matching scheme for each column.After applying the column-based pair-matching scheme, the three faulty PEs in column 5 can be paired (the indices of these matching pairs are 21, 22, and 23).Upon completion of the second phase, we obtain the results, as displayed in Figure 10b.We find that all of the faulty PEs have been paired.
It is noteworthy that the proposed pair-matching algorithms, including both our onedimensional and two-dimensional approaches, can be implemented in either software or hardware.However, for real-time applications, it is essential to implement these algorithms in hardware circuits to achieve high speed.

Experimental Results
In the experiments, we address the fault-tolerance capability and circuit area overhead of the proposed approach.For comparison, we also implemented the approach presented by Jan and Huang [30].
Regarding the fault-tolerance capability, we conducted separate analyses on an 8 × 8 systolic array, 16 × 16 systolic array, and 32 × 32 systolic array.We randomly assume the positions of faulty PEs within the systolic arrays.Then, we can evaluate the fault-tolerance capabilities of different methods (including the proposed one-dimensional pair-matching algorithm, the proposed two-dimensional pair-matching algorithm, and the approach proposed by Jan and Huang [30]).
The detailed analysis methodology is as follows.Given a specific number of faulty PEs, we randomly generate the locations of these faulty PEs.Additionally, for each specific number of faulty PEs, we generate 10,000 cases randomly.For each case, we analyze whether various methods (including the proposed one-dimensional pair-matching algorithm, the proposed two-dimensional pair-matching algorithm, and the approach proposed by Jan and Huang [30]) can successfully tolerate faults.Subsequently, we can calculate the probability of success (i.e., the success rate) of each method concerning a specific number of faulty PEs. Figure 11, Figure 12, and Figure 13 depict our analysis results for the 8 × 8 systolic array, 16 × 16 systolic array, and 32 × 32 systolic array, respectively.In these figures, 1-D represents the proposed one-dimensional pair-matching algorithm, 2-D represents the proposed two-dimensional pair-matching algorithm, and Jan an Huang (2012) represents the method presented in [30].Moreover, in these figures, the x-axis represents the probability of success (i.e., the success rate), and the y-axis represents the allowable quantity of faulty PEs.For instance, as shown in Figure 11, to attain a success rate of 90% in an 8 × 8 systolic array, the permissible numbers of faulty PEs for our one-dimensional pair-matching algorithm, our two-dimensional pair-matching algorithm, and the method proposed in [30] are 14, 21, and 4, respectively; as shown in Figure 11, to attain a success rate of 80% in an 8 × 8 systolic array, the permissible numbers of faulty PEs for our one-dimensional pair-matching algorithm, our two-dimensional pair-matching algorithm, and the method proposed in [30] are 16, 23, and 6, respectively.
From Figures 11-13, we observe that our two-dimensional pair-matching algorithm exhibits the highest fault-tolerance capability.Furthermore, we also note that both the proposed one-dimensional pair-matching algorithm and the proposed two-dimensional pair-matching algorithm outperform the approach presented in [30] in terms of fault tolerance.Essentially, the approach presented in [30] is only suitable for scenarios with a smaller number of faulty PEs.In contrast, both our one-dimensional pair-matching algorithm and our two-dimensional pair-matching algorithm can be applied in situations with a higher number of faulty PEs.Next, we explore the circuit area overhead.For our implementations, we assume that the size of the systolic array is 8 × 8.The circuits are implemented using the TSMC 40 nm cell library.We begin by analyzing the area of the controller (i.e., the host control circuit).Table 1 presents the controller area for executing the proposed one-dimensional pair-matching algorithm (referred to as 1-D), the proposed two-dimensional pair-matching algorithm (referred to as 2-D), and the method proposed in [30].As depicted in Table 1, the controller area utilizing the proposed one-dimensional pair-matching algorithm is the smallest.Conversely, the controller area of the approach proposed in [30] is the largest.This is because their twisted column scheme is more complex than the proposed pairmatching scheme.We also examine the areas occupied by PEs.Table 2 presents the areas of various PE designs, including the conventional PE design (i.e., without any fault-tolerance mechanism), our PE design, and the approach proposed in [30].In comparison to the conventional PE design, our area overhead is only 6%.It is noteworthy to mention that the conventional PE design lacks a fault-tolerance mechanism.On the other hand, when compared with the conventional PE design, the area overhead of the approach proposed in [30] reaches 70% (owing to the twisted column scheme).Therefore, the area overhead of our PE design is small.Note that Table 2 refers to the area of a single PE.When considering a PE array, its total area can be determined by multiplying the area of a single PE by the array size.Therefore, for PE arrays of the same size, compared to utilizing the conventional PE design, the area overhead of employing our PE design remains at 6%, while the area overhead of utilizing the approach proposed in [30] also stands at 70%.

Conclusions
This paper introduces a highly fault-tolerant approach designed for a systolic array executing Cannon's algorithm for matrix multiplication.Our core concept involves pairing each faulty PE with a corresponding fault-free PE to serve as its proxy.We propose two pairmatching algorithms: one-dimensional pair-matching and two-dimensional pair-matching.These two pair-matching algorithms offer a trade-off between fault tolerance capability and circuit area overhead.We employed the TSMC 40 nm process technology to implement the proposed approach.The experimental results demonstrate that, compared to the previous work, our approach (whether employing our one-dimensional pair-matching algorithm or our two-dimensional pair-matching algorithm) not only improves fault tolerance but also reduces circuit area overhead.In certain application domains, such as space or deepsea exploration, repairs are challenging even if many PEs fail.Therefore, the proposed approach is particularly well-suited for these applications.
Since our current pair-matching algorithms necessitate finding a dedicated fault-free PE to substitute for each faulty PE, the number of faulty PEs cannot exceed half of the total PE count.Our future work will concentrate on developing methods to overcome this limitation.

Figure 2 .
Figure 2. The systolic array architecture applied to Cannon's algorithm.

Figure 3 .
Figure 3. (a) The overlap of aligned matrices A and B. (b) The results after the first global matrix shifting.

Figure 4 .
Figure 4.An example of twisted columns.

Figure 5 .
Figure 5.The first stage of our fault-tolerance mechanism.

Figure 6 .
Figure 6.The second stage of our fault-tolerance mechanism.

Figure 7 .
Figure 7.The architecture of our PE.

Figure 8 .
Figure 8. Illustration of the row-based pair-matching scheme.

Figure 9 .
Figure 9.An example of our one-dimensional pair-matching algorithm.

Figure 11 .
Figure 11.The fault-tolerance capabilities of different methods in an 8 × 8 systolic array.The bar Jan and Huang (2012) denotes the method proposed in [30].

Figure 12 .
Figure 12.The fault-tolerance capabilities of different methods in a 16 × 16 systolic array.The bar Jan and Huang (2012) denotes the method proposed in [30].

Figure 13 .
Figure 13.The fault-tolerance capabilities of different methods in a 32 × 32 systolic array.The bar Jan and Huang (2012) denotes the method proposed in [30].

Table 1 .
Comparison of the areas of controllers.

Table 2 .
Comparison of the areas of different PE designs.