The Algorithm and Structure for Digital Normalized Cross-Correlation by Using First-Order Moment

Normalized cross-correlation is an important mathematical tool in digital signal processing. This paper presents a new algorithm and its systolic structure for digital normalized cross-correlation, based on the statistical characteristic of inner-product. We first introduce a relationship between the inner-product in cross-correlation and a first-order moment. Then digital normalized cross-correlation is transformed into a new calculation formula that mainly includes a first-order moment. Finally, by using a fast algorithm for first-order moment, we can compute the first-order moment in this new formula rapidly, and thus develop a fast algorithm for normalized cross-correlation, which contributes to that arbitrary-length digital normalized cross-correlation being performed by a simple procedure and less multiplications. Furthermore, as the algorithm for the first-order moment can be implemented by systolic structure, we design a systolic array for normalized cross-correlation with a seldom multiplier, in order for its fast hardware implementation. The proposed algorithm and systolic array are also improved for reducing their addition complexity. The comparisons with some algorithms and structures have shown the performance of the proposed method.


Introduction
Normalized cross-correlation (NCC) is an important mathematical tool in signal and image processing for feature matching, similarity analysis, motion tracking, object recognition, and so on [1][2][3]. In order to improve its real-time and efficient performance, digital NCC has been suggested to be implemented by some fast algorithms and hardware structures, due to its high computational complexity [4,5].
Nowadays, since correlation and convolution have similar computation structures, there are mainly three kinds of fast convolution algorithms can be applied for fast NCC [6,7]: (1) the Fast Fourier Transform (FFT)-based algorithm, (2) the polynomial-based algorithm, (3) the decomposition algorithm. However, to our knowledge, each of these algorithms has its applicable limitations. The FFT-based algorithm is not well-suited to the discrete domain. Plus, it involves with complex multiplications [8,9]. Both the polynomial-based algorithm and the decomposition algorithm require complex computational structures, and they often lack commonality for arbitrary-length correlations [10,11].
Furthermore, some special algorithms for fast NCC have been presented [12,13]. The fast cross-correlation of binary sequences can be extended to other types of NCC sequences [14]. The estimation algorithm derives the scaling factor between the signal and the kernel, so it computes NCC using only additions at the cost of small noise [15]. Several methods have been used to assist NCC for reducing its searching and computing times in image matching, such as the pyramid method [3,7]. In addition, many parallel algorithms of the inner-product have been published that can perform fast cross-correlation for NCC [16,17], where the Distributed Arithmetic (DA) with look-up table has not multiplication, but needs much Read-Only Memory (ROM) [18].
To hardware implementation of fast NCC, Very-Large-Scale Integration (VLSI) circuits have been applied, where systolic structures are popular due to their regularity and modularity [19][20][21]. The integration of the systolic array and the DA technique lead to more efficient VLSI implementation of cross-correlation, although they use many ROMs and address decoders [22,23]. The Residue Number System-based DA can reduce ROMs and enhance throughput, while extra encoding processes in the residue domain are necessary [24].
In this paper, we present a new algorithm and structure to implement digital NCC with a simple and fast procedure. It is a breakthrough that an NCC formula expressed in terms of a first-order moment is designed according to the relationship between the inner-product and the first-order moment, so the computational complexity of NCC is transformed into that of a first-order moment. For performing an arbitrary-length digital NCC, our algorithm would first establish the NCC formula based on a first-order moment for correlation sequences, and then introduce a fast algorithm without multiplication from [25,26] to compute this first-order moment in the new NCC formula rapidly. For the hardware implementation of NCC, we develop a simple and scalable systolic array derived from the proposed algorithm, due to the fact that the fast algorithm for the first-order moment is easily performed by systolic structure [27]. The proposed algorithm and systolic array are also improved to reduce their addition complexity, according to an even-odd relationship in the computation of the first-order moment.
The rest of the paper is organized as follows. Section 2 establishes the NCC formula based on a first-order moment. Section 3 introduces a fast algorithm and its systolic implementation for first-order moment. Sections 4 and 5 discuss the fast algorithm and the systolic array inspired by Section 3 to perform the NCC formula in Section 2 rapidly. Comparison and analysis are presented in Section 6 to demonstrate the feasibility of the proposed algorithm and structure. Finally, Section 7 gives the conclusion.

Normalized Cross-Correlation Based on First-Order Moment
Being the most complex operation in NCC, the inner-product of two correlation sequences would be transformed into a first-order moment for decreasing computational complexity in fast NCCs. To do this, let us assume two N-point digital sequences { f (i) } and { g(i) }, where { f (i) } is an arbitrary input sequence, and { g(i) } is the fixed correlation kernel with the value range g(i)∈{ 0, 1, 2, . . . , L }. This section establishes an NCC formula for these two sequences that mainly includes a first-order and a zero-order moment. The aim is to replace the complex computation of cross-correlation in NCC with an easy computation of a first-order moment.

Cross-Correlation
Cross-correlation is an inner-product between two digital sequences. It is defined as Using mathematical transformation, this Equation (1) could be transformed into a first-order moment by means of the statistical characteristics of the inner-product operation. To do this, we define Sensors 2020, 20, 1353 3 of 16 some subsets S k (k = 0, 1, 2, . . . , L) that divide the index set i∈{0, 1, . . . , N − 1} into L subsets, depending on the max value in the correlation kernel { g(i) }. Specifically, where k = 0, 1, 2, . . . , L. In other words, S k is a set of indices i that corresponds to g(i) = k in actual. Then a new (L + 1)-point sequence { a k (n) } is defined by subsets S k [28], which is where k = 0, 1, 2, . . . , L, and "Φ" denotes an empty set. The a k (n) could be acted as the sum of elements in the sequence { f (n + i) } while the parameter i corresponds to g(i) = k. The computation of the { a k (n) } is actually a statistics procedure for counting how much k would be accumulated in the computation of the c(n). Therefore, the relationship between { f (n + i) } and { a k (n) } can be described as: is a first-order moment of { a k (n) }. As a result, the Equation (1) can be transformed into: From Equation (5), we obtain a new calculation formula for cross-correlation based on a first-order moment.

Normalized Cross-Correlation
Normalized cross-correlation is more complex than cross-correlation, because it includes an inner-product between two difference sequences from { f (i) }, { g(i) } and their mean value. It is defined as where This Equation (6) can be rewritten as and substitute Equations (4a), (4b) and (8) into Equation (7), the NCC expressed by Equation (6) can be converted to From Equation (9), we develop a new calculation formula for NCC based on a first-order moment a k (n)k and a zero-order moment a k (n). It is obvious that the computation complexity of this NCC formula depends heavily upon the complexity of a k (n)k and b(n). Therefore, for a fast implementation of Equation (9), we introduce a fast algorithm and structure for a k (n)k in Section 3, and an optimization method for b(n) in Section 4.1.

The Fast Algorithm and Systolic Array for First-Order Moment
Liu et al. presented an algorithm and its systolic array for first-order moment in [25][26][27]. Their method is suitable to compute the first-order and the zero-order moment in Equation (4) rapidly. In this section, we introduce this algorithm and systolic array that aims to implement fast NCC by using Equation (9). In addition, because the introduced algorithm and array request many additions as the result of removing all multiplications, we also improve them in order for lower addition complexity.

The Fast Algorithm for First-Order Moment
According to [25], we illustrate a simple 1-network shown in Figure 1 that represents a map of transforming the two-dimensional vector (1, x) into the vector (1, (1 + x)). This map is denoted by F that is Sensors 2020, 20, 1353 4 of 16 and substitute Equations (4a), (4b) and (8) into Equation (7), the NCC expressed by Equation (6) can be converted to a n k g a n n b n a n g i g N From Equation (9), we develop a new calculation formula for NCC based on a first-order moment () k a n k  and a zero-order moment () k a n  . It is obvious that the computation complexity of this NCC formula depends heavily upon the complexity of () k a n k  and b(n).
Therefore, for a fast implementation of Equation (9), we introduce a fast algorithm and structure for () k a n k  in Section 3, and an optimization method for b(n) in Section 4.1.

The Fast Algorithm and Systolic Array for First-Order Moment
Liu et al. presented an algorithm and its systolic array for first-order moment in [25][26][27]. Their method is suitable to compute the first-order and the zero-order moment in Equation (4) rapidly. In this section, we introduce this algorithm and systolic array that aims to implement fast NCC by using Equation (9). In addition, because the introduced algorithm and array request many additions as the result of removing all multiplications, we also improve them in order for lower addition complexity.

The Fast Algorithm for First-Order Moment
According to [25], we illustrate a simple 1-network shown in Figure 1 that represents a map of transforming the two-dimensional vector (1, x) into the vector (1, (1 + x)). This map is denoted by F that is Some characteristic equations obtained from F are Also, and by induction Sensors 2020, 20, 1353

of 16
Also, and by induction Hence, we have To compute first-order moment by this 1-network, let a k = ( a k (n), a k (n) ) (k = 1, 2, . . . , L), so, Equations (10) and (11) are yielded by Generally, the above equation is expanded into From Equation (12), a k (n) in Equation (4a) and a k (n)k in Equation (4b) can both be obtained from an iterative implementation of the map F. This computational flow uses the (L − 1) recursive process of map F that includes 3L additions and 0 multiplications [26]. Therefore, the fast algorithm for first-order moment by Equation (12) can be described in Algorithm 1 as a subroutine Moment [29]. Its computational structure is also shown in Figure 2, which is an iterative structure of a 1-network with six adders and three latches. Its total addition number to compute N-point first-order moments a k (n)k (n = 0, 1, . . . , N − 1) is 3NL.
Define the array a with two elements Initial a ← ( a L (n), a L (n) ) for each k ∈ [2, L] do // Equation (

The Systolic Array for First-Order Moment
The Equation (12) can be implemented by a systolic array for continuously generating a set of a k (n) and a k (n)k in parallel [27]. This systolic array is shown in Figure 3, which is actually a serial arrangement of (L − 1) 1-networks extended from Figure 2. It uses 3L − 2 adders, L + 2 latch, and 0 multiplier. In each clock cycle, we should input a sequence { a k (n) } into this systolic array and get a ( a k (n), a k (n)k).

The Systolic Array for First-Order Moment
The Equation (12) can be implemented by a systolic array for continuously generating a set of () k an  and () k a n k  in parallel [27]. This systolic array is shown in Figure 3, which is actually a serial arrangement of (L − 1) 1-networks extended from Figure 2. It uses 3L − 2 adders, L + 2 latch, and 0 multiplier. In each clock cycle, we should input a sequence { ak(n) } into this systolic array and get a ( k k a n a n k   . Especially, to keep an operation synchronization for this parallel structure, the (L − 1)-point ak(n) (k = 2, …, L) should be input into the (L − 1) 1-networks respectively rather than simultaneously. Generally, a single ak(n) (k > 0) is input into the (L − k)-th 1-network with a latency n + 2 (L -1 − k) clock cycle. Hence, in Figure 3, we use the extra latch array to generate latency for ak(n) before it is input into the corresponding 1-network. The number of latch array and latency time is shown in the note "[ ]", which leads to the occurrence that different ak(n) are input into the different 1-networks at regular intervals. As a result, the total execution time of this systolic array to compute N-point [2(L-2)] [0]

The Improvement of the Fast Algorithm and Systolic Array for First-Order Moment
The algorithm in Section 3.1 requires many additions that are computationally expensive when N is larger. In order to reduce its addition number, this algorithm is improved by means of an even-odd relationship that divides the first-moment of sequence { ak(n) } into two smaller moments. This even-odd relationship is illustrated as: a n a n a n a n Especially, to keep an operation synchronization for this parallel structure, the (L − 1)-point a k (n) (k = 2, . . . , L) should be input into the (L − 1) 1-networks respectively rather than simultaneously. Generally, a single a k (n) (k > 0) is input into the (L − k)-th 1-network with a latency n + 2 (L -1 − k) clock cycle. Hence, in Figure 3, we use the extra latch array to generate latency for a k (n) before it is input into the corresponding 1-network. The number of latch array and latency time is shown in the note "[ ]", which leads to the occurrence that different a k (n) are input into the different 1-networks at regular intervals. As a result, the total execution time of this systolic array to compute N-point a k (n)k (n = 0, 1, . . . , N − 1) is that

The Improvement of the Fast Algorithm and Systolic Array for First-Order Moment
The algorithm in Section 3.1 requires many additions that are computationally expensive when N is larger. In order to reduce its addition number, this algorithm is improved by means of an even-odd relationship that divides the first-moment of sequence { a k (n) } into two smaller moments. This even-odd relationship is illustrated as: According to Equation (13), the fast algorithm described by Figure 2 can be improved to the new structure shown in Figure 4. This improved algorithm firstly adds L/2 additions to obtain the sequence a 2k−1 (n) + a 2k (n) } as well as L/2 − 1 addition to accumulate a 2k−1 (n). Then each a 2k−1 (n) + a 2k (n) is input into map F successively for performing L/2 − 1 iterations. Finally, a left-shift operation and 1 subtraction are applied to generate a k (n)k. The improved algorithm requires 5L/2 − 1 additions that are superior to Figure 2, even though its structure is more complex at the cost of decreasing L/2 additions. Although the sequence a 2k−1 (n) + a 2k (n) } could be continually divided by the even-odd relationship for further reducing additions, the fast algorithm's structure would become very complex and unworthy.
Sensors 2020, 20, 1353 7 of 16 a n k a n k a n k a n a n k a n According to Equation (13), the fast algorithm described by Figure 2 can be improved to the new structure shown in Figure 4. This improved algorithm firstly adds L/2 additions to obtain the sequence Similarly, the systolic array in Figure 3 can be improved to the structure shown in Figure 5. This improved systolic array is a serial arrangement of the L/2 − 1 1-networks extended from Figure  4. It requires 5L/2 − 3 adders and L/2 + 3 latches that are superior to Figure 3, even though its structure is more complex. As a result, the total execution time of this systolic array to compute N-point () k a n k  (n = 0, 1, …, N − 1) is decreased to  Similarly, the systolic array in Figure 3 can be improved to the structure shown in Figure 5. This improved systolic array is a serial arrangement of the L/2 − 1 1-networks extended from Figure 4. It requires 5L/2 − 3 adders and L/2 + 3 latches that are superior to Figure 3, even though its structure is Sensors 2020, 20, 1353 8 of 16 more complex. As a result, the total execution time of this systolic array to compute N-point a k (n)k (n = 0, 1, . . . , N − 1) is decreased to Sensors 2020, 20, 1353 8 of 16

The Fast Algorithm for Normalized Cross-Correlation
We apply the improved fast algorithm in Section 3.3 to compute the first-order and the zero-order moments in Equation (9). Thus, the fast algorithm for NCC is presented that can remove most of its multiplications. At first, some optimization methods are introduced in Section 4.1 to further reduce its additions.

The Optimization Methods
As the sequence { g(i) } is a fixed correlation kernel in general, both (9) could be pre-computed and reused for avoiding their repeated computations [30].
Although b(n) in Equation (8) involves many additions and complex squares, it could also be computed by a simple function with the previous b(n − 1), where We only need to directly compute the first b(0) by N multiplication and N − 1 additions, where the square is performed by multiplication. Then, the following b(n) (n = 1, 2, …, N − 1) would be obtained from Equation (14) by only 1 multiplication, 2 additions and 1 subtraction.

The Step of the Fast Algorithm for NCC
The proposed fast algorithm for NCC would include five steps: Step 1 Initializing all ak(n) = 0 (k = 0, 1, …, L), where a0(n) is indispensable for () k a n  .

The Fast Algorithm for Normalized Cross-Correlation
We apply the improved fast algorithm in Section 3.3 to compute the first-order and the zero-order moments in Equation (9). Thus, the fast algorithm for NCC is presented that can remove most of its multiplications. At first, some optimization methods are introduced in Section 4.1 to further reduce its additions.

The Optimization Methods
As the sequence { g(i) } is a fixed correlation kernel in general, both g = g(i)/N and [g(i) − g] 2 in Equation (9) could be pre-computed and reused for avoiding their repeated computations [30].
Although b(n) in Equation (8) involves many additions and complex squares, it could also be computed by a simple function with the previous b(n − 1), where We only need to directly compute the first b(0) by N multiplication and N − 1 additions, where the square is performed by multiplication. Then, the following b(n) (n = 1, 2, . . . , N − 1) would be obtained from Equation (14)

The Step of the Fast Algorithm for NCC
The proposed fast algorithm for NCC would include five steps: Step 1 Initializing all a k (n) = 0 (k = 0, 1, . . . , L), where a 0 (n) is indispensable for a k (n).
Step 2 Implementing Equation (3) to acquire the sequence { a k (n) } using N addition.

The Systolic Array for Normalized Cross-Correlation
We apply the improved systolic array in Figure 5 to design a hardware structure for fast NCC in parallel. Figure 6 shows this systolic structure that mainly includes three parts: the module A to compute a 2k−1 (n) + a 2k (n) } , the module M to compute the first-order and zero-order moment of { a k (n) }, and the module S to compute b(n). In each cycle, we simultaneously input N-point f (n + i) into this systolic array and get an NCC result ρ(n). At first, since the direct computation for a 2k−1 (n) + a 2k (n) } needs many adders, a simplified structure for the module A is discussed in Section 5.1.

The Module A
The module A is to acquire an L/2-point sequence Therefore, the structure of the module A should be not fixed, but changed with different sequences { g(i) } to reduce its hardware complexity. We also show the module A using maximum adders when { g(i) } = { 4, 4, 4, 4 } in Figure 9a, and the module A using 0 adders when { g(i) } = { 2, 4, 6, 8 } in Figure  9b. From Figures 7-9, it can be obtained the adder number of the module A is from 0 to N − 1, and the latency TA is from 0 to log2N.

The Module A
The module A is to acquire an L/2-point sequence a 2k−1 (n) + a 2k (n) } according to Equations (3) and (13) in every clock cycle. It includes L + 1 sub-modules A k (k = 0, 1, 2, . . . , L) that firstly count { f (n + i) } to generate corresponding { a k (n) }, and then sum up the two adjacent a k (n) to obtain a 2k−1 (n) + a 2k (n) . We assume the execution time of the module A is T A clock cycles. The N-point f (n + i) should be inputted into the sub-modules { A k } in a gradual way.
Since the correlation kernel { g(i) } is so invariable that the computational strategy for Equations (3) and (13) are known in advance, we could simplify the structure of A k for less adder and data transfer. f (0) f (1) f (2) f (

The Model P
The Model P is to implement Equation (9) with 4 multipliers, 1 divider and 1 square root extractor. It receives a () k a n k  and a b(n), and output a corresponding ρ(n) in each cycle. Some fast methods can be applied for the square root operation. In addition, the fixed g and 2 [ ( ) ] g i g   are saved in advance against repeated computation.

The Model P
The Model P is to implement Equation (9) with 4 multipliers, 1 divider and 1 square root extractor. It receives a () k a n k  and a b(n), and output a corresponding ρ(n) in each cycle. Some fast methods can be applied for the square root operation. In addition, the fixed g and 2 [ ( ) ] g i g   are saved in advance against repeated computation.

The Model P
The Model P is to implement Equation (9) with 4 multipliers, 1 divider and 1 square root extractor. It receives a () k a n k  and a b(n), and output a corresponding ρ(n) in each cycle. Some fast methods can be applied for the square root operation. In addition, the fixed g and 2 [ ( ) ] g i g   are saved in advance against repeated computation.

The Systolic Array
The systolic array in Figure 6 uses various modules to perform Equations (3), (9), (13) and (14), respectively, for NCC. Some latches are indispensable to connect these modules for assuring their mutual and parallel operation. The latch number has been shown in the note "[ ]". The module M from Figure 5 is to compute first-order moments and zero-order moments based on Equation (13). The module S implements Equation (14) and generates b(n) by 1 multiplier, 1 accumulator and 1 subtractor. Finally, the module P generates NCC ρ(n). The systolic array's total adder number is ranged from 2L − 2 to 2L + N − 3, and its multiplier number is 5.

The Model P
The Model P is to implement Equation (9) with 4 multipliers, 1 divider and 1 square root extractor. It receives a a k (n)k and a b(n), and output a corresponding ρ(n) in each cycle. Some fast methods can be applied for the square root operation. In addition, the fixed g and [g(i) − g] 2 are saved in advance against repeated computation.

The Systolic Array
The systolic array in Figure 6 uses various modules to perform Equations (3), (9), (13) and (14), respectively, for NCC. Some latches are indispensable to connect these modules for assuring their mutual and parallel operation. The latch number has been shown in the note "[ ]". The module M from Figure 5 is to compute first-order moments and zero-order moments based on Equation (13). The module S implements Equation (14) and generates b(n) by 1 multiplier, 1 accumulator and 1 subtractor. Finally, the module P generates NCC ρ(n). The systolic array's total adder number is ranged from 2L − 2 to 2L + N − 3, and its multiplier number is 5.
The initial value of the accumulator in the module S is set as b(0). In the n-th clock cycle, f (n + N − 1) and f (n − 1) would be input into the module S to get b(n) with three clock cycles. Then b(n) is output from the module S to the module P with a latency T A + L − 1. The aim is that b(n), a k (n) and a k (n)k can arrive in the P at the same time.

Comparisons
The proposed algorithm and systolic structure are compared with some existing methods to verify their effectiveness. These compared methods are also focused on reducing their multiplication numbers.

Algorithm Comparison
Because correlation and convolution can share fast algorithms, we compare the proposed algorithm in Section 4 with some convolution algorithms, as well as a fast NCC algorithm to compute an N-point cyclic NCC. The computational complexity of these algorithms are displayed in Table 1, where we set a complex multiplication, which is equivalent to three real multiplications and three real additions, an "AND" operation is equivalent to an addition [31], and a subtraction is also equivalent to an addition.
From Table 1, the multiplication and addition complexity of the FFT-based algorithm are both O(N log 2 N), the DA-based algorithm is the least addition complexity, and the fast NCC algorithm has zero multiplication. The proposed algorithm uses O(N 2 ) additions that are more than the FFT-based and the DA-based algorithm, and O(N) multiplications that are more than the fast NCC algorithm. However, the FFT-based algorithm needs float addition and multiplication operations that are more complex than integer operations, the DA-based algorithm requires tedious decode address and very large memories, as well as that the fast NCC algorithm is the most addition complexity and not suitable for high-precision matching [15]. Figure 10 shows the four algorithms' multiplication and addition number increasing along with N. It is obviously that the proposed algorithm's multiplication number is lower than both the FFT-based algorithm's and the DA-based algorithm's, and its addition number is lower than the fast NCC algorithm's when N > 320.

Comparisons
The proposed algorithm and systolic structure are compared with some existing methods to verify their effectiveness. These compared methods are also focused on reducing their multiplication numbers.

Algorithm Comparison
Because correlation and convolution can share fast algorithms, we compare the proposed algorithm in Section 4 with some convolution algorithms, as well as a fast NCC algorithm to compute an N-point cyclic NCC. The computational complexity of these algorithms are displayed in Table 1, where we set a complex multiplication, which is equivalent to three real multiplications and three real additions, an "AND" operation is equivalent to an addition [31], and a subtraction is also equivalent to an addition.
From Table 1, the multiplication and addition complexity of the FFT-based algorithm are both O (N log2N), the DA-based algorithm is the least addition complexity, and the fast NCC algorithm has zero multiplication. The proposed algorithm uses O(N 2 ) additions that are more than the FFT-based and the DA-based algorithm, and O(N) multiplications that are more than the fast NCC algorithm. However, the FFT-based algorithm needs float addition and multiplication operations that are more complex than integer operations, the DA-based algorithm requires tedious decode address and very large memories, as well as that the fast NCC algorithm is the most addition complexity and not suitable for high-precision matching [15]. Figure 10 shows the four algorithms' multiplication and addition number increasing along with N. It is obviously that the proposed algorithm's multiplication number is lower than both the FFT-based algorithm's and the DA-based algorithm's, and its addition number is lower than the fast NCC algorithm's when N > 320.

Algorithm
Multiplication Addition Direct calculation 2N (N + 1) 3N (N + 1) FFT-based algorithm [8,9] (3/2) Nlog2N − (3/2) N + 16 (7/2) Nlog2N − N/2 + 15 DA-based algorithm [22] 7N − 1 (5N − 2) log2L + 8N − 1 Fast NCC algorithm [15] 0  The wireless sensor and communication is an important application field for the proposed algorithm. Therefore, we compare the execution time of the five algorithms from Table 1 by using a mobile phone with the type "HUAWEI nova 2s (HWI-AL00)" and the operation system "Android 9". Figure 11 shows these algorithms' execution time to compute a cyclic NCC by the phone with N from 100 to 6000. The growth curve of the FFT-based algorithm's time is similar to a step curve, in that the length of FFT needs to be extended from N to 2 log 2 N . Although the DA-based algorithm can use the least time, it needs too much memory to make it worthwhile. From the Figure 11, the proposed algorithm's execution time is less than the FFT-based algorithm's when N < 5500, and is very close to the fast NCC algorithm, but not involved with noise.
Sensors 2020, 20, 1353 13 of 16 The wireless sensor and communication is an important application field for the proposed algorithm. Therefore, we compare the execution time of the five algorithms from Table 1 by using a mobile phone with the type "HUAWEI nova 2s (HWI-AL00)" and the operation system "Android 9". Figure 11 shows these algorithms' execution time to compute a cyclic NCC by the phone with N from 100 to 6000. The growth curve of the FFT-based algorithm's time is similar to a step curve, in that the length of FFT needs to be extended from N to 2 log 2 N     . Although the DA-based algorithm can use the least time, it needs too much memory to make it worthwhile. From the Figure 11, the proposed algorithm's execution time is less than the FFT-based algorithm's when N < 5500, and is very close to the fast NCC algorithm, but not involved with noise. In addition, it is important that the proposed algorithm has five advantages, as follows: (1) With less multiplications and memory.
(2) Simple computational structure due to its simple implementation.
(3) Precision and Fit to discrete domain as it uses integer operations [32]. (4) Without limitations on the length of NCC. (5) Implementation by simple systolic structure.

Structure Comparison
We compare the proposed systolic array in Section 5 with some existing hardware structures. Table 2 shows the hardware complexity of these structures to implement an N-point cyclic NCC, where N = PM (P and M are two positive integers derived from [33]). Because the proposed array's adder number and latency are not fixed, but varied with the sequence { g(i) }, we only display their value range according to Section 5.1. The execution time of the model P is assumed as three clock cycles.  Table 2, it is an advantage that the proposed systolic structure does not need ROMs, while the other two structures use O(2 N ) ROMs that are hardware-expensive when N > 16. The structure [22] has minimum latency, but its throughput is more than 1. The structure [33] needs the O(P) adder and latency that would increase rapidly with N. In addition, it is important that the proposed algorithm has five advantages, as follows: (1) With less multiplications and memory.
(2) Simple computational structure due to its simple implementation.
(3) Precision and Fit to discrete domain as it uses integer operations [32]. (4) Without limitations on the length of NCC. (5) Implementation by simple systolic structure.

Structure Comparison
We compare the proposed systolic array in Section 5 with some existing hardware structures. Table 2 shows the hardware complexity of these structures to implement an N-point cyclic NCC, where N = PM (P and M are two positive integers derived from [33]). Because the proposed array's adder number and latency are not fixed, but varied with the sequence { g(i) }, we only display their value range according to Section 5.1. The execution time of the model P is assumed as three clock cycles.
(P + 1)log 2 L Latency L + 5 To log 2 N + L + 5 2 log 2 L log 2 L + P Throughput 1 N/2 log 2 L 1 From Table 2, it is an advantage that the proposed systolic structure does not need ROMs, while the other two structures use O(2 N ) ROMs that are hardware-expensive when N > 16. The structure [22] has minimum latency, but its throughput is more than 1. The structure [33] needs the O(P) adder and latency that would increase rapidly with N.
The proposed structure's hardware complexity is dependent upon L. Furthermore, for long NCCs, or two-dimension NCCs when N and P are larger than L, the adder number of the proposed structure is lower than that of the structure [22], and the latency of the proposed structure is lower than that of the structure [33]. Figure 12 shows the three structures' adder number and latency increasing along with N, where the proposed structure adopts maximum adder and latency to perform comparisons. It is obvious that the proposed structure's adder number is least when N > 1800, and its latency is lower than the structure [33] when N > 1500. Therefore, although additional O(L) latches are required for data store and transfer, the proposed systolic array could be more efficient in digital signal and image domain where the maximum value of L is less than 256 in general [34]. The proposed structure's hardware complexity is dependent upon L. Furthermore, for long NCCs, or two-dimension NCCs when N and P are larger than L, the adder number of the proposed structure is lower than that of the structure [22], and the latency of the proposed structure is lower than that of the structure [33]. Figure 12 shows the three structures' adder number and latency increasing along with N, where the proposed structure adopts maximum adder and latency to perform comparisons. It is obvious that the proposed structure's adder number is least when N > 1800, and its latency is lower than the structure [33] when N > 1500. Therefore, although additional O(L) latches are required for data store and transfer, the proposed systolic array could be more efficient in digital signal and image domain where the maximum value of L is less than 256 in general [34].

Conclusions
It is suggested that digital NCCs be implemented by efficient algorithms and hardware structures for decreasing their high multiplication complexity [35]. With the assist of fast computation for first-order moment, this paper presents an algorithm and a systolic array for fast NCCs that aim to reduce multiplication as much as possible. To do this, the key is to transform the complex inner-product in the NCC into a simple first-order moment according to the statistical properties of the digital inner-product, and then a new NCC formula based on a first-order moment is established in order for eliminating inner-product operations. As a result, by introducing an algorithm without multiplication into the computation of the first-order moment in NCC, we proposed a fast algorithm for NCC with the advantages of simple implementation, less multiplication, no length limitation, and so on. Especially, as the introduced algorithm for first-order moment requests many additions, we also improved it by means of an even-odd relationship to reduce addition complexity and execution time. It is an advantage that the introduced algorithm for the first-order moment can be implemented by systolic structure, so a systolic array composed of latches and adders is designed for implementing fast NCC in parallel. This systolic array is hardware-efficient due to its parallel operation, simple structures and seldom multiplier. This paper analyzes the computational and the hardware complexity for the proposed algorithm and systolic array, and compares them with some existing methods to prove their efficiency. The proposed algorithm and array could also be applied for digital filter and various transforms [36].
There are still many additions in the proposed algorithm and systolic structure. Future studies will focus on further reducing their additions.

Conclusions
It is suggested that digital NCCs be implemented by efficient algorithms and hardware structures for decreasing their high multiplication complexity [35]. With the assist of fast computation for first-order moment, this paper presents an algorithm and a systolic array for fast NCCs that aim to reduce multiplication as much as possible. To do this, the key is to transform the complex inner-product in the NCC into a simple first-order moment according to the statistical properties of the digital inner-product, and then a new NCC formula based on a first-order moment is established in order for eliminating inner-product operations. As a result, by introducing an algorithm without multiplication into the computation of the first-order moment in NCC, we proposed a fast algorithm for NCC with the advantages of simple implementation, less multiplication, no length limitation, and so on. Especially, as the introduced algorithm for first-order moment requests many additions, we also improved it by means of an even-odd relationship to reduce addition complexity and execution time. It is an advantage that the introduced algorithm for the first-order moment can be implemented by systolic structure, so a systolic array composed of latches and adders is designed for implementing fast NCC in parallel. This systolic array is hardware-efficient due to its parallel operation, simple structures and seldom multiplier. This paper analyzes the computational and the hardware complexity for the proposed algorithm and systolic array, and compares them with some existing methods to prove their efficiency. The proposed algorithm and array could also be applied for digital filter and various transforms [36].
There are still many additions in the proposed algorithm and systolic structure. Future studies will focus on further reducing their additions.