A Real-Time FPGA-Based Metaheuristic Processor to Efficiently Simulate a New Variant of the PSO Algorithm

Nowadays, high-performance audio communication devices demand superior audio quality. To improve the audio quality, several authors have developed acoustic echo cancellers based on particle swarm optimization algorithms (PSO). However, its performance is reduced significantly since the PSO algorithm suffers from premature convergence. To overcome this issue, we propose a new variant of the PSO algorithm based on the Markovian switching technique. Furthermore, the proposed algorithm has a mechanism to dynamically adjust the population size over the filtering process. In this way, the proposed algorithm exhibits great performance by reducing its computational cost significantly. To adequately implement the proposed algorithm in a Stratix IV GX EP4SGX530 FPGA, we present for the first time, the development of a parallel metaheuristic processor, in which each processing core simulates the different number of particles by using the time-multiplexing technique. In this way, the variation of the size of the population can be effective. Therefore, the properties of the proposed algorithm along with the proposed parallel hardware architecture potentially allow the development of high-performance acoustic echo canceller (AEC) systems.


Introduction
Over the last ten years, tremendous efforts have been made to develop high-performance acoustic echo cancellers, which can offer high-quality and realistic sound needed in the newest acoustic communications. In particular, some authors have used the PSO algorithm as an adaptive echo canceling algorithm since this can be easily implemented and exhibits a fast convergence rate [1]. For example, Mahbub et al. [1,2] presented an AEC system based on the PSO to compute the error minimization in the frequency domain and time domain, respectively. Pichardo et al. [3] introduced a convex combination to improve the performance of the digital filter at the cost of increasing the overall computational cost. Recently, Kimoto et al. [4] introduced a multichannel adaptive echo-canceling algorithm based on PSO. Specifically, their technique considers a pre-processing of the input signals. Despite achieving these advanced approaches, there are still remaining tasks to significantly improve the performance of these systems in terms of echo return loss enhancement (ERLE) and convergence rate. Unfortunately, PSO suffers from premature convergence problems, especially in the case of multi-modal optimization problems. This reduces its performance since it losses the ability to find the optimal solution. To deal with this, several authors have proposed modifications to the conventional PSO [5][6][7][8][9][10]. However, a small number of solutions have been applied to adaptive acoustic echo cancellers. On the other hand, the implementation of the PSO algorithm in FPGA devices, to simulate AEC systems, faces great challenges to build optimal hardware architectures. To date, several hardware architectures have been developed to implement the PSO algorithm to be applied in adaptive filtering [11][12][13]. However, none of these architectures have been proposed to simulate AEC systems. Here, we present for the first time, the implementation of the PSO algorithm in an FPGA for AEC systems. Specifically, we include the Markovian switching technique [14] into the conventional PSO algorithm to create a high-performance AEC system since considers quick convergence to the global optimum and also keeps the swarm global search simultaneously by taking advantage of the current search information.
From the engineering point of view, the implementation of the AEC systems based on the PSO algorithm requires a large number of particles. As a consequence, a large area of consumption is required. To overcome this, we proposed criteria to dynamically decrease the number of particles over the filtering process. In addition, we use the block-processing scheme to easily implement the proposed algorithm in a parallel hardware architecture.

Proposed Markov Switching PSO Algorithm
In this section, we present a new variant of the PSO to improve search performance. Specifically, we dynamically adjust the velocity of the particle according to an evolutionary factor. In this manner, the premature convergence of the PSO can be prevented and this can be also especially useful in dealing with multi-modal and high-dimensional problems. Figure 1 shows the structure of the proposed variant of the PSO algorithm applied to adaptive filtering and Figure 2 shows the required steps to perform the proposed PSO algorithm.  To perform the proposed PSO algorithm, the following steps are required: • Specification of the control parameters. Here, the proposed Markov switching PSO algorithm has a population matrix W with P adaptive filters, where each particle denotes an adaptive filter, as shown in Equation (1). Here, the order N of each adaptive filter determines the dimension of each particle. Therefore, the whole population is defined as follows: • Creation of the initial population. At the first iteration n = 1, the position w i (n) of each particle is initialized, where i = 1, 2, . . . , P.
where r denotes a Gaussian process of length, N. The value lb is a lower bound and ub is an upper bound. • Calculation of the signal filtering . The calculation of signal counteraction or also called residual noise e(n) is given by d(n) denotes the desired signal and y(n) the filter output of the i-th filter.
• Evaluation of the fitness function. To compute the best position, the PSO algorithm uses the mean squared error (MSE) of each P error signal as a fitness function of each adaptive filter. The evaluation of the position w i (n) can be computed as follows: • Calculation of the distance between particles and the obtention of the value of Markov chain.
Here, the velocity and position are obtained by using the following equations: where r 1 and r 2 are the vectors of random numbers of length N, w pbest i and w gbest are the personal best position and global best position, respectively, and φ the inertia weight. Here, we define r 1 and r 2 in the interval [0, 1]. c 1 (ξ(n)) and c 2 (ξ(n)) are the acceleration coefficients determined by a non-homogeneous Markov chain ξ(n)(n ≥ 0). The value of the Markov chain is taken in a finite state space: S = {1, 2, . . . , L}. Π(n) = (π ij (n)) LxL , where Π(n) denotes the probability transition matrix of the Markov chain, where π ij (n) ≥ 0(i, j ∈ S) and ∑ N j=1 π ij (n) = 1. It is important to keep in mind that the matrix Π is dynamically adjusted by evaluating an evolutionary factor (E f ) [14] according to the population distribution properties [15]. Based on these characteristics, the E f approach can be exploited at the maximum to define four states: convergence, exploration, exploitation and jumping out. In particular, these four states are, respectively, represented by ξ(n) = 1, ξ(n) = 2, ξ(n) = 3 and ξ(n) = 4 in the Markov chain. The average distance, d i , between each particle and the other particles is computed as follows: where P and N denotes the swarm size and the dimensions of each particle, respectively. Hence, the evolutionary factor E f can be obtained as follows [15]: where d g represents the globally best particle among d i . d max and d min are the maximum and minimum distances in d i , respectively. Here, we obtain the value of the Markov chain, which is based on the value of evolutionary factor E f , as follows [14]: where the probability transition matrix is given by: Based on the probability distribution matrix Π, the Markov process may switch its state at the next iteration. To guarantee the classification accuracy and the search diversity, the value of the probability χ is equal to 0.9 [14]. Here, the initial values of acceleration coefficients c 1 and c 2 are selected by trial-and-error for all states in order to guarantee the best performance of the purposed algorithm. Table 1 shows their values based on the evolutionary state, which are automatically adjusted. Table 1. Strategies for selecting c 1 and c 2 .

State
Mode Update the personal and global best position. To get the value of the personal best w pbest i (n), a comparison between the current value of f i [w i (n)] and the value of f [w pbest i (n − 1)] is performed as follows: The w i (1) defined as w pbest i (1) is used to calculate Equation (11) The computation of the global best position, w gbest is obtained as follows: • Update population. Equations (5) and (6) are used to update the velocity and position of each particle, respectively, and Equation (13), which is in the function of the power of the instantaneous error, is used to update the population size.
where P max and P min are the maximum and minimum number of particles, respectively.

Pure Software Implementation
Before implementing the proposed Markov switching PSO algorithm in parallel hardware architectures, we simulate it in Matlab software for testing and comparison purposes. Here, we use the AEC structure, in which the existing approaches and the proposed adaptive filter are used, as shown in Figure 3. As can be observed, x(n) is the far-end input signal, e(n) denotes the residual echo signal, d(n) represents the sum of the echo signal, y(n), and the background noise, e 0 (n).
To simulate the proposed Markov switching PSO algorithm, we consider the following conditions: • We use an impulse response as an echo path obtained from the ITU-T G168 recommendation [16]. This echo path is modeled using N = 500 coefficients, as shown in   The maximum number of iterations is set to 4,000,000. • We verify the performance of the proposed algorithm in terms of echo return loss enhancement, (ERLE = 10log 10 ( d(n) 2 e(n) 2 )). Here, we simulate the conventional PSO [17] and the proposed Markov switching PSO algorithm to compare their performance. As can be observed from Figure 5, the proposed algorithm shows better performance in comparison with the conventional PSO in terms of echo return loss enhancement (ERLE) and convergence speed. To make a coherent comparison between the proposed algorithm and existing algorithms we carried out an experiment in which the acoustic path is multiplied by −1 midway through the iterations. The algorithms that were simulated for this comparison and their tuning parameters are described in the following list: 1.
Grey wolf optimization (GWO) [ As can be observed from Figure 6, the proposed algorithm shows the best performance in terms of convergence speed and ERLE level. In addition, the proposed algorithm requires a lower computational cost when compared with the GWO, PSO, ABC, PSO-LMS and MABC algorithms, as shown in Table 2. To obtain these data, we consider a doubletalk scenario, in which the proposed Markov switching PSO algorithm requires fewer multiplications and additions when compared with most of the existing algorithms during the whole simulation. The reason for this is that the number of particles, P, which are used to model the proposed Markov switching PSO algorithm, is reduced during the filtering process. Specifically, we initially use 100 particles, after some iterations, this number is reduced to 20 particles. In contrast, the existing approaches need a larger population to obtain acceptable performance at the cost of increasing the computational burden. In AEC applications, this aspect is crucial because one of the objectives of the proposed Markov switching PSO algorithm is to improve the computational cost without losing performance and this is possible by using the Markov Switching technique. Figure 6. ERLE of the proposed Markov switching PSO algorithm and existing approaches [18][19][20][21][22][23] by computing the AR(1) process input signal when multiplying the acoustic path by −1 at the middle of the adaptive filtering process.

Hardware Implementation
Once the performance of the proposed Markov-PSO adaptive filter was verified, we develop a parallel metaheuristic processor to simulate it at high processing speeds by expanding the minimum area consumption. To achieve a low area consumption; first, we use the time multiplexing technique to simulate particles virtually; second, we use optimized neural multipliers and adders since is the most demanding arithmetic circuit in terms of area and processing speed. In addition, the use of these circuits has allowed us to optimize the processing time since each operation is performed at a single clock cycle [24]. As can be observed from Figure 7, the proposed parallel metaheuristic processor is mainly composed of the following components: • Markov-PSO processing core, M-PSO PC. This represents the basic processing element to compute the signal-filtering process and update the population. The proposed M-PSO PC mainly uses neural multipliers Π mul [24] and adders Π add [24]. Additionally, this circuit has a slave control unit CU s1 , pseudo-random number generators, RNG, and a Markov processor core, MP (Figure 8). In particular, the MP core is in charge of performing the calculation of the distance between particles by means of the optimized square root circuit [25], as shown in Figure 9. It should be noted that we presented a neural adder circuit, ∏ add , and a neural multiplier, ∏ mul , to perform addition and multiplication of fixed-point numbers, respectively, in [24]. To implement the AEC system in an FPGA device, the fixed point representation is highly demanded since the simulation of metaheuristic algorithms requires high-precision calculations. Therefore, we created advanced neurons, which are based on spiking neural P systems, by improving their structural and functional capabilities. In particular, we used cutting-edge variants of the SN P systems, such as anti-spikes, dendritic trunk, dendritic delays and rules on the synapses. As a result, we create high-precision neural adders and multipliers by employing a low number of synapses and neurons with simple spiking rules. In general, both circuits exhibit the following features: • Scalability. These circuits can process numbers with any required length by only adding neurons in a regular and homogenous neural structure. • Compactness. To obtain a great improvement in terms of area, we designed the circuit by using a low number of neurons and synapses. Specially, we optimized the number of synapses since the routing of a large number of synaptic connections creates place and routing problems, especially when they are implemented in advanced FPGAs. • High performance. In this application, the real-time filtering process is highly demanded. Therefore, we achieved neural multiplier and adder to perform their respective operations by expending a single and ten clock cycles, respectively.
Once the proposed parallel metaheuristic processor was debugged, we integrate it into the structure of the AEC system to validate its performance, as shown in Figure 10.    To demonstrate the computational capabilities of the meta-heuristic processor, we develop two sets of experiments. Particularly, we simulate single-talk and double-talk scenarios by employing two different input signals, x(n). In addition, under these scenarios, the proposed metaheuristic processor computes the AR(1) process and the speech signal by considering an under-modeling case. Specifically, we employ 512 coefficients to model the adaptive filter, while the echo path [26] is configured by using 1024 coefficients. In realworld echo noise applications, the performance of the AEC can be crucially decreased by the variations of background noise. For a single-talk scenario, we decreased the SNR from 20 to 10 dB in the middle of the iterations, as shown in Figure 11. It should be noted that the background noise variation does not affect the performance of the proposed algorithm. On the other hand, the proposed metaheuristic processor was implemented in Stratix IV GX EP4SGX530 FPGA. Here, this implementation, which involves the use of eight BRAMs and 20 M-PSO PCs, requires 384,748 LEs. This represents 72.429% of the total area of the FPGA. In this way, we can simulate 100 particles virtually since each M-PSO PC simulates five particles serially. The processing time to simulate all of these particles is 89.1 µs, which are obtained by multiplying the number of clock cycles (11,143), which are obtained by means of Equation (14), by the system clock period (8 ns). It should be noted that the required processing time in the FPGA device is less in comparison when this algorithm is simulated in a server, which includes a Xeon E5-2630 processor working at 2.6 GHz and 64 Gb RAM, since the simulation of the algorithm on this computer requires 1.47 ms, considering the simulation of 100 particles. This can be considered the worse case since the number of particles decreases over the filtering process. As a consequence, the processing time also decreases. This factor is vital in the simulation of real-time AEC systems since the maximum latency of the system is 125 µs, i.e., the input signal is sampled at 8 KHz. On the other hand, the simulation of the proposed Markov-PSO adaptive filter consumes up to 328 mW, by considering the worst case (one hundred particles). After observing the results of the above experiments, we prove that the metaheuristic processor is capable of processing a variable number of particles to perform the proposed Markov switching PSO algorithm at high processing speeds.
where y represents the number of coefficients and x depicts the number of particles.

Conclusions
In this work, we present, for the first time the development of a high-speed and compact FPGA-based parallel metaheuristic process to efficiently simulate a new variant of the PSO algorithm based on the Markovian switching technique. Here, we grouped our contributions as follows: • From the AEC model point of view. In this work, we made intensive efforts to reduce the computational cost of the AEC systems to be implemented in resource-constrained devices. In addition, we significantly improve the convergence properties of these systems by using an improved metaheuristic swarm intelligence method to be used in practical acoustic environments. Specifically, we present a new variant of the PSO algorithm based on the Markovian switching technique. The use of this technique has allowed us to guarantee a higher convergence rate and higher ERLE in comparison when the conventional PSO algorithm is used. To make feasible the implementation of the proposed variant of the PSO algorithm in embedded devices, we use the blockprocessing scheme. In this way, the proposed algorithm can be easily implemented in parallel hardware architectures. As a consequence, it can be simulated at high processing speeds. In addition, we significantly reduce the computational cost of the proposed conventional PSO algorithm. To achieve this aim, we propose a method to dynamically decrease the number of particles of this new variant of the PSO algorithm over the filtering process. • From the digital point of view. In this work, we present for the first time, the development of a parallel hardware architecture to simulate a variable number of particles by using the proposed time-multiplexing control scheme. In this way, we properly implement the proposed Markov switching PSO algorithm, in which the number of particles decreases according to the simulation needs, in a Stratix IV GX EP4SGX530 FPGA.
Finally, we carry out several experiments to prove that the proposed Markov switching PSO algorithm along with new techniques potentially allows the creation of practical and real-time AEC processing tools.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.