Harris Hawks Sparse Auto-Encoder Networks for Automatic Speech Recognition System

: Automatic speech recognition (ASR) is an effective technique that can convert human speech into text format or computer actions. ASR systems are widely used in smart appliances, smart homes, and biometric systems. Signal processing and machine learning techniques are incorporated to recognize speech. However, traditional systems have low performance due to a noisy environment. In addition to this, accents and local differences negatively affect the ASR system’s performance while analyzing speech signals. A precise speech recognition system was developed to improve the system performance to overcome these issues. This paper uses speech information from jim-schwoebel voice datasets processed by Mel-frequency cepstral coefﬁcients (MFCCs). The MFCC algorithm extracts the valuable features that are used to recognize speech. Here, a sparse auto-encoder (SAE) neural network is used to classify the model, and the hidden Markov model (HMM) is used to decide on the speech recognition. The network performance is optimized by applying the Harris Hawks optimization (HHO) algorithm to ﬁne-tune the network parameter. The ﬁne-tuned network can effectively recognize speech in a noisy environment.


Introduction
Artificial intelligence (AI) methods [1] evolve rapidly and are increasingly creating effective communication systems. AI can both effectively analyze and recreate the human voice, and automatic speech recognition (ASR) systems [2] have been created to achieve communication and dialogue like real people's conversation. The ASR system combines the fields of linguistics, computer science, natural language processing (NLP), and computer engineering. The system needs a training process to understand the individual speakers and recognize the speeches; here, speakers read the text and vocabularies to get the speaker's inner details (speaker-dependent). Most of the ASR system does not require the speakerindependent system's training process. Advancement of machine and deep learning techniques is highly involved in ASR to improve the Persian speech classification in an efficient [3]. However, ASR has been affected negatively by loud and noisy environmental factors fuzzy phoneme [4], which create challenging issues and causes for ambiguous ASR. 3 of 18 feature vector is shifted in measure and SOM is used to select the appropriate length of the feature vector. Finally, the Tamil numerals and words are arranged using a BRNN classifier using the fixed-length feature vector from SOM as input, known as BRNN-SOM. Ismail et al. [26] aimed to develop speech recognition systems and improve the interaction between the home appliance and the human by giving voice commands. Speech signals are processed by dynamic time warping (DTW) techniques and use SVM to recognize the voice with up to 97% accuracy.
Hori et al. [14] used deep convolution encoder and long-short-term memory (LSTM) recurrent neural networks (RNN) to recognize end-to-end speech. This process uses the connectionist temporal classification procedure while investigating the audio signals. The convolution network uses the VGG neural network architecture, which works jointly with the encoder to investigate the speech signal. Then, the memory network stores every speech signal, which improves the system performance compared to existing methods. Finally, the framework introduced is applied to the Chinese and Japanese datasets, and the system ensures a 5% to 10% error rate.
Neamah et al. [15] recommend continual learning algorithms such as the hidden Markov model and deep learning algorithms to perform automatic speech recognition. Here, a deep learning network learns the speech features derived from the Mel-frequency coefficient approach. The learning process minimized the deviation between the original audio and the predicted audio. The trained features are further evaluated using the Markov model to improve offline mode's overall recognition accuracy.
Khan et al. [27] selected a time-delayed neural network to reduce the problem of limited language analysis using the Hindi speech recognition system. The Hindi speech information is collected from Mumbai people that are processed using an i-vector adapted network. The network considers time factors when investigating speech characteristics. This process reduces training time because the delay network maintains all processed speech information. Furthermore, the effective utilization of the network parameters increases the recognition accuracy up to 89.9%, which is a 4% average improvement compared to the existing methods.
Mao et al. [28] created a multispeaker diarization model to recognize long conversationbased speech. The method uses audio-lexical interdependency factors to learn the model for improving the word diarization process. This learning process generates a separate training setup for the diarization and ASR systems. The training setup helps identify long conversation speech with minimum effort because the data augmentation and decoding algorithm recognizes the speech accurately.
Kawase et al. [18] suggested a speech enhancement parameter with a genetic algorithm to create the automatic speech recognition system. This study aims to improve the recognition accuracy while investigating the noisy speech signal. Here, a genetic algorithm is applied to investigate the speech parameter and the noise features are removed from the audio, which helps improve the overall ASR system.
Another stream of research on ASRs focused on speech emotion recognition (SER) [29]. In the context of human-computer or human-human interaction applications, the challenge of identifying emotions in human speech signals is critical and extremely difficult [30]. The blockchain based IoT devices and systems have been created [31]. For example, Khalil et al. [32] reviewed deep learning techniques to examine emotions from the speech signal. This paper will examine deep learning techniques, functions, and features to extract human emotions from audio signals. This analysis helps to improve the speech recognition process further. Fahad et al. [33] created a deep learning with a hidden Markov model-based speech recognition system using the epoch and MFCC features. First, the speech features are derived by computing the maximum likelihood regression value. Then, the derived features are processed by the testing and training phase to improve the overall prediction of speech emotions. The effectiveness of the system was measured using information from the emotional dataset of the Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the system ensures high results up to ±7.13% compared to existing methods.
Zhao et al. [34] created a merged convolutional neural network (CNN) with two branches, one one-dimensional (1D) CNN branch and another two-dimensional (2D) CNN branch to learn high-level features from raw audio samples. First, a 1D CNN and a 2D CNN architecture were created and assessed; after the second dense layers were removed, the two CNN designs were fused. Transfer learning was used in the training to speed up the training of the combined CNN. First, the 1D and 2D CNNs were trained. The learnt characteristics of the 1D and 2D CNNs were then reused and transferred to the combined CNN. Finally, the initialization of the merged deep CNN with transferred features was finetuned. Two hyperparameters of the developed architectures were chosen using Bayesian optimization in the training. Experiments on two benchmark datasets demonstrate that merged deep CNN may increase emotion classification performance. In another paper, Zhao et al. [35] proposed learning local and global emotion-related characteristics from speech and log-Mel spectrograms using two CNN and LSTM models. The architectures of the two networks are identical, with four local feature learning blocks (LFLBs) and one LSTM layer each. The LFLB, which consists mostly of one convolutional layer and one maximum-pooling layer, is designed to learn local correlations and extract hierarchical correlations. The LSTM layer is used to learn long-term dependencies from locally learnt characteristics. The developed models use the strengths of both networks while overcoming their drawbacks.
Finally, speech recognition methods have been extensively used for medical purposes and disease diagnostics, such as developing biosignal sensors to help people with disabilities speak [36] and fake news to manage sentiments [37]. The audio challenges [38] were captured using two microphone channels from an acoustic cardioid and a smartphone, allowing the performance of different types of microphones to be evaluated. Polap et al. [39] suggested a paradigm for speech processing based on a decision support system that can be used in a variety of applications in which voice samples can be analyzed. The proposed method is based on an examination of the speech signal using an intelligent technique in which the signal is processed by the built mathematical transform in collaboration with a bioinspired heuristic algorithm and a spiking neural network to analyze voice impairments. Mohammed et al. [40] adopted a pretrained CNN for recognition of speech pathology and explored a distinctive training approach paired with multiple training methods to expand the application of the suggested system to a wide variety of vocal disorders-related difficulties. The suggested system has been evaluated using the Saarbrücken Voice Database (SVD) for speech pathology identification, achieving an accuracy of 95.41%. Lauraitis et al. in [41,42] developed a mobile application that can record and extract pitch contour features, MFCC, gammatone cepstral coefficients, Gabor (analytic Morlet) wavelets, and auditory spectrograms for speech analysis and recognition of speech impairments due to the early stage of central nervous system disorders (CNSD) with up to 96.3% accuracy. The technology can be used for automated CNSD patient health status monitoring and clinical decision support systems, and a part of the Internet of Medical Things (IoMT).
In summary, speech recognition played a vital role in different applications. Therefore, several intelligent techniques are incorporated to improve speech recognition effectiveness. However, in the loud and noisy environment, speech signals are difficult to recognize accurately. Therefore, metaheuristics-optimized techniques, specifically the Harris Hawk (HH) heuristic optimization algorithm [43], are incorporated with the traditional machine learning techniques to improve the overall recognition accuracy. HH has been successfully used before for various other applications such as feature selection [44], big data-based techniques using spark [45][46][47][48][49], pronunciation technology [50,51] and image chain based optimizers thresholding [52,53], and deep learning [54,55]. However, traditional systems have computational complexity due to a noisy environment. In addition to this, accents and local differences affect the performance of the ASR system. This causes the system reliability and flexibility to be affected while analyzing speech signals.
The detailed working process of the introduced ASR system is discussed in Section 3.

Data Set Description
This section examines the effectiveness of the proposed Harris Hawks sparse autoencoder networks (HHSAE-ASR) framework. The jim-schwoebel voice datasets applied on our experiments [56]. The dataset consists of several voice datasets that are widely used to investigate the effectiveness of the introduced system.

Harris Hawks Sparse Auto-Encoder Networks (HHSAE)-ASR Framework
This system aims to reduce the computation complexity while investigating the loud and noisy environment speech signal. The HHSAE-ASR framework utilizes the learning concepts that continuously train the system using speech patterns. Then, metaheuristic techniques, specifically the Harris Hawks (HH) algorithm, are applied to the encoder network to fine-tune the network parameters that minimize the error-rate classification problem. Here, the HH algorithm allows for recognizing the sequence of speech patterns, learning concepts, and the network parameter updating process, and improves the precise rate, robustness, and reliability of the ASR. The HHSAE-ASR framework is then illustrated in Figure 1.
chain based optimizers thresholding [52,53], and deep learning [54,55]. However, traditional systems have computational complexity due to a noisy environment. In addition to this, accents and local differences affect the performance of the ASR system. This causes the system reliability and flexibility to be affected while analyzing speech signals.
The detailed working process of the introduced ASR system is discussed in Section 3.

Data Set Description
This section examines the effectiveness of the proposed Harris Hawks sparse autoencoder networks (HHSAE-ASR) framework. The jim-schwoebel voice datasets applied on our experiments [56]. The dataset consists of several voice datasets that are widely used to investigate the effectiveness of the introduced system.

Harris Hawks Sparse Auto-Encoder Networks (HHSAE)-ASR Framework
This system aims to reduce the computation complexity while investigating the loud and noisy environment speech signal. The HHSAE-ASR framework utilizes the learning concepts that continuously train the system using speech patterns. Then, metaheuristic techniques, specifically the Harris Hawks (HH) algorithm, are applied to the encoder network to fine-tune the network parameters that minimize the error-rate classification problem. Here, the HH algorithm allows for recognizing the sequence of speech patterns, learning concepts, and the network parameter updating process, and improves the precise rate, robustness, and reliability of the ASR. The HHSAE-ASR framework is then illustrated in Figure 1. Outline of HHASE-ASR framework that includes speech input, speech preprocessing, feature extraction, speech recognition, and speech-to-text modules.
The working process illustrated in Figure 1 consists of several stages, such as the collection of speech signals, preprocessing, feature extraction, and the recognizer. The collected speech signals generated a lot of noisy and inconsistent information that completely affects the quality and precision of the ASR system. Therefore, modulations and changes should be suspected at all frequencies, and irrelevant details should be eliminated.

Speech Signal Preprocessing and Denoising
Here, the spectral deduction approach is applied to the collected speech signal to purify the signal. The method effectively apprises the spectrums in the most straightforward and easiest ways. The spectrum is not affected by time due to the additive noise. For Figure 1. Outline of HHASE-ASR framework that includes speech input, speech preprocessing, feature extraction, speech recognition, and speech-to-text modules.
The working process illustrated in Figure 1 consists of several stages, such as the collection of speech signals, preprocessing, feature extraction, and the recognizer. The collected speech signals generated a lot of noisy and inconsistent information that completely affects the quality and precision of the ASR system. Therefore, modulations and changes should be suspected at all frequencies, and irrelevant details should be eliminated.

Speech Signal Preprocessing and Denoising
Here, the spectral deduction approach is applied to the collected speech signal to purify the signal. The method effectively apprises the spectrums in the most straightforward and easiest ways. The spectrum is not affected by time due to the additive noise. For every speech signal s(n), it has a clean signal cs(n) and an additive noise signal ad(n). Therefore, the original speech signal is written as Equation (1).
The clean signal cs(n) is obtained by applying the discrete Fourier transform with the imaginary and the real part, which gives the noise-free speech output signal. The Fourier transform representation of the signal is defined in Equation (2).
The Fourier transform of signal s(n) is obtained by computing the spectrum magnitude s|(w)| and the ∅ phase spectra value of the noise signal is obtained using Equation (4).
The value of the computed noise spectrum value |ad[w]| is more helpful to identify the noisy information of the original speech signal. This noise continuously occurs in a loud and noisy environment, which completely affects the originality of the speech. Therefore, the noise value in s(w) should be replaced by the average noisy spectrum value. This average value is computed from the details of nonspeech activities (speech pause) and speech ineligibility (s) because it does not affect the speech quality. Therefore, the noise-free signal is computed as: The clean signal cs e [w] is estimated from the computation of the signal spectrum magnitude s|(w)| of the phase spectrum value and the average noise spectrum value of the noise signal |ad e [w]|. The spectral magnitude is computed to clean the recorded speech signal.
Extraction of features is used to train a Markov model-based convolution network for resolving noisy and loud voice signals. According to the hawk's prey finding behavior, the network's parameters are fine-tuned and updated during this process. The system's robustness and availability are maintained by reducing the number of misclassification errors.
Then, Equation (6) is applied to identify the power spectrum of the speech signal s(w); cs e [w] 2 = |s(w)| 2 − |ad e [w]| 2 to estimate the original noise-free signal. The computed spectral values cut off the noise information from the original signal s(n). Then, the inverse Fourier transform is applied on the signal magnitude |cs e [w]| and the power spectrum cs e [w] 2 to identify the noise-free speech signal.
The noise-free signal is computed from the spectral subtraction of the power exponent p. Here, the noise signal spectrum deduction is performed according to p. If the p has a value of 1, then magnitude is affected by noise and that part is deducted from the signal. If the value of p is 2, the power spectral deduction is applied to obtain the original noise-free signal. Then, the noise removal of the speech signal is summarized in Figure 2.

Signal Decomposition and Feature Extraction
The extracted features are more useful to get the important details that improve the overall ASR systems more precisely. The feature extraction process helps maintain the robustness of the ASR system because it helps to investigate the signal s(n) in different aspects. The speech signal s(n) = cs(n) + ad(n) has the length of N. Once the noise has been eliminated, cs(n) has been divided based on the trend and fluctuations by applying the wavelet transform. Here, level 4 Daubechies wavelets are utilized to extract five wavelets, such as db14, db12, db10, db8, and db2. Appl

Signal Decomposition and Feature Extraction
The extracted features are more useful to get the important details that improve the overall ASR systems more precisely. The feature extraction process helps maintain the robustness of the ASR system because it helps to investigate the signal ( ) in different aspects. The speech signal ( ) = ( ) + ( ) has the length of . Once the noise has been eliminated, ( ) has been divided based on the trend and fluctuations by applying the wavelet transform. Here, level 4 Daubechies wavelets are utilized to extract five wavelets, such as db14, db12, db10, db8, and db2. Then, the level of the speech signal mapping process is illustrated as follows.
In level 1, the speech signal ( ) is divided into the first level according to the signal length /2 of the trend and fluctuations .
In level 2, the speech signal is divided by /4 length and is obtained from trend and fluctuations , which is defined as → ( | ).
In level 3, the signal is calculated by dividing the and signals that are defined as → ( | ). Here, the decomposition process is carried out of length /8.
In level 4, the decomposition is carried out for /16 length, and it is obtained by the and signals that are represented as → ( | ).
According to the above wavelet process, 20 subsignals are obtained according to trends and fluctuations. After that, the signal entropy value ( ) is estimated, which helps to determine the information of the signal presented in the decomposed signals. The entropy value was obtained according to Equation (12). Then, the level of the speech signal mapping process is illustrated as follows.
In level 1, the speech signal c(n) is divided into the first level according to the signal length N/2 of the trend I 1 and fluctuations f 1 .
In level 2, the speech signal is divided by N/4 length and is obtained from trend I 1 and fluctuations f 1 , which is defined as In level 3, the signal is calculated by dividing the I 2 and f 2 signals that are defined as I 2 → (I 3 | f 3 ) . Here, the decomposition process is carried out of length N/8.
In level 4, the decomposition is carried out for N/16 length, and it is obtained by the I 3 and f 3 signals that are represented as According to the above wavelet process, 20 subsignals are obtained according to trends and fluctuations. After that, the signal entropy value (ev) is estimated, which helps to determine the information of the signal presented in the decomposed signals. The entropy value was obtained according to Equation (12).
The entropy value (ev) is computed from the random phenomenon of speech signal Q{q 1 , q 2 , . . . , q n } and the probability value of p(q i ) of Q. Then, according to Q, every subsignal entropy value is estimated using Equation (13). The subsignal entropy value (ev) is computed from m number of frames, k = 1, 2, . . . m. j = 1, 2, 3. According to Equation 13, the entropy values are Ie 4k and f e jk . These extracted frame entropy values characterize the speech based on emotions because the fluctuations are varying when compared to the normal speaker emotion level. Then, Mel-frequency coefficient features are derived from identifying the characteristics of the speech signal.
The Mel( f ) value is obtained from the frequency value of every subsignal derived from the discrete wavelet transform process. The extracted features are trained and learned by the encoder convolution networks to train the feature to perform in any situation. The process of feature extraction is summarized in Figure 3.
The entropy value ( ) is computed from the random phenomenon of speech signal Q , , … . . and the probability value of ( ) of . Then, according to , every subsignal entropy value is estimated using Equation (13).
The subsignal entropy value ( ) is computed from m number of frames, = 1,2, … . = 1,2,3. According to Equation 13, the entropy values are and . These extracted frame entropy values characterize the speech based on emotions because the fluctuations are varying when compared to the normal speaker emotion level. Then, Melfrequency coefficient features are derived from identifying the characteristics of the speech signal.
( ) = 2595 log 1 + 700 The ( ) value is obtained from the frequency value of every subsignal derived from the discrete wavelet transform process. The extracted features are trained and learned by the encoder convolution networks to train the feature to perform in any situation. The process of feature extraction is summarized in Figure 3.

Speech Recognition
The convolution network trains the extracted features to recognize the speech signal in different noisy and loud environments. The learning process is done in the language and acoustic models because the introduced ASR framework should react perfectly in different speech environments. Therefore, only the system ensures a higher recognition rate.
Consider that the extracted features are had at T-length, and the features are defined as = ∈ ℝ | = 1, … . . | . The features are extracted for the length of the spoken word and defined as = ∈ | = 1, … . . | . The features are derived from frame and word position and vocabulary in D-dimension. The derived features are further examined to get the acoustic features that are obtained from the most likely appearing words:

Speech Recognition
The convolution network trains the extracted features to recognize the speech signal in different noisy and loud environments. The learning process is done in the language and acoustic models because the introduced ASR framework should react perfectly in different speech environments. Therefore, only the system ensures a higher recognition rate.
Consider that the extracted features are had at T-length, and the features are defined as X = X T ∈ R D t = 1, . . . , T . The features are extracted for the length of the spoken word and defined as W = { W n ∈ v|n = 1, . . . , N|}. The features X are derived from t frame and W word position n and v vocabulary in D-dimension. The derived features are further examined to get the acoustic features that are obtained from the most likely appearing words:Ŵ = argmax w P(W|X) The acoustic feature P(W|X) is computed from the word sequence W from X using Bayes' rules, defined in Equation (16). During the computation, P(X) is omitted when the word is constant, belonging to the word W. W = argmax w P(X|W).P(W) P(X) Then, the sequence of features P(X|W) is computed from the acoustic model and the priori knowledge of the word P(W) is computed from the language model. The sequence of features, words, and the respective analysis is performed using Equation (18).
P(X|S) is derived from the acoustic model, which helps make the Markov assumption concerning the probabilistic chain rules (Equation (19)).
The convolution network changed the P(x t |S t ) frame-wise likelihood function into the frame-wise posterior distribution P(S t |X t ) P(S t ) . The frame-wise analysis helps to resolve the decision-making issues and the system's performance is improved by considering the lexicon model P(S|W). This lexicon model process is factorized according to the Markov assumption and probabilistic model.
The extracted phoneme features and respective Markov probability value helps to identify the lexicon information from the speech. Finally, the language model P(W) is computed using the Markov assumption and probabilistic chain rule for a word in speech.
The Appendix A is explained the sparse encoder and model Fine-Tuning using Haris Hawk optimization.

Experiment Setup
The collected datasets are investigated, in which 80% of the dataset is utilized as training and 20% is used for testing purposes. This process is developed using MATLAB (MathWorks Inc., Natick, MA, USA) and the system uses the acoustic and language model to train the networks. Here, people's speech information is investigated in every word, phenomena, and fluctuation that helps to identify every speech in different environments. During the analysis, the Harris Hawk optimization process is utilized to update and finetune network parameters to reduce the maximum error-rate classification problem. Further, the system's robustness and reliability are maintained by extracting the valuable features in all signal sub-bands and wavelets. Due to the effective analysis of the speech signal spectrum, power and modulations were used to remove the modulations and deviations in the captured speech signal.

Objective Performance Evaluation
This section determines how the proposed HHSAE-ASR framework obtains the substantial results while working on the speech recognition process. The system effectiveness is evaluated using the error rate values because it is more relevant to the maximum errorrate classification problem. The resultant value of the HHSAE-ASR is compared with the existing research works such as [12,14,15,18,20]. These methods, described in more detail in Section 2, were selected because of their utilization of the optimization techniques and functions while analyzing the speech signal. Table 1 illustrates the error rate analysis of the proposed HHSAE-ASR framework which is compared with the existing algorithms, such as the multiobjective evolutionary optimization algorithm [12], the deep convolution encoder and long short term recurrent neural networks [14], continual learning algorithms [15], enhancement parameter with a genetic algorithm [18], and MFCC and DTW [20]. Among these methods, the HHSAE-ASR algorithm attains minimum error values (MSE-1.11, RMSE-1.087, and VUV-1.01). The training process uses different features like the acoustic, lexicon, and language model with the speech signal. These features are more helpful in making decisions according to the probability value and chain rules. Here, the set of speech features are analyzed by applying the encoder network that uses the different conditions while updating the network parameters,.The error rate has been evaluated on different numbers of users and the obtained results are illustrated in   Figure 4 illustrates the error rate analysis of the different number of persons that participated during the speech analysis process. The effective utilization of the speech features and training parameters helps to reduce the classification error rate. The minimum error rate directly indicates the maximum recognition accuracy on the objective analysis.  Figure 4 illustrates the error rate analysis of the different number of persons that participated during the speech analysis process. The effective utilization of the speech features and training parameters helps to reduce the classification error rate. The minimum error rate directly indicates the maximum recognition accuracy on the objective analysis. The obtained results are illustrated in Figure 5.   Figure 4 illustrates the error rate analysis of the different number of persons that participated during the speech analysis process. The effective utilization of the speech features and training parameters helps to reduce the classification error rate. The minimum error rate directly indicates the maximum recognition accuracy on the objective analysis. The obtained results are illustrated in Figure 5. The above results illustrate that the proposed HHSAE-ASR framework attains effective results while investigating the speech signals on a different number of iterations and persons. The recognition system's effectiveness is further examined using the testing model for a different number of persons and iterations in the subjective analysis.

Subjective Performance Evaluation
This section discusses the performance evaluation results of the HHSAE-ASR framework in a subjective manner. The dataset consists of much recorded information that is The above results illustrate that the proposed HHSAE-ASR framework attains effective results while investigating the speech signals on a different number of iterations and persons. The recognition system's effectiveness is further examined using the testing model for a different number of persons and iterations in the subjective analysis.

Subjective Performance Evaluation
This section discusses the performance evaluation results of the HHSAE-ASR framework in a subjective manner. The dataset consists of much recorded information that is both male and female. Therefore, the testing accuracy is determined using various numbers of persons and iterations. Figure 6 shows that the proposed HHSAE-ASR framework attains high accuracy (98.87%) while analyzing various people's signals on a different number of iterations. The obtained results are compared to existing methods: multiobjective evolutionary optimization algorithm [12] (66.76%), deep convolution encoder and long short term recurrent neural networks [14] (73.43%), continual learning algorithms [15] (78.31%), enhancement parameter with a genetic algorithm [18] (81.34%), and MFCC and DTW [20] (93.23%). Table 2 illustrates the excellency of the introduced system's efficiency while investigating a different number of participants. The system examined each person's speech signal as it compared the speech word, length, and sequence-related probability value. The Markov chain rules developed according to the acoustic model, lexicon model, and language model, which helps to identify the speech relationships and their deviations in the loud and noisy environment. Figure 6 shows that the proposed HHSAE-ASR framework attains high accuracy (98.87%) while analyzing various people's signals on a different number of iterations. The obtained results are compared to existing methods: multiobjective evolutionary optimization algorithm [12] (66.76%), deep convolution encoder and long short term recurrent neural networks [14] (73.43%), continual learning algorithms [15] (78.31%), enhancement parameter with a genetic algorithm [18] (81.34%), and MFCC and DTW [20] (93.23%).  Table 2 illustrates the excellency of the introduced system's efficiency while investigating a different number of participants. The system examined each person's speech signal as it compared the speech word, length, and sequence-related probability value. The Markov chain rules developed according to the acoustic model, lexicon model, and language model, which helps to identify the speech relationships and their deviations in the loud and noisy environment.
Thus, the proposed HHSAE-ASR system recognizes the speech synthesis with 99.31% precision, 99.22% recall, 99.21% of MCC, and 99.18% of F-measure value.  Thus, the proposed HHSAE-ASR system recognizes the speech synthesis with 99.31% precision, 99.22% recall, 99.21% of MCC, and 99.18% of F-measure value. Table 2 illustrated the excellence of the introduced system's efficiency while investigating a different number of participants. It analyzes each person's speech signal based on their word length, sequence-related probability, and the chain rules that are taken between 100 to 1000 participants. The method predicts the sequence of features P(X|W) and respective argmax w∈v * ∑ S P(X|S), P(S|W)P(W) values help to match the training and testing features.
The new system's efficiency improves when tested with various participants. It analyzes each person's speech signal based on their word length, sequence-related probability, and chain rules. The approach predicts the sequence of features and their respective values, which helps to match the training and testing features.
In the HHSAE-ASR framework, speech patterns are continuously used to train the system. The encoder network is then fine-tuned using metaheuristic techniques to reduce the error rate classification problem to a minimum. The ASR accuracy, robustness, and dependability are enhanced by using sequence speech patterns, learning concepts, and network parameter updating.

Data Accessing in HHSAE-ASR
The recognition and authentication of human speech uses dynamic time wrapping (DTW). These techniques are used to extract the distinctive aspects of human speech. It is easier to authenticate users using the derived features. Thus, this system's total security and authentication efficiency is enhanced with an achievement of 91.8%. The accessing of data in our proposed system is compared with other traditional approaches that are given in Table 3. This kind of validation helps to reduce the classification error rate compared to other methods. Thus, the Harris Hawks sparse auto-encoder networks (HHSAE-ASR) system recognizes the speech synthesis with 99.31% precision, 99.22% recall, 99.21% MCC, and 99.18% F-measure value.

Conclusions
This paper proposed the Harris Hawks sparse auto-encoder network (HHSAE-ASR) framework for automatic speech recognition. Initially, the human voice signal is collected and analyzed by using the spectrum decomposition approach. Here, spectrum deviations and fluctuations are analyzed to replace the noise signal with the average spectrum phase value. Then, different features are extracted from the signal by decomposing the signals into four levels. The decomposed signals are further investigated to get the Mel-frequency coefficient features, which are more useful to create the acoustic, lexicon, and language model. The extracted features are applied to the Markov model-based convolution network to train the network for resolving the loud and noisy environment speech signal analysis. During this process, the network is fine-tuned, and the parameters are updated according to the Harris hawk prey searching behavior with certain updating conditions. This process reduces misclassification error rate problems and maintains the robustness and availability of the system. Thus, the system ensures a 99.18% accuracy, which outperforms the existing algorithms.
Natural language recognition is a challenging task, as different dialects, speeds, and traditions vary in actual applications. In the future, a relevant feature selection process will be incorporated to improve the overall effectiveness of the system. By using Mel-frequency cepstral coefficients to express the characteristics, the correctness of the classification could This computed energy and jump strength value is updated for every jump and food searching process using Equation (A4), as it is used to identify the best network parameter value.
The energy value is updated according to the prey energy value while escaping (E) on the maximum iteration T with initial energy E 0 . The E 0 value is selected between (−1, 1), which determines the hawk's condition. If the value E 0 is reduced between 0 to −1. If |E| ≥ 1 (exploration phase), then it moves to a different location, and it updates continuously for selecting the effective network parameter. If |E| < 1, then the rabbit is in the neighborhood phase searching for the solution in the exploitation step. As said, if |E| ≥ 1, it is in the exploration phase; then, the location vector is updated using Equation (A5).
X(t + 1) = X rand (t) − r 1 |X rand (t) − 2r 2 X(t)| q ≥ 0.5 (X rabbit (t) − X m (t)) − r 3 (LB + r 4 (UB − LB)) q < 0.5 (A5) The updating of the |E| ≥ 1 condition is the next iteration of the hawk's position X(t + 1) updating process that is done by the rabbit position X rabbit (t), the hawk's current position vector X(t), and the random numbers r 1 , r 2 , r 3 , r 4 and q having values of (0, 1). For every iteration, the lower (LB) and upper (UB) boundary of the searching region is considered with the current population X rand (t) and the position X m of hawks. Suppose |E| < 1 (r ≥ 0.5 and |E| ≥ 0.5), then it goes to the exploitation phase, and the energy factor is updated using Equation (A6).
The updating process is performed by computing the difference between the location and the position vector of the rabbit in every iteration t. Here, the jumping strategy J is estimated as J = 2(1 − r5); the random number is computed between (0, 1). The jumping value is changed in every iteration because the rabbit moves in the search space randomly. Suppose the |E| < 1 (r ≥ 0.5 and |E| < 0.5), then the updating process is performed as: This updating process is performed when the Harris hawk has a low escaping energy level; then, the updating of the current position is done as Equation (A7). If |E| < 1 (r < 0.5 and |E| ≥ 0.5), then the location vector is updated using Equation (A8).
Here, Y and Z are computed as follows, Here, Y and Z parameters are computed in the D dimension, the Levy flight function LF, and the random vector (S) with D size. The LF is computed as follows, Here, random values between (0, 1) are selected for u and v and 1.5 is the constant value for β. At last, the |E| < 1 (r < 0.5 and |E| < 0.5), then the updating process is done by using Equation (A12).
Here, Y and Z are computed as follows, According to this process, the network parameters are updated continuously, which reduces the recognition issues and the existing research problem. Based on the encoder network performance, the convolute network identifies the speech by examining the acoustic, lexicon, and language model effectively.