A Novel Pathological Voice Identiﬁcation Technique through Simulated Cochlear Implant Processing Systems

: This paper presents a pathological voice identiﬁcation system employing signal processing techniques through cochlear implant models. The fundamentals of the biological process for speech perception are investigated to develop this technique. Two cochlear implant models are considered in this work: one uses a conventional bank of bandpass ﬁlters, and the other one uses a bank of optimized gammatone ﬁlters. The critical center frequencies of those ﬁlters are selected to mimic the human cochlear vibration patterns caused by audio signals. The proposed system processes the speech samples and applies a CNN for ﬁnal pathological voice identiﬁcation. The results show that the two proposed models adopting bandpass and gammatone ﬁlterbanks can discriminate the pathological voices from healthy ones, resulting in F1 scores of 77.6% and 78.7%, respectively, with speech samples. The obtained results of this work are also compared with those of other related published works.


Introduction
Humans use speech to convey information in their daily life. A human speaker encodes information into a continuously time-varying waveform that can be stored, manipulated, and transmitted during speech production. Finally, the message is decoded by a listener. The whole human communication process can be broadly divided into four main parts: speech production, auditory feedback, sound wave transmission, and speech perception [1].
As illustrated in Figure 1, the human voice generation system consists of the lungs, larynx, and vocal tracts. The speech production process originates from the lungs. During the speech production process, humans inhale air and then expel it. The most critical components of the human voice generation system are the vocal folds. The larynx controls the vocal folds by using its ligaments, cartilages, and muscles. The vocal folds ultimately open the glottis (a slit between the vocal folds) depending on three conditions, namely breathing, unvoiced, and voiced [2]. The lips, tongue, palate, and cheek form the articulators. The primary function of articulators is to filter the sound emanating from the larynx to produce a highly intricate sound.
The human peripheral auditory system consists of three parts [3]: the outer ear, middle ear, and inner ear. The propagated sound enters the outer ear through the pinna, which helps to localize the sound. Afterward, it travels down to the auditory canal and vibrates the eardrum. The middle ear consists of three bones: the malleus, incus, and stapes. These bones transport the vibration of the eardrum to the inner ear. The middle ear is connected to the inner ear by an oval window. The main component of the inner ear is the cochlear, which is a coiled tube with a snail type of shape and is filled with fluid. A basilar membrane exists within the cochlear fluid, which is held to the cochlear with a bone. The vibration of the eardrum causes a movement of the oval window to generate a compressed sound Figure 1. A human voice generation system [2].
The human peripheral auditory system consists of three parts [3]: middle ear, and inner ear. The propagated sound enters the outer ear thro which helps to localize the sound. Afterward, it travels down to the audi vibrates the eardrum. The middle ear consists of three bones: the malle stapes. These bones transport the vibration of the eardrum to the inner e ear is connected to the inner ear by an oval window. The main component is the cochlear, which is a coiled tube with a snail type of shape and is fille basilar membrane exists within the cochlear fluid, which is held to the c bone. The vibration of the eardrum causes a movement of the oval window compressed sound wave in the cochlear fluid. This compressed wave ca vibration in the basilar membrane. The basilar membrane is mechani different frequencies, and it plays a vital role in distributing sound energy along the cochlea's length, as shown in Figure 2.  The human peripheral auditory system consists of three parts [3]: middle ear, and inner ear. The propagated sound enters the outer ear throu which helps to localize the sound. Afterward, it travels down to the audit vibrates the eardrum. The middle ear consists of three bones: the malleu stapes. These bones transport the vibration of the eardrum to the inner ea ear is connected to the inner ear by an oval window. The main component o is the cochlear, which is a coiled tube with a snail type of shape and is filled basilar membrane exists within the cochlear fluid, which is held to the co bone. The vibration of the eardrum causes a movement of the oval window compressed sound wave in the cochlear fluid. This compressed wave cau vibration in the basilar membrane. The basilar membrane is mechanic different frequencies, and it plays a vital role in distributing sound energy b along the cochlea's length, as shown in Figure 2. Voice pathology occurs when one or more human voice gener components malfunction. The specific causes for this kind of malfunction ar issues. However, researchers have discovered that calluses and swelling o vocal cord paralysis, vocal cord shutting, and spasmodic dysphonia are the l of voice pathology. Other causes include hearing loss, neurological disorde intellectual disability, and drug abuse. Researchers can find a comprehens the voice pathologies and their causes in [5][6][7]. Voice pathology occurs when one or more human voice generation system components malfunction. The specific causes for this kind of malfunction are still research issues. However, researchers have discovered that calluses and swelling on vocal cords, vocal cord paralysis, vocal cord shutting, and spasmodic dysphonia are the leading causes of voice pathology. Other causes include hearing loss, neurological disorder, brain injury, Appl. Sci. 2022, 12, 2398 3 of 21 intellectual disability, and drug abuse. Researchers can find a comprehensive survey on the voice pathologies and their causes in [5][6][7].
Many voice pathologies have been reported in the literature. However, the American Speech-Language-Hearing Association (ASHA) has mentioned laryngitis as one of the three most common voice pathologies [8], which is investigated in this work. Laryngitis is caused by inflammation in the vocal folds [9]. This results in sounds being obstructed by the inflamed vocal folds as the air passes over them. Laryngitis can occur from voice overuse, smoking, and infection in the larynx [9]. The other reasons for laryngitis include excessive alcohol consumption and gastroesophageal reflux disease (GERD) [10]. Laryngitis makes the voice sound hoarse and weak.
Both invasive and non-invasive methods are used for detecting voice pathologies. In invasive methods, physicians insert probes into the mouth using an endoscopic procedure. Laryngoscopy [11], stroboscopy [12], and laryngeal electromyography [13] are examples of such practices. In non-invasive methods, voice pathology is detected using voice signal processing [14,15] techniques. These methods involve three significant steps, namely: (a) voice samples collection and analysis, (b) features' extraction, and (c) classification. Voice samples are collected in a sound environment. Then, the samples are analyzed, and voice features are extracted. The final step is to classify voice samples into control (i.e., healthy) and pathological. A classifier is commonly used for this purpose.
A literature survey shows that several classifier algorithms have been popularly used for voice pathology detection. The resulting accuracies demonstrate that the classification accuracy mainly depends on the classifier algorithms and voice features [16,17] used by the classifiers. Recently, deep learning algorithms have drawn considerable attention from researchers in this field. It has been shown in [18][19][20][21][22][23][24] that deep learning algorithms can play an essential role in voice pathology detection as they provide higher accuracies.
The goal of this work is to focus on the possibility of using the existing technology of the cochlear simulation model noninvasively for the detection of pathological voice. The clinical tools used by the physicians rely on invasive technology that is unpleasant for the patients. Additionally, they sometimes rely on subjective assessment, especially for the voice pathology that lacks the structural abnormality. To overcome these limitations, we address a signal processing and deep learning-based technology that can help the clinicians for noninvasive objective assessment of voice disorder, and thus to provide relief for the patients from painful processes and to avoid the misdiagnosis that may result from subjective assessment.
Many pathological voice detection systems have been published in the literature. However, to our knowledge, we are the first to use the cochlear simulation model to implement a pathological voice detection system. The voice samples are processed using a cochlear simulation model, and then the processed voice samples are applied to the input of a CNN for final classification.
Cochlear implants are sensory prosthetic devices. They are capable of establishing the functional hearing of the listeners with severe hearing loss. This is achieved by establishing direct electrical stimulation to the auditory nerves for the people with damaged hair cells in the basilar membrane. These hair cells are tuned at different frequencies to aid hearing perception for people with no hearing impairment [25]. A typical cochlear implant system includes several signal processing steps: removal of the D.C. component, pre-emphasis, division of the signal into a set of channels, rectification, and lowpass filtering. Among these signal processing steps, the most critical one is dividing the signal into several channels using a filterbank. The center frequencies and the bandwidth of these filters are determined based on the human cochlear vibration patterns caused by the audio samples. In this work, we consider two models for the filterbank. One model uses a bank of bandpass filters, and the other uses gammatone filters. A bank of bandpass filters is commonly used in commercially available cochlear implants. However, recently, researchers are recommending using gammatone filters instead. The main advantages of the gammatone filters are that they: (a) provide an appropriate "pseudo-resonant" frequency transfer 4 of 21 function, (b) demonstrate a simple impulse response, and (c) support efficient hardware implementation [26]. Finally, the processed audio features are applied to the input of a CNN for classification. The main contributions of this work are as follows: • It develops a novel, non-invasive pathological voice detection algorithm based on speech signal processing that mimics the biological process of speech perception and a deep learning approach. • It extracts audio information using gammatone filters and conventional bandpass filters to examine their efficacy for pathological voice identification.

•
It eliminates the necessity of choosing the suitable features from speech samples to aid the classification mechanism. • It achieves a reasonably high classification accuracy without overwhelming the computation burden on the system. • It provides a detailed performance analysis of the proposed system in terms of accuracy, precision, recall, NPV, and F1 score.

•
It compares the performances of the proposed system with other related works to demonstrate its effectiveness.
The rest of the paper is organized as follows: some related works are presented in Section 2, the materials and methods are explained in Section 3, the results are analyzed in Section 4, and the paper is concluded with Section 5.

Related Works
A variety of voice pathology detection algorithms have been published in the literature. In this section, some of these algorithms that are closely related to our work are presented. Deep neural network (DNN)-based algorithms have been investigated in [23] to detect pathological voices. The authors investigated eight voice pathologies: vocal fold nodules, polyps, cysts, neoplasm, atrophy, vocal palsy, sulcus and spasmodic dysphonia. They have used several classification algorithms in their work. The results showed that the DNN-based classifier achieved the highest accuracy of 94.26% and 90.52% for male and female speakers, respectively.
Vocal disorders, namely neoplasm, phono-trauma, and vocal palsy, have been investigated [27]. The authors have used a machine learning algorithm named dense net recurrent neural network (DNRNN) in their work. The results showed that the DNRNN algorithm achieved an accuracy of 71%.
Multiple neural networks have been used in [28] to detect voice pathology. The authors have used multilayer perceptron neural network (MLPNN), general regression neural network (GRNN), and probabilistic neural network (PNN) in their work. The authors achieved the highest accuracy of 100% with the MLPNN.
Some existing algorithms use multiple voice features to improve detection accuracy. For example, the researchers in [29] have used six voice features, namely jitter, shimmer, harmonic-to-noise ratio (HNR), soft phonation index (SPI), amplitude perturbation quotient (APQ), and relative average perturbation (RAP). In that study, the authors investigated several voice pathologies: cyst, edema, laryngitis, nodule, palsy, polyp, and glottis cancer. Several classifiers were used in their work, and the Gaussian mixture model (GMM)-based method provided the highest accuracy of 95.2%.
A similar voice pathology detection algorithm has been presented in [30]. In the first step, the voice features, namely Mel-frequency cepstral coefficients (MFCCs), linear frequency cepstral coefficients (LFCC), and zero-crossing rate (ZCR), are extracted from the voice samples. In the second step, the voice samples are classified by using an artificial neural network (ANN). The authors claimed that their proposed algorithm requires less computation compared to other similar existing algorithms. Support vector machine (SVM) and radial basis function neural network (RBFNN) have been used in [31] to detect voice pathology. In that work, the authors used several audio features: signal energy, pitch, formant frequencies, mean square residual signal, reflection coefficients, jitter, and shimmer. Then, they combined these voice features to form a feature vector. Finally, the feature set was used for classification. The results showed that RBFNN achieved an accuracy of 91%. On the other hand, the SVM attained an accuracy of 83%.
Stuttering has been addressed in [32]. In that work, the authors proposed a method to improve the performance of an automatic speech recognizer. Specifically, the authors developed a classifier that can better detect stuttering in speech signals. That work used ANN, hidden Markov model (HMM), and SVM as the classifiers. The results showed that these algorithms could detect stuttered voices with an accuracy of 85% and 78% for males and females, respectively.
Four voice attributes, namely roughness, breathiness, asthma, and strain, have been addressed in [33]. The authors have proposed a method that uses the higher-order local autocorrelation (HLAC) features extracted by an algorithm called automatic topologygenerated autoregressive HMM (AR-HMM) analysis. The proposed algorithm used a feed-forward neural network (FFNN)-based classifier for voice pathology detection. The achieved accuracy was 87.75%.
Some researchers claimed that the spectrogram is the most suitable voice feature for pathological voice detection as it traces different frequencies and their occurrences in time. For example, spectrogram was used to detect pathological voice disorder due to vocal cord paralysis (Reinke's edema) in [21]. The authors used CNN as the classifier in their work. The spectrograms of pathological and control speech were applied to the input of a convolutional deep belief network (CDBN). The results showed that a small dataset was enough to train the CDBN and to achieve high accuracy. The authors achieved 77% and 71% accuracy for CNN and CDBN, respectively.
In [34], the author have also used spectrogram. They argued that voice pathology detection using spectrogram is affected by jitter, shimmer, and HNR. However, the measurement of jitter, shimmer, and HNR is not independent and may provide ambiguous information. For example, the addition of random noise increases the jitter measurement, and the introduction of jitter leads to a reduction in HNR. The authors suggested removing the effects of jitter and shimmer on the speech spectrum to improve the detection accuracy of voice pathology by using spectrogram.
To identify dysphonic voice, a new marker, called the dysphonic marker index (DMI), has been introduced in [35]. This marker consists of four acoustic parameters. The authors have employed a regression algorithm to relate these features, and they defined a threshold value to discriminate pathological voices from healthy ones. They have achieved an accuracy of 82.2% for the classification task. A novel computer-aided pathological voice classification system was proposed in [36]. In that work, the authors used a deep-connected ResNet for classification. The model used two databases and achieved almost similar accuracies (81.6% and 82.2%) for both databases. Hence, the authors concluded that the proposed method is data-independent.
A long-short-term memory (LSTM) auto-encoder hybrid model with a multi-task learning solution was presented in [37]. The authors used dysphonia, depression, and Parkinson's voice samples. The spectrogram features were extracted from the voice features and applied to the classifiers' input for the voice pathology detection. The proposed method achieved an accuracy of 85% for all samples related to the dysphonia, depression, and Parkinson's.
The online sequential extreme leaning machine (OSELM) was used in [38] to detect voice pathology. The authors used 600 vowel samples. They extracted MFCC features from the samples and applied them to the OSELM for classification. The algorithm in [38] achieved an accuracy of 85%, sensitivity of 87%, and specificity of 87%.
Deep learning algorithms, namely feed forward neural network (FNN) and CNN, were used in [39] to detect voice pathology. Three voice features, namely the MFCCs, linear prediction cepstral coefficients (LPCCs), and higher-order statistics (HOSs), of 518 vowel samples (259 healthy and 259 mixed pathologies) were used in that work. The authors used '/a/', '/i/', and '/u/' vowels at normal pitch in the work. They achieved the highest accuracy of 82.69% using the CNN with the LPCCs of the vowel sound '/u/' of male samples.
The CNN and MFCCs as features were used in [40] to discriminate pathological voices from normal voices. The work in [40] used the vowel sample '/a/' of 189 normal and 552 pathological samples. The work investigated four pathologies: vocal atrophy, unilateral vocal paralysis, organic vocal fold lesions, and adductor spasmodic dysphonia. The results showed an overall accuracy of 66.9%, a sensitivity of 66%, and a specificity of 91% with their algorithm.
A pre-trained network (ResNet34) was used in [41]. The authors used 150 healthy and 150 pathological '/a/' vowel samples. First, the vowel samples were framed and windowed. Then, the spectrograms were computed from the samples, and the pre-trained network was used for classification. The authors tested the proposed algorithm with 200 healthy and 874 pathological samples. The authors achieved an accuracy of 95.41%, an F1 score of 94.22%, and a recall of 96.13% in the work.
The performances of two algorithms, namely CNN and RNN, were compared in [42]. The authors used the vowel samples in that work. They used several voice features, namely, 13 MFCCs, pitch, roll-off, ZCR, energy entropy, spectral flux, spectral centroid, and energy. In the experiment, the authors used 10-fold validation techniques. The results show that the algorithm achieved 87.11% and 86.52% accuracy with the CNN and the RNN, respectively.
There are two significant limitations of the above-mentioned related works. One of them is that none of the investigations represent how the human auditory system responds to sounds. Another shortcoming is that the adopted classifiers overwhelmed the system with a substantial computational burden. The addressed limitations are overcome in this work by using a cochlear simulation model and a CNN.

Materials and Methods
In this investigation, control (i.e., normal) and laryngitis voice samples were collected from the Saarbrücken Voice Database (SVD) [43]. The SVD database is a collection of speech and electroglottography (EGG) signals of more than 2000 speakers. It contains recordings of 1002 speakers (454 male and 548 female), exhibiting a wide range of voice disorders. It also includes recordings of 851 control (423 male and 428 female) samples. The age of the speakers varies from 6 to 84 years [44]. All of these samples were collected in one session with the patients and the samples contain the recordings of the following components: (a) vowels '/i/', '/a/', and '/u/' produced at normal, high, and low pitch, (b) vowels '/i/', '/a/', and '/u/' with rising and falling pitch, and (c) sentence, "Guten Morgen, wie geht es Ihnen?" ("Good morning, how are you?"). In this investigation, we chose the sentence, "Guten Morgen, wie geht es Ihnen?" ("Good morning, how are you?"). The main reason is that the sentence speech samples contain both voiced and unvoiced components. On the other hand, the vowel speech samples contain only the voiced component. Moreover, the sentence speech samples contain articulatory and other linguistic confounds that often do not exist with the vowel samples. Figure 3 shows the time domain plots for control (i.e., healthy) and laryngitis voice samples, randomly collected from the SVD database. It is observed in the figure that the laryngitis voice sample suffers from irregular distortion in both magnitude and shape compared to that of the healthy sample. In addition, the laryngitis voice samples exhibit a more extended unvoiced segment compared to the vowel samples.
Appl. Sci. 2022, 12, x FOR PEER REVIEW    As shown in Figure 4, the system model can be broadly classified into three major sub-systems. They are: (a) pre-processing, (b) cochlear modeling, and (c) classification. The pre-processing sub-system consists of three signal processing steps: down-sampling, D.C. removal, and pre-emphasis. In a Clarion processor, the acoustic signal is processed As shown in Figure 4, the system model can be broadly classified into three major sub-systems. They are: (a) pre-processing, (b) cochlear modeling, and (c) classification. The pre-processing sub-system consists of three signal processing steps: down-sampling, D.C. removal, and pre-emphasis. In a Clarion processor, the acoustic signal is processed at the rate of 13,000 samples/s. The voice samples available in the SVD database have a sampling frequency of 50,000 samples/s. Hence, the voice signals were down-sampled to 13,000 samples/s using the MATLAB built-in function of resample. The resample function utilizes a built-in anti-aliasing (lowpass) FIR filter to minimize the effects of aliasing that occur due to the down-sampling operation. Afterwards, the D.C. component of the speech signals was removed. Most of the energy in the speech signal is concentrated in the lower frequency components of its spectrum and, generally, the energy drops at a rate of 2.0 dB/kHz [48]. This rapid reduction in energy leads to a problem for further subsequent processing of the speech signals. To overcome this limitation, the high-frequency components of the speech signals were boosted by a pre-emphasis filter, which was designed based on the model presented in [49]. The magnitude response of the pre-emphasis filter is shown in Figure 5. This filter has a cut-off frequency of 2000 Hz and a roll-off of 3 dB/octave. It compensates for the rapid reduction of the energy in the low-frequency components of the audio signal. Additionally, it better optimizes the CPU consumption.
Appl. Sci. 2022, 12, x FOR PEER REVIEW occur due to the down-sampling operation. Afterwards, the D.C. component of the signals was removed. Most of the energy in the speech signal is concentrated in th frequency components of its spectrum and, generally, the energy drops at a rat dB/kHz [48]. This rapid reduction in energy leads to a problem for further sub processing of the speech signals. To overcome this limitation, the high-fre components of the speech signals were boosted by a pre-emphasis filter, wh designed based on the model presented in [49]. The magnitude response of t emphasis filter is shown in Figure 5. This filter has a cut-off frequency of 2000 Hz an off of 3 dB/octave. It compensates for the rapid reduction of the energy in the low-fre components of the audio signal. Additionally, it better optimizes the CPU consump It is also shown in Figure 4 that the cochlear modeling sub-system consi bandpass filter, rectifier, lowpass filter, and a non-linear mapper. The pre-pr speech signals were divided into eight channels by using eight filters. These filte designed based on the specifications mentioned in [49]. The center frequency bandwidth of these eight filters are listed in Table 1. These eight filters were desig using the third-order Butterworth prototype filters. It is demonstrated in the table bandwidth of the filters is logarithmically spaced from 265 to 1136 Hz, mimick frequency response of the basilar membrane (see Figure 2). The lowest center fre is 394 Hz (the center frequency of the first filter) and the highest center frequency Hz (the center frequency of the eighth bandpass filter). The magnitude spectrum eight bandpass filters is shown in Figure 6.  It is also shown in Figure 4 that the cochlear modeling sub-system consists of a bandpass filter, rectifier, lowpass filter, and a non-linear mapper. The pre-processed speech signals were divided into eight channels by using eight filters. These filters were designed based on the specifications mentioned in [49]. The center frequency and the bandwidth of these eight filters are listed in Table 1. These eight filters were designed by using the third-order Butterworth prototype filters. It is demonstrated in the table that the bandwidth of the filters is logarithmically spaced from 265 to 1136 Hz, mimicking the frequency response of the basilar membrane (see Figure 2). The lowest center frequency is 394 Hz (the center frequency of the first filter) and the highest center frequency is 4871 Hz (the center frequency of the eighth bandpass filter). The magnitude spectrum of these eight bandpass filters is shown in Figure 6.  The next signal processing steps included envelope detection and lowpass filtering. This work used a full-wave rectifier as an envelope detector, and an eighth-order finiteimpulse response filter (FIR) was used as a lowpass filter. This lowpass filter was designed by using the Hamming window function. Several window functions, namely Hanning, Blackman, Bartlett, and Hamming, have been investigated in this work. The main advantages of these window functions are that they taper at their ends and avoid unnatural discontinuity in the speech segment. They also minimize the distortion in the underlying spectrum. Finally, the Hamming window function was selected as it provided the minimum passband ripple and maximum stopband attenuation compared to the other investigated window functions [50].
Finally, the detected signal envelope in each channel was used to modulate a biphasic pulse train. A non-linear mapping technique was used to produce the biphasic pulse train so that the interferences of the pulses in different channels were minimized.
The eight filters (mentioned above) were replaced by eight gammatone filters in the second model while using the same other components. The pre-processed audio signals were divided into eight channels by using these eight gammatone filters. The name gammatone comes from the fact that the envelope of the impulse response of those filters is similar to the gamma function. Moreover, the fine structure of the impulse response is a tone at the center frequency of the filter, [51,52]. Those gammatone filters perform spectral analysis and convert an acoustic wave into the multichannel representation by mimicking the basilar membrane motion [53]. The gammatone filter has an impulse response that is similar to that of a cat's cochlea [54], and it is defined by: where is a constant, is the filter order, is the temporal decay coefficient, is the center frequency of the filter, is the carrier phase, and ( ) is the unit step function. The filter order, , controls the relative shape of the envelope that becomes less skewed when increases. The carrier phase, , determines the relative position of the envelope. Let us assume that the carrier component is denoted by ( ) = (2 + ) and the ( ) The next signal processing steps included envelope detection and lowpass filtering. This work used a full-wave rectifier as an envelope detector, and an eighth-order finiteimpulse response filter (FIR) was used as a lowpass filter. This lowpass filter was designed by using the Hamming window function. Several window functions, namely Hanning, Blackman, Bartlett, and Hamming, have been investigated in this work. The main advantages of these window functions are that they taper at their ends and avoid unnatural discontinuity in the speech segment. They also minimize the distortion in the underlying spectrum. Finally, the Hamming window function was selected as it provided the minimum passband ripple and maximum stopband attenuation compared to the other investigated window functions [50].
Finally, the detected signal envelope in each channel was used to modulate a biphasic pulse train. A non-linear mapping technique was used to produce the biphasic pulse train so that the interferences of the pulses in different channels were minimized.
The eight filters (mentioned above) were replaced by eight gammatone filters in the second model while using the same other components. The pre-processed audio signals were divided into eight channels by using these eight gammatone filters. The name gammatone comes from the fact that the envelope of the impulse response of those filters is similar to the gamma function. Moreover, the fine structure of the impulse response is a tone at the center frequency of the filter, f 0 [51,52]. Those gammatone filters perform spectral analysis and convert an acoustic wave into the multichannel representation by mimicking the basilar membrane motion [53]. The gammatone filter has an impulse response that is similar to that of a cat's cochlea [54], and it is defined by: where c is a constant, n is the filter order, b is the temporal decay coefficient, f 0 is the center frequency of the filter, ϕ is the carrier phase, and u(t) is the unit step function. The filter order, n, controls the relative shape of the envelope that becomes less skewed when n increases. The carrier phase, ϕ, determines the relative position of the envelope. Let us assume that the carrier component is denoted by s(t) = cos(2π f 0 t + ϕ) and the gammatone distribution function is defined by r(t) = t n−1 e −2πbt u(t). Hence, the impulse response of the gammatone filter can be expressed as h(t) = cs(t)r(t). The parameter b determines the duration of the impulse response and hence determines the bandwidth of the gammatone filters, and the parameter n determines the tuning or quality factor (Q) of the filter. Figure 7 shows the impulse response of the gammatone filter with its constituent components. In the plot, the factor c was set to b n (n−1)! to make the area under the curve of gamma distribution equal to one [26]. The temporal decay coefficient b was set to 125, and the carrier frequency, f 0 , was chosen to be 1000 Hz. The shape of the magnitude characteristic of the gammatone filters with order 4 is very similar to that of the roex function [55] that is commonly used to represent the magnitude response of the human auditory filter [56,57]. The Fourier transform of the h(t) is given by H( f ) and can be expressed as: (2) Appl. Sci. 2022, 12, x FOR PEER REVIEW 10 constituent components. In the plot, the factor was set to ( )! to make the area u the curve of gamma distribution equal to one [26]. The temporal decay coefficient set to 125, and the carrier frequency, , was chosen to be 1000 Hz. The shape of magnitude characteristic of the gammatone filters with order 4 is very similar to th the function [55] that is commonly used to represent the magnitude response o human auditory filter [56,57]. The Fourier transform of the ℎ( ) is given by ( ) and be expressed as: A complete derivation of the ( ) can be found in Appendix A. The imp response, ℎ( ), and the transfer function, ( ), of the gammatone filter with var / are plotted in Figure 8, which shows that the two frequency components of gammatone filters do not interfere with each other when / > 8. In this work selected / = 9. Another advantage of selecting / = 9 is that the bandw becomes proportional to , and it is claimed in [58] that the bandwidth (equiva rectangular bandwidth) becomes independent of when / > 3. The detailed pro shown in Appendix B. The center frequency and the bandwidth of the gammatone fi are listed in Table 2, while the magnitude spectrum of the gammatone filterbank is sh in Figure 9. The filters are logarithmically spaced in frequency resolution similar to basilar membrane's motion, as shown in this figure.
Another main system component is the classifier, as shown in the proposed syst last sub-system presented in Figure 4. The processed signal from the cochlear mod applied to the input of a classifier for binary classification. In this work, a CNN employed for this purpose. The CNN presented in [59] was adopted and optimize  Figure 8, which shows that the two frequency components of the gammatone filters do not interfere with each other when f 0 /b > 8. In this work, we selected f 0 /b = 9.
Another advantage of selecting f 0 /b = 9 is that the bandwidth becomes proportional to b, and it is claimed in [58] that the bandwidth (equivalent rectangular bandwidth) becomes independent of f 0 when f 0 /b > 3. The detailed proof is shown in Appendix B. The center frequency and the bandwidth of the gammatone filters are listed in Table 2, while the magnitude spectrum of the gammatone filterbank is shown in Figure 9. The filters are logarithmically spaced in frequency resolution similar to the basilar membrane's motion, as shown in this figure. feature extractor network consists of a special kind of neural network, of which synaptic weights are determined via the training process. Usually, the feature extrac network consists of piles of convolutional layer and pooling layer pairs, as shown Figure 4. It is widely accepted that pattern recognition algorithms perform better wh the feature extractor network contains more layers. However, it is always challenging train them as it incurs a substantial computational burden on the system [60]. Consider this limitation, this work used one convolutional layer as a feature extractor network.         Another main system component is the classifier, as shown in the proposed system's last sub-system presented in Figure 4. The processed signal from the cochlear model is applied to the input of a classifier for binary classification. In this work, a CNN was employed for this purpose. The CNN presented in [59] was adopted and optimized to implement the proposed system. The CNN includes feature extraction and classifier networks. The feature extractor produces a feature map based on the input data. The feature map accentuates the unique features from the original data. Consequently, the extracted feature map was applied to the classification neural network. The classification neural network operates on the feature map and performs classification functions. The feature extractor network consists of a special kind of neural network, of which the synaptic weights are determined via the training process. Usually, the feature extractor network consists of piles of convolutional layer and pooling layer pairs, as shown in Figure 4. It is widely accepted that pattern recognition algorithms perform better when the feature extractor network contains more layers. However, it is always challenging to train them as it incurs a substantial computational burden on the system [60]. Considering this limitation, this work used one convolutional layer as a feature extractor network.
Unlike other conventional neural networks, no connection weights or a weighted sum are employed in the convolutional layer. Instead, filters are used to convert the input data to produce a feature map. In this work, 20 convolutional filters of size 11 × 11 were used. An activation function processes the feature map produced by the convolutional filters. In this work, we used the ReLU as the activation function. The output produced by the convolutional layer is then passed through the pooling layer. The pooling layer reduces the data size by combining the neighboring data of a certain area into a single representative value. In this work, a 2 × 2 matrix was used for pooling the mean value from the input data. The data produced by the pooling layer enters the classifier network, which consists of a hidden layer and an output layer. A backpropagation algorithm was used for determining the weight vectors for this classification network. The hidden layer has 100 nodes that also use the ReLU activation function. The output layer of the CNN was constructed with a single node as the decision made by the classifier is binary. The Softmax function was used at the output node.

Results
To evaluate the performance of the proposed system, the following parameters have been used [61,62].
Accuracy is simply a ratio of the correctly predicted observations to the total observations, as defined by: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The precision is defined by: Recall is the ratio of correctly predicted positive observations to all observations in the actual class. It is formulated by: F1 score is the weighted average of precision and recall. Therefore, this score takes both false positives and false negatives into account. The F1 score is defined by: NPV defines the fraction of the tests that correctly detect healthy individuals. The NPV is defined by: To investigate the performance of the proposed system, 10 simulations were conducted using the first investigated model consisting of bandpass filters. First, the CNN was trained with 100 control and 100 pathological samples. Five-fold cross-validation was used to ensure the accuracy of the training. The simulations were run for enough epochs to achieve a training accuracy of 100%. Once trained, the 100 other control samples and 100 pathological samples were used to test the network's performance. The training, validation, and testing results of the proposed algorithm for the first model are listed in Table 3. The table shows that the proposed system's average training, validation, and testing accuracies are 100%, 85.96%, and 77.91%, respectively. The testing performances of the proposed system in terms of TPF, TNF, FPF, and FNF are listed in Table 4 and the corresponding classification matrix is shown in Table 5. Based on the data presented in Table 5, it can be concluded that the proposed system can correctly detect pathological voices, resulting an accuracy of 76.67% with the first model. On the other hand, the system can detect control (i.e., normal) voices with an accuracy of 79.17%. Table 3. Training and testing accuracies with bandpass filters.

Simulation
No. Ten more simulations were conducted using the second model consisting of the gammatone filters with the same set of control and pathological samples that were used in the previous simulations. The proposed algorithm's training, validation, and testing results are listed in Table 6. This table shows that the average training, validation, and testing accuracies of the proposed system are 100%, 81.98%, and 77.50%, respectively. The testing performances of the proposed method in terms of TPF, TNF, FPF, and FNF are listed in Table 7 and the corresponding classification matrix is shown in Table 8. Based on the data presented in Tables 7 and 8, it can be concluded that the proposed system can correctly identify pathological voices with an accuracy of 83.30% adopting the second model. On the other hand, the system can detect control (i.e., normal) voices with an accuracy of 71.67%. Table 6. Training and testing accuracies with gammatone filters.

Simulation
No.
Accuracy (%) The performance comparisons of the two investigated models are listed in Table 9. The proposed system performed almost equally in terms of accuracy for both of the models. The recall was significantly higher for the model with gammatone filters, though precision was greater with bandpass filters. However, the F1 score that considers both recall and precision, was higher for gammatone filters. Also, the NPV was higher for gammatone filters. Hence, it justifies the greater possibility of implementing a signal processing-based pathological voice detection system with gammatone filters, incorporating the functionality of an optimally simulated cochlear implant processing system. Finally, the performance results of the proposed model were compared with other existing published works, and the comparison is presented in Table 10. As listed in this table, the spectrogram and melspectrogram audio features have been used in [21,27] and the achieved maximum accuracy was 71% for both the works. Compared to those works, the proposed system achieved much higher accuracies (i.e., 77.9% and 77.5%) for the two studied models. Moreover, the achieved results are challenging as compared with that of [31], where multiple features and mixed pathologies were considered with the children subgroup. Additionally, in [33], a high F-measure (87.75%) was achieved considering vowel samples but with a speaker-specific identification system.

Conclusions
This paper presented a novel, non-invasive pathological voice detection system considering a cochlear simulation model. Two models have been considered in this work. One model uses a bank of bandpass filters, and the other uses gammatone filters. It has been shown that the gammatone filter is more suitable for voice pathology identification through the signal processing steps involved in the cochlear implants. It has also been demonstrated that the gammatone filters with f 0 /b = 9 are the optimum choice for this purpose. The speech samples have been processed using these two models and the processed signals were applied to the input of a CNN, which acted as a binary classifier to detect pathological voices. It is a challenging issue to consider suitable features extracted from the speech samples. In general, no single feature or feature vector is well-accepted to provide the best accuracy. This novel technique eliminates acoustic feature extraction from the speech samples before applying the classification algorithm. The simulation results presented in this paper have shown that the proposed system achieved almost equal accuracy by using the two proposed models. However, the higher F1 score for the model with gammatone filters illustrates its better applicability for pathological voice identification through the cochlear implant simulation model.
This work focused only on discriminating pathological voices from normal voices using a signal processing approach adopted in the simulated cochlear implant processing system. However, achieving better performance in terms of accuracy and determining the progression level of voice pathology should be considered as the next challenges. As a future work, a pre-trained deeper network with a transfer learning approach can be used to improve this system's classification accuracy. Additionally, to ensure the validity of the results for the proposed algorithm, the performances of the present system need to be compared with some popular machine learning algorithms such as SVM, KNN, and ANNs.