Behavioral Pattern Analysis between Bilingual and Monolingual Listeners’ Natural Speech Perception on Foreign-Accented English Language Using Different Machine Learning Approaches

: Speech perception in an adverse background/noisy environment is a complex and challenging human process, which is made even more complicated in foreign-accented language for bilingual and monolingual individuals. Listeners who have difﬁculties in hearing are affected most by such a situation. Despite considerable efforts, the increase in speech intelligibility in noise re-mains elusive. Considering this opportunity, this study investigates Bengali–English bilinguals and native American English monolinguals’ behavioral patterns on foreign-accented English language considering bubble noise, gaussian or white noise, and quiet sound level. Twelve regular hearing participants (Six Bengali–English bilinguals and Six Native American English monolinguals) joined in this study. Statistical computation shows that speech with different noise has a signiﬁcant effect ( p = 0.009) on listening for both bilingual and monolingual under different sound levels (e.g., 55 dB, 65 dB, and 75 dB). Here, six different machine learning approaches (Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-nearest neighbors (KNN), Naïve Bayes (NB), Classiﬁcation and regression trees (CART), and Support vector machine (SVM)) are tested and evaluated to differentiate between bilingual and monolingual individuals from their behavioral patterns in both noisy and quiet environments. Results show that most optimal performances were observed using LDA by successfully differentiating between bilingual and monolingual 60% of the time. A deep neural network-based model is proposed to improve this measure further and achieved an accuracy of nearly 100% in successfully differentiating between bilingual and monolingual individuals.


Introduction
Listeners have demonstrated difficulties in understanding speech when exposed to various background noise and reverberation degradation conditions [1,2]. Speech perception is a complex process during which the auditory system perceives the sound and interprets it into linguistic information. A complex interaction between the auditory system and the cognitive skills of a listener requires alternation between target speech and competing noise for speech perception in noise [3]. Background noise, the air interface between speakers, poor room acoustics, foreign accents, and reverberation are common reasons for an individual listener's inability to recognize speech completely [4,5]. Researchers found that older listeners showed a negative effect by the spoken speech on fast rate time compression simulation, which happens due to rapid rate of speech. [6,7]. Foreign accented speech causes temporal characteristics alternation in everyday speech conditions [1]. Changes in the rhythm and tonal patterns, as well as the signal identity of consonants and vowels, can affect and influence the timing structure of the total utterance of the accented speech [1,8,9]. In quiet conditions, researchers investigated the performance of the ability of older listeners to understand accented English [10,11].
Supervised segregation (regulated facilities for different groups' race, class, or ethnicity) has shown significant enhancement of human speech intelligibility in noisy environments, with clear indication that Deep Neural Network (DNN)-based supervised speech segregation is a promising approach to new acoustic environments [12]. Healy et al. [13,14] showed improvement in the intelligibility of noisy speech. The exploration of DNN-based speech separation in noisy environments has also been presented in the literature [15][16][17][18]. However, behavioral pattern recognition (Chains of behavior indicating particular groups' foreground nature in complex segments of behavior which foist the sameness for input data, e.g., image, speech, speech rating, text refers to behavioral pattern recognition) for bilingual individual's speech perception under noisy environments have not been evaluated and explored yet. Thus, there is a need to investigate both machine learning and DNN-based behavioral pattern recognition for Bengali-English bilingual and native English speaker monolingual individuals' speech in noise (SIN) perception for foreign-accented English language under quiet and noisy environments.
Approximately 19.7% of the U.S. population speak a language other than English at home, according to the U.S. Census Bureau [19], which projected that bilingualism would continue to rise in the United States in the near future. Previous study results showed that, in 2014, the U.S Hispanic population reached 60 million, and this number is estimated to reach 106 million by 2050 [20]. This growing population diversity will lead to language diversity as well. Understanding foreign-accented speech in noise can also be more challenging for bilingual listeners. Additionally, different racial groups may be more prone to developing hearing deficiencies. For instance, non-Hispanic white male adults report more hearing loss than other racial adult groups [21]. The language background of listeners can play a vital role in speech perception in adverse acoustic conditions [22]. Therefore, studying the mechanism of speech perception by listeners of different language backgrounds may be of interest to auditory research. Research efforts show that listeners have the ability to quickly adapt to foreign-accented speech, which also improves over time [23]. Cristia et al. (2012) [23] compared the neuronal response (in the form of EEG) of normal-hearing individuals to both foreign-accented and native-accented speech. They monitored brain activity through the EEG and suggested that the brain may respond differently to different accents [24]. Tabri et al. conducted an experiment on English speech perception in quiet and different noise levels (50, 55, 60, 65, and 70 dB) using the speech perception in noise (SPIN) test [25]. Their results showed that the bilingual and trilingual listeners performed similarly to monolingual, but the performance declined rapidly at 65and 70-dB SPL. Lotfi et al. [26] studied 92 individuals to evaluate the differences between Kurd-Persian bilingual versus Persian monolingual speech perception in noise. Their results demonstrated that Kurd-Persian bilinguals had a poor performance in the quick speech in noise (Q-SIN) test; however, they had a better performance on consonant-vowel in the noise (CV) test than monolingual Persians. Krizman et al. [27] investigated linguistic processing demands between Spanish-English bilingual and English monolingual to identify the performance on different task demands. Skoe et al. [28] investigated the source of difficulties experienced by English proficient bilingual listeners while listening to English speech in noise and found that the performance declined with the drop of signal to noise ratio (SNR). Barbosa et al. found that, in the background noise condition, bilingual individuals make more errors than monolinguals; in addition, they found that individuals who learn English at an earlier age make fewer errors in a noisy situation [29].
Bidelman et al. [30] showed in their study that bilinguals require around 10 dB SNR more to match monolingual listeners in adverse conditions; in addition, they found that Broca's area activity does not compensate bilingual but compensate monolingual SIN perception. Other studies [31][32][33][34][35] investigated monolingual and bilingual listeners' speech-in-noise performance in an everyday listening environment.
Human speech intelligibility is a key research topic exploring and analyzing various subjects, such as acoustics engineering, audiometry, phonetics, and human factors. In the twenty-first century, with the increase in bilingualism, it is critical to assess the challenges faced by monolingual and bilingual individuals during communication in noisy acoustic environments to improve speech intelligibility. The development of fine-tuned automatic Artificial Intelligence (AI)-based hearing aids for Hearing Impaired (HI) individuals will be a successful contribution to increase speech intelligibility in adverse acoustics conditions. Therefore, there is a distinctive variety of elements (e.g., bilingualism, language, foreign accent, behavioral pattern) that needs to be considered. This study investigates the question regarding the effects of foreign accent on speech between Bengali-English bilingual and native American accent English listeners, specifically: (1) Does human behavior show any significant difference on foreign accent language under quiet or adverse noisy environment? and (2) How Bengali-English bilingual and native American English speakers show significance under quiet and adverse condition? The overall purpose of this study is to investigate the significant difference between Bengali-English bilingual and native American English monolinguals effects of a talker's accent in (quiet and noise) listening conditions to predict behavioral pattern recognition using Artificial Intelligence (AI).

Related Work
Intelligibility is designated by a listener's experience and accuracy in decoding the acoustic signal of a speaker. Assessing a listener's intelligibility has been practiced clinically over the years. To assess the speech intelligibility reception, a handful of detection applications have been introduced already, such as an automatic intelligibility detection system. An object is distinguished by a set of features or variables to a class denoted as a classification task [36]. The applications of the classification task in daily human activities are wide [37,38]. Classification methods have been used to classify speech intelligibility in the context of recognition or detection. This is a binary classification problem. Artificial intelligence, fuzzy logic, statistical, and the formal way of classification have been used in many recognitions or detection problems. The classification methods in speech recognition or intelligibility applications have been explored by many research groups. Fook et al. [39] carried out an experiment for the classification prolongations and repetitions among speakers using the Support Vector Machine (SVM) algorithm. Classification of speech intelligibility of Parkinsonian speakers using SVM has been explored by Khan et al. [38]. Using NKI CCRT and the TORGO database with the help of SVM, LDA, and k-NN classifier, Kim et al. [40] showed the effort in impaired speech to classify pronunciation and voice quality. Elfahal et al. [41] examined the automatic recognition system for mixed Sudanese Arabic-English Languages Speech. For the Ngiemboon language, Yemmene et al. [42] explored various characteristics of a deep learning-based automatic speech recognition system. Automatic classification of speech intelligibility for listener's using Long Short-Term Memory based system was proposed by Miguel et al. [43]. Listening effort during sentence processing has been explored by Borghini et al. [44]. In addition, several research efforts have been published showing the Deep Neural Network (DNN) for listener's speech recognition and ineligibility applications [45][46][47][48]. Based on the available literature and author's best knowledge, binary classification or recognition of bilingual and monolingual listener's speech ineligibility reception have not been reported yet.

Data Acquisition
Data available from the literature [49] were used in this study. Data were collected at the Applied DSP Research Laboratory of Lamar University, Beaumont, Texas, USA. Participants included eighteen college student volunteers between the ages of 20 to 27. Six native English speakers and six Bengali-English bilinguals formed two mutually exclusive experimental groups. It was verified (confirmed by the LU Speech and Hearing clinic) that all subjects had normal hearing.
Short duration (10-12 s) audio fragments spoken by adult British English speakers (male and female) were used as the speech stimuli. The recordings were obtained from the free online depository http://listentogenius.com/ (accessed on 19 January 2018). Speech fragments were delivered at three sound levels: 55 dB, 65 dB, and 75 dB. Some fragments were contaminated by either Gaussian or bubble noise at the same three sound levels to produce various signal-to-noise ratios of −10 dB, 0 dB, 10 dB, and infinity (no noise). Stimuli were delivered diotically to participants using Etymotic insert earphones at a variety of sound levels. One hundred twenty audio stimuli were presented in total in a randomized order with 2 s of silence between them.
Experimental details consisted of continuous EEG recording and behavioral data. Additionally, participants were asked to provide their subjective evaluations regarding the quality of the audio fragments that they listened to. For that purpose, the same randomized sequence of 120 audio fragments was used. The quality was evaluated on a 1 to 10 scale where 1 corresponded to "inferior" and 10 represented "excellent" quality.
The primary purpose of the survey was to understand the participant's experience on a different kind of speech with different types of frequency considering bubble noise, white noise, and quiet sound level environment.

Methodology
The experiment was conducted among 12 participants of native English speaker monolingual and Bengali-English speaker bilingual individuals. Since the monolingual individuals represented a higher percentage of the participant population, systematic sampling techniques were used to choose 6 participants from the total pool of participants at regular intervals. The audio fragments contain a total of 120 questions against 120 types of speech with different sound (bubble noise, white noise, and quiet) levels. However, among 120 samples, 25 speech samples were with a quiet condition, which was the lowest number of samples compared to bubble and white noise sound speech. Thus, 25 samples were chosen from each group to conduct further statistical analysis. Table 1 shows the demographic characteristics of the subject with the mean value from 120 audio fragments. MANOVA was used to test for group differences/variances on two or more dependent variables. This experiment considered the following independent and dependent variables for MANOVA analysis: • Independent variables: Language (monolingual, bilingual)-2 factors • Dependent variables: Speech sound (quiet, white noise, bubble noise)-3 factors.

Results
There was a significant difference between bilinguals and monolinguals considered jointly on the variables bubble noise, white noise, and quiet speech, Wilk's η = 0.256, F (3,8), P (Significant) = 0.009, partial eta square = 0.744. A distinct ANOVA was conducted for each dependent variable, with each ANOVA evaluated at an alpha level of 0.016. It did not show any significant difference separately on monolingual and bilingual individuals for the Multivariate test, refers to Table 2.

Correlation Analysis
In order to determine whether there is any correlation among all three sound levels based on the user experience, a co-relation analysis was conducted using "Pearson Correlation". Table 3 presents the correlation between bubble noise, white noise, and quiet speech sound. There was a significant negative relationship between bubble and white noise, r (10) = 0.883, p = 0.000, as shown in Table 3.

Machine Learning Algorithm
Another study was conducted using different machine learning techniques on survey data of 12 participants. Different machine learning algorithms were selected for the study as follows: Logistic Regression (LR), Linear Discriminant Analysis (LDA), K Nearest Neighbor (KNN), Gaussian Naïve Bayes (NB), Classification and Regression Trees (CART), and Support Vector Machine (SVM), with default parameters. The whole experiment was carried out using Scikit learn tools with Python interpreter language. To evaluate the performance, fivefold cross-validation was utilized, and the results are presented by averaging (avg.) those five folds. Table 4 present a summary of the performance of all the algorithms on survey data. Note that, in this first experiment, the performance of all the machine learning algorithms was significantly low. To improve the existing computational performance, another experiment was carried out by standardizing the dataset. After standardizing the dataset, some improvement was observed on the performance of those algorithms. Table 5 summarizes the overall performance of algorithms after scaling the dataset. The noticeable changes were observed once the overall accuracy of LR increased from 30% (Table 4) to 60% (Table 5), still which is not up to the mark as a final result of general data analysis. In Figure 1, a clustered bar chart was used to compare the performance of six machine learning algorithms in terms of data standardization. The LR method showed the highest accuracy improvement among all the different algorithms, from 30% to 60%. Additionally, the CART machine learning algorithm's performance significantly decreased by up to 66% (from 30% to 20%).
Since most of the machine learning algorithm's performance was significantly low on the dataset, another experiment was conducted using a deep learning approach.

Behavioral Pattern Recognition Using a Deep Learning Approach
To develop a neural network model, Keras Python library was used. It is a Python library that can run on top of Theaona or Tensorflow. Since most of the machine learning algorithm's performance was significantly low on the dataset, another experiment was conducted using a deep learning approach.

Behavioral Pattern Recognition Using a Deep Learning Approach
To develop a neural network model, Keras Python library was used. It is a Python library that can run on top of Theaona or Tensorflow.

Proposed Model
A sequential model was created, and some additional layers were also added until a significant amount of improvement was observed during the training phase. One hundred and twenty input variables were used as the data set contained 120 input parameters. The most optimal network was chosen after several trials with random input features. Note that the defined neural network was a fully connected layer using Dense Class. More details on how a deep learning architecture is developed can be found here [50]. Figure 2 shows the architecture of the network: Algorithm Performance before and after scaling the data accuracy change% Accuracy(Mean after scaling) Accuracy (Mean) Figure 1. Algorithm performance before and after data standardization.

Proposed Model
A sequential model was created, and some additional layers were also added until a significant amount of improvement was observed during the training phase. One hundred and twenty input variables were used as the data set contained 120 input parameters. The most optimal network was chosen after several trials with random input features. Note that the defined neural network was a fully connected layer using Dense Class. More details on how a deep learning architecture is developed can be found here [50]. Figure 2 shows the architecture of the network:  Figure 2 shows that the network was initiated by 120 inputs, and two hidden lay contained 60 and 30 neurons, respectively. To initialize the network, an activation func was necessary, and here, the network utilized the rectifier activation function on the three layers and the sigmoid activation function as the output layer.
The sigmoid activation function was used to ensure the network output would  Figure 2 shows that the network was initiated by 120 inputs, and two hidden layers contained 60 and 30 neurons, respectively. To initialize the network, an activation function was necessary, and here, the network utilized the rectifier activation function on the first three layers and the sigmoid activation function as the output layer.
The sigmoid activation function was used to ensure the network output would remain between 0 and 1 since the network was designed for binary classification. Details regarding "Relu" and "Sigmoid" can be found at [51]. Note that training a network means finding the right set of weights to make a better prediction. Thus, it is necessary to specify the loss function to evaluate a set of weights. In this case, the logarithmic loss was used, which is defined in Keras as "binary_crossentropy". An adaptive learning rate optimization algorithm (Adam) was used as an optimization algorithm due to its robust performance on binary classification. More details regarding the 'Adam' optimizer can be found in [52].
The training process runs for a fixed number of iterations through the dataset called epochs which need to be specified while fitting the model. Here 150 epochs were used with a batch size of 10. Note that the batch size and the epochs were chosen experimentally by trial and error. While training the model, each iteration adjusted the loss to the next epoch. During this experiment, after 35 epochs, the accuracy reached 100%, while the loss recorded was only 76%. To understand the network performance, training loss, validation loss, training accuracy, and validation accuracy were also calculated. Figure 3 shows the training and validation loss, as well as accuracy.  During the training phase, both the training and validation loss curve touched at about 120 epochs, and it was decided that no further training was required after that point (Figure 3a). On the other hand, both training and validation curves in Figure 3b show some discrepancies during 35 epochs. Considering both Figure 3a,b, it is possible to assume that the proposed model performed well on "EEG_data_lamar" with an accuracy of 100%.

Discussion
As a means of understanding the effects of foreign accents on speech between Bengali-English bilingual and native American English listeners, the study observed 12 participant's behaviors under bubble noise, white noise, and quiet speech sound level environments. The results showed a significant difference (p = 0.009) between the two groups' (bilinguals and monolinguals) behavior under various noisy conditions. Additionally, the behavioral performance was analyzed with different machine learning approaches, such as LR, LDA, KNN, CART, NB, and SVM. The resulting analysis showed it was possible to differentiate between two groups 60% accurately using LR. Hence, a small deep neural network (DNN) was proposed, which achieved 100% accuracy in differentiating between bilinguals and monolinguals. It is relevant to emphasize that none of the reference studies considered the effects of noisy environment between two distinct groups-the bilinguals and monolinguals-using machine learning/deep learning-based approach, which hinders the opportunity of a direct comparison with the existing literature. Therefore, this During the training phase, both the training and validation loss curve touched at about 120 epochs, and it was decided that no further training was required after that point (Figure 3a). On the other hand, both training and validation curves in Figure 3b show some discrepancies during 35 epochs. Considering both Figure 3a,b, it is possible to assume that the proposed model performed well on "EEG_data_lamar" with an accuracy of 100%.

Discussion
As a means of understanding the effects of foreign accents on speech between Bengali-English bilingual and native American English listeners, the study observed 12 participant's behaviors under bubble noise, white noise, and quiet speech sound level environments. The results showed a significant difference (p = 0.009) between the two groups' (bilinguals and monolinguals) behavior under various noisy conditions. Additionally, the behavioral performance was analyzed with different machine learning approaches, such as LR, LDA, KNN, CART, NB, and SVM. The resulting analysis showed it was possible to differentiate between two groups 60% accurately using LR. Hence, a small deep neural network (DNN) was proposed, which achieved 100% accuracy in differentiating between bilinguals and monolinguals. It is relevant to emphasize that none of the reference studies considered the effects of noisy environment between two distinct groups-the bilinguals and monolinguals-using machine learning/deep learning-based approach, which hinders the opportunity of a direct comparison with the existing literature. Therefore, this study may help researchers and practitioners in the near future to evaluate the effect of noise on multilingual individuals. Apart from the aforementioned advantages, this study also has some limitations which shall be addressed in future projects: • During this study, only a limited number of individuals (12 participants) were considered.

•
We did not consider other widely bilingual people who speak English-Arabic, Hindi-English who need to be taken into account for the proper evaluation of the effect of noise on bilingual people on a large scale.

•
The performance of the proposed deep neural network may fluctuate when applied to a larger data set.

Conclusions
This study evaluated the participant's experience on a foreign-accented speech with different types of frequency considering bubble noise, white noise, and quiet speech sound level (e.g., sound levels: 55 dB, 65 dB, and 75 dB) and with signal-to-noise ratios of −10 dB, 0 dB, 10 dB, and infinity (no noise) environments between bilingual and monolingual individuals. The study focused on young adults. The findings suggest that foreign-accented speech with different noise has a significant effect on listening regardless of whether the person is bilingual or monolingual. A significant difference was also observed between the two groups in quiet and white noise-contaminated speech; however, no such significant difference was measured under bubble noise-contaminated speech. This indicates that the performance of listening will be mostly similar regardless of one's multi-linguistic capabilities. It seems that no additional advantages are enjoyed through the comprehension of multiple languages. Finally, we tested and evaluated six different machine learning algorithms on the 12 participant's dataset in terms of speech quality ratings in mild-tomoderately by listeners, and higher accuracy was achieved using LDA-60%, after data standardization. Speech quality ratings by monolingual and bilingual listeners observed were somewhat confounded because of the ineligibility. In addition to this, a deep neural network was developed that differentiated between bilingual and monolingual participants by achieving an accuracy of 100%. Some of the limitations associated with this work can be addressed by conducting experiments with large and imbalance datasets [53,54], comparing the performance of the proposed methods with other bilingual participants, and explaining the analytic results using explainable AI [55].

Acknowledgments:
The authors would like to acknowledge Gleb V. Tcheslavski and Applied DSP Research Laboratory of Lamar University, Beaumont, Texas, USA, for collecting the data and permission to share the data for this study.

Conflicts of Interest:
Authors declare no conflict of interest.