An Acoustic Way to Support Japanese Children’s Effective English Learning in School Classrooms

: In this paper, the importance of implementing good acoustic conditions in classrooms using sound ampliﬁcation systems is investigated to support more effective English education for elementary school children. To date, the failure of educating English as a second language at Japanese schools has been demonstrated by poor English conversation ability of those who completed a compulsory six-year English language course at Japanese junior-high and high schools (age 12–18). To amend the situation, teaching English became compulsory at grade three (age 8–9) and above at most Japanese elementary schools in the 2020 academic year. We conducted acoustic measurements of two types of sound ampliﬁcation systems, a pair of PC loudspeakers and another with a loudspeaker array, in a typical classroom at an elementary school in Japan. We also analysed English listening test results of 216 Japanese native children (age 11–12) who were learning English in their usual classes in Japan, to compare the effects of those two systems. Results of logistic regression analysis adjusted by the discrimination difﬁculty of word pairs demonstrated the statistically signiﬁcant association between correct answer rate of the English tests and classroom acoustic factors. Although, on average, upgrading the sound ampliﬁcation system had positive effects on the correct answer rate, it also had a negative impact when the word pairs had English phoneme contrasts that do not appear in Japanese phoneme structure. Combined with the acoustic measurements’ results, it was also revealed that heterogeneous sound ﬁelds that depend on seat positions could be compensated using sound ampliﬁcation systems with loudspeaker arrays. Our ﬁndings suggest that improvement of both acoustic quality and teaching methods is required for children to acquire English communication skills effectively in their classroom. Author Contributions: Conceptualization, N.E.; methodology, N.E. and K.K.; measurement, N.E., M.K., I.S. and K.K.; analysis, N.E. and K.K.; writing—original draft preparation, N.E.; writing— review and editing, N.E., M.K., I.S., T.S. and K.K.; visualization, N.E. and K.K.; project administration,


Introduction
English language education at Japanese public schools has been criticized. The average score of Test of English as a Foreign Language in Japan has been around the lowest among Organisation for Economic Co-operation and Development countries [1]. Many Japanese people cannot make a simple daily conversation in English after completing a compulsory six-year English language course in Japan [2]. In April 2020, the age to start teaching English at most Japanese elementary schools was officially lowered to resolve the matter. Grade three and four (age 8-10) have some English lessons to get familiar with the language, whereas grade five and six (age [10][11][12] learn English as a subject, thus their skills get evaluated through tests [3].
Previous research has shown evidence of poor acoustics found in school classrooms and their negative influence on children in various aspects [4][5][6][7][8][9][10]. Children are often too exhausted from trying to comprehend speech contents; subsequently they have little energy left to perform any cognitive tasks using the obtained information [11,12]. Not only language related skills, such as reading, writing, speaking and listening, but also others, including numeracy, working memory and motivation of children were deteriorated as a consequence [13][14][15][16][17][18][19][20][21].
The further the distance between a speaker and a listener, the less speech is heard [22]. In a closed space like a classroom, sound reflection and absorption by the ceiling, floor, walls, furniture as well as people present, all integrate together with the sound directly emitted by a speaker to shape diverse characteristics of the sound field [23,24]. Consequently, the intelligibility of a teacher's speech signals fluctuates depending on where children are, even in the same classroom [25]. However, such effects on human perception in real rooms of sizes often seen in classrooms are not well understood [26], compared to those in large indoor spaces for performance (e.g., concert halls and auditoria).
The length of time in which children are required to listen to their teachers, peers, or audio materials accounts for the majority of learning time children spend in their classrooms [9], taking up to 45-75% in total [7,27]. With the teaching style shift towards what is sometimes called "active learning" [28], children are demanded to listen more and more in classrooms that are noisier than ever. The introduction of more group work also leads to children's learning abilities being more linked with their hearing skills than ever [29].
It is also important to note that hearing in children is not as sophisticated as that in adults, especially in noisy environments [14,30]. Thus, adults are often unable to perceive the hearing difficulties experienced by young children even in the same classroom [7,31]. The way one speaks usually varies under noisy conditions. This, known as the "Lombard effect", is recognised to reduce intelligibility but also cause strains on speakers. Such deteriorated speech signals are inadequate for children to learn anything, especially languages [32,33]. Moreover, for those aged between 5 and 15 years in particular, it is common to contract middle ear infectious disease Otitis Media, which causes temporary hearing loss of up to 40 dB hearing level and often lasts more than one months [8].
Many teachers have also expressed their discontent with poor acoustics at schools [6,34,35]. Teachers are more likely to report voice disorders than those in any other professions [34]. Absent teachers due to voice related health problems impose non-negligible costs. Since much information is delivered in speech form in classrooms, good acoustics there would benefit a much larger population than just children with already-diagnosed hearing impairments.
Although the above mentioned problems have been identified, there is no clear guidelines regarding the quality of acoustic signals to be presented to children, to protect and support both children and teachers in Japanese school classrooms, other than the school building design guidelines [36] written from an architectural perspective. Northern and Downs stated [31] that solving such problems with acoustics in classrooms is not too hard. Rather than the absence of funding nor knowledge of its solution, missing awareness of both the problem and the potentially achievable answers is the main issue. In addition, the importance of evidence-based education research has been promoted in recent years [37,38]. Education practices should be based on the best available scientific evidence, rather than tradition, personal judgement, or other influences. However, evidence in the real-life situations collected at real schools are still not enough.
The aim of this study is to demonstrate the importance of good acoustic conditions in the second and foreign language education. We present our findings using data collected in a real-life environment at a typical elementary school in Japan, with help from Japanese native children who were learning English as their second language. Two kinds of sound amplification systems are utilised; one is what had been used at the school and another is what we prepared as an example alternative for a comparative study. By statistically analysing English listening test results, combined with some acoustic measurements, effect of more adequately treated classroom acoustics on children's learning is illustrated.

Participants
In total, 216 children from six classrooms at a typical public mixed-gender elementary school in Japan, aged between 11 and 12 years (grade six), participated over two consecutive weeks (212 in the week one, and 201 in the week two) in December 2019. The majority of Japanese elementary schools are public schools. The school was selected after consulting the local education authorities in a city while searching for a school that can cooperate in our study. The number of children who attended each lesson is presented in Table 1. No prior information was available regarding children's English language skills, English word familiarity, nor hearing level, apart from the following information: • At the school, the grade 6 children had been taking three English lessons every fortnight, taught mainly by a Japanese native teacher in the English learning room, dedicated for English lessons of the grade five and six, for roughly 17 months before the current study began. • They also had an English native teacher assisting in some lessons, typically once a week. • Each lesson was 45-min-long.
We also had no knowledge regarding their intellectual abilities nor their socio-economic characteristics.

English Listening Test
English listening tests using recorded speech audio signals emitted by two sorts of sound amplification systems ( Figure 1) were given to the participants. Participants were asked to indicate whether they thought the speech sounds of each word pair were the same or not by circling their choice between "the same" and "different" (both written in Japanese) for each question on provided sheets. The task consists of 42 English word pairs, such as "walk-walk" and "ten-pen". The questions comprise clear speech recordings of either a minimal pair or the same word spoken twice with at least one second (s) interval, voiced by several English native speakers in a randomised order. The speech audio materials were prepared by editing the sound source provided in the accompanying CDs of the two books [39,40] with permission of the licensor through PLSclear (Publishers' Licensing Services, London, UK), using WaveLab Pro 9.5 (Steinberg, Hamburg, Germany) [41]. Restrictions on letter and word selections were determined according to the availability of the sound source. The maximum amplitude for each word was adjusted to be the same. For the speech audio materials, the balance to encompass varieties in terms of vowels vs. consonants, minimal pair vs. the same word, and male voice vs. female voices was all decided manually during the preparation period in an attempt to make it fair wherever possible within all restrictions applied.
Each participant was given four tests in total as illustrated in Figure 2. Each test had 21 questions with approximately four-min-long audio materials. Two tests were integrated into each lesson, one at the beginning and another at the end of each lesson. Two lessons were scheduled over two consecutive weeks, with one week interval in between. The first test was given usually after an introduction song and a small talk session. After another small talk, often followed by some group work, the second test was completed with some closing remarks added by their Japanese native teacher. All lessons were held during their regular teaching hours in their English room at the school. During the two week period, lessons were scheduled according to a weekly timetable set by the school. Thus, a class that had an English lesson on Monday starting at 9:30 in the first week, had another one on Monday at 9:30 in the following week, for example. Tests were designed so that the same word appeared twice, once a week, and it was played by both sound amplification systems one after another over the two weeks. The orders of questions were randomised using a pseudo random number generator written in Python [42]. The sound amplification system was alternated after every four questions.
To avoid causing too much stress to the children, it was determined in advance that roughly 10 min is the maximum length of time allowed for the test procedures in each lesson in our study. To protect their privacy, we asked each participant to place a sticker on each answer sheet before starting a test. These stickers came in five colours, signifying their seat groups ( Figure 3, left), determined in advance to be utilised in our analyses.

Sound Amplification Systems for a Classroom
Two different sound amplification systems, "System A" and "System B", were used for a comparison. System A (Figure 1 left), was a pair of typical PC loudspeakers (S-0264C, Logitech, Lausanne, Switzerland) which had been used during regular English lessons at the school for a long time before the current study began. We kept the System A where it had been placed for their previous English lessons, on the bottom shelf of a metal rack with a projector on the top (Figure 1 left), placed in the centre front of the room (shown as "System A" in Figure 3 right). It enabled us to compare between "a regular system" found at a school and another brought in with a little more care. The System A generates point-source-like signals, thus direct signals would spread and decay by 6 dB per doubling the distance in a free filed. Sound emitted by such simple sound amplification systems are usually deteriorated by their representational features.
System B (Figure 1 right) consisted of a signal processor, an amplifier and a loudspeaker array (DP-SP3, DA-250F, and SR-H3L, respectively; TOA, Hyogo, Japan). The System B was placed at the front corner of the room (shown as "System B" in Figure 3 right) so that the emitted signals could cover the whole room with little interference caused by sound reflected by the walls whilst maintaining easier installation, smaller additional costs, and less interruptions to learning activities, compared with many other alternatives. The System B is designed to realise sound fields that are more evenly spread in a room, and has a relatively flat response. The advantage of loudspeaker arrays is that emitted acoustic signals become like a line source as the distance to the listening point widens for a while; therefore, they would only decay 3 dB per doubling the distance in a free filed.
Levels of both systems were adjusted so that the A-weighted equivalent continuous sound pressure level over 10 s, denoted by L Aeq,10s [43], was 75 dBA at position (0) (Figure 3 right), using pink noise. This level was determined based on the exposure limit for young children, suggested by the World Health Organization, as 75 dBA 8-hour L EX in order to protect children from risks of noise-induced hearing loss [44].

Acoustic Measurements
Multiple impulse response measurements in the unoccupied furnished English room were carried out on an additional day, to gain a better understanding of the difference in the acoustic characteristics realised by the two sound amplification systems. Both systems were placed in the same ways as where they were during English tests (Figure 3, right). The measurements were carried out while no lessons in and around the school building were taking place.
The Exponential Sine Sweeps (ESS) method [45] was used. Signals were recorded by using a microphone of a calibrated class two sound level meter [46] (NL-42, RION, Tokyo, Japan) set at a height of 1 m with a portable recorder (PCM-D100, SONY, Tokyo, Japan), and sampled at 44.1 kHz, 16-bit. ESS signals (30 Hz to 22 kHz, also sampled at 44.1 kHz) were played and recorded several times at each of 10 positions ((0) to (9) in Figure 3, right). Albeit imperfect, the method with averaging and deconvolution in the frequency domain was adopted to obtain each impulse response, calculated with Python, so that non-linear and time-variant behaviours of both the room acoustics and the systems were compensated to a certain extent [47]. The transfer function as well as Reverberation Time (RT) for each octave band and Speech Transmission Index (STI) were estimated by Systune pro (AFMG, Berlin, Germany) [48] in accordance with ISO 3382-2 [49] and IEC 60268-16 [50], respectively. STI is often included in guidelines or standards for classroom acoustics [27], and it was developed as an objective method to predict speech intelligibility due to transmission channels [51]. T60 is estimated from the decay time found between −5 dB and −25 dB (T20), whereas STI is calculated as a weighted average of seven coefficients corresponding to seven octave bands, with centre frequencies from 125 Hz to 8 kHz [52]. Note that these measurement results including STI have known limitations and they only help prediction of true speech intelligibility, which should only be quantified via listening tests employing human perception [50].
The unoccupied noise level was measured in the furnished room, and calculated using the same sound level meter as above.

Statistical Analyses
The English test results were analysed after the following pre-processing steps. Firstly, data of children who were absent in any lessons were removed. Secondly, weekly numbers for the data, obtained in week two of the study from children who were absent in the week one, were replaced by those from week one in order to separately examine the impact of children getting used to the type of tests after week one.
To evaluate the discrimination difficulty of the word pairs for Japanese native children, the correct answer rate p c of each word pair was estimated as the percentage of the correct answers. The question difficulty levels were defined as shown in Table 2 and were employed for adjustment in the following analyses.
Both univariate and multivariate logistic regression analyses [53] were used to evaluate the associations between the correct answer rate p c and independent factors, such as question difficulty level, habituation effects (repetition of tasks), type of the sound amplification system and seat group (Table 3). In the multivariate logistic regression analysis, the Akaike Information Criteria (AIC) was used for variable selection. To estimate the bias due to overfitting (called "optimism"), we also carried out bootstrap validation of the multivariate models using 200 repetitions of bootstrapping from the original data [54]. As a performance metric, we used C-statistic, defined as the area under the receiver operating characteristic curve for the prediction of the correct answer of a task. The optimism for the C-statistic was estimated as the difference between the mean C-statistic of the bootstrap sample based models and that of the original model. The optimism-corrected C-statistic was given by subtracting the optimism from the original C-statistic.
The relationship between seat positions and acoustic characteristics realised by the two sound amplification systems (System A and System B) was explored using yet another sub data set using difficulty level 1 to 3 words. In this analysis, English test score (expressed in percentages) of each participant who attended both lessons were used. Seat groups were categorised into two groups for simplicity: Front (Seat Group 1-3) and Back (similarly, [4][5]. Normality of the English test scores in each sub set was assessed using the Shapiro-Wilk test. Because the test score distributions showed non-normality, the Wilcoxon signed-rank test was applied for paired comparison of medians, and the Mann-Whitney U test for unpaired comparison. As a multiple comparison correction method, Bonferroni-corrected p-values were employed. p < 0.05 was considered to be statistically significant. All statistical analyses were performed by using R 3.6.0 [55].

Question Difficulty Level
Correct Answer Rate p c Table 3. Explanatory variables assumed in the study.

Ethical Notes
The English listening test study at the school was conducted with the approval of the Human Research Ethics Committee of the Graduate School of Engineering Science, Osaka University (protocol code R1-16, 3 December 2019). The participation of the study was agreed by the school's head teacher and the guardians of children were informed about the study in advance.

Acoustic Measurements
In the classroom, the measured unoccupied noise level, L Aeq,10s , was 31.3 dBA with the air conditioners turned off. Occupied noise level varied around 40 dBA to 50 dBA during English tests with no one speaking inside the room although lessons in other rooms and playground were carried on. The American Speech-Language-Hearing Association recommends to ensure unoccupied classroom noise level ≤35 dBA [56]. Correspondingly, the Japanese school building guidelines [36] recommend L Aeq,T ≤ 40 dBA, where T indicates the measurement period that can be determined depending on the situation. Table 4 shows the estimated RT and STI, calculated by the methods explained in Section 2.4. Likewise, the recommended RT for unoccupied furnished classrooms is 0.6 s at maximum, as average values over octave bands with centre frequencies of 500 Hz, 1 kHz, and 2 kHz (according to the American standard [56,57], for classrooms ≤ 283 m 3 ) or similarly for octave bands with centre frequencies of 500 Hz and 1 kHz (according to the Japanese guidelines [36], for classrooms ≤ 200 m 3 approximately). Unlike the common practice of simply measuring the room reverberation and STI, often seen in architectural acoustics, the results we obtained informs us the combined characteristics of the unoccupied furnished room and either of the utilised sound amplification systems. As shown in Table 4, the RT values of both systems are almost within the recommended range with little variation between the two sets although System B is slightly better than System A. The STI, on the other hand, are better with the System B at all measurement positions, compared to that with System A. According to an indication provided in the standard [50], STI > 0.76 observed at all positions with System B imply "Excellent intelligibility", but STI > 0.66 observed at all positions except at position (2) with System A imply "High speech intelligibility", and STI > 0.62 at position (2) with System A still implies "Good speech intelligibility". Table 4. RT (s) for octave bands with centre frequencies 500 Hz, 1 kHz, and 2 kHz (in accordance with the American standard [57]), and STI for the two systems. The measurement positions are indicated in Figure 3 ( In comparison with that in RT values, the difference in transfer functions between the two systems was more clearly observed as depicted in Figure 4. The frequency bands between 125 Hz and 8 kHz are understood to contain the most elements of English speech sound [31]. Therefore, signals over this range should be clearly audible (e.g., SNR ≥ +15 dB is recommended in [56]) when English speech stimuli are presented to children in their classrooms. The frequency response of System A reveals that the signals arriving at each listening point diverge more than that of System B. It indicates the difficulty of using System A to ensure that everyone can hear speech sound very well at wherever they are in the same room without harming hearing of no one, including those close to the loudspeakers. System B exhibits its relatively flat response at least over the frequency bands between 200 Hz and 10 kHz at many measurement positions. The shortcomings observed at some positions below 500 Hz for both systems (e.g., System B at Position (7)) are possibly due to the room modes although the cause is not fully identifiable with the collected information only.

English Test
The correct answer rate p c is shown in Table 5, of which the list is sorted according to the difficulty levels in ascending order. The mean correct answer rates were 94.2% in Level 1, 73.7% in Level 2, 50.6% in Level 3 and 22.3% in Level 4. The word pairs categorized into the Level 4 indicated extremely low correct answer rates. Note that chance level in a twochoice forced task, like our English test, is 50%. Nevertheless, the Japanese native children yielded much lower, p c < 50%, with some word pairs. This finding would imply the existence of a bias effect caused by the difference between Japanese and English phonemes and/or lack of English knowledge.
All explanatory variables in Table 3 were selected for the multivariate logistic regression model after applying variable selection based on the AIC minimisation. The Odds Ratio (OR) and its 95% Confidence Interval (CI) of each explanatory variable in univariate and multivariate logistic regression analyses are shown in Table 6. An OR greater than 1 implies a positive association with an increased correct answer rate, whereas an OR less than 1 implies a negative association. The bootstrap validation result is shown in Table 7. Obviously, the most powerful predicting factor for the correct answer rates p c was the difficulty level of word pairs. In the multivariate analysis, correct answer rates p c of the second week were better than those of the first week, indicating that many children did indeed get used to the task over those two weeks. Additionally, sound amplification system type also improved the correct answer rates p c to a certain extent. In regards to seat positions, those who sat on the seats located at the back of the classroom (Seat Group 4-5) performed worse than those whose seats were located in Seat Group 1 area. The habituation effects during single lesson is uncertain in the results at this point.  Table 7. Bootstrap-validated C-statistic for the multivariate logistic regression model shown in Table 6. Multivariate analysis excluding the effects of the difficulty level can be beneficial to evaluate the effects of other factors. Then, we divided the data into two subsets, without and with Level 4 words, and carried out logistic regression analyses for each subset. Results are shown in Tables 8 and 9, where only variables selected by AIC minimisation for the multivariate logistic regression models are included. Bootstrap validation results are shown in Table 10.

Original C-Statistic
The sub-analysis using the subset containing only difficulty levels 1-3 words (Table 8) showed similar tendencies with our first analysis results (Table 6). On the contrary, the sub-analysis results for the Level 4 words (Table 9) appeared to be different, especially regarding the sound amplification system and the seat groups.
Speech recognition is known to be governed by acoustic signals as well as the language experience of listeners [31]. Generally speaking, for those learning second languages (L2), differentiating phonemes that do not exist in their first language (L1) is more challenging [58]. The English word pairs categorised into the difficulty level 4 have no difference in pronunciation when they are transcribed in Japanese characters like Katakana. Therefore, one of the causes for the extremely low correct answer rate would be due to lack of exposure to such phoneme contrasts. In addition, elementary school children may not have knowledge of English words in the test. Therefore, it is extremely difficult for such children to suppose that words in those pairs may be different.
Early studies suggested that such non-native phonemes may be incorporated into listeners' L1 categories, which typically had been developed before their first birthdays [59,60]. For example, the lowest question difficulty of 17.2% was for the word pair "light-right" ( Table 5). Japanese people are particularly known to have tendency of getting confused between /l/ and /r/ sound [61,62]. The third formant (F 3 ) [63] is reported to be the critical cue that English native listeners use when differentiating between /l/ and /r/, but many Japanese native listeners frequently fail to employ it [64]. Although speech perception is not just about simply detecting the formant, being able to hear such useful acoustic speech cues is helpful for listeners to infer phonemes contained in speech signals [63].
Special care or perhaps separate training is required when teaching L2 phonemes that do not appear in their L1 phoneme structure. Research showed that mothers often communicate with their infants with a special form of speech, known as "parentese", with higher pitch, overemphasised intonation contour, spoken slowly [65]. Additionally, albeit incomplete, many have attempted to develop methods to train Japanese natives (including adults) to learn abilities to differentiate /l/ and /r/, as in [62], for example. Above results suggest sound amplification systems equipped with functions allowing flexible control over realised frequency response could be valuable in second and foreign language education.   Finally, to assess the association between the test score and the heterogeneous sound fields in the classroom, we divided the subset with difficulty levels 1-3 words further into four groups according to sound amplification systems (A or B) and seat locations (Front or Back), then compared the medians of test scores (38 questions per system). As shown in Figure 5, the lowest median of the test score is observed in the sub group "System A at the back seats", and it is significantly lower than that of "System B at the back seats". With System A, the difference between front and back seats is noticed, indicating heterogeneous sound fields in the classroom realised by using such a system.  Collective findings from both the acoustic measurements (Section 3.1) and our statistical results exposed that the difference in the acoustic quality realised by the two sound amplification systems had an impact on, at least, Japanese children's English learning, and the inequality in their classroom.

Conclusions
The current study has demonstrated the importance of delivering good acoustic conditions in classrooms to help children learn English more effectively through a scientific approach, using data collected in a real-life school classroom in Japan.
The statistical results of English listening tests taken by 216 Japanese children (aged between 11 and 12 years) using two sound amplification systems revealed the following three points. (1) There is association between the acoustic conditions in classrooms and the correct answer rate of the English tests taken by children in the classroom. (2) Using sound amplification systems with loudspeaker arrays can reduce the existing inequality in a classroom due to the heterogeneous sound fields that depend on the seat positions.
(3) English word combinations including phonemes that do not exist in L1 are particularly hard for Japanese children to discriminate, and they negatively affect the correct answer rate of the English tests. Hence, teaching those words particularly requires extra care. Furthermore, the joint results with the multiple acoustic measurements show that even when the difference in acoustic characteristics realised by the two systems is at the magnitude of what has been shown, its impact on children's learning is still noteworthy. The observed differences in our acoustic measurements were not so noticeable in the RT values but they were evident in the transfer functions.
Although improvement can be made to the English listening test materials for any future studies of similar kinds, as a consequence, we illustrated that more adequately adjusted acoustics could play a significant role to contrive children's learning at their schools. More research should be encouraged, for example, to set guidelines regarding acoustic quality in classrooms at schools in any countries without them.
Funding: This work was partially supported by JSPS KAKENHI Grants No. 19H04282.

Institutional Review Board Statement:
The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Human Research Ethics Committee of the Graduate School of Engineering Science, Osaka University, Japan (protocol code R1-16, 3 December 2019).
Informed Consent Statement: Written informed consent was obtained from the principal of the participating school. In addition, an opportunity to reject participation or use of the children's data was provided to the children's legal guardians using a document specifically prepared for this purpose as an opt-out method.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to ethical issues.