Investigating Sibilant Fricative Representation in Bangla Telemedicine Speech: A Cost-Aware Sampling Rate Optimization Study

Paul, Prajat; Bouh, Mohamed Mehfoud; Shah, Manan Vinod; Hossain, Forhad; Ahmed, Ashir

doi:10.3390/signals7030044

Open AccessArticle

Investigating Sibilant Fricative Representation in Bangla Telemedicine Speech: A Cost-Aware Sampling Rate Optimization Study

by

Prajat Paul

^1,*

,

Mohamed Mehfoud Bouh

¹,

Manan Vinod Shah

¹

,

Forhad Hossain

²

and

Ashir Ahmed

¹

Faculty of Information Science and Electrical Engineering, Kyushu University, Fukuoka 819-0395, Japan

²

Faculty of Liberal Arts, Sophia University, Tokyo 102-8554, Japan

^*

Author to whom correspondence should be addressed.

Signals 2026, 7(3), 44; https://doi.org/10.3390/signals7030044

Submission received: 26 February 2026 / Revised: 2 April 2026 / Accepted: 9 April 2026 / Published: 7 May 2026

Download

Browse Figures

Versions Notes

Abstract

Automatic speech recognition has advanced rapidly for high-resource languages, yet performance remains limited for low-resource languages such as Bangla, particularly in telehealth settings. Most systems rely on a standardized 16 kHz sampling rate, a design choice despite evidence that Bangla contains sibilant fricatives and other phonetic cues with substantial high-frequency energy that may be suppressed under bandwidth and latency constraints. This study evaluates audio sampling rate as a controllable signal-level parameter for Bangla telehealth ASR to identify an empirically grounded operating range balancing transcription accuracy, execution time, and network bandwidth. Twenty real-world Bangla doctor–patient consultations were deterministically resampled to 55 configurations between 8 kHz and 32 kHz and transcribed using a fixed cloud-based ASR system. Session-level Word Error Rate, execution latency, payload bandwidth, and high-frequency phonetic content were analyzed using a composite sibilant-likelihood score. WER decreased from 0.338 at 8 kHz to a local minimum of 0.232 at 18.75 kHz, with gains plateauing beyond this range despite substantial bandwidth increases. Elbow-point, Pareto frontier, weighted scoring, and Minimum Acceptable Trade-off analyses converged on an optimal region between 17.25 and 18.75 kHz, demonstrating that sampling rate optimization improves ASR accuracy without proportional resource costs in telehealth settings.

Keywords:

automatic speech recognition (ASR); Bangla Language; telehealth; Low-Resource Language (LRL); sibilant fricatives; word error rate (WER); sampling rate optimization; acoustic analysis

1. Introduction

Speech recognition systems have achieved substantial progress in recent years, driven by advances in deep learning architectures and the availability of large, well-structured speech corpora for high-resource languages. These systems now power a wide range of applications, from digital assistants to automated documentation tools. However, this progress has not been uniformed across the world’s languages. Many languages remain technologically disadvantaged due to limited digitized materials, lack of standardized phonetic resources, and insufficient annotated datasets. Such languages are commonly categorized as Low-Resource Languages (LRLs), reflecting their restricted representation in speech and language technologies [1]. The scarcity of high-quality linguistic resources for LRLs limits the reliability, adaptability, and scalability of ASR systems designed for them. Beyond data and model availability, many LRL ASR systems also inherit signal-processing assumptions, such as fixed sampling rates, that were optimized for high-resource languages and are rarely re-examined in low-resource deployment contexts.

Recent Bangla ASR research reflects ongoing efforts to address data scarcity, dialectal variation, and deployment robustness through complementary strategies. Foundational supervised corpora established speaker-diverse Bangladeshi Bangla resources but offered limited explicit modeling of regional variation [2]. Subsequent work introduced dialect-aware lexical datasets with division-level metadata, enabling evaluation beyond Dhaka-centric generalization [3]. Model-centric studies demonstrate that fine-tuning self-supervised architectures such as Wav2Vec 2.0 on larger curated corpora substantially reduces WER/CER, albeit with increased computational demands [4]. To scale data, pseudo-labeling approaches have produced large domain-agnostic corpora (~20 k h), while revealing persistent degradation on conversational and telephony speech [5]. These challenges are formalized by out-of-distribution benchmarks explicitly quantifying performance drops under domain and style shifts, including telemedicine [6]. Finally, decoding-level language model rescoring yields significant accuracy gains but introduces accuracy–latency trade-offs critical for real-world Bangla ASR deployment [7]. Collectively, these studies indicate that progress in Bangla ASR hinges on representative data, scalable annotation, and resource-aware modeling, as improvements on controlled speech often fail to generalize to real-world conversations. The lack of linguistic resources in that aspect creates a noticeable gap in the speech recognition accuracy. Notably, most prior Bangla ASR studies implicitly assume fixed front-end signal representations, with limited examination of how inherited sampling rate choices may interact with data scarcity and conversational speech characteristics.

Automatic speech recognition has been increasingly adopted in healthcare as a scalable mechanism for converting spoken clinical interactions into text, enabling downstream analysis, documentation, and decision support. In mental health research, ASR has been used to transcribe patient speech at scale to support Natural Language Processing (NLP)-based clinical assessment and longitudinal monitoring, particularly where manual transcription is impractical or privacy sensitive [8]. More broadly, multiple studies have positioned ASR as an upstream infrastructure for clinical language processing, enabling automated transcription of conversational medical speech for documentation, information extraction, and system-level learning across diverse clinical settings [9,10,11]. In the context of structured clinical documentation, ASR-driven systems have been proposed to assist with anamnesis creation by transforming doctor–patient conversations into editable medical records, improving efficiency and standardization [12]. Comparative evaluations of commercial ASR engines on conversational clinical speech have further established baseline performance characteristics to guide deployment in healthcare workflows [13]. More recently, specialty-specific studies, such as in dentistry and orthodontics, have demonstrated that ASR can support the generation of detailed clinical records and that coupling ASR with language models can further enhance the usability of transcribed medical text [14]. Clinical conversations are acoustically challenging due to overlapping speech, spontaneous disfluencies, and background noise, making recognition performance particularly sensitive to front-end signal representation choices such as sampling rate and bandwidth.

Most current ASR systems, including those developed for Bangla, continue to rely on the standardized 16 kHz sampling rate. While convenient for compatibility, this rate limits the captured spectrum to 8 kHz and fails to preserve the full acoustic detail of several Bangla phoneme classes, especially the language’s rich fricative inventory. Sibilant fricatives are a class of consonant sounds produced by directing turbulent airflow against the teeth or alveolar ridge, generating high-intensity, high-frequency noise concentrated in the upper spectral bands. Bangla includes multiple sibilant fricatives, /s/, /ʂ/, /ʃ/, and dialect-specific /ɕ/-like variants, whose acoustic signatures often extend beyond the 8 kHz boundary. In several regional accents, sibilants produce sharper turbulence and enhanced high-frequency energy, making them especially vulnerable under 16 kHz sampling. When these upper-band cues are truncated at recording, ASR systems struggle to separate acoustically similar fricatives in conversational and medical speech. Sibilant fricatives such as /s/, /ʂ/, /ʃ/, and dialectal /ɕ/ exhibit distinctive high-frequency spectral patterns that differ from non-sibilant frication and ambient noise; unlike background noise, which distributes energy broadly and irregularly, sibilants contain structured cues such as narrow spectral peaks, high spectral moments, and stable turbulence patterns across speakers and tokens, as shown in acoustic-phonetic studies reporting higher spectral peaks, sharper energy concentrations, and greater spectral kurtosis than other fricatives and noise [15,16]. Because the 16 kHz Nyquist boundary removes much of this extended high-frequency structure, critical information is lost. Studies on extended high-frequency cues further show that preserving spectral information above 8 kHz improves phoneme separability under noise and reduces omission or substitution errors in ASR [17]. These observations are consistent with broader findings that sibilants concentrate substantial energy in higher frequencies and require spectral measures beyond narrowband limits; fricatives are turbulence-based but shaped by vocal tract filtering, producing wider spectral distributions than vowels or sonorants. Diagnostic measures such as FM, AmpD, Fh, AmpRange, and HighLevelD often require spectral content up to at least 15 kHz [18]. Some discrimination algorithms, such as DFT-slope separation of [s] and [ʃ], require analyzable ranges up to 8 kHz and therefore need at least a 16 kHz sampling rate [19]. Evidence from 44.1 kHz conversational corpora and laboratory or modeling studies spanning 15–20 kHz indicates that extended high-frequency information can alter cue weighting and improve discrimination or perceptual-model accuracy in a context-dependent manner, while population response modeling further shows that structured, multi-channel spectral profiles predict perceptual confusions more effectively than raw spectra, underscoring the importance of organized high-frequency acoustic cues [20,21,22,23]. Although fricatives are broadband and noise-like, their structure is not random. Vocal-tract filtering produces systematic spectral patterns that differ from unstructured noise. Spectral peak measures, amplitude differences, high-frequency maxima, and dynamic cues capture this structure and help separate sibilants from background noise [18]. ASR front-end strategies leverage this by separately processing low- and high-frequency bands; appending de-noised high-frequency filter-bank energies improves wideband ASR under noise. Frequency-filtered band energies and sampling rate conversion methods maintain performance across variable sampling conditions [24].

Speech recognition accuracy and robustness are strongly shaped by sampled bandwidth, front-end feature design, and training–test sampling mismatches, and prior studies have systematically quantified these dependencies while proposing practical remedies. Rather than a single universally optimal operating point, earlier work demonstrates that preferred sampling rates vary with task and feature representation; for instance, Linear Predictive Cepstral Coefficients (LPCCs) and Mel-Frequency Cepstral Coefficients (MFCCs) front ends were found to perform best at 12 kHz and 14 kHz, respectively, in a Hidden Markov Model (HMM)-based recognizer [25]. Nonetheless, 16 kHz wideband audio has become the dominant baseline in practice because it preserves spectral information up to 8 kHz and facilitates stable cross-device processing through subband-based descriptors [24]. Despite this convention, sampling rate is typically treated as a fixed design choice, and only coarse operating points are evaluated, leaving the effects of fine-grained rate variation, particularly above 16 kHz, largely unexplored in realistic deployment scenarios. Corpus-based evaluations show that sibilant discrimination remains robust in conversational speech only when sufficient high-frequency spectral detail is preserved, underscoring how bandwidth and sampling rate choices directly affect the recoverability of fricative cues critical for ASR [26]. Sampling rate transformation (SRT) techniques have shown that systems trained at 16 kHz can process 11 kHz test data with markedly reduced word error rates, e.g., from 29.89% to 18.17% without retraining, underscoring the contribution of preserved bandwidth to recognition accuracy [27]. In noisy real-world conditions such as in-car speech, 16 kHz consistently outperforms 8 kHz in robustness [28], while the explicit inclusion of de-noised high-frequency energy has been reported to yield up to 13.96% relative improvement for noisy wideband ASR [24]. It also further indicates that frequency-filtered and subband energy representations allow high-rate systems to remain compatible with lower-rate inputs while retaining much of the performance benefit. Collectively, these findings support 16 kHz as a reliable and efficient baseline, yet they also point to a broader conclusion: higher sampling rates provide additional acoustic information that can be exploited when extended high-frequency cues, such as fricative turbulence, whispery articulation, or bandwidth extension effects are relevant to the recognition task [29]. Notably, while prior work establishes the importance of bandwidth and high-frequency content, it largely treats sampling rate as a fixed design choice or evaluates only coarse operating points, leaving open the question of how incremental increases beyond 16 kHz at recording affect recognition quality under realistic deployment constraints.

Increasing the sampling rate beyond 16 kHz provides a direct acoustic advantage by capturing a wider range of high-frequency energy, thereby preserving sibilant-related cues that are critical for phoneme discrimination in Bangla speech. At the same time, higher sampling rates generate larger audio payloads, increasing network bandwidth demands and processing latency for cloud-based ASR services. These trade-offs are especially consequential in low-resource settings where infrastructure support is limited. In regions such as Bangladesh and India, variable network connectivity and constrained deployment environments amplify the practical impact of sampling rate decisions on real-time speech applications. While the expansion and diversification of Bangla speech corpora remain the most fundamental long-term pathway for improving ASR performance, this study argues that avoidable technical constraints, such as suboptimal sampling rate selection at recording due to operational constraints, should be systematically examined and optimized. By isolating and addressing signal-level limitations, the present work aims to clarify the possible existence of high-frequency components that bears the likeliness of getting clipped by following the standardized sampling rate in recording settings to comply with minimization of resource utilization. In addition, the goal is to find an optimum value or range of values of sampling rate to maintain at audio recording to avoid the exclusion of high-frequency components and improve the speech recognition accuracy.

2. Materials and Methods

2.1. Study Design

This study adopts a comparative experimental design to examine how audio sampling rate influences Bangla ASR performance under realistic telehealth constraints. The design explicitly considers three interrelated dimensions: transcription accuracy, computational execution time, and network bandwidth requirements. Rather than treating sampling rate as a fixed engineering parameter, the study models it as a controllable variable whose effects can be systematically evaluated across a dense range of configurations. This approach enables an analysis that connects acoustic signal fidelity with downstream recognition behavior and deployment-level resource costs. Figure 1 depicts the flow of action while the following subsections elaborates on the steps.

2.2. Speech Data Collection and Recording Protocol

Speech data were collected through a collaboration with Grameen Communications, Bangladesh, as part of routine health checkup activities conducted using the Portable Health Clinic system [30]. The Portable Health Clinic provides an in-person health screening service in which non-communicable disease parameters are measured and, based on the results, a remote doctor consultation is arranged. As the system does not include a dedicated audio or video calling feature, consultations were conducted using Zoom (version 6.4.6) (Zoom Video Communications, Inc.) configured with high-fidelity audio settings on all connected terminals. This configuration disables Zoom’s automatic echo cancellation and post-processing and raises the audio codec quality to 48 kHz with a higher transmission bitrate. Both the doctor and the patient joined using laptop computers equipped with headset microphones and connected via broadband internet. Both participants were situated in environments with minimal ambient noise.

Twenty complete doctor–patient consultations were recorded in Dhaka, Bangladesh. Recording was performed locally on the patient-side terminal using Zoom’s built-in local recording function, which captures the mixed audio of both participants. The local recording was saved in M4A format (AAC-LC codec) at a native sampling rate of 48 kHz stereo. Prior to analysis, the M4A recordings were converted to single-channel WAV format using pydub and subsequently downsampled to 32 kHz using librosa to establish a standardized experimental baseline while retaining acoustic bandwidth sufficient for controlled resampling. The effective spectral bandwidth of the recordings, with structured energy above 8 kHz verified through spectrographic inspection.

Because the recording captures the mixed consultation, it includes both the patient’s locally captured speech and the doctor’s speech as received through Zoom’s real-time audio transmission pipeline. In the consultations used in this study, patient speech constituted the majority of the recorded dialogue. The doctor’s transmitted audio is subject to Zoom’s adaptive Opus codec for real-time delivery; however, with high-fidelity mode enabled on both terminals and broadband connectivity, the Opus codec operates in wideband to fullband mode, preserving audio bandwidth up to 20 kHz under favorable network conditions [31]. Although some spectral shaping of the doctor’s speech by the transmission codec cannot be excluded, the predominance of locally captured patient speech and the empirically confirmed presence of high-frequency content in the recordings support the integrity of the acoustic signal for the purposes of this study.

Recordings were selected from available consultations based on the presence of medical vocabulary containing sibilant fricatives as identified in Zoom-generated transcripts, ensuring that the dataset contains speech segments relevant to the study’s phonetic focus. The consultations ranged in duration from approximately 44 s to 7 min and 30 s and consisted primarily of Bangla speech along with medical terminology for which English equivalents are commonly used in clinical Bangla. Each interaction followed a typical outpatient consultation flow, including greetings, history-taking, symptom description, preliminary assessment, and clinical recommendations. The conversational and dialogic nature of the recordings introduces natural variability such as turn-taking, overlapping speech, and spontaneous disfluencies, which are essential for evaluating ASR performance under ecologically valid conditions.

The use of 32 kHz recordings as the source material for resampling experiments establishes an upper bound on available acoustic information within the confirmed effective bandwidth and allows all lower sampling rates to be derived through deterministic downsampling. By avoiding independent recordings at different sampling rates, the design ensures that linguistic content, speaker behavior, background noise, and conversational structure remain constant across conditions. This isolates sampling rate as the primary experimental factor and prevents confounding effects that could arise from differences in recording hardware or environment. Prior to analysis, segments containing personal identifiers or sensitive information were removed to ensure patient privacy and ethical compliance. No additional signal enhancement, denoising, or normalization was applied, allowing the ASR system to operate on audio that closely reflects real-world telehealth inputs.

2.3. Audio Preprocessing and ASR Evaluation Pipeline

Each recording was segmented into non-overlapping 15 s windows using a Python 3-based procedure to satisfy the input constraints of the Wit.ai, an ASR system from Meta, while preserving natural conversational dynamics, including overlapping speech and speaker interactions. Audio segments originating from the native 32 kHz recordings were resampled to target sampling rates between 8 kHz and 32 kHz using a non-uniform grid designed to balance analytical resolution and computational feasibility. Sampling rates from 8 to 12 kHz was evaluated at 500 Hz intervals to capture broad effects of high-frequency loss, while finer 250 Hz increments were applied in the 12–20 kHz range to resolve transitional behavior where performance sensitivity was expected to be highest. Above 20 kHz, sampling rates were evaluated at 1 kHz intervals to reduce redundancy in regions where theoretical gains are limited, and resource costs increase. All resampling was performed using identical signal-processing routines to ensure consistency across conditions, enabling detailed characterization of sampling rate sensitivity while supporting practical evaluation of deployable operating ranges. Each resampled audio segment was submitted to the Wit.ai ASR service for transcription using a consistent API configuration. Segment-level transcripts were generated independently and later concatenated to reconstruct a full-session hypothesis transcript for each consultation. To mitigate the effects of transient network variability or service instability, the system architecture incorporated caching mechanisms that prevent redundant ASR calls for previously processed segments while preserving segment-level metadata such as processing time and payload size. Manual reference transcripts were prepared for each complete consultation by listening to the full audio recordings. These transcripts serve as ground truth for evaluation and reflect the intended lexical content of the conversations rather than segment-level approximations. After ASR processing, reconstructed hypothesis transcripts were compared against the corresponding reference transcripts using word error rate as the primary accuracy metric. WER was computed using the Jiwer Python library, which provides standardized alignment and scoring procedures suitable for conversational speech.

2.4. Computational Latency and Bandwidth Cost Measurement

Execution time was measured for each automatic speech recognition request by recording the elapsed duration between submission of an audio segment to the ASR API and receipt of the corresponding transcription, thereby capturing end-to-end latency under realistic usage conditions, including local preprocessing, network transmission, remote inference, and response delivery. Timing was performed at the segment level to account for variability arising from segment duration, network conditions, and service-side processing behavior, and execution times were subsequently aggregated across segments and consultations using robust summary statistics to characterize the computational cost associated with each sampling rate configuration. Network bandwidth requirements were estimated by recording the file size of each uploaded uncompressed audio segment and normalizing it by segment duration to obtain an effective data transmission rate. This approach directly associates bandwidth consumption with sampling rate and segment length, enabling consistent comparison across configurations while providing a conservative approximation of communication cost relevant to network-constrained telehealth deployments, where high-fidelity audio transmission is often required.

2.5. Acoustic Analysis for Sibilant Characterization

To examine how audio sampling rate affects the preservation of high-frequency phonetic information, an acoustic analysis was conducted with a focus on sibilant fricatives. Sibilants are known to exhibit strong turbulent energy concentrated in the upper frequency range and are therefore particularly sensitive to bandwidth limitations imposed by lower sampling rates. Analyzing their acoustic properties provides a principled way to assess whether increases in sampling rate meaningfully preserve information that is otherwise truncated under standard configurations. For each original 32 kHz recording, frame-level spectral features were extracted using Python-based audio analysis library LibROSA (version 0.11.0). Three summary metrics were first computed at the file level: mean spectral centroid, mean spectral flatness, and mean high-frequency (HF) energy ratio. Spectral centroid was used as an indicator of the distribution of energy along the frequency axis, with higher values corresponding to greater high-frequency concentration typically associated with fricative articulation. Spectral flatness quantified the degree to which the signal exhibits noise-like characteristics, which are prominent in sibilant turbulence compared to voiced or harmonic speech. The high-frequency energy ratio measured the proportion of signal energy above the 8 kHz boundary, corresponding to the Nyquist limit of 16 kHz sampling, thereby directly quantifying the extent of acoustic information that would be discarded under commonly used ASR settings. Together, these measures provide complementary views of sibilant-related acoustic structure and establish a descriptive basis for understanding how much potentially informative high-frequency content is present in the recordings prior to any down sampling.

2.6. Frame-Level Sibilant Likelihood Estimation

The composite sibilant-likelihood score was computed at the frame level to enable fine-grained acoustic analysis beyond global summary statistics, using a weighted integration of seven established acoustic features: spectral centroid, spectral flatness, high-frequency energy ratio, sibilant-band energy ratio, zero-crossing rate, spectral skewness, and energy consistency. The weighting scheme was determined heuristically based on well-established acoustic–phonetic evidence for fricative discrimination. Spectral centroid and spectral flatness were assigned the highest weights (0.20 each), consistent with findings by Jongman et al. [15], who identified spectral peak location and noise-like energy distribution as primary discriminators of sibilant articulation. The sibilant-band energy ratio, computed over the 2.5–8 kHz range corresponding to the dominant turbulence region of Bangla sibilants (/s/, /ʃ/, /ɕ/), was also assigned a high weight (0.20). The high-frequency energy ratio received a slightly lower weight (0.15) to capture extended high-frequency content beyond 8 kHz, which Monson et al. [32] demonstrated to contain perceptually relevant information often lost in standard transmission. Zero-crossing rate and spectral skewness were assigned moderate weights (0.10 each); although Forrest et al. [33] confirmed their utility in distinguishing voiceless obstruents, Kong et al. [34] showed reduced reliability under low signal-to-noise conditions, motivating a limited contribution for robustness in real-world telehealth audio. Energy consistency was assigned a minimal weight (0.05) to suppress very low-energy frames and function primarily as a noise-gating mechanism [35]. All features were normalized to a common ([0, 1]) scale prior to weighting. Frames were categorized based on confidence thresholds applied to the composite score, with high-confidence frames labeled as likely sibilant, intermediate-confidence frames labeled as possible sibilant, and remaining frames evaluated using an inverted spectral criterion to identify noise-dominant cases; frames not meeting either condition were labeled as unclear. This multi-feature, weighted formulation provides an interpretable and literature-aligned estimate of sibilant presence suitable for natural conversational speech without requiring manual phonetic annotation.

S (f) = 0.20 \tilde{C} (f) + 0.20 \tilde{F} (f) + 0.20 \tilde{E_{s i b}} (f) + 0.15 \tilde{E_{H F}} (f) + 0.10 \tilde{Z} (f) + 0.10 \tilde{K} (f) + 0.05 \tilde{E_{c o n s}} (f)

(1)

Feature Definitions:

S (f) =

Composite sibilant likelihood score;

\tilde{C} (f) =

Normalized spectral centroid;

\tilde{F} (f) =

Normalized spectral flatness;

\tilde{E_{s i b}} (f) =

Normalized sibilant–band energy ratio;

\tilde{E_{H F}} (f) =

Normalized high-frequency energy ratio;

\tilde{Z} (f) =

Normalized zero-crossing rate;

\tilde{K} (f) =

Normalized spectral skewness;

\tilde{E_{c o n s}} (f) =

Normalized energy consistency.

2.7. Sampling Rate-Dependent ASR Performance Analysis

Sampling rate was treated as the primary independent variable in a structured evaluation of Bangla automatic speech recognition performance across three interrelated dimensions: transcription accuracy, computational execution time, and network bandwidth requirement. Accuracy was quantified using WER, a standard metric for speech recognition accuracy. It is a ratio of the sum of substitution, deletion, and insertion errors to the total number of words in the reference transcript. Execution time and bandwidth served as resource-oriented constraints relevant to real-world deployment. Sampling rates between 8 kHz and 32 kHz were selected using a non-uniform grid to provide higher resolution in frequency ranges where performance transitions were expected, while avoiding redundant evaluation at higher rates with diminishing returns. This analytical framework enables explicit examination of accuracy–efficiency trade-offs and provides the foundation for subsequent analyses of optimal and near-optimal operating points under practical constraints.

2.8. Low-Pass Filtering Control Analysis

A low-pass filtering (LPF) control condition was introduced to isolate the contribution of extended high-frequency acoustic content from sampling rate-related effects. Original recordings being sampled at 32 kHz and low-pass filtered at 8 kHz before segmentation and ASR processing, thereby preserving temporal resolution and payload characteristics while removing spectral content above the 16 kHz Nyquist limit. Comparing ASR performance under this condition with unfiltered higher-rate and native 16 kHz configurations enables attribution of accuracy differences specifically to preserved high-frequency phonetic cues rather than to data rate or processing artifacts. Execution time and bandwidth were measured identically to other conditions, allowing the LPF configuration to serve as a controlled baseline for interpreting sampling rate-dependent performance gains.

2.9. Elbow-Point Detection: Identifying Diminishing Returns

To identify sampling rate regions where further increases yield diminishing improvements in recognition accuracy, an elbow-point analysis was applied to the WER–sampling rate relationship. Rather than focusing on absolute performance, this analysis highlights inflection regions where gains attributable to increased acoustic bandwidth begin to taper, indicating a transition from signal-limited to model- or data-limited performance. This provides an interpretable criterion for distinguishing sampling rates that meaningfully preserve phonetic information from those offering marginal benefit.

2.10. Pareto Frontier Analysis: Balancing Accuracy and Bandwidth

To explicitly account for deployment constraints, a Pareto frontier analysis was conducted using WER and estimated bandwidth as competing objectives. This analysis identifies sampling rate configurations that achieve optimal trade-offs, in the sense that no alternative configuration simultaneously improves accuracy while reducing bandwidth cost. By isolating non-dominated operating points, the Pareto framework complements the elbow analysis by emphasizing efficiency under network-constrained conditions rather than accuracy trends alone.

2.11. Composite Scoring and Minimum Acceptable Trade-Off Selection

To support holistic comparison across all configurations, a composite scoring framework was employed that integrates normalized measures of transcription accuracy, execution time, and bandwidth into a single efficiency score, with accuracy assigned the highest weight. In addition, a minimum acceptable trade-off strategy was applied by first selecting configurations whose accuracy falls within a predefined tolerance of the best-performing condition and then identifying the lowest-cost option among them. Together, these analyses enable principled selection of sampling rates that balance recognition quality with practical deployment constraints when absolute optimal accuracy is not required.

2.12. Ethical Considerations

The audio recordings of doctor–patient consultations used in this study were obtained from Grameen Communications, a collaborative research partner, and were shared with the research team in de-identified form, containing no personally identifiable information. Informed consent was obtained from all participants prior to recording, and participation was voluntary with no incentives or coercive measures. All analyses conducted in this study are limited strictly to signal-level properties—including spectral characteristics, automatic speech recognition performance, execution latency, and network bandwidth—and do not involve interpretation, extraction, or reporting of any semantic, medical, or personal content from the consultations. The study does not generate diagnostic information, treatment recommendations, or outputs intended to influence clinical decision-making. As such, the research is non-interventional, presents minimal risk to participants, and is consistent with ethical guidelines for secondary analysis of anonymized clinical communication data.

3. Results

3.1. Sibilant-Related Acoustic Measures

Table 1 summarizes file-level acoustic characteristics and frame-level sibilant categorization across the 20 Bangla telehealth recordings. Mean spectral centroid values span a broad range (≈1848–3018 Hz), with higher centroid values generally co-occurring with increased spectral flatness (≈0.11–0.30) and elevated HF energy ratios above 8 kHz (≈0.05–0.12), indicating substantial variability in high-frequency acoustic content across sessions. Figure 2 presents the distributional structure of these three metrics. Spectral centroid values cluster primarily between 2046 and 2340 Hz (IQR), with a median of approximately 2208 Hz and two recordings exceeding 2900 Hz; spectral flatness and HF energy ratio exhibit similar right-skewed distributions (medians ≈ 0.169 and 0.065, respectively), with the same high-end recordings driving the upper tails across all three metrics. This co-occurrence of elevated values within the same files suggests that inter-session variability in high-frequency content is not random but reflects genuine differences in sibilant-related acoustic energy across consultations.

Correspondingly, the proportion of frames classified as likely sibilant ranges from ~8.5% to ~32.8% (mean ≈ 16.5%), while possible sibilant frames range from ~10.5% to ~27.7% (mean ≈ 16.6%). Noise-dominant frames are comparatively limited in the likely noise category (mean ≈0.7%, median ≈ 0.03%), though possible noise shows substantial variability (mean ≈16.9%, range ≈ 0.1–43.2%). Most frames fall into the unclear category (mean ≈ 66.1%), reflecting the conversational nature of the data with frequent phonetic transitions and mixed speech content. Figure 3 visualizes these classification distributions, revealing that the sibilant categories show considerable spread across recordings, particularly likely sibilant (IQR ≈ 11.7–21.2%), while likely noise remains near zero for most files. The high variability in possible noise is driven by a subset of recordings with elevated ambient spectral energy, whereas the majority of files exhibit minimal noise in the high-confidence category, indicating that the sibilant-likelihood estimates are not dominated by background noise misclassification. Collectively, these distributions indicate that sibilant-related acoustic cues are present at nontrivial levels across recordings, with measurable high-frequency energy above 8 kHz available prior to downsampling.

Table 2 summarizes file-level aggregates of the composite sibilant-likelihood score and its weighted feature components across the 20 Bangla telehealth recordings. Composite scores cluster within a narrow range (mean ≈ 0.59; range ≈ 0.56–0.65), indicating stable behavior of the scoring framework across heterogeneous conversational sessions. The largest contributions consistently arise from spectral centroid (mean ≈ 0.81), spectral flatness (mean ≈ 0.70), HF energy ratio (mean ≈ 0.74), and sibilant-band energy ratio (mean ≈ 0.60), which together dominate the composite score in accordance with their assigned weights and normalized scaling. In contrast, zero-crossing rate (mean ≈ 0.48), spectral skewness (mean ≈ 0.07), and energy consistency (mean ≈ 0.06) contribute at lower levels, with the latter two serving a primarily auxiliary role.

Figure 4 visualizes the distributional separation between these component groups: centroid, flatness, HF energy, and sibilant energy all score consistently above 0.5, while skewness and energy consistency remain below 0.12, with ZCR occupying an intermediate position near 0.48. The tight clustering of the composite score (IQR ≈ 0.57–0.61) despite wider variation in individual components, particularly in flatness (IQR ≈ 0.66–0.74) and sibilant energy (IQR ≈ 0.56–0.62), suggests that the multi-feature formulation provides robustness against session-level acoustic heterogeneity. The number of frames classified as sibilant varies substantially across recordings (≈1.8 k–19.1 k frames), yet this variation is not accompanied by large shifts in composite score magnitude, suggesting that the score reflects stable spectral characteristics rather than frame count alone. Across files, higher composite scores generally coincide with elevated HF energy and sibilant-band energy components, indicating the presence of nontrivial high-frequency acoustic content prior to downsampling.

To illustrate the phonetic basis of the high-frequency acoustic content identified in the preceding analyses, Figure 5 presents zoomed wideband and narrowband spectrograms for four representative segments containing sibilant-bearing medical terms drawn from the consultation recordings.

Table 3 summarizes the corresponding Bangla terms, their English equivalents, and the sibilant fricatives present in each. The selected terms: শুকনো কাশি (dry cough), ওষুধ (medicine), শারীরিক অসুস্থতা (physical discomfort), and প্রেসার বেশি (high pressure); represent common healthcare vocabulary encountered during telehealth consultations and contain instances of the Bangla sibilant fricatives শ (/ʃ/), স (/s/), and ষ (/ʂ/) across varying word positions and phonetic contexts.

In all four segments, the wideband spectrograms reveal intervals of broadband, aperiodic energy extending above the 8 kHz Nyquist boundary imposed by 16 kHz sampling. These high-frequency energy bursts are temporally aligned with speech activity and exhibit the diffuse, noise-like spectral signature characteristic of fricative turbulence, distinguishable from the harmonic structure of adjacent voiced segments visible in the narrowband spectrograms. The extent of high-frequency energy varies across segments: the শারীরিক অসুস্থতা segment (14.wav) exhibits the most pronounced energy above 8 kHz, while the ওষুধ segment (6.wav) and শুকনো কাশি segment (1.wav) show energy extending to approximately 10–12 kHz at multiple points. The প্রেসার বেশি segment (17.wav) displays recurring high-frequency activity above 8 kHz distributed across the 5 s window. In all cases, the narrowband spectrograms confirm that the energy above 8 kHz lacks harmonic structure, consistent with aperiodic fricative turbulence rather than voiced speech or background noise. Across the four segments, the three Bangla sibilant categories, postalveolar /ʃ/, alveolar /s/, and retroflex /ʂ/, each contribute high-frequency energy above the standard Nyquist limit, indicating that the bandwidth constraint imposed by 16 kHz sampling affects multiple fricative classes within healthcare vocabulary. These phoneme-level observations complement the corpus-wide acoustic statistics reported in Table 1 and Table 2 by grounding the summary measures in identifiable medical vocabulary.

3.2. Sampling Rate Optimization

In Table 4, ASR accuracy and resource metrics obtained by deterministically resampling the original 32 kHz recordings to selected target sampling rates and evaluating all segments under an identical cloud ASR configuration is summarized. Segment-level transcripts were concatenated to form session-level hypotheses and compared with manually prepared references to compute global WER using standardized alignment. Execution latency was measured per segment as end-to-end elapsed time (submission to response) and aggregated as median and IQR per sampling rate. Estimated bandwidth was computed from uncompressed segment file size normalized by segment duration (kbps) and summarized as the median per configuration. All processing, timing, and aggregation procedures were held constant across sampling rates to isolate sampling rate as the sole experimental variable.

WER decreases markedly from low rates (8 kHz: ≈0.338) into the mid-range, with notable improvements between ~15–19 kHz (e.g., 15.25 kHz: ≈0.257; 18.75 kHz: ≈0.232), after which gains plateau. At higher rates (≥20 kHz), WER remains within a narrow band (≈0.231–0.240) while bandwidth rises substantially (≈324 to >512 kbps) and median latency increases modestly (~3.6–4.2 s). Latency variability (IQR) is relatively stable across rates. Together, these results indicate a mid-range region where accuracy improves most relative to added bandwidth and latency, followed by diminishing returns as sampling rate continues to increase.

3.3. Low-Pass Filtered High-Rate Control Condition

Table 5 presents the LPF control in which 32 kHz recordings were filtered to remove spectral content above 8 kHz while retaining the original sampling rate and payload characteristics (≈512 kbps). Under this condition, global WER increases to ≈0.252 compared with unfiltered mid-to-high sampling rate configurations, despite comparable execution latency (median ≈ 3.41 s). Because computational and network costs remain unchanged, this degradation can be attributed specifically to the removal of extended high-frequency acoustic information. The LPF control therefore isolates the contribution of high-frequency cues and reinforces the interpretation that accuracy improvements at higher sampling rates arise from preserved spectral content rather than from sampling rate or bandwidth alone.

3.4. Elbow-Point Detection: Identifying Diminishing Returns

The illustrated elbow-point detection in Figure 6 applied to the WER–sampling rate curve using the maximum distance–to–endpoints method, where the black curve shows WER across evaluated sampling rates and red markers denote individual configurations. Sampling rates are first ordered in ascending order, and a straight reference line is constructed between the lowest and highest sampling rate configurations. For each intermediate point, the perpendicular distance to this line is computed, and the sampling rate corresponding to the maximum distance is identified as the elbow, which is indicated by the green dashed line. In this analysis, the elbow is detected at 17,250 Hz, indicating the point at which the rate of WER improvement begins to diminish relative to increases in sampling rate.

3.5. Pareto Frontier Analysis: Balancing Accuracy and Bandwidth

The Pareto frontier between global word WER and median payload bandwidth illustrates the trade-off across all sampling rate configurations, with grey circle marks representing all evaluated configurations and red circles denoting the Pareto-optimal frontier sampling rates. Each red point corresponds to a configuration for which any further reduction in WER would require an increase in bandwidth. Along the frontier, WER decreases monotonically from approximately 0.34 at ~132 kbps to ~0.23 at ~300–400 kbps, corresponding to an absolute reduction of about 0.11. In the annotated mid-bandwidth region, 17,000 Hz (≈272 kbps) achieves a WER of ~0.242, 17,250 Hz (≈276 kbps) further reduces WER to ~0.234, and 18,750 Hz (≈297 kbps) reaches ~0.232, near the minimum observed along the frontier. Beyond this point, substantial bandwidth increases (≥100 kbps) yield only marginal additional WER reductions (≤0.002), indicating diminishing returns, as shown in Figure 7.

3.6. Weighted Scoring Model: Composite Ranking of Configurations

The weighted scoring analysis integrates normalized recognition accuracy, latency, and bandwidth into a single composite metric, computed as a weighted sum of min–max–normalized variables (lower is better), with weights of 0.60 for global WER, 0.20 for median latency, and 0.20 for median payload bandwidth, as shown in Table 6. As shown in the table, the lowest composite score is achieved at 18,750 Hz (weighted score = 0.2266), driven primarily by the lowest normalized WER (0.0035) while maintaining moderate latency (0.6883) and bandwidth (0.4342). The next two lowest scores occur at 17,250 Hz (0.2458) and 17,500 Hz (0.2526), which exhibit slightly higher normalized WERs (0.0225–0.0277) and higher latency normalization, but benefit from comparatively lower bandwidth normalization than 18,750 Hz. Configurations at 18,250 Hz and 17,000 Hz rank lower due to incremental increases in normalized WER and/or latency, despite similar bandwidth profiles. Overall, the ranking indicates that small differences in WER exert a dominant influence on the composite score relative to modest variations in latency and bandwidth.

3.7. Minimum Acceptable Trade-Off (MAT): Cost-Efficient Near-Optiomal Accuracy

The MAT analysis is based on a near-optimal WER criterion, where the gray curve shows WER across sampling rates, the green dotted line marks the near-optimal threshold defined as within 2% of the minimum WER, green crosses indicate configurations that satisfy this criterion, and the red star highlights the selected MAT operating point corresponding to the lowest sampling rate within the near-optimal region. The minimum observed WER across all sampling rates is 0.2316, yielding a near-optimal threshold of 0.236284, and multiple sampling rates beyond approximately 17 kHz fall below this threshold, forming a plateau of near-optimal performance, as illustrated in Figure 8. Among these, 17,250 Hz achieves a WER of 0.234054, satisfying the accuracy constraint while occurring at a substantially lower sampling rate than several higher-rate alternatives. The selected MAT point therefore represents the lowest sampling rate that meets the predefined accuracy tolerance, balancing recognition performance against increasing bandwidth and computational demands.

4. Discussion

This study underscores the importance of audio sampling rate as a tunable optimization variable for ASR, rather than a fixed preset. In contrast to conventional approaches that fix the sample rate (e.g., at 8 or 16 kHz) and potentially confound improvements with changes in model architecture or training data, our experiments isolate sampling rate as the only varying factor. By focusing on Bangla medical telehealth speech and keeping the model and dataset constant, we directly quantify the impact of sampling rate on recognition performance. This controlled approach reveals that significant accuracy gains can be achieved by optimizing the sampling rate for this specific ASR scenario.

4.1. Interpretation of Sibilant Acoustic Cues

The distributional analyses in Figure 2, Figure 3 and Figure 4 complement the summary statistics in Table 1 and Table 2 by revealing the underlying spread and structure of the acoustic measures. Spectral centroid, flatness, and HF energy ratio all exhibit right-skewed distributions, with a subset of recordings showing simultaneously elevated values across all three metrics—indicating that the acoustic cost of truncation at 8 kHz is not uniform but disproportionately affects consultations with the most sibilant-related high-frequency content. Frame classification distributions confirm that sibilant-related frames constitute a consistent proportion of the signal (likely sibilant median ≈ 15.7%, IQR ≈ 11.7–21.2%) while high-confidence noise frames remain negligible, supporting the interpretation that detected high-frequency content is speech-related. At the component level, the composite sibilant-likelihood score clusters narrowly (IQR ≈ 0.57–0.61) despite wider variation in individual features, suggesting the multi-feature formulation provides robustness against session-level heterogeneity. The clear separation between dominant and auxiliary components in Figure 4 provides empirical support for the heuristic weighting scheme in Equation (1).

The spectrograms in Figure 5 demonstrate that sibilant fricatives in common Bangla healthcare terms produce structured high-frequency energy that extends beyond the spectral ceiling of standard 16 kHz sampling, providing a concrete phonetic basis for the sampling rate-dependent ASR accuracy differences reported in subsequent sections.

4.2. Sampling Rate-Dependent Accuracy Gains and Diminishing Returns

Optimizing the sampling rate yields marked improvements in ASR accuracy. WER decreases from 0.3383 at 8 kHz to 0.2320 at 18.75 kHz (Table 4), representing a substantial relative reduction. Beyond approximately 19 kHz, accuracy gains plateau, while bandwidth and latency continue to increase. Median latency rises modestly from around 3.6 s to approximately 4.2 s, and estimated bandwidth increases from roughly 132 kbps to over 500 kbps. These trends identify the 15.25–18.75 kHz region as the most efficient zone, where accuracy improves rapidly relative to added resource cost.

4.3. Isolating the Contribution of Extended High-Frequency Information

The LPF control condition isolates the role of extended high-frequency acoustic content. When 32 kHz audio is filtered to remove spectral content above 8 kHz, global WER increases to 0.2516 (Table 5), despite maintaining identical bandwidth and comparable latency. This degradation confirms that the observed accuracy gains at higher sampling rates are attributable to preserved high-frequency information rather than sampling rate or payload size alone. The LPF control therefore provides direct evidence that extended high-frequency cues contribute meaningfully to recognition performance.

4.4. Elbow Point Identification of the Accuracy-Efficiency Trade-Off

Elbow-point detection applied to the WER–sampling rate curve identifies a clear inflection at 17,250 Hz (Figure 6). This point represents the transition beyond which incremental increases in sampling rate yield diminishing improvements in WER. The elbow therefore marks a practical threshold for cost-effective accuracy gains and provides a data-driven lower bound for selecting an optimal operating region.

4.5. Pareto-Optimal Balance Between Accuracy and Bandwidth

Pareto frontier analysis further refines the optimal operating region by jointly considering WER and bandwidth (Figure 7). Along the frontier, WER decreases from approximately 0.34 at 132 kbps to around 0.23 at 300–400 kbps. Within the mid-bandwidth region, 17,000 Hz (≈272 kbps) achieves a WER of approximately 0.242, 17,250 Hz (≈276 kbps) reduces WER to approximately 0.234, and 18,750 Hz (≈297 kbps) reaches near-minimal WER at approximately 0.232. Beyond this range, bandwidth increases of more than 100 kbps result in WER reductions of 0.002 or less, clearly indicating diminishing returns.

4.6. Composite Ranking and Minimum Acceptable Trade-Off Selection

The weighted scoring model integrates normalized WER, latency, and bandwidth into a single composite metric (Table 6). Under accuracy-prioritized weighting, 18,750 Hz achieves the lowest composite score (0.2266), driven primarily by its lowest normalized WER. The next best configurations are 17,250 Hz (0.2458) and 17,500 Hz (0.2526), which offer slightly higher WER but reduced bandwidth demands. The minimum acceptable trade-off (MAT) analysis further selects 17,250 Hz as the lowest sampling rate that remains within 2% of the minimum observed WER, identifying it as the most cost-efficient near-optimal configuration.

4.7. Recommended Sampling Rate for Bangla Medical Telehealth ASR

Based on converging evidence from acoustic analysis, WER trends, LPF controls, elbow detection, Pareto optimization, and composite scoring, 18,750 Hz is recommended as the best overall sampling rate for Bangla medical telehealth ASR. It achieves the lowest observed WER while maintaining manageable latency and bandwidth requirements. For deployments with stricter resource constraints, a sampling rate range of 17,250–18,750 Hz is recommended, as this interval captures most of the available accuracy gains while avoiding the inefficiencies associated with higher sampling rates.

4.8. Limitations

This study was designed to examine the effect of audio sampling rate on automatic speech recognition performance within a controlled, application-relevant telemedicine setting. All experiments employed a single cloud-based ASR system and a consistent segmentation and resampling pipeline, allowing sampling rate to be isolated as the primary variable while ensuring comparability across conditions. Although different ASR architectures or front-end configurations may yield different absolute performance levels, the present design enables robust analysis of relative trends associated with sampling rate variation in naturalistic doctor–patient speech. Sibilant presence was estimated using a multi-feature, frame-level acoustic formulation grounded in established phonetic correlates of fricative articulation, providing an interpretable proxy for extended high-frequency information without requiring manual phonetic annotation. Execution time and network bandwidth were measured end-to-end under realistic operating conditions, capturing practical deployment behavior rather than isolated computational components. The mixed recording additionally includes the doctor’s speech received through Zoom’s Opus transmission codec and is subject to AAC-LC compression at approximately 92 kbps stereo, imposing an effective bandwidth ceiling of approximately 15 kHz; both factors fall within acceptable bounds for the study’s critical 8–9.5 kHz frequency range, as confirmed empirically in Section 3.3 and Figure 5. These design choices define the scope within which the findings should be interpreted while supporting a focused evaluation of sampling rate trade-offs in comparable real-world ASR deployment contexts.

4.9. Comparison with Prior Work

Prior research in Bangla ASR has predominantly focused on data-centric and model-centric strategies, including corpus expansion, dialect-aware annotation, self-supervised model fine-tuning, pseudo-labeling, and decoding-level language-model rescoring [2,3,4,5,6,7]. In parallel, healthcare-oriented ASR studies have emphasized system usability, transcription accuracy, and downstream clinical applications, often benchmarking commercial engines under challenging conversational conditions [8,9,10,11,12,13,14]. While a substantial body of acoustic–phonetic work has demonstrated the importance of high-frequency cues—particularly for sibilant fricatives—and highlighted the limitations of narrowband representations [15,16,17,18,19,20,21,22,23,24], these insights have rarely been translated into deployment-level ASR optimization for specific domains or languages. Existing sampling rate studies typically treat sampling rate as a fixed or coarsely evaluated design parameter, inherited from telephony or benchmark conventions, with limited exploration beyond 16 kHz and little consideration of application-specific trade-offs [25,26,27,28,29]. In contrast, the present work complements prior efforts by isolating sampling rate as a controllable signal-level variable and systematically evaluating its fine-grained effects on recognition accuracy, latency, and bandwidth within the context of Bangla doctor–patient conversations. By grounding the analysis in domain-specific acoustic characteristics and explicit resource constraints, this study provides a practical signal-representation perspective that bridges acoustic theory and real-world ASR deployment in low-resource healthcare settings.

5. Conclusions

This study examined audio sampling rate as a controllable signal-level factor influencing automatic speech recognition performance for Bangla doctor–patient conversations under realistic telehealth constraints. Acoustic analyses confirmed the presence of nontrivial high-frequency sibilant-related information beyond the 8 kHz Nyquist limit of standard 16 kHz sampling, motivating systematic resampling experiments. Fine-grained evaluation across sampling rates demonstrated substantial reductions in word error rate as sampling increased into the mid-to-high range, followed by diminishing returns at higher rates despite increasing bandwidth and latency costs. Elbow-point detection, Pareto frontier analysis, weighted scoring, and minimum acceptable trade-off selection converged on a narrow operating region between 17,250 Hz and 18,750 Hz, within which most attainable accuracy gains were realized efficiently. Among these, 18,750 Hz consistently achieved near-minimal WER while maintaining moderate execution time and bandwidth, emerging as the most balanced single operating point when transcription accuracy is prioritized. These findings do not suggest a universal optimal sampling rate but demonstrate that inherited 16 kHz conventions can impose avoidable accuracy limitations in low-resource, conversational medical ASR. By isolating sampling rate from confounding factors and evaluating it under deployment-aware constraints, this work highlights signal-level optimization as a practical complement to data and model-centric advances in low-resource speech recognition.

Author Contributions

P.P. contributed to Conceptualization, Methodology, Investigation, Data curation, Formal analysis, Visualization, Writing—original draft and Funding Acquisition of the manuscript. M.M.B. contributed to Conceptualization, Methodology, and Writing—review & editing, with primary involvement in the literature review and hypothesis development. M.V.S. contributed to Data curation, Formal analysis, and Visualization, supporting data preprocessing, analytical implementation, and graphical representation of results. Contribution of F.H. was focused on Data curation, Resources and Investigation. A.A. contributed to Conceptualization, Supervision, Methodology, Validation, Funding Acquisition, Writing—review & editing and Project Administration, providing oversight and critical evaluation across all stages of the research workflow, from idea formulation through analytical interpretation, and contributing to strengthening the overall research contribution. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by JST BOOST (Japan Grant Number JPMJBS2406) and the APC was funded by JST Startup Ecosystem Co-creation Program for New Industry University Startups under the PARKS Startup Creation Program Student Project Step-2 (Japan Grant Number JPMJSF2317).

Data Availability Statement

The audio recordings and transcripts analyzed in this study were collected by a collaborating organization during routine health checkup activities and shared with the authors under a data use agreement. Due to confidentiality obligations and the sensitive nature of doctor–patient conversations, these data are not publicly available and cannot be deposited in an open repository. Access may be considered on a case-by-case basis for non-commercial research, subject to approval by the data-owning organization and applicable ethical requirements.

Acknowledgments

The authors acknowledge the use of ChatGPT 5.2, an AI-based language model developed by OpenAI, as a supportive tool during the preparation of this manuscript. ChatGPT was used to assist with language refinement, structural organization, and iterative drafting of selected sections. The authors also gratefully acknowledge Grameen Communications for their contribution to the collection of the Bangla doctor–patient audio recordings during health checkup services, which formed the empirical basis of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ASR	Automatic Speech Recognition
WER	Word Error Rate
MAT	Minimum Acceptable Trade-off
LRL	Low-Resource Language
LPF	Low-Pass Filter
HF	High Frequency
IQR	Interquartile Range

References

Magueresse, A.; Carles, V.; Heetderks, E. Low-resource languages: A review of past work and future challenges. arXiv 2020, arXiv:2006.07264. [Google Scholar] [CrossRef]
Kibria, S.; Samin, A.M.; Kobir, M.H.; Rahman, M.S.; Selim, M.R.; Iqbal, M.Z. Bangladeshi Bangla speech corpus for automatic speech recognition research. Speech Commun. 2022, 136, 84–97. [Google Scholar] [CrossRef]
Aiman, U.; Islam, M.N.; Chowdhury, M.H.; Rahman, M.S.; Habib, M.T.; Hasan, M. BRADS and BRWDS: Multipurpose audio and text datasets for automatic Bangla regional speech recognition. Data Brief 2025, 63, 112177. [Google Scholar] [CrossRef] [PubMed]
Hossain, S.; Rihan, M.R.; Imtiaz, A.; Boni, P.; Gomes, D. Enhancing Bangla local speech-to-text conversion using fine-tuning Wav2vec 2.0 with OpenSLR and self-compiled datasets through transfer learning. In 7th IEOM Bangladesh International Conference on Industrial Engineering and Operations Management; IEOM Society International: Southfield, MI, USA, 2024; Volume 20240161. [Google Scholar] [CrossRef]
Nandi, R.N.; Menon, M.; Muntasir, T.; Sarker, S.; Muhtaseem, Q.S.; Islam, M.T.; Chowdhury, S.; Alam, F. Pseudo-labeling for domain-agnostic Bangla automatic speech recognition. In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023); Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 152–162. [Google Scholar]
Rakib, F.R.; Dip, S.S.; Alam, S.; Tasnim, N.; Shihab, M.I.; Ansary, M.N.; Hossen, S.M.; Meghla, M.H.; Mamun, M.; Sadeque, F.; et al. Ood-speech: A large Bengali speech recognition dataset for out-of-distribution benchmarking. arXiv 2023, arXiv:2305.09688. [Google Scholar]
Rakib, M.; Hossain, M.I.; Mohammed, N.; Rahman, F. Bangla-wave: Improving Bangla automatic speech recognition utilizing n-gram language models. In Proceedings of the 12th International Conference on Software and Computer Applications; Association for Computing Machinery: New York, NY, USA, 2023; pp. 297–301. [Google Scholar]
Just, S.A.; Elvevåg, B.; Pandey, S.; Nenchev, I.; Bröcker, A.L.; Montag, C.; Morgan, S.E. Moving beyond word error rate to evaluate automatic speech recognition in clinical samples: Lessons from research into schizophrenia-spectrum disorders. Psychiatry Res. 2025, 352, 116690. [Google Scholar] [CrossRef] [PubMed]
Mani, A.; Palaskar, S.; Konam, S. Towards understanding ASR error correction for medical conversations. In Proceedings of the First Workshop on Natural Language Processing for Medical Conversations; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 7–11. [Google Scholar]
Klusty, M.A.; Logan, W.V.; Armstrong, S.E.; Mullen, A.D.; Leach, C.N.; Calvert, K.; Talbert, J.; Bumgardner, V.C. Toward automated clinical transcriptions. AMIA Summits Transl. Sci. Proc. 2025, 2025, 235–241. [Google Scholar] [PubMed]
Salloum, W.; Edwards, E.; Ghaffarzadegan, S.; Suendermann-Oeft, D.; Miller, M. Crowdsourced continuous improvement of medical speech recognition. In AAAI Workshops; AAAI Press: Washington, DC, USA, 2017. [Google Scholar]
Gonçalves, Y.T.; Alves, J.V.; Sá, B.A.; da Silva, L.N.; de Macedo, J.A.; da Silva, T.L. MedTalkAI: Assisted anamnesis creation with automatic speech recognition. In Simpósio Brasileiro de Banco de Dados (SBBD); SBC: Porto Alegre, Brazil, 2024; pp. 83–88. [Google Scholar]
Kodish-Wachs, J.; Agassi, E.; Kenny, P., III; Overhage, J.M. A systematic comparison of contemporary automatic speech recognition engines for conversational clinical speech. AMIA Annu. Symp. Proc. 2018, 2018, 683. [Google Scholar] [PubMed]
O’Kane, R.; Stonehouse-Smith, D.; Ota, L.C.; Patel, R.; Johnson, N.; Slipper, C.; Seehra, J.; Papageorgiou, S.N.; Cobourne, M.T. Transcription accuracy of automatic speech recognition for orthodontic clinical records. J. Dent. Res. 2025, 00220345251382452. [Google Scholar] [CrossRef] [PubMed]
Jongman, A.; Wayland, R.; Wong, S. Acoustic characteristics of English fricatives. J. Acoust. Soc. Am. 2000, 108, 1252–1263. [Google Scholar] [CrossRef] [PubMed]
Maniwa, K.; Jongman, A.; Wade, T. Acoustic characteristics of clearly spoken English fricatives. J. Acoust. Soc. Am. 2009, 125, 3962–3973. [Google Scholar] [CrossRef] [PubMed]
Guo, Z.C.; Chandrasekaran, B. Extended high-frequency cues to phoneme recognition: Insights from ASR. In Proceedings of the Interspeech 2025; International Speech Communication Association: Grenoble, France, 2025; pp. 1038–1042. [Google Scholar]
Kozierski, P.; Sadalla, T.; Drgas, S.; Dabrowski, A.; Giemacki, W. Polish Whispery Speech Recognition—Minimum Sampling Frequency; IEEE: New York, NY, USA, 2017; pp. 611–615. [Google Scholar] [CrossRef]
Hokking, R.; Woraratpanya, K. A hybrid of fractal code descriptor and harmonic pattern generator for improving speech recognition of different sampling rates. In Recent Advances in Information and Communication Technology 2017; IC2IT 2017; Advances in Intelligent Systems and Computing; Meesad, P., Sodsee, S., Unger, H., Eds.; Springer: Berlin/Heidelberg, Germany, 2018; Volume 566. [Google Scholar] [CrossRef]
Bauerecker, H.; Nadeu, C.; Padrell, J. On the advantage of frequency-filtering features for speech recognition with variable sampling frequencies: Experiments with speechdatcar databases. In 8th European Conference on Speech Communication and Technology (EUROSPEECH 2003–INTERSPEECH 2003); International Speech Communication Association: Grenoble, France, 2003; pp. 869–872. [Google Scholar]
Liu, F.H.; Picheny, M. On variable sampling frequencies in speech recognition. In Proceedings of the 5th International Conference on Spoken Language Processing; International Speech Communication Association: Grenoble, France, 1998. [Google Scholar]
Nadeu, C.; Tolos, M. Recognition experiments with the SpeechDat-Car Aurora Spanish database using 8 kHz- and 16 kHz-sampled signals. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU’01); IEEE: New York, NY, USA, 2001; pp. 135–138. [Google Scholar]
Ssnderson, C.; Paliwal, K.K. Effect of different sampling rates and feature vector sizes on speech recognition performance. In Proceedings of the IEEE TENCON’97; IEEE: New York, NY, USA, 1997; Volume 1, pp. 161–164. [Google Scholar]
Hirsch, H.G.; Hellwig, K.; Dobler, S. Speech recognition at multiple sampling rates. In Proceedings of the EUROSPEECH 2001 Scandinavia, 7th European Conference on Speech Communication and Technology, 2nd INTERSPEECH Event; International Speech Communication Association: Grenoble, France, 2001; pp. 1837–1840. [Google Scholar]
Guo, Z.C.; Chandrasekaran, B. Extended high frequencies improve phoneme recognition: Evidence from automatic speech recognition in spatial speech mixtures. J. Acoust. Soc. Am. 2025, 158, 3365–3377. [Google Scholar] [CrossRef] [PubMed]
Roberts, P.J.; Reetz, H.; Lahiri, A. Corpus-testing a fricative discriminator; or, just how invariant is this invariant? In 15th Annual Conference of the International Speech Communication Association; International Speech Communication Association: Grenoble, France, 2014; pp. 189–192. [Google Scholar]
Steiner, I.M.A. Observations on the Dynamic Control of an Articulatory Synthesizer Using Speech Production Data. Ph.D. Thesis, Karlsruhe Institute of Technology, Karlsruhe, Germany, 2010. [Google Scholar] [CrossRef]
Hokking, R.; Woraratpanya, K.; Kuroki, Y. Speech recognition of different sampling rates using fractal code descriptor. In 13th International Joint Conference on Computer Science and Software Engineering (JCSSE); IEEE: New York, NY, USA, 2016; pp. 1–5. [Google Scholar]
Shadle, C.H.; Chen, W.R.; Koenig, L.L.; Preston, J.L. Refining and extending measures for fricative spectra, with special attention to the high-frequency range. J. Acoust. Soc. Am. 2023, 154, 1932–1944. [Google Scholar] [CrossRef] [PubMed]
Ahmed, A.; Inoue, S.; Kai, E.; Nakashima, N.; Nohara, Y. Portable health clinic: A pervasive way to serve the unreached community for preventive healthcare. In International Conference on Distributed, Ambient, and Pervasive Interactions; Springer: Berlin/Heidelberg, Germany, 2013; pp. 265–274. [Google Scholar]
Valin, J.M.; Vos, K.; Terriberry, T.B. RFC 6716; Definition of the Opus Audio Codec; Internet Engineering Task Force: Wilmington, DE, USA, 2012; Available online: https://tools.ietf.org/html/rfc6716 (accessed on 25 February 2026).
Monson, B.B.; Hunter, E.J.; Lotto, A.J.; Story, B.H. The perceptual significance of high-frequency energy in the human voice. Front. Psychol. 2014, 5, 587. [Google Scholar] [CrossRef] [PubMed]
Forrest, K.; Weismer, G.; Milenkovic, P.; Dougall, R.N. Statistical analysis of word-initial voiceless obstruents: Preliminary data. J. Acoust. Soc. Am. 1988, 84, 115–123. [Google Scholar] [CrossRef] [PubMed]
Kong, Y.Y.; Mullangi, A.; Kokkinakis, K. Classification of fricative consonants for speech enhancement in hearing devices. PLoS ONE 2014, 9, e95001. [Google Scholar] [CrossRef] [PubMed]
Ramirez, J.; Górriz, J.M.; Segura, J.C. Voice Activity Detection. Fundamentals and Speech Recognition System Robustness. In Robust Speech Recognition and Understanding; Grimm, M., Kroschel, K., Eds.; I-Tech Education and Publishing: Rijeka, Croatia, 2007; pp. 1–22. [Google Scholar]

Figure 1. Methodological framework for sampling rate-dependent ASR performance evaluation and acoustic analysis of Bangla telehealth speech.

Figure 2. Distribution of Spectral Characteristics across 20 recordings.

Figure 3. Distribution of Frame-level Classification across 20 recordings.

Figure 4. Distribution of Composite Sibilant-likelihood Score & Component Contributions across 20 recordings.

Figure 5. Wideband (5 ms) and narrowband (25 ms) spectrograms of four representative segments containing sibilant-bearing medical terms from Bangla telehealth consultations. Dashed lines indicate the 8 kHz Nyquist boundary of 16 kHz sampling and the 2.5 kHz lower bound of the sibilant band.

Figure 6. Elbow detection for WER as a function of sampling rate. Each red dot represents a discrete sampling rate configuration and its corresponding WER.

Figure 7. Pareto frontier (red line) of WER vs. bandwidth.

Figure 8. Near-optimal WER and MAT selection as a function of sampling rate.

Table 1. Spectral Characteristics and Sibilant Classification Outcomes for 20 Audio Files.

File	Mean Spectral Centroid (Hz)	Mean Spectral Flatness	Mean HF Energy Ratio	Likely Sibilant (%)	Possible Sibilant (%)	Likely Noise (%)	Possible Noise (%)	Unclear (%)
1.wav	1848.223	0.131	0.053	13.381	11.053	0.003	4.961	75.495
2.wav	2052.420	0.159	0.065	15.334	13.676	0.024	13.064	70.772
3.wav	2455.571	0.223	0.086	21.260	18.793	0.123	29.383	59.295
4.wav	2279.628	0.212	0.079	22.857	14.090	0.002	0.647	63.049
5.wav	2251.891	0.191	0.070	16.377	18.411	0.004	27.373	64.520
6.wav	2006.801	0.146	0.063	14.819	12.467	0.009	0.983	72.714
7.wav	1961.270	0.178	0.067	21.365	10.505	0.003	0.107	68.130
8.wav	2182.456	0.194	0.067	17.379	18.413	0.001	24.211	63.946
9.wav	2295.956	0.184	0.064	17.086	16.954	0.028	26.026	64.986
10.wav	2089.816	0.160	0.059	13.879	15.507	0.168	20.544	69.885
11.wav	2560.026	0.200	0.075	16.072	22.242	1.646	26.995	60.316
12.wav	3018.359	0.287	0.109	25.231	27.668	0.019	43.224	45.737
13.wav	2459.080	0.217	0.078	21.203	18.540	0.001	31.632	59.265
14.wav	2945.454	0.304	0.122	32.794	22.905	0.001	4.490	44.283
15.wav	2208.959	0.147	0.058	11.967	16.210	0.173	19.653	70.579
16.wav	2028.531	0.127	0.052	10.604	13.692	0.029	15.575	74.709
17.wav	2300.926	0.139	0.053	11.054	16.990	6.150	11.756	70.871
18.wav	2156.108	0.119	0.052	9.706	15.578	1.534	13.286	73.008
19.wav	2206.857	0.126	0.046	9.906	16.646	3.797	12.088	72.233
20.wav	1905.613	0.105	0.049	8.470	11.743	0.073	11.358	78.530

Table 2. Frame-Level Sibilant Likelihood Scores and Contributing Acoustic Components Across Recordings.

File Name	Sibilant Frames Count	Sibilant Score	Centroid Component	Flatness Component	HF Energy Component	Sibilant Energy Component	ZCR Component	Skewness Component	Energy Component
1.wav	6811	0.6096	0.8241	0.6871	0.7594	0.6769	0.5199	0.0377	0.0458
2.wav	6020	0.6037	0.8148	0.7227	0.7726	0.6262	0.4803	0.0550	0.0308
3.wav	6509	0.5974	0.8128	0.7558	0.7776	0.5747	0.4498	0.0615	0.0190
4.wav	9191	0.6268	0.8120	0.7771	0.7786	0.6678	0.5286	0.0479	0.0185
5.wav	2566	0.5889	0.8270	0.7375	0.7640	0.5498	0.4431	0.0577	0.0268
6.wav	4025	0.6092	0.7780	0.7178	0.7420	0.7073	0.4763	0.0791	0.0355
7.wav	1793	0.6533	0.8230	0.7834	0.8095	0.7819	0.5041	0.0300	0.0162
8.wav	5056	0.5971	0.8140	0.7219	0.7338	0.5948	0.5142	0.0494	0.0912
9.wav	19,148	0.5892	0.8197	0.7098	0.7292	0.5628	0.5090	0.0687	0.0725
10.wav	2094	0.5912	0.8188	0.7170	0.7524	0.5902	0.4507	0.0680	0.0247
11.wav	4167	0.5713	0.8149	0.6869	0.7278	0.5225	0.4434	0.0938	0.0710
12.wav	5621	0.5800	0.8033	0.7546	0.7577	0.5079	0.4170	0.0941	0.0426
13.wav	3925	0.5922	0.8209	0.7360	0.7567	0.5322	0.5096	0.0733	0.0518
14.wav	3064	0.6100	0.7594	0.7871	0.7784	0.6209	0.4811	0.1044	0.0235
15.wav	6023	0.5714	0.7957	0.6623	0.7185	0.5588	0.4675	0.0978	0.0740
16.wav	4981	0.5793	0.8046	0.6538	0.7164	0.6010	0.4765	0.0830	0.0793
17.wav	4838	0.5597	0.7885	0.6170	0.6873	0.5572	0.4716	0.1177	0.1029
18.wav	3066	0.5552	0.7783	0.5552	0.6474	0.6009	0.5221	0.1038	0.1718
19.wav	5377	0.5568	0.8035	0.5858	0.6386	0.5941	0.4899	0.0935	0.1193
20.wav	3310	0.5741	0.8025	0.6299	0.6969	0.6146	0.4990	0.0723	0.0608

Table 3. Medical terms containing sibilant fricatives identified in representative consultation segments, with corresponding Bangla script and fricative classifications.

Filename & Time Stamp	Bangla	English Meaning	Fricative (s) Present
1.wav (2 min 15 s–2 min 18 s)	শুকনো কাশি	Dry Cough	শ (/ʃ/)
6.wav (52 s–56 s)	ওষুধ	Medicine	ষ (/ʂ/)
14.wav (15 s–18 s)	শারীরিক অসুস্থতা	Physical Discomfort	শ (/ʃ/), স (/s/)
17.wav (24 s–29 s)	প্রেসার বেশি	High Pressure	স (/s/), শ (/ʃ/)

Table 4. Sampling rate–dependent ASR performance and resource metrics, reporting global WER, median execution latency, latency variability, and estimated payload bandwidth across selected sampling rate configurations.

Sampling Rate (Hz)	Global WER	Latency Median (s)	Latency IQR (s)	Payload Median (kbps)
8000	0.3383	2.0558	0.8758	132.0235
8500	0.3036	3.4907	1.4915	136.0235
15,250	0.2570	3.5705	1.5301	244.0235
15,500	0.2463	3.5877	1.5819	248.0235
15,750	0.2492	3.6042	1.5241	252.0235
16,000	0.2505	3.6562	1.5315	256.0235
16,250	0.2478	3.7584	1.5453	260.0235
16,500	0.2433	3.5727	1.4933	264.0235
16,750	0.2437	3.8609	1.4934	268.0235
17,000	0.2420	3.6006	1.5008	272.0235
17,250	0.2341	3.7912	1.5355	276.0235
17,500	0.2346	3.8089	1.5675	280.0235
17,750	0.2411	3.7927	1.4872	284.0235
18,000	0.2444	3.9073	1.5756	288.0235
18,250	0.2357	3.7327	1.4722	292.0235
18,500	0.2389	4.0662	1.6441	296.0235
18,750	0.2320	3.5826	1.4947	297.0235
19,000	0.2420	3.8670	1.5573	304.0235
19,250	0.2400	3.9875	1.5398	308.0235
19,500	0.2378	4.1998	1.4575	312.0235
19,750	0.2333	4.1171	1.4909	318.0235
20,000	0.2342	3.9819	1.5807	324.0235
21,000	0.2379	4.0645	1.4763	336.0235
25,000	0.2317	4.1946	1.4781	400.0235
26,000	0.2394	4.0909	1.5471	416.0235
27,000	0.2400	4.0993	1.5345	432.0235
31,000	0.2352	4.0316	1.5869	496.0235
32,000	0.2312	4.1119	1.5223	512.0235

Table 5. Low-pass–filtered control condition using 32 kHz audio with spectral content limited to ≤8 kHz, reporting word error rate, execution latency, and bandwidth to isolate the impact of extended high-frequency information on ASR performance.

Sampling Rate (Hz)	Effective Bandwidth	Global WER	Latency Median (s)	Latency IQR (s)	Payload Median kbps
32,000	<=8 kHz	0.2516	3.4115	1.6257	512.0235

Table 6. Normalized performance metrics and composite weighted scores for near-optimal sampling rates, showing the relative trade-off between recognition accuracy (global WER), processing latency, and payload bandwidth under an accuracy-prioritized weighting scheme.

Sampling Rate (Hz)	Global WER	Latency Median (s)	Payload Median (kbps)	Weighted Score
18,750	0.2320	3.5826	297.0235	0.2266
17,250	0.2341	3.7912	276.0235	0.2458
17,500	0.2346	3.8089	280.0235	0.2526
18,250	0.2357	3.7327	292.0235	0.2583
17,000	0.2420	3.6006	272.0235	0.2712

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Paul, P.; Bouh, M.M.; Shah, M.V.; Hossain, F.; Ahmed, A. Investigating Sibilant Fricative Representation in Bangla Telemedicine Speech: A Cost-Aware Sampling Rate Optimization Study. Signals 2026, 7, 44. https://doi.org/10.3390/signals7030044

AMA Style

Paul P, Bouh MM, Shah MV, Hossain F, Ahmed A. Investigating Sibilant Fricative Representation in Bangla Telemedicine Speech: A Cost-Aware Sampling Rate Optimization Study. Signals. 2026; 7(3):44. https://doi.org/10.3390/signals7030044

Chicago/Turabian Style

Paul, Prajat, Mohamed Mehfoud Bouh, Manan Vinod Shah, Forhad Hossain, and Ashir Ahmed. 2026. "Investigating Sibilant Fricative Representation in Bangla Telemedicine Speech: A Cost-Aware Sampling Rate Optimization Study" Signals 7, no. 3: 44. https://doi.org/10.3390/signals7030044

APA Style

Paul, P., Bouh, M. M., Shah, M. V., Hossain, F., & Ahmed, A. (2026). Investigating Sibilant Fricative Representation in Bangla Telemedicine Speech: A Cost-Aware Sampling Rate Optimization Study. Signals, 7(3), 44. https://doi.org/10.3390/signals7030044

Article Menu

Investigating Sibilant Fricative Representation in Bangla Telemedicine Speech: A Cost-Aware Sampling Rate Optimization Study

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Design

2.2. Speech Data Collection and Recording Protocol

2.3. Audio Preprocessing and ASR Evaluation Pipeline

2.4. Computational Latency and Bandwidth Cost Measurement

2.5. Acoustic Analysis for Sibilant Characterization

2.6. Frame-Level Sibilant Likelihood Estimation

2.7. Sampling Rate-Dependent ASR Performance Analysis

2.8. Low-Pass Filtering Control Analysis

2.9. Elbow-Point Detection: Identifying Diminishing Returns

2.10. Pareto Frontier Analysis: Balancing Accuracy and Bandwidth

2.11. Composite Scoring and Minimum Acceptable Trade-Off Selection

2.12. Ethical Considerations

3. Results

3.1. Sibilant-Related Acoustic Measures

3.2. Sampling Rate Optimization

3.3. Low-Pass Filtered High-Rate Control Condition

3.4. Elbow-Point Detection: Identifying Diminishing Returns

3.5. Pareto Frontier Analysis: Balancing Accuracy and Bandwidth

3.6. Weighted Scoring Model: Composite Ranking of Configurations

3.7. Minimum Acceptable Trade-Off (MAT): Cost-Efficient Near-Optiomal Accuracy

4. Discussion

4.1. Interpretation of Sibilant Acoustic Cues

4.2. Sampling Rate-Dependent Accuracy Gains and Diminishing Returns

4.3. Isolating the Contribution of Extended High-Frequency Information

4.4. Elbow Point Identification of the Accuracy-Efficiency Trade-Off

4.5. Pareto-Optimal Balance Between Accuracy and Bandwidth

4.6. Composite Ranking and Minimum Acceptable Trade-Off Selection

4.7. Recommended Sampling Rate for Bangla Medical Telehealth ASR

4.8. Limitations

4.9. Comparison with Prior Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI