Development and Comparison of Machine Learning and Deep Learning Models for Speech Audiometry Prediction

Shin, Jae sung; Ma, Jun; Makara, Mao; Sung, Nak-Jun; Choi, Seong Jun; Kim, Sung yeup; Hong, Min

doi:10.3390/app15063071

Open AccessArticle

Development and Comparison of Machine Learning and Deep Learning Models for Speech Audiometry Prediction

by

Jae sung Shin

¹,

Jun Ma

¹

,

Mao Makara

¹

,

Nak-Jun Sung

²

,

Seong Jun Choi

³

,

Sung yeup Kim

⁴ and

Min Hong

^5,*

¹

Department of Software Convergence, Soonchunhyang University, Asan 31538, Republic of Korea

²

Research Institute, National Cancer Center, Goyang 10245, Republic of Korea

³

Department of Otorhinolaryngology—Head and Neck Surgery, College of Medicine, Soonchunhyang University Cheonan Hospital, Cheonan 31151, Republic of Korea

⁴

Insitute for Artificial Intelligence and Software, Soonchunhyang University, Asan 31538, Republic of Korea

⁵

Department of Computer Software Engineering, Soonchunhyang University, Asan 31538, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 3071; https://doi.org/10.3390/app15063071

Submission received: 12 February 2025 / Revised: 6 March 2025 / Accepted: 10 March 2025 / Published: 12 March 2025

(This article belongs to the Special Issue Advances in Machine Learning for Healthcare Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Hearing loss significantly impacts daily communication, making accurate speech audiometry (SA) assessment essential for diagnosis and treatment. However, SA testing is time-consuming and resource-intensive, limiting its accessibility in clinical practice. This study aimed to develop a multi-class classification model that predicts SA results using pure-tone audiometry (PTA) data, enabling a more efficient and automated assessment. To achieve this, we implemented and compared MLP, RNN, gradient boosting, and XGBoost models, evaluating their performance using accuracy, F1 score, log loss, and confusion matrix analysis. Experimental results showed that gradient boosting achieved the highest accuracy, 86.22%, while XGBoost demonstrated a more balanced classification performance. The MLP achieved 85.77% and the RNN achieved 85.41%, exhibiting relatively low accuracy, with the RNN showing limitations due to the low temporal dependency of PTA data. Additionally, all models faced challenges predicting class 2 (borderline hearing levels) due to overlapping data distributions. These findings suggest that machine learning models, particularly gradient boosting and XGBoost, outperform deep learning models in SA prediction. Future research should focus on feature engineering, hyperparameter optimization, and ensemble approaches to enhance performance and validate real-world applicability. The proposed model could contribute to automating SA prediction and improving hearing assessment efficiency and patient care.

Keywords:

speech audiometry; pure-tone audiometry; multilayer perceptron; recurrent neural network; gradient boosting; XGBoost; regression analysis in audiology; deep learning; machine learning; hearing assessment

1. Introduction

In recent years, deep learning and machine learning have driven groundbreaking advancements across various fields. In particular, these technologies have demonstrated exceptional performance in learning and predicting complex data patterns in the medical domain, attracting significant attention from researchers and healthcare professionals. Medical data inherently contain vast amounts of intricate information, with various factors related to a patient’s health status often exhibiting interdependencies [1]. Deep learning and machine learning are powerful tools for efficiently processing such data and extracting meaningful insights, especially when modeling nonlinear and complex relationships [2].

Utilizing these technological advantages, deep learning has been actively applied in various medical applications, including medical image analysis, pathological data classification, and biometric signal processing [3]. Machine learning excels in analyzing structured data and making predictions, playing an equally vital role in medical data analysis [4]. Against this backdrop, the present study aimed to develop a deep learning- and machine learning-based multi-class classification model to predict speech audiometry results using pure-tone audiometry (PTA) data, a crucial diagnostic tool for evaluating auditory health. PTA assesses hearing thresholds by measuring a patient’s minimum sound intensity at specific frequencies, thereby evaluating hearing across different frequency ranges [5]. In contrast, speech audiometry evaluates a patient’s ability to recognize and comprehend speech in everyday conversations, which is closely linked to real-world language perception abilities [6]. Although both tests serve complementary roles in diagnosing and managing hearing impairments, speech audiometry is often time-consuming and costly, making it impractical to administer to all patients uniformly [7].

Speech audiometry relies on headphones and other equipment for measurement, making it inherently dependent on the subjective assessment of both the patient and the audiologist. This dependency can lead to reduced accuracy in speech audiometry results, increasing the risk of diagnostic and measurement errors by clinicians or patients. Furthermore, various external factors, such as environmental noise, individual differences, dialects, and other human-related factors, may contribute to lower reliability in speech audiometry assessments.

The primary objective of this study was to develop and compare models that predict speech audiometry outcomes solely from PTA data, identifying the most effective model for accurate predictions. This approach could potentially reduce the need for additional testing, thereby enhancing diagnostic efficiency and the effectiveness of treatment. If speech audiometry results can be accurately predicted from pure-tone audiometry data, it would save medical resources and patient time and enable a more rapid and precise approach to diagnosing and managing hearing impairments [8]. Machine learning and deep learning algorithms are pivotal in developing such predictive models. While machine learning can classify or predict speech recognition ability based on learned relationships within auditory data, deep learning excels at handling large-scale data and capturing complex patterns to model nonlinear relationships. Specifically, as a fundamental hearing assessment, PTA consists of air conduction (AC) and bone conduction (BC) tests. Air conduction audiometry evaluates sound transmission through the external ear and middle ear, whereas bone conduction audiometry measures sound transmission through the skull [9]. Deep learning models can use these data to learn intricate patterns, improving predictive accuracy.

This study aimed to explore new possibilities in medical data analysis by integrating deep learning and machine learning, analyzing the effectiveness of different models in predicting speech audiometry outcomes from pure-tone audiometry data. By doing so, we sought to enhance the convenience and accuracy of the diagnostic process, ultimately maximizing the effectiveness of hearing impairment treatment and rehabilitation. Furthermore, this research is expected to contribute significantly to expanding the applicability of deep learning and machine learning in medical data analysis. The study’s key contributions are as follows:

Proposes a deep learning- and machine learning-based multi-class classification model for predicting speech audiometry results using pure-tone audiometry data and compares the performance of the multilayer perceptron (MLP), recurrent neural network (RNN), XGBoost, and gradient boosting models.
Deep learning and machine learning models effectively learn the nonlinear patterns of pure-tone audiometry data.
The performance of each deep learning and machine learning model was evaluated using accuracy, loss, and a confusion matrix. The results indicate that the recurrent neural network model slightly outperformed the multilayer perceptron model, and the gradient boosting model demonstrated superior performance to XGBoost.

This paper consists of the following sections. Section 2 discusses the relevant background research on PTA data, speech audiometry data, MLP, RNN, gradient boosting, XGBoost, and performance evaluation metrics. Section 3 reviews deep learning and machine learning models for PTA data classification and speech audiometry prediction. Section 4 presents an overview of the benchmark dataset, evaluation metrics, and comparative analysis. Finally, Section 5 concludes the study and suggests directions for future research.

2. Related Work

2.1. Speech Audiometry

Speech audiometry is a critical hearing assessment used to evaluate an individual’s speech recognition ability, directly measuring their capacity to comprehend spoken language in real-life situations [10]. Unlike PTA, this test involves presenting speech sounds and assessing how accurately they can be understood. Speech audiometry evaluates a patient’s ability to perceive and distinguish speech sounds at various intensity levels. It plays a crucial role in determining their auditory condition and the necessity for hearing assistive devices such as hearing aids or cochlear implants [11].

The threshold values used in speech audiometry, namely the speech recognition threshold (SRT) and the speech detection threshold (SDT), are crucial indicators for auditory assessment, enabling the quantitative measurement of a patient’s speech recognition and detection abilities. SRT refers to the lowest sound intensity (dB HL) at which a patient can minimally recognize speech, and it is typically measured as the minimum intensity at which speech is correctly recognized 50% of the time (SRT 50) [12]. During the test, monosyllabic or disyllabic words are presented at varying intensity levels, and the lowest level at which the patient correctly recognizes a specified percentage of speech is recorded as the threshold [13]. SRT serves as an essential benchmark for evaluating a patient’s overall hearing status in comparison with PTA results.

Meanwhile, SDT measures the lowest intensity at which a patient can detect the presence of speech sounds without necessarily understanding their meaning. This threshold is determined at the point where the patient simply perceives the existence of an auditory stimulus. SDT values generally exhibit a strong correlation with the lowest threshold obtained in pure-tone audiometry and tend to be lower than SRT values [14].

Ristovska L. et al. [15] conducted a study analyzing the correlation between pure-tone audiometry (PTA) and speech audiometry as a cross-validation method to enhance the accuracy of hearing assessments. The study included 52 children aged 5 to 14 years (30 males, 22 females) with hearing loss, who underwent pure-tone audiometry and speech audiometry. Statistical analyses included the Pearson correlation coefficient and the chi-squared test, with a significance level of p < 0.05. The results showed that the speech detection threshold (SDT) strongly correlated with the best pure-tone threshold, with the highest correlation observed in the sloping audiometric configuration (r = 0.993). Additionally, the speech recognition threshold (SRT) demonstrated a strong correlation with PTA (500–2000 Hz), PTA (500–4000 Hz), and PTA (500–1000 Hz) (r = 0.978, r = 0.91, r = 0.909, respectively), with the highest correlation at 1000 Hz (r = 0.986). The difference between SDT and SRT was ≤12 dB in most cases (p = 0.033), supporting the reliability of these measures. This study confirmed the strong correlation between pure-tone audiometry and speech audiometry, highlighting the value of speech thresholds as an effective cross-check for pure-tone threshold measurements. These findings contribute to more reliable clinical evaluations of hearing loss and provide a solid foundation for improving diagnostic accuracy in audiology.

The speech discrimination score (SDS) is also used to assess a patient’s language comprehension ability at the optimal sound intensity level, where they can hear most effectively. This is a crucial reference for customizing hearing assistive device settings [16]. It plays a key role in precisely evaluating the degree and type of hearing impairment, thereby contributing to the formulation of treatment and rehabilitation plans. In particular, the speech audiometry test helps identify the listening difficulties patients experience in everyday conversations, allowing for the optimization of auditory treatment effectiveness [17].

2.2. Pure-Tone Audiometry

PTA is the most fundamental and widely used hearing assessment for evaluating the degree and type of hearing loss. This test generates pure tones at various frequency ranges and measures the hearing threshold—the minimum sound intensity a patient can perceive [18]. Commonly tested frequencies range from 125 Hz to 8000 Hz, including 250 Hz, 500 Hz, 1000 Hz, 2000 Hz, and 4000 Hz, allowing for a detailed assessment of hearing loss across different frequencies [19]. Using an audiometer, the sound intensity is gradually adjusted and the patient responds when they perceive the sound, enabling the determination of their hearing threshold [20]. The test is conducted independently for each ear using either headphones or a BC vibrator, and the results are typically visualized in an audiogram.

PTA is categorized into AC testing and BC testing based on the sound transmission pathway. AC testing evaluates sound transmission through the external and middle ear, making it helpful in identifying conductive hearing loss caused by outer or middle ear issues [21]. In contrast, BC testing assesses sound transmission through the skull, allowing for the diagnosis of sensorineural hearing loss resulting from inner ear or auditory nerve damage. By comparing the results of these two tests, clinicians can determine whether hearing loss originates from the outer/middle ear or the inner ear/auditory nerve [22]. During the test, patients respond to sounds at different frequencies, enabling precise identification of hearing loss severity. If hearing loss occurs only in specific frequency ranges, it may indicate damage to a particular auditory pathway or an underlying disorder [23]. The results, recorded as an audiogram, comprehensively analyze hearing loss severity and type. This information is crucial for assessing the suitability of hearing assistive devices such as hearing aids or cochlear implants and for designing auditory rehabilitation programs. PTA is a relatively simple, yet essential method for the early detection of hearing loss and plays a critical role in evaluating auditory health and formulating treatment plans [24].

Kim, H. et al. [25] evaluated a deep learning-based approach for classifying hearing loss types using PTA data. The study utilized 4007 PTA records analyzed and labeled by medical experts between 2017 and 2021, comparing various deep learning models. This study employed RNN, LSTM, and GRU, which are well suited for learning sequential data patterns, to analyze PTA data. Additionally, the impact of different normalization techniques, including Z score, MinMaxScaler, and MaxAbsScaler, on model performance was examined. The experimental results demonstrated that the LSTM model achieved the highest classification accuracy of 97.56%, indicating that deep learning models capable of capturing sequential patterns in PTA data are effective for hearing loss classification. Furthermore, this study explored the potential of CGAN-based data augmentation to enhance model performance. By incorporating generated data into the training process, the generalization performance of the models improved, with LSTM and GRU models exhibiting the most significant enhancements. These findings highlight that CGAN-based data augmentation can serve as an effective method to enhance deep learning model training, particularly in medical domains with limited data availability

Liu, X. et al. [26] developed a machine learning model to automatically diagnose Ménière’s disease and predict endolymphatic hydrops based on PTA data. Traditionally, a diagnosis of Ménière’s disease has relied on gadolinium-enhanced magnetic resonance imaging and hearing tests; however, magnetic resonance imaging is costly and has limited accessibility. To address this issue, this study proposed a more efficient diagnostic approach by training a machine learning model using pure-tone audiometry data. The study collected data from 262 patients with Ménière’s disease and 74 patients with vestibular migraine, using air conduction thresholds in the frequency range of 125 Hz to 8 kHz as key features. To train the model, five machine learning algorithms—logistic regression, support vector machine, decision tree, random forest, and light gradient-boosting machine—were employed and compared. Cross-validation was conducted to evaluate model performance, and the light gradient-boosting machine achieved the highest accuracy (87%), sensitivity (83%), specificity (90%), and receiver operating characteristic area under the curve (0.95), demonstrating its effectiveness in diagnosing Ménière’s disease. Additionally, the performance of the machine learning model was comparable to or even superior to that of experienced otolaryngologists in certain aspects.

Compared to speech audiometry, PTA provides a more objective evaluation by precisely measuring hearing thresholds at specific frequencies. This allows for a detailed assessment of auditory function across different frequency bands [27]. While speech audiometry evaluates language comprehension in daily communication, PTA objectively quantifies frequency-specific hearing loss, providing clinicians with a systematic approach to diagnosing and analyzing auditory impairments [28]. During the test, sound intensity is gradually adjusted to determine hearing thresholds, helping to identify whether hearing loss is concentrated in specific frequencies or affects a broader range. Furthermore, PTA is conducted in a soundproof booth, eliminating external noise interference and enhancing test reliability. This controlled environment allows for consistent retesting under identical conditions, making it well suited for monitoring changes over time [29]. The test is relatively quick and can be applied to individuals of all ages, from infants to older adults. While the patient responds to auditory stimuli at various frequencies, the examiner records their responses in real time, visualizing the degree and pattern of hearing loss [30,31].

PTA is particularly effective in detecting early-stage hearing loss and identifying mild hearing impairments. Even when hearing loss is limited to specific frequencies, the test can differentiate these cases, enabling a personalized approach to treatment and rehabilitation [32]. Additionally, PTA results play a crucial role in designing auditory rehabilitation programs and predicting the impact of hearing loss on daily life. By providing objective data, PTA is a reliable decision-making tool for healthcare professionals and remains an essential diagnostic method for managing hearing impairments [33].

3. Materials and Methods

3.1. MLP

The MLP model learns nonlinear relationships between complex inputs and outputs. It consists of an input layer, multiple hidden layers, and an output layer, where neurons in each layer process data through activation functions, as shown in Figure 1. The MLP implemented in the code addresses a multi-class classification problem using the ReLU activation function in hidden layers and a softmax output layer. In the input layer, the model receives the features of the training data, while the hidden layers transform the input data using weights and biases.

Batch normalization is applied to normalize the inputs at each layer, improving training speed, while dropout is used to deactivate certain neurons, preventing overfitting. The ReLU activation function is used mainly in hidden layers to capture nonlinear relationships effectively. In the final output layer, the softmax activation function is applied to return the predicted probabilities for each class. During training, MLP utilizes the backpropagation algorithm to measure and minimize the difference between expected and actual values by adjusting weights and biases. The code uses sparse categorical cross-entropy as the loss function to optimize predictions for integer-encoded class labels and employs the Adam optimizer to change the learning rate efficiently.

The reason for using the Adam optimizer in the MLP model is to ensure stable training and fast convergence. As a multilayer neural network, MLP experiences increased gradient fluctuations as the depth increases, making it difficult to adjust the learning rate when using stochastic gradient descent (SGD). If a fixed learning rate is applied, the optimization process may become excessively slow, or conversely, excessive updates may prevent the model from finding optimal weights, leading to divergence. To address this issue, the Adam optimizer dynamically adjusts the learning rate using first-order momentum (momentum) and second-order momentum (RMSProp), allowing it to mitigate sudden gradient variations and perform more effective optimization. Additionally, since MLP has a large number of parameters, preventing overfitting is crucial. Adam regulates individual weight updates, thereby improving generalization performance. In mini-batch training, Adam independently adjusts the learning rate for each weight, ensuring stable training even with high-variance data. Experimental results also confirmed that the MLP model using Adam exhibited a stable learning curve and maintained high accuracy throughout the training process.

The model undergoes iterative training for a user-specified number of epochs, where it is evaluated on both training and validation datasets. The ModelCheckpoint callback is utilized to save the model when the validation loss reaches its lowest point. After completing the training process, the final trained model is stored as a file. This structure enables MLP to learn complex patterns effectively, and performance can be optimized by tuning hyperparameters such as the number of neurons, the number of hidden layers, and the learning rate. Mathematically, in each hidden layer, an input vector

x

undergoes transformation via a weight matrix W and a bias vector b, followed by applying an activation function f(⋅). For example, the output of the first hidden layer is computed as follows:

h_{1} = f (W_{1} x + b_{1})

(1)

where

h_{1}

represents the output of the first hidden layer, which is then used as the input for the subsequent layers. In the final output layer, the predicted value

y

is computed based on the weighted sum and bias without an activation function:

y = W_{n} h_{n - 1} + b_{n}

(2)

In this process, backpropagation is used to adjust weights and biases to minimize the loss iteratively. MLP effectively learns high-dimensional data and nonlinear relationships through this mechanism while utilizing dropout and normalization to prevent overfitting, ensuring stable predictive performance.

MLP Model and Methodology in This Paper

In this study, an MLP model was designed, as shown in Table 1, to address the multi-class classification problem. The MLP model consists of multiple fully connected layers, incorporating batch normalization and dropout techniques to enhance generalization performance and prevent overfitting. The input data are represented as a feature vector, with the first hidden layer consisting of 256 neurons and utilizing the ReLU activation function. Batch normalization is then applied to normalize the input, followed by a 40% dropout to prevent overfitting.

Subsequent hidden layers consist of 128, 64, 32, and 16 neurons, each applying the ReLU activation function, batch normalization, and dropout, ensuring stable learning. The output layer consists of 3 neurons and applies the softmax activation function to return class probabilities for multi-class classification. The model is trained using the Adam optimization algorithm, with sparse categorical cross-entropy as the loss function, allowing accurate classification of integer-labeled data. Additionally, accuracy is used as the performance evaluation metric to monitor predictive performance. The model is trained for 300 epochs, with a batch size of 32 to ensure efficient learning. During training, a validation set evaluates performance at each epoch, and the ModelCheckpoint callback saves the model weights when the validation loss is minimized. This ensures that the optimal model weights are retained. After training, the learned model is applied to test data for final accuracy evaluation, and the classification results are visualized using a confusion matrix. Additionally, various hyperparameters, such as hidden layers, number of neurons, and learning rate, can be adjusted to optimize model performance further.

3.2. RNN

A recurrent neural network (RNN) is a neural network model designed to learn the temporal dependencies and correlations in sequential data. It is widely used in time-series analysis and natural language processing applications. Unlike MLP, which processes inputs independently, RNN consists of an input layer, hidden layers, and an output layer, as shown in Figure 2. Each time step combines the previous hidden state with the current input to process and propagate information. A key feature of RNN is that it shares the same weights and biases across time steps. This allows it to maintain information flow while reducing the number of parameters, making it highly effective for learning from sequential data. Mathematically, at each time step t, the hidden state

h_{t}

is computed using the current input

x_{t}

, the previous hidden state

h_{t - 1}

, and the bias term

b_{h}

, as follows:

h_{t} = f (W_{x h} x_{t} + W_{h h} h_{t - 1} + b_{n})

(3)

where

W_{x h}

represents the weight matrix from the input data to the hidden state,

W_{h h}

is the weight matrix from the previous hidden state to the current hidden state, and f(⋅) denotes the nonlinear activation function. The hidden state is the model’s internal memory, enabling it to learn patterns in time-series or numerical data. In the output layer, the final output

y_{t}

is computed based on the hidden state as follows:

y_{t} = W_{h y} h_{t} + b_{y}

(4)

where

W_{h y}

represents the weight matrix from the hidden state to the output layer and

b_{y}

denotes the bias vector in the output layer. While RNN provides a structural advantage in learning temporal information, it faces the vanishing gradient problem during training. This issue arises when long-term dependencies become difficult to capture as past information gradually fades over time. Long short-term memory (LSTM) and gated recurrent unit (GRU) architectures have been developed to address this. LSTM retains important information over extended sequences by utilizing input and output and forgets gates to store and discard information selectively. GRU achieves similar performance with a simplified structure, improving computational efficiency.

During training, backpropagation through time (BPTT), a variation of the backpropagation algorithm, is used to update weights. The sparse categorical cross-entropy loss function is applied for classification tasks, while mean squared error (MSE) is used for regression problems. The Adam optimizer is employed to adjust the learning rate efficiently. The reason for using the Adam optimizer in the RNN model is to mitigate the vanishing gradient problem and enable effective learning of sequential data. RNNs have a recursive structure, where past information is utilized to generate current outputs. However, during backpropagation, the vanishing gradient problem is likely to occur, making learning difficult. If SGD is used, gradients may diminish over time when training on long sequences, leading to a situation where information is no longer transmitted effectively. To prevent this, applying the Adam optimizer helps correct gradient variations by leveraging momentum and adaptive learning rate adjustments, thereby mitigating the long-term dependency issue. Additionally, RNNs capture temporal variations in sequential data and Adam optimizes weight updates at each time step, allowing for more effective learning of dynamic patterns. In mini-batch training, Adam adjusts the learning rate for each weight separately, ensuring stable training even when dealing with significant gradient fluctuations. The experimental results confirmed that the RNN model using Adam converged faster than conventional optimization techniques and demonstrated enhanced predictive performance.

Various regularization techniques are applied to prevent overfitting and improve training stability. Dropout randomly deactivates specific neurons in the hidden layers, enhancing generalization performance, while batch normalization normalizes the distribution of hidden states, accelerating training. RNN is trained over a predefined number of epochs, iterating through the dataset multiple times. Validation data is used to monitor performance and detect potential overfitting.

Once training is complete, the final model is saved as a file for predictions or real-time applications. The performance of RNN can be optimized by tuning hyperparameters such as hidden state size, number of hidden layers, and learning rate. With these structural improvements, RNN effectively learns complex patterns in sequential data, providing strong predictive performance.

RNN Model and Methodology in This Paper

The second RNN layer consists of 64 neurons and applies the same ReLU activation function, batch normalization, and dropout as the first layer. It also maintains “return sequences = True” to preserve sequential information. The third RNN layer comprises 32 neurons and applies the same activation and regularization techniques to extract high-dimensional features from the input data, where “return sequences = False” is set to output only the final hidden state.

Following the RNN layers, a dense layer with 16 neurons and the ReLU activation function is added. This layer incorporates L2 regularization, batch normalization, and 20% dropout to enhance model generalization. The final output layer consists of 3 neurons with the softmax activation function, returning class probabilities for multi-class classification. The model is trained using the Adam optimization algorithm, with the learning rate set to 0.0005. The sparse categorical cross-entropy loss function optimizes predictions for integer-encoded class labels. Accuracy is the performance evaluation metric used to monitor model performance during training. The model is trained for 300 epochs with a batch size of 128, ensuring efficient learning. A validation set evaluates performance at each epoch, allowing continuous monitoring of the gap between training and validation performance. The model’s training configuration is summarized in Table 2. Ultimately, the trained RNN model effectively learns the temporal patterns of sequential data, making it well suited for multi-class classification tasks. The model’s performance can be further optimized by adjusting hyperparameters such as the number of neurons and learning rate.

3.3. Gradient Boosting

As illustrated in Figure 3, the gradient boosting algorithm is an ensemble learning technique that incrementally improves performance by combining multiple weak learners, typically shallow decision trees. This algorithm performs excellently in both regression and classification tasks and is particularly effective for handling complex nonlinear data. The fundamental idea of gradient boosting is to iteratively improve the model by adding new trees that minimize the residual errors at each stage. Initially, a base model

F_{0} (x)

is established, and optimized using the value

γ

that minimizes the loss function

L (y_{i}, γ)

for the given dataset:

F_{0} (x) = a r g \sum_{i = 1} L (y_{i}, γ)

(5)

where

F_{0} (x)

generates the initial predictions, while the loss function

L

measures the error between the predicted and actual values. In the first stage, the model learns from the data based on this initial prediction. In each subsequent stage, a new tree is trained using the residuals, which represent the difference between the previous model’s predictions and the actual values. The residuals correct the existing model’s prediction errors, allowing the newly added tree to refine the overall model performance. These residuals are computed as the gradient of the loss function, mathematically expressed as follows:

r_{i m} = - \frac{\partial L (y_{i}, F (x_{i}))}{\partial F (x_{i})}

(6)

where

r_{i m}

represents the residual for the

i

data point and corresponds to the gradient of the loss function at the m iteration. Using these residuals, a new tree

T_{m} (x)

is trained, which corrects the model in the direction that minimizes the loss. The model updates its predictions by combining the previous prediction with the newly learned tree, scaled by the learning rate η. This update process is mathematically expressed as follows:

F_{m} (x) = F_{m - 1} (x) + η T_{m} (x)

(7)

This process progressively reduces residuals as more trees are added, enhancing the model’s predictive performance. The learning rate η serves as a hyperparameter that controls the contribution of each newly added tree, playing a crucial role in regulating model complexity and preventing overfitting. In regression tasks, the final predicted value

\hat{y}

is obtained by sequentially summing the predictions from all trees, which can be expressed as follows:

\hat{y} = \sum_{i = 1}^{N} a_{i} T_{i} (x)

(8)

where N represents the total number of trees and

a_{i}

denotes the weight of each tree. Each tree is designed to correct the learned residuals, improving the model’s accuracy. For classification tasks, the model computes the prediction probabilities for each class and selects the class with the highest probability as the final predicted value.

Gradient boosting provides flexibility by allowing various hyperparameters to be tuned for optimal performance based on the dataset’s characteristics. Key hyperparameters include the number of trees, learning rate, maximum tree depth, and minimum number of samples per node. Adjusting these parameters helps prevent overfitting and improves generalization performance. In particular, a lower learning rate allows the model to train more trees, reducing the contribution of each individual tree while preventing excessive model complexity. Thanks to these characteristics, gradient boosting is recognized as a powerful model capable of effectively capturing important patterns and interactions between variables in high-dimensional datasets with numerous features.

Gradient Boosting Model and Methodology in This Paper

Gradient Boosting is an ensemble learning technique that sequentially trains weak learners, correcting the errors of previous models to build a strong learner. In this study, Scikit-learn’s gradient boosting classifier was used for implementation. The model was configured with estimators = 100, learning rate = 0.1, and random state = 42, optimized for solving the multi-class classification problem. During training, the dataset was randomly resampled in each epoch to evaluate the model’s performance on both training and validation data, enhancing generalization. This random resampling approach helps prevent overfitting and ensures performance assessment across diverse data distributions.

At each epoch, the model undergoes a fitting process using training data and generates both predictions and class probability values for the validation set. The model’s performance is monitored using multiple evaluation metrics, including F1 score, accuracy, log loss, and AUC (area under the curve). The F1 score is computed using a macro-averaging approach to ensure balanced performance across classes, while accuracy quantifies overall predictive performance. Log loss measures the difference between predicted probability distributions and actual labels, and AUC is calculated using the one vs. rest (OvR) method to assess the model’s discrimination ability. The model’s weights were saved at each epoch based on log loss, preserving the best-performing version. The final gradient boosting model was selected based on the lowest validation loss. After training, the model was fine-tuned using the entire training dataset, ensuring reproducibility and real-world applicability, and the final model was stored as a file. Gradient boosting allows performance optimization by tuning key hyperparameters, such as the learning rate, number of trees, and maximum depth. The configurations used in this study were well suited for the multi-class classification task. Performance evaluations confirmed that gradient boosting effectively learns complex data patterns and delivers reliable results across different data distributions and problem types.

3.4. XGBoost

XGBoost (extreme gradient boosting) is a high-performance extension of the gradient boosting algorithm. It is designed for fast and efficient learning on large-scale datasets while delivering high predictive accuracy. This algorithm sequentially trains weak learners, primarily decision trees, to form a powerful ensemble model. XGBoost improves model performance at each stage using gradient-based optimization and incorporates enhancements that outperform traditional gradient boosting, as shown in Figure 4. In XGBoost, the predicted value

{\hat{y}}_{i}

is computed by summing the outputs of multiple weak learners. This process can be mathematically expressed as follows:

{\hat{y}}_{i} = \sum_{t = 1}^{T} f_{t} (x_{i})

(9)

where

T

represents the total number of generated trees and

f_{t} (x_{i})

denotes the output of the

t

—tree for the

i

—data point xix_ixi. Each tree

f_{t} (x)

is sequentially added during training based on the gradient of the loss function. During the training process, XGBoost optimizes the model by minimizing the loss function

L^{(t)}

at each iteration. The loss function is defined as follows:

L^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t})

(10)

where

l (y_{i}, {\hat{y}}_{i})

represents the loss between the predicted value

{\hat{y}}_{i}

and the actual value

y_{i}

, while

Ω (f_{t})

is a regularization term that controls model complexity. The regularization term

Ω (f_{t})

helps reduce model complexity and prevents overfitting, and it is expressed as follows:

Ω (f_{t}) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2}

(11)

Here,

T

represents the number of leaf nodes in the tree,

w_{j}

denotes the weight of each leaf node, and

γ

and

λ

are hyperparameters that control the regularization strength.

XGBoost supports efficient distributed and parallel processing during training, enabling rapid learning even with large-scale datasets. It demonstrates strong performance in both classification and regression tasks, with key hyperparameters (e.g., learning rate, maximum tree depth, and number of trees) that can be fine-tuned for optimal results. Through this structural design, XGBoost effectively learns complex data patterns, delivers high predictive performance, and is widely utilized across various data analysis and machine learning applications.

XGBoost Model and Methodology in This Paper

This study used the XGBoost algorithm to design a model to solve the multi-class classification problem. As an extension of the gradient boosting algorithm, XGBoost sequentially trains weak learners (decision trees), correcting the errors of previous models to form a strong learner. XGBoost supports optimized parallel processing and regularization techniques, enabling faster and more stable training compared to traditional gradient boosting models. The model was configured with estimators = 100, learning rate = 0.1, and random state = 42, making it well suited for the multi-class classification task. Training and validation datasets were randomly resampled in each epoch to evaluate the model’s generalization performance. This random resampling method assessed model performance across various data distributions. The model undergoes a fitting process at each epoch using training data and generates predictions and class probability values for the validation set. The model’s performance was monitored using multiple evaluation metrics, including F1 score, accuracy, log loss, and area under the curve (AUC).

The F1 score was computed using a macro-averaging approach to ensure balanced performance across multiple classes while accurately quantifying the model’s overall predictive performance. Log loss measured the difference between predicted probability distributions and actual labels, and AUC was computed using the one vs. rest (OvR) method to assess the model’s classification capability. The model’s weights were preserved based on log loss, storing the best-performing version. The final XGBoost model was selected based on the lowest validation loss. After training, the model was fine-tuned using the entire training dataset, ensuring the reproducibility of experimental results. The final model was stored as a file for further use. The XGBoost model allows for performance optimization by adjusting key hyperparameters such as the number of estimators, learning rate, and regularization coefficients. The settings used in this study were well suited for the multi-class classification task. Experimental results confirmed that XGBoost effectively learns complex data patterns and provides reliable performance across various data distributions and problem types.

4. Results

4.1. Data Collection and Segmentation

In this study, training was conducted using PTA and speech audiometry data from 12,972 patients, as shown in Table 3. As shown in Table 3, the dataset includes age, representing the patient’s age, and SA, which denotes speech audiometry results. The PTA data consist of AC and BC threshold values at 250 Hz, 500 Hz, 1000 Hz, 2000 Hz, and 4000 Hz. A higher SA value indicates better speech audiometry performance, whereas higher PTA threshold values indicate worse hearing ability. The threshold represents the minimum intensity required for the patient to perceive sounds. Speech audiometry consists of three classes (0 to 2), where class 0 indicates excellent speech audiometry, class 1 represents poor speech audiometry, and class 2 signifies very poor speech audiometry. This classification serves as an indicator for determining the presence or absence of sound recognition and was used for model training. The PTA dataset used in this study was collected from 12,972 patient records at Ajou University Hospital, Asan Medical Center, Konyang University Hospital, Soonchunhyang University Hospital, Sookmyung Women’s University Hospital, Seoul National University Hospital, and Yonsei University Hospital. The speech audiometry scores range from 100 to 0. The dataset used in this study was reviewed and approved by the Institutional Review Board (IRB) of Soonchunhyang University Cheonan Hospital (2021-06-040).

4.2. Experimental Setup

We conducted experiments on a system equipped with an Intel(R) Core (TM) i9-14900K (core speed 3.20 GHz) and an Nvidia GeForce RTX 4090 GPU with 128GB of memory. The program code was implemented in Python (3.12.4) using TensorFlow and Keras, with CUDA libraries utilized to accelerate training speed. The training process employed the Adam optimizer with a learning rate of 0.001 and a batch size of 16. Performance metrics included the Huber loss function and mean absolute error (MAE), and the training was conducted exclusively on the CPU. Table 4 presents detailed information on the hardware, software, and parameters used in the experiment.

In this study, a multi-class classification model was developed to predict speech audiometry based on PTA data using various algorithms such as XGBoost, gradient boosting, MLP, and RNN. During the research process, challenges were encountered in optimizing performance and ensuring stability depending on the specific algorithm, and various experiments and hyperparameter optimizations were conducted to address these issues. Scikit-learn and TensorFlow/Keras were utilized to construct machine learning and deep learning models, and rather than simply applying the default algorithms, a trial-and-error approach was employed to fine-tune hyperparameters and maximize model performance. For the XGBoost and gradient boosting models, key hyperparameters such as n_estimators, learning_rate, and max_depth were adjusted to explore optimal values, while in the MLP and RNN models, experiments were conducted to improve training stability by adjusting the learning rate of the Adam optimizer, batch_size, dropout, and the number of hidden layers and units. Initially, models were trained using default settings, and then hyperparameters were modified to mitigate overfitting and address performance degradation issues. For example, in the RNN model, various recurrent neural network architectures such as LSTM and GRU were tested, and dropout rates were adjusted to prevent overfitting and ensure optimal generalization performance. Additionally, in gradient boosting-based models, learning_rate was adjusted within a range of 0.001 to 0.1, and the combination of max_depth and n_estimators was modified to find the optimal configuration.

4.3. MLP and RNN Accuracy, F1 Score, Loss

Figure 5 shows the accuracy of MLP and RNN models in predicting speech audiometry from PTA data, while Figure 6 represents the loss values for each model. To evaluate the performance of the models designed in this study, the test dataset was used with evaluation metrics including accuracy, log loss, and F1 score, and a confusion matrix was utilized to analyze classification performance for each class. The MLP model achieved a test accuracy of 85.77%, correctly classifying 85.77% of the samples. The test loss was 0.3181, indicating a relatively low difference between predicted probability distributions and actual labels. The F1 score was 0.8596, demonstrating that the model effectively balanced precision and recall. Similarly, the RNN model was evaluated using the same performance metrics based on the test dataset. The test accuracy of the RNN model was 85.41% and the test loss was 0.3796, representing the quantitative difference between predicted probabilities and actual labels. The F1 score was 0.8548, indicating that the model maintained a balanced classification performance across multiple classes. Both models demonstrated strong classification performance, with the MLP model achieving slightly better accuracy and lower loss than the RNN model. However, the RNN model maintained a competitive F1 score, suggesting its effectiveness in capturing sequential dependencies in the PTA dataset.

MLP and RNN Confusion Matrix

In this study, a confusion matrix was used to analyze and compare the performance of the MLP and RNN models designed for solving the multi-class classification problem, as shown in Figure 7. Each model was evaluated based on its classification performance across three classes (0, 1, 2), allowing for an assessment of inter-class confusion and the strengths of each model. An analysis of the confusion matrix for the MLP model showed that class 0 recorded 682 true positives (TPs), with 55 false positives (FPs) (55 + 1), indicating that the model effectively distinguished class 0 from the others. Class 1 recorded 670 TPs with 58 FPs (10 + 58), which remained relatively low. However, 10 false negatives (FNs) suggested some confusion with class 2. Class 2 recorded 531 TPs, but its figure of 206 FNs was relatively high, revealing considerable misclassification between class 1 and class 2. This result suggests that class 2 had overlapping distributions with class 1 or lacked sufficient representative data.

For the RNN model, class 0 achieved 692 TPs with 43 FPs (43 + 3), demonstrating a classification performance similar to MLP in distinguishing class 0. Class 1 recorded 656 TPs with 52 FPs (30 + 52), indicating a higher misclassification rate, particularly between class 1 and class 2. Class 2 achieved 543 TPs with 193 FNs (2 + 193), showing lower confusion than class 1. This suggests that the RNN model effectively learned the distinguishing features of class 2 due to its ability to capture temporal patterns. Overall, the MLP model demonstrated higher accuracy in classifying class 0 and class 1, performing well in cases where data distribution overlap was minimal. In contrast, the RNN model showed better classification performance for class 2, mitigating some of the class overlap issues. However, the RNN model exhibited increased FPs and FNs for class 1, suggesting a higher misclassification rate between classes.

4.4. Gradient Boosting and XGBoost Accuracy, F1 Score, Loss

In this study, gradient boosting and XGBoost models were utilized to solve the multi-class classification problem, and their performance was analyzed based on accuracy, F1 score, log loss, and the confusion matrix. The classification accuracy of both models is illustrated in Figure 8, providing a visual comparison of their predictive performance. The gradient boosting model achieved a test accuracy of 86.22%, demonstrating high classification performance. The F1 score was 0.8635, indicating a well-balanced trade-off between precision and recall. The log loss was 0.3083, suggesting a small difference between the predicted probability distribution and the actual labels. These results indicate that gradient boosting provided a stable overall performance. On the other hand, the XGBoost model achieved a test accuracy of 86.04%, which was slightly lower than gradient boosting. However, the F1 score was 0.8619, outperforming gradient boosting in terms of balanced class performance. The log loss, shown in Figure 9, was 0.3056, similar to that of gradient boosting, further demonstrating its effectiveness in reducing inter-class confusion.

Gradient Boosting and XGBoost Confusion Matrix

In this study, gradient boosting and XGBoost models were utilized to address the multi-class classification problem, with their performance compared using a confusion matrix, as shown in Figure 10. For the gradient boosting model, class 0 achieved 687 true positives (TPs), demonstrating strong classification performance, with 51 false positives (FPs) (48 + 3), indicating a relatively low misclassification rate. Class 1 recorded 625 TPs, but 113 FPs (19 + 94) and 19 false negatives (FNs), suggested misclassification between class 1 and class 2. In class 2, the model achieved 597 TPs, but 141 FNs, a relatively high value, suggesting that class 2 significantly overlapped with other classes.

For the XGBoost model, class 0 recorded 683 TPs, showing similar classification performance to gradient boosting, but FPs increased to 55 (52 + 3). In class 1, the model achieved 641 TPs with 97 FPs (17 + 80) and 17 FNs, resulting in a lower false-negative rate compared to gradient boosting. This indicates that XGBoost performed better in terms of recall for class 1. In class 2, 581 TPs and 157 FNs were recorded, showing a misclassification rate similar to gradient boosting. The gradient boosting model demonstrated high accuracy on the test data, with strong classification performance in class 0 and class 1. However, its performance declined for class 2 due to the high false-negative rate, limiting its classification effectiveness. On the other hand, while the XGBoost model had slightly lower accuracy, it achieved a higher F1 score, indicating a more balanced performance across classes. Notably, XGBoost reduced false negatives for class 1, resulting in higher recall and a slight reduction in class confusion compared to gradient boosting. However, both models exhibited high false-negative values for class 2, confirming that overlapping data distributions between classes remain challenging.

4.5. Discussion

In this study, various machine learning and deep learning models were developed and compared to solve the multi-class classification problem of predicting speech audiometry using PTA data. The models used included MLP, RNN, gradient boosting, and XGBoost, and their performance was evaluated using accuracy, F1 score, and log loss, as shown in Table 5. A confusion matrix was also used to analyze each model’s predictive performance, highlighting their strengths and limitations. On the test dataset, the MLP model achieved an accuracy of 85.77%, demonstrating relatively strong classification performance. It performed well in class 0 and class 1, but its classification accuracy for class 2 was lower, leading to confusion between classes. The RNN model recorded an accuracy of 85.41%, and despite its structural advantage in learning sequential patterns, exhibited higher misclassification between class 1 and class 2. This suggests that RNN may have limitations in fully capturing the characteristics of PTA data.

When comparing gradient boosting and XGBoost, the gradient boosting model achieved the highest accuracy of 86.22%. Still, it exhibited a high false-negative rate for class 2, indicating that the model tended to misclassify some data points. On the other hand, the XGBoost model recorded an accuracy of 86.04%, slightly lower than gradient boosting, but achieved a higher F1 score, indicating more balanced classification performance across classes. Notably, XGBoost was more effective in reducing false negatives for class 1, demonstrating its ability to mitigate overfitting to specific classes better than gradient boosting. The comparison of the models’ performance suggests that machine learning-based models (gradient boosting and XGBoost) outperformed deep learning-based models (MLP and RNN) in accuracy. This implies that PTA data may lack the complexity required for deep learning models to learn effectively and tree-based boosting methods may be more suitable for capturing the relationships between variables. Additionally, since PTA data exhibit relatively low temporal dependency, RNN did not provide the expected performance improvement. The findings of this study suggest that machine learning models can serve as practical alternatives for predicting speech audiometry using PTA data. However, additional data preprocessing and feature engineering may be necessary to address overlapping class distributions. Future research could focus on extracting more effective features, optimizing model hyperparameters, and improving performance. Additionally, ensemble techniques that combine the strengths of multiple models could be explored, along with incorporating a more diverse patient dataset to assess the real-world applicability of these models in clinical environments.

5. Conclusions

This study developed a multi-class classification model for predicting speech audiometry (SA) using pure-tone audiometry (PTA) data and compared the performance of machine learning and deep learning models. To achieve this, deep learning models such as MLP and RNN, as well as machine learning models like gradient boosting and XGBoost, were implemented. The models were evaluated based on accuracy, F1 score, and log loss, while a confusion matrix was analyzed to compare detailed classification performance and identify the most suitable model for hearing data prediction. Experimental results showed that the gradient boosting model achieved the highest accuracy of 86.22%, demonstrating superior classification performance. The XGBoost model recorded an accuracy of 86.04%, but achieved a higher F1 score, indicating more balanced classification performance across classes. In contrast, deep learning-based models, MLP and RNN, showed relatively lower performance, with accuracies of 85.77% and 85.41%, respectively. The RNN model in particular was limited in effectiveness due to the irregular nature of PTA data, preventing it from fully utilizing its recurrent structure. Additionally, all models exhibited lower prediction performance for class 2 (borderline hearing levels), likely due to overlapping data distributions causing some misclassifications. Nevertheless, both machine learning and deep learning models demonstrated a reasonable level of predictive performance, with gradient boosting and XGBoost in particular achieving high accuracy and balanced classification, thereby validating the feasibility of using PTA data for speech audiometry prediction. These findings suggest that deep learning models are not always the optimal choice for predicting speech audiometry using PTA data, and tree-based machine learning models may serve as a practical alternative. Specifically, because PTA data do not exhibit strong sequential patterns, RNN did not perform as expected, whereas gradient boosting and XGBoost, which effectively capture relationships between variables, proved to be more suitable. This underscores the importance of selecting the appropriate model for hearing data analysis, emphasizing that different types of data require tailored algorithmic approaches.

Clinically, the results of this study indicate the potential for machine learning models to be utilized as reliable assistive tools in hearing assessment and hearing aid fitting. Traditional audiological evaluations often rely on expert assessments, but the proposed models can serve as an automated diagnostic aid, supporting hearing loss evaluation and treatment planning. Future research should focus on improving data quality through feature engineering and hyperparameter optimization to further enhance model performance. Additionally, incorporating hybrid models that combine machine learning and deep learning or leveraging ensemble techniques to maximize predictive capabilities could be considered. By doing so, this study’s findings may contribute to the practical clinical application of PTA-based speech audiometry prediction models, ultimately providing a more precise foundation for hearing assessment.

Author Contributions

Conceptualization, M.H. and S.J.C.; methodology, J.s.S. and M.H.; software, J.s.S.; validation, J.s.S., J.M., M.M., S.y.K., N.-J.S. and S.J.C.; formal analysis, J.s.S., J.M., M.M. and S.y.K.; investigation, J.s.S.; resources, J.s.S. and S.J.C.; data curation, J.s.S., J.M., M.M., S.y.K., N.-J.S. and S.J.C.; writing—original draft preparation, J.s.S.; writing—review and editing, M.H., J.M., M.M., S.y.K., N.-J.S. and S.J.C.; visualization, J.s.S.; supervision, M.H.; project administration, S.J.C.; funding acquisition, M.H. and S.J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by BK21 FOUR (Fostering Outstanding Universities for Research) Grant, No. 5199990914048 and supported by the Soonchunhyang University Research Fund.

Institutional Review Board Statement

This study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of Soonchunhyang University Cheonan Hospital (Cheonan, Republic of Korea) (IRB 2021-06-040; approval 4 June 2021).

Informed Consent Statement

Because of the retrospective design of the study, patient consent was waived.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Abbreviations	Description
PTA	pure-tone audiometry
SRT	speech recognition threshold
SDT	speech discrimination testing
SDS	speech discrimination score
Val_loss	validation loss
LSTM	long short-term memory
GRU	gated recurrent unit
BPTT	backpropagation through time
MSE	mean squared error
MAE	mean absolute error
AUC	area under the curve
OvR	one vs. rest
MLP	multilayer perceptron
RNN	recurrent neural network
XGBoost	extreme gradient boosting
AC	air conduction
BC	bone conduction
dB	decibel
Hz	hertz
TP	true positive
FP	false positive
FN	false negative

References

Sharifani, K.; Amini, M. Machine Learning and Deep Learning: A Review of Methods and Applications. World Inf. Technol. Eng. J. 2023, 10, 3897–3904. [Google Scholar]
Khalil, M.; McGough, A.S.; Pourmirza, Z.; Pazhoohesh, M.; Walker, S. Machine Learning, Deep Learning and Statistical Analysis for Forecasting Building Energy Consumption—A Systematic Review. Eng. Appl. Artif. Intell. 2022, 115, 105287. [Google Scholar] [CrossRef]
Bhatt, C.; Kumar, I.; Vijayakumar, V.; Singh, K.U.; Kumar, A. The State of the Art of Deep Learning Models in Medical Science and Their Challenges. Multimed. Syst. 2021, 27, 599–613. [Google Scholar] [CrossRef]
Shehab, M.; Abualigah, L.; Shambour, Q.; Abu-Hashem, M.A.; Shambour, M.K.Y.; Alsalibi, A.I.; Gandomi, A.H. Machine Learning in Medical Applications: A Review of State-of-the-Art Methods. Comput. Biol. Med. 2022, 145, 105458. [Google Scholar]
Hoff, M.; Göthberg, H.; Tengstrand, T.; Rosenhall, U.; Skoog, I.; Sadeghi, A. Accuracy of Automated Pure-Tone Audiometry in Population-Based Samples of Older Adults. Int. J. Audiol. 2024, 63, 622–630. [Google Scholar] [CrossRef]
Ryćko, P.; Rogowski, M. Speech Recognition and Speech Audiometry Parameters in Evaluation of Aural Rehabilitation Progress in Cochlear Implant Patients. Pol. J. Otolaryngol. 2024, 78, 1–6. [Google Scholar] [CrossRef]
Sincock, B.P. Clinical Applicability of Adaptive Speech Testing: A Comparison of the Administration Time, Accuracy, Efficiency and Reliability of Adaptive Speech Tests with Conventional Speech Audiometry. Ph.D. Thesis, University of Canterbury, Christchurch, New Zealand, 2008. [Google Scholar]
Shin, J.S.; Ma, J.; Choi, S.J.; Kim, S.; Hong, M. Development of a Deep Learning Model for Predicting Speech Audiometry Using Pure-Tone Audiometry Data. Appl. Sci. 2024, 14, 9379. [Google Scholar] [CrossRef]
Wallaert, N.; Perry, A.; Jean, H.; Creff, G.; Godey, B.; Paraouty, N. Performance and Reliability Evaluation of an Automated Bone-Conduction Audiometry Using Machine Learning. Trends Hear. 2024, 28, 23312165241286456. [Google Scholar] [CrossRef]
DeRuiter, M.; Ramachandran, V. Basic Audiometry Learning Manual; Plural Publishing: San Diego, CA, USA, 2021. [Google Scholar]
Suatbayeva, R.; Toguzbayeva, D.; Taukeleva, S.; Mukanova, Z.; Sadykov, M. Speech Perception and Parameters of Speech Audiometry after Hearing Aid: Systematic Review and Meta-Analysis. Electron. J. Gen. Med. 2024, 21, em563. [Google Scholar] [CrossRef]
Puglisi, G.E.; di Berardino, F.; Montuschi, C.; Sellami, F.; Albera, A.; Zanetti, D.; Albera, R.; Astolfi, A.; Kollmeier, B.; Warzybok, A. Evaluation of Italian simplified matrix test for speech-recognition measurements in noise. Audiol. Res. 2021, 11, 73–88. [Google Scholar] [CrossRef]
Humes, L.E. Factors Underlying Individual Differences in Speech-Recognition Threshold (SRT) in Noise among Older Adults. Front. Aging Neurosci. 2021, 13, 702739. [Google Scholar] [CrossRef] [PubMed]
Oh, J.H.; Lim, T.; Joo, J.B.; Cho, J.E.; Park, P.; Kim, J.Y. The Relationship Between Tinnitus Frequency and Speech Discrimination in Patients with Hearing Loss. Korean J. Otorhinolaryngol.-Head Neck Surg. 2023, 66, 156–161. [Google Scholar]
Ristovska, L.; Jachova, Z.; Kovacevic, J. Cross Validation of the Pure Tone Threshold with Speech Audiometry. In Proceedings of the 6th International Scientific Conference—30 Years of Studies in Special Education and Rehabilitation, Ohrid, Republic of N. Macedonia, 14–16 September 2023; Petrov, R., Jachova, Z., Dimitrova-Radojičić, D., Eds.; Faculty of Philosophy: Skopje, North Macedonia, 2023; pp. 520–531. [Google Scholar]
Deniz, B.; Gülmez, Z.D.; Kara, H.; Kara, E. Effect of Digital Noise Reduction in Hearing Aids on Speech Intelligibility in Both Quiet and Noisy Environments. Noise Health 2024, 26, 220–225. [Google Scholar] [PubMed]
Dellazizzo, L.; Giguère, S.; Léveillé, N.; Potvin, S.; Dumais, A. A systematic review of relational-based therapies for the treatment of auditory hallucinations in patients with psychotic disorders. Psychol. Med. 2022, 52, 2001–2008. [Google Scholar]
Zhao, F.; Mayr, R. Pure Tone Audiometry and Speech Audiometry. In Manual of Clinical Phonetics; Routledge: London, UK, 2021; pp. 444–460. [Google Scholar]
Born, N.M.; Marciano, M.D.S.; Mass, S.D.C.; Silva, D.P.C.D.; Scharlach, R.C. Influence of the Type of Acoustic Transducer in Pure-Tone Audiometry. CoDAS 2022, 34, e20210019. [Google Scholar] [CrossRef]
Masalski, M. The Hearing Test App for Android Devices: Distinctive Features of Pure-Tone Audiometry Performed on Mobile Devices. Med. Devices Evid. Res. 2024, 17, 151–163. [Google Scholar] [CrossRef]
Giotakis, A.I.; Mariolis, L.; Koulentis, I.; Mpoutris, C.; Giotakis, E.I.; Apostolopoulou, A.; Papaefstathiou, E. The Benefit of Air Conduction Pure-Tone Audiometry as a Screening Method for Hearing Loss over the VAS Score. Diagnostics 2023, 14, 79. [Google Scholar] [CrossRef]
Brandt, J.P.; Winters, R. Bone Conduction Evaluation. In StatPearls; StatPearls Publishing: Treasure Island, FL, USA, 2023. [Google Scholar]
Le Prell, C.G.; Brewer, C.C.; Campbell, K. The Audiogram: Detection of Pure-Tone Stimuli in Ototoxicity Monitoring and Assessments of Investigational Medicines for the Inner Ear. J. Acoust. Soc. Am. 2022, 152, 470–490. [Google Scholar]
Kassjański, M.; Kulawiak, M.; Przewoźny, T.; Tretiakow, D.; Kuryłowicz, J.; Molisz, A.; Grono, M. Automated Hearing Loss Type Classification Based on Pure Tone Audiometry Data. Sci. Rep. 2024, 14, 14203. [Google Scholar]
Kim, H.; Park, J.; Choung, Y.H.; Jang, J.H.; Ko, J. Predicting speech discrimination scores from pure-tone thresholds—A machine learning-based approach using data from 12,697 subjects. PLoS ONE 2021, 16, e0261433. [Google Scholar]
Liu, X.; Guo, P.; Wang, D.; Hsieh, Y.L.; Shi, S.; Dai, Z.; Wang, D.; Li, H.; Wang, W. Applications of Machine Learning in Meniere’s Disease Assessment Based on Pure-Tone Audiometry. Otolaryngol.–Head Neck Surg. 2025, 172, 233–242. [Google Scholar] [PubMed]
Musiek, F.E.; Shinn, J.; Chermak, G.D.; Bamiou, D.E. Perspectives on the Pure-Tone Audiogram. J. Am. Acad. Audiol. 2017, 28, 655–671. [Google Scholar]
Wang, X.; Rasidi, W.N.A.; Seluakumaran, K. Simplified Frequency Selectivity Measure as a Potential Candidate for Hearing Screening: Changes with Masker Level and Test-Retest Reliability of Self-Administered Testing. Int. J. Audiol. 2024, 1–10. [Google Scholar] [CrossRef]
Walker, J.J.; Cleveland, L.M.; Davis, J.L.; Seales, J.S. Audiometry Screening and Interpretation. Am. Fam. Physician 2013, 87, 41–47. [Google Scholar]
Kemaloğlu, Y.K.; Gündüz, B.; Gökmen, S.; Yilmaz, M. Pure Tone Audiometry in Children. Int. J. Pediatr. Otorhinolaryngol. 2005, 69, 209–214. [Google Scholar] [PubMed]
Oosterloo, B.C.; Homans, N.C.; de Jong, B.R.J.; Ikram, M.A.; Nagtegaal, A.P.; Goedegebure, A. Assessing Hearing Loss in Older Adults with a Single Question and Person Characteristics; Comparison with Pure Tone Audiometry in the Rotterdam Study. PLoS ONE 2020, 15, e0228349. [Google Scholar] [CrossRef]
Ahn, J.H.; Lee, H.S.; Kim, Y.J.; Yoon, T.H.; Chung, J.W. Comparing Pure-Tone Audiometry and Auditory Steady-State Response for the Measurement of Hearing Loss. Otolaryngol.-Head Neck Surg. 2007, 136, 966–971. [Google Scholar] [CrossRef]
Komazec, Z.; Lemajić-Komazec, S.; Jović, R.; Nađ, Č.; Jovančević, L.; Savović, S. Comparison Between Auditory Steady-State Responses and Pure-Tone Audiometry. Vojnosanit. Pregl. 2010, 67, 761–765. [Google Scholar] [CrossRef]

Figure 1. MLP model structure.

Figure 2. RNN model structure.

Figure 3. Gradient boosting model structure.

Figure 4. XGBoost model structure.

Figure 5. MLP and RNN model accuracy.

Figure 6. MLP and RNN model loss.

Figure 7. MLP and RNN confusion matrix.

Figure 8. Gradient boosting and XGBoost model accuracy.

Figure 9. Gradient boosting and XGBoost model loss.

Figure 10. Gradient boosting and XGBoost confusion matrix.

Table 1. MPL Training parameters.

Parameter	Value
Batch size	32
Momentum	0.9
Weight decay	L2 0.01
Epochs	300
Learning rate	0.001
Optimizer	Adam
Workers	1

Table 2. RNN training parameters.

Parameter	Value
Batch size	128
Weight decay	L2 0.001
Epochs	300
Learning rate	0.0005
Optimizer	Adam
Workers	1

Table 3. PTA and speech audiometry data.

Age	SA	AC250	AC500	AC1000	AC2000	PTA4000	BC250	BC500	BC1000	BC2000	BC4000
68	1.0	30.0	30.0	40.0	45.0	55.0	20.0	35.0	40.0	40.0	50.0
51	2.0	50.0	60.0	65.0	85.0	95.0	40.0	60.0	70.0	70.0	70.0
55	0.0	15.0	20.0	35.0	50.0	40.0	15.0	20.0	35.0	50.0	35.0
55	0.0	15.0	20.0	35.0	50.0	40.0	15.0	15.0	35.0	50.0	35.0

Table 4. Software and simulation parameters.

Software/Parameters	Value
Windows 10	64-bit
Programming language	Python, Keras, TensorFlow
CPU	Intel(R) Core (TM) i9-14900K
GPU	Nvidia GeForce RTX 4090
RAM	128 GB
Batch size	16
Validation split	0.2
Test split	0.2
Optimizer	Adam
Learning rate	0.001
Loss function	Huber
Epochs	300
Dataset	12,972

Table 5. DL and ML performance evaluation.

Method	Accuracy	Loss	F1 Score	Time
MLP	85.77%	0.3181	0.8596	91 s
RNN	85.41%	0.3796	0.8548	102 s
Gradient Boosting	86.22%	0.3083	0.8635	297 s
XGBoost	86.04%	0.3056	0.8619	34 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shin, J.s.; Ma, J.; Makara, M.; Sung, N.-J.; Choi, S.J.; Kim, S.y.; Hong, M. Development and Comparison of Machine Learning and Deep Learning Models for Speech Audiometry Prediction. Appl. Sci. 2025, 15, 3071. https://doi.org/10.3390/app15063071

AMA Style

Shin Js, Ma J, Makara M, Sung N-J, Choi SJ, Kim Sy, Hong M. Development and Comparison of Machine Learning and Deep Learning Models for Speech Audiometry Prediction. Applied Sciences. 2025; 15(6):3071. https://doi.org/10.3390/app15063071

Chicago/Turabian Style

Shin, Jae sung, Jun Ma, Mao Makara, Nak-Jun Sung, Seong Jun Choi, Sung yeup Kim, and Min Hong. 2025. "Development and Comparison of Machine Learning and Deep Learning Models for Speech Audiometry Prediction" Applied Sciences 15, no. 6: 3071. https://doi.org/10.3390/app15063071

APA Style

Shin, J. s., Ma, J., Makara, M., Sung, N.-J., Choi, S. J., Kim, S. y., & Hong, M. (2025). Development and Comparison of Machine Learning and Deep Learning Models for Speech Audiometry Prediction. Applied Sciences, 15(6), 3071. https://doi.org/10.3390/app15063071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Development and Comparison of Machine Learning and Deep Learning Models for Speech Audiometry Prediction

Abstract

1. Introduction

2. Related Work

2.1. Speech Audiometry

2.2. Pure-Tone Audiometry

3. Materials and Methods

3.1. MLP

MLP Model and Methodology in This Paper

3.2. RNN

RNN Model and Methodology in This Paper

3.3. Gradient Boosting

Gradient Boosting Model and Methodology in This Paper

3.4. XGBoost

XGBoost Model and Methodology in This Paper

4. Results

4.1. Data Collection and Segmentation

4.2. Experimental Setup

4.3. MLP and RNN Accuracy, F1 Score, Loss

MLP and RNN Confusion Matrix

4.4. Gradient Boosting and XGBoost Accuracy, F1 Score, Loss

Gradient Boosting and XGBoost Confusion Matrix

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI