Diagnosis of Tympanic Membrane Disease and Pediatric Hearing Using Convolutional Neural Network Models with Multi-Layer Perceptrons

work


Introduction
The world is seeing a steady increase in ear diseases, affecting millions of children a year, as shown in Figure 1.Among them, the most prominent is middle ear infection, accounting for 46.3%.Hearing loss and a significant impact on quality of life can be caused by otitis media, a common ear infection in children.The reason for this infection is either the inflammation or blockage of the Eustachian tube, which often occurs because of the shorter and underdeveloped Eustachian tubes in children [1][2][3].
The diagnosis of otitis media is challenging because symptoms are often mild and unrecognized by the patient, such as slight hearing loss and discomfort [4].This has the potential to cause delayed treatment and potential complications [5,6].Therefore, accurate diagnosis and early intervention are crucial to preventing hearing loss and its negative consequences on social interaction, education, employment, and overall well-being [7].Different techniques are employed to diagnose the tympanic membrane, but there is a lack of objective and precise diagnostic methods.The diagnostic methods that currently exist include clinical tests, hearing tests, and imaging tests.First, a clinical test is a doctor's visual examination of the symptoms of otitis media.A clinical examination can also be subjective and less precise.Second, hearing tests are used to check pediatric hearing loss due to otitis media.Children may experience inconvenience from time-consuming hearing tests.Finally, imaging tests are methods to generate images of the middle ear cavity using techniques such as CT (computed tomography) or MRI (magnetic resonance imaging).Imaging tests can be used to identify structural changes in otitis media, but they are expensive and pose a risk of radiation exposure.
Recently, artificial intelligence technology, particularly deep learning [8,9] technology among artificial neural networks [10], has been used to develop new diagnostic methods in the medical field.Deep learning technology is capable of creating predictive models by analyzing data like images, speech, and text.Various studies [11] of deep learning are also being conducted on the middle ear disease [12], and these studies are aimed at assisting in diagnosis.The average diagnostic accuracies of doctors [13] are 73% (Otolaryngologists) and 50% (Pediatricians).Since the diagnosis results and their precision vary from doctor to doctor, and since there is a possibility of bias [14], research on diagnostic assistance using deep learning is emerging to solve this problem.
We propose a novel model for diagnosing tympanic membrane disease and predicting pediatric hearing by combining the CNN (convolution neural networks) [15] and MLP (multi-layer perceptron) [16] models.Previous research has demonstrated that CNN models are highly effective in extracting and classifying features from medical images.In contrast, MLP models are effective in learning complex nonlinear relationships.
This study covers the medical definitions of tympanic membrane diseases and pediatric hearing.First, the OME (otitis media with effusion) disease is a condition in which non-infectious fluids accumulate in the middle ear.Second, congenital cholesteatoma disease is an abnormal skin growth that occurs in the middle ear canal.Third, traumatic perforation disease is the formation of a hole in the tympanic membrane.Fourth, the COM (chronic otitis media) disease is the development of inflammation in the middle ear (middle ear) that lasts for more than 3 months.Fifth, AOM (acute otitis media) is the development of inflammation in the middle ear that lasts less than 3 weeks.Sixth, otitis externa is an inflammation that occurs in the ear canal (a passage behind the ear).Finally, pediatric hearing, which is affected by these diseases, refers to the ability to understand sounds and languages.
The main goals of this study are as follows.The first is to develop a combined CNN and MLP model for diagnosing tympanic membrane disease, the second is to develop a combined CNN and MLP model for predicting pediatric hearing, and the last is to evaluate the performance of the proposed models.
This study is anticipated to improve the accuracy of medical judgment in diagnosing tympanic membrane diseases and predicting pediatric hearing.In addition, the proposed Different techniques are employed to diagnose the tympanic membrane, but there is a lack of objective and precise diagnostic methods.The diagnostic methods that currently exist include clinical tests, hearing tests, and imaging tests.First, a clinical test is a doctor's visual examination of the symptoms of otitis media.A clinical examination can also be subjective and less precise.Second, hearing tests are used to check pediatric hearing loss due to otitis media.Children may experience inconvenience from time-consuming hearing tests.Finally, imaging tests are methods to generate images of the middle ear cavity using techniques such as CT (computed tomography) or MRI (magnetic resonance imaging).Imaging tests can be used to identify structural changes in otitis media, but they are expensive and pose a risk of radiation exposure.
Recently, artificial intelligence technology, particularly deep learning [8,9] technology among artificial neural networks [10], has been used to develop new diagnostic methods in the medical field.Deep learning technology is capable of creating predictive models by analyzing data like images, speech, and text.Various studies [11] of deep learning are also being conducted on the middle ear disease [12], and these studies are aimed at assisting in diagnosis.The average diagnostic accuracies of doctors [13] are 73% (Otolaryngologists) and 50% (Pediatricians).Since the diagnosis results and their precision vary from doctor to doctor, and since there is a possibility of bias [14], research on diagnostic assistance using deep learning is emerging to solve this problem.
We propose a novel model for diagnosing tympanic membrane disease and predicting pediatric hearing by combining the CNN (convolution neural networks) [15] and MLP (multi-layer perceptron) [16] models.Previous research has demonstrated that CNN models are highly effective in extracting and classifying features from medical images.In contrast, MLP models are effective in learning complex nonlinear relationships.
This study covers the medical definitions of tympanic membrane diseases and pediatric hearing.First, the OME (otitis media with effusion) disease is a condition in which non-infectious fluids accumulate in the middle ear.Second, congenital cholesteatoma disease is an abnormal skin growth that occurs in the middle ear canal.Third, traumatic perforation disease is the formation of a hole in the tympanic membrane.Fourth, the COM (chronic otitis media) disease is the development of inflammation in the middle ear (middle ear) that lasts for more than 3 months.Fifth, AOM (acute otitis media) is the development of inflammation in the middle ear that lasts less than 3 weeks.Sixth, otitis externa is an inflammation that occurs in the ear canal (a passage behind the ear).Finally, pediatric hearing, which is affected by these diseases, refers to the ability to understand sounds and languages.
The main goals of this study are as follows.The first is to develop a combined CNN and MLP model for diagnosing tympanic membrane disease, the second is to develop a combined CNN and MLP model for predicting pediatric hearing, and the last is to evaluate the performance of the proposed models.
This study is anticipated to improve the accuracy of medical judgment in diagnosing tympanic membrane diseases and predicting pediatric hearing.In addition, the proposed model is expected to play a role in showing the possibility of the combination of the CNN and the MLP in the field of medical image analysis.

Related Work
Recently, as the role of artificial neural networks as diagnostic tools emerges, their influence has gradually expanded, and various methods of using neural networks have been used for research on medical data.
Autoencoder, an unsupervised learning method, is one of them.Song et al. [17] studied anomaly detection, the task of identifying sample data that do not match the overall data distribution, using a variational autoencoder [18].Because variational autoencoders make complex data such as tympanic membrane endoscopic images difficult to learn, they preprocessed tympanic membrane images using adaptive histogram equalization and canny edge detection.Then, they made the variational autoencoder learn the preprocessed data only for the normal tympanic membrane and applied the normal and abnormal tympanic membrane image anomaly scores of the distribution of the variational autoencoder to the K-nearest neighbor algorithm to classify the normal and abnormal tympanic membrane images.As a result, a total of 1232 normal and abnormal tympanic membrane images were obtained, which were classified with 94.5% accuracy, using the algorithm that applied only the normal tympanic membrane image.Studies on lightweight models are also being attempted in many directions.
Yue et al. [19] constructed the first large-scale ear endoscopy dataset consisting of eight types of ear disease and disease-free samples from two institutions.Inspired by ShuffleNetV2 [20], Best-EarNet is an ultra-fast and ultra-lightweight network that enables real-time ear disease diagnosis.Best-EarNet includes a novel local-global spatial feature fusion module and a multi-scale supervision strategy, making it easy to focus on globallocal information within different levels of feature maps.Using transfer learning, the accuracy of Best-EarNet with only 0.77 M parameters achieved 95.23% (with 22,581 images inside) and 92.14% (with 1652 images outside), respectively.Specifically, the average frame per second is 80, so real-time computation was possible.
Zeng et al. [21] presented a deep learning model to automatically diagnose tympanic diseases in real time using abundant otoscope image data obtained from clinical cases.They trained nine common deep CNNs using a total of 20,542 endoscopic images and classified eight ear diseases, including normal diseases, cholesteatoma of the middle ear, chronic suppurative otitis media, external auditory canal bleeding, impacted cerumen, otomycosis external, secretory otitis media, tympanic membrane classification.A transfer learning model was selected by them to construct an ensemble model with DensNet-BC169 [22] and DensNet-BC1615, which has an average accuracy of 95.59%.

Materials and Methods
In this chapter, we will cover the datasets that are employed for learning and the process of preprocessing them.Our description includes the proposed model's structure, hyperparameters, and model evaluation indices.

Open-Access Tympanic Membrane Dataset
This study utilized data acquired from Kaggle in addition to using the open-access tympanic membrane dataset [23], which is an open dataset used in various papers.Normal, COM, AOM, and otitis externa are represented by 757 TIFF images in this dataset.The ratio between the training and test data is 75:25, as shown in Table 1.
Prior to training the SCH (Soonchunhyang University Hospital) tympanic membrane dataset, the performance of each model is compared with the open-access tympanic membrane dataset.There are a total of five comparative models, including MobileNet V3 [24], DenseNet 201, EfficientNet B7 [25], ConvNeXt [26], and the proposed model.

SCH Tympanic Membrane Dataset
This study uses 23,302 JPG image files provided by SCH after de-identification, which were approved by the institutional review committee of SCH.The dataset is divided into a classification dataset and a regression dataset, and each has a different training task.Usually, patients obtain their eardrum images and EAC (external auditory canal) photos via an oto-endoscopy (Pentax, Berlin, Germany) upon visit.The resolution rate of these images is 1280 (h) × 1350 (w) pixels.
The tympanic membrane disease subset in the dataset is a dataset for classification of a total of five classes: normal (completely normal eardrum, normal with healed perforation or some tympanosclerosis), OME (light yellow, orange oil or amber color, but if the liquid does not fill in the tympanic cavity, the liquid level can be seen through the tympanic membrane), cholesteatoma (loose inner pocket can be seen, and white exfoliated epithelium can be seen inside the pocket), traumatic perforation (there is perforation of the tympanic membrane, and they are not a uniform size).
In COM, the tympanic membrane may perforate due to tension and exhibit blood clumps and uneven size.Most of them are single shots.The residual tympanic membrane may have calcification, ulceration and granulation tissue growth around the perforation margin.All the image labeling was conducted by three ear specialists with more than ten years of experience.
OME was diagnosed according to the clinical otologic practice that included medical history, physical examination with otoscopes, and audiological tests (PTA [pure tone audiometry] and tympanometry).Inclusion criteria required that otoscopic images and audiological assessment results be measured at the same time and on individual OME ears.Ears with OME and a history of middle ear surgery (e.g., grommet insertion) were excluded.The pediatric hearing subset in the OME dataset is a dataset for the hearing threshold of 1 kHz in the left and right ear.
The split of the training set, validation set, and test set of the SCH tympanic membrane disease subset in the dataset was handled by SCH, and the ratio is 8:1:1.For training, a training set and a verification set were first received.The composition of the data for training is shown in Table 2.After communicating through the training set and validation set that the training was completed, the test set was received, and the test was conducted.The composition of the data that were tested is also shown in Table 2.The pediatric hearing subset in the dataset is a dataset for regression and has a certain value of dB (Decibel).The split of this dataset was also dedicated to SCH, and the ratio of training set, validation set, and test set is 8:1:1.First, training was conducted by receiving a training set and a validation set, and the composition of the data is shown in Table 3.After communicating through the training set and validation set that the learning was completed, the test set was received, and the test was conducted.The composition of the data that were tested is also shown in Table 3, and the distribution of all data in training, validation, and testing is visualized in Figure 2. training set, validation set, and test set is 8:1:1.First, training was conducted by receiving a training set and a validation set, and the composition of the data is shown in Table 3.
After communicating through the training set and validation set that the learning was completed, the test set was received, and the test was conducted.The composition of the data that were tested is also shown in Table 3, and the distribution of all data in training, validation, and testing is visualized in Figure 2.

Data Preprocessing
A standardization layer for convergence learning was developed by EfficientNet, which resized image data from various formats to a 600 × 600 8-bit RGB format.Ground truth used one-hot encoding and label smoothing [27] to classify datasets like the openaccess tympanic membrane dataset and the SCH tympanic membrane disease subset dataset.
In the case of label smoothing, correction is applied to prevent predictions close to 0 and 1 from becoming overly confident, and through this, neural networks are constantly focused on classes with lowered predictions through correction to improve performance.The formula for this label smoothing is shown in Equation (1),  is the GT value, α is the label smoothing ratio, and  is the number of classes.In the experiment, training was conducted with the smoothing ratio of Label 1 × 10 −1 , as shown in Tables 4 and 5.

Data Preprocessing
A standardization layer for convergence learning was developed by EfficientNet, which resized image data from various formats to a 600 × 600 8-bit RGB format.Ground truth used one-hot encoding and label smoothing [27] to classify datasets like the openaccess tympanic membrane dataset and the SCH tympanic membrane disease subset dataset.
In the case of label smoothing, correction is applied to prevent predictions close to 0 and 1 from becoming overly confident, and through this, neural networks are constantly focused on classes with lowered predictions through correction to improve performance.The formula for this label smoothing is shown in Equation ( 1), y k is the GT value, α is the label smoothing ratio, and K is the number of classes.In the experiment, training was conducted with the smoothing ratio of Label 1 × 10 −1 , as shown in Tables 4 and 5. Standardization was used for regression datasets such as the SCH pediatric hearing subset dataset.The formula used in the standardization of the SCH pediatric hearing subset dataset is shown in Equation ( 2) below, which is the same as the formula of the standardization layer designed inside the EfficientNet.The Mean and Standard Deviation values used in the equation are shown in Table 6.The basic backbone model uses the EfficientNet model, which achieves State-of-The-Art in five dataset segments, including Flowers and CIFAR-100.EfficientNet achieved both top-1 and top-5 accuracy in ImageNet while reducing the number of parameters and attaining high accuracy, unlike the existing CNN model, which had a significant number of parameters.To improve model performance, compound scaling is essential, and the optimal values were found by organically adjusting the Width, Depth, and Resolution scaling.
As shown in Table 7, the optimized value exhibits the best performance in terms of computation and accuracy, and the compound scaling combination formula is based on Equation (3) below.In this equation, α, β, and γ are constants and are found using grid search, and ϕ is a factor that can be controlled by the user and takes an appropriate value according to the available resources.EfficientNet has a group of models such as B0, B1, B2, B3, B4, B5, B6, B7, B8, and L2 (added after the launch of EfficientNet for B8 and L2), and each model has its own Compound Scaling value.As the number of models increases, the amount of computation doubles and the intensity of regulations to prevent overfitting also increases.

Multi-Layer Perceptron
The MLP structure is utilized in this study to enhance the performance of EfficientNet.The MLP used is a structure that repeats fully connected with 4096 units, swish activation, and dropout [28] with a 50% probability 5 times, referring to the structure of a Transformer [29] model that utilizes MLP in various ways.This structure is used to construct the EfficientNet B7, as shown in Figure 3.

Multi-Layer Perceptron
The MLP structure is utilized in this study to enhance the performance of Efficient-Net.The MLP used is a structure that repeats fully connected with 4096 units, swish activation, and dropout [28] with a 50% probability 5 times, referring to the structure of a Transformer [29] model that utilizes MLP in various ways.This structure is used to construct the EfficientNet B7, as shown in Figure 3.

Drop Connect
To prevent overfitting due to the huge size of EfficientNet B7, we applied drop connect [30].Drop connect is a follow-up study of dropout that randomly selects nodes and turns them to zero.Unlike dropout, it is a regularization for co-adaptation prevention that deactivates weights.Dropout had previously been utilized to correct the MLP pattern, but drop connect was employed to enhance the performance and result in a 50% weight inactivation rate.

Calibration Weight Classes
There are more than 1000 classes in the data class of ImageNet, and the amount of data per class is different.Most datasets are extremely rare and the data are evenly distributed for each class.As such, the problem of data imbalance by class is a very important issue in classification tasks.To solve this problem, we could consider a method of adjusting the frequency of sampling and a method of adjusting the weight by class.This paper uses the most recent method to calculate the weight of each class, which is based on Equation (4).
Except for the SCH pediatric hearing subset dataset, which is a regression task, the open-access tympanic membrane dataset and the SCH tympanic membrane disease subset dataset are both classification tasks, and the weights of the classes can be calculated.Basically, high weights are given to classes with limited data, low weights are given to classes with abundant data, and the calculated weights for each class are shown in Table 8 for the open-access tympanic membrane dataset and Table 9 for the SCH tympanic membrane disease subset dataset.

Drop Connect
To prevent overfitting due to the huge size of EfficientNet B7, we applied drop connect [30].Drop connect is a follow-up study of dropout that randomly selects nodes and turns them to zero.Unlike dropout, it is a regularization for co-adaptation prevention that deactivates weights.Dropout had previously been utilized to correct the MLP pattern, but drop connect was employed to enhance the performance and result in a 50% weight inactivation rate.

Calibration Weight Classes
There are more than 1000 classes in the data class of ImageNet, and the amount of data per class is different.Most datasets are extremely rare and the data are evenly distributed for each class.As such, the problem of data imbalance by class is a very important issue in classification tasks.To solve this problem, we could consider a method of adjusting the frequency of sampling and a method of adjusting the weight by class.This paper uses the most recent method to calculate the weight of each class, which is based on Equation (4).
Except for the SCH pediatric hearing subset dataset, which is a regression task, the open-access tympanic membrane dataset and the SCH tympanic membrane disease subset dataset are both classification tasks, and the weights of the classes can be calculated.Basically, high weights are given to classes with limited data, low weights are given to classes with abundant data, and the calculated weights for each class are shown in Table 8 for the open-access tympanic membrane dataset and Table 9 for the SCH tympanic membrane disease subset dataset.

Rand Augment
The augmentation used in the training is rand augment [31], an augmentation that applies up to N random augmentations with maximum random intensity M. Rand augment is a technology that refers to fast auto augment [32] and can be applied with a very small amount of computation.Figure 4 shows an example of rand augment, which shows the difference between the M values of 9, 17, and 28 when shearX and auto contrast were randomly selected at N = 2.As such, rand augment randomly selects N augmentation techniques for each image and applies a random magnitude between 0 and M. The augmentation used in the training is rand augment [31], an augmentation that applies up to N random augmentations with maximum random intensity M. Rand augment is a technology that refers to fast auto augment [32] and can be applied with a very small amount of computation.Figure 4 shows an example of rand augment, which shows the difference between the M values of 9, 17, and 28 when shearX and auto contrast were randomly selected at N = 2.As such, rand augment randomly selects N augmentation techniques for each image and applies a random magnitude between 0 and M.  In the experiment, standard augmentations of rand augment such as flipLR, identity, auto contrast, equalize, rotate, solarize, color, posterize, contrast, brightness, sharpness, shearX, shearY, translateX, and translateY were applied.N is 2, and M is 28.

AdaBelief
AdaBelief [33] is an algorithm that Adam [34] uses to adjust convergence speed and generalization performance [35] by using the variance value of the slope as a replacement for Adam, which is momentum squared.Ada is derived from Adam, and Belief is named because the variance is calculated with the currently estimated momentum value and has a distance squared from the predicted slope.Despite a one-line change in the code, it still received much attention for its exceptional performance improvement.This study uses a global clip norm of 1, the learning rate of 1 × 10 −4 , and weight decay of 1 × 10 −4 to conduct training.

Mixed Precision
Mixed precision [36] is a method of converting the existing float32 operation into the float16 operation and converting the classifier back to the float32 operation, which enables twice as fast learning by reducing the burden of memory in half while maintaining accuracy.During the float16 operation, there may be losses due to values exceeding the range, which is corrected through scaling.In this study, the maximum batch size 16 was raised to 32 using mixed precision, and through this, it was possible to conduct smooth training and improve the performance by increasing the efficiency of batch normalization [37] in the model.

Loss
The general categorical cross-entropy was used for the loss of the classification task and the Huber loss [38], which combines the outlier robustness of the L1 loss, the fast convergence speed, and the training stability according to the differentiable of the L2 loss, was used for the loss of the regression task.In Huber loss, as shown in Equation (5), if the difference between GT and the predicted value is less than a specific threshold delta value, it follows L2 loss, and if it is large, it follows L1 loss, and in training, this delta value was set to 0.25.

Metrics
Training evaluation is used separately for classification tasks such as the open-access tympanic membrane dataset and SCH tympanic membrane disease subset dataset and for regression tasks such as the SCH pediatric hearing subset dataset.
To calculate the metrics of classification, the confusion matrix for each class is first obtained.Based on the TP (True Positive), TN (True Negative), FP (False Positive), and FN (False Negative) of the classes obtained here, the Average Accuracy, Average Sensitivity, and Average Specificity of the class are obtained, and the model is evaluated with these Metrics.Each metric follows an Equations ( 6)- (8).
The models of the regression task evaluate the performance with Mean Absolute Error and Mean Squared Logarithmic Error, which follow Formulas ( 9) and (10).

Results
The device used in the experiment was hamoniKR 6.0, based on Linux Ubuntu 20.04.The CPU was equipped with an Intel Xeon Gold 6346 64 core, 3.10 GHz, and the GPU was equipped with four RTX 3090, 256 GB, of RAM.It installed Nvidia CUDA 11.2 and cuDNN 11.2, while Python version 3.9.13 and Anaconda 22.9 were used.The deep learning framework was based on TensorFlow 2.11.1 and Keras 2.11.0, and all experiments were conducted.

Open-Access Tympanic Membrane Dataset
In this study, benchmarks for each model were conducted using an open-access tympanic membrane dataset, and validation and performance measurements were performed as a test set.A total of five models were compared to measure the average accuracy, average sensitivity, and average specificity of normal, COM, AOM, and otitis externa, and the training graph for each model is shown in Figure 5.
The models of the regression task evaluate the performance with Mean Absolute Error and Mean Squared Logarithmic Error, which follow Formulas ( 9) and (10).

Results
The device used in the experiment was hamoniKR 6.0, based on Linux Ubuntu 20.04.The CPU was equipped with an Intel Xeon Gold 6346 64 core, 3.10 GHz, and the GPU was equipped with four RTX 3090, 256 GB, of RAM.It installed Nvidia CUDA 11.2 and cuDNN 11.2, while Python version 3.9.13 and Anaconda 22.9 were used.The deep learning framework was based on TensorFlow 2.11.1 and Keras 2.11.0, and all experiments were conducted.

Open-Access Tympanic Membrane Dataset
In this study, benchmarks for each model were conducted using an open-access tympanic membrane dataset, and validation and performance measurements were performed as a test set.A total of five models were compared to measure the average accuracy, average sensitivity, and average specificity of normal, COM, AOM, and otitis externa, and the training graph for each model is shown in Figure 5.For a quick comparison, we used weights that pre-trained the dataset of ImageNet for each model, which converged all models within 100 epochs.Based on the epoch that obtained the highest performance, the model's performance was higher in the order of Our Model, ConvNeXt, Vanilla EfficientNet B7, DenseNet 201, and MobileNet V3.In particular, the fact that Our Model led ConvNeXt, which showed higher performance than Vanilla EfficientNet B7, shows that MLP and drop connect had a good effect on the performance, as shown in Table 10.

SCH Tympanic Membrane Dataset
The training of the SCH tympanic membrane dataset used the proposed model, MLP, and EffcientNet B7, which connected drop connect.For better performance, the model was fine-tuned with the weight of the noisy student [39].The weight of the noisy student refers to the additional training of the JFT-300M dataset on the ImageNet large-capacity dataset using the noisy student training method that divides the teacher and student model into non-label training.
The experimental results of the tympanic membrane disease dataset are as follows.As a result of training all 100 epochs, the weight at 50 epochs showed the highest performance, and the validation and test performance of the corresponding weight are shown in Table 11.Table 12 shows the inference time for measuring the performance of the test set received from SCH with the previous weight.Figure 6 shows a visualization of the correct answer prediction results for each class using Grad-CAM [40].
The experimental results of the pediatric hearing dataset are as follows.As a result of learning all 300 epochs, 215 epochs showed the highest performance, and the validation and test performance at these epochs are shown in Table 13.The inference time measuring the performance of the test set received from SCH is shown in Table 14, and the result of visualizing the predicted result for each dB with a difference of less than 5 between the predicted value and GT in Grad-CAM is shown in Figure 7.In the case of Table 13, additional benchmarks were executed with the same data to evaluate the regression performance of the proposed model.As a result, like the classification part, it was confirmed that the performance of our model was the most compliant among the comparative models.The experimental results of the pediatric hearing dataset are as follows.As a of learning all 300 epochs, 215 epochs showed the highest performance, and the val and test performance at these epochs are shown in Table 13.The inference time mea the performance of the test set received from SCH is shown in Table 14, and the r visualizing the predicted result for each dB with a difference of less than 5 betw predicted value and GT in Grad-CAM is shown in Figure 7.In the case of Table 13 tional benchmarks were executed with the same data to evaluate the regression mance of the proposed model.As a result, like the classification part, it was confirm the performance of our model was the most compliant among the comparative mo    Unlike the average performance in the experiment, both the classification mo the regression model had the problem of lowering performance in a specific class As shown in Figures 8 and 9, many such problems were in the cholesteatoma tympanic membrane disease model and 60 dB for the pediatric hearing model.Thi to be a problem caused by a data imbalance, and it was not completely overcom method such as class weight in the training process.

Cross Validation
We performed additional cross validation through k-fold to evaluate the perfo of each class in more detail.This was performed on the previous two classificat Unlike the average performance in the experiment, both the classification model and the regression model had the problem of lowering performance in a specific class or dB.As shown in Figures 8 and 9, many such problems were seen in the cholesteatoma for the tympanic membrane disease model and 60 dB for the pediatric hearing model.This seems to be a problem caused by a data imbalance, and it was not completely overcome by a method such as class weight in the training process.Unlike the average performance in the experiment, both the classification mo the regression model had the problem of lowering performance in a specific class As shown in Figures 8 and 9, many such problems were seen in the cholesteatoma tympanic membrane disease model and 60 dB for the pediatric hearing model.This to be a problem caused by a data imbalance, and it was not completely overcom method such as class weight in the training process.

Cross Validation
We performed additional cross validation through k-fold to evaluate the perfo of each class in more detail.This was performed on the previous two classificat tasets, and after integrating the training set, validation set, and test set, it was pro by stratified sampling with 5-fold.
The results for the open-access tympanic membrane dataset are shown in Ta Unlike the average performance in the experiment, both the classification mo the regression model had the problem of lowering performance in a specific class As shown in Figures 8 and 9, many such problems were seen in the cholesteatoma tympanic membrane disease model and 60 dB for the pediatric hearing model.This to be a problem caused by a data imbalance, and it was not completely overcom method such as class weight in the training process.

Cross Validation
We performed additional cross validation through k-fold to evaluate the perfo of each class in more detail.This was performed on the previous two classificat tasets, and after integrating the training set, validation set, and test set, it was pro by stratified sampling with 5-fold.
The results for the open-access tympanic membrane dataset are shown in Ta

Cross Validation
We performed additional cross validation through k-fold to evaluate the performance of each class in more detail.This was performed on the previous two classification datasets, and after integrating the training set, validation set, and test set, it was proceeded by stratified sampling with 5-fold.
The results for the open-access tympanic membrane dataset are shown in Tables 15 and and the visualization of this as a box plot is shown in Figure 10.Also, the results for the SCH tympanic membrane dataset are shown in Tables 17 and 18, and the visualization of this as a box plot is shown in Figure 11.From the box plot of each dataset, it can be seen that the deviation of the performance for each mold is not small compared to the average performance.These deviations are attributed to data imbalances due to the presence of data-poor classes.In the results of the open-access tympanic membrane dataset, normal in Accuracy, otitis externa in Sensitivity, and normal in Specificity had the largest deviation, and in the results of the SCH tympanic membrane dataset, normal in Accuracy, cholesteatoma in Sensitivity, and perforation in Specificity had the largest deviation.Considering that the error of around 1-3% is generalized, it is difficult to say that the deviation of these classes is completely generalized for each class because it is outside this level, and a plan to overcome this seems necessary for future research.

Conclusions
In this study, we proposed the tympanic membrane disease classification and pediatric hearing prediction method of the EfficientNet B7 model using MLP and drop connect.In the process of benchmarking with the open-access tympanic membrane dataset, the proposed model, which fine-tuned the ImageNet weights, showed the best performance with an Average Accuracy of 93.59%, an Average Sensitivity of 87.19, and an Average Specificity of 95.73%.This contrasts with the lower performance of the vanilla EfficientNet B7 than ConvNeXt.In the case of the SCH tympanic membrane dataset, which fine-tuned the noisy student weights, the tympanic membrane disease model showed an Average Accuracy of 98.28%, an Average Sensitivity of 89.66%, an Average Specificity of 98.68%, and an average inference time of 0.2, and the pediatric hearing model showed a Mean Absolute Error of 6.9801, a Mean Squared Logarithmic Error of 0.2887, and an average inference time of 0.2 s.
Future research will try to find ways, such as data augmentation through GAN, a generative artificial intelligence model, or unseen data training through teacher and student model training, e.g., noisy student, to overcome the performance degradation caused by this data imbalance.In addition, we will study how to train tympanic membrane disease and a more diverse dB range of pediatric hearing data not covered in this study and study the structure of a more improved model.

Figure 2 .
Figure 2. Distribution chart of hearing data by dB.

Figure 2 .
Figure 2. Distribution chart of hearing data by dB.

Figure 4 .
Figure 4. Example images augmented by rand augment.Figure 4. Example images augmented by rand augment.

Figure 4 .
Figure 4. Example images augmented by rand augment.Figure 4. Example images augmented by rand augment.

Figure 5 .
Figure 5.The visualization of validation trajectory in 100 epochs on open-access tympanic membrane dataset.

Figure 5 .
Figure 5.The visualization of validation trajectory in 100 epochs on open-access tympanic membrane dataset.

Figure 6 .
Figure 6.Visualization comparison with Grad-CAM on SCH tympanic membrane disease d

Figure 6 .
Figure 6.Visualization comparison with Grad-CAM on SCH tympanic membrane disease dataset.

Figure 7 .
Figure 7. Visualization of the pediatric hearing using Grad-CAM.

Figure 8 .
Figure 8. Correct and incorrect classification of Cholesteatoma.(Bold green letters indicate correct prediction, and bold red letters indicate the incorrect prediction).

Figure 9 .
Figure 9. Small prediction error and large prediction error when GT is 60 dB.(Bold green le indicate the correct prediction, and bold red letters indicate the incorrect prediction).

Figure 7 .
Figure 7. Visualization of the pediatric hearing using Grad-CAM.

Figure 8 .
Figure 8. Correct and incorrect classification of Cholesteatoma.(Bold green letters indicate correct prediction, and bold red letters indicate the incorrect prediction).

Figure 9 .
Figure 9. Small prediction error and large prediction error when GT is 60 dB.(Bold green le indicate the correct prediction, and bold red letters indicate the incorrect prediction).

Figure 8 .Figure 7 .
Figure 8. Correct and incorrect classification of Cholesteatoma.(Bold green letters indicate the correct prediction, and bold red letters indicate the incorrect prediction).

Figure 8 .
Figure 8. Correct and incorrect classification of Cholesteatoma.(Bold green letters indicate correct prediction, and bold red letters indicate the incorrect prediction).

Figure 9 .
Figure 9. Small prediction error and large prediction error when GT is 60 dB.(Bold green le indicate the correct prediction, and bold red letters indicate the incorrect prediction).

Figure 9 .
Figure 9. Small prediction error and large prediction error when GT is 60 dB.(Bold green letters indicate the correct prediction, and bold red letters indicate the incorrect prediction).

Figure 10 .
Figure 10.Visualization comparison of box plot results on open-access tympanic membrane disease model.

Figure 10 .
Figure 10.Visualization comparison of box plot results on open-access tympanic membrane disease model.

Figure 11 .
Figure 11.Visualization comparison of box plot results on SCH tympanic membrane disease model.Figure 11.Visualization comparison of box plot results on SCH tympanic membrane disease model.

Figure 11 .
Figure 11.Visualization comparison of box plot results on SCH tympanic membrane disease model.Figure 11.Visualization comparison of box plot results on SCH tympanic membrane disease model.

Table 3 .
SCH pediatric hearing subset training dataset.

Table 3 .
SCH pediatric hearing subset training dataset.

Table 4 .
The result of label smoothing on open-access tympanic membrane dataset.

Table 4 .
The result of label smoothing on open-access tympanic membrane dataset.

Table 5 .
The result of label smoothing on SCH tympanic membrane disease subset dataset.

Table 6 .
The Mean and Standard Deviation on SCH pediatric hearing dataset.

Table 7 .
Performance with scale change at the same amount of computation.

Table 8 .
Open-access tympanic membrane dataset weights by disease type.

Table 9 .
SCH tympanic membrane disease subset dataset weights by disease type.

Table 8 .
Open-access tympanic membrane dataset weights by disease type.

Table 9 .
SCH tympanic membrane disease subset dataset weights by disease type.

Table 10 .
The quantitative comparison results on the open-access tympanic membrane dataset.

Table 11 .
The quantitative comparison results on SCH tympanic membrane disease dataset.

Table 12 .
Inference time on the proposed model.

Table 13 .
The quantitative pediatric hearing result on SCH tympanic membrane disease da

Table 14 .
Inference time on the pediatric hearing model.

Table 13 .
The quantitative pediatric hearing result on SCH tympanic membrane disease dataset.

Table 14 .
Inference time on the pediatric hearing model.

Table 15 .
The performance of an open-access tympanic membrane disease model varies based on the number of folds.

Table 16 .
The 5-fold average performance of the open-access tympanic membrane disease model.

Table 17 .
The performance of an SCH tympanic membrane disease model varies based on the number of folds.

Table 18 .
The 5-fold average performance of the SCH tympanic membrane disease model.