Gender, Smoking History, and Age Prediction from Laryngeal Images

Flexible laryngoscopy is commonly performed by otolaryngologists to detect laryngeal diseases and to recognize potentially malignant lesions. Recently, researchers have introduced machine learning techniques to facilitate automated diagnosis using laryngeal images and achieved promising results. The diagnostic performance can be improved when patients’ demographic information is incorporated into models. However, the manual entry of patient data is time-consuming for clinicians. In this study, we made the first endeavor to employ deep learning models to predict patient demographic information to improve the detector model’s performance. The overall accuracy for gender, smoking history, and age was 85.5%, 65.2%, and 75.9%, respectively. We also created a new laryngoscopic image set for the machine learning study and benchmarked the performance of eight classical deep learning models based on CNNs and Transformers. The results can be integrated into current learning models to improve their performance by incorporating the patient’s demographic information.


Introduction
Flexible laryngoscopy is a commonly used diagnostic tool to visually identify diseases of the larynx [1,2]. While it has advantages over other diagnostic methods given its ease of use and lack of ionizing radiation exposure, discerning between benign and malignant lesions on laryngoscopy requires expert interpretation. Previously, computer vision techniques utilizing deep learning, including Convolutional Neural Networks (CNNs) and Transformers, have been implemented to determine the pathologic diagnosis based on laryngoscopic medical images or videos [3][4][5][6][7]. Such models have been shown to be sufficiently accurate in the diagnosis of laryngeal cancer with only a limited training set [3][4][5][6][7][8].
The majority of prior studies that utilized machine learning for medical image analysis focused on lesion or polyp detection, segmentation, and classification [9][10][11][12][13]. To date, no studies have attempted to automatically incorporate patient characteristics into lesion detection models by predicting them using laryngeal images. Even for well-trained experts, identifying the age, gender, or smoking status of patients based on laryngoscopy alone is virtually impossible. Fortunately, this is never necessary because this information is readily available to clinicians performing laryngoscopy. However, the incorporation of patient characteristics into deep learning models for medical image analysis typically requires manual entry.
In this study, we have demonstrated the capability of deep learning models, such as CNNs and Transformers, to extract discernible features from laryngeal images, allowing the identification of patients' demographic characteristics. This has the potential to enhance clinical diagnoses by automatically integrating demographic information into intelligent learning models. For instance, we can automate multi-model learning to improve the detection of laryngeal cancers by considering factors such as the patient's smoking status and age during decision making. Additionally, our research contributes to the field of explainable machine learning (XAI), which emphasizes the provision of clear and interpretable explanations for the decisions and predictions of models. By enhancing transparency and trust, XAI plays a crucial role in medical contexts, where healthcare decisions carry significant importance [14][15][16][17]. Analyzing patients' demographic characteristics, especially activation saliency maps, can deepen our understanding of the underlying workings of deep learning models.
This study is the first endeavor to predict the patient's gender, smoking history, and age directly from laryngeal images. We implemented and compared the performance of the following classical CNN-based and Transformer-based deep learning models: ResNet-18 [18], ResNet-50 [18], ResNet-101 [18], DenseNet-121 [19], MobileNetv2 [20], ShuffleNetv2 [21], and ViT [22]. The major contributions of this paper are as follows: • We performed the first study on predicting the gender, age, and smoking status of the patient purely based on laryngeal images from laryngoscopy. • We created a dataset of 33,906 laryngeal image frames captured from 398 patients. The dataset is annotated with the clinical diagnosis, the pathologic diagnosis for the lesion, and the patient's demographic information. This is the first large laryngoscopic image set for machine learning studies. • We implemented and benchmarked the performance of eight classical deep learning models and achieved very promising results. • We employed the Classification Activation Map (CAM) to visualize and analyze the regions of interest in the image. This approach contributes to the explainability of the learning models by providing insights into which specific areas of the image influenced the decision-making process.
The labeled dataset and developed learning models are available to the research community upon request.

Dataset
Data from flexible video stroboscopic exams performed during patient care in the Department of Otolaryngology-Head and Neck Surgery at the University of Kansas Medical Center (KUMC) were collected over a one-year period. Digital videos were collected in MPEG-4 format at 30 frames per second (fps) with a resolution of 720 × 486 pixels. Each video was labeled with a clinical diagnosis (structurally normal larynx, polyp, papilloma, leukoplakia, or malignant neoplasm) and a pathologic diagnosis for lesions that were biopsied. Additional patient demographic information was captured, including age, sex, and history of tobacco use.
A total of 398 video sequences were included for analysis and randomly separated into training (n = 319, 80%) and testing (n = 79, 20%) cohorts. Every 10th video frame was extracted from each video sequence, creating a dataset of 33,906 laryngeal images in total, with 26,424 for training and 7,482 for testing. All classification models were pretrained on the ImageNet benchmark [23]. Transfer learning was then used to fine-tune the learning models using the collected training set. Finally, the testing set was employed to evaluate the performance of the final classification models.

Deep Learning Models
The following classical deep learning models were implemented and compared: ResNet-18 [18], ResNet-50 [18], ResNet-101 [18], DenseNet-121 [19], MobileNetv2 [20], ShuffleNetv2 [21], and Vision Transformer [22]. The general process of deep-learning-based classification is shown in Figure 1. Given an input of a laryngeal image frame, the trained network can predict the gender, smoking history, or age of the patient purely based on the features in the input image. Below is a brief introduction the implemented learning models.  Figure 1. Illustration of using deep learning models for laryngeal image classification. The deep learning models were pretrained on ImageNet and then fine-tuned on the laryngeal dataset using transfer learning. The output prediction could be gender, smoking history, or age.

ResNet:
ResNet [18] designs a residual connection to facilitate the training of deep neural networks. The gradients can be easily back-propagated via short connections so that the deep neural networks can be optimized more easily and have better performance than their shallow counterparts. Since its introduction, ResNet has become the benchmark for almost all computer vision tasks and has achieved state-of-the-art performance in almost all tasks. Additionally, shortcut connections can be applied to other classic models, such as Transformers, to achieve state-of-the-art performance in both natural language processing and computer vision applications.
DenseNet: In DenseNet [19], the layers are connected with each other directly so that the gradient can flow smoothly, preventing information flow from vanishing, which is a common difficulty in deep neural network training. The features from different layers are combined by concatenation instead of summation.
MobileNetv2: MobileNetv2 [20] is based on MobileNetv1 [24], which separates the convolutions into depthwise separable convolutions and pointwise convolutions with fewer parameters and computations. MobileNetv2 introduces an inverted residual block that projects the feature maps to a high dimension and then back to a low dimension. The proposed inverted module reduces memory access and accelerates inference speed.
ShuffleNetv2: ShuffleNetv2 [21] was developed from ShuffleNetv1 [25] to empirically design high-efficiency mobile-level networks. Practical guidelines were incorporated for higher efficiency and a more lightweight network, including equal channel width, group convolution cost, less network fragmentation, and fewer element-wise operations.
Vision Transformers: Transformers [26] were initially designed for natural language processing for global connections between long-range tokens. Transformers have since been applied to computer vision tasks and have achieved state-of-the-art performance in classification [27,28] and object detection [29,30]. For image classification, the images are split into patches of the same size, which are embedded into tokens and fed into the Transformer blocks. Usually, there is an extra class token that interacts with all other tokens and produces the ultimate class prediction. Due to the lack of inductive bias, Vision Transformers [22] normally require more data and much longer training epochs to converge.

Training Settings
Given that the current dataset was relatively small compared to other benchmark datasets in computer vision, transfer learning was employed, and all deep learning models were pretrained on ImageNet [23]. For each learning model, the same structure and hyperparameters as reported in the original paper were utilized. The batch size was set to 16, and the initial learning rate was 0.00005 (reduced by 0.2 each epoch) with a total of 5 epochs. The optimizer utilized was Adam [31], and all code was written with PyTorch [32].

The Metrics for Evaluation
We evaluated the performance of our deep learning models on the laryngeal dataset using four commonly used metrics: precision, recall, F1 score, and overall accuracy. The definitions of these metrics can be found in [13].
Precision assesses the accuracy of positive predictions by measuring the proportion of correctly classified positive instances out of all instances predicted as positive. It indicates how well the model identifies positive instances and has a low false-positive rate. Recall, also known as sensitivity or the true-positive rate, measures the proportion of correctly classified positive instances out of all actual positive instances. It focuses on the model's ability to detect all positive instances and has a low false-negative rate. The F1 score combines precision and recall into a single value, providing a balanced measure. It is calculated as the harmonic mean of precision and recall, considering both false positives and false negatives. The F1 score serves as an overall performance metric, providing a single evaluation measure. Overall accuracy measures the proportion of correctly classified instances, including both positive and negative, out of all instances.
These metrics offer insights into different aspects of the model's classification abilities. When evaluating a medical image classification model, it is crucial to consider the specific requirements and priorities of the application. The importance of each metric may vary depending on the context. Additionally, it is important to interpret these metrics alongside domain-specific considerations, such as the severity of misclassifications and their potential impact on patient outcomes.

Results
A total of 398 video sequences were utilized in our analysis, which were further divided into two cohorts: a training cohort consisting of 319 sequences and a testing cohort comprising 79 sequences. The models were trained using the training cohort, taking into account the ground-truth information regarding the patient's gender, smoking history, and age. Subsequently, the classification performance of the models was assessed using the independent testing set. In this section, we begin by evaluating the model's performance at the image level and subsequently present the results at the patient (sequence) level. This approach allows us to examine both the individual image classification accuracy and the overall performance across the entire video sequence. Figure 2 depicts the loss curves of the different models employed for age, gender, and smoking history predictions. The loss measure utilized in the analysis is the average loss calculated across all previous loss values, resulting in a smoothed loss curve. During the training process, the loss curves converged quickly, and the training was terminated after five epochs. It is observed that MobileNetv2, ShuffleNetv2, and ViT-B exhibit higher loss values compared to the other models at convergence. The evaluation metrics, namely, precision, recall, F1 score, and overall accuracy, are presented in Tables 1-4. These metrics assess the performance of the models in predicting three target categories: gender (male or female), smoking history (smoker or non-smoker), and age (<50 or ≥50). Each category represents a binary classification problem. In the case of smoking history, a non-smoker is defined as a patient who has never smoked, while a smoker refers to a patient with any smoking history. For age prediction, patients are divided into two groups: young (<50) and senior (≥50), creating a binary classification scenario. The reported precision, recall, F1 score, and overall accuracy provide a comprehensive assessment of the models' performance across these classification tasks. Table 1 provides the precision values calculated for each class, along with the mean and standard deviation computed across all deep learning models. Overall, the models exhibited consistent performance across the experiments, although the standard deviations for predicting female gender and age < 50 were relatively large. Among all deep learning models, ResNet-50 achieved the highest precision for predicting male gender, with a value of 94.7%. In contrast, the precision for age < 50 was significantly lower at only 44% compared to the other categories. The high precision for predicting male gender can be attributed to the distinguishable features between male and female patients, as well as the clear visual differences between male and female images. In contrast, discerning features related to age becomes more challenging, particularly for patients near the age threshold. The standard deviation of accuracy among the models, as shown in Table 1, indicates that there is relatively low variation in performance across the different models. Notably, the lightweight deep learning models, such as ResNet-18, outperform the more complex models with a larger number of parameters (e.g., ResNet-18 achieves the highest precision for predicting "female"). This observation suggests that the limited size of the dataset may favor simpler models, as they are less prone to overfitting. The dataset's relatively small size may also contribute. Complex models such as Vision Transformers typically require a larger amount of data to achieve optimal performance. While precision provides valuable insights into the models' performance, it is important to consider other metrics as well. The following sections will present the models' performance based on additional evaluation metrics. Table 2 provides the recall rates for each class. The recall rate measures the proportion of positive samples that are correctly identified among all positive samples in each category. While the recall rates exhibit relatively high variations compared to precision, the results remain consistent across all models. Notably, smoking history exhibits the lowest recall rate among all deep learning models. This can be attributed, in part, to the inherent variability in smoking habits among individuals. Some smokers who have a minimal smoking frequency or have quit smoking for an extended period may display fewer visible changes in their larynx, making it challenging to distinguish them from non-smokers solely based on visual cues. As a result, accurately identifying these individuals as smokers becomes more difficult, leading to a lower recall rate for smoking history prediction. To comprehensively evaluate the performance of the deep learning models on the larynx dataset, it is important to consider metrics that incorporate both precision and recall. The F1 score, as presented in Table 3, computes the harmonic mean of precision and recall, providing a balanced assessment of the models. The F1 scores among the different deep learning models exhibit consistency, as indicated by the small standard deviations. Notably, the performance for predicting male gender, female gender, and age ≥ 50 surpasses that of other classes in terms of the F1 score. This implies that the models achieve a good balance between precision and recall for these categories, resulting in higher overall performance. The overall accuracy of each learning model was assessed by calculating the number of correctly predicted samples divided by the total number of samples. Table 4 presents the results, showing that gender prediction achieved the highest overall accuracy, followed by age and smoking history predictions. Notably, gender prediction exhibited a particularly high mean accuracy among the three tasks, with an average overall predicted accuracy of 83.2%. The impressive accuracy of gender prediction suggests that deep learning models can effectively capture and analyze specific features present in laryngeal images that are indicative of a patient's gender. These distinguishing features may not be readily discernible to human experts, underscoring the potential of deep learning models in extracting valuable information from medical images.

The Performance of Deep Learning Models at Image Level
Gender prediction presents a straightforward binary classification task, whereas age and smoking history classification pose more significant challenges due to their continuous nature. Dividing age into specific thresholds becomes difficult, as the distinguishing features between different age groups may not be readily apparent. Similarly, predicting smoking history is complex due to the wide range of addiction levels among smokers. For instance, the characteristics of a social smoker or someone with a short smoking history may differ significantly from those of a heavy smoker. Consequently, the boundaries between smokers and non-smokers are not always clearly discernible, despite the existence of distinct boundaries between heavy smokers and non-smokers. These challenges in establishing clear boundaries likely contribute to the relatively low accuracy observed in age prediction compared to gender prediction. Despite these inherent difficulties, the developed learning models still achieved notable mean accuracies of 73% for age classification and 63.6% for smoking history prediction. These results demonstrate the models' capability to capture meaningful patterns and extract relevant information from laryngeal images, enabling reasonably accurate predictions. Although classifying age and smoking history entails inherent complexities, the achieved accuracies indicate that the models have successfully learned and utilized discriminative features to make informed predictions in these challenging tasks. These findings highlight the potential of machine learning in extracting valuable information from laryngeal images for age and smoking history classification.
In summary, deep learning models demonstrate strong performance in predicting gender, smoking history, and age, with gender prediction being particularly notable. The models surpass human doctors in extracting this information solely from laryngeal images, showcasing their potential in advancing medical image analysis. This finding underscores the promising role of deep learning models in leveraging visual data to enhance diagnostic capabilities in healthcare. By effectively identifying subtle patterns and characteristics, these models can aid healthcare professionals in providing more accurate assessments based on laryngeal images, ultimately improving patient care and outcomes.

Overall Performance Based On Patients
The experiments described above were evaluated based on individual image frames, but in clinical settings, all frames in a video sequence belong to the same patient. Therefore, it is more meaningful to evaluate the performance of classification at the sequence level. This section reports the overall accuracy of gender, smoking history, and age predictions at the patient level by combining the results of all frames in the same sequence.
Two methods were used to evaluate sequence-level predictions: majority voting and probability voting. In majority voting, the final prediction is based on the majority of the predicted image labels in the sequence. In probability voting, the predicted probabilities for the correct and wrong labels are separately aggregated, and the final prediction is assigned to the one with higher aggregated probabilities. The comparative results are presented in Table 5. It is evident that when using sequence-based prediction, the overall accuracy for predicting gender, smoking history, and age is much higher than that based on individual frames, as shown in Table 4. We also notice that the overall performance of majority-based voting is slightly better than that of the probability-based approach.

Visualization
In order to illustrate the response strength within different areas of images that correspond to the prediction results, a Classification Activation Map (CAM) [33] was extracted. Only results obtained by ResNet-50 were utilized for visualization, as similar results were obtained by other learning models. The CAMs for the prediction of gender, smoking history, and age are illustrated in Figure 3, Figure 4, and Figure 5, respectively.
For visualization, CAM maps were overlaid on top of the original laryngeal images with a ratio of 2:5 so that the high-response areas in the original images could be easily recognized. The red color indicates a high response, and the blue color represents a low response. High-response areas correspond to areas that contribute more to the prediction results. Figure 3. The CAM visualization of gender prediction. "pred" stands for the predicted result, and "gt" represents the ground truth. The left column demonstrates the maps for male patients, and the right column illustrates the maps for female patients. The red color indicates the areas in the image that have a high response for the predicted result, and the blue color means the areas ion the image that have a low response for the predicted result. Figure 3 shows the CAM map for gender prediction. The left images are from male patients, and the right images are from female patients. The high-response areas were similar in both male and female images; they involved the true and false focal folds and, partially, the arytenoids. For smoking history and age prediction, the corresponding CAMs are illustrated in Figures 4 and 5, respectively. The high-response areas had increased arytenoid involvement and were less focused on the vocal folds. Figure 4. The CAM visualization of smoking history prediction. "pred" stands for the predicted result, and "gt" represents the ground truth. The left column demonstrates the maps for male patients, and the right column illustrates the maps for female patients. The red color indicates the areas in the image have a high response for the predicted result, and the blue color means the areas in the image have a low response for the predicted result. Figure 5. The CAM visualization of age prediction. "pred" stands for the predicted result, and "gt" represents the ground truth. The left column demonstrates the maps for male patients, and the right column illustrates the maps for female patients. The red color indicates the areas in the image that have a high response for the predicted result, and the blue color means the areas in the image that have a low response for the predicted result.

Conclusions
This is the first study to employ deep learning models with computer visualization to predict the gender, smoking history, and age of patients from laryngeal images. The deep learning models tested achieved consistent and promising results for these tasks. Visualizing the CAMs of the laryngeal images revealed that the high-response areas were focused primarily around the true and false vocal folds, which indicates that these areas may exhibit subtle differences among patients of different genders, ages, and smoking statuses.
While we have annotated a laryngoscopic dataset in this study, the trained models may exhibit poor generalizability due to the relatively small scale of the dataset. To mitigate this limitation, it is essential to explore strategies that enhance service continuity in medical image classification. One viable solution is the design of self-organized systems [34] that can dynamically optimize model parameters based on the scalability of the dataset, thereby improving the reliability and adaptability of the system. In our future studies, we will integrate the findings of this study into a comprehensive model for laryngeal disease classification by leveraging multi-modality learning techniques to effectively combine information from various sources, leading to more accurate and reliable diagnostic outcomes. We believe that these advancements will contribute to improved diagnostic capabilities and ultimately benefit patient care.

Informed Consent Statement:
It was waived since the study uses anonymized data without personally identifiable information.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors have declared no conflict of interest.