1. Introduction
The growing digitization of healthcare systems has led to the exponential accumulation of clinical data in electronic health records (EHRs) [
1,
2]. Among these records, textual medical notes, such as physicians’ observations, discharge summaries, and clinical reports, contain rich information that can reflect patients’ conditions, diagnoses and clinical histories. However, the unstructured and heterogeneous nature of such textual data poses challenges for automated processing and analysis [
3]. Efficiently classifying textual medical notes into relevant disease categories can facilitate improved clinical decision making and support healthcare management systems through faster retrieval and knowledge extraction.
Machine Learning (ML) approaches have shown promise in transforming unstructured clinical text into actionable insights. By learning patterns and associations from labeled datasets, ML models can help automatically categorize medical documents according to disease types. This capability is important for developing intelligent healthcare applications such as predictive diagnostic tools and population health monitoring frameworks [
4]. Nevertheless, selecting an effective classification model remains a challenge, as different algorithms show varying degrees of performance depending on data characteristics, preprocessing methods, and feature representations.
Although recent advances in deep learning and transformer-based language models have gained attention, these approaches often require extensive training resources and very large datasets, which may not be feasible for many healthcare settings [
5]. So traditional ML methods continue to play an important role due to their interpretability, computational efficiency, and competitive performance on high-dimensional sparse features such as TF–IDF representations [
6]. Despite their relevance, the literature still lacks comprehensive and statistically rigorous benchmark studies evaluating multiple traditional ML models on multiclass medical note classification. Existing studies often rely on binary classification settings (e.g., disease vs. non-disease) or restricted/domain-specific corpora, limiting the generalizability and reliability of their findings. Given that tradition ML models differ in their learning strategies, underlying assumptions, and robustness to noise, a systematic multiclass comparison remains valuable for identifying their relative strengths and weaknesses in clinical NLP applications [
6].
This research aims to address this gap by conducting a comparative study of four widely used machine learning models, namely Random Forest, Logistic Regression, Naive Bayes, and Support Vector Machine, for classifying textual medical notes. The study uses a curated dataset containing over 9000 labeled clinical notes and applies consistent preprocessing to train and fine-tune each model. Model performance is assessed using standard multiclass metrics, including accuracy, precision, recall, F1-score, and multiclass ROC-AUC, enabling a rigorous and balanced comparison.
The key contributions of this paper can be summarized as follows:
A statistically robust benchmark comparing four established machine learning algorithms on a multiclass clinical text classification task, using repeated stratified evaluation and cross-validated hyperparameter tuning.
Empirical insights into the strengths, limitations, and suitability of traditional ML approaches for disease-oriented clinical text analytics, providing a foundation for evaluating more advanced NLP methods.
By establishing a transparent and statistically reliable benchmark for traditional ML models, this study supports future research on clinical NLP, including transformer-based models, domain-specific embeddings, and automated or weakly supervised learning frameworks.
The remainder of this paper is organized as follows.
Section 2 reviews related work on medical text classification and prior comparative studies.
Section 3 describes the dataset and preprocessing procedures used in this study.
Section 4 details the methodologies of the selected machine learning models.
Section 5 presents and discusses the experimental results. Finally,
Section 6 concludes the paper and outlines directions for future research.
2. Related Work
The growing availability of electronic health records (EHRs) and digitized clinical narratives has accelerated research on machine learning (ML) and natural language processing (NLP) applications in healthcare [
7]. Over the past decade, numerous studies have explored how ML techniques can extract, classify, and interpret meaningful information from unstructured medical text, such as clinical notes and discharge summaries. Existing studies range from broad surveys that summarize progress and challenges in the field to empirical investigations comparing traditional supervised algorithms, and more recently, deep learning and hybrid frameworks capable of modeling complex semantic patterns.
2.1. Surveys on Machine Learning for Clinical and Medical Text
Several surveys have examined the increasing integration of ML within healthcare and clinical NLP. Spasic et al. conducted a systematic review of 110 studies applying ML to clinical text and identified text classification as the most common NLP task in healthcare [
4]. Authors found that most datasets were small and institution-specific, limiting model generalizability, and emphasized the annotation bottleneck as a critical challenge for supervised learning [
4]. Strategies such as active learning, distant supervision, and crowdsourcing were discussed as potential solutions to reduce manual labeling costs.
Mustafa et al. surveyed the emerging field of Automated Machine Learning (AutoML) in healthcare, highlighting its potential for clinical note analysis [
8]. Although AutoML has shown promise in structured data settings, its application to unstructured medical text remains underdeveloped. The authors noted key barriers including data heterogeneity, privacy concerns, and model interpretability, concluding that an AutoML platform for clinical notes could greatly enhance scalability and reduce human effort in ML-based healthcare solutions [
8].
Kino et al. provided a scoping review of ML applications to the social determinants of health (SDH) [
9]. Reviewing 82 studies published before 2020, they observed that most used predictive ML models on structured survey data, with limited exploration of unstructured sources such as clinical narratives. The authors underscored the broader expansion of ML into health research and emphasized the need for interpretable, transparent, and well-validated approaches when applying ML to clinical textual data [
9].
Kadhim offered a comprehensive overview of supervised ML techniques for text classification, detailing the standard pipeline of data preprocessing, feature extraction, and model evaluation [
10]. The review compared algorithms such as Naive Bayes (NB), Support Vector Machines (SVM), and k-Nearest Neighbors (k-NN), noting that TF–IDF weighting schemes significantly enhance classification accuracy.
These reviews establish the theoretical and methodological foundation for applying ML to unstructured health text and motivate the empirical comparisons undertaken in the present study.
2.2. Traditional Machine Learning Approaches for Medical Text Classification
A broad range of studies demonstrate that traditional supervised learning remains effective for medical text classification and disease prediction. Weng et al. developed a machine learning–based NLP framework for medical subdomain classification of clinical notes [
11]. Using cTAKES and UMLS features, the authors compared SVMs with convolutional recurrent neural networks (RNN) and showed that SVMs offered comparable accuracy with better interpretability, validating their utility for cross-institutional applications [
11].
López-Úbeda et al. proposed an ML-based system for automatic classification of radiological protocols using a corpus of 700,000 CT and MRI reports [
12]. Several NLP-driven classifiers were evaluated, including SVM, Random Forest, neural networks, and transfer-learning approaches. The system achieved high accuracy and has since been implemented as a clinical decision-support tool, demonstrating the practical value of ML for workflow optimization.
Tiwari et al. examined multiclass disease prediction using Random Forest, SVM, Naive Bayes, and Decision Tree algorithms on a symptom dataset [
13]. Both Random Forest and Decision Tree achieved the highest accuracy (99%), whereas Naive Bayes yielded the lowest (86%), confirming the effectiveness of ensemble and tree-based methods for healthcare classification tasks.
Sung et al. applied supervised ML and text mining to automated phenotyping of ischemic stroke using 4640 patient EMRs [
14]. The integration of structured variables with textual data improved classification and decomposing the multiclass problem into binary subtasks further enhanced performance. Their findings highlight the potential of ML to replace manual annotation in disease phenotyping.
Rabby and Berka investigated multi-class classification of COVID-19 biomedical research papers using ten ML algorithms and eleven feature configurations [
15]. They found that TF–IDF features of abstracts yielded the highest accuracy, with Random Forest and BERT models performing best, demonstrating the versatility of traditional ML for biomedical document classification.
Gupta et al. developed an NLP pipeline to automatically identify immune-related adverse events (irAEs) from unstructured oncology notes [
16]. Employing keyword filtering, TF–IDF, and BioWordVec embeddings as input for Logistic Regression, SVM, Random Forest, CNN, and Bi-LSTM models, they achieved an F1 score of 0.75 and AUC of 0.85, demonstrating that classical ML methods augmented with embeddings can automate complex clinical annotation tasks.
Gao et al. introduced KeyClass, a weakly supervised framework for assigning ICD-9 codes to unstructured clinical notes without manual labeling [
17]. Tested on the MIMIC-III dataset, KeyClass achieved performance comparable to supervised models trained on thousands of labeled samples, underscoring the promise of weak supervision for scalable medical text classification.
Lenivtceva et al. explored multi-label classification of 11,671 Russian medical notes [
18]. The authors compared several algorithms and proposed classifier-chain ensembles to capture inter-label dependencies, achieving notable performance gains and illustrating the strength of ensemble strategies for complex medical text tasks.
These studies confirm that traditional ML models, particularly SVM, Logistic Regression, and Random Forest, are able to offer robust, interpretable, and computationally efficient baselines for medical text classification.
2.3. Advances and Extensions Using Deep Learning and Hybrid Methods
While traditional algorithms remain effective, recent research increasingly apply deep learning and hybrid NLP approaches to capture the semantic richness of clinical text. da Silva et al. evaluated machine learning and deep learning models for oncology clinical notes, comparing Logistic Regression, Random Forest, Decision Tree, k-NN, Multilayer Perceptron (MLP), and LSTM networks on 3308 documents [
19]. Preprocessing raised mean accuracy from 26% to 93.9%, with the MLP model achieving the best F1 score (93.6%), demonstrating the influence of text normalization on performance.
Goodrum et al. developed a framework to classify scanned EHR documents into clinically relevant and non-relevant categories using OCR-extracted text [
20]. A ClinicalBERT model achieved an accuracy of 0.973, highlighting the power of transformer architectures for document-level clinical classification.
Lu et al. compared seven deep learning models including CNN, RNN, GRU, LSTM, Bi-LSTM, Transformer encoders, and BERT, for discharge note classification under varying class-imbalance conditions [
21]. Transformer encoders yielded the best results overall, whereas CNNs achieved similar accuracy with shorter training time, suggesting a practical balance between computational efficiency and predictive accuracy.
These studies reflect a gradual evolution from traditional ML pipelines toward deep neural and hybrid models that exploit pre-trained embeddings and transformer architectures to enhance semantic understanding. However, they also reveal that well-tuned traditional algorithms can offer comparable performance with greater interpretability and lower computational demands which are valuable in clinical settings [
19].
Across surveys, traditional models, and modern deep learning methods, the literature demonstrates the maturity and adaptability of ML for clinical and medical text classification. The challenges, such as data sparsity, annotation costs, and the trade-off between interpretability and complexity, continue to shape the field. Building on these insights, the present study contributes by conducting a comparative evaluation of four traditional ML algorithms for multiclass classification of textual medical notes, thereby building empirical benchmarks to guide future research that may integrate deep learning, weak supervision, or AutoML techniques for enhanced performance.
4. Methodologies
This section presents the machine learning algorithms used to classify textual medical notes into disease categories. To identify the most effective traditional learning algorithm for this task, four widely used classifiers, namely Random Forest (RF), Naive Bayes (NB), Logistic Regression (LR), and Support Vector Machine (SVM), were implemented and compared under consistent preprocessing conditions. Each model represents a distinct learning paradigm: probabilistic reasoning in NB, linear discriminative modeling in LR, margin maximization in SVM, and ensemble-based decision aggregation in RF. All models were trained using the same vectorized representation of the preprocessed medical notes and optimized through hyperparameter tuning to achieve the best classification performance. The following subsections describe the theoretical foundations, parameter configurations, and implementation details of each model.
4.1. Random Forest Classification
The Random Forest (RF) algorithm is an ensemble-based learning method that constructs multiple decision trees during training and outputs the class predicted by the majority of trees [
23]. By combining bootstrap aggregation (bagging) and random feature selection, Random Forest reduces overfitting while maintaining strong predictive accuracy [
24]. Each tree is trained on a random subset of samples, and at each node split, a random subset of features is considered. This dual randomness enhances model diversity and stability, making Random Forest suited for high-dimensional and sparse text datasets [
25].
In this study, the Random Forest classifier was applied to the vectorized medical notes to classify documents into four disease categories. The feature space consisted of 27,609 unique tokens derived from the preprocessed corpus. The algorithm recursively partitions the feature space to minimize impurity, measured using the Gini index. Its ability to handle large vocabularies without explicit feature selection makes it an appropriate choice for textual data.
Hyperparameters were optimized using a grid search with 5-fold cross-validation to achieve a balance between accuracy and generalization. The parameters tuned included the number of trees, maximum tree depth, the minimum number of samples required to be at a leaf node, and the minimum samples required for node splits. Model performance was assessed using the macro average F1-score, which provide balanced sensitivity across all disease classes.
Random Forest was selected for its robustness to noise, interpretability of feature importance, and capacity to model nonlinear relationships in textual data. Its ensemble structure also identifies key discriminative medical terms, supporting interpretability and transparency in clinical applications.
4.2. Naive Bayes
The Naive Bayes (NB) classifier is a probabilistic model based on Bayes’ Theorem, which assumes conditional independence among features given the class label [
26,
27]. Despite this simplifying assumption, it remains effective and computationally efficient for text classification because it models word-occurrence probabilities directly [
28]. In this study, NB was applied to estimate the likelihood that a medical note belongs to one of four disease categories using the distribution of words within each note. Given a document
and a class
c, the posterior probability is expressed as:
where
is the prior probability of a class and
represents the likelihood of observing word
in that class [
26]. The class with the highest posterior probability is then assigned as the predicted label.
The Multinomial Naive Bayes (MNB) variant was employed, as it is well suited for count-based representations such as term frequency or TF–IDF vectors derived from the medical notes. Additive Laplace smoothing was used to handle zero probabilities for rare or unseen words, improving generalization. To optimize the smoothing parameter , a grid search with 5-fold cross-validation were conducted over the range , using the macro average F1-score as the selection metric.
Naive Bayes was chosen for its simplicity, speed, and interpretability which make it a strong baseline model for medical text classification.
4.3. Logistic Regression
The Logistic Regression (LR) model is a linear classifier that estimates the probability of a document belonging to a specific class using a logistic (sigmoid) function [
29]. Logistic regression is a discriminative approach that directly learns the decision boundary between classes by maximizing the likelihood of the observed labels [
30]. It is well suited for high-dimensional, sparse datasets like textual representations, where individual tokens serve as features. Given a document represented by a feature vector
, the probability that it belongs to class
c is expressed as
where
denotes the model weights and
b is the bias term [
30]. For multiclass problems, a softmax extension is applied to ensure all class probabilities sum to one.
In this study, logistic regression was implemented using the one-vs.-rest (OvR) strategy, where an independent binary classifier was trained for each disease class. The input features were derived from the TF–IDF document–term matrix generated during preprocessing. Model parameters were optimized through L2 (ridge) regularization to control overfitting, and hyperparameter tuning was performed using grid search with 5-fold cross-validation. Regularization type and solver choice were further optimized by testing L1, L2, and elasticnet penalties with solvers including ‘lbfgs’, ‘liblinear’, and ‘saga’. Model performance was evaluated using the macro averaged F1-score.
Logistic Regression was selected for its interpretability, scalability, and ability to produce well-calibrated probability estimates, making it a reliable model for textual medical note classification.
4.4. Support Vector Machine Classification
Support Vector Machine (SVM) is a discriminative learning algorithm that seeks an optimal separating hyperplane to maximize the margin between data points of different classes [
31]. It is effective for text classification, where data are typically high-dimensional and sparse—conditions under which linear SVMs perform well [
32]. By identifying a subset of critical data points, known as support vectors, the model defines the decision boundary that best separates classes in the feature space. Given a training set of labeled samples
, where
represents the feature vector and
denotes the class label, the SVM optimization problem can be formulated as:
where
is the weight vector,
b is the bias term,
are slack variables, and
C is the regularization parameter balancing margin maximization and classification error [
33].
In this study, a linear SVM was employed due to its scalability and strong performance on large text corpora. The model was trained on a TF–IDF document–term matrix containing 27,609 features. Hyperparameters, including the regularization parameter C and kernel type, were tuned using grid search with 5-fold cross-validation. Linear, polynomial, radial basis function (RBF), and sigmoid kernels were tested, with performance evaluated using the macro averaged F1-score to account for class imbalance. The final configuration, based on the highest cross-validated F1-score, achieved a balanced trade-off between accuracy and generalization, confirming SVM’s robustness and interpretability for medical text classification.
4.5. Evaluation Metrics
To evaluate the performance of the four machine learning models, a set of widely accepted metrics was employed. These metrics provide complementary perspectives on predictive accuracy, robustness across classes, and reliability when applied to multiclass textual data. The primary metrics used in this study include accuracy, precision, recall, F1-score, and AOC-AUC, each computed for individual class labels and then aggregated to summarize overall model performance.
For a given class label , where C represents the total number of categories, and , , and denote true positives for class c, false positives for class c, and false negatives for class c, respectively. The following definitions apply:
Precision for class
c quantifies the proportion of correctly predicted samples of class
c among all samples predicted as class
c:
Recall for class
c (or sensitivity) measures the proportion of actual samples of class
c that were correctly identified:
F1-score for class
c is the harmonic mean of precision and recall:
Accuracy measures the overall proportion of correctly classified samples across all classes:
Since this study involves four disease categories, macro averaged and weighted averaged versions of these metrics were used. Macro-averaging treats all classes equally, regardless of their size, while weighted averaging assigns weights proportional to the number of instances per class, providing a more realistic assessment under class imbalance.
In addition to these numerical metrics, confusion matrices were analyzed to visualize class-level performance, identify common misclassifications, and detect semantic overlaps among disease categories. It provides a detailed view of the classifier’s performance by comparing predicted versus actual class labels [
34]. The diagonal elements in a confusion matrix represent correctly classified medical notes, while off-diagonal values indicate misclassifications [
34].
To further characterize the discriminative behavior of the models, Receiver Operating Characteristic (ROC) analysis was conducted. For multiclass problems, ROC curves were generated using the one-vs.-rest (OvR) strategy, in which each class is considered the positive class in turn [
35]. For each classifier, class-wise ROC curves and their associated area under the curve (AUC
c) values were computed. A macro-averaged AUC was obtained by averaging over all classes.
Together, these evaluation measures provide a comprehensive and balanced assessment of model performance, enabling reliable comparison among the Random Forest, Multinomial Naive Bayes, Logistic Regression, and Support Vector Machine classifiers.
4.6. Model Training and Validation
To ensure a rigorous and unbiased evaluation, all models were trained and validated using a repeated experimental protocol consisting of five independent stratified 90–10 train–test splits. For each run, 90% of the data was used for model training and hyperparameter tuning, while the remaining 10% was reserved exclusively for evaluation. Stratification preserved the class distribution across splits, providing consistent representation of disease categories in each subset.
Within each run, hyperparameter optimization was performed using scikit-learn’s GridSearchCV with 5-fold cross-validation applied only to the training portion. This ensured that the choice of hyperparameters reflected performance averaged across multiple folds rather than a single validation set, reducing variance and improving the reliability of model selection. After tuning, each model was retrained on the full training split and evaluated on the corresponding test subset.
The hyperparameter grids were intentionally designed to be non-trivial and to include parameters known to affect model complexity, regularization, and generalization behavior, particularly in high-dimensional sparse text settings.
For Random Forest, the parameter grid included:
n_estimators
max_depth
max_features
min_samples_split
min_samples_leaf
These parameters influence ensemble stability, bias–variance trade-offs, and resistance to noise.
For Multinomial Naive Bayes, tuning focused on:
Smoothing coefficient
fit_prior
These parameters affect the handling of rare terms in TF–IDF distributions.
For Logistic Regression, the grid explored:
Regularization strength C
penalty
l1_ratio
solver
These hyperparameters govern coefficient sparsity, optimization behavior, and decision-boundary flexibility.
For the Support Vector Machine, tuning considered:
kernel
C
degree (for polynomial kernels)
gamma (for RBF and sigmoid kernels)
These parameters control the geometric complexity of the decision boundary and the model’s ability to capture linear versus nonlinear relationships.
The parameter ranges were selected based on prior text-classification research, empirical guidance from the scikit-learn documentation, and computational feasibility given the high dimensionality of the TF–IDF feature space. The consistency of optimal configurations across runs demonstrates that the search space was appropriate and sufficiently expressive for this task.
Model evaluation metrics included accuracy, macro and weighted F1-scores, and multiclass ROC–AUC. Final performance results represent the mean values across the five runs, while confusion matrices, classification reports, and ROC curves are shown for the run closest to the mean performance.
5. Experimental Results and Analysis
This section presents and discusses the experimental results obtained from the comparative evaluation of the four machine learning models (Random Forest, Multinomial Naive Bayes, Logistic Regression, and Support Vector Machine) applied to classifying textual medical notes into four disease categories. All experiments were conducted using the scikit-learn version 1.7 in Python 3.13, which provides reliable and standardized implementations of machine learning algorithms and evaluation procedures [
36]. The experiments were conducted on the preprocessed dataset containing 9633 labeled records. Each model was trained and optimized using the best hyperparameter configurations determined through cross-validation, as detailed in
Section 4.6.
The evaluation focuses on multiple performance metric, including accuracy, class-wise precision, recall, and F1-score, along with macro- and weighted-averaged forms that account for imbalanced class distributions. In addition, Receiver Operating Characteristic (ROC) curves were generated using the one-vs.-rest strategy to evaluate the discriminative ability of the models across all classes, and confusion matrices were used to visualize misclassification patterns. The following subsections present detailed results for each model, followed by a comparative analysis highlighting their relative strengths, weaknesses, and suitability for medical text classification tasks.
5.1. Random Forest Classification Model Evaluation
The Random Forest classifier was trained and evaluated using the repeated experimental procedure described in
Section 4.6, consisting of five independent stratified 90–10 train–test splits. For each run, hyperparameter tuning was carried out on the training portion using GridSearchCV with 5-fold cross-validation. Across all five runs, the tuning process consistently selected the same optimal hyperparameter configuration:
bootstrap = True,
max_depth = None,
max_features = ‘sqrt’,
min_samples_leaf = 4,
min_samples_split = 2, and
n_estimators = 400. This configuration allows the model to grow deep trees with random feature subsets while maintaining generalization through leaf size regularization. This stability indicates that the Random Forest model converges to a reliable configuration under multiple data partitions.
Table 2 summarizes the accuracy, macro F1-score, and weighted F1-score for each of the five runs.
To provide detailed diagnostic insight, Run #5 (# means run number), showing accuracy closest to the mean, was chosen for subsequent analysis.
Table 3 presents the confusion matrix for this representative run. It shows how the Random Forest model’s predictions are distributed across the four disease categories on the test set.
The confusion matrix shows that the classifier performed strongly on Neoplasms (Class 1) and Cardiovascular Diseases (Class 4), with high true positive counts and relatively few misclassifications. More confusion is observed for Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3), where overlapping terminology and shared clinical descriptors likely contributed to errors.
The corresponding classification report for Run #5 is shown in
Table 4, which includes class-wise precision, recall, F1-score, and support.
Among all categories, Neoplasms (Class 1) and Cardiovascular Diseases (Class 4) achieved the highest recall values (0.8864 and 0.9082), showing that the model effectively identified these disease types. In contrast, Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3) had lower recall values (0.6309 and 0.6321), reflecting greater confusion with other classes.
To further assess the discriminative capability of the Random Forest classifier, ROC curves for Run #5 were generated for each class using the one-vs.-rest strategy.
Figure 1 displays the ROC curves for all four disease categories.
These AUC results indicate separation between positive and negative classes across all disease categories. The slightly lower AUC for Class 3 reflects the higher misclassification rate observed in the confusion matrix and classification report, consistent with textual overlap between neurological and digestive case descriptions.
Overall, the Random Forest classifier performed competitively across all metrics, establishing a strong baseline for textual medical note classification.
5.2. Naive Bayes Model Evaluation
The Multinomial Naive Bayes (MNB) classifier was evaluated using the repeated experimental procedure outlined in
Section 4.6, consisting of five independent stratified 90–10 train–test splits. For each run, hyperparameter tuning was performed on the training subset using GridSearchCV with 5-fold cross-validation. Across all runs, the tuning procedure consistently selected the same optimal hyperparameters as smoothing coefficient
and
fit_prior = False. This stability indicates that the MNB model converges reliably to a robust configuration across varying data partitions.
Table 5 summarizes the performance of the Naive Bayes classifier across the five runs.
Run #1, which exhibited accuracy closest to the mean, was selected for in-depth diagnostic evaluation. The confusion matrix for this representative run is shown in
Table 6. which illustrates how the model classified the four disease categories based on word-occurrence probabilities.
The confusion matrix shows that the classifier performed strongly on Neoplasms (Class 1) and Cardiovascular Diseases (Class 4), with high true positive counts. Misclassification was more common for Nervous System Diseases (Class 3), likely due to overlapping clinical terminology shared with digestive and cardiovascular descriptions.
Table 7 presents the corresponding classification report for Run #1, which includes class-wise precision, recall, F1-score, and support.
These results highlight the model’s strong recall for Cardiovascular Diseases (Class 4), while Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3) exhibited the lowest recall, reflecting greater lexical overlap with other categories. Overall, MNB demonstrated stable and competitive performance, with accuracy and weighted F1-scores consistently exceeding 0.80 across all runs.
Figure 2 presents the ROC curves for each disease category using the one-vs.-rest strategy, as well as the corresponding AUC values for Run #1.
The high AUC values indicate strong discriminative ability across all classes, with Nervous System Diseases again showing the lowest separation performance.
Overall, the Multinomial Naive Bayes classifier produced balanced results with strong overall accuracy and consistent weighted averages across all metrics. Its efficiency and simplicity make it a reliable baseline model for multiclass medical note classification, particularly when computational efficiency and interpretability are desired.
5.3. Logistic Regression Model Evaluation
The Logistic Regression classifier was trained and evaluated following the repeated experimental protocol described in
Section 4.6, involving five independent stratified 90–10 train–test splits. For each run, hyperparameter tuning was conducted using GridSearchCV with 5-fold cross-validation on the training subset. Across all runs, the same optimal configuration was consistently selected:
C = 1,
class_weight = None,
l1_ratio = 0.5,
max_iter = 1000,
multi_class = ‘multinomial’,
penalty = ‘elasticnet’, and
solver = ‘saga’. This combination of elastic net regularization and the SAGA solver provided a balanced control of both L1 and L2 penalties, enabling the model to handle sparse features while maintaining good generalization. The multinomial formulation was chosen to directly optimize the cross-entropy loss for the four disease categories. This consistent selection demonstrates the stability of the logistic regression model under repeated data partitioning.
Table 8 summarizes the accuracy, macro F1-score, and weighted F1-score for each of the five runs.
Run #2, whose accuracy was closest to the overall mean, was selected for detailed diagnostic evaluation. The confusion matrix for this representative run is shown in
Table 9. It summarizes the classifier’s predictions across the four disease categories on the test set.
The classifier demonstrated particularly strong performance for Neoplasms (Class 1) and Cardiovascular Diseases (Class 4), with high true positive counts and minimal misclassification. Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3) exhibited slightly lower recall values, reflecting greater semantic overlap among their textual descriptions.
The detailed performance metrics for Run #2 are shown in
Table 10, including class-wise precision, recall, F1-score, and support.
The logistic regression classifier achieved high precision and recall for Neoplasms Diseases (Class 1) and Cardiovascular Diseases (Class 4), and balanced performance for Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3). The overall results highlight logistic regression as the best-performing model among the four classical approaches evaluated in this study, demonstrating strong discriminative ability and consistent generalization across repeated data splits.
Figure 3 presents the ROC curves for each disease category using the one-vs.-rest strategy for Run #2.
These AUC scores indicate excellent separability across all disease categories, with the highest discrimination achieved for Cardiovascular Diseases (Class 4). The slightly lower AUC for Nervous System Diseases (Class 3) is consistent with its lower recall, reflecting overlapping vocabulary with other clinical descriptions.
Overall, the Logistic Regression model demonstrated robust and consistent performance across all categories. Its high accuracy and balanced precision-recall scores confirm its effectiveness for multiclass text classification of medical notes. The model’s interpretability and well-calibrated probability estimates further highlight its suitability for clinical NLP tasks, establishing it as the best-performing approach among the four models evaluated.
5.4. Support Vector Machine Classification Model Evaluation
The Support Vector Machine (SVM) classifier was trained and evaluated using the repeated experimental protocol described in
Section 4.6, involving five independent stratified 90–10 train–test splits. For each run, hyperparameter tuning was performed using GridSearchCV with 5-fold cross-validation on the training subset. As part of hyperparameter tuning, four SVM kernels, i.e., linear, polynomial, RBF, and sigmoid, were systematically evaluated. Across all five runs, the tuning procedure consistently selected the same optimal configuration:
C = 1,
class_weight = None, and
kernel = ‘linear’. The linear kernel consistently produced the highest cross-validation performance, while nonlinear kernels did not offer improvements. This outcome aligns with theoretical expectations, as the TF–IDF representation yields a high-dimensional sparse feature space that is particularly well suited to linear decision boundaries [
37]. This stability suggests that the linear SVM converges reliably to a robust parameter setting across different partitions of the data.
Table 11 summarizes the accuracy, macro F1-score, and weighted F1-score across the five runs.
The run with the accuracy closest to the overall mean is Run #3, so Run #3 was selected as the representative run for detailed analysis.
Table 12 presents the confusion matrix of Run #3, which summarizes the model’s predictions across the four disease categories on the test set.
As shown in the table, the SVM accurately classified Neoplasms (Class 1) and Cardiovascular Diseases (Class 4), with high true positive counts. Some confusion persisted for Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3), reflecting shared terminology and overlapping clinical themes in the textual data.
The performance metrics in
Table 13 reveal balanced precision, recall, and F1-scores across all categories for Run #3.
These results indicate that the linear SVM classifier performs consistently well across all disease categories, with particularly high precision and recall for Neoplasms Diseases (Class 1) and Cardiovascular Diseases (Class 4). Performance for Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3) was slightly lower but still competitive, reflecting the greater semantic complexity of these categories.
Figure 4 presents the ROC curves for each disease category using the one-vs.-rest strategy for Run #3.
These high AUC values confirm that the SVM possesses strong discriminative ability across all four categories, even in cases where class boundaries are less distinct due to overlapping vocabulary.
Overall, the Support Vector Machine classifier exhibited the most stable performance across all evaluation metrics. Its inherent ability to effectively handle high-dimensional, sparse textual features rendered it particularly well-suited to this dataset. These results confirm that SVM strikes a balance between predictive accuracy and generalization, establishing it as a highly competitive model for classifying textual medical notes.
5.5. Model Comparison and Discussion
5.5.1. Overall Model Comparison
The comparative performance of the four machine learning models—Random Forest (RF), Multinomial Naive Bayes (MNB), Logistic Regression (LR), and Support Vector Machine (SVM)—was assessed using the repeated experimental protocol described in
Section 4.6. Mean accuracy, macro F1-score, and weighted F1-score were computed across five independent stratified 90–10 splits to provide statistically reliable estimates of model behavior.
Table 14 summarizes the average performance metrics for all models.
Logistic Regression achieved the highest overall performance, with an average accuracy of 0.8469 and the highest macro and weighted F1-scores. This indicates that the elastic net regularized multinomial logistic model was particularly effective at capturing discriminative linear relationships among the TF–IDF features. The consistent superiority of Logistic Regression across multiple data splits highlights its robustness and suitability for multiclass medical text classification.
The Support Vector Machine achieved the second highest overall performance (accuracy = 0.8199), demonstrating strong discriminative ability and stable generalization under repeated splits. The linear kernel performed particularly well given the high-dimensional and sparse nature of the text representation.
Multinomial Naive Bayes performed well (accuracy = 0.8075), exceeding the Random Forest model and achieving weighted F1 comparable to its accuracy. This model’s strength reflects the compatibility between word-frequency features and Naive Bayes’ probabilistic assumptions, making it an efficient and competitive baseline for textual classification tasks.
Random Forest achieved slightly lower performance (accuracy = 0.8002) relative to the other models. Although it performed well for Neoplasms and Cardiovascular Diseases, it showed greater confusion between Digestive and Nervous System Diseases. This may be attributed to the difficulty of modeling subtle lexical distinctions with tree-based splits, especially in high-dimensional sparse feature spaces.
Across all models, the results consistently showed stronger performance for Neoplasms Diseases (Class 1) and Cardiovascular Diseases (Class 4), where domain-specific terminology provides clearer semantic cues for classification. Misclassifications were more frequent for Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3), reflecting greater lexical overlap between these categories in the medical note corpus.
5.5.2. Interpretability Analysis
Beyond quantitative performance, traditional machine learning models offer the important advantage of interpretability, which is critical in clinical applications. Random Forest provides feature-importance scores that highlight the terms most useful for distinguishing between disease categories. Logistic Regression and linear SVM yield coefficient weights that indicate the positive or negative influence of specific terms on class predictions, making it possible to identify discriminative vocabulary associated with each condition. Although the full ranked lists are not included for brevity, these interpretability properties demonstrate how classical ML models can provide transparent decision-making insights, supporting clinician trust and facilitating integration into clinical text analysis workflows.
5.5.3. Qualitative Error Analysis
Examination of misclassified samples revealed that much of the confusion between Digestive System Diseases and Nervous System Diseases arises from shared terminology and general clinical descriptors. Many notes contain non-specific symptoms—such as pain, nausea, weakness, study, and treatment—that appear across multiple disease categories, making them difficult to differentiate using surface-level lexical features. In several misclassified cases, the clinical narratives involved multi-system presentations or overlapping symptom clusters, such as gastrointestinal complications secondary to neurological disorders. These findings indicate that the observed misclassification patterns reflect genuine linguistic overlap in the dataset, rather than isolated model failures, and suggest that more advanced contextual representations may help better capture subtle semantic distinctions in future work.
Overall, the repeated experimental runs confirm that Logistic Regression provides the best balance of accuracy, interpretability, and stability, followed by SVM and Naive Bayes. Random Forest remains a useful model for interpreting feature importance but is less competitive for this particular classification task.
6. Conclusions and Future Work
This study presented a comprehensive and statistically sound comparison of four traditional machine learning models—Random Forest, Multinomial Naive Bayes, Logistic Regression, and Support Vector Machine—for multiclass classification of textual medical notes into four disease categories. Using a rigorously designed experimental protocol that incorporated five independent stratified train–test data splits, 5-fold cross-validation for hyperparameter tuning, and aggregated performance reporting across five independent runs, the evaluation provides a statistically reliable assessment of model behavior. Detailed evaluation using accuracy, macro and weighted F1-scores, confusion matrices, and multiclass ROC–AUC values provided a clear picture of each model’s strengths and limitations.
Among the evaluated models, Logistic Regression achieved the highest and most consistent performance, with an average accuracy of 0.8469, macro F1-score of 0.8339, and weighted F1-score of 0.8462, demonstrating consistent accuracy and excellent discriminative ability across all four disease categories. The Support Vector Machine ranked second, followed closely by Multinomial Naive Bayes, which exceeded expectations given its simplicity and computational efficiency. Random Forest, while competitive, showed lower average performance, particularly for disease categories with overlapping clinical terminology. Collectively, these findings highlight the effectiveness of traditional ML methods for clinical text classification and demonstrate that interpretable, computationally efficient models can still achieve strong performance in biomedical NLP settings. The interpretability of classical models, including the ability to examine feature-importance rankings and coefficient weights, further enhances their suitability for clinical text applications.
The limitations identified through our error and interpretability analyses point to several concrete directions for future research. Misclassification patterns—particularly between Digestive System Diseases and Nervous System Diseases—highlight the need for models capable of capturing deeper semantic and contextual relationships than TF–IDF features provide. Future work will explore contextualized biomedical language models, such as BioBERT and ClinicalBERT, which are specifically trained on clinical and biomedical text and may better distinguish subtle domain-specific language. In addition, the sparsity of the 27,609-token TF–IDF feature space suggests the potential benefits of dimensionality-reduction or embedding-based methods, including PCA, LSA/LSI, or neural embeddings.
To address the scarcity of labeled clinical text, we also propose leveraging weak supervision frameworks, such as Snorkel-style labeling functions, to automatically generate training labels from heuristic rules or domain knowledge. Furthermore, integrating medical ontologies (e.g., UMLS, MeSH, ICD hierarchies) could enrich textual features with structured semantic information and improve differentiation among clinically related disease categories.
Future studies should also evaluate probability calibration and uncertainty estimation, which are essential for clinical deployment where reliable decision thresholds are needed. Finally, more advanced interpretability tools, such as SHAP or LIME, may provide deeper insights into model predictions and enhance clinician trust in automated classification systems.
Overall, the study enhances methodological rigor within clinical NLP and provides a solid benchmark that supports and motivates future innovations in medical text classification.