Comparative Study of Machine Learning Models for Textual Medical Note Classification

Zhang, Yan; Le, Huynh Trung Nguyen; Lopez, Nathan; Phan, Kira

doi:10.3390/computers15010007

Open AccessArticle

Comparative Study of Machine Learning Models for Textual Medical Note Classification

School of Computer Science and Engineering, California State University San Bernardino, 5500 University Parkway, San Bernardino, CA 92407, USA

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(1), 7; https://doi.org/10.3390/computers15010007 (registering DOI)

Submission received: 11 November 2025 / Revised: 11 December 2025 / Accepted: 19 December 2025 / Published: 23 December 2025

(This article belongs to the Special Issue Machine Learning and Statistical Learning with Applications 2025)

Download

Browse Figures

Versions Notes

Abstract

The expansion of electronic health records (EHRs) has generated a large amount of unstructured textual data, such as clinical notes and medical reports, which contain diagnostic and prognostic information. Effective classification of these textual medical notes is critical for improving clinical decision support and healthcare data management. This study presents a statistically rigorous comparative analysis of four traditional machine learning algorithms—Random Forest, Logistic Regression, Multinomial Naive Bayes, and Support Vector Machine—for multiclass classification of medical notes into four disease categories: Neoplasms, Digestive System Diseases, Nervous System Diseases, and Cardiovascular Diseases. A dataset containing 9633 labeled medical notes was preprocessed through text cleaning, lemmatization, stop-word removal, and vectorization using term frequency-inverse document frequency (TF–IDF) representation. The models were trained and optimized through GridSearchCV with 5-fold cross-validation and evaluated across five independent stratified 90-10 train–test splits. Evaluation metrics, including accuracy, precision, recall, F1-score, and multiclass ROC-AUC, were used to assess model performance. Logistic Regression demonstrated the strongest overall performance, achieving an average accuracy of 0.8469 and high macro and weighted F1 scores, followed by Support Vector Machine and Multinomial Naive Bayes. Misclassification patterns revealed substantial lexical overlap between digestive and neurological disease notes, underscoring the limitations of TF–IDF representations in capturing deeper semantic distinctions. These findings confirm that traditional machine learning models remain robust, interpretable, and computationally efficient tools for textual medical note classification, and the study establishes a transparent and reproducible benchmark that provides a solid foundation for future methodological advancements in clinical natural language processing.

Keywords:

medical text classification; machine learning; logistic regression; clinical natural language processing; multiclass disease classification

1. Introduction

The growing digitization of healthcare systems has led to the exponential accumulation of clinical data in electronic health records (EHRs) [1,2]. Among these records, textual medical notes, such as physicians’ observations, discharge summaries, and clinical reports, contain rich information that can reflect patients’ conditions, diagnoses and clinical histories. However, the unstructured and heterogeneous nature of such textual data poses challenges for automated processing and analysis [3]. Efficiently classifying textual medical notes into relevant disease categories can facilitate improved clinical decision making and support healthcare management systems through faster retrieval and knowledge extraction.

Machine Learning (ML) approaches have shown promise in transforming unstructured clinical text into actionable insights. By learning patterns and associations from labeled datasets, ML models can help automatically categorize medical documents according to disease types. This capability is important for developing intelligent healthcare applications such as predictive diagnostic tools and population health monitoring frameworks [4]. Nevertheless, selecting an effective classification model remains a challenge, as different algorithms show varying degrees of performance depending on data characteristics, preprocessing methods, and feature representations.

Although recent advances in deep learning and transformer-based language models have gained attention, these approaches often require extensive training resources and very large datasets, which may not be feasible for many healthcare settings [5]. So traditional ML methods continue to play an important role due to their interpretability, computational efficiency, and competitive performance on high-dimensional sparse features such as TF–IDF representations [6]. Despite their relevance, the literature still lacks comprehensive and statistically rigorous benchmark studies evaluating multiple traditional ML models on multiclass medical note classification. Existing studies often rely on binary classification settings (e.g., disease vs. non-disease) or restricted/domain-specific corpora, limiting the generalizability and reliability of their findings. Given that tradition ML models differ in their learning strategies, underlying assumptions, and robustness to noise, a systematic multiclass comparison remains valuable for identifying their relative strengths and weaknesses in clinical NLP applications [6].

This research aims to address this gap by conducting a comparative study of four widely used machine learning models, namely Random Forest, Logistic Regression, Naive Bayes, and Support Vector Machine, for classifying textual medical notes. The study uses a curated dataset containing over 9000 labeled clinical notes and applies consistent preprocessing to train and fine-tune each model. Model performance is assessed using standard multiclass metrics, including accuracy, precision, recall, F1-score, and multiclass ROC-AUC, enabling a rigorous and balanced comparison.

The key contributions of this paper can be summarized as follows:

A statistically robust benchmark comparing four established machine learning algorithms on a multiclass clinical text classification task, using repeated stratified evaluation and cross-validated hyperparameter tuning.
Empirical insights into the strengths, limitations, and suitability of traditional ML approaches for disease-oriented clinical text analytics, providing a foundation for evaluating more advanced NLP methods.

By establishing a transparent and statistically reliable benchmark for traditional ML models, this study supports future research on clinical NLP, including transformer-based models, domain-specific embeddings, and automated or weakly supervised learning frameworks.

The remainder of this paper is organized as follows. Section 2 reviews related work on medical text classification and prior comparative studies. Section 3 describes the dataset and preprocessing procedures used in this study. Section 4 details the methodologies of the selected machine learning models. Section 5 presents and discusses the experimental results. Finally, Section 6 concludes the paper and outlines directions for future research.

2. Related Work

The growing availability of electronic health records (EHRs) and digitized clinical narratives has accelerated research on machine learning (ML) and natural language processing (NLP) applications in healthcare [7]. Over the past decade, numerous studies have explored how ML techniques can extract, classify, and interpret meaningful information from unstructured medical text, such as clinical notes and discharge summaries. Existing studies range from broad surveys that summarize progress and challenges in the field to empirical investigations comparing traditional supervised algorithms, and more recently, deep learning and hybrid frameworks capable of modeling complex semantic patterns.

2.1. Surveys on Machine Learning for Clinical and Medical Text

Several surveys have examined the increasing integration of ML within healthcare and clinical NLP. Spasic et al. conducted a systematic review of 110 studies applying ML to clinical text and identified text classification as the most common NLP task in healthcare [4]. Authors found that most datasets were small and institution-specific, limiting model generalizability, and emphasized the annotation bottleneck as a critical challenge for supervised learning [4]. Strategies such as active learning, distant supervision, and crowdsourcing were discussed as potential solutions to reduce manual labeling costs.

Mustafa et al. surveyed the emerging field of Automated Machine Learning (AutoML) in healthcare, highlighting its potential for clinical note analysis [8]. Although AutoML has shown promise in structured data settings, its application to unstructured medical text remains underdeveloped. The authors noted key barriers including data heterogeneity, privacy concerns, and model interpretability, concluding that an AutoML platform for clinical notes could greatly enhance scalability and reduce human effort in ML-based healthcare solutions [8].

Kino et al. provided a scoping review of ML applications to the social determinants of health (SDH) [9]. Reviewing 82 studies published before 2020, they observed that most used predictive ML models on structured survey data, with limited exploration of unstructured sources such as clinical narratives. The authors underscored the broader expansion of ML into health research and emphasized the need for interpretable, transparent, and well-validated approaches when applying ML to clinical textual data [9].

Kadhim offered a comprehensive overview of supervised ML techniques for text classification, detailing the standard pipeline of data preprocessing, feature extraction, and model evaluation [10]. The review compared algorithms such as Naive Bayes (NB), Support Vector Machines (SVM), and k-Nearest Neighbors (k-NN), noting that TF–IDF weighting schemes significantly enhance classification accuracy.

These reviews establish the theoretical and methodological foundation for applying ML to unstructured health text and motivate the empirical comparisons undertaken in the present study.

2.2. Traditional Machine Learning Approaches for Medical Text Classification

A broad range of studies demonstrate that traditional supervised learning remains effective for medical text classification and disease prediction. Weng et al. developed a machine learning–based NLP framework for medical subdomain classification of clinical notes [11]. Using cTAKES and UMLS features, the authors compared SVMs with convolutional recurrent neural networks (RNN) and showed that SVMs offered comparable accuracy with better interpretability, validating their utility for cross-institutional applications [11].

López-Úbeda et al. proposed an ML-based system for automatic classification of radiological protocols using a corpus of 700,000 CT and MRI reports [12]. Several NLP-driven classifiers were evaluated, including SVM, Random Forest, neural networks, and transfer-learning approaches. The system achieved high accuracy and has since been implemented as a clinical decision-support tool, demonstrating the practical value of ML for workflow optimization.

Tiwari et al. examined multiclass disease prediction using Random Forest, SVM, Naive Bayes, and Decision Tree algorithms on a symptom dataset [13]. Both Random Forest and Decision Tree achieved the highest accuracy (99%), whereas Naive Bayes yielded the lowest (86%), confirming the effectiveness of ensemble and tree-based methods for healthcare classification tasks.

Sung et al. applied supervised ML and text mining to automated phenotyping of ischemic stroke using 4640 patient EMRs [14]. The integration of structured variables with textual data improved classification and decomposing the multiclass problem into binary subtasks further enhanced performance. Their findings highlight the potential of ML to replace manual annotation in disease phenotyping.

Rabby and Berka investigated multi-class classification of COVID-19 biomedical research papers using ten ML algorithms and eleven feature configurations [15]. They found that TF–IDF features of abstracts yielded the highest accuracy, with Random Forest and BERT models performing best, demonstrating the versatility of traditional ML for biomedical document classification.

Gupta et al. developed an NLP pipeline to automatically identify immune-related adverse events (irAEs) from unstructured oncology notes [16]. Employing keyword filtering, TF–IDF, and BioWordVec embeddings as input for Logistic Regression, SVM, Random Forest, CNN, and Bi-LSTM models, they achieved an F1 score of 0.75 and AUC of 0.85, demonstrating that classical ML methods augmented with embeddings can automate complex clinical annotation tasks.

Gao et al. introduced KeyClass, a weakly supervised framework for assigning ICD-9 codes to unstructured clinical notes without manual labeling [17]. Tested on the MIMIC-III dataset, KeyClass achieved performance comparable to supervised models trained on thousands of labeled samples, underscoring the promise of weak supervision for scalable medical text classification.

Lenivtceva et al. explored multi-label classification of 11,671 Russian medical notes [18]. The authors compared several algorithms and proposed classifier-chain ensembles to capture inter-label dependencies, achieving notable performance gains and illustrating the strength of ensemble strategies for complex medical text tasks.

These studies confirm that traditional ML models, particularly SVM, Logistic Regression, and Random Forest, are able to offer robust, interpretable, and computationally efficient baselines for medical text classification.

2.3. Advances and Extensions Using Deep Learning and Hybrid Methods

While traditional algorithms remain effective, recent research increasingly apply deep learning and hybrid NLP approaches to capture the semantic richness of clinical text. da Silva et al. evaluated machine learning and deep learning models for oncology clinical notes, comparing Logistic Regression, Random Forest, Decision Tree, k-NN, Multilayer Perceptron (MLP), and LSTM networks on 3308 documents [19]. Preprocessing raised mean accuracy from 26% to 93.9%, with the MLP model achieving the best F1 score (93.6%), demonstrating the influence of text normalization on performance.

Goodrum et al. developed a framework to classify scanned EHR documents into clinically relevant and non-relevant categories using OCR-extracted text [20]. A ClinicalBERT model achieved an accuracy of 0.973, highlighting the power of transformer architectures for document-level clinical classification.

Lu et al. compared seven deep learning models including CNN, RNN, GRU, LSTM, Bi-LSTM, Transformer encoders, and BERT, for discharge note classification under varying class-imbalance conditions [21]. Transformer encoders yielded the best results overall, whereas CNNs achieved similar accuracy with shorter training time, suggesting a practical balance between computational efficiency and predictive accuracy.

These studies reflect a gradual evolution from traditional ML pipelines toward deep neural and hybrid models that exploit pre-trained embeddings and transformer architectures to enhance semantic understanding. However, they also reveal that well-tuned traditional algorithms can offer comparable performance with greater interpretability and lower computational demands which are valuable in clinical settings [19].

Across surveys, traditional models, and modern deep learning methods, the literature demonstrates the maturity and adaptability of ML for clinical and medical text classification. The challenges, such as data sparsity, annotation costs, and the trade-off between interpretability and complexity, continue to shape the field. Building on these insights, the present study contributes by conducting a comparative evaluation of four traditional ML algorithms for multiclass classification of textual medical notes, thereby building empirical benchmarks to guide future research that may integrate deep learning, weak supervision, or AutoML techniques for enhanced performance.

3. Textual Medical Note Dataset and Preprocessing

3.1. Data Collection

The dataset used in this study was obtained from Kaggle, a publicly available machine learning repository [22]. This dataset contains textual medical notes derived from published clinical case descriptions and research summaries, representing five disease-related categories: Digestive System Diseases, Cardiovascular Diseases, Neoplasms, Nervous System Diseases, and General Pathological Conditions [22]. Each record consists of a paragraph describing a medical condition, patient case, or experimental study, annotated with a corresponding disease class label.

The original dataset contains a total of 28,880 medical notes, divided evenly into two parts: 14,438 labeled training samples and 14,442 unlabeled test samples. The unlabeled portion, intended for Kaggle competition evaluation, does not include ground-truth class labels. Because supervised machine learning requires labeled data for training and performance assessment, this study exclusively uses the labeled subset of 14,438 samples. The unlabeled test data were excluded from analysis, as their unknown class distribution prevents valid evaluation or model comparison.

Within the labeled subset, the five classes were moderately imbalanced, with the General Pathological Conditions category containing 4805 records. This category is broad and heterogeneous, often overlapping semantically with the four more specific disease groups. Manual inspection revealed that many notes within this class referenced multiple organ systems or general pathological patterns, making it unsuitable as a coherent predictive category. Retaining this heterogeneous class would reduce class separability and introduce noise, thereby undermining the objective of evaluating machine learning algorithms on a well-defined multiclass classification task. For these reasons, the General Pathological Conditions class was removed. After excluding these 4805 records, the resulting dataset consisted of 9633 labeled medical notes distributed across four clearly defined disease categories. Table 1 summarizes the final class distribution after data cleaning.

The average length of a raw medical note in the processed dataset is approximately 187 words.

3.2. Data Preprocessing

Comprehensive text preprocessing was applied to ensure the dataset was clean, standardized, and machine-readable. The following steps were performed sequentially:

Text cleaning. All text was converted to lowercase to eliminate case-based redundancy. Punctuation marks were removed, and stop words (e.g., the, is, and, of) were filtered out to reduce noise. Digits and special symbols were also removed. Although such elements can carry clinical meaning in structured EHR data (e.g., laboratory values or dosages), exploratory analysis revealed that numerical tokens in this dataset primarily reflected metadata (e.g., ages, years, sample counts), which provided minimal discriminatory value for the four disease categories. Their removal reduced noise and vocabulary size without compromising semantic content.
Tokenization and lemmatization. Each document was tokenized into individual words. Tokens were then lemmatized, that is, reduced to their base or dictionary form, to minimize inflectional variations (e.g., “studies” → “study,” “patients” → “patient”). After these operations, the average document length was reduced from 187 words to approximately 111 words.
Lexical distribution analysis. To understand class-level vocabulary patterns, word frequency analyses were generated for each disease category.
−
Neoplasms (Class 1), the top frequent terms included patient, tumor, case, treatment, lesion, and disease.
−
Digestive System Diseases (Class 2), the top frequent terms included patient, treatment, disease, study, group, and associated.
−
Nervous System Diseases (Class 3), the top frequent terms included patient, treatment, study, pain, group, and effect.
−
Cardiovascular Diseases (Class 4), the top frequent terms included patient, blood pressure, coronary artery, left ventricular, myocardial infarction, and treatment.
Text vectorization. The cleaned and lemmatized corpus was converted into a numerical format using TF–IDF vectorization techniques. TF–IDF vectorization was applied with a consistent set of parameters to ensure reproducibility. A unigram representation was used (n_gram_range = (1,1)), and no vocabulary pruning was performed (max_features = None), allowing all 27,609 unique tokens in the preprocessed corpus to be retained. Default document-frequency thresholds were applied (min_df = 1, max_df = 1.0), and sublinear TF scaling was disabled (sublinear_tf = False), meaning raw term frequencies were combined directly with inverse document frequency weighting. All TF–IDF vectors were L2-normalized to ensure comparable scaling across documents. These parameter choices were selected to maintain a transparent and fully reproducible baseline for evaluating traditional machine learning algorithms. The resulting document-term matrix contained 27,609 unique tokens, yielding a feature space of (9633 × 27,609) for subsequent machine learning analysis.
Dataset partitioning. The dataset was randomly partitioned into training and testing subsets using a 90-10 stratified split, ensuring that the class distribution in both subsets remained proportional to that of the full dataset. This resulted in 8669 training records and 964 test records, providing sufficient data for model training and unbiased evaluation.

No explicit feature-selection or dimensionality-reduction methods were applied. This was intentional, as TF–IDF vectorization coupled with linear classifiers is known to perform well in high-dimensional sparse spaces, and retaining the full vocabulary preserves interpretability and reproducibility. Retaining all features also enables meaningful comparison with future studies employing embeddings or deep learning methods.

The following sample demonstrates the transformation of a raw medical note into its cleaned and lemmatized form. The original medical note is “Duodenal-caval fistula. Duodenal-caval fistula is a rare, often lethal disease that requires prompt diagnosis and surgical correction. A case of duodenal-caval fistula due to duodenal ulceration is presented and discussed.” The preprocessed medical note is “duodenal caval fistula duodenal caval fistula rare often lethal disease requires prompt diagnosis surgical correction case duodenal caval fistula due duodenal ulceration presented discussed”. These preprocessing steps standardized the dataset, reduced sparsity, and prepared it for downstream feature extraction and classification modeling.

4. Methodologies

This section presents the machine learning algorithms used to classify textual medical notes into disease categories. To identify the most effective traditional learning algorithm for this task, four widely used classifiers, namely Random Forest (RF), Naive Bayes (NB), Logistic Regression (LR), and Support Vector Machine (SVM), were implemented and compared under consistent preprocessing conditions. Each model represents a distinct learning paradigm: probabilistic reasoning in NB, linear discriminative modeling in LR, margin maximization in SVM, and ensemble-based decision aggregation in RF. All models were trained using the same vectorized representation of the preprocessed medical notes and optimized through hyperparameter tuning to achieve the best classification performance. The following subsections describe the theoretical foundations, parameter configurations, and implementation details of each model.

4.1. Random Forest Classification

The Random Forest (RF) algorithm is an ensemble-based learning method that constructs multiple decision trees during training and outputs the class predicted by the majority of trees [23]. By combining bootstrap aggregation (bagging) and random feature selection, Random Forest reduces overfitting while maintaining strong predictive accuracy [24]. Each tree is trained on a random subset of samples, and at each node split, a random subset of features is considered. This dual randomness enhances model diversity and stability, making Random Forest suited for high-dimensional and sparse text datasets [25].

In this study, the Random Forest classifier was applied to the vectorized medical notes to classify documents into four disease categories. The feature space consisted of 27,609 unique tokens derived from the preprocessed corpus. The algorithm recursively partitions the feature space to minimize impurity, measured using the Gini index. Its ability to handle large vocabularies without explicit feature selection makes it an appropriate choice for textual data.

Hyperparameters were optimized using a grid search with 5-fold cross-validation to achieve a balance between accuracy and generalization. The parameters tuned included the number of trees, maximum tree depth, the minimum number of samples required to be at a leaf node, and the minimum samples required for node splits. Model performance was assessed using the macro average F1-score, which provide balanced sensitivity across all disease classes.

Random Forest was selected for its robustness to noise, interpretability of feature importance, and capacity to model nonlinear relationships in textual data. Its ensemble structure also identifies key discriminative medical terms, supporting interpretability and transparency in clinical applications.

4.2. Naive Bayes

The Naive Bayes (NB) classifier is a probabilistic model based on Bayes’ Theorem, which assumes conditional independence among features given the class label [26,27]. Despite this simplifying assumption, it remains effective and computationally efficient for text classification because it models word-occurrence probabilities directly [28]. In this study, NB was applied to estimate the likelihood that a medical note belongs to one of four disease categories using the distribution of words within each note. Given a document

d = {w_{1}, w_{2}, \dots, w_{n}}

and a class c, the posterior probability is expressed as:

P (c | d) \propto P (c) \prod_{i = 1}^{n} P (w_{i} | c),

(1)

where

P (c)

is the prior probability of a class and

P (w_{i} | c)

represents the likelihood of observing word

w_{i}

in that class [26]. The class with the highest posterior probability is then assigned as the predicted label.

The Multinomial Naive Bayes (MNB) variant was employed, as it is well suited for count-based representations such as term frequency or TF–IDF vectors derived from the medical notes. Additive Laplace smoothing was used to handle zero probabilities for rare or unseen words, improving generalization. To optimize the smoothing parameter

α

, a grid search with 5-fold cross-validation were conducted over the range

0.01 \leq α \leq 1.0

, using the macro average F1-score as the selection metric.

Naive Bayes was chosen for its simplicity, speed, and interpretability which make it a strong baseline model for medical text classification.

4.3. Logistic Regression

The Logistic Regression (LR) model is a linear classifier that estimates the probability of a document belonging to a specific class using a logistic (sigmoid) function [29]. Logistic regression is a discriminative approach that directly learns the decision boundary between classes by maximizing the likelihood of the observed labels [30]. It is well suited for high-dimensional, sparse datasets like textual representations, where individual tokens serve as features. Given a document represented by a feature vector

x = (x_{1}, x_{2}, \dots, x_{n})

, the probability that it belongs to class c is expressed as

P (c | x) = \frac{1}{1 + e^{- (w^{⊤} x + b)}},

(2)

where

w

denotes the model weights and b is the bias term [30]. For multiclass problems, a softmax extension is applied to ensure all class probabilities sum to one.

In this study, logistic regression was implemented using the one-vs.-rest (OvR) strategy, where an independent binary classifier was trained for each disease class. The input features were derived from the TF–IDF document–term matrix generated during preprocessing. Model parameters were optimized through L2 (ridge) regularization to control overfitting, and hyperparameter tuning was performed using grid search with 5-fold cross-validation. Regularization type and solver choice were further optimized by testing L1, L2, and elasticnet penalties with solvers including ‘lbfgs’, ‘liblinear’, and ‘saga’. Model performance was evaluated using the macro averaged F1-score.

Logistic Regression was selected for its interpretability, scalability, and ability to produce well-calibrated probability estimates, making it a reliable model for textual medical note classification.

4.4. Support Vector Machine Classification

Support Vector Machine (SVM) is a discriminative learning algorithm that seeks an optimal separating hyperplane to maximize the margin between data points of different classes [31]. It is effective for text classification, where data are typically high-dimensional and sparse—conditions under which linear SVMs perform well [32]. By identifying a subset of critical data points, known as support vectors, the model defines the decision boundary that best separates classes in the feature space. Given a training set of labeled samples

{(x_{i}, y_{i})}_{i = 1}^{m}

, where

x_{i} \in R^{n}

represents the feature vector and

y_{i} \in {- 1, + 1}

denotes the class label, the SVM optimization problem can be formulated as:

min_{w, b, ξ_{i}} \frac{1}{2} {∥ w ∥}^{2} + C \sum_{i = 1}^{m} ξ_{i}

s . t . y_{i} (w^{⊤} x_{i} + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0,

(3)

where

w

is the weight vector, b is the bias term,

ξ_{i}

are slack variables, and C is the regularization parameter balancing margin maximization and classification error [33].

In this study, a linear SVM was employed due to its scalability and strong performance on large text corpora. The model was trained on a TF–IDF document–term matrix containing 27,609 features. Hyperparameters, including the regularization parameter C and kernel type, were tuned using grid search with 5-fold cross-validation. Linear, polynomial, radial basis function (RBF), and sigmoid kernels were tested, with performance evaluated using the macro averaged F1-score to account for class imbalance. The final configuration, based on the highest cross-validated F1-score, achieved a balanced trade-off between accuracy and generalization, confirming SVM’s robustness and interpretability for medical text classification.

4.5. Evaluation Metrics

To evaluate the performance of the four machine learning models, a set of widely accepted metrics was employed. These metrics provide complementary perspectives on predictive accuracy, robustness across classes, and reliability when applied to multiclass textual data. The primary metrics used in this study include accuracy, precision, recall, F1-score, and AOC-AUC, each computed for individual class labels and then aggregated to summarize overall model performance.

For a given class label

c \in {1, 2, \dots, C}

, where C represents the total number of categories, and

T P_{c}

,

F P_{c}

, and

F N_{c}

denote true positives for class c, false positives for class c, and false negatives for class c, respectively. The following definitions apply:

Precision for class c quantifies the proportion of correctly predicted samples of class c among all samples predicted as class c:

${Precision}_{c} = \frac{T P_{c}}{T P_{c} + F P_{c}} .$

(4)
Recall for class c (or sensitivity) measures the proportion of actual samples of class c that were correctly identified:

${Recall}_{c} = \frac{T P_{c}}{T P_{c} + F N_{c}} .$

(5)
F1-score for class c is the harmonic mean of precision and recall:

${F 1 - score}_{c} = 2 \times \frac{{Precision}_{c} \times {Recall}_{c}}{{Precision}_{c} + {Recall}_{c}} .$

(6)
Accuracy measures the overall proportion of correctly classified samples across all classes:

$Accuracy = \frac{\sum_{c = 1}^{C} T P_{c}}{| All Samples |} .$

(7)

Since this study involves four disease categories, macro averaged and weighted averaged versions of these metrics were used. Macro-averaging treats all classes equally, regardless of their size, while weighted averaging assigns weights proportional to the number of instances per class, providing a more realistic assessment under class imbalance.

In addition to these numerical metrics, confusion matrices were analyzed to visualize class-level performance, identify common misclassifications, and detect semantic overlaps among disease categories. It provides a detailed view of the classifier’s performance by comparing predicted versus actual class labels [34]. The diagonal elements in a confusion matrix represent correctly classified medical notes, while off-diagonal values indicate misclassifications [34].

To further characterize the discriminative behavior of the models, Receiver Operating Characteristic (ROC) analysis was conducted. For multiclass problems, ROC curves were generated using the one-vs.-rest (OvR) strategy, in which each class is considered the positive class in turn [35]. For each classifier, class-wise ROC curves and their associated area under the curve (AUC_c) values were computed. A macro-averaged AUC was obtained by averaging over all classes.

Together, these evaluation measures provide a comprehensive and balanced assessment of model performance, enabling reliable comparison among the Random Forest, Multinomial Naive Bayes, Logistic Regression, and Support Vector Machine classifiers.

4.6. Model Training and Validation

To ensure a rigorous and unbiased evaluation, all models were trained and validated using a repeated experimental protocol consisting of five independent stratified 90–10 train–test splits. For each run, 90% of the data was used for model training and hyperparameter tuning, while the remaining 10% was reserved exclusively for evaluation. Stratification preserved the class distribution across splits, providing consistent representation of disease categories in each subset.

Within each run, hyperparameter optimization was performed using scikit-learn’s GridSearchCV with 5-fold cross-validation applied only to the training portion. This ensured that the choice of hyperparameters reflected performance averaged across multiple folds rather than a single validation set, reducing variance and improving the reliability of model selection. After tuning, each model was retrained on the full training split and evaluated on the corresponding test subset.

The hyperparameter grids were intentionally designed to be non-trivial and to include parameters known to affect model complexity, regularization, and generalization behavior, particularly in high-dimensional sparse text settings.

For Random Forest, the parameter grid included:

n_estimators $\in {100, 200, 300, 400, 500}$
max_depth $\in {None, 20, 40, 60}$
max_features $\in {‘ sqrt ’, ‘ \log 2 ’}$
min_samples_split $\in {2, 5, 10}$
min_samples_leaf $\in {1, 2, 4, 6}$

These parameters influence ensemble stability, bias–variance trade-offs, and resistance to noise.

For Multinomial Naive Bayes, tuning focused on:

Smoothing coefficient $α \in {0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 1.0}$
fit_prior $\in {True, False}$

These parameters affect the handling of rare terms in TF–IDF distributions.

For Logistic Regression, the grid explored:

Regularization strength C $\in {0.1, 0.5, 1, 2, 5}$
penalty $\in {‘ l 1 ’, ‘ l 2 ’, ‘ elasticnet ’}$
l1_ratio $\in {0.0, 0.5, 1.0}$
solver $\in {‘ lbfgs ’, ‘ liblinear ’, ‘ saga ’}$

These hyperparameters govern coefficient sparsity, optimization behavior, and decision-boundary flexibility.

For the Support Vector Machine, tuning considered:

kernel $\in {‘ linear ’, ‘ poly ’, ‘ rbf ’, ‘ sigmoid ’}$
C $\in {0.1, 1, 2, 3, 5}$
degree $\in {2, 3}$ (for polynomial kernels)
gamma $\in {‘ scale ’, ‘ auto ’}$ (for RBF and sigmoid kernels)

These parameters control the geometric complexity of the decision boundary and the model’s ability to capture linear versus nonlinear relationships.

The parameter ranges were selected based on prior text-classification research, empirical guidance from the scikit-learn documentation, and computational feasibility given the high dimensionality of the TF–IDF feature space. The consistency of optimal configurations across runs demonstrates that the search space was appropriate and sufficiently expressive for this task.

Model evaluation metrics included accuracy, macro and weighted F1-scores, and multiclass ROC–AUC. Final performance results represent the mean values across the five runs, while confusion matrices, classification reports, and ROC curves are shown for the run closest to the mean performance.

5. Experimental Results and Analysis

This section presents and discusses the experimental results obtained from the comparative evaluation of the four machine learning models (Random Forest, Multinomial Naive Bayes, Logistic Regression, and Support Vector Machine) applied to classifying textual medical notes into four disease categories. All experiments were conducted using the scikit-learn version 1.7 in Python 3.13, which provides reliable and standardized implementations of machine learning algorithms and evaluation procedures [36]. The experiments were conducted on the preprocessed dataset containing 9633 labeled records. Each model was trained and optimized using the best hyperparameter configurations determined through cross-validation, as detailed in Section 4.6.

The evaluation focuses on multiple performance metric, including accuracy, class-wise precision, recall, and F1-score, along with macro- and weighted-averaged forms that account for imbalanced class distributions. In addition, Receiver Operating Characteristic (ROC) curves were generated using the one-vs.-rest strategy to evaluate the discriminative ability of the models across all classes, and confusion matrices were used to visualize misclassification patterns. The following subsections present detailed results for each model, followed by a comparative analysis highlighting their relative strengths, weaknesses, and suitability for medical text classification tasks.

5.1. Random Forest Classification Model Evaluation

The Random Forest classifier was trained and evaluated using the repeated experimental procedure described in Section 4.6, consisting of five independent stratified 90–10 train–test splits. For each run, hyperparameter tuning was carried out on the training portion using GridSearchCV with 5-fold cross-validation. Across all five runs, the tuning process consistently selected the same optimal hyperparameter configuration: bootstrap = True, max_depth = None, max_features = ‘sqrt’, min_samples_leaf = 4, min_samples_split = 2, and n_estimators = 400. This configuration allows the model to grow deep trees with random feature subsets while maintaining generalization through leaf size regularization. This stability indicates that the Random Forest model converges to a reliable configuration under multiple data partitions. Table 2 summarizes the accuracy, macro F1-score, and weighted F1-score for each of the five runs.

To provide detailed diagnostic insight, Run #5 (# means run number), showing accuracy closest to the mean, was chosen for subsequent analysis. Table 3 presents the confusion matrix for this representative run. It shows how the Random Forest model’s predictions are distributed across the four disease categories on the test set.

The confusion matrix shows that the classifier performed strongly on Neoplasms (Class 1) and Cardiovascular Diseases (Class 4), with high true positive counts and relatively few misclassifications. More confusion is observed for Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3), where overlapping terminology and shared clinical descriptors likely contributed to errors.

The corresponding classification report for Run #5 is shown in Table 4, which includes class-wise precision, recall, F1-score, and support.

Among all categories, Neoplasms (Class 1) and Cardiovascular Diseases (Class 4) achieved the highest recall values (0.8864 and 0.9082), showing that the model effectively identified these disease types. In contrast, Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3) had lower recall values (0.6309 and 0.6321), reflecting greater confusion with other classes.

To further assess the discriminative capability of the Random Forest classifier, ROC curves for Run #5 were generated for each class using the one-vs.-rest strategy. Figure 1 displays the ROC curves for all four disease categories.

These AUC results indicate separation between positive and negative classes across all disease categories. The slightly lower AUC for Class 3 reflects the higher misclassification rate observed in the confusion matrix and classification report, consistent with textual overlap between neurological and digestive case descriptions.

Overall, the Random Forest classifier performed competitively across all metrics, establishing a strong baseline for textual medical note classification.

5.2. Naive Bayes Model Evaluation

The Multinomial Naive Bayes (MNB) classifier was evaluated using the repeated experimental procedure outlined in Section 4.6, consisting of five independent stratified 90–10 train–test splits. For each run, hyperparameter tuning was performed on the training subset using GridSearchCV with 5-fold cross-validation. Across all runs, the tuning procedure consistently selected the same optimal hyperparameters as smoothing coefficient

α = 0.1

and fit_prior = False. This stability indicates that the MNB model converges reliably to a robust configuration across varying data partitions. Table 5 summarizes the performance of the Naive Bayes classifier across the five runs.

Run #1, which exhibited accuracy closest to the mean, was selected for in-depth diagnostic evaluation. The confusion matrix for this representative run is shown in Table 6. which illustrates how the model classified the four disease categories based on word-occurrence probabilities.

The confusion matrix shows that the classifier performed strongly on Neoplasms (Class 1) and Cardiovascular Diseases (Class 4), with high true positive counts. Misclassification was more common for Nervous System Diseases (Class 3), likely due to overlapping clinical terminology shared with digestive and cardiovascular descriptions.

Table 7 presents the corresponding classification report for Run #1, which includes class-wise precision, recall, F1-score, and support.

These results highlight the model’s strong recall for Cardiovascular Diseases (Class 4), while Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3) exhibited the lowest recall, reflecting greater lexical overlap with other categories. Overall, MNB demonstrated stable and competitive performance, with accuracy and weighted F1-scores consistently exceeding 0.80 across all runs.

Figure 2 presents the ROC curves for each disease category using the one-vs.-rest strategy, as well as the corresponding AUC values for Run #1.

The high AUC values indicate strong discriminative ability across all classes, with Nervous System Diseases again showing the lowest separation performance.

Overall, the Multinomial Naive Bayes classifier produced balanced results with strong overall accuracy and consistent weighted averages across all metrics. Its efficiency and simplicity make it a reliable baseline model for multiclass medical note classification, particularly when computational efficiency and interpretability are desired.

5.3. Logistic Regression Model Evaluation

The Logistic Regression classifier was trained and evaluated following the repeated experimental protocol described in Section 4.6, involving five independent stratified 90–10 train–test splits. For each run, hyperparameter tuning was conducted using GridSearchCV with 5-fold cross-validation on the training subset. Across all runs, the same optimal configuration was consistently selected: C = 1, class_weight = None, l1_ratio = 0.5, max_iter = 1000, multi_class = ‘multinomial’, penalty = ‘elasticnet’, and solver = ‘saga’. This combination of elastic net regularization and the SAGA solver provided a balanced control of both L1 and L2 penalties, enabling the model to handle sparse features while maintaining good generalization. The multinomial formulation was chosen to directly optimize the cross-entropy loss for the four disease categories. This consistent selection demonstrates the stability of the logistic regression model under repeated data partitioning. Table 8 summarizes the accuracy, macro F1-score, and weighted F1-score for each of the five runs.

Run #2, whose accuracy was closest to the overall mean, was selected for detailed diagnostic evaluation. The confusion matrix for this representative run is shown in Table 9. It summarizes the classifier’s predictions across the four disease categories on the test set.

The classifier demonstrated particularly strong performance for Neoplasms (Class 1) and Cardiovascular Diseases (Class 4), with high true positive counts and minimal misclassification. Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3) exhibited slightly lower recall values, reflecting greater semantic overlap among their textual descriptions.

The detailed performance metrics for Run #2 are shown in Table 10, including class-wise precision, recall, F1-score, and support.

The logistic regression classifier achieved high precision and recall for Neoplasms Diseases (Class 1) and Cardiovascular Diseases (Class 4), and balanced performance for Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3). The overall results highlight logistic regression as the best-performing model among the four classical approaches evaluated in this study, demonstrating strong discriminative ability and consistent generalization across repeated data splits.

Figure 3 presents the ROC curves for each disease category using the one-vs.-rest strategy for Run #2.

These AUC scores indicate excellent separability across all disease categories, with the highest discrimination achieved for Cardiovascular Diseases (Class 4). The slightly lower AUC for Nervous System Diseases (Class 3) is consistent with its lower recall, reflecting overlapping vocabulary with other clinical descriptions.

Overall, the Logistic Regression model demonstrated robust and consistent performance across all categories. Its high accuracy and balanced precision-recall scores confirm its effectiveness for multiclass text classification of medical notes. The model’s interpretability and well-calibrated probability estimates further highlight its suitability for clinical NLP tasks, establishing it as the best-performing approach among the four models evaluated.

5.4. Support Vector Machine Classification Model Evaluation

The Support Vector Machine (SVM) classifier was trained and evaluated using the repeated experimental protocol described in Section 4.6, involving five independent stratified 90–10 train–test splits. For each run, hyperparameter tuning was performed using GridSearchCV with 5-fold cross-validation on the training subset. As part of hyperparameter tuning, four SVM kernels, i.e., linear, polynomial, RBF, and sigmoid, were systematically evaluated. Across all five runs, the tuning procedure consistently selected the same optimal configuration: C = 1, class_weight = None, and kernel = ‘linear’. The linear kernel consistently produced the highest cross-validation performance, while nonlinear kernels did not offer improvements. This outcome aligns with theoretical expectations, as the TF–IDF representation yields a high-dimensional sparse feature space that is particularly well suited to linear decision boundaries [37]. This stability suggests that the linear SVM converges reliably to a robust parameter setting across different partitions of the data. Table 11 summarizes the accuracy, macro F1-score, and weighted F1-score across the five runs.

The run with the accuracy closest to the overall mean is Run #3, so Run #3 was selected as the representative run for detailed analysis. Table 12 presents the confusion matrix of Run #3, which summarizes the model’s predictions across the four disease categories on the test set.

As shown in the table, the SVM accurately classified Neoplasms (Class 1) and Cardiovascular Diseases (Class 4), with high true positive counts. Some confusion persisted for Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3), reflecting shared terminology and overlapping clinical themes in the textual data.

The performance metrics in Table 13 reveal balanced precision, recall, and F1-scores across all categories for Run #3.

These results indicate that the linear SVM classifier performs consistently well across all disease categories, with particularly high precision and recall for Neoplasms Diseases (Class 1) and Cardiovascular Diseases (Class 4). Performance for Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3) was slightly lower but still competitive, reflecting the greater semantic complexity of these categories.

Figure 4 presents the ROC curves for each disease category using the one-vs.-rest strategy for Run #3.

These high AUC values confirm that the SVM possesses strong discriminative ability across all four categories, even in cases where class boundaries are less distinct due to overlapping vocabulary.

Overall, the Support Vector Machine classifier exhibited the most stable performance across all evaluation metrics. Its inherent ability to effectively handle high-dimensional, sparse textual features rendered it particularly well-suited to this dataset. These results confirm that SVM strikes a balance between predictive accuracy and generalization, establishing it as a highly competitive model for classifying textual medical notes.

5.5. Model Comparison and Discussion

5.5.1. Overall Model Comparison

The comparative performance of the four machine learning models—Random Forest (RF), Multinomial Naive Bayes (MNB), Logistic Regression (LR), and Support Vector Machine (SVM)—was assessed using the repeated experimental protocol described in Section 4.6. Mean accuracy, macro F1-score, and weighted F1-score were computed across five independent stratified 90–10 splits to provide statistically reliable estimates of model behavior. Table 14 summarizes the average performance metrics for all models.

Logistic Regression achieved the highest overall performance, with an average accuracy of 0.8469 and the highest macro and weighted F1-scores. This indicates that the elastic net regularized multinomial logistic model was particularly effective at capturing discriminative linear relationships among the TF–IDF features. The consistent superiority of Logistic Regression across multiple data splits highlights its robustness and suitability for multiclass medical text classification.

The Support Vector Machine achieved the second highest overall performance (accuracy = 0.8199), demonstrating strong discriminative ability and stable generalization under repeated splits. The linear kernel performed particularly well given the high-dimensional and sparse nature of the text representation.

Multinomial Naive Bayes performed well (accuracy = 0.8075), exceeding the Random Forest model and achieving weighted F1 comparable to its accuracy. This model’s strength reflects the compatibility between word-frequency features and Naive Bayes’ probabilistic assumptions, making it an efficient and competitive baseline for textual classification tasks.

Random Forest achieved slightly lower performance (accuracy = 0.8002) relative to the other models. Although it performed well for Neoplasms and Cardiovascular Diseases, it showed greater confusion between Digestive and Nervous System Diseases. This may be attributed to the difficulty of modeling subtle lexical distinctions with tree-based splits, especially in high-dimensional sparse feature spaces.

Across all models, the results consistently showed stronger performance for Neoplasms Diseases (Class 1) and Cardiovascular Diseases (Class 4), where domain-specific terminology provides clearer semantic cues for classification. Misclassifications were more frequent for Digestive System Diseases (Class 2) and Nervous System Diseases (Class 3), reflecting greater lexical overlap between these categories in the medical note corpus.

5.5.2. Interpretability Analysis

Beyond quantitative performance, traditional machine learning models offer the important advantage of interpretability, which is critical in clinical applications. Random Forest provides feature-importance scores that highlight the terms most useful for distinguishing between disease categories. Logistic Regression and linear SVM yield coefficient weights that indicate the positive or negative influence of specific terms on class predictions, making it possible to identify discriminative vocabulary associated with each condition. Although the full ranked lists are not included for brevity, these interpretability properties demonstrate how classical ML models can provide transparent decision-making insights, supporting clinician trust and facilitating integration into clinical text analysis workflows.

5.5.3. Qualitative Error Analysis

Examination of misclassified samples revealed that much of the confusion between Digestive System Diseases and Nervous System Diseases arises from shared terminology and general clinical descriptors. Many notes contain non-specific symptoms—such as pain, nausea, weakness, study, and treatment—that appear across multiple disease categories, making them difficult to differentiate using surface-level lexical features. In several misclassified cases, the clinical narratives involved multi-system presentations or overlapping symptom clusters, such as gastrointestinal complications secondary to neurological disorders. These findings indicate that the observed misclassification patterns reflect genuine linguistic overlap in the dataset, rather than isolated model failures, and suggest that more advanced contextual representations may help better capture subtle semantic distinctions in future work.

Overall, the repeated experimental runs confirm that Logistic Regression provides the best balance of accuracy, interpretability, and stability, followed by SVM and Naive Bayes. Random Forest remains a useful model for interpreting feature importance but is less competitive for this particular classification task.

6. Conclusions and Future Work

This study presented a comprehensive and statistically sound comparison of four traditional machine learning models—Random Forest, Multinomial Naive Bayes, Logistic Regression, and Support Vector Machine—for multiclass classification of textual medical notes into four disease categories. Using a rigorously designed experimental protocol that incorporated five independent stratified train–test data splits, 5-fold cross-validation for hyperparameter tuning, and aggregated performance reporting across five independent runs, the evaluation provides a statistically reliable assessment of model behavior. Detailed evaluation using accuracy, macro and weighted F1-scores, confusion matrices, and multiclass ROC–AUC values provided a clear picture of each model’s strengths and limitations.

Among the evaluated models, Logistic Regression achieved the highest and most consistent performance, with an average accuracy of 0.8469, macro F1-score of 0.8339, and weighted F1-score of 0.8462, demonstrating consistent accuracy and excellent discriminative ability across all four disease categories. The Support Vector Machine ranked second, followed closely by Multinomial Naive Bayes, which exceeded expectations given its simplicity and computational efficiency. Random Forest, while competitive, showed lower average performance, particularly for disease categories with overlapping clinical terminology. Collectively, these findings highlight the effectiveness of traditional ML methods for clinical text classification and demonstrate that interpretable, computationally efficient models can still achieve strong performance in biomedical NLP settings. The interpretability of classical models, including the ability to examine feature-importance rankings and coefficient weights, further enhances their suitability for clinical text applications.

The limitations identified through our error and interpretability analyses point to several concrete directions for future research. Misclassification patterns—particularly between Digestive System Diseases and Nervous System Diseases—highlight the need for models capable of capturing deeper semantic and contextual relationships than TF–IDF features provide. Future work will explore contextualized biomedical language models, such as BioBERT and ClinicalBERT, which are specifically trained on clinical and biomedical text and may better distinguish subtle domain-specific language. In addition, the sparsity of the 27,609-token TF–IDF feature space suggests the potential benefits of dimensionality-reduction or embedding-based methods, including PCA, LSA/LSI, or neural embeddings.

To address the scarcity of labeled clinical text, we also propose leveraging weak supervision frameworks, such as Snorkel-style labeling functions, to automatically generate training labels from heuristic rules or domain knowledge. Furthermore, integrating medical ontologies (e.g., UMLS, MeSH, ICD hierarchies) could enrich textual features with structured semantic information and improve differentiation among clinically related disease categories.

Future studies should also evaluate probability calibration and uncertainty estimation, which are essential for clinical deployment where reliable decision thresholds are needed. Finally, more advanced interpretability tools, such as SHAP or LIME, may provide deeper insights into model predictions and enhance clinician trust in automated classification systems.

Overall, the study enhances methodological rigor within clinical NLP and provides a solid benchmark that supports and motivates future innovations in medical text classification.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z.; investigation, H.T.N.L.; data curation, H.T.N.L.; experiments, H.T.N.L., N.L. and K.P.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z.; supervision, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The research did not receive external funding.

Data Availability Statement

The original dataset is publicly available on Kaggle. The pre-processed dataset is available upon request from the authors.

Acknowledgments

The authors gratefully acknowledge support from the California State University, San Bernardino (CSUSB) College of Natural Sciences (CNS) Proactive Approaches for Training Hispanics in STEM (PATHS) program, which provided funding for three undergraduate researchers during summer 2025. This material is based upon work supported by the National Science Foundation under Grant No. 2322436.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Badawy, M.; Ramadan, N.; Hefny, H.A. Big data analytics in healthcare: Data sources, tools, challenges, and opportunities. J. Electr. Syst. Inf. Technol. 2024, 11, 63. [Google Scholar] [CrossRef]
Tang, A.S.; Woldemariam, S.R.; Miramontes, S.; Norgeot, B.; Oskotsky, T.T.; Sirota, M. Harnessing EHR data for health research. Nat. Med. 2024, 30, 1847–1855. [Google Scholar] [CrossRef]
Tayefi, M.; Ngo, P.; Chomutare, T.; Dalianis, H.; Salvi, E.; Budrionis, A.; Godtliebsen, F. Challenges and opportunities beyond structured data in analysis of electronic health records. Wiley Interdiscip. Rev. Comput. Stat. 2021, 13, e1549. [Google Scholar] [CrossRef]
Spasic, I.; Nenadic, G. Clinical text data in machine learning: Systematic review. JMIR Med. Inform. 2020, 8, e17984. [Google Scholar] [CrossRef]
Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A survey on text classification: From traditional to deep learning. ACM Trans. Intell. Syst. Technol. (TIST) 2022, 13, 1–41. [Google Scholar] [CrossRef]
Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information 2019, 10, 150. [Google Scholar] [CrossRef]
Khalate, P.; Gite, S.; Pradhan, B.; Lee, C.W. Advancements and gaps in natural language processing and machine learning applications in healthcare: A comprehensive review of electronic medical records and medical imaging. Front. Phys. 2024, 12, 1445204. [Google Scholar] [CrossRef]
Mustafa, A.; Rahimi Azghadi, M. Automated machine learning for healthcare and clinical notes analysis. Computers 2021, 10, 24. [Google Scholar] [CrossRef]
Kino, S.; Hsu, Y.T.; Shiba, K.; Chien, Y.S.; Mita, C.; Kawachi, I.; Daoud, A. A scoping review on the use of machine learning in research on social determinants of health: Trends and research prospects. SSM Popul. Health 2021, 15, 100836. [Google Scholar] [CrossRef]
Kadhim, A.I. Survey on supervised machine learning techniques for automatic text classification. Artif. Intell. Rev. 2019, 52, 273–292. [Google Scholar] [CrossRef]
Weng, W.H.; Wagholikar, K.B.; McCray, A.T.; Szolovits, P.; Chueh, H.C. Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach. BMC Med. Inform. Decis. Mak. 2017, 17, 155. [Google Scholar] [CrossRef]
López-Úbeda, P.; Diaz-Galiano, M.C.; Martin-Noguerol, T.; Luna, A.; Urena-Lopez, L.A.; Martin-Valdivia, M.T. Automatic medical protocol classification using machine learning approaches. Comput. Methods Programs Biomed. 2021, 200, 105939. [Google Scholar] [CrossRef] [PubMed]
Tiwari, P.; Upadhyay, D.; Pant, B.; Mohd, N. Multiclass classification in machine learning algorithms for disease prediction. In Proceedings of the International Conference on Advanced Informatics for Computing Research, Gurugram, India, 18–19 December 2021; pp. 102–111. [Google Scholar]
Sung, S.F.; Lin, C.Y.; Hu, Y.H. EMR-based phenotyping of ischemic stroke using supervised machine learning and text mining techniques. IEEE J. Biomed. Health Inform. 2020, 24, 2922–2931. [Google Scholar] [CrossRef]
Rabby, G.; Berka, P. Multi-class classification of COVID-19 documents using machine learning algorithms. J. Intell. Inf. Syst. 2023, 60, 571–591. [Google Scholar] [CrossRef]
Gupta, S.; Belouali, A.; Shah, N.J.; Atkins, M.B.; Madhavan, S. Automated identification of patients with immune-related adverse events from clinical notes using word embedding and machine learning. JCO Clin. Cancer Inform. 2021, 5, 541–549. [Google Scholar] [CrossRef]
Gao, C.; Goswami, M.; Chen, J.; Dubrawski, A. Classifying unstructured clinical notes via automatic weak supervision. In Proceedings of the Machine Learning for Healthcare Conference, PMLR, Durham, NC, USA, 5–6 August 2022; pp. 673–690. [Google Scholar]
Lenivtceva, I.; Slasten, E.; Kashina, M.; Kopanitsa, G. Applicability of machine learning methods to multi-label medical text classification. In Proceedings of the International Conference on Computational Science, Amsterdam, The Netherlands, 3–5 June 2020; pp. 509–522. [Google Scholar]
da Silva, D.P.; Fröhlich, W.d.R.; Schwertner, M.A.; Rigo, S.J. Clinical Oncology Textual Notes Analysis Using Machine Learning and Deep Learning. In Proceedings of the Brazilian Conference on Intelligent Systems, Belo Horizonte, Brazil, 25–29 September 2023; pp. 140–153. [Google Scholar]
Goodrum, H.; Roberts, K.; Bernstam, E.V. Automatic classification of scanned electronic health record documents. Int. J. Med. Inform. 2020, 144, 104302. [Google Scholar] [CrossRef]
Lu, H.; Ehwerhemuepha, L.; Rakovski, C. A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance. BMC Med. Res. Methodol. 2022, 22, 181. [Google Scholar] [CrossRef] [PubMed]
Kaggle. Medical Text. 2019. Available online: https://www.kaggle.com/datasets/chaitanyakck/medical-text/data (accessed on 8 June 2025).
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Rigatti, S.J. Random forest. J. Insur. Med. 2017, 47, 31–39. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Li, Y.; Zeng, Q.; Bian, Y. Application research of text classification based on random forest algorithm. In Proceedings of the 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), Shenzhen, China, 24–26 April 2020; pp. 370–374. [Google Scholar]
Murphy, K.P. Naive Bayes Classifiers; University of British Columbia: Vancouver, BC, Canada, 2006; Volume 18, pp. 1–8. [Google Scholar]
Webb, G.I.; Keogh, E.; Miikkulainen, R. Naïve Bayes. Encycl. Mach. Learn. 2010, 15, 713–714. [Google Scholar]
Singh, G.; Kumar, B.; Gaur, L.; Tyagi, A. Comparison between multinomial and Bernoulli naïve Bayes for text classification. In Proceedings of the 2019 International Conference on Automation, Computational and Technology Management (ICACTM), London, UK, 24–26 April 2019; pp. 593–596. [Google Scholar]
Pampel, F.C. Logistic Regression: A Primer; Number 132; Sage Publications: Thousand Oaks, CA, USA, 2020. [Google Scholar]
Shah, K.; Patel, H.; Sanghvi, D.; Shah, M. A comparative analysis of logistic regression, random forest and KNN models for the text classification. Augment. Hum. Res. 2020, 5, 12. [Google Scholar] [CrossRef]
Sarkar, A.; Chatterjee, S.; Das, W.; Datta, D. Text classification using support vector machine. Int. J. Eng. Sci. Invent. 2015, 4, 33–37. [Google Scholar]
Dadgar, S.M.H.; Araghi, M.S.; Farahani, M.M. A novel text mining approach based on TF-IDF and Support Vector Machine for news classification. In Proceedings of the 2016 IEEE International Conference on Engineering and Technology (ICETECH), Coimbatore, India, 17–18 March 2016; pp. 112–116. [Google Scholar]
Mammone, A.; Turchi, M.; Cristianini, N. Support vector machines. Wiley Interdiscip. Rev. Comput. Stat. 2009, 1, 283–289. [Google Scholar] [CrossRef]
Heydarian, M.; Doyle, T.E.; Samavi, R. MLCM: Multi-label confusion matrix. IEEE Access 2022, 10, 19083–19095. [Google Scholar] [CrossRef]
Carrington, A.M.; Manuel, D.G.; Fieguth, P.W.; Ramsay, T.; Osmani, V.; Wernly, B.; Bennett, C.; Hawken, S.; Magwood, O.; Sheikh, Y.; et al. Deep ROC analysis and AUC as balanced average accuracy, for improved classifier selection, audit and explanation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 329–341. [Google Scholar] [CrossRef]
Scikit-Learn Developers. Scikit-Learn: Machine Learning in Python. 2025. Available online: https://scikit-learn.org (accessed on 10 June 2025).
Joachims, T. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning, Chemnitz, Germany, 21–23 April 1998; pp. 137–142. [Google Scholar]

Figure 1. Multiclass ROC curves (One-vs.-Rest) for Random Forest Classification (Run #5).

Figure 2. Multiclass ROC curves (One-vs.-Rest) for Multinomial Naive Bayes (Run #1).

Figure 3. Multiclass ROC curves (One-vs.-Rest) for Logistic Regression (Run #2).

Figure 4. Multiclass ROC curves (One-vs.-Rest) for Support Vector Machine Classifier (Run #3).

Table 1. Distribution of medical notes across four disease classes after data cleaning.

Disease Class	Label	Number of Records	Percentage (%)
Neoplasms	1	3163	32.8
Digestive System Diseases	2	1494	15.5
Nervous System Diseases	3	1925	20.0
Cardiovascular Diseases	4	3051	31.7
Total	–	9633	100.0

Table 2. Performance of Random Forest across 5 runs of stratified 90–10 splits.

Run #	Accuracy	F1_Macro	F1_Weighted
1	0.8051	0.7818	0.7999
2	0.8039	0.7724	0.7973
3	0.8061	0.7827	0.8015
4	0.7832	0.7615	0.7729
5	0.8029	0.7785	0.7977
Mean	0.8002	0.7754	0.7939

Table 3. Confusion matrix for Random Forest (Run #5).

	Pred 1	Pred 2	Pred 3	Pred 4
True 1 Neoplasms	281	7	16	13
True 2 Digestive	30	94	9	16
True 3 Nervous System	24	5	122	42
True 4 Cardiovascular	12	4	12	277

Table 4. Classification report for Random Forest (Run #5).

Class	Precision	Recall	F1-Score	Support
1 Neoplasms	0.8098	0.8864	0.8464	317
2 Digestive	0.8545	0.6309	0.7259	149
3 Nervous System	0.7673	0.6321	0.6932	193
4 Cardiovascular	0.7960	0.9082	0.8484	305
Accuracy			0.8029	964
Macro avg	0.8069	0.7644	0.7785	964
Weighted avg	0.8038	0.8029	0.7977	964

Table 5. Performance of Multinomial Naive Bayes across 5 runs of stratified 90–10 splits.

Run #	Accuracy	F1_Macro	F1_Weighted
1	0.8061	0.7933	0.8053
2	0.8226	0.8067	0.8225
3	0.8019	0.7869	0.8027
4	0.8008	0.7873	0.8005
5	0.8060	0.7919	0.8065
Mean	0.8075	0.7932	0.8075

Table 6. Confusion matrix for Multinomial Naive Bayes (Run #1).

	Pred 1	Pred 2	Pred 3	Pred 4
True 1 Neoplasms	262	21	19	15
True 2 Digestive	15	120	6	8
True 3 Nervous System	21	7	132	33
True 4 Cardiovascular	11	10	21	263

Table 7. Classification report for Multinomial Naive Bayes (Run #1).

Class	Precision	Recall	F1-Score	Support
1 Neoplasms	0.8479	0.8265	0.8371	317
2 Digestive	0.7595	0.8054	0.7818	149
3 Nervous System	0.7416	0.6839	0.7116	193
4 Cardiovascular	0.8245	0.8623	0.8429	305
Accuracy			0.8061	964
Macro avg	0.7934	0.7945	0.7933	964
Weighted avg	0.8055	0.8060	0.8053	964

Table 8. Performance of Logistic Regression across 5 runs of stratified 90–10 splits.

Run #	Accuracy	F1_Macro	F1_Weighted
1	0.8506	0.8381	0.8502
2	0.8465	0.8282	0.8452
3	0.8423	0.8276	0.8414
4	0.8475	0.8396	0.8471
5	0.8475	0.8361	0.8472
Mean	0.8469	0.8339	0.8462

Table 9. Confusion matrix for Logistic Regression (Run #2).

	Pred 1	Pred 2	Pred 3	Pred 4
True 1 Neoplasms	285	11	14	7
True 2 Digestive	22	104	17	6
True 3 Nervous System	15	5	152	21
True 4 Cardiovascular	11	3	16	275

Table 10. Classification report for Logistic Regression (Run #2).

Class	Precision	Recall	F1-Score	Support
1 Neoplasms	0.8559	0.8991	0.8769	317
2 Digestive	0.8455	0.6980	0.7647	149
3 Nervous System	0.7638	0.7876	0.7755	193
4 Cardiovascular	0.8900	0.9016	0.8958	305
Accuracy			0.8465	964
Macro avg	0.8388	0.8216	0.8282	964
Weighted avg	0.8466	0.8465	0.8452	964

Table 11. Performance of Support Vector Machine across 5 runs of stratified 90–10 splits.

Run #	Accuracy	F1_Macro	F1_Weighted
1	0.8288	0.8157	0.8276
2	0.8278	0.8097	0.8263
3	0.8216	0.8061	0.8209
4	0.8091	0.7968	0.8082
5	0.8122	0.7983	0.8121
Mean	0.8199	0.8053	0.8190

Table 12. Confusion matrix for Support Vector Machine (Run #3).

	Pred 1	Pred 2	Pred 3	Pred 4
True 1 Neoplasms	277	16	17	7
True 2 Digestive	28	112	5	4
True 3 Nervous System	24	7	139	23
True 4 Cardiovascular	10	9	22	264

Table 13. Classification report for Support Vector Machine (Run #3).

Class	Precision	Recall	F1-Score	Support
1 Neoplasms	0.8171	0.8738	0.8445	317
2 Digestive	0.7778	0.7517	0.7645	149
3 Nervous System	0.7596	0.7202	0.7394	193
4 Cardiovascular	0.8859	0.8656	0.8756	305
Accuracy			0.8216	964
Macro avg	0.8101	0.8028	0.8060	964
Weighted avg	0.8213	0.8216	0.8209	964

Table 14. Average performance of four machine learning models across 5 runs of stratified 90–10 splits.

Model	Accuracy	F1_Macro	F1_Weighted
Random Forest (RF)	0.8002	0.7754	0.7939
Multinomial Naive Bayes (MNB)	0.8075	0.7932	0.8075
Logistic Regression (LR)	0.8469	0.8339	0.8462
Support Vector Machine (SVM)	0.8199	0.8053	0.8190

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Le, H.T.N.; Lopez, N.; Phan, K. Comparative Study of Machine Learning Models for Textual Medical Note Classification. Computers 2026, 15, 7. https://doi.org/10.3390/computers15010007

AMA Style

Zhang Y, Le HTN, Lopez N, Phan K. Comparative Study of Machine Learning Models for Textual Medical Note Classification. Computers. 2026; 15(1):7. https://doi.org/10.3390/computers15010007

Chicago/Turabian Style

Zhang, Yan, Huynh Trung Nguyen Le, Nathan Lopez, and Kira Phan. 2026. "Comparative Study of Machine Learning Models for Textual Medical Note Classification" Computers 15, no. 1: 7. https://doi.org/10.3390/computers15010007

APA Style

Zhang, Y., Le, H. T. N., Lopez, N., & Phan, K. (2026). Comparative Study of Machine Learning Models for Textual Medical Note Classification. Computers, 15(1), 7. https://doi.org/10.3390/computers15010007

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Comparative Study of Machine Learning Models for Textual Medical Note Classification

Abstract

1. Introduction

2. Related Work

2.1. Surveys on Machine Learning for Clinical and Medical Text

2.2. Traditional Machine Learning Approaches for Medical Text Classification

2.3. Advances and Extensions Using Deep Learning and Hybrid Methods

3. Textual Medical Note Dataset and Preprocessing

3.1. Data Collection

3.2. Data Preprocessing

4. Methodologies

4.1. Random Forest Classification

4.2. Naive Bayes

4.3. Logistic Regression

4.4. Support Vector Machine Classification

4.5. Evaluation Metrics

4.6. Model Training and Validation

5. Experimental Results and Analysis

5.1. Random Forest Classification Model Evaluation

5.2. Naive Bayes Model Evaluation

5.3. Logistic Regression Model Evaluation

5.4. Support Vector Machine Classification Model Evaluation

5.5. Model Comparison and Discussion

5.5.1. Overall Model Comparison

5.5.2. Interpretability Analysis

5.5.3. Qualitative Error Analysis

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI