Next Article in Journal
Explainable Multi-Modal Medical Image Analysis Through Dual-Stream Multi-Feature Fusion and Class-Specific Selection
Previous Article in Journal
Multi-Modal Multi-Stage Multi-Task Learning for Occlusion-Aware Facial Landmark Localisation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Exploratory Study on Hybrid Systems Performance: A First Approach to Hybrid ML Models in Breast Cancer Classification

by
Francisco J. Rojas-Pérez
1,
José R. Conde-Sánchez
2,
Alejandra Morlett-Paredes
3,
Fernando Moreno-Barbosa
2,
Julio C. Ramos-Fernández
1,
José Luna-Muñoz
1,4,5,
Genaro Vargas-Hernández
1,
Blanca E. Jaramillo-Loranca
1,
Juan M. Xicotencatl-Pérez
1 and
Eucario G. Pérez-Pérez
1,*
1
Dirección de Investigación, Innovación y Posgrado, Universidad Politécnica de Pachuca, Zempoala 43830, Hidalgo, Mexico
2
Facultad de Ciencias Físico Matemáticas (FCFM), Benemérita Universidad Autónoma de Puebla, Puebla 72570, Puebla, Mexico
3
Department of Neurosciences, University of California San Diego, La Jolla, CA 92093, USA
4
National Dementia BioBank, AMPAEYDEN A.C., Cuautitlán Izcalli 54743, Estado de Mexico, Mexico
5
Banco Nacional de Cerebros-UNPHU, Universidad Nacional Pedro Henríquez Ureña, Santo Domingo 10602, Dominican Republic
*
Author to whom correspondence should be addressed.
Submission received: 27 October 2025 / Revised: 29 December 2025 / Accepted: 8 January 2026 / Published: 15 January 2026

Abstract

The classification of breast cancer using machine learning techniques has become a critical tool in modern medical diagnostics. This study analyzes the performance of hybrid models that combine traditional machine learning algorithms (TMLAs) with a convolutional neural network (CNN)-based VGG16 model for feature extraction to improve accuracy for classifying eight breast cancer subtypes (BCS). The methodology consists of three steps. First, image preprocessing is performed on the BreakHis dataset at 400× magnification, which contains 1820 histopathological images classified into eight BCS. Second, the CNN VGG16 is modified to function as a feature extractor that converts images into representative vectors. These vectors constitute the training set for TMLAs, such as Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Naive Bayes (NB), leveraging VGG16’s ability to capture relevant features. Third, k-fold cross-validation is applied to evaluate the model’s performance by averaging the metrics obtained across all folds. The results reveal that hybrid models leveraging a CNN-based VGG16 model for feature extraction, followed by TMLAs, achieve accuracy outstanding experimental accuracy. The KNN-based hybrid model stood out with a precision of 0.97, accuracy of 0.96, sensitivity of 0.96, specificity of 0.99, F1-score of 0.96, and ROC-AUC of 0.97. These findings suggest that, with an appropriate methodology, hybrid models based on TMLA have strong potential in classification tasks, offering a balance between performance and predictive capability.

1. Introduction

Breast cancer is one of the most prevalent malignancies among women and remains a leading cause of cancer-related mortality worldwide. According to GLOBOCAN 2022 [1], breast cancer accounts for the highest female mortality in over 100 countries, underscoring the urgent need for accurate and efficient diagnostic systems that support early detection and improved survival.
In recent years, machine learning (ML) and deep learning (DL) techniques have been increasingly applied to assist in medical image classification, pattern recognition, and diagnostic decision-making. DL approaches such as CNNs have shown remarkable performance in biomedical applications due to their ability to automatically extract hierarchical features from raw data.
In theory, a three-layer CNN can be trained to function as a perfect classifier. This is consistent with the universal approximation theorem, which states that a CNN with at least one sufficiently large hidden layer can approximate any continuous function with arbitrary precision. However, it is not necessarily the optimal choice for BCS classification, as a three-layer network may not adequately capture the underlying structure of the BCS classification problem, which involves complex features that may be better modeled by alternative techniques. Simpler models such as RF, SVM, NB, and KNN, which rely on different learning approaches, may be more suitable given the complexity of the multiclassification task and the size and quality of the data.
However, these models often require large datasets, high computational power, and complex tuning procedures, which can limit their efficiency and interpretability in clinical environments. TMLA models, in contrast, are less computationally demanding and perform robustly on smaller or structured datasets.
To overcome the limitations of both paradigms, hybrid systems combining CNN-based feature extraction with TMLA have emerged as promising alternatives. By leveraging CNNs to capture intricate spatial features and conventional algorithms—such as RF, SVM, NB, and KNN—for classification, these hybrid models aim to balance accuracy, provide an easy-to-understand interpretation, and benefit from the efficiency of feature extractors.
This study focuses on evaluating the performance of such hybrid architectures for breast cancer classification. Specifically, we investigate how combining CNN feature extraction with TMLA has the potential to serve as a viable alternative capable of achieving accuracy levels comparable to state-of-the-art approaches and methodologies.

2. Related Work

In 2020, the study entitled Comparative Study of Machine Learning Algorithms for Breast Cancer Prediction [2] compared the performance of various ML algorithms for breast cancer prognosis. The authors quantitatively evaluated decision tree (DT) and logistic regression (LR) classifiers, concluding that ML-based early detection plays a vital role in improving clinical outcomes. The models achieved high accuracies of 94.4% (LR) and 95.1% (DT), highlighting that algorithmic selection strongly influences predictive performance even when applied to the same dataset. During the same year, in Conventional Machine Learning and Deep Learning Approach for Multi-Classification of Breast Cancer Histopathology Images [3], the authors addressed the automation of multiclass breast cancer diagnosis using the BreakHis image set. The study compared conventional feature-based classifiers with transfer learning models (VGG16, VGG19, and ResNet50). Results indicated that pretrained networks as feature extractors significantly outperformed handcrafted feature approaches. The hybrid combination of the VGG16 network with a linear SVM achieved the best performance, reaching 91.79% accuracy at 400× magnification.
In 2021, the study entitled Malignant and Benign Breast Cancer Classification using Machine Learning Algorithms [4] focused on early detection of benign and malignant tumors, using the Wisconsin breast cancer dataset. Several classifiers were tested, including SVM, LR, KNN, DT, NB, and RF. RF and SVM demonstrated superior results, both achieving accuracies around 96.5%, supporting their potential for integration into automated diagnostic systems. In 2022, the authors of An Improvised Random Forest Model for Breast Cancer Classification [5] proposed a cost-sensitive enhancement to the standard RF algorithm. The model incorporated a penalty matrix assigning higher costs to false negatives, thereby improving minority class detection. This adjustment yielded a notable accuracy of 97.51%, outperforming conventional RF models and reinforcing the algorithm’s adaptability to clinical data imbalance.
More recently, the 2023 study entitled Breast cancer diagnosis through knowledge distillation of Swin transformer-based teacher–student models [6] explored transformer-based networks for histopathological image analysis. The authors proposed a teacher–student framework where a compact learner model was trained via knowledge distillation from a Swin Transformer teacher model. Despite being significantly lighter, the learner achieved 98.71% accuracy, only marginally lower than the teacher’s 98.91%, suggesting its practical value for real-time diagnostic support. Also, in the same year, the authors of Deep learning- and expert knowledge-based feature extraction and performance evaluation in breast histopathology images [7] examined how DL-derived and expert-defined features affect classification performance. The study compared CNN-based, VGG16 transfer learning, and knowledge-driven feature extraction, tested across seven classifiers. The knowledge-based approach achieved the highest accuracy of up to 98% when combined with neural networks, RF, and multilayer perceptron classifiers.
By reviewing these works, it becomes evident that ML and DL models substantially contribute to breast cancer diagnosis. However, differences in accuracy, computational demand, and interpretability highlight the need for hybrid frameworks that combine the representational strength of deep networks with the efficiency and explainability of TMLA.

3. Proposed Methodology

The overall workflow of the proposed methodology is illustrated in Figure 1. The process begins with data augmentation, where oversampling techniques and digital image processing (DIP) operations are applied to expand the dataset and mitigate errors inherent to limited data, characterized by low representativeness, low diversity, and class imbalance [8].
The augmented image set is then fed into a VGG16 CNN configured as a feature extractor to obtain high-level representations of the input data. These extracted features are subsequently used to train four TMLAs: RF, SVM, NB, and KNN. Finally, k-fold cross-validation is performed on all trained models to evaluate performance metrics, including accuracy (Acc), precision (P), recall (R), specificity (Sp), and F1-score (F1). The hybrid model achieving the highest overall performance is selected for its potential in the breast cancer classification task.

3.1. Implementation Steps

The methodology was implemented in four main stages, as described below.

3.1.1. Data Augmentation

The BreakHis image set consists of 7909 histopathological biopsy images collected from 82 patients by P&D Laboratory in Brazil in 2014 [9]. These images are organized into four magnifications (40×, 100×, 200×, and 400×), as shown in Figure 2a. Each magnification level contains an unbalanced distribution of eight BCS. For the BreakHis 400× subset, there are 1820 images distributed across the following categories (Figure 2b): Adenosis (A), Fibroadenoma (F), Phyllodes tumor (PT), Tubular adenoma (TA), Ductal carcinoma (DC), Lobular carcinoma (LC), Mucinous carcinoma (MC), and Papillary carcinoma (PC). As a first approximation, each image in the BreakHis dataset is treated as an independent sample, omitting any existing correlations among images.
In ML model training, a limited dataset can lead to overfitting, poor generalization, and inaccurate predictions [10], since the model does not have enough information to learn representative patterns. This constraint is particularly critical in exploratory research, where insufficient training data can compromise model performance.
To mitigate data scarcity, data augmentation techniques [11] are applied to the BreakHis 400× subset. Synthetic samples were generated using DIP operations without altering the integrity or biological relevance of the original images. The following transformations were applied: rotation of an image, mirroring an image, and contrast limited adaptive histogram equalization (CLAHE). Applied sequentially, they generate eight images shown in Figure 3.
In this way, synthetic images are generated without empty pixels or null values that complicate classification, while increasing the number of images and allowing analysis from different angles and perspectives without affecting image quality or content.

3.1.2. VGG16 Feature Extractor

A feature extractor [13] is an algorithm that identifies and encodes relevant visual patterns such as shape, texture, and color into numerical feature vectors. A popular method for feature extraction is the use of autoencoders, which are deep neural networks capable of learning compressed representations (encodings) of the input data. Autoencoders can reduce data dimensionality, perform noise reduction, and extract salient features that may improve downstream classification performance. However, they may also lead to information loss due to forced compression in the latent space, and they often suffer from training instabilities such as vanishing gradients, which can affect convergence and overall performance. For multiclass tasks, autoencoders focus on learning latent representations and reconstruction, rather than optimizing directly for classification objectives, which may reduce their accuracy and effectiveness.
In contrast, specialized architectures such as VGG16 are explicitly designed for efficient feature extraction, providing richer hierarchical and discriminative representations. This structure supports robust and clinically relevant recognition of complex breast tumor samples.
The VGG16 network, a pretrained CNN, was employed as a feature extractor. This architecture consists of 16 layers in total: 13 convolutional layers and 3 fully connected layers. The convolutional layers are organized into five blocks, each containing multiple 3 × 3 convolutional filters, interleaved with pooling layers to progressively reduce spatial dimensionality.
To analyze the potential of TMLM integrated into hybrid systems, VGG16 was used exclusively as the sole feature extractor. This choice is based on its well-established ability to extract high-quality features and serves as a starting point for the present study, allowing us to establish an initial foundation upon which future methodological improvements can be developed and evaluated.
To use VGG16 as a feature extractor, the convolutional layers are frozen, meaning that their weights remain unchanged during the feature extraction process. Feature extraction is performed using the final output of the convolutional blocks, just before the fully connected layers. This produces a convolutional feature map with dimensions 7 × 7 × 512, assuming the input is an RGB image of size 224 × 224 pixels. A global average pooling or global max pooling strategy is then applied to transform the 7 × 7 × 512 tensor into a 512-dimensional vector, resulting in 25,088 high-quality attributes per image. These vectors capture essential spatial information and serve as inputs for subsequent TMLA models, leveraging transfer learning to improve efficiency and performance (Figure 4).
When training hybrid models, it is possible that they inherit certain disadvantages from the CNN-based feature extractors, such as the need for large amounts of labeled data, long training periods, and substantial hardware and energy requirements.
However, these limitations are offset by the ability of CNN feature extractors to generate high-level representations that capture complex details and spatial hierarchies in histopathological micrographs, thereby improving the quality of model training and enhancing generalization ability.
Although CNNs typically require extensive data and computational resources for training, the proposed methodology leverages their pretrained capabilities to extract high-quality features that strengthen the predictive power of TMLM classifiers. This hybrid approach combines the best of both worlds: powerful automated feature extraction with faster and more cost-efficient classification. Therefore, despite the computational demands and data requirements, this hybrid strategy improves model accuracy and allows us to examine the potential of hybrid models for multiclass breast cancer classification.

3.1.3. ML Models

This study compared four hybrid multiclassifier models that combine the VGG16 feature extractor with TMLA methods—specifically, RF, SVM, KNN, and NB—to determine the most potentially effective hybrid model for the BCS. Although TMLAs are less complex than DL architectures, integrating them with CNN-based feature extraction can yield comparable results in settings where data availability and computational resources are limiting factors.
Since traditional classifiers cannot process raw images directly, the VGG16 converts images into a structured feature vector. These vectors are compiled into a dataset to train the four TMLA models described below.
RF: An ensemble algorithm that combines multiple decision trees to enhance predictive accuracy and reduce overfitting [15]. For a dataset with N samples and M features, bootstrap sampling generates B subsets, each used to train a decision tree. For each bootstrap set b, a decision tree t b is built.
Mathematically, the prediction of an RF with B trees can be expressed as
y ^ x = 1 B y ^ b ( x )
SVM: is a supervised maximum margin model [16], which makes it resistant to noise and misclassified data. It is used to analyze simple or high-dimensional data and to solve classifications. This algorithm involves finding the hyperplane that best separates two different classes of data points and maximizes the margin between them. The margin is defined as the maximum width of the region parallel to the hyperplane that contains no interior data points.
Given a training dataset { x 1 , y 1 , x 2 , y 2 , , ( x n , y n ) } , where x i R p are feature vectors and y i { 1 , 1 } are class labels, the objective of SVMs is to find the hyperplane w · x + b = 0 that maximizes the margin between the two classes.
Mathematically, the maximal margin classifier principle behind SVM is formulated as a quadratic optimization problem that minimizes the norm of the normal vector to the hyperplane ( w 2 ) subject to the constraint that all data points lie on the correct side of the hyperplane with a minimum margin of 1.
y i w · x + b 1 ; f o r   i = 1 ,   , n ;
where w is a normal vector to the hyperplane, b is the independent term of the hyperplane, and ( w 2 ) is the Euclidean norm of w, which is minimized to maximize margin.
KNN: This model is based on the mathematical principle of assigning a new instance the most frequent class among its k nearest neighbors in the feature space [17].
Given a training dataset with instances { x i , y i } , where x i is a vector in R P , and y i is the class label, the distance d ( x 0 , x i ) between a new point x 0 and each training point is computed as a distance metric, such as the Euclidean distance. The k training points with the smallest distances d ( x 0 , x i )   are selected. The predicted class for x 0 is then determined by the most frequent class among these k neighbors:
y 0 ^ =   arg m a x c i N k ( x 0 ) 1 ( y i = c )
where N k ( x 0 ) is the set of indices corresponding to the k nearest neighbors of x 0 , and 1(⋅) is the indicator function.
NB: this ML algorithm is a probabilistic classifier that utilizes Bayes’ theorem for prediction [18]. It operates under the assumption that predictor variables are mutually independent, significantly simplifying the required calculations. Given a set of predictor variables X = x 1 , x 2 , , x p and a class variable C, NB computes the posterior probability P(C|X) using the formula.
P C X = P C P ( X i | C ) / P ( X )
where P(C) is the prior probability of the class, ( X i | C ) is the conditional probability of each predictor variable given the class, and P(X) is the probability of the predictor variables.

3.1.4. Model Training and Performance

Supervised ML models automatically adjust their internal parameters during the training process to optimize their performance in BCS classification, learning directly from the data. However, to ensure reproducibility and to maintain control over the learning process, training hyperparameters must be defined in advance. These hyperparameters are external configurations set prior to model training and govern key aspects of the learning process.
Table 1 presents the training hyperparameters for the RF, SVM, KNN, and NB models. In addition to listing the primary hyperparameters for each model, the table includes the specific values used in this study as well as suggested search ranges for future reference.
The primary goal is to individually evaluate the performance of the proposed hybrid models and to identify the most suitable model for BCS multiclass classification using the Acc, R, Sp, P, and F1 metrics, which are standard and deterministic criteria for assessing ML model performance. A final ensemble of all models is not necessarily required, as the comparative analysis aims to determine the optimal model.
The train–test split is one of the most fundamental techniques for training and validating ML models. It divides the dataset into two subsets: the training set and the validation set, typically allocating 70–80% of the data for training and 20–30% for validation.
However, because this method relies on a single random data partition, it may produce a less reliable estimate of model performance and be influenced by sampling variability. In some cases, the selected partition may not accurately represent the true distribution of the data, potentially biasing the results and leading to misleading conclusions.
K-fold cross-validation is a more rigorous and reliable technique for training and validating ML models compared to the simple train–test split. In this approach, the dataset is systematically divided into k equally sized subsets (folds). The model is iteratively trained on k–1 fold and validated on the remaining fold, ensuring that every sample is used for both training and validation across the k iterations (Figure 5).
This process provides a more stable and unbiased estimate of model performance by maximizing data utilization and mitigating the randomness inherent in single-split validation. As a result, k-fold cross-validation offers a robust and comprehensive assessment that effectively compensates for the limitations of the traditional train–test split.
Furthermore, k-fold cross-validation is particularly effective for comparative model analysis, as it evaluates all models under identical data partitions and experimental conditions. This consistency ensures a fair and reliable comparison of performance results across different classifiers.
During each training and validation iteration, a confusion matrix [19] is generated to summarize the model’s correct and incorrect predictions for each class. The matrix is organized into four categories that describe classification outcomes:
  • True Positives (TP): Malignant cases correctly identified as malignant.
  • False Positives (FP): Benign cases incorrectly identified as malignant.
  • True Negatives (TN): Benign cases correctly identified as benign.
  • False Negatives (FN): Malignant cases incorrectly identified as benign
Evaluation metrics
The elements of the confusion matrix (TP, FP, TN, and FN) are used to compute several quantitative measures that assess the accuracy and effectiveness of a classification model. In this case, because the task involves an eight-category multiclass classification, the resulting confusion matrix has dimensions of 8 × 8.
These metrics, Acc, P, R, Sp, and F1, evaluate a model’s performance in terms of its ability to generalize and avoid overfitting to training data patterns.
Micro Acc represents the proportion of correctly predicted instances among all predictions. It provides an overall measure of model performance, and for multiclass classification, it is calculated as follows:
A c c m i c r o =   i = 1 8 T P i i = 1 8 ( T P i + T N i + F P i + F N i )
Micro P measures the proportion of TP predictions among all instances predicted as positive. In other words, it quantifies how many of the samples classified as positive are actually positive.
P m i c r o = i = 1 8 T P i i = 1 8 ( T P i + F P i )
Micro R measures the model’s ability to correctly identify positive cases. It represents the proportion of actual positive instances that are correctly detected by the model. Mathematically, it is expressed as follows:
R m i c r o =   i = 1 8 T P i i = 1 8 ( T P i + F N i )
Micro Sp measures the model’s ability to correctly identify negative cases. It represents the proportion of actual negative instances that are correctly classified as negative by the model. Mathematically, it is defined as follows:
S p m i c r o =   i = 1 8 T N i i = 1 8 ( T N i + F P i )
Micro F1 represents the harmonic mean of micro P and micro R, providing a single metric that balances both. It evaluates the overall quality of a classification model by considering the impact of both false positives and false negatives. Mathematically, it is expressed as follows:
F 1 m i c r o =   2 P m i c r o R m i c r o P m i c r o + R m i c r o
These evaluation metrics are fundamental for assessing the effectiveness and performance of ML models in predictive tasks.

3.2. Performance Considerations

1: For each hybrid model, two sets of evaluation tables are presented: the first corresponds to the performance obtained with the original image set, and the second shows the metrics derived from the synthetically augmented image set. This comparison highlights the impact of data augmentation on the hybrid’s training performance.
2: The comparison between models trained on the original and augmented datasets is not entirely equivalent because the validation subset of the augmented dataset is larger. This discrepancy affects the absolute counts of TP, TN, FP, and FN, potentially introducing bias in the performance evaluation metrics.
For this reason, the evaluation of model performance was extended to include a testing stage using an independent image set (BreakHis Version 2). This image set is distinct from the BreakHis 400× image set for training but was documented following the same acquisition and annotation methodology. The dataset is distributed as follows: A—30 images, F—74 images, PT—39 images, TA—33 images, DC—248 images, LC—27 images, MC—50 images, and PC—44 images, for a total of 545 images. Each image in this dataset is treated as an independent sample, omitting the patient-wise split and any correlations that may exist between images.
To ensure a fair comparison, the metrics obtained during the testing stage are used to determine the best-performing hybrid model, while those from the validation stage serve as internal references for assessing model behavior during training.
3: In accordance with the above considerations, each results table is organized into two sections: the first section reports the validation metrics (highlighted in blue), while the second section presents the testing metrics (highlighted in green). This structure provides a clear distinction between internal model performance and external generalization capability.

4. Results

As described in the methodology section, the values of TP, FP, TN, and FN were quantified to calculate the evaluation metrics for each trained model. Two sets of performance tables were constructed for each hybrid classifier: one based on the original image set and another using the synthetically augmented dataset. In addition to the tables, confusion matrices, and ROC-AUC curve plots are provided, which complement and strengthen the evaluation of the hybrid model by offering both a visual and quantitative measure of its ability to distinguish among the eight BCS classes.

5. Discussion

The RF model (Table 2 and Table 3 and Figure 6, Figure 7, Figure 8 and Figure 9) showed that data augmentation produced a moderate but consistent improvement in performance. While the validation Acc remained around 0.51, the testing Acc increased to 0.94, demonstrating that the inclusion of synthetic data enhanced the model’s generalization ability.
For the SVM models (Table 4 and Table 5 and Figure 10, Figure 11, Figure 12 and Figure 13), the improvement was more pronounced. After training with augmented data, the average Acc rose from 0.82 to 0.91 and the F1 from 0.77 to 0.89. These results confirm that the hybrid integration of VGG16 feature extraction with SVM enables superior discrimination of BCS compared to models trained exclusively on the original dataset.
In contrast, the NB-based model (Table 6 and Table 7 and Figure 14, Figure 15, Figure 16 and Figure 17) experienced a decline in Acc and F1 when trained with synthetic data. This suggests that the proposed methodology is not well suited for hybrid frameworks relying on NB, likely due to its strong independence assumptions and reduced adaptability to high-dimensional features.
The KNN-based hybrid model (Table 8 and Table 9 and Figure 18, Figure 19, Figure 20 and Figure 21) exhibited the most significant improvement. Training with augmented data increased Acc from 0.92 to 0.97 and F1 from 0.90 to 0.96, demonstrating excellent stability and generalization. These findings position the KNN-based hybrid model as the most promising approach for multi-class BCS classification.
Table 10 and Table 11 summarize the average, SD, and CI95 values for all models, clearly showing that the KNN hybrid model trained with synthetically augmented data consistently outperformed the others across all evaluation metrics. In turn, Table 12 presents the mean values, SD, and CI95 of the ROC-AUC scores for each hybrid model, confirming the generalization capability of each model in relation to the calculated metrics.
Overall, these findings demonstrate that hybrid systems combining feature extraction with classical classifiers have the potential to achieve high performance, while also revealing the possibility of data leakage and the need for a feature-selection process to reduce the dimensionality of the attributes extracted by VGG16. Together, these insights enable specification of a rigorous experimental design that incorporates selective sampling, which groups images from the same patient within the same fold of cross-validation to ensure better control over training data and to prevent data leakage that would inflate the performance metrics of ML models.

6. Conclusions

This study presented a methodology for breast cancer classification using hybrid models that combine VGG16 feature extraction with traditional ML classifiers. The proposed approach demonstrated that high performance can be achieved in terms of Acc, P, R, Sp, and F1, highlighting the viability of hybrid models for BCS classification.
Overall, the RF and SVM hybrids achieved computationally satisfactory levels of accuracy; however, these levels remain insufficient from a clinical standpoint. Therefore, further optimization and methodological refinement are needed to enhance their applicability to multi-class classification of BCS.
In contrast, the NB-hybrid model showed limited adaptability and failed to achieve significant performance gains, reflecting its inherent limitations when applied to high-dimensional data. This restricts its suitability for contexts where precision and reliability are paramount.
The KNN-hybrid model, on the other hand, demonstrated the most consistent and accurate performance, reaching an average Acc of 0.97. These results underscore the potential of hybrid architectures integrating TMLA-based systems, demonstrating that the hybrid approach is a viable alternative.
One area for improvement is addressing a key limitation in the proposed methodology related to potential information leakage during the training of hybrid models, which poses a critical risk to the model’s reliability and practical utility. Possible causes include insufficient rigor and oversight in the construction of the training set, as well as artificially inflated model performance resulting from placing highly correlated histopathological micrographs in both the training and validation sets. This issue may arise from the incorporation of synthetic data during the preprocessing stage, leading to performance estimates that do not accurately reflect the model’s true predictive capacity in real-world scenarios.
Other disadvantages of the proposed methodology include the limited generalization of the model due to working exclusively with 400× magnification micrographs and the reliance on a single feature extractor (VGG16), which could potentially lead to model overfitting due to its pretrained biases.
In future work, efforts to mitigate information leakage in the training of hybrid models will focus on implementing rigorous safeguards to ensure data integrity and security without compromising the scientific validity of the model. In addition, a feature-selection process will be implemented to reduce data dimensionality and improve the generalization capability of the hybrid models, along with a statistical analysis that supports the robustness of the experimental design.

Author Contributions

F.J.R.-P.: Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—review and editing, visualization, supervision, project; J.R.C.-S.: Conceptualization, validation, formal analysis, writing—review and editing, project administration; A.M.-P.: Validation, investigation, formal analysis, writing—review and editing; F.M.-B.: Conceptualization, methodology, formal analysis, writing—review and editing; J.C.R.-F.: Conceptualization, software, writing—review and editing; J.L.-M.: Conceptualization, validation, investigation, writing—review and editing; G.V.-H.: Conceptualization, validation, investigation, writing—review and editing; B.E.J.-L.: Conceptualization, validation, investigation, writing—review and editing; J.M.X.-P.: Conceptualization, software, writing—review and editing; E.G.P.-P.: Conceptualization, methodology, software, formal analysis, data curation, writing—original draft, writing—review and editing, supervision, project administration. All authors have read and agreed to the published version of the manuscript.

Funding

The first author received a grant from the Consejo Nacional de Humanidades, Ciencia y Tecnología (CONACYT), CVU: 1078268.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The BreakHis Version 2 breast cancer histopathology micrograph set is publicly available at the following URL: https://web.inf.ufpr.br/vri/databases/breast-cancer-histopathological-database-breakhis (accessed on 9 January 2025) or https://www.kaggle.com/datasets/ambarish/breakhis (accessed on 9 January 2025). This dataset contains 7909 microscopic images of breast tumor tissues, classified as benign and malignant, obtained from 82 patients with different magnification factors (40×, 100×, 200×, and 400×). This dataset is widely used for classification tasks in ML and allows for comparative evaluation of models on histopathological images of breast cancer. The BreakHis breast cancer histopathology micrograph set is publicly available at the following URL: https://www.kaggle.com/datasets/forderation/breakhis-400x (accessed on 9 January 2025). This dataset contains 545 microscopic images of breast tumor tissues, classified as benign and malignant, with a 400× magnification factor. The dataset is distributed as follows: The dataset is distributed as follows: A—30 images, F—74 images, PT—39 images, TA—33 images, DC—248 images, LC—27 images, MC—50 images, and PC—44 images. The purpose of this dataset is to serve as a resource for deep learning training and for the training or testing of machine learning models.

Acknowledgments

The first author received a grant from the Consejo Nacional de Humanidades, Ciencia y Tecnología (CONACYT), for which we are grateful for the support provided by this institution. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
TMLATraditional Machine Learning Algorithms
BCSBreast Cancer Subtypes
RFRandom Forest
SVMSupport Vector Machine
KNNK-Nearest Neighbors
NBNaive Bayes
AIArtificial Intelligence
DTDecision Tree
LRLogistic Regression
DLDeep Learning
CNNConvolutional Neural Network
MLMachine Learning
DIPDigital Image Processing
AccAccuracy
PPrecision
RRecall
SpSpecificity
F1F1-Score
CLAHEContrast Limited Adaptive Histogram Equalization
TPTrue Positive
FPFalse Positive
TNTrue Negative
FNFalse Negative
SDStandard deviation
CI9595% Confidence Interval
ROC-AUCReceiver Operating Characteristic–Area Under the Curve

References

  1. Bray, F.; Laversanne, M.; Sung, H.; Ferlay, J.; Siegel, R.L.; Soerjomataram, I.; Jemal, A. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA A Cancer J. Clin. 2024, 74, 229–263. [Google Scholar] [CrossRef] [PubMed]
  2. Sengar, P.P.; Gaikwad, M.J.; Nagdive, A.S. Comparative study of machine learning algorithms for breast cancer prediction. In Proceedings of the 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 20–22 August 2020; IEEE: New York, NY, USA, 2020; pp. 796–801. [Google Scholar]
  3. Sharma, S.; Mehra, R. Conventional machine learning and deep learning approach for multi-classification of breast cancer histopathology images—A comparative insight. J. Digit. Imaging 2020, 33, 632–654. [Google Scholar] [CrossRef] [PubMed]
  4. Ara, S.; Das, A.; Dey, A. Malignant and benign breast cancer classification using machine learning algorithms. In Proceedings of the 2021 International Conference on Artificial Intelligence (ICAI), Lucknow, India, 22–23 May 2021; IEEE: New York, NY, USA, 2021; pp. 97–101. [Google Scholar]
  5. Mathew, T.E. An improvised random forest model for breast cancer classification. NeuroQuantology 2022, 20, 713. [Google Scholar] [CrossRef]
  6. Kolla, B.; Venugopal, P. Breast cancer diagnosis through knowledge distillation of Swin transformer-based teacher–student models. Mach. Learn. Sci. Technol. 2023, 4, 045047. [Google Scholar] [CrossRef]
  7. Kode, H.; Barkana, B.D. Deep learning- and expert knowledge-based feature extraction and performance evaluation in breast histopathology images. Cancers 2023, 15, 3075. [Google Scholar] [CrossRef] [PubMed]
  8. Ebrahimy, H.; Mirbagheri, B.; Matkan, A.A.; Azadbakht, M. Effectiveness of the integration of data balancing techniques and tree-based ensemble machine learning algorithms for spatially-explicit land cover accuracy prediction. Remote Sens. Appl. Soc. Environ. 2022, 27, 100785. [Google Scholar] [CrossRef]
  9. Spanhol, F.A.; Oliveira, L.S.; Petitjean, C.; Heutte, L. A dataset for breast cancer histopathological image classification. IEEE Trans. Biomed. Eng. 2015, 63, 1455–1462. [Google Scholar] [CrossRef] [PubMed]
  10. Barkah, A.S.; Selamat, S.R.; Abidin, Z.Z.; Wahyudi, R. Impact of data balancing and feature selection on machine learning-based network intrusion detection. JOIV Int. J. Inform. Vis. 2023, 7, 241–248. [Google Scholar] [CrossRef]
  11. Tarawneh, A.S.; Hassanat, A.B.; Altarawneh, G.A.; Almuhaimeed, A. Stop oversampling for class imbalance learning: A review. IEEE Access 2022, 10, 47643–47660. [Google Scholar] [CrossRef]
  12. Krawczyk, B.; Jeleń, Ł.; Krzyżak, A.; Fevens, T. Oversampling methods for classification of imbalanced breast cancer malignancy data. In Proceedings of the Computer Vision and Graphics: International Conference, ICCVG 2012, Warsaw, Poland, 24–26 September 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 483–490. [Google Scholar]
  13. Dara, S.; Tumma, P. Feature extraction by using deep learning: A survey. In Proceedings of the 2018 Second International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 29–31 March 2018; IEEE: New York, NY, USA, 2018; pp. 1795–1801. [Google Scholar]
  14. Bakasa, W.; Viriri, S. Vgg16 feature extractor with extreme gradient boost classifier for pancreas cancer prediction. J. Imaging 2023, 9, 138. [Google Scholar] [CrossRef] [PubMed]
  15. Salman, H.A.; Kalakech, A.; Steiti, A. Random forest algorithm overview. Babylon. J. Mach. Learn. 2024, 2024, 69–79. [Google Scholar] [CrossRef] [PubMed]
  16. Guido, R.; Ferrisi, S.; Lofaro, D.; Conforti, D. An overview on the advancements of support vector machine models in healthcare applications: A review. Information 2024, 15, 235. [Google Scholar] [CrossRef]
  17. Kramer, O. K-nearest neighbors. In Dimensionality Reduction with Unsupervised Nearest Neighbors; Springer: Berlin/Heidelberg, Germany, 2013; pp. 13–23. [Google Scholar]
  18. Wickramasinghe, I.; Kalutarage, H. Naive Bayes: Applications, variations and vulnerabilities: A review of literature with code snippets for implementation. Soft Comput. 2021, 25, 2277–2293. [Google Scholar] [CrossRef]
  19. Liang, J. Confusion matrix: Machine learning. POGIL Act. Clgh. 2022, 3, 4. [Google Scholar]
Figure 1. Proposed training methodology diagram for hybrid ML models.
Figure 1. Proposed training methodology diagram for hybrid ML models.
Ai 07 00029 g001
Figure 2. Distribution of images in the BreakHis training set: (a) BreakHis image distribution per magnification level; (b) BreakHis 400×—BCS distribution.
Figure 2. Distribution of images in the BreakHis training set: (a) BreakHis image distribution per magnification level; (b) BreakHis 400×—BCS distribution.
Ai 07 00029 g002
Figure 3. Oversampling process [12]: (a) original image; (b) original image with rotation of 180° (in range −360° to 360°); (c) original image with mirror effect (horizontal axis); (d) original image with rotation of 180° and horizontal mirror effect; (e) CLAHE image (with clipLimit = 2.0 and tileGridSize = (8,8)); (f) CLAHE image with rotation of 180°; (g) CLAHE image with horizontal mirror effect; (h) CLAHE image with rotation of 180° and horizontal mirror effect.
Figure 3. Oversampling process [12]: (a) original image; (b) original image with rotation of 180° (in range −360° to 360°); (c) original image with mirror effect (horizontal axis); (d) original image with rotation of 180° and horizontal mirror effect; (e) CLAHE image (with clipLimit = 2.0 and tileGridSize = (8,8)); (f) CLAHE image with rotation of 180°; (g) CLAHE image with horizontal mirror effect; (h) CLAHE image with rotation of 180° and horizontal mirror effect.
Ai 07 00029 g003
Figure 4. The VGG16 CNN [14] is modified into a feature extractor by removing its fully connected layers and its last classification layer (red box). Instead, an output of an intermediate layer is taken (green box), which represents a vector of high-resolution features. The vectors are then stacked to create a dataset to train TMLM.
Figure 4. The VGG16 CNN [14] is modified into a feature extractor by removing its fully connected layers and its last classification layer (red box). Instead, an output of an intermediate layer is taken (green box), which represents a vector of high-resolution features. The vectors are then stacked to create a dataset to train TMLM.
Ai 07 00029 g004
Figure 5. Schematic illustration of k-fold cross-validation with K = 10 (folds = 10). The dataset is divided into 10 subsets, each representing 10% of the total training set. In each cycle, the training subsets (shown in blue) and the validation subset (shown in red) is alternated, such that 90 percent of the data are used to train the model and the remaining 10 percent are used to validate the same model.
Figure 5. Schematic illustration of k-fold cross-validation with K = 10 (folds = 10). The dataset is divided into 10 subsets, each representing 10% of the total training set. In each cycle, the training subsets (shown in blue) and the validation subset (shown in red) is alternated, such that 90 percent of the data are used to train the model and the remaining 10 percent are used to validate the same model.
Ai 07 00029 g005
Figure 6. Confusion matrices of the RF-based hybrid model trained with the original image set (TP indicated in red): (a) confusion matrix for the validation stage (182 images); (b) confusion matrix for the test stage (545 images).
Figure 6. Confusion matrices of the RF-based hybrid model trained with the original image set (TP indicated in red): (a) confusion matrix for the validation stage (182 images); (b) confusion matrix for the test stage (545 images).
Ai 07 00029 g006
Figure 7. ROC-AUC curves of the RF model trained with the original dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Figure 7. ROC-AUC curves of the RF model trained with the original dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Ai 07 00029 g007
Figure 8. Confusion matrices of the RF-based hybrid model trained with the synthetically augmented image set (TP indicated in red): (a) confusion matrix for the validation stage (1456 images); (b) confusion matrix for the test stage (545 images).
Figure 8. Confusion matrices of the RF-based hybrid model trained with the synthetically augmented image set (TP indicated in red): (a) confusion matrix for the validation stage (1456 images); (b) confusion matrix for the test stage (545 images).
Ai 07 00029 g008
Figure 9. ROC-AUC curves of the RF model trained with the synthetically augmented dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Figure 9. ROC-AUC curves of the RF model trained with the synthetically augmented dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Ai 07 00029 g009
Figure 10. Confusion matrices of the SVM-based hybrid model trained with the original image set (TP indicated in yellow): (a) confusion matrix for the validation stage (182 images); (b) confusion matrix for the test stage (545 images).
Figure 10. Confusion matrices of the SVM-based hybrid model trained with the original image set (TP indicated in yellow): (a) confusion matrix for the validation stage (182 images); (b) confusion matrix for the test stage (545 images).
Ai 07 00029 g010
Figure 11. ROC-AUC curves of the SVM model trained with the original dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Figure 11. ROC-AUC curves of the SVM model trained with the original dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Ai 07 00029 g011
Figure 12. Confusion matrices of the SVM-based hybrid model trained with the synthetically augmented image set (TP indicated in yellow): (a) confusion matrix for the validation stage (1456 images); (b) confusion matrix for the test stage (545 images).
Figure 12. Confusion matrices of the SVM-based hybrid model trained with the synthetically augmented image set (TP indicated in yellow): (a) confusion matrix for the validation stage (1456 images); (b) confusion matrix for the test stage (545 images).
Ai 07 00029 g012
Figure 13. ROC-AUC curves of the SVM model trained with the synthetically augmented dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Figure 13. ROC-AUC curves of the SVM model trained with the synthetically augmented dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Ai 07 00029 g013
Figure 14. Confusion matrices of the NB-based hybrid model trained with the original image set (TP indicated in purple): (a) confusion matrix for the validation stage (182 images); (b) confusion matrix for the test stage (545 images).
Figure 14. Confusion matrices of the NB-based hybrid model trained with the original image set (TP indicated in purple): (a) confusion matrix for the validation stage (182 images); (b) confusion matrix for the test stage (545 images).
Ai 07 00029 g014
Figure 15. ROC-AUC curves of the NB model trained with the original dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Figure 15. ROC-AUC curves of the NB model trained with the original dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Ai 07 00029 g015
Figure 16. Confusion matrices of the NB-based hybrid model trained with the synthetically augmented image set (TP indicated in purple): (a) confusion matrix for the validation stage (1456 images); (b) confusion matrix for the test stage (545 images).
Figure 16. Confusion matrices of the NB-based hybrid model trained with the synthetically augmented image set (TP indicated in purple): (a) confusion matrix for the validation stage (1456 images); (b) confusion matrix for the test stage (545 images).
Ai 07 00029 g016
Figure 17. ROC-AUC curves of the NB model trained with the synthetically augmented dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Figure 17. ROC-AUC curves of the NB model trained with the synthetically augmented dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Ai 07 00029 g017
Figure 18. Confusion matrices of the KNN-based hybrid model trained with the original image set (TP indicated in orange): (a) confusion matrix for the validation stage (182 images); (b) confusion matrix for the test stage (545 images).
Figure 18. Confusion matrices of the KNN-based hybrid model trained with the original image set (TP indicated in orange): (a) confusion matrix for the validation stage (182 images); (b) confusion matrix for the test stage (545 images).
Ai 07 00029 g018
Figure 19. ROC-AUC curves of the KNN model trained with the original dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Figure 19. ROC-AUC curves of the KNN model trained with the original dataset (reference line equal to 0.5 in black): (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Ai 07 00029 g019
Figure 20. Confusion matrices of the KNN-based hybrid model trained with the synthetically augmented image set (TP indicated in orange): (a) confusion matrix for the validation stage (1456 images); (b) confusion matrix for the test stage (545 images).
Figure 20. Confusion matrices of the KNN-based hybrid model trained with the synthetically augmented image set (TP indicated in orange): (a) confusion matrix for the validation stage (1456 images); (b) confusion matrix for the test stage (545 images).
Ai 07 00029 g020
Figure 21. ROC-AUC curves of the KNN model trained with the synthetically augmented dataset (reference line equal to 0.5 in black); (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Figure 21. ROC-AUC curves of the KNN model trained with the synthetically augmented dataset (reference line equal to 0.5 in black); (a) ROC-AUC curve of the validation stage; (b) ROC-AUC curve of the test stage.
Ai 07 00029 g021
Table 1. Hyperparameters and suggested search ranges for the RF, SVM, KNN, and NB models.
Table 1. Hyperparameters and suggested search ranges for the RF, SVM, KNN, and NB models.
ModelMain HyperparametersSuggested Search Ranges
RFn_estimators = 100; random_state = 42; max_depth = None; min_samples_split = 2; max_features = 1.0; bootstrap = Truen_estimators: [50, 100, 200, 500]; random_state = Constant value for reproducibility; max_depth: [None, 10, 20, 30]; min_samples_split: [2, 5, 10, 20]; max_features: [‘sqrt’, ‘log2’, None]; bootstrap: [True, False]
SVMkernel = ‘rbf’, probability = True, C = 1.0, gamma = ‘scale’, degree = 3 (only for “poly,” irrelevant in rbf)Kernel: [‘rbf’, ‘poly’]; probability: True (fixed); C: [0.1, 1.0, 10, 100]; gamma = [‘scale’, ‘auto’, 0.001, 0.01, 0.1]; degree: [3] (fixed)
KNNn_neighbors = 1, metric = “minkowski”, p = 2N_neighbors: [1, 3, 5, 7, 9, 15]; metric: [minkowski, euclidean, manhattan]; p: [1,2] (only for minkowski)
NBvar_smoothing = 1 × 10−6var_smoothing: [1 × 10−9, 1 × 10−8, 1 × 10−7, 1 × 10−6, 1 × 10−5]
Table 2. Performance of the RF model trained with the original dataset.
Table 2. Performance of the RF model trained with the original dataset.
Model IdValidationTest
AccPRSpF1AccPRSpF1
10.43410.10650.17610.88750.12930.93210.96710.88840.98510.9242
20.48900.13090.19130.89060.15210.92840.95910.87510.98470.9114
30.55490.26870.20460.89630.17940.94310.97240.89880.98760.9321
40.53300.17940.19840.90000.17300.93030.96600.88200.98470.9201
50.48900.37740.20630.89670.17770.94130.95430.90140.98840.9260
60.48350.14740.20340.89700.16300.94860.96510.91250.98940.9370
70.46150.12380.18570.89380.14220.94860.96670.91340.98910.9378
80.51650.13620.16560.88740.13870.94500.96420.91950.98870.9403
90.43410.23180.19030.88850.14980.93580.96930.89010.98630.9260
100.52200.12760.18380.89150.14890.94130.96720.90480.98770.9333
Mean0.49180.18300.19060.89290.15540.93950.96510.89860.98720.9288
Table 3. Performance of the RF model trained with synthetically augmented data.
Table 3. Performance of the RF model trained with synthetically augmented data.
Model IdValidationTest
AccPRSpF1AccPRSpF1
10.50340.72770.22790.89620.22670.93760.95900.89450.98690.9232
20.54600.76610.24160.90040.24880.94500.96670.90940.98870.9358
30.50410.71940.22730.89550.22280.95050.97890.90890.98910.9398
40.51440.71040.22370.89740.21790.92480.96270.87550.98360.9143
50.50820.77930.23270.89570.23680.95050.97800.90740.98930.9388
60.52750.77100.24120.89820.24210.93940.96350.89680.98760.9274
70.51100.75490.22560.89590.23440.93940.96460.90020.98700.9298
80.51510.76730.24140.89860.24930.94130.96760.89440.98720.9273
90.54460.77490.25120.90060.26290.94680.97100.90770.98830.9367
100.51030.72000.24840.89770.26050.94860.96770.91570.98910.9396
Mean0.51850.74910.23610.89760.24020.94240.96800.90110.98770.9313
Table 4. Performance of the SVM model trained on the original dataset.
Table 4. Performance of the SVM model trained on the original dataset.
Model IdValidationTest
AccPRSpF1AccPRSpF1
10.48900.32050.23590.90170.19910.81280.93160.68170.96070.7572
20.53300.44000.27060.90410.25760.82390.92390.69930.96310.7700
30.60440.50640.26840.91400.26090.82940.93570.70340.96450.7778
40.56590.40000.23620.91190.22870.82020.92470.68780.96190.7616
50.52200.31610.24410.90710.20990.82200.93120.69440.96320.7654
60.55490.53050.30820.91530.28210.82750.92100.70100.96390.7720
70.53300.33940.25120.91380.20800.83300.93610.71120.96530.7831
80.59340.63300.27930.90990.30390.82570.93640.70740.96310.7786
90.51100.48890.27670.90760.25070.81650.93050.69020.96180.7696
100.58790.38670.24730.90950.22520.82570.93020.70190.96360.7768
Mean0.54950.43620.26180.90950.24260.82370.93010.69780.96310.7712
Table 5. Performance of the SVM model trained with synthetically augmented data.
Table 5. Performance of the SVM model trained with synthetically augmented data.
Model IdValidationTest
AccPRSpF1AccPRSpF1
10.80290.85130.72280.96230.77180.91740.94680.86330.98330.8991
20.81800.86590.70960.96330.77000.90280.93020.84440.98070.8816
30.80150.85670.70060.96190.75590.90830.93940.84400.98160.8841
40.80010.87540.69700.96040.76330.91190.94510.85580.98190.8946
50.80360.86760.71260.96130.77080.91560.95230.85740.98240.8971
60.80360.86680.70680.96060.76690.90460.93500.84260.98090.8811
70.77200.83600.66100.95510.72430.90830.94360.84780.98110.8881
80.79530.85290.69890.96080.75760.90830.93860.84410.98130.8836
90.82350.87630.72910.96430.78370.90640.93680.84210.98080.8806
100.79880.85620.71500.96130.76810.91190.93760.85590.98240.8909
Mean0.80190.86050.70530.96110.76320.90960.94050.84970.98160.8881
Table 6. Performance of the NB model trained on the original dataset.
Table 6. Performance of the NB model trained on the original dataset.
Model IdValidationTest
AccPRSpF1AccPRSpF1
10.31320.09480.11690.09100.86940.85690.87000.86430.86440.9752
20.41210.24140.16630.15530.88160.85500.86860.86510.86180.9754
30.47250.26990.19400.19350.88370.87520.88330.89460.88490.9791
40.38460.18730.15250.14340.88030.85320.85550.87100.86040.9750
50.37910.15110.15300.13420.87840.87160.87620.87870.87470.9780
60.36810.10450.13530.10410.87850.86790.87480.87960.87250.9780
70.37910.18120.16290.14730.88200.87160.87640.89400.88050.9791
80.41760.09820.12800.10660.87430.87520.88470.89640.88590.9791
90.35160.39580.15890.15060.87610.86240.87720.86770.86860.9760
100.40660.13200.13050.11560.87220.86240.86520.87490.86250.9772
Mean0.38850.18560.14980.13420.87770.86510.87320.87860.87160.9772
Table 7. Performance of the NB model trained with synthetically augmented data.
Table 7. Performance of the NB model trained with synthetically augmented data.
Model IdValidationTest
AccPRSpF1AccPRSpF1
10.42510.42770.44020.91160.41360.52110.64660.62830.93410.5549
20.42100.41550.42190.91190.39410.51380.64270.63130.93330.5481
30.42650.44030.42970.91220.40880.50640.62910.61730.93210.5388
40.41900.43300.43400.91090.40950.51010.62550.62020.93250.5428
50.42450.43760.45090.91280.42010.51740.64380.63700.93330.5586
60.40250.42040.41430.90840.39010.51740.64400.63480.93340.5570
70.40930.41310.42460.90990.39510.51740.62910.63060.93300.5500
80.41210.43200.42780.91050.40800.50280.62650.61730.93180.5387
90.41960.42090.45220.91250.41180.51740.64660.62380.93340.5504
100.45880.46340.47110.91780.44790.51190.63040.62230.93220.5473
Mean0.42180.43040.43670.91190.40990.51360.63640.62630.93290.5487
Table 8. Performance of the KNN model trained on the original dataset.
Table 8. Performance of the KNN model trained on the original dataset.
Model IdValidationTest
AccPRSpF1AccPRSpF1
10.37910.26300.30170.90630.26610.92290.89120.90270.98800.8964
20.37360.23250.24970.90520.22850.91380.88620.90530.98670.8946
30.34620.19780.18980.89860.18740.91380.88570.90870.98720.8947
40.35160.26050.23650.89980.22130.92480.90710.90310.98770.9038
50.32420.20390.21280.89900.19110.92660.89310.90720.98910.898
60.34070.33140.29090.90360.28650.92290.89370.92440.98840.9063
70.37910.37940.28630.90480.25960.93390.91090.92590.98970.915
80.32970.20480.22780.89410.19700.91010.87450.92810.98650.8986
90.32970.22510.21300.89830.21110.91560.88410.88750.98690.8826
100.38460.33950.26990.90810.25120.92660.90140.9250.98910.9088
Mean0.35390.26380.24780.90180.23000.92110.89280.91180.98790.8999
Table 9. Performance of the KNN model trained with synthetically augmented data.
Table 9. Performance of the KNN model trained with synthetically augmented data.
Model IdValidationTest
AccPRSpF1AccPRSpF1
10.85850.82410.85120.97850.83520.97250.96440.96230.99540.9631
20.85920.82840.84570.9780.83340.95960.94920.95660.99330.9526
30.85710.83810.82480.97740.82540.97430.97370.96300.99520.9676
40.86880.85190.84480.97940.84040.96150.94870.95580.99380.9509
50.87570.85510.86620.98040.85810.97610.96560.96940.99610.9675
60.86060.83320.83830.97830.82970.97250.95840.96730.99590.9619
70.8510.82110.82360.97630.81550.97060.95930.96780.99530.9633
80.86610.84580.84620.97910.84090.97250.96400.95950.99560.9613
90.87360.85150.85170.97920.84850.97980.97000.96940.99660.9695
100.84680.82630.82590.9760.82220.97610.96850.96580.99590.9669
Mean0.86170.83760.84180.97830.83490.97160.96220.96370.99530.9625
Table 10. Average, standard deviation (SD), and 95% confidence interval (CI95) values of evaluation metrics (validation stage) for the overall and comparative interpretation of the performance of hybrid ML models.
Table 10. Average, standard deviation (SD), and 95% confidence interval (CI95) values of evaluation metrics (validation stage) for the overall and comparative interpretation of the performance of hybrid ML models.
Model IdValidation
AccPRSpF1
RF (Original data)
Mean ± SD0.49176 ± 0.04060.18297 ± 0.08570.19055 ± 0.01320.89293 ± 0.00450.15541 ± 0.0172
CI95[0.4627, 0.5208][0.1217, 0.2443][0.1811, 0.2][0.8897, 0.8961][0.1431, 0.1677]
RF (Synthetic data)
Mean ± SD0.51846 ± 0.01570.7491 ± 0.02670.2361 ± 0.00990.89762 ± 0.00190.24022 ± 0.0153
CI95[0.5072, 0.5297][0.73, 0.7682][0.229, 0.2432][0.8963, 0.899][0.2293, 0.2512]
SVM (Original data)
Mean ± SD0.54945 ± 0.03820.43615 ± 0.10340.26179 ± 0.0230.90949 ± 0.00440.24261 ± 0.0343
CI95[0.5221, 0.5768][0.3622, 0.5101][0.2453, 0.2782][0.9063, 0.9127][0.2181, 0.2672]
SVM (Synthetic data)
Mean ± SD0.80193 ± 0.01370.86051 ± 0.01230.70534 ± 0.01870.96113 ± 0.00250.76324 ± 0.0158
CI95[0.7921, 0.8117][0.8517, 0.8693][0.692, 0.7187][0.9594, 0.9629][0.752, 0.7745]
NB (Original data)
Mean ± SD0.38845 ± 0.04270.18562 ± 0.09480.14983 ± 0.02270.13416 ± 0.03050.87765 ± 0.0046
CI95[0.3579, 0.419][0.1178, 0.2535][0.1336, 0.1661][0.1123, 0.156][0.8744, 0.8809]
NB (Synthetic data)
Mean ± SD0.42184 ± 0.01510.43039 ± 0.01480.43667 ± 0.01710.91185 ± 0.00250.4099 ± 0.0164
CI95[0.4111, 0.4326][0.4198, 0.441][0.4244, 0.4489][0.9101, 0.9136][0.3981, 0.4217]
KNN (Original data)
Mean ± SD0.35385 ± 0.02330.26379 ± 0.06460.24784 ± 0.03810.90178 ± 0.00440.22998 ± 0.0344
CI95[0.3372, 0.3705][0.2176, 0.31][0.2206, 0.2751][0.8986, 0.905][0.2053, 0.2546]
KNN (Synthetic data)
Mean ± SD0.862 ± 0.0090.838 ± 0.0130.842 ± 0.0140.978 ± 0.0010.835 ± 0.013
CI95[0.8551, 0.8684][0.8284, 0.8467][0.832, 0.8517][0.9773, 0.9793][0.8259, 0.844]
Table 11. Average, standard deviation (SD), and 95% confidence interval (CI95) values of evaluation metrics (test stage) for the overall and comparative interpretation of the performance of hybrid ML models.
Table 11. Average, standard deviation (SD), and 95% confidence interval (CI95) values of evaluation metrics (test stage) for the overall and comparative interpretation of the performance of hybrid ML models.
Model IdTest
AccPRSpF1
RF (Original data)
Mean ± SD0.93945 ± 0.00740.96514 ± 0.00510.8986 ± 0.01450.98717 ± 0.00180.92882 ± 0.009
CI95[0.9342, 0.9447][0.9615, 0.9688][0.8882, 0.909][0.9859, 0.9885][0.9224, 0.9352]
RF (Synthetic data)
Mean ± SD0.94239 ± 0.00780.96797 ± 0.00640.90105 ± 0.01150.98768 ± 0.00170.93127 ± 0.0084
CI95[0.9368, 0.948][0.9634, 0.9726][0.8928, 0.9093][0.9865, 0.9889][0.9253, 0.9373]
SVM (Original data)
Mean ± SD0.82367 ± 0.0060.93013 ± 0.00540.69783 ± 0.00920.96311 ± 0.00140.77121 ± 0.0081
CI95[0.8194, 0.828][0.9263, 0.934][0.6913, 0.7044][0.9621, 0.9641][0.7654, 0.777]
SVM (Synthetic data)
Mean ± SD0.90955 ± 0.00460.94054 ± 0.00640.84974 ± 0.00760.98164 ± 0.00090.88808 ± 0.007
CI95[0.9062, 0.9129][0.9359, 0.9451][0.8443, 0.8552][0.981, 0.9822][0.8831, 0.8931]
NB (Original data)
Mean ± SD0.86514 ± 0.00830.87319 ± 0.00870.87863 ± 0.01240.87162 ± 0.00960.97721 ± 0.0017
CI95[0.8592, 0.8711][0.867, 0.8794][0.8697, 0.8875][0.8647, 0.8785][0.976, 0.9784]
NB (Synthetic data)
Mean ± SD0.51357 ± 0.00580.63643 ± 0.00890.62629 ± 0.00710.93291 ± 0.00070.54866 ± 0.007
CI95[0.5095, 0.5177][0.63, 0.6428][0.6212, 0.6314][0.9324, 0.9334][0.5436, 0.5537]
KNN (Original data)
Mean ± SD0.9211 ± 0.00750.89279 ± 0.01110.91179 ± 0.01340.98793 ± 0.00110.89988 ± 0.009
CI95[0.9158, 0.9264][0.8848, 0.9007][0.9022, 0.9214][0.9871, 0.9887][0.8934, 0.9063]
KNN (Synthetic data)
Mean ± SD0.972 ± 0.0060.962 ± 0.0080.964 ± 0.0050.995 ± 0.0010.962 ± 0.006
CI95[0.967, 0.9761][0.9562, 0.9681][0.9601, 0.9673][0.9946, 0.996][0.958, 0.967]
Table 12. Average, standard deviation (SD), and 95% confidence interval (CI95) values of ROC-AUC.
Table 12. Average, standard deviation (SD), and 95% confidence interval (CI95) values of ROC-AUC.
Model IdValidationTest
RF (Original data)
Mean ± SD0.77745 ± 0.016880.9948 ± 0.0013
CI95[0.7654, 0.7895][0.9939, 0.9957]
RF (Synthetic data)
Mean ± SD0.9017 ± 0.005180.99672 ± 0.00211
CI95[0.898, 0.9054][0.9952, 0.9982]
SVM (Original data)
Mean ± SD0.88203 ± 0.010850.99016 ± 0.0014
CI95[0.8743, 0.8898][0.9892, 0.9912]
SVM (Synthetic data)
Mean ± SD0.97967 ± 0.00150.995325 ± 0.00098
CI95[0.9786, 0.9807][0.9946, 0.996]
NB (Original data)
Mean ± SD0.51393 ± 0.013980.93027 ± 0.00719
CI95[0.5039, 0.5239][0.9251, 0.9354]
NB (Synthetic data)
Mean ± SD0.69413 ± 0.010010.80661 ± 0.00263
CI95[0.687, 0.7013][0.8047, 0.8085]
KNN (Original data)
Mean ± SD0.5748 ± 0.020810.94984 ± 0.00695
CI95[0.5599, 0.5897][0.9449, 0.9548]
KNN (Synthetic data)
Mean ± SD0.91004 ± 0.00750.97951 ± 0.00301
CI95[0.9047, 0.9154][0.9774, 0.9817]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rojas-Pérez, F.J.; Conde-Sánchez, J.R.; Morlett-Paredes, A.; Moreno-Barbosa, F.; Ramos-Fernández, J.C.; Luna-Muñoz, J.; Vargas-Hernández, G.; Jaramillo-Loranca, B.E.; Xicotencatl-Pérez, J.M.; Pérez-Pérez, E.G. Exploratory Study on Hybrid Systems Performance: A First Approach to Hybrid ML Models in Breast Cancer Classification. AI 2026, 7, 29. https://doi.org/10.3390/ai7010029

AMA Style

Rojas-Pérez FJ, Conde-Sánchez JR, Morlett-Paredes A, Moreno-Barbosa F, Ramos-Fernández JC, Luna-Muñoz J, Vargas-Hernández G, Jaramillo-Loranca BE, Xicotencatl-Pérez JM, Pérez-Pérez EG. Exploratory Study on Hybrid Systems Performance: A First Approach to Hybrid ML Models in Breast Cancer Classification. AI. 2026; 7(1):29. https://doi.org/10.3390/ai7010029

Chicago/Turabian Style

Rojas-Pérez, Francisco J., José R. Conde-Sánchez, Alejandra Morlett-Paredes, Fernando Moreno-Barbosa, Julio C. Ramos-Fernández, José Luna-Muñoz, Genaro Vargas-Hernández, Blanca E. Jaramillo-Loranca, Juan M. Xicotencatl-Pérez, and Eucario G. Pérez-Pérez. 2026. "Exploratory Study on Hybrid Systems Performance: A First Approach to Hybrid ML Models in Breast Cancer Classification" AI 7, no. 1: 29. https://doi.org/10.3390/ai7010029

APA Style

Rojas-Pérez, F. J., Conde-Sánchez, J. R., Morlett-Paredes, A., Moreno-Barbosa, F., Ramos-Fernández, J. C., Luna-Muñoz, J., Vargas-Hernández, G., Jaramillo-Loranca, B. E., Xicotencatl-Pérez, J. M., & Pérez-Pérez, E. G. (2026). Exploratory Study on Hybrid Systems Performance: A First Approach to Hybrid ML Models in Breast Cancer Classification. AI, 7(1), 29. https://doi.org/10.3390/ai7010029

Article Metrics

Back to TopTop