1. Introduction
The problem of classification requires data to train the classifiers. A larger availability of data implies that, in general, we can train a more accurate classifier tool. However, constantly annotating new data for novel tasks, new specific cases, and samples is a costly task. In addition, adding only new data does not guarantee a successful improvement of the classifier. Rather, one needs to add meaningful data samples that would provide additional information to the classifier to learn [
1]. Data selection and data generation for Machine Learning (ML) is a well-studied topic. For instance, in [
2], features are selected to provide the best training support. In [
3], data are statistically analyzed to provide the best coverage of the problem space. In general, when a dataset is prepared, the data are also statistically analyzed for different measures in order to represent a good sample distribution.
In order to improve the general training and generalization ability of the trained decision making Algorithm, data generation has been recently used to achieve moderate improvement in certain tasks. For instance, in [
4], the generation and augmentation of a real dataset by synthetic data improved the generalization ability of the model, while using synthetic data for training large language models can lead to a serious performance reduction [
5].
However, in a real-world context, no single Algorithm can provide best processing for each input instance. This effect has been described for instance in the framework of the no-free-lunch theorem [
6] or complementary performance [
7,
8]. The no-free-lunch theorem states that for a given problem, no single Algorithm can solve all instances of the problem with the highest accuracy. The Algorithm complementarity expands this concepts into the observation that different Algorithms, while trained on the same dataset, will exhibit different per-instance accuracy due to Algorithm bias. The per-instance Algorithm selection is a meta strategy that aims to solve this problem by learning each Algorithm’s performance over a certain number of samples and then generalizes this knowledge. For the improvement of the classification task, one starts from a set of pretrained component Algorithms. Then, we construct an Algorithm selection mechanism that chooses the best component Algorithm on a case-by-case basis. The general methodology is to train an Algorithm classifier using a set of meta-features that would capture Algorithm performance and related information. The standard approaches are, for instance, based on meta-features [
9], reasoning [
10], parallel execution [
11], Algorithm configuration [
12], etc. In all the cases, however, the data and the features must be able to describe the Algorithms’ performance well enough to allow for a reliable Algorithm selection [
13].
The training of the meta classifier, i.e., Algorithm selector, however, also requires data. The data can be either obtained from a set of runs of the component Algorithms on a training data or a set of solved instance problems. The main difference is that one can use the Algorithm evaluation as a conceptual step towards task change [
14]. In ideal conditions, one would always want to have a specific dataset that would reveal as much information about the individual Algorithms as possible. However, in general, training data for systems in production might not be available and not enough novel annotated data might be generated. For this purpose, we consider a specific scenario for optimization. Let the problem of training an Algorithm selector be a function of available data. Considering a dataset
we analyze (a) how much data are experimentally necessary to train an Algorithm selector and, (b) how much data augmentation using data generation can improve the selector accuracy. In particular, we study what amount of natural and synthetic data is necessary to train an Algorithm selector that would outperform the best Algorithm, what properties of the selector would allow us to efficiently select component Algorithms, and how to generate synthetic data that would give the most accurate Algorithm selection accuracy. Finally, we evaluate the different approaches to Algorithm selection on input image classification.
While data augmentation by synthetic data has been effectively used in various machine learning tasks, so far there is no work related to Algorithm selection using synthetic dataset. Algorithm selection has been evaluated on various mechanisms as well as on an increasing number of features or meta-features, and therefore we study if additional information required for an Algorithm selector generalization ability can be generated using a synthetic data generator.
The results of this paper can be summarized as follows:
The evaluation of synthetic data for Algorithm selection;
An analysis of synthetic vs. real-world data with respect to Algorithm selection accuracy;
The determination of the relation of Algorithms’ complementarity for Algorithm selection in classification problems.
This paper is organized as follows: in
Section 2, an overview of the previous work is given; in
Section 3, we describe the methodology used. In
Section 4 we describe the experimental settings and in
Section 5, we describe the experiments and the obtained results. Finally, in
Section 6 we discuss the results and in
Section 7 we conclude the paper.
2. Background
The authors of [
8] examine the trade-off between per-instance Algorithm selection and running multiple Algorithms in parallel. While Algorithm selection leverages complementary performance among Algorithms, it requires complex models and additional computation. In contrast, parallel execution is feasible on modern hardware but can lead to resource contention and slowdowns. The results suggest that Algorithm selection is beneficial, particularly for large Algorithm portfolios.
The work [
15] proposed an approach for decreasing the time needed for Algorithm selection training. Usually, the training process includes the evaluation of the performance of all pretrained Algorithms on the testing set. This process can take a lot of time. The authors propose to start by evaluating the small batches of testing data and then gradually increase the size of the evaluation data. During the process, Algorithms can be excluded from the pool based on the cross-validation between their learning curves. This allows for the early exclusion of under-performing Algorithms and can decrease the required training time by 20% to 50% without a noticeable downgrade in terms of accuracy.
From the current literature, it is seen that Algorithm selection can help in marginally improving the classification of many problems. Most applications achieve from 1%–5% improvement in relative classification accuracy [
16]. Among many studies, the best results for simple Algorithm selection classifiers are often shown by Random Forest, and Support Vector Machines (SVM) classifiers. Boosted ensemble classifiers created with Adaboost also show good results and outperform single classifiers in some cases.
Authors of [
17] trained a range of content-similarity-assessing Algorithms and used a Random Forest classifier to chose the best one for a particular document instance. The Random Forest classifier had 15% better accuracy than the random choice strategy for Algorithm selection. However, Random Forest-based Algorithm selection still had lower accuracy compared to the best of single Algorithms (0.51 vs. 0.58).
The authors of work [
13] applied Algorithm selection for segmentation problems on Microsoft Common Objects in Context (MSCOCO) and Visual Object Classes Challenge 2012 (VOC2012) datasets without prior extraction of meta-features. For the VOC2012 dataset, the SVM Algorithm selector showed the best accuracy, while for the MSCOCO dataset, the AdaBoost ensemble showed the best result. The best overall result was achieved using an ensemble of five predetermined Algorithms resulting in 3% higher accuracy than the best single Algorithm. In another work, authors of [
10] applied a per-object Algorithm selector by dividing the image into parts and using Algorithm selection for choosing the segmentation Algorithm for each part. This approach resulted in 2% improvement in accuracy using combinations of Artificial Neural Network (ANN) + Preference Rules and ANN + SVM classifiers.
A lot of work went into creating fully automated ML learning frameworks that can be unified under the AutoML term. The goal of AutoML systems is to provide users with a fully working and automated pipeline for analyzing the data, choosing the machine learning approach, configuring the Algorithm, and evaluating the results. SmartML [
18] and AutoSklearn [
19] are examples of such automated suites. SmartML, being the simpler one, executes an extensive grid search on the range of pretrained Algorithms as well as a range of Algorithm selection classifiers. It also experiments with different training, validation, and test splits in the process. The AutoSklearn framework was created for the AutoML competition with the strict requirement of giving the best predictions in under 20 min. The authors achieved the best result by cutting out unpromising combinations for a grid search. They employed a resource-management strategy that devotes processing time and power to the most promising combinations based on pre-collected data.
Some works are concentrated on preemptively predicting the potential error rate for the classifier based on the uncertainty [
20] and imbalance [
21] of input data. The preemptive extraction and analysis of datasets’ meta-features allow this approach to improve Algorithm selection recommendations. The extensive analysis of classification performance on different datasets revealed that in some cases the imbalance of the training dataset may improve the resulting classification accuracy.
The paper [
22] analyzes the optimization dynamics of Generative Adversarial Network (GAN), focusing on gradient descent updates for both the generator and discriminator. The authors show that while GAN optimization is not a convex–concave game, equilibrium points in traditional GANs remain locally stable under certain conditions. In contrast, Wasserstein GANs (WGANs) can exhibit non-convergent limit cycles near equilibrium. To address these stability issues, the paper proposes a regularization term that improves local stability, accelerates convergence, and mitigates mode collapse.
The work [
23] investigates the quality of conventional evaluation techniques for measuring GAN performance, such as the Inception Score and Fréchet Inception Distance. The authors propose two additional evaluation techniques. The proposed GAN-train technique evaluates the diversity and realism of a GAN’s generated images by training a classifier on synthetic images and testing it on real ones, assessing how well the learned distribution matches the target distribution. In contrast, the proposed GAN-test technique measures the classifier’s performance when trained on real images and tested on generated ones, capturing the fidelity of the synthetic data. The results demonstrate that these evaluation methods complement existing ones and, when used together, provide a more comprehensive assessment of GAN performance.
The study [
24] explores the applicability of synthetic images from text-to-image models for image recognition tasks, focusing on zero-shot, few-shot learning, and large-scale model pretraining. The authors find that synthetic data significantly improve classification accuracy in zero-shot settings and remain beneficial in few-shot scenarios, though domain gaps pose challenges. They propose strategies like diversified text prompts and real-image-guided generation to enhance effectiveness. Additionally, synthetic data prove highly effective for model pretraining, sometimes surpassing ImageNet pretraining, particularly for architectures based on Vision Transformers. The study highlights both the potential and limitations of synthetic data, encouraging further research in this area.
The authors of [
25] use hidden Markov models and regression models to generate synthetic time-series data based on a real smart-home dataset. As part of their research, they investigated a problem of generating the synthetic data from a small amount of real-world samples. Their results show that training Algorithms on a combination of a small real data sample and the synthetic data generated from it can achieve significantly higher activity recognition accuracy than training on the small real data sample alone.
Despite extensive research in both Algorithm selection and synthetic data generation, no prior work has explored the use of synthetically generated data for training Algorithm selectors. Algorithm selection has been widely studied across various domains, demonstrating improvements in classification accuracy and efficiency, while synthetic data generation has shown promise in augmenting training datasets, improving model performance, and reducing reliance on large real-world datasets. However, the potential of leveraging synthetic data to enhance the training process of Algorithm selectors remains unexplored. This gap in the literature suggests an opportunity to investigate whether synthetic data can improve the generalization and robustness of Algorithm selection models, particularly in cases where real-world training data are scarce or expensive to obtain.
3. Methodology
Algorithm selection has been shown to have excellent results with meta-features [
7,
18,
26] and some success with regular features [
13,
14,
27]. In both cases, analyzing the whole dataset can provide valuable statistics-based meta-features, but it also requires having a lot of labeled data. In order to avoid this problem, we study the Algorithm selection training using synthetic dataset. The overall approach is shown in
Figure 1. We focus on medical image classification. The task is defined as a mapping
; for each image
, determine the label
.
The Algorithm selection approach improves performance by dynamically choosing the optimal Algorithm for each input instance [
28,
29]. Unlike ensembles that combine multiple Algorithm outputs, Algorithm selection chooses a single best Algorithm for each specific input, leveraging the complementary strengths of different Algorithms across the input space [
7]. This meta-learning strategy works by training a selector that learns to map input features to the most suitable Algorithm from a portfolio of pretrained component Algorithms [
30]. The selector essentially learns which Algorithm performs best on which types of inputs, allowing the system to achieve higher overall accuracy than any single Algorithm could achieve alone [
8,
9].
The method starts with a set of pretrained Algorithms or by training a set of Algorithms for the classification task defined by mapping C. We prepare an initial set of data by separating them into three subsets: , training, validation, and testing, respectively. Using the training dataset , we train a set of binary classifiers . Each Algorithm is evaluated on a validation dataset , and its accuracy is recorded. The Algorithms are all trained on the same training dataset and will be referred to as component Algorithms.
Next, from the validation dataset , we use a subset of samples to create a subset, referred to as the seed dataset . Using , we train a Generative Adversarial Network (GAN). Then, using the trained GAN, we create a new set of synthetic images . For each label in the dataset , we train a new GAN and generate a target number of synthetic images. For each sample of the resulting dataset , we determine the best Algorithm. The best Algorithm is any Algorithm that classifies x correctly: , where is the ground-truth label for the input image x. For synthetic images, it is important to note that we generate them class by class. When training the GAN on images of a specific class, e.g., “pneumonia” or “healthy”, the generated synthetic images inherit the same class label as their training data source. Thus, when we refer to an Algorithm classifying a synthetic image “correctly”, we mean it assigns the image to the same class used to train the GAN that generated it. This approach provides unambiguous ground-truth labels for all synthetic images used in our experimental evaluation.
Using the synthetic images and the best Algorithms, we build the Algorithm selector training dataset . The inputs in are the synthetic images, and the outputs are the Algorithm names represented by the label set . Using , we train the Algorithm selector for the mapping . Once the Algorithm selector is trained, we evaluate its accuracy and compare it to the accuracy of the individual Algorithms using the dataset.
The GAN used can be described as follows. Let
represent the mapping performed by the GAN, with
an input generated from a random-intensity pixels distribution and
S being an output image. According to the original definition of GAN [
31], the GAN plays a zero-sum game, where the generator is trained to make more and more realistic images and the discriminator is trained to differentiate between synthetic and real input images. Both the generator and the discriminator are Convolutional Neural Networks (CNN), in order to enable the generation and recognition of images.
The classification task studied in this paper
is a binary task: the target is to determine if an input image shows a patient with a diagnosis ‘pneumonia’ or a patient without one, ‘healthy’. To assess the accuracy of the classification
C, we use a direct accuracy measure. If the classification can be expressed as
with
and
and
, then the accuracy measure for a single Algorithm is shown Equation (
1).
where
is the ground truth for the input image
and
n is the number of samples in the dataset under evaluation. This accuracy is the ratio of correctly classified images over the total number of images under evaluation.
When evaluating the Algorithm selector approach, the Equation (
1) is modified to take into account the Algorithm selection process. First, let
be the Algorithm selected by the Algorithm selection for the input sample
. Second, let the Algorithm selection be described by
. Finally, let the accuracy for the Algorithm selector can be written as in Equation (
2).
Note that the mapping
is purposefully abstracted in order to allow the evaluation of the Algorithm selector using the image classification accuracy measure. This means that we do not directly measure the Algorithm selector accuracy with respect to the labels
but rather we measure the final classification accuracy over the set of image labels
L by implicitly learning the labels
maximizing the classification accuracy. In general, we will refer to accuracy from Equations (
1) and (
2) by
.
We use different Algorithms on the same dataset to obtain different Algorithm bias instead of a different data bias. Each Algorithm is trained according to its specific training requirements. In most of the cases, the classification Algorithms are CNN based.
Data Preparation and Construction
The data generation was primarily performed using a GAN network [
31]. While we initially explored Stable Diffusion [
32] as an alternative generative model, all the experiments reported in this paper were conducted using GAN-generated images to maintain consistency in our comparative analysis. Our preliminary tests with Stable Diffusion suggested similar patterns to those observed with GANs, but a comprehensive comparison between different generative models for Algorithm selection remains an interesting direction for future work.
The specifics of the data generation methodology are as follows. Both parts of the GAN (a generator and a discriminator) are implemented as convolutional neural network. The schematic representation of the learning of the GAN network is shown in
Figure 2. The general principle of learning GAN can be explained as follows. Propagate a latent vector
z of randomly generated noise to obtain a set of synthetic samples that together with some real images create the discriminator training set. Then, train the discriminator by setting the Training Generator variable to 0 to generate an output from the discriminator and update its parameters (dashed-line Discriminator update). Repeat the process, but this time set the Training Generator variable to 1 to propagate the result of discriminator to the generator and update its parameters (dashed-line generator update). Repeating this process makes the discriminator converge to a high accuracy of discriminating between real and synthetic images and makes the generator converge towards generating synthetic samples very close to the real images.
The generator takes a latent vector z and maps it to an 8 × 8 feature space using a fully connected layer. It then applies four transposed convolution layers with a kernel size of 4 × 4 and a stride of 2, progressively upsampling to 128 × 128. Each layer uses LeakyReLU activation (a = 0.2), except for the final convolutional layer, which applies Tanh to produce an image.
The discriminator is a convolutional network that processes input images through five convolutional layers with a kernel size of 5 × 5, progressively downsampling the resolution from 128 × 128 to 8 × 8 with a stride of 2. Each layer uses LeakyReLU activation. The feature maps are then flattened, followed by a dropout layer (0.4) and a fully connected layer with a sigmoid activation for binary classification.
We use the Inception Score (IS) and the Fréchet Inception Distance (FID) to assess the quality and variety of generated images. The IS is calculated by a pretrained Inception V3 model applied to a sample of generated images. The FID is calculated by first extracting high-level features both from sets of real and generated images, then extracting the mean and covariance statistics of both samples and comparing them to each other.
4. Experimental Settings
4.1. Algorithms
To perform Algorithm selection on this task, we needed a pool of image classification Algorithms. We used the following CNN-based image classification neural networks from the PyTorch “Torchvision” and “Pretrained models for Pytorch” libraries: Dual-Path Network (DPN), Residual Networks (ResNet), Densely Connected Convolutional Networks (DenseNet), Very Deep Convolutional Networks (VGG), and SqueezeNet. The summary of the Algorithms’ main features is shown in
Table 1. We used pretrained checkpoints for each Algorithm to lower the training time and data requirements. The pretrained checkpoints were obtained from the Torchvision library, which provides models trained on large-scale datasets such as ImageNet. We are using pretrained models to reduce the amount of data required for training, as the models already learned the way to effectively extract features by training on a vast corpus of images. Each model was initialized with these pretrained weights and subsequently fine-tuned on our dataset for several training epochs.
Every component Algorithm is by default trained to up to five training epochs when not mentioned otherwise in the experiments. The main goal of this work is not to use the best-performing Algorithms but rather to evaluate Algorithm selection under specific training constraints. Therefore, the focus is more on how the Algorithm selector can improve or not through the use of synthetic data than on how the accuracy of processing improves through training. Therefore, the five epochs were decided as a reasonable amount of training for the dataset. In addition, the Algorithm’s accuracy has enough variety, so that the complementarity can be observed. The Algorithms were trained to display a certain amount of complementary performance; that is, we are interested in Algorithms that, even if they have different classification accuracy, their respective true and false labels are not overlapping completely.
4.2. Algorithm Selectors
To train Algorithm selectors, we extracted feature representations from input images using a pretrained AlexNet [
38] model. Specifically, we used the intermediate activations from five convolutional layers of the network as a compact image descriptor. This approach enables the capture of meaningful visual patterns at various levels of abstraction, while keeping the feature dimensionality within a practical range. All input images were resized to 128 × 128 before being passed through the feature-extraction pipeline. For each image, we computed the mean activation values across feature maps from the first, fourth, seventh, ninth, and eleventh convolutional layers. The resulting values were concatenated into a single feature vector. Each vector was then labeled with the identity of the best-performing classification Algorithm for the corresponding image, determined based on validation accuracy.
For the evaluation of the methodology, we used nine different shallow Algorithm selectors: ExtraTreesClassifier [
39], SVM [
40], DecisionTreeClassifier [
41], KNeighborsClassifier [
42], GradientBoostingClassifier [
43], SGDClassifier [
44], LogisticRegression [
45], RandomForestClassifier [
46], and MLPClassifier [
47]. We represent them as
, where
is the set of all Algorithm selectors.
Each classifier is trained in an identical manner described specifically for each experiment and learning the mapping . In this context, meta-learning refers to the process where the Algorithm selector learns to predict which component Algorithm will perform best for a given input. This approach falls under the broader meta-learning framework, where knowledge is transferred across learning episodes, allowing the system to learn from previous Algorithm performances to improve future selections. The Algorithm selector essentially operates at a meta level, learning patterns about when different Algorithms perform well rather than solving the original classification task directly. However, because the task at hand is classification, we have to prepare the training data in a specific manner. This is to address the problem of multiple correct answers during the learning of the selector. During the learning process of the selector, the dataset samples are to be mapped to a unique label . However, due to complementary performance not being totally complementary, many samples are such that they are correctly classified by more than one Algorithm, creating the mapping . Therefore, during the training process one has to decide which target Algorithm will be fed to the classifier as training sample.
We explore three possible strategies. The first strategy simply creates the sample by providing the target label from the classifier with the highest confidence score, referred to as . The second approach assign the target label randomly and is referred to as . The last strategy is basically to create as many samples as there are target labels and then include them into the final dataset with equal probability. This method is referred to as .
For the evaluation of the studied Algorithm selection, we compare it to two different approaches. First, we compare the meta-strategy to the most accurate component Algorithm (also referred to as the best Algorithm). Second, we compare the Algorithm selector trained on synthetic data to the Algorithm selector trained on real data. We will refer to Algorithm selector trained on synthetic data as SAS (standing for Synthetic data Algorithm Selection) and to the Algorithm selector trained on real data as RAS (standing for Real data Algorithm Selection).
4.3. Dataset
The component Algorithms were trained on the training part of the dataset
. We used the chest X-RAY dataset taken from [
48,
49]. The original dataset contains 5232 training images (1349 healthy, 3883 pneumonia) and 624 test images (234 healthy and 390 pneumonia).
All images in the dataset were cropped to remove the markers that show the orientation of the X-RAY capture. In [
50], the authors demonstrated that that Algorithms can use these and other markers and focus on them to improve accuracy. This happens because different hospitals have different markers and different rates of sick vs. healthy individuals. X-ray images in which the subject had an unnatural posture (e.g., hunched or distorted) or had medical devices such as tubes attached were manually removed. These images were removed because unnatural posture or the presence of medical attachments often indicate a patient with a severe medical condition. Allowing Algorithms to rely on these indicators for classification could lead to biased results, shifting focus away from assessing the actual condition of the lungs.
The synthetic data are produced by training a GAN separately for each label in the dataset (healthy or pneumonia). To determine the quality of gan training convergence, we use Inception Score [
51] and Fréchet Inception Distance [
52]. The data for training GAN were taken from the validation part of the dataset.
The procedure to prepare the different datasets from the original train and test sets is shown in
Figure 3. All experiments were performed on balanced dataset. Therefore, we adjusted the amount of images with healthy and pneumonia labels in train and test parts of the dataset. The balanced test dataset is
. The train dataset was then split into a new train
and validation
parts using 75% and 25% of the dataset, respectively. The seed dataset
is created on a case-by-case basis. When the size of
, all samples
are drawn from
. If the size is larger than the size of the validation dataset
, all additional samples are drawn randomly from
. The
is used then as the training set for the synthetic data generator.
The resulting balanced dataset contains: 1800 training images (900 healthy, 900 pneumonia), the dataset contains 540 validation images (270 healthy, 270 pneumonia), and the dataset contains 456 test images (228 healthy, 228 pneumonia).
The component Algorithms are trained in general on unless specified. The Algorithm selector is trained on the . The accuracy is always assessed on the test part of the dataset unless specified differently. The seed dataset is created as a subset of the original validation dataset (containing only real images) such that . The training dataset is constructed from either purely synthetic or purely real images.
5. Experiments
5.1. Component Algorithm Accuracy
The first set of experiments aims to verify the accuracy of the individual Algorithms.
Table 2 displays how accurate each Algorithm is for classifying both the validation dataset
and the test set
. The first column shows the Algorithm name, and the second column demonstrates the validation accuracy on the validation dataset of real images
, and the third column shows the accuracy on the test dataset
. The last column shows the accuracy of the validation dataset, which contains only synthetic images
. All the Algorithms were trained for five epochs on the train dataset
.
In each column of
Table 2, the results are ranked in descending order, from the most to the least accurate. In columns three and five the differences in relative ranks of each Algorithm on a given dataset are shown. For instance, the DPNWrapper obtained the highest accuracy on the validation dataset
(column two, row two) but on the test set
, the same Algorithm’s accuracy is one before the last one (column four), and thus the difference in rank, given the order introduced above, is 3. A rank of 0 means that the algorithm’s classification accuracy does not change from one dataset to another with respect to the other algorithm’s classification accuracy.
The reason for these experiments is not only to determine the individual Algorithms’ accuracy but also to demonstrate the fact that the accuracy of classification on the dataset is not a very reliable predictor for the accuracy obtained on the dataset. Even less, the accuracy from the validation dataset is not a very good predictor of the accuracy on the synthetic dataset . However, the accuracy on the synthetic dataset is a better predictor for the accuracy on the test dataset . Therefore, it should be possible to leverage this specific relation between the synthetic dataset and the test dataset to the Algorithm selection by training it on purely synthetic data.
There are three main detailed observations from this experiment. First, note that when evaluated on
that contains only synthetic images, only one Algorithm, the DenseNet, results in accuracy higher than
. The other Algorithms perform quite poorly, with VGG scoring nearly random result with accuracy
. The second observation is that on both the validation
and test datasets
, all five Algorithms provide comparable accuracy ≈90% with only SqueezeNet evaluating at ≈83% on the test dataset
. The final observation is the discrepancy in Algorithm accuracy between the
and
. The
Table 2 has been colorized to indicate the relative accuracy of the individual Algorithms in descending order, starting from the highest by the following colors: red, green, blue, orange, and yellow. Observe that the evaluation on the
results in a relatively poor generalization. If one would look for most efficient Algorithm based on the validation dataset, then both the
and
would result in poorer accuracy.
To further analyze the performance of the individual Algorithms,
Figure 4 shows the complementarity of the five used Algorithms’ accuracy displayed as the correlation matrix. Each cell in
Figure 4 shows the relative overlap between the result of classification between two Algorithms. For instance, the correlation between SqueezeNet and DPN92 is
. This scores means that
of their results are the same for both positive and negative labels. The notably lower accuracy on
compared to real data is expected and informative for our study. This performance gap indicates that synthetic images capture class-specific features while representing a different distribution than real images. This domain shift helps identify which Algorithms are more robust to input variations, which is valuable information for the Algorithm selector to learn. Rather than indicating poor GAN training, this controlled distribution difference creates a more challenging testbed that better differentiates the Algorithms’ complementary behaviors. The results of this correlation analysis show that the selected Algorithms have enough complementarity to be used as component Algorithms in our Algorithm selection methodology.
It is worth noting that while more sophisticated generative models such as conditional GANs could explicitly incorporate class-label information into the generation process, we opted for a simpler approach where we train separate GAN models for each class. This class-by-class training implicitly conditions the generation process on the class label without requiring modifications to the GAN architecture. Our approach allows us to control the class distribution in the generated dataset while maintaining architectural simplicity. Future work could explore whether conditional GANs or other label-aware generative models might produce higher-quality synthetic images that further improve Algorithm selection performance.
5.2. Synthetic Data Generators Assessment
Before starting the Algorithm selection evaluation, we evaluate the performance of the data generator networks.
Figure 5a,b show two common measures used to asses the quality of generated images.
Figure 5a shows the Inception Score (IS) measure [
53] applied to the pneumonia synthetic images generated by the used GAN. The IS metric is used to assess how sharp and distinct the generated images are. IS is calculated using the Inception v3 [
54] pretrained image classification model applied to images generated by GAN. The score is maximized when two conditions are met. The first condition is high confidence in the label assigned to each image, indicating image sharpness. The second condition is the variety of the labels assigned to different images, indicating a high diversity of the generated images.
Figure 5b shows the FID score. The FID score [
52] is a measure of the distribution of the synthetic images when compared to the distribution of the real images. In particular, the covariance and the mean of the synthetic distribution of images are compared with the distribution parameters of the real images.
Observe that both measures confirm that at least statistically the evaluated GAN converges properly or as expected. In other words, as more seed images are used to train the GAN, the generated samples more closely follow the original distribution of the real images, and the synthetic images become increasingly similar to the seed images. Note that we are interested not in generating synthetic samples highly similar to the distribution of real images, but rather we aim to capture the noise required for efficient generalization. While the FID score near 0 indicates the highest similarity to the original real-world images, it also implies that for in order to capture the noise beneficial for Algorithm selection, fewer images in the seed dataset are desired.
Finally, we also evaluated the quality of synthetic images in a classification setting. For this purpose, we evaluated the accuracy of the five component Algorithms that were trained on images from
for 5 epochs on the
data generated by the GAN from various sizes of the
dataset.
Figure 6 shows the individual Algorithm accuracy as a function of the size of the
dataset.
Figure 6 shows that when the GAN is seeded with over 300 real images, the resulting synthetic dataset provides enough information to result in an accuracy over
for at least certain Algorithm. When seeded with 900 images, the accuracy of all Algorithms is larger than
. This means that the larger the seed dataset is, the greater the ability of the GAN to generate images similar to the seed dataset.
This indicates that when the GAN is provided with too many images, its statistical output distribution converges to the original distribution of the seed dataset. While this result is expected [
31], in this work we are looking for the complementarity of the Algorithm selection performance to the performance of the individual Algorithms. This means that we are looking for a complementary distribution of data samples that would allow us to predict the Algorithm accuracy
on
from
using the smallest possible set of seed samples.
Figure 7 shows images both real as well as synthetic ones.
Figure 7a,b show examples of healthy and diagnosed images from the original dataset.
Figure 7c,d show examples of synthetic images representing healthy individuals.
Figure 7c show the generated images for the highest FID score, while the
Figure 7d shows generated images for the best IS score.
Figure 7e,f show generated images for diagnosed individuals for the highest FID score and for the best IS score, respectively. Note as a general observation, when simply comparing the real images from
Figure 7b to the synthetic images from
Figure 7c,e, the generated images are more noisy and in general appear as an extrapolation of the real-world images.
5.3. Algorithm Selection as a Function of Algorithm Selectors
The next set of experiments shows the performance of the individual Algorithm selectors.
Figure 8 shows the accuracy of classification when the Algorithm selector is trained on synthetic samples from
generated by GAN from an increasing number of seed samples
which are selected from
and
.
As shown in
Figure 3, the training dataset used here is constructed in a similar fashion of constructing the sample dataset
. This means that only the first samples from
are used, and when their number is larger than
, samples from
are used. However, all seed images are real and all Algorithm selectors are trained on synthetic images generated by the GAN network.
Figure 8 shows that for each size of seed dataset
, different Algorithm selectors provides the highest classification accuracy. Because of this observation, we will further report only the best performance for each experiment by selecting the most accurate Algorithm selector for each task.
Note that as expected, the more images are used to generate the dataset, the higher the Algorithm selection accuracy is in average. Interestingly, the variation is lowest for . The implication of this observation is that for very few samples in , GAN-generated synthetic images are too noisy, while for a very large amount in , the generated images do not provide enough variation for all the Algorithms used by the algorithm selector, and thus resulting in almost no classification accuracy increase. However, an interesting observation is that the variation between the classification accuracy of the different Algorithm selectors is also very small when only 10 images are used to seed the GAN generator.
Figure 9 summarizes these results. The plots show the average (Avg.) and most accurate (Best) Algorithm selection accuracies. The average accuracy represents the average over the set of all nine Algorithm selectors. The average is computed over the set of Algorithm selectors trained on the same set of training samples and plots them against the most accurate Algorithm selector for both RAS and SAS approaches. The Algorithm selectors were trained on increasing number of real or synthetic images and the most accurate result for each type of learning is displayed. The classification accuracy was evaluated on 228 real images per class. Note that Algorithm selection trained on synthetic images outperforms or equals Algorithm selection trained on real images on five out of six data points (red line).
Another observation is that the Algorithm selection provides the highest accuracy when the Algorithm selection provides the highest degree of noise. Here, it means that the Algorithm selection is trained on smallest number of samples (10 in
Figure 9) in
.
5.4. Algorithm Selection as a Function of Component Algorithms Training Epochs
The next set of experiments evaluates Algorithm selection accuracy as a function of the number of training epochs of the component Algorithms. The Algorithm selectors were trained on either real and synthetic images, 270 real and 270 synthetic, respectively.
Figure 10 shows the results of both the evaluation of the accuracy of the component Algorithms as well as the accuracy of Algorithm selection as a function of component Algorithms’ training epochs.
Figure 10a shows the accuracy of the individual component Algorithms.
Figure 10b also shows the the accuracy of Algorithm selection when using images
generated by GAN (in the case of SAS) or real images
(in the case of RAS) for the training of Algorithm selectors. In
Figure 10b, we consider for each epoch the most accurate Algorithm selection.
In
Figure 10a, the more the Algorithms are specifically trained to the dataset, the more they converge to individual accuracies with reduced variation. It can be concluded that that the component Algorithms’ complementary performance becomes less relevant due to Algorithm relative overfitting or training. By relative overfitting, we mean that an Algorithm’s accuracy
tends to a much lower variance by matching the training dataset more and more precisely. In other words, as the performance overlap becomes larger, the Algorithm selection converges to select Algorithms with diminishing effect of improvement. Note that when component Algorithms are trained for one epoch, the difference in the average accuracy is larger than
, while, when trained for 30 epochs, their relative accuracy difference is <
. In addition, their relative accuracy increases from the interval [84%–90%] at epoch 1 to [91%–93%] at 30 epochs. In addition, note that the highest and lowest accuracy of classification was achieved for 24 training epochs and for 4 training epochs by the DPN.
Figure 10b displays a similar trend—it shows the accuracy of the best component Algorithm as well as the accuracy of Algorithm selection trained on real and on synthetic images. Similarly to
Figure 10a, at epoch 1, the difference of accuracies is in interval [86%–90%], while at training epoch
, the interval is [91%–93%]. The best Algorithm from all five component Algorithms is taken for each epoch. First, observe that both the average synthetic and real Algorithm selection accuracy always under perform when compared to the best Algorithm.
The best Algorithm accuracy (blue curve) shows the highest value for eleven data points. The SAS (orange line) shows the highest value for four data points. The RAS (red line) shows the highest value for three data points. The best Algorithm is more accurate for most of the training at lower epoch numbers. These epochs are [1–4,5,8–10,14–16,22]. The Algorithm selection, on the other hand, is more accurate for most of the experiments with a higher number of training epochs; in particular, for epochs [5,7,12,20,24–30].
Interestingly, while the highest Algorithms’ complementarity occurs in the early training epochs (due to highly randomized nature of under-trained Algorithms), the highest accuracy in the early training epochs is dominated by the best Algorithm (
Figure 10b). This would imply that for Algorithms trained for few epochs only, the predictability of their behavior might be too difficult for the Algorithm selector despite their high complementary performance. It is only once the Algorithms are predictable enough that Algorithm selection improves the overall accuracy. This would confirm the fact that while Algorithms are not well converged, predicting their classification is difficult due to largely randomized nature.
5.5. Algorithm Selection as a Function of Number of Real Images Used to Train the Generator
The next set of experiments is conducted in order to understand what amount of synthetic samples is necessary for an Algorithm selector to improve over the best Algorithm. First, three types of Algorithm selection strategies are evaluated in this experiment:
,
, and
(as previously described in
Section 4.2), along with two types of training datasets. In these experiments, the Algorithm selector was either trained on 270 real images or 500 synthetic images. The results are shown in
Figure 11.
The first observation is that, when using real data for training, all three Algorithm selection , , and result in constant accuracy. This is natural because the Algorithm selection is always trained on the same 270 real images (solid plot lines). The synthetic images are, however, generated for each different number of seed samples from , and therefore the Algorithm selector accuracy varies (dashed plot lines).
The second observation is the evolution of Algorithm selector accuracy in the classification task. First, observe that from the Algorithm selection methods trained on real data, the is the most accurate, followed by , and the least accurate is . Second, for 50, 200 and 900 seed images, the Algorithm selection is the most accurate, while for 500 seed images, the Algorithm selection is the most accurate. Finally, observe that when the number of training images grows to over 100, the and Algorithm selections are much more accurate than the Algorithm selection.
The last observation is that with the strategy, both Best RAS and Best SAS obtain the highest accuracy. Only the SAS Algorithm selection strategy outperforms the best Algorithm (best Algorithm accuracy is shows as black line with triangles as data points marker) as well as the Best RAS strategy in some cases.
5.6. SAS vs. RAS Under Limited Data Availability
Another set of experiments was conducted to compare the performance of RAS and SAS under conditions of limited data availability. In these experiments, we used the same five Algorithms, each fine-tuned on data from
for five epochs. The amount of validation data from
used in each run was reduced to 50 images per class. We performed 10 runs of the experiment for both RAS and SAS. The results are presented in
Table 3. For each RAS run, eight Algorithm selectors were trained on 50 randomly sampled images from
. For each SAS run, we used the same 50 images selected for RAS as the seed dataset
to train a GAN. This ensured that neither RAS nor SAS had an advantage in data sampling. For each run, the GAN was trained independently with a distinct fixed random seed. For five of the runs, we used the Inception Score (IS) to select the generator training epoch for data generation, and for the other five runs, we used the Fréchet Inception Distance (FID). The results show that the difference between using IS and FID for generator epoch selection was not statistically significant for this set of experiments. Using the selected generator epoch, 1000 synthetic images per class were generated for each run. The Algorithm selectors for SAS were trained on the generated synthetic images, and the final classification accuracy after Algorithm selection was recorded. The “Best Classifier” row reports the average result of the best-performing classifier from each run. The “Average” row shows the mean accuracy of all eight Algorithm selectors across all runs. For each Algorithm selector, we report the mean accuracy along with the 95% confidence interval computed across 10 runs.
We performed paired t-tests between RAS and SAS results for each Algorithm selector. A p-value less than 0.05 indicates a statistically significant difference between RAS and SAS. A p-value greater than 0.05 suggests that the difference is not statistically significant. Some Algorithm selectors performed very consistently despite being trained on different data with varying random seeds. However, the distribution of test results violated the normality assumption required for the paired t-test for several models: SVM, KNeighborsClassifier, SGDClassifier, LogisticRegression, RandomForestClassifier, MLPClassifier, and Best Classifier. Therefore, we also performed the Wilcoxon signed-rank test as a non-parametric alternative to paired t-tests. The results showed a statistically significant improvement for SVM, LogisticRegression, RandomForestClassifier, and MLPClassifier when trained on synthetic data compared to real data. Conversely, DecisionTreeClassifier showed a statistically significant decrease in performance under the same condition. KNeighborsClassifier achieved the highest average performance across 10 runs and did not show a statistically significant difference in results between training on synthetic versus real data.
5.7. Algorithm Selection as a Function of Component Algorithms
The final set of experiments illustrates the impact of the individual Algorithms used for the target task on the Algorithm selection accuracy. For this experiment, we consider the removal of the individual Algorithms from the set . The purpose of this experiment is to determine how the set of component Algorithms affects the Algorithm selection accuracy.
Figure 12 shows the results of the Algorithm selection as a function of components of the set
. The
x axis shows which Algorithms have been removed from the initial
, while the
y axis shows the accuracy
. Both Algorithm selection trained on synthetic and on real data are illustrated. The first five data points (
x axis) show the results of classification when a single Algorithm is removed from
and the last two data points show the removal of two and three Algorithms from
, respectively.
The first interesting observation is that the classification accuracy when using Algorithm selection is strongly improved only when SqueezeNet is removed. Interestingly, SqueezeNet is the least accurate Algorithm on the training and testing dataset and is second to last on the
in
Table 1. This indicates that SqueezeNet has either low predictability due to its lower evaluation accuracy, or that having SqueezeNet in the learning dataset alters the Algorithm selection data bias in such a way that the Algorithm selector increases the amount of false selections.
Second, similarly to previous experiments, the synthetic Algorithm selection here also almost always outperforms the real Algorithm selection. There are only two instances where real Algorithm selection outperformed synthetic Algorithm selection. These two cases are the DenseNet and the DenseNet-VGG-ResNet removal. However, in both cases, the best Algorithm remained more accurate than the Algorithm selection approach.
6. Results and Discussion
The results of the performed experiments can be summarized from several points of view. First, the Algorithm selection has experimentally demonstrated that it can improve the accuracy when trained on well-behaved Algorithms. Well-behaved Algorithms are such that the difference between the selection confidence and the result of processing result tends to 0. This Algorithmic good behavior does, however, go against the idea of improvement by generalization by modeling processing noise. The reason is because, in order to effectively select Algorithms for the classification task, the prediction must take into account the outliers of each Algorithm. In general, such outliers are not directly predictable, as they are outside of the main distribution of the Algorithm’s input–output mapping. Therefore, training an Algorithm selection on well-behaved Algorithms can result in selecting Algorithms in such a way that does not allow for proper selection for the sample outliers. Both of these effects can be seen on
Figure 10b. The best Algorithm selection accuracy is obtained at epochs 24, 28, and 30, pointing towards the fact that the Algorithm selector improves the predictive accuracy with increasing epochs of training of the component Algorithms. However, at the same time, for epochs 14, 16, and 22, the Algorithm selection does not provide any improvement when compared to the most accurate Algorithm. This would indicate that the ability to accurately select an Algorithm would benefit from directly predicting each Algorithm’s outliers. Regarding the component Algorithms’ training duration, while five epochs might seem limited for deep learning models, our results demonstrate this was sufficient for our experimental purposes. As seen in
Figure 8, these pre-trained models achieved reasonably stable performance within this timeframe on our medical imaging dataset. In fact, the early training epochs created opportunities for complementary performance patterns that were valuable for Algorithm selection. As models train longer and approach convergence, they tend to find similar solutions, potentially reducing the benefits of Algorithm selection. Our approach intentionally balanced adequate individual performance with maintaining Algorithmic diversity, which aligns with our research focus on Algorithm selection rather than maximizing individual Algorithm accuracy.
It is worth noting that our component Algorithms were deliberately chosen to represent a diverse range of architectures commonly used in medical image classification. While more advanced state-of-the-art architectures might achieve higher absolute accuracy, our focus was on demonstrating the efficacy of Algorithm selection using synthetic data rather than maximizing classification performance per se. The moderate accuracy levels (<95%) of our component Algorithms actually provide a realistic scenario where Algorithm selection can make meaningful improvements. In settings where all Algorithms achieve near-perfect accuracy, the potential benefits of selection would be minimal. Future work could explore how Algorithm selection using synthetic data performs with more advanced architectures like Vision Transformers [
55], MedicalNet [
56], or other domain-specific models, though we expect the core findings about synthetic data’s utility for Algorithm selection to remain valid regardless of the specific architectures employed.
Second, Algorithm selection experimentally benefits from synthetic data. In particular, when using the synthetic images generated from certain number of seed samples, the SAS outperforms RAS and most of the experimental cases. As shown in experiments illustrated for instance in
Figure 8 and
Figure 9, the improvement provided by Algorithm selection is well observed for Algorithm selectors trained on few images, implying that even with very partial knowledge about the well-behaved Algorithms, the noisy samples from the GAN generator provide enough information to obtain an accuracy
at least as good as the best Algorithm. It also shows that for a small size of
, the GAN generates images that have enough information to model the sample distribution from the dataset (
Figure 8). Similarly, this is reflected by the fact that for few synthetic images, the SAS approach results in highest accuracy
(
Figure 9).
Third, training the
or the
Algorithm selector experimentally, shows that it negates the Algorithm selection bias. This can be observed in
Figure 11, where not only do the
and
SAS improve other instances of Algorithm selection training, but the best Algorithm is actually the
RAS Algorithm. Note that the sample replication is not a novel method and it is not Algorithm selection specific. The unique position of our approach is that we specifically target synthetic images. In addition, one could ask why we need the GAN generator if we potentially have enough images to train a shallow classifier. As shown in
Figure 8 and
Figure 9, even if the maximum of 540 (270 × 2) is used for training, the resulting Algorithm selection is statistically less accurate than the one that is trained purely on synthetic data.
Finally, we also verified the statistical significance of the experimental setup. Interestingly, while the statistical verification demonstrated that the synthetic images are a viable option as a tool to efficiently select Algorithm, it also showed that the improvement is also model dependent. Four out of nine classifiers had constantly improved performance when using the synthetic data, while additional two showed statistical improvement when using the Wilcox test only. This would also imply that this result can be also task specific. Therefore we consider this as a motivation for a direct extension of this work into other domain-specific works.
7. Conclusions
Under limited data conditions, our experiments demonstrated that synthetic data generated by GANs can effectively support Algorithm selection. In several cases, SAS not only matched but statistically outperformed RAS in final classification accuracy, with significant gains observed for SVM, LogisticRegression, RandomForestClassifier, and MLPClassifier. Importantly, these improvements were consistent across multiple seeds used for GAN and Algorithm selector training. Although DecisionTreeClassifier showed a performance drop when trained on synthetic data, and KNeighborsClassifier exhibited no significant difference, the overall trend supports the viability of synthetic data for Algorithm selection in low-resource settings. The synthetic images we create help us make medical image classification more accurate by using Algorithm selection, achieving better results than just using the best Algorithm or the Algorithm selection trained on the original dataset. While our study primarily used traditional machine learning Algorithms as selectors due to their interpretability and efficiency with limited training data, future work could explore deep neural networks as Algorithm selectors. Deep learning-based selectors might capture more complex patterns in the data that determine Algorithm effectiveness, potentially further improving selection accuracy. However, such approaches would likely require larger training datasets and might introduce additional computational overhead during both training and inference. The trade-off between selector complexity and performance gains represents an interesting direction for future research, particularly in the context of synthetic training data, where generating larger volumes of data is feasible.
One limitation of our current study is that we investigated Algorithm selection trained on either purely real or purely synthetic data, but did not explore mixed datasets combining both types. Future work could examine how varying the proportion of real to synthetic images in the training data affects Algorithm selection accuracy. Such an investigation might reveal an optimal mixing ratio that leverages the complementary information in both real and synthetic samples. This approach could potentially further improve Algorithm selection performance while minimizing the amount of real data required. Additionally, it would be valuable to study whether certain types of real samples are more informative when combined with synthetic data, which could lead to more efficient strategies for creating hybrid training datasets.