Synthetic Data-Based Algorithm Selection for Medical Image Classification Under Limited Data Availability
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper considers using synthetic data to train a medical image classification algorithm selector in the case of insufficiency of real data. The authors present the results of experiments with real chest XRays of healthy patients and patients with pneumonia. In particular they discuss the classification algorithm selection as a function of number of real images used to synthetize a new dataset with generative adversarial network (GAN).
The topic is relevant in the field. Recently, GANs have emerged as a powerful and innovative approach in medical image processing.
The authors' contribution is of an experimental nature. It can be summarized as follows: (i) Evaluation of synthetic data for algorithm selection, (ii) Analysis of synthetic vs real world data with respect to algorithm selection accuracy, (iii) Determination of relation of algorithms complementarity for algorithm selection in classification problems.
The conclusions are consistent with the evidence.
The references are appropriate. The list includes 50 representative items with a half from the last five years.
Remarks:
The title of the paper should be more informative. So please indicate more precisely what algorithm it is about. Moreover, the Abstract should be presented more neatly. You use the notion meta-learning, which should be better explained in section 2.2.
Author Response
Dear Reviewer, thank you very much for your comments and suggestions. Below please find point by point our answers with indications where in the text your comments have been answered
{Comment} The title of the paper should be more informative. So please indicate more precisely what algorithm it is about.
{Answer}
The title has been changed into "Synthetic Data Based Algorithm Selection for Medical Image Classification Under Limited Data Availability"
{Comment} Moreover, the Abstract should be presented more neatly. You use the notion meta-learning, which should be better explained in section 2.2.
{Answer}
The abstract was modified and transformed by providing a cleaner flow and terminology. The changed part are shown in red color. Specifically related to meta learning it has been clarified as concept in lines 37-46, 187-195, 330-336
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript investigated how algorithm selection strategy may be impacted if models are trained on synthetic data. Extensive ablation studies on several key factors were conducted.
- The word “algorithm selection” is not very clear to me at first. I would suggest elaborating the rationale and intuition of this module, and how it differs from complementary algorithms.
- Most complementary algorithms are based on relative traditional architectures, and the accuracy is not that high (< 95% in test set)/ I’m not sure if such accuracy is good enough for complementary algorithms, more SOTA methods may need to be included to see if better accuracy can be achieved.
- Could the manuscript clarify the specific generative model used? GAN was mentioned solely most times, but stable diffusion was also mentioned in line 214. If both of them were used, how will this impact the following algorithm selection?
- How was the label of the synthetic data assigned? In line 188: “we determine the best algorithm: an algorithm a from the set A, that classifies i correctly is used to build a label set LS.” How is the word “correctly” defined here since there’s no label for synthetic data?
- Line 246: Only 5 epochs were trained here. 5 epochs might be too small. Are there any loss curve or accuracy curve plots that can support the model can be well-trained within 5 epochs?
- The algorithm selectors currently only consider traditional machine learning algorithms, I wonder if it can be expanded to include deep neural networks.
- Table 2: How is the accuracy evaluated in synthetic data? The accuracy in this dataset seems much lower than that in training or test data. Could this imply the image generator was not well trained so that there was domain shift between real data and synthetic data?
- Also, in Table 2, it seems the image generator didn’t consider the labels in training data, I wonder if some methods such as conditional GAN can be used to incorporate such information, which may help improve the quality of synthetic data.
- Line 339: The correlation mentioned here is 0.7753, but it’s 0.7636 in the associated figure (Figure 4).
- What features were used for algorithm in algorithm selector?
- Line 474: the algorithm was either trained on real or synthetic images. I would suggest making a dataset containing both real and synthetic images, and adjust the proportion of synthetic images to see how the accuracy changes.
Author Response
Dear Reviewer, thank you very much for your comments and suggestions. Below please find point by point our answers with indications where in the text your comments have been answered
{Comment} The word “algorithm selection” is not very clear to me at first. I would suggest elaborating the rationale and intuition of this module, and how it differs from complementary algorithms.
{Answer}
This has been addressed in several locations of the paper. At first we provide an extended distinction between no-free lunch theorem and algorithm complementarity and then we also insert the concept of algorithm selection to solve both of these issues (red color, lines 37-46). In addition the notion of algorithm selection is further explained in details (red color, lines 187-195).
{Comment} Most complementary algorithms are based on relative traditional architectures, and the accuracy is not that high ($< 95\%$ in test set). I’m not sure if such accuracy is good enough for complementary algorithms, more SOTA methods may need to be included to see if better accuracy can be achieved.
{Answer}
Thank you for the valuable feedback. We agree that incorporating more recent and higher-performing models could strengthen the overall effectiveness of the system, especially in terms of final classification accuracy. In this work, however, our main focus is not on achieving the highest possible accuracy, but rather on studying the relative effectiveness of algorithm selection using synthetic versus real images. To that end, we intentionally used a diverse set of existing CNN models to simulate a realistic scenario in which pretrained models of varying quality are available.
That said, we fully recognize the importance of evaluating the selection framework with stronger base models, particularly as part of assessing its potential for deployment in real-world systems. As part of our future work, we plan to expand the pool of algorithms to include recent SOTA architectures and investigate how improved base accuracy interacts with the selection process. We believe this will offer a more comprehensive view of the practical benefits of Synthetic Algorithm Selection.
{Comment} Could the manuscript clarify the specific generative model used? GAN was mentioned solely most times, but stable diffusion was also mentioned in line 214. If both of them were used, how will this impact the following algorithm selection?
{Answer} Thank you for your comment. In this research we used only GAN for generating synthetic data. We added a statement in the beginning of "Data Preparation and Construction" subsection to explicitly clarify that. In future work we a planning to also test the performance on SAS when using Diffusion Models for data generation and compare it with GANs. This issue was specifically addressed in lines 252-258
{Comment} How was the label of the synthetic data assigned? In line 188: “we determine the best algorithm: an algorithm a from the set A, that classifies i correctly is used to build a label set LS.” How is the word “correctly” defined here since there’s no label for synthetic data?
{Answer} We trained GAN separately for each class. Thus, we assign the label of the class used as a seed dataset for GAN to the generated images. We did some experiments with conditional GAN that can be trained on both classes at once and then generate the images of needed class based on the parameters. In our experiments, the conditional GAN were significantly more prone to model collapse during the training. However, we refrain from mentioning this in the paper, as we did not perform a comprehensive research on this topic. We are planning to dedicate more time to trying different versions of GAN and Diffusion Models for data generation in the future work. (This was clarified in lines 196-221 blue color)
{Comment} Line 246: Only 5 epochs were trained here. 5 epochs might be too small. Are there any loss curve or accuracy curve plots that can support the model can be well-trained within 5 epochs?
{Answer}
Thank you for this observation. We agree that training a model for only 5 epochs may not be sufficient to reach optimal performance, and we would like to clarify our rationale.
In this experiment, our goal was not to fully train each model to convergence, but rather to simulate a realistic scenario in which we have access to a pool of pretrained algorithms — possibly trained on different data, for varying durations, and with varying levels of performance. The diversity in performance across these models is crucial for the algorithm selection task to be meaningful. If all models performed similarly (as tends to happen with longer training), the selector would have little to distinguish between them, reducing the practical utility of the selection mechanism. The main focus of our research is comparing algorithm selection performed on real data to algorithm selection performed on synthetic data. That is why in our research we created the conditions best suitable to reveal potential differences and make them more pronounced.
We do observe that additional training improves individual model accuracy, as shown in Figure 10a, 10b, but it also leads to performance homogenization across the pool. To preserve model diversity, we intentionally limited training to 5 epochs in this study. We agree that further analysis of how the amount of training affects both individual model performance and the effectiveness of the selector—both in SAS and RAS settings would be an interesting direction for future work.
{Comment} The algorithm selectors currently only consider traditional machine learning algorithms, I wonder if it can be expanded to include deep neural networks.
{Answer}
Thank you for your suggestion. We are planning to investigate the impact of using more complex machine learning models, including the DNN for algorithm selection in the future work.
{Comment} Table 2: How is the accuracy evaluated in synthetic data? The accuracy in this dataset seems much lower than that in training or test data. Could this imply the image generator was not well trained so that there was domain shift between real data and synthetic data?
{Answer}
The accuracy of individual algorithms is evaluated as shown in the eq. 1. In Figure 6 it can be seen that component algorithms are able to classify the synthetic images that were generated using the seed dataset $\mathbb{D}_E$ containing 900 images for training the GAN with a high accuracy. In the Table 2, we present the accuracy on synthetic data that was produced using the $\mathbb{D}_E$ containing 200 images for training the GAN. From the Figure 6 it can be seen that when we use more seed images for training the GAN, the component algorithms are able to classify the produced synthetic images better. While synthetic images produced by GAN with $\mathbb{D}_E$ of 200 images may differ from the real images significantly, our experiments showed that even with a $\mathbb{D}_E$ of small size, the GAN is able to produce synthetic images that capture the necessary information needed for training of the algorithm selectors.
{Comment} Also, in Table 2, it seems the image generator didn’t consider the labels in training data, I wonder if some methods such as conditional GAN can be used to incorporate such information, which may help improve the quality of synthetic data.
{Answer}
We trained GAN separately for each class (red lines 209-221). Thus, we assign the label of the class used as a seed dataset for GAN to the generated images. We did some experiments with conditional GAN that can be trained on both classes at once and then generate the images of needed class based on the parameters. In our experiments, the conditional GAN were significantly more prone to model collapse during the training. However, we refrain from mentioning this in the paper, as we did not perform a comprehensive research on this topic. We are planning to dedicate more time to trying different versions of GAN and Diffusion Models for data generation in the future work. In addition this is specifically addressed in lines 442 - 450
{Comment} Line 339: The correlation mentioned here is 0.7753, but it’s 0.7636 in the associated figure (Figure 4).
{Answer}
Thank you for pointing out our mistake. This was corrected in text at line 432.
{Comment} What features were used for algorithm in algorithm selector?
{Answer}
Thank you for pointing out this omission. This specific issue was addressed in lines 313 - 323 in red color.
{Comment} Line 474: the algorithm was either trained on real or synthetic images. I would suggest making a dataset containing both real and synthetic images, and adjust the proportion of synthetic images to see how the accuracy changes.
{Answer}
This paper basis is the scenario where there are not enough samples for training the algorithm selector (red lines 59-63). Therefore the concept of generating large number of synthetic images from smaller set of real images was introduced. While yes, mixing images is a valid approach when one wants to increase the training dataset we consider the idea that we do not have enough of such images. Therefore mixing real and synthetic images is not included in our scenario. We are planning to investigate how different ratios of real and synthetic data combinations can affect the algorithm selection in future work on this topic.
Reviewer 3 Report
Comments and Suggestions for Authors<Introduction>
1. "We consider a scenario, where additional data might not be available for training an algorithm selector and to implement a selection mechanism data must be generated."
→ This sentence is overly long and ambiguous. It should be split into two clear sentences, and the term "selection mechanism" needs to be explicitly defined.
<Conslusion>
2. "The generated synthetic images allow to improve the algorithm selection beyond the accuracy..."
→ The phrase "allow to improve" is grammatically incorrect. It should be revised to "allow us to improve" or "enable improvement in..." for proper and formal usage.
<Method>
3. Equations (1) and (2) are conceptually correct but should include more precise definitions for readers unfamiliar with the mapping. It is unclear how the labels L_S and L differ in role. Please elaborate on whether the algorithm selector is evaluated based on selecting the correct algorithm or the final classification result — this distinction is important.
The sentence: “The mapping S : D_S → L_S is purposefully obfuscated…” is both unclear and academically inappropriate. Replace “obfuscated” with “abstracted” or clarify that the labels L_
S are used implicitly in evaluation via the selected algorithm’s performance.
<Experimental>
4. The experimental design is methodologically sound. However, statistical significance testing (e.g., confidence intervals, p-values) is missing when comparing accuracies across conditions (especially in Figures 9 and 11). These would lend more weight to the observed differences between SAS and RAS models.
5. The entire study is centered on chest X-ray data. To support generalizability, consider including a second dataset from a different domain (even a small one) or discuss explicitly the limitations and potential domain-specific biases.
Author Response
Dear Reviewer, thank you very much for your comments and suggestions. Below please find point by point our answers with indications where in the text your comments have been answered
{Comment} Introduction:"We consider a scenario, where additional data might not be available for training an algorithm selector and to implement a selection mechanism data must be generated." $\rightarrow$
This sentence is overly long and ambiguous. It should be split into two clear sentences, and the term ""selection mechanism"" needs to be explicitly defined."
{Answer}
This was fixed in the lines 59-63. Selection mechanism was removed as it was only obfuscating the statement.
{Comment} Conclusion: "The generated synthetic images allow to improve the algorithm selection beyond the accuracy..." $\rightarrow$ The phrase ""allow to improve"" is grammatically incorrect. It should be revised to ""allow us to improve"" or ""enable improvement in..."" for proper and formal usage."
{Answer}
This was corrected according to reviewers suggestions
{Comment} Method: Equations (1) and (2) are conceptually correct but should include more precise definitions for readers unfamiliar with the mapping. It is unclear how the labels $\mathbb{L}_S$ and $\mathbb{L}$ differ in role. Please elaborate on whether the algorithm selector is evaluated based on selecting the correct algorithm or the final classification result — this distinction is important."
{Answer}
The lines 229 - 243 have been modified to specifically address this issue. Eq. 1 and Eq. 2 both evaluate the final classification accuracy. As such both individual algorithms and algorithm selection are always evaluated only on the final classification result.
{Comment} The sentence: “The mapping S : $\mathbb{D}_S \rightarrow\mathbb{L}_S$ is purposefully obfuscated…” is both unclear and academically inappropriate. Replace “obfuscated” with “abstracted” or clarify that the labels $\mathbb{L}_S$ are used implicitly in evaluation via the selected algorithm’s performance."
{Answer}
This was modified according to reviewers suggestion at line 242.
{Comment} Experimental: The experimental design is methodologically sound. However, statistical significance testing (e.g., confidence intervals, p-values) is missing when comparing accuracies across conditions (especially in Figures 9 and 11). These would lend more weight to the observed differences between SAS and RAS models."
{Answer}
Thank you for your comment. We added a section 5.6 and Table 3, in which we performed multiple runs with different data samples and random seeds. We show the results in Table 3 along with the result of paired t-test and Wilcox on signed-rank test.
{Comment} The entire study is centered on chest X-ray data. To support generalizability, consider including a second dataset from a different domain (even a small one) or discuss explicitly the limitations and potential domain-specific biases.
{Answer}
Thank you for your valuable comment. In this study, we focused on a single domain—chest X-ray classification—due to our emphasis on algorithm selection under limited data conditions. While this choice helps us control experimental variables and investigate the core challenges of data scarcity, we acknowledge that domain-specific characteristics may influence our findings. We plan to explore the generalizability of our approach across different data domains as part of future work.
Reviewer 4 Report
Comments and Suggestions for AuthorsDear authors,
Thank you for your interesting paper. I have several questions regarding the reliability of your results, which I hope will strengthen the manuscript:
How does the removal of specific component algorithms (e.g., SqueezeNet) quantitatively affect the complementarity of the algorithm selector and its overall accuracy? Could this reveal biases or dependencies in the selection process? Please elaborate on how the "conf," "rand," and "copy" label-assignment strategies for synthetic data influence the robustness of the algorithm selector when applied to out-of-distribution data. Do certain strategies introduce unintended biases or improve generalization? What was the effect of training component algorithms for fewer than 5 epochs on SAS accuracy? Does undertraining introduce beneficial noise for algorithm selection, or does it destabilize the selector’s performance?
How does varying the ratio of real vs. synthetic samples in the seed dataset (DE) affect the generalization of GAN-generated data? What is the minimum size of DE required for SAS to match or exceed RAS performance across multiple datasets? How does SAS perform when synthetic data is augmented with artificial noise (e.g., Gaussian blur, contrast shifts) compared to RAS? Does SAS maintain accuracy when synthetic data is intentionally imbalanced (e.g., 90% pneumonia samples)? How does this imbalance affect algorithm selection bias?
Are GAN-generated synthetic images vulnerable to adversarial attacks? If so, how does this impact the reliability of SAS? Does aligning real and synthetic feature distributions (via domain adaptation) improve SAS accuracy on real-world test data?
Please add tables demonstrating the statistical significance of SAS improvements over RAS and the best algorithm (e.g., via paired t-tests or ANOVA). Include confidence intervals for SAS accuracy across different GAN seeds and compare them to RAS. Do FID/IS scores correlate with SAS accuracy, or do these metrics fail to capture critical aspects of synthetic data utility for algorithm selection?
Thanks and looking forward for the revised version!
Author Response
Dear Reviewer, thank you very much for your comments and suggestions. Below please find point by point our answers with indications where in the text your comments have been answered
{Comment} How does the removal of specific component algorithms (e.g., SqueezeNet) quantitatively affect the complementarity of the algorithm selector and its overall accuracy? Could this reveal biases or dependencies in the selection process? Please elaborate on how the "conf," "rand," and "copy" label-assignment strategies for synthetic data influence the robustness of the algorithm selector when applied to out-of-distribution data. Do certain strategies introduce unintended biases or improve generalization? What was the effect of training component algorithms for fewer than 5 epochs on SAS accuracy? Does undertraining introduce beneficial noise for algorithm selection, or does it destabilize the selector’s performance?
{Answer}
Removal of specific classifier can be seen in Figure 12. In general removing a single algorithm will result in a lower accuracy because the complementarity cannot be exploited anymore as efficiently as when all algorithms are available. While the final accuracy of the classification depends on the component algorithms the accuracy of the selection depends on the data describing each algorithm. Removing certain algorithm implies that certain samples do not have the correct label or no label at all. So while it might be showing a difference of accuracy the bias will not be directly observed, but rather that complementarity of the algorithm performance can be observed.
The three approaches, copy, rand and conf show quite interesting behavior. The rand approach is the weakest as it provides the smallest amount of control on the data bias and thus the overall accuracy. The conf seems to work better but in general would require that the dataset must be rebalanced due to the fact that the algorithm with the highest confidence will be overrepresented in the training data. The result will be the fact that the algorithm with the highest confidence will be overused and the overall accuracy won't be the highest. The copy mechanism is the one that shares the weights in the decision process among multiple algorithms in the selection process. As such it has the highest accuracy of classification of the images because the algorithm bias is most influenced by the databias. Figure 10b shows the accuracy changes of both RAS and SAS as a function of the number of epochs. As can be seen, the biggest difference between few training and more training epochs is the relative stability of the image classification accuracy when using algorithm selection. This would mean that even using algorithms trained only on few epochs, the algorithm selection can benefit from their complementarity but is hindered from the fact that algorithms will behave differently from training at different epochs.
{Comment} How does varying the ratio of real vs. synthetic samples in the seed dataset (DE) affect the generalization of GAN-generated data? What is the minimum size of DE required for SAS to match or exceed RAS performance across multiple datasets? How does SAS perform when synthetic data is augmented with artificial noise (e.g., Gaussian blur, contrast shifts) compared to RAS? Does SAS maintain accuracy when synthetic data is intentionally imbalanced (e.g., 90\% pneumonia samples)? How does this imbalance affect algorithm selection bias?
{Answer}
The $\mathbb{D}_E$ dataset is only constructed using real images because it is used to train the GAN. (Figure 1 has been reworked in order to show explicitly which dataset is used for which part of the system). Therefore we do not consider this specific mixture for the training of the GAN Network. In our experiments the minimum size of DE was found out to be 50 (Figure 11 using the copy SAS approach). The augmentation of the synthetic data with specific noise is an interesting idea. However the generated images are already noisy by the fact of the training amount and this was one of the scopes of the paper.
{Comment} Are GAN-generated synthetic images vulnerable to adversarial attacks? If so, how does this impact the reliability of SAS? Does aligning real and synthetic feature distributions (via domain adaptation) improve SAS accuracy on real-world test data?
{Answer}
Thank you for your insightful questions.
Regarding adversarial attacks on GAN-generated images: while it is theoretically possible to craft adversarial perturbations that affect synthetic images, it is important to note that in our setup, these images are used solely to train the algorithm selector—a classifier that predicts the most suitable CNN model for a given input. The actual image classification is performed by downstream CNNs, which are trained exclusively on real data. Therefore, even if an adversarial attack were to fool the algorithm selector, it would not necessarily compromise the final prediction unless the same adversary also succeeded in attacking the chosen CNN model. This two-step structure adds a level of robustness, as an attacker would need to fool both components in a coordinated manner.
As for domain adaptation, aligning the feature distributions of real and synthetic images could in principle improve the performance of the algorithm selector, especially in cases where the synthetic data does not adequately reflect the distributional properties of real inputs. While we did not incorporate explicit domain adaptation in this work, we agree that such techniques could further enhance the reliability of the selector and represent a promising direction for future research.
{Comment} Please add tables demonstrating the statistical significance of SAS improvements over RAS and the best algorithm (e.g., via paired t-tests or ANOVA). Include confidence intervals for SAS accuracy across different GAN seeds and compare them to RAS. Do FID/IS scores correlate with SAS accuracy, or do these metrics fail to capture critical aspects of synthetic data utility for algorithm selection?
{Answer} We have added a Table 3 showing the results of RAS and SAS averaged across 10 different seeds. We performed the paired t-Test and Wilcoxon test to determine if the difference between RAS and SAS approach is statistically significant for the experiment. We show that for some algorithm selectors the SAS approach show statistically significant improvement, while for others the RAS approach is better.
Based on our experiments we can conclude that FID and IS metrics can serve a good indicators of the quality of generated data. We did not find a statistically significant difference for using one metric over the other. We are planning to dedicate a future work towards investigating the optimal strategies for generating synthetic data for the purposes of algorithm selection. We plan to experiment with different metrics, sizes of the seed dataset, generation heuristics, and data augmentations for GAN.
Reviewer 5 Report
Comments and Suggestions for AuthorsThe authors work is about data generation using generative models. They also study the behavior of algorithms on obtaining the same accuracy on generated models (GAN) as if trained on real-world natural data. Tests were made to choose the right algorithm.
The paper is well structured.
Methodology is detailed. Data Preparation and Construction includes also GAN representation details; Inception Score (IS) and the Fréchet inception distance (FID) are used to check the quality and variety of generated images.
Experimental Settings: are presented the tools used: CNN based image classification NN: PyTorch "Torchvision" and "Pretrained models for Pytorch" libraries: DPN, ResNet, DenseNet, VGG and SqueezeNet; Algorithm selectors, classifiers: ExtraTrees, SVM, DecisionTree, KNeighbors, GradientBoosting, SGD, LogisticRegression, RF, and MLP
Dataset: X-RAY dataset;
- synthetic data was made: by training a GAN separately for each dataset label: Healthy/Pneumonia
Figure 3: represented data preparation steps
Experiments
- Component Algorithms Accuracy: detailed procedures; including Table: Accuracy of individual algorithms
- Figure: Correlation between the individual algorithms;
- Figure: Graphs of FID and IS metrics for GAN generated synthetic images
-Synthetic Data Generators Assessment: Figure: Ex. images: both real images & synthetic images healthy & with pneumonia
- accuracy: Figure
- Algorithm Selection as a Function of Algorithm Selectors: Figure 9;
Conclusions-"algorithm selection trained on synthetic images outperforms or equals algorithm selection trained on real images on five out of six data points "; "algorithm selection provides the highest accuracy when the algorithm selection provides the highest degree of noise"
- Algorithm Selection as a function of Component Algorithms Training epochs: Figure 10.
Conclusion: "while algorithms are not well converged, predicting their classification is difficulty due to largely randomized nature"
- Algorithm Selection as a Function of Number of Real Images Used to Train the Generator: Figure 11
Conclusions:
- real data for training, all three algorithm selection:constant accuracy
- evolution of alg. selector accuracy in the classification task.
- copy alg. obtain the highest accuracy
- Algorithm Selection as a Function of Component Algorithms: Figure 12
Conclusions: "when only when SqueezeNet is removed alg. selection strongly improves the algorithm selection accuracy"; the synthetic alg. selection almost always outperforms the real alg. selection
The paper has a detailed part of "Results Discussion" and Conclusion
-a small dataset, e.g. 200-300 images, can be efficiently used as a seed to GAN.
Author Response
Dear Reviewer, thank you very much for your comments and suggestions. We considered your comments and thank you for encouraging comments.
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsThe author responded appropriately to the reviewer's questions, and there were also improvements to the paper.
Reviewer 4 Report
Comments and Suggestions for AuthorsDear authors, thank you for clarifying my concerns. I recommend accepting it for publication.