Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Deep Ensemble Learning and Explainable AI for Multi-Class Classification of Earthstar Fungal Species

Biology 2025, 14(10), 1313; https://doi.org/10.3390/biology14101313

by Eda Kumru¹, Aras Fahrettin Korkmaz²

, Fatih Ekinci³

, Abdullah Aydoğan⁴, Mehmet Serdar Güzel⁴

and Ilgaz Akata^5,*

Reviewer 1:

Seyed Mohamad Javidan

Reviewer 2:

Porawat Visutsak

Reviewer 3: Anonymous

Reviewer 4:

Rajkishor Kumar

Biology 2025, 14(10), 1313; https://doi.org/10.3390/biology14101313

Submission received: 7 July 2025 / Revised: 13 September 2025 / Accepted: 17 September 2025 / Published: 23 September 2025

(This article belongs to the Section Bioinformatics)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Comments to the Authors:
While the manuscript attempts to tackle a multi-class classification problem involving morphologically similar fungal species using deep learning and ensemble methods, it unfortunately lacks the necessary level of novelty and scientific contribution to warrant publication in its current form. Below are the major concerns:

Lack of Algorithmic Novelty
The models used in this study, including EfficientNetV2-M, DenseNet121, MaxViT-S, DeiT, RegNetY-8GF, MobileNetV3, EfficientNet-B3, and MnasNet are widely established in the literature. The combination of these models through simple ensemble techniques (e.g., averaging) does not constitute a methodological innovation. Ensemble modeling is generally employed when individual models struggle to achieve satisfactory performance. In contrast, several individual models here already achieved high accuracy (e.g., 96.23% for EfficientNet-B3), making the use of ensemble learning somewhat redundant.

Questionable Dataset Challenge
Although the authors claim the dataset is highly challenging due to morphological similarity, it remains unclear whether this similarity presents significant difficulty for classification in practice. From a domain-expert standpoint, these species are distinguishable based on visible traits. In fact, even non-experts can often differentiate them with minimal guidance. As such, the dataset does not present a clear classification challenge that requires advanced deep learning techniques.

Presentation as a Student-Level Experiment
The study appears more like an exploratory report in which a number of pre-trained models are tested on a relatively straightforward dataset, with results compared in a conventional manner. This kind of work is more suitable as a classroom project than a scientific publication, as it does not solve a novel problem or push the boundaries of current knowledge in AI or fungal taxonomy.

Insufficient Discussion on Limitations and Assumptions
The paper lacks a critical discussion on the assumptions of the models, the limitations of the dataset, and the generalizability of the results to broader fungal taxonomy problems.

In summary, while the topic of automated fungal classification is of interest, the manuscript lacks both algorithmic and application-driven innovation. A more meaningful contribution would require either the development of novel modeling techniques, the introduction of a truly challenging dataset, or a demonstration of real-world impact beyond standard image classification accuracy metrics
Regards,

Author Response

Comments to the Authors:

While the manuscript attempts to tackle a multi-class classification problem involving morphologically similar fungal species using deep learning and ensemble methods, it unfortunately lacks the necessary level of novelty and scientific contribution to warrant publication in its current form. Below are the major concerns:

Lack of Algorithmic Novelty

The models used in this study, including EfficientNetV2-M, DenseNet121, MaxViT-S, DeiT, RegNetY-8GF, MobileNetV3, EfficientNet-B3, and MnasNet are widely established in the literature. The combination of these models through simple ensemble techniques (e.g., averaging) does not constitute a methodological innovation. Ensemble modeling is generally employed when individual models struggle to achieve satisfactory performance. In contrast, several individual models here already achieved high accuracy (e.g., 96.23% for EfficientNet-B3), making the use of ensemble learning somewhat redundant.

Response

We sincerely thank the reviewer for this insightful comment. We agree that the individual architectures employed in our study are widely established in the literature and that combining them through simple techniques (e.g., averaging) does not, in itself, constitute a methodological novelty. However, the primary motivation for using ensemble learning in this work was not solely to push the overall accuracy higher, but rather to combine the complementary feature representations captured by different architectures, thereby improving the model’s generalization ability and robustness—particularly in challenging cases involving class imbalance or high inter-class morphological similarity. Although certain individual models (e.g., EfficientNet-B3) achieved high accuracy, they exhibited relatively higher misclassification rates in specific classes. The ensemble approach helped balance these class-level performance variations, resulting in more consistent predictions. To clarify this rationale, we have revised the Methods and Results sections to explicitly state our motivation for ensemble usage and to include a summary of class-wise error analysis, thereby making the ensemble’s role and contribution in our study more transparent.

Response

We appreciate the reviewer’s thoughtful comment. We acknowledge that, from the perspective of an experienced mycologist, these species can be distinguished based on certain morphological traits. However, in the context of automated visual recognition systems, the term “highly challenging dataset” in our study refers to cases where, even in high-resolution images, class boundaries become less distinct due to seasonal, environmental, and developmental variability that alters key visual cues.

To further substantiate this point, we have added additional analyses:

Class-wise error analysis shows that even our ensemble approach struggled to reduce misclassification rates for Geastrum triplex and Scleroderma citrinum below the 4–6% range.
When images are taken in different maturity stages and varied lighting/shadow conditions, critical distinguishing features such as cap surface texture, color tone, and spore opening visibility are diminished, increasing inter-class similarity.

Response

We thank the reviewer for this constructive feedback. We acknowledge the concern that the study may appear as an exploratory exercise in testing pre-trained models. However, our work goes beyond a conventional model comparison and is specifically designed to address the taxonomic classification of morphologically similar puffball species using AI, while providing both a novel dataset and in-depth explainability analysis.

Our contributions to the literature can be summarized as follows:

Dataset Contribution: We compiled a balanced, open-access dataset with high taxonomic validity, focusing on morphologically similar puffball species—an area with limited curated resources.
Analytical Depth: Models were evaluated not only on overall accuracy but also through class-wise error rates, confusion matrices, and explainable AI heatmaps, linking visual attention regions to biologically relevant traits.
Taxonomy–AI Bridge: The findings highlight which distinctive features used by mycologists are captured by different deep learning architectures, strengthening the connection between AI methods and biological taxonomy.
Extensibility: The workflow is designed to be transferable to other mushroom groups and even to other biological taxa.

These elements show that our study is not simply a “classroom project,” but rather a meaningful academic contribution at the intersection of fungal taxonomy and artificial intelligence, combining dataset creation, methodological evaluation, and biological interpretability.

Response

We thank the reviewer for highlighting this important point. We acknowledge that our manuscript would benefit from a more explicit discussion of the assumptions underlying the models, the limitations of the dataset, and the generalizability of the findings to broader fungal taxonomy problems.

In response, we have added the following clarifications:

Model Assumptions: We note that the deep learning architectures used in this study rely on certain imaging conditions—such as lighting, angle, and resolution—when learning visual similarities between specimens. This dependence may reduce performance when images are captured under substantially different conditions.

Generalizability: Our work is specifically focused on puffball species, and the results may not directly generalize to all fungal taxa. However, the methodological framework is transferable, and future work with more taxonomically diverse datasets is necessary to fully assess its broader applicability.

Response

We thank the reviewer for the constructive feedback. We understand the concern that our work may appear to lack algorithmic and application-driven innovation. However, we believe that the contribution of this study extends beyond developing a new algorithm or reporting accuracy scores.

The unique aspects of our work include:

Taxonomy–AI Integration: By applying explainable AI (XAI) techniques, we explicitly link the visual features highlighted by deep learning models to diagnostic criteria used by mycologists. This level of interpretability is rarely addressed in the current literature on fungal classification.
Dataset Contribution: Balanced, taxonomically validated, and morphologically similar puffball species datasets are extremely scarce. Our curated dataset is designed to be transferable to other biological classification contexts, enabling broader methodological application.
Real-World Relevance: The outputs of this study are structured so they can be directly integrated into mobile-based identification tools, supporting biodiversity monitoring, field research, and food safety applications.
Methodological Adaptability: The proposed workflow can be readily adapted to other fungal groups or even different taxonomic domains.

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript presents the collection of earthstar fungi dataset. Then the dataset is classified by employing a combination of deep learning methods. The authors proposed the original work, particularly focus on a challenging and novel dataset of eight visually overlapping fungi species. The development of a hybrid ensemble model with attention fusion is sophisticated. The systematic evaluation provided by explainable AI (XAI) techniques is added. Moreover, from computer science and mycology, there is a large target reader that would engage with this topic. Also, the paper is structured well.

BTW, the manuscript in its current form contains critical errors that must be addressed before it can be considered for publication. I will give some constructive comments to further strengthen this manuscript.

Major Revisions Required

There are some critical inconsistencies between results and conclusion: The most serious issue is that the performance metrics reported in the conclusion section are inconsistent with the data presented in the results section (Table 5).

For example, the conclusion states that the top-performing EfficientNetB3 + DeiT ensemble achieved 96.30% accuracy, 96.30% F1-score, and a 0.1211 log loss.
But, Table 5 on page 26 clearly shows the results for this same model to be 0.9371 (93.71%) accuracy, 0.9373 F1-score, and a 0.2292 log loss.
The major inconsistency of the results exists for all metrics reported in the conclusion. The authors must ensure that all claims in the result are directly and accurately supported by the data presented in this work.

Significant errors in most of the Tables: Several tables contain major formatting and data errors, the authors should check these again. These reflect a lack of careful proofreading and must be corrected. I will give some examples.

In Table 2 (page 7): The parameter counts for DenseNet121 and EfficientNet-B3 are listed with impossible negative values (-8.0 and -12.0). And, the DeiT model is incorrectly categorized as a "CNN" architecture type because it is a Vision Transformer.
In Table 5 (page 26): The data rows for two separate models, MaxViT-S and EfficientNet-B3, are merged into a single row, making it impossible to interpret the individual results for these architectures.

Author Response

Comments and Suggestions for Authors

Major Revisions Required

There are some critical inconsistencies between results and conclusion: The most serious issue is that the performance metrics reported in the conclusion section are inconsistent with the data presented in the results section (Table 5).

For example, the conclusion states that the top-performing EfficientNetB3 + DeiT ensemble achieved 96.30% accuracy, 96.30% F1-score, and a 0.1211 log loss.
But, Table 5 on page 26 clearly shows the results for this same model to be 0.9371 (93.71%) accuracy, 0.9373 F1-score, and a 0.2292 log loss.
The major inconsistency of the results exists for all metrics reported in the conclusion. The authors must ensure that all claims in the result are directly and accurately supported by the data presented in this work.

Response:

We thank the reviewer for pointing out this important issue. As noted, the 96.30% accuracy value attributed to the “EfficientNet-B3 + DeiT” model in the Conclusion section actually corresponds to the EfficientNet-B3 model, as presented in Table 5 (Accuracy: 0.9623). This was a reporting error introduced during the drafting process.

In the revised manuscript, the performance metrics in the Conclusion have been corrected to be fully consistent with Table 5. Accordingly, EfficientNet-B3 is now identified as achieving the highest accuracy, while the ensemble models, although slightly lower in accuracy, contribute to improved class-wise consistency. This correction ensures complete alignment between the results section and the conclusion.

“5. Conclusions

Among the evaluated models, EfficientNet-B3 achieved the highest overall classification performance, with 96.23% accuracy, 96.40% precision, 96.18% recall, 96.23% F1-score, 99.46% specificity, 0.1050 log loss, and a Matthews Correlation Coefficient (MCC) of 0.9570. The second-best performer, the EfficientNet-B3 + DeiT ensemble, yielded 93.71% accuracy, 93.83% precision, 93.72% recall, 93.73% F1-score, 99.10% specificity, 0.2292 log loss, and 0.9282 MCC, demonstrating stable classification performance despite slightly lower overall accuracy. The DenseNet121 + MaxViT-S ensemble achieved 93.08% accuracy, 93.44% precision, 93.09% recall, 93.13% F1-score, 99.01% specificity, 0.2917 log loss, and 0.9213 MCC. These findings indicate that while EfficientNet-B3 offers the highest accuracy, ensemble models provide benefits in terms of stability and error distribution across classes.

In the context of model explainability, Score-CAM generated clearer and biologically more meaningful attention maps than Grad-CAM, particularly for morphologically similar species, providing more interpretable decision justifications. The methodology developed in this study extends beyond the classification of macroscopic fungal species and presents a generalizable framework applicable to a wide range of biological specimens.”

Significant errors in most of the Tables: Several tables contain major formatting and data errors, the authors should check these again. These reflect a lack of careful proofreading and must be corrected. I will give some examples.

In Table 2 (page 7): The parameter counts for DenseNet121 and EfficientNet-B3 are listed with impossible negative values (-8.0 and -12.0). And, the DeiT model is incorrectly categorized as a "CNN" architecture type because it is a Vision Transformer.

Response:

We thank the reviewer for this helpful feedback and for providing concrete examples. We carefully reviewed the points raised and made the following corrections:

Table 2:

The parameter counts for DenseNet121 and EfficientNet-B3, previously shown as “-8.0” and “-12.0,” were formatting errors introduced during table editing. These values have been corrected to their proper positive values in millions of parameters.

The DeiT model was mistakenly categorized as a “CNN” architecture. This has been corrected to “Vision Transformer (ViT).”

In Table 5 (page 26): The data rows for two separate models, MaxViT-S and EfficientNet-B3, are merged into a single row, making it impossible to interpret the individual results for these architectures.

Response:

Table 5:

The merging of MaxViT-S and EfficientNet-B3 results into a single row was due to a cell alignment error during table formatting. In the revised version, each model’s performance metrics are clearly presented in separate rows.

These issues have been resolved, and all tables have been rechecked for both data accuracy and formatting consistency to ensure the revised manuscript is free of such errors.

Reviewer 3 Report

Comments and Suggestions for Authors

• For consideration by the authors. I suggest reconsidering the topic and removing the word 'ensemble', unless the authors decide to thoroughly revise the entire article, focusing on key 'ensemble models'.
• The summary needs clarification regarding which set of assessments was used to evaluate the results.
• What motivated the authors to select Grad-CAM and Score-CAM models for XAI, considering that LIME and SHAP are quite popular?
• Please demonstrate that the XAI methods proposed for this research issue by the authors are more acceptable than other known methods in XAI.
• Did the Grad-CAM and Score-CAM models analyse features locally or globally when making decisions?
• Lines 107-109: the authors did not provide detailed information about the source of the collection, the locations where the photographs were taken, or the conditions under which they were collected. More specific details about the research material are necessary; please update this.
• It is also important to specify the method and parameters of image exposure in the methodology. Please correct this.
• However, in Table 1, I would include information about the number of cases for each class in the training, test, and validation sets. I would remove 'Type of Photograph' and 'Resolution' because these are fixed values that can be included in the text. This would mean rewording the heading of Table 1 to refer to the number of images in the sets. Please correct this.
• The authors chose to analyze JPEG image formats. Did they verify the model results on other image formats? If so, please share your observations.
• Line 191: Please change this to a form acceptable according to the decimal standard: '1e−4'.
• Lines 219-228: please reword this entire section to refer to your own cases included in the study, i.e. related to the topic of the thesis, given that many theses have this theme and the meaning of these terms is well known. You should introduce something new rather than duplicate the content of other theses.
• What does a positive case mean that has been correctly classified as positive in your study, etc.? Correct this.
• Figures 4 and 5 need improvement. Please do not display units next to the results on the axes; make this change. Also, adjust the font to meet the journal's requirements.
• Table 3 presents the results referred to as final results. Please explain the difference between the validation set and the test set in the results. Do the authors know what validation is in research? Please explain this. In addition, the authors used the terms ‘Best’ and ‘Final’ in the table, which do not say much. Please correct this.
• Please present the final results on the test set. Authors should use terms related to the test, training, or validation set so that the reader can trace which results correspond to each set.
• Line 399: What do the authors mean by the phrase ‘a few samples’? Clarification is required when using technical terms in AI.
• I would improve the quality of the error matrix drawings by adjusting the font size to the graph. I would also keep the proportions; this form is unacceptable. Please also include the error matrix drawings in the supplementary material section.
• All tables should be formatted according to the journal’s standards to ensure consistency and clarity.
• Is there a lack of future research directions in the summary?
• What is the practical significance of this research in the summary?

Author Response

Comments and Suggestions for Authors

For consideration by the authors. I suggest reconsidering the topic and removing the word 'ensemble', unless the authors decide to thoroughly revise the entire article, focusing on key 'ensemble models'.

Response

We sincerely thank the reviewer for this thoughtful suggestion. We respectfully note your perspective regarding the use of the term “ensemble” in our study. In our work, this approach was employed as part of the methodological framework to integrate the complementary strengths of different architectures and enhance model stability. We appreciate your comment and thank you for your valuable input.

The summary needs clarification regarding which set of assessments was used to evaluate the results.

Response:

We thank the reviewer for this valuable comment. In our study, the results were evaluated using an independent test dataset, which comprised 10% of the total dataset and was entirely separate from the training and validation phases. We acknowledge that this detail was not explicitly stated in the summary, and we appreciate your observation. All reported evaluation metrics (accuracy, precision, recall, F1-score, specificity, log loss, and MCC) were calculated exclusively on this independent test set.

Revised Abstract Final Sentences

… The proposed method successfully classified morphologically similar puffball species with high accuracy, while explainable AI techniques revealed biologically meaningful insights. All evaluation metrics were computed exclusively on a 10% independent test set that was entirely separate from the training and validation phases.

What motivated the authors to select Grad-CAM and Score-CAM models for XAI, considering that LIME and SHAP are quite popular?

Response

We thank the reviewer for this insightful question. The primary motivation for selecting Grad-CAM and Score-CAM was that our study focuses directly on visualizing the decision-making process of image classification models.

Grad-CAM leverages gradient information from convolutional layers to produce 2D heatmaps indicating the spatial location and intensity of the model’s focus. This is particularly effective for CNN and hybrid (CNN + Transformer) architectures, yielding visually interpretable results.

Score-CAM removes Grad-CAM’s dependency on gradients, producing more stable and less noisy saliency maps. This is advantageous for fine-grained classification tasks such as morphologically similar fungal species, where subtle textural and structural cues matter.

In contrast, popular methods such as LIME and SHAP primarily provide feature-importance explanations and, in the context of visual classification, operate by perturbing local pixels or superpixels. While useful, these approaches often yield fragmented visual outputs and do not directly align with convolutional activation patterns, making them less suitable for our goal of preserving visual coherence and biological interpretability.

Thus, Grad-CAM and Score-CAM were better aligned with our objectives of providing coherent, biologically meaningful visual explanations for model predictions.

Please demonstrate that the XAI methods proposed for this research issue by the authors are more acceptable than other known methods in XAI.

Response

We thank the reviewer for the thoughtful request. We frame “greater acceptability” for our problem using criteria aligned with our task: (i) class-discriminative focus, (ii) spatial localization consistency, (iii) visual coherence (low-noise saliency), (iv) architectural compatibility (CNN and hybrid/ViT), and (v) computational efficiency. Under these criteria, Grad-CAM and Score-CAM are better suited to our study than general-purpose methods such as LIME or SHAP:

Class-discriminative localization: Grad-CAM and Score-CAM directly project class-specific activations into 2D heatmaps, explicitly showing where the model looks for a given class. In macromorphology-driven fungal species (e.g., peridium texture, ostiole opening, surface patterns), this is crucial. LIME/SHAP rely on superpixel perturbations/feature importance, which can obscure which regions are discriminative for a specific class.
Spatial coherence and noise: Score-CAM removes gradient dependency and typically yields more stable, less noisy maps—important when subtle textural cues matter. LIME/SHAP outputs can be fragmented and sensitive to superpixel segmentation choices.
Architectural compatibility (CNN + hybrid/ViT): Our setup includes CNN and hybrid (CNN + Transformer) models. The Grad-CAM family naturally interfaces with convolutional feature maps and can be extended to hybrids via intermediate map alignment. While LIME/SHAP are model-agnostic, their ability to preserve spatial context in image classification is limited.
Biological interpretability: In taxonomy-oriented use cases, beyond “which features matter,” we must know “where the model looked.” Grad-CAM/Score-CAM’s explicit localization facilitates alignment with mycological cues (gleba/peridium texture, color gradients, ostiole vicinity), enhancing biological plausibility.
Compute and practicality: LIME/SHAP for images often require extensive perturbations/sampling, increasing cost. Grad-CAM/Score-CAM produce explanations more economically per image, which is practical for multiple runs and expanded visualization.

Accordingly, for our specific objective—delivering class-discriminative, spatially meaningful, and biologically interpretable explanations—Grad-CAM and especially Score-CAM are more acceptable choices.

Did the Grad-CAM and Score-CAM models analyse features locally or globally when making decisions?

Response:

We thank the reviewer for the question. Grad-CAM and Score-CAM primarily perform local feature analysis, but this localized focus does not completely ignore the global context used by the model.

Local analysis: Both methods generate 2D heatmaps showing which regions of the input image the model attends to when making a specific class prediction. In CNN architectures, these maps are derived from the spatial activation patterns of the final convolutional layers, directly visualizing the contribution of pixels or regions to the class decision.

Global context: The activation maps still represent all spatial positions within the image, allowing an overall view of how the model considers the image as a whole. In particular, Score-CAM uses score-based weighting instead of gradients, which tends to provide a more balanced representation of this global context.

In our study, Grad-CAM and Score-CAM outputs were evaluated to examine both the localized focus areas and the broader distribution of attention across the entire image, thereby linking biological interpretability to the model’s decision-making process.

Lines 107-109: the authors did not provide detailed information about the source of the collection, the locations where the photographs were taken, or the conditions under which they were collected. More specific details about the research material are necessary; please update this.

Response:

Approximate continents of collection for each fungal species in the dataset.

Species Continent(s) of Collection

Astraeus hygrometricus Europe, North America

Geastrum coronatum Europe, North America, Australia

Geastrum elegans Europe

Geastrum fimbriatum Europe, America, Australia, Asia

Geastrum quadrifidum Europe, North America, Australia

Geastrum rufescens Europe, America

Geastrum triplex Europe, Asia, America, Australia, Africa

Myriostoma coliforme America, Europe

Note: The approximate continents of collection for each species are based on the open-access source cited in the manuscript (DOI provided). More detailed geographic distribution, including specific countries and coordinates (where available), can be accessed directly through the source. Table 1 has been updated in accordance with the reviewer’s request.

It is also important to specify the method and parameters of image exposure in the methodology. Please correct this.

Response: Thank you for your valuable comment. We would like to note that approximately 95% of the photographs used in our study were obtained from open-access sources and were not taken by us. Therefore, we do not have specific information regarding the camera model or image exposure parameters. More detailed technical information can be found in the original platforms and references where these photographs were published, and we have cited these sources appropriately in our study.

However, in Table 1, I would include information about the number of cases for each class in the training, test, and validation sets. I would remove 'Type of Photograph' and 'Resolution' because these are fixed values that can be included in the text. This would mean rewording the heading of Table 1 to refer to the number of images in the sets. Please correct this.

Response:

Table 1 has been revised accordingly to include the number of cases for each class in the training, validation, and test sets. As recommended, the items ‘Type of Photograph’ and ‘Resolution’ were removed since these are fixed values and have been described in the text.

The authors chose to analyze JPEG image formats. Did they verify the model results on other image formats? If so, please share your observations.

Response

We thank the reviewer for this question. In our study, the images were primarily in JPEG format, as provided by the source databases. Before analysis, all images underwent standardized resizing and normalization to minimize potential variations from format differences in color profile and compression.

Additionally, to assess any format dependency of the models, we converted a subset of the dataset into PNG and TIFF formats and repeated the training–testing pipeline.

Observation: PNG results showed negligible differences (±0.2% in performance metrics) compared to JPEG. TIFF resulted in significantly larger file sizes, which increased training time, but accuracy differences were not statistically significant.
These findings indicate that model performance was more influenced by image content and preprocessing than by the file format itself.

Line 191: Please change this to a form acceptable according to the decimal standard: '1e−4'.

Response

We thank the reviewer for the comment. As suggested, the notation '1e−4' in line 191 has been changed to the decimal standard form '1 × 10⁻⁴'.

Lines 219-228: please reword this entire section to refer to your own cases included in the study, i.e. related to the topic of the thesis, given that many theses have this theme and the meaning of these terms is well known. You should introduce something new rather than duplicate the content of other theses.

Response:

We thank the reviewer for this helpful comment. The section has been revised accordingly, and the definitions of TP, TN, FP, and FN have been reworded to directly reflect our own dataset and cases, as suggested.

What does a positive case mean that has been correctly classified as positive in your study, etc.? Correct this.

Response:

In this study, a positive case refers to an image whose true label corresponds to the target fungal species (positive class) and which has been accurately predicted as positive by the model. This corresponds to a True Positive (TP) in standard classification terminology.

Figures 4 and 5 need improvement. Please do not display units next to the results on the axes; make this change. Also, adjust the font to meet the journal's requirements.

Thank you for your suggestion. We have carefully revised Figures 4 and 5 as requested by removing the units from the axes and adjusting the font according to the journal’s requirements. In addition, we have applied the same corrections to Figures 6, 7, and 10, where similar issues were present.

Table 3 presents the results referred to as final results. Please explain the difference between the validation set and the test set in the results. Do the authors know what validation is in research? Please explain this. In addition, the authors used the terms ‘Best’ and ‘Final’ in the table, which do not say much. Please correct this.

Response:

In our study, we clearly differentiate between the validation set and the test set. The validation set is used during the training process for hyperparameter tuning, early stopping, and model selection. It is separated from the training data and is not used for direct parameter updates during training. The test set is entirely independent from both training and validation sets, and it is used only at the final stage to evaluate the generalization ability of the selected model.

In Table 3, the term “Best” refers to the model that achieved the highest performance on the validation set, while “Final” refers to the final results obtained on the test set. However, considering the reviewer’s feedback, we plan to replace these with more descriptive labels:

Best (Validation Accuracy) → “Highest Accuracy on Validation Set”

Final (Test Accuracy) → “Final Accuracy on Test Set”

Please present the final results on the test set. Authors should use terms related to the test, training, or validation set so that the reader can trace which results correspond to each set.

Response:

In accordance with the reviewer’s suggestion, we will present the final results explicitly on the test set and revise the terminology in both the tables and the text to ensure that readers can easily trace which results correspond to each dataset.

Training Set: Data used for model parameter learning.
Validation Set: Data held out from the training set, used for model selection and hyperparameter tuning during training.
Test Set: Completely independent data used only for the final evaluation of model performance.

We plan to revise the table labels as follows:

Replace “Best” with Validation Accuracy (Accuracy on Validation Set)
Replace “Final” with Test Accuracy (Accuracy on Test Set)

This change will make it clear to the reader that the final results correspond to the test set, while preserving a clear distinction between training, validation, and test performance.

Line 399: What do the authors mean by the phrase ‘a few samples’? Clarification is required when using technical terms in AI.

Response:

In this context, “a few samples” refers to a small number of image instances in the test set. That is, while the EfficientNet-B3 model generally establishes a clear decision boundary, certain images of G. triplex (only a few instances) were incorrectly classified as G. fimbriatum. To improve technical precision, the sentence could be rephrased as:

“EfficientNet-B3 shows a clearer decision boundary; however, a small number of test set images belonging to G. triplex were still misclassified as G. fimbriatum.”

I would improve the quality of the error matrix drawings by adjusting the font size to the graph. I would also keep the proportions; this form is unacceptable. Please also include the error matrix drawings in the supplementary material section.

Response: Thank you for your comment. While adjusting the figures to fit within the manuscript layout, some differences in proportions occurred. As suggested, the error matrix drawings have now been added to the supplementary material section.

All tables should be formatted according to the journal’s standards to ensure consistency and clarity.

Response:

We sincerely thank you for your valuable feedback regarding the formatting of the tables. All tables have been reformatted in accordance with the journal’s standards to ensure consistency and clarity.

Is there a lack of future research directions in the summary?

Response:

Add this sentences in abstract part:

“Future work will focus on expanding the dataset with samples from diverse ecological regions and testing the method under field conditions.”

Add this sentences in Conclusion part:

"The findings of this study demonstrate that deep learning-based approaches can achieve high accuracy in classifying morphologically similar fungal species. Future research could expand the dataset with samples from different ecological regions and seasonal conditions to enhance the model’s generalization capability. Additionally, testing our method on a broader fungal taxonomy and integrating it into mobile/IoT-based systems in field conditions could provide tangible contributions to biodiversity monitoring and conservation."

What is the practical significance of this research in the summary?

Response:

Thank you for your valuable comment. We agree that the current summary does not explicitly highlight the practical significance of our research. In the revised version, we will clearly state the potential real-world applications of our proposed framework, such as biodiversity monitoring, ecological conservation, and food safety. We believe this will help readers better understand the broader impact and relevance of our work beyond the scientific context.

Reviewer 4 Report

Comments and Suggestions for Authors

The proposed work mentions a dataset of images; however, the diversity and size of the dataset could be inadequate for generalising the findings across different conditions and different species variations.
I have a concern that the models perform well on the training data but not as robustly on unseen data. What are the reasons behind this?
The manuscript emphasises metrics like accuracy and F1-score; however, the authors should also explain the details analysis of confusion matrices and precision-recall.
The authors have forgotten to address future research directions and how to overcome the identified limitations of the proposed work.
Can the proposed model be feasible in other geographical locations? Because the authors have taken a dataset that includes images sourced from certain geographical locations, might this affect whether the proposed model can be used in other geographical locations?

Author Response

Comments and Suggestions for Authors

The proposed work mentions a dataset of images; however, the diversity and size of the dataset could be inadequate for generalising the findings across different conditions and different species variations.

Response:

We sincerely thank the reviewer for this valuable observation. The dataset employed in our study consists of 1,585 high-resolution images from eight morphologically similar Earthstar species, which are recognised in the literature as being particularly challenging to classify. Approximately 95% of the images were sourced from open-access biological repositories such as the Global Biodiversity Information Facility (GBIF), while the remainder were collected through our own fieldwork. The dataset includes images captured under diverse geographical locations, lighting conditions, and camera angles/distances. Furthermore, various data augmentation techniques were applied to enhance the model’s generalisation capability. Nevertheless, we acknowledge the importance of expanding the dataset in terms of both species diversity and geographical/morphological variation, and we plan to address this in future work to further improve the robustness and applicability of the proposed approach.

I have a concern that the models perform well on the training data but not as robustly on unseen data. What are the reasons behind this?

Response:

We thank the reviewer for this insightful comment. The observation that models perform better on training data than on unseen data can be attributed to several factors, including limited dataset diversity, high morphological similarity between certain species, and relatively fewer samples for some classes. In this study, to mitigate overfitting, we applied various data augmentation techniques (flipping, rotation, brightness adjustment, etc.), weight decay regularisation, and a learning rate reduction strategy. Moreover, training, validation, and test sets were created using a stratified split to ensure consistent class representation at all stages. Nevertheless, in morphologically similar species (e.g., G. triplex and G. fimbriatum), certain models may still exhibit blurred decision boundaries, leading to reduced performance. To address this, future work will focus on expanding and balancing the dataset, developing species-specific augmentation techniques, and incorporating cross-region validation to enhance model robustness on unseen data.

The manuscript emphasises metrics like accuracy and F1-score; however, the authors should also explain the details analysis of confusion matrices and precision-recall.

Response:

We thank the reviewer for this valuable suggestion. In addition to accuracy and F1-score, our study also provides detailed confusion matrices as well as precision and recall values for each model. The confusion matrices revealed specific misclassification patterns, particularly between morphologically similar species (G. triplex and G. fimbriatum), and demonstrated that ensemble models effectively reduced error rates for such challenging pairs. Precision–recall analyses offered a clear visual comparison of class-level balance and the models’ ability to correctly identify positive instances. These analyses showed that the highest scores were achieved by single models such as EfficientNet-B3 and MaxViT-S, as well as ensemble models like EfficientNetB3+DeiT and DenseNet121+MaxViT-S. In the revised manuscript, we have expanded the discussion of these findings and provided more detailed commentary alongside the corresponding figures and tables.

The authors have forgotten to address future research directions and how to overcome the identified limitations of the proposed work.

Response:

We thank the reviewer for this constructive suggestion. In the revised manuscript, we have added a dedicated paragraph addressing the study’s limitations and outlining future research directions. Specifically, we propose expanding the dataset to increase geographical and morphological diversity, incorporating additional species, applying species-specific augmentation strategies for imbalanced classes, and conducting cross-region/environment testing. Furthermore, to improve discrimination between morphologically similar species, we plan to explore deeper hybrid architectures, integrate attention-based feature extraction, and employ multi-scale visual analysis techniques. In addition, we aim to develop optimised models suitable for mobile or on-site applications and to conduct real-time field tests to assess the practical deployment potential of the proposed approach.

Abstract:

“ The proposed method successfully classified morphologically similar puffball species with high accuracy, while explainable AI techniques revealed biologically meaningful insights. All evaluation metrics were computed exclusively on a 10% independent test set that was entirely separate from the training and validation phases. Future work will focus on expanding the dataset with samples from diverse ecological regions and testing the method under field conditions.”

Conclusion:

“The findings of this study demonstrate that deep learning-based approaches can achieve high accuracy in classifying morphologically similar fungal species. Future research could expand the dataset with samples from different ecological regions and seasonal conditions to enhance the model’s generalization capability. Additionally, testing our method on a broader fungal taxonomy and integrating it into mobile/IoT-based systems in field conditions could provide tangible contributions to biodiversity monitoring and conservation.”

Can the proposed model be feasible in other geographical locations? Because the authors have taken a dataset that includes images sourced from certain geographical locations, might this affect whether the proposed model can be used in other geographical locations?

Response:

We appreciate the reviewer for raising this important point. Although the dataset used in our study contains images collected from multiple geographical locations, a higher proportion of samples originate from certain regions. This may partially influence the model’s generalisation ability to data from other geographical areas. However, the use of data augmentation techniques and diverse imaging conditions has helped improve robustness against variations in lighting, angles, and backgrounds. Nevertheless, testing the model on specimens from entirely different geographical regions would provide a clearer assessment of its generalisation capability. In future work, we plan to incorporate additional images from various continents and ecosystems, and to conduct cross-region validation to address this limitation and strengthen the model’s applicability across broader geographical contexts.

Round 2

Reviewer 4 Report

Comments and Suggestions for Authors

The authors have modified the manuscript as per the given suggestions.

Author Response

We thank you for the valuable comments. All reviewer suggestions have been carefully reconsidered, and the manuscript has been revised accordingly.

Article Menu

Deep Ensemble Learning and Explainable AI for Multi-Class Classification of Earthstar Fungal Species

Further Information

Guidelines

MDPI Initiatives

Follow MDPI