Review Reports
- Baijuan Wang1,2,
- Weihao Liu2 and
- Xiaoxue Guo2,3
- et al.
Reviewer 1: Anonymous Reviewer 2: Anonymous Reviewer 3: Anonymous
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsIn this study, several concepts are proposed and integrated together to formulate a detection model known as MC-YOLOv13-L, featuring the YOLOv13 model as its backbone. It may incorporate biomimetics that have little grounding in computational evidence about the necessity and advantage of such methods. For instance, it breaks down multi-scale visual pathways in primates and uses methodologies that are essentially multi-scale linear attention similar to that in current SOTA lightweight attention (like EfficientFormer-inspired linear attention). There is only conceptual, and not functional, reasoning between biological and algorithmic models.
- It is proposed that CEAC shortens the inference period to 12.20% of standard conditions when x=3 tiling is applied with little loss of mAP. However:
• GPU memory discussions are missing, as it increases memory usage because of the addition of tensors.
• “CEAC appears to be just another known tile inference trick, and it is of little novel interest.”
-There appears to be logical validity in the design, yet the following are not provided in the paper:
- FLOP breakdown of the MSLA module,
- Ablation inside MSLA (kernel sizes, head numbers),
- Visualization of attention maps to demonstrate biological analogy.
Author Response
Thanks very much for your time to review this manuscript. I really appreciate you’re your comments and suggestions. We have considered these comments carefully and tried our best to address every one of them. And the corresponding part in the text has been modified using red font.
- In this study, several concepts are proposed and integrated together to formulate a detection model known as MC-YOLOv13-L, featuring the YOLOv13 model as its backbone. It may incorporate biomimetics that have little grounding in computational evidence about the necessity and advantage of such methods. For instance, it breaks down multi-scale visual pathways in primates and uses methodologies that are essentially multi-scale linear attention similar to that in current SOTA lightweight attention (like EfficientFormer-inspired linear attention). There is only conceptual, and not functional, reasoning between biological and algorithmic models.
Modification instructions: We greatly appreciate your valuable comments on our paper. We fully agree with your focus on the biologically inspired visual mechanisms and thank you for your in-depth consideration of the relationship between the biological and algorithmic models in our approach. The core of this study is indeed based on mimicking biological visual mechanisms to address the challenges faced by the YOLOv13 model in detecting drought stress during the seedling stage of the Yunnan large-leaf tea tree. We drew inspiration from multi-scale visual pathways, particularly the visual processing mechanisms in primates, which helped us overcome the difficulties YOLOv13 faces when handling complex backgrounds and drought features at different scales. By introducing a mechanism similar to multi-scale linear attention, we aim to simulate how organisms process information from multiple scales in parallel to better perceive subtle environmental changes, thereby enhancing the sensitivity and accuracy of YOLOv13 in detecting drought stress in tea trees. Although the current lightweight attention mechanism shares similarities with our multi-scale linear attention, we have not merely applied existing methods. Instead, we have made targeted adjustments based on biological principles to better adapt it to drought detection tasks. For example, we use parallel multi-scale pathways to simultaneously capture heterogeneous information, ranging from fine weak fluorescence spots to the overall leaf morphology and vein orientation. This is followed by global aggregation and position-dependent reallocation, which align and coordinate cross-region and multi-scale evidence at the same level. This significantly improves the visibility and consistency of weak signals along leaf veins and local attenuation at the leaf edges. Finally, we introduce learnable channel recalibration, which achieves information compression and discriminative subspace projection without altering spatial resolution, thereby reducing false negatives and alleviating drought level confusion. Once again, we sincerely thank you for your valuable suggestions. We will continue to improve the paper to ensure its scientific rigor and theoretical soundness.
- It is proposed that CEAC shortens the inference period to 12.20% of standard conditions when x=3 tiling is applied with little loss of mAP. However: GPU memory discussions are missing, as it increases memory usage because of the addition of tensors. “CEAC appears to be just another known tile inference trick, and it is of little novel interest.”
Modification instructions: We greatly appreciate your valuable comments on our paper. We fully agree with your concern regarding GPU memory usage and thank you for raising these technical questions. In our experiments, the YOLO series algorithms process all input images into 640*640 images, including the ones after stitching. Therefore, although we used the CEAC stitching technique during processing, the image stitching did not significantly increase GPU memory usage, with the primary change being the image clarity. After testing, we found that the image preprocessing time, inference speed, and post-processing time did not undergo significant changes. The introduction of CEAC did not notably increase the inference time. Although the images were stitched, the overall efficiency of the inference process remained at the same level. Due to table length limitations, we report only the FPS in the paper, which is the combined value of image preprocessing time, inference speed, and post-processing time (calculated by dividing 1000 by the sum of these three). If you require more detailed parameters, we are happy to provide additional information and offer new tables to further enhance our conclusions. Once again, we sincerely thank you for your valuable suggestions.
Additionally, we understand your view on CEAC as an image stitching inference technique. Indeed, simple image stitching methods are technically quite common. However, we have made further innovations in the application of this method, not only stitching images but also incorporating the bounding box reverse mapping technique. This improvement allows us to process the stitched images more accurately, avoiding displacement of the bounding box positions during stitching, thus improving detection accuracy. We have added relevant mathematical derivations and technical explanations in the main text, detailing how reverse mapping ensures the precise localization of bounding boxes in the stitched image. This process plays an important role in enhancing the model's detection capability in complex backgrounds.
“During the splitting process, this study first calculates the absolute values of the normalized TXT labels, as shown in Equations 1 to 4. Here, and represent the coordinates of the center of the bounding box, and represent the width and height of the bounding box, and and represent the width and height of the combined image. After the image is split, the absolute labels are normalized and restored according to the size of the cropped image, as shown in Equations 5 to 8. Here, and represent the coordinates of the top-left corner of the current cropped image, while and represent the width and height of the cropped image.
- There appears to be logical validity in the design, yet the following are not provided in the paper: FLOP breakdown of the MSLA module,
Modification instructions: Thank you for your valuable suggestions. We greatly appreciate your attention to our research design and thank you for pointing out the issue regarding the MSLA module's FLOPs. To better address your suggestion, we have made additions to the paper regarding the relevant content. Through MSLA optimization, the network's FLOP value was reduced by 0.2G. The original network contains 6 C3k2 modules, and our MSLA optimization is based on these six modules. In the optimized six modules, the Params are 6424, 23724, 96600, 29232, 72024, and 280232, compared to the original C3k2 modules with 5792, 20800, 115328, 35136, 90752, and 345344. Although the Params in the backbone network have slightly increased, there is a significant decrease in the Neck part. These details have been presented in a table in the main text.
- Ablation inside MSLA (kernel sizes, head numbers)
Modification instructions: Thank you very much for your valuable suggestions. In response to your mention of the ablation study within the MSLA module, we have provided detailed additions in the paper. We have listed the parameters for each module in detail, including 6424, 23724, 96600, 29232, 72024, 280232, and so on. Additionally, our optimization method incorporates four different kernel sizes: 3×3, 5×5, 7×7, and 9×9. After performing element-wise residual addition on the outputs of each branch, ReLU is used as the nonlinear activation function. This design is partially inspired by the method in the study "Exploiting Multi-Scale Parallel Self-Attention and Local Variation via Dual-Branch Transformer-CNN Structure for Face Super-Resolution" by Jingang Shi et al. Additionally, to further strengthen the paper's persuasiveness, we have introduced the mAP@50-95 metric and added a confusion matrix to analyze the experimental results. The research results show that moderate and severe droughts are the most easily confused. Testing revealed that this issue arises because, under moderate and severe drought conditions, the phenotypic changes in some tea seedlings' leaves are very similar, lacking clear boundaries, leading to confusion between the two categories. Once again, thank you for your valuable suggestions.
- Visualization of attention maps to demonstrate biological analogy.
Modification instructions: Thank you very much for your valuable suggestions. In response to your mention of visualizing attention maps to demonstrate biological analogies, we have made the corresponding additions in Section 3.2 of the paper. We have added visualized attention maps to show how the MC-YOLOv13-L model simulates biological visual mechanisms when handling drought stress. Through these attention maps, we can visually observe how the model focuses on key areas at different scales when recognizing drought features.
“To verify the performance improvement of the multi-scale linear attention mecha-nism, CMUNeXt module, and auxiliary bounding box algorithm optimization on the YOLOv13 network, a visual analysis of the model’s attention regions was conducted. This study further introduces Grad-CAM (Gradient-weighted Class Activation Mapping) for analysis[38]. As a gradient-based visualization method, Grad-CAM is primarily used to explain the decision-making process of deep convolutional neural networks in object detection tasks. It calculates the gradient information from specific convolutional layers in the network and generates class activation heatmaps to visually display the key regions the model focuses on when predicting specific categories. In this study, the highlighted areas focused on by the model are all phenotypic features related to drought stress, such as wilting at the leaf edges and changes in color. In this study, the highlighted areas focused on by the model are all phenotypic features related to drought stress, such as wilting at the leaf edges and changes in color. The analysis results are shown in Figure 10, indicate that the improved YOLOv13 network shows enhanced attention to details under conditions such as occlusion, small objects, blur, and low illumination, with significantly improved region focusing capability.”
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsAbstract
The abstract is very dense in biomimetic terminology (compound-eye apposition concatenation, multi-scale linear attention, CMUNeXt, Inner-IoU), but it does not clearly state what is biologically or agronomically new compared with standard YOLO-based stress detection, nor what the practical gain is beyond a few percentage points in mAP. I suggest explicitly stating:
(i) what problem in tea drought monitoring is addressed that previous YOLO variants could not handle, and
(ii) What practical detection performance (e.g., the earliest detectable stage, under which drought level) is achieved?
The abstract only reports relative improvements versus YOLOv13. It should at least mention the number of images/plants and the existence of an independent external validation set (Laobanzhang base) to better support the generalisation claims.
Introduction
The introduction reviews several biomimetic vision models (aphids, small objects, and remote-sensing targets), but it does not clearly articulate what remains to be solved for chlorophyll-fluorescence-based drought detection in tea. For example:
– Are there existing CF-imaging drought classifiers for other crops?
– Is the main limitation early-stage sensitivity, class confusion, or speed?
A sharper, more quantitative gap statement would better justify the complexity of MC-YOLOv13-L.
Likewise, drought is considered a field problem for Yunnan tea plantations, but the experiment is conducted on potted seedlings with PEG-6000 solution in controlled conditions. The authors should explicitly discuss, already in the Introduction, why this setup is expected to approximate field drought responses (or where it may differ).
Methods
The manuscript specifies PEG-6000 concentrations (0, 10, 20, 30%) and soil relative humidity thresholds for drought grades (Table 1), but it is unclear how these two are quantitatively linked and how long plants were kept at each stress level before imaging. For example:
• How many days after treatment were CF images taken?
• Were soil moisture and PEG levels monitored over time or only set initially?
This is crucial for reproducibility and for understanding the physiological significance of the labels.
The authors mention 324 raw samples and then augment to 4,212 images. It is unclear:
• Does one “sample” correspond to one plant, one imaging day, or one image?
• Were multiple images per plant taken at different angles and/or dates, and if so, was the train/validation/test split done at plant level or image level?
If images from the same plant can appear in both training and validation/test sets (even after augmentation), performance may be overestimated.
Table 2 presents the correlations of 14 indices with drought stress, with Fv/Fm exhibiting the strongest Pearson correlation. However, the decision to discard all other channels and use only Fv/Fm as input to the detection model should be justified more rigorously (e.g., multivariate redundancy analysis, risk of overfitting when adding channels, practical sensor constraints). It is plausible that combining several partially correlated indices could further improve early detection.
Results
Many reported improvements are within a few percentage points (e.g., +4.2% mAP vs YOLOv13; +0.75–1.36% in ablations). Because the dataset is relatively small, these gains may be attributed to variance across runs or random splits. I strongly recommend reporting:
• mean ± standard deviation over multiple runs, or
• confidence intervals and/or statistical tests for the metrics,
especially for the ablation study, where each configuration is already trained with five seeds.
The authors note that recall for severe drought is initially lower and improves with MC-YOLOv13-L. It would strengthen the paper to provide:
• a confusion matrix by drought level,
• per-class AP values, and
• a short discussion of which levels remain most confused (e.g., mild vs moderate),
since “drought level confusion” is one of the core motivations of the work.
The Laobanzhang base dataset is used for external verification, but the details are minimal. Please specify:
• the number of plants and images per drought class,
• whether the stress protocol (PEG-induced or natural drought) and imaging device/settings were identical or different, and
• whether any images from this base were used at any stage of training or hyperparameter tuning.
Without this information, the “external validation” could still be very similar to the training domain, limiting the strength of the robustness claims.
Conclusion and Discussion
Section 4 is largely a repetition and structured summary of results, with limited deeper interpretation of biological or agronomic implications. I recommend:
• explicitly discussing how early in the drought progression the model can detect stress (e.g., at mild or moderate drought),
• comparing this with conventional physiological indicators (soil moisture, leaf water potential), and
• explaining what this would concretely change in tea management (e.g., irrigation scheduling, yield protection).
The final paragraph briefly notes that the study focuses only on Yunnan large-leaf tea and that the compound-eye mechanism is limited to static parallel stitching. This is useful but could be more explicit about:
• the limitations of PEG-induced drought vs natural field drought (soil heterogeneity, root depth, temperature interactions),
• potential differences in canopy structure and background clutter in plantations vs lab imaging, and
• the need to test the approach under different illumination conditions and with multi-channel CF inputs.
Author Response
Thanks very much for your time to review this manuscript. I really appreciate you’re your comments and suggestions. We have considered these comments carefully and tried our best to address every one of them. And the corresponding part in the text has been modified using red font.
- The abstract is very dense in biomimetic terminology (compound-eye apposition concatenation, multi-scale linear attention, CMUNeXt, Inner-IoU), but it does not clearly state what is biologically or agronomically new compared with standard YOLO-based stress detection, nor what the practical gain is beyond a few percentage points in mAP. I suggest explicitly stating: (i) what problem in tea drought monitoring is addressed that previous YOLO variants could not handle.
Modification instructions: We greatly appreciate your thorough review and valuable feedback on our paper. We have carefully considered your suggestions and made corresponding revisions to the paper. Regarding the issue of drought monitoring in tea plants you mentioned, we have clarified in the revision that this study addresses the problem of drought level confusion in the detection of drought stress during the seedling stage of the Yunnan large-leaf tea variety using the traditional YOLOv13 network.
“ To address the issue of drought level confusion in the detection of drought stress during the seedling stage of the Yunnan large-leaf tea variety using the traditional YOLOv13 network, this study proposes an improved version of the network, MC-YOLOv13-L, based on animal vision. ”
- (ii) What practical detection performance (e.g., the earliest detectable stage, under which drought level) is achieved? The abstract only reports relative improvements versus YOLOv13. It should at least mention the number of images/plants and the existence of an independent external validation set (Laobanzhang base) to better support the generalisation claims.
Modification instructions: Thank you for your valuable feedback. In response to your concerns about the practical detection performance, we have further included specific experimental results and practical application performance in the revised version. External validation results from the Laobanzhang base in Xishuangbanna, Yunnan Province, show that the MC-YOLOv13-L network can quickly and accurately capture the drought stress response of tea plants under mild drought conditions. In addition, we have clearly stated in the paper that the drought stress dataset used in this study consists of 324 original images, with 4212 images after data augmentation.
“The testing results from the drought stress dataset (324 original images, 4212 images after data augmentation) indicate that, in the training set, the Box Loss, Cls Loss, and DFL Loss of the MC-YOLOv13-L network decreased by 5.08%, 3.13%, and 4.85%, respectively, compared to the YOLOv13 network. In the validation set, these losses decreased by 2.82%, 7.32%, and 3.51%, respectively. On the whole, the improved MC-YOLOv13-L improves the accuracy, recall rate and mAP by 4.64 %, 6.93 % and 4.2 % respectively on the basis of only sacrificing 0.63 FPS. External validation results from the Laobanzhang base in Xishuangbanna, Yunnan Province, indicate that the MC-YOLOv13-L network can quickly and accurately capture the drought stress response of tea plants under mild drought conditions. This lays a solid foundation for the intelligence-driven development of the tea production sector and, to some extent, promotes the application of bio-inspired computing in complex ecosystems.”
- The introduction reviews several biomimetic vision models (aphids, small objects, and remote-sensing targets), but it does not clearly articulate what remains to be solved for chlorophyll-fluorescence-based drought detection in tea. For example:– Are there existing CF-imaging drought classifiers for other crops?
Modification instructions: Thank you for your review and valuable suggestions on our paper. In response to the issues you raised regarding the introduction, we have made the necessary revisions and added relevant content, including examples of existing drought detection methods based on chlorophyll fluorescence imaging.
“Muhammad Akbar Andi Arief and colleagues developed a system based on chlorophyll fluorescence imaging technology to address the issue of changes in the photosynthetic efficiency of strawberries under drought conditions. The system excites chlorophyll fluorescence using blue LED light and captures the fluorescence signal with a monochrome camera to measure the maximum photochemical quantum efficiency (Fv/Fm). The study shows that drought stress significantly reduces the photosynthetic efficiency of strawberries. This research not only validates the potential application of chlorophyll fluorescence imaging technology in plant drought monitoring but also provides a new technical approach for non-destructive real-time monitoring of plant health.”
- Is the main limitation early-stage sensitivity, class confusion, or speed? A sharper, more quantitative gap statement would better justify the complexity of MC-YOLOv13-L.
Modification instructions: Thank you very much for your valuable feedback on our paper. In response to your concerns about the model's limitations, we have provided further clarification in the revised paper. After testing, the main limitation of the model is category confusion, particularly in distinguishing between different drought levels. This is also the core issue we focused on addressing during the improvement of the MC-YOLOv13-L network. To better highlight the complexity and advantages of the MC-YOLOv13-L, we have provided a detailed comparison in the data analysis section, contrasting the performance gap between the improved network and existing mainstream models in practical detection. Through quantitative analysis, we demonstrated a significant improvement in the drought level classification accuracy and recall rate of the MC-YOLOv13-L, which fully proves the model's effectiveness in addressing category confusion and improving drought monitoring accuracy.
“However, when performing drought stress detection during the seedling stage of the Yunnan large-leaf tea variety, existing models are still prone to the influence of drought level confusion, resulting in low classification accuracy.”
“Compared to the original YOLOv13 network, YOLOv10, SSD, RT-DETR, and Faster-RCNN, the MC-YOLOv13-L network improved Precision by 4.64%, 6.4%, 17.66%, 6.68%, and 17.79%, respectively. Recall was improved by 6.93%, 7.72%, 14.73%, 10.08%, and 18.99%, respectively. F1 increased by 5.78%, 7.05%, 16.26%, 8.38%, and 18.38%, respectively. mAP was improved by 4.2%, 5.29%, 14.59%, 7.19%, and 17.32%, respectively.”
- Likewise, drought is considered a field problem for Yunnan tea plantations, but the experiment is conducted on potted seedlings with PEG-6000 solution in controlled conditions. The authors should explicitly discuss, already in the Introduction, why this setup is expected to approximate field drought responses (or where it may differ).
Modification instructions: Thank you very much for your valuable suggestions. In response to your comments about the differences between the experimental setup and the actual drought conditions in the tea garden, we have provided a detailed discussion in the introduction section of the revised paper.
“To accurately control the extent of drought stress in response to the above issues, this study uses a PEG-6000 solution to simulate drought conditions and constructs a dataset using chlorophyll fluorescence imaging technology[16]. Tests show that, although there are more complex factors in the field drought environment, the basic physiological effects of drought stress on tea seedlings are similar. Therefore, the phenotypic changes observed in the pot experiments are highly consistent with the results from the field experiments.”
- The manuscript specifies PEG-6000 concentrations (0, 10, 20, 30%) and soil relative humidity thresholds for drought grades (Table 1), but it is unclear how these two are quantitatively linked and how long plants were kept at each stress level before imaging. For example: How many days after treatment were CF images taken? Were soil moisture and PEG levels monitored over time or only set initially? This is crucial for reproducibility and for understanding the physiological significance of the labels.
Modification instructions: Thank you very much for your valuable suggestions. In response to your comments regarding the quantitative relationship between the PEG-6000 concentration and the soil relative humidity threshold, as well as the timing of data collection, we have provided additional clarification in the paper. To ensure that the plants adapt to the drought environment and exhibit stable physiological responses, this study maintained the normal growth of the tea seedlings for 15 days after the drought treatment and used the PR-3002-TRREC-N01 soil sensor to measure the soil moisture content of the potted plants daily at 9 AM. By precisely controlling soil moisture variation, we ensured that the soil moisture at each drought level remained within the preset range.
“To ensure that the plants adapt to the drought environment and exhibit stable physiological responses, this study maintained the normal growth of the tea seedlings for 15 days after the drought treatment and used the PR-3002-TRREC-N01 soil sensor to measure the soil moisture content of the potted plants daily at 9 AM. By precisely controlling soil moisture variation, we ensured that the soil moisture at each drought level remained within the preset range.”
- The authors mention 324 raw samples and then augment to 4,212 images. It is unclear: Does one “sample” correspond to one plant, one imaging day, or one image? Were multiple images per plant taken at different angles and/or dates, and if so, was the train/validation/test split done at plant level or image level? If images from the same plant can appear in both training and validation/test sets (even after augmentation), performance may be overestimated.
Modification instructions: Thank you very much for your valuable suggestions. To further clarify some details of the experimental design, we have provided the necessary supplementary explanations in the paper. In this study, an original sample refers to a single tea seedling. All original samples were photographed after the drought treatment on day 15, with each tea seedling only undergoing one data collection session.
“To ensure that the plants adapt to the drought environment and exhibit stable physiological responses, this study maintained the normal growth of the tea seedlings for 15 days after the drought treatment and used the PR-3002-TRREC-N01 soil sensor to measure the soil moisture content of the potted plants daily at 9 AM. By precisely controlling soil moisture variation, we ensured that the soil moisture at each drought level remained within the preset range.”
“A total of 324 original samples were collected in this study, with each sample de-rived from a different experimental plant.”
- Table 2 presents the correlations of 14 indices with drought stress, with Fv/Fm exhibiting the strongest Pearson correlation. However, the decision to discard all other channels and use only Fv/Fm as input to the detection model should be justified more rigorously (e.g., multivariate redundancy analysis, risk of overfitting when adding channels, practical sensor constraints). It is plausible that combining several partially correlated indices could further improve early detection.
Modification instructions: Thank you very much for your valuable suggestions. In response to your concern about using only Fv/Fm as the input channel, this is because, apart from Fv/Fm, the other 13 indicators have a low correlation with drought stress, with the highest being only 0.597, much lower than Fv/Fm's 0.89. Therefore, during model training, we chose Fv/Fm as the sole input channel. We also conducted several experiments testing the inclusion of other indicators with lower correlation, and found that these did not significantly improve the model's detection performance, but instead led to a decline in model performance. In addition, we reviewed a substantial amount of prior research, with many studies indicating that Fv/Fm is one of the most representative indicators for drought stress detection, and in most related studies, Fv/Fm is the only indicator used. For example, Muhammad Akbar Andi Arief and colleagues also used only Fv/Fm in their drought phenotype detection and achieved good results. Once again, thank you for your valuable suggestions, and we have made the corresponding revisions in the paper.
“The results indicate that the correlation between Maximum Photosynthetic Efficiency and drought stress is significantly higher than that of other indicators. Therefore, this study uses the channel images as the input data for the detection model.”
- Many reported improvements are within a few percentage points (e.g., +4.2% mAP vs YOLOv13; +0.75–1.36% in ablations). Because the dataset is relatively small, these gains may be attributed to variance across runs or random splits. I strongly recommend reporting: mean ± standard deviation over multiple runs, or confidence intervals and/or statistical tests for the metrics, especially for the ablation study, where each configuration is already trained with five seeds.
Modification instructions: Thank you very much for your valuable suggestions. Regarding the issue you mentioned about the small dataset potentially leading to results being affected by randomness, we have added data augmentation to increase the number of images in the "2.2 Dataset Construction" section, and we have made targeted improvements in the revised paper based on your feedback. To better evaluate the model's stability and statistical significance, we have included the reporting of the mAP@50-95 metric in the experimental results.
“Similarly, mAP@0.5-95 represents the AP across all thresholds from an IoU of 0.5 to 0.95, with a threshold step of 0.05.”
“As shown in Table 6, the multi-scale linear attention mechanism optimization improved the original network’s Precision by 0.55%, Recall by 3.76%, mAP@50 by 0.75%, mAP@50-95 by 1.29%, and FPS by 2.27, while reducing the original network’s FLOPs by 0.2G. This optimization reduced the original network's missed detections and drought level confusion to some extent, while improving the model's detection speed. The CMUNeXt module optimization improved the original network’s Precision by 0.63%, Recall by 4.13%, mAP@50 by 1.36%, mAP@50-95 by 1.41%, and FLOPs by 0.9G, while reducing the original network's FPS by 1.23. Although this optimization reduced the model's detection speed by 2.20%, it significantly improved the model’s feature discriminative ability. The auxiliary bounding box algorithm optimization, without changing the original network's basic architecture, improved the original network’s Precision by 0.72%, Recall by 3.47%, mAP@50 by 1.08%, and mAP@50-95 by 1.36%, significantly improving the localization accuracy of tea seedling drought stress detection and recall rate. After the overall improvements, compared to the original YOLOv13 network, the MC-YOLOv13-L network's Precision, Recall, mAP@50, and mAP@50-95 were improved by 4.64%, 6.93%, 4.2%, and 5.07% respectively, with only a 0.63 decrease in FPS.”
- The authors note that recall for severe drought is initially lower and improves with MC-YOLOv13-L. It would strengthen the paper to provide: a confusion matrix by drought level, per-class AP values, and a short discussion of which levels remain most confused (e.g., mild vs moderate), since “drought level confusion” is one of the core motivations of the work.
Modification instructions: Thank you very much for your valuable suggestions. To better address your suggestions, we have made further revisions and additions to the paper. We have added a confusion matrix to the paper to more clearly demonstrate the classification performance between different drought levels. We have also specifically discussed the reason why severe and moderate droughts are most easily confused in the model. This is mainly because the drought manifestations of these two levels are relatively similar, and their leaf photosynthetic efficiency and water status changes are quite close.
“In the study of drought stress detection for Yunnan large-leaf tea seedlings, the confusion matrix is used to show the model’s detection performance across different drought levels. As shown in Figure 9, the rows of the matrix represent the true drought levels, while the columns represent the predicted drought levels. The larger the values on the diagonal, the higher the recognition accuracy for that class. The off-diagonal elements reflect the degree of confusion between different drought levels in the model. The results show that moderate drought and severe drought are the most easily confused. Testing revealed that this is due to similar phenotypic changes in the leaves of some tea seedlings under moderate and severe drought conditions, lacking clear boundaries. Compared to the YOLOv13 network, the improved MC-YOLOv13-L increased detection accuracy for tea seedling phenotypes under no drought stress by 2 percentage points, for mild drought stress by 3 percentage points, for moderate drought stress by 4 percentage points, and for severe drought stress by 11 percentage points. MC-YOLOv13-L effectively enhances the model's feature discriminative ability, alleviating confusion between drought levels.”
- The Laobanzhang base dataset is used for external verification, but the details are minimal. Please specify: the number of plants and images per drought class, whether the stress protocol (PEG-induced or natural drought) and imaging device/settings were identical or different, and whether any images from this base were used at any stage of training or hyperparameter tuning. Without this information, the “external validation” could still be very similar to the training domain, limiting the strength of the robustness claims.
Modification instructions: Thank you very much for your valuable suggestions. In response to your mention of the detailed information regarding the external validation of the Laobanzhang dataset, we have made the corresponding additions in the paper.
“The external validation set consists of 100 samples, with 25 samples for each drought stress level. The drought treatment and data collection equipment are the same as those used in the training set, and this dataset was not involved in the training process.”
- Section 4 is largely a repetition and structured summary of results, with limited deeper interpretation of biological or agronomic implications. I recommend: explicitly discussing how early in the drought progression the model can detect stress (e.g., at mild or moderate drought), comparing this with conventional physiological indicators (soil moisture, leaf water potential), and explaining what this would concretely change in tea management (e.g., irrigation scheduling, yield protection).
Modification instructions: Thank you very much for your valuable suggestions. In response to your mention of the lack of in-depth biological or agricultural significance interpretation in Section 4, we have made corresponding additions and improvements to the paper.
“The external validation results show that the improved MC-YOLOv13-L network can quickly and accurately capture the drought resistance response of tea seedlings under mild drought conditions, and its detection accuracy is significantly better than mainstream algorithms.”
“In the drought stress detection task for Yunnan large-leaf tea seedlings, this study used the Plant Explorer Pro as the data collection device. Compared to traditional physiological indicator detection, the chlorophyll fluorescence imaging technology used by the Plant Explorer Pro can sensitively reflect the phenotypic changes in tea seedling leaves under mild drought conditions.”
“The developed drought stress detection model can help farmers adjust irrigation plans more promptly, preventing tea seedling growth limitations or damage caused by drought stress.”
- The final paragraph briefly notes that the study focuses only on Yunnan large-leaf tea and that the compound-eye mechanism is limited to static parallel stitching. This is useful but could be more explicit about: the limitations of PEG-induced drought vs natural field drought (soil heterogeneity, root depth, temperature interactions), potential differences in canopy structure and background clutter in plantations vs lab imaging, and the need to test the approach under different illumination conditions and with multi-channel CF inputs.
Modification instructions: Thank you very much for your valuable suggestions. We fully agree with your viewpoint and appreciate your thorough review of the paper. In response to your mention of the need for further discussion on the experimental limitations, we have made the corresponding revisions and additions in the paper.
“PEG-induced drought treatment cannot fully simulate the complexity of natural field drought, and the canopy structure of tea trees in tea gardens is more complex than in laboratory environments, with more background interference. Therefore, our team will conduct field experiments in the future and perform plant phenotyping under different light conditions and backgrounds.”
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for Authors(1)Each data augmentation technique (such as HSV space perturbation and Gaussian blur) should be briefly explained, including the reasons for its selection and how it helps improve the model's generalization ability or adapt to different environmental conditions.
(2)A detailed explanation of the sub-image stitching method and the bounding box inverse mapping algorithm should be provided, supplemented with relevant diagrams or formula derivations.
(3)The results section should discuss whether the performance improvement has led to an increase in computational costs. For instance, although detection accuracy may have improved, is the computation speed sufficient to meet the requirements for real-time applications? It is recommended to analyze the impact on computation speed (FPS) alongside accuracy improvements, and evaluate its applicability and limitations in practical deployment.
(4)The figures should provide more detailed legends and descriptions, especially regarding the Grad-CAM heat maps. It is necessary to explain why specific areas are emphasized and how this attention influences detection results.
(5)The paper mentions that experiments compared multiple detection models, but there is a lack of in-depth discussion regarding the reasons behind the comparison results. For example, what specific advantages does MC-YOLOv13-L have over YOLOv13? Does MC-YOLOv13-L perform better in all scenarios?
(6)The discussion section should include more comparative analysis, exploring the reasons for performance differences among different models, and examining the performance of MC-YOLOv13-L under various experimental conditions.
(7)The conclusion section should be expanded to propose future research directions, particularly regarding practical deployment, model optimization, and cross-domain applications, along with a detailed discussion.
Author Response
Thanks very much for your time to review this manuscript. I really appreciate you’re your comments and suggestions. We have considered these comments carefully and tried our best to address every one of them. And the corresponding part in the text has been modified using red font.
- Each data augmentation technique (such as HSV space perturbation and Gaussian blur) should be briefly explained, including the reasons for its selection and how it helps improve the model's generalization ability or adapt to different environmental conditions.
Modification instructions: Thank you very much for your valuable suggestions. We have provided further explanation and additions on data augmentation techniques in the paper.
“The HSV spatial perturbation is mainly used to simulate different color temperatures, white balance, and backlighting conditions, reducing the model's reliance on non-physiological color variations and false color mapping, while enhancing the model's focus on drought-related features, thereby improving the model's adaptability to different shooting conditions. Mean Blur is used to simulate the detail degradation caused by slight defocus and resampling. It preserves the large-scale intensity gradient while suppressing high-frequency noise and random particles, thereby improving the recognition accuracy of the model for low-definition images. By approximating the point spread function of the optical system, Gaussian Blur simulates the blur effect caused by defocus or slight motion, and enhances the detection accuracy of the model under the conditions of fast imaging or unstable exposure. Median Blur uses a small-scale convolution kernel for median filtering, which effectively removes salt and pepper noise and hot pixel outlier interference. While maintaining the integrity of the edge structure, it reduces the misjudgment probability of the model for abnormal bright and dark spots. CutOut is used to simulate common occlusions such as mutual occlusion of leaves, identification stickers, and shadow reflection. The number of random occlusions is set to 1-6, and the occlusion area accounts for 5 % - 40 % of the original image. On the premise of not changing the topological structure of the image, the D4 transform (rotation and mirror combination) applies rotation and mirror operation to the sample to enhance the generalization ability of the model to different shooting angles and swing habits.”
- A detailed explanation of the sub-image stitching method and the bounding box inverse mapping algorithm should be provided, supplemented with relevant diagrams or formula derivations.
Modification instructions: Thank you very much for your valuable suggestions. In response to your mention of the detailed explanation of the sub-image stitching method and the bounding box reverse mapping algorithm, we have made further additions in the paper. The working principles have been explained in detail using mathematical derivations in the revised version of the paper. If supporting materials are required, I am willing to provide the corresponding code immediately.
“During the splitting process, this study first calculates the absolute values of the normalized TXT labels, as shown in Equations 1 to 4. Here, and represent the coordinates of the center of the bounding box, and represent the width and height of the bounding box, and and represent the width and height of the combined image. After the image is split, the absolute labels are normalized and restored according to the size of the cropped image, as shown in Equations 5 to 8. Here, and represent the coordinates of the top-left corner of the current cropped image, while and represent the width and height of the cropped image.
- The results section should discuss whether the performance improvement has led to an increase in computational costs. For instance, although detection accuracy may have improved, is the computation speed sufficient to meet the requirements for real-time applications? It is recommended to analyze the impact on computation speed (FPS) alongside accuracy improvements, and evaluate its applicability and limitations in practical deployment.
Modification instructions: Thank you very much for your valuable suggestions. We have included further discussion on computational performance in the revised version of the paper. Although the model's complexity increased the computational load, the FPS of the MC-YOLOv13-L model only decreased by 0.63, still maintaining a high inference speed. The improved model still performs well in real-time applications and can meet the requirements for real-time detection.
- The figures should provide more detailed legends and descriptions, especially regarding the Grad-CAM heat maps. It is necessary to explain why specific areas are emphasized and how this attention influences detection results.
Modification instructions: Thank you very much for your valuable suggestions. We have revised the paper based on your suggestions, particularly the section on Grad-CAM heatmaps. In this study, the highlighted areas focused on by the model are all phenotypic features related to drought stress, such as wilting at the leaf edges and changes in color. Through the heatmap, we can visually observe that the model’s attention is focused on key areas of the tea seedling, such as the center of the leaf or the veins, where the drought response is more pronounced, thus helping to improve the model's detection accuracy.
- The paper mentions that experiments compared multiple detection models, but there is a lack of in-depth discussion regarding the reasons behind the comparison results. For example, what specific advantages does MC-YOLOv13-L have over YOLOv13? Does MC-YOLOv13-L perform better in all scenarios?
Modification instructions: Thank you very much for your valuable suggestions. In response to your mention of the discussion on the comparison of multiple detection models, we have made further additions and improvements in the paper. In the revised paper, we have discussed in detail the advantages of MC-YOLOv13-L compared to YOLOv13. We explicitly state that MC-YOLOv13-L improves the model's detection accuracy while maintaining a high detection speed. Especially in the drought stress detection task, MC-YOLOv13-L is able to more accurately identify different drought levels, with significant improvement in accuracy during the early stages of drought. While MC-YOLOv13-L has shown significant advantages in our drought detection task, its performance has not been significantly superior to other traditional models in more generalized detection tasks. The advantage of MC-YOLOv13-L is more evident in its application to the specific task of drought stress detection for Yunnan large-leaf tea seedlings.
- The discussion section should include more comparative analysis, exploring the reasons for performance differences among different models, and examining the performance of MC-YOLOv13-L under various experimental conditions.
Modification instructions: Thank you very much for your valuable suggestions. We have made revisions and clarified that the advantages of MC-YOLOv13-L are more evident in the specific task of drought stress detection for Yunnan large-leaf tea seedlings. However, in more generalized detection tasks, its performance is not significantly superior to the original YOLOv13 network. In addition, to better compare the performance differences between MC-YOLOv13-L and YOLOv13 networks, we have further introduced the mAP@50-95 metric in the paper and included a confusion matrix to analyze the results. The research results indicate that moderate and severe droughts are the most easily confused. Testing revealed that this is due to the similar phenotypic changes in tea seedlings under moderate and severe drought conditions, lacking clear boundaries. Furthermore, we have added an explanation that PEG-induced drought treatment cannot fully simulate the complexity of natural field drought, and the canopy structure of tea trees in tea gardens is more complex than in laboratory environments, with more background interference. Therefore, our team will conduct field experiments in the future and perform plant phenotyping under different light conditions and background environments.
- The conclusion section should be expanded to propose future research directions, particularly regarding practical deployment, model optimization, and cross-domain applications, along with a detailed discussion.
Modification instructions: We greatly appreciate your valuable suggestions. Based on your advice, we have expanded the conclusion section and added a detailed discussion on future research directions. For model optimization, we have proposed the possibility of a lightweight design, exploring how to optimize the model’s inference speed and computational resource consumption while maintaining high detection accuracy. By integrating adaptive optimization techniques, the model’s generalization ability and adaptability across different drought levels, crops, and complex backgrounds will be further enhanced.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsRevised manuscript is fine for acceptance.
Author Response
We sincerely thank you for your review and acknowledgment of our manuscript, as well as for the professional suggestions and thorough review provided throughout the entire evaluation process. Your feedback has significantly contributed to improving the clarity of the presentation, the transparency of the experiments, and the methodological rigor of the paper. We deeply appreciate the time and effort you have dedicated, which has played a key role in enhancing the quality of the manuscript.
Reviewer 2 Report
Comments and Suggestions for AuthorsI appreciate the authors’ detailed responses and the substantial revisions made to the manuscript. The paper has clearly improved in clarity, experimental transparency, and interpretation, particularly with respect to dataset construction, drought-level confusion analysis, and the inclusion of an external validation set. However, one important methodological concern raised in my original review remains insufficiently addressed and should be resolved before acceptance. Although the authors note that each ablation configuration was trained with five different random seeds, the manuscript reports only the best-performing run for each configuration. Given the relatively small dataset and the modest magnitude of several reported improvements (e.g., +0.7–1.4% in mAP in the ablation study), selecting the best run risks overestimates model gains and does not adequately demonstrate robustness.
Author Response
Thanks very much for your time to review this manuscript. I really appreciate you’re your comments and suggestions. We have considered these comments carefully and tried our best to address every one of them. And the corresponding part in the text has been modified using red font.
- I appreciate the authors’ detailed responses and the substantial revisions made to the manuscript. The paper has clearly improved in clarity, experimental transparency, and interpretation, particularly with respect to dataset construction, drought-level confusion analysis, and the inclusion of an external validation set. However, one important methodological concern raised in my original review remains insufficiently addressed and should be resolved before acceptance. Although the authors note that each ablation configuration was trained with five different random seeds, the manuscript reports only the best-performing runfor each configuration. Given the relatively small dataset and the modest magnitude of several reported improvements (e.g., +0.7–1.4% in mAP in the ablation study), selecting the best run risks overestimates model gains and does not adequately demonstrate robustness.
Modification instructions: Thank you very much for your careful review of our paper and for your valuable comments. We take your concerns about the ablation study results very seriously. To address this issue, we have updated all ablation study models in the manuscript and added the mean and standard deviation for each configuration. We again appreciate your valuable feedback. We have revised the paper accordingly and hope these improvements make the work more rigorous and complete.
Table 6. Ablation Study Results.
|
Model |
Precision(%) |
Recall (%) |
mAP@50 (%) |
Avg-mAP@50 (%) |
FLOPs (G) |
Parameters |
mAP@50-95(%) |
FPS |
|
YOLOv13 |
88.39 |
88.08 |
91.74 |
90.34±0. 78 |
6.4 |
2,460,691 |
72.98 |
55.87 |
|
M-YOLOv13 |
88.94 |
91.84 |
92.49 |
91.10±1.09 |
6.2 |
2,355,775 |
74.27 |
58.14 |
|
C-YOLOv13 |
89.02 |
92.21 |
93.10 |
91.96±0.82 |
7.3 |
3,582,483 |
74.39 |
54.64 |
|
YOLOv13-L |
89.11 |
91.55 |
92.82 |
91.77±0.67 |
6.4 |
2,460,691 |
74.34 |
57.14 |
|
MC-YOLOv13 |
89.67 |
92.32 |
94.02 |
93.15±0.73 |
7.1 |
3,477,567 |
75.71 |
55.25 |
|
M-YOLOv13-L |
91.01 |
91.49 |
93.78 |
92.71±1.00 |
6.2 |
2,355,775 |
74.74 |
58.48 |
|
C-YOLOv13-L |
92.15 |
94.78 |
94.24 |
93.69±0.33 |
7.3 |
3,582,483 |
76.58 |
54.95 |
|
MC-YOLOv13-L |
93.03 |
95.01 |
95.94 |
95.16±0.53 |
7.1 |
3,477,567 |
78.05 |
55.24 |
Note: M: Multi-scale Linear Attention Mechanism Optimization; C: CMUNeXt Module Optimization; L: Loss Function Optimization; Avg: Average.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThank you for your detailed responses to my questions. I have no further questions, and I recommend this paper for publication.
Author Response
We sincerely thank you for your review and acknowledgment of our manuscript, as well as for the professional suggestions and thorough review provided throughout the entire evaluation process. Your feedback has significantly contributed to improving the clarity of the presentation, the transparency of the experiments, and the methodological rigor of the paper. We deeply appreciate the time and effort you have dedicated, which has played a key role in enhancing the quality of the manuscript.