Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Garbage Detection and Classification Model for Orchards Based on Lightweight YOLOv7

Sustainability 2025, 17(9), 3922; https://doi.org/10.3390/su17093922

by Xinyuan Tian¹, Liping Bai^1,* and Deyun Mo^1,2

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Marta Bistron

Reviewer 4: Anonymous

Sustainability 2025, 17(9), 3922; https://doi.org/10.3390/su17093922

Submission received: 6 March 2025 / Revised: 21 April 2025 / Accepted: 23 April 2025 / Published: 27 April 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper identifies 16 kinds of garbage in orchards through deep learning, which is innovative and has the following problems:
Line 34 positioning needs to be modified, and the meaning of garbage should be analyzed from a global perspective.
Line 46 What is the scale of supplemental orchard waste.
The Line 248 image is of little value and you are advised to delete it.
Line 320 Deletes Formula 8.
Line 379 supplements other feature pictures of the sample.
Line135 Please explain what is the difference between the orchard garbage and the lawn garbage, and the garbage along the road.

Author Response

Comments 1: Line 34 positioning needs to be modified, and the meaning of garbage should be analyzed from a global perspective.

Response 1:I feel that this change is necessary, and I have made changes in the manuscript: globalised garbage sorting policies have proven to be a key governance tool for enhancing the efficiency of resource transformation and optimising the quality of human settlements. Standardised waste disposal mechanisms not only significantly improve resource recycling efficiency, but also have a significant impact on the quality of life of the public and are an important indicator of the level of civilisation in society.

Comments 2: Line 46 What is the scale of supplemental orchard waste.

Response 2: A corresponding description has been added to the manuscript.

Global orchard waste generation reaches 280-350 million tonnes annually, a scale equivalent to 1.8 times the global municipal garden waste. However, due to bottlenecks in treatment capacity and degradation technology, the current resource utilisation rate is less than 28%, which is a key constraint to improving the effectiveness of agricultural recycling systems.

Comments 3: The Line 248 image is of little value and you are advised to delete it.

Response 3: Maybe it's the manuscript version, but the 248 lines I see don't have a figure. I assume you are referring to figure1, that figure presents the fusion of the MobileNetV3 and GhostNet learning relation vectors, and I think it gives a better illustration of how the fusion is done, so I have kept it.

Comments 4: Line 320 Deletes Formula 8.

Response 4: Formula 8 has been deleted.

Comments 5: Line 379 supplements other feature pictures of the sample.

Response 5: Additions have been made to the manuscript.

Comments 6: Line135 Please explain what is the difference between the orchard garbage and the lawn garbage, and the garbage along the road.

Response 6: Orchard garbage is mainly pruned branches, rusty tools, and glass, which are hard, slow to degrade, and usually not specifically handled. Lawn garbage resembles crushed grass clippings, is soft and perishable, and may be suitable for composting for reuse. Roadside garbage is like greyish leaves, often mixed with vehicle exhaust pollutants and plastic waste, and requires special disinfection.

Reviewer 2 Report

Comments and Suggestions for Authors

Strengths / Good Points

Innovative Application:

The focus on orchard-specific waste detection addresses a real-world niche that hasn’t been sufficiently studied. This is a fresh and relevant application of deep learning and sustainability efforts.

Lightweight Model Design:

Smart replacement of CSPDarknet53 with MobileNetV3 and GhostNet, significantly reducing computational load to 16% while achieving 84.4% mAP.
Useful for real-time detection on edge devices like drones and patrol vehicles.

Strong Experimental Section:

Well-done ablation studies, hyperparameter analysis, and comparison of feature fusion strategies.
Clear visualization (e.g., Figures 5–9) supports claims.

Practical Deployment & Impact:

Field test results (e.g., 41% increase in recycling, 35% reduction in hazardous spill) are promising.
Alignment with Zero Waste Orchard policy and Sustainable Development Goals (SDGs) adds social impact.

Parts That Need Corrections / Improvements

English Language Quality:
Inconsistent or Missing References:

Some URLs (e.g., Huawei competition) should be properly formatted.

Lack of Discussion on Limitations:

No explicit discussion of model limitations, such as:

Sensitivity to lighting or occlusion?
False positives in natural clutter?
How often the drone misclassifies due to seasonal changes?

Clarity in Mathematical Formulations:

Some formula blocks are a bit dense for readers unfamiliar with contrastive learning. More intuitive explanations or breakdowns of variables would help.
Example: Equation (8) – unclear what Sq or K specifically refers to without digging.

Suggestions for Improvement

Add Limitation & Future Improvement Discussion:

Explain how seasonal changes, lighting conditions, or motion blur can affect the system during drone capture?
Suggest adding LiDAR or depth sensors as future work for occlusion issues.

Language Polishing:

A native English speaker or language service should edit for tone, grammar, and flow.

Visual Summaries:

It is preferable to :-

Include a workflow diagram summarizing the model pipeline (YOLOv7, MobileNetV3, GhostNet, SFT, etc.)
Add a comparison table of different YOLO versions and performance metrics (YOLOv5, YOLOv7, etc.) for clarity.

Policy/Practical Deployment Details:

How is the system integrated into real orchard operations? Maybe include a flowchart of patrol route, detection, robotic arm sorting?

Comments on the Quality of English Language

Language Polishing:

A native English speaker or language service should edit for tone, grammar, and flow.

Author Response

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files. English expression problems we will optimise in the manuscript, here are the responses to specific comments.

Comment1：Inconsistent or Missing References: Some URLs (e.g., Huawei competition) should be properly formatted.

Response1：We'll recheck the reference citation in the manuscript, and in the meantime I've verified that the link https://www.huaweicloud.com/zhishi/dasai-19ljfl.html is jumping open correctly.

Comment2：Lack of Discussion on Limitations:

No explicit discussion of model limitations, such as:

Sensitivity to lighting or occlusion?

False positives in natural clutter?

Response2：

As seasonal changes can lead to differences in the environment, we specifically collected images by manual photography during different seasons to increase the adaptability of the dataset.

The issue of light affecting the algorithm was also considered. We performed data enhancement on the dataset to make the model better at recognising different lighting conditions.

We have also added suggestions for future improvements to the manuscript:
1) Integration of LIDAR or Depth Sensors
In order to mitigate occlusion and improve detection accuracy in complex environments, the integration of LIDAR (Light Detection and Ranging) or depth sensors can greatly improve the model's ability to distinguish between garbage and surrounding vegetation. These sensors provide three-dimensional information about the scene, allowing for more accurate garbage detection, especially in cluttered or occluded areas.
2) Improved pre-processing for cluttered backgrounds
To reduce false alarms from UAVs filming in naturally cluttered environments, a better pre-processing step could be integrated into the model to distinguish litter from natural elements using techniques such as semantic segmentation or contextual feature learning. This would allow the system to better distinguish between garbage and environmental elements that look like garbage.

Comment3：Clarity in Mathematical Formulations:

Some formula blocks are a bit dense for readers unfamiliar with contrastive learning. More intuitive explanations or breakdowns of variables would help.

Example: Equation (8) – unclear what Sq or K specifically refers to without digging.

Response3：These three dense formulas are formulas for the loss function, but this part is not the focus of this study, we just use the loss function in this way and the corresponding explanations are mentioned in the manuscript.

Comment4：Add a comparison table of different YOLO versions and performance metrics (YOLOv5, YOLOv7, etc.) for clarity.

Response4：We have made comparisons, the overall results are much better than yolov5, from the detection speed has been improved, but did not put the comparison results of the reason is that the original yolov7 has been better than yolov5, we only need to do side-by-side comparisons here seems to be enough.

Comment5：How is the system integrated into real orchard operations? Maybe include a flowchart of patrol route, detection, robotic arm sorting?

Response5：Thank you for your valuable input, this is indeed something that can guide us in our subsequent research. We think the algorithm can be deployed on a small sorting robot, or on the on-board arithmetic box of the UAV, and the usage can be different. The inspection of UAVs will also depend on the docking of the platform, and this part is really the work we need to study later, which we have not been able to demonstrate well yet. This research mainly focuses on dataset expansion and algorithm optimization.

Comment6：Include a workflow diagram summarizing the model pipeline (YOLOv7, MobileNetV3, GhostNet, SFT, etc.)

Response6：

Your comments have greatly helped us to improve the rigour of our research and the clarity of our presentation. We will continue to improve our work and sincerely hope that this revision will meet your requirements. If there is anything else that needs to be added or adjusted, we will cooperate fully. Thank you for your time and effort in your busy schedule!

Reviewer 3 Report

Comments and Suggestions for Authors

Notes in attached document.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

Included in the review.

Author Response

Comment1：The introduction is quite brief. It is recommended to enrich it with more references to recent literature on the Zero Waste Orchard policy and its challenges

Response1：Taking into account your valuable comments and the comments of other reviewers, we have changed this policy to a policy on waste separation and also expanded the application area. We have made corresponding changes in the introductory part.

Comment2：The presentation of the proposed methodology does not include an introductory general conceptual framework. It is recommended that Chapter 3 starts with a diagram of the general architecture with a concise explanation, and only then moves on to specific structures and methods as in the following subsections.

Response2：We recognize your comments, and in the manuscript we've included a full architecture diagram with explanations. This should make it clearer for the reader to understand.

Comment3：The manuscript does not clearly distinguish which elements of the model are new contributions (e.g., SFT module, contrastive fusion strategy) and which are adapted from existing literature. A summary table or a clear paragraph presenting the scientific contributions of the authors would be helpful.

Response3：We have also made changes in section 3.1 to highlight our work. The corresponding changes have been highlighted in the manuscript.

Comment4：Key methodological blocks (e.g., relational network module, contrastive loss) are explained in dense text with equations, but there is a lack of diagrams or pseudocode to better understand the ideas described.

Response4：This part of the formula we have explained and illustrated the parameters in it. We have tried to explain it using a legend, but it doesn't seem to be very intuitive. We have highlighted the explanations of the formulas in the text, and that part should help to understand the formulas.

Comment5：Table 2 shows that the best performing single lightweight framework in terms of mAP is MobileNetV2 (0.971), outperforming both MobileNetV3 (0.805) and GhostNet (0.812). The authors decided to combine MobileNetV3 and GhostNet, but do not adequately justify this decision. This is particularly puzzling since their combination does not outperform MobileNetV2. A discussion explaining the rationale—or reconsidering the combination strategy—is needed.

Response5：We believe you read the manuscript very carefully and meticulously and gave us your comments. This place is a clerical error, we wrongly wrote 0.791 as 0.971, it is obvious that a network's mAP can't be that high either. We apologise for not checking it before.

Comment6：The paper does not specify how the orchard waste dataset was divided into training,validation, and test sets, or whether any cross-validation was applied. This is critical for assessing generalizability and avoiding data leakage.

Response6：We have performed this operation during training, and have added this part of the description in the manuscript. We randomly divided the dataset into three subsets: the training set, the validation set, and the test set by using a Python script. The ratio of the three subsets was 7:2:1. The training set was used to train the model, the validation set was used to tune the hyper-parameters of the model and make an initial assessment of the model's capabilities, and the test set was used to test the detection accuracy of the model and assess its generalisation ability. Also we introduced k-fold cross-validation in the training phase of the training process.

Comment7：Figure 8 summarizes the critical results, but is not discussed in sufficient detail. The authors should explain which classes achieved high or low mAP, possible reasons, and implications for orchard-specific use cases.

Response7：This is really a very good suggestion and will help us a lot in our subsequent research. We have added this part of the description in the manuscript. We believe that the recognition is poorer due to the small number of datasets themselves or the susceptibility of this type of item to light.

Comment8：There is no error analysis or discussion of false positives/negatives, which would be necessary to assess the risk of real-world implementation.

Response8：We have statistical analyses and comparisons of P/R/F1 in table4, but we think more important is mAP. in the numerical feedback of P/R/F1, we think the results are acceptable.

Comment9：Figure 9 aims to show examples of successful detections, but includes only one detection specific to orchard waste (fertilizer-bag). More representative examples (e.g. pesticide containers, branches) should demonstrate the target capabilities of the model.

Response9：We have thought about the choice of this figure, from the inspection process we found that the more common in the orchard in addition to fertiliser bags are plastic bottle, metal can and cigarette butts. The identification result of cigarette butts seems to reflect the characteristics of the orchard even less, so we chose the four diagrams in the manuscript.

Comment10：Conclusions for future work relate to plans to expand into new ecosystems, but remain unclear. The authors should clarify whether they plan to expand the dataset, explore different model architectures, address domain adaptation? More detailed explanations are recommended.

Response10：Future work will prioritise expanding the model's adaptability to different orchard ecosystems such as temperate fruit forests and tropical plantations. We have planned two directions of development, the first in terms of modelling and the second in terms of integrating external capabilities. We have described the details in the manuscript.

Comment11：The use of terms like “semantic feature-relational network” or “relation vector” should be carefully defined and consistently used.

Response11：We will optimise this in the manuscript.

Comment12：Typos and grammar (e.g., “garbage” vs. “waste”, inconsistent spacing in equations) should be revised for clarity.

Response12：We will optimise this in the manuscript.

Reviewer 4 Report

Comments and Suggestions for Authors

The manuscript proposes a novel lightweight YOLOv7-based object detection model tailored to the classification of garbage in orchard environments. The work replaces the CSPDarknet53 backbone with MobileNetV3 and GhostNet to achieve a reduction in computational load while maintaining high accuracy. I have the following observation to improve the quality of the paper.

Some sections such as supervised contrastive learning, equations 6–8, are highly dense and could benefit from more intuitive explanations or diagrams for general readers. Include simplified flow diagrams or illustrative examples showing the role of contrastive learning in enhancing representation.
The methodology section at times repeats concepts like fusion layers and relational networks are described in multiple subsections with similar content. Consolidate repeated information and clearly separate model components: (1) backbone, (2) feature fusion, (3) contrastive learning.
The dataset has significant class imbalance such as "plastic product" = 99, and "glass" = 819. Acknowledge the implications of this imbalance in the results section and discuss any mitigation techniques (e.g., data augmentation, class reweighting).
Some references are cited with incomplete or inconsistent formatting (e.g., missing author names in some places).

The paper has strong merit in terms of its contribution to lightweight garbage detection in orchards using a YOLOv7-based framework. However, clarity in explanation, reproducibility of the dataset, and comparative benchmarking with existing models must be strengthened before the manuscript can be considered for publication.

Comments on the Quality of English Language

The manuscript has minor grammatical inconsistencies (e.g., "The Chinese government's also proposal...").

Author Response

Comment1：Some sections such as supervised contrastive learning, equations 6–8, are highly dense and could benefit from more intuitive explanations or diagrams for general readers. Include simplified flow diagrams or illustrative examples showing the role of contrastive learning in enhancing representation.

Response1：We have included schematics in the appropriate section.

Comment2：The methodology section at times repeats concepts like fusion layers and relational networks are described in multiple subsections with similar content. Consolidate repeated information and clearly separate model components: (1) backbone, (2) feature fusion, (3) contrastive learning.

Response2：Although there is some repetition of descriptions, the body of content presented in different subsections is not the same. We have adapted them accordingly in the manuscript.

Comment3：The dataset has significant class imbalance such as "plastic product" = 99, and "glass" = 819.

Response3：We have expanded on the low number of categories, either by manual search or image inversion and cutting. The manuscript has been modified accordingly.

Comment4：Some references are cited with incomplete or inconsistent formatting (e.g., missing author names in some places).

Response4：We have reworked the reference formatting in the manuscript.

Comment5：However, clarity in explanation, reproducibility of the dataset, and comparative benchmarking with existing models must be strengthened before the manuscript can be considered for publication.

Response5：The dataset is reusable as most of them are existing datasets or images on the web, but probably this dataset also has some limitations in that it has a strong geographical character as we are targeting orchard environments offshore in monsoon climate zones. The comparative benchmark is to show the advantages of the model by comparing the results with the regular yolov7 model.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors

I appreciate the authors' efforts in revising the manuscript and addressing many of the earlier comments. Most of the minor issues raised in the first round have been sufficiently clarified or corrected. However, there are still several key points where, in my view, the current version of the manuscript remains incomplete or could benefit from further revision. These are outlined below.

Ad. Comment 1:

Thank you for expanding the introduction. I see that the references have been updated, but the number of cited sources remains quite limited (only five). To clarify: my earlier comment was not about providing detailed descriptions of specific models—which are appropriately discussed in Section 2—but about building a more literature-informed context in the introduction. A stronger background section would help frame the problem within the broader landscape of orchard waste management and related research, thus better motivating the study. I recommend incorporating additional references to strengthen this foundation.

Ad. Comment 3:

Thank you for the additional details in Section 3.1. However, the manuscript still lacks a clear and explicit statement of the authors' original contributions. My initial suggestion was to clearly distinguish the new components (e.g., the proposed fusion strategy, the relational network module) from standard elements adapted from existing literature. This is not just about describing the technical content, but about clearly indicating what is novel in this work. At present, no such summary appears in either the introduction or the methodology section.

Ad. Comment 5:

Thank you for the clarification regarding the mAP value. I understand that this was a simple clerical error, and I agree that such outlier values can often be immediately recognizable. That said, I would encourage the authors to adopt a more neutral and professional tone when responding to reviewer comments. The phrase “it is obvious that a network's mAP can’t be that high either” risks sounding dismissive and may not reflect the constructive spirit expected in academic discourse. A more careful and respectful phrasing would be more appropriate in future revisions.

Aside from these points, I am generally satisfied with the responses and revisions related to the other comments in my initial review. I encourage the authors to consider the issues outlined above in order to further improve the quality and clarity of the manuscript.

Author Response

Thank you very much for taking the time to review this manuscript again. We do apologise that some of the responses made you feel uncomfortable. Please see the detailed responses below and the corresponding revisions will be highlighted in the resubmission.

Comment1：A stronger background section would help frame the problem within the broader landscape of orchard waste management and related research, thus better motivating the study. I recommend incorporating additional references to strengthen this foundation.

Response1：Thanks to your comments, we have updated the manuscript to include more references.

Comment2：To clearly distinguish the new components (e.g., the proposed fusion strategy, the relational network module) from standard elements adapted from existing literature. This is not just about describing the technical content, but about clearly indicating what is novel in this work. At present, no such summary appears in either the introduction or the methodology section.

Response2：The approach in this paper starts from lightweighting as an objective and different backbone networks are selected for replacement testing. We ensure the accuracy and efficiency of the recognition by replacing the two backbone networks with a combination of two backbone networks, and innovatively use the Feature Fusion Module to make the model more applicable to real-world scenarios.
Replacement of backbone networks is common, but rarely the combination of different backbone networks is tried, and the Feature Fusion Module is also seen in previous research, but we cleverly use Semantic Feature-wise Transformation to learn different weights while fusing the learnt relationship vectors. This use of combinability is an innovative aspect of our research. We have added a corresponding summary description in the manuscript.

Comment3：That said, I would encourage the authors to adopt a more neutral and professional tone when responding to reviewer comments. The phrase “it is obvious that a network's mAP can’t be that high either” risks sounding dismissive and may not reflect the constructive spirit expected in academic discourse. A more careful and respectful phrasing would be more appropriate in future revisions.

Response3：Firstly we would like to apologise to you for your previous response. My intention was to prove that I did indeed misspell by expressing that it was an obvious mistake. Thank you very much for reviewing my manuscript so carefully, this is an error that no other reviewer has found. We will definitely remember this mistake and make sure to carry a serious and careful attitude in our subsequent research studies. Once again, our apologies for the misrepresentation.

Your comments are very important to us and have helped us to improve the rigour of our research and the clarity of our presentation. We sincerely hope that this revision will meet your requirements and thank you again for taking time and effort out of your busy schedule!

Article Menu

A Garbage Detection and Classification Model for Orchards Based on Lightweight YOLOv7

Further Information

Guidelines

MDPI Initiatives

Follow MDPI