PVConv: Enhancing Depthwise Separable Convolution via Preference-Value Learning for Similar-Feature Discrimination
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors propose a new deep learning method named PVConv, achieving superior object detection performance compared to strong baselines such as YOLOv8, YOLOv5, YOLOv10, and YOLOv11. After addressing the major revisions listed below, I believe the manuscript will be suitable for publication.
In the Introduction, the studies discussed in Line 27–28, Line 35, and Line 80–89 are not up to date. Recent works such as LDConv and RepViT, both introduced in 2024, should also be included. As can be seen from the reference list, the only citation from 2025 corresponds to the dataset used. For this reason, the authors need to update Section 2.1, 2.2, and 2.3 by incorporating more recent literature.
In addition, previous work on beverage recognition or detection must be integrated into the related work section. Recommended papers include, but are not limited to:
1.Toward New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines
2.Enhancing Egocentric Insights: Comparison of Deep Learning Models for Food and Beverage Object Detection from Egocentric Images
Section 3.2 and Section 3.3 are clearly written and sufficiently informative.
Figure 5 effectively illustrates the structure of the proposed architecture.
However, the class names in Figure 11 are not legible and should be improved for readability.
For better clarity and purpose-driven organization, the subsections 4.5.1 Parameter and Computational Analysis and 4.5.6 Generalization and Model Transfer Experiments should be grouped under a dedicated Ablation Study section.
From a structural perspective, it would also be more coherent to present 4.5.3 Analysis of Precision and 4.5.5 F1-score Curve Analysis consecutively, followed immediately by 4.5.4 Analysis of Similar-Class and Background Detections.
Including Gaussian and Laplace activation function comparisons is a valuable contribution to the manuscript.
The authors have not provided any motivation for using the Beverage Containers Dataset. They should connect this dataset to broader topics such as waste management, public health, recycling, or environmental sustainability. Both the Introduction and Discussion & Conclusions sections should emphasize the real-world potential of this work and outline how future research can build upon it.
Furthermore, the authors should add other methods previously evaluated on the Beverage Containers Dataset into Table 4 or Table 6, and compare their findings with prior work to provide a comprehensive performance discussion.
The manuscript must demonstrate PVConv’s performance on a well-known benchmark dataset, enabling a fair comparison with established methods. Benchmark-level evaluation is essential for determining the generalizability and competitiveness of the proposed PVConv framework. I strongly recommend evaluating PVConv on at least one of the following:
MS-COCO or AI Crowd Food Recognition Dataset (Food & Beverage subset) or WaRP – Waste Recycling Plant Dataset
https://www.kaggle.com/datasets/parohod/warp-waste-recycling-plant-dataset
Author Response
Comment 1:
“In the Introduction, the studies discussed in Line 27–28, Line 35, and Line 80–89 are not up to date. Recent works such as LDConv and RepViT, both introduced in 2024, should also be included. As can be seen from the reference list, the only citation from 2025 corresponds to the dataset used. For this reason, the authors need to update Section 2.1, 2.2, and 2.3 by incorporating more recent literature.”
Response 1:
We sincerely thank the reviewer for this insightful comment. We agree that the Related Work section should be updated to include the latest developments in convolutional neural networks. The reviewer’s suggestion helped us recognize the need to incorporate recent works to ensure that our discussion reflects the current state of the field.
Following this advice, we have carefully reviewed recent publications from 2024–2025 and added several representative and relevant works, including architectures analogous to LDConv and RepViT, as well as the two beverage recognition papers recommended by the reviewer. The newly added references are:
- Large Kernel ConvNets (Zhang et al., 2025)
- ShiftwiseConv (Li et al., CVPR 2025)
- RapidNet (Munir et al., WACV 2025)
- Depthwise Convolutions in Vision Transformers (Zhang et al., Neurocomputing 2025)
These additions provide a more up-to-date overview of recent convolutional designs, including methods for fine-grained feature modeling and efficient channel processing. By incorporating these works, the manuscript now better situates PVConv within the broader context of modern CNN research.
Updated text in the manuscript, highlighted in red:
Recent advances continue to broaden CNN design spaces, with large-kernel… Page 3,Lines 88-92.
More recently, LDConv employs linear deformable convolutions for fine-grained structure modeling Page 3,Lines 104-109.
Comment 2:
“In addition, previous work on beverage recognition or detection must be integrated into the related work section. Recommended papers include, but are not limited to:
1.Toward New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines
2.Enhancing Egocentric Insights: Comparison of Deep Learning Models for Food and Beverage Object Detection from Egocentric Images”
Response 2:
We sincerely thank the reviewer for this valuable suggestion. In response, we have added a new subsection titled “Food and Beverage Object Detection” in the Related Work section, which incorporates the two papers recommended by the reviewer:
- Zhang et al., 2020 (Toward New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines)
- Hossain & Sazonov, 2024 (Enhancing Egocentric Insights: Comparison of Deep Learning Models for Food and Beverage Object Detection from Egocentric Images)
The new subsection discusses fine-grained recognition tasks involving beverages and packaged goods, highlighting challenges such as real-world illumination, occlusion, shelf-layout variations, small object size, visual similarity among categories, and complex backgrounds. These observations underscore the importance of efficient and discriminative feature extractors, which directly motivates the design of our PVConv and PVDSC modules.
Updated text in the manuscript, highlighted in red:
Beyond general-purpose CNN design, several studies have focused on fine-grained recognition tasks... Page 3,Lines 111-121.
Comment 3:
"However, the class names in Figure 11 are not legible and should be improved for readability."
Response 3:
Thank you for your valuable comment. We agree that the class names in the previous figure were difficult to read. To improve readability, we have increased the font size of the class labels in the confusion matrix from 12 pt to 24 pt. This adjustment ensures that all category names are clearly legible without affecting the figure layout. Due to updates in the manuscript structure, the revised figure is now Figure 12 (previously Figure 11). The updated figure can be found on Page 18 of the revised manuscript.
Comment 4:
"For better clarity and purpose-driven organization, the subsections 4.5.1 Parameter and Computational Analysis and 4.5.6 Generalization and Model Transfer Experiments should be grouped under a dedicated Ablation Study section."
Response 4:
We thank the reviewer for this valuable suggestion regarding the organization of our results. We would like to clarify that the primary innovation of our work lies in the proposed PVDSC convolution module, rather than in the design of a new overall network architecture. As such, our experiments are focused on demonstrating the effectiveness and efficiency of PVDSC as a module-level enhancement.
Specifically, the “Parameter and Computational Analysis” subsection evaluates the computational cost and parameter overhead introduced by PVDSC relative to standard convolution and DSC, while the “Generalization and Model Transfer Experiments” subsection verifies the module’s effectiveness when integrated into multiple YOLO backbones. These experiments serve as module-level validation rather than full-network ablation, and thus combining them into a traditional Ablation Study section would not fully align with the scope of our contribution.
We have ensured that both subsections are clearly labeled and presented consecutively to maintain readability and logical flow. Additionally, we have revised the text to explicitly emphasize that these experiments are intended to validate the PVDSC module’s performance across different architectures while keeping computational overhead minimal.
We hope this clarification helps the reviewer understand our rationale and the purpose-driven organization of these results.
Updated text in the manuscript, highlighted in red:
we conducted experiments by integrating it into different YOLO architectures...across multiple backbones.Page 19,Lines 473-475.
Comment 5:
"From a structural perspective, it would also be more coherent to present 4.5.3 Analysis of Precision and 4.5.5 F1-score Curve Analysis consecutively, followed immediately by 4.5.4 Analysis of Similar-Class and Background Detections."
Response 5:
We sincerely thank the reviewer for this constructive suggestion regarding the presentation order of the subsections. We agree that arranging the “Analysis of Precision” and “F1-score Curve Analysis” subsections consecutively, followed immediately by the “Analysis of Similar-Class and Background Detections” subsection, improves the logical flow and readability of the results.
Accordingly, we have revised the manuscript to follow this recommended order, ensuring that readers can better follow the progression from overall performance metrics to detailed error analyses.
Comment 6:
"The authors have not provided any motivation for using the Beverage Containers Dataset. They should connect this dataset to broader topics such as waste management, public health, recycling, or environmental sustainability. Both the Introduction and Discussion & Conclusions sections should emphasize the real-world potential of this work and outline how future research can build upon it."
Response 6:
We thank the reviewer for this valuable suggestion regarding the motivation for using the Beverage Containers Dataset. In the revised manuscript, we have clarified that the dataset was selected not only because it contains visually similar categories and challenging backgrounds, but also because it is relevant to real-world applications such as waste management, recycling, and environmental monitoring.
Specifically, in the Introduction and Dataset sections, we have highlighted that distinguishing between similar beverage containers can aid automated recycling systems and facilitate accurate inventory and waste tracking, demonstrating the practical significance of our work. Furthermore, in the Discussion and Conclusion sections, we have emphasized the broader potential of PVDSC to support real-world vision applications, including efficient monitoring of recyclable materials and public health–related object detection tasks.
These additions serve to connect our experimental evaluation to meaningful, practical scenarios, while maintaining the focus on the technical contributions of PVDSC.
Updated text in the manuscript, highlighted in red:
Moreover, beverage container recognition has practical implications for environmental sustainability... Page 2,Lines 60-64.
Beyond these technical motivations, beverage containers are prevalent in daily life and contribute... Page 10,Lines 303-306.
...on the Beverage Containers dataset, which contains visually similar categories of objects... Page 21,Lines 534-536.
These results not only validate the module’s effectiveness but also highlight its real-world... Page 21,Lines 548-551.
Overall, PVDSC effectively addresses the limitations of standard DSC in spatial information... Page 22,Lines 556-561.
Comment 7:
"Furthermore, the authors should add other methods previously evaluated on the Beverage Containers Dataset into Table 4 or Table 6, and compare their findings with prior work to provide a comprehensive performance discussion."
Response 7:
We appreciate the reviewer’s suggestion to include results from prior work on the Beverage Containers Dataset.
However, the dataset used in our experiments is a custom subset that we constructed by extracting and reorganizing samples from the original source dataset(s).
To the best of our knowledge, no previous studies have conducted experiments on this specific version of the dataset, and therefore no existing benchmark results are available for comparison.
Additionally, the primary goal of this work is to evaluate the effectiveness of the proposed PVDSC module, rather than to benchmark various full detection models. Thus, we selected YOLOv8s as the baseline and ensured consistent training settings across all variants to isolate the effect of the convolution module itself.
Comment 8:
"The manuscript must demonstrate PVConv’s performance on a well-known benchmark dataset, enabling a fair comparison with established methods. Benchmark-level evaluation is essential for determining the generalizability and competitiveness of the proposed PVConv framework. I strongly recommend evaluating PVConv on at least one of the following:
MS-COCO or AI Crowd Food Recognition Dataset (Food & Beverage subset) or WaRP – Waste Recycling Plant Dataset
https://www.kaggle.com/datasets/parohod/warp-waste-recycling-plant-dataset"
Response 8:
We sincerely thank the reviewer for this valuable suggestion. We fully agree that demonstrating PVConv’s performance on an additional benchmark dataset is important for validating its generalizability and competitiveness.
After carefully reviewing the recommended datasets (MS-COCO, AI Crowd Food Recognition, and WaRP), we found that while they are well-established, their category distributions and inter-class similarities do not fully align with the design motivation of PVConv, which is particularly aimed at scenarios with highly visually similar categories requiring fine-grained feature discrimination. For example, many classes in MS-COCO or WaRP have very distinct visual appearances, limiting the ability to highlight PVConv’s preference-value mechanism for disambiguating similar objects.
To better reflect the intended use case, we instead conducted supplementary experiments on the Animals Detection Images Dataset from Kaggle, constructing a Bird Subset containing 12 visually similar bird species with 3,869 images. Many images include multiple bird instances, creating a challenging fine-grained detection scenario. Notably, the dataset size is comparable to the WaRP dataset, ensuring a fair evaluation in terms of data scale. This subset aligns closely with the core goal of PVConv: enhancing discrimination among visually similar categories.
We trained YOLOv8 models with Conv, DSC, and PVDSC_G modules under identical settings and evaluated each model across five independent runs. The results, summarized in Table~\ref{tab:bird_results}, show that PVDSC consistently outperforms DSC and approaches the accuracy of standard convolution. This benchmark-level evaluation demonstrates that PVDSC’s preference-value mechanism effectively improves fine-grained feature discrimination while maintaining strong generalization beyond the primary dataset.
We are grateful for this suggestion, as it allowed us to strengthen the empirical evidence of PVConv’s generalizability in the context for which it was designed, providing a more meaningful and targeted evaluation.
Updated text in the manuscript, highlighted in red:
to further assess the generalizability of the proposed PVDSC module, we conducted... Page 19-20,Lines 488-511.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposes PVConv a preference value guided add-on to depth wise separable convolution, then builds PVDSC and plugs it into YOLOv8 to improve detection of very similar beverage container classes. The direction is interesting, but the paper needs clearer novelty framing, stronger and cleaner experiments, and fixes for several reporting inconsistencies.
- PVConv is like “implicit attention,” but the paper does not clearly separate it from existing channel or value-based gating / attention ideas. Include baselines like SE, ECA, CBAM-lite, or other DSC-enhancement modules, not just Conv and DSC.
- Preference parameter scaling trick needs justification. Explain why +99 and 0.01 were chosen, not just that they worked.
- All main results are on one dataset from Roboflow. That is a very specific domain. If you claim “general purpose,” show at least one more dataset or task.
- randomly downsampled training images to 8,715 but do not show that this does not bias results. Report seed(s), repeat runs, and average with std. Explain why that subset size is enough.
- Authors state images were resized to 640×640, but Table 2 lists image size 512. That is a direct conflict and affects FLOPs and results.
- Table 6 has impossible metric ordering: your mAP@50:95 values are higher than mAP@50, which cannot happen. Example: YOLOv5s row shows 0.76628 vs 0.92263.
- Text says transfer tests on YOLOv5s and YOLOv11s, but Table 6 also includes YOLOv10s without explanation.
- only report FLOPs and parameter counts. For edge and real-time claims, latency and FPS on real hardware are more important.
- Measure inference speed (FPS or ms) on at least one GPU and one embedded device if possible.
- Show memory impact too, since PVConv adds extra tensors.
- Several sentences are long and repetitive, especially in Introduction and Discussion.
Author Response
Comment 1:
"PVConv is like “implicit attention,” but the paper does not clearly separate it from existing channel or value-based gating / attention ideas. Include baselines like SE, ECA, CBAM-lite, or other DSC-enhancement modules, not just Conv and DSC."
Response 1:
We sincerely thank the reviewer for this insightful comment regarding the relationship between PVConv and existing attention-based mechanisms. We acknowledge that our previous description may have suggested that PVConv functions as a form of attention. To clarify, PVConv is not an attention module. While inspired by attention ideas, its operational focus is fundamentally different: conventional attention mechanisms reweight feature vectors based on their own distributions, whereas PVConv enhances the convolution process itself through a dedicated preference-value path to assist depthwise separable convolution (DSC).
Accordingly, we have revised multiple parts of the manuscript. In the Introduction, we explicitly describe PVConv’s mechanism and its distinction from conventional attention. In the Related Work section, we removed the discussion of attention modules and replaced it with a focused subsection on food and beverage object detection, highlighting the practical motivation of our study.
These changes are highlighted in red in the revised manuscript. We believe the updated Introduction and Related Work sections now clearly reflect the novelty and mechanism of PVConv, avoiding any confusion with standard attention modules.
Updated text in the manuscript, highlighted in red:
The design of PVConv is inspired by the idea of attention, but its... Page 2,Lines 60-64.
Beyond general-purpose CNN design, several studies have focused on fine-grained... Page 3,Lines 111-121.
Comment 2:
"Preference parameter scaling trick needs justification. Explain why +99 and 0.01 were chosen, not just that they worked."
Response 2:
We thank the reviewer for highlighting the need to clarify the preference parameter scaling. In the original manuscript, the description of the scaling procedure was brief and did not provide sufficient detail regarding how the specific factors were determined. We appreciate the reviewer’s suggestion, which prompted us to improve the clarity and reproducibility of this important training detail.
The scaling factors ($0.01$ for optimization and $+99$ for forward compensation) were determined empirically through systematic experiments. Starting from the initial parameter values, we carefully monitored the updates of the preference parameters during the first few training epochs to ensure that each parameter received effective updates.
Through iterative adjustments of the scaling and compensation factors, we ensured that by the end of training, the majority of preference parameters were distributed within a reasonable and meaningful range. This confirmed that the network successfully learned the intended preference information while avoiding ineffective updates or bias suppression.
The revised manuscript now provides a more comprehensive description of this procedure, making it clear how the preference parameter scaling facilitates effective learning. We sincerely thank the reviewer for pointing out this omission, which allowed us to strengthen the transparency and rigor of our experimental methodology.
Updated text in the manuscript, highlighted in red:
These specific scaling values (0.01 and +99) were determined empirically through... Page 12,Lines 326-332.
Comment 3:
"All main results are on one dataset from Roboflow. That is a very specific domain. If you claim “general purpose,” show at least one more dataset or task."
Response 3:
We sincerely thank the reviewer for this insightful comment. We fully agree that evaluating PVConv on an additional dataset is important to support claims of general-purpose applicability.
In response, we conducted supplementary experiments on the Animals Detection Images Dataset from Kaggle, focusing on a Bird Subset containing 12 visually similar bird species with 3,869 images. Many images include multiple bird instances, creating a challenging fine-grained detection scenario. This evaluation complements our original experiments on the Roboflow Beverage Containers dataset and provides evidence that PVConv generalizes beyond a single, domain-specific dataset.
We trained YOLOv8 models with Conv, DSC, and PVDSC_G modules under identical configurations, repeating each experiment across multiple seeds to account for randomness. As shown in Table~\ref{tab:bird_results}, PVDSC consistently outperforms DSC and approaches the performance of standard convolution, particularly under stricter IoU thresholds (mAP@50:95).
We are grateful for this suggestion, as it allowed us to demonstrate the robustness and broader applicability of PVConv, confirming its effectiveness in a multi-instance, visually similar detection context beyond the initial dataset.
Updated text in the manuscript, highlighted in red:
To further assess the generalizability of the proposed PVDSC module, we conducted...Page 19-20,Lines 488-511.
Comment 4:
"randomly downsampled training images to 8,715 but do not show that this does not bias results. Report seed(s), repeat runs, and average with std. Explain why that subset size is enough."
Response 4:
We sincerely thank the reviewer for the careful and constructive suggestion regarding the dataset subset and the robustness of our results. In response, we have clarified the following points in the manuscript:
- Subset Selection and Representativeness:
The 8,715 training images were randomly sampled from the full dataset while preserving the overall distribution of target categories. This subset remains sufficiently large—containing tens of thousands of labeled instances—to enable stable training and effective convergence, while significantly reducing computational cost. It also maintains a reasonable ratio with the 1,303 validation images, ensuring unbiased evaluation. - Random Seeds and Repeat Runs:
To further validate robustness, all experiments were repeated five times using different random seeds (0, 10, 100, 1000, and 10000). The mean and standard deviation of the key mAP metrics across these runs are reported in Table~\ref{tab:conv_module_map}, confirming that the results are consistent and not sensitive to the specific random subset. - Bias Mitigation:
By maintaining the overall category distribution and reporting averaged results across multiple seeds, we ensure that our evaluation remains representative of the full dataset and does not introduce significant bias.
We greatly appreciate the reviewer’s guidance, which helped us strengthen the experimental rigor and clarity of our dataset description.
Updated text in the manuscript, highlighted in red:
To reduce computational cost while maintaining representativeness, a subset of 8,715... Page 10,Lines 308-313.
To ensure statistical robustness against randomness introduced by data sampling, weight initialization... Page 12,Lines 334-338.
To assess robustness, all models are trained with five different random seeds... Page 14,Lines 383-384.
The multi-run statistics further confirm this trend. PVDSC_G and PVDSC_L achieve... Page 14,Lines 393-397.
Comment 5:
Authors state images were resized to 640×640, but Table 2 lists image size 512. That is a direct conflict and affects FLOPs and results.
Response 5:
We sincerely thank the reviewer for carefully noticing the discrepancy regarding image sizes. We appreciate this helpful suggestion, which prompted us to clarify the details. For clarity, all dataset images were initially resized to 640×640 by the data provider. In our experiments, we trained the models at 512×512 to accommodate our custom CUDA implementation, while FLOPs are reported at 640×640 to allow standardized comparison with prior works. We confirm that this setup does not affect the validity or fairness of the training and evaluation results. We have added a clarifying sentence in the manuscript to explicitly explain the difference between the dataset image size and the training resolution.
Updated text in the manuscript, highlighted in red:
Note that while all dataset images are 640 × 640, models were trained at 512 × 512... Page 12,Lines 339-341.
Comment 6:
Table 6 has impossible metric ordering: your mAP@50:95 values are higher than mAP@50, which cannot happen. Example: YOLOv5s row shows 0.76628 vs 0.92263.
Response 6:
We sincerely thank the reviewer for pointing out the issue regarding the metric ordering in Table 6. We apologize for the oversight. The values of mAP@50 and mAP@50:95 were inadvertently reversed in the original table. This has now been corrected in the revised Table 6.
Comment 7:
Text says transfer tests on YOLOv5s and YOLOv11s, but Table 6 also includes YOLOv10s without explanation.
Response 7:
We sincerely thank the reviewer for highlighting this point. We apologize for the oversight in the original manuscript. In the revised version, we have updated the text to explicitly indicate the specific layer positions where PVDSC and DSC replace convolutional modules in YOLOv10s (layers 3, 19 for Conv and layers 5, 7, 20 for SCDown), in addition to YOLOv5s and YOLOv11s. This ensures full consistency with Table 6 and clarifies that PVDSC consistently improves performance across all included backbones.
Updated text in the manuscript, highlighted in red:
Specifically, in YOLOv5s, YOLOv10s, and YOLOv11s, the original convolutional layers were replaced... Page 19,Lines 473-480.
Comment 8-10:
8.only report FLOPs and parameter counts. For edge and real-time claims, latency and FPS on real hardware are more important.
9.Measure inference speed (FPS or ms) on at least one GPU and one embedded device if possible.
10.Show memory impact too, since PVConv adds extra tensors.
Response 8-10:
We sincerely thank the reviewer for the valuable suggestions regarding the necessity of reporting not only FLOPs and parameter counts but also practical runtime indicators such as latency/FPS on real hardware, as well as \textbf{GPU memory usage during training}. These comments are very constructive and have helped us significantly improve the completeness and practical relevance of our evaluation.
In response, we have added a measurement of inference speed (FPS) for all compared models, obtained using our custom CUDA implementation. Since our CUDA kernels are experimental and not fully optimized, the achieved FPS mainly reflects functional correctness rather than optimized real-time performance. Therefore, we clearly label these values as reference only in the revised manuscript to avoid misleading interpretations. We fully agree that optimized FPS on production-level kernels may differ, and our data should be considered as comparative indicators rather than absolute hardware limits.
Furthermore, following the reviewer’s suggestion, we also report \textbf{training-time GPU memory usage} for each model. This addition helps provide a more complete understanding of the resource impact introduced by PVDSC, especially since PV-based computation involves additional intermediate tensors. We note that although PVDSC slightly increases memory consumption compared with DSC, the difference remains small and does not undermine its practical deployment value.
These additions—reference FPS measurements and \textbf{training memory usage} statistics—have now been incorporated into the revised manuscript (Section Analysis of Detection Performance, Table 4). We again thank the reviewer for guiding us to strengthen the experimental rigor and practical relevance of our work.
Updated text in the manuscript, highlighted in red:
Regarding resource usage, Conv has the highest computational cost... Page 14,Lines 398-408.
Comment 11:
Several sentences are long and repetitive, especially in Introduction and Discussion.
Response 11:
We sincerely thank the reviewer for this helpful comment. We carefully reviewed the Introduction and Discussion sections and identified multiple sentences that were indeed overly long or repeated similar ideas. Following the reviewer’s suggestion, we have revised these sections to improve clarity, conciseness, and readability. Specifically, we shortened complex sentences, removed redundant descriptions of CNN development history and lightweight convolution methods, and streamlined the explanation of DSC limitations and the motivation behind PVConv. These revisions substantially improve the logical flow of the manuscript while preserving all essential technical details. We appreciate the reviewer’s guidance, which has led to a clearer and more focused presentation of our contributions.
Author Response File:
Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have successfully addressed the majority of the concerns raised in the initial review cycle. However, a critical issue remains regarding Comment 8.
While I acknowledge that the authors have processed a new dataset, the comparative performance of the proposed method against state-of-the-art approaches remains ambiguous. As emphasized in my previous report, I strongly recommend utilizing standardized scientific datasets rather than crowd-sourced platforms like Kaggle. Adopting established academic benchmarks is essential to validly demonstrate the study's contribution to the literature.
Since the current evaluation does not sufficiently prove the method's superiority or distinctiveness due to the dataset choice and lack of comparison, I recommend a Major Revision.
Validation must be performed on a recognized scientific dataset to ensure reproducibility and rigor.For the new dataset, the manuscript must include detection results and Grad-CAM++ heatmaps, mirroring the visualization standards presented in Figures 13 and 14 to verify model focus and interpretability.
Author Response
Comment :
While I acknowledge that the authors have processed a new dataset, the comparative performance of the proposed method against state-of-the-art approaches remains ambiguous. As emphasized in my previous report, I strongly recommend utilizing standardized scientific datasets rather than crowd-sourced platforms like Kaggle. Adopting established academic benchmarks is essential to validly demonstrate the study's contribution to the literature.
Since the current evaluation does not sufficiently prove the method's superiority or distinctiveness due to the dataset choice and lack of comparison, I recommend a Major Revision.
Validation must be performed on a recognized scientific dataset to ensure reproducibility and rigor.For the new dataset, the manuscript must include detection results and Grad-CAM++ heatmaps, mirroring the visualization standards presented in Figures 13 and 14 to verify model focus and interpretability.
Response:
We sincerely thank the reviewer for the valuable suggestion to evaluate PVDSC on a recognized benchmark dataset. Following your recommendation, we conducted supplementary experiments on the WaRP – Waste Recycling Plant Dataset \cite{YUDIN2024107542}, which is a well-established academic benchmark. This dataset allows us to validate PVDSC under realistic and challenging conditions while respecting computational constraints.
On WaRP, we trained YOLOv8 models with Conv, DSC, and PVDSC modules under identical settings. Each model was evaluated across multiple independent runs, and both the best single-run performance and mean ± standard deviation were reported. The results, summarized in Table~\ref{tab:additional_results}, demonstrate that PVDSC consistently outperforms DSC and approaches the accuracy of standard convolution, highlighting the effectiveness of the preference–value mechanism in enhancing fine-grained feature discrimination.
In addition, we updated the manuscript to include new detection visualizations and Grad-CAM++ heatmaps for both the previously introduced Bird Subset and the WaRP datasets. These visualizations provide clear evidence of PVDSC’s ability to distinguish visually similar targets and suppress background interference, illustrating the benefits of introducing PVDSC for fine-grained feature discrimination.
We are grateful to the reviewer for this recommendation, which allowed us to strengthen our empirical validation and demonstrate the generalizability of PVDSC on a recognized academic benchmark.
Updated text in the manuscript, highlighted in red:
To further assess the generalizability of the proposed PVDSC module, we conducted... Page 19-20,Lines 488-512.
To further demonstrate the effectiveness of PVConv in distinguishing visually similar... Page 20-21,Lines 514-526.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors have revised the manuscript, and it is now improved.
Author Response
We sincerely thank Reviewer #X for their thoughtful evaluation and positive comments. We are pleased that the revisions have improved the manuscript and addressed the concerns. Their acknowledgment of our efforts is greatly appreciated.
Round 3
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors included more detailed results and visual predictions. Even though the manuscript still lacks a benchmarking with current methods, it can be published in the current form.

