Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Spatial and Channel Similarity-Aware Attention-Enhanced Network for Object Counting

Appl. Sci. 2025, 15(5), 2563; https://doi.org/10.3390/app15052563

by Ran Li^*

, Chunlei Wu, Jing Lu and Wenqi Zhao

Reviewer 1:

Junhong Zhao

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4:

Eduardo Freire

Appl. Sci. 2025, 15(5), 2563; https://doi.org/10.3390/app15052563

Submission received: 22 January 2025 / Revised: 17 February 2025 / Accepted: 25 February 2025 / Published: 27 February 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes a spatial and channel similarity-aware attention enhancement network for few-shot object counting. The method introduces a spatial similarity-aware module and a channel attention enhancement module to improve feature extraction for objects viewed from diverse angles. While the proposed approach is interesting, the paper requires major revisions to improve clarity, justify design choices, and strengthen the evaluation.

Major concerns:

The related work section does not discuss whether prior research has applied transformers or other recent advancements in deep learning to the few-shot object counting problem. A comparison with such approaches is necessary to establish the paper’s contribution relative to existing methods.
The proposed spatial similarity extraction method convolves query and support feature maps using H-wise and W-wise ordering. However, it is unclear whether further separating H and W is necessary. For example, in the H-wise direction, does top-to-bottom and bottom-to-top processing generate the same similarity map, only differing in arrangement in the results? Similarly, in the W-wise direction, does right-to-left processing yield similar results? An ablation study should be conducted to demonstrate the effectiveness of this separation and whether it is essential.
In Table 2, the method performs worse on the CARPK dataset compared to CSRNet. Given that CARPK has a higher maximum object count per image, it appears that the proposed method struggles in such cases.Further analysis is needed to understand why performance declines with increasing object count. Is it due to feature overlapping, occlusion, or another factor? More experiments or discussion should be provided to address this limitation.
The paper does not mention how much the model size increases compared to other methods.Additionally, real-time performance metrics are missing. How does the proposed model compare in terms of inference speed and computational cost? A comparison with prior works would strengthen the evaluation.
Figures 1 & 2: The abbreviations “NL,” “SA,” and “ga()” should be expanded to improve readability. Alternatively, this information should be included in the figure captions.
Figure 2: It is confusing whether fs comes from support images or from the same query image (input image). This needs to be clarified.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This manuscript presents a novel deep learning model for object counting. Here are some suggestions for improvement:

1. In Section 3.3.1, what are the sizes of fs and fq, after the 1x1 convolution, their size stays the same?

2. In Line 176-177, how does this convolution using fs as kernel work, its output is a scalar? (if the size of fs equals fq).

3. In Line 186-188, the output of convolution for each slice is a scalar to be added to the next slice?

4. In Equation 2, “Ⅹ” indicates the multiplication or showing the size of the input data? If it is for multiplication, please use an alternative symbol.

5. In Equation (9), what is the meaning of the operation “⊙” ?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This research addresses the challenge of object counting in computer vision by proposing a novel few-shot Spatial and Channel Similarity-Aware Attention Enhanced Network (SCSAEN). Object counting is a critical task with applications in agriculture, medical research, traffic monitoring, and drone-based surveillance. Traditional methods, such as classical image analysis and feature-based approaches, are limited by their reliance on manually extracted features and predefined constraints, which reduce their adaptability. While convolutional neural networks (CNNs) have improved counting accuracy by automatically extracting features, they often struggle with object angle variations and require large annotated datasets, making them costly and time-consuming. The proposed SCSAEN model mitigates these limitations by leveraging a few-shot learning framework, reducing the need for extensive annotations. It introduces a Spatial Similarity-Aware (SSA) Module for multi-directional feature extraction and a Channel Attention Enhancement (CAE) Module to improve feature representation, making it particularly effective for counting objects with diverse orientations and limited training samples.

The main contribution of this research lies in enhancing object counting performance while addressing key challenges such as annotation complexity and viewpoint variability. The integration of similarity-based matching allows for effective object recognition with minimal data, significantly reducing annotation costs compared to traditional CNN-based models. The SSA module improves robustness to angle variations by analyzing spatial features across multiple directions, while the CAE module refines feature representation along different channels, improving adaptability. However, potential limitations include the computational cost associated with the additional attention mechanisms and the reliance on sufficient variation in the few-shot training samples to generalize effectively across diverse real-world scenarios. Despite these challenges, the proposed approach presents a significant advancement in object counting, offering a more flexible and efficient solution applicable to various fields.

Main comments:

The introduction effectively highlights the significance of object counting and provides a well-structured overview of existing methods. However, it would benefit from a clearer transition between the general importance of object counting and the specific problem your research addresses. Additionally, the mention of "CAEeras" appears to be a typographical error and should be corrected. Consider refining the problem statement to explicitly link the limitations of current methods with the motivation for developing the proposed SCSAEN model.
The research review is well structured, but You could improve its coherence and depth of analysis. For example, You could better distinguish between problems with existing methods and suggested improvements, ensuring a smoother transition between the review and your contribution. You could also strengthen the critique by discussing the limitations of the approaches reviewed in the context of the research.
The methods are presented qualitatively. Adding mathematical formulas would strengthen its presentation.
I recommend adding a subsection called "Discussion" to the "Results" section. It should provide a brief overview of the results obtained, highlighting their strengths and limitations. Based on the identified limitations, it is necessary to formulate prospects for further research.
I recommend making the conclusions more specific with an emphasis on the results obtained.
The figures are of high quality. However, I recommend checking the article itself to see if it meets the requirements of the MDPI publishing house. The references fully disclose this study.

The article is recommended for publication after minor revisions.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The paper presents an interesting approach to few-shot object counting using a spatial and channel similarity-aware attention enhancement network. However, the manuscript could be improved by addressing the following points:

1. Introduction: The original contributions of the paper should be explicitly stated to enhance clarity.

2. Line 48: The term CAEeras appears to be a typographical error and should be corrected.

3. Figure 1: The three main components of the network—backbone, similarity feature extraction and processing module, and density map prediction module—should be more clearly highlighted to improve readability.

4. Normalization in Figure 1 and Section 3.3.1: It should be explicitly stated whether the normalization applied is Max-normalization or Min-Max normalization to ensure clarity for the reader.

5. Line 277: The phrase “It should be proposed that…” is unclear; the first occurrence of “proposed” should be revised for better readability.

6.Line 306: The phrase “on the CARPK and PUCPR+ datasets” should be modified to “on the TRANCOS, CARPK, and PUCPR+ datasets” for accuracy.

7. Conclusion (Lines 349 and 353): SCA should be verified, as SSA might be the intended term.

Addressing these points will improve the clarity and precision of the manuscript.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you!

Article Menu

Spatial and Channel Similarity-Aware Attention-Enhanced Network for Object Counting

Further Information

Guidelines

MDPI Initiatives

Follow MDPI