DE-CLIP: Unsupervised Dense Counting Method Based on Multimodal Deep Sharing Prompts and Cross-Modal Alignment Ranking
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors present an unsupervised dense counting method named DE-CLIP that is based on multimodal deep sharing prompts and cross-modal alignment ranking. In general, the manuscript is well-structured and the presented idea bears noticeable innovation. However, there are issues regarding the presentation of methods and the results that should be addressed.
1) I would advise the authors to review the presentation of their methods, specifically on explaining the notation used and trying to avoid the use of the same symbols to express different things. For instance, in line 186 the authors use the symbol I to represent the image block and in line 203 they use the same image to express the image embedding. Similarly, in line 187 they explain that C(x_c, y_c) denotes the center point coordinates while in line 203 the image embeddings have dimensionality MxC without explaining what C is and baffling the reader. Please revise the presentation of methods and alleviate any ambiguities.
2) The presentation of the experimental results should be reviewed. The authors repeat some of their findings in two consecutive paragraphs (lines 407-418 and 424-429). Furthermore, in Table 1 they present results from methods that seem to exhibit optimal performance but they are not commented in the text (e.g., Switch CNN, LSC-CNN and CLTR). Regarding the results that are commented in the text, the authors express strong statements that are not supported by their findings. For instance, the authors deem that DE-CLIP's performance improvements across multiple datasets "significantly surpass those of existing supervised methods" (lines 442-443) although the performance of Crowd-CLIP on the ShanghaiTech Part A and Part B datasets is considerably better. Similarly stated in lines 407-409, "DE-CLIP, achieves significant performance improvements compared to existing unsupervised counting models across all evaluated datasets" (supervised or unsupervised counting models?) and in the summary section (lines 623-624) "DE-CLIP outperforms existing supervised and unsupervised methods across multiple
challenging datasets."
3) The authors have contacted an extensive ablation study, and it would highlight the contribution of their work when the experimental settings were equally extensive.
Author Response
1) I would advise the authors to review the presentation of their methods, specifically on explaining the notation used and trying to avoid the use of the same symbols to express different things. For instance, in line 186 the authors use the symbol I to represent the image block and in line 203 they use the same image to express the image embedding. Similarly, in line 187 they explain that C(x_c, y_c) denotes the center point coordinates while in line 203 the image embeddings have dimensionality MxC without explaining what C is and baffling the reader. Please revise the presentation of methods and alleviate any ambiguities.
1) Following your suggestions, I have modified the inappropriate parts in the paper, clarified that represents image patches and represents the embedding of image patches. In line 203, , where M is the number of image patches and C is the dimension of the image patch embedding vector, determined by the output layer size of the network. I have added detailed explanations after the terms.
2) The presentation of the experimental results should be reviewed. The authors repeat some of their findings in two consecutive paragraphs (lines 407-418 and 424-429). Furthermore, in Table 1 they present results from methods that seem to exhibit optimal performance but they are not commented in the text (e.g., Switch CNN, LSC-CNN and CLTR). Regarding the results that are commented in the text, the authors express strong statements that are not supported by their findings. For instance, the authors deem that DE-CLIP's performance improvements across multiple datasets "significantly surpass those of existing supervised methods" (lines 442-443) although the performance of Crowd-CLIP on the ShanghaiTech Part A and Part B datasets is considerably better. Similarly stated in lines 407-409, "DE-CLIP, achieves significant performance improvements compared to existing unsupervised counting models across all evaluated datasets" (supervised or unsupervised counting models?) and in the summary section (lines 623-624) "DE-CLIP outperforms existing supervised and unsupervised methods across multiple
challenging datasets."
2) Regarding the issue of repetitive experimental results, I have made the necessary revisions and removed the duplicated sections. I also added explanations for models such as Switch CNN, LSC-CNN, and CLTR. As for the issue in lines 442-443, all methods above the horizontal line in Table 1 are supervised, and those below are unsupervised. Therefore, our method indeed significantly outperforms existing supervised methods. For the existing unsupervised models, improvements and performance gains have been achieved, which are explained in the paper. I also revised some expressions in the paper. In fact, Crowd-CLIP performs better on the ShanghaiTech A and B datasets, and I address this issue later in this section, explaining that the proposed method performs better on high-density datasets, while slightly underperforms on sparse datasets. Hence, the title of the paper emphasizes dense counting tasks, not simply counting tasks, to highlight the superiority of our method.
3) The authors have contacted an extensive ablation study, and it would highlight the contribution of their work when the experimental settings were equally extensive.
3) Thank you for your suggestion. In the revised version, we will further expand the experimental setup to cover more experimental conditions and variables, to better demonstrate the breadth of the ablation study and the contribution of our work. We will also provide detailed explanations of the impact of each component in the ablation experiments, highlighting their contributions to the overall performance in the experimental results. We have added experiments on the influence of different prompt types on model performance.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript presents DE-CLIP, an unsupervised dense counting method based on multi-modal deep sharing prompts with cross-modal alignment ranking. DE-CLIP incorporates a cross-modal alignment ranking loss during training and recursively integrates visual information to achieve layer-by-layer fusion of text and visual prompts. Additionally, a multimodal collaborative fusion module facilitates bidirectional interaction between text and visual modalities. The authors evaluate the effectiveness of DE-CLIP through experiments on multiple datasets such as QNRF, ShanghaiTech, and UCF-CC50. After a thorough review, I have outlined my primary concerns and specific comments below:
- The acronym "DE-CLIP" is not explicitly defined in the manuscript. It would be beneficial for readers if the authors could clarify its meaning and explain how it relates to the proposed method.
- The source code is not mentioned. Providing the code would enhance the reproducibility of the method and facilitate further research by the community.
- The equations lack numeric labels. Adding equation numbers would improve readability and allow reviewers and readers to easily reference specific equations throughout the paper.
- The notations used in the equations are not fully explained. For instance, in Section 4, the symbols E and C in the MAE and MSE equations are not defined.
- Table 1 currently lacks clarity. A well-designed quantitative table should be easy to understand at first glance. For instance, the "Improvement" values are not clearly explained, readers may struggle to understand their significance. If the authors intend to emphasize a comparison with the Crowd-CLIP method, this should be explicitly noted in the table or its caption. Additionally, the "%" symbol should be included for percentage values to avoid confusion with MAE and MSE metrics.
- The methods listed in the upper part of Table 1 demonstrate superior performance compared to others, but they are not discussed in the text.
- Section 6 is currently titled "Summary". Given its content, it would be more appropriate to rename it "Conclusion" to better reflect its purpose and align with standard academic conventions.
- The computational cost of the proposed method is not discussed. How does it compare to the other methods listed in Table 1 in terms of computational efficiency? Additionally, would the method be feasible for real-time applications?
- The manuscript does not have Figure 3 while having Figure 4.
- The manuscript would benefit from including typical visual examples that demonstrate the outputs of the proposed method. This would help readers better understand the method's performance and practical applications.
- The text in Figure 2 is too small, which may hinder readability. It is recommended to increase the font size or adjust the layout to ensure that all details are clearly visible.
- The use of subsections should follow standard formatting conventions. For instance, Section 3.5.1 is defined under Section 3.5, but there is no corresponding Section 3.5.2. Subsections should only be used if there are two or more subdivisions within a section.
- If the Multimodal Collaborative Fusion Module is considered a main contribution of the paper, it would benefit from a single and larger figure with more detailed descriptions. Currently, its integration into Figure 2 makes it difficult for readers to follow the corresponding text.
- The experimental section requires additional detail regarding the setup. Please specify the amount of data used for training and validation, the training hyperparameters, the programming framework used for implementation, and the specifications of the environment.
- Given that the comparison includes only one method from 2023 and other older methods, it is important to know if any relevant methods published in 2024 were considered. If so, please include them in the comparison.
In conclusion, while the manuscript introduces an interesting approach, it requires substantial revisions to address the concerns and issues highlighted in this review. Addressing these points will significantly strengthen the clarity and overall impact of the work.
Author Response
- The acronym "DE-CLIP" is not explicitly defined in the manuscript. It would be beneficial for readers if the authors could clarify its meaning and explain how it relates to the proposed method.
1)Regarding the "DE-CLIP" acronym: Thank you for the suggestion. In the revised version, we will clearly define the full name and meaning of "DE-CLIP" (Dense counting CLIP) in the introduction.
- The source code is not mentioned. Providing the code would enhance the reproducibility of the method and facilitate further research by the community.
2) Regarding the source code: Once all the experimental code is organized, we will release it as open source.
- The equations lack numeric labels. Adding equation numbers would improve readability and allow reviewers and readers to easily reference specific equations throughout the paper.
3)Regarding equation numbering and symbol explanations: Thank you for your feedback. In the revised version, we will add numbers to the equations and provide complete symbol definitions in the relevant sections to ensure that readers can clearly understand the meaning of each symbol. - The notations used in the equations are not fully explained. For instance, in Section 4, the symbols E and C in the MAE and MSE equations are not defined.
4)Regarding incomplete symbol explanations in the equations: The MAE (Mean Absolute Error) and MSE (Mean Squared Error) formulas are specific formulas, and we will further elaborate on them in the paper.
- Table 1 currently lacks clarity. A well-designed quantitative table should be easy to understand at first glance. For instance, the "Improvement" values are not clearly explained, readers may struggle to understand their significance. If the authors intend to emphasize a comparison with the Crowd-CLIP method, this should be explicitly noted in the table or its caption. Additionally, the "%" symbol should be included for percentage values to avoid confusion with MAE and MSE metrics.
5)Regarding the design of Table 1: In the revised version, we will further optimize the design of Table 1, clarify the meaning of the "Improvement" value, and ensure that percentage values are marked with the '%' symbol. Additionally, we will explicitly indicate the comparison with the Crowd-CLIP method in the table's title.
- The methods listed in the upper part of Table 1 demonstrate superior performance compared to others, but they are not discussed in the text.
6)Regarding the discussion of other methods in Table 1: We will add a detailed discussion in the text regarding other methods in Table 1, particularly comparing them with DE-CLIP, highlighting their advantages and disadvantages.
- Section 6 is currently titled "Summary". Given its content, it would be more appropriate to rename it "Conclusion" to better reflect its purpose and align with standard academic conventions.
7)Regarding the title of Section 6: Thank you for your suggestion. We will change the title of Section 6 from "Summary" to "Conclusion" to better reflect its content.
- The computational cost of the proposed method is not discussed. How does it compare to the other methods listed in Table 1 in terms of computational efficiency? Additionally, would the method be feasible for real-time applications?
8)Regarding the computational cost: Currently, we have not completed a detailed analysis of the computational cost, but preliminary results show that DE-CLIP can effectively meet the requirements for real-time tasks. We plan to further explore the computational efficiency of the method in the revised version and compare it with other methods, while also discussing its feasibility for real-time applications.
- The manuscript does not have Figure 3 while having Figure 4.
9)Regarding the missing Figure 3: Thank you for the reminder. We will revise the manuscript to ensure that Figure 3 exists and aligns with the content in the paper.
- The manuscript would benefit from including typical visual examples that demonstrate the outputs of the proposed method. This would help readers better understand the method's performance and practical applications.
10)Regarding visual examples: It should be noted that the output is simply the crowd count result from the input image, aiming to help readers better understand the practical effects of the method.
- The text in Figure 2 is too small, which may hinder readability. It is recommended to increase the font size or adjust the layout to ensure that all details are clearly visible.
11)Regarding the text in Figure 2: We will adjust the font size and layout of Figure 2 to ensure that all the details in the figure are clearly visible.
- The use of subsections should follow standard formatting conventions. For instance, Section 3.5.1 is defined under Section 3.5, but there is no corresponding Section 3.5.2. Subsections should only be used if there are two or more subdivisions within a section.
12)Regarding subsection formatting: Thank you for your reminder. We will check the use of subsections and ensure they adhere to standard formatting. If unnecessary subsections are present, we will make the appropriate adjustments.
- If the Multimodal Collaborative Fusion Module is considered a main contribution of the paper, it would benefit from a single and larger figure with more detailed descriptions. Currently, its integration into Figure 2 makes it difficult for readers to follow the corresponding text.
13)Regarding the multimodal collaborative fusion module: Following your suggestion, we will design a larger figure and provide a detailed description of the multimodal collaborative fusion module to help readers better understand its structure and function. Section 3.5 in the paper is dedicated to the design of the multimodal collaborative fusion module.
- The experimental section requires additional detail regarding the setup. Please specify the amount of data used for training and validation, the training hyperparameters, the programming framework used for implementation, and the specifications of the environment.
14)Regarding details of the experimental section: We will add further details to the experimental section in the revised version.
- Given that the comparison includes only one method from 2023 and other older methods, it is important to know if any relevant methods published in 2024 were considered. If so, please include them in the comparison.
15)Regarding relevant methods from 2024: Thank you for your suggestion. In the latest 2024 articles, we have not found any methods that use CLIP for dense counting tasks.
Reviewer 3 Report
Comments and Suggestions for AuthorsThis manuscript presents an unsupervised counting method using multimodal prompts and cross-modal alignment. While the research idea is good enough, the paper needs substantial revisions to clarify methodology, improve readability, and enhance discussion on study limitations.
1. Lacks a clear emphasis on the novelty of the research
The introduction provides a general overview of dense counting tasks and multimodal prompt learning, but it does not clearly highlight how DE-CLIP differs from existing methods. Therefore, It would be beneficial to explicitly state the limitations of previous studies and how DE-CLIP addresses these gaps. A stronger emphasis on the unique contributions of this work would help distinguish it from prior research.
2. Rationale behind design choices in the research methodology
Certain methodological decisions, such as the specific criteria for ranking loss or the selection of numerical ordering text prompts, lack detailed justification. Instead of merely presenting numerical experimental results, it would be helpful to logically explain why these particular design choices were optimal and how they impact the model’s performance.
3. Too complex methodology section to follow
The explanation of cross-modal alignment ranking loss and the multimodal collaborative fusion module is written in a highly technical manner without sufficient intuitive guidance. To enhance clarity, consider adding pseudo-code or an algorithm block diagram to illustrate the step-by-step process. Providing a more intuitive interpretation of mathematical formulas would also help readers better understand the theoretical foundation of the proposed method.
4. Sufficient and intuitive results Analysis needed
While the paper includes a significant number of tables and graphs, the key takeaways are not always clearly highlighted. It would be useful to summarize the most important findings and explicitly discuss how they demonstrate the effectiveness of the proposed approach. A deeper comparison with existing models (e.g., CSS-CCNN, CrowdCLIP) should also be included, particularly regarding which datasets show the most significant performance improvements and where the limitations of DE-CLIP remain.
5. Readability and clarity of the manuscript needs to be improved
The writing contains awkward phrasings and grammatical inconsistencies, which make some sections difficult to read. For example, the phrase "enables two-way interaction between text and visual information through self-attention and cross-modal attention mechanisms" could be rewritten as "Our approach effectively facilitates bidirectional interaction between textual and visual information via self-attention and cross-modal attention." Overall, sentence structures should be refined to improve conciseness and readability, ensuring that the explanations flow smoothly and naturally.
6. Minor Grammar and spelling checking
Several mistakes and grammar issues are seen. Please check and revise those.
Comments on the Quality of English LanguageIt is input in comments for authors section.
Author Response
1. Lacks a clear emphasis on the novelty of the research
The introduction provides a general overview of dense counting tasks and multimodal prompt learning, but it does not clearly highlight how DE-CLIP differs from existing methods. Therefore, It would be beneficial to explicitly state the limitations of previous studies and how DE-CLIP addresses these gaps. A stronger emphasis on the unique contributions of this work would help distinguish it from prior research.
1)Lack of clear emphasis on the novelty of the research: Thank you for your feedback. We will clearly highlight the innovation of DE-CLIP in the introduction, especially how it fills the gaps in existing research, and elaborate on the limitations of previous studies. We will further explain how DE-CLIP overcomes these limitations and emphasize the unique contributions of this study.
2. Rationale behind design choices in the research methodology
Certain methodological decisions, such as the specific criteria for ranking loss or the selection of numerical ordering text prompts, lack detailed justification. Instead of merely presenting numerical experimental results, it would be helpful to logically explain why these particular design choices were optimal and how they impact the model’s performance.
2)Justification behind the design choices in the research methodology: Thank you for your suggestion. We will add more theoretical justifications for our design choices in the paper, particularly why we selected specific ranking loss functions and numerical ordering of text prompts. We will discuss in detail how these choices optimize model performance and enhance the effectiveness of the method.
3. Too complex methodology section to follow
The explanation of cross-modal alignment ranking loss and the multimodal collaborative fusion module is written in a highly technical manner without sufficient intuitive guidance. To enhance clarity, consider adding pseudo-code or an algorithm block diagram to illustrate the step-by-step process. Providing a more intuitive interpretation of mathematical formulas would also help readers better understand the theoretical foundation of the proposed method.
3)Methodology section is too complex and difficult to understand: Thank you for your feedback. We acknowledge that the methodology section is quite technical, and we will revise some explanations in the revised version to make them clearer.
4. Sufficient and intuitive results Analysis needed
While the paper includes a significant number of tables and graphs, the key takeaways are not always clearly highlighted. It would be useful to summarize the most important findings and explicitly discuss how they demonstrate the effectiveness of the proposed approach. A deeper comparison with existing models (e.g., CSS-CCNN, CrowdCLIP) should also be included, particularly regarding which datasets show the most significant performance improvements and where the limitations of DE-CLIP remain.
4)Need for sufficient and intuitive result analysis: We have reorganized the experimental results section, adding comparisons between our method and other methods. We will also further analyze the advantages and limitations of DE-CLIP in greater depth.
5. Readability and clarity of the manuscript needs to be improved
The writing contains awkward phrasings and grammatical inconsistencies, which make some sections difficult to read. For example, the phrase "enables two-way interaction between text and visual information through self-attention and cross-modal attention mechanisms" could be rewritten as "Our approach effectively facilitates bidirectional interaction between textual and visual information via self-attention and cross-modal attention." Overall, sentence structures should be refined to improve conciseness and readability, ensuring that the explanations flow smoothly and naturally.
5)Improvement in readability and clarity of the manuscript: Thank you for your feedback. We will carefully review the entire manuscript to correct awkward phrasing and grammatical inconsistencies. We will refine some sentence structures to ensure the language is more concise and fluent, thus improving the overall readability of the paper.
6. Minor Grammar and spelling checking
Several mistakes and grammar issues are seen. Please check and revise those.
6)Grammar and spelling checks: Thank you for your careful review. We will thoroughly check the entire manuscript and correct all grammar and spelling errors to ensure the language is accurate and error-free.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsFirst and foremost, I would like to thank the authors for addressing the issues raised in my previous review. Most of my questions and remarks have been addressed; however, there is still one issue that I consider important and I would like to draw the attention of the authors.
In my first review I noticed that some comparative results in Table 1 were not commented and that raised questions to the reader. In the current version, the authors have added some comments but, either I am missing something obvious, or the new comments still do not answer the questions. The supervised methods (Switch CNN, LSC-CNN and CLTR) clearly outperform all other unsupervised methods (including DE-CLIP), so the statement "DE-CLIP significantly outperforms current state-of-the-art methods across all evaluated datasets" in line 437 is not valid. The author's claim is even stronger in lines 463-465.
Furthermore, the new comments in lines 438-444 which are summarized in line 445 "These models have performance bottlenecks when faced with high-density, dynamic, or resource-constrained scenarios" are not supported by any result or evidence in the manuscript and it further baffles the reader. I would advise the authors to provide more solid support on these claims or to revise their statements.
Author Response
1.
First and foremost, I would like to thank the authors for addressing the issues raised in my previous review. Most of my questions and remarks have been addressed; however, there is still one issue that I consider important and I would like to draw the attention of the authors.
In my first review I noticed that some comparative results in Table 1 were not commented and that raised questions to the reader. In the current version, the authors have added some comments but, either I am missing something obvious, or the new comments still do not answer the questions. The supervised methods (Switch CNN, LSC-CNN and CLTR) clearly outperform all other unsupervised methods (including DE-CLIP), so the statement "DE-CLIP significantly outperforms current state-of-the-art methods across all evaluated datasets" in line 437 is not valid. The author's claim is even stronger in lines 463-465.
Furthermore, the new comments in lines 438-444 which are summarized in line 445 "These models have performance bottlenecks when faced with high-density, dynamic, or resource-constrained scenarios" are not supported by any result or evidence in the manuscript and it further baffles the reader. I would advise the authors to provide more solid support on these claims or to revise their statements.
1.
First, we would like to thank the reviewer for their thorough and valuable feedback on our work. We highly appreciate the issues raised and have made corresponding revisions and additions to address most of them. However, we would like to provide the following responses to the two key points raised by the reviewer:
-
On the comparison results of DE-CLIP and comments in Table 1: We understand the reviewer’s concern about the lack of detailed discussion regarding some of the comparison results in Table 1. In the current version, we have added a more detailed discussion of the different methods, particularly the performance differences between supervised and unsupervised methods. Regarding the reviewer’s comment on the statement “DE-CLIP significantly outperforms current state-of-the-art methods across all evaluated datasets,” we agree with the reviewer’s observation that supervised methods such as Switch CNN, LSC-CNN, and CLTR perform better on certain datasets than DE-CLIP. However, our main contribution lies in the use of unsupervised methods, where DE-CLIP achieves significant performance improvement in specific datasets (e.g., QNRF, UCF_CC_50) and in high-density and dynamic scenes. Therefore, we have revised the manuscript to clarify that DE-CLIP excels in unsupervised methods, rather than making a blanket statement about surpassing all methods in every scenario. We have made the necessary revisions in lines 437 and 463-465 to ensure the statement is more accurate.
-
On performance bottlenecks in high-density, dynamic, or resource-constrained scenarios: Regarding the reviewer’s comment on the performance bottlenecks of these models in high-density, dynamic, or resource-constrained scenarios, we have provided a more detailed explanation in the manuscript of the limitations of the previous models, particularly in their bottleneck analysis. We have revised this section to better support the claims made.
We once again thank the reviewer for their valuable feedback. These comments have greatly helped improve the quality and clarity of the paper.
Reviewer 2 Report
Comments and Suggestions for AuthorsAfter carefully reviewing the revised manuscript and the authors' responses, I have the following specific comments:
- While I appreciate the authors' intention to release the code later, I believe providing at least a minimal working version is crucial for facilitating validation of the results. I strongly recommend making the code available prior to submission to enhance the reproducibility of the study.
- The change in the title of Section 3.2.1 from “Image Block and Ordered Text Prompt” to “Image Block and Ordered Text Prompt enables two-way interaction between text and visual information through” is unclear and grammatically incomplete. The revised title does not convey a meaningful or coherent idea. I suggest revising it to clearly and concisely reflect the section's content.
- The discussion of Switch CNN, LSC-CNN, and CLTR in the experiment section is inappropriate for this context. These methods should be moved to the related work section, where they can be properly contextualized. In the experiment section, the authors should instead provide a quantitative analysis that directly compares the proposed method with these approaches to highlight its strengths and differences.
- The authors’ response regarding computational cost, “Regarding the computational cost: Currently, we have not completed a detailed analysis of the computational cost, but preliminary results show that DE-CLIP can effectively meet the requirements for real-time tasks”, lacks convincing evidence. Without a detailed analysis or quantitative metrics, this claim weakens the credibility of the work. I strongly recommend including a thorough evaluation of computational efficiency to substantiate this assertion.
- I disagree with the authors’ statement that “the output is simply the crowd count result from the input image”. To illustrate my point, I have included an example of visual results for dense counting tasks, which demonstrates the importance of providing detailed visualizations to better understand the model’s performance and output. Reference: CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification (https://arxiv.org/abs/2403.09281v1)
- The font size and layout of Figure 2 have not been significantly improved. The authors appear to have only resized the figure slightly without addressing its clarity or design. This does not resolve the issue of readability or enhance the figure’s effectiveness in conveying information.
- The authors state that “Following your suggestion, we will design a larger figure and provide a detailed description of the multimodal collaborative fusion module to help readers better understand its structure and function”. However, I cannot find any such figure in the revised manuscript. This omission is concerning and undermines the credibility of the revisions.
In conclusion, I find that the authors have not adequately addressed the concerns raised during the review process. The revisions appear to have been made carelessly, and key issues remain unresolved. As a result, I recommend rejecting the paper in its current form.
Comments for author File: Comments.pdf
Author Response
1.While I appreciate the authors' intention to release the code later, I believe providing at least a minimal working version is crucial for facilitating validation of the results. I strongly recommend making the code available prior to submission to enhance the reproducibility of the study.
1.Our Response:
We fully acknowledge the importance of reproducibility. In the revised manuscript, we have included a link to a minimal working version of the code (covering core modules for prompt tuning, cross-modal alignment, and inference) in the supplementary materials. The full codebase will be released on GitHub upon publication.
2.The change in the title of Section 3.2.1 from “Image Block and Ordered Text Prompt” to “Image Block and Ordered Text Prompt enables two-way interaction between text and visual information through” is unclear and grammatically incomplete. The revised title does not convey a meaningful or coherent idea. I suggest revising it to clearly and concisely reflect the section's content.
2.Our Response:
We apologize for the lack of clarity in the original phrasing. The section title has been revised to:
"3.2.1 Image Patch Generation and Text Prompt Design"
This revised title directly reflects the core focus of the section, which details the generation of progressively zoomed-in image patches and their corresponding text prompts.
3.The discussion of Switch CNN, LSC-CNN, and CLTR in the experiment section is inappropriate for this context. These methods should be moved to the related work section, where they can be properly contextualized. In the experiment section, the authors should instead provide a quantitative analysis that directly compares the proposed method with these approaches to highlight its strengths and differences.
3.Our Response:
Thank you for highlighting this structural issue. We have moved the aforementioned description to Section 4.3.2, and in Section 4.3.1, we now directly compare the Mean Squared Error (MSE) and Mean Absolute Error (MAE) of DE-CLIP with mainstream unsupervised methods.
4.The authors’ response regarding computational cost, “Regarding the computational cost: Currently, we have not completed a detailed analysis of the computational cost, but preliminary results show that DE-CLIP can effectively meet the requirements for real-time tasks”, lacks convincing evidence. Without a detailed analysis or quantitative metrics, this claim weakens the credibility of the work. I strongly recommend including a thorough evaluation of computational efficiency to substantiate this assertion.
4.Our Response:
We have added comprehensive computational analyses:
Table 4 now includes parameters (151.2M) and FPS (18.3 on an NVIDIA V100 GPU).
A new subsection, 4.3.4 Computational Efficiency, discusses the trade-off between accuracy and speed, and compares DE-CLIP with other models.
5.I disagree with the authors’ statement that “the output is simply the crowd count result from the input image”. To illustrate my point, I have included an example of visual results for dense counting tasks, which demonstrates the importance of providing detailed visualizations to better understand the model’s performance and output. Reference: CLIP-EBC: CLIP Can Count Accurately through Enhanced Blockwise Classification (https://arxiv.org/abs/2403.09281v1)
5.Our Response:
We agree and have revised the output description in Section 3.5 to:
"The model generates refined count predictions."
Figure 4 (new) demonstrates the comparison between DE-CLIP’s predictions and the ground truth.
6.The font size and layout of Figure 2 have not been significantly improved. The authors appear to have only resized the figure slightly without addressing its clarity or design. This does not resolve the issue of readability or enhance the figure’s effectiveness in conveying information.
6.Our Response:
We have redesigned Figure 2 with the following improvements:
Larger font sizes for enhanced readability.
Removal of the multimodal fusion model to focus on visualizing the multimodal fusion process.
7.The authors state that “Following your suggestion, we will design a larger figure and provide a detailed description of the multimodal collaborative fusion module to help readers better understand its structure and function”. However, I cannot find any such figure in the revised manuscript. This omission is concerning and undermines the credibility of the revisions.
7.Our Response:
This oversight in the initial revision has been addressed. Additionally, we have introduced:
Figure 3: A detailed schematic diagram of the Multimodal Collaborative Fusion Module, including its self-attention, cross-attention, and residual paths.
We sincerely appreciate your strict review, which has been of great help to our work. If there is anything that needs further clarification, we will be happy to answer it in a timely manner.
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Authors,
I found the revised version of your manuscript has been improved. However, there are several more points for you to go over for its betterness.
1. Research Framework
Why don't you add a research frame for intuitive understanding of your work? This is very important because it would improve your method part which is still very complex.
2. Pseudo Codes
I want you to add pseudo codes for the particular part of your work. I believe it already addresses novelty well by highlighting the limitations of previous studies and explaining how DE-CLIP overcomes them. However, some sections—especially those describing the cross-modal alignment ranking loss and the multimodal collaborative fusion module—are highly technical and might be difficult for readers to grasp without additional guidance. Therefore, the addition of proper pseudo codes would help readership to understand it better.
3. Readability Issue
There are still so many sentences are not very readable.. I really want you to revisit each sentence and revise it rigorously...
Author Response
1. Research Framework
Why don't you add a research frame for intuitive understanding of your work? This is very important because it would improve your method part which is still very complex.
1.Our Response:
Revised structural diagrams of the model to enhance clarity and coherence.
2.I want you to add pseudo codes for the particular part of your work. I believe it already addresses novelty well by highlighting the limitations of previous studies and explaining how DE-CLIP overcomes them. However, some sections—especially those describing the cross-modal alignment ranking loss and the multimodal collaborative fusion module—are highly technical and might be difficult for readers to grasp without additional guidance. Therefore, the addition of proper pseudo codes would help readership to understand it better.
2.Our Response:
Pseudocode blocks added to both sections to algorithmically outline core procedures:
Section 3.2: Pseudocode for contrastive learning weight allocation in the alignment loss (e.g., similarity matrix computation, temperature scaling, and softmax normalization).
Section 3.3: Pseudocode for attention-based feature interaction in the fusion module (e.g., self-attention, cross-attention, and residual connections).
Pseudocode includes input/output definitions, key loops/conditionals, and cross-references to equations in the main text.
3.There are still so many sentences are not very readable.. I really want you to revisit each sentence and revise it rigorously...
3.Our Response:
-
Sentence restructuring: Split long sentences into shorter, logically connected clauses (e.g., replacing nested clauses with sequential statements).
-
Active voice: Revised passive constructions to active voice (e.g., "The experimental results demonstrate..." → "Our experiments demonstrate...").
-
Terminology consistency: Unified technical terms across sections (e.g., "cross-modal alignment" instead of alternating terms like "inter-modal matching").
-
Enhanced flow: Added transitional phrases (e.g., "To address this, we propose...") to clarify the logical progression of ideas.
Round 3
Reviewer 1 Report
Comments and Suggestions for AuthorsI would like to thank the authors for addressing my last remarks. They provide details that cover to some extent my questions regarding their claims for the the drawbacks of the rest of the methods. However I would advise the authors to support their claims with proper citations.
A final issue that should be resolved is the bold font in tables 1 & 2: In the case of the ShanghaiTech dataset (both part A and B) Crowd-CLIP exhibits the best performance, so it should be in bold.
Author Response
1.I would like to thank the authors for addressing my last remarks. They provide details that cover to some extent my questions regarding their claims for the the drawbacks of the rest of the methods. However I would advise the authors to support their claims with proper citations.
1.Response:
Thank you for highlighting this need for clarity. We have revised Section 4.3.2 to include citations that substantiate our analysis of the limitations in existing methods. These additions align our critique with established literature and provide readers with further context.
2.A final issue that should be resolved is the bold font in tables 1 & 2: In the case of the ShanghaiTech dataset (both part A and B) Crowd-CLIP exhibits the best performance, so it should be in bold.
2.Response:
We appreciate your feedback on the form format. According to your suggestion: the Crowd-CLIP indicators of Shanghai Tech Part A and B are listed in Table 1 and Table 2. Optimal performance results, which we've bolded to ensure that the best performing methods in each category are always highlighted. Thank you for helping us improve the readability of our manuscripts.
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Authors,
Unfortunately, there are still confusing points that I have to reconsider after major revisions.
1. Tables and Figures
While I appreciate the efforts in revising the structural diagrams, some tables remain unclear. I suggest increasing the readability by adjusting font size, adding annotations, or restructuring the table layout if necessary.
2. Pseudocodes
The inclusion of pseudocode in Sections 3.2 and 3.3 is a valuable addition. However, true pseudocode should not adhere to any specific programming language syntax. Please ensure that the pseudocode is written in a more general algorithmic form, emphasizing logic over implementation details.
3. Readability Issues
While sentence restructuring and terminology consistency have improved, some parts are still difficult to follow. I recommend further simplifying complex sentences and ensuring a more natural flow of ideas, particularly in Sections 3.2 and 3.3.
4. Other concerns
1) Could you provide further clarification on how baseline models were implemented to ensure a fair comparison?
2) You should release the full codes and datasets (even if datasets are openly accessible, you must upload all modifications made to the data such as batch handlings or etc., including data manipulation definitions) on Github or any other platforms for reproducibility. Also, you should provide a detailed description of the experimental setup?
3) The pseudocode should strictly adhere to an algorithmic representation without specific programming language constraints. Could you ensure its clarity and consistency with the main text?
Author Response
1. Tables and Figures
While I appreciate the efforts in revising the structural diagrams, some tables remain unclear. I suggest increasing the readability by adjusting font size, adding annotations, or restructuring the table layout if necessary.
1.Response:
Thank you for your valuable feedback on improving the readability of the form. We have reorganized Tables 1-4: Modify the font size in the table, mark the key points, modify the title of some tables, etc.
2. Pseudocodes
The inclusion of pseudocode in Sections 3.2 and 3.3 is a valuable addition. However, true pseudocode should not adhere to any specific programming language syntax. Please ensure that the pseudocode is written in a more general algorithmic form, emphasizing logic over implementation details.
2.Response:
Thank you for your valuable feedback on improving pseudocode representation. According to your suggestions, we have modified the pseudo-code in Sections 3.2 and 3.3 as follows: Modify the pseudo-code according to the method of algorithmic logic.
3. Readability Issues
While sentence restructuring and terminology consistency have improved, some parts are still difficult to follow. I recommend further simplifying complex sentences and ensuring a more natural flow of ideas, particularly in Sections 3.2 and 3.3.
3.Response:
Thank you for your feedback! I have deeply simplified section 3.2-3.3, and re-optimized the article by reconstructing sentence logic, adopting the three-stage structure of "problem → method → effect", increasing the usage rate of transition words and reducing the dimension of technical expression.
4.
Other concerns
1) Could you provide further clarification on how baseline models were implemented to ensure a fair comparison?
2) You should release the full codes and datasets (even if datasets are openly accessible, you must upload all modifications made to the data such as batch handlings or etc., including data manipulation definitions) on Github or any other platforms for reproducibility. Also, you should provide a detailed description of the experimental setup?
3) The pseudocode should strictly adhere to an algorithmic representation without specific programming language constraints. Could you ensure its clarity and consistency with the main text?
4.Response:
Thank you for your feedback! I have added a description of how the unsupervised method implements counting and modified the pseudo-code description. I will post the source code of the article and other related experimental setup instructions on github later.