DE-CLIP: Unsupervised Dense Counting Method Based on Multimodal Deep Sharing Prompts and Cross-Modal Alignment Ranking

Zi, Xuebin; Wu, Chunlei

doi:10.3390/electronics14061234

Open AccessArticle

DE-CLIP: Unsupervised Dense Counting Method Based on Multimodal Deep Sharing Prompts and Cross-Modal Alignment Ranking

by

Xuebin Zi

^* and

Chunlei Wu

^*

Qingdao Software Institute, Software Engineering Department, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(6), 1234; https://doi.org/10.3390/electronics14061234

Submission received: 8 February 2025 / Revised: 15 March 2025 / Accepted: 19 March 2025 / Published: 20 March 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

With the rapid development of multimodal prompt learning in unsupervised domains, prompt tuning has demonstrated significant potential for dense counting tasks. However, existing supervised methods heavily rely on annotated data, limiting their generalization capabilities. Additionally, unimodal prompt designs fail to fully leverage the complementary advantages of multimodal data, compromising the accuracy and robustness of counting systems. To address these challenges, we propose DE-CLIP, an unsupervised dense counting method based on multimodal deep shared prompts and cross-modal alignment ranking. DE-CLIP constructs hierarchically ordered textual prompts and optimizes the image encoder via cross-modal alignment ranking loss, which enforces rank-aware embedding learning by aligning visual patches with incrementally scaled textual descriptions, thereby enhancing the model’s numerical perception. The text encoder recursively injects visual information across transformer layers, achieving the progressive fusion of textual and visual prompts to improve multimodal representation. Simultaneously, the image encoder interacts deeply with textual prompts at each transformer layer, strengthening the synergy between visual features and textual semantics. A multimodal collaborative fusion module further enables bidirectional interaction between modalities via self-attention and cross-attention mechanisms, enhancing the model’s capability to comprehend and process complex scenes. The experimental results demonstrate that DE-CLIP significantly outperforms the existing supervised and unsupervised methods across multiple dense counting benchmarks, achieving superior recognition accuracy and generalization ability. This validates its exceptional performance and broad applicability in unsupervised settings.

Keywords:

prompt learning; multimodal; CLIP; supervised learning; prompts sharing; dense counting

1. Introduction

The core objective of dense counting tasks is to accurately estimate the number of objects in images or videos, with critical applications in crowd management, traffic flow monitoring, and environmental surveillance. However, precise counting in dense scenarios faces significant challenges, including mutual occlusion between objects, interference from complex backgrounds, and limitations in image quality, all of which substantially degrade model accuracy.

Traditional dense counting methods predominantly rely on density map regression models [1,2,3], which require precise point-level annotations for each object (e.g., individual heads in crowd scenes) to generate density maps. However, this point-level annotation process is extremely laborious and time-consuming. As illustrated in Figure 1a, the annotation of datasets, such as NWPU-Crowd [4], demands substantial human and temporal resources. To mitigate annotation costs, researchers have proposed weakly supervised [5,6] and semi-supervised methods [7,8], utilizing coarse count-level annotations or combining limited fully labeled data with large-scale unlabeled images for training. Nevertheless, these approaches still heavily depend on labeled data when handling dense or complex scenes, leaving annotation costs prohibitively high. Consequently, reducing the reliance on annotated data while enhancing model performance in unsupervised settings remains a pivotal challenge in dense counting research.

In recent years, language–image contrast pre-trained models, such as CLIP [9], have achieved remarkable success in various visual tasks, including object detection [10], semantic segmentation [11], and image generation [12], thanks to their excellent transfer learning capabilities. CLIP learns visual representations from large-scale noisy image–text pair data, but when applied to downstream tasks, it often relies on manually designed prompts (e.g., “a photo of [CLASS]”), which are typically based on intuition and experience, making it challenging to achieve optimal results. To address this issue, researchers have developed automated prompt tuning techniques, such as CoOp [13], CoCoOp [14], and VPT [15]. These methods primarily focus on optimizing prompts for a single modality (either visual or textual), as shown in Figure 1b. However, research on simultaneous prompt tuning for both visual and textual modalities—leveraging the complementary advantages of multimodal data to enhance the accuracy and robustness of dense counting tasks—has not been fully explored.

In dense counting tasks, a natural approach is to discretize the number of objects into multiple intervals, transforming the counting problem into a classification task. Specifically, by calculating the similarity between the image encoder and text encoder embeddings, the most similar image–text pair is selected as the prediction. However, the performance of the zero-shot CLIP model in such tasks is suboptimal because the original CLIP model is primarily trained on single-target images, making it difficult to effectively capture semantic information in complex scenes. Additionally, the non-uniform distribution of objects in the image can cause areas without targets to interfere with the CLIP model, negatively impacting prediction accuracy. To address these issues, the simple single-modal prompt tuning approach—designing independent prompts for each modality and optimizing them through end-to-end supervised training—while improving performance to some extent, often fails to fully leverage the complementary advantages of multimodal data due to the inherent differences between the visual and textual modalities. As a result, the performance improvements are limited. To overcome this limitation, we propose an innovative multimodal deep sharing prompt tuning method, as shown in Figure 1c. By enabling the deeper fusion of modalities, this method enhances the model’s ability to understand and process complex scenes, thereby significantly improving the accuracy and robustness of dense counting tasks.

Although existing unimodal prompt tuning methods such as CoOp, CoCoOp, and VPT have improved the performance of visual or textual modalities to some extent, they often overlook the complementary nature between modalities, which limits performance improvement in handling complex and dynamic dense counting tasks. To address this issue, DE-CLIP (dense counting CLIP) introduces a multimodal deep shared prompt tuning method that fully integrates the strengths of both visual and textual modalities. Specifically, DE-CLIP constructs ordered textual prompts and cross-modal alignment ranking loss during the training phase, which encourages deep interaction and alignment between the image encoder and the text encoder in the embedding space, enabling the model to better understand and process semantic information in complex scenes. During the testing phase, DE-CLIP employs a multimodal collaborative fusion module that enables bidirectional prompt fusion and deeper modality interaction, effectively overcoming the limitations of traditional unimodal methods. Through this innovative approach, DE-CLIP not only improves the accuracy of dense counting tasks, but also enhances robustness across different datasets and complex scenarios.

We conducted extensive experiments on three challenging datasets to evaluate the effectiveness of DE-CLIP. Notably, on the QNRF dataset, DE-CLIP achieved improvements of 5.6% and 38.8% in MAE (mean absolute error) compared to the existing unsupervised state-of-the-art methods, CrowdCLIP [16] and CSS-CCNN [17], respectivley. In cross-dataset validation, our approach even outperforms several popular fully supervised methods [3,18], demonstrating superior performance and broad application potential under unsupervised conditions.

The main contributions of this paper are as follows:

(1): We propose novel architecture based on CLIP that incorporates a deep-level prompt injection and prompt sharing mechanism, enabling the unsupervised application of dense counting tasks.
(2): We introduce cross-modal alignment ranking loss to guide image encoders in learning ordered embedded representations, enhancing the model’s numerical awareness.
(3): We present a multimodal collaborative fusion module that addresses the issue of underutilized prompts in multimodal environments by effectively fusing and facilitating deep interaction between prompt information across different modalities.

2. Related Work

Supervised methods for dense counting can be broadly categorized into full supervision and semi-supervision. In mainstream supervised dense counting studies [2,19,20,21,22], the primary approach is to estimate the number of objects by regressing density maps. These density maps are typically generated from carefully labeled point annotations. However, such point labels often fail to accurately reflect the size of objects, making density-based methods prone to errors when faced with scale variations. To address this issue, researchers have proposed several solutions. Specifically, some studies [3,21] use multi-layer network architecture to learn multi-scale features, enhancing the model’s ability to handle objects of varying scales. Additionally, attention mechanisms play a crucial role in improving feature representation, with common approaches including self-attention mechanisms [23], spatial attention mechanisms [24], and other customized attention modules [25].

In addition to density map regression, other approaches [26,27] employ supervised classifiers to categorize the number of objects into different intervals, achieving satisfactory performance in certain scenarios. Furthermore, location-based methods [28,29,30] have gained attention in recent years and can be grouped into the following three main categories: predictive pseudo-bounding boxes [30,31], custom location-based maps [28,30], and direct regression of point coordinates [32,33]. These methods typically do not require complex pre-processing or post-processing steps.

Weakly supervised methods propose using count-level labeling instead of point-level labeling as a supervisory signal, while semi-supervised methods further enhance model performance by combining a small amount of labeled data with a large volume of unlabeled data.

Currently, only CSS-CCNN [17] and CrowdCLIP [16] focus on purely unsupervised settings, where the model is trained without any labeled data. The core idea behind CSS-CCNN [17] is that the distribution of the natural population follows a power-law distribution, which can be leveraged to generate backpropagation errors. CrowdCLIP [16] trains the encoder to classify images and incorporate prompts for counting. Our experiments show that there remains a significant performance gap between CSS-CCNN [17] and some popular fully supervised methods [3,34].

In recent years, visual-language models (VLMs) for pre-training on large-scale image–text pairs from the Internet have gained increasing attention. These VLMs, pre-trained on vast datasets, exhibit advanced zero-shot image–text matching capabilities. CLIP, in particular, learns an aligned multimodal embedding space and has inspired various applications, including image-level classification. Recently, two CLIP-based target counting models have been proposed, both of which rely on image-level classification. However, these models are limited in their counting granularity and precision. Achieving accurate density estimation using VLMs remains a challenge. In this paper, we explore how to transfer visual language knowledge to unsupervised density estimation tasks. To address this, we propose DE-CLIP, a novel method that transforms the dense counting task into an image–text matching problem, significantly enhancing the performance of unsupervised dense scene counting and offering a new approach to unsupervised dense counting tasks.

3. Methods

3.1. General Architecture

This study proposes a method for optimizing the visual-language model, with the overall framework consisting of two phases: training and testing. Our approach enhances the model’s numerical perception by constructing ordered text prompts and guiding the image encoder to learn ordered embedded representations through cross-modal alignment losses. Through the hierarchical recursive injection of visual information, the text encoder achieves layer-by-layer fusion of textual and visual prompts, improving the representation capability of multimodal information. The image encoder interacts deeply with the text prompts at each layer of the transformer, enhancing the synergy between visual features and text information. The multimodal collaborative fusion module enables bidirectional interaction between textual and visual information through self-attention and cross-modal attention mechanisms, thereby improving the model’s ability to understand and process complex scenes. Finally, the density estimation result is generated by matching the image and text. The overall architecture of the model is shown in Figure 2.

3.2. Cross-Modal Alignment Ranking Mechanism

To enhance CLIP’s ability to recognize density-related objects, we propose an ordered contrastive fine-tuning strategy. By aligning progressively expanding image regions with corresponding text prompts through novel ranking loss, we enable the image encoder to learn ordered feature representations, improving numerical sensitivity and generalization (pseudocode in Appendix A).

3.2.1. Image Block Segmentation and Text Prompt Design

While CLIP excels at zero-shot image–text matching, its original design lacks the numerical perception capabilities crucial for dense counting tasks. To effectively count densely packed objects, the model must precisely estimate quantities within images.

To enhance CLIP’s numerical awareness, we propose ordered image–text pairs with progressive scaling. Our method enables the model to learn correlations between visual patterns and quantity levels through structured input–output pairs. Order-preserving ranking loss enforces alignment between expanding visual regions and their textual descriptions, enhancing dense counting accuracy.

First, the input image is divided into a series of square image blocks

P i

of increasing size, with the center point

C (x_{c}, y_{c})

of the image as the reference. Let

P i

represent the

i

-th square image block, where

i = 1, 2, \dots, 6

. The side lengths

s_{i}

of each image block satisfy the condition

s_{1} < s_{2} < \dots < s_{6}

. As

i

increases, the side length

s_{i}

of each square block grows, causing the area covered by the image block to expand and the number of objects contained within the block to increase. The image block generation formula is as follows:

\begin{matrix} P i = P (C (x_{c}, y_{c}), [\frac{s_{i}}{2}, \frac{s_{i}}{2}]) \end{matrix}

(1)

P i

represents the

i

-th square image block,

C (x_{c}, y_{c})

denotes the center point coordinates of the image, and

s_{i}

represents the side length of the

i

-th image block.

The image blocks follow an incremental pattern, where each block expands progressively from the center. Larger image blocks encompass more objects than smaller ones, ensuring an increasing relationship with the number of objects within each block. To effectively capture the order among these image blocks, we design the corresponding ordered text prompt, defined as follows:

\begin{matrix} A = [A_{0}, A_{0} + K, \dots, A_{0} + (N - 1) K] \end{matrix}

(2)

A_{0}

is the initial text prompt,

K

represents the incremental step size, and

N

is the total number of prompts. This set of ordered text prompts is fed into the frozen text encoder to generate text embeddings.

Through structured alignment between visual quantities and text prompts, the model learns progressive object distributions across image regions. This multimodal correlation learning enhances complex scene interpretation by explicitly encoding numerical patterns.

3.2.2. Cross-Modal Alignment Ranking Loss

For the generated image embeddings

I = \{I_{1}, I_{2}, \dots, I_{M}\}

, where

I \in R^{M \times C}

,

M

is the number of image patches;

C

is the dimension of the image patch embedding vector, which is determined by the output layer size of the network; the text embeddings

T = {T_{1}, T_{2}, \dots, T_{N}}

, where

T \in R^{N \times C}

;

N

is the total number of prompts; and

C

is the dimension of the text prompt embedding vector, which is determined by the output layer size of the network. The similarity matrix

S_{i, j}

is calculated using the inner product, i.e.,

S_{i, j} = I_{i}^{T} T_{j}

, which represents the similarity between the image embeddings

I_{i}

and the text embeddings

T_{j}

.

Since the image and text have a pre-set ordering relationship, we aim for the similarity matrix

S

to have a specific order, thus preserving the ordered alignment of the image and text embeddings. To achieve this, we set

M = N

, ensuring that the similarity matrix

S

is a square matrix. Based on this, we propose cross-modal alignment ranking loss

L_{r a n k}

to fine-tune the image encoder and maintain the order of the image and text embeddings in the corresponding space. This loss function is defined as follows:

\begin{matrix} L_{r a n k} = \sum_{i = 1}^{N} \sum_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} \max (0, m + S_{i, j} - S_{i, i}) \end{matrix}

(3)

S_{i, i}

represents the similarity between the image embedding

I_{i}

and the text embedding

T_{i}

, while

S_{i, j}

denotes the similarity between the image embedding

I_{i}

and the text embedding

T_{j}

. The parameter

m

is a predefined threshold that ensures that the similarity value on the main diagonal is higher than all off-diagonal values.

This loss function establishes ordered alignment between image–text embeddings in the multimodal space, enhancing numerical sensitivity and boosting generalization capabilities through structured feature matching.

3.3. Text Encoder

During model deployment, the text encoder enables joint multimodal optimization through bidirectional visual–textual interaction. This enhanced architecture processes text prompts while dynamically incorporating visual features, establishing cross-modal information exchange across all transformer layers. This deep fusion mechanism significantly improves complex scene comprehension through coordinated feature learning.

3.3.1. Input Layer

The initial text prompt consists of natural language information provided externally, typically used to describe aspects relevant to the image, such as the number of objects present. In the model, the text prompt is first processed through a word embedding layer, resulting in a fixed-dimensional vector representation. This vector comprises three types of information, as follows:

[T_{0} W_{0} w^{e o s}]

. Here,

T_{0}

represents the text prompt (similar to a template such as “A Photo of”),

W_{0}

corresponds to the word embeddings of the class name, representing the image class, and

w^{e o s}

denotes the vector for the

[E O S]

token.

3.3.2. Hierarchical Information Fusion on the Text Side

Next, we progressively inject the text embeddings

[T_{0} W_{0} w^{e o s}]

into each layer of the transformer network up to the

L

-th layer. This hierarchical injection allows the text encoder to progressively refine its understanding of the image by incorporating visual information in each layer, ensuring that both textual and visual cues are deeply integrated at multiple levels of the transformer. This fusion strengthens the model’s ability to understand and interpret complex scenes, enhancing its performance on tasks such as dense counting.

The text prompts used in the recursive injection process across different transformer layers are represented as follows:

\begin{matrix} [-, W_{i + 1}, w_{i + 1}^{e o s}] = \{\begin{matrix} T_{i + 1} ([T_{i}, W_{i}, w_{i}^{e o s}]), i = 0 \\ T_{i + 1} ([\hat{T_{i}}, W_{i}, w_{i}^{e o s}]), i \in [1, L] \end{matrix} \end{matrix}

(4)

At layer

i = 0

, the initial text prompt

T_{0}

, the class name embedding

W_{i}

, and the

[E O S]

token embedding

w_{i}^{e o s}

are fed into the network.

In subsequent layers,

i \in [1, L]

, the text prompt

T_{i}

is replaced by a multimodal prompt

T_{i}^{'} + V_{i}

, which incorporates visual information. This updated prompt, along with the class name embedding

W_{i}

and the

[E O S]

token embedding

w_{i}^{e o s}

, is then passed into the next layer of the transformer,

T_{i + 1}

.

In the first transformer encoding layer, the input prompt

T

is the raw text prompt, which is essentially the text representation without significant modal processing. In each subsequent layer, the text prompt is combined with the visual prompt. This combined prompt undergoes processing through the cross-modal attention mechanism in the multimodal collaborative fusion module, allowing the text prompt to progressively integrate visual information into the multimodal prompt representation.

This hierarchical integration process enables the progressive fusion of multimodal features through layered inputs, significantly strengthening cross-modal learning capabilities. The deep prompt injection mechanism drives multimodal transition by adaptively enriching text prompts with visual context across transformer layers, achieving comprehensive visual–textual fusion in deep network stages.

3.3.3. Output Layer

After each layer is processed, the text prompts are linearly transformed to ensure that the output dimensions align with the input requirements of the next transformer layer. Following this transformation, the resulting text prompt

T^{'}

(i.e., the merged text prompt) is passed to the next layer. Through the recursive fusion across multiple transformer layers, the model generates a deeply integrated multimodal feature representation, with the final text code encapsulating rich visual and linguistic information.

In the design of the text encoder, a hierarchical recursive injection mechanism is employed to gradually fuse textual prompt and visual prompt information. This process enhances the text prompt in each layer, allowing the final generated text encoding to fully capture the shared semantics between the image and the text. This design not only improves the expressiveness of the text information, but also strengthens the synergy between cross-modal data, enabling the model to better understand and handle complex multimodal tasks.

3.4. Image Encoder

To further enhance the cooperative optimization of multimodal information, this study designs an image encoder that processes the input image to generate the final image representation. Simultaneously, the text prompt is integrated, enabling bidirectional information interaction between the visual data and the textual prompt.

3.4.1. Input Layer

The input image is first processed through patch embeddings to obtain a fixed-dimensional vector representation. This generated vector contains three types of information, as follows:

[c_{0} V_{0}, P_{0}]

, where

c_{0}

represents the

[C L S]

token encoding,

V_{0}

represents the initial visual feature information, and

P_{0}

represents the image patch information.

3.4.2. Hierarchical Information Fusion on the Image Side

Next, we inject

[c_{0}, V_{0}, P_{0}]

into each layer of the transformer network up to the

L

-th layer, similar to the text prompt injection process.

The image prompts for the recursive injection process at different transformer layers are represented as follows:

\begin{matrix} [c_{i + 1}, -, P_{i + 1}] = \{\begin{matrix} V_{i + 1} ([c_{i}, V_{i}, P_{i}]), i = 0 \\ V_{i + 1} ([c_{i}, \hat{V_{i}}, P_{i}]), i \in [1, L] \end{matrix} \end{matrix}

(5)

At layer

i = 0

, the initial image prompt

V_{0}

, patch information

P_{0}

, and

[C L S]

token embeddings

c_{0}

are input.

In subsequent layers

i \in [1, L]

, the image prompt

V_{i}

is replaced by a multimodal prompt

V_{i}^{'} + T_{i}

after incorporating the text information. This updated prompt is then input into the next transformer layer

T_{i + 1}

, along with the patch information

P_{0}

and

[C L S]

token embedding

c_{0}

.

Through this deep visual prompt mechanism, visual features are progressively updated and multimodally fused at each layer. Ultimately, these prompts enable the model to capture the multimodal features of the image at a deeper level, further enhancing the alignment between the visual and textual branches.

3.4.3. Output Layer

After each layer of processing, the image prompts are linearly transformed to ensure that the output dimensions align with the input requirements of the transformer layer. Once this processing is complete, the image prompts—representing the deeply fused and processed image features—are fed into the next transformer layer. Ultimately, after passing through multiple transformer layers, the resulting image code captures all relevant information from the image in a deep feature representation.

The design of the image encoder incorporates a deep multimodal fusion mechanism, enabling visual features to interact extensively with text prompts at each transformer layer. This ensures that the image features are richly represented within the multimodal context. Through the hierarchical recursive injection mechanism, the image encoder effectively generates image embeddings that integrate cross-modal information, thereby enhancing the model’s performance in multimodal tasks. This design not only improves the semantic representation of images, but also strengthens the synergy between cross-modal features, allowing the model to better capture the interrelationships between images and text.

3.5. Multimodal Collaborative Fusion Module

To integrate textual and visual prompts, this study introduces a multimodal collaborative fusion module, as shown in Figure 3. The module aims to enhance the model’s ability to understand complex scenes, particularly when processing both textual and visual information, by improving the collaborative optimization of multimodal data through bidirectional interaction and deep feature fusion. The lightweight structure of the module ensures that textual and visual prompts are fully integrated at each transformer layer, thereby enhancing the expression and processing capabilities of cross-modal features. The pseudocode can be found in Appendix B.

The multimodal collaborative fusion module primarily consists of the following key components:

Input Layer: The textual prompt

T

and visual prompt

V

are embedded representations obtained from the text encoder and image encoder, respectively.

Multi-Head Attention Mechanism: This mechanism is divided into two layers: the self-attention layer and the cross-modal attention layer.

Self-Attention Layer: This layer computes the self-attention within each modality separately, i.e., for the textual prompt

T

and the visual prompt

V

. It captures the internal feature relationships within each modality, such as the contextual relationships between words in the textual prompt and the correlations between different regions in the visual prompt.

Cross-Modal Attention Layer: This layer uses cross-attention to calculate the attention weights between the textual prompt

T

and the visual prompt

V

. This enables the interaction of information and feature fusion between the two modalities. By capturing the cross-modal correlations, it provides the model with a richer, more comprehensive multimodal context.

Residual Connection and Normalization: After the multi-head attention mechanism, the residual connection and normalization layer is adopted. Residual connections add input features to attentional output features, alleviating the problem of disappearing gradients. Specifically, textual prompts

T

and visual prompts

V

will be added to input features to generate

T_{r e s}

and

V_{r e s}

, a sum after cross-modal attention output. Normalization is performed on the fused features to accelerate convergence. The normalized textual and visual features are

T_{n o r m}

and

V_{n o r m}

, respectively.

Residual Connection: The input features from the text prompt

T

and the visual prompt

V

are added to the attention output. This helps to mitigate the issue of vanishing gradients. Specifically, the residual connections generate

T_{r e s}

and

V_{r e s}

by adding the input features to the output of the cross-modal attention layer.

Normalization: The fused features are then normalized to accelerate convergence and improve training stability. The normalized textual and visual features are denoted as

T_{n o r m}

and

V_{n o r m}

, respectively.

Feedforward Network: The fused features undergo further nonlinear mapping to enhance the model’s ability to express complex patterns. This operation transforms

T_{n o r m}

and

V_{n o r m}

into higher-dimensional feature representations. After passing through the feedforward network, the resulting textual and visual features are denoted as

T_{f f}

and

V_{f f}

, respectively. These features capture more complex semantic and visual patterns, representing the integrated textual and visual information.

Linear Transform Layer: A linear transformation is applied to adjust the merged prompts to dimensions that are consistent with the input requirements of the next transformer layer, ensuring compatibility for subsequent processing. This transformation aligns

T_{f f}

and

V_{f f}

with the dimensional requirements of the following layer, maintaining consistency between the input and output dimensions of the model. The resulting features,

T^{'}

and

V^{'}

, represent the merged textual and visual prompts, which are now ready for further processing in the next layer or the final output stage.

Output Layer: After a series of operations, including self-attention, cross-modal attention, residual connections, normalization, feedforward network processing, and linear transformations, the fused text prompt

T^{'}

and fused visual prompt

V^{'}

are generated. These prompts now contain integrated features from both the textual and visual modalities, making them ready for use in the next transformer layer or for the final model output.

T^{'}

captures the semantics of the text, along with its relationship to the visual information, while

V^{'}

incorporates visual features that are enhanced by the text information, strengthening the semantic associations within the visual data.

The multimodal collaborative fusion module enables two-way information interaction between the textual prompt and visual prompt through effective information fusion. It combines the outputs of both self-attention and cross-modal attention to integrate features from the textual and visual modalities.

\begin{matrix} T^{'} = L i n e a r (S e l f - A t t e n t i o n (T) + C r o s s - A t t e n t i o n (T, V)) \end{matrix}

(6)

\begin{matrix} V^{'} = L i n e a r (S e l f - A t t e n t i o n (V) + C r o s s - A t t e n t i o n (V, T)) \end{matrix}

(7)

The two-way information interaction mechanism ensures a bidirectional flow of information between text prompts and visual prompts, enabling full integration of both modalities within each transformer layer. This deep interaction not only enhances the information representation capabilities of each individual modality, but also strengthens the synergy of cross-modal features. Consequently, the model’s ability to understand and process complex scenes is significantly improved.

4. Experiments

4.1. Experimental Setup

All experiments are conducted on an Nvidia 3090 GPU. CLIP with ViT-B/16 backbone is used. In the training phase, the parameters of the text encoder are frozen, and the RAdam optimizer with a learning rate of

1 \times 10^{- 4}

is used to optimize the image encoder. The number of training epochs is set to 150. We evaluated the proposed methods on three datasets. The QNRF dataset [35] is a high-density crowd counting dataset comprising 1535 images, with crowd sizes ranging from 49 to 12,865 individuals. The ShanghaiTech dataset [3] is split into Part A and Part B, containing 482 and 716 images, respectively. UCF-CC50 [36] is an exceptionally challenging high-density dataset that includes only 50 images but labels up to 63,075 human subjects. Model performance was assessed using mean absolute error (MAE) and mean squared error (MSE).

\begin{matrix} M A E = \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} |E_{i} - C_{i}| \end{matrix}

(8)

\begin{matrix} M S E = \sqrt{\frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} {|E_{i} - C_{i}|}^{2}} \end{matrix}

(9)

where

E_{i}

is the true value,

C_{i}

is the model’s predicted value, and

N_{c}

is the number of samples.

4.2. Main Experimental Results

Table 1 summarizes the experimental results. Overall, our method, DE-CLIP, achieves significant performance improvements over existing unsupervised counting models, such as CSS-CCNN and Crowd-CLIP, across all of the evaluated datasets. Specifically, DE-CLIP demonstrates clear advantages over CSS-CCNN [17] on the QNRF, ShanghaiTech, and UCF-CC50 datasets. When compared to Crowd-CLIP [16] on high-density datasets such as QNRF and UCF-CC50, DE-CLIP also shows superior performance. These results validate the effectiveness of the proposed method, demonstrating that the multimodal deep sharing prompt approach outperforms both single-modal prompts and existing models in unsupervised dense counting tasks.

Figure 4 shows the prediction results of DE-CLIP for image img_0023 in the QNRF dataset, where the ground truth count is 1255 and the predicted count is 1210.

4.3. Cross-Dataset Generalization Validation

To comprehensively evaluate the performance advantages of DE-CLIP, this section provides a systematic comparison from two perspectives: unsupervised and supervised methods. The comparison is conducted on the QNRF, ShanghaiTech, and UCF-CC50 datasets. The experimental metrics include mean absolute error (MAE), mean squared error (MSE), and an analysis of model complexity (number of parameters) and inference speed (FPS).

4.3.1. Comparison of Unsupervised Methods

Table 2 presents a performance comparison between DE-CLIP and current mainstream unsupervised methods. CSS-CCNN hypothesizes that natural crowd distributions adhere to a power-law pattern, enforcing alignment between its predicted density maps and this prior knowledge to achieve accurate crowd counting. Crowd-CLIP achieves counting objectives through a three-stage progressive filtering strategy, where three distinct text encoders progressively map objects into the linguistic space, enabling the model to precisely identify and localize target objects for crowd counting. On the QNRF dataset, DE-CLIP achieves a MAE of 267.5, improving by 5.6% compared to CrowdCLIP (283.2) and by 38.8% compared to CSS-CCNN (437.0). On the extremely dense UCF-CC50 dataset, DE-CLIP achieves a MAE of 409.6, significantly outperforming CrowdCLIP (438.3) and CSS-CCNN (564.9), demonstrating the effectiveness of multimodal deep shared prompts.

4.3.2. Comparison with Supervised Methods

Although DE-CLIP is an unsupervised method, its performance approaches that of mainstream supervised methods in certain metrics. As shown in Table 1, on the QNRF dataset, DE-CLIP achieves a MAE of 267.5, which is still behind the current best supervised method, CLTR [39] (85.5), indicating a noticeable gap. However, the results suggest that the unsupervised framework can be competitive in specific scenarios, even though its overall performance still lags behind the supervised learning paradigm.

Nevertheless, supervised learning requires a large number of annotated images for model training, which consumes substantial human and material resources, which is a major drawback [40]. Switch CNN processes objects of different scales by switching between different network paths [41], but it may fail to effectively capture complex counting patterns in high-density or overlapping scenes [37]. LSC-CNN introduces local-scale features to improve regional information extraction [42], but it has high computational complexity and poor adaptability in dynamic scenes [38]. CLTR uses a transformer structure to capture features at different scales, which performs well in most scenarios [39], but it comes with a high computational overhead and weak handling of small targets. These models have performance bottlenecks when faced with high-density, dynamic, or resource-constrained scenarios [43].

4.3.3. Model Complexity and Efficiency Analysis

Table 3 compares the model parameters and inference speed. Both CrowdCLIP (150.5 M parameters) and DE-CLIP (151.2 M parameters) are based on a frozen CLIP backbone network (ViT-B/32, approximately 151 M parameters), with only a small number of fine-tuned prompt parameters (around 0.8 M). These data are cited from the original CrowdCLIP paper, where the total parameter count consists of the CLIP backbone parameters (151 M) and additional prompt parameters (0.5 M).

DE-CLIP introduces a deeper cross-modal fusion module, leading to slightly more additional parameters (0.8 M), yet its overall complexity remains highly similar to that of CrowdCLIP. By freezing the backbone network and applying lightweight prompt tuning, both CrowdCLIP and DE-CLIP significantly reduce model complexity while maintaining performance.

5. Ablation Experiments

All experiments in this section were conducted on the QNRF dataset [35], focusing on the following aspects:

5.1. Impact of Text Prompts Design

As shown in Table 4, firstly, we systematically investigated the impact of different numeric ordering text prompts on model performance. To achieve this, we designed a series of experiments by adjusting the numerical sequence of ranking prompts to evaluate their specific effects on the model’s counting ability. In a key experiment, we set the rank prompt to [‘20’, ‘55’, ‘90’, ‘125’, ‘160’, ‘195’], corresponding to parameters

R_{0}

= 20 and

K

= 35.

The experimental results demonstrate that, under this particular configuration, the DE-CLIP model achieves optimal performance, significantly outperforming the other rank prompt configurations. This outcome indicates that DE-CLIP is highly effective at learning from numerically ranked text and capturing a meaningful representation of the number of objects in an image.

Further analysis reveals that appropriate numerical ranking prompts not only provide extensive numerical information, but also create a contextual framework that enhances the model’s accuracy in counting. This capability makes DE-CLIP particularly well suited for object counting tasks in complex scenarios.

To comprehensively evaluate the impact of different rank prompts, we tested a variety of rank prompt configurations, varying both the starting values (

R_{0}

) and the step sizes (

K

). The results demonstrated that DE-CLIP consistently outperformed the original CLIP model in counting performance across all prompt configurations. This consistent superiority underscores the effectiveness and robustness of DE-CLIP in learning and utilizing numerical ordering prompts.

5.2. Prompt Depth Sensitivity Analysis

In Figure 5, we detail the specific effects of varying prompt depths on the performance of the DE-CLIP model’s text encoder and visual encoder. Prompt depth refers to the number of layers at which the hint vector is inserted into the model architecture, effectively determining the feature space layer to which the hint vector is applied. By systematically adjusting the prompt depth, we gain insights into how hint vectors operate at different levels of the model’s feature hierarchy.

Our experimental results indicate that the overall performance of the DE-CLIP model improves significantly as the prompt depth increases gradually. This trend suggests that deeper feature spaces are more effective at integrating the information conveyed by the prompt vectors, thereby enhancing the model’s ability to comprehend the semantic relationships between images and text. Specifically, when the prompt vector is inserted into the model’s mid-to-upper feature layers (e.g., layers five to nine), the model more effectively captures complex semantic relationships and fine-grained information. This leads to improved accuracy and robustness in counting tasks.

These findings demonstrate that the strategic placement of hint vectors within deeper layers of the model architecture plays a crucial role in optimizing DE-CLIP’s performance, particularly in understanding and processing intricate semantic associations in complex scenes.

In particular, when we inserted randomly initialized prompt vectors into the frozen model’s deep feature space, the model’s performance exhibited greater sensitivity to prompt depth. This indicates that the deep feature space not only integrates prompt information more effectively, but also relies more heavily on the initialization and configuration of the prompt vectors. Therefore, it is crucial to design an appropriate initialization strategy for the hint vectors to fully leverage the potential of deep prompts.

By increasing the hint depth, we significantly enhanced the performance of DE-CLIP compared to earlier shallow prompt methods, such as those using a hint depth of one in COCO. Shallow prompting typically captures only basic feature information, which limits its ability to handle complex semantic relationships and diverse scenes. In contrast, deep prompts operate within higher-level feature spaces, providing richer and more detailed contextual information. This enhancement improves the model’s object counting capabilities in complex scenes, demonstrating the superiority of deep hinting over shallow methods.

Further experimental results demonstrate that DE-CLIP achieves optimal performance when the prompt depth is set to nine. Compared to shallow prompt methods, DE-CLIP exhibits significantly improved counting accuracy and stability. Selecting the optimal prompt depth not only enhances the model’s performance, but also strikes a favorable balance between computational efficiency and resource utilization. A prompt depth that is too shallow may lead to insufficient information fusion, while an excessively deep prompt depth could introduce redundant information or increase computational complexity.

Additionally, we conducted several ablation experiments to analyze the specific effects of different prompt depths on the features at each level of the model. The results indicate that deep prompting not only enhances the model’s ability to capture high-level semantic information, but also increases its sensitivity to low-level visual features to a certain extent. This multi-level information fusion mechanism enables DE-CLIP to perform better in processing diverse and complex image scenes.

5.3. Encoder Collaborative Optimization Validation

We further investigated the specific effects of fine-tuning the image encoder or text encoder separately on the counting performance of the DE-CLIP model. To achieve this, we designed a series of comparative experiments to elucidate how fine-tuning strategies applied to different encoders impact overall model performance. The experimental results, presented in Figure 5, illustrate the counting accuracy and stability of the model under various fine-tuning configurations.

Firstly, the model’s counting performance significantly decreased when only the text encoder was supplemented with prompts. This suggests that enhancing the text encoder’s ability to interpret image content in isolation is insufficient to improve the overall counting task. Specifically, while the text encoder becomes better at processing and analyzing numerical information from the text prompts, the lack of simultaneous optimization of the visual information causes discrepancies when integrating multimodal data, ultimately reducing counting accuracy.

Similarly, when prompts were added exclusively to the image encoder, the model’s performance also declined. This may be because enhancing visual feature extraction through image side prompts does not concurrently optimize text understanding, leading to reduced efficiency in matching and fusing textual and visual information. Unilateral optimization of the image encoder prevents the model from fully leveraging the numerical information contained in the text prompts, thereby limiting its object counting capabilities in complex scenes.

In contrast, adding a shared prompt to both the text encoder and the image encoder resulted in significant performance improvements, achieving the best results. This outcome underscores the importance of multimodal collaborative optimization for enhancing the model’s counting ability. By incorporating a shared prompt into both encoders, the model can simultaneously receive and integrate numerical information from both textual and visual inputs, leading to more accurate object recognition and counting. Specifically, the shared prompt facilitates synchronous optimization of text and image features, enhances semantic consistency, and promotes information complementarity between the two modalities. This enables DE-CLIP to more effectively capture the number and distribution of objects within an image.

Furthermore, additional analysis revealed that the shared prompt not only improved counting accuracy, but also significantly enhanced the model’s robustness across different scenes and complex backgrounds. The model demonstrated better adaptability to diverse image contents and variations, achieving stable object count estimations through a unified prompt mechanism. This collaborative optimization strategy effectively mitigates the information asymmetry that can arise from single-end optimization, ensuring coordinated and efficient fusion of multimodal information.

In summary, fine-tuning both the image encoder and the text encoder with shared prompts is crucial for maximizing DE-CLIP’s performance in object counting tasks. This approach leverages the strengths of both modalities, resulting in enhanced accuracy, stability, and robustness in handling complex and diverse image scenarios.

5.4. Optimization of Image Slice Quantity

Finally, we conducted an in-depth study on the specific effects of varying the number of image slices (denoted as p) during the training stage of the DE-CLIP model. The number of image slices p refers to the number of regions into which an image is segmented during training. This parameter is crucial for enabling the model to perform fine-grained analysis and feature extraction.

To systematically evaluate the optimization impact of different p values, we designed a series of experiments and summarized the results in Table 5. These experiments aimed to determine how varying the number of image slices influences the model’s ability to accurately and efficiently process and count objects within an image.

Our findings indicate that the number of image slices p significantly affects the model’s performance. An optimal p value facilitates effective segmentation, allowing the model to capture detailed features and perform precise object counting. Conversely, too few slices may result in inadequate feature representation, while too many slices could lead to increased computational complexity without substantial performance gains.

Table 3 presents the experimental results, highlighting the relationship between the number of image slices and the model’s counting accuracy and stability. The data demonstrate that there is a balanced range of p values where DE-CLIP achieves peak performance, effectively managing the trade-off between detailed feature extraction and computational efficiency.

In summary, the number of image slices p is a pivotal parameter in the training of DE-CLIP, directly influencing its ability to perform detailed image analysis and accurate object counting. Properly selecting and optimizing p ensures that the model can efficiently handle fine-grained features while maintaining robust performance across diverse and complex image scenarios.

The experimental results demonstrate that the DE-CLIP model achieves optimal performance in both mean absolute error (MAE) and mean squared error (MSE) metrics when the number of image slices (p) is set to five or six. Specifically, configurations with p = 5 and p = 6 significantly outperform other p value settings on these evaluation metrics, indicating that an appropriate number of image slices effectively enhances the model’s counting accuracy and stability.

When the p value is low, the image is divided into fewer slices, each containing more comprehensive information but less detail. This configuration can reduce the model’s sensitivity to subtle changes in the number of objects within the image, thereby affecting counting accuracy. For instance, at p = 4, although there is an improvement in model performance, both MAE and MSE remain higher compared to p = 5 and p = 6. This suggests that the number of image slices is insufficient to fully capture the distribution and quantity information of objects in complex scenes.

Conversely, when p is set to five or six, the image is segmented into a moderate number of slices, each containing adequate contextual information and fine-grained feature details. This balance enables the model to identify and count objects with greater precision. The experimental results show that DE-CLIP achieves the lowest MAE and MSE values at p = 5 and p = 6, indicating that the model can best integrate and utilize the image data within this range of slice numbers for efficient object counting.

However, as the p value increases further (e.g., p = 7, p = 8, p = 9), the image is divided into more slices, resulting in smaller slice areas with relatively limited contextual information. This can hinder the model’s ability to effectively integrate global information, thereby negatively impacting overall counting performance.

In summary, through a systematic study of the effects of different numbers of image slices (p values) on DE-CLIP model performance, we found that appropriate p value settings (specifically p = 5 and p = 6) significantly improve the model’s counting accuracy and stability. Based on these findings, p = 5 was selected to finalize the experimental results.

5.5. Validation of Multimodal Fusion Effectiveness

We conducted ablation experiments to compare the performance of using image prompts, text prompts, and a combination of both. Specifically, we designed three experimental setups, as follows:

Text-only Prompt: In this experiment, the model relies solely on the text prompt for inference, using only the text input to generate embedding vectors through the text encoder, while ignoring image information.

Image-only Prompt: In this experiment, the model relies solely on the image prompt for inference, using only the image input to extract features through the image encoder, while ignoring text information.

Text–Image Fusion: In this experiment, the model combines both text and image prompts, utilizing both features for inference through a multimodal fusion mechanism.

From the results shown in Table 6, it can be seen that the text–image fusion setup achieved the best performance across all evaluation metrics. Specifically, the MAE and MSE were 267.5 and 464.4, respectively, significantly lower than the results of using only text or image prompts, indicating that the fusion method performs better in reducing prediction errors.

In the text-only prompt experiment, the MAE was 397.6 and the MSE was 755.2, showing that text prompts perform poorly without image information, especially when handling details and high-density areas, where the spatial structure of the image cannot be accurately captured. In contrast, the image-only prompt setup showed some improvement with a MAE of 342.1 and MSE of 641.3. However, due to the lack of contextual information provided by the text, the model still could not make accurate judgments in complex scenarios.

The text–image multimodal prompt demonstrated its advantages in this experiment. By combining the semantic information from the text prompt and the visual features from the image prompt, the model was able to understand the input data more comprehensively, leading to significant performance improvements in the QNRF dataset. This result validates the effectiveness and advantages of our proposed text–image fusion method in multimodal tasks.

6. Conclusions

This paper presents DE-CLIP, an unsupervised dense counting method that leverages multimodal deep sharing prompts and cross-modal alignment ranking. During the training phase, ordered text prompts are constructed, and the image encoder is fine-tuned using cross-modal alignment ranking loss. In the testing phase, a multimodal collaborative fusion module is introduced to achieve deep hierarchical bidirectional prompt fusion. This approach significantly enhances the performance of small-sample classification, crowd counting, and cross-dataset generalization. The experimental results demonstrate that DE-CLIP outperforms existing supervised and unsupervised methods across multiple challenging datasets, validating its effectiveness and superiority in unsupervised dense counting tasks.

Author Contributions

Conceptualization, C.W. and X.Z.; Methods: X.Z.; Software, X.Z.; Verification, C.W. and X.Z.; Formal analysis, C.W.; Investigate, X.Z.; Resources, C.W.; Data Management, X.Z.; Writing—Manuscript preparation, X.Z.; Writing—Review and editing, C.W. and X.Z.; Visualization, X.Z.; Supervisor, C.W.; Project Management, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data set used in the study is a public data set, which can be found in the references. Please contact the corresponding author for intermediate data.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Algorithm A1: Image patch generation and ranking loss

Input: Image

I

, Center

C

(

x_{c}

,

y_{c}

), Number of Patches

N

, Initial patch size

s_{i n i t}

, Text prompt base

A_{0}

, Step size

K

, Margin

m

= 0.5

Output: Ranking Loss

L_{r a n k}

1:: Initialize:
Empty patch set $P$ = {}
2:: Generate image patches:
for $i$ = 1 to $N$ :
$s_{i}$ = $s_{i n i t}$ ∗ $i$
$P_{i}$ = crop( $I$ , $C$ , $s_{i}$ )
$a d d P_{i}$ to $P$
3:: Generate text prompts:
$T = \{A_{0} + k * K| k = 0, 1, \dots, N - 1}$
4:: Encode features:
Visual embeddings: $I_{e m b} = C L I P_V i s i o n_E n c o d e r (P)$
Text embeddings: $T_{e m b} = C L I P_T e x t_E n c o d e r (T)$
5:: Compute similarity matrix:
$S = I_{e m b} * {T_{e m b}}^{⊺}$
6:: Calculate ranking loss:
$L_{r a n k}$ = 0
for $i$ = 1 to $N$ :
for $j$ = 1 to $N$ :
if $i \neq j$ :
$L_{r a n k} + = m a x (0, m + S [i, j] - S [i, i])$
$L_{r a n k} = L_{r a n k} / (N * (N - 1))$
7:: Return $L_{r a n k}$

Appendix B

Algorithm A2: Multimodal collaborative fusion

Input: Text features

T

, Vision features

V

Output: Fused text features

T^{'}

, Fused vision features

V^{'}

1:: Intra-modal self-attention:
$T_{s e l f} = M u l t i H e a d A t t e n t i o n (Q u e r y = T, K e y = T, V a l u e = T)$
$V_{s e l f} = M u l t i H e a d A t t e n t i o n (Q u e r y = V, K e y = V, V a l u e = V)$
2:: Cross-modal attention:
$T_{c r o s s} = M u l t i H e a d A t t e n t i o n (Q u e r y = T, K e y = V, V a l u e = V)$
$V_{c r o s s} = M u l t i H e a d A t t e n t i o n (Q u e r y = V, K e y = T, V a l u e = T)$
3:: Residual connection & normalization:
$T_{r e s} = L a y e r N o r m (T_{s e l f} + T_{c r o s s})$
$V_{r e s} = L a y e r N o r m (V_{s e l f} + V_{c r o s s})$
4:: Feed-forward network (FFN):
$T_{f f n} = F F N (T_{r e s}) + T_{r e s}$
$V_{f f n} = F F N (V_{r e s}) + V_{r e s}$
5:: Linear projection:
$T^{'} = W_{T} T_{f f n}$
$V^{'} = W_{V} V_{f f n}$
6:: Return $T^{'}$ , $V^{'}$

References

Bai, S.; He, Z.; Qiao, Y.; Hu, H.; Wu, W.; Yan, J. Adaptive dilated network with self-correction supervision for counting. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; p. 1. [Google Scholar]
Li, Y.; Zhang, X.; Chen, D. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1091–1100. [Google Scholar]
Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
Wang, Q.; Gao, J.; Lin, W.; Li, X. Nwpucrowd: A large-scale benchmark for crowd counting and localization. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2141–2149. [Google Scholar] [CrossRef] [PubMed]
Lei, Y.; Liu, Y.; Zhang, P.; Liu, L. Towards using count-level weak supervision for crowd counting. Pattern Recognit. 2021, 109, 107616. [Google Scholar] [CrossRef]
Yang, Y.; Li, G.; Wu, Z.; Su, L.; Huang, Q.; Sebe, N. Weakly-supervised crowd counting learns from ranking rather than locations. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 1–17. [Google Scholar]
Lin, H.; Ma, Z.; Hong, X.; Wang, Y.; Su, Z. Semi-supervised crowd counting via density agency. In Proceedings of the ACM Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 1416–1426. [Google Scholar]
Liu, Y.; Liu, L.; Wang, P.; Zhang, P.; Lei, Y. Semi-supervised crowd counting via self-training on surrogate tasks. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 242–259. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. Int. Conf. Mach. Learn. PMLR 2021, 139, 8748–8763. [Google Scholar]
Shi, H.; Hayat, M.; Wu, Y.; Cai, J. Proposalclip: Unsupervised open-category object proposal generation via exploiting clip prompts. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9611–9620. [Google Scholar]
Xu, M.; Zhang, Z.; Wei, F.; Lin, Y.; Cao, Y.; Hu, H.; Bai, X. A simple baseline for openvocabulary semantic segmentation with pre-trained visionlanguage model. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 736–753. [Google Scholar]
Hong, F.; Zhang, M.; Pan, L.; Cai, Z.; Yang, L.; Liu, Z. AvatarCLIP: Zero-shot text-driven generation and animation of 3D avatars. ACM Trans. Graph. 2022, 41, 161. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for visionlanguage models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Conditional prompt learning for vision-language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16816–16825. [Google Scholar]
Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 709–727. [Google Scholar]
Liang, D.; Xie, J.; Zou, Z.; Ye, X.; Xu, W.; Bai, X. CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model. arXiv 2023, arXiv:2304.04231. [Google Scholar]
Sam, D.B.; Agarwalla, A.; Joseph, J.; Sindagi, V.A.; Babu, R.V.; Patel, V.M. Completely self-supervised crowd counting via distribution matching. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 186–204. [Google Scholar]
Shi, Z.; Zhang, L.; Liu, Y.; Cao, X.; Ye, Y.; Cheng, M.-M.; Zheng, G. Crowd counting with deep negative correlation learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Liu, X.; Li, G.; Han, Z.; Zhang, W.; Yang, Y.; Huang, Q.; Sebe, N. Exploiting sample correlation for crowd counting with multi-expert network. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3215–3224. [Google Scholar]
Liu, Z.; He, Z.; Wang, L.; Wang, W.G.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J.; et al. Visdrone-cc2021: The vision meets drone crowd counting challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2830–2838. [Google Scholar]
Sindagi, V.A.; Patel, V.M. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017. [Google Scholar]
Xu, C.; Qiu, K.; Fu, J.; Bai, S.; Xu, Y.; Bai, X. Learn to scale: Generating multipolar normalized density maps for crowd counting. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8382–8390. [Google Scholar]
Lin, H.; Ma, Z.; Ji, R.; Wang, Y.; Hong, X. Boosting crowd counting via multifaceted attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 19628–19637. [Google Scholar]
Shi, Z.; Mettes, P.; Snoek, C.G.M. Counting with focus for free. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4200–4209. [Google Scholar]
Chen, B.; Yan, Z.; Li, K.; Li, P.; Wang, B.; Zuo, W.; Zhang, L. Variational attention: Propagating domain-specific knowledge for multi-domain learning in crowd counting. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 16065–16075. [Google Scholar]
Wang, C.; Song, Q.; Zhang, B.; Wang, Y.; Tai, Y.; Hu, X.; Wang, C.; Li, J.; Ma, J.; Wu, Y. Uniformity in heterogeneity: Diving deep into count interval partition for crowd counting. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3234–3242. [Google Scholar]
Xiong, H.; Lu, H.; Liu, C.; Liu, L.; Cao, Z.; Shen, C. From open set to closed set: Counting objects by spatial divide-and-conquer. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Abousamra, S.; Hoai, M.; Samaras, D.; Chen, C. Localization in the crowd with topological constraints. Proc. AAAI Conf. Artif. Intell. 2021, 35, 872–881. [Google Scholar] [CrossRef]
Chen, Y.; Liang, D.; Bai, X.; Xu, Y.; Yang, X. Cell localization and counting using direction field map. IEEE J. Biomed. Health Inform. 2021, 26, 359–368. [Google Scholar] [CrossRef] [PubMed]
Liang, D.; Xu, W.; Zhu, Y.; Zhou, Y. Focal inverse distance transform maps for crowd localization. IEEE Trans. Multimed. 2022, 25, 6040–6052. [Google Scholar] [CrossRef]
Liu, Y.; Shi, M.; Zhao, Q.; Wang, X. Point in, box out: Beyond counting persons in crowds. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6469–6478. [Google Scholar]
Liang, D.; Xu, W.; Bai, X. An end-to-end transformer model for crowd localization. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
Song, Q.; Wang, C.; Jiang, Z.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Wu, Y. Rethinking counting and localization in crowds: A purely point-based framework. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3365–3374. [Google Scholar]
Zhang, C.; Li, H.; Wang, X.; Yang, X. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 833–841. [Google Scholar]
Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 532–546. [Google Scholar]
Idrees, H.; Saleemi, I.; Seibert, C.; Shah, M. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
Sam, D.B.; Surya, S.; Babu, R.V. Switching Convolutional Neural Network for Crowd Counting. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4031–4039. [Google Scholar]
Sam, D.B.; Peri, S.V.; Sundararaman, M.N.; Kamath, A.; Radhakrishnan, V.B. Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detection, Computer Science—Computer Vision and Pattern Recognition. arXiv 2019, arXiv:1906.07538. [Google Scholar]
Liang, D.; Xu, W.; Bai, X. An End-to-End Transformer Model for Crowd Localization. Computer Science—Computer Vision and Pattern Recognition. arXiv 2022, arXiv:2202.13065. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [PubMed]
Sharma, V.K.; Mir, R.N.; Singh, C. Scale-aware CNN for crowd density estimation and crowd behavior analysis. Comput. Electr. Eng. 2023, 106, 108569. [Google Scholar] [CrossRef]
Zhan, B.; Monekosso, D.N.; Remagnino, P.; Velastin, S.A.; Xu, L.-Q. Crowd analysis: A survey. Mach. Vis. Appl. 2008, 19, 345–357. [Google Scholar] [CrossRef]

Figure 1. (a) Supervised training requires a large amount of manual annotation to label the number of objects in the image; (b) Previous methods have focused on optimizing prompts for a single modality (visual or textual); (c) We propose incorporating prompt information from both modalities and deeply sharing the prompts at the encoder stage.

Figure 2. The model architecture of DE-CLIP, where only the parameters of the prompt (represented by the orange block) are adjusted, while the backbone (represented by the blue block) is frozen. DE-CLIP achieves cross-modal prompt conversion and fusion through an additional multimodal collaborative fusion module.

Figure 3. Overall structure of the multimodal collaborative fusion module.

Figure 4. The prediction results of DE-CLIP for image img_0023 in the QNRF dataset, where the ground truth count is 1255 and the predicted count is 1210.

Figure 5. Share the effect of depth of prompts on model performance.

Table 1. Performance comparison (MAE/MSE) of DE-CLIP against state-of-the-art methods on QNRF, ShanghaiTech, and UCF-CC50 datasets. Bold values indicate the best results.

Method	Venue	QNRF		Part A		Part B		UCF_CC_50
		MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE
Switch CNN [37]	CVPR 17	228.0	445.0	90.4	135.0	21.6	33.4	318.1	439.2
LSC-CNN [38]	TPAMI 21	120.5	218.2	66.4	117.0	8.1	12.7	225.6	302.7
CLTR [39]	ECCV 22	85.5	141.3	56.9	95.2	6.5	10.2	-	-
CSS-CCNN-Random	ECCV 22	718.7	1036.3	431.1	559.0	-	-	1279.3	1567.9
CSS-CCNN [17]	ECCV 22	437.0	722.3	197.3	295.9	-	-	564.9	959.4
Crowd-CLIP [16]	CVPR 23	283.3	488.7	146.1	236.3	69.3	85.8	438.3	604.7
DE-CLIP (ours)	-	267.5	464.4	156.8	243.7	80.6	102.3	409.6	572.3
Improvement	-	5.6%	4.9%	−7.3%	−3.1%	−16.3%	−19.2%	6.5%	5.3%

Table 2. Performance comparison of unsupervised methods (MAE/MSE), where the best results are in bold.

Method	QNRF	ShanghaiTech A	UCF_CC_50
	MAE/MSE	MAE/MSE	MAE/MSE
CSS-CCNN [17]	437.0/722.3	197.3/295.9	564.9/959.4
Crowd-CLIP [16]	283.3/488.7	146.1/236.3	438.3/604.7
DE-CLIP (ours)	267.5/464.4	156.8/243.7	409.6/572.3

Table 3. Comparison of model complexity and inference speed.

Method	Parameters (Million)	Inference Speed (FPS)
CSS-CCNN [17]	138.6	9.2
Crowd-CLIP [16]	150.5	15.4
CLTR [39]	248.3	12.1
DE-CLIP (ours)	151.2	18.3

Table 4. Effects of different numeric text prompts on model performance, where the bold values indicate the best results.

R₀	K	Prompts	MAE	MSE
10	25	[10, 35, …, 110, 135]	319.9	520.2
10	35	[10, 45, …, 150, 185]	329.2	521.4
20	25	[20, 45, …, 120, 145]	280.9	471.2
20	35	[20, 55, …, 160, 195]	267.5	464.5
20	45	[20, 65, …, 200, 245]	313.1	521.8
30	25	[30, 55, …, 130, 155]	280.5	475.2
30	35	[30, 65, …, 170, 205]	328.1	553.6

Table 5. Effects of different image slice numbers on model performance.

Setting	Patch Number	MAE	MSE
I	4	311.6	506.9
II	5	295.7	452.7
III	6	267.5	464.5

Table 6. Experimental results of different prompting methods (QNRF dataset).

Setting	MAE	MSE
Text-only	397.6	755.2
Image-only	342.1	641.3
Fusion	267.5	464.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zi, X.; Wu, C. DE-CLIP: Unsupervised Dense Counting Method Based on Multimodal Deep Sharing Prompts and Cross-Modal Alignment Ranking. Electronics 2025, 14, 1234. https://doi.org/10.3390/electronics14061234

AMA Style

Zi X, Wu C. DE-CLIP: Unsupervised Dense Counting Method Based on Multimodal Deep Sharing Prompts and Cross-Modal Alignment Ranking. Electronics. 2025; 14(6):1234. https://doi.org/10.3390/electronics14061234

Chicago/Turabian Style

Zi, Xuebin, and Chunlei Wu. 2025. "DE-CLIP: Unsupervised Dense Counting Method Based on Multimodal Deep Sharing Prompts and Cross-Modal Alignment Ranking" Electronics 14, no. 6: 1234. https://doi.org/10.3390/electronics14061234

APA Style

Zi, X., & Wu, C. (2025). DE-CLIP: Unsupervised Dense Counting Method Based on Multimodal Deep Sharing Prompts and Cross-Modal Alignment Ranking. Electronics, 14(6), 1234. https://doi.org/10.3390/electronics14061234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DE-CLIP: Unsupervised Dense Counting Method Based on Multimodal Deep Sharing Prompts and Cross-Modal Alignment Ranking

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. General Architecture

3.2. Cross-Modal Alignment Ranking Mechanism

3.2.1. Image Block Segmentation and Text Prompt Design

3.2.2. Cross-Modal Alignment Ranking Loss

3.3. Text Encoder

3.3.1. Input Layer

3.3.2. Hierarchical Information Fusion on the Text Side

3.3.3. Output Layer

3.4. Image Encoder

3.4.1. Input Layer

3.4.2. Hierarchical Information Fusion on the Image Side

3.4.3. Output Layer

3.5. Multimodal Collaborative Fusion Module

4. Experiments

4.1. Experimental Setup

4.2. Main Experimental Results

4.3. Cross-Dataset Generalization Validation

4.3.1. Comparison of Unsupervised Methods

4.3.2. Comparison with Supervised Methods

4.3.3. Model Complexity and Efficiency Analysis

5. Ablation Experiments

5.1. Impact of Text Prompts Design

5.2. Prompt Depth Sensitivity Analysis

5.3. Encoder Collaborative Optimization Validation

5.4. Optimization of Image Slice Quantity

5.5. Validation of Multimodal Fusion Effectiveness

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI