Scale and Background Aware Asymmetric Bilateral Network for Unconstrained Image Crowd Counting

Lv, Gang; Xu, Yushan; Ma, Zuchang; Sun, Yining; Nian, Fudong

doi:10.3390/math10071053

Open AccessArticle

Scale and Background Aware Asymmetric Bilateral Network for Unconstrained Image Crowd Counting

by

Gang Lv

^1,2,3,

Yushan Xu

³,

Zuchang Ma

^1,2,

Yining Sun

^1,2 and

Fudong Nian

^3,4,*

¹

Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China

²

School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China

³

School of Advanced Manufacturing Engineering, Hefei University, Hefei 230601, China

⁴

School of Artificial Intelligence, Anhui University, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(7), 1053; https://doi.org/10.3390/math10071053

Submission received: 2 March 2022 / Revised: 20 March 2022 / Accepted: 22 March 2022 / Published: 25 March 2022

(This article belongs to the Special Issue Computer Vision and Pattern Recognition with Applications)

Download

Browse Figures

Versions Notes

Abstract

:

This paper attacks the two challenging problems of image-based crowd counting, that is, scale variation and complex background. To that end, we present a novel crowd counting method, called the Scale and Background aware Asymmetric Bilateral Network (SBAB-Net), which is able to handle scale variation and background noise in a unified framework. Specifically, the proposed SBAB-Net contains three main components, a pre-trained backbone convolutional neural network (CNN) as the feature extractor and two asymmetric branches to generate a density map. These two asymmetric branches have different structures and use features from different semantic layers. One branch is densely connected stacked dilated convolution (DCSDC) sub-network with different dilation rates, which relies on one deep feature layer and can handle scale variation. The other branch is parameter-free densely connected stacked pooling (DCSP) sub-network with various pooling kernels and strides, which relies on shallow feature and can fuse features with several receptive fields to reduce the impact of background noise. Two sub-networks are fused by the attention mechanism to generate the final density map. Extensive experimental results on three widely-used benchmark datasets have demonstrated the effectiveness and superiority of our proposed method: (1) We achieve competitive counting performance compared to state-of-the-art methods; (2) Compared with baseline, the MAE and MSE are decreased by at least

6.3 %

and

11.3 %

, respectively.

Keywords:

crowd counting; asymmetric structure; bilateral network; density estimation

MSC:

68T10

1. Introduction

Crowd counting, a fundamental task in many public safety monitoring systems, aims to estimate the number of people in a still image. There are many studies have been devoted to solve this problem and they have made some progress [1,2]. Conventional methods solve the crowd counting problem by using hand-crafted features to detect each individual person, predict the count of people via regression or density estimation [3,4,5,6,7]. Due to insufficient semantic representation of hand-crafted features, these methods usually have low performance. Recently, benefiting from the powerful feature representation of CNN, CNN-based methods become dominant in crowd counting. In view of different types of network architectures, CNN-based crowd counting models can be divided into two categories: single column based methods [8,9] and multi-column based methods [10,11].

However, there are still several challenges preventing the computer vision community from designing a model capable of performing precise and robust crowd counting, such as occlusion, complex background, scale variation, non-uniform distribution, perspective distortion, rotation, illumination variation and weather changes. In this paper, we make an effort to address two of these challenges: scale variation and background noise. As illustrated in Figure 1, the scales of person vary as the distance from the camera, the background regions may have similar colors or textures with the people.

The scale variation is the primary problem that should be addressed in the crowd counting models. In the era of deep learning, researchers have made efforts to integrate semantic features of different scales to solve this problem. The main algorithm framework can be divided into three categories: (1) Multi-column. Some works [10,12,13] adopted “multi-column” architecture, where each branch has a different filter kernel size to handle a specific scale; (2) Nonstandard convolution. Some nonstandard convolution operations like dilated convolution [14,15,16] or deformable convolution [17,18], are exploited to model multi-scale information; (3) Feature pyramid networks (FPN). Several works [19,20,21] assumed features from different levels can capture different scale information and utilized FPN to fuse multi-level features. To eliminate noises caused by cluttered backgrounds, semantic segmentation or visual attention operations [22,23,24,25] are two common ways to suppress the response of the background regions. These methods guide the network focus on person instance via a mask image. Moreover, transformers are a new genre of deep neural networks that stack multiple self-attended layers. Unsurprisingly, we see an increasing effort on re-purposing such types of transformer models for crowd counting [26,27,28,29,30,31]. However, all methods mentioned above usually require millions of extra to-be-learned parameters (e.g., learnable multi-column, learnable segmentation/attention branch, multi-head transformer structure), which results in a higher computational burden.

Recently, there have been some works [32,33,34] that attempt to model scale and background problems in a unified framework. The structures of these models are basically the same: a backbone to extract the image feature, and two multi-scale fusion branches to predict a density map and a background mask, respectively. These two branches share the image feature in a dual/symmetric way, that is, they utilize one or multiple common layers to learn scale and background aware features respectively. However, the latest research [35] shows that, compared with deep features, shallow features have a better ability to distinguish background and people. In addition, ref. [36] demonstrates the effectiveness of asymmetric module.

In this paper, we propose a new method for unconstrained image crowd counting, termed Scale and Background aware Asymmetric Bilateral Network (SBAB-Net), which is able to handle scale variation and background noise in a unified framework. Specifically, the proposed SBAB-Net contains three main components—a pre-trained backbone CNN as the feature extractor and two asymmetric branches to generate a density map. These two asymmetric branches have different structures and use features from different semantic layers. One branch is a densely connected stacked dilated convolution (DCSDC) sub-network with different dilation rates, which relies on one deep feature layer and can handle scale variation. The other branch is a densely connected stacked pooling (DCSP) sub-network with various pooling kernels and strides, which relies on shallow features and can fuse features with several receptive fields to reduce the impact of background noise. Compared with the existing semantic segmentation or visual attention based background aware methods, the proposed DCSP is learnable, parameters-free. Two sub-networks are fused by an attention mechanism to generate the final density map. Comprehensive experimental results on three widely-used benchmark datasets demonstrate that the proposed method is effective at the crowd counting task.

We note that Huang et al. [37] also apply the stacked pooling structure for crowd counting. The main differences between our method and [37] are three-fold. First, feature fusion strategies are different. We present a densely connected manner to fuse features with several receptive fields while [37] utilizes channel-wise average operation. Second, inspired by [35], we utilize shallow features to suppress the background noise while [37] use deep features to model scale information. Third, we handle scale and background information simultaneously while [37] only considers the scale problem.

Moreover, it should be noted that although there are several existing methods (e.g., CSRNet [16], DSACA [14], ADNet [15], DADNet [38]) that also utilize a dilation convolution structure to model scale variation, the differences between these existing dilation approaches and the proposed approach are three fold—Firstly, the proposed densely connected stacked dilated convolution (DCSDC) sub-network is a new and effective module to handle scale variation. Compared with the most famous dilation solution, CSRNet [16], our DCSDC fuses features from three different convolution layers with different dilation rates (i.e., 1, 2, 3) while CSRNet only stacks six dilated convolution layers with the same dilation rate (i.e., 2). Although ADNet [15] presents an adaptive dilated network in which the dilation rate can be learned online, only a specific dilation rate is estimated for an input feature map. DSACA [14] uses different dilation rates to capture multi-scale information by fusing features from different layers of the backbone. On the contrary, the proposed DCSDC only utilizes the last feature layer of the backbone, and captures multi-scale information by a densely connected stacked dilated convolution sub-network with different dilation rates. The multi-dilation convolution branch in DADNet [38] is a parallel structure while in our DCSDC it is a serial structure. The disadvantage of the parallel structure is that it neglects the small scale visual context when modeling a large scale visual context. Extensive experiments on three widely-used benchmarks demonstrate that the proposed method is better than these existing dilation approaches. Secondly, the scale aware feature

f_{s}

is refined by the background aware feature

f_{b}

in our framework. Finally, the dual branches in our SBAB-Net are asymmetric while most existing frameworks are symmetric, for example, SFANet [32].

In summary, this paper makes the following contributions.

We propose a novel asymmetric bilateral network to handle scale variation and background noise in a unified framework. The scale-aware feature is captured by a multi-convolutional branch based on one deep feature layer. The background-aware feature is captured by a multi-pooling branch based on one shallow feature layer. These two features are fused in an attention manner. On the contrary, most existing methods organize scale and background branches in a symmetric/dual structure, that is, share the image feature from one or multiple common layers.
A new DCSP sub-network is proposed to capture background-aware feature, which can fuse features with several receptive fields by several pooling layers with different strides to reduce the impact of background noise, and without any extra learnable parameters.
We propose a new DCSDC sub-network to extract multi-scale information based on one deep feature layer, which densely connected several stacked dilated convolution layers with different dilation rates to capture the scale-aware feature. Moreover, the scale-aware feature is refined by the background-aware feature to ensure that the final feature can handle scale and background information simultaneously.
We conducted extensive experiments on three challenging crowd counting datasets, that is, ShanghaiTech [10], UCF-QNRF [39], NWPU-Crowd [40]. Both quantitative and qualitative results demonstrate the effectiveness of the proposed method.

The remainder of this paper is organized as follows. Section 2 reviews the related work of CNN-based crowd counting methods for addressing scale variation and background noise challenges. Section 3 presents the proposed method in detail. Section 4 conducts comparisons on benchmarks and provides the experimental analysis. Finally, we conclude this paper in Section 5.

2. Related Work

The research on crowd counting can be roughly divided into three different dimensions: detection-based, regression-based and density estimation approaches.

Conventional methods [6,41] count the number of people in an image by using the handcraft feature to detect head or body, which performs worse for high crowd images. Thus, the direct regression methods have been proposed, which aims to predict the number of people directly from the image feature. These regression methods can also be divided into traditional and deep learning algorithms according to the utilization of low-level [4,42] or deep [9,43,44] features. However, direct regression methods ignore the spatial context information, which leads to the results not being robust. Recently, deep learning based density map estimation methods become dominant in crowd counting area. Each value in the density map represents the count of the person which corresponding to the original image region. Apparently, the sum of the density map is the number of people appearing in the whole image.

We shortly review some representative density estimation works related to our framework in the following.

2.1. Handle Scale Variation

Early work [10,11,16] for scale-aware feature extraction is via the multi-column or multi-network structure; each column or sub-network handles specific scale information via different convolution types, which is more complicated for optimization and computation wasting. Thus, single-column architecture becomes more popular. There are three common strategies modeling multi-scale in a single-column CNN: (1) Fuse features extracted from different layers [45], in which they assume that a different layer could capture different object scales; (2) Integrate features via pyramid operation, which may be an image pyramid [46] or a feature pyramid [19,20]; and (3) Learn scale from the annotations’ distribution directly [47,48,49]. Different from the above methods, we propose a novel densely connected stacked dilated convolution sub-network with different dilation rates, which can handle multi-scale variation only based on one deep feature layer.

2.2. Handle Background Noise

Attention structure is widely-used to handle background noise. Liu et al. [22] present a detection module to guide the network counting of the crowd region. Gao et al. [50] introduce spatial-wise and channel-wise attention modules to constrain the network’s focus on the head region—a similar idea is reported in [51]. Sindagi et al. [52] propose an inverse attention mechanism which can infuse segmentation information into the counting network efficiently. Shi et al. [53] consider the pixels near to annotations as foreground and design a branch to predict these foreground regions to learn attention factors. Liu et al. [54] propose a recurrent attentive zooming network to detect ambiguous regions recurrently.

2.3. Handle Scale Variation and Background Noise Simultaneously

Recently, some works have attempted to handle scale variation and background noise simultaneously in a unified framework. Zhang et al. [55] propose a cross layer attentional neural field. Guo et al. [38] and Liu et al. [17] use both attention and deformable convolutional structures in a model. All methods with the attention structure mentioned above have a common shortcoming, which is that these attention modules need millions of parameters to be optimized, which makes the model become more complex.

In addition, there are some works that attempt to improve the performance of the counting model by using auxiliary tasks, for example, a classification task is used in [21,56,57], a segmentation task is utilized in [58,59,60], and a depth estimation task is adopted in [59,61]. These methods improve the performance significantly, which indicates that modeling related tasks simultaneously can improve the crowd counting accuracy. Inspired by this, in this paper, we deem the foreground mask prediction as the auxiliary task, and achieve it efficiently by feeding the shallow feature into the proposed novel parameter-free densely connected stacked pooling sub-network.

3. Methodology

In this section, we introduce the proposed SBAB-Net for crowd counting in detail. Our goal is to estimate the number of persons in a crowd image. To this end, following [62], we formulate the crowd counting task as a density map regression and distribution matching problem. In SBAB-Net, the input is the crowd image and the output is the corresponding density map.

3.1. Framework

As shown in Figure 2, our proposed SBAB-Net is composed of three modules, that is, backbone, densely Connected Stacked Dilated Convolution (DCSDC) and densely Connected Stacked Pooling (DCSP) modules. The backbone is a pre-trained CNN to extract both shallow and deep features of the input image. Let

I \in R^{H \times W \times 3}

denote the input crowd image of width W and height H. The extracted shallow and deep features are represented as

S_{I} \in R^{\frac{H}{d_{1}} \times \frac{W}{d_{1}} \times c_{1}}

and

D_{I} \in R^{\frac{H}{d_{2}} \times \frac{W}{d_{2}} \times c_{2}}

, respectively, where

d_{1}

and

d_{2}

are output strides,

c_{1}

and

c_{2}

are channels of the feature maps.

D_{I}

is fed into the DCSDC module to obtain scale aware feature

f_{s}

, and

S_{I}

is fed into the DCSP module to obtain background aware feature

f_{b}

. We ensure the dimensions of the two features

f_{s}

and

f_{b}

to be equal by assigning proper convolutional kernel numbers in the DCSDC module. Then we use a foreground mask regression head (a light weight CNN to predict the probability of pedestrians at each location) to distinguish background and person instance based on feature

f_{b}

. To handle scale and background information simultaneously, we present an element-wise multiple based attention mechanism which refines the feature

f_{s}

based on the feature

f_{b}

, formulation is as,

f = f_{s} \otimes f_{b} .

(1)

Apparently, according to Equation (1), the features of background regions in

f_{s}

will be suppressed. Finally, a density map regression head (a light weight CNN to predict the number of pedestrians at each location) is utilized to predict a person density map based on the refined feature f. Therefore, the sum of the density map is the estimated person numbers of the input image I. The structure of the proposed density map regression head and foreground mask regression head are illustrated in Figure 3; the learnable structures and input-output dimensions of these two sub-networks are the same, while the output semantics are different. In addition, both density map regression head and foreground mask regression head are mapping local features to semantics. Therefore,

f_{s}

is sensitive to person scales and

f_{b}

is sensitive to regions with pedestrians; the refined feature f is sensitive to scales and background regions simultaneously.

It should be noted that we should predict both foreground mask and density map in training, and only the density map needs to be estimated in the test.

Next, we will introduce the DCSDC and DCSP modules in detail.

3.2. Densely Connected Stacked Dilated Convolution (DCSDC)

The DCSDC implements a non-linear mapping function

H (\cdot)

which takes the deep feature

D_{I}

as input, and produces a scale aware feature

f_{s}

, that is,

f_{s} = H (D_{I}) .

(2)

As shown in Figure 2, DCSDC consists of three dilation convolution layers with different dilation rates, and these layers are connected in a densely connected way. It is able to handle scale variation based on one deep feature layer.

Specifically, all of the three convolution layers have

3 \times 3

kernels, the dilation rates are 1, 2, and 3 respectively. Meanwhile, since the feature dimensions of each convolution layer in the DCSDC shall be consistent, the corresponding padding sizes are also set to 1, 2, and 3 respectively. The non-linear functions of three dilation convolution layers are denoted as

h_{1} (\cdot)

,

h_{2} (\cdot)

, and

h_{3} (\cdot)

and the output of the first dilation convolution layer

H^{1}

is formulated as,

H^{1} = [D_{I}, h_{1} (D_{I})],

(3)

where

[\cdot]

represents the feature concatenation operation.

The output of the second dilation convolution layer

H^{2}

is formulated as,

H^{2} = [D_{I}, h_{2} (H^{1}), H^{1}] .

(4)

Consequently, the output of the third dilation convolution layer, that is,

f_{s}

, is formulated as,

f_{s} = [D_{I}, h_{3} (H^{2}), H^{1}, H^{2}] .

(5)

3.3. Densely Connected Stacked Pooling (DCSP)

The DCSP implements a non-linear mapping function

G (\cdot)

which takes the deep feature

S_{I}

as input, and produces a background aware feature

f_{b}

, that is,

f_{b} = G (S_{I}) .

(6)

As shown in Figure 2, DCSP consists of three max pooling layers with different pooling kernels, and these layers are connected in a densely connected way. It is able to fuse features with several receptive fields to reduce the impact of background noise via a learnable parameter-free manner.

Specifically, the first two pooling kernels in DCSP are

2 \times 2

max pooling with stride 2, and the third pooling kernel is a

3 \times 3

max pooling with stride 1. The non-linear functions of three pooling layers are denoted as

g_{1} (\cdot)

,

g_{2} (\cdot)

, and

g_{3} (\cdot)

, the output of the first pooling layer

G^{1}

is formulated as,

G^{1} = g_{1} (S_{I}) .

(7)

The output of the second pooling layer

G^{2}

is formulated as,

G^{2} = [g_{2} (G^{1}), G^{1}] .

(8)

Consequently, the output of the third pooling layer, that is,

f_{b}

, is formulated as:

f_{b} = [g_{3} (G^{2}), G^{1}, G^{2}] .

(9)

3.4. Loss Function

In this section, we first introduce the ground-truth generation strategy, then detail the loss function.

To obtain the ground-truth density map

D_{i}^{G T}

(here i represents the i-th image), we follow the method of generating density maps in [62], which does not need any Gaussian to preprocess ground truth annotations. The generation of density map

D_{i}^{G T}

can be formulated as,

D_{i}^{G T} (x) = δ (x),

(10)

where

δ (x)

means the annotated person numbers in the position x.

Based on ground-truth density map

D_{i}^{G T}

, we compute ground-truth foreground mask

F_{i}^{G T}

as follows:

F_{i}^{G T} (x) = \{\begin{matrix} 0 & δ (x) = 0 . \\ 1 & δ (x) > 0 . \end{matrix}

(11)

Equation (11) ensures that position x with person instance will be deemed foreground.

To increase the generalization performance of the proposed model, inspired by [62], counting loss, optimal transport loss and total variation loss are utilized to measure the difference between ground-truth density map

D_{i}^{G T}

and predict density map

D_{i}^{P}

.

Counting loss (

L_{c}

) is defined as follows:

L_{c} (D_{i}^{P}, D_{i}^{G T}) = | ∥ D_{i}^{P} ∥_{1} - | ∥ D_{i}^{G T} ∥_{1} | |,

(12)

where

{∥ * ∥}_{1}

denotes the

L_{1}

norm of a vector.

Optimal transport loss (

L_{o t}

) is defined as,

L_{o t} (D_{i}^{P}, D_{i}^{G T}) = 〈 α^{*}, \frac{D_{i}^{G T}}{∥ D_{i}^{G T} ∥_{1}} 〉 + 〈 β^{*}, \frac{D_{i}^{P}}{∥ D_{i}^{P} ∥_{1}} 〉,

(13)

where

α^{*}

and

β^{*}

are the solutions of Monge-Kantarovich Optimal Transport formulation [63].

Total variation loss (

L_{t v}

) is defined as,

L_{t v} (D_{i}^{P}, D_{i}^{G T}) = \frac{1}{2} ∥ D_{i}^{G T} ∥_{1} ∥ \frac{D_{i}^{G T}}{∥ D_{i}^{G T} ∥_{1}} - \frac{D_{i}^{P}}{∥ D_{i}^{P} ∥_{1}} ∥_{1} .

(14)

To measure the difference between the ground-truth foreground mask

F_{i}^{G T}

and predicted foreground mask

F_{i}^{P}

, BCE segmentation loss (

L_{b}

) is defined as,

L_{b} (F_{i}^{P}, F_{i}^{G T}) = - F_{i}^{G T} log F_{i}^{P} + (F_{i}^{G T} - 1) log (1 - F_{i}^{P}) .

(15)

The final loss is the weighted summation of the four loss terms mentioned above,

L o s s = L_{c} + λ_{1} L_{o t} + λ_{2} L_{t v} + λ_{3} L_{b},

(16)

where the hyper-parameters

λ_{1}

,

λ_{2}

and

λ_{3}

control the contributions of different components.

4. Experiments

In this section, we first introduce the experimental setting and implementation details, and then analyze in detail the effectiveness of each component in our framework through a set of ablation experiments on three public benchmarks. After that, we evaluate our method by comparing to the state-of-the-art methods.

4.1. Experimental Setting

4.1.1. Datasets

The experiments are conducted and evaluated on three common crowd counting benchmarks, including ShanghaiTech [10], UCF-QNRF [39] and NWPU-Crowd [40]; some samples are shown in Figure 4. Each person in a crowd image on these datasets is annotated with one point close to the center of the head.

The ShanghaiTech dataset consists of 1198 annotated crowd images and 330,165 annotated people. It is divided into two independent parts: Part-A and Part-B. Part-A contains 482 images which are collected from the Internet, and it is split into train and test subsets consisting of 300 and 182 images, respectively. Part-B contains 716 images which are collected from the busy streets, and it is split into training and test subsets consisting of 400 and 316 images, respectively.

UCF-QNRF is composed of 1535 images (1201 for training and 334 for testing) with more than 1.25 million label instances.

The NWPU-Crowd dataset is a large-scale congested dataset recently introduced in [40], which consists of 5109 images, with a total of 2,133,375 annotated heads with various illumination scenes and density ranges. The dataset is randomly split into training, validation and test subsets, which respectively contain 3109, 500 and 1500 images.

4.1.2. Evaluation Metrics

Follow the existing works [64,65,66,67], mean absolute error (MAE) and mean squared error (MSE) are used to evaluate the performance of different crowd counting methods.

The MAE is calculated by,

M A E = \frac{1}{N} \sum_{i = 1}^{N} ∥ C_{i} - C_{i}^{G T} ∥ .

(17)

MSE is formulated as,

M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {∥ C_{i} - C_{i}^{G T} ∥}^{2}},

(18)

where N is the number of images in the test set,

C_{i}

and

C_{i}^{G T}

are the estimated and ground-truth individuals of i-th image, respectively.

4.2. Implementation Details

We set hyper-parameters following [62]. All empirical studies are conducted on an Nvidia GTX 3070 GPU in PyTorch. During training, the input size is

256 \times 256

on ShanghaiTech Part-A,

384 \times 384

on NWPU,

512 \times 512

on ShanghaiTech Part-B and UCF-QNRF; all of them are randomly cropped from the original image. During testing, the input image of the network maintains the original resolution. The AdamW [68] optimizer is used to train all models, and the learning rate and weight decay are set to 0.00001 and 0.0001 respectively. The output strides

d_{1}

and

d_{2}

are set to 4 and 8, respectively. The hyper-parameters

λ_{1}

,

λ_{2}

and

λ_{3}

in loss function are set to 0.1, 0.01 and 1, respectively. There is no augmentation during training and inference.

4.3. Ablation Studies

The purpose of this section is to demonstrate the effectiveness of the proposed solution rather than achieving a state-of-the-art performance in a certain dataset by all manner of means. Therefore, we utilize the light network MobileNetV2 [69] as our backbone.

4.3.1. Convergency

Figure 5 shows the value of the loss function of our method versus the different number of training epochs on the ShanghaiTech Part-A dataset. We can observe that the loss value decreases almost monotonously and converges smoothly during the entire training procedure, which indicates that the proposed framework can be efficiently trained by using the stochastic gradient descent optimization algorithm, AdamW. The model has reached convergence if the value of loss does not decrease in 20 consecutive epochs.

4.3.2. Impact of Different Components

Compared with the baseline [62], the proposed SBAB-Net has two novel terms, that is, DCSDC and DCSP. To investigate the impact of these terms on the performance of the proposed method, we developed and evaluated two variations of SBAB-Net: the one is SBAB-Net w/o DCSDC, that is, only backbone and DCSP are used to predict the density map directly. The other is SBAB-Net w/o DCSP, that is, it does not model the background region explicitly. The optimization procedure of these two cases is similar to the proposed SBAB-Net.

Table 1 shows the performance comparisons of SBAB-Net and its two variations on three benchmarks. From the results, we can see that the full SBAB-Net performs best on all datasets, which indicates that all of the two novel terms (DCSDC and DCSP) contribute to the final counting accuracy. We can also see that SBAB-Net w/o DCSP outperforms SBAB-Net w/o DCSDC by a large margin, which indicates that the importance of the DCSDC for the model to learn the scale aware representation.

We qualitatively evaluate the effectiveness of the proposed SBAB-Net by visualizing the density maps, as shown in Figure 6. We can see that there are obvious differences between density maps of the baseline and SBAB-Net on the following two aspects: (1) Robustness to noises, as shown in the second and fourth rows; (2) Robustness to scale variations, as shown in the first and third rows.

Both quantitative and qualitative experimental results (Table 1 and Figure 6) demonstrate that utilizing

f_{b}

refines

f_{s}

by element wise multiple operation can improve the counting performance (“SBAB-Net w/o DCSP” and “SBAB-Net” comparisons on Table 1) and suppress background regions (visualization results in Figure 6).

It should be noted that the major difference between our DCSP and the SP presented in [37] is that features from several different pooling kernels are fused by the densely connected method in our framework. Figure 7 shows the performance comparison of the proposed DCSP and SP in terms of MAE and MSE on the ShanghaiTech Part-A dataset. From this, we can see that DCSP performs better than SP. We think the reason is that channel-wise average operation in original SP loses many details.

4.4. Comparison with State-of-the-Art Methods

To verify the effectiveness of our SBAB-Net, we compare the proposed method with sixteen state-of-the-art methods in the experiments, namely, CSRNet [16], SFCN [70], CAN [71], Bayesian+ [65], S-DCNet [72], SANet + SPANet [73], SDANet [35], ADNet [15], ADSCNet [15], ASNet [74], AMRNet [75], AMSNet [76], DM-Count [62], P2PNet [64], DADNet [38] and SFANet [32]. As with the structure in DM-Count [62], the backbone of the proposed SBAB-Net is the pre-trained VGG-19 [77].

Table 2, Table 3 and Table 4 report the MAE and MSE of the proposed SBAB-Net and the compared methods (performance numbers directly cited from the original papers) of ShanghaiTech, UCF-QNRF, and NWPU datasets, from which we have the following observations:

Although the structures of DCSDC and DCSP are simple, the proposed SBAB-Net is able to achieve a competitive counting performance compared to state-of-the-art methods on both small and large datasets. SBAB-Net, especially, significantly outperforms baseline (DM-Count [62]) while using the same backbone as the feature extractor.
The proposed asymmetric bilateral network performs better than the commonly used dual structure (e.g., SFANet [32]). It is a further proof of the observation in [35], that is, the features of different layers in CNN have different abilities to model background noise.
Although the same three dilation rates are utilized in the proposed framework and CSRNet [16], we achieve a much better performance, which indicates the advantage of our DCSDC structure.
Our SBAB-Net is not the best approach. A better multi-task learning framework (e.g., P2PNet [64]), a better CNN structure (e.g., AMSNet [76]) or a better training strategy (e.g., ADSCNet [15]) can obtain a higher crowd counting performance.

Table 2. Performance comparison of the proposed in terms of MAE and MSE on ShanghaiTech Part-A dataset.

Methods	ShanghaiTech Part-A
Methods	MAE	MSE
CSRNet [16]	68.2	115.0
SFCN [70]	64.8	107.5
CAN [71]	62.3	100.0
Bayesian+ [65]	62.8	101.8
S-DCNet [72]	58.3	95.0
SANet + SPANet [73]	59.4	92.5
SDANet [35]	63.6	101.8
ADSCNet [15]	55.4	97.7
ADNet [15]	61.3	103.9
ASNet [74]	57.78	90.13
AMRNet [75]	61.59	98.36
AMSNet [76]	56.7	93.4
DM-Count [62]	59.7	95.7
DADNet [38]	64.2	99.9
SFANet [32]	63.8	105.2
Ours	57.69	93.77

Table 3. Performance comparison of the proposed in terms of MAE and MSE on the UCF-QNRF dataset.

Methods	UCF-QNRF
Methods	MAE	MSE
CSRNet [16]	120.3	208.5
SFCN [70]	102	171
CAN [71]	107	183
Bayesian+ [65]	88.7	154.8
S-DCNet [72]	104.4	176.1
ASNet [74]	91.59	159.71
AMRNet [75]	86.6	152.2
AMSNet [76]	101.8	163.2
DM-Count [62]	85.6	148.3
SFANet [32]	100.8	174.5
P2PNet [64]	85.32	154.5
ADNet [15]	90.1	147.1
DADNet [38]	113.2	189.4
Ours	84.81	152.17

Table 4. Performance comparison of the proposed in terms of MAE and MSE on the NWPU validation set.

Methods	NWPU Validation Set
Methods	MAE	MSE
CSRNet [16]	104.8	433.4
SFCN [70]	95.4	608.3
CAN [71]	93.5	489.9
Bayesian+ [65]	93.6	470.3
DM-Count [62]	70.5	357.6
Ours	63.26	211.75

5. Conclusions

In this paper, we presented a new framework (SBAB-Net) for unconstrained image crowd counting, which can handle scale variation and background noise in a unified framework. SBAB-Net achieved this goal with an asymmetric bilateral network, which consists of two novel branches. One branch is a densely connected stacked dilated convolution (DCSDC) sub-network with different dilation rates, which relies on one deep feature layer and can handle scale variation. The other branch is a parameter-free densely connected stacked pooling (DCSP) sub-network with various pooling kernels and strides, which relies on shallow features and can fuse features with several receptive fields to reduce the impact of background noise. Two sub-networks are fused by an attention mechanism to generate the final density map. Extensive experimental results on three widely-used benchmark datasets have demonstrated the effectiveness and superiority of our proposed method: (1) We achieve competitive counting performance compared to state-of-the-art methods; (2) Compared with baseline, the MAE and MSE are decreased by at least

6.3 %

and

11.3 %

, respectively.

Author Contributions

Conception, F.N.; Formal analysis, G.L.; Methodology, G.L.; Project administration, Y.S.; Validation, Y.X.; Writing—original draft, Y.X.; Writing—review & editing, Z.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the University Synergy Innovation Program of Anhui Province (No. GXXT-2019-048), the National Natural Science Foundation (NSF) of China (No. 61902104), the National Key R&D Program (No. 2020YFC2005603), the Anhui Provincial Natural Science Foundation (No. 2008085QF295), and the University Natural Sciences Research Project of Anhui Province (No. KJ2020A0651).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository. The data presented in this study are openly available at https://github.com/gjy3035/Awesome-Crowd-Counting, accessed on 12 February 2022.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, T.; Chang, H.; Wang, M.; Ni, B.; Hong, R.; Yan, S. Crowded scene analysis: A survey. IEEE Trans. Circuits Syst. Video Technol. 2014, 25, 367–386. [Google Scholar] [CrossRef] [Green Version]
Gao, G.; Gao, J.; Liu, Q.; Wang, Q.; Wang, Y. Cnn-based density estimation and crowd counting: A survey. arXiv 2020, arXiv:2003.12783. [Google Scholar]
Ge, W.; Collins, R.T. Marked point processes for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 2913–2920. [Google Scholar]
Idrees, H.; Saleemi, I.; Seibert, C.; Shah, M. Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 2547–2554. [Google Scholar]
Idrees, H.; Soomro, K.; Shah, M. Detecting humans in dense crowds using locally-consistent scale prior and global occlusion reasoning. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1986–1998. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Zhang, Z.; Huang, K.; Tan, T. Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection. In Proceedings of the International Conference on Pattern Recognition, Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar]
Pham, V.Q.; Kozakaya, T.; Yamaguchi, O.; Okada, R. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3253–3261. [Google Scholar]
Fu, M.; Xu, P.; Li, X.; Liu, Q.; Ye, M.; Zhu, C. Fast crowd density estimation with convolutional neural networks. Eng. Appl. Artif. Intell. 2015, 43, 81–88. [Google Scholar] [CrossRef]
Hu, Y.; Chang, H.; Nian, F.; Wang, Y.; Li, T. Dense crowd counting from still images with convolutional neural networks. J. Vis. Commun. Image Represent. 2016, 38, 530–539. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 589–597. [Google Scholar]
Babu Sam, D.; Surya, S.; Venkatesh Babu, R. Switching convolutional neural network for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 5744–5752. [Google Scholar]
Yang, Y.; Li, G.; Du, D.; Huang, Q.; Sebe, N. Embedding Perspective Analysis Into Multi-Column Convolutional Neural Network for Crowd Counting. IEEE Trans. Image Process. 2020, 30, 1395–1407. [Google Scholar] [CrossRef]
Cheng, Z.Q.; Li, J.X.; Dai, Q.; Wu, X.; He, J.Y.; Hauptmann, A.G. Improving the learning of multi-column convolutional neural network for crowd counting. In Proceedings of the ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1897–1906. [Google Scholar]
Xu, W.; Liang, D.; Zheng, Y.; Xie, J.; Ma, Z. Dilated-Scale-Aware Category-Attention ConvNet for Multi-Class Object Counting. IEEE Signal Process. Lett. 2021, 28, 1570–1574. [Google Scholar] [CrossRef]
Bai, S.; He, Z.; Qiao, Y.; Hu, H.; Wu, W.; Yan, J. Adaptive dilated network with self-correction supervision for counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4594–4603. [Google Scholar]
Li, Y.; Zhang, X.; Chen, D. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1091–1100. [Google Scholar]
Liu, N.; Long, Y.; Zou, C.; Niu, Q.; Pan, L.; Wu, H. Adcrowdnet: An attention-injective deformable convolutional network for crowd understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; pp. 3225–3234. [Google Scholar]
Zou, Z.; Su, X.; Qu, X.; Zhou, P. Da-net: Learning the fine-grained density distribution with deformation aggregation network. IEEE Access 2018, 6, 60745–60756. [Google Scholar] [CrossRef]
Amirgholipour, S.; He, X.; Jia, W.; Wang, D.; Liu, L. Pdanet: Pyramid density-aware attention net for accurate crowd counting. arXiv 2020, arXiv:2001.05643. [Google Scholar]
Chen, X.; Bin, Y.; Sang, N.; Gao, C. Scale pyramid network for crowd counting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1941–1950. [Google Scholar]
Sindagi, V.A.; Patel, V.M. Generating high-quality crowd density maps using contextual pyramid cnns. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1861–1870. [Google Scholar]
Liu, J.; Gao, C.; Meng, D.; Hauptmann, A.G. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5197–5206. [Google Scholar]
Hossain, M.; Hosseinzadeh, M.; Chanda, O.; Wang, Y. Crowd counting using scale-aware attention networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1280–1288. [Google Scholar]
Liu, L.; Wang, H.; Li, G.; Ouyang, W.; Lin, L. Crowd counting using deep recurrent spatial-aware network. arXiv 2018, arXiv:1807.00601. [Google Scholar]
Chen, J.; Su, W.; Wang, Z. Crowd counting with crowd attention convolutional neural network. Neurocomputing 2020, 382, 210–220. [Google Scholar] [CrossRef]
Sun, G.; Liu, Y.; Probst, T.; Paudel, D.P.; Popovic, N.; Van Gool, L. Boosting crowd counting with transformers. arXiv 2021, arXiv:2105.10926. [Google Scholar]
Gao, J.; Gong, M.; Li, X. Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer. arXiv 2021, arXiv:2108.00584. [Google Scholar]
Sajid, U.; Chen, X.; Sajid, H.; Kim, T.; Wang, G. Audio-visual transformer based crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 2249–2259. [Google Scholar]
Tian, Y.; Chu, X.; Wang, H. Cctrans: Simplifying and improving crowd counting with transformer. arXiv 2021, arXiv:2109.14483. [Google Scholar]
Liang, D.; Xu, W.; Bai, X. An End-to-End Transformer Model for Crowd Localization. arXiv 2022, arXiv:2202.13065. [Google Scholar]
Wang, F.; Liu, K.; Long, F.; Sang, N.; Xia, X.; Sang, J. Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting. arXiv 2022, arXiv:2203.06388. [Google Scholar]
Zhu, L.; Zhao, Z.; Lu, C.; Lin, Y.; Peng, Y.; Yao, T. Dual path multi-scale fusion networks with attention for crowd counting. arXiv 2019, arXiv:1902.01115. [Google Scholar]
Rong, L.; Li, C. A Strong Baseline for Crowd Counting and Unsupervised People Localization. arXiv 2020, arXiv:2011.03725. [Google Scholar]
Modolo, D.; Shuai, B.; Varior, R.R.; Tighe, J. Understanding the impact of mistakes on background regions in crowd counting. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Breckenridge, CO, USA, 5–7 January 2021; pp. 1650–1659. [Google Scholar]
Miao, Y.; Lin, Z.; Ding, G.; Han, J. Shallow feature based dense attention network for crowd counting. In Proceedings of the AAAI Conference on Artificial Intelligence, Hilton New York Midtown, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11765–11772. [Google Scholar]
Yi, Q.; Liu, Y.; Jiang, A.; Li, J.; Mei, K.; Wang, M. Scale-aware network with regional and semantic attentions for crowd counting under cluttered background. arXiv 2021, arXiv:2101.01479. [Google Scholar]
Huang, S.; Li, X.; Cheng, Z.Q.; Zhang, Z.; Hauptmann, A. Stacked Pooling for Boosting Scale Invariance of Crowd Counting. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 2578–2582. [Google Scholar]
Guo, D.; Li, K.; Zha, Z.J.; Wang, M. Dadnet: Dilated-attention-deformable convnet for crowd counting. In Proceedings of the ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1823–1832. [Google Scholar]
Idrees, H.; Tayyab, M.; Athrey, K.; Zhang, D.; Al-Maadeed, S.; Rajpoot, N.; Shah, M. Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 532–546. [Google Scholar]
Wang, Q.; Gao, J.; Lin, W.; Li, X. NWPU-crowd: A large-scale benchmark for crowd counting and localization. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2141–2149. [Google Scholar] [CrossRef]
Leibe, B.; Seemann, E.; Schiele, B. Pedestrian detection in crowded scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 878–885. [Google Scholar]
Chan, A.B.; Liang, Z.S.J.; Vasconcelos, N. Privacy preserving crowd monitoring: Counting people without people models or tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–7. [Google Scholar]
Zhang, C.; Li, H.; Wang, X.; Yang, X. Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 833–841. [Google Scholar]
Iqbal, M.; Rehman, M.A.; Iqbal, N.; Iqbal, Z. Effect of Laplacian Smoothing Stochastic Gradient Descent with Angular Margin Softmax Loss on Face Recognition. In International Conference on Intelligent Technologies and Applications; Springer: Berlin/Heidelberg, Germany, 2019; pp. 549–561. [Google Scholar]
Zeng, L.; Xu, X.; Cai, B.; Qiu, S.; Zhang, T. Multi-scale convolutional neural networks for crowd counting. In Proceedings of the IEEE International Conference on Image Processing, Beijing, China, 17–20 September 2017; pp. 465–469. [Google Scholar]
Kang, D.; Chan, A. Crowd counting by adaptively fusing predictions from an image pyramid. arXiv 2018, arXiv:1805.06115. [Google Scholar]
Xu, C.; Liang, D.; Xu, Y.; Bai, S.; Zhan, W.; Bai, X.; Tomizuka, M. AutoScale: Learning to Scale for Crowd Counting and Localization. arXiv 2019, arXiv:1912.09632. [Google Scholar]
Song, Q.; Wang, C.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Wu, J.; Ma, J. To Choose or to Fuse? Scale Selection for Crowd Counting. In Proceedings of the AAAI Conference on Artificial Intelligence, Arlington, VA, USA, 4–6 November 2021; Volume 35, pp. 2576–2583. [Google Scholar]
Ma, Z.; Wei, X.; Hong, X.; Gong, Y. Learning scales from points: A scale-aware probabilistic model for crowd counting. In Proceedings of the ACM International Conference on Multimedia, Singapore, 16–18 December 2020; pp. 220–228. [Google Scholar]
Gao, J.; Wang, Q.; Yuan, Y. SCAR: Spatial-/channel-wise attention regression networks for crowd counting. Neurocomputing 2019, 363, 1–8. [Google Scholar] [CrossRef] [Green Version]
Sindagi, V.A.; Patel, V.M. Ha-ccn: Hierarchical attention-based crowd counting network. IEEE Trans. Image Process. 2019, 29, 323–335. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sindagi, V.A.; Patel, V.M. Inverse attention guided deep crowd counting network. In Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance, Taipei, Taiwan, 18–21 September 2019; pp. 1–8. [Google Scholar]
Shi, Z.; Mettes, P.; Snoek, C.G. Counting with focus for free. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 4200–4209. [Google Scholar]
Liu, C.; Weng, X.; Mu, Y. Recurrent attentive zooming for joint crowd counting and precise localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; pp. 1217–1226. [Google Scholar]
Zhang, A.; Yue, L.; Shen, J.; Zhu, F.; Zhen, X.; Cao, X.; Shao, L. Attentional neural fields for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 5714–5723. [Google Scholar]
Gao, J.; Wang, Q.; Li, X. Pcc net: Perspective crowd counting via spatial convolutional network. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3486–3498. [Google Scholar] [CrossRef] [Green Version]
Sindagi, V.A.; Patel, V.M. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance, Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar]
Jiang, S.; Lu, X.; Lei, Y.; Liu, L. Mask-aware networks for crowd counting. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3119–3129. [Google Scholar] [CrossRef] [Green Version]
Zhao, M.; Zhang, J.; Zhang, C.; Zhang, W. Leveraging heterogeneous auxiliary tasks to assist crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; pp. 12736–12745. [Google Scholar]
Huang, S.; Li, X.; Zhang, Z.; Wu, F.; Gao, S.; Ji, R.; Han, J. Body structure aware deep crowd counting. IEEE Trans. Image Process. 2017, 27, 1049–1059. [Google Scholar] [CrossRef]
Lian, D.; Li, J.; Zheng, J.; Luo, W.; Gao, S. Density map regression guided detection network for rgb-d crowd counting and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; pp. 1821–1830. [Google Scholar]
Wang, B.; Liu, H.; Samaras, D.; Nguyen, M.H. Distribution Matching for Crowd Counting. arXiv 2020, arXiv:2009.13077. [Google Scholar]
Villani, C. Optimal transport, old and new. Notes for the 2005 Saint-Flour summer school. In Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences]; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
Song, Q.; Wang, C.; Jiang, Z.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Wu, Y. Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3365–3374. [Google Scholar]
Ma, Z.; Wei, X.; Hong, X.; Gong, Y. Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6142–6151. [Google Scholar]
Ranjan, V.; Le, H.; Hoai, M. Iterative crowd counting. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 270–285. [Google Scholar]
Ranjan, V.; Sharma, U.; Nguyen, T.; Hoai, M. Learning To Count Everything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3394–3403. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Wang, Q.; Gao, J.; Lin, W.; Yuan, Y. Learning from synthetic data for crowd counting in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; pp. 8198–8207. [Google Scholar]
Liu, W.; Salzmann, M.; Fua, P. Context-aware crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; pp. 5099–5108. [Google Scholar]
Xiong, H.; Lu, H.; Liu, C.; Liu, L.; Cao, Z.; Shen, C. From open set to closed set: Counting objects by spatial divide-and-conquer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 8362–8371. [Google Scholar]
Cheng, Z.Q.; Li, J.X.; Dai, Q.; Wu, X.; Hauptmann, A.G. Learning spatial awareness to improve crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6152–6161. [Google Scholar]
Jiang, X.; Zhang, L.; Xu, M.; Zhang, T.; Lv, P.; Zhou, B.; Yang, X.; Pang, Y. Attention scaling for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4706–4715. [Google Scholar]
Liu, X.; Yang, J.; Ding, W.; Wang, T.; Wang, Z.; Xiong, J. Adaptive mixture regression network with local counting map for crowd counting. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 241–257. [Google Scholar]
Hu, Y.; Jiang, X.; Liu, X.; Zhang, B.; Han, J.; Cao, X.; Doermann, D. Nas-count: Counting-by-density with neural architecture search. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 747–766. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]

Figure 1. Illustrations of the challenges of varying scales of people and cluttered backgrounds.

Figure 2. Overview of the proposed SBAB-Net. Backbone: a pre-trained CNN to extract both shallow and deep features of the input image. Densely Connected Stacked Dilated Convolution (DCSDC): a densely connected sub-network with different dilation rates, which relies on one deep feature layer and can get scale aware feature. Densely Connected Stacked Pooling (DCSP): a densely connected sub-network with different pooling kernels, which relies on one shallow feature layer and can get background aware feature. Finally, scale and background features are fused by an attention mechanism to generate density map.

Figure 3. The structure of the proposed density map regression head and foreground mask regression head.

Figure 4. Some samples from representing crowd counting datasets.

Figure 5. The value of the loss function of SBAB-Net versus the different number of training epochs on the ShanghaiTech Part-A dataset. (The backbone is MobileNetV2.)

Figure 6. Density maps of NWPU dataset generated by baseline and our SBAB-Net. (The backbone is MobileNetV2).

Figure 7. Ablation study. Performance comparison of the proposed DCSP and SP [37] in terms of MAE and MSE on ShanghaiTech Part-A dataset. (The backbone is MobileNetV2.)

Table 1. Ablation studies. Performance comparison of the proposed SBAB-Net and its three variations in terms of MAE and MSE on benchmarks. The highest score is shown in boldface. (The backbone is MobileNetV2).

Components	ShanghaiTech Part-A		ShanghaiTech Part-B		UCF-QNRF		NWPU
Components	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE
Baseline	68.92	108.08	8.57	15.38	210.56	393.69	85.29	401.35
SBAB-Net w/o DCSDC	65.06	105.29	8.34	14.88	172.11	255.84	-	-
SBAB-Net w/o DCSP	62.50	100.31	8.17	13.72	-	-	82.87	389.29
SBAB-Net	60.57	95.89	8.03	12.24	108.58	185.38	69.91	237.14

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lv, G.; Xu, Y.; Ma, Z.; Sun, Y.; Nian, F. Scale and Background Aware Asymmetric Bilateral Network for Unconstrained Image Crowd Counting. Mathematics 2022, 10, 1053. https://doi.org/10.3390/math10071053

AMA Style

Lv G, Xu Y, Ma Z, Sun Y, Nian F. Scale and Background Aware Asymmetric Bilateral Network for Unconstrained Image Crowd Counting. Mathematics. 2022; 10(7):1053. https://doi.org/10.3390/math10071053

Chicago/Turabian Style

Lv, Gang, Yushan Xu, Zuchang Ma, Yining Sun, and Fudong Nian. 2022. "Scale and Background Aware Asymmetric Bilateral Network for Unconstrained Image Crowd Counting" Mathematics 10, no. 7: 1053. https://doi.org/10.3390/math10071053

APA Style

Lv, G., Xu, Y., Ma, Z., Sun, Y., & Nian, F. (2022). Scale and Background Aware Asymmetric Bilateral Network for Unconstrained Image Crowd Counting. Mathematics, 10(7), 1053. https://doi.org/10.3390/math10071053

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scale and Background Aware Asymmetric Bilateral Network for Unconstrained Image Crowd Counting

Abstract

1. Introduction

2. Related Work

2.1. Handle Scale Variation

2.2. Handle Background Noise

2.3. Handle Scale Variation and Background Noise Simultaneously

3. Methodology

3.1. Framework

3.2. Densely Connected Stacked Dilated Convolution (DCSDC)

3.3. Densely Connected Stacked Pooling (DCSP)

3.4. Loss Function

4. Experiments

4.1. Experimental Setting

4.1.1. Datasets

4.1.2. Evaluation Metrics

4.2. Implementation Details

4.3. Ablation Studies

4.3.1. Convergency

4.3.2. Impact of Different Components

4.4. Comparison with State-of-the-Art Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI