A Pedestrian Detection Network Based on an Attention Mechanism and Pose Information

Jiang, Zhaoyin; Huang, Shucheng; Li, Mingxing

doi:10.3390/app14188214

Open AccessArticle

A Pedestrian Detection Network Based on an Attention Mechanism and Pose Information

by

Zhaoyin Jiang

^1,*,

Shucheng Huang

² and

Mingxing Li

^3,*

¹

School of Information Engineering, Yangzhou Polytechnic College, Yangzhou 225009, China

²

School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212003, China

³

Jingjiang College, Jiangsu University, Zhenjiang 212013, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8214; https://doi.org/10.3390/app14188214

Submission received: 17 June 2024 / Revised: 27 August 2024 / Accepted: 28 August 2024 / Published: 12 September 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

Pedestrian detection has recently attracted widespread attention as a challenging problem in computer vision. The accuracy of pedestrian detection is affected by differences in gestures, background clutter, local occlusion, differences in scales, pixel blur, and other factors occurring in real scenes. These problems lead to false and missed detections. In view of these visual description deficiencies, we leveraged pedestrian pose information as a supplementary resource to address the occlusion challenges that arise in pedestrian detection. An attention mechanism was integrated into the visual information as a supplement to the pose information, because the acquisition of pose information was limited by the pose estimation algorithm. We developed a pedestrian detection method that integrated an attention mechanism with visual and pose information, including pedestrian region generation and pedestrian recognition networks, effectively addressing occlusion and false detection issues. The pedestrian region proposal network was used to generate a series of candidate regions with possible pedestrian targets from the original image. Then, the pedestrian recognition network was used to judge whether each candidate region contained pedestrian targets. The pedestrian recognition network was composed of four parts: visual features, pedestrian poses, pedestrian attention, and classification modules. The visual feature module was responsible for extracting the visual feature descriptions of candidate regions. The pedestrian pose module was used to extract pose feature descriptions. The pedestrian attention module was used to extract attention information, and the classification module was responsible for fusing visual features and pedestrian pose descriptions with the attention mechanism. The experimental results on the Caltech and CityPersons datasets demonstrated that the proposed method could substantially more accurately identify pedestrians than current state-of-the-art methods.

Keywords:

pedestrian detection; attention mechanism; pose information; pedestrian recognition network

1. Introduction

Pedestrian detection has been a focus of academia and industry as a special branch of target detection. Considerable progress has been achieved in pedestrian detection through the successful application of deep convolutional neural networks [1,2,3,4,5,6,7,8,9]. Anchor-based detection and anchor-free pedestrian detection are two major CNN-based pedestrian detection methods. Typical anchor-based methods include faster R-CNN [10] and its derivative [11,12]. These methods generate some candidate proposals in advance and then employ a classifier to judge whether pedestrians exist in each candidate proposal. Anchor-based models are time-consuming, and the majority of candidate proposals are uninformative. Researchers developed an anchor-free detector that can directly predict pedestrians in images to solve these problems [13]. With this detector, redundant steps such as the definition of anchor points and the feature extraction of candidate regions are skipped to predict the pedestrian target directly from the original image. Although existing research has made some progress in pedestrian detection, the developed methods typically rely solely on visual features for localizing human targets. The visual cues of the targets needing to be detected can easily be confused with the background in scenarios with severe occlusion and cluttered backgrounds due to the lack of robust visual features for low-quality targets, resulting in a substantial drop in detection accuracy. This is the most noteworthy issue in pedestrian detection.

Occlusion is a common and core challenge in various computer vision tasks such as pedestrian and face detection. In occluded scenes, the disruption of a pedestrian’s body structure information causes missed detections. Moreover, occluded pedestrian features often exhibit diverse characteristics due to differences in the occlusion degree or pattern. Researchers have developed several methods to address the occlusion issue in pedestrian detection by enhancing pedestrian visual descriptions or employing a series of partial detectors. Inspired by partial detectors, Tian et al. [14] proposed DeepParts, which is based on R-CNN and consists of a set of deep partial detectors. Zhou et al. [15] used a multilabel learning approach with partial detectors to exploit the correlations among them and improve detection performance while reducing the computation costs of the partial detectors. Both of these methods use simple visual description models to detect pedestrians. However, visual descriptions cannot be used alone to distinguish pedestrians from the background in situations where visual clues are indistinguishable, for instance, when the visual features of pedestrians resemble the background and are obscured by other objects. Therefore, generating robust descriptions to filter occluded pedestrians becomes the central problem in occluded pedestrian detection.

As such, we introduced pedestrian pose information and an attention mechanism to enhance the robustness of pedestrian detection to address the above problems. Pose information is composed of several key points on the human skeleton and their connections, which help to represent a pedestrian’s structural information. With partial occlusion or background clutter, detection models struggle to capture the whole torso of a pedestrian target, so the occluded candidate area is determined as a negative sample based on visual characteristics only. However, the model can take the key points on the visible pedestrian body area and other clues as the basis for judging whether the area is a pedestrian body target according to the attitude information. This overcomes the problems caused when a pedestrian target is partially occluded. Thus, we developed a pedestrian detection framework based on the integration of visual and pose information with an attention mechanism. As illustrated in Figure 1, the model first uses the proposed pedestrian region to generate a series of candidate regions with possible pedestrian targets from the original image, and each region has a corresponding confidence score. Second, the pedestrian recognition network is used to judge whether each candidate region contains pedestrian targets. The pedestrian recognition network contains visual features, pedestrian poses, pedestrian attention, and classification modules. The visual feature module is responsible for extracting the visual feature descriptions of the candidate regions. The pedestrian pose module is used to extract the pose feature descriptions based on an improved OpenPose [16] framework. The pedestrian attention module is used to extract attention information, and the classification module is responsible for fusing the visual features and pedestrian pose descriptions with the attention mechanism.

The main contributions of this study can be summarized as follows:

(1): We developed a pedestrian detection method that integrates an attention mechanism with visual and pose information and includes pedestrian region proposal and pedestrian recognition networks, effectively addressing occlusion and false detection issues.
(2): We found that gesture information and an attention mechanism, as supplements to visual description, are key in pedestrian detection.
(3): In this study, experiments on the public datasets Caltech [17] and CityPersons [12] were conducted to validate the effectiveness of the proposed network, which fuses an attention mechanism with visual and pose information.

2. The Related Literature

In this section, we introduce some algorithms used for object and pedestrian detection, and then discuss the issues of occlusion and false positives in pedestrian detection.

A.: Object Detection

Object detection methods can be classified into two types. The first type includes single-stage algorithms, such as the YOLO series [18,19,20,21] and SSD [22], which eliminate the generation stage of candidate proposals and directly predict the boundary frame and category probability of the target in a whole image, thus increasing detection speed while retaining accuracy. The second type includes two-stage algorithms, such as faster R-CNN [10], R-FCN [23], and PV-RCNN [24]. These methods generate a large number of target proposal regions in advance and then optimize them through classification and regression networks. However, two-stage algorithms require the repeated extraction of convolutional features, resulting in a slower model execution, which fails to meet the needs of applications with real-time requirements. Anchor-free detectors, such as CornerNet [25] and CenterNet [26], have also been constructed. Song et al. developed a pedestrian detection network based on vertical lines that uses the vertical characteristics of upright pedestrians [13]. With these approaches, additional prior box parameters do not need to be set, asfeature maps can be directly used for classification prediction and location regression.

B.: Pedestrian Detection

Currently, computer vision based on deep learning technology is rapidly advancing; many variants of faster R-CNN [10], such as SA-Fast RCNN [27] and MS-CNN [28], have been developed to improve detection performance by directly solving the problem of target scale. Although two-step detectors are widely used, they mostly use visual information only to locate pedestrian objects in images. Cascade R-CNN [8] is a multistep detection model. Cascade R-CNN continuously increases its detection accuracy by gradually increasing the IoU threshold; however, only the visual information of the target is used. Single-step detectors are similar in this regard. For example, ALFNet [29] adopts a progressive positioning fitting strategy to continuously optimize the default anchor frame. As an anchorless detector, CSP [30] locates pedestrian targets by directly predicting the width and height of the center point and boundary frame. Despite these achievements, some noteworthy challenges are encountered by existing detection methods in the field of pedestrian detection, primarily stemming from the impacts of occlusion and false detections. Reference [31] proposed a progressive training process that improves a model’s generalization ability across different datasets. For example, by training on the WiderPedestrian + CrowdHuman dataset and testing on the CityPersons dataset, a reasonable score of 12.8 was achieved on the CityPersons dataset. F2DNet [32] eliminates the redundancy of current two-stage detectors by replacing the region proposal network with a focal detection network and the bounding box head with a fast suppression head. F2DNet demonstrated good performance when trained using an Nvidia RTXA6000 GPU cluster (NVIDIA, Santa Clara, CA, USA). Reference [33] used synthetic data generation methods, including scene generation, 3D object usage, and sensor simulation, to improve pedestrian detection accuracy. Reference [34] primarily addressed the occlusion problem by using mix-up augmentation to enhance the performance of their proposed architecture. Reference [35] proposed a novel NMS loss function that allows for the NMS process to be trained end-to-end without adding any extra network parameters. Specifically, Reference [35] introduced a pull loss to bring predictions for the same object closer together and a push loss to push predictions for different objects further apart. Reference [36] proposed a novel method for context-aware pedestrian detection through visual–language semantic self-supervision (VLPD), which explicitly models semantic context without requiring any additional annotations. The paper introduced a self-supervised visual–language semantic (VLS) segmentation method, which learns fully supervised pedestrian detection and context segmentation using explicitly generated semantic class labels from a visual–language model.

Compared to the above methods, this study does not use any data augmentation techniques or additional semantic information labels. Instead, it only utilizes an Nvidia 4090 GPU and innovatively introduces pose information to describe partially occluded human targets. Additionally, an attention mechanism is introduced to extract visual attention information, achieving a good pedestrian detection performance on the Caltech and CityPersons datasets.

C.: Occlusion and False Positive Error

Researchers have recently focused on overcoming the challenges in pedestrian target detection. Xie et al. proposed MGAN [37] to address the occlusion problem. Enzweiler et al. [38] used strength, depth, and motion characteristics to construct a component-based classification model. Noh et al. [39] proposed an occlusion-processing method that combines single-stage detection models to reduce false positive errors. An occlusion-sensitive detection score is obtained by updating the output tensor to include the component confidence score. Zhang et al. proposed an occlusion-aware R-CNN framework [40] by introducing aggregation loss to address the occlusion problem. Wang et al. proposed repulsion loss [41], which includes three parts: the loss value between the predicted box and the target ground-truth box; the loss value between the predicted box and the adjacent target ground-truth box; and the loss between the predicted and adjacent predicted boxes that are not predicting the same real target.

We introduce additional attitude information to describe a partially occluded pedestrian object, which produces the following benefits:

(1): Pedestrian pose information has a strong robustness to occlusion and the changes in appearance caused by variations in clothing and lighting. Additionally, pose estimation algorithms [42,43,44,45,46,47,48] have considerably progressed in recent years.
(2): The attention mechanism enables the model to focus on the most crucial regions in an image to improve the representation of target pedestrian features and its detection performance in occlusion scenarios. The attention mechanism can also reduce irrelevant information interference and increase the robustness and detection accuracy of pedestrian targets.

3. Methodology

The two main challenges encountered in pedestrian detection are as follows: first, the many instances of local occlusion in images lead to missed detections; second, images may contain numerous indistinguishable negative samples, such as similar vertical structures, resulting in false detections. We developed a new pedestrian detection method that integrates an attention mechanism with visual and pose information.

As shown in Figure 2, the pedestrian region proposal network was designed based on existing detectors (such as ALFNet [29] and FCOS [49]), with the aim of generating a large number of candidate bounding boxes. The pedestrian recognition network is used to refine the confidence scores of proposal regions and remove challenging negative samples. For a given proposal region, the visual feature and pedestrian pose modules extract feature descriptions and pose feature descriptions, respectively. The pedestrian attention module extracts attention information as a complement to pose descriptions. The classification module is responsible for fusing these three feature descriptions and generating an embedded representation that can distinguish a pedestrian from the background.

The pedestrian region proposal network is optimized using a multitask loss function:

L_{r p n} = \sum_{i} L_{c l s} (p_{i}, p_{i}^{*}) + μ \sum_{i} {p_{i}^{*} L}_{r e g} (t_{i}, t_{i}^{*}),

(1)

where

L_{c l s}

is the binary cross-entropy loss used to judge whether the candidate proposal region contains a pedestrian object and

L_{r e g}

is the regression loss for pedestrian object localization. The coefficient

μ

is used to balance the classification loss

L_{c l s}

and the regression loss

L_{r e g}

. We define

p_{i}

as the predicted probability and

p_{i}^{*}

as the ground-truth label.

p_{i}^{*}

is set to 1 only when the IoU between candidate box i and any ground-truth box is larger than 0.5; otherwise,

p_{i}^{*}

is set to 0 as the background. As such, the classification loss function

L_{c l s}

can be defined as:

L_{c l a s s} (p_{i}, p_{i}^{*}) = - (1 - p_{i}^{*}) l o g {(1 - p}_{i}) - p_{i}^{*} \log (p_{i}),

(2)

In this study, we adopt the same smooth L1 loss proposed in fast R-CNN [22] as the regression loss function to learn the transformation from each candidate box to its nearest ground-truth box:

L_{r e g} (t_{i}, t_{i}^{*}) = {s m o o t h}_{L 1} (t_{i} - t_{i}^{*}),

(3)

{s m o o t h}_{L 1} (x) = \{\begin{matrix} 0.5 x^{2}, |x| < 1 \\ |x| - 0.5, otherwise \end{matrix},

(4)

where

t_{i} = [t_{x}, t_{y}, t_{w}, t_{h}]

represent the four parameter coordinates of the prediction box and

t_{i}^{*} = [t_{x}^{*}, t_{y}^{*}, t_{w}^{*}, t_{h}^{*}]

represent the parameter coordinates corresponding to the ground-truth box. The relationship to the anchor frame is expressed as follows:

t_{x} = \frac{(x_{p} - x_{a})}{w_{a}}, t_{y} = \frac{(y_{p} - y_{a})}{h_{a}}, t_{w} = \log \frac{w_{p}}{w_{a}}, = \log \frac{h_{p}}{h_{a}},

t_{x}^{*} = \frac{(x^{*} - x_{a})}{w_{a}}, t_{y}^{*} = \frac{(y^{*} - y_{a})}{h_{a}}, t_{w}^{*} = \log \frac{w^{*}}{w_{a}}, t_{h}^{*} = \log \frac{h^{*}}{h_{a}},

(5)

where x and y represent the center point coordinates of the target frame; w and h represent the target frame’s width and height, respectively;

x_{p}

represents the predicted box;

x_{a}

represents the anchor box; and x* represents the ground-truth box.

If all target candidate regions are obtained, duplication and redundancy are eliminated using the non-maximum suppression (NMS) algorithm. Existing detectors (such as ALFNet [29], FCOS [49] (ResNet50), and FCOS [49] (VoVNet39)) were selected as region-generating networks to roughly extract the candidate regions of pedestrian targets. As these detectors generate many false detection and low-confidence areas, the pedestrian recognition network developed to solve this problem is described in detail below.

A.: Pedestrian Recognition Network

After generating candidate regions that may contain pedestrian objects, the proposed pedestrian recognition network optimizes the confidence scores of these candidate regions and reduces the number of false detections. We added an attention mechanism to the visual information to reduce the influence of local occlusion and background, because most false detections are caused by local occlusion or background clutter and visual descriptions are not robust to low-quality images affected by these factors. The convolutional block attention module (CBAM) [50] is an attention mechanism used in convolutional neural networks (CNNs), designed to enhance a model’s ability to focus on specific regions of input data. The CBAM improves feature representation by applying attention mechanisms along both the channel and spatial dimensions, thereby enhancing the model performance. Compared with the CBAM, the original data generated by the visual feature module are input into the proposed pedestrian attention module. In the CBAM, channel attention can be applied before spatial attention (CBAM) or after spatial attention (reverse CBAM (RCBAM)); the weights in the latter are generated from the feature maps of the former. In this process, the preceding attention decorates the original input feature map, meaning that the attention mechanism following the decoration learns from the altered feature map. This affects the features learned by the subsequent attention module to a certain degree. Through experiments, we validated that this interference caused by “sequential connection” in pedestrian detection classification tasks can lead to instability in the effectiveness of attention modules. Therefore, we introduced an input of the original data generated by the visual feature module before the spatial attention mechanism.

In addition to visual features, pose information provides crucial clues for judging pedestrian targets. Unlike visual descriptions, pedestrian pose information is composed of several key points on the human skeleton and their connections, which help to represent a pedestrian’s structural information. The detection model uses the key points on the head and shoulders as the basis for judging whether the area is the targeted pedestrian body to overcome the problems that arise when local areas are blocked.

Figure 3 shows that the visual feature module comprises the first twenty-two layers of ResNet50 [51] and some convolutional blocks. For each target candidate region, the size of the image is first converted to 256 × 256, and then the image is input into the visual feature module to obtain a 128-dimensional visual feature description

f_{v}

. For each target candidate proposal converted to 256 × 256, the pedestrian pose module is used to obtain the corresponding pedestrian pose description. The pedestrian pose module is constructed by using an improved version of the pose estimation model OpenPose [16]. First, the first twenty-two layers of ResNet50 [51] are used to extract a visual feature map F. Second, two subnetworks are employed to predict a confidence map M and affinity field A, separately. The confidence graph represents the key points on the human skeleton, and the affinity field represents the connections between the key points. Each branch adopts an iterative prediction structure, optimizing the prediction results in successive phases. This iterative approach allows for the continuous refinement and improvement of the final pose estimation outcomes. The pose module generates a group of a confidence graph

M_{1} = ρ_{1} (F)

and an affinity field

A_{1} = φ_{1} (F)

.

ρ_{1}

and

φ_{1}

are composed of three 3 × 3 and two 1 × 1 convolution layers, respectively. In each subsequent stage, the current prediction is generated by combining the prediction from the two branches in the previous stage with the original image feature F:

M_{t} = ρ_{t} (F, M_{t - 1} {, A}_{t - 1}), \forall t \geq 2, A_{t} = φ_{t} (F, M_{t - 1} {, A}_{t - 1}), \forall t \geq 2,

(6)

where

ρ_{t}

and

φ_{t}

are composed of two 7 × 7, two 5 × 5, and two 1 × 1 convolution layers, respectively. In the last stage, by connecting the confidence graph

M_{6}

with the affinity field

A_{6}

, the model obtains pedestrian pose information.

The pedestrian pose module is initialized using the improved OpenPose [16] model, which was pretrained on a pose dataset. The parameters of the pedestrian pose module remain unchanged during the whole training process as the pedestrian pose module extracts useful pose information. This network maps the pedestrian pose information into a 128-dimension pose feature description

f_{p}

with a fully connected layer, because pedestrian pose information can be used to judge whether a candidate region contains a pedestrian target or background. Finally, we use a fully connected classifier and cross-entropy loss

L_{p}

to optimize the feature description

f_{p}

. The obtained visual feature descriptions

f_{v}

and pose feature descriptions

f_{p}

are first spliced to generate 256-dimension feature descriptions. Second, the fully connected layer is used to reduce the dimensions of the fused features. Finally, the joint loss function

L_{r p n}

is minimized to optimize the pedestrian recognition network:

L_{r p n} = L_{c} + λ L_{v} + μ L_{p}

(7)

where

L_{c}, L_{p}

, and

L_{p}

represent the binary cross-entropy losses of the classification, pedestrian pose, and visual feature modules, respectively.

λ

and

μ

are two weight parameters, referring to previous findings [52], we choose

λ

= 0.5 and

μ

= 0.5.

During the model testing, for any target candidate region, we perform weighted fusion of the confidence scores

{s c o r e}_{r}

, which is the output of the region proposal network, and the confidence score

{s c o r e}_{p}

, which is the output of the pedestrian recognition network. The combined score is taken as the final confidence score of the candidate region:

s c o r e = α \cdot {s c o r e}_{r} + β \cdot {s c o r e}_{p}

(8)

where

α

and

β

are two weight parameters. The final prediction is more accurate with the use of fused scores. Based on the results of the parameter analysis in our experiments, we set α to 0.2 and β to 0.8. When both networks have low predictive scores, the candidate region is considered to be background.

To more clearly describe our method, the algorithm is described in Algorithm 1.

Algorithm of the proposed pedestrian recognition method.

Algorithm 1 Pedestrian Detection Based on Pose Information and Attention Mechanism

Input: Image F, first threshold parameter

λ

, second threshold parameter

μ

Output: Pedestrian detection results {Pi (i = 0,1,2…)}
1. Initialize pedestrian region proposal network (RPN) using existing detectors (e.g., ALFNet, FCOS)
2. Generate candidate bounding boxes using RPN
B_rpn = RPN(F)
3. Initialize pedestrian recognition network
4. For each candidate bounding box b in B_rpn,
4.1. Extract visual feature descriptions

f_{v}

4.1.1. Resize candidate region to 256 × 256
4.1.2. Use the first 22 layers of ResNet50 and convolutional blocks to obtain

f_{v}

4.2. Extract pose feature descriptions

f_{p}

4.2.1. Resize candidate region to 256 × 256
4.2.2. Use improved OpenPose with first 22 layers of ResNet50 to extract feature map F
4.2.3. Iteratively predict confidence map M and affinity field A to obtain pose information
4.2.4. Map pose information to 128-dimensional feature description

f_{p}

using a fully connected layer
4.3. Extract attention information

f_{a}

using pedestrian attention module
4.4. Fuse visual, pose, and attention feature descriptions
f = concatenate (

f_{v}

,

f_{p}

,

f_{a}

)
f = reduce dimension(f)
4.5. Refine confidence score and remove challenging negative samples using classification module
score_p = classifier(f)
5. Optimize pedestrian region proposal network using multitask loss:

L_{r p n} = L_{c} + λ L_{v} + μ L_{p}

6. Calculate final confidence score for each candidate region, set

α = 0.2

and

β

= 0.8:
score =

α

∗ score_r +

β

∗ score_p
7. Perform non-maximum suppression (NMS) to eliminate duplicate and redundant detections
8. Return final pedestrian detection results {Pi (i = 0,1,2…)}

4. Experiments

A.: Datasets

We conducted experiments on two widely used public datasets: Caltech [17] and CityPersons [12]. We performed ablation studies and compared the results of our method with those of related methods.

Caltech [17] is a dataset widely used for pedestrian detection. Caltech includes 137 segments of approximately one-minute-long videos, totaling 250,000 frames of annotated images with a resolution of 640 × 480. For each image containing a pedestrian, the dataset provides a compact bounding box describing the entire pedestrian. An additional bounding box is marked to outline the visible area of the occluded pedestrian. We only used data from the “person” category for the model training and testing. The entire dataset was divided into 11 subsets, with the first 6 subsets (set00–set05) containing 42,782 images used as the training set and the last 5 subsets (set06–set10) containing 4024 images used as the test set.

CityPersons [12] is a diversified pedestrian detection dataset that evolved from the Cityscapes dataset. CityPersons contains a total of 5000 images (2048 × 1024 pixels), including 2975 images in the training set, 500 images in the validation set, and 1525 images in the test set. In this study, only the data from the “pedestrian” category, which represents walking, running, or standing human targets, were used for the model training and testing. Additionally, as illustrated in Figure 4, the dataset was further divided based on different levels of occlusion: reasonably, barely, partially, and heavily occluded.

B.: Evaluation Metrics

The miss rate (MR) is a metric used to evaluate the results of human body detection, where lower values indicate a more accurate performance. The accuracy of human body detection is primarily reflected in two aspects: detecting as many human targets as possible and minimizing false positives. Therefore, the MR and false positives per image (FPPI) are usually considered together to evaluate the pedestrian detection performance; often, the decision threshold (or confidence score) in the detection algorithm must be adjusted to achieve a balance between the two. If a lower decision threshold is set, the MR decreases and the FPPI increases; conversely, if a higher decision threshold is set, the MR increases and the FPPI decreases. Hence, to more accurately represent the performance of human body detection under different decision threshold conditions, Dollar [17] proposed calculating the logarithmic average of the MR within a certain range of FPPI as a quantitative metric, referred to as the log-average miss rate (LAMR). The LAMR is calculated as follows:

L A M R = e x p (\frac{\sum_{i = 1}^{N} (M R ({F P P I}_{i})}{N})

(9)

where FPPIi represents the FPPI value corresponding to the selected sampling point i and N denotes the number of sampling points. The FPPI was sampled in 100.25 intervals within the range of [10⁻², 100] to reflect the miss rate of the detector under low-false-positive conditions and ensure fair comparisons with existing methods. The logarithmic average miss rate (LAMR) in this state is referred to as the miss rate and denoted as MR-2 in this paper.

The average precision (AP) represents the average accuracy of all image detections belonging to a certain class. The calculation formula is as follows:

A P = \frac{\sum_{n} P r e c i s i o n}{n}

(10)

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

where n represents the total number of images belonging to a certain class.

C.: Implementation Details

Training Data

The model proposed in this section comprised two key parts: a region generation network and a pedestrian recognition network. All image training areas were used to first generate the network. For each target candidate box generated by the region generation network, an IoU with any ground-truth box larger than 0.5 was considered to be a positive sample. A maximum IoU of between 0.1 and 0.3 with all real boxes was considered to be a negative sample. We randomly selected a subset from the negative samples to ensure a positive-to-negative sample ratio of 1:3. Each target candidate area was scaled, and the surrounding area was filled with black pixels before being input into the pedestrian recognition network. Scaling was performed to adjust the region to a size of 256 × 256 while preserving the original aspect ratio.

Initialization and Setting

The approach presented in this section was implemented in the PyTorch framework and optimized using Adam [53]. The regional generation network was trained in strict accordance with ALF and FCOS. The OpenPose [16] model was pretrained on the pose dataset. The parameters of the pedestrian pose module remained unchanged during the whole training process. For the Caltech dataset [17], only several fully connected layers of the classification module were updated with the parameters of the first three epochs of the fixed models during training. The training batch size was set to 128 and the learning rate was 0.01. The visual feature module also participated in the update from epochs 4 to 20; the training batch size was adjusted to 64 and the learning rate was adjusted to 0.0001. In the CityPersons [12] dataset, only a few fully connected layers of the classification module were updated with the parameters of the first five epochs of the fixed models during training. The training batch size was 100 and the learning rate was 0.01. The visual feature module also participated in the update from epochs 6 to 30; the training batch size was adjusted to 50 and the learning rate was adjusted to 0.0001.

D.: Baseline

In this study, ALFNet [29] and FCOS [49] were selected as the pedestrian region proposal networks.

ALFNet [29] is a single-step detection model with a simple but effective structure. We extended the ALFNet model by incorporating additional modules, with the variations denoted as follows:

(1): ALFNet + visual: In this experiment, we added a visual feature module to the ALFNet model.
(2): ALFNet + visual + pose: This experiment involved adding both a visual module and pedestrian pose modules to the ALFNet model.
(3): ALFNet + visual + pose + attention: For this experiment, we incorporated the visual feature module, pedestrian pose modules, and attention mechanism into the ALFNet model.

E.: Ablation study

A series of ablation experiments were performed on the Caltech [17] dataset to verify the validity of each module in the model.

Table 1 summarizes the detection loss rates of several models on the Caltech [17] dataset. The visual feature module reduced the loss rate of the detection model, such as the loss rates of ALFNet, FCOS (ResNet50), and FCOS (VoVNet39) from 5.13%, 4.68%, and 4.57% to 4.71%, 4.26%, and 4.13%, respectively. The highest accuracy was achieved for all models by further incorporating the pedestrian pose module and attention mechanism.

Table 2 presents the statistics on the number of proposals, number of true positives, and number of false positives for all models. The original detection models generated a large number of candidate targets, i.e., 9459 for ALFNet and 17,823 for FCOS. The number of false positives was reduced, while the number of true positives was maintained by using an independent visual feature module. Taking FCOS as an example, the number of false positives decreased from 13,969 to 9080, whereas the number of correct targets only decreased by 155. This indicates that visual descriptions can increase detection accuracy. Furthermore, the number of proposals was substantially reduced by integrating the pedestrian pose information and attention mechanisms.

We conducted extensive comparisons of our method with several methods on the CityPersons [12] dataset. Table 3 shows that the fusion of a visual attention mechanism and pose information led to improvements in detection performance across all evaluation subsets compared with that of the original model. Furthermore, Table 4 shows that the proposed approach effectively reduced the number of candidate targets. Table 3 indicates that the proposed method reduced the miss rate, regardless of the degree of occlusion. Moreover, the more severe the occlusion, the greater the performance improvement. For instance, for FCOS and ALFNet, an improvement of almost 3% was achieved.

F.: Comparison with state-of-the-art methods

Table 5 compares the performance of the proposed model with that of related methods on the Caltech dataset [17]. The proposed method achieved a loss rate of 4.13%, outperforming AR-Ped [54], which had a loss rate of 4.36%.

Table 6 shows the performance of the proposed model compared with that of related methods on the CityPersons dataset [12]. The proposed model achieved the following performances: 10.11%, 6.56%, 9.23%, and 45.40% for reasonably, barely, partially, and heavily occluded, respectively.

G.: Parameter analysis

We studied the detection results using a series of different weight parameters. Evaluation experiments were conducted on the CityPersons dataset [12], and the results are shown in Table 7. Table 7 shows that excessively large weight parameters α and β led to a worse detection performance, because using a single network (region proposal network or pedestrian recognition network) alone was insufficient for generating reliable prediction results. The best detection performance was achieved when α = 0.8 and β = 0.2.

H.: Visualization

This section provides some visual results, demonstrating the efficacy of integrating visual attention mechanisms with pose information. Figure 5 shows that using human pose information and attention mechanisms can considerably reduce false positives and partially address occlusion issues.

5. Conclusions

In this paper, we proposed a pedestrian detection method that integrates an attention mechanism with visual and pose information. First, the model uses a pedestrian region proposal network to roughly extract the candidate pedestrian targets in the image. Second, the visual and pose feature descriptions corresponding to the candidate proposals are extracted through the visual feature and pedestrian pose modules, respectively. Third, the classification module fuses these three feature descriptions and generates an embedded representation that can distinguish a pedestrian from the background. The experimental results showed that the fused features, which combine the advantages of the visual attention features and pedestrian pose descriptions, can effectively overcome occlusion problems and reduce false positive errors, thus increasing the robustness of pedestrian detection.

Author Contributions

S.H. conceptualized the study. M.L. acquired the data. Z.J. and M.L. analyzed and interpreted the data and wrote the manuscript. This article contains original data. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the corresponding author on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, S.; Benenson, R.; Omran, M.; Hosang, J.; Schiele, B. Towards reaching human performance in pedestrian detection. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 973–986. [Google Scholar] [CrossRef] [PubMed]
Hosang, J.; Omran, M.; Benenson, R.; Schiele, B. Taking a deeper look at pedestrians. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4073–4082. [Google Scholar]
Tian, Y.; Luo, P.; Wang, X.; Tang, X. Pedestrian detection aided by deep learning semantic tasks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 5079–5087. [Google Scholar]
Zhang, S.; Benenson, R.; Omran, M.; Hosang, J.; Schiele, B. How far are we from solving pedestrian detection? In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1259–1267. [Google Scholar]
Zhou, C.; Yuan, J. Multi-label learning of part detectors for heavily occluded pedestrian detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3506–3515. [Google Scholar]
Ouyang, W.; Zeng, X.; Wang, X. Partial occlusion handling in pedestrian detection with a deep model. IEEE Trans. Circuits Syst. Video Technol. 2016, 26, 2123–2137. [Google Scholar] [CrossRef]
Hu, Q.; Wang, P.; Shen, C.; van den Hengel, A.; Porikli, F. Pushing the limits of deep CNNs for pedestrian detection. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 1358–1368. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Tang, P.; Wang, X.; Bai, S.; Shen, W.; Bai, X.; Liu, W.; Yuille, A. PCL: Proposal cluster learning for weakly supervised object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 176–191. [Google Scholar] [CrossRef] [PubMed]
Ren, S.; He KGirshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Lin, L.; Liang, X.; He, K. Is faster R-CNN doing well for pedestrian detection? In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 443–457. [Google Scholar]
Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4457–4465. [Google Scholar]
Song, T.; Sun, L.; Xie, D.; Sun, H.; Pu, S. Small-scale pedestrian detection based on topological line localization and temporal feature aggregation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 536–551. [Google Scholar]
Tian, Y.; Luo, P.; Wang, X.; Tang, X. Deep learning strong parts for pedestrian detection. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1904–1912. [Google Scholar]
Zhou, C.; Yuan, J. Bi-box regression for pedestrian detection and occlusion estimation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 135–151. [Google Scholar]
Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1302–1310. [Google Scholar]
Dollar, P.; Wojek, C.; Schiele, B.; Perona, P. Pedestrian detection: An evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 743–761. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June—1 July 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhad, A.I. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.; Liao, H.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
Li, Z.; Chen, Y.; Gang, Y. R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Shi, S.; Guo, C.; Li, J. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 27 October–2 November 2020; pp. 10526–10535. [Google Scholar]
Law, H.; Deng, J. CornerNet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6568–6577. [Google Scholar]
Li, J.; Liang, X.; Shen, S.; Xu, T.; Feng, J.; Yan, S. Scale-aware fast R-CNN for pedestrian detection. IEEE Trans. Multimed. 2018, 20, 985–996. [Google Scholar] [CrossRef]
Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 354–370. [Google Scholar]
Liu, W.; Liao, S.; Hu, W.; Liang, X.; Chen, X. Learning efficient single-stage pedestrian detectors by symptotic localization fitting. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 618–634. [Google Scholar]
Liu, W.; Liao, S.; Ren, W.; Hu, W.; Yu, Y. High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5187–5196. [Google Scholar]
Hasan, I.; Liao, S.; Li, J.; Akram, S.U.; Shao, L. Generalizable pedestrian detection: The elephant in the room. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Khan, A.H.; Munir, M.; van Elst, L.; Dengel, A. F2dnet: Fast focal detection network for pedestrian detection. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022. [Google Scholar]
Hagn, K.; Grau, O. Increasing pedestrian detection performance through weighting of detection impairing factors. In Proceedings of the 6th ACM Computer Science in Cars Symposium, Ingolstadt, Germany, 8 December 2022. [Google Scholar]
Khan, A.H.; Nawaz, M.S.; Dengel, A. Localized semantic feature mixers for efficient pedestrian detection in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Luo, Z.; Fang, Z.; Zheng, S.; Wang, Y.; Fu, Y. NMS-loss: Learning with non-maximum suppression for crowded pedestrian detection. In Proceedings of the 2021 International Conference on Multimedia Retrieval, Taipei, Taiwan, 16–19 November 2021. [Google Scholar]
Liu, M.; Jiang, J.; Zhu, C.; Yin, X.C. Vlpd: Context-aware pedestrian detection via vision-language semantic self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Lu, R.; Ma, H.; Wang, Y. Semantic head enhanced pedestrian detection in a crowd. Neurocomputing 2020, 400, 343–351. [Google Scholar] [CrossRef]
Enzweiler, M.; Eigenstetter, A.; Schiele, B.; Gavrila, D.M. Multi-cue Pedestrian Classification with Partial Occlusion Handling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 990–997. [Google Scholar]
Noh, J.; Lee, S.; Kim, B.; Kim, G. Improving occlusion and hard negative handling for single-stage pedestrian detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 966–974. [Google Scholar]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Occlusion-aware R-CNN: Detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 637–653. [Google Scholar]
Wang, X.; Xiao, T.; Jiang, Y.; Shao, S.; Sun, J.; Shen, C. Repulsion loss: Detecting pedestrians in crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7774–7783. [Google Scholar]
Larsson, V.; Kukelova, Z.; Zheng, Y. Camera pose estimation with unknown principal point. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2984–2992. [Google Scholar]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7103–7112. [Google Scholar]
Guler, R.A.; Neverova, N.; Kokkinos, I. DensePose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7297–7306. [Google Scholar]
Dong, L.; Chen, X.; Wang, R.; Zhang, Q.; Izquierdo, E. ADORE: An adaptive holons representation framework for human pose estimation. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 2803–2813. [Google Scholar] [CrossRef]
Yang, B.; Ma, A.J.; Yuen, P.C. Body parts synthesis for cross-quality pose estimation. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 461–474. [Google Scholar] [CrossRef]
Liu, S.; Li, Y.; Hua, G. Human pose estimation in video via structured space learning and halfway temporal evaluation. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 2029–2038. [Google Scholar] [CrossRef]
Liu, Z.; Feng, R.; Chen, H.; Wu, S.; Gao, Y.; Gao, Y.; Wang, X. Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10996–11006. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9627–9636. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), 8–14 September 2018; pp. 3–19. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7–12 June 2015; pp. 770–778. [Google Scholar]
Jiao, Y.; Yao, H.; Xu, C. PEN: Pose-Embedding Network for Pedestrian Detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 1150–1162. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Brazil, G.; Liu, X. Pedestrian detection with autoregressive network phases. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 7231–7240. [Google Scholar]
Du, X.; El-Khamy, M.; Lee, J.; Davis, L. Fused DNN: A deep neural network fusion approach to fast and robust pedestrian detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Santa Rosa, CA, USA, 24–31 March 2017; pp. 953–961. [Google Scholar]
Brazil, G.; Yin, X.; Liu, X. Illuminating pedestrians via simultaneous detection and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4950–4959. [Google Scholar]
Fu, C.-Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. DSSD: Deconvolutional single shot detector [EB/OL]. arXiv 2017, arXiv:1701.06659. [Google Scholar]
Zhou, P.; Zhou, C.; Peng, P.; Du, J.; Sun, X.; Guo, X.; Huang, F. NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination. In Proceedings of the ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1967–1975. [Google Scholar]
Xu, Z.; Li, B.; Yuan, Y.; Dang, A. Beta R-CNN: Looking into Pedestrian Detection from Another Perspective. arXiv 2022, arXiv:2210.12758. [Google Scholar]

Figure 1. Pedestrian detection framework that integrates an attention mechanism with visual and pose information.

Figure 2. Detailed framework diagram of pedestrian detection based on pose information and an attention mechanism.

Figure 3. The detailed structure of the proposed pedestrian recognition network.

Figure 4. Samples of the four occlusion levels.

Figure 5. Illustration of the results obtained with different methods.

Table 1. The miss rate on the Caltech dataset.

Model	MR⁻² (%)
ALFNet	5.13
ALFNet + visual	4.99
ALFNet + visual + pose	4.79
ALFNet + visual + pose + attention	4.71
FCOS(ResNet50)	4.68
FCOS(ResNet50) +visual	4.53
FCOS(ResNet50) + visual + pose	4.39
FCOS(ResNet50) + visual + pose + attention	4.26
FCOS(VoVNet39)	4.57
FCOS(VoVNet39) + visual	4.41
FCOS(VoVNet39) + visual + pose	4.27
FCOS(VoVNet39) + visual + pose + attention	4.13

Table 2. The number of proposals on Caltech dataset.

Model	Proposals	TP	FP	AP
ALFNet	9459	2158	7301	0.23
ALFNet + visual	8141	2113	6028	0.26
ALFNet + visual + pose	6233	2035	4198	0.32
ALFNet + visual + pose + attention	6019	2027	3992	0.34
FCOS(ResNet50)	17,823	3858	13,969	0.22
FCOS(ResNet50) + visual	12,838	3703	9080	0.29
FCOS(ResNet50) + visual + pose	8965	3526	5239	0.39
FCOS(ResNet50) + visual + pose + attention	7721	3329	4392	0.43

Table 3. The miss rate on CityPersons dataset.

Model	Reasonable	Bare	Partial	Heavy
ALFNet	16.01	9.96	13.85	52.47
ALFNet + visual	15.68	9.53	13.25	52.40
ALFNet + visual + pose	15.29	9.25	12.97	50.70
ALFNet + visual + pose + attention	14.93	9.02	12.76	49.62
FCOS(ResNet50)	11.97	7.87	11.52	49.17
FCOS(ResNet50) + visual	11.69	7.53	11.25	48.53
FCOS(ResNet50) + visual + pose	11.32	7.25	10.71	47.70
FCOS(ResNet50) + visual + pose + attention	10.73	7.02	10.36	47.12
FCOS(VoVNet39)	11.09	7.16	10.51	48.77
FCOS(VoVNet39) + visual	10.78	7.03	10.12	47.39
FCOS(VoVNet39) + visual + pose	10.52	6.95	9.57	46.28
FCOS(VoVNet39) + visual + pose + attention	10.11	6.56	9.23	45.40

Table 4. The number of proposals on CityPersons dataset.

Model	Proposals	TP	FP	AP
ALFNet	17,401	2943	14,458	0.17
ALFNet + visual	7377	2587	4790	0.35
ALFNet + visual + pose	7208	2642	4566	0.37
ALFNet + visual + pose + attention	6502	2637	3865	0.41
FCOS(ResNet50)	23,029	5309	17,720	0.23
FCOS(ResNet50) + visual	18,013	5292	12,721	0.29
FCOS(ResNet50) + visual + pose	13,396	5011	8185	0.37
FCOS(ResNet50) + visual + pose + attention	8677	4783	3894	0.55

Table 5. Comparison of miss rates of proposed and existing methods on Caltech dataset.

Method	MR⁻²(%)
MS-CNN [28]	8.08
RPN + BF [11]	7.28
F-DNN [55]	6.89
SDS-RCNN [56]	6.44
ALFNet [29]	6.10
RepLoss [41]	5.00
CSP [30]	4.59
AR-Ped [54]	4.36
Ours (PEAN)	4.13

Table 6. Comparison of miss rates between proposed and existing methods on CityPersons dataset for different occlusion levels.

Method	Reasonable	Bare	Partial	Heavy
YOLOv2 [19]	23.36	14.23	22.65	52.50
SSD [22]	22.54	16.91	21.95	50.66
DSSD [57]	19.70	15.75	18.90	51.88
TLL [13]	15.50	10.00	17.20	53.60
Faster R-CNN [10]	15.40	9.30	18.90	55.00
RepLoss [41]	13.20	7.60	16.80	56.90
OR-CNN [40]	12.80	6.70	15.30	55.70
ALFNet [29]	12.00	8.40	11.40	51.90
CSP [30]	11.00	7.30	10.40	49.30
NOH-NMS [58]	10.80	6.60	11.20	53.00
Beta R-CNN [59]	10.60	6.40	10.30	47.10
Ours (PEAN)	10.11	6.56	9.23	45.40

Table 7. Analysis of parameters

α

and

β

.

Table 7. Analysis of parameters

α

and

β

.

Method	α and β	MR⁻² (%)
ALFNet + visual + pose + attention	α = 1.0, β = 0.0	15.52
	α = 0.9, β = 0.1	15.11
	α = 0.8, β = 0.2	14.93
	α = 0.7, β = 0.3	15.71
	α = 0.6, β = 0.4	15.93
FCOS(ResNet50) + visual + pose + attention	α = 1.0, β = 0.0	11.33
	α = 0.9, β = 0.1	11.02
	α = 0.8, β = 0.2	10.73
	α = 0.7, β = 0.3	11.41
	α = 0.6, β = 0.4	11.57
FCOS(VoVNet39) + visual + pose + attention	α = 1.0, β = 0.0	10.65
	α = 0.9, β = 0.1	10.34
	α = 0.8, β = 0.2	10.11
	α = 0.7, β = 0.3	10.72
	α = 0.6, β = 0.4	10.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, Z.; Huang, S.; Li, M. A Pedestrian Detection Network Based on an Attention Mechanism and Pose Information. Appl. Sci. 2024, 14, 8214. https://doi.org/10.3390/app14188214

AMA Style

Jiang Z, Huang S, Li M. A Pedestrian Detection Network Based on an Attention Mechanism and Pose Information. Applied Sciences. 2024; 14(18):8214. https://doi.org/10.3390/app14188214

Chicago/Turabian Style

Jiang, Zhaoyin, Shucheng Huang, and Mingxing Li. 2024. "A Pedestrian Detection Network Based on an Attention Mechanism and Pose Information" Applied Sciences 14, no. 18: 8214. https://doi.org/10.3390/app14188214

APA Style

Jiang, Z., Huang, S., & Li, M. (2024). A Pedestrian Detection Network Based on an Attention Mechanism and Pose Information. Applied Sciences, 14(18), 8214. https://doi.org/10.3390/app14188214

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Pedestrian Detection Network Based on an Attention Mechanism and Pose Information

Abstract

1. Introduction

2. The Related Literature

3. Methodology

4. Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI