A Pedestrian Detection Network Based on an Attention Mechanism and Pose Information
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors introduce an object detection model that integrates a region proposal network for region proposals, followed by a pose module and visual feature extraction module with attention, that are used for classification. The model is trained by combining three loss terms for the classification module, pedestrian pose module and visual feature module. The authors show the introduction of the additional modules can improve existing detectors like ALFNet and FCOS.
Strengths
* The ablation experiments demonstrate the contribution of all of the modules to the final model performance.
Weaknesses
* My main (and major) concern is that the approach considered does not cite the state-of-the-art approaches to this problem, and does not have results that are competitive against or placed in the context of such methods. Example papers the authors should reference and compare against are [1][2][3][4], for proper citing of the relevant literature and positioning of the paper in their context.
References
[1] Hasan, Irtiza, et al. "Generalizable pedestrian detection: The elephant in the room." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[2] Khan, Abdul Hannan, et al. "F2dnet: Fast focal detection network for pedestrian detection." 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022.
[3] Hagn, Korbinian, and Oliver Grau. "Increasing pedestrian detection performance through weighting of detection impairing factors." Proceedings of the 6th ACM Computer Science in Cars Symposium. 2022.
[4] Khan, Abdul Hannan, Mohammed Shariq Nawaz, and Andreas Dengel. "Localized semantic feature mixers for efficient pedestrian detection in autonomous driving." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
Comments on the Quality of English LanguageMinor Typos
Figure 1: Recognization -> Recognition
Figure 2: Recogination -> Recognition
Author Response
Reviewer1:
Comments 1: [My main (and major) concern is that the approach considered does not cite the state-of-the-art approaches to this problem, and does not have results that are competitive against or placed in the context of such methods. Example papers the authors should reference and compare against are [1][2][3][4], for proper citing of the relevant literature and positioning of the paper in their context.
References
[1] Hasan, Irtiza, et al. "Generalizable pedestrian detection: The elephant in the room." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[2] Khan, Abdul Hannan, et al. "F2dnet: Fast focal detection network for pedestrian detection." 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022.
[3] Hagn, Korbinian, and Oliver Grau. "Increasing pedestrian detection performance through weighting of detection impairing factors." Proceedings of the 6th ACM Computer Science in Cars Symposium. 2022.
[4] Khan, Abdul Hannan, Mohammed Shariq Nawaz, and Andreas Dengel. "Localized semantic feature mixers for efficient pedestrian detection in autonomous driving." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
Response 1: Thank you for pointing this out.
- (1) Reference 1 mainly studies the generalization ability of the algorithm on different datasets, such as:
Reference1 trains on the WiderPedestrian+CrowdHuman dataset and tests on the CityPersons dataset. In contrast, this paper conducts both training and testing on the CityPersons dataset. Furthermore, this paper achieves a Reasonable result of 10.11 on the CityPersons dataset, which is significantly better than the 12.8 reported in reference1.
- (2) For reference 2, the Reasonable result on the CityPersons dataset reached 8.7, higher than the 10.11 achieved in this paper. However, it is mentioned “We used the Nvidia RTXA6000 GPU cluster to train our models. We used Distributed Data-Parallel to achieve parallel training on multiple GPUs with a manual seed. We used 2 GPUs with 32 and 4 images per GPU for training model on Caltech Pedestrian and City Persons datasets respectively.” In contrast, this paper used only a NVIDIA GeForce RTX 4090 GPU (24GB VRAM). We personally believe that the higher performance of the server used in reference2 is the main reason for their better result.
- (3) Reference3 used "methods for generating synthetic data, including scene generation, 3D object usage, and sensor simulation." We believe that reference 3 used additional data, whereas this paper we proposed did not use any additional data. Therefore, we consider it unfair to compare the results of reference3 with those of this paper.
- (4) Reference 4 mainly addresses the occlusion problem and uses extensive data augmentation. This paper does not use any data augmentation techniques, and compared to this paper, we consider it unfair.
Comments 2:[Comments on the Quality of English Language: Minor Typos
Figure 1: Recognization -> Recognition
Figure 2: Recogination -> Recognition]
Thank you for pointing this out. We have made revisions in the newly submitted manuscript.
Author Response File: Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposes a pedestrian detection approach that addresses the prevalent issues of false and missing detection in computer vision. The approach integrates attention mechanisms and pose information, offering a significant contribution compared to state-of-the-art correlated methods.
A final version, including some significant missing information, would be interesting.
To include a brief description about what it is e for what is an attention mechanism.
Since the proposed approach addresses the limitations of previous methods, describe the limitations of the new approach.
In solving this kind of problem, what kind of application can be deployed using this approach?
How do you describe a general solution that is framed by this approach?
Figure 1 detects only 2 pedestrians. What are the limitations?
The paper's argument is based on the detection methods of the two-stage algorithms, but something needs to be said about pedestrian detection and occlusion problems for single-stage algorithms. What are the state-of-the-art single-stage algorithms for pedestrian detection oclusion-free?
What are the classes of the classification module?
It is necessary to increase the resolution in Figure 3.
The MR definition in section 4 only applies to this problem but has a more general definition. Include this definition and a reference.
Comments on the Quality of English LanguageThe paper demands a careful revision of vocabulary and grammar.
Author Response
Reviewer2:
Comments 1: [To include a brief description about what it is e for what is an attention mechanism.]
Response 1: Thank you for pointing this out.
CBAM (Convolutional Block Attention Module) is an attention mechanism used in Convolutional Neural Networks (CNNs) designed to enhance the model's ability to focus on specific regions of the input data. CBAM improves feature representation by applying attention mechanisms along both channel and spatial dimensions, thereby enhancing model performance.
Here are some key features of CBAM:
- Dual-Dimensional Attention: CBAM applies attention mechanisms simultaneously in both channel and spatial dimensions. This dual-dimensional focus allows the model to learn which channels and spatial locations are important.
- Channel Attention: The channel attention component of CBAM captures inter-channel relationships using global average pooling (GAP) and global maximum pooling (GMP), and then generates weights for each channel using a learnable convolutional layer.
- Spatial Attention: The spatial attention component divides the feature maps into different regions, aggregates features within each region using average pooling, and then generates weights for each region using a learnable convolutional layer."
Compared to CBAM, the pedestrian attention module proposed in this paper adds an additional input of the original data which is generate by visual feature module. The reason behind this is that in CBAM, whether applying channel attention before spatial attention (denoted as CBAM) or applying spatial attention before channel attention (denoted as reverse CBAM or RCBAM), the weights in the latter are generated from the feature maps of the former. In this process, it is believed that the preceding attention decorates the original input feature map. In other words, the attention mechanism following the decorated learns from the altered feature map. This affects the features learned by the subsequent attention module at a certain degree. Through experiments, we validated that this interference caused by "sequential connection" in pedestrian detection classification tasks can lead to instability in the effectiveness of attention modules. Therefore, we introduced an input of the original data which is generate by visual feature module before the spatial attention mechanism.
We have added a detailed description of attention mechanism in Chapter “Methodology” of the paper.
Comments 2: [Since the proposed approach addresses the limitations of previous methods, describe the limitations of the new approach.]
Response 2: Thank you for pointing this out.
The limitations of previous methods have been described in this paper, such as
“Although existing re-search has made some progress in pedestrian detection, these methods typically rely solely on visual features for localizing human targets. Due to the lack of robustness of visual features to low-quality targets, in scenarios with severe occlusion and cluttered backgrounds, the visual cues of the targets to be detected can easily be confused with the background, resulting in a significant drop in detection performance. This is the most noteworthy issue in pedestrian detection.” and
“In situations where visual clues are indistinguishable, for instance, when pedestrian visual features resemble the background and are obscured by others, relying solely on visual descriptions is insufficient to distinguish pedestrians from the background.”
Comments 3: [In solving this kind of problem, what kind of application can be deployed using this approach?]
Response 3: Thank you for pointing this out.
The pedestrian detection method proposed in this paper can be deployed in the following types of applications:
- Autonomous Vehicles:Integrating this approach into the perception systems of autonomous vehicles allows for accurate and efficient pedestrian detection, enabling the vehicle to make safe navigation decisions.
- Driver Assistance Systems:In vehicles equipped with Advanced Driver Assistance Systems (ADAS), the pedestrian detection system can enhance features such as pedestrian collision avoidance, cross-traffic alerts, and automatic emergency braking.
- Surveillance Systems:The ability to detect pedestrians in real time can be used for security monitoring applications, monitoring public spaces, detecting unusual activities, or managing crowd flow.
- Traffic Management:In intelligent transportation systems, pedestrian detection systems can better monitor and manage pedestrian traffic flow and behavior.
- Robot Navigation:In service robots that share environments with humans, pedestrian detection systems can help robots navigate safely by detecting and avoiding pedestrians.
- Smart Cities:As part of smart city infrastructure, pedestrian detection systems can be used to analyze pedestrian movement, optimize public transportation, and improve urban planning.
Comments 4: [How do you describe a general solution that is framed by this approach?]
Response 4: Thank you for pointing this out.
A general solution under this framework includes the following key steps:
- Input Data Processing:Receive input images, which may be real-time images from surveillance cameras or in-vehicle cameras.
- Pose Estimation:Use a pose estimation module to identify the body key points of pedestrians in the image. These key points provide information about the pedestrian's pose, helping to understand their orientation and shape.
- Feature Extraction:Extract image features using Convolutional Neural Networks (CNNs). These features capture both local and global information of the pedestrian, providing a foundation for subsequent detection and analysis.
- Spatial and Channel Attention:
- Spatial Attention:Apply attention mechanisms to the spatial dimensions of the image to enable the model to focus on regions containing pedestrians.
- Channel Attention:Weight the channels of the feature maps to highlight the most important feature channels for pedestrian detection tasks.
- Feature Fusion:Combine the results of spatial and channel attention to enhance feature representation, allowing the model to better understand the pedestrian's shape, size, and pose.
- Pedestrian Detection:Use the improved feature representation for pedestrian detection, which may include generating bounding boxes and classifying pedestrian categories.
- Loss Function Definition:Define loss functions for training the network, which may include localization loss, confidence loss, and pose loss, to ensure the model accurately detects pedestrians and learns correct pose information.
- Model Training:Train the model on an annotated dataset containing pedestrians and backgrounds, optimizing with the aforementioned loss functions.
- Post-processing:Perform post-processing on the detection results, such as Non-Maximum Suppression (NMS) to remove overlapping bounding boxes, and possible calibration steps to improve detection accuracy.
- Performance Evaluation:Evaluate the model's performance on the test set using standard metrics (e.g., Average Precision (AP), Precision-Recall Curve (PRC), etc.).
- Testing Model Generalization:Ensure the model maintains stable performance across diverse scenarios.
This paper proposes a pedestrian detection method based on pose information and spatial-channel attention mechanisms. By combining key pose information and spatial structural features of the image, the method improves the accuracy and robustness of the detection algorithm, especially when handling occlusions, different poses, and scale variations of pedestrians.
Comments 5: [Figure 1 detects only 2 pedestrians. What are the limitations?]
Response 5: Thank you for pointing this out.
Sorry for the confusion. Our image resolution was insufficient, which led to the detection of three pedestrians. We have thickened the bounding boxes and made revisions in the paper.
Comments 6: [The paper's argument is based on the detection methods of the two-stage algorithms, but something needs to be said about pedestrian detection and occlusion problems for single-stage algorithms. What are the state-of-the-art single-stage algorithms for pedestrian detection oclusion-free?]
Response 6: Thank you for pointing this out.
The state-of-the-art single-stage algorithm is YOLOv9, which was released on February 21, 2024. It is an improvement upon YOLOv7, introducing an innovative approach to address core challenges in object detection through deep learning, with a focus on information loss and efficiency issues within the network architecture. This approach consists of four key components: Information Bottleneck Principle, Reversible Functions, Programmable Gradient Information (PGI), and Generalized Efficient Layer Aggregation Network (GELAN). Based on this analysis, a new deep neural network training method is needed that can generate reliable gradients to update the model while being applicable to shallow and lightweight neural networks. Programmable Gradient Information (PGI) is a solution that includes a main branch for inference, an auxiliary reversible branch for reliable gradient computation, and multi-level auxiliary information, effectively addressing the deep supervision problem without adding extra inference costs.
Comments 7: [What are the classes of the classification module?]
Response 7: Thank you for pointing this out.
The classification module has two classes: pedestrian and background.
Comments 8: [It is necessary to increase the resolution in Figure 3.]
Response 8: Thank you for pointing this out.
The images in Figure 3 are uniformly normalized to a size of 256×256, which is already a high resolution for a single pedestrian.
Comments 9: [The MR definition in section 4 only applies to this problem but has a more general definition. Include this definition and a reference.]
Response 9: Thank you for pointing this out.
We have indicated a more general a more general definition names MR-2 and reference. “Dollar [45] proposed calculating the logarithmic average of the MR within a certain range of FPPI as a quantitative metric, referred to as the log-average miss rate (LAMR). The calculation formula is as follows:
Where FPPIi represents the FPPI value corresponding to the selected sampling point i, and N denotes the number of sampling points. To better reflect the miss rate of the detector under low false positive conditions and to facilitate fair comparisons with existing methods, FPPI is sampled at intervals of 100.25 in the range of [10−2, 100]. The logarithmic average miss rate (LAMR) in this state is referred to as the miss rate and denoted as MR-2 in this paper.”。
Comments 10: [Comments on the Quality of English Language, the paper demands a careful revision of vocabulary and grammar.]
Response 10: Thank you for pointing this out.
We have carefully revised the vocabulary and grammar of the paper.
Author Response File: Author Response.docx
Reviewer 3 Report
Comments and Suggestions for AuthorsMethods
The methods are adequately described. The article breaks down the pedestrian recognition network into four modules: Visual Feature, Pedestrian Pose, Pedestrian Attention, and Classification modules. Each module's role is clearly explained, providing a clear understanding of how they contribute to the overall detection process. However, it would be beneficial to include more detailed algorithmic descriptions and possibly some pseudocode to enhance reproducibility and clarity.
Results
The results are clearly presented, with extensive experimental validation on the Caltech and CityPersons datasets. The article claims a substantial improvement over state-of-the-art methods, which is supported by quantitative results. Including visual examples of detections could further illustrate the improvements and provide a more intuitive understanding of the method's effectiveness.
Conclusions
The conclusions are well-supported by the results. The article successfully demonstrates that integrating attention mechanisms with visual and pose information enhances pedestrian detection performance, particularly in challenging scenarios involving occlusion and cluttered backgrounds. The conclusion summarizes the findings effectively and suggests potential for further research and application.
The article presents an original approach by combining attention mechanisms with visual and pose information, which is a novel contribution to pedestrian detection research. This integrated method addresses some persistent issues in the field, such as occlusion and false detections, in a new way.
The content is significant as it tackles a critical challenge in pedestrian detection, which has wide-ranging applications in areas like autonomous driving and surveillance. The proposed method's improvement over existing techniques highlights its potential impact.
The quality of presentation is generally high, with clear structure and logical flow. However, there are minor issues with English grammar and phrasing that, if corrected, would enhance readability. Additionally, including more detailed figures and diagrams could further improve the presentation.
The research is scientifically sound, with a well-defined methodology and robust experimental validation. The theoretical basis for using attention mechanisms and pose information is well-explained, and the results support the claims made.
Overall, the article makes a valuable contribution to the field of pedestrian detection. With improvements in presentation and more detailed methodological descriptions, it has the potential to be a highly impactful paper.
Recommendations for Improvement
- Enhance Clarity: Improve English grammar and phrasing throughout the article.
- Detailed Methodology: Provide more detailed algorithmic descriptions or pseudocode.
By addressing these areas, the article will not only be more accessible but also more impactful for its intended audience.
Comments on the Quality of English LanguageThe quality of presentation is generally high, with clear structure and logical flow. However, there are minor issues with English grammar and phrasing that, if corrected, would enhance readability. Additionally, including more detailed figures and diagrams could further improve the presentation.
Author Response
Reviewer3:
Comments 1: [Provide more detailed algorithmic descriptions or pseudocode.]
Response 1: Thank you for pointing this out.
We have added a detailed algorithmic description in Chapter “Methodology” of the paper. The specifics are as follows:
Comments 2: [The quality of presentation is generally high, with clear structure and logical flow. However, there are minor issues with English grammar and phrasing that, if corrected, would enhance readability. Additionally, including more detailed figures and diagrams could further improve the presentation.]
Response 2: Thank you for pointing this out.
We have carefully revised the vocabulary and grammar of the paper.
Author Response File: Author Response.docx
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe comments that were raised in the previous review are still unaddressed; the authors should add the relevant literature and the differences from their method to the Related Work section so that their work is presented in the context of modern research in this field, instead of addressing them in the author response alone. These related works can then be mentioned for proposing future work. In addition to the references mentioned before [1][2][3][4], I am adding two more potential references for the authors' consideration [5][6].
References
[1] Hasan, Irtiza, et al. "Generalizable pedestrian detection: The elephant in the room." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[2] Khan, Abdul Hannan, et al. "F2dnet: Fast focal detection network for pedestrian detection." 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022.
[3] Hagn, Korbinian, and Oliver Grau. "Increasing pedestrian detection performance through weighting of detection impairing factors." Proceedings of the 6th ACM Computer Science in Cars Symposium. 2022.
[4] Khan, Abdul Hannan, Mohammed Shariq Nawaz, and Andreas Dengel. "Localized semantic feature mixers for efficient pedestrian detection in autonomous driving." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[5] Luo, Zekun, et al. "NMS-loss: Learning with non-maximum suppression for crowded pedestrian detection." Proceedings of the 2021 International Conference on Multimedia Retrieval. 2021.
[6] Liu, Mengyin, et al. "Vlpd: Context-aware pedestrian detection via vision-language semantic self-supervision." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
Author Response
Comments 1: The comments that were raised in the previous review are still unaddressed; the authors should add the relevant literature and the differences from their method to the Related Work section so that their work is presented in the context of modern research in this field, instead of addressing them in the author response alone. These related works can then be mentioned for proposing future work. In addition to the references mentioned before [1][2][3][4], I am adding two more potential references for the authors' consideration [5][6].
References
[1] Hasan, Irtiza, et al. "Generalizable pedestrian detection: The elephant in the room." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[2] Khan, Abdul Hannan, et al. "F2dnet: Fast focal detection network for pedestrian detection." 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022.
[3] Hagn, Korbinian, and Oliver Grau. "Increasing pedestrian detection performance through weighting of detection impairing factors." Proceedings of the 6th ACM Computer Science in Cars Symposium. 2022.
[4] Khan, Abdul Hannan, Mohammed Shariq Nawaz, and Andreas Dengel. "Localized semantic feature mixers for efficient pedestrian detection in autonomous driving." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
[5] Luo, Zekun, et al. "NMS-loss: Learning with non-maximum suppression for crowded pedestrian detection." Proceedings of the 2021 International Conference on Multimedia Retrieval. 2021.
[6] Liu, Mengyin, et al. "Vlpd: Context-aware pedestrian detection via vision-language semantic self-supervision." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
Response 1: Thank you for pointing this out.
Thank you for your valuable feedback and for highlighting the importance of contextualizing our work within the broader scope of modern research. We appreciate your patience and have taken your suggestions seriously.
We acknowledge that the comments raised in the previous review were not fully addressed in our initial revision. In response, we have thoroughly revised the "Related Work" section to include the relevant literature you mentioned and highlighted the modifications in yellow. We have also detailed the differences between our method and the referenced works to clearly position our contribution within the current research landscape.
Additionally, we have incorporated the new references [5][6] you provided and discussed their relevance to our work. These works have been included in the "Related Work" section, and their implications for future research directions have been noted as you suggested.
“Reference 5 proposed a novel NMS loss function that allows the NMS process to be trained end-to-end without adding any additional network parameters. Specifically, Reference 5 introduced a pull loss to bring predictions of the same object closer together and a push loss to separate predictions of different objects. Reference 5 achieved a Reasonable score of 10.08% on the CityPersons dataset, which is almost identical to our result of 10.11%. However, our method achieved an MR-2 score of 4.13% on the Caltech dataset, which is significantly better than the 5.92% reported in Reference 1.
Reference 6 proposed a novel method for context-aware pedestrian detection (VLPD) through visual-language semantic self-supervision, explicitly modeling semantic context. Reference 6 used a large number of language semantic labels, whereas our method did not use any additional labels, making a comparison with our method, in our view, unfair.”
We believe these changes better align our manuscript with the expectations of the research community and hope that the revisions meet your approval.
Thank you again for your constructive comments and continued support.