Discriminative Deformable Part Model for Pedestrian Detection with Occlusion Handling
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors-
The deformable part model using machine learning (vs. human intuition) is a key contribution. However, the paper lacks a clear comparison with similar part-based models (e.g., DPM by Felzenszwalb et al. [1]). How does DDPM differ in part selection and deformation handling?
-
Equation (1) for decision weight Wk is not sufficiently justified. The choice of the exponential decay factor 2/n2 and its impact on model performance needs empirical or theoretical validation.
-
Computational efficiency is a concern. Training for 200 hours on a high-end GPU (RTX 3050) may limit practicality. A runtime comparison with baseline methods (e.g., YOLO variants) would clarify trade-offs between accuracy and efficiency.
-
The proposed dataset (5,609 images) is small compared to benchmarks like COCO or Cityscapes. While cultural diversity is valuable, the authors should justify the dataset size and annotation strategy (e.g., why 70-20-10 split instead of standard splits?).
-
For VisDrone, the reported mAP improvement (24.3% vs. 16.45%) is significant. However, the evaluation is limited to mAP@0.5. Including mAP@0.5:0.95 would provide a more robust assessment, especially for occlusion-heavy scenarios.
-
Transfer learning details are sparse. Clarify which layers were frozen/retrained, hyperparameters, and training duration. Reproducibility is hampered without this information.
-
Comparisons on Pascal VOC focus on older methods (SS, RPN, YOLOv3). Including recent state-of-the-art detectors (e.g., Faster R-CNN, DETR, or newer YOLO versions) would better situate DDPM’s performance.
-
The VisDrone results in Table 5 show DDPM outperforming methods like ACM-OD, but the cited baseline results ([40]) are not clearly described. Ensure baseline implementations are fair (e.g., same training data, augmentation).
-
Figures and tables are referenced inconsistently (e.g., Figure 1(a)/(b) are mentioned before Figure 1 itself). Ensure all figures/tables are labeled and described in order.
-
Citations in references are incomplete (e.g., [30], [31], [32] lack volume/page numbers). Follow journal formatting guidelines strictly.
-
The literature review in Section 2 includes papers dated to 2025, which are likely anachronistic. Verify publication dates and citations.
-
Typos: "Transfer Leaning" → "Transfer Learning" (Section 4.3.1), "intraclass" → "intra-class" (Figure 10 caption).
-
Passive voice is overused (e.g., "A framework is designed..."). Revise for active voice where possible.
-
Table 1 and 2 could be merged for conciseness.
-
Provide code or pseudocode for the DDPM framework, especially the discriminative region mining process.
-
Specify hyperparameters (e.g., learning rate, batch size) for training and transfer learning.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper develops machine learning algorithms for detecting occluded pedestrians, with a focus on pedestrians whose body parts are hidden by traditional Eastern costumes. This is achieved by breaking the human image into parts and using discriminative part models to classify individual image patch, which then contribute to the final decision on the validity of the object. The proposed method achieves the best performance on public datasets (Pascal VOC and VisDrone), as well as on their own dataset of pedestrians on Pakistani roads wearing Eastern costumes.
[General]
The paper has a clear motivation. The sample images presented really help the reader to understand the problem and the proposed methodology. The author also compares and shows different model performances on the public datasets and their own Pakistani dataset.
[Critical flaw]
1. Lack of up-to-date benchmarks: all of the models the author compares with are outdated, dating back to 2020. The author should include some of the latest models, such as YOLOv11/v12, or the latest models (2022-2024) referenced in their literature review.
[Minor]
2. The author could list the computation resource and time spent for processing one image, and compare their model with benchmark models.
3. The limitation of this study is missing.
4. I recommend publishing the Pakistani dataset in an open-access repository to allow other research groups to follow up.
[Editorial]
5. Line 422-423: Repeated content as Line 406-408.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for Authors- Abstract, Line 14: “Machine learning has been used only for rigid objects like traffic signs.” → Unsupported generalization.
- Section 1, Line 43-44: “This optimization was achieved by breaking the object parts through machine learning for traffic sign detection [2].” → Clarify novelty. Explicitly contrast with prior deformable object methods (e.g., cite Felzenszwalb et al. [1] for human-intuition-based part models).
- Section 3.3.1, Line 351-353: Equation (1) lacks derivation. Justify the exponential term e^{-(2d_i^2/n^2)}e−(2di2​/n2). Explain why this specific decay rate was chosen (e.g., empirical testing vs. theoretical basis).
- Section 4.1, Table 3: DDPM’s 88.3% mAP vs. IA²-Net’s 84.13% is not analyzed. Add a brief discussion on architectural differences (e.g., part-based vs. holistic approaches) driving performance gains.
- Section 5, Line 493-494: “Using the default data, our proposed algorithm [...] outperforms YOLO v3 [...] and YOLO v5 [...].” → Specify training protocols (e.g., backbone, input resolution, augmentation) to ensure fair comparison.
- Section 2.5, Line 211-215: “Vision Transformers process pictures by using self-attention techniques. Its architecture makes use of a series of transformer blocks [...] feed-forward layer.” → Overly generic. Replace with technical specifics (e.g., patch embedding, positional encoding) relevant to occlusion handling.
- Section 3.2, Line 274-276: “The main idea [...] using a discriminative deformable part model.” → Vague phrasing. Replace with concrete steps (e.g., “The framework combines part localization via ML-learned regions with deformation-aware scoring”).
- Section 4.3, Line 467-469: “The data [...] employing the domain adaptation technique [...] pre-trained on a large, diverse dataset.” → Non-specific. Clarify the adaptation method (e.g., adversarial training, fine-tuning layers).
- Dataset annotation: No mention of annotation protocols (e.g., labeling tools, annotator training, inter-annotator agreement). Add to Section 3.4.
- Occlusion levels in the proposed dataset: Missing quantification (e.g., % of samples per occlusion category). Include in Table 1/2 or add a subsection.
- Baseline comparison details: Absence of training hyperparameters (e.g., epochs, learning rates) for YOLOv3/v5 and DDPM. Add to Section 4.3.
- Some figures are low resolution (almost all the textual information in the figures).
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe author has addressed all of my concerns and I recommend publishing the manuscript in its present form.
Reviewer 3 Report
Comments and Suggestions for AuthorsAll my comments were addressed.