Real-Time Multi-Camera Tracking for Vehicles in Congested, Low-Velocity Environments: A Case Study on Drive-Thru Scenarios
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors1. SUMMARY
This paper tackles multi-object, multi-camera vehicle tracking in challenging drive-thru scenarios, characterized by slow or stationary vehicles, heavy congestion, frequent occlusions, overlapping camera views, and real-time constraints. Traditional IoU- or Kalman filter-based trackers struggle due to bounding box instability and high vehicle loss rates.
The authors propose a real-time multi-camera tracking system with four main components:
- MTCD: A novel single-camera tracker using corner-point displacement instead of IoU, improving robustness to bounding box distortions and camera jitter.
- SA: Projects each camera’s view onto a metric plane for spatial consistency across cameras.
- MCA: Combines multi-camera data using appearance and spatio-temporal cues with “Joints” to preserve identities across views.
- Person-Vehicle Interaction: Detects customer interactions to estimate dwell time and analyze queue flow.
Evaluated on two real drive-thru sites with synchronized multi-camera data, the system outperforms a strong baseline (NvDCF) in precision, recall, and F1-score, while running in real-time on embedded GPUs.
Key contributions:
a corner-displacement tracker for slow/static vehicles, robust multi-camera spatial alignment, lightweight identity association, integration of interaction analysis, and real-world deployment validation.
2. However,the paper has the following issus:
- Minor Terminology and Typographical Issues
- In the Introduction, the phrase:“The intermittent, low-velocity nature of movement in drive-thrus further exacerbates these problems.”
- In the Conclusion:“This paper introduced a novel Multi-Target Multi-Camera system specifically designed for accurate vehicle trajectory tracking in challenging, low-velocity, and highly congested environments like drive-thrus.”
Replace “drive-thrus” with “drive-thru” in these contexts to better reflect standard academic usage when referring to a general type of environment.
- In Section 2.1.1, the sentence:“The information of each vehicle is stored into JSON files and sended to a database/dashboard in the cloud.”contains a grammatical error.
“Sended” should be corrected to “sent.”
- Lack of High-Level Method Architecture Diagram
Although the paper provides detailed textual explanations of each module, including MTCD, SA, MCA, and the Joint mechanism, the absence of an overall architecture diagram makes it difficult for readers to grasp the system workflow at a glance.
Suggestion: Include a block diagram or flowchart that outlines the structure and interaction of the major components in the tracking pipeline. This visual aid will significantly enhance understanding of the algorithm's data flow and modular logic.
- Clarify visual sequence in Figure 1 by labeling (1) and (2) directly on the image.
In the caption of Figure 1, the phrase “(1) comes before (2)” is used to indicate a temporal sequence. However, the figure itself does not include clear visual cues—such as frame numbers, timestamps, or arrows—to establish this order.
Author Response
Thank you for your valuable suggestions to improve the document quality.
Comment 1: Thanks for the typo corrections. We perform an additional detailed correction for these and other typo errors.
Comment 2: We attended the suggestion by adding a diagram of the pipeline where all the methods are integrated to enhance the explanation of the algorithm. You can find it in section "Multi-Target Multi-Camera tracking methodology"
Comment 3: The image is now modified to facilitate the understanding of sequence.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors need to address the following comments for publication:
- The title includes references to remote sensing data and images; it is recommended to clearly reflect these aspects in the title.
- In the introduction, clearly describe and clarify what is meant by "Difficult-to-Detect Scenarios."
- Currently, the introduction section is excessively lengthy, making the manuscript difficult to follow. It is recommended to split this section into two distinct parts: one clearly defined Introduction and a separate Literature Review.
- Within the Literature Review, two sub-sections should be established. The first should describe and justify the application and relevance of deep learning methods. The second sub-section should specifically justify the selection of YOLOv11, highlighting the research gap addressed by choosing this particular version.
- Numerous papers introduce and validate the reliability of YOLOv10 in their respective contexts. Therefore, you must explicitly discuss why YOLOv11 was chosen over YOLOv10. Additionally, clarify the implications and significance of using YOLOv11. Specifically, the paper titled "Image Augmentation Approaches for Building Dimension Estimation in Street View Images Using Object Detection and Instance Segmentation Based on Deep Learning" employed YOLOv10 effectively. These papers should be discussed in literature review, alongside additional relevant literature, to comprehensively justify your choice of YOLOv11.
- In the conclusion section, clearly summarize the key findings and their significance. Currently, the section briefly mentions only a few findings; a more comprehensive summary is necessary.
Author Response
Thank you for your thorough review and valuable feedback on our manuscript. We appreciate your time and suggestions, which have helped us improve the clarity and impact of our work.
Comment1: We appreciate your suggestion. We have reviewed the title and believe it accurately reflects the scope and methodologies employed in our research, which indeed involve remote sensing data and images. We feel the current title adequately defines these aspects without being overly verbose.
Comment 2: Thank you for pointing this out. We have carefully reviewed our manuscript and were unable to locate the phrase "Difficult-to-Detect Scenarios" within the text. It's possible there might be a discrepancy in document versions.
Comment 3: We understand your concern regarding the length of the introductory material. Currently, our manuscript is structured with an "Introduction" and a separate "Related Work" section. We consider the "Related Work" section to be analogous to a Literature Review, where we discuss relevant prior research. We believe this existing division already addresses your suggestion for two distinct parts.
Comment 4: To address your valuable suggestion about justifying the model choice, we have added an explanation in the "Identified patterns in detector-tracker behaviour" subsection. This new content clarifies why YOLOv7 was chosen for our research, elaborating on its suitability for our specific objectives and the research gap it helps address.
Comment 5: We appreciate you bringing these relevant papers to our attention. However, as previously mentioned, our work does not utilize YOLOv11. Furthermore, since our current manuscript does not involve any model training, we believe that a detailed discussion and justification for a specific YOLO version (like YOLOv10 or YOLOv11) beyond our chosen YOLOv7 model would not align with the scope of our presented research. We have focused on justifying the selection of YOLOv7 where appropriate within the context of our work.
Comment 6: We agree with your assessment regarding the conclusion section. We have revised and expanded the conclusion to provide a more comprehensive summary of our key findings and their significance. We have ensured that this section now clearly articulates the main contributions and implications of our research.
Reviewer 3 Report
Comments and Suggestions for AuthorsPlease ckeck the word file
Comments for author File: Comments.pdf
correct somes grammatical and typos errors
Author Response
Thank you for your constructive feedback and detailed suggestions, which are invaluable for improving our manuscript. We have carefully considered each point and made revisions accordingly.
Comment 1: We want to emphasize that each of these modules is complementary and critical for addressing specific error sources. Removing any one of them would lead to a significant decrease in precision, as they are designed to jointly resolve complex detection and tracking challenges. We have clarified this interconnectedness and the anticipated impact of removing individual components by adding a discussion to the "Discussion" section, explaining how each module contributes to overcoming specific issues and the overall performance.
Comment 2: Thank you for bringing this to our attention. We sincerely apologize for the typographical errors. We have conducted a thorough review of the entire manuscript to correct all identified typos, including "Discusion" to "Discussion," and other grammatical inconsistencies, to enhance the overall readability and professionalism of the paper.
Comment 3: Our algorithm is primarily designed for scenarios with overlapping camera views to facilitate robust vehicle tracking across multiple perspectives. Therefore, it is not usable in non-overlapping camera scenarios, where its behaviour would primarily resemble that of a standalone tracker without the benefit of multi-camera data association.
Comment 4: Thank you for this excellent point. We agree that a clearer explanation of the automated calibration process is necessary. We have significantly enhanced the description of the automatic calibration methodology in the relevant section of the manuscript to provide more detail on its principles and operation.
Comment 5: We appreciate you highlighting the relationship between YOLOv7 and MTCD. We have added a detailed explanation of how YOLOv7's detection outputs influence and are integrated into the MTCD algorithm in the "Identified patterns in detector-tracker behaviour" section. This addition clarifies the interplay between the detector and our multi-camera tracking system, significantly strengthening the justification of our research and its findings.
Comment 6: Thank you for suggesting these relevant comparisons. We have incorporated a new table in the "Discussion" section that presents a comparative analysis of our approach with state-of-the-art methods, including references 12 and 17, as you recommended. This table highlights the strengths and differences of our system in relation to current research.
Comment 7:
Thank you for raising this important consideration. Our system has been tested in scenarios that include significant variations in vehicle perspective and rotation. Specifically, in cameras 1 and 2 of the DTB (Drive Thru B) dataset, vehicles exhibit approximately 180-degree rotations, and in camera 4 of the same dataset, vehicles are viewed from an upward perspective.
The MTCD algorithm is designed to handle these cases effectively because the changes in bounding box representations and vehicle appearance are gradual enough between frames and cameras. This allows our data association and tracking mechanisms to maintain robust performance even under these challenging conditions. We have clarified this capability in the discussion section of the manuscript.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe paper presents a novel Multi-Target Multi-Camera (MTMC) tracking system for vehicles in congested, low-velocity drive-thru environments, introducing the Mult object Tracker based on Corner Displacement (MTCD) algorithm and associated methodologies. While the work demonstrates significant advancements, several weaknesses and areas for improvement can be identified to enhance its reliability, clarity, and impact. Below is a detailed review focusing on these weaknesses:
1. Please provide a more detailed explanation of each algorithm, especially focusing on edge-case handling and parameter initialization. For the calibration process, clarify what criteria or metrics were used during fine-tuning. A visual aid such as a flowchart or diagram illustrating the entire pipeline would enhance clarity.
2.Table 3 lists threshold parameters (e.g., th_0 = 0.25, th_3 = 1m) without justifying how these values were selected. The reference to "DT features" is vague. Please explain what specific features or data were used to determine these thresholds.
3. The evaluation dataset is limited to two drive-thru locations and a total of 297 vehicles over 6.3 hours. This is relatively small for generalization. The paper should discuss environmental variations such as lighting (day/night), weather conditions, or different drive-thru layouts as limitations.
4. The paper claims 10 FPS real-time performance on a Jetson Xavier NX, but it lacks detailed analysis of computational complexity and latency. Please discuss trade-offs between speed and accuracy, especially in higher frame-rate or resolution scenarios.
5. While the system uses four cameras per location, there’s no analysis on how it would scale to more cameras or multiple overlapping fields of view. Consider discussing how increased object density or camera count might affect performance.
6. Although the paper includes figures (e.g., Figures 1–5), it misses visualizations of tracked trajectories, joint associations, or error corrections. Including such visuals would help readers assess the effectiveness of the proposed MTCD method.
7. Section 4 highlights system strengths but neglects critical limitations. Please address issues like reliance on stock YOLOv7 (with no fine-tuning or augmentation), sensitivity to camera placement, and possible failure modes in complex environments.
8. The interaction timings are computed using distance and angle rules without ground-truth validation. Without validated benchmarks, the timing accuracy and reliability of this component remain uncertain.
Author Response
Thank you once again for your meticulous review and invaluable feedback. Your insightful comments have significantly contributed to improving the clarity, robustness, and overall quality of our manuscript. We have carefully considered each point and made the following revisions.
Comment 1: For the calibration process, we have clarified the specific criteria and metrics utilized during fine-tuning to ensure accuracy and robustness. Furthermore, to improve overall clarity, we have incorporated a flowchart diagram that illustrates the entire pipeline, providing a comprehensive visual overview of our system.
Comment 2: Thank you for pointing out the need for clearer justification of our threshold parameters. We have expanded the explanation for these values in the manuscript. These thresholds are primarily determined based on the physical dimensions and dynamic behavior of typical vehicles in a drive-thru environment
Comment 3: We acknowledge your concern regarding the dataset size and its potential impact on generalization. Our experiments were conducted during the standard operational hours of the drive-thru locations, encompassing various times of the day, which naturally included diverse lighting conditions. However, we confirm that evaluations during night-time hours were not feasible within the scope of this study
Comment 4: Thank you for highlighting the importance of a more detailed analysis of computational aspects. Our experiments primarily focused on the accuracy of inferences and algorithm evaluation. Our experience working with soft real time video processing shows that undefined behaviour appears only when processing time is slow and batches accumulates or are dropped. In our experiments we do not measure significant delay that indicates hardware bottlenecks.
Comment 5:
You've raised an excellent point regarding scalability. We recognize the importance of understanding how our system would perform with an increased number of cameras or higher object density. Due to current time constraints for providing a response, a detailed scalability analysis falls outside the scope of the present research. However, we agree this is a critical aspect for future work. We have explicitly included scalability to more cameras and higher object density as a key direction for future analysis in the "Future Work" subsection of our conclusion.
Comment 6: Thank you for this constructive suggestion. We concur that visualizations of tracked trajectories and association processes would significantly enhance the reader's understanding. We have now incorporated a new diagram with illustrative images in the "Multi-Target Multi-Camera tracking methodology" subsection
Comment 7:
We appreciate your candid feedback on the balance between strengths and limitations. You've correctly identified areas where clarification is needed. We want to clarify that fine-tuning and training with data augmentation were performed on YOLOv7 in our development process. However, even with these efforts, the fundamental detection errors persisted, which is precisely why our subsequent MTCD and correction modules were critical. We have updated the "Identified patterns in detector-tracker behaviour" subsection to explicitly address this.
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThere is no further comment. I recommend that this paper is published.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe authors have made notable improvements in their revised manuscript.