Real-Time Multi-Camera Tracking for Vehicles in Congested, Low-Velocity Environments: A Case Study on Drive-Thru Scenarios

Gellida-Coutiño, Carlos; Rios-Cabrera, Reyes; Maldonado-Ramirez, Alan; Sanchez-Orta, Anand

doi:10.3390/electronics14132671

Open AccessArticle

Real-Time Multi-Camera Tracking for Vehicles in Congested, Low-Velocity Environments: A Case Study on Drive-Thru Scenarios

by

Carlos Gellida-Coutiño

^1,*

,

Reyes Rios-Cabrera

^1,*

,

Alan Maldonado-Ramirez

²

and

Anand Sanchez-Orta

¹

Robotics and Advanced Manufacturing Division, Research Center for Advanced Studies (CINVESTAV), Industria Metalúrgica 1062, Parque Industrial Ramos Arizpe, Ramos Arizpe 25903, Mexico

²

Introid Inc., 199-Santa Susana Avenue, Saltillo 25297, Mexico

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(13), 2671; https://doi.org/10.3390/electronics14132671

Submission received: 30 May 2025 / Revised: 27 June 2025 / Accepted: 30 June 2025 / Published: 1 July 2025

(This article belongs to the Special Issue New Trends in Computer Vision and Image Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In this paper we propose a novel set of techniques for real-time Multi-Target Multi-Camera (MTMC) tracking of vehicles in congested, low speed environments, such as those of drive-thru scenarios, where metrics such as the number of vehicles, time of stay, and interactions between vehicles and staff are needed and must be highly accurate. Traditional methods of tracking based on Intersection over Union (IoU) and basic appearance features produce fragmented trajectories of misidentifications under these conditions. Furthermore, detectors, such as YOLO (You Only Look Once) architectures, exhibit different types of errors due to vehicle proximity, lane changes, and occlusions. Our methodology introduces a new tracker algorithm, Multi-Object Tracker based on Corner Displacement (MTCD), that improves the robustness against bounding box deformations by analysing corner displacement patterns and several other factors involved. The proposed solution was validated on real-world drive-thru footage, outperforming standard IoU-based trackers like Nvidia Discriminative Correlation Filter (NvDCF) tracker. By maintaining accurate cross-camera trajectories, our framework enables the extraction of critical operational metrics, including vehicle dwell times and person–vehicle interaction patterns, which are essential for optimizing service efficiency. This study tackles persistent tracking challenges in constrained environments, showcasing practical applications for real-world surveillance and logistics systems where precision is critical. The findings underscore the benefits of incorporating geometric resilience and delayed decision-making into MTMC architectures. Furthermore, our approach offers the advantage of seamless integration with existing camera infrastructure, eliminating the need for new deployments.

Keywords:

multi-camera tracking; vehicle tracking; deep learning; congested spaces; low velocity; drive-thru; tracking errors; occlusion handling; metric plane; spatial analysis; multi-camera association; real-time systems; YOLOv7

1. Introduction

Optimizing vehicle flow in congested areas is a critical challenge with significant economic and operational implications, spanning urban traffic management [1,2] to efficiency analysis in industrial and retail settings. In contexts such as drive-thru (DT) environments, the precise tracking of vehicle trajectories, queue dwell times, and interactions with personnel is essential for identifying bottlenecks, reducing customer wait times, and minimizing economic losses due to delays. This requires robust vehicle tracking across multiple camera views in challenging, high-density environments.

Multi-target multi-camera tracking refers to the set of techniques employed to obtain the trajectories of objects of interest across video frame sequences captured by multiple cameras. This process fundamentally involves identifying the same object as it moves through distinct fields of view of different cameras. Generally, MTMC algorithms comprise a Multi-Object Tracker (MOT) and techniques for re-identification or cross-camera association [3]. A Multi-Object Tracker is an algorithm responsible for obtaining the trajectories of various objects of interest within a single view of the camera, such as vehicles in a drive-thru scenario. MOTs can be classified based on their operational mode: offline (requiring information from all frames to compute trajectories) or online (calculating object displacement solely using information from previous and current frames) [4]. Furthermore, MOTs may possess real-time attributes, which, in the context of this article, implies adherence to soft real-time conditions [5]. This primarily means that algorithm processing is completed within determined intervals with a tolerable margin of error, thereby enabling continuous operation on video frames with a predictable delay after capture. Thus, a real-time tracker is a subtype of online trackers.

MOTs incorporate various association techniques to connect different “tracklets,” which refer to incomplete object trajectories. Tracklets are common due to detector failures, as well as occlusion and intersection events between detections—challenges that are particularly difficult to mitigate in two-dimensional scenarios like video frames. Common association techniques within MOTs include the use of similarity embeddings, Intersection over Union (IoU), and motion estimation and prediction. A widely utilized group of MOTs is the SORT-like trackers, which was first introduced by Bewley [6]. These trackers are characterized by their use of motion prediction filtered by Kalman filters [7] and object correlation via the Hungarian algorithm [8]. When SORT-type trackers are augmented with appearance embeddings, they are referred to as DeepSORT trackers [9,10].

Real-time multi-target multi-camera vehicle tracking in scenarios of low velocity and congested spaces, as in the case of a multi-lane DT areas, presents challenges. Unlike typical traffic monitoring on open roads, these scenarios involve multiple lanes with permissible changes, extremely tight vehicle spacing, prolonged stationary periods (up to 15 min), sudden stop-and-go movements, and frequent partial occlusions from other vehicles or structures. These conditions severely challenge traditional computer vision techniques.

Existing vehicle identification methods, including appearance embedding comparison [11,12,13], license plate recognition, and position-based association [14,15], face significant reliability issues in this specific context. Appearance embeddings struggle with variations in viewpoint, visual similarity between vehicles, and the use of a cropped image that contains fragments from adjacent vehicles, a common condition in dense queues. License plate recognition requires ideal camera placement, high resolution, and consistent lighting, which are often not met, and it is insufficient for the detailed trajectory analysis needed for interaction studies. Positional methods, while promising, are highly susceptible to the persistent inaccuracies of vehicle detectors in congested scenes, particularly false positives and negatives, fragmented bounding boxes, and inconsistent positioning due to occlusions. These detection failures propagate, leading to broken trajectories and corrupted identity information.

The intermittent, low-velocity nature of movement in drive-thrus further exacerbates these problems. Standard tracking algorithms relying on IoU, motion tracking, and similarity models often lose track when vehicles stop for extended periods in positions where detection fail or when bounding boxes changes in a few video frames. Simply increasing tracker memory and parameter relaxing leads to significant identity interchange between tracked targets, especially in areas where vehicles enter and leave the scene.

We observed that in congested and low-velocity environments, the failures in the detector match specific patterns. Thus, we propose a novel tracker that obtains robustness in a simple manner, over-passing some detector errors by using the knowledge about such failure patterns. To complement the tracker and to mitigate the false positives and negatives of the detector, we propose a novel real-time MTMC methodology for calculating the vehicle trajectory through complex areas covered by different cameras. The core of our approach includes (1) a new MOT algorithm that identifies and leverages specific patterns in bounding box deformations (like stretching or fragmentation) to maintain identities; (2) the use of perspective transformation to project vehicle positions onto a real-world scaled plane, enabling robust spatial analysis for refining detections and understanding inter-vehicle relationships; and (3) the introduction of a novel association method that utilizes a system of “joints” to connect the single-cam tracklets, associate the vehicles across camera views, and register interactions between vehicles and staff members, thereby effectively managing the uncertainty introduced by temporary occlusions and detection failures. This system allows for the reconstruction of precise vehicle trajectories and enables the analysis of critical metrics like dwell time and person–vehicle interactions.

Related Work

This section provides a concise overview of prior research relevant to our study, detailing the techniques used, including the artificial intelligence model, tracking methods, and a multi-camera re-identification approach. Recent approaches related to vehicle tracking are summarized in Table 1.

In [18], it was demonstrated that the precision and recall of a multi-camera tracking system based on the field of view (FoV) primarily depends on the quality of the detections. Their work quantified how false positives and negatives, as well as bounding boxes that do not accurately fit the vehicles, can negatively impact accuracy. Diangang Li et al. [19] focused on developing a person re-identification network capable of associating images captured in the visible light spectrum with images taken in the infrared spectrum. This is highly beneficial in low-illumination environments, particularly during nighttime.

In studies addressing the problem of multi-camera tracking of people, the method by Jungik Jang [3] stands out. This approach leverages the projected position of individuals onto a plane to estimate their velocity and direction of movement, achieving an identification rate exceeding 95% based on position and FoV intersection. However, this method is not suitable for scenarios where vehicles stop for extended periods and move intermittently. Haowen Hu [20] performed multi-camera tracking of people within an operating room. To accomplish this, he utilized three modules, including a detector, body pose estimation, and embeddings, achieving 85% accuracy in trajectory tracking. His method does not operate in real time; instead, it uses trajectory fragments generated by the tracker and subsequently reconstructs the information. While not directly applicable to our case of interest, it shares similarities since individuals in an operating room also remain stationary for periods ranging from seconds to minutes, posing similar detection and tracking challenges.

2. Materials and Methods

2.1. Architecture and Design of the Proposal

2.1.1. System Overview

The proposed solution is designed to monitor the dwell times of vehicles inside two drive-thru areas with multiple lanes with turns working completely on-edge. Each DT is covered by the FoV of four cameras. The cameras are connected to a switch that is directly connected to a Nvidia (Santa Clara, CA, USA) Jetson device. This device is able to process in real time the videos on cameras by executing the proposed tracking and MTMC tracking algorithms. The information of each vehicle is stored in JavaScript Object Notation (JSON) files and sent to a database/dashboard in the cloud.

2.1.2. Cameras Used

Our proposal was designed to be able to work with already install infrastructure. For our testbed we used

Video Capture recorded in two drive-thru locations: The first one uses four 1080 × 720 analog Dahua (Hangzhou, China) cameras. The second uses four Dahua cameras with a resolution of 2048 × 1080 pixels and a field of view of 180 degrees.
Processing: Data processing was performed on a Jetson model Xavier-NX with a capacity of 21 TOPS (tera operations per second).

The base model used for detection is YOLO version 7 [21]. The NvDCF version 6.0 [9], IoU tracker, and Our C++ adaptation of the ByteTrack algorithm described in [22] were used for comparison purposes as MOTs.

2.2. Identification of Patterns in Detector–Tracker Behaviour

Prior to our proposed methodology, we conducted experiments using the YOLOv7 detector. Since our first experiments were performed in 2022, we continued using this detector to maintain consistency in different experiments. These experiments included the use of various MOTs, including the DeepStream NvDCF, DeepStream IoU Tracker [9], and ByteTrack [22]. These experiments aimed to solve the tracking problem by associating pairs of vehicles across different cameras when their bounding boxes (BBs) simultaneously intersected with assigned areas within the FoV overlap regions. The obtained results demonstrated that their precision and recall were within an unacceptable range. The maximum achieved recall and precision in these experiments, measuring tracking along the full multi-camera trajectory as described in Section 3, were 0.58 and 0.59, respectively, when measured using with DeepStream NvDCF. In contrast, we obtained values as low as 0.47 and 0.51 for the simpler IoU tracker.

To address the limitations observed in the initial tracking experiments, a comprehensive dataset of over five thousand vehicle images was meticulously labelled. Subsequently, a YOLOv7s model was trained using this dataset. Training was performed using the repository by Wong Kin Yiu [23]. Despite this effort, the precision of multi-object tracking remained an unresolved challenge. Subsequent attempts to solve the problem include the evaluation of YOLOv8 with various model sizes; however, no significant improvements were observed in tracking accuracy. It was consistently noted that while bounding box placements for most vehicles were accurate, detection errors persisted in specific vehicle distribution scenarios, irrespective of the detector employed. A detailed post hoc analysis of the failures revealed and categorized several error types attributable to both detection and Multi-Object Tracking (MOT) failures. The predominant detection-related errors and patterns identified are detailed in the subsequent discussion.

Errors in detector: Double-Object-BB Error (DOB): When vehicles are very close to each other, the detector confuses both vehicles with a single object, generating a single BB that covers both vehicles. This failure occurs when the vehicles are aligned, as observed in Figure 1.

DOB Pattern: It was observed that when the DOB forms, one of the sides of the BB remains almost identical to that of the BB of one of the vehicles, appearing as if the rectangle stretches in the direction perpendicular to the fixed side. The expansion of the BB occurs in a single frame upon completion of the alignment.

Fragmented Bounding Box (FB): This error refers to the fragmentation of the box into two or more boxes.

FB Pattern: It was observed that in the sequence where a single BB becomes two or more boxes, these new boxes have their area mostly intercepted with the entire bounding box from previous frames.

False Negative BB (FNB): An object is not detected.

Pattern: It occurs when vehicles in specific positions move to other positions and are then detected again.

False Positive BB (FPB): An object is detected when there is none.

Pattern: The detected BB appears either between two vehicles or in places with complex textures. In the latter case, the BB always appears in the same location. In cases where objects are in constant motion, errors such as DOB and FB should be filtered by a SORT-type tracker. However, in the current case study, vehicles remain stationary for prolonged periods. Therefore, the tracker is unable to keep the vehicles in memory. If the memory time is increased, it results in ID contamination, especially in areas where vehicles are entering or leaving the scene.

Errors in tracking: The main tracking errors observed and their patterns are described below:

Partially-Occlusion-Loss Error (POL): When a person or vehicle moves from a partially occluded region to an un-occluded one, or vice versa, the bounding box usually increases or reduces its size. Under these conditions, the tracker often loses its target. This error is presented in Figure 2.

POL Pattern: It was observed that one of the bounding box dimensions, width or height, remains with small changes during these sequences. The pattern is similar to that observed in DOB, but the BB does not intersect another BB of the same class.

Fast Entry on a Scene (FE): When a person or vehicle enters the scene quickly, its bounding box grows rapidly. Under these conditions, the studied tracking algorithms are unable to maintain the ID.

FE Pattern: It was observed that one of the sides of the BB’s rectangle remains approximately unchanged in size or position. The error is similar to POL and DOB, but in this case, the size change is more gradual.

2.3. Multi-Target Multi-Camera Tracking Methodology

The proposed MTMC tracking methodology consists of combining several techniques and includes the presentation of a new tracking algorithm that leverages the error patterns of the detector YoloV7, as well as the sequences of rapid bounding box size changes that appear in DOB, FE, and POL errors. In this manner our tracker offers a computationally inexpensive and effective solution. We call the new tracker Multi-Object Tracker based on Corner Displacement (MTCD).

Perspective transformation is used to obtain the positions of the bounding boxes in a plane with real size dimensions, and space constraints are denoted as the metric plane (MP). The positions of the vehicles in the MP allow for the construction of a graph of positions and relative distances between vehicles, which are called “object networks”, and they are useful for reducing FB, FPB, and FNB errors using a spatial analysis (SA) method.

Vehicles are associated across multiple cameras by projecting their positions onto a common reference frame in MP, within the overlapping FoV. The Multi-Camera Association (MCA) method facilitates these associations by generating temporary “Joints.” These joints preserve object information, enabling the establishment and dissolution of associations until the vehicle exits the scene or sufficient data is available to confirm valid associations. These algorithms are integrated in a video pipeline for analysis for soft real-time processing, as shown in Figure 3. In this diagram, inside the tracking with MTCD block, the rectangles with different colours represent two bounding boxes detected in consecutive frames, and the circle represents the main ideas of association using near corners. The rest of this section describes each of the algorithms and their integration.

2.3.1. Multi-Object Tracker Based on Corner Displacement

In the current case study, vehicles remain stationary several minutes and perform sudden rapid movements, such as lane changes or simply moving forward. Under these conditions, trackers based on the Intersection over Union (IOU) of bounding boxes and other more complex SORT-like trackers tend to lose more than 30% of vehicles due to the errors described in Section 2.2. Based on the observed error patterns, a new multi-object tracking algorithm that is robust to those errors is proposed.

The pattern observed in DOB, FE, and POL errors shows that one of the sides of the BB experiences small displacements and size changes, while the dimension perpendicular to this side changes its magnitude considerably. A human observer watching the BBs through a sequence of frames perceives that the rectangle “stretches” or “contracts” in one direction. The MTCD leverages this characteristic. Thus, a comparison is made between the distances of the four corners of all bounding boxes with their analogues in the previous video frame.

Let

o_{j}^{(i)}

and

o_{k}^{(i - 1)}

be two arbitrary BBs: one detected in video frame i and the other in

i - 1

, respectively (consecutive frames). The sides of the rectangles will be denoted as

l_{t o p}

,

l_{r i g h t}

,

l_{b o t t o m}

, and

l_{l e f t}

, and the notation

o_{j}^{(i)} [l_{b o t t o m}]

will be used to refer to the bottom side of object

o_{j}^{(i)}

. Each side is composed of two points, denoted as 0 and 1, where the order is established clockwise starting from the center of the rectangle. Thus, we can unequivocally express the point we are referring to by adding the index of the point in brackets to the side expression, e.g.,

o_{j}^{(i)} [l_{b o t t o m}] [1]

. To express the calculation of the Euclidean distance between two points, we use the operator

d (p_{0}, p_{1})

, where

p_{0}

and

p_{1}

are any two points. Using these definitions the length of an arbitrary side of rectangle

a_{s i d e}

can be define with Equation (1).

o_{j}^{(i)} {[a_{s i d e}]}_{l e n g t h} \overset{Δ}{=} d (o_{j}^{(i)} [a_{s i d e}] [0], o_{j}^{(i)} [a_{s i d e}] [1])

(1)

Based on this definition, we have outlined the conditions for identifying the same object across consecutive frames in Equations (2) and (3).

\frac{d (o_{j}^{(i)} [a_{s i d e}] [ℏ], o_{k}^{(i - 1)} [a_{s i d e}] [ℏ])}{m i n (o_{k}^{(i)} {[a_{s i d e}]}_{l e n g t h}, o_{j}^{(i)} {[a_{s i d e}]}_{l e n g t h})} < t h_{0} : ℏ \in [0, 1]

(2)

Equation (2) ensure that at least one of the sides is close to its position in the previous frame.

a b s (1 - \frac{o_{k}^{(i)} {[a_{s i d e}]}_{l e n g t h}}{o_{j}^{(i)} {[a_{s i d e}]}_{l e n g t h}}) < t h_{1}

(3)

The third condition, expressed in Equation (3), disambiguates intercepted rectangles when the dimensions of the unchanging side are different. When an object rapidly changes one of its dimensions, the bounding box dimension must be maintained to avoid DOB and POL errors, keeping the BB of the vehicle with its dimensions prior to the error. To detect rapid increases in BB size, the condition presented in Equation (4) is used, where

o_{i} [a r e a]

represents the area of the bounding box of object i.

a b s (1 - \frac{o_{j} [a r e a]}{o_{k} [a r e a]}) > t h_{2}

(4)

The MTCD method is summarized in Algorithm 1.

Algorithm 1 MTCD algorithm

1:: for all frame $f_{i}$ in stream do
2:: if it is the first frame then
3:: Assign an ID number to each object
4:: continue
5:: end if
6:: for all object $o_{j}$ in frame $f_{i}$ do
7:: for all object $o_{k}$ in frame $f_{i - 1}$ do
8:: for all $a_{s i d e}$ in [top, right, bottom, left] do
9:: if Are satisfied conditions (2) and (3) then
10:: $o_{j}$ is associated with $o_{k}$ , receiving its ID
11:: if $a b s (1 - \frac{o_{j} [a r e a]}{o_{k} [a r e a]}) > t h_{2}$ then
12:: $o_{j} = o_{k}$
13:: end if
14:: continue to next object in frame $f_{i}$
15:: end if
16:: end for
17:: end for
18:: end for
19:: end for

2.3.2. Method for Spatial Analysis

FB, FPB, and FNB errors result in catastrophic failure for MCT algorithms based on position overlap for the following reasons: A false positive in an FPB, when it occurs in FoV overlap zones, results in the association of multiple different objects or contamination. Each false negative can result in the loss of the trajectory of the vehicle; the fragmentation of bounding boxes into smaller boxes in the case of an FB causes more vehicle instances to be produced, leading to a loss of trajectory information and affecting both recall and precision. The new spatial analysis (SA) algorithm, described in this section allows for the preservation of information about interactions between vehicles and people despite these errors. It is based on the fact that there are multiple fragments of information, which can be connected depending on how their trajectories relate over time. The algorithm works in real time, but associations are not fully determined until vehicles leave the scene.

Vehicle positions are represented using a single point. We propose using the centre of the bounding box for vehicles and the centre of the top side for people. Any area traversable by vehicles must be covered with perspective transformation polygons, each with 4 vertices, corresponding to sections between 4 and 8 square meters, as observed in the yellow colour in Figure 4. Using excessively large transformation polygons leads to errors due to the distortion introduced by the camera, which causes vehicles not to be projected onto the desired position on the plane. The ROIs must be small enough to reduce distortion error.

The next step is to obtain the coordinates of the projections of the transformation polygons onto the MP. The latter must contain recognizable features that aid in the positioning of the polygons; in our case, roof vertices and on-site measured distances in meters are used. For each polygon

P_{i}

, a pair of coordinate sets will be obtained: the projections of the polygon on the plane of camera j are denoted by

{}^{[j]}{P_{i}}

, and its coordinates in the MP are denoted by

{}^{[M P]}{P_{i}}

. Through direct linear transformation, the homography matrix that relates both sets of coordinates

H_{i : j - M P}

is calculated. With this matrix, the homogeneous coordinates of a vehicle or person with an arbitrary ID k, observed by camera j and located within polygon

P_{i}

during video frame h, are represented by

{}^{[j]}{o_{k}}^{(h)} = (o_{k x, j}, o_{k y, j})

, where

o_{k y, j}

and

o_{k x, j}

correspond to its coordinates in pixels. The coordinates of the vehicle on the MP,

{}^{[M P]}{o_{k}}^{(h)} = (o_{k x, M P}, o_{k y 0, M P})

, can be approximated using the transformation of Equation (5).

(\begin{matrix} b o_{k x, M P} \\ b o_{k y, M P} \\ b \end{matrix}) = H_{i : j - M P} (\begin{matrix} o_{k x, j} \\ o_{k y, j} \\ 1 \end{matrix})

(5)

where b is a bias factor that is eliminated by simple division.

It is necessary to perform a fine adjustment of the positioning of the polygon vertices so that the mapping of vehicles does not introduce errors into the algorithm. For this adjustment, the projections obtained from the vehicles in the MP must be compared with the expected position in the same plane. In this work, we performed the adjustment by trial and error, as the vehicles must be in the correct lane, and we used the observable features of the MP to validate the fit with the desired position.

For calculating the velocity of a vehicle

{}^{[M P]}{o_{k}}^{(h)}

represented by

{}^{[M P]}{v_{k}}^{(h)}

, a first-order approximation of the position derivative is used. The BB can move from one frame to another due to factors other than the car’s motion; thus the estimated position of the vehicle has an error that can be represented as white noise. To reduce the effect of position error in the calculated velocity, the velocity is filtered by taken the average of the last 5 frames as the instant velocity, as presented in Equation (6).

{}^{[M P]}{v_{k}}^{(h)} = F P S \frac{\sum_{i = h - c}^{h} ({}^{[M P]}{o_{k}}^{(i)} - {}^{[M P]}{o_{k}}^{(i - 1)})}{c} : c = 5

(6)

The next step of the algorithm is to construct an abstract representation (AR) of each tracked object. An AR is a collection of information that contains, among other things, the trajectory of a tracked object. These trajectories are expressed in both the MP coordinates and BB coordinates of the frames of detection. The AR contain references to joints, which in this context refer to another abstract collection of data that represents the association between two objects. The joint contains references to the objects joined and the start time of the union. ARs also contain references to the joints that unite them, forming circular references.

In FoV overlap areas, joints can be generated between pairs of objects seen by multiple cameras, but not between objects from the same camera. Conversely, in the rest of the observed areas, only joints between pairs of objects from the same camera are allowed. The rule for creating a joint is that the coordinates of the objects must be within a distance less than threshold

t h_{3}

in the same frame. Joints can also be broken if, in the same frame, two objects united by it are at a distance greater than threshold

t h_{4}

. When an object joins a false positive with a joint and then moves away from it, the joint breaks, preserving its information. Also, if the BB of an object grows or displaces erroneously, briefly generating a joint with a nearby object, this joint should break when the objects separate.

Joints do not break when one of the objects disappears and thus allow the connecting of objects that were first observed in other cameras. The union between an active object and one that disappeared also allows for the transferring of information from a BB to fragments in case of an FB and then recovering the information when the original BB reappears.

Vehicles that are related to others due to a chain of joints and other vehicles belong to an object network (ON), which is a set in which all its elements should, ideally, be the same vehicle. ONs can temporarily include other vehicles and be “contaminated”; however, the proposed rules also allow ONs to be “cleaned,” containing the information of a single object at the end of their trajectories. Vehicles in an ON are not eliminated until time

t h_{4}

has elapsed since the last time any of its elements were observed.

When a vehicle with a new ID appears for the first time in an area far from FoV intersection zones or vehicle entry zones, it is assumed to be either a false positive or the end of a false negative. To validate the second case, a search is conducted among all inactive object networks for one that was last in a nearby position, with similar dimensions, within a time less than

t h_{4}

. As an additional condition to identify the end of an FNB, the speed of the vehicle before disappearing is analysed, and its value must be below the speed threshold

t h_{5}

. If all conditions are met, a joint is made with the disappeared object.

When there is more than one active instance of an object, the object position to create new joints is the average of the positions of the instances. The SA algorithm is condensed in Algorithm 2.

Algorithm 2 Spatial analysis (SA) algorithm

1:: for all video frames batch b do
2:: for all camera c do
3:: for all tracked object $o_{k}$ do
4:: if $o_{k}$ is not vehicle class then
5:: continue
6:: end if
7:: for all polygon $P_{i}$ do
8:: if $o_{k}$ is inside $P_{i}$ then
9:: project $o_{k}$ onto MP using Equation (5)
10:: if $o_{k}$ already has an AR then
11:: add new data
12:: estimate instant velocity using Equation (6)
13:: else
14:: Create new AR instance
15:: end if
16:: end if
17:: end for
18:: end for
19:: end for
20:: for all each ${AR}_{i}$ do
21:: if ${AR}_{i}$ is new and in a FoV area then
22:: for all each other ${AR}_{j}$ of same class do
23:: if ${AR}_{j} [b]$ camera $\neq c$ then
24:: if $d ({AR}_{i} [b], {AR}_{j} [b]) < t h_{3}$ then
25:: joint( ${AR}_{i}$ , ${AR}_{j}$ )
26:: end if
27:: end if
28:: end for
29:: else if ${AR}_{i}$ is new in a non-FoV area then
30:: for all each other ${AR}_{j}$ of same class do
31:: if ${AR}_{j} [b]$ camera $= = c$ then
32:: if $d ({AR}_{i} [b], {AR}_{j} [b]) < t h_{3}$ then
33:: joint( ${AR}_{i}$ , ${AR}_{j}$ )
34:: end if
35:: end if
36:: end for
37:: for all each other ${AR}_{k}$ not active of same class do
38:: if ${AR}_{k} [- 1]$ camera $= = c$ then
39:: if $d ({AR}_{k} [- 1], {AR}_{j} [b]) < t h_{3}$ and iou( ${AR}_{k} [- 1]$ , ${AR}_{j} [b]) > 0.9$ and last velocity of ${AR}_{k} < t h_{5}$ then
40:: joint( ${AR}_{k}$ , ${AR}_{j}$ )
41:: end if
42:: end if
43:: end for
44:: end if
45:: for all ${AR}_{j}$ in Network of ${AR}_{i}$ do
46:: if ${AR}_{j} [b]$ camera $= = c$ then
47:: if $d ({AR}_{i} [b], {AR}_{j} [b]) > t h_{4}$ then
48:: unjoint( ${AR}_{i}$ , ${AR}_{j}$ )
49:: end if
50:: end if
51:: end for
52:: end for
53:: end for

The bracket after

{AR}_{i}

indicates selecting the position in a batch corresponding to the letter inside. Note that ARs are sets of data that store the information of tracklets, and the ON is an abstract graph that connects them temporally.

2.3.3. Methods to Estimate Person–Vehicle Interaction

In the drive-thru case study, staff members first approach vehicles to take orders, then deliver the orders, and finally process payments. The objective of analysing staff–vehicle interactions is to measure the time elapsed from the initial staff interaction with the customer to the final interaction, which serves as a performance metric. In the proposed approach, staff–vehicle interactions are identified based on the distance between the person and the vehicle in the MP, the duration of the interaction, and the angle of interaction. Determining the relationship between the staff and vehicles in video frames is non-trivial because the relative distance in pixels between the vehicle and the person depends on both the actual distance between the objects and the camera, as well as camera distortion at the point of interaction.

For staff location, we also use abstract representations (ARs). However, they are not used to track people but rather to store the interaction. The staff always interacts with vehicles through the window of the driver. Using this fact, the interaction can be defined in terms of the relative position of the staff with respect to the vehicle when the staff is static. To evaluate this, follow the next steps: Let

A R_{i} [b]

be the abstract representation of a person in a batch of video frames “b”, and let

A R_{j} [b]

be the abstract representation of a vehicle in the same frame; then

{\hat{p}}_{v | p} = \frac{(d_{x} (A R_{i} [b], A R_{j} [b]), d_{y} (A R_{i} [b], A R_{j} [b]))}{d (A R_{i} [b], A R_{j} [b])}

(7)

where

d_{x} (\cdot)

and

d_{y} (\cdot)

represent the distance in x and y, respectively, between both objects in meters in the MP coordinates. Therefore,

{\hat{p}}_{v | p}

represents the unit vector of the position of the staff with respect to the vehicle. To calculate if the orientation of one with respect to the other is correct, an initial desired unit vector in MP coordinates is proposed, denoted by

\hat{e}

, and finally, the attention conditions are represented by Equations (8) and (9).

If \hat{e} • {\hat{p}}_{v | p} < t h_{6}

(8)

where • represents a dot product.

d (A R_{i} [b], A R_{j} [b]) < t h_{7}

(9)

When both conditions, distance Equation (8) and orientation Equation (9), are met, a person–vehicle joint is generated. This only store information about the beginning and end of the interaction. A person’s AR can contain many joints with various vehicle ARs and vice versa. Unlike vehicle–vehicle joints, person–vehicle joints do not break when both objects move away from each other. Instead, the break occurs when the interaction time between the person and the vehicle is less than the time threshold

t h_{8}

. A person’s AR is not eliminated until all vehicles with which they maintain a joint are eliminated. Algorithm 3 shows how joints are formed between interactions of both types of objects:

Algorithm 3 Person–vehicle interaction algorithm

1:: for all video frames batch b do
2:: for all camera c do
3:: for all tracked object $o_{k}$ do
4:: if $o_{k}$ is not person class then
5:: continue
6:: end if
7:: for all polygon $P_{i}$ in service area do
8:: if $o_{k}$ is inside $P_{i}$ then
9:: project $o_{k}$ onto MP using Equation (5)
10:: if $o_{k}$ already has an AR then
11:: add new data
12:: else
13:: Create new AR instance
14:: end if
15:: end if
16:: end for
17:: end for
18:: end for
19:: for all active ${AR}_{i}$ of class person do
20:: for all active ${AR}_{j}$ of class vehicle in service area do
21:: Calculate ${\hat{p}}_{v | p}$ using Equation (7)
22:: if conditions (8) and (9) are met then
23:: if joint[ ${AR}_{j}$ , ${AR}_{i}$ ] exists then
24:: update times of joint[AR ${AR}_{j}$ , AR ${AR}_{i}$ ]
25:: else
26:: create joint[AR ${AR}_{j}$ , ${AR}_{i}$ ]
27:: end if
28:: end if
29:: end for
30:: end for
31:: end for

2.3.4. Method for Data Deletion and Storage

The full trajectory of vehicles, from entry to drive-thru at the first camera and to the exit at the last one, are obtained when instances of vehicle ARs are deleted. Nonvehicle in a ON must be viewed in cameras for time

t h_{4}

to delete the full object network. When an object network is eliminated, the entry time to the DT, the exit time from the DT, the entry time to the zone observed by each camera, its first interaction, and last interaction with the staff are stored. The non-tracked objects and FPBs are naturally discarded because they are only observed in a single camera, have no joints, and thus have no entry and exit times.

When a vehicle’s AR is eliminated, all its joints with people are iterated. The joint that was established earliest corresponds to the start time of the first interaction, while the end time of the interaction of the most recent joint shows the last interaction time. This operation is performed for all vehicles in the ON of the eliminated vehicle, and with a simple subtraction process, the total interaction time between vehicles and the staff is obtained. The process for storing data and deleting the instances is described in Algorithm 4.

Algorithm 4 Instance elimination and data storage algorithm

1:: for all video frames batch b do
2:: for all non-active ${AR}_{k}$ object of class vehicle do
3:: Get last view in ${ON}_{k}$
4:: if current time - last_view_time of ${AR}_{k} > t h_{4}$ then
5:: Get first view in ${ON}_{k}$
6:: Get first person interaction in ${ON}_{k}$
7:: Get last person interaction in ${ON}_{k}$
8:: Store data
9:: Delete each object in ${ON}_{k}$ with its joints
10:: end if
11:: end for
12:: for all non-active ${AR}_{i}$ of class person do
13:: if all joints with vehicle objects are resolved/inactive then
14:: Destroy ${AR}_{i}$
15:: end if
16:: end for
17:: end for

2.3.5. Summary of the MTC Algorithm with Person–Vehicle Interaction

Before starting real-time video processing, we perform algorithm calibration, which is performed following the steps described in Algorithm 5.

Algorithm 5 Calibration algorithm

1:: Obtain images for each camera.
2:: Obtain a scaled layout of the Drive-thru.
3:: Position the perspective transform polygons.
4:: Fine-tune the perspective transform polygons.
5:: Define the desired service direction in MP $\hat{e}$ .
6:: Define time and distance thresholds based on DT features $t h_{i} : i \in [0, 9]$ .
7:: Define the service area as a polygon in MP.
8:: Define the FoV overlap areas as polygons in the MP.
9:: Define FPS that enables real-time performance and latency management.
10:: Define push_time_out according to latency to avoid desynchronization.

After calibrating, the MTC and person–vehicle interaction algorithm starts automatically following the procedure described in Algorithm 6.

Algorithm 6 MTC and person–vehicle interaction algorithm

1:: Form synchronized video frame batches.
2:: Perform inference on each video frame in the batch.
3:: Execute tracking algorithm using Algorithm 1.
4:: Execute the spatial analysis algorithm using Algorithm 2.
5:: Execute person–vehicle interaction using Algorithm 3.
6:: Execute eeletion and data calculation using Algorithm 4.
7:: Restart the algorithm during out-of-service times to reset counters.

3. Results

Methodology and Experimental Validation

To evaluate the algorithm, two drive-thrus are used, denoted as DTA and DTB, respectively. DTA has a length of 130 m and three sections of two lanes. DTA is covered by four cameras whose FoV covers the entirety of the drive-thru. These FoVs have three overlap zones denoted by Roman numerals I–III. The scaled MP of DTA is presented on the left in Figure 5, where the entry and exit flows can be seen. DTB has a length of 81 m. It consists of two sections of one lane and one section of three lanes. Similarly to DTA, there are four cameras covering the entire vehicle path with three overlap zones. DTB is presented on the right in the same figure.

A frame from each camera of DTA is presented on the left in Figure 6, and similarly, frames from the DTB cameras are presented in the same figure on the right. To evaluate the precision of our methods, we manually annotated several videos. The videos are processed at 10 FPS synchronously, simulating real-time processing. The processing is carried out through a Deepstream pipeline, which, in conjunction with the proposed MTCD algorithm, is programmed in C++ to achieve high execution speed. The rendered output of the pipeline is not stored to prioritize computational resources for detection and post-processing. However, we did execute all the steps of our solution: process, detect, send info to the cloud, and show the information on the online dashboards for the final user. To perform the evaluation, JSON files containing vehicle metadata are stored, and image crops of each vehicle with a complete trajectory are also stored.

To create the ground truth, the entry time, the first and last service times by the staff, and the exit time of each vehicle are annotated. The recorded vehicles are compared with the vehicle times in the JSON files, and their image crops were used for association. In this manner, 3 h of DTA and 3.3 h of DTB are analysed, divided into seven videos corresponding to different days and times. The duration of each video and the number of vehicle instances counted are shown in Table 2.

The trackers performance is not directly evaluated individually, because it is already contained in the whole process. This is because our SA algorithm works together with the trackers. Using the SA Algorithm, vehicles lost by the tracker can be recovered, and false positives are eliminated.

The initial experiments showed that a single camera’s recall and precision is as low as 60%. Instead, we measure the precision and recall of the full MTMC tracker by making use of two different trackers: (1) the DeepStream NvDCF tracker from Nvidia [9], configured with the parameters detailed in Table 3, and (2) our proposed MTCD tracker. The NvDCF is selected because it obtained the highest precision and recall on previous experiments when compared with a set of different trackers, as described in Section 2.2. It uses features of a SORT-like method and the Kalman filter for predicting the motion and IoU. Furthermore, MTCD creates embeddings of appearances for improving the robustness of target association. The full parameters used in our algorithm and tracker are outlined in Table 4.

The parameters detailed in Table 3 are configured based on the dynamics of the vehicles, frame rate of the system, as well as the typical dimensions of the target vehicles. Specifically,

t h_{1}

is determined by the relative displacement of bounding box (BB) corners across consecutive frames.

t h_{2}

and

t h_{3}

define the upper bounds for the permissible increase in BB dimensions between successive frames. Parameters

t h_{3}

and

t h_{4}

are correlated with 0.5 and 1 times the typical lateral distance between vehicles in adjacent lanes of the DT, respectively.

t h_{5}

establishes the maximum allowable error in the measured velocity for stationary vehicles. Finally,

t h_{7}

and

t h_{8}

are empirically set based on observed staff behaviour.

For the ground truth evaluation, a true positive (TP) is defined as a vehicle whose trajectory was accurately tracked from its entry into the drive-thru lane until its exit, encompassing the interaction period with staff. A false positive (FP) is identified as a vehicle whose trajectory is completely tracked but whose identity is erroneously swapped with another vehicle. Conversely, a false negative (FN) denoted a vehicle whose trajectory is not tracked in its entirety. Based on these definitions, precision and recall metrics are calculated according to Equations (10) and (11).

p r e c i s i o n = \frac{T P}{T P + F P}

(10)

r e c a l l = \frac{T P}{T P + F N}

(11)

The ground truth results are presented in Table 5.

The results vary for each video. After carefully analysing the detections, it is inferred that the variation is due to factors such as variable detector quality with different vehicle types and colours; distance between vehicles; the presence or absence of pickup trucks and other large vehicles causing occlusions; and lighting depending on the time of day. Overall, the weighted average precision and recall by the number of vehicles using NvDCF are 0.81 and 0.69, respectively. These same values using MTCD are 0.89 and 0.91, respectively.

To quantify staff–vehicle interaction, each instance of interaction between vehicles and personnel is meticulously labelled. Subsequently, the metadata for vehicles with successfully tracked trajectories is filtered to isolate relevant interactions. The experimental results, however, exhibited significant variability across different experiments. Several corner cases that adversely affected the performance of interaction detection were identified, including instances where staff–vehicle interaction occurred but the vehicle is not detected; pauses by staff members among vehicles without interacting with them; and the proximity of non-staff individuals to vehicles. With the proposed methodology, the average precision for staff–vehicle interaction was 0.65, and the average recall was 0.51.

A total of 297 vehicles were counted, accumulating a duration of 1217 min, or 730,450 frames.

4. Discussion

In terms of the number of vehicles and the video hours analysed, our study utilized a more extensive evaluation database compared to similar recent vehicle identification studies as can be observed in Table 6 that also condensed evaluation metrics. The obtained results demonstrate that using the observed detector error pattern is reliable, and the inclusion of this knowledge into the design of the tracking algorithm allows higher accuracy. Furthermore, we already used installed cameras that are not located in optimal positions, and the illumination of the observed areas is not ideal. These advantages allow us to replicate the system in already installed infrastructure.

The proposed method stands out from other methodologies due to its efficiency in the use of computational resources because of the following reasons: First, it is not necessary to generate, store, and operate embeddings to identify vehicles. Second, the tracker is simple and does not involve motion prediction. Third, the association operations only involve the comparison of coordinates in a plane. Despite its simplicity, the method offers an above-average precision. It did not evaluated the individual performances of MTCD because of SA. Working in a separated manner causes minor errors to be interpreted as persistent false negatives, and bounding box fast deformation drastically reduces their performance. Only by working together can the presented methods achieve acceptable results.

Regarding the proposed tracker, it should be used together with the proposed algorithm in congested and low-velocity environments rather than NvDCF, according to the results in Table 5. We found that detectors based on the Yolo architecture often have significant room for improvement, even if they have been retrained, possible due to limitations intrinsic to the model. Although with certain tracking techniques, the detections are useful.

Practical Considerations

Our proposal has the following requirements that we consider very feasible: It is necessary for the FoVs of the cameras to overlap and fully cover the trajectory. The cameras must be in elevated positions to avoid obstructions. In addition, it is necessary to obtain the dimensions of the study area and a scaled representation of its geometry. With that, we can calibrate the perspective transformation polygons. To alleviate the time spent on this calibration, a configuration dataset could be generated automatically. This dataset would consist of obtaining vehicle trajectories in the associated MP. Then, an algorithm could be used to find the optimal perspective transformation polygons.

Our study introduces novel techniques that prove effective in practical scenarios and suggest new avenues for enhancing both multi-object tracking and multi-camera tracking. Furthermore, these techniques offer the potential for combination with existing embedding-based methods, potentially leading to further improvements in their precision and recall, perhaps approaching error-free tracking.

5. Conclusions

This paper introduced a novel multi-target multi-camera system specifically designed for accurate vehicle trajectory tracking in challenging, low-velocity, and highly congested environments like drive-thrus. It has the advantage of working in existing infrastructure. We addressed the limitations of standard tracking algorithms in such scenarios, which often fail due to prolonged stationary periods, tight vehicle spacing, and frequent occlusions. Our solution tackles these issues through several innovations: The Multi-Object Tracker based on Corner Displacement algorithm maintains object identities despite bounding box distortions. Metric plane projection refines detections and manages trajectory fragments, and a novel “joints” concept in our multi-camera association algorithm flexibly links vehicle identities across different camera views, even with temporary disappearances.

Evaluated on real world drive-thru video, our system demonstrated superior performance in precise multi-camera vehicle tracking compared to traditional approaches like NvDCF by Nvidia. The ability to precisely track vehicles through drive-thru queues offers valuable opportunities for operational analysis, such as measuring dwell times and understanding person–vehicle interactions, ultimately enabling the optimization of service flow and efficiency in various retail and industrial settings. Furthermore, the core principles of the MTCD, metric plane analysis, and joint-based association could be adapted to the MTMC tracker for other challenging scenarios characterized by similar low-velocity, congested, and dynamically occluded conditions.

Author Contributions

Conceptualization, C.G.-C.; methodology, C.G.-C., A.M.-R. and R.R.-C.; software, C.G.-C.; validation, A.S.-O. and A.M.-R.; formal analysis, C.G.-C. and A.S.-O.; investigation, C.G.-C. and R.R.-C.; resources, A.M.-R.; data curation, R.R.-C. and A.M.-R.; writing—original draft preparation, C.G.-C.; writing—review and editing, C.G.-C., A.M.-R., R.R.-C. and A.S.-O.; visualization, A.M.-R.; supervision, C.G.-C.; project administration, R.R.-C.; funding acquisition, R.R.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset supporting the findings of this study is not publicly available. This is due to restrictions stipulated by the private enterprise that provided access to its facilities for data collection. These restrictions are in place to maintain the anonymity of the enterprise, protect proprietary information, and safeguard its commercial interests. Consequently, the public dissemination of the raw or annotated data, including video footage, is prohibited under the terms of the data use agreement.

Acknowledgments

The authors wish to express their sincere gratitude to Introid Inc. and its members for their significant contributions to this research. We are particularly grateful to Rene Padilla Calderon for initiating this collaboration, his instrumental role in data obtention, and the provision of necessary hardware. We also extend our thanks to M.C. Arturo Alvarez-Hernandez for his valuable technical support. Furthermore, we acknowledge Guillermo Arozqueta and Rocio Gonzales for their meticulous and dedicated efforts in data labelling, which were crucial for this study.

Conflicts of Interest

Author Alan Maldonado-Ramirez was employed by the company Introid Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AR	Abstract Representation of Objects
BB	Bounding Box
DOB	Double-Object Bounding Box (Detector Error)
DTA	Drive-Thru A
DTB	Drive-Thru B
DT	Drive-Thru
FB	Fragmented Bounding Box (Detector Error)
FE	Fast Entry on a Scene (Tracker error)
FNB	False Negative Bounding Box (Detector Error)
FoV	Field of View (of the Cameras)
FPB	False Positive Bounding Box (Detector Error)
FPS	Frames per Second
ID	Identification Number for Each Vehicle
IoU	Intersection over Union (Method)
JSON	JavaScript Object Notation
MCA	Multi-Camera Association Method
MP	Metric Plane, Plane of the Study Area With Scale
MOT	Multi-Object Tracker
MTCD	Multi-Object Tracker Based on Corner Displacement
MTMC	Multi-Target Multi-Camera (Tracker Algorithm)
NvDCF	NVIDIA Discriminative Correlation Filter (Tracker)
ON	Objects Network
POL	Partial Occlusion Loss (Tracker Error)
Re-ID	Re-identification
ResNet	Residual Neural Network
SA	Spatial Analysis Method
SORT	Simple Online and Real-time Tracking
TOPS	Tera Operations Per Second
YOLOv7	You Only Look Once Version 7 (Detector)

References

Medina-Salgado, B.; Sánchez-DelaCruz, E.; Pozos-Parra, P.; Sierra, J.E. Urban traffic flow prediction techniques: A review. Sustain. Comput. Inform. Syst. 2022, 35, 100739. [Google Scholar] [CrossRef]
Peri, N.; Khorramshahi, P.; Rambhatla, S.S.; Shenoy, V.; Rawat, S.; Chen, J.C.; Chellappa, R. Towards real-time systems for vehicle re-identification, multi-camera tracking, and anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 622–623. [Google Scholar]
Jang, J.; Seon, M.; Choi, J. Lightweight indoor multi-object tracking in overlapping FOV multi-camera environments. Sensors 2022, 22, 5267. [Google Scholar] [CrossRef] [PubMed]
Xiang, Y.; Alahi, A.; Savarese, S. Learning to track: Online multi-object tracking by decision making. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4705–4713. [Google Scholar]
Laplante, P.A. Real-Time Systems Design and Analysis; Wiley: New York, NY, USA, 2004; Volume 3. [Google Scholar]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3464–3468. [Google Scholar]
Kalman, R.E. A new approach to linear filtering and prediction problems. Trans. ASME—J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
NVIDIA. DeepStream 6.0.1 Release Documentation: Gst-nvtracker. Available online: https://docs.nvidia.com/metropolis/deepstream/6.0.1/dev-guide/text/DS_plugin_gst-nvtracker.html (accessed on 25 May 2025).
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3645–3649. [Google Scholar]
Huang, X.; He, P.; Rangarajan, A.; Ranka, S. Machine-learning-based real-time multi-camera vehicle tracking and travel-time estimation. J. Imaging 2022, 8, 101. [Google Scholar] [CrossRef]
Zhang, H.; Fang, R.; Li, S.; Miao, Q.; Fan, X.; Hu, J.; Chan, S. Multi-Camera Multi-Vehicle Tracking Guided by Highway Overlapping FoVs. Mathematics 2024, 12, 1467. [Google Scholar] [CrossRef]
Tan, X.; Wang, Z.; Jiang, M.; Yang, X.; Wang, J.; Gao, Y.; Su, X.; Ye, X.; Yuan, Y.; He, D.; et al. Multi-camera vehicle tracking and re-identification based on visual and spatial-temporal features. In Proceedings of the CVPR Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 275–284. [Google Scholar]
Luna, E.; SanMiguel, J.C.; Martínez, J.M.; Escudero-Viñolo, M. Online clustering-based multi-camera vehicle tracking in scenarios with overlapping fovs. Multimed. Tools Appl. 2022, 81, 7063–7083. [Google Scholar] [CrossRef]
Li, P.; Li, G.; Yan, Z.; Li, Y.; Lu, M.; Xu, P.; Gu, Y.; Bai, B.; Zhang, Y.; Chuxing, D. Spatio-temporal Consistency and Hierarchical Matching for Multi-Target Multi-Camera Vehicle Tracking. In Proceedings of the CVPR Workshops, Long Beach, CA, USA, 16–20 June 2019; pp. 222–230. [Google Scholar]
Yang, H.; Cai, J.; Zhu, M.; Liu, C.; Wang, Y. Traffic-informed multi-camera sensing (TIMS) system based on vehicle re-identification. IEEE Trans. Intell. Transp. Syst. 2022, 23, 17189–17200. [Google Scholar] [CrossRef]
Li, Y.L.; Li, H.T.; Chiang, C.K. Multi-Camera Vehicle Tracking Based on Deep Tracklet Similarity Network. Electronics 2022, 11, 1008. [Google Scholar] [CrossRef]
Rios-Cabrera, R.; Tuytelaars, T.; Van Gool, L. Efficient multi-camera vehicle detection, tracking, and identification in a tunnel surveillance application. Comput. Vis. Image Underst. 2012, 116, 742–753. [Google Scholar] [CrossRef]
Li, D.; Wei, X.; Hong, X.; Gong, Y. Infrared-visible cross-modal person re-identification with an x modality. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 4610–4617. [Google Scholar]
Hu, H.; Hachiuma, R.; Saito, H.; Takatsume, Y.; Kajita, H. Multi-camera multi-person tracking and re-identification in an operating room. J. Imaging 2022, 8, 219. [Google Scholar] [CrossRef] [PubMed]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–21. [Google Scholar]
WongKinYiu. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. 2022. Available online: https://github.com/WongKinYiu/yolov7 (accessed on 26 June 2025).

Figure 1. Double bounding box error appearance sequence (1) comes before sequence (2).

Figure 2. This is a sequence of a bounding box reduction and then an increase of its length for a person.

Figure 3. A diagram of the different parts of the methodology implemented in a video pipeline.

Figure 4. Example of perspective transform polygons labelled as

P s i : i \in (1, 6)

.

Figure 4. Example of perspective transform polygons labelled as

P s i : i \in (1, 6)

.

Figure 5. Metric plane with a camera FoV of both drive-thrus, with MPA on the left and PMB on the right.

Figure 6. A frame from each camera in DTA in the left and DTB in the right.

Table 1. Overview of multi-camera vehicle tracking systems: general information.

Reference	Detection Model	Single-Camera Tracker	Re-Identification Method/Model	Brief Description
Hao Yang et al. [16]	YOLOv3	TrackletNet Tracker	Deep Metric Features; ResNet50 for embeddings. Graph-based association (spatial info, movement direction, and observation order). Vehicle orientation by ResNet.	System for vehicle trajectory tracking in non-overlapping fields of view (FoVs) to extract vehicular traffic information.
Xiaohui Huang et al. [11]	YOLO-based architecture	DeepSort	ResNet-50 embeddings.	Novel techniques for multi-camera vehicle tracking in non-overlapping FoVs across three intersections.
Hongkai Zhang et al. [12]	Swin Transformer Detection	SORK-like tracker	Overlapping FoV position relationship combined with embedding comparison. Embeddings from parallel networks: ResNet-101-IBN, ResNest101, ResNest101, and Se-ResNet101-IBN.	Novel system integrating various techniques for vehicle tracking within a tunnel environment using cameras with overlapping FoVs.
Yun-Lun Li et al. [17]	Faster RCNN	TrackletNet tracker	ResNet101-IBN-Net for embeddings. Embedding comparison refined by spatial distance; association graph constructed.	Multi-camera tracking with IoU-based filtering of false positives. Association graph prevents linking distinct, distant vehicles.
Peilun Li et al. [15]	Feature Pyramid Networks	ResNet50 embeddings, GPS coordinates, and the Hungarian algorithm for final matching.	ResNet50 embeddings. Perspective transformation for approximate coordinates. Discrimination based on speed, distance, and embeddings.	Multi-target multi-camera vehicle tracking at an intersection. Employs perspective transformation for vehicle coordinates.
Elena Luna et al. [14]	Evaluated: YOLOv3, SSD512, Mask R-CNN, and EfficientDet.	SORT-like tracker	ResNet-50 (modified final layer) for embeddings. Vehicles coordinates are projected to GPS coordinates; association in transformed space via inter-vehicle distances.	Novel approach for online multi-camera vehicle tracking based on embeddings and position within overlapping FoVs.

Table 2. Experiment duration and instances.

Exp_id	Place	Duration	No. Vehicles
I	DTA	1 h	59
II	DTA	1 h	33
III	DTA	1 h	11
IV	DTB	0.6 h	34
V	DTB	0.6 h	32
VI	DTB	0.6 h	16
VII	DTB	1.5 h	112

Table 3. DeepStream NvDCF tracker parameters used for comparison.

Parameter	Value	Description
`minDetConf`	0.1	Min detector confidence.
`enBboxUnClip`	1	Enable BB un-clipping at the border.
`maxTgtPerStream`	200	Max targets to track per stream.
`minIou4New`	0.5	Min IOU diff. for new target.
`minTrkConf`	0.1	Min tracker confidence for the active state.
`probAge`	3	Frames to track for the valid state.
`maxShadowAge`	25	Max frames in shadow mode.
`earlyTermAge`	1	Shadow age for tentative target termination.
`useUniqueID`	0	Use 64-bit unique IDs (0 = false).
`type`	0	Data associator type.
`matchType`	0	Matcher algorithm.
`chkClass`	1	Check class match for association.
`minOverallScore`	0.0	Min overall score for association.
`minSizeSimScore`	0.1	Min BB size similarity score.
`minIouScore`	0.2	Min IOU score for association.
`minVisSimScore`	0.1	Min visual similarity score.
`wVisSim`	0.1	Visual similarity weight.
`wSizeSim`	0.15	Size similarity weight.
`wIou`	0.11	IOU score weight.
`type`	1	State estimator type.
`procNoiseLoc`	2.0	Process noise variance for location.
`procNoiseSize`	1.0	Process noise variance for size.
`procNoiseVel`	0.1	Process noise variance for velocity.
`measNoiseDet`	4.0	Meas. noise variance for detector.
`measNoiseTrk`	16.0	Meas. noise variance for tracker.
`type`	1	Visual tracker type (1 = NvDCF).
`useColor`	1	Use ColorNames feature.
`useHog`	0	Use HOG feature.
`featSizeLvl`	2	Feature image size level (1–5).
`featOffsetY`	−0.2	Feature focus Y offset factor.
`filterLr`	0.075	DCF filter learning rate.
`chWeightLr`	0.1	DCF channel weight learning rate.
`gaussSigma`	0.75	Gaussian sigma for the DCF response.

Table 4. Parameter descriptions of our methods.

Parameter	Label	Description
th_0	0.25	relative distance between analogue corners to a joint
th_1	0.2	relative max. side ratio error
th_2	0.4	relative max. area increase in a frame
th_3	1 m	max. distance for joining ARs in MP
th_4	2 m	distance for unjoining ARs in MP
th_5	0.5 m/s	max. velocity for static disappear
th_6	0.9	min. cosine angle of attention
th_7	2.5 m	max. distance of attention from vehicle centre
th_8	4 s	min. duration of vehicle–staff interaction

Table 5. Algorithm Performance Comparison.

Exp_id	MTMC + DeepStream NvDCF		MTMC + MTCD (Ours)
	Precision	Recall	Precision	Recall
I	0.66	0.46	0.94	0.89
II	0.76	0.57	0.94	0.96
III	0.30	0.36	0.90	0.81
IV	0.91	0.94	0.97	0.97
V	0.82	0.75	1.00	0.90
VI	0.71	0.62	0.93	0.87
VII	0.84	0.79	0.89	0.93

Table 6. Overview of multi-camera vehicle tracking systems: evaluation details and metrics.

Reference	Evaluation Dataset(s)	Training Datasets (Re-ID, Detection, etc.)	Evaluation Setup	Precision	Recall	Other Metrics
Hao Yang et al. [16]	Custom (20 min video footage)	Orientation ResNet: ImageNet; Embedding ResNet50: (Dataset not specified, pre-training assumed)	6 cameras and non-overlapping FoV	0.76	0.74	–
Xiaohui Huang et al. [11]	Custom (data from 3 intersections)	Re-ID ResNet-50: VeRi dataset	200 vehicles	SC: 0.96 CC assoc.: 0.85	SC: 0.95	CC assoc. Acc: 1.00
Hongkai Zhang et al. [12]	Custom (3.03 h video footage)	Re-ID Networks: CityFlow-V2, CityFlow-V2-CROP, vehicleX, and vehicleX SPGAN	Num. of vehicles not specified. Overlapping FoV.	0.818 (Detector P: 0.585)	0.806 (Detector R: 0.706)	–
Yun-Lun Li et al. [17]	CityFlow V2 dataset	Re-ID ResNet101-IBN-Net: CityFlow V2 (668 instances for training)	668 instances for evaluation	0.73	0.42	–
Peilun Li et al. [15]	AI City Challenge (validation split implied)	Re-ID ResNet50: AI City Challenge (333 vehicles for train/val)	Approx. 167 vehicles (AI City val. split of 333 total)	0.78	0.64	–
Elena Luna et al. [14]	CityFlow benchmark dataset	Re-ID ResNet-50: CityFlow and VeRi datasets	129 annotated vehicles, 4 cameras, and overlapping FoV	0.572 (best)	0.719 (best)	–
Ours	Own labeled dataset	Default parameters	297 vehicles (6.3 h), 4 cameras, and overlapping FoV	0.89 (average)	0.91 (average)	–

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gellida-Coutiño, C.; Rios-Cabrera, R.; Maldonado-Ramirez, A.; Sanchez-Orta, A. Real-Time Multi-Camera Tracking for Vehicles in Congested, Low-Velocity Environments: A Case Study on Drive-Thru Scenarios. Electronics 2025, 14, 2671. https://doi.org/10.3390/electronics14132671

AMA Style

Gellida-Coutiño C, Rios-Cabrera R, Maldonado-Ramirez A, Sanchez-Orta A. Real-Time Multi-Camera Tracking for Vehicles in Congested, Low-Velocity Environments: A Case Study on Drive-Thru Scenarios. Electronics. 2025; 14(13):2671. https://doi.org/10.3390/electronics14132671

Chicago/Turabian Style

Gellida-Coutiño, Carlos, Reyes Rios-Cabrera, Alan Maldonado-Ramirez, and Anand Sanchez-Orta. 2025. "Real-Time Multi-Camera Tracking for Vehicles in Congested, Low-Velocity Environments: A Case Study on Drive-Thru Scenarios" Electronics 14, no. 13: 2671. https://doi.org/10.3390/electronics14132671

APA Style

Gellida-Coutiño, C., Rios-Cabrera, R., Maldonado-Ramirez, A., & Sanchez-Orta, A. (2025). Real-Time Multi-Camera Tracking for Vehicles in Congested, Low-Velocity Environments: A Case Study on Drive-Thru Scenarios. Electronics, 14(13), 2671. https://doi.org/10.3390/electronics14132671

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Multi-Camera Tracking for Vehicles in Congested, Low-Velocity Environments: A Case Study on Drive-Thru Scenarios

Abstract

1. Introduction

Related Work

2. Materials and Methods

2.1. Architecture and Design of the Proposal

2.1.1. System Overview

2.1.2. Cameras Used

2.2. Identification of Patterns in Detector–Tracker Behaviour

2.3. Multi-Target Multi-Camera Tracking Methodology

2.3.1. Multi-Object Tracker Based on Corner Displacement

2.3.2. Method for Spatial Analysis

2.3.3. Methods to Estimate Person–Vehicle Interaction

2.3.4. Method for Data Deletion and Storage

2.3.5. Summary of the MTC Algorithm with Person–Vehicle Interaction

3. Results

Methodology and Experimental Validation

4. Discussion

Practical Considerations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI