Lightweight Indoor Multi-Object Tracking in Overlapping FOV Multi-Camera Environments

Multi-Target Multi-Camera Tracking (MTMCT), which aims to track multiple targets within a multi-camera network, has recently attracted considerable attention due to its wide range of applications. The main challenge of MTMCT is to match local tracklets (i.e., sub-trajectories) obtained by different cameras and to combine them into global trajectories across the multi-camera network. This paper addresses the cross-camera tracklet matching problem in scenarios with partially overlapping fields of view (FOVs), such as indoor multi-camera environments. We present a new lightweight matching method for the MTMC task that employs similarity analysis for location features. The proposed approach comprises two steps: (i) extracting the motion information of targets based on a ground projection method and (ii) matching the tracklets using similarity analysis based on the Dynamic Time Warping (DTW) algorithm. We use a Kanade–Lucas–Tomasi (KLT) algorithm-based frame-skipping method to reduce the computational overhead in object detection and to produce a smooth estimate of the target’s local tracklets. To improve matching accuracy, we also investigate three different location features to determine the most appropriate feature for similarity analysis. The effectiveness of the proposed method has been evaluated through real experiments, demonstrating its ability to accurately match local tracklets.


Introduction
Multi-Target Multi-Camera Tracking (MTMCT) has recently received considerable attention due to the growing demand for intelligent monitoring and surveillance systems. It aims to track multiple interested targets and infer a complete trajectory for each target across a multiple-camera network [1]. MTMCT can be applied to various tasks such as video surveillance [1][2][3], city-scale traffic management [4], smart buildings [5,6], and in-store customer analysis [7].
Owing to the rapid development of object detection techniques [8][9][10][11], most state-of-theart MTMCT methods employ a two-phase pipeline, namely detection-and-tracking [12], to focus on the tracking functionality. In the first phase, they detect targets using a modern object detector and generate sets of local trajectories for each detected target within a single camera. To track targets within the entire multi-camera network, the cross-camera tracklet matching is performed in the next phase, which matches local tracklets across all the cameras to generate a complete trajectory for each target [1].
However, it is a challenging task, as these methods are inherently prone to camera field-of-view (FOV) issues, such as occlusion, i.e., the blind areas of camera views, and/or a significant change in the visual appearance of moving targets. As a result, the trajectory of each target generated within each camera is easily divided into multiple local tracklets, i.e., sub-trajectories. In addition, the tracker can generate incorrect duplicate local tracklets for the same target, which makes the cross-camera tracklet matching problem even more The rest of the paper is organized as follows: Section 2 discusses the related work and illustrates the preliminaries. Section 3 details the proposed framework and its main components. Section 4 demonstrates the evaluation results. Section 5 concludes the study.

Multi-Target Multi-Camera Tracking (MTMCT)
The MTMCT task aims to infer the perfect path for each target in multi-camera environments. It usually comprises two steps: (1) generating local tracklets of all targets within each camera, and (2) matching the local tracklets for the same global target across a multi-camera network. In recent years, studies on the MTMCT task based on various approaches have been actively conducted.
Fleuret et al. [15] proposed a solution that utilizes a probabilistic occupancy map (POM) to approximate the probabilities of occupancy and combines it with the usual color and motion attributes. Berclaz et al. [16] optimized the MTMCT task by employing the POM and K-Shortest Path (KSP) algorithm. Hu et al. [17] and Eshel and Moses [18] presented a matching algorithm for the same target across multiple cameras using the homography correspondence. Similarly, the proposed method employs homography; however, it uses the DTW algorithm to match global IDs. Hou et al. [19] presented a new approach focusing on local neighboring data matching using a Locality Aware Appearance Metric (LAAM) composed of a metric network. Bredereck et al. [20] matched local tracklets of all cameras using the greedy matching association method. As an example of solving the MTMCT task using hierarchical clustering, Zhang et al. [21] proposed an approach that uses the distance matrix between averaged Re-ID features and applies re-ranking [22] to cluster local tracklets. Jiang et al. [23] solved the problem of trajectory association under orientation variations and occlusions, and they improved the matching efficiency using camera topology. Xu et al. [24] presented an approach using a hierarchical composition model for MTMCT. They re-formulated MTMCT as a composition structure optimization problem. He et al. [1] obtained the tracklet-TID assignment matrix with the Restricted Non-negative Matrix Factorization (RNMF) algorithm and used it to match the tracklet to the target ID (TRACTA). You et al. [25] used Optical-based Pose Association (OPA) for MTMCT and solved the occlusion problem using local pose matching. In addition, the distance problem caused by fast motion was reduced by applying optical flow. Wu et al. [26] proposed a three-step cooperative tracking method to track people in a multi-camera environment through tracking token transfer. Zhang et al. [27] proposed an online (real-time) tracking framework, and they improve the cross-camera person recall performance through appearance and spatial-temporal features. The MTMCT task has been widely used to target vehicles and humans. Hsu et al. [28] proposed the vehicle MTMCT framework, Trajectory-based Camera Link Model (TCLM), through which spatial-temporal information is obtained and MTMCT performance is improved by reducing the Re-ID candidate search process.

Multi-Object Tracking (MOT)
Multi-Object Tracking (MOT), which is the first step among the process steps of MTMCT described above, can be viewed as a problem of tracking multiple objects in a single camera. MOT aims to estimate and associate a bounding box and ID for a number of objects appearing in an image. As our study utilizes the object information (tracklet) of local image data, which are generated by MOT, the accuracy of the proposed method relies on the MOT performance. The MOT task can be divided into two methods, namely detection-by-tracking and tracking-by-detection, depending on how detection results are used. The tracking-by-detection method has recently attracted attention with the advent of high-performance object detection models. Moreover, it can be categorized into two approaches: (1) a batch tracking method in which data are correlated using the entire data frame information and (2) an online tracking method in which data are correlated using past and current frame information. SORT was proposed by Bewley et al. [8]; it is a popular method that uses online tracking, predicts the tracklet position for a new frame using a Kalman filter [29] and correlates the data using a Hungarian algorithm [30]. DeepSORT [9] supplements the ID switching problem caused by occlusion, which is a problem in SORT, with Deep Appearance Descriptor, and it enables more accurate tracking with the cascated matching strategy. FastMOT [31] used in our study solves the bottleneck caused by the use of DeepSORT's two-stage tracker by running the detector and feature extractor every specific frame. In addition, motion compensation makes it possible to track objects with a moving camera. Most MOT-related studies deal with outdoor tracking such as video surveillance and autonomous driving; however, there is a clear difference from indoor environments. Therefore, Liu et al. [32] presented the depth-enhanced trackingby-detection (DET) framework optimized for the indoor environment where occlusion frequently occurs. ByteTrack [11], which has recently achieved state-of-the-art results in the field of MOT, dramatically improves performance by associating a detection box with an object with a low detection score. ByteTrack achieves the highest performance, but we wanted to take advantage of FastMOT's skip function, which enhances the frame processing speed. In addition, there are various MOT methods [10,[33][34][35][36][37][38][39][40][41][42][43][44][45]. In the experimental stage, the accuracy of global ID matching is measured by changing the skip parameter value. In the proposed framework, the MOT model can be replaced by other models.

System Design
This section presents an overview of the proposed system and its main components. Figure 1 illustrates the high-level overview of our proposed system. We consider a multi-camera network comprising M cameras. The proposed system takes M input video clips from each camera and generates a complete trajectory for each target across the M camera network. It consists of three components: (1) a Multi-Object Tracker (Section 3.2), (2) Ground Projector (Section 3.3), and (3) Global Trajectory Mapper (Section 3.4); these components perform two steps: (i) local tracklet generation and (ii) complete global trajectory generation. As shown in Figure 1, the first two components handle the first step, and the latter one performs the second step. We explain each component and phase in detail in the following sub-sections.

Multi-Object Tracker
Given input video V i from the i-th camera (i = 1, . . . , M), Multi-Object Tracker detects a set of targets and generates their local tracklet information. As mentioned in the previous section, we rely on the state-of-the-art modern multiple object tracker for this task. To detect the targets, i.e., people, in the field-of-view (FOV) and their bounding boxes, we have employed FastMOT [31], which significantly accelerates the object tacking system to run in real time.
As the output of input video V i (i = 1, . . . , M), Multi-Object Tracker generates a set of frame-by-frame local object detection information, I L i , which is given by where K is the number of frames in video V i and I L i,k denotes a set of local image tuples for each target detected in the k-th frame (k = 1, . . . , K). Here, each video V i may have a different value of K and should be denoted as K i . However, we omit the subscript i to simplify the notation throughout the paper.
Let Π i denote a set of targets' local IDs (LIDs) observed in video V i and π i,k denote a set of LIDs observed in the k-th frame in V i . Then, we can obtain A set of local image information I L i,k for the k-th frame (k = 1, . . . , K) is composed of a series of tuples u i,k (LID) with three attributes: (1) the frame number k, (2) the target's local ID, namely LID, and (3) the local foot coordinates (bottom center of the bounding box, which is often assumed to be on the ground): where the coordinate (x, y) L is the reference location of target ID L , which is the bottom center of its bounding box. One of the challenges at this stage is that the object detection method demonstrates large temporal variations when it generates the bounding box of the target across the frame, where fluctuations may occur owing to motion blur, partial occlusion, change in poses, and other factors. This can cause short-term fluctuations in the derivation of local foot coordinates (x, y) L that act as short-term noise values.
To handle this issue, we use a simple yet effective strategy, known as Frame Skipping, which detects and tracks only the selected frames at a specific sampling period s. Frame Skipping predicts target positions using the KLT tracker without executing the detector and feature extractor for the frames between two selected frames. For a skipping period s and a certain selected k-th frame, I i,k+1 , I i,k+2 , . . . , I i,k+s−1 are estimated by the KLT tracker. Frame Skipping alleviates the bottleneck of traditional MOT methods and enables real-time execution. In addition, we will show that the accuracy of the proposed matching algorithm is improved by removing noise in the calculation of the direction of each target (vector features), which will be described in Section 3.4.

Bird's Eye View (BEV) Ground Projector and Feature Extractor
Ground Projector takes the set of local image information in I L = {I L 1 , I L 2 , . . . , I L M } and for each I L i (i = 1, . . . , M) produces the set of coordinates (x, y) P projected on the target coordination map by using a homography matrix H; homography is a transformation process (a 3 × 3 matrix) that maps the points in one image to the corresponding points in the other image [46].
To obtain the homography matrix H i (∈ H) for V i , eight point coordinates are needed, including four point coordinates in the image data and four corresponding points in the projection map. We have obtained reference points in the experiments by placing a rectangular grid carpet and by measuring edge positions offline.
In our experiments, we have observed a well-known camera distortion problem, i.e., barrel distortion [47], at the edges of frames ( Figure 2). This distortion significantly affects on-the-ground projection, especially depending on the installation height and angle of the camera. This can cause a decrease in accuracy with regard to the performance of our similarity-based global ID-matching algorithm.
Therefore, camera calibration including distortion correction is required to obtain the accurate matrix H and improve the matching accuracy. The parameters derived for camera calibration can be used continuously once acquired; however, if the camera's angle of view is changed, it must be derived again.
One key observation to address this challenge is that the error caused by camera distortion significantly affects the derivation of the targets' coordinates, but it negligibly impacts on the derivation of the targets' moving direction. Based on this observation, we adopt the approach of using the targets' movement direction as the key feature for our global ID-matching algorithm instead of a cost calibration process. In particular, we investigate the vector-based features of moving targets for the global ID-matching algorithm, which will be described in Section 3.4. Each local image information I L i,k in Equation (2), which contains local coordinates (x, y) L , is projected into the corresponding coordinates (x, y) P on the target map (Figure 3), which will be to generate a set of projected local image tuples I P i,k for the k-th frame (k = 1, . . . , K): Then, for each target with local ID x ∈ Π i , we can construct the local tracklet T L i,x as a set of tuples over the k-th frame (k = 1, 2, . . . , K) in V i : Let T L i denote a set of local trackets for V i extracted from the i-th camera, which is given by where N is the total number of targets observed across all the cameras, given by Here, the number of targets in V i may be less than N because a few targets may never appear in camera i.
For example, when the p-th local ID is detected in the image data with the frame range [1, 4, . . . , 1024], T i,p for V i is given as: Figure 3 shows the local tracklet set T on the projection map. A set of M local tracklets generated across the multi-camera network is used as an input for the global IDmatching module.

Global ID Matching
Finally, given M sets of local tracklets T L i (i = 1, 2, . . . , M) from M cameras, the global ID-matching component uses the DTW (Dynamic Time Warping) algorithm to perform similarity analysis. The DTW algorithm can measure the similarity of two sequences, and it has the advantage of being able to measure even if the lengths of input sequences are different from each other. We used the two-dimensional sequence set S i,a (∀a ∈ Π i ) generated from the coordinate information of T i,a to measure the similarity.
The features used for the input sequence generation are (i) scalar, (ii) vector, and (iii) unit vector of each target's movement information. In particular, to generate S i,a for target a, the coordinates (x, y) P k (a) and (x, y) P k+1 (a) of two adjacent frames (i.e., k-th and k + 1-th frames) in tuples u i,k (x) and u i,k+1 (x) ∈ T i,x are used. The k-th element values of S i,a generated using the (i) scalar, (ii) vector, and (iii) unit vector are s k , v k , and w k , respectively, which are given by: Algorithm 1 can generate sequences using the aforementioned features. To compare the similarity between synchronized trajectories, we compute the sequence of each target's local tracklet over the same period of time (frames). for all u i,k (a) ∈ T L i,a do 3: Generate v k (or s k , w k ) using Equation (7) (or Equation (6), Equation (8), respectively) 4: and add v k (or s k , w k ) to S i,a 5: end for 6: end for Given the sequence sets of each target, we use the DTW distance function d DTW (S i,a , S j,b ) for two different sequences from the i-th and j-cameras (i and j ∈ {1, 2, . . . , M}, i = j) to calculate the distance value D: where the lower the D value is, the higher the similarity will be. We use Algorithm 2 to perform global ID matching by calculating a Tracklet Similarity Candidate List R. Under the overlapping FOV condition that a target assigned a local ID p in camera i has appeared in other cameras j (j ∈ 1, 2, . . . , M, j = i), we calculate D(S i,p , S j,q )-the distance between S i,p and S j,q (S j,q ∈ S j )-by using Equation (9), which will be added to a tracklet similarity matrix R i,j . for all S i,p ∈ S i do 4: for all S j (j = i + 1, i + 2, . . . , M) ∈ S do 5: for all S j,q ∈ S c do 6: Calculate DTW distance D(S i,p , S j,q ) using Equation (9) 7: Set the element of row p and column q in A i,j to D(S i,p , S j,q ), i.e., A i,j (p, q) = D(S i,p , S j,q ) Calculate matching matrix A i,j according to Equation (10) 12: Add all A i,j (p, q) = 1 to G i in the form of a tuple (p, q).

13: end for
Given R i,j , we calculate a matching matrix A i,j ∈ 0, 1 N×N , where each element A i,j (p, q) denotes a binary value of 1 or 0-which represents matching (i.e., A i,j (p, q) = 1) and nonmatching (i.e., A i,j (p, q) = 0) operations. If tracklets p and q have a high similarity score D(S i,p , S j,q ) → 1, we should have A(p, :)A(q, :) T = 1. On the contrary, for a small similarity score, A(p, :)A(q, :) T = 0, which implies that p and q are not the same target. This bipartitegraph-matching problem can be solved by the following optimization problem: where denotes element-wise matrix multiplication, and || · || 2 denotes the L2 norm of the input matrix. The constraints ensure the mutual exclusion of trajectories such that each detection occupies using at most one trajectory. Based on this, the optimization problem can be efficiently solved by the Hungarian algorithm [30]. Finally, given the sets G = {G 1 , G 2 , . . . , G M } containing the local tracklet matching information, we assign the global ID to each local matching information and create a set I G for V. I G has the same configuration as I L except that the ID is global rather than local, and the coordinates possess the average coordinates of local IDs mapped to the global IDs. We assign a new global ID to the elements with the tuple (p, :) and/or (:, q) for ∀(p, q) ∈ G i (∀G i ∈ G). The overall pipeline is described in Algorithm 3.

Algorithm 3: Overall Pipeline to Generate Global Information
Input : Generate a local video information I L i using a multi-object tracker [31] 3: end for 4: for all local video information I L i ∈ I L do 5: Calculate the projected footprint coordinate (x, y) P using homography matrix H and generate a projected video information I P 6: end for 7: for all I P i ∈ I P do 8: for all p in V i do 9: Generate local tracket T i,p and create vector sequence set S i,p using Algorithm 1 (T i,p ∈ T i , S i,p ∈ S i ) 10: end for 11: end for 12: Calculate similarity using DTW and generate similar local ID list set G = {G 1 , G 2 , . . . , G M } using Algorithm 2 13: For ∀(p, q) ∈ G i (∀G i ∈ G), assign a new global ID to the elements with the tuple (p, :) and/or (:, q) 14: Create an empty global video information I G 15: while current frame < the end of frame do 16: In I P , find all existing ID on the current frame and update to global ID, calculate average coordinates 17: Add the update to I G 18: end while

Experimental Setup
To evaluate our method, we have conducted experiments using Wisenet's four-channel camera systems, SNK-B73047BW [48]. For experiments, we constructed overlapping FOV environments by installing up to four cameras in various places including classrooms and daycare centers, as shown in Figure 4. Each video had three different configurations: 180,000, 135,000, and 90,000 frames.
In the first stage of MTMCT, wherein the Ground Projector generates the sets of local tracklet (Section 3.3), we consider different parameters for the KLT algorithm-based frameskipping method with a fixed sampling interval, namely Skip value: 1 (no skip), 5, and 10. A larger skip value reduces the number of frames used for target tracking and speeds up the calculation. The frames between the two selected frames are inferred by applying the KLT tracker. To evaluate its effect and find the desired skip parameter, we compared the performance for the same dataset with different skip parameters. Moreover, as explained in Section 3.3, we considered the moving direction of the targets to filter out the errors in tracklet positions caused by camera distortion.  In the second step, we performed global ID matching based on the similarity analysis between projected tracklets using the DTW algorithm and measured accuracy. For a given coordinate of each target, three different features-(i) scalar, (ii) vector, and (iii) unit vector based on Equations (6)-(8), respectively-were used to generate input sequences for DTW similarity analysis, and we compare their effect on the accuracy of global ID matching.

Evaluation Results
The accuracy of the proposed global ID-matching method mainly relies on the tracklet data obtained from the output of the Multi-Object Tracker. This implies that the performance of the Multi-Object Tracker used has a significant impact. In particular, it is sensitive to the problems of ID switching and fragmentation. ID switching is a phenomenon in which an existing ID assigned to object X is incorrectly assigned to another object Y. It can be caused by several reasons, including occlusion, where other objects partially or totally obscure the identified object during a short period. For instance, Figure 5 shows a case of ID switching due to occlusion when two targets assigned with IDs 1 and 2 intersect for a short time.
In order to evaluate the performance of our proposal, therefore, it is important to understand the basic characteristics of the object tracker and obtain a ground-truth dataset for performance evaluation. To this end, we collected ground-truth values by preprocessing the experimental image data to identify information on these ID-switching problems and measured the number of incorrect ID switching occurrences.
First, we examine the effect of the Frame Skipping method applied to alleviate the fundamental accuracy problem of the MOT.  Table 1 lists the number of incorrect ID switching for each camera in a four-camera network with three different skip values. As ID switching occurs depending on the state of the video image, it varies depending on the installation location or angle of the camera. In our experiment, Re-ID failure occurred most frequently in camera 2, i.e., 48 out of a total of 150 data. This can be attributed to the fact that there are no obstacles near the wall where camera 2 was installed; thus, the target can move under camera 2. Only the upper body was recognized, which reduces the accuracy of Re-ID.  Figure 6 shows the global ID-matching accuracy for three target persons in several FOV indoor environments covered by two cameras with three different target's movement features (i.e., scalar, vector, and unit vector) and three different skip values (i.e., no-skip, skip 5, and skip 10). Table 2 lists their F1 scores. In calculating the accuracy of the assignment results, an ID assignment is only considered successful if all global ID-matching results are correctly assigned to all three targets across the cameras. As shown in Figure 4, the overlapping area of cameras 1 and 4 was relatively wide, and camera 1 was deployed to observe the obstacles around the window easily. Thus, the results of experiments using cameras 1 and 4 and camera 1 and 3 in Figure 6a,b show very high accuracy for all the three motion features.  As shown in Figure 6c,d, the results including camera 2 exhibit relatively low performance. We analyzed the cause of matching failure and observed that there were mainly two reasons. First, the bounding box of the Multi-Object Tracker itself fluctuated severely in the videos from camera 2, and it had a greater impact on the similarity analysis between sequences obtained from vector motion features using direction information. Next, we observed high Re-ID errors in videos from camera 2 due to its underlying deployment environment. Table 3 shows the Re-ID failure rate for the results in Figure 6d for cameras 2 and 3, which yielded the lowest performance among the four combinations. In the cases of original (no skip) and sampling with a skip value of 5 (skip 5), global ID-matching failure due to Re-ID failure accounts for more than half of all the failures. This implies that most errors occur not because of the proposed matching solution but due to the inherent Re-ID problem of Multi Object Tracker. After all, the Multi-Object Tracker's Re-ID problem (which is outside the scope of our paper) greatly affects the performance of the proposed technique. In the case of the combination of cameras 1 and 2 it can be seen that using frame skipping with a skip value of 5 (skip 5) significantly improved the performance of the unit vector feature, although the accuracy degrades again when using a higher skip value, i.e., 10. This is because 'skip 10' predicts the target's position over a longer frame, resulting in a loss of original information. Based on these results, we can observe the effect on the performance of the frame-skipping technique based on the KLT. That is, despite a small amount of frames used to generate sequence sets compared to the original data, the frame-skipping scheme improved the matching accuracy.
Next, we conducted experiments for additional people in the overlapping FOV indoor environment with two cameras. Figure 7 shows the experimental results for five and ten persons with three different frame skipping and motion features. Table 4 lists their F1 scores. In the five-person experiment, the proposed scheme achieved an accuracy of 100% with a frame skipping value of 5 and scalar motion feature. In the case of the ten-person experiments, the matching was counted as a failure even in one false matching among 10 matchings, so a significant decrease in matching accuracy was observed. Note that the ten-person experiment is a very congested scenario. Nevertheless, we observed that frame skipping improved the performance for all motion features in these experiments. When the vector feature was used, the accuracy was shown to be lower than the other two features. This is because the vector takes into account the stride length and directionality of the target. As previously explained, accurate stride measurements are not possible for reasons such as undulations in the bounding box or errors in the ground projection. Therefore, it can be confirmed that the scalar and the unit vector are suitable motion features for the proposed method. Indoor environment have blind spots due to obstacles and other reasons, which decrease MOT accuracy. Therefore, it is necessary to intentionally construct an overlapping FOV environment using redundant cameras. In this regard, in the following experiment, we conducted measurements with more cameras to test whether our global ID-matching method will show stable performance for using two or more cameras.  Figure 8a,b show high accuracy using three and four cameras, respectively, for three persons. While calculating accuracy, it was considered a success only if the targets of all the cameras matched correctly. Despite the increase in the number of cameras, the proposed technology achieved high accuracy. It achieved the best performance with the skip values of 5 and 10 for vector and both scalar and unit vector features, respectively. We investigated and summarized the errors caused by Re-ID of Muilti Object Tracker among the causes of the failure of matching for four cameras in Table 5. The results indicate that most errors (in the experiments with the unit vector and scalar features) are not related to our algorithms but rather are due to the inherent noise involved in the sequence sets caused by Re-ID problems.  Finally, we studied the impact of our approach in reducing the effects of computational complexity on matching accuracy. As explained in Section 3.2, sequence sets were generated using Algorithm 1 and KLT-based frame skipping. In addition, we further reduced the amount of data required to generate sequence sets by selectively using frames from each video V i . Given V i with the default frame rate of 24 frames/sec, we use interval sampling (also known as systematic sampling), which selects every k-th frame in the video, where we have tested for k = 1 (default/original), 2 (half frame rate), and 3 (one-third), which correspond to 24, 12, and 8 frames/sec, respectively. Figure 9 shows the experimental performance of two cameras and five persons considering vector, unit vector, and scalar features. In the results, we have observed a clear effect on performance with frame rate for all the three movement features and a significant performance improvement using the half rate of 12 frames/s instead of using the original frame rate. If the frame rate is too low (e.g., 8 frames/s), the accuracy will decrease again. While using unit vectors for generating sequence sets, the original video without frame skipping at 24 frames/s had an accuracy of 90%. On the other hand, it is improved to an accuracy of 100% when applying frame skipping with a skip of 5 at a half frame rate, i.e., 12 frames/s, where the amount of data corresponds to 1/10 of the original data. In the case of scalar feature, sampling with a skip value of 5 achieved an accuracy of 100%, and there was no performance degradation even when the amount of information is reduced by 1/3 with 8 frames/s. These results imply that the proposed approach showed high accuracy even with a smaller amounts of data (1/10 of that of the original data), lightweight computational overhead for DTW similarity, and matching tasks.

Conclusions
In this paper, we proposed a new lightweight matching method for MTMC tracking consisting of two steps: (i) extracting targets' motion information based on a ground projection method, and (ii) matching the tracklets using similarity analysis based on the Dynamic Time Warping (DTW) algorithm. We reduced the computational overhead by leveraging a KLT-based frame skipping and smoothing method to reduce computational costs in using targets' location information to generate input sequence sets for our matching algorithm. In addition, three different location features including scalar, vector, and unit vector have also been studied to derive the best input feature for similarity analysis to improve matching accuracy. Extensive experiments demonstrated the effectiveness of the proposed method, showing that our scheme achieved high accuracy in most overlapping FOV environments. In dense environments, performance degradation occurred, but it has been shown that many errors occur not because of the proposed matching solution but due to the inherent Re-ID problem of the Multi Object Tracker.
Limitation and Future Work. A significant limitation is that the accuracy of the proposed framework strongly relies on the performance of the Multi-Object Tracker, since the tracklet data are obtained from the output of the Multi-Object Tracker. Although we alleviate the inborn position error caused by camera distortion by exploiting the moving direction information of the target instead of the absolute position of the target, the matching based on the similarity analysis suffers from the errors even with the proposed features when the camera distortion is severe.
In the future, we would like to extend our work in the following three categories: (i) employing adaptive feature selection for different object's movement features, (ii) optimizing tracklet matching using graph models and/or Bayesian formulation, and (iii) reducing the computational complexity of DTW for obtaining similarity measures between feature sequences by applying fast DTW algorithms.