Multiple Objects Fusion Tracker Using a Matching Network for Adaptively Represented Instance Pairs

Multiple-object tracking is affected by various sources of distortion, such as occlusion, illumination variations and motion changes. Overcoming these distortions by tracking on RGB frames, such as shifting, has limitations because of material distortions caused by RGB frames. To overcome these distortions, we propose a multiple-object fusion tracker (MOFT), which uses a combination of 3D point clouds and corresponding RGB frames. The MOFT uses a matching function initialized on large-scale external sequences to determine which candidates in the current frame match with the target object in the previous frame. After conducting tracking on a few frames, the initialized matching function is fine-tuned according to the appearance models of target objects. The fine-tuning process of the matching function is constructed as a structured form with diverse matching function branches. In general multiple object tracking situations, scale variations for a scene occur depending on the distance between the target objects and the sensors. If the target objects in various scales are equally represented with the same strategy, information losses will occur for any representation of the target objects. In this paper, the output map of the convolutional layer obtained from a pre-trained convolutional neural network is used to adaptively represent instances without information loss. In addition, MOFT fuses the tracking results obtained from each modality at the decision level to compensate the tracking failures of each modality using basic belief assignment, rather than fusing modalities by selectively using the features of each modality. Experimental results indicate that the proposed tracker provides state-of-the-art performance considering multiple objects tracking (MOT) and KITTIbenchmarks.


Introduction
Object tracking is an important task in various research areas, such as surveillance, sports analysis, human-computer interaction and autonomous driving systems. As a result, various forms of tracking are being actively researched; these include multiple objects tracking (MOT), tracking using multiple sensors and model-free tracking. In this paper, we propose a new multiple-object tracker that uses multi-sensor modality. The main objective of MOT is to estimate the states of target objects from given frames in a video sequence. However, despite much success, MOT techniques still face various challenges caused by illumination and scale changes, occlusion and other disturbance factors.
One way to handle distortions is to model them into trackers a priori. Affine transformation [1], illumination invariance [2] and occlusion detection [3,4] have been widely applied to trackers to deal with disturbances. While trackers in which distortion handlers are embedded are able to overcome a specific disturbing factor, tracking may fail when other distortions are introduced. Another way to maintain tracking performance despite distortions is to adaptively train the appearance (and/or motion) model of the object tracker online. Although the appearance models are adaptively CNNs have previously been employed in trackers [11,12]. Those trackers used only the fixed outputs of layers from CNN architectures as a representation of target objects. However, because target objects have different scale levels, information loss can occur [13]. To solve this problem, we adaptively select the convolutional layer from a pre-trained CNN according to the scales of the target objects as a representation. We train a matching function that can be well fitted to generic distortions without adaptively updating models during testing, while still achieving state-of-the-art performance. To achieve this goal, large-scale external video sequences are used to initialize a matching function. For the tracker, we train a CNN comprising two sub-networks sharing weights, called the matching network, in an offline manner. During testing, the matching scores between a set of target objects in frames k and k + 1 are computed using the learned matching network with frozen weights. At this point, the training datasets used to initialize the matching network are not overlapped with the sequences for testing because our aim is to build a generic matching function that can be applied to any unseen object.
We use CNNs that are pre-trained on large-scale annotated datasets. Although many kinds of pre-trained CNNs for representing RGB images are available, to the best of our knowledge, no CNNs pre-trained on depth data exist. To overcome this problem, we apply supervision transfer [14] to transfer from learned representation on large-scale annotated data (source data) to another modality (target data) for tracking. Supervision transfer is generated on paired data modalities and facilitates the representation of data without annotations.
To compensate the limitations of tracking in RGB frames, in this paper, we propose a fusion tracker that combines the tracking results of RGB frames and depth frames. In particular, to solve the aforementioned problems besetting feature-fusion tracking schemes, we fuse tracking results at the decision level. In the proposed decision-level-fusion scheme, fusion is performed when MOT is completed on each modality using basic belief masses (BBMs).
For the first few frames, tracking is performed by using the initialized matching network. After performing tracking on a few frames, the initialized matching network is fine-tuned based on the weighted average of the matching score to update the target appearance models. At this point, fine-tuning is not performed on the same matching network iteratively. According to the matching score extracted from the matching network, we construct a structured path for fine-tuning. Targets that have a similar appearance model activate the same path of fine-tuned matching networks while the targets' difference to previous paths is updated on a newly-created node. The proposed MOFT performs well without this update procedure, but the MOFT with structured target appearance models can more precisely track temporally-changed target states.
The contributions of this work are as follows: (1) matching function learning that can be applied without adaptive updating from large-scale external video sequences; (2) a proposed fusion tracker that fuses at the decision level; and (3) the structured fine-tuning strategy for the matching function. Our proposed tracker achieved state-of-the-art tracking performance on published object tracking benchmark datasets. After the optimization of our code, we plan to submit it to benchmark competitions.

Related Work
Deep learning tracking: Several trackers have used neural networks with training models trained online for tracking [15][16][17][18][19][20]. Among them, the trackers proposed by Danelljan et al. [15] and Nam and Han [16] exhibit state-of-the-art performance for the tracking task. However, because online neural network training is very slow, these trackers are very slow. Recently, several trackers that use a matching function trained offline using a CNN architecture have been presented. Siamese instance search for tracking (SINT) [11] is a simple tracker that matches the initial target object in the first frame with candidates in the current frame using a Siamese network. Similarly, generic object tracking using regression networks (GOTURN) [12] trains a neural network offline and can track objects at 100 fps at test time. GOTURN reduces the search candidates using a search strategy in which the locations of moving objects are slowly changed from frame k to frame k + 1. Previously-proposed trackers that use CNNs in an offline manner represent instances using a fixed number of layers. However, because instances have different scale levels, representing them using fixed layers can result in information loss.
Fusion of multiple modalities for vision tasks: In many applications, multiple modalities have been fused to completely generate their tasks [21][22][23][24][25][26]. In particular, tracking results with disturbances occurring in RGB frames have been compensated by combining depth information extracted from laser scanners, stereo vision and RGB-D sensors. For robotics systems, many kinds of modalities are fused to track moving objects. In sports analysis, the moving objects are tracked from sets of calibrated cameras to precisely localize the objects [21,22]. The depth data from stereo vision sensors, Kinect and 3D LIDAR are often fused with RGB images for dynamic environments [23][24][25]. Mertz et al. [27] proposed a detector and tracker that combine single-layer and multi-layer (LIDAR) laser scanners. Cho et al. [24] proposed a tracker that uses radar, LIDAR and CCD sensor data in which their features are fused. Aufere et al. [28] presented a fusion approach for avoiding collisions by tracking through the use of CCD and LIDARs. In a benchmark presented by Sturm et al. [29], there are various trackers using RGB-D data for visual odometry and SLAM tasks. Teutsch et al. [30] proposed an object detector that used the long wavelength infrared camera to detect low resolution objects. Kummerle et al. [31] fused the multiple infrared spectrum images to detect and track multiple humans from a bird's eye view. Gozalez et al. [32] presented an object detector, which leverages multiple cues from the multi-sensor modality for driving situations. They claimed that each modality has different views and features, and these aspects can increase the detection accuracy by aggregating them. The fusion trackers cited above separately extract features from each modality in order to fuse them and required equally well-captured multi-sensor data for generating them with accurate performances. However, because they cannot consider beliefs about sensor measurements, these trackers are significantly influenced by noise from each modality.

Problem Definition and Pre-Requisites
In this section, we describe the traditional issues that the proposed tracking method addresses. Figure 1 shows the overview of the proposed method. Suppose there exist RGB frames and 3D point clouds, which should be pre-calibrated and synchronized. The proposed tracker can be independently applied on RGB frames or 3D point clouds without following the fusion procedure; however, we fuse both data to compensate for each other's tracking failures. Because the coordinate systems of two modalities differ, the coordinate systems between two different modalities should be homogenized to find the associations when the tracking results are fused at the decision level. Therefore, the proposed MOFT is generated on image coordinates for both RGB frames and 3D point clouds. Pre-defined algorithm settings for each step will be discussed in Section 7.2. We also evaluate tracking performances when the proposed tracker is independently applied on each modality in Section 7.3. Note that the maximum size of an object to be tracked is set to the image size because the tracking is performed on image coordinates for both modalities. To track objects on 3D point clouds in image coordinates, we employ a transformation method [33], which transforms 3D point clouds to a dense depth map by up-sampling the point clouds.
The representation step means representing object instances (e.g., the bounding box of each object) using a feature extraction method. Each instance x i has state x i = [x, y, w, h] where x and y denote the left-top location, and w and h are the width and height, respectively. In this paper, we adaptively use an output of the VGG-16 architecture [34] pre-trained on the ILSVRC2012 as a representation of each object instance according to size. To identify the size of each instance, we use the scale-dependent pooling (SDP) [35]. By doing this, the instances with different sizes can be suitably represented without any information loss. To identify the matched pairs between given representations of target object instances at time k (X k = {x k i,i=1,··· ,n }) and representations of candidate instances at time k + 1 (X k+1 = {x k+1 j,j=1,··· ,m }), we train a matching network. Before finding matched pairs, the candidates at time k + 1 should be set. The MOFT uses both traditional candidate selection methods for tracking, which include candidate sampling [11] and the object detector [36]. Because directly accepting the bounding boxes of instances extracted from the candidate sampling method or object detector can increase localization error, we use a bounding box regressor [37], which regresses center coordinates of the box and their width and height.

Matching Network
We propose a matching function that can track objects when various disturbing factors are introduced into the scenes.

Target Representation
Target objects contain large variations in their scales according to disturbance factors, such as pose change and movement states. If the instances that have a low resolution (i.e., small scale) are passed to subsequent layers, fine features may be gradually ignored owing to operations such as convolution and pooling. To suitably represent instances according to their scales, we first classify scales into a sampled range (scale-level ranging is discussed in Section 4.1). If the scale of instances is small, the earlier output of the convolutional layer is extracted as a representation, while the latter output of the convolutional layer is extracted as a representation. Figure 2 shows the architecture of the proposed representation method. Feeding entire instances into the pre-trained network can result in high computational cost. In this work, we apply ROI pooling [38] to frame representation to reduce the computational cost. First, we extract all of the outputs of the convolutional layers of a frame from the pre-trained CNN. Then, the representation of each instance is pooled from the output map of its corresponding frame according to its scale level using ROI pooling. Because the sizes of the outputs from each convolutional layer vary, we sample them individually by applying different sampling layers to feed them into the matching function. Max pooling is applied to sub-sample the represented instances with sizes greater than the input size of the matching network. A deconvolutional operation is also used to up-sample represented instances that are smaller than the input size of the matching network. Finally, local response normalization (LRN) [39] is used to normalize all of the representations.

Representation of the Depth Frame
To represent instances of depth data that have no pre-trained CNNs on large-scale annotated dataset, we employ supervision transfer [14].
κ]} be layered representations of RGB and depth frame, respectively, where i denotes the number of layers. The main task of supervision transfer is to effectively learn the weight parameters W [1,··· ,κ] = {w i , i ∈ [1, · · · , κ]} to represent unannotated depth images from a fixed CNN architecture. Supervision transfer proceeds by measuring the similarity between the representations of both modalities by using a loss function f (in this work, L 2 distance is used for the loss function). The similarity can be measured as follows: where t(•) is the transformation function for embedding ψ i into the same dimension of φ i . w i denotes the learned weight parameter. In this work, if depth images are obtained from 3D point clouds, the up-sampling method [33] is used for the transformation function.

Architecture of the Matching Network
To train a generic matching function that can be generated on sequences that include disturbing factors, we employ a CNN architecture comprising two sub-networks sharing weights as a matching network. The proposed matching network is shown in Figure 3. To construct each paired input (x k i , x k+1 j ), representations (Sections 4.1 and 4.2) of target object x k i and candidates x k+1 j are separately fed into two identical sub-networks that are in the form of a CNN. For the proposed matching network, the sub-network comprises three convolutional layers and two fully-connected layers. In traditional CNN architectures, strong values in the local neighborhood can only be activated to be fed into subsequent layers when max pooling is applied on inputs, i.e., the spatial resolutions of activated values are substantially reduced. While one benefit of max pooling is invariance to local deformations, maintaining the small appearance changes of the target objects over time is more important for the MOT task. Therefore, the max pooling layers are not included in the proposed sub-network architecture.
Finally, the outputs of the last fully-connected layer in each sub-network are concatenated and fed into a two-way Softmax layer. The Softmax layer is used to determine the matching between x k i and x k+1 j . In our work, Classes 1 and 0 denote the matching (positive) and mismatching (negative), respectively.

Fusion Tracker
The proposed tracker is independently generated from represented instances, both RGB and depth frames. We obtain more accurate tracking results by fusing the tracking results generated from each modality at the decision level than using only the tracker-generated RGB frames.
In this section, we present a method of assessing the discounting factors to be applied to the tracking results from each modality using the basic belief assignment (BBA). The discounting factor α indicates the degree of trust to apply to the tracking results.

Basic Belief Assignment
The BBA is based on the Dempster-Shafer theory (DST) [40,41]. The DST is a generalization of the Bayesian inference. The frame of discernment Ω = {a i,i=1,··· ,n } is the set of elementary hypotheses in the DST. Elementary a i are disjointed with each other while they cover the complete event space. The mass function m of the BBA to map the power set 2 Ω to [0, 1] is defined as follows: where m(A) represents the certainty for proposition A of belief committed to the subset. All BBAs are assigned as probability functions, i.e., they are measurements of certainty. The Dempster-Shafer combination rule is applied to combine two different BBAs, m 1 and m 2 , as follows: The Belief bel m and plausibility pl m can be construed as pessimistic and optimistic guesses that ignore the additional information for uncertainty if a decision is required. The pignistic transformation [40] can incorporate this additional information by transforming the BBA into probabilistic function. Furthermore, the BBA can be equally distributed by repeating each BBA among each singleton element. In other words, the pignistic transformation gives an equal probability to elements given a lack of information by building a probability distribution on each element as follows: where | • | denotes the amount of elementary hypotheses of Ω.

Fusion for Tracking Results
Let Θ ∈ {0, 1, Ω} be the frame of discernment for the MOFT, where 0, 1 and Ω represent non-matching, matching and uncertainty. The BBA to tracking results can be defined as follows: where m(A) is a basic belief mass (BBM) representing the part of belief committed to the subset. The conjunctive rule of combination is used to combine the BBA of RGB frames with the BBA of depth frames. We already know that an instance from one modality is corresponding with that from the other modality because 3D point clouds were projected to the image coordinates as a depth image (Section 3). Therefore, we can fuse tracking results generated from different modalities by considering only the final results from the matching networks. Let m D and m R be the BBMs for the tracking results of the depth frames and the RGB frames, respectively. The conjunctive combination m D ⊕ m R is defined as follows: The main idea underlying the proposed fusion scheme is the assignment of weights according to the tracking results. To do this, we add a discounting factor to each BBM. The BBM can be discounted when the correctness of a BBM is only valid with a discounting factor. Clearly, the discounting factor α indicates that the degree of trust be applied to the tracking results. The discounting factor for the tracking results of each modality can be defined as follows: We set the discounting factor α using the normalized belief functions in [40]. Let us suppose that tracking is independently completed in each modality for the same frames. Before we merge the tracking results, each tracking result should be discounted by their discounting factors. From the given discounted BBMs of the RGB frames m c is the tracked object in each sensor, the joint BBM m α {x t+1 c } is computed as follows: where α is computed by a linear function and a quadratic function of a minimization program [40].
In this work, we set α R = 0.22 and α D = 0.31.

Structured Fine-Tuning of the Matching Network
In this section, we describe how to fine-tune the matching network for updating target appearance models and perform tracking using the target appearance models. To build a robust matching network, the initialized matching network trained on external video sequences is fine-tuned after performing tracking on a few frames in a structured model. The fine-tuned matching network can preserve model consistency and become robust to temporally-changed target appearances. At this point, previously fine-tuned matching networks are maintained as paths of the target appearance models. Figure 4 shows an example of our proposed structured fine-tuning of the matching network.

Structured Target Appearance Models
The proposed structured target appearance model is constructed as a hierarchical tree structure by adaptively fine-tuning the matching network. Let S = {V, E } be a node for the fine-tuned matching network, where v ∈ V and (u, v) ∈ E denote a vertex related to a fine-tuned matching network and a directed edge indicating the path relationship between vertices, respectively. The relationship between two vertices (an edge) is defined as follows: where φ E (u, v) denotes the relationship score between vertices u and v, Θ v F is a set of consecutive frames on which tracking is performed by using the matching network until the vertex v and M u (x is the matching score when x f j∈m is judged as a matched candidate with a previous target object x f −1 i∈n by using vertex u.

Inference
The target state from a new frame fis inferred by accumulating the relationship scores from multiple fine-tuned matching networks in the form of a structure. Let where M n is a weighted average matching score from the activated fine-tuning path of the n-th target object. The activated fine-tuning path of our paper denotes the optimized path that satisfies Equation (3). If a path for a new frame is selected as an activation, the remainder paths are left as deactivated paths, i.e., the remainders are not used ever for the frame. When V n α ⊆ V is the activated fine-tuning path of the n-th target appearance model, the weighted average matching score can be measured as follows: i∈n , x f j∈m ) denotes the matching score between x f −1 i∈n and x f j∈m corresponding to the vertex v of the n-th target appearance models, which considers the probability to Class 1 (matching), and w n v→ f is the weight of vertex v in frame f on a path of the n-th target object. If a candidate is not matched with the target object, it is considered as a newly-detected object to track. Further, if a target object has lower matching scores than θ dis for all candidates, it is considered to be a disappearing object. In this work, we experimentally set θ dis to 0.6.
To determine weight w n v→ f , we consider the reliability of the matching network. This weight is assigned to prevent that fine-tuning is generated on unreliable cases, for which high matching scores are measured despite noisy instances. To measure the reliability of a matching network, the fine-tuning paths, and not in entire paths, are recursively explored as follows: where P v is the parent node of vertex v.

Adaptive Model Update
In the structured target appearance models, a node includes a fine-tuned matching network. We describe an approach to precisely select the fine-tuning path for new training frames.
Let z denote a newly-created node to fine-tune the matching network. The matching network is fine-tuned after tracking is completed on 15 consecutive frames without model update, i.e., |Θ z F | = 15. The newly fine-tuned matching network has a parent node P z to maximize reliability, which satisfies: whereφ E is the interim edge. Equation (6) can be solved by using a tree searching method, easily. After finding the parent node of the newly-created node z, fine-tuning is performed on two sets of frames |Θ z F | and |Θ P z F | on only fully-connected layers because fine-tuning all layers including convolutional layers is too expensive to perform and update online. Although the fine-tuning is performed on only fully-connected layers, there is no significant performance degradation as compared to fine-tuning for all of the layers. After the 10th fine-tuning is completed on a path, the earliest node is eliminated when the new fine-tuned matching network is added. Finally, the newly fine-tuned matching network of node z is added to V n α .

Experimental Evaluation
In this section, we discuss the evaluations conducted in terms of performance and effectiveness. First, we validate the architectural design and algorithms for the proposed MOFT by comparing it with intra-varied MOFT. Then, we compare it with state-of-the-art trackers on tracking benchmarks.

Dataset and Evaluation Metric
Dataset: We evaluated our proposed method on a project of Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) tracking [42] and MOT benchmark [43] datasets. We constructed a pair dataset consisting of cropped objects in frames k and k + 1 from both RGB and depth frames for the training data.
The KITTI tracking dataset contains data captured in driving environments. It consists of 21 training sequences with annotations and also provides various sensor modalities, such as single image, stereo image and 3D point clouds. In our experiments, we used two kinds of paired modalities: (1) RGB frames with depth frames extracted from a stereo camera and (2) RGB frames with depth frames extracted from 3D point clouds. For training, we used 40,000 pairs from 18 training sequences; the remainder was used for evaluations.
We used both the 2015 and 2016 MOT benchmarks (2015: 11 sequences; 2016: seven sequences). Because the MOT benchmark does not provide the depth sequences, we only generated MOFT on RGB frames when evaluations were conducted on the MOT benchmark. For training, we used 32,000 pairs from 15 training sequences; the remainder was used for evaluations.
Evaluation metrics: The following tracking evaluation metrics [44,45] from both benchmarks were utilized: multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), mostly tracked targets (MT), mostly lost targets (ML) and the number of ID switches (IDS).

Experimental Setup
Environments: We implemented the proposed method using Caffe [39] and MATLAB on an Intel-Core i7-6700 quad-core 4.0-GHz processor with 64.00 GB of RAM and an NVIDIA GeForce Titan X graphics card with 12 GB of memory for CUDA computations.
Tracking candidates: In this work, the object detector in [36] was used to sample the target candidates. Furthermore, the candidate sampling methods in [11] were used as a comparison model.
Depth frames extraction: We used the adaptive random work method proposed by Lee et al. [46] to extract the depth frames from the stereo camera of the KITTI tracking benchmark. To track objects in 3D point clouds from the KITTI benchmark, we mapped the 3D point clouds into a 2D dense depth map using up-sampling [33].
Data representation: In this work, we used the VGG-16 network [34] pre-trained on the ILSVRC2012 dataset [47] to represent RGB image data. With the same layered architecture, supervision transfer was applied to the depth map for data representation. The scale-level range was set to five levels because the VGG-16 network has five convolutional layers. To classify the scale-level of instances, we used scale-dependent pooling (SDP) [35].
Bounding box regression: If we have used a candidate sampling method or detection strategy, we generated a refinement strategy for the bounding boxes extracted from each frame using the method presented by Girshick et al. [37] to precisely localize the bounding box of target objects. Bounding box regression facilitates accuracy in the tracked target boxes by training four ridge regressors (x, y, w, h), where (x, y) is the center coordinates of the box and (w, h) is the width and height of the box. In our work, the regressors are not updated during testing because of noise.

Evaluation
We evaluated the proposed method by comparing it with intra-varied MOFTs to validate the performance of our architecture choices for MOFT. The matching target experiment was conducted to validate the suitability of the matching between target objects in frame k and candidates in frame k + 1 for MOT tasks. In the representation architecture experiment, we observed which pre-trained CNNs precisely represent instances and whether adaptively representing instances in accordance with their scale levels is more suitable than representing in a fixed form. The data modality experiment was conducted to show the effectiveness of the proposed fusion tracking method. To show whether the structured fine-tuning update improves the performance of MOFT or not, we compared MOFT with and without fine-tuning in the update experiment. The proposed matching function was designed for application to any unseen target. The generality experiment was conducted to verify this issue. Finally, we compared the proposed MOFT with state-of-the-art trackers.
The basic setting for the proposed method (the row ours in Table 1) was as follows: representation using VGG-16 architectures, adaptive representation according to scale levels of the instance, applying supervision transfer to represent depth data, matching targets in frame k with candidates in frame k + 1 and the result fusion using BBMs. All of the models, including the proposed model, were evaluated using RGB and depth fusion trackers (except in the data modality and generality experiments). The depth data were extracted from 3D point cloud sequences from the KITTI tracking benchmark dataset. The shown tracking results validated on KITTI benchmark were measured as the average in the entire object classes. Table 1. Comparison models used to evaluate the proposed MOFT. Depth (PC) and Depth (stereo) denote depth frames extracted from 3D point clouds and stereo vision, respectively. "Init." indicates the initialized target. w and w/o of the "update" column mean that the proposed fine-tuning method was used for MOFT (w) or not (w/o), respectively.

Tracker Matching Target Representation Representation Usage
Modality Update Matching target: To show which target models are suitable for matching with target candidates, we compared ours to a method that matches initialized target objects in the first frame with candidates in the current frame (model 1 ). To train the matching function for model 1 , we constructed a dataset in which a pair consisted of a cropped object from the first frame and the corresponding object from the frames remaining in a sequence. As shown in the rows of ours and model 1 in Table 2, the matching between consecutive frames (ours) can more accurately track the multiple objects than model 1 in most metrics. Because the target states of the current frame are significantly influenced in the previous frame, noises introduced by temporally-changed states can have a negative effect on the matching function. Table 2. Comparison of the proposed MOFT with design-varied trackers on testing sequences. The best and second-best scores are boldfaced and underlined, respectively. The direction of the arrows indicates the direction of better performances; multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), mostly tracked targets (MT), mostly lost targets (ML) and the number of ID switches (IDS). Representation architecture: First, we compared the pre-trained network architectures on their ability to represent instances. For a comparison target, we used AlexNet [48] pre-trained on ILSVRC2012 (model 2 ) because it is a popular network architecture that comprises smaller layers than VGG-16. As in our proposed model, we divided the scale into five levels for adaptively representing instances on AlexNet. The rows ours and model 2 in Table 2 show that the representation of instances from the larger network (VGG-16) has better tracking performances on all metrics.
Next, we compared the performances of trackers according to the usages of layers for representation. The following comparison models were used: outputs of conv1 − only (model 3 ), conv5 − only (model 4 ), f c7 − only (model 5 ) and conv5 with f c7 (model 6 ) layers from VGG-16. To uniformly feed different sizes of layered representations into our matching network, we applied the sampling scheme described in Section 4.1 into each layered representation. From the results shown in the rows ours and model 3,··· ,6 in Table 2, it is clear that adaptively representing instances according to their scale levels results in more accurate tracking than representing all scale levels of instances in the fixed layers, because information loss is prevented.
Data modality: To observe accuracy differences in the used data modality, we measured the performance when the tracker was generated on each sensor. model 7 was generated on RGB sequences, and model 8 and model 9 were generated on depth sequences extracted from stereo camera and 3D point clouds, respectively. As shown in the rows model 7,··· ,9 , MOTA was the highest when tracking was generated on only RGB sequences, whereas the MOTPs of model 8 and model 9 were higher than that of model 7 . As a result, ours gave the best performances on all of the metrics compared with model 7,··· ,9 . Thus, it is clear that information conflicts generated from modalities can be compensated by combining the tracking results of modalities.
Update: To show that the proposed structured fine-tuning makes MOFT become robust, we compared ours with MOFT without the fine-tuning procedure. As a result, the entire metrics are high for MOFT tracked objects with the structured fine-tuning. As shown in the rows model 10 , however, the performance of MOFT without the structured fine-tuning is not a low.
Generality: This experiment was used to verify that the proposed matching function can be generally applied without references to test sequences. To this end, we evaluated the trackers in two ways: (1) training and testing the matching function on the same dataset and (2) training and testing the matching function on different datasets. This evaluation was conducted on both the MOT [43] and KITTI [42] benchmark datasets. Because the MOT benchmark does not include depth data, the trackers only tracked objects on RGB sequences in this experiment. Table 3 shows the dataset used for training and testing and the resulting performances. The first row in Table 3 is the same as model 7 in Tables 1 and 2. As a result, although MOFTs for which training and testing are performed on the same dataset can track objects more accurately than other cases, there was no significant performance degradation when comparing the first and third rows or the second and fourth rows. State-of-the-art comparisons: One of the advantages of MOFT stated above is that it is robust to distortion factors because of a matching function trained on external video sequences offline. Further, information conflicts can be compensated by employing a modality fusion scheme. Table 4 shows the tracking performances on the MOT16 dataset. Tables 5 and 6 show the tracking performances on the KITTI tracking benchmark. On the MOT benchmark, MOFT provides the best performance in terms of MOTA and MT. On the KITTI benchmark, MCMOT-CPT, which uses a general CNN detector, has the higher performances on the car category, while the MOFT has the higher performance on the pedestrian category. Generally, the instances of pedestrians have a low resolution while an instance including car has an adequate resolution, which can avoid information losses raised from passing many layers of CNNs. In other words, if an instance including a pedestrian was passed in many layers of CNNs, information losses can be easily introduced. Therefore, the MOFT can track multiple objects without category-dependence because the MOFT adaptively represents instances according to their sizes. Further, MOFT achieved state-of-the-art performances on the other metrics.
Qualitative results: MOFT was only generated on the KITTI benchmark dataset to qualitatively evaluate the proposed tracker because the MOT benchmark does not include depth data. Figure 5a,b depicts the tracked targets of each tracker on RGB and depth (from 3D point clouds) frames, respectively. Figure 5c shows the tracked targets of MOFT. It can be seen that, whereas the bounding boxes are shifted to similar objects in the RGB tracker, the depth tracker cannot precisely track the distant targets. MOFT compensates the limitations of each modality. As shown in Figure 5, the tracking failures are overcome.
Although a modality fusion procedure is applied with discounting factors in MOFT, tracking failures from the failures of each modality still occurred. On the top of Figure 6, bounding box shifting and missing between similar targets can be observed. This may be as a result of the influence of the RGB tracker. On the bottom of Figure 6, distant objects are considered as disappearing objects. Even though they maintain tracking-available sizes for the RGB tracker, they may be overlooked in the depth tracker. Table 4. Comparison of the proposed MOFT with previous trackers on the MOT16 [49] benchmark dataset. The boldfaced and underlined scores indicate the best and second-best scores, respectively. The direction of the arrows indicates the direction of better performances; multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), mostly tracked targets (MT), and mostly lost targets (ML).  Table 5. Comparison of the proposed MOFT with previous trackers on the KITTI benchmark dataset [59]. This evaluation was validated on the car category. The boldfaced and underlined scores indicate the best and second-best scores, respectively. The direction of the arrows indicates the direction of better performances; multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), mostly tracked targets (MT), and mostly lost targets (ML).  Table 6. Comparison of the proposed MOFT with previous trackers on the KITTI benchmark dataset [59]. This evaluation was validated on the pedestrian category. The boldfaced and underlined scores indicate the best and second-best scores, respectively. The direction of the arrows indicates the direction of better performances; multiple object tracking accuracy (MOTA), multiple object tracking precision (MOTP), mostly tracked targets (MT), and mostly lost targets (ML).

Conclusions
In this paper, we proposed a multiple objects fusion tracker called MOFT. We demonstrated that we can train a matching function regardless of disturbing factors using a matching network. In addition, because each instance has a different scale level, we adaptively represent instances from layered representations of the pre-trained CNN. Further, to compensate tracking failures occurring in each modality, we proposed a fusion tracker based on combining BBMs. Our fusion tracker fuses the tracking results of each modality at the decision level to prevent information conflicts introduced by feature-level fusion schemes. MOFT was verified as achieving better results than conventional multiple object trackers on the KITTI and MOT benchmarks.
In future work, we will plan to extend MOFT to combine various modalities. We will also propose methods for real-time computation.