Relation3DMOT: Exploiting Deep Affinity for 3D Multi-Object Tracking from View Aggregation

Autonomous systems need to localize and track surrounding objects in 3D space for safe motion planning. As a result, 3D multi-object tracking (MOT) plays a vital role in autonomous navigation. Most MOT methods use a tracking-by-detection pipeline, which includes object detection and data association processing. However, many approaches detect objects in 2D RGB sequences for tracking, which is lack of reliability when localizing objects in 3D space. Furthermore, it is still challenging to learn discriminative features for temporally-consistent detection in different frames, and the affinity matrix is normally learned from independent object features without considering the feature interaction between detected objects in the different frames. To settle these problems, We firstly employ a joint feature extractor to fuse the 2D and 3D appearance features captured from both 2D RGB images and 3D point clouds respectively, and then propose a novel convolutional operation, named RelationConv, to better exploit the correlation between each pair of objects in the adjacent frames, and learn a deep affinity matrix for further data association. We finally provide extensive evaluation to reveal that our proposed model achieves state-of-the-art performance on KITTI tracking benchmark.


Introduction
Multi-object tracking in 3D world space (3D MOT) plays an indispensable role in environmental perception for autonomous systems [13,20,41,47]. The 3D MOT task is that it firstly localizes all the surrounding 3D objects in a sequence, and then assigns them the consistent identity numbers (IDs). As a result, the same objects in the sequence are given the same IDs, which is then used to predict the trajectories for the surrounding objects and make safe path planning for autonomous navigation. In recent years, 2D multi-object tracking has made great progress in the computer vision domain [3,23,37,39,48]. However, the camera sensor is unlikely to provide depth information and is quite sensitive to the lighting conditions (e.g. overexposure) for the autonomous systems in the real 3D world.
In order to improve the reliability and safety, recent tracking-by-detection approaches [33,47] firstly combine the camera with the Lidar sensor that is capable of offering precise spatial information. Leveraging the sensor fusion technology and redundant information from multiple sensors, the performance of 3D MOT can be significantly boosted. After that, a pairwise feature similarity between any two objects in the different frames is learned from fused features. Finally, the data association algorithms. For example, Hungarian algorithm [16], Joint Probabilistic Data Association (JPDA) [10] are employed to assign different IDs to corresponding objects for data association.
It is easy to observe that capturing the discriminative feature, which is beneficial for distinguishing different objects, is the critical process when learning the affinity matrix and the following data association. One efficient method to solve this problem is to represent the objects in the adjacent frames as a directed graph. Specifically, each object can be treated as a node in the graph, and the relationship between an object pair is the edge between related two nodes. As a result, the problem of exploiting the discriminative feature between two objects is convert to learn the relations between the nodes in a directed graph. Consequently, we propose a deep neural network by employing the graph neural network into the 3D MOT to better exploit discriminative features.
Furthermore, considering the fact that the nodes in the graph are unordered, we are unlikely to leverage the advantages of the convolutional neural networks (CNNs) to exploit the features. Prior works for both 2D MOT and 3D MOT [13,37,39,42,47] use the multi-layer perceptron (MLP) to capture contextual features for each node. However, it is not efficient to learn local features between nodes using MLP, as the MLP operation is not a spatial convolution. As a result, we introduce a novel convolutional operation, named RelationConv, to better exploit relations between nodes in the graph.
In order to learn temporal-spatial features for the objects in the sequence, we also propose a feature extractor that jointly learns appearance features and motion features from both 2D images and 3D point clouds. Specifically, we use the off-the-shelf 2D image and 3D point cloud detectors to extract respective features for the appearance features of the objects. We learn the motion features by building a subnetwork that takes the 2D bounding boxes in the images and 2D bounding boxes in the point clouds as input. We finally fuse the appearance and motion features together for further data association. We summarize our contributions as: • We represent the detected objects as the nodes in a directed graph and propose a graph neural network to exploit discriminative features for the objects in the adjacent frames for 3D MOT. • We propose a novel joint feature extractor to learn both 2D/3D appearance features and motion features from the images and the point clouds in the sequence. • We novelly propose a RelationConv operation to efficiently learn the correlation between each pair of objects for the affinity matrix.

2D Multi-Object Tracking.
The MOT methods can be categorized into the online method and offline method. The online method only predicts the data association between the detection in current frame and a few past frames. It is normally used for real-time applications. Early 2D trackers [5,8,9,22,38] enhance the robustness and accuracy of tracking by exploiting and combining deep features in the different layers. However, these integrated features from multiple layers are not helpful when the targets are heavily occluded or even unseen in a certain frame. Several 2D MOT methods [7,15,24] employ correlation filters to improve the decision models. Deep reinforcement learning is used in [45] to efficiently predict the location and the size for the targets in the new frame.
On the other hand, the offline method aims to find global optimization for the whole sequence. Some models [31,46] build a neural network with a min-cost flow algorithm to optimize the total cost for the data association problem.

3D Multi-Object Tracking
3D object detection has made a great success in recent years, especially when PointNets [27,28] are capable of processing the unordered point cloud in an efficient way. As a result, many researchers draw attention to 3D object tracking based on accurate 3D detection results. Some approaches [25,30,40] firstly predict the 3D objects from off-the-shelf 3D detectors, followed by a filter-based model to track the 3D objects continuously. mmMOT [47] builds an end-to-end neural network to extract features for the detected objects and data association. Specifically, it employs 2D feature extractor [35] and 3D feature extractor [27] to capture 2D and 3D features for objects in the adjacent two frames. A sensor fusion module is then proposed to aggregate multi-modality features, which is finally used for further data association. However, the model only learns appearance features for the detected objects, and motion features are not considered. Alternatively, GNN3DMOT [42] proposes a joint feature extractor to learn discriminative appearance features for the objects from the images and the point clouds, and then employs an LSTM neural network to capture the motion information. Finally, a batch triplet loss is processed for data association.

Joint Multi-Object Detection and Tracking
Joint multi-object detection and tracking method becomes popular due to the fact that it might lead to a sub-optimal solution if the detection task and the tracking task are decoupled. Recent methods [44,48] construct an end-to-end framework with a multi-task loss to directly localize the objects and associate them with the objects in previous frame. Similarly, FaF [21] firstly converts the sequential point clouds into stacked voxels, and then applies standard 3D CNN over 3D space and time to predict 3D objects location, associate them in the multiple frames, and forecast their motions.

Data Association in MOT
Data association is an essential problem in the MOT task to assign the same identity for the same objects in the sequence. Traditionally, the Hungarian algorithm [16] minimizes the total cost of the similarity for each pair of observations and hypotheses. JPDA [10] considers all the possible assignment hypotheses and uses a joint probabilistic score to associate objects in the different frames. Modern works [6,46] firstly represent the objects and corresponding relations as a directed graph. Each object is treated as the node in the graph, and relation between each pair of objects is the edge. After that, the data association problem can be cast as a linear program problem for seeking the optimal solution.

Model Structure
Our proposed 3D MOT network follows the tracking-by-detection paradigm. As shown in Figure 1, the framework takes a sequence of images and related point clouds as input, and consists of three modules: (a) a joint feature extractor to capture the appearance feature and the motion feature from the 2D images and 3D point clouds; (b) a feature interaction module that takes the joint feature as input and uses proposed RelationConv operation to exploit the correlation between the pairs of objects in the different frames; (c) a confidence estimator predicts if a certain detected object is a valid detection; (d) a data association module to compute the affinity matrix for associating the objects in the adjacent frames.

Problem Statement
As an online 3D MOT method, our model performs objects association in every two consecutive frames from a given sequence. We refer to current frame at time t with N detected objects as X t = x i t |i = 1, 2, . . . , N , and previous frame at time t − 1 with M detected objects as X t−1 = x j t−1 |j = 1, 2, . . . , M . We aims at exploiting discriminative feature for each x i t and x j t−1 pair, predicting a feature affinity matrix for the correct matching, and finally assigning the matched IDs to corresponding objects in current frame t.

2D Detector
For object detection, there are two methods to localize the objects in the frames. One method is to use existing 3D detectors (e.g. PointRCNN [34], PointPillar [17]) to predict 3D bounding boxes, which are then projected to corresponding images to obtain 2D bounding boxes. However, we believe the fact that the point clouds are lack of rich colour and texture information that allows to provide sufficient semantic information for the objects. As a result, it is still not efficient to localize the objects precisely. The other method is to use off-the-shelf 2D detectors to predict 2D bounding boxes. With respect to the 3D location, we employ [26] to estimate the localization of the objects. We use this method and introduce RRC-Net [29] as our 2D detector model, as we observe that it could obtain higher recall and accuracy.

Joint Feature Extractor
In order to exploit sufficient information for the detected objects in the frames, we propose a joint feature extractor (see Figure 1 (a)) to learn deep representations from both the images and the point clouds. Specifically, we employ a modified VGG-16 [35] to extract 2D appearance features, and then apply PointNet [27] over the trimmed points, which are obtained by extruding related 2D bounding boxes into the 3D frustums in the point cloud, to capture 3D appearance features and predict 3D bounding boxes. After that, we propose a sensor fusion module to aggregate the 2D and 3D appearance features together for further feature interaction. With regard to the motion features, we build a subnetwork to learn the high-level motions of objects using the information of the 2D bounding boxes and 3D bounding boxes.

2D Appearance Feature Extraction
As shown in Figure 1 (a), we take the objects in the 2D bounding boxes in the image as input, which are then cropped and resized to the fixed size 224 × 224 to adapt the VGG-16 [35] feature extractor. [38] indicates that the features in the different CNN layers contain different semantic properties. For example, a lower convolutional layer is likely to capture more detailed spatial information but low-level features, and a deeper convolutional layer could capture more abstract and high-level information. As a result, inspired by [38,47], we embed a skip-pool [2] method into VGG-16 network (see Figure 2) to involve all the features in the different levels for the global feature generation, which is treated as the 2D appearance feature for the corresponding detected object.

3D Appearance Feature Extraction
We firstly obtain the 3D points of each object by extruding the corresponding 2D bounding box into a 3D bounding frustum, where the 3D points on the object are then trimmed from. After that, we apply the PointNet [27] to capture the spatial feature for 3D appearance feature and predict corresponding 3D bounding box.

Motion Feature
With respect to the motion feature for each object, we directly use the 2D bounding box and corresponding 3D bounding box as the motion cue. We define the 2D bounding box of a certain object as B 2d = [x 2d , y 2d , w 2d , h 2d ], where [x 2d , y 2d ] is the center of the 2D bounding box, and w 2d , h 2d are the width and the height of the 2D bounding box respectively. Similarly, the 3D bounding box is defined as B 3d = [x 3d , y 3d , z 3d , w 2d , h 2d , l 2d , θ 2d ], where [x 3d , y 3d , z 3d ] is the center of the 3D bounding box, and w 3d , h 3d , l 3d are the width, height, length of the 3D bounding box respectively. θ 3d indicates the orientation of the object in the 3D space. Finally, the motion cue is defined as B = [B 2d , B 3d ] (see Figure 1 (a)), which is then fed to a 3-layer MLP subnetwork to capture the motion feature for each detected object.

Features Aggregation and Fusion
We firstly aggregate 2D appearance feature and 3D appearance feature to exploit sufficient semantic features (e.g. spatial and colour features) for each detected object. We propose three fusion operators: (1) An intuitive method is to add 2D appearance feature and 3D appearance feature together after forcing them to have the same feature channels; (2) Another common approach is to concatenate 2D appearance feature and 3D appearance feature, after which a 1-layer MLP is used to adapt the dimension of the fused features; (3) Another operator is attention-based weighted sum. Specifically, a scalar score is firstly computed by learning a 1-dimension feature from the features in the different sensors, after which the score is normalized by computing a sigmoid function. Finally, the weighted sum for the fusion is computed by using an element-wise multiplication operation on the normalized scores and corresponding features obtained from different sensors. We compare the performance by applying different fusion operators on our model in the experiments, and using a concatenation operator as our 2D and 3D appearance features aggregation method in our model.
We finally fuse the appearance feature and motion feature by concatenating the aggregated 2D/3D appearance feature with the related motion feature for further feature interaction.

Feature Interaction Module
After the joint feature extractor, we gain the fused features for M individual objects in the frame t − 1 and N objects in the frame t. We then propose a feature interaction module, as shown in Figure 1 (b), to learn the relations between each pair of two objects (one object is in previous frame t − 1 and the other object is in current frame t).

Graph Construction
In order to efficiently represent the objects in the different frames, we treat the feature of each object as a node in a graph structure, and the edge between two nodes in the graph can indicate the relationship between two objects in the different frames. As a result, we firstly construct a directed acyclic graph structure to represent the objects in two adjacent frames. Generally, a directed acyclic graph can be defined as G = (V, E), where V ⊂ R F are the nodes with F dimension, and E ⊆ V × V are edges connecting two nodes in the graph.
However, considering the fact that we only learn the correlation between every object in the frame t and all the objects in the frame t − 1, rather than taking the relations between the pair of objects in the same frame into account, our graph for the objects representation is constructed as where V t indicate N objects at current frame t, and V t−1 , which are also the neighbourhood nodes of V t , denote M objects at previous frame t − 1. The edge feature set E t for all the nodes in current frame are defined as Equations 1: where x i t indicates a certain object node in current frame t, y i j t denotes the edge feature connecting i-th object node x i t in current frame t and j-th object node x j t−1 in previous frame t − 1. The | · | is the absolute value operation. e i t are the edge features connecting i-th object node in current frame t and all the M neighbourhood nodes in previous frame t − 1.

Relation Convolution Operator
Considering irregular and unordered properties of the nodes in the graph, we are unlikely to use regular CNN filters (e.g. 3 × 3, 7 × 7 f ilters) on the unstructured graph for convolutional operation, as these filters are only suited to deal with the standard grid data, such as images. As a result, the traditional method is to apply shared MLP on the graph to learn local and global contextual representations. However, it is not efficient to extract the spatially local features from unordered data using the shared 1 × 1 filters, which leads to small receptive fields. Inspired by the processing of standard convolutional kernels, PointCNN [18] learned a transformation matrix as a regular CNN filter to capture local features for the irregular point clouds, and it outperforms MLP-based methods PointNets [27,28] by a large margin.
In order to leverage the ability of CNNs that is capable of extracting spatially-local correlation with large receptive fields, we propose a relation convolution operator, named RelationConv, to abstract fine-grained local representations for the nodes on the graph. The advantages of our RelationConv are that it is similar to standard convolutional kernels and can work on irregular data (e.g. graphs). In the standard CNNs, the convolutional operation between the filters W and the feature map X can be defined as Equation 2 to obtain abstract features f X .
where X represents the feature map with standard grid distribution, and W are the convolutional filters. The operation "·" denotes element-wise multiplication. Similarly, as defined in Equation 3, our RelationConv firstly learns a flexible filter W() by applying a shared MLP with a non-linear function on the edge feature of the unordered graph, and then an element-wise multiplication operation is used on the learned filter W and the edge feature E t to extract the local feature for the nodes in the graph (see Figure 3). It is easy to observe that our flexible filter W is learned from all the nodes and corresponding neighbourhood in the graph, which forces it to consider the global information in the graph, and also makes it independent to the ordering of the nodes.

W(E t ) = RELU(MLP(E t ))
where E t is the edge feature set. W(E t ) are the learnable and flexible filters obtained from a shared Multi-layer perceptron MLP() with a non-linear function RELU() [43]. f E t are the abstract features captured from our RelationConv operation.

Feature Interaction
We believe that the same objects in the different frames should learn similar discriminative features, and the feature similarity should be dependent. In contrast, the feature similarity of two different objects should be decreased. As a result, the discriminative feature is beneficial for avoiding confusion while matching the objects.
Given the obtained fused features containing 2D/3D appearance features and motion features, we propose a feature interaction module to learn the correlation between each object pair in the different frames as shown in Figure 1 (b). Rather than directly learning the deep affinity for further data association, we firstly employ a feature interaction module equipped with a 1-layer RelationConv network to learn more discriminative features, which allows the feature communication between two objects in adjacent frames before learning the affinity matrix.

Confidence Estimator
As shown in Figure 1 (c), we build a 3-layer MLP classification network to predict the scalar scores s det as the confidence of the validation for the detected objects.

Affinity Matrix Learning
In order to associate and match the objects in the different frames, given the correlation representation after feature communication among objects, we use a 3-layer MLP subnetwork to learn an affinity matrix with 1-dimension output feature s A . The affinity matrix is capable of determining whether a certain object pair indicates a link. Besides, the scalar score in the matrix shows the confidence of the object pair associating with the same identity.
Furthermore, inspired by [47], we further learn a start estimator and an end estimator to predict whether an object is linked. Specifically, the start estimator learns the scalar scores s start to determine whether a certain object just appears in previous frame t − 1. On the other hand, the end estimator predicts the score s end whether a certain object is likely to disappear in the frame t due to hard occlusion or out-of-bounds, etc. The start estimator and the end estimator firstly use an average pooling over the deep correlation representation to summarize the relations, and then employ the respective MLP network to learn the scalar scores for all the objects.

Linear Programming
We obtain several binary variables for prediction scores from our proposed neural network as shown in Figure 1 (d),. In summary, the detection score s i det indicates the confidence whether i-th object is a true positive detection. s i j A denotes the affinity confidence whether the j-th object in previous frame t − 1 and the i-th object in current frame t are the same objects. s j start denotes the confidence whether j-th object in previous frame t − 1 starts a new trajectory in frame t − 1. s i end denotes the confidence whether i-th object in current frame t ends a trajectory in the frame t. We then aggregate all the prediction scores to a new vector S = [s i det , s i j A , s j start , s i end ] for the optimization of the data association problem.
Considering the graph structure for all the detected objects, the data association problem can be formulated as the min-cost flow graph problem [31,47]. Specifically, we use these obtained prediction scores to define linear constraints, and then find an optimal solution for matching problem.
There are two circumstances for a certain true positive object in previous frame t − 1. It can be matched to another object in current frame t, or it starts a new trajectory. As a result, we define the linear constraint as Equation 4.
Similarly, a certain true positive object in current frame t can be matched to another object in previous frame t − 1, or it ends a trajectory. Consequently, the linear constraint can be defined as Equation 5.
Finally, we formulate the data association problem as: where Θ(X) indicates a flattened vector that comprises all the prediction scores. CS is a matrix form that satisfies two linear constraint Equations 4 and 5.

Dataset
We evaluate our neural network on the KITTI object tracking benchmark [11,12]. The benchmark consists of 21 training sequences and 29 testing sequences. We split the training sequences into 10 sequences for training and 11 sequences for validation. As a result, we obtain 3975 frames for training and 3945 frames for validation. The dataset is captured from a car equipped with two color/gray stereo cameras, one Velodyne HDL-64E rotating 3D laser scanner and one GPS/IMU navigation system. Each object is annotated with a unique identity number (ID) across the frames in the sequences, 2D bounding boxes 3D bounding boxes parameters. We measure the distance between the predicted bounding box and corresponding bounding box of matched object-hypothesis by calculating the intersection over union (IoU).

Evaluation Metrics
The evaluation metrics to assess the performance of tracking methods are based on CLEAR MOT [4] and [19]. Specifically, MOT precision (MOTP) measures the average total error of distances for all the frames in the sequences as defined in Equation 7, and it indicates the total misalignment between the predicted bounding boxes and corresponding matched object-hypothesis. where d t i indicates the distance between the bounding box of i-th object and corresponding matched hypothesis in the frame t. c t indicates the total number of the matched objects in the frame t.
MOT accuracy (MOTA) measures the total tracking accuracy for all the frames as defined in Equation 8.
where FN t , FP t , IDSW t , GT t denote the total number of false negative objects, false positive objects, identity switches and ground truth respectively in the frame t.
Besides, [19] introduces other metrics to improve the tracking assessment, such as mostly tracking (MT) indicating the percentage of the entire trajectories in the sequences that could cover more than 80% in total length; mostly lost (ML) indicating the percentage of the entire trajectories in the sequences that could cover less than 20% in total length; partial tracked (PT) indicating 1 − MT − ML.

Training Settings
Our model is implemented using Pytorch 1.1. We train our model on a ThinkStation P920 workstation with one NVIDIA GTX 1080Ti, and use Adam as a training optimization strategy with the initial learning rate 3e-4. Besides, the super convergence training strategy is employed to boost the training processing and the maximum learning rate is set to 6e-4. Table 1 shows that our Relation3D model achieves competitive results compared to recent state-of-the-art online tracking methods. It is easy to observe that our model achieves the best ID-SW metric when we compare to 2D online tracking methods. We discuss that 3D spatial information and location are capable of avoiding the confusion when matching the pairwise objects. Our MOTA and MOTP results outperform those of GNN3DMOT [42] whose feature interaction mechanism employs MLP network to exploit discriminative features. It shows the effectiveness of our RelatiionConv operation for feature interaction. Compared to mmMOT [47], our model is slightly better as it is beneficial from our RelationConv operation and the combination of both the appearance features and motion features.

Ablation study
We investigate different hyper-parameter settings to evaluate the effectiveness of our model on the KITTI object tracking benchmark [11].  Table 2 indicates the effectiveness of different fusion methods. It shows that the concatenation operation outperforms addition and attention-based weighted sum methods. We believe that the concatenation operation is capable of exploiting more useful information from the features obtained from different sensors. Specifically, the addition operation is inefficient to align the features captured from 2D RGB images and 3D point clouds, making it difficult to learn discriminative feature after element-wise addition. Although the attention-based weighted sum method enables to highlight the importance for the different features, the concatenation operation is a more general operation that gathers all the information from different modalities. Table 3 shows that the performance is the best when the edge feature y i j t = |x i t − x j t−1 |. We observe that the first option x j t−1 only considers the neighbourhood information without involving then center nodes. The third option x i t − x j t−1 uses the relative distance value between the object pair as the edge feature, which is efficient when we deal with spatial data (e.g. point cloud). However, the result represents that the absolute value is more efficient when we learn discriminative features for the similarity. The fourth option [x i t , x j t−1 ] encodes the edge feature by combining the individual information without explicitly considering the distance between each pair of objects. We also investigate the effectiveness of the appearance feature and motion feature as shown in Table  4, which indicates that the motion feature could contribute the MOTA by 1.07%, and significantly reduce the number of ID − SW and Frag by 119 and 113 respectively, which convincingly verifies that the motion of 2D/3D localization could help to match the correct objects for data association.  Table 5 compares our RelationConv operation with traditional MLP method for local feature extraction. It shows that our RelationConv outperforms MLP method by significant 1.2% for MOTA metric and performs much better in terms of ID − SW and Frag metrics.
It is worthwhile noting that the MOTP metric is only related to the distance between predicted bounding boxes and corresponding matched object-hypothesis as shown in Equation 7. Besides, the MT and the ML are irrelevant to the measurement of whether the IDs of the objects remain the same throughout the entire sequence. As a result, the metrics of MOTP, MT, ML are unlikely to update during our ablation investigation (see Table 2, 3, 4, 5 as long as the performance of object detection remains the same.

Conclusion
We propose a deep affinity network, named Relation3DMOT, to learn discriminative features and associate the objects in the adjacent frames for 3D MOT. We employ a joint feature extractor to capture the 2D/3D appearance feature and motion feature from 2D images and 3D point clouds respectively, followed by a feature interaction module to enhance the feature communication among objects in the different frames. We also propose an efficient convolutional operation, named RelationConv, to abstract semantic and contextual relations for each object pair. We finally perform extensive experiments on the KITTI object tracking benchmark to demonstrate the effectiveness of our Relation3DMOT tracker.
In the further, we plan to improve our model by considering the objects with hard occlusion in the frames. Furthermore, it would be worthwhile to develop an end-to-end framework for the joint task of object detection and tracking, and we believe that it could avoid the decoupling issue when we deal with the object detection and tracking separately.

Conflicts of Interest:
The authors declare no conflict of interest.