Multi-Target Tracking Based on a Combined Attention Mechanism and Occlusion Sensing in a Behavior-Analysis System

Multi-object tracking (MOT) is a topic of great interest in the field of computer vision, which is essential in smart behavior-analysis systems for healthcare, such as human-flow monitoring, crime analysis, and behavior warnings. Most MOT methods achieve stability by combining object-detection and re-identification networks. However, MOT requires high efficiency and accuracy in complex environments with occlusions and interference. This often increases the algorithm’s complexity, affects the speed of tracking calculations, and reduces real-time performance. In this paper, we present an improved MOT method combining an attention mechanism and occlusion sensing as a solution. A convolutional block attention module (CBAM) calculates the weights of space and channel attention from the feature map. The attention weights are used to fuse the feature maps to extract adaptively robust object representations. An occlusion-sensing module detects an object’s occlusion, and the appearance characteristics of an occluded object are not updated. This can enhance the model’s ability to extract object features and improve appearance feature pollution caused by the short-term occlusion of an object. Experiments on public datasets demonstrate the competitive performance of the proposed method compared with the state-of-the-art MOT methods. The experimental results show that our method has powerful data association capability, e.g., 73.2% MOTA and 73.9% IDF1 on the MOT17 dataset.


Introduction
Artificial intelligence is widely used in the field of healthcare [1,2]. Specifically, researchers focus on human behavior analysis based on multi-target tracking (MOT) for healthcare systems [3]. MOT is a topic of interest in the field of computer vision, which has broad prospects in fields, including intelligent video monitoring [4][5][6], assisted driving [7][8][9], smart agriculture [10,11], and behavior analysis [12][13][14]. The main task is to track multiple objects in a video sequence, assign unique identifiers (IDs) to each object, maintain the stability of identity when occlusion and interaction occur, and finally obtain the object's motion track.
The main problems to be solved include an object's occlusion, interference of similar objects, and mutual influence between multiple objects. As the tracking environment is complex and changeable, and the characteristics of tracked objects are similar, the performance of MOT systems is limited by their ability to distinguish the appearance characteristics of multiple objects and keep them stable [15].
A detection-based MOT algorithm structure can be divided into detectors, trackers, and classifications; such algorithms can be either detection-based (DBT) or detector-free (DFT), according to whether an image must be detected. The present study focuses on improving (1) the speed of extracting different object features as well as the stability of distinguishing between object features; and (2) the accuracy of associations between classifiers under different association methods to balance the tracking accuracy and operational efficiency [16].
MOT algorithms can be online or offline, depending on whether a video sequence is progressively tracked frame by frame, i.e., whether real-time video streams can be analyzed and predicted. Among online algorithms, Bewley proposed SORT [17] to divide a tracking algorithm into object-detection modules, using Faster R-CNN [18] to complete object detection for input video, and Kalman filters and Hungarian algorithms to determine if an object in different frames is the same. To address the instability of the SORT algorithm in the association module, DeepSORT [19] introduced appearance model-assisted data association, which improved the accuracy of the data association and reduced the probability of a failed association due to occlusion and other problems.
Zhang et al. [20] proposed an improved DeepSORT algorithm based on YOLOv5 for MOT, improving the efficiency of tracking. Considering that the detection model and embedding model exist in the DeepSORT-like algorithm and that the two models run independently, reducing the efficiency of MOT, Wang et al. [21] proposed JDE, which integrated the detection and embedding models for joint embedding learning, which was more efficient than a separate detection and embedding Model (SDE), such as DeepSORT. The image resolution and running speed reached 1088 × 688 and 18.8 FPS, respectively, which was close to the requirement of real-time speed (20 FPS). Yoo et al. [22] designed an object constraint learning method to raise the tracking efficiency. Boragule et al. [23] advanced a pixel-guided method to combine the joint-detection and tracking task for MOT.
There are two difficulties in the current research of MOT algorithms: (1) In complex environments, such as environments with many occlusions, the algorithm's tracking needs to be improved for the same object. (2) Operational efficiency. The application scenario of multi-target tracking can determine that it needs to run at a speed close to or even beyond the real-time speed (20 FPS). Specifically for online algorithms, high running-speed requirements are proposed.
This paper presents an improved tracking method based on joint detection and embedding learning (JDE), and makes the following contributions: (1) To unify spatial and channel features, a CBAM extracts the attention weights of space and channel dimensions from the feature map. An adaptive fusion of feature maps using attention weights enhances the model's ability to extract object features. (2) To solve the problem of temporary occlusion of the detected object, an occlusion-detection module adaptively determines the current object occlusion situation. If this stops the updating of the obscured object's appearance features, the resultant contamination is improved. (3) The proposed method achieved excellent tracking performance on public datasets.
The rest of this paper is organized as follows. Section 2 describes related work. Section 3 introduces the proposed method in detail. Some experimental results are discussed in Section 4 followed by our concluding remarks in Section 5.

MOT Algorithm
There are two-and one-step MOT algorithms, depending on whether a single network is used to predict the detection information and re-identification (Re-ID) of an object.

Two-Step MOT Algorithm
The two-step algorithm takes the output of the object-detection task as the input of the object re-identification task. Hence, two tasks are processed separately using different network models, which requires serial execution.
For example, based on SORT [17], DeepSORT [19] introduces an appearance model based on a CNN [18], which adds the ability to extract the appearance-feature information of the object in the input image, and enhances the robustness of the algorithm during object tracking. The integration of appearance-feature information improves the model's handling of long-term occlusion, which reduces the error object ID switching that occurs during tracking. The Tracktor++V2 [24] utilizes the bounding box to predict the location of the target in the next frame, thus, converting the detector into one tracker. However, due to the use of two networks to obtain object-detection information and object-appearance characteristics, DeepSort is more efficient than SORT, and serially scheduling two tasks limits efficiency.

One-Step MOT Algorithm
Two-step MOT algorithms use two networks to obtain object-detection information and object-appearance characteristics. The JDE [21] algorithm is proposed to integrate the object-detection model and the appearance-embedding model into the same network. The FairMOT [25] applies the CenterNet [26] to create two homogeneous branches of detection and embedding for predicting pixel-level objectness information. The TransCenter [27] is implemented the multi-scale and pixel-level query and object-centered heatmap via linking them frame by frame based on the transformer network.
This allows a single network to complete the extraction of these two types of information in forward propagation to improve the algorithm's efficiency. The performance improvement of one-stage and two-stage algorithms mainly manifests in four aspects: speed, accuracy, model complexity, and data augmentation. Compared to two-stage algorithms, one-stage algorithms are faster since they directly perform dense bounding-box prediction on the entire image, avoiding the calculation of both object-detection and objecttracking steps, thus, being suitable for applications with high real-time requirements.
However, one-stage algorithms usually have lower accuracy compared with two-stage algorithms, as they need to perform dense bounding-box prediction on the entire image and are prone to false positives and false negatives. Additionally, one-stage algorithms usually have simpler network structures and fewer parameters than two-stage algorithms, while the latter needs to design both object detection and object tracking networks, thus, being relatively more complex. Finally, one-stage algorithms usually adopt some data augmentation techniques, such as data augmentation and adaptive sampling, to improve the algorithm generalization and robustness, further enhancing the algorithm performance.

Attention Mechanism
Attention is an inherent mechanism of human vision. When the human eye looks at an object or scene, the distribution of attention is different according to the object or scene. Such a mechanism can help humans quickly obtain critical information from the environment and allows for careful observation of detailed information about objects. The attention mechanism in in-depth learning learns from the human attention mechanism and is widely used in in-depth learning tasks, such as image classification and natural language processing. Depending on the scope, attention mechanisms can be categorized as channel attention [28], spatial attention [29], or mixed attention [30].
Spatial attention mechanisms arise because, for an input image, part of the area is unrelated to the identification or segmentation task, and so only the area related to the task must be processed. It can compute the spatial information of the input image while preserving the key information and suppressing non-key information. A representative model of spatial attention mechanisms is the spatial transformer network (STN) [29], proposed by Google DeepMind, which learns from the input to select a preprocessing operation suitable to a task.
For visual tasks, the input image has both spatial and channel dimensions, and the ability of the network to extract the feature information from the image can be effectively improved by studying the dependency between channels in the feature map. SENet [28] is a representative channel attention mechanism model, which compresses the spatial dimension of the input signature graph, preserves the channel dimension, generates weights for each channel through the network, learns to adjust them for each channel during training, and multiplies the generated weight matrix by the original input signature graph, which enlarges the signature information of important channels, suppresses the signature information of less-important channels, and improves the efficiency of the network at extracting signature information.
The mixed-attention mechanism combines spatial and channel information in a hybrid attention mechanism, ignoring the intrinsic relationship between features and failing to consider both spatial and channel characteristics. For example, the CBAM [30] and dualattention network [31] are representative models.

Shortcomings or Research Gaps
Based on the current related work, we summarize the current shortcomings of multiobject tracking as follows: 1.
Robustness: Multi-object tracking algorithms still need to improve their robustness to external factors, such as lighting changes, occlusions, and motion blur.

2.
Long-term tracking: Long-term tracking involves cross-frame object re-identification and model updates, and there are still some challenges, such as model drift, occlusions, and motion blur.

3.
Object re-identification: Object re-identification is one of the key technologies of multiobject tracking; however, there are still certain issues, such as object deformations and viewpoint changes.

4.
Algorithm efficiency: Multi-object tracking algorithms usually need to process a large amount of data, and fast and efficient algorithms are needed for real-time applications.

FairMOT Framework
Researchers have found that one-step models, such as JDE [21] use an anchor-based detection network [32] that results in inconsistencies between the appearance embedding features extracted from the anchor point and the real object during training, thereby, resulting in a decrease in the tracking accuracy. Therefore, Zhang et al. [25] proposed a one-step MOT algorithm, FairMOT, based on an anchorless frame.

DLAseg-Based Backbone
Deep layer aggregation (DLA) [33] can iteratively aggregate specific information about a network structure with higher accuracy as the number of parameters decreases, similar to a pyramid network. The DLAseg [26] network used by the FairMOT algorithm introduces more hop connections based on the DLA network, enabling more information to be shared between lower and higher features. To alleviate the problem of detecting critical points and aligning objects, deformable convolution improves the information extraction capability of upsampling operations, thus, enabling the network to dynamically adjust the sensing field to the size of the object.
Different from traditional convolutional kernels, the shape and position of deformable convolutional kernels are learned, making them better suited to irregular shapes and position variations of the targets. In addition, deformable convolutions can reduce the number of parameters, enhance the generalization ability of the model, and improve its performance. The structure of the DLAseg network, as shown in Figure 1, is based on DLA34 and introduces a variant network after deformable convolution.
In the input and output part of the network, we express the size of the input image as H image × W image , and the output signature graph has the dimensions C × H × W, where H = H image /4 and W = W image /4.

Object Detection Branch
The FairMOT algorithm adds three parallel predictors to the object detection part of the tracking task-a thermogram predictor, frame size predictor, and center offset predictor, as shown in Figure 2-based on the DLASEG network, each consisting of a convolution of 256 channels with a convolution core sizes of 3 × 3 and 1 × 1. The thermogram predictor predicts the center position of the object. The output characteristic diagram of the thermogram is H × W × 1. If the thermogram collapses with the center of the real object, the response of the location in the output characteristic diagram is 1. The box size predictor predicts the size of the object detection bounding box at each object location.
The center offset predictor locates the object more accurately in the image. As the resolution of the feature map is one-fourth that of the original image, a step size of 4 in the feature map introduces an error of up to 4 pixels in the original image. With the introduction of a center offset predictor, the offset of each pixel point in the image relative to the true object center point can be estimated from the output feature map of the predictor, thereby, mitigating the effects of the error on the sampling.

Object Recognition Branch
The object recognition branch, as shown in Figure 3, generates visual features that distinguish different tracking objects in a single task. In an ideal preset, the degree of similarity between different objects output by the object recognition branch should be less than that of the same object.

Convolutional Attention Module
The CBAM [30], as shown in Figure 4, is a lightweight attention module. Based on the SE module of SENet [28], the attention module of the second dimension, i.e., the spatial dimension, is added. Hence, the CBAM can be divided into channel attention and spatial attention modules. The CBAM extracts features from multiple channels or spaces by blending them with convolution. From the perspective of the change of the signature graph, the calculation can be expressed as where F represents the input characteristic map, M c is the calculation of channel attention; M s is the calculation based on spatial attention; and F and F are the output characteristic maps after calculating the channel and spatial attention, respectively. The channel attention module, as shown in Figure 5, performs maximum and average pooling on the input signature map, and uses a shared weighted multilayer perception machine to learn and predict. After adding the two output matrices, channel attention weights are obtained through the sigmoid activation function [34]. The above operations of the channel attention module can be expressed as Equation (3).
where σ represents the Sigmoid activation function, W 0/1 represents two convolution operations in a multilayer perception machine, F c avg represents the average pooling characteristics for channel attention, and F c max represents the maximum pooling characteristics for channel attention.
The spatial attention module, as shown in Figure 6, maximizes the pooling operation on the input feature map and then averages the pooling operation. The experiments [30] indicate that performing only a single pooling operation results in significant information loss. Using the parallel connection of average and max pooling reduces the amount of lost information compared to single pooling, thus, resulting in better performance. The result is convolved with a convolution core of 7 × 7 and obtains the spatial attention weight after using the sigmoid activation function on the result. The operations of the spatial attention module can be expressed as Equation (4).
where σ represents the sigmoid activation function, 7×7 represents convolution with a convolution core of 7 × 7, F s avg represents the average pooling feature for spatial attention, and F s max represents the maximum pooling feature for spatial attention. The CBAM is mainly added in the DLAseg [26] backbone network and prediction branch; the location affects its effects. Experiments show that adding the CBAM after forecasting branches can best improve the comprehensive performance of multiple indicators. The experimental process and results can be seen in Section 4. The network structure after adding the CBAM is shown in Figure 7. In the prediction branch, the number of channels of the heatmap predictor, center offset predictor, frame size, and object recognition predictor are 1, 2, 4, and 128, respectively. In the CBAM, the effect of applying attention to the feature map of low channel numbers is not good. Therefore, the CBAM is inserted in the frame size predictor and object recognition predictor in this paper.

Occlusion-Sensing Module
For occlusion detection (as shown in Figure 8), the traditional IoU crossover ratio algorithm [31] calculates the coincidence ratio as the ratio of the intersection area of the two detection boxes to that of their union, and filters the objects that meet the requirements by setting a threshold value. This works well when the proportions of the sizes between the objects are similar. However, in pedestrian-tracking tasks, the sizes of pedestrian-detection boxes may vary greatly depending on the distance from the camera, which greatly reduces the effect of the IoU algorithm. In Figure 9a,b, the ratios computed by IoU better reflect the overlap of the two objects, whose frames are of similar size. In case (c), however, the size difference between the object boxes is large, and the IoU calculation does not accurately show the occlusion of small objects.  Based on the traditional IoU crossover ratio algorithm [31], we introduce the judgment of the object's center point. When judging two object frames for occlusion, if the center point of one object frame is within the coordinate range of another frame, the object is determined to be occluded. Assuming two object-detection boxes, b 1 and b 2 , with center points c 1 and c 2 , respectively, IoU indicates the result of the cross-ratio calculation of the two object-detection boxes, and F indicates whether there is occlusion, where 1 indicates occlusion, and 0 indicates none, i.e., F = 0, IoU >= 0 and c 1 / ∈ b 2 and c 2 / ∈ b 1 1, else .
In FairMOT [25], the detected object is matched with the track reserved in the tracker by cascade matching and IoU matching in the association part. The most successful association of the object occurs in the cascade matching part; thus, adding the occlusiondetection module here can obtain a good result. The improved tracker flow is shown in Figure 10, where the bold module represents modification after adding the occlusiondetection module.

Heatmap Loss Function
The size of the heatmap is H × W × 1. If it collapses with the center of a real object, the response at that location is 1. The response decays exponentially with the distance between the location in the heatmap and the object center. . The heatmap response calculation for location (x, y) is where N is the number of objects in the image, and σ c is the standard deviation.
The loss function is defined as a pixel-level logistic regression with focal loss, where M is the heatmap for prediction, and α and β are preset loss parameters.

Box Offset and Size Prediction Loss Function
The frame offset predictor is used to locate the object more accurately in the image. The predictor estimates the constant offset of each pixel point from the object center to mitigate the effect of downsampling. The size predictor estimates the size of the object bounding box at each location. As the step size of the signature map is 4, a nonnegligible error of up to 4 pixels is introduced.
The outputs of the box offset predictor and size predictor are O ∈ R 2×H×W and S ∈ R 2×H×W , respectively. The size of each GT box b i = (x i 1 , y i 1 , x i 2 , y i 2 ), can be calculated as where λ s is a weight factor, which is set to 0.1 in the original CenterNet [26] network.

Object Recognition Loss Function
The resulting feature graph is E ∈ R 128×H×W , and the object recognition feature is extracted from the object whose center is located at (x, y) is E x,y ∈ R 128 . The object center An eigenvector E x i ,y i can be extracted and mapped to a class distribution vector P = p(k), k ∈ [1, K] using a fully connected layer and a softmax operation. The one-hot of the GT class label is represented as L i (k), and the object recognition loss is where K is the number of categories.

Overall Loss Function
The loss of the detection and recognition branches is added to the total loss, and two tasks of indeterminate loss [35], automatic balance detection and recognition, are added to calculate the total loss, as shown in Equations (10) and (11).
where w 1 and w 2 are learnable parameters used to balance the detection and recognition tasks.

MOT Series Dataset
The MOT Series dataset is an open dataset proposed by the MOTChallenge, focusing on pedestrian-tracking tasks. The picture files of the training and test sets are exposed, the labels of the training set are exposed, and the labels of the test set are retained. The series is divided into the MOT15 [36], MOT16 [37], MOT17, and MOT20 [38] datasets. MOT15 is modified from other old datasets. MOT16 and MOT17 are new datasets, in which pedestrians are much more crowded. MOT20 is the largest and most dense dataset in the series. We used MOT Series datasets for validation and testing.

CrowdHuman Dataset
The CrowdHuman dataset [39] is an open dataset that is publicly available and focuses on pedestrian detection tasks. The dataset contains 15,000 pictures of the training set, each with header, body, and visible bounding boxes labeled on each pedestrian object. The test dataset does not expose a label file. We used the CrowdHuman dataset for training.

MIX Dataset
MIX is a hybrid dataset based on datasets proposed by the author of the JDE algorithm. It includes six datasets: Caltech Pedestrian, CityPersons, CUHK-SYSU, PRW, ETHZ, and MOT17. MIX is mainly used for training MOT task models. We used these datasets for training.

Evaluation Metrics
We thoroughly benchmarked our method using five standard evaluation metrics, the main ones being MOTA [40] and IDF1 [41]. MOTA measures the overall performance of the tracker by evaluating mistakes from three sources-namely, mostly tracked object (MT), mostly lost object (ML), and identity switching (IDs).The IDF1 is concerned with the quality of assigning identities with unity in the detection quality of the identity.

Analysis of Experimental Results
Based on the FairMOT [25] research and improvements, we used the improved algorithm system to train on the MOT dataset, used the final model to test on the MOT series dataset, and analyzed the results.

Training Process and Results
We conducted experiments that were trained using PyTorch on a server with two NVIDIA RTX 2080Ti GPUs, an 8-core Intel Xeon Silver 4110 CPU and 32 GB memory. Based on the pretraining model, the improved algorithm trained 60 batches using the CrowdHuman dataset, trained 30 batches on the MIX dataset using the trained model, and tested the experimental results using the resulting final model. The hyperparameters were set as follows: the batch size was set to 6 for CrowdHuman training and 12 for MIX training.
The initial learning rate was set to 0.0001. The dataset load thread (num workers) was set to 8. The relationships between loss function curves and training batches are shown in Figures 11 and 12. The changes of each loss function during 60 batches of training using the CrowdHuman dataset based on the pretraining model are shown in Figure 11. The changes of each loss function during 30 batches of training batches, using the MIX dataset to the model trained on the CrowdHuman dataset, are illustrated in Figure 12.
From the results shown in Figures 11 and 12, we can see that at the end of the training, all the loss values of the network were in a convergent state, and the convergence effect was ideal.   Table 1 shows the experimental results on the MOT20 dataset using the FairMOT model and the improved model in this paper. It can be seen in Table 1 that the MOTA [40] and IDF1 [40] indices of the modified model on the MT20 dataset were increased by 1.7% and 1.54%, respectively. Therefore, after the introduction of the CBAM and the pedestrian occlusion-detection module, the model improved the ability to extract object-detection information and appearance features, produced more accurate feature information, and improved the tracking accuracy of the model.

Model Comparison Experiment
In terms of the number of MT and ML indicators, for the same object in the tracking process, the stability of the tracker was improved due to the improvement of the feature information. At the same time, the short-term pedestrian occlusion-detection module reduced the problem of appearance feature pollution by stopping appearance feature updates after occlusion detection. It also improved the robustness of tracking. Table 2 shows the experimental results on MOT16. Most algorithms do not publish the specific number of MT and ML indicators but instead publish the percentage of indicators. MT and ML from the experimental results of this algorithm were converted to percentages in Table 2 and subsequent comparisons.
From Table 2, we can see that the proposed model was improved on the MOTA16 dataset, especially MOTA and IDF1, which are the most critical measures of MOT, by 0.1% and 1.3%, respectively, indicating enhanced tracking robustness. Significantly, the number of ID switches in this model is significantly lower than that in FairMOT, indicating an improved predictive effect of the tracker after adding an attention mechanism to the object recognition branch and the pedestrian occlusion-detection module. The lower number of IDs makes the model tracking result more useful in practical applications. The improved model in this paper has excellent levels of all indicators when compared with the other models in the table. The results on the test setting of the MOT17 are shown in Table 3.

Ablation Experiments
To analyze the effectiveness of the different components of our proposed framework, we also designed a series of baseline methods for comparison. The MIX dataset was set as the training dataset, and the MOT20 training set was applied as the test dataset. The ablation study of occlusion sensing is reported in Table 4. OS indicates the occlusionsensing module. Figure 13 shows the visualization of the tracking results on MOT17. The results without OS lose the blocked object (807ID changes to 811ID as a new object), while the OS can deal with this situation. Both of them illustrate the effectiveness of the occlusion-sensing module.

Conclusions
In this paper, we proposed a novel approach for multi-target tracking that utilizes a combined attention mechanism and occlusion sensing. The motivation behind our approach was to tackle the challenges posed by object occlusions, which can significantly affect the accuracy and robustness of object-tracking systems.
To this end, we designed a convolution block attention module that calculates the weights of space and channel attention from the feature map. The attention weights were then used to fuse the feature maps, which allowed for the adaptive extraction of robust object representations. Additionally, we introduced an occlusion-sensing module that is capable of detecting occlusions. Once an occlusion occurs, the appearance of the occluded object was not updated to ensure the purity of the object. To evaluate the effectiveness of our method, we conducted experiments on three widely used datasets: MOT16, MOT17, and MOT20.
The experimental results show that our method achieved different degrees of improvement on these datasets, demonstrating its accuracy and robustness in multi-target tracking scenarios. Specifically, our approach outperformed state-of-the-art methods in terms of multiple evaluation metrics, such as MOTA and IDF1. These results validate the effectiveness of our approach in handling challenging scenarios, such as occlusions, and improve the overall performance of multi-target tracking systems.