Enhancement of Multi-Target Tracking Performance via Image Restoration and Face Embedding in Dynamic Environments

: In this paper, we propose several methods to improve the performance of multiple object tracking (MOT), especially for humans, in dynamic environments such as robots and autonomous vehicles. The ﬁrst method is to restore and re-detect unreliable results to improve the detection. The second is to restore noisy regions in the image before the tracking association to improve the identiﬁcation. To implement the image restoration function used in these two methods, an image inference model based on SRGAN (super-resolution generative adversarial networks) is used. Finally, the third method includes an association method using face features to reduce failures in the tracking association. Three distance measurements are designed so that this method can be applied to various environments. In order to validate the effectiveness of our proposed methods, we select two baseline trackers for comparative experiments and construct a robotic environment that interacts with real people and provides services. Experimental results demonstrate that the proposed methods efﬁciently overcome dynamic situations and show favorable performance in general situations.


Introduction
The multiple object tracking (MOT) problem aims to assign IDs to multiple detected targets and to estimate the trajectory of the object until each tracking target disappears. Recently, high-performance real-time MOT research studies are required for scenarios such as human-computer interaction, autonomous vehicles, and humanoid robots. For this reason, researches for improving real-time MOT performance such as [1][2][3][4][5] have been actively conducted. Existing MOT frameworks can be classified into two types, offline and online, depending on the temporal range of data to be considered [6]. The offline methods [7][8][9] consider the range from the past to the future, while the online methods [10][11][12][13][14] consider the range from the past to the present. In general, the offline methods perform better than the online methods by global optimization considering the future state, but they are not suitable for real-time tracking applications such as the previous scenario examples. The online MOT frameworks for real-time tracking are often applied in complex, dynamic, or unexpected situations, but overcoming tracking failures in these environments remains a challenge.
Online MOT frameworks need the best data association in every frame because only current and past frames are considered. In the MOT problem, the data association generally means updating the state of an object being tracked by the collaboration of a motion model and an appearance model. The motion model compares the positional similarity between the current state of the tracking object and the current detection result. In this case, prediction methods such as Kalman filters [15,16] or particle filters [17,18] are used to predict the current position of the tracking object, and the appearance model compares the similarity of the appearance between the past state of the tracking object and the present detection result. In this case, a method such as visual embedding is used to effectively extract features from a noisy image.
The dynamic movements of a tracking object and a camera in a real-time environment cause the reliabilities of the motion model and the appearance model to decrease. Motion models are negatively impacted by the complexity of moving the camera and objects separately. This is because the motion models of existing frameworks generally adopt linear functions, making it difficult for them to infer objects with nonlinear movements. Appearance models are negatively affected by motion blurs caused by dynamic movements. Because the surrounding pixel information is mixed, the detailed pixel representation, especially the texture information of the picture, is lost, and the outline of the object is blurred. This causes difficulties in distinguishing the boundary between the background and the object. The motion blur refers to a phenomenon in which pixels bleed due to the movement of the photographing object while the photosensor of the camera records an image [19][20][21]. The phenomenon occurs often when there is vibration caused by an uneven road surface, when the tracking target moves quickly, and when the camera mounted on the moving platform moves. As a result, the dynamic situation during multiple object tracking negatively affects the motion model and the appearance model.
In this paper, we propose three methods to overcome the limitations of existing multiple object tracking in the event of dynamic movement. The first is to perform redetection after the image restoration on the detection result whose reliability is lowered by noise. This makes it possible to calibrate ambiguous detection results that the detector could not screen. The second is to classify and restore the damaged area before the image is entered into the appearance model. This makes it easy to guess the intact state of a damaged image and match the state of the existing object that the appearance model remembers with the current detection of the state change. We adopt and train a GAN (generative adversarial networks)-based image inference model to recover damaged images due to dynamic situations in the previous two approaches. The third method introduces a face appearance model, which is an association method that uses face features. This improves the performance of the discrimination using a large amount of information on the face and a relatively low number of occlusions.
Many MOT studies use the MOT Challenge [22] Benchmark dataset to evaluate the performance of the framework. However, since it targets a stationary or smoothly moving environment, it is difficult to simulate an environment (e.g., robot, car) in which real-time MOT is applied. We thus constructed a robot environment that can provide services in a real-time environment and produced the images observed from the robot viewpoint as a benchmark set following the MOT16 benchmark rule.
Our main contributions in this paper are as follows.
1. We present three methods to enhance the performance for multiple object tracking in a dynamic environment. Those three methods overcoming the dynamic situation contributes to improved detection, improved identification, and a lower chance of association failures, respectively. 2. Since each of the proposed methods has modularity, there is no cost for re-learning the entire framework, so it can be easily attached to various trackers. 3. To demonstrate the effectiveness of the proposed methodology, we constructed a benchmark set on a real robot environment and verified our approaches through experimental ablation studies.

Related Work
Online multiple object tracker. Strong motion models and appearance models are essential for online MOT methods due to the consideration of the optimal selection in the current frame without future frames. With the advent of the latest advanced object detectors [23][24][25], various MOT methods that link their tracking based on detection results have become popular. The work in [26] proposed a simple motion model based on a Kalman filter affected by the performance of the latest CNN (convolutional neural network)-based object detector. Furthermore, [27] proposed a model using a CNN-based appearance model for [26]. The authors in [28] proposed a method of classifying the correct detection candidates among the crowd and selecting the optimal detection candidate using the heatmap generation model. [29] adopted an RNN (recurrent neural network) to cope with the problem of the occlusion between objects by integrating the spatial and temporal information. The work in [30] improved model performance by integrating the information from the CNN intermediate layer to compensate for the information loss that existing tracking frameworks result from using only the last CNN information in a detector. In the situations of providing real-time services, dynamic or unexpected movements frequently occur, but the existing MOT methods do not often cope with such situations resulting in the poor associations. On the contrary, we apply a method to overcome the dynamic situations to the traditional online MOT frameworks and evaluate its effectiveness in the experiments.
GAN-based image inference model. Generative adversarial networks (GAN) [31] propose an adversarial loss in which two competitors, discriminator and generator, compete and learn from each other. Deep Convolutional GAN (DCGAN) [32] designed a CNN-based generator for image inference using a GAN. It indicates that when using the adversarial loss for the image inference problem, the pixel distribution close to the actual data can be obtained, resulting in more realistic images compared to the autoencoder-based model. As a result, great progress in the image inference problem such as the style transfer [33][34][35], the super resolution [36], and deblurring [37]. In this paper, we implement an image restoration module for MOT by adopting and learning a GAN-based image inference model for the damaged image restoration.
Appearance embedding model. Identifying whether two images represent the same person or not is accompanied by considerable difficulties due to the curse of dimensionality. In particular, identifying an unaware person adds to the difficulty. To overcome these problems, image embedding methods [38,39] using CNN were proposed recently. These models can learn to represent the whole body or a part of the body as feature vectors with small dimensions, and after learning, they are able to extract features of people that are not involved in the learning. In particular, [39] proposed a learning method using the triplet loss which achieves great results in facial recognition. We try to use a face feature to relax the problem that occurs when only a body feature is used to implement the identification function. There are two problems with using body features only. Firstly, because the boundaries become blurred when the occlusion between multiple people occurs, the embedding results mixed with the features of several people can be extracted. Secondly, the objects may be wearing similar clothes, which leads to less differentiation. To alleviate these limitations, we propose an appearance model using face features with a low incidence of occlusion and high discrimination.

Method
Our main goal is to overcome association failures caused by dynamic situations when executing MOT in real-time. Figure 1 shows the overall structure of our framework as data flows. In this section, we propose three strategies to achieve our goals in the order of data flow. The following summarizes each of the methods we propose.  3.1 Re-detection: Re-detection is performed in the detection/re-detection process. This section describes the process of defining ambiguous detection results and re-detecting them after image restoration to increase the reliability of the detection results. Section 3.2 Site restoration: This section presents the site restoration process of defining the noisy area and restoring the image of such an area to improve the reliability of the image to be used in the appearance model. Section 3.3 Face appearance model: the face appearance model is performed in the data association process, and defines how to associate the appearance model that uses face features to overcome connection failures of baseline trackers.

Re-Detection
Since the online MOT framework does not consider the future state, the best choice is needed at every frame. Since the candidate for the association is generally proposed from the detection result, it has a high dependency on the detection result. Therefore, if the false detection on an object from an image with noise can be reduced, the tracking on the wrong object and the loss of the tracked object due to the detection failure can be prevented. Figure 2 illustrates our re-detection method. It aims to increase the detection reliability by re-detection after reconstructing the image for the ambiguous detection results that are not too low or not high enough. Firstly, the raw detection result is required to classify the ambiguous detection result. The raw detection result D origin of the current input image x can be obtained using the pre-trained HumanDetector as follows.
The variable d means a detected instance which includes the location information (x, y, w, h) and the confidence c body , thus, d = (b detect , c body ). Here, b detect means the x,y coordinates, area, and height values that make up the bounding box, and c body means the reliability of the detected object.
To classify an ambiguous detection set, D origin should be separated into an ambiguous detection set D amb and a verified detection set D veri f ied depending on whether re-detection is needed or not, respectively. We define the confidence threshold, τ detect and τ amb to classify detection results with the confidence that is not too low or not too high enough. τ detect stands for the most basic threshold for the detection, and τ amb is a threshold to find ambiguous detections by ignoring low confidences. This process is defined as ambiguous detect filtering and is formulated as: If the value of the confidence c body is lower than τ amb , the target is determined to be a non-object, thus it is excluded from the detection set and not used in the tracker.
For the classified ambiguous detection set D amb , the re-detection set D redetect can be obtained using the following definition. It functions to restore the ambiguous detection regions and then re-detect them by reusing existing detectors. The RestorationModule used to restore damaged images uses a GAN-based image inference model. The detailed procedure of training the module to optimize our model is described in Section 4.1.
To classify reliable detections from the re-detection result set D redetect , we use the following definition. It classifies the re-verified detection using the detection threshold τ redetect for the confidence c redetect .
As a result of these processes, a combination of D veri f ied and D re−veri f ied can be used to construct a detection set D complete to be used for the tracking association. However, since the re-detected object may indicate the same object as the existing detection result, there is a possibility that the duplicate objects exist in D complete . Therefore, NMS (non-maximum suppression) is performed to remove the redundancy after constructing the union. The following definition refers to the NMS process, wher τ nms is the IOU (intersection over union) threshold for the NMS.
The final detection set D complete is used as a set of candidates for restoring the blur sites in the Section 3.2.

Site Restoration
Frameworks for the online MOT problem generally require one or more powerful appearance models. The appearance model aims to determine that a tracklet (an object being tracked) and a detected object are the same object based on their visual information. The appearance model is mainly used when the motion model cannot predict due to the complicated movement of the object, or when re-identification is needed based on the visual data of the object due to the failure of the detection. However, unpredictable noise such as motion blur caused by the movement of an object has a negative effect on the data association because of difficulties in the identification of the appearance model. We thus discriminate whether the detected objects are blurry images or sharp images before the image is used in the appearance model, and then restore the detection regions of those discriminated as blurry images. Figure 3 represents our site restoration method. To classify whether each element of D complete is blurry or sharp, we use the Laplacian kernel, inspired by [40]. Laplacian kernels are generally used to detect edges of an image, but can be used to quantify the blur of an image as well. The usability is based on the fact that sharp images show a large number of edge detections. On the contrary, blurry images show a small number of edge detections. The variance value of the convolution operation using a Laplacian kernel is called Laplacian variance, which means the quantified blur. Therefore, we can discriminate that the Laplacian variance value is blurry at when low or and sharp at when high.
We define the function to find Laplacian variance of the detection set D as: If the Laplacian variance is lower than the blur threshold τ blur , d is determined to be thea blurry image.
Finally, the image restoration is conducted on the blurry image using the Restoration Module. Consequently, in order for the reconstructed image to be used in the appearance model, the damaged areas of the original image are replaced with the reconstructed image.

Face Appearance Model
When tracking a person, the face of the target can be observed in many situations, allowing face data to be used for identification. The face data is advantageous for identification compared to the other recognizable information of the body. For example, it is less likely to meet people with similar faces than to meet people with similar fashions. In addition, features are less likely to be mixed because the possibility of occlusion is relatively less than when using the whole body. We propose an appearance model that uses face features to compensate for the problems that arise when using only body features.  Our association method aims to associate candidates that are not yet associated after executing that of the baseline tracker. The unassociated detection set is defined as D missed = {d 1 , d 2 , . . . , d missedDets } and has the detection position d as an element. The unassociated tracklet set is defined as T missed = {H 1 , H 2 , . . . , H missedTrks } and has a tracklet H as an element. Since the tracklet has its past trajectories, it is defined as H = h 1 , h 2 , . . . , h historyNum where h is the past position as an element.
For the extraction of face features, the following definitions are used to detect and embed the candidates' faces. At this time, the embedding is performed only for those whose detected face confidence are greater than or equal to τ f aceDetect as follows: Consequently, the feature vector v d is obtained from the detection position d, and the feature vector set V H is obtained from the tracklet H.
V H is defined as . . , f eatureNum which is a set of feature vectors v h p from the past position h of tracklet H. f eatureNum denotes the number of face features derived from H, and satisfies f eatureNum ≤ historyNum since a face may not be detected at a past position.
To make the association, the face appearance model needs to calculate the similarity between one detect and one tracklet-that is, the distance between v d and V H . We propose three distance measurements to consider various environments when calculating the distance.
The first distance measurement uses the minimum distance between v d and V H as follows: The second distance measurement uses the average distance between v d and V H as follows: The third distance measurement uses the distance between v d and the most recent vector v p f eatureNum of V H as follows: The distance between the detect and the tracklet is measured by selecting the appropriate measurement from the three proposed distance measurements. If the distance is less than the threshold τ f ace , it is determined to be the same person and the association can be proceeded. The detailed association procedure is defined with the following Algorithm 1.

Experiment Configuration
Baseline tracker. Two baseline trackers are selected according to several conditions to test our method. The first condition is that our method aims to overcome the problems occurring in the real-time environment; thus, we apply the proposed method to the online MOT framework. The second is to select validated trackers that are listed in the MOT Challenge [22]. The last condition is that open-source project models are selected to avoid polarization and to evaluate the performance fairly. In accordance with those conditions, we adopt MOTDT [28] and DeepSORT [27] as baseline trackers. MOTDT proposes a Faster RCNN-based hitmap generation model to filter the correct candidates from the expanded candidates by combining the current detection results with the previous tracking results. This approach shows effectiveness in situations where the data is noisy because the detector additionally finds objects that failed detection. DeepSORT uses the Kalman filter-based motion model proposed by SORT [26] and proposes a CNN-based appearance model. DeepSORT has a high dependency on detection results when constructing candidates, but it is effective in dynamic situations due to the simple association method. For the human detector to provide detection results for the two baseline trackers, we use YOLOv2 [23] 544 × 544 trained with the VOC (2007 + 2012) [41] dataset.
Restoration module. We use the SRGAN (super-resolution GAN) [36] as the image restoration module used in the re-detection and site restoration sections. SRGAN applies an adversarial loss to solve the super resolution problem and proves that the pixel values to be interpolated can be located in a realistic manifold. It is more effective than existing models using MSE (mean squared error). Based on these properties, SRGAN is used to estimate high-resolution sharp images from low-resolution blurry images. The traditional SRGAN uses the high-resolution image and the reduced low-resolution image for the ground truths and the input data, respectively, to learn to estimate the high-resolution image from the low-resolution image. However, in order for the tracker to adapt properly to dynamic movements, it is necessary not only to estimate high-resolution images from low-resolution images but also to estimate sharp images from blurry images. Therefore, our image restoration module is trained using high-resolution sharp images for the ground truths and low-resolution blurry images for the input data. Figure 5 shows how to construct a dataset to train our image restoration module. To create one training data pair consisting of a low-resolution blurry image and a high-resolution sharp image, one chunk needs to be configured. This chunk is constructed by selecting an odd number of images from a set of images arranged in a chronological order. From the constructed chunk, the low-resolution blurry image can be obtained by averaging the images and scaling it down to a quarter, and high-resolution sharp images can be obtained from the middle image of the chunk. In order for our image restoration module to focus on restoring a person's image, we need to exclude the background of the learning image. Therefore, we detect and crop humans in high-resolution sharp images and crop the same positions in low-resolution blurry images to produce training data pairs.
There are a few conditions in taking a video to construct a training dataset: 1. It should aim for people who are not included in the benchmark set for fair evaluation. 2. To simulate a natural and precise blurry image, the image must be taken at a high refresh rate with dynamic movement.
According to the above conditions, images are taken at a 240 Hz refresh rate, the chunk size is set to 7, and datasets of 31,640 pairs (body 21,220 pairs, face 10,420 pairs) are extracted from the images. The configuration for learning is based on the SRGAN default setting with 400 epochs. Figure 6 shows the examples of the restoration using our image restoration module.
Face appearance model. For the face detector, we use a mobileNet SSD [25] trained with a WIDERFACE dataset, which follows the mobileNet default configuration of the object detection API provided by the TensorFlow official project. FaceNet [39] is used for the face embedding, which is trained to represent face features in 128d using triplet loss with the MS-Celeb-1M dataset. The input image size is 224 × 224 according to the FaceNet default setting. When resizing, the interpolation method uses the inter-linear method.
The reconstructed images from the image restoration module may have a positive effect on the face appearance model. However, even if those images have visually realistic results, they may adversely affect the recognition performance. We thus evaluate the improvement in the performance of the face recognition compared to the existing interpolation method when ×4 up-sampling low-resolution face images are restored using our image restoration module. To measure the performance of the face recognition of both methods, we use the benchmarking method using the LFW(Labeled Faces in the Wild) dataset [42] proposed by FaceNet. Figure 7 shows the re-id benchmark results. The performance with the size of 20 × 20 is lower than that of bicubic interpolation because the GAN model generates some artifacts due to the severely lacking information, but for 30 × 30 to 60 × 60, superior results are shown when using our method. Since the amount of information for the recognition is large enough for over 70 × 70, the difference in performance is insignificant. Consequently, our restoration module positively affects the appearance model.

Benchmark Set
Our methods are proposed to enhance the performance by overcoming problems in dynamic situations. To see the contributions of our methods, we construct a robot environment dataset and evaluate the performance according to the MOT16 [22] benchmark method. A Turtlebot v2 is used for mobility and the recording camera is a Galaxy S8 + 12MP mounted on the robot. The video is taken following the scenario where the robot interacts with the user or patrols inside the building, and the ground truths are made by labeling humans' bounding boxes with a handcraft according to the manner of the construction of the MOT16 benchmark set. Table 1 lists the details of our benchmark sets. The Interaction-1 and Interaction-2 benchmark sets are based on the robotic guidance scenarios. The guidance robot interacts with the person and moves to its service destination. During the process, the users follow the robot with nonlinear movements that are difficult to predict. Interaction-1 and Interaction-2 include a small number of people and a relatively large number of people, respectively.
The Patrol benchmark set is based on a robotic patrol scenario. If the robot has no current purpose in progress, it patrols the lobby until a new command is received from the user. A large number of people appear in the scenes, and the robot repeats the actions of getting close to and away from the people to patrol.
Our dataset contains more dynamic scenes than the MOT16 dataset because it is based on images taken from a robot in service. To validate the degree of dynamic movements, we quantify them with Laplacian variance, taking advantage of the fact that a dynamic movement inevitably produces motion blur. Figure 8 represents the Laplacian variances in our dataset in comparison with the MOT16 datasets.
All three datasets used in our experiments are found to show lower Laplacian variances compared to the MOT16 datasets. This indicates that the edge detection by the Laplacian kernel is difficult due to the dynamic movements.
We are concerned about whether the reason for the low Laplacian variances is the camera characteristics or not. To address this concern, we additionally constructed an image set, Robot-Stop, which observes moving people in the stationary condition of the robot. The experimental results demonstrate that the low Laplacian variances of our datasets are not given by the camera characteristics because the second highest result was observed amongst all the datasets.

Experiment Results
Evaluation metric. We use Multiple Object Tracking Accuracy (MOTA) [43], ID F1 score [44], the ratio of Mostly Tracked targets (MT), the ratio of Mostly Lost targets (ML), and the number of ID Switches (IDS) as the performance metrics, which are significant among several metrics used in the MOT Challenge. Specifically, MOTA and IDF1 are considered to be the most important performance metrics. MOTA represents false positives, missed targets, and identity switches together, and IDF1 represents the consistent tracking rate of object ID. In our experiments, we consider the MOTA score as the first priority and the IDF1 score as the second priority in performance. For the evaluation, the MATLABbased MOT Challenge Development Kit is used according to the MOT16 benchmark rule, and the three dataset benchmarking results are given with their weighted average scores.
Evaluation baseline tracker. The default τ detect value of the detector may provide biased performance to some trackers. Therefore, to avoid this, we first observe the performances of the selected detector's detect confidence threshold, τ detect , and fix the threshold that results in the highest performance for the baseline tracker.
In the experimental results for the baseline shown in Figure 9, the best values of τ detect are 0.31 for MOTDT and 0.40 for DeepSORT. We specify MOTA and IDF1 scores for the thresholds as the baseline performance to compare with our methods. Evaluation and ablation studies. The finalized model proposed in this paper is combined with three aforementioned methods. The ablation study confirms the effectiveness of the combination of each method. When combining methods, we set parameters that record the highest performance of each of the three methods without the additional hyperparameter settings for each combination. Table 2 shows the experimental results of the ablation study. The arrows after evaluation metrics indicate that the higher (↑) and lower (↓) values represent the better performance. In this table, specifically, the method column refers to which of the three methods-the re-detection, the site restoration, and the face appearance model (simplified as "face appearance")-are applied to the corresponding baseline tracker. The proposed models generally perform better than the baseline methods, even when using some of our methods, and the best when all methods are combined. Specifically, comparing our finalized model with the baseline, the MOTA score is improved by 0.65% in MOTDT, 0.57% in DeepSORT, and the IDF1 score by 0.6% in MOTDT and 1.79% in DeepSORT.
Analysis of re-detection. Two experiments are conducted to adjust the aforementioned thresholds, τ amb and τ redetect , to find the optimal parameters for the re-detection method. In the first experiment, τ amb ranges from 0 to τ detect , and τ redetect ranges from τ detect to 0.9. Those parameters are adjusted in units of 0.05 to identify tendencies in performance. Delicate evaluation is conducted in the second experiment, adjusting those parameters to 0.01 units for the ±0.05 range on the value which shows the maximum MOTA score in the first experiment and where the IDF1 score was above the baseline. Figures 10 and  11 represent the results of the first and second experiments, respectively, where the green dotted line represents the baseline performance. According to the results shown in Figure 11, the highest score is achieved when the value of τ amb and τ redetect are 0.25 and 0.61 in MOTDT and the value of τ amb and τ redetect are 0.3 and 0.64 in DeepSORT, respectively. The reason why DeepSORT is enhanced more than MOTDT when using re-detection is that DeepSORT uses only detect results as tracking linkage candidates. This indicates that DeepSORT is more dependent on detect results. On the contrary, MOTDT uses not only the detect results but the locations estimated by the motion model as an association candidate. Consequently, DeepSORT is more affected in the performance improvement by our re-detection method.
Analysis of site restoration. We perform an experiment to adjust the aforementioned threshold, τ blur , to find the optimal parameter for the best configuration of the site restoration method. As an additional experiment, we compare SRGAN, which we use in the image restoration module, with DeblurGAN, a representative model based on GAN for the Deblur problem. DeblurGAN is trained on our the same dataset as SRGAN, and the detailed training configuration follows the default settings suggested in the paper. Figure  12 shows the effect of the image restoration module on the appearance model by adjusting the τ blur by 1 unit for the range from 0 to 200.  As illustrated in Figure 12, the results show the highest performance for τ blur with the value of 24 on MOTDT and for τ blur with 70 on DeepSORT. They indicate that MOTDT is effective in severely dynamic situations, but DeepSORT is effective in relatively weakly dynamic situations. The further experimental results represent the predominance of SRGAN in most cases. In fact, the images restored by DeblurGAN tend to make blur effects disappear more clearly. However, when we zoomed in on the image as shown in Figure 13, specific patterns that are newly created are observed. We speculate that this pattern has a negative effect on the appearance model. Figure 13. An inferred image by DeblurGAN. We apply the learning method and dataset proposed in [37].
Analysis of the face appearance model. We conduct an experiment with three distance measurements (Equations (11)-(13)) to find an optimal parameter for τ f ace in the face appearance model. Figure 14 shows the effect of the face appearance model on the baseline tracker adjusting τ f ace by 0.01 unit for the range from 0 to 1 for each distance measurement. As shown in Figure 14, MOTDT achieves the highest performance when its distance measurement is l last and the value of τ f ace is 0.65. This indicates that, because MOTDT has a relatively large number of tracking candidates, using the most recent face information contributes to the better performance rather than using all of the past face information. Moreover, when only the face information of the last state is used, the embedding vector distances are relatively close due to the rare changes of the face. The performance is thus improved at a relatively strong threshold. DeepSORT achieves the highest performance when the distance measurement is l mean and the value of τ f ace is 0.58. Because DeepSORT relies on the detection results to configure tracking candidates, comparing the limited candidates with the past face information more intensively contributes to the better performance. In addition, since the face information of the entire past states is used, various embedding vectors should be considered, resulting in the improved performance at relatively weak thresholds.
Experiments on MOT challenge datasets. In order to check how effective the finished model is when applied to an unfamiliar environment, experiments are conducted in a new environment using parameters found through previous experiments. We use the MOT Challenge dataset for simulating a new environment. Among all MOT Challenge datasets, only four of those which are based on the moving platform (MOT16-(05, 10,11,13)) are used to match the scope of our paper. The experimental results are reported in Table 3, and the arrows after evaluation metrics indicate that the higher (↑) and lower (↓) values represent the better performance. Table 3. Experimental results to confirm the effect of the proposed model applied to an unfamiliar environment. The parameters found through the previous experiment are used without change, and the MOT Challenge dataset is used as a test set. Highlighting represents better performance.

Baseline Test Data Method MOTA (↑) IDF1 (↑) MT (↑) ML (↓) ID sw (↓)
MOTDT The experimental results show the performance improvement of the main metrics, MOTA and IDF1, in the datasets (except for one of the four datasets). The reason why performance has not improved in some datasets can be interpreted as a problem of excessively low image quality. A dataset with improved performance (MOT16- (10,11,13)) has a high resolution of 1920 × 1080 and has little noise, whereas a dataset without improved performance (MOT16-05) has a low resolution of 640 × 480 and a lot of noise. Since low resolution and a lot of noise cause reasoning failure of the Image Restoration Module and Face Embedding Module, the threshold value used for verification needs to be strict. Therefore, when the image quality is significantly different from the experimental environment, retuning the τ redetect , τ blue , and τ f ace values to strict values may be advantageous for improving performance. From the overall experimental result, it is validated that our proposed methodology enhances the tracking performance when applied to a general situation.
Qualitative results. Several qualitative assessments are conducted, summarizing some of the benchmark results. DeepSORT is selected for evaluation as the baseline tracker, and Figure 15 represents three qualitative evaluation results. The top row of each assessment is the benchmark result of the baseline tracker, and the bottom one is that of our proposed model on the baseline tracker. The bounding boxes with the same color in each evaluation result are meant to be recognized as the same objects by the tracker. In particular, the red bounding boxes represent the object which shows the largest difference between the proposed model and the baseline tracker. The examples clearly show that the three methods we propose are capable of maintaining the IDs of objects properly by overcoming the visual changes and the congestion of objects that occur in dynamic situations. (A) shows the selected frames in the Patrol dataset. This dataset causes severe motion blur, especially for distant objects due to the camera rotation. In the case of the baseline, the detection of the object fails, which is marked in red, and the tracking is terminated. On the other hand, the ID is constantly tracked in red on our model because it maintains the detection in a high success rate, even for blurry objects, by performing the re-detection based on the restoration functionality. (B) shows an example of the experimental results in the Interaction-2 dataset that has a lot of occlusions between objects that are relatively far apart. In the case of the baseline, the association failure of the object indicated by the red bounding box occurs due to the noise generated by the dynamic movement. On the contrary, our method maintains the object's ID constantly because the accuracy of the appearance model is improved by restoring the noise using the site restoration method.
(C) includes the experimental examples on the Interaction-1 dataset in which humans are observed at close range and many occlusions are detected due to movements. In particular, for the occlusion of the objects indicated by the red and black bounding boxes, the tracking is terminated after the occlusion and a new ID is assigned in the case of the baseline. The reason for the mistracking is the confusion of the appearance model caused by the mixed features. On the other hand, in our method, the ID remains intact because the use of a face feature alleviates mixing of features.

Conclusions
In this paper, we propose three methods to enhance the performance of multiple object tracking, especially in a dynamic environment. First, re-detection and site restoration methods use the approach to remove noises from an image to improve the detection and identification performance, respectively. To remove the noises of the image, we implement the image restoration module by adopting and learning the GAN-based image inference model suitable for the dynamic environment. Moreover, the face appearance model uses an approach that uses face features to reduce the likelihood of an association failure. We design three distance measurements to efficiently calculate the distance between multiple features so that our appearance model can be applied to a general-purpose environment. In order to validate the effectiveness of our proposed methods, we construct dynamic robot environments and conduct experiments with robot service scenarios. As a result, the performance of the multiple object tracking is improved significantly due to the adaptability of our proposed model to the dynamic environment in comparison with the existing trackers. The image restoration module proposed by us has a limitation in that it cannot utilize the characteristics of time series data because it restores using only one image. In the future, if a recurrent architecture-based image inference model using a neural network is applied to an existing image restoration module, better performance is expected to be achieved for dynamic situations using the time series characteristics of the dataset as well.