Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (26)

Search Parameters:
Keywords = visual object tracking (VOT)

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 7677 KiB  
Article
Universal Low-Frequency Noise Black-Box Attack on Visual Object Tracking
by Hanting Hou, Huan Bao, Kaimin Wei and Yongdong Wu
Symmetry 2025, 17(3), 462; https://doi.org/10.3390/sym17030462 - 19 Mar 2025
Viewed by 492
Abstract
Adversarial attacks on visual object tracking aim to degrade tracking accuracy by introducing imperceptible perturbations into video frames, exploiting vulnerabilities in neural networks. In real-world symmetrical double-blind engagements, both attackers and defenders operate with mutual unawareness of strategic parameters or initiation timing. Black-box [...] Read more.
Adversarial attacks on visual object tracking aim to degrade tracking accuracy by introducing imperceptible perturbations into video frames, exploiting vulnerabilities in neural networks. In real-world symmetrical double-blind engagements, both attackers and defenders operate with mutual unawareness of strategic parameters or initiation timing. Black-box attacks based on iterative optimization show excellent applicability in this scenario. However, existing state-of-the-art adversarial attacks based on iterative optimization suffer from high computational costs and limited effectiveness. To address these challenges, this paper proposes the Universal Low-frequency Noise black-box attack method (ULN), which generates perturbations through discrete cosine transform to disrupt structural features critical for tracking while mimicking compression artifacts. Extensive experimentation on four state-of-the-art trackers, including transformer-based models, demonstrates the method’s severe degradation effects. GRM’s expected average overlap drops by 97.77% on VOT2018, while SiamRPN++’s AUC and Precision on OTB100 decline by 76.55% and 78.9%, respectively. The attack achieves real-time performance with a computational cost reduction of over 50% compared to iterative methods, operating efficiently on embedded devices such as Raspberry Pi 4B. By maintaining a structural similarity index measure above 0.84, the perturbations blend seamlessly with common compression artifacts, evading traditional spatial filtering defenses. Cross-platform experiments validate its consistent threat across diverse hardware environments, with attack success rates exceeding 40% even under resource constraints. These results underscore the dual capability of ULN as both a stealthy and practical attack vector, and emphasize the urgent need for robust defenses in safety-critical applications such as autonomous driving and aerial surveillance. The efficiency of the method, when combined with its ability to exploit low-frequency vulnerabilities across architectures, establishes a new benchmark for adversarial robustness in visual tracking systems. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

14 pages, 1901 KiB  
Article
Learning from Outputs: Improving Multi-Object Tracking Performance by Tracker Fusion
by Vincenzo M. Scarrica and Antonino Staiano
Technologies 2024, 12(12), 239; https://doi.org/10.3390/technologies12120239 - 22 Nov 2024
Viewed by 2496
Abstract
This paper presents an approach to improving visual object tracking performance by dynamically fusing the results of two trackers, where the scheduling of trackers is determined by a support vector machine (SVM). By classifying the outputs of other trackers, our method learns their [...] Read more.
This paper presents an approach to improving visual object tracking performance by dynamically fusing the results of two trackers, where the scheduling of trackers is determined by a support vector machine (SVM). By classifying the outputs of other trackers, our method learns their behaviors and exploits their complementarity to enhance tracking accuracy and robustness. Our approach consistently surpasses the performance of individual trackers within the ensemble. Despite being trained on only 4 sequences and tested on 144 sequences from the VOTS2023 benchmark, our approach achieves a Q metric of 0.65. Additionally, our fusion strategy demonstrates versatility across different datasets, achieving 73.7 MOTA on MOT17 public detections and 82.8 MOTA on MOT17 private detections. On the MOT20 dataset, it achieves 68.6 MOTA on public detections and 79.7 MOTA on private detections, setting new benchmarks in multi-object tracking. These results highlight the potential of using an ensemble of trackers with a learner-based scheduler to significantly improve tracking performance. Full article
(This article belongs to the Section Information and Communication Technologies)
Show Figures

Figure 1

14 pages, 2945 KiB  
Article
Security in Transformer Visual Trackers: A Case Study on the Adversarial Robustness of Two Models
by Peng Ye, Yuanfang Chen, Sihang Ma, Feng Xue, Noel Crespi, Xiaohan Chen and Xing Fang
Sensors 2024, 24(14), 4761; https://doi.org/10.3390/s24144761 - 22 Jul 2024
Viewed by 1547
Abstract
Visual object tracking is an important technology in camera-based sensor networks, which has a wide range of practicability in auto-drive systems. A transformer is a deep learning model that adopts the mechanism of self-attention, and it differentially weights the significance of each part [...] Read more.
Visual object tracking is an important technology in camera-based sensor networks, which has a wide range of practicability in auto-drive systems. A transformer is a deep learning model that adopts the mechanism of self-attention, and it differentially weights the significance of each part of the input data. It has been widely applied in the field of visual tracking. Unfortunately, the security of the transformer model is unclear. It causes such transformer-based applications to be exposed to security threats. In this work, the security of the transformer model was investigated with an important component of autonomous driving, i.e., visual tracking. Such deep-learning-based visual tracking is vulnerable to adversarial attacks, and thus, adversarial attacks were implemented as the security threats to conduct the investigation. First, adversarial examples were generated on top of video sequences to degrade the tracking performance, and the frame-by-frame temporal motion was taken into consideration when generating perturbations over the depicted tracking results. Then, the influence of perturbations on performance was sequentially investigated and analyzed. Finally, numerous experiments on OTB100, VOT2018, and GOT-10k data sets demonstrated that the executed adversarial examples were effective on the performance drops of the transformer-based visual tracking. White-box attacks showed the highest effectiveness, where the attack success rates exceeded 90% against transformer-based trackers. Full article
(This article belongs to the Special Issue Advances in Automated Driving: Sensing and Control)
Show Figures

Figure 1

14 pages, 3873 KiB  
Article
Evolution of Siamese Visual Tracking with Slot Attention
by Jian Wang, Xiangzhou Ye, Dongjie Wu, Jinfu Gong, Xinyi Tang and Zheng Li
Electronics 2024, 13(3), 586; https://doi.org/10.3390/electronics13030586 - 31 Jan 2024
Cited by 2 | Viewed by 1449
Abstract
Siamese network object tracking is a widely employed tracking method due to its simplicity and effectiveness. It first employs a two-stream network to independently extract template and search region features. Subsequently, these features are then combined through feature association to yield object information [...] Read more.
Siamese network object tracking is a widely employed tracking method due to its simplicity and effectiveness. It first employs a two-stream network to independently extract template and search region features. Subsequently, these features are then combined through feature association to yield object information within the visual scene. However, the conventional approach faces limitations when it leverages the template features as a convolution kernel to convolve the search image features, which restricts the ability to capture complex and nonlinear feature transformations of objects, thereby restricting its discriminative capabilities. To overcome this challenge, we propose replacing traditional convolutional correlation with Slot Attention for feature association. This novel approach enables the effective extraction of nonlinear features within the scene, while augmenting the discriminative capacity. Furthermore, to increase the inference efficiency and reduce the parameter occupation, we suggest deploying a single Slot Attention module for multiple associations. Our tracking algorithm, SiamSlot, was evaluated on diverse benchmarks, including VOT2019, GOT-10k, UAV123, and Nfs. The experiments show a remarkable improvement in performance relative to previous methods under the same network size. Full article
Show Figures

Figure 1

18 pages, 1329 KiB  
Article
Adaptive Kalman Filter for Real-Time Visual Object Tracking Based on Autocovariance Least Square Estimation
by Jiahong Li, Xinkai Xu, Zhuoying Jiang and Beiyan Jiang
Appl. Sci. 2024, 14(3), 1045; https://doi.org/10.3390/app14031045 - 25 Jan 2024
Cited by 4 | Viewed by 3938
Abstract
Real-time visual object tracking (VOT) may suffer from performance degradation and even divergence owing to inaccurate noise statistics typically engendered by non-stationary video sequences or alterations in the tracked object. This paper presents a novel adaptive Kalman filter (AKF) algorithm, termed AKF-ALS, based [...] Read more.
Real-time visual object tracking (VOT) may suffer from performance degradation and even divergence owing to inaccurate noise statistics typically engendered by non-stationary video sequences or alterations in the tracked object. This paper presents a novel adaptive Kalman filter (AKF) algorithm, termed AKF-ALS, based on the autocovariance least square estimation (ALS) methodology to improve the accuracy and robustness of VOT. The AKF-ALS algorithm involves object detection via an adaptive thresholding-based background subtraction technique and object tracking through real-time state estimation via the Kalman filter (KF) and noise covariance estimation using the ALS method. The proposed algorithm offers a robust and efficient solution to adapting the system model mismatches or invalid offline calibration, significantly improving the state estimation accuracy in VOT. The computation complexity of the AKF-ALS algorithm is derived and a numerical analysis is conducted to show its real-time efficiency. Experimental validations on tracking the centroid of a moving ball subjected to projectile motion, free-fall bouncing motion, and back-and-forth linear motion, reveal that the AKF-ALS algorithm outperforms a standard KF with fixed noise statistics. Full article
(This article belongs to the Special Issue Autonomous Vehicles and Robotics)
Show Figures

Figure 1

17 pages, 15790 KiB  
Article
Siamese Visual Tracking with Spatial-Channel Attention and Ranking Head Network
by Jianming Zhang, Yifei Liang, Xiaoyi Huang, Li-Dan Kuang and Bin Zheng
Electronics 2023, 12(20), 4351; https://doi.org/10.3390/electronics12204351 - 20 Oct 2023
Viewed by 1859
Abstract
Trackers based on the Siamese network have received much attention in recent years, owing to its remarkable performance, and the task of object tracking is to predict the location of the target in current frame. However, during the tracking process, distractors with similar [...] Read more.
Trackers based on the Siamese network have received much attention in recent years, owing to its remarkable performance, and the task of object tracking is to predict the location of the target in current frame. However, during the tracking process, distractors with similar appearances affect the judgment of the tracker and lead to tracking failure. In order to solve this problem, we propose a Siamese visual tracker with spatial-channel attention and a ranking head network. Firstly, we propose a Spatial Channel Attention Module, which fuses the features of the template and the search region by capturing both the spatial and the channel information simultaneously, allowing the tracker to recognize the target to be tracked from the background. Secondly, we design a ranking head network. By introducing joint ranking loss terms including classification ranking loss and confidence&IoU ranking loss, classification and regression branches are linked to refine the tracking results. Through the mutual guidance between the classification confidence score and IoU, a better positioning regression box is selected to improve the performance of the tracker. To better demonstrate that our proposed method is effective, we test the proposed tracker on the OTB100, VOT2016, VOT2018, UAV123, and GOT-10k testing datasets. On OTB100, the precision and success rate of our tracker are 0.925 and 0.700, respectively. Considering accuracy and speed, our method, overall, achieves state-of-the-art performance. Full article
(This article belongs to the Special Issue Deep Learning in Computer Vision and Image Processing)
Show Figures

Figure 1

16 pages, 5513 KiB  
Article
Fast and Accurate Visual Tracking with Group Convolution and Pixel-Level Correlation
by Liduo Liu, Yongji Long, Guoning Li, Ting Nie, Chengcheng Zhang and Bin He
Appl. Sci. 2023, 13(17), 9746; https://doi.org/10.3390/app13179746 - 29 Aug 2023
Cited by 1 | Viewed by 1470
Abstract
Visual object trackers based on Siamese networks perform well in visual object tracking (VOT); however, degradation of the tracking accuracy occurs when the target has fast motion, large-scale changes, and occlusion. In this study, in order to solve this problem and enhance the [...] Read more.
Visual object trackers based on Siamese networks perform well in visual object tracking (VOT); however, degradation of the tracking accuracy occurs when the target has fast motion, large-scale changes, and occlusion. In this study, in order to solve this problem and enhance the inference speed of the tracker, fast and accurate visual tracking with a group convolution and pixel-level correlation based on a Siamese network is proposed. The algorithm incorporates multi-layer feature information on the basis of Siamese networks. We designed a multi-scale feature aggregated channel attention block (MCA) and a global-to-local-information-fused spatial attention block (GSA), which enhance the feature extraction capability of the network. The use of a pixel-level mutual correlation operation in the network to match the search region with the template region refines the bounding box and reduces background interference. Comparing our work with the latest algorithms, the precision and success rates on the UAV123, OTB100, LaSOT, and GOT10K datasets were improved, and our tracker was able to run at 40FPS, with a better performance in complex scenes such as those with occlusion, illumination changes, and fast-motion situations. Full article
(This article belongs to the Special Issue Recent Advances in Robotics and Intelligent Robots Applications)
Show Figures

Figure 1

21 pages, 1052 KiB  
Article
Efficient and Lightweight Visual Tracking with Differentiable Neural Architecture Search
by Peng Gao, Xiao Liu, Hong-Chuan Sang, Yu Wang and Fei Wang
Electronics 2023, 12(17), 3623; https://doi.org/10.3390/electronics12173623 - 27 Aug 2023
Cited by 2 | Viewed by 1587
Abstract
Over the last decade, Siamese network architectures have emerged as dominating tracking paradigms, which have led to significant progress. These architectures are made up of a backbone network and a head network. The backbone network comprises two identical feature extraction sub-branches, one for [...] Read more.
Over the last decade, Siamese network architectures have emerged as dominating tracking paradigms, which have led to significant progress. These architectures are made up of a backbone network and a head network. The backbone network comprises two identical feature extraction sub-branches, one for the target template and one for the search candidate. The head network takes both the template and candidate features as inputs and produces a local similarity score for the target object in each location of the search candidate. Despite promising results that have been attained in visual tracking, challenges persist in developing efficient and lightweight models due to the inherent complexity of the task. Specifically, manually designed tracking models that rely heavily on the knowledge and experience of relevant experts are lacking. In addition, the existing tracking approaches achieve excellent performance at the cost of large numbers of parameters and vast amounts of computations. A novel Siamese tracking approach called TrackNAS based on neural architecture search is proposed to reduce the complexity of the neural architecture applied in visual tracking. First, according to the principle of the Siamese network, backbone and head network search spaces are constructed, constituting the search space for the network architecture. Next, under the given resource constraints, the network architecture that meets the tracking performance requirements is obtained by optimizing a hybrid search strategy that combines distributed and joint approaches. Then, an evolutionary method is used to lighten the network architecture obtained from the search phase to facilitate deployment to devices with resource constraints (FLOPs). Finally, to verify the performance of TrackNAS, comparison and ablation experiments are conducted using several large-scale visual tracking benchmark datasets, such as OTB100, VOT2018, UAV123, LaSOT, and GOT-10k. The results indicate that the proposed TrackNAS achieves competitive performance in terms of accuracy and robustness, and the number of network parameters and computation volume are far smaller than those of other advanced Siamese trackers, meeting the requirements for lightweight deployment to resource-constrained devices. Full article
Show Figures

Figure 1

24 pages, 5426 KiB  
Article
Siamese Network Tracker Based on Multi-Scale Feature Fusion
by Jiaxu Zhao and Dapeng Niu
Systems 2023, 11(8), 434; https://doi.org/10.3390/systems11080434 - 18 Aug 2023
Viewed by 2345
Abstract
The main task in visual object tracking is to track a moving object in an image sequence. In this process, the object’s trajectory and behavior can be described by calculating the object’s position, velocity, acceleration, and other parameters or by memorizing the position [...] Read more.
The main task in visual object tracking is to track a moving object in an image sequence. In this process, the object’s trajectory and behavior can be described by calculating the object’s position, velocity, acceleration, and other parameters or by memorizing the position of the object in each frame of the corresponding video. Therefore, visual object tracking can complete many more advanced tasks, has great performance in relation to real scenes, and is widely used in automated driving, traffic monitoring, human–computer interaction, and so on. Siamese-network-based trackers have been receiving a great deal of attention from the tracking community, but they have many drawbacks. This paper analyzes the shortcomings of the Siamese network tracker in detail, uses the method of feature multi-scale fusion to improve the Siamese network tracker, and proposes a new target-tracking framework to address its shortcomings. In this paper, a feature map with low-resolution but strong semantic information and a feature map with high-resolution and rich spatial information are integrated to improve the model’s ability to depict an object, and the problem of scale change is solved by fusing features at different scales. Furthermore, we utilize the 3D Max Filtering module to suppress repeated predictions of features at different scales. Finally, our experiments conducted on the four tracking benchmarks OTB2015, VOT2016, VOT2018, and GOT10K show that the proposed algorithm effectively improves the tracking accuracy and robustness of the system. Full article
(This article belongs to the Special Issue AI-Driven Information and Engineering Systems for Future Mobility)
Show Figures

Figure 1

14 pages, 3329 KiB  
Article
PACR: Pixel Attention in Classification and Regression for Visual Object Tracking
by Da Li, Haoxiang Chai, Qin Wei, Yao Zhang and Yunhan Xiao
Mathematics 2023, 11(6), 1406; https://doi.org/10.3390/math11061406 - 14 Mar 2023
Cited by 2 | Viewed by 1766
Abstract
Anchor-free-based trackers have achieved remarkable performance in single visual object tracking in recent years. Most anchor-free trackers consider the rectangular fields close to the target center as the positive sample used in the training phase, while they always use the maximum of the [...] Read more.
Anchor-free-based trackers have achieved remarkable performance in single visual object tracking in recent years. Most anchor-free trackers consider the rectangular fields close to the target center as the positive sample used in the training phase, while they always use the maximum of the corresponding map to determine the location of the target in the tracking phase. Thus, this will make the tracker inconsistent between the training and tracking phase. To solve this problem, we propose a pixel-attention module (PAM), which ensures the consistency of the training and tracking phase through a self-attention module. Moreover, we put forward a new refined branch named Acc branch to inherit the benefit of the PAM. The score of Acc branch can tune the classification and the regression of the tracking target more precisely. We conduct extensive experiments on challenging benchmarks such as VOT2020, UAV123, DTB70, OTB100, and a large-scale benchmark LaSOT. Compared with other anchor-free trackers, our tracker gains excellent performance in small-scale datasets. In UAV benchmarks such as UAV123 and DTB70, the precision of our tracker increases 4.3% and 1.8%, respectively, compared with the SOTA in anchor-free trackers. Full article
Show Figures

Figure 1

13 pages, 3768 KiB  
Communication
Object Relocation Visual Tracking Based on Histogram Filter and Siamese Network in Intelligent Transportation
by Jianlong Zhang, Yifan Liu, Qiao Li, Ci He, Bin Wang and Tianhong Wang
Sensors 2022, 22(22), 8591; https://doi.org/10.3390/s22228591 - 8 Nov 2022
Cited by 2 | Viewed by 1657
Abstract
Target detection and tracking algorithms are one of the key technologies in the field of autonomous driving in intelligent transportation, providing important sensing capabilities for vehicle localization and path planning. Siamese network-based trackers formulate the visual tracking mission as an image-matching process by [...] Read more.
Target detection and tracking algorithms are one of the key technologies in the field of autonomous driving in intelligent transportation, providing important sensing capabilities for vehicle localization and path planning. Siamese network-based trackers formulate the visual tracking mission as an image-matching process by regression and classification branches, which simplifies the network structure and improves the tracking accuracy. However, there remain many problems, as described below. (1) The lightweight neural networks decrease the feature representation ability. It is easy for the tracker to fail under the disturbing distractors (e.g., deformation and similar objects) or large changes in the viewing angle. (2) The tracker cannot adapt to variations of the object. (3) The tracker cannot reposition the object that has failed to track. To address these issues, we first propose a novel match filter arbiter based on the Euclidean distance histogram between the centers of multiple candidate objects to automatically determine whether the tracker fails. Secondly, the Hopcroft–Karp algorithm is introduced to select the winners from the dynamic template set through the backtracking process, and object relocation is achieved by comparing the Gradient Magnitude Similarity Deviation between the template and the winners. The experiments show that our method obtains better performance on several tracking benchmarks, i.e., OTB100, VOT2018, GOT-10k, and LaSOT, compared with state-of-the-art methods. Full article
(This article belongs to the Section Navigation and Positioning)
Show Figures

Figure 1

18 pages, 4841 KiB  
Article
Fast and Robust Visual Tracking with Few-Iteration Meta-Learning
by Zhenxin Li, Xuande Zhang, Long Xu and Weiqiang Zhang
Sensors 2022, 22(15), 5826; https://doi.org/10.3390/s22155826 - 4 Aug 2022
Cited by 2 | Viewed by 2231
Abstract
Visual object tracking has been a major research topic in the field of computer vision for many years. Object tracking aims to identify and localize objects of interest in subsequent frames, given the bounding box of the first frame. In addition, the object-tracking [...] Read more.
Visual object tracking has been a major research topic in the field of computer vision for many years. Object tracking aims to identify and localize objects of interest in subsequent frames, given the bounding box of the first frame. In addition, the object-tracking algorithms are also required to have robustness and real-time performance. These requirements create some unique challenges, which can easily become overfitting if given a very small training dataset of objects during offline training. On the other hand, if there are too many iterations in the model-optimization process during offline training or in the model-update process during online tracking, it will cause the problem of poor real-time performance. We address these problems by introducing a meta-learning method based on fast optimization. Our proposed tracking architecture mainly contains two parts, one is the base learner and the other is the meta learner. The base learner is primarily a target and background classifier, in addition, there is an object bounding box prediction regression network. The primary goal of a meta learner based on the transformer is to learn the representations used by the classifier. The accuracy of our proposed algorithm on OTB2015 and LaSOT is 0.930 and 0.688, respectively. Moreover, it performs well on VOT2018 and GOT-10k datasets. Combined with the comparative experiments on real-time performance, our algorithm is fast and robust. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

18 pages, 4185 KiB  
Article
Enhancement: SiamFC Tracker Algorithm Performance Based on Convolutional Hyperparameters Optimization and Low Pass Filter
by Rogeany Kanza, Yu Zhao, Zhilin Huang, Chenyu Huang and Zhuoming Li
Mathematics 2022, 10(9), 1527; https://doi.org/10.3390/math10091527 - 3 May 2022
Cited by 3 | Viewed by 2569
Abstract
Over the past few decades, convolutional neural networks (CNNs) have achieved outstanding results in addressing a broad scope of computer vision problems. Despite these improvements, fully convolutional Siamese neural networks (FCSNN) still hardly adapt to complex scenes, such as appearance change, scale change, [...] Read more.
Over the past few decades, convolutional neural networks (CNNs) have achieved outstanding results in addressing a broad scope of computer vision problems. Despite these improvements, fully convolutional Siamese neural networks (FCSNN) still hardly adapt to complex scenes, such as appearance change, scale change, similar objects interference, etc. The present study focuses on an enhanced FCSNN based on convolutional block hyperparameters optimization, a new activation function (ModReLU) and Gaussian low pass filter. The optimization of hyperparameters is an important task, as it has a crucial ascendancy on the tracking process performance, especially when it comes to the initialization of weights and bias. They have to work efficiently with the following activation function layer. Inadequate initialization can result in vanishing or exploding gradients. In the first method, we propose an optimization strategy for initializing weights and bias in the convolutional block to ameliorate the learning of features so that each neuron learns as much as possible. Next, the activation function normalizes the output. We implement the convolutional block hyperparameters optimization by setting the convolutional weights initialization to constant, the bias initialization to zero and the Leaky ReLU activation function at the output. In the second method, we propose a new activation, ModReLU, in the activation layer of CNN. Additionally, we also introduce a Gaussian low pass filter to minimize image noise and improve the structures of images at distinct scales. Moreover, we add a pixel-domain-based color adjustment implementation to enhance the capacity of the proposed strategies. The proposed implementations handle better rotation, moving, occlusion and appearance change problems and improve tracking speed. Our experimental results clearly show a significant improvement in the overall performance compared to the original SiamFC tracker. The first proposed technique of this work surpasses the original fully convolutional Siamese networks (SiamFC) on the VOT 2016 dataset with an increase of 15.42% in precision, 16.79% in AUPC and 15.93% in IOU compared to the original SiamFC. Our second proposed technique also reveals remarkable advances over the original SiamFC with 18.07% precision increment, 17.01% AUPC improvement and an increase of 15.87% in IOU. We evaluate our methods on the Visual Object Tracking (VOT) Challenge 2016 dataset, and they both outperform the original SiamFC tracker performance and many other top performers. Full article
(This article belongs to the Special Issue Recent Advances in Computational Intelligence and Its Applications)
Show Figures

Figure 1

18 pages, 3753 KiB  
Article
CTT: CNN Meets Transformer for Tracking
by Chen Yang, Ximing Zhang and Zongxi Song
Sensors 2022, 22(9), 3210; https://doi.org/10.3390/s22093210 - 22 Apr 2022
Cited by 8 | Viewed by 2928
Abstract
Siamese networks are one of the most popular directions in the visual object tracking based on deep learning. In Siamese networks, the feature pyramid network (FPN) and the cross-correlation complete feature fusion and the matching of features extracted from the template and search [...] Read more.
Siamese networks are one of the most popular directions in the visual object tracking based on deep learning. In Siamese networks, the feature pyramid network (FPN) and the cross-correlation complete feature fusion and the matching of features extracted from the template and search branch, respectively. However, object tracking should focus on the global and contextual dependencies. Hence, we introduce a delicate residual transformer structure which contains a self-attention mechanism called encoder-decoder into our tracker as the part of neck. Under the encoder-decoder structure, the encoder promotes the interaction between the low-level features extracted from the target and search branch by the CNN to obtain global attention information, while the decoder replaces cross-correlation to send global attention information into the head module. We add a spatial and channel attention component in the target branch, which can further improve the accuracy and robustness of our proposed model for a low price. Finally, we detailly evaluate our tracker CTT on GOT-10k, VOT2019, OTB-100, LaSOT, NfS, UAV123 and TrackingNet benchmarks, and our proposed method obtains competitive results with the state-of-the-art algorithms. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

13 pages, 2527 KiB  
Article
An Accurate Refinement Pathway for Visual Tracking
by Liang Xu, Shuli Cheng and Liejun Wang
Information 2022, 13(3), 147; https://doi.org/10.3390/info13030147 - 11 Mar 2022
Viewed by 2459
Abstract
Recently, in the field of visual object tracking, visual object tracking algorithms combined with visual object segmentation have achieved impressive results while using mask to label targets in the VOT2020 dataset. Most of the trackers get the object mask by increasing the resolution [...] Read more.
Recently, in the field of visual object tracking, visual object tracking algorithms combined with visual object segmentation have achieved impressive results while using mask to label targets in the VOT2020 dataset. Most of the trackers get the object mask by increasing the resolution through multiple upsampling modules and gradually get the mask by summing with the features in the backbone network. However, this refinement pathway does not fully consider the spatial information of the backbone features, and therefore, the segmentation results are not perfect. In this paper, the cross-stage and cross-resolution (CSCR) module is proposed for optimizing the segmentation effect. This module makes full use of the semantic information of high-level features and the spatial information of low-level features, and fuses them by skip connections to achieve a very accurate segmentation effect. Experiments were conducted on the VOT dataset, and the experimental results outperformed other excellent trackers and verified the effectiveness of the algorithm in this paper. Full article
Show Figures

Figure 1

Back to TopTop