sensors-logo

Journal Browser

Journal Browser

Special Issue "Visual Sensor Networks for Object Detection and Tracking"

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Sensor Networks".

Deadline for manuscript submissions: closed (31 March 2021).

Special Issue Editor

Prof. Dr. Byung-Gyu Kim
E-Mail Website
Guest Editor
Department of IT Engineering, Sookmyung Women's University, Seoul, Korea
Interests: visual sensor network; real-time object segmentation; deep learning for object detection; facial expression recognition

Special Issue Information

Dear Colleagues,

Information obtained through the human eye is more efficient and diverse for object recognition/tracking than information obtained through any other sensory organ. Recently, these kinds of tasks for visual object detection, recognition, and tracking are being enabled by more flexible vision sensors and its network scheme, such as the 5G standard. In addition, visual intelligence technology and inference systems based on deep/reinforcement learning are currently actively being researched to make vision systems more accurate. This issue will publish original technical papers and review papers on these recent technologies which are focusing on visual recognition, real-time visual object tracking, knowledge extraction, distributed visual sensor networks, and applications.

You are welcome to submit an unpublished original research work related to the theme of “Visual Sensor Networks for Object Detection and Tracking.”

Prof. Dr. Byung-Gyu Kim
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2200 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Intelligent object detection algorithms
  • Fast and complexity reduction algorithms for real-time object detection and tracking
  • Knowledge extraction and mining from visual sensor data
  • Visual sensor network architecture for object detection and tracking
  • Awareness-based visual sensor network design
  • Intelligent machine learning mechanism for object detection and recognition
  • Lightweight deep learning for real-time object detection and tracking
  • Visual data representation and transmission in a 5G network
  • Real-time visual object tracking in vision sensor network
  • Intelligent CCTV applications

Published Papers (14 papers)

Order results
Result details
Select all
Export citation of selected articles as:
Article
Enhanced Single Image Super Resolution Method Using Lightweight Multi-Scale Channel Dense Network
Sensors 2021, 21(10), 3351; https://doi.org/10.3390/s21103351 - 12 May 2021
Viewed by 428
Abstract
Super resolution (SR) enables to generate a high-resolution (HR) image from one or more low-resolution (LR) images. Since a variety of CNN models have been recently studied in the areas of computer vision, these approaches have been combined with SR in order to [...] Read more.
Super resolution (SR) enables to generate a high-resolution (HR) image from one or more low-resolution (LR) images. Since a variety of CNN models have been recently studied in the areas of computer vision, these approaches have been combined with SR in order to provide higher image restoration. In this paper, we propose a lightweight CNN-based SR method, named multi-scale channel dense network (MCDN). In order to design the proposed network, we extracted the training images from the DIVerse 2K (DIV2K) dataset and investigated the trade-off between the SR accuracy and the network complexity. The experimental results show that the proposed method can significantly reduce the network complexity, such as the number of network parameters and total memory capacity, while maintaining slightly better or similar perceptual quality compared to the previous methods. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Article
Face Recognition at a Distance for a Stand-Alone Access Control System
Sensors 2020, 20(3), 785; https://doi.org/10.3390/s20030785 - 31 Jan 2020
Cited by 7 | Viewed by 2178
Abstract
Although access control based on human face recognition has become popular in consumer applications, it still has several implementation issues before it can realize a stand-alone access control system. Owing to a lack of computational resources, lightweight and computationally efficient face recognition algorithms [...] Read more.
Although access control based on human face recognition has become popular in consumer applications, it still has several implementation issues before it can realize a stand-alone access control system. Owing to a lack of computational resources, lightweight and computationally efficient face recognition algorithms are required. The conventional access control systems require significant active cooperation from the users despite its non-aggressive nature. The lighting/illumination change is one of the most difficult and challenging problems for human-face-recognition-based access control applications. This paper presents the design and implementation of a user-friendly, stand-alone access control system based on human face recognition at a distance. The local binary pattern (LBP)-AdaBoost framework was employed for face and eyes detection, which is fast and invariant to illumination changes. It can detect faces and eyes of varied sizes at a distance. For fast face recognition with a high accuracy, the Gabor-LBP histogram framework was modified by substituting the Gabor wavelet with Gaussian derivative filters, which reduced the facial feature size by 40% of the Gabor-LBP-based facial features, and was robust to significant illumination changes and complicated backgrounds. The experiments on benchmark datasets produced face recognition accuracies of 97.27% on an E-face dataset and 99.06% on an XM2VTS dataset, respectively. The system achieved a 91.5% true acceptance rate with a 0.28% false acceptance rate and averaged a 5.26 frames/sec processing speed on a newly collected face image and video dataset in an indoor office environment. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Article
Real-Time Instance Segmentation of Traffic Videos for Embedded Devices
Sensors 2021, 21(1), 275; https://doi.org/10.3390/s21010275 - 03 Jan 2021
Cited by 1 | Viewed by 977
Abstract
The paper proposes a novel instance segmentation method for traffic videos devised for deployment on real-time embedded devices. A novel neural network architecture is proposed using a multi-resolution feature extraction backbone and improved network designs for the object detection and instance segmentation branches. [...] Read more.
The paper proposes a novel instance segmentation method for traffic videos devised for deployment on real-time embedded devices. A novel neural network architecture is proposed using a multi-resolution feature extraction backbone and improved network designs for the object detection and instance segmentation branches. A novel post-processing method is introduced to ensure a reduced rate of false detection by evaluating the quality of the output masks. An improved network training procedure is proposed based on a novel label assignment algorithm. An ablation study on speed-vs.-performance trade-off further modifies the two branches and replaces the conventional ResNet-based performance-oriented backbone with a lightweight speed-oriented design. The proposed architectural variations achieve real-time performance when deployed on embedded devices. The experimental results demonstrate that the proposed instance segmentation method for traffic videos outperforms the you only look at coefficients algorithm, the state-of-the-art real-time instance segmentation method. The proposed architecture achieves qualitative results with 31.57 average precision on the COCO dataset, while its speed-oriented variations achieve speeds of up to 66.25 frames per second on the Jetson AGX Xavier module. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Article
Virtual to Real Adaptation of Pedestrian Detectors
Sensors 2020, 20(18), 5250; https://doi.org/10.3390/s20185250 - 14 Sep 2020
Cited by 1 | Viewed by 820
Abstract
Pedestrian detection through Computer Vision is a building block for a multitude of applications. Recently, there has been an increasing interest in convolutional neural network-based architectures to execute such a task. One of these supervised networks’ critical goals is to generalize the knowledge [...] Read more.
Pedestrian detection through Computer Vision is a building block for a multitude of applications. Recently, there has been an increasing interest in convolutional neural network-based architectures to execute such a task. One of these supervised networks’ critical goals is to generalize the knowledge learned during the training phase to new scenarios with different characteristics. A suitably labeled dataset is essential to achieve this purpose. The main problem is that manually annotating a dataset usually requires a lot of human effort, and it is costly. To this end, we introduce ViPeD (Virtual Pedestrian Dataset), a new synthetically generated set of images collected with the highly photo-realistic graphical engine of the video game GTA V (Grand Theft Auto V), where annotations are automatically acquired. However, when training solely on the synthetic dataset, the model experiences a Synthetic2Real domain shift leading to a performance drop when applied to real-world images. To mitigate this gap, we propose two different domain adaptation techniques suitable for the pedestrian detection task, but possibly applicable to general object detection. Experiments show that the network trained with ViPeD can generalize over unseen real-world scenarios better than the detector trained over real-world data, exploiting the variety of our synthetic dataset. Furthermore, we demonstrate that with our domain adaptation techniques, we can reduce the Synthetic2Real domain shift, making the two domains closer and obtaining a performance improvement when testing the network over the real-world images. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Review
A Review of Vision-Based On-Board Obstacle Detection and Distance Estimation in Railways
Sensors 2021, 21(10), 3452; https://doi.org/10.3390/s21103452 - 15 May 2021
Viewed by 573
Abstract
This paper provides a review of the literature on vision-based on-board obstacle detection and distance estimation in railways. Environment perception is crucial for autonomous detection of obstacles in a vehicle’s surroundings. The use of on-board sensors for road vehicles for this purpose is [...] Read more.
This paper provides a review of the literature on vision-based on-board obstacle detection and distance estimation in railways. Environment perception is crucial for autonomous detection of obstacles in a vehicle’s surroundings. The use of on-board sensors for road vehicles for this purpose is well established, and advances in Artificial Intelligence and sensing technologies have motivated significant research and development in obstacle detection in the automotive field. However, research and development on obstacle detection in railways has been less extensive. To the best of our knowledge, this is the first comprehensive review of on-board obstacle detection methods for railway applications. This paper reviews currently used sensors, with particular focus on vision sensors due to their dominant use in the field. It then discusses and categorizes the methods based on vision sensors into methods based on traditional Computer Vision and methods based on Artificial Intelligence. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Article
A Visual Tracker Offering More Solutions
Sensors 2020, 20(18), 5374; https://doi.org/10.3390/s20185374 - 19 Sep 2020
Cited by 1 | Viewed by 673
Abstract
Most trackers focus solely on robustness and accuracy. Visual tracking, however, is a long-term problem with a high time limitation. A tracker that is robust, accurate, with long-term sustainability and real-time processing, is of high research value and practical significance. In this paper, [...] Read more.
Most trackers focus solely on robustness and accuracy. Visual tracking, however, is a long-term problem with a high time limitation. A tracker that is robust, accurate, with long-term sustainability and real-time processing, is of high research value and practical significance. In this paper, we comprehensively consider these requirements in order to propose a new, state-of-the-art tracker with an excellent performance. EfficientNet-B0 is adopted for the first time via neural architecture search technology as the backbone network for the tracking task. This improves the network feature extraction ability and significantly reduces the number of parameters required for the tracker backbone network. In addition, maximal Distance Intersection-over-Union is set as the target estimation method, enhancing network stability and increasing the offline training convergence rate. Channel and spatial dual attention mechanisms are employed in the target classification module to improve the discrimination of the trackers. Furthermore, the conjugate gradient optimization strategy increases the speed of the online learning target classification module. A two-stage search method combined with a screening module is proposed to enable the tracker to cope with sudden target movement and reappearance following a brief disappearance. Our proposed method has an obvious speed advantage compared with pure global searching and achieves an optimal performance on OTB2015, VOT2016, VOT2018-LT, UAV-123 and LaSOT while running at over 50 FPS. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Article
A Light-Weight Practical Framework for Feces Detection and Trait Recognition
Sensors 2020, 20(9), 2644; https://doi.org/10.3390/s20092644 - 06 May 2020
Cited by 9 | Viewed by 1374
Abstract
Fecal trait examinations are critical in the clinical diagnosis of digestive diseases, and they can effectively reveal various aspects regarding the health of the digestive system. An automatic feces detection and trait recognition system based on a visual sensor could greatly alleviate the [...] Read more.
Fecal trait examinations are critical in the clinical diagnosis of digestive diseases, and they can effectively reveal various aspects regarding the health of the digestive system. An automatic feces detection and trait recognition system based on a visual sensor could greatly alleviate the burden on medical inspectors and overcome many sanitation problems, such as infections. Unfortunately, the lack of digital medical images acquired with camera sensors due to patient privacy has obstructed the development of fecal examinations. In general, the computing power of an automatic fecal diagnosis machine or a mobile computer-aided diagnosis device is not always enough to run a deep network. Thus, a light-weight practical framework is proposed, which consists of three stages: illumination normalization, feces detection, and trait recognition. Illumination normalization effectively suppresses the illumination variances that degrade the recognition accuracy. Neither the shape nor the location is fixed, so shape-based and location-based object detection methods do not work well in this task. Meanwhile, this leads to a difficulty in labeling the images for training convolutional neural networks (CNN) in detection. Our segmentation scheme is free from training and labeling. The feces object is accurately detected with a well-designed threshold-based segmentation scheme on the selected color component to reduce the background disturbance. Finally, the preprocessed images are categorized into five classes with a light-weight shallow CNN, which is suitable for feces trait examinations in real hospital environments. The experiment results from our collected dataset demonstrate that our framework yields a satisfactory accuracy of 98.4%, while requiring low computational complexity and storage. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Article
Detecting Defects on Solid Wood Panels Based on an Improved SSD Algorithm
Sensors 2020, 20(18), 5315; https://doi.org/10.3390/s20185315 - 17 Sep 2020
Cited by 7 | Viewed by 774
Abstract
Wood is widely used in construction, the home, and art applications all over the world because of its good mechanical properties and aesthetic value. However, because the growth and preservation of wood are greatly affected by the environment, it often contains different types [...] Read more.
Wood is widely used in construction, the home, and art applications all over the world because of its good mechanical properties and aesthetic value. However, because the growth and preservation of wood are greatly affected by the environment, it often contains different types of defects that affect its performance and ornamental value. To solve the issues of high labor costs and low efficiency in the detection of wood defects, we used machine vision and deep learning methods in this work. A color charge-coupled device camera was used to collect the surface images of two types of wood from Akagi and Pinus sylvestris trees. A total of 500 images with a size of 200 × 200 pixels containing wood knots, dead knots, and checking defects were obtained. The transfer learning method was used to apply the single-shot multibox detector (SSD), a target detection algorithm and the DenseNet network was introduced to improve the algorithm. The mean average precision for detecting the three types of defects, live knots, dead knots and checking was 96.1%. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Article
Design of an Interactive Mind Calligraphy System by Affective Computing and Visualization Techniques for Real-Time Reflections of the Writer’s Emotions
Sensors 2020, 20(20), 5741; https://doi.org/10.3390/s20205741 - 09 Oct 2020
Viewed by 707
Abstract
A novel interactive system for calligraphy called mind calligraphy that reflects the writer’s emotions in real time by affective computing and visualization techniques is proposed. Differently from traditional calligraphy, which emphasizes artistic expression, the system is designed to visualize the writer’s mental-state changes [...] Read more.
A novel interactive system for calligraphy called mind calligraphy that reflects the writer’s emotions in real time by affective computing and visualization techniques is proposed. Differently from traditional calligraphy, which emphasizes artistic expression, the system is designed to visualize the writer’s mental-state changes during writing using audio-visual tools. The writer’s mental state is measured with a brain wave machine to yield attention and meditation signals, which are classified next into the four types of emotion, namely, focusing, relaxation, calmness, and anxiety. These emotion types then are represented both by animations and color palettes for by-standing observers to appreciate. Based on conclusions drawn from data collected from on-site observations, surveys via Likert-scale questionnaires, and semi-structured interviews, the proposed system was improved gradually. The participating writers’ cognitive, emotional, and behavioral engagements in the system were recorded and analyzed to obtain the following findings: (1) the interactions with the system raise the writer’s interest in calligraphy; (2) the proposed system reveals the writer’s emotions during the writing process in real time via animations of mixtures of fish swimming and sounds of raindrops, insects, and thunder; (3) the dynamic visualization of the writer’s emotion through animations and color-palette displays makes the writer understand better the connection of calligraphy and personal emotions; (4) the real-time audio-visual feedback increases the writer’s willingness to continue in calligraphy; and (5) the engagement of the writer in the system with interactions of diversified forms provides the writer with a new experience of calligraphy. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Article
Laser Ranging-Assisted Binocular Visual Sensor Tracking System
Sensors 2020, 20(3), 688; https://doi.org/10.3390/s20030688 - 27 Jan 2020
Cited by 3 | Viewed by 977
Abstract
Aimed at improving the low measurement accuracy of the binocular vision sensor along the optical axis in the process of target tracking, we proposed a method for auxiliary correction using a laser-ranging sensor in this paper. In the process of system measurement, limited [...] Read more.
Aimed at improving the low measurement accuracy of the binocular vision sensor along the optical axis in the process of target tracking, we proposed a method for auxiliary correction using a laser-ranging sensor in this paper. In the process of system measurement, limited to the mechanical performance of the two-dimensional turntable, the measurement value of a laser-ranging sensor is lagged. In this paper, the lag information is updated directly to solve the time delay. Moreover, in order to give full play to the advantages of binocular vision sensors and laser-ranging sensors in target tracking, federated filtering is used to improve the information utilization and measurement accuracy and to solve the estimated correlation. The experimental results show that the real-time and measurement accuracy of the laser ranging-assisted binocular visual-tracking system is improved by the direct update algorithm and the federal filtering algorithm. The results of this paper are significant for binocular vision sensors and laser-ranging sensors in engineering applications involving target tracking systems. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Article
A Runway Safety System Based on Vertically Oriented Stereovision
Sensors 2021, 21(4), 1464; https://doi.org/10.3390/s21041464 - 20 Feb 2021
Viewed by 713
Abstract
In 2020, over 10,000 bird strikes were reported in the USA, with average repair costs exceeding $200 million annually, rising to $1.2 billion worldwide. These collisions of avifauna with airplanes pose a significant threat to human safety and wildlife. This article presents a [...] Read more.
In 2020, over 10,000 bird strikes were reported in the USA, with average repair costs exceeding $200 million annually, rising to $1.2 billion worldwide. These collisions of avifauna with airplanes pose a significant threat to human safety and wildlife. This article presents a system dedicated to monitoring the space over an airport and is used to localize and identify moving objects. The solution is a stereovision based real-time bird protection system, which uses IoT and distributed computing concepts together with advanced HMI to provide the setup’s flexibility and usability. To create a high degree of customization, a modified stereovision system with freely oriented optical axes is proposed. To provide a market tailored solution affordable for small and medium size airports, a user-driven design methodology is used. The mathematical model is implemented and optimized in MATLAB. The implemented system prototype is verified in a real environment. The quantitative validation of the system performance is carried out using fixed-wing drones with GPS recorders. The results obtained prove the system’s high efficiency for detection and size classification in real-time, as well as a high degree of localization certainty. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Article
Mask-Refined R-CNN: A Network for Refining Object Details in Instance Segmentation
Sensors 2020, 20(4), 1010; https://doi.org/10.3390/s20041010 - 13 Feb 2020
Cited by 31 | Viewed by 2030
Abstract
With the rapid development of flexible vision sensors and visual sensor networks, computer vision tasks, such as object detection and tracking, are entering a new phase. Accordingly, the more challenging comprehensive task, including instance segmentation, can develop rapidly. Most state-of-the-art network frameworks, for [...] Read more.
With the rapid development of flexible vision sensors and visual sensor networks, computer vision tasks, such as object detection and tracking, are entering a new phase. Accordingly, the more challenging comprehensive task, including instance segmentation, can develop rapidly. Most state-of-the-art network frameworks, for instance, segmentation, are based on Mask R-CNN (mask region-convolutional neural network). However, the experimental results confirm that Mask R-CNN does not always successfully predict instance details. The scale-invariant fully convolutional network structure of Mask R-CNN ignores the difference in spatial information between receptive fields of different sizes. A large-scale receptive field focuses more on detailed information, whereas a small-scale receptive field focuses more on semantic information. So the network cannot consider the relationship between the pixels at the object edge, and these pixels will be misclassified. To overcome this problem, Mask-Refined R-CNN (MR R-CNN) is proposed, in which the stride of ROIAlign (region of interest align) is adjusted. In addition, the original fully convolutional layer is replaced with a new semantic segmentation layer that realizes feature fusion by constructing a feature pyramid network and summing the forward and backward transmissions of feature maps of the same resolution. The segmentation accuracy is substantially improved by combining the feature layers that focus on the global and detailed information. The experimental results on the COCO (Common Objects in Context) and Cityscapes datasets demonstrate that the segmentation accuracy of MR R-CNN is about 2% higher than that of Mask R-CNN using the same backbone. The average precision of large instances reaches 56.6%, which is higher than those of all state-of-the-art methods. In addition, the proposed method requires low time cost and is easily implemented. The experiments on the Cityscapes dataset also prove that the proposed method has great generalization ability. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Article
Mixed YOLOv3-LITE: A Lightweight Real-Time Object Detection Method
Sensors 2020, 20(7), 1861; https://doi.org/10.3390/s20071861 - 27 Mar 2020
Cited by 16 | Viewed by 2400
Abstract
Embedded and mobile smart devices face problems related to limited computing power and excessive power consumption. To address these problems, we propose Mixed YOLOv3-LITE, a lightweight real-time object detection network that can be used with non-graphics processing unit (GPU) and mobile devices. Based [...] Read more.
Embedded and mobile smart devices face problems related to limited computing power and excessive power consumption. To address these problems, we propose Mixed YOLOv3-LITE, a lightweight real-time object detection network that can be used with non-graphics processing unit (GPU) and mobile devices. Based on YOLO-LITE as the backbone network, Mixed YOLOv3-LITE supplements residual block (ResBlocks) and parallel high-to-low resolution subnetworks, fully utilizes shallow network characteristics while increasing network depth, and uses a “shallow and narrow” convolution layer to build a detector, thereby achieving an optimal balance between detection precision and speed when used with non-GPU based computers and portable terminal devices. The experimental results obtained in this study reveal that the size of the proposed Mixed YOLOv3-LITE network model is 20.5 MB, which is 91.70%, 38.07%, and 74.25% smaller than YOLOv3, tiny-YOLOv3, and SlimYOLOv3-spp3-50, respectively. The mean average precision (mAP) achieved using the PASCAL VOC 2007 dataset is 48.25%, which is 14.48% higher than that of YOLO-LITE. When the VisDrone 2018-Det dataset is used, the mAP achieved with the Mixed YOLOv3-LITE network model is 28.50%, which is 18.50% and 2.70% higher than tiny-YOLOv3 and SlimYOLOv3-spp3-50, respectively. The results prove that Mixed YOLOv3-LITE can achieve higher efficiency and better performance on mobile terminals and other devices. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Article
Weighted Kernel Filter Based Anti-Air Object Tracking for Thermal Infrared Systems
Sensors 2020, 20(15), 4081; https://doi.org/10.3390/s20154081 - 22 Jul 2020
Viewed by 919
Abstract
Visual object tracking is an important component of surveillance systems and many high-performance methods have been developed. However, these tracking methods tend to be optimized for the Red/Green/Blue (RGB) domain and are thus not suitable for use with the infrared (IR) domain. To [...] Read more.
Visual object tracking is an important component of surveillance systems and many high-performance methods have been developed. However, these tracking methods tend to be optimized for the Red/Green/Blue (RGB) domain and are thus not suitable for use with the infrared (IR) domain. To overcome this disadvantage, many researchers have constructed datasets for IR analysis, including those developed for The Thermal Infrared Visual Object Tracking (VOT-TIR) challenges. As a consequence, many state-of-the-art trackers for the IR domain have been proposed, but there remains a need for reliable IR-based trackers for anti-air surveillance systems, including the construction of a new IR dataset for this purpose. In this paper, we collect various anti-air thermal-wave IR (TIR) images from an electro-optical surveillance system to create a new dataset. We also present a framework based on an end-to-end convolutional neural network that learns object tracking in the IR domain for anti-air targets such as unmanned aerial vehicles (UAVs) and drones. More specifically, we adopt a Siamese network for feature extraction and three region proposal networks for the classification and regression branches. In the inference phase, the proposed network is formulated as a detection-by-tracking method, and kernel filters for the template branch that are continuously updated for every frame are introduced. The proposed network is able to learn robust structural information for the targets during offline training, and the kernel filters can robustly track the targets, demonstrating enhanced performance. Experimental results from the new IR dataset reveal that the proposed method achieves outstanding performance, with a real-time processing speed of 40 frames per second. Full article
(This article belongs to the Special Issue Visual Sensor Networks for Object Detection and Tracking)
Show Figures

Figure 1

Back to TopTop