E-Mail Alert

Add your e-mail address to receive forthcoming issues of this journal:

Journal Browser

Journal Browser

Special Issue "Video Analysis and Tracking Using State-of-the-Art Sensors"

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Physical Sensors".

Deadline for manuscript submissions: closed (30 October 2017)

Special Issue Editor

Guest Editor
Prof. Dr. Joonki Paik

Department of Image, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul 06974, Korea
Website | E-Mail
Interests: image processing; computer vision; imaging sensors; computational photography

Special Issue Information

Dear Colleagues,

Object detection, identification, recognition, and tracking from video is a fundamental problem in the computer vision and image processing fields. This task requires an object modelling and motion analysis, and various types of object models have been developed for improved performance. However, a practical object detection and tracking algorithm cannot avoid a number of limitations including: Object occlusion, unstable illumination, object deformation, an insufficient resolution of input video, limited computational resources to meet the video rendering speed, to name a few. Recent developments in state-of-the-art sensors widen the application area of video object tracking by solving the practical limitations.

The objective of this Special Issue is to highlight innovative development of video analysis and tracking technologies related with various state-of-the-art sensors. Topics include, but are not limited to:

  • Detection, identification, recognition and tracking objects using various sensors
  • Multiple camera network or association for very wide range surveillance
  • Development of non-visual sensors, such as time-of-flight sensor, RGB-D camera, IR sensor, RADAR, LIDAR, motion sensor, and acoustic wave sensor, and their applications to video analysis and tracking
  • Image and video enhancement algorithms to improve the quality of visual sensors for video tracking
  • Computational photography and imaging for advanced object detection and tracking
  • Depth estimation and three-dimensional reconstruction for augmented reality (AR) and/or advanced driver assistance systems (ADAS)

Prof. Dr. Joonki Paik
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All papers will be peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Video tracking
  • motion estimation
  • optical flow
  • RGB-D camera
  • infra-red (IR) sensor
  • RADAR
  • LIDAR
  • computational photography
  • augmented reality (AR)
  • surveillance

Published Papers (28 papers)

View options order results:
result details:
Displaying articles 1-28
Export citation of selected articles as:

Research

Open AccessArticle
High-Speed Video System for Micro-Expression Detection and Recognition
Sensors 2017, 17(12), 2913; https://doi.org/10.3390/s17122913
Received: 25 October 2017 / Revised: 11 December 2017 / Accepted: 11 December 2017 / Published: 14 December 2017
Cited by 4 | PDF Full-text (5247 KB) | HTML Full-text | XML Full-text | Supplementary Files
Abstract
Micro-expressions play an essential part in understanding non-verbal communication and deceit detection. They are involuntary, brief facial movements that are shown when a person is trying to conceal something. Automatic analysis of micro-expression is challenging due to their low amplitude and to their [...] Read more.
Micro-expressions play an essential part in understanding non-verbal communication and deceit detection. They are involuntary, brief facial movements that are shown when a person is trying to conceal something. Automatic analysis of micro-expression is challenging due to their low amplitude and to their short duration (they occur as fast as 1/15 to 1/25 of a second). We propose a fully micro-expression analysis system consisting of a high-speed image acquisition setup and a software framework which can detect the frames when the micro-expressions occurred as well as determine the type of the emerged expression. The detection and classification methods use fast and simple motion descriptors based on absolute image differences. The recognition module it only involves the computation of several 2D Gaussian probabilities. The software framework was tested on two publicly available high speed micro-expression databases and the whole system was used to acquire new data. The experiments we performed show that our solution outperforms state of the art works which use more complex and computationally intensive descriptors. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Iterative Refinement of Transmission Map for Stereo Image Defogging Using a Dual Camera Sensor
Sensors 2017, 17(12), 2861; https://doi.org/10.3390/s17122861
Received: 28 October 2017 / Revised: 27 November 2017 / Accepted: 6 December 2017 / Published: 9 December 2017
Cited by 1 | PDF Full-text (1190 KB) | HTML Full-text | XML Full-text
Abstract
Recently, the stereo imaging-based image enhancement approach has attracted increasing attention in the field of video analysis. This paper presents a dual camera-based stereo image defogging algorithm. Optical flow is first estimated from the stereo foggy image pair, and the initial disparity map [...] Read more.
Recently, the stereo imaging-based image enhancement approach has attracted increasing attention in the field of video analysis. This paper presents a dual camera-based stereo image defogging algorithm. Optical flow is first estimated from the stereo foggy image pair, and the initial disparity map is generated from the estimated optical flow. Next, an initial transmission map is generated using the initial disparity map. Atmospheric light is then estimated using the color line theory. The defogged result is finally reconstructed using the estimated transmission map and atmospheric light. The proposed method can refine the transmission map iteratively. Experimental results show that the proposed method can successfully remove fog without color distortion. The proposed method can be used as a pre-processing step for an outdoor video analysis system and a high-end smartphone with a dual camera system. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Vision System for Coarsely Estimating Motion Parameters for Unknown Fast Moving Objects in Space
Sensors 2017, 17(12), 2820; https://doi.org/10.3390/s17122820
Received: 30 October 2017 / Revised: 27 November 2017 / Accepted: 1 December 2017 / Published: 5 December 2017
Cited by 1 | PDF Full-text (21845 KB) | HTML Full-text | XML Full-text
Abstract
Motivated by biological interests in analyzing navigation behaviors of flying animals, we attempt to build a system measuring their motion states. To do this, in this paper, we build a vision system to detect unknown fast moving objects within a given space, calculating [...] Read more.
Motivated by biological interests in analyzing navigation behaviors of flying animals, we attempt to build a system measuring their motion states. To do this, in this paper, we build a vision system to detect unknown fast moving objects within a given space, calculating their motion parameters represented by positions and poses. We proposed a novel method to detect reliable interest points from images of moving objects, which can be hardly detected by general purpose interest point detectors. 3D points reconstructed using these interest points are then grouped and maintained for detected objects, according to a careful schedule, considering appearance and perspective changes. In the estimation step, a method is introduced to adapt the robust estimation procedure used for dense point set to the case for sparse set, reducing the potential risk of greatly biased estimation. Experiments are conducted against real scenes, showing the capability of the system of detecting multiple unknown moving objects and estimating their positions and poses. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Scene-Aware Adaptive Updating for Visual Tracking via Correlation Filters
Sensors 2017, 17(11), 2626; https://doi.org/10.3390/s17112626
Received: 27 September 2017 / Revised: 9 November 2017 / Accepted: 11 November 2017 / Published: 15 November 2017
Cited by 8 | PDF Full-text (13255 KB) | HTML Full-text | XML Full-text
Abstract
In recent years, visual object tracking has been widely used in military guidance, human-computer interaction, road traffic, scene monitoring and many other fields. The tracking algorithms based on correlation filters have shown good performance in terms of accuracy and tracking speed. However, their [...] Read more.
In recent years, visual object tracking has been widely used in military guidance, human-computer interaction, road traffic, scene monitoring and many other fields. The tracking algorithms based on correlation filters have shown good performance in terms of accuracy and tracking speed. However, their performance is not satisfactory in scenes with scale variation, deformation, and occlusion. In this paper, we propose a scene-aware adaptive updating mechanism for visual tracking via a kernel correlation filter (KCF). First, a low complexity scale estimation method is presented, in which the corresponding weight in five scales is employed to determine the final target scale. Then, the adaptive updating mechanism is presented based on the scene-classification. We classify the video scenes as four categories by video content analysis. According to the target scene, we exploit the adaptive updating mechanism to update the kernel correlation filter to improve the robustness of the tracker, especially in scenes with scale variation, deformation, and occlusion. We evaluate our tracker on the CVPR2013 benchmark. The experimental results obtained with the proposed algorithm are improved by 33.3%, 15%, 6%, 21.9% and 19.8% compared to those of the KCF tracker on the scene with scale variation, partial or long-time large-area occlusion, deformation, fast motion and out-of-view. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Motion-Blur-Free High-Speed Video Shooting Using a Resonant Mirror
Sensors 2017, 17(11), 2483; https://doi.org/10.3390/s17112483
Received: 15 September 2017 / Revised: 22 October 2017 / Accepted: 25 October 2017 / Published: 29 October 2017
Cited by 1 | PDF Full-text (8373 KB) | HTML Full-text | XML Full-text
Abstract
This study proposes a novel concept of actuator-driven frame-by-frame intermittent tracking for motion-blur-free video shooting of fast-moving objects. The camera frame and shutter timings are controlled for motion blur reduction in synchronization with a free-vibration-type actuator vibrating with a large amplitude at hundreds [...] Read more.
This study proposes a novel concept of actuator-driven frame-by-frame intermittent tracking for motion-blur-free video shooting of fast-moving objects. The camera frame and shutter timings are controlled for motion blur reduction in synchronization with a free-vibration-type actuator vibrating with a large amplitude at hundreds of hertz so that motion blur can be significantly reduced in free-viewpoint high-frame-rate video shooting for fast-moving objects by deriving the maximum performance of the actuator. We develop a prototype of a motion-blur-free video shooting system by implementing our frame-by-frame intermittent tracking algorithm on a high-speed video camera system with a resonant mirror vibrating at 750 Hz. It can capture 1024 × 1024 images of fast-moving objects at 750 fps with an exposure time of 0.33 ms without motion blur. Several experimental results for fast-moving objects verify that our proposed method can reduce image degradation from motion blur without decreasing the camera exposure time. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
DEEP-SEE: Joint Object Detection, Tracking and Recognition with Application to Visually Impaired Navigational Assistance
Sensors 2017, 17(11), 2473; https://doi.org/10.3390/s17112473
Received: 28 August 2017 / Revised: 5 October 2017 / Accepted: 25 October 2017 / Published: 28 October 2017
Cited by 7 | PDF Full-text (5808 KB) | HTML Full-text | XML Full-text
Abstract
In this paper, we introduce the so-called DEEP-SEE framework that jointly exploits computer vision algorithms and deep convolutional neural networks (CNNs) to detect, track and recognize in real time objects encountered during navigation in the outdoor environment. A first feature concerns an object [...] Read more.
In this paper, we introduce the so-called DEEP-SEE framework that jointly exploits computer vision algorithms and deep convolutional neural networks (CNNs) to detect, track and recognize in real time objects encountered during navigation in the outdoor environment. A first feature concerns an object detection technique designed to localize both static and dynamic objects without any a priori knowledge about their position, type or shape. The methodological core of the proposed approach relies on a novel object tracking method based on two convolutional neural networks trained offline. The key principle consists of alternating between tracking using motion information and predicting the object location in time based on visual similarity. The validation of the tracking technique is performed on standard benchmark VOT datasets, and shows that the proposed approach returns state-of-the-art results while minimizing the computational complexity. Then, the DEEP-SEE framework is integrated into a novel assistive device, designed to improve cognition of VI people and to increase their safety when navigating in crowded urban scenes. The validation of our assistive device is performed on a video dataset with 30 elements acquired with the help of VI users. The proposed system shows high accuracy (>90%) and robustness (>90%) scores regardless on the scene dynamics. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Robust Small Target Co-Detection from Airborne Infrared Image Sequences
Sensors 2017, 17(10), 2242; https://doi.org/10.3390/s17102242
Received: 14 August 2017 / Revised: 17 September 2017 / Accepted: 25 September 2017 / Published: 29 September 2017
Cited by 4 | PDF Full-text (896 KB) | HTML Full-text | XML Full-text
Abstract
In this paper, a novel infrared target co-detection model combining the self-correlation features of backgrounds and the commonality features of targets in the spatio-temporal domain is proposed to detect small targets in a sequence of infrared images with complex backgrounds. Firstly, a dense [...] Read more.
In this paper, a novel infrared target co-detection model combining the self-correlation features of backgrounds and the commonality features of targets in the spatio-temporal domain is proposed to detect small targets in a sequence of infrared images with complex backgrounds. Firstly, a dense target extraction model based on nonlinear weights is proposed, which can better suppress background of images and enhance small targets than weights of singular values. Secondly, a sparse target extraction model based on entry-wise weighted robust principal component analysis is proposed. The entry-wise weight adaptively incorporates structural prior in terms of local weighted entropy, thus, it can extract real targets accurately and suppress background clutters efficiently. Finally, the commonality of targets in the spatio-temporal domain are used to construct target refinement model for false alarms suppression and target confirmation. Since real targets could appear in both of the dense and sparse reconstruction maps of a single frame, and form trajectories after tracklet association of consecutive frames, the location correlation of the dense and sparse reconstruction maps for a single frame and tracklet association of the location correlation maps for successive frames have strong ability to discriminate between small targets and background clutters. Experimental results demonstrate that the proposed small target co-detection method can not only suppress background clutters effectively, but also detect targets accurately even if with target-like interference. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
American Sign Language Alphabet Recognition Using a Neuromorphic Sensor and an Artificial Neural Network
Sensors 2017, 17(10), 2176; https://doi.org/10.3390/s17102176
Received: 24 July 2017 / Revised: 2 September 2017 / Accepted: 13 September 2017 / Published: 22 September 2017
Cited by 2 | PDF Full-text (5078 KB) | HTML Full-text | XML Full-text
Abstract
This paper reports the design and analysis of an American Sign Language (ASL) alphabet translation system implemented in hardware using a Field-Programmable Gate Array. The system process consists of three stages, the first being the communication with the neuromorphic camera (also called Dynamic [...] Read more.
This paper reports the design and analysis of an American Sign Language (ASL) alphabet translation system implemented in hardware using a Field-Programmable Gate Array. The system process consists of three stages, the first being the communication with the neuromorphic camera (also called Dynamic Vision Sensor, DVS) sensor using the Universal Serial Bus protocol. The feature extraction of the events generated by the DVS is the second part of the process, consisting of a presentation of the digital image processing algorithms developed in software, which aim to reduce redundant information and prepare the data for the third stage. The last stage of the system process is the classification of the ASL alphabet, achieved with a single artificial neural network implemented in digital hardware for higher speed. The overall result is the development of a classification system using the ASL signs contour, fully implemented in a reconfigurable device. The experimental results consist of a comparative analysis of the recognition rate among the alphabet signs using the neuromorphic camera in order to prove the proper operation of the digital image processing algorithms. In the experiments performed with 720 samples of 24 signs, a recognition accuracy of 79.58% was obtained. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Comparative Evaluation of Background Subtraction Algorithms in Remote Scene Videos Captured by MWIR Sensors
Sensors 2017, 17(9), 1945; https://doi.org/10.3390/s17091945
Received: 1 July 2017 / Revised: 7 August 2017 / Accepted: 17 August 2017 / Published: 24 August 2017
Cited by 3 | PDF Full-text (2308 KB) | HTML Full-text | XML Full-text
Abstract
Background subtraction (BS) is one of the most commonly encountered tasks in video analysis and tracking systems. It distinguishes the foreground (moving objects) from the video sequences captured by static imaging sensors. Background subtraction in remote scene infrared (IR) video is important and [...] Read more.
Background subtraction (BS) is one of the most commonly encountered tasks in video analysis and tracking systems. It distinguishes the foreground (moving objects) from the video sequences captured by static imaging sensors. Background subtraction in remote scene infrared (IR) video is important and common to lots of fields. This paper provides a Remote Scene IR Dataset captured by our designed medium-wave infrared (MWIR) sensor. Each video sequence in this dataset is identified with specific BS challenges and the pixel-wise ground truth of foreground (FG) for each frame is also provided. A series of experiments were conducted to evaluate BS algorithms on this proposed dataset. The overall performance of BS algorithms and the processor/memory requirements were compared. Proper evaluation metrics or criteria were employed to evaluate the capability of each BS algorithm to handle different kinds of BS challenges represented in this dataset. The results and conclusions in this paper provide valid references to develop new BS algorithm for remote scene IR video sequence, and some of them are not only limited to remote scene or IR video sequence but also generic for background subtraction. The Remote Scene IR dataset and the foreground masks detected by each evaluated BS algorithm are available online: https://github.com/JerryYaoGl/BSEvaluationRemoteSceneIR. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Headgear Accessories Classification Using an Overhead Depth Sensor
Sensors 2017, 17(8), 1845; https://doi.org/10.3390/s17081845
Received: 22 June 2017 / Revised: 2 August 2017 / Accepted: 8 August 2017 / Published: 10 August 2017
PDF Full-text (10711 KB) | HTML Full-text | XML Full-text
Abstract
In this paper, we address the generation of semantic labels describing the headgear accessories carried out by people in a scene under surveillance, only using depth information obtained from a Time-of-Flight (ToF) camera placed in an overhead position. We propose a new method [...] Read more.
In this paper, we address the generation of semantic labels describing the headgear accessories carried out by people in a scene under surveillance, only using depth information obtained from a Time-of-Flight (ToF) camera placed in an overhead position. We propose a new method for headgear accessories classification based on the design of a robust processing strategy that includes the estimation of a meaningful feature vector that provides the relevant information about the people’s head and shoulder areas. This paper includes a detailed description of the proposed algorithmic approach, and the results obtained in tests with persons with and without headgear accessories, and with different types of hats and caps. In order to evaluate the proposal, a wide experimental validation has been carried out on a fully labeled database (that has been made available to the scientific community), including a broad variety of people and headgear accessories. For the validation, three different levels of detail have been defined, considering a different number of classes: the first level only includes two classes (hat/cap, and no hat/cap), the second one considers three classes (hat, cap and no hat/cap), and the last one includes the full class set with the five classes (no hat/cap, cap, small size hat, medium size hat, and large size hat). The achieved performance is satisfactory in every case: the average classification rates for the first level reaches 95.25%, for the second one is 92.34%, and for the full class set equals 84.60%. In addition, the online stage processing time is 5.75 ms per frame in a standard PC, thus allowing for real-time operation. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
A Study of Deep CNN-Based Classification of Open and Closed Eyes Using a Visible Light Camera Sensor
Sensors 2017, 17(7), 1534; https://doi.org/10.3390/s17071534
Received: 2 June 2017 / Revised: 26 June 2017 / Accepted: 28 June 2017 / Published: 30 June 2017
Cited by 9 | PDF Full-text (6450 KB) | HTML Full-text | XML Full-text
Abstract
The necessity for the classification of open and closed eyes is increasing in various fields, including analysis of eye fatigue in 3D TVs, analysis of the psychological states of test subjects, and eye status tracking-based driver drowsiness detection. Previous studies have used various [...] Read more.
The necessity for the classification of open and closed eyes is increasing in various fields, including analysis of eye fatigue in 3D TVs, analysis of the psychological states of test subjects, and eye status tracking-based driver drowsiness detection. Previous studies have used various methods to distinguish between open and closed eyes, such as classifiers based on the features obtained from image binarization, edge operators, or texture analysis. However, when it comes to eye images with different lighting conditions and resolutions, it can be difficult to find an optimal threshold for image binarization or optimal filters for edge and texture extraction. In order to address this issue, we propose a method to classify open and closed eye images with different conditions, acquired by a visible light camera, using a deep residual convolutional neural network. After conducting performance analysis on both self-collected and open databases, we have determined that the classification accuracy of the proposed method is superior to that of existing methods. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Improving Video Segmentation by Fusing Depth Cues and the Visual Background Extractor (ViBe) Algorithm
Sensors 2017, 17(5), 1177; https://doi.org/10.3390/s17051177
Received: 19 March 2017 / Revised: 12 May 2017 / Accepted: 18 May 2017 / Published: 21 May 2017
Cited by 9 | PDF Full-text (6198 KB) | HTML Full-text | XML Full-text
Abstract
Depth-sensing technology has led to broad applications of inexpensive depth cameras that can capture human motion and scenes in three-dimensional space. Background subtraction algorithms can be improved by fusing color and depth cues, thereby allowing many issues encountered in classical color segmentation to [...] Read more.
Depth-sensing technology has led to broad applications of inexpensive depth cameras that can capture human motion and scenes in three-dimensional space. Background subtraction algorithms can be improved by fusing color and depth cues, thereby allowing many issues encountered in classical color segmentation to be solved. In this paper, we propose a new fusion method that combines depth and color information for foreground segmentation based on an advanced color-based algorithm. First, a background model and a depth model are developed. Then, based on these models, we propose a new updating strategy that can eliminate ghosting and black shadows almost completely. Extensive experiments have been performed to compare the proposed algorithm with other, conventional RGB-D (Red-Green-Blue and Depth) algorithms. The experimental results suggest that our method extracts foregrounds with higher effectiveness and efficiency. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Graphical abstract

Open AccessArticle
A Human Activity Recognition System Based on Dynamic Clustering of Skeleton Data
Sensors 2017, 17(5), 1100; https://doi.org/10.3390/s17051100
Received: 12 January 2017 / Revised: 3 May 2017 / Accepted: 8 May 2017 / Published: 11 May 2017
Cited by 3 | PDF Full-text (377 KB) | HTML Full-text | XML Full-text
Abstract
Human activity recognition is an important area in computer vision, with its wide range of applications including ambient assisted living. In this paper, an activity recognition system based on skeleton data extracted from a depth camera is presented. The system makes use of [...] Read more.
Human activity recognition is an important area in computer vision, with its wide range of applications including ambient assisted living. In this paper, an activity recognition system based on skeleton data extracted from a depth camera is presented. The system makes use of machine learning techniques to classify the actions that are described with a set of a few basic postures. The training phase creates several models related to the number of clustered postures by means of a multiclass Support Vector Machine (SVM), trained with Sequential Minimal Optimization (SMO). The classification phase adopts the X-means algorithm to find the optimal number of clusters dynamically. The contribution of the paper is twofold. The first aim is to perform activity recognition employing features based on a small number of informative postures, extracted independently from each activity instance; secondly, it aims to assess the minimum number of frames needed for an adequate classification. The system is evaluated on two publicly available datasets, the Cornell Activity Dataset (CAD-60) and the Telecommunication Systems Team (TST) Fall detection dataset. The number of clusters needed to model each instance ranges from two to four elements. The proposed approach reaches excellent performances using only about 4 s of input data (~100 frames) and outperforms the state of the art when it uses approximately 500 frames on the CAD-60 dataset. The results are promising for the test in real context. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Fuzzy System-Based Target Selection for a NIR Camera-Based Gaze Tracker
Sensors 2017, 17(4), 862; https://doi.org/10.3390/s17040862
Received: 15 February 2017 / Revised: 29 March 2017 / Accepted: 11 April 2017 / Published: 14 April 2017
Cited by 2 | PDF Full-text (13990 KB) | HTML Full-text | XML Full-text
Abstract
Gaze-based interaction (GBI) techniques have been a popular subject of research in the last few decades. Among other applications, GBI can be used by persons with disabilities to perform everyday tasks, as a game interface, and can play a pivotal role in the [...] Read more.
Gaze-based interaction (GBI) techniques have been a popular subject of research in the last few decades. Among other applications, GBI can be used by persons with disabilities to perform everyday tasks, as a game interface, and can play a pivotal role in the human computer interface (HCI) field. While gaze tracking systems have shown high accuracy in GBI, detecting a user’s gaze for target selection is a challenging problem that needs to be considered while using a gaze detection system. Past research has used the blinking of the eyes for this purpose as well as dwell time-based methods, but these techniques are either inconvenient for the user or requires a long time for target selection. Therefore, in this paper, we propose a method for fuzzy system-based target selection for near-infrared (NIR) camera-based gaze trackers. The results of experiments performed in addition to tests of the usability and on-screen keyboard use of the proposed method show that it is better than previous methods. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Tracking a Non-Cooperative Target Using Real-Time Stereovision-Based Control: An Experimental Study
Sensors 2017, 17(4), 735; https://doi.org/10.3390/s17040735
Received: 14 February 2017 / Revised: 23 March 2017 / Accepted: 28 March 2017 / Published: 31 March 2017
Cited by 2 | PDF Full-text (2615 KB) | HTML Full-text | XML Full-text
Abstract
Tracking a non-cooperative target is a challenge, because in unfamiliar environments most targets are unknown and unspecified. Stereovision is suited to deal with this issue, because it allows to passively scan large areas and estimate the relative position, velocity and shape of objects. [...] Read more.
Tracking a non-cooperative target is a challenge, because in unfamiliar environments most targets are unknown and unspecified. Stereovision is suited to deal with this issue, because it allows to passively scan large areas and estimate the relative position, velocity and shape of objects. This research is an experimental effort aimed at developing, implementing and evaluating a real-time non-cooperative target tracking methods using stereovision measurements only. A computer-vision feature detection and matching algorithm was developed in order to identify and locate the target in the captured images. Three different filters were designed for estimating the relative position and velocity, and their performance was compared. A line-of-sight control algorithm was used for the purpose of keeping the target within the field-of-view. Extensive analytical and numerical investigations were conducted on the multi-view stereo projection equations and their solutions, which were used to initialize the different filters. This research shows, using an experimental and numerical evaluation, the benefits of using the unscented Kalman filter and the total least squares technique in the stereovision-based tracking problem. These findings offer a general and more accurate method for solving the static and dynamic stereovision triangulation problems and the concomitant line-of-sight control. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Gender Recognition from Human-Body Images Using Visible-Light and Thermal Camera Videos Based on a Convolutional Neural Network for Image Feature Extraction
Sensors 2017, 17(3), 637; https://doi.org/10.3390/s17030637
Received: 31 January 2017 / Revised: 10 March 2017 / Accepted: 18 March 2017 / Published: 20 March 2017
Cited by 9 | PDF Full-text (2572 KB) | HTML Full-text | XML Full-text
Abstract
Extracting powerful image features plays an important role in computer vision systems. Many methods have previously been proposed to extract image features for various computer vision applications, such as the scale-invariant feature transform (SIFT), speed-up robust feature (SURF), local binary patterns (LBP), histogram [...] Read more.
Extracting powerful image features plays an important role in computer vision systems. Many methods have previously been proposed to extract image features for various computer vision applications, such as the scale-invariant feature transform (SIFT), speed-up robust feature (SURF), local binary patterns (LBP), histogram of oriented gradients (HOG), and weighted HOG. Recently, the convolutional neural network (CNN) method for image feature extraction and classification in computer vision has been used in various applications. In this research, we propose a new gender recognition method for recognizing males and females in observation scenes of surveillance systems based on feature extraction from visible-light and thermal camera videos through CNN. Experimental results confirm the superiority of our proposed method over state-of-the-art recognition methods for the gender recognition problem using human body images. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Conditional Random Field (CRF)-Boosting: Constructing a Robust Online Hybrid Boosting Multiple Object Tracker Facilitated by CRF Learning
Sensors 2017, 17(3), 617; https://doi.org/10.3390/s17030617
Received: 6 December 2016 / Revised: 13 March 2017 / Accepted: 14 March 2017 / Published: 17 March 2017
Cited by 3 | PDF Full-text (6794 KB) | HTML Full-text | XML Full-text
Abstract
Due to the reasonably acceptable performance of state-of-the-art object detectors, tracking-by-detection is a standard strategy for visual multi-object tracking (MOT). In particular, online MOT is more demanding due to its diverse applications in time-critical situations. A main issue of realizing online MOT is [...] Read more.
Due to the reasonably acceptable performance of state-of-the-art object detectors, tracking-by-detection is a standard strategy for visual multi-object tracking (MOT). In particular, online MOT is more demanding due to its diverse applications in time-critical situations. A main issue of realizing online MOT is how to associate noisy object detection results on a new frame with previously being tracked objects. In this work, we propose a multi-object tracker method called CRF-boosting which utilizes a hybrid data association method based on online hybrid boosting facilitated by a conditional random field (CRF) for establishing online MOT. For data association, learned CRF is used to generate reliable low-level tracklets and then these are used as the input of the hybrid boosting. To do so, while existing data association methods based on boosting algorithms have the necessity of training data having ground truth information to improve robustness, CRF-boosting ensures sufficient robustness without such information due to the synergetic cascaded learning procedure. Further, a hierarchical feature association framework is adopted to further improve MOT accuracy. From experimental results on public datasets, we could conclude that the benefit of proposed hybrid approach compared to the other competitive MOT systems is noticeable. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Effective Visual Tracking Using Multi-Block and Scale Space Based on Kernelized Correlation Filters
Sensors 2017, 17(3), 433; https://doi.org/10.3390/s17030433
Received: 9 December 2016 / Revised: 13 February 2017 / Accepted: 17 February 2017 / Published: 23 February 2017
Cited by 5 | PDF Full-text (6784 KB) | HTML Full-text | XML Full-text
Abstract
Accurate scale estimation and occlusion handling is a challenging problem in visual tracking. Recently, correlation filter-based trackers have shown impressive results in terms of accuracy, robustness, and speed. However, the model is not robust to scale variation and occlusion. In this paper, we [...] Read more.
Accurate scale estimation and occlusion handling is a challenging problem in visual tracking. Recently, correlation filter-based trackers have shown impressive results in terms of accuracy, robustness, and speed. However, the model is not robust to scale variation and occlusion. In this paper, we address the problems associated with scale variation and occlusion by employing a scale space filter and multi-block scheme based on a kernelized correlation filter (KCF) tracker. Furthermore, we develop a more robust algorithm using an appearance update model that approximates the change of state of occlusion and deformation. In particular, an adaptive update scheme is presented to make each process robust. The experimental results demonstrate that the proposed method outperformed 29 state-of-the-art trackers on 100 challenging sequences. Specifically, the results obtained with the proposed scheme were improved by 8% and 18% compared to those of the KCF tracker for 49 occlusion and 64 scale variation sequences, respectively. Therefore, the proposed tracker can be a robust and useful tool for object tracking when occlusion and scale variation are involved. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
A Real-Time High Performance Computation Architecture for Multiple Moving Target Tracking Based on Wide-Area Motion Imagery via Cloud and Graphic Processing Units
Sensors 2017, 17(2), 356; https://doi.org/10.3390/s17020356
Received: 29 November 2016 / Revised: 4 February 2017 / Accepted: 7 February 2017 / Published: 12 February 2017
Cited by 3 | PDF Full-text (5748 KB) | HTML Full-text | XML Full-text
Abstract
This paper presents the first attempt at combining Cloud with Graphic Processing Units (GPUs) in a complementary manner within the framework of a real-time high performance computation architecture for the application of detecting and tracking multiple moving targets based on Wide Area Motion [...] Read more.
This paper presents the first attempt at combining Cloud with Graphic Processing Units (GPUs) in a complementary manner within the framework of a real-time high performance computation architecture for the application of detecting and tracking multiple moving targets based on Wide Area Motion Imagery (WAMI). More specifically, the GPU and Cloud Moving Target Tracking (GC-MTT) system applied a front-end web based server to perform the interaction with Hadoop and highly parallelized computation functions based on the Compute Unified Device Architecture (CUDA©). The introduced multiple moving target detection and tracking method can be extended to other applications such as pedestrian tracking, group tracking, and Patterns of Life (PoL) analysis. The cloud and GPUs based computing provides an efficient real-time target recognition and tracking approach as compared to methods when the work flow is applied using only central processing units (CPUs). The simultaneous tracking and recognition results demonstrate that a GC-MTT based approach provides drastically improved tracking with low frame rates over realistic conditions. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Robust Video Stabilization Using Particle Keypoint Update and l1-Optimized Camera Path
Sensors 2017, 17(2), 337; https://doi.org/10.3390/s17020337
Received: 17 October 2016 / Revised: 3 February 2017 / Accepted: 4 February 2017 / Published: 10 February 2017
Cited by 4 | PDF Full-text (5556 KB) | HTML Full-text | XML Full-text | Supplementary Files
Abstract
Acquisition of stabilized video is an important issue for various type of digital cameras. This paper presents an adaptive camera path estimation method using robust feature detection to remove shaky artifacts in a video. The proposed algorithm consists of three steps: (i) robust [...] Read more.
Acquisition of stabilized video is an important issue for various type of digital cameras. This paper presents an adaptive camera path estimation method using robust feature detection to remove shaky artifacts in a video. The proposed algorithm consists of three steps: (i) robust feature detection using particle keypoints between adjacent frames; (ii) camera path estimation and smoothing; and (iii) rendering to reconstruct a stabilized video. As a result, the proposed algorithm can estimate the optimal homography by redefining important feature points in the flat region using particle keypoints. In addition, stabilized frames with less holes can be generated from the optimal, adaptive camera path that minimizes a temporal total variation (TV). The proposed video stabilization method is suitable for enhancing the visual quality for various portable cameras and can be applied to robot vision, driving assistant systems, and visual surveillance systems. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Real-Time Straight-Line Detection for XGA-Size Videos by Hough Transform with Parallelized Voting Procedures
Sensors 2017, 17(2), 270; https://doi.org/10.3390/s17020270
Received: 31 October 2016 / Accepted: 20 January 2017 / Published: 30 January 2017
Cited by 7 | PDF Full-text (1522 KB) | HTML Full-text | XML Full-text
Abstract
The Hough Transform (HT) is a method for extracting straight lines from an edge image. The main limitations of the HT for usage in actual applications are computation time and storage requirements. This paper reports a hardware architecture for HT implementation on a [...] Read more.
The Hough Transform (HT) is a method for extracting straight lines from an edge image. The main limitations of the HT for usage in actual applications are computation time and storage requirements. This paper reports a hardware architecture for HT implementation on a Field Programmable Gate Array (FPGA) with parallelized voting procedure. The 2-dimensional accumulator array, namely the Hough space in parametric form (ρ, θ), for computing the strength of each line by a voting mechanism is mapped on a 1-dimensional array with regular increments of θ. Then, this Hough space is divided into a number of parallel parts. The computation of (ρ, θ) for the edge pixels and the voting procedure for straight-line determination are therefore executable in parallel. In addition, a synchronized initialization for the Hough space further increases the speed of straight-line detection, so that XGA video processing becomes possible. The designed prototype system has been synthesized on a DE4 platform with a Stratix-IV FPGA device. In the application of road-lane detection, the average processing speed of this HT implementation is 5.4ms per XGA-frame at 200 MHz working frequency. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Visual Object Tracking Based on Cross-Modality Gaussian-Bernoulli Deep Boltzmann Machines with RGB-D Sensors
Sensors 2017, 17(1), 121; https://doi.org/10.3390/s17010121
Received: 1 December 2016 / Revised: 5 January 2017 / Accepted: 5 January 2017 / Published: 10 January 2017
Cited by 3 | PDF Full-text (8321 KB) | HTML Full-text | XML Full-text
Abstract
Visual object tracking technology is one of the key issues in computer vision. In this paper, we propose a visual object tracking algorithm based on cross-modality featuredeep learning using Gaussian-Bernoulli deep Boltzmann machines (DBM) with RGB-D sensors. First, a cross-modality featurelearning network based [...] Read more.
Visual object tracking technology is one of the key issues in computer vision. In this paper, we propose a visual object tracking algorithm based on cross-modality featuredeep learning using Gaussian-Bernoulli deep Boltzmann machines (DBM) with RGB-D sensors. First, a cross-modality featurelearning network based on aGaussian-Bernoulli DBM is constructed, which can extract cross-modality features of the samples in RGB-D video data. Second, the cross-modality features of the samples are input into the logistic regression classifier, andthe observation likelihood model is established according to the confidence score of the classifier. Finally, the object tracking results over RGB-D data are obtained using aBayesian maximum a posteriori (MAP) probability estimation algorithm. The experimental results show that the proposed method has strong robustness to abnormal changes (e.g., occlusion, rotation, illumination change, etc.). The algorithm can steadily track multiple targets and has higher accuracy. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
3D Visual Tracking of an Articulated Robot in Precision Automated Tasks
Sensors 2017, 17(1), 104; https://doi.org/10.3390/s17010104
Received: 1 November 2016 / Revised: 21 December 2016 / Accepted: 4 January 2017 / Published: 7 January 2017
Cited by 4 | PDF Full-text (8873 KB) | HTML Full-text | XML Full-text
Abstract
The most compelling requirements for visual tracking systems are a high detection accuracy and an adequate processing speed. However, the combination between the two requirements in real world applications is very challenging due to the fact that more accurate tracking tasks often require [...] Read more.
The most compelling requirements for visual tracking systems are a high detection accuracy and an adequate processing speed. However, the combination between the two requirements in real world applications is very challenging due to the fact that more accurate tracking tasks often require longer processing times, while quicker responses for the tracking system are more prone to errors, therefore a trade-off between accuracy and speed, and vice versa is required. This paper aims to achieve the two requirements together by implementing an accurate and time efficient tracking system. In this paper, an eye-to-hand visual system that has the ability to automatically track a moving target is introduced. An enhanced Circular Hough Transform (CHT) is employed for estimating the trajectory of a spherical target in three dimensions, the colour feature of the target was carefully selected by using a new colour selection process, the process relies on the use of a colour segmentation method (Delta E) with the CHT algorithm for finding the proper colour of the tracked target, the target was attached to the six degree of freedom (DOF) robot end-effector that performs a pick-and-place task. A cooperation of two Eye-to Hand cameras with their image Averaging filters are used for obtaining clear and steady images. This paper also examines a new technique for generating and controlling the observation search window in order to increase the computational speed of the tracking system, the techniques is named Controllable Region of interest based on Circular Hough Transform (CRCHT). Moreover, a new mathematical formula is introduced for updating the depth information of the vision system during the object tracking process. For more reliable and accurate tracking, a simplex optimization technique was employed for the calculation of the parameters for camera to robotic transformation matrix. The results obtained show the applicability of the proposed approach to track the moving robot with an overall tracking error of 0.25 mm. Also, the effectiveness of CRCHT technique in saving up to 60% of the overall time required for image processing. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Multi-User Identification-Based Eye-Tracking Algorithm Using Position Estimation
Sensors 2017, 17(1), 41; https://doi.org/10.3390/s17010041
Received: 13 December 2016 / Revised: 22 December 2016 / Accepted: 23 December 2016 / Published: 27 December 2016
Cited by 2 | PDF Full-text (3110 KB) | HTML Full-text | XML Full-text
Abstract
This paper proposes a new multi-user eye-tracking algorithm using position estimation. Conventional eye-tracking algorithms are typically suitable only for a single user, and thereby cannot be used for a multi-user system. Even though they can be used to track the eyes of multiple [...] Read more.
This paper proposes a new multi-user eye-tracking algorithm using position estimation. Conventional eye-tracking algorithms are typically suitable only for a single user, and thereby cannot be used for a multi-user system. Even though they can be used to track the eyes of multiple users, their detection accuracy is low and they cannot identify multiple users individually. The proposed algorithm solves these problems and enhances the detection accuracy. Specifically, the proposed algorithm adopts a classifier to detect faces for the red, green, and blue (RGB) and depth images. Then, it calculates features based on the histogram of the oriented gradient for the detected facial region to identify multiple users, and selects the template that best matches the users from a pre-determined face database. Finally, the proposed algorithm extracts the final eye positions based on anatomical proportions. Simulation results show that the proposed algorithm improved the average F1 score by up to 0.490, compared with benchmark algorithms. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
A Novel Probabilistic Data Association for Target Tracking in a Cluttered Environment
Sensors 2016, 16(12), 2180; https://doi.org/10.3390/s16122180
Received: 20 October 2016 / Revised: 10 December 2016 / Accepted: 14 December 2016 / Published: 18 December 2016
Cited by 9 | PDF Full-text (5671 KB) | HTML Full-text | XML Full-text
Abstract
The problem of data association for target tracking in a cluttered environment is discussed. In order to improve the real-time processing and accuracy of target tracking, based on a probabilistic data association algorithm, a novel data association algorithm using distance weighting was proposed, [...] Read more.
The problem of data association for target tracking in a cluttered environment is discussed. In order to improve the real-time processing and accuracy of target tracking, based on a probabilistic data association algorithm, a novel data association algorithm using distance weighting was proposed, which can enhance the association probability of measurement originated from target, and then using a Kalman filter to estimate the target state more accurately. Thus, the tracking performance of the proposed algorithm when tracking non-maneuvering targets in a densely cluttered environment has improved, and also does better when two targets are parallel to each other, or at a small-angle crossing in a densely cluttered environment. As for maneuvering target issues, usually with an interactive multi-model framework, combined with the improved probabilistic data association method, we propose an improved algorithm using a combined interactive multiple model probabilistic data association algorithm to track a maneuvering target in a densely cluttered environment. Through Monte Carlo simulation, the results show that the proposed algorithm can be more effective and reliable for different scenarios of target tracking in a densely cluttered environment. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Robust and Accurate Vision-Based Pose Estimation Algorithm Based on Four Coplanar Feature Points
Sensors 2016, 16(12), 2173; https://doi.org/10.3390/s16122173
Received: 26 October 2016 / Revised: 11 December 2016 / Accepted: 14 December 2016 / Published: 17 December 2016
Cited by 3 | PDF Full-text (13077 KB) | HTML Full-text | XML Full-text
Abstract
Vision-based pose estimation is an important application of machine vision. Currently, analytical and iterative methods are used to solve the object pose. The analytical solutions generally take less computation time. However, the analytical solutions are extremely susceptible to noise. The iterative solutions minimize [...] Read more.
Vision-based pose estimation is an important application of machine vision. Currently, analytical and iterative methods are used to solve the object pose. The analytical solutions generally take less computation time. However, the analytical solutions are extremely susceptible to noise. The iterative solutions minimize the distance error between feature points based on 2D image pixel coordinates. However, the non-linear optimization needs a good initial estimate of the true solution, otherwise they are more time consuming than analytical solutions. Moreover, the image processing error grows rapidly with measurement range increase. This leads to pose estimation errors. All the reasons mentioned above will cause accuracy to decrease. To solve this problem, a novel pose estimation method based on four coplanar points is proposed. Firstly, the coordinates of feature points are determined according to the linear constraints formed by the four points. The initial coordinates of feature points acquired through the linear method are then optimized through an iterative method. Finally, the coordinate system of object motion is established and a method is introduced to solve the object pose. The growing image processing error causes pose estimation errors the measurement range increases. Through the coordinate system, the pose estimation errors could be decreased. The proposed method is compared with two other existing methods through experiments. Experimental results demonstrate that the proposed method works efficiently and stably. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Adaptive Local Spatiotemporal Features from RGB-D Data for One-Shot Learning Gesture Recognition
Sensors 2016, 16(12), 2171; https://doi.org/10.3390/s16122171
Received: 16 October 2016 / Revised: 1 December 2016 / Accepted: 7 December 2016 / Published: 17 December 2016
PDF Full-text (4126 KB) | HTML Full-text | XML Full-text
Abstract
Noise and constant empirical motion constraints affect the extraction of distinctive spatiotemporal features from one or a few samples per gesture class. To tackle these problems, an adaptive local spatiotemporal feature (ALSTF) using fused RGB-D data is proposed. First, motion regions of interest [...] Read more.
Noise and constant empirical motion constraints affect the extraction of distinctive spatiotemporal features from one or a few samples per gesture class. To tackle these problems, an adaptive local spatiotemporal feature (ALSTF) using fused RGB-D data is proposed. First, motion regions of interest (MRoIs) are adaptively extracted using grayscale and depth velocity variance information to greatly reduce the impact of noise. Then, corners are used as keypoints if their depth, and velocities of grayscale and of depth meet several adaptive local constraints in each MRoI. With further filtering of noise, an accurate and sufficient number of keypoints is obtained within the desired moving body parts (MBPs). Finally, four kinds of multiple descriptors are calculated and combined in extended gradient and motion spaces to represent the appearance and motion features of gestures. The experimental results on the ChaLearn gesture, CAD-60 and MSRDailyActivity3D datasets demonstrate that the proposed feature achieves higher performance compared with published state-of-the-art approaches under the one-shot learning setting and comparable accuracy under the leave-one-out cross validation. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Open AccessArticle
Detecting Target Objects by Natural Language Instructions Using an RGB-D Camera
Sensors 2016, 16(12), 2117; https://doi.org/10.3390/s16122117
Received: 20 September 2016 / Revised: 24 November 2016 / Accepted: 7 December 2016 / Published: 13 December 2016
Cited by 1 | PDF Full-text (4693 KB) | HTML Full-text | XML Full-text
Abstract
Controlling robots by natural language (NL) is increasingly attracting attention for its versatility, convenience and no need of extensive training for users. Grounding is a crucial challenge of this problem to enable robots to understand NL instructions from humans. This paper mainly explores [...] Read more.
Controlling robots by natural language (NL) is increasingly attracting attention for its versatility, convenience and no need of extensive training for users. Grounding is a crucial challenge of this problem to enable robots to understand NL instructions from humans. This paper mainly explores the object grounding problem and concretely studies how to detect target objects by the NL instructions using an RGB-D camera in robotic manipulation applications. In particular, a simple yet robust vision algorithm is applied to segment objects of interest. With the metric information of all segmented objects, the object attributes and relations between objects are further extracted. The NL instructions that incorporate multiple cues for object specifications are parsed into domain-specific annotations. The annotations from NL and extracted information from the RGB-D camera are matched in a computational state estimation framework to search all possible object grounding states. The final grounding is accomplished by selecting the states which have the maximum probabilities. An RGB-D scene dataset associated with different groups of NL instructions based on different cognition levels of the robot are collected. Quantitative evaluations on the dataset illustrate the advantages of the proposed method. The experiments of NL controlled object manipulation and NL-based task programming using a mobile manipulator show its effectiveness and practicability in robotic applications. Full article
(This article belongs to the Special Issue Video Analysis and Tracking Using State-of-the-Art Sensors)
Figures

Figure 1

Sensors EISSN 1424-8220 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top