Next Article in Journal
An Image Registration Method for Multisource High-Resolution Remote Sensing Images for Earthquake Disaster Assessment
Next Article in Special Issue
Training Data Extraction and Object Detection in Surveillance Scenario
Previous Article in Journal
Ultrasonic Propagation in Highly Attenuating Insulation Materials
Previous Article in Special Issue
Liver Tumor Segmentation in CT Scans Using Modified SegNet
Open AccessArticle

Multi-View Visual Question Answering with Active Viewpoint Selection

1
Graduate School of Science and Technology, University of Tsukuba, Tsukuba 305-8577, Japan
2
National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan
*
Author to whom correspondence should be addressed.
Sensors 2020, 20(8), 2281; https://doi.org/10.3390/s20082281
Received: 31 March 2020 / Revised: 13 April 2020 / Accepted: 14 April 2020 / Published: 17 April 2020
This paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human–robot interaction (HRI), in which camera angles and occluded scenes must be considered, answering questions based on single-view images might be difficult. Since HRI applications make it possible to observe a scene from multiple viewpoints, it is reasonable to discuss the VQA task in multi-view settings. In addition, because it is usually challenging to observe a scene from arbitrary viewpoints, we designed a framework that allows the observation of a scene actively until the necessary scene information to answer a given question is obtained. The proposed framework achieves comparable performance to a state-of-the-art method in question answering and simultaneously decreases the number of required observation viewpoints by a significant margin. Additionally, we found our framework plausibly learned to choose better viewpoints for answering questions, lowering the required number of camera movements. Moreover, we built a multi-view VQA dataset based on real images. The proposed framework shows high accuracy (94.01%) for the unseen real image dataset.
View Full-Text
Keywords: visual question answering; three-dimensional (3D) vision; reinforcement learning; deep learning; human–robot interaction visual question answering; three-dimensional (3D) vision; reinforcement learning; deep learning; human–robot interaction
Show Figures

Figure 1

MDPI and ACS Style

Qiu, Y.; Satoh, Y.; Suzuki, R.; Iwata, K.; Kataoka, H. Multi-View Visual Question Answering with Active Viewpoint Selection. Sensors 2020, 20, 2281.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop