1. Introduction
Recent developments in deep neural networks have resulted in significant technological advancements and have broadened the applicability of human–robot interaction (HRI). Vision and language tasks, such as visual question answering (VQA) [
1,
2,
3,
4,
5,
6], and visual dialog [
7,
8], can be extremely useful in HRI applications. In a VQA system, the input is an image along with a text question that the system needs to answer by recognizing the image, interpreting the natural language in the question, and determining the relationships between them. VQA tasks play an essential role in various real-world applications. For example, in HRI applications, VQA can be used to connect robot perceptions with human operators through a question-answering process. In video surveillance systems, the question–answer process can serve as an interface to help avoid the manual checking of each video frame, significantly reducing labor costs.
Although VQA methods can be useful in real-world applications, there are numerous problems related to their implementation in real-world environments. For example, conventional VQA methods answer a given question based on a single image. However, in real-world environments, because it is challenging to take photographs continuously from optimal viewpoints, objects can be greatly occluded and thus answering questions based on single-view images could be difficult. Considering that multi-view observations are possible in HRI applications, this study discusses VQA under multi-view settings. Qiu et al. [
9] proposed a multi-view VQA framework that uses perimeter viewpoint observation for answering questions. However, perimeter viewpoints may be difficult to set up in real-world environments due to environmental constraints, making their method difficult to implement. In addition, observing each scene from perimeter viewpoints is relatively inefficient, especially for applications that require real-time processing. Moreover, the authors did not evalutate their method under real-world image settings.
Here, we propose a framework, shown in
Figure 1, in which a scene is actively observed. Answers to questions are obtained based on previous observations of the scene. The overall framework consists of three modules, namely a scene representation network (SRN) that integrates multi-view images into a compact scene representation, a viewpoint selection network that selects the next observation viewpoint (or ends the observation) based on the input question and the previously observed scene, and a VQA network that predicts the answer based on the observed scene and question. We built a computer graphics (CG) multi-view VQA dataset with 12 viewpoints. For this dataset, the proposed framework achieved accuracy comparable to that of a state-of-the-art method [
9] (97.11% (Ours) vs. 97.37%) that answers questions based on the perimeter observation of scenes from fixed directions, while using an average of just 2.98 viewpoints as a contrast with 12 viewpoints of the previous method. In addition, we found that our framework learned to choose viewpoints efficiently for answering questions by ending observation while the observed scene contains all objects needed to answer the question, or additionally observing the scene from viewpoints with large spacing to enhance accessible scene information, thereby lowering the camera movement cost. Furthermore, to evaluate the effectiveness of the proposed method in realistic settings, we also created a real image dataset. In experiments conducted with this dataset, the proposed framework outperformed the existing method by a significant margin (+11.39%).
The contributions of our work are three-fold:
we discuss the VQA task under a multi-view and interactive setting, which is more representative of a real-world environment compared to traditional single-view VQA settings. We also built a dataset for this purpose.
we propose a framework that actively observes scenes and answers questions based on previously observed scene information. Our framework achieves high accuracy for question answering with high efficiency in terms of the number of observation viewpoints. In addition, our framework can be applied in applications that have restricted observation viewpoints.
we conduct experiments on a multi-view VQA dataset that consists of real images. This dataset can be used to evaluate the generalization ability of VQA methods. The proposed framework shows high performance for this dataset, which indicates the suitability of our framework for realistic settings.
3. Approach
In real-world HRI applications, it can be challenging to obtain photographs from perimeter viewpoints. In addition, it is efficient to end the observation when the input scene information is sufficient to answer the question. Based on the above, we propose a framework that actively observes the environment and decides when to answer the question based on previously observed scene information.
As shown in
Figure 2, the proposed framework consists of three modules, namely an SRN, a viewpoint selection network, and a VQA network.
The inputs of the overall framework are the default viewpoint
, the image
of the scene observed from
, and the question
q.
and
are first processed by the SRN to obtain the original scene representation
. The question is processed by an embedding layer and a two-layered long short-term memory (LSTM) [
27] network to obtain
. Then, the viewpoint selection network predicts the next observation viewpoint (or selects the end action) based on
and
. If viewpoint
is chosen at time step
t, the agent obtains the image
from viewpoint
of the scene (e.g., for a robot, a scene image from
is taken). Next, the SRN updates the scene based on
and
. If the end action is chosen, the VQA network predicts an answer based on the
and the integrated scene representation at that time. In the following sections, we discuss these three networks in greater detail.
3.1. Scene Representation
We use the SRN proposed by Eslami et al. [
17] to obtain integrated scene representations
from viewpoints
and images
. For scene
i observed from
K viewpoints, the observation
is defined as follows:
The scene representation and is jointly trained for image rendering from arbitrary viewpoint m to maximize the likelihood between the predicted and the ground truth images. integrates multi-view information into a compact scene representation. We use the above framework to train the SRN.
3.2. Viewpoint Selection
For VQA in real-world environments, it is necessary to choose an observation based on the question and previously observed visual information. For example, for the question “ is there a red ball?”, if a red ball has previously been observed, the question can be answered instantly. Additionally, for highly occluded scenes, it may be necessary to make observations from a variety of viewpoints. Therefore, we propose a DQN-based viewpoint selection network to actively choose actions.
More specifically, assuming that the scene can be observed from K viewpoints, we define an action set , where denote the viewpoint selection actions and represents the end observation action.
The viewpoint selection network
predicts a
K+one-dimensional, vector that represents the obtainable reward value of each action (after it is executed) from the input of the previously observed scene
and question
. Assuming that the
j-th action
is chosen at time step
t, we denote the predicted reward value
of action
under the environment state
as follows:
The real reward value of action
can be formulated as Equation (
3), where
denotes the reward obtained from the environment, and
is the discount factor of future rewards:
The overall objective of viewpoint selection is to minimize the distance between and .
We designed the reward
based on the correctness of the question answering and the numbers of selected viewpoints. For each added new observation viewpoint or viewpoint that is chosen repeatedly, we assign penalties
and
(hyperparameters). We denote the VQA loss as
(normalized to
). We designed
for three action types, as shown below.
For the final viewpoint selection action, the reward is ; for the other viewpoint selection actions, the reward is ; for the end action, the reward is .
3.3. Visual Question Answering
VQA predicts an answer based on integrated scene information
s and the processed question
. We denote the VQA network to be trained as
. The answer
can be predicted by the following:
The network is optimized by minimizing the cross-entropy loss between the predicted
and the ground truth answer. In this study, we used the state-of-the-art FiLM method [
10] as the VQA network. However, it is noteworthy that the VQA framework can be arbitrary.
The proposed framework could not deal with ill-structured or non-English questions that are not included in the training dataset. However, the framework could be expanded for these situations by further integrating sentence structure checking or translation modules.
5. Conclusions
In this study, we proposed a multi-view VQA framework that actively chooses observation viewpoints to answer questions. VQA is defined as the task of answering a text question based on an image that is essential in various HRI systems. Existing VQA methods answer questions based on single-view images. However, in real-world applications, single-view image information can be insufficient for answering questions, and observation viewpoints are usually limited. Moreover, the ability to observe scenes efficiently from optimal viewpoints is crucial for real-world applications with time restrictions. To resolve these issues, we propose a framework that makes iterative observations under a multi-view VQA setting. The proposed framework iteratively selects additional observation viewpoints and updates scene information until the scene information is sufficient to answer questions. the proposed framework achieves performance comparable to that of a state-of-the-art method on a VQA dataset with CG images and, in the meantime, greatly reduces the number of observation viewpoints (a reduction of 12 to 2.98). in addition, the proposed method outperforms the existing method by a significant margin (+11.39% on overall accuracy) on a dataset with real images, which is closer to the real-world setting, making our method more promising to be applied to real-world applications. However, directly applying a model trained on CG images to real images results in a performance gap (a drop of −30.82% on accuracy). In the future, we will consider using various methods to narrow this gap.