The Core of Smart Cities: Knowledge Representation and Descriptive Framework Construction in Knowledge-Based Visual Question Answering

Wang, Ruiping; Wu, Shihong; Wang, Xiaoping

doi:10.3390/su142013236

Open AccessArticle

The Core of Smart Cities: Knowledge Representation and Descriptive Framework Construction in Knowledge-Based Visual Question Answering

by

Ruiping Wang

^1,2,*

,

Shihong Wu

² and

Xiaoping Wang

¹

School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China

²

YGSOFT Inc., Zhuhai 519085, China

^*

Author to whom correspondence should be addressed.

Sustainability 2022, 14(20), 13236; https://doi.org/10.3390/su142013236

Submission received: 25 September 2022 / Revised: 10 October 2022 / Accepted: 12 October 2022 / Published: 14 October 2022

Download

Browse Figures

Versions Notes

Abstract

:

Visual question answering (VQA), which is an important presentation form of AI-complete task and visual Turing tests, coupled with its potential application value, attracted widespread attention from both researchers in computer vision and natural language processing. However, there are no relevant research regarding the expression and participation methods of knowledge in VQA. Considering the importance of knowledge for answering questions correctly, this paper analyzes and researches the stratification, expression and participation process of knowledge in VQA and proposes a knowledge description framework (KDF) to guide the research of knowledge-based VQA (Kb-VQA). The KDF consists of a basic theory, implementation methods and specific applications. This paper focuses on describing mathematical models at basic theoretical levels, as well as the knowledge hierarchy theories and key implementation behaviors established on this basis. In our experiment, using the statistics of VQA’s accuracy in the relevant literature, we propose a good corroboration of the research results from knowledge stratification, participation methods and expression forms in this paper.

Keywords:

visual question answering; knowledge representation; knowledge description framework; external knowledge; smart cities

1. Introduction

As the core of sustainable cities, smart cities are responsible for information integration, distribution, scheduling, and assisting city managers in decision-making [1,2]. Exploring and studying the key technologies and knowledge systems involved in the process of building a smart city is of great significance for sustainable urban life [3,4].

The completeness of artificial intelligence (AI) can be regarded as an important indicator of the Turing test, which is responsible for testing and verifying the anthropomorphic intelligence of an algorithm model. Antol et al. [5] proposed a free-form and open-ended VQA task, which requires high-level multi-modal fusion and a well-defined quantitative evaluation metric. VQA satisfies the AI’s completeness and can be used for the visual Turing test, which has become a consensus in this research field [6,7,8].

Building a city’s brain based on VQA technology is a superior choice to realize a smart city. In this process, a large amount of knowledge must be possessed by the VQA system, and this is the main research topic of this paper. For the problem of the low efficiency of knowledge sharing, a knowledge grid is a better solution. Via mining and citation methods, it can effectively acquire, represent, and exchange massive data and information and integrate them into a communication infrastructure for useful knowledge [9].

VQA is the intersection and fusion of image understanding, natural language processing (NLP), and knowledge reasoning, which is extremely challenging [10]. In order to obtain the correct answer, knowledge reasoning and understanding are essential (the acts of knowledge reasoning and understanding in this paper are collectively referred to as knowledge metabolism). Obviously, knowledge plays a very important role in VQA, and the intelligence of VQA is also positively correlated with its level of knowledge. In order to better understand the role of knowledge in VQA, we will review VQA-related research at different stages according to the degree of knowledge participation.

Agrawal et al. [11] conducted research on the priori of reasoning results and proposed solutions; Wu et al. [12] believed that the results of multi-modal fusion at different scales would affect the accuracy of the answer, so they proposed multi-scale relation reasoning. Ma et al. [13] constructed a joint-embedding model based on dynamic word vector-none KB-Specific network models, which are different from the commonly used visual question-answering models based on static word vectors. These research studies are the cornerstone of the rapid development of VQA, but they seldom involve knowledge metabolism. Multi-modal feature fusion and local image feature understanding represented by the attention mechanism are milestone works, and they promote the development of VQA in the understanding of explicit image content, but this understanding is superficial and requires less intellectual involvement [14,15,16]. The most representative ones are image-oriented recognition and counting problems. The relational reasoning after cross-modal fusion provides the VQA system with the ability to think and understand, thereby further improving the knowledge level of VQA [17,18,19]: for example, reasoning about the relationship between characters in the scene and reasoning about geographic locations and so on.

With the development of theories and the maturity of technology, VQA is facing more open problems and challenges from reality, which will inevitably involve external common sense or professional knowledge; thus, external knowledge needs to be introduced. Zhu et al. [20] constructed a large-scale multimodal knowledge base to interface with VQA, which can answer various visualization queries without retraining. In addition, the knowledge base can also be used for image reasoning, which is also the research area of Kb-VQA in this paper. Wu et al. [21] converted image content into semantic content and then combined it with information extracted from external knowledge bases in order to answer more open questions. This method can answer more complex questions than before due to the combination of external knowledge bases. Zhu et al. [22] split a question that requires logical reasoning into multiple questions that can be answered directly, and then, they obtained the final answer by querying the answer to each sub-question from an external knowledge source. Compared with Wu et al.’s method, this method splits the reasoning process, so it becomes easier to seek answers from external knowledge bases for each sub-question; thus, the accuracy is higher. Su et al. [23] proposed a visual knowledge memory network (VKMN), which seamlessly integrated structured human knowledge and deep visual features. Compared with methods that directly leverage external knowledge to support visual question answering, this method focuses more on the following two mechanisms. The first is the integration mechanism of visual content and knowledge facts. VKMN addresses this problem by jointly embedding knowledge triples (topic, relation, and object) and deep visual features into visual knowledge features. The second is the processing mechanism of multiple knowledge facts extended from question answering. VKMN stores joint embedding using key-value pair structures in the memory networks so that it can easily handle multiple facts. Yu et al. [24] used multiple knowledge graphs to describe multi-modal knowledge sources from the perspectives of vision, semantic, and fact and finally unified them, which is conducive to relational reasoning. This method is similar to that of Zhu et al. [22] in that multiple inferences are used to obtain the answer. The difference is that Zhu et al. split a complex question into multiple simple questions, and then they found the answer for each question, while Yu et al.’s method is to perform parallel reasoning on visual and semantic information; thus, it is more efficient. Zhang et al. [25] proposed an extended network to adaptively balance the weight of visual features and external knowledge. The previous types of methods basically start from the perspective of the problem and seek to combine the problem with external knowledge, while this method converts the image content into a knowledge map from the perspective of the image and then uses the attention mechanism to balance the focus of the problem. Zhang et al. [26] built a knowledge base graph-embedding module to extend the generality of knowledge-based visual question-answering models. The design of the knowledge base graph-embedding module belongs to a visual question-answering technique, which can improve the accuracy of the model, especially when facing problems that require public knowledge or external knowledge. Liu et al. [27] proposed a fact-based visual question-answering (FVQA) model, which answers questions based on observed images and external knowledge. The core of their research is to enable agents to understand questions and images and then to reason using the knowledge base to find the correct answer. The method starts from the dual-process theory in cognitive science and includes two modules of perception and explicit reasoning. When given a question and an image, the perception module learns their joint representation, while the explicit reasoning module predicts the answer by reasoning from the fact and semantic graphs. Uehara et al. [28] propose to use sub-question-derived knowledge of relevant information in images to derive answers to questions that are not present in images for visual question answering. This is also a study of using sub-questions to derive the answer of the original question, and the difference is that the question-generation model and the info-score model are introduced in this method. The info-score model can be used to estimate the amount of information contained in the generated questions so that the amount of information in the generated questions can be controlled. The external knowledge of VQA is carefully studied in the above papers, and the focus is different. These studies were carried out from specific problems, and the analysis of these papers will help propose the knowledge pyramid and knowledge description framework in this paper. However, none of these papers summarize the characteristics of Kb-VQA from a macroperspective, which is the value of this paper. Obviously, VQA with external knowledge has been further improved in terms of the complexity and richness of knowledge, and it can begin to deal with open problems.

The survey’s results reflect that the degree of knowledge participation in different VQA is different. The current research on Kb-VQA (Including knowledge metabolism) mainly focuses on how to combine knowledge with traditional models to answer more complex and open questions, but there is less research on knowledge expression and participation methods. It is for this reason that the current Kb-VQA research is not sufficiently in-depth. In response to this gap, this paper puts forward a knowledge hierarchy and knowledge pyramid, defines the connotation and extension of Kb-VQA, and answers the questions about knowledge expression forms. Furthermore, the theoretical model of Kb-VQA is proposed to explain the manner knowledge participates. Moreover, on the basis of the Kb-VQA theoretical model, the underlying logic of the knowledge description framework is constructed by combining the knowledge hierarchy theory and the knowledge pyramid. Based on this logic, users of the knowledge description framework can determine the specific implementation method (such as using CNN, RNN or Attention Mechanism, etc.); Finally, combined with the application scenario, the implementation plan is determined.

The innovation of this paper is the following: (1) This paper unifies and standardizes the expression and mathematical model of Kb-VQA. (2) By analyzing the relevant research results of existing Kb-VQA, the phenomenon of knowledge stratification is found, and the knowledge pyramid is summarized. (3) Using research results, this paper points out the significance and importance of Kb-VQA for building smart cities. The application areas of this paper include the following: (1) Providing a set of standardized technical routes for the follow-up research on Kb-VQA; (2) Providing a feasible solution for the construction of smart brains in smart cities, that is, using Kb-VQA as the basis to build smart brains; (3) The Kb-VQA can be an alternative to general artificial intelligence.

The contributions of this paper are mainly reflected in the following aspects:

•: A knowledge description framework is proposed to express the knowledge composition, morphology and mathematical model in visual question answering;
•: The concept of knowledge pyramid is proposed to describe the knowledge content and participation forms in different visual question answering;
•: The basic mathematical model of knowledge-based visual question answering is proposed;
•: The rationality of the research content of this paper is explained by means of example elaboration and experimental verification.

This paper is organized as follows. The Section 2 introduces the composition of the knowledge pyramid and explains it with examples. The Section 3 is the introduction of the Kb-VQA mathematical model. In the Section 4, we construct a knowledge description framework and illustrate its key components. The experiment and analysis are provided in the Section 5, and the conclusion is presented in the Section 6.

2. Knowledge Pyramid

In order to answer the question of the degree of knowledge participation, this paper proposes the Knowledge Hierarchy Theory (KHT) based on the systematic research of Kb-VQA. As introduced in the introduction, different VQA models have different levels of knowledge participation. Some VQA can only answer simple questions, while others can answer many questions that even humans may not know. This shows that the knowledge in VQA is hierarchical.

KHT can be preliminarily expressed as follows: In different Kb-VQA, knowledge exists in different forms and different levels, which are divided into upper-lower levels, similarly to a pyramid. High-level Kb-VQA can answer common-sense or professional questions that are complex and require more reasoning and even require external knowledge to for completion.

Obviously, such a VQA model is more “smart”. The KHT proposed in this paper contains five levels, using the form of a knowledge pyramid, as shown in Figure 1.

The arrow on the right side of the pyramid reflects the gradual increase in knowledge complexity, level, and difficulty of acquisition as the level increases. The horizontal width of each layer of the pyramid reflects the number and scale of related research qualitatively. Obviously, most current research studies on VQA do not involve knowledge or involve a small amount of image knowledge. The second level of the pyramid mainly solves counting and recognition problems; the third level obtains knowledge using simple explicit reasoning with respect to the image’s content, which is mainly internal knowledge; the fourth level introduces an external knowledge base to solve common sense problems and professional problems; the top level is mainly unknown knowledge, which is used to solve prediction problems.

Specifically, the VQA of knowledge-free metabolism can be understood as what you see is what you get. Image knowledge is the need to find answers to questions from images. Internal reasoning knowledge is more complex, and it needs to perform algebraic or logical operations based on the connection between image knowledge and image knowledge. Unknown knowledge involves prediction and prospect. The specific performance of the five levels is shown in Table 1.

The existing form of knowledge in Table 1 corresponds to the forms of knowledge in the knowledge pyramid. Via the phenomena shown in Table 1, the following observation results can be obtained: (1) Visual images may not determine the form and level of knowledge (such as the three groups of examples in Table 1); (2) answers are not absolutely related to knowledge forms (Number 2 and Number 3 in Table 1); (3) questions determine the form of knowledge; (4) for unknown knowledge forms, the current VQA technology may not be able to obtain the correct answer (such as Number 5, Number 10, and Number 15). Based on the observation results, the following concepts can be clarified.

The knowledge in VQA is the result of the information flow that starts from the question and reacts with the relevant image, and it is presented in the form of answers. The connotation of Kb-VQA is the process of generating answers after the knowledge metabolism of images and questions, and its extension includes the basic level without knowledge metabolism and the prediction level with unknown knowledge. For specific manifestations of different levels of knowledge in the Kb-VQA, please refer to the examples (a) and (b) in Figure 2 (the material in Figure 2a,b comes from Figure A1k,o).

Analyzing the questions and answers in Figure 2 found that higher-level questions often comprise lower-level questions and answers. For example, in Figure 2a, the second question involves dogs, and dogs are derived from the answer to the first question. Sometimes, it seems that the first question is not asked, but the first question and answer are actually implicit in the process of asking the second question. This situation is involved in (a) and (b) of Figure 2.

Obviously, although there are levels of knowledge, these levels are not independent of each other. The existence of high-level knowledge is often based on low-level knowledge. Currently, obtaining knowledge using internal reasoning (the third level in Figure 1) is a focus of research of Kb-VQA. On this basis, further adding external knowledge to answer more extensive and open questions is a future development direction (The fourth level in Figure 1). The ultimate goal of Kb-VQA is to explore predictive VQA based on unknown knowledge (the fifth level in Figure 1). Finally, this section summarizes the current research status of knowledge existing forms in Kb-VQA and displays them in Table 2.

3. Theoretical Model

The traditional VQA comprises visual feature extraction, language feature extraction, multi-modal feature fusion and reasoning, and answer generation. When designing a theoretical model, it is rarely considered to incorporate knowledge. Although the knowledge element has been introduced in some recent research studies, it is only used as a means to improve the performance of VQA, and it is not regarded as an important existence equivalent to the extraction of visual features. However, based on simple and intuitive understanding, knowledge is an indispensable component when humans answer a question for a certain scene. Without corresponding knowledge as support, it is difficult for VQA to obtain correct and reasonable answers.

Note that Kb-VQA can be described as a procedural task that uses knowledge

K

mastered by the model to answer a random and open question

Q

for scene image

I

. The ultimate goal is to find answer

a^{*}

with the highest confidence from the set of possible answers

A

. This process can be formulated as follows:

a^{*} = \underset{a \in A}{arg max} p_{k b - v q a} (I, Q, K; θ),

(1)

where

p_{k b - v q a} (\cdot)

is the probability calculation model of Kb-VQA, which is derived from the traditional VQA model’s framework [6];

A

is the answer set,

θ

is the model parameter, a is any answer, and

a^{*}

is the answer with the highest confidence. Compared with the traditional VQA model, the VQA model represented by Equation (1) adds a knowledge module to represent knowledge (knowledge comes from internal reasoning or external import). The knowledge in the model constructed by Equation (1) is too abstract. This paper will start with various components of VQA and further elaborate on the details of knowledge introduction. Visual feature

υ

is expressed as follows:

υ = F (I, κ),

(2)

where F is the model used for visual feature extraction, and the most widely used one is a convolutional neural network. The

κ

is a knowledge-based semantic network, and its construction method can be used for reference [41]. The knowledge-based semantic network can be used to verify whether the extracted visual features are correct. In addition, it can also be used to construct a internal semantic network to assist cross-modal feature fusion and reasoning, thereby improving answer accuracy. Assuming that a total of N visual features are extracted from an image,

υ

has the following representation.

υ = \{υ_{i}, s . t ., i \in [1, N]\},

(3)

The language feature extraction quoted the expression of Farazi [6], but in order to expand the scope of application, it was appropriately modified in this paper. The language feature extraction model is expressed as follows:

q = L (w_{l}),

(4)

where q is the semantic features, which is from the input question;

L (\cdot)

is the language feature extraction model, and its implementation can refer to the related literature [25];

w_{l}

is a fixed-sized word embedding, and l is the length of the word embedding.

In traditional VQA, when visual features and language features are extracted, they are combined, and the answer to the question is inferred. In this paper, the two-modality problem is transformed into a three-modality problem. The most significant change is the addition of a knowledge modality. Function

Φ

of multi-modal feature embedding is learned, and learned joint embedding z is described as follows:

z = Φ (υ, q, \hat{k}),

(5)

where

\hat{k}

is the auxiliary knowledge from outside knowledge bases, and z is the learned joint embadding from the input question, image, and knowledge [23,24,25].

Furthermore,

υ

and q are specific expression forms of

I

and

Q

in Equation (1), and

\hat{k}

is the specific form of

K

, but it is only a part of

K

. Assuming that the knowledge derived from the image is k (the knowledge here includes the image knowledge and internal reasoning knowledge in Table 1) and the unknown knowledge is

\tilde{k}

in Table 1, then

K

can be expressed in the following form.

K = \hat{k} \cup k \cup \tilde{k},

(6)

In this paper,

K

,

κ

, and

\hat{k}

are used to represent abstract knowledge, knowledge-based semantic networks, and external auxiliary knowledge, respectively. Their meaning is that abstract knowledge only indicates that the knowledge modal is introduced into the VQA model proposed in this paper, while the knowledge-based semantic network is a specific technical means, and it belongs to the category of knowledge with external auxiliary knowledge. On the other hand, the knowledge-based semantic network comes from the combination of image extraction features, and the main body of information is internal. However, external auxiliary knowledge is different, and their main source is an external knowledge base.

This section constructs a theoretical model of Kb-VQA, describes the form of knowledge participation from the theoretical level, and also points out the difference and connection of knowledge forms between the whole and each component.

4. Knowledge Description Framework

In view of the understanding of Kb-VQA, KHT, and theoretical models, it can be inferred that key implementation actions (KIA) related to knowledge must exist between different levels. The KIA refers to the most important actions introduced in the execution of this level of VQA. Understanding these KIA can help researchers distinguish the types of Kb-VQA they are studying. Figure 3 shows the composition of the KIA.

The action on the left side of Figure 3 is training and learning, which accompanies each level of Kb-VQA. However, it should be noted that in the training and learning process, as the level increases, the training carrier, that is, the data set will also change, and this change is mainly reflected in the data set that will incorporate more logical reasoning and external knowledge. The right side of Figure 3 shows the answer generation, where the action is the same for all levels of VQA. In addition, the acquisition and generation of knowledge also exist at multiple levels, and as the level increases, the action of this part gradually enhanced.

Furthermore, based on the KIA of Kb-VQA, this paper proposes KDF. The significance of KDF is to summarize and integrate Kb-VQA from the three dimensions of basic theory, implementation methods, and specific applications; then, the research results of KHT and KIB are encapsulated into the core content of basic theory and the implementation method. Finally, arrows are used to indicate the flow of knowledge. The composition of KDF is shown in Figure 4.

Figure 4 depicts the existence and expression of knowledge in knowledge-based visual question answering, the accompanying methods in the application process, and frontend technical solutions. At the base layer, knowledge exists not only in the visual question-answering process in the form of external knowledge but also in visual features, which together promote the visual question-answering model to obtain more accurate answers. Furthermore, with knowledge as the key element, we can stratify knowledge-based visual question answering and build a knowledge pyramid model, which in turn determines the key implementation behaviors of visual question answering at different levels. Guided by key implementation behaviors, specific technical solutions are formed by combining specific implementation methods (such as CNN, RNN, etc.) to solve practical application problems. Specifically, we first determine the knowledge level required for the service scenario, and then we design the required algorithms according to the basic theoretical model; furthermore, we select the available deep neural networks, databases and knowledge bases; finally, these specific methods are combined and trained with a suitable dataset. Considering that the description of the implementation method in the knowledge description framework is relatively simple and not comprehensive enough, we elaborate it in more detail in Table 3.

Figure 3. Key implementation behaviors at different levels of VQA. Regardless of the light blue on the left and the gray modules on the right, each level from the top to bottom in the figure corresponds to the level in the knowledge pyramid.

Figure 4. Knowledge description framework for Kb-VQA.

Figure 4 not only shows the knowledge form in Kb-VQA from different levels but also points out the direction of the flow of knowledge information. For example, knowledge exists in visual feature extraction and cross-modal feature fusion at the basic theoretical level. From the perspective of implementation methods, knowledge exists in CNN, RNN, and attention mechanisms. Furthermore, if analyzed from a specific application level, knowledge exists in the process of human–computer interactions.

The KDF is the summary and integration of the research content of this paper, and also has a reference and guiding role for the research of Kb-VQA.

Table 3. The specific implementation method in the knowledge description framework.

Category	Type	Method
Data	Data set	VQA1.0 [5], VQA2.0 [42], FVQA [39], Visual 7W [43], GQA [44], ⋯
	Knowledge base	ConceptNet [45], VGR [46], DBpedia [47], Webchild [48], ⋯
Network Type	Convolutional Neural Network	Faster R-CNN [49], YOLO v3 [50], Resnet [51], VggNet [52], ⋯
	Recurrent Neural Network	LSTM [53], GRU [54], BERT [55], GPT-3 [56], ⋯
	Attention Mechanism	Transformer [57], Re-Attention [58], Rahman [34]⋯
Index	Evaluation indicators	Accuracy, WUPS, BLEU, Consensus et al. [8]

5. Experiment and Analysis

This section will conduct a comparative analysis on different levels of Kb-VQA to quantitatively evaluate the impact of knowledge participation on the results of open problems. Statistical data are derived from corresponding references. The accuracy of answer prediction is used for result evaluation. The results are shown in Table 4.

The statistical results shown in Table 4 have the following characteristics: (1) Most Kb-VQA performance tests are carried out on VQA v1 and v2 data sets; (2) In experiments based on VQA v1 and v2 data sets, the experimental results of high-level Kb-VQA (Level 3 and Level 4) are overall better than low-level Kb-VQA (Level 1 and Level 2); (3) In high-level Kb-VQA, Level 3 performs better than Level 4; (4) In the past two years, new knowledge-based datasets such as GQA and FVQA appeared. (5) No relevant research has been published for Level 5 Kb-VQA.

The statistical results and conclusions obtained in Table 4 can verify the knowledge pyramid and knowledge hierarchy theory proposed in this paper to a certain extent. The addition of knowledge helps improve the effect of VQA, and knowledge itself has levels, and different levels of knowledge have a positive correlation with the final result.

In order to further confirm the above-observed characteristics and to better assist the conclusions to be provided in this section, this paper collects the data obtained on the VQA v2 data set in Table 4 and then displays it in the form of a graph. The results are shown in Figure 5.

Based on experimental results in Table 4 and Figure 5, the following conclusions can be drawn: (1) The improvement of the knowledge level in Kb-VQA has a direct impact on the accuracy of the VQA model; (2) The performance of Level 3 is better than that of Level 4. The main reason is that Level 4 Kb-VQA is based on Level 3. Level 4 with relatively poor results in Table 4 was mainly proposed in 2017 and 2018, but the knowledge reasoning algorithm (Level 3 KIA) at that time was poor, which dragged down the effect of Level 4; (3) The method of Zhang et al. [25] was proposed in 2020, but its accuracy is better than the method of Zhang et al. [17] and Kim et al. [19] in 2021. This further shows that Level 4 is superior compared to Level 3 in answering open questions. The experimental results are consistent with the hierarchical theory described in this paper, and the analysis of the results further enriches the knowledge description framework we propose.

6. Conclusions

The modern city and its management is a type of open complex giant system, which has the characteristics of multi-subject, multi-level, multi-structure, multi-form, and non-linear urban life. Smart cities and their management must grasp the complex system characteristics of cities based on multi-agent, multi-layer, and multi-structure in the new network, new perception, and new data environment, but these tasks are obviously difficult to complete by relying on manpower. We hope to provide a practical solution for building city brains for smart cities by using Kb-VQA. The Kb-VQA system will serve as the center of the city brain, providing power for information interaction, data exchange, knowledge reasoning, and short-term predictions in smart cities.

This paper proposes KDF, which integrates our research results for Kb-VQA, including KHT, knowledge pyramid, knowledge theory model, and KIA, and points out the direction of knowledge flow. In addition, we also point out that advanced knowledge can promote Kb-VQA to achieve improved results on complex and open issues. In the experiment and analysis section, we made statistics and generalizations on the published Kb-VQA. The laws reflected by the experimental results are consistent with the KHT proposed in this paper. At the same time, the analysis results also reveal that high-level knowledge can help the VQA model in obtaining higher accuracy when answering complex and open questions.

Different from previous research on specific VQA methods, this paper focuses more on the research on the architecture and methodology of VQA, so the research content is more universal. We hope to provide some help for the subsequent development of Kb-VQA using the research content of this paper. Of course, since the focus of this paper is on the construction of the architecture and the discussion of the methodology, the details may not be sufficient, and there is also a lack of more experimental verifications. We will further improve this part of the work in follow-up research studies.

Author Contributions

Conceptualization, R.W. and S.W.; methodology, R.W. and X.W.; software, R.W.; validation, X.W. and S.W.; formal analysis, X.W.; investigation, R.W.; resources, S.W.; writing—original draft preparation, R.W.; writing—review and editing, X.W.; visualization, R.W.; supervision, X.W.; project administration, S.W.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 61876209, 61936004 and U1913602.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Materials Used in this Paper

The images in the first row are all people, the images in the second row are working scenes, the images in the third row are street scenes, the fourth row comprise natural and animal scenes, and the fifth row comprise indoor scenes.

Figure A1. Example image.

References

Sheng, H.; Zhang, Y.; Wang, W.; Shan, Z.; Fang, Y.; Lyu, W.; Xiong, Z. High confident evaluation for smart city services. Front. Environ. Sci. 2022, 1103. [Google Scholar] [CrossRef]
Li, C.; Xuan, W. Green development assessment of smart city based on PP-BP intelligent integrated and future prospect of big data. Acta Electron. Malays. (AEM) 2017, 1, 1–4. [Google Scholar] [CrossRef]
Fang, Y.; Shan, Z.; Wang, W. Modeling and key technologies of a data-driven smart city system. IEEE Access 2021, 9, 91244–91258. [Google Scholar] [CrossRef]
Lu, H.P.; Chen, C.S.; Yu, H. Technology roadmap for building a smart city: An exploring study on methodology. Future Gener. Comput. Syst. 2019, 97, 727–742. [Google Scholar] [CrossRef]
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. Vqa: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Washington, DC, USA, 7–13 December 2015; pp. 2425–2433. [Google Scholar]
Farazi, M.; Khan, S.; Barnes, N. Accuracy vs. complexity: A trade-off in visual question answering models. Pattern Recogn. 2021, 120, 108106. [Google Scholar] [CrossRef]
Teney, D.; Wu, Q.; van den Hengel, A. Visual question answering: A tutorial. IEEE Signal Process. Mag. 2017, 34, 63–75. [Google Scholar] [CrossRef]
Manmadhan, S.; Kovoor, B.C. Visual question answering: A state-of-the-art review. Artif. Intell. Rev. 2020, 53, 5705–5745. [Google Scholar] [CrossRef]
Hosseinioun, S. Knowledge grid model in facilitating knowledge sharing among big data community. Comput. Sci. 2018, 2, 8455–8459. [Google Scholar] [CrossRef]
Aditya, S.; Yang, Y.; Baral, C. Explicit reasoning over end-to-end neural architectures for visual question answering. Aaai Conf. Artif. Intell. 2018, 32, 629–637. [Google Scholar] [CrossRef]
Agrawal, A.; Batra, D.; Parikh, D.; Kembhavi, A. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4971–4980. [Google Scholar]
Wu, Y.; Ma, Y.; Wan, S. Multi-scale relation reasoning for multi-modal Visual Question Answering. Signal Process. Image Commun. 2021, 96, 116319. [Google Scholar] [CrossRef]
Ma, Z.; Zheng, W.; Chen, X.; Yin, L. Joint embedding VQA model based on dynamic word vector. PeerJ Comput. Sci. 2021, 7, e353. [Google Scholar] [CrossRef] [PubMed]
Bai, Y.; Fu, J.; Zhao, T.; Mei, T. Deep attention neural tensor network for visual question answering. Proc. Eur. Conf. Comput. Vis. 2018, 11216, 20–35. [Google Scholar]
Gordon, D.; Kembhavi, A.; Rastegari, M.; Redmon, J.; Fox, D.; Farhadi, A. Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4089–4098. [Google Scholar]
Li, W.; Yuan, Z.; Fang, X.; Wang, C. Knowing where to look? Analysis on attention of visual question answering system. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; pp. 145–152. [Google Scholar]
Zhang, W.; Yu, J.; Zhao, W.; Ran, C. DMRFNet: Deep multimodal reasoning and fusion for visual question answering and explanation generation. Inf. Fusion 2021, 72, 70–79. [Google Scholar] [CrossRef]
Liang, W.; Jiang, Y.; Liu, Z. GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering. arXiv 2021, arXiv:2104.10283. [Google Scholar]
Kim, J.J.; Lee, D.G.; Wu, J.; Jung, H.G.; Lee, S.W. Visual question answering based on local-scene-aware referring expression generation. Neural Netw. 2021, 139, 158–167. [Google Scholar] [CrossRef]
Zhu, Y.; Zhang, C.; Ré, C.; Fei-Fei, L. Building a Large-scale Multimodal Knowledge Base for Visual Question Answering. arXiv 2015, arXiv:1507.05670. [Google Scholar]
Wu, Q.; Wang, P.; Shen, C.; Dick, A.; Van Den Hengel, A. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4622–4630. [Google Scholar]
Zhu, Y.; Lim, J.J.; Fei-Fei, L. Knowledge acquisition for visual question answering via iterative querying. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1154–1163. [Google Scholar]
Su, Z.; Zhu, C.; Dong, Y.; Cai, D.; Chen, Y.; Li, J. Learning visual knowledge memory networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7736–7745. [Google Scholar]
Yu, J.; Zhu, Z.; Wang, Y.; Zhang, W.; Hu, Y.; Tan, J. Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recognit. 2020, 108, 107563. [Google Scholar] [CrossRef]
Zhang, L.; Liu, S.; Liu, D.; Zeng, P.; Li, X.; Song, J.; Gao, L. Rich visual knowledge-based augmentation network for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4362–4373. [Google Scholar] [CrossRef]
Zheng, W.; Yin, L.; Chen, X.; Ma, Z.; Liu, S.; Yang, B. Knowledge base graph embedding module design for Visual question answering model. Pattern Recognit. 2021, 120, 108153. [Google Scholar] [CrossRef]
Liu, L.; Wang, M.; He, X.; Qing, L.; Chen, H. Fact-based visual question answering via dual-process system. Knowl.-Based Syst. 2022, 237, 107650. [Google Scholar] [CrossRef]
Uehara, K.; Duan, N.; Harada, T. Learning To Ask Informative Sub-Questions for Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21 June 2022; pp. 4681–4690. [Google Scholar]
Cudic, M.; Burt, R.; Santana, E.; Principe, J.C. A flexible testing environment for visual question answering with performance evaluation. Neurocomputing 2018, 291, 128–135. [Google Scholar] [CrossRef]
Lioutas, V.; Passalis, N.; Tefas, A. Explicit ensemble attention learning for improving visual question answering. Pattern Recognit. Lett. 2018, 111, 51–57. [Google Scholar] [CrossRef]
Liu, F.; Xiang, T.; Hospedales, T.M.; Yang, W.; Sun, C. Inverse visual question answering: A new benchmark and VQA diagnosis tool. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 460–474. [Google Scholar] [CrossRef] [PubMed]
Lu, P.; Li, H.; Zhang, W.; Wang, J.; Wang, X. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 1–8. [Google Scholar]
Mun, J.; Lee, K.; Shin, J.; Han, B. Learning to Specialize with Knowledge Distillation for Visual Question Answering. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; pp. 8092–8102. [Google Scholar]
Rahman, T.; Chou, S.H.; Sigal, L.; Carenini, G. An Improved Attention for Visual Question Answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1653–1662. [Google Scholar]
Zhang, W.; Yu, J.; Hu, H.; Hu, H.; Qin, Z. Multimodal feature fusion by relational reasoning and attention for visual question answering. Inf. Fusion 2020, 55, 116–126. [Google Scholar] [CrossRef]
Bajaj, G.; Bandyopadhyay, B.; Schmidt, D.; Maneriker, P.; Myers, C.; Parthasarathy, S. Understanding Knowledge Gaps in Visual Question Answering: Implications for Gap Identification and Testing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 386–387. [Google Scholar]
Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3195–3204. [Google Scholar]
Wu, Q.; Shen, C.; Wang, P.; Dick, A.; Van Den Hengel, A. Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1367–1381. [Google Scholar] [CrossRef] [Green Version]
Wang, P.; Wu, Q.; Shen, C.; Dick, A.; Van Den Hengel, A. Fvqa: Fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2413–2427. [Google Scholar] [CrossRef] [Green Version]
Wang, P.; Wu, Q.; Shen, C.; Hengel, A.V.D.; Dick, A. Explicit Knowledge-based Reasoning for Visual Question Answering. Proc. Conf. Artif. Intell. 2017, 1290–1296. [Google Scholar] [CrossRef] [Green Version]
Teney, D.; Liu, L.; van Den Hengel, A. Graph-structured representations for visual question answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3233–3241. [Google Scholar]
Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. Int. J. Comput. Vis. 2019, 398–414. [Google Scholar] [CrossRef] [Green Version]
Zhu, Y.; Groth, O.; Bernstein, M.; Fei-Fei, L. Visual7w: Grounded question answering in images. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4995–5004. [Google Scholar]
Hudson, D.A.; Manning, C.D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6700–6709. [Google Scholar]
Speer, R.; Chin, J.; Havasi, C. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef] [Green Version]
Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. Dbpedia: A nucleus for a web of open data. Semant. Web. 2017, 722–735. [Google Scholar] [CrossRef] [Green Version]
Tandon, N.; Melo, G.; Weikum, G. Acquiring comparative commonsense knowledge from the web. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; p. 28. [Google Scholar]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 770–778. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.bibsonomy.org/bibtex/273ced32c0d4588eb95b6986dc2c8147c/jonaskaiser (accessed on 24 September 2022).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://papers.nips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 24 September 2022).
Guo, W.; Zhang, Y.; Yang, J.; Yuan, X. Re-attention for visual question answering. IEEE Trans. Image Process. 2021, 30, 6730–6743. [Google Scholar] [CrossRef]

Figure 1. VQA knowledge pyramid based on KHT. The upper left corner is used to explain that the pyramids are arranged in a time line in the vertical direction, and the scale of related research results (measured by published papers) is qualitatively described by the width of the pyramid in the horizontal direction.

Figure 2. Examples of knowledge forms in Kb-VQA. (a) Example A1k; (b) Example A1o.

Figure 5. Comparing the accuracy of Kb-VQA methods at different levels on the VQA v2 data set. (a) Agrawal [11]; (b) Liu [31]; (c) Bai [14]; (d) Li [16]; (e) Rahman [34]; (f) Zhang [35]; (g) Zhang [17]; (h) Kim [19]; (i) Zhang [25].

Table 1. Existing form of knowledge.

Number	Existing Form	Sample Image	Sample Question	Standard Answer
1	No knowledge		Null	Null
2	Image knowledge		What fruits are in the image?	Banana
3	Internal reasoning knowledge	Figure A1a	What is the mustache made of?	Banana
4	External knowledge		What kind of banana is it?	Emperor banana
5	Unknown knowledge		What are the girls doing?	Play?
6	No knowledge		Null	Null
7	Image knowledge		How many people are there in the image?	Two
8	Internal reasoning knowledge	Figure A1e	What’s in front of the woman?	Computer
9	External knowledge		What brand of computer does the woman in the image use?	Apple
10	Unknown knowledge		Are they working overtime?	Maybe
11	No knowledge		Null	Null
12	Image knowledge		How many people are there in the image?	Two
13	Internal reasoning knowledge	Figure A1h	What is the man holding in his hand?	Umbrella
14	External knowledge		What are the two of them doing?	Taking wedding photos
15	Unknown knowledge		This is where?	Do not know

The first knowledge level does not involve knowledge metabolism, so there is no input and output of questions and answers; thus, it is set to “Null”. Refer to Appendix A for sample images.

Table 2. The research status of knowledge existing forms in Kb-VQA.

Level	Existing Form	Related Research Work
1	No knowledge	Agrawal [11], Cudic [29], Lioutas [30], Liu [31], Lu [32], Mun [33], Wu [12]
2	Image knowledge	Bai [14], Gordon [15], Li [16], Rahman [34]
3	Internal reasoning knowledge	Zhang [17], Liang [18], Kim [19], Zhang [35], Bajaj [36]
4	External knowledge	Zhang [25], Yu [24], Marino [37], Su [23], Aditya [10], Zhu [22], Wu [38], Wang [39,40], Wu [21], Zhu [20]
5	Unknown knowledge	No

Table 4. Comparison of K

b

-VQA performance at all levels.

Table 4. Comparison of K

b

-VQA performance at all levels.

Level	Method	Publication	Dataset ^a	Accuracy
1	Agrawal [11]	2018	VQA v2 $^{b}$ [42]	48.24
	Cudic [29] $^{c}$	2018	—	—
	Lioutas [30]	2018	Visual7W [43]	66.6
	Liu [31]	2018	VQA v2	62.19
	Lu [32]	2018	VQA v1	69.97
	Mun [33]	2018	—	—
	Wu [12]	2021	VQA v1	68.47
2	Bai [14]	2018	VQA v2	67.94
	Gordon [15]	2018	—	—
	Li [16]	2018	VQA v2	65.19
	Rahman [34]	2021	VQA v2	70.90
3	Zhang [35]	2020	VQA v2	67.34
	Bajaj [36]	2020	—	—
	Zhang [17]	2021	VQA v2	71.27
	Liang [18]	2021	GQA[44]	94.78
	Kim [19]	2021	VQA v2	70.79
4	Zhu [22]	2017	VQA v1	68.90
	Wu [38]	2017	VQA v1	59.50
	Wang [40]	2017	VQA v1	69.60
	Su [23]	2018	VQA v1	66.10
	Aditya [10]	2018	VQA v2	—
	Zhang [25]	2020	VQA v2	71.84
	Yu [24]	2020	FVQA [39]	79.63
			Visual7W	69.03
5	—		—	—

^a The data set in the table only refers to the data set from which the experimental results are obtained, and does not mean that only this data set is used in the corresponding paper. ^b All tests for VQA v2 are executed on test-standard, and the test scope is overall. ^c Some methods do not give accuracy results because the papers corresponding to these methods do not conduct related Kb-VQA.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, R.; Wu, S.; Wang, X. The Core of Smart Cities: Knowledge Representation and Descriptive Framework Construction in Knowledge-Based Visual Question Answering. Sustainability 2022, 14, 13236. https://doi.org/10.3390/su142013236

AMA Style

Wang R, Wu S, Wang X. The Core of Smart Cities: Knowledge Representation and Descriptive Framework Construction in Knowledge-Based Visual Question Answering. Sustainability. 2022; 14(20):13236. https://doi.org/10.3390/su142013236

Chicago/Turabian Style

Wang, Ruiping, Shihong Wu, and Xiaoping Wang. 2022. "The Core of Smart Cities: Knowledge Representation and Descriptive Framework Construction in Knowledge-Based Visual Question Answering" Sustainability 14, no. 20: 13236. https://doi.org/10.3390/su142013236

APA Style

Wang, R., Wu, S., & Wang, X. (2022). The Core of Smart Cities: Knowledge Representation and Descriptive Framework Construction in Knowledge-Based Visual Question Answering. Sustainability, 14(20), 13236. https://doi.org/10.3390/su142013236

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Core of Smart Cities: Knowledge Representation and Descriptive Framework Construction in Knowledge-Based Visual Question Answering

Abstract

1. Introduction

2. Knowledge Pyramid

3. Theoretical Model

4. Knowledge Description Framework

5. Experiment and Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Materials Used in this Paper

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI