A Metamorphic Testing Approach for Assessing Question Answering Systems

: Question Answering (QA) enables the machine to understand and answer questions posed in natural language, which has emerged as a powerful tool in various domains. However, QA is a challenging task and there is an increasing concern about its quality. In this paper, we propose to apply the technique of metamorphic testing (MT) to evaluate QA systems from the users’ perspectives, in order to help the users to better understand the capabilities of these systems and then to select appropriate QA systems for their speciﬁc needs. Two typical categories of QA systems, namely, the textual QA (TQA) and visual QA (VQA), are studied, and a total number of 17 metamorphic relations (MRs) are identiﬁed for them. These MRs respectively focus on some characteristics of different aspects of QA. We further apply MT to four QA systems (including two APIs from the AllenNLP platform, one API from the Transformers platform, and one API from CloudCV) by using all of the MRs. Our experimental results demonstrate the capabilities of the four subject QA systems from various aspects, revealing their strengths and weaknesses. These results further suggest that MT can be an effective method for assessing QA systems.


Introduction
Question answering (QA) [1,2] focuses on returning right answers to given questions. Among various QA systems, the textual question answering (TQA) and visual question answering (VQA) represent a typical paradigm that enables the machine to answer a question in natural language by referring to the given contents (i.e., text or image). As shown in Figure 1, TQA [3] focuses on answering a question about a passage of text, which is also known as an NLP task of machine reading comprehension; while VQA [4] focuses on answering a question based on an image, which leverages techniques from the domains of NLP and computer vision. Both TQA and VQA have various potential applications. For example, TQA has been widely adopted by conversational agents [5] and customer service support [6]; VQA has a broad range of applications in the autonomous agents and virtual assistants [7]. On the other hand, a large number of neural network models have been created for implementing both TQA and VQA. For instances, BiDAF [8], BERT [9], RoBERTa [10] for TQA, and ViLBERT [11] for VQA.
Due to the importance and popularity of QA, it it critical to properly assess QA systems in order to demonstrate their capabilities and limitations. QA systems are commonly evaluated by a test dataset. However, the dataset may not necessarily be representative of the real world. Due to this, various different approaches have been proposed and applied to evaluate QA systems, revealing a series of problems concerning different aspects. Jia et al. [12] proposed an adversarial evaluation scheme to investigate whether QA can answer questions about passages containing adversarially inserted sentences, and their experimental results revealed that the QA models under investigation had poor performance. Divyansh et al. [13] investigated popular QA benchmarks and then revealed that This study focuses on assessing TQA and VQA systems from the users' perspective in order to reveal to which degree QA systems satisfy the users' expectations. This kind of assessment is helpful for the users to better understand QA systems such that they are able to select appropriate QA systems for their specific needs. To this end, we propose to adopt the technique of metamorphic testing (MT). MT is a property based testing technique, which has shown promising effectiveness in various software engineering activities, such as testing [18], fault localization [19], and program repair [20,21]. The key component of MT is metamorphic relations (MRs), which encode system properties via the relationship among multiple related inputs and outputs. MT is originally applied for software verification. In recent year, it has been successfully extended to software validation and system comprehension [22,23].
In this study, we identify a total number of 17 MRs for QA systems. These MRs respectively focus on different aspects of TQA and VQA, which can help the users to understand the capability of TQA and VQA systems from different perspectives, and can also provide guidances for the users to select appropriate systems to satisfy their specific needs. We conduct experiments by employing four QA systems (two TQA APIs provided by AllenNLP [24] and Transformers [25], and two VQA APIs provided by AllenNLP and CouldCV) using all of the MRs, demonstrating the capabilities and limitations of the QA systems under investigation. To summarize, the paper makes three major contributions. • We proposed to apply the technique of metamorphic testing to assess QA systems from the users' perspectives, and presented 17 MRs by considering different aspects of QA systems. • We conducted experiments on four common QA systems (two TQA systems and two VQA systems), demonstrating the feasibility and effectiveness of MT in assessing QA systems. • We conducted comparison analysis among subject QA systems to reveal their capabilities of understanding and processing the input data, and also demonstrated how the analysis results can help the user to select appropriate QA system for their specific needs.
The remainder of the paper is organized as follows. Section 2 introduces the technique of metamorphic testing. Section 3 clarifies the overall approach, and Section 4 presents a list of MRs identified for QA systems. Our experimental setup is introduced in Section 5, and the experimental results are presented and analyzed in Section 6. Section 7 discusses related work, and Section 8 concludes the present study.

Metamorphic Testing
Metamorphic testing (MT) [26,27] is a property based testing technique. MT proposes to describe the necessary properties of the target system through the relationships among inputs and outputs of multiple executions. Such properties are expressed by metamorphic relation (MRs). Specifically, an MR describes how to construct the follow-up input from the given input (which is known as the source input), and also encodes the relationship among the source and follow-up outputs (namely, the outputs for the source and followup inputs respectively). As an example for illustration, consider the program Max that implements the algorithm of finding the maximum value among two input values. An MR for Max can be "Suppose that the source input is t s = (x, y), where x and y can be arbitrary numeric values, and the follow-up input t f is constructed by swapping the two input values of t s (that is, t f = (y, x)). As a result, the source and follow-up outputs are expected to be identical".
Generally, MRs can be identified by referring to the system's requirements or based on the users' expectations on the system. Given an MR and a set of its source inputs (which can be generated by arbitrary strategies), MT can be conducted as below. At first, the corresponding follow-up inputs are constructed based on the source inputs according to the MR. After that, for every group of source and follow-up inputs, MT respectively runs the target program on both source and follow-up inputs, yielding the source and follow-up outputs. MT finally checks each group of source and follow-up inputs and outputs against the relevant MR to see whether or not the MR is violated. Any group of source and follow-up inputs with which the program violates the MR is regarded to incur an MR violation. Specifically, an MR violation is an indicator of the existence of defects in the target system if the relevant MR is identified with reference to the system's requirements. Nevertheless, an MR violation reveals either the existence of defects or the the discrepancies between the system behavior and the users expectations if the MR is identified with respect to the users' expected characteristics of the system. Different from traditional testing techniques that check the correctness of the output of individual inputs, MT checks the satisfaction of MRs on individual groups of source and follow-up executions. Because of this, MT can be conducted without using oracles, and has been applied for software verification and validation [18,22] as well as for helping users to understand the system behaviors [23]. It is also noted that after MRs are identified, the whole procedure of MT can be easily automated.

Methodology
This study proposes to apply MT to evaluate QA systems by considering different users' requirements. An overview of the approach is presented in Figure 2. Given a set of source inputs (namely, passage-question pairs for TQA and image-question pairs for VQA) and a list of MRs, a corresponding set of perturbed passage-question pairs and image-question pairs are generated, which are respectively the follow-up inputs for TQA and VQA. By executing the TQA and VQA systems with source and followup inputs that are relevant to individual MRs, their source and follow-up answers are collected. Since both TQA and VQA provide a phrase or a sentence as an output answer, we conduct semantic similarity analysis on groups of original and follow-up answers with respect to the relevant MR to determine the testing result. At last, for each MR and every TQA and VQA system under investigation, we calculate the violation rate, which denotes the rate of occurrence of MR violations. A higher violation rate indicates a higher degree to which the system's behaviors deviate from the users' expectations. Based on the evaluation data, we further conduct comparison analysis to reveal the capabilities of QA systems under investigation. Our analysis mainly focuses on three aspects: both TQA and VQA's capabilities of understanding and answering questions, TQA's capabilities of understanding and processing passages, and VQA's capabilities of understanding and processing images. We also demonstrate how our analysis results can help the users to select appropriate QA systems according to their specific needs. The key task of applying MT to QA systems lies in the identification of MRs by considering the characteristics of QA systems as well as the users' expectations on these systems. Moreover, upon the identification of MRs, the whole evaluation procedure can be automated.

Metamorphic Relations of Question Answering Systems
In order to evaluate QA systems by MT, we defined a series of MRs. These MRs consider the users' expected characteristics of QA systems, and thus the satisfaction and violation of these MRs can help users to better understand the capability and limitations of QA systems. In total, 17 MRs are identified, each of which focuses on some aspects of QA. This section presents the details of these MRs, and also gives illustrative examples for some MRs.

Output Relationships
Let t s and t f be a group of source and follow-up inputs of a QA system with respect to an MR, and let A s and A f be the corresponding source and follow-up outputs. In this study, we consider the following relationships between A s and A f . In order to determine whether two answers A s and A f have similar semantics, we first transform them into vector representations. This is done by employing the bertas-service API [28], which encodes a sentence with a fixed length vector by using the BERT model [9]. BERT is a pre-trained transformer network built upon the attention mechanism [29]. The model has multiple layers, each of which consists of an attention sub-layer and a feed-forward network sub-layer. The former helps the model to gain a broad range of information from the input. For an input, the attention sub-layer extracts three vectors, namely, the query vector, key vector and value vector, and packs them together into matrices Q, K, and V, respectively. Based on this, it conducts the self-attention calculation as below [29].
where d k represents the dimension of keys of the input, and softmax is a learned normalized exponential function. Specifically, BERT adopts a multi-head attention mechanism, which concats multiple attention calculations of linearly transformed queries, keys and values. The output of the attention sub-layer is provided for another sub-layer that contains a feed-forward network, which is responsible for conducting linear transformations as below [29].
Based on the digital vectors yielded by BERT for A s and A f , we further apply the cosine similarity analysis [30] to decide whether or not they are semantically equivalent. Suppose that the size of the resulting vectors is n, let vs = [vs 1 , ..., vs n ] and v f = [v f 1 , ..., v f n ] be the vectors representing A s and A f , respectively. The semantic similarity of A s and A f is measured by As a result, a similarity score that is higher than a threshold value indicates the equivalence of A s and A f in terms of their semantics.

MRs for QA Systems
The input of TQA consists of a passage and a question, and the input of VQA contains an image and a question. As such, we use P s (or I s ) and Q s to denotes the passage (or image) and question in t s , and use P f (I f ) and Q f to denote the corresponding information in t f . That is, t s = (P s , Q s ) and t f = (P f , Q f ) for TQA, while t s = (I s , Q s ) and t f = (I f , Q f ) for VQA. Different MRs may operate on different input parameters of t s to construct t f , leading to discrepancies between t s and t f . According to this, we classify all MRs into three categories, which are summarized in Table 1 and are explained as below. • MR1.x has t s = (P, Q s ) and t f = (P, Q f ) or t s = (I, Q s ) and t f = (I, Q f ). That is, t s and t f of MR1.x have the same P (or I), but different questions Q s and Q f . This category of MRs operates on Q s to construct Q f . Hence, they focus on QA's capability of understanding and answering questions • MR2.x has t s = (P s , Q) and t f = (P f , Q). That is, t s and t f of MR2.x have the same Q, but different passages P s and P f . This category of MRs operate on P s to construct P f , and they focus on the TQA's capability of processing and understanding the input passage. • MR3.x has t s = (I s , Q) and t f = (I f , Q). That is, t s and t f of MR3.x have the same Q, but different images I s and I f . This category of MRs operate on I s to construct I f , and they concentrate on the VQA's capability of processing and understanding the input image. Table 1. Summary of metamorphic relations (MRs).

Source and Follow-Up Inputs Number of MRs
This category of MRs are designed to investigate the QA's capability of understanding and answering questions. For each MR, t s and t f use the same input passage or image but different questions, that is, (P, Q s ) and (P, Q f ) for TQA, while (I, Q s ) and (I, Q f ) for VQA. Different MRs alter Q s in different ways to construct Q f and also encode the relationship that is expected to be satisfied by A s and A f . We identify four MRs, which are described as follows.
MR1.1 (Capitalization): Q f is constructed by replacing lowercase letters of Q s with the corresponding uppercase letters. As a result, A f is expected to be equivalent to A s . MR1.2 (Rephrasing comparative question): Suppose that Q s contains comparative phrases. Q f is constructed by rephrasing Q s without changing the meaning of Q s . As a result, A f is expected to be equivalent to A s . MR1.3 (Replacing the comparative word with its antonym): Suppose that Q s contains comparative words. Q f is constructed by replacing a comparative word in Q s with its antonym such that Q f expresses a different meaning from Q s . As a result, A f is expected to be different from A s . MR1.4 (Changing the subject of a question): Q f is constructed by changing the subject of Q s with another noun. This change leads to different meanings of these two questions. As a result, A f is expected to be different from A s . Table 2 shows some illustrative examples of Q s and Q f of MR1.1-MR1. 4, where Q f is highlighted with underlines. For each MR, the interpretation of MR violations is also presented. This category of MRs are identified to study the TQA's capability of processing and understanding the input passage. For each MR, the source input is t s = (P s , Q), and the corresponding follow-up input is t f = (P f , Q). Every MR proposes a way of altering P s to construct P f and also predicts the relationships between the corresponding A s and A f . Table 3 summarizes this category of MRs, the details of which are presented as below. MR2.2 (Reversing the order of sentences): P f is constructed by reversing the order of sentences of P s . As a result, A f is expected to be equivalent to A s .
MR2.3 (Addition of irrelevant sentences): P f is constructed by adding some sentences that are irrelevant to the question into P s . As a result, A f is expected to be equivalent to A s .
MR2.4 (Removal of irrelevant sentences): P f is constructed by removing sentences that are irrelevant to the question from P s . As a result, A f is expected to be equivalent to A s . MR2.5 (Replacing the answer-related words): Suppose that A s is a numeric value, which is an answer to questions of types of how many, how old, how long, or when. P f is constructed by replacing A s in P s with A s + n, where n is a randomly selected numeric constant, which makes A s + n a numeric value that is different from A s and is also unique in P s . As a result, A f is expected to be different from A s but is equal to A s + n.
MR2.5 is designed by considering a special case where TQA returns a numeric value as an answer to a given question. In this study, we consider four types of questions, namely, how many, how old, how long, and when. An illustrative example of MR2.5 is presented in Table 4, which demonstrates the way of constructing P f based on both P s and A s . Obviously, MR2.5 can only be applied to source inputs that contain the aforementioned four types of questions. Table 4. Example P s and P f of MR2.5 (n is set to be 3). After graduating from high school, West received a scholarship to attend Chicago's American Academy of Art in 1997 and began taking painting classes, but shortly after transferred to Chicago State University to study English. He soon realized that his busy class schedule was detrimental to his musical work, and at 20 he dropped out of college to pursue his musical dreams.This action greatly displeased his mother, who was also a professor at the university. Q s : How old was Kanye when he dropped out of college? A s : 20 P f : After graduating from high school, West received a scholarship to attend Chicago's American Academy of Art in 1997 and began taking painting classes, but shortly after transferred to Chicago State University to study English. He soon realized that his busy class schedule was detrimental to his musical work, and at 23 he dropped out of college to pursue his musical dreams.This action greatly displeased his mother, who was also a professor at the university.

MR3.x
This category of MRs are identified for evaluating the VQA's capability of processing and understanding the input image. For each MR, the source input is t s = (I s , Q), and the corresponding follow-up input is t f = (I f , Q). Accordingly, each MR designs a way of altering I s to construct I f and also predicts the relationships between source and follow-up outputs. Researchers have proposed a series of operations, such as image scaling and image rotation, to perturb images for evaluating deep neural network based models [31]. In this study, we consider 2D input images, and identify MRs by adopting some of the operations.
We first consider the rotation operation. To rotate an image with a given angle, a rotation matrix is constructed and applied on the image (https://github.com/jrosebr1 /imutils, accessed on 8 October 2020). Suppose that c is the center of the rotation, θ is the rotation angle, and x denotes the scale factor. The rotation matrix is as follows: where α = x * cosθ and β = x * sinθ. We next consider the changing of RGB images into grayscale images. This can be implemented by using the ITU-R 601-2 (Luma transform https://github.com/pythonpillow/Pillow, accessed on 8 October 2020), where each pixel of an image is expressed as 8-bits, and is transformed as below. L = R * 299/1000 + G * 587/1000 + B * 114/1000, where R, G, and B are the RGB values in range of 0-255, and L is the resulting single channel output. Based on this, MR3.4 is identified.
MR3.4: Suppose that I s is a RGB image. I f is constructed by converting I s to its corresponding grayscale image. As a result, A f is expected to be equivalent to A s .
We further consider another two types of images operations, image flipping and resizing. Flipping an image utilizes a similar method as for rotating images but with different parameter configurations, while resizing an image can be implemented by adopting scale factors along the horizontal and vertical axes. Based on these two types of operations, the following four MRs are identified.
MR3.5: I f is constructed by flipping I s horizontally. As a result, A f is expected to be equivalent to A s . MR3.6: I f is constructed by flipping I s vertically. As a result, A f is expected to be equivalent to A s . MR3.7: I f is constructed by magnifying the size of I s by 1.5 times. As a result, A f is expected to be equivalent to A s . MR3.8: I f is constructed by reducing the size of I s by 1.5 times. As a result, A f is expected to be equivalent to A s .

Experimental Setup
A series of experiments were conducted to evaluate four QA systems by using all of the 17 MRs. This section presents our experimental setup, including the implementation of MRs, our subject QA systems, the datasets used in the experiments, and the source inputs of MRs.

MRs Implementation
All of the identified MRs were implemented in order to automatically evaluate QA systems by MT. Some specific MR implementations are presented as below.
MR1.3: MR1.3 replaces the comparative word in Q s with its antonym for constructing Q f . To this end, we applied nltk (http://www.nltk.org/, accessed on 23 October 2020) for part-of-speech tagging, which can identify comparative form of an adjective or adverb in Q s . We further searched the antonym of the given word by using PyDictionary (https: //github.com/geekpradd/PyDictionary, accessed on 23 October 2020). MR1.4: MR1.4 changes the subject of Q s to construct Q f . In this study, we treated a word of Q s representing the entity of PERSON as the subject of Q s . To identify and change the subject of Q s , we applied the Named Entity Recognizer StanfordNERTagger (https://nlp.stanford.edu/software/CRF-NER.html, accessed on 2 November 2020). Given a Q s , StanfordNERTagger was first applied to extract the PERSON entity from Q s . If the the PERSON entity was successfully identified, we further replaced it with another PERSON entity that was not included in the passage.
MR3.1-MR3.3: These MRs rotate I s to construct I f . To automate this procedure, we utilized a package called imutils (https://github.com/jrosebr1/imutils, accessed on 2 November 2020), which provides a function rotate_bound for rotating images by given degrees.
To automatically check the relationship of A s and A f , we employed the bert-as-service API [28], which determines the degree to which the given two sentences have similar semantics. This API represented a sentence as a fixed length vector according to BERT [9], based on which we calculated the cosine similarity of vectors of A s and A f to determined whether they are equivalent or different.

Datasets and Source Inputs of MRs
The SQuAD 2.0 dataset [34] was used for preparing source inputs of TQA. SQuAD2.0 contains over 150,000 questions. For VQA, we utilized the DAQUAR dataset [35], which contains 1449 images and 12,468 questions. A source input obtained from the SQuAD 2.0 dataset was a passage-question pair, while a source input extracted from the DAQUAR dataset was an image-question pair.
Nine MRs, namely, MR1.1-MR1.4 and MR2.1-MR2.5, were used to evaluate TQA systems, while 12 MRs, namely, MR1.1-MR1.4 and MR3.1-MR3.8, were used to evaluated TQA systems. Each MR was applied to individual source inputs in order to generate the relevant follow-up inputs. Note that MRs may not be applicable to some source inputs due to its preconditions and the operations used for constructing follow-up inputs. For example, MR1.3 operates on comparative words, and thus it cannot be applied to source inputs whose questions contain no comparative word. As a result, different MRs may have varying numbers of groups of source and follow-up inputs. In total, over 50,000 groups of source and follow-up inputs are used for evaluating TQA systems, and over 80,000 groups of source and follow-up inputs are used for evaluating VQA systems.

Results and Analysis
In this section, the MT results of evaluating the four subject QA systems are presented. Then, the capabilities of our subject QA systems are analyzed and discussed with respect to relevant MRs.

MT Results for QA Systems
To evaluate QA systems, the violation rate (VR) was used as the evaluation metric. Given an MR and a QA system, let y be the total number of groups of source and follow-up inputs of the MR that were applied to test the QA system, and x be the number of groups of source and follow-up inputs with which the system violated the MR. The VR of this QA system with respect to this MR is y x . Obviously, a lower VR value indicated a higher degree to which the QA system conformed to the relevant MR, revealing a higher satisfaction of the users' needs. Oppositely, a higher VR value denoted that the QA system was more sensitive to the MR operations, and thus was more likely to produce unexpected answers for the given question. Particularly, a violation rate of 0 means that no violation of the relevant MR was revealed in our experiments, suggesting that the system was likely to be robust with respect to the MR and all of its source and follow-up inputs.  Table 5. Violation rates of question answering (QA) systems ('*' denotes that the number of source input is 0, while '-' means that the relevant MR is not applicable to the system).

Further Analysis
Based on the MT results reported in Table 5, an in-depth analysis was conducted to reveal the capabilities of the four QA systems from different perspectives. Each VR value reported in Table 5 represents the extent to which a system deviated from the properties specified by the relevant MR. Furthermore, as described and explained in Section 4, different MRs handled varying input parameters and also referred to different capabilities of QA. More importantly, a system may have performed well in some aspects but may have had bad performance in some other aspects, while different users may have had concern with varying QA capabilities due to their distinct application scenarios. It was therefore important for the users to know the strength and weakness of different systems such that appropriate systems could be selected to satisfy their needs. Because of this, we compared subject QA systems by inspecting VR values of MRs pertaining to specific QA capabilities in order to reveal the strength and weakness of individual systems from different aspects.

QA's Capability of Understanding and Answering Questions
Both TQA and VQA have to understand the question and then to give an appropriate answer to the question. When using these systems, the users may want to know which QA system has a better capability of processing questions. Four of the proposed MRs, namely, MR1.1-MR1.4, focus on this aspect by describing the relationships among source and follow-up inputs that differ exactly in the input questions. Figure 3 compares different TQA systems and VQA systems based on MR1.1-MR1.4. As shown in Figure 3a, Transformers-TQA had lower VR values than AllenNLP-TQA for three out of four MRs. It can be further observed from Table 5 that the average VR value of Transformers-TQA on these four MRs was also much lower than that of AllenNLP-TQA. Therefore, as compared with AllenNLP-TQA, Transformers-TQA exhibited better capabilities of understanding and answering questions. Similarly, as shown in Figure 3b, the two VQA systems also had varying violation rates for MR1.1 (the other three MRs had 0 source input for VQA and thus no data was collected). As compared with CloudCV-VQA, AllenNLP-TQA had a relatively lower violation rate with respect to MR1.1, suggesting that AllenNLP-VQA was more robust to the letter case of input questions. Moreover, Figure 3b also showed that the two VQA systems under investigation were of better capability of handling questions with lowercase or uppercase letters than the two TQA systems, because the former two had much lower VR values (namely, 10.34% and 20.14%) than the latter (namely, 65.10% and 91.11%) with respect to MR1.1.

TQA's Capability of Understanding and Processing Passages
TQA answers a given question based on a passage, it thus needs to understand and process the passage for exacting information related to the given question. We defined five MRs, MR2.1-MR2.5, for investigating TQA's capability of understanding and processing input passages. Figure 4 compares the violation rates of the two TQA systems (AllenNLP-TQA and Transformers-TQA) with respect to MR2.1-MR2.5. Firstly, both TQA systems had much lower violation rates for MR2.2-MR2.5 (VR values are lower than 35%) as compared with those for MR2.1 (VR values are higher than 65%). These results reveal that the two TQA systems were much more robust to the adding, removing or replacing some contents of the input passage, but were less robust to the conversion of lowercase letters to uppercase letters of the input passage. Secondly, Transformers-TQA had similar violation rates as AllenNLP-TQA for MR2.2-MR2.4 (the discrepancies between the VR values of the two systems with respect to individual MRs were about 2%), but had a very different violation rates from AllenNLP-TQA for the other two MRs (the VR value of the former was about 20% higher than that of the latter with respect to MR2.1, while the VR value of the former was about 10% lower than that of the latter with respect to MR2.5). In other words, the two TQA systems had equivalent capability of dealing with passages containing sentences of different orders as well as containing more or less irrelevant sentences. Nevertheless, AllenNLP-TQA did better for handling passages containing lowercase or uppercase letters, while Transformers-TQA performed better when dealing with passages containing minor replaced contents.

VQA's Capability of Understanding and Processing Images
While TQA understands and processes the input passage for answering a question, VQA relies on the input image for giving an answer to a question. We identified eight MRs, MR3.1-MR3.8, for investigating the VQA's capability of understanding and processing images.

Further Analysis and Discussion
TQA and VQA had the commonality that they both needed to understand and process the given question. Figure 3b compares our four subject systems with respect to MR1.1, showing that the two VQA systems had relatively better capabilities than the two TQA systems in terms of processing questions containing lowercase or uppercase letters. However, TQA and VQA differed in that the former relied on the passage of text while the latter relied on the image. Concerning these aspects, we respectively used MR2.x and MR3.x for evaluating TQA and VQA. It can still be found from Table 5 that the two TQA systems generally had lower violation rates for MR2.x (which focused on TQA's capability of understanding and processing passages) as compared with the VQA's violation rates for MR3.x (that concentrated on VQA's capability of understanding and processing images). These results indicated that compared with the image processing capability of the two VQA systems, the two TQA systems had better capability of processing passages. Furthermore, Table 5 presents the average violation rates across all applied MRs for individual subject QA systems (as shown in the last row of Table 5). Base on the average VR values, it was found that the two TQA systems generally performed better than the two VQA system, because the former two had average VR values of 44.71% and 32.11% while the latter two had average VR values of 56.68% and 48.47%.
In summary, the proposed 17 MRs encoded some characteristics of QA system, based on which MT results revealed the capabilities of our subject TQA and VQA systems from different perspectives. On one hand, the MT results reported the VR values for every subject system with respect to individual MRs, which could help the users to gain a better understanding about the capability and limitations of the relevant systems. For example, by inspecting the VR values of AllenNLP-TQA, the users could find that this system was good at extracting the question-related information from the passage either with or without some irrelevant sentences (as suggested by the VR value of 2.05% of MR2.3), but it was very incapable of properly understanding questions containing comparative words (as indicated by the VR value of 92.98% of MR1.3). On the other hand, the MT results supported the comparison of different QA systems by considering different aspects, which thus provided guidance for the user to select appropriate QA systems for their specific needs. For example, if the users wanted to use VQA systems without concerning the use of lowercase or uppercase letters in the question description, they could check the VQA systems' VR values with respect to MR1.1. The reason for this is that MR1.1 encoded the relationship between source and follow-up inputs to reflect to which degree a QA system was sensitive to the letter case of a question. In our experiments, AllenNLP-VQA had a VR value of 10.34%, while CloudCV-VQA had a VR value of 20.14%, with respect to MR1.1. Based on this result, it was natural that the users would utilize AllenNLP-VQA rather than CloudCV-VQA. Note that different users may have had varying needs and expectations on the QA systems, and thus MT results of different MRs should be referred in different application scenarios.

Related Work
A large body of studies focus on assessing the QA systems' robustness. In order to construct input data, various strategies have been proposed, such as adversarially inserting sentences into the input passages of TQA [12], perturbing questions with respect to high attribute terms [14], rephrasing questions by applying linguistic variations [36], introducing noises into questions [15,37], and applying universal adversarial triggers [38]. Another line of work focuses on improving or explaining QA systems' robustness. Chen et al. [17] proposed a model for TQA through sub-part alignment, which was able to filter out bad prediction results and thus was of higher robustness, while Patro et al. [16] proposed a collaborative correlated network for providing visual and textual explanations of the VQA's answers. Although robustness is important for evaluation, these studies are orthogonal to our focus on assessing to what degree QA systems satisfy the users' specific expectations. On the other hand, most of existing studies focused on either of TQA or VQA, and proposed strategies for changing only parts of an input (namely, question or passage). Nevertheless, our study proposed a list of MRs, which involve various operations that can be applied to both the input questions and the input passages (input images) of TQA (VQA).
Apart from focusing on the QA systems' robustness, Ribeiro et al. [39] evaluated the logic consistency of QA systems. They transformed a question and also implied the corresponding answer by considering the positive and negative implications caused by the given question with respect to the context. While useful, this method still did not take the other parts of the input (i.e., passages or images) into account, and thus the evaluation was still restricted to parts of the QA's capabilities.
Ribeiro et al. [15] introduced MT to one of the QA systems, namely, TQA, and proposed to use MT for evaluatig the TQA's robustness. However, in their work, only one MR was identified, which introduced a specific type of noises (namely, typos) into the input passage or the input question to generate follow-up inputs. In contrast, our study proposed applying MT as a comprehensive evaluation method for both TQA and VQA in a user-oriented way. We have identified a large number of MRs for QA, including MRs that reflect systems' robustness (such as the MRs adopting the capitalization operation on the inputs), and also MRs that focus on particular system functionalities (such as the MRs adopting words replacement). Moreover, these MRs are able to construct diverse test data with changes on both the input passages (images) and questions of TQA (VQA).

Conclusions
In recent years, question answering (QA) has emerged as a popular and powerful tool in various domains, due to its capability of enabling the machine to understand and answer question posted in natural language. Unfortunately, recent studies have adopted various techniques to evaluate QA systems, revealing a series of problems concerning different aspects. In this paper, we focused on the evaluation of two typical categories of QA systems, namely, the textual QA (TQA) and visual QA (VQA). We applied the technique of metamorphic testing (MT) to QA, and identified 17 metamorphic relations (MRs) by considering the users' varying expectations on QA systems. In the experiments, we evaluated two TQA systems and two VQA systems by using all of the MRs, and our experimental results reveal their capabilities from different perspectives. These results further suggest that the proposed MRs are capable of encoding the expected characteristics of QA and MT can be an effective evaluation method for QA.