Evaluation of Chinese Natural Language Processing System Based on Metamorphic Testing

Jin, Lingzi; Ding, Zuohua; Zhou, Huihui

doi:10.3390/math10081276

Open AccessArticle

Evaluation of Chinese Natural Language Processing System Based on Metamorphic Testing

by

Lingzi Jin

,

Zuohua Ding

^* and

Huihui Zhou

School of Information Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(8), 1276; https://doi.org/10.3390/math10081276

Submission received: 31 December 2021 / Revised: 28 March 2022 / Accepted: 6 April 2022 / Published: 12 April 2022

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

A natural language processing system can realize effective communication between human and computer with natural language. Because its evaluation method relies on a large amount of labeled data and human judgment, the question of how to systematically evaluate its quality is still a challenging task. In this article, we use metamorphic testing technology to evaluate natural language processing systems from the user’s perspective to help users better understand the functionalities of these systems and then select the appropriate natural language processing system according to their specific needs. We have defined three metamorphic relation patterns. These metamorphic relation patterns respectively focus on some characteristics of different aspects of natural language processing. Moreover, on this basis, we defined seven metamorphic relations and chose three tasks (text similarity, text summarization, and text classification) to evaluate the quality of the system. Chinese is used as target language. We extended the defined abstract metamorphic relations to these tasks, and seven specific metamorphic relations were generated for each task. Then, we judged whether the metamorphic relations were satisfied for each task, and used them to evaluate the quality and robustness of the natural language processing system without reference output. We further applied the metamorphic test to three mainstream natural language processing systems (including BaiduCloud API, AliCloud API, and TencentCloud API), and on the PWAS-X datasets, LCSTS datasets, and THUCNews datasets. Experiments were carried out, revealing the advantages and disadvantages of each system. These results further show that the metamorphic test can effectively test the natural language processing system without annotated data.

Keywords:

natural language processing; metamorphic testing; quality assessment

MSC:

68N30; 68T50

1. Introduction

Natural language processing (NLP) techniques benefit the communication between machines and human beings and have thus recently received increasing attention. In various NLP systems, the tasks of text similarity [1], text summarization [2], and text classification [3], are widely used in specific applications. The ability of an NLP system to process text has greatly improved in recent years, but there are still several problems.

At present, the evaluation indicators [4,5] of text similarity is relatively single, and it cannot effectively evaluate text similarity from the perspective of users. For text summarization [6,7,8], the performance evaluation has always been a major challenge for researchers because the ideal reference abstracts are difficult to define. Human beings can write different correct abstracts for the same original document or document sets according to different topics and perspectives. However, the existing datasets are generally single reference abstracts, which lack accuracy and diversity. Some evaluation methods cannot determine the quality of abstracts through semantics. However, manual evaluation requires labor and time costs. According to different questions, the focus of manual evaluation is also different. The same problem also appears in the text classification task. The existing evaluation indicators cannot indicate whether the model understands the text from the semantic level as with human beings. In addition, with the emergence of noise samples, small sample noise may lead to significant changes in decision confidence or even decision reversal. Therefore, it is necessary to prove the semantic representation ability and robustness of model in practice; moreover, we need evaluation indicators from the perspective of users.

For example, a very small change in the calculation of sentence similarity will cause a drastic change in the output similarity score; a small disturbance in the text summarization task will cause a huge difference in the output of the system. As shown in Figure 1 and Figure 2, we entered two paragraphs in the Baidu online text summarization. In these two sentences, there is only a slight difference, which is highlighted by red rectangles. From the user’s point of view, these two paragraphs express similar meanings, although they involve different persons’ information. The output results obtained through Baidu’s natural language processing system are quite different.

Currently, NLP is a very important research topic and is widely used in all walks of life. From this point of view, the evaluation of the NLP system has important practical and research significance. Improving the robustness of the model is a hot spot and challenge in current research. At present, most of the better natural language processing models use deep learning [9]. For example, the model of multi-feature fusion neural network [10] is used to achieve good results in text similarity, and the convs2s-generated [11] automatic text summarization method used in text summarization has also greatly improved. In recent years, xlnet [12] and Roberta [13] have unique advantages in feature extraction and semantic mining and have realized excellent text classification tasks. However, deep learning is used in these tasks, and deep learning is a black-box model. Although the training process is reproducible, the implicit semantics and outputs lack interpretability.

In addition, the evaluation of NLP systems relies on a large number of labeled datasets [14,15,16], which are difficult to obtain and usually rely on manual annotations. At the same time, the evaluation indicators are relatively single, which leads to one-sided evaluation results that cannot truly reflect the performance of the NLP systems [17]. In addition, most of the testing methods for NLP systems are in English. English and Chinese are quite different [17]. Unlike English, which has a smaller set of letters, Chinese is composed of a large number of Chinese characters. Therefore, in English, attacking methods such as deleting characters, swapping letters, tense, uppercase and lowercase, and singular and plural attacks on the system, cannot be applied to Chinese. The grammatical structure of Chinese and English is also quite different. Chinese is a combined language without word breakers. In Chinese, very small changes may also bring significant semantic changes.

To address these issues, we propose an NLP system evaluation method based on metamorphic relations (MRs). This research mainly evaluates the NLP system from the user’s perspective to reveal the extent to which the NLP system meets the user’s expectations. This evaluation helps users better understand the NLP system and enables them to choose the appropriate NLP system according to their specific needs. Therefore, we recommend the use of metamorphic testing [18] (MT) technology. MT is a property-based testing technology that has shown promising effectiveness in various software engineering activities such as image generation [19], fault detection [20,21], and optimization or encryption programs [22,23].

The key component of MT is MRs, which encode system properties through the relationships among multiple related inputs and outputs. MT was originally used for software verification. In recent years, it has been successfully extended to software verification and system understanding [24,25].

In this research, we designed three metamorphic relation patterns (MRPs) for the NLP system; based on the characteristics and needs of the task, we proposed seven MRs. We evaluated all MRs to use three NLP systems (Baidu [26], Ali [27], Tencent [28]), demonstrating the capabilities and limitations of the respective NLP system under study. In summary, this article has made three major contributions.

We propose to apply MT to evaluate Chinese NLP systems from the perspective of users. By considering different aspects of the NLP system, we propose three categories of MRPs, and then define seven MRs under these MRPs. According to the characteristics of each task, each MR is concretized to evaluate the quality of different NLP systems;
We conduct experiments on three common NLP systems (Baidu, Ali, Tencent) and demonstrate the feasibility and effectiveness of MT in evaluating NLP systems;
We conduct a comparative analysis of NLP systems, revealing their capabilities and demonstrating how the results of the analysis can help users choose the appropriate NLP system according to their specific needs.

The rest of this article is organized as follows: Section 2 introduces MT. Section 3 describes the overall approach and provides lists of MRs identified for NLP systems. In Section 4, our experimental setup is introduced. The experimental results are presented and analyzed in Section 5. Section 6 discusses related work, and Section 7 summarizes the research.

2. Metamorphic Testing

Normally, testing verifies whether the results of test case execution are correct. In traditional software testing, our usual practice is to compare the relationship between the expected outputs and the actual outputs. However, in many cases, given the input of a system, it is difficult for us to distinguish the corresponding expected, correct behavior from potentially incorrect behavior. These problems are called Oracle problems [29]. In order to alleviate this kind of problem, Chen et al. [30] in Australia put forward the concept of metamorphic testing (MT). More and more research studies have applied the technique of MT and proved that it is very effective in alleviating Oracle problems. MT is a property-based software testing method. Given one or more test cases, MT generates one or more follow-up test cases. The testers do not need to know the execution result of the program in advance, but only needs to check whether the relationship between multiple test cases satisfies the expected relationship. The difference between MT and traditional testing technology is that it does not focus on the verification of each individual output of the software under test (SUT) but rather checks the relationships between the inputs and outputs of the SUT that is executed multiple times. This relationship is called a metamorphic relation (MR). MR is a necessary property of the intended program function. If the MR is violated, the SUT must be defective.

2.1. Definition of MRs

In MT, the SUT is checked according to the prescribed MR. We chose a classic case of MR for the purpose of illustration. Suppose that the SUT P implements the objective function f(x) = sin(x). For a random value x, we cannot intuitively obtain the expected result sin(x) without the help of other tools. We can compare the results of sin(x) and the results of sin(π − x) to determine the correctness of P execution. Therefore, according to the characteristics of the f(x) function, an MR can be defined. The MR is as follows:

sin(x) = sin(π − x)

For example: x₁ = 0.2, the follow-up test case is x₂ = −0.2, MT checks whether the MR sin(x) = sin(π − x) is satisfied. If the equation does not hold, that is, the result does not satisfy the MR, it means that the SUT has certain defects.

2.2. Metamorphic Relation Pattern

In the early MT research, researchers usually put forward specific MRs for the specific problems to be studied. In order to assist the system in using MRs, Zhou et al. [26] proposed the concept of general MR. The experimental results show that general MRs have strong fault detection capabilities. Zhou et al. [31] used it to detect failures in search engines, but there are certain limitations in their work. They only considered the SUT for a specific task, that is, the software for search operations, the outputs of which are lists of search results. Therefore, their patterns are not generic enough to cover other types of features, functions, and application domains.

Segura et al. [32] introduced the term Metamorphic Relation Output Pattern (MROP), which they defined as an abstract relationship between source output and follow-up output from which multiple specific MRs can be derived. Their work has opened up a new broad direction for the study of “general relation pattern” MT. Regarding how to improve the universality of MR, Zhou et al. [25] formally defined the metamorphic relation pattern (MRP). The general concept of MRP is an abstract concept that describes a set, and possibly an infinite number of, MRs. MRP simplifies the systematic identification of MRs and can be used as a starting point for the automatic inference of MRs that conform to patterns. Moreover, they proposed both a Metamorphic Relation Input Pattern (MRIP) through the concept of symmetry and a method to convert source inputs into follow-up inputs.

For the field of NLP, MT has been applied in various tasks, such as machine translation [5,6] and question answering systems [7]. In these applications, MRs are identified for specific tasks. Therefore, drawing on the concept of the MRIP, we propose two MRPs which can be widely used in the field of NLP.

3. Our Approach

This study evaluates NLP systems based on different features and user requirements. For NLP, we analyze the characteristics of the key parts involved in the text, which is crucial to the construction of MRs. We define the MR we need by analyzing the characteristics of the smallest unit words that constitute the meaning of the sentence.

Figure 3 outlines the method. Given a set of source inputs (i.e., sentence pairs, texts, etc.) and lists of MRs, a set of follow-up inputs is generated. These are the follow-up inputs of sentence similarity, text summarization, and text classification. By executing the NLP system using source inputs and follow-up inputs related to an MR, their source outputs and follow-up outputs can be obtained. Then, we check the relationship between source outputs and follow-up outputs.

3.1. MRPs for NLP

In order to use MT to evaluate the NLP system, we define a series of MRs. This paper proposes seven MRPs and seven MRs for the NLP systems. For the three different tasks of text similarity, text summarization, and text classification, we have defined some special MRs to help better show the performance of the system. We describe the details of these MRs in detail in the next two subsections and give some illustrative examples of MRs.

3.1.1. Property-Based Substitution MRP (MRP_pbs)

Property-based substitution MRP (MRP_pbs) is used to study the sensitivity of NLP systems to words of the same property. According to the MRs we defined, the NLP system should process words of the same property more similarly. MRP_pbs aims to replace the same property words in the source inputs to generate follow-up inputs.

We used HanLP [33] for part-of-speech (POS) tagging. It applies a pruning strategy on the basis of BERT, so that each task retains only attention heads that are absolutely necessary to obtain the best performance. BERT is essentially a stack of multi-head attention layers, and there is a broad consensus that different layers learn different knowledge. The linear layer is used as a POS decoder. It obtains the final embedding of each token from BERT and generates an output vector, where each dimension gives a score for a specific POS tag. For an input, the attention sub-layer extracts three vectors, namely query vector, key vector, and value vector, and packs them together into matrices Q, K, and V, respectively. Moreover, d_k is the dimension of query vector. Based on this, it conducts the self-attention calculation as below [34].

A t t e n t i o n^{(j)} (Q, K, V) = z_{j} \cdot s o f t m a x (\frac{Q K^{⊺}}{\sqrt{d_{k}}}) V

In order to achieve efficient continuous optimization, each z_j is then relaxed into a random variable independent of the continuous random distribution. Specifically, the relaxed z is re-parameterized to

G_{α} (u)

by the inverse of its cumulative density function (CDF).

Then, the Hard Concrete Distribution [35] is chosen for z, which gives the following form of

G_{α} (u)

that is differentiable, where (l, r) defines the interval that

g_{α} (u)

can be stretched into (l < 0, r > 1):

g_{α} (u) = s i g m o i d (l o g u - l o g (1 - u) + α)

G_{α} (u) = m i n (1, m a x (0, g_{α} (u) \times (r - l) + l))

Different from the traditional static pruning strategy, Hanlp [33] uses a dynamic pruning (DP) strategy. The encoder dynamically adapts to the encoder that is pruned during the training process, which can train more effectively, thereby improving the efficiency of word segmentation and labeling. The attended-value matrix H ∈

ℝ^{n \times d}

is created by multiplying the attention matrix A ∈

ℝ^{n \times n}

to the value matrix V ∈

ℝ^{n \times d}

such that H = AV (n: sentence length; d: embedding size). The centroid of each label is obtained through PseudoCluster [35] before being used to predict the labels of all spans.

Through HanLP, we can mark the POS of the text and find the replaced words. The detailed implementation process is described in Section 5.1.

Then, we define a function to represent the replacement relationship:

T_f = Replace_X1→X2(T_s)

T_s represents the given source text, X₁ is the replaced word, X₂ is the replaced word, and T_f is the follow-up text generated by replacing X₁ with X₂ in the text T_s.
Set the source input as I_s, the follow-up input as I_f. Different MRPs alter I_s in different ways to construct I_f and also encode the relationship that is expected to be satisfied by O_s and O_f. Based on the analysis of the system and the expectations of users, we have defined four MRP_pbss, which can be described in detail according to the specific tasks as follows:
MRP_pbs1:
Replace name or pronoun of I_s to generate a follow-up input I_n, that is, I_f = Replace_{name1→name2}(I_s). Then, O_s and O_f should be consistent.
MRP_pbs2:
Replace country of I_s to generate a follow-up input I_n, that is, I_f = Replace_{country1->country2}(I_s). Then, O_s and O_f should be consistent.
MRP_pbs3:
Replace occupation of I_s to generate a follow-up input I_n, that is, I_f = Replace_{occupation1->occupation2}(I_s). Then, O_s and O_f should be consistent.
MRP_pbs4:
Replace punctuation of I_s to generate a follow-up input I_n, that is, I_f = Replace_{punctuation1->punctuation2}(I_s). Then, O_s and O_f should be consistent.

3.1.2. Synonym-Based Substitution MRP (MRP_sbs)

Synonym-based substitution MRP (MRP_sbs) is used to study the sensitivity of the NLP system to the synonyms. According to the MRP_sbss we defined, the NLP system should process words with similar meanings similarly. MRP_sbs aims to replace some words in the source inputs to generate follow-up inputs and to find the defects of the system by directly comparing the results of source inputs and follow-up inputs. Similarly, we have defined two MRP_sbss as follows:

MRP_sbs5:: Replace nouns with synonyms of I_s to generate a follow-up input I_n, that is, I_f = Replace_noun1->noun2(I_s). Then, O_s and O_f should be consistent.
MRP_sbs6:: Replace verbs with synonyms of I_s to generate a follow-up input I_n, that is, I_f = Replace_verb1->verb2(I_s). Then, O_s and O_f should be consistent.

We utilized the Chinese Synonyms toolkit published by Wang et al. [36]. This tool is widely used in NLP tasks, such as similarity calculation, semantic shift, keyword extraction, etc. According to the input words, the similar words and corresponding scores can be returned. The scores are in the range of zero to one. The closer it is to 1, the more similar input words are. In order to ensure that the meanings of the words are similar, we replace the words with the highest score and no less than 0.8 [37] to ensure that the words are highly similar.

3.1.3. Reordering MRP (MRP_r)

We define the reordering MRP (MRP_r) based on the change of the text order. In the NLP system, from the user’s point of view, the order of inputs should not have a great impact on outputs. MRP_r aims to study the influence of the sequence of source inputs on the system and to find the defects of the system by directly comparing the results of the source inputs and follow-up inputs. We define the MRP_r as follows:

MRP_r7:: Change the order of sentences of I_s to generate the follow-up input I_f, then O_s and O_f should be consistent.

4. Metamorphic Relations for Specific NLP Tasks

4.1. Text Similarity

Text similarity is divided into word similarity, sentence similarity, text similarity, and so on. Measuring the similarity between words, sentences, paragraphs, and documents is an important part of various tasks such as information retrieval, automatic question and answer [38], word sense disambiguation, automatic paper-scoring, short answer scoring, and machine translation [39]. Text similarity refers to the similarity between two texts (sentences), which is widely used in search engines, paper identification, machine translation, automatic response, named entity recognition, spelling error correction, and other fields. A text similarity task means that the user inputs two pieces of text and the system returns a value as an output to indicate the degree of text similarity. Take sentence similarity as an example: As shown in Table 1, enter the following two sentences in the NLP system calculates the similarity of the two sentences through the system and gives the return value. The return value ranges from 0 to 1. A value of 0 means that the two sentences have no similarity, and 1 means that the two sentences are the same. The larger the input value, the higher the similarity of the sentence.

In the field of text similarity, taking sentence similarity as an example, we concretize the MRPs defined above and obtain a set of MRs for text similarity tasks. We set the source input (S₁, S₂) and the follow-up input generated by an MR (S_f1, S_f2). The source output and follow-up output are recorded as sim (S₁, S₂) and sim (S_f1, S_f2), which are expected to satisfy the relevant MR. Generally speaking, the output of the text similarity system is a decimal between 0 and 1, in accordance with five interval divisions [37] ([0.8, 1.0], [0.6, 0.8], [0.4, 0.6], [0.2, 0.4], [0,0.2]), which respectively represent “perfect correlation”, “strong correlation”, “regulated correlation”, “weak correlation”, and “zero correlation”. We concretize the seven above-mentioned MRPs, which are described as follows:

MR1.1–MR4.1:: Replace the words X₁ in (S₁, S₂) to X₂ with the same property to generate follow-up inputs (S_f1, S_f2), that is, (S_f1, S_f2) = Replace_X1->X2(S₁, S₂). Then, sim (S₁, S₂) and sim (S_f1, S_f2) should meet: sim (S₁, S₂) ∈ C and sim (S_f1, S_f2) ∈ C, C is an arbitrary interval.
MR5.1–MR6.1:: Replace the words X₁ in S (S₁, S₂) to synonyms X₂ to generate follow-up inputs (S_f1, S_f2), that is, (S_f1, S_f2) = Replace_X1->X2(S₁, S₂). Then, sim (S₁, S₂) and sim (S_f1, S_f2) should meet: sim (S₁, S₂) ∈ C and sim (S_f1, S_f2) ∈ C, C is an arbitrary interval.
MR7.1:: Sim (S₁, S₂) and sim (S₂, S₁) should satisfy: sim (S₁, S₂) ∈ C and sim (S₂, S₁) ∈ C, C is an arbitrary interval.

Table 2 further explains the above MR1.1–7.1:

In MR1.1, we can see that we replace “小明” (Xiaoming) in the source input with “小红” (Xiaohong) to generate follow-up input. Comparing the source outputs and the follow-up outputs, it can be seen that they all fall into the “strong correlation” interval (sim (S₁, S₂) = 0.723585, sim (S_n1, S_n2) = 0.776188), that is, sim (S₁, S₂) and sim (S_n1, S_n2) satisfy the MR.

4.2. Text Summarization

Text summarization refers to the refinement and summary of the content of massive data, with concise and intuitive summaries to summarize the main content that users are concerned about, so that users can quickly understand and browse the massive content. Text summaries can help people escape from massive information and improve the efficiency of work. The NLP system extracts the main content based on the text content and forms a short output. However, due to the lack of ideal reference abstracts for original documents or document collections, it is very difficult to evaluate the performance of text abstracts and there is a lack of ideal evaluation criteria [2].

Text summarization refers to a summary of a given single document or multiple documents. The abstract of the text should be kept as concise as possible while ensuring that it reflects the important content of the original document. High-quality text summaries can play an important role in multiple scenarios. For example, in the information retrieval process, text summaries can be used to replace the original documents in the index, which can effectively shorten the retrieval time. Text summarization can also reduce the redundant information in the retrieval results and improve the user experience. The text summarization task means that the system uses the text input given by the user, and the system returns a string as an output to represent the text summarization. Taking news short text summarization as an example, as shown in Table 3, inputting the following short news text into the NLP system, produces a short output summary text through the system.

In the field of text summarization, we concretize the MRPs defined above and obtain a set of MRs for the text summarization tasks. We set the source input as T_s and the follow-up input generated by the MR as T_f. The outputs are marked as sum (T_s) and sum (T_f), which are expected to satisfy the relevant MR.

We concretize the seven MRs in the above-mentioned MRP, which are described as follows:

MR1.2–4.2:: Replace X₁ in T_s to X₂ with the same property to generate a follow-up input T_f, that is, T_f = Replace_X1->X2(T_s). Then, sum (T_s) and sum (T_f) should satisfy: sum (T_s) = Replace_X1->X2(sum (T_f)).
MR5.2–6.2:: Replace X₁ in T_s to synonyms X₂ to generate a follow-up input T_f, that is, T_f = Replace_X1->X2(T_s). Then, sum (T_s) and sum (T_f) should satisfy: sum (T_s) = Replace_X1->X2(sum (T_f)).
MR7.2:: Change the order of sentences to generate follow-up input T_f, and then sum (T_s) and sum (T_f) should satisfy: sum (T_s) = sum (T_f).

Table 4 gives a further explanation, taking MR1.2 as an example:

In Table 4, input/output (ORI) means the source input/output, and input/output (FLU) means follow-up input/output. We change the order of the sentences of the source input to generate follow-up input. Comparing the source output sum (T_s) and the follow-up output sum (T_f), it can be seen that sum (T_s) and sum (T_f) satisfy sum (T_s) = sum (T_f), which conforms to the MR.

4.3. Text Classification

In the era of big data, text data on the Internet is increasing day by day. It is particularly important to use text classification technology to organize and manage massive data scientifically. Text is the most widely distributed information carrier with the largest amount of data. How to effectively organize and manage these data is a difficult problem that needs to be solved urgently. Text classification is a basic task in NLP tasks. Its purpose is to sort and classify text resources. It is also a key link to solve the problem of text information overload. It is often used in news recommendation, digital library, public opinion analysis [40], mail filtering [41], and other fields. It provides strong support for the query and retrieval of text resources.

Text classification refers to the classification of text into different categories by machines through certain rules, which are divided into multi-label text classification and single-label text classification. The task of text classification refers to the text input given by the user, and the system returns a character string as an output to indicate the classification of the text. Take news text classification as an example. The usual news text classifications include finance, lottery, real estate, stocks, home furnishing, education, technology, society, fashion, current affairs, sports, constellations, games, entertainment, and so on. As shown in Table 5, after inputting the following piece of news text into the NLP system, and the output label is given by the system, we can see that the text belongs to the sports classification.

Similar to text summaries, we set the source input as T_s and the follow-up input, generated by an MR, as T_f. We set the outputs as L_s and L_f, which are expected to satisfy the relevant MR. We concretize the seven above-mentioned MRP, which are described as follows.

MR1.3–4.3:: Replace X₁ in T_s to X₂ with the same property word to generate follow-up input T_f, that is, T_f = Replace_X1->X2(T_s). Then, L_s and L_f should satisfy: L_s = L_f.
MR5.3–6.3:: Replace X₁ in T_s to synonyms X₂ to generate follow-up input T_f, that is, T_f = Replace_X1->X2(T_s). Then, L_s and L_f should satisfy: L_s = L_f.
MR7.3:: Change the order of sentences in the text to generate follow-up input T_f. Then, L_s and L_f should satisfy: L_s = L_f.

Table 6 further explains the above MR, taking MR1.3 as an example:

In Table 6, we replace “练束梅” (Lian Shumei) in the source input with “白露” (Bai Lu) to generate follow-up input. Comparing the source output L_s and the follow-up output L_f, it can be seen that they belong to the same category, that is, L_s and L_f satisfy L_s = L_f, which satisfies the MR.

5. Experimental Setup

In order to verify the effectiveness of the MRPs proposed in this article, we try to answer the following three research questions (RQs).

RQ1:: Are the MRs defined in this study are effective for evaluating the target NLP systems?
RQ2:: Can the MRP and its MRs that we define effectively reflect the advantages and disadvantages of the SUTs?
RQ3:: What are the common problems of the SUTs under different MRP and their MRs?

The purpose of RQ1 is to test whether the MRs we defined can be applied to the three NLP tasks and to evaluate the quality of the system and give a quantitative evaluation result. The purpose of RQ2 is to check the rationality of the results. It is necessary to analyze the performance of different systems, reveal the advantages and disadvantages of the system, and facilitate users to choose according to their needs so as to better use the relevant system. RQ3 analyzes the results and reveals the common problems of the SUT.

In order to study the above problems, we conducted a series of experiments to evaluate three NLP systems with the seven MRPs we defined, involving three tasks: text similarity, text summarization, and text classification. This section introduces our experimental setup, including the implementation of the MRs, our subject NLP system, datasets used in the experiment, and the source input of the MRs.

5.1. Realization of MRs

We evaluate the NLP system by MT through the defined MR. Some specific MR implementation processes are shown in Figure 4:

We use HanLP [33] for POS tagging. The principle is to first perform rough word segmentation and construct a word graph to segment all possible word segmentation paths, then use a user-customized dictionary to intervene, and finally use the Viterbi algorithm to select the optimal path. Then, number recognition and entity recognition (person name recognition, place name recognition, organization name recognition), is used before Hidden Markov POS tagging.

For MRP_pbs, we first analyze the source input via HanLP POS tagging. Based on this, we extract words with three attributes: name, pronoun, and country from the source input, and construct respective word lists and obtain occupation word lists from BiasFinder [42]. Then, for a given source input, the replacement words are randomly selected from the corresponding word lists for constructing the follow-up input. To check the relationship among relevant source and follow-up outputs, different strategies are applied with respect to the target NLP systems. At first, when the MR is applied to text similarity systems, a group of source and follow-up outputs are checked according to the predefined value intervals to determine whether they belong to the same interval. For the text summarization system, the same replacement operation will be conducted on the source output to check whether the resulting text is the same as the text expressed by the relevant follow-up output. When applying this category of MRs to the text summarization systems, a group of source and follow-up outputs are checked directly to see whether they represent the same category.

For MRP_sbs, we do the same POS tagging operation. Then, through the MR defined by us, we use Synonym [36] to acquire the replaced words to obtain the follow-up input. The rest procedure is the same as MRP_pbs.

For MRP_r, we first use punctuation matching to identify the sentences in the source input and then use the sentences as the unit for out of order processing to generate the follow-up input. The rest procedure is the same as MRP_pbs.

5.2. Target Systems

According to the “China Artificial Intelligence Cloud Service Market Research Report (2020H1)” [43] released by IDC, the top three are Baidu Smart Cloud, Alibaba Cloud, and Tencent Cloud. Therefore, we chose these three mainstream AI systems as the target systems for the experiments.

BaiduNLP: (https://cloud.baidu.com/product/nlp_basic, accessed on 1 August 2021) Baidu NLP provides comprehensive and leading NLP basic module capabilities, covering the underlying capabilities of different granularities such as words, phrases, and sentences. It is applicable for a variety of technologies and business directions.

TencentNLP: (https://cloud.tencent.com/product/nlp, (accessed on 1 August 2021)) Tencent Cloud NLP deeply integrates the top NLP technology of Tencent, relying on the accumulation of the 100 billion Chinese corpus, and provides 16 intelligent text processing capabilities, including intelligent word segmentation, entity recognition, text error correction, emotion analysis, text classification, word vector, keyword extraction, automatic summary, intelligent chat, Encyclopedia knowledge, map query, etc.

AliNLP: (https://ai.aliyun.com/nlp, (accessed on 1 August 2021)) Ali NLP is a core tool for text analysis and mining for all kinds of enterprises and developers. It has been widely used. It is used in many businesses of customers in the e-commerce, cultural and entertainment, finance, logistics, and other industries. NLP can be used to build smart products such as content search, content recommendation, public opinion recognition and analysis, text structuring, and dialogue robots.

5.3. Datasets

We used the PWAS-X datasets [14] published by Google researchers to prepare the source inputs for text similarity. PWAS-X contains 23,659 human translation evaluation sentence pairs and 296,406 machine translation PAWS training sentence pairs. Among them, the Chinese datasets accounted for more than 50,000 pairs. For the text summarization, we used the LCSTS datasets [15], which is taken from the large-scale Chinese short text summary datasets of Sina Weibo, which contains 2 million real Chinese short text data and the summary data given by author. For the text classification problem, we used the THUCNews datasets [16], which is filtered and generated based on the historical data of Sina News, which contains 740,000 news documents. Based on the original Sina News classification system, we included 14 categories such as finance, technology, and education. We selected seven categories (sports, entertainment, education, fashion, games, society, and technology) for experimentation.

It is worth mentioning that MR may not be suitable for certain source inputs because of its preconditions. For example, MR3 is used to replace occupations, so it cannot be used for source inputs where the sentences do not contain occupations. Therefore, different MRs may have different numbers of source and follow-up inputs. A total of more than 220,000 groups of source and follow-up inputs were used to evaluate sentence similarity systems, more than 310,000 groups of source and follow-up inputs were used to evaluate text summarization systems, and more than 1.65 million groups of source and follow-up inputs were used to evaluate text classification systems.

6. Results and Analysis

This section presents the MT results of evaluating three NLP systems. Then, the capabilities of our NLP system are analyzed and discussed. In our experiments, we used the violation rate (VR) as the evaluation metric. The violation rate can be calculated as below

V R = \frac{V}{T_{c}}

where T_c refers to the total number of groups of source and follow-up inputs used for evaluating the NLP system, and V is the number of groups of source and follow-up inputs with which the system violates the relevant MR. For a given MR, VR refers to the extent to which the system violates this MR. Therefore, a VR value indicates the extent to which the system behaves inconsistently according to the MR. The range of VR values is from 0 to 1, and (1-VR) represents information similar to that expressed by the typical metric, namely, accuracy, used by existing studies. A higher VR value indicates that the NLP system performs poorly under this MR and does not meet the defined MR. It not only means that we can find software defects from the user’s point of view to better improve the system, but also allow users to enhance their understanding of the system to better use the system. On the contrary, a lower VR value indicates that the NLP system conforms to the relevant MR to a higher degree, which indicates that it can better meet the needs of users. A VR value of 0 means that no violation has been shown in our experiments, which indicates that the system is likely to adhere to the MR. For example, in terms of text similarity, the VR value of MR7.1 is almost 0%, and the system is not disturbed in this respect.

6.1. RQ1: Effectiveness of MT and MRs

In order to answer RQ1, we report the VR values of individual NLP systems with respect to each MR.

6.1.1. Text Similarity

Table 7 shows the performance of the three mainstream NLP systems in terms of text similarity under seven MRs. It can be seen that, in MR1, when replacing names and pronouns, all three systems perform well. TencentNLP has the highest VR at 8.29%, and BaiduNLP has a much lower VR than others, only 0.15%, which shows a strong stability. In MR2.1 and MR3.1, we can see that the VR is lower than 1.71%, and the substitution disturbance of region and occupation has little interference with sentence similarity, which reflects the high robustness of the three NLP systems in this respect. It is worth mentioning that AliNLP and BaiduNLP have zero VR in MR3. For the replacement of punctuation (MR4), the BaiduNLP VR is better than the other two systems, only 0.54%; TencentNLP suffers a relatively large interference in this regard. From the results of MR7.1, the three systems are not affected by the sequence of input sentences. In MR6.1 and MR7.1, we can see that synonyms that replace nouns and verbs have the least disturbance to BaiduNLP.

6.1.2. Text Summarization

Table 8 shows the performance of the three mainstream NLP systems in the text summarization under seven MRs. The percentage in Table 8 represents the VR of each system under the seven MRs we defined. It can be seen that, in MR1.2, when replacing names and pronouns, AliNLP has the lowest VR of 8.89%, while BaiduNLP and TencentNLP have been affected to varying degrees, with VRs reaching 11.75% and 38.29%, respectively. Under the MR between the three systems, in MR2.2, the VRs of the three systems are not much different. In MR3.2 and MR4.2, we can see that the VR of TencentNLP is lower than that of the other two systems. The substitution disturbance of occupation and punctuation has less interference with sentence similarity, which reflects the three NLP systems in this respect. In MR5.2 and MR6.2, we can see that the synonyms that replace nouns and verbs do not disturb the three systems much. In MR7.2, we replaced the order of sentences in the text. From the perspective of the NLP system, the semantics has changed a lot, which leads to big differences in the outputs. The VRs of the three systems are all greater than 60%. From the user’s point of view, the text content of the disordered sequence is not much different, and the summarization results should be similar. The above data shows that, from the user level, the NLP system has a good performance in processing text summarization tasks in some aspects, but also has certain defects in other aspects.

6.1.3. Text Classification

We used the THUCNews datasets generated by filtering Sina News historical data as the source input to perform the text classification task. It includes news in fourteen categories. After comparing the text classification scope and classification categories of AliNLP, BaiduNLP, and TencentNLP, in order to prevent the unfairness of data selection, after comparing the business scope of the three NLP systems under text classification, we selected seven categories for experimentation. Respectively, the seven categories are sports, entertainment, education, fashion, games, society, and technology. About 350,000 source data generated about 1.65 million follow-up inputs, evaluated the performance of the three systems under the text classification task from various aspects, and also showed the performance of the NLP system under the seven MRs from the user’s point of view.

Table 9 shows the results of AliNLP’s text classification. A total of more than 165,000 follow-up inputs were generated, including data on seven topics and seven MRs. In MR1.3, the source input and follow-up inputs under the social topic classification have the highest rate of result violations, reaching 9.26%, indicating that the classification of social news is more susceptible to interference by names and pronouns. As for MR2.3, the two themes of game and technology have been greatly disturbed by the state on AliNLP, with VRs exceeding 10%. The replacement of occupation (MR3.3) and the change of punctuation (MR4.3) have little effect on AliNLP interference, which shows that AliNLP is robust in this regard. For MR5.3 and MR6.3, the VR of games and technology is more disturbed than other themes. For MR7.3 technology classification, the interference is greater under the task of text classification.

Table 10 shows the results of BaiduNLP’s text classification. A total of more than 134,000 follow-up inputs were generated. In MR1.3, the source input and follow-up input under the category of entertainment and fashion topics have the highest VRs, reaching 10.56% and 10.36%, indicating that the classification of entertainment and fashion news is more susceptible to interference by names and pronouns. As for MR2.3 and MR3.3, the replacement of the country under the science and technology theme has a greater disturbance to BaiduNLP, which is much higher than other themes. The substitution of punctuation (MR4.3) and the change of sentence order (MR7.3) interfere with BaiduNLP under the theme of games and technology. In addition to the science and technology theme, BaiduNLP has shown good robustness in MR5.3 and MR6.3.

Table 11 shows the text classification results of TencentNLP. A total of more than 174,000 follow-up inputs were generated. The VR of TencentNLP under the theme of fashion and society under MR is relatively high, reaching 12.73% and 14.50%, which shows that the system is more sensitive to names under these two themes, and that changes in names will cause the system to change the output of news categories. It can be seen from MR2.3 and MR3.3 that the country and occupation have little effect on the classification results of the system. The perturbation of punctuation marks (MR4.3) has almost no impact on the classification task of TencentNLP. We can see that the VR is 0 in some cases, which shows that the system has a strong anti-interference ability in this respect. In MR5.3 and MR6.3, the system under the three themes of game, society, and technology has been disturbed to varying degrees, each slightly higher than other themes. For MR7.3, the VR of the two aspects of society and games is relatively large, and the VR under the social theme is as high as 51.16%, which shows that the sentence order affects the system’s judgment on social news.

6.1.4. Conclusions

In response to RQ1, based on the analysis of the VR values of all recognized MRs in the fields of text similarity, text summarization, and text classification by the three NLP systems shown, the NLP system violates some MRs to varying degrees and displays a VR value ranging from 0% to 77.84%. For example, consider the VR value of AliNLP at MR1.1 (2.36%). The VR value indicates that, in this case, the result of violating this MR accounted for 2.36%. Most of the VR values in the experiment are low, reflecting that the three NLP systems perform well under MRPs and their MRs, but there are still some shortcomings. These results further show that the MRP and its MRs we defined can be applied in the field of NLP and can reflect the capabilities of the quality assurance system from different aspects.

6.2. RQ2: Advantages and Disadvantages of the SUTs

In response to RQ2, based on the MT results, we conducted an in-depth analysis to determine whether the MRP and its MRs defined by us can effectively reflect the characteristics of the SUT and are reasonable and convincing. We show the performance of the three NLP systems under each topic of text classification and the comparison under the three tasks of text similarity, text summarization, and text classification. The abscissa in Figure 5, Figure 6 and Figure 7 represents the MRs we defined, and the ordinate represents the VR value, that is, the proportion of the number of violations of the system under each MR. Each MR demonstrates the ability of NLP to process information in different aspects.

6.2.1. Text Similarity

Figure 5 shows the comparison of AliNLP, TencentNLP, and BaiduNLP in the field of text similarity. It can be seen that BaiduNLP’s violations under each MR are much lower than AliNLP and TencentNLP, which are both lower than 2%. In this field, TencentNLP’s performance is also relatively good, as the VRs are lower than AliNLP.

6.2.2. Text Summarization

In the field of text summarization, we can see that AliNLP and BaiduNLP perform very well, with the exception of MR7, the VR of which is below 17%. TencentNLP performed well in MR4, with a VR of only 0.45%. In other words, TencentNLP can avoid the interference of punctuation marks to a great extent and will not affect the output result. Regarding the interference caused by the order of sentences in the text to the abstract, it can be seen that the system has certain shortcomings and that the abstract cannot be extracted from a good semantic perspective.

6.2.3. Text Classification

In the field of text classification, we can see that the VR of the three systems is less than 30% under the seven MRs we defined. Overall, BaiduNLP maintains strong stability from the user’s perspective. The three systems performed well in MRP_pbs. In MRP_sbs and MRP_r, these systems have higher VRs, which means they are more sensitive to these MRPs.

Figure 8a–c intuitively shows the text classification of AliNLP, TencentNLP, and BaiduNLP under different news topics. The results show that VR is greatly affected by the topic of the news. Among them, social news and scientific and technological news are more susceptible to interference from various factors, thus affecting the output results of the system. We analyze and summarize the following factors from the user’s perspective to interfere with the output of the system. First of all, in social news, the substitution of names and occupations, such as replacing some names with entertainment stars or athletes, caused the system to misjudge the classification results as entertainment or sports. A similar situation is also reflected in the fields of games and technology. These interferences caused a small part of the test cases to fail. This shows that the MRP and its MRs defined by us can effectively reflect the characteristics of the SUT and reveal that the SUT is susceptible to interference by individual words to produce different judgments.

6.2.4. Conclusions

To sum up, in response to RQ2, we have defined some characteristics of the NLP systems under the different MRPs and their MRs. Through our experimental analysis and comparison, it can be seen that AliNLP, TencentNLP, and BaiduNLP have good task processing capabilities in the field of text similarity, text summarization, and text classification; however, they still have certain defects. On this basis, the MT results reveal the capabilities of our tested NLP system from different perspectives. From the user’s point of view, the MT results report the VR value of each SUT relative to a single MR, which can help users better understand the capabilities and limitations of related systems. For example, the text summarization task of TencentNLP is more sensitive to names (38.29%), and the difference of the name of the same text leads to a large change in the abstract content, but it is almost free from the interference of punctuation (0.45%), which means that the tone of the text and other factors have little effect on the abstract content. On the other hand, the MT results support the comparison of different NLP systems by considering different aspects, thereby providing guidance for users to choose a suitable NLP system according to their specific needs. For example, the user wants to select a text classification algorithm that is not sensitive to the order of sentences. According to the analysis results, the VR value of MR7.3 under BaiduNLP re-text classification is 8.09%, and the VR value of MR7.3 under AliNLP text classification is 11.64. For TencentNLP, the VR value of MR7.3 under text classification is 27.36%. Therefore, according to our experimental results, this type of user should choose BaiduNLP.

6.3. RQ3: Common Problems of the SUTs

In response to RQ3, our analysis of the results reveals common problems of the SUT. We analyzed the specific performance of the AliNLP, BaiduNLP, and TencentNLP systems under each task.

Figure 9 reports the VR values of AliNLP, BaiduNLP, and TencentNLP under the three major tasks, where sim means text similarity, sum means text summarization, and cla means test classification. It can be seen that the three mainstream NLP systems perform well in MRP_pbs and MRP_sbs. However, under the MRP_r, they have been challenged to different degrees. On the whole, BaiduNLP outperforms the other two NLP systems in most tasks and MRs. In the text summarization task, the MRP_r has a higher VR than other fields. AliNLP’s text similarity task VR is generally low, which shows that the system has strong stability under this classification task. For AliNLP’s text summarization task, MR4.2 has the highest VR in the MRP_sbs, indicating that punctuation has the greatest interference in the text summarization task, while the VR of the text summarization in the MRP_r is higher than 60%. In TencentNLP’s MRP_pbs, among the MRs in all tasks, MR1.2 has the highest VR, and the lowest VRs are MR4.2 and MR4.3, indicating that the punctuation signifies the effect of text summarization and text classification.

From the seven MRs that we defined and the seven graphs in Figure 10, we can see that the results of each MR under different systems and different tasks; the ordinate represents the VR. In the figure, we use A for Ali, B for Baidu, and T for Tencent. MR1.x has a small VR on text similarity and text classification tasks, but has a greater impact on text summaries, especially the text summaries of TencentNLP, which are more affected by names and pronouns. For MR2.x, under the task of text summarization, the three systems are all affected to varying degrees, and the VR is much higher than that of the text similarity and text classification tasks. Each system generally shows strong stability in MR3.x (occupation) and MR4.x (punctuation), indicating that these two MRs have little interference with the NLP system. Noun substitution (MR5.x) and verb substitution (MR6.x) have a greater impact on text classification but have less interference on text summarization. However, it can be seen that BaiduNLP has a particularly outstanding anti-interference ability under the task of text similarity. For MR7.x, the text similarity task is almost unaffected under the three systems, while the three systems under the text summarization task reflect high sensitivity to the order of sentences, and the VR is higher than 60%. Under the task of text classification, BaiduNLP is more stable than the other two systems.

Therefore, for RQ3, each system has a strong anti-interference ability under the task of text similarity, and to a large extent gives correct results for reasonable replacement. In the field of text summarization, each system shows that the performance under MRP_pbs and MRP_sbs is better than the performance under MRP_r. However, the replacement of words by each system is more disturbing to the text classification task, which shows that, in this respect, the NLP system has the characteristic of weak anti-interference. In addition, we can see that the system may perform well in some aspects; however, in other respects, it may be unsatisfactory. According to different needs and application scenarios, users need to refer to the performance of the system in different aspects to select the required system. Because of this, we visually show the system performance by checking the VR of the MRs in specific tasks, so that users can choose the appropriate system to meet their needs.

6.4. Comparison Results

We compared our approach with the dataset-based evaluation approach. The dataset-based evaluation approach relies on the ground truth result of a dataset to confirm the correctness of the output of an NLP task. In our experiments, we directly used the inputs and the corresponding ground truth results provided by the datasets introduced in Section 5.3.

Table 12 reports the accuracies of the target NLP systems reported by both approaches. As a reminder, the accuracy values reported by MT is obtained via 1-VR. It can be found from Table 12 that, for every one of the NLP tasks under evaluation, MT reports a lower accuracy value as compared with the dataset-based evaluation method. These results indicate that, compared with the evaluation method, our approach can reveal failures of the systems.

7. Related Work

A lot of research has focused on evaluating the robustness of NLP. In order to construct input data, researchers have proposed various strategies, such as automatically generating training data [44], reducing the original feature set [45], and adversarial sample generation [46]. Shehu et al. [47] proposed three data augmentation techniques to increase the diversity of the training data and used three DL algorithms for sentiment analysis in Turkish textual data obtained from Twitter. To overcome the problem of unbalanced data, reference [48] proposed a fine-grained NER model by using the contextual information of coarse-grained NEs.

Some research studies have focused on improving or explaining the robustness of the quality assurance system, such as iterative input [49] to improve the quality of the text summarization system, and the combination of decision trees and enhancement techniques [50] to improve the classification accuracy. Qin et al. [51] proposed a Machine Natural Language Parser (MParser) to address the semantic interoperability problem between users and computers. To evaluate the annotator agreement of MParser outputs, 154 non-expert participants manually evaluated the sentences’ semantic expressiveness. Zhou et al. [52] selected five state-of-the-art models as baselines to evaluate the contextual ensemble network (CENet). Hao et al. [53] reviewed the deep learning-based semantic segmentation methods and highlighted three challenges: balance between accuracy and efficiency, dependency on high-quality training data, and domain gap across different datasets. Lateef et al. [54] provided a comprehensive survey of deep learning techniques used for semantic segmentation. They evaluated ten classes of methods and thirty five datasets in accuracy, speed, and complexity. They also presented a new feature selection scheme for term selection in the field of emotion recognition from text [55]. Wang et al. [56] proposed a novel fine-grained multimodal fusion network (FMFN) to fully fuse textual features and visual features for fake news detection. Jiang et al. [57] used quantitative and qualitative research methods to verify that the hero’s personal and social identity is affected in both target language versions. Moreover, they put forward four challenges facing the study of natural language processing: widespread uncertainty, unpredictability, data insufficiency, and the complexity of language generation.

Researchers use mixed similarity measures [4,5] to improve the quality of text similarity detection, and often use Pearson correlation coefficient [4] as an evaluation indicator.

There are two main types of evaluation methods for text summaries, namely automatic evaluation methods and manual evaluation methods. The commonly used indicators in automatic evaluation methods mainly include ROUGE [6] and METEOR [7]. ROUGE is widely used to evaluate the performance of automatic summarization models. The basic idea is to compare the system abstract generated by the model with the reference abstract and evaluate the quality of the system abstract by calculating the number of basic units overlapping between them. The METEOR method is an improvement of BLEU [8], and it also considers the accuracy and recall rate of the entire corpus, therefore the reliability is higher. Because the current automatic evaluation method can only describe the superficial relationship between sentences and cannot distinguish the quality of the abstract through semantics, the emergence of manual evaluation makes up for the shortcomings of the automatic evaluation method to some extent. However, manual evaluation is greatly affected by subjective factors, and the cost is high, and the efficiency is low.

When evaluating text classification models, accuracy and F1 score are the most commonly used methods for evaluating text classification methods. Later, with the increase in the difficulty of classification tasks or the existence of certain specific tasks, the evaluation indicators have been improved. For example, Hamming-loss [58] for evaluating misclassification and NDCG [59] for a big data classification index. Although some new text classification models perform well under these indicators, it cannot indicate whether the model understands the text from the semantic level similar to humans. In addition, with the emergence of noise samples, small sample noise may lead to a large change in decision confidence, and even lead to decision reversal.

For text tasks, there are many research studies using adversarial text generation [60] to test NLP systems, focusing on the robustness analysis of the system. For example, text adversarial attacks [61] (semantically equivalent adversaries, SEAs) generate adversarial examples that can keep the semantics unchanged while changing the model’s prediction results, and propose semantically equivalent adversarial rules (SEARs) to find and replace according to the rules. The rules are generalized from existing adversarial samples, and the generated adversarial samples can help people find NLP model bugs and can repair model vulnerabilities through adversarial training.

For the applications of existing MTs in the field of NLP, most of them are machine translation [62,63] and question answering systems [64]. Zhou et al. [63] proposed an automated neural machine translation test method to test without the intervention of human evaluation. Zhong et al. [65] proposed a multi-granularity testing framework based on metamorphic testing to evaluate translation quality and translation robustness. Yuan et al. [66] successfully detected more than 4.9 million perceptual failures caused by popular image question answering (VQA) models through the metamorphic testing. Regarding the application of the research of MRPs in the field of NLP, Segura et al. [67] proposed a catalog of MRPs to help test search engine systems and used formal relational algebra to describe query language semantics; Wu et al. [67,68] proposed “noise” MRPs, which achieved good results in the fields of obstacle perception, machine translation, and named entity recognition. At present, there is no convincing MT method for text similarity, text summarization, and text classification. Due to the complexity and diversity of Chinese, systematic quality assessment is more difficult. In our MRPs, we consider the substitution of names, countries, occupations, punctuation, nouns, and verbs. Due to the particularity of Chinese, the diversity of expressions of words such as time and place and the uncertainty of meaning lead to some unreasonable results. We will study this further in the follow-up work. Our research puts forward MT as a comprehensive user-oriented evaluation method. We have identified a large number of MRs to meet the various requirements of users for the system and have achieved good results.

8. Conclusions

In recent years, NLP systems have become a commonly used tool because they are applicable in various fields. Unfortunately, recent studies have used various techniques to evaluate NLP systems, revealing a series of issues related to different aspects. In this article, we focus on evaluating the three areas of the NLP system, namely text similarity, text summarization, and text classification. We applied the technique of metamorphic testing (MT) to the NLP system; by considering the different expectations of users for the NLP system, we determined seven MRPs and 21 MRs. In the experiment, the MT method we proposed shows the performance of the NLP system from many aspects and proves the feasibility and effectiveness of the MT in evaluating the NLP system. At the same time, from the user’s point of view, it reveals the NLP system’s ability to understand language. Under the MRs we defined, different NLP systems also show different processing capabilities. Through the analysis and comparison of a large amount of data, these results further show that the proposed MRs can demonstrate the performance of various NLP systems and are an effective evaluation method.

Future work can be carried out from the following aspects. First, researchers can continue to improve and optimize each process in the MT to achieve a more reasonable quality assessment effect, such as finding better substitution methods and MRs to detect defects in different aspects of the system. Second, researchers can modify the test method to make it more universal.

Author Contributions

Conceptualization, L.J.; methodology, L.J. and H.Z.; software, L.J.; validation, L.J., Z.D. and H.Z.; investigation, L.J., Z.D. and H.Z.; data processing, L.J.; writing—original draft preparation, L.J.; writing—review and editing, L.J.; visualization, L.J.; supervision, Z.D.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

No applicable.

Informed Consent Statement

No applicable.

Data Availability Statement

The data presented in this study are avaliable on request from the corresponding author. The data are not publicly avaliable due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gomaa, W.H.; Fahmy, A.A. A Survey of Text Similarity Approaches. Int. J. Comput. Appl. 2013, 68, 13–18. [Google Scholar]
Gambhir, M.; Gupta, V. Recent automatic text summarization techniques: A survey. Artif. Intell. Rev. 2016, 47, 1–66. [Google Scholar] [CrossRef]
Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information 2019, 10, 150. [Google Scholar] [CrossRef] [Green Version]
Islam, A.; Inkpen, D. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2008, 2, 1–25. [Google Scholar] [CrossRef]
Nitish, A.; Kartik, A.; Paul, B. DERI&UPM: Pushing Corpus Based Relatedness to Similarity: Shared Task System Description. In Proceedings of the First Joint Conference on Lexical and Computational Semantics (*SEM), Montreal, QC, Canada, 7–8 June 2012; pp. 643–647. [Google Scholar]
Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL, Barcelona, Spain, 25 July 2004. [Google Scholar]
Denkowski, M.; Lavie, A. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, MD, USA, 26–27 June 2014; pp. 376–380. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Han, M.; Zhang, X.; Yuan, X.; Jiang, J.; Yun, W.; Gao, C. A survey on the techniques, applications, and performance of short text semantic similarity. Concurr. Comput. Pract. Exp. 2020, 33, e5971. [Google Scholar] [CrossRef]
Ruan, H.; Li, Y.; Wang, Q.; Liu, Y. A research on sentence similarity for question answering system based on multi-feature fusion. In Proceedings of the 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), Omaha, NE, USA, 13–16 October 2016; pp. 507–510. [Google Scholar]
Fan, A.; Grangier, D.; Auli, M. Controllable abstractive summarization. arXiv 2017, arXiv:1711.05217. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 2019, 32, 5754–5764. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Yang, Y.; Zhang, Y.; Tar, C.; Baldridge, J. PAWS-x: A cross-lingual adversarial dataset for paraphrase identification. arXiv 2019, arXiv:1908.11828. [Google Scholar]
Hu, B.; Chen, Q.; Zhu, F. LCSTS: A large scale chinese short text summarization dataset. arXiv 2015, arXiv:1506.05865. [Google Scholar]
Sun, M.; Li, J.; Guo, Z.; Yu, Z.; Zheng, Y.; Si, X.; Liu, Z. Thuctc: An Efficient Chinese Text Classifier. GitHub Repos. 2016. Available online: https://github.com/diuzi/THUCTC (accessed on 1 August 2021).
Li, J.; Du, T.; Ji, S.; Zhang, R.; Lu, Q.; Yang, M.; Wang, T. TextShield: Robust Text Classification Based on Multimodal Embedding and Neural Machine Translation. In Proceedings of the 29th USENIX Security Symposium, San Diego, CA, USA, 12–14 August 2020; pp. 1381–1398. [Google Scholar]
Segura, S.; Fraser, G.; Sanchez, A.B.; Ruiz-Cortés, A. A survey on metamorphic testing. IEEE Trans. Softw. Eng. 2016, 42, 805–824. [Google Scholar] [CrossRef] [Green Version]
Deng, Y.; Zheng, X.; Zhang, T.; Lou, G.; Liu, H.; Kim, M. RMT: Rule-based metamorphic testing for autonomous driving models. arXiv 2012, arXiv:2012.10672. [Google Scholar]
Cao, Y.; Zhou, Z.Q.; Chen, T.Y. On the correlation between the effectiveness of metamorphic relations and dissimilarities of test case executions. In Proceedings of the 2013 13th International Conference on Quality Software, Najing, China, 29–30 July 2013; pp. 153–162. [Google Scholar]
Zhou, Z.Q. Using coverage information to guide test case selection in adaptive random testing. In Proceedings of the 2010 IEEE 34th Annual Computer Software and Applications Conference Workshops, Seoul, Korea, 19–23 July 2010; pp. 208–213. [Google Scholar]
Barus, A.C.; Chen, T.Y.; Grant, D.; Kuo, F.C.; Lau, M.F. Testing of heuristic methods: A case study of greedy algorithm. In Software Engineering Techniques; Huzar, Z., Koci, R., Meyer, B., Walter, B., Zendulka, J., Eds.; Springer: Berlin, Germany, 2011; Volume 4980, pp. 246–260. [Google Scholar]
Chen, T.Y.; Kuo, F.C.; Liu, H.; Wang, S. Conformance testing of network simulators based on metamorphic testing technique. In Formal Techniques for Distributed Systems; Lee, D., Lopes, A., Poetzsch-Heffter, A., Eds.; Springer: Berlin, Germany, 2009; Volume 5522, pp. 243–248. [Google Scholar]
Zhou, Z.Q.; Xiang, S.; Chen, T.Y. Metamorphic testing for software quality assessment: A study of search engines. IEEE Trans. Softw. Eng. 2016, 42, 264–284. [Google Scholar] [CrossRef]
Zhou, Z.Q.; Sun, L.; Chen, T.Y.; Towey, D. Metamorphic Relations for Enhancing System Understanding and Use. IEEE Trans. Softw. Eng. 2020, 46, 1120–1154. [Google Scholar] [CrossRef] [Green Version]
Available online: https://cloud.baidu.com/product/nlp_basic (accessed on 1 August 2021).
Available online: https://cloud.tencent.com/product/nlp (accessed on 1 August 2021).
Available online: https://ai.aliyun.com/nlp (accessed on 1 August 2021).
Barr, E.T.; Harman, M.; McMinn, P.; Shahbaz, M.; Yoo, S. The Oracle Problem in Software Testing: A Survey. IEEE Trans. Softw. Eng. 2015, 41, 507–525. [Google Scholar] [CrossRef]
Chen, T.Y.; Kuo, F.C.; Liu, H.; Poon, P.L.; Towey, D.; Tse, T.H.; Zhou, Z.Q. Metamorphic Testing: A Review of Challenges and Opportunities. ACM Comput. Surv. 2018, 51, 1–27. [Google Scholar] [CrossRef] [Green Version]
Zhou, Z.Q.; Tse, T.H.; Kuo, F.C.; Chen, T.Y. Automated Functional Testing of Web Search Engines in the Absence of an Oracle; Technical Report TR-2007–06; Department of Computer Science, The University of Hong Kong: Hong Kong, China, 2007. [Google Scholar]
Segura, S.; Parejo, J.A.; Troya, J.; Ruiz-Cortés, A. Metamorphic testing of RESTful web APIs. IEEE Trans. Softw. Eng. 2018, 44, 1083–1099. [Google Scholar] [CrossRef]
Mihalcea, R.; Tarau, P. TextRank: Bringing Order into Texts. In Proceedings of the EMNLP, Barcelona, Spain, 1 January 2004; p. 85. [Google Scholar]
He, H.; Choi, J.D. The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders. 2021. Available online: https://arxiv.org/abs/2109.06939 (accessed on 1 August 2021).
Louizos, C.; Welling, M.; Kingma, D.P. Learning Sparse Neural Networks through L0 Regularization. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April 30–3 May 2018. [Google Scholar]
Wang, H.L.; Xi, H.Y. Synonyms:Chinese Synonyms for Natural Language Processing and Understanding. Available online: https://github.com/chatopera/Synonyms (accessed on 1 August 2021).
Bao, W.; Bao, W.; Du, J.; Yang, Y.; Zhao, X. Attentive siamese lstm network for semantic textual similarity measure. In Proceedings of the 2018 International Conference on Asian Language Processing (IALP), Bandung, Indonesia, 15–17 November 2018; pp. 312–317. [Google Scholar]
Bouziane, A.; Bouchiha, D.; Doumi, N.; Malki, M. Question Answering Systems: Survey and Trends. Procedia Comput. Sci. 2015, 73, 366–375. [Google Scholar] [CrossRef] [Green Version]
Li, Y.C.; Xiong, D.Y.; Zhang, M. A survey of neural machine translation. Chin. J. Comput. 2018, 41, 100–121. [Google Scholar]
Zhan, G.; Wang, M.; Zhan, M. Public opinion detection in an online lending forum: Sentiment analysis and data visualization. In Proceedings of the 2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), Chengdu, China, 10–13 April 2020; pp. 211–213. [Google Scholar]
Bagui, S.; Nandi, D.; Bagui, S.; White, R.J. Classifying phishing email using machine learning and deep learning. In Proceedings of the 2019 International Conference on Cyber Security and Protection of Digital Services (Cyber Security), Oxford, UK, 3–4 June 2019; pp. 1–2. [Google Scholar]
Asyrofi, M.H.; Yang, Z.; Yusuf, I.N.; Kang, H.J.; Thung, F.; Lo, D. BiasFinder: Metamorphic Test Generation to Uncover Bias for Sentiment Analysis Systems. IEEE Trans. Softw. Eng. 2021. [Google Scholar] [CrossRef]
Available online: https://www.idc.com/getdoc.jsp?containerId=prCHC47212020 (accessed on 1 August 2021).
Peyrard, M.; Eckle-Kohler, J. Supervised learning of automatic pyramid for optimization-based multi-document summarization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1084–1094. [Google Scholar]
Abdi, A.; Shamsuddin, S.M.; Hasan, S.; Piran, J. Machine learning-based multi-documents sentiment-oriented summarization using linguistic treatment. Expert Syst. Appl. 2018, 109, 66–85. [Google Scholar] [CrossRef]
Wallace, E.; Feng, S.; Kandpal, N.; Gardner, M.; Singh, S. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2153–2162. [Google Scholar]
Shehu, H.A.; Sharif, M.H.; Sharif, M.H.; Datta, R.; Tokat, S.; Uyaver, S.; Kusetogullari, H.; Ramadan, R.A. Deep Sentiment Analysis: A Case Study on Stemmed Turkish Twitter Data. IEEE Access 2021, 9, 56836–56854. [Google Scholar] [CrossRef]
Kim, H. Fine-Grained Named Entity Recognition Using a Multi-Stacked Feature Fusion and Dual-Stacked Output in Korean. Appl. Sci. 2021, 11, 10795. [Google Scholar] [CrossRef]
Chen, X.; Gao, S.; Tao, C.; Song, Y.; Zhao, D.; Yan, R. Iterative document representation learning towards summarization with polishing. arXiv 2019, arXiv:1809.10324. [Google Scholar]
Schapire, R.E.; Singer, Y. BoosTexter: A boosting-based system for text categorization. Mach. Learn. 2000, 39, 135–168. [Google Scholar] [CrossRef] [Green Version]
Qin, P.; Tan, W.; Guo, J.; Shen, B.; Tang, Q. Achieving Semantic Consistency for Multilingual Sentence Representation Using an Explainable Machine Natural Language Parser (MParser). Appl. Sci. 2021, 11, 11699. [Google Scholar] [CrossRef]
Zhou, Q.; Wu, X.; Zhang, S.; Kang, B.; Ge, Z.; Latecki, L.J. Contextual ensemble network for semantic segmentation. Pattern Recognit. 2022, 122, 108290. [Google Scholar] [CrossRef]
Hao, S.; Zhou, Y.; Guo, Y. A Brief Survey on Semantic Segmentation with Deep Learning. Neurocomputing 2020, 406, 302–321. [Google Scholar] [CrossRef]
Lateef, F.; Ruichek, Y. Survey on Semantic Segmentation using Deep Learning Techniques. Neurocomputing 2019, 338, 321–348. [Google Scholar] [CrossRef]
Erenel, Z.; Adegboye, O.R.; Kusetogullari, H. A New Feature Selection Scheme for Emotion Recognition from Text. Appl. Sci. 2020, 10, 5351. [Google Scholar] [CrossRef]
Wang, J.; Mao, H.; Li, H. FMFN: Fine-Grained Multimodal Fusion Networks for Fake News Detection. Appl. Sci. 2022, 12, 1093. [Google Scholar] [CrossRef]
Jiang, K.; Lu, X. Natural Language Processing and Its Applications in Machine Translation: A Diachronic Review. In Proceedings of the 2020 IEEE 3rd International Conference of Safe Production and Informatization (IICSPI), Chongqing, China, 28–30 November 2020; pp. 210–214. [Google Scholar] [CrossRef]
Schapire, R.E.; Singer, Y. Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 1999, 37, 297–336. [Google Scholar] [CrossRef] [Green Version]
Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
Zhang, W.E.; Sheng, Q.Z.; Alhazmi, A.; Li, C. Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey. ACM Trans. Intell. Syst. Technol. 2019, 11, 1–41. [Google Scholar] [CrossRef] [Green Version]
Ribeiro, M.T.; Singh, S.; Guestrin, C. Semantically Equivalent Adversarial Rules for Debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Vol 1), Melbourne, Australia, 15–20 July 2018. [Google Scholar]
Pesu, D.; Zhou, Z.Q.; Zhen, J.F. Dave Towey: A Monte Carlo method for metamorphic testing of machine translation services. In Proceedings of the 2018 IEEE/ACM 3rd International Workshop on Metamorphic Testing (MET), Gothenburg, Sweden, 27 May–3 June 2018; pp. 38–45. [Google Scholar]
Zhou, Z.Q.; Sun, L.Q. Metamorphic testing for machine translations: MT4MT. In Proceedings of the 2018 25th Australasian Software Engineering Conference (ASWEC), Adelaide, SA, Australia, 26–30 November 2018; pp. 96–100. [Google Scholar]
Tu, K.; Jiang, M.; Ding, Z. A metamorphic testing approach for assessing question answering systems. Mathematics 2021, 9, 726. [Google Scholar] [CrossRef]
Zhong, W.K.; Ge, J.D.; Chen, X.; Li, C.Y.; Tang, Z.; Luo, B. Multi-Granularity Metamorphic Testing for Neural Machine Translation System. Ruan Jian Xue Bao/J. Softw. 2021, 32, 1051–1066. Available online: http://www.jos.org.cn/1000-9825/6221.htm (accessed on 1 August 2021). (In Chinese).
Yuan, Y.; Wang, S.; Jiang, M. Tsong Yueh Chen: Perception Matters: Detecting Perception Failures of VQA Models Using Metamorphic Testing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16908–16917. [Google Scholar]
Segura, S.; Durán, A.; Troya, J.; Ruiz-Cortés, A. Metamorphic Relation Patterns for Query-Based Systems. In Proceedings of the 2019 IEEE/ACM 4th International Workshop on Metamorphic Testing (MET), Montreal, QC, Canada, 26 May 2019; pp. 24–31. [Google Scholar] [CrossRef]
Wu, C.; Sun, L.; Zhou, Z.Q. The Impact of a Dot: Case Studies of a Noise Metamorphic Relation Pattern. In Proceedings of the IEEE/ACM 4th International Workshop on Metamorphic Testing ACM, Montreal, QC, Canada, 26 May 2019; pp. 17–23. [Google Scholar]

Figure 1. Example 1 of Baidu online text summarization results (input: Mr. Shi, a citizen, said that when he was driving, his blood pressure suddenly became high and his body was unwell, so he stopped his car by the roadside to have a rest. Unexpectedly, he was posted a parking ticket by the traffic police. He is a serious person. He believes that according to the road traffic safety law, the traffic police should not stick a ticket when the driver is in the car. He took the Huangpu traffic police detachment to court. Output: He took the Huangpu traffic police detachment to court.).

Figure 2. Example 2 of Baidu online text summarization results (input: Wang Qiang, a citizen, said that when he was driving, his blood pressure suddenly became high and his body was unwell, so he stopped his car by the roadside to have a rest. Unexpectedly, he was posted a parking ticket by the traffic police. He is a serious person. He believes that according to the road traffic safety law, the traffic police should not stick a ticket when the driver is in the car. He took the Huangpu traffic police detachment to court. Output: Wang qiang, a citizen, said that when he was driving, his blood pressure suddenly became high and his body was unwell, so he stopped his car by the roadside to have a rest. Unexpectedly, he was posted a parking ticket by the traffic police. He is a serious person. He believes that according to the road traffic safety law, the traffic police should not stick a ticket when the driver is in the car. He took the Huangpu traffic police detachment to court.).

Figure 3. The procedure of evaluating the NLP system based on MT.

Figure 4. MRs realization process.

Figure 5. VR of 3 NLP systems on text similarity.

Figure 6. VR of 3 NLP systems on text summarization.

Figure 7. VR of 3 NLP systems on text classification.

Figure 8. VR of 3 NLP systems on text classification under different news topics.

Figure 9. Results of each system in the MRP.

Figure 10. Experimental results of each MR under different systems and tasks.

Table 1. Text similarity.

Input 1: 在数学天文学领域, 他的声誉归功于天文球体的提出, 以及他对理解行星运动的早期贡献。 (In the field of mathematical astronomy, he is credited with the creation of the astronomical sphere and his early contributions to understanding the motion of planets.)

Input 2: 他在数学天文学方面享有盛誉是因为他引入了天文地球仪, 并对理解行星运动作出了早期贡献。 (He is credited with mathematical astronomy for his introduction of the astronomical globe and for his early contributions to the understanding of planetary motion.)

Output: 0.74384534

Table 2. Explanation and simple example of text similarity MR1.1–6.1.

MRs	Interpretation of MR Violation	Examples
MR1.1–MR4.1	NLP system is sensitive to the same property words in regular rules.	S₁: “小明有一个苹果。” S₂: “小明有一个梨。” (S₁: “Xiaoming has an apple.” S₂: “Xiaoming has a pear.”) S_n1: “小红有一个苹果。” S_n2: “小红有一个梨。”(S₁: “Xiaohong has an apple.” S₂: “Xiaohong has a pear.”) sim (S₁, S₂) = 0.723585 sim (S_n1, S_n2) = 0.776188
MR5.1–MR6.1	NLP system is sensitive to similar words in regular rules.	S₁: “老师今天表扬了小明。” S₂: “同学们今天表扬了小明。” (S₁: “The teacher praised Xiao Ming today.” S₂: “The students praised Xiaoming today.”) S_n1: “老师今天夸赞了小明。” S_n2:“同学们今天夸赞了小明。” (S_n1: “The teacher speak highly of Xiaoming today.” S_n2: “The students speak highly of Xiaoming today.”) sim (S₁, S₂) = 0.842097 sim (S_n1, S_n2) = 0.828796
MR7.1	NLP system is sensitive to the input sequence.	S₁: “小明有一个苹果。” S₂: “小明有一个梨。” (S₁: “Xiaoming has an apple.” S₂: “Xiaoming has a pear.”) S_n1: “小明有一个梨。” S_n2: “小明有一个苹果。” (S₁: “Xiaoming has a pear.” S₂: “Xiaoming has an apple.”) sim (S₁, S₂) = 0.723585 sim (S_n1, S_n2) = 0.723585

Table 3. Text summarization.

Input: 前天, 上海有网友发现, 马路上竟有几辆正在行驶的碰碰车。原来, 开碰碰车的是附近游乐场员工, 主要为免去运输烦恼。警方则表示目前正在查证, 但碰碰车绝不能上路。网友纷纷表示, 这就是跑跑卡丁车现实版! (The day before yesterday, a netizen in Shanghai found that there were several bumper cars running on the road. It turned out that the bumper car was driven by the employees of the nearby amusement park, mainly to avoid transportation troubles. The police said they were checking, but the bumper car must not be on the road. Netizens have said that this is the reality version of the go kart!)

Output: 上海有网友发现, 马路上竟有几辆正在行驶的碰碰车。开碰碰车的是附近游乐场员工。 (Some netizens in Shanghai found that there were several bumper cars running on the road. The bumper car is driven by the staff of the nearby amusement park.)

Table 4. Text summarization. MR7.2 explanation and simple example.

Input (ORI): “月光女神”英国歌手莎拉•布莱曼近日则表示, 9月1日她将乘坐俄“联盟”号飞船进入国际空间站, 并停留10天。她希望在空间站与地球上的小朋友进行“零时差”合唱。为了实现上太空的梦想, 55岁的她每天要进行16个小时身体训练, 还要进行心理辅导。(British singer Sarah Brightman said recently that she would enter the international space station on September 1 in a Russian Soyuz spacecraft and stay for 10 days. She hopes to sing “zero time difference” with children on earth on the space station. In order to realize her dream of going to space, the 55 year old has to carry out 16 h of physical training and psychological counseling every day.)
Output (ORI): 英国歌手莎拉•布莱曼近日则表示“飞船国际空间站”她在空间站与地球上的小朋友进行合唱。(British singer Sarah Brightman said recently that she sang with children on earth on the “spaceship International Space Station”.)

Input (FLU): “月光女神”英国歌手莎拉•布莱曼近日则表示, 9月1日她将乘坐俄“联盟”号飞船进入国际空间站, 并停留10天。为了实现上太空的梦想, 55岁的她每天要进行16个小时身体训练, 还要进行心理辅导。她希望在空间站与地球上的小朋友进行“零时差”合唱。(British singer Sarah Brightman said recently that she would enter the international space station on September 1 in a Russian Soyuz spacecraft and stay for 10 days. In order to realize her dream of going to space, the 55 year old has to carry out 16 h of physical training and psychological counseling every day. She hopes to sing “zero time difference” with children on earth on the space station.)
Output (FLU): 英国歌手莎拉•布莱曼近日则表示“飞船国际空间站”她在空间站与地球上的小朋友进行合唱。(British singer Sarah Brightman said recently that she sang with children on earth on the “spaceship International Space Station”.)

Table 5. Text classification.

Input: 特派记者郑菁埃因霍温报道 2月7日晚, PSV埃因霍温俱乐部官方网站宣布, 球队已经签下中国球员周海滨, 他将于9日正式开始随队训练。周海滨与俱乐部签署的是一份为期1年的合同……一旦周海滨出现伤病情况, 埃因霍温方面必须一直负责到他能正式恢复踢球为止。(Special correspondent Zheng Jing and PSV Eindhoven reported that on the evening of February 7, the official website of PSV Eindhoven club announced that the team had signed Chinese player Zhou Haibin, who will officially start training with the team on the 9th. Zhou Haibin signed a one-year contract with the club... In case of injury, PSV Eindhoven must be responsible until he can officially resume playing football.)

Output: 体育 (sports)

Table 6. Text classification. MR1.3 explanation and simple example.

Input (ORI): 日前, 由张黎执导的电视剧《圣天门口》正在横店热拍, 在片中出演重要女性角色麦香的则是青年演员练束梅。 练束梅在由张黎执导的《人间正道是沧桑》中曾有过短暂的亮相……练束梅称, 麦香这个角色必须用真心的把自己放进去, 因为离自己太远所以要抛开生活中的自己全心投入其中。 (Recently, the TV series “holy gate” directed by Zhang Li is hot shooting in Hengdian. The young actor Lian Shumei plays the important female role Mai Xiang in the film. Lian Shumei once made a brief appearance in “the right way in the world is the vicissitudes” directed by Zhang Li... Lian Shumei said that the role of Mai Xiang must put herself in with sincerity, because she is too far away from herself, so she should abandon herself in life and devote herself to it.)
Output (ORI): 娱乐(entertainment)

Input (FLU): 日前, 由张黎执导的电视剧《圣天门口》正在横店热拍, 在片中出演重要女性角色麦香的则是青年演员白露。白露在由张黎执导的《人间正道是沧桑》中曾有过短暂的亮相……白露称, 麦香这个角色必须用真心的把自己放进去, 因为离自己太远所以要抛开生活中的自己全心投入其中。 (Recently, the TV series “holy gate” directed by Zhang Li is hot shooting in Hengdian. The young actor Bai Lu plays the important female role Mai Xiang in the film. Bai Lu once made a brief appearance in “the right way in the world is the vicissitudes” directed by Zhang Li... Bai Lu said that the role of Mai Xiang must put herself in with sincerity, because she is too far away from herself, so she should abandon herself in life and devote herself to it.)
Output (FLU): 娱乐(entertainment)

Table 7. Results of text similarity.

VR (%)	MR1.1	MR2.1	MR3.1	MR4.1	MR5.1	MR6.1
Ali	2.36	0.12	0.00	8.24	14.35	16.46
Tencent	8.29	1.12	1.71	12.82	18.02	17.66
Baidu	0.15	0.03	0.00	0.54	1.70	1.49

Table 8. Results of text summarization.

VR (%)	MR1.2	MR2.2	MR3.2	MR4.2	MR5.2	MR6.2	MR7.2
Ali	8.89	15.33	11.04	17.42	8.96	7.37	60.94
Tencent	38.29	18.84	7.05	0.45	9.76	7.05	77.84
Baidu	11.75	14.65	10.32	10.42	6.97	6.27	60.63

Table 9. AliNLP text classification results.

VR (%)	MR1.3	MR2.3	MR3.3	MR4.3	MR5.3	MR6.3	MR7.3
sports	1.05	1.54	1.02	0.14	5.92	3.66	0.92
entertainment	7.35	8.32	7.21	0.67	21.22	23.65	7.70
education	1.30	5.36	1.54	0.19	13.55	9.14	5.22
fashion	5.29	9.16	4.97	0.63	26.39	24.24	5.41
games	5.37	12.64	6.08	1.01	41.10	33.96	13.49
society	9.26	8.46	6.35	0.96	28.86	22.80	6.81
technology	5.58	14.27	8.54	1.13	43.89	40.92	37.18
All	5.31	6.40	5.50	0.67	24.79	21.68	11.64

Table 10. BaiduNLP text classification results.

VR (%)	MR1.3	MR2.3	MR3.3	MR4.3	MR5.3	MR6.3	MR7.3
sports	0.75	0.83	0.44	1.58	3.46	1.08	1.85
entertainment	10.56	3.56	2.36	6.37	9.12	4.44	7.39
education	2.48	2.79	1.43	4.43	8.87	3.47	5.66
fashion	10.36	3.88	3.72	8.74	13.77	7.27	9.91
games	4.13	5.42	4.46	12.15	43.90	10.15	16.39
society	1.48	5.53	1.73	3.51	6.73	3.43	4.39
technology	7.68	13.08	17.88	16.93	54.91	20.16	23.39
All	4.09	3.55	5.51	6.34	18.29	6.90	8.09

Table 11. TencentNLP text classification results.

VR (%)	MR1.3	MR2.3	MR3.3	MR5.3	MR6.3	MR7.3
sports	1.74	1.45	0.41	3.08	1.88	4.61
entertainment	4.70	2.13	1.82	5.50	4.46	9.30
education	3.06	4.46	1.30	7.89	4.86	10.66
fashion	12.73	6.18	4.07	14.30	9.69	18.13
games	8.20	8.33	4.99	36.75	23.45	42.36
society	14.50	11.41	6.84	36.07	25.54	51.16
technology	9.03	8.57	5.22	41.21	24.47	39.05
all	6.91	4.71	3.71	20.75	13.48	27.36

Table 12. Comparative experimental result: the accuracies reported by MT and dataset-based evaluation approach.

(%)	MT			Dataset-Based Evaluation Approach
(%)	Text Similarity	Text Summarization	Text Classification	Text Similarity	Text Summarization	Text Classification
AliNLP	85.35	76.05	87.50	91.32	84.92	94.41
TencentNLP	83.52	77.55	86.66	92.20	83.21	96.12
BaiduNLP	92.47	76.74	91.46	92.25	88.60	96.54

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, L.; Ding, Z.; Zhou, H. Evaluation of Chinese Natural Language Processing System Based on Metamorphic Testing. Mathematics 2022, 10, 1276. https://doi.org/10.3390/math10081276

AMA Style

Jin L, Ding Z, Zhou H. Evaluation of Chinese Natural Language Processing System Based on Metamorphic Testing. Mathematics. 2022; 10(8):1276. https://doi.org/10.3390/math10081276

Chicago/Turabian Style

Jin, Lingzi, Zuohua Ding, and Huihui Zhou. 2022. "Evaluation of Chinese Natural Language Processing System Based on Metamorphic Testing" Mathematics 10, no. 8: 1276. https://doi.org/10.3390/math10081276

APA Style

Jin, L., Ding, Z., & Zhou, H. (2022). Evaluation of Chinese Natural Language Processing System Based on Metamorphic Testing. Mathematics, 10(8), 1276. https://doi.org/10.3390/math10081276

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluation of Chinese Natural Language Processing System Based on Metamorphic Testing

Abstract

1. Introduction

2. Metamorphic Testing

2.1. Definition of MRs

2.2. Metamorphic Relation Pattern

3. Our Approach

3.1. MRPs for NLP

3.1.1. Property-Based Substitution MRP (MRPpbs)

3.1.2. Synonym-Based Substitution MRP (MRPsbs)

3.1.3. Reordering MRP (MRPr)

4. Metamorphic Relations for Specific NLP Tasks

4.1. Text Similarity

4.2. Text Summarization

4.3. Text Classification

5. Experimental Setup

5.1. Realization of MRs

5.2. Target Systems

5.3. Datasets

6. Results and Analysis

6.1. RQ1: Effectiveness of MT and MRs

6.1.1. Text Similarity

6.1.2. Text Summarization

6.1.3. Text Classification

6.1.4. Conclusions

6.2. RQ2: Advantages and Disadvantages of the SUTs

6.2.1. Text Similarity

6.2.2. Text Summarization

6.2.3. Text Classification

6.2.4. Conclusions

6.3. RQ3: Common Problems of the SUTs

6.4. Comparison Results

7. Related Work

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1.1. Property-Based Substitution MRP (MRP_pbs)

3.1.2. Synonym-Based Substitution MRP (MRP_sbs)

3.1.3. Reordering MRP (MRP_r)