1. Introduction
The development of machine reading comprehension (MRC) and natural language processing (NLP) [
1] is gaining more and more attention. In Chinese machine reading comprehension, one of the important tasks is the reading comprehension of Chinese idioms. Idioms, due to their simple forms, often have literal meanings that differ from their original meanings, and it is difficult for machines to select the correct answer by simple matching.
To improve the accuracy of Chinese machine reading comprehension, many studies have been carried out by researchers. For example, Cui [
2] et al. proposed BERT-WWE to improve the accuracy of Chinese machine reading by replacing whole-word masking with a single-word masking strategy. To better study fill-in-the-blank idiom machine reading comprehension, researchers proposed the ChID [
3] dataset, which focuses on Chinese idioms and asks readers to select the correct option that best fits the context based on the given story background. However, due to the complexity of the idiom machine reading comprehension problem, it still faces the following problems:
(1) Idioms often have deep semantic information and usually require the introduction of external knowledge to enhance their semantic information understanding. However, most current studies [
4] rely on the ChID dataset for idiom machine reading comprehension research. This dataset cannot easily improve the accuracy of the machine reading comprehension of idioms, because the ChID dataset only focuses on the idioms themselves. Less consideration is given to extending collections of external knowledge such as idiom interpretation.
(2) The treatment of idioms in current research is not reasonable enough. Most existing methods [
2] treat four-character idioms as combinations of individual Chinese characters to slice and dice or as a complete words without processing and input, which makes the model unable to parse the intrinsic structure of idioms and extract feature information, thus reducing the semantic learning ability of the model for idioms.
(3) Cloze-style machine reading comprehension requires models to understand the course and options deeply. Existing studies have not fully exploited the advantages of end-to-end information extraction and have not been able to reason from multiple perspectives, such as global and local, regarding the information given by the question. Additionally, current studies have not taken full advantage of the end-to-end information extraction of deep models.
In response to the above problems, this paper will improve the three aspects of idiom representation, expanded interpretation, and multi-granularity semantic capture and propose an idiom-oriented machine reading comprehension model, BERT-IDM (IDM is the abbreviated representation of idiom). The main research work included the following three aspects:
(1) To address the problem that the current models lack external knowledge expansion and corpus data support, in this paper, we constructed a pre-trained idiom interpretation sentence corpus in the pre-training stage and initially incorporated idiom interpretation information as external knowledge support based on this training. In the fine-tuning stage, we introduced an autoregressive model, XLNet, which integrated the two mainstream pre-training models to complement each other and form a dual-stream model structure, so that the idiom interpretation could be encoded and interact with the main model to achieve the integration of interpretation, options, and contextual semantics, thus improving the model’s ability to understand idioms.
(2) To improve the model’s idiom reading comprehension level, we used idiom mask training to obtain a representation form (“two add two” form structure) that was more in line with the inherent characteristics of the idiom, allowing the model to understand the idiom and its context more deeply and naturally.
(3) To address the problem that the model could not efficiently extract semantic information from candidates and contexts, we designed a self-masking self-attention mechanism (SMSA) to prevent inter-layer information leakage and simplify model complexity by improving the attention mechanism in the continued pre-training phase, and then used global attention, windowed attention, and a unique combination of idiom masks designed for the fine-tuning phase to obtain a better latent layer representation. The variety of global awareness, windowed engagement, and idiom masks were then used in the fine-tuning phase to modify Transformer’s multi-headed attention structure to obtain the multi-granularity integrated attention (MGIA) mechanism for the information fusion of the dual-stream model, effectively improving model robustness and multidimensional information capture.
The rest of the paper is organized as follows: 
Section 2 introduces the work related to the study of idiom machine reading comprehension; 
Section 3 presents the idiom reading comprehension model; 
Section 4 compares the model proposed in this paper with other benchmark models used for the experiments; 
Section 5 presents an analysis of the experimental results and conclusions; and 
Section 6 summarizes the entire work.
  2. Related Work
Machine reading comprehension is not a new natural language processing topic; as early as 1977, Lehnert [
5] et al. built the question and answer program QUALM to comprehend stories, bringing contextual comprehension under study for the first time. In 1999, Hirschman [
6] et al. created a reading comprehension system containing a development set and a test set of 60 discourse items, each using reading material from grades 3 to 6. At that time, Deep Read’s baseline system obtained 30–40% accuracy in eleven subtasks, while most machine reading comprehension systems of the same period were rule-based [
7] and statistically based models [
8]. Riloff [
9] et al. designed a rule-based machine reading comprehension system, Quarc, containing multiple heuristic rules and morphological analysis as a means to provide answers. In 2010, Poon [
10] et al. started to use machine learning methods for machine reading, combining techniques such as bootstrap sampling, Markov logic, and self-supervised learning. However, the lack of high-quality, large-scale reading comprehension datasets in this area and the high reliance on manual rule sets or features constructed by many human beings led to a long period during which research in the field of machine reading comprehension did not receive sufficient attention.
The above situation was resolved in 2015 with the emergence of neural machine reading comprehension and new large-scale benchmark datasets. On the one hand, deep-learning-based machine reading comprehension began to show its unique advantage in capturing contextual information, and on the other hand, the availability of datasets such as the CNN/Daily Mail dataset [
11], Stanford Question-Answering Dataset (SQuAD) [
12], and MS MARCO [
13] dataset provided data support for deep neural network architectures and evaluation testbeds. Deep learning does not rely on linguistic feature tools, does not require manual feature construction, and has more robust generalization than traditional methods.
Hermann [
8] et al. proposed Attentive Reader, a supervised attentional bidirectional long- and short-term memory (LSTM) model based on the CNN/Daily Mail dataset in the form of a completion filler, with an accuracy of 63.8%, obtaining a performance improvement of more than 10% compared to the traditional model. The following year, Chen [
14] et al. introduced a bilinear layer to replace the tanh layer, thus further improving the accuracy to over 70%. Although neural models have been successful in NLP tasks, their performance improvement in MRC accuracy is still insufficient, a significant reason being that the current datasets are negligible for most supervised NLP tasks (except machine translation), and deep neural networks usually have a large number of parameters. Smaller training datasets lead to the appearance of overfitting. As a result, early neural models for NLP tasks were relatively shallow, often containing only one to three layers.
A large body of research in recent years has shown that pre-trained models (PTMs) based on large corpora can learn generic linguistic representations and benefit downstream NLP tasks while avoiding training models from scratch. With the development of computational power, the emergence of deep models (such as Transformer [
15]) and the enhancement of training techniques have allowed PTMs to evolve from shallow to deep.
In 2018, ELMo [
16] used bidirectional LSTM as a feature extractor to learn deep contextual word representations. A new paradigm of pre-training + fine-tuning gradually formed and opened a new era of PLMs. The current deep PLM has shown a powerful ability to learn universal language representations. The Transformer [
17] structure was widely used in a series of large PLMs such as BERT [
18] and OpenAI [
15] in subsequent studies due to its more powerful feature extraction ability. At the same time, the emergence of the attention mechanism allowed the model to learn to focus on certain information. With this combination of features, PLMs continue to be used in various NLP subfields, setting new SOTA scores. BERT, based on the Transformer structure and pre-trained with the masked language model (MLM) and next sentence prediction (NSP) on a large unlabeled corpus, brought machine reading comprehension research into the BERT-base era. Among these models, RoBERTa [
19] removed the NSP pre-training task of BERT, proving that it was not beneficial for downstream tasks, and further improved the performance by adopting dynamic masks based on the random acts of BERT. Joshi [
20] et al. proposed SpanBERT, a pre-training model that transformed the input into a set of spans and achieved good results by pre-training on a masked language modeling task at the span level. Zhang [
21] et al. introduced a Chinese pre-trained language model called CPM (Chinese pre-trained language model), which was a large-scale generative model pre-trained on a corpus of over 10 billion Chinese language tokens that had strong language generation capabilities.
Unlike English, Chinese is unique in its syntax, vocabulary, and pronunciation. Li [
22] et al. proposed in 2019 that deep learning Chinese representations should use Chinese characters as the basic unit in the word separation process instead of using words or subwords as in English according to the standard proposed by Wu [
23] et al. Xu [
24] et al. constructed a standard dataset for evaluating Chinese natural language processing models in 2020. Subsequently, ERNIE [
25] proposed three masking strategies (character-level masking, phrase-level masking, and entity-level masking) to enhance the ability of the model to perform multi-granularity semantic capture. Cui [
2] et al. proposed BERT-WWM based on BERT and pre-trained the model by modifying the whole-word masking strategy of BERT to mask each word of a Chinese character instead of masking the character as a unit. As a result, the model was forced to learn the words themselves at the mask instead of predicting the word components, as in the original BERT model, which could improve the model’s understanding of Chinese inputs.
Idioms are different from ordinary Chinese characters in that they have semantic unity and structural persistence; that is, they are semantically indivisible as a whole, and their overall meaning cannot be speculated from the individual words that make up the idiom. In terms of structure, the order cannot be changed casually, let alone the grammatical structure. This leads to the low accuracy of the existing models for idiom machine reading.
The Modern Chinese Dictionary defines “idiom” as follows: “a fixed phrase or short sentence that is concise and concise in practice; most Chinese idioms consist of four characters”. More than 95% of Chinese idioms comprise four characters (the rest range from 3 to 16 characters), and among these four-character idioms, the majority are of the “two plus two” structure. The “two plus two” structure of four-character idioms requires the treatment of one or two and three or four characters as a single unit. These are the most widely distributed examples [
26] in the existing corpus and the system of everyday discourse applications. In 1995, Goldberg [
27] proposed a theoretical framework for constructional grammar, arguing that some aspects of the form or meaning of a construction cannot be fully predicted from its components or established constructions.
Based on Goldberg’s theory, Wang [
28] et al. investigated the antonymic co-occurrence construct “no A no B”, which fixes one or three characters and then pairs them with two or four characters of opposite meaning, and explored the intrinsic non-constructive meaning formation mechanism of this construct. Xie [
29] et al. introduced the event-related brain potential technique to study the human brain’s form of understanding Chinese idioms from a neuroscientific perspective and found that there are differences in idiom processing patterns between the East and the West, and that Chinese idiom comprehension obeys idiom constructivity theory. They also confirmed that the process of Chinese idiom comprehension could not be abstracted to a simple linear extraction construction. The relationship between rhyme structure and syntactic construction at the level of Chinese utterance was discussed. The two-plus-two rhyme unit of the four-character idiom is considered a unique feature of Chinese idiom comprehension compared with foreign languages, especially English. At the same time, Yang [
30] explicitly proposed that idiom comprehension needs to be split into two-plus-two structures under the constructive theory system and explored the relationship between these components. However, incorporating two-plus-two structural semantics in the process of idiom semantic understanding for training is still a fundamental problem to be solved in machine learning. At the same time, the interaction of contextual semantics also needs to be considered because of idiomatic semantic understanding. However, the capability for multi-dimensional semantic acquisition is still lacking in current methods, affecting the understanding of idiomatic semantics. Madabushi [
31] et al. proposed a cross-lingual idiom detection and sentence embedding evaluation task. The task aimed to evaluate whether computers could accurately identify idioms in multiple languages and embed them contextually into semantic space.
This paper improves the BERT-Chinese [
25] model by using paraphrase expansion and external knowledge introduction to enhance the model’s understanding of idioms, changing the masking method according to the structure of Chinese idioms, and introducing multi-granularity semantic reasoning to obtain more contextual information to improve the accuracy of the model.
  3. Model
In this paper, the initial improvement of BERT-Chinese was first achieved via the BERT-IDM
base model. Then, the overall model of this paper, BERT-IDM-FULL, was obtained by adding a multi-granularity inference mechanism. The two models were compared with a series of benchmark models in the experimental phase. The overall architecture of the model is shown in 
Figure 1.
  3.1. BERT-IDMbase Model
In contrast to classical machine learning methods, the “pre-training + fine-tuning” model enables larger, better-performing, and more generalizable models using large-scale unlabeled data without needing data labeling. The in-domain further pre-training (IDFP) method continues to pre-train a pre-existing model on specific types of data, allowing increasingly large pre-training models to avoid starting from scratch each time. With the support of a sufficient unlabeled corpus, IDFP can effectively improve the performance of the original pre-trained model on a particular task. In this work, we first crawled the idiom entries in the “Idioms in Sentences” section of the Sentences website as an index and then used the “Sentence Search” function of the site to backtrack the idiom sentences in the index dictionary and stitch them together using English commas. Based on this, since the idioms and their corresponding meanings also provided opportunities for the model to learn the idiom comprehension mechanism, we continued to query the implications based on the crawled idiom index from the online idiom dictionary website, and the idioms and their corresponding definitions were also separated by commas and stitched together.
The BERT-IDM
base model proposed in this paper is shown in 
Figure 2. In 
Figure 2, [CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token. We denoted the input embedding as 
E, the final hidden vector of the special [CLS] token as C 
, and the final hidden vector for the 
ith input token as T
. To address the problem that the current idiom segmentation approach did not enable the model to fully understand idioms, the BERT-IDM
base model uses a two-plus-two mask processing model for idiom input, masking two words at a time and using the idiom mask to process each idiom’s one or two words and three or four words as two input units. This mode enabled the model to pay attention to the association information contained in the idiom structure, thus effectively supporting the extraction of additional structural information pertaining to the idiom and avoiding word-based prediction that would lead to the destruction of structural information, as well as whole-word masking that would mask the whole idiom at once and ignore structural information. In order to conform to the structural features of the idioms themselves, the model was required to both infer the missing part according to the given part of the idiom (simulate the understanding of the idiom itself) and absorb the advantages of whole-word masking, which put the target as a whole word into context for overall contextual inference. The model structure of the Transformer not only required a large number of matrix operations in the self-attention module, increasing the arithmetic load by a large amount, but also introduced the problem of information leakage in the MLM target task.
In this paper, the structure of the Transformer is improved, and the self-masked self-attention mechanism (SMSA) is proposed. In the self-attention masking process, SMSA adopts two critical steps to ensure the efficiency and security of a model. Firstly, in the information aggregation stage, SMSA removes the token information in the query vector 
Q and only retains the positional information. This is because the inclusion of raw content in the query vector 
Q may cause information leakage and pose a threat to the security of the model. By only retaining the positional information, this situation can be effectively avoided, thereby ensuring the security of the model. Secondly, in the self-attention calculation process, SMSA normalizes the attention weights through a scaling operation and then adds an attention masking matrix to the self-attention mechanism. This attention masking matrix is a diagonal matrix with 1 s on the main diagonal and 0 s elsewhere. Its role is to mask the matrix elements on the main diagonal, thereby shielding the self-information of each vector. This can ensure that when calculating attention weights, each vector only focuses on the information of other vectors, reducing computational complexity and improving the efficiency of the model. At the same time, the introduction of the attention masking matrix can also further ensure the security of the model and avoid interference from the self-information of each vector on the performance of the model. The attention mask pattern is shown in 
Figure 3. Based on this, the compatibility of the model with multi-layer deep stacking architectures and residual connections was restored by setting up an isolation mechanism for vectors.
The key to Transformer, the core structure of BERT, is to use the self-attentive mechanism to compute the 
 vector. Equivalent 
Q, 
K, and 
V vectors were selected in the self-attentive mechanism of BERT. If the original BERT’s self-attention layer-wise propagation algorithm is used to select the 
K and 
V vectors in subsequent hidden layers, the content information in the token vectors will diffuse into the query vector 
Q in the subsequent layers, because the 
K and 
V vectors have already been fused in the upper layers. The final vector used for the prediction output is the query vector 
Q, so if this problem is not addressed, the SMSA mechanism will fail after the first layer of input in a deep model. To solve this problem, we set up an isolation mechanism for the 
K and 
V vectors to restore the compatibility of the model with a multi-layer deep stacking architecture and residual connections. Specifically, all 
K and 
V vectors in the encoding layers were fixed to a constant value, and only the query vector 
Q, which was used for the final prediction output and did not contain self-content information, was updated across layers. The fixed values of the 
K and 
V vectors were determined by the combination of the input embedding sequence 
E and the position vector 
P. The modified SMSA updated the attention calculation process as shown in 
Figure 4. Algorithm 1 formatively defines the attention flow isolation of SMSA and describes the flow in detail. This algorithm simultaneously reduced the time complexity of the large number of linear operations in the computation of the self-attentive mechanism by fixing the vector.
        
| Algorithm 1: BERT-IDMbase based on SMSA mechanism | 
| Input: Query vector Q, key vector K, value vector V Output: Query vector for predicting output  1 The sequence of participle embeddings E and the corresponding position vector P are loaded;2 The query vector  of the first layer input is initialized as the position vector P, and the key vector K and value vector V are fixed as the vector  and not updated in subsequent layers;3 The input   and K  vectors in the m  layer pass through the attention mask module after matrix multiplication and deflation operations; proceed through the Softmax layer; and are then subject to matrix multiplication with the vector V  before outputting the attention module to obtain the intermediate output   in accordance with Equation (1 ):4 The output of this layer is obtained by forward propagation in the fully connected layer (FFN), denoted as  .   satisfies Equation (2 ) after residual concatenation and regularization based on the results of step 3: 
                where   satisfies Equation (3 ) for input x ,
                5 The process loops through the Transformer layers selected in the set Transformer stacking architecture, sequentially looping through steps 3 to 4 until the end of the calculation;6 The final layer outputs the trained query vector , and the algorithm ends.
 | 
  3.2. Multi-Granularity Integrated Attention Mechanism
We found that the existing work focused on optimizing the network and model structure from an engineering perspective rather than modeling the realistic behavior of humans in reading comprehension. In contrast, human reading comprehension is characterized by leaps, global grasp, and local focus, and we attempted to build a new idiom reading comprehension model based on the BERT-IDMbase model proposed in the previous section with more explanatory capabilities using multi-granularity reasoning and the introduction of idiom interpretation for expansion.
We based our model on the fact that people tend to focus their attention on the main idea of the passage after grasping it, rather than grasping every detail for memorization, when performing reading comprehension. Therefore, we improved the attention mechanism of the BERT-IDMbase model in the fine-tuning phase by introducing a multi-granularity integrated attention mechanism.
In MGIA, global attention (GA) is used to simulate the general reading process of human reading comprehension, stride attention (SA) is used to simulate the jump reading process of reading comprehension, and window attention (WA) is used to simulate fine reading focusing on local information. MGIA is a kind of sparse attention, which could effectively reduce the number of parameters and improve the operational efficiency compared with the fully connected dense matrix used by the original self-attentive mechanism.
In terms of implementation, MGIA makes use of the multi-headed attention mechanism of the Transformer structure, wherein the stacked attention heads are assigned to different attentions according to a predefined pattern. This allows the model to use different approaches to the reading comprehension inference process simultaneously, and the importance (i.e., weight) of different inference patterns can be adjusted by setting the distribution of attention heads. The combined and integrated attention mask matrix is shown in 
Figure 5, where the pink squares are global attention (GA), the blue squares are stride attention (SA), and the yellow squares are window attention (WA).
For a set recording the positions of a set of elements to be attended to by the attention mechanism as the current position coordinates, the three types of attention used by MGIA can be specified as described below (using the lower triangular matrix).
There are two specific implementation methods for global attention (GA), one is to select the individual clauses at the head of the sequence as global clauses, and the other is to add new clauses without real meaning to the beginning of the sentence to aggregate the information of the whole sequence. In this section, we chose to use the latter and add [CLS] clauses uniformly to avoid the bias caused by data samples. After setting the global clauses, all clauses needed to calculate attention scores with global clauses, and the global clauses also needed to calculate attention scores with global clauses. Thus, all clauses in the sequence (i.e., the overall information distributed) were linked by global clauses, at which time extracting the global clauses could have the effect of grasping the main idea of the passage. The attention set corresponding to global attention can be expressed as presented in Equation (
4):
The stride attention mechanism (SA) was originally used for image generation and later applied to language modeling. Through the SA model, long-range dependencies in sequences can be constructed, and vector representations can be compressed to reduce computational complexity. This effectively increases the receptive field (analogous to a convolution operation) for the same computational cost under longer sequence lengths. For a given stride length 
k, the attention matrix attends to every 
k element from the current element and aggregates them into set 
 for attention. Increasing the stride length can further reduce the computational complexity. For example, when 
, the complexity of the attention mechanism decreases from 
 to 
. The attention set corresponding to stride attention can be expressed as in Equation (
5):
Window attention (WA) focuses on the information around the current time step and assumes a window size of 
k. Then, window attention focuses on the elements within the distance of the attention nucleus at this moment as the object of attention. Window attention can also be considered as a kind of local attention, which enables the model to focus fully on the semantic information embedded in the vicinity of the location to be processed and can significantly reduce the diversion of attention from information that is theoretically irrelevant to the current subword. The attention set corresponding to the window attention can be expressed as in Equation (
6):
The attention mechanism of MGIA with three such pre-defined templates can be set up in such a way as to allocate computational resources to the key parts in a very efficient way, avoiding distracting attention while combining global and local information in a multi-granular form, taking into account long-range dependency modeling, and reducing the complexity of the attention computation process. The general structure of the MGIA mechanism is shown in 
Figure 6.
The multi-headed attention mechanism is an extension of the single-headed attention mechanism, which branches the 
 vector according to the number of attention heads. The attention heads are isolated from each other and made invisible for independent parallel attention score computation. Different attention masks were chosen so that the same input yielded different weight coefficients, which were stitched together in the form of first-place joins in the output. In this way, the semantic information learned by many different granularities of attention was integrated, thus simulating the process of reading and reasoning by humans using multiple modalities. This can be expressed formally as in Equation (
7).
For an input sequence of length 
, representing the set of activated elements in the attention mask of the MGIA mechanism, the update of the multi-headed self-attentive mechanism is
        
  3.3. Embedded Layer
In the embedding layer, a pre-trained model that could sufficiently extract semantic information and syntactic structure needed to be selected as the encoder of the component used to encode the input in this layer. Although both BERT-IDMbase and XLNET are pre-trained models that contain a large amount of linguistic knowledge, have been trained by learning on a large-scale corpus, and are capable of performing this task independently, it was possible to combine the advantages of both to obtain a better idiomatic representation by building a dual-stream pre-training model.
A major drawback of BERT is that the downstream task does not contain artificially added symbols such as [MASK], leading to a mismatch between pre-training and fine-tuning. This is a common problem in the design mechanism of BERT models and is a drawback brought about by the introduction of MLM target tasks. However, this drawback was not significant in the downstream task involved in this study, because the completion task precisely required the model to select the correct candidates from the masked integrity-breaking input to recover the original passage, not only with artificially added marker symbols, but also with a pattern of tasks consistent with MLM. XLNet was not suitable as a backbone model because of the large difference between the target task and completion task; however, because of its own powerful semantic information extraction ability and the native advantage that the autoregressive model could take into account the interrelationship between sequence subwords, it was suitable for use as a secondary model.
The additional idioms introduced in this work were data that did not contain any artificial symbols and were inherently flawed when encoded using BERT-IDMbase; thus, using XLNet as a black box in an end-to-end fashion was appropriate for this scenario. In addition, since the pre-training task of BERT-IDM learned the idiom reading pattern so that the model processed the idioms in a “two-plus-two” structure, to further enhance the multi-granularity setting of the model, the input of XLNet was decomposed into a simple sequence of characters, i.e., the idioms were not entered into the model as whole words or in a two-plus-two structure.
For a given chapter  and seven options (provided by the ChID dataset of seven-choice completion questions) , the input for XLNet was , where  denotes the jth character of the ith option, while the input for BERT-IDM was entered as whole words in the  sequence.
  3.4. Feature Extraction Layer
After acquiring the embedding representation of the original input for initialization in the embedding layer, the idioms were encoded from the character level and word level using XLNet and BERT-IDMbase, respectively, potentially introducing the ability to understand idioms in multiple dimensions, while in the feature extraction layer the two input streams were directly feature-extracted using MGIA, the multi-grained integrated attention proposed in this paper. The models in this layer performed multi-granularity inference on top of their respective encoders, which was equivalent to testing two subjects who had learned the language in different ways, both using the same inference paradigm consistent with human thinking patterns to complete the test questions. The multi-granularity inference approach further amplified the differences between the multi-dimensional comprehension modalities in the embedding layer, increasing the model’s potential in terms of multi-perspective reading inference capabilities and generalization performance.
Let the hidden layer of the attention mechanism be represented as 
. The synthetic vector of 
 input into the feature extraction layer through the embedding layer is represented as 
, and the computation between the hidden layers is shown in Equation (
8).
        
  3.5. Semantic Interaction Layer
The vector flow passed through the feature extraction layer and entered the semantic interaction layer, where the context was fused with the combination of options plus interpretation. In this layer, multiple attention heads computed attention scores in parallel and integrated them in the manner shown in Equation (
9). Two reading comprehension methods with different granularity grouped the results into a unified model in this layer and prepared the final output after splicing the vectors to obtain the overall attention in the form of Equation (
10). The fusion vector at this point already contained the results after the expansion of multi-granularity inference and interpretation information.
        
  3.6. Answer Prediction Layer
Finally, the model prediction results were outputted after calculating the probability of the semantic fusion information 
 in the upper layer through the linear fully connected layer, and the inferred answer was derived as shown in Equation (
11).
        
This concludes the introduction of the BERT-IDM-Full model proposed in this paper. Experiments comparing it with BERT-IDMbase and other baseline models are presented in the next section.
  5. Results and Analysis
  5.1. Experimental Results
This section presents the experimental investigations conducted on the proposed models using the ChID dataset and the CNCID dataset constructed in this study. From the perspectives of reading comprehension and out-of-domain generalization ability, commonly used baseline models were compared with the proposed BERT-IDM
base model and the BERT-IDM-FULL model. The experimental results indicated that the pre-trained models (PTMs), in the era of pre-training, achieved significant improvements over the traditional models. BERTBASE, RoBERTa, and BERT-IDM-FULL all outperformed the classic baseline models by a large margin.The overall experimental results obtained from the experiments are shown in 
Table 4.
This experiment retained the same BiLSTM baseline as XLNet for regression prediction, while XLNet was proposed as a large pre-training model with RoBERTa as the baseline and obtained better results for several tasks. The performance of XLNet was indeed stronger than that of RoBERTa and BERT-IDMbase in the experiments, but it was slightly weaker than that of the other two models on both the ChID out-of-domain and CNCID datasets, which indicated that the inference generalization ability of XLNet was slightly weaker than that of the other two baselines. This may have been due to the fact that the inference process modeled by the AR model only computed the log-conditional probabilities of each random variable and then summed them to obtain the log-likelihood, unlike the AE model, which learned the unsupervised representation of the data in the self-coding process.
Furthermore, because of XLNet’s powerful performance and slightly deficient generalization reasoning ability, using it as one of the two streams of fused semantic integration could effectively assist in the performance improvement of the model, while the self-encoding model used in the other stream provided the basis for the high generalization ability of the model. The integrated AR and AE models dealt with idioms as characters (character embedding) and idioms (idiom embedding), respectively, into which the paraphrase information of the idioms was expanded, bringing accuracy improvements of 1.2% and 1.6%, respectively, for the overall BERT-IDM-Full model proposed in this paper. Although this model achieved the best performance on all three datasets, there was still a substantial gap of close to 4% in reading comprehension ability compared with trained native Chinese speakers.
The accuracy convergence between the baseline model and the overall model training in this paper is shown in 
Figure 8, and by observing the experimental results, it can be found that all models reached convergence before the seventh round of training, the performance of all the large-scale pre-trained models improved, the baseline BERT converged significantly worse because of its lack of language knowledge and ability to process idioms, and the simple Bi-LSTM model converged faster. However, there was a significant performance gap. The overall model in this paper obtained a better convergence speed and prediction accuracy because it could fully parse the idiom structure and obtain knowledge of idioms in the target domain through further pre-training, which proved that the overall BERT-IDM-Full model proposed in this work had better robustness and better performance in reading and understanding idioms.
  5.2. MGIA Experimental Analysis
There are two configurable parts in MGIA, one is the scheme of assigning three different types of attention to 12 attention heads, and the other is the 
Stride value of the stride attention among the three types of attention used. To find the attention head allocation scheme that improved the performance, the following three attention combinations were chosen for this experiment, and the results are shown in 
Table 5 after fixing the combination and changing the step size on the ChID test set. The three combinations were: 12 stride attention heads (12SA); 6 global attention heads and 6 stride attention heads (6GA + 6SA); and 4 global attention heads, 4 stride attention heads, and 4 local attention heads (4GA + 4SA + 4WA).
As can be seen from the table, the model performed better when the step size was 1, while any increase in the step size setting led to a decrease in performance. This may have been due to the fact that the sample sequence of the ChID dataset was too short, and a larger step size could not be chosen. The first pure cross-step attention scheme lacking global attention performed poorly due to the lack of full-text information, although it picked up later, probably as a result of model oscillation. With the addition of global attention and windowed attention as an expansion of global and local information, the third scheme achieved the best performance among all the schemes. Although the model performance decreased when the step size increased, it could be observed that the third scheme decreased more slowly and with fewer oscillations, which indicated that the multi-granularity inference approach designed in this section was more robust.
However, a very important property of the step size of 1 was that an idiom was divided into two subwords, because it followed the setting of the idiom mask in 
Section 3, and a step size of 1 caused the two subwords to remain invisible to each other and provide no information to each other. Similar to BERT-WWM, which predicted whole words instead of characters and exchanged information by isolating subwords, the channel provided a more challenging setting for the model learning process. The experiments in this section observed the effect of three schemes for improving the robustness and shock resistance of the model under the settings of a step size of 1 and 3. The experimental results are shown in 
Figure 9, where 1S and 3S represent the two settings of step size 1 and 3, respectively.
From 
Figure 9, it can be seen that the final training losses under the step size 3 setting were all greater than when the step size was 1. The MGIA module loss drop for the hybrid attention head mechanism under the single step size setting was smoother, and the accuracy rate corresponding to the combination of loss settings in training was also better, as shown in 
Table 5, which demonstrated that higher robustness improved performance results and convergence.
  5.3. Ablation Experiments
In this section, we present the ablation experiments conducted on the XLNet interpretation expansion module, a constituent element of BERT-IDM-Full, and the MGIA multi-granularity integrated attentional reasoning module to investigate the impact of both on the performance of the overall model and the effectiveness of model improvement. The experimental results obtained are shown in 
Table 6.
From the experimental results, it can be seen that the performance of the model with the addition of the MGIA module for enhanced inference had a significant enhancement of more than 1% compared to the baseline BERT-IDM model, which indicated that multi-granularity inference enabled BERT-IDM to catch up with XLNET in terms of performance. However, the generalization ability of the model with this setting was slightly weakened. The generalization ability of XLNet in combination with the BERT-IDMbase model was greater than that of XLNet alone, and the substantial improvement in performance compared to the baseline was a good indication of the effectiveness of incorporating idiom interpretation. However, the model results were only comparable to the BERT-IDM model with the addition of MGIA, which again was not the upper limit of XLNet’s capability. The results of both experiments indicated that only after the multi-granularity integration of MGIA could the AE and AR models each take advantage of their strengths to achieve a better overall model structure.
In contrast, the overall model proposed in this work still significantly outperformed the other models on the new web-based idiom dataset CNCID, which indicated that the multi-granularity splitting of idioms followed by multi-granularity inference could actually enhance the model reading comprehension and fully improve the potential of the overall model.