Leveraging Label-Attention Networks and POS Tagging for Generating Chinese Cloze Questions

Hou, Yanyang; Xiong, Shufeng; Li, Yang

doi:10.3390/a19060501

Open AccessArticle

Leveraging Label-Attention Networks and POS Tagging for Generating Chinese Cloze Questions

by

Yanyang Hou

¹,

Shufeng Xiong

^2,3

and

Yang Li

^2,3,*

¹

School of Information Engineering, Zhengzhou University of Industrial Technology, Zhengzhou 451100, China

²

College of Information and Management Science, Henan Agricultural University, Zhengzhou 450002, China

³

Henan International Joint Laboratory of Agricultural Big Data and Artificial Intelligence, Zhengzhou 450002, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(6), 501; https://doi.org/10.3390/a19060501 (registering DOI)

Submission received: 27 April 2026 / Revised: 17 June 2026 / Accepted: 18 June 2026 / Published: 22 June 2026

(This article belongs to the Special Issue Deep Learning Methods and Applications)

Download

Browse Figures

Versions Notes

Abstract

Chinese cloze question generation for educational assessments requires identifying gap phrases that accurately reflect key knowledge points, posing significant challenges to automated systems. We observe that the syntactic boundaries revealed by part-of-speech (POS) tags closely align with the semantic boundaries of target gap phrases. Motivated by this observation, we propose a multi-task learning framework in which gap phrase identification serves as the primary task and POS tagging as a complementary auxiliary task. The two tasks share a common BERT-BiLSTM encoder, enabling mutual reinforcement of both syntactic and semantic representations through joint training. To further capture the interaction between label semantics and contextual word representations, we introduce a label-attention mechanism that models dependencies between the global word sequence and candidate label embeddings. Additionally, we construct a refined POS tag subset by excluding categories whose boundaries show no alignment with gap phrase boundaries, thereby strengthening the correspondence between the two tasks. Evaluated on a real-world dataset of 20.5K questions spanning five academic disciplines, our method achieves an F1 score of 65.85%, with a Recall of 67.79%, representing improvements of 2.12% and 4.35% over the prior state-of-the-art, respectively. These results demonstrate that exploiting the alignment between syntactic and semantic structures through joint learning is effective for generating educationally meaningful fill-in-the-blank questions.

Keywords:

question generation; deep learning; label attention; multi-task learning

1. Introduction

Natural language processing has long been applied to support educational assessment, including computer-assisted language learning [1], automated essay scoring [2], and test item generation [3]. In recent years, advances in language generation models have further encouraged the adoption of automated question generation in educational settings [4,5,6]. This study focuses on Chinese cloze question generation (CCQG) for examinations, a task initially conceptualized by [7] in the context of biology textbook chapters for Advanced Placement exams. Their framework divides the process into three subtasks: sentence selection, target (gap) selection, and distractor generation. Our work corresponds to the second sub-task—identifying the key phrase to be omitted in a given sentence. For example, given the Chinese sentence “计算机工作时, 内存储器中存储的是指令与数据。” (When the computer is working, the memory stores instruction and data), the goal is to recognize “指令与数据 (instruction and data)” as the gap phrase that encapsulates the intended knowledge point.

Unlike cloze tasks in reading comprehension [8], the Chinese cloze question generation (CCQG) task for educational examinations is designed to assess a test-taker’s ability to apply linguistic and grammatical rules, or to evaluate their mastery of specific knowledge domains [4,5,6]. Consequently, the selection of gap phrases (GPs) must accurately reflect the targeted knowledge points—such as grammatical structures, conceptual definitions, or procedural methods—and is often constrained by formal grammatical rules or a defined body of subject knowledge.

In contrast, gap selection in reading comprehension is typically confined to a local vocabulary within a passage. This task revolves primarily around paragraph content, where words are masked within relevant sentences and answers are generally recoverable from the surrounding context. For instance, in the Chinese cloze dataset People Daily & Children’s Fairy Tale [9], answers are restricted to nouns that appear at least twice in the paragraph, though the specific choice is random. Correct answers in such tasks often exhibit lexical overlap with the context. As an example, in the sentence “The so-called ‘silly money’ _— is that to buy and hold the common combination of U.S. stock,” the correct answer “Strategies” appears explicitly in the subsequent sentence: “This strategy is better than other complex investment methods…”

The generation of cloze questions ranges from the rule-based method proposed in the original study [7] to the machine learning method proposed by [5] and the knowledge link-based method proposed by [10]. However, all of the above methods deal with the problem of small-scale labeled data. Although some work, such as [11], has constructed large-scale datasets, the vast majority of research work focuses on the English language learning corpus, with limited publicly available resources for Chinese corpora. In 2021, one study [12] proposed a Chinese fill-in-the-blank question generation dataset for non-English subject exams, with a data size that can support the training of current mainstream deep learning models, compensating and alleviating to some extent the insufficient annotated data problem faced by CCQG. However, in today’s large language model-based natural language processing framework, the number of samples corresponding to each category during model training is still insufficient to learn a good feature representation.

This paper addresses the challenge of model learning under low-resource conditions, focusing on two key aspects. Firstly, a label-attention-mechanism-based approach is introduced, which enhances feature representations of word sequences by assigning preferences to different label types. This enriches sample feature representations without the need for increased training data. Secondly, we introduce POS tagging as an auxiliary task, integrating it into a multi-task learning model to enhance the feature encoding module’s modeling capacity and subsequently improve the performance of the sequence annotation module. The design of auxiliary tasks is driven by the observation that POS tag boundaries align closely with GP boundaries. A specific example is shown in Figure 1. A GP usually consists of one or more related categories of POS phrases. For example, a VV (other verb) “尊重 (respect)” and a NN (common noun) “人 (people)” form the GP “尊重人 (respect people)” in Ex.1.

The main contributions of this study are summarized as follows. First, we investigate the linguistic relationship between gap phrases and POS structures in Chinese cloze question generation. Different from general sequence labeling tasks, GP identification in educational cloze questions is closely related to the boundaries of knowledge-bearing syntactic units. We explicitly exploit this task-specific observation and use POS boundary information as auxiliary supervision.

Second, we propose a POS-guided multi-task learning framework for Chinese cloze question generation. In this framework, GP identification is treated as the primary task, while POS tagging is used as an auxiliary task to enhance the shared encoder’s ability to capture syntactic boundary information. This design allows the model to benefit from syntactic supervision without directly injecting manually engineered POS features into the input.

Third, we construct a refined POS tag subset for the auxiliary task. Instead of using the full CTB POS tag inventory, we exclude POS categories whose boundaries show little correspondence with annotated GP boundaries, thereby improving the relevance between the auxiliary task and the primary task.

Fourth, we adapt a label-attention mechanism to model the interaction between contextual word representations and GP label embeddings. This allows the primary task decoder to obtain label-aware representations for sequence prediction.

2. Related Work

2.1. Automatic Question Generation

A substantial portion of initial question generation research employs strategies rooted in grammatical rules and templates. This approach bifurcates the question generation task into two distinct subtasks: “what to ask” and “how to ask” [13]. More specifically, linguistic rules and templates are devised to systematically extract question content from the text, leveraging information such as the grammatical composition of the input text. Subsequently, the extracted content is incorporated into pre-constructed question sentence templates, culminating in the formation of the question sentence [14,15,16,17]. However, such methods rely on established rules, cannot adapt to different text fields according to the data, and the transfer cost is too high, making it difficult to be widely used. Subsequently, sequence-to-sequence encoder–decoder neural networks have been widely used in question generation tasks [18,19,20,21]. The approaches described above primarily concentrate on the generation of questions and often require the answer to be given in advance.

Another group of research works selects answers and generates questions directly in the modeling process. Yang et al. [22] adopt a rule-based approach to pick answer words from the unlabeled text. Scialom et al. [23] proposed a multi-module interaction framework (multi-agent communication framework) using a local extraction module to automatically identify phrases worth asking questions, use it to help question generation, and then use the generated question feedback to improve key phrase recognition performance. Subramanian et al. [24] use a two-stage keyword extraction and question generation model. Willis et al. [25] exploit the end-to-end model, which generates a list of candidate answers from the context. Wang et al. [26] use a multi-agent framework for phrase extraction and question generation. Our work is concerned with selecting answer phrases and fill-in-the-blank questions that can be generated naturally after the answer is determined.

2.2. Label Embedding

Pretrained language models, which are the core foundation of current neural text processing [27], can only represent information about the text itself. In contrast, information about the labels can also be incorporated. Label embedding is a technology that converts the category label of an image or text into a vectorized representation. Recently, many scholars have increasingly applied label embedding to various tasks in natural language processing (NLP). Tang et al. [28] used label information for representation learning for large-scale heterogeneous text networks and achieved good experimental results. For text classification tasks, Wang et al. [29] argued that the label information of current general models only plays a supervisory role in the final classification prediction layer, and few studies combine label information and attention mechanisms to design efficient attention models. Therefore, they proposed the Label-Embedding Attentive Model (LEAM), which combines label embedding and attention mechanisms to obtain label-related attention by embedding word representations and label representations in a joint space for learning. Subsequently, they apply weighted attention to the text representation to get a more accurate text representation and ultimately improve the effectiveness of text classification. Ma et al. [30] combine prototype and hierarchical information to learn label embedding, which enhances the performance of named entity category labeling in the zero-shot setting. For multi-task text classification, Zhang et al. [31] proposed Multi-Task Label Embedding, which achieves the conversion from text classification task to vector matching task by converting the label information in text classification into a label vector that contains semantic information, and enhances the performance of the model. When dealing with sequence labeling problems, Zhang et al. [32] model the dependencies between labels and improve the accuracy of label sequences. Cui et al. [33] designed layer-refined RNN networks combined with label attention to gradually adjust the probability distribution of sequence labels for improving the final prediction performance. Liu et al. [34] combine label embedding with co-attention to improve the performance of text classification. Zhang et al. [35] designed a label-attention module for hierarchical multi-label text classification networks. Lv et al. [36] introduce predicate label representation in the event extraction task to guide the model for better semantic understanding. The above research studies have all proved the efficacy of label embedding and its significant enhancement to text representation. In our task, to make full use of the label embedding, this paper weights the output features of the encoder with the label embedding and finally fuses the obtained attention values with the encoder feature representation for sequence prediction, which enhances the performance of the model.

2.3. Multi-Task Learning and Auxiliary Task Learning

Multi-task learning methods are commonly used in natural language processing, learning shared representations across multiple related tasks to enhance the performance of all tasks. Multi-task learning is widely used in image and NLP tasks [37,38,39,40,41,42]. Multi-task learning refers to the joint training of multiple related tasks, which helps the primary task learn by sharing information between tasks [43]. Parameter-sharing methods [44,45] in multi-task learning include hard sharing, soft sharing, etc. Specifically, Balikas et al. [46] proposed a recurrent neural network-based multi-task learning method that treats emotion classification (positive, neutral, and negative) and quintuple classification (very positive, positive, neutral, negative, and very negative) as related tasks and solves the quintuple emotion classification problem by jointly learning. Yang et al. [47] introduced the cross-modal multi-task transformer in a multimodal sentiment analysis task. Cheng et al. [48] designed the entity linking and relation detection tasks for KB-QA for joint learning. Zeng et al. [49] used the multi-task learning model for document retrieval and query generation and achieved good performance.

Unlike multi-task learning, the goal of auxiliary task learning aims to enhance the performance of the primary task, and the auxiliary tasks are only used to help the learning of the primary task. Specifically, Kung et al. [50] studied the problem of data selection in auxiliary tasks, and their method selected auxiliary training data subsets based on feature similarity, which improved the training efficiency. Kumari et al. [51] designed novelty detection and emotion recognition as two auxiliary tasks to improve the performance of the misinformation detection task. Coavoux et al. [52] set up two supervised auxiliary tasks, a lexical annotation task and a functional label prediction task (determining the constituent relations between words and central words), to enhance the performance of the multilingual constituent syntactic analysis task. Hu et al. [53] addressed the problem of large legal text categories by redesigning the category system according to the specific content involved in the legal text (whether a death occurred, whether there was violence, etc.), and helped the primary task of classifying legal texts through the new category system classification auxiliary tasks. In sequence labeling tasks, multi-task learning is also widely used. Rei et al. [54] proposed the use of unsupervised auxiliary tasks to help neural models to learn deep textual semantic, syntactic information. Lin et al. [55] suggest a cross-linguistic multi-task learning method to alleviate the problem of insufficient corpus in a specific NER domain. This paper proposes an auxiliary learning method for POS tagging and determines the subset of POS closely related to the primary task through manual selection so that the feature encoding layer can better take into account the POS information, thereby improving the performance of the primary task.

3. Research Objective

As mentioned in the previous section, the current Chinese CCQG faces the problem of scarcity of annotated data, which makes it difficult to exploit the powerful capabilities of the current mainstream pretrained language models. To address this problem, we propose the following two research objectives while maintaining the reuse of the original pretrained language model architecture: (1) to explore a feature representation method for enriching samples without the requirement of increasing training data; (2) to construct a multi-task learning framework and introduce appropriate auxiliary tasks to improve the performance of the CCQG task model. To achieve the goal, we propose an approach based on a label-attention mechanism, which learns the feature representation of word sequence with a preference for various labels. Furthermore, we design a POS-tagging auxiliary task to improve the representation of the feature encoding module based on the observation that the boundaries of POS tags and GPs are highly consistent.

4. Definition of the CCQG Task

CCQG belongs to the sequence labeling problem in natural language processing, which uses an algorithm to allocate a predefined class label to each character in a sequence of Chinese text while preserving word order and context. Given an input sequence

x = {x_{t}}_{t = 1}^{T}

of length T tokens, to identify the gap phrase (GP) in the sequence, the output label sequence

y = {y_{t}}_{t = 1}^{T}

is required to indicate the position of the GPs. In our task, the location information uses the BMEOS annotation mode, where B (beginning) means that the token is located at the beginning of the GP, M (middle) means that the token is located inside the GP, E (end) means that the token is located at the end of the GP, O (outside) indicates a non-GP token, and S (single) represents a GP with only one token. There is no need to identify type information in the CCQG task; however, for the auxiliary POS-tagging task we designed, the category information is the POS tag of the token. For POS tagging, assuming that the category set is

C = {c_{k}}_{k = 1}^{K}

, after combining the position and category information

(B - C, M - C, E - C, S - C)

, the size of the token classification space is

4 K + 1

, that is,

y_{t} \in [0, 4 K] .

The CCQG model

θ

seeks to produce the sequence label

\hat{y}

in such a way that the probability of the true label sequence

y

occurring in the dataset is as high as possible, which is expressed as in Equations (1) and (2):

\hat{y} = a r g max_{y} P (y | x, θ)

(1)

P (y) = \prod_{i = 1}^{L} P (y_{i} | y_{1}, y_{2}, \dots, y_{i - 1})

(2)

where

y_{1}, y_{2}, \dots, y_{L}

are the tags that make up the entire label sequence, and L is the length of the label sequence.

5. Baseline

The typical solution to the sequence labeling problem is to use a neural encoder–decoder model; here, we choose BERT-BiLSTM-CRF as the baseline method. The input in Chinese sentences is word-level features, so the model is made up of a word representation layer, a sequence encoding layer, and a CRF (conditional random field) inference layer.

5.1. Word Representation Layer

The semantic representation of text (embedding) is the basis of sequence labeling task based on the deep neural network. This research uses the BERT (Bidirectional Encoder Representations from Transformer) model as the embedding layer to generate semantic representation vectors. BERT is a language model built on the principle of “pretraining–finetuning” and it has excelled at several NLP tasks [56]. As the embedding layer of the sequence labeling model, BERT generates a semantic representation vector with rich contextual semantic features according to the input sentence sequence. In the pretraining stage, the large-scale training corpus enables BERT to effectively ingest and learn the rich semantic information in the input text [56]. At the same time, during the pretraining process, some words are randomly masked on the corpus through the mask language model, so that the model can learn the context information of each word. The BERT model uses a bidirectional transformer [57] for feature extraction. The transformer is different from a convolutional neural network or recurrent neural network. It is a feature extraction model based on an attention mechanism and consists only of a self-attention mechanism and a feedforward neural network. The transformer effectively solves the problem of long-distance-dependent feature capture in sentences in traditional NLP tasks. The bidirectional transformer encodes and decodes the input sentence sequence, resulting in an output word vector with contextual semantic features.

The input sentence

x = (x_{1}, x_{2}, \dots, x_{n})

, which is converted to the output of embedding vector

e_{i}

by the BERT embedding layer, and the embedded representation of the sentence

E = (e_{1}, e_{2}, \dots, e_{n})

, where

e_{i}

is the vector representation corresponding to

x_{i}

, is an m-dimensional vector, and

E

is an

n \times m

matrix. Each row corresponds to the vector representation of a word in the sentence.

5.2. Sequence Encoding Layer

The objective of this layer is to capture additional contextual features and acquire more extensive semantic information. To accomplish this, we employ the BiLSTM (Bidirectional Long Short-Term Memory) layer to encode contextual semantic features, enabling the extraction of the sentence’s overall semantics. This process predominantly relies on the backward and forward positional relationships among words to assimilate structural information within the sentence. The BiLSTM [58] is composed by splicing a forward and a backward LSTM [59], which can record both forward and backward information of each input item. The LSTM network, as a variant of RNN, introduces memory units and a gating mechanism to control the forgetting, updating, and passing of information compared to the traditional RNN network structure. As a result, it can learn long-range dependencies and effectively address the gradient disappearance or explosion phenomenon that occurs in RNN network structures. The output of this layer is shown in Equation (3):

h_{t} = [\vec{h_{t}}, \overset{\leftarrow}{h_{t}}]

(3)

where

\vec{h_{t}}

and

\overset{\leftarrow}{h_{t}}

are the hidden layer outputs of the forward and backward LSTM, respectively.

5.3. CRF

While the probability matrix derived from BiLSTM can be utilized to determine the final outcome, it may still yield incorrect results due to the absence of label correlation consideration. To address this limitation, the final layer of the model incorporates a conditional random field (CRF). The CRF is tasked with capturing dependencies between adjacent labels and imposing constraints to ensure the coherence of preceding and following labels. CRF is a conditional probability distribution model, which can be represented by

P (y | x)

. In our task, a linear chain of conditional random fields is used. Here

x

is the input variable, representing the observation sequence to be labeled, while

y

represents the series of labels corresponding one by one to

x

as the output sequence. Its core premise is shown in Equation (4).

p (y ∣ x) \propto exp (\sum_{k = 1}^{K} ω_{k} f_{k} (y, x))

(4)

where f represents the Eigenfunction, and

ω

denotes the corresponding weight of the characteristic function. During training, the conditional probability model

\hat{P} (y | x)

is derived through maximum likelihood estimation. When predicting, for a given observation sequence, it uses the Viterbi algorithm to produce the label sequence Y with the highest conditional probability

\hat{P} (y | x)

.

6. Methodology

The multi-task label-attention network proposed in this paper learns the feature representation with the support of the POS-tagging task and combines the label-attention mechanism to boost the primary task’s effectiveness. Figure 2 illustrates the structure of our model. On the primary task, the label embedding is used as a learning parameter, which is learned during the training process. On the basis of the label embedding, the model learns the attentional representation of words. In order to predict, the CRF layer receives the output of the BiLSTM encoding layer along with the attentional representation of words. On the auxiliary task, due to the large number of label types and the scarcity of data, we did not learn its label embedding but directly fed the output of BiLSTM into the CRF layer as its input for POS prediction. The BERT and BiLSTM encoding layers are shared by the two tasks.

6.1. Label-Attention Network

Given a candidate set of labels

B = {b_{1}, \dots, b_{k}, \dots, b_{| B |}}

, where

| B |

is the number of labels in the output label candidate set, and each label in the set is represented by an embedding vector as shown in Equation (5).

l_{k} = e^{b} (b_{k})

(5)

where

e^{b}

represents the look-up table of the label embeddings, the label embeddings are randomly initialized at the beginning of the training of the model, and are continuously adjusted during the training process.

We leverage the label-attention mechanism to introduce label embeddings into the model, enabling the model to learn the dependencies between labels, and the correlation between labels and words. First, the attention score between the label embedding matrix

L = [l_{1}, l_{2}, \dots l_{k}]

and the hidden state

h_{i}

of the BiLSTM encoder is calculated, and the calculation formula is as shown in Equations (6) and (7):

a_{i} = s o f t m a x (o_{i} L)

(6)

o_{i} = W_{i} h_{i} + b

(7)

This attention score differs from the frequently employed sequence-level attention, which focuses on modeling interactions among words within the input sequence. In contrast, the proposed label-attention mechanism models the interaction between contextual word representations and label embeddings.

Since the label-attention vector

a_{i}

is represented in the label space, it cannot be directly combined with the original BiLSTM hidden representation

h_{i}

. Therefore, we first project

h_{i}

into the same label space through a linear transformation

h_{i}^{'} = W_{p} h_{i} + b_{p}

(8)

where

h_{i} \in R^{d}

denotes the hidden representation produced by the BiLSTM encoder,

h_{i}^{'} \in R^{| B |}

is the projected representation in the label space, and

a_{i} \in R^{| B |}

is the label-attention vector. Therefore,

h_{i}^{'}

and

a_{i}

have the same dimensionality.

The projected representation and the label-attention vector are then combined through element-wise addition:

c_{i} = h_{i}^{'} + a_{i}

(9)

The fused representation

c_{i}

is subsequently used as the input to the CRF inference layer.

In this work, we use additive fusion to combine the original BiLSTM hidden representation and the label-aware attention representation. This design is intentionally lightweight and consistent with the role of the label-attention module in our framework. The objective of the module is not to replace or heavily transform the contextual representation learned by the encoder, but rather to inject label-aware information that can guide subsequent sequence prediction.

Since the CCQG dataset is relatively limited in size, more complex fusion strategies, such as concatenation followed by a multi-layer perceptron (MLP) or gated fusion mechanisms, would introduce additional trainable parameters and potentially increase the risk of overfitting. In contrast, additive fusion preserves the original contextual representation while incorporating label-aware information with minimal architectural complexity and computational overhead.

Therefore, the label-attention module is designed as an auxiliary enhancement to the encoder output before CRF decoding rather than as a separate feature transformation component. A systematic comparison of alternative fusion strategies is an interesting direction for future work and will be investigated in subsequent studies.

6.2. Design of POS-Tagging Auxiliary Task

As shown in Figure 1, the POS information has a good reference value for the selection of GPs. One direct method is to input the POS information as a feature into the model. Another way is to make the output feature representation of the model’s encoder capable of supporting the decoder in identifying POS categories and boundaries. This paper adopts the second approach, that is, by setting the form of an auxiliary task, forcing the task-sharing encoder to consider POS tagging. This method eliminates the need to change the structure of the encoder and allows text sequences to be encoded directly by employing a large language model, like BERT.

We use the CTB POS-tagging standard, which defines 36 original POS categories [60]. To reduce noise in the auxiliary POS-tagging task, we did not directly use all CTB POS categories. Instead, we constructed a refined POS subset according to the relevance between POS boundaries and annotated GP boundaries in the training data. The selection criterion was whether a POS category frequently appeared inside or at the boundary of annotated GPs. The POS subset was determined exclusively using the training portion of the dataset. No validation-set or test-set annotations were used during the subset construction process. Therefore, the subset selection procedure did not introduce information leakage from the evaluation data. POS categories that rarely overlapped with GP boundaries and mainly functioned as grammatical or discourse markers were excluded.

The excluded categories include BA, CS, DER, ETC, IJ, NOI, ON, and SP. These tags usually correspond to construction markers, conjunctions, particles, interjections, noise tokens, onomatopoeia, or sentence-final particles. Although they are important for syntactic analysis, they seldom form knowledge-bearing answer phrases in educational cloze questions. Therefore, excluding them from the auxiliary task helps the shared encoder focus more on POS categories that are informative for GP boundary recognition. The excluded POS categories are listed in Table 1.

6.3. Optimization Objectives

The objective function of our proposed model consists of two parts, namely, the loss function of the GP sequence labeling in the primary task and the loss function of the POS tagging in the auxiliary task.

Both the GP labeling task and the POS-tagging task are optimized using the negative log-likelihood (NLL) of the correct label sequence produced by the corresponding CRF layer, as shown in Equations (10) and (11):

L_{1} (θ_{1}) = - log P (g | x; θ_{1})

(10)

L_{2} (θ_{2}) = - log P (s | x; θ_{2})

(11)

where

P (g | x; θ_{1})

and

P (s | x; θ_{2})

denote the conditional probabilities of the complete GP label sequence and POS tag sequence computed by the corresponding CRF layers, respectively. Here,

θ_{1}

and

θ_{2}

are the learning parameters of the primary task and the auxiliary task, respectively. The final objective is a weighted sum of the two component losses, as shown in Equation (12):

L = (1 - λ) L_{1} + λ L_{2}

(12)

where

λ

is the weight factor used to adjust the balance between the primary task and the auxiliary task.

7. Experiment

7.1. Datasets

The dataset (https://github.com/tianlin668/CSFQGD, accessed on 17 June 2026) used in this study was released by [12] and consists of Chinese fill-in-the-blank questions collected from educational examination resources. Unlike cloze-style reading comprehension datasets, the blanks in this dataset are designed to assess specific knowledge points rather than to recover contextually repeated words. Therefore, the annotated gap phrases usually correspond to knowledge-bearing units such as concepts, definitions, terms, numerical expressions, or procedural components. The detailed statistics of the dataset are shown in Table 2.

The dataset covers five academic disciplines, including engineering, training, computing, medicine, and economics. The original train/validation/test split is retained in this study to ensure comparability with previous work. Each sample consists of a sentence and one or more annotated gap phrases. The annotation of GPs is knowledge-oriented: annotators select the phrase or phrases that are suitable to be omitted for constructing an educational cloze question.

It should be noted that the distribution of GPs is naturally imbalanced. Many GPs are short noun phrases or terminology units, whereas long GPs and structurally complex phrases occur less frequently. This imbalance increases the difficulty of boundary detection, especially when multiple educationally meaningful phrases appear in the same sentence.

7.2. Experimental Settings

We employ the bert-base-Chinese (https://huggingface.co/bert-base-chinese, accessed on 17 June 2026) model (12 layers, 768 hidden, 12 heads) as the pretraining language model, the dimension of other hidden layers of the network is also 768, the maximum length is set to 256, the dropout rate is 0.1, the batch_size in the training phase is 16, and the label vector in label attention is initialized with a uniform distribution in the range [−0.1, 0.1] and adjusted during training. The balance factor

λ

between the primary and the auxiliary tasks is set to 0.2. Additionally, we adopt the AdamW optimizer [61] with an initial learning rate of

6 \times 10^{- 5}

for objective optimization, and the maximum number of training epochs was set to 50. The experimental results use Precision, Recall and F1 score as evaluation indicators. The original dataset study reported a human prediction F1 of 87.64%, indicating the inherent subjectivity of GP annotation.

7.3. Reproducibility Details

To facilitate reproducibility, we provide additional implementation details. All experiments were implemented using PyTorch 1.10.0 and CUDA 11.1. The pretrained encoder was bert-base-Chinese. Input sentences were tokenized using the original BERT WordPiece tokenizer with a maximum sequence length of 256. No additional text normalization or lowercasing was applied.

The auxiliary task uses POS labels annotated according to the Chinese Treebank (CTB) POS tag standard. No external POS tagger was employed during training or inference.

To reduce the influence of random initialization, all experiments were repeated five times using different random seeds. The reported results correspond to the mean and standard deviation over these five independent runs. Training was conducted for 50 epochs. No additional early-stopping criterion was used. The checkpoint achieving the highest validation-set F1 score was selected as the final model for evaluation.

Experiments were conducted on a server equipped with NVIDIA RTX 3090 GPUs, Intel Xeon Silver 4214R CPUs, and 256 GB RAM (Santa Clara, CA, USA). The software environment includes Python 3.7, PyTorch 1.10.0, CUDA 11.1, NumPy 1.21.2, tqdm 4.64.0, tensorboardX 2.5.1, and pytorch-transformers 1.2.0. The complete source code, preprocessing scripts, configuration files, and dependency specifications are publicly available in the GitHub repository accompanying this work.

7.4. Baseline Models

To validate the performance of our proposed model, we compare it with three baseline models:

BiLSTM-CRF: This model uses Word2vec to generate word vectors, and encodes text semantic information and dependencies through a bidirectional LSTM layer. The decoder part uses a standard CRF to learn the transition probabilities between different tags according to the training data to better predict the GP tags.

Lattice LSTM: This model is a representative approach to combined word and character for training. It creatively integrates characters and words through a grid method, and has achieved good performance in Chinese named entity recognition [62]. It also has good results for other Chinese sequence annotation problems [63,64].

BERT-BiLSTM-CRF: This model uses the BERT model to obtain the vector representation of words, and the sentence encoding and decoding parts still adopt BiLSTM and CRF models. With the powerful representation ability of BERT, it has achieved state-of-the-art performance on the problem of text sequence labeling [12,65]. BERT-BiLSTM-CRF is the strongest baseline reported in the original dataset publication and therefore serves as the primary reference model in this study.

7.5. Experimental Results

The comparison results are shown in Table 3. The experimental results show that our approach has the best performance among the four models (F1 value is 65.85%). Both BERT-BiLSTM-CRF and our approach outperform the other two methods based on traditional word vector encoding by over 10%. When using BERT as the embedding layer for vector representation, it adopts a bidirectional transformer structure, so the output vector already contains rich information such as word features and contextual semantics. However, the word vectors used by BiLSTM-CRF and Lattice LSTM contain fewer contextual semantic features, so the semantic information of the entire sentence is incomplete, which affects the learning of sentence patterns. In addition, the transformer attention mechanism during training weakens information such as position and directional distance between sequences, but position and directional information are crucial in sequence annotation tasks [66]. By connecting a BiLSTM layer, the two models are able to learn the position information of the input sequences to compensate for the deficiencies in the BERT coding process and effectively learn the pattern features, such as the location of constituent items and sentence structure within a sentence.

To reduce the influence of random initialization and training instability, we performed multiple independent runs only for the strongest baseline (BERT-BiLSTM-CRF) and our proposed model, as the other two baselines (BiLSTM-CRF and Lattice LSTM) exhibit substantially lower performance and are not competitive with our approach. Specifically, both BERT-BiLSTM-CRF and our model were trained and evaluated over five independent runs using different random seeds. For these two models, we report the mean and standard deviation of Precision, Recall, and F1. To examine whether the improvement of our model over the BERT-BiLSTM-CRF baseline is statistically reliable, we conducted a paired t-test on the scores obtained from the repeated runs. The significance level was set to 0.05. The symbols in Table 3 indicate the corresponding p-value ranges (**:

p < 0.001

, *:

p < 0.05

, ∼:

p \geq 0.05

). For the non-competitive baselines, we directly cite the single-run results from their original papers, as indicated by the superscript ♮ in Table 3.

When interpreting the experimental results, the subjectivity of GP annotation should also be considered. Unlike conventional sequence labeling tasks such as named entity recognition, GP selection in educational cloze question generation depends on pedagogical objectives and may admit multiple valid answers. In many sentences, more than one phrase can reasonably represent the target knowledge point, whereas the dataset annotation usually records only the phrase or phrases selected by the original annotators. Consequently, a model prediction that differs from the gold annotation may still correspond to a valid educationally meaningful gap phrase.

This annotation uncertainty imposes an inherent upper bound on automatic performance. The original dataset study reported a human prediction F1 score of 87.64%, indicating that even human annotators do not achieve perfect agreement on GP selection. Therefore, the performance of automatic systems should be interpreted in the context of this annotation subjectivity and the existence of multiple plausible answers.

Although sequence labeling metrics such as Precision, Recall, and F1 are widely adopted in GP identification research and are consistent with the evaluation protocol of the original dataset, they do not fully capture the pedagogical quality of generated cloze questions. In educational assessment scenarios, multiple gap phrases within the same sentence may be equally suitable for evaluating a target knowledge point. Consequently, a prediction that differs from the gold annotation may still produce an educationally meaningful cloze question. Future work may therefore incorporate additional phrase-level evaluation metrics and expert-based human evaluation protocols to better assess the pedagogical usefulness of automatically selected gap phrases.

Compared to the BERT-BiLSTM-CRF baseline, our method achieves a Precision improvement of +0.2%, a Recall improvement of +1.5%, and an F1 improvement of +0.9%. Notably, a paired t-test across the five runs indicates that the Precision gain is not statistically significant. However, the Recall gain is significant. Therefore, the overall effectiveness of our approach is primarily driven by a substantial Recall improvement, leading to a statistically significant F1 boost, a task-specific gain. Both models utilize BERT for input feature representation and BiLSTM for sentence encoding. However, our model incorporates a label-attention mechanism module to model the interrelationship between label representation and sentence sequence features effectively. Moreover, by training on an auxiliary task, our approach exhibits enhanced GP boundary recognition capabilities. When compared to alternative methods, our proposed approach excels in contextual semantic feature extraction, sentence structure learning, tag relationship learning, and tag boundary recognition. As a result, it yields more accurate GP extraction results.

The purpose of the baseline comparison is to evaluate whether the proposed POS auxiliary learning and label-attention modules can improve a strong BERT-BiLSTM-CRF sequence labeling framework under the same dataset and evaluation protocol. Therefore, this study mainly compares with previously reported models and the re-implemented BERT-BiLSTM-CRF baseline. We acknowledge that replacing the encoder with more recent pretrained Chinese models, such as MacBERT, RoBERTa-wwm, or DeBERTa, may further improve performance. Since the focus of this work is the effectiveness of the proposed task-specific modules rather than the comparison of different pretrained encoders, we leave a systematic evaluation of stronger pretrained baselines to future work.

To avoid ambiguity, we report the final comparison results and ablation results separately. The final performance comparison in Table 3 is used to evaluate the proposed model against baseline methods under the test setting. The ablation study in Table 4 is used to analyze the relative contribution of different components. Therefore, the F1 value of the full model in Table 4 is not intended to replace the final test-set result reported in Table 3.

Nevertheless, the proposed POS-guided auxiliary learning and label-attention modules are encoder-agnostic and can be incorporated into alternative pretrained language models. Therefore, the effectiveness demonstrated on the BERT-BiLSTM-CRF framework reflects the contribution of the proposed modules rather than a specific encoder choice.

7.6. Ablation Study

To verify the impact of each mechanism in our proposed method on the overall effect, two sets of ablation experiments were used for comparative validation: (1) The label-attention and auxiliary task contribution analysis was validated by removing the POS-tagging auxiliary task (denoted as POS) and label-attention (denoted as LA) modules from our model, respectively. For the convenience of comparison, we also put together the model validation results with both modules removed (denoted w/o POS & LA, equivalent to the model BERT-BiLSTM-CRF). (2) Contribution analysis of POS-tagging subsets. We denote the complete CTB POS-tagging task as POS1 and our simplified POS-tagging task as POS2 and observe their performance improvement in combination with BERT-BiLSTM-CRF, respectively. Further, we verify their performance changes when superimposed on the BERT-BiLSTM-CRF model simultaneously with the label attention LA.

The results of the first group are given in Table 4. Compared with the full model, the model with the auxiliary task POS removed dropped more in Precision by 1.51 percentage points and Recall by 0.9 percentage points, resulting in a 1.22 percentage point drop in F1 values. The removal of the LA module also had some performance impact, with the F1 value also dropping by 0.41 percentage points. After the two modules were removed and the model degraded to BERT-BiLSTM-CRF, Recall dropped significantly to 4.33 percentage points, and the F1 value dropped more with it. The above observations show that F1 decreases more with the removal of the auxiliary task POS for both modules compared to each other; in other words, the auxiliary task POS has a more significant performance improvement. The auxiliary task POS is more likely to affect the Precision metric than label attention LA, while label attention LA improves Precision and Recall values in a balanced manner. We infer that the reason for this is that the auxiliary task improves the accuracy of the boundary recognition. The combination of the two mainly improves the Recall value and, thus, the F1 value. It can also be observed that when only the label-attention module (w/o POS) is used, it actually hurts the Precision of the model while significantly increasing the Recall value compared to BERT-BiLSTM-CRF; that is, this module enhances the global recognition ability of the model.

In addition to the typical error analysis, we also provide detailed statistics on the categories of error samples. To facilitate the interpretation of the error statistics, Table 5 summarizes the definitions and typical causes of the four error categories considered in this study, while Table 6 reports their distribution.

As shown in Table 6, Case 1 (incomplete GP recognition) and Case 4 (over-prediction) account for the vast majority of errors, representing 34.96% and 52.65% of all error cases, respectively. In contrast, Case 2 (boundary mismatch) and Case 3 (missing GP) occur much less frequently.

The dominance of Case 1 and Case 4 reflects the intrinsic characteristics of educational cloze question generation. In many educational sentences, multiple knowledge-bearing phrases coexist, and the model may successfully identify some but not all annotated GPs, resulting in incomplete GP recognition. This phenomenon is particularly common in sentences containing coordinated concepts or parallel noun phrases.

For Case 4, the model often predicts additional phrases that are educationally meaningful but are not included in the gold annotations. This observation is closely related to the subjectivity of GP annotation. Since multiple phrases may reasonably serve as candidate blanks for assessing the same knowledge point, some model predictions counted as false positives may still correspond to valid educationally meaningful gap phrases. Example 5 in Table 7 provides a representative illustration of this phenomenon.

The improvement in Precision brought by LA is smaller than that obtained by the POS auxiliary task alone, suggesting that the current interaction between label-aware features and multi-task shared features still has room for improvement. This observation underscores the potential for further enhancements by capitalizing on the synergies between LA and the shared features acquired in a multi-task learning environment. It can also be observed that the improvement in Precision when combined with LA is smaller than when multi-task learning is employed independently, which also suggests that the ability to combine LA and shared features learned in a multi-task environment leaves room for further improvement.

The second group of ablation results is presented graphically, as shown in Figure 3a–c. It can be seen that the auxiliary task POS2 achieves a significantly higher Precision score than the CTB POS full set tagging, which is due to the accurate identification of important POS boundaries.

7.7. Influence of the Multi-Task Weight Factor

In the model of this paper, the parameter

λ

is used as a harmonic coefficient for the primary and auxiliary tasks, which can have an impact on the labeling performance. When the parameter

λ

is large, the proportion of losses of the auxiliary task is higher, and vice versa, the proportion of losses of the primary task is high. Figure 4 shows the experimental data of the F1 scores for different

λ

values. In this paper, a candidate value is chosen at 0.1 intervals between 0 and one, for a total of nine candidate weight hyperparameter values for comparison. It can be seen from the curves that the best F1 score is achieved when the weight is 0.2, and both increasing and decreasing weights show a decreasing trend in the score. Therefore, 0.2 is chosen as the final

λ

value in our model.

7.8. Result Analysis

To facilitate the interpretation of the error statistics, Table 5 summarizes the definitions and typical causes of the four error categories considered in this study. The corresponding error distribution is reported in Table 6. Representative examples illustrating different error patterns are presented in Table 7.

The recognition scenarios for GPs are impossible to categorize definitively, but the error categories can be classified into four cases: (1) too few GPs recognized, (2) inaccurate segmentation, (3) unrecognized GPs, and (4) too many GPs recognized. In conducting the case study, we categorize the misidentified samples of the BERT-BiLSTM-CRF model according to the cause of the error, and then test whether our proposed method correctly predicts these samples, and if it does, further compare and analyze the cause. The five sentences listed in Table 7 of the manuscript are typical examples that reflect the differences between the two models.

Our method can correctly predict the GPs in Example 1, Example 2 and Example 3. BERT-BiLSTM-CRF did not fully predict both GPs for Example 1, the GP boundaries in Example 2 were not correctly identified, and no GP was identified in Example 3. The correct identification of our approach for the above three examples is of benefit to the label-attention LA module and the auxiliary task POS module, respectively. For example, the labeling attention LA can comprehensively grasp the relationship between “价值尺度 (value scale)”, “流通手段 (circulation means)” and the whole sentence, especially “货币的原始职能 (the original function of money)” in Example 1. The shared features learned in conjunction with POS auxiliary task allow for better representation of the possibility of “1986年 (the year 1986)” as a holistic chunk in Chinese in Example 2. Example 3 is an interesting case, where our method correctly identifies the boundary of the second GP, and the BERT-BiLSTM-CRF method only identifies a part of the GP’s “启发 (heuristics)”. However, neither method identifies the other GP “复习谈话 (review talk)”. A situation similar to Example 5 is more common in the dataset, where there are actually n available GPs in a sentence, and the annotator only selects m

(m < n)

of them for generating questions.

As shown in Table 6, Case 1 (incomplete GP recognition) and Case 4 (over-prediction) account for the vast majority of errors, representing 34.96% and 52.65% of all error cases, respectively. In contrast, Case 2 (boundary mismatch) and Case 3 (missing GP) occur much less frequently.

The dominance of Case 1 and Case 4 reflects the intrinsic characteristics of educational cloze question generation. In many educational sentences, multiple knowledge-bearing phrases coexist, and the model may successfully identify some but not all annotated GPs, resulting in incomplete GP recognition. This phenomenon is particularly common in sentences containing coordinated concepts or parallel noun phrases.

For Case 4, the model often predicts additional phrases that are educationally meaningful but are not included in the gold annotations. This observation is closely related to the subjectivity of GP annotation. Since multiple phrases may reasonably serve as candidate blanks for assessing the same knowledge point, some model predictions counted as false positives may still correspond to valid educationally meaningful gap phrases. Example 5 in Table 7 provides a representative illustration of this phenomenon.

In this work, we suggest that POS would contribute to the main task of GP recognition and use POS tagging as a secondary task. To further validate our argument, in this section, the correlation between GPs and POS is summarized. Figure 5 illustrates the distribution of POS tag pattern by the number of GPs. It can be observed that, in addition to the noun phrase (NN+) which accounts for 45% of the total number of GPs, the combination of JJ+NN (other noun-modifier with common noun), NR (proper noun), CD (cardinal number), and VV+NN (other verb with common noun) are also the main patterns that constitute GPs. We also observe that some niche POS tag such as NT can also help the model to better determine GP boundaries, as in sample 2 in Table 7.

It should be noted that, due to the subjective nature of manual GP selection, this results in existing models that do not fully agree with manual annotation. This is also the reason why the data provider giving the performance of human prediction can only reach the F1 value of 87.64% in their paper [12]. The above analysis shows that our model performs better in complete GP discovery and boundary recognition without considering the subjectivity of annotation. However, inadequate understanding of whole sentences in the case of sparse text can also lead to situations such as incomplete GP recognition, which poses challenges and directions for model improvement.

8. Theoretical and Practical Implications

In this work, we introduce a label-attention network along with an auxiliary part-of-speech tagging task for generating Chinese cloze questions in educational assessments. Our contribution not only enhances and refines the landscape of research in the domain of cloze question generation but also furnishes valuable technical foundations for the real-world implementation of educational examination applications.

From a theoretical perspective, our work provides a new solution for CCQG research. Our goal is to alleviate the data scarcity problem of the CCQG task while reusing the pretrained language model framework. Specifically, we propose a label-attention model that enables the encoder to have the capability of label preference-aware feature representation of word sequences. Our proposed multi-task learning framework with POS tagging as an auxiliary task further improves the capability of representing of the encoding module. Both attempts can provide a new perspective for researchers in the CCQG task and other NLP domains to look at the problem of pretrained language model reuse.

From a practical perspective, our work can provide an aid to educational examination question generation. With the widespread use of AI technologies in education, AI technologies are also increasingly used to assist in the generation of examination questions. Our model is capable of generating cloze-type examination questions for a knowledge point assessment based on a small amount of annotated data. For teachers, our model can help them generate test questions from texts such as electronic textbooks and syllabuses, saving their working time. For educational institutions, the CCQG model can help them quickly build up a large pool of test questions, thus saving a lot of human resources.

9. Limitations

Although the proposed method achieves consistent improvements over existing approaches, several limitations should be acknowledged.

First, the experiments were conducted on a single publicly available Chinese cloze question generation dataset. Although this dataset covers multiple academic disciplines, its scale remains relatively limited compared with datasets commonly used in large-scale natural language processing research. Therefore, the generalization ability of the proposed framework on other educational datasets requires further investigation.

Second, the annotation of gap phrases is inherently subjective. Unlike conventional sequence labeling tasks, educational cloze question generation may admit multiple valid gap phrases within the same sentence. However, the current dataset provides only the phrase or phrases selected by the original annotators as reference annotations. Consequently, some predictions counted as errors may still correspond to educationally meaningful gap phrases. This annotation uncertainty introduces an inherent limitation to automatic evaluation and may underestimate the actual quality of model predictions. Future work could improve evaluation reliability through multi-reference annotation, expert validation, or human-centered assessment protocols.

Third, this study focuses on validating the effectiveness of the proposed POS-guided auxiliary learning strategy and label-attention mechanism using a BERT-based encoder. More recent pretrained language models, such as MacBERT, RoBERTa-wwm, and DeBERTa, were not systematically evaluated in this work.

Finally, the label-attention module adopts a lightweight additive fusion strategy. Although this design reduces model complexity and mitigates overfitting risks under limited training data, alternative fusion mechanisms may further improve performance and deserve future investigation.

10. Conclusions

A multi-task learning model is proposed in this paper for generating Chinese cloze questions for educational examinations. The model utilizes a label-attention component to improve global perception of label recognition and an auxiliary POS-tagging task to enhance recognition accuracy. Compared to other deep learning-based sequence labeling models, our approach efficiently determines label boundaries and achieves more comprehensive identification of GPs. However, there are areas for improvement in the current work. Firstly, in scenarios with sparse text representation, the model’s semantic understanding may be limited in capturing knowledge information from examination questions. Secondly, the shared feature layer encoding is influenced by incorrect decisions from the auxiliary task due to the multi-task learning framework. Lastly, the sequence labeling model can be further enhanced by applying task-specific constraints, such as requiring at least one GP per sentence. Overall, our model’s effectiveness is validated through experiments, but it is important to acknowledge these areas for potential enhancement.

Extensive experiments on the benchmark dataset demonstrate that our approach achieves competitive performance, particularly in terms of Recall and overall F1 score. Beyond the current framework, future work will focus on adapting our approach to cross-domain scenarios and reducing the model’s inference latency, rather than on random-seed validation.

Author Contributions

Conceptualization, Y.H.; methodology, Y.H. and Y.L.; software, Y.H.; validation, Y.H.; formal analysis, Y.L.; resources, S.X.;investigation, S.X.; writing—original draft preparation, Y.L.; writing—review and editing, Y.H. and Y.L.; visualization, S.X.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Henan Province key research and development project (No. 251111211300), the Natural Science Foundation of Henan Province (No. 262300421491) and the 2024 Teaching Reform Project of Henan Agricultural University (No. 2024XJGLX030).

Data Availability Statement

Both code and data are available at Github, link: https://github.com/pdsxsf/CCQG (accessed on 17 June 2026).

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare that they have no competing interests or other interests that might be perceived to influence the results and/or discussion reported in this paper.

References

Zock, M. Computational linguistics and its use in real world. In Proceedings of the COLING 1996 Volume 2: The 16th International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 1996; p. 1002. [Google Scholar] [CrossRef]
Foltz, P.W.; Laham, D.; Landauer, T.K. The intelligent essay assessor: Applications to educational technology. Interact. Multimed. Electron. J.-Comput.-Enhanc. Learn. 1999, 1, 939–944. [Google Scholar]
Mitkov, R.; Ha, L.A.; Karamanis, N. A computer-aided environment for generating multiple-choice test items. Nat. Lang. Eng. 2006, 12, 177–194. [Google Scholar] [CrossRef]
Kumar, G.; Banchs, R.E.; D’Haro, L.F. Automatic fill-the-blank question generator for student self-assessment. In 2015 IEEE Frontiers in Education Conference (FIE); IEEE: Piscataway, NJ, USA, 2015; Volume 2025, pp. 2–4. [Google Scholar] [CrossRef]
Kumar, G.; Banchs, R.E.; D’Haro, L.F. Revup: Automatic gap-fill question generation from educational texts. In 10th Workshop on Innovative Use of NLP for Building Educational Applications, BEA 2015 at the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2015; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 154–161. [Google Scholar] [CrossRef]
Le, N.T.; Kojiri, T.; Pinkwart, N. Automatic question generation for educational applications—The state of art. In Proceedings of the Advances in Intelligent Systems and Computing; van Do, T., Thi, H.A.L., Nguyen, N.T., Eds.; Springer: Cham, Switzerland, 2014; Volume 282, pp. 325–338. [Google Scholar] [CrossRef]
Agarwal, M.; Mannem, P. Automatic Gap-fill question generation from text books. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, BEA 2011 at the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 56–64. [Google Scholar]
Cui, Y.; Chen, Z.; Wei, S.; Wang, S.; Liu, T.; Hu, G. Attention-over-attention neural networks for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 593–602. [Google Scholar] [CrossRef]
Cui, Y.; Liu, T.; Chen, Z.; Wang, S.; Hu, G. Consensus attention-based neural networks for Chinese reading comprehension. In Proceedings of the COLING 2016—26th International Conference on Computational Linguistics: Technical Papers; The COLING 2016 Organizing Committee: Osaka, Japan, 2016; pp. 1777–1786. [Google Scholar]
Faizan, A.; Lohmann, S. Automatic generation of multiple choice questions from slide content using linked data. In ACM International Conference Proceeding Series; Association for Computing Machinery: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
Marrese-Taylor, E.; Nakajima, A.; Matsuo, Y.; Yuichi, O. Learning to Automatically Generate Fill-In-The-Blank Quizzes. In Proceedings of the Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 152–156. [Google Scholar] [CrossRef]
Zhang, T.; Cui, Z.; Leng, J.; Liu, Y. CSFQGD: Chinese Sentence Fill-in-the-blank Question Generation Dataset for Examination. In Proceedings of the 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design, CSCWD 2021; Shen, W., Barthès, J.P.A., Luo, J., Shi, Y., Zhang, J., Eds.; IEEE: Piscataway, NJ, USA, 2021; pp. 609–613. [Google Scholar] [CrossRef]
Lindberg, D.; Popowich, F.; Nesbit, J.; Winne, P. Generating natural language questions to support learning on-line. In Proceedings of the ENLG 2013—14th European Workshop on Natural Language Generation; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 105–114. [Google Scholar]
Mazidi, K.; Nielsen, R.D. Linguistic considerations in automatic question generation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; Volume 2, pp. 321–326. [Google Scholar] [CrossRef]
Hussein, H.; Elmogy, M.; Guirguis, S. Automatic English question generation system based on template driven scheme. Int. J. Comput. Sci. Issues (IJCSI) 2014, 11, 45. [Google Scholar]
Labutov, I.; Basu, S.; Vanderwende, L. Deep questions without deep understanding. In Proceedings of the ACL-IJCNLP 2015—53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; Volume 1, pp. 889–898. [Google Scholar] [CrossRef]
Kusuma, S.F.; Siahaan, D.O.; Fatichah, C. Automatic question generation with various difficulty levels based on knowledge ontology using a query template. Knowl.-Based Syst. 2022, 249, 108906. [Google Scholar] [CrossRef]
Zhou, Q.; Yang, N.; Wei, F.; Tan, C.; Bao, H.; Zhou, M. Neural question generation from text: A preliminary study. In Proceedings of the Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2018; Volume 10619 LNAI, pp. 662–671. [Google Scholar] [CrossRef]
Kim, Y.; Lee, H.; Shin, J.; Jung, K. Improving neural question generation using answer separation. Proc. AAAI Conf. Artif. Intell. 2019, 33, 6602–6609. [Google Scholar] [CrossRef]
Liu, B.; Lai, K.; Zhao, M.; He, Y.; Xu, Y.; Niu, D.; Wei, H. Learning to generate questions by learning what not to generate. In Proceedings of the Web Conference 2019—Proceedings of the World Wide Web Conference, WWW 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1106–1118. [Google Scholar] [CrossRef]
Zeng, H.; Zhi, Z.; Liu, J.; Wei, B. Improving paragraph-level question generation with extended answer network and uncertainty-aware beam search. Inf. Sci. 2021, 571, 50–64. [Google Scholar] [CrossRef]
Yang, Z.; Hu, J.; Salakhutdinov, R.; Cohen, W.W. Semi-supervised QA with generative domain-adaptive nets. In Proceedings of the ACL 2017—55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; Volume 1, pp. 1040–1050. [Google Scholar] [CrossRef]
Scialom, T.; Piwowarski, B.; Staiano, J. Self-attention architectures for answer-agnostic neural question generation. In Proceedings of the ACL 2019—57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6027–6032. [Google Scholar] [CrossRef]
Subramanian, S.; Wang, T.; Yuan, X.; Zhang, S.; Bengio, Y.; Trischler, A. Neural Models for Key Phrase Extraction and Question Generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 78–88. [Google Scholar] [CrossRef]
Willis, A.; Davis, G.; Ruan, S.; Manoharan, L.; Landay, J.; Brunskill, E. Key phrase extraction for generating educational question-answer pairs. In Proceedings of the 6th 2019 ACM Conference on Learning at Scale, L@S 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–10. [Google Scholar] [CrossRef]
Wang, S.; Wei, Z.; Fan, Z.; Liu, Y.; Huang, X. A multi-agent communication framework for question-worthy phrase extraction and question generation. Proc. AAAI Conf. Artif. Intell. 2019, 33, 7168–7175. [Google Scholar] [CrossRef]
Wiedemann, G.; Remus, S.; Chawla, A.; Biemann, C. Does BERT make any sense? Interpretable word sense disambiguation with contextualized embeddings. In Proceedings of the 15th Conference on Natural Language Processing, KONVENS 2019, Erlangen, Germany, 9–11 October 2019; pp. 161–170. [Google Scholar]
Tang, J.; Qu, M.; Mei, Q. PTE: Predictive text embedding through large-scale heterogeneous text networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15; Association for Computing Machinery: New York, NY, USA, 2015; pp. 1165–1174. [Google Scholar] [CrossRef]
Sheng, Y.; Takashi, I. Joint Embedding of Words and Labels for Sentiment Classification. In Proceedings of the 2020 International Conference on Asian Language Processing (IALP); IEEE: Piscataway, NJ, USA, 2020; Volume 1, pp. 264–269. [Google Scholar] [CrossRef]
Ma, Y.; Cambria, E.; Gao, S. Label embedding for zero-shot fine-grained named entity typing. In COLING 2016—26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers; The COLING 2016 Organizing Committee: Osaka, Japan, 2016; pp. 171–180. [Google Scholar]
Shuang, K.; Xu, M.; Zhang, W.; Zhang, Z. Adversarial Multi-task Label Embedding for Text Classification. In ACM International Conference Proceeding Series; Association for Computing Machinery: New York, NY, USA, 2019; pp. 45–50. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, H.; Zhao, Y.; Liu, Q.; Yin, D. Learning tag dependencies for sequence tagging. In IJCAI International Joint Conference on Artificial Intelligence; International Joint Conferences on Artificial Intelligence: Marina Del Rey, CA, USA, 2018; pp. 4581–4587. [Google Scholar] [CrossRef] [PubMed]
Cui, L.; Zhang, Y. Hierarchically-refined label attention network for sequence labeling. In Proceedings of the EMNLP-IJCNLP 2019—2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4115–4128. [Google Scholar] [CrossRef]
Liu, M.; Liu, L.; Cao, J.; Du, Q. Co-attention network with label embedding for text classification. Neurocomputing 2022, 471, 61–69. [Google Scholar] [CrossRef]
Zhang, X.; Xu, J.; Soh, C.; Chen, L. LA-HCN: Label-based Attention for Hierarchical Multi-label Text Classification Neural Network. Expert Syst. Appl. 2022, 187, 115922. [Google Scholar] [CrossRef]
Lv, J.; Zhang, Z.; Jin, L.; Li, S.; Li, X.; Xu, G.; Sun, X. Trigger is Non-central: Jointly event extraction via label-aware representations with multi-task learning. Knowl.-Based Syst. 2022, 252, 109480. [Google Scholar] [CrossRef]
Cipolla, R.; Gal, Y.; Kendall, A. Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 7482–7491. [Google Scholar] [CrossRef]
Clark, K.; Luong, M.T.; Khandelwal, U.; Manning, C.D.; Le, Q.V. BAM! Born-again multi-task networks for natural language understanding. In Proceedings of the ACL 2019—57th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 5931–5937. [Google Scholar] [CrossRef]
Li, Y.; Caragea, C. Multi-task stance detection with sentiment and stance lexicons. In Proceedings of the EMNLP-IJCNLP 2019—2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 6299–6305. [Google Scholar] [CrossRef]
Wang, X.; Lyu, J.; Dong, L.; Xu, K. Multitask learning for biomedical named entity recognition with cross-sharing structure. BMC Bioinform. 2019, 20, 427. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Xu, Y. Multi-task nonparallel support vector machine for classification. Appl. Soft Comput. 2022, 124, 37–56. [Google Scholar] [CrossRef]
Mirfallah Lialestani, S.P.; Parcerisa, D.; Benomar, M.H.; Abbaszadeh Shahri, A. Generating 3D Geothermal Maps in Catalonia, Spain Using a Hybrid Adaptive Multitask Deep Learning Procedure. Energies 2022, 15, 4602. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. An overview of multi-task learning. Natl. Sci. Rev. 2018, 5, 30–43. [Google Scholar]
Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. GradNorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, PMLR, Stockholm, Sweden, 10–15 July 2018; Volume 2, pp. 1240–1251. [Google Scholar]
Yang, Y.; Hospedales, T.M. Deep multi-task representation learning: A tensor factorisation approach. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017—Conference Track Proceedings, Toulon, France, 24–26 April 2017. [Google Scholar]
Balikas, G.; Moura, S.; Amini, M.R. Multitask Learning for Fine-Grained Twitter Sentiment Analysis. In Proceedings of the SIGIR 2017—Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17; Association for Computing Machinery: New York, NY, USA, 2017; pp. 1005–1008. [Google Scholar] [CrossRef]
Yang, L.; Na, J.C.; Yu, J. Cross-Modal Multitask Transformer for End-to-End Multimodal Aspect-Based Sentiment Analysis. Inf. Process. Manag. 2022, 59, 103038. [Google Scholar] [CrossRef]
Cheng, L.; Xie, F.; Ren, J. KB-QA based on multi-task learning and negative sample generation. Inf. Sci. 2021, 574, 349–362. [Google Scholar] [CrossRef]
Zeng, J.; Yu, Y.; Wen, J.; Jiang, W.; Cheng, L. Personalized Dynamic Attention Multi-task Learning model for document retrieval and query generation. Expert Syst. Appl. 2023, 213, 119026. [Google Scholar] [CrossRef]
Kung, P.N.; Chen, Y.C.; Yin, S.S.; Yang, T.H.; Chen, Y.N. Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity. In Proceedings of the EMNLP 2021—2021 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 416–428. [Google Scholar] [CrossRef]
Kumari, R.; Ashok, N.; Ghosal, T.; Ekbal, A. Misinformation detection using multitask learning with mutual learning for novelty detection and emotion recognition. Inf. Process. Manag. 2021, 58, 102631. [Google Scholar] [CrossRef]
Coavoux, M.; Crabbe, B. Multilingual lexicalized constituency parsing with word-level auxiliary tasks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; Volume 2, pp. 331–336. [Google Scholar] [CrossRef]
Hu, Z.; Li, X.; Tu, C.; Liu, Z.; Sun, M. Few-shot charge prediction with discriminative legal attributes. In Proceedings of the COLING 2018—27th International Conference on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 487–498. [Google Scholar]
Rei, M. Semi-supervised multitask learning for sequence labeling. In Proceedings of the ACL 2017—55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; Volume 1, pp. 2121–2130. [Google Scholar] [CrossRef]
Lin, Y.; Yang, S.; Stoyanov, V.; Ji, H. A multi-lingual multi-task architecture for low-resource sequence labeling. In Proceedings of the ACL 2018—56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; Volume 1, pp. 799–809. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL HLT 2019—2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 5999–6009. [Google Scholar]
Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM networks. Proc. Int. Jt. Conf. Neural Netw. 2005, 4, 2047–2052. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Xia, F. The Part-of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0); IRCS Technical Reports Series; Institute for Research in Cognitive Science: Philadelphia, PA, USA, 2000. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Zhang, Y.; Yang, J. Chinese nEr using lattice LSTM. In Proceedings of the ACL 2018—56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; Volume 1, pp. 1554–1564. [Google Scholar] [CrossRef]
Wang, H.; Wang, B.; Duan, J.; Zhang, J. Chinese Spelling Error Detection Using a Fusion Lattice LSTM. ACM Trans. Asian Low.-Resour. Lang. Inf. Process. 2021, 20, 28. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Y.; Yang, J. Lattice LSTM for Chinese Sentence Representation. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 1506–1519. [Google Scholar] [CrossRef]
Meng, F.; Yang, S.; Wang, J.; Xia, L.; Liu, H. Creating Knowledge Graph of Electric Power Equipment Faults Based on BERT–BiLSTM–CRF Model. J. Electr. Eng. Technol. 2022, 17, 2507–2516. [Google Scholar] [CrossRef]
Hu, B.; Huang, Z.; Hu, M.; Zhang, Z.; Dou, Y. Adaptive Threshold Selective Self-Attention for Chinese NER. In Proceedings of the International Conference on Computational Linguistics, COLING, 2022; International Committee on Computational Linguistics: Gyeongju, Republic of Korea, 2022; Volume 29, pp. 1823–1833. [Google Scholar]

Figure 1. Four sentence examples with ground-truth GP and corresponding POS tags. GPs in each example are marked with parentheses, and at the top of the sentence are the corresponding POS tags for each word. The English translations corresponding to the four sentences are Ex.1 “The core of etiquette and politeness is to respect people.”, Ex.2 “The basic goal of psychological counseling is to learn to adjust.”, Ex.3 “The no-load voltage of the electrode arc welder is generally 60∼90 V.”, Ex.4 “After all, the competition of enterprises is the competition of talents.”

Figure 2. Architecture of the proposed model. The right part is the overall architecture of the model, and the left is the network structure of label attention. Arrows indicate the flow of information between components, and the circled “+” symbol denotes element-wise addition for representation fusion. The input

x_{t}

of the model is the original sentence. After passing through the BERT and BiLSTM layers shared by two tasks, it is fed into the task-specific network layer. The primary task is designed with one more label-attention module than the auxiliary task. The primary task output

g_{t}

is the GP label, and the auxiliary task output

s_{t}

is the POS tag.

Figure 2. Architecture of the proposed model. The right part is the overall architecture of the model, and the left is the network structure of label attention. Arrows indicate the flow of information between components, and the circled “+” symbol denotes element-wise addition for representation fusion. The input

x_{t}

of the model is the original sentence. After passing through the BERT and BiLSTM layers shared by two tasks, it is fed into the task-specific network layer. The primary task is designed with one more label-attention module than the auxiliary task. The primary task output

g_{t}

is the GP label, and the auxiliary task output

s_{t}

is the POS tag.

Figure 3. Ablation study results of different multi-task settings.

Figure 4. F1 scores on the validation dataset for different

λ

values.

Figure 4. F1 scores on the validation dataset for different

λ

values.

Figure 5. Distribution of proportion of POS patterns associated with GPs. The ‘+’ symbol in the legend indicates one or more times of the same POS combination, and the ‘#’ symbol indicates combinations of different POS. Patterns with a share of less than 1% are categorized as other.

Table 1. The eight types of POS tags that were excluded.

Tag	Description
BA	bǎ in ba-construction
CS	subordinating conjunction
DER	resultative de, de in V-de const and V-de-R
ETC	for words like “etc.”
IJ	interjection
NOI	noise that characters are written in the wrong order
ON	onomatopoeia
SP	sentence-final particle

Table 2. The statistics of the datasets.

	Train	Validation	Test
Num. questions	14,370	3080	3080
Avg. sentence length	41.45
Avg. blanks	1.67
Avg. blanks length	4.25
Vocabulary size	3858

Table 3. The performance of the existing and proposed model. The superscript ♮ indicates that the score comes from the original paper, and the superscript ♯ indicates that the score comes from our re-implemented model. The reported results are mean ± standard deviation over five independent runs. Statistical significance was assessed using paired t-tests against the BERT-BiLSTM-CRF baseline. The symbols **, *, and ∼ correspond to p < 0.001, p < 0.05, and p > 0.05, respectively.

	Precision	Recall	F1
BiLSTM-CRF ^♮	57.95	47.20	52.03
Lattice LSTM ^♮	52.08	50.10	51.07
BERT-BiLSTM-CRF ^♮	$64.02 \pm 0.25$	$63.44 \pm 0.31$	$63.73 \pm 0.22$
Our Approach	$64.05 \pm {0.28}^{\sim}$	$67.79 \pm 0.35$ **	$65.85 \pm 0.26$ *

Table 4. Ablation results on the validation set. The results are used to compare the relative contribution of POS auxiliary learning and label attention. Since Table 3 and Table 4 correspond to different experimental purposes, the results are used for component analysis and are not directly comparable to the final test-set results reported in Table 3.

	Precision	Recall	F1
Full model	64.4	67.87	66.09
w/o POS	62.89	66.97	64.87
w/o LA	64.25	67.17	65.68
w/o POS & LA	64.08	63.39	63.74

Table 5. Definition of error categories.

Error Category	Definition	Explanation
Case 1: Incomplete GP Recognition	The model identifies only part of the annotated GPs in a sentence.	Usually occurs when multiple GPs appear in one sentence and the model fails to recognize all annotated targets.
Case 2: Boundary Mismatch	The model detects the correct GP region but predicts an incorrect boundary.	Often related to phrase segmentation errors, such as missing modifiers or partial phrase recognition.
Case 3: Missing GP	The model fails to identify the annotated GP.	Usually occurs when the GP is semantically implicit, context-dependent, or weakly indicated by local syntactic cues.
Case 4: Over-prediction	The model predicts extra phrases that are not annotated as GPs.	Often occurs when several noun phrases, technical terms, or concept expressions are plausible blanks but only a subset is annotated in the dataset.

Table 6. Error statistics of different error categories.

Error Category	Count	Percentage (%)
Case 1	316	34.96
Case 2	90	9.96
Case 3	22	2.43
Case 4	476	52.65
Total	904	100.00

Table 7. Case study.

No	Methods	Text
1	Ground Truth	[价值尺度]和[流通手段]是货币的原始职能。 (The measure of value and the means of circulation are the original functions of money.)
	Our Approach	[价值尺度]和[流通手段]是货币的原始职能。
	BERT-BiLSTM-CRF	[价值尺度]和流通手段是货币的原始职能。
2	Ground Truth	关于安乐死, 我国首起安乐死事件发生时间是[1986年]。 (Regarding euthanasia, the first euthanasia incident occurred in my country in the year 1986.)
	Our Approach	关于安乐死, 我国首起安乐死事件发生时间是[1986年]。
	BERT-BiLSTM-CRF	关于安乐死, 我国首起安乐死事件发生时间是[1986]年。
3	Ground Truth	[联系]是指实体间存在的对应关系。 (A relation is a correspondence that exists between entities.)
	Our Approach	[联系]是指实体间存在的对应关系。
	BERT-BiLSTM-CRF	联系是指实体间存在的对应关系。
4	Ground Truth	谈话法可分[复习谈话]和[启发谈话]两种。 (Conversations can be divided into review conversation and enlightenment conversation.)
	Our Approach	谈话法可分复习谈话和[启发谈话]两种。
	BERT-BiLSTM-CRF	谈话法可分复习谈话和[启发]谈话两种。
5	Ground Truth	颈部以[斜方肌]为界分颈前外侧部和颈后部。 (The neck is divided into the anterolateral portion and the back portion of the neck by the trapezius muscle.)
	Our Approach	颈部以[斜方肌]为界分为[颈前外侧部]和[颈后部]。
	BERT-BiLSTM-CRF	颈部以斜方肌为界分为[颈前外侧部]和[颈后部]。

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hou, Y.; Xiong, S.; Li, Y. Leveraging Label-Attention Networks and POS Tagging for Generating Chinese Cloze Questions. Algorithms 2026, 19, 501. https://doi.org/10.3390/a19060501

AMA Style

Hou Y, Xiong S, Li Y. Leveraging Label-Attention Networks and POS Tagging for Generating Chinese Cloze Questions. Algorithms. 2026; 19(6):501. https://doi.org/10.3390/a19060501

Chicago/Turabian Style

Hou, Yanyang, Shufeng Xiong, and Yang Li. 2026. "Leveraging Label-Attention Networks and POS Tagging for Generating Chinese Cloze Questions" Algorithms 19, no. 6: 501. https://doi.org/10.3390/a19060501

APA Style

Hou, Y., Xiong, S., & Li, Y. (2026). Leveraging Label-Attention Networks and POS Tagging for Generating Chinese Cloze Questions. Algorithms, 19(6), 501. https://doi.org/10.3390/a19060501

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Leveraging Label-Attention Networks and POS Tagging for Generating Chinese Cloze Questions

Abstract

1. Introduction

2. Related Work

2.1. Automatic Question Generation

2.2. Label Embedding

2.3. Multi-Task Learning and Auxiliary Task Learning

3. Research Objective

4. Definition of the CCQG Task

5. Baseline

5.1. Word Representation Layer

5.2. Sequence Encoding Layer

5.3. CRF

6. Methodology

6.1. Label-Attention Network

6.2. Design of POS-Tagging Auxiliary Task

6.3. Optimization Objectives

7. Experiment

7.1. Datasets

7.2. Experimental Settings

7.3. Reproducibility Details

7.4. Baseline Models

7.5. Experimental Results

7.6. Ablation Study

7.7. Influence of the Multi-Task Weight Factor

7.8. Result Analysis

8. Theoretical and Practical Implications

9. Limitations

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI