1. Introduction
Natural language processing has long been applied to support educational assessment, including computer-assisted language learning [
1], automated essay scoring [
2], and test item generation [
3]. In recent years, advances in language generation models have further encouraged the adoption of automated question generation in educational settings [
4,
5,
6]. This study focuses on Chinese cloze question generation (CCQG) for examinations, a task initially conceptualized by [
7] in the context of biology textbook chapters for Advanced Placement exams. Their framework divides the process into three subtasks: sentence selection, target (gap) selection, and distractor generation. Our work corresponds to the second sub-task—identifying the key phrase to be omitted in a given sentence. For example, given the Chinese sentence “计算机工作时, 内存储器中存储的是
指令与数据。” (When the computer is working, the memory stores
instruction and data), the goal is to recognize “指令与数据 (instruction and data)” as the gap phrase that encapsulates the intended knowledge point.
Unlike cloze tasks in reading comprehension [
8], the Chinese cloze question generation (CCQG) task for educational examinations is designed to assess a test-taker’s ability to apply linguistic and grammatical rules, or to evaluate their mastery of specific knowledge domains [
4,
5,
6]. Consequently, the selection of gap phrases (GPs) must accurately reflect the targeted knowledge points—such as grammatical structures, conceptual definitions, or procedural methods—and is often constrained by formal grammatical rules or a defined body of subject knowledge.
In contrast, gap selection in reading comprehension is typically confined to a local vocabulary within a passage. This task revolves primarily around paragraph content, where words are masked within relevant sentences and answers are generally recoverable from the surrounding context. For instance, in the Chinese cloze dataset People Daily & Children’s Fairy Tale [
9], answers are restricted to nouns that appear at least twice in the paragraph, though the specific choice is random. Correct answers in such tasks often exhibit lexical overlap with the context. As an example, in the sentence “The so-called ‘silly money’
— is that to buy and hold the common combination of U.S. stock,” the correct answer “Strategies” appears explicitly in the subsequent sentence: “This strategy is better than other complex investment methods…”
The generation of cloze questions ranges from the rule-based method proposed in the original study [
7] to the machine learning method proposed by [
5] and the knowledge link-based method proposed by [
10]. However, all of the above methods deal with the problem of small-scale labeled data. Although some work, such as [
11], has constructed large-scale datasets, the vast majority of research work focuses on the English language learning corpus, with limited publicly available resources for Chinese corpora. In 2021, one study [
12] proposed a Chinese fill-in-the-blank question generation dataset for non-English subject exams, with a data size that can support the training of current mainstream deep learning models, compensating and alleviating to some extent the insufficient annotated data problem faced by CCQG. However, in today’s large language model-based natural language processing framework, the number of samples corresponding to each category during model training is still insufficient to learn a good feature representation.
This paper addresses the challenge of model learning under low-resource conditions, focusing on two key aspects. Firstly, a label-attention-mechanism-based approach is introduced, which enhances feature representations of word sequences by assigning preferences to different label types. This enriches sample feature representations without the need for increased training data. Secondly, we introduce POS tagging as an auxiliary task, integrating it into a multi-task learning model to enhance the feature encoding module’s modeling capacity and subsequently improve the performance of the sequence annotation module. The design of auxiliary tasks is driven by the observation that POS tag boundaries align closely with GP boundaries. A specific example is shown in
Figure 1. A GP usually consists of one or more related categories of POS phrases. For example, a VV (other verb) “尊重 (respect)” and a NN (common noun) “人 (people)” form the GP “尊重人 (respect people)” in Ex.1.
The main contributions of this study are summarized as follows. First, we investigate the linguistic relationship between gap phrases and POS structures in Chinese cloze question generation. Different from general sequence labeling tasks, GP identification in educational cloze questions is closely related to the boundaries of knowledge-bearing syntactic units. We explicitly exploit this task-specific observation and use POS boundary information as auxiliary supervision.
Second, we propose a POS-guided multi-task learning framework for Chinese cloze question generation. In this framework, GP identification is treated as the primary task, while POS tagging is used as an auxiliary task to enhance the shared encoder’s ability to capture syntactic boundary information. This design allows the model to benefit from syntactic supervision without directly injecting manually engineered POS features into the input.
Third, we construct a refined POS tag subset for the auxiliary task. Instead of using the full CTB POS tag inventory, we exclude POS categories whose boundaries show little correspondence with annotated GP boundaries, thereby improving the relevance between the auxiliary task and the primary task.
Fourth, we adapt a label-attention mechanism to model the interaction between contextual word representations and GP label embeddings. This allows the primary task decoder to obtain label-aware representations for sequence prediction.
2. Related Work
2.1. Automatic Question Generation
A substantial portion of initial question generation research employs strategies rooted in grammatical rules and templates. This approach bifurcates the question generation task into two distinct subtasks: “what to ask” and “how to ask” [
13]. More specifically, linguistic rules and templates are devised to systematically extract question content from the text, leveraging information such as the grammatical composition of the input text. Subsequently, the extracted content is incorporated into pre-constructed question sentence templates, culminating in the formation of the question sentence [
14,
15,
16,
17]. However, such methods rely on established rules, cannot adapt to different text fields according to the data, and the transfer cost is too high, making it difficult to be widely used. Subsequently, sequence-to-sequence encoder–decoder neural networks have been widely used in question generation tasks [
18,
19,
20,
21]. The approaches described above primarily concentrate on the generation of questions and often require the answer to be given in advance.
Another group of research works selects answers and generates questions directly in the modeling process. Yang et al. [
22] adopt a rule-based approach to pick answer words from the unlabeled text. Scialom et al. [
23] proposed a multi-module interaction framework (multi-agent communication framework) using a local extraction module to automatically identify phrases worth asking questions, use it to help question generation, and then use the generated question feedback to improve key phrase recognition performance. Subramanian et al. [
24] use a two-stage keyword extraction and question generation model. Willis et al. [
25] exploit the end-to-end model, which generates a list of candidate answers from the context. Wang et al. [
26] use a multi-agent framework for phrase extraction and question generation. Our work is concerned with selecting answer phrases and fill-in-the-blank questions that can be generated naturally after the answer is determined.
2.2. Label Embedding
Pretrained language models, which are the core foundation of current neural text processing [
27], can only represent information about the text itself. In contrast, information about the labels can also be incorporated. Label embedding is a technology that converts the category label of an image or text into a vectorized representation. Recently, many scholars have increasingly applied label embedding to various tasks in natural language processing (NLP). Tang et al. [
28] used label information for representation learning for large-scale heterogeneous text networks and achieved good experimental results. For text classification tasks, Wang et al. [
29] argued that the label information of current general models only plays a supervisory role in the final classification prediction layer, and few studies combine label information and attention mechanisms to design efficient attention models. Therefore, they proposed the Label-Embedding Attentive Model (LEAM), which combines label embedding and attention mechanisms to obtain label-related attention by embedding word representations and label representations in a joint space for learning. Subsequently, they apply weighted attention to the text representation to get a more accurate text representation and ultimately improve the effectiveness of text classification. Ma et al. [
30] combine prototype and hierarchical information to learn label embedding, which enhances the performance of named entity category labeling in the zero-shot setting. For multi-task text classification, Zhang et al. [
31] proposed Multi-Task Label Embedding, which achieves the conversion from text classification task to vector matching task by converting the label information in text classification into a label vector that contains semantic information, and enhances the performance of the model. When dealing with sequence labeling problems, Zhang et al. [
32] model the dependencies between labels and improve the accuracy of label sequences. Cui et al. [
33] designed layer-refined RNN networks combined with label attention to gradually adjust the probability distribution of sequence labels for improving the final prediction performance. Liu et al. [
34] combine label embedding with co-attention to improve the performance of text classification. Zhang et al. [
35] designed a label-attention module for hierarchical multi-label text classification networks. Lv et al. [
36] introduce predicate label representation in the event extraction task to guide the model for better semantic understanding. The above research studies have all proved the efficacy of label embedding and its significant enhancement to text representation. In our task, to make full use of the label embedding, this paper weights the output features of the encoder with the label embedding and finally fuses the obtained attention values with the encoder feature representation for sequence prediction, which enhances the performance of the model.
2.3. Multi-Task Learning and Auxiliary Task Learning
Multi-task learning methods are commonly used in natural language processing, learning shared representations across multiple related tasks to enhance the performance of all tasks. Multi-task learning is widely used in image and NLP tasks [
37,
38,
39,
40,
41,
42]. Multi-task learning refers to the joint training of multiple related tasks, which helps the primary task learn by sharing information between tasks [
43]. Parameter-sharing methods [
44,
45] in multi-task learning include hard sharing, soft sharing, etc. Specifically, Balikas et al. [
46] proposed a recurrent neural network-based multi-task learning method that treats emotion classification (positive, neutral, and negative) and quintuple classification (very positive, positive, neutral, negative, and very negative) as related tasks and solves the quintuple emotion classification problem by jointly learning. Yang et al. [
47] introduced the cross-modal multi-task transformer in a multimodal sentiment analysis task. Cheng et al. [
48] designed the entity linking and relation detection tasks for KB-QA for joint learning. Zeng et al. [
49] used the multi-task learning model for document retrieval and query generation and achieved good performance.
Unlike multi-task learning, the goal of auxiliary task learning aims to enhance the performance of the primary task, and the auxiliary tasks are only used to help the learning of the primary task. Specifically, Kung et al. [
50] studied the problem of data selection in auxiliary tasks, and their method selected auxiliary training data subsets based on feature similarity, which improved the training efficiency. Kumari et al. [
51] designed novelty detection and emotion recognition as two auxiliary tasks to improve the performance of the misinformation detection task. Coavoux et al. [
52] set up two supervised auxiliary tasks, a lexical annotation task and a functional label prediction task (determining the constituent relations between words and central words), to enhance the performance of the multilingual constituent syntactic analysis task. Hu et al. [
53] addressed the problem of large legal text categories by redesigning the category system according to the specific content involved in the legal text (whether a death occurred, whether there was violence, etc.), and helped the primary task of classifying legal texts through the new category system classification auxiliary tasks. In sequence labeling tasks, multi-task learning is also widely used. Rei et al. [
54] proposed the use of unsupervised auxiliary tasks to help neural models to learn deep textual semantic, syntactic information. Lin et al. [
55] suggest a cross-linguistic multi-task learning method to alleviate the problem of insufficient corpus in a specific NER domain. This paper proposes an auxiliary learning method for POS tagging and determines the subset of POS closely related to the primary task through manual selection so that the feature encoding layer can better take into account the POS information, thereby improving the performance of the primary task.
3. Research Objective
As mentioned in the previous section, the current Chinese CCQG faces the problem of scarcity of annotated data, which makes it difficult to exploit the powerful capabilities of the current mainstream pretrained language models. To address this problem, we propose the following two research objectives while maintaining the reuse of the original pretrained language model architecture: (1) to explore a feature representation method for enriching samples without the requirement of increasing training data; (2) to construct a multi-task learning framework and introduce appropriate auxiliary tasks to improve the performance of the CCQG task model. To achieve the goal, we propose an approach based on a label-attention mechanism, which learns the feature representation of word sequence with a preference for various labels. Furthermore, we design a POS-tagging auxiliary task to improve the representation of the feature encoding module based on the observation that the boundaries of POS tags and GPs are highly consistent.
4. Definition of the CCQG Task
CCQG belongs to the sequence labeling problem in natural language processing, which uses an algorithm to allocate a predefined class label to each character in a sequence of Chinese text while preserving word order and context. Given an input sequence
of length
T tokens, to identify the gap phrase (GP) in the sequence, the output label sequence
is required to indicate the position of the GPs. In our task, the location information uses the BMEOS annotation mode, where B (beginning) means that the token is located at the beginning of the GP, M (middle) means that the token is located inside the GP, E (end) means that the token is located at the end of the GP, O (outside) indicates a non-GP token, and S (single) represents a GP with only one token. There is no need to identify type information in the CCQG task; however, for the auxiliary POS-tagging task we designed, the category information is the POS tag of the token. For POS tagging, assuming that the category set is
, after combining the position and category information
, the size of the token classification space is
, that is,
The CCQG model
seeks to produce the sequence label
in such a way that the probability of the true label sequence
occurring in the dataset is as high as possible, which is expressed as in Equations (
1) and (
2):
where
are the tags that make up the entire label sequence, and
L is the length of the label sequence.
5. Baseline
The typical solution to the sequence labeling problem is to use a neural encoder–decoder model; here, we choose BERT-BiLSTM-CRF as the baseline method. The input in Chinese sentences is word-level features, so the model is made up of a word representation layer, a sequence encoding layer, and a CRF (conditional random field) inference layer.
5.1. Word Representation Layer
The semantic representation of text (embedding) is the basis of sequence labeling task based on the deep neural network. This research uses the BERT (Bidirectional Encoder Representations from Transformer) model as the embedding layer to generate semantic representation vectors. BERT is a language model built on the principle of “pretraining–finetuning” and it has excelled at several NLP tasks [
56]. As the embedding layer of the sequence labeling model, BERT generates a semantic representation vector with rich contextual semantic features according to the input sentence sequence. In the pretraining stage, the large-scale training corpus enables BERT to effectively ingest and learn the rich semantic information in the input text [
56]. At the same time, during the pretraining process, some words are randomly masked on the corpus through the mask language model, so that the model can learn the context information of each word. The BERT model uses a bidirectional transformer [
57] for feature extraction. The transformer is different from a convolutional neural network or recurrent neural network. It is a feature extraction model based on an attention mechanism and consists only of a self-attention mechanism and a feedforward neural network. The transformer effectively solves the problem of long-distance-dependent feature capture in sentences in traditional NLP tasks. The bidirectional transformer encodes and decodes the input sentence sequence, resulting in an output word vector with contextual semantic features.
The input sentence , which is converted to the output of embedding vector by the BERT embedding layer, and the embedded representation of the sentence , where is the vector representation corresponding to , is an m-dimensional vector, and is an matrix. Each row corresponds to the vector representation of a word in the sentence.
5.2. Sequence Encoding Layer
The objective of this layer is to capture additional contextual features and acquire more extensive semantic information. To accomplish this, we employ the BiLSTM (Bidirectional Long Short-Term Memory) layer to encode contextual semantic features, enabling the extraction of the sentence’s overall semantics. This process predominantly relies on the backward and forward positional relationships among words to assimilate structural information within the sentence. The BiLSTM [
58] is composed by splicing a forward and a backward LSTM [
59], which can record both forward and backward information of each input item. The LSTM network, as a variant of RNN, introduces memory units and a gating mechanism to control the forgetting, updating, and passing of information compared to the traditional RNN network structure. As a result, it can learn long-range dependencies and effectively address the gradient disappearance or explosion phenomenon that occurs in RNN network structures. The output of this layer is shown in Equation (
3):
where
and
are the hidden layer outputs of the forward and backward LSTM, respectively.
5.3. CRF
While the probability matrix derived from BiLSTM can be utilized to determine the final outcome, it may still yield incorrect results due to the absence of label correlation consideration. To address this limitation, the final layer of the model incorporates a conditional random field (CRF). The CRF is tasked with capturing dependencies between adjacent labels and imposing constraints to ensure the coherence of preceding and following labels. CRF is a conditional probability distribution model, which can be represented by
. In our task, a linear chain of conditional random fields is used. Here
is the input variable, representing the observation sequence to be labeled, while
represents the series of labels corresponding one by one to
as the output sequence. Its core premise is shown in Equation (
4).
where
f represents the Eigenfunction, and
denotes the corresponding weight of the characteristic function. During training, the conditional probability model
is derived through maximum likelihood estimation. When predicting, for a given observation sequence, it uses the Viterbi algorithm to produce the label sequence
Y with the highest conditional probability
.
6. Methodology
The multi-task label-attention network proposed in this paper learns the feature representation with the support of the POS-tagging task and combines the label-attention mechanism to boost the primary task’s effectiveness.
Figure 2 illustrates the structure of our model. On the primary task, the label embedding is used as a learning parameter, which is learned during the training process. On the basis of the label embedding, the model learns the attentional representation of words. In order to predict, the CRF layer receives the output of the BiLSTM encoding layer along with the attentional representation of words. On the auxiliary task, due to the large number of label types and the scarcity of data, we did not learn its label embedding but directly fed the output of BiLSTM into the CRF layer as its input for POS prediction. The BERT and BiLSTM encoding layers are shared by the two tasks.
6.1. Label-Attention Network
Given a candidate set of labels
, where
is the number of labels in the output label candidate set, and each label in the set is represented by an embedding vector as shown in Equation (
5).
where
represents the look-up table of the label embeddings, the label embeddings are randomly initialized at the beginning of the training of the model, and are continuously adjusted during the training process.
We leverage the label-attention mechanism to introduce label embeddings into the model, enabling the model to learn the dependencies between labels, and the correlation between labels and words. First, the attention score between the label embedding matrix
and the hidden state
of the BiLSTM encoder is calculated, and the calculation formula is as shown in Equations (
6) and (
7):
This attention score differs from the frequently employed sequence-level attention, which focuses on modeling interactions among words within the input sequence. In contrast, the proposed label-attention mechanism models the interaction between contextual word representations and label embeddings.
Since the label-attention vector
is represented in the label space, it cannot be directly combined with the original BiLSTM hidden representation
. Therefore, we first project
into the same label space through a linear transformation
where
denotes the hidden representation produced by the BiLSTM encoder,
is the projected representation in the label space, and
is the label-attention vector. Therefore,
and
have the same dimensionality.
The projected representation and the label-attention vector are then combined through element-wise addition:
The fused representation is subsequently used as the input to the CRF inference layer.
In this work, we use additive fusion to combine the original BiLSTM hidden representation and the label-aware attention representation. This design is intentionally lightweight and consistent with the role of the label-attention module in our framework. The objective of the module is not to replace or heavily transform the contextual representation learned by the encoder, but rather to inject label-aware information that can guide subsequent sequence prediction.
Since the CCQG dataset is relatively limited in size, more complex fusion strategies, such as concatenation followed by a multi-layer perceptron (MLP) or gated fusion mechanisms, would introduce additional trainable parameters and potentially increase the risk of overfitting. In contrast, additive fusion preserves the original contextual representation while incorporating label-aware information with minimal architectural complexity and computational overhead.
Therefore, the label-attention module is designed as an auxiliary enhancement to the encoder output before CRF decoding rather than as a separate feature transformation component. A systematic comparison of alternative fusion strategies is an interesting direction for future work and will be investigated in subsequent studies.
6.2. Design of POS-Tagging Auxiliary Task
As shown in
Figure 1, the POS information has a good reference value for the selection of GPs. One direct method is to input the POS information as a feature into the model. Another way is to make the output feature representation of the model’s encoder capable of supporting the decoder in identifying POS categories and boundaries. This paper adopts the second approach, that is, by setting the form of an auxiliary task, forcing the task-sharing encoder to consider POS tagging. This method eliminates the need to change the structure of the encoder and allows text sequences to be encoded directly by employing a large language model, like BERT.
We use the CTB POS-tagging standard, which defines 36 original POS categories [
60]. To reduce noise in the auxiliary POS-tagging task, we did not directly use all CTB POS categories. Instead, we constructed a refined POS subset according to the relevance between POS boundaries and annotated GP boundaries in the training data. The selection criterion was whether a POS category frequently appeared inside or at the boundary of annotated GPs. The POS subset was determined exclusively using the training portion of the dataset. No validation-set or test-set annotations were used during the subset construction process. Therefore, the subset selection procedure did not introduce information leakage from the evaluation data. POS categories that rarely overlapped with GP boundaries and mainly functioned as grammatical or discourse markers were excluded.
The excluded categories include BA, CS, DER, ETC, IJ, NOI, ON, and SP. These tags usually correspond to construction markers, conjunctions, particles, interjections, noise tokens, onomatopoeia, or sentence-final particles. Although they are important for syntactic analysis, they seldom form knowledge-bearing answer phrases in educational cloze questions. Therefore, excluding them from the auxiliary task helps the shared encoder focus more on POS categories that are informative for GP boundary recognition. The excluded POS categories are listed in
Table 1.
6.3. Optimization Objectives
The objective function of our proposed model consists of two parts, namely, the loss function of the GP sequence labeling in the primary task and the loss function of the POS tagging in the auxiliary task.
Both the GP labeling task and the POS-tagging task are optimized using the negative log-likelihood (NLL) of the correct label sequence produced by the corresponding CRF layer, as shown in Equations (
10) and (
11):
where
and
denote the conditional probabilities of the complete GP label sequence and POS tag sequence computed by the corresponding CRF layers, respectively. Here,
and
are the learning parameters of the primary task and the auxiliary task, respectively. The final objective is a weighted sum of the two component losses, as shown in Equation (
12):
where
is the weight factor used to adjust the balance between the primary task and the auxiliary task.
7. Experiment
7.1. Datasets
The dataset (
https://github.com/tianlin668/CSFQGD, accessed on 17 June 2026) used in this study was released by [
12] and consists of Chinese fill-in-the-blank questions collected from educational examination resources. Unlike cloze-style reading comprehension datasets, the blanks in this dataset are designed to assess specific knowledge points rather than to recover contextually repeated words. Therefore, the annotated gap phrases usually correspond to knowledge-bearing units such as concepts, definitions, terms, numerical expressions, or procedural components. The detailed statistics of the dataset are shown in
Table 2.
The dataset covers five academic disciplines, including engineering, training, computing, medicine, and economics. The original train/validation/test split is retained in this study to ensure comparability with previous work. Each sample consists of a sentence and one or more annotated gap phrases. The annotation of GPs is knowledge-oriented: annotators select the phrase or phrases that are suitable to be omitted for constructing an educational cloze question.
It should be noted that the distribution of GPs is naturally imbalanced. Many GPs are short noun phrases or terminology units, whereas long GPs and structurally complex phrases occur less frequently. This imbalance increases the difficulty of boundary detection, especially when multiple educationally meaningful phrases appear in the same sentence.
7.2. Experimental Settings
We employ the bert-base-Chinese (
https://huggingface.co/bert-base-chinese, accessed on 17 June 2026) model (12 layers, 768 hidden, 12 heads) as the pretraining language model, the dimension of other hidden layers of the network is also 768, the maximum length is set to 256, the dropout rate is 0.1, the batch_size in the training phase is 16, and the label vector in label attention is initialized with a uniform distribution in the range [−0.1, 0.1] and adjusted during training. The balance factor
between the primary and the auxiliary tasks is set to 0.2. Additionally, we adopt the AdamW optimizer [
61] with an initial learning rate of
for objective optimization, and the maximum number of training epochs was set to 50. The experimental results use Precision, Recall and F1 score as evaluation indicators. The original dataset study reported a human prediction F1 of 87.64%, indicating the inherent subjectivity of GP annotation.
7.3. Reproducibility Details
To facilitate reproducibility, we provide additional implementation details. All experiments were implemented using PyTorch 1.10.0 and CUDA 11.1. The pretrained encoder was bert-base-Chinese. Input sentences were tokenized using the original BERT WordPiece tokenizer with a maximum sequence length of 256. No additional text normalization or lowercasing was applied.
The auxiliary task uses POS labels annotated according to the Chinese Treebank (CTB) POS tag standard. No external POS tagger was employed during training or inference.
To reduce the influence of random initialization, all experiments were repeated five times using different random seeds. The reported results correspond to the mean and standard deviation over these five independent runs. Training was conducted for 50 epochs. No additional early-stopping criterion was used. The checkpoint achieving the highest validation-set F1 score was selected as the final model for evaluation.
Experiments were conducted on a server equipped with NVIDIA RTX 3090 GPUs, Intel Xeon Silver 4214R CPUs, and 256 GB RAM (Santa Clara, CA, USA). The software environment includes Python 3.7, PyTorch 1.10.0, CUDA 11.1, NumPy 1.21.2, tqdm 4.64.0, tensorboardX 2.5.1, and pytorch-transformers 1.2.0. The complete source code, preprocessing scripts, configuration files, and dependency specifications are publicly available in the GitHub repository accompanying this work.
7.4. Baseline Models
To validate the performance of our proposed model, we compare it with three baseline models:
BiLSTM-CRF: This model uses Word2vec to generate word vectors, and encodes text semantic information and dependencies through a bidirectional LSTM layer. The decoder part uses a standard CRF to learn the transition probabilities between different tags according to the training data to better predict the GP tags.
Lattice LSTM: This model is a representative approach to combined word and character for training. It creatively integrates characters and words through a grid method, and has achieved good performance in Chinese named entity recognition [
62]. It also has good results for other Chinese sequence annotation problems [
63,
64].
BERT-BiLSTM-CRF: This model uses the BERT model to obtain the vector representation of words, and the sentence encoding and decoding parts still adopt BiLSTM and CRF models. With the powerful representation ability of BERT, it has achieved state-of-the-art performance on the problem of text sequence labeling [
12,
65]. BERT-BiLSTM-CRF is the strongest baseline reported in the original dataset publication and therefore serves as the primary reference model in this study.
7.5. Experimental Results
The comparison results are shown in
Table 3. The experimental results show that our approach has the best performance among the four models (F1 value is 65.85%). Both BERT-BiLSTM-CRF and our approach outperform the other two methods based on traditional word vector encoding by over 10%. When using BERT as the embedding layer for vector representation, it adopts a bidirectional transformer structure, so the output vector already contains rich information such as word features and contextual semantics. However, the word vectors used by BiLSTM-CRF and Lattice LSTM contain fewer contextual semantic features, so the semantic information of the entire sentence is incomplete, which affects the learning of sentence patterns. In addition, the transformer attention mechanism during training weakens information such as position and directional distance between sequences, but position and directional information are crucial in sequence annotation tasks [
66]. By connecting a BiLSTM layer, the two models are able to learn the position information of the input sequences to compensate for the deficiencies in the BERT coding process and effectively learn the pattern features, such as the location of constituent items and sentence structure within a sentence.
To reduce the influence of random initialization and training instability, we performed multiple independent runs only for the strongest baseline (BERT-BiLSTM-CRF) and our proposed model, as the other two baselines (BiLSTM-CRF and Lattice LSTM) exhibit substantially lower performance and are not competitive with our approach. Specifically, both BERT-BiLSTM-CRF and our model were trained and evaluated over five independent runs using different random seeds. For these two models, we report the mean and standard deviation of Precision, Recall, and F1. To examine whether the improvement of our model over the BERT-BiLSTM-CRF baseline is statistically reliable, we conducted a paired
t-test on the scores obtained from the repeated runs. The significance level was set to 0.05. The symbols in
Table 3 indicate the corresponding
p-value ranges (**:
, *:
, ∼:
). For the non-competitive baselines, we directly cite the single-run results from their original papers, as indicated by the superscript ♮ in
Table 3.
When interpreting the experimental results, the subjectivity of GP annotation should also be considered. Unlike conventional sequence labeling tasks such as named entity recognition, GP selection in educational cloze question generation depends on pedagogical objectives and may admit multiple valid answers. In many sentences, more than one phrase can reasonably represent the target knowledge point, whereas the dataset annotation usually records only the phrase or phrases selected by the original annotators. Consequently, a model prediction that differs from the gold annotation may still correspond to a valid educationally meaningful gap phrase.
This annotation uncertainty imposes an inherent upper bound on automatic performance. The original dataset study reported a human prediction F1 score of 87.64%, indicating that even human annotators do not achieve perfect agreement on GP selection. Therefore, the performance of automatic systems should be interpreted in the context of this annotation subjectivity and the existence of multiple plausible answers.
Although sequence labeling metrics such as Precision, Recall, and F1 are widely adopted in GP identification research and are consistent with the evaluation protocol of the original dataset, they do not fully capture the pedagogical quality of generated cloze questions. In educational assessment scenarios, multiple gap phrases within the same sentence may be equally suitable for evaluating a target knowledge point. Consequently, a prediction that differs from the gold annotation may still produce an educationally meaningful cloze question. Future work may therefore incorporate additional phrase-level evaluation metrics and expert-based human evaluation protocols to better assess the pedagogical usefulness of automatically selected gap phrases.
Compared to the BERT-BiLSTM-CRF baseline, our method achieves a Precision improvement of +0.2%, a Recall improvement of +1.5%, and an F1 improvement of +0.9%. Notably, a paired t-test across the five runs indicates that the Precision gain is not statistically significant. However, the Recall gain is significant. Therefore, the overall effectiveness of our approach is primarily driven by a substantial Recall improvement, leading to a statistically significant F1 boost, a task-specific gain. Both models utilize BERT for input feature representation and BiLSTM for sentence encoding. However, our model incorporates a label-attention mechanism module to model the interrelationship between label representation and sentence sequence features effectively. Moreover, by training on an auxiliary task, our approach exhibits enhanced GP boundary recognition capabilities. When compared to alternative methods, our proposed approach excels in contextual semantic feature extraction, sentence structure learning, tag relationship learning, and tag boundary recognition. As a result, it yields more accurate GP extraction results.
The purpose of the baseline comparison is to evaluate whether the proposed POS auxiliary learning and label-attention modules can improve a strong BERT-BiLSTM-CRF sequence labeling framework under the same dataset and evaluation protocol. Therefore, this study mainly compares with previously reported models and the re-implemented BERT-BiLSTM-CRF baseline. We acknowledge that replacing the encoder with more recent pretrained Chinese models, such as MacBERT, RoBERTa-wwm, or DeBERTa, may further improve performance. Since the focus of this work is the effectiveness of the proposed task-specific modules rather than the comparison of different pretrained encoders, we leave a systematic evaluation of stronger pretrained baselines to future work.
To avoid ambiguity, we report the final comparison results and ablation results separately. The final performance comparison in
Table 3 is used to evaluate the proposed model against baseline methods under the test setting. The ablation study in
Table 4 is used to analyze the relative contribution of different components. Therefore, the F1 value of the full model in
Table 4 is not intended to replace the final test-set result reported in
Table 3.
Nevertheless, the proposed POS-guided auxiliary learning and label-attention modules are encoder-agnostic and can be incorporated into alternative pretrained language models. Therefore, the effectiveness demonstrated on the BERT-BiLSTM-CRF framework reflects the contribution of the proposed modules rather than a specific encoder choice.
7.6. Ablation Study
To verify the impact of each mechanism in our proposed method on the overall effect, two sets of ablation experiments were used for comparative validation: (1) The label-attention and auxiliary task contribution analysis was validated by removing the POS-tagging auxiliary task (denoted as POS) and label-attention (denoted as LA) modules from our model, respectively. For the convenience of comparison, we also put together the model validation results with both modules removed (denoted w/o POS & LA, equivalent to the model BERT-BiLSTM-CRF). (2) Contribution analysis of POS-tagging subsets. We denote the complete CTB POS-tagging task as POS1 and our simplified POS-tagging task as POS2 and observe their performance improvement in combination with BERT-BiLSTM-CRF, respectively. Further, we verify their performance changes when superimposed on the BERT-BiLSTM-CRF model simultaneously with the label attention LA.
The results of the first group are given in
Table 4. Compared with the full model, the model with the auxiliary task POS removed dropped more in Precision by 1.51 percentage points and Recall by 0.9 percentage points, resulting in a 1.22 percentage point drop in F1 values. The removal of the LA module also had some performance impact, with the F1 value also dropping by 0.41 percentage points. After the two modules were removed and the model degraded to BERT-BiLSTM-CRF, Recall dropped significantly to 4.33 percentage points, and the F1 value dropped more with it. The above observations show that F1 decreases more with the removal of the auxiliary task POS for both modules compared to each other; in other words, the auxiliary task POS has a more significant performance improvement. The auxiliary task POS is more likely to affect the Precision metric than label attention LA, while label attention LA improves Precision and Recall values in a balanced manner. We infer that the reason for this is that the auxiliary task improves the accuracy of the boundary recognition. The combination of the two mainly improves the Recall value and, thus, the F1 value. It can also be observed that when only the label-attention module (w/o POS) is used, it actually hurts the Precision of the model while significantly increasing the Recall value compared to BERT-BiLSTM-CRF; that is, this module enhances the global recognition ability of the model.
In addition to the typical error analysis, we also provide detailed statistics on the categories of error samples. To facilitate the interpretation of the error statistics,
Table 5 summarizes the definitions and typical causes of the four error categories considered in this study, while
Table 6 reports their distribution.
As shown in
Table 6, Case 1 (incomplete GP recognition) and Case 4 (over-prediction) account for the vast majority of errors, representing 34.96% and 52.65% of all error cases, respectively. In contrast, Case 2 (boundary mismatch) and Case 3 (missing GP) occur much less frequently.
The dominance of Case 1 and Case 4 reflects the intrinsic characteristics of educational cloze question generation. In many educational sentences, multiple knowledge-bearing phrases coexist, and the model may successfully identify some but not all annotated GPs, resulting in incomplete GP recognition. This phenomenon is particularly common in sentences containing coordinated concepts or parallel noun phrases.
For Case 4, the model often predicts additional phrases that are educationally meaningful but are not included in the gold annotations. This observation is closely related to the subjectivity of GP annotation. Since multiple phrases may reasonably serve as candidate blanks for assessing the same knowledge point, some model predictions counted as false positives may still correspond to valid educationally meaningful gap phrases. Example 5 in
Table 7 provides a representative illustration of this phenomenon.
The improvement in Precision brought by LA is smaller than that obtained by the POS auxiliary task alone, suggesting that the current interaction between label-aware features and multi-task shared features still has room for improvement. This observation underscores the potential for further enhancements by capitalizing on the synergies between LA and the shared features acquired in a multi-task learning environment. It can also be observed that the improvement in Precision when combined with LA is smaller than when multi-task learning is employed independently, which also suggests that the ability to combine LA and shared features learned in a multi-task environment leaves room for further improvement.
The second group of ablation results is presented graphically, as shown in
Figure 3a–c. It can be seen that the auxiliary task POS2 achieves a significantly higher Precision score than the CTB POS full set tagging, which is due to the accurate identification of important POS boundaries.
7.7. Influence of the Multi-Task Weight Factor
In the model of this paper, the parameter
is used as a harmonic coefficient for the primary and auxiliary tasks, which can have an impact on the labeling performance. When the parameter
is large, the proportion of losses of the auxiliary task is higher, and vice versa, the proportion of losses of the primary task is high.
Figure 4 shows the experimental data of the F1 scores for different
values. In this paper, a candidate value is chosen at 0.1 intervals between 0 and one, for a total of nine candidate weight hyperparameter values for comparison. It can be seen from the curves that the best F1 score is achieved when the weight is 0.2, and both increasing and decreasing weights show a decreasing trend in the score. Therefore, 0.2 is chosen as the final
value in our model.
7.8. Result Analysis
To facilitate the interpretation of the error statistics,
Table 5 summarizes the definitions and typical causes of the four error categories considered in this study. The corresponding error distribution is reported in
Table 6. Representative examples illustrating different error patterns are presented in
Table 7.
The recognition scenarios for GPs are impossible to categorize definitively, but the error categories can be classified into four cases: (1) too few GPs recognized, (2) inaccurate segmentation, (3) unrecognized GPs, and (4) too many GPs recognized. In conducting the case study, we categorize the misidentified samples of the BERT-BiLSTM-CRF model according to the cause of the error, and then test whether our proposed method correctly predicts these samples, and if it does, further compare and analyze the cause. The five sentences listed in
Table 7 of the manuscript are typical examples that reflect the differences between the two models.
Our method can correctly predict the GPs in Example 1, Example 2 and Example 3. BERT-BiLSTM-CRF did not fully predict both GPs for Example 1, the GP boundaries in Example 2 were not correctly identified, and no GP was identified in Example 3. The correct identification of our approach for the above three examples is of benefit to the label-attention LA module and the auxiliary task POS module, respectively. For example, the labeling attention LA can comprehensively grasp the relationship between “价值尺度 (value scale)”, “流通手段 (circulation means)” and the whole sentence, especially “货币的原始职能 (the original function of money)” in Example 1. The shared features learned in conjunction with POS auxiliary task allow for better representation of the possibility of “1986年 (the year 1986)” as a holistic chunk in Chinese in Example 2. Example 3 is an interesting case, where our method correctly identifies the boundary of the second GP, and the BERT-BiLSTM-CRF method only identifies a part of the GP’s “启发 (heuristics)”. However, neither method identifies the other GP “复习谈话 (review talk)”. A situation similar to Example 5 is more common in the dataset, where there are actually n available GPs in a sentence, and the annotator only selects m of them for generating questions.
As shown in
Table 6, Case 1 (incomplete GP recognition) and Case 4 (over-prediction) account for the vast majority of errors, representing 34.96% and 52.65% of all error cases, respectively. In contrast, Case 2 (boundary mismatch) and Case 3 (missing GP) occur much less frequently.
The dominance of Case 1 and Case 4 reflects the intrinsic characteristics of educational cloze question generation. In many educational sentences, multiple knowledge-bearing phrases coexist, and the model may successfully identify some but not all annotated GPs, resulting in incomplete GP recognition. This phenomenon is particularly common in sentences containing coordinated concepts or parallel noun phrases.
For Case 4, the model often predicts additional phrases that are educationally meaningful but are not included in the gold annotations. This observation is closely related to the subjectivity of GP annotation. Since multiple phrases may reasonably serve as candidate blanks for assessing the same knowledge point, some model predictions counted as false positives may still correspond to valid educationally meaningful gap phrases. Example 5 in
Table 7 provides a representative illustration of this phenomenon.
In this work, we suggest that POS would contribute to the main task of GP recognition and use POS tagging as a secondary task. To further validate our argument, in this section, the correlation between GPs and POS is summarized.
Figure 5 illustrates the distribution of POS tag pattern by the number of GPs. It can be observed that, in addition to the noun phrase (NN+) which accounts for 45% of the total number of GPs, the combination of JJ+NN (other noun-modifier with common noun), NR (proper noun), CD (cardinal number), and VV+NN (other verb with common noun) are also the main patterns that constitute GPs. We also observe that some niche POS tag such as NT can also help the model to better determine GP boundaries, as in sample 2 in
Table 7.
It should be noted that, due to the subjective nature of manual GP selection, this results in existing models that do not fully agree with manual annotation. This is also the reason why the data provider giving the performance of human prediction can only reach the F1 value of 87.64% in their paper [
12]. The above analysis shows that our model performs better in complete GP discovery and boundary recognition without considering the subjectivity of annotation. However, inadequate understanding of whole sentences in the case of sparse text can also lead to situations such as incomplete GP recognition, which poses challenges and directions for model improvement.
8. Theoretical and Practical Implications
In this work, we introduce a label-attention network along with an auxiliary part-of-speech tagging task for generating Chinese cloze questions in educational assessments. Our contribution not only enhances and refines the landscape of research in the domain of cloze question generation but also furnishes valuable technical foundations for the real-world implementation of educational examination applications.
From a theoretical perspective, our work provides a new solution for CCQG research. Our goal is to alleviate the data scarcity problem of the CCQG task while reusing the pretrained language model framework. Specifically, we propose a label-attention model that enables the encoder to have the capability of label preference-aware feature representation of word sequences. Our proposed multi-task learning framework with POS tagging as an auxiliary task further improves the capability of representing of the encoding module. Both attempts can provide a new perspective for researchers in the CCQG task and other NLP domains to look at the problem of pretrained language model reuse.
From a practical perspective, our work can provide an aid to educational examination question generation. With the widespread use of AI technologies in education, AI technologies are also increasingly used to assist in the generation of examination questions. Our model is capable of generating cloze-type examination questions for a knowledge point assessment based on a small amount of annotated data. For teachers, our model can help them generate test questions from texts such as electronic textbooks and syllabuses, saving their working time. For educational institutions, the CCQG model can help them quickly build up a large pool of test questions, thus saving a lot of human resources.
9. Limitations
Although the proposed method achieves consistent improvements over existing approaches, several limitations should be acknowledged.
First, the experiments were conducted on a single publicly available Chinese cloze question generation dataset. Although this dataset covers multiple academic disciplines, its scale remains relatively limited compared with datasets commonly used in large-scale natural language processing research. Therefore, the generalization ability of the proposed framework on other educational datasets requires further investigation.
Second, the annotation of gap phrases is inherently subjective. Unlike conventional sequence labeling tasks, educational cloze question generation may admit multiple valid gap phrases within the same sentence. However, the current dataset provides only the phrase or phrases selected by the original annotators as reference annotations. Consequently, some predictions counted as errors may still correspond to educationally meaningful gap phrases. This annotation uncertainty introduces an inherent limitation to automatic evaluation and may underestimate the actual quality of model predictions. Future work could improve evaluation reliability through multi-reference annotation, expert validation, or human-centered assessment protocols.
Third, this study focuses on validating the effectiveness of the proposed POS-guided auxiliary learning strategy and label-attention mechanism using a BERT-based encoder. More recent pretrained language models, such as MacBERT, RoBERTa-wwm, and DeBERTa, were not systematically evaluated in this work.
Finally, the label-attention module adopts a lightweight additive fusion strategy. Although this design reduces model complexity and mitigates overfitting risks under limited training data, alternative fusion mechanisms may further improve performance and deserve future investigation.
10. Conclusions
A multi-task learning model is proposed in this paper for generating Chinese cloze questions for educational examinations. The model utilizes a label-attention component to improve global perception of label recognition and an auxiliary POS-tagging task to enhance recognition accuracy. Compared to other deep learning-based sequence labeling models, our approach efficiently determines label boundaries and achieves more comprehensive identification of GPs. However, there are areas for improvement in the current work. Firstly, in scenarios with sparse text representation, the model’s semantic understanding may be limited in capturing knowledge information from examination questions. Secondly, the shared feature layer encoding is influenced by incorrect decisions from the auxiliary task due to the multi-task learning framework. Lastly, the sequence labeling model can be further enhanced by applying task-specific constraints, such as requiring at least one GP per sentence. Overall, our model’s effectiveness is validated through experiments, but it is important to acknowledge these areas for potential enhancement.
Extensive experiments on the benchmark dataset demonstrate that our approach achieves competitive performance, particularly in terms of Recall and overall F1 score. Beyond the current framework, future work will focus on adapting our approach to cross-domain scenarios and reducing the model’s inference latency, rather than on random-seed validation.