1. Introduction
Assessment instruments are among the most influential yet least interrogated artifacts in formal education [
1]. Examination questions implicitly encode curricular intent, cognitive demand, and evaluative priorities, while marking schemes make these assumptions explicit through structured rubrics, scoring logic, and model responses [
2]. Despite their intrinsic coupling, the relationship between examination questions and marking schemes has traditionally been treated as unidirectional: questions are authored first, and marking schemes are subsequently derived as interpretive guides [
3]. This linear workflow obscures a critical opportunity for computational insight namely, the ability to infer, reconstruct, or validate examination questions directly from their associated marking schemes [
4].
Recent advances in artificial intelligence, particularly in natural language processing (NLP), have transformed how unstructured educational texts can be analyzed, represented, and generated [
5]. Large language models, semantic parsers, and discourse-aware transformers now demonstrate the capacity to capture latent instructional intent, hierarchical reasoning patterns, and domain-specific evaluative cues embedded within textual data [
6]. However, the application of these techniques in educational assessment has largely focused on forward-generation tasks, such as automated question generation from syllabi or essay scoring from student responses. The inverse problem of reverse-engineering examination questions from marking schemes remains substantially underexplored, despite its relevance to assessment quality assurance, curriculum alignment, and academic integrity [
7].
Despite rapid progress in AI-driven educational assessment, existing research remains predominantly forward-oriented, treating marking schemes as secondary scoring tools rather than as structured semantic representations of assessment intent. Current systems generate questions from content or grade responses against fixed rubrics, yet they do not computationally model rubric-derived constraints such as expected reasoning pathways, conceptual weighting, or evaluative granularity. Moreover, contemporary NLP frameworks lack mechanisms for explicitly incorporating pedagogical attributes such as cognitive complexity, scope of evaluation, and mark allocation during generation, resulting in outputs that may be linguistically coherent but pedagogically misaligned. Critically, no systematic methodology exists for reverse assessment validation through structured reconstruction of examination prompts from marking schemes. Consequently, the inverse problem of question reconstruction remains under-theorized and methodologically unsupported, representing a substantive gap at the intersection of NLP, assessment design, and learning analytics.
Marking schemes function as structured semantic blueprints of expected knowledge representations [
8]. They encode acceptable reasoning paths, conceptual anchors, and weighted subcomponents, making them computationally decomposable into intent signals and evaluative operators [
9,
10]. Reverse question inference therefore constitutes a constraint-aware structured reasoning problem rather than a surface-level generation task [
11].
This study addresses three research questions: (RQ1) the feasibility of reconstructing examination questions from marking schemes using semantic and pedagogical signals; (RQ2) methods for explicitly incorporating assessment attributes such as cognitive level, evaluation scope, and mark allocation during reconstruction; (RQ3) the potential of reverse question inference to support automated assessment validation and alignment analysis.
This paper introduces an AI-powered NLP framework designed to systematically reconstruct examination questions from marking schemes. The proposed approach treats the task as a multi-stage inference pipeline, combining semantic extraction, pedagogical intent modeling, and controlled natural language generation. Unlike generic text-to-text generation paradigms, the framework explicitly models assessment dimensions such as cognitive level, scope of evaluation, and mark-weighted emphasis, ensuring that reconstructed questions are not only linguistically coherent but also pedagogically valid. By grounding question reconstruction in the latent structure of marking schemes, the framework enables both faithful regeneration of original assessment prompts and the generation of equivalent alternatives for moderation and benchmarking.
The significance of this work extends beyond automation. Reverse-engineering examination questions enables novel forms of assessment analytics, including the detection of misalignment between learning outcomes and evaluation criteria, the identification of rubric ambiguity, and the systematic validation of assessment fairness across cohorts and institutions. Moreover, the framework has practical implications for scalable assessment design, especially in resource-constrained educational environments where expert moderation capacity is limited. By transforming marking schemes into generative knowledge artifacts, this research contributes to a broader vision of intelligent, transparent, and auditable educational assessment systems.
In doing so, this study advances the discourse on AI-assisted assessment by reframing marking schemes as primary computational objects rather than secondary documentation. The proposed NLP framework establishes a foundation for future work in assessment reverse engineering, neuro-symbolic evaluation modeling, and explainable educational AI, positioning reverse question inference as a critical capability in next-generation learning analytics infrastructures.
To the best of our knowledge, no prior framework has systematically modeled marking schemes as primary semantic inputs for structured reverse question inference while explicitly incorporating pedagogical attributes during generation.
3. Methodology
The proposed framework for reverse-engineering examination questions from marking schemes is designed to model the conditional probability of a question given a marking scheme. Unlike conventional question generation systems, which derive answers from questions, this framework inverts the paradigm by reconstructing appropriate exam questions from examiner-provided solutions. To ensure reproducibility, the methodology provides explicit details regarding dataset preparation, model architecture, hyperparameters, reconstruction mechanisms, and pedagogical constraint integration.
3.1. Data Representation and Preprocessing
The dataset employed in this study consists of 7021 aligned records collected from the Department of Data Science, Sol Plaatje University, South Africa. Each record contains an examiner-provided marking scheme and its corresponding ground-truth examination question, forming a supervised parallel corpus suitable for sequence-to-sequence learning. For each instance, the marking scheme is denoted as and the associated question as , ensuring deterministic one-to-one alignment throughout the training and evaluation processes. In addition to the primary input–output pair, each record contains auxiliary metadata including a unique sample identifier, an annotated Bloom’s Taxonomy level, token length statistics, and a predefined dataset split label (training, validation, or test). This structured organization enables reproducibility, controlled experimentation, and consistent pedagogical conditioning during model training.
The dataset is stored in Javascript Object Notation (JSON) format, where each row represents a single aligned pair. A representative sample instance is shown below for clarity:
{
“sample_id”: 1024,
“marking_scheme”: “A confusion matrix evaluates classification performance by comparing predicted labels with actual labels. It includes true positives, true negatives, false positives, and false negatives.”,
“question”: “Explain how a confusion matrix is used to evaluate the performance of a classification model.”,
“bloom_level”: “Analyze”,
“answer_length”: 32,
“question_length”: 17,
“split”: “train”
}
In this representation, the marking_scheme field serves as the model input, while the question field represents the target output. The bloom_level attribute provides pedagogical supervision aligned with Bloom’s Taxonomy, enabling cognitive conditioning of the generative model. The answer_length and question_length fields are derived after tokenization and are used for statistical reporting and distributional analysis. The split attribute ensures deterministic partitioning of the dataset into 70% training, 15% validation, and 15% test subsets, thereby guaranteeing reproducibility of experimental results.
Prior to model training, all textual data underwent a standardized preprocessing pipeline to reduce noise and ensure representational consistency. The normalization process included lowercasing all characters and removing punctuation and extraneous symbols. Tokenization was performed using the SentencePiece tokenizer associated with the T5 model, preserving subword structure and ensuring compatibility with the transformer architecture. Formally, each marking scheme can be represented as a token sequence:
where
denotes the WordPiece vocabulary. The average marking scheme length is 34.7 tokens, while the average question length is 21.3 tokens, indicating that the dataset predominantly contains semantically rich solutions paired with concise yet cognitively dense examination questions.
To obtain semantically informative representations, each tokenized marking scheme was encoded using MPNet, a transformer encoder optimized for natural language understanding tasks. The encoding function projects the token sequence into a continuous embedding space:
where
denotes the embedding dimensionality. These embeddings serve as semantic conditioning inputs to the T5-based generative model described in
Section 3.2. The exclusive focus on Data Science courses provides a homogeneous domain-specific benchmark, facilitating controlled investigation of the reverse question engineering task while limiting cross-disciplinary variability.
The complete preprocessing workflow applied to construct the final training corpus is summarized in Algorithm 1.
| Algorithm 1: Dataset Preprocessing Pipeline |
Input: Raw dataset D_raw containing (marking_scheme, question, bloom_level) Output: Processed dataset D_processed with embeddings and metadata 1: Initialize empty dataset D_processed 2: For each record r in D_raw do 3: Extract M_i ← r.marking_scheme 4: Extract Q_i ← r.question 5: Extract B_i ← r.bloom_level 7: // Text normalization 8: M_i ← lowercase(M_i) 9: Q_i ← lowercase(Q_i) 10: M_i ← remove_punctuation(M_i) 11: Q_i ← remove_punctuation(Q_i) 13: // Tokenization 14: tokens_M ← WordPiece_tokenize(M_i) 15: tokens_Q ← WordPiece_tokenize(Q_i) 17: // Length computation 18: answer_length ← length(tokens_M) 19: question_length ← length(tokens_Q) 21: // Semantic encoding 22: m_i ← MPNet_encode(tokens_M) // 768-dimensional embedding 24: Store processed record: 25: {sample_id, tokens_M, tokens_Q, m_i, B_i, 26: answer_length, question_length} 28: End For 30: Deterministically split D_processed into: 31: 70% training set 32: 15% validation set 33: 15% test set 35: Return D_processed |
The overall statistics of the dataset are presented in
Table 1. The average answer length of 34.7 tokens indicates that the dataset predominantly contains extended marking schemes rather than short factual spans, while the average question length of 21.3 tokens reflects the compact yet semantically dense nature of exam-style questions. The exclusive focus on Data Science courses provides a homogeneous but domain-specific benchmark, enabling in-depth exploration of the reverse-engineering task.
In addition to deterministic random splits (70/15/15), a course-wise split evaluation was conducted to assess internal generalization across distinct Data Science courses.
3.2. Reverse Question Generation Model
To generate questions from marking schemes, this study employs a sequence-to-sequence transformer based on T5-small. The encoder takes the semantic embeddings of a marking scheme as input, while the decoder autoregressively generates the corresponding exam question. The primary training objective maximizes the conditional likelihood of the question given the marking scheme:
where
represents the probability distribution parameterized by the transformer with parameters
. Training is performed with teacher forcing to accelerate convergence, and the AdamW optimizer is applied with a learning rate of
.
To reinforce semantic fidelity, the framework introduces a reconstruction mechanism. After a candidate question
is generated, it is passed into a secondary T5 model
, which attempts to regenerate the original marking scheme as an embedding
. A reconstruction loss ensures that these regenerated embeddings closely approximate the original marking scheme embeddings
:
The reconstruction loss is computed in continuous embedding space rather than on discrete token sequences. Specifically, the secondary T5 model first regenerates the marking scheme in textual form. The regenerated text is then encoded using the same MPNet encoder to obtain its semantic embedding . The L2 loss is subsequently calculated between the original embedding and the regenerated embedding .
This design ensures that reconstruction fidelity is measured at the semantic representation level rather than through categorical cross-entropy over tokens, thereby aligning the optimization objective with semantic preservation rather than surface-form exactness.
The final objective combines the primary sequence generation loss and the reconstruction loss:
where the hyperparameter
controls the contribution of the reconstruction term. In this study,
was empirically set to 0.3, providing a balance between question fluency and semantic preservation.
To ensure reproducibility, the model was implemented using the HuggingFace Transformers library with the pre-trained T5-small backbone (60M parameters). Training was conducted for 10 epochs using the AdamW optimizer with a learning rate of 3 × 10−5 and a batch size of 16. Early stopping was monitored based on validation loss to prevent overfitting. The reconstruction weight λ was empirically set to 0.3 after preliminary tuning on the validation set, balancing semantic fidelity and fluency.
All experiments were conducted using deterministic dataset splits (70% training, 15% validation, 15% test) defined prior to training to prevent data leakage. Random seeds were fixed for model initialization and data loading to ensure consistent reproducibility across runs.
Importantly, the T5-small model was selected as the core sequence-to-sequence generator to ensure computational feasibility and reproducibility within the available infrastructure. While larger transformer models and instruction-tuned large language models (LLMs) such as LLaMA 3, Mistral 1.1, or GPT-4 may offer higher representational capacity and potentially improved performance, resource constraints limited the scope of the present experiments. Accordingly, baseline comparisons focus on an unconstrained T5-small model and a rule-based template system, providing a meaningful reference for evaluating the contributions of semantic reconstruction and Bloom-level conditioning.
3.3. Bloom Annotation and Classifier Details Need Clarification
While semantic validity is necessary, it is insufficient to guarantee pedagogically useful questions. To embed instructional depth, the framework integrates Bloom’s Taxonomy as a guiding constraint. Each question in the dataset was manually annotated with one of six Bloom levels—Knowledge, Comprehension, Application, Analysis, Synthesis, or Evaluation. These levels serve as control tokens prepended to marking schemes during training, conditioning the model not only on the semantic content of the answer but also on the intended cognitive level.
Bloom-level labels were manually assigned by two subject-matter experts within the Department of Data Science using standard Bloom’s Taxonomy definitions. Disagreements were resolved through discussion to ensure consistent labeling. The final dataset therefore contains a single agreed-upon Bloom label per instance. During training, the Bloom level was prepended as a control token (e.g., <ANALYZE>) to the marking scheme input sequence, enabling explicit conditioning of the generative model on cognitive level.
The modified conditional probability of generating a question is therefore expressed as
where
represents the Bloom-level embedding vector,
is the decoder hidden state at time step
, and
is the output projection matrix. This formulation allows the decoder to modulate question style and complexity in accordance with the targeted Bloom category. As such, the generated questions are not only semantically aligned with the marking scheme but also pedagogically coherent.
3.4. Workflow and System Architecture
The workflow of the framework comprises four tightly coupled stages: the examiner-provided marking scheme is first semantically encoded using MPNet; the resulting embedding is passed to the T5 generator augmented with Bloom-level constraints to produce candidate questions; the generated question is subsequently validated by a secondary T5 model that reconstructs the marking scheme; and finally, the reconstructed marking scheme is compared with the original, and Bloom-level alignment is verified to enforce pedagogical integrity.
This process can be understood as a closed-loop generative system in which both semantic reconstruction and pedagogical filtering ensure the validity of the generated questions. The workflow is illustrated in
Figure 1, which depicts the data flow from marking scheme to final validated question.
The visual representation makes clear that the architecture does not simply produce a one-way mapping from answers to questions but instead incorporates recursive validation, ensuring both semantic integrity and educational coherence.
3.5. Experimental Setup
The evaluation of the proposed AI-powered NLP framework was conducted using a dataset of 7021 examiner-provided solutions and their corresponding questions collected from the Department of Data Science, Sol Plaatje University, South Africa. Consistent with
Section 3.1, the dataset was divided into training, validation, and test subsets in a 70:15:15 ratio to facilitate reliable model development and assessment. All textual data were preprocessed through normalization, including lowercasing and punctuation removal, followed by tokenization using WordPiece. Semantic representations of marking schemes were generated using MPNet embeddings, which served as input to the T5-small sequence-to-sequence generative model.
To benchmark the performance of the reverse question generation framework, two categories of comparative models were employed. The first comprised conventional sequence-to-sequence transformer models fine-tuned in the same manner as the proposed framework but without the reconstruction loss or Bloom-level conditioning. This comparison enabled quantification of the contributions of the semantic reconstruction and pedagogical constraint mechanisms. The second category consisted of rule-based question generation systems, which leveraged keyword extraction and syntactic templates to produce candidate questions from examiner solutions. These baselines provided a reference for evaluating improvements attributable to deep learning and constrained generative modeling, highlighting the added value of semantic and pedagogical integration.
A set of complementary metrics was selected to assess both semantic fidelity and pedagogical quality of the generated questions. Standard NLP evaluation measures—including BLEU, ROUGE-L, and METEOR—quantified n-gram overlap, longest common subsequence, and semantic similarity between generated and ground-truth questions, providing a robust indication of linguistic and structural fidelity. To evaluate the preservation of semantic content from marking schemes, reconstruction fidelity (RF) was computed using the embeddings of the original and reconstructed marking schemes via the secondary T5 model, as follows:
where
represents the reconstructed marking scheme and
the original. Higher RF values indicate stronger retention of semantic information in the generated questions. Pedagogical alignment was assessed through Bloom-level classification accuracy, in which each generated question was passed through a pre-trained Bloom classifier to predict its cognitive level. Accuracy was then defined as the proportion of generated questions whose predicted cognitive level matched the manually annotated level, providing an empirical measure of whether the system successfully produced questions that are both semantically correct and cognitively appropriate.
All relevant hyperparameters used in the experiments—including the generator and encoder models, optimizer, learning rate, batch size, sequence lengths, number of epochs, and reconstruction weight—are summarized in
Table 2. These configurations were determined through preliminary validation experiments designed to balance training efficiency with optimal model performance, ensuring that the study can be independently replicated while maintaining transparency and rigor.
4. Results and Evaluation
The performance of the proposed AI-powered NLP framework for reverse-engineering examination questions from marking schemes was evaluated using a dataset of 7021 aligned (marking scheme, question) pairs obtained from the Department of Data Science, Sol Plaatje University, South Africa. The evaluation focused on three key dimensions: semantic fidelity of the generated questions, reconstruction fidelity to preserve semantic content from marking schemes, and pedagogical alignment with Bloom’s Taxonomy. The dataset was partitioned into training, validation, and test subsets in a 70:15:15 ratio, and all textual data were preprocessed through normalization and tokenization using WordPiece, with semantic embeddings generated via MPNet to serve as input to the T5-small generative model.
To contextualize the performance of the proposed framework, two categories of baselines were employed. The first baseline consisted of a conventional T5-small sequence-to-sequence model trained under the same conditions as the proposed framework but without reconstruction loss or Bloom-level conditioning. The second baseline was a rule-based question generation system that leveraged keyword extraction and syntactic templates to produce candidate questions from examiner solutions. While these baselines provide a meaningful reference for evaluating the contributions of semantic reconstruction and pedagogical constraints, they represent relatively modest benchmarks compared to the current landscape of large-scale neural architectures. Larger transformer models and instruction-tuned large language models (LLMs), such as LLaMA 3 or GPT-4, are likely to offer higher performance due to their increased representational capacity. However, T5-small was selected for these experiments to ensure computational feasibility and reproducibility within the available infrastructure. Future work will extend the evaluation to include these more powerful models, incorporating LLM prompting and fine-tuning, to provide a comprehensive assessment of the framework’s performance relative to state-of-the-art approaches.
4.1. Semantic Fidelity and Similarity Metrics
To rigorously evaluate the quality of the generated examination questions, multiple complementary metrics were employed to capture lexical overlap, semantic preservation, and pedagogical alignment. BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between generated and reference questions, providing an indication of surface-level similarity. ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation—Longest Common Subsequence) evaluates structural alignment via the longest common subsequence, offering tolerance to moderate word-order variation. METEOR (Metric for Evaluation of Translation with Explicit ORdering) incorporates unigram precision and recall with stemming and synonym matching, enabling more semantically sensitive comparison than pure n-gram overlap.
Beyond lexical similarity, Reconstruction Fidelity (RF) was used to assess semantic preservation within the proposed architectural framework. RF is computed as the cosine similarity between MPNet embedding representations of the original marking scheme and the reconstructed marking scheme generated from the produced question. Importantly, RF functions as an internal consistency diagnostic rather than an independent external metric. It evaluates whether sufficient semantic information is retained for reconstruction within the model pipeline. Bloom-Level Accuracy was additionally used to quantify pedagogical alignment, defined as the proportion of generated questions whose predicted cognitive level matches the annotated Bloom label.
Table 3 summarizes semantic evaluation results across lexical metrics for all baselines, including a stronger seq2seq comparison (T5-base) trained under identical preprocessing and optimization settings.
The inclusion of T5-base provides a stronger scale-controlled baseline. While the larger model achieves moderate improvements over unconstrained T5-small, it remains below the full proposed framework that integrates reconstruction loss and Bloom-level conditioning. This suggests that the observed gains are attributable to architectural enhancements rather than model size alone.
These results establish improved semantic alignment under automatic evaluation metrics within the studied domain. However, automatic metrics do not fully capture question clarity, ambiguity, or educational appropriateness, and they should be interpreted as quantitative indicators rather than definitive pedagogical validation.
4.2. Reconstruction Fidelity
The ability of the framework to preserve semantic content was further evaluated using Reconstruction Fidelity (RF). RF measures the cosine similarity between MPNet embedding vectors of the original marking scheme and the reconstructed marking scheme generated from the model’s question output. Because this metric operates in embedding space and reflects the internal reconstruction pathway, it is interpreted as an architectural consistency measure rather than an external evaluation benchmark.
Table 4 reports reconstruction fidelity scores across models.
The proposed framework achieves the highest RF score, indicating stronger semantic retention within the reconstruction loop. Notably, increasing model scale alone (T5-base without reconstruction loss) improves reconstruction modestly compared to T5-small, but does not match the performance of the full architecture. This supports the role of reconstruction loss as a meaningful structural constraint that encourages semantic completeness in generated questions.
While high RF values indicate strong embedding-level alignment between original and regenerated marking schemes, this metric does not independently guarantee pedagogical quality or real-world exam suitability. Instead, RF complements lexical metrics by verifying semantic information preservation within the model’s internal representation.
4.3. Ablation Study
To isolate the contribution of individual architectural components, a series of quantitative ablation experiments was conducted on the same held-out test split used for primary evaluation. Specifically, we examined the performance impact of (i) removing the reconstruction loss, (ii) removing Bloom-level conditioning, and (iii) varying the reconstruction weighting coefficient λ ∈ {0.1, 0.3, 0.5}. All models were trained under identical preprocessing, tokenization, and optimization settings to ensure fair comparison.
Table 5 summarizes the results across lexical metrics and reconstruction fidelity.
Removing the reconstruction loss results in the most pronounced degradation across all lexical metrics, with BLEU decreasing from 0.71 to 0.64 and reconstruction fidelity dropping substantially from 0.84 to 0.69. This confirms that the reconstruction mechanism plays a central role in enforcing semantic completeness during question generation. Without the reconstruction constraint, the model tends to produce linguistically plausible but less semantically grounded questions.
Eliminating Bloom-level conditioning produces a more moderate decline in lexical performance, suggesting that Bloom embeddings primarily influence pedagogical alignment rather than surface-level semantic similarity. Notably, reconstruction fidelity remains relatively stable when Bloom conditioning is removed (RF = 0.82), indicating that the reconstruction mechanism operates largely independently of cognitive-level guidance.
The λ sensitivity analysis further illustrates the balancing role of reconstruction weighting. A lower weight (λ = 0.1) reduces semantic enforcement, leading to modest decreases in BLEU and RF. Conversely, a higher weight (λ = 0.5) slightly increases reconstruction fidelity (RF = 0.86) but does not substantially improve lexical similarity, suggesting diminishing returns when reconstruction is overly emphasized. The selected value λ = 0.3 provides a balanced trade-off between semantic preservation and fluent generation.
Taken together, the ablation results demonstrate that both reconstruction loss and Bloom-level conditioning contribute measurably to model performance, with reconstruction serving as the dominant driver of semantic fidelity and Bloom embeddings supporting cognitive alignment. These findings substantiate the architectural design choices and confirm that improvements cannot be attributed solely to model scale or training configuration.
4.4. Large Language Model Baseline Comparison
To align with current evaluation standards in contemporary NLP research, we incorporated two recent large language model (LLM) baselines such as GPT-4o developed by OpenAI and LLaMA-3-8B-Instruct released by Meta. Both models were fine-tuned on the same training split used for the proposed framework to ensure a fair comparison under identical data conditions. Fine-tuning followed a supervised instruction-style format, where marking schemes and Bloom-level indicators were provided as structured inputs and the corresponding question served as the target output.
Table 6 summarizes the comparative results across lexical similarity metrics and reconstruction fidelity. As expected, both LLM baselines achieve strong lexical performance. GPT-4o attains the highest BLEU (0.73) and ROUGE-L (0.70) scores, reflecting its large-scale pretraining and strong generative fluency. LLaMA-3-8B-Instruct demonstrates comparable performance, with BLEU of 0.72 and ROUGE-L of 0.69.
However, differences emerge when evaluating reconstruction fidelity. While the LLMs maintain competitive RF scores (0.79 and 0.81, respectively), the proposed reconstruction-informed architecture achieves the highest fidelity (0.84). This suggests that explicitly incorporating a reconstruction objective provides additional semantic grounding beyond what is implicitly learned through large-scale pretraining and fine-tuning alone.
To sum up, the results indicate that large instruction-tuned LLMs constitute strong baselines for reverse question generation. Nevertheless, the proposed framework remains competitive in lexical metrics while demonstrating superior structural alignment with marking schemes. These findings support the continued relevance of task-specific architectural constraints, even in the era of large pretrained models.
4.5. Human Expert Evaluation
To complement the automatic evaluation metrics and provide pedagogical validation beyond lexical and embedding-based similarity measures, a small-scale expert assessment was conducted. Two subject-matter lecturers in Data Science independently evaluated a randomly selected subset of 40 generated questions drawn from the held-out test split. The evaluators were not involved in model development and were blinded to the generation configuration. Each question was assessed using a 5-point Likert scale (1 = very poor, 5 = excellent) across four dimensions: clarity, correctness with respect to the marking scheme, absence of ambiguity, and alignment with the annotated Bloom cognitive level.
Inter-rater reliability was measured using Cohen’s kappa, yielding κ = 0.78, indicating substantial agreement between evaluators. The mean ratings across the four dimensions are reported in
Table 7.
The results indicate that the majority of generated questions were judged to be clear, semantically aligned with the intended marking schemes, and consistent with the specified cognitive levels. Correctness received the highest mean score (4.35), suggesting that the reconstruction-informed architecture effectively preserves answer-relevant content during generation. Bloom-level alignment (mean = 4.22) further supports the contribution of cognitive conditioning in guiding question formulation. Slightly lower scores for ambiguity reflect occasional instances of phrasing that could benefit from minor refinement, although overall ratings remained strongly positive.
These findings provide qualitative support for the automatic evaluation metrics presented earlier, indicating that embedding-level semantic preservation and lexical similarity correspond reasonably well with expert judgments in this domain. However, this evaluation remains limited in scale and disciplinary scope. While the results offer preliminary evidence of pedagogical soundness within Data Science assessment contexts, broader validation involving multiple institutions, larger samples, and cross-disciplinary reviewers would be necessary to establish general educational robustness.
4.6. Pedagogical Alignment
The effectiveness of Bloom-level integration was evaluated by measuring the Bloom classification accuracy of generated questions.
Table 8 summarizes the alignment of predicted cognitive levels with the annotated levels in the test dataset. The proposed framework achieved an accuracy of 0.79, demonstrating a high degree of correspondence between intended and generated cognitive levels. In comparison, T5-small without Bloom embeddings achieved only 0.56, while rule-based approaches were limited to 0.43.
These results show the importance of incorporating pedagogical constraints into the generative process. The proposed model is capable of generating questions that not only reflect the content of examiner solutions but also align with the intended cognitive depth, a crucial factor for ensuring assessment validity and fairness. This alignment is further validated by
Figure 2, which presents the confusion matrix, offering a detailed view of Bloom-level prediction performance. The confusion matrix complements the overall accuracy metric and reinforces the conclusion that embedding Bloom-level constraints substantially enhances the model’s adherence to the intended cognitive levels.
4.7. Training Convergence
Figure 3 illustrates the training and validation loss curves across 10 epochs. The steady decline of both losses indicates stable convergence, with the reconstruction component reducing variance compared to the unconstrained baseline. These results indicate that the proposed model is capable of generating reliable questions from available answers.
4.8. Question Length and Distribution Analysis
To further examine structural properties of the generated outputs,
Figure 4 presents the empirical distribution of generated question lengths measured in tokens. The histogram illustrates a unimodal distribution centered within the 18–25 token interval, with a gradual tapering toward both shorter and longer sequences. Only a small proportion of questions fall below 12 tokens or exceed 32 tokens, indicating controlled variability in generation length.
The observed distribution suggests that the model produces questions of moderate length, consistent with typical short-answer and conceptual assessment items within the dataset. Importantly, the absence of extreme outliers indicates that the decoding strategy does not systematically favor overly concise or excessively verbose formulations. This structural consistency complements the previously reported semantic evaluation metrics by demonstrating that output fluency is accompanied by stable length characteristics.
4.9. Semantic Similarity Visualization
In addition to quantitative metrics, a scatter plot of cosine similarity between generated questions and ground-truth questions was constructed to visualize semantic alignment across the test set.
Figure 5 shows that most points are clustered near a similarity score of 0.85, reflecting high consistency between model outputs and human-designed questions. Outliers correspond to complex multi-part questions where minor deviations in phrasing produced lower similarity scores, which are nonetheless pedagogically valid.
This visualization complements the numerical evaluation, providing intuitive insight into the overall semantic fidelity of generated questions across the dataset.
4.10. Internal Generalization Analysis (Course-Wise Split Evaluation)
To evaluate robustness beyond a conventional random split, we conducted an internal generalization experiment using a course-wise partitioning strategy. In this setting, the model was trained on marking schemes from multiple Data Science courses and evaluated on a held-out course that was excluded entirely from training. This configuration provides a stricter test of generalization, as it requires the model to transfer learned representations across variations in instructional emphasis, terminology, and assessment style.
Table 9 reports quantitative performance under both evaluation settings. As expected, performance under the course-wise split shows moderate degradation compared to the random split. BLEU decreases from 0.71 to 0.66, ROUGE-L from 0.68 to 0.63, and METEOR from 0.65 to 0.60. Reconstruction Fidelity exhibits a smaller decline, from 0.84 to 0.80. The relative reductions range between approximately 5% and 8%, indicating sensitivity to course-specific phrasing while maintaining substantial semantic alignment.
Figure 6 further visualizes this comparison using a grouped bar chart, illustrating the consistent but controlled performance drop across all metrics. The bar chart highlights two important patterns. First, the degradation is uniform rather than erratic, suggesting that the model does not fail catastrophically when exposed to unseen course material. Second, reconstruction fidelity remains comparatively stable relative to lexical metrics, supporting the claim that the reconstruction objective promotes answer-level semantic grounding even when surface-level wording differs.
Together, this analysis demonstrates that while course-specific stylistic variation impacts lexical similarity scores, the framework preserves core semantic structure across courses within the same disciplinary domain. These findings provide evidence of meaningful internal generalization rather than simple memorization of course-specific expressions. Nonetheless, broader validation across distinct academic disciplines remains necessary to establish external generalizability.
4.11. Discussion
The results of this study demonstrate that the proposed AI-powered NLP framework effectively addresses the challenge of reverse-engineering examination questions from examiner-provided marking schemes. The high BLEU, ROUGE-L, and METEOR scores indicate that the generated questions closely mirror the lexical and structural content of the reference questions, reflecting strong semantic fidelity. The Reconstruction Fidelity (RF) metric further confirms that the framework preserves essential semantic information, as generated questions consistently contain sufficient content to accurately reconstruct the original marking schemes. This semantic preservation is a direct outcome of the secondary reconstruction module, which enforces a closed-loop feedback mechanism, penalizing deviations from the original embeddings and thereby enhancing content retention.
Analysis of Bloom-level predictions provides additional insight into the pedagogical performance of the model. The confusion matrix reveals minor errors primarily between adjacent cognitive levels, such as “Application” and “Analysis.” These confusions suggest that while the framework captures nuanced cognitive distinctions, certain levels may share semantic overlap, making exact classification challenging. Nevertheless, the overall Bloom-level accuracy of 0.79 demonstrates that the model successfully generates questions aligned with intended cognitive complexity, highlighting the value of integrating Bloom embeddings as control tokens.
The combination of semantic reconstruction and Bloom-level conditioning also impacts the structural and stylistic quality of the generated questions. Questions produced by the unconstrained T5-small baseline frequently omitted critical semantic details or exhibited reduced pedagogical coherence, while rule-based templates, although syntactically valid, lacked cognitive alignment. In contrast, the proposed framework balances semantic completeness with cognitive specificity, producing questions that are both meaningful and functionally equivalent to human-authored items.
While the automatic evaluation metrics provide useful quantitative measures, they have inherent limitations. BLEU, ROUGE-L, and METEOR primarily assess surface-form similarity and may not fully capture semantic equivalence or pedagogical quality, particularly for paraphrased or multi-part questions. Similarly, RF, while indicative of semantic consistency, is partially dependent on the reconstruction module itself, and therefore may not serve as an independent measure of semantic correctness. These considerations underscore the need for future human-in-the-loop evaluation to complement automatic metrics and provide expert assessment of question clarity, relevance, and cognitive appropriateness.
Finally, the results reflect the constraints of the dataset, which consists exclusively of Data Science courses from a single institution. This domain specificity may limit the generalizability of the framework to other subjects or institutional contexts. Although the model demonstrates robustness within the given dataset, further testing across disciplines and educational settings is necessary to confirm broader applicability.
Collectively, these findings establish that transformer-based architectures, when combined with semantic reconstruction and Bloom-level conditioning, can generate examination questions that are semantically accurate, pedagogically meaningful, and structurally coherent. The framework provides a scalable approach for automated exam construction, digital archiving, and adaptive assessment, while also highlighting areas for future research to improve generalization and validation through human expert evaluation.
5. Conclusions and Future Work
This study introduced and validated a novel AI-powered NLP framework for reverse-engineering examination questions from examiner-provided solutions, addressing a significant gap in intelligent assessment technologies. By combining semantic embeddings, transformer-based generation, reconstruction fidelity, and Bloom-level conditioning, the framework achieved strong performance across multiple metrics, including BLEU = 0.71, ROUGE-L = 0.68, METEOR = 0.65, reconstruction fidelity = 0.84, and Bloom-level accuracy = 0.79. These results demonstrate that the generated questions not only preserve semantic content but also align with intended cognitive levels, producing items that are structurally coherent, pedagogically meaningful, and functionally equivalent to human-authored questions.
Despite these encouraging findings, the study has several limitations. The dataset consists of 7021 aligned records exclusively from Data Science courses within a single university, which may limit generalizability to other subjects, institutions, or educational contexts. Evaluation relied primarily on automatic metrics, which, while informative, do not fully capture human judgments of clarity, appropriateness, or exam suitability. Furthermore, the framework has not yet undergone formal human evaluation by educators or subject-matter experts, and cross-disciplinary validation remains untested. These limitations highlight the need for cautious interpretation of the reported performance and motivate future work to improve robustness and applicability.
The framework also has practical implications for educational practice. By leveraging examiner-provided marking schemes, it supports digital archiving of legacy assessment materials and enables automated draft question generation, reducing preparation time for lecturers. Additionally, the system can facilitate adaptive assessment, providing tailored questions aligned with desired cognitive outcomes, and can be integrated into human-in-the-loop exam construction workflows, where educators review and refine AI-generated drafts. Collectively, these applications demonstrate the potential of AI-assisted assessment to enhance both efficiency and educational quality.
Future research will address several directions. First, cross-disciplinary and cross-institutional evaluation will assess generalization beyond Data Science courses. Second, multi-part, scenario-based, and higher-order questions will be incorporated to extend pedagogical coverage. Third, human-in-the-loop evaluation will provide expert validation of question clarity, cognitive alignment, and educational appropriateness. These efforts aim to advance AI-driven educational evaluation, supporting scalable, semantically accurate, and pedagogically meaningful automated assessment systems.
It is important to note that the current evaluation is restricted to Data Science courses from a single institution. Consequently, the present findings establish effectiveness within in-domain reverse question generation under relatively homogeneous curricular conditions. Generalization beyond Data Science courses and institutional contexts requires further empirical validation. Future work will involve cross-disciplinary and cross-institutional validation to assess generalization to other subjects and educational contexts.