AI-Powered Natural Language Processing Framework for Reverse-Engineering Examination Questions from Marking Schemes

Olaniyan, Julius; Verkijika, Silas Formunyuy; Obagbuwa, Ibidun Christiana

doi:10.3390/computers15040204

Open AccessArticle

AI-Powered Natural Language Processing Framework for Reverse-Engineering Examination Questions from Marking Schemes

by

Julius Olaniyan

¹

,

Silas Formunyuy Verkijika

¹

and

Ibidun Christiana Obagbuwa

^2,3,*

¹

Center for Applied Data Science (CADS), Faculty of Natural and Applied Sciences, Sol Plaatje University, Kimberley 8300, South Africa

²

Department of Computer Science & Information Technology, Faculty of Natural and Applied Sciences, Sol Plaatje University, Kimberley 8300, South Africa

³

Department of Mathematical Science and Computing, Walter Sisulu University, Mthatha 5117, South Africa

^*

Author to whom correspondence should be addressed.

Computers 2026, 15(4), 204; https://doi.org/10.3390/computers15040204

Submission received: 24 January 2026 / Revised: 26 February 2026 / Accepted: 5 March 2026 / Published: 26 March 2026

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

The generation of examination questions from examiner-provided marking schemes remains a critical yet underexplored challenge in automated assessment. This study proposes an AI-powered natural language processing (NLP) framework that reverse-engineers exam questions using transformer-based generative modeling, semantic reconstruction, and pedagogical constraints. Marking schemes are encoded with MPNet embeddings and decoded into candidate questions by a T5-small model, with a reconstruction module ensuring semantic fidelity and Bloom-level embeddings enforcing cognitive alignment. Evaluation on a dataset of 7021 marking schemes from Sol Plaatje University demonstrated strong performance, with BLEU = 0.71, ROUGE-L = 0.68, METEOR = 0.65, reconstruction fidelity = 0.84, and Bloom-level accuracy = 0.79. Comparative baselines, including an unconstrained T5 (BLEU = 0.62, RF = 0.68, Bloom = 0.56) and rule-based methods (BLEU = 0.48, RF = 0.51, Bloom = 0.43), confirmed the effectiveness of the proposed approach. The results indicate that the framework generates questions that are semantically accurate, structurally coherent, and pedagogically valid, offering a scalable solution for adaptive assessment, digital archiving, and automated exam construction.

Keywords:

reverse question generation; natural language processing; transformer models; Bloom’s Taxonomy; semantic reconstruction

Graphical Abstract

1. Introduction

Assessment instruments are among the most influential yet least interrogated artifacts in formal education [1]. Examination questions implicitly encode curricular intent, cognitive demand, and evaluative priorities, while marking schemes make these assumptions explicit through structured rubrics, scoring logic, and model responses [2]. Despite their intrinsic coupling, the relationship between examination questions and marking schemes has traditionally been treated as unidirectional: questions are authored first, and marking schemes are subsequently derived as interpretive guides [3]. This linear workflow obscures a critical opportunity for computational insight namely, the ability to infer, reconstruct, or validate examination questions directly from their associated marking schemes [4].

Recent advances in artificial intelligence, particularly in natural language processing (NLP), have transformed how unstructured educational texts can be analyzed, represented, and generated [5]. Large language models, semantic parsers, and discourse-aware transformers now demonstrate the capacity to capture latent instructional intent, hierarchical reasoning patterns, and domain-specific evaluative cues embedded within textual data [6]. However, the application of these techniques in educational assessment has largely focused on forward-generation tasks, such as automated question generation from syllabi or essay scoring from student responses. The inverse problem of reverse-engineering examination questions from marking schemes remains substantially underexplored, despite its relevance to assessment quality assurance, curriculum alignment, and academic integrity [7].

Despite rapid progress in AI-driven educational assessment, existing research remains predominantly forward-oriented, treating marking schemes as secondary scoring tools rather than as structured semantic representations of assessment intent. Current systems generate questions from content or grade responses against fixed rubrics, yet they do not computationally model rubric-derived constraints such as expected reasoning pathways, conceptual weighting, or evaluative granularity. Moreover, contemporary NLP frameworks lack mechanisms for explicitly incorporating pedagogical attributes such as cognitive complexity, scope of evaluation, and mark allocation during generation, resulting in outputs that may be linguistically coherent but pedagogically misaligned. Critically, no systematic methodology exists for reverse assessment validation through structured reconstruction of examination prompts from marking schemes. Consequently, the inverse problem of question reconstruction remains under-theorized and methodologically unsupported, representing a substantive gap at the intersection of NLP, assessment design, and learning analytics.

Marking schemes function as structured semantic blueprints of expected knowledge representations [8]. They encode acceptable reasoning paths, conceptual anchors, and weighted subcomponents, making them computationally decomposable into intent signals and evaluative operators [9,10]. Reverse question inference therefore constitutes a constraint-aware structured reasoning problem rather than a surface-level generation task [11].

This study addresses three research questions: (RQ1) the feasibility of reconstructing examination questions from marking schemes using semantic and pedagogical signals; (RQ2) methods for explicitly incorporating assessment attributes such as cognitive level, evaluation scope, and mark allocation during reconstruction; (RQ3) the potential of reverse question inference to support automated assessment validation and alignment analysis.

This paper introduces an AI-powered NLP framework designed to systematically reconstruct examination questions from marking schemes. The proposed approach treats the task as a multi-stage inference pipeline, combining semantic extraction, pedagogical intent modeling, and controlled natural language generation. Unlike generic text-to-text generation paradigms, the framework explicitly models assessment dimensions such as cognitive level, scope of evaluation, and mark-weighted emphasis, ensuring that reconstructed questions are not only linguistically coherent but also pedagogically valid. By grounding question reconstruction in the latent structure of marking schemes, the framework enables both faithful regeneration of original assessment prompts and the generation of equivalent alternatives for moderation and benchmarking.

The significance of this work extends beyond automation. Reverse-engineering examination questions enables novel forms of assessment analytics, including the detection of misalignment between learning outcomes and evaluation criteria, the identification of rubric ambiguity, and the systematic validation of assessment fairness across cohorts and institutions. Moreover, the framework has practical implications for scalable assessment design, especially in resource-constrained educational environments where expert moderation capacity is limited. By transforming marking schemes into generative knowledge artifacts, this research contributes to a broader vision of intelligent, transparent, and auditable educational assessment systems.

In doing so, this study advances the discourse on AI-assisted assessment by reframing marking schemes as primary computational objects rather than secondary documentation. The proposed NLP framework establishes a foundation for future work in assessment reverse engineering, neuro-symbolic evaluation modeling, and explainable educational AI, positioning reverse question inference as a critical capability in next-generation learning analytics infrastructures.

To the best of our knowledge, no prior framework has systematically modeled marking schemes as primary semantic inputs for structured reverse question inference while explicitly incorporating pedagogical attributes during generation.

2. Literature Review

2.1. Automated Question Generation in Educational NLP

Automatic Question Generation (AQG) has emerged as a prominent research area within educational NLP, primarily motivated by the need to scale assessment creation and personalize learning experiences. Early work in this space focused on generating fact-based or causal questions directly from instructional text. For instance, Stasaski et al. [12] proposed a pipeline for automatically generating cause-and-effect questions from passages, emphasizing discourse structure and semantic triggers. Their work demonstrated that meaningful questions can be derived from latent relational cues in text, but it remained constrained to source passages rather than evaluative artifacts such as marking schemes.

More recent studies have leveraged transformer-based architectures and large language models to enhance both the quality and diversity of generated questions. Yao et al. [13] introduced MCQG-SRefine, an iterative framework that integrates self-critique and correction loops to improve multiple-choice question generation and evaluation. While effective in refining question clarity and distractor quality, the framework assumes the availability of source content and does not address the reconstruction of assessment intent from grading criteria. Similarly, Babakhani et al. [14] presented Opinerium, which uses large language models to generate subjective questions, demonstrating strong fluency and contextual relevance. However, the generative process remains forward-facing, treating question creation as a creative task rather than an inferential one grounded in evaluative structure.

Hybrid approaches have also been explored to improve informativeness in generated questions. Yehia et al. [15] proposed a sentence extraction strategy that combines statistical and semantic features to identify question-worthy content prior to generation. Although this improves input selection, it again presupposes instructional text rather than marking schemes as the primary semantic source. Collectively, AQG research has advanced substantially in generation quality, yet it continues to model questions as primary artifacts, leaving unexplored the inverse relationship between questions and their corresponding evaluation rubrics.

2.2. Subjective Question Generation and Answer Evaluation

Parallel to AQG, a substantial body of work has focused on subjective answer evaluation, often framing the task as semantic similarity measurement between student responses and reference answers. Studies by Das et al. [16], Bashir et al. [17], and Rambola et al. [18] employed machine learning and NLP techniques including TF-IDF, word embeddings, and syntactic features to automate grading of descriptive answers. These approaches demonstrated that rubric-aligned evaluation can approximate human grading under constrained conditions, but they treat marking schemes as static scoring references rather than generative knowledge sources.

More recent contributions have incorporated deep learning architectures to improve robustness. Manikandan et al. [19] applied recurrent neural networks to capture sequential semantics in subjective responses, while Yasin Sharif and Ravindhar [20] proposed enhanced evaluators using refined semantic representations. Similarly, Metan et al. [21], Naikar et al. [22], and Aggarwal et al. [23] developed end-to-end systems combining NLP pipelines with supervised learning for automated evaluation. Despite their technical sophistication, these systems implicitly assume that questions and marking schemes are externally provided and correct, offering no mechanism to interrogate or reconstruct the assessment prompt itself.

Islam et al. [24] extended this paradigm by integrating subjective question generation with answer evaluation, presenting a unified NLP-based framework. While this integration narrows the gap between question creation and evaluation, it still follows a forward generation logic and does not consider whether marking schemes alone can serve as sufficient semantic signals for reconstructing the original assessment intent.

2.3. Assessment Systems and Secure Evaluation Frameworks

Several works have addressed assessment automation from a systems perspective. Ragasudha and Saravanan [25] proposed a secure framework for automatic question paper generation coupled with subjective answer evaluation, focusing on integrity and deployment considerations. Pradeep and Vishaka [26] developed a web-based application for rapid subjective answer evaluation, emphasizing usability rather than inference or explainability. Vichare et al. [27] introduced QGen, combining question generation and answer evaluation within a single NLP-driven workflow. While these systems demonstrate practical feasibility, they largely treat assessment components as modular yet independent entities, without exploiting the latent bidirectional relationship between questions and marking schemes.

Across the reviewed literature, two consistent assumptions prevail: (i) examination questions are primary artifacts authored independently of marking schemes, and (ii) marking schemes function solely as evaluation tools rather than semantic representations of assessment intent. Existing AQG methods generate questions from instructional content, while answer evaluation systems consume marking schemes as static rubrics. None of the reviewed studies explicitly investigate the inverse problem of reconstructing examination questions from marking schemes, nor do they model marking schemes as high-density semantic objects capable of supporting controlled natural language generation.

This gap is non-trivial. Marking schemes encode not only expected answers but also implicit constraints on cognitive level, conceptual scope, and mark-weighted emphasis. Ignoring this structure limits the ability of automated systems to validate assessment alignment, detect rubric ambiguity, or support scalable moderation. The absence of reverse-engineering frameworks thus represents a foundational limitation in current educational NLP research.

In summary, prior work in automatic question generation and subjective answer evaluation has made significant progress in forward-generation and grading accuracy. However, the literature lacks a principled NLP framework that treats marking schemes as generative inputs for reconstructing examination questions. By addressing this overlooked inverse relationship, the present study positions itself at the intersection of assessment intelligence, semantic inference, and explainable educational AI, extending existing research beyond automation toward deeper computational understanding of assessment design.

3. Methodology

The proposed framework for reverse-engineering examination questions from marking schemes is designed to model the conditional probability of a question given a marking scheme. Unlike conventional question generation systems, which derive answers from questions, this framework inverts the paradigm by reconstructing appropriate exam questions from examiner-provided solutions. To ensure reproducibility, the methodology provides explicit details regarding dataset preparation, model architecture, hyperparameters, reconstruction mechanisms, and pedagogical constraint integration.

3.1. Data Representation and Preprocessing

The dataset employed in this study consists of 7021 aligned records collected from the Department of Data Science, Sol Plaatje University, South Africa. Each record contains an examiner-provided marking scheme and its corresponding ground-truth examination question, forming a supervised parallel corpus suitable for sequence-to-sequence learning. For each instance, the marking scheme is denoted as

M_{i}

and the associated question as

Q_{i}

, ensuring deterministic one-to-one alignment throughout the training and evaluation processes. In addition to the primary input–output pair, each record contains auxiliary metadata including a unique sample identifier, an annotated Bloom’s Taxonomy level, token length statistics, and a predefined dataset split label (training, validation, or test). This structured organization enables reproducibility, controlled experimentation, and consistent pedagogical conditioning during model training.

The dataset is stored in Javascript Object Notation (JSON) format, where each row represents a single aligned pair. A representative sample instance is shown below for clarity:

{

“sample_id”: 1024,

“marking_scheme”: “A confusion matrix evaluates classification performance by comparing predicted labels with actual labels. It includes true positives, true negatives, false positives, and false negatives.”,

“question”: “Explain how a confusion matrix is used to evaluate the performance of a classification model.”,

“bloom_level”: “Analyze”,

“answer_length”: 32,

“question_length”: 17,

“split”: “train”

}

In this representation, the marking_scheme field serves as the model input, while the question field represents the target output. The bloom_level attribute provides pedagogical supervision aligned with Bloom’s Taxonomy, enabling cognitive conditioning of the generative model. The answer_length and question_length fields are derived after tokenization and are used for statistical reporting and distributional analysis. The split attribute ensures deterministic partitioning of the dataset into 70% training, 15% validation, and 15% test subsets, thereby guaranteeing reproducibility of experimental results.

Prior to model training, all textual data underwent a standardized preprocessing pipeline to reduce noise and ensure representational consistency. The normalization process included lowercasing all characters and removing punctuation and extraneous symbols. Tokenization was performed using the SentencePiece tokenizer associated with the T5 model, preserving subword structure and ensuring compatibility with the transformer architecture. Formally, each marking scheme can be represented as a token sequence:

M_{i} = \{w_{1}, w_{2}, \dots, w_{n}\}, w_{j} \in V

(1)

where

V

denotes the WordPiece vocabulary. The average marking scheme length is 34.7 tokens, while the average question length is 21.3 tokens, indicating that the dataset predominantly contains semantically rich solutions paired with concise yet cognitively dense examination questions.

To obtain semantically informative representations, each tokenized marking scheme was encoded using MPNet, a transformer encoder optimized for natural language understanding tasks. The encoding function projects the token sequence into a continuous embedding space:

m_{i} = f_{MPNet} (M_{i}) \in R^{d}

(2)

where

d = 768

denotes the embedding dimensionality. These embeddings serve as semantic conditioning inputs to the T5-based generative model described in Section 3.2. The exclusive focus on Data Science courses provides a homogeneous domain-specific benchmark, facilitating controlled investigation of the reverse question engineering task while limiting cross-disciplinary variability.

The complete preprocessing workflow applied to construct the final training corpus is summarized in Algorithm 1.

Algorithm 1: Dataset Preprocessing Pipeline

Input:
Raw dataset D_raw containing (marking_scheme, question, bloom_level)
Output:
Processed dataset D_processed with embeddings and metadata
1: Initialize empty dataset D_processed
2: For each record r in D_raw do
3:        Extract M_i ← r.marking_scheme
4:        Extract Q_i ← r.question
5:        Extract B_i ← r.bloom_level
7:        // Text normalization
8:        M_i ← lowercase(M_i)
9:        Q_i ← lowercase(Q_i)
10:      M_i ← remove_punctuation(M_i)
11:      Q_i ← remove_punctuation(Q_i)
13:      // Tokenization
14:      tokens_M ← WordPiece_tokenize(M_i)
15:      tokens_Q ← WordPiece_tokenize(Q_i)
17:      // Length computation
18:      answer_length ← length(tokens_M)
19:      question_length ← length(tokens_Q)
21:      // Semantic encoding
22:      m_i ← MPNet_encode(tokens_M)   // 768-dimensional embedding
24:      Store processed record:
25:              {sample_id, tokens_M, tokens_Q, m_i, B_i,
26:              answer_length, question_length}
28: End For
30: Deterministically split D_processed into:
31:       70% training set
32:       15% validation set
33:       15% test set
35: Return D_processed

The overall statistics of the dataset are presented in Table 1. The average answer length of 34.7 tokens indicates that the dataset predominantly contains extended marking schemes rather than short factual spans, while the average question length of 21.3 tokens reflects the compact yet semantically dense nature of exam-style questions. The exclusive focus on Data Science courses provides a homogeneous but domain-specific benchmark, enabling in-depth exploration of the reverse-engineering task.

In addition to deterministic random splits (70/15/15), a course-wise split evaluation was conducted to assess internal generalization across distinct Data Science courses.

3.2. Reverse Question Generation Model

To generate questions from marking schemes, this study employs a sequence-to-sequence transformer based on T5-small. The encoder takes the semantic embeddings of a marking scheme as input, while the decoder autoregressively generates the corresponding exam question. The primary training objective maximizes the conditional likelihood of the question given the marking scheme:

L (θ) = - \sum_{i = 1}^{N} \log P_{θ} (Q_{i}∣ M_{i})

(3)

where

P_{θ}

represents the probability distribution parameterized by the transformer with parameters

θ

. Training is performed with teacher forcing to accelerate convergence, and the AdamW optimizer is applied with a learning rate of

3 \times 10^{- 5}

.

To reinforce semantic fidelity, the framework introduces a reconstruction mechanism. After a candidate question

{\hat{Q}}_{i}

is generated, it is passed into a secondary T5 model

g_{ϕ}

, which attempts to regenerate the original marking scheme as an embedding

{\hat{M}}_{i}

. A reconstruction loss ensures that these regenerated embeddings closely approximate the original marking scheme embeddings

M_{i}

:

L_{rec} (θ, ϕ) = \sum_{i = 1}^{N} ∥ {\hat{M}}_{i} - M_{i} ∥_{2}^{2}

(4)

The reconstruction loss is computed in continuous embedding space rather than on discrete token sequences. Specifically, the secondary T5 model first regenerates the marking scheme in textual form. The regenerated text is then encoded using the same MPNet encoder to obtain its semantic embedding

{\hat{M}}_{i} \in R^{768}

. The L2 loss is subsequently calculated between the original embedding

M_{i}

and the regenerated embedding

{\hat{M}}_{i}

.

This design ensures that reconstruction fidelity is measured at the semantic representation level rather than through categorical cross-entropy over tokens, thereby aligning the optimization objective with semantic preservation rather than surface-form exactness.

The final objective combines the primary sequence generation loss and the reconstruction loss:

L_{total} = L (θ) + λ L_{rec} (θ, ϕ)

(5)

where the hyperparameter

λ

controls the contribution of the reconstruction term. In this study,

λ

was empirically set to 0.3, providing a balance between question fluency and semantic preservation.

To ensure reproducibility, the model was implemented using the HuggingFace Transformers library with the pre-trained T5-small backbone (60M parameters). Training was conducted for 10 epochs using the AdamW optimizer with a learning rate of 3 × 10⁻⁵ and a batch size of 16. Early stopping was monitored based on validation loss to prevent overfitting. The reconstruction weight λ was empirically set to 0.3 after preliminary tuning on the validation set, balancing semantic fidelity and fluency.

All experiments were conducted using deterministic dataset splits (70% training, 15% validation, 15% test) defined prior to training to prevent data leakage. Random seeds were fixed for model initialization and data loading to ensure consistent reproducibility across runs.

Importantly, the T5-small model was selected as the core sequence-to-sequence generator to ensure computational feasibility and reproducibility within the available infrastructure. While larger transformer models and instruction-tuned large language models (LLMs) such as LLaMA 3, Mistral 1.1, or GPT-4 may offer higher representational capacity and potentially improved performance, resource constraints limited the scope of the present experiments. Accordingly, baseline comparisons focus on an unconstrained T5-small model and a rule-based template system, providing a meaningful reference for evaluating the contributions of semantic reconstruction and Bloom-level conditioning.

3.3. Bloom Annotation and Classifier Details Need Clarification

While semantic validity is necessary, it is insufficient to guarantee pedagogically useful questions. To embed instructional depth, the framework integrates Bloom’s Taxonomy as a guiding constraint. Each question in the dataset was manually annotated with one of six Bloom levels—Knowledge, Comprehension, Application, Analysis, Synthesis, or Evaluation. These levels serve as control tokens prepended to marking schemes during training, conditioning the model not only on the semantic content of the answer but also on the intended cognitive level.

Bloom-level labels were manually assigned by two subject-matter experts within the Department of Data Science using standard Bloom’s Taxonomy definitions. Disagreements were resolved through discussion to ensure consistent labeling. The final dataset therefore contains a single agreed-upon Bloom label per instance. During training, the Bloom level was prepended as a control token (e.g., <ANALYZE>) to the marking scheme input sequence, enabling explicit conditioning of the generative model on cognitive level.

The modified conditional probability of generating a question is therefore expressed as

P (Q| M, b) = softmax (W h_{t})

(6)

where

b

represents the Bloom-level embedding vector,

h_{t}

is the decoder hidden state at time step

t

, and

W

is the output projection matrix. This formulation allows the decoder to modulate question style and complexity in accordance with the targeted Bloom category. As such, the generated questions are not only semantically aligned with the marking scheme but also pedagogically coherent.

3.4. Workflow and System Architecture

The workflow of the framework comprises four tightly coupled stages: the examiner-provided marking scheme is first semantically encoded using MPNet; the resulting embedding is passed to the T5 generator augmented with Bloom-level constraints to produce candidate questions; the generated question is subsequently validated by a secondary T5 model that reconstructs the marking scheme; and finally, the reconstructed marking scheme is compared with the original, and Bloom-level alignment is verified to enforce pedagogical integrity.

This process can be understood as a closed-loop generative system in which both semantic reconstruction and pedagogical filtering ensure the validity of the generated questions. The workflow is illustrated in Figure 1, which depicts the data flow from marking scheme to final validated question.

The visual representation makes clear that the architecture does not simply produce a one-way mapping from answers to questions but instead incorporates recursive validation, ensuring both semantic integrity and educational coherence.

3.5. Experimental Setup

The evaluation of the proposed AI-powered NLP framework was conducted using a dataset of 7021 examiner-provided solutions and their corresponding questions collected from the Department of Data Science, Sol Plaatje University, South Africa. Consistent with Section 3.1, the dataset was divided into training, validation, and test subsets in a 70:15:15 ratio to facilitate reliable model development and assessment. All textual data were preprocessed through normalization, including lowercasing and punctuation removal, followed by tokenization using WordPiece. Semantic representations of marking schemes were generated using MPNet embeddings, which served as input to the T5-small sequence-to-sequence generative model.

To benchmark the performance of the reverse question generation framework, two categories of comparative models were employed. The first comprised conventional sequence-to-sequence transformer models fine-tuned in the same manner as the proposed framework but without the reconstruction loss or Bloom-level conditioning. This comparison enabled quantification of the contributions of the semantic reconstruction and pedagogical constraint mechanisms. The second category consisted of rule-based question generation systems, which leveraged keyword extraction and syntactic templates to produce candidate questions from examiner solutions. These baselines provided a reference for evaluating improvements attributable to deep learning and constrained generative modeling, highlighting the added value of semantic and pedagogical integration.

A set of complementary metrics was selected to assess both semantic fidelity and pedagogical quality of the generated questions. Standard NLP evaluation measures—including BLEU, ROUGE-L, and METEOR—quantified n-gram overlap, longest common subsequence, and semantic similarity between generated and ground-truth questions, providing a robust indication of linguistic and structural fidelity. To evaluate the preservation of semantic content from marking schemes, reconstruction fidelity (RF) was computed using the embeddings of the original and reconstructed marking schemes via the secondary T5 model, as follows:

RF = 1 - \frac{\sum_{i = 1}^{N} ∥ {\hat{M}}_{i} - M_{i} ∥_{2}}{\sum_{i = 1}^{N} ∥ M_{i} ∥_{2}}

where

{\hat{M}}_{i}

represents the reconstructed marking scheme and

M_{i}

the original. Higher RF values indicate stronger retention of semantic information in the generated questions. Pedagogical alignment was assessed through Bloom-level classification accuracy, in which each generated question was passed through a pre-trained Bloom classifier to predict its cognitive level. Accuracy was then defined as the proportion of generated questions whose predicted cognitive level matched the manually annotated level, providing an empirical measure of whether the system successfully produced questions that are both semantically correct and cognitively appropriate.

All relevant hyperparameters used in the experiments—including the generator and encoder models, optimizer, learning rate, batch size, sequence lengths, number of epochs, and reconstruction weight—are summarized in Table 2. These configurations were determined through preliminary validation experiments designed to balance training efficiency with optimal model performance, ensuring that the study can be independently replicated while maintaining transparency and rigor.

4. Results and Evaluation

The performance of the proposed AI-powered NLP framework for reverse-engineering examination questions from marking schemes was evaluated using a dataset of 7021 aligned (marking scheme, question) pairs obtained from the Department of Data Science, Sol Plaatje University, South Africa. The evaluation focused on three key dimensions: semantic fidelity of the generated questions, reconstruction fidelity to preserve semantic content from marking schemes, and pedagogical alignment with Bloom’s Taxonomy. The dataset was partitioned into training, validation, and test subsets in a 70:15:15 ratio, and all textual data were preprocessed through normalization and tokenization using WordPiece, with semantic embeddings generated via MPNet to serve as input to the T5-small generative model.

To contextualize the performance of the proposed framework, two categories of baselines were employed. The first baseline consisted of a conventional T5-small sequence-to-sequence model trained under the same conditions as the proposed framework but without reconstruction loss or Bloom-level conditioning. The second baseline was a rule-based question generation system that leveraged keyword extraction and syntactic templates to produce candidate questions from examiner solutions. While these baselines provide a meaningful reference for evaluating the contributions of semantic reconstruction and pedagogical constraints, they represent relatively modest benchmarks compared to the current landscape of large-scale neural architectures. Larger transformer models and instruction-tuned large language models (LLMs), such as LLaMA 3 or GPT-4, are likely to offer higher performance due to their increased representational capacity. However, T5-small was selected for these experiments to ensure computational feasibility and reproducibility within the available infrastructure. Future work will extend the evaluation to include these more powerful models, incorporating LLM prompting and fine-tuning, to provide a comprehensive assessment of the framework’s performance relative to state-of-the-art approaches.

4.1. Semantic Fidelity and Similarity Metrics

To rigorously evaluate the quality of the generated examination questions, multiple complementary metrics were employed to capture lexical overlap, semantic preservation, and pedagogical alignment. BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between generated and reference questions, providing an indication of surface-level similarity. ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation—Longest Common Subsequence) evaluates structural alignment via the longest common subsequence, offering tolerance to moderate word-order variation. METEOR (Metric for Evaluation of Translation with Explicit ORdering) incorporates unigram precision and recall with stemming and synonym matching, enabling more semantically sensitive comparison than pure n-gram overlap.

Beyond lexical similarity, Reconstruction Fidelity (RF) was used to assess semantic preservation within the proposed architectural framework. RF is computed as the cosine similarity between MPNet embedding representations of the original marking scheme and the reconstructed marking scheme generated from the produced question. Importantly, RF functions as an internal consistency diagnostic rather than an independent external metric. It evaluates whether sufficient semantic information is retained for reconstruction within the model pipeline. Bloom-Level Accuracy was additionally used to quantify pedagogical alignment, defined as the proportion of generated questions whose predicted cognitive level matches the annotated Bloom label.

Table 3 summarizes semantic evaluation results across lexical metrics for all baselines, including a stronger seq2seq comparison (T5-base) trained under identical preprocessing and optimization settings.

The inclusion of T5-base provides a stronger scale-controlled baseline. While the larger model achieves moderate improvements over unconstrained T5-small, it remains below the full proposed framework that integrates reconstruction loss and Bloom-level conditioning. This suggests that the observed gains are attributable to architectural enhancements rather than model size alone.

These results establish improved semantic alignment under automatic evaluation metrics within the studied domain. However, automatic metrics do not fully capture question clarity, ambiguity, or educational appropriateness, and they should be interpreted as quantitative indicators rather than definitive pedagogical validation.

4.2. Reconstruction Fidelity

The ability of the framework to preserve semantic content was further evaluated using Reconstruction Fidelity (RF). RF measures the cosine similarity between MPNet embedding vectors of the original marking scheme and the reconstructed marking scheme generated from the model’s question output. Because this metric operates in embedding space and reflects the internal reconstruction pathway, it is interpreted as an architectural consistency measure rather than an external evaluation benchmark. Table 4 reports reconstruction fidelity scores across models.

The proposed framework achieves the highest RF score, indicating stronger semantic retention within the reconstruction loop. Notably, increasing model scale alone (T5-base without reconstruction loss) improves reconstruction modestly compared to T5-small, but does not match the performance of the full architecture. This supports the role of reconstruction loss as a meaningful structural constraint that encourages semantic completeness in generated questions.

While high RF values indicate strong embedding-level alignment between original and regenerated marking schemes, this metric does not independently guarantee pedagogical quality or real-world exam suitability. Instead, RF complements lexical metrics by verifying semantic information preservation within the model’s internal representation.

4.3. Ablation Study

To isolate the contribution of individual architectural components, a series of quantitative ablation experiments was conducted on the same held-out test split used for primary evaluation. Specifically, we examined the performance impact of (i) removing the reconstruction loss, (ii) removing Bloom-level conditioning, and (iii) varying the reconstruction weighting coefficient λ ∈ {0.1, 0.3, 0.5}. All models were trained under identical preprocessing, tokenization, and optimization settings to ensure fair comparison. Table 5 summarizes the results across lexical metrics and reconstruction fidelity.

Removing the reconstruction loss results in the most pronounced degradation across all lexical metrics, with BLEU decreasing from 0.71 to 0.64 and reconstruction fidelity dropping substantially from 0.84 to 0.69. This confirms that the reconstruction mechanism plays a central role in enforcing semantic completeness during question generation. Without the reconstruction constraint, the model tends to produce linguistically plausible but less semantically grounded questions.

Eliminating Bloom-level conditioning produces a more moderate decline in lexical performance, suggesting that Bloom embeddings primarily influence pedagogical alignment rather than surface-level semantic similarity. Notably, reconstruction fidelity remains relatively stable when Bloom conditioning is removed (RF = 0.82), indicating that the reconstruction mechanism operates largely independently of cognitive-level guidance.

The λ sensitivity analysis further illustrates the balancing role of reconstruction weighting. A lower weight (λ = 0.1) reduces semantic enforcement, leading to modest decreases in BLEU and RF. Conversely, a higher weight (λ = 0.5) slightly increases reconstruction fidelity (RF = 0.86) but does not substantially improve lexical similarity, suggesting diminishing returns when reconstruction is overly emphasized. The selected value λ = 0.3 provides a balanced trade-off between semantic preservation and fluent generation.

Taken together, the ablation results demonstrate that both reconstruction loss and Bloom-level conditioning contribute measurably to model performance, with reconstruction serving as the dominant driver of semantic fidelity and Bloom embeddings supporting cognitive alignment. These findings substantiate the architectural design choices and confirm that improvements cannot be attributed solely to model scale or training configuration.

4.4. Large Language Model Baseline Comparison

To align with current evaluation standards in contemporary NLP research, we incorporated two recent large language model (LLM) baselines such as GPT-4o developed by OpenAI and LLaMA-3-8B-Instruct released by Meta. Both models were fine-tuned on the same training split used for the proposed framework to ensure a fair comparison under identical data conditions. Fine-tuning followed a supervised instruction-style format, where marking schemes and Bloom-level indicators were provided as structured inputs and the corresponding question served as the target output.

Table 6 summarizes the comparative results across lexical similarity metrics and reconstruction fidelity. As expected, both LLM baselines achieve strong lexical performance. GPT-4o attains the highest BLEU (0.73) and ROUGE-L (0.70) scores, reflecting its large-scale pretraining and strong generative fluency. LLaMA-3-8B-Instruct demonstrates comparable performance, with BLEU of 0.72 and ROUGE-L of 0.69.

However, differences emerge when evaluating reconstruction fidelity. While the LLMs maintain competitive RF scores (0.79 and 0.81, respectively), the proposed reconstruction-informed architecture achieves the highest fidelity (0.84). This suggests that explicitly incorporating a reconstruction objective provides additional semantic grounding beyond what is implicitly learned through large-scale pretraining and fine-tuning alone.

To sum up, the results indicate that large instruction-tuned LLMs constitute strong baselines for reverse question generation. Nevertheless, the proposed framework remains competitive in lexical metrics while demonstrating superior structural alignment with marking schemes. These findings support the continued relevance of task-specific architectural constraints, even in the era of large pretrained models.

4.5. Human Expert Evaluation

To complement the automatic evaluation metrics and provide pedagogical validation beyond lexical and embedding-based similarity measures, a small-scale expert assessment was conducted. Two subject-matter lecturers in Data Science independently evaluated a randomly selected subset of 40 generated questions drawn from the held-out test split. The evaluators were not involved in model development and were blinded to the generation configuration. Each question was assessed using a 5-point Likert scale (1 = very poor, 5 = excellent) across four dimensions: clarity, correctness with respect to the marking scheme, absence of ambiguity, and alignment with the annotated Bloom cognitive level.

Inter-rater reliability was measured using Cohen’s kappa, yielding κ = 0.78, indicating substantial agreement between evaluators. The mean ratings across the four dimensions are reported in Table 7.

The results indicate that the majority of generated questions were judged to be clear, semantically aligned with the intended marking schemes, and consistent with the specified cognitive levels. Correctness received the highest mean score (4.35), suggesting that the reconstruction-informed architecture effectively preserves answer-relevant content during generation. Bloom-level alignment (mean = 4.22) further supports the contribution of cognitive conditioning in guiding question formulation. Slightly lower scores for ambiguity reflect occasional instances of phrasing that could benefit from minor refinement, although overall ratings remained strongly positive.

These findings provide qualitative support for the automatic evaluation metrics presented earlier, indicating that embedding-level semantic preservation and lexical similarity correspond reasonably well with expert judgments in this domain. However, this evaluation remains limited in scale and disciplinary scope. While the results offer preliminary evidence of pedagogical soundness within Data Science assessment contexts, broader validation involving multiple institutions, larger samples, and cross-disciplinary reviewers would be necessary to establish general educational robustness.

4.6. Pedagogical Alignment

The effectiveness of Bloom-level integration was evaluated by measuring the Bloom classification accuracy of generated questions. Table 8 summarizes the alignment of predicted cognitive levels with the annotated levels in the test dataset. The proposed framework achieved an accuracy of 0.79, demonstrating a high degree of correspondence between intended and generated cognitive levels. In comparison, T5-small without Bloom embeddings achieved only 0.56, while rule-based approaches were limited to 0.43.

These results show the importance of incorporating pedagogical constraints into the generative process. The proposed model is capable of generating questions that not only reflect the content of examiner solutions but also align with the intended cognitive depth, a crucial factor for ensuring assessment validity and fairness. This alignment is further validated by Figure 2, which presents the confusion matrix, offering a detailed view of Bloom-level prediction performance. The confusion matrix complements the overall accuracy metric and reinforces the conclusion that embedding Bloom-level constraints substantially enhances the model’s adherence to the intended cognitive levels.

4.7. Training Convergence

Figure 3 illustrates the training and validation loss curves across 10 epochs. The steady decline of both losses indicates stable convergence, with the reconstruction component reducing variance compared to the unconstrained baseline. These results indicate that the proposed model is capable of generating reliable questions from available answers.

4.8. Question Length and Distribution Analysis

To further examine structural properties of the generated outputs, Figure 4 presents the empirical distribution of generated question lengths measured in tokens. The histogram illustrates a unimodal distribution centered within the 18–25 token interval, with a gradual tapering toward both shorter and longer sequences. Only a small proportion of questions fall below 12 tokens or exceed 32 tokens, indicating controlled variability in generation length.

The observed distribution suggests that the model produces questions of moderate length, consistent with typical short-answer and conceptual assessment items within the dataset. Importantly, the absence of extreme outliers indicates that the decoding strategy does not systematically favor overly concise or excessively verbose formulations. This structural consistency complements the previously reported semantic evaluation metrics by demonstrating that output fluency is accompanied by stable length characteristics.

4.9. Semantic Similarity Visualization

In addition to quantitative metrics, a scatter plot of cosine similarity between generated questions and ground-truth questions was constructed to visualize semantic alignment across the test set. Figure 5 shows that most points are clustered near a similarity score of 0.85, reflecting high consistency between model outputs and human-designed questions. Outliers correspond to complex multi-part questions where minor deviations in phrasing produced lower similarity scores, which are nonetheless pedagogically valid.

This visualization complements the numerical evaluation, providing intuitive insight into the overall semantic fidelity of generated questions across the dataset.

4.10. Internal Generalization Analysis (Course-Wise Split Evaluation)

To evaluate robustness beyond a conventional random split, we conducted an internal generalization experiment using a course-wise partitioning strategy. In this setting, the model was trained on marking schemes from multiple Data Science courses and evaluated on a held-out course that was excluded entirely from training. This configuration provides a stricter test of generalization, as it requires the model to transfer learned representations across variations in instructional emphasis, terminology, and assessment style.

Table 9 reports quantitative performance under both evaluation settings. As expected, performance under the course-wise split shows moderate degradation compared to the random split. BLEU decreases from 0.71 to 0.66, ROUGE-L from 0.68 to 0.63, and METEOR from 0.65 to 0.60. Reconstruction Fidelity exhibits a smaller decline, from 0.84 to 0.80. The relative reductions range between approximately 5% and 8%, indicating sensitivity to course-specific phrasing while maintaining substantial semantic alignment.

Figure 6 further visualizes this comparison using a grouped bar chart, illustrating the consistent but controlled performance drop across all metrics. The bar chart highlights two important patterns. First, the degradation is uniform rather than erratic, suggesting that the model does not fail catastrophically when exposed to unseen course material. Second, reconstruction fidelity remains comparatively stable relative to lexical metrics, supporting the claim that the reconstruction objective promotes answer-level semantic grounding even when surface-level wording differs.

Together, this analysis demonstrates that while course-specific stylistic variation impacts lexical similarity scores, the framework preserves core semantic structure across courses within the same disciplinary domain. These findings provide evidence of meaningful internal generalization rather than simple memorization of course-specific expressions. Nonetheless, broader validation across distinct academic disciplines remains necessary to establish external generalizability.

4.11. Discussion

The results of this study demonstrate that the proposed AI-powered NLP framework effectively addresses the challenge of reverse-engineering examination questions from examiner-provided marking schemes. The high BLEU, ROUGE-L, and METEOR scores indicate that the generated questions closely mirror the lexical and structural content of the reference questions, reflecting strong semantic fidelity. The Reconstruction Fidelity (RF) metric further confirms that the framework preserves essential semantic information, as generated questions consistently contain sufficient content to accurately reconstruct the original marking schemes. This semantic preservation is a direct outcome of the secondary reconstruction module, which enforces a closed-loop feedback mechanism, penalizing deviations from the original embeddings and thereby enhancing content retention.

Analysis of Bloom-level predictions provides additional insight into the pedagogical performance of the model. The confusion matrix reveals minor errors primarily between adjacent cognitive levels, such as “Application” and “Analysis.” These confusions suggest that while the framework captures nuanced cognitive distinctions, certain levels may share semantic overlap, making exact classification challenging. Nevertheless, the overall Bloom-level accuracy of 0.79 demonstrates that the model successfully generates questions aligned with intended cognitive complexity, highlighting the value of integrating Bloom embeddings as control tokens.

The combination of semantic reconstruction and Bloom-level conditioning also impacts the structural and stylistic quality of the generated questions. Questions produced by the unconstrained T5-small baseline frequently omitted critical semantic details or exhibited reduced pedagogical coherence, while rule-based templates, although syntactically valid, lacked cognitive alignment. In contrast, the proposed framework balances semantic completeness with cognitive specificity, producing questions that are both meaningful and functionally equivalent to human-authored items.

While the automatic evaluation metrics provide useful quantitative measures, they have inherent limitations. BLEU, ROUGE-L, and METEOR primarily assess surface-form similarity and may not fully capture semantic equivalence or pedagogical quality, particularly for paraphrased or multi-part questions. Similarly, RF, while indicative of semantic consistency, is partially dependent on the reconstruction module itself, and therefore may not serve as an independent measure of semantic correctness. These considerations underscore the need for future human-in-the-loop evaluation to complement automatic metrics and provide expert assessment of question clarity, relevance, and cognitive appropriateness.

Finally, the results reflect the constraints of the dataset, which consists exclusively of Data Science courses from a single institution. This domain specificity may limit the generalizability of the framework to other subjects or institutional contexts. Although the model demonstrates robustness within the given dataset, further testing across disciplines and educational settings is necessary to confirm broader applicability.

Collectively, these findings establish that transformer-based architectures, when combined with semantic reconstruction and Bloom-level conditioning, can generate examination questions that are semantically accurate, pedagogically meaningful, and structurally coherent. The framework provides a scalable approach for automated exam construction, digital archiving, and adaptive assessment, while also highlighting areas for future research to improve generalization and validation through human expert evaluation.

5. Conclusions and Future Work

This study introduced and validated a novel AI-powered NLP framework for reverse-engineering examination questions from examiner-provided solutions, addressing a significant gap in intelligent assessment technologies. By combining semantic embeddings, transformer-based generation, reconstruction fidelity, and Bloom-level conditioning, the framework achieved strong performance across multiple metrics, including BLEU = 0.71, ROUGE-L = 0.68, METEOR = 0.65, reconstruction fidelity = 0.84, and Bloom-level accuracy = 0.79. These results demonstrate that the generated questions not only preserve semantic content but also align with intended cognitive levels, producing items that are structurally coherent, pedagogically meaningful, and functionally equivalent to human-authored questions.

Despite these encouraging findings, the study has several limitations. The dataset consists of 7021 aligned records exclusively from Data Science courses within a single university, which may limit generalizability to other subjects, institutions, or educational contexts. Evaluation relied primarily on automatic metrics, which, while informative, do not fully capture human judgments of clarity, appropriateness, or exam suitability. Furthermore, the framework has not yet undergone formal human evaluation by educators or subject-matter experts, and cross-disciplinary validation remains untested. These limitations highlight the need for cautious interpretation of the reported performance and motivate future work to improve robustness and applicability.

The framework also has practical implications for educational practice. By leveraging examiner-provided marking schemes, it supports digital archiving of legacy assessment materials and enables automated draft question generation, reducing preparation time for lecturers. Additionally, the system can facilitate adaptive assessment, providing tailored questions aligned with desired cognitive outcomes, and can be integrated into human-in-the-loop exam construction workflows, where educators review and refine AI-generated drafts. Collectively, these applications demonstrate the potential of AI-assisted assessment to enhance both efficiency and educational quality.

Future research will address several directions. First, cross-disciplinary and cross-institutional evaluation will assess generalization beyond Data Science courses. Second, multi-part, scenario-based, and higher-order questions will be incorporated to extend pedagogical coverage. Third, human-in-the-loop evaluation will provide expert validation of question clarity, cognitive alignment, and educational appropriateness. These efforts aim to advance AI-driven educational evaluation, supporting scalable, semantically accurate, and pedagogically meaningful automated assessment systems.

It is important to note that the current evaluation is restricted to Data Science courses from a single institution. Consequently, the present findings establish effectiveness within in-domain reverse question generation under relatively homogeneous curricular conditions. Generalization beyond Data Science courses and institutional contexts requires further empirical validation. Future work will involve cross-disciplinary and cross-institutional validation to assess generalization to other subjects and educational contexts.

Author Contributions

Conceptualization, J.O., S.F.V. and I.C.O.; methodology, J.O.; software, J.O.; validation, S.F.V. and I.C.O.; formal analysis, J.O.; investigation, S.F.V. and I.C.O.; resources, S.F.V. and I.C.O.; data curation, J.O.; writing—original draft preparation, J.O.; writing—review and editing, S.F.V. and I.C.O.; visualization, J.O.; supervision, S.F.V. and I.C.O.; project administration, S.F.V.; funding acquisition, I.C.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding and The APC was funded by I.C.O.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study comprises 7021 aligned marking scheme–question pairs collected from archived examination materials within the Department of Data Science at Sol Plaatje University, Kimberley, South Africa. Each record contains an examiner-provided solution, its corresponding ground-truth examination question, and associated pedagogical metadata (including Bloom’s Taxonomy annotations). All records were digitized, cleaned, and structured in JSON format to support automated preprocessing, semantic encoding, and model evaluation. Due to institutional data governance policies and the inclusion of official examination materials, the dataset is not publicly available. Access may be granted by the corresponding author upon reasonable request, subject to approval by Sol Plaatje University and compliance with applicable data protection and academic use regulations.

Acknowledgments

Special thanks to all the contributors of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ansya, Y.A.U.; Alfianita, A.; Syahkira, H.P. Optimizing Mathematics Learning in Fifth Grades: The Critical Role of Evaluation in Improving Student Achievement and Character. Prog. Pendidik. 2024, 5, 302–311. [Google Scholar] [CrossRef]
Knight, T. Effect of Examinations as a Dominant Evaluation Approach on the Implementation of Secondary School Curriculum in Kenya. Ph.D. Thesis, Masinde Muliro University of Science and Technology, Kakamega, Kenya, 2024. [Google Scholar]
Hasan, A.; Jones, B. Assessing the assessors: Investigating the process of marking essays. Front. Oral Health 2024, 5, 1272692. [Google Scholar] [CrossRef] [PubMed]
Linhuber, M.; Bernius, J.P.; Krusche, S. Constructive alignment in modern computing education: An open-source computer-based examination system. In Proceedings of the 23rd Koli Calling International Conference on Computing Education Research, Koli, Finland, 13–18 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1–11. [Google Scholar]
Alqahtani, T.; Badreldin, H.A.; Alrashed, M.; Alshaya, A.I.; Alghamdi, S.S.; Bin Saleh, K.; Alowais, S.A.; Alshaya, O.A.; Rahman, I.; Al Yami, M.S.; et al. The emergent role of artificial intelligence, natural learning processing, and large language models in higher education and research. Res. Soc. Adm. Pharm. 2023, 19, 1236–1242. [Google Scholar] [CrossRef] [PubMed]
Oyelade, O.N.; Wang, H.; Rafferty, K. A Survey of Adaptation of Large Language Models to Idea and Hypothesis Generation: Downstream Task Adaptation, Knowledge Distillation Approaches and Challenges. ACM Comput. Surv. 2025, 58, 141. [Google Scholar] [CrossRef]
Petrovic-Dzerdz, M. Reverse Engineering a Multiple-Choice Test Blueprint to Improve Course Alignment. Collect. Essays Learn. Teach. 2024, 15. [Google Scholar] [CrossRef]
Greco, C.; Ianni, M. A Formal Framework for LLM-assisted Automated Generation of Zeek Signatures from Binary Artifacts. Future Gener. Comput. Syst. 2025, 175, 108086. [Google Scholar] [CrossRef]
Smith, J.D.; Jackson, B.N.; Adamczyk, M.N.; Church, B.A. Conceptual anchoring dissociates implicit and explicit category learning. J. Exp. Psychol. Learn. Mem. Cogn. 2022, 48, 813. [Google Scholar] [CrossRef] [PubMed]
Dechtiar, M.; Katz, D.M.; Sundaresan, M.; Jaume, S.; Wang, H. GRAPH-GRPO-LEX: Contract Graph Modeling and Reinforcement Learning with Group Relative Policy Optimization. arXiv 2025, arXiv:2511.06618. [Google Scholar]
Hu, X.; Fu, Z.; Xie, S.; Ding, S.H.; Charland, P. SoK: Potentials and Challenges of Large Language Models for Reverse Engineering. arXiv 2025, arXiv:2509.21821. [Google Scholar] [CrossRef]
Stasaski, K.; Rathod, M.; Tu, T.; Xiao, Y.; Hearst, M.A. Automatically generating cause-and-effect questions from passages. In Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, Online, 20 April 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 158–170. [Google Scholar]
Yao, Z.; Parashar, A.; Zhou, H.; Jang, W.S.; Ouyang, F.; Yang, Z.; Yu, H. Mcqg-srefine: Multiple choice question generation and evaluation with iterative self-critique, correction, and comparison feedback. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, Albuquerque, NM, USA, 29 April–4 May 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; Volume 1, pp. 10728–10777. [Google Scholar]
Babakhani, P.; Lommatzsch, A.; Brodt, T.; Sacker, D.; Sivrikaya, F.; Albayrak, S. Opinerium: Subjective question generation using large language models. IEEE Access 2024, 12, 66085–66099. [Google Scholar] [CrossRef]
Yehia, E.; Hassan, N.; AbdelGaber, S. Towards More Effective Automatic Question Generation: A Hybrid Approach for Extracting Informative Sentences. Int. J. Adv. Comput. Sci. Appl. 2025, 16. [Google Scholar] [CrossRef]
Das, B.; Majumder, M.; Sekh, A.A.; Phadikar, S. Automatic question generation and answer assessment for subjective examination. Cogn. Syst. Res. 2022, 72, 14–22. [Google Scholar] [CrossRef]
Bashir, M.F.; Arshad, H.; Javed, A.R.; Kryvinska, N.; Band, S.S. Subjective answers evaluation using machine learning and natural language processing. IEEE Access 2021, 9, 158972–158983. [Google Scholar] [CrossRef]
Rambola, R.K.; Bansal, A.; Savaliya, P.; Sharma, V.; Joshi, S. Development of novel evaluating practices for subjective answers using natural language processing. In Recent Trends in Communication and Intelligent Systems: Proceedings of ICRTCIS 2020; Springer: Singapore, 2021; pp. 205–218. [Google Scholar]
Manikandan, R.; Sai, R.Y.; Vignavi, G.; Reddy, G.S.K.; Reddy, N.S.D. Evaluating subjective answers using RNN and NLP. In Challenges in Information, Communication and Computing Technology; CRC Press: Boca Raton, FL, USA, 2024; pp. 681–686. [Google Scholar]
Yasin Sharif, A.; Ravindhar, N.V. Improved Evaluator for Subjective Answers Using Natural Language Processing. In Proceedings of the International Conference on Computational Intelligence in Data Science, Chennai, India, 21–23 February 2024; Springer: Cham, Switzerland, 2024; pp. 98–109. [Google Scholar]
Metan, J.; Kumar, D.; Kumar, H. An Automated Approach to Subjective Answer Evaluation Using ML and NLP. In Proceedings of the 2024 Second International Conference on Advances in Information Technology (ICAIT), Chikkamagaluru, Karnataka, 24–27 July 2024; IEEE: New York, NY, USA, 2024; Volume 1, pp. 1–7. [Google Scholar]
Naikar, M.; Khandagale, S.; Jadhav, V.; Jadhav, G.; Khade, A. Design of an Auto Evaluation Model for Subjective Answers Using Natural Language Processing and Machine Learning Techniques. In Proceedings of the International Conference on Artificial Intelligence and Smart Energy, Coimbatore, India, 22–23 March 2024; Springer: Cham, Switzerland, 2024; pp. 200–209. [Google Scholar]
Aggarwal, I.; Gautam, P.; Parashar, G. Automated Subjective Answer Evaluation Using Machine Learning. In Proceedings of the KILBY 100 7th International Conference on Computing Sciences, Punjab, India, 5 May 2023. [Google Scholar]
Islam, G.M.; Shaheer, S.; Nur, Y.; Hamid, M.R. Subjective Question Generation and Answer Evaluation using NLP. arXiv 2025, arXiv:2512.17289. [Google Scholar] [CrossRef]
Ragasudha, R.; Saravanan, M. Secure automatic question paper generation with the subjective answer evaluation system. In Proceedings of the 2022 International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN), Villupuram, India, 25–26 March 2022; IEEE: New York, NY, USA, 2022; pp. 1–5. [Google Scholar]
Pradeep, B.M.; Vishaka, M. Web app for quick evaluation of subjective answers using natural language processing. Sci. Tech. J. Inf. Technol. Mech. Opt. 2022, 22, 594–599. [Google Scholar] [CrossRef]
Vichare, S.; Gawade, A.; Mangrulkar, R. Qgen: A unique question generation and answer evaluation technique using natural language processing. J. Eng. Educ. Transform. 2024, 38, 122–135. [Google Scholar] [CrossRef]

Figure 1. The Architectural Overview of the Proposed Framework.

Figure 2. Bloom-Level Confusion Matrix for Generated Questions (Proposed Model).

Figure 3. Training vs. Validation Loss across Epochs.

Figure 4. Distribution of Generated Question Lengths.

Figure 5. Cosine Similarity of Generated vs. Ground-Truth Questions.

Figure 6. Random vs. Course-Wise Split Performance.

Table 1. Dataset Statistics.

Statistic	Value
Total Samples	7021
Average Answer Length	34.7 tokens
Average Question Length	21.3 tokens
Domains Covered	Data Science (core + electives)

Table 2. Experimental Hyperparameters.

Parameter	Value
Generator Model	T5-small
Encoder Model	MPNet-base
Optimizer	AdamW
Learning Rate	3 × 10⁻⁵
Batch Size	32
Epochs	10
Max Seq. Length (Answer)	128 tokens
Max Seq. Length (Question)	64 tokens
Reconstruction Weight (λ)	0.3

Table 3. Semantic Fidelity Metrics Across Baselines.

Model	BLEU	ROUGE-L	METEOR
Proposed Framework	0.71	0.68	0.65
T5-Base (No Reconstruction/Bloom)	0.65	0.62	0.60
T5-Small (No Constraints)	0.62	0.59	0.56
Rule-Based Baseline	0.48	0.45	0.42

Table 4. Reconstruction Fidelity (RF).

Model	Reconstruction Fidelity (RF)
Proposed Framework	0.84
T5-Base (No Reconstruction)	0.72
T5-Small (No Constraints)	0.68
Rule-Based Baseline	0.51

Table 5. Quantitative Ablation Results.

Model Variant	BLEU	ROUGE-L	METEOR	Reconstruction Fidelity (RF)
Full Model (λ = 0.3)	0.71	0.68	0.65	0.84
Reconstruction Loss	0.64	0.61	0.58	0.69
Bloom Conditioning	0.67	0.64	0.61	0.82
λ = 0.1	0.68	0.65	0.62	0.80
λ = 0.5	0.69	0.66	0.63	0.86

Table 6. LLM Baseline Comparison.

Model	BLEU	ROUGE-L	METEOR	Reconstruction Fidelity
T5-base	0.65	0.62	0.60	0.72
LLaMA-3-8B-Instruct (Fine-tuned)	0.72	0.69	0.66	0.79
GPT-4o (Fine-tuned)	0.73	0.70	0.67	0.81
Proposed Model	0.71	0.68	0.65	0.84

Table 7. Human Expert Evaluation Results (n = 40).

Criterion	Mean Score	Standard Deviation
Clarity	4.28	0.62
Correctness	4.35	0.55
Ambiguity	4.10	0.71
Bloom-Level Alignment	4.22	0.64

Table 8. Bloom-Level Classification Accuracy.

Model	Bloom Accuracy
Proposed Framework	0.79
T5-Small (No Bloom Embeddings)	0.56
Rule-Based Baseline	0.43

Table 9. Random vs. Course-wise Split Performance.

Evaluation Setting	BLEU	ROUGE-L	METEOR	Reconstruction Fidelity
Random Split	0.71	0.68	0.65	0.84
Course-wise Split	0.66	0.63	0.60	0.80
Relative Change (%)	−7.0%	−7.4%	−7.7%	−4.8%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Olaniyan, J.; Verkijika, S.F.; Obagbuwa, I.C. AI-Powered Natural Language Processing Framework for Reverse-Engineering Examination Questions from Marking Schemes. Computers 2026, 15, 204. https://doi.org/10.3390/computers15040204

AMA Style

Olaniyan J, Verkijika SF, Obagbuwa IC. AI-Powered Natural Language Processing Framework for Reverse-Engineering Examination Questions from Marking Schemes. Computers. 2026; 15(4):204. https://doi.org/10.3390/computers15040204

Chicago/Turabian Style

Olaniyan, Julius, Silas Formunyuy Verkijika, and Ibidun Christiana Obagbuwa. 2026. "AI-Powered Natural Language Processing Framework for Reverse-Engineering Examination Questions from Marking Schemes" Computers 15, no. 4: 204. https://doi.org/10.3390/computers15040204

APA Style

Olaniyan, J., Verkijika, S. F., & Obagbuwa, I. C. (2026). AI-Powered Natural Language Processing Framework for Reverse-Engineering Examination Questions from Marking Schemes. Computers, 15(4), 204. https://doi.org/10.3390/computers15040204

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI-Powered Natural Language Processing Framework for Reverse-Engineering Examination Questions from Marking Schemes

Abstract

1. Introduction

2. Literature Review

2.1. Automated Question Generation in Educational NLP

2.2. Subjective Question Generation and Answer Evaluation

2.3. Assessment Systems and Secure Evaluation Frameworks

3. Methodology

3.1. Data Representation and Preprocessing

3.2. Reverse Question Generation Model

3.3. Bloom Annotation and Classifier Details Need Clarification

3.4. Workflow and System Architecture

3.5. Experimental Setup

4. Results and Evaluation

4.1. Semantic Fidelity and Similarity Metrics

4.2. Reconstruction Fidelity

4.3. Ablation Study

4.4. Large Language Model Baseline Comparison

4.5. Human Expert Evaluation

4.6. Pedagogical Alignment

4.7. Training Convergence

4.8. Question Length and Distribution Analysis

4.9. Semantic Similarity Visualization

4.10. Internal Generalization Analysis (Course-Wise Split Evaluation)

4.11. Discussion

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI