Adaptive Online Assessment in Higher Education: An Improved UEMD-GA Approach with Independent Rating

Li, Tianrui; Wang, Handong

doi:10.3390/app16115516

Open AccessArticle

Adaptive Online Assessment in Higher Education: An Improved UEMD-GA Approach with Independent Rating

by

Tianrui Li

^* and

Handong Wang

School of Software, Henan University of Engineering, Zhengzhou 451191, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(11), 5516; https://doi.org/10.3390/app16115516

Submission received: 2 May 2026 / Revised: 20 May 2026 / Accepted: 23 May 2026 / Published: 2 June 2026

Download

Browse Figures

Versions Notes

Abstract

This paper investigates personalized online learning platforms in higher education and their role in enhancing educational outcomes. Traditional programming assessment systems often struggle with scalability and limited functionality, failing to effectively leverage advanced recommendation algorithms and comprehensive data analytics. The proposed system integrates user interest profiles with capability metrics to deliver personalized exercise recommendations, while implementing a novel rating calculation method that eliminates dependency on external platforms. Built upon a decoupled microservice architecture using Spring Cloud and Vue.js, the system incorporates a simulated time decay. We tested our approach on a comprehensive university dataset. The results show improvements in recommendation precision and system performance over baselines, suggesting the algorithm’s applicability and scalability. This research contributes valuable theoretical insights and practical methodologies for advancing online education technologies, particularly in the context of data-driven teaching evaluation and adaptive learning systems.

Keywords:

learning analytics; educational technology; personalized recommendation; online assessment system; system architecture

1. Introduction

In the broader domain of online learning, recommendation algorithms have achieved significant progress in handling complex, heterogeneous interactions. Traditional collaborative filtering often fails to capture the multi-faceted relationships among students, teachers, and learning materials. To address this limitation, Shou et al. [1] proposed LPRWHIN, an algorithm based on a weighted heterogeneous information network. Their method systematically integrates various object types—such as students, videos, exercises, and knowledge points—along with specific attribute values like interaction degrees, to automatically generate meaningful weighted meta-paths. Subsequently, the algorithm employs random walks and a Bayesian Personalized Ranking (BPR) framework to calculate personalized weights, effectively translating intricate structural semantics into precise peer recommendations, demonstrating the necessity of structured data modeling for capturing complex educational interactions.

Beyond structural modeling, recommendation algorithms must also navigate the nuances of temporal data drift and generate highly interpretable outputs. Static recommendation models often falter when user states or environmental contexts shift dynamically. To counteract this, Yang et al. [2] introduced SNPERS, an innovative algorithm that synergizes statistical principles with natural language processing (NLP). By leveraging the probabilistic drift property of the Poisson distribution, the algorithm actively stabilizes conceptual drift, allowing for timely adjustments to recommendation plans. Furthermore, it employs a fused max function and extracts emotion words to quantify recommendation strength, effectively eliminating mechanical suggestions while ensuring logical plausibility through a double-chain feedback mechanism, providing valuable insights into adapting recommendations to dynamic learning environments.

Most recently, advanced recommendation paradigms have shifted towards the synergistic fusion of deep cognitive models and large language models (LLMs). While knowledge tracing (KT) is essential for assessing student mastery, existing models are typically restricted to single-step predictions and lack effective feedback loops. To overcome this, Lin and Wu [3] developed LPReKL, a novel algorithmic framework where an LLM acts as a pedagogical planner to generate reference exercises, while an enhanced, masked KT simulates multi-step knowledge states. Crucially, the algorithm introduces an iterative feedback mechanism: the KT quantitatively evaluates the effectiveness of the generated path and guides the LLM to refine its output, thereby forming a dynamic, closed-loop optimization system that bridges generative capabilities with rigorous quantitative assessment, establishing a promising paradigm for next-generation intelligent tutoring systems.

Furthermore, recommendation systems are increasingly expanding into IoT environments. In such real-world deployments, multi-criteria and multi-objective optimization are crucial for sustaining user engagement. Single-objective models frequently struggle to reconcile competing goals such as accuracy, novelty, and contextual relevance under dynamically evolving conditions. To address this challenge, Stitini and Kaloun [4] proposed a multi-objective contextual multi-criteria recommender system tailored for sustainable smart homes. By continuously capturing real-time data from IoT sensors—encompassing physiological indicators such as heart rate, stress levels, physical activity, and sleep quality—their framework dynamically categorizes users and generates personalized health interventions through a machine learning pipeline built upon classifiers such as Random Forest. Central to their approach is an explicit multi-criteria evaluation mechanism that balances accuracy against novelty-based diversity, employing various similarity metrics to ensure recommendations remain both relevant and exploratory. This work underscores the necessity of incorporating multi-objective optimization and real-time contextual awareness into recommendation architectures, particularly when deployed in environments where user states, environmental factors, and well-being objectives co-evolve, an insight equally applicable to educational contexts where cognitive loads and learning states fluctuate dynamically.

In the realm of computer science education, Online Judge (OJ) systems have evolved from niche competitive programming tools into pivotal platforms for automated assessment in higher education. Wasik et al. [5] classified these systems by objectives—education, recruitment, and compilation—highlighting their expanded utility. However, traditional frameworks often lack generalizability and functionality. Han et al. [6] noted that many platforms lack course-oriented features such as systematic test case management and robust anti-cheating mechanisms, limiting their pedagogical efficacy—a critical gap that our current work seeks to address.

However, the development of such systems is often constrained by the scarcity of large-scale educational datasets. To address this challenge, Chen et al. [7] introduced ACcoding, a graph-structured dataset spanning six years of online judge (OJ) submissions. Their work overcomes scale limitations and enables applications like knowledge tracing. More importantly, their analysis demonstrates that leveraging structured behavioral data significantly enhances the accuracy of educational models. This finding highlights a critical shift from mere data scale to deep behavioral analysis, providing a theoretical basis for our method to incorporate fine-grained data.

Concurrently, pedagogical designs are shifting towards novice support. Mentari et al. [8] developed JPLAS, a platform engineered for beginners incorporating code typing and debugging exercises. Their findings suggest that focused introductory tasks facilitate syntax and logic mastery, highlighting the necessity of integrating pedagogical structures rather than prioritizing technical architecture alone.

Parallel to content design, a paradigm shift leverages learning analytics for intelligent assessment. Dai et al. [9] modeled learning engagement via submission sequences, while Rico-Juan et al. [10] employed Explainable AI (XAI) for student profiling, facilitating early performance prediction. In contrast, Messer et al. [11] found that current tools rely heavily on unit testing, often neglecting software engineering skills like maintainability. This results in binary feedback that offers little constructive guidance, underscoring the demand for adaptive systems with sophisticated algorithms.

However, designing effective adaptive systems requires more than just algorithmic sophistication; it must be grounded in pedagogical principles. Traditionally, recommender systems in educational contexts have often treated learners as general users, focusing on coarse-grained preferences rather than cognitive states. Recent research highlights the necessity of shifting from mere preference tracking to knowledge tracing [12]. This shift is crucial because the ultimate goal of educational recommendation is to facilitate learning. According to Self-Determination Theory and Flow Theory, mismatched problem difficulty leads to “need frustration,” a major barrier to learning, while well-matched challenges satisfy the need for competence and promote a flow state of deep engagement [13]. Therefore, effective adaptive algorithms must go beyond simple preference matching to continuously estimate a student’s fine-grained knowledge state, providing optimal challenges that minimize frustration and foster both flow experience and learning motivation.

Moreover, ensuring evaluation integrity remains critical. Karnalim [14] proposed a low-level structural approach using Bytecode analysis to detect plagiarism, effectively identifying obfuscation tactics like identifier renaming. While vital for fairness, implementing such rigorous mechanisms imposes significant technical complexities.

To address these limitations, this study designs a comprehensive online assessment system tailored for programming education. The principal contributions are as follows:

1.: Development of an integrated educational platform. By combining a personalized repository with automated judging, the system enhances autonomous learning and computational thinking. Robust class management functionalities also alleviate administrative burdens, allowing educators to focus on pedagogical design.
2.: Advancement of architecture and recommendation algorithms. A unified judging interface ensures high scalability. Furthermore, we propose the UEMD-GA algorithm for personalized recommendations. Avoiding complex deep learning architectures, this method achieves high accuracy with low computational overhead, ensuring interpretability and ease of deployment.
3.: Analysis of experimental results and insights. Empirical validation using data from a higher education institution reveals the proposed system effectively mitigates stochastic biases through architectural diversity, with a strong positive correlation between exercise difficulty and student performance. The algorithm’s basic matching degree also significantly correlates with final recommendation scores, validating the proposed approach. These findings provide valuable insights for improving online assessment systems.

2. System Architecture and Overall Design

2.1. Current Technical Architectures

In the domain of higher education online competitions, HUSTOJ stands out as a widely adopted open-source platform. Constructed upon a technology stack comprising PHP and C++, the system leverages the scripting nature of PHP to facilitate a certain degree of flexibility, allowing for the dynamic addition or modification of modules to meet specific operational requirements. Nevertheless, the platform’s architectural design presents significant limitations. Specifically, the functional modules suffer from insufficient encapsulation, resulting in a high degree of code coupling. This structural flaw not only escalates the complexity of system maintenance—rendering the system susceptible to unintended side effects in unrelated modules during functional updates—but also obstructs the long-term stability and iterative evolution of the platform.

2.2. System Architecture

The system adopts a decoupled microservice architecture to ensure high cohesion and low coupling, optimizing both maintainability and scalability. Structurally, it is organized into three distinct layers: the client layer, the gateway layer, and the service layer [15]. The client layer utilizes Vue.js [16] and Element-UI [17] to construct a responsive user interface, leveraging virtual DOM for efficient rendering. The backend is powered by Spring Boot and Spring Cloud [18,19], which facilitate service discovery and load balancing. Furthermore, Docker containerization [20] is employed to create isolated, secure environments for code execution. This hierarchical design supports independent deployment and dynamic scaling, effectively accommodating the high-concurrency demands of educational scenarios. A detailed depiction of the system architecture is presented in Figure 1.

2.3. Database Design

The data persistence layer relies on MySQL [21] to guarantee ACID compliance for core business entities, such as user profiles and submission records. To mitigate database load during peak usage, Redis is integrated as a high-speed cache [22] for frequently accessed data and session management. The database schema is streamlined to support essential functionalities, centering on the Solution table for submission metadata. The detailed field definitions are presented in Table 1.

3. Formal Problem Definition

To facilitate the rigorous design and analysis of key modules—including the judging architecture, problem recommendation algorithm, and core judging functions—this section formally defines the essential terminology utilized throughout the study:

Definition 1.

Test Dataset: The test dataset D serves as the definitive benchmark for validating the correctness of code execution outcomes. It comprises a collection of individual test data instances, denoted as

D = {d_{1}, d_{2}, \dots}

.

Definition 2.

Test Sample Set: The test sample set F represents the aggregate of input samples associated with the problem, expressed as

F = {f_{1}, f_{2}, \dots}

. Each input sample

f_{i}

maintains a direct correspondence with a specific test data instance

d_{i}

.

Definition 3.

Test Result Set: The test result set

D^{*}

refers to the collection of output results generated following the execution of each test sample. It is defined as

D^{*} = {d_{1}^{*}, d_{2}^{*}, \dots}

, where each element

d_{i}^{*}

signifies the result derived from the respective test sample.

Definition 4.

Judging Code Class: The Judging Code Class J constitutes the minimal information unit utilized by both front-end and back-end systems for evaluation purposes. This class encapsulates essential attributes, including user ID, problem ID, programming language, submitted code, and submission type. Furthermore, it integrates operational variables such as the test dataset D associated with the specific problem and the necessary container execution commands.

4. Implementation of Core Technologies

4.1. Judger Architecture

The user interaction layer is designed to streamline the submission and transmission of code for evaluation. To guarantee data integrity during transmission, a standardized communication protocol is implemented. This interface serves as a comprehensive portal, integrating essential functionalities such as problem selection, language specification, and an embedded code editor. Upon submission, user inputs are serialized into a unified JSON format [23] and dispatched to the backend services. This payload encapsulates critical metadata required for the evaluation process, including the unique problem identifier, programming language type, the source code content, and predefined resource thresholds such as time and memory limits. The backend module is tasked with receiving and parsing the incoming data stream, as well as orchestrating the compilation and execution processes. Leveraging containerization technology—specifically Docker—the system establishes isolated runtime environments for various programming languages. This architectural approach prevents interference between different execution contexts and enables fine-grained resource management during the execution phase. The detailed construction process is outlined in Algorithm 1:

Algorithm 1 Judging Engine Algorithm

    Input: Judging Code Class J
    Output: Judging result correctness information
    1. DoJudge(J) // Initialize judging logic
    2. JudgeCode.init() // Initialize judging parameters
    3. JudgeCode.choiceLanguage()
       // Instantiate Docker container based on language
    4. JudgeCode.compile()
    5. JudgeCode.execute()
    6. JudgeCode.checkAnswer()
    End

The Sandbox Environment [24] plays a pivotal role in ensuring the security and isolation of code execution. Within this sandbox, user code is confined to an isolated container, effectively preventing potential interference with the host system. To maintain system stability, critical resources—such as execution time, memory usage, and CPU allocation—are strictly regulated to preclude any adverse effects on the overall system performance. The Judging Engine module is subsequently responsible for evaluating the correctness of the user’s submission. The detailed construction logic of the sandbox is outlined in Algorithm 2:

Algorithm 2 Judging Process Algorithm

Input: Judging Code Class J
Test Sample Set

F = {f_{1}, f_{2}, \dots, f_{m}}

Output: Test Result Set

D^{*} = {d_{1}^{*}, d_{2}^{*}, \dots, d_{m}^{*}}

    Begin:
    1. WorkPath(J)
        // Generate unique working directory
    2. CreatContainer(J)
       // Create corresponding judging container
    3. ContainerRun(J)
       // Execute judging code class J in the container and return result information
    4. DeleteContainer(J)
       // Asynchronously delete container
    End

4.2. Problem Recommendation Algorithm

To cater to the demands of personalized exercise recommendation in Online Judge (OJ) systems, this study draws upon the User-Exercise Matching Degree-Generate Algorithm (UEMD-GA) [25]. Leveraging historical submission records, the algorithm synthesizes the user’s interest in specific knowledge points with the alignment between user ability and problem difficulty. Through quantitative analysis, it constructs a multi-dimensional user-exercise matching degree matrix, which serves as a precise foundation for facilitating personalized recommendations. Furthermore, this paper proposes enhancements to the algorithm by refining the computation model for user ability values and optimizing specific vector parameters. Consequently, a robust framework for personalized exercise recommendation is derived, as illustrated in Figure 2.

Within this framework, the mathematical formulations and their corresponding interpretations are defined as follows:

Exercise Embedding Vector: To rigorously characterize the intrinsic attributes of an exercise, the embedding vector for the i-th exercise is defined as

q_{i} \in R^{32}

. The vector dimensions are stratified into difficulty, knowledge point count, and supplementary features. The specific calculation is governed by Equation (1).

q_{i} [j] = \{\begin{matrix} 1 - r_{i}, & j = 0, \\ \frac{n_{i}}{10}, & j = 1, \\ \frac{1}{| K_{i} |} \sum_{k \in K_{i}} D_{k}, & j \geq 2 . \end{matrix}

(1)

Remark 1.

The repetition of the ability value for

j \geq 1

in the user vector is a deliberate design to introduce a global cognitive prior and facilitate subsequent scalability. Unlike end-to-end learnable embeddings that act as black boxes and are prone to overfitting in sparse educational data, this explicit representation forces the student’s cognitive state to uniformly influence the interaction across all feature dimensions. This not only ensures dimensional compatibility for the cross-compression operation but also provides the structural flexibility to seamlessly incorporate more fine-grained features in future expansions, guaranteeing both strict interpretability and robustness by injecting strong pedagogical priors.

Here,

r_{i}

denotes the pass rate of exercise i. Given the inverse correlation between the pass rate and the objective difficulty, a positive characterization of difficulty is achieved via the term

1 - r_{i}

.

n_{i}

signifies the count of knowledge point tags associated with exercise i; this value is normalized by a factor of 10 to confine the dimension to the interval

[0, 1]

.

K_{i}

represents the set of knowledge points encompassed by exercise i, while

D_{k}

indicates the base difficulty value of knowledge point k. The supplementary feature dimensions are populated using the average difficulty of the associated knowledge points, thereby ensuring both dimensional integrity and feature richness.

User Feature Vector: This vector serves as the primary representation of the user’s proficiency level. It aligns its dimensionality with the exercise embedding vector (32 dimensions) to facilitate downstream vector operations. The embedding vector for user u is denoted as

u \in R^{32}

, with the computational specifics detailed in Equation (2).

u [j] = \{\begin{matrix} a b i l i t y_{u}, & j = 0 (Ability Dimension) \\ a b i l i t y_{u}, & j \geq 1 (Feature Expansion Dimension) \end{matrix}

(2)

In this formulation,

a b i l i t y_{u}

signifies the normalized proficiency value of user u, constrained to the interval

[0, 1]

. The feature expansion dimensions corresponding to

j \geq 1

are currently populated with the user’s ability value. This design strategy preserves structural space for the prospective incorporation of additional metrics, such as problem-solving velocity and the breadth of knowledge mastery.

Knowledge Point Interest Degree: The user’s interest in the knowledge points associated with an exercise is quantified based on the intersection of the exercise’s knowledge tags and the user’s historical problem-solving trajectory, reflecting the user’s focus on the relevant concepts. For a specific exercise q, the knowledge point interest degree, denoted as

I_{q}

, is formulated in Equation (3).

I_{q} = \frac{1}{| T_{q} |} \sum_{t \in T_{q}} (1 - \frac{1}{1 + c_{t}})

(3)

Here,

T_{q}

denotes the set of knowledge point tags for exercise q, while

c_{t}

signifies the cumulative count of problems solved by the user pertaining to knowledge point t. As the frequency of problem-solving for a specific knowledge point rises, the term

1 - \frac{1}{1 + c_{t}}

converges towards 1, indicating a heightened level of interest. The final interest score for the exercise is derived by averaging the interest values across all associated knowledge points.

Difficulty Fitness: This metric quantifies the alignment between the user’s proficiency level and the objective difficulty of an exercise. Its primary objective is to preclude recommendations that are either excessively challenging or overly simplistic, thereby fulfilling the personalized recommendation strategy of “reachability with moderate effort,” which aligns with the Zone of Proximal Development theory. For a specific exercise q, the difficulty fitness

S_{q}

is calculated as presented in Equation (4).

S_{q} = 1 - | d_{q} - a_{u} |

(4)

Here,

d_{q} = 1 - r_{q}

denotes the objective difficulty of exercise q (where

r_{q}

represents the pass rate), and

a_{u}

signifies the normalized ability value of user u (

a_{u} \in [0, 1]

). The metric attains its maximum value of 1 when the user’s ability perfectly aligns with the exercise difficulty. Conversely, as the disparity between the two increases, the fitness value diminishes, indicating a lower degree of suitability.

Basic Matching Degree: The knowledge point interest degree and difficulty fitness are synthesized through a multiplicative model to derive the basic matching degree

B_{q}

between the user and the exercise, as formulated in Equation (5).

B_{q} = I_{q} \times S_{q}

(5)

This computational approach embodies a strict filtering logic wherein a deficiency in either interest or capability alignment results in a significantly diminished matching score. Consequently, if the user lacks interest in the exercise or if the difficulty is misaligned with their proficiency, the basic matching degree declines sharply, thereby establishing a robust quantitative baseline for the subsequent recommendation process.

Cross-Compression Score: To further exploit the feature associations between users and exercises, a vector cross-compression unit is introduced to augment the basic matching degree. By executing dot product operations on the user and exercise embedding vectors, deep-seated correlations between features are effectively captured. Ultimately, a comprehensive recommendation score is generated by synthesizing the basic matching degree with the cross-compression score. The cross-compression score, denoted as

C_{u, q}

, is derived by computing the dot product of the user vector u and the exercise vector q, followed by a normalization process, as illustrated in Equation (6).

C_{u, q} = \frac{u \cdot q}{D} = \frac{\sum_{j = 1}^{D} u_{j} \cdot q_{j}}{D}

(6)

Remark 2.

The normalized dot product is deliberately chosen over cosine similarity and attention-based methods. Unlike cosine similarity, which discards magnitude information, this formulation preserves the absolute intensity of the student’s ability while applying normalization solely for numerical stability and probability mapping. Furthermore, as a parameter-free operation, it guarantees the strict interpretability required in cognitive diagnosis and prevents overfitting on sparse educational data, avoiding the “black-box” nature of neural embeddings or attention mechanisms.

Here,

D = 32

represents the dimensionality of the embedding vectors. The dot product

u \cdot q

quantifies the similarity between the user’s and the exercise’s feature vectors. The subsequent normalization mitigates the dimensional scaling effect on the calculation, ensuring that the cross-compression score remains within a standardized and reasonable interval.

Comprehensive Recommendation Score: Equation (7) defines the comprehensive recommendation score

P_{q}

for exercise q, which integrates the basic matching degree and the cross-compression score.

P_{q} = B_{q} \times (1 + C_{u, q})

(7)

In this formulation, the basic matching degree is modulated by the factor

(1 + C_{u, q})

. A high similarity between the user and exercise feature vectors yields a larger cross-compression score, which positively amplifies the final recommendation score, thereby strengthening the efficacy of the feature matching process. Upon computation of the comprehensive recommendation scores, the candidate exercise set

Q_{c a n d i d a t e}

, which excludes previously completed exercises, is sorted in descending order based on

P_{q}

. The top N exercises are then selected to construct the personalized recommendation list, as described in Equation (8).

R e c o m m e n d L i s t = Top - N_{q \in Q_{c a n d i d a t e}} (P_{q})

(8)

In the proposed algorithm, N is set to 10; however, this parameter can be flexibly adjusted to meet the specific application requirements of the OJ system. Distinct from the foundational UEMD-GA framework, which lacks a self-contained mechanism for ability estimation, the user ability value (denoted as

a b i l i t y_{u}

) is proposed herein as a cornerstone for feature vector formulation and difficulty fitness evaluation. In light of this, we propose a novel multi-dimensional fusion methodology to independently quantify user ability. This newly designed approach dynamically synthesizes diverse variables, including the inherent difficulty of knowledge points, problem complexity, user solving accuracy, and temporal decay effects. The specific mathematical expressions governing this calculation are detailed as follows:

Knowledge Point Basic Difficulty: The intrinsic difficulty value for an individual knowledge point is denoted as

B_{k} \in [0, 1]

, representing a fundamental attribute of the subject knowledge system. This value is predetermined by system administrators and stratified into three distinct levels based on mastery complexity, as detailed in Equation (9).

B_{k} = \{\begin{matrix} 0.1 - 0.3, & (Entry-level knowledge points) \\ 0.4 - 0.7, & (Intermediate knowledge points) \\ 0.8 - 1.0, & (Advanced knowledge points) \end{matrix}

(9)

Cumulative Difficulty Normalization: (which adjusts the total difficulty score to keep it fair, preventing an exercise from being rated as overly difficult just because it covers many topics) To mitigate potential metric distortion arising from a high count of knowledge points within a single exercise, the cumulative knowledge point difficulty for exercise t, denoted as

B_{t}^{s u m}

, is normalized to the range

[0, 2]

. The specific calculation is presented in Equation (10).

B_{t}^{s u m} = (\sum_{k \in K_{t}} w_{t, k} \times B_{k}) \times \frac{2}{B_{m a x}^{s u m}}

(10)

In this formulation,

K_{t}

represents the set of knowledge points encompassed by exercise t. The term

w_{t, k}

denotes the assessment weight assigned to knowledge point k within exercise t, which defaults to

\frac{1}{| K_{t} |}

and satisfies the constraint

\sum_{k \in K_{t}} w_{t, k} = 1

. Furthermore,

B_{m a x}^{s u m}

signifies the maximum cumulative difficulty value observed across the platform, functioning as a global normalization constant.

User Single-Exercise Accuracy: The single-exercise accuracy, denoted as

C_{u, t} \in [0, 1]

, incorporates submission frequency to precisely quantify the user’s actual mastery level of a specific exercise, as formulated in Equation (11).

C_{u, t} = \{\begin{matrix} \frac{1}{s u b m i t_{u, t}}, & (Solved correctly) \\ 0, & (Not solved correctly) \end{matrix}

(11)

In this formulation,

s u b m i t_{u, t}

represents the total number of submissions made by user u for exercise t. A successful resolution on the first attempt yields an accuracy of 1; conversely, the accuracy value diminishes as the submission count rises. This mechanism aligns with the pedagogical principle that requiring multiple attempts to reach a correct solution suggests a lower level of proficiency.

Time Decay Coefficient: To attenuate the impact of historical exercise records on the current ability estimation, a temporal decay coefficient

β_{u, t} \in (0, 1]

is incorporated, as formulated in Equation (12).

β_{u, t} = β^{T - t}

(12)

Here,

β = 0.9

denotes the base decay factor. T represents the time sequence index of the user’s most recent exercise (the current time step), while t corresponds to the time index of the specific exercise t. In this formulation, a larger difference

T - t

signifies an older record, resulting in a smaller decay coefficient. This mechanism ensures that the contribution of earlier exercise records to the current ability value is diminished, prioritizing recent performance.

Single-Exercise Ability Contribution: (which measures how much completing this specific exercise improves a student’s mastery of the target concept) The single-exercise ability contribution value, denoted as

S_{u, t} \in [0, 2]

, synthesizes the user’s single-exercise accuracy, the cumulative knowledge point difficulty, and the temporal decay coefficient. This metric quantifies the specific impact of an individual exercise on the user’s overall ability estimation, as formulated in Equation (13).

S_{u, t} = C_{u, t} \times B_{t}^{s u m} \times β_{u, t}

(13)

This formulation encapsulates the rationale that correctly solving a challenging problem with minimal attempts and recent engagement yields a superior contribution to the ability score, thereby precisely quantifying the educational value of each exercise.

User Raw Ability Score: (which represents a student’s overall mastery level by combining the learning gains from all their past practices) The user raw ability score,

S_{u}

, aggregates the contribution values from all exercise records via a weighted summation. To prevent potential overestimation, a knowledge point difficulty calibration weight is incorporated, as detailed in Equation (14).

S_{u} = (\frac{1}{N_{u}} \times \sum_{t \in T_{u}} S_{u, t}) \times W_{u, K}

(14)

Here,

T_{u}

represents the set of all exercise records for user u.

N_{u}

denotes the effective exercise count, calculated as

N_{u}

= number of correct exercises + number of incorrect exercises × 0.2, a formulation designed to mitigate the skewing effect of incorrect submissions on the denominator.

W_{u, K}

serves as the difficulty calibration weight, defined as

W_{u, K} = \frac{1}{| K_{u} |} \sum_{k \in K_{u}} B_{k}

, where

K_{u}

is the set of distinct knowledge points covered by the user’s exercises, and

| K_{u} |

represents the cardinality of this set.

Normalized Ability Value: The raw ability score

S_{u}

is projected onto the interval

[0, 1]

to derive the final normalized ability value

A_{u}

(equivalent to

a b i l i t y_{u}

), as presented in Equation (15).

A_{u} = \frac{S_{u} - S_{m i n}}{S_{m a x} - S_{m i n}}

(15)

In this equation,

S_{m a x}

represents the platform’s highest raw ability score, calculated as the average of the top 1% of users to eliminate outliers. Conversely,

S_{m i n}

is the lowest raw ability score, derived from the average of the bottom 1% of users; users with no activity or entirely incorrect records are assigned a score of 0.

5. Analysis of Experimental Results

To ensure the reliability of the data analysis, data from our institution’s online assessment system was initially extracted and consolidated to maintain source stability throughout the study. During the data processing phase, to align the dataset with the algorithm’s input specifications, specific scores were assigned to the corresponding knowledge points. These values were strictly normalized within the 0–1 interval, maintaining consistency with the scoring framework defined previously. Additionally, since certain exercises lacked explicit knowledge point associations, a default intermediate score of 0.5 was uniformly applied to these instances.

This study strictly adheres to ethical guidelines for educational data mining and data protection regulations. To protect student privacy, rigorous anonymization protocols were applied: all personally identifiable information (PII), including student names, IDs, and IP addresses, was removed and replaced with irreversible hashed identifiers prior to analysis, ensuring the dataset is entirely de-identified. Furthermore, informed consent was obtained from all participants via the platform’s privacy policy during account registration. The data is stored in an encrypted, access-controlled internal database and is used exclusively for non-commercial, pedagogical research and algorithm optimization purposes described herein.

By integrating diverse data origins—such as freshman weekly contests, university-level tournaments, and provincial competitions—this multi-channel collection strategy robustly underpins the objectivity and reliability of our subsequent analytical findings. Following rigorous data cleansing and preprocessing procedures, the finalized corpus comprises 894 distinct programming problems, 666,121 submission logs, and approximately 170 independent knowledge point tags (Owing to the data’s origin from Chinese websites, inherent inconsistencies in naming conventions were observed during the data processing phase, which precluded the possibility of achieving fully accurate translation for certain information. Consequently, manual renaming of select knowledge point labels was implemented to facilitate observation, a procedure that does not compromise the integrity of the subsequent data analysis.).

Regarding typological distribution, the repository is fundamentally anchored in conventional foundational exercises while being strategically interspersed with authentic competition problems. This architectural diversity not only enriches the heterogeneity of the problem set but also significantly mitigates the stochastic biases typically induced by monolithic data sources. The quantitative distribution of the top 10 most frequent knowledge points and their associated problem counts is visualized in Figure 3.

Aiming to delineate the intricate mapping between problems and knowledge concepts, a systematic taxonomic refinement and reconstruction of the raw labels was executed, yielding the knowledge distribution topology illustrated in Figure 4. In the absence of explicit foundational difficulty metrics within the raw dataset, a parametric weighting scheme was formulated to quantify individual knowledge point scores, expressed as

Score = (Average Difficulty \times 0.6 + Mastery Degree \times 0.4) / 5

. Operationally, the “Average Difficulty” of a specific concept is calculated as the arithmetic mean of the term

(1 - Pass Rate)

across all its associated problems, whereas the “Mastery Degree” is derived as

1 - Average Pass Rate

. Consequently, the spatial distribution of high-scoring concepts derived from this mechanism is depicted in Figure 5. This quantitative paradigm is demonstrably capable of capturing the granular weight disparities among distinct knowledge concepts within personalized recommendation ecosystems.

Synthesizing the aforementioned feature extraction, Figure 6 illustrates the joint distribution pattern between problem difficulty levels and the number of associated knowledge points. Evidently, the problem repository exhibits a predominant concentration in the medium-to-high difficulty range, with a smooth gradient progression that lays a solid data foundation for the algorithm’s precise recommendation. Furthermore, each problem, on average, encompasses 3 to 4 knowledge point tags, effectively mitigating the drawbacks of insufficient distinguishability and weak semantic association under single-label scenarios, thereby further ensuring the accuracy of recommendation results.

Focusing on the intrinsic correlation between knowledge point difficulty attributes and users’ actual submission pass rates, this study conducted targeted exploration and visualized analysis, with the specific results presented in Figure 7.

Following the preliminary data profiling, the aforementioned multidimensional features were fed into the proposed recommendation algorithm for joint validation. Figure 8 depicts the indicator correlation dynamics after algorithm execution, where results demonstrate a strong positive synergistic effect between the fundamental matching degree of exercises and their final recommendation scores. This robustly validates the scientificity and rationality of the proposed algorithm’s logic from an experimental perspective.

5.1. Analysis of $β$ Selection

The parameter

β

in Equation (12) plays a pivotal role in scaling the user single-exercise contribution value, thereby directly influencing the final exercise recommendations. To elucidate the variation patterns of the user single-exercise ability contribution value, simulation experiments were conducted utilizing user submission data. The results are illustrated in Figure 9, Figure 10 and Figure 11.

Figure 9 demonstrates that as

β

increases, the contribution values of earlier exercises decrease more rapidly, while recent exercises maintain higher weights—consistent with the “forgetting curve” in learning psychology. Figure 10 compares the total contribution values across different

β

settings, showing that

β = 0.9

yields a balanced total contribution that neither overweights recent data (e.g.,

β = 0.99

leads to inflated recent contributions) nor underweights historical performance (e.g.,

β = 0.5

fails to reflect recent progress). Figure 11 further validates the optimality of

β = 0.9

by analyzing the variance of user ability values: when

β = 0.9

, the variance reaches a significant peak (0.003310), indicating enhanced discriminability (the model can better distinguish between users), while subsequent increases in

β

(e.g., 0.95, 0.99) lead to diminishing marginal gains in variance (0.005545, 0.012710), suggesting potential overfitting to recent submissions (the model becomes overly sensitive to the latest data, reducing stability).

In conclusion,

β = 0.9

is chosen as the optimal time decay coefficient because it achieves the best balance between model discriminability and stability. At

β = 0.9

, the model maximizes user differentiation (Figure 11) while avoiding excessive sensitivity to recent data (Figure 9 and Figure 10). This optimal setting ensures the algorithm effectively captures both recent learning progress and long-term performance trends, making it the most suitable choice for exercise recommendation.

5.2. Pedagogical Implications and Learning Outcomes

Beyond technical performance, the proposed adaptive system has important implications for educational practices. Regarding student engagement and motivation, the adaptive recommendation mechanism adjusts the exercise difficulty based on real-time performance. This prevents students from feeling frustrated by overly difficult tasks or bored by overly simple ones, helping them stay motivated. Concerning learning paths, the system supports personalized learning. Instead of giving all students the same tasks, it finds individual knowledge gaps and recommends specific exercises, which greatly improve learning efficiency. The time decay mechanism (

β = 0.9

) ensures that assessments reflect both long-term retention and recent progress, providing a fairer measure of student ability.

Our current study primarily focuses on validating the effectiveness of the core adaptive algorithms. As this technology moves toward real-world classroom deployment, integrating ethical considerations becomes a vital next step. Algorithmic fairness must be ensured so that personalized recommendations benefit all student groups equally, without creating disparities. Additionally, explainability is key for educational adoption. By making the recommendation logic transparent, students can understand why they need to practice certain concepts, which supports self-reflection, and teachers can better trust the system’s guidance. Addressing these aspects will transform the current technical prototype into a fully mature educational tool.

5.3. Complexity and Scalability Analysis

To support the scalability claims, we analyze the theoretical complexity and evaluate the runtime of the UEMDGA algorithm. To mitigate hardware performance fluctuations, we conducted each test with a warm-up phase and repeated the full test 5 times, using the median of the 5 runs as the final result. Experiments were performed on a device with an 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz processor and 16.0 GB RAM.

Let U, P, and K denote the number of users, problems, and average knowledge points per problem, respectively.

Recommendation Calculation: For each user, the algorithm evaluates the matching score across all problems. This results in a time complexity of $O (U \times P \times K)$ .
Time Decay Calculation: The time decay factor only depends on the user’s history. Thus, its complexity is $O (U)$ .

Since the recommendation calculation dominates the process, the total theoretical time complexity is

O (U \times P \times K)

. This linear relationship indicates that the computational workload scales predictably as the user base grows.

The performance metrics, obtained by running the test 5 times and taking the median, are summarized in Table 2. The metrics are defined as follows:

Recommendation Time: Execution time of the core recommendation algorithm, with a time complexity of $O (U \times P \times K)$ .
Total Time: Complete execution time of the algorithm (including recommendation and time decay calculations), dominated by $O (U \times P \times K)$ .
Throughput: Number of operations processed per second, serving as an efficiency metric.

As shown in Table 2, when the number of users increases from 50 to 1000 (a 20-fold increase), the total computation time increases from 336.37 ms to 8389.27 ms (a 24.94-fold increase). The growth ratio of 1.25 is close to the theoretical linear complexity

O (U \times P \times K)

. The throughput remains relatively stable (averaging approximately 120,000 ops/sec), indicating no significant performance bottlenecks as data volume increases.

It should be noted that the 8.4 s evaluation simulates an extreme scenario where 1000 users initiate requests simultaneously. In practical educational contexts, requests are typically staggered, resulting in millisecond-level response times for individual users. This predictable scalability aligns well with the microservice architecture: during peak hours, the linearly increasing workload can be efficiently distributed across multiple service nodes, ensuring robust performance for large-scale applications.

5.4. Comparative Evaluation of Recommendation Performance

To evaluate the real-world effectiveness of the recommendations rather than just the internal coherence of the scoring mechanism, we adopt a temporal next-item prediction protocol. This protocol simulates the real-world scenario where the system recommends problems based on a student’s historical learning trajectory and predicts their subsequent learning needs. For each student, all interaction records are sorted chronologically. We perform a strict temporal split: the most recent interactions are held out as the test set (evaluation targets), while all preceding interactions are used as the training set (historical records). The recommendation algorithms generate a ranked list of top-K candidate problems based solely on the historical records. A recommendation is considered a “hit” if the knowledge points of the recommended problems match the knowledge points of the student’s actual subsequent interactions in the test set. This protocol ensures that no future information is leaked into the recommendation generation process. Furthermore, since the temporal split is strictly deterministic based on timestamps and all algorithms are evaluated under identical configurations, the experimental results are fully reproducible.

The evaluation reveals a trade-off between prediction accuracy and personalization in educational recommendations. Popularity achieves the highest prediction accuracy (MAP@5 = 0.5454) but the lowest personalization (PD@5 = 0.2824), indicating that it tends to recommend similar popular problems to most students. Conversely, Random and Difficulty-Only achieve high personalization (PD@5 = 0.7456 and 0.7759) but low directionality (MAP@5 = 0.2252 and 0.3177), meaning their recommendations are personalized but less well-guided. Item-KNN exhibits moderate performance on both dimensions (MAP@5 = 0.4265, PD@5 = 0.4919), reflecting limited personalization due to popularity bias inherent in similarity-based methods. Notably, KP-Coverage achieves the lowest MAP@5 (0.0235) and Hit Rate@5 (0.190), suggesting that naive weak-point matching without proper scoring may not be effective for prediction.

UEMD-GA achieves a relatively balanced performance between these extremes: it attains a personalization degree of PD@5 = 0.7505 (2.66× that of Popularity and 1.53× that of Item-KNN) while maintaining moderate prediction accuracy (MAP@5 = 0.3277, 45.5% higher than Random). Compared to Difficulty-Only, which has comparable personalization, UEMD-GA achieves slightly higher MAP at all K values (e.g., MAP@10 = 0.3885 vs. 0.3606), suggesting that integrating knowledge-point targeting with difficulty matching may provide somewhat better-directed recommendations than difficulty matching alone. Furthermore, UEMD-GA’s recommendation diversity (RD@5 = 0.1842) is higher than both Popularity (0.1258) and Item-KNN (0.1368), indicating relatively broader knowledge coverage.

In summary, UEMD-GA demonstrates a certain degree of “directional personalization”—recommendations that are personalized while retaining some pedagogical guidance—offering a different trade-off from baselines that are either accurate-but-homogeneous or personalized-but-aimless. This is consistent with the goal of educational recommendation: recommending what students should practice, tailored to their individual profiles. The detailed results are shown in Table 3 and Table 4.

6. Conclusions

This study comprehensively investigates existing online assessment systems and implements targeted improvements. By integrating technical measures with functional optimizations, we have refined the system’s design to better align with pedagogical requirements, significantly enhancing its applicability and utility in teaching scenarios. Furthermore, we have developed and optimized an innovative exercise recommendation algorithm. By intelligently analyzing user learning data and behavioral patterns, the algorithm generates highly targeted exercise suggestions. This mechanism facilitates knowledge consolidation and skill enhancement during the problem-solving process, offering novel perspectives for empowering instruction within online assessment systems. Experimental results demonstrate that the proposed system provides competitive performance compared to traditional approaches, offering a scalable and intelligent solution that bridges theoretical insights with practical applications in online education.

Despite promising results, this study has limitations. The single-university dataset restricts generalizability, which future multi-institutional collaborations will address. Scaling for real-time concurrent users requires further computational optimization. Moreover, lacking pedagogical validation, future longitudinal studies will assess the long-term impact on knowledge retention. Finally, we plan to integrate explainable AI (XAI) to improve transparency, build trust, and support self-regulated learning.

Author Contributions

Conceptualization, T.L. and H.W.; methodology, H.W.; validation, H.W.; formal analysis, H.W.; investigation, T.L.; resources, H.W.; data curation, H.W.; writing—original draft preparation, H.W.; writing—review and editing, T.L. and H.W.; visualization, H.W.; supervision, H.W.; project administration, H.W.; funding acquisition, T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Higher Education Key Scientific Research Program Funded of by Henan Province under Grant 24A520008 and 25A630021, Natural Science Foundation of Henan Province under Grant 252300421507, Henan Province Science and Technology Research Projects under Grant 252102210173, the Key Research and Development Program of Henan Province under Grant 261111211200, Doctoral Cultivation Fund Project of Henan University of Engineering under Grant D2022030.

Institutional Review Board Statement

Ethical review and approval were waived for this study based on Article 32 of the Measures for Ethical Review of Life Science and Medical Research Involving Human Subjects issued by the State Council of China. This study strictly adheres to ethical guidelines for educational data mining and data protection regulations. All data were processed to remove identifiable information (e.g., student names, IDs, and IP addresses), and no sensitive personal data were involved. The research does not include any intervention with human participants and poses no risk of harm; therefore, ethical approval was not required.

Informed Consent Statement

Informed consent wns obtained from all individual participants included in the study.

Data Availability Statement

Data will be made available upon reasonable request by contacting the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their kind suggestions and constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shou, Z.; Shi, Z.; Wen, H.; Liu, J.; Zhang, H. Learning Peer Recommendation Based on Weighted Heterogeneous Information Networks on Online Learning Platforms. Electronics 2023, 12, 2051. [Google Scholar] [CrossRef]
Yang, Y.; Lin, Y.; Chen, Z.; Lei, Y.; Liu, X.; Zhang, Y.; Sun, Y.; Wang, X. SNPERS: A Physical Exercise Recommendation System Integrating Statistical Principles and Natural Language Processing. Electronics 2023, 12, 61. [Google Scholar] [CrossRef]
Lin, Y.; Wu, Z. Learning Path Recommendation Enhanced by Knowledge Tracing and Large Language Model. Electronics 2025, 14, 4385. [Google Scholar] [CrossRef]
Stitini, O.; Kaloun, S. Towards a Multi-Objective and Contextual Multi-Criteria Recommender System for Enhancing User Well-Being in Sustainable Smart Homes. Electronics 2025, 14, 809. [Google Scholar] [CrossRef]
Wasik, S.; Antczak, M.; Badura, J.; Laskowski, A.; Sternal, T. A Survey on Online Judge Systems and Their Applications. ACM Comput. Surv. 2018, 51, 1–34. [Google Scholar] [CrossRef]
Han, Y.; Zhang, Z.; Yuan, B.; Bi, H.; Shahzad, M.N.; Liu, L. An Experimental Online Judge System Based on Docker Container for Learning and Teaching Assistance. In Proceedings of the 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI); IEEE: Piscataway, NJ, USA, 2019; pp. 1462–1467. [Google Scholar]
Chen, K.; Huang, F.; Liu, Z.; Wang, Y.; Zhang, L. ACcoding: A graph-based dataset for online judge programming. Sci. Data 2024, 11, 548. [Google Scholar] [CrossRef] [PubMed]
Mentari, M.; Funabiki, N.; Kinari, S.A.; Puspitasari, D.; Kao, W.C. A Study of Introductory Exercise Problems for Novice Students in Java Programming Learning Assistant System. IAENG Int. J. Comput. Sci. 2025, 52, 3526–3544. [Google Scholar]
Dai, Z.; Lv, S.; Zhang, Q. A Programming Learning Engagement Assessment Model and Teaching Application Based on Online Judgement System. Front. Educ. Res. 2024, 7, 223–234. [Google Scholar] [CrossRef]
Rico-Juan, J.; Sánchez-Cartagena, V.M.; Valero-Mas, J.J.; Gallego, A.J. Identifying student profiles within online judge systems using explainable artificial intelligence. IEEE Trans. Learn. Technol. 2023, 16, 955–969. [Google Scholar] [CrossRef]
Messer, M.; Brown, N.C.C.; Kölling, M.; Shi, M. Automated Grading and Feedback Tools for Programming Education: A Systematic Review. ACM Trans. Comput. Educ. 2024, 24, 1–43. [Google Scholar] [CrossRef]
Hwang, J.; Lee, H. From knowledge tracing to preference tracing: Capturing dynamic user preferences for personalized recommendation. Electron. Commer. Res. Appl. 2025, 73, 101527. [Google Scholar] [CrossRef]
Chen, Y.; Guo, Q.; Qiu, Q.; Li, X.; Wang, Z. The Impact and Pathways of Digital Game-Based Learning on STEM Undergraduates’ Engineering Concepts: Integrating Self-Determination Theory and Flow Theory. Comput. Appl. Eng. Educ. 2025, 33, e70066. [Google Scholar] [CrossRef]
Karnalim, O. A Low-Level Structure-based Approach for Detecting Source Code Plagiarism. IAENG Int. J. Comput. Sci. 2017, 44. [Google Scholar]
Vyas, R. Comparative analysis on front-end frameworks for web applications. Int. J. Res. Appl. Sci. Eng. Technol. 2022, 10, 298–307. [Google Scholar] [CrossRef]
Medin, H.F.; Vieyra, A.N.; Robles, K.A.A.; Santos, J.M.; Cruz, R.L. How to improve web application development using a layered reference model and quality indicators. J. Eng. Res. 2023, 3, 2–16. [Google Scholar] [CrossRef]
Bao, X. Deep learning-driven personalized recommendations and layout optimization for UI interaction design. Prog. Artif. Intell. 2025. [Google Scholar] [CrossRef]
Cheng, F. Talent Recruitment Management System for Small and Micro Enterprises Based on Springboot Framework. Adv. Educ. Technol. Psychol. 2021, 5, 99–105. [Google Scholar]
Rodriguez, R.S.K.; Galindo, A.D.J.; Juan, T.J.; Santos, M.R.; Garcia, P.L. Secure Development Methodology for Full Stack Web Applications: Proof of the Methodology Applied to Vue.js, Spring Boot and MySQL. Comput. Mater. Contin. 2025, 85, 1807–1858. [Google Scholar] [CrossRef]
Preeti, P.; Nakul, M.; Dudeja, C.; Singh, R.; Kumar, A. MeXDocker: An integrated application for transposon identification and annotation based on the Docker platform. Gene Rep. 2025, 41, 102364. [Google Scholar] [CrossRef]
Andrijević, N.; Lovreković, Z.; Salkić, H.; Matić, I.; Ćosić, K. Benchmarking PHP–MySQL Communication: A Comparative Study of MySQLi and PDO Under Varying Query Complexity. Electronics 2025, 15, 21. [Google Scholar] [CrossRef]
Belalem, G.; Matallah, H.; Bouamrane, K. Evaluation of NoSQL Databases: MongoDB, Cassandra, HBase, Redis, Couchbase, OrientDB. Int. J. Softw. Sci. Comput. Intell. 2020, 12, 71–91. [Google Scholar]
Attouche, L.; Baazizi, A.M.; Colazzo, D.; Ghelli, G.; Klessinger, S.; Sartiani, C.; Scherzinger, S. Elimination of annotation dependencies in validation for Modern JSON Schema. Theor. Comput. Sci. 2026, 1063, 115645. [Google Scholar] [CrossRef]
Kuo, J.Y.; Wen, Z.J.; Hsieh, T.F.; Lin, C.H.; Chen, W.H. A Study on the Security of Online Judge System Applied Sandbox Technology. Electronics 2023, 12, 3018. [Google Scholar] [CrossRef]
Ye, J.; Song, J.; Zhang, K.; Li, M.; Wang, H. Research on exercise recommendation algorithm for online judge system enhanced by knowledge graph. J. Chin. Comput. Syst. 2023, 44, 2558–2565. [Google Scholar]

Figure 1. System Architecture Diagram.

Figure 2. Problem Recommendation Framework.

Figure 3. Source Distribution.

Figure 4. Knowledge Point Distribution Map.

Figure 5. Knowledge Points Score Ranking.

Figure 6. Problem Difficulty and Knowledge Points Distribution.

Figure 7. Correlation between Knowledge Point Difficulty and Pass Rate.

Figure 8. Recommendation Analysis showing the strong positive correlation between Base Match

B_{q}

and Recommendation Score

F_{q}

(Pearson

R = 0.9980

,

R^{2} = 0.9960

).

Figure 8. Recommendation Analysis showing the strong positive correlation between Base Match

B_{q}

and Recommendation Score

F_{q}

(Pearson

R = 0.9980

,

R^{2} = 0.9960

).

Figure 9. Impact of Time Decay Coefficient on Contribution Values of Each Exercise.

Figure 10. Comparison of Total Contribution Values under Different Beta Values.

Figure 11. User Ability Variance and Marginal Change Rate Across Different

β

Values.

Figure 11. User Ability Variance and Marginal Change Rate Across Different

β

Values.

Table 1. Solution Table.

Field Name	Data Type	Description
solution_id	int	Submission ID
qid	int	Problem ID
user_id	char(48)	User ID
nick	char(20)	Nickname
time	int	Runtime
memory	int	Memory Usage
in_date	datetime	Submission Time
result	int	Result
language	int	Use Language
cid	int	Homework ID
valid	int	Is Judged
code_length	int	Code Length
judgetime	datetime	Judge Time
pass_rate	decimal(3, 2)	Pass Rate

Table 2. Runtime evaluation of the UEMDGA algorithm under different user scales (median of 5 runs).

Users	Rec. Time (ms)	Total Time (ms)	Throughput (ops/s)
50	336.30	336.37	132,888
100	698.02	698.10	128,061
200	1594.59	1594.74	112,118
400	2911.59	2911.70	122,814
800	6101.85	6102.11	117,205
1000	8389.10	8389.27	106,564

Table 3. Comparison of Knowledge-Level Prediction Accuracy Metrics.

K	Algorithm	Precision	Recall	Hit Rate	MAP	NDCG
5	Random	0.2740	0.3745	0.8400	0.2252	0.3356
	Popularity	0.4500	0.6220	0.9800	0.5454	0.6550
	Difficulty-Only	0.2920	0.4526	0.9500	0.3177	0.4362
	KP-Coverage	0.0340	0.0301	0.1900	0.0235	0.0409
	Item-KNN	0.4040	0.5709	0.9900	0.4265	0.5440
	UEMD-GA	0.3020	0.4199	0.9000	0.3277	0.4420
10	Random	0.2130	0.5747	0.9900	0.2593	0.4108
	Popularity	0.2880	0.7653	1.0000	0.5724	0.6990
	Difficulty-Only	0.2260	0.6320	1.0000	0.3606	0.5108
	KP-Coverage	0.1050	0.2675	0.8100	0.0546	0.1459
	Item-KNN	0.2930	0.7734	0.9900	0.4890	0.6255
	UEMD-GA	0.2460	0.6683	1.0000	0.3885	0.5429
20	Random	0.1585	0.8192	1.0000	0.3137	0.5051
	Popularity	0.1640	0.8626	1.0000	0.6013	0.7386
	Difficulty-Only	0.1590	0.8316	1.0000	0.4113	0.5920
	KP-Coverage	0.1225	0.6468	1.0000	0.1136	0.2887
	Item-KNN	0.1725	0.8894	0.9900	0.5239	0.6725
	UEMD-GA	0.1555	0.8130	1.0000	0.4245	0.6003

Table 4. Comparison of Pedagogical Value Metrics.

Algorithm	WCR			RD			PD
Algorithm	K = 5	K = 10	K = 20	K = 5	K = 10	K = 20	K = 5	K = 10	K = 20
Random	0.2460	0.3674	0.6268	0.1935	0.2972	0.4030	0.7456	0.6257	0.4681
Popularity	0.3310	0.5436	0.7124	0.1258	0.1638	0.2000	0.2824	0.2744	0.2260
Difficulty-Only	0.2512	0.4242	0.6310	0.1987	0.2898	0.3933	0.7759	0.6778	0.4910
KP-Coverage	0.0172	0.1348	0.3587	0.4230	0.4787	0.5782	0.5610	0.4512	0.1762
Item-KNN	0.2392	0.5069	0.6750	0.1368	0.1990	0.2618	0.4919	0.3925	0.2305
UEMD-GA	0.2072	0.3949	0.5477	0.1842	0.2915	0.3778	0.7505	0.6325	0.4778

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, T.; Wang, H. Adaptive Online Assessment in Higher Education: An Improved UEMD-GA Approach with Independent Rating. Appl. Sci. 2026, 16, 5516. https://doi.org/10.3390/app16115516

AMA Style

Li T, Wang H. Adaptive Online Assessment in Higher Education: An Improved UEMD-GA Approach with Independent Rating. Applied Sciences. 2026; 16(11):5516. https://doi.org/10.3390/app16115516

Chicago/Turabian Style

Li, Tianrui, and Handong Wang. 2026. "Adaptive Online Assessment in Higher Education: An Improved UEMD-GA Approach with Independent Rating" Applied Sciences 16, no. 11: 5516. https://doi.org/10.3390/app16115516

APA Style

Li, T., & Wang, H. (2026). Adaptive Online Assessment in Higher Education: An Improved UEMD-GA Approach with Independent Rating. Applied Sciences, 16(11), 5516. https://doi.org/10.3390/app16115516

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Online Assessment in Higher Education: An Improved UEMD-GA Approach with Independent Rating

Abstract

1. Introduction

2. System Architecture and Overall Design

2.1. Current Technical Architectures

2.2. System Architecture

2.3. Database Design

3. Formal Problem Definition

4. Implementation of Core Technologies

4.1. Judger Architecture

4.2. Problem Recommendation Algorithm

5. Analysis of Experimental Results

5.1. Analysis of $β$ Selection

5.2. Pedagogical Implications and Learning Outcomes

5.3. Complexity and Scalability Analysis

5.4. Comparative Evaluation of Recommendation Performance

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Adaptive Online Assessment in Higher Education: An Improved UEMD-GA Approach with Independent Rating

Abstract

1. Introduction

2. System Architecture and Overall Design

2.1. Current Technical Architectures

2.2. System Architecture

2.3. Database Design

3. Formal Problem Definition

4. Implementation of Core Technologies

4.1. Judger Architecture

4.2. Problem Recommendation Algorithm

5. Analysis of Experimental Results

5.1. Analysis of β Selection

5.2. Pedagogical Implications and Learning Outcomes

5.3. Complexity and Scalability Analysis

5.4. Comparative Evaluation of Recommendation Performance

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.1. Analysis of $β$ Selection