1. Introduction
Fraud can be defined in many ways. As a general definition, it can be viewed as deceptive actions performed to obtain an undeserved gain [
1], but it lacks specific characteristics that can help detect fraud. A more specific definition is provided in [
2] as “Fraud is an uncommon, well-considered, imperceptibly concealed, time-evolving, and often carefully organized crime that appears in many types of forms”. The latter provides some key characteristics of fraud that are relevant in detecting it. First, fraud is rare, which means that fraudulent actions are more sparse than legitimate actions. This translates into imbalanced data sets. Imbalanced data sets present a challenge to current ML algorithms, as they struggle in this scenario, focusing on the majority class and ignoring the minority class [
3]. Since most actions are legitimate, this offers the opportunity for fraudulent actions to hide among the legitimate ones. Another important aspect of fraud is that it changes over time. Fraudster behavior adapts over time to elude their identification, so the model must adjust to this change as well.
Academic fraud encompasses any form of unauthorized activities that violate rules and regulations, to gain an unfair advantage and to show a performance that misrepresents one’s true ability [
4,
5]. A problem specific to the academic field is that the perception of students of what constitutes cheating can vary significantly based on their background [
5,
6]. To address this behavior, students must be informed about academic integrity during classes and reinforce it throughout the course. Furthermore, faculty should provide a clear guideline on what constitutes academic fraud [
6]. Cheating affects the validity of the examination in higher education and jeopardizes a fair learning and assessment environment [
4]. Studies have shown that teachers are not better at detecting academic fraud than chance [
7]. To address this limitation, ML techniques can be applied to detect academic fraud either in real time or after the exam is completed, resulting in better results. However, legitimate privacy concerns arise when detecting academic fraud, given the necessity of handling sensitive data [
8].
Detecting fraud has the potential to save companies important amounts of money, as the Association of Certified Fraud Examiners estimates that up to 5% of company revenue can be lost due to fraud [
2] or taxpayers that should cover losses in the medical field or other areas of the public sector [
9,
10]. Detecting academic fraud provides equal opportunities for all students, ensures the quality of education, and maintains academic integrity.
ML techniques have been used in this scope for a long time. Credit card fraud is detected using Logistic Regression (LR), Naïve Bayes (NB), K-Nearest Neighbors (KNN) [
11], or Random Forest (RF) and Multilayer Perceptron (MLP) [
12]. Fraud detection in the American healthcare system, particularly in Medicare, can be achieved using RF and LR, which are employed in both [
9,
13]. Additionally, Gradient Tree Boosting (GTB) [
9], NB, KNN, Support Vector Machine (SVM), and Decision Tree (DT) [
13] are also utilized. RF has been shown to be effective in detecting tax fraud [
1], while SVM is capable of detecting money laundering [
14]. Clustering is performed to detect anomalies in time series independently of domain [
15,
16]. Academic fraud is detected using DT, LR, and KNN [
17], as well as RF [
18].
More recently, deep learning and graph-based techniques have been applied in this field. Graph Neural Networks (GNNs) that utilize both labeled and unlabeled data [
19], along with Artificial Neural Networks (ANNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, and Gated Recurrent Units (GRUs) [
20], are used for credit card fraud detection. Spam and fake review detection is achieved using Graph-Based methods [
21,
22] and GNNs [
23,
24]. Hierarchical Attention Networks [
25] and Graph-Based methods [
10] are used in detecting tax fraud. ANNs have been shown to be capable in detecting medical fraud [
26]. Education fraud is detected using Convolutional Neural Networks (CNNs) [
27,
28], RNNs [
28], and LSTM [
28,
29].
The hybrid techniques seek to enhance classification performance by stacking multiple classifiers. In credit card fraud, the techniques proposed by [
30] involve calculating the similarity between a cardholder’s transactions using KNN and making decisions based on Dynamic Random Forest. Another approach for detecting credit card fraud, outlined in [
31], employs a hybrid method that combines multiple models, such as NB, DT, and RF, with statistical methods like Adaptive Boosting (AdaBoost) and Majority Voting. Additionally, a hybrid approach presented in [
32], which combines an ANN with RF, has been shown to be effective in detecting spam and fake reviews. Labeling based on K-means clustering is used to train an SVM for detecting academic fraud [
6].
Data imbalance is a drawback in all domains where fraud detection occurs [
3,
9,
14,
33,
34] due to the nature of the fraud. Most of the actions are legitimate in each data set; just a minority are fraudulent actions. Furthermore, in the educational field, collecting reliable data to train detection models is a challenge. Previous work to detect academic fraud has focused on using data from MOOC classes [
5,
6], using custom evaluation tools to detect illegal actions [
8], or relying on synthetic data [
29].
This paper shows a new approach that leverages Moodle logs as input data sets for supervised ML algorithms without employing a custom evaluation tool. We selected log data covering the COVID-19 pandemic years preceded and followed by standard academic years. While the standard years provide a baseline of normal behavior, the inclusion of the pandemic period was done to address class imbalance. During the pandemic, examinations were conducted online with limited proctoring mechanisms, which naturally resulted in a higher frequency of fraudulent activities compared to in-class proctored exams. By combining these years, we obtained a data set with a higher representation of the minority class, fraudulent activity, which is essential for training ML models. The primary goal is to develop a robust methodology for extracting, labeling, and processing Moodle platform logs to train ML models capable of detecting fraudulent activities before exams. Although the resulting models might not be applicable to other courses, an educator following the proposed method can train their own.
In line with the above objective, the research is framed by the following research questions (RQ):
RQ1. What is the impact of temporal window size on the discriminative power of features extracted from Moodle activity logs for academic fraud detection?
RQ2. Which combination of resampling strategy and classification algorithm yields the best trade-off between fraud detection rate and precision in a highly imbalanced educational data set?
RQ3. How does the inclusion of auxiliary student metadata (e.g., grade book information) influence the performance of machine learning models in detecting academic fraud compared to using log-derived features alone?
This paper is structured as follows:
Section 2 discusses the work related to fraud detection in various domains with a focus on the educational field. In
Section 3, the steps taken to create the data sets are explained, followed by an overview of the evaluation of the chosen ML algorithms. The results obtained, clustered according to the research questions, are presented in
Section 4. In
Section 5, we discuss the results obtained from the experiment. Finally, the conclusions of this study are presented in
Section 6.
3. Materials and Methods
3.1. Data Set Construction
Politehnica University Timisoara operates an online learning environment based on the open-source Moodle system [
44], known as Campus Virtual [
45]. The data set utilized in this paper was derived from logs recording the actions of users on this platform, spanning three courses across three academic years. The specific procedure followed to construct this data set is outlined in Algorithm 1.
| Algorithm 1 The procedural pipeline for constructing the ML data set, encompassing raw log filtering, feature extraction, and integration with auxiliary grade book data |
| Input: cv.upt.ro |
| Output: ML Data Set |
| 1: Retrieve raw logs from cv.upt.ro |
| 2: Encrypt user specific information |
| 3: Get semester start and stop time |
| 4: Determine list of active students |
| 5: for all logs do |
| 6: if ( OR ) then |
| 7: Remove log |
| 8: end if |
| 9: if () then |
| 10: Remove log |
| 11: end if |
| 12: end for |
| 13: Map student to exam information |
| 14: for
do |
| 15: Count internal elements access |
| 16: Count external elements access |
| 17: Check fixed rules violation |
| 18: Check flexible rules violation |
| 19: end for |
| 20: Merge current data with Grade Book Data Set |
| 21: Filter and remove duplicates |
| 22: Filter and remove students that withdraw from course |
3.2. Logs Retrieval
We have collected logs from three different courses: LP2 (Programming Languages 2), POO (Object-Oriented Programming), and BD (Database Management Systems). These courses are part of the Electronics, Telecommunications and Information Technology (ETTI) specialization offered by University Politehnica Timisoara (UPT), a technical university from Romania. The students are undergraduate students in their second and third year of study. Due to their technical background, the students have a good digital literacy and technical proficiency. The students’ profile is similar across the three courses, as they are all part of the same specialization. The courses are offered in the first and second semester of the academic year. The logs are generated by students’ interactions with the Moodle platform, which include activities such as accessing course materials, submitting assignments, or responding to quizzes. Because of this students’ specialization, it is important to note that these interaction patterns may differ from those observed in non-technical disciplines, such as the humanities.
We selected log data covering both standard academic years and the COVID-19 pandemic period to address class imbalance. It is important to note that the pedagogical structure and the actual activities on the Moodle platform remained the same across both formats. The difference between these two educational environments is represented by the physical location of the exam and the efficiency of the proctoring. While face-to-face exams took place inside the classroom with strict physical oversight, online exams were conducted remotely with less efficient proctoring constraints. The proctoring method used for online exams only required for cameras to be turned on during the exam. We assume that this environmental change from face-to-face to online exams yields a higher incidence of academic fraud due to the reduction in physical proctoring. Including the online format was intentional as it better represents the minority class (fraud) in the final data set.
The logs are obtained from the following academic years: 2019–2020, 2020–2021, 2021–2022. Of the three courses, one takes place in the first semester from September to February (POO), while the other two courses (LP2, BD) take place in the second semester from February to June. The COVID-19 pandemic forced classes to take place in an online environment during the second semester of 2019–2020, the entire academic year 2020–2021, and the first semester of 2021–2022.
Table 2 illustrates the distribution of data regarding the number of students and the number of logs generated across the three courses and the three university years. Campus Virtual is an e-learning platform that can be used in face-to-face courses as well. Therefore, the available logs are generated during two semesters of face-to-face classes and four semesters of online classes.
The initial logs, as can be seen in
Figure 1, consist of the time the action was performed, the user’s name, the course’s component interacted with, a description of the action, the source of the action (web, app, cli), and the IP address that was used. To protect the privacy of users, all user-specific information has been encrypted using a one-way cryptographic function, md5 with a custom salt string. Furthermore, one of the course instructors implemented data encryption. Appropriate measures were taken to safeguard sensitive data before preprocessing. The sensitive information consists of the user’s ID, IP address, and name.
Besides the logs generated by the students on the Moodle platform, we also had access to the students’ grade book for each course. Its content included the student’s encrypted identifier, gender, and grades, as well as a flag indicating students who took only the final examination, their course activity having been previously finalized.
3.3. Data Preprocessing
The goal of data preprocessing is to map information about each student and the exams they attended during a course. The information obtained after this step is stored in the Student–Exam data set,
Figure 1.
The first step taken was to filter out the logs that took place outside the semester’s timeframe, as maintenance work can reflect in the logs, but it is not relevant to our study. Afterward, we extracted the ID of the resource the user interacted with and appended it to the information provided by logs. The resource ID can depict the type of resource like a file, a URL, an exam activity, or others.
Since a student can take part in some activities from a semester, but decide not to take any exams, we need to remove these students’ actions from the logs. A list of the exams that took place in the respective semester can be retrieved from the logs. Based on the list of exams, we were able to extract the list of active students from that semester, the students who attended at least one exam. The list of active students is used to filter the logs once more and keep only the logs performed by those who were active. We use this filtering process to keep only information related to the active students. This approach is similar to that of Sangalli et al. [
6], as they processed log actions to retain only exercise related information.
In the
Student–Exam data set, each student is paired with the exams they took the corresponding semester. The origin of the data is maintained by features
Course Name and
Academic Year. Based on the logs, the start and stop time of the exam for the student–exam pair can be retrieved, and the duration of the exam can be computed. To have quantifiable metrics about a student’s efforts to prepare for an exam, we counted on how many occasions the available resources for exam preparation were accessed by the student before the exam started from the beginning of the semester. With this approach we obtain a quantifiable measure of a student’s involvement with the course directly from logs. This is an alternative to methodologies that rely on grades from quizzes or midterms to predict a final result [
18,
29]. The resources are split into three categories: File, URL, and Other. A second-level differentiation is performed based on the location of resource access: “Interior” if accessed on faculty premises during classes, or “Exterior” if accessed from external devices outside scheduled class hours. This can be achieved by looking at the IP address and comparing it with the list of IP addresses used in the laboratory. In the existing literature, shared IP addresses are used to identify collaborating students [
6]. We use the IP data to establish the location of the students’ preparation and exam environment. A parameter that conveys the number of times a student looked through their exam questions is saved as
Number of Attempts During Exam. For later usage in the labeling process, we retrieve how many resources are accessed during the exam; since this is not allowed, these are saved as “Illegal Actions”. We also check if the student opened multiple sessions of the exam from distinct locations (the IP changes quickly after the exam), and the student’s behavior in the minutes preceding the exam. The detection of rapid IP changes to identify multiple sessions aligns with the identification of copying using multiple accounts in the edX platform [
5]. We use this rule-based system to establish our ground truth. An alternative to our approach is presented by unsupervised clustering used for labeling [
6].
3.4. Data Labeling
To apply supervised ML algorithms, the data set must have a target value. Since the logs only supply information about what actions have been performed, the target has to be manually created. We created the label based on two sets of rules: “Fixed” and “Flexible”, described in
Table 3. If at least one of the fixed rules is broken, we know for sure that the student defrauded the exam and set the label accordingly. Flexible rules cannot determine fraud, but rather highlight suspicious behavior.
Fixed Rule #1 prohibits students from accessing course resources or any other resources during the exam. Failure to follow this rule, and thus accessing any resource, will result in the student being marked as a cheater (committing fraud).
Fixed Rule #2 prohibits impersonation by any person acting on behalf of the student. This can be verified by tallying the number of unique IP addresses used during the exam. If multiple IP addresses are detected, it indicates that the same user account has been logged in from different computers. This suggests that someone is impersonating the student and therefore violating this rule. As a result, the student is flagged as a cheater (committing fraud). This approach can detect identity theft based on the IP address change from the standard Moodle logs. It presents an alternative to systems requiring specialized behavioral biometrics, such as keystroke dynamics [
17,
27], which is not available by default on the Moodle platform. Detecting rapid IP changes to identify multiple sessions aligns with the identification of copying using multiple accounts in edX platform [
5].
Two fixed rules were chosen because they detect fraudulent behaviors that can be definitively verified from Moodle activity logs alone. Fixed Rule #1 is verifiable because Moodle logs every resource access with a timestamp. Fixed Rule #2 is verifiable because Moodle records IP addresses for each user action. Conversely, copying from printed materials, looking at a neighbor’s screen, or downloading resources before the exam for offline access leave no trace in the platform logs. To detect these actions requires physical proctoring or additional monitoring, such as real-time camera feeds [
28]. By restricting labeling to log-verifiable violations, we ensure 100% labeling precision at the cost of uncertain labeling recall.
Flexible rules can be violated without necessarily labeling the student as committing fraud; they serve as indicators of suspicious behavior.
Fixed Rule #1 can be avoided by downloading exam resources before the start of the exam. In this manner, a student could utilize the resources during the exam without leaving any trace in the logs. Flexible Rule #1 is designed to detect such behavior.
The students can avoid Fixed Rule #2 by not logging in during the exam. The single login can be executed by the impersonator, thereby leading to the identification of a unique IP address during an exam. The actual student may choose to log in only after the exam has concluded. Flexible Rule #2 is designed to identify this scenario.
3.5. ML Data Set
The framework presented in
Figure 1 represents a template that was used to derive 12 distinct data sets by combining the available features. The description of the attributes found in this template is provided in
Table 4. The distinct data sets were derived from the standard data set by adjusting the attribute computation interval and selectively including auxiliary information, as illustrated in
Figure 2.
The configuration of the experimental data sets is summarized in
Table 5. In the standard data set, each metric is computed cumulatively from the beginning of the semester. In contrast, the alternative data sets utilize temporal windows of 1, 3, 5, or 7 days for attribute calculation. The granular approach to define computation intervals is adapted from financial fraud detection [
38]. Auxiliary information encompasses the number of exam attempts and the metadata available in the student’s grade book. The final grade was intentionally excluded from the final data set, as the model is designed to detect academic fraud proactively, before the issuance of the final grade. This design choice distinguishes our framework from studies where grades received during a semester are used to identify fraud based on outlier detection after the final exams are completed [
18,
29].
Different courses or the same course in different academic years may contain a different number of resources with which a student interacts. To merge the data into the same data set, standardizing the features is necessary. Standardizing the metrics is important to achieve generalization across courses. This is confirmed by Alexandron et al. [
5] in their multiple courses evaluations. For both approaches, we scaled the number of resources the student accessed before the exam. The mean and standard deviation were computed for each exam and the samples were scaled based on (
1).
3.6. Data Balancing
An important aspect of the prediction’s performance is the distribution of entries among the classes. As can be seen in
Figure 3, the original data set (indicated by the
None label) is highly unbalanced, with the
Fraud class representing only 5.83% of the total entities. Beyond this baseline, the figure also details the comparative class distribution of
Non-Fraud vs.
Fraud entities for every balancing method evaluated in this study. These values are calculated on the entire data set and serve strictly to aid in understanding the data and the balancing methods. In our experiment, we employed a Repeated Stratified K-Fold procedure to split the data into training and testing sets. The balancing techniques were applied exclusively to the training data, leaving the testing data unmodified.
To solve the unbalanced data set issue, we evaluated eleven resampling protocols, ranging from oversampling to hybrid pipelines.
Table 6 describes the specific balancing techniques evaluated. We employ SMOTE and
Adaptive Synthetic Sampling (ADASYN) for oversampling the minority class. ADASYN was selected to prioritize minority instances located near the decision boundary that are more difficult to learn [
46]. These resampling methods were used both in isolation and within pipelines.
To address the potential for overfitting and noise amplification associated with oversampling, we introduced
Tomek Links as a data cleaning step. A Tomek Link is defined as a pair of instances from different classes that are nearest neighbors to each other. These instances, which are close to the decision boundary, can be removed. Eliminating these pairs effectively widens the boundary between the two classes, thereby potentially enhancing the classifier’s performance. Furthermore, we implemented hybrid sampling architectures that combine oversampling (targeting a minority ratio of 0.2) with subsequent undersampling using RUS or
NearMiss to achieve a balanced data set. The efficiency of hybrid resampling strategies for preserving minority class structure in highly skewed data sets has been validated on fraud detection in the medical field [
9,
13,
26]. This sequential approach aims to balance the data set while preserving the structure of the minority class.
We used the Python implementation available in the
imbalanced-learn library [
47] for the oversampling, undersampling, and cleaning techniques. We used Python 3.12, imbalanced-learn version 0.14.0 and sklearn version 1.6.2.
3.7. ML Algorithms
Detecting fraud is solving a binary problem, whether the activity is fraudulent or not. To ensure that the same class balance is kept between train and test, we implemented a Repeated Stratified K-Fold cross-validation with five folds and three repetitions. Its Python implementation is available in the sklearn.model_selection library. We used sklearn version 1.6.1. By employing stratified sampling, we ensure that the class distribution remains consistent across all training and testing partitions. This is critical in fraud detection to prevent the formation of unrepresentative folds due to class imbalance. In this configuration, the data set is divided into five distinct folds; in each iteration, one fold serves as the testing set, while the remaining four folds constitute the training set. This entire cycle is repeated three times with different randomization to ensure the stability of the results.
In this study, we employ six classifiers: LR, DT, SVM, Gradient Boosting Machine (GBM), AdaBoost, and RF. Our selection of these classifiers is based on two criteria. Firstly, the classifiers must have demonstrated good results in detecting fraud. Secondly, they must be capable of handling small data sets.
LR has demonstrated good performance in detecting fraudulent credit card transactions [
11,
30,
33,
35,
38] and medical fraud within the United States’ healthcare system [
9]. LR and DT have been successfully used to create an authentication solution based on facial recognition and keystroke dynamics for preventing academic fraud [
17]. SVM has proven effective in detecting cheating students in Massive Open Online Courses (MOOCs) [
6]. Regarding detection of credit card fraud, Randhawa et al. [
31] demonstrated that AdaBoost is capable of detecting fraudulent transactions, while Dhankhad et al. [
35] and Taha and Malebary [
48] successfully used variants of GBM. Meanwhile, Varmedja et al. [
12] concluded that although LR can achieve better recall for the fraudulent class, it is surpassed by RF, which shows better precision and overall accuracy. In detecting academic fraud, RF has demonstrated strong performance in predicting whether students provide correct or incorrect answers, thus identifying as cheaters those who consistently provide answers that do not align with their usual performance patterns. [
18]. Moreover, RF has proven good results in detecting tax fraud [
1], as well as medical fraud [
13]. GBMs were also employed in the medical domain, where Herland et al. [
9] utilized them for Medicare fraud detection.
LSTM has been used to detect academic fraud [
29], credit card fraud [
20], and tax fraud [
28,
29]. ANNs have been used to detect credit card fraud [
20] and medical fraud [
26]. Academic fraud has been detected using CNNs [
27,
28]. Deep learning approaches are capable of detecting fraud; however, our data set encompasses, before applying any resampling technique, a total of 3788 entries. Considering that deep networks generally perform better with large data sets [
12], and given that our data set is small, we decided not to use any deep learning networks in our experiment.
In our study, we utilized the Python implementation of these classifiers available in the sklearn library.
3.8. Evaluation Metrics
The efficacy of ML classifiers is often assessed through a series of evaluation metrics. We use several key evaluation metrics that are commonly used to gauge classifier performance. To evaluate the performance of the classifiers, we employed the following metrics: precision, recall, F1-Score, balanced accuracy, G-Mean, and Cohen’s Kappa. Our data set consists of two classes: Fraud and Non-Fraud. The data set is imbalanced, with the majority Non-Fraud class representing 94.17% of the total entries.
To better understand the evaluation metrics, we define the components of the confusion matrix for a binary classification problem. True Positives (TPs) are instances correctly classified as Fraud, True Negatives (TNs) are instances correctly classified as Non-Fraud, False Positives (FPs) are Non-Fraud instances incorrectly classified as Fraud, and False Negatives (FNs) are Fraud instances incorrectly classified as Non-Fraud. Based on these components, we define the True Positive Rate (TPR) based on (
2) and False Positive Rate (FPR) based on (
3).
3.8.1. Minority Class Centric Metrics
To overcome the limited number of entries in the minority class, we employ evaluation metrics focused on this class. These metrics are less sensitive to class imbalance, as their calculation does not rely on the majority class. Evaluation metrics suitable for this scenario are: precision, recall, and F1-Score.
Precision is the ability of the classifier not to label as positive a sample that is negative. It is calculated based on (
4). Precision is a metric that is computed for each class of the output label. The value range is between 0 and 1, where 0 shows that none of the predictions for a certain class are correct, and 1 shows that all predicted instances for a certain class are correct.
Recall represents a classifier’s ability to identify a certain class. It is calculated based on (
5). Recall is computed for each class of the output label. It ranges between 0 and 1, where 0 indicates that the classifier is unable to identify any instance of a certain class, and 1 indicates that the classifier can identify every instance of a certain class.
F1-Score is the harmonic mean of the precision and recall. It is calculated based on (
6). The F1-Score ranges from 0 to 1, with the worst score at 0 and the best score at 1.
3.8.2. Balanced Aggregation Metrics
Another way to handle the imbalanced data set is to used balanced metrics, such as: balanced accuracy, G-mean, and Cohen’s Kappa. These metrics normalize performance contribution across classes, treating the minority class as equally important to the majority class.
Balanced accuracy is an alternative to accuracy that takes into account the class imbalance in the data set. It is calculated as the average of the TPR and TNR, as shown in (
7). Balanced accuracy ranges from 0 to 1, where 0.5 indicates random guessing, and 1 represents perfect classification.
G-Mean represents the geometric mean of the True Positive Rate (TPR) and True Negative Rate (TNR). It indicates the balance between the classification performance on the majority and minority classes. G-Mean is calculated based on (
8). The G-Mean ranges from 0 to 1, where 0 indicates that the classifier is unable to correctly classify any instance from either class, and 1 indicates that the classifier can perfectly classify all instances from both classes.
Cohen’s Kappa is used to measure the level of agreement between two classifiers that identify the instances of mutually exclusive categories. It is calculated using (
9). Cohen’s Kappa ranges from 0 and 1, with 0 indicating random agreement between the two classifiers and 1 indicating perfect agreement.
where:
To evaluate the classifiers, we utilize the Python implementation of these metrics available in the sklearn.metrics library.
4. Results
In order to construct a data set to train ML models for fraud detection, we processed the logs available in the Campus Virtual Moodle platform, as depicted in Algorithm 1. Data privacy is ensured by encrypting all the user-specific information from the logs. The information found in the logs is mapped to describe each student’s behavior in relation to a certain exam. Data labeling is performed by identifying the cheaters.
Table 7 presents the fraud ratios observed across two teaching methods during the study period: Face-to-Face (Face2Face) and Online.
Table 8 provides the incidence of fraud for each course per academic year. Moreover, it illustrates the evolution in time of the fraud phenomenon among students. Further filtering of the data set was performed to eliminate duplicate records and students who withdrew from the course. Cheaters are represented by those who break the fixed rules.
After labeling, a notable difference was observed between the honest students and cheaters. Only 5.83% of students were identified as cheaters. Imbalance in the data set is a prevalent issue in fraud detection problems.
For handling the class imbalance problem, we tried out 11 resampling strategies, as described in
Table 6, ranging from traditional oversampling methods to hybrid solutions. These methods include the use of the SMOTE algorithm followed by Tomek Links removal to avoid overfitting in the ADASYN algorithm, which is designed to handle challenging boundary samples. Moreover, hybrid solutions consisting of partial oversampling to obtain a ratio of
in the minority class, followed by undersampling using the Random Undersampling (RUS) method or the NearMiss method, were also implemented.
We proposed 12 approaches for the ML data set based on the template depicted in
Figure 1. The 12 distinct data sets were derived by varying the attributes’ computation interval, using either standard cumulative metrics or temporal windows of 1, 3, 5, or 7 days, as well as selectively incorporating auxiliary information such as exam attempts and grade book metadata, as seen in
Figure 2. Comprehensive attribute descriptions and a summary of experimental data sets are provided in
Table 4 and
Table 5, respectively.
In all cases, data labeling is performed based on fixed rules. The final step in data set construction is scaling the attributes based on (
1). The 12 data sets created serve as input for six ML classifiers.
We analyzed the performance of ML techniques in identifying academic fraud by conducting a comprehensive experimental study comprising 792 unique configurations. These experiments examined the performance of 12 derived data sets, 11 balancing protocols, and six classification algorithms. Classifiers included in this study are LR, SVM, DT, GBM, AdaBoost, and RF.
The classifiers’ performance was evaluated using a Repeated Stratified K-Fold cross-validation with five folds and three repetitions. For each fold, we split data into 80% for training and 20% for testing. This 5-fold process was repeated three times with different random data partitions, yielding a total of 15 independent evaluation runs. Given the highly imbalanced nature of the data set (where the minority fraud class represents 5.83% of entries), classifier performance was assessed using metrics capable of capturing minority class performance: precision, recall, and F1-Score, as well as balanced metrics such as: balanced accuracy, G-Mean, and Cohen’s Kappa.
Table 9 summarizes the top-performing experiments across the following metrics: recall, F1-Score, G-Mean, and Cohen’s Kappa. To compare performance, the result with the original data set is provided for each metric. The following section details our findings in response to our research questions, as introduced in
Section 1.
RQ1. What is the impact of temporal window size on the discriminative power of features extracted from Moodle activity logs for academic fraud detection?
Our experiment results indicate that narrowing the temporal window size improves the fraud detection rate. The highest recall was achieved by the RF classifier using the one-day temporal window data set (1 d) combined with a hybrid balancing strategy: ADASYN oversampling (ratio 0.2), Tomek Links cleaning, and NearMiss undersampling. This configuration yielded a recall of 0.757 ± 0.056. Furthermore, the highest G-Mean (0.607 ± 0.035) was also achieved by AdaBoost classifier using the one-day window paired with auxiliary information and a hybrid pipeline of ADASYN oversampling and Random Undersampling. The best results for the baseline unmodified data set were obtained using SVC with a recall value of 0.617 ± 0.078 and a G-Mean value of 0.569 ± 0.034.
RQ2. Which combination of resampling strategy and classification algorithm yields the best trade-off between fraud detection rate and precision in a highly imbalanced educational data set?
When prioritizing the detection rate, hybrid architectures yielded the best results. The top five results for recall were achieved using hybrid methods involving NearMiss undersampling, Tomek Links for eliminating outliers, and either ADASYN or SMOTE for oversampling. While these models achieved high recall (≈75%), they exhibited very low precision (≈6%), indicating a high rate of false positives. To optimize the trade-off between detection rate and precision, we evaluated the F1-Score and Kappa metrics. The RF classifier performed best in terms of concurrency of F1-Score and Kappa when working with the original+extra data set, which was balanced using ADASYN and Tomek Links. This resulted in the highest F1-Score of 0.202 ± 0.038 and Kappa of 0.156 ± 0.040. The SMOTE equivalent, using Tomek Links and RF on the same data set, performed equally well, with an F1-Score of 0.201 ± 0.033 and Kappa of 0.156 ± 0.035. The best performance in the baseline data set, without any modification, was less, with the SVC model having an F1-Score of 0.134 ± 0.014 and the DT classifier obtaining the maximum Kappa of 0.064 ± 0.054.
RQ3. How does the inclusion of auxiliary student metadata (e.g., grade book information) influence the performance of machine learning models in detecting academic fraud compared to using log-derived features alone?
Integrating student information and grade book data alongside log-derived features improved the classifiers’ overall performance and stability. When assessing the models using the G-Mean metric, the best result was given by the AdaBoost classifier. The best results for F1-Score and Kappa were achieved by the RF classifier. Notably, all five best-performing experiments when measuring G-Mean, F1-Score, and Kappa utilized data sets enriched with auxiliary information (+extra). The improvements over the baseline version are represented by an increase of 6.67% in G-Mean, 50.7% in F1-Score, and 143% in Kappa.
5. Discussions
The educational sector requires a fair and balanced environment for all the involved stakeholders, such as students and teachers. For the maintenance of equity in the evaluation process, there should be accurate and sound assessment techniques. As hypothesized in our methodology, the shift to online learning environments with limited proctoring resulted in a higher incidence of academic fraud, more than doubling from 3.02% in face-to-face settings to 7.34% in online classes, as shown in
Table 7. This trend is confirmed by the analysis of fraud rates across courses and academic years in
Table 8. The instances of academic fraud increased as a result of the start of online learning and continued to be high throughout the period of online learning, peaking at a rate of about 10% before declining as face-to-face learning resumed in late 2021. The results indicate that the lack of physical surveillance during online exams contributed greatly to the occurrence of the misconduct. There thus exists a need for the development of an automated system that ensures academic integrity. The problem posed by the need for an automated system for academic integrity detection is tackled by the research, which investigates the following research questions: The impact of temporal window size on the discriminative power of Moodle log features (RQ1), the optimal combination of resampling strategy and classifier to balance detection rate and precision (RQ2), and the effect of adding auxiliary student metadata on detection performance (RQ3). The results show that while there may be some similarities between the detection of academic and financial fraud, academic fraud detection poses distinct challenges regarding feature engineering and class imbalance.
A contribution of this study is the creation of data sets derived from Moodle activity logs. Our results from the G-Mean, F1-Score, and Cohen’s Kappa analysis showed that the inclusion of auxiliary data (students’ grade book attributes) generated a better classification performance. This aligns with research in academic fraud detection. Kamalov et al. [
29] demonstrated that detecting cheating requires analyzing differences between a student’s performance (grades) during the course and their final exam score, treating fraud as an outlier in student’s performance. Similarly, Hu et al. [
18] utilized gaps between practice exercises and final exam scores to flag potential cheaters. Our findings support this, confirming that “grade book” attributes, specifically grades obtained prior to the exam, are relevant features for academic fraud detection.
Furthermore, our use of temporal windows to capture recent behavior is similar to the sliding window strategies used in financial fraud detection by Dornadula and Geetha [
38]. Similarly, in the educational domain, Sangalli et al. [
6] focused on granular interaction metrics, such as co-occurring submission times, to detect fraud. By translating the concept of the sliding window to the educational domain, where recent financial transactions are analogous to recent study activities, we demonstrated that temporal granularity (e.g., the 1 d data set) is more effective for fraud detection than semester-long aggregates when prioritizing recall or G-Mean. Notably, all top-five experiments yielding the best recall and G-Mean utilized temporal windows, achieving a maximum recall value of 0.757 ± 0.056 with the 1 d_ADASYN_clean_nearmiss_RF configuration and a maximum G-Mean value of 0.607 ± 0.0035 with the 1 d + extra_ADASYN_under_AdaB.
While temporal windows proved decisive for maximizing detection rates (recall and G-Mean), our analysis of F1-Score and Cohen’s Kappa reveals a divergent trend. The highest stability, indicated by the top F1-Score (0.202 ± 0.038) and Cohen’s Kappa (0.156 ± 0.040), was achieved using the cumulative (original + extra) data set rather than the segmented temporal windows. This distinction highlights the specific roles of different feature types: short-term temporal windows are sensitive to bursts of anomalous activity, making them suitable for flagging potential fraud (high recall). However, this sensitivity often comes at the cost of precision, as it generates more false positives. The semester-long cumulative metrics provide the necessary historical context to differentiate between a cheater and a hard-working student, effectively filtering out false alarms. Nevertheless, this stability induces a substantial trade-off: recall drops severely from 0.757 ± 0.056 to 0.189 ± 0.037 to achieve the best values for F1-Score and Cohen’s Kappa.
Regarding the capability of ML models to detect fraud, our extensive experimental evaluation identified RF and AdaBoost as the superior classifiers. Moreover, RF accounted for 13 out of the 20 top performing results in our study. This finding is strongly supported by the existing literature, where RF has demonstrated efficacy in detecting financial fraud [
1], medical fraud [
13], and academic fraud [
18]. However, a direct comparison of performance metrics reveals the distinct difficulty of this domain. While Sangalli et al. [
6] reported accuracy exceeding 95% for specific collusion types, our optimal models achieved a recall of approximately 75% and a Cohen’s Kappa of 0.16. This disparity may stem from the definition of fraud; whereas Sangalli et al. targeted specific, detectable patterns, such as multiple account cheating, our study aimed to detect a broader range of anomalous behaviors, which are inherently more subtle. This reflects the reality of the evaluated data set, where students did not adhere to a specific cheating protocol, requiring a generalized anomaly detection strategy capable of identifying diverse forms of academic misconduct.
Our results confirm that standard, unmodified data sets are insufficient for detecting fraud, yielding a baseline Kappa of only 0.06. The use of hybrid resampling architectures, specifically combining oversampling (SMOTE/ADASYN) with cleaning (Tomek Links) or undersampling (NearMiss), were the balancing methods that showed the best results. We observed a distinct trade-off: hybrid undersampling using NearMiss maximized the recall (identifying more fraud instances), while oversampling with SMOTE or ADASYN paired with Tomek Links cleaning optimized the precision and F1-Score. These findings are similar to those of Garg and Goel [
8], who utilized feature engineering and clustering to distinguish cheating patterns but noted the difficulty in establishing a clear boundary between “intense study” and “cheating”.
Considering the experiments that yielded high recall, approximately 75% at the expense of precision, these prediction results include a wide range of students who are possibly at-risk, including actual cheaters and many false alarms. Therefore, these results should not be considered proof of guilt but are more appropriately used for optimization in the proctoring phase. Instructors can use the risk assessment tool for proactive and non-intrusive intervention techniques. Such techniques may involve assigning the identified students high-visibility seating, giving the students priority for “random” checks, and improved monitoring through frequent walkthroughs and pinned video monitoring. Such measures will have the overall effect of increasing the chances of detecting actual cheating and improving the efficiency of human monitoring, while appearing as standard administrative procedures to innocent students.
A limitation of this study is the process of creating the data sets. The twelve data sets were created based on logs provided from three courses over several years from Campus Virtual. Although the extracted information is accurate in its current context, the algorithm for extraction is necessarily context-dependent. Therefore, when used for log data from a different course, there may be issues in obtaining a valid data set. Since there is no standardized structure that a Moodle course must follow, there exists the possibility that a custom algorithm may be required to extract information from the logs of another course. Future research can focus on making use of data that is retrieved from other courses and years. Given the absence of a standardized course structure, it is necessary to develop an extraction method that is not reliant on specific course structures. Alternatively, we need to improve the current method’s capabilities to enable extraction from diverse course structures.
Another limitation may arise from the way we label the data set. We have labeled the data set with high precision, 100%, but the recall is uncertain. We label cheaters (fraud) as those students who break the fixed rules. This gives certainty that those labeled as cheaters have indeed cheated on the exam. Nonetheless, this method may leave out cheaters that performed fraudulent actions that are not covered by the fixed rules. For example, a student can download the course material and access it locally during an exam, a student can share answers with colleagues who took the exam earlier, or a student can look at a neighbor’s screen. The fixed rules do not cover these actions that do not leave marks on the logs, thereby preventing the student from being labeled as a cheater. The moderate performance of the ML models may be partially explained by the label noise. The “honest” class contains an unknown proportion of undetected cheaters. A possible solution to this limitation that can be explored in future work is to use unsupervised learning techniques for labeling, as done by Sangalli et al. [
6].
While our hybrid models improved detection capabilities, the high false positive rate, indicated by low precision in the high recall experiments, remains a limitation. As noted by Carcillo et al. [
49] in the financial domain, augmenting data sets with too many derived scores can lead to variance issues. Future work could address this by integrating unsupervised learning techniques, similar to the clustering methods used by Garg and Goel, to identify different types of legitimate study patterns (e.g., “cramming” vs. “consistent study”). By distinguishing these diverse normal behaviors from actual anomalies, we could reduce the noise that leads to the high number of false alarms.
Any intervention must be non-intrusive and framed as standard procedure to avoid creating a hostile environment for innocent students.
In summary, the contributions of this paper are:
The framework presented in Algorithm 1 is used for constructing data sets from the available Moodle activity logs. We applied this framework to create 12 data sets based on the logs of three courses over three academic years. The distinction between the data sets is based on the temporal windows used and the inclusion of auxiliary information from the students’ grade book. Other instructors can use this framework to create data sets based on their available logs, but they should treat it as a template and adapt it to their specific course structure.
The analysis of the temporal windows revealed that the usage of shorter temporal windows results in capturing a higher number of fraudulent activities. The highest recall of 0.757 ± 0.056 is achieved using the one-day window for the data set. The best G-Mean of 0.607 ± 0.035 was also achieved by a one-day window data set, but with the inclusion of auxiliary information. This suggests that short temporal windows are more efficient in detecting fraudulent activities, while the inclusion of auxiliary data improves the overall performance.
The evaluation of 11 resampling strategies over 12 data sets and six classifiers (792 experiments total) demonstrated that class imbalance can be better treated using hybrid methods. Oversampling followed by cleaning with Tomek Links and undersampling with NearMiss showed the best recall values, approximately 75% for the top five results. Across all evaluated metrics, the best results were achieved with a hybrid resampling method.
Given the high false positive rate inherent in high-recall configurations, model outputs should not be treated as evidence of guilt. An instructor can best use these as screening tools. The students flagged by the model can be prioritized for “random checks”, they can be assigned to high-visibility seats, and they can be monitored more closely. These measures can increase the chances of detecting actual fraud and improving the efficiency of human proctoring, at the same time appearing as standard procedures to honest students.
6. Conclusions
Our study explored how to use activity logs to construct a data set that ML models can be trained on for academic fraud detection. We established a framework for creating data sets based on the Moodle logs and student grade books. This process produces 12 distinct data sets by varying the temporal windows used in feature extraction and the inclusion of student grade book information. We used 11 resampling techniques: oversampling with SMOTE and ADASYN; hybrid resampling based on oversampling followed by undersampling with RUS and NearMiss; hybrid resampling with cleaning based on Tomek Links; and no resampling to serve as a benchmark. We evaluated six ML techniques: LR, DT, SVM, GBM, AdaBoost, and RF. In this study, we developed a total of 792 experiments by testing every combination of data set, resampling technique, and ML technique. Furthermore, we found that maximizing recall is more advantageous when features are computed using smaller temporal windows rather than standard semester-long windows (RQ1). The highest stability, indicated by the F1-Score and Cohen’s Kappa was achieved using the auxiliary grade book information (RQ3). While RF and AdaBoost outperformed the other techniques using hybrid methods for resampling involving NearMiss, Tomek Links, and either ADASYN or SMOTE (RQ2), the overall performance of the evaluated models remains unsatisfactory as none of them provided acceptable values for F1-Score, Cohen’s Kappa, or G-Mean. A possible explanation can be related to the ground truth. We defined rules that define with certainty fraudulent activities but cannot cover every possible fraud scenario. This labeling method ensures that every entity labeled as a “cheater” committed fraud, but does not guarantee that all “honest” labeled students did not cheat. Another aspect that we extracted based on the results is that hybrid undersampling maximized recall (≈75%), while hybrid oversampling optimized metrics such as F1-Score and Cohen’s Kappa.
Our findings suggest that the evaluated models cannot definitively prove guilt, but high-recall models can function as screening tools. Instructors can use these predictions to focus their attention on the sensitive cases. They can take actions such as moving flagged students to better visibility seats, prioritizing them for “random” checks, or closely observing them, while leaving honest students undisturbed. As future work, we will focus on reducing false positives by using unsupervised clustering to better distinguish diverse, legitimate study patterns from misconduct, and validating these approaches on data sets that originate from different institutions.