You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

27 November 2023

Composing Multiple Online Exams: The Bees Algorithm Solution

,
and
1
Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh 12371, Saudi Arabia
2
Department of Computer Science, Faculty of Applied Sciences, Taiz University, Taiz P.O. Box 6803, Yemen
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Artificial Intelligence Technologies for Education: Advancements, Challenges, and Impacts

Abstract

Online education has gained increasing importance in recent years due to its flexibility and ability to cater to a diverse range of learners. The COVID-19 pandemic has further emphasized the significance of online education as a means to ensure continuous learning during crisis situations. With the disruption of traditional in-person exams, online examinations have become the new norm for universities worldwide. Among the popular formats for online tests are multiple-choice questions, which are drawn from a large question bank. However, creating online tests often involves meeting specific requirements, such as minimizing the overlap between exams, grouping related questions, and determining the desired difficulty level. The manual selection of questions from a sizable question bank while adhering to numerous constraints can be a laborious task. Additionally, traditional search methods that evaluate all possible solutions are impractical and time-consuming for such a complex problem. Consequently, approximate methods like metaheuristics are commonly employed to achieve satisfactory solutions within a reasonable timeframe. This research proposes the application of the Bees Algorithm (BA), a popular metaheuristic algorithm, to address the problem of generating online exams. The proposed solution entails creating multiple exam forms that align with the desired difficulty level specified by the educator, while considering other identified constraints. Through extensive testing and comparison with four rival methods, the BA demonstrates superior performance in achieving the primary objective of matching the desired difficulty level in most test cases, as required by the educator. Furthermore, the algorithm exhibits robustness, indicated by minimal standard deviation across all experiments, which suggests its ability to generalize, adapt, and be practically applicable in real-world scenarios. However, the algorithm does have limitations related to the number of successful solutions and the achieved overlap percentage. These limitations have also been thoroughly discussed and highlighted in this research.

1. Introduction

Over the past few years, the education system has experienced a significant transformation, primarily attributable to the impact of the COVID-19 pandemic. The shift towards online exams has been motivated by the imperative to prioritize the safety and well-being of both students and staff, while also ensuring the uninterrupted progression of education. Online exams provide students with a secure environment in comparison to traditional modes of examination. In general, from an instructor standpoint, multiple-choice questions (MCQ) can be advantageous with respect to the ease of scoring, perceived objectivity in grading, fast return of scores in large classes, and the capacity to ask more questions [1,2].
The process of selecting exam questions, though, can be time-consuming, particularly when multiple exam models are necessary. Furthermore, when exams are administered to a large number of students at different times, it becomes essential to have a question bank that encompasses a considerable number of questions. Question banks facilitate the inclusion of questions in the exam at different intervals, offering a variety of selection criteria. This approach enhances the diversity of the generated exams. However, a challenge faced in both automatic and manual online exam generation is the question bank’s size. To ensure exam variety and minimize repetition, having a question bank larger than the total number of questions required is generally recommended [3,4]. While there is no specific guideline regarding the exact question bank size in relation to the number of questions needed, educational experts and researchers suggest having a substantial number of questions available. A larger question bank offers broader coverage of the curriculum and ensures adequate assessment of all important topics [5,6]. The specific question bank size will depend on factors such as the subject matter, questions’ difficulty level, and desired variety in the exams [3,4]. Generating various exam models from a question bank can be accomplished through manual or automated methods. The manual process is typically intricate and time-consuming, while employing an algorithm to automate the task allows for precise adherence to specific criteria instead of random question selection. This approach guarantees that the exam models maintain consistent difficulty levels and meet the required constraints.
The multiple-exam composition problem has recently attracted the attention of many researchers [2]. The problem at hand involves the automated generation of multiple MCQ exam papers or test sheets. These test sheets cover the same subject matter and maintain an identical or nearly identical difficulty level, ensuring a fair assessment process. It is crucial that the difficulty levels of the generated exams align with the desired test level established by the educator. Furthermore, to promote diversity and mitigate the potential for cheating, the overlap between the different tests, in terms of identical questions, should not exceed a specified percentage.
In this study, we propose a method to simultaneously create multiple forms of MCQ exams while adhering to specific criteria, including the difficulty level, avoidance of question repetition, and a predetermined number of questions in each section. Given that traditional search methods that exhaustively examine all feasible solutions are often impractical and time-consuming for larger problem sizes, we introduce a metaheuristic algorithm as a solution, namely, the Bees Algorithm (BA) [7].
In brief, the concept of the BA revolves around a set of initial solutions. The algorithm then focuses on exploiting promising and elite solutions by making modifications to enhance the likelihood of discovering improved solutions. The ultimate objective is to achieve the optimal or near-optimal solution through these iterative refinements [8]. The motivation for employing the BA algorithm lies in its ability to effectively tackle real-world problems that have proven challenging for other classical algorithms [8]. This algorithm, in particular, demonstrates promise in addressing combinatorial optimization problems [7]. Moreover, for the multiple-exam composition problem specifically, we opted for the BA for two primary reasons. Firstly, despite its popularity and simplicity, the BA has not been previously applied to solve this particular problem, unlike other well-known methods such as Simulated Annealing (SA) [9], the Genetic Algorithm (GA) [10], Particle Swarm Optimization (PSO) [2,11,12], and Ant Colony Optimization (ACO) [13]. Secondly, we believe that the BA possesses an inherent advantage over other methods which directly aligns with the multiple-exam composition problem. Specifically, it naturally selects the best and elite solutions as part of its optimization process, thereby readily generating multiple exams (i.e., solutions) in each run. In other words, the BA does not require any additional steps to generate multiple tests simultaneously, as seen in other algorithms, such as the random algorithm [14], SA [9], GA [10], and ACO [13], which involve repeating runs and/or sorting the population to select the best individuals, or incorporating parallelization, as in PSO [2,11,12].
The problem tackled in this research is defined as follows [15]:
Let a question bank Q B = { Q 1 , Q 2 , Q 3 , , Q k } and the generated test T i = { Q 1 , Q 2 , Q 3 , , Q n } where nk. The difficulty level requirement of the exam set by the educator is DLR ∈ [0, 1]; the difficulty level of each question, DLx ∈ [0, 1], is also determined by the educator. Moreover, each question belongs to a certain chapter (part). The objective function that determines the quality of the solution (exam) computes the average of the difficulty values of all questions in the exam and then compares it to the required difficulty level DLR, as shown in Equation (1) [15].
f Q i j · D L = j = 1 n Q i j · D L n D L R
where Q i j · D L represents the difficulty level of question Q i j . The smaller f ( Q i j · D L ) becomes, the better the fitness of the solution. In addition, each generated exam must adhere to the following constraints, where C1, C2, and C3 are hard (i.e., mandatory) constraints, while C4 is a soft (i.e., desired but not mandatory) constraint:
  • C1: Each question in the exam must be unique.
  • C2: The difficulty level of each question should not be concentrated around the required difficulty level of the exam, ensuring diversity. In other words, selecting questions all of the same difficulty level for the test is not allowed.
  • C3: The number of questions extracted from each chapter (or section) must match the desired number specified by the educator.
  • C4: To generate multiple exams while ensuring a high level of uniqueness in each test, it is desirable to restrict the percentage of overlapping questions between tests. The acceptable overlap percentage can be determined using Equation (2). This approach guarantees that the exams exhibit minimal redundancy and maintain a diverse collection of questions. Following [2], the maximum acceptable overlap percentage is 0.3. Assuming we have m generated tests, the overlap percentage P t e s t between the m tests is calculated as follows:
P t e s t = O U m · n · 100
where
  • O is the number of overlapping questions between all tests;
  • U is the number of unique questions within O ;
  • m is the number of tests generated;
  • n is the number of questions in each test.
The O U term accounts for the diversity of the questions involved in the overlapping count. Its purpose is to slightly mitigate the impact of overlap when the questions involved in the overlapping count are different. This approach is used to address situations in which the overlapping portion consists of a small number of questions, indicating a higher similarity between the exams. To illustrate this concept, consider the following example:
Let us consider five exams, each consisting of ten questions. Suppose there are 15 overlapping questions, all of which come from Q1, Q2, and Q3. In this case, the overlap percentage would be calculated as ((15 − 3)/50) × 100 = (12/50) × 100 = 24%.
On the other hand, suppose we have 15 overlapping questions, but these questions include Q1, Q2, Q3, Q4, and Q5. In this case, the overlap percentage would be ((15 − 5)/50) × 100 = (10/50) × 100 = 20%. The second scenario is considered better than the first because it demonstrates more diversity in the questions included in the overlapping. If we remove the “U” (unique) component from Equation (2), both scenarios would yield the same overlap percentage, which we believe is not fair.
In this research, the BA is employed to construct a number of MCQ exams from a question bank while adhering to specific constraints. The primary objective of the solution is to attain the desired difficulty level determined by the educator. As a metaheuristic algorithm, the BA offers an approximate solution that can be optimal or near-optimal. As previously highlighted, the BA is well-suited for simultaneously generating multiple exam forms as it interacts with a population of solutions and selects the best and elite solutions during an intensified search process.
This study presents four main contributions. Firstly, it introduces a unique adaptation of the BA, specifically designed to address this problem. Secondly, it enables the generation of multiple online tests simultaneously, meeting the user requirements without resorting to parallelization techniques. Thirdly, it introduces problem-specific neighborhood moves, crafted to adhere to the constraints of this particular problem. Finally, the experimental results demonstrate that our proposed method performs favorably compared to existing approaches, delivering high-quality solutions to the problem at hand.
The remainder of this paper is organized as follows. Section 2 reviews some background information, as well as previous related work. Section 3 details the proposed BA approach for the multiple-exam composition problem. Section 4 describes our experiments and results. Section 5 discusses our findings, and finally Section 6 concludes the study.

3. Methodology

This section provides a detailed description of the algorithm employed to address the multiple-exam composition problem. Firstly, we will outline the main steps followed by the BA to solve this problem. After specifying the number of desired exams (m), the required difficulty level (DLR), the number of questions in each part, and the overlap threshold, the algorithm proceeds as follows:
  • Generate an initial population of bees (s > m), in which each bee represents a potential exam, and s denotes the scout bees. Sort the population based on each bee’s fitness, i.e., the exam’s difficulty level, calculated using Equation (1).
  • Select the fittest bees (m) for neighborhood search. During neighborhood search, some questions in the selected exam are replaced, resulting in a new exam or neighbor. The number of neighbors generated for the elite (fittest) bees (e) within m is higher than the number of neighbors generated for the remaining bees (m-e). The fittest bee from each of the m patches or neighborhoods is then chosen to transfer to the next generation. Two neighborhood search methods are applied: random for the best and elite bees, and deterministic when the set of m exams fails the overlap condition. In the deterministic method, overlapping questions are replaced with new questions from the same chapters with similar difficulty levels. This preserves the fitness value while reducing the overlap percentage.
  • Test the overlap condition by comparing the m exams and keeping track of overlapping questions. If the overlap condition is satisfied, select the fittest bees from the m patches to transfer to the next generation and discard the remaining (s-m) bees. However, if the ratio of overlapping questions to the remaining questions in the exams exceeds the overlap threshold, perform a new neighborhood search operation on the previously selected m bees until the overlap condition is met. The overlap percentage is calculated by dividing the number of repeated questions by the total number of questions in the m exams, as shown in Equation (2).
  • Generate new (s-m) bees randomly to maintain the original population size.
The stopping condition test memorizes the best DLR and checks whether the overlap percentage has changed. If not, the algorithm returns the desired m exams. The process (steps 2–4) repeats until the termination condition is met. The termination condition can be satisfied when the resulting set of bees includes m exams that fulfill all the conditions specified by the educator, or when the algorithm reaches a maximum pre-defined number of iterations, or when the overlapping percentage remains unchanged for five consecutive iterations. Algorithm 1 presents the pseudocode of the proposed method.
Algorithm 1: BA for Multiple Exam Generation.
Input: number of the desired exams (m), required difficulty level, overlap percentage, number of required chapters and the number of questions in each chapter.
Output: multiple tests satisfying the conditions.
Initialize the number of scout bees s, number of best sites e and patch size.
Initialize a population of s scout bees.
Repeat until maximum iterations is reached.
  If iteration ! = 0 Then
   If the overlap condition is not satisfied Then
    Apply overlap neighborhood move.
   End if
   If stopping condition is satisfied Then
    Return the m exams generated.
   End if
  End if
  Calculate the objective function of the population.
  Choose m best sites for neighborhood search.
  Assign the size of the neighborhood (patch size).
  Apply neighborhood search; search more around the best e sites.
  Select the fittest bee from each patch.
  Assign the (s-m) remaining bees to random search.
  Generate a new population of scout bees.
End Repeat
Return the m exams generated.
A crucial aspect of the algorithm is the representation of a problem’s solution, which, in our case, corresponds to an exam or test. The exam is structured into parts or chapters, with each part containing multiple questions of varying difficulty levels. The educator specifies the number of questions per part, the overall difficulty level of the test, and the desired overlap percentage between tests. Hence, the solution is represented as a test object comprising parts or chapters, where each part consists of a collection of questions identified via a question ID. Figure 2 provides an illustration of a solution representation for an exam.
Figure 2. An example of a solution representation.
As mentioned earlier, our neighborhood move involves replacing one question in the exam to generate a new exam. In the following section, we present an example of a question bank that will be utilized to illustrate and describe the behavior of our algorithm.
Ch1Q1Q2Q3Q4Q5Q6Q7Q8Q9Q10
DL0.560.930.450.120.980.710.240.250.20.62
Ch2Q11Q12Q13Q14Q15Q16Q17Q18Q19Q20
DL0.370.540.50.360.860.130.90.670.760.28
Ch3Q21Q22Q23Q24Q25Q26Q27Q28Q29Q30
DL0.420.210.330.600.130.590.780.940.810.63
The initial step involves initializing a random population. In this example, population size = 3, and there are two exams to be generated (m = 2). Each exam consists of five questions, with two questions allocated to chapter 1, two questions to chapter 2, and one question to chapter 3. The desired difficulty level is set to 0.65, and the overlap limit is defined as 0.2. The fitness of each exam is calculated by computing the difference between the average difficulty level of the exam and the required difficulty level specified by the user.
Exam 1Question Q8Q9Q13Q11Q24Fitness
Chapter Ch1Ch1Ch2Ch2Ch30.266
Difficulty0.250.20.50.370.60
Exam 2Question Q1Q7Q17Q14Q25Fitness
Chapter Ch1Ch1Ch2Ch2Ch30.212
Difficulty0.560.240.90.360.13
Exam 3Question Q2Q6Q12Q18Q29Fitness
Chapter Ch1Ch1Ch2Ch2Ch30.016
Difficulty0.930.710.540.670.32
Next, a neighborhood search is conducted on the fittest m bees (exams), for which m equals 2 in this instance. During this step, one question is exchanged within the fittest m exams, and replaced by another question from the same chapter. As Exam 3 is the fittest (or elite) exam, we will extract two neighbors from it. For clarity, the exchanged questions are denoted in bold text.
Exam 4
(neighbor 2)
Question Q1Q7Q17Q19Q25Fitness
Chapter Ch1Ch1Ch2Ch2Ch30.132
Difficulty0.560.240.90.760.13
Exam 5
(neighbor 3)
Question Q2Q6Q12Q18Q21Fitness
ChapterCh1Ch1Ch2Ch2Ch30.004
Difficulty0.930.710.540.670.42
Exam 6
(neighbor 3)
Question Q2Q6Q11Q18Q29Fitness
Chapter Ch1Ch1Ch2Ch2Ch30.05
Difficulty0.930.710.370.670.32
The next step is to select the fittest exams from each neighborhood or patch, specifically Exams 4 and 5. We then evaluate the overlap ratio for each exam. In this case, as there are no overlapping questions between them, the overlap ratio is 0, which does not exceed the defined overlap threshold. Consequently, we need to generate a new random population of (s-m) bees. Since the m bees are Exams 4 and 5, only one new bee is created randomly. Thus, the new set of bees comprises Exams 4, 5, and 7.
Exam 4Question Q1Q7Q17Q19Q25Fitness
Chapter Ch1Ch1Ch2Ch2Ch30.132
Difficulty0.560.240.90.760.13
Exam 5Question Q2Q6Q12Q18Q21Fitness
ChapterCh1Ch1Ch2Ch2Ch30.004
Difficulty0.930.710.540.670.42
Exam 7Question Q5Q1Q20Q15Q30Fitness
Chapter Ch1Ch1Ch2Ch2Ch30.126
Difficulty0.980.560.850.860.63
Then, we proceed to repeat the neighborhood search operation on Exams 5 and 7, since they are the best exams.
Exam 8
(neighbor 5)
Question Q5Q6Q12Q18Q21Fitness
ChapterCh1Ch1Ch2Ch2Ch30.014
Difficulty0.980.710.540.670.42
(neighbor 7)Question Q10Q1Q20Q15Q30Fitness
Chapter Ch1Ch1Ch2Ch2Ch30
Difficulty0.350.560.850.860.63
Exam 9
(neighbor 5)
Question Q2Q6Q12Q18Q29Fitness
ChapterCh1Ch1Ch2Ch2Ch30
Difficulty0.930.710.540.670.40
The fittest exams from each neighborhood are Exams 9 and 10. Since there are no overlapping questions between them, the overlap percentage is calculated as 0%. At this point, we check the stopping condition. Since we have two distinct exams, both with a fitness value of 0, the algorithm terminates, resulting in two unique exams that have the same difficulty level and the required distribution of questions across chapters.
Furthermore, we provide an example of the overlap neighborhood move, which occurs when the algorithm exceeds the predefined overlap percentage. When the algorithm fails to satisfy the overlap condition, it focuses on diversifying the top m exams. This is accomplished by replacing as many overlapping questions as possible with new questions from the same chapter. The replacement questions are carefully selected to ensure a similar or closely matched difficulty level to the questions they replace. The process of changing the overlapping questions is carried out individually for each question and one exam at a time. In other words, for every overlapping question, an attempt is made sequentially on each overlapping exam to replace the corresponding question with another unique question from the same chapter, with a similar or closely matched difficulty level. If the attempt is successful, there is no need to consider other exams. However, if the attempt fails, the next exam in the sequence is taken into consideration.
The parameters used in this example are as follows: population size = 3, m = 2 (number of exams), and five questions per exam. The difficulty level is set to 0.65, and the overlap limit is defined as 0.15. Exams 1 and 2 are identified as the best exams (m best exams). It is noteworthy that both Q9 and Q13 appear twice in the selected best exams. The total number of overlapping questions (O) is determined as 2 + 2 = 4. The variable U represents the number of unique questions within the set of overlapping questions (O). Since all overlapping questions are either Q9 or Q13, U = 2. To calculate the overlap percentage, we use the formula in Equation (2), which in this case yields (4 − 2)/(5 × 2) = 0.2. Consequently, the overlap percentage exceeds the maximum overlap threshold specified above.
Exam 1Question Q2Q9Q13Q18Q29Fitness
Chapter Ch1Ch1Ch2CH2CH30.126
Difficulty0.930.20.50.670.32
Exam 2Question Q8Q9Q13Q11Q24Fitness
Chapter Ch1Ch1Ch2Ch2Ch30.266
Difficulty0.250.20.50.370.60
Exam 3Question Q1Q9Q12Q14Q25Fitness
Chapter Ch1Ch1Ch2Ch2Ch30.292
Difficulty0.560.20.540.360.13
The overlap neighborhood move is then performed on the best m exams, specifically Exams 1 and 2. During this process, the overlapping questions are replaced with new and unique questions from the same chapter. The replacement questions are chosen to have a difficulty level that is either identical to or within a 0.05 margin of the question being replaced. In this case, since Q7 is available and meets the specified criteria (from chapter 1 and within the margins of the difficulty level), it is an appropriate replacement for Q9. Similarly, Q13 is replaced with Q12 for the same reasons. These changes were applied to Exam 1.
Exam 1Question Q2Q7Q12Q18Q29Fitness
Chapter Ch1Ch1Ch2Ch2Ch30.11
Difficulty0.930.240.540.670.32
Exam 2Question Q8Q9Q13Q11Q24Fitness
Chapter Ch1Ch1Ch2Ch2Ch30.266
Difficulty0.250.20.50.370.60
After the execution of the overlap neighborhood move, the resulting best m exams are entirely distinct from each other. As a result, the overlap percentage is now reduced to zero.

4. Experimental Design and Results

To evaluate the performance of the proposed algorithm, we utilized two datasets obtained from references [2,15]. These datasets are outlined in Table 1. The first dataset, the small question bank (small QB) consists of 1000 questions with varying difficulty levels, distributed across 14 chapters. The second dataset, the large question bank (large QB) consists of 12,000 questions with diverse difficulty levels across 12 chapters.
Table 1. Summary of experimental datasets; dataset 1 for the small question bank [2], dataset 2 for the large question bank [15].
To run the algorithm, the educator needs to provide several parameters, including the number of desired exams (m), the difficulty level requirement for each exam (DLR), the desired chapters (Ch1, Ch2, … Chc) for each exam, and the number of questions covering each selected chapter (Qn1, Qn2, … Qnc). Additionally, there are other parameters specific to our Bee Algorithm (BA) that need to be specified, such as the number of scout bees (initial population size) (s), the number of best solutions from the m exams (e), the neighborhood (patch) size for both e and m-e, and the maximum allowed overlap percentage between generated tests. We conducted a parameterized test, exploring different values for these parameters to determine the optimal configuration for our algorithm.
The performance criterion considered was that the quality of each exam, as measured by the fitness function, should have a margin of error no greater than 0.0001 [2]. We also introduced additional measures to handle the trade-off between fitness value and runtime. If the algorithm fails to produce m exams with a fitness value within the range of [0, 0.0001] within the allocated time, the margin of error is increased to 0.15. The resulting exams should not exceed a maximum overlap percentage of 0.3, as assumed in reference [2].
Initially, the algorithm prioritizes the fitness value, which is based on the difficulty of the exam. The algorithm keeps track of its solutions and saves the overall best solutions. Once an acceptable fitness value is achieved, the algorithm shifts its focus towards improving the overlap percentage. If the overlap percentage does not improve over five iterations, the algorithm terminates and outputs its best solution.
We implemented the proposed approach using Python 3.6.12 in the Colab environment, leveraging several libraries that support machine learning algorithms.
To evaluate the effectiveness of the proposed approach, we conducted a comparative analysis with four other algorithms: Parallel Migration PSO (PMPSO) [2], sequential PSO (SPSO) [2], the random algorithm (RA) [14], and the Simulated Annealing (SA) [9] algorithm. The results of these four methods were documented in reference [2].
The performance measurements used for evaluation were the quality of the solution and the standard deviation. Multiple runs of each algorithm were performed to obtain robust results. It is important to note that the population size and difficulty level requirement (DLR) were kept consistent across all algorithms in this comparison. However, the size of the question bank was varied to evaluate the algorithms’ performance across different search space sizes (i.e., different datasets). The objective function computed the quality of the solution, with a decreasing fitness function indicating an improved solution. The fitness value of each test was measured to assess the quality of the results.
The standard deviation served as a measure of the generated exam’s proximity to the desired difficulty level (DLR). It acted as an indicator of the algorithm’s stability or robustness [2]. To evaluate the stability of the algorithm, we measured the standard deviation across multiple DLRs.
Table 2 presents the parameters utilized in the conducted experiments. We performed two distinct experiments, each executed twice: once on the large question bank and once on the small question bank. The first experiment aimed to assess the algorithms’ performance when varying the number of generated exams, while the second experiment evaluated their performance when altering the required difficulty level. As the BA is a stochastic algorithm that can produce different results with each run, each experiment was repeated ten times to ensure accurate and reliable outcomes, which aligns with the number of runs conducted in [2].
Table 2. Parameters used in the testing phase.
Table 3 shows the detailed performance of all algorithms when changing the number of exams while keeping the difficulty level the same. The large question bank was used in this experiment. The success rate measures the number of times the algorithm reaches a successful solution, i.e., a solution with a fitness ∈ [0, 0.0001]. Figure 3 shows the average fitness of all algorithms.
Table 3. Algorithms’ results when changing the number of exams, using the large question bank.
Figure 3. Average fitness values for different numbers of exams using the large question bank.
As evident from the outcomes presented in Table 3 and Figure 3, the BA exhibits superior performance in terms of average fitness values across all dataset sizes compared to all other methods. The PMPSO and the SPSO demonstrate comparable performance, while the RA and SA underperform across all test cases. With regard to the standard deviation, the BA and PSO methods exhibit similar values, with slightly smaller deviations observed for the PSOs. This suggests that all three algorithms demonstrate stability, as the variation between different runs is minimal. Conversely, the RA and SA once again underperform in this aspect.
For the success rate, the BA was able to achieve an approximately 93% success rate on average, compared to the 100% success rate for the PSOs and much worse success rates for the RA and SA.
For the overlap percentage, it is important to highlight that our method for calculating the overlap percentage differs from the approach used by [2]. The reason behind this discrepancy is that their calculation involves dividing the number of overlapping questions by the number of tests, without considering the total number of questions in the tests. We believe that this method is not accurate for calculating the overlap percentage because when calculating a percentage, it is necessary to consider the relationship between a part and the whole. In other words, both the numerator and the denominator should be of the same type. Since exams are distinct entities from questions, we have developed our own formula for determining the overlap percentage, as shown in Equation (2). Therefore, a direct comparison of the overlap percentage between the algorithms cannot be provided. However, since we have established a baseline for comparison, which is the overlap threshold defined by the user, our ultimate objective is to ensure that the overlap percentage does not exceed this threshold, regardless of the calculation method employed. Observing the obtained results, it is evident that the BA was unable to attain the targeted overlap percentage of 30% for the larger test cases. This outcome is somewhat expected as the likelihood of question repetition increases with a larger number of exams.
Table 4 shows the performance of all algorithms when varying the number of exams using the small question bank, while Figure 4 shows a comparison of the algorithms in terms of average fitness values with respect to the number of exams.
Table 4. Algorithms’ results when changing the number of exams using the small question bank.
Figure 4. Average fitness values for different numbers of exams using the small question bank.
Based on the outcomes presented in Table 4 and Figure 4, it is evident once again that the BA generated exams with smaller fitness values compared to all other competing methods. This indicates that the exams produced by the BA were closer to the desired DLR. However, it should be noted that the two variants of PSO achieved an optimal success rate, which the BA did not attain, albeit with a small margin. Nevertheless, the BA exhibited a significantly higher success rate compared to the RA and SA. For standard deviation, both the BA and PSO methods demonstrated comparable values, suggesting stability across different runs for the obtained results. In contrast, the RA and SA exhibited much higher deviations, indicating instability in the outcomes across multiple runs. On the other hand, the BA once again fell short of achieving the desired overlap percentage, as calculated using our method.
Figure 5 illustrates the runtime behavior (in seconds) of the Bees Algorithm when varying the number of exams for question banks of different sizes. It can be observed from the figure that as the size of the question bank increases, the runtime also significantly increases, which is expected.
Figure 5. Bees Algorithm’s runtime when changing the number of exams.
The subsequent experiment involves modifying the required difficulty level (DLR) using the large question bank. The results of all algorithms for this experiment are presented in Table 5, while Figure 6 provides a side-by-side comparison of the algorithms in terms of their average fitness values. From these findings, it can be observed that the BA outperforms the PSO methods in terms of average fitness in three out of five instances (DLR = 0.4, 0.5, and 0.6), while having a slightly worse fitness for a DLR = 0.7, and much worse fitness for a DLR = 0.3. Nevertheless, the BA outperforms the RA and SA consistently in all instances. Also, the BA and both PSO algorithms demonstrate comparable standard deviations in most instances, outperforming the RA and SA.
Table 5. Algorithms’ results when changing the DLR using the large question bank.
Figure 6. Average fitness values for different DLRs using the large question bank.
Regarding the success rate, the BA achieved an average success rate of approximately 90% across the various test cases, significantly outperforming the RA and SA, which failed to obtain any successful solutions for the majority of the test cases. The PSO methods, though, had a 100% success rate for all test cases. In terms of the overlap percentage, the BA achieves better results in this experiment compared to the previous one on the same question bank (Table 3), as is discussed in more detail in Section 5.
The performance of the algorithms when varying the required difficulty levels using the small question bank is presented in Table 6. Similar to previous experiments, both PSO methods exhibited higher success rates than the BA in all instances, except for the instance in which DLR = 0.3, where both PSO methods significantly underperformed in comparison to the BA.
Table 6. Algorithms’ results when changing the DLR using the small question bank.
Figure 7 and Figure 8 provide a comparison of the average fitness values for each method across different instances, with Figure 8 separating the outlier instance (DLR = 0.3) for clarity. From these figures, it can be observed that the BA demonstrated better performance in terms of average fitness in two instances (DLR = 0.3 and 0.5), while it had a worse average fitness than the PSO methods in the remaining three instances. However, it is worth noting that the BA consistently outperformed the RA and SA in all instances. Additionally, the PSO methods exhibited slightly better standard deviations than the BA in most cases. Again in this experiment, the desired threshold for the overlap percentage was still not achieved by the BA.
Figure 7. Average fitness values when changing the DLR [0.4–0.7] for the small question bank.
Figure 8. Average fitness for DLR = 0.3 for the small question bank.

5. Discussion

In our study, we evaluated the performance of the Bees Algorithm (BA) based on four key criteria: the number of successful solutions, fitness, overlap percentage, and standard deviation. The computational experimentation demonstrated that the BA achieved favorable results in terms of successful solutions, fitness, and standard deviation compared to its rival methods. Specifically, the BA outperformed all competing methods in terms of average fitness in most test cases. However, the PSO methods (both parallel and sequential versions) exhibited a higher success rate, highlighting their advantage over the BA in this particular metric, although the BA outperformed the RA and SA in this metric.
Amongst the different difficulty levels (DLRs) considered, the case of DLR = 0.5 consistently showed the best results in terms of the fitness function, standard deviation, and successful solutions, as indicated in (Table 3, Table 4, Table 5 and Table 6). This can be attributed to the nature of our neighborhood move, which selects replacement questions at random. Therefore, it is relatively easier to balance difficulty level values for DLR = 0.5, since there is an approximately equal number of questions with higher and lower difficulty levels than 0.5. On the other hand, in the case of a DLR = 0.7, the majority of the questions in the database have a difficulty level lower than 0.7, making it more challenging to generate an exam with that difficulty level.
Furthermore, the standard deviation results were comparable to those of the PMPSO, according to the computational results. Once again, the case of DLR = 0.5 exhibited greater a robustness of the algorithm. Additionally, the successful solutions were generally close to the optimal solution, as evident in the results, particularly for DLR = 0.5. It is worth noting that even better results could have been achieved if we had set a margin of error less than 15%. Furthermore, the BA did not meet the specified overlap threshold reported in [2]. Despite this, our results can be considered reasonable, especially considering the small question bank, which has a limited number of questions per chapter. For instance, taking the first row of Table 4 as an example, the dataset of the small bank for the first ten chapters comprised 614 unique questions, while the total number of generated questions was 10,000. This resulted in a significant overlap percentage.
Moreover, in line with the results presented in Table 3, there is a direct correlation between the number of exams and the overlap percentage. As expected, as the number of exams increased, the overlap percentage also increased. Additionally, Table 5 revealed that the selected difficulty level (DLR) could impact the overlap percentage, as previously explained.
In conclusion, we believe that the BA successfully achieved our research goals by producing near-optimal solutions that satisfied the required difficulty level set by the educator. It displayed robustness with respect to the difficulty level, as evidenced by the small standard deviation.

6. Conclusions

Automatically generated online exams have garnered significant interest among researchers and educators. However, there is a scarcity of resources that can produce multiple versions of the same exam automatically. Educators require such tools for fair assessments and to prevent cheating during exams. This study aims to contribute to the field of education by providing a method to quickly generate unique and fair exams compared to manual solutions. We propose the utilization of the Bees Algorithm (BA) to compose multiple exams from a question bank, with a focus on achieving the desired difficulty level as the primary objective. Each generated exam also adheres to several constraints defined by the educators to ensure an equitable examination process. The contribution of this research lies in presenting a customized version of the BA, specifically designed to address the problem at hand. Our proposed method enables the generation of multiple online tests simultaneously, effectively meeting user requirements without the need for parallelization techniques. Additionally, we introduce innovative problem-specific neighborhood moves that are designed to adhere to the constraints of the problem.
Through extensive testing on benchmark data and comparison with the four rival methods reported in [2], the experimental results indicate that our algorithm outperforms the competing methods in terms of the primary objective, i.e., the difficulty level of the exams, in the majority of test cases. Furthermore, the algorithm exhibits a small standard deviation across all experiments, demonstrating its robustness. However, it has a slightly lower success rate compared to the Particle Swarm Optimization (PSO) methods, although it outperforms the Random and Simulated Annealing (SA) algorithms in this aspect. Furthermore, the BA fell short of attaining the targeted overlap percentage, as assumed in [2].
For future work, we aim to enhance the design of the BA by employing smarter neighborhood moves, particularly when dealing with overlapping questions between exams. Fine-tuning this approach and exploring alternative methods may mitigate the overlap percentage and further improve the algorithm’s performance in terms of exam construction. One possible approach is to utilize question selection heuristics instead of randomly selecting replacement questions. An example heuristic is the Least Recently Used (LRU) heuristic, which prioritizes questions that have been used the least number of times in previous exams. By favoring less recently used questions, the LRU heuristic can maximize question diversity and minimize overlap. Additionally, runtime improvements can be achieved through parallelization or other optimization techniques (e.g., adaptive parameter tuning). Moreover, we intend to test the algorithm’s scalability by evaluating larger-sized instances.
In conclusion, the BA is a metaheuristic algorithm that proves useful in solving our problem. It stands out for its simplicity, adaptability, and ease of use. We believe that our BA competes well with other metaheuristic algorithms that have been successfully applied to this problem. Our research findings hold promise for improving educational outcomes by reducing the workload for instructors, enhancing student satisfaction, and optimizing resource allocation.

Author Contributions

Conceptualization, M.H.; methodology, M.H.; validation, M.H. and R.H.; formal analysis, M.H.; investigation, M.H. and R.H; resources, M.H. and N.A.; data curation, M.H and R.H.; writing—original draft preparation, R.H.; writing—review and editing, M.H. and N.A.; visualization, M.H.; supervision, M.H.; project administration, M.H; funding acquisition N.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been funded by King Saud University, Riyadh, Saudi Arabia, through the Researchers Supporting Project number (RSPD2023R857).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in 10.1109/ACCESS.2021.3057515, reference number 2 and 10.6688/JISE.201811_34(6).0004, reference number 15.

Acknowledgments

This work has been funded by King Saud University, Riyadh, Saudi Arabia, through the Researchers Supporting Project number (RSPD2023R857). The authors acknowledge the contributions of Rafeef Alazzaz, Demah Alawadh, Atheer Mohammad, Beshayer Al-mutawa, Ghadah Alshabib for their role in reviewing the literature, implementing the algorithm, and reporting the results.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Stanger-Hall, K.F. Multiple-Choice Exams: An Obstacle for Higher-Level Thinking in Introductory Science Classes. CBE—Life Sci. Educ. 2012, 11, 294–306. [Google Scholar] [CrossRef] [PubMed]
  2. Nguyen, T.; Nguyen, L.T.T.; Bui, T.; Loc, H.D.; Pedrycz, W.; Snasel, V.; Vo, B. Multi-Swarm Optimization for Extracting Multiple-Choice Tests From Question Banks. IEEE Access 2021, 9, 32131–32148. [Google Scholar] [CrossRef]
  3. Raje, S.; Stitzel, S. Strategies for Effective Assessments While Ensuring Academic Integrity in General Chemistry Courses during COVID-19. J. Chem. Educ. 2020, 97, 3436–3440. [Google Scholar] [CrossRef]
  4. Beerepoot, M.T.P. Formative and Summative Automated Assessment with Multiple-Choice Question Banks. J. Chem. Educ. 2023, 100, 2947–2955. [Google Scholar] [CrossRef]
  5. Wen, J.; Lu, W.; Chen, Z. Innovation and Construction of Examination Database of Pharmacology. Indian J. Pharm. Educ. Res. 2020, 54, 279–283. [Google Scholar] [CrossRef]
  6. Krzic, M.; Brown, S. Question Banks for Effective Online Assessments in Introductory Science Courses. Nat. Sci. Educ. 2022, 51, e20091. [Google Scholar] [CrossRef]
  7. Yuce, B.; Mastrocinque, E.; Packianather, M.S.; Lambiase, A.; Pham, D.T. The Bees Algorithm and Its Applications. In Handbook of Research on Artificial Intelligence Techniques and Algorithms; IGI Global: Pennsylvania, PA, USA, 2015; pp. 122–151. ISBN 978-1-4666-7258-1. [Google Scholar]
  8. Al-Negheimish, S.; Alnuhait, F.; Albrahim, H.; Al-Mogherah, S.; Alrajhi, M.; Hosny, M. An Intelligent Bio-Inspired Algorithm for the Faculty Scheduling Problem. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 151–159. [Google Scholar] [CrossRef]
  9. Lu, P.; Cong, X.; Zhou, D. The Research on Web-Based Testing Environment Using Simulated Annealing Algorithm. Sci. World J. 2014, 2014, 167124. [Google Scholar] [CrossRef]
  10. Rahim, T.N.T.A.; Aziz, Z.A.; Rauf, R.H.A.; Shamsudin, N. Automated Exam Question Generator Using Genetic Algorithm. In Proceedings of the 2017 IEEE Conference on e-Learning, e-Management and e-Services (IC3e), Miri, Malaysia, 16–17 November 2017; pp. 12–17. [Google Scholar]
  11. Nguyen, T.; Bui, T.; Vo, B. Multi-Swarm Single-Objective Particle Swarm Optimization to Extract Multiple-Choice Tests. Vietnam J. Comp. Sci. 2019, 6, 147–161. [Google Scholar] [CrossRef]
  12. Bui, T.; Nguyen, T.; Huynh, H.M.; Vo, B.; Chun-Wei Lin, J.; Hong, T.-P. Multiswarm Multiobjective Particle Swarm Optimization with Simulated Annealing for Extracting Multiple Tests. Sci. Program. 2020, 2020, e7081653. [Google Scholar] [CrossRef]
  13. Hu, X.-M.; Zhang, J.; Chung, H.S.-H.; Liu, O.; Xiao, J. An Intelligent Testing System Embedded With an Ant-Colony-Optimization-Based Test Composition Method. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2009, 39, 659–669. [Google Scholar] [CrossRef]
  14. Naik, K.; Sule, S.; Jadhav, S.; Pandey, S. Automatic Question Paper Generation System Using Randomization Algorithm. Int. J. Eng. Technol. Res. 2014, 2, 192–194. [Google Scholar]
  15. Bui, T.; Nguyen, T.; Vo, B.; Pedrycz, W.; Snasel, V. Application of Particle Swarm Optimization to Create Multiple-Choice Tests. J. Inf. Sci. Eng. 2018, 34, 1405–1423. [Google Scholar] [CrossRef]
  16. Glover, F.; Sörensen, K. Metaheuristics. Scholarpedia 2015, 10, 6532. [Google Scholar] [CrossRef]
  17. Cheng, S.; Liu, B.; Ting, T.O.; Qin, Q.; Shi, Y.; Huang, K. Survey on Data Science with Population-Based Algorithms. Big Data Anal. 2016, 1, 3. [Google Scholar] [CrossRef]
  18. Gandomi, A.; Yang, X.-S.; Talatahari, S.; Alavi, A. Metaheuristic Algorithms in Modeling and Optimization. In Metaheuristic Applications in Structures and Infrastructures; Elsevier: Amsterdam, The Netherlands, 2013; pp. 1–24. ISBN 978-0-12-398364-0. [Google Scholar]
  19. Ziarati, K.; Akbari, R.; Zeighami, V. On the Performance of Bee Algorithms for Resource-Constrained Project Scheduling Problem. Appl. Soft Comput. 2011, 11, 3720–3733. [Google Scholar] [CrossRef]
  20. Li, T.; Zhou, C.; Hu, M. An Improved Artificial Bee Colony Algorithm for 3D Protein Structure Prediction. In Proceedings of the 2017 International Conference on Biometrics Engineering and Application, Hong Kong, China, 21–23 April 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 7–12. [Google Scholar]
  21. Korkmaz, E.; Akgüngör, A.P. The Forecasting of Air Transport Passenger Demands in Turkey by Using Novel Meta-heuristic Algorithms. Concurr. Comput. 2021, 33, e6263. [Google Scholar] [CrossRef]
  22. Korkmaz, E.; Akgüngör, A.P. Comparison of Artificial Bee Colony and Flower Pollination Algorithms in Vehicle Delay Models at Signalized Intersections. Neural Comput. Appl. 2020, 32, 3581–3597. [Google Scholar] [CrossRef]
  23. Korkmaz, E.; Doğan, E.; Akgüngör, A. Estimation of Car Ownership in Turkey Using Artificial Bee Colony Algorithm. In Proceedings of the 3rd International Conference on Traffic and Transport Engineering (ICTTE), Belgrade, Serbia, 24–25 November 2016. [Google Scholar]
  24. Dogan, E.; Korkmaz, E.; Akgungor, A.P. Comparison Of Different Approaches In Traffic Forecasting Models For The D-200 Highway In Turkey. Sci. J. Silesian Univ. Technol. Ser. Transp. 2018, 99, 25–42. [Google Scholar] [CrossRef]
  25. Li, J.-Q.; Song, M.-X.; Wang, L.; Duan, P.-Y.; Han, Y.-Y.; Sang, H.-Y.; Pan, Q.-K. Hybrid Artificial Bee Colony Algorithm for a Parallel Batching Distributed Flow-Shop Problem With Deteriorating Jobs. IEEE Trans. Cybern. 2020, 50, 2425–2439. [Google Scholar] [CrossRef]
  26. Alhuwaishel, N.; Hosny, M. A Hybrid Bees/Demon Optimization Algorithm for Solving the University Course Timetabling Problem. In Proceedings of the 3rd NAUN International Conference on Mathematical, Computational and Statistical Sciences, Dubai, United Arab Emirates, 22–24 February 2015. [Google Scholar]
  27. Almaneea, L.I.; Hosny, M.I. A Two Level Hybrid Bees Algorithm for Operating Room Scheduling Problem. In Intelligent Computing; Arai, K., Kapoor, S., Bhatia, R., Eds.; Advances in Intelligent Systems and Computing; Springer International Publishing: Cham, Switzerland, 2019; Volume 858, pp. 272–290. ISBN 978-3-030-01173-4. [Google Scholar]
  28. Pham, D.; Ghanbarzadeh, A.; Koç, E.; Otri, S.; Rahim, S.; Zaidi, M. The Bees Algorithm Technical Note; Manufacturing Engineering Centre, Cardiff University: Cardiff, UK, 2005; pp. 1–57. [Google Scholar]
  29. Talbi, E.-G. Metaheuristics: From Design to Implementation; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2009; Volume 74, ISBN 978-0-470-27858-1. [Google Scholar]
  30. Teodorović, D.; Dell’Orco, M. Bee Colony Optimization—A Cooperative Learning Approach to Complex Transportation Problems. Adv. OR AI Methods Transp. 2005, 51, 60–69. [Google Scholar]
  31. Ma, L.; Zhu, X.; Feng, Q. Improvement and Design of Genetic Algorithm in Personalized Test Paper Composition System. In Proceedings of the 4th International Conference on Computer Science and Application Engineering, Sanya, China, 20–22 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1–5. [Google Scholar]
  32. Hwang, G.-J.; Lin, B.M.T.; Tseng, H.-H.; Lin, T.-L. On the Development of a Computer-Assisted Testing System with Genetic Test Sheet-Generating Approach. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2005, 35, 590–594. [Google Scholar] [CrossRef]
  33. Shushu, L.; Fengying, W. Strategy and Realization of Auto-Generating Exam Paper Based on Genetic Algorithm. In Proceedings of the 2010 International Conference on Artificial Intelligence and Computational Intelligence, Sanya, China, 23–24 October 2010; Volume 2, pp. 478–482. [Google Scholar]
  34. Hwang, G.-J.; Yin, P.-Y.; Yeh, S.-H. A Tabu Search Approach to Generating Test Sheets for Multiple Assessment Criteria. IEEE Trans. Educ. 2006, 49, 88–97. [Google Scholar] [CrossRef]
  35. Baronti, L.; Castellani, M.; Pham, D.T. An Analysis of the Search Mechanisms of the Bees Algorithm. Swarm Evol. Comput. 2020, 59, 100746. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.