AI Chatbots as Tools for Designing Evaluations in Road Geometric Design According to Bloom’s Taxonomy

García-Ramírez, Yasmany

doi:10.3390/app15168906

Open AccessArticle

AI Chatbots as Tools for Designing Evaluations in Road Geometric Design According to Bloom’s Taxonomy

by

Yasmany García-Ramírez

Department of Civil Engineering, Universidad Técnica Particular de Loja, Loja 110101, Ecuador

Appl. Sci. 2025, 15(16), 8906; https://doi.org/10.3390/app15168906

Submission received: 25 July 2025 / Revised: 6 August 2025 / Accepted: 7 August 2025 / Published: 13 August 2025

(This article belongs to the Special Issue Intelligent Systems and Tools for Education)

Download

Browse Figures

Versions Notes

Abstract

In the realm of educational assessment, the integration of artificial intelligence (AI) offers a promising pathway for the development of robust evaluations. This study explores the application of AI chatbots in crafting and validating examinations tailored to road geometric design, while adhering to the principles of Bloom’s Taxonomy. Utilizing Gemini AI Studio, three distinct exam versions were generated, covering eight crucial topics within road geometric design. A panel of expert chatbots, including Chat GPT 3.5, Claude 3, Sonet, Copilot, Perplexity, and You, assessed the validity of the exam content. These chatbots achieved scores of 9.17 or higher, establishing their proficiency as experts. Subsequent evaluations focused on relevance and wording, revealing high scores for both metrics, indicating the adequacy of the assessment tools. The two remaining versions were administered to student groups enrolled in the Road Construction II course at the Universidad Técnica Particular de Loja. Only 1.2% of students reached Bloom’s Taxonomy level 3, with many questions deemed easy, leading to varying trends in cognitive levels. Comparative analysis of student scores revealed significant discrepancies between a previous “classic” exam. While AI shows potential in crafting valid assessments aligned with Bloom’s Taxonomy, greater human involvement is necessary to ensure high-quality instrument generation.

Keywords:

chatbots; Bloom’s taxonomy; road geometric design

1. Introduction

Artificial intelligence (AI) language models are a type of artificial intelligence that were developed by being trained on massive amounts of text data. This training allows them to understand, summarize, and generate human-like language. In the evolving field of educational assessment, AI presents promising tools for creating robust evaluations [1]. Among the emerging applications, the artificial intelligence language models (AI LMs) in the form of chatbots have emerged as valuable assistants in crafting exam content, capable of generating a diverse array of materials, including assessment questions, essays, and stories [2]. Notably, research has demonstrated ChatGPT’s capacity to produce multiple-choice questions (MCQs) of comparable quality to those crafted by human examiners, accomplishing these tasks at a significantly accelerated pace [3,4]. Furthermore, studies have reported successful outcomes when utilizing ChatGPT to propose questions based on uploaded documents, indicating promising potential for exam preparation [4,5].

Parallel to the advancements in AI for educational assessment, Bloom’s Taxonomy, conceptualized by Bloom in 1956 [6] and revised in 2001 [7], has served as a foundational framework for evaluating the cognitive proficiency of students in several disciplines. This taxonomy delineates six hierarchical levels—remembering, understanding, applying, analyzing, evaluating, and creating—that encapsulate increasingly complex cognitive processes, from basic information retention to the capacity for evaluation and creation [8]. Higher-order levels (Analyzing, Evaluating, Creating) are generally emphasized more in modern educational assessments to promote critical thinking and problem-solving skills [9,10,11]. Recognizing its value, the American Society of Civil Engineers (ASCE) has integrated Bloom’s Taxonomy into educational strategies, affirming its role in enriching educational practices [12].

While AI chatbots like ChatGPT offer potential for generating exam content efficiently, some concerns have been raised about their use. One issue is that ChatGPT-generated questions may not align well with learning objectives, leading to superficial questions that fail to address student misconceptions [13]. Additionally, ChatGPT can occasionally produce biased or inaccurate content, highlighting the need for human verification when creating questions [14]. Nonetheless, challenges persist with transfer questions or tasks requiring creativity and complex insights, where ChatGPT may struggle to provide satisfactory responses [15].

Regarding the possibility of considering chatbots as experts, the results are mixed. Specific limitations have been observed, such as ChatGPT achieving only moderate success rates of 44% on General Chemistry problems [16], and 46% on Ophthalmic Knowledge exams [17]. However, the more advanced GPT-4 model has shown significant improvement, with high success rates of 93% for detailed prompts and 91% for short prompts [18]. In the medical sciences, ChatGPT performed reasonably well, achieving 74% on basic sciences and 70% on clinical sciences [19]. AI chatbots like Microsoft Bing, Google Bard, and ChatGPT averaged 80.0% across various assessments [20].

While these performance metrics provide a useful benchmark, they do not fully capture the underlying cognitive mechanisms—or lack thereof—used by AI language models when generating educational content. Unlike human instructors who draw upon deep subject understanding, pedagogical strategies, and awareness of student misconceptions, AI models operate through statistical associations based on training data. As a result, they may struggle to generate questions that require reasoning, contextual interpretation, or creativity beyond pattern replication. This cognitive limitation becomes particularly relevant when aligning AI-generated content with frameworks such as Bloom’s Taxonomy, which is grounded in hierarchical thinking and the development of higher-order cognitive skills. Understanding these constraints is essential when assessing the appropriateness and pedagogical value of AI-generated exams in specialized fields.

Despite these limitations, the potential of AI chatbots in educational assessment remains promising, particularly in specialized fields where subject matter expertise is critical. Road geometric design, a vital component of civil engineering, governs the safe and efficient movement of traffic while considering factors such as topography, geology, drainage, environmental impact, and road safety. As such, it is imperative to ensure that educational assessments in this domain accurately gauge students’ understanding and ability to apply relevant principles and best practices. Exploring the feasibility of using AI chatbots to design exams aligned with established pedagogical frameworks like Bloom’s Taxonomy holds particular significance for specialized fields like road geometric design.

While previous studies have explored AI’s potential in educational assessment, this research uniquely investigates the application of AI chatbots in designing and validating domain-specific exams, specifically in road geometric design. By aligning AI-generated exam questions with Bloom’s Taxonomy, the study addresses a critical gap in the intersection of AI and specialized pedagogy, offering a novel framework for leveraging AI to enhance the quality and precision of assessments in specialized fields. Specifically, we evaluate the performance of Gemini AI Studio (1.5 Pro) in generating exam content and validate it with the assistance of various expert chatbots and students. The research objectives of this study are (1) to assess the ability of Gemini AI Studio (1.5 Pro) to generate exam questions for road geometric design courses that are aligned with Bloom’s Taxonomy; (2) to validate the exam content generated by Gemini AI Studio (1.5 Pro) using a panel of expert chatbots and evaluate its relevance, wording, and adherence to Bloom’s Taxonomy; (3) to administer the validated exam versions to students and compare their performance with previous exam results, evaluating the effectiveness of the AI-generated assessments.

2. Materials and Methods

2.1. Design

This study used a mixed-methods approach to evaluate the effectiveness of AI chatbots in generating exams for a road geometric design course. We used Gemini AI Studio (1.5 Pro) to create three exam versions, which were then validated for content by multiple AI chatbots. After validation, the exams were administered to students in the Road Construction II course. We compared student performance on these AI-generated exams with results from previous years using traditional exams.

2.2. Exam Development

We used Gemini AI Studio (1.5 Pro) to develop three exam versions. For content validation, we enlisted several other AI Chatbots, including Gemini AI Studio (1.5 Pro) (https://aistudio.google.com/app/, accessed on 24 July 2025), which was used to generate the three exam versions. For content validation, we employed several chatbots: ChatGPT 3.5 (https://openai.com/index/openai-api/, accessed on 24 July 2025), Claude 3 (https://www.anthropic.com/), Copilot (https://copilot.microsoft.com/), Perplexity (https://www.perplexity.ai/, accessed on 24 July 2025), and You-Smart mode (https://you.com/). All chatbots were utilized until 7 May 2024. All chatbots were used for free and provided diverse perspectives on complex engineering topics, aligning with Bloom’s Taxonomy. We chose them to minimize potential bias and ensure a comprehensive review process.

The exam content was derived from the book “Design of Geometric and Operation of Two-Lane Road” [21], with topics directly aligned with the book’s chapter titles. The exams were designed to align with Bloom’s Taxonomy, a common practice in engineering education. Each of the eight key themes from the book was assessed with 10 questions. Questions were distributed across the six levels of the taxonomy as follows: two questions for each of levels 1–4 (Remembering, Understanding, Applying, Analyzing) and one question for each of levels 5–6 (Evaluating, Creating). This distribution was chosen to reflect the typical requirements for undergraduate students.

To create the exams, the digital chapters of the reference book were uploaded to Gemini AI Studio. The following prompt structure was used as shown in Table 1.

Each of the three exam versions consisted of 80 questions. An initial review found that some questions required students to draw or design roads, which was not feasible for a computer-based exam. We used a follow-up prompt to have Gemini suggest alternative formats for levels 5 and 6 questions that did not require spatial manipulation.

“I have a computer-based exam that covers all six levels of Bloom’s Taxonomy, which I’ll share with you []. Since students won’t be able to draw routes, design roads, or perform spatial manipulations on the computer, I want to ensure the exam effectively assesses higher-order thinking skills like analysis and evaluation (levels 5 & 6 of Bloom’s Taxonomy). Could you suggest alternative question formats for levels 5 and 6 that don’t require spatial manipulation?”

2.3. Expert Validation

The exams were validated by a panel of expert judges, which included a diverse group of AI systems (ChatGPT 3.5, Claude 3, Perplexity, Copilot, and You) and human experts in test design and road design.

Before evaluating the student exams, the AI systems were tested on one of the Gemini-generated exam versions. Scoring was based on Bloom’s Taxonomy levels, with specific weights assigned to each level to reflect their importance in promoting critical thinking:

Levels 1 and 2 (Remembering and Understanding): Foundational knowledge and comprehension levels, weighted at 0.8 and 1.2, respectively.
Levels 3 and 4 (Applying and Analyzing): Involve applying knowledge and analyzing information, weighted at 1.5 and 1.0, respectively.
Level 5 (Evaluating): Critical thinking and making judgments, weighted at 0.50, reflecting the challenge of assessing this level in multiple-choice format.
Level 6 (Creating): Generating new ideas, weighted at 0.50 due to the difficulty of assessing creativity in multiple-choice format.

A minimum score of 7 was required to pass, aligned with ASCE’s competency levels for undergraduate civil engineering students, generally level 3 in Bloom’s Taxonomy. The following weights were used for the 8 topics, transformed to a 10-point scale, chosen based on their importance in geometric design: introduction—5%, driver—5%, traffic—10%, route study—10%, horizontal geometric design—20%, vertical geometric design—20%, transverse geometric design—20%, and geometric design consistency—10%. AI Judges had to score 9 points or higher to proceed.

The second stage of the validation process shifted the focus to the remaining two exam versions intended for student use. Here, AI judges evaluated the relevance and wording of the questions. Relevance refers to how well each question aligns with the intended learning objective and the targeted cognitive level based on Bloom’s Taxonomy. In simpler terms, are the questions truly testing what students are supposed to learn at different levels of understanding? Wording, on the other hand, assesses the clarity, conciseness, and appropriateness of the language used in each question. The following prompt was used to guide the AI judges in their evaluation as seen in Table 2.

The validation process for the exam content was built on a comprehensive framework grounded in established methodological approaches. Drawing from content validity research and item development principles, the panel evaluated the content, noting that scales with more than six points offer minimal added benefit to relevance ratings [22] and that positive and negative phrasing significantly impacts validity [23,24,25]. Each question was meticulously evaluated for alignment with learning objectives and Bloom’s Taxonomy, with cognitive levels cross-verified against the reference text. The panel also systematically verified the factual accuracy and precision of the questions to ensure they reflected current engineering practices [26]. Question quality was assessed to eliminate ambiguity and align terminology with the target audience. The validation employed the content validity ratio framework [27], calculating the Aiken’s V coefficient with a 95% confidence interval and setting an acceptance threshold of 0.8 or higher for both relevance and wording [28]. Any questions with a low average relevance score were flagged for revision or potential exclusion, ensuring that only questions meeting the highest standards of academic and technical precision were retained.

2.4. Participants and Procedure

The two validated exam versions were administered to two groups of students. Student scores on these exams were compared with those from previous academic periods using traditional exams. The exams consisted of three tasks, evaluating levels ranging from 3 to 5 of Bloom’s Taxonomy as seen in Table 3.

2.5. Data Analysis

To assess the internal consistency of the AI-generated exams, the Kuder–Richardson reliability coefficient (KR-20) [29] was calculated. To evaluate the effectiveness and appropriateness of this exam versions, statistical methods were applied to analyze student scores across Bloom’s Taxonomy levels. The following approaches were employed:

Threshold-based performance analysis: to assess student mastery of Bloom’s Taxonomy levels 1 to 3, performance thresholds were defined at 30%, 80%, and 90% correct answers. These thresholds enabled the identification of incremental improvements in student outcomes as the criteria for mastery were relaxed. This analysis aimed to provide insights into the proportion of students meeting specific cognitive benchmarks and areas requiring additional preparation.
Comparison of scores across Bloom’s Taxonomy levels: student grades for Bloom’s Taxonomy levels were converted to a 10-point scale to facilitate comparisons across levels and between the two AI-generated exam versions. This normalization provided a consistent framework for identifying performance trends and deviations from expected progressive declines in grades from lower to higher cognitive levels.
Item difficulty analysis: the difficulty index for each question was calculated as the proportion of students answering correctly. Questions were categorized into five difficulty levels (easy, relatively easy, medium difficulty, relatively difficult, and difficult). This categorization enabled the examination of the distribution of questions across difficulty levels in both versions and comparisons with ideal distributions recommended in prior research.
Validation through prior evaluations: to validate the AI-generated instruments, scores from both versions were compared against grades from prior semester exams covering Bloom’s Taxonomy levels 3, 4, and 5. A paired t-test was used to evaluate the statistical significance of differences in mean scores between the new and prior assessments. Minitab software 14.2 [30] was used to perform the calculations for this analysis.
Error metrics: to assess the alignment between the new and prior exams, error metrics were calculated, including Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Mean Squared Error (MSE). These metrics evaluated how closely the AI-generated exams approximated the grading patterns of the previous semester’s exam, providing additional evidence of their validity.
Student feedback and qualitative analysis: semi-structured interviews were conducted to gather qualitative feedback on the AI-generated exams. Students were asked about perceived difficulty, time allocation, and preferred question types. This feedback provided context for interpreting the statistical results, particularly in relation to question clarity, time management, and alignment with practical competencies.

3. Results

3.1. Exam Generation

The Gemini AI Studio (1.5 Pro) generated three exam versions, each containing 80 questions. Per topic, 10 questions were created: 2 questions each for Bloom’s Taxonomy levels 1 to 4, and 1 question each for levels 5 and 6. This balanced assessment across cognitive levels. Higher-order questions requiring spatial manipulation were replaced with alternative formats assessing higher-order thinking skills. The AI-generated versions and student-evaluated versions can be requested from the author.

The assessment questions systematically addressed different levels of Bloom’s Taxonomy across the 8 topics and 3 versions, serving distinct cognitive functions. “Remembering” questions tested direct recall of essential facts and definitions for understanding road design principles, such as the primary objective of geometric road design or the responsible agency for publishing key manuals. “Understanding” questions required interpreting concepts and relationships within road design, such as explaining the significance of driver behavior or describing how traffic volume impacts road service levels.

“Applying” questions challenged students to use their knowledge to solve practical problems or make decisions in novel situations. Examples include proposing design changes to enhance safety on a problematic road segment or calculating specific parameters like superelevation for road curves. “Analyzing” tasks involved dissecting information, identifying patterns, and drawing evidence-based conclusions, such as analyzing factors contributing to traffic congestion or evaluating the environmental implications of various design alternatives.

“Evaluating” questions demanded informed judgments by weighing criteria against standards. At the highest “Creating” level, students had to generate innovative solutions and comprehensive plans. This could involve designing sustainable transportation systems for urban growth or proposing research studies to investigate the impacts of new design features.

After the exam generation, an analysis examined the relationship between the number of words in the correct answer and the average number of words in the incorrect answer choices. The ratio of correct literal length to average incorrect literal length was calculated for each question across the three exam versions. This ratio is important because large language models (LLMs), like chatbots, may produce random variations in their outputs. While the length of the correct answer can vary with these models, the ratio still provides a useful metric for identifying potential biases in the answer choices. A ratio of 1 indicates that the correct answer and the incorrect answers have similar lengths. Deviations from this ratio could suggest that students might be more likely to choose longer correct answers, or that cognitive biases may influence their answer selection.

The length of correct and incorrect answer options in multiple-choice questions (MCQs) can significantly influence student responses, introducing bias. This phenomenon arises from various cognitive and perceptual biases. Research suggests that longer answers may be perceived as more informative or accurate due to the simple presence of more information [31]. This can lead to a “length bias,” where students are more likely to select longer options, even if they are incorrect, due to an assumption of greater depth or complexity [32]. For instance, if a correct answer contains 15 words while the average incorrect answer spans 12 words, the resulting length ratio would be 1.25. Such a ratio suggests the correct answer is relatively longer, which could potentially introduce unintended response biases among students. In contrast, large language models (LLMs) may exhibit inherent biases towards certain answer lengths due to biases present in their training data. These biases can subtly influence human responses when similar patterns are observed in educational assessment [33,34].

Thus, Figure 1 illustrates the ratio of the correct literal length to the average incorrect literal length for each question in this study. The analysis reveals that few questions exhibit a ratio close to 1, indicating a significant disparity in the length of correct and incorrect answer options.

Examining the average ratio for each topic from Figure 1, we observe that “Traffic” displays values closest to 1 across all versions, suggesting a more balanced length distribution. In contrast, “Driver” and “Transverse Geometric Design” exhibit ratios furthest from 1, indicating substantial length discrepancies between correct and incorrect options. Ideally, this ratio should fall within the range of 0.75 to 1.25. For instance, if the correct answer comprises 100 letters, the incorrect options should ideally contain between 75 and 125 letters to minimize bias arising from length differences. However, it is crucial to acknowledge that answer length is not the sole determinant of student choices. Poorly worded options or subtle differences, such as the presence of punctuation in only the correct answer, can also significantly influence student responses.

3.2. AI Chatbot Performance Analysis

This study evaluated the performance of five AI chatbots—ChatGPT 3.5, Claude 3, Copilot, Perplexity, and Bard (in Smart mode)—on multiple-choice exams. The exams comprised 10 questions for each of eight topics. The chatbots answered one question at a time, and their responses were graded for accuracy. While they correctly answered 98% of the multiple-choice questions, their performance dropped to 55% for questions at Bloom’s Taxonomy levels 5 and 6. The results suggest AI chatbots perform well on multiple-choice exams, especially for lower-level questions. However, they struggle with more complex questions requiring higher-order thinking skills, founded in previous studies. A potential explanation is that multiple-choice questions often involve identifying the correct answer from given options, a task well-suited for AI’s pattern recognition and text processing capabilities.

Table 4 presents the scores achieved by AI chatbots across various road engineering topics. All scored over 9 points, indicating they performed well enough to proceed to the next evaluation phase. The highest score was in the “driver” topic, which also had a high ratio of correct to incorrect answer length. The three geometric design topics yielded the lowest scores.

The weights assigned to each topic in Table 4 reflect their relative importance within the domain of road engineering. Higher weights were allocated to topics such as geometric design (horizontal, vertical, and transverse), as these are fundamental to road engineering and involve complex calculations and design standards. Conversely, lower weights were given to introductory topics or driver-related concepts, as these are less critical to the technical aspects of road design. This weighting system ensures the evaluation emphasizes areas of higher technical significance while maintaining a balanced overall assessment.

3.3. Chatbot Scoring and Exam Item Validation

Since the five AI chatbots are experts, they were tasked with evaluating the relevance and wording of the assessment items using specific rating scales. Their evaluations yielded an average Aiken’s V of 0.95 for relevance and above 0.89 for wording, confirming the adequacy of the assessment tools. Figure 2 illustrates the average relevance and wording scores across different questions and topics in the road geometric design assessments.

The top graph in Figure 2 shows that the average Aiken’s V for relevance (Version 1 and Version 2 of the exam) remains relatively high, generally fluctuating between 0.90 and 1.00, across different questions and Bloom’s Taxonomy levels. This suggests that the AI-generated exams maintain a high level of relevance throughout the various cognitive levels. However, the wording evaluations (V1 and V2) display more variability compared to relevance, with noticeable dips below 0.90 for several questions, particularly those corresponding to Bloom’s Taxonomy levels 3 to 6. While relevance is consistently high, the clarity or appropriateness of wording may require further refinement.

The bottom graph in Figure 2 indicates that the average Aiken’s V varies across different road geometric design topics. The relevance scores (V1 and V2) consistently stay above 0.90 for all topics, indicating a high degree of alignment between the exam questions and the intended subject matter. However, topics 1 (introduction) and 4 (road study) show lower average Aiken’s V for wording, suggesting that these areas might need more attention to ensure clarity and precision in the exam questions.

Both relevance and wording scores tend to converge towards higher values (above 0.90), indicating that the AI chatbots effectively produce valid and well-worded exam questions on average. However, the slight downward trends and variability in wording scores suggest a need for periodic review and adjustment, particularly for more complex topics and higher Bloom’s Taxonomy levels.

Additionally, the average coefficient of variation for relevance and wording was analyzed to measure the variation among the five artificial judges. A coefficient of zero means that all judges chose the same rating for the item, whereas higher values indicate more discrepancy among the judges. The average coefficient of variation for relevance and wording is shown in Figure 3.

The upper part of Figure 3 shows that the coefficients of variation for relevance are generally lower and more stable compared to those for wording across both exam versions. This suggests greater consensus among judges regarding the relevance of questions. However, for the last two levels, there was greater discrepancy in relevance ratings, similar to the discrepancy observed for wording.

In the lower part of Figure 3, a similar trend is seen between relevance and wording, except for Topic 6 (vertical geometric design). The greatest differences among judges were found in Topic 1. Ideally, a coefficient of variation around 10% indicates that one out of the five judges assigned a significantly different rating from the others who agreed on the score. In this context, while the items could be considered valid, improvements in wording may be beneficial, especially to avoid undesirable biases in the evaluation.

3.4. Estimation of the Internal Reliability of the Exams

The two validated exam versions were administered to students enrolled in the Road Construction II course at Universidad Técnica Particular de Loja during the April–August 2024 period. Group A consisted of 48 students, while Group B had 35 students. The exam comprised 80 questions, and students were given a 2 h time limit to complete it. They were allowed to use printed materials, such as textbooks or notes, during the exam. The average completion time for Group A was 109.83 min, and for Group B, it was 109.50 min. Appendix A.1 includes a sample version of the exam on horizontal geometric design, while Appendix A.2 presents representative questions covering other topics and Bloom’s cognitive levels.

To measure the internal reliability of the assessment instrument, a Kuder–Richardson Coefficient 20 (KR20) analysis was conducted. The KR20 indicates the consistency of the instrument in measuring its intended objective. For this analysis, student responses were assigned a value of 1 for correct answers and 0 for incorrect answers. The KR20 formula is shown in Equation (1):

r_{k r 20} = (\frac{k}{k - 1}) (1 - \frac{\sum p q}{σ^{2}})

(1)

where k is the number of items in the instrument, p is the percentage of people who answered each item correctly, q is the percentage of people who answered each item incorrectly, and σ² is the total variance of the instrument [29].

The value of k was 80 questions. For the first version of the test, Σpq was 8.54 and the variance was 29.7, so r_kr20 is equal to 0.72. On the other hand, for the second version of the test, Σpq was 6.53 and the variance was 25.9, so r_kr20 is equal to 0.76.

The KR20 values for both versions of the exam (0.72 and 0.76) suggest that the exams have acceptable internal reliability. This suggests that the exams could be improved by adding more items or by revising some of the existing items to make them more discriminating. Those KR20 values mean that the exams are consistent in measuring what they are intended to measure, which is the students’ knowledge of road geometric design.

3.5. Percentage of Correct Responses

Student responses were analyzed by calculating the percentage of correct answers for each topic, categorized by Bloom’s Taxonomy levels and exam version. The average results are presented in Figure 4. For levels 5 and 6, responses were graded as correct (100%), partially correct (50%), or incorrect (0%).

The bottom graph, representing Version 1 of the student exam, shows significant fluctuations in results up to level 4, with a marked decrease in the percentage of correct answers at levels 5 and 6 across all topics. The introductory topic had the highest percentage of correct answers, while the traffic topic was easier at levels 2 and 3 but more difficult at level 1. The traffic, driver, and route study topics displayed varying levels of difficulty across Bloom’s Taxonomy levels. The horizontal, vertical, and cross-sectional geometric design topics generally started easier and became more difficult at higher levels. The geometric design consistency topic started easy, increased in difficulty, and then became easier again. Ideally, the graph should show higher percentages of correct answers at lower Bloom’s Taxonomy levels and lower percentages at higher levels.

Version 2 (bottom of Figure 3) exhibited a similar pattern to Version 1, with a decrease in correct answers at levels 5 and 6 and fluctuations across topics at levels 1 to 4. However, the topic patterns differed from Version 1. Version 2 had less dispersion than Version 1 at levels 1 to 3. Only the traffic topic followed the ideal pattern, while the rest showed varying responses. In general, there was high variability across topics at levels 1, 2, and 3.

To validate whether the random characteristics of large language models (LLMs) impacted student performance, Pearson correlation was applied to analyze the relationships between student scores across Bloom’s Taxonomy levels and road geometric design topics for both versions of the exam (data from Figure 4). The results show significant correlations (p-value < 0.05) between most topics, indicating consistent student performance trends. For the first version of the exam, strong correlations were observed between “Driver” and “Introduction” (r = 0.964, p = 0.002), and between “Transversal Geometric Design” and most other topics, such as “Introduction” (r = 0.984, p < 0.001). Similarly, in the second version, “Introduction” demonstrated high correlations with “Transversal Geometric Design” (r = 0.990, p < 0.001) and “Driver” (r = 0.948, p = 0.004).

These strong correlations suggest that student performance was consistently influenced across topics, regardless of potential random variations in the output of LLMs used to generate exam questions. This consistency is further evidenced by high correlations in critical topics like “Vertical Geometric Design” and “Horizontal Geometric Design”, which are integral to road engineering. The observed correlations indicate that any variability introduced by LLM-generated questions did not disrupt overall student performance patterns, reinforcing the robustness of the evaluation process.

However, some topics, such as “Traffic” in the first version, exhibited slightly weaker correlations with other topics, such as “Route Study” (r = 0.652, p = 0.16), though they remained positively correlated. This may reflect the cognitive complexity of certain topics or varying student familiarity with them, warranting further investigation into specific question designs and content alignment.

3.6. Analysis of Student Grades

A detailed analysis of student grades revealed that only one student from both groups correctly answered all questions from Bloom’s Taxonomy levels 1 to 3, representing a mere 1.2% of the sample. This finding highlights the need for further student preparation. By adjusting the threshold to 90% correct answers for levels 1 to 3, the percentage of students achieving this increased to 3.6%. Lowering the threshold further to 80% resulted in 27.7% of students reaching Bloom’s Taxonomy level 3. As the minimum threshold continued to decrease, the percentage of students at Bloom’s Taxonomy level 3 rose correspondingly, reaching 100% when the threshold was set at 30%.

Figure 5 presents the grades for questions at each Bloom’s Taxonomy level across all eight topics, converted to a 10-point scale. This figure demonstrates that Version 2 had higher grades than Version 1 up to level 4, after which grades for Version 2 dropped significantly below Version 1 at level 5. This drop corresponds to the increased cognitive demands of level 5 questions, which require evaluation and critical reasoning, skills that are notably challenging for undergraduate students. In Version 2, grades increased slightly at level 6, while the highest grades were observed at level 4. Conversely, in Version 1, the highest grades were at level 2. Although a general decline in grades was anticipated from level 1 to level 6, the pattern was not progressive. Grades remained relatively similar between levels 1 to 4, with a sudden and noticeable drop at level 5, highlighting the difficulty of questions at this level.

3.7. Analysis of Item Difficulty

Referring to Figure 5, we analyzed the difficulty levels of multiple-choice questions (Bloom’s Taxonomy levels 1 to 4) in both exam versions. Table 5 presents the results of this analysis. For Version 1, over 50% of the questions were categorized as easy or relatively easy, while in Version 2, this percentage exceeded 60%. This difference in the distribution of easy questions might explain the variation in grades between the two versions. Both versions had a similar proportion of medium difficulty questions. However, Version 1 contained a higher percentage of relatively difficult or difficult questions compared to Version 2.

Table 5 also shows the ideal percentage of questions an exam should have, based on previous research [35]. Both AI-generated exam versions deviated significantly from these ideal percentages, making the exams relatively easy. This deviation explains the trends observed in Figure 4 and Figure 5, where student performance was higher than expected. Additionally, the shorter word count in Figure 1 might have contributed to easier calculations for students.

To further understand the exam experience, we interviewed students from both groups. Regarding the allocated time and materials, students mentioned that the time was adequate, and printed materials were only helpful when they knew where to look; otherwise, searching wasted time. Concerning the test difficulty, students noted they could often eliminate incorrect options and select the answer. Group A students could answer up to questions 5 or 7, while Group B could answer up to questions 7 or 8, aligning with the results in Table 4. Finally, students expressed a preference for questions involving road section design or redesign, as it tested their knowledge without requiring drawing or software skills.

3.8. Validation with Students

Although the previous sections presented results, an attempt was made to validate the instruments against prior evaluations of the same students. These prior evaluations only covered Bloom’s levels 3, 4, and 5; therefore, calculations for both instrument versions were performed for those levels and scaled to a 10-point scale.

It is important to acknowledge that the comparison between AI-generated exams and previous semester assessments involves inherent limitations that preclude the use of more sophisticated statistical approaches such as ANCOVA or regression modeling. Several factors contribute to this limitation: (1) the temporal separation between assessments prevents proper matching of covariates such as student preparation levels, study habits, or external factors that may have influenced performance; (2) the fundamental structural differences between the AI-generated multiple-choice format and the previous semester’s assessment format make direct statistical comparisons methodologically problematic; (3) the absence of baseline measurements for relevant covariates during the original assessment period limits our ability to control for confounding variables; and (4) the sample size and institutional context restrict the generalizability of sophisticated statistical modeling.

Figure 6 shows the results, indicating that students generally scored higher on the new instruments compared to the previous design exam. Ideally, the scores should align with the reference line. To evaluate the differences, a t-test was conducted. For Version 1, the estimated difference between the two evaluation methods was 2.05 (95% CI: 1.36 to 2.74), with a t-value of 5.94 and p-value of 0.000, suggesting a statistically significant difference from the previous exam grades. Similarly, for Version 2, the estimated difference was 3.20 (95% CI: 2.37 to 4.03), with a t-value of 7.77 and p-value of 0.000, indicating a statistically significant difference.

Furthermore, an analysis of errors between the previous semester’s exam grades and the AI-generated versions was performed, excluding one student each from groups A and B who scored nearly zero previously. For Version 1, the Mean Absolute Error (MAE) was 4.47, Mean Absolute Percentage Error (MAPE) was 275.28, and Mean Squared Error (MSE) was 21.84. For Version 2, the MAE was 5.37, MAPE was 314.0, and MSE was 31.43. Version 1 had better error metrics, suggesting a closer approximation to the previous exam grades. However, the high error values indicate little relation between the previous exam and the AI-generated exams. The error metrics calculated between assessments should be interpreted cautiously, as they reflect format differences rather than instrument quality. These methodological constraints suggest that future research should incorporate prospective designs with proper covariate measurement to enable more rigorous statistical analysis.

4. Discussion

The integration of AI chatbots in road geometric design assessment reveals both promising capabilities and significant limitations. Our findings demonstrate that while AI can generate coherent exam content aligned with Bloom’s Taxonomy, substantial challenges remain in creating high-quality assessments for specialized engineering domains.

4.1. AI Performance and Question Generation Quality

The exam generation process demonstrated that AI can produce structurally coherent questions aligned with Bloom’s Taxonomy, supporting previous research indicating comparable quality between AI and human-generated questions [4]. However, the analysis revealed systematic biases, particularly in answer length ratios where correct options frequently differed substantially from distractors. This finding aligns with established research on cognitive biases in multiple-choice questions [31,32], where length disparities inadvertently cue students to correct answers, compromising assessment validity.

AI chatbots achieved high performance (98%) on lower-level cognitive tasks but dropped to 55% at Bloom’s levels 5 and 6, consistent with findings by Herrmann-Werner et al. [18] and previous studies showing AI struggles with higher-order thinking assessment [20]. This performance pattern reflects fundamental limitations in AI’s capacity to generate questions requiring genuine evaluation and creativity, rather than pattern recognition tasks characteristic of lower cognitive levels.

4.2. Content Validation and Reliability Concerns

The validation process using multiple AI judges yielded high Aiken’s V scores for relevance (0.95) but more variable results for wording (0.89). While these metrics exceed established thresholds [28], the methodological approach raises concerns about circular validation—using AI systems to evaluate AI-generated content may not capture essential human pedagogical expertise. This limitation echoes concerns regarding AI-generated questions potentially lacking alignment with learning objectives and failing to address student misconceptions [13].

Internal reliability analysis (KR20 = 0.72–0.76) indicated acceptable consistency, though item difficulty analysis revealed problematic distributions. Over 50% of questions were categorized as easy or relatively easy, significantly deviating from ideal distributions recommended in educational assessment literature [35]. This pattern suggests AI tends to generate fewer discriminating items, potentially limiting the instruments’ ability to accurately assess student competencies across ability levels.

4.3. Student Performance and Assessment Validity

Student performance analysis revealed that only 1.2% achieved mastery across Bloom’s levels 1–3, indicating potential issues with either student preparation or question calibration. The strong correlations between topics (r > 0.9) suggest consistent performance patterns but may also indicate insufficient discrimination between domain-specific competencies, a concern given the specialized nature of road geometric design.

Comparison with previous assessments, while methodologically constrained, suggested AI-generated exams may be less challenging than traditional evaluations. This finding raises questions about whether AI-generated assessments adequately prepare students for professional practice, particularly given the specialized technical requirements of civil engineering [12].

4.4. Implications and Limitations

These findings have significant implications for AI adoption in educational assessment. While AI demonstrates efficiency in generating large volumes of questions, substantial human oversight remains essential for ensuring pedagogical quality, particularly in specialized domains requiring complex problem-solving skills [14]. The study supports recommendations for hybrid approaches that leverage AI capabilities while maintaining human expertise for quality assurance and pedagogical alignment.

Despite the comprehensive nature of this study, several limitations warrant consideration in interpreting the results and implications. The limitations identified—including small sample size, single-institution context, and methodological constraints in comparative analysis—underscore the need for larger-scale, multi-institutional research with prospective designs that enable more rigorous statistical analysis and broader generalizability to diverse educational contexts.

4.5. Future Research

Looking ahead, future developments in AI-assisted educational assessments may focus on enhancing the alignment of AI-generated questions with complex learning objectives and increasing their capacity to assess higher-order cognitive skills. Continued research is needed to explore hybrid models that integrate human expertise and AI automation, as well as the development of adaptive assessments that respond dynamically to student performance. These innovations could pave the way for more personalized, accurate, and equitable evaluation systems in engineering education and beyond.

5. Conclusions

The findings of this study highlight the potential of AI-generated exams in evaluating student learning within road geometric design education. It demonstrates that AI chatbots can effectively create exam questions aligned with various levels of Bloom’s Taxonomy, offering students a diverse set of cognitive challenges. Despite the positive performance of AI chatbots in generating exam content and maintaining relevance and clarity across different topics and cognitive levels, certain limitations should be acknowledged. These include constraints inherent to the multiple-choice format, challenges in assessing higher-order thinking skills, and the subjective nature of content validation by AI judges. However, the study emphasizes the importance of ongoing exploration and refinement of AI technologies in educational assessment, providing valuable insights into both the opportunities and challenges associated with integrating AI-generated exams into teaching practices. This study stands out as a pioneering effort to evaluate the integration of AI chatbots in creating domain-specific assessments, particularly in road geometric design education. By demonstrating the feasibility of aligning AI-generated content with Bloom’s Taxonomy, it offers a novel approach to modernizing assessment practices in specialized fields. The findings not only validate the transformative potential of AI in reducing the time and subjectivity inherent in traditional exam creation but also set a foundation for expanding AI-driven assessment frameworks across diverse disciplines.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the use of fully anonymized and aggregated data collected through voluntary participation. The study involved no collection of personally identifiable information, and verbal informed consent was obtained from all participants prior to completing the questionnaire. According to current national and institutional regulations, formal ethics committee approval is not required for minimal-risk studies that do not involve identifiable data.

Informed Consent Statement

Verbal consent was obtained rather than written because the study posed no physical or psychological risk to participants, data were collected anonymously, and the setting (public urban spaces) and informal recruitment approach made verbal consent more practical and culturally appropriate. Participants were fully informed about the purpose of the study, their right to decline or withdraw at any time, and how their data would be used and protected. A copy of the consent script is provided as part of the submission.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the author on request.

Acknowledgments

During the preparation of this manuscript, the author used ChatGPT 3.5 to improve the clarity, coherence, and academic tone of selected sections of the text. The author has reviewed and edited the output and takes full responsibility for the content of this publication.

Conflicts of Interest

The author declares no conflicts of interest.

Appendix A

Appendix A.1

This appendix presents an example of the exam on the topic of horizontal geometric design.

Bloom’s Taxonomy level 1: recall

What is the main objective of horizontal geometric design of roads?

(a)
Minimize environmental impact.
(b)
Maximize vehicle speed.
(c)
Provide safety and comfort to users. *
(d)
Reduce construction costs.

Which element of horizontal design is responsible for smooth transitions between tangents?

(a)
Superelevation.
(b)
Circular curves.
(c)
Transition curves. *
(d)
Sight distance.

Bloom’s Taxonomy level 2: understand

Why is it important to consider the design speed when planning horizontal curves?

(a)
Because it determines the type of pavement to be used.
(b)
Because it affects the required sight distance and the safety of the curve. *
(c)
Because it defines the lane width.
(d)
Because it influences the amount of signage required.

How does superelevation relate to centrifugal force in a curve?

(a)
Superelevation increases centrifugal force, improving adhesion.
(b)
Superelevation counteracts centrifugal force, providing greater stability. *
(c)
Superelevation has no relationship to centrifugal force.
(d)
Superelevation reduces centrifugal force, allowing lower speeds.

Bloom’s Taxonomy level 3: apply

An engineer is designing a curve with a design speed of 40 km/h. Which of the following formulas should they use to calculate the minimum radius of the curve?

(a)
Formula for calculating stopping sight distance.
(b)
Formula for calculating superelevation.
(c)
Formula of the transition spiral.
(d)
Formula relating design speed to radius and superelevation. *

If the stopping sight distance on a road is 120 m and a vehicle is approaching at a speed of 60 km/h, what action should the engineer take to ensure safety?

(a)
Increase the superelevation of the curve.
(b)
Reduce the radius of the curve.
(c)
Implement measures to improve sight distance. *
(d)
Decrease the design speed.

Bloom’s Taxonomy level 4: analyze

What are the main differences between a simple circular curve and a curve composed of spiral–circular–spiral?

(a)
The composite curve provides a smoother and more gradual transition, improving comfort and safety. *
(b)
The simple circular curve is more suitable for high-speed roads.
(c)
The composite curve requires more space for its construction.
(d)
The simple circular curve is cheaper to build.

When analyzing an existing stretch of road, a high rate of road accidents is observed in a specific curve. What factors of horizontal geometric design could be contributing to this situation?

(a)
Radius of the curve too small, insufficient superelevation, or limited visibility. *
(b)
Inadequate vertical signage or lack of lane markings.
(c)
Poor drainage or problems with the pavement surface.
(d)
Lack of lighting in the curve area.

Bloom’s Taxonomy level 5: evaluate

You are presented with two design proposals for a horizontal curve with different radii and superelevations. One design prioritizes driver safety and comfort, while the other prioritizes cost reduction. Analyze the advantages and disadvantages of each design in terms of safety, driver experience, and project economy. What design would you recommend and why?

Bloom’s Taxonomy level 6: create

Imagine you are a road engineer and you need to design a road in a mountainous terrain. What creative strategies could you implement to adapt the horizontal geometric design to the terrain conditions and ensure the safety of the users?

Appendix A.2

This appendix presents a set of representative sample questions drawn from the full assessment instrument used in the study. The questions included here aim to illustrate the diversity of topics covered in the course and the alignment of each item with various cognitive levels of Bloom’s revised taxonomy. While the complete test comprises a larger number of items, the samples provided reflect the overall structure, thematic breadth, and progression from lower- to higher-order thinking skills that characterize the full evaluation.

Each question has been selected to demonstrate how the instrument assesses knowledge, understanding, application, analysis, evaluation, and creation across different core topics, such as driver characteristics, traffic studies, route selection, horizontal and vertical geometric design, and cross-sectional elements.

Topic: Introduction

Bloom Level 1: Remember

What is the main objective of traffic engineering?

(a)
To design and operate efficient and safe transportation systems.
(b)
To build new roads and highways.
(c)
To regulate traffic laws.
(d)
To investigate the causes of road accidents.

Topic: Driver Characteristics

Bloom Level 1: Remember

How is the driver’s “reaction time” defined?

(a)
The time it takes for the driver to perceive an obstacle.
(b)
The time it takes for the driver to decide.
(c)
The time it takes for the driver to execute an action.
(d)
The total time elapsed from the moment an obstacle is perceived until an evasive maneuver is performed.

Topic: Traffic Studies

Bloom Level 2: Understand

What happens to traffic speed as vehicle density increases?

(a)
Increases proportionally
(b)
Decreases gradually
(c)
Remains constant
(d)
Fluctuates unpredictably

Topic: Route Study

Bloom Level 2: Understand

Why is it important to consider terrain topography in route study?

(a)
To determine the type of pavement to use.
(b)
To estimate construction and maintenance costs.
(c)
To minimize environmental impact.
(d)
To identify areas prone to landslides.

Topic: Horizontal Geometric Design

Bloom Level 3: Apply

If a road has a cross slope of 2%, what is the elevation difference between the edge of the road and the center in a section with a width of 7 m?

(a)
7 cm.
(b)
14 cm.
(c)
28 cm.
(d)
140 cm.

Topic: Vertical Geometric Design

Bloom Level 3: Apply

A section of road has a grade of 5% and needs to connect with another section with a grade of −3%. If the design speed is 80 km/h, what is the minimum length of the vertical curve required to meet design criteria?

(a)
120 m.
(b)
150 m.
(c)
180 m.
(d)
200 m.

Topic: Transverse Geometric Design

Bloom Level 4: Analyze

Analyze the advantages and disadvantages of using a high superelevation in a curve.

(a)
Advantages: greater safety and stability; disadvantages: higher construction cost, possible discomfort for slow-moving vehicles. *
(b)
Advantages: lower construction cost; disadvantages: lower safety, risk of rollover.
(c)
Advantages: higher design speed; disadvantages: greater environmental impact.
(d)
Advantages: improved drainage; disadvantages: greater tire wear.

Topic: Consistency of Geometric Design

Bloom Level 5: Evaluate

Two alternative designs are presented for a section of road: one that prioritizes landscape aesthetics and another that prioritizes geometric consistency. What criteria should be considered to evaluate which design is most appropriate?

Bloom Level 6: Create

Imagine you are part of a team of engineers tasked with developing an automated monitoring system to detect and correct inconsistencies in road geometric design. Describe how this system would use machine learning algorithms to identify problem areas and propose design solutions to improve the consistency of the road.

References

Perez Sanpablo, A.I.; Arquer Ruiz, M.d.C.; Meneses Peñaloza, A.; Rodriguez Reyes, G.; Quiñones Uriostegui, I.; Anaya Campos, L.E. Development and Evaluation of a Diagnostic Exam for Undergraduate Biomedical Engineering Students Using GPT Language Model-Based Virtual Agents; Flores Cuautle, J.d.J.A., Ed.; Springer Nature: Berlin/Heidelberg, Germany, 2024; pp. 128–136. [Google Scholar]
Alves de Castro, C. A Discussion about the Impact of ChatGPT in Education: Benefits and Concerns. J. Bus. Theory Pract. 2023, 11, 28–34. [Google Scholar] [CrossRef]
Sanjay, M.; Vikas, S.; Prashant, D. ChatGPT: Optimizing Text Generation Model for Knowledge Creation. I-Manag. J. Softw. Eng. 2023, 17, 21–26. [Google Scholar] [CrossRef]
Cheung, B.H.H.; Lau, G.K.K.; Wong, G.T.C.; Lee, E.Y.P.; Kulkarni, D.; Seow, C.S.; Wong, R.; Co, M.T.H. ChatGPT versus Human in Generating Medical Graduate Exam Multiple Choice Questions—A Multinational Prospective Study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom). PLoS ONE 2023, 18, e0290691. [Google Scholar] [CrossRef]
Sreelakshmi, A.S.; Abhinaya, S.B.; Nair, A.; Jaya Nirmala, S. A Question Answering and Quiz Generation Chatbot for Education. In Proceedings of the Grace Hopper Celebration India (GHCI), Bangalore, India, 6–8 November 2019; IEEE: Bangalore, India, 2019; pp. 1–6. [Google Scholar]
Bloom, B.S. Taxonomy of Educational Objectives; Edwards Brothers: Ann Arbor, MI, USA, 1956; ISBN 058232386X. [Google Scholar]
Anderson, L.W.; Krathwohl, D.R. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives; Longman: London, UK, 2001; ISBN 9780321084057. [Google Scholar]
Dorodchi, M.; Dehbozorgi, N.; Frevert, T.K. “I Wish I Could Rank My Exam’s Challenge Level!”: An Algorithm of Bloom’s Taxonomy in Teaching CS1. In Proceedings of the Proceedings-Frontiers in Education Conference, FIE, Indianapolis, IN, USA, 18–21 October 2017; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2017; Volume 2017, pp. 1–5. [Google Scholar]
Amin, M.; Naqvi, S.U.E.L.; Amin, H.; Kayfi, S.Z.; Amjad, F. Bloom’s Taxonomy and Prospective Teachers’ Preparation in Pakistan. Qlantic J. Soc. Social. Sci. 2024, 5, 391–403. [Google Scholar] [CrossRef]
Breckwoldt, J.; Lingemann, C.; Wagner, P. Reanimationstraining Für Laien in Erste-Hilfe-Kursen: Vermittlung von Wissen, Fertigkeiten Und Haltungen. Anaesthesist 2016, 65, 22–29. [Google Scholar] [CrossRef]
Bharatha, A.; Ojeh, N.; Rabbi, A.M.F.; Campbell, M.H.; Krishnamurthy, K.; Layne-Yarde, R.N.A.; Kumar, A.; Springer, D.C.R.; Connell, K.L.; Majumder, M.A.A. Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom’s Taxonomy. Adv. Med. Educ. Pract. 2024, 15, 393. [Google Scholar] [CrossRef] [PubMed]
American Society of Civil Engineers (Ed.) Civil Engineering Body of Knowledge: Preparing the Future Civil Engineer; American Society of Civil Engineers: Reston, VA, USA, 2019; ISBN 9780784415221. [Google Scholar]
Lu, K. Can ChatGPT Help College Instructors Generate High-Quality Quiz Questions? In Human Interaction and Emerging Technologies (IHIET-AI 2023): Artificial Intelligence and Future Applications; AHFE International: Orlando, FL, USA, 2023; Volume 70, pp. 311–318. [Google Scholar] [CrossRef]
Bhatia, P. ChatGPT for Academic Writing: A Game Changer or a Disruptive Tool? J. Anaesthesiol. Clin. Pharmacol. 2023, 39, 1–2. [Google Scholar] [CrossRef]
Fuhrmann, T.; Niemetz, M. Analysis and Improvement of Engineering Exams Toward Competence Orientation by Using an AI Chatbot. In Towards a Hybrid, Flexible and Socially Engaged Higher Education. ICL 2023; Auer, M.E., Cukierman, U.R., Vendrell Vidal, E., Tovar Caro, E., Eds.; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2024; Volume 899, pp. 403–411. ISBN 978-3-031-51979-6. [Google Scholar]
Clark, T.M. Investigating the Use of an Artificial Intelligence Chatbot with General Chemistry Exam Questions. J. Chem. Educ. 2023, 100, 1905–1916. [Google Scholar] [CrossRef]
Mihalache, A.; Popovic, M.M.; Muni, R.H. Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment. JAMA Ophthalmol. 2023, 141, 589–597. [Google Scholar] [CrossRef] [PubMed]
Herrmann-Werner, A.; Festl-Wietek, T.; Holderried, F.; Herschbach, L.; Griewatz, J.; Masters, K.; Zipfel, S.; Mahling, M. Assessing ChatGPT’s Mastery of Bloom’s Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study. J. Med. Internet Res. 2024, 26, e52113. [Google Scholar] [CrossRef]
Meo, S.A.; Al-Masri, A.A.; Alotaibi, M.; Meo, M.Z.S.; Meo, M.O.S. ChatGPT Knowledge Evaluation in Basic and Clinical Medical Sciences: Multiple Choice Question Examination-Based Performance. Healthcare 2023, 11, 2046. [Google Scholar] [CrossRef]
Govender, R.G. My AI Students: Evaluating the Proficiency of Three AI Chatbots in Completeness and Accuracy. Contemp. Educ. Technol. 2024, 16, ep509. [Google Scholar] [CrossRef] [PubMed]
García-Ramírez, Y. Diseño Geométrico y Operación de Carreteras de Dos Carriles, 1st ed.; Ediciones de la U: Bogotá, Colombia, 2022. [Google Scholar]
Tang, R.; Shaw, W.; Vervea, J. Towards the Identification of the Optimal Number of Relevance Categories. J. Am. Soc. Inf. Sci. 1999, 50, 254–264. [Google Scholar] [CrossRef]
Clark, L.A.; Watson, D. Constructing Validity: Basic Issues in Objective Scale Development. Psychol. Assess. 1995, 7, 309–319. [Google Scholar] [CrossRef]
Greenberger, E.; Chen, C.; Dmitrieva, J.; Farruggia, S.P. Item-Wording and the Dimensionality of the Rosenberg Self-Esteem Scale: Do They Matter? Pers. Individ. Dif. 2003, 35, 1241–1254. [Google Scholar] [CrossRef]
Clark, L.A.; Watson, D. Constructing Validity: New Developments in Creating Objective Measuring Instruments. Psychol. Assess. 2019, 31, 1412–1427. [Google Scholar] [CrossRef] [PubMed]
Haladyna, T.M.; Downing, S.M.; Rodriguez, M.C. A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Appl. Meas. Educ. 2002, 15, 309–333. [Google Scholar] [CrossRef]
Lai, V.; Chen, C.; Smith-Renner, A.; Liao, Q.V.; Tan, C. Towards a Science of Human-AI Decision Making: An Overview of Design Space in Empirical Human-Subject Studies. In Proceedings of the ACM International Conference Proceeding Series; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1369–1385. [Google Scholar]
Aiken, L.R. Three Coefficients for Analyzing the Reliability and Validity of Ratings. Educ. Psychol. Meas. 1985, 45, 131–142. [Google Scholar] [CrossRef]
Kuder, G.F.; Richardson, M.W. The Theory of the Estimation of Test Reliability. Psychometrika 1937, 2, 151–160. [Google Scholar] [CrossRef]
Minitab, version 14.2; Statistical Software: State College, PA, USA, 2005.
Wang, Z.; Chen, L.; You, H.; Xu, K.; He, Y.; Li, W.; Codella, N.; Chang, K.W.; Chang, S.F. Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond. Find. Assoc. Comput. Linguist. EMNLP 2023, 8598–8617. [Google Scholar] [CrossRef]
Kumar, S. Answer-Level Calibration for Free-Form Multiple Choice Question Answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics (ACL): Dublin, Ireland, 2022; Volume 1, pp. 665–679. [Google Scholar]
Khatun, A.; Brown, D.G.; Cheriton, D.R. A Study on Large Language Models’ Limitations in Multiple-Choice Question Answering. Comput. Lang. 2024, 1–17. [Google Scholar] [CrossRef]
Myrzakhan, A.; Bsharat, S.M.; Shen, Z. Open-LLM-Leaderboard: From Multi-Choice to Open-Style Questions for LLMs Evaluation, Benchmark, and Arena. Comput. Lang. 2024, 1–19. [Google Scholar] [CrossRef]
López, V.G.; López, V.M.G.; Gracia, S.R.; Galaviz, J.L.G.; Sánchez, K.I.B.; Sánchez, C.M.B. Índice de Dificultad y Discriminación de Ítems Para La Evaluación en Asignaturas Básicas de Medicina. Educ. Médica Super. 2020, 34, 1–12. [Google Scholar]

Figure 1. Ratio of correct literal length to average incorrect literal length (multiple choice) for the 3 versions of the exam.

Figure 2. Average Aiken’s V scores for relevance and wording evaluation across Bloom’s Taxonomy levels and road geometric design topics.

Figure 3. Average coefficient of variation for relevance and wording across different questions and topics in student exam versions.

Figure 4. Student performance across Bloom’s Taxonomy levels (1–6) by road geometric design topics.

Figure 5. Comparison of student grades by Bloom’s Taxonomy levels across two versions.

Figure 6. Comparison of student grades across different evaluation methods.

Table 1. Prompt structure for AI-generated multiple-choice questions.

Step	Description
Input	Chapter “[Chapter Title]” and modified Bloom’s Taxonomy.
Process	1. Generate multiple-choice questions. 2. Assign Bloom’s Taxonomy level. 3. Mark correct answer. 4. Avoid irrelevant details.
Output	10 multiple-choice questions with 4 options each, assigned to specific Bloom’s Taxonomy levels as follows. Two questions: Test Recalling Information (Remembering level). Two questions: Assess Understanding of Concepts (Understanding level). Two questions: Evaluate the Application of Knowledge (Applying level). Two questions: Measure the Ability to Analyze Information (Analyzing level). One question: Assess Judgment-Making Capability (Evaluating level). One question: Measure the Ability to Generate New Ideas (Creating level)
Condition	Each question must align with the assigned Bloom’s Taxonomy level.

Table 2. AI-guided evaluation protocol for question relevance and wording.

Step	Description
Input	A set of 10 questions to be evaluated. An evaluation scale for relevance (from 1 to 5). An evaluation scale for wording (from 1 to 5).
Process	1. Analyze each provided question. 2. Assess the relevance of each question based on the following scale: - Not Relevant: The question has no connection to the evaluation objective. - Slightly Relevant: Minimal connection to the objective, relevance is questionable. - Moderately Relevant: Partially aligned with the objective but can be improved. - Relevant: Directly related to the objective, significantly contributes to the measurement. - Highly Relevant: Essential and completely aligned with the evaluation objective 3. Evaluate the wording of each question using the following scale: - Very Poor: Confusing, incomprehensible, or contains severe errors. - Poor: Difficult to understand and includes major errors. - Acceptable: Mostly clear but may contain minor errors or slight ambiguity. - Good: Clear, understandable, with few or no errors. - Excellent: Impeccable, precise, and completely comprehensible
Output	A relevance score (1 to 5) and a wording score (1 to 5) for each question. Specific recommendations for improving the questions, if necessary.
Condition	Questions must be correctly formatted and comprehensible. Evaluation scales must be predefined.

Table 3. Task descriptions and Bloom’s Taxonomy levels in the exam.

Bloom’s Taxonomy Level	Task Description	Details
Level 3 (Applying)	Calculate the AADT (Annual Average Daily Traffic) for the [anonymized year], based on historical data provided in the table. The project starts in [anonymized year], with the [anonymized percentage]% annual growth rate in vehicle traffic starting from that year. The attracted traffic is estimated at [anonymized] vehicles per day for the [anonymized year]. Additionally, [anonymized percentage]% of traffic is generated by the project, and [anonymized percentage]% of traffic is developed due to the project’s influence.	This task requires applying data and traffic growth rates to calculate AADT, which involves applying knowledge of traffic estimation methods.
Level 4 (Analyzing)	On the map provided, plot the shortest route between points A and B, considering the AADT results obtained from the previous calculation. Ensure that the route follows the average slope for longitudinal gradients.	This task involves analyzing and interpreting the AADT data to select the optimal route based on slope and other geographic considerations.
Level 5 (Evaluating)	Modify the road layout to meet design standards and regulations for a road with a speed limit of [anonymized speed] km/h and a maximum superelevation of [anonymized percentage]%. Consider factors such as maximum slopes, minimum radii, minimum and maximum clearances. Perform a consistency analysis of the alignment using Criterion II, and draw the necessary superelevation for one of the simple circular curves.	This task requires a critical evaluation of the road layout, including the application of design standards and performing a consistency analysis to ensure the design is safe and efficient.

Table 4. Scores achieved by AI chatbots on one version of the exam.

Topic	Weights	ChatGPT 3.5	Claude 3	Copilot	Perplexity	You	Average
Introduction	5	10	10	8	10	10	9.60
Driver	5	10	10	10	10	9.5	9.90
Traffic	10	10	10	8.75	10	9.75	9.70
Route study	10	9.75	9.5	9.75	10	9.5	9.70
Horizontal geometric design	20	9.75	7.6	9.5	9	10	9.17
Vertical geometric design	20	9	9.75	9.75	9.5	9	9.40
Transverse geometric design	20	9.5	9.5	9.5	7.6	9	9.02
Geometric design consistency	10	9	10	9.5	9.5	10	9.60
Final score	-	9.53	9.32	9.45	9.17	9.50	9.39

Table 5. Item difficulty analysis in this study.

Item Qualification	Difficulty Index Range	% of Questions Meeting the Range
Item Qualification	Difficulty Index Range	Version 1	Version 2	Ideal
Easy	0.91–1	31.25	56.25	5
Relatively Easy	0.81–0.9	28.13	9.38	20
Medium Difficulty	0.51–0.8	26.56	28.13	50
Relatively Difficult	0.40–0.50	6.25	3.13	20
Difficult	0–0.39	7.81	3.11	5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

García-Ramírez, Y. AI Chatbots as Tools for Designing Evaluations in Road Geometric Design According to Bloom’s Taxonomy. Appl. Sci. 2025, 15, 8906. https://doi.org/10.3390/app15168906

AMA Style

García-Ramírez Y. AI Chatbots as Tools for Designing Evaluations in Road Geometric Design According to Bloom’s Taxonomy. Applied Sciences. 2025; 15(16):8906. https://doi.org/10.3390/app15168906

Chicago/Turabian Style

García-Ramírez, Yasmany. 2025. "AI Chatbots as Tools for Designing Evaluations in Road Geometric Design According to Bloom’s Taxonomy" Applied Sciences 15, no. 16: 8906. https://doi.org/10.3390/app15168906

APA Style

García-Ramírez, Y. (2025). AI Chatbots as Tools for Designing Evaluations in Road Geometric Design According to Bloom’s Taxonomy. Applied Sciences, 15(16), 8906. https://doi.org/10.3390/app15168906

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI Chatbots as Tools for Designing Evaluations in Road Geometric Design According to Bloom’s Taxonomy

Abstract

1. Introduction

2. Materials and Methods

2.1. Design

2.2. Exam Development

2.3. Expert Validation

2.4. Participants and Procedure

2.5. Data Analysis

3. Results

3.1. Exam Generation

3.2. AI Chatbot Performance Analysis

3.3. Chatbot Scoring and Exam Item Validation

3.4. Estimation of the Internal Reliability of the Exams

3.5. Percentage of Correct Responses

3.6. Analysis of Student Grades

3.7. Analysis of Item Difficulty

3.8. Validation with Students

4. Discussion

4.1. AI Performance and Question Generation Quality

4.2. Content Validation and Reliability Concerns

4.3. Student Performance and Assessment Validity

4.4. Implications and Limitations

4.5. Future Research

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI