AI Chatbots as Tools for Designing Evaluations in Road Geometric Design According to Bloom’s Taxonomy
Abstract
1. Introduction
2. Materials and Methods
2.1. Design
2.2. Exam Development
“I have a computer-based exam that covers all six levels of Bloom’s Taxonomy, which I’ll share with you []. Since students won’t be able to draw routes, design roads, or perform spatial manipulations on the computer, I want to ensure the exam effectively assesses higher-order thinking skills like analysis and evaluation (levels 5 & 6 of Bloom’s Taxonomy). Could you suggest alternative question formats for levels 5 and 6 that don’t require spatial manipulation?”
2.3. Expert Validation
- Levels 1 and 2 (Remembering and Understanding): Foundational knowledge and comprehension levels, weighted at 0.8 and 1.2, respectively.
- Levels 3 and 4 (Applying and Analyzing): Involve applying knowledge and analyzing information, weighted at 1.5 and 1.0, respectively.
- Level 5 (Evaluating): Critical thinking and making judgments, weighted at 0.50, reflecting the challenge of assessing this level in multiple-choice format.
- Level 6 (Creating): Generating new ideas, weighted at 0.50 due to the difficulty of assessing creativity in multiple-choice format.
2.4. Participants and Procedure
2.5. Data Analysis
- Threshold-based performance analysis: to assess student mastery of Bloom’s Taxonomy levels 1 to 3, performance thresholds were defined at 30%, 80%, and 90% correct answers. These thresholds enabled the identification of incremental improvements in student outcomes as the criteria for mastery were relaxed. This analysis aimed to provide insights into the proportion of students meeting specific cognitive benchmarks and areas requiring additional preparation.
- Comparison of scores across Bloom’s Taxonomy levels: student grades for Bloom’s Taxonomy levels were converted to a 10-point scale to facilitate comparisons across levels and between the two AI-generated exam versions. This normalization provided a consistent framework for identifying performance trends and deviations from expected progressive declines in grades from lower to higher cognitive levels.
- Item difficulty analysis: the difficulty index for each question was calculated as the proportion of students answering correctly. Questions were categorized into five difficulty levels (easy, relatively easy, medium difficulty, relatively difficult, and difficult). This categorization enabled the examination of the distribution of questions across difficulty levels in both versions and comparisons with ideal distributions recommended in prior research.
- Validation through prior evaluations: to validate the AI-generated instruments, scores from both versions were compared against grades from prior semester exams covering Bloom’s Taxonomy levels 3, 4, and 5. A paired t-test was used to evaluate the statistical significance of differences in mean scores between the new and prior assessments. Minitab software 14.2 [30] was used to perform the calculations for this analysis.
- Error metrics: to assess the alignment between the new and prior exams, error metrics were calculated, including Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and Mean Squared Error (MSE). These metrics evaluated how closely the AI-generated exams approximated the grading patterns of the previous semester’s exam, providing additional evidence of their validity.
- Student feedback and qualitative analysis: semi-structured interviews were conducted to gather qualitative feedback on the AI-generated exams. Students were asked about perceived difficulty, time allocation, and preferred question types. This feedback provided context for interpreting the statistical results, particularly in relation to question clarity, time management, and alignment with practical competencies.
3. Results
3.1. Exam Generation
3.2. AI Chatbot Performance Analysis
3.3. Chatbot Scoring and Exam Item Validation
3.4. Estimation of the Internal Reliability of the Exams
3.5. Percentage of Correct Responses
3.6. Analysis of Student Grades
3.7. Analysis of Item Difficulty
3.8. Validation with Students
4. Discussion
4.1. AI Performance and Question Generation Quality
4.2. Content Validation and Reliability Concerns
4.3. Student Performance and Assessment Validity
4.4. Implications and Limitations
4.5. Future Research
5. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Appendix A.1
- Bloom’s Taxonomy level 1: recall
- (a)
- Minimize environmental impact.
- (b)
- Maximize vehicle speed.
- (c)
- Provide safety and comfort to users. *
- (d)
- Reduce construction costs.
- (a)
- Superelevation.
- (b)
- Circular curves.
- (c)
- Transition curves. *
- (d)
- Sight distance.
- Bloom’s Taxonomy level 2: understand
- (a)
- Because it determines the type of pavement to be used.
- (b)
- Because it affects the required sight distance and the safety of the curve. *
- (c)
- Because it defines the lane width.
- (d)
- Because it influences the amount of signage required.
- (a)
- Superelevation increases centrifugal force, improving adhesion.
- (b)
- Superelevation counteracts centrifugal force, providing greater stability. *
- (c)
- Superelevation has no relationship to centrifugal force.
- (d)
- Superelevation reduces centrifugal force, allowing lower speeds.
- Bloom’s Taxonomy level 3: apply
- (a)
- Formula for calculating stopping sight distance.
- (b)
- Formula for calculating superelevation.
- (c)
- Formula of the transition spiral.
- (d)
- Formula relating design speed to radius and superelevation. *
- (a)
- Increase the superelevation of the curve.
- (b)
- Reduce the radius of the curve.
- (c)
- Implement measures to improve sight distance. *
- (d)
- Decrease the design speed.
- Bloom’s Taxonomy level 4: analyze
- (a)
- The composite curve provides a smoother and more gradual transition, improving comfort and safety. *
- (b)
- The simple circular curve is more suitable for high-speed roads.
- (c)
- The composite curve requires more space for its construction.
- (d)
- The simple circular curve is cheaper to build.
- (a)
- Radius of the curve too small, insufficient superelevation, or limited visibility. *
- (b)
- Inadequate vertical signage or lack of lane markings.
- (c)
- Poor drainage or problems with the pavement surface.
- (d)
- Lack of lighting in the curve area.
- Bloom’s Taxonomy level 5: evaluate
- Bloom’s Taxonomy level 6: create
Appendix A.2
- Topic: Introduction
- (a)
- To design and operate efficient and safe transportation systems.
- (b)
- To build new roads and highways.
- (c)
- To regulate traffic laws.
- (d)
- To investigate the causes of road accidents.
- Topic: Driver Characteristics
- (a)
- The time it takes for the driver to perceive an obstacle.
- (b)
- The time it takes for the driver to decide.
- (c)
- The time it takes for the driver to execute an action.
- (d)
- The total time elapsed from the moment an obstacle is perceived until an evasive maneuver is performed.
- Topic: Traffic Studies
- (a)
- Increases proportionally
- (b)
- Decreases gradually
- (c)
- Remains constant
- (d)
- Fluctuates unpredictably
- Topic: Route Study
- (a)
- To determine the type of pavement to use.
- (b)
- To estimate construction and maintenance costs.
- (c)
- To minimize environmental impact.
- (d)
- To identify areas prone to landslides.
- Topic: Horizontal Geometric Design
- (a)
- 7 cm.
- (b)
- 14 cm.
- (c)
- 28 cm.
- (d)
- 140 cm.
- Topic: Vertical Geometric Design
- (a)
- 120 m.
- (b)
- 150 m.
- (c)
- 180 m.
- (d)
- 200 m.
- Topic: Transverse Geometric Design
- (a)
- Advantages: greater safety and stability; disadvantages: higher construction cost, possible discomfort for slow-moving vehicles. *
- (b)
- Advantages: lower construction cost; disadvantages: lower safety, risk of rollover.
- (c)
- Advantages: higher design speed; disadvantages: greater environmental impact.
- (d)
- Advantages: improved drainage; disadvantages: greater tire wear.
- Topic: Consistency of Geometric Design
References
- Perez Sanpablo, A.I.; Arquer Ruiz, M.d.C.; Meneses Peñaloza, A.; Rodriguez Reyes, G.; Quiñones Uriostegui, I.; Anaya Campos, L.E. Development and Evaluation of a Diagnostic Exam for Undergraduate Biomedical Engineering Students Using GPT Language Model-Based Virtual Agents; Flores Cuautle, J.d.J.A., Ed.; Springer Nature: Berlin/Heidelberg, Germany, 2024; pp. 128–136. [Google Scholar]
- Alves de Castro, C. A Discussion about the Impact of ChatGPT in Education: Benefits and Concerns. J. Bus. Theory Pract. 2023, 11, 28–34. [Google Scholar] [CrossRef]
- Sanjay, M.; Vikas, S.; Prashant, D. ChatGPT: Optimizing Text Generation Model for Knowledge Creation. I-Manag. J. Softw. Eng. 2023, 17, 21–26. [Google Scholar] [CrossRef]
- Cheung, B.H.H.; Lau, G.K.K.; Wong, G.T.C.; Lee, E.Y.P.; Kulkarni, D.; Seow, C.S.; Wong, R.; Co, M.T.H. ChatGPT versus Human in Generating Medical Graduate Exam Multiple Choice Questions—A Multinational Prospective Study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom). PLoS ONE 2023, 18, e0290691. [Google Scholar] [CrossRef]
- Sreelakshmi, A.S.; Abhinaya, S.B.; Nair, A.; Jaya Nirmala, S. A Question Answering and Quiz Generation Chatbot for Education. In Proceedings of the Grace Hopper Celebration India (GHCI), Bangalore, India, 6–8 November 2019; IEEE: Bangalore, India, 2019; pp. 1–6. [Google Scholar]
- Bloom, B.S. Taxonomy of Educational Objectives; Edwards Brothers: Ann Arbor, MI, USA, 1956; ISBN 058232386X. [Google Scholar]
- Anderson, L.W.; Krathwohl, D.R. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives; Longman: London, UK, 2001; ISBN 9780321084057. [Google Scholar]
- Dorodchi, M.; Dehbozorgi, N.; Frevert, T.K. “I Wish I Could Rank My Exam’s Challenge Level!”: An Algorithm of Bloom’s Taxonomy in Teaching CS1. In Proceedings of the Proceedings-Frontiers in Education Conference, FIE, Indianapolis, IN, USA, 18–21 October 2017; Institute of Electrical and Electronics Engineers Inc.: New York, NY, USA, 2017; Volume 2017, pp. 1–5. [Google Scholar]
- Amin, M.; Naqvi, S.U.E.L.; Amin, H.; Kayfi, S.Z.; Amjad, F. Bloom’s Taxonomy and Prospective Teachers’ Preparation in Pakistan. Qlantic J. Soc. Social. Sci. 2024, 5, 391–403. [Google Scholar] [CrossRef]
- Breckwoldt, J.; Lingemann, C.; Wagner, P. Reanimationstraining Für Laien in Erste-Hilfe-Kursen: Vermittlung von Wissen, Fertigkeiten Und Haltungen. Anaesthesist 2016, 65, 22–29. [Google Scholar] [CrossRef]
- Bharatha, A.; Ojeh, N.; Rabbi, A.M.F.; Campbell, M.H.; Krishnamurthy, K.; Layne-Yarde, R.N.A.; Kumar, A.; Springer, D.C.R.; Connell, K.L.; Majumder, M.A.A. Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom’s Taxonomy. Adv. Med. Educ. Pract. 2024, 15, 393. [Google Scholar] [CrossRef] [PubMed]
- American Society of Civil Engineers (Ed.) Civil Engineering Body of Knowledge: Preparing the Future Civil Engineer; American Society of Civil Engineers: Reston, VA, USA, 2019; ISBN 9780784415221. [Google Scholar]
- Lu, K. Can ChatGPT Help College Instructors Generate High-Quality Quiz Questions? In Human Interaction and Emerging Technologies (IHIET-AI 2023): Artificial Intelligence and Future Applications; AHFE International: Orlando, FL, USA, 2023; Volume 70, pp. 311–318. [Google Scholar] [CrossRef]
- Bhatia, P. ChatGPT for Academic Writing: A Game Changer or a Disruptive Tool? J. Anaesthesiol. Clin. Pharmacol. 2023, 39, 1–2. [Google Scholar] [CrossRef]
- Fuhrmann, T.; Niemetz, M. Analysis and Improvement of Engineering Exams Toward Competence Orientation by Using an AI Chatbot. In Towards a Hybrid, Flexible and Socially Engaged Higher Education. ICL 2023; Auer, M.E., Cukierman, U.R., Vendrell Vidal, E., Tovar Caro, E., Eds.; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2024; Volume 899, pp. 403–411. ISBN 978-3-031-51979-6. [Google Scholar]
- Clark, T.M. Investigating the Use of an Artificial Intelligence Chatbot with General Chemistry Exam Questions. J. Chem. Educ. 2023, 100, 1905–1916. [Google Scholar] [CrossRef]
- Mihalache, A.; Popovic, M.M.; Muni, R.H. Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment. JAMA Ophthalmol. 2023, 141, 589–597. [Google Scholar] [CrossRef] [PubMed]
- Herrmann-Werner, A.; Festl-Wietek, T.; Holderried, F.; Herschbach, L.; Griewatz, J.; Masters, K.; Zipfel, S.; Mahling, M. Assessing ChatGPT’s Mastery of Bloom’s Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study. J. Med. Internet Res. 2024, 26, e52113. [Google Scholar] [CrossRef]
- Meo, S.A.; Al-Masri, A.A.; Alotaibi, M.; Meo, M.Z.S.; Meo, M.O.S. ChatGPT Knowledge Evaluation in Basic and Clinical Medical Sciences: Multiple Choice Question Examination-Based Performance. Healthcare 2023, 11, 2046. [Google Scholar] [CrossRef]
- Govender, R.G. My AI Students: Evaluating the Proficiency of Three AI Chatbots in Completeness and Accuracy. Contemp. Educ. Technol. 2024, 16, ep509. [Google Scholar] [CrossRef] [PubMed]
- García-Ramírez, Y. Diseño Geométrico y Operación de Carreteras de Dos Carriles, 1st ed.; Ediciones de la U: Bogotá, Colombia, 2022. [Google Scholar]
- Tang, R.; Shaw, W.; Vervea, J. Towards the Identification of the Optimal Number of Relevance Categories. J. Am. Soc. Inf. Sci. 1999, 50, 254–264. [Google Scholar] [CrossRef]
- Clark, L.A.; Watson, D. Constructing Validity: Basic Issues in Objective Scale Development. Psychol. Assess. 1995, 7, 309–319. [Google Scholar] [CrossRef]
- Greenberger, E.; Chen, C.; Dmitrieva, J.; Farruggia, S.P. Item-Wording and the Dimensionality of the Rosenberg Self-Esteem Scale: Do They Matter? Pers. Individ. Dif. 2003, 35, 1241–1254. [Google Scholar] [CrossRef]
- Clark, L.A.; Watson, D. Constructing Validity: New Developments in Creating Objective Measuring Instruments. Psychol. Assess. 2019, 31, 1412–1427. [Google Scholar] [CrossRef] [PubMed]
- Haladyna, T.M.; Downing, S.M.; Rodriguez, M.C. A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Appl. Meas. Educ. 2002, 15, 309–333. [Google Scholar] [CrossRef]
- Lai, V.; Chen, C.; Smith-Renner, A.; Liao, Q.V.; Tan, C. Towards a Science of Human-AI Decision Making: An Overview of Design Space in Empirical Human-Subject Studies. In Proceedings of the ACM International Conference Proceeding Series; Association for Computing Machinery: New York, NY, USA, 2023; pp. 1369–1385. [Google Scholar]
- Aiken, L.R. Three Coefficients for Analyzing the Reliability and Validity of Ratings. Educ. Psychol. Meas. 1985, 45, 131–142. [Google Scholar] [CrossRef]
- Kuder, G.F.; Richardson, M.W. The Theory of the Estimation of Test Reliability. Psychometrika 1937, 2, 151–160. [Google Scholar] [CrossRef]
- Minitab, version 14.2; Statistical Software: State College, PA, USA, 2005.
- Wang, Z.; Chen, L.; You, H.; Xu, K.; He, Y.; Li, W.; Codella, N.; Chang, K.W.; Chang, S.F. Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond. Find. Assoc. Comput. Linguist. EMNLP 2023, 8598–8617. [Google Scholar] [CrossRef]
- Kumar, S. Answer-Level Calibration for Free-Form Multiple Choice Question Answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics (ACL): Dublin, Ireland, 2022; Volume 1, pp. 665–679. [Google Scholar]
- Khatun, A.; Brown, D.G.; Cheriton, D.R. A Study on Large Language Models’ Limitations in Multiple-Choice Question Answering. Comput. Lang. 2024, 1–17. [Google Scholar] [CrossRef]
- Myrzakhan, A.; Bsharat, S.M.; Shen, Z. Open-LLM-Leaderboard: From Multi-Choice to Open-Style Questions for LLMs Evaluation, Benchmark, and Arena. Comput. Lang. 2024, 1–19. [Google Scholar] [CrossRef]
- López, V.G.; López, V.M.G.; Gracia, S.R.; Galaviz, J.L.G.; Sánchez, K.I.B.; Sánchez, C.M.B. Índice de Dificultad y Discriminación de Ítems Para La Evaluación en Asignaturas Básicas de Medicina. Educ. Médica Super. 2020, 34, 1–12. [Google Scholar]
Step | Description |
---|---|
Input | Chapter “[Chapter Title]” and modified Bloom’s Taxonomy. |
Process | 1. Generate multiple-choice questions. 2. Assign Bloom’s Taxonomy level. 3. Mark correct answer. 4. Avoid irrelevant details. |
Output | 10 multiple-choice questions with 4 options each, assigned to specific Bloom’s Taxonomy levels as follows. Two questions: Test Recalling Information (Remembering level). Two questions: Assess Understanding of Concepts (Understanding level). Two questions: Evaluate the Application of Knowledge (Applying level). Two questions: Measure the Ability to Analyze Information (Analyzing level). One question: Assess Judgment-Making Capability (Evaluating level). One question: Measure the Ability to Generate New Ideas (Creating level) |
Condition | Each question must align with the assigned Bloom’s Taxonomy level. |
Step | Description |
---|---|
Input | A set of 10 questions to be evaluated. An evaluation scale for relevance (from 1 to 5). An evaluation scale for wording (from 1 to 5). |
Process | 1. Analyze each provided question. 2. Assess the relevance of each question based on the following scale: - Not Relevant: The question has no connection to the evaluation objective. - Slightly Relevant: Minimal connection to the objective, relevance is questionable. - Moderately Relevant: Partially aligned with the objective but can be improved. - Relevant: Directly related to the objective, significantly contributes to the measurement. - Highly Relevant: Essential and completely aligned with the evaluation objective 3. Evaluate the wording of each question using the following scale: - Very Poor: Confusing, incomprehensible, or contains severe errors. - Poor: Difficult to understand and includes major errors. - Acceptable: Mostly clear but may contain minor errors or slight ambiguity. - Good: Clear, understandable, with few or no errors. - Excellent: Impeccable, precise, and completely comprehensible |
Output | A relevance score (1 to 5) and a wording score (1 to 5) for each question. Specific recommendations for improving the questions, if necessary. |
Condition | Questions must be correctly formatted and comprehensible. Evaluation scales must be predefined. |
Bloom’s Taxonomy Level | Task Description | Details |
---|---|---|
Level 3 (Applying) | Calculate the AADT (Annual Average Daily Traffic) for the [anonymized year], based on historical data provided in the table. The project starts in [anonymized year], with the [anonymized percentage]% annual growth rate in vehicle traffic starting from that year. The attracted traffic is estimated at [anonymized] vehicles per day for the [anonymized year]. Additionally, [anonymized percentage]% of traffic is generated by the project, and [anonymized percentage]% of traffic is developed due to the project’s influence. | This task requires applying data and traffic growth rates to calculate AADT, which involves applying knowledge of traffic estimation methods. |
Level 4 (Analyzing) | On the map provided, plot the shortest route between points A and B, considering the AADT results obtained from the previous calculation. Ensure that the route follows the average slope for longitudinal gradients. | This task involves analyzing and interpreting the AADT data to select the optimal route based on slope and other geographic considerations. |
Level 5 (Evaluating) | Modify the road layout to meet design standards and regulations for a road with a speed limit of [anonymized speed] km/h and a maximum superelevation of [anonymized percentage]%. Consider factors such as maximum slopes, minimum radii, minimum and maximum clearances. Perform a consistency analysis of the alignment using Criterion II, and draw the necessary superelevation for one of the simple circular curves. | This task requires a critical evaluation of the road layout, including the application of design standards and performing a consistency analysis to ensure the design is safe and efficient. |
Topic | Weights | ChatGPT 3.5 | Claude 3 | Copilot | Perplexity | You | Average |
---|---|---|---|---|---|---|---|
Introduction | 5 | 10 | 10 | 8 | 10 | 10 | 9.60 |
Driver | 5 | 10 | 10 | 10 | 10 | 9.5 | 9.90 |
Traffic | 10 | 10 | 10 | 8.75 | 10 | 9.75 | 9.70 |
Route study | 10 | 9.75 | 9.5 | 9.75 | 10 | 9.5 | 9.70 |
Horizontal geometric design | 20 | 9.75 | 7.6 | 9.5 | 9 | 10 | 9.17 |
Vertical geometric design | 20 | 9 | 9.75 | 9.75 | 9.5 | 9 | 9.40 |
Transverse geometric design | 20 | 9.5 | 9.5 | 9.5 | 7.6 | 9 | 9.02 |
Geometric design consistency | 10 | 9 | 10 | 9.5 | 9.5 | 10 | 9.60 |
Final score | - | 9.53 | 9.32 | 9.45 | 9.17 | 9.50 | 9.39 |
Item Qualification | Difficulty Index Range | % of Questions Meeting the Range | ||
---|---|---|---|---|
Version 1 | Version 2 | Ideal | ||
Easy | 0.91–1 | 31.25 | 56.25 | 5 |
Relatively Easy | 0.81–0.9 | 28.13 | 9.38 | 20 |
Medium Difficulty | 0.51–0.8 | 26.56 | 28.13 | 50 |
Relatively Difficult | 0.40–0.50 | 6.25 | 3.13 | 20 |
Difficult | 0–0.39 | 7.81 | 3.11 | 5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
García-Ramírez, Y. AI Chatbots as Tools for Designing Evaluations in Road Geometric Design According to Bloom’s Taxonomy. Appl. Sci. 2025, 15, 8906. https://doi.org/10.3390/app15168906
García-Ramírez Y. AI Chatbots as Tools for Designing Evaluations in Road Geometric Design According to Bloom’s Taxonomy. Applied Sciences. 2025; 15(16):8906. https://doi.org/10.3390/app15168906
Chicago/Turabian StyleGarcía-Ramírez, Yasmany. 2025. "AI Chatbots as Tools for Designing Evaluations in Road Geometric Design According to Bloom’s Taxonomy" Applied Sciences 15, no. 16: 8906. https://doi.org/10.3390/app15168906
APA StyleGarcía-Ramírez, Y. (2025). AI Chatbots as Tools for Designing Evaluations in Road Geometric Design According to Bloom’s Taxonomy. Applied Sciences, 15(16), 8906. https://doi.org/10.3390/app15168906