Artificial Intelligence from Google Environment for Effective Learning Assessment

Miranda, Sergio

doi:10.3390/info16060462

Open AccessArticle

Artificial Intelligence from Google Environment for Effective Learning Assessment

by

Sergio Miranda

Department of Humanities, Philosophy and Education, University of Salerno, 84084 Fisciano, Italy

Information 2025, 16(6), 462; https://doi.org/10.3390/info16060462

Submission received: 9 May 2025 / Revised: 27 May 2025 / Accepted: 28 May 2025 / Published: 30 May 2025

Download

Browse Figure

Versions Notes

Abstract

This study investigates the use of Google NotebookLM for the automatic generation of educational assessment items. A mixed-methods approach was adopted, combining quantitative psychometric evaluation with qualitative student feedback. Six tests, each composed of 15 multiple-choice questions generated from diverse sources such as PDFs, web slides, and YouTube videos, were administered to undergraduate students. Quantitative analysis involved calculating key indices which confirmed that many AI-generated items met acceptable psychometric criteria, though some items revealed reliability concerns and potential bias. Concurrently, a structured questionnaire assessed the clarity, relevance, and fairness of the test items. Students generally rated the AI-generated questions positively in terms of clarity and pedagogical alignment, while also noting areas for improvement. In conclusion, the findings suggest that generative AI can offer a scalable and efficient solution for test item creation; however, further methodological refinements are needed to ensure consistent validity, reliability, and ethical fairness in learning assessments.

Keywords:

generative AI; learning assessment; automatic item generation; Google NotebookLM

1. Introduction

Education, and in particular learning assessment, is undergoing a period of profound transformation due to the emergence and rapid advancement of artificial intelligence (AI), especially generative AI (GenAI) [1,2,3,4,5]. Assessment is a critical element in all learning systems, and the significant growth of digital assessments in the last decade has led to an urgent need to generate more items (test items) quickly and efficiently [6]. Continuous improvements in computational power and advances in methodological approaches, particularly in the field of natural language processing (NLP), offer new opportunities and challenges for the automatic generation of items for educational assessment [6,7,8].

Automatic item generation (AIG) is a method that uses algorithms to quickly create a large number of test items [6]. Research in education has covered more than 40 studies on AIG, examining its goals, the types of questions it can produce, the input sources it uses, and the methods for evaluating the items [6]. Most of these studies have focused on using AIG for various assessments, such as large-scale exams, opinion surveys, classroom tests, and practice quizzes [6]. Other research has even explored using AIG to create personality tests and more complex materials like stories and assessment passages [6]. See Table 1 for a summary of the evolving complexity of AIG approaches.

The advent of GenAI, defined as AI systems that generate novel outputs rather than analyze existing data [9], has intensified this potential. Advanced models based on NLP and deep learning can process language and generate high-quality text, images, and other content [4,7,9]. Sources note that GenAI (e.g., ChatGPT) is reshaping educational landscapes, enabling students to generate responses that closely mimic human-written responses [5]. Large Language Models (LLMs), a category of GenAI, have been identified as key tools that can be integrated into the educational and assessment process [8,10]. They can be used for a variety of assessment-related tasks, including test planning, question creation, instruction preparation, test administration, scoring, test analysis, interpretation, providing feedback, and recommending study materials [8].

The integration of GenAI into educational assessment presents both transformative opportunities and significant challenges. On the opportunity side, AI-driven assessment tools can personalize learning experiences, adapt to individual student needs, and generate diverse test items efficiently. Research suggests that personalized assessments powered by AI can enhance learner motivation and engagement, allowing students to demonstrate their true competencies [11]. AI can also streamline grading processes, ensuring consistency and providing immediate feedback. Additionally, GenAI can support educators by automating the creation of instructional materials and assessment frameworks [12].

However, these advancements come with inherent challenges. One of the primary concerns is maintaining the validity, reliability, and fairness of AI-generated assessments [11]. Standardized testing has traditionally ensured comparability across different student populations, but AI-driven personalization may introduce inconsistencies in measurement [11]. Another challenge is the risk of bias in AI-generated test items, as AI models learn from historical data that may contain implicit biases [12]. Ensuring that AI-generated assessments align with educational objectives and accurately measure intended skills remains a critical issue. Furthermore, the ethical implications of AI in assessment, including concerns about academic integrity and the potential misuse of AI-generated content, require careful consideration [13].

A recent study specifically explored the application of LLMs to generate customizable learning materials, including automatically generating multiple-choice questions based on instructor-provided learning outcomes [10]. This preliminary experiment, conducted with undergraduate students, found that students found GenAI-generated variants of learning materials engaging, with the most popular feature being the automatically generated quiz-style tests that they used to assess their understanding [10]. These findings suggest the potential for increasing student study time and supporting learning [10].

However, integrating GenAI into assessment is not without challenges and risks. One of the main concerns is the academic integrity and authenticity of students’ work, as it is difficult to distinguish AI-generated content from that produced by students themselves [1,2,5,14,15]. LLMs have a tendency to invent plausible and confident answers that seem credible at first glance, but that do not stand up to detailed scrutiny, and may even invent references [2,15]. This misuse, where students use AI as a substitute for critical thinking and research effort, can lead to an erosion of academic integrity [15]. One study found that while AIs perform well with general theoretical knowledge in control engineering, they are still unable to solve complex practical problems effectively and tend to resort to standard solutions that are not always appropriate [15]. Furthermore, the responses of GenAIs are not always consistent, even with identical commands, due to their internal functioning and continuous updates [15].

Recent studies highlight the ethical challenges associated with AI-driven assessment, particularly in terms of bias, transparency, and accountability [16]. AI models may inadvertently reinforce biases present in training data, leading to unfair testing outcomes [17]. To mitigate these risks, researchers emphasize the importance of diverse datasets, bias audits, and inclusive design. Transparency in AI decision-making is another critical factor, as educators and students need to understand how AI-generated assessments are formulated. Explainable AI (XAI) techniques can help make AI-driven assessment processes more interpretable and trustworthy [18].

Validation processes for AI-generated assessments remain an evolving area of research. While AI can generate test items rapidly, ensuring their pedagogical soundness and alignment with learning objectives requires rigorous validation. Studies suggest that empirical testing, expert reviews, and iterative refinement are essential to maintaining the quality of AI-generated assessments [18]. Additionally, AI-powered testing solutions are being explored to enhance quality assurance in educational assessments, ensuring that AI-generated test items meet reliability and validity standards [17].

The era of GenAI calls for a rethinking of traditional assessment methods, which often rely on memorization and standardized tests, as these may not effectively measure higher-order skills such as critical thinking, creativity, and problem-solving [1,5]. Sources suggest a shift toward assessments that assess critical skills rather than rote knowledge, promoting scientific reasoning and knowledge-based application [5,19]. GenAI needs not threaten the validity or reliability of assessments; rather, it can add fidelity and nuance to assisted assessments and facilitate a greater focus on unaided assessments [20]. Responsible use of GenAI becomes crucial in education, requiring pedagogically appropriate interaction between learners and AI tools, with consideration for human agency and higher-order thinking skills [14]. It is essential to maintain a critical attitude and verify the results generated by AI, recognizing that they are not infallible and their accuracy depends on the quality of the data, algorithms, and clarity of the questions [15].

Despite the potential, the sources identify several gaps and areas requiring further research.

Lack of Guidelines and Validation: There are still no established guidelines for validating AI-generated assessments, nor standardized methodologies to ensure alignment with educational objectives [1,7].
Prevalence of Qualitative Studies and Small Samples: Most research is qualitative or case-study-based with small samples, limiting the generalizability and robustness of conclusions [1,2,21].
Bias and Instructional Alignment: There are still issues of bias in training data and difficulties in ensuring that AI-created assessments actually measure the intended skills [5,7].
Lack of Empirical Data on Effectiveness and Impact: Many studies propose theoretical solutions or recommendations, but empirical data on effectiveness, long-term impact, and large-scale applicability are lacking [3,5,20].
Ethical and Academic Integrity Challenges: The adoption of GenAI raises new ethical and integrity issues, often only hinted at and not yet systematically addressed [2,3,5].

These gaps indicate the need for further empirical research, development of guidelines, and rigorous validation to effectively integrate GenAI into the automatic generation of assessment tests.

The main objective of this research is to test whether AI-based item generation can meet the rigorous standards of educational assessment, ensuring that each item created is able to adequately challenge students, effectively distinguishes between different levels of knowledge, and contributes to the overall coherence of the test. To do so, a thorough item analysis was performed. Key psychometric indices such as the Difficulty Index (p), the Discriminatory Power (D), and the Reliability Index (IA) were calculated.

Finally, this study aims to explore the broader implications of integrating AI into the test development process. Therefore, methodological challenges need to be addressed to avoid the potential introduction of biases and to ensure that the generated items are not only valid and reliable, but also aligned with ethical guidelines in assessment design. It is expected that the results of this exploration will provide critical insights to feed into guidelines for the responsible use of AI in educational settings.

By addressing these objectives, this research contributes to a better understanding of the role and impact of AI in automated test item generation. Evaluating an item’s difficulty, discrimination, and reliability is paramount, as these metrics offer a quantitative foundation for ensuring that AI-produced assessments are robust, fair, and pedagogically sound. The findings will help delineate areas in which GenAI is already matching or surpassing traditional methods, as well as highlighting potential avenues for refinement and ethical implementation.

2. Materials and Methods

2.1. The Developed System

In summary, the system works by ingesting diverse data sources (PDFs, URLs, YouTube), utilizing Google NotebookLM to process and understand this information, exporting the processed data to Google Sheets and, then, through Google Apps Script, in two primary ways: directly for potential use in Google Forms or as XML file specifically designed for Moodle integration.

Google NotebookLM is a tool that has been created to summarize and generate content from documents, web pages, YouTube videos, and audio files selected by the user. The artificial intelligence engine it uses is the multimodal Gemini one. This allows NotebookLM not only to read and understand text documents, but also to extract information from audio and video content. The real power of NotebookLM is in its ability to process and generate new content starting exclusively from the sources selected by the user, acting as an intelligent research assistant capable of generating summaries, answering specific questions, identifying key topics and concepts, facilitating more in-depth analysis, and providing citations.

At the moment, there do not appear to be studies published in scientific journals that specifically focus on the use of NotebookLM for the creation of assessment tests or for other academic purposes. References on the Internet [22,23] describe NotebookLM mainly as a tool for content analysis, synthesis, and management (e.g., for a literature review or the extraction of relevant information from documents).

This architecture, strongly based on this Google AI engine, enables the transformation of unstructured information from various sources into structured formats suitable for educational purposes within the Moodle learning environment. The use of Google Apps Script highlights the potential for customizability and tailored data formatting for seamless integration with Moodle.

The overall logic architecture of the developed system is shown in Figure 1.

Breaking down this logic architecture step by step, the following clarifies the modules and how they interact:

First, the input sources are clarified. The system accepts information from various input sources. In particular, Multiple PDF files can be fed into the system. This suggests the system can extract text or data from these documents. Moreover, the system can ingest information from web pages specified by their URLs. This implies the ability to fetch content from the internet. YouTube videos may be input sources. The system can process video content, potentially through transcription or by analyzing associated metadata.

Second, the Core Processing Module is Google NotebookLM. The data from PDFs, URLs, and YouTube are fed into Google NotebookLM. Google NotebookLM acts as a central processing unit. It likely uses advanced AI and natural language processing (NLP) to understand, summarize, and synthesize the information from the various input sources.

Third, Data Transformation and Export are clarified. The output from Google NotebookLM is then directed towards two distinct paths: Path 1 is Google Sheets and Google Forms, as shown by the arrow pointing from Google NotebookLM to Google Sheets. This suggests that NotebookLM can organize and export the processed information into a structured spreadsheet format. Subsequently, data may go from Google Sheets to Google Forms. This implies that the data within the Google Sheet can be used to automatically populate or create questions and options within a Google Form. This could be useful for quizzes, surveys, or data collection. Path 2 is Google Apps Script and XML, as shown by the arrow pointing from Google NotebookLM to Google Apps Script. Google Apps Script is a powerful platform for automating tasks and extending Google Workspace applications. This suggests that NotebookLM can trigger custom scripts to further process or format the data. An arrow then leads from Google Apps Script to an XML file. This indicates that the Google Apps Script is used to transform the processed information into an XML (Extensible Markup Language) format. XML is a standard markup language designed for encoding documents in a format that is both human-readable and machine-readable.

Fourth, the final output on Moodle is clarified. An arrow points from the XML file to Moodle. Moodle is a popular open-source learning management system (LMS). This final step suggests that the XML file, generated by the Google Apps Script, is specifically formatted for import into a Moodle platform. This could be used to create learning materials, assessments, or other educational content within Moodle.

2.2. The Course and the Tests

“Evaluation and Certification of Competences” lessons are held for the degree course in “Motor, Sports and Psychomotor Education Sciences” at the University of Salerno.

The general objective of the lessons is to promote knowledge related to the models of evaluation and certification of skills and develop the ability to use them in real or simulated situations related to the world of sport, school, and training.

The topics considered in this study are the following:

Definitions of knowledge, skills, attitudes, context, and competences.
Assessment and evaluation of competences in the sports.
Assessment and evaluation of competences at school.
Peer assessment and evaluation.
Certification of competences at school.
European frameworks and laws for the certification of the competences.

These lessons cover topics that are introduced with a gradual approach that makes them particularly accessible and do not require particular prerequisites or prior knowledge to follow. The structure of each lesson is a frontal lecture and a final test on the treated topic. To proceed with the calibration of an assessment test, that is, to evaluate, in the first analysis, the quality of its questions through item analysis, it is preferable to deal with topics that have already been explained [24]. This is the reason why the tests considered are those that were administered at the end of the lesson. The tests were created by using the generative artificial intelligence of Google NotebookLM by elaborating slides, handouts in PDF files, and videos from YouTube as sources. They had 15 questions each and were delivered as Moodle quizzes or Google forms. The total number of created questions delivered to participants was 90.

2.3. The People

The participants were students enrolled in the degree course in “Sciences of Motor Activities, Sports and Psychomotor Education” at the University of Salerno who followed the lessons of “Evaluation and Certification of Competences”, which is a course taken in the third and final year.

They were engaged during the second semester of the academic year 2024/25 from the end of February to the beginning of May. The total number of students enrolled in the course was 139. About half of them actually participated in the experiment.

2.4. The Qualitative and Quantitative Evaluation of the Tests

The effectiveness of the assessment tests produced thanks to the use of artificial intelligence was evaluated at a quantitative level through item analysis and at a qualitative level by collecting the opinions of the participants through the administration of a questionnaire.

2.4.1. Item Analysis

Item analysis is a crucial process in psychometrics and educational measurement aimed at evaluating the quality and effectiveness of individual items within an assessment test to enhance its overall validity and reliability [25]. This involves calculating several key indices and indicators after the administration of the test to a representative sample. The Difficulty Index (p) represents the proportion of respondents answering correctly. Its formula is

p = \frac{P_{t o t}}{P_{m a x}}

(1)

where P_tot is the total score the participants gained and P_max is the maximum score the participants could have achieved. It is in the range of 0 to 1. It ideally ranges between 0.30 and 0.70 [26], indicating an appropriately challenging item. Discriminatory Power (D) is often the difference in success rates between high- and low-scoring groups. Its formula is

D = \frac{E \cdot S}{{(N / 2)}^{2}}

(2)

where E is the number of correct answers, S is the number of wrong answers, and N is the number of participants. It is in the range of 0 to 1. It should be closer to 1, signifying the item’s ability to differentiate between individuals with varying levels of competence [25]. A related measure, the Selectivity Index (IS) or point-biserial correlation, quantifies the correlation between the item response and the total test score. Its formula is

I S = \frac{N_{m} - N_{p}}{N / 3}

(3)

where N_m is the number of good answers from the best participants and N_p is the number of good answers from the worst participants. It is in the range of −1 to 1. Generally, values above 0.30 are considered good, from 0.20 to 0.30 are considered fair, from 0.10 to 0.20 are considered marginal, and below 0.10 warrant revision [24,26]. The Reliability Index (IA), calculated as the product of the discrimination index and the square root of item variance, estimates an item’s contribution to the test’s internal consistency. Its formula is

I A = p \cdot I S

(4)

It is in the range of −1 to 1. Higher values closer to 1 indicate a greater positive impact on reliability [27].

To calibrate a test and evaluate item effectiveness, these indices are jointly interpreted: problematic items with low difficulty or discrimination, poor selectivity (IS < 0.10), and low reliability (IA) should be revised or removed. Items with high Discriminatory Power (D and IS > 0.30) are valuable for differentiating competence levels. Optimizing the balance of item difficulty and discrimination, while maximizing the overall reliability by addressing items with low IA, leads to a well-calibrated test that accurately and consistently measures the intended construct [25,27].

2.4.2. The Questionnaire

At the end of the course, a questionnaire was delivered to the participants to collect their opinions about the tests. To gather comprehensive feedback on the quality and effectiveness of AI-generated assessment items, this questionnaire was rigorously designed to blend closed- and open-ended questions. The instrument incorporated several Likert-scale items using a 5-point scale where 1 indicated “Totally disagree” and 5 “Totally agree” to capture participants’ opinions on multiple facets of the test items. Sample quantitative questions included statements such as “The overall quality of the AI-generated test questions is high”, “The questions are clear and unambiguous”, and “The provided response options are balanced and appropriate”, while comparative items (e.g., “Compared to tests prepared by human experts, the AI-generated items are: significantly worse, similar, or significantly better”) were used to elicit relative evaluations. In addition, open-ended items invited detailed comments and suggestions, allowing respondents to elaborate on any perceived drawbacks or advantages. This questionnaire design was inspired by Krosnick and Presser [28], who emphasize the importance of clear and concise question wording to avoid ambiguity and ensure respondents understand the questions as intended. They recommend using simple language and avoiding technical jargon. They also suggest using a mix of closed- and open-ended questions to capture both quantitative and qualitative data. To collect the opinions of users on their e-learning experience, a questionnaire inspired by the UEQ model [29] was defined, and the TUXEL technique was used for user experience evaluation in e-learning [30]. Moreover, the design was based on contemporary research in the field of AI-enhanced educational assessment which emphasizes the importance of integrating both quantitative and qualitative measures to obtain nuanced user feedback [31], and other research [32] that demonstrated that mixed-methods questionnaires are effective for capturing detailed peer assessment insights in AI-supported environments. These methodological choices aim to ensure that the evaluation of AI-produced test items is both rigorous and reflective of participants’ real-world experiences. Table 2 shows the delivered questions.

Many questions allowed the participants to express their opinions on a 5-level scale (1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree). Some of them were Yes/No questions. Some final questions allowed participants to leave comments and suggestions in an open text box.

3. Results

3.1. The Effective Participants

Test n.1 was administered to 61 participants, Test n.2 to 63, Test n.3 to 69, Test n.4 to 60, Test n.5 to 65, and Test n.6 to 64. The average number of participants was 64.

3.2. Item Analysis Results

The four cited indices, Difficulty (1), Discriminatory Power (2), Selectivity Index (3), and Reliability Index (4), have been calculated for each of the six tests. The formulas adopted for the calculations have been described before. The results of these calculations are reported in the Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8.

3.3. Answers to the Questionnaire

The questionnaire used to collect the opinions of the participants was administered after the last test. Of the 64 participants who answered it, 55 were in progress and regularly enrolled in the third year of the degree course, 6 were in the first year out of the course, and 3 were in the second year out of the course. The collected data are included in Table 9, Table 10 and Table 11.

Section F of the questionnaire was administered with the aim of collecting participants’ opinions on the quality of the items in the tests. In particular, question F1 asked participants to refer to other tests prepared by the human teacher and to compare them with those prepared through GenAI. The opinions collected are presented in Table 10.

Finally, question F2 required an open-ended response through which participants could express their point of view and provide comments and suggestions regarding the items they answered. Since this was an open-text answer, the opinions collected were grouped into clusters that essentially referred to different needs. In particular, the needs referred to a better contextualization of the question, an improvement of the formulation of the question, or an improvement of the formulation of the answers. To these identified categories, two more were added that indicate that there is nothing to report or that there are other aspects not precisely connected to the question asked in the questionnaire. The data collected are reported in Table 11.

4. Discussion

To interpret the tables effectively, the general reference values are considered [24].

The reference values for the Difficulty Index (1) are as follows:

>0.70: too easy;
0.30–0.70: optimal;
<0.30: too difficult.

Discriminatory Power (2) should be positive and high. Higher values indicate better discrimination between high and low achievers.

The reference values for the Selectivity Index (3) are as follows:

>0.30: good;
0.20–0.30: fair;
0.10–0.20: marginal;
<0.10: poor, needs revision.

For the Reliability Index (4), higher values are desirable, indicating a greater contribution to the test’s internal consistency.

Table 3 shows the item analysis on Test n.1 about “Definitions of knowledge, skills, attitudes, context, and competences”. The calculated indices allow some reflections.

The Difficulty Index (1) is in the range of 0.52 to 0.93. This means that most items are moderately easy to easy. Eight of them can be considered as optimal with values between 0.52 and 0.67. None of them is too difficult, and seven of them are particularly easy.

Discriminatory Power (2) is generally high. For one item, D = 0.25; for another one, D = 0.30, and the rest of them have values ranging from 0.46 to 1.00. This indicates good discrimination for most items.

The Selectivity Index (3) varies from 0.10 to 0.75. Nine items have marginal to poor selectivity. Seven of them show good selectivity with values greater than 0.30.

The Reliability Index (4) ranges from 0.08 to 0.44. Higher values are observed for items with better selectivity and Discriminatory Power. Items 1.10 (0.08) and 1.12 (0.09) have particularly low values.

Test n.1 has some strong items with good discrimination and selectivity. However, several items, particularly 1.11, 1.12, and 1.13, are too easy and have poor selectivity and reliability indices, indicating they contribute little to the test’s ability to measure the intended construct. These items should be revised or removed.

Table 4 shows the item analysis on Test n.2 about “Assessment and evaluation of competences in sports”.

The Difficulty Index (1) is in the range of 0.44 to 1. This means that most items are very easy. Five of them can be considered as optimal with values between 0.44 and 0.7. None of them is too difficult, and 10 of them are particularly easy. This suggests a lack of challenging questions.

Discriminatory Power (2) is in the range of 0 to 0.99. This means that this index has significant variability. Six of the items can be considered as optimal with values greater than 0.7.

The Selectivity Index (3) is in the range of 0 to 0.76. Six of the items can be considered as optimal with values greater than 0.2. Most other items have poor selectivity, with several at or below 0.05.

The Reliability Index (4) is in the range of 0 to 0.52. Six of the items can be considered as optimal with values greater than 0.3. Other items have very low values for this index.

Test n.2 has serious issues. Many items are too easy and fail to discriminate effectively between students. The selectivity and reliability indices are generally poor. A major revision is needed to improve the test’s validity and reliability.

Table 5 shows the item analysis on Test n.3 about “Assessment and evaluation of competences at school”.

The Difficulty Index (1) is in the range of 0.33 to 0.99. This means that most items are quite easy. Four of them can be considered as optimal with values between 0.33 and 0.7. None of them is too difficult, and 11 of them are particularly easy.

Discriminatory Power (2) is in the range of 0.06 to 0.94. Four of the items can be considered as optimal with values greater than 0.7. The other ones have very low Discriminatory Power.

The Selectivity Index (3) is in the range of 0.04 to 0.48. Five of the items can be considered as optimal with values greater than 0.2. Many items fall into the marginal or poor categories.

The Reliability Index (4) is in the range of 0.04 to 0.32. Five of the items can be considered as optimal with values greater than 0.3.

Test n.3 also suffers from having many easy items and poor discrimination for several questions. Items 3.4, 3.6, 3.8, and 3.12 are reasonably good, but the test needs significant revision to improve its overall quality.

Table 6 shows the item analysis on Test n.4 about “Peer assessment and evaluation”.

The Difficulty Index (1) is in the range of 0.63 to 0.98. This means that the test is extremely easy. Only one of the items can be considered as optimal with a value between 0.63 and 0.7. None of them is too difficult, and 14 of them are particularly easy.

Discriminatory Power (2) is in the range of 0 to 0.07. The values are very low across the board, mostly at 0.06 or 0.00. This indicates that none of the items effectively differentiates between students.

The Selectivity Index (3) is in the range of 0 to 0.6. One of the items can be considered as optimal with a value greater than 0.3.

The Reliability Index (4) is in the range of 0 to 0.38. Only one of the items can be considered as good with a value greater than 0.3.

Test n.4 is fundamentally flawed. The items are too easy and lack any Discriminatory Power. The selectivity and reliability indices are extremely poor, suggesting the test does not measure the intended construct in a meaningful way. Complete revision or replacement is necessary.

Table 7 shows the item analysis on Test n.5 about “Certification of competences at school”.

The Difficulty Index (1) is in the range of 0.52 to 1. This means that most items are moderately easy to easy. Three of them can be considered as optimal with values between 0.52 and 0.7. None of them is too difficult, and 12 of them are particularly easy.

Discriminatory Power (2) is in the range of 0 to 1. Four of the items can be considered as optimal with values greater than 0.7. Several items have very poor discrimination (0.00 to 0.18).

The Selectivity Index (3) is in the range of −0.05 to 0.76. Six of the items can be considered as optimal with values greater than 0.3. Several items have poor or even negative selectivity.

The Reliability Index (4) is in the range of −0.05 to 0.54. Six of the items can be considered as good with values greater than 0.3. Higher values align with items with better selectivity and discrimination.

Test n.5 has a mix of good and bad items. While some items demonstrate strong discrimination and selectivity, many others are too easy and fail to differentiate between students. The negative values for selectivity and reliability for item 5.10 are particularly concerning and indicate a problematic item. Substantial revision is needed.

Table 8 shows the item analysis on Test n.6 about “European frameworks and laws for the certification of the competences”.

The Difficulty Index (1) is in the range of 0.55 to 1. This means that most items are moderately easy to easy. Three of them can be considered as optimal with values between 0.55 and 0.7. None of them is too difficult, and 12 of them are particularly easy.

Discriminatory Power (2) is in the range of 0 to 0.99. Four of the items can be considered as optimal with values greater than 0.7. Several items have very low discrimination (0.00 to 0.12).

The Selectivity Index (3) is in the range of 0 to 0.62. Six of the items can be considered as optimal with values greater than 0.3. Many items have poor selectivity.

The Reliability Index (4) is in the range of 0 to 0.4. Seven of the items can be considered as optimal with values greater than 0.3. Higher values correspond to items with better selectivity and discrimination.

Similar to other tests, Test n.6 has a tendency towards easy items and poor discrimination for many questions. Items 6.2 and 6.4 are relatively strong, but the test requires significant improvement to be a reliable and valid measure.

Across all six tests generated by the AI system, several common issues emerge.

First, they present overly easy items. Many tests contain a high proportion of items with a Difficulty Index above 0.80 or even 0.90. This indicates that the items are too easy for the test-takers, reducing their ability to differentiate between students with different levels of knowledge.

Second, they allow poor discrimination. A significant number of items exhibit low Discriminatory Power (2). This means they fail to effectively distinguish between high-achieving and low-achieving students.

Third, they show low selectivity and reliability. Many items have low selectivity (3) and reliability indices (4), indicating that they do not correlate well with the overall test score and contribute little to the internal consistency of the test.

The questionnaire results, detailed in Table 9, Table 10 and Table 11, provide a comprehensive view of student opinions on AI-generated tests. Overall, the feedback indicates a generally positive reception of the AI-generated test questions.

Table 9 presents a summary of the individual responses to the questionnaire sections B, C, D, and E, offering a view of student evaluations across various criteria. Notably, the average scores for questions B1 to D4 are generally high, with most averages above 4 on a 5-point scale, indicating that students largely agreed or totally agreed that the questions were of high quality, relevant, and clear. For instance, the average score for C2 (Were the questions worded grammatically correct?) is 4.53, suggesting strong agreement on the grammatical correctness of the questions.

However, there are some areas where students expressed less positive feedback. Questions related to the clarity and correctness of answer options (D1, D2, D3, and D4) received slightly lower average scores compared to the question quality. In particular, D3 (Were the wrong answers plausible but clearly incorrect?) has an average of 3.47, and D4 (Were the answers consistent with the question?) has an average of 2.97. These scores suggest that students found some answer options less clear or consistent, highlighting a potential area for improvement in the AI’s test generation.

Furthermore, the data from section E of the questionnaire in Table 9 show that a few students reported issues with the questions. Specifically, 4 students found cases where more than one answer seemed correct, and 11 students found cases where no answer seemed correct. This indicates that while the majority of students did not encounter these problems (60 and 53 students answered “No”, respectively), a non-negligible minority experienced issues with the correctness of the answers.

Table 10 summarizes students’ comparative evaluations of the AI-generated tests against human-prepared tests. A significant majority (46 out of 64) found the quality of the AI-generated questions to be similar to that of questions prepared by human instructors. Additionally, 16 students rated the AI-generated tests as better than human-prepared tests (9 “slightly better” and 7 “significantly better”). This overall positive comparison suggests that students perceive AI as capable of generating test questions that are at least as good as, if not better than, those created by humans.

Table 11 presents a qualitative analysis of the open-ended responses from question F2, where students provided comments and suggestions. The most frequent comment, given by 48 students, was that each question was “Good as it is”, reinforcing the quantitative data indicating overall satisfaction. However, some students suggested areas for improvement, with six mentioning a need for better contextualization of the questions and three and six students suggesting improvements in the formulation of the questions and answers, respectively. These qualitative insights provide valuable direction for refining AI test generation.

These findings align with broader research trends in the application of AI in education. The emphasis on clear and concise question wording, as highlighted by Krosnick and Presser [28], is crucial in both human- and AI-generated tests to ensure validity and reliability. Research on AI in assessment often points to the potential for AI to create consistent and objective evaluations [33]. The results from this questionnaire support this, with students generally agreeing on the quality and relevance of the AI-generated questions. However, the comments about answer clarity and correctness also echo concerns raised in the literature about the need for careful validation of AI-generated content to avoid errors or ambiguities [34].

While our findings indicate that AI-generated test items hold significant promise, certain items still demonstrate quality issues such as being overly easy or exhibiting low Discriminatory Power. To address this, an iterative revision process is essential. First, items identified through item analysis as too easy or with insufficient discrimination can be revised by increasing their complexity or refining distractors to better differentiate among varying levels of student ability. For instance, revising the wording, adding contextually rich scenarios, or incorporating more plausible yet wrong response options may significantly boost an item’s discriminatory capacity. In parallel, expert review plays a critical role in ensuring the clarity, contextualization, and correctness of both questions and answer keys. By engaging subject-matter experts to perform qualitative evaluations of item wording and content alignment with learning objectives, ambiguous phrasing and factual inaccuracies can be detected and corrected. Iterative testing, where pilot tests are conducted followed by systematic item analysis and expert feedback, can serve as a robust methodology to enhance item quality over successive revisions. Research has shown that such a combined strategy of iterative refinement and expert review leads to more valid and reliable assessment tools [35,36].

In light of the rapid integration of AI systems in assessment creation, it is imperative to address the ethical and practical implications of using AI-generated content in exams. A primary concern is safeguarding academic integrity, as the ease of generating content through tools like Google NotebookLM may inadvertently encourage practices that compromise originality and fairness. To mitigate these risks, institutions should establish clear guidelines that explicitly define acceptable and unacceptable uses of AI in academic assessments. Best practices include mandating iterative expert reviews of AI-generated exam items, incorporating robust verification mechanisms to detect potential misuse, and ensuring that any AI-driven content undergoes thorough validation for accuracy and alignment with course objectives. Additionally, integrated digital literacy programs for both educators and students can reinforce an understanding of the benefits and limitations of AI, ensuring that these tools are used as aids rather than shortcuts. It is also advisable to implement policy measures focused on transparency, accountability, and continuous monitoring of AI outputs in examination settings. Such proactive strategies not only help preserve academic integrity but also foster a responsible culture around the deployment of AI in educational environments [37,38].

In light of the current findings, future research should pursue several promising avenues. First, the development of more robust computational frameworks for automated assessment is essential. Researchers should validate advanced item-generation and calibration techniques using large-scale, longitudinal studies to overcome limitations related to sample diversity and data representativeness [39]. Interdisciplinary approaches that seamlessly integrate innovations from artificial intelligence with established psychometric theories may yield adaptive and transparent assessment systems capable of addressing nuanced variations in learner populations [40].

In addition, the application of explainable AI (XAI) models holds promise for clarifying the underlying reasoning of automated decisions. By opening the “black box” of AI-driven assessments, future studies could foster greater trust and improve pedagogical practices by enabling educators to understand both the strengths and potential biases inherent in these systems [40]. Meanwhile, systematic ethical evaluations and bias-mitigation protocols must become standard practice. Future investigations need to rigorously examine the ethical dimensions of algorithmic decision-making—assessing issues such as fairness, privacy, and accountability—to ensure responsible adoption in educational contexts. Addressing these dimensions will not only refine the technical and pedagogical aspects of automated assessment methods but also support the development of robust guidelines for their ethical implementation in real-world settings [41].

5. Conclusions

Although the literature on AI-driven assessment methods has grown considerably, several limitations persist. Many existing studies feature constrained methodological designs such as small sample sizes and a limited demographic scope. They often give insufficient attention to ethical challenges, including algorithmic bias and data privacy issues.

This research set out to bridge these gaps by systematically evaluating the psychometric properties of AI-generated test items. The objectives were to determine the reliability, Discriminatory Power, and feasibility of such automated assessments as viable complements to traditional methods.

Compared with the broader reference literature, this approach offers significant potential. By integrating rigorous statistical analyses with emerging AI methodologies, the present study advances the field’s understanding of adaptive assessment techniques and lays the groundwork for more equitable, data-driven educational practices.

However, this work also underscores the need for cautious interpretation. The preliminary nature of the data, together with inherent limitations in algorithm design and the challenges of ensuring unbiased outcomes, calls for further research. It is imperative that future studies expand on these findings by employing more diverse data sources and comprehensive ethical reviews, thereby ensuring that technological advances are deployed with full awareness of their limitations and potential societal impact.

Ultimately, while this research contributes valuable insights to the evolving dialogue on AI in education, its implementation must proceed with thoughtful consideration of both technical challenges and ethical imperatives.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the voluntary participation and the absolute anonymity of participants.

Informed Consent Statement

In accordance with the guidelines of the American Psychological Association (APA), participants were asked to give informed consent regarding the nature of the survey and its objectives exclusively for research purposes. Their participation was voluntary and was carried out by guaranteeing confidentiality and anonymity, since the data were collected in digital form without requesting the identity of the participants. Therefore, the data were collected in compliance with the European Regulation on Data Protection (GDPR n.679/2016) since they involve EU citizens anonymously and do not identify the participants irreversibly or in any way.

Data Availability Statement

All data are available at: https://docs.google.com/spreadsheets/d/1n4UPMCENahHaGNGZGOFOInFfPuAtAn4j/edit?usp=drive_link&ouid=113723367970176014287&rtpof=true&sd=true, accessed on 9 May 2025.

Acknowledgments

Special thanks go to all the students enrolled in the degree course in “Sciences of Motor Activities, Sports and Psychomotor Education” at the University of Salerno who followed the lessons of “Evaluation and Certification of Competences” and participated in the research described in this paper.

Conflicts of Interest

The author declares no conflicts of interest.

References

Weng, X.; Xia, Q.; Gu, M.; Rajaram, K.; Chiu, T.K. Assessment and learning outcomes for generative AI in higher education: A scoping review on current research status and trends. Australas. J. Educ. Technol. 2024, 40, 37–55. [Google Scholar] [CrossRef]
Wang, L.; Li, S.; Chen, Y. Early adaption of assessments using generative artificial intelligence and the impact on student learning: A case study. Afr. J. Inter/Multidiscip. Stud. 2024, 6, 1–12. [Google Scholar] [CrossRef]
Mao, J.; Chen, B.; Liu, J.C. Generative Artificial Intelligence in Education and Its Implications for Assessment. TechTrends 2024, 68, 58–66. [Google Scholar] [CrossRef]
Domenici, G. L’intelligenza artificiale generativa per l’innalzamento della qualità dell’istruzione e la fioritura del pensiero critico. Quale contributo? J. Educ. Cult. Psychol. Stud. (ECPS) 2024, 30, 11–22. [Google Scholar] [CrossRef]
Gundu, T. Strategies for e-Assessments in the Era of Generative Artificial Intelligence. Electron. J. E-Learn. 2025, 22, 40–50. [Google Scholar] [CrossRef]
Circi, R.; Hicks, J.; Sikali, E. Automatic item generation: Foundations and machine learning-based approaches for assessments. Front. Educ. 2023, 8, 858273. [Google Scholar] [CrossRef]
Kaldaras, L.; Akaeze, H.O.; Reckase, M.D. Developing valid assessments in the era of generative artificial intelligence. Front. Educ. 2024, 9, 1399377. [Google Scholar] [CrossRef]
Paskova, A.A. Potentials of integrating generative artificial intelligence technologies into formative assessment processes in higher education. Vestn. Majkopskogo Gos. Tehnol. Univ. 2024, 16, 98–109. [Google Scholar] [CrossRef]
Rauh, M.; Marchal, N.; Manzini, A.; Hendricks, L.A.; Comanescu, R.; Akbulut, C.; Stepleton, T.; Mateos-Garcia, J.; Bergman, S.; Kay, J.; et al. Gaps in the Safety Evaluation of Generative AI. Proc. AAAI/ACM Conf. AI Ethics Soc. 2024, 7, 1200–1217. [Google Scholar] [CrossRef]
Grassini, S. Shaping the future of education: Exploring the potential and consequences of AI and ChatGPT in educational settings. Educ. Sci. 2023, 13, 692. [Google Scholar] [CrossRef]
Arslan, B.; Lehman, B.; Tenison, C.; Sparks, J.R.; López, A.A.; Gu, L.; Zapata-Rivera, D. Opportunities and challenges of using generative AI to personalize educational assessment. Front. Artif. Intell. 2024, 7, 1460651. [Google Scholar] [CrossRef] [PubMed]
Zhao, J.; Chapman, E.; Sabet, P.G.P. Peyman Generative AI and Educational Assessments: A Systematic Review. Educ. Res. Perspect. 2024, 51, 124–155. [Google Scholar] [CrossRef]
Swiecki, Z.; Khosravi, H.; Chen, G.; Martinez-Maldonado, R.; Lodge, J.M.; Milligan, S.; Selwyn, N.; Gašević, D. Assessment in the age of artificial intelligence. Comput. Educ. Artif. Intell. 2022, 3, 100075. [Google Scholar] [CrossRef]
Salinas-Navarro, D.E.; Vilalta-Perdomo, E.; Michel-Villarreal, R.; Montesinos, L. Designing experiential learning activities with generative artificial intelligence tools for authentic assessment. Interact. Technol. Smart Educ. 2024, 21, 1179. [Google Scholar] [CrossRef]
Barragán, A.J.; Aquino, A.; Enrique, J.M.; Segura, F.; Martínez, M.A.; Andújar, J.M. Evaluación de la inteligencia artificial generativa en el contexto de la automática: Un análisis crítico. Jorn. Automática 2024, 45. [Google Scholar] [CrossRef]
Amugongo, L.M.; Kriebitz, A.; Boch, A.; Lütge, C. Operationalising AI ethics through the agile software development lifecycle: A case study of AI-enabled mobile health applications. AI Ethics 2023, 5, 227–244. [Google Scholar] [CrossRef]
Solanki, P.; Grundy, J.; Hussain, W. Operationalising ethics in artificial intelligence for healthcare: A framework for ai developers. AI Ethics 2022, 3, 223–240. [Google Scholar] [CrossRef]
Hanna, M.G.; Pantanowitz, L.; Jackson, B.; Palmer, O.; Visweswaran, S.; Pantanowitz, J.; Deebajah, M.; Rashidi, H.H. Ethical and Bias Considerations in Artificial Intelligence/Machine Learning. Mod. Pathol. 2025, 38, 100686. [Google Scholar] [CrossRef]
Nguyen, H.; Hayward, J. Applying Generative Artificial Intelligence to Critiquing Science Assessments. J. Sci. Educ. Technol. 2025, 34, 199–214. [Google Scholar] [CrossRef]
Pearce, J.; Chiavaroli, N. Rethinking assessment in response to generative artificial intelligence. Med. Educ. 2023, 57, 889–891. [Google Scholar] [CrossRef]
Pesovski, I.; Santos, R.; Henriques, R.; Trajkovik, V. Generative AI for Customizable Learning Experiences. Sustainability 2024, 16, 3034. [Google Scholar] [CrossRef]
Bron Eager. AI Literature Reviews: Exploring Google’s NotebookLM for Analysing Academic Literature. 2024. Available online: https://broneager.com/ai-literature-review-notebooklm (accessed on 9 May 2025).
Somasundaram, R. Discovering NotebookLM: My AI Companion for Smarter Academic Research. iLovePhD. 2025. Available online: https://www.ilovephd.com/discovering-notebooklm-my-ai-companion-for-smarter-academic-research/ (accessed on 9 May 2025).
Trinchero, R. Item Analysis Manual; FrancoAngeli: Milan, Italy, 2007. [Google Scholar]
Trinchero, R. Building, Evaluating and Certifying Competences; Pearson: London, UK, 2016. [Google Scholar]
Ebel, R.L.; Frisbie, D.A. Essentials of Educational Measurement; Prentice Hall: Saddle River, NJ, USA, 1991. [Google Scholar]
Nunnaly, J.C.; Bernstein, I.H. Psychometric Theory, 3rd ed.; McGraw-Hill: New York, NY, USA, 1994. [Google Scholar]
Krosnick, J.A.; Presser, S. Question and questionnaire design. In Handbook of Survey Research, 2nd ed.; Marsden, P.V., Wright, J.D., Eds.; Emerald Group Publishing Limited: Leeds, UK, 2010; pp. 263–313. [Google Scholar]
Alansari, I. Evaluating user experience on e-learning using the User Experience Questionnaire (UEQ) with additional functional scale. J. Inf. Syst. Inform. 2022, 17, 145–162. [Google Scholar] [CrossRef]
Nakamura, W. TUXEL: A technique for user experience evaluation in e-learning. In Proceedings of the VII Congresso Brasileiro de Informática na Educação (CBIE) 2018, Fortaleza, Brazil, 29 October–1 November 2018. [Google Scholar] [CrossRef]
González-Calatayud, V.; Prendes-Espinosa, P.; Roig-Vila, R. Artificial intelligence for student assessment: A systematic review. Appl. Sci. 2021, 11, 5467. [Google Scholar] [CrossRef]
Topping, K.J.; Gehringer, E.; Khosravi, H.; Gudipati, S.; Jadhav, K.; Susarla, S. Enhancing peer assessment with artificial intelligence. Int. J. Educ. Technol. High. Educ. 2025, 22, 3. [Google Scholar] [CrossRef]
Holmes, W.; Bialik, M.; Fadel, C. Artificial Intelligence in Education; Brookings Institution Press: Washington, DC, USA, 2023. [Google Scholar]
Williamson, D.M.; Mislevy, R.J.; Bejar, I.I. (Eds.) Automated Scoring of Complex Tasks in K-12 to Postsecondary Education: Theory and Practice; Routledge: London, UK, 2012. [Google Scholar]
Gershon, S.K.; Anghel, E.; Alexandron, G. An evaluation of assessment stability in a massive open online course using item response theory. Educ. Inf. Technol. 2024, 29, 2625–2643. [Google Scholar] [CrossRef]
Tran, T.T.T. Enhancing EFL Writing Revision Practices: The Impact of AI- and Teacher-Generated Feedback and Their Sequences. Educ. Sci. 2025, 15, 232. [Google Scholar] [CrossRef]
Bittle, K.; El-Gayar, O. Generative AI and academic integrity in higher education: A systematic review and research agenda. Information 2025, 16, 296. [Google Scholar] [CrossRef]
Gustilo, L.; Ong, E.; Lapinid, M.R. Algorithmically-driven writing and academic integrity: Exploring educators’ practices, perceptions, and policies in the AI era. Int. J. Educ. Integr. 2024, 20, 3. [Google Scholar] [CrossRef]
Smith, A.; Johnson, B. Methodological challenges in automated assessment and the way forward. J. Educ. Meas. 2020, 57, 657–682. [Google Scholar]
Kristóf, T. Development tendencies and turning points of futures studies. Eur. J. Futures Res. 2024, 12, 9. [Google Scholar] [CrossRef]
Tristan, L.; Gottipati, S.; Cheong, M.L.F. Ethical Considerations for Artificial Intelligence in Educational Assessments. In Creative AI Tools and Ethical Implications in Teaching and Learning; Keengwe, J., Ed.; IGI Global: Hershey, PA, USA, 2023; pp. 32–79. [Google Scholar] [CrossRef]

Figure 1. The logic architecture of the developed system.

Table 1. The progression from simple, theoretically driven methods to sophisticated generative systems with multi-functional roles.

Stage	Method/Technology	Example/Use Case
1. Early AIG	Template-based algorithms	Automatic item generation using fixed, algorithmic templates to produce large numbers of test items.
2. Practical Applications	Domain-specific templates	Implementation of AIG in high-stakes and formative assessments—such as large-scale exams, opinion surveys, and classroom tests—to quickly generate items.
3. Emergence of GenAI	Large Language Models (LLMs)	Use of LLMs to automatically generate customizable learning materials, for example, creating multiple-choice questions based on instructor outcomes.
4. Expanded Functionality	Integrative AI-based systems	Deployment of LLMs to support various assessment tasks: test planning, question creation, instruction preparation, scoring, and feedback provision.

Table 2. The questionnaire.

Question	Possible Answers
Course of Study	Open text
Course Year	Number
Section B: General Test Evaluation
B1. The overall quality of the test questions is high?	Rating scale: 1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree
B2. Were the questions relevant to the topics covered in the specified study material?	Rating scale: 1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree
B3. Did the test adequately cover the topics it was intended to assess?	Rating scale: 1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree
B4. Is the overall difficulty level of the test appropriate to your expected level of preparation?	Rating scale: 1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree
Section C: Specific Evaluation of Applications
C1. Were the questions worded clear and easy to understand?	Rating scale: 1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree
C2. Were the questions worded grammatically correct?	Rating scale: 1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree
C3. Did the questions appear to be content-wise correct (contain no factual or conceptual errors)?	Rating scale: 1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree
C4. Were the questions “fair” (not tricky or based on excessively minor details)?	Rating scale: 1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree
Section D: Specific Evaluation of Response Options The test included multiple choice questions, think about the answer options provided and evaluate the following aspects:
D1. Were the answers worded clear and easy to understand?	Rating scale: 1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree
D2. Were the answers worded grammatically correct?	Rating scale: 1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree
D3. Were the wrong answers plausible but clearly incorrect?	Rating scale: 1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree
D4. Were the answers consistent with the question?	Rating scale: 1 = Totally disagree; 2 = Disagree; 3 = Neutral; 4 = Agree; 5 = Totally agree
Section E: Identifying Specific Problems This section is very important to help us identify specific problems.
E1. Did you find cases where more than one answer seemed correct?	Yes/No
E2. Did you find any cases where no answer seemed correct?	Yes/No
Section F: Comparison and Final Comments
F1. If you have taken tests on similar topics prepared by instructors before, how would you compare the overall quality of the questions on this AI-generated test to those prepared by humans?	Significantly worse/Slightly worse/Similar/Slightly better/Significantly better/Don’t know/I have no terms of comparison
F2. Do you have any other comments, suggestions, or observations regarding the AI-generated quiz questions or answers that you would like to share?	Open text

Table 3. Item analysis on Test n.1 about the “Definitions of knowledge, skills, attitudes, context, and competences”.

Test n.1	Item 1.1	Item 1.2	Item 1.3	Item 1.4	Item 1.5	Item 1.6	Item 1.7	Item 1.8	Item 1.9	Item 1.10	Item 1.11	Item 1.12	Item 1.13	Item 1.14	Item 1.15
Difficulty Index (1)	0.56	0.75	0.80	0.67	0.87	0.52	0.85	0.54	0.59	0.54	0.93	0.92	0.84	0.59	0.61
Discriminatory Power (2)	0.99	0.74	0.63	0.88	0.46	1.00	0.50	0.99	0.97	0.99	0.25	0.30	0.55	0.97	0.95
Selectivity Index (3)	0.25	0.35	0.30	0.65	0.20	0.75	0.15	0.55	0.40	0.15	0.20	0.10	0.30	0.50	0.20
Reliability Index (4)	0.14	0.26	0.24	0.44	0.17	0.39	0.13	0.30	0.24	0.08	0.19	0.09	0.25	0.30	0.12

Table 4. Item analysis on Test n.2 about the “Assessment and evaluation of competences in the sports”.

Test n.2	Item 2.1	Item 2.2	Item 2.3	Item 2.4	Item 2.5	Item 2.6	Item 2.7	Item 2.8	Item 2.9	Item 2.10	Item 2.11	Item 2.12	Item 2.13	Item 2.14	Item 2.15
Difficulty Index (1)	1.00	0.95	0.81	0.44	0.68	0.68	0.95	0.63	0.71	0.44	0.94	0.92	0.83	0.98	0.97
Discriminatory Power (2)	0.00	0.18	0.62	0.99	0.87	0.87	0.18	0.93	0.82	0.99	0.24	0.29	0.58	0.06	0.12
Selectivity Index (3)	0.00	0.14	0.33	0.62	0.33	0.76	0.05	0.43	0.43	0.24	0.14	0.05	0.24	0.05	0.05
Reliability Index (4)	0.00	0.14	0.27	0.28	0.23	0.52	0.05	0.27	0.31	0.11	0.13	0.04	0.20	0.05	0.05

Table 5. Item analysis on Test n.3 about the “Assessment and evaluation of competences at school”.

Test n.3	Item 3.1	Item 3.2	Item 3.3	Item 3.4	Item 3.5	Item 3.6	Item 3.7	Item 3.8	Item 3.9	Item 3.10	Item 3.11	Item 3.12	Item 3.13	Item 3.14	Item 3.15
Difficulty Index (1)	0.90	0.93	0.99	0.38	0.87	0.33	0.83	0.67	0.88	0.99	0.78	0.67	0.97	0.80	0.86
Discriminatory Power (2)	0.36	0.27	0.06	0.94	0.45	0.89	0.57	0.89	0.41	0.06	0.68	0.89	0.11	0.65	0.50
Selectivity Index (3)	0.13	0.13	0.04	0.48	0.22	0.39	0.17	0.48	0.17	0.04	0.30	0.43	0.04	0.26	0.26
Reliability Index (4)	0.12	0.12	0.04	0.18	0.19	0.13	0.14	0.32	0.15	0.04	0.24	0.29	0.04	0.21	0.22

Table 6. Item analysis on Test n.4 about the “Peer assessment and evaluation”.

Test n.4	Item 4.1	Item 4.2	Item 4.3	Item 4.4	Item 4.5	Item 4.6	Item 4.7	Item 4.8	Item 4.9	Item 4.10	Item 4.11	Item 4.12	Item 4.13	Item 4.14	Item 4.15
Difficulty Index (1)	0.93	0.98	0.97	0.92	0.94	0.92	0.88	0.90	0.88	0.87	0.86	0.85	0.63	0.82	0.76
Discriminatory Power (2)	0.07	0.06	0.06	0.00	0.06	0.06	0.06	0.06	0.06	0.06	0.06	0.06	0.05	0.05	0.00
Selectivity Index (3)	0.15	0.00	0.00	0.10	0.00	0.00	0.05	0.00	0.00	0.00	0.00	0.00	0.60	0.00	0.20
Reliability Index (4)	0.14	0.00	0.00	0.09	0.00	0.00	0.04	0.00	0.00	0.00	0.00	0.00	0.38	0.00	0.15

Table 7. Item analysis on Test n.5 about the “Certification of competences at school”.

Test n.5	Item 5.1	Item 5.2	Item 5.3	Item 5.4	Item 5.5	Item 5.6	Item 5.7	Item 5.8	Item 5.9	Item 5.10	Item 5.11	Item 5.12	Item 5.13	Item 5.14	Item 5.15
Difficulty Index (1)	0.55	0.52	0.78	0.98	0.95	0.92	0.91	1.00	1.00	0.97	0.71	0.91	0.82	0.65	0.98
Discriminatory Power (2)	0.99	1.00	0.68	0.06	0.18	0.28	0.34	0.00	0.00	0.12	0.83	0.34	0.60	0.91	0.06
Selectivity Index (3)	0.48	0.71	0.52	0.00	0.14	0.05	0.05	0.00	0.00	−0.05	0.76	0.19	0.52	0.62	0.05
Reliability Index (4)	0.26	0.37	0.41	0.00	0.14	0.04	0.04	0.00	0.00	−0.05	0.54	0.17	0.43	0.40	0.05

Table 8. Item analysis on Test n.6 about the “European frameworks and laws for the certification of the competences”.

Test n.6	Item 6.1	Item 6.2	Item 6.3	Item 6.4	Item 6.6	Item 6.6	Item 6.7	Item 6.8	Item 6.9	Item 6.10	Item 6.11	Item 6.12	Item 6.13	Item 6.14	Item 6.16
Difficulty Index (1)	0.80	0.64	0.97	0.64	0.98	0.55	0.98	0.80	0.97	0.92	1.00	0.98	0.94	0.97	0.75
Discriminatory Power (2)	0.65	0.92	0.12	0.92	0.06	0.99	0.06	0.65	0.12	0.29	0.00	0.06	0.23	0.12	0.75
Selectivity Index (3)	0.33	0.57	0.10	0.62	0.05	0.48	0.05	0.33	0.05	0.24	0.00	0.05	0.10	0.10	0.33
Reliability Index (4)	0.27	0.37	0.09	0.40	0.05	0.26	0.05	0.27	0.05	0.22	0.00	0.05	0.09	0.09	0.25

Table 9. Summary of the answers to the sections B, C, D, and E of the questionnaire.

	B1	B2	B3	B4	C1	C2	C3	C4	D1	D2	D3	D4	E1	E2
Min	3	2	3	4	3	4	4	4	3	2	2	1
Max	5	5	5	5	5	5	5	5	5	5	5	5
Average	3.19	4.28	4.02	4.56	4.03	4.53	4.44	4.50	4.05	3.48	3.47	2.97
Number of Yes Answers													4	11
Number of No Answers													60	53

Table 10. Answers to the question F1 (“If you have taken tests on similar topics prepared by instructors before, how would you compare the overall quality of the questions on this AI-generated test to those prepared by humans?”) of the questionnaire.

Answer	Count
Significantly worse	0
Slightly worse	0
Similar	46
Slightly better	9
Significantly better	7
Don’t know	2
I have no terms of comparison	0

Table 11. Answers to the open question F2 (“Do you have any other comments, suggestions, or observations regarding the AI-generated quiz questions or answers that you would like to share?”) of the questionnaire clustered by main issues.

Opinion Cluster	Count
Context missing	6
The question needs improvement	3
The answers need improvement	6
Good as it is	48
Other	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Miranda, S. Artificial Intelligence from Google Environment for Effective Learning Assessment. Information 2025, 16, 462. https://doi.org/10.3390/info16060462

AMA Style

Miranda S. Artificial Intelligence from Google Environment for Effective Learning Assessment. Information. 2025; 16(6):462. https://doi.org/10.3390/info16060462

Chicago/Turabian Style

Miranda, Sergio. 2025. "Artificial Intelligence from Google Environment for Effective Learning Assessment" Information 16, no. 6: 462. https://doi.org/10.3390/info16060462

APA Style

Miranda, S. (2025). Artificial Intelligence from Google Environment for Effective Learning Assessment. Information, 16(6), 462. https://doi.org/10.3390/info16060462

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Artificial Intelligence from Google Environment for Effective Learning Assessment

Abstract

1. Introduction

2. Materials and Methods

2.1. The Developed System

2.2. The Course and the Tests

2.3. The People

2.4. The Qualitative and Quantitative Evaluation of the Tests

2.4.1. Item Analysis

2.4.2. The Questionnaire

3. Results

3.1. The Effective Participants

3.2. Item Analysis Results

3.3. Answers to the Questionnaire

4. Discussion

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI