1. Introduction
Recent advances in generative artificial intelligence (AI), powered by large language models, present opportunities and challenges for assessment in higher education. AI is now widely used across sectors including health, industry, and research (
McKinsey, 2024;
Sun et al., 2023), and is permanently reshaping the nature of academic tasks. In educational settings, AI has already shown potential to support learning by providing personalised feedback, scaffolding writing processes, and automating routine tasks (
Kasneci et al., 2023;
Strielkowski et al., 2025). Interest in the role of AI in education has accelerated rapidly in recent years (
Strielkowski et al., 2025), with growing attention being paid to its implications for assessment and feedback practices (e.g.,
Henderson et al., 2025;
Usher, 2025). In this study, we extend this literature by evaluating a novel assessment design that contrasts different modalities of AI use, providing new insight into how AI can be critically and ethically integrated into higher education assessment. Our participatory methodology is transferable to other educational contexts, and we provide practical resources to support educators in adapting this approach.
Initial studies suggest that, while students may benefit from AI-enhanced feedback, overreliance on these tools may undermine opportunities for deep learning and critical engagement (
Kasneci et al., 2023;
Suriano et al., 2025;
Zawacki-Richter et al., 2019;
Zhai et al., 2024). The integration of generative AI in education also presents challenges. Equity concerns persist, including unequal access to reliable AI tools and the digital skills needed to use them meaningfully (
UNESCO, 2024). Academic integrity is also at risk, as AI can be used to ‘cheat’ in ways that evade detection (
Yusuf et al., 2024). Moreover, the use of AI complicates traditional concepts of authorship and scholarship, raising questions about what constitutes independent academic work (
Kulkarni et al., 2024;
Luo, 2024). There are also concerns that critical thinking, a key goal of higher education, could be weakened if students accept AI outputs without careful evaluation (
Bittle & El-Gayar, 2025).
In response, there is growing recognition of the need to build critical AI literacy among students and staff. This means not just knowing how to use AI tools, but understanding how they work, the wider impacts they have, and how to assess AI-generated content carefully and ethically (
Abdelghani et al., 2023). Developing critical AI literacy is needed to prepare students to be thoughtful, responsible users of AI, and should be built into teaching and assessment strategies.
The overarching aim of this study is to improve the critical AI literacy of postgraduate students and teaching staff through the co-design and evaluation of an AI-integrated written coursework assessment that contrast different AI modalities. In this assessment, students used generative AI tools to draft a blog critically summarising an empirical research article and produced a reflective, critical commentary on the AI-generated content. Specifically, we asked two research questions:
Is the AI-integrated assessment acceptable and feasible for students and teaching staff?
Can teaching staff distinguish between assessments completed in accordance with the brief and those generated entirely by AI?
Our findings were used to develop practical guidance and a toolkit for educators, support the implementation and iterative improvement of AI-integrated assessments, and contribute to the wider pedagogical literature on assessment in higher education.
2. Materials and Methods
This study uses a participatory evaluation approach. Participatory evaluation involves contributors not just as participants, but as co-designers and co-evaluators (
Fetterman et al., 2017), and has been used previously to explore AI-related resources and curriculum development (
Cousin, 2006;
Teodorowski et al., 2023). A strength of this approach is its emphasis on different forms of expertise, including lived experience, disciplinary knowledge, and teaching practice, which contribute to the development of assessments that are both grounded and relevant.
The protocol is available at OSF Registries: doi.org/10.17605/OSF.IO/JQPCE. We used the Guidance for Reporting Involvement of Patients and the Public short form (GRIPP2-SF) checklist to report involvement in the study (reported in
Table S1 in the Supplementary Materials; (
Staniszewska et al., 2017). This study was approved by the Research Ethics Panel of King’s College London (LRS/DP-23/24-42387; 27 June 2024).
This study involves twelve participants from the 2023–24 cohort of a postgraduate course at the Institute of Psychiatry, Psychology, and Neuroscience, King’s College London, a Russell Group university in the United Kingdom. The student cohort comprised approximately 30 individuals. Most were in their early twenties and had entered the MSc programme directly after completing their undergraduate studies, with around one in six being mature students returning to education after spending time in the workforce. A very small number were men. Approximately one-third were UK home students, while two-thirds were international, the majority of whom were from East Asia.
Eight students and four members of the teaching team took part in the study. The teaching staff included a Teaching Fellow, a Lecturer, and two Research Associates. All participants had recently completed or marked a summative assessment within the course. We considered the sample size to be adequate given the small cohort, the participatory nature of the research, and the principle of information power, which suggests that the more relevant information the sample holds, the lower the sample size that is needed (
Malterud et al., 2016). In this study, participants were well positioned to inform the evaluation, having first-hand experience with the assessment and its development. They brought a range of expertise and experience with AI, from high digital literacy to limited prior use, as well as strengths in academic writing and assessment design. This ensured that the participatory methods supported shared ownership, practical relevance, and opportunities for innovation.
2.1. Stage 1: AI-Integrated Assessment
The research team collaborated with other members of the course’s teaching team to adapt an existing summative assessment already embedded in the curriculum. This assessment required students to write a blog post summarising and critically appraising an empirical research article on mental health. Framed as an authentic assessment, the task included the potential for selected blogs to be published on science communication platforms.
We used the Transforming Assessment in Higher Education framework developed by AdvanceHE to guide our approach to integrating generative AI tools into this assessment (
Healey & Healey, 2019). The framework highlights the need for assessments that are authentic, inclusive, and aligned with learning outcomes, emphasising the importance of involving students in the assessment development process. This emphasis aligned with our approach of integrating AI tools to reflect real-world practices and to develop critical AI literacy.
Under the revised assessment approach, students were asked to use two AI tools to assist with drafting a blog based on an empirical article. The written assessment consisted of three components:
Two AI-generated blog drafts using two AI tools.
A final blog that combined the strongest elements of the AI outputs with the student’s own revisions and original contributions, assessed for the accurate and critical appraisal of the empirical article.
A commentary critically reflecting on the AI-generated content and explaining the rationale for revisions made, assessed for the depth of critical and ethical reflection.
The marking matrix was revised to retain the use of a standard critical appraisal checklist for assessing students’ understanding of the empirical article, alongside the programme-wide marking framework (stage 2). New criteria were introduced to evaluate students’ critical engagement with AI-generated content (stage 3). The adapted format built on the existing learning outcome of critically appraising empirical research, extending it to assess students’ ability to reflect on the role of AI in academic work, apply subject knowledge to evaluate AI outputs, and make informed editorial decisions.
2.2. Stage 2: Assessment Trial
All participants were invited to take part in a trial of the adapted assessment. They first attended a workshop designed to support students in their AI-assisted assessment. Microsoft Copilot, in both balanced and precise modes, was the mandated generative AI tool, selected for its free availability for the participants (ensuring equitable access) and to allow for direct comparisons between model outputs. While Copilot was used in this instance, the assessment was designed to be transferable to other AI tools.
The workshop was delivered in four stages. The first introduced Copilot’s core functions, including its strengths, limitations, and examples of effective prompt writing. In the second stage, students practised drafting prompts and used the AI models to generate and revise a mock blog post. The final two stages drew on Gibbs’ Reflective Cycle to guide structured learning (
Gibbs, 1988). In stage three, students critically appraised an AI-generated blog and compared the outputs produced by the two Copilot modes. This exercise supported a deeper understanding and analysis of AI-generated content. In the final stage, students reflected on their use of AI and developed an action plan for how they would apply AI tools in future academic work. This reflection aimed to consolidate learning and promote ethical, informed use of generative AI tools.
Feedback on the workshop was collected through qualitative discussions at the end of the session and a short survey. The survey included a Likert-scale question assessing whether the workshop would help students complete the assessment (responses: yes, somewhat, no) and two free-text questions: “What did you learn from the workshop?” and “What was missing from the workshop that would help you feel more prepared for the pilot assessment?”
Student participants were then randomly allocated to one of two groups. Those in the ‘compliant’ group were instructed to follow the coursework brief precisely, using the designated AI tools as directed. Students in the ‘unrestricted’ group were given freedom to complete the assessment by any means, including generating the entire submission using AI tools. They were encouraged to be creative and to push the boundaries of the process. Teaching staff participants were asked to mark the submitted assessments and provide written student feedback using the adapted marking matrix. They also indicated whether they believed the student had completed the assessment as instructed (were in the ‘compliant’ group) or had been in the unrestricted group.
2.3. Stage 3: Evaluation
Participants participated in an iterative process of reviewing and refining the assessment materials, including the workshop content, coursework brief, and marking matrix.
To explore the feasibility, acceptability, and perceived integrity of the AI-integrated assessment approach, we conducted a series of semi-structured focus groups with students and individual interviews with teaching staff. This format was chosen to accommodate participant preferences and availability, while also helping to reduce power imbalances by providing students with a peer-supported setting in which to reflect on an assessment co-designed with researchers who were also their course instructors.
The discussion guides are reported in
Supplement SB in the Supplementary Materials and at the Open Science Framework project: osf.io/ctewk/. They were developed to address the study’s research aims and to capture experiences across both groups regarding their engagement with generative AI in the context of assessments. Both focus groups and interviews lasted from approximately 45 to 60 min and were structured in two parts: the first explored participants’ existing knowledge of generative AI and their experiences of completing or marking the assessment; the second addressed their reflections on the assessment design and its potential for future implementation. In addition, we explored perceptions of ‘cheating’ in the assessment, including whether students in the compliant and unrestricted groups felt they had met the intended learning outcomes and whether staff felt able to distinguish between the two groups. Particular attention was paid to whether the approach supported intended learning outcomes and provided a fair measure of student performance.
We also asked questions about the initial training workshop as part of the interviews. This feedback was reviewed alongside data from the survey questions completed by participants after the workshop and was used to revise and improve the training content.
Focus groups and interviews were conducted via Microsoft Teams. Thematic analysis was led by one researcher (AFM), following
Braun and Clarke’s (
2006) approach, including familiarisation with the data, initial coding, theme identification, and iterative theme refinement. Analyses was performed separately for students and teaching staff. Emerging themes were reviewed and refined through discussion within the research team and with participants who participated in subsequent workshops.
In addition to qualitative comparisons, we conducted a statistical analysis to compare how successful markers were at identifying assessments written by students in either the compliant or unrestricted groups. Given the small sample size and expected cell counts of below five, we used Fisher’s exact test rather than the chi-square approximation (
Howell, 2011), calculated using base R (
R Core Team, 2025).
We held co-design workshops with students and teaching staff to further refine the assessment brief and marking matrix, respectively. The think aloud technique was used (
Charters, 2003;
Someren et al., 1994), whereby each section of the assessment materials was reviewed in turn. Participants took part in a facilitated group discussion, voicing their thoughts, suggestions, and reactions in real-time as they engaged with the materials. Data saturation was considered to have occurred when no further substantial changes were proposed by the participants. Two workshops were held with students and one with teaching staff, which likely reflects the fact that more extensive feedback had already been gathered from teaching staff during earlier individual interviews and incorporated into the materials prior to the workshops.
Feedback gathered during these sessions was used to inform revisions to the assessment materials. We documented this process using a Table of Changes (ToC) from the Person-Based Approach using the MoSCoW method, a prioritisation framework used to collaboratively decide which features, changes, or recommendations should be implemented [must, should, could, would like] (
Bradbury et al., 2018). We also used a Custom GPT using GPT-4-turbo, which allows for the creation of a personalised version of ChatGPT-4 tailored to specific tasks or knowledge domains, to review the final materials for accessibility and readability.
3. Results
3.1. Assessment Materials and Learning Outcomes
Feedback on the co-designed assessment materials produced using the Custom GPT indicated that, while both the assessment brief and marking matrix were generally well-structured and aligned with learning outcomes, several refinements could improve readability and accessibility. These included ensuring consistency of language and tone, using bullet points and clearer formatting to support navigation, and clarifying instructions around AI tool use and submission structure. Minor revisions were recommended to the learning outcomes and the reflection criteria to enhance alignment with marking expectations.
The final versions of the assessment materials (workshop proformas, assessment brief, and marking matrix) and the amendments recommended by ChatGPT are included in
Supplement SC in the Supplementary Materials and at the Open Science Framework project: osf.io/ctewk/.
Feedback on the learning outcomes was generally positive or neutral, with no negative responses offered. Students and teaching staff appreciated the inclusion of learning outcomes and found them helpful for understanding the purpose of the assessment. Some suggested making the link between the learning outcomes and the specific assessment tasks more explicit to improve alignment and clarify expectations. The major change that emerged from all feedback sources was the need to communicate that critical appraisal of the original empirical article is as important as the appraisal of the ability of AI to generate seemingly useful content. One student noted that engaging with the AI output highlighted inaccuracies, such as fabricated participant details, which prompted them to critically verify the content against the original source. This process, while demanding, was seen as intellectually valuable: “It forces you to actually figure out whether you’re critically appraising the critical appraisal.”(S5) Another contributor reflected on the need to distinguish between assessing AI literacy and assessing critical thinking (S2), suggesting that the learning objectives should clearly indicate which of these skills is being prioritised. This feedback informed revisions to the assessment brief and the revised learning outcomes.
Our revised learning outcomes became the following:
Critical appraisal: Students will demonstrate the ability to critically appraise academic content by the following:
Evaluating an empirical research article using an established critical appraisal checklist.
Assessing the accuracy, relevance, and limitations of AI-generated content in relation to the original empirical article.
Comparing outputs from different AI tools, identifying their strengths and weaknesses in academic content generation.
Generative AI literacy: Students will develop foundational AI literacy by using generative tools to support scientific blog writing. They will demonstrate an understanding of AI’s capabilities and limitations, including the ability to identify common errors such as fabrication or hallucination.
Editorial and reflective judgement: Students will apply editorial judgement to revise AI-generated content, integrating critical analysis and original insight. They will reflect on their use of AI tools and articulate the rationale for content modifications in alignment with accuracy, academic standards, and ethical considerations.
3.2. Feasibility, Acceptability, and Integrity
Table 5 and
Table 6 present summaries of the key findings and illustrative quotes from the thematic analysis of the focus groups and interviews.
Student feedback highlighted that, while AI tools could streamline aspects of the writing process, they did not reduce workload due to the effort required to refine outputs. Perceptions of feasibility, acceptability, and integrity varied, with students valuing the opportunity to build critical thinking skills, but also expressing concerns about fairness, skill development, and, notably, ownership of their work. Some viewed equitable access and thoughtful integration of AI to be particularly important for maintaining academic standards. Teaching staff found the assessment structure clear, although marking was initially time-intensive because of the dual task of evaluating both AI and student contributions. Efficiency improved with familiarity, and staff recognised the assessment’s potential to support critical engagement. While challenges remained in distinguishing AI-generated from student-authored content, most staff endorsed transparent and pedagogically grounded use of AI in academic settings.
Students in the unrestricted group found that using AI to complete the entire assessment was challenging, with outputs, particularly the reflective commentary, requiring substantial oversight and correction. Some spent a similar amount of time on the task as those in the compliant group, while others felt they used somewhat less. Most felt they had achieved the intended learning outcomes due to the time spent checking, appraising, and reflecting on the AI-generated content.
Assessment marks ranged from 35 (fail) to 78 (distinction). For most assessments, marks from different markers fell within a ten-point range, but for one assessment, scores ranged more widely (from 58 to 78). Markers correctly identified 6 out of 14 students in the compliant group (42.9%) compared to 3 out of 6 in the unrestricted group (50.0%). Fisher’s exact test produced an odds ratio of 0.75, p = 1.00, indicating that marker accuracy did not differ meaningfully between the groups. Markers’ views on identifying students in the unrestricted condition were polarised: some reported having no clear sense, while others felt very confident that they could recognise AI-generated submissions. However, these subjective impressions were not reflected in their actual ability to accurately distinguish between the groups.
4. Discussion
The findings from this study add to the nascent body of literature that highlights the dual role of AI-integrated assessments as tools for digital literacy and as mechanisms for reflective, critical pedagogy. The blog format provided a unique opportunity for students to practise public-facing, accessible academic writing, aligning with real-world expectations in science communication. The pilot findings show that students found the approach to be feasible and helpful for developing critical skills, although engaging with AI outputs was perceived to increase workload. Teaching staff initially found marking more demanding and had limited success distinguishing unrestricted AI-generated content, but valued the assessment’s potential to promote ethical and critical AI use.
A key success of the project was the development of students’ critical AI literacy, with findings suggesting that the blog assessment promoted active engagement with AI outputs. Students were required to critique AI-generated content, identify inaccuracies, and justify their editorial decisions. This process appeared to encourage deeper critical engagement and helped students to view AI as a tool requiring human oversight rather than as a source of ready-made answers. However, some students may have used AI to support parts of the evaluative process itself, for example, by prompting AI to critique its own outputs, blurring the boundary between human and AI intervention. This challenge is prompting the development of pedagogical tools to enhance deeper engagement with AI content, including a revised version of Bloom’s Taxonomy (
Gonsalves, 2024). In our study, students in the unrestricted group reported limited success when attempting to outsource critical reflection and revision entirely to AI, reporting that human oversight remained essential to complete the task successfully. This supports
Gonsalves’ (
2024) observation of AI as a cocreator, where students collaboratively refined, challenged, and integrated output. Nonetheless, the timing and degree of human input will vary between students, highlighting the need for structured scaffolding to support meaningful engagement with AI whilst safeguarding academic skill development.
The requirement to compare outputs across different AI models also supported the development of critical evaluation skills, as students reflected on the variability and limitations of AI-generated content. Importantly, these findings address concerns raised during the qualitative evaluation and reflected issues highlighted in previous research, such as that overreliance on AI could undermine opportunities for deep learning and reflective practice (
Kasneci et al., 2023;
Larson et al., 2024;
Zawacki-Richter et al., 2019). These findings align with recommendations that AI in education must go beyond functional skills to include AI literacy, as well as active learning skills and metacognition (
Abdelghani et al., 2023).
Beyond promoting critical engagement with AI outputs, this study also highlights strategies for maintaining assessment integrity and supporting academic skill development. Teaching staff expressed concerns that AI use could make it harder to distinguish original work from AI-generated content. This echoes broader challenges in the literature, where AI use may complicate traditional definitions of scholarship and independent academic work (
Kulkarni et al., 2024;
Luo, 2024;
Yusuf et al., 2024). Although markers were sometimes confident, their accuracy in identifying AI-reliant submissions from those that were compliant with the assessment instructions was poor. This is likely because students in the unrestricted group generally described a similar editorial process to those in the compliant group. Nevertheless, the assessment’s structure, requiring critical appraisal of the empirical article, critique of AI outputs, evidence of revision, and transparency may have helped to mitigate these risks, although this needs further testing.
By embedding critical evaluation and editorial judgement, the assessment addressed concerns that AI could weaken core academic skills such as critical thinking and reflective analysis (
Bittle & El-Gayar, 2025). One key challenge identified by participants was that the focus on evaluating AI-generated content risked overshadowing the critical appraisal of the empirical article itself. In response, the final co-produced brief more clearly separated and emphasised both components and better balanced the dual aims of the task. Maintaining this balance will be essential in future implementations to ensure that the assessment remains both authentic and educationally robust. Students also recognised that genuine engagement, not uncritical acceptance of AI outputs, was needed to meet the learning outcomes. However, the extent to which students internalised critical evaluation versus simply complying with task requirements remains unclear. Future studies could explore students’ metacognitive strategies and critical reasoning during AI use through longitudinal or think-aloud methodologies (
Charters, 2003;
Someren et al., 1994). Overall, the findings suggest that carefully designed AI-integrated assessments can uphold academic integrity while supporting the development of essential academic competencies.
Involving students and teaching staff in the co-design and evaluation process was central to developing an assessment that was authentic, feasible, and acceptable. The participatory approach drew on academic, pedagogical, and lived experience to shape the teaching workshop and assessment materials, helping us spot practical challenges early and promote shared ownership of the development of the assessment (
Fetterman et al., 2017;
Teodorowski et al., 2023). This aligns with broader calls for more inclusive, responsive, and transparent innovation in educational assessment (
Bovill et al., 2016;
Healey & Healey, 2019). However, participatory approaches also carry limitations, including potential power imbalances between participants and researchers, risks of tokenism, and the possibility of over-relying on stakeholder input to the detriment of expert judgement. Future research should continue to embed participatory evaluation while remaining mindful of these challenges to ensure AI-integrated assessment remains student-centred and pedagogically sound.
Several limitations of this study should be acknowledged. First, students were not involved in the initial design phase of the assessment, falling short of authentic co-production (
Cook-Sather et al., 2014). Although this was partly mitigated through later participatory evaluation, involving students earlier could have strengthened the creativity, relevance, and ethical responsiveness of the assessment. Second, qualitative feedback was collected following the initial pilot rather than after a full module-wide rollout. As such, findings may reflect early impressions rather than longer-term engagement. However, this timing allowed for immediate adjustments and iterative revisions of the assessment materials. Third, this study was conducted within a single institutional setting with a small cohort and an ensuing small sample size, which limits the generalisability of the evaluation findings to other universities or international contexts with different AI access, policies, and pedagogical cultures. However, this study did not aim for statistical generalisability, but rather aimed to explore the feasibility and acceptability in context, using participatory methods grounded in information power (
Malterud et al., 2016). Our broader goal was to model a co-design and evaluation approach that is transferable and could be adapted to different educational settings. The resulting assessment toolkit supports wider applications, helping educators adapt AI-integrated assessments to their own institutional and disciplinary contexts.