How to Evaluate Augmented Reality Embedded in Lesson Planning in Teacher Education

: Augmented reality (AR) is vital in education for enhancing learning and motivation through interactive environments and experiments. This requires teacher training in AR creation and integration. Research indicates that learning effectiveness relies on thorough preparation, calling for the development of scoring rubrics for evaluating both educational AR and AR’s educational integration. However, no current studies provide such a rubric for assessing AR’s pedagogical implementation. Hence, a scoring rubric, EVAR (Evaluating Augmented Reality in Education), was developed based on the framework for the analysis and development of augmented reality in science and engineering teaching by Czok and colleagues, and extended with core concepts of instructional design and lesson organization, featuring 18 items in five subscales rated on a four-point Likert scale. To evaluate the validity and reliability of the scoring rubric, AR learning scenarios, designed by eleven master’s seminar pre-service teacher students at the University of Konstanz, majoring in biology, chemistry, or physics, were assessed by five AR experts using the newly developed scoring rubric. The results reveal that a simple classification of AR characteristics is insufficient for evaluating its pedagogical quality in learning scenarios. Instead, the newly developed scoring rubric for evaluating AR in educational settings showed high inter-rater reliability and can discriminate between different groups according to the educational quality of the AR and the implementation of AR into lesson planning.

Research in the realm of experimentation has revealed that the effectiveness of student learning depends on thorough preparation before and follow-up processing after the experimentation.Hence, it is crucial how activities are embedded in the lesson [58,59].To assess the structured and didactically sound integration of AR into lessons or lesson plans, a scoring rubric is needed to allow for objective, reliable, valid, and test-economical measurement during lesson observations, including mock trials, for example.While research-based frameworks for the analysis and development of augmented reality in science and engineering teaching already exist [32,60], they need to be extended to incorporate aspects of instructional implementation, as this framework does not include any aspects of embedding experiments in lessons.However, we know from research on carrying out experiments that it is precisely the embedding of experiments in the classroom that is of particular importance for students' learning success.In addition to the categories from Czok et al. (namely adaptivity, interactivity, immersion, congruence with reality, content proximity to reality, game elements, and complexity), we also need aspects of instructional embedding such as "frictionless function of the AR", "confidence of the teacher when handling the AR", "simplicity of handling for the learners", "promotion of a learning objective through the AR", "embedding in the lesson", "design laws", and "cognitive load".Overall, these aspects can be divided thematically into the following main categories: "Technical Implementation", "Fit of the AR", "Interactivity and Engagement", "Visualization", and "Creativity and Originality".This leads to the following research questions:

1.
Which categorizations, according to Czok et al. [32], can be found in augmented reality embedded in teaching scenarios created by pre-service teacher students in a master's seminar for teacher education?2.
How can the quality of the embedding of augmented reality in teaching be evaluated?

3.
To what extent can the deductively derived structuring of the categorizations be mapped to reliable subscales?4.
To what extent does the quality of an AR learning environment determine the overall quality of the lesson planning integrating this AR learning environment?

Sample
Eleven pre-service teacher students (six female, five male; seven biology, four chemistry, and three physics students; multiple answers were possible, as in Germany teacher students select at least two subjects) voluntarily took part in this study during a master's seminar on science education at the University of Konstanz in the summer term of 2023.The participants were divided into six groups of one to two people (the students were allowed to choose their own partner).

Instrument
To assess the characteristics of augmented reality environments, the evaluation criteria proposed by [32] were used to answer research question one.A new scoring rubric to evaluate the use of augmented reality in teaching scenarios was developed to examine the second research question.The rubric is based on core concepts pronounced by Czok et al.Based on the design parameters described there, it was checked for each category which form of lesson embedding is necessary to emphasize this aspect of learning with AR, and then a corresponding item was formulated.This was done for identifying ways teachers should use and embed AR in lessons, which are most beneficial.For example, in the area of interactivity [32], the contribution the quality of AR makes to the lessons was examined.It is not enough for the AR to be able to interact with the learners.Rather, the newly developed rubric should record whether it is actually being used in a meaningful way.The items were thematically grouped into five dimensions: Technical Implementation, Fit of AR, Interactivity and Engagement, Visualization and Creativity, and Originality of AR Use.A four-point Likert scale from "1-strongly agree" to "4-strongly disagree" was used.In addition, the option "no answer possible" could be selected.Notes could be added to each item to allow for a complementary qualitative assessment.Further, the rubric was consensually validated by six science education researchers with experience in the development and implementation of augmented reality in teaching and teacher training.The scoring rubric, EVAR (Evaluating Augmented Reality in Education), can be found in Table 1 and downloaded as supplementary material Table S1.
Table 1.The five subscales of the scoring rubric EVAR (Evaluating Augmented Reality in Education) with the corresponding 18 items.The terms teacher and learner are used in the scoring rubric.Teachers refer to those who have created or selected the augmented reality (including pre-service teachers or trainees).Learners here refer to those who act as participants in the teaching scenario (e.g., fellow students in the seminar context or pupils).

Item Item text
Technical Implementation 1 The AR in the learning scenario operates smoothly and reliably.2 The teachers are confident in controlling the AR. 3 The handling of the AR is intuitive and simple for the learners.4 The functionality of the AR is sufficiently described and explained.5 The tracking method chosen is appropriate for the teaching scenario.
Fit of the AR 6 The AR supports at least one specific learning goal.7 There is a connection to previous and subsequent teaching sequences.8 Relevant references to real situations or applications are made.9 The AR offers clear benefits compared to conventional visualizations.10 Potential benefits and challenges of AR use for teaching are discussed.11 The AR helps learners to develop a better understanding of the content.

Interactivity and Engagement 12
The AR encourages learners to actively engage with the subject matter.13 There are additional possibilities besides viewing the object, e.g., interactivity or individualization.14 There are feedback mechanisms (analogous or digital) to provide learners with feedback on their use of AR.

Visualization 15
The complexity of the AR (cf.[60]) fits the learning goal addressed (in terms of cognitive load [61]).16 The design laws [62] are taken into account.

Creativity and Originality 17
The lesson design demonstrates an original and creative use of AR to support the learning process.18 The AR was created by the teachers themselves.
As part of the expert survey, items 7, 8, and 9 were shortened and simplified: (changes crossed out) "Integration into the teaching process: There is a connection to previous and subsequent teaching sequences."(I7)."Integration into the course of the lesson: Relevant references to real situations or applications are made."(I8)."Added Value and educational benefits: The AR offers clear benefits compared to conventional visualizations."(I9).Item 11 was assigned to "Fit of the AR"; before the expert survey, it was thematically assigned to "Visualization".Items 13 and 15 were supplemented with examples or references for precision (extension in brackets): "There are additional possibilities besides viewing the object (e.g., interactivity or individualization.)"(I13) and "The complexity of the AR (cf.[62]) fits the learning goal addressed (in terms of cognitive load [63])."(I15).Finally, item 16 was newly added.

Study Design
Six groups of pre-service teachers presented self-created lessons in a 45-min presentation.Parts of the lessons were also carried out with fellow students in a mock trial.The subject areas were specified by the experts to ensure that the topic was fundamentally suitable for the use of AR.Five observing AR experts in the field of science education research participated during the presentation and applied the developed rubric EVAR (live and on site).Subsequently, Czok's questionnaire was applied to the submitted AR materials.

Context
The study was conducted during a master's seminar [63,64], especially targeted at the development of digital competencies for teaching in science education, and took place in the summer term of 2023 at the University of Konstanz.This seminar was divided into three phases (see Figure 1).In the first part, the basics of teaching with digital media were introduced and practiced in alternating theory and voluntary on-site exercises with a team of tutors following the DiKoLAN framework [65,66].DiKoLAN is a competency framework that defines seven digital core competency areas that science education students should have acquired by the end of their studies.
tive load [61]).16 The design laws [62] are taken into account.d Original-17 The lesson design demonstrates an original and creative use of AR to support the learning process.18 The AR was created by the teachers themselves.

Study Design
Six groups of pre-service teachers presented self-created lessons in a 45-min presentation.Parts of the lessons were also carried out with fellow students in a mock trial.The subject areas were specified by the experts to ensure that the topic was fundamentally suitable for the use of AR.Five observing AR experts in the field of science education research participated during the presentation and applied the developed rubric EVAR (live and on site).Subsequently, Czok's questionnaire was applied to the submitted AR materials.

Context
The study was conducted during a master's seminar [63,64], especially targeted at the development of digital competencies for teaching in science education, and took place in the summer term of 2023 at the University of Konstanz.This seminar was divided into three phases (see Figure 1).In the first part, the basics of teaching with digital media were introduced and practiced in alternating theory and voluntary on-site exercises with a team of tutors following the DiKoLAN framework [65,66].DiKoLAN is a competency framework that defines seven digital core competency areas that science education students should have acquired by the end of their studies.DiKoLAN is divided into two sections: general competency areas, encompassing documentation, presentation, communication/collaboration, and information search and evaluation, as well as competency areas specific to the natural sciences, including data acquisition, data processing, and simulation and modeling.In the context of this seminar, particular emphasis was placed on the application of AR as an example of simulation and modeling, as well as the creation of AR content [64].Here, students acquire essential knowledge about models, their development, and the application of AR.AR is explained through the lens of seven design parameters, according to [32].Each parameter has different levels or indicators that enable a comparison between different AR implementations.These parameters DiKoLAN is divided into two sections: general competency areas, encompassing documentation, presentation, communication/collaboration, and information search and evaluation, as well as competency areas specific to the natural sciences, including data acquisition, data processing, and simulation and modeling.In the context of this seminar, particular emphasis was placed on the application of AR as an example of simulation and modeling, as well as the creation of AR content [64].Here, students acquire essential knowledge about models, their development, and the application of AR.AR is explained through the lens of seven design parameters, according to [32].Each parameter has different levels or indicators that enable a comparison between different AR implementations.These parameters are adaptivity, interactivity, immersion, congruence with reality, content proximity to reality, game elements, and complexity.Adaptivity describes the program's ability to adjust to various situations by reacting to activities, events, or changes in situations.Interactivity refers to the intended interaction between the user and the digital media components and includes six levels of interaction.According to [32], immersion is understood as the ability of digital media to influence human senses, and the degree of immersion increases as more senses are engaged.Congruence with reality assesses the plausibility and realism of AR implementations in terms of social and perceptual realism.Content proximity to reality examines the plausibility of AR content regarding causal, spatial, and temporal factors, as well as the tracking method's appropriate use.The incorporation of game elements in education can enhance interactivity and motivation.For this parameter, eight indicators are provided.Lastly, complexity reflects the content-related and cognitive structures of AR functions, whereby achieving a higher level of complexity is associated with a higher demand on the user or more extensive cognitive activity.In general, the aim of the seminar is to promote the future-oriented and didactically sound use of digital tools in science lessons.Therefore, students are trained to create AR content themselves and analyze suitable tools.Hence, AR is not viewed merely as a technical gimmick but as a powerful tool for future educators, which could change teaching science in school.
In the second part of the seminar, students planned teaching sequences for upperlevel classes in groups of two, incorporating AR.The prospective teachers were given a predetermined topic from the field of molecular orbital theory, including core chemistry, underlying physics concepts, and biological contexts.The students could choose one of these secondary 2 / undergraduate science topics for which AR visualization is promising.It was, therefore, not a matter of investigating whether AR makes sense in general.It can be assumed that the prerequisites for (meaningful) AR are given in principle.Clear guidelines are provided by the instructors after diagnosing the potential of AR.The students had around four weeks to plan a teaching unit, select or develop an AR, and implement it into a teaching unit (i.e., a lesson plan).To enhance the clarity and understanding of the results, the created teaching sequences are presented in Table 2.The educational offerings included lectures, exercises, and DiKoLAN sessions with individual supervision [63].Additionally, the opportunity to enhance their education was offered through self-learning units on the DiBaNa website (DiBaNa: Digitale Basiskompetenzen in den Naturwissenschaften, German for Digital Basic Competencies for Science Teachers), an online platform for acquiring digital teaching competencies [67].In the third phase, each group of pre-service teachers presented their planned lessons, and a written elaboration was handed in.

Statistical Analysis
For many of the items, a bimodal distribution of responses was expected (separating agreement and disagreement).In addition, all items were positively worded so that students would desirably achieve a positive rating with their unit after the training measure.This leads to the expectation of a one-sided distribution.In the evaluation, therefore, statistical measures such as Cohen's kappa [72], Fleiss' kappa [73], or Krippendorff's alpha [74] were not applicable since these are known to be problematic with a one-sided distribution, especially for small sample sizes [75].Instead, a graphical method was used for the evaluation of the inter-rater reliability, and a frequency map was created with the statistical software R [76].In order to compare the results of the two parts of the research, the mean value across all seven areas according to [32] was compared with the mean value across all 18 self-generated items for each group using Excel [77].To check the scale reliability, Guttman's lambda 4 and lambda 6 [78] were calculated with R. Cronbach's alpha [79] was unsuitable since no normal distribution was expected due to the assumed polarization of the responses.

Characteristics of the AR Used
For each augmented reality presented by the students, a classification of characteristic features was performed based on [32,60].The results of the seven categories for all six groups were plotted on a spider web plot.As the different categories have different maximums of scores, the values reached in each category were normalized.The results are shown in Figure 2.
When looking over the spider webs, it is clear that for the implementation of each group, there are differences in which category is the most pronounced and whether the degree of pronouncement is rather the same across all categories or rather fluctuating.Only the groups represented by Figure 2c,d show the same values over all categories.In general, as a common feature, it can be found that all of the featured AR have low values for game elements (GE) as well as low immersion (Imm) values.For five out of six, the relative value for congruence with reality (CwR) is about 0.6; still, two-thirds of the apps have a relative value of 0.4 in terms of content proximity to reality (CPtR), and half of them have the same values for complexity (Comp) and interactivity (Int), but for the other categories, the spider webs show different figures.

Evaluation of the Teaching Scenarios including AR
The aim is to verify whether the newly developed scoring rubric can serve as an evaluation basis for the instructional context in which augmented reality was used.For this purpose, the inter-rater reliability is presented below.
For the representations of groups 3, 5, and 6, the raters ranked the majority of items very similarly high (cf.Figure 3).For groups 1 and 4, the scatter is wider, while for group 2, the scatter of the ratings is the widest.The variance in the scatter can be easily recognized in the heat map.
In addition to the significant differences between the groups, it is also striking that for items 4, 10, 13, and 14, the answers are distributed over three to four neighboring scale levels, or there is contrasting checkbox behavior for at least three groups, indicating a low level of agreement among the raters.
tic features was performed based on [32,60].The results of the seven categories for all six groups were plotted on a spider web plot.As the different categories have different maximums of scores, the values reached in each category were normalized.The results are shown in Figure 2.  In addition to the significant differences between the groups, it is also striking that for items 4, 10, 13, and 14, the answers are distributed over three to four neighboring scale levels, or there is contrasting checkbox behavior for at least three groups, indicating a low level of agreement among the raters.

Reliability of the Rubric and Its Theoretically Derived Subscales
Based on the concepts from Czok [32], the scale was divided thematically into subscales.To check the meaningfulness of this division, Guttman's lambda-4 and lambda-6 were calculated for the main scale and the five dimensions Technical Implementation, Fit of the AR, Interactivity and Engagement, Visualization, and Creativity and Originality.The results are illustrated in Table 3. Further, the absolute frequencies of the individual response options for each item across all groups and the relative frequencies are shown.For the main scale, a lambda-4 of 0.93 and a lambda-6 of 0.99 were reached.For further consideration, the mean value of each group with regard to the classification in all seven areas according to [32] was compared with the mean value of each group with regard to the classification in all seven areas of the developed rubric.The results are shown in Figure 4.A slight correlation can be seen between achieving a high score according to [32] and a higher evaluation of teaching commitment.Around 19.4% of the variance in the evaluation of lesson planning can be explained by the quality of the integrated AR learning environment.to [32], 0 stands for no points achieved and 1 for maximum points achieved in all seven areas.In the rubric presented here, 1 stands for "I totally agree" and 4 for "I totally disagree".

Characteristics of the AR Used
As a common feature, it can be seen that low immersion values are often achieved.Furthermore, all ARs contain only a few or even no game elements.Apart from that, no specific preferred trend can be identified for the other areas.This clearly reflects the seminar content.On the one hand, only the immersion of further senses except the optical one was treated in the seminar.Likewise, game elements played no role in the design of the seminar unit on AR and therefore are not found in the environments designed by the students.On the other hand, no specific training was conducted with the aim of achieving particularly high scale levels in the other categories, which explains well the different high values for these categories.
The AR of group 5 stands out with high or higher item ratings in all areas.This can be attributed to the fact that this group used a very comprehensive augmented reality application, leARnCHEM [80], developed at the University of Toronto that already included many features, e.g., for individualization and different levels of complexity.In contrast, the ARs of groups 1 to 4 focused on a specific problem that was needed in the teaching setting being worked on.For this, no more functions than necessary were included.In the evaluation according to [32], 0 stands for no points achieved and 1 for maximum points achieved in all seven areas.In the rubric presented here, 1 stands for "I totally agree" and 4 for "I totally disagree".

Characteristics of the AR Used
As a common feature, it can be seen that low immersion values are often achieved.Furthermore, all ARs contain only a few or even no game elements.Apart from that, no specific preferred trend can be identified for the other areas.This clearly reflects the seminar content.On the one hand, only the immersion of further senses except the optical one was treated in the seminar.Likewise, game elements played no role in the design of the seminar unit on AR and therefore are not found in the environments designed by the students.On the other hand, no specific training was conducted with the aim of achieving particularly high scale levels in the other categories, which explains well the different high values for these categories.
The AR of group 5 stands out with high or higher item ratings in all areas.This can be attributed to the fact that this group used a very comprehensive augmented reality application, leARnCHEM [80], developed at the University of Toronto that already included many features, e.g., for individualization and different levels of complexity.In contrast, the ARs of groups 1 to 4 focused on a specific problem that was needed in the teaching setting being worked on.For this, no more functions than necessary were included.

Evaluation of the Teaching Scenarios including AR
To verify that the rubric created is a way evaluate AR in an instructional context, the inter-rater reliability was first examined.Basically, for the relatively high number of five raters in the six groups, a high level of agreement could be achieved for many of the items (see Figure 3).The different results for the different groups seem to allow a conclusion to be drawn about the technical quality of the AR used.While the AR worked very well for groups 3, 5, and 6, the realization was technically rather challenging for group 2. Nonetheless, group 2 delivered a very convincing and thoughtful instructional concept for the use of their AR in the classroom.Further research is needed to confirm this assumption.In addition, the students' challenges in the technical realization of AR seemed to make the ratings more ambiguous.This had to be addressed by making the raters aware of the issue and finding ways to address it.It may also be beneficial to provide advanced training to experts who offer advice, helping them determine whether to prioritize the technical implementation or the quality of the idea, particularly in cases where a good idea is compromised by poor technical execution.
Summarizing the results of the frequency map (Figure 3), items 4 ("The functionality of the AR is sufficiently described and explained."),10 ("Potential benefits and challenges of AR use for teaching are discussed."),13 ("There are additional possibilities besides viewing the object, for example, interactivity or individualization", and 14 ("Feedback mechanisms (analogous or digital) are in place to provide learners with feedback on their use of AR.") were assessed differently by the raters.A possibility to counteract this would be a more detailed coding guide to train the rating experts.

Reliability of the Rubric and Its Theoretically Derived Subscales
The calculated lambda-6 values for the technical implementation and fit of the AR subscales are between 0.8 and 0.9 and thus show that the subscales formed on the basis of theory can be very well implemented.A still acceptable value is achieved for the interactivity and engagement subscales.It is only the combination of items 15 and 16 on the visualization subscale that cannot be confirmed by the calculation of Guttman's lambda.However, when looking at the cross-tabulation of the answers given, this summary cannot be refuted either (see Table 4).Even though a slight connection between the two different ratings, the evaluation according to [32] and with the instrument EVAR, was found, it is clear to see that evaluating design options according to [32] is an important factor but not a reliable indicator of good teaching embedding.This clearly requires a specific rubric for evaluating lesson embedding.In groups 3 and 4, for example, which were assessed in the same way according to [32], the assessment of AR in teaching embedding differs noticeably.This clearly shows why a pure classification based on different characteristics is not sufficient for the evaluation of AR in a teaching context since neither the occurrence of many different characteristics nor a focus on a few characteristics is better or worse without a teaching context.

Limitations
This study focuses on the validation of the rubric.Therefore, based on the available data, no statement can be made about the occurrence of certain characteristics of the population due to the small sample.A larger sample is necessary for this.
Considering the agreement between the raters, a subjective interpretation cannot be ruled out.The evaluation of the selected AR application was carried out individually during each presentation and the introduction of each lesson and was not revised after the end of each presentation.Therefore, finding a consensus among the raters can be ruled out, as the raters did not discuss the material.
The rubric was developed in order to evaluate the use of AR in the classroom for given topics that were classified by a team of experts as beneficial for the use of AR.Therefore, no items can be found that assess whether the selected topics are suitable for the use of AR at all.

Conclusions
EVAR fills a gap that gives teacher educators a tool that can be used to evaluate a teaching sequence on augmented reality.It offers the possibility of adding an assessment to some of the features beyond just looking at them.The use of the grid is useful when selecting an AR to determine whether it is conducive to the teaching purpose.By guiding the questions that the teacher has to ask the AR, the selection of the AR is made easier.The evaluation grid was further developed to provide teachers with an instrument that they can use to make a reflective decision as to whether creating a tool is worthwhile or whether an already-created tool or another alternative can fulfil the same learning objective.For example, it assesses whether the additional work involved in creating the tool could be worthwhile because it reduces the workload elsewhere, whether the desired learning objective can be achieved, or whether another alternative can achieve the same learning objective with less effort.It was shown that around 20% of the responses were already predicted in terms of the quality of the augmented reality.There is therefore an opportunity to carry out further investigations with an even larger sample, for example, to determine the influence of individual design parameters according to Czok on the individual subscales of EVAR.

Declaration of AI and AI-Assisted Technologies in the Writing Process
During the preparation of this work, the authors used DeepL (www.deepl.com,accessed on 29 January 2024) and Grammarly (www.grammarly.com,accessed on 29 January 2024) in order to improve the readability and language of single sentences.After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Figure 1 .
Figure 1.The structure of the seminar with its three phases: initial lecture phase, second project phase, and final presentation.

Figure 1 .
Figure 1.The structure of the seminar with its three phases: initial lecture phase, second project phase, and final presentation.

16 Figure 3 .
Figure 3.The frequency of the given answers broken down by answer options 1 ("I agree absolutely") to 4 ("I do not agree at all") across all 18 items.

Figure 3 .
Figure 3.The frequency of the given answers broken down by answer options 1 ("I agree absolutely") to 4 ("I do not agree at all") across all 18 items.

16 Figure 4 .
Figure 4.The comparison of the assessment from the two research parts.In the evaluation according

Figure 4 .
Figure 4.The comparison of the assessment from the two research parts.In the evaluation according to[32], 0 stands for no points achieved and 1 for maximum points achieved in all seven areas.In the rubric presented here, 1 stands for "I totally agree" and 4 for "I totally disagree".

Table 2 .
Topics of teaching sequences, learning goals, and AR implementation by group.

Table 3 .
The counted answers for each item across all groups as absolute values and as relative values.The last columns contain the calculated values for lambda-4 and lambda-6 for each subscale.
3.4.Relevance of the Quality of an AR Learning Environment for the Overall Quality of the Lesson Planning

Table 4 .
Comparison of the answers for items 15 and 16.Relevance of the Quality of an AR Learning Environment for the Overall Quality of the Lesson Planning