Evidence-Based Assessment of Student Performance in Virtual Worlds

Virtual Worlds (VWs) are popular tools for teaching/learning in the twenty-first century classroom. The challenge remains however, to provide the means by which teachers could sustainably analyse and assess the performance of large groups of students in such environments. Unfortunately, external game features such as game scores and play duration have turned out to be unfair in some assessments. In this context, a case study was carried out in a foreign language course, illustrating how teachers could easily retrieve a number of performance indicators from VW-interaction logs and harness them to conduct a fine-grained analysis of students’ performance, while facilitating at the same time valuable tools for their assessment. Objective performance indicators in a server database were made accessible using an end-user development programming language. This way, a range of data visualisation methods could be employed to contrast different assumptions regarding learner performance when playing a VW-based game, which was designed to help CEFR A1 level students to learn German. This way, factors such as randomisation of game tasks, which could negatively affect learner performance, were alleviated.


Introduction
The use of Virtual Worlds (VWs) and VW-based games in education is not new. This is evidenced by the large body of literature on VW-based teaching and learning experiences [1][2][3]. However, despite extensive data indicating that VWs are effective teaching/learning tools, the use of VWs for educational purposes has not been as widespread as might be expected [4]. This may be due to high development costs and/or technological barriers, administrative hurdles as well as the inherent complexities of monitoring and analysing the learning process, specially when the number of students increases [5,6].
The idea of using VW-based games for learning is twofold: on the one hand, to make the learning process more enjoyable and on the other, to provide learners with interactive environments which offer them opportunities to practice and foster different skills from their respective area of study. In terms of teaching, the main challenge remains however, to provide the means by which teachers could sustainably analyse students' interaction with the game-environment and use this information for assessing students' performance in the targeted skills [7]. In this context, the current study aims to give answer to the following research question: Can VW interaction logs help teachers to sustainably conduct a finegrained analysis of students' performance and facilitate their assessment in VW-based game environments?
The study is based on the assumption that the information retrieved from VW interaction logs could help analysing learner performance and offer teachers, even with few programming skills and large groups of students, valuable information on students' performance. This paper illustrates how teachers can easily retrieve refined information from interaction logs by employing an end-user development (EUD) programming language and use that visual information to harness teaching and learning as well as assessment processes. To this end, a case study was conducted in a CEFR A1 level German language course during which students played a VW-based game.

State of the Art
In recent years many experts have recognised the enormous educational and motivational potential of video games and VWs [8][9][10][11] yet, to date, there are few empirical studies directly exploring their impact on learning processes [12][13][14].
The interaction that takes place in VWs aimed at learning a target language, both among students as well as between students and virtual agents, favours the development of reading and discourse management skills [15]. Peterson applied massively multiplayer online role-playing games (MMORPGs) [15] and 3D multiuser virtual environment (MUVE) Second Life [16] respectively for language learning. Students' feedback suggested that the benefits of using these tools were valuable opportunities to practice the target language, learning new vocabulary, while at the same time enjoying the learning process.
Both off-the-shelf and tailor-made games, and most VWs, are restricted by private software licences; hence, in general, do not allow teachers to create nor tailor them to specific learning needs and curricular requirements, nor to trace and analyse learner interaction [5,17].
Despite the widespread trend to use external learning assessment procedures [7,18] to measure the learning impact of game-based learning experiences, conventional (preand post-tests) as well as other more recent completion criteria (e.g., teacher observation, the game levels played), entail some important limitations since information about the game experience itself is barely considered [12]. However, in-game assessment can provide more detailed information for analysing and assessing students' learning process, even though it is more difficult to implement by non-technical staff [19].
Some previous studies [20,21] underlined the need for a deeper analysis of learner interaction with a view to streamline the assessment of learner performance. In a research study presented by Hsiao, Lan, Kao and Li [22], student records were analysed to examine the relationship between learning outcomes and learning paths and strategies within a VW for learning Mandarin Chinese. The analysis was done by using Second Life learning latabase (SLLDB), a computer tool that was developed by the authors for the purposes of their research study to record students' interaction. Although this proposal does not allow teachers to adapt the analysis of students' interaction to the specific needs of their subject, it does provide an initial effort to analyse large amounts of language learning data and its relationships with the learning outcomes. Other works [23] have studied the visualisation of students' progress when learning through game-based environments as a way of tracking students' interaction and measuring their activity in VW-based learning environments. Students' interaction is depicted by means of online graphs, illustrating their activity based on indicators such as voice and text chat communications as well as user sessions. Beyond regular graphs, Minović, Milovanović, Šošević and González [24] have used coloured circular views to visualise time-related data describing students' progress on a number of key concepts. Other types of circular views, such as semantic spiral timelines, have been used to explore the use of virtual learning environments across time [25]. Moreover, some authors have proposed the use of dendrogram representations which show how VW student communities can be visually identified after clustering avatar positions [26].
While the previously mentioned studies provided computer experts with some valuable techniques for analysing and assessing students' in-game performance, the current work intends to make in-game assessment available to a broader audience with no need for having specific programming skills by using an EUD programming language. The EUD programming language will be developed as a Domain-Specific Language (DSL). A DSL is a programming language or specification designed to solve a specific problem. Its use is spreading, as it provides users with a programming language that is customised to the user's domain [17,27]. To illustrate how an EUD programming language can be used in the area of foreign language learning, the authors have carried out a study with a group of undergraduate students from a German language course playing a VW-based game.

Methodology
For this study, the authors have used the design and creation research strategy defined by Oates [28] which focuses on developing information technology (IT) artefacts such as constructs, models, methods, etc. in order to solve complex problems. In order to answer our initial assumption that accessing students' interaction logs, when playing VW-based games, could help teachers (in this case foreign language teachers) conduct a customised fine-grained analysis of students' performance, we here propose an evidenced-based method. This method aims to provide teachers with the means by which they could easily access and analyse students' interaction logs in a sustainable way (that is, independently from the number of students to be assessed).

Evidence-Based Method
In this paper, we propose the use of a method that has previously been applied in other virtual environments to assess students' performance through their interaction with both, a wiki environment [29] and a learning management system [30,31].
The evidence-based method implies working in different refinement phases until an objective and valid indicator is found that could help teachers assessing their students' performance when completing a given learning task. Next, the proposed method and the different phases that need to be taken are described in detail:

1.
Instruction: first students need to receive instructions on how to play the game in order to complete the given learning task.

2.
Task performance: next, students are asked to play the game.

3.
Assessment: in order to monitor and assess students' performance, after or during the game, the teacher needs to retrieve different indicators from the VW regarding concrete student skills. These indicators can be retrieved from the VW interaction logs and later be refined with the help of the following evidence-based method ( Figure 1): (a) First assessment proposal: the teacher defines an initial proposal to assess students' performance with regard to a specific skill, using for this purpose the information stored in the VW interaction logs.
Submission: the assessment proposal is coded in a Virtual World Query Language (VWQL) query. A VWQL query specifies a requests of concrete information from the VW interaction logs. After being coded, the query must be submitted to EvalSim, the system that interprets VWQL querys [20]. (c) Data collection: the system provides the data requested from the logs. Analysis: the teacher analyses the data in order to decide whether they are valid indicators to assess the targeted skill/s. (f) Validated: the process ends once the data obtained are valid indicators for assessing the targeted skill/s. (g) Refinement: in case the data are not valid or need to be refined, the entire process is repeated, and a new query is launched.

Implementation
To implement the evidence-based method two IT artefacts are needed: firstly, a VWbased game in which students are required to interact while developing the targeted skill/s and secondly, an EUD programming language, implemented as a DSL, to provide teachers with a language that helps them with coding the assessment proposals.

Virtual World
For the current study, a tailor-made competitive two player VW-based game called GEFE (German Expert and Fast waitEr), designed in line with specific learning needs and curricular requirements for CEFR A1-level German language learners, was used. Since some of the key items of students' language curriculum are the learning of basic vocabulary and structures that are for instance needed to perform daily tasks in the target language (e.g., ordering in a restaurant or cafeteria, buying in a supermarket, etc.), the focus here was on providing students with the opportunity to practice those items by means of a competitive role-play game. The game recreates an outdoor cafeteria scenario, using simple artificial intelligence algorithms to randomly place a total number of 12 virtual client bots all over the playing field. Additionally, the algorithm assigns to each player one waiter avatar. Play starts with the two waiter avatars lined up along the counter. Each waiter avatar will receive a call for service from one of the client bots. To attend a client, the waiter avatar must first approach the table of the respective client bot and then ask him/her for the items he/she wants to order. Interaction between both the waiter avatar and client bot takes place via text chat and in the target language. Once the order has been taken, the waiter avatar must go back to the counter and prepare the order (each order consists of 3 items: a beverage, a dish and a dessert) by identifying and clicking on the image of the different items. Hereafter the waiter avatar must return to the client's table and deliver the order (Figures 2 and 3).  Once the waiter avatar has delivered the order, the student is provided with immediate feedback in the form of points. Points are calculated as follows: for each item which has been correctly delivered from the order, the student is awarded 1 point with 3 points being the maximum for each client. Once a waiter avatar has delivered the order, he/she receives a new call for service from another client bot. This way, if a student makes a mistake when delivering items, the maximum number of points he/she can get in the game is reduced. Although the server initially assigns the same number of client bots (12 bots) to each waiter avatar, the play round finishes after one of the two waiter avatars has attended all client bots assigned to him/her. Due to random task assignment each game session differs from the previous one. The randomness of client bots and the distance that client bots are placed over the playing field means that the time spent to complete a required game task varies in each game session.
In order to ensure that student logs are stored correctly, regardless of possible drawbacks due to the server and internet connection, player movement traces are registered throughout the game and by means of player coordinates on a regular basis.

Domain Specific Language
A Domain Specific Language (DSL) is a programming language which allows an expert to formalise and represent specific knowledge on a particular field or topic, which in this case is foreign language learning assessment.
To code the assessment proposals, the DSL used for the analysis was VWQL (available at https://bitbucket.org/RaulGS/vwql/). VWQL was developed by the authors in the context of previous works in order to retrieve information on German language learners' performance by analysing their interaction when learning through VW-based game environments, developed with OpenSim [20]. VWQL queries are submitted to Eval-Sim, which processes them and generates reports with the requested data. The technical development documentation and the user manual are available in [32].
The information made available by using VWQL were firstly, words and sentences (complex sentences and one-word sentences) employed by the learners while interacting via text-chat, secondly, the turns and time students needed to accomplish the given game task and thirdly, the number of clients attended, and points obtained. Below, we can see the reserved words and syntax of VWQL.
Moreover, the use of VWQL allows to obtain two types of information: a VW tracking map for each game session and a general data report. A VW tracking map is a visualisation method that allows teachers to visually analyse students' movements, when playing VWbased games. This process is complemented by a general data report i.e., a set of files containing students' movement traces and interaction logs. Additionally, a spreadsheet program can be used to process files, analyse data and generate charts and graphs.

Evaluation
The evaluation carried out for the current study was done with 16 undergraduate students from a CEFR A1-level German language course and in collaboration with the students' language teacher being one of the authors of the study. All participants had previously been enrolled in a 2-semester German language course at a Spanish university (6 ECTS/semester) for a total of 96 classroom contact hours and 204 independent learning hours.
The experiment was organised in two sessions: the first session aimed to familiarise students with the VW-based game environment; the second required students to play the game by means of waiter avatars and to interact with a number of client-bots using the target language (German) as the only vehicle for communication. Each game session lasted a maximum of 25 minutes. Table 1 shows the different game sessions and students who participated in each session. With a view to analyse and assess students' language performance when playing the game, the course teacher intended to look at two aspects: students' vocabulary knowledge and students' interaction in the target language.

Vocabulary Knowledge
In order to assess students' vocabulary knowledge, the teacher first analysed their game performance in terms of game scores. This procedure was based on the language teacher's assumption that the more clients a student has attended and orders he/she was able to successfully deliver, the better was his/her vocabulary knowledge. Consequently, the more items he/she was able to deliver the more game scores he/she obtained.

Interaction in the Target Language
Second, the teacher assessed students' interaction by analysing the time a student spent to interact with the client bots, to perform the given game task, and the game scores (points) he/she obtained. This procedure was based on the assumption that the less time a student needed to perform the given task correctly, the better was his/her ability to interact and negotiate effectively in the target language.

Results
The purpose of the following section is twofold: firstly, to show the assessment phase (Figure 1), that was carried out by the teacher in order to evaluate the aforementioned language skills, and to discuss the suitability of the evidence-based method to help teachers assessing different fine-grained aspects of students' language learning process.

Vocabulary Knowledge: Analysis and Validation
In the following, the steps taken to assess students' vocabulary knowledge are described.

First Assessment Proposal
Students' vocabulary knowledge was analysed by their game performance in terms of game scores.
• Submission: the assessment proposal is coded through the VWQL query shown below.
Evidence students_scores: get students show points.
• Results: the system delivers various reports and figures with the data requested. Figure 4 illustrates the game scores (points) obtained by each student. Game scores range from 0 to 34, with a standard deviation (SD) of more than 10. • Analysis: regarding the assessment proposal, the data illustrate on the one hand, that two students (Stud8 and Stud12) performed significantly better than the rest of the participating students (Stud8 obtained 32 and Stud12 obtained 33 points, while the rest of students obtained 22 or less points). And on the other hand, that six (37.5%) of the participating students (Stud5, Stud6, Stud9, Stud10, Stud13, Stud14) only obtained 6 or less points, which means a score that was five times lower than the score obtained by Stud8 and Stud12. • Validated: although the obtained report provides valuable information to help teachers assessing students' performance in terms of vocabulary knowledge, the teacher became aware that she needed more detailed information on students' performance in order to validate the retrieved assessment indicator. For instance, in case a student failed several client-orders, but attended much more clients than another student, the first one would probably obtain a higher score than the second one, who delivered his/her orders more precisely. Thus, the proposed indicator should be refined by also considering the number of orders that have been delivered incorrectly.
• Refinement: to refine the game score indicator the teacher will need to analyse the number of orders each student delivered successfully. To obtain this new indicator, the teacher will need the points obtained by each student as well as the number of clients attended.

Second Assessment Proposal
Students' vocabulary knowledge will be analysed by their game performance in terms of points/clients.
• Submission: the assessment proposal is coded through the VWQL query shown below.
Evidence students_points_per_clients: get students show points, clients.
• Results: The values obtained for the points/client ratio range from 0 points when the student failed to deliver any of the items in the order to a maximum of 3 points when the student successfully delivered all 3 items to the client ( Table 2). • Analysis: with regard to the assessment proposal, the data from Table 2 illustrate that three students (Stud2, Stud8 and Stud15) completed the given game task very precisely, delivering correctly all their clients' orders. Nevertheless, student scores depend not only on the success of each student's deliveries, but also on the time they played, which could differ in each game session. Therefore, it is not only the score that can be considered, but also the number of clients that the student was able to serve, which were necessarily related with the time the student played. The authors assume that some of the students might be more reflective learners and thus would have needed more time to perform the same game task and to obtain a higher game score. Finally, to determine whether a student had accomplished the game task satisfactorily, the teacher could set a score threshold, by establishing the points (ratio per order) a student would need to be considered as a learner with a strong vocabulary knowledge. Such a score threshold could establish, for instance, that obtaining 2.60 (see Stud1 and Stud11) or more points (see Stud12) would be an indicator for having a strong vocabulary knowledge, regardless the fact that Stud1 and Stud11 attended less than half of the clients Stud12 was able to attend in the same time. Although four students (Stud5, Stud6, Stud9 and Stud13) also completed all of their deliveries accurately, they are considered outliers for having served two clients or less.
• Validated: from the teacher's point of view, the information on the points/client ratio provides a valid indicator with regard to students' vocabulary knowledge, since she was able to determine whether her students had successfully acquired the targeted vocabulary from the course syllabus. In fact, the data show that all learners, except one (Stud10), obtained a good points/client ratio (in the range of 2 to 3). Although the teacher used the point per order indicator to source a more precise assessment of students' vocabulary knowledge, she considered it necessary to additionally establish a minimum number of clients attended. This was based on analysing the performance of Stud5, Stud6, Stud9, Stud10, Stud13 and Stud14, since these students attended 2 or fewer clients. The data therefore suggest that the validity of the point per order indicator might not be sufficient for the assessment of the respective students, since they attended less than 17 percent of the total amount of clients.

Interaction in the Target Language: Analysis and Validation
In the following, the authors describe the steps that were taken to assess students' interaction skills in the target language.

First Assessment Proposal
Students' interaction in the target language will be assessed by analysing the relationship between students' game scores and playing time. This means, that the average score per minute (points/minute ratio) could help assessing students' performance in terms of effectiveness, providing some valuable indicator for evaluating students' ability to interact in the target language.
• Submission: the assessment proposal is coded through the VWQL query shown below.
Evidence students_points_per_minute: get students show points, time.
• Results: the system delivers several reports and figures, providing detailed information on the game scores and play duration of each student. Since play duration is shown in minutes, a third indicator, named points/min ratio, was added by using a spreadsheet software ( Table 3). The results from Table 3 indicate that values for the points/minute ratio range from 0.50 to 3.68. • Analysis: with regard to the assessment proposal, the data illustrate that two of the participating students (Stud12 and Stud8) performed significantly better than the rest of the students: while Stud12 obtained 3.78 points/minute ratio, Stud8 obtained 3.67 points. Additionally, a look at students' game scores highlight that Stud8 and Stud12 are also the students with the best results. In order to compare both indicators, a ranking of students based on points/minute and game score is shown in Table 4. A comparison of both indicators shows that only in the case of Stud6 the assessment could significantly differ, when considering one or another indicator. In fact, Stud6 obtained very few points (6 points) since they played for less time than the rest of the top ranked students (Table 3). Nonetheless, Stud6 performed well in the given game task, delivering all client orders correctly, which explains why Stud6 obtained a relatively high game score, obtaining similar results compared to the top ranked students (Stud12, Stud8, Stud7, Stud2 and Stud4). However, the points obtained by students who played during significantly less time than the rest of the group had to be carefully considered: while Stud6 ratio was extremely high, those of Stud9 and Stud10 were really poor, and only Stud5 obtained an average result. Therefore, such measures need further contrast to be considered in the same terms that those of the rest of the players.  Stud12  1st  1st  Stud8  2nd  2nd  Stud7  3rd  3rd  Stud6  3rd  11th  Stud2  5th  4th  Stud4  6th  5th  Stud15  7th  6th  Stud3  8th  7th  Stud5  9th  14th  Stud11  10th  8th  Stud1  10th  8th  Stud16  12th  10th  Stud9  13th  14th  Stud13  14th  11th  Stud14  15th  13th  Stud10 16th 16th

Student RANKING Position Based on RANKING Position Based on Points/Minute Game Score
• Validated: from the teacher's point of view, the information on the points per minute ratio provides a valid indicator for assessing students' ability to interact in the target language, since it allows to gather information on students' efficiency when performing the given game task. This means, given a minimum of 4 min played, the points per minute ratio indicator will show the time each student needed to correctly deliver his/her clients' orders. • Refinement: despite accepting the validity of the points per minute ratio indicator, the teacher decided to consider another external factor, which could have influenced students' game performance and therefore explain some of the observed differences between student players, in terms of efficiency. Due to random task assignment each game session differed from the previous one hence requiring from each player a different effort in order to attend client orders and to complete the given game task successfully.

Second Assessment Proposal
Students' interaction in the target language will be analysed by considering the potential relation between students' point per minute ratio and the effort required to accomplish the given task. For the purposes of this study, effort rather than referring to student initiative is synonymous with the distance a student had to cover in order to attend his/her client orders. Note that clients are randomly placed by the system and thus, each student should cover a different distance to attend the same number of clients.
• Submission: the assessment proposal is coded through the VWQL query shown below.
Evidence students_distance_clients_points: get students show distance, clients, points.
• Results: the system delivers several reports and figures, providing detailed information on the distance covered by each student player pair. A look at Figures 5-7 shows how the retrieved information can be illustrated by means of a VW tracking map, providing some interesting insights into students' performance, based on student players' movement and the distance covered. • Analysis: a look at the different VW tracking maps (Figures 5-7) reveals that not all students were required to cover the same distance in order to attend to their clients. This implies that some students must make more of an effort to complete the same game task than others. Additionally, in terms of assessment, it means that conventional performance assessment criteria such as game score alone are clearly insufficient.  For instance, the data in Figure 5 which show the interaction and distance covered by Stud4 (Waiter A) and Stud3 (Waiter B) clearly indicates that both students faced a similar challenge in terms of distance and hence effort required to deliver their clients' orders. In this case, the student who performed best (Stud4), having a higher game score showed clearly a better knowledge of the target language compared to the one who performed worse (Stud3). Nonetheless, a look at Figure 6 illustrates that in the game session played by Stud7 and Stud8, Stud8 (Waiter B) needed to cover a much longer distance (146.08 metres and thus 30 per cent longer) compared to Stud7 (Waiter A) in order to attend two clients less. In fact, while Stud 7 (Waiter A) attended 11 clients, Stud8 (Waiter B) attended only 9 clients. Hence, considering game score alone would not be fair to assess students' performance. A third case is finally illustrated in Figure 7. The data from Figure 7 show that Stud15 (Waiter A) attended, in the same amount of time, one more client than Stud16 (Waiter B) hence being more efficient. Regardless of the fact that Stud15 (Waiter A) needed to cover a greater distance and thus make a greater effort, he/she was able to attend much better to his/her clients' orders than Stud 16 (Waiter B). A look at students' game scores confirms the authors' assumption, clearly indicating that Stud15 has a stronger knowledge of the target language compared to Stud16: while Stud15 (WaiterA) obtained 18 points (a ratio of 100% regarding the delivered items), Stud16 (WaiterB) obtained only 10 points (resulting from an average ratio of 66%). After analysing the different VW-tracking maps, we tried to identify a general group behaviour based on the data normalisation ( Figure 8). While the blue line indicates the normalised clients (i.e., the clients attended by the students divided into the maximum number of clients attended by a student in the group), the red line indicates the normalised distance (i.e., the metres run by the students divided by the maximum metres run by a student in the group). Students are sorted according to the normalised clients. • Validated: from the teacher's point of view, the indicator for the number of clients attended complements the previous indicator (distance covered) when assessing students' interaction in the target language, since both indicators highlight the effort students had to make in order to accomplish the given task. Moreover, the data from Figure 8 show the more clients a student attended, the greater the amount of effort he/she was required to deliver (i.e., distance covered  Table 2), after having played the game for 3-min (see Table 3), so their interaction was not long enough to be taken into account. On the other hand, Stud11 showed a significant interaction with the VW, while his/her performance was average: he/she attended 5 orders in the same time his/her game-partner (Stud12) was able to attend 12 orders, so he/she probably needed to approach his/her clients several times to make sure that his/her orders were delivered correctly

Discussions
From the language teachers' point of view, the implications of being able to assess students by using activity records extracted from VWs and an EUD programming language that allows them to customise their assessments are the following.

1.
It allows language teachers to sustainably assess VW-based learning experiences by focusing not only on the final results according to the scores achieved by each learner, but also on how each individual learner interacted with the game. By extracting the records of students' activity while playing the game, it is possible to take into account how they performed with regard to other indicators. For example, if learners have to complete a task (deliver orders) in a given time and the teacher focuses only on the scores obtained, learners who are less skilled at playing video games [15] might score even lower by having a 100% success compared to other learners who have a 50% success and have moved much faster through the game attending more clients and delivering more orders.

2.
It allows language teachers to think about designing learning experiences based on VWs by focusing on all the actions the learner has to take [12]. For example, it has been found that the fact that one learner had to run a longer distance compared to another learner in order to attend his/her clients was a disadvantage. Being able to extract this information allows the teacher firstly, to consider this feature for making a fairer assessment and secondly, to consider the mentioned aspect when redesigning the game to avoid that running a longer distance to complete a given task implies for the respective learner a disadvantage when competing with other learners who are not required to run the same distance.

3.
Thanks to the use of an EUD programming language, the teacher can develop his or her own evaluation criteria rather than depending on VWs and video games which, although they could provide some indicators based on students' activity records, are usually tailor-made in line with the interests of those (teacher or computer specialist) who have created the game [22]. This can be illustrated by an example from the VW-based game (GEFE) that has been implemented for the current study. For example, if the game had only provided the score of each student and the time spent to perform the given learning task, it would not have been possible to refine the indicators by taking into account aspects such as the number of orders delivered correctly or the distance run, which have later been proven to be relevant for the current study. Finally, it is important to note that the indicators used in this study to assess the students' language skills (i.e., vocabulary knowledge and interaction in the target language) have been endorsed by the experience of the language teacher, who participated in the current study and who has refined them on the basis of her personal experience as well as observation of students' game performance. Taking into consideration the results from the current study the authors consider that the added value of using tools as the one discussed in this paper lies in the fact that it will be the language teacher himself/herself who will be able to use, adapt, discard or redesign his/her indicators through the use of the EUD programming language.
Finally, it is important to note that the indicators used in this study to assess the language skills of students, i.e., vocabulary knowledge and interaction in the target language, have been endorsed by the experience of the language teacher, who refined them on the basis of observation of the activity and her experience. The power of the use of these tools lies in the fact that it will be the language teacher himself/herself who will be able to use, adapt, discard or redesign his/her indicators through the use of the EUD programming language.

Conclusions
In terms of in-game assessment, still very little research has been done to help language teachers analysing and assessing students' performance when learning through VW-based environments. While external game features such as game scores and play duration have turned out to be unfair in some evaluations (especially when randomisation comes into play), other more internal aspects such as students' interaction and the effort made to complete the given game task successfully, could reveal some interesting information on students' learning process and the language knowledge they have acquired.
The current study aims to help language teachers assessing students' performance when learning through VW-based games, by providing them with some valuable tools to sustainably analyse and assess students' language skills. With this purpose in mind, the paper suggests the use of both, an evidence-based method as well as an EUD programming language. To corroborate the validity of the proposed method for students' language assessment, a case study with students from an undergraduate German foreign language course was conducted.
The findings of this study reveal that the use of VW interaction logs can help teachers to sustainably conduct a fine-grained analysis of students' performance and facilitate their assessment in VW-based game environments:

1.
Firstly, a specific EUD programming language allows teachers to easily retrieve objective indicators from students' logs together with the use of a range of visualisation methods (i.e., VW tracking map and general data report). This way, they can sustainably (that is, independently from the number of students) gather valuable information on students' task performance and the skills involved.

2.
Secondly, an evidenced-based method allows teachers to refine their initial assessment criteria and thus implement a more comprehensive assessment method. This refinement is especially helpful when considering differences in terms of students' performance due to external factors such as the randomisation of game tasks. The analysis of students' results illustrated that such factors could significantly affect their game performance, since the effort required to fulfil the given task could differ in each case and game session.
Future work needs to focus on more specific aspects such as students' use of the foreign language in terms of specific structures, communication strategies etc. to be included in the assessment process. Apart from refining language assessment, the authors intend to extend their study to other professional fields and skills e.g., generic skills such as teamwork or leadership skills, in order to provide teachers from a wide range of areas with the means to easily assess their students' performance when using games for learning.  Institutional Review Board Statement: Ethical review and approval were waived for this study, due to not involving personally identifiable nor sensitive data.
Informed Consent Statement: Student consent was waived due to not involving personally identifiable nor sensitive data.

Data Availability Statement:
Publicly available datasets were analyzed in this study. This data can be found here: https://doi.org/10.6084/m9.figshare.13491240.