Exploring (Collaborative) Generation and Exploitation of Multiple Choice Questions: Likes as Quality Proxy Metric

: Multiple Choice Questions (MCQs) are an established medium of formal educational contexts. The collaborative generation of MCQs by students follows the perspectives of constructionist and situated learning and is an activity that fosters learning processes. The MCQs generated are— besides the learning processes—further outcomes of collaborative generation processes. Quality MCQs are a valuable resource, so that collaboratively generated quality MCQs might also be exploited in further educational scenarios. However, the quality MCQs ﬁrst need to be identiﬁed from the corpus of all generated MCQs. This article investigates whether Likes distributed by students when answering MCQs are viable as a metric for identifying quality MCQs. Additionally, this study explores whether the process of collaboratively generating MCQs and using the quality MCQs generated in commercial quiz apps is achievable without additional extrinsic motivators. Accordingly, this article describes the results of a two-stage ﬁeld study. The ﬁrst stage investigates whether quality MCQs may be identiﬁed through collaborative inputs. For this purpose, the Reading Game (RG), a gamiﬁed, web-based software aiming at collaborative MCQ generation, is employed as a semester-accompanying learning activity in a bachelor course in Urban Water Management. The reliability of a proxy metric for quality calculated from the ratio of Likes received and appearances in quizzes is compared to the quality estimations of domain experts for selected MCQs. The selection comprised the ten best and the ten worst rated MCQs. Each of the MCQs is rated regarding ﬁve dimensions. The results support the assumption that the RG-given quality metric allows identiﬁcation of well-designed MCQs. In the second stage, MCQs created by RG are provided in a commercial quiz app (QuizUp) in a voluntary educational scenario. Despite the prevailing pressure to learn, neither the motivational effects of RG nor of the app are found in this study to be sufﬁcient for encouraging students to voluntarily use them on a regular basis. Besides conﬁrming that quality MCQs may be generated by collaborative software, it is to be stated that in the collaborative generation of MCQs, Likes may serve as a proxy metric for the quality of the MCQs generated.


Introduction
Multiple choice questions (MCQs) have been used in educational contexts for a long time, predominantly for assessment purposes (e.g., [1][2][3]). MCQs are not restricted to a certain domain; they may cover almost all knowledge domains and (at least in theory) all complexity levels of knowledge [4][5][6]. Besides testing, MCQs may be used for learning; additionally, there is a so-called testing effect [7,8]: repeated answering of MCQs leads to memorization-although not all the effective mechanisms are well-known so far [9]. The testing effect is not specific to MCQs, but to testing in general. Larsen & Butler (2013) [10] subsume MCQs to recognition tests. In contrast, fill-in-the-blank-tests-as, for example, investigated by Wiklund-Hörnqvist, Jonsson, & Nyberg (2013) [11]-belong to production tests. Retrieval from memory is considered as a fundamental cognitive process producing this effect [12]. An overview of work in the context of repeated tests is presented by promotes learning [56]. Student-generated MCQs may be used in a manual way, for example, being discussed in the classroom after being polished by the lecturer and made available in discussion forums [57]. Yet, web platforms for collaborative question generation are also available, such as PeerWise [58]. Positive effects on learning processes include increasing learner engagement and fostering collaborative learning [58][59][60]. However, challenges emerge when students consider writing MCQs difficult without triggering adequate subjective learning gains at the same time. In addition, students do not trust the questions of their peers [61]. Additionally, the concealment of the identity of students in collaborative questioning is considered beneficial [62]. Despite all challenges, systems based on user-generated MCQs-such as Quizzical [63], RiPPLE [64], or UpGrade [65]-are considered to have a great learning potential, provided they are supported by external incentives in formal educational scenarios, as was done in the referenced studies.
This article complements the fundamentals described with the results of a two-staged field study. The objectives of the study were to investigate collaborative MCQ generation and the identification of quality MCQs in the first phase and the use of the collaborative question generation software and a commercial quiz app for further utilization of MCQs to voluntary learning scenarios in the second phase. In a pragmatic approach, the MCQs created in the first phase may be used in a motivating learning activity in the second phase. The research questions to be answered are: for the first stage (RQ 1), to what extent the quality of the generated MCQs may be measured by collaboratively generated data and for the second part (RQ 2) whether the collaborative generation of MCQs as a process is sufficiently motivating that the process works without extrinsic incentives of educational scenarios and-if the process does not work without incentives of the educational scenario-whether instead a commercial quiz app succeeds without such incentives of educational scenarios.
In the following section, the methodology is described; thereafter, the results of the first step (RG as tool to generate MCQs) and the second step (QuizUp as a tool to answer MCQs) are described. Thereafter, the results are discussed, and, finally, conclusions are derived.

Materials and Methods
Two software tools were used: in the first stage, MCQs are created and ranked by students. In the second stage, these MCQs are provided for learning purposes via a well-established commercial quiz app. In the following, employed software tools and the experimental settings are described.

Generating MCQs: The Reading Game
The Reading Game (RG) is implemented as a moodle [50] module. It aims at prompting users to generate quality MCQs. During the study. RG was still in an experimental state. RG is an epistemic game that requires students to create questions and answer questions at a predefined regular rate, indicated by a control bar [66]. Because students may review MCQs by adding comments as well as by liking them or by reporting them to an administrator, this gamified application provides a form of collaborative question engineering.

Answering MCQs: QuizUp
Generated MCQs have been provided in a commercial quiz app. In this way, the vast popularity and attractiveness of commercial quiz apps should be exploited for learning. In September 2015, QuizUp, a major commercial quiz app, was opened for user-defined content [45]. This opening has created an opportunity to add user-defined topics to QuizUp sourced from RG mechanisms.

The Experimental Setting
In the first step, RG was introduced to a bachelor course in Urban Wastewater Management (n = 16 (students, who took the final exam) resp. n = 29 (students, who enrolled in the course initially)). Students were instructed by a written assignment description to fulfill their weekly quota, which requires them to create one MCQ and to answer five MCQs of their fellow students each week. Students have to meet the weekly quota in 10 of 16 weeks of the semester in order to be admitted to the final exam. Besides this formal incentive, students were encouraged in the written assignment description to use the Like feature to identify any well-generated MCQ. Students were also made aware of the Comment feature for contributing to the improvement of MCQs. Overall, students were alerted that engaging with MCQs would support the learning objectives of the course. An administrator ensured that reported MCQs (MCQs that have been marked as wrong by the participants) were either approved or deactivated.
In the second stage, a group of five students used another instance of the RG in the context of the bachelor course Capital Budgeting (CB). As part of a project assignment, the five students were mandated to investigate the impact of RG. First, the five students generated a basic corpus of MCQs through collaborative playing of RG for three weeks at the beginning of the semester. After these three weeks, the students began encouraging their fellow students (n = 30) from the CB course to participate in RG as well, so that all fellow students could prepare for the final exam. After none of the fellow students participated in RG, and feedback indicated that fellow students shied away from generating MCQs, the instructional design was adjusted: Instead of answering the MCQs in RG, the MCQs generated were transferred to a topic Capital Budgeting in QuizUp, where the MCQs were available for practice. The response for this topic in QuizUp was again poor. To evaluate the reasons for student inactivity, the group of students employed a self-designed questionnaire, which was answered by 18 fellow students. In the 16 weeks during which the game was active during the semester term, 29 users created 379 MCQs and answered 6689 MCQs. 326 MCQs were liked, i.e., Likes were issued at a low rate of only 0.5%. 15 MCQs were reported by students. All MCQs reported were finally deactivated. The comment feature of the RG was used very seldom, mainly for communication about reported MCQs. There was only one game-related student inquiry about the game-a clarification about an MCQ deactivated. Four of the 29 students did not adhere to their weekly quota and were not admitted to the final test.
A first observation was that prevalently negated MCQs were generated after an orientation phase for the students. Negated MCQs are quite easy to generate, as they free the originator from the demanding task of finding well-balanced distractors. Instead of three distractors, only one distractor has to be found; the other three answer options are formed by correct answers. By use of negation, this distractor becomes the correct answer. However, this approach is not recommended as a good practice to build MCQs [67]. Figure 1 illustrates that after approximately a third of the semester, students predominantly generated negated MCQs. More than half of all MCQs generated included a negation for almost a third of the course period. This phenomenon was stopped after the course administration instructed students to avoid this type of MCQ. Soon after this instruction, the percentage of MCQs asking for a numerical answer increased again, doubling from 10% to 20%. Besides negation, an MCQ asking for a numerical answer is another type of MCQ, which eases MCQ generation because distractors (other numerical values) might be found with little effort. Although MCQs asking for numerical answers are also to be regarded as MCQs of high quality, focusing solely on MCQs with numerical answers underrepresent specific domains as numerical answers that are more common in specific domains, such as mathematics or physics [68]. Further, not all learning objectives are supportable by MCQs employing numerical answers. to 20%. Besides negation, an MCQ asking for a numerical answer is another type of MCQ, which eases MCQ generation because distractors (other numerical values) might be found with little effort. Although MCQs asking for numerical answers are also to be regarded as MCQs of high quality, focusing solely on MCQs with numerical answers underrepresent specific domains as numerical answers that are more common in specific domains, such as mathematics or physics [68]. Further, not all learning objectives are supportable by MCQs employing numerical answers. A further observation was that throughout the game a small group of 3-4 students battled for the leading position in the game and they answered up to 10 times as many questions as required for admission to the final exam. This phenomenon has been observed in other competitions including educational quizzes as well [69]. Another group of students fulfilled just their weekly quota referring to both MCQs the generation of MCQs to answer. In between, there was a group which did not stick immediately to the quota but answered an extra number of MCQs from time to time. A classification might be hypothesized into competitive overachievers, interest-driven casual learners, and effort-optimizing minimalists. The group of effort-optimizing minimalists may be interpreted as evidence suggesting weekly quotas as an admission requirement for the final exam acted as a trigger to start participation in RG. In general, the number of MCQs answered varied much more than the number of MCQs generated. On average, a student answered almost 500% of the mandatory quota, but only generated 40% more MCQs than required. These ratios may indicate that generating MCQs is perceived to be significantly more difficult than answering MCQs. Table 1 includes the minimum, maximum, and average number of MCQs answered and generated. In addition, the number of MCQs required to be admitted to the final exam is given as an indicator value. The numbers refer to those 16 participants who completed the final exam and therefore should be considered as the most regular users of RG.  A further observation was that throughout the game a small group of 3-4 students battled for the leading position in the game and they answered up to 10 times as many questions as required for admission to the final exam. This phenomenon has been observed in other competitions including educational quizzes as well [69]. Another group of students fulfilled just their weekly quota referring to both MCQs the generation of MCQs to answer. In between, there was a group which did not stick immediately to the quota but answered an extra number of MCQs from time to time. A classification might be hypothesized into competitive overachievers, interest-driven casual learners, and effort-optimizing minimalists. The group of effort-optimizing minimalists may be interpreted as evidence suggesting weekly quotas as an admission requirement for the final exam acted as a trigger to start participation in RG. In general, the number of MCQs answered varied much more than the number of MCQs generated. On average, a student answered almost 500% of the mandatory quota, but only generated 40% more MCQs than required. These ratios may indicate that generating MCQs is perceived to be significantly more difficult than answering MCQs. Table 1 includes the minimum, maximum, and average number of MCQs answered and generated. In addition, the number of MCQs required to be admitted to the final exam is given as an indicator value. The numbers refer to those 16 participants who completed the final exam and therefore should be considered as the most regular users of RG.

Students' Perceptions
Upon completion of the RG activity, however, before the final exam, students were asked to answer a questionnaire consisting of 21 questions. 26 answers were received; 10 of the respondents decided either not to take the final exam or were not admitted to the final exam. 19 (73%) of them took part regularly in RG because they wanted to receive admission to the final exam. The other 5 participants (19%), who were already admitted to the test in a previous semester, wanted to prepare for the final exam. The majority of participants (73%) logged into RG once a week. 50% of respondents estimated the weekly time spent in RG as 10 to 20 min.
Respondents were asked for their estimation of the difficulty of various tasks in RG. Answers on a 6-point Likert scale are depicted in Figure 2. The effort of creating a new MCQ was marked as very challenging. All related tasks (having an idea, finding distractors, and formulating the MCQ) received higher values of perceived difficulty than the alternatives of quizzing (answering 10 MCQs in a row) and answering a single MCQ. Noteworthy is the huge difference between both categories of difficulties: tasks related to creating MCQs are rated almost 2 points more difficult than just answering MCQs.
admission to the final exam. The other 5 participants (19%), who were already a to the test in a previous semester, wanted to prepare for the final exam. The ma participants (73%) logged into RG once a week. 50% of respondents estimated the time spent in RG as 10 to 20 min.
Respondents were asked for their estimation of the difficulty of various task Answers on a 6-point Likert scale are depicted in Figure 2. The effort of creatin MCQ was marked as very challenging. All related tasks (having an idea, finding tors, and formulating the MCQ) received higher values of perceived difficulty alternatives of quizzing (answering 10 MCQs in a row) and answering a sing Noteworthy is the huge difference between both categories of difficulties: tasks r creating MCQs are rated almost 2 points more difficult than just answering MCQ A further question is the efficacy of gamification elements. RG is positioned a of game. Hence, a fundamental question is the expectation with which a studen RG: are they in the mood to play a game or to use a learning tool? Respondents we for the main information elements, which they considered as important for the o of the RG. The results of this question, which again used a 6-point Likert scale control element, are depicted in Figure 3. The most-observed information is the bar. This might be not surprising, as this gauge is the official measurement that th administrators stick to. Related to this indicator are the given numbers of MCQs s answered and asked. The most important gamification element is competition, as indicator-the position in the ranking list-suggests. Remarkable here is the larg more than one point relative to the previous indicators. The urge towards comp assisted by the information regarding how many points are needed to move one Competition in this context seems not to be strongly personal, as names of the ran neighbors are not that important. Assigned karma, i.e., a measurement of how person's MCQs are liked and therefore a kind of recognition from fellow students work, seems at least to be noted, together with the information of how often one generated have been answered. The most ignored kind of information is the signed. This is a weekly reward for most points and most answered MCQs, an ment, which is considered as a classic gamification element [70]. A further question is the efficacy of gamification elements. RG is positioned as a kind of game. Hence, a fundamental question is the expectation with which a student enters RG: are they in the mood to play a game or to use a learning tool? Respondents were asked for the main information elements, which they considered as important for the operation of the RG. The results of this question, which again used a 6-point Likert scale for each control element, are depicted in Figure 3. The most-observed information is the control bar. This might be not surprising, as this gauge is the official measurement that the course administrators stick to. Related to this indicator are the given numbers of MCQs still to be answered and asked. The most important gamification element is competition, as the next indicator-the position in the ranking list-suggests. Remarkable here is the large gap of more than one point relative to the previous indicators. The urge towards competition is assisted by the information regarding how many points are needed to move one rank up. Competition in this context seems not to be strongly personal, as names of the ranking list neighbors are not that important. Assigned karma, i.e., a measurement of how much a person's MCQs are liked and therefore a kind of recognition from fellow students for one's work, seems at least to be noted, together with the information of how often one's MCQs generated have been answered. The most ignored kind of information is the stars assigned. This is a weekly reward for most points and most answered MCQs, an achievement, which is considered as a classic gamification element [70]. The next group of questions evaluated students' perception of RG as a learning tool. As Figure 4 reveals, there are no settled statements. The only denied statement is that RG is operated collaboratively. Further, the respondents admitted that RG stimulated learning activities on a regular base. While respondents were undecided, if the game supported them to learn, they mostly rejected the (concededly provocative) statement that RG is a The next group of questions evaluated students' perception of RG as a learning tool. As Figure 4 reveals, there are no settled statements. The only denied statement is that RG is operated collaboratively. Further, the respondents admitted that RG stimulated learning activities on a regular base. While respondents were undecided, if the game supported them to learn, they mostly rejected the (concededly provocative) statement that RG is a waste of time. The next group of questions evaluated students' perception of RG as a learning tool. As Figure 4 reveals, there are no settled statements. The only denied statement is that RG is operated collaboratively. Further, the respondents admitted that RG stimulated learning activities on a regular base. While respondents were undecided, if the game supported them to learn, they mostly rejected the (concededly provocative) statement that RG is a waste of time. In general, RG was received more as additional work than as a game. In a comparative question about preferred lecture-accompanying learning activities, RG got the lowest marks (2.7) on a 6-point Likert scale compared to online questionnaires (4.3) and calculation exercises (3.2). Especially, generation of MCQs led to avoidance behavior (negated MCQs).

Analysis
The data described above were linked to further data from the didactic scenario, described in Table 2. Students not only completed the RG during the semester, but also had to complete regular online pretests as a further requirement for admission to the final exam. Each of the 9 online pretests consisted of 5 MCQs. 7 of them had to be passed with at least 60% to obtain admission to the final exam. For these 9 tests, a pool of 140 MCQs was used from previous semesters of the course. This pool has been enlarged by 32 MCQs from RG, which have been identified by the number of Likes received. The RG MCQs were checked for technical accuracy by two domain experts and adapted accordingly. A selection of MCQs from this pool was also the subject of the final exam [71].  In general, RG was received more as additional work than as a game. In a comparative question about preferred lecture-accompanying learning activities, RG got the lowest marks (2.7) on a 6-point Likert scale compared to online questionnaires (4.3) and calculation exercises (3.2). Especially, generation of MCQs led to avoidance behavior (negated MCQs).

Analysis
The data described above were linked to further data from the didactic scenario, described in Table 2. Students not only completed the RG during the semester, but also had to complete regular online pretests as a further requirement for admission to the final exam. Each of the 9 online pretests consisted of 5 MCQs. 7 of them had to be passed with at least 60% to obtain admission to the final exam. For these 9 tests, a pool of 140 MCQs was used from previous semesters of the course. This pool has been enlarged by 32 MCQs from RG, which have been identified by the number of Likes received. The RG MCQs were checked for technical accuracy by two domain experts and adapted accordingly. A selection of MCQs from this pool was also the subject of the final exam [71]. This data (n = 16) was subjected to a correlation analysis. All absolute values found to be higher than 0.4 are included in Table 3. The final exam results (MCQs) (F) are positively correlated with the number of mock tests completed (E), the number of MCQs answered (A) in RG, and the points (C) in RG. As C and A may be regarded as an indicator of time spent for test preparation, the values for the correlation coefficients seem to be reasonable.
The increased value of 0.79 between the sum of A and E and the final exam results (MCQs) (F) seems to be reasonable, as efforts of mock tests and answering MCQs may be mutually substitutable. Overall, the correlations shown are in line with the recognized assumption that active engagement with the learning subject matter results in better learning outcomes. [72,73]. An unexpected result is the negative correlation of −0.60 between MCQs generated (B) and the final exam result (Calculations) (G). Whether participation in RG was motivated by compensation, especially among students who had weaknesses in the calculation tasks to be solved in the final exam, still needs to be explored. In a previous study, a few students reported that engaging in quizzes for a short period of time led, in particular, to a sense of relief from having contributed towards learning [71].

Analysis of the MCQs Generated
A special kind of feedback, which students may give in RG, is liking an MCQ answered. The number of Likes for all MCQs of a participant are aggregated: As Karmascore, the Likes are a form of reward for well-received MCQs. The question arises of whether the number of Likes might be used as a measurement for the quality of an MCQ. Especially, MCQs generated by students need to be assessed for their quality, also because students have doubts about the quality of self-generated MCQs [52]. Besides peer assessment of quality, a further approach to assessing the quality of MCQs is the assessment by experts, who examine the MCQs for supported educational goals; for example, in [74] an MCQ pool is mapped to Bloom's taxonomy of educational goals [75]. Artificial intelligence may also be used. [76].
To determine whether Likes are a valid proxy metric for selecting quality MCQs, expert judgment was used as a reference here. A number of MCQs and an assessment scheme for MCQs were included in a questionnaire. This questionnaire was answered by domain experts. Finally, it was evaluated whether there are correlations between the number of Likes an MCQ has received and the assessments of the domain experts. In the following, the steps of the methodology are described: Selection of MCQs. Both 10 well-rated and 10 not-so-well-rated MCQs from the RG-corpus of 379 questions were selected to provide a broad range of quality. Those with at least three Likes and the best ratio of Likes per answer (Karmascore) were selected as the best MCQs. The ten worst rated MCQs were identified as those with the most answers without any Likes. Finally, the selected 20 MCQs were included in an arbitrary order in a questionnaire, so that the quality could not be inferred from the position in the questionnaire.
Assessment scheme. An assessment scheme for MCQs was developed guided by the work of Haladyna & Rodriguez (2013) [67]. The dimensions of the scheme are presented in Table 4. Each MCQ has to be rated according to each dimension. A 5-point scale from 1 (not at all) to 5 (yes, completely) has been used.  Results. For each MCQ, the dimensions' mean values and the Karmascore ha analyzed for correlations (see Table 5). The best value of 0.51 for a correlation h found between Relevance and Karmascore. Complexity follows with a value whereas Precision and Correctness show rather low values and Selectivity seems n  Table 5). The best value of 0.51 for a correlation has been found between Relevance and Karmascore. Complexity follows with a value of 0.34, whereas Precision and Correctness show rather low values and Selectivity seems not to be correlated to Karmascore at all. As a summary, MCQs that receive Likes by students seem to be characterized mostly by Relevance and Complexity. Therefore, the Like feature is to be considered as valid for ranking MCQs according to their quality.

Motivation and Method
In the previous setting, participation was mandatory for all students, as there were means to sanction their non-participation. At least in theory, RG provides a frame that may be filled by the self-directed and self-paced work of the students, and that provides relatively short-cycled feedback by statistics and by Karma, provided by fellow students. Therefore, it is worthwhile to test whether RG fosters motivation sufficiently by letting students participate in the game voluntarily. The research question was whether RG can serve as a tool in an informal learning context, i.e., without a formal obligation.
The Starting Phase. The study was started as a student project of five participants in their studies for a bachelor's degree. In parallel, they had to take part in the course Capital Budgeting (CB). Their task was to operate an instance of RG. Thereby they should build up a pool of MCQs. For growing the number of MCQs quickly, each member of the team had to provide three MCQs in the first two weeks. Participants claimed that CB would not be an appropriate domain for MCQs: there would be only little knowledge to memorize, but mostly just procedural knowledge would have to be applied. Consequently, a further MCQ type was introduced: rough estimation MCQs. These MCQs are to be solved by mental calculations; they should transform procedural calculation knowledge into MCQs. The low number of participants became a problem in later weeks, when new MCQs were not generated sufficiently, and participants could not fulfill their quota without interventions.
The Blossoming Phase. After three weeks, the project group advertised RG in a short introduction in the lecture with the intention of inviting their fellow students to join. The projected group repeated this invitation two times. No other student joined the game. As the project group indicated from personal conversations with their fellow students, the main worry of their peers was about the requirement to generate an MCQ. The fellow students stated a desire to benefit from RG by answering the MCQs but considered the effort of creating MCQs as too much. Thus, there was no blossoming phase.
The Harvesting Period. According to the preferences of the fellow students to have access to the MCQs without the obligation to create MCQs, the project group transferred the MCQs to the commercial quiz app QuizUp. Thus, the user-defined Quiz Up topic Capital Budgeting, including 57 MCQs, was created. Remarkably, in the project session, where the project group were first introduced by their advisors to the option of transferring their RG MCQs to QuizUp, the project group unintentionally demonstrated the low-threshold accessibility of mobile apps by all taking their mobile device unrequested and installing the app within five minutes.
Again, the new QuizUp topic was advertised in a lecture by the students. The result was disappointing again; only six (out of 30) students tested the topic but did not use it regularly. Altogether, QuizUp seems not to be an attractive tool in informal educational contexts. Consequently, the reasons for not participating either in RG or in playing QuizUp-though the provided contents were relevant for the written test-were collected by a questionnaire.

Questionnaire
The questionnaire consisted of 9 questions in the categories RG, QuizUp and Learning (see: Supplementary Materials). It was launched after the last lecture supported by the lecturer of the course CB. 20 answers were received; 15 of them completed the questionnaire. The first question asked whether respondents were aware of RG. 18 of 20 confirmed. A second question asked for the reason not to enter RG ( Figure 6). Again, participants could indicate their reasons on a 5-point Likert scale. The statement rated highest indicated that the presentation in the lecture was not convincing, meaning that students could not envision a beneficial learning situation. Together with the second most named reason, the unwillingness to create a question and the lack of formal approval for this tool, these answers might serve as an explanation for missing participation in RG. Further hindrances were mistrust of the idea that such a game might contribute to learning and that semesteraccompanying learning is useful and required.
the MCQs to the commercial quiz app QuizUp. Thus, the user-defined Quiz Up topic Capital Budgeting, including 57 MCQs, was created. Remarkably, in the project session, where the project group were first introduced by their advisors to the option of transferring their RG MCQs to QuizUp, the project group unintentionally demonstrated the lowthreshold accessibility of mobile apps by all taking their mobile device unrequested and installing the app within five minutes.
Again, the new QuizUp topic was advertised in a lecture by the students. The result was disappointing again; only six (out of 30) students tested the topic but did not use it regularly. Altogether, QuizUp seems not to be an attractive tool in informal educational contexts. Consequently, the reasons for not participating either in RG or in playing QuizUp-though the provided contents were relevant for the written test-were collected by a questionnaire.

Questionnaire
The questionnaire consisted of 9 questions in the categories RG, QuizUp and Learning (see: Supplementary Materials). It was launched after the last lecture supported by the lecturer of the course CB. 20 answers were received; 15 of them completed the questionnaire. The first question asked whether respondents were aware of RG. 18 of 20 confirmed. A second question asked for the reason not to enter RG ( Figure 6). Again, participants could indicate their reasons on a 5-point Likert scale. The statement rated highest indicated that the presentation in the lecture was not convincing, meaning that students could not envision a beneficial learning situation. Together with the second most named reason, the unwillingness to create a question and the lack of formal approval for this tool, these answers might serve as an explanation for missing participation in RG. Further hindrances were mistrust of the idea that such a game might contribute to learning and that semesteraccompanying learning is useful and required.  The propagation of QuizUp and its educational topic CB was not successful. Only 4 of 17 respondents were aware of QuizUp, and only 3 of them already had experiences in QuizUp. Figure 7 summarizes a list of the non-representative attitudes of only 3 participants. Nevertheless, new questions arise about the use of QuizUp: Is playing QuizUp fun (as suggested by its commercial success)? Is there a difference between educational and entertainment topics (as suggested in [77]). Finally, is the quality of the MCQs sufficientwhich might be an important aspect of accepting quiz apps as learning tools?
Educ. Sci. 2022, 12, x FOR PEER REVIEW 12 of 18 The propagation of QuizUp and its educational topic CB was not successful. Only 4 of 17 respondents were aware of QuizUp, and only 3 of them already had experiences in QuizUp. Figure 7 summarizes a list of the non-representative attitudes of only 3 participants. Nevertheless, new questions arise about the use of QuizUp: Is playing QuizUp fun (as suggested by its commercial success)? Is there a difference between educational and entertainment topics (as suggested in [77]). Finally, is the quality of the MCQs sufficientwhich might be an important aspect of accepting quiz apps as learning tools? Three verbal responses were received, which pointed to prevalent issues here. The first person indicated that the effort required, in combination with the experimental character, has stopped her from taking part in QuizUp: "I prefer investing my time in learning activities which have already proven their efficiency." Another person doubts the quality of student-provided MCQs and, further, is not convinced that the complex content of Three verbal responses were received, which pointed to prevalent issues here. The first person indicated that the effort required, in combination with the experimental character, has stopped her from taking part in QuizUp: "I prefer investing my time in learning activities which have already proven their efficiency." Another person doubts the quality of student-provided MCQs and, further, is not convinced that the complex content of higher education lectures can be transferred into MCQs, a phenomenon which has been described before [61]. Additionally, a third person raises an issue, which applies to a small group of students: visual learning. Instead of dealing with the meaning of the answers, students memorize the visual form of the answers and they identify the correct answer by the length of the words and their visual appearance.
Another question addressed how students approach their learning tasks in general ( Figure 8). The most prevalent method is the use of lecture notes for learning sessions. At this point, four written answers indicated that students write summaries of their lecture notes. Learning activities during the semester do not seem to be very popular. A similar observation that students tend to study little during the semester and instead try to prepare for the final exam intensively was also found in an earlier study [71]. Working with flash cards is not too much favored, whereas the use of flash card learning apps is not popular at all. Overall, using digital tools in self-initiated learning activities appeared to be uncommon among the students in this cohort. However, possible reasons that still need to be investigated could be a high proportion of procedural knowledge that may be less well practiced with the help of MCQs but that is, nonetheless, required for the calculation tasks in the final exam. Three verbal responses were received, which pointed to prevalent issues h first person indicated that the effort required, in combination with the experimen acter, has stopped her from taking part in QuizUp: "I prefer investing my time in activities which have already proven their efficiency." Another person doubts th of student-provided MCQs and, further, is not convinced that the complex co higher education lectures can be transferred into MCQs, a phenomenon which described before [61]. Additionally, a third person raises an issue, which applies t group of students: visual learning. Instead of dealing with the meaning of the students memorize the visual form of the answers and they identify the correct an the length of the words and their visual appearance.
Another question addressed how students approach their learning tasks in (Figure 8). The most prevalent method is the use of lecture notes for learning ses this point, four written answers indicated that students write summaries of thei notes. Learning activities during the semester do not seem to be very popular. A observation that students tend to study little during the semester and instead tr pare for the final exam intensively was also found in an earlier study [71]. Work flash cards is not too much favored, whereas the use of flash card learning ap popular at all. Overall, using digital tools in self-initiated learning activities app be uncommon among the students in this cohort. However, possible reasons that s to be investigated could be a high proportion of procedural knowledge that ma well practiced with the help of MCQs but that is, nonetheless, required for the ca tasks in the final exam.

Discussion
In general, the idea of generating MCQs by a collaborative game for use in further educational scenarios appears to be beneficial to learning. RG-generated MCQs were included in QuizUp and were available for playing. Learning effects have been observed, when students indicated that regular course-accompanying learning activities were triggered by QuizUp. The results are partially in alignment with findings in the literature indicating the positive impact of engagement in generating MCQs on performance in the final exam. [53,58,63]. However, some studies have not observed an impact of engagement in MCQ generation on final exam performance. [78,79].
The results presented appear to be inconsistent with those of an earlier study, in which QuizUp experienced much higher uptake, but in which QuizUp activities were not voluntary [77]. QuizUp has been received as a game, even when it has been used for educational purposes, as the results of the Game Experience Questionnaire (GEQ) [80] suggest. As a learning tool in synchronous lecture settings, it has been accepted. However, as an asynchronous spare time activity, it has received lower acceptance. There were comparatively high values for Positive affect, Challenge, and Competence, whereas values for Negative affect and Tension were low, seemingly typical for a game experience ( Figure 9). Further, noteworthy from the previous study is the categorization of players into Learners and Gamers: Learners fulfill their quota of assigned tasks and probably play a few further entertainment matches, but then leave the app. Gamers, however, accomplish their educational tasks in the game, and then get stuck in the app, i.e., they play 10 times more matches in entertainment topics than in educational topics.
gest. As a learning tool in synchronous lecture settings, it has been accepted. How an asynchronous spare time activity, it has received lower acceptance. There w paratively high values for Positive affect, Challenge, and Competence, whereas v Negative affect and Tension were low, seemingly typical for a game experienc 9). Further, noteworthy from the previous study is the categorization of pla Learners and Gamers: Learners fulfill their quota of assigned tasks and probab few further entertainment matches, but then leave the app. Gamers, however, ac their educational tasks in the game, and then get stuck in the app, i.e., they play more matches in entertainment topics than in educational topics. In general, a few limitations of this study need still to be resolved. Certain sizes (especially 16 to 29 students operating RG and 18 out of 30 students answ questionnaire regarding QuizUp) in the study are to be increased in replicatio for attaining greater significance. Further, RG was received more as an assignmen of a game, although the students valued RG (and QuizUp) as an enrichment of the Only a minor faction of the students seemed to be susceptible to gaming features. accomplished their weekly quota but generating MCQs seemed not to be a prefer Additionally, tasks such as liking and commenting were done only reluctantly; collaborative part of the game did not work as intended. These deficits certainly i the quality of the MCQs generated, although the quality has been rated as accept especially high-quality MCQs might be selected based on the Likes received. In lowing, measures aiming at further developments are summarized: organ In general, a few limitations of this study need still to be resolved. Certain sample sizes (especially 16 to 29 students operating RG and 18 out of 30 students answering the questionnaire regarding QuizUp) in the study are to be increased in replication studies for attaining greater significance. Further, RG was received more as an assignment instead of a game, although the students valued RG (and QuizUp) as an enrichment of the courses. Only a minor faction of the students seemed to be susceptible to gaming features. Students accomplished their weekly quota but generating MCQs seemed not to be a preferred task. Additionally, tasks such as liking and commenting were done only reluctantly; thus, the collaborative part of the game did not work as intended. These deficits certainly impacted the quality of the MCQs generated, although the quality has been rated as acceptable, and especially high-quality MCQs might be selected based on the Likes received. In the following, measures aiming at further developments are summarized: organizational measures (including didactical necessities) and software improvements (including game design).
Organizational Measures. Didactically, extending the introduction into RG and underlining the positive effects for students might be beneficial for the students' motivation. Further, providing a corpus of well-designed sample MCQs and a design guide for MCQs could provide more orientation to students. Additionally, reviews and assistance by domain experts during the game might improve the learning process and shorten periods of unsound MCQ generation strategies, such as negated MCQs. In general, these experiments confirmed that learning tools such as RG and Quiz Up require a dedicated didactic scenario. Learning activities were not performed voluntarily but had to be spurred by an educational scenario linking learning activities formally to intended course outcomes.
RG Software Improvements. The RG module was a prototype and had functional limitations that also reduced effectiveness. For example, commenting on MCQs is not shared with interested students who cannot respond to comments in turn. Thus, discussing a question was cumbersome. In addition, an editing feature was missing. Faulty MCQs had to be deactivated and re-submitted. Additionally, although students were instructed to like an MCQ, only a half percent of all answers was accompanied by a like. However, as shown, Likes may be used to rank MCQs and are therefore important for identifying quality MCQs. Extension of the like feature is suggested in two ways: Firstly, a multi-star rating might help to clarify the value of a like. Secondly, Likes should be a partially mandatory part of answering an MCQ: when a student has not issued a like for a certain number of answered MCQs in a row, such a rating would appear and would have to be completed. Further options for enhancing the quality of interactions in the game might be introduced, such as mandatory interactions such as assessment or direct competition between participants as a means to contribute to a more meaningful game experience.
QuizUp. The usage of commercial quiz apps, such as Quiz Up, in educational scenarios, is not well-known in the literature. A threat of using commercial software in educational scenarios is always the loss of the software, be it due to licensing reasons or due to the discontinuation of the software, as is the case with QuizUp, which was discontinued in the year 2021 [81]. However, a loss may also happen to dedicated software such as RG, which has meanwhile also been discontinued, too. If software is discontinued, there is usually alternative software available that may be used with a one-time setup effort. Related to this study, for example, RG might be substituted by PeerWise [82], along with appropriate instructions for the quotas to be produced. For the substitution of QuizUp, various alternatives are also available, for example the Keeunit quiz app [20]. Further research is required into increasing the attractiveness of educational topics. Further, user-defined topics suffer from some restrictions, which impact the game enjoyment; for example, players are not awarded specific titles when they reach a certain level. Additionally, the behavior of bots as opponents is too simple, e.g., it is almost impossible to beat certain bots, which frustrates players. Among the positive aspects of QuizUp is its openness to all technical domains; thus, it is a domain-independent generic learning tool.

Conclusions
This two-stage field study investigated two multiple choice question (MCQ)-based digital tools in learning scenarios of bachelor's courses. The first (RQ 1) was the Reading Game (RG), a platform for collaborative generation of MCQs. Regarding RG, it was shown that quality MCQs may be identified by the Likes given by students to their fellow students' MCQs. In the second stage (RQ 2), it was confirmed-in line with findings from the literature-that the process of collaborative generation of MCQs is not solely motivated by the learning outcomes achievable, but must rely on external incentives of framing educational scenarios. Further, MCQs generated by RG may be transferred to a quiz app with little effort from the lecturer, so that the MCQs are available there for further learning scenarios. The commercial quiz app used in this study was the well-established entertainment app QuizUp. Due to the non-mandatory application, the student use of this app was only marginal, too, indicating that even the motivational effects of a commercial quiz app are not sufficient, despite an upcoming final exam, to draw students into voluntary learning activities. Further, both RG and QuizUp have been discontinued in the meantime. Nevertheless, since both learning tools represent a group of learning tools and may be substituted, the results confirm that using collaborative MCQ platforms for generating MCQs, a selection of which is then used in a quiz app in other learning scenarios, is a sustainable and especially domain-independent approach. Furthermore, the results suggest that Likes provide a proxy metric for the quality of collaboratively generated MCQs.
Supplementary Materials: The following questionnaires are available online at https://www.mdpi. com/article/10.3390/educsci12050297/s1. For each stage of the study, data was collected by a questionnaire, named QuestionnaireStage1.pdf and QuestionnaireStage2.pdf. Institutional Review Board Statement: Ethical review and approval were waived for this study, due to surveying existing learning scenarios in the field, i.e., the measurements reported here had no influence on the design of the learning scenarios.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.