Robot Tutoring of Multiplication: Over One-Third Learning Gain for Most, Learning Loss for Some

86-528-2766-4509


Introduction
With the current pandemic of COVID-19, learners world-wide rely on online teaching and media applications for their education. Nonetheless, the United Nations fear for knowledge deficits, learning losses, and gaps in the learning process from a lack of face-to-face interaction (UN, 2020, p. 4, p. 23). Therefore, the UN plead for different ways of content delivery: Hybrid learning that is flexible and quasi-individualized (UN, 2020, p. 25): "We should seize the opportunity to find new ways to address the learning crisis and bring about a set of solutions previously considered difficult or impossible to implement" (UN, 2020, p. 4). If every child had a robot tutor at home, would this to some extent make up for missing out on human interaction? Whereas a few years ago, robot teachers were mere science fiction, today a number of schools include some form of robot education. This varies from educational programs such as Science, Technology, Engineering, and Mathematics (STEM) in which young children learn to build and program robots (e.g., Gomoll, Šabanović, Tolar, Hmelo-Silver, Francisco, & Lawlor, 2017;STEMex, 2019) to humanoids that teach children mathematics or language (e.g., Chang, Lee, Chao, Wang, & Chen, 2010;Nuse, 2017). Multiple studies show that robots can be beneficial for learning outcomes. A recent review points out that appearance, behavior, and different kinds of social roles of the robot may positively affect learning outcomes but sometimes also negatively (Belpaeme, Kennedy, Ramachandran, Scassellati, & Tanaka, 2018).
It seems that people go better by instructions forwarded by a social robot than by a tablet with the same programs and voice (e.g., Mann, MacDonald, Kuo, Li, & Broadbent, 2015). Pupils apparently learn significantly more from their robotic tutors than from a tablet or no robot at all (VanLehn, 2011;Brown, Kerwin, & Howard, 2013).
Common understanding has it that in human-human teaching, warm, social, and personal teachers are more successful in advancing their pupils' level of study performance (e.g., Hattie, 2015;Tiberius & Billson, 1991;Saerbeck, Schut, Bartneck, & Janse, 2010). In human teacher-student relationships, a teacher should not just offer theoretical instruction and correct mistakes but also support students personally while creating a healthy relationship (e.g., Frymier & Houser, 2000; embodiment played a modest role in intentions to use the robot and feeling engaged with it. In robot design also, realism is not all (Van Vugt, Konijn, Hoorn, Eliëns, & Keur, 2007).
To study the effects of robot tutoring on learning a STEM related task such as rehearsing multiplications, we varied different forms of human-likeness in the design of the robot (cf. Syrdal, Dautenhahn, Woods, Walters, & Koay, 2007). Our hypothesis (H1) was that we expected positive effects of a more humanlike design on rehearsing the multiplications.
As our H2, we presumed that working with a robot tutor potentially would be more beneficial to lower-ability pupils than for advanced students. For below-average students, larger progress may be achieved whereas for the high performers, the added value may be minimal.
From Konijn and Hoorn (2017;2020a), one could infer that robot tutoring improves learning the multiplications the more the child emotionally bonds with the robot tutor. Bonding is stimulated when the robot's design looks and behaves like a human and in the perception of the child is experienced as high on anthropomorphism, relevance, realism, and affordances. Therefore, H3 supposed that building rapport or establishing an emotional bond with the robot would lead to better task performance, perhaps in a mediating or moderating manner. As a control, we queried the social role the robot played for these children (cf. Chen, Park Hae, & Breazeal, 2020) and how appealing ('beautiful') and new they felt their robot tutors were.

Participants and Design
After obtaining approval from institutional IRB (lbn344; April 1, 2019; FSW Research Ethics Review Committee -RERC), parental consent letters were distributed through two Hong Kong primary schools. Due to strict time planning by the schools and because parents picked up their children early, eventually 75 students were able to participate in at least one session with a robot tutor and completed the pre-and post-test (N = 75; MAge = 8.4, SDAge = .82, range: 7-10, 44% female, Hongkongers). For more details on demographics, consult the technical report in Supplementary Materials 1.
We planned for all pupils to participate in 3 robot tutoring sessions spread over three weeks (within-subjects). Due to the schools' tight time schedules, however, not every pupil could participate in every session. Children from the S.K.H. Good Shepherd Primary School only took one session. This number plus those from the Free Methodist Bradbury Chun Lei Primary School that took but one session, resulted into 48 children participating only once. Those who participated twice (n = 13), and thrice (n = 14) were all from Chun Lei. 1 For a complete overview of the participatory division, consult the technical report in Supplementary Materials 1.
To test our hypotheses, we administered an experiment with the between-subjects factors of Robot Design (3) and Advancement Level (4) to measure their effects on the within-subjects scores at the multiplication test, before-and-after robot tutoring. We also examined the mediating or moderating effects of affective Bonding with the robot on learning the multiplications. We invited the children to participate in three sessions with the tutoring robot.
1 Those participating twice or thrice were different children. Boys and girls were distributed over the Robot Design conditions as follows: Humanoid (15 males, 6 females), Puppy (15 males, 12 females), and Droid (12 males, 15 females). The schools' strict time scheduling caused unequal distributions of Gender over the three robots but this did not render a significant effect ( 2 (2) = 3.49, p = .174).
To determine the Advancement Level of the pupils, we took the average Baseline score (N = 75, M = 37.16, SD = 12.88) established in the pre-test and categorized the children into four groups for further exploration. Those who scored lower than one standard deviation below average (Baseline  22.28) were categorized as 'Challenged' students (n = 11). Those between one negative standard deviation and the average were categorized as 'Below average' (22.8 < Baseline  37.16) (n = 34). Those between average and one positive standard deviation were categorized as 'Above average' (37.16 < Baseline  52.04) (n = 19), and those beyond one positive standard deviation were categorized as 'Advanced' students (Baseline > 52.04) (n = 11). No significant effect of unequal distributions was found between Advancement Level and Robot Design ( 2 (6) = 1.73, p = .943). For more details, see the technical report in Supplementary Materials 1.

Procedure
At the Free Methodist Bradbury Chun Lei Primary school, the experiment took place during three weeks on every Tuesday. The S.K.H. Good Shepherd Primary school had time for but one session. In class, the topic and procedure was introduced and pupils took a 5 minute multiplication pre-test of 147 equations (Table 1, Figure 2). One week later, after class, the pupils from Chun Lei were asked to wait in the corridor before entering the experiment classroom ( Figure 3).  Those from Good Shepherd were taken out of class one at a time by one of the research assistants and entered the experiment room upon arrival. When one of the pupils of either school entered the room, they were brought by one of the assistants to the table where the robot stood ( Figure 4). With three Bioloid robots available, three children were tutored simultaneously such that they did not disturb each other. The assistant explained that the robot would ask a question and that the pupil could answer through the numpad and pressing Enter afterwards ( Figure 4). All interactions, tests, and questionnaires were recorded in Cantonese. The robot started the session by asking if the pupil was ready. Upon confirmation, the multiplication program started, automatically drawing 147 equations randomly from various multiplication tables. Equations consisted of one-digit numbers times two-digit numbers (see Table 1). Questioning went on for 5 minutes, after which the program thanked the child, reported on the number of correct answers, and dismissed the pupil from the session. After one and after two weeks, the same procedure was repeated (at Chun Lei).
The three assistants that operated the robots were sitting behind a curtain. This way, the pupil had the illusion that the robot was fully autonomous while in fact someone was pressing buttons on a remote control. The answers that participants typed in on the numpad could be read by the assistant. When the answer was correct, the assistant pressed the button that triggered positive feedback such as clapping or nodding; when incorrect, the assistant pressed the button that triggered feedback about the mistake such as shaking the head or head scratching (對不起。那是不對 的。"I am sorry. That is not right").
Each time the pupils completed their sessions, they took another multiplication test as post-test (once, twice, thrice). The same procedure as in the pre-test was used. After the post-test, pupils filled out a questionnaire about their experiences with the robot. At Chun Lei, the questionnaire was a homework assignment; the pupils from the Good Shepherd did the questionnaire in class.

Apparatus and Materials
Humanoid, Puppy, and Droid ( Figure 1) were built from three identical Bioloid Premium DIY kits and programmed on the same CM530 computer. 3 To tease out bonding tendencies, we put comparable eyes on the three machines ( Figure 1) so each robot would 'look' at the participants. Attached to the Bioloids were Fresh 'n Rebel Rockbox Cube Fabriq Army (595959mm, Bluetooth 4.0, 1 channel mono 3W) front speakers that were connected to a self-written speech engine in Node.js (a Javascript framework), which ran independently from the robot software.
Trials consisted of prerecorded Cantonese male speech (23 years of age) of multiplication equations, for instance, "5 times 12?" and the child's input was followed by various feedback such as "I'm sorry that is incorrect," "Well done, that's correct." Trials were composed from separate audio files of the numbers 1 to 99, of the word "times," and of "equals." The program would then randomly select a number audio file, followed by the "times" audio file, followed by another random number audio file, followed by the "equals" audio file.
The speech program kept track of the pupil's answers but motoric functions of the robot were controlled remotely because the speech program in Node.js was incompatible with the Robotis+ code language of the robot. 4 Therefore, a wireless Bluetooth receiver was attached to the robot's computer, communicating with a wireless controller ( Pupils could input their answers on a Gembird numerical keyboard or numpad (OS independent, plug-and-play, 1248121mm, USB 2.0 powered with type A-plug) ( Figure 5). Apart from audio feedback, a correct answer was rewarded with Humanoid clapping its hands, Puppy nodding its head, and Droid moving up and down. For negative feedback, Humanoid scratched its head, Puppy shook its head, and Droid wiggled from left to right.
The program terminated after 5 minutes, counted the number of correct answers, and based on the results played "Well done" or "I'm sorry." It then thanked the child for its participation and asked to leave the room. Table 1 offers a synopsis of the variables investigated in this study. The full record of variables can be found in Supplementary Materials 1. Table 1 has two types of dependent measures that are theoretically relevant: learning and experience. Additionally, several control variables are tabulated as well. Learning variables were derived from pre-and post-test in which pupils solved 147 equations drawn from the range [1-99] with the second number always having two digits (e.g., 3  12 or 15  31). In the analysis, our focus will be on Learning gain (the absolute difference between pre-and post-test) and Gain percentage (learning gain relative to a child's baseline knowledge).

Measures
We created the measure of Gain percentage because, for example, 5 more correct answers after robot tutoring may be a relatively big gain for those who performed poorly before but a small gain for those who already performed at a high level (cf. ceiling effect). Percentage_Fin_min_Base, then, was calculated as Fin_min_Base divided by Baseline (Table 1).
The experiential variables were measured by a 43-item paper-and-pencil structured questionnaire that was filled out after pupils completed their tutoring session(s) (Appendix A). Indicative and counter-indicative Likert-type items were scored on a 6-point rating scale (1 = totally disagree, 6 = totally agree). The counter-indicative items on the questionnaire were recoded into new variables, after which we calculated Cronbach's α for all scales followed by Principal Component Analysis (PCA). From the remaining items, we calculated Cronbach's α again.
Representation. To check the manipulation with the three different Robot Designs, participants rated to what degree they felt the design of their robot represented a human being, an animal, and a machine. All three dimensions were rated for each robot. In addition, they evaluated the Social role of the robot (e.g., a friend, a teacher, etc.).
Bonding was measured with 5 items (i.e. bond, interested, connected, friends, understand). Two examples of indicative items are "I felt a bond with the robot" and "The robot understands me." Cronbach's α = .88.
Anthropomorphism contained 4 items (machine, talk like human, humanlike reaction, humanlike interaction). Two examples are: "It felt just like a human was talking to me" and "I reacted to the robot just as I react to a human." Only these two items were left after psychometric analysis: Spearman-Brown Correlation (r = .68, p = .000).
Perceived realism was based on Paauwe, Hoorn, Konijn, and Keyson (2015) and Van Vugt, Hoorn, Konijn, and De Bie Dimitriadou (2006). The scale had 4 items (real creature, like real, feels fabricated, real conversation), two examples of which are: "The robot resembled a real-life creature" and "It was just like real to me." Psychometric analysis indicated 3 items for sufficient reliability: Cronbach's α = .75.
Perceived relevance was based on Van Vugt, Hoorn, Konijn, and De Bie Dimitriadou (2006) and consisted of four items (important, help, useless, need). Two examples are: "The robot was important to do my exercises" and "The robot is what I need to practice the multiplication tables." With four items, Cronbach's α = .73.
Perceived affordances also was based on Van Vugt, Hoorn, Konijn, and De Bie Dimitriadou (2006) (immediately clear, took a while, puzzled). Two examples are: "I understood the task with the robot immediately" and "The robot was clear in its instructions." Only these two items achieved just sufficient reliability (r = .61, p = .000).
Engagement was included in addition to Bonding and was measured based on two scales by Paauwe, Hoorn, Konijn, and Keyson (2015) and Van Vugt, Hoorn, Konijn, and De Bie Dimitriadou (2006). Engagement was constructed from 5 items (e.g., like, dislike, feeling uncomfortable, fun). Examples are "I like the robot" and "I felt uncomfortable with the robot." Cronbach's α = .79.
Use intentions also was based on Van Vugt, Hoorn, Konijn, and De Bie Dimitriadou (2006). It consisted of 3 items (use again, another time, help again), an example being: "I would use the robot again." Cronbach's α = .63, which is just sufficient for group comparisons.

Principal component analysis
In a 7-and a 5-factor solution, divergent validity of the questionnaire items was weak and the only scale having good measurement quality overall, clearly distinguishable from other components, was Bonding (5 items, Cronbach's  = .88), which will be the experiential measure we use for further analysis. For in-depth PCA analysis, consult Supplementary Materials 1.

Preliminary analyses
To check the Robot Design manipulation, participants rated the extent to which they believed their robot resembled a human, an animal, and a machine (i.e. Human-like, Animal-like, and Machine-like). We ran a General Linear Model Multivariate Analysis (MANOVA) of Robot Design (3) on the Representation ratings of Human-like, Animal-like, and Machine-like. Pupils judged their robots as significantly different in what they represented: The effects of Robot Design on the rating of Representation was significant (Wilks'  = .57, F(6,134) = 7.17, p < .000, ηp 2 = .24). Significant effects were found for Human-like (F(2,69) = 8.32, p = .001) and Animal-like (F(2,69) = 12.41, p = .000). Thus, the robots did not differ in their machine-likeness but they did differentiate according to their representation of a human being or an animal.
Six two-tailed independent t-tests of Robot Design (Humanoid-Puppy, Humanoid-Droid, and Puppy-Droid) on ratings of Human-like and Animal-likeness showed that Human-likeness of the Humanoid robot (n = 19, M = 3.89, SD = 1.91) was significantly higher than that of Puppy (n = 26, M = 1.88, SD = 1.42) (t(43) = 4.05, p = .000). Human-likeness of Humanoid (n = 19, M = 3.89, SD = 1.91) also was significantly higher than that of Droid Therefore, Humanoid was rated as more human-like and Puppy was more animal-like, whereas for Droid, no differences were significant. Thus, all robots were machine-like with Droid as the starting point, while Puppy added an animalistic and Humanoid a more humanlike impression.
As an extra control on the manipulation, we asked the pupils if they experienced the robot as a classmate, a teacher, a tutor, and other Social Roles. We ran three GLM Multivariate Analyses (MANOVA) of Social Role (Friend, Classmate, Teacher, etc.) on Human-like, Animal-like, and Machine-like as separate dependents so that effects would become significant easily. However, the different Social Roles were not significant for Human-likeness (F(30,246) = .94, p = .563) and had no significant effect on Animal-likeness (F(30,246) = 1.18, p = .246). The different Social Roles were significant for Machine-likeness (F(30,246) = 1.75, p = .012): Between-subject effects indicated that the effect of Teacher (F(5,66) = 2.75, p = .026) and the effect of Machine (F(5,66) = 5.53, p = .000) on Machine-likeness were significant. However, there were six dependent variables in the analysis so that the rejection area α should be corrected, according to Bonferroni (.05 / 6 = .0083). Hence, only the categorization as Machine (F(5,66) = 5.53, p = .000) exerted significant effects on Machine-likeness, indicating that students perceived a machine-like robot indeed as a machine.
To check on possible confounding effects of non-theoretical variables, we ran a School (2) × Gender (2) ANCOVA on the Baseline score from the pre-test with Age as a covariate (N = 75). The only significant difference was caused by Age (F(1,70) = 4.35, p = .041) (r = .36, p = .002). With age, pupils performed better. School, Gender, and their interaction had no significant effect on Baseline performance. Only as isolated effects, while disregarding omnibus variance, did a two-tailed independent samples t-test show that the mean Baselines of Good Shepherd (n = 48, M = 39.71, SD = 15.85) and Chun Lei (n = 27, M = 32.63, SD = 11.94) significantly differed (t(73) = 2.02, p = .047) in favor of Good Shepherd. Likewise, while ignoring overall variance, the Baseline means of Boys (n = 42, M = 34.07, SD = 13.81) versus Girls (n = 33, M = 41.09, SD = 15.46) significantly differed (t(73) = -2.08, p = .042): Girls did more multiplications correct during the pre-test (not on the post-test after robot intervention as we shall see later). It seems that effects of School and Gender while significant on the detailed level (t-test) were spurious when more factors were added (F-test).
In a School (2) × Gender (2) ANCOVA on FinMSco with Age as a covariate (N = 75), none of the differences were significant. Although in an isolated correlation analysis, Age significantly affected the FinMSco (r = .24, p = .039), this relationship dissolved in the ANCOVA. Probably, the interaction with the robot countered the effect of Age on learning.
In addition, the correlation between Novelty and Fin_min_Base was not significant (r = .187, p = .12). Thus, novelty of the robot did not affect learning.
To explore the effects of the number of tutoring sessions on learning, we ran a number of tests with the factor Sessions (partaking once, twice, thrice). To see whether advancement level and number of sessions had an effect, we ran a GLM Univariate (ANCOVA) of Sessions (3) × Advancement Level (4) on Fin_min_Base with Age as a covariate. Yet, the interaction was not significant (F = .668).
We also conducted a One-way ANOVA of Sessions (participating once, twice, thrice) on Fin_min_Base without other variables involved but still no significant effects were established (F(2,71) = .866, p = .425). More robot-tutoring sessions did not improve learning performance any further.
Notwithstanding that there was not much difference among the groups that took one, two, or three tutorial sessions, yet, within each group, we wanted to know how big the learning gain was. We conducted three paired samples t-tests of Sessions on Baseline score versus FinMSco, representing the gain in absolute numbers and in percentages (Table 2). Those who worked once with the robot improved by 8.42 more answers correct (21.20%). Those who did two sessions had a 7.68 improvement (21.73%) compared to Baseline. Those who interacted thrice had a 10.54 improvement (36.83%) compared to Baseline. Although at face value, three times tutoring seems to be better, later in the paper we see that Oneway ANOVA pointed out that statistically, the differences among the number of sessions were not significant.

Learning effects
H1 expected positive effects of Robot Design on learning with a significant advantage for Humanoid. H2 assumed differences in learning as a function of Advancement Level of the students, the Challenged students gaining significantly more from robot tutoring.
To test H1 and H2, we ran a GLM Repeated Measures of Robot Design (3) × Advancement Level (4) (between-subjects) on the (within-subjects) number of equations correctly solved before (Baseline) and after (Final Score) robot tutoring (N = 75). Note that this was the score in absolute numbers, not the percentage of gain relative to Baseline. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 December 2020 doi:10.20944/preprints202012.0061.v1 Our key finding was a significant and moderately strong main before-after effect on the absolute number of multiplications solved correctly (V = .50, F(1,63) = 62.43, p = .000, p 2 = .50). The mean score MFinal = 45.73 (SD = 17.40) was significantly larger than MBaseline = 37.16 (SD = 14.88) (t(74) = 7.19, p = .000), the mean difference being 8.57 equations more solved correctly after one session of robot tutoring, irrespective of Robot Design or Advancement Level. Multivariate tests also showed a significant second-order interaction among Robot Design, Advancement Level, and before-after score (V = .22, F(6,63) = 2.99, p = .012, p 2 = .22). Inspection of the mean scores showed that the largest difference was established for Challenged pupils working with Humanoid (MBaseline = 16.33, SD = 6.03; MFinal = 41.67, SD = 17.93) and a small reverse effect was found for Advanced pupils, working with Droid (MBaseline = 69.33, SD = 5.52; MFinal = 68.00, SD = 18.61). Paired-samples t-test, however, showed that the effect for Challenged pupils working with Humanoid (n = 3) was not significant (not even preceding Bonferroni correction): t(2) = 3.51, p = .072; probably due to the large SDs and lack of power. No other main or interaction effects were significant (Supplementary Materials 1) except for the main effect of Advancement Level, which was a trivial finding obviously. H1 and H2 were refuted for learning gain in absolute numbers of correctly answered multiplications.

Learning gain (difference scores)
GLM Repeated Measures accounts for multiple sources of variance and is therefore the strictest test on our hypotheses. To assess if nothing was gained at all from Robot Design or Advancement Level, we included fewer sources of variance in our analysis from the reasoning that if lenient tests do not render significant effects either, we can dismiss Robot Design and Advancement Level from our theorizing altogether. Therefore, we calculated the difference score from the Final Mean Score (FinMSco) -Baseline Score = Final_minus_Baseline (Fin_min_Base). Whereas 64 pupils gained from robot tutoring, there were 11 (about 15%) who did not perform better but worse after robot interaction (Fin_min_Base = -1 to -35). Ten of the worse performers came from the categories Below Average and Challenged, the remaining one coming from Advanced.
We conjectured that perhaps certain Robot Designs exercised negative effects on learning. Therefore, we reran the analyses on the group that performed worse after robot tutoring. However, Robot Design and School again did not exert significant effects on Fin_min_Base. In all, the effects of schools, gender, and robot designs improved nor worsened the children's learning as measured through the difference scores.
For the 64 children (about 85%) that did show learning gains after robot intervention, we ran a paired samples t-test on Baseline versus FinMSco to see how much those children gained. The difference between Baseline (n = 64, M = 37.98, SD = 1.91) and FinMSco (n = 64, M = 49.14, SD = 2.05) was highly significant (t(63) = -11.20, p = .000). On average, those who learned from the robot did over one-third better compared to Baseline. Although most children learned significantly from robot tutoring, the various robot designs did not significantly differentiate the learning effects, therefore countering H1.
Although Robot Design did not exact significant effects on learning, perhaps the experience of the design as Human-like, Animal-like, or Machine-like would, allowing yet another chance for H1 to come to expression; albeit in a more perceptual way. To check the effects of the childrens' Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 December 2020 doi:10.20944/preprints202012.0061.v1 perceptions of their robot on learning, we did regression analysis of Human-like, Animal-like, and Machine-like on Fin_min_Base. However, no significant relationship was established (Human-like: t = -.47, p = .640; Animal-like: t = -.52, p = .610; Machine-like: t = -.50, p = .620). Also with Gain percentage as dependent (Table 1: Per_Fin_min_Base) significant effects remained absent (Human-like: t = -.26, p = .800; Animal-like: t = -1.16, p = .250; Machine-like: t = -.71, p = .480). Combined with the results from the section on Learning effects, students perceived the robot as we expected but their perception had no effect on learning; not in absolute numbers of correct answers and not as a percentage of improvement from the Baseline. Although overall learning gains were achieved, the design of the robot embodiment or what it represented to the children did not matter, rejecting H1.
For H2 on Advancement Level, we ran a One-way ANOVA of Advancement Level on the difference score Fin_min_Base but none of the effects were significant (F(3,71) = 1.58, p = .202). No matter how well or poor children performed initially, it did not affect their learning gain on average.

Summary of findings for learning
1. Prior to robot intervention, pupils performed better with age and girls did better on baseline performance than boys. After 5 minutes of robot interaction, these differences disappeared 2. Most children (85%) learned from the robot, a small group (15%) performed worse 3. Those who learned from the robot had an average of more than one-third gain after tutoring 4. The weakest students that gained from robot tutoring did so in percentage of gain (90%), not in absolute numbers, compared to their earlier achievements 5. School, gender, design of the robot, the number of times these children were tutored, nor the experience of novelty of the robot were influential for learning through robot tutoring

Experience
Although we had a range of psychometric scales on our questionnaire to measure dimensions of affect (i.e. Engagement, Bonding, Anthropomorphism, Perceived Realism, Relevance, Perceived Affordances, and Use Intentions), none but Bonding achieved convergent and divergent measurement reliability (Supplementary Materials 1). Therefore, we decided to work with the only clear-cut case we had, Bonding, and not make ad-hoc decisions.
H3 expected that emotional bonding with the robot would positively affect the learning outcomes in a mediating or moderating way. To examine H3, we ran the previous GLM Repeated Measures again of Robot Design (3) × Advancement Level (4) (between-subjects) on the (within-subjects) number of equations correctly solved before and after robot tutoring but now with mean Bonding as the covariate. However, mean Bonding exerted no significant main or interaction effects on the multiplication scores and the earlier pattern of results was not altered (Supplementary Materials 1).
To let the presumed relation between bonding and learning happen more easily, we ran a two-tailed bivariate correlation analysis between MBond and Fin_min_Base (r = .007, p = .951) and between MBond and Per_Fin_min_Base (r = -.076, p = .531). Yet, neither were significant.
Therefore, H3 was rejected. Bonding tendencies were independent from the design of the robot or the advancement level of the children. The level of bonding with a robot tutor seemed not to have any substantial correlation with learning, not in absolute numbers nor in relative gain.
To check if any of the non-theoretical variables would affect the level of learning and bonding, we conducted GLM Multivariate Analysis (MANOVA) of Robot Design (3) × Advancement Level (4) × School (2) × Gender (2) on Fin_min_Base and MBond and on Per_Fin_min_Base and MBond with Age, Novelty, and Aesthetics as covariates. The following results were obtained: 1. The interaction of Robot Design  School  Gender on Fin_min_Base (F(1,30) = 6.44, p = .017) was significant. However, earlier we showed that none of the contrasts in the factors Robot Design, School, and Gender were significant so that 1. can be considered a false positive 2. The interaction of Robot Design  School  Gender on Per_Fin_min_Base (F(1,30) = 9.56, p = .004) was significant. To scrutinize the contrasts of the factor Robot Design, we ran three independent samples t-tests on Per_Fin_min_Base. Yet, none of the differences were significant (Humanoid -Puppy: t(43) = .14, p = .89; Humanoid -Droid: t(44) = 1.03, p = .31; Puppy -Droid: t(51) = 1.18, p = .24). Additionally, neither the difference between School (t(70) = -1.23, p = .22) nor that between Gender (t(70) = .13, p = .90) was significant. We therefore conclude that the significant F-value for 2. came from the accumulation of noise in the contrasts 3. The interaction of Robot Design  Advancement Level on Per_Fin_min_Base (F(6,30) = 4.15, p = .004) was the product of 4. and 5. 4. The main effect of Robot Design on Per_Fin_min_Base (F(2,30) = 6.06, p = .006) was significant but as said in 2., the contrasts of the factor Robot Design were not so that the inconsistency between ANOVA and t-test indicates the propagation of noise from a set of non-significant contrasts, resulting in a false-positive for the F-value 5. The main effect of Advancement Level on Per_Fin_min_Base (F(3,30) = 4.12, p = .015). As shown earlier, we saw that Per_Fin_min_Base decreased with the increase of Advancement, which was due to the group we regarded as Challenged 6. The only significant effect that included Bonding was that Aesthetics covaried with MBond (F(1,71) = 13.21, p = .001): A robot experienced as 'prettier' raised stronger bonding tendencies

Effects on Bonding
We ran a Univariate Analysis of Variance (ANOVA) of Robot Design and Advancement Level directly on mean Bonding. Not all children who took the multiplication test also filled out the questionnaire, therefore N = 70. The intercept was significantly different from zero so that Bonding tendencies did occur (F(1,58) = 194.76, p = .000, p 2 = .77). However, none of the main effects or interaction was significant (F < 1) (Supplementary Materials 1). Robot Design nor Advancement Level exerted significant effects on Bonding.
As an extra exploration, we conducted an ANOVA of Robot Design (3) × Advancement Level (4) × School (2) × Gender (2) on the grand averages of MBond, showing that only the difference in School was significant (F(1,34) = 4.57, p = .04). We ran an independent samples t-test of School on MBond, showing that Bonding at Good Shepherd was significantly higher than at Chun Lei (t(68) = 2.99, p = .004). Theoretically, this is an irrelevant finding.
We then ran three t-tests with Sessions as the grouping variable (once -twice, once -thrice, twice -thrice). The effects on MBond of Once and Thrice and that of Twice and Thrice were not significant (Once -Thrice: t(54) = 1.31, p = .20; Twice -Thrice: t(20) = .97, p = .34). However, the difference between Once and Twice was significant for MBond (Once -Twice: t(60) = 3.01, p = .004), even if α was corrected to .017 with respect to Bonferroni. Apparently, mean Bonding became less upon second encounter (MBond1 = 3.60, SD = 1.64; MBond2 = 2.19; SD = 1.70), which was due to Chun Lei pupils alone. The insignificant difference with those encountering the robot thrice might indicate a ceiling effect.
We wondered if the high bonding upon first encounter was due to a novelty effect, wearing off after multiple encounters. Therefore, we correlated MBond with Novelty and found that the correlation was significant but not very strong (r = .31, p = .01). Children from Chun Lei saw the robot more often so that less novelty may have led to lower rates of bonding. MBond also correlated with Aesthetics (r = .56, p = .000), indicating that the experience of 'prettier' led to stronger bonding tendencies as supported by the covariance analysis earlier on.

Summary of findings for experience
With respect to the experience of the robot tutor as a social entity, we found that: 1. The pupils perceived the robot as intended (manipulation successful) 2. The social role they attributed to the robots had no significant effect on their perceptions of human, animal, or machine-likeness, except that the role of 'machine' indeed raised significant machine-likeness, which a trivial finding 3. From a design perspective, the Bioloids to these children were basically all machines like Droid, while Puppy added animal-like features to that basic frame and Humanoid added human-like features to it. However, type of robot (humanoid, animal, or machine) did not affect the bonding tendencies 4. Only the Bonding scale was psychometrically reliable and all other measures for these children seemed to be related to that experience or were confusing 5. Bonding had no significant relation with learning gains. In 5 minutes of robot training, children improved their skills irrespective of the quality of the established relationship 6. The Good Shepherd children experienced more bonding with their robot tutor than Chun Lei pupils, maybe owing to a novelty effect 7. Stronger perceptions of the robot's attractiveness ('beautiful') were associated with stronger bonding tendencies

Discussion and Conclusions
We found that 5 minutes of robot tutoring improved learning the multiplications irrespective of the design of the robot or the advancement level of the pupils. This result counters our hypothesis H1 that a more anthropomorphic design would enhance performance. It also counters H2 on different effects for advancement level when dealt with as the absolute number of equations solved correctly. H2 is confirmed when seen as the relative gain pupils get from robot tutoring as compared to their earlier achievements; then, the more challenged children (n = 10) gain relatively more than the others. H3 was disconfirmed that the child learns more while developing a stronger emotional bond with the robot tutor. While rehearsing multiplication equations in this study, learning and bonding seemed to be two different strands of processing, both happening, but not affecting each other significantly.
Thus, our conclusion is very straightforward: Apparently, children improved on the multiplication tables with 5 minutes of exercise with a robot; more sessions were unnecessary. Initial differences between gender, age, or school disadvantages were compensated for and the novelty of the method had no significant effect on learning. The type of robot or its social role (teacher, peer, friend) also did not matter (cf. Onyeulo & Gandhi, 2020): A more human-like machine did not improve performance, a teacher role was no better than a peer, and the level of emotional bonding of the child with the tutoring machine (because it is new and beautiful) made no difference for their learning outcomes. This is good news for teaching practice (cf. UN, 2020) because cheap and simple robots of whatever kind may help the larger part of pupils gain more than 33% better scores with little time investment. The weakest pupils should be treated with caution because many may have a 90% progress but some challenged and under-average children may be set back by robot tutoring. For different reasons, challenged as well as certain advanced students can be easily distracted and may experience learning difficulties (e.g., Beckmann & Minnaert, 2018).
The theory of affective bonding (Konijn & Hoorn, 2017;2020a) was not supported. For the children, all the different conceptualizations of affordances, relevance, realism, and anthropomorphism seemed to be diffuse except for the notion of bonding ("I felt connected to the robot") and such bonding may be present but was not influential for rational performance.
Robots are not human beings (cf. Onyeulo & Gandhi, 2020). It may be that a warm relationship with a human teacher makes a child want to work harder and may improve its social-emotional development (e.g., Frymier & Houser, 2000;Hattie, 2015;Skinner & Belmont, 1993;Hamre & Pianta, 2001). Yet, for a simple drill like quickly practicing multiplications with a little robot, warm relationships did not seem to be necessary in our case, perhaps because the interaction was so short. According to Serholt and Barendregt (2016), it may be that children do not develop bonds with robots in the human sense but engage in a different sort of relationship and what that is, needs to be found out.
Our work does coincide with the results of Hindriks and Liebens (2019) that social behavior during a maths task is not conducive to learning. Moreover, for certain challenged pupils, the effects we found were even counter-productive. It seems that matching the robot's appearance with its task is insignificant despite some individual preferences for specific robot appearances in some tasks (Li, Rau, & Li, 2010;Imai, Ono, & Ishiguro, 2003;Mutlu, Forlizzi, & Hodgins, 2006;Konijn, Smakman, & Van den Berghe, 2020). Our robots were successful at maintenance rehearsal and repeated exercise (e.g., Wei, Hung, Lee, & Chen, 2011;Huang & Hoorn, 2018) and during the remedial teaching of a strongly rational task, the bonding aspects of the robot appeared to be unimportant.
Strong point in our study is the comparability of the three robot designs. It is quite hard to compare existing factory robots of a different make, telling which design elements are responsible for the differences in user responses. Our basic design, materials, and general appearance of the robots was similar but differentiated in representation: It is a rather unique finding that the children recognized the basic design of all three robots as a machine with human features added for the humanoid and animal characteristics for the puppy. Unexpectedly, these representational variations were not conducive to learning, which brings us to the limitations of this study.
Field studies add to ecological validity and plausibility yet at the cost of methodological soundness. The time schedules of schools and parents left us with 75 children that could participate in but one session so the insignificant progress after the second and third session may have been due to a lack of power. Also effects of the advancement level (weaker-stronger pupils) may have been disturbed by the small numbers in a cell. Working with children in itself already yields nosier data than with adults, which may have drowned some effects of taking multiple sessions, the mix-up of psychometric constructs (e.g., anthropomorphism, realism), or the effects on bonding. It may be argued that 5 minutes of interaction is too short to become attached to a machine.

Future outlook
Due to severe budget cuts and fewer teachers, education faces a lack of human resources to serve an increasingly larger number of pupils with a wider variety of individual needs. Owing to changes in care systems (in Europe), children with special needs are integrated in regular rather than special schools (e.g., Mader, 2017; for the situation in Hong Kong, see Lee, Yeung, Tracey, & Barker, 2015). Migration causes new mixes of children with diverse backgrounds, cultural and educational differences. The current pandemic asks for novel teaching solutions to make up for learning loss (UN, 2020). These transitions demand ways of teaching that differ from class-wise instructions (ibid.). As is, the teaching level converges to the middle whereas children learn most if instruction matches their level of proficiency (Leyzberg, Spaulding, & Scassellati, 2014).
Social robots may provide support, which probably has far-reaching implications for classroom instruction and organization. For example, repetitive tasks may be performed by the robot while the teacher focuses on special cases or develops and teaches advanced topics. This actually asks from the teachers to recalibrate their profession. In the near future, teachers may have to consider working in teams that also consist of synthetic colleagues. However, before the role of this new robot colleague can be outlined, we have to understand how a robot's (limited) capabilities can match the teaching needs of pupils but also of teachers. In this respect, moral deliberations on robots in education should be proliferated (e.g., .
Our results suggest that the robot does not have to be fancy in looks or behavior to help the child increase its performance quickly in arithmetic rehearsal tasks. In this study, weak pupils benefited strongly from robot instruction with the exception of a few challenged children. Robot teachers in motion pictures and comic books do not have to remain mere science fiction. Educators and parents may apply a simple and cheap machine equipped with the proper software to make up for knowledge deficits and gaps in the learning process without having to fear the lack of face-to-face interaction. That makes robot tutoring feasible in times of a COVID-19 pandemic.
Supplementary Materials: Technical Report S1: Bioloids, multiplication TechRep, Software S1: Bioloids, code, Audio S1: Bioloids, audio files. Funding: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. [Representation]