Implementing Government Elementary Math Exercises Online: Positive E ﬀ ects Found in RCT under Social Turmoil in Chile

: The impact of online math programs depends on its implementation, especially in vulnerable populations from developing countries. An existing online platform was adapted, at the request of the Chilean Ministry of Education, to exclusively include exercises previously designed and tested by a paper-based government program for elementary school. We carried out a cluster-randomized controlled trial (RCT) with 50 fourth grade classrooms. Treatment classrooms used the platform in a weekly 90-min math session. Due to a social instability outbreak in the country, a large unexpected disruption with huge absenteeism occurred in the second half of the semester, which turned this study into a unique opportunity to explore the robustness of the platform’s e ﬀ ects on students’ learning. Using multiple imputation and multilevel models, we found a statistically signiﬁcant e ﬀ ect size of 0.13, which corresponds to two extra months of learning. This e ﬀ ect is meaningful for four reasons. First, it has double the e ﬀ ect of the paper-based version. Second, it was achieved during one semester only. Third, is half that obtained with the platform for a complete year with its own set of exercises and with two sessions per week instead of one. Fourth, it was attained in a semester with a lot of absenteeism. an online platform. Results show a 0.13 SD positive e ﬀ ect size of the implementation on students’ outcomes when measured with a large-scale test SEPA-math. This e ﬀ ect corresponds to almost two extra months of learning after


Introduction
There is ample evidence to suggest that education has not dramatically changed over recent centuries. Even after the introduction of textbooks, students continue to spend their class time by primarily listening to lectures and taking notes. Why does education seem so immune to transformations? Labaree [1] argues that education is a far more complex domain than other areas. For example, he compares a typical nuclear power facility with a school. Since every component of a nuclear facility is causally interrelated with the others, it is much easier to trace the source of any deficiencies and fix them accordingly. Schools, conversely, are composed of completely independent units: isolated classrooms. If one classroom performs well, it does not immediately produce an effect in parallel classrooms. Superintendents and principals generally track mean performance across classrooms, and, on average, good and bad performances cancel each other out. As a whole, a school therefore remains highly stable.
However, after several decades of experimental studies introducing ICT for math teaching and learning in K-12, there is still a wide range of impacts. For example, a meta-analysis of 71 evaluations in the United States reported effects by time of use [2]. The study shows that for evaluated programs 2 of 15 where students spent less than 30 min a week, the average effect was 0.06 SD; where students spent between 30 and 75 min, it was 0.20 SD; finally, where students spent more than 75 min, contrary to what one would expect, the result was 0.14 SD. In a more recent study, [3] reports 14 studies that strongly emphasize the use of technology. Most of them rotate students through technology and non-technology activities. The weighted mean effect size was +0.07. A 2019 study in 26 municipalities in Sweden [4] found no significant impact of an ICT program on standardized tests in mathematics or language on average, but it could, unfortunately, increase inequality in education. Further, a systematic review on 85 independent evaluations found that shorter ICT programs were much more effective in promoting mathematics achievements than longer ones with a mean effect size of 0.35 SD [5]. The theoretical framework for the study of ICT in schools highlights the importance of the implementation process and the context in which this implementation is situated [6]. The integration and final adoption of technological tools relies heavily on these factors. This framework has been supported by empirical evidence of the effect of practice with immediate feedback from peers and teachers and the inclusion of writing justifications for math problems [7,8]. In developing countries, results are also diverse. A review of experimental evaluations in developing countries focused on mathematics [9], reported effect sizes ranging from 0.14 SD for programs with 80 min of weekly computer time in China, to an effect size of 0.35 SD for a program with 120 min of weekly practice in India, and another program with an effect size of 0.28 for 300 min spent using computers during after-school sessions. However, another 300 min per week program in India had an effect size of −0.48. Cristia et al. [10] and Beuermann et al. [11] studied a randomized experiment with a 1:1 program in poor regions of rural Peru and found no significant impact on test scores in mathematics or language. De Melo, Machado, and Miranda [12] found no effects on math or reading scores in the national implementation of a 1:1 program in primary schools in Uruguay. This divergent variety of effects sizes points to the possibility of strong dependence on the type of implementation of the programs.
According to the UNESCO 2013 TERCE assessment, Chile has the highest national average in 6th grade mathematics in Latin America [13]. However, the Programme for International Student Assessment (PISA) test for fifteen years old students, positions Chile in 59th place out of 78 participating countries. Further, its score is not statistically significantly different from scores of countries such as Kazakhstan, Moldova, Baku (Azerbaijan), Thailand, Uruguay, and Qatar [14]. Araya et al. [7], present evidence and theoretical reasons to back the claim that guided technology programs focused on practice, can be effective, efficient, and relatively easy to scale up, under the Chilean context. In [15], the use of the same platform with the originally designed platform exercises was reported. The effect was computed in 15 fourth grade classes from 11 vulnerable Chilean schools where the platform was used during the full educational year. Measured with the National Standardized test, a paper-based assessment implemented by an independent government agency yearly in all schools, the improvement over previous years was 0.26 SD higher than the national improvement.
Later, in Araya et al. [16], reported the results of three years of implementation of the same platform in 11 public schools from a low SES urban district in Chile. This included 43 fourth grade classes and 1355 students. Improvement over previous years on the National Standardized fourth grade math test was 0.28 SD higher than the improvement made by a neighboring district with a similar population. Next, in [17], eight years of use of the same platform and exercises were analyzed. The authors found that on the national standardized test scores, the 80 classes that were under treatment obtained 0.30 SD higher results than on the 32 classes that were not treated. In a more recent study, Araya et al. [18], experimentally evaluated the platform with a RCT with 48 classes from 24 low-performing primary schools in Chile, where at each school a class was randomly assigned to treatment. It was implemented with two weekly sessions in a computer lab during the whole educational year. The impact was measured with the Chilean National Standardized Exam by using a multilevel model, and a positive effect on math learning of 0.27 SD was found.
Moreover, the Ministry of Education in 2011 and 2012 implemented a paper-based implementation of the "Plan de Apoyo Compartido" (PAC) program. This is a standardized teaching material program that included the support of internal and external pedagogic teams. It was implemented in under-performing schools in Chile. In [19], Bassi et al. conducted a RCT to estimate the effectiveness of PAC. The intervention improved performance in math for the first cohort of students (effect size of 0.068 and it was statistically significant), but not in the second cohort. Thus, in this paper we study the impact of exclusively using PAC exercises in the ConectaIdeas platform, instead of the standard ConectaIdeas exercises that were previously designed and tested. According to Bowen [20], the need for customizable platforms that allow teachers to customize materials is perhaps the largest obstacle to widespread adoption of interactive online learning. This study can help determine whether the effect is due to the exercises or the implementation in an online platform.
Hill et al. [21] reviewed experimental evaluations in education in the US and documented that the average effect on broad standardized tests was 0.07 SD, compared to an average effect of 0.23 SD for narrow standardized tests and to 0.44 SD for specialized tests developed for specific interventions. According to Cheung et al. [22] effect sizes are roughly twice as large for published articles, small-scale trials, and experimenter-made measures than for unpublished documents, large-scale studies, and independent measures, respectively. In addition, effect sizes are significantly higher in quasi-experiments than in randomized experiments. Moreover, across seven WWC-accepted math studies, the mean effect size was +0.45 for measures with treatment-inherent measures and −0.03 for measures used in the same studies that were not inherent to the treatment [22].
In this paper, we explore the use of an online platform in an unforeseen environment. First, instead of using the originally designed and improved exercises for the platform, this study implements paper-based exercises designed by the Chilean Ministry of Education. These are valuable exercises product of an extensive recompilation, updated and upgraded in a previous program developed by the Ministry of Education. Moreover, this upgraded program and its exercises have been previously studied [19] and have shown positive results in math during its first year of implementation, but not for the second year. According to [19], a possible explanation for this decline is the decrease in rigor of implementation compared to the first year. The first research question is then to estimate whether the effect size is maintained or hopefully increased in the online version. Second, in the middle of the semester and until the end of the implementation, a huge social outbreak shook the country. Several schools closed due to teacher strikes and to social unrest. Thus, the second research question aims to determine whether the online version platform could still impact student learning under this unstable condition, which involved a huge level of student absenteeism, and how much erosion occurred when comparing with effects of previous evaluation of the use of the same online platform in math, for fourth graders.
Particularly in this study, we used a large-scale test to estimate the impact of the program on students' outcomes. The main contribution of this paper is to measure the effect of a platform when a completely new set of exercises are exclusively used, or from another point of view, when a set of materials with exercises is used in an online platform. Moreover, the social turmoil during the second half of the implementation period had a huge impact on attendance, which turned this study into a rare opportunity to estimate the robustness of the effect of the intervention under difficult contextual circumstances. Missing data was one of the main unexpected challenges faced because of the social turmoil. This paper illustrates the application of multilevel multiple imputation models to deal with missingness in the outcome variable together with the use of multilevel regression models to estimate the program effect size. Finally, we analyzed the effect of the inclusion of at least one open-ended question in each session with written answers and peer review.

Sample and Implementation
According to [23], a minimum of 30 to 60 teachers or schools is necessary in order to have sufficient statistical power to detect at least medium-sized effects and allow for attrition. In this study, we purposively recruited 50 fourth grade classrooms from undeserved schools of several Santiago districts in July 2019; 77 percent of the participating schools are publicly funded, and 23 percent are voucher schools. Selection criteria included: time and space allocations for the use of ICT for math teaching; school administrators who are open to the implementation of ICT; willingness of teachers to engage in the use of ICT and to classroom visitations; and school technological infrastructure, including an internet connection. Half of the schools were randomly assigned to the treatment. A total of 659 students participated in the treatment and 538 were part of the control group. The average age of the students who participated ranges from 9 to 10 years old. Participating teachers had no previous experience with the ConectaIdeas platform. Teachers in the treatment group were assigned to an initial training where they were introduced to the platform and final objectives of the project. The implementation team secured official written consent for participation, classroom observation, and data collection from administrators and legal guardians of participating students.
Given the short time available for the pilot implementation (from August to November), the random selection was performed before results from the pretest were available. The results of this test took almost a month after the last classroom completed the test in mid-August. To allow sufficient implementation time, the program started right after the baseline was measured. Thus, the treatment groups balance was later verified. As illustrated in Table 1, the control group had slightly lower results, which corresponds to 0.04 standard deviations in the pretest. However, these differences were not statistically significant (t(1195) = 0.783, p = 0.433). Previous studies show that using technology that allows for immediate feedback in the classroom can have a positive impact on students' outcomes, particularly in math and science [24,25]. The ConectaIdeas platform was designed to drive the progression of the classroom as a whole, and not to leave students by themselves. It provides a real time early alert system that lists students who are having more difficulties during the session and promotes peers' cooperation. Thus, using the automatic early alert system, teachers can assist those students or "assign" them to students who are ahead of their peers (i.e., students that finish early and perform well) for help. Students being assisted by peers can in turn evaluate the quality of the support. This system allows teachers and lab coordinators to work with both types of students in order to improve their understanding of the content, as well as their communication skills.
The platform also detects if there are exercises that are proving difficult for the whole class. Permitting teachers and lab coordinators to freeze the system and explain the necessary concepts. All exercises are related to specific Learning Objectives of the National Curriculum. Particularly in this implementation, all the exercises of the PAC program were assigned to those specific Learning Objectives. After each session, teachers receive a report of the Learning Objectives coverage. A particular feature of the implementation on this platform is to promote reflection and written argumentation, as well as peer review of those written arguments. In each session, teachers ask at least one open question. Students answer on their devices, and then they have to randomly review the answers of one of their classmates. Implementation started in mid-August and lasted until the first week of December, but during 2 to 4 weeks in late October the average length was zero, given that classes were interrupted due to the social turmoil. As shown in Figure 1, the average length of students' answers to math problems ranged between 9 to 16 words and excluding December, where there was just one week of implementation, there was a positive growth trend for this average. Moreover, the difference between the mean answer length in August and November was statistically significant (t(892) = 4.184, p < 0.001).
Class attendance of each student has a direct effect on students' learning, but it can also affect classmates. According to Gottfried [26], chronic absenteeism (missing more than 10% of school days) has a damaging effect on students and a potential negative spillover that reduces outcomes for other students in the same classroom. This negative spillover effect responds to the paradigm of teachers' time and instruction being a public good [27] i.e., something that is 'consumed' by all students in the classroom. Thus, greater chronic absenteeism produces a big disruption in instruction, and then it consumes the efforts and time of teachers attending to those students when they return. The implementation was led by an experienced teacher that supervised two new lab coordinators. In each treatment classroom, a lab coordinator supported or commanded the sessions. Both lab coordinators were elementary school teachers who had never worked on the platform before, and who were trained on the job during the sessions of the first week. Each lab coordinator visited 12 or 13 classrooms each week and provided on the job training to participating teachers. The treatment session lasted 90 min, and it was completed in one of the weekly regular math sessions. Thus, there was no increase in instructional time. The overall mean number of math exercises completed by students was 364, September being the month with the highest average number (105). The reduction in the number of math problems during the following months-October and November-can be explained by the social turmoil and its impact in student attendance. In fact, the difference between the mean number of math problems completed by students in September was significantly higher than in October (t(1252) = 5.7, p < 0.000) and November (t(1145) = 9.9, p < 0.000). As previously discussed, during December, the program was only implemented for one week, which explains the lower number of exercises, as can be seen in Figure 2. The implementation was led by an experienced teacher that supervised two new lab coordinators. In each treatment classroom, a lab coordinator supported or commanded the sessions. Both lab coordinators were elementary school teachers who had never worked on the platform before, and who were trained on the job during the sessions of the first week. Each lab coordinator visited 12 or 13 classrooms each week and provided on the job training to participating teachers. The treatment session lasted 90 min, and it was completed in one of the weekly regular math sessions. Thus, there was no increase in instructional time. The overall mean number of math exercises completed by students was 364, September being the month with the highest average number (105). The reduction in the number of math problems during the following months-October and November-can be explained by the social turmoil and its impact in student attendance. In fact, the difference between the mean number of math problems completed by students in September was significantly higher than in October (t(1252) = 5.7, p < 0.000) and November (t(1145) = 9.9, p < 0.000). As previously discussed, during December, the program was only implemented for one week, which explains the lower number of exercises, as can be seen in Figure 2.
The reduction in the number of math problems during the following months-October and November-can be explained by the social turmoil and its impact in student attendance. In fact, the difference between the mean number of math problems completed by students in September was significantly higher than in October (t(1252) = 5.7, p < 0.000) and November (t(1145) = 9.9, p < 0.000). As previously discussed, during December, the program was only implemented for one week, which explains the lower number of exercises, as can be seen in Figure 2.  Students attendance was measured before and during the implementation. As shown in Figure 3, both treatment and control groups followed a similar pattern of mean monthly attendance. The average attendance was lower during the second academic semester (August-December) for both groups, with a mean difference of four missing days per month. The reduction between first and second semester was significant for the treatment (t(1957) = 17.1, p < 0.000) as well as the control group (t(847) = 12.2, p < 0.000) and the treatment group had, on average, a lower attendance level than the control group in the second semester. Moreover, 11 percent of the students in a classroom missed more than 10 percent of school days during the second semester, on average. Following Gottfried [26] guidelines, we anticipate this having a negative impact on students' outcomes. Class attendance of each student has a direct effect on students' learning, but it can also affect classmates. According to Gottfried [26], chronic absenteeism (missing more than 10% of school days) has a damaging effect on students and a potential negative spillover that reduces outcomes for other students in the same classroom. This negative spillover effect responds to the paradigm of teachers' time and instruction being a public good [27] i.e., something that is 'consumed' by all students in the classroom. Thus, greater chronic absenteeism produces a big disruption in instruction, and then it consumes the efforts and time of teachers attending to those students when they return.
Students attendance was measured before and during the implementation. As shown in Figure  3, both treatment and control groups followed a similar pattern of mean monthly attendance. The average attendance was lower during the second academic semester (August-December) for both groups, with a mean difference of four missing days per month. The reduction between first and second semester was significant for the treatment (t(1957) = 17.1, p < 0.000) as well as the control group (t(847) = 12.2, p < 0.000) and the treatment group had, on average, a lower attendance level than the control group in the second semester. Moreover, 11 percent of the students in a classroom missed more than 10 percent of school days during the second semester, on average. Following Gottfried [26] guidelines, we anticipate this having a negative impact on students' outcomes. Finally, the implementation was carried out meeting the following criteria [3]: 1. Students who qualified for special education services but attended mainstream mathematics classes were included. 2. Random assignment to treatment and control. 3. Control groups used an alternative program already in place, or "business-as-usual". 4. The treatment program was delivered by ordinary teachers, not by the program developers, Finally, the implementation was carried out meeting the following criteria [3]:

1.
Students who qualified for special education services but attended mainstream mathematics classes were included.

2.
Random assignment to treatment and control. 3.
Control groups used an alternative program already in place, or "business-as-usual". 4.
The treatment program was delivered by ordinary teachers, not by the program developers, researchers, or their graduate students. 5.
Pretest differences between experimental and control groups were less than 25% of a standard deviation. Indeed, the difference was just 4% of a standard deviation. 6.
Differential attrition between experimental and control groups from pre-post-test was 10%, which is less than the limit of 15% suggested [3]. 7.
Assessments were not made by developers of the program or researchers. They were designed and administered by a regular provider of the Ministry of Education, with the most experience in the country, and who also is a provider of tests of the UNESCO ERCE 2019 [28] test for Latin America. 8.
The study had more than two teachers and 30 students in each condition. Indeed, there were 18 teachers in the Treatment Group, another 18 teachers in the Control Group, and a total of 1197 students. 9.
The study had more than 12 weeks of duration. 10. Additionally, the intervention in the treatment group was in regular class hours, not in extra supplementary time.

Analysis
A third party, an Item Response Theory (IRT) calibrated SEPA test, was used as our pre and post outcome measure. SEPA was developed by MIDE UC, Universidad Católica de Chile, and allows measuring the progress of student learning throughout the school year across a set of tests, based on the Chilean curricular framework from 1st to 11th grade in Language and Mathematics. SEPA defines a Reference Sample (RS) to achieve a better representativeness of the national distribution of schools according to dependency, socioeconomic level, and school performance (in the National Standardized Test, SIMCE), and then fits an IRT model using the RS to estimate a standard score for each student. The fourth grade SEPA test has been validated and has a reliability of 0.91 [29].
Given the nested structure of the data, a three-level Hierarchical Linear Model (HLM) was specified to explore the effect size of the intervention and student's academic achievement. HLM is commonly used in education research, mainly because of the need to take aggregation levels into account in order to comprehend the differences observed between students [30] and thus allowing an unbiased significance test [31]. Following HLM procedures, first an unconditional model (Equation (1)) was estimated to clarify whether HLM is appropriate for the data: where Y ij is the ith observation in the jth group (class/school level) estimated student outcome, γ 0 is the unobserved overall mean, µ j is the unobserved random effect shared by all values in group j, and ij is the student-level residual term. The intraclass correlations (ICC) is 0.149, in other words, 15% of the total variance of the SEPA math test is explained by classroom differences. Further, 2.8% of the total variance is explained by school differences. Based on these results and following the guidelines of considering clusters with ICC as low as 0.01, we included schools and classrooms clusters in our three-level model [32].
Second, a full model was specified to examine the effect of factors at the student, classroom, and school-level (Equation (2)): where Y ijk is the ith observation in the jth classroom at the kth school, β 1 jk . . . β njk refers to the fixed effect (slope) of the student level variables STUDENT ijk , γ 00k refers to the class-level random intercept (i.e., grand mean of scores), and π 000 refers to the school-level random intercept. We followed the methods used by the National Center for Education Evaluation (NCEE) Technical Methods report [33] in order to address the problem of missing data in the analysis of data in Randomized Controlled Trials (RCTs) of educational interventions, with a particular focus on the common educational situation in which groups of students such as entire classrooms or schools are randomized. Table 2 describes the proportion of missing cases in each treatment group. For the purpose of this analysis, we made the assumption that the missing data follows a Missing at Random mechanism (MAR) [33], meaning the probability of being missing is the same only within groups defined by the observed data. This would imply that there is a systematic relationship between the propensity of missing values and the observed data [34]. Further, once one has conditioned on all the observed data, any remaining missingness is completely random. Consequently, when the cause of missingness is taken into account, MAR missingness leads to unbiased parameter estimates [35]. Table 3 presents a comparison between observed values of students who completed the SEPA-posttest and those who did not. We used the Kruskal Wallis test for the continuous variables and the Chi-squared test for categorical variables to determine significant differences between both groups. Results suggested that there is a significant relationship between observed variables and missingness in SEPA-post outcomes, which sustains our claim of the data not being MCAR and thus a MAR assumption being more suitable [36,37]. Complete case deletion is probably the most commonly used procedure when dealing with missing data. However, when the missingness mechanism fails to meet the Missing Completely at Random assumption, as in our case, complete case deletion will yield bias estimates [34,38]. Further, the loss of sample members can reduce the power to detect statistically significant differences. Instead, we used the multiple imputation approach, which has become the method of choice in many contexts of missing data [39].
Ignoring the clustering and imputing the data by a one-level approach will underestimate the ICC [40][41][42] and that, in certain cases, can be more harmful than complete case deletion, due to wrong model specification [43]. Thus, we used the multilevel predictive mean matching method, which uses linear mixed models with random draws from the regression coefficients and the random effects to impute missing outcomes [44,45]. Moreover, following the literature recommendation, we included all the available complete variables in the data set in order to capture the assumption of MAR [46,47]. Based on previous recommendations [48,49], we first generated 20 imputations. Later, we analyzed each completed data set using the HLM model previously detailed. Finally, using Rubin's rules implemented in the 'mice' R package, we combined the estimates from the analyses and obtained our effect size estimates [50].

Results
Relevant student-level covariates in this study include continuous variables such as SEPA-math baseline (SEPA Math Pre), overall attendance, grade point average (GPA), the total number of performed math exercises (NumberExercises), and average length of open-ended math questions (AnswerLength), as well as sex and treatment group indicators. Descriptive statistics for each predictor are presented in Table 4. On average, students scored 559.67 points in the pretest and had a mean GPA of 5.87. In general, mean attendance was 89.42 percent and the average number of platform exercises was 202. Open-ended questions had a length of 7.2 words on average. Further, the correlation between pre-test and post-test scores was 0.72. Equation (3) specifies the linear mixed model used to impute missing SEPA-math post results. Further, Figure 3 illustrates the distribution of SEPA-math post scores for observed (blue) and imputed (red) values after generating twenty multiple imputed datasets. Results suggest that the imputed SEPA-math post imputed values follow a distribution similar to that shown in the observed values: Y ij = γ 0 j + β 1 j SEPAMathPre ij + β 2 j Group : Treatment ij +β 3 j Sex : Male ij + β 3 j GPA ij + β 4 j Attendace ij + β 5 j NumberExercises ij + β 6 j AnswerLength ij + ij γ 0 j = γ oo + µ j µ j ∼ N 0, σ 2 µ and ij ∼ N 0, σ 2 , all independent. (3) For each completed data set, the implementation effect size was estimated by fitting the HLM shown in Equation (4). Later, the estimates from each analysis were combined following Rubin's rules. Results showed a positive significant effect of the treatment on SEPA-math post scores (t(78) = 2.802, p = 0.035). The intervention effect size was estimated using the covariate adjusted mean difference (regression coefficient) and the unadjusted post-test standard deviation. Thus, the estimated treatment effect size was 0.13 SD and had a variance of 0.0016. Moreover, SEPA pre-test results (t(79) = 16.49, p = 0.000) and overall GPA (t(35) = 3.85, p = 0.000) were also significant and had a positive effect on students post-test scores. On the other hand, male students on average showed 2.6 points higher than female students for these scores, but this difference is not significant (t(186) = 1.34, p = 0.182). Attendance appears to have had a negative effect on overall results (t(136) = −0.41, p = 0.002). Table 5 summarizes the final effect size estimates from the HLM model: Y ijk = γ 0 jk + β 1 jk PreTest ijk + β 2 jk Group : Treatment ijk +β 3 jk Sex : Male ijk + β 3 jk GPA ijk + β 4 jk Attendace ijk + ijk γ 0 jk = γ ook + µ 0 jk γ 00k = π 000 + r 00k µ 0 jk ∼ N 0, σ 2 µ and ijk ∼ N 0, σ 2 , all independent.

Discussion
The aim of this study was to estimate whether the impact of the use of nationally developed math exercises is maintained or even increased when integrated in an online platform. Results show a 0.13 SD positive effect size of the implementation on students' outcomes when measured with a large-scale test SEPA-math. This effect corresponds to almost two extra months of learning after translating learning gains according to US year-long learning gains in math for fourth graders [51]. Further, the effect size achieved by the online platform intervention was double the effect achieved by using the same exercises on a paper version for a whole year.
The vast majority of Chilean urban schools have fiber optic internet connection, which allowed us to convert a paper-based government program of math exercises to an online version. Further, the selection criteria applied in this project, made it possible to ensure that all participating schools had the technological infrastructure required for the use of the online platform in the classroom. During the implementation, students were able to carry out all the activities without internet connection problems.
According to Gottfried [26], chronic absenteeism can not only have a negative effect on students missing excessive school days, but also has the potential to lower outcomes for other students within a similar educational context. The results shown in this study shed light on the effectiveness of the ContectaIdeas platform under unstable conditions due to the social outbreak, despite the increase in absenteeism during the application period. Moreover, the estimated effect size is almost half the impact achieved with two sessions per week for a complete year when using the same online platform but using pre-designed exercises. Thus, the current implementation has shown to be promising when compared with effects of previous evaluations on the use of the same online tool for fourth graders.
Further, as discussed by Kuhfeld et al. [52], the effect size in the second semester is 0.01 SD lower than the effect size in the first semester in 4th grade in the US, even though the second semester is longer. Thus, the Average Monthly RIT Gains is 2.00 in the second semester and 2.02 in the first semester. We can then estimate that the yearly overall effect size for this implementation would have been 0.27 SD. However, this estimation does not consider the extra absenteeism in the second semester due to the social outbreak, we could then estimate that under normal conditions, the yearly overall impact of the implementation would have been even higher.
There is evidence that writing can improve learning. A meta study with 6th to 8th graders of 48 writing-to-learn programs [53] shows that writing can have a small, positive impact on conventional measures of academic achievement. According to the authors, writing can prompt and support the use of rehearsal strategies, elaboration strategies, organization strategies, and comprehension-monitoring strategies. In another more recent meta-analysis of 12 studies Bicer et al. [8] found an overall effect size of 0.42. Similarly, our findings show a significant positive effect of average length of students' answers to math problems on math learning. Likewise, recent studies show that incorporating real time monitoring and feedback into online platforms can have a positive impact on overall students' outcomes in math [24,25]. We argue that both components are an essential part of the positive results of the ConectaIdeas online implementation presented here.
Finally, this paper provides evidence of the positive impact of incorporating regular paper-based math exercises into an online platform, as well of the robustness of the effect of an intervention under unique contextual circumstances. Furthermore, it exemplifies the use of multilevel multiple imputation models to handle missing data in the outcome variable, as opposed to purposely deleting observations-complete case deletion-which would have reduced the power of our study and biased the estimated results.

Conclusions
The implementation evaluated in this work has important practical implications. First, converting paper-based mathematic exercises-previously used and refined for years by the Ministry of Education-to an online platform, proved to improve the effectiveness of such exercises. This effect is doubled and its significant despite the fact that the number of sessions was reduced from twice to once per week, and the fact that the intervention only lasted one semester. Moreover, the effect was achieved despite the social turmoil that affected the country in the middle of the semester that increased absenteeism to levels much higher than the historical ones.
Second, in each session students were required to answer at least one open question, which included arguments of the procedures and the logic used to solve the problem. These written answers were shared with their peers, who reviewed and commented on the answers. This activity shown to have an effect on student learning. These results contribute to informing policy decisions regarding the use of existing math exercises under an online platform.
Although these findings have shown promise, there are several aspects that require further study and will be addressed in future work. For instance, studying not only the length of the written answers but how they relate to the type of question posted by the teacher. Araya R., et al. [54] addressed this issue and found that the presence of certain keywords in the question demonstrated to be relevant. However, it is necessary to further extend the study of type of questions using topic models or the natural language processing methods. It is also necessary to analyze the type and "quality" of answers given by students and its relationship to learning.
A second aspect that needs to be further studied is the effect of the strategy of peer collaboration through student assistants, implemented in ConectaIdeas. In each session, a platform module preselects students who are performing well to become candidates for classroom assistants. A couple of students are then selected by the teacher to be teaching assistants during the session. Students can then request help from any assistant or the teacher itself to solve an exercise. Once the assistant is finished, students can evaluate the quality of the help received, and the teacher assistant can also evaluate how well he or she thinks the person who helped understood the explanation. Evaluating the impact of this strategy will require a different experimental study.
A third feature that is important to address is the impact of the platform on teachers' didactic strategy. In Araya R., et al. and Uribe P., et al. [55][56][57], various classroom observation protocols are used to classify each moment of the session, and different machine learning algorithms are also used to perform automatic analysis of teaching discourse transcriptions. We have been using both methodologies to determine the impact of the use of platforms on teaching strategies. This is work in progress.
Finally, one of the main limitations of this implementation is related to its sustainability and was revealed this year during the quarantine in response to COVID-19. Although most urban underserved schools in Chile have optic fiber internet connections, students at home have very unstable internet. In addition, a big proportion of them rely on their parents' smartphones for internet connections. Even though the ConectaIdeas platform requires very little internet bandwidth, it does need a stable connection. Thus, the challenge is to adapt the platform to work offline and to accommodate both the interface and the exercises, to facilitate its use on small screen devices. In a future study, we will analyze an offline version of the platform for smartphones that is now being tested by students from vulnerable sectors in Chile and Peru.