Investigating the Statistical Distribution of Learning Coverage in MOOCs

: Learners participating in Massive Open Online Courses (MOOC) have a wide range of backgrounds and motivations. Many MOOC learners enroll in the courses to take a brief look; only a few go through the entire content, and even fewer are able to eventually obtain a certiﬁcate. We discovered this phenomenon after having examined 92 courses on both xuetangX and edX platforms. More speciﬁcally, we found that the learning coverage in many courses—one of the metrics used to estimate the learners’ active engagement with the online courses—observes a Zipf distribution. We apply the maximum likelihood estimation method to ﬁt the Zipf’s law and test our hypothesis using a chi-square test. In the xuetangX dataset, the learning coverage in 53 of 76 courses ﬁts Zipf’s law, but in all of 16 courses on the edX platform, the learning coverage rejects the Zipf’s law. The result from our study is expected to bring insight to the unique learning behavior on MOOC.


Introduction
The Massive Open Online Courses (MOOCs) have gained tremendous popularity since 2008 [1].Thus far, many MOOC platforms, such as Coursera, edX, and Udacity (the three pioneer platforms), have seen tremendous growth and success, especially in recent years [2].Around the world, many other platforms have also been developed, such as Khan Academy in North America, Miriada and Spanish MOOC in Spain, Iversity in German, FutureLearn in England, Open2Study in Australia, Fun in France, Veduca in Brazil, Schoo in Japan, and xuetangX in China [3].Various universities, including many prestigious ones, nowadays develop and offer MOOCs on these platforms.In doing so, MOOCs have transformed education beyond the boundary of university campuses.
MOOCs have also brought unparalleled opportunities for studying learning behavior, both for online education in general and for courses on MOOC in particular, for example, how students learn and how new technologies can be incorporated to transform teaching and learning.Online learning platforms maintain rich records on the students demographics and enrollment history, as well as online activities when interacting with the learning platforms.The latter includes browsing behavior, click stream, downloads, video streaming, and so on.Being able to access this data, albeit sanitized and anonymized, provides us the opportunity to analyze learning behavior with unprecedented scale and detail.Earlier we explored this opportunity and studied Zipf's law in MOOC learning behavior [4].
Many researchers have become interested in studying the learning behavior of MOOC participants.One of the most highlighted issues is how to measure the effectiveness of MOOCs in general, given that the student completion rate (the proportion of students obtaining MOOC certificates) is substantially less than traditional online education courses [5].The release of data on enrollment and certification from MOOC points out a very low certification rate with an average less 15%.This problem has generated quite significant research efforts in studying the cause of low certification rates and thereby providing suggestion to improvement strategies (e.g., [6,7]).The general belief is that certification is considered as a poor indicator to measure learning.MOOCs have a large and diverse learner body with different backgrounds and with different intentions and motivations [8].Many students engage with the courses and yet choose not to complete the assessments for credits.Consequently, the certification rate cannot be used as a reliable indicator for learning [9].
Another highlighted issue in studying learning behavior is on the difference in the engagement patterns of learners as they interact with the learning platforms.Many researchers use the data collected by the MOOC platforms to define and extract prominent features to describe different learning behaviors and use them to identify different engagement patterns (e.g., [10][11][12][13]).The focus there is to classify learners into different categories by the engagement patterns and analyze their relationship with performance attributes, student demographics, social activities, and so on.
In this paper, we focus on the statistical distribution of "learning coverage".We define "learning coverage" as the amount of course materials accessed by the students.We found that the statistical distribution of learning coverage among students enrolled in online courses has an pronounced long-tail feature.In particular, we found that, like many types of natural and man-made events, the learning coverage in MOOCs observes the Zipf's law and can thus be approximated with a Zipf distribution.
In our study, we analyzed two datasets from different MOOC platforms.One is provided by the xuetangX platform, containing over 40 million entries of event logs in 76 courses.The courses cover a wide range of disciplines, including mathematics, computer science, engineering, physics, chemistry, philosophy, history, business and so on.The results show that in 47 courses the students' learning coverage in the xuetangX dataset follows a Zipf distribution with only slight differences between the courses (in the exponent parameter), which we believe can be attributed to the inherent features of specific courses, such as their level of difficulty and popularity.The other dataset is a public one from edX, containing user's statistics in 16 courses held by Havard and MIT.In these courses, learning coverage also shows an explicit long-tail feature but doesn't fit Zipf mathematically.The dataset and major source code can be found at: https://github.com/SophieMEN/learning_coverage_distribution.
By investigating the statistical distribution of the learning coverage, we want to answer questions like "how much do students learn from a MOOC" or "how deep do students engage with a MOOC".We found that the distribution of the learning coverage shows a clear long-tail feature and fits the Zipf's Law in over half courses.This results suggest that the learning coverage is quite diverse and most students only learn a little.Our study can be considered as an attempt to recognize engagement pattern in a more quantitative and statistical direction.Our study is a first of its kind in that we explore and derive the statistical distribution of students' learning behavior by analyzing large datasets from MOOCs.We are the first to show the existence of a Zipf distribution in the student engagement patterns.Our study can yield further insight and more profound knowledge of the unique learning behaviors associated with MOOCs and thus help both MOOC developers and course providers improve the effectiveness of the learning platforms as well as the design of the courses.

MOOC Learning Behavior
Multidimensional data composed of user profiles and learning activities has been made available for researchers in education and data science fields.There have been studies attempting to establish relationships between students' background, motivation, and performance (e.g., [14,15]).
Many researchers classify students and activities according to the level of engagement with the online courses.For example, Perna et al. [16] define "starters" as those who register for a course no later than one week after its start date.Ho et al. [17] divide students into three types: "registrant" as any registered user, "participant" as a registrant who has accessed the content of a course, and "explorer" as a participant who has accessed more than half of a course's content.Anderson et al. [12] classify users into five categories based on their accomplishment in the assignments: "viewers", "all-rounders", "solvers", "collectors" and "bystanders".Here, the collectors refer to those who primarily download lectures, while the bystanders refer to those with very low level of activities.Similarly, Kizilcec, Piech, and Schneider [10] define four types of learning patterns: "on track", "auditing", "behind", and "out".Evan et al. [13] define three types of activities: "engagement" refers to any activity such as downloading materials or watching lecture; "persistence" refers to engagement for a prolonged duration; and "completion" refers to persistence to the end of the course.
Our study is also focused on student engagement.Our definition of learning coverage is a quantitative measure of student engagement in a particular course.We discuss learning coverage in detail in Section 3.2.

Zipf's Law
Zipf's law builds on a fundamental premise that the occurrences of many types of natural and man-made events can be approximated with a Zipf distribution.Initially, Zipf's law was applied in the context of language studies.It states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.Mathmatically, this is P n ∼ 1/n α , where P n is the frequency of a word ranked nth.This means the second word occurs approximately 1/2 as often as the first, and the third word 1/3 as often as the first, and so on.Zipf's law reveals that while only a few words are used very often, many or most are used rarely.Thereafter, Zipf's law has been proven applicable to similar phenomena in various areas, such as population in cities, company size, science article citations, as well as in many other natural and physical phenomena [18].Especially on the Internet, Zipf's Law governs many features and has strong implications on the design and function of the Internet.The connectivity of Internet routers influences the robustness of the network while the distribution in the number of email contacts affects the spread of email viruses.Even web caching strategies are formulated to account for a Zipf distribution in the number of requests for webpages [19].
A Zipf distribution can be defined more concretely as f (r) = Cr −α , where C = (∑ k r=1 r −α ) −1 , and α is the exponent parameter with a positive value.In the classic version of a Zipf distribution, the exponent α is 1.If we plot the Zipf distribution, frequency versus rank, in log-log scale, the result is a line and the slope of the line is −α.Because of this, most of the authors that claim the Zipf's law patterns (e.g., [18][19][20][21][22][23]) use linear regression to examine the linearity between log(frequency) and log(rank).The better the linearity can be shown, the closer it is to a Zipf distribution.
This procedure, however, considers the intercept as a nuisance parameter, omitting the fact that it is related to α.More precisely, the intercept should be equal to log(C).Moreover, linear regression through ordinary least squares is inefficient in this case, given that r is an integer [24].A better method to fit the Zipf's law for empirical data is to use the maximum likelihood estimation (MLE), which has been proven effective in practice for similar distributions, such as Zipf-Mandelbrot law [25] and power law [26].In our study, we also use MLE to estimate the exponent parameter in the Zipf distribution and check the goodness of fit by performing a chi-square test.

Dataset
In this study, we use two MOOCs datasets.The first dataset is provided by xuetangX (www.xuetangx.com)and contains information of 76 courses held by Tsinghua University in year 2014 and 2015.The dataset contains information on individual users and courses, as well as the event logs of all users' online activities.The information for each course contains the course's name, level, and subject area.Each event log entry consists of the user ID, the IP address, the course ID, the chapter ID, the section ID, the event type, the event time, and other information depending on the specific event type.There are more than 4 million event log entries in the dataset.
The second dataset is Person-Course Dataset AY2013 (http://dx.doi.org/10.7910/DVN/26147)and contains 16 courses held by MIT and Harvard on edX in year 2012 and 2013.This dataset does not have detailed event logs; however, it contains important information of each student enrolled in the courses, which includes an anonymized user ID, the course ID, the number of chapters accessed, and other statistics on the user's activities, including the number of active days, the number of video play events, the number of posts to forums, etc.
Figure 1 respectively shows the distribution of the number of registrants and the number of participants in the courses.Here, registrants refer to the users who have been enrolled in the course and participants refer to the registrants who have accessed the course content [17].The minimum number of registrants for all 92 courses is 1490, while the minimum number of participants is 139, both from the xuetangX dataset, whose courses are smaller in size compared to the edX dataset.Most courses in the edX dataset have over 10,000 registrants and 5000 participants.The difference between the numbers of registrants and participants indicates that many users were simply enrolled in a course but did not get to access any course materials.
Figure 2 shows the distribution of courses in various disciplines.Courses in the xuetangX dataset are already labeled as part of the course's summary information.Courses in the edX dataset do not have such labels and we added them manually.As we can see, engineering, which is a mix of many subjects including electronic engineering, mechanical engineering, civil engineering, and so on, has the largest number of courses (25.84%).

Learning Coverage
We define learning coverage as the amount of course content a learner has accessed.On MOOCs, the course content is usually organized as a multi-level tree: each course contains several chapters, a chapter contains several sections, and a section contains various materials, including texts, videos, assignments and quizzes.Conceptually, one can calculate the learning coverage at different granularities (chapter, section, or specific content within a section).
The xuetangX dataset contains event logs that record users' online activities with the learning platform.Using the event logs we can locate the specific section for each event.This enables us to count how many sections a learner has been able to access.The edX dataset does not come with an event log, but it has the information on the the number of chapters a learner has accessed.In this case, it is sufficient to calculate the learning coverage at the chapter level, but we do not have access to more fine-grained information.
As a typical example, Figure 3 shows the histogram of learning coverage for the course, Financial Analysis and Decision Making, on xuetangX.Other courses would show similar distributions.A long-tail feature of the distribution can be speculated.In the histogram, the high "head" tells that a great amount of students learned quite little in MOOC, and the long "tail" tells that students who learn a lot have really diverse learning coverage.While paying attention to details of Figure 3, we can see that as the learning coverage increases, the number of students corresponding to the learning coverage generally decreases with some fluctuation.Using methods presented in [26], we test the learning coverage for power law.However, the null hypothesis that the learning coverage fits power law is rejected in all 92 courses.We believe that the absence of monotonicity is a major cause for the rejection.
Consequently, we test the learning coverage for the Zipf's law, which describes the relationship between frequency and rank.We sort the frequency of each learning coverage in descending order, and then conduct linear regression to the frequency versus the rank in log-log scale as a pre-experiment.The results show that the learning coverage fits well with the Zipf distribution consistently.Figure 4 shows the scatter plots of the frequency versus the ranking of the learning coverage in log-log scale for six courses from different disciplines.The curves fit well with a straight line.Other courses produce similar plots.The coefficient for the linear regression on frequency versus rank in log-log scale is the negative of the exponent parameter α in Zipf's Law.The distribution of α is shown in Figure 5a, for the xuetangX courses and the edX courses, respectively.Overall, α ranges from 1.0018 to 2.2503.In fact, there are only three courses with α bigger than 2.0.
Figure 5b shows the R-squared values for the two sets of courses.For all 92 courses, R-squared values range from 0.6780 to 0.9893.Only five courses, three from the xuetangX dataset and two from the edX dataset, get a value lower than 0.9.For the other 87 courses, the R-squared value is larger than 90%, indicating a high goodness-of-fit.The result is encouraging that we decide to use the maximum likelihood method for a more effective and accurate estimation of the Zipf's law.

Fitting Zipf's Law
Formally, a random variable X is Zipf distributed with parameter α (X ∼ Zipf α ), if for a given α ∈ R, where x r is the rth frequent element, and C is the normalization factor: We obtain the likelihood function for sample x as follows: which gives the probability of the observed sample supposedly from a Zipf distribution with parameter α.
The method of MLE estimates α by finding a value of α that maximizes l α (x).For ease of calculation, we maximize log-likelihood function, which is (3) the gradient with respect to α of log-likelihood function is: We can use gradient descent to obtain the optimal parameter, α, which maximizes the log-likelihood function.

Goodness-of-fit Test
The method described above allows us to fit a Zipf distribution to a given data set and provide an estimate of α.Now we need to determine whether the sample data are consistent with the hypothesized distribution, in this case, a Zipf distribution with the parameter α.For this, we use the chi-square goodness-of-fit test [27].The null hypothesis for the test is as follows: H 0 : The data of learning coverage is consistent with a Zipf distribution with parameter α.
We calculate chi-square statistic as: We can get a p-value with k − 1 degree of freedom.The p-value is the probability that a chi-square statistic is more extreme than the calculated value (from Equation ( 5)).We set the significance level to be 0.01.That is, if p-value < 0.01, we reject the null hypothesis that the data of learning coverage is Zipf distributed with the parameter α; otherwise, the null hypothesis is not rejected.

Results
We calculate the maximum likelihood estimation of α based on the log-likelihood function (Equation ( 3)) and its gradient (Equation ( 4)).For finding the optimal value of α that maximizes log-likelihood function, we use the MATLAB built-in function, called fminunc, as the optimization solver with the initial α set to be 1.5.After obtaining α, we conduct a chi-square test to determine whether the observed data fit with the Zipf's law.The procedure above is conducted for all 92 courses.
The results of α and p-value for two datasets are respectively summarized in Table 1.In the xuetangX dataset, α ranges from 0.8915 to 1.9751; While in the edX dataset, α ranges from 1.2352 to 1.6860, which has a significantly higher minimum.Figure 6 shows the results of parameter estimation together with the observed data in two courses whose α is close to the mean of the respective dataset.In xuetangX dataset, the learning coverage of 47 courses is likely to fit with the Zipf's law, which accounts for 61.84%.The results also show that over 25% of the courses have a p-value approximate to zero, leading to a definite rejection of the null hypothesis.Figure 7 compares the difference in the number of participants between the courses fitting a Zipf distribution and those not.We observe that the courses with more than 3000 participants all reject the Zipf's law.On the other hand, most courses with less than 1000 participants are likely to fit with the Zipf's law.We have attempted to focus on other differences (e.g., disciplines, semesters) between the courses that report different conclusions, but so far no meaningful conclusions have been found.In the edX dataset, the null hypothesis is rejected in all 16 courses.This means that even though the learning coverage in the courses looks like Zipf, it does not necessarily fit Zipf mathematically.This may be caused by the large number of samples in these courses, just like what we have observed from the rejected courses in the xuetangX dataset.A previous study that uses a similar method for testing the Zipf-Mandelbrot model [25] claimed that the size of dataset should not be larger than 3000 for the maximum likelihood method.The rejection of the null hypothesis may be caused by the limitation of the method.Another possible reason of no acceptance is relatively lower degree of freedom.In the edX dataset, we calculate the learning coverage at the chapter level due to the limitations in the data granularity, so the learning coverage has shallower range compared to those calculated at the section level.Shallower range of the learning coverage means k is smaller, which directly leads to a lower degree of freedom in the chi-square test (referring to Section 3.4).

Courses fit Zipf
Courses reject Zipf 0 2000 4000 6000 8000 10000 # of participants The learning coverage is intrinsically related to how students perceive and carry on with the courses.In particular, the exponent parameter α can be regarded as an indicator of the student retention in the course.Higher α means that more students are either dropping out from the course early or simply taking in less content overall.That is, students are less engaged with an online course with a higher α than that of a lower α.It would be interesting to correlate this α parameter with course's content quality and alternative teaching methods (for example, more forum activities) to evaluate potential improvements in student retention.We leave the investigation for future work.

Discussion
In this study, inspired by preliminary data analysis, we investigated the statistical distribution of the learning coverage and tried to identify the existence of the Zipf's Law.A long-tail feature has been observed clearly across 92 courses on two MOOC platforms and in 61.84% courses on the xuetangX platform the learning coverage fits the Zipf's Law mathematically.The results show that a great portion of students only learn a little.The statistical distribution can be considered as characteristic of engagement pattern, which is a general feature of students population that one needs take into consideration while designing and running MOOCs.
Our study can be improved in several ways.First, further work is needed to examine whether section-level learning coverage fits Zipf's law in courses on other MOOC platforms and learning context.This requires fine-grained data.Second, we also need to explore current fitting and goodness-of-fit test methods and study the limitation of the methods.Then we can be more cerntain whether the rejection is indeed caused by the defect of method.Third, we define learning coverage as the amount of content a learner has accessed, which can be sometimes misleading because accessing content does not equal to learning.To be more rigorous, we can set a threshold in terms of the number of events, which means that we can define the learning coverage as the amount of content a learner has accessed more than a certain number of times.Some MOOC platforms organize learning materials with a requirement that a learner must complete certain sections before moving to the next ones.In these MOOC platforms we can assume that the learning coverage can be defined more precisely.
More in-depth studies on discovering knowledge behind the data are warranted.Not only will they provide us with methods for evaluating the effectiveness of learning on MOOCs, but also they will provide the educators and MOOC providers with the basis for further improving the teaching methods and the learning platforms.In future work, we would like to investigate other methods that can capture the learning behavior of the students in more aspects, and therefore more accurately represent the overall learning behavior of the students.

Conclusions
To measure learner engagement in MOOCs, we introduce a new metric, called the learning coverage, to estimate the amount of course content (the number of sections or the number of chapters) accessed by the learners.It is a measure on how far a learner has advanced into the course.By analyzing the datasets provided by the MOOC platforms, xuetangX and edX, we calculate the learning coverage for 92 courses of various disciplines either at the section level or at the chapter level.We discover that the learning coverage distribution observes a clear long-tail feature and shows good linearity in double log plot.To confirm the observation, we apply the MLE method to fit the Zipf's law and conduct the chi-square goodness-of-fit test.The exponent parameter for the Zipf distribution can be used as an inherent feature of the course, representing in some degree the student retention in the course, and therefore a reflection of the course's difficulty and popularity.The results show that the learning coverage is likely to fit with a Zipf distribution mathematically in 61.84% courses on the xuetangX platform, but in no course on the edX platform.Rejection in all edX courses may be caused by the limitation of the MLE method or the relatively small degree of freedom in the chi-square test.The prevalent existence of Zipf's Law in the learning coverage can be considered as an engagement pattern, a general feature of students population to be taken into account when designing and running MOOCs.

Figure 3 .
Figure 3. Histogram of the learning coverage of a course.Intro to Entrepreneurship (xuetangX)

Figure 4 .
Figure 4. Linear regression results of six courses.

Figure 6 .
Figure 6.Maximum likelihood estimation (MLE) fitting results.(a) Medical Parasitology in the xuetangX dataset; (b) Mechanics Review in the edX dataset.

Figure 7 .
Figure 7. Number of participants for courses fitting and rejecting Zipf in the xuetangX dataset.

Table 1 .
Statistics of α and p-value of the two datasets.