Measuring Arithmetic Word Problem Complexity through Reading Comprehension and Learning Analytics

: Numerous studies have addressed the relationship between performance in mathematics problem-solving and reading comprehension in students of all educational levels. This work presents a new proposal to measure the complexity of arithmetic word problems through the student reading comprehension of the problem statement and the use of learning analytics. The procedure to quantify this reading comprehension comprises two phases: (a) the division of the statement into propositions and (b) the computation of the time dedicated to read each proposition through a technological environment that records the interactions of the students while solving the problem. We validated our approach by selecting a collection of problems containing mathematical concepts related to fractions and their different meanings, such as fractional numbers over a natural number, basic mathematical operations with a natural whole or fractional whole and the fraction as an operator. The main results indicate that a student’s reading time is an excellent proxy to determine the complexity of both propositions and the complete statement. Finally, we used this time to build a logistic regression model that predicts the success of students in solving arithmetic word problems.


Introduction
Previous work has studied the relationship between performance in mathematics problem-solving and reading comprehension in students of all educational levels [1][2][3]. Authors such as Pólya [4] and Puig and Cerdán [5] have shown that reading and understanding the statement are key phases of the problem-solving process. The National Council of Teachers of Mathematics (NCTM) [6] determined that, in solving a mathematical problem, many of the necessary skills present in all areas of the educational curriculum are required, such as reading, reflection and understanding. The latest PISA report [7] indeed highlights that a solid reading competence is fundamental for academic achievement in all subjects of the educational system (including mathematics), while being a prerequisite for successful participation in most adult life [8][9][10].
Our research is framed within the context of arithmetic word problems (from now on AWPs or AWP in singular) and focuses on how to measure the complexity of the statements involved. To this end, we computed the reading comprehension of students through a technological environment and use learning analytics to predict student performance in solving this sort of problems.

•
The linguistic approach, based on the student's reading ability [15] and the readability of external texts different from the AWP statement [16].

•
The structural variables approach, based on the so-called task variables (e.g., syntactic variables or context variables) as defined by Kilpatrick [17] or Goldin and McClintock [18].

•
The open sentences approach, based on the situation of the question within the statement [19].

•
The semantic approach, based on the semantic structure of the statement, considered as a whole [5] or divided into segments [19], so that an association can be built between keywords and operators in a partial problem-solving process.
To the author's knowledge, none of the previous approaches has yet measured the complexity of AWPs through the students' reading comprehension of the statement itself. This aspect makes our research a novel and original contribution to the state of the art of mathematical problem-solving.

Measuring the Complexity of AWP Statements through a Technological Environment
To measure the complexity of an AWP, we split its statement into propositions, as follows from the partial semantic approach defined above. Our unit of analysis is thus a proposition, which contains a verb and a quantity associated with the related action. Figure 1 shows an example of an AWP statement [20] divided into three propositions. Propositions can be classified into levels to facilitate comparability and to determine their complexity. Hunt [21] names these constructions T-units or minimal terminable units of language. T-units each consist of a main clause plus the subordinate clauses it may include, and they can be organized on a number of levels: declarative sentences represent level 0; level 1 adds a subordination to sentences of level 0; level 2 adds a subordination to those of level 1 and so on. The higher the level, the more complex the sentence will be. This way, Proposition 1 in Figure 1 belongs to level 0, and Propositions 2 and 3 are of level 1, since they are respectively subordinate by the terms "of them" and "now".
We measure the complexity of each proposition by obtaining the time per word that students spent while reading the corresponding segment of the AWP statement. The time per word is computed through a technological learning environment able to control which information is displayed at any time and to register the interaction of students with the content. This novel approach is more powerful than to control reading from printed texts.
The use of intelligent tutors or technological learning environments (e.g., Moodle, Edmodo or Bakpax) has increased in recent years across all educational stages [22]. However, these environments have not yet been used to measure reading comprehension and the complexity of AWPs. These tools can usually be accessed through mobile devices and smart screens and allow one to register student-computer, student-teacher or student-content interactions [23][24][25][26], thereby giving rise to the so-called learning analytics research field. This field deals with applying data analytics to education and it is defined as the area of investigation in charge of measuring, compiling and analyzing data sets obtained through the use of computer-assisted learning platforms that track and record student digital interactions [27,28].
Technological environments and learning analytics are a cutting-edge approach to detect patterns on student strategies when solving a learning task. They are also helpful in understanding study habits, the use of teaching materials or the time dedicated to the proposed activities [29], sometimes supplemented by information on attendance, participation or motivation [30].
This work focuses on the analysis of the student-computer and student-content interactions obtained through the Read and Learn (R&L) technological environment [24,31]. R&L is a research tool to carry out experiments that analyze the strategies of students when they first have to read a text or problem statement and then answer a series of questions in a digital context.

Predicting Student Performance When Solving AWPs
Mathematical models have been extensively used to try to predict the probability of correctly solving a learning task. These models are commonly used to build a personalized route that guides students through an adapted teaching-learning process [32].
Logistic and Bayesian knowledge tracing models stand out among the statistical prediction models used for this purpose. The former have been used to predict the probability of success from the students' previous skills and the difficulty of the task [33]. The latter use hidden Markov models to estimate latent parameters and predict student success [32].
Following previous work on the matter [26,34], this work presents a binary logistic regression model to predict student performance from the complexity of an AWP measured by the reading comprehension of its statement.
The remainder of the paper is organized as follows. Section 2 describes the materials and methods used to measure the complexity of AWPs, the features of the R&L technological environment, a validation experiment for a sample population and the tested hypotheses. Section 3 presents the experimental results that determined the feasibility of our approach for assessing the complexity of mathematical problems through reading comprehension. Section 4 shows how to build a logistic model to predict student performance from the complexity computed for an AWP. Finally, discussion and conclusions are drawn in Section 5 in the context of the state-of-the-art literature.

Procedure for Measuring the Complexity of AWPs
The complexity of an AWP can be derived from the complexity of all the propositions that form its statement. To estimate the complexity of a proposition, we compute the reading time per word for a group of students using the R&L technological environment. The reading time of proposition j in task i (T ij in Equation (1)) thus comprises the time spent by each student (t ijs ) in the group (of size n) and the number of words in the proposition (k).
The total complexity of an AWP can in turn be measured by averaging the previous reading times per student for all propositions (Equation (2)), where m represents the number of propositions in the statement.

Instrument
R&L is a technological environment in which to design research experiments on reading comprehension in text and image-related learning tasks. It is a web tool that can be accessed through mobile devices, computers and smart screens using any browser on any operating system.  R&L records all user interactions with the statements, questions and response options along with timestamps, which allows tracking the access history to the learning content with a level of precision of milliseconds. Any user action is registered, such as displaying a hidden proposition, moving the focus from the statement to the questions and vice versa. This way, we can determine aspects such as: what part of the statement the student is focused on, which point in time a certain proposition is read, how long a student remains in the same proposition, how many times a proposition is consulted and in which order students access the statement, the questions and the answer options.
R&L is able to digest these learning data flows and compute the variables of interest from the previously recorded data (e.g., the time reading a proposition or answering a question). Data can then be exported in CSV so it can be further used in any preferred data analysis software (e.g., R or SPSS). For more details on R&L the interested reader can check out the literature [24] and keep up with our website about data analytics and technological tools in education https://go.uv.es/grimo/datte.

Experimental Design
To test our proposal we have conducted a descriptive quantitative study involving a group of 70 students, 26 girls and 44 boys, aged between 15 and 16 years old.
At the time of the study, the students belonged to two public secondary schools in Spain selected by a convenience non-probability sampling. One school is located in an upper-middle socioeconomic area of a town of twelve thousand inhabitants. The other one is located in a multicultural suburb with medium-low socioeconomic status in a city of eight hundred thousand inhabitants.
Informed consent was obtained from schools, teachers and students before the start of the experiment. Anonymity of the data was guaranteed by just collecting the year of birth, gender, course and a dummy school code for each student. Any combination of data with a frequency of less than 5 observations was considered subject to statistical secrecy and it was removed to prevent de-anonymization.
The experiment was run individually using the school's computer room. Students were introduced to the R&L technological environment before starting the session. Following fair and ethical practices, participants were made aware that they were involved in a research study. They were clearly informed about the aims of the study and that their performance would not be considered in their grades.
Participants were asked to solve a couple of AWPs presented as two tasks with their corresponding statement and five answer options. The statements were designed taking in to account the mathematical and the grammatical complexity. We built two isomorphic tasks [35] dealing with mathematical introductory concepts related fractional numbers over a natural number, basic mathematical operations with a fractional whole and the fraction as an operator. In addition, we classify the propositions of the statements into levels as defined by Hunt [21], which allows the measured reading comprehension to be compared.
Tasks were written in Spanish since all participants were native Spanish speakers. For the sake of readability, we also show the translation of the statement into English as follows: Both tasks have an equal mathematical structure, expressed in terms of the relationships between the variables and quantities involved. This means that they are solved by applying the same rules, procedures, and algorithms. The question is placed at the end of the statement following the pattern a x b = ? where a and b are known quantities. Note that the semantic relationship between the variables and the unknown quantity, the lack of data in the question and the absence of irrelevant data is equivalent in both statements. The tasks can be classified as two AWPs of multiplicative comparison according to Puig and Cerdán [5]. This sort of problems use a scalar function (I) to link two extensive quantities (E) of the same type of magnitude (E x I = E, the Schwartz relation [36]). For example, the scalar function in task 1 is "two-thirds of," while the two extensive quantities are "thirty candies" and the unknown quantity of "strawberry candies".
The proposed AWPs use the fraction (i.e., two-thirds) as an operator [37] that transforms an initial quantity (i.e., thirty candies or one-half of a pizza) into a final quantity (e.g., strawberry candies or a fraction of the pizza). This transformation is associated with the scalar function and the multiplication operator, as shown in Figure 3. The tasks are consistent [38] since they can be solved by directly translating the key terms in the statement (e.g., are or is) into the operation to be performed, in this case a multiplication. We can determined the grammatical complexity of the tasks by dividing the statement into propositions and analyzing their syntax. Each statement is composed of three propositions, as shown in Table 1. The first two relate to the informative part of the statement and the third one is the question. We configured the tasks in R&L so that just one proposition could be displayed at a time while the rest of them remained hidden (see the different colored segments in Figure 4). The length of the informative parts is the same in both statements (i.e., 3 + 6 words for P11 + P12 and P21 + P22 as from the original text in Spanish). The number of words in the question part differs (i.e., 5 to 7 words for P13 and P23 as shown in Table 2) due to the introduction of rational numbers that change the Spanish quantifier "cuántos" by "qué porción de," although it keeps the same length in English.   The grammatical complexity of each proposition is also represented by the number of nous, verbs, numerals, prepositions and conjunctions in Table 2. The type of sentences can be categorized into levels as defined by Hunt [21]. Propositions P11 and P21 are declarative sentences of level 0. The rest of propositions are level 1 since they include a subordination to the previous sentences by the terms "of them" (P12), "of it" (P22), "candies" (P13) and "of the pizza" (P23) respectively.

Research Hypotheses
We pose the following hypotheses in line with previous work on the mathematical concepts dealt with by our study: • H1: The change from natural to fractional numbers increases the complexity of AWPs.
According to Perera Dzul [39], difficulties begin when students face the study of fractions, without having prior knowledge and enough situations in daily life that present problems related to rational numbers. Gairín and Muñoz [40], in a study on textbooks for the teaching of rational numbers in secondary education in Spain, affirm that rational numbers are overshadowed by the study of procedural aspects, making it difficult to transfer this concept to daily life problems. • H2: The use of the fraction as an operator makes statements harder to understand. Authors like Hart [41] have already shown how challenging a syntagm of the type "two-thirds of them are" can be. Sanz, Figueras and Gómez [42] have also observed that students from 15 to 16 years old find it difficult to tackle this expression when presented literally in simple operative exercises. • H3: Operating on a rational whole is more difficult than operating on a natural whole. Problems arise when the concept of the whole is reformulated. If the whole is not a natural but a fractional number, solving an AWP becomes a more difficult task [43].
Hypotheses 1 and 3 were tested by comparing the average reading times of propositions of the same level. Regarding H1, an increase in complexity from P11 to P21 was due to the mere presence of fractional instead of natural numbers. By comparing the complexity of P12 and P22 we checked the effect of reformulating the whole (H3) from a natural number (i.e., thirty candies) to a fractional one (i.e., one-half of a pizza).
To test H2, we compared the average reading times of level 1 subordinate propositions with that of proposition P21. Propositions P12 and P22 include the syntagms "of them are" and "of them it is" that refer to the use of the fraction as an operator (from now on, we refer only to syntagms "of them are" in order to improve readability). We take proposition P21 as the reference level 0 declarative sentence since it also uses a rational number (i.e., one-half of a pizza), but it does so as a fractional quantity.

Analysis and Results
Reading times were rather dispersed in our group of students, as shown by the high standard deviations in Table 3 (values are expressed in seconds per word or s/word). The Kolmogorov-Smirnov test confirmed that the times recorded did not follow a normal distribution (p-value < 0.05) for the propositions (T ij ) or the complete statement (T i ). Therefore, we use the median as a good representative of each set of times. We did not use the mean in our analysis, since it is affected by outliers in the obtained asymmetric distributions. For example, see how most of the students read faster than the average reading time (empty circle) in the box-plots shown in Figure 5.
We checked for differences in the reading times due to the socioeconomic context and the gender of students. Differences between school were not statistically significant following the non-parametric Wilcoxon signed-rank test for paired samples (p-values > 0.05). Reading times were also not statistically different between boys and girls (p-values > 0.05). We can then use the data obtained for the whole group to study the complexity of the statements. Table 3. Reading times (s/word) for each proposition (T ij ) and task (T i ).  By comparing the reading times in Table 3 we can test our hypotheses as follows: • H1: The change from natural to fractional numbers increases the complexity of AWPs. The median reading time of propositions P11 and P21 increases from 5.12 s/word to 6.68 s/word (see also the difference reported in Figure 5). This rise in complexity is due to the change from a natural to a fractional initial quantity. The difference in medians is statistically significant according to the Wilcoxon signed-rank test (p-value = 0.0001 < 0.05). The results thus confirm this hypothesis. • H2: The use of the fraction as an operator makes statements harder to understand. The median reading time of propositions that use the fraction as an operator (i.e., 2.78 s/word for P12 and 6.01 s/word for P22) is shorter than that of the proposition using the fraction as a quantity (i.e., 6.68 s/word for P21). The difference in medians is not statistically significant for task 2 according to the Wilcoxon signed-rank test (p-value = 0.069 > 0.05). The difference is significant for task 1 (p-value = 0.004 < 0.05) mainly due to the ease of operating on a natural whole, as we analyze below in H3. Thus, the syntagm "of them are" does not introduce further complexity to the statements in the AWPs studied. • H3: Operating on a rational whole is more difficult than operating on a natural whole. The median reading time of proposition P22 (i.e., 6.01 s/word) is longer than that of proposition P12 (i.e., 2.78 s/word). Differences are statistically significant according to the Wilcoxon signed-rank test (p-value = 0.0002 < 0.05), as is also shown in Figure 5. Those results confirm the hypothesis that it was more complex to operate on a rational whole (e.g., one-half of a pizza) than to operate on a natural whole (e.g., thirty candies).
Student performance was rather good when solving the two proposed tasks. The success rate was 94.3% for task 1 and 62.9% for task 2. The median reading time of all propositions in task 2 was longer than that of task 1 (7.21 s/word and 3.67 s/word respectively) and the distribution was more sparse (e.g., compare T 2 and T 1 in Figure 5). The previous results confirm that solving task 2 was more complicated than solving task 1.

Predicting Student Success from the Proposed Complexity Measure
We use a binary logistic regression model to predict the student success when solving an AWP. The model estimates the probability of succeeding (or failing) in completing a task from the complexity of its statement, measured as the reading time per word. The data obtained in our study were used to train a model for each task, as described by Equation (3), where T ij is the time taken by students to read each proposition (j) of the problem (i).
We discarded outliers from our data and kept the results of 58 students to build the model for task 1 and of 57 students for task 2. We trained the models with a random sample of 50 students and validated them with the remaining eight students (task 1) and seven students (task 2). Table 4 shows the relation between the reading time per proposition and the success of students from direct observation of the data. Faster reading times led to better performance in task 1 (indirect relation), whereas slower students were the best performers in task 2 (direct relation). These results are in line with the complexity of the statements analyzed above. Table 4. Relation between the reading time per proposition and student success.
Success inverse inverse inverse direct direct direct The model built for task 1 is shown in Equation (4). It explains between 0.142 (Cox and Snell R 2 value) and 0.424 (Nagelkerke R 2 value) of the dependent variable. It gives an accuracy of 98.3% when calibrating on the train set and it correctly predicts the success of the eight students in the validation set. The sign of the coefficients obtained for each proposition (b j ) reproduces the indirect relation previously found between the reading time and the probability of successfully solving task 1 (see Table 4). P(success = 1) = 1/(1 + e −(7.302−0.063·T 11 −0.788·T 12 −0.269·T 13 ) ) We analyzed the odds ratio (OR) to understand the magnitude of the effect, that is, how much the probability of success changes as a result of increasing by one second the reading time of a proposition, the rest being constant. An OR greater than one indicates an increase in the probability while an OR less than one implies a decrease. Taking more time to read proposition P12 (i.e., higher values of T 12 ) lowers the probability of success since OR = 0.455. Increasing the reading time for propositions P11 and P13 does not affect the student's success that much since OR remains near to one (OR = 0.939 and OR = 0.764 respectively). The model built for task 2 (see Equation (5)) is more limited since it explains between 0.056 (Cox and Snell R 2 value) and 0.175 (Nagelkerke R 2 value) of the dependent variable. It gives an accuracy of 65.4% when calibrating on the train set and it correctly predicts the success of four students in the validation set. All coefficients are positive and confirm the direct relation found in Table 4. They are also close to zero, which makes OR rather close to one. For example, increasing the reading time of proposition P22 slightly raises the probability of success (OR = 1.117); the time taken to read propositions P21 and P23 does not have any significant effect on student success (OR = 1.009 and OR = 1.059 respectively).
Far from being contradictory, the models represent the different complexities of the two statements. The overall reading time for task 1 was half the overall time for task 2 (e.g., see T 1 and T 2 in Table 3). Students having reading comprehension problems in task 1 thus showed higher probabilities of failure. On the contrary, task 2 appeared as a more complex AWP whose successful resolution could benefit from investing more time in reading its propositions.

Discussion and Conclusions
We have presented a novel proposal to measure the complexity of an AWP through the student reading comprehension of its statement. The approach allowed us to predict the students' success from their reading times when solving the task. The students' reading time has demonstrated to be a good proxy to determine the complexity of AWPs and it can become an essential tool for the design of problem statements. By analyzing the statement propositions, one can adjust the level of complexity of the task to focus on certain student profiles.
The paper also introduces the use of the R&L technological environment to compute the complexity of a problem statement, without the need to use traditional paper-and-pencil questionnaires. In addition to that, R&L enables the collection of extensive data on student interactions and opens the way for more data-driven research on the topic.
The results obtained confirm that our procedure for measuring the complexity of AWPs is consistent with previous findings [14]. The two tasks under study can be classified as multiplicative comparison problems according to the semantic approach [5], whose difficulty lies in the introduction of fractional versus natural numbers [39][40][41].
We identified the complexity of the syntagms "of are" or "of them it is" (or its equivalent "son de" in Spanish), which is related to the multiplication operator and to the concept of "fraction of" or "part of" [37]. These ideas begin to be developed in the school curriculum from the fourth year of primary education. The complexity of this concept, though, increases when it is applied to a fraction. These results may be linked to the design of tasks for current textbooks, where the concept of natural number is introduced through graphic support and considering the whole as a discrete quantity. However, when this concept is introduced over a fraction in the sixth year of primary education, the visual representation is usually removed and the whole becomes a continuum. That results in the mathematical concept being taught through a rote rule, which associates this expression with the multiplication of fractions and leads to possible errors in later courses, as shown by researchers at the Rational Number Project (http://www.cehd.umn.edu/ci/rationalnumberproject/) and the National Assessment of Educational Progress (https://nces.ed.gov/nationsreportcard/). Our work confirmed this issue with a sample group of students of the last year from compulsory secondary education.
The complexity of the statement propositions has been used to build binary logistic regression models that predict the probability of success in solving AWPs. The models confirmed that the propositions that most affect probability are those that involved a more difficult mathematical concept. In our study, these propositions are the ones that deal with the fraction as an operator over both a natural and a rational number.
It is worth noting that our approach also proposes the segmentation of the statement into propositions, whose complexity can be measured and compared following the classification into levels by Hunt [21]. In our study, first level propositions are declarative alphanumeric sentences where the numerical values are either natural numbers or fractions. Second level propositions introduce a subordinate clause through the syntagms "of them are" or "of it is". This fact goes far beyond evaluating the complexity by the success rate [44] and allows comparing the complexity of mathematical concepts within and across AWPs.
This work opens up a line of research on using technological environments and data analytics to determine the complexities of AWPs by measuring the level of understanding of each the statements and dealing with the mathematical concepts that make them more difficult to solve. Next steps include the design of a longitudinal study by students' age that analyzes the evolution of the concepts and the possible blockages that occur. Future work will also help to define an index that allows creating AWPs statements with prefixed complexities by weighting the propositions in the statement according to their level following the classification by Hunt [21].
These sorts of metrics and tools can be implemented by intelligent tutors designed to teach maths through problem-solving. They can help to track personalized teaching-learning paths for each student while using reading comprehension as one of the key drivers for predicting students' skills [26]. Despite the benefits provided by technological environments, the development of digital teaching competence continues to be a challenge for the education system [45,46]. However, the introduction of emerging tools and data analytics is progressively providing teachers and researchers with new experimental scenarios to study, for example, the possible impact of the use of feedback oriented to success when students interact with a given statement [31]. As Alonso et al. pointed out [22], the development of good teaching practices that integrate technology in the classroom can help teachers to start applying digital learning tools effectively and to improve their digital competence.