When Easy Becomes Boring and Difficult Becomes Frustrating: Disentangling the Effects of Item Difficulty Level and Person Proficiency on Learning and Motivation

The research on electronic learning environments has evolved towards creating adaptive learning environments. In this study, the focus is on adaptive curriculum sequencing, in particular, the efficacy of an adaptive curriculum sequencing algorithm based on matching the item difficulty level to the learner’s proficiency level. We therefore explored the effect of the relative difficulty level on learning outcome and motivation. Results indicate that, for learning environments consisting of questions focusing on just one dimension and with knowledge of correct response, it does not matter whether we present easy, moderate or difficult items or whether we present the items with a random mix of difficulty levels, regarding both learning and motivation.


Introduction
The appearance and functionality of electronic learning environments have changed tremendously as a result of both technological advances and the increased attention of researchers and companies [1,2].
Research is today mainly focused on creating an adaptive learning environment in which one or more characteristics of the learning environment (e.g., difficulty of the items and type of feedback) are adapted to one or more features of the learner.Different classifications of learner features are suggested in the previous research.For instance [3] differentiate between proficiency and learning styles; [4]) go a step further and divide learner's features into static (e.g., age and tongue language) and dynamic categories (e.g., proficiency and learning styles); [5] differentiates between domain-specific (e.g., proficiency and skills) and domain-independent information (i.e., learning preferences and demographics).In this study, the learner's feature of interest is the learner's proficiency, and the learning environment characteristic of interest is the item difficulty level.Such a personalized/individualized learning environment can, for instance, incorporate an adaptive item curriculum sequencing algorithm that provides a sequence of items that is contingent on the performance of the learner on previous items and on the difficulty of the remaining unsolved items [6,7].
Adaptive curriculum sequencing requires two main processes: The first process implies estimating the learner's proficiency level and the difficulty level of the course material (i.e., the difficulty of the items presented to the learners).The second process makes use of the obtained estimates to optimize the interaction between the learner and the learning material given to him [8].The focus of this article is on the second process and investigates the relationship between item difficulty level and person proficiency level, and their impact on learning and motivation using Item Response Theory (IRT) in an item-based learning environment.
Before this can be investigated, we further clarify the kind of e-learning environment that is the focus of this study, because the characteristics of the e-learning environment affects the applied estimation method for determining the learner's prior proficiency and the difficulty level of the learning material, as well as the choice of sequencing algorithm.We focus on item-based e-learning environments that consist of small, independent tasks (which we call items), in which users learn from making items and getting immediate feedback on their answers.Items differ in the amount of ability, but not the kind of ability that is needed to solve them correctly.Most previous research attention is drawn to the adaptive sequencing algorithm underlying learning environments with tasks, items, or learning materials that are not independent, but rather linked by some kind of relationship.For instance, if a learner answers an item related to the dimension comprehensive reading correctly, then the probability is high that the learner will also correctly answer items related to the dimension technical reading because technical reading is a prerequisite for reading comprehension.The adaptive sequencing algorithm most frequently implemented in such learning environments is the classic rule-based curriculum sequencing techniques [9,10] and the probabilistic graphical models, such as Bayesian networks [11].In rule-based curriculum sequencing, the learning path is based on the relationship between multiple dimensions defined by experts.Each dimension represents items related to a specific latent proficiency [12,13].For instance, reading fluency and verb conjugation are two different latent proficiencies (each with their own series of items), and both proficiencies are expected to be correlated.The relationship in simple rule-based curriculum sequencing techniques is the prerequisite relationship, and curriculum sequencing is based on a single rule: Learn prerequisite knowledge first.Bayesian networks differ from rule-based models because they incorporate uncertainty into the different relationships by modeling the strength of a relationship as a probability.Furthermore, Bayesian networks are used to update the probabilities when information comes in.
One disadvantage of these curriculum sequencing techniques is that they are only applicable to learning environments that include learning material consisting of dimensions that are linked by some kind of relationship (e.g., prerequisite, analogy, etc.).However, some learning environments consist of unrelated dimensions.While there is a fairly large body of research related to adaptive curriculum sequencing in learning environments with linked dimensions (i.e., multidimensional learning environments), less research has been conducted on sequencing algorithm techniques in learning environments where this relationship is absent, i.e., unidimensional learning environments [14].Even though [12] states that such a simple curriculum can only be offered by random question sequencing, the aim of this study is to explore whether this statement can be underpinned by empirical research or whether a specific sequencing algorithm can be applied to item-based adaptive learning environments in order to improve learning efficiency, taking motivation into account.To reach our goal, a brief elaboration on the estimation of the learner's proficiency level and the item difficulty level is offered, followed by the implementation of this estimation method for adaptive item sequencing in testing environments and in item-based adaptive learning environments.Finally, we will argue that the optimal relative difficulty (depending on the learner's proficiency and item difficulty) might not be fixed for a given learner, but could increase or decrease according to his or her proficiency level [12].

Adaptive Item Sequencing in Item-Based Learning Environments
Simultaneous estimation of the item difficulty parameters and the learner's proficiency level and its application.It is important to estimate both the item difficulty level and the learner's proficiency level because the relative difficulty might be more influential than the absolute item difficulty.In item-based learning environments, the probability of success can be estimated by means of item response theory (IRT) [15].IRT is a psychometric approach that emphasizes the fact that the probability of a discrete outcome, such as the correctness of a response to an item, is influenced by qualities of the item and by qualities of the person.Various IRT models exist, differing in degree of complexity, with the simplest IRT model stating that a person's response to an item depends on the person's proficiency level and the item's difficulty level [16].The item difficulty parameter (i.e., β i ) and the person proficiency parameter (i.e., θ s ) can be found in the following Equation: As a consequence, IRT makes it possible to estimate the probability of success (i.e., π pi " P pX si " 1q for each combination between an item difficulty and a person's proficiency.The person and item parameter can be placed on the same continuous scale, making it possible to match the difficulty of the item to the proficiency level of the learner.More specifically, the difficulty level of an item can be interpreted as the proficiency needed to have a 0.5 probability (i.e., π pi = 0.5) of giving a correct answer.The higher the person's proficiency compared to the item difficulty (i.e., θ s ´βi is high and positive), the greater the probability of giving a correct answer (i.e., the higher π pi ).The reverse results in a smaller probability.As a consequence, the Rasch model presented in Equation ( 1) takes into account that the difficulty of an item is relative to the learner's ability (and this is reflected in the probability to give a correct answer, the relative item difficulty).
A major implementation of IRT is situated in computerized adaptive testing (CAT) [17,18] where adaptive item sequencing is used to get an estimate of the true underlying students' proficiency, based upon the item difficulty and the student proficiency.By varying the difficulty level of the item, one can evaluate the change in probability of giving a correct answer.More specifically, the sequencing algorithm in CAT, guided by the objective of precise measurement, targets the item that provides the most information on the person's proficiency level.For the Rasch model, this means that items are administered for which the person is expected to have about a 50% probability of answering the item correctly (i.e., items with a difficulty level close to the proficiency level).The sequencing algorithm in CAT can be described using following steps: (1) A prior calibration study to create calibrated items is conducted (i.e., the items difficulty parameter value were estimated); (2) items to the participants with an optimal difficulty level are administered (more specifically, the optimal difficulty level of an item can be interpreted as the proficiency needed to have a 0.50 probability (i.e., π pi = 0.50) of giving a correct answer); (3) the participants proficiency level is estimated; (4) items with adjusted difficulty level are administered (based upon the estimated proficiency level in step 3); (5) the estimated proficiency level is adjusted; (6) step 1 is repeated, or, if the proficiency level is accurately estimated (or if the test length has attained its maximum size), the sequence is stopped.The purpose of this manuscript is step (4), estimating the adjusted difficulty level.
An important difference between testing and learning environments makes us question whether this CAT item selection algorithm is also suitable for learning purposes.While in testing environments, the objective is to select the item that would be most informative for refining the person's proficiency estimate; in learning environments, the objective is to select the item that optimizes the probability of progressing to a higher proficiency level [19].
Adaptive item sequencing.Researchers have recognized the importance of developing a productive adaptive curriculum sequencing strategy as a strategy that leads to effective and efficient learning.Whether this strategy alternates between difficult and easy items, aims at resolving misconceptions, or makes the decision based on the ideas of CAT, the overall objective is to enhance learning and increase or maintain motivation.Some theories, such as the flow theory [20] and the self-determination theory [21], state that learners are more motivated by challenging tasks.According to the flow theory, the learner's perceived challenge and his or her proficiency should be balanced.If there is not a good balance, feelings of anxiety (with high challenge and low ability) and boredom or uninvolvement (with low challenge and high proficiency) will be the result [22].Moderately challenging tasks, i.e., tasks that are somewhat beyond the learner's current proficiency [20,21,23], make learners, on the one hand, aware that they lack some proficiency but on the other hand keep them involved [24,25].Hence, those intermediate difficult problems engage learners and lead to a greater enjoyment of the task.Furthermore, learners conducting moderately challenging tasks feel more successful, efficacious, and in control of their own learning [25].On the other hand, overly challenging tasks can have an adverse effect on motivation and persistence.More specifically, tasks that are too difficult relative to the learner's actual proficiency or his/her perceived proficiency have a negative impact on the feeling of competence, expectations of success, and enjoyment of the activity, and increase anxiety [21,22,26,27] .The underlying idea, supported by the flow theory, is that those feelings of anxiety may inhibit the learner's involvement and task engagement [20,28].Furthermore, overly challenging or difficult tasks can be perceived as a threat to the learner's sense of competence, resulting in lower self-efficacy [26].The learner's interest in the learning material may buffer the negative effects of overly challenging tasks on motivation [29].Interest is composed of both intrinsic motivation [30] and task value/task motivation [31].Learners who are interested in the task are more likely to enjoy challenging tasks, while learners who are not interested in the task are more likely to avoid challenge [32,33].The underlying process is possibly mediated by arousal and attention [34].
In sum, for maximizing the motivation of learners, tasks should still provide an intermediate probability of success, rather than offering an almost certain probability of success (i.e., the probability of correctly answering an item is always close to 1) or failure (i.e., the probability of correctly answering an item is always close to 0 [29,35,36].In other words, research is needed to find the optimal probability to answer an item correct (i.e., π pi in Equation ( 1)) in order to keep the learner engaged.This probability depends on both the learner's proficiency and the item difficulty as reflected in the Rasch algorithm presented in Equation (1).In instructional game research, it is indeed found that too easy or too difficult games can lead to a reduction in motivation and, in time, on task [37].[38] further argue that this effect may result in less positive learning outcomes [39].However, with regard to learning outcome, results are found to be inconsistent.[40] studied the effect of feedback and adaptive sequencing of tasks on learning outcome and learning efficiency.Results indicated that adaptive task sequencing does not lead up to more effective learning.On the other hand, some studies did find a significant effect of adapting the difficulty and support of learning tasks to the learner's competences and perceived cognitive load [41][42][43].Other researchers found a positive effect of IRT-based adaptive item sequencing on learning.More precisely, adaptive environments in which items were selected because the learner had a 50% probability of answering them correctly yielded faster learning than a non-adaptive learning environment [44,45].Furthermore, research on CAT suggests that administering easier items would foster motivation and lead to a higher performance score, especially for persons with a low proficiency level [46].Hence, the selection of challenging tasks is not only supposed to enhance motivation, but could also have an effect on learning.In addition to that, prior research found that learner's characteristics and, in particular, the learner's proficiency can influence learning outcomes [38,[47][48][49][50] and the need for any enhancement to the basic learning material, such as adaptive task sequencing [51][52][53][54][55].The overall finding is that students with low proficiency benefit more from adaptive learning environments than do students with high proficiency [38,53,55].However, one study found that adapting the difficulty is more beneficial for advanced learners than it is for the novice or intermediate learners [52].Based on this prior research, the present study will also examine the influence of the learner's prior proficiency level on the relationship between the adaptive item sequencing algorithm and motivation and learning outcomes in item-based learning environments.
Previous studies did not differentiate between different levels of difficulty and different levels of proficiency, which is needed for an accurate estimate of the relative item difficulty.In this study, we sought to provide initial evidence as to whether a particular relative difficulty level is more effective than others in a specific item sequencing algorithm in terms of learning and motivation and hence aim to answer the following question: What is the optimal relative difficulty to use in an item-based learning environment where the item difficulty level and the learner's proficiency level is estimated by means of IRT?The relationship between item difficulty level and person proficiency level, and their impact on learning and motivation, will be disentangled.

Experiment
To date, no previous research has systematically compared item selection algorithms in item-based adaptive learning environments by considering different levels of item relative difficulty.This research sought to provide initial evidence as to whether a particular relative item difficulty level (i.e., the probability of answering an item correctly) is more effective than others in terms of maintaining learner's motivation and, in turn, enhancing learning outcomes.We chose to examine six different item selection algorithms in the learning environment: items for which the learner has a probability between 0.40 and 0.50 of answering the item correctly (i.e., π pi in Equation ( 1) ranges from 0.40 to 0.50), a probability between 0.50 and 0.60, 0.60 and 0.70, 0.70 and 0.80, 0.80 and 0.90, and a selection algorithm that randomly selects items for which the learner has a probability between 0.40 and 0.90 of answering the item correctly.The model used to estimate the relative item difficulty is presented in Equation ( 1).The outcome score (i.e., π pi , the relative difficulty or probability of correctly responding the item) is a function of the item difficulty and learner's proficiency parameters, β i and θ s , respectively Following previous studies focusing on adaptive technologies [40,56], we predict that adaptive item sequencing will result in higher learning outcomes and a higher level of motivation.
This results in following research hypotheses: (1) Items with a moderate relative difficulty (π pi = 0.60-0.70)will result in higher task involvement, higher interest, and higher perceived competence than when presenting more difficult items (π pi = 0.40-0.60).
(2) Relatively easy items (π pi = 0.70-0.90)will result in lower task involvement (effort).The learner's interest (intrinsic motivation and task motivation) is presumed to buffer the negative effect that difficult items have on motivation.
(3) Proficiency has a moderating effect on the relationship between the relative item difficulty level and learning outcome.In other words, the relation between relative item difficulty and learning outcome depends on proficiency.

Method
Participants.Students from ten educational programs in the Flemish part of Belgium (1st and 2nd year of the Bachelor Linguistics and Literature-KU Leuven; 1st, 2nd and 3rd year of the Bachelor Teacher-Training for primary education-Katho Tielt; 1st and 2nd year of the Bachelor Teacher-Training for secondary education-Katho Reno; 1st and 2nd year of the Bachelor of Applied Linguistics-HUB and Lessius; and 1st year of the Bachelor Educational Science-KU Leuven) were contacted to participate in the experiment.Two hundred twenty participants completed the entire study (i.e., pre-test, learning phase and post-test).Descriptive statistics of the participants are presented in Table 1.
Design.In a pre-test, proficiency and motivation were measured.A covariate adaptive randomization design was used with proficiency (6 levels) as covariate.Participants within each covariate level were randomly assigned to one of the six between-subject conditions (i.e., relative difficulty level) that were part of the learning phase: (1) very difficult (VD), in which participants had a probability between 0.40 and 0.50 to answer an item correctly; (2) difficult (D), in which participants had a probability between 0.50 and 0.60 to answer an item correctly; (3) moderate (M), in which participants had a probability between 0.60 and 0.70 to answer an item correctly; (4) easy (E), in which participants had a probability between 0.70 and 0.80 to answer an item correctly; (5) very easy (VE), in which participants had a probability between 0.80 and 0.90 to answer an item correctly; and ( 6) random (R), in which participants were presented a random set of items for which they had a probability between 0.40 and 0.90 of answering those items correctly.Every difficulty condition included a similar number of participants: n(VD) = 36; n(D) = 34; n(M) = 38; n(E) = 37; n(VE) = 39; n(R) = 31).After the learning phase, a post-test was administered, consisting of a proficiency test and post-experimental motivation measurement.Material.The web-based learning environment.In this study, the open source software Moodle 2.0 ® (http://www.moodle.org) was used to create and administer: (1) the pre-test, (2) the course of the learning phase, and (3) the post-test.The testing and learning material (i.e., items) consisted of fill-in exercises on French verb conjugation.Every item contained one example of the required verbal form, followed by the actual verb that the learner needed to conjugate.After completing an item, participants received explanatory feedback on the correct response.Each item had an associated item difficulty parameter value.The items were calibrated (i.e., the items difficulty parameter value were estimated) by means of a conducted by SELOR (Selectie en Orientatie, is the official assessment center of the federal Belgian government that selects and tests candidate civil servants in Belgium).Items were calibrated using the Rasch model, based on the data from 2961 examinees.The examinees of SELOR completed the calibration study because the administered items are used to test the examinees proficiency of French verb conjugation.The examinees that successfully completed the test got promoted at the government.The examinees are not part of the current study.
Introduction and pre-test.All participants completed a proficiency test consisting of 25 fill-in items testing French verb conjugation.The test was not time-limited and the average time to complete the test was close to 20 min.The pre-test total scores ranged from 4 to 25 with a mean of 15.81 and standard deviation of 4.53.
To measure motivation, we adapted the Motivated Strategies for Learning Questionnaire (i.e., MSLQ) developed by [57] so that this questionnaire would be applicable to French language learning.Sample items include the following: (1) "For learning French connector words, I prefer tasks that really challenge me so I can learn new things"; (2) Understanding the use of French connector words is very important to me".The questionnaire consisted of 18 6-point Likert type items (1 = strongly disagree, 6 = strongly agree), divided over three scales: (1) self-efficacy and performance, (2) motivation, and (3) task value.
Based on the responses of the 215 study participants who filled out the questionnaire, we found that these scales are internally consistent (by calculating Cronbach's α, [58]): intrinsic motivation, consisting of four items (α = 0.732), asking students why they are engaging in the learning task; task value, consisting of six items (α = 0.836), asking students how interesting, important, and useful they find the task; and self-efficacy and performance, consisting of eight items (α = 0.937), asking students for their expectancy for success and self-efficacy.The motivation questionnaire (as measured by the three subscales) was found to be reliable (α = 0.803), with all subscales showing a positive correlation (p < 0.01).Both intrinsic motivation and task value are regarded as pre-experimental motivation/interest, while self-efficacy and performance was considered a separate scale.
Learning phase.For each combination of prior proficiency (n = 6) and difficulty condition (n = 6), random sets of 80 items were compiled.All items were on French verb conjugation and were scored binarily (1 for a correct response and 0 for an incorrect response).Learning phase total scores ranged from 21 to 76 with a mean equal to 55.54 and a standard deviation of 11.11.
Post-experimental phase.After the learning phase, all participants received 25 fill-in items on French verb conjugation, with equal content for all conditions.The post-test scores ranged from 5 to 25, with a mean equal to 16.59 and a standard deviation of 4.21.To measure post-experimental motivation, a translated version of the Intrinsic Motivation Inventory (IMI) [59] was used.We selected four relevant subscales.The questionnaire consisted of 25 6-point Likert type items (1 = strongly disagree, 6 = strongly agree) divided into four subscales that were found to be reliable in the present study (n = 215, Cronbach's α): interest/enjoyment, consisting of seven items (α = 0.924), perceived competence, consisting of six items (α = 0.918), value/usefulness, consisting of seven items (α = 0.923), and effort/importance, consisting of five items (α = 0.853).The motivation questionnaire (as measured by the four subscales) was found to be reliable (α = 0.797), with all subscales showing a positive correlation (p < 0.01).
Procedure.Introduction and pre-test.During the pre-experimental phase, the participants first received a short introduction on the experiment.Subsequently they signed the informed consent, provided some background information and filled in the motivated strategies for the learning questionnaire.After completing the MSLQ, the participants completed the proficiency test consisting of 25 fill-in items.
Intermediate analysis.The proficiency of participants was assessed by applying the Rasch model (Equation ( 1)) on the participants' scores on the 25 fill-in items with the known difficulty of the proficiency test.Based on the resulting proficiency estimates, participants were grouped into six proficiency levels: [2;3[, and [3;8[.Within each proficiency level, participants were randomly assigned to one of the six experimental conditions.
Learning phase.One week after the pre-test, participants completed 80 items during a learning phase.After each response, they received feedback on the correctness of their answer; at the same time, the correct response was provided [60].
Post-experimental phase.During the post-experimental phase the participants completed the post-test consisting of 25 items.Subsequently, they filled in the IMI.The total duration of the learning and post-experimental phase was approximately one hour and a half.
Data Analysis.The total number of students who completed the pre-test, learning phase and post-test in the experiment was 220.Participants with a score on the pre-test, learning set, or post-test of 3 SDs below or above the average score were also excluded from the analysis (n = 4).We choose 3 SDs as criterion for identifying outliers because scores deviating more than 3SDs from the mean are unlikely (i.e., 0.3% of the scores if we assume a normal distribution [61]. All excluded participants had a score of more than 3 SDs below the average score (i.e., X pre = 15.77,X leer = 54.95,X post = 16.48),possibly due to a lack of effort those participants had put into the experiment.45 out of the 215 (20.83%) study participants had missing values on either the MSLQ or IMI scale (completely at random), which was used for the post-experimental motivation analysis (i.e., the effect of proficiency, prior motivation, and difficulty on post-experimental motivation).Instead of deleting the participants with missing values on these scales, we applied the regression-based multiple imputation technique (after investigating the percentage of missing data per variable and per case and investigating the pattern of missing values [62]).Values were imputed borrowing strength of the known values for the different variables in the dataset.A sensitivity analysis was conducted to investigate whether the imputation method had an effect on the results by comparing the results of using the regression-based imputation method with those using maximum likelihood estimation.Only small differences were found, and conclusions remained the same.Therefore, we report here the results of the regression-based imputation method.
Participants with a score on the different scales of MSLQ and IMI of more than 3 SDs below or above the average score were also excluded from this specific analysis (n = 1).The excluded participant had a score of more than 3 SDs below the average score on the subscale value/usefulness of the Intrinsic Motivation Inventory survey.In sum, a total number of 215 study participants were included in the analysis with imputed values on the MSLQ and IMI variables.Every difficulty condition included a similar number of participants: The influence of the difficulty condition (i.e., independent variable, grouping variable) on the learning outcome (measured as the difference between the post-test and pre-test score) controlling for self-efficacy, prior motivation, and proficiency (i.e., the covariates) was investigated.Analysis of covariance (i.e., ANCOVA) is the most recommended analysis method.Prior to the analysis, we tested the homogeneity assumption using Levene's test.Based on this, we concluded that the homogeneity assumption was not violated, F(5,209) = 1.69, p = 0.14.In addition, we evaluated the normality assumption by applying the Kolmogorov-Smirnov and Shapiro-Wilk test, and no significant deviations from normality were identified (Kurtosis statistic varies between ´0.64 and 1.14 and Skewness between ´0.230 and 0.230).In addition, ANCOVA was robust, as our group sizes are very similar, there are at least 20 degrees of freedom, and the smallest response category contained at least 20% of all responses [62].
A multivariate analysis of covariance (MANCOVA) was applied to investigate the influence of the grouping variable (difficulty level) and covariates (i.e., prior motivation and self-efficacy) on multiple dependent variables (i.e., the four subscales of post-intervention motivation).The following assumptions were evaluated: independence of observations, multivariate normality assumption using Kolmogorov-Smirnov and ShapiroWilk test, homogeneity of covariance matrix using Box's test, homogeneity of error variance using Levene's test, and the assumption of no multicollinearity.Ideally, the dependent variables are moderately correlated with each other.If correlations are low, it is better to run separate one-way ANOVAs; if the correlations are larger than 0.9, than there is such a strong multicollinearity that an analysis of one of the dependent variables is sufficient.Box's test of equality of covariance matrices indicates that there is no significant difference between the covariance matrix of the four dependent variables [Box's M = 50.007,F(50, 44676.961)= 0.937, p = 0.601].In addition, Levene's test of equality of error variances indicates that the error variances are equal across the groups.The Kolmogorov-Smirnov test and Shapiro-Wilk test both indicate that the data are multivariate normally distributed.The correlation between the 4 dependent variables was found moderate in size (ranging from 0.305 to 661), supporting the choice for a MANCOVA.

Results
All analyses reported in the present study used a significance level of 0.05.The equality of conditions (VD, D, M, E, VE, and R) was ascertained for proficiency, as measured by the total score on the pre-test, F(5,209) = 0.320, p = 0.904, and prior motivation, F(5,200) = 1.44, p = 0.211.This means that there was no systematic difference between the six conditions in terms of proficiency and prior motivation.
Manipulation check.A logistic regression analysis was conducted with the binary response on the learning phase as the dependent variable and five of the six difficulty groups (VD, D, M, E and VE) as the independent variable.The difficulty groups had a statistically significant effect on the outcome score [t(5) = 558.35,p < 0.001].The random difficulty condition was not included in this analysis because, in this condition, a random mix of difficulty levels was presented.The proportion correct score for the learning phase by the difficulty condition and the confidence intervals of the mean proportion correct score for the learning phase by the difficulty condition can be found in Figure 1.Learning outcome.The mean of the pre-test (X pre = 15.85) was significantly lower than the mean of the post-test (X post = 16.63),t(215) = ´3.37,p = 0.001.Since both tests were equally difficult (the true score at θ = 0.5 is 12.504 for the pre-test and 12.505 for the post-test), the results suggest that learning occurred.
An ANCOVA with self-efficacy (measured by the MSLQ), prior motivation (i.e., intrinsic motivation and task value measured by the MSLQ), and proficiency as covariates, and the difficulty condition as the independent variable (VD, D, M, E, VE, and R), was tested to explain the variances in learning outcome (i.e., the difference between the score on the post-test and the score on the pre-test).Learner's self-efficacy score did not affect learning outcome F(1,205) = 1.09, p = 0.297, η p 2 = 0.003.Prior motivation had a positive but small effect on learning outcome F(1,205) = 0.07, p = 0.787 η p 2 = 0.0002.
Moderator effect of proficiency.A hierarchical multiple linear regression was conducted to determine whether the difficulty condition and the proficiency has a significant interaction effect on learning outcome.We wanted to investigate whether the effect of prior proficiency on learning outcome is dependent on different level of difficulty.The difficulty condition and proficiency were entered in Step 1, explaining 24.01% of the variance in the learning outcome scores.The predictive model for Step 1 was statistically significant, F(6,208) = 10.96,p < 0.001.After entering the interaction term at Step 2, the total variance explained by the model as a whole was 25.63%, F(11, 203) = 6.36, p < 0.001.The interaction term, therefore, hardly explains any additional variance in the learning outcome; the proportion only explained variance that changed with 0.016 [F (5, 203) = 0.995, p = 0.422].Examination of the beta values highlighted the significant contribution of proficiency (b = ´0.413,p < 0.001).This demonstrated that learning outcomes decrease as proficiency level increases.The interaction effect of the difficulty condition and proficiency (b = 0.014, p = 0.601) and the main effect of the difficulty condition (b = ´0.384,p = 0.394) were non-significant.Detailed statistics of this hierarchical multiple regression are provided in Table 3.

Discussion
In this study, we aimed at identifying the optimal item sequencing algorithm in item-based adaptive learning environments by disentangling the relationship between the item difficulty level and the learning outcome and motivation.As little experimental research has been conducted on evaluating the efficacy of item sequencing algorithms with varying item difficulty levels, this study tried to bridge the gap by evaluating six difficulty conditions in which participants had a varying probability of answering an item correctly: (1) between 0.40 and 0.50; (2) between 0.50 and 0.60; (3) between 0.60 and 0.70; (4) between 0.70 and 0.80; (5) between 0.80 and 0.90; and (6) between 0.40 and 0.90.The six difficulty conditions were evaluated on learning outcome and motivation.
Results showed that the difficulty condition had no significant effect on either learning outcome or on motivation.Because the number of participants is relatively large, and the lack of significance, therefore, does not seem to be a consequence of lack of power, this finding suggests that, in item-based adaptive learning environments covering only one latent proficiency (in this study French verb conjugation), it makes no important difference whether you present items that are adapted to the learner's proficiency level or whether you select items randomly.As a consequence, Hypothesis (1), stating that items with a moderate difficulty will result in higher task involvement, higher interest, and higher perceived competence than when presenting more difficult items, and Hypothesis (2), stating that relatively easy items will result in lower task involvement (effort), could not be confirmed.Higher proficiency appeared to be predictive of lower learning outcomes, and this was independent of the difficulty condition.Therefore, Hypothesis (3), assuming that proficiency has a moderating effect on the relationship between the relative item difficulty level and learning outcome, could not be confirmed either.
Furthermore, no single difficulty condition maximized the learning outcome relative to others.Hence, the results provide empirical evidence for Brusilovky's statement [11] that simple curriculum learning does not benefit from adaptive sequencing compared to random question sequencing.Besides, the results are in line with the findings of [40], who found that adaptive task sequencing did not yield more efficient learning.
In this study, the learning outcome is measured by means of a post-test that was of approximately equal difficulty as the pre-test.Because both pre-test and post-test consisted of items from a calibrated item bank (i.e., the item difficulty parameters are known and located on one continuous scale), the score on the pre-test and the post-test could be compared and could function as a measure of learning outcome.Other possible measurement methods, such as retention and time it takes to learn, have not been taken in to consideration.Because we randomly assigned the participants to experimental conditions and conditions only differed in the difficulty of the items in the learning phase, we can exclude the influence of confounding factors.However, we must be more prudent in interpreting the improvement of the average score from pre-test to post-test: Whereas this improvement suggests a positive effect of the learning phase, it is not excluded that the improvement is (partly) due to, for instance, study activities in the days between pre-and post-test.
Furthermore, it needs to be considered that the proportion correct score on the learning phase for each difficulty condition, and particularly for the more difficult conditions, were substantially higher than expected on the basis of the item difficulty parameter.This minor shortcoming in the difficulty manipulation might result in too small a distinction between the different difficulty conditions, leading to non-significant effects of the difficulty condition.Furthermore, this could explain why the difficulty condition was found to have no negative effect on post-experimental motivation.The difficult items might not have been difficult enough to have an adverse effect on motivation.Furthermore, a possible explanation for the high values of proportion correct score in the learning phase might be attributed to learning taking place in the learning phase.
The MSLQ [57] was used to measure prior motivation (interest-i.e., intrinsic motivation and task value-and self-efficacy).The IMI [59] was used to measure post-experimental motivation (interest/enjoyment, perceived competence, effort/importance and value/usefulness).The two questionnaires contain distinct subscales; consequently, it would have been better to choose one questionnaire to present to the learners before and after the learning phase.Furthermore, asking the learner to rate their agreement with specific attitudes, beliefs, and activities is only one method to measure motivation.Future research could also focus on behavior in the learning environment as an indicator of motivation.
Because the literature reports inconsistent results with regard to the presence or absence of the relationship between difficulty level and learning outcome, future research should focus on inferring the specific characteristics of the learning environments in which this relationship does or does not hold.Besides, the grammar items in our study are non-authentic, while some researchers suggest that authentic tasks can speed up the learning of grammar [63].Furthermore, simple knowledge of correct response feedback might not be enough to effectively promote learning.According to [40], elaborated feedback would ensure that the assessment itself is a valid learning experience.Besides examining the specific characteristics of the learning environments in which the studied relationship does or does not hold, future research may also consider different item selection algorithms.In this article, we explored an item selection algorithm that is comparable to the item selection algorithm in CAT.Other procedures of sequencing the items in an item-based learning environment are available, but require further investigation, such as alternating between relatively difficult and relatively easy items or incorporating a moving window as in the moving test approach [64,65].
In sum, this study provides initial evidence as to whether a particular relative item difficulty level is more effective than others in terms of maintaining learner motivation and, in turn, enhancing learning outcome in an item-based learning environment.Findings indicate that, for learning environments consisting of simple questions (i.e., questions dealing with one proficiency) provided with knowledge of the correct response, it does not matter whether we present easy, moderate, or difficult items or whether we present the items with a random mix of difficulty levels.This research may instigate further examination, which could take other characteristics of the learner and the learning environment into consideration.

Appendix A
Motivated Strategies for Learning Questionnaire (i.e., MSLQ), by [57] Motivated Strategies for Learning Questionnaire Totally Disagree Totally Agree

Figure 1 .
Figure 1.Mean proportion correct score for each difficulty condition.Error bars represent 95% confidence intervals.VD = Very Difficult; D = Difficult; M = Moderate; E = Easy; VE = Very Easy; R = Random.

Table 1 .
Descriptive Statistics of Study Participants.
Note: GSO = General Secondary Education.TSO = Technical Secondary Education.The total number of participants is 215.

Table 3 .
Testing Moderator Effect Using Hierarchical Multiple Regression.
[59] consider the difficulty level of this topic, the type of instruction and my own proficiency, then I will perform well for this course.Intrinsic Motivation Inventory (i.e., IMI)[59]