1. Introduction
A social interactionist approach to learning acknowledges that learning in educational settings is a dynamic process that reflects the interplay between the student and interaction with others and the environment (
Vygotsky, 1978). What is to be measured, therefore, is the individual student’s growth or progress in educational contexts. Indeed, there has been a growing tendency for policymakers and researchers to shift the focus away from mean or average effect for groups and focus instead on how student learning evolves over time for the individual student (
Anderman et al., 2015). It would follow that available methods for assessing individual-level change need to be systematically compared, thus motivating the current study.
Having a reliable method to capture change is critically important for educators and clinicians (e.g., psychologists, speech-language pathologists), allowing decision-making to be guided by developmental trajectory (
Page, 2014;
Vygotsky, 1978). For example, measuring individual change is valid for documenting one’s academic growth (or lack thereof), monitoring progress in order to decide on support measures, and ending intervention when relevant changes have been made (e.g.,
Zahra et al., 2016;
Daub et al., 2019). In research, there is a long tradition of using frequentist statistics to quantify the amount of change experienced by a group (
Pernet, 2015). Take, for example, the effects of full-day (vs. half-day) kindergarten on developing academic-relevant competencies which are typically studied at the group-level using frequentist statistics (see
Cooper et al., 2010 for a review). Subsequently, developmental research informs policy decisions such as implementing full-day kindergarten curriculum for all students, given that full-day kindergarten is generally efficacious (e.g., Canada:
Akbari et al., 2024; United Sates:
Kauerz, 2005). The ability to identify specific individuals who change, however, would offer more information beyond group-based effects. Although it would seem simple to examine change by seeing who increases or decreases their scores overtime, the task is deceptively complex. First, showing that change is not due to maturation (e.g.,
Bedard & Dhuey, 2006) or random effects (e.g.,
Duff, 2012;
Anderman et al., 2015) alone may be challenging. Another difficulty is that there is no consensus on
how to measure change and, relatedly,
who makes a reliable change (e.g.,
Anderman et al., 2015;
Frijters et al., 2013;
Hendricks & Fuchs, 2020;
Maassen et al., 2009). Various methods are available to measure who has changed—whether a group or given individual—but each method yields a slightly different result. What follows is a broad overview of the group- and individual-level methods most commonly used in research and clinical practice.
1.1. Identifying Group-Level Changes
In the scientific and research literature, group comparisons are commonly conducted using classical frequentist tests (i.e., null hypothesis significance testing) (
Pernet, 2015). The most common statistical tests to index change overtime are repeated measures analysis of variance (ANOVA) and pairwise
t-tests, with results being communicated in terms of significance. Results found to be statistically significant—large
F- or
t-values, small
p-values below 0.05, confidence intervals, and effect sizes—indicate that the group has changed overtime.
There are several limitations with null hypothesis significance testing, however. Most notably, focusing on the group effect is not a readily applicable method for understanding individual performance. These methods also do not account for initial levels of performance or regression to the mean. Even an analysis of covariance (ANCOVA), where variables are included as covariates, fails to provide the necessary statistical controls (
Miller & Chapman, 2001). Further, for common tests such as the
t-test, only one outcome (e.g., a language test) between the two time points can be compared. If the two points do not differ on this outcome, there is only a 5% probability that the result will be statistically significant because of chance when a benchmark of
p < 0.05 is used (
Sainani, 2009). However, more often than not, a battery of outcomes is typically utilized to make comparisons. For example, if the two groups are compared five times for each outcome (e.g., a language test, a reading score, etc.), there is a 23% probability that one of the outcomes will be significant by chance (
Sainani, 2009). Testing a number of hypotheses in this way, called multiple testing/comparisons, increases the chances of making a Type 1 statistical error (e.g., concluding that there is change when there is not) (
Sainani, 2009). Finally,
p-values are often misinterpreted. A non-significant result does not support the null hypothesis per se, but rather fails to reject the null hypothesis (
Brydges & Bielak, 2020). Conversely, a Bayesian analysis can be used to find evidence for a null effect by quantifying the amount of evidence (
Brydges & Bielak, 2020;
Wagenmakers et al., 2018). Nevertheless, Bayesian statistics still only accounts for group changes and not which individual within a group has changed. It follows that identifying methods for (a) quantifying changes in outcomes for individuals and (b) determining whether such changes are reliable are warranted.
1.2. Identifying Individual-Level Changes
1.2.1. Normalization Method
To assess which specific individuals change, two methods have been proposed: normalization and growth statistics. The normalization approach might be most familiar to professionals (educators and clinicians) and researchers who use norm-referenced tests in their practice (e.g.,
Torgesen et al., 2001;
Hendricks & Fuchs, 2020;
Frijters et al., 2013). Children typically meet eligibility for special education services if they receive a standard score of 85 or lower (or the 16th percentile or lower) on nationally normed assessment measures (
Spaulding et al., 2012). Then, children who receive intervention are considered to have made a change if post-intervention scores are within the “average” or “normalized” range, which is typically a standard score of 90 or higher (corresponding to the 25th percentile or higher). However, it is important to note that these cut-off-point criteria vary by test and across regions (
Spaulding et al., 2012).
Standard scores represent a child’s performance relative to the performance of age-matched peers (i.e., child’s age-adjusted standing), whereas percentiles are used to clarify the interpretation of standard scores by ranking scores. Standard scores and percentiles can be converted from raw scores for commercially available, norm-referenced tests across domains. For example, the Clinical Evaluation of Language Fundamentals—Fifth Edition (CELF-5;
Wiig et al., 2013), one of the gold-standard assessment measures for language, has a mean standard score of 100 (SD = 15) for typically developing children. In contrast, deriving standard scores/percentiles for experimenter-created measures requires more resources, including a representative sample for norming and statistical training to convert raw scores to standard scores and percentiles. Nevertheless, the normalization method is appealing because the individual can be seen to have moved, providing a gauge of ‘how much’ they moved from an at-risk level (below the 16th percentile) to within normalized levels (25th percentile or greater).
Although interpretating the results may seem straightforward, concerns about the normalization approach must be acknowledged. Only two time points are considered in the normalization method, in a similar way to group-based approaches. Nonetheless, it is likely that real-life time and resource constraints mean that practitioners only conduct pre–post measures anyway. Most notably, this method would only be able to measure progress in a subset of the sample, namely those who score below the cut-off at initial testing. Relatedly, an important point to consider is the potential bias of norm-referenced tests if the culture and language of the student being assessed is different from the normative group (e.g.,
Laher & Cockcroft, 2017;
Higgins & Lefebvre, 2024). This can increase the risk of overidentification/misdiagnosis of culturally and linguistically diverse students as having learning difficulties when relying on norm-referencing (e.g.,
Paradis et al., 2013). It would be useful to have feasible methods that capture growth across the range of individual differences, including those developing typically or even struggling individuals who change but may not have normalized.
Another concern with the normalization method is that using standard scores to demonstrate change may be inappropriate (
Daub et al., 2019;
Kerr et al., 2003) and can even be misleading (
Daub et al., 2019;
Spaulding et al., 2012;
Kwok et al., 2022). For example, one student who raises their standard score by five points from an initial standard score of 85 to a final standard score of 90 may be viewed as normalized, whereas another student who makes a five-point gain from 80 to 85 would not be categorized as having changed because their final score remains under the cut-off point of 90 (e.g.,
Torgesen et al., 2001;
Hendricks & Fuchs, 2020;
Frijters et al., 2013). Further, changes in a child’s standard scores over time do not directly imply improvement or regression in their skills but rather indicate a change in the child’s performance relative to peers of the same age (
Kwok et al., 2022). The same standard score at pre- and post-test at first glance may be assumed to indicate no change, when, in fact, maintaining the same standard score overtime indicates that the individual progressed at the same rate as the normative group (e.g.,
Daub et al., 2019). A negative change in standard scores could also be interpreted in different ways. Although norm-referenced scores often decrease over time among individuals with neurodevelopmental disorders who exhibit slower-than-average progress (e.g.,
Farmer et al., 2020), negative change does not necessarily mean that the child has regressed and lost skills. It could also mean that the child gained the skills but at a slower rate than most peers or that they have experienced no change in their skills (
Kwok et al., 2022).
In addition, although standard scores are readily available with commercially developed norm-referenced tests, deriving standard scores for experimental or informal measures can be cumbersome. This becomes problematic when schools or clinics design their own outcome measures, for example. The measure would need to be administered to a representative sample so that a normative distribution could be formed, from which standard scores/percentile ranks can then be derived. Nevertheless, given that in real-life contexts, clinicians and educators are likely attempting to use standard scores from norm-referenced tests as indicators of change (
Kerr et al., 2003), we included this method in the current paper as a comparison against potentially more sensitive measures, namely growth methods. We were able to use the normalization method with our experimental measures because we had sufficient resources (e.g., a representative sample and statistical training).
1.2.2. Growth Methods
The individual-level approaches considered in this paper are collectively called growth methods (also referred to as distribution-based methods (
de Vet et al., 2006) or idiographic methods (
Blampied, 2022)). Growth methods compare the change in an individual based on statistical characteristics of the obtained sample. We reviewed the extant literature to determine methods that reflected best practice and, importantly, that would be easy for practitioners, who typically have minimal statistical training (
Lakhlifi et al., 2023), to calculate and to interpret based on available data. We also wanted to choose non-overlapping methods to compare. We focused on the reliable change index (RCI) because it is the most widely used method for evaluating individual change (
Jacobson & Truax, 1991;
Chelune et al., 1993;
Blampied, 2022;
Duff, 2012), standardized individual difference (SID) methods because of the acceptable false positive rates compared to other methods (
Payne & Jones, 1957;
Ferrer & Pardo, 2014), and standardized regression-based (SRB) methods for their precise estimates of change (
McSweeny et al., 1993;
Crawford & Garthwaite, 2007).
First, the RCI is a method for assessing whether a change between two observations of the same individual is statistically reliable and not due to measurement error (
Jacobson & Truax, 1991).
Chelune et al. (
1993) further adjusted the RCI formula to control for practice effects on repeat testing. To measure progress at the student level across kindergarten in this paper, we adopted the practice effect-adjusted RCI (henceforth called the RCI for brevity), as per best-practice recommendations (
Duff, 2012;
McAleavey, 2024). The numerator for this improved RCI equation is calculated by comparing the difference between test scores with the mean practice effects from the group. Then, the numerator is divided by the standard error of the measurement (SEM) of the difference score (see
Table 1 for equation; adopted from
Duff (
2012) and
Hendricks and Fuchs (
2020)). Each individual will receive a resulting RCI value that is then compared to the z-distribution in order to make interpretations about reliable change. On a z-score distribution, which is standardized with a mean of 0 (SD = 1), the cut-off point for a 95% confidence interval is 1.96, and for a 90% confidence interval, it is 1.645. We use the recommended ±1.645 cut-off given that our design is a single group pre–post study (
Estrada et al., 2019) and that this cut-off is widely used in clinical psychology (
Duff, 2012). Thus, an RCI value of +1.645 or greater (e.g., increase overtime) represents a significant and reliable change that would be unlikely to occur by measurement error, whereas −1.645 or lower would indicate a reliable change in the negative direction (e.g., decrease overtime). Note that the sign on the RCI value indicates the positive or negative direction of change, not the arithmetic sign of the change score (
Blampied, 2022).
The SID method is based on the standardized pre–post difference as the denominator instead of the SEM to identify reliable change (
Payne & Jones, 1957). The standardized score results from dividing the difference between the initial and final scores by the standard deviation of the differences. Similarly to the RCI method, the resulting SID value is also compared to the z-distribution. We considered an individual change to be reliable when the SID value is above or below 1.645. Compared to other methods, the SID method yielded acceptable false positive rates (
Ferrer & Pardo, 2014). Others, however, have found similar results between the RCI and SID methods (
Estrada et al., 2019).
Finally, the SRB index uses regression analyses to predict change (
McSweeny et al., 1993). In this paper, we adopted a simpler version, termed the estimated SRB approach, which was proposed to ease computation (
Crawford & Garthwaite, 2007;
Maassen et al., 2009), an important consideration when creating an easy-to-use calculator for clinicians. In the estimated SRB method, summary data (e.g., standard deviation, mean) that is typically available in test manuals is used to form the regression equation with which to calculate reliable change. The SRB approach uses the regression equation to predict the final score; then, the predicted final score is compared to the observed final score to determine whether a meaningful change occurred overtime. Similarly to the other methods, the 1.645 benchmark was used when considering change.
The advantage of the growth methods over the normalization method is that they can be applied to all individuals and not just those whose initial score is below a cut-off. Having a statistical method for measuring individual change could be especially advantageous when evaluating single-case experimental designs, which has traditionally relied on visual inspection rather than formal statistics (cf.
Fisch, 2001). Other benefits include a precise estimate of relative change, accounting for test–retest reliability and measurement error, and controlling for practice effects (e.g.,
Chelune et al., 1993;
Duff, 2012). However, these measures are calculated only seldomly in educational and developmental studies (
Zahra et al., 2016; cf.
Frijters et al., 2013;
Hendricks & Fuchs, 2020;
Unicomb et al., 2015), possibly due to the formulas being complicated to calculate. For example, instead of calculating an RCI for every individual, an increasingly popular alternative is to inverse the RCI formula to calculate the smallest amount of change needed from the initial to final scores for the change to be considered reliable (formula in
Table 1). This coefficient is more widely known as the minimally detectable change (e.g.,
de Vet et al., 2006).
While there are a number of change score methods available, studies with adults have found that most change indices perform similarly to each other, except for the Hageman–Arrindell (HA) method, which is considered “the most conservative” (
McGlinchey et al., 2002). Still, others have found that the RCI and HA methods yield comparable results (
Ronk et al., 2016;
Ferrer & Pardo, 2014), with the RCI method being suggested to promote consistent practice (
Lambert & Ogles, 2009). The HA method was not included in this study because a preliminary analysis of the current dataset found that the reliability of difference score was below the recommended 0.40 threshold. Despite acceptable false positive rates, the confidence interval for the individual predicted value of the linear regression method (
Ferrer & Pardo, 2014) was also not included due to potential obstacles to practice. Clinicians will likely not have access to the full dataset and software when generating regression analyses
1. Instead, the SID method showed comparable false positive rates to the confidence interval method (
Ferrer & Pardo, 2014) and is included in the current study.
To our knowledge, no studies have systematically compared the normalization, RCI controlling for practice effects, SID, and estimated SBR methods in an unselected sample of young children to measure academic progress. Recently,
Frijters et al. (
2013) and
Hendricks and Fuchs (
2020) compared the normalization and original RCI method to identify change in struggling readers in response to intervention. Strikingly, results were similar in both studies: for a given child, one method would indicate change, while another would not. For example, the normalization method identified high rates of responders in
Hendricks and Fuchs (
2020) but low percentages of change in
Frijters et al. (
2013). Overall, it is clear that measuring change in children is far from clear cut. Nevertheless, measuring progress is routinely carried out by educators and clinicians. It would then follow that providing practitioners with accessible ways of evaluating individual change—which could serve as one piece of clinical evidence—is critical, motivating the current paper.
1.3. The Current Study
The goal of the current study was to provide a preliminary investigation of using sophisticated methods for examining individual change in the academic skills of kindergarten students. We were interested in testing whether the individual change indices offer more information than group-based statistics. Thus, the novelty of our study lies in our methodology of comparing four individual change scores—normalization, RCI controlling for practice effects, SID, and estimated SRB—to identify who exhibits a meaningful change with respect to school-relevant competencies. Our research aims were (1) to assess change at the individual level for all students, (2) to assess change in a subgroup of struggling learners, (3) to compare how change was captured across methods, and (4) to develop user-friendly Excel calculators with which practitioners can examine change.
2. Method
2.1. Participants
In Ontario, Canada, kindergarten is a 2-year program, with those entering the first year of the program turning 4 years of age and those entering the second year turning 5 years of age at some point during that school year. Participants were part of a larger longitudinal study following children from Year 2 of kindergarten to grade 2 (
Pham et al., 2025). The larger study used a multi-wave approach in which a total of 16 schools (2 rural) in London, Ontario participated in the study over 7 years (2013–2019), with 6–9 schools participating in new recruitment each year. Ethics was approved by the local university and participating school boards. Written consent was obtained by the principal of each of the 16 different schools. Teachers sent home consent packets which included a letter of information about the study to all children. If parents wished to consent, they returned the written consent form to their child’s teacher, along with child assent at the study onset.
The larger study had 330 participants (162 females) who completed the assessment battery in the spring of Year 1. The current study focuses on 157 participants (70 females) who had data points available in both the spring of Year 1 (Mage = 4.33, SD = 0.49) and the spring of Year 2 (Mage = 5.38, SD = 0.49). The two time points will henceforth be referred to as the time 1 score and time 2 score, respectively. There was 47.57% attrition between the two years. The main reason for attrition was that the larger study focused on kindergarten students in Year 2. Students in Year 1 were included in the study if they were part of a split Year 1/2 class, as all students in the class were invited to participate. Other possible reasons for attrition included child absence, child moving schools, child withdrawal from the study, and school withdrawal (board-level policy changes, labour shortages, and disruptions). There were no specific criteria listed for participation, as we were interested in obtaining an unselected, representative sample of kindergarten students.
The larger study did not collect specific information about the demographic characteristics for the sample (e.g., ethnicity, language(s) spoken, socioeconomic status, etc.), nor did the school share this information. There was limited information available from a demographic survey that was discontinued after the first year of a study. As per parent reports (
n = 101), children were rated to be ‘good’ on average at counting and recognizing numbers; letter names; number relationships; quantity concepts; and understanding patterns. Children were rated to be ‘satisfactory’ on average at letter sounds and meanings of written words. Further, children ‘never’ or ‘rarely’ (once or twice a month) attended English as a Second Language or Literacy programs in the six months before kindergarten. Based on previous studies with cohorts from this school board (
Archibald et al., 2013), we could only speculate that the sample might be largely monolingually English with a high socioeconomic status.
2.2. Procedure
All participants were tested individually in a single session, lasting 30–40 min, by trained research assistants (RAs). There were on average 14 trained RAs (range = 7–20) administering the assessment each year. Each RA attended a 2–3 h training session. Training included viewing a video recording of the second author providing step-by-step explanations of how to administer and score the test battery, conducted on a practice child participant who was not included in the study. The RAs also had opportunities to practice the administering and scoring of each measure. We created a website that contained the training video and other resources for RAs to access during and after training.
Students completed the battery of language, reading, and mathematical measures in the spring (March–May) of Year 1 kindergarten (i.e., time 1 scores). These outcome measures were repeated approximately 1 year later in the spring (March–May) of Year 2 kindergarten (i.e., time 2 scores). All participants completed the measures described below. Further details of the procedure are reported in the larger study (
Pham et al., 2025). Caregivers could opt in to receive the lab’s annual newsletter which would contain a summary of the study.
2.3. Materials
The kindergarten assessment consisted of nine tasks: vocabulary, sentence recall, letter-name and letter-sound knowledge, rapid color naming, phonological awareness, number naming, number line estimation, dot and symbol magnitude comparison, and arithmetic skills. Materials are fully described elsewhere (
Pham et al., 2025) and freely available on Open Science Framework (
https://osf.io/gnekj/; accessed on 21 October 2025). Test–retest reliability when tested approximately 1 month apart, as reported in
Pham et al. (
2025) is provided; additional psychometric properties are provided where available. Brief descriptions of each tool are as follows:
Vocabulary. Participants were asked to name the color of each of the 10 pictures of various nouns presented for a total of 10 points. This vocabulary task was from the Reading Readiness Screening Tool (RRST) v7.5 (
Learning Disability Association of Alberta, 2011). The RRST was developed by a committee of specialists with the Learning Disability Association of Alberta and used to screen students from kindergarten to grade 1. Norming of the tool is currently being conducted by Alberta Education, the governing agency that oversees the education system. The test–retest reliability was 0.73.
Sentence recall. Participants were asked to repeat sentences after hearing a digital recording of the sentence (
Redmond, 2005). There were 16 sentences. Sentences were scored online as 2 (correct), 1 (three or fewer errors), or 0 (more than four errors or no response) for a maximum score of 32. Inter-rater consistency was r = 0.99 and test–retest reliability was r = 0.95 in
Redmond (
2005), and test–retest reliability was 0.95 in
Pham et al. (
2025). The task and manual are available at
https://health.utah.edu/communication-sciences-disorders/research/redmond (accessed on 10 October 2025).
Letter-name and letter-sound knowledge. Participants were asked to name and give the sound for the 26 letters of the alphabet presented in upper- and lower-case letters with no time constraints. Correct responses were tallied out of 26 for each of the upper case sounds, lower case sounds, upper case letter names, and lower case letter names, and an average score across the four trials was used. The test–retest reliability was 0.90.
Color rapid automatized naming (RAN). Participants had to verbally and serially name colored boxes aloud as quickly and accurately as possible (
Howe et al., 2006). The time required to name all stimuli (in seconds) was used for data analysis. The test–retest reliability was 0.89.
Phonological awareness. This measure was a bespoke measure developed by the school board’s Speech and Language Services. The test consisted of 6 subtests, including rhyme recognition, rhyme production, sound blending, sound identification, sound segmentation, and sound/syllable deletion. There are 33 items in total across the subtests, with a total score of 33. The test–retest reliability was 0.74.
Number name. Participants were asked to name the 10 digits from 0 to 9 without time constraints. There are 10 items, with a total of 10 points. The test–retest reliability was 0.93.
Number line estimation. Participants had to repeatedly estimate the position of a number on a 15.9 cm
2 number line with 0 and 10 at the respective ends of the line by making a mark on the line to indicate where the number would go (
Hawes et al., 2019). The average percentage of absolute error across the eight trials was calculated using the following equation: |(observed value − expected value)/15.9|. The test–retest reliability was r = 0.32 in
Hawes et al. (
2019) and 0.72 in
Pham et al. (
2025).
Dot and symbol magnitude comparison. Participants were required to compare pairs of magnitudes ranging from 1 to 9, presented either in symbolic (56 digit pairs) or non-symbolic (56 pairs of dot arrays) formats. Participants were instructed to cross out the larger of the two magnitudes and were given two minutes to complete each condition. The proportion of correct answers based on the total number of correct responses relative to number of items answered was used for data analysis. For more information, see
Nosworthy et al. (
2013);
Hawes et al. (
2019); and
https://www.numeracyscreener.org (accessed on 16 October 2025). The test–retest reliability for the symbolic and nonsymbolic magnitude comparison tasks were found to be r = 0.72 and r = 0.61, respectively, in
Hawes et al. (
2019) and r = 0.61 and r = 0.90, respectively, in
Pham et al. (
2025).
Arithmetic skills. Participants had to complete 5 single-digit addition and 5 single-digit subtraction problems, with no time constraints. The number of correct responses was tallied. The test–retest reliability was r = 0.60 in
Hawes et al. (
2019) and r = 0.80 in
Pham et al. (
2025).
2.4. Data Analysis Plan
Data Cleaning. There was a small amount of missing data, 5.15%, for various reasons (e.g., child absent, did not complete). An empirical test revealed that the data were not missing completely at random (
Little, 1988):
χ2(186) = 353.80,
p < 0.001. We used the K Nearest Neighbour (KNN) imputation method to estimate the missing data (
Troyanskaya et al., 2001), which is an algorithm applicable to imputing values that are not missing at random. The DMwR2 package (
Torgo, 2016) in R version 4.4.2 (
R Core Team, 2024) was used to handle missing data.
For most measures, higher scores reflected better performance, and thus, an increase in score from time 1 to time 2 indicates an improvement, except for color RAN and number line estimation, where lower raw scores reflected better performance. Guided by
Brysbaert and Stevens (
2018), we transformed RAN reaction times and number line estimation scores into inverse scores (1000/(RAN reaction time) and 1/(number line score), respectively) for ease of interpretation so that lower scores also had the lowest inverse values. The nominators were selected to avoid having values that are too small in our analyses (
Brysbaert & Stevens, 2018). Inverse scores are used in subsequent analyses. Further, given that each subtest used a different scale and had different maximum scores, we created converted raw scores to z-scores for each subtest to provide a standardized scale for interpretation. Z-scores were used in the analyses.
Finally, given that each individual subtest is a screening measure, we sought to create a more robust measure by combining the subtests, as informed by results from an exploratory factor analysis. An evaluation of the scree plot and parallel analysis suggested that a single factor be extracted for both time 1 scores (significant Bartlett’s test of sphericity, KMO = 0.81, eigenvalue = 3.68, 40.94% of variance explained) and time 2 scores (significant Bartlett’s test of sphericity, KMO = 0.76, eigenvalue = 3.33, 37.00% of variance explained). The full results are reported in the
Supplementary Materials. As a result, we created a
school screening composite score by adding z-scores across all measures. The school screening composite will be used in the remaining analyses.
Analyses. To identify which individual student showed a change, we used four individual change indices:
Normalization. For the normalization method, the function pnorm in base R was used to calculate the percentile corresponding to the composite z-score for time 1 and 2. A student is considered to have changed if they moved from the 16th percentile or less (time 1 score) to the 25th percentile or higher (time 2 score). These values correspond to standard scores of 85 and 90, respectively, which are widely used as cut-offs for norm-referenced tests. Notably, this method would only be applicable to a subgroup of students who initially scored below the 16th percentile on a given measure at time 1. We refer to this subgroup as the ‘low scorers’ in the remainder of the paper; the remaining students were considered within normalized levels.
Reliable change index (RCI). The first growth method we applied to our data was the practice effect-adjusted RCI formula. The RCI was calculated using the revised formula by
Chelune et al. (
1993), as reported in
Duff (
2012;
Table 1). The numerator corresponds to the difference between the student’s time 2 and 1 z-scores and mean practice effect. The denominator is based on the standard error of the measurement (SEM) of the difference z-score, which is estimated from pooling the variance of the sample and the reliability of the test. We estimated reliability as the correlation between the pre- and post-administrations of the composite z-score (
Hendricks & Fuchs, 2020). An RCI value was calculated for each student. We considered an individual change to be reliable when its corresponding RCI was greater than or equal to ±1.645. The ±1.645 benchmark indicates that 90% of change scores will fall within this range in a normal distribution, that only 5% of cases will fall below this point based on chance, and that only 5% of cases will fall above this point. The RCI results were then constrained to the low scorers only to allow for comparison with the normalization method. In addition, the RCI coefficient accounting for practice effects was calculated to determine the smallest amount of change that would indicate a reliable change (i.e., minimally detectable change) (
de Vet et al., 2006).
Standardized individual difference (SID). The SID is calculated by dividing the individual difference between the time 2 and 1 z-scores by the standard deviation of these difference. A resulting SID value of greater than or equal to ±1.645 would indicate a reliable change, the same as the RCI benchmark. The dataset was then constrained to lower scorers only for comparison with the normalization method.
Standardized regression-based (SRB) index. The final change score was the estimated SRB index (
Crawford & Garthwaite, 2007;
Maassen et al., 2009). The numerator corresponds to the difference between the student’s predicted and observed time 2 z-scores. The predicted time 2 z-score was estimated using summary data: mean, SD, and test–retest reliability (
Crawford & Garthwaite, 2007;
Maassen et al., 2009). The denominator is the SEM (i.e., the same denominator as the RCI formula). Similarly to previous methods, a resulting SRB value of greater than or equal to ±1.645 would indicate a reliable change. The dataset was then constrained to lower scorers only for comparison among the methods.
Classification profiles. Profiles of academic progress were created based on the results across indices. Students were classified into one of three profiles: (i) Improved overtime meant that the student received a reliable change score in the positive direction (+1.645 or greater) on the RCI, SID, and/or SRB. (ii) No reliable change means that the score across all methods did not meet the benchmark. (iii) Worsened overtime meant that there was a relative change in the negative direction (−1.645 or lower) on the RCI, SID, and/or SRB. Note that even if raw scores increase from time 1 to time 2, reliable change could still be classified as negative (below −1.65 benchmark).
Agreement. Cohen’s kappa (κ) was used to examine agreement between the change score indices. The following classification scheme was used to interpret ICC: values ≤ 0 as no agreement, 0.01–0.20 as none to slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1.00 as almost perfect agreement (
McHugh, 2012).
4. General Discussion
This study provided a preliminary evaluation of using individual-level indices to capture changes in learning outcomes across kindergarten. Children were assessed using our screening battery on various language, reading, and mathematical skills in the spring of each year of a 2-year kindergarten program. While the group-level results found no significant differences overtime for the full sample (see
Supplementary Materials), responsiveness at the level of the individual child varied. For the full sample, the RCI/SID and SRB methods identified 7.64% and 8.28% of students as having changed, respectively, with almost perfect agreement among the methods (research aims 1 and 3). We then restricted our dataset to a subgroup of low scorers to allow us to compare the normalization approach to the RCI and SRB methods (research aims 2 and 3). For this subgroup, there was a significant difference between time 1 and 2 scores, with scores increasing overtime for the group (see
Supplementary Materials). However, the change score methods revealed individual differences that were masked by group-based results. Specifically, the RCI/SID and SRB methods identified 14.81% and 16.67% of students as having changed, respectively, while the normalization method identified 24.07%. Agreement was almost perfect between the RCI, SID, and SRB method for this subgroup, whereas the normalization method differed from the rest. We next interpret our main findings and discuss practical implications, including the creation of a Growth Calculator (research aim 4).
Our results show that individual change indices have the potential to offer more precise estimates of relative change compared to group-level statistics. In particular, group-level statistics might mislead educators and clinicians to underestimate or overestimate the extent of individual change. Had we stopped our analysis after finding a lack of significance using frequentist/Bayesian statistics for the full sample, we would have concluded that, on average, this group of kindergarteners showed no-to-minimal gains in academic competencies overtime. However, the growth methods further revealed that about a tenth of students made progress, with 3% (
n = 3) moving downwards but still remaining within normalized levels. This type of information would be crucial for clinical judgements, such as monitoring academics or initiating services, if problems persist for these specific students (
Anderman et al., 2015). For the low-scoring subgroup, group-based effects were associated with strong evidence for a positive change overtime. While this was indeed the case for 31% of students, the vast majority (63%) did not make a change that was considered reliable and 5% moved downwards. Knowing who made positive gains would mean triaging services and supports to those who have not made a reliable change during the same period. Knowing who has not made a reliable change or moved downwards can mean initiating/increasing services to ensure that timely supports are in place. Being able to track individual progress (or a lack thereof) is in line with the Vygotskian theoretical framework, in which educators or clinicians could support each student’s needs and make equitable educational decisions for individual students in order to improve learning outcomes (
Vygotsky, 1978;
Anderman et al., 2015).
Another advantage is that results from the individual indices could be more easily interpreted by practitioners than group-based statistics (
p-value, effect size) (
Page, 2014). For example, a practitioner could (quickly) conclude whether a reliable change occurred if the individual’s resulting RCI, SID, or SRB score falls above ±1.65 or if the difference between their scores exceeds the RCI coefficient benchmark. Having precise interpretations about individual performance would in turn facilitate educational and clinical decision-making and the communication of results to the individual. Nevertheless, we want to emphasize that change score indices should be used with caution as a piece of evidence.
Despite the potential advantages of individual-level indices, each method yielded slightly different results, especially for the low-scoring group. The SRB index identified slightly more students as having changed than the RCI/SID index, but agreement was almost perfect. For the low-scoring subgroup, the RCI/SID and SRB indices were on par with each other but differed from the normalization method. Nevertheless, our level of agreement and general consensus of change (or not) were similar or better than previous work focusing on young children (e.g.,
Frijters et al., 2013;
Hendricks & Fuchs, 2020). For example, in
Hendricks and Fuchs (
2020), the biggest gap between methods was found between the normalization method identifying 42% of students as having changed and the RCI method identifying 24%, almost half of the amount.
Potential differences between the RCI and SRB formulas are likely due to the numerator, given that both methods use the same denominator. The RCI numerator accounts for practice effects, and the SRB numerator takes this one step further to control for variability in time 1 scores, which, in turn, corrects for variability when predicting time 2 scores. While our RCI and SRB results were similar, best-practice guidelines (which have focused on adult control vs. treatment groups) recommend using SRB when the individual is demographically similar to the group or has a typical baseline score and using the RCI when characteristics are dissimilar (e.g.,
Levine et al., 2007;
Ouimet et al., 2009). The more complex version of the SRB method can also account for demographic variables by using a stepwise regression, but this is not applicable to the version of the SRB chosen for the study. Interestingly though, including demographic variables does not always improve performance (
Durant et al., 2019). Even though the SID method is based on standardized pre–post difference, whereas the RCI and SRB calculations used SEM, previous work has found that the RCI and SID perform similarly (
Estrada et al., 2019).
Finally, the normalization method performed differently compared to other indices in our study, aligning with prior work (
Frijters et al., 2013;
Hendricks & Fuchs, 2020). There may be several explanations. The normalization method was focused on children with relatively low initial scores. Children with lower initial scores have been found to generally improve more than those with higher initial scores (e.g.,
Luwel et al., 2011;
Lervåg & Hulme, 2010). Relatedly, another explanation may be regression to the mean, as scores for this subgroup were relatively low at time 1. Indeed, the normalization equation does not account for additional variables (e.g., regression to the mean, practice effects). This interpretation tentatively suggests that for about a quarter of the children (
n = 13/54) with the lowest scores in the beginning, a compensatory pattern is potentially being exhibited in order to catch up with the rest of the sample by the second year of formal education. However, this is perhaps only to a certain degree, as most of them did not make a change that was considered to be reliable based on our indices (
n = 11/13). Practically, this might mean that practitioners should avoid prematurely ceasing services once these at-risk children seem to be within normalized levels. Instead, these children may have a high potential for growth that may need more time to be realized. Notably, the majority of children in the low-scoring subgroup (
n = 34/54) did not make a change according to any of the indices, highlighting that test scores should only be used as one source of data to guide educational and clinical decisions. Overall, we acknowledge that different methods yielding slightly different results might preclude us from concluding which method may be most accurate. Further investigation is required to compare the methods to ascertain the accuracy of each index for evaluating young children’s academic progress.
A final yet important point we want to address is that progress should be measured in all children of all abilities, aligning with the policy of universal screenings to monitor students’ performance across academic years (
Anderman et al., 2015;
Zahra et al., 2016). Indeed, our sample was composed of an unselected sample of children, including typically developing and struggling learners. Previous studies using change indices have primarily focused on measuring change before and after intervention, hence focusing on struggling leaners only (e.g.,
Frijters et al., 2013;
Hendricks & Fuchs, 2020) or clinical populations (e.g.,
Jacobson & Truax, 1991;
Crawford & Garthwaite, 2007;
Ronk et al., 2016;
Duff, 2012 for review, etc.). Being able to estimate the amount of typical change expected to occur over kindergarten, in this case, would provide practitioners with benchmarks of change that could be attributed to maturation and learning. For instance, the RCI coefficient provides a preliminary estimation of change that we might expect in kindergarten children over a 1-year time period due to development in the local setting of the study. It would follow that practitioners could then compare this benchmark against changes as a result of specific intervention or programs and determine whether the change that occurred was within, above, or below developmental expectations. However, we must emphasize that more research will be needed to validate the specificity and sensitivity of the current outcome measures and that many other factors must be considered when interpreting test scores.
4.1. Growth Calculator
Converging with the results from other studies (e.g.,
Frijters et al., 2013;
Hendricks & Fuchs, 2020;
Estrada et al., 2019;
Duff, 2012), we advocate for the use of individual indices to measure change. Several advantages of the RCI, SID, and SRB methods over group-level statistics and even the normalization method include (i) being applicable to all students, even if the individual was not in the at-risk level to start, (ii) the ability to indicate which individual made a reliable change rather than random change (e.g., measurement error, regression to the mean), and (iii) the results being easy to interpret and communicate. The change score indices could be used by practitioners to determine precisely
who has reliably changed. For single-case studies, visual inspection of data continues to be the most widely applied method of data interpretation due to the lack of a robust formal statistical approach (cf.
Fisch, 2001). We surmise that growth methods could serve as adjunctive methods, aligning with recent work (e.g.,
Jones & Hurrell, 2019;
Unicomb et al., 2015).
One reason that these change score methods are underutilized in educational and developmental practice and research is in part due to their complex calculations. To address research aim 4, we have constructed freely available and user-friendly calculators for educators and clinicians to measure growth via the RCI and SRB using our screening battery for children of similar ages. Given strong similarities between the RCI and SID indices, we focused on the RCI method because it is more commonly used and simplest to calculate, echoing previous recommendations (
Ronk et al., 2016;
Lambert & Ogles, 2009). The online tool consists of an Excel-based calculator:
https://www.uwo.ca/fhs/lwm/publications/growth_calculator.html (accessed on 21 October 2025) or available for download from the
Supplementary Materials. By inputting the time 1 and 2 scores for an individual child—the only step practicing professionals are required to do—the calculator will automatically calculate individual growth through the RCI and SRB index and provide a nominal classification relative to our sample psychometrics:
- (i)
Improved overtime: Reliable change on both the RCI and SRB; reliable change on the RCI or SRB;
- (ii)
No reliable change: No reliable change on both the RCI and SRB;
- (iii)
Worsened overtime: Negative reliable change on both; negative reliable change on the RCI or SRB.
Calculations are based on summary data (i.e., mean, SD, test–retest reliability) from our sample for each measure; however, users could change these values to match the characteristics of their setting. Instructions are provided regarding how to make these changes. We hope that this practical tool will provide educators and clinicians with a quick and easy way to evaluate whether an individual made a reliable change. Nevertheless, we want to emphasize that the result from the change score index should be interpreted as part of a profile of relevant strengths and weaknesses for individual children in order to avoid misinterpreting results from the calculator. We encourage interested readers to contact the first author for further consultation about our calculator or to facilitate creating their own.
4.2. Limitations
This study has several limitations. First, this study was a re-analysis of available data. As a result, we only had experimental tests of language, reading, and mathematics. Although we have started to validate these tools (e.g.,
Hawes et al., 2019;
Nosworthy et al., 2013), having standardized or commercially developed tests as outcome measures would have substantiated our results. The psychometric properties of the measures, including internal consistency reliability, construct validity, known groups validity, sensitivity and specificity of diagnostic accuracy, should be evaluated. In addition, given that only two data points were available for each participant, we could only use the RCI, SID, and SRB growth indices. Other growth approaches can account for multiple repeated assessments (e.g., growth scale values), which would increase reliability and sensitivity of identifying change. Nevertheless, the RCI, SID, and SRB indices are among the most widely used and researched methods in clinical psychology, and here we aim to highlight the potential benefits for educational and developmental studies, as well.
Second, we relied on screening tools (e.g., vocabulary subtest from the Reading Readiness Screening Tool; magnitude comparison tasks from the Numeracy Screener) or brief assessments (e.g., 10 questions/items) to illustrate change. This may not be appropriate because screening measures are meant to be a relatively brief evaluation and ceiling effects were observed for some measures. However, screeners have been shown to be efficient and effective methods for identifying struggling students (e.g.,
Stormont et al., 2015). Importantly, schools (and clinics) need brief and cost-effective universal screeners to facilitate early intervention. We had to find a balance between forming a comprehensive screening tool and the feasibility of administering the assessment; thus, we selected well-established measures that have been widely used in research and practice.
Finally, although the individual change indices reviewed here are an exciting avenue of research, the normalized range, RCI and RCI coefficient, SID, and SRB values are specific to our experimental measures and only applicable to kindergarteners measured in the spring of a 2-year program when approximately 4.5 and 5.5 years of age. We present a calculator based on this specific population. The potential lack of linguistic and cultural diversity in our sample could limit the generalizability of the findings and the application of the Excel-based Growth Calculator to a broader population. We provide instructions to change the summary data to match the sample characteristics of the context, if available, which, in turn, could improve the accuracy of the calculator. We encourage future research to investigate the individual change indices reviewed here across different cultural, ethnic, racial, and socioeconomic groups.
4.3. Conclusions
The goal of this study was to quantify individual change in kindergarteners using various methods. We were able to specifically identify for whom change occurred using the RCI, SID, and SRB growth methods for all students and the normalization method for lower scorers. We therefore recommend the inclusion of individual-level indices, especially growth approaches, in future research as an adjunct to quantify change, in addition to traditional group-level statistics. For practicing professionals, a tool is provided to assist in evaluating individual change based on the RCI and SRB growth methods using our outcome measures. Overall, these methods provide starting points for measuring progress in kindergarten.