Next Article in Journal
Transformative Potential in Special Education: How Perceived Success, Training, Exposure, and Experience Contribute to Teacher Readiness for Inclusive Practice
Previous Article in Journal
Ready or Not? Greek K-12 Teachers’ Psychological Readiness for Bringing the EU into the Classroom
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparison of Different Methods for Measuring Individual Change in Kindergarten Children

1
School of Communication Sciences and Disorders, University of Western Ontario, London, ON N6G 1H1, Canada
2
Department of Psychology, University of Western Ontario, London, ON N6A 3K7, Canada
3
Research and Assessment Services, Thames Valley District School Board, London, ON N5W 5P2, Canada
*
Author to whom correspondence should be addressed.
Educ. Sci. 2025, 15(11), 1475; https://doi.org/10.3390/educsci15111475
Submission received: 8 September 2025 / Revised: 20 October 2025 / Accepted: 22 October 2025 / Published: 3 November 2025

Abstract

Measuring progress in students is an important consideration when making decisions in education, clinical practice, and research. However, change due to learning over time at the group level cannot be applied to interpret individual change. Therefore, the current study compared four methods for measuring individual change: reliable change index controlling for practice effects (RCI), standardized individual difference (SID), estimated standardized regression-based (SRB) change, and a normalization approach. Participants included 157 children (4 years initially and 5 years at follow-up) who completed measures of language, reading, and mathematics and were tested 1 year apart. We measured individual differences in children as they developed academic-relevant competencies. The RCI and SID indices yielded the same results. While group-based statistics did not find a change overall, the RCI/SID and SRB methods identified 7.64% and 8.28% of students as having changed, respectively. Further, in a subgroup of 54 low scorers, the RCI/SID and SRB methods indicated that 14.81% and 16.67% of students changed, respectively, whereas the normalization method identified a higher rate at 24.07%. The RCI, SID, and SRB methods showed similar results, whereas the normalization method differed from the others. Finally, a practical tool (Excel-based Growth Calculator) is provided to assist practitioners in evaluating individual change. Overall, these methods provide starting points for measuring change in individuals.

1. Introduction

A social interactionist approach to learning acknowledges that learning in educational settings is a dynamic process that reflects the interplay between the student and interaction with others and the environment (Vygotsky, 1978). What is to be measured, therefore, is the individual student’s growth or progress in educational contexts. Indeed, there has been a growing tendency for policymakers and researchers to shift the focus away from mean or average effect for groups and focus instead on how student learning evolves over time for the individual student (Anderman et al., 2015). It would follow that available methods for assessing individual-level change need to be systematically compared, thus motivating the current study.
Having a reliable method to capture change is critically important for educators and clinicians (e.g., psychologists, speech-language pathologists), allowing decision-making to be guided by developmental trajectory (Page, 2014; Vygotsky, 1978). For example, measuring individual change is valid for documenting one’s academic growth (or lack thereof), monitoring progress in order to decide on support measures, and ending intervention when relevant changes have been made (e.g., Zahra et al., 2016; Daub et al., 2019). In research, there is a long tradition of using frequentist statistics to quantify the amount of change experienced by a group (Pernet, 2015). Take, for example, the effects of full-day (vs. half-day) kindergarten on developing academic-relevant competencies which are typically studied at the group-level using frequentist statistics (see Cooper et al., 2010 for a review). Subsequently, developmental research informs policy decisions such as implementing full-day kindergarten curriculum for all students, given that full-day kindergarten is generally efficacious (e.g., Canada: Akbari et al., 2024; United Sates: Kauerz, 2005). The ability to identify specific individuals who change, however, would offer more information beyond group-based effects. Although it would seem simple to examine change by seeing who increases or decreases their scores overtime, the task is deceptively complex. First, showing that change is not due to maturation (e.g., Bedard & Dhuey, 2006) or random effects (e.g., Duff, 2012; Anderman et al., 2015) alone may be challenging. Another difficulty is that there is no consensus on how to measure change and, relatedly, who makes a reliable change (e.g., Anderman et al., 2015; Frijters et al., 2013; Hendricks & Fuchs, 2020; Maassen et al., 2009). Various methods are available to measure who has changed—whether a group or given individual—but each method yields a slightly different result. What follows is a broad overview of the group- and individual-level methods most commonly used in research and clinical practice.

1.1. Identifying Group-Level Changes

In the scientific and research literature, group comparisons are commonly conducted using classical frequentist tests (i.e., null hypothesis significance testing) (Pernet, 2015). The most common statistical tests to index change overtime are repeated measures analysis of variance (ANOVA) and pairwise t-tests, with results being communicated in terms of significance. Results found to be statistically significant—large F- or t-values, small p-values below 0.05, confidence intervals, and effect sizes—indicate that the group has changed overtime.
There are several limitations with null hypothesis significance testing, however. Most notably, focusing on the group effect is not a readily applicable method for understanding individual performance. These methods also do not account for initial levels of performance or regression to the mean. Even an analysis of covariance (ANCOVA), where variables are included as covariates, fails to provide the necessary statistical controls (Miller & Chapman, 2001). Further, for common tests such as the t-test, only one outcome (e.g., a language test) between the two time points can be compared. If the two points do not differ on this outcome, there is only a 5% probability that the result will be statistically significant because of chance when a benchmark of p < 0.05 is used (Sainani, 2009). However, more often than not, a battery of outcomes is typically utilized to make comparisons. For example, if the two groups are compared five times for each outcome (e.g., a language test, a reading score, etc.), there is a 23% probability that one of the outcomes will be significant by chance (Sainani, 2009). Testing a number of hypotheses in this way, called multiple testing/comparisons, increases the chances of making a Type 1 statistical error (e.g., concluding that there is change when there is not) (Sainani, 2009). Finally, p-values are often misinterpreted. A non-significant result does not support the null hypothesis per se, but rather fails to reject the null hypothesis (Brydges & Bielak, 2020). Conversely, a Bayesian analysis can be used to find evidence for a null effect by quantifying the amount of evidence (Brydges & Bielak, 2020; Wagenmakers et al., 2018). Nevertheless, Bayesian statistics still only accounts for group changes and not which individual within a group has changed. It follows that identifying methods for (a) quantifying changes in outcomes for individuals and (b) determining whether such changes are reliable are warranted.

1.2. Identifying Individual-Level Changes

1.2.1. Normalization Method

To assess which specific individuals change, two methods have been proposed: normalization and growth statistics. The normalization approach might be most familiar to professionals (educators and clinicians) and researchers who use norm-referenced tests in their practice (e.g., Torgesen et al., 2001; Hendricks & Fuchs, 2020; Frijters et al., 2013). Children typically meet eligibility for special education services if they receive a standard score of 85 or lower (or the 16th percentile or lower) on nationally normed assessment measures (Spaulding et al., 2012). Then, children who receive intervention are considered to have made a change if post-intervention scores are within the “average” or “normalized” range, which is typically a standard score of 90 or higher (corresponding to the 25th percentile or higher). However, it is important to note that these cut-off-point criteria vary by test and across regions (Spaulding et al., 2012).
Standard scores represent a child’s performance relative to the performance of age-matched peers (i.e., child’s age-adjusted standing), whereas percentiles are used to clarify the interpretation of standard scores by ranking scores. Standard scores and percentiles can be converted from raw scores for commercially available, norm-referenced tests across domains. For example, the Clinical Evaluation of Language Fundamentals—Fifth Edition (CELF-5; Wiig et al., 2013), one of the gold-standard assessment measures for language, has a mean standard score of 100 (SD = 15) for typically developing children. In contrast, deriving standard scores/percentiles for experimenter-created measures requires more resources, including a representative sample for norming and statistical training to convert raw scores to standard scores and percentiles. Nevertheless, the normalization method is appealing because the individual can be seen to have moved, providing a gauge of ‘how much’ they moved from an at-risk level (below the 16th percentile) to within normalized levels (25th percentile or greater).
Although interpretating the results may seem straightforward, concerns about the normalization approach must be acknowledged. Only two time points are considered in the normalization method, in a similar way to group-based approaches. Nonetheless, it is likely that real-life time and resource constraints mean that practitioners only conduct pre–post measures anyway. Most notably, this method would only be able to measure progress in a subset of the sample, namely those who score below the cut-off at initial testing. Relatedly, an important point to consider is the potential bias of norm-referenced tests if the culture and language of the student being assessed is different from the normative group (e.g., Laher & Cockcroft, 2017; Higgins & Lefebvre, 2024). This can increase the risk of overidentification/misdiagnosis of culturally and linguistically diverse students as having learning difficulties when relying on norm-referencing (e.g., Paradis et al., 2013). It would be useful to have feasible methods that capture growth across the range of individual differences, including those developing typically or even struggling individuals who change but may not have normalized.
Another concern with the normalization method is that using standard scores to demonstrate change may be inappropriate (Daub et al., 2019; Kerr et al., 2003) and can even be misleading (Daub et al., 2019; Spaulding et al., 2012; Kwok et al., 2022). For example, one student who raises their standard score by five points from an initial standard score of 85 to a final standard score of 90 may be viewed as normalized, whereas another student who makes a five-point gain from 80 to 85 would not be categorized as having changed because their final score remains under the cut-off point of 90 (e.g., Torgesen et al., 2001; Hendricks & Fuchs, 2020; Frijters et al., 2013). Further, changes in a child’s standard scores over time do not directly imply improvement or regression in their skills but rather indicate a change in the child’s performance relative to peers of the same age (Kwok et al., 2022). The same standard score at pre- and post-test at first glance may be assumed to indicate no change, when, in fact, maintaining the same standard score overtime indicates that the individual progressed at the same rate as the normative group (e.g., Daub et al., 2019). A negative change in standard scores could also be interpreted in different ways. Although norm-referenced scores often decrease over time among individuals with neurodevelopmental disorders who exhibit slower-than-average progress (e.g., Farmer et al., 2020), negative change does not necessarily mean that the child has regressed and lost skills. It could also mean that the child gained the skills but at a slower rate than most peers or that they have experienced no change in their skills (Kwok et al., 2022).
In addition, although standard scores are readily available with commercially developed norm-referenced tests, deriving standard scores for experimental or informal measures can be cumbersome. This becomes problematic when schools or clinics design their own outcome measures, for example. The measure would need to be administered to a representative sample so that a normative distribution could be formed, from which standard scores/percentile ranks can then be derived. Nevertheless, given that in real-life contexts, clinicians and educators are likely attempting to use standard scores from norm-referenced tests as indicators of change (Kerr et al., 2003), we included this method in the current paper as a comparison against potentially more sensitive measures, namely growth methods. We were able to use the normalization method with our experimental measures because we had sufficient resources (e.g., a representative sample and statistical training).

1.2.2. Growth Methods

The individual-level approaches considered in this paper are collectively called growth methods (also referred to as distribution-based methods (de Vet et al., 2006) or idiographic methods (Blampied, 2022)). Growth methods compare the change in an individual based on statistical characteristics of the obtained sample. We reviewed the extant literature to determine methods that reflected best practice and, importantly, that would be easy for practitioners, who typically have minimal statistical training (Lakhlifi et al., 2023), to calculate and to interpret based on available data. We also wanted to choose non-overlapping methods to compare. We focused on the reliable change index (RCI) because it is the most widely used method for evaluating individual change (Jacobson & Truax, 1991; Chelune et al., 1993; Blampied, 2022; Duff, 2012), standardized individual difference (SID) methods because of the acceptable false positive rates compared to other methods (Payne & Jones, 1957; Ferrer & Pardo, 2014), and standardized regression-based (SRB) methods for their precise estimates of change (McSweeny et al., 1993; Crawford & Garthwaite, 2007).
First, the RCI is a method for assessing whether a change between two observations of the same individual is statistically reliable and not due to measurement error (Jacobson & Truax, 1991). Chelune et al. (1993) further adjusted the RCI formula to control for practice effects on repeat testing. To measure progress at the student level across kindergarten in this paper, we adopted the practice effect-adjusted RCI (henceforth called the RCI for brevity), as per best-practice recommendations (Duff, 2012; McAleavey, 2024). The numerator for this improved RCI equation is calculated by comparing the difference between test scores with the mean practice effects from the group. Then, the numerator is divided by the standard error of the measurement (SEM) of the difference score (see Table 1 for equation; adopted from Duff (2012) and Hendricks and Fuchs (2020)). Each individual will receive a resulting RCI value that is then compared to the z-distribution in order to make interpretations about reliable change. On a z-score distribution, which is standardized with a mean of 0 (SD = 1), the cut-off point for a 95% confidence interval is 1.96, and for a 90% confidence interval, it is 1.645. We use the recommended ±1.645 cut-off given that our design is a single group pre–post study (Estrada et al., 2019) and that this cut-off is widely used in clinical psychology (Duff, 2012). Thus, an RCI value of +1.645 or greater (e.g., increase overtime) represents a significant and reliable change that would be unlikely to occur by measurement error, whereas −1.645 or lower would indicate a reliable change in the negative direction (e.g., decrease overtime). Note that the sign on the RCI value indicates the positive or negative direction of change, not the arithmetic sign of the change score (Blampied, 2022).
The SID method is based on the standardized pre–post difference as the denominator instead of the SEM to identify reliable change (Payne & Jones, 1957). The standardized score results from dividing the difference between the initial and final scores by the standard deviation of the differences. Similarly to the RCI method, the resulting SID value is also compared to the z-distribution. We considered an individual change to be reliable when the SID value is above or below 1.645. Compared to other methods, the SID method yielded acceptable false positive rates (Ferrer & Pardo, 2014). Others, however, have found similar results between the RCI and SID methods (Estrada et al., 2019).
Finally, the SRB index uses regression analyses to predict change (McSweeny et al., 1993). In this paper, we adopted a simpler version, termed the estimated SRB approach, which was proposed to ease computation (Crawford & Garthwaite, 2007; Maassen et al., 2009), an important consideration when creating an easy-to-use calculator for clinicians. In the estimated SRB method, summary data (e.g., standard deviation, mean) that is typically available in test manuals is used to form the regression equation with which to calculate reliable change. The SRB approach uses the regression equation to predict the final score; then, the predicted final score is compared to the observed final score to determine whether a meaningful change occurred overtime. Similarly to the other methods, the 1.645 benchmark was used when considering change.
The advantage of the growth methods over the normalization method is that they can be applied to all individuals and not just those whose initial score is below a cut-off. Having a statistical method for measuring individual change could be especially advantageous when evaluating single-case experimental designs, which has traditionally relied on visual inspection rather than formal statistics (cf. Fisch, 2001). Other benefits include a precise estimate of relative change, accounting for test–retest reliability and measurement error, and controlling for practice effects (e.g., Chelune et al., 1993; Duff, 2012). However, these measures are calculated only seldomly in educational and developmental studies (Zahra et al., 2016; cf. Frijters et al., 2013; Hendricks & Fuchs, 2020; Unicomb et al., 2015), possibly due to the formulas being complicated to calculate. For example, instead of calculating an RCI for every individual, an increasingly popular alternative is to inverse the RCI formula to calculate the smallest amount of change needed from the initial to final scores for the change to be considered reliable (formula in Table 1). This coefficient is more widely known as the minimally detectable change (e.g., de Vet et al., 2006).
While there are a number of change score methods available, studies with adults have found that most change indices perform similarly to each other, except for the Hageman–Arrindell (HA) method, which is considered “the most conservative” (McGlinchey et al., 2002). Still, others have found that the RCI and HA methods yield comparable results (Ronk et al., 2016; Ferrer & Pardo, 2014), with the RCI method being suggested to promote consistent practice (Lambert & Ogles, 2009). The HA method was not included in this study because a preliminary analysis of the current dataset found that the reliability of difference score was below the recommended 0.40 threshold. Despite acceptable false positive rates, the confidence interval for the individual predicted value of the linear regression method (Ferrer & Pardo, 2014) was also not included due to potential obstacles to practice. Clinicians will likely not have access to the full dataset and software when generating regression analyses1. Instead, the SID method showed comparable false positive rates to the confidence interval method (Ferrer & Pardo, 2014) and is included in the current study.
To our knowledge, no studies have systematically compared the normalization, RCI controlling for practice effects, SID, and estimated SBR methods in an unselected sample of young children to measure academic progress. Recently, Frijters et al. (2013) and Hendricks and Fuchs (2020) compared the normalization and original RCI method to identify change in struggling readers in response to intervention. Strikingly, results were similar in both studies: for a given child, one method would indicate change, while another would not. For example, the normalization method identified high rates of responders in Hendricks and Fuchs (2020) but low percentages of change in Frijters et al. (2013). Overall, it is clear that measuring change in children is far from clear cut. Nevertheless, measuring progress is routinely carried out by educators and clinicians. It would then follow that providing practitioners with accessible ways of evaluating individual change—which could serve as one piece of clinical evidence—is critical, motivating the current paper.

1.3. The Current Study

The goal of the current study was to provide a preliminary investigation of using sophisticated methods for examining individual change in the academic skills of kindergarten students. We were interested in testing whether the individual change indices offer more information than group-based statistics. Thus, the novelty of our study lies in our methodology of comparing four individual change scores—normalization, RCI controlling for practice effects, SID, and estimated SRB—to identify who exhibits a meaningful change with respect to school-relevant competencies. Our research aims were (1) to assess change at the individual level for all students, (2) to assess change in a subgroup of struggling learners, (3) to compare how change was captured across methods, and (4) to develop user-friendly Excel calculators with which practitioners can examine change.

2. Method

2.1. Participants

In Ontario, Canada, kindergarten is a 2-year program, with those entering the first year of the program turning 4 years of age and those entering the second year turning 5 years of age at some point during that school year. Participants were part of a larger longitudinal study following children from Year 2 of kindergarten to grade 2 (Pham et al., 2025). The larger study used a multi-wave approach in which a total of 16 schools (2 rural) in London, Ontario participated in the study over 7 years (2013–2019), with 6–9 schools participating in new recruitment each year. Ethics was approved by the local university and participating school boards. Written consent was obtained by the principal of each of the 16 different schools. Teachers sent home consent packets which included a letter of information about the study to all children. If parents wished to consent, they returned the written consent form to their child’s teacher, along with child assent at the study onset.
The larger study had 330 participants (162 females) who completed the assessment battery in the spring of Year 1. The current study focuses on 157 participants (70 females) who had data points available in both the spring of Year 1 (Mage = 4.33, SD = 0.49) and the spring of Year 2 (Mage = 5.38, SD = 0.49). The two time points will henceforth be referred to as the time 1 score and time 2 score, respectively. There was 47.57% attrition between the two years. The main reason for attrition was that the larger study focused on kindergarten students in Year 2. Students in Year 1 were included in the study if they were part of a split Year 1/2 class, as all students in the class were invited to participate. Other possible reasons for attrition included child absence, child moving schools, child withdrawal from the study, and school withdrawal (board-level policy changes, labour shortages, and disruptions). There were no specific criteria listed for participation, as we were interested in obtaining an unselected, representative sample of kindergarten students.
The larger study did not collect specific information about the demographic characteristics for the sample (e.g., ethnicity, language(s) spoken, socioeconomic status, etc.), nor did the school share this information. There was limited information available from a demographic survey that was discontinued after the first year of a study. As per parent reports (n = 101), children were rated to be ‘good’ on average at counting and recognizing numbers; letter names; number relationships; quantity concepts; and understanding patterns. Children were rated to be ‘satisfactory’ on average at letter sounds and meanings of written words. Further, children ‘never’ or ‘rarely’ (once or twice a month) attended English as a Second Language or Literacy programs in the six months before kindergarten. Based on previous studies with cohorts from this school board (Archibald et al., 2013), we could only speculate that the sample might be largely monolingually English with a high socioeconomic status.

2.2. Procedure

All participants were tested individually in a single session, lasting 30–40 min, by trained research assistants (RAs). There were on average 14 trained RAs (range = 7–20) administering the assessment each year. Each RA attended a 2–3 h training session. Training included viewing a video recording of the second author providing step-by-step explanations of how to administer and score the test battery, conducted on a practice child participant who was not included in the study. The RAs also had opportunities to practice the administering and scoring of each measure. We created a website that contained the training video and other resources for RAs to access during and after training.
Students completed the battery of language, reading, and mathematical measures in the spring (March–May) of Year 1 kindergarten (i.e., time 1 scores). These outcome measures were repeated approximately 1 year later in the spring (March–May) of Year 2 kindergarten (i.e., time 2 scores). All participants completed the measures described below. Further details of the procedure are reported in the larger study (Pham et al., 2025). Caregivers could opt in to receive the lab’s annual newsletter which would contain a summary of the study.

2.3. Materials

The kindergarten assessment consisted of nine tasks: vocabulary, sentence recall, letter-name and letter-sound knowledge, rapid color naming, phonological awareness, number naming, number line estimation, dot and symbol magnitude comparison, and arithmetic skills. Materials are fully described elsewhere (Pham et al., 2025) and freely available on Open Science Framework (https://osf.io/gnekj/; accessed on 21 October 2025). Test–retest reliability when tested approximately 1 month apart, as reported in Pham et al. (2025) is provided; additional psychometric properties are provided where available. Brief descriptions of each tool are as follows:
Vocabulary. Participants were asked to name the color of each of the 10 pictures of various nouns presented for a total of 10 points. This vocabulary task was from the Reading Readiness Screening Tool (RRST) v7.5 (Learning Disability Association of Alberta, 2011). The RRST was developed by a committee of specialists with the Learning Disability Association of Alberta and used to screen students from kindergarten to grade 1. Norming of the tool is currently being conducted by Alberta Education, the governing agency that oversees the education system. The test–retest reliability was 0.73.
Sentence recall. Participants were asked to repeat sentences after hearing a digital recording of the sentence (Redmond, 2005). There were 16 sentences. Sentences were scored online as 2 (correct), 1 (three or fewer errors), or 0 (more than four errors or no response) for a maximum score of 32. Inter-rater consistency was r = 0.99 and test–retest reliability was r = 0.95 in Redmond (2005), and test–retest reliability was 0.95 in Pham et al. (2025). The task and manual are available at https://health.utah.edu/communication-sciences-disorders/research/redmond (accessed on 10 October 2025).
Letter-name and letter-sound knowledge. Participants were asked to name and give the sound for the 26 letters of the alphabet presented in upper- and lower-case letters with no time constraints. Correct responses were tallied out of 26 for each of the upper case sounds, lower case sounds, upper case letter names, and lower case letter names, and an average score across the four trials was used. The test–retest reliability was 0.90.
Color rapid automatized naming (RAN). Participants had to verbally and serially name colored boxes aloud as quickly and accurately as possible (Howe et al., 2006). The time required to name all stimuli (in seconds) was used for data analysis. The test–retest reliability was 0.89.
Phonological awareness. This measure was a bespoke measure developed by the school board’s Speech and Language Services. The test consisted of 6 subtests, including rhyme recognition, rhyme production, sound blending, sound identification, sound segmentation, and sound/syllable deletion. There are 33 items in total across the subtests, with a total score of 33. The test–retest reliability was 0.74.
Number name. Participants were asked to name the 10 digits from 0 to 9 without time constraints. There are 10 items, with a total of 10 points. The test–retest reliability was 0.93.
Number line estimation. Participants had to repeatedly estimate the position of a number on a 15.9 cm2 number line with 0 and 10 at the respective ends of the line by making a mark on the line to indicate where the number would go (Hawes et al., 2019). The average percentage of absolute error across the eight trials was calculated using the following equation: |(observed value − expected value)/15.9|. The test–retest reliability was r = 0.32 in Hawes et al. (2019) and 0.72 in Pham et al. (2025).
Dot and symbol magnitude comparison. Participants were required to compare pairs of magnitudes ranging from 1 to 9, presented either in symbolic (56 digit pairs) or non-symbolic (56 pairs of dot arrays) formats. Participants were instructed to cross out the larger of the two magnitudes and were given two minutes to complete each condition. The proportion of correct answers based on the total number of correct responses relative to number of items answered was used for data analysis. For more information, see Nosworthy et al. (2013); Hawes et al. (2019); and https://www.numeracyscreener.org (accessed on 16 October 2025). The test–retest reliability for the symbolic and nonsymbolic magnitude comparison tasks were found to be r = 0.72 and r = 0.61, respectively, in Hawes et al. (2019) and r = 0.61 and r = 0.90, respectively, in Pham et al. (2025).
Arithmetic skills. Participants had to complete 5 single-digit addition and 5 single-digit subtraction problems, with no time constraints. The number of correct responses was tallied. The test–retest reliability was r = 0.60 in Hawes et al. (2019) and r = 0.80 in Pham et al. (2025).

2.4. Data Analysis Plan

Data Cleaning. There was a small amount of missing data, 5.15%, for various reasons (e.g., child absent, did not complete). An empirical test revealed that the data were not missing completely at random (Little, 1988): χ2(186) = 353.80, p < 0.001. We used the K Nearest Neighbour (KNN) imputation method to estimate the missing data (Troyanskaya et al., 2001), which is an algorithm applicable to imputing values that are not missing at random. The DMwR2 package (Torgo, 2016) in R version 4.4.2 (R Core Team, 2024) was used to handle missing data.
For most measures, higher scores reflected better performance, and thus, an increase in score from time 1 to time 2 indicates an improvement, except for color RAN and number line estimation, where lower raw scores reflected better performance. Guided by Brysbaert and Stevens (2018), we transformed RAN reaction times and number line estimation scores into inverse scores (1000/(RAN reaction time) and 1/(number line score), respectively) for ease of interpretation so that lower scores also had the lowest inverse values. The nominators were selected to avoid having values that are too small in our analyses (Brysbaert & Stevens, 2018). Inverse scores are used in subsequent analyses. Further, given that each subtest used a different scale and had different maximum scores, we created converted raw scores to z-scores for each subtest to provide a standardized scale for interpretation. Z-scores were used in the analyses.
Finally, given that each individual subtest is a screening measure, we sought to create a more robust measure by combining the subtests, as informed by results from an exploratory factor analysis. An evaluation of the scree plot and parallel analysis suggested that a single factor be extracted for both time 1 scores (significant Bartlett’s test of sphericity, KMO = 0.81, eigenvalue = 3.68, 40.94% of variance explained) and time 2 scores (significant Bartlett’s test of sphericity, KMO = 0.76, eigenvalue = 3.33, 37.00% of variance explained). The full results are reported in the Supplementary Materials. As a result, we created a school screening composite score by adding z-scores across all measures. The school screening composite will be used in the remaining analyses.
Analyses. To identify which individual student showed a change, we used four individual change indices:
  • Normalization. For the normalization method, the function pnorm in base R was used to calculate the percentile corresponding to the composite z-score for time 1 and 2. A student is considered to have changed if they moved from the 16th percentile or less (time 1 score) to the 25th percentile or higher (time 2 score). These values correspond to standard scores of 85 and 90, respectively, which are widely used as cut-offs for norm-referenced tests. Notably, this method would only be applicable to a subgroup of students who initially scored below the 16th percentile on a given measure at time 1. We refer to this subgroup as the ‘low scorers’ in the remainder of the paper; the remaining students were considered within normalized levels.
  • Reliable change index (RCI). The first growth method we applied to our data was the practice effect-adjusted RCI formula. The RCI was calculated using the revised formula by Chelune et al. (1993), as reported in Duff (2012; Table 1). The numerator corresponds to the difference between the student’s time 2 and 1 z-scores and mean practice effect. The denominator is based on the standard error of the measurement (SEM) of the difference z-score, which is estimated from pooling the variance of the sample and the reliability of the test. We estimated reliability as the correlation between the pre- and post-administrations of the composite z-score (Hendricks & Fuchs, 2020). An RCI value was calculated for each student. We considered an individual change to be reliable when its corresponding RCI was greater than or equal to ±1.645. The ±1.645 benchmark indicates that 90% of change scores will fall within this range in a normal distribution, that only 5% of cases will fall below this point based on chance, and that only 5% of cases will fall above this point. The RCI results were then constrained to the low scorers only to allow for comparison with the normalization method. In addition, the RCI coefficient accounting for practice effects was calculated to determine the smallest amount of change that would indicate a reliable change (i.e., minimally detectable change) (de Vet et al., 2006).
  • Standardized individual difference (SID). The SID is calculated by dividing the individual difference between the time 2 and 1 z-scores by the standard deviation of these difference. A resulting SID value of greater than or equal to ±1.645 would indicate a reliable change, the same as the RCI benchmark. The dataset was then constrained to lower scorers only for comparison with the normalization method.
  • Standardized regression-based (SRB) index. The final change score was the estimated SRB index (Crawford & Garthwaite, 2007; Maassen et al., 2009). The numerator corresponds to the difference between the student’s predicted and observed time 2 z-scores. The predicted time 2 z-score was estimated using summary data: mean, SD, and test–retest reliability (Crawford & Garthwaite, 2007; Maassen et al., 2009). The denominator is the SEM (i.e., the same denominator as the RCI formula). Similarly to previous methods, a resulting SRB value of greater than or equal to ±1.645 would indicate a reliable change. The dataset was then constrained to lower scorers only for comparison among the methods.
Classification profiles. Profiles of academic progress were created based on the results across indices. Students were classified into one of three profiles: (i) Improved overtime meant that the student received a reliable change score in the positive direction (+1.645 or greater) on the RCI, SID, and/or SRB. (ii) No reliable change means that the score across all methods did not meet the benchmark. (iii) Worsened overtime meant that there was a relative change in the negative direction (−1.645 or lower) on the RCI, SID, and/or SRB. Note that even if raw scores increase from time 1 to time 2, reliable change could still be classified as negative (below −1.65 benchmark).
Agreement. Cohen’s kappa (κ) was used to examine agreement between the change score indices. The following classification scheme was used to interpret ICC: values ≤ 0 as no agreement, 0.01–0.20 as none to slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1.00 as almost perfect agreement (McHugh, 2012).

3. Results

3.1. Full Sample

Table 2 reports descriptive statistics of raw scores from all kindergarten measures. The results for the full sample are as follows. The results from the RCI and SID indices were the same and will be discussed together. The RCI/SID and SRB index identified a similar number of students as making a reliable change, 7.64% (or 12/157 students) and 8.28% (or 13/157 students), respectively (Table 3). Similarity between the methods was substantiated by perfect agreement between the RCI and SID, κ = 1, p < 0.001, and an almost perfect level of agreement between the RCI and SRB, κ = 0.96, p < 0.001.
We outline the different profiles of change that emerged from our dataset in Table 3 and Figure 1. The majority of students (n = 99 within normal + 34 low scorers = 133) did not make a reliable change (e.g., solid-line square cross in Figure 1). A minority of students (n = 1 normal + 2 low = 3) made a reliable change sufficient enough to be captured by both the RCI/SID and SRB indices alongside normalization for the lower scorers (e.g., dashed filled circle in Figure 1). There were 17 students from the low-scoring subgroup that showed a reliable change in the positive direction, based on at least one of the indices (RCI/SID, SRB, or normalization; dashed filled triangle and dashed filled square in Figure 1). Six students (3 normal + 3 low) showed a reliable change in the negative direction (e.g., solid-line plus in Figure 1). Notably, there were no cases in which one index would indicate a positive reliable change (≥+1.645 or greater) and the other index would indicate a negative reliable change (≤−1.645 or lower). The RCI coefficient was +5.67 points, indicating that participants had to increase their z-score by at least 5.67 from time 1 to time 2 to have changed.

3.2. Low Scorers

Table 3 reports the change indices for the low-scoring subgroup (n = 54). Similarly to the full sample, the RCI/SID and SRB approaches appeared to identify approximately the same proportion of reliable change—14.81% (or 8/54 students) and 16.67%, (or 9/54 students), respectively. The normalization method identified about a quarter of students as making a change—24.07% (or 13/54 students). Agreement was perfect between the RCI and SID methods, κ = 1, p < 0.001, and almost perfect between the RCI and SRB, κ = 0.93, p < 0.001, similarly to the full sample. There was however almost no agreement between the RCI/SID and normalization methods, κ = 0.01, p = 0.95, nor between the SRB and normalization methods, κ = −0.02, p = 0.89 (Table 3).
A closer look at the profiles of change in Table 3 indicates that the majority of low-scoring students (n = 34) did not make a reliable change. Seventeen made a positive reliable change based on at least one method, with four of them still not being considered to be within the normalized level. Finally, three students made a negative reliable change based on the RCI, SID, and/or SRB methods but not when using the normalization method.

4. General Discussion

This study provided a preliminary evaluation of using individual-level indices to capture changes in learning outcomes across kindergarten. Children were assessed using our screening battery on various language, reading, and mathematical skills in the spring of each year of a 2-year kindergarten program. While the group-level results found no significant differences overtime for the full sample (see Supplementary Materials), responsiveness at the level of the individual child varied. For the full sample, the RCI/SID and SRB methods identified 7.64% and 8.28% of students as having changed, respectively, with almost perfect agreement among the methods (research aims 1 and 3). We then restricted our dataset to a subgroup of low scorers to allow us to compare the normalization approach to the RCI and SRB methods (research aims 2 and 3). For this subgroup, there was a significant difference between time 1 and 2 scores, with scores increasing overtime for the group (see Supplementary Materials). However, the change score methods revealed individual differences that were masked by group-based results. Specifically, the RCI/SID and SRB methods identified 14.81% and 16.67% of students as having changed, respectively, while the normalization method identified 24.07%. Agreement was almost perfect between the RCI, SID, and SRB method for this subgroup, whereas the normalization method differed from the rest. We next interpret our main findings and discuss practical implications, including the creation of a Growth Calculator (research aim 4).
Our results show that individual change indices have the potential to offer more precise estimates of relative change compared to group-level statistics. In particular, group-level statistics might mislead educators and clinicians to underestimate or overestimate the extent of individual change. Had we stopped our analysis after finding a lack of significance using frequentist/Bayesian statistics for the full sample, we would have concluded that, on average, this group of kindergarteners showed no-to-minimal gains in academic competencies overtime. However, the growth methods further revealed that about a tenth of students made progress, with 3% (n = 3) moving downwards but still remaining within normalized levels. This type of information would be crucial for clinical judgements, such as monitoring academics or initiating services, if problems persist for these specific students (Anderman et al., 2015). For the low-scoring subgroup, group-based effects were associated with strong evidence for a positive change overtime. While this was indeed the case for 31% of students, the vast majority (63%) did not make a change that was considered reliable and 5% moved downwards. Knowing who made positive gains would mean triaging services and supports to those who have not made a reliable change during the same period. Knowing who has not made a reliable change or moved downwards can mean initiating/increasing services to ensure that timely supports are in place. Being able to track individual progress (or a lack thereof) is in line with the Vygotskian theoretical framework, in which educators or clinicians could support each student’s needs and make equitable educational decisions for individual students in order to improve learning outcomes (Vygotsky, 1978; Anderman et al., 2015).
Another advantage is that results from the individual indices could be more easily interpreted by practitioners than group-based statistics (p-value, effect size) (Page, 2014). For example, a practitioner could (quickly) conclude whether a reliable change occurred if the individual’s resulting RCI, SID, or SRB score falls above ±1.65 or if the difference between their scores exceeds the RCI coefficient benchmark. Having precise interpretations about individual performance would in turn facilitate educational and clinical decision-making and the communication of results to the individual. Nevertheless, we want to emphasize that change score indices should be used with caution as a piece of evidence.
Despite the potential advantages of individual-level indices, each method yielded slightly different results, especially for the low-scoring group. The SRB index identified slightly more students as having changed than the RCI/SID index, but agreement was almost perfect. For the low-scoring subgroup, the RCI/SID and SRB indices were on par with each other but differed from the normalization method. Nevertheless, our level of agreement and general consensus of change (or not) were similar or better than previous work focusing on young children (e.g., Frijters et al., 2013; Hendricks & Fuchs, 2020). For example, in Hendricks and Fuchs (2020), the biggest gap between methods was found between the normalization method identifying 42% of students as having changed and the RCI method identifying 24%, almost half of the amount.
Potential differences between the RCI and SRB formulas are likely due to the numerator, given that both methods use the same denominator. The RCI numerator accounts for practice effects, and the SRB numerator takes this one step further to control for variability in time 1 scores, which, in turn, corrects for variability when predicting time 2 scores. While our RCI and SRB results were similar, best-practice guidelines (which have focused on adult control vs. treatment groups) recommend using SRB when the individual is demographically similar to the group or has a typical baseline score and using the RCI when characteristics are dissimilar (e.g., Levine et al., 2007; Ouimet et al., 2009). The more complex version of the SRB method can also account for demographic variables by using a stepwise regression, but this is not applicable to the version of the SRB chosen for the study. Interestingly though, including demographic variables does not always improve performance (Durant et al., 2019). Even though the SID method is based on standardized pre–post difference, whereas the RCI and SRB calculations used SEM, previous work has found that the RCI and SID perform similarly (Estrada et al., 2019).
Finally, the normalization method performed differently compared to other indices in our study, aligning with prior work (Frijters et al., 2013; Hendricks & Fuchs, 2020). There may be several explanations. The normalization method was focused on children with relatively low initial scores. Children with lower initial scores have been found to generally improve more than those with higher initial scores (e.g., Luwel et al., 2011; Lervåg & Hulme, 2010). Relatedly, another explanation may be regression to the mean, as scores for this subgroup were relatively low at time 1. Indeed, the normalization equation does not account for additional variables (e.g., regression to the mean, practice effects). This interpretation tentatively suggests that for about a quarter of the children (n = 13/54) with the lowest scores in the beginning, a compensatory pattern is potentially being exhibited in order to catch up with the rest of the sample by the second year of formal education. However, this is perhaps only to a certain degree, as most of them did not make a change that was considered to be reliable based on our indices (n = 11/13). Practically, this might mean that practitioners should avoid prematurely ceasing services once these at-risk children seem to be within normalized levels. Instead, these children may have a high potential for growth that may need more time to be realized. Notably, the majority of children in the low-scoring subgroup (n = 34/54) did not make a change according to any of the indices, highlighting that test scores should only be used as one source of data to guide educational and clinical decisions. Overall, we acknowledge that different methods yielding slightly different results might preclude us from concluding which method may be most accurate. Further investigation is required to compare the methods to ascertain the accuracy of each index for evaluating young children’s academic progress.
A final yet important point we want to address is that progress should be measured in all children of all abilities, aligning with the policy of universal screenings to monitor students’ performance across academic years (Anderman et al., 2015; Zahra et al., 2016). Indeed, our sample was composed of an unselected sample of children, including typically developing and struggling learners. Previous studies using change indices have primarily focused on measuring change before and after intervention, hence focusing on struggling leaners only (e.g., Frijters et al., 2013; Hendricks & Fuchs, 2020) or clinical populations (e.g., Jacobson & Truax, 1991; Crawford & Garthwaite, 2007; Ronk et al., 2016; Duff, 2012 for review, etc.). Being able to estimate the amount of typical change expected to occur over kindergarten, in this case, would provide practitioners with benchmarks of change that could be attributed to maturation and learning. For instance, the RCI coefficient provides a preliminary estimation of change that we might expect in kindergarten children over a 1-year time period due to development in the local setting of the study. It would follow that practitioners could then compare this benchmark against changes as a result of specific intervention or programs and determine whether the change that occurred was within, above, or below developmental expectations. However, we must emphasize that more research will be needed to validate the specificity and sensitivity of the current outcome measures and that many other factors must be considered when interpreting test scores.

4.1. Growth Calculator

Converging with the results from other studies (e.g., Frijters et al., 2013; Hendricks & Fuchs, 2020; Estrada et al., 2019; Duff, 2012), we advocate for the use of individual indices to measure change. Several advantages of the RCI, SID, and SRB methods over group-level statistics and even the normalization method include (i) being applicable to all students, even if the individual was not in the at-risk level to start, (ii) the ability to indicate which individual made a reliable change rather than random change (e.g., measurement error, regression to the mean), and (iii) the results being easy to interpret and communicate. The change score indices could be used by practitioners to determine precisely who has reliably changed. For single-case studies, visual inspection of data continues to be the most widely applied method of data interpretation due to the lack of a robust formal statistical approach (cf. Fisch, 2001). We surmise that growth methods could serve as adjunctive methods, aligning with recent work (e.g., Jones & Hurrell, 2019; Unicomb et al., 2015).
One reason that these change score methods are underutilized in educational and developmental practice and research is in part due to their complex calculations. To address research aim 4, we have constructed freely available and user-friendly calculators for educators and clinicians to measure growth via the RCI and SRB using our screening battery for children of similar ages. Given strong similarities between the RCI and SID indices, we focused on the RCI method because it is more commonly used and simplest to calculate, echoing previous recommendations (Ronk et al., 2016; Lambert & Ogles, 2009). The online tool consists of an Excel-based calculator: https://www.uwo.ca/fhs/lwm/publications/growth_calculator.html (accessed on 21 October 2025) or available for download from the Supplementary Materials. By inputting the time 1 and 2 scores for an individual child—the only step practicing professionals are required to do—the calculator will automatically calculate individual growth through the RCI and SRB index and provide a nominal classification relative to our sample psychometrics:
(i)
Improved overtime: Reliable change on both the RCI and SRB; reliable change on the RCI or SRB;
(ii)
No reliable change: No reliable change on both the RCI and SRB;
(iii)
Worsened overtime: Negative reliable change on both; negative reliable change on the RCI or SRB.
Calculations are based on summary data (i.e., mean, SD, test–retest reliability) from our sample for each measure; however, users could change these values to match the characteristics of their setting. Instructions are provided regarding how to make these changes. We hope that this practical tool will provide educators and clinicians with a quick and easy way to evaluate whether an individual made a reliable change. Nevertheless, we want to emphasize that the result from the change score index should be interpreted as part of a profile of relevant strengths and weaknesses for individual children in order to avoid misinterpreting results from the calculator. We encourage interested readers to contact the first author for further consultation about our calculator or to facilitate creating their own.

4.2. Limitations

This study has several limitations. First, this study was a re-analysis of available data. As a result, we only had experimental tests of language, reading, and mathematics. Although we have started to validate these tools (e.g., Hawes et al., 2019; Nosworthy et al., 2013), having standardized or commercially developed tests as outcome measures would have substantiated our results. The psychometric properties of the measures, including internal consistency reliability, construct validity, known groups validity, sensitivity and specificity of diagnostic accuracy, should be evaluated. In addition, given that only two data points were available for each participant, we could only use the RCI, SID, and SRB growth indices. Other growth approaches can account for multiple repeated assessments (e.g., growth scale values), which would increase reliability and sensitivity of identifying change. Nevertheless, the RCI, SID, and SRB indices are among the most widely used and researched methods in clinical psychology, and here we aim to highlight the potential benefits for educational and developmental studies, as well.
Second, we relied on screening tools (e.g., vocabulary subtest from the Reading Readiness Screening Tool; magnitude comparison tasks from the Numeracy Screener) or brief assessments (e.g., 10 questions/items) to illustrate change. This may not be appropriate because screening measures are meant to be a relatively brief evaluation and ceiling effects were observed for some measures. However, screeners have been shown to be efficient and effective methods for identifying struggling students (e.g., Stormont et al., 2015). Importantly, schools (and clinics) need brief and cost-effective universal screeners to facilitate early intervention. We had to find a balance between forming a comprehensive screening tool and the feasibility of administering the assessment; thus, we selected well-established measures that have been widely used in research and practice.
Finally, although the individual change indices reviewed here are an exciting avenue of research, the normalized range, RCI and RCI coefficient, SID, and SRB values are specific to our experimental measures and only applicable to kindergarteners measured in the spring of a 2-year program when approximately 4.5 and 5.5 years of age. We present a calculator based on this specific population. The potential lack of linguistic and cultural diversity in our sample could limit the generalizability of the findings and the application of the Excel-based Growth Calculator to a broader population. We provide instructions to change the summary data to match the sample characteristics of the context, if available, which, in turn, could improve the accuracy of the calculator. We encourage future research to investigate the individual change indices reviewed here across different cultural, ethnic, racial, and socioeconomic groups.

4.3. Conclusions

The goal of this study was to quantify individual change in kindergarteners using various methods. We were able to specifically identify for whom change occurred using the RCI, SID, and SRB growth methods for all students and the normalization method for lower scorers. We therefore recommend the inclusion of individual-level indices, especially growth approaches, in future research as an adjunct to quantify change, in addition to traditional group-level statistics. For practicing professionals, a tool is provided to assist in evaluating individual change based on the RCI and SRB growth methods using our outcome measures. Overall, these methods provide starting points for measuring progress in kindergarten.

Supplementary Materials

Author Contributions

Conceptualization, T.P., J.O., and L.M.D.A.; Methodology, T.P.; Formal analysis, T.P.; Data curation, T.P.; Resources, J.O., D.A., M.F.J., C.S., and L.M.D.A.; Funding acquisition, D.A., and L.M.D.A.; Supervision, L.M.D.A.; Writing—review & editing, T.P., J.O., D.A., M.F.J., C.S., and L.M.D.A.; Writing—original draft, T.P., and L.M.D.A. All authors have read and agreed to the published version of the manuscript.

Funding

This study was made possible by funding from the Natural Sciences and Engineering Research Council of Canada to LMDA and DA, Grant Number: 371201-2009.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board at the University of Western Ontario (protocol code 104064 and date of approval: 13 September 2017).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data used in this research were collected under provision of informed consent of the participants. Access to the data will be granted in line with that consent, subject to approval by the university ethics board and under a formal Data Sharing Agreement. Some assessment materials for which we own the copyright are publicly accessible on https://osf.io/gnekj/ (accessed on 21 October 2025). The change score calculator is available https://www.uwo.ca/fhs/lwm/publications/growth_calculator.html (accessed on 21 October 2025) or can be downloaded from the Supplementary Materials.

Acknowledgments

The work of many research assistants is gratefully acknowledged. We would also like to thank the children and their families as well as the school districts who partnered with us.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Preregistration

The study and analyses presented here were not preregistered.

Notes

1
The interested reader can refer to the results from the confidence interval for the individual predicted value of the linear regression method in the Supplementary Materials. Notably, we found overlapping results with the methods in the main text.
2
The number line was intended to be 16 cm but was smaller than intended due to a printing error.

References

  1. Akbari, E., McCuaig, K., & Mehta, S. (2024). Early childhood education report 2023. Ontario Institute for Studies in Education/University of Toronto. Available online: https://ecereport.ca/media/uploads/2023-overview/ecer_2023_overview-en.pdf (accessed on 17 October 2025).
  2. Anderman, E. M., Gimbert, B., O’Connell, A. A., & Riegel, L. (2015). Approaches to academic growth assessment. British Journal of Educational Psychology, 85(2), 138–153. [Google Scholar] [CrossRef]
  3. Archibald, L. M. D., Oram Cardy, J., Joanisse, M. F., & Ansari, D. (2013). Language, reading, and math learning profiles in an epidemiological sample of school age children. PLoS ONE, 8(10), e77463. [Google Scholar] [CrossRef]
  4. Bedard, K., & Dhuey, E. (2006). The persistence of early childhood maturity: International evidence of long-run age effects. The Quarterly Journal of Economics, 121(4), 1437–1472. [Google Scholar] [CrossRef]
  5. Blampied, N. M. (2022). Reliable change and the reliable change index: Still useful after all these years? Cognitive Behaviour Therapist, 15, e50. [Google Scholar] [CrossRef]
  6. Briggs, N. E., & MacCallum, R. C. (2003). Recovery of weak common factors by maximum likelihood and ordinary least squares estimation. Multivariate Behavioral Research, 38, 25–56. [Google Scholar] [CrossRef] [PubMed]
  7. Brydges, C. R., & Bielak, A. A. M. (2020). A Bayesian analysis of evidence in support of the null hypothesis in gerontological psychology (or lack thereof). The Journals of Gerontology: Series B, 75(1), 58–66. [Google Scholar] [CrossRef]
  8. Brysbaert, M., & Stevens, M. (2018). Power analysis and effect size in mixed effects models: A tutorial. Journal of Cognition, 1, 9. [Google Scholar] [CrossRef] [PubMed]
  9. Chelune, G. J., Naugle, R. I., Luders, H., Sedlak, J., & Awad, I. A. (1993). Individual change after epilepsy surgery: Practice effects and base-rate information. Neuropsychology, 7(1), 41–52. [Google Scholar] [CrossRef]
  10. Cohen, J. (1988). Statistical power analysis for the behavioural sciences. Academic Press. [Google Scholar]
  11. Cooper, H., Batts Allen, A., Patall, E., & Dent, A. (2010). Effects of full-day kindergarten on academic achievement and social development. Review of Educational Research, 80(1), 34–70. [Google Scholar] [CrossRef]
  12. Crawford, J. R., & Garthwaite, P. H. (2006). Comparing patients’ predicted test scores from a regression equation with their obtained scores: A significance test and point estimate of abnormality with accompanying confidence limits. Neuropsychology, 20(3), 259–271. [Google Scholar] [CrossRef]
  13. Crawford, J. R., & Garthwaite, P. H. (2007). Using regression equations built from summary data in the neuropsychological assessment of the individual case. Neuropsychology, 21(5), 611–620. [Google Scholar] [CrossRef]
  14. Crawford, J. R., & Howell, D. C. (1998). Regression equations in clinical neuropsychology: An evaluation of statistical methods for comparing predicted and obtained scores. Journal of Clinical and Experimental Neuropsychology, 20(5), 755–762. [Google Scholar] [CrossRef]
  15. Daub, O., Skarakis-Doyle, E., Bagatto, M. P., Johnson, A. M., & Cardy, J. O. (2019). A comment on test validation: The importance of the clinical perspective. American Journal of Speech-Language Pathology, 28(1), 204–210. [Google Scholar] [CrossRef]
  16. de Vet, H. C., Terwee, C. B., Ostelo, R. W., Beckerman, H., Knol, D. L., & Bouter, L. M. (2006). Minimal changes in health status questionnaires: Distinction between minimally detectable change and minimally important change. Health and Quality of Life Outcomes, 4(1), 54. [Google Scholar] [CrossRef] [PubMed]
  17. Duff, K. (2012). Evidence-based indicators of neuropsychological change in the individual patient: Relevant concepts and methods. Archives of Clinical Neuropsychology, 27(3), 248–261. [Google Scholar] [CrossRef] [PubMed]
  18. Durant, J., Duff, K., & Miller, J. B. (2019). Regression-based formulas for predicting change in memory test scores in healthy older adults: Comparing use of raw versus standardized scores. Journal of Clinical and Experimental Neuropsychology, 41(5), 460–468. [Google Scholar] [CrossRef] [PubMed]
  19. Estrada, E., Ferrer, E., & Pardo, A. (2019). Statistics for evaluating pre-post change: Relation between change in the distribution center and change in the individual scores. Frontiers in Psychology, 9, 2696. [Google Scholar] [CrossRef]
  20. Farmer, C. A., Kaat, A. J., Thurm, A., Anselm, I., Akshoomoff, N., Bennett, A., Berry, L., Bruchey, A., Barshop, B. A., Berry-Kravis, E., Bianconi, S., Cecil, K. M., Davis, R. J., Ficicioglu, C., Porter, F. D., Wainer, A., Goin-Kochel, R. P., Leonczyk, C., Guthrie, W., … Miller, J. S. (2020). Person ability scores as an alternative to norm-referenced scores as outcome measures in studies of neurodevelopmental disorders. American Journal on Intellectual and Developmental Disabilities, 125(6), 475–480. [Google Scholar] [CrossRef]
  21. Ferrer, R., & Pardo, A. (2014). Clinically meaningful change: False positives in the estimation of individual change. Psychological Assessment, 26(2), 370–383. [Google Scholar] [CrossRef]
  22. Fisch, G. S. (2001). Evaluating data from behavioral analysis: Visual inspection or statistical models? Behavioural Processes, 54(1–3), 137–154. [Google Scholar] [CrossRef]
  23. Frijters, J. C., Lovett, M. W., Sevcik, R. A., & Morris, R. D. (2013). Four methods of identifying change in the context of a multiple component reading intervention for struggling middle school readers. Reading and Writing, 26(4), 539–563. [Google Scholar] [CrossRef]
  24. Hair, J., Black, W. C., Babin, B. J., & Anderson, R. E. (2010). Multivariate data analysis. Pearson Educational International. [Google Scholar]
  25. Hawes, Z., Nosworthy, N., Archibald, L., & Ansari, D. (2019). Kindergarten children’s symbolic number comparison skills relates to 1st grade mathematics achievement: Evidence from a two-minute paper-and-pencil test. Learning and Instruction, 59, 21–33. [Google Scholar] [CrossRef]
  26. Hendricks, E. L., & Fuchs, D. (2020). Are individual differences in response to intervention influenced by the methods and measures used to define response? Implications for identifying children with learning disabilities. Journal of Learning Disabilities, 53(6), 428–443. [Google Scholar] [CrossRef]
  27. Higgins, Z. E., & Lefebvre, P. (2024). Features of culturally and linguistically relevant speech-language assessments for Indigenous children: A scoping review. The Australian Journal of Rural Health, 32(6), 1100–1117. [Google Scholar] [CrossRef]
  28. Howe, A., Arnell, K. M., Klein, R. M., Joanisse, M. F., & Tannock, R. (2006). The ABCs of computerized naming: Equivalency, reliability, and predictive validity of a computerized rapid automatized naming (RAN) task. Journal of Neuroscience Methods, 151(1), 30–37. [Google Scholar] [CrossRef]
  29. Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change in psychotherapy research. Journal of Consulting and Clinical Psychology, 59, 12–19. [Google Scholar] [CrossRef]
  30. Jeffreys, H. (1961). The theory of probability. Oxford University Press. [Google Scholar]
  31. Jennrich, R. I., & Sampson, P. F. (1966). Rotation for simple loadings. Psychometrika, 31(3), 313–323. [Google Scholar] [CrossRef]
  32. Jones, S., & Hurrell, E. (2019). A single case experimental design: How do different psychological outcome measures capture the experience of a client undergoing CBT for chronic pain. British Journal of Pain, 13(1), 6–12. [Google Scholar] [CrossRef] [PubMed]
  33. Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141–151. [Google Scholar] [CrossRef]
  34. Kauerz, K. (2005). Full-day kindergarten: A study of state policies in the United States. Education Commission of the State. Available online: https://files.eric.ed.gov/fulltext/ED486074.pdf (accessed on 17 October 2025).
  35. Kerr, M. A., Guildford, S., & Kay-Raining Bird, E. (2003). Standardized language test use: A Canadian survey. Journal of Speech-Language Pathology and Audiology, 27(1), 10–28. [Google Scholar]
  36. Kwok, E., Feiner, H., Grauzer, J., Kaat, A., & Roberts, M. Y. (2022). Measuring change during intervention using norm-referenced, standardized measures: A comparison of raw scores, standard scores, age equivalents, and growth scale values from the preschool language scales-fifth edition. Journal of Speech, Language, and Hearing Research, 65(11), 4268–4279. [Google Scholar] [CrossRef]
  37. Laher, S., & Cockcroft, K. (2017). Moving from culturally biased to culturally responsive assessment practices in low-resource, multicultural settings. Professional Psychology: Research and Practice, 48(2), 115–121. [Google Scholar] [CrossRef]
  38. Lakhlifi, C., Lejeune, F.-X., Rouault, M., Khamassi, M., & Rohaut, B. (2023). Illusion of knowledge in statistics among clinicians: Evaluating the alignment between objective accuracy and subjective confidence, an online survey. Cognitive Research: Principles and Implications, 8(1), 23. [Google Scholar] [CrossRef]
  39. Lambert, M. J., & Ogles, B. M. (2009). Using clinical significance in psychotherapy outcome research: The need for a common procedure and validity data. Psychotherapy Research, 19, 493–501. [Google Scholar] [CrossRef]
  40. Learning Disability Association of Alberta. (2011). Reading readiness screening tool v7.5. Learning Disability Association of Alberta. [Google Scholar]
  41. Lervåg, A., & Hulme, C. (2010). Predicting the growth of early spelling skills: Are there heterogeneous developmental trajectories? Scientific Studies of Reading, 14(6), 485–513. [Google Scholar] [CrossRef]
  42. Levine, A. J., Hinkin, C. H., Miller, E. N., Becker, J. T., Selnes, O. A., & Cohen, B. A. (2007). The generalizability of neurocognitive test/retest data derived from a nonclinical sample for detecting change among two HIV+ cohorts. Journal of Clinical and Experimental Neuropsychology, 29(6), 669–678. [Google Scholar] [CrossRef] [PubMed]
  43. Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198–1202. [Google Scholar] [CrossRef]
  44. Luwel, K., Foustana, A., Papadatos, Y., & Verschaffel, L. (2011). The role of intelligence and feedback in children’s strategy competence. Journal of Experimental Child Psychology, 108(1), 61–76. [Google Scholar] [CrossRef] [PubMed]
  45. Maassen, G. H., Bossema, E., & Brand, N. (2009). Reliable change and practice effects: Outcomes of various indices compared. Journal of Clinical and Experimental Neuropsychology, 31(3), 339–352. [Google Scholar] [CrossRef] [PubMed]
  46. McAleavey, A. A. (2024). When (not) to rely on the Reliable Change Index: A critical appraisal and alternatives to consider in clinical psychology. Clinical Psychology, 31(3), 351–366. [Google Scholar] [CrossRef]
  47. McGlinchey, J. B., Atkins, D. C., & Jacobson, N. S. (2002). Clinical significance methods: Which one to use and how useful are they? Behavior Therapy, 33, 529–550. [Google Scholar] [CrossRef]
  48. McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276–282. [Google Scholar] [CrossRef]
  49. McSweeny, A. J., Naugle, R. I., Chelune, G. J., & Luders, H. (1993). “T Scores for Change”: An illustration of a regression approach to depicting change in clinical neuropsychology. Clinical Neuropsychologist, 7, 300–312. [Google Scholar] [CrossRef]
  50. Miller, G. A., & Chapman, J. P. (2001). Misunderstanding analysis of covariance. Journal of Abnormal Psychology, 110(1), 40–48. [Google Scholar] [CrossRef]
  51. Nosworthy, N., Bugden, S., Archibald, L., Evans, B., & Ansari, D. (2013). A two-minute paper-and-pencil test of symbolic and nonsymbolic numerical magnitude processing explains variability in primary school children’s arithmetic competence. PLoS ONE, 8(7), e67918. [Google Scholar] [CrossRef]
  52. Ouimet, L. A., Stewart, A., Collins, B., Schindler, D., & Bielajew, C. (2009). Measuring neuropsychological change following breast cancer treatment: An analysis of statistical models. Journal of Clinical and Experimental Neuropsychology, 31(1), 73–89. [Google Scholar] [CrossRef]
  53. Page, P. (2014). Beyond statistical significance: Clinical interpretation of rehabilitation research literature. International Journal of Sports Physical Therapy, 9(5), 726–736. [Google Scholar] [PubMed]
  54. Paradis, J., Schneider, P., & Duncan, T. S. (2013). Discriminating children with language impairment among English-language learners from diverse first-language backgrounds. Journal of Speech, Language, and Hearing Research, 56(3), 971–981. [Google Scholar] [CrossRef] [PubMed]
  55. Payne, R. W., & Jones, H. G. (1957). Statistics for the investigation of individual cases. Journal of Clinical Psychology, 13, 115–121. [Google Scholar] [CrossRef]
  56. Pernet, C. (2015). Null hypothesis significance testing: A short tutorial. F1000Research, 4, 621. [Google Scholar] [CrossRef]
  57. Pham, T., Joanisse, M. F., Ansari, D., Oram Cardy, J., Stager, C., & Archibald, L. M. D. (2025). Early cognitive predictors of language, reading, and mathematics outcomes in the primary grades. Early Childhood Research Quarterly, 70, 187–198. [Google Scholar] [CrossRef]
  58. R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Available online: https://www.R-project.org/ (accessed on 21 October 2025).
  59. Redmond, S. M. (2005). Differentiating SLI from ADHD using children’s sentence recall and production of past tense morphology. Clinical Linguistics & Phonetics, 19(2), 109–127. [Google Scholar] [CrossRef]
  60. Ronk, F. R., Hooke, G. R., & Page, A. C. (2016). Validity of clinically significant change classifications yielded by Jacobson-Truax and Hageman-Arrindell methods. BMC Psychiatry, 16, 187. [Google Scholar] [CrossRef]
  61. Sainani, K. L. (2009). The problem of multiple testing. PM & R, 1(12), 1098–1103. [Google Scholar] [CrossRef]
  62. Spaulding, T. J., Swartwout Szulga, M., & Figueroa, C. (2012). Using norm-referenced tests to determine severity of language impairment in children: Disconnect between U.S. policy makers and test developers. Language, Speech, and Hearing Services in Schools, 43(2), 176–190. [Google Scholar] [CrossRef]
  63. Stormont, M., Herman, K. C., Reinke, W. M., King, K. R., & Owens, S. (2015). The kindergarten academic and behavior readiness screener: The utility of single-item teacher ratings of kindergarten readiness. School Psychology Quarterly, 30(2), 212–228. [Google Scholar] [CrossRef]
  64. Torgesen, J. K., Alexander, A., Wagner, R., Rashotte, C., Voeller, K., & Conway, T. (2001). Intensive remedial instruction for children with severe reading disabilities: Immediate and long-term outcomes from two instructional approaches. Journal of Learning Disabilities, 34, 33–58. [Google Scholar] [CrossRef] [PubMed]
  65. Torgo, L. (2016). Data Mining with R, learning with case studies (2nd ed.). Chapman and Hall/CRC. Available online: http://ltorgo.github.io/DMwR2 (accessed on 21 October 2025).
  66. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., & Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520–525. [Google Scholar] [CrossRef] [PubMed]
  67. Unicomb, R., Colyvas, K., Harrison, E., & Hewat, S. (2015). Assessment of reliable change using 95% credible intervals for the differences in proportions: A statistical analysis for case-study methodology. Journal of Speech, Language, and Hearing Research, 24(2), 1–14. [Google Scholar] [CrossRef]
  68. Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes. Harvard University Press. [Google Scholar]
  69. Wagenmakers, E. J., Love, J., Marsman, M., Jamil, T., Ly, A., Verhagen, J., Selker, R., Gronau, Q. F., Dropmann, D., Boutin, B., Meerhoff, F., Knight, P., Raj, A., van Kesteren, E., van Doorn, J., Šmíra, M., Epskamp, S., Etz, A., Matzke, D., … Morey, R. D. (2018). Bayesian inference for psychology. Part II: Example applications with JASP. Psychonomic Bulletin and Review, 25, 58–76. [Google Scholar] [CrossRef]
  70. Wiig, E. H., Semel, E., & Secord, W. A. (2013). Clinical evaluation of language fundamentals (5th ed.). Pearson. [Google Scholar]
  71. Zahra, D., Hedge, C., Pesola, F., & Burr, S. (2016). Accounting for test reliability in student progression: The reliable change index. Medical Education, 50(7), 738–745. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Selective sample of the possible profiles of change (n = 5) using the school screening composite. Note. To illustrate the different classifications that could be provided, we selected a profile from each of the three categories to discuss in detail as follows: The following summary data based on z-scores was used across all profiles: M1 = 1.12 × 10−15; M2 = 3.57 × 10−16; SD1 = 25.67; SD2 = 5.35; rxy = 0.80; SEMdiff = 3.45; RCI coefficient = +5.67. Refer to Table 1 for formulas and colours correspond to Table 3. (i) Improved overtime (filled triangle-dashed line): For an individual with x1 = −9.50 and x2 = −1.08, resulting in RCI = 2.44, SID = 2.43, and SRB = 2.29 both of which fall above the cut-off of +1.645, indicating a reliable change. A reliable change can also be seen when comparing the 8.42-point increase from the initial to final score, exceeding the RCI coefficient of +5.67. Note that this participant’s final score was in the 14.06 percentile (below the 25th percentile), and hence, no change was observed according to the normalization method, despite growth methods indicating otherwise. Other participants fitting this profile include the dashed filled circle and dashed filled square. (ii) No reliable change (square cross-solid line): An individual with x1 = 6.04 and x2 = 8.29 was classified with this outcome, with RCI = 0.65, SID = 0.65, and SRB = 0.75. No reliable change can be seen when comparing their 2.25-point increase from the initial to final score, which is below the required RCI coefficient. (iii) Worsened overtime (plus-solid line): An individual with x1 = 4.03 and x2 = −2.44 worsened overtime, with RCI = −1.88, SID = −1.87, and SRB = −1.81. RCI = reliable change index; SRB = standardized regression-based formula.
Figure 1. Selective sample of the possible profiles of change (n = 5) using the school screening composite. Note. To illustrate the different classifications that could be provided, we selected a profile from each of the three categories to discuss in detail as follows: The following summary data based on z-scores was used across all profiles: M1 = 1.12 × 10−15; M2 = 3.57 × 10−16; SD1 = 25.67; SD2 = 5.35; rxy = 0.80; SEMdiff = 3.45; RCI coefficient = +5.67. Refer to Table 1 for formulas and colours correspond to Table 3. (i) Improved overtime (filled triangle-dashed line): For an individual with x1 = −9.50 and x2 = −1.08, resulting in RCI = 2.44, SID = 2.43, and SRB = 2.29 both of which fall above the cut-off of +1.645, indicating a reliable change. A reliable change can also be seen when comparing the 8.42-point increase from the initial to final score, exceeding the RCI coefficient of +5.67. Note that this participant’s final score was in the 14.06 percentile (below the 25th percentile), and hence, no change was observed according to the normalization method, despite growth methods indicating otherwise. Other participants fitting this profile include the dashed filled circle and dashed filled square. (ii) No reliable change (square cross-solid line): An individual with x1 = 6.04 and x2 = 8.29 was classified with this outcome, with RCI = 0.65, SID = 0.65, and SRB = 0.75. No reliable change can be seen when comparing their 2.25-point increase from the initial to final score, which is below the required RCI coefficient. (iii) Worsened overtime (plus-solid line): An individual with x1 = 4.03 and x2 = −2.44 worsened overtime, with RCI = −1.88, SID = −1.87, and SRB = −1.81. RCI = reliable change index; SRB = standardized regression-based formula.
Education 15 01475 g001
Table 1. Individual-level change indices and their formulas.
Table 1. Individual-level change indices and their formulas.
Individual Change ScoreFormula
Normalizationx1 < 16th percentile and x2 > 25th percentile
Reliable change index (RCI) controlling for practice effects RCI = ( x 2 x 1 ) ( M 2 M 1 ) / SEM diff

S E M d i f f = ( s d 1 2 + s d 2 2 ) ( 1 r 12 )
RCI coefficient (i.e., minimally detectable change) controlling for practice effects(1.645 × SEMdiff) + (M2 − M1)
Standardized individual difference (SID)SID = (x2 − x1)/sd(M2 − M1)
Estimated standardized regression-based (SRB) methodSRB = x2 − x2′/SEMdiff

b = sd2/sd1
c = (M2 − [b × M1])
x2′ = bx1 + c

S E M d i f f = ( s d 1 2 + s d 2 2 ) ( 1 r 12 )
Note. x1 = score at time 1; x2 = score at time 2; SEMdiff = standard error of the measurement of the difference score; sd1 = standard deviation at time 1; sd2 = standard deviation at time 2; r12 = test–retest reliability; x2′ = predicted score at time 2 based on regression model; b = slope of the regression model with summary data; c = intercept of the regression model with summary data; M1 = mean at time 1; M2 = mean at time 2.
Table 2. Descriptive statistics for initial and final raw scores across experimental tasks for the full sample and the subgroup of low scorers; school screening composite (z-score).
Table 2. Descriptive statistics for initial and final raw scores across experimental tasks for the full sample and the subgroup of low scorers; school screening composite (z-score).
MeasurePerformance for Full SamplePerformance for Low Scorers (n = 54)Maximum Score
Time 1: M (SD)Time 2: M (SD)Time 1: M (SD)Time 2: M (SD)
Subtest:
Vocabulary test8.29 (1.41)8.99 (1.077)7.074 (1.69)8.30 (1.56)10
Sentence recall10.55 (7.37)17.49 (7.09)3.59 (4.55)10.41 (7.30)32
Letter-sound knowledge18.65 (7.40)23.82 (3.38)6.42 (6.46)19.22 (5.56)26
Rapid color naming81.74 (23.70)67.17 (20.69)98.62 (29.08)83.074 (30.94)No max time
Inverse rapid color naming13.17 (3.48)15.98 (3.95)10.97 (3.11)13.27 (3.91)Highest = 24.39
Phonological awareness21.09 (7.12)27.86 (5.62)10.93 (5.74)20.23 (7.80)33
Number name8.62 (2.44)9.82 (.90)5.072 (3.32)9.41 (1.95)10
Number line estimation0.18 (0.09)0.13 (0.06)0.27 (0.09)0.17 (0.07)
Inverse number line estimation7.28 (4.01)8.97 (3.64)4.056 (1.17)7.16 (3.15)Highest = 27.35
Magnitude comparison0.88 (0.15)0.94 (0.07) 0.76 (0.17) 0.91 (0.13)1
Arithmetic skills 1.44 (1.73)3.77 (2.83)0.46 (1.39)2.0 (2.24)10
School screening composite (z-score)1.12 × 10−15 (5.66)3.57 × 10−16 (5.35)−6.35 (4.24)−4.54 (5.95)
Table 3. Number of individuals being classified within the different profiles of change based on the three methods.
Table 3. Number of individuals being classified within the different profiles of change based on the three methods.
Sample SizeGroupNormalizationRCI/SIDSRBProfileClassification
1NormalChangeChangeRCI & SRBImproved
3NormalNegativeNegativeNegative RCI & SRBWorsened
99NormalNo changeNo changeNo reliable changeNo reliable change
2LowChangeChangeChangeAll changeImproved
4LowNo changeChangeChangeRCI & SRB changeImproved
11LowChangeNo changeNo changeNormalization onlyImproved
2LowNo changeNegativeNegativeNegative RCI & SRBWorsened
1LowNo changeNo changeNegativeNegative SRB Worsened
34LowNo changeNo changeNo changeNo reliable changeNo reliable change
Note. The classification is as follows: (i) Improved overtime includes reliable change on both RCI and SRB; reliable change on RCI or SRB (green font). (ii) No reliable change means no reliable change on both RCI and SRB (yellow font). (iii) Worsened overtime includes negative reliable change on both RCI and SRB; negative reliable change on RCI or SRB. Note that even if raw scores increase from time 1 to time 2, reliable change could still be classified as negative (below −1.65 benchmark) (red font). See the online article for the color-coded version of this table. This table should be read as follows using the first row as an example: Here, 1 individual who scored within normalized levels (i.e., not in the low-scoring subgroup) made a positive reliable change using both the RCI and SRB methods. As a result, they are being classified as improved overtime. RCI = reliable change index; SID = standardized individual difference. SRB = standardized regression-based formula; – = the normalization method is not applicable for those within normal limits.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pham, T.; Oram, J.; Ansari, D.; Joanisse, M.F.; Stager, C.; Archibald, L.M.D. A Comparison of Different Methods for Measuring Individual Change in Kindergarten Children. Educ. Sci. 2025, 15, 1475. https://doi.org/10.3390/educsci15111475

AMA Style

Pham T, Oram J, Ansari D, Joanisse MF, Stager C, Archibald LMD. A Comparison of Different Methods for Measuring Individual Change in Kindergarten Children. Education Sciences. 2025; 15(11):1475. https://doi.org/10.3390/educsci15111475

Chicago/Turabian Style

Pham, Theresa, Janis Oram, Daniel Ansari, Marc F. Joanisse, Christine Stager, and Lisa M. D. Archibald. 2025. "A Comparison of Different Methods for Measuring Individual Change in Kindergarten Children" Education Sciences 15, no. 11: 1475. https://doi.org/10.3390/educsci15111475

APA Style

Pham, T., Oram, J., Ansari, D., Joanisse, M. F., Stager, C., & Archibald, L. M. D. (2025). A Comparison of Different Methods for Measuring Individual Change in Kindergarten Children. Education Sciences, 15(11), 1475. https://doi.org/10.3390/educsci15111475

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop