Next Article in Journal
Innovation Plans in Portuguese Schools: The Importance of the Aspects and Locus of Action on the Slow Path to Metamorphosis
Previous Article in Journal
Mapping the Integration of Theory of Planned Behavior and Self-Determination Theory in Education: A Scoping Review on Teachers’ Behavioral Intentions
Previous Article in Special Issue
Measuring Mathematics Teaching Quality: The State of the Field and a Call for the Future
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Assessing Differentiation in All Phases of Teaching (ADAPT): Properties and Quality of the ADAPT Instrument

1
Section of Teacher Professional Development, University of Twente, Postbus 217, 7500 AE Enschede, The Netherlands
2
Knowledge and Innovation Centre, Hogeschool KPZ, Ten Oeverstraat 68, 8012 EW Zwolle, The Netherlands
3
Team Kwaliteit van Leraren, HAN University of Applied Sciences, Postbus 5375, 6802 EJ Arnhem, The Netherlands
4
Section on Cognition, Data and Education, University of Twente, Postbus 217, 7500 AE Enschede, The Netherlands
*
Author to whom correspondence should be addressed.
Educ. Sci. 2025, 15(11), 1530; https://doi.org/10.3390/educsci15111530
Submission received: 22 May 2025 / Revised: 23 October 2025 / Accepted: 7 November 2025 / Published: 13 November 2025
(This article belongs to the Special Issue Recent Advances in Measuring Teaching Quality)

Abstract

Existing observation instruments to measure differentiated instruction often lack insight into the degree to which teachers’ decisions match the actual needs of their students, and neglect the importance of preparation and evaluation. This article describes the psychometric evaluation of a comprehensive instrument (Assessing Differentiation in All Phases of Teaching; ADAPT) that does not suffer from these shortcomings. To assess its quality, 41 raters used it to score videos of lessons and interviews of 86 primary school teachers. A 5-dimensional item-response model showed good fit and high internal consistency, and a decision study was conducted to determine the reliability and agreement coefficients for different numbers of raters. For the intended low-stakes use, a single rater would be enough to provide a reliable estimate of a teacher’s overall score. Finally, rater experiences showed that ADAPT has high practical value due to the comprehensive manual and detailed score descriptions and examples. The instrument can therefore not only be used for research purposes, but can also serve as a valuable resource for teachers and teacher educators in practice.

1. Introduction

Adapting teaching to students’ needs is considered an important but complex teaching skill. It requires deliberate decision-making that is based on pedagogical and content knowledge as well as an ongoing evaluation of students’ needs (Eysink & Schildkamp, 2021; Moon, 2005; van Geel et al., 2019). Teachers need to be able to assess and address students’ needs within the dynamic complexity of a classroom situation. This dynamic complexity makes it challenging to capture whether or not the teacher has succeeded in meeting their students’ needs.
In their review on differentiation, Graham et al. (2021) concluded that there was a tendency in the included studies “to use weak forms of survey methodology [and] researcher-developed instruments with no clear theoretical or empirical foundation” (p. 191), with a narrow focus on “whether and how often teachers differentiate and what they do when differentiating” (p. 190). The authors therefore recommended investigating the planning and enactment of evidence-based practices that teachers can use to meet the needs of all learners in heterogeneous classrooms, and the use of “rigorous mixed-method research designs capable of assessing the adequacy of those practices for meeting the full range of individual learning needs” (Graham et al., 2021, p. 192). Similar conclusions were drawn by van Geel et al. (2019), based on their inventory of existing instruments to measure differentiated instruction. Existing instruments, ranging from classroom observation schemes to teacher self-reported practices and student perception questionnaires, were mainly focused on identifying whether or not teachers apply specific differentiation strategies. The degree to which teachers take students’ needs into account in making educational decisions and whether those decisions match the needs of their students are mostly neglected (van Geel et al., 2019). For example, in some situations, teachers may consider whole-class instruction best suited for their students and the lesson goals, while in other instances, teachers might apparently differentiate by using fixed ability groups for providing additional instruction, without taking students’ actual needs into account. This illustrates that merely checking whether a teacher implements small-group instruction (a prevalent differentiation strategy) does not guarantee education that matches students’ needs (M. I. Deunk et al., 2018). The quality of differentiated instruction rather depends on the degree to which teacher’s decisions match the needs of students (Keuning & van Geel, 2021).
Given that understanding of what high-quality differentiated instruction entails, how can it be measured? Making informed statements about (aspects of) teaching quality requires accurate methods to capture this construct (Daltoé, 2024). As empirically identified in the cognitive task analysis by van Geel et al. (2019) and highlighted by, for example, Graham et al. (2021), and Letzel-Alt and Pozas (2023), planning and evaluation are essential for differentiated instruction; differentiation during the lesson cannot be separated from the phases prior to and after classroom teaching. As most currently available instruments lack this perspective and are merely focused on what teachers do, instead of on why teachers make certain decisions that take their students’ needs into account, we set out to develop a comprehensive instrument to fully capture differentiation (Keuning et al., 2022). In this article, we describe and discuss the psychometric evaluation of the developed instrument, which combines a classroom observation with an interview with the teacher so as to gain insight into and assess the teacher’s choices related to differentiation in all four phases of teaching: lesson series preparation, individual lesson preparation, actual teaching, and evaluation.

1.1. Differentiation

Differentiation, the focus of the instrument evaluated in this study, typically refers to the adaptation of instruction to students’ diverse educational needs (e.g., M. Deunk et al., 2015). Often, the primary focus is on needs based on students’ current level of knowledge and skills, although teachers can take a variety of student characteristics and differences into account (Tomlinson et al., 2003). This idea is not limited to literature on differentiation. Other interrelated and partially complementary approaches to adapting education to students’ needs are formative assessment (e.g., Black & Wiliam, 2018; Eysink & Schildkamp, 2021), universal design for learning (e.g., Griful-Freixenet et al., 2020; Cai et al., 2024), adaptive teaching (e.g., Corno, 2008; van Geel et al., 2023). In each of these approaches, the importance of ongoing, systematic, goal-oriented monitoring of student progress, understanding and needs is stressed, just as the significance of adaptations based on these insights. Roy et al. (2013, p. 1187) defined differentiation as: “an approach by which teaching is varied and adapted to match students’ abilities using systematic procedures for academic progress monitoring and data-based decision making.” Literature on differentiation combines research on inclusive education for students as well as education for gifted students, as it is aimed at addressing the variety of students in heterogeneous classrooms (Ardenlid et al., 2025). For making these adaptations during teaching in the classroom, planning is considered essential (e.g., Graham et al., 2021; Letzel-Alt & Pozas, 2023; Prast et al., 2015). Smale-Jacobse et al. (2019) and van Geel et al. (2019) identified four interrelated phases in which teachers prepare, apply and evaluate differentiation: (1) preparing a lesson series, (2) preparing an individual lesson, (3) actual teaching of the lesson, and (4) evaluation. Successful implementation requires a broad range of skills and underlying knowledge (van Geel et al., 2019), while instructional decisions can be guided by five underlying principles for differentiation: (1) strong goal orientation, (2) continuous monitoring of students’ progress and understanding, (3) challenging all students, (4) adapting instruction and exercises to match students’ needs, and (5) stimulating students’ self-regulation (Keuning & van Geel, 2021).
In the current study, the term ‘differentiation’ is used for providing education that matches the needs of all students in a heterogeneous classroom, based on the deliberate application of the five principles for differentiation during all four phases of differentiated instruction. This implies that differentiation can be recognized during the various phases by the implementation of a wide variety of instructional practices that aim to meet the needs of all learners in heterogeneous classrooms. The level of differentiation depends on the degree to which teachers make deliberate, goal-oriented decisions to support all students’ learning and ensure appropriate challenge and guidance, where students are also given an active role in the decision-making process regarding their own learning. This operationalization aligns with M. I. Deunk et al. (2018) who stated: “(t)o understand the effects of differentiation, it is important to use an ecologically valid operationalization, which is inherently somewhat fuzzy, because differentiation is an educational approach in which multiple practices are combined” (p. 44).

1.2. Assessing Differentiation

Measuring teachers’ differentiation could accomplish various goals. It is important for research purposes, such as evaluating the effectiveness of professional development activities aimed at improving differentiation skills. For example, Langelaan et al. (2024) sought to identify key elements for successful teacher preparation and development for differentiated instruction, and concluded that there is a “need for a more comprehensive approach to evaluating the effectiveness of [differentiated instruction] teacher training programs” (p. 11). Given the limited empirical evidence with regard to effects of differentiation on students’ achievement (e.g., M. I. Deunk et al., 2018), the development of a reliable and valid instrument that captures differentiation as a comprehensive approach to teaching would also be beneficial for empirical research into the effectiveness of (professional development for) differentiation. Furthermore, such an instrument could be used to support improvement of teachers’ differentiation. An informative instrument could provide insight into the development of differentiation skills, and could help teachers identify their own zone of proximal development, which would allow them to focus their professional development or coaching on the most appropriate skills at a given time.
In an analysis of various existing instruments for measuring the quality of differentiation in practice, van Geel et al. (2019) included instruments ranging from teacher self-reported practice to student questionnaires, and from classroom observation instruments to teacher responses to vignettes. Six overarching categories of skills needed to adapt their teaching to students’ needs were identified, three of which related to differentiation prior to teaching (curriculum, identifying instructional needs, and setting challenging goals) and three during teaching (monitoring and diagnosing student progress, adapting instruction and activities, and general teaching quality dimensions). The authors further concluded that “the match between students’ needs and the adaptation is crucial to the real quality of the adaptation. However, items assessing this match explicitly are lacking” (van Geel et al., 2019, p. 53).
The ability to make informed statements about teaching quality constructs requires accurate methods for capturing these constructs (Daltoé, 2024). As observations by external observers are generally considered “the gold standard for assessing teaching quality” (Helmke, 2009, as cited in Daltoé, 2024, p. 12), another difficulty arises. To assess differentiation, observation of preparation and evaluation as well as in-class teaching is essential. Luoto and Selling (2021) argued that insight into a teacher’s intentions and lesson goals can provide a deeper understanding of instructional decisions and implementation of strategies, and complements findings from classroom observation. This implies that existing classroom observation instruments are not suitable to assess the quality of differentiation in its full complexity, as they do not give observers access beyond observable behavior in the classroom.
In line with Parsons et al. (2018), van Geel et al. (2019) therefore recommended development of a comprehensive instrument to identify the degree to which the teachers take the needs of their students into account in instructional decision-making, in-class as well as during the preparation and evaluation phases.

1.3. Purpose of the Study

In the current study, such a comprehensive instrument was developed: Assessing Differentiation in All Phases of Teaching (ADAPT, Keuning et al., 2022), where scores are assigned based on a classroom observation of a lesson combined with an interview with the teacher after the lesson. This instrument is described in detail in the Section 2.1.
The primary aim of the current study is to examine the psychometric properties of the ADAPT instrument. In addition, the practical feasibility of using such a comprehensive instrument will be assessed, as combining a classroom observation with an interview with the teacher may require much time and effort from raters. To achieve this goal, the study addresses the following research questions:
  • What support is found for the construct validity of ADAPT?
This study aims to investigate the extent to which the ADAPT instrument effectively measures the construct of differentiation. Construct validity is investigated by assessing whether the responses to the items on the measurement instrument correspond with a latent variable model reflecting the instrument’s structure (outlined below). In other words, does an item response theory (IRT) model reflecting the latent structure fit the data?
2.
How reliably do scores on the ADAPT instrument measure the quality of differentiated instruction?
In this context, reliability refers to the consistency and stability of the instrument’s results under similar conditions. Reliability is investigated in the framework of the latent variable model referred to in the first research question, enhanced with a generalizability model. Additionally, the study seeks to determine the number of observers or raters needed to obtain reliable results. Note that the inferences regarding the reliability of the model are only valid if it is shown that the model holds.
3.
How did raters experience the training and rating process?
The labor-intensive nature of the rating process raises the question of whether it is feasible and beneficial for raters to use the ADAPT instrument in this manner. Therefore, this study explores the experiences of the raters, and their suggestions for future use of the instrument.
By addressing these research questions, the study aims to provide empirical evidence regarding both the psychometric quality and practical feasibility of the ADAPT instrument.

2. Materials and Methods

2.1. The ADAPT Instrument

We developed a comprehensive instrument: Assessing Differentiation in All Phases of Teaching (ADAPT, Keuning et al., 2022), that is scored based on a classroom observation of a lesson combined with a structured interview with the teacher after the lesson. By conducting this interview, information about cues related to preparation phases as well as in-class decision-making is gathered, enabling raters to gain a more complete understanding and therefore reach a more accurate rating (Daltoé, 2024; White et al., 2025).
To develop the ADAPT instrument, first, performance objectives (“What does mastery of this skill mean?”) were formulated for each skill identified in the cognitive task analysis (CTA; Keuning & van Geel, 2021; van Geel et al., 2019) of differentiated instruction, in consultation with expert teachers and external experts involved in the CTA. The first version of the ADAPT instrument consisted of 27 indicators, each with a brief, overall description. Instead of developing a score description that would represent theoretical excellence, a score of 4 was assigned to the expert behavior identified in the cognitive task analysis for each indicator. The behavior represented by a score of 4 therefore is a realistic representation of high-quality differentiation. Reasoning back from there, score descriptions for scores of 1, 2, and 3 were formulated based on examples from practice, where 1 and 2 represent insufficient mastery of the assessed skill for each indicator, and 3 and 4 show (medium- to expert-level) mastery. In focus group sessions, indicators were merged where deemed relevant, explanatory notes for the indicators were refined, and detailed score descriptions to reduce variability in indicator interpretation and to enable formative use of ADAPT were added. This version was studied with four raters, who all scored ADAPT for the same 35 teachers. The inter-rater reliability (pertaining to the correlation between raters) and the agreement (the degree to which the scores of raters agreed on the same observation) were quite good. However, the number of data points was too small to be able to judge the quality of the instrument. Based on raters’ experiences, the indicators in the instrument and the interview guidelines were further refined, leading to the current version (Keuning et al., 2022). In addition, based on recommendations by Bell et al. (2019), rating procedures were established “to ensure that raters are well trained and are able to use the rating scales accurately and reliably over time” (p. 5). These procedures on the one hand consisted of detailed score descriptions and scoring guidelines and rules, which were included in an extensive rater manual, and on the other hand included the development of a rater training procedure, giving raters the opportunity to familiarize themselves with the theoretical foundations, the meaning of all indicators, and the scoring rules, as well as the opportunity to practice by scoring videos of lessons and interviews.
The final version of ADAPT, as used in the current study, consists of 23 indicators, divided over the four phases: preparation of a lesson series (8 indicators), individual lesson preparation (6 indicators), actual teaching (8 indicators), and evaluation (1 indicator). Each indicator is scored on a 4-point scale, where the rater is instructed to select the score that best represents the teacher’s behavior or explanation. The scoring rubric consists of one page for each indicator, where an elaborate description is provided for each numerical score, in which the difference between behaviors appropriate for the current and next-lower score is highlighted. An explanation and examples from practice in various school subject are also presented next to the score description. A translated overview of all indicators, including related principle(s) of differentiation, is provided in Table 1. ‘Not applicable’ can be selected for certain items (e.g., for Indicator 2.5, if a teacher states that there are no high-performing students for a specific lesson goal) and ‘not enough information’ can be selected for any item when insufficient information is available.

2.2. Phase 1: Recordings of Lessons and Interviews

To identify the quality of the ADAPT instrument, it was necessary to have access to classroom lessons and interviews with teachers. Although the instrument is developed to be used in practice, with an in-class observation and in-person interview afterwards, for the sake of this study we needed to have multiple raters scoring the same lesson and interview. Therefore, video recordings were made among a convenience sample of 86 Dutch primary school teachers. The only inclusion criteria that were applied were (a) that the participating teachers should teach in Grades 1 to 6 (children aged 6- to 12-year-old), in either a single-grade or multi-grade class, and (b) that teaching was not fully individualized. These teachers were recruited through various professional networks of the first three authors. Teachers applied voluntarily, and were asked whether colleagues would also be willing to participate to expand the initial sample. Teachers as well as parents/guardians of all children in participating teachers’ classrooms provided active informed consent for using the videos for this research. Ethical permission was requested and granted by the BMS ethics committee of the University of Twente. In total, 86 primary school teachers (teaching in single and multigrade groups, from grades 1 to 6; see Table 2 for descriptives) participated in this first phase of data collection. The composition of this sample is rather representative for the entire population of Dutch primary school teachers.
For each teacher, a regular mathematics lesson (lasting 45–60 min) was observed and simultaneously video-recorded, the teacher was interviewed later on the same day, and the interview was recorded as well. As the interview is partially based on the teaching during the lesson, recordings were made by trained assessors (student teachers, students in Educational Sciences, and researchers) who also conducted the interviews afterwards. These assessors were instructed not to interfere with the lesson, not to interact with students, and to stay in their position at the back of the classroom during the entire lesson. For the interview, the interview protocol from the ADAPT manual was followed, ensuring consistency and preventing bias from observers. Interviews lasted 35 to 50 min. The assessors who made the recordings and conducted the interviews did not assign scores to those teachers for the research we describe in this paper, as we needed the observation conditions to be similar across all ratings, i.e., based on videos.

2.3. Phase 2: Rater Training & Procedure

In the second phase, 41 trained raters used the ADAPT instrument to score the video recordings from Phase 1. All raters provided active informed consent for using their assigned scores and their evaluation of the ADAPT instrument, and signed a declaration of confidentiality to safeguard the privacy of the participating teachers and students in the videos the raters would watch. Ethical permission was requested and granted by the BMS ethics committee of the University of Twente.
A total of 45 raters participated in this phase of data collection. Besides the research team (first three authors and a research assistant), five student teachers participated as part of their study research project, and 36 volunteers (teachers, student teachers, teacher educators and academic coaches) were recruited via various networks of the first three authors, mainly via email and social media. Interested volunteers were informed about the time investment (approximately 15 h for training, and 2 h per teacher they would rate for data collection) and expectations (follow training, score videos of at least 5 teachers). Volunteers were only eligible for participation if they had a background in primary education as a teacher or other professional. All raters were trained in using the ADAPT instrument by the first three authors of this paper. The training consisted of: watching an introductory video in which theoretical foundations of ADAPT were explained, reading the ADAPT manual (including all indicators, score descriptions and scoring guidelines), and discussing all indicators during a meeting. Next, there were two rounds of individually watching recordings of a teacher (lesson and interview), scoring all ADAPT indicators for this teacher, and jointly comparing and discussing similarities and differences in assigned scores. Since the future application of the instrument is not intended to be in a high-stakes situation, the raters were required to reach 80% judgement agreement on the second recording before they could start scoring ADAPT for the purpose of this study. This judgement agreement was computed by identifying whether raters agreed on a score of insufficient (1 or 2) or sufficient (3 or 4) as compared to the master score (consensus score by the first three authors) for ≥80% of the indicators. Four volunteers did not meet this criterion and could therefore not participate in the rating of teachers for the purpose of this study, resulting in 41 participating raters (32 volunteers, 5 students, 4 members of the research team). The entire training and rating took place online and raters spent approximately 15 h on the training phase.
Based on their indicated availability, each rater was assigned a number of teacher IDs, the videos for which had to be scored in that specific order. We made sure order effects would be ruled out by making sure each teacher ID was assigned first, in the middle and last for different raters. Raters were instructed not to pause and certainly not to rewind videos. They were required to watch both recordings on one day, preferably in one run, in the fixed order of first lesson, then interview. Furthermore, raters were required to give a reason for each score, and to always use the ADAPT manual when assigning scores. On average, raters scored the recordings of almost 10 teachers (range 5 to 15). In total, 399 ADAPT score forms were submitted. The recordings for each teacher were rated by four (31 teachers) or five (55 teachers) raters. After rating all assigned videos, all volunteer raters were asked to what degree they had followed the instructions and procedures (see the section below on the Rater Experience Questionnaire). Only one of the 16 responding raters indicated that they had not used the manual for all teachers. The majority of responding raters (68.8%) admitted to having paused the videos in some cases (with an estimated maximum of 30% of rated videos); the other five raters (31.2%) did not pause videos, but two of them indicated they would have liked to.

2.4. Rater Experience Questionnaire

To get insight into rater experiences, after completing all ratings, the 32 certified volunteers were invited to respond to an online questionnaire; 16 of them responded. In this questionnaire, raters were asked how they perceived scoring with ADAPT, what factors affected their ratings, whether they followed the prescribed rating procedures, and whether they had any additional comments. Responses to open-ended questions were analyzed using Atlas.ti.
In addition, an email was sent to the same 32 raters to ask for more qualitative input about their experiences and suggestions, e.g., to elaborate on the suggestions for future use. Ten raters replied to this request. The responses to the questionnaire were anonymous, so that data from questionnaires and emails could not be linked.

2.5. Data Analysis

The data were analyzed with a combination of an item-response model (IRT model; see, for instance, Lord, 1980) and a generalizability model (GT model; see, for instance, Brennan, 2001). The IRT model mapped the discrete item responses (the assigned scores) of the raters onto a continuous latent scale and the generalizability theory (GT) model separated the variance in these latent measurements into rater effects and the effects of the proficiency of the teachers. The technical details of the combined IRT and GT model are outlined in Appendix A, together with information on item fit and item reliability. Further details on the combined IRT and GT model can be found in Glas et al. (2024). Here we will only outline the conceptual arguments motivating the choice of the model and present the main results. Note that showing that the model fits the observed response data supports research question 1, regarding construct validity. Further, if the model is shown to fit, inferences regarding the reliability of the instrument can be validly made.

2.5.1. IRT Model

As shown in Table 1, each of the 23 items pertained to one or more principles of differentiation. Therefore, a special 5-dimensional IRT model was used, where each principle of differentiation was associated with one dimension of the 5-dimensional model, shown in Formula (A1) in Appendix A. This type of multidimensional IRT model is often referred to as a full-information factor-analysis model (see Bock et al., 1988). In Table 1, it can be seen that the factor-analysis model cannot have a simple structure where each item loads on one dimension only, given that an item could pertain to multiple principles of differentiation. So, the 5 dimensions cannot be associated with 5 subscales that could be administered individually. That is, the complete instrument must be administered as a whole.

2.5.2. GT Model

In Appendix A, the values of the five latent variables associated with the item responses are indicated as θijd, which is the assessment of teacher j as scored by rater i on dimension d, conditional on the item parameters. Besides the item parameters, the values of θijd also depend on the raters’ perspective. A GT model defined on the latent variables was used to assess these rater effects. The model decomposes θijd into a latent variable associated with the teacher’s level of performance, the rater effect and a residual. The GT model is shown in Formula (A2) in Appendix A. The model supports decomposition of the total variance θijd into teacher variance (the target of the measurement), rater variance, and the residual variance. Reliability and agreement coefficients for a rating can then be computed by dividing the explained variance of the teachers by the total variance of the latent variables.
The parameters of the combined IRT and GT model (see Glas et al., 2024) were concurrently estimated using a Bayesian approach with the WinBugs software (version 1.4.3). Finally, we conducted a D-study (decision study; see Shavelson et al., 1989) to examine change in the reliability and agreement coefficients when the number of raters is increased or decreased.

3. Results

3.1. Descriptives—Raw Scores

In total, 399 ADAPT score forms were completed, for 86 teachers. The 399 response patterns proved to be enough for stable estimates of the item parameters and the covariance matrix between the latent variables.
Assigned scores ranged from 1 to 4; raw score frequencies per indicator are presented in Table 3. Note that this overview includes multiple scores per teacher per indicator.
In Figure 1, the distribution of teachers’ average raw scores per indicator is displayed. Although we did not aim for a representative sample of teachers, this distribution of scores across teachers provides an overview of the variation in means and spread of scores on the different indicators.

3.2. The Structure of ADAPT’s Construct Validity

The development of ADAPT involved using the outcomes of a cognitive task analysis (CTA) including the identification of the essential skills required for differentiation, formulating performance objectives based on the identified skills, and seeking input from various experts (van Geel et al., 2019). This rigorous process was undertaken to ensure the face validity of the indicators, confirming that they effectively capture and represent the construct of differentiation being measured. The next step related to validity had two parts. With the resulting 23 indicators, the intention is two-fold. First was to create separate scales for the five dimensions (associated with the principles of differentiation) with the resulting 23 indicators, and second was to capture the multidimensional nature of differentiation within a unified construct. These indicators were expected to collectively assess the different phases and principles of differentiation as intended, providing a comprehensive evaluation of teachers’ differentiated instructional practices.

3.2.1. Descriptives—IRT Model

As stated previously, the latent variables related to the responses of a rater i of a teacher j regarding principle of differentiation d, θijd, were decomposed into latent variables associated with the teacher’s level of differentiation, the effect of the rater and a residual. The main focus of the measurement, the differentiation performance of the teacher, consisted of five dimensions. As the latent variable model shows to reflect the instrument’s structure, it can be concluded that the ADAPT instrument effectively measures the construct of differentiation. The covariance matrix and the correlation matrix of the five dimensions are displayed in Table 4.
Although this is not a standard cut-off value, correlations above 0.400 are bolded to enhance the readability of Table 4. The pattern reveals small to moderate correlations between the scores on the dimensions (principles) of goal orientation, monitoring, and adapting instruction. This is understandable from a conceptual perspective, as monitoring students’ progress and understanding related to the goals requires having a clear goal orientation, and adapting instruction to support students in reaching those goals generally requires insight into students’ progress and understanding. However, the correlations are not high enough to justify a unidimensional approach.
In Table 5, the factor loading of the items on the five dimensions are presented. Note that we used a confirmatory model according to the structure defined in Table 1, so some of the loadings were structurally zero (these are left blank in Table 5). The displayed factor loadings indicate the extent to which the indicator depends on the latent variable associated with a principle of differentiation. The factor loadings show that some indicators load more strongly on one of its two or three associated principles (e.g., Indicator 1.1 on MO or 2.2 on AD), while two indicators show similar factor loadings on multiple factors (Indicator 1.6 on CH and AD, or 3.8 on GO and MO). As can be noted in Table 1 and Table 5, the principle of goal orientation (GO) is often combined with one or two other principles, where the factor loading is often higher for the additional factor(s). From a conceptual perspective, it appears that factors always load highest on the conceptually most evident principle as was identified in Table 1 (indicated by bold Xs), except for the two indicators (1.6 and 3.8) that show similar factor loadings across multiple factors.

3.2.2. Agreement and Reliability

Indices of agreement and reliability (generalizability indices) were concurrently computed with the estimates of the variance components of the combined IRT and GT model. Both indices are a ratio of the variance of the target of the assessment (in this case: the differentiation performance of the teachers), and the total relevant variance. Agreement pertains to the absolute ordering of the targets by the raters, while reliability pertains to the correlation between raters. It is assumed that the overall assessment score is the average over raters. For agreement, the total variance consists of the targets’ variances and all error variance that contributes to errors in the ordering of the targets, that is, the teachers’ scores. For reliability, the raters’ variance is deleted from the denominator of the variance ratio, because taking the average of the assessment over raters works out the same for all teachers.
The estimated generalizability indices derived from the collected data are shown in Table 6 in the row labeled ‘4 raters’, which is the average of the number of raters used in the data collection design. The column labeled ‘ALL’ gives the estimates for the complete set of 23 items. Both the target variance and the error variance are computed as the inner product of their covariance matrices with a unit vector. The other five columns pertain to the agreement and reliability indices for the five unidimensional constituent dimensions.
Besides being an easy summary of the generalizability of the assessments, generalizability coefficients also support what is known as a decision study (D-study). Using the estimates of the variance components (obtained in the so-called G-study), a D-study can be used to estimate the number of raters needed to reach a target level of agreement or reliability by varying the number of raters in the generalizability indices. Usually, the criterion for agreement and reliability indices are 0.70 for research applications and 0.80 for high-stakes decisions on the individual level (see, for instance, Evers et al., 2009). Note that the current study is based on one lesson per teacher, and we can therefore not draw any conclusions regarding generalizability across lessons or contexts.
The results of the D-study are displayed in the rows of Table 6. For agreement, the number of raters required to meet the criteria is of course higher than for reliability, with a required number of five raters for high-stakes applications (which we do not endorse) where both agreement and reliability need to exceed 0.80. When applying the 0.70 criterion for research and low-stakes applications, three raters would be required for agreement, and one rater is expected to provide a reliable estimate for the overall assessment as well as for each of the individual principles. For the overall assessment, and for MO, CH and SR, the reliability for one rater even exceeds the value of 0.80. In fact, reliability indices for overall scores are well above 0.85, so the measurement quality of the instrument proved very high.

3.3. Rater Experiences

To gain insight into rater experiences, we administered a questionnaire that was completed by 16 raters. These raters were positive about scoring with the ADAPT instrument. When asked to assign a score from 1 (most negative) to 10 (most positive), raters graded scoring with ADAPT on average with a 7.8 (ranging from 7 to 9). At the end of the questionnaire, raters could include comments about their experiences. Raters indicated that they learned from all elements of their participation in this research project: the training, generally ‘working with the instrument’, and specifically rating teachers with ADAPT. The raters perceived the instrument as very clear, due to the detailed score descriptions and elaborate explanations and examples from practice.
Raters were asked to what degree several factors affected their ratings. Three factors were indicated by at least half of the raters to have had some degree of influence on their ratings: whether the interviewer followed the ADAPT interview guidelines (87.5%), the person who conducted the interview (62.5%), and the quality of the image in the video recordings (50% of the raters). Raters indicated that when the interviewer did not fully follow the ADAPT interview guidelines, sometimes (too) little information was available to score an indicator. According to most raters, factors such as the length of the recorded lesson or interview, the number of teachers rated in one day, or whether a teacher was the first or last on their list did not influence their ratings. However, four of the 16 raters indicated that rating became easier over time. One rater indicated:
With the first teachers, I had less experience in scoring, and during the interviews, I was less able to listen with a specific focus. After scoring several teachers, I became better at filtering what information is relevant for scoring during a lesson or interview, and I could listen more focused.
Besides their experiences with using ADAPT for research and evaluation purposes, raters identified additional potential uses of the ADAPT instrument. They noted that the manual and detailed score descriptions can guide self-assessment and reflection, and can support teachers in preparing a lesson and lesson series. Raters emphasized that ADAPT “prompts reflection”, encouraging teachers to consider how well they applied differentiation and the underlying principles in their teaching practice. They noted that, for instance, teachers can use the manual as a guide to assess their current differentiation level and identify and implement improvements. Additionally, the instrument acts as a conversation starter among colleagues, as it helps to facilitate discussions about “the deeper purpose of differentiation”.

4. Discussion

Three main themes that emerged from the findings will be discussed below. First, we will discuss lessons learned and directions for further research into the quality of the ADAPT instrument. Next, we describe potential uses of ADAPT in educational research. Finally, we propose various uses in educational practice.

4.1. Lessons Learned and Further Research into ADAPT

Creating an instrument that is founded in practice and based on expert insights, and through several iterations is a demanding and time-consuming process. However, crafting a comprehensive user manual including detailed score descriptions and rating quality procedures, following the suggestions by Bell et al. (2019), seems to have contributed to positive results for both reliability and rater experiences. We therefore strongly suggest that future developers adopt a similar approach, taking these guidelines by Bell et al. (2019) into account.
The ADAPT instrument, when used by trained raters, exhibited high internal consistency and reliability. Participating in an intensive training trajectory also requires time and effort from (potential) raters. Although the raters clearly appreciated the training and recognized its added value for becoming acquainted with the instrument, the instrument itself was perceived as very clear, due to the detailed score descriptions and accompanying information. We would therefore recommend a future investigation into the instrument’s reliability with untrained raters, as in practice not all raters may invest time and effort in following such training. Some raters in the current study also noted that rating became easier after three or four teachers. They also practiced with rating two teachers in the training phase. By conducting a study where untrained raters each score at least six videos (without receiving feedback on their relative rating accuracy), we could assess whether reliability would be adequate when the instrument is used by untrained raters. By comparing reliability for their first and last ratings, we could also identify whether the amount of practice would have an influence on reliability. This could lead to more detailed and substantiated recommendations about the necessary training intensity and practice frequency for reliable use.
The current study was only focused on the reliability of the instrument for assessing differentiation in one lesson. This study therefore does not provide us with information on generalizability of teacher scores to other lessons or contexts. A generalizability study would be recommended to gain deeper understanding of the degree to which scores based on a single lesson-interview combination could be extrapolated to other situations. Such a study would ideally encompass multiple lessons and lesson series per subject, with multiple subjects per teacher, for multiple teachers. These lessons should then each be rated by multiple raters. Regardless of the potential generalizability of ADAPT scores, we would like to stress that ADAPT should not be used to make high-stakes decisions about teachers. As Blikstad-Balas et al. (2021) argued, ethical tensions can arise when measuring the quality of teaching; those who try to measure the quality of teaching should be clear about how the data are expected to be used. We aimed to develop an instrument that could be used for research purposes, one that could possibly be used formatively to support improvement of teachers’ differentiation.
Finally, we began this article by arguing that differentiation goes beyond what can be observed in the classroom and that an interview is necessary to make preparation and evaluation, as well as decision-making during the lesson, accessible for raters (Daltoé, 2024). Based on this assumption, we expect that raters use the interview to get a more nuanced and comprehensive understanding of what happened during the lesson. Due to the current data collection set-up we were unable to compare initial and final scores for lesson performance indicators, and could therefore not study whether changes were made based on what was discussed in the interview. In future research, the degree to which scores on (specific) indicators regarding lesson performance are changed based on interview information could be explored, for example, by letting raters enter initial scores and providing the opportunity to change them after the interview.

4.2. Unlocking Possibilities: Educational Research with ADAPT

The availability of this comprehensive instrument to measure adaptive teaching provides opportunities for future research into differentiation. Although meta-analyses have shown that adapting instruction to students’ needs can lead to improved learning and achievement (Eysink & Schildkamp, 2021), previous studies have mainly focused on the impact of applying specific differentiation strategies, such as student grouping or the use of adaptive software, on student performance (M. I. Deunk et al., 2018). The ADAPT instrument could potentially (if its scores prove to be generalizable) be used to outline the relationship between differentiation and student achievement in greater detail. The instrument could furthermore be used to assess effects of intervention studies that aim at improving differentiation in practice. We also see potential in comparing ADAPT scores with students’ ratings, to identify to what degree students’ perceptions of differentiation are related to their teacher’s scores on ADAPT.

4.3. Practical Applications: Using ADAPT for Enhancing Differentiation

The ADAPT instrument holds promise not only as a means of assessing differentiation, but also as a tool for enhancing differentiation skills. However, we should also handle these opportunities with caution. As raters in this study highlighted, ADAPT can serve as a valuable resource for improving differentiation skills in practice due to its detailed information and comprehensive overview. The instrument has the potential to be applied in both pre-service and in-service education to develop differentiation competencies.
Based on suggestions from the raters in the current study, we identified various possibilities for using ADAPT in this way. The instrument and manual can serve as a guide or a step-by-step framework for lesson preparation, supporting both pre-service as well as in-service teachers in planning for providing differentiated instruction. Additionally, the instrument could be used as a self-assessment tool, allowing teachers to identify areas for improvement and monitor their progress. Finally, ADAPT can act as a conversation starter, promoting dialogues among teachers and fostering a community of practice focused on differentiation. The use of the manual and detailed score descriptions can facilitate the development of a common language and understanding (cf. Klette, 2023).
With regard to its use in practice, it should be noted that the option ‘not enough information’ was mainly added for use in research settings, where raters were required to assign scores based on previously recorded videos. This was needed, as some raters indicated the degree to which the interview protocol was fully followed influenced their ability to assign scores to some of the indicators. For use in practice we would therefore strongly recommend that raters make sure to gather enough information to be able to assign scores for all indicators.

5. Conclusions

In this study, we assessed the construct validity, reliability, and the feasibility of use of the ADAPT instrument for assessing differentiated instruction. Performance objectives derived from the cognitive task analysis by van Geel et al. (2019) served as the starting point for development of this instrument. Through multiple iterations, an instrument consisting of 23 indicators spread across four phases was created. Planning and evaluation are just as essential for differentiation as in-class decision-making. Therefore, the indicators are scored based on a combination of a classroom observation and an interview, making the ADAPT instrument unique in its kind. In the current study, we used video-recordings of lessons and interviews to assess the quality of the ADAPT-instrument. The first research question aimed at investigating the extent to which a five-dimensional IRT-model fitted the data. The intention of this model was to create separate scales for the five dimensions, and to capture the multidimensional nature of differentiation within a unified construct. This model, where each principle of differentiation was associated with one dimension of the 5-dimensional model, indicated good model fit and therefore high internal consistency.
Research question two was focused on the reliability of the scores on the ADAPT instrument. By conducting a D-study, we demonstrated that trained raters could assign highly reliable scores using the ADAPT instrument, with reliability indices for overall scores well above 0.85, even based on one rater. However, it is important to note that our study was based on a single lesson per teacher, and we therefore cannot generalize about other lessons provided by the same teacher. Further research would be required for such claims. We strongly emphasize that ADAPT should not be used for high-stakes decisions about teachers.
In the final research question, the experiences of the raters, the feasibility and desirability of using ADAPT, and their suggestions for future use were addressed. Even though the use of ADAPT is rather labor-intensive, raters found it not only feasible, but also very informative to work with the ADAPT instrument. Notably, raters expressed interest in using the instrument themselves, not as an observation and interview tool for assessment and research purposes, but to enhance their own educational practices. From their suggestions, we conclude that ADAPT can serve as a guide for preparation, a self-assessment tool to determine areas for improvement, and a conversation starter.
To conclude, although our initial goal was the development of an instrument for measuring differentiation for research purposes, we learned that the instrument could also serve as a tool for improving it.

Author Contributions

Conceptualization, M.v.G., T.K. and M.D.; methodology, M.v.G., T.K., M.D. and C.G.; formal analysis, M.D. and C.G.; data curation, M.v.G., M.D. and C.G.; writing—original draft preparation, M.v.G., T.K., M.D. and C.G.; writing—review and editing, M.v.G. and C.G.; project administration, M.D.; funding acquisition, M.v.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by Nationaal Regieorgaan Onderwijsonderzoek (grant numbers 405-15-733 and 405-16830-018).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved in 16 November 2023 by the BMS Ethics Committee of the University of Twente, under the protocol code 200990.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
CTACognitive Task Analysis
DIDifferentiated Instruction
ADAPTAssessing Differentiation in All Phases of Teaching (the developed instrument)

Appendix A

The original scores assigned by the raters ranged from 1 to 4. For the IRT analyses, the scores were recoded to 0 to 3. The items are labeled k = 1, …, K, the response categories are labeled m = 0, …, M, the teachers are labeled j = 1, …, J, the raters are labeled i = 1, …, I, and the dimensions are labeled d = 1, …, D. So in the present application, K = 23, M = 3, J = 41, I = 86 and D = 5. The recordings of each teacher were rated by four (31 teachers) or five (55 teachers) raters. In total, 399 ADAPT score forms were available for analysis. This proved to be sufficient to obtain stable estimates of the item parameters and the covariance matrix defined below. The item responses ‘not applicable’ and ‘not enough information’ scores were labelled as missing values. Adding them as an extra item score category, probably for all 5 dimensions separately, would result in a 10-dimensional model without enough information for the extra item parameters. Further, it is unlikely that an item score such as ‘not applicable’ or ‘not enough information’ would contribute much information regarding a teacher’s level of differentiation.
The IRT Model for the responses is given by:
p U i j k = m | θ i j 1 , , θ i j d , , θ i j D = exp m d = 1 D w k d α k d θ i j d c = 1 m δ k c 1 + h = 1 M exp h d = 1 D w k d α k d θ i j d c = 1 h δ k h
  • U i j k  Response of rater i teacher j item k
  • θ i j d  Latent IRT variable related to dimension d
  • w k d w k d = 1 if item k loads on dimension d ,   w k d =   0 otherwise
  • α k d  Loading of item k on dimension d , if w k d = 1
  • δ k c  Location of category c of item k on the latent scale
The Generalizability Theory model is given by:
θ i j d = τ j d + ω i d + ε i j d
  • τ j d  Proficiency teacher j on dimension d with τ j 1 , , τ j D )     N ( 0 , Σ τ )
  • ω i d  Effect of rater i on dimension d , with ( ω i 1 , , ω i D )     N ( 0 , Σ ω )
  • ε i j d  Error term, ε i j d ~ N ( 0 , σ ε )
The matrix Στ is the covariance matrix of the teacher’s proficiencies on the five indicators of differentiation This matrix and the associated correlation matrix are displayed in Table 5.
Model fit can be viewed as the extent to which the model reproduces the data. In the present application, model fit was evaluated by an item-oriented approach, that is, by comparing the observed average item scores with the expected item scores in four homogeneous subgroups on the latent continuum. To achieve this, the sample of 399 response patterns was divided into four subsamples of approximately equal size according to the level of the total scores associated with the response patterns. The procedure targets the extent to which the item characteristic curves, that is, the posterior item-response probabilities dictated by the model, fit the observed item responses.
The results are given in Table A1. The columns labeled ‘Obs.’ and ‘Exp.’ give the average observed and posterior expected item scores in the focal and reference group, respectively. Note that the items are scored 0, 1, 2, and 3, so the values in the table also lie in this range. Note further that both the observed and expected values increase monotonically from right to left, with the four score levels used to define the subsamples. Finally, the column labeled ‘Abs Dif.’ gives the mean over subsamples of the absolute difference between observed and expected values.
Table A1. Observed and Expected Item Scores across Four Subgroups.
Table A1. Observed and Expected Item Scores across Four Subgroups.
Group 1Group 2Group 3Group 4Abs.
IndicatorObs.Exp.Obs.Exp.Obs.Exp.Obs.Exp.Dif.
1.11.591.622.031.992.292.262.452.490.03
1.21.741.772.062.122.372.372.672.580.04
1.31.361.301.781.852.212.232.602.550.05
1.41.451.501.981.972.422.312.532.600.06
1.51.441.421.741.852.222.102.292.310.07
1.60.770.781.201.131.481.471.731.790.04
1.70.880.841.251.301.681.702.162.130.04
1.80.750.690.820.820.960.951.011.090.04
2.11.551.662.152.032.312.282.512.540.07
2.21.241.212.001.832.172.242.492.580.09
2.31.481.502.072.052.412.362.532.580.04
2.40.910.891.641.632.062.092.422.420.02
2.50.380.430.750.761.291.151.511.600.07
2.61.411.211.461.511.511.752.111.990.15
3.11.611.611.731.711.761.781.871.870.01
3.21.291.231.381.421.611.561.681.740.05
3.31.922.002.032.072.182.122.222.180.06
3.42.192.182.312.312.422.402.452.490.02
3.51.241.131.571.712.022.082.472.410.09
3.60.540.580.860.821.071.071.381.390.02
3.70.780.811.041.071.171.311.791.580.10
3.80.700.790.860.860.920.911.050.980.04
4.11.691.772.092.092.432.272.342.420.08
Note. ‘obs’ = observed, ‘exp’ = expected item scores.
It can be seen that the differences between observed and expected values are very small. The values of the average absolute differences in the column labeled ‘Abs. Dif.’ are also small. The conclusion is that the combined IRT and GT model represented the data very well.
Global reliability is the extent to which random samples of elements of a well-defined population (in this case teachers) can be distinguished. Besides global reliability, IRT detects local reliability. Local reliability is defined as the local estimation error of a test, conditional on a value on the latent scale. This local estimation error is the reciprocal of test information, which, in turn, is a summation of item information values. Therefore, item information is a measure of the contribution of the item response to the total test information, and its reciprocal, that is, local reliability. So, item information can be seen as a measure of item quality, and can be used to select items for a test. Table A2 gives item information for five selected points on the latent scale. For convenience, the latent scale has been transformed to standard normal, to easily identify the mean (0.000), extreme values (−2.000 and 2.000) and moderately low and high values (−1.000 and 1.000). For instance, note that Indicator 2.4 peaks at the mean value, Indicator 2.3 contributes more to the reliability of lower levels, and Indicator 3.6 contributes more to the reliability of higher levels.
Table A2. Item Information at five positions on the latent continuum.
Table A2. Item Information at five positions on the latent continuum.
Item Information Value at:
Indicator−2.000−1.0000.0001.0002.000
1.10.5110.4750.4070.3210.196
1.20.2820.3480.3200.2130.115
1.30.3020.4670.5090.2940.118
1.40.4390.5830.5330.2910.123
1.50.3850.3630.2420.1680.129
1.60.2230.3300.4020.4020.326
1.70.2850.4780.6070.5060.286
1.80.0240.0300.0370.0430.049
2.10.1810.2870.3030.2030.103
2.20.4060.6990.5970.2720.104
2.30.5100.5560.4030.2350.122
2.40.3430.8670.7050.3350.178
2.50.2260.4660.6550.6760.589
2.60.0810.0940.0960.0840.066
3.10.0230.0240.0240.0240.023
3.20.0390.0430.0460.0460.043
3.30.0160.0140.0130.0120.011
3.40.0480.0430.0370.0310.026
3.50.2920.4660.4070.2280.116
3.60.1480.2070.2700.3530.398
3.70.0760.1020.1230.1270.112
3.80.0130.0150.0160.0180.020
4.10.2200.1880.1480.1130.084

References

  1. Ardenlid, F., Lundqvist, J., & Sund, L. (2025). A scoping review and thematic analysis of differentiated instruction practices: How teachers foster inclusive classrooms for all students, including gifted students. International Journal of Educational Research Open, 8, 100439. [Google Scholar] [CrossRef]
  2. Bell, C. A., Dobbelaer, M. J., Klette, K., & Visscher, A. (2019). Qualities of classroom observation systems. School Effectiveness and School Improvement, 30(1), 3–29. [Google Scholar] [CrossRef]
  3. Black, P., & Wiliam, D. (2018). Classroom assessment and pedagogy. Assessment in Education: Principles, Policy & Practice, 25(6), 551–575. [Google Scholar] [CrossRef] [PubMed]
  4. Blikstad-Balas, M., Tengberg, M., & Klette, K. (2021). Why—And how—Should we measure instructional quality? In M. Blikstad-Balas, K. Klette, & M. Tengberg (Eds.), Ways of analyzing teaching quality: Potential and pitfalls (pp. 9–20). Universitetsforlaget. [Google Scholar] [CrossRef]
  5. Bock, R., Gibbons, R., & Muraki, E. (1988). Full-information factor analysis. Applied Psychological Measurement, 12, 261–280. [Google Scholar] [CrossRef]
  6. Brennan, R. L. (2001). Generalizability theory. Springer. [Google Scholar]
  7. Cai, J., Wen, Q., Bi, M., & Lombaerts, K. (2024). How Universal Design for Learning (UDL) is related to Differentiated Instruction (DI): The mediation role of growth mindset and teachers’ practices factors. Social Psychology of Education, 27(6), 3513–3532. [Google Scholar] [CrossRef]
  8. Corno, L. (2008). On teaching adaptively. Educational Psychologist, 43(3), 161–173. [Google Scholar] [CrossRef]
  9. Daltoé, T. L. M. (2024). The assessment of teaching quality through classroom observation—New approaches for teacher education and research. Tübingen. [Google Scholar]
  10. Deunk, M., Doolaard, S., Smale-Jacobse, A., & Bosker, R. J. (2015). Differentiation within and across classrooms: A systematic review of studies into the cognitive effects of differentiation practices. GION Onderwijs/Onderzoek. [Google Scholar]
  11. Deunk, M. I., Smale-Jacobse, A. E., de Boer, H., Doolaard, S., & Bosker, R. J. (2018). Effective differentiation practices: A systematic review and meta-analysis of studies on the cognitive effects of differentiation practices in primary education. Educational Research Review, 24, 31–54. [Google Scholar] [CrossRef]
  12. Evers, A., Lucassen, W., Meijer, R. R., & Sijtsma, K. (2009). COTAN Beoordelingssysteem voor de kwaliteit van tests [Assessment system for the quality of tests]. Nederlands Instituut van Psychologen/Commissie Testaangelegenheden Nederland. [Google Scholar]
  13. Eysink, T. H. S., & Schildkamp, K. (2021). A conceptual framework for assessment-informed differentiation (AID) in the classroom. Educational Research, 63(3), 261–278. [Google Scholar] [CrossRef]
  14. Glas, C. A., Jorgensen, T. D., & Hove, D. T. (2024). Reducing attenuation bias in regression analyses involving rating scale data via psychometric modeling. Psychometrika, 89, 42–63. [Google Scholar] [CrossRef] [PubMed]
  15. Graham, L. J., De Bruin, K., Lassig, C., & Spandagou, I. (2021). A scoping review of 20 years of research on differentiation: Investigating conceptualisation, characteristics, and methods used. Review of Education, 9(1), 161–198. [Google Scholar] [CrossRef]
  16. Griful-Freixenet, J., Struyven, K., Vantieghem, W., & Gheyssens, E. (2020). Exploring the interrelationship between universal design for learning (UDL) and differentiated instruction (DI): A systematic review. Educational Research Review, 29, 100306. [Google Scholar] [CrossRef]
  17. Keuning, T., & van Geel, M. (2021). Differentiated teaching with adaptive learning systems and teacher dashboards: The teacher still matters most. IEEE Transactions on Learning Technologies, 14(2), 201–210. [Google Scholar] [CrossRef]
  18. Keuning, T., van Geel, M., & Dobbelaer, M. (2022). ADAPT: Krijg zicht op differentiatievaardigheden. PICA. [Google Scholar] [CrossRef]
  19. Klette, K. (2023). Classroom observation as a means of understanding teaching quality: Towards a shared language of teaching? Journal of Curriculum Studies, 55(1), 49–62. [Google Scholar] [CrossRef]
  20. Langelaan, B. N., Gaikhorst, L., Smets, W., & Oostdam, R. J. (2024). Differentiating instruction: Understanding the key elements for successful teacher preparation and development. Teaching and Teacher Education, 140, 104464. [Google Scholar] [CrossRef]
  21. Letzel-Alt, V., & Pozas, M. (Eds.). (2023). Differentiated instruction around the world. A global inclusive insight. Waxmann Verlag. [Google Scholar]
  22. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Erlbaum. [Google Scholar]
  23. Luoto, J. M., & Selling, A. J. V. (2021). Exploring the potential in using teachers’ intended lesson goals as a context-sensitive lens to understanding observational scores of instructional quality. In M. Blikstad-Balas, K. Klette, & M. Tengberg (Eds.), Ways of analyzing teaching quality: Potential and pitfalls (pp. 229–253). Universitetsforlaget. [Google Scholar] [CrossRef]
  24. Moon, T. R. (2005). The role of assessment in differentiation. Theory into Practice, 44(3), 226–233. [Google Scholar] [CrossRef]
  25. Parsons, S. A., Vaughn, M., Scales, R. Q., Gallagher, M. A., Parsons, A. W., Davis, S. G., Pierczynski, M., & Allen, M. (2018). Teachers’ instructional adaptations: A research synthesis. Review of Educational Research, 88(2), 205–242. [Google Scholar] [CrossRef]
  26. Prast, E. J., van de Weijer-Bergsma, E., Kroesbergen, E. H., & van Luit, J. E. H. (2015). Readiness-based differentiation in primary school mathematics: Expert recommendations and teacher self-assessment. Frontline Learning Research, 3(2), 90–116. [Google Scholar] [CrossRef]
  27. Roy, A., Guay, F., & Valois, P. (2013). Teaching to address diverse learning needs: Development and validation of a Differentiated Instruction Scale. International Journal of Inclusive Education, 17(11), 1186–1204. [Google Scholar] [CrossRef]
  28. Shavelson, R. J., Webb, N. M., & Rowley, G. L. (1989). Generalizability theory. American Psychologist, 44(6), 922–932. [Google Scholar] [CrossRef]
  29. Smale-Jacobse, A. E., Meijer, A., Helms-Lorenz, M., & Maulana, R. (2019). Differentiated instruction in secondary education: A systematic review of research evidence. Frontiers in Psychology, 10, 2366. [Google Scholar] [CrossRef]
  30. Tomlinson, C. A., Brighton, C., Hertberg, H., Callahan, C. M., Moon, T. R., Brimijoin, K., Conover, L. A., & Reynolds, T. (2003). Differentiating instruction in response to student readiness, interest, and learning profile in academically diverse classrooms: A review of literature. Journal for the Education of the Gifted, 27(2–3), 119–145. [Google Scholar] [CrossRef]
  31. van Geel, M., Keuning, T., Frèrejean, J., Dolmans, D., van Merriënboer, J., & Visscher, A. J. (2019). Capturing the complexity of differentiated instruction. School Effectiveness and School Improvement, 30(1), 51–67. [Google Scholar] [CrossRef]
  32. van Geel, M., Keuning, T., Meutstege, K., de Vries, J., Visscher, A., Wolterinck, C., Schildkamp, K., & Poortman, C. (2023). Adapting teaching to students’ needs: What does it require from teachers? In R. Maulana, M. Helms-Lorenz, & R. M. Klassen (Eds.), Effective teaching around the world (pp. 723–736). Springer. [Google Scholar] [CrossRef]
  33. White, M., Göllner, R., & Kleickmann, T. (2025). Reconsidering the measurement of teaching quality using a lens model: Measurement dilemmas and trade-offs. School Effectiveness and School Improvement, 36(2), 192–210. [Google Scholar] [CrossRef]
Figure 1. Boxplots, showing distributions of teachers’ (n = 86) average raw scores per indicator. Colors indicate different phases.
Figure 1. Boxplots, showing distributions of teachers’ (n = 86) average raw scores per indicator. Colors indicate different phases.
Education 15 01530 g001
Table 1. Overview of all indicators and associated principle(s).
Table 1. Overview of all indicators and associated principle(s).
Principle(s) of Differentiation #
Phase IndicatorGOMOCHADSR
1. Lesson series preparation1.1Evaluation of students’ learning achievementsXX
1.2Insight into educational needs X
1.3Insight into the range of instruction offeredX
1.4Prediction of support needs X X
1.5 *Determination of supplementary remedial objectives and approachesX X
1.6 *Formulation of supplementary enrichment objectives and compilation of a suitable range of instructionX XX
1.7Organisation of instructional sessions for groups of students X
1.8Involvement of students in the objectives and approach X
2. Lesson preparation2.1Determination of lesson objectivesX
2.2Composition of instructional groups X X
2.3Preparation of instruction and processing for the core group X
2.4 *Preparation of instruction and processing for the intensive instructional groupX X
2.5 *Preparation of instruction and processing for the enrichment groupX XX
2.6Preparation of encouragement for self-regulation X
3. Actual teaching3.1Sharing of the lesson objectiveX
3.2Activation and inventory of prior knowledge X
3.3 *Providing didactically sound and purposive core instructionX
3.4Monitoring of comprehension and the working process X
3.5 *Instruction and processing for the intensive group in the lessonX XX
3.6 *Challenging the enrichment group in the lesson XX
3.7Encouragement of self-regulation during the lesson X
3.8Conclusion of the lessonXX
4. Evaluation4.1Evaluation and determination of follow-up actionsXX X
Note. * indicates that the rater could indicate the indicator was ‘not applicable’; # GO = strong goal-orientation; MO = continuously monitoring students’ progress and understanding; CH = challenging all students; AD = adapting instruction and exercises to students’ needs; SR = stimulating students’ self-regulation. Where multiple principles are associated with an indicator, a bold X indicates the principle that is most evident.
Table 2. Teacher descriptives (n = 86).
Table 2. Teacher descriptives (n = 86).
n
Grade taught
19
211
317
424
524
6 26
Gender
Male26 (30.2%)
Female 60 (69.8%)
Note: The total n for grades taught adds up to more than 86 because 21 teachers (24.4%) taught a group with two or three grades in one classroom.
Table 3. Raw score frequencies per indicator.
Table 3. Raw score frequencies per indicator.
Indicator1234nein/a
1.1Evaluation of students’ learning achievements96819212010-
1.2Insight into educational needs37413816123-
1.3Insight into the range of instruction offered231157116327-
1.4Prediction of support needs168911616612-
1.5 *Determination of supplementary remedial objectives and approaches383519785395
1.6 *Formulation of supplementary enrichment objectives and compilation of a suitable range of instruction76148113351710
1.7Organisation of instructional sessions for groups of students70129117776-
1.8Involvement of students in the objectives and approach132114293589-
2.1Determination of lesson objectives11159017320-
2.2Composition of instructional groups45768917514-
2.3Preparation of instruction and processing for the core group245914115619-
2.4 *Preparation of instruction and processing for the intensive instructional group73251391024614
2.5 *Preparation of instruction and processing for the enrichment group10314785143614
2.6Preparation of encouragement for self-regulation90687312345-
3.1Sharing of the lesson objective14135183643-
3.2Activation and inventory of prior knowledge87138561144-
3.3 *Providing didactically sound and purposive core instruction12342489609
3.4Monitoring of comprehension and the working process5481491961-
3.5 *Instruction and processing for the intensive group in the lesson68431101251637
3.6 *Challenging the enrichment group in the lesson10117956182619
3.7Encouragement of self-regulation during the lesson13699102575-
3.8Conclusion of the lesson106234301613-
4.1Evaluation and determination of follow-up actions154817413032-
* indicates that ‘not applicable’ could be selected for this indicator. Note. nei = not enough information; n/a = not applicable.
Table 4. Covariances and correlations for latent variables.
Table 4. Covariances and correlations for latent variables.
Covariance Matrix (Teachers) Correlation Matrix (Teachers)
GOMOCHADSR GOMOCHADSR
GO0.1900.1430.0270.0790.081GO1.0000.5720.0670.4570.194
MO0.1430.3300.1410.1140.116MO0.5721.0000.2640.5030.210
CH0.0270.1410.8640.0680.292CH0.0670.2641.0000.1840.327
AD0.0790.1140.0680.1570.133AD0.4570.5030.1841.0000.351
SR0.0810.1160.2920.1330.921SR0.1940.2100.3270.3511.000
Note: correlations above 0.400 are bolded to enhance the readability. GO = strong goal-orientation; MO = continuously monitoring students’ progress and understanding; CH = challenging all students; AD = adapting instruction and exercises to students’ needs; SR = stimulating students’ self-regulation.
Table 5. Factor Loadings.
Table 5. Factor Loadings.
Principle of Differentiation
IndicatorGOMOCHADSR
1.10.5721.421
1.2 1.665
1.31.890
1.4 0.911 1.211
1.50.531 0.950
1.60.259 0.9090.918
1.7 2.295
1.8 0.468
2.11.839
2.2 0.171 1.868
2.3 1.721
2.40.989 1.425
2.50.297 0.8381.629
2.6 0.933
3.10.637
3.2 0.313
3.30.657
3.4 0.340
3.50.710 0.0311.092
3.6 0.6911.081
3.7 0.937
3.80.4120.400
4.10.6550.252 0.363
Note: bold indicates highest load. GO = strong goal-orientation; MO = continuously monitoring students’ progress and understanding; CH = challenging all students; AD = adapting instruction and exercises to students’ needs; SR = stimulating students’ self-regulation.
Table 6. Agreement and reliability of overall scores and scorers per principle.
Table 6. Agreement and reliability of overall scores and scorers per principle.
Agreement
RatersALLGOMOCHADSR
20.6350.6590.7520.8580.6160.692
30.7190.7430.8200.9000.7060.771
40.7710.7940.8580.9230.7620.818
50.8070.8280.8830.9380.8000.849
Reliability
RatersALLGOMOCHADSR
10.8600.7380.8310.9280.7000.932
20.9230.8490.9080.9630.8230.965
30.9470.8940.9360.9750.8750.976
40.9590.9190.9520.9810.9030.982
50.9670.9340.9610.9850.9210.986
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

van Geel, M.; Keuning, T.; Dobbelaer, M.; Glas, C. Assessing Differentiation in All Phases of Teaching (ADAPT): Properties and Quality of the ADAPT Instrument. Educ. Sci. 2025, 15, 1530. https://doi.org/10.3390/educsci15111530

AMA Style

van Geel M, Keuning T, Dobbelaer M, Glas C. Assessing Differentiation in All Phases of Teaching (ADAPT): Properties and Quality of the ADAPT Instrument. Education Sciences. 2025; 15(11):1530. https://doi.org/10.3390/educsci15111530

Chicago/Turabian Style

van Geel, Marieke, Trynke Keuning, Marjoleine Dobbelaer, and Cees Glas. 2025. "Assessing Differentiation in All Phases of Teaching (ADAPT): Properties and Quality of the ADAPT Instrument" Education Sciences 15, no. 11: 1530. https://doi.org/10.3390/educsci15111530

APA Style

van Geel, M., Keuning, T., Dobbelaer, M., & Glas, C. (2025). Assessing Differentiation in All Phases of Teaching (ADAPT): Properties and Quality of the ADAPT Instrument. Education Sciences, 15(11), 1530. https://doi.org/10.3390/educsci15111530

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop