Emotional and behavioral problems in students pose a big challenge in classrooms. These problems have been structured traditionally into externalizing and internalizing behavior problems (Achenbach and Edelbrock 1978
). Externalizing behavior problems are outwardly directed behaviors, which are a representation of a maladaptive underregulation of cognitive and emotional states (Achenbach and Edelbrock 1978
). Meanwhile, internalizing behavior problems typically develop and persist within an individual, and they represent a maladaptive overregulation of cognitive and emotional states (Achenbach and Edelbrock 1978
). According to national and international prevalence studies, 10% to 20% of all school-age children and adolescents show these behavioral problems (e.g., Costello et al. 2003
). Longitudinal studies focusing on the consequences of students’ externalizing and internalizing behavior problems have shown that the aforementioned behavior patterns in the classroom correlate with academic failure, social exclusion, and delinquency (e.g., Krull et al. 2018
; Moffitt et al. 2002
; Reinke et al. 2008
). In addition, teachers report high levels of stress when they face students’ externalizing and internalizing behavior problems in the classroom (e.g., Center and Callaway 1999
School-based behavioral interventions have been shown to be an efficient way to prevent and decrease the occurrence of externalizing and internalizing behavior problems (e.g., Durlak et al. 2011
; Fabiano and Pyle 2018
; Waschbusch et al. 2018
). However, the effectiveness of these intervention methods increases when intervention planning, implementation, and evaluation are closely linked to school-based assessment practices (Eklund et al. 2009
). Two assessment methods have been shown to lead to more effective interventions (Volpe et al. 2010
): universal behavior screening and behavior progress monitoring. Universal screening tools identify students who might benefit from a behavior intervention and additionally guide its planning and implementation. Behavioral progress monitoring is used to evaluate an individual student’s response to a behavioral intervention. Behavioral progress monitoring data is collected very frequently up to several times a day. It allows teachers to recognize behavioral changes of the students over a short time period, which assists decisions about maintaining or modifying the intervention.
While there are many existing tools that can be used for universal behavior screening (e.g., Daniels et al. 2014
; Volpe et al. 2018
for an overview), the development of methods that can be used for behavioral progress monitoring is still in its initial stages. Traditionally, there are two widely used approaches that have been used for school-based behavior assessment: behavior rating scales (BRS) and systematic direct behavior observations (SDO; Christ et al. 2009
). BRS usually consists of a pool of items representing specific behaviors that an individual might show. The intensity or frequency of the occurrence of these behaviors are rated on a Likert scale. Therefore, the documentation and interpretation of the behavior occur at the same time. BRS can be completed by multi-informants such as the teachers, the parents, or the individual themself. BRS are an efficient way to measure specific behaviors, since they are easy to understand, complete, and interpret. However, the scores generated by BRS represent a subjective perception of an individual’s behavior. SDO, in contrast, represent an objective tool to assess a student’s behavior (Volpe et al. 2005
). In SDO, the documentation and interpretation of an individual’s behavior are usually separated. The observer identifies and defines the behavior of interest, and the observation interval. Afterwards, the targeted behavior will be observed in the relevant interval by using time-sampling methods. Finally, the observation scores are analyzed and interpreted. While this procedure generates objective, reliable, and valid data, it is work and time intensive. Furthermore, observation training is often required. In conclusion, BRS and SDO alone have limitations when collecting behavioral progress monitoring data (Christ et al. 2009
Direct Behavior Rating (DBR) represents a relatively new assessment method, which allows for progress monitoring measurements over short intervals. DBR is a hybrid form of systematic direct observation and behavior rating scales wherein individuals observe and rate (e.g., on a Likert scale) a behavior in a specific situation immediately afterwards (Chafouleas 2011
). In recent years, two DBR forms have been developed and evaluated for progress-monitoring purposes: Single-Item Scales (DBR-SIS) and Multi-Item Scales (DBR-MIS; Volpe and Briesch 2015
). DBR-SIS usually targets more global behaviors (e.g., academically engagement, disruptive behavior) and may be the most efficient way to broadly measure a student’s overall level of behavioral success. This information could be useful when a student exhibits a broad range of specific problem behaviors that are related to problem behavior in general. However, DBR-SIS has not typically been used to assess specific classroom behaviors (e.g., hand raising), which might be more informative for evaluating a student’s response to behavioral intervention. In contrast, DBR-MIS usually includes three to five specific behavior items (e.g., completes classwork in allowed time, starts working independently, turns in assignments appropriately) that operationalize a higher-order behavioral dimension. These more specific items can then be analyzed individually or added up to produce a sum score (Volpe and Briesch 2012
Previous studies have shown that DBR meets the criteria required for behavioral progress-monitoring. First, DBR is feasible and effective because it does not require extensive materials and the ratings can be completed easily in a few minutes (Chafouleas 2011
). Second, DBR is flexible because a broad range of observable behaviors (at both the global and specific levels) can be addressed. Third, DBR is repeatable because the same behavior target can be observed and rated across many observations. Fourth, the psychometric quality of DBR has been supported by a broad range of evaluation studies focusing on the performance of the tool under different measurement conditions (Chafouleas 2011
; Christ et al. 2009
; Huber and Rietz 2015
Most DBR studies, both within Germany and the rest of the world, have evaluated its reliability using Generalizability Theory (GT; see Huber and Rietz 2015
). Within generalizability theory, which represents a liberalization of Classical Test Theory (CTT), assessments are tied closely to the target populations with respect to the variability of the targeted behaviors. This technique can establish the external validity of a DBR by ensuring that the behavioral targets and evaluation groups are well matched. Most studies were designed along generalizability theory in order to measure the true behavior and investigate potential factors (and their interactions) that might influence the variance in the generated scores (e.g., such as multiple raters and multiple time points). Such studies are necessary to examine the reliability of behavioral assessment and to determine conditions that might increase the reliability (Cronbach et al. 1972
). Previous studies found that DBR generates reliable scores by reflecting a large amount of variance explained by the actual student’s behavior (e.g., Owens and Evans 2017
). However, results from different raters across multiple time points indicate that different persons rate the same behavior differently and that students behave differently across multiple occasions (e.g., Briesch et al. 2010
; Volpe and Briesch 2012
; Briesch et al. 2014
). Therefore, multiple measurement points are necessary to provide a stable score that still is interpretable. Previous generalizability studies showed that valid results are generated within 4 to 20 measurement points, and that fewer measurement points were needed when DBR-MIS was used (e.g., Casale et al. 2015
; Volpe et al. 2011
; Volpe and Briesch 2012
). While GT represents a strong effort to develop reliable testing, validity concerns within the framework of item response theory (IRT) remain.
Even if the results on the psychometric characteristics of DBR are promising, there are still two remaining issues. First, most of the previous studies had small sample groups with five to ten students and three or more raters. Because of theoretical assumptions in generalizability studies, smaller samples are often used, but such a small sample is insufficient for the evaluation of validity and testing technical adequacy of the test itself. For instance, Rasch modelling may require 100 participants, with 250 for high stakes decisions like screenings, diagnoses, or classroom advancement to obtain sufficiently precise parameter estimates (Linacre 1994
). It is therefore important to embed evaluation throughout the development process and to use an evaluation sample that is a sufficient size for IRT analyses. Therefore, DBR should be developed in lines of both generalizability theory and IRT, with evaluation embedded throughout development.
Second, measurement invariance of DBR across multiple occasions has not been examined yet. Since DBR was developed for assessment within a problem-solving model, it has to be sensible to behavioral progress (Deno 2005
; Good and Jefferson 1998
). Only when DBR scores are comparable over time, the results can be used to draw valid conclusions regarding the behavioral progress and responses to behavioral interventions (Gebhardt et al. 2015
This study demonstrated an approach complementary to generalizability theory to examine the item characteristics and the stability of QMBS ratings across occasions. Our relatively large sample allowed more detailed IRT analysis than the smaller samples of generalizability theory-based DBR development and assessment. This represents a significant addition to previous DBR assessment and development techniques. With such advances in DBR techniques, we provide a framework for future studies to improve the assessment of both the reliability and validity of DBR techniques. This provides a significant step forward to DBR research within both Germany and around the world.
Our QMBS had a high compliance rate by two trained raters in a pilot study and showed in IRT analyses invariance over five measurement points and satisfactory reliability on the item level. Results from the CFA confirmed the overall factor structure parallels the structure of past research (e.g., SDQ; Goodman et al. 2010
), with only minor modifications. Invariance tests revealed its applicability to diverse groups based on gender, migration background, and school level. Results of the latent growth models confirmed the overall stability of the scores across all five measurement points. The intraclass correlations based on rater indicate that there was little effect of rater bias on the overall WLE scores.
The assessment of invariance for the QMBS is an important early step in the development of the scale. This invariance means that the test performs comparably for our primary, secondary, and clinical student samples, as well as both genders, for learners with SEN, and for those with a migration background. Similarly, it performed comparably across the multiple measurement points. This means that over time, the results from each item may be compared to results at a previous time point. More generally, the CFA and invariance results mean that meaningful comparisons can be made using the sum scores for the different dimensions (Dimitrov 2017
). This is an important requirement for any scale which uses repeated measurements which track change in behavior over time.
5.1. Limitations and Future Work
We did not provide any experimental treatments, and every teacher rated their own students individually. Therefore, it was expected that there would be no difference between school levels or types or across measurement points. Most teachers provided ratings that were in the center category and their ratings remained stable. The highest category, seven (always), was so rarely used, that we needed to combine this category with a lower rating for the IRT model. Therefore, our instrument cannot compare different groups of students to one measurement point, but it is instead more suited to measure the individual change over time. Additional studies are needed to measure the sensitivity for the change in behavior over time. For this, studies with interventions are needed to explain the behavior over short and long time periods. Lastly, normative values should be found across a large, random sample which includes learners across a diverse group of schools. Another caveat is that our design did not allow for a detailed analysis of rater effects (e.g., rater severity, rater drift over time), which can affect longitudinal ratings substantially. Especially in order to disentangle rater effects from item stability/sensitivity to interventions over time, more complex designs are needed.
5.2. Implications for Research and Practice
Our study demonstrated the value of an IRT approach to DBR. Specifically, that such an approach can help validate the test items. Furthermore, items can be assessed for invariance across a number of groupings. This approach is sorely needed as a compliment to DBR approaches focusing solely on generalizability theory.