Universal Design for Learning: The More, the Better?

: An experimental study investigated the effects of applying principles of the Universal Design for Learning (UDL). Focusing on epistemic beliefs (EBs) in inclusive science classes, we compared four groups who worked with learning environments based more or less on UDL principles and ﬁlled out an original version of a widely used EBs questionnaire or an adapted version using the Universal Design for Assessment (UDA). Based on measurement invariance analyses, a multiple indicator, and multiple cause (MIMIC) approach as well as multi-group panel models, the results do not support an outperformance of the extensive UDL environment. Moreover, the UDA-based questionnaire appears to be more adequately suited for detecting learning gains in an inclusive setting. The results emphasize how important it is to carefully adopt and introduce the UDL principles for learning and to care about test accessibility when conducting quantitative research in inclusive settings.


Introduction
The Universal Design for Learning (UDL) provides a theoretical framework for the conception of teaching that addresses the accessibility of learning content and welcoming students' diversity. Accessibility is thought of here in terms of minimizing barriers-an idea that is central to many approaches to implementing inclusive teaching in order to achieve participation in learning for all students [1]. Participation in education for all students is a current social and political challenge. The increasing diversity of learners should be met positively and serve as a resource. In this context, inclusion is defined as a term for an appreciative and welcoming approach to diversity [2,3]. UNESCO also mentions education with the Sustainable Development Goals. It explicitly addresses inclusive education. By 2030, education systems worldwide should be adapted to be more equitable to the diversity of learners. All people, regardless of background, should have access to education and be able to participate in it. This will also realize the right to education [4].
The basic assumption of UDL is that monomodal teaching approaches tend to focus on the "average student" and lead to barriers for many other students. Multimodality of a learning environment is created by multiple forms of representation, processing, and motivational or motivation-maintaining elements in the learning environment [5]. Metaphorically speaking, UDL puts the what, the how, and the why of learning in the focus of lesson planning. The concept of UDL has been widely used in many approaches all around the world [6][7][8][9]. It can be shown that all students-not just those with special education needs-can benefit from a UDL based learning environment [10]. It has also been a guideline for systemic educational reform after the COVID-19 pandemic [11].
However, UDL is not free from criticism. It is seen as a very complex framework that is, on the one hand, very inspiring to educators but can, on the other hand, also be arbitrary when it comes to concretizing and operationalizing the guidelines [12,13]. It is in question whether UDL is adequately defined to derive clear interventions and to isolate the active Table 1. The four dimensions of epistemic beliefs (EBs) in dependence on their expression [24].

Dimensions of EBs Naïve Sophisticated
Nature of knowledge These dimensions have been used in a whole range of studies. There is a body of evidence supporting their importance for learning processes [25], for their dimensional structure [18,26], as well as the relations to academic achievement [27]. Recent metaanalytical perspectives support the assumption that EBs can be fostered during intervention studies being either the focus of an intervention or playing the role of a co-construct that supports learning processes [28].
Fostering students' epistemic cognition is one of the grand goals of science education efforts all around the world. Rather sophisticated EBs can be seen as a prerequisite to understanding fundamental aspects of science and of how science is represented and discussed in-and influences-society. This is one part of enabling societal participation [29,30]. Having this in mind, EBs appear particularly important for an inclusive science education that focuses on helping all students participate in a society that is, to a large extent, shaped by science and technology and does not only aim at providing a later STEM workforce for economic or academic purposes. For these reasons, we decided to focus on fostering EBs in inclusive science classrooms, comparing an extensive and focused UDL-setting.

Universal Design of Learning
The Universal Design for Learning (UDL) was developed by the Center for Applied Special Technology (CAST) [5]. UDL offers several instructional adaptation options to reach every student regardless of their prerequisite. UDL-based instruction provides multiple ways to present information ("what" of learning), process information and present learning outcomes ("how" of learning), and promote learning engagement and learning motivation ("why" of learning) [5,31]. The three principles are subdivided into nine guidelines, which are described in Table 2. The focus is on the individual so that the barriers to accessibility are minimized. Thus, it is not the learner who must adapt, but the classroom [2,5]. Educators can use UDL principles to create flexible learning pathways for learners to achieve their learning goals. This allows all learners to be addressed by choosing different methods, materials, and assessments based on their individual needs [32]. The UDL principles application does not have to be digital because the educational effort to reach all learners is the focus. However, UDL promises the advantage of reaching the learner in different ways, for example, by reading a text aloud or using videos to convey the learning content [33].

Universal Design of Assessment
While UDL offers the opportunity to minimize barriers in learning environments, the assessments used for evaluating have barriers again that can significantly influence the results. Capp [7], Edyburn [12,14], and Gregg and Nelson [34] explicitly point out that the assessment should receive more attention when integrating UDL into learning environments. One way to minimize these barriers and increase accessibility is through the Universal Design for Assessment (UDA) framework [35][36][37]. UDA is designed to enable participants to achieve the best possible test scores regardless of personal characteristics which are irrelevant according to the test construct. In doing so, UDA focuses on decreasing the construct irrelevant variance [38]. Similar to UDL, essential elements can also be formulated in UDA (Table 3). Table 3. Elements of a universally designed test [37].

Inclusive Assessment Population
Tests designed for state, district, or school accountability must include every student except those in the alternate assessment, and this is reflected in assessment design and field-testing procedures.

Precisely Defined Constructs
The specific constructs tested must be clearly defined so that all construct irrelevant cognitive, sensory, emotional, and physical barriers can be removed.

Accessible, Non-Biased Items
Accessibility is built into items at the beginning, and bias review procedures ensure that quality is retained in all items.

Amenable to Accommodations
The test design facilitates the use of needed accommodations (e.g., all items can be Brailled).

Simple, Clear, and Intuitive Instructions and Procedures
All instructions and procedures are simple, clear, and presented in understandable language.

Maximum Readability and Comprehensibility
A variety of readability and plain language guidelines are followed (e.g., sentence length and number of difficult words are kept to a minimum) to produce readable and comprehensible text.

Maximum Legibility
Characteristics that ensure easy decipherability are applied to text, tables, figures, and illustrations, and to response formats.

Research Question
All in all, our aim was to carry out a quasi-experimental study that investigates the impact of using an extensive and a focused UDL-setting on the development of EBs in science. We therefore designed and compared two learning environments based on a different amount of UDL principles. We also tried to be sensitive to barriers in research in inclusive settings that might affect research results and hinder participation in testing. Thus, we aimed at testing the effect of adapting an internationally published epistemic beliefs questionnaire using the concept of UDA. More concretely, we focused on the following research questions:

1.
Does the adaption of UDA on a widely used instrument affect the results of the study? 2.
To what extent can epistemic beliefs be fostered in inclusive science classes using the concept of UDL? 3.
How does an extensive or a more focused use of UDL principles impact learning outcomes in the field of epistemic beliefs?
This study was part of the dissertation project of one of the authors where further information can be found [39].

Description of the Learning Environments
Both learning environments were based on the UDL principles. While one only referred to the principle of multiple representations ("MR environment") and contained a video, the second learning environment addressed more UDL principles ("UDL-environment"). The extended UDL environment included a comic and interactive pop-up text in addition to the video from the MR learning environment. It contained more features and customizations, as shown in Table 4. The operationalization of the UDL guidelines drew on research findings from test development and evaluation [38,40] and research on digital learning environments [41]. The learning environment was created via iBooks author in e-book format [42] and is described more concretely in an article addressing educators in practice [43]. Furthermore, one operationalization can be contributed to several UDL guidelines. Both learning environments showed two scientists holding different hypotheses about the question being addressed: Does the same amount of a substance also have the same weight? This question was related to everyday experiences as well as to the concept of density. This fundamental science concept was rather abstract. The learning environment aimed to teach the experiment's purpose (testing hypotheses) and the experiment's planning. Learners were given an overview at the beginning of the learning environment with the intended goals they were learning: (1) with experiments, chemists answer their questions, (2) ideas are possible answers to the questions, (3) with experiments, chemists test their ideas, (4) scientists plan an experiment. The learning environment can be seen in Figure 1. However, students in the country and federal state of Lower Saxony, Germany, where this study took place, should be in contact with density while learning about sinking and floating. Using a self-assessment tool, the students started to reflect on how the scientists proceed to figure out whose hypothesis should be accepted. Students then engaged in a hands-on activity using everyday materials. They generated data and reflected on the hypotheses as well as the procedures they and the scientists used to generate knowledge. Beliefs such as the experiments were used to test ideas and those experiments justifying scientific knowledge were fostered. Also, the reflection on data from experiments was used for justification purposes. Thus, the justification of scientific knowledge was the main dimension of the EBs fostered. However, students also had opportunities to reflect upon further EB dimensions: the controversy of science determined by the experiments of the students may also foster beliefs that scientific knowledge is subject to change (development), that the students can test scientific knowledge for themselves and they do not have to rely on authorities (source), as well as knowing that scientific knowledge should be reflected from more than one perspective (certainty).
The learning environments were based on the theoretical framework of easy language [44]. With the selected materials for the experiment, both hypotheses, "equal amount is not necessarily equal weight" and "equal amount is also equal weight," could be investigated. The following utensils were provided for this purpose: Sand, salt, sugar, measuring cylinders (plastic), scales, and spatulas.

Preliminary Study
As part of a pre-study, guided interviews were conducted to develop and evaluate the learning environments and the three content representation forms (video, comic, and interactive pop-up text). The accessibility of the learning environment was tested through this approach. The data were analyzed with a qualitative content analysis [45]. The pre-and post-interview lasted 10 min each. Working with the learning environment lasted 30 min.
For the preliminary study, 36 learners from 5th to 7th grade were interviewed in a guideline-based approach. Nine of them indicate a diagnosed need for special educational support. The intervention was carried out in groups of four, while the pre-and postinterviews were conducted individually. Learners were assigned in equal numbers to representational forms: video, text, and comic. The basis for the evaluation was the coding manual of Carey et al. [46]. When intraindividual changes were included, the results show that the video-based representation had advantages over the pop-up text, but not over the comic-based one. In the framework of a correlation analysis, it can be concluded that the video was not superior when the distribution of levels was examined in the learning environment.
Furthermore, through the interviews conducted, an insight into the abilities of the learners could be gained. For this purpose, the interviews were coded with regard to the hypothetical-deductive way of experimenting [47] and unsystematically trying out (look and see) [46].

Design of the Main Study
A 2 × 2 between-subjects design was selected for the main quantitative study with a pre-post assessment (Table 5). This approach allows differences in learning environments and assessments to be examined. Learners are randomly assigned to one of the experimental groups. This ensured that each study condition was represented in each school class. The intervention lasted 90 min. The standard assessment of Kampa et al. [24] was used to capture all four EB dimensions. In a further step, this assessment was adapted to create conformity with the UDA. For this purpose, the concept of easy language was utilized [44] and experts (two from German studies, two from special needs education, and two from science education) were consulted to verify the linguistic and content accuracy. The comparability of both assessment forms was secured in this way. Furthermore, a larger text layout and a more everyday response format in the form of stars were chosen ( Figure 2). The exemplary wording of the justification scale UDA assessment items can be found in Table 6.  For an extended evaluation, additional learner characteristics were collected via a paper-pencil test and iPad-based tests. The selection of learner characteristics is theorybased and is necessary for a broad understanding of inclusion, as it is not sufficient to focus only on special educational needs. In addition to reading ability and cognitive skills, socioeconomic status, cognitive activation, perception of learning success, as well as gender, age, and diagnosed support needs were assessed (Table 7). We chose these characteristics as they were particularly suitable for describing and quantifying the diversity of the learning groups who participated in this study. We are aware that these characteristics may play a part in categorizing children, contradicting the basic idea of inclusion. However, at least in Germany, characteristics like reading literacy or socioeconomic status show a major impact on school success. Nevertheless, we decided to include these characteristics in our study as the information gained may help advance inclusive teaching. Table 7. Collected learner characteristics.

Sample
The main study included 348 learners (male = 189; female = 193; mean age 12.2 (SD 0.74)). The learners were from integrated comprehensive schools (IGS) in Lower Saxony, Germany. IGSs stand out in Germany for being the first schools to implement inclusive education. Sixteen learners required special needs education (learning = 12, language = 4), corresponding to a proportion of 4.6% and therefore above average compared to the 3.9% at general education schools in Lower Saxony in the school year 2014/15 [53].

Procedures of Data Analysis
As a first step, we compared the original with the UDA-test version and tried to figure out a set of items equally existing in both versions to evaluate the development of EBs. For this purpose, we calculated and compared McDonalds-ω as a reliability coefficient [54] and conducted analyses of measurement invariance using longitudinal confirmatory factor analyses (LCFA). We also checked for instructional sensitivity [55] by using a multiple indicator, multiple cause approach (MIMIC approach; as applied, for example, in Sideridis et al. [56]). By introducing a variable representing the type of learning environment as a predictor on the latent factor as well as on the items into the longitudinal model (preand post-test), the MIMIC approach is suitable for indicating differences between both test versions in measuring the development of EBs. We also used t-tests on the item level to check for differences between pre-and post-test. Based on these analyses, we identified a comparable set of items for further analyzing the effects of the learning environments.
To gain insights into the UDA-implementation, in a second step we used this set of items to re-check measurement invariance and to compare the accessibility of the assessment versions using a graphical analysis for differential item functioning (DIF). We compared item difficulty for each subgroup by using the learner characteristics data to build up subgroups within the sample. The mean scores for reading literacy, intelligence, and socioeconomic status were calculated. The proportion with special educational needs in the sample, however, was too small for a separate evaluation. A difference of one standard deviation from the mean was chosen as the cut-off criterion for forming groups.
Differences in item difficulty for a particular subgroup would indicate differences in test accessibility in regards to an important trait for the diversity of learners [36].
In a third step, we specified a multi-group panel model that included pre-and posttests as well as the learner characteristics and the type of learning environment (UDL or MR) as covariates. This allows us to model the learning gains in the context of EBs, the impacts of the learner characteristics, and the impact of the type of learning environment as well in one step. If the covariate learning environment indicates a significant correlation to the EB measures this would be an indicator for an outperformance of the UDL environment (as the UDL environment was coded with one). In order to acknowledge the students' individual learning gains in this quantitative setting, longitudinal plots were calculated. These plots give one line for each measurement point of a student showing the whole range of sample as well as a medium line indicating the mean learning gain of the whole sample.
As the learning environments mainly focused on fostering the justification dimension, we will mainly present the results for this dimension. This will then be discussed to provide consistent and structured insight into the presentation of results. All further results will be provided in Appendix A.

Step One: Item Selection
In the very first analysis, the items of the scales were analyzed and evaluated in the process described above. The analyses showed that items with a low factor loading are those that do not show a significant mean change in either assessment form (Table 8). Items with a sufficiently high standardized factor loading were selected for the scales' new formation and items with significant mean changes despite a low standardized factor loading were also included. Consequently, items 2, 3, 4, and 6 of the justification scale were relevant for further analysis and were converted into a short scale. The results of the selection process are documented in Appendix A. The other three scales were analyzed accordingly, and the number of items was reduced in the form of a short scale. The internal consistency of the short scales showed an acceptable to good range. The exception was the justification scale in the original assessment. An increase in the consistencies from the first to the second measurement point can be seen (Table 9).

Step Two: Checking Test Accessibility of Both Versions
Longitudinal measurement invariance (MI) testing of the justification short scale showed that the data supported configural, metric (∆CFI = −0.007; ∆χ2 = 5.87, p = n.s.), and full scalar MI (∆CFI = 0.002; ∆χ2 = 8.55, p = n.s.). The quality criterion for strict MI was not met (∆CFI = 0.013), but the χ2-difference test established a significant difference for the model and data structures in contrast (∆χ2 = 12.73, p = n.s.). The remaining fit indices were in the good to very good range. Consequently, it can be assumed that the data supported strict MI (Table 10). For the test accessibility, the data from the short scales were used at the first measurement point, since at the second measurement point, there was already an influence of the learning environment. The learner characteristics reading literacy, intelligence, gender, and socioeconomic status were used to examine the items' group dependency. We chose a standard deviation from the mean as a cut-off parameter to form groups for the analysis.
Regarding the statistical parameters, no significant differences can be observed for the item difficulty of both assessment versions (reading literacy: t(328) = 1.65, p = n.s.; intelligence: t(337) = 1.34, p = n.s.; socioeconomic status: t(332) = 0.21, p = n.s.). Using these criteria, a total of 137 at-risk learners can be identified who meet at least one criterion. This corresponded to 40% of the total sample.
The items of the justification scale of the UDA assessment showed measurement invariance in all four group comparisons (Figure 3). The original assessment showed a similar situation except for item 2. There, students with higher intelligence were more successful than those with lower intelligence (Figure 4). Besides, a lower socioeconomic status led to a higher solution probability. Overall, the UDA assessment was minimally more accessible than the original assessment at the first measurement point regarding the justification scale.

Summarizing the Results and Answering the Research
Using a quasi-experimental study, we investigated the impact of using an extensive and a focused UDL-setting on the development of epistemic beliefs in science. We used a 2 × 2-between-subject design to examine the impact of adapting an EB questionnaire for researching in inclusive settings.
Regarding the first research questions, our results show that the UDA version has more adequately tested statistic values. The UDA assessment has a higher overall interitem correlation than the original one. Furthermore, the internal consistency of both assessment variants increases towards the second measurement point. Yet with the UDA assessment, a higher consistency can already be assumed at the first measurement point due to McDonalds-w. However, comparing the learning gains, the UDA-based version indicates increased acceptance of sophisticated views on the justification of scientific knowledge, whereas the original version indicates an increased variance with a comparable stable mean. Students showed an increased as well as decreased acceptance of sophisticated views.
We assume that this effect is due to test barriers in the original questionnaire. We think that students working with the UDA version understand the items better in the first measurement point. Some students with the original version might need the learning environment to elaborate on their understanding. They might answer the original version in what they deem a purposeful manner in the second measurement point. This might lead to a decreased acceptance of sophisticated beliefs so the original version might not show all in all the elaboration of beliefs. To follow this finding, qualitative studies involving cognitive interviews such as proposed by Kuusela and Paul [57] or Ryan, Gannon-Slater, and Culbertson [58] might be fruitful. They would have to reflect, however, the diversity of students in inclusive learning settings.
Regarding the second and the third research questions, the UDA version of the questionnaire indicates an elaboration of students' views on the justification of scientific knowledge. However, the multi-group panel models do not significantly impact the variable "learning environment." This means that we could not detect students learning in the extensive UDL environment who outperformed those who learned in the MR environment.
We discuss these findings with regard to four implications: Implication no. 1: In inclusive settings where quantitative research is conducted, test accommodation plays a significant role. Quantitative instruments should be used with care.
Aiming at a barrier minimized learning environment is undoubtedly a good step toward enabling all students to participate. For conducting research, barriers can be set up again, which can disadvantage particular students and lead to biased research results. Adjustments such as extending the processing time or reducing the number of items do not seem appropriate [59][60][61]. However, the principles of the UDA allow the barriers to be minimized without changing the target construct. If researchers minimized barriers in the assessment, it is important not to change the actual target construct to avoid unsystematic scoring patterns [62]. Within the framework of UDA, further adjustments are also possible and also beneficial. Zydney, Hord, and Koenig [63] show that video-based assessments for students with learning disabilities may be an excellent way to minimize barriers. Furthermore, there is a need to investigate accessibility through auditory representations [64]. Future projects in UDL-oriented research that contain quantitative approaches might benefit from adding qualitative research on the assessment, the processes of working with the assessment tool, and its possible barriers.
Implication no. 2: The UDL principles should be applied with care. "The more, the better" does not seem to be applicable.
This study could not detect significant advantages of the extensive UDL learning environment. Of course, non-significant findings might be explained by methodological effects such as too much error variance in the data. The reliability and DIF analyses, however, indicate a relatively acceptable amount of noise in the data. The effect of applying more UDL principles does not seem strong enough to hold its ground against the remaining data noise.
It is more likely that using a video as a tool containing multiple representations might be enough to decrease barriers for elaborating EBs. This was also shown in the preliminary study where interviews indicated that the embedded video already had advantages over the other representations. Since both learning environments use the video, the advantages of the UDL over the MR learning environment may be leveled out. Implication no. 3: The UDL principles should be introduced with care. The more, the better might not be applicable in the long run. UDL also means changing a learning culture.
As this study was carried out with students in inclusive schools who did not work with UDL, the UDL learning environment might have been too complex to outperform the MR framework in the first place. We do not have any data on when students become familiar with learning with UDL environments. Since UDL is fundamentally different from monomodal teaching, its integration into the school routine may need to be ritualized over a longer period to unleash the full potential. Implication no. 4: An unanswered question is how students' learning behavior in a UDL learning environment leads to an increased outcome for all students. Learning analytics could fill this gap in research.
The learning environment was technically realized with an eBook app. When the study was carried out, it was not possible to track the students' learning progress. Qualitative research might be one way to gain more insights into the learning processes. Against the multitude of students' characteristics, future research may be able to draw on technological advances in learning analytics and machine learning in the sense of collecting and analyzing page view times, general usage of the eBook contents, or clickstreams. This makes it possible to "intelligently" process vast amounts of data beyond human capability. Thus, patterns can be detected, learning paths can be recorded, and extensive analysis can be performed. One challenge that future research will face is balancing the individuality of students' learning and the categories that learning analytics and machine learning systems would use to make sense of students' learning. Currently, there are already existing systems that can track the learning path of students with machine learning. With the help of log files, it is possible to identify students' behavior regarding "gaming" the system [65] or the potential to identify student modeling practices more extensively in a way that has not been possible before thanks to machine learning [66]. Nevertheless, the need and benefit for systems that use machine learning are also evident concerning UDA. Through an "intelligent" system, future systems could adapt the assessment individually to the student [63].
All in all, this study might be a confirmatory approach to the UDL literature that focuses on an important research gap [6,7,13]. Our results might contribute to raising even more questions than we can answer in one study. Therefore, proposition #10 stated by Edyburn in 2010 [9] (p. 40) "UDL Is Much More Complex Than We Originally Thought" still seems applicable.  Acknowledgments: A special thanks belong to the project "Didaktische Forschung".

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A.
Appendix A.1. Process of Item Selection at the Justification Scale   Table A1. The Internal consistency of the justification scale before we shortened the item set [39].          Table A12. Reformulated source, certainty and development short scales with the standard factor loadings, mean differences and associated Bonferroni-corrected significances [39].              . Longitudinal plots to trajectories of all students to the source, certainty and development scales from both assessments (UDA assessment n = 175; original assessment= 165). The average is shown in black [39].