Assessing the Word Recognition Skills of German Elementary Students in Silent Reading— Psychometric Properties of an Item Pool to Generate Curriculum-Based Measurements

Given the high proportion of struggling readers in school and the long-term negative consequences of underachievement for those affected, the question of prevention options arises. The early identification of central indicators for reading literacy is a noteworthy starting point. In this context, curriculum-based measurements have established themselves as reliable and valid instruments for monitoring the progress of learning processes. This article is dedicated to the assessment of word recognition in silent reading as an indicator of adequate reading fluency. The process of developing an item pool is described, from which instruments for learning process diagnostics can be derived. A sample of 4268 students from grades 1–4 processed a subset of items. Each student template included anchor items, which all students processed. Using Item Response Theory, item statistics were estimated for the entire sample and all items. After eliminating unsuitable items (N = 206), a one-dimensional, homogeneous pool of items remained. In addition, there are high correlations with another established reading test. This provides the first evidence that the recording of word recognition skills for silent reading can be seen as an economic indicator for reading skills. Although the item pool forms an important basis for the extraction of curriculum-based measurements, further investigations to assess the diagnostic suitability (e.g., the measurement invariance over different test times) are still pending.


Introduction
Learning disabilities have numerous negative impacts on the educational progress of affected students. Not only is it assumed that school difficulties are consistent throughout the time of a student's school career [1][2][3], but the problems can also be observed up to adulthood [1]. Against the background of these findings, the proportion of children with problems in learning in Germany is very worrisome. In a Germany-wide survey in 2016, the reading performance of almost 30,000 fourth graders was examined against the background of nationwide standards [4]. In the sample, more than 12% of the children did not reach the minimum standard. These children are therefore able to read simple texts and understand their meaning, but the information must be explicitly specified in the text. To ensure a successful transition to secondary education, intensive support for these children is necessary. Only about 10% achieved the optimum standard, which means that these children can cope with clearly demanding requirements. They can think independently about texts, grasp topics and motives that

Reading Fluency
One of the most important goals of school attendance for all primary school students is the successful acquisition of reading skills. From the transition to secondary school onwards, it is assumed that students will be able to independently extract and understand information from texts [11]. Historically and currently, much importance is attached to reading fluency in connection with the acquisition of reading skills [12,13]. The literature reveals a multitude of attempts to define reading fluency [14][15][16]. Current attempts see fluency as the result of a successful interplay of different basic competencies [13,17]. The US-American National Reading Panel [16] published a frequently cited definition. They describe fluency as the ability "to read orally with speed, accuracy, and proper expression" [16] (p. 5). Accordingly, reading fluency is a process of appropriate recoding and decoding of what has been read, the quality of which depends on various aspects, such as reading accuracy and phonological, orthographic, and morphological abilities [12,13,[17][18][19].
Reading fluency, however, is not only the outcome of a successful combination of different partial competences, but also often seen as a prerequisite for higher reading skills [20,21]. In this context, emphasis is placed on the speed of reading [12,22,23]. The basic assumption is that insufficient word reading skills (slow, stagnant, and erroneous decoding) is an obstacle to the contraction of individual information into larger units of meaning. This in turn complicates the processes of activating prior knowledge, integrating new information into existing knowledge structures, and metacognitive control processes [21]. Only when the word reading process is automated do resources become available for higher forms of information processing, i.e., more complex reading processes [16,[24][25][26]. Especially in the first years of school, clear connections between the ability to quickly recognize and decode words and reading comprehension can be empirically depicted [27,28]. With increasing reading experience, the students' mental lexicon expands, and frequent combinations of sounds, morphs, and words are stored and linked. Automated word recognition and rapid decoding thus form the basis for appropriate reading fluency at the sentence and text level, which is in turn necessary for the successful understanding of meaning [12,29].
Against the background of the different understandings of reading fluency listed above, it becomes clear that early interventions are necessary to promote reading fluency in order to prevent reading problems [20,30,31]. Reading fluency should therefore be understood in a development-oriented way [20]. Attention should be paid to it even at the beginning of reading acquisition. At this point, this concerns aspects of phonemic segmentation, alphabetic understanding, phonics, and orthography [12,30], as well as word recognition [32]. As with reading fluency, word recognition skills can be seen as an outcome, as well as a predictor. As a result of the interplay of letter and sound knowledge, as well as decoding abilities, word recognition skills serve as an outcome variable [16,33].
According to the National Reading Panel [16], fluency is the direct result of successful word recognition. Overall, word recognition skills can be assumed to be a potential indicator of reading fluidity [12,13]. In this sense, the assessment of word recognition skills (amount of words read identified correctly within a limited time span) over time plays an important role in preventing reading difficulties. According to the study by Speece and Ritchey [34], word recognition skills develop at the same time as the first word recognition processes and are therefore already important in grade 1. At the end of grade 2, most students should have acquired fundamental word recognition skills [35], and by the middle of grade 4 at the latest [36]. The assessment of basic reading skills, such as the precise recoding and decoding of words, should be a goal of instruction in the first grades at school [33]. Based on this data, further pedagogical decisions can be made.

Assessing Reading Fluency with Curriculum-Based Measurements
Curriculum-based measurements (CBM) [37] are a very prominent approach for progress monitoring of academic skills. CBM were developed in the USA and already have a long research tradition there, especially in the fields of reading, writing, and mathematics [37][38][39]. The original aim of the use of CBM was to provide teachers working in special education with reliable and valid data for assessing the development of students, in order to support instructional decisions [37]. These short test procedures can be used regularly at short intervals. Within a time limit of only a few minutes, the children have to solve as many tasks of a test as possible. Due to the repeated application, the monitoring of academic progress can be derived [40]. On the one hand, CBM can be easily implemented in school routines. On the other hand, the instruments must correspond to the current standards of psychological tests, so that the results can be clearly interpreted.
Depending on the domain of use, CBM may refer to separate competencies that are curricularly identified for the area (e.g., CBM for addition tasks in the numerical range up to 20) or that can be regarded as an indicator of general outcome (e.g., reading aloud as an indicator of general reading skills). Alternatively, they may bundle different partial competences relevant to the domain in a single instrument (CBM with mixed tasks for calculating sizes, for factual tasks, etc.) [37,[40][41][42][43].
The origin of CBM research lies in the field of reading. Accordingly, many methods have been published in this domain [43]. Reading fluency is often assessed by reading aloud individual syllables, words, or texts [44]. The working time is limited to one minute. The test leader documents the correctly read syllables or words.
A large number of research findings show, in particular, that measures of fluency are relevant with regard to students' reading skills [45][46][47][48]. According to Fuchs et al. [44] and Reschly, Busch, Betts, Deno, and Long [49], oral reading fluency can be assumed to be a reliable indicator of overall reading competence.
While much attention has been paid to oral reading, there is a lack of research related to the silent reading of students [35,50]. One justification for this can be found in the conclusions of the National Reading Panel [16]. Accordingly, there is a lack of empirical research on the effectiveness of silent reading experiences [51]. Therefore, adequate time should rather be given to reading aloud in class [51][52][53]. In reality, however, silent reading is the most important form of reading from the first grade onwards [50]. Empirical findings indicate that there is a high correlation between oral reading and silent reading. This is particularly true for gifted readers and in higher grades [53][54][55].

Research Questions
Our research refers to the word recognition skills of elementary school students. In order to early identify struggling elementary school students, we want to generate CBM to assess their word recognition skills. To create different CBM instruments, we designed an item pool, from which items can be flexibly selected according to content-related but also psychometric aspects. The aim of the study presented here is to test the psychometric suitability of the item pool.

Methods
The psychometric suitability of the items from the item pool were tested using common item parameters (item difficulty, selectivity, and fit to the one-dimensional Rasch model). In addition, the coefficients of reliability and validity were determined.

Design
The items of the generated item pool were distributed and piloted within a multimatrix design [56]. Items were divided by grade level into eight different word lists each. Due to the multimatrix design, each list had a proportion of identical words (so-called anchor items) within one grade level and between grade levels. The tests were carried out by the teacher in the middle of the school year without a time limit, as is usually the case with CBM, in order to be able to calculate characteristic values for each item.
The multimatrix design of the test templates made it possible to generate a cross-linked data set. Analyses based on the item response theory (IRT) allowed the determination of psychometric parameters for all items based on the total sample. Since the present data matrix shows a binary coding in "correctly solved" or "incorrectly solved", a dichotomous Rasch Model was estimated. The Rasch analyses were performed with the statistics program R [57] using the pairwise package [58]. The model fit of the items was judged by their estimated Infit values. Since the outfit statistics are clearly influenced by outlier values, whereas the Infit values are more sensitive in the range of medium ability values [59], the Infit statistics were primarily examined in the present study for deviations from the expected value 1 (0.70 ≤ Infit ≤ 1.30) [60]. For further analysis of the quality of the items, common item statistics (difficulty and selectivity) were calculated.
In order to check for differences in item difficulties between boys and girls (test fairness regarding gender), a graphical model test was carried out to assess the measurement invariance of the items.
To analyze the reliability, Cronbach's α was reported. Based on a correlation of the items of the item pool with an external criterion (ELFE-II test), the construct validity could be tested.

Sample
A total of 4268 elementary school students took part in the evaluation of the item pool. Table 1 gives an overview of the distribution of the children among the different grades. One part of these children (N = 178) solved the tasks of the item pool as well as a German reading comprehension test ("Ein Leseverständnistest für Erst-bis Siebtklässler-Version II", ELFE II; see below).

The Item Pool
Against the theoretical background mentioned before, items for assessing word recognition skills were developed and compiled into a comprehensive item pool. In order to form a suitable item pool for assessing the word reading skills of children of primary school age, various considerations are necessary, which integrate verbal, literary, scientific, and curricular analyses. The formal design of the items was based on economic and pragmatic factors. Thus, they were to be feasible as group procedures in class. In connection with the previously described significance of silent reading experiences [35,50,51], it was therefore decided that the students should identify a real target word from a selection of pseudowords (e.g., "Maelr"-"Maler"-"Melar"-"Mlaer"; target word: "Maler" = painter).
To generate the item corpus used in this study, an analysis of various common textbooks was carried out. An intersection of the word material was created and compared with the available minimum vocabulary for the primary school sector in Germany. On this basis, 1277 words could be identified as relevant word material for primary schools. The words of the item pool were structured according to different aspects (word type, number of letters, number of syllables, and number of graphemes, as well as phonological, morphological, and orthographic peculiarities) and occurrences according to grade levels. For each word out of the item pool we designed pseudowords. Every distractor shows an optical proximity to the target word. For each item, pseudowords have been chosen that have a letter combination valid for Germany, as well as those that are unpronounceable in the German language.
Within the prepilot, a total of 533 children of the first to the fourth grade solved between 40 and 50 items according to the described task format, depending on the grade level. The distribution among the different grades is shown in Table 2. An analysis of the student outcomes (frequency of solutions) and interviews with the teachers and students (difficulty with tasks and possible remarks) indicate that the task format is understandable for students in elementary schools, that teachers consider it appropriate, and that there is a high variance of outcomes, i.e., it can differentiate between different achievement levels.

The ELFE-II Test
In addition to the items of the item pool, some of the children worked on an established instrument to assess the reading fluency, reading accuracy, and reading comprehension of German-speaking children at the word, sentence, and text level (ELFE-II) [61]. To test the reading comprehension at the word level, the children had to choose the correct word out of a list of four for a given picture within a limited time span. At the sentence level, the children had to separate the correct word from four given distractors, and at the text level, the children were asked to answer multiple-choice questions for short texts. The ELFE-II test can be used in an individual or group session from the end of the first grade to the beginning of the seventh grade. The reliability (split-half reliability: r tt = 96; retest reliability: r tt = 93; parallel reliability: r tt = 93) and concurrent validity (correlation to another reading test: r = 77; correlation to the teacher's judgment: r = 70) of the instrument could be proven. Construct validity was determined using structural equation models. In addition, validity studies are available for children with diagnosed reading and spelling disorders and for children from different school types [61].

Results
Due to the scaling according to IRT, empirical characteristic values (difficulty, selectivity, and model fit statistics) considered during item selection are available for each item. The item difficulties σ result from the Thurston threshold values estimated in the Rasch Model (due to dichotomous response possibilities). The σ-values can be interpreted as z-values, so that a value of zero corresponds to an average difficulty. Values below zero indicate that the words were easier for the children to read, values above zero indicate more difficult items. The selectivity values were calculated as point biserial correlations of the raw scores with the respective total value of the test template. These values help to discriminate which items are separated between students with low and high levels of performance. A value close to one means that the item assesses the same aspect as the overall test. A value close to zero indicates that an item has little in common with the overall test. In this study a value of r pbis = 0.2 and above served as a minimum criterion.
Of the items analyzed, 206 showed too little selectivity (r pbis < 0.2) or an under or overfit in the Infit statistics (fit < 0.70 or fit > 1.30). These items were eliminated from the item pool for further analysis. The reduced item pool was then scaled again using a one-dimensional Rasch Model. The selectivity values were at least r pbis = 0.20, the maximum was r pbis = 0.66, and the average value was r pbis = 0.41. The item fit statistics (Infit MnSq ) varied between min = 0.70 and max = 1.30. The mean Infit MnSq was 0.92. This indicates that there were no model violations and that all items meet the requirements of a one-dimensional Rasch model. All items thus form a one-dimensional scale, i.e., they measure the same construct.
The mean item difficulty was σ = 0.00, and the values scattered in a range between min = −2.99 and max = 3.68. The item pool thus covered a range of very easy to very difficult items. An analysis of the item difficulties separated by grade levels showed an increase in the mean values with constant standard deviation (see Table 3), i.e., the items became easier with the increasing grade level. Results show that there is a wide range of item difficulties in every grade level. Due to the use of the Rasch model, it was also possible to map the item and person parameters on the same scale. The personal parameters WLE (Weighted-Maximum-Likelihood-Method) [62] were determined by a pairwise item comparison [63,64]. This method is particularly suitable for data sets with missing values [64,65]. The WLE can be used to assess the appropriateness of the degree of difficulty of the items. The person item map (see Figure 1) shows the person parameters as histograms, as well as the item difficulty. It becomes clear that the measurement range of the items essentially corresponds to the distribution of the person parameters. However, one can see that there is a lack of items for students with particularly high skills. In order to analyze the differences in item difficulties between boys and girls, the results obtained were plotted separately by gender on the x-and y-axes (see Figure 2). If the parameters are constant across the sexes, they run along the bisectors of the angle. Here, there are differences in the item difficulties of individual items. A few items (N = 21) showed very large deviations from the bisecting line. A variance analysis showed a significant influence of sex (F(1, 4115) = 9.753, p < 05), but the effect was only small (η 2 = 002 or d = 0.10).  In order to analyze the differences in item difficulties between boys and girls, the results obtained were plotted separately by gender on the xand y-axes (see Figure 2). If the parameters are constant across the sexes, they run along the bisectors of the angle. Here, there are differences in the item difficulties of individual items. A few items (N = 21) showed very large deviations from the bisecting line. A variance analysis showed a significant influence of sex (F(1, 4115) = 9.753, p < 05), but the effect was only small (η 2 = 002 or d = 0.10).
In order to analyze the differences in item difficulties between boys and girls, the results obtained were plotted separately by gender on the x-and y-axes (see Figure 2). If the parameters are constant across the sexes, they run along the bisectors of the angle. Here, there are differences in the item difficulties of individual items. A few items (N = 21) showed very large deviations from the bisecting line. A variance analysis showed a significant influence of sex (F(1, 4115) = 9.753, p < 05), but the effect was only small (η 2 = 002 or d = 0.10).

Discussion
The aim of the present study was to design an item pool to assess the word recognition skills of elementary school students. The importance of word recognition in the context of reading fluency was established. Overall, word recognition can be seen as a potential indicator for first reading skills [12,13]. To increase the assessment economy in school practice, the items were conceived as tasks for quiet reading. This enables an assessment of the whole class at a time. It is assumed that these results are largely related to oral reading skills [53,54].
Another aim of the study was to verify the psychometric suitability of these items. For this purpose, the items were distributed over different test templates, which were connected to each other by means of anchor items. Thus, not every child in the sample had to process all the items, but it was still possible to determine item and person estimators for each item and for all students using the item response theory. From the original 1277 items, a total of 1071 items corresponded to the previously set criteria. The other items were dropped due to unfavorable selectivity or an over-or underfit to the computed Rasch model. Overall, the reliability of the individual test templates is high (lowest average α = 85), which speaks for the homogeneity of the items. The correlation with other reading tests is also high (r = 64). This shows that word recognition skills are related to reading speed and reading comprehension.
Although there are indications that some of the items are of different difficulty for girls and boys (in the sense of test fairness), these differences can be regarded as minor. No items measure in the upper performance range, which is another limitation of this study. However, it can be argued that although word recognition is highly correlated with other reading skills, such as passage comprehension throughout primary school [47], it is particularly important in the first years of primary school [34]. From the third grade onwards, it can be assumed that students have largely acquired word recognition skills [35,36]. In this respect, possible ceiling effects are to be expected in higher grades. The items of the item pool therefore differentiate particularly in the lower performance range. Against this background, however, they can be used for screening purposes. Overall, the targeting of the items appears to be adequate. Though word recognition skills seem to be a potential indicator of reading skills, they are not sufficient to diagnose higher reading skills. The use of further test instruments should be considered here.
Based on the item pool, CBM with parallel forms of the same structure were developed, which can be used every four weeks during a school year (10 parallel forms in each grade level). A proportion of easier (−2.5 ≤ σ < −1), medium (−1 ≤ σ < 0), and more difficult items (0 ≤ σ ≤ 1) was selected to map different areas of competence. In further investigations, the suitability of the developed CBM for progress monitoring will be investigated. In addition to classical quality criteria (objectivity, reliability, and validity), progress monitoring criteria must also be fulfilled (Fuchs, 2004). The study presented here only uses results from a cross-sectional study. Thus, no information can be derived on the suitability of the items for status diagnostic purposes. The scaling according to the item response theory, however, is to be seen as a meaningful addition to the classical test theory, which allows first statements about the suitability for progress monitoring (high reliability, unidimensionality of the measured construct, constant item difficulty, and high test fairness) [67][68][69]. In a further step, the measurement invariance over different test times should be investigated. In addition, the sensitivity to change as well as the applicability and effectiveness in the school context should be examined [42].
A calibrated item pool, as described in this study, provides many advantages. Different instruments can be flexibly developed from such a pool. It is also possible to realize adaptive test situations, whereby the item selection in the concrete test situation is dependent on the ability of the child, in order to enable more precise measurements at the ability level. In this context, the use of digital media appears to be particularly useful [35,70]. In addition, the time taken to process the items can be measured with the aid of a computer. This makes it possible to dispense with a time restriction on the processing time, which can lead to increased pressure on the students. A further advantage is the possible combination of diagnostic information and training material. Computer-aided training programs can react adaptively to the results of an upstream diagnosis. Digital technologies offer the potential to support struggling readers; however, little systematic research has focused on the effect of technology on reading skills [71]. In terms of quiet reading, the research situation has so far been even sparser.
Future research will concentrate on factors that influence the difficulty of the items of the word pool. Possible variables in this context are structural features of the words (word type, word length, number of syllables or graphemes in a word) and phonological, morphological, and orthographic characteristics and occurrences in textbooks according to grade level.
Author Contributions: The authors contributed equally to the conceptualization, writing, and revision of this article. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the Ministry of Education, Science and Culture of Mecklenburg-Western Pomerania/Germany.