3.1.1. Sample and Procedure
A total of 432 pre-service biology teachers enrolled in 12 universities throughout Germany were recruited: 79.8% and 20.2% aspiring to teaching careers in the academic and nonacademic tracks, respectively. The average age of participants was 23.4 years (SD = 2.8; 19.8% male); 23.3 years (SD = 2.8; range 18–43 years; 21.9% male) in the academic-track group and 23.7 years in the nonacademic-track group (SD = 2.82; range 19–35; 11.5% male). The participants had attended 5.9 semesters in higher education on average (SD = 2.8); those in the academic-track group 5.8 semesters on average (SD = 2.8; range 2–17 semesters), and those in the nonacademic-track group 6.0 semesters on average (SD = 2.9; range 1–14 semesters). Almost half (43.9%) studied biology and a second science subject (46.4% in the academic track group, 34.5% in the nonacademic track group), while the others studied biology and a second non-science subject.
The tests in this evaluation were administered in the same manner as those in Evaluation 1 in terms of duration (including breaks), local conditions, and monetary reimbursement. Before the break participants filled out a questionnaire consisting of questions about learning opportunities, interest, and self-concept. After the break, they completed a questionnaire that included the CK instrument and others designed to measure CK, NOS, and PPK.
3.1.3. Independent Variables
Opportunities to learn. Of the three measures used to assess CK, PCK, and PPK contents in the participants’ previous education, two were applied to measure content-related knowledge (CK and PCK), by inviting responses to statements about them (“Please tick how intensively the following CK/PCK-related contents have been covered in your university studies to date.”) on a 5-point Likert-scale (1 = Not Covered, 2 = Rarely Covered, 3 = Moderately Covered, 4 = Excellently Covered). Cronbach’s α values for CK and PCK items were 0.86 (15 items; M = 42.89, SD = 0.33) and 0.92 (nine items, M = 24.25, SD = 6.74), respectively. One measure focused on PPK-related content (“Have you attended a university course that addresses this content?”), using a 2-point scale (0 = I have not attended; 1 = I have attended). This measure consisted of 33 items (α = 0.91, M = 14.71, SD = 7.66).
. PCK was assessed using 36 items of the PCK-IBI instrument [8
]. This includes 16 open-ended items, six short answer items, and eight closed-ended items (six items consist of two or three subitems with different item formats; subitems were merged because of subitem interdependencies). Some probe participants’ knowledge of instructional strategies, including both representation of subject matter and responses to specific learning difficulties (here: knowledge of instructional strategies for teaching; 18 items). Others probe knowledge of students’ conceptions and preconceptions (here: knowledge of students’ understanding; 18 items; see Figure 1
). Two items were deleted because they did not meet acceptance criteria (item difficulty 0.22 to 0.84 and discrimination indices > 0.19). After removing these items, we averaged responses, resulting in a measure of PCK (34 items; α = 0.77, M
= 27.51, SD = 7.52), with expected a posteriori/plausible value (EAP/PV) and weighted likelihood estimate (WLE) reliability values of 0.66 and 0.72, respectively [8
]. The PCK-IBI is a mixture of dichotomously scored (0 = wrong answer vs. 1 = correct answer) and polytomously scored items (0 = wrong answer; 1 = answer half correct [partial credit]; 2 = correct answer). A coding scheme was used to judge the pre-service teachers’ answers.
. NOS was measured using 23 Likert-type items (1 = Strongly Disagree; 2 = Disagree; 3 = Uncertain or Not Sure; 4 = Agree More Than Disagree; 5 = Strongly Agree) of the “Student Understanding of Science and Scientific Inquiry (SUSSI)” [105
]. As both positively and negatively phrased items were included, the responses were recoded so the scores were all positively correlated with understanding of science and scientific inquiry. The Cronbach’s α value for the knowledge of NOS measure was 0.80 (M
= 84.06, SD = 9.56).
. PPK was assessed using 67 items (12 open-ended, four short answer and 51 closed-ended items). These items were developed by experienced educational researchers, based on the national teacher education standards of Germany [106
] in an analogous manner to the CK and PCK items. They focused on pre-service teachers’ knowledge of “instructional strategies for teaching” (14 items), “students’ understanding” (19 items), “classroom management” (21 items), and “assessment” (13 items). Twelve items were removed because they had low discrimination power. The Cronbach’s alpha value of the resulting scale was 0.92 (55 items; M
= 82.60, SD = 29.38), indicating good internal consistency.
. We used six items of the Study Interest Questionnaire (SIQ) [107
] to measure pre-service biology teachers’ interest in the three domains of university studies: the subject (source of CK), pedagogy/psychology (source of PPK), and pedagogy of subject matter (source of PCK; German term Fachdidaktik) [108
] (p. 656). The participants were asked to assess the degree to which each of these items (e.g., “Dealing with the contents and issues of this study area is one of my favorite activities”) applies to the three sources of professional knowledge (subject, pedagogy/psychology, and the pedagogy of subject matter) on 4-point scales from “does not apply at all” (1) to “fully applies” (4). We averaged scores for items associated with each subscale, resulting in measures for interest in the subject (α = 0.80, M
= 20.38, SD = 3.05), pedagogy/psychology (α = 0.88, M
= 14.56, SD = 4.40), and pedagogy of subject matter (α = 0.82, M
= 15.56, SD = 3.68).
. To measure pre-service teachers’ self-concept, we used items of two subscales (professional competence and methodological competence) of the Berlin Evaluation Instrument for self-evaluated student competencies (BEvaKomp) [109
]. The instrument measures students’ CK-, PCK-, and PPK-related self-concept components. The participants were asked to assess the degree to which each of these items, e.g., “I can see the connections and inconsistencies in the subject area of CK (vs. PCK vs. PPK)”, applies to the three domains of professional knowledge (CK, PCK, and PPK) on 4-point scales ranging from “does not apply at all” (1) to “fully applies” (4). Cronbach’s α values for CK-, PCK-, and PPK-related self-concept components were 0.80 (M
= 27.98, SD = 4.23), 0.89 (M
= 24.19, SD = 5.57), and 0.88 (M
= 22.51, SD = 5.28), respectively, indicating that the subscales had acceptable internal consistency.
3.1.4. Statistical Analysis
. In order to obtain unconfounded psychometric measurements the procedures used must capture the same ability or attribute (i.e., latent trait) [110
]. Hence, operationalized variables used in any instrument must each capture a single latent trait to generate valid overall scores [112
]. A number of procedures have been developed to test such “unidimensionality” [113
], and a frequently used option is to fit the data to a unidimensional measurement model, e.g., the Rasch model [114
]. Discrepancies between modelled and empirical data are expressed using descriptive statistics, such as INFIT and OUTFIT, which are mean-square residuals ranging from 0 to infinity, but with expected values of 1 (if the null hypothesis of unidimensionality is not violated). Indices smaller than 1 indicate higher than expected predictability, which may be due to redundancy in the data (i.e., overfit) [115
]. In Classical Test Theory low mean-square residuals are regarded as desirable, and according to Item Response Theory “they do no harm”, despite indicating some redundancy in responses [116
] (p. 370). The cited authors do not present any strict thresholds for acceptable OUTFIT and INFIT values, but claim that the values should ideally be somewhere between 0.5 to 1.5. However, in this study we used a tighter range of acceptability: from 0.8 to 1.2. In addition to calculating OUTFIT and INFIT values for the selected items, we applied t-value-tests to identify values that significantly differed from 1. OUTFIT means outlier-sensitive fit, so OUTFIT statistics are highly sensitive to unexpected observations, that is, responses to items with difficulties far away from persons focal ability (i.e., items that respondents find relatively very easy or very hard). In contrast, INFIT means inlier-sensitive fit, and INFIT statistics are more sensitive to discrepancies in patterns of responses to items targeting the persons’ focal ability [115
We used ACER ConQuest software (version 126.96.36.199) [118
] to analyze the acquired data. As the measures included dichotomously and polytomously scored items, both person ability and item difficulty were estimated using Masters’ [119
] partial credit model (PCM). This is an extension of the simple logistic (Rasch) model that enables analysis of cognitive items scored in more than two ordered categories, with differing measurement scales, and can provide estimates of threshold parameters for each item, even if the thresholds vary among items [118
]. We used the WLE method to estimate Person ability as it is less biased than maximum likelihood estimation, and provides best point estimates of individual ability [121
]. The person ability scores were used for further calculations implemented in SPSS®
. The measurement accuracy was assessed by calculating EAP/PV and WLE reliability values [118
Dimensionality analysis of the CK-IBI
. Multidimensional Rasch analysis was applied to analyze the dimensionality of the CK-IBI. Since our instrument is intended to cover five biological disciplines, initially a five-dimensional model was fitted to the data, assuming that knowledge of ecology, evolution, genetics & microbiology, morphology, and physiology form separate dimensions. Then we compared the model’s fitting parameters to corresponding parameters of a one-dimensional model, which implicitly assumes that CK scores reflect a single latent trait. To identify the model providing the best fit to the data we calculated deviance factors (inversely reflecting the degree to which the data fit underlying assumptions). To assess the significance of differences between the models’ deviance factors we applied χ2
-tests, in which the degrees of freedom depend on the number of estimated parameters [122
In addition, we compared the five- and one-dimensional models using Akaike’s Information Criterion (AIC = deviance + 2 np; [123
]) and Bayes’ Information Criterion (BIC = deviance + [lnN] 2 np; [124
]). These criteria do not allow significance tests, but offer the possibility to take into account models’ parsimony. Generally, the lower the coefficient, the better the fit between the model and the data [124
]. The AIC is superior when the number of possible response patterns greatly exceeds the sample size, while the BIC is superior in opposite cases [126
Analysis of the CK, PCK, and PPK measures’ factor structure
. We used confirmatory factor analysis (CFA), which can be applied to examine the relationship between any set of observed variables (e.g., set of items) and set of continuous non-observed variables, to analyze the construct validity of our CK, PCK, and PPK measures. Bentler [122
] recommends a minimum ratio of 5:1 between sample size and number of free parameters, but we could not meet this recommendation due to the high number of items. However, using scores for all of the items would lead to a ratio of nearly 1:1, so we used scores for parcels of items associated with subscales as manifest indicators (e.g., giving 15 rather than 126 factor loadings) of latent variables [80
We expected a three-factor model including single latent traits for CK, PCK, and PPK to provide adequate explanation of the variation in responses to our measures. We allocated items associated with each of the three factors to five parcels, thereby reducing the number of factor loadings from 126 to 15 and obtaining a sample size to number of free parameters ratio of 9:1. To estimate the parcel scores we used the WLE (Weighted Maximum Likelihood Estimate) method, which provides best point estimates [121
]. As the participants were enrolled in 12 universities, we used “Type = complex” when conducting the CFA to consider the nested structure of the data, with factor variance set to 1 to fix the metric of the latent variables. Distributions of our parcel scores did not meet normality requirements, so we applied full information maximum likelihood estimation with robust standard errors in MPlus 5.21 [128
Differential item functioning (DIF)
. A test instrument should always be “fair” to all of the test subjects, in the sense of functioning equivalently towards all groups. So, the difficulty of all the included items should be solely governed by the construct the instrument is intended to measure, and not influenced by any irrelevant or extraneous factors. Otherwise, representatives of different groups with the same ability in terms of that construct may have differing probabilities of answering an item correctly. This is referred to as differential item functioning (DIF) [129
]. Hence, as any group of subjects may include representative of numerous subgroups (differing, for instance, in race, gender, age, etc.), it is important to identify factors that may be most relevant for testing an instrument. We selected two factors: the participants’ track (academic versus nonacademic) and second teaching subject (science (physics or chemistry) versus a non-science subject).
Several techniques have been developed to detect DIF [129
]. We chose to use a model, based on item response theory, implemented in ACER ConQuest software (version 188.8.131.52) [118
]. The model included interactions between item difficulty and both factors (track and second teaching subject, respectively), in order to capture variations in responses due to differences in item difficulties among the corresponding subgroups (academic versus nonacademic groups, and those taking another science subject versus another non-science subject), on the same scale [132
]. The statistical significance of differences between groups in item difficulties depends on sample size, so we used effect sizes suggested by the Educational Testing Service [133
]. Following their recommendations, we distinguished three categories: A, B, and C for differences in item difficulty between two groups that are non-significant (<0.43 logits), significant and indicative of a slight to moderate effect (0.43 to 0.64 logits), and significant but indicative of a moderate to large effect (> 0.64), respectively [130