Next Article in Journal
Evaluating Neural Networks Architectures for Competency Prediction from Process Data Using PISA Computer-Based Mathematics Assessment
Next Article in Special Issue
From Incremental Validity to Decision Utility: A Framework for Intelligence Testing in Education
Previous Article in Journal
Semantic Accessibility Is Associated with Reduced Experience-Induced Heuristic Fixation in Creative Problem Solving
Previous Article in Special Issue
Quantile Regression of Cognitive Ability and Achievement Inequality Before and After the Pandemic in One State
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

From Abstract to Domain-Specific: Development and Validation of Matrix Reasoning Tasks for Students in Biology

by
Colin Peperkorn
* and
Claas Wegner
Faculty of Biology, Biology Didactics-Giftedness and Talent Research, Bielefeld University, 33615 Bielefeld, Germany
*
Author to whom correspondence should be addressed.
J. Intell. 2026, 14(4), 69; https://doi.org/10.3390/jintelligence14040069
Submission received: 17 December 2025 / Revised: 10 April 2026 / Accepted: 13 April 2026 / Published: 17 April 2026
(This article belongs to the Special Issue Intelligence Testing and Its Role in Academic Achievement)

Abstract

Matrix reasoning tests are frequently used to measure intelligence and identify gifted students across domains. To date, there is limited evidence on the usefulness of contextualised tasks for identifying domain-specific giftedness. In the current study, matrix reasoning tasks tailored to biological contexts were developed and validated for students in grades 3–6. The tasks were evaluated across two research cycles, involving a total of N = 895 students (n1 = 470; n2 = 425). An item analysis based on item response theory indicated acceptable item parameters and fit indices for the final item pool. Correlation analyses revealed moderate-to-strong associations with IQ, assessed via abstract matrix reasoning, as well as with domain-specific achievement in biological inquiry processes. A known-groups comparison revealed that students identified as gifted in biology outperformed a comparison group of peers, providing preliminary known-groups validity evidence for the developed tasks. Overall, the matrix reasoning tasks tailored to biology showed acceptable psychometric properties, demonstrated positive correlations with achievement in biological inquiry, and the study provided initial evidence of their usefulness for identifying gifted students in biology.

1. Introduction

Intelligence (defined as Spearman’s g) is a main predictor of academic performance across domains (e.g., Deary et al., 2007; Watkins et al., 2007). It is considered one of the most thoroughly researched psychological constructs and thus offers a strong foundation for identifying gifted students (Rost & Sparfeldt, 2017). Although a modern understanding of giftedness goes beyond above-average intelligence, many definitions see it as a necessary condition or potential that enables individuals to contribute to societal development through their actions (e.g., Subotnik et al., 2011; Worrell et al., 2019). Furthermore, it is essential to consider domain-specific abilities, psychosocial skills, motivation, and environmental influences, along with strategies to address these factors, as they enable the individual to translate potential into transformative giftedness (Sternberg, 2024). Subotnik et al. (2011) provided the following definition:
Giftedness is the manifestation of performance or production that is clearly at the upper end of the distribution in a talent domain even relative to that of other high-functioning individuals in that domain. Further, giftedness can be viewed as developmental, in that in the beginning stages, potential is the key variable; in later stages, achievement is the measure of giftedness; and in fully developed talents, eminence is the basis on which this label is granted. Psychosocial variables play an essential role in the manifestation of giftedness at every developmental stage. Both cognitive and psychosocial variables are malleable and need to be deliberately cultivated.
(p. 7)
Therefore, identification processes must consider both potential and achievement, must be designed in accordance with educational goals, and employ multiple stages to shape programmes that foster learning strategies, motivation, and interest in specific domains (VanTassel-Baska, 2005, 2021). On the potential stage, broad cognitive abilities, such as fluid reasoning, should be measured to identify gifted individuals. However, current instruments mainly consist of domain-general tasks or abstract forms, which do not fully align with the domain-specific nature of giftedness. In the current study, contextualised matrix reasoning tasks for biology were developed and tested with students in grades 3–6 to assess their psychometric properties and ability to identify the potential of gifted students in biology.
Intelligence should be viewed not merely as a psychological attribute but as a network of diverse individual strengths and weaknesses across different broad cognitive abilities (McGrew et al., 2023). Fluid intelligence (Gf) is the reasoning ability that enables individuals to handle unfamiliar situations or problems and refers to the ability to solve problems and to recognise patterns or relations (Cattell, 1963; Schneider & McGrew, 2022). It is usually measured using matrix reasoning tests. These tests involve tasks in which a matrix, typically 3 × 3 or 2 × 2, or a row of fields (e.g., 1 × 5) is presented, with one field left empty for participants to fill from a set of options. The completion of these tasks requires inductive, deductive, and relational reasoning, along with spatial visualisation and working memory (Carpenter et al., 1990; Prabhakaran et al., 1997). The most well-known examples are the Raven’s Progressive Matrices (RPM; Raven et al., 1998), the Culture Fair Test (CFT 20-R; Weiß, 2006), the Bochumer Matrizentest (BOMAT; Hossiep & Hasella, 2010), or the Wechsler scales (WISC-V, WAIS-5; Wechsler, 2014, 2024).
Research at a domain-specific level has shown that Gf, assessed by matrix reasoning tests, plays a significant role in both mathematical and verbal skills (Peng et al., 2019). Ren et al. (2015) found that measuring fluid intelligence can provide insights into learning abilities, making matrix reasoning tests a valuable addition to educational assessments. Based on the results, educational practitioners can draw direct conclusions about important aspects of their students’ learning (Klauer & Phye, 2008). In addition, there is an opportunity to gain deeper insights into the potential of their students beyond measures of academic achievement or performance (Freund & Holling, 2011a, 2011b). Given this strong empirical foundation and the large number of available tests, educators rely heavily on general intelligence tests to identify gifted students. Nevertheless, researchers emphasise the importance of considering domain-specific skills, creativity, and non-intellectual factors such as special interests and motivation when identifying gifted students (Subotnik et al., 2011). The main goal of gifted education is to foster all students according to their individual potential, considering real later-life outcomes (Sternberg, 2002). This understanding aligns with the non-g psychometric network analysis (PNA) approach in contemporary intelligence research, which posits that intelligence is formed through the interaction of multiple broad abilities (e.g., McGrew et al., 2023). Therefore, identifying giftedness should not rely on measuring a single, unidimensional construct to differentiate between gifted and non-gifted individuals. This also means that gifted students do not have to perform above average on every metric used (as a high IQ would require; Peters et al., 2020). Instead, the individual expression of giftedness should be identified and fostered. To achieve this, broad cognitive abilities, such as Gf, must be assessed inclusively, culturally sensitive, and fairly across all students (Holden & Tanenbaum, 2023).
Gifted individuals interested in a specific area are more likely to use their potential to develop domain-specific skills. Therefore, these are essential for giftedness identification because they indicate the development and expression of potential in a specific area of interest (Subotnik et al., 2011; VanTassel-Baska, 2005). The ability to perform scientific inquiry processes, epistemological views (Nature of Science), and practical skills (scientific working techniques) are core competencies in biology (Arnold et al., 2014; Wellnitz & Mayer, 2011). These include scientific reasoning skills like formulating questions and hypotheses, planning experiments, and drawing conclusions, as well as inquiry methods such as skills in observing, collecting data, or systematically controlling and varying variables in experiments or models (e.g., Bell et al., 2010; Bybee et al., 2009; OECD, 2023; Pedaste et al., 2015). Research and debates on direct relationships between (fluid) intelligence and domain-specific abilities also extend to STEM subjects (science, technology, engineering, and mathematics; Brookman-Byrne et al., 2019; Richland et al., 2007; Yuan et al., 2006). Greiff and Neubert (2014), in a study with 490 high school students, found evidence for the relationships between fluid intelligence, measured with the CFT 20-R (Weiß, 2006), and complex problem solving (CPS) in scientific inquiry processes, as a domain-specific ability, assessed using MicroDYN tasks (Greiff et al., 2012). Scherer and Tiemann (2014) confirmed the relationship between fluid intelligence and CPS in science among students in grades 8 to 10. Other studies suggested that intelligence has only weak to moderate correlations with ability domains such as scientific reasoning (Sternberg et al., 2019). It is noted that the strength of the correlation is highly dependent on the measurement method. The results indicate that, for matrix reasoning tests to be used meaningfully in domain-specific research and identification processes, they must be valid and reliable (Van Hoogdalem & Bosman, 2024).
At the potential level, mainly domain-general measures are currently used in giftedness identification processes, which are mostly supplemented with domain-specific achievement measures to serve the educational purpose. One emerging option is to contextualize intelligence measures within a specific field, thereby integrating domain-general and domain-specific cognitive processes (Roberts, 2007). In their study, Benit and Soellner (2012) adapted a matrix reasoning test for mechanical engineering and, with a sample of 360 university students, demonstrated that participants’ willingness to complete the test could be significantly increased. The measurement of domain-specific cognitive processes at the potential level currently receives little attention in giftedness identification and there is a lack of validated, school-age, domain-specific matrix reasoning tests that accurately measure fluid reasoning.
In the current study, a domain-specific matrix reasoning test in biology was developed within a design-based research (DBR) framework to improve identification processes of gifted students in schools (Peperkorn & Wegner, 2024). The DBR framework involves identifying a problem in educational practice, developing a prototype to address it based on a preliminary examination, and then evaluating and refining it through multiple research cycles. This process yields both practical outputs and contributes to existing theories (Euler, 2014; Shavelson et al., 2003). Therefore, matrix reasoning tasks were designed with biological themes to assess students’ fluid reasoning abilities in grades 3–6 using a domain-specific approach. The age group was selected because, during the transition from primary to secondary school, a decline in interest in science is observed (Gebhard et al., 2017; Potvin & Hasni, 2014). In Germany, students typically transition between 4th and 5th grade, although in two states, it occurs after sixth grade. Identifying and fostering gifted students is particularly important during this phase. The assessment is intended for classroom group administration in low-stakes environments. The primary objective of the study was to pilot the test, investigate its psychometric properties, and assess the quality of the developed items. Different forms of the newly developed matrix test were used in two studies. The matrix reasoning tasks was administered across different cohorts, and accompanying instruments were used to gather preliminary evidence of validity. References to IQ and skills in scientific inquiry processes in biology were examined to analyse subject-specific contextualisation and the associations with subject-specific achievement. In addition, group comparisons were conducted to examine the suitability for subject-specific giftedness identification. The following research questions were posed:
RQ1. 
What is the psychometric quality of the developed domain-specific matrix reasoning tasks across both test versions?
RQ2. 
How are the results of the domain-specific matrix reasoning tasks related to IQ and abilities in scientific inquiry processes in biology?
RQ3. 
Do gifted students in an enrichment program show different ability levels in completing domain-specific matrix reasoning tasks compared to a control group?

2. Materials and Methods

The present study is a quantitative cross-sectional study aimed at examining the psychometric quality of domain-specific matrix reasoning tasks. The study was conducted in Germany as part of an enrichment program to foster gifted students in biology (Wegner et al., 2013) and in cooperating schools of the project. Two research cycles were conducted following the DBR methodology (Peperkorn & Wegner, 2024; Euler, 2014; Shavelson et al., 2003). In the first study, a 24-item test version was administered. In the second study, an expanded version comprising 60 items was used.

2.1. Participants

The total sample consisted of N = 895 students (41.7% female, 54.8% male, 3.5% N/A, mean age = 10.1 years) from the third to sixth grade. The overall sample was divided into two studies. After piloting the initial version, a second expanded test version was used. In both samples, the participants stemmed from two separate cohorts. The first cohort consisted of participants in an enrichment program (Wegner et al., 2013) who were identified by their biology teachers. The second cohort was assembled from students attending partner schools. We used a non-probability convenience sample recruited through collaborating teachers. Students were recruited from primary schools (grades 1–4) and from secondary schools, namely academic-track secondary schools (Gymnasium; grades 5–13), from urban and rural areas. No participants of the enrichment program were part of the second cohort.
The sample of the first study included n1 = 470 students (42.3% female, 55.7% male, 2.0% N/A, mean age = 10.08 years). Of these, 373 were participants from the enrichment program, and 97 were students from participating schools. In the first study, we contacted six teachers from three different schools within our collaboration network. Three teachers agreed to participate, yielding a 50% participation rate. This resulted in four classes (grades 5–6) participating. All students present in the participating classes were invited. A total of 104 students were eligible, 99 provided parental consent, and 97 completed the assessment, resulting in a student participation rate of 93.3%. Two students who provided consent were absent during the study.
The sample of the second study included n2 = 425 students (40.9% female, 53.9% male, 5.2% N/A, mean age = 10.12 years). Of these, 341 were participants from the enrichment program, and 84 were students from participating schools. In the second study, we contacted three teachers from two different schools. All teachers agreed to participate, yielding a 100% participation rate. This resulted in three classes (grades 5–6) in which all students present were invited. A total of 87 students were eligible, 84 provided parental consent, and all completed the assessment, resulting in a student participation rate of 96.6%.

2.2. Data Collection Tools

2.2.1. Domain-Specific Matrix Reasoning Test for Biology

Domain-specific matrix reasoning tasks with biological references were developed. Biological forms, such as animals, plants, natural phenomena, or laboratory materials, replaced common abstract shapes. Items were created in four formats (see Figure 1), including 2 × 2, 3 × 3, and 1 × 5 matrices, as well as patterns with cut-out fields to be completed (Matzen et al., 2010). Five answer options were created for each item, including one correct answer and four distractors. This choice aligns with standard practices in school-age matrix reasoning tests (e.g., WISC-V; CFT 20-R; RPM). To ensure fairness across diverse student groups, the nonverbal items were designed so that their solutions do not require prior knowledge. The items can be enlarged at will or displayed in grayscale. For the first version, 24 items were developed. These were presented in a fixed order in the first study. In the second study, the initial item pool of 24 items was expanded to 60 items, and the items were presented in randomised order. The students were introduced to the items by answering trial items of each type and receiving automatic feedback on whether their answers were correct. Item 1 was used as an example during test administration and was therefore excluded from all analyses. The items were created and administered digitally using LimeSurvey, with processing on tablets (2732 × 2048 pixels). During administration, participants received no feedback on the correctness of their answers and were unable to skip items or revise submitted answers. Participants were given 20 min to complete the test in both versions.

2.2.2. Raven’s Progressive Matrices 2, Clinical Edition—German Short Form (NCS Pearson, 2019)

Participants’ IQ was assessed using the German digital short form of Raven’s Progressive Matrices (NCS Pearson, 2019). The matrix reasoning test is used to evaluate overall cognitive abilities, with a primary focus on fluid intelligence. It has been validated with a European norm sample and is appropriate for assessing individuals aged between 4:0 and 69:11 years. The digital short form demonstrated a test-retest reliability between r = 0.79 and r = 0.81 (NCS Pearson, 2019). The test duration is limited to 20 min.

2.2.3. Abilities in Scientific Inquiry Processes in Biological Research Contexts

The assessment was used to evaluate the student’s abilities in scientific inquiry processes as domain-specific abilities. Scientific inquiry was assessed using the VerE model (Nowak et al., 2013). This theoretical framework encompasses scientific reasoning, including the ability to formulate hypotheses and research questions, plan and perform investigations, and draw conclusions; as well as inquiry methods such as observing, comparing, and arranging, experimenting, and modelling, as overlapping dimensions. The items consisted of a brief description, a visualisation, and a question about the biological phenomenon, model, or experiment. The instrument comprised 54 multiple-choice items, of which 18 were administered in a digital test version. The test instrument was developed as part of the research project. Item analyses were conducted and psychometric properties were assessed for the current sample (grades 3–6; KR-20 = 0.632–0.742). DIF analyses for gender and grade level, divided by student group, showed no significant DIF for any of the items. A translated example was as follows: “A chameleon is observed in its terrarium. You can see that the chameleon turns darker as soon as a conspecific approaches, it is fed, or touched. Which assumption can be verified through the described observation? The chameleon (a) changes its colour in different situations to communicate (correct answer); (b) only changes colour when threatened; (c) only changes colour when it is hungry; (d) becomes brighter when it is touched.” To prevent excessive demands through the description texts, a read-aloud function has been implemented. The assessment was also administered via LimeSurvey and took approximately 25 min to complete.

2.3. Procedure

The administration was conducted in two different settings. The cohort of participants in the enrichment program was surveyed during their project courses, with groups consisting of 15–20 students. The participants were equipped with tablets. The control cohort completed surveys in groups of 25–30 students during lessons at the participating schools. If school-owned tablets were available, they were used. Otherwise, the research team provided participants with tablets. The surveys were conducted following a standardised procedure. Each survey session took approximately 30 min, including greetings, introductions, process explanations with a trial item, execution, and farewell. Testing was conducted by trained research staff using a standardised protocol. To ensure comprehension, examiners read the general instructions aloud to all students, administered trial items with feedback, and explained that students could choose and, if necessary, change an answer before submitting. It was explained that submitted answers cannot be corrected. During the test, procedural support was limited to rereading or paraphrasing the general instructions from the protocol, clarifying the response format (e.g., “choose the answer that completes the pattern”), and reminding students that no subject-matter knowledge was required. Examiners were explicitly instructed not to provide hints to the correct answer and not to confirm whether an answer was correct. Assistance procedures were identical across groups to preserve comparability. Written consent has been obtained from a parent or legal guardian of all participants. All participants were briefed on the purpose of the research and were informed that their participation was voluntary. They could withdraw at any time. The study was reviewed and approved by the ethics committee at Bielefeld University (approval number: 2025-256; approval date: 27 August 2025).

2.4. Data Analysis

To address the first research question, an item analysis using item response theory (IRT) was conducted for both test versions. A Rasch model (1PL) was estimated to determine the reliability of expected a posteriori (EAP) and mean weighted likelihood estimation (WLE; Warm, 1989), item difficulty (b), and the item-fit values weighted/unweighted mean-square (MNSQ) and z-standardised statistic (ZSTD). Additionally, the Kruder-Richardson 20 formula (KR-20) was used. The unidimensionality of each of the two versions was evaluated using Principal Component Analysis (PCA) of the residuals (Linacre, 1998), and local dependencies were assessed using Yen’s Q3 method (Yen, 1984). In both versions, the PCA showed that the eigenvalue of the first residual contrast was <2.0, indicating that the residuals do not form a meaningful secondary dimension and thus support unidimensionality. For the initial version, Yen’s Q3 method showed a raw mean Q3 of −0.04 and an adjusted mean Q3 of less than 0.001. Item-specific analysis showed that all mean absolute Q3 values were <0.20 (Max = 0.112) with a maximum of two violations observed across all item pairs. Similarly, for the second version, Yen’s Q3 method showed a raw mean Q3 of −0.018 and an adjusted mean Q3 of less than 0.001. Item-specific analysis showed that all mean absolute Q3 values were <0.20 (Max = 0.077), with a maximum of two violations observed across all item pairs as well. Given these generally low residual correlations, the assumption of local independence was considered to be sufficiently met in both versions, and unidimensionality was further verified. In addition, uniform differential item functioning (DIF) analyses for the second version of the matrix reasoning test were conducted across the different cohorts (enrichment/control), genders (male/female), and school levels (primary/secondary level). We have decided to use school levels rather than grade levels to align the analysis with the key curricular transition and to better capture children’s progress over time. Grade-by-grade DIF may misattribute expected growth to bias and is influenced by small cell sizes in our data, thereby reducing stability and statistical power (Penfield, 2001). For this, the Mantel-Haenszel DIF procedure (MH; Mantel & Haenszel, 1959) with iterative anchor purification was used. The robustness of the results was examined using Lord’s chi-square DIF test (Lord, 1980) and Raju’s area-based DIF test (Raju, 1988). The p-values were adjusted using the Benjamini-Hochberg method (α = 0.05; Benjamini & Hochberg, 1995). DIF effect size was evaluated according to Zwick et al. (1999): ΔMH units < 1 indicated a negligible effect, 1 < ΔMH units < 1.5 indicated a slight to moderate effect, and ΔMH units > 1.5 indicated a large effect.
For the second research question, the results of the matrix reasoning test, the scientific inquiry assessment, and the IQ were correlated, using Spearman’s ρ (Shapiro-Wilk: p < .05). For this, the person’s ability parameters (θ) were used for the matrix reasoning test and the scientific inquiry assessment. Not-reached responses were recorded as missing and did not affect the likelihood. The correlation analysis was conducted separately for the third and fifth grades to account for age-standardisation of the IQ data.
To answer the third research question, we estimated an exploratory latent regression Rasch model for the two different cohorts (enrichment/control). For further verification of the results, we estimated a multiple-group Rasch model using marginal maximum likelihood with EAP ability estimates. To rule out potential measurement invariance, we employed Stocking-Lord linking (Stocking & Lord, 1983), assuming complete non-invariance as a robust approach (Baghaei & Robitzsch, 2025; Robitzsch & Lüdtke, 2020). This method enables item response modelling with multiple groups across different item sets because the item parameters for both sets are brought onto a common scale. In our case, this allows us to compare both groups (enrichment/control) across both test versions.
All analyses were performed using R software (Version 4.5.2; R Core Team, 2025) and ACER ConQuest software (Version 5.47.5; Adams et al., 2020).

3. Results

3.1. Descriptive Results

Table 1 summarises the descriptive results of both test versions separated by cohorts.

3.2. Item Analyses

An item analysis was conducted for both versions of the matrix reasoning test to answer the first research question (see Table 2 and Table 3).
The b values for the items in the first version (24 items) ranged from −3.39 to 2.45 (see Figure 2). All items met the MNSQ criteria for infit and outfit (0.8 ≤ MSNQ ≤ 1.2; Wright & Linacre, 1994). Item 23 showed high ZSTD values for infit and outfit, and Item 22 showed high ZSTD values for outfit (ZSTD ≤ 1.96), indicating underfit. The reliability of the first version was as follows: EAP = 0.60; WLE = 0.58; KR-20 = 0.60. In the second version, the b values ranged between −3.07 and 2.58 (see Figure 2). In the second version, the difficulty range was more effectively covered. Coverage was less dense toward the extremes, with only a few items in the ability ranges of θ < −2.0 and θ > 2.0. Additionally, there were fewer items in the range −1.0 < θ < 0. All items met the criteria for MNSQ and ZSTD values for infit. Only item 49 failed the outfit MNSQ criterion, with an unweighted MNSQ = 1.67, indicating underfit. Six items (6; 11; 12; 36; 43; 54) showed a ZSTD value below −1.96 for outfit, indicating overfit. Two items (38; 49) showed ZSTD values for outfit above 1.96, indicating underfit. The reliability of the second version was as follows: EAP = 0.74; WLE = 0.73; KR-20 = 0.75. A uniform DIF analysis was conducted for both versions of the matrix reasoning test across the two cohorts, genders, and grade levels. In the first version, notable DIF was observed for items 3 and 20, with cohort as the grouping variable. However, only item 3 also exhibited DIF in the robustness checks via Lord’s chi-square DIF test and Raju’s area-based DIF test. Both items showed a ΔMH value above 1.5, indicating DIF favoring the control group of students. No significant DIF items were identified across gender or grade levels. We decided to retain item 3 in the item pool to collect additional information on potential biases. In the second version 2, no significant DIF items were identified across all analysed groups using the MH DIF procedure (see Table 4). Regarding the DIF analysis across the two cohorts, items 12 and 15 showed ΔMH values below −1.5, indicating DIF favouring the enrichment students. Items 3 and 27 showed ΔMH values above 1.5, indicating DIF favouring the control group of students. All of these values were non-significant. Items 3, 12, and 33 showed significant DIF in Lord’s chi-square DIF test and Raju’s area-based DIF test, but not in the primary analysis using MH procedure. All items were included in subsequent analyses.

3.3. Correlational Analyses

Correlation analyses between the results of the developed matrix reasoning test (θMR), the results for abilities in scientific inquiry processes (θSI), and IQ were conducted separately for primary and secondary grade levels (see Table 5). The results of the primary grade level showed significantly positive moderate correlations (ρ > 0.3) between θMR and IQ, and between θMR and θSI. A significantly positive weak correlation (ρ > 0.1) was observed between θSI and IQ. The results for secondary-grade level showed strong, significant positive correlations (ρ > 0.5) between θMR and IQ and between θMR and θSI. A significantly moderate correlation was observed between θSI and IQ.

3.4. Group Comparisons

The Rasch latent regression model included the categorical covariate cohort (enrichment vs. control; enrichment as the reference group). The estimated contrast was β = −0.33, indicating lower mean ability in the control group (R2θ = 0.03; see Table 6).
A multi-group Rasch model with equal item parameters across groups corroborated these results. With the enrichment-group mean fixed to 0, the control group mean was −0.31 logits (d = −0.53; see Table 7).

4. Discussion

The present study investigated the psychometric properties of contextualised matrix reasoning tasks, featuring biological forms and representations, to assess students’ fluid reasoning abilities in grades 3–6 as part of giftedness identification in biology. In the following, we first discuss the findings from the correlation analysis and the group comparison, before detailing the psychometric properties of the tasks.
For the primary grade level, the correlation analysis indicated a moderate association between θMR and IQ. The strength of the correlation indicated at least a meaningful overlap of the tested latent abilities. The developed matrix reasoning test showed a moderate convergent validity with fluid intelligence in the primary grade level. The moderate correlation between θMR and θSI, and the weaker correlation between θSI and IQ, suggest that the domain-specific adaptation was effective and that individuals who score higher on SI also tend to score higher in the domain-specific matrix reasoning tasks. These correlations were evident at the secondary grade level. Here, strong correlations were observed between θMR and IQ, as well as between θMR and θSI, indicating substantial overlap and convergence. Consistent with developmental psychology research, MR correlated similarly with IQ and SI at secondary grade levels. For the MR–IQ association, adolescent development in domain-general executive functions and fluid reasoning enhances reliance on cognitive processes fundamental to both measures, such as pattern abstraction, rule induction, and working memory coordination. A shift toward more analytical, rule-based strategies also strengthens the alignment with IQ (e.g., Best & Miller, 2010). In the MR–SI correlation, secondary school science teaching develops inquiry skills such as identifying patterns, evaluating evidence, and managing variables, as well as representational fluency, such as interpreting biological diagrams and graphs. Since our MR included biology-based stimuli, the lower content novelty and common analytic demands might further connect MR to SI (e.g., Zimmerman, 2007). The higher correlations observed at the secondary level may be due to older students being more experienced in dealing with biological illustrations and common representations. Nevertheless, the newly developed test showed no redundancy with Raven’s Progressive Matrices 2 (NCS Pearson, 2019). The strong correlation between θMR and θSI confirmed the primary-grade results and indicated that adapting the matrix reasoning tasks for biology was successful.
Group comparisons between students in the enrichment program and the control group, estimated using the latent regression Rasch model, indicated that students in the enrichment program outperformed control group students. Accordingly, the measurement direction of the test instrument could be confirmed. In the Rasch latent regression that included cohort (enrichment vs. control) as a binary predictor, the estimated group effect was β = −0.33 logits, corresponding to a standardised mean difference of d ≈ 0.33, which is generally considered small to moderate (Cohen, 1988). Because the groups were unbalanced (about 75% enrichment and 25% control), the maximum explanation of between-person variance by this contrast was limited. With current proportions, an effect of this size is expected to explain about 2% of the variance (Rosenthal & Rosnow, 2008). Consistent with this, the model produced R2θ = 0.03. Small differences are due to sampling variability and the fact that the latent variance was estimated rather than fixed. The estimated multigroup Rasch model confirmed the initial indications of known-groups validity. Nonetheless, to improve the discriminative ability of the test instrument, items should be prepared for adaptive testing in future studies (e.g., Weiss & Vale, 1987).
The item analysis showed that the second test version exhibited acceptable item parameters and fit indices. Reliability could be significantly improved by expanding the test instrument from the first to the second version. The reliability of the second version was within an acceptable range but should be improved through further adjustments. The targeting between item difficulty (b) and person ability (θ) was improved in the second version. Although the difficulty range was generally well covered, some gaps in item coverage at certain trait levels were still apparent (Lord, 1980; see Figure 2). Extreme items might be removed or adjusted to heighten their informational value. Considering their application in giftedness identification, high discriminative power and test information are essential for the valid use of matrix reasoning tasks (Rost & Sparfeldt, 2017). In the second version item–person targeting was enhanced by expanding coverage of item difficulties throughout the observed ability range (see Figure 2). However, coverage was less dense at the extremes and sparse in the lower mid-range. In practice, this entails a risk of reaching ceiling performance within the highest-ability subgroup and diminished discrimination for individuals with lower-to-mid abilities. To address these gaps, future revisions should expand the item pool to include very difficult items (b > 2.5), very easy items (b < −2), and a small set targeting the lower-mid range (−1.0 < b < 0) to ensure discrimination across all ability levels. Because of the identified targeting gaps, correlations and group comparisons should be interpreted with caution. These gaps increase the conditional standard error of the mean (SEM) and weaken group differences and correlations (Lord, 1980). The analysis of fit indices revealed that all items met the infit criterion for high-stakes tests (Wright & Linacre, 1994). A total of six items showed standardised values outside the criterion for outfit (ZSTD < −1.96), which indicated overfitting. This could enable shortening the instrument and eliminating redundant items. However, these findings require verification through additional studies. A short instrument that does not overwhelm participants would be advantageous for identifying gifted students in schools, particularly among younger age groups. The two items that showed underfitting for the outfit (ZSTD > 1.96) should be monitored in future research, as their increased difficulty may have led participants to guess excessively. Since the infit values met the criterion and difficult items are essential for talent identification to distinguish high-ability individuals, there is no reason to revise these items. For item 3, all three methods identified significant DIF across different cohorts, favouring the control group in the first version. In the second version, no significant DIF was detected for this item. However, this item warrants special attention in future studies, as do items 12, 15, and 27, where non-significant but noteworthy ΔMH values were observed (see Table 4). These differences could have methodological causes. Although the instrument was administered according to a set procedure, the cohort of enrichment students was surveyed as part of the project courses. In contrast, the control group students were surveyed during lessons at school. The survey settings revealed differences in group size and atmosphere. The items in question showed extreme b-parameters, indicating they were at the lower or upper ends of the difficulty scale. The direction of DIF varied across items and did not follow a consistent pattern, suggesting that the observed effects are more likely caused by item-level properties than by a consistent group-related bias. Analyzing the cognitive processes and solution strategies associated with these items may provide more detailed explanations and help guide targeted item revisions (e.g., Laurence & Macedo, 2023). The developed matrix reasoning tasks still showed negligible or no significant DIF between the cohorts, strengthening the test instrument’s quality and practical applicability. In addition, no indications of DIF were found for gender or grade level. The developed items demonstrated acceptable characteristics for further research.
When interpreting the results of this study, several limitations should be considered. The results of the DIF analysis were obtained using the MH procedure (Mantel & Haenszel, 1959). In this method, only uniform DIF is assessed. Therefore, Lord’s chi-square test and Raju’s area method were used for robustness checks. Here, three items showed significant values, indicating possible non-uniform DIF. These items should be monitored in future studies. Although the items include contextualised forms, their solution does not require understanding the forms’ content to avoid bias against groups with different levels of knowledge. We lack information about our participants’ ethnic-racial identity and socio-economic background. To ensure the fairness of the developed test among various ethnic and racial groups, this should be examined in future research (Holden & Tanenbaum, 2023). Furthermore, only academic track schools (Gymnasium) were included among the participating schools. To develop an instrument suitable for all school types and examine potential bias, students from diverse educational backgrounds should be included in future validation efforts. Although both rural and urban schools participated in the study, potential DIF between groups from these areas warrants further investigation. This study depended on existing partnerships, which led to unequal group sizes and a non-probability, clustered sampling method. Though we standardised the administration, residual confounding could remain, and limited precision for smaller groups cannot be entirely excluded. Future research should incorporate stratified recruitment to improve balance and enhance generalisability. For further analyses addressing the second and third research questions, we decided to retain all items in the data pool to obtain a comprehensive picture of the developed items. This procedure may have introduced distortions. Slightly different results would have been obtained through a more rigorous approach and the elimination of marked items. For the correlation analysis, it should be noted that the calculated IQ was derived from the Raven’s Progressive Matrices 2—Short Form (NCS Pearson, 2019) and did not include person-ability values. Furthermore, a large proportion of students in the correlation analysis sample came from the enrichment program, possibly affecting its results. When calculating the person’s ability parameters (θ) for the matrix reasoning test and the scientific inquiry assessment, we treated no-response answers as missing data, which were not included in the calculations. This approach reduced information and increased SE(θ), but it prevented penalising construct-irrelevant non-responses and might have biased the results of the correlation analysis. Furthermore, the reliability of the instrument used to assess abilities in scientific inquiry in the present sample was questionable. When comparing the two cohorts, it is important to note that the students in the enrichment program were chosen by their biology teachers. This selection provides a weak criterion for known-groups validity, as teachers have difficulties in identifying domain-specific gifted students (Bergold, 2014; Machts et al., 2016; Urhahne & Wijnia, 2021). Additionally, the cohorts being compared differ greatly in size, resulting in significantly more comprehensive information for estimating students’ abilities in the enrichment project. The small number of control students limits the significance of the analysis and should be increased in future studies.
The results of the present study provide initial findings that the domain-specific adaptation of matrix reasoning tasks has been successful in the biological domain. Nevertheless, the backgrounds of the effect must be examined more thoroughly in subsequent studies. Experimental contrasts (e.g., contextual versus abstract matrices) should be used to assess the effects of contextualisation on solving strategies, cognitive demands, and student engagement. Additionally, automatically measured process indicators (e.g., response time; Wise & Kong, 2005) should be examined, and speededness effects should be controlled. To determine whether the stronger correlation with the domain-specific comparison instrument (SI) is attributable to domain-specific skills, knowledge, interests, or increased test-taking motivation, additional test instruments should be used. To enhance the generalisability of the results, it would be appropriate to compare them with matrix reasoning tasks adapted for other STEM subjects or with existing findings in other domains (e.g., Benit & Soellner, 2012). To further explore the initial results regarding convergent validity, additional intelligence tests (e.g., Wechsler, 2014), measures of fluid intelligence (e.g., Weiß, 2006), and measures of domain-specific abilities (e.g., Darman et al., 2024; Greiff & Neubert, 2014) should also be utilised.
In summary, the results of the present study show that the developed item pool exhibited acceptable psychometric properties. Nevertheless, further validation should be conducted in future research to examine the test instrument’s measurement quality more thoroughly and to make additional adjustments. Particular attention should be given to the appropriateness for identifying domain-specific giftedness in school. Self-report, think-aloud, or eye-tracking studies used to explore cognitive strategies could offer deeper insights (Laurence & Macedo, 2023). Furthermore, future endeavours should examine methodological decisions, such as the inability to revise answers, the lack of response feedback, and the available time (Frey et al., 2024), to continually optimise the instrument’s design (Peperkorn & Wegner, 2024; Euler, 2014; Shavelson et al., 2003). The development of an item bank that enables adaptive testing and the creation of different test versions can further improve the test’s discriminatory ability and enhance its practical usefulness for giftedness identification procedures with regard to DBR goals (e.g., Chierchia et al., 2019; Pallentin et al., 2023). In identifying gifted students, it is important to consider multiple broad ability domains (e.g., McGrew et al., 2023), specific subject skills, creativity, and non-cognitive personality traits (e.g., Sternberg, 2024). The findings of the present study support the idea that measuring the broad ability area Gf with contextualised matrix reasoning tasks can potentially improve identification processes. Future work should broaden the theoretical scope by exploring transfer to other STEM contexts and conducting cross-site studies.

Author Contributions

Conceptualisation, C.P.; Methodology, C.P.; Software, C.P.; Validation, C.P.; Formal Analysis, C.P.; Investigation, C.P.; Resources, C.W.; Data Curation, C.P.; Writing—Original Draft Preparation, C.P.; Writing—Review & Editing, C.W.; Visualisation, C.P.; Supervision, C.W.; Project Administration, C.W.; Funding Acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study involving human participants was reviewed and approved by the ethics committee at Bielefeld University (approval number: 2025-256, approval date: 27 August 2025).

Informed Consent Statement

Written informed consent was obtained from the parents or a legal representative of each participant.

Data Availability Statement

The original data presented in the study are openly available in Mendeley Data at https://data.mendeley.com/datasets/cx7zwx7k66/1 (accessed on 12 April 2026). The developed test instrument is available from the corresponding author upon reasonable request for research purposes. Upon completion of the dissertation, the full instrument, including the manual, IRT parameters, and scoring code, will be made publicly available via the Open Science Framework (OSF).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
gGeneral Intelligence
GfFluid Intelligence
PNAPsychometric Network Analysis
STEMScience, Technology, Engineering, and Mathematics
CPSComplex Problem Solving
IQIntelligence Quotient
DBRDesign-Based Research
VerE modelModell zur Vernetzung der Erkenntnisgewinnung in Biologie und Chemie [Model of Cross-Linking Scientific Inquiry between Biology and Chemistry]
KR-20Kruder-Richardson Formula
IRTItem Response Theory
EAPExpected a Posteriori
WLEMean Weighted Likelihood Estimation
MNSQMean-Square
ZSTDz-standardised statistics
PCAPrincipal component analysis
DIFDifferential Item Functioning
MHMantel-Haenszel
θMRPerson Ability Parameter in Matrix Reasoning Tasks
θSIPerson Ability Parameter in Scientific Inquiry
SEMStandard Error of the Mean

References

  1. Adams, R. J., Wu, M. L., Cloney, D., Berezner, A., & Wilson, M. (2020). ACER ConQuest: Generalised item response modelling software (Version 5.47.5) [Computer software]. ACER. Available online: https://www.acer.org/au/conquest (accessed on 12 April 2026).
  2. Arnold, J. C., Kremer, K., & Mayer, J. (2014). Understanding students’ experiments—What kind of support do they need in inquiry tasks? International Journal of Science Education, 36(16), 2719–2749. [Google Scholar] [CrossRef]
  3. Baghaei, P., & Robitzsch, A. (2025). A tutorial on item response modeling with multiple groups using TAM. Educational Methods and Psychometrics, 3(1), 1–14. [Google Scholar] [CrossRef]
  4. Bell, T., Urhahne, D., Schanze, S., & Ploetzner, R. (2010). Collaborative inquiry learning: Models, tools, and challenges. International Journal of Science Education, 32(3), 349–377. [Google Scholar] [CrossRef]
  5. Benit, N., & Soellner, R. (2012). Misst gut, ist gut? Vergleich eines abstrakten und eines berufsbezogenen matrizentests [Measures good, is good? Comparison of an abstract and a job-related matrix test]. Journal of Business and Media Psychology, 3(1), 22–29. [Google Scholar]
  6. Benjamini, Y., & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B: Statistical Methodology, 57(1), 289–300. [Google Scholar] [CrossRef]
  7. Bergold, S. (2014). Zur diagnostischen Kompetenz von Lehrkräften bei der Identifikation begabter Schülerinnen und Schüler [On the diagnostic competence of teachers in identifying gifted pupils]. Bildung und Erziehung, 67(2), 219–236. [Google Scholar] [CrossRef]
  8. Best, J. R., & Miller, P. H. (2010). A Developmental perspective on executive function. Child Development, 81(6), 1641–1660. [Google Scholar] [CrossRef]
  9. Brookman-Byrne, A., Mareschal, D., Tolmie, A. K., & Dumontheil, I. (2019). The unique contributions of verbal analogical reasoning and nonverbal matrix reasoning to science and maths problem-solving in adolescence. Mind, Brain, and Education, 13(3), 211–223. [Google Scholar] [CrossRef]
  10. Bybee, R., McCrae, B., & Laurie, R. (2009). PISA 2006: An assessment of scientific literacy. Journal of Research in Science Teaching, 46(8), 865–883. [Google Scholar] [CrossRef]
  11. Carpenter, P. A., Just, M. A., & Shell, P. (1990). What one intelligence test measures: A theoretical account of the processing in the raven progressive matrices test. Psychological Review, 97(3), 404–431. [Google Scholar] [CrossRef] [PubMed]
  12. Cattell, R. B. (1963). Theory of fluid and crystallized intelligence: A critical experiment. Journal of Educational Psychology, 54(1), 1–22. [Google Scholar] [CrossRef]
  13. Chierchia, G., Fuhrmann, D., Knoll, L. J., Pi-Sunyer, B. P., Sakhardande, A. L., & Blakemore, S.-J. (2019). The matrix reasoning item bank (MaRs-IB): Novel, open-access abstract reasoning items for adolescents and adults. Royal Society Open Science, 6(10), 190232. [Google Scholar] [CrossRef]
  14. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Routledge. [Google Scholar] [CrossRef]
  15. Darman, D. R., Suhandi, A., Kaniawati, I., Samsudin, A., & Wibowo, F. C. (2024). Development and validation of scientific inquiry literacy instrument (SILI) using rasch measurement model. Education Sciences, 14(3), 322. [Google Scholar] [CrossRef]
  16. Deary, I. J., Strand, S., Smith, P., & Fernandes, C. (2007). Intelligence and educational achievement. Intelligence, 35(1), 13–21. [Google Scholar] [CrossRef]
  17. Euler, D. (2014). Design research—A paradigm under development. In Design-based research (pp. 15–44). Franz Steiner Verlag. [Google Scholar] [CrossRef]
  18. Freund, P. A., & Holling, H. (2011a). Retest effects in matrix test performance: Differential impact of predictors at different hierarchy levels in an educational setting. Learning and Individual Differences, 21(5), 597–601. [Google Scholar] [CrossRef]
  19. Freund, P. A., & Holling, H. (2011b). Who wants to take an intelligence test? Personality and achievement motivation in the context of ability testing. Personality and Individual Differences, 50(5), 723–728. [Google Scholar] [CrossRef]
  20. Frey, A., Liu, T., Fink, A., & König, C. (2024). Meta-analysis of the effects of computerized adaptive testing on the motivation and emotion of examinees. European Journal of Psychological Assessment, 40(5), 427–443. [Google Scholar] [CrossRef]
  21. Gebhard, U., Höttecke, D., & Rehm, M. (2017). Interesse an naturwissenschaft [Interest in science]. In U. Gebhard, D. Höttecke, & M. Rehm (Eds.), Pädagogik der Naturwissenschaften (pp. 125–142). Springer Fachmedien Wiesbaden. [Google Scholar] [CrossRef]
  22. Greiff, S., & Neubert, J. C. (2014). On the relation of complex problem solving, personality, fluid intelligence, and academic achievement. Learning and Individual Differences, 36, 37–48. [Google Scholar] [CrossRef]
  23. Greiff, S., Wüstenberg, S., & Funke, J. (2012). Dynamic problem solving: A new assessment perspective. Applied Psychological Measurement, 36(3), 189–213. [Google Scholar] [CrossRef]
  24. Holden, L. R., & Tanenbaum, G. J. (2023). Modern assessments of intelligence must be fair and equitable. Journal of Intelligence, 11(6), 126. [Google Scholar] [CrossRef] [PubMed]
  25. Hossiep, R., & Hasella, M. (2010). Bochumer matrizentest standard [Bochum matrix test standard] (1st ed.). Hogrefe. [Google Scholar]
  26. Klauer, K. J., & Phye, G. D. (2008). Inductive reasoning: A training approach. Review of Educational Research, 78(1), 85–123. [Google Scholar] [CrossRef]
  27. Laurence, P. G., & Macedo, E. C. (2023). Cognitive strategies in matrix-reasoning tasks: State of the art. Psychonomic Bulletin & Review, 30(1), 147–159. [Google Scholar] [CrossRef]
  28. Linacre, J. M. (1998). Detecting multidimensionality: Which residual data-type works best? Journal of Outcome Measurement, 2(3), 266–283. [Google Scholar] [PubMed]
  29. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Routledge. [Google Scholar] [CrossRef]
  30. Machts, N., Kaiser, J., Schmidt, F. T. C., & Möller, J. (2016). Accuracy of teachers’ judgments of students’ cognitive abilities: A meta-analysis. Educational Research Review, 19, 85–103. [Google Scholar] [CrossRef]
  31. Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. JNCI: Journal of the National Cancer Institute, 22(4), 719–748. [Google Scholar] [CrossRef] [PubMed]
  32. Matzen, L. E., Benz, Z. O., Dixon, K. R., Posey, J., Kroger, J. K., & Speed, A. E. (2010). Recreating raven’s: Software for systematically generating large numbers of raven-like matrix problems with normed properties. Behavior Research Methods, 42(2), 525–541. [Google Scholar] [CrossRef]
  33. McGrew, K. S., Schneider, W. J., Decker, S. L., & Bulut, O. (2023). A psychometric network analysis of CHC intelligence measures: Implications for research, theory, and interpretation of broad CHC scores “beyond g”. Journal of Intelligence, 11(1), 19. [Google Scholar] [CrossRef]
  34. NCS Pearson. (2019). Raven’s progressive matrices 2, clinical edition—German short form—Manual. Pearson Clinical Assessment Deutschland. [Google Scholar]
  35. Nowak, K. H., Nehring, A., Tiemann, R., & Upmeier Zu Belzen, A. (2013). Assessing students’ abilities in processes of scientific inquiry in biology using a paper-and-pencil test. Journal of Biological Education, 47(3), 182–188. [Google Scholar] [CrossRef]
  36. OECD. (2023). PISA 2022 Results (volume II): Learning during—And from—Disruption. OECD. [Google Scholar] [CrossRef]
  37. Pallentin, V. S., Danner, D., & Rummel, J. (2023). Construction and validation of the HeiQ: An operation-oriented figural matrices test. Journal of Intelligence, 11(4), 73. [Google Scholar] [CrossRef]
  38. Pedaste, M., Mäeots, M., Siiman, L. A., De Jong, T., Van Riesen, S. A. N., Kamp, E. T., Manoli, C. C., Zacharia, Z. C., & Tsourlidaki, E. (2015). Phases of inquiry-based learning: Definitions and the inquiry cycle. Educational Research Review, 14, 47–61. [Google Scholar] [CrossRef]
  39. Penfield, R. D. (2001). Assessing differential item functioning among multiple groups: A comparison of three mantel-haenszel procedures. Applied Measurement in Education, 14(3), 235–259. [Google Scholar] [CrossRef] [PubMed]
  40. Peng, P., Wang, T., Wang, C., & Lin, X. (2019). A meta-analysis on the relation between fluid intelligence and reading/mathematics: Effects of tasks, age, and social economics status. Psychological Bulletin, 145(2), 189–236. [Google Scholar] [CrossRef]
  41. Peperkorn, C., & Wegner, C. (2024). Developing a diagnostic instrument for scientific giftedness in the context of design-based research (DBR). Global Education Review, 11(2), 25–45. [Google Scholar]
  42. Peters, S. J., Carter, J., & Plucker, J. A. (2020). Rethinking how we identify “gifted” students. Phi Delta Kappan, 102(4), 8–13. [Google Scholar] [CrossRef]
  43. Potvin, P., & Hasni, A. (2014). Analysis of the decline in interest towards school science and technology from grades 5 through 11. Journal of Science Education and Technology, 23(6), 784–802. [Google Scholar] [CrossRef]
  44. Prabhakaran, V., Smith, J. A. L., Desmond, J. E., Glover, G. H., & Gabrieli, J. D. E. (1997). Neural substrates of fluid reasoning: An fMRI study of neocortical activation during performance of the raven’s progressive matrices test. Cognitive Psychology, 33(1), 43–63. [Google Scholar] [CrossRef]
  45. Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53(4), 495–502. [Google Scholar] [CrossRef]
  46. Raven, J., Raven, J., & Court, J. H. (1998). Manual for raven’s progressive matrices and vocabulary scales. Oxford Psychologists Press. [Google Scholar]
  47. R Core Team. (2025). R: A language and environment for statistical computing (Version 4.5.2) [Computer software]. R Foundation for Statistical Computing. Available online: https://www.R-project.org/ (accessed on 12 April 2026).
  48. Ren, X., Schweizer, K., Wang, T., & Xu, F. (2015). The prediction of students’ academic performance with fluid intelligence in giving special consideration to the contribution of learning. Advances in Cognitive Psychology, 11(3), 97–105. [Google Scholar] [CrossRef]
  49. Richland, L. E., Zur, O., & Holyoak, K. J. (2007). Cognitive supports for analogies in the mathematics classroom. Science, 316(5828), 1128–1129. [Google Scholar] [CrossRef] [PubMed]
  50. Roberts, M. J. (Ed.). (2007). Contextual facilitation methodology as a means of investigating domain specific cognition. In Integrating the mind: Domain general vs domain specific processes in higher cognition (pp. 13–37). Psychology Press. [Google Scholar] [CrossRef]
  51. Robitzsch, A., & Lüdtke, O. (2020). A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psychological Test and Assessment Modeling, 62(2), 233–279. [Google Scholar]
  52. Rosenthal, R., & Rosnow, R. L. (2008). Essentials of behavioral research: Methods and data analysis (3rd ed.). McGraw-Hill. [Google Scholar]
  53. Rost, D. H., & Sparfeldt, J. R. (2017). Intelligenz und hochbegabung [Intelligence and giftedness]. In M. K. W. Schweer (Ed.), Lehrer-schüler-interaktion (pp. 315–346). Springer Fachmedien Wiesbaden. [Google Scholar] [CrossRef]
  54. Scherer, R., & Tiemann, R. (2014). Evidence on the effects of task interactivity and grade level on thinking skills involved in complex problem solving. Thinking Skills and Creativity, 11, 48–64. [Google Scholar] [CrossRef]
  55. Schneider, W. J., & McGrew, K. S. (2022). The Cattell–Horn–Carroll theory of cognitive abilities. In D. P. Flanagan, & E. M. McDonough (Eds.), Contemporary intellectual assessment: Theories, tests, and issues (4th ed.). The Guilford Press. [Google Scholar]
  56. Shavelson, R. J., Phillips, D. C., Towne, L., & Feuer, M. J. (2003). On the science of education design studies. Educational Researcher, 32(1), 25–28. [Google Scholar] [CrossRef]
  57. Sternberg, R. J. (2002). Raising the Achievement of all students: Teaching for successful intelligence. Educational Psychology Review, 14(4), 383–393. [Google Scholar] [CrossRef]
  58. Sternberg, R. J. (2024). A new model of giftedness for transformational active concerned citizenship and ethical leadership. Gifted Education International, 40(2), 166–195. [Google Scholar] [CrossRef]
  59. Sternberg, R. J., Wong, C. H., & Sternberg, K. (2019). The relation of tests of scientific reasoning to each other and to tests of general intelligence. Journal of Intelligence, 7(3), 20. [Google Scholar] [CrossRef]
  60. Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210. [Google Scholar] [CrossRef]
  61. Subotnik, R. F., Olszewski-Kubilius, P., & Worrell, F. C. (2011). Rethinking giftedness and gifted education: A proposed direction forward based on psychological science. Psychological Science in the Public Interest, 12(1), 3–54. [Google Scholar] [CrossRef]
  62. Urhahne, D., & Wijnia, L. (2021). A review on the accuracy of teacher judgments. Educational Research Review, 32, 100374. [Google Scholar] [CrossRef]
  63. Van Hoogdalem, A., & Bosman, A. M. (2024). Intelligence tests and the individual: Unsolvable problems with validity and reliability. Methodological Innovations, 17(1), 6–18. [Google Scholar] [CrossRef]
  64. VanTassel-Baska, J. (2005). Domain-specific giftedness: Applications in school and life. In R. J. Sternberg, & J. E. Davidson (Eds.), Conceptions of giftedness (2nd ed., pp. 358–376). Cambridge University Press. [Google Scholar] [CrossRef]
  65. VanTassel-Baska, J. (2021). A conception of giftedness as domain-specific learning: A dynamism fueled by persistence and passion. In R. J. Sternberg, & D. Ambrose (Eds.), Conceptions of giftedness and talent (pp. 443–466). Springer International Publishing. [Google Scholar] [CrossRef]
  66. Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54(3), 427–450. [Google Scholar] [CrossRef]
  67. Watkins, M. W., Lei, P.-W., & Canivez, G. L. (2007). Psychometric intelligence and achievement: A cross-lagged panel analysis. Intelligence, 35(1), 59–68. [Google Scholar] [CrossRef]
  68. Wechsler, D. (2014). Wechsler intelligence scale for children—Fith edition (WISC-V). Pearson. [Google Scholar]
  69. Wechsler, D. (2024). Wechsler adult intelligence scale—Fith edition (WAIS-5). Pearson. [Google Scholar]
  70. Wegner, C., Minnaert, L., & Strehlke, F. (2013). The importance of learning strategies and how the project ‘kolumbus-kids’ promotes them successfully. European Journal of Science and Mathematics Education, 1(3), 137–143. [Google Scholar] [CrossRef] [PubMed]
  71. Weiss, D. J., & Vale, C. D. (1987). Adaptive testing. Applied Psychology, 36(3–4), 249–262. [Google Scholar] [CrossRef]
  72. Weiß, R. H. (2006). CFT 20-R. Grundintelligenztest skala 2—Revision [CFT 20-R. Basic intelligence test scale 2-revision]. Hogrefe. [Google Scholar]
  73. Wellnitz, N., & Mayer, J. (2011, April 3–6). Modelling and assessing scientific methods. Annual Meeting of the National Association of Research in Science Teaching (NARST), Orlando, FL, USA. [Google Scholar]
  74. Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163–183. [Google Scholar] [CrossRef] [PubMed]
  75. Worrell, F. C., Subotnik, R. F., Olszewski-Kubilius, P., & Dixson, D. D. (2019). Gifted students. Annual Review of Psychology, 70(1), 551–576. [Google Scholar] [CrossRef]
  76. Wright, B., & Linacre, J. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370–371. [Google Scholar]
  77. Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125–145. [Google Scholar] [CrossRef]
  78. Yuan, K., Steedle, J., Shavelson, R., Alonzo, A., & Oppezzo, M. (2006). Working memory, fluid intelligence, and science learning. Educational Research Review, 1(2), 83–98. [Google Scholar] [CrossRef]
  79. Zimmerman, C. (2007). The development of scientific thinking skills in elementary and middle school. Developmental Review, 27(2), 172–223. [Google Scholar] [CrossRef]
  80. Zwick, R., Thayer, D. T., & Lewis, C. (1999). An empirical bayes approach to mantel-haenszel DIF analysis. Journal of Educational Measurement, 36(1), 1–28. [Google Scholar] [CrossRef]
Figure 1. Examples of the developed domain-specific matrix reasoning tasks. Correct answers are outlined in green.
Figure 1. Examples of the developed domain-specific matrix reasoning tasks. Correct answers are outlined in green.
Jintelligence 14 00069 g001
Figure 2. Wright maps of both evaluated matrix reasoning test versions.
Figure 2. Wright maps of both evaluated matrix reasoning test versions.
Jintelligence 14 00069 g002
Table 1. Descriptive statistics of both test versions.
Table 1. Descriptive statistics of both test versions.
VersionNumber of ItemsCohortnItems SolvedItems
Unanswered
Mean (SD)Median (IQR)MinMax(%)
124All47014.97 (2.96)15 (4)7220.19
Enrichment37315.52 (2.80)16 (4)8220.25
Control9712.84 (2.56)13 (4)7180
260All42513.40 (3.01)13 (5)42012.20
Enrichment34113.53 (2.98)14 (4)42013.33
Control8412.83 (3.09)13 (4)5197.61
Note. SD = standard deviation; IQR = interquartile range.
Table 2. Item parameters and fit indices for all items of the first version of the matrix reasoning test.
Table 2. Item parameters and fit indices for all items of the first version of the matrix reasoning test.
ItembSEUnweighted FitWeighted Fit
MNSQCIZSTDMNSQCIZSTD
2−2.020.201.18(0.87, 1.13)2.701.02(0.66, 1.34)0.20
3−0.470.120.99(0.87, 1.13)−0.101.00(0.87, 1.13)0.00
4−0.180.111.05(0.87, 1.13)0.801.03(0.90, 1.10)0.50
5−1.760.180.94(0.87, 1.13)−0.900.99(0.71, 1.29)0.00
6−3.390.370.91(0.87, 1.13)−1.400.98(0.29, 1.71)0.10
7−0.620.121.00(0.87, 1.13)0.000.98(0.86, 1.14)−0.20
8−2.410.231.04(0.87, 1.13)0.700.98(0.58, 1.42)0.00
90.330.100.96(0.87, 1.13)−0.600.97(0.93, 1.07)−0.90
10−1.900.190.89(0.87, 1.13)−1.800.95(0.68, 1.32)−0.30
11−0.970.140.98(0.87, 1.13)−0.300.99(0.82, 1.18)−0.10
12−0.970.140.82(0.87, 1.13)−2.900.93(0.82, 1.18)−0.80
130.770.100.96(0.87, 1.13)−0.500.96(0.94, 1.06)−1.40
14−0.590.120.98(0.87, 1.13)−0.300.99(0.86, 1.14)−0.20
15−0.290.110.84(0.87, 1.13)−2.500.91(0.89, 1.11)−1.60
161.310.100.95(0.87, 1.13)−0.700.96(0.94, 1.06)−1.40
171.270.100.98(0.87, 1.13)−0.300.99(0.94, 1.06)−0.50
181.140.100.98(0.87, 1.13)−0.300.98(0.95, 1.05)−0.60
191.150.101.01(0.87, 1.13)0.201.01(0.95, 1.05)0.30
201.140.101.02(0.87, 1.13)0.301.02(0.95, 1.05)0.60
212.450.120.96(0.87, 1.13)−0.500.99(0.87, 1.13)−0.20
222.190.111.14(0.87, 1.13)2.101.07(0.89, 1.11)1.30
231.620.101.20(0.87, 1.13)2.901.14(0.93, 1.07)3.70
242.20 *0.101.12(0.87, 1.13)1.801.06(0.89, 1.11)1.00
Note. n1 = 470. b = item difficulty, SE = standard error; MNSQ = mean square; CI = confidence interval; ZSTD = z-standardised fit. * parameter estimate constrained.
Table 3. Item parameters and fit indices for all items of the second version of the matrix reasoning test.
Table 3. Item parameters and fit indices for all items of the second version of the matrix reasoning test.
ItembSEUnweighted FitWeighted Fit
MNSQCIZSTDMNSQCIZSTD
2−2.390.201.06(0.87, 1.13)0.801.01(0.67, 1.33)0.10
3−1.420.141.01(0.87, 1.13)0.101.00(0.82, 1.18)0.00
4−0.490.120.97(0.86, 1.14)−0.300.98(0.91, 1.09)−0.50
5−0.920.131.03(0.85, 1.15)0.501.02(0.87, 1.13)0.30
6−3.070.280.85(0.86, 1.14)−2.200.94(0.50, 1.50)−0.20
7−0.630.120.97(0.85, 1.15)−0.400.98(0.89, 1.11)−0.30
8−2.700.240.99(0.85, 1.15)−0.100.96(0.59, 1.41)−0.10
9−0.970.131.00(0.85, 1.15)0.000.99(0.86, 1.14)−0.10
10−2.210.190.91(0.86, 1.14)−1.200.93(0.69, 1.31)−0.40
11−1.680.160.85(0.86, 1.14)−2.100.94(0.78, 1.22)−0.50
12−1.260.140.82(0.86, 1.14)−2.600.92(0.83, 1.17)−0.90
130.280.110.96(0.86, 1.14)−0.500.97(0.95, 1.05)−1.20
14−1.110.131.03(0.86, 1.14)0.501.02(0.85, 1.15)0.20
15−1.560.150.91(0.87, 1.13)−1.300.97(0.80, 1.20)−0.30
160.790.110.96(0.86, 1.14)−0.500.97(0.94, 1.06)−0.90
170.870.110.94(0.86, 1.14)−0.800.96(0.94, 1.06)−1.30
180.240.110.97(0.86, 1.14)−0.400.97(0.95, 1.05)−1.30
190.150.111.03(0.86, 1.14)0.501.03(0.94, 1.06)1.00
200.420.110.98(0.85, 1.15)−0.200.99(0.95, 1.05)−0.50
212.090.140.91(0.86, 1.14)−1.200.97(0.82, 1.18)−0.20
221.350.121.09(0.86, 1.14)1.201.06(0.90, 1.10)1.10
231.250.121.13(0.85, 1.15)1.701.08(0.91, 1.09)1.70
241.180.111.03(0.86, 1.14)0.401.01(0.92, 1.08)0.30
250.970.111.01(0.85, 1.15)0.101.01(0.93, 1.07)0.30
260.470.111.00(0.86, 1.14)0.001.00(0.95, 1.05)−0.20
27−1.490.150.98(0.86, 1.14)−0.201.00(0.80, 1.20)0.10
280.070.110.98(0.85, 1.15)−0.300.98(0.94, 1.06)−0.80
29−0.150.110.98(0.86, 1.14)−0.300.97(0.93, 1.07)−0.90
300.840.111.02(0.86, 1.14)0.301.02(0.94, 1.06)0.60
310.430.110.99(0.85, 1.15)−0.100.99(0.95, 1.05)−0.40
32−1.670.160.88(0.86, 1.14)−1.600.96(0.78, 1.22)−0.30
332.170.151.15(0.85, 1.15)1.901.03(0.81, 1.19)0.30
341.450.121.01(0.85, 1.15)0.200.99(0.89, 1.11)−0.10
350.820.110.95(0.85, 1.15)−0.600.95(0.93, 1.07)−1.60
36−1.300.140.86(0.85, 1.15)−2.000.93(0.82, 1.18)−0.80
37−1.920.170.88(0.86, 1.14)−1.600.97(0.74, 1.26)−0.20
381.510.121.18(0.86, 1.14)2.301.09(0.89, 1.11)1.50
39−0.400.110.97(0.86, 1.14)−0.400.98(0.91, 1.09)−0.50
40−0.010.111.04(0.86, 1.14)0.501.03(0.94, 1.06)1.00
410.260.110.97(0.86, 1.14)−0.300.98(0.95, 1.05)−0.80
421.600.121.08(0.86, 1.14)1.201.04(0.88, 1.12)0.70
43−1.770.170.83(0.85, 1.15)−2.400.95(0.76, 1.24)−0.40
441.560.121.13(0.85, 1.15)1.701.08(0.88, 1.12)1.30
45−1.430.150.87(0.86, 1.14)−1.800.94(0.81, 1.19)−0.60
461.690.131.07(0.86, 1.14)1.001.04(0.87, 1.13)0.60
47−0.630.121.02(0.86, 1.14)0.201.00(0.89, 1.11)0.00
481.300.121.02(0.86, 1.14)0.301.02(0.90, 1.10)0.40
492.580.171.67(0.86, 1.14)7.701.11(0.76, 1.24)0.90
501.570.121.08(0.86, 1.14)1.101.06(0.88, 1.12)0.90
510.810.110.96(0.86, 1.14)−0.500.97(0.94, 1.06)−1.10
521.010.110.99(0.86, 1.14)−0.100.98(0.93, 1.07)−0.40
53−1.070.131.03(0.86, 1.14)0.401.00(0.85, 1.15)0.00
54−1.260.140.86(0.86, 1.14)−2.000.92(0.83, 1.17)−0.90
551.090.111.07(0.86, 1.14)0.901.06(0.92, 1.08)1.50
560.990.110.99(0.86, 1.14)−0.100.99(0.93, 1.07)−0.20
571.440.121.03(0.86, 1.14)0.501.02(0.89, 1.11)0.40
580.040.110.93(0.86, 1.14)−0.900.94(0.94, 1.06)−1.90
590.100.111.03(0.86, 1.14)0.401.03(0.94, 1.06)1.00
600.12 *0.111.07(0.85, 1.15)0.901.05(0.94, 1.06)1.80
Note. n2 = 425. b = item difficulty, SE = standard error; MNSQ = mean square; CI = confidence interval; ZSTD = z-standardised fit. * parameter estimate constrained.
Table 4. Mantel-Haenszel DIF results for cohort, gender, and grade Level.
Table 4. Mantel-Haenszel DIF results for cohort, gender, and grade Level.
ItemCohortGenderGrade Level
ΔMHpadjΔMHpadjΔMHpadj
20.880.930.570.981.810.99
33.170.24−1.000.931.350.99
4−0.830.330.130.980.160.99
5−1.020.33−0.080.98−0.800.99
61.320.98−1.120.98−0.030.99
70.820.930.360.980.380.99
80.910.94−0.310.980.180.99
9−1.150.240.600.83−0.900.99
10−1.000.550.560.98−0.250.99
110.970.990.140.930.080.99
12−1.610.080.480.93−0.560.99
130.440.980.280.980.130.99
14−1.380.240.050.98−0.530.99
15−2.560.081.430.74−0.750.99
16−0.890.530.130.98−0.100.99
17−1.030.531.060.74−0.190.99
18−1.210.240.120.98−0.730.99
19−0.250.79−0.230.98−0.370.99
20−1.050.470.270.98−0.910.99
21−0.900.610.420.98−0.040.99
22−0.430.72−0.020.990.090.99
23−0.110.98−0.250.980.480.99
24−0.130.980.840.830.480.99
25−0.890.520.160.980.220.99
26−0.770.530.440.98−0.210.99
272.110.720.000.980.990.99
28−0.760.530.810.74−0.130.99
29−0.910.470.730.93−0.020.99
30−0.850.470.450.93−0.280.99
310.550.780.580.740.560.99
32−0.030.73−0.590.98−0.280.99
330.630.910.750.930.320.99
34−0.510.780.650.740.000.99
35−0.690.55−0.280.990.400.99
36−0.090.94−0.170.98−0.120.99
370.140.94−0.750.99−0.040.99
38−0.900.530.010.99−0.040.99
39−0.100.730.800.830.510.99
40−1.010.370.600.93−1.270.99
41−0.540.680.140.980.210.99
42−0.540.72−0.260.980.780.99
430.440.940.270.98−0.020.99
44−0.270.910.630.74−0.440.99
45−0.520.73−0.380.98−0.460.99
460.210.980.200.980.330.99
470.310.94−0.340.980.220.99
48−0.340.730.100.980.400.99
49−0.790.560.350.980.450.99
500.210.99−0.040.990.620.99
51−0.220.77−0.380.980.200.99
52−0.730.530.390.930.850.99
53−0.290.730.260.980.140.99
54−1.150.37−0.780.98−0.540.99
55−1.470.24−0.190.98−0.230.99
56−0.250.78−0.330.980.310.99
570.710.780.290.980.350.99
58−0.260.730.850.740.400.99
59−0.540.560.110.980.430.99
60−0.150.98−0.110.980.460.99
Note. Cohort: focal = control group students, reference = enrichment students; Gender: focal = female, reference = male; Grade level: focal = secondary grade level, reference = primary grade level. ΔMH = ETS delta transform of the Mantel-Haenszel odds ratio; positive values indicate the item favours the focal group, negative values favour the reference group. padj = Benjamini-Hochberg adjusted p-values from the Mantel-Haenzsels chi-square test of uniform DIF (adjustment applied within each grouping variable).
Table 5. Correlations between θMR, θSI, and IQ disaggregated by grade level.
Table 5. Correlations between θMR, θSI, and IQ disaggregated by grade level.
Variable123
1. θMR.33 ***.43 ***
2. θSI.59 ***.27 ***
3. IQ.55 ***.44 ***
Note. θMR = person’s ability level in the matrix reasoning test; θSI = person’s ability level in the scientific inquiry assessment. The results for the primary grade level (n = 212) are shown above the diagonal. The results for the secondary grade level (n = 128) are shown below the diagonal. *** p < .001.
Table 6. Latent regression Rasch model for matrix reasoning ability by cohort.
Table 6. Latent regression Rasch model for matrix reasoning ability by cohort.
Regression VariableEstimate (Logits)SEzp95% CI
Intercept0.5090.2016.42<.001[0.448, 0.570]
Cohort−0.3320.12−4.74<.001[−0.469, −0.195]
Note. Group coding: 1 = control students, 0 = enrichment students (reference). Negative coefficients for the group variable indicate lower mean ability in the control group.
Table 7. Multi-Group Rasch Model of Matrix Reasoning by Cohorts.
Table 7. Multi-Group Rasch Model of Matrix Reasoning by Cohorts.
EnrichmentControld
MSDMSD
0.000.54−0.310.61−0.53
Note. d = standardised mean difference on the theta scale. Results were linked using the Stocking-Lord method. Means (and SDs) are reported in logits on the enrichment-group metric (enrichment = reference).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Peperkorn, C.; Wegner, C. From Abstract to Domain-Specific: Development and Validation of Matrix Reasoning Tasks for Students in Biology. J. Intell. 2026, 14, 69. https://doi.org/10.3390/jintelligence14040069

AMA Style

Peperkorn C, Wegner C. From Abstract to Domain-Specific: Development and Validation of Matrix Reasoning Tasks for Students in Biology. Journal of Intelligence. 2026; 14(4):69. https://doi.org/10.3390/jintelligence14040069

Chicago/Turabian Style

Peperkorn, Colin, and Claas Wegner. 2026. "From Abstract to Domain-Specific: Development and Validation of Matrix Reasoning Tasks for Students in Biology" Journal of Intelligence 14, no. 4: 69. https://doi.org/10.3390/jintelligence14040069

APA Style

Peperkorn, C., & Wegner, C. (2026). From Abstract to Domain-Specific: Development and Validation of Matrix Reasoning Tasks for Students in Biology. Journal of Intelligence, 14(4), 69. https://doi.org/10.3390/jintelligence14040069

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop